Thread: PATCH: logical_work_mem and logical streaming of large in-progresstransactions
PATCH: logical_work_mem and logical streaming of large in-progresstransactions
From
Tomas Vondra
Date:
Hi all, Attached is a patch series that implements two features to the logical replication - ability to define a memory limit for the reorderbuffer (responsible for building the decoded transactions), and ability to stream large in-progress transactions (exceeding the memory limit). I'm submitting those two changes together, because one builds on the other, and it's beneficial to discuss them together. PART 1: adding logical_work_mem memory limit (0001) --------------------------------------------------- Currently, limiting the amount of memory consumed by logical decoding is tricky (or you might say impossible) for several reasons: * The value is hard-coded, so it's not quite possible to customize it. * The amount of decoded changes to keep in memory is restricted by number of changes. It's not very unclear how this relates to memory consumption, as the change size depends on table structure, etc. * The number is "per (sub)transaction", so a transaction with many subtransactions may easily consume significant amount of memory without actually hitting the limit. So the patch does two things. Firstly, it introduces logical_work_mem, a GUC restricting memory consumed by all transactions currently kept in the reorder buffer. Secondly, it adds a simple memory accounting by tracking the amount of memory used in total (for the whole reorder buffer, to compare against logical_work_mem) and per transaction (so that we can quickly pick transaction to spill to disk). The one wrinkle on the patch is that the memory limit can't be enforced when reading changes spilled to disk - with multiple subtransactions, we can't easily predict how many changes to pre-read for each of them. At that point we still use the existing max_changes_in_memory limit. Luckily, changes introduced in the other parts of the patch should allow addressing this deficiency. PART 2: streaming of large in-progress transactions (0002-0006) --------------------------------------------------------------- Note: This part is split into multiple smaller chunks, addressing different parts of the logical decoding infrastructure. That's mostly to allow easier reviews, though. Ultimately, it's just one patch. Processing large transactions often results in significant apply lag, for a couple of reasons. One reason is network bandwidth - while we do decode the changes incrementally (as we read the WAL), we keep them locally, either in memory, or spilled to files. Then at commit time, all the changes get sent to the downstream (and applied) at the same time. For large transactions the time to do the network transfer may be significant, causing apply lag. This patch extends the logical replication infrastructure (output plugin API, reorder buffer, pgoutput, replication protocol etc.) so allow streaming of in-progress transactions instead of spilling them to local files. The extensions to the API are pretty straightforward. Aside from adding methods to stream changes/messages and commit a streamed transaction, the API needs a function to abort a streamed (sub)transaction, and functions to demarcate a block of streamed changes. To decode a transaction, we need to know all it's subtransactions, and invalidations. Currently, those are only known at commit time (although some assignments may be known earlier), but invalidations are only ever written in the commit record. So far that was fine, because we only decode/replay transactions at commit time, when all of this is known (because it's either in commit record, or written before it). But for in-progress transactions (i.e. the subject of interest here), that is not the case. So the patch modifies WAL-logging to ensure those two bits of information are written immediately (for wal_level=logical). For assignments that was fairly simple, thanks to existing caching. For invalidations, it requires a new WAL record type and a couple of changes in inval.c. On the apply side, we simply receive the streamed changes, write them into a file (one file for toplevel transaction, which is possible thanks to the assignments being known immediately). And then at commit time the changes are replayed locally, without having to copy a large chunk of data over network. WAL overhead ------------ Of course, these changes to WAL logging are not for free - logging assignments individually (instead of multiple subtransactions at once) means higher xlog record overhead. Similarly, (sub)transactions doing a lot of DDL may result in a lot of invalidations written to WAL (again, with full xlog record overhead per invalidation). I've done a number of tests to measure the impact, and for extreme corner cases the additional amount of WAL is about 40% in both cases. By an "extreme corner case" I mean a workloads intentionally triggering many assignments/invalidations, without doing a lot of meaningful work. For assignments, imagine a single-row table (no indexes), and a transaction like this one: BEGIN; UPDATE t SET v = v + 1; SAVEPOINT s1; UPDATE t SET v = v + 1; SAVEPOINT s2; UPDATE t SET v = v + 1; SAVEPOINT s3; ... UPDATE t SET v = v + 1; SAVEPOINT s10; UPDATE t SET v = v + 1; COMMIT; For invalidations, add a CREATE TEMPORARY TABLE to each subtransaction. For more realistic workloads (large table with indexes, runs long enough to generate FPIs, etc.) the overhead drops below 5%. Which is much more acceptable, of course, although not perfect. In both cases, there was pretty much no measurable impact on performance (as measured by tps). I do not think there's a way around this requirement (having assignments and invalidations), if we want to decode in-progress transactions. But perhaps it would be possible to do some sort of caching (say, at command level), to reduce the xlog record overhead? Not sure. All ideas are welcome, of course. In the worst case, I think we can add a GUC enabling this additional logging - when disabled, streaming of in-progress transactions would not be possible. Simplifying ReorderBuffer ------------------------- One interesting consequence of having assignments is that we could get rid of the ReorderBuffer iterator, used to merge changes from subxacts. The assignments allow us to keep changes for each toplevel transaction in a single list, in LSN order, and just walk it. Abort can be performed by remembering position of the first change in each subxact, and just discarding the tail. This is what the apply worker does with the streamed changes and aborts. It would also allow us to enforce the memory limit while restoring transactions spilled to disk, because we would not have the problem with restoring changes for many subtransactions. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
- 0001-Introduce-logical_work_mem-to-limit-ReorderBuffer-me.patch.gz
- 0002-Issue-XLOG_XACT_ASSIGNMENT-with-wal_level-logical.patch.gz
- 0003-Issue-individual-invalidations-with-wal_level-logica.patch.gz
- 0004-Extend-the-output-plugin-API-with-stream-methods.patch.gz
- 0005-Implement-streaming-mode-in-ReorderBuffer.patch.gz
- 0006-Add-support-for-streaming-to-built-in-replication.patch.gz
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Erikjan Rijkers
Date:
On 2017-12-23 05:57, Tomas Vondra wrote: > Hi all, > > Attached is a patch series that implements two features to the logical > replication - ability to define a memory limit for the reorderbuffer > (responsible for building the decoded transactions), and ability to > stream large in-progress transactions (exceeding the memory limit). > logical replication of 2 instances is OK but 3 and up fail with: TRAP: FailedAssertion("!(last_lsn < change->lsn)", File: "reorderbuffer.c", Line: 1773) I can cobble up a script but I hope you have enough from the assertion to see what's going wrong...
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Tomas Vondra
Date:
On 12/23/2017 03:03 PM, Erikjan Rijkers wrote: > On 2017-12-23 05:57, Tomas Vondra wrote: >> Hi all, >> >> Attached is a patch series that implements two features to the logical >> replication - ability to define a memory limit for the reorderbuffer >> (responsible for building the decoded transactions), and ability to >> stream large in-progress transactions (exceeding the memory limit). >> > > logical replication of 2 instances is OK but 3 and up fail with: > > TRAP: FailedAssertion("!(last_lsn < change->lsn)", File: > "reorderbuffer.c", Line: 1773) > > I can cobble up a script but I hope you have enough from the assertion > to see what's going wrong... The assertion says that the iterator produces changes in order that does not correlate with LSN. But I have a hard time understanding how that could happen, particularly because according to the line number this happens in ReorderBufferCommit(), i.e. the current (non-streaming) case. So instructions to reproduce the issue would be very helpful. Attached is v2 of the patch series, fixing two bugs I discovered today. I don't think any of these is related to your issue, though. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
- 0001-Introduce-logical_work_mem-to-limit-ReorderBuffer-v2.patch.gz
- 0002-Issue-XLOG_XACT_ASSIGNMENT-with-wal_level-logical-v2.patch.gz
- 0003-Issue-individual-invalidations-with-wal_level-log-v2.patch.gz
- 0004-Extend-the-output-plugin-API-with-stream-methods-v2.patch.gz
- 0005-Implement-streaming-mode-in-ReorderBuffer-v2.patch.gz
- 0006-Add-support-for-streaming-to-built-in-replication-v2.patch.gz
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Erik Rijkers
Date:
On 2017-12-23 21:06, Tomas Vondra wrote: > On 12/23/2017 03:03 PM, Erikjan Rijkers wrote: >> On 2017-12-23 05:57, Tomas Vondra wrote: >>> Hi all, >>> >>> Attached is a patch series that implements two features to the >>> logical >>> replication - ability to define a memory limit for the reorderbuffer >>> (responsible for building the decoded transactions), and ability to >>> stream large in-progress transactions (exceeding the memory limit). >>> >> >> logical replication of 2 instances is OK but 3 and up fail with: >> >> TRAP: FailedAssertion("!(last_lsn < change->lsn)", File: >> "reorderbuffer.c", Line: 1773) >> >> I can cobble up a script but I hope you have enough from the assertion >> to see what's going wrong... > > The assertion says that the iterator produces changes in order that > does > not correlate with LSN. But I have a hard time understanding how that > could happen, particularly because according to the line number this > happens in ReorderBufferCommit(), i.e. the current (non-streaming) > case. > > So instructions to reproduce the issue would be very helpful. Using: 0001-Introduce-logical_work_mem-to-limit-ReorderBuffer-v2.patch 0002-Issue-XLOG_XACT_ASSIGNMENT-with-wal_level-logical-v2.patch 0003-Issue-individual-invalidations-with-wal_level-log-v2.patch 0004-Extend-the-output-plugin-API-with-stream-methods-v2.patch 0005-Implement-streaming-mode-in-ReorderBuffer-v2.patch 0006-Add-support-for-streaming-to-built-in-replication-v2.patch As you expected the problem is the same with these new patches. I have now tested more, and seen that it not always fails. I guess that it here fails 3 times out of 4. But the laptop I'm using at the moment is old and slow -- it may well be a factor as we've seen before [1]. Attached is the bash that I put together. I tested with NUM_INSTANCES=2, which yields success, and NUM_INSTANCES=3, which fails often. This same program run with HEAD never seems to fail (I tried a few dozen times). thanks, Erik Rijkers [1] https://www.postgresql.org/message-id/3897361c7010c4ac03f358173adbcd60%40xs4all.nl
Attachment
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Tomas Vondra
Date:
On 12/23/2017 11:23 PM, Erik Rijkers wrote: > On 2017-12-23 21:06, Tomas Vondra wrote: >> On 12/23/2017 03:03 PM, Erikjan Rijkers wrote: >>> On 2017-12-23 05:57, Tomas Vondra wrote: >>>> Hi all, >>>> >>>> Attached is a patch series that implements two features to the logical >>>> replication - ability to define a memory limit for the reorderbuffer >>>> (responsible for building the decoded transactions), and ability to >>>> stream large in-progress transactions (exceeding the memory limit). >>>> >>> >>> logical replication of 2 instances is OK but 3 and up fail with: >>> >>> TRAP: FailedAssertion("!(last_lsn < change->lsn)", File: >>> "reorderbuffer.c", Line: 1773) >>> >>> I can cobble up a script but I hope you have enough from the assertion >>> to see what's going wrong... >> >> The assertion says that the iterator produces changes in order that does >> not correlate with LSN. But I have a hard time understanding how that >> could happen, particularly because according to the line number this >> happens in ReorderBufferCommit(), i.e. the current (non-streaming) case. >> >> So instructions to reproduce the issue would be very helpful. > > Using: > > 0001-Introduce-logical_work_mem-to-limit-ReorderBuffer-v2.patch > 0002-Issue-XLOG_XACT_ASSIGNMENT-with-wal_level-logical-v2.patch > 0003-Issue-individual-invalidations-with-wal_level-log-v2.patch > 0004-Extend-the-output-plugin-API-with-stream-methods-v2.patch > 0005-Implement-streaming-mode-in-ReorderBuffer-v2.patch > 0006-Add-support-for-streaming-to-built-in-replication-v2.patch > > As you expected the problem is the same with these new patches. > > I have now tested more, and seen that it not always fails. I guess that > it here fails 3 times out of 4. But the laptop I'm using at the moment > is old and slow -- it may well be a factor as we've seen before [1]. > > Attached is the bash that I put together. I tested with > NUM_INSTANCES=2, which yields success, and NUM_INSTANCES=3, which fails > often. This same program run with HEAD never seems to fail (I tried a > few dozen times). > Thanks. Unfortunately I still can't reproduce the issue. I even tried running it in valgrind, to see if there are some memory access issues (which should also slow it down significantly). regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Craig Ringer
Date:
On 23 December 2017 at 12:57, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
-- Hi all,
Attached is a patch series that implements two features to the logical
replication - ability to define a memory limit for the reorderbuffer
(responsible for building the decoded transactions), and ability to
stream large in-progress transactions (exceeding the memory limit).
I'm submitting those two changes together, because one builds on the
other, and it's beneficial to discuss them together.
PART 1: adding logical_work_mem memory limit (0001)
---------------------------------------------------
Currently, limiting the amount of memory consumed by logical decoding is
tricky (or you might say impossible) for several reasons:
* The value is hard-coded, so it's not quite possible to customize it.
* The amount of decoded changes to keep in memory is restricted by
number of changes. It's not very unclear how this relates to memory
consumption, as the change size depends on table structure, etc.
* The number is "per (sub)transaction", so a transaction with many
subtransactions may easily consume significant amount of memory without
actually hitting the limit.
Also, even without subtransactions, we assemble a ReorderBufferTXN per transaction. Since transactions usually occur concurrently, systems with many concurrent txns can face lots of memory use.
We can't exclude tables that won't actually be replicated at the reorder buffering phase either. So txns use memory whether or not they do anything interesting as far as a given logical decoding session is concerned. Even if we'll throw all the data away we must buffer and assemble it first so we can make that decision.
Because logical decoding considers snapshots and cid increments even from other DBs (at least when the txn makes catalog changes) the memory use can get BIG too. I was recently working with a system that had accumulated 2GB of snapshots ... on each slot. With 7 slots, one for each DB.
So there's lots of room for difficulty with unpredictable memory use.
So the patch does two things. Firstly, it introduces logical_work_mem, a
GUC restricting memory consumed by all transactions currently kept in
the reorder buffer
Does this consider the (currently high, IIRC) overhead of tracking serialized changes?
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Erik Rijkers
Date:
>>>> >>>> logical replication of 2 instances is OK but 3 and up fail with: >>>> >>>> TRAP: FailedAssertion("!(last_lsn < change->lsn)", File: >>>> "reorderbuffer.c", Line: 1773) >>>> >>>> I can cobble up a script but I hope you have enough from the >>>> assertion >>>> to see what's going wrong... >>> >>> The assertion says that the iterator produces changes in order that >>> does >>> not correlate with LSN. But I have a hard time understanding how that >>> could happen, particularly because according to the line number this >>> happens in ReorderBufferCommit(), i.e. the current (non-streaming) >>> case. >>> >>> So instructions to reproduce the issue would be very helpful. >> >> Using: >> >> 0001-Introduce-logical_work_mem-to-limit-ReorderBuffer-v2.patch >> 0002-Issue-XLOG_XACT_ASSIGNMENT-with-wal_level-logical-v2.patch >> 0003-Issue-individual-invalidations-with-wal_level-log-v2.patch >> 0004-Extend-the-output-plugin-API-with-stream-methods-v2.patch >> 0005-Implement-streaming-mode-in-ReorderBuffer-v2.patch >> 0006-Add-support-for-streaming-to-built-in-replication-v2.patch >> >> As you expected the problem is the same with these new patches. >> >> I have now tested more, and seen that it not always fails. I guess >> that >> it here fails 3 times out of 4. But the laptop I'm using at the >> moment >> is old and slow -- it may well be a factor as we've seen before [1]. >> >> Attached is the bash that I put together. I tested with >> NUM_INSTANCES=2, which yields success, and NUM_INSTANCES=3, which >> fails >> often. This same program run with HEAD never seems to fail (I tried a >> few dozen times). >> > > Thanks. Unfortunately I still can't reproduce the issue. I even tried > running it in valgrind, to see if there are some memory access issues > (which should also slow it down significantly). One wonders again if 2ndquadrant shouldn't invest in some old hardware ;) Another Good Thing would be if there was a provision in the buildfarm to test patches like these. But I'm probably not to first one to suggest that; no doubt it'll be possible someday. In the meantime I'll try to repeat this crash on other machines (but that will be after the holidays). Erik Rijkers
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Tomas Vondra
Date:
On 12/24/2017 05:51 AM, Craig Ringer wrote: > On 23 December 2017 at 12:57, Tomas Vondra <tomas.vondra@2ndquadrant.com > <mailto:tomas.vondra@2ndquadrant.com>> wrote: > > Hi all, > > Attached is a patch series that implements two features to the logical > replication - ability to define a memory limit for the reorderbuffer > (responsible for building the decoded transactions), and ability to > stream large in-progress transactions (exceeding the memory limit). > > I'm submitting those two changes together, because one builds on the > other, and it's beneficial to discuss them together. > > > PART 1: adding logical_work_mem memory limit (0001) > --------------------------------------------------- > > Currently, limiting the amount of memory consumed by logical decoding is > tricky (or you might say impossible) for several reasons: > > * The value is hard-coded, so it's not quite possible to customize it. > > * The amount of decoded changes to keep in memory is restricted by > number of changes. It's not very unclear how this relates to memory > consumption, as the change size depends on table structure, etc. > > * The number is "per (sub)transaction", so a transaction with many > subtransactions may easily consume significant amount of memory without > actually hitting the limit. > > > Also, even without subtransactions, we assemble a ReorderBufferTXN > per transaction. Since transactions usually occur concurrently, > systems with many concurrent txns can face lots of memory use. > I don't see how that could be a problem, considering the number of toplevel transactions is rather limited (to max_connections or so). > We can't exclude tables that won't actually be replicated at the reorder > buffering phase either. So txns use memory whether or not they do > anything interesting as far as a given logical decoding session is > concerned. Even if we'll throw all the data away we must buffer and > assemble it first so we can make that decision. Yep. > Because logical decoding considers snapshots and cid increments even > from other DBs (at least when the txn makes catalog changes) the memory > use can get BIG too. I was recently working with a system that had > accumulated 2GB of snapshots ... on each slot. With 7 slots, one for > each DB. > > So there's lots of room for difficulty with unpredictable memory use. > Yep. > So the patch does two things. Firstly, it introduces logical_work_mem, a > GUC restricting memory consumed by all transactions currently kept in > the reorder buffer > > > Does this consider the (currently high, IIRC) overhead of tracking > serialized changes? > Consider in what sense? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Tomas Vondra
Date:
On 12/24/2017 10:00 AM, Erik Rijkers wrote: >>>>> >>>>> logical replication of 2 instances is OK but 3 and up fail with: >>>>> >>>>> TRAP: FailedAssertion("!(last_lsn < change->lsn)", File: >>>>> "reorderbuffer.c", Line: 1773) >>>>> >>>>> I can cobble up a script but I hope you have enough from the assertion >>>>> to see what's going wrong... >>>> >>>> The assertion says that the iterator produces changes in order that >>>> does >>>> not correlate with LSN. But I have a hard time understanding how that >>>> could happen, particularly because according to the line number this >>>> happens in ReorderBufferCommit(), i.e. the current (non-streaming) >>>> case. >>>> >>>> So instructions to reproduce the issue would be very helpful. >>> >>> Using: >>> >>> 0001-Introduce-logical_work_mem-to-limit-ReorderBuffer-v2.patch >>> 0002-Issue-XLOG_XACT_ASSIGNMENT-with-wal_level-logical-v2.patch >>> 0003-Issue-individual-invalidations-with-wal_level-log-v2.patch >>> 0004-Extend-the-output-plugin-API-with-stream-methods-v2.patch >>> 0005-Implement-streaming-mode-in-ReorderBuffer-v2.patch >>> 0006-Add-support-for-streaming-to-built-in-replication-v2.patch >>> >>> As you expected the problem is the same with these new patches. >>> >>> I have now tested more, and seen that it not always fails. I guess that >>> it here fails 3 times out of 4. But the laptop I'm using at the moment >>> is old and slow -- it may well be a factor as we've seen before [1]. >>> >>> Attached is the bash that I put together. I tested with >>> NUM_INSTANCES=2, which yields success, and NUM_INSTANCES=3, which fails >>> often. This same program run with HEAD never seems to fail (I tried a >>> few dozen times). >>> >> >> Thanks. Unfortunately I still can't reproduce the issue. I even tried >> running it in valgrind, to see if there are some memory access issues >> (which should also slow it down significantly). > > One wonders again if 2ndquadrant shouldn't invest in some old hardware ;) > Well, I've done tests on various machines, including some really slow ones, and I still haven't managed to reproduce the failures using your script. So I don't think that would really help. But I have reproduced it by using a custom stress test script. Turns out the asserts are overly strict - instead of Assert(prev_lsn < current_lsn); it should have been Assert(prev_lsn <= current_lsn); because some XLOG records may contain multiple rows (e.g. MULTI_INSERT). The attached v3 fixes this issue, and also a couple of other thinkos: 1) The AssertChangeLsnOrder assert check was somewhat broken. 2) We've been sending aborts for all subtransactions, even those not yet streamed. So downstream got confused and fell over because of an assert. 3) The streamed transactions were written to /tmp, using filenames using subscription OID and XID of the toplevel transaction. That's fine, as long as there's just a single replica running - if there are more, the filenames will clash, causing really strange failures. So move the files to base/pgsql_tmp where regular temporary files are written. I'm not claiming this is perfect, perhaps we need to invent another location. FWIW I believe the relation sync cache is somewhat broken by the streaming. I thought resetting it would be good enough, but it's more complicated (and trickier) than that. I'm aware of it, and I'll look into that next - but probably not before 2018. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
- 0001-Introduce-logical_work_mem-to-limit-ReorderBuffer-v3.patch.gz
- 0002-Issue-XLOG_XACT_ASSIGNMENT-with-wal_level-logical-v3.patch.gz
- 0003-Issue-individual-invalidations-with-wal_level-log-v3.patch.gz
- 0004-Extend-the-output-plugin-API-with-stream-methods-v3.patch.gz
- 0005-Implement-streaming-mode-in-ReorderBuffer-v3.patch.gz
- 0006-Add-support-for-streaming-to-built-in-replication-v3.patch.gz
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Erik Rijkers
Date:
That indeed fixed the problem: running that same pgbench test, I see no crashes anymore (on any of 3 different machines, and with several pgbench parameters). Thank you, Erik Rijkers
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Dmitry Dolgov
Date:
> On 25 December 2017 at 18:40, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
> The attached v3 fixes this issue, and also a couple of other thinkos
Thank you for the patch, it looks quite interesting. After a quick look at it
(mostly the first one so far, but I'm going to continue) I have a few questions:
> + * XXX With many subtransactions this might be quite slow, because we'll have
> + * to walk through all of them. There are some options how we could improve
> + * that: (a) maintain some secondary structure with transactions sorted by
> + * amount of changes, (b) not looking for the entirely largest transaction,
> + * but e.g. for transaction using at least some fraction of the memory limit,
> + * and (c) evicting multiple transactions at once, e.g. to free a given portion
> + * of the memory limit (e.g. 50%).
Do you want to address these possible alternatives somehow in this patch or you
want to left it outside? Maybe it makes sense to apply some combination of
them, e.g. maintain a secondary structure with relatively large transactions,
and then start evicting them. If it's somehow not enough, then start to evict
multiple transactions at once (option "c").
> + /*
> + * We clamp manually-set values to at least 64kB. The maintenance_work_mem
> + * uses a higher minimum value (1MB), so this is OK.
> + */
> + if (*newval < 64)
> + *newval = 64;
> +
I'm not sure what's recommended practice here, but maybe it makes sense to
have a warning here about changing this value to 64kB? Otherwise it can be
unexpected.
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Peter Eisentraut
Date:
On 12/22/17 23:57, Tomas Vondra wrote: > PART 1: adding logical_work_mem memory limit (0001) > --------------------------------------------------- The documentation in this patch contains some references to later features (streaming). Perhaps that could be separated so that the patches can be applied independently. I don't see the need to tie this setting to maintenance_work_mem. maintenance_work_mem is often set to very large values, which could then have undesirable side effects on this use. Moreover, the name logical_work_mem makes it sound like it's a logical version of work_mem. Maybe we could think of another name. I think we need a way to report on how much memory is actually used, so the setting can be tuned. Something analogous to log_temp_files perhaps. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Tomas Vondra
Date:
On 01/02/2018 04:07 PM, Peter Eisentraut wrote: > On 12/22/17 23:57, Tomas Vondra wrote: >> PART 1: adding logical_work_mem memory limit (0001) >> --------------------------------------------------- > > The documentation in this patch contains some references to later > features (streaming). Perhaps that could be separated so that the > patches can be applied independently. > Yeah, that's probably a good idea. But now that you mention it, I wonder if "streaming" is really a good term. We already use it for "streaming replication" and it may be quite confusing to use it for another feature (particularly when it's streaming within logical streaming replication). But I can't really think of a better name ... > I don't see the need to tie this setting to maintenance_work_mem. > maintenance_work_mem is often set to very large values, which could > then have undesirable side effects on this use. > Well, we need to pick some default value, and we can either use a fixed value (not sure what would be a good default) or tie it to an existing GUC. We only really have work_mem and maintenance_work_mem, and the walsender process will never use more than one such buffer. Which seems to be closer to maintenance_work_mem. Pretty much any default value can have undesirable side effects. > Moreover, the name logical_work_mem makes it sound like it's a logical > version of work_mem. Maybe we could think of another name. > I won't object to a better name, of course. Any proposals? > I think we need a way to report on how much memory is actually used, > so the setting can be tuned. Something analogous to log_temp_files > perhaps. > Yes, I agree. I'm just about to submit an updated version of the patch series, that also introduces a few columns pg_stat_replication, tracking this type of stats (amount of data spilled to disk or streamed, etc.). regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Tomas Vondra
Date:
Hi, attached is v4 of the patch series, with a couple of changes: 1) Fixes a bunch of bugs I discovered during stress testing. I'm not going to go into details, but the main fixes are related to properly updating progress from the worker, and not streaming when creating the logical replication slot. 2) Introduces columns into pg_stat_replication. The new columns track various kinds of statistics (number of xacts, bytes, ...) about spill-to-disk/streaming. This will be useful when tuning the GUC memory limit. 3) Two temporary bugfixes that make the patch series work. The first one (0008) makes sure is_known_subxact is set properly for all subtransactions, and there's a separate fix in the CF. So this will eventually go away. The second one (0009) fixes an issue that is specific to streaming. It does fix the issue, but I need a bit more time to think about it before merging it into 0005. This does pass extensive stress testing with a workload mixing DML, DDL, subtransactions, aborts, etc. under valgrind. I'm working on extending the test coverage, and introducing various error conditions (e.g. walsender/walreceiver timeouts, failures on both ends, etc.). regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
- 0001-Introduce-logical_work_mem-to-limit-ReorderBuffer-v4.patch.gz
- 0002-Issue-XLOG_XACT_ASSIGNMENT-with-wal_level-logical-v4.patch.gz
- 0003-Issue-individual-invalidations-with-wal_level-log-v4.patch.gz
- 0004-Extend-the-output-plugin-API-with-stream-methods-v4.patch.gz
- 0005-Implement-streaming-mode-in-ReorderBuffer-v4.patch.gz
- 0006-Add-support-for-streaming-to-built-in-replication-v4.patch.gz
- 0007-Track-statistics-for-streaming-spilling-v4.patch.gz
- 0008-BUGFIX-make-sure-subxact-is-marked-as-is_known_as-v4.patch.gz
- 0009-BUGFIX-set-final_lsn-for-subxacts-before-cleanup-v4.patch.gz
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Tomas Vondra
Date:
On 01/03/2018 09:06 PM, Tomas Vondra wrote: > Hi, > > attached is v4 of the patch series, with a couple of changes: > > 1) Fixes a bunch of bugs I discovered during stress testing. > > I'm not going to go into details, but the main fixes are related to > properly updating progress from the worker, and not streaming when > creating the logical replication slot. > > 2) Introduces columns into pg_stat_replication. > > The new columns track various kinds of statistics (number of xacts, > bytes, ...) about spill-to-disk/streaming. This will be useful when > tuning the GUC memory limit. > > 3) Two temporary bugfixes that make the patch series work. > Forgot to mention that the v4 also extends the CREATE SUBSCRIPTION to allow customizing the streaming and memory limit. So you can do CREATE SUBSCRIPTION ... WITH (streaming=on, work_mem=1024) and this subscription will allow streaming, and the logica_work_mem (on provider) will be set to 1MB. -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Peter Eisentraut
Date:
On 1/3/18 14:53, Tomas Vondra wrote: >> I don't see the need to tie this setting to maintenance_work_mem. >> maintenance_work_mem is often set to very large values, which could >> then have undesirable side effects on this use. > > Well, we need to pick some default value, and we can either use a fixed > value (not sure what would be a good default) or tie it to an existing > GUC. We only really have work_mem and maintenance_work_mem, and the > walsender process will never use more than one such buffer. Which seems > to be closer to maintenance_work_mem. > > Pretty much any default value can have undesirable side effects. Let's just make it an independent setting unless we know any better. We don't have a lot of settings that depend on other settings, and the ones we do have a very specific relationship. >> Moreover, the name logical_work_mem makes it sound like it's a logical >> version of work_mem. Maybe we could think of another name. > > I won't object to a better name, of course. Any proposals? logical_decoding_[work_]mem? >> I think we need a way to report on how much memory is actually used, >> so the setting can be tuned. Something analogous to log_temp_files >> perhaps. > > Yes, I agree. I'm just about to submit an updated version of the patch > series, that also introduces a few columns pg_stat_replication, tracking > this type of stats (amount of data spilled to disk or streamed, etc.). That seems OK. Perhaps we could bring forward the part of that patch that applies to this feature. That would also help testing *this* feature and determine what appropriate settings are. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Peter Eisentraut
Date:
On 1/3/18 15:13, Tomas Vondra wrote: > Forgot to mention that the v4 also extends the CREATE SUBSCRIPTION to > allow customizing the streaming and memory limit. So you can do > > CREATE SUBSCRIPTION ... WITH (streaming=on, work_mem=1024) > > and this subscription will allow streaming, and the logica_work_mem (on > provider) will be set to 1MB. I was wondering already during PG10 development whether we should give subscriptions a generic configuration array, like databases and roles have, so we don't have to hardcode a bunch of similar stuff every time we add an option like this. At the time we only had synchronous_commit, but now we're adding more. Also, instead of sticking this into the START_REPLICATION command, could we just run a SET command? That should work over replication connections as well. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Peter Eisentraut
Date:
On 12/22/17 23:57, Tomas Vondra wrote: > PART 1: adding logical_work_mem memory limit (0001) > --------------------------------------------------- > > Currently, limiting the amount of memory consumed by logical decoding is > tricky (or you might say impossible) for several reasons: I would like to see some more discussion on this, but I think not a lot of people understand the details, so I'll try to write up an explanation here. This code is also somewhat new to me, so please correct me if there are inaccuracies, while keeping in mind that I'm trying to simplify. The data in the WAL is written as it happens, so the changes belonging to different transactions are all mixed together. One of the jobs of logical decoding is to reassemble the changes belonging to each transaction. The top-level data structure for that is the infamous ReorderBuffer. So as it reads the WAL and sees something about a transaction, it keeps a copy of that change in memory, indexed by transaction ID (ReorderBufferChange). When the transaction commits, the accumulated changes are passed to the output plugin and then freed. If the transaction aborts, then changes are just thrown away. So when logical decoding is active, a copy of the changes for each active transaction is kept in memory (once per walsender). More precisely, the above happens for each subtransaction. When the top-level transaction commits, it finds all its subtransactions in the ReorderBuffer, reassembles everything in the right order, then invokes the output plugin. All this could end up using an unbounded amount of memory, so there is a mechanism to spill changes to disk. The way this currently works is hardcoded, and this patch proposes to change that. Currently, when a transaction or subtransaction has accumulated 4096 changes, it is spilled to disk. When the top-level transaction commits, things are read back from disk to do the final processing mentioned above. This all works mostly fine, but you can construct some more extreme cases where this can blow up. Here is a mundane example. Let's say a change entry takes 100 bytes (it might contain a new row, or an update key and some new column values, for example). If you have 100 concurrent active sessions and no subtransactions, then logical decoding memory is bounded by 4096 * 100 * 100 = 40 MB (per walsender) before things spill to disk. Now let's say you are using a lot of subtransactions, because you are using PL functions, exception handling, triggers, doing batch updates. If you have 200 subtransactions on average per concurrent session, the memory usage bound in that case would be 4096 * 100 * 100 * 200 = 8 GB (per walsender). And so on. If you have more concurrent sessions or larger changes or more subtransactions, you'll use much more than those 8 GB. And if you don't have those 8 GB, then you're stuck at this point. That is the consideration when we record changes, but we also need memory when we do the final processing at commit time. That is slightly less problematic because we only process one top-level transaction at a time, so the formula is only 4096 * avg_size_of_changes * nr_subxacts (without the concurrent sessions factor). So, this patch proposes to improve this as follows: - We compute the actual size of each ReorderBufferChange and keep a running tally for each transaction, instead of just counting the number of changes. - We have a configuration setting that allows us to change the limit instead of the hardcoded 4096. The configuration setting is also in terms of memory, not in number of changes. - The configuration setting is for the total memory usage per decoding session, not per subtransaction. (So we also keep a running tally for the entire ReorderBuffer.) There are two open issues with this patch: One, this mechanism only applies when recording changes. The processing at commit time still uses the previous hardcoded mechanism. The reason for this is, AFAIU, that as things currently work, you have to have all subtransactions in memory to do the final processing. There are some proposals to change this as well, but they are more involved. Arguably, per my explanation above, memory use at commit time is less likely to be a problem. Two, what to do when the memory limit is reached. With the old accounting, this was easy, because we'd decide for each subtransaction independently whether to spill it to disk, when it has reached its 4096 limit. Now, we are looking at a global limit, so we have to find a transaction to spill in some other way. The proposed patch searches through the entire list of transactions to find the largest one. But as the patch says: "XXX With many subtransactions this might be quite slow, because we'll have to walk through all of them. There are some options how we could improve that: (a) maintain some secondary structure with transactions sorted by amount of changes, (b) not looking for the entirely largest transaction, but e.g. for transaction using at least some fraction of the memory limit, and (c) evicting multiple transactions at once, e.g. to free a given portion of the memory limit (e.g. 50%)." (a) would create more overhead for the case where everything fits into memory, so it seems unattractive. Some combination of (b) and (c) seems useful, but we'd have to come up with something concrete. Thoughts? -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Greg Stark
Date:
On 11 January 2018 at 19:41, Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote: > Two, what to do when the memory limit is reached. With the old > accounting, this was easy, because we'd decide for each subtransaction > independently whether to spill it to disk, when it has reached its 4096 > limit. Now, we are looking at a global limit, so we have to find a > transaction to spill in some other way. The proposed patch searches > through the entire list of transactions to find the largest one. But as > the patch says: > > "XXX With many subtransactions this might be quite slow, because we'll > have to walk through all of them. There are some options how we could > improve that: (a) maintain some secondary structure with transactions > sorted by amount of changes, (b) not looking for the entirely largest > transaction, but e.g. for transaction using at least some fraction of > the memory limit, and (c) evicting multiple transactions at once, e.g. > to free a given portion of the memory limit (e.g. 50%)." AIUI spilling to disk doesn't affect absorbing future updates, we would just keep accumulating them in memory right? We won't need to unspill until it comes time to commit. Is there any actual advantage to picking the largest transaction? it means fewer spills and fewer unspills at commit time but that just a bigger spike of i/o and more of a chance of spilling more than necessary to get by. In the end it'll be more or less the same amount of data read back, just all in one big spike when spilling and one big spike when committing. If you spilled smaller transactions you would have a small amount of i/o more frequently and have to read back small amounts for many commits. But it would add up to the same amount of i/o (or less if you avoid spilling more than necessary). The real aim should be to try to pick the transaction that will be committed furthest in the future. That gives you the most memory to use for live transactions for the longest time and could let you process the maximum amount of transactions without spilling them. So either the oldest transaction (in the expectation that it's been open a while and appears to be a long-lived batch job that will stay open for a long time) or the youngest transaction (in the expectation that all transactions are more or less equally long-lived) might make sense. -- greg
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Peter Eisentraut
Date:
On 1/11/18 18:23, Greg Stark wrote: > AIUI spilling to disk doesn't affect absorbing future updates, we > would just keep accumulating them in memory right? We won't need to > unspill until it comes time to commit. Once a transaction has been serialized, future updates keep accumulating in memory, until perhaps it gets serialized again. But then at commit time, if a transaction has been partially serialized at all, all the remaining changes are also serialized before the whole thing is read back in (see reorderbuffer.c line 855). So one optimization would be to specially keep track of all transactions that have been serialized already and pick those first for further serialization, because it will be done eventually anyway. But this is only a secondary optimization, because it doesn't help in the extreme cases that either no (or few) transactions have been serialized or all (or most) transactions have been serialized. > The real aim should be to try to pick the transaction that will be > committed furthest in the future. That gives you the most memory to > use for live transactions for the longest time and could let you > process the maximum amount of transactions without spilling them. So > either the oldest transaction (in the expectation that it's been open > a while and appears to be a long-lived batch job that will stay open > for a long time) or the youngest transaction (in the expectation that > all transactions are more or less equally long-lived) might make > sense. Yes, that makes sense. We'd still need to keep a separate ordered list of transactions somewhere, but that might be easier if we just order them in the order we see them. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Tomas Vondra
Date:
On 01/11/2018 08:41 PM, Peter Eisentraut wrote: > On 12/22/17 23:57, Tomas Vondra wrote: >> PART 1: adding logical_work_mem memory limit (0001) >> --------------------------------------------------- >> >> Currently, limiting the amount of memory consumed by logical decoding is >> tricky (or you might say impossible) for several reasons: > > I would like to see some more discussion on this, but I think not a lot > of people understand the details, so I'll try to write up an explanation > here. This code is also somewhat new to me, so please correct me if > there are inaccuracies, while keeping in mind that I'm trying to simplify. > > ... snip ... Thanks for a comprehensive summary of the patch! > > "XXX With many subtransactions this might be quite slow, because we'll > have to walk through all of them. There are some options how we could > improve that: (a) maintain some secondary structure with transactions > sorted by amount of changes, (b) not looking for the entirely largest > transaction, but e.g. for transaction using at least some fraction of > the memory limit, and (c) evicting multiple transactions at once, e.g. > to free a given portion of the memory limit (e.g. 50%)." > > (a) would create more overhead for the case where everything fits into > memory, so it seems unattractive. Some combination of (b) and (c) seems > useful, but we'd have to come up with something concrete. > Yeah, when writing that comment I was worried that (a) might get rather expensive. I was thinking about maintaining a dlist of transactions sorted by size (ReorderBuffer now only has a hash table), so that we could evict transactions from the beginning of the list. But while that speeds up the choice of transactions to evict, the added cost is rather high, particularly when most transactions are roughly of the same size. Because in that case we probably have to move the nodes around in the list quite often. So it seems wiser to just walk the list once when looking for a victim. What I'm thinking about instead is tracking just some approximated version of this - it does not really matter whether we evict the really largest transaction or one that is a couple of kilobytes smaller. What we care about is an answer to this question: Is there some very large transaction that we could evict to free a lot of memory, or are all transactions fairly small? So perhaps we can define some "size classes" and track to which of them each transaction belongs. For example, we could split the memory limit into 100 buckets, each representing a 1% size increment. A transaction would not switch the class very often, and it would be trivial to pick the largest transaction. When all the transactions are squashed in the smallest classes, we may switch to some alternative strategy. Not sure. In any case, I don't really know how expensive the selection actually is, and if it's an issue. I'll do some measurements. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Tomas Vondra
Date:
On 01/12/2018 05:35 PM, Peter Eisentraut wrote: > On 1/11/18 18:23, Greg Stark wrote: >> AIUI spilling to disk doesn't affect absorbing future updates, we >> would just keep accumulating them in memory right? We won't need to >> unspill until it comes time to commit. > > Once a transaction has been serialized, future updates keep accumulating > in memory, until perhaps it gets serialized again. But then at commit > time, if a transaction has been partially serialized at all, all the > remaining changes are also serialized before the whole thing is read > back in (see reorderbuffer.c line 855). > > So one optimization would be to specially keep track of all transactions > that have been serialized already and pick those first for further > serialization, because it will be done eventually anyway. > > But this is only a secondary optimization, because it doesn't help in > the extreme cases that either no (or few) transactions have been > serialized or all (or most) transactions have been serialized. > >> The real aim should be to try to pick the transaction that will be >> committed furthest in the future. That gives you the most memory to >> use for live transactions for the longest time and could let you >> process the maximum amount of transactions without spilling them. So >> either the oldest transaction (in the expectation that it's been open >> a while and appears to be a long-lived batch job that will stay open >> for a long time) or the youngest transaction (in the expectation that >> all transactions are more or less equally long-lived) might make >> sense. > > Yes, that makes sense. We'd still need to keep a separate ordered list > of transactions somewhere, but that might be easier if we just order > them in the order we see them. > Wouldn't the 'toplevel_by_lsn' be suitable for this? Subtransactions don't really commit independently, but as part of the toplevel xact. And that list is ordered by LSN, which is pretty much exactly the order in which we see the transactions. I feel somewhat uncomfortable about evicting oldest (or youngest) transactions for based on some assumed correlation with the commit order. I'm pretty sure that will bite us badly for some workloads. Another somewhat non-intuitive detail is that because ReorderBuffer switched to Generation allocator for changes (which usually represent 99% of the memory used during decoding), it does not reuse memory the way AllocSet does. Actually, it does not reuse memory at all, aiming to eventually give the memory back to libc (which AllocSet can't do). Because of this evicting the youngest transactions seems like a quite bad idea, because those chunks will not be reused and there may be other chunks on the blocks, preventing their release. Yeah, complicated stuff. -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Peter Eisentraut
Date:
On 1/12/18 23:19, Tomas Vondra wrote: > Wouldn't the 'toplevel_by_lsn' be suitable for this? Subtransactions > don't really commit independently, but as part of the toplevel xact. And > that list is ordered by LSN, which is pretty much exactly the order in > which we see the transactions. Yes indeed. There is even ReorderBufferGetOldestTXN(). > Another somewhat non-intuitive detail is that because ReorderBuffer > switched to Generation allocator for changes (which usually represent > 99% of the memory used during decoding), it does not reuse memory the > way AllocSet does. Actually, it does not reuse memory at all, aiming to > eventually give the memory back to libc (which AllocSet can't do). > > Because of this evicting the youngest transactions seems like a quite > bad idea, because those chunks will not be reused and there may be other > chunks on the blocks, preventing their release. Right. But this raises the question whether we are doing the memory accounting on the right level. If we are doing all this tracking based on ReorderBufferChanges, but then serializing changes possibly doesn't actually free any memory in the operating system, that's no good. Can we get some usage statistics out of the memory context? It seems like we need to keep serializing transactions until we actually see the memory context size drop. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Tomas Vondra
Date:
Attached is v5, fixing a silly bug in part 0006, causing segfault when creating a subscription. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
- 0001-Introduce-logical_work_mem-to-limit-ReorderBuffer-v5.patch.gz
- 0002-Issue-XLOG_XACT_ASSIGNMENT-with-wal_level-logical-v5.patch.gz
- 0003-Issue-individual-invalidations-with-wal_level-log-v5.patch.gz
- 0004-Extend-the-output-plugin-API-with-stream-methods-v5.patch.gz
- 0005-Implement-streaming-mode-in-ReorderBuffer-v5.patch.gz
- 0006-Add-support-for-streaming-to-built-in-replication-v5.patch.gz
- 0007-Track-statistics-for-streaming-spilling-v5.patch.gz
- 0008-BUGFIX-make-sure-subxact-is-marked-as-is_known_as-v5.patch.gz
- 0009-BUGFIX-set-final_lsn-for-subxacts-before-cleanup-v5.patch.gz
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Tomas Vondra
Date:
On 01/19/2018 03:34 PM, Tomas Vondra wrote: > Attached is v5, fixing a silly bug in part 0006, causing segfault when > creating a subscription. > Meh, there was a bug in the sgml docs (<variable> vs. <varname>), causing another failure. Hopefully v6 will pass the CI build, it does pass a build with the same parameters on my system. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
- 0001-Introduce-logical_work_mem-to-limit-ReorderBuffer-v6.patch.gz
- 0002-Issue-XLOG_XACT_ASSIGNMENT-with-wal_level-logical-v6.patch.gz
- 0003-Issue-individual-invalidations-with-wal_level-log-v6.patch.gz
- 0004-Extend-the-output-plugin-API-with-stream-methods-v6.patch.gz
- 0005-Implement-streaming-mode-in-ReorderBuffer-v6.patch.gz
- 0006-Add-support-for-streaming-to-built-in-replication-v6.patch.gz
- 0007-Track-statistics-for-streaming-spilling-v6.patch.gz
- 0008-BUGFIX-make-sure-subxact-is-marked-as-is_known_as-v6.patch.gz
- 0009-BUGFIX-set-final_lsn-for-subxacts-before-cleanup-v6.patch.gz
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Masahiko Sawada
Date:
On Sat, Jan 20, 2018 at 7:08 AM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > On 01/19/2018 03:34 PM, Tomas Vondra wrote: >> Attached is v5, fixing a silly bug in part 0006, causing segfault when >> creating a subscription. >> > > Meh, there was a bug in the sgml docs (<variable> vs. <varname>), > causing another failure. Hopefully v6 will pass the CI build, it does > pass a build with the same parameters on my system. Thank you for working on this. This patch would be helpful for synchronous replication. I haven't looked at the code deeply yet, but I've reviewed the v6 patch set especially on subscriber side. All of the patches can be applied to current HEAD cleanly. Here is review comment. ---- CREATE SUBSCRIPTION commands accept work_mem < 64 but it leads ERROR on publisher side when starting replication. Probably we should check the value on the subscriber side as well. ---- When streaming = on, if we drop subscription in the middle of receiving stream changes, DROP SUBSCRIPTION could leak tmp files (.chages file and .subxacts file). Also it also happens when a transaction on upstream aborted without abort record. ---- Since we can change both streaming option and work_mem option by ALTER SUBSCRIPTION, documentation of ALTER SUBSCRIPTION needs to be updated. ---- If we create a subscription without any options, both pg_subscription.substream and pg_subscription.subworkmem are set to null. However, since GetSubscription are't aware of NULL we start the replication with invalid options like follows. LOG: received replication command: START_REPLICATION SLOT "hoge_sub" LOGICAL 0/0 (proto_version '2', work_mem '893219954', streaming 'on', publication_names '"hoge_pub"') I think we can set substream to false and subworkmem to -1 instead of null, and then makes libpqrcv_startstreaming not set streaming option if stream is -1. ---- Some WARNING messages appeared. Maybe these are for debug purpose? WARNING: updating stream stats 0x1c12ef8 4 3 65604 WARNING: UpdateSpillStats: updating stats 0x1c12ef8 0 0 0 39 41 2632080 Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Peter Eisentraut
Date:
To close out this commit fest, I'm setting both of these patches as returned with feedback, as there are apparently significant issues to be addressed. Feel free to move them to the next commit fest when you think they are ready to be continued. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Tomas Vondra
Date:
On 01/31/2018 07:53 AM, Masahiko Sawada wrote: > On Sat, Jan 20, 2018 at 7:08 AM, Tomas Vondra > <tomas.vondra@2ndquadrant.com> wrote: >> On 01/19/2018 03:34 PM, Tomas Vondra wrote: >>> Attached is v5, fixing a silly bug in part 0006, causing segfault when >>> creating a subscription. >>> >> >> Meh, there was a bug in the sgml docs (<variable> vs. <varname>), >> causing another failure. Hopefully v6 will pass the CI build, it does >> pass a build with the same parameters on my system. > > Thank you for working on this. This patch would be helpful for > synchronous replication. > > I haven't looked at the code deeply yet, but I've reviewed the v6 > patch set especially on subscriber side. All of the patches can be > applied to current HEAD cleanly. Here is review comment. > > ---- > CREATE SUBSCRIPTION commands accept work_mem < 64 but it leads ERROR > on publisher side when starting replication. Probably we should check > the value on the subscriber side as well. > > ---- > When streaming = on, if we drop subscription in the middle of > receiving stream changes, DROP SUBSCRIPTION could leak tmp files > (.chages file and .subxacts file). Also it also happens when a > transaction on upstream aborted without abort record. > > ---- > Since we can change both streaming option and work_mem option by ALTER > SUBSCRIPTION, documentation of ALTER SUBSCRIPTION needs to be updated. > > ---- > If we create a subscription without any options, both > pg_subscription.substream and pg_subscription.subworkmem are set to > null. However, since GetSubscription are't aware of NULL we start the > replication with invalid options like follows. > LOG: received replication command: START_REPLICATION SLOT "hoge_sub" > LOGICAL 0/0 (proto_version '2', work_mem '893219954', streaming 'on', > publication_names '"hoge_pub"') > > I think we can set substream to false and subworkmem to -1 instead of > null, and then makes libpqrcv_startstreaming not set streaming option > if stream is -1. > > ---- > Some WARNING messages appeared. Maybe these are for debug purpose? > > WARNING: updating stream stats 0x1c12ef8 4 3 65604 > WARNING: UpdateSpillStats: updating stats 0x1c12ef8 0 0 0 39 41 2632080 > > Regards, > Thanks for the review! I'll address the issues in the next version of the patch. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Tomas Vondra
Date:
On 02/01/2018 03:51 PM, Peter Eisentraut wrote: > To close out this commit fest, I'm setting both of these patches as > returned with feedback, as there are apparently significant issues to be > addressed. Feel free to move them to the next commit fest when you > think they are ready to be continued. > Will do. Thanks for the feedback. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Andres Freund
Date:
On 2018-02-01 23:51:55 +0100, Tomas Vondra wrote: > On 02/01/2018 03:51 PM, Peter Eisentraut wrote: > > To close out this commit fest, I'm setting both of these patches as > > returned with feedback, as there are apparently significant issues to be > > addressed. Feel free to move them to the next commit fest when you > > think they are ready to be continued. > > > > Will do. Thanks for the feedback. Hm, this CF entry is marked as needs review as of 2018-03-01 12:54:48, but I don't see a newer version posted? Greetings, Andres Freund
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Tomas Vondra
Date:
On 03/02/2018 02:12 AM, Andres Freund wrote: > On 2018-02-01 23:51:55 +0100, Tomas Vondra wrote: >> On 02/01/2018 03:51 PM, Peter Eisentraut wrote: >>> To close out this commit fest, I'm setting both of these patches as >>> returned with feedback, as there are apparently significant issues to be >>> addressed. Feel free to move them to the next commit fest when you >>> think they are ready to be continued. >>> >> >> Will do. Thanks for the feedback. > > Hm, this CF entry is marked as needs review as of 2018-03-01 12:54:48, > but I don't see a newer version posted? > Ah, apologies - that's due to moving the patch from the last CF (it was marked as RWF so I had to reopen it before moving it). I'll submit a new version of the patch shortly, please mark it as WOA until then. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
David Steele
Date:
Hi Tomas. On 3/1/18 9:33 PM, Tomas Vondra wrote: > On 03/02/2018 02:12 AM, Andres Freund wrote: >> On 2018-02-01 23:51:55 +0100, Tomas Vondra wrote: >>> On 02/01/2018 03:51 PM, Peter Eisentraut wrote: >>>> To close out this commit fest, I'm setting both of these patches as >>>> returned with feedback, as there are apparently significant issues to be >>>> addressed. Feel free to move them to the next commit fest when you >>>> think they are ready to be continued. >>>> >>> >>> Will do. Thanks for the feedback. >> >> Hm, this CF entry is marked as needs review as of 2018-03-01 12:54:48, >> but I don't see a newer version posted? >> > > Ah, apologies - that's due to moving the patch from the last CF (it was > marked as RWF so I had to reopen it before moving it). I'll submit a new > version of the patch shortly, please mark it as WOA until then. Marked as Waiting on Author. -- -David david@pgmasters.net
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Andres Freund
Date:
Hi, On 2018-03-01 21:39:36 -0500, David Steele wrote: > On 3/1/18 9:33 PM, Tomas Vondra wrote: > > On 03/02/2018 02:12 AM, Andres Freund wrote: > > > Hm, this CF entry is marked as needs review as of 2018-03-01 12:54:48, > > > but I don't see a newer version posted? > > > > > > > Ah, apologies - that's due to moving the patch from the last CF (it was > > marked as RWF so I had to reopen it before moving it). I'll submit a new > > version of the patch shortly, please mark it as WOA until then. > > Marked as Waiting on Author. Sorry to be the hard-ass, but given this patch hasn't been moved forward since 2018-01-19, I'm not sure why it's elegible to be in this CF in the first place? Greetings, Andres Freund
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Robert Haas
Date:
On Thu, Mar 1, 2018 at 9:33 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > Ah, apologies - that's due to moving the patch from the last CF (it was > marked as RWF so I had to reopen it before moving it). I'll submit a new > version of the patch shortly, please mark it as WOA until then. So, the way it's supposed to work is you resubmit the patch first and then re-activate the CF entry. If you get to re-activate the CF entry without actually updating the patch, and then submit the patch afterwards, then the CF deadline becomes largely meaningless. I think a new patch should rejected as untimely. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
David Steele
Date:
On 3/2/18 3:06 PM, Robert Haas wrote: > On Thu, Mar 1, 2018 at 9:33 PM, Tomas Vondra > <tomas.vondra@2ndquadrant.com> wrote: >> Ah, apologies - that's due to moving the patch from the last CF (it was >> marked as RWF so I had to reopen it before moving it). I'll submit a new >> version of the patch shortly, please mark it as WOA until then. > > So, the way it's supposed to work is you resubmit the patch first and > then re-activate the CF entry. If you get to re-activate the CF entry > without actually updating the patch, and then submit the patch > afterwards, then the CF deadline becomes largely meaningless. I think > a new patch should rejected as untimely. Hmmm, I missed that implication last night. I'll mark this Returned with Feedback. Tomas, please move to the next CF once you have an updated patch. Thanks, -- -David david@pgmasters.net
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Tomas Vondra
Date:
On 03/02/2018 09:21 PM, David Steele wrote: > On 3/2/18 3:06 PM, Robert Haas wrote: >> On Thu, Mar 1, 2018 at 9:33 PM, Tomas Vondra >> <tomas.vondra@2ndquadrant.com> wrote: >>> Ah, apologies - that's due to moving the patch from the last CF (it was >>> marked as RWF so I had to reopen it before moving it). I'll submit a new >>> version of the patch shortly, please mark it as WOA until then. >> >> So, the way it's supposed to work is you resubmit the patch first and >> then re-activate the CF entry. If you get to re-activate the CF entry >> without actually updating the patch, and then submit the patch >> afterwards, then the CF deadline becomes largely meaningless. I think >> a new patch should rejected as untimely. > > Hmmm, I missed that implication last night. I'll mark this Returned > with Feedback. > > Tomas, please move to the next CF once you have an updated patch. > Can you guys please point me to the CF rules that say this? Because my understanding (and not just mine, AFAICS) was obviously different. Clearly there's a disconnect somewhere. -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Tomas Vondra
Date:
Hi there, attached is an updated patch fixing all the reported issues (a bit more about those below). The main change in this patch version is reworked logging of subxact assignments, which needs to be done immediately for incremental decoding to work properly. The previous patch versions did that by logging a separate xlog record, which however had rather noticeable space overhead (~40% on a worst-case test - tiny table, no FPWs, ...). While in practice the overhead would be much closer to 0%, it still seemed unacceptable. Andres proposed doing something like we do with replication origins in XLogRecordAssemble, i.e. inventing a special block, and embedding the assignment info into that (in the next xlog record). This turned out to be working quite well, and the worst-case space overhead dropped to ~5%. I have attempted to do something like that with the invalidations, which is the other thing that needs to be logged immediately for incremental decoding to work correctly. The plan was to use the same approach as for assignments, i.e. embed the invalidations into the next xlog record and stop sending them in the commit message. That however turned out to be much more complicated - the embedding is fairly trivial, of course, but unlike assignments the invalidations are needed for hot standbys. If we only send them incrementally, I think the standby would have to collect from the WAL records, and store them in a way that survives restarts. So for invalidations the patch uses the original approach with a new type xlog record type (ignored by standby), and still logging the invalidations in commit record (which is that the standby relies on). On 02/01/2018 11:50 PM, Tomas Vondra wrote: > On 01/31/2018 07:53 AM, Masahiko Sawada wrote: > ... >> ---- >> CREATE SUBSCRIPTION commands accept work_mem < 64 but it leads ERROR >> on publisher side when starting replication. Probably we should check >> the value on the subscriber side as well. >> Added. >> ---- >> When streaming = on, if we drop subscription in the middle of >> receiving stream changes, DROP SUBSCRIPTION could leak tmp files >> (.chages file and .subxacts file). Also it also happens when a >> transaction on upstream aborted without abort record. >> Right. The files would get cleaned up eventually during restart (just like other temporary files), but leaking them after DROP SUBSCRIPTION is not cool. So I've added a simple tracking of files (or rather streamed XIDs) in the worker, and clean them explicitly on exit. >> ---- >> Since we can change both streaming option and work_mem option by ALTER >> SUBSCRIPTION, documentation of ALTER SUBSCRIPTION needs to be updated. >> Yep, I've added note that work_mem and streaming can also be changed. Those changes won't be applied to the already running worker, though. >> ---- >> If we create a subscription without any options, both >> pg_subscription.substream and pg_subscription.subworkmem are set to >> null. However, since GetSubscription are't aware of NULL we start the >> replication with invalid options like follows. >> LOG: received replication command: START_REPLICATION SLOT "hoge_sub" >> LOGICAL 0/0 (proto_version '2', work_mem '893219954', streaming 'on', >> publication_names '"hoge_pub"') >> >> I think we can set substream to false and subworkmem to -1 instead of >> null, and then makes libpqrcv_startstreaming not set streaming option >> if stream is -1. >> Good catch! I've done pretty much what you suggested here, i.e. store -1/false instead and then handle that in libpqrcv_startstreaming. >> ---- >> Some WARNING messages appeared. Maybe these are for debug purpose? >> >> WARNING: updating stream stats 0x1c12ef8 4 3 65604 >> WARNING: UpdateSpillStats: updating stats 0x1c12ef8 0 0 0 39 41 2632080 >> Yeah, those should be removed. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
- 0001-Introduce-logical_work_mem-to-limit-ReorderBuffer.patch.gz
- 0002-Immediatel-WAL-log-assignments.patch.gz
- 0003-Issue-individual-invalidations-with-wal_level-logica.patch.gz
- 0004-Extend-the-output-plugin-API-with-stream-methods.patch.gz
- 0005-Implement-streaming-mode-in-ReorderBuffer.patch.gz
- 0006-Add-support-for-streaming-to-built-in-replication.patch.gz
- 0007-Track-statistics-for-streaming-spilling.patch.gz
- 0008-BUGFIX-make-sure-subxact-is-marked-as-is_known_as_su.patch.gz
- 0009-BUGFIX-set-final_lsn-for-subxacts-before-cleanup.patch.gz
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Tomas Vondra
Date:
On 03/02/2018 09:05 PM, Andres Freund wrote: > Hi, > > On 2018-03-01 21:39:36 -0500, David Steele wrote: >> On 3/1/18 9:33 PM, Tomas Vondra wrote: >>> On 03/02/2018 02:12 AM, Andres Freund wrote: >>>> Hm, this CF entry is marked as needs review as of 2018-03-01 12:54:48, >>>> but I don't see a newer version posted? >>>> >>> >>> Ah, apologies - that's due to moving the patch from the last CF (it was >>> marked as RWF so I had to reopen it before moving it). I'll submit a new >>> version of the patch shortly, please mark it as WOA until then. >> >> Marked as Waiting on Author. > > Sorry to be the hard-ass, but given this patch hasn't been moved forward > since 2018-01-19, I'm not sure why it's elegible to be in this CF in the > first place? > That is somewhat misleading, I think. You're right the last version was submitted on 2018-01-19, but the next review arrived on 2018-01-31, i.e. right at the end of the CF. So it's not like the patch was sitting there with unresolved issues. Based on that review the patch was marked as RWF and thus not moved to 2018-03 automatically. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Andres Freund
Date:
On 2018-03-03 02:00:46 +0100, Tomas Vondra wrote: > That is somewhat misleading, I think. You're right the last version was > submitted on 2018-01-19, but the next review arrived on 2018-01-31, i.e. > right at the end of the CF. So it's not like the patch was sitting there > with unresolved issues. Based on that review the patch was marked as RWF > and thus not moved to 2018-03 automatically. I don't see how this changes anything. - Andres
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Tomas Vondra
Date:
On 03/03/2018 02:01 AM, Andres Freund wrote: > On 2018-03-03 02:00:46 +0100, Tomas Vondra wrote: >> That is somewhat misleading, I think. You're right the last version >> was submitted on 2018-01-19, but the next review arrived on >> 2018-01-31, i.e. right at the end of the CF. So it's not like the >> patch was sitting there with unresolved issues. Based on that >> review the patch was marked as RWF and thus not moved to 2018-03 >> automatically. > > I don't see how this changes anything. > You've used "The patch hasn't moved forward since 2018-01-19," as an argument why the patch is not eligible for 2018-03. I suggest that argument is misleading, because patches generally do not move without reviews, and it's difficult to respond to a review that arrives on the last day of a commitfest. Consider that without the review, the patch would end up with NR status, and would be moved to the next CF automatically. Isn't that a bit weird? kind regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Andres Freund
Date:
On 2018-03-03 02:34:06 +0100, Tomas Vondra wrote: > On 03/03/2018 02:01 AM, Andres Freund wrote: > > On 2018-03-03 02:00:46 +0100, Tomas Vondra wrote: > >> That is somewhat misleading, I think. You're right the last version > >> was submitted on 2018-01-19, but the next review arrived on > >> 2018-01-31, i.e. right at the end of the CF. So it's not like the > >> patch was sitting there with unresolved issues. Based on that > >> review the patch was marked as RWF and thus not moved to 2018-03 > >> automatically. > > > > I don't see how this changes anything. > > > > You've used "The patch hasn't moved forward since 2018-01-19," as an > argument why the patch is not eligible for 2018-03. I suggest that > argument is misleading, because patches generally do not move without > reviews, and it's difficult to respond to a review that arrives on the > last day of a commitfest. > > Consider that without the review, the patch would end up with NR status, > and would be moved to the next CF automatically. Isn't that a bit weird? Not sure I follow. The point is that nobody would have complained if you'd moved the patch into this fest if you'd updated it *before* it started? Greetings, Andres Freund
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
David Steele
Date:
On 3/2/18 8:01 PM, Andres Freund wrote: > On 2018-03-03 02:00:46 +0100, Tomas Vondra wrote: >> That is somewhat misleading, I think. You're right the last version was >> submitted on 2018-01-19, but the next review arrived on 2018-01-31, i.e. >> right at the end of the CF. So it's not like the patch was sitting there >> with unresolved issues. Based on that review the patch was marked as RWF >> and thus not moved to 2018-03 automatically. > > I don't see how this changes anything. I agree that things could be clearer, and Andres has produced a great document that we can build on. The old one had gotten a bit stale. However, I think it's pretty obvious that a CF entry should be accompanied with a patch. It sounds like the timing was awkward but you still had 28 days to produce a new patch. I also notice that you submitted 7 patches in this CF but are reviewing zero. -- -David david@pgmasters.net
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Tomas Vondra
Date:
On 03/03/2018 02:37 AM, David Steele wrote: > On 3/2/18 8:01 PM, Andres Freund wrote: >> On 2018-03-03 02:00:46 +0100, Tomas Vondra wrote: >>> That is somewhat misleading, I think. You're right the last version was >>> submitted on 2018-01-19, but the next review arrived on 2018-01-31, i.e. >>> right at the end of the CF. So it's not like the patch was sitting there >>> with unresolved issues. Based on that review the patch was marked as RWF >>> and thus not moved to 2018-03 automatically. >> >> I don't see how this changes anything. > > I agree that things could be clearer, and Andres has produced a great > document that we can build on. The old one had gotten a bit stale. > > However, I think it's pretty obvious that a CF entry should be > accompanied with a patch. It sounds like the timing was awkward but > you still had 28 days to produce a new patch. > Based on internal discussion I'm not so sure about the "pretty obvious" part. It certainly wasn't that obvious to me, otherwise I'd submit the revised patch earlier - hindsight is 20/20. > I also notice that you submitted 7 patches in this CF but are > reviewing zero. > I've volunteered to review a couple of patches at the FOSDEM Developer Meeting - I thought Stephen was entering that into the CF app, not sure where it got lost. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
David Steele
Date:
On 3/2/18 8:54 PM, Tomas Vondra wrote: > On 03/03/2018 02:37 AM, David Steele wrote: >> On 3/2/18 8:01 PM, Andres Freund wrote: >>> On 2018-03-03 02:00:46 +0100, Tomas Vondra wrote: >>>> That is somewhat misleading, I think. You're right the last version was >>>> submitted on 2018-01-19, but the next review arrived on 2018-01-31, i.e. >>>> right at the end of the CF. So it's not like the patch was sitting there >>>> with unresolved issues. Based on that review the patch was marked as RWF >>>> and thus not moved to 2018-03 automatically. >>> >>> I don't see how this changes anything. >> >> I agree that things could be clearer, and Andres has produced a great >> document that we can build on. The old one had gotten a bit stale. >> >> However, I think it's pretty obvious that a CF entry should be >> accompanied with a patch. It sounds like the timing was awkward but >> you still had 28 days to produce a new patch. > > Based on internal discussion I'm not so sure about the "pretty obvious" > part. It certainly wasn't that obvious to me, otherwise I'd submit the > revised patch earlier - hindsight is 20/20. Indeed it is. Be assured that nobody takes pleasure in pushing patches, but we have limited resources and must make some choices. >> I also notice that you submitted 7 patches in this CF but are >> reviewing zero. > > I've volunteered to review a couple of patches at the FOSDEM Developer > Meeting - I thought Stephen was entering that into the CF app, not sure > where it got lost. There are plenty of patches that need review, so go for it. Regards, -- -David david@pgmasters.net
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Erik Rijkers
Date:
On 2018-03-03 01:55, Tomas Vondra wrote: > Hi there, > > attached is an updated patch fixing all the reported issues (a bit more > about those below). Hi, 0007-Track-statistics-for-streaming-spilling.patch won't apply. All the other patches apply ok. patch complaints with: patching file doc/src/sgml/monitoring.sgml patching file src/backend/catalog/system_views.sql Hunk #1 succeeded at 734 (offset 2 lines). patching file src/backend/replication/logical/reorderbuffer.c patching file src/backend/replication/walsender.c patching file src/include/catalog/pg_proc.h Hunk #1 FAILED at 2903. 1 out of 1 hunk FAILED -- saving rejects to file src/include/catalog/pg_proc.h.rej patching file src/include/replication/reorderbuffer.h patching file src/include/replication/walsender_private.h patching file src/test/regress/expected/rules.out Hunk #1 succeeded at 1861 (offset 2 lines). Attached the produced reject file. thanks, Erik Rijkers
Attachment
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Tomas Vondra
Date:
On 03/03/2018 06:19 AM, Erik Rijkers wrote: > On 2018-03-03 01:55, Tomas Vondra wrote: >> Hi there, >> >> attached is an updated patch fixing all the reported issues (a bit more >> about those below). > > Hi, > > 0007-Track-statistics-for-streaming-spilling.patch won't apply. All > the other patches apply ok. > > patch complaints with: > > patching file doc/src/sgml/monitoring.sgml > patching file src/backend/catalog/system_views.sql > Hunk #1 succeeded at 734 (offset 2 lines). > patching file src/backend/replication/logical/reorderbuffer.c > patching file src/backend/replication/walsender.c > patching file src/include/catalog/pg_proc.h > Hunk #1 FAILED at 2903. > 1 out of 1 hunk FAILED -- saving rejects to file > src/include/catalog/pg_proc.h.rej > patching file src/include/replication/reorderbuffer.h > patching file src/include/replication/walsender_private.h > patching file src/test/regress/expected/rules.out > Hunk #1 succeeded at 1861 (offset 2 lines). > > Attached the produced reject file. > Yeah, that's due to fd1a421fe66 which changed columns in pg_proc.h. Attached is a rebased patch, fixing this. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
- 0001-Introduce-logical_work_mem-to-limit-ReorderBuffer.patch.gz
- 0002-Immediatel-WAL-log-assignments.patch.gz
- 0003-Issue-individual-invalidations-with-wal_level-logica.patch.gz
- 0004-Extend-the-output-plugin-API-with-stream-methods.patch.gz
- 0005-Implement-streaming-mode-in-ReorderBuffer.patch.gz
- 0006-Add-support-for-streaming-to-built-in-replication.patch.gz
- 0007-Track-statistics-for-streaming-spilling.patch.gz
- 0008-BUGFIX-make-sure-subxact-is-marked-as-is_known_as_su.patch.gz
- 0009-BUGFIX-set-final_lsn-for-subxacts-before-cleanup.patch.gz
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Peter Eisentraut
Date:
I think this patch is not going to be ready for PG11. - It depends on some work in the thread "logical decoding of two-phase transactions", which is still in progress. - Various details in the logical_work_mem patch (0001) are unresolved. - This being partially a performance feature, we haven't seen any performance tests (e.g., which settings result in which latencies under which workloads). That said, the feature seems useful and desirable, and the implementation makes sense. There are documentation and tests. But there is a significant amount of design and coding work still necessary. Attached is a fixup patch that I needed to make it compile. The last two patches in your series (0008, 0009) are labeled as bug fixes. Would you like to argue that they should be applied independently of the rest of the feature? -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Konstantin Knizhnik
Date:
On 11.01.2018 22:41, Peter Eisentraut wrote: > On 12/22/17 23:57, Tomas Vondra wrote: >> PART 1: adding logical_work_mem memory limit (0001) >> --------------------------------------------------- >> >> Currently, limiting the amount of memory consumed by logical decoding is >> tricky (or you might say impossible) for several reasons: > I would like to see some more discussion on this, but I think not a lot > of people understand the details, so I'll try to write up an explanation > here. This code is also somewhat new to me, so please correct me if > there are inaccuracies, while keeping in mind that I'm trying to simplify. > > The data in the WAL is written as it happens, so the changes belonging > to different transactions are all mixed together. One of the jobs of > logical decoding is to reassemble the changes belonging to each > transaction. The top-level data structure for that is the infamous > ReorderBuffer. So as it reads the WAL and sees something about a > transaction, it keeps a copy of that change in memory, indexed by > transaction ID (ReorderBufferChange). When the transaction commits, the > accumulated changes are passed to the output plugin and then freed. If > the transaction aborts, then changes are just thrown away. > > So when logical decoding is active, a copy of the changes for each > active transaction is kept in memory (once per walsender). > > More precisely, the above happens for each subtransaction. When the > top-level transaction commits, it finds all its subtransactions in the > ReorderBuffer, reassembles everything in the right order, then invokes > the output plugin. > > All this could end up using an unbounded amount of memory, so there is a > mechanism to spill changes to disk. The way this currently works is > hardcoded, and this patch proposes to change that. > > Currently, when a transaction or subtransaction has accumulated 4096 > changes, it is spilled to disk. When the top-level transaction commits, > things are read back from disk to do the final processing mentioned above. > > This all works mostly fine, but you can construct some more extreme > cases where this can blow up. > > Here is a mundane example. Let's say a change entry takes 100 bytes (it > might contain a new row, or an update key and some new column values, > for example). If you have 100 concurrent active sessions and no > subtransactions, then logical decoding memory is bounded by 4096 * 100 * > 100 = 40 MB (per walsender) before things spill to disk. > > Now let's say you are using a lot of subtransactions, because you are > using PL functions, exception handling, triggers, doing batch updates. > If you have 200 subtransactions on average per concurrent session, the > memory usage bound in that case would be 4096 * 100 * 100 * 200 = 8 GB > (per walsender). And so on. If you have more concurrent sessions or > larger changes or more subtransactions, you'll use much more than those > 8 GB. And if you don't have those 8 GB, then you're stuck at this point. > > That is the consideration when we record changes, but we also need > memory when we do the final processing at commit time. That is slightly > less problematic because we only process one top-level transaction at a > time, so the formula is only 4096 * avg_size_of_changes * nr_subxacts > (without the concurrent sessions factor). > > So, this patch proposes to improve this as follows: > > - We compute the actual size of each ReorderBufferChange and keep a > running tally for each transaction, instead of just counting the number > of changes. > > - We have a configuration setting that allows us to change the limit > instead of the hardcoded 4096. The configuration setting is also in > terms of memory, not in number of changes. > > - The configuration setting is for the total memory usage per decoding > session, not per subtransaction. (So we also keep a running tally for > the entire ReorderBuffer.) > > There are two open issues with this patch: > > One, this mechanism only applies when recording changes. The processing > at commit time still uses the previous hardcoded mechanism. The reason > for this is, AFAIU, that as things currently work, you have to have all > subtransactions in memory to do the final processing. There are some > proposals to change this as well, but they are more involved. Arguably, > per my explanation above, memory use at commit time is less likely to be > a problem. > > Two, what to do when the memory limit is reached. With the old > accounting, this was easy, because we'd decide for each subtransaction > independently whether to spill it to disk, when it has reached its 4096 > limit. Now, we are looking at a global limit, so we have to find a > transaction to spill in some other way. The proposed patch searches > through the entire list of transactions to find the largest one. But as > the patch says: > > "XXX With many subtransactions this might be quite slow, because we'll > have to walk through all of them. There are some options how we could > improve that: (a) maintain some secondary structure with transactions > sorted by amount of changes, (b) not looking for the entirely largest > transaction, but e.g. for transaction using at least some fraction of > the memory limit, and (c) evicting multiple transactions at once, e.g. > to free a given portion of the memory limit (e.g. 50%)." > > (a) would create more overhead for the case where everything fits into > memory, so it seems unattractive. Some combination of (b) and (c) seems > useful, but we'd have to come up with something concrete. > > Thoughts? > I am very sorry that I have not noticed this thread before. Spilling to the file in reorder buffer is the main factor limiting speed of importing data in multimaster and shardman (sharding based on FDW with redundancy provided by LR). This is why we think a lot about possible ways of addressing this issue. Right now data of huge transaction is written to the disk three times before it is applied at replica. And obviously read also three times. First it is saved in WAL, then spilled to the disk by reorder buffer and once again spilled to the disk at replica before assignment to the particular apply worker (last one is specific of multimaster, which can apply received transactions concurrently). We considered three different approaches: 1. Streaming. It is similar with the proposed patch, the main difference is that we do not want to spill transaction in temporary file at replica, but apply it immediately in separate backend and abort transaction if it is aborted at master. Certainly it will work only with 2PC. 2. Elimination of spilling by rescanning WAL. 3. Bypass WAL: add hooks to heapam to buffer and propagate changes immediately to replica and apply them in dedicated backend. I have implemented prototype of such replication. With one replica it shows about 1.5x slowdown comparing with standalone/async LR and about 2-3 improvement comparing with sync LR. For two replicas result is 2x slower than async LR and 2-8 times faster than sync LR (depending on number of concurrent connections). Approach 3) seems to be specific to multimaster/shardman, so most likely it can not be considered for general LR. So I want to compare 1 and 2. Did you ever though about something like 2? Right now in the proposed patch you just move spilling to the file from master to replica. It still can make sense to avoid memory overflow and reduce disk IO at master. But if we have just one huge transaction (COPY) importing gigabytes of data to the database, then performance will be almost the same with your patch or without it. The only difference is where we serialize transaction: at master or at replica side. In this sense this patch doesn't solve the problem with slow load of large bulks of data though LR. Alternatively (approach 2) we can have small in-memory buffer for decoding transaction and remember LSN and snapshot of this transaction start. In case of buffer overflow we just continue WAL traversal until we reach end of the transaction. After it we restart scanning WAL from the beginning of this transaction at this second pass send changes directly to the output plugin. So we have to scan WAL several times but do not need to spill anything to the disk neither at publisher, neither at subscriber side. Certainly this approach will be inefficient if we have several long interleaving transactions. But in most customer's use cases we have observed until now there is just one huge transaction performing bulk load. May be I missed something, but this approach seems to be easier for implementation than transaction streaming. And it doesn't require any changes in output plugin API. I realize that it is a little bit late to ask this question once your patch is almost ready, but what do you think about it? Are there some pitfals with this approach? There is one more aspect and performance problem with LR we have faced with shardman: if there are several publications for different subsets of table at one instance, then WAL senders have to do a lot of useless work. Them are decoding transactions which have no relation to this publication. But WAL sender doesn't know it until it reaches the end of this transaction. What is worser: if transaction is huge, then all WAL senders will spill it to the disk even through only one of them actually needs it. So data of huge transaction is written not three times, but N times, where N is number of publications. The only solution of the problem we can imagine is to let backend somehow inform WAL sender (through shared message queue?) about LSN-s it should considered. In this case WAL sender can skip large portions of WAL without decoding. We also want to know opinion of 2ndQuandarnt about this idea. -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Peter Eisentraut
Date:
This patch set was not updated for the 2018-07 commitfest, so moved to -09. On 09.03.18 17:07, Peter Eisentraut wrote: > I think this patch is not going to be ready for PG11. > > - It depends on some work in the thread "logical decoding of two-phase > transactions", which is still in progress. > > - Various details in the logical_work_mem patch (0001) are unresolved. > > - This being partially a performance feature, we haven't seen any > performance tests (e.g., which settings result in which latencies under > which workloads). > > That said, the feature seems useful and desirable, and the > implementation makes sense. There are documentation and tests. But > there is a significant amount of design and coding work still necessary. > > Attached is a fixup patch that I needed to make it compile. > > The last two patches in your series (0008, 0009) are labeled as bug > fixes. Would you like to argue that they should be applied > independently of the rest of the feature? > -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Michael Paquier
Date:
On Sat, Mar 03, 2018 at 03:52:40PM +0100, Tomas Vondra wrote: > Yeah, that's due to fd1a421fe66 which changed columns in pg_proc.h. > Attached is a rebased patch, fixing this. The latest patch set does not apply anymore, and had no activity for the last two months, so I am marking it as returned with feedback. -- Michael
Attachment
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Tomas Vondra
Date:
Hi, Attached is an updated version of this patch series. It's meant to be applied on top of the 2pc decoding patch [1], because streaming of in-progress transactions requires handling of concurrent aborts. So it may or may not apply directly to master, I'm not sure - unfortunately that's likely to confuse the cputube thing, but I don't want to include the 2pc decoding bits here because that would be just confusing. If needed, the part introducing logical_work_mem limit for ReorderBuffer can be separated and committed independently, but I do expect this to be committed after the 2pc decoding patch so I've left it like this. This new version is mostly just a rebase to current master (or almost, because 2pc decoding only applies to 29180e5d78 due to minor bitrot), but it also addresses the new stuff committed since last version (most importantly decoding of TRUNCATE). It also fixes a bug in WAL-logging of subxact assignments, where the assignment was included in records with XID=0, essentially failing to track the subxact properly. For the logical_work_mem part, I think this is quite solid. The main question is how to pick transactions for eviction. For now it uses the same approach as master (i.e. picking the largest top-level transaction, although measured by amount of memory and not just number of changes). But I've realized that may not work with Generation context that great, because unlike AllocSet it does not reuse the memory. That's nice as it allows freeing old blocks (which AllocSet can't), but it means a small transaction can have a change on old blocks preventing free(). That is something we have in pg11 already, because that's where Generation context got introduced - I haven't seen this issue in practice, but we might need to do something about it. In any case, I'm thinking we may need to pick a different eviction algorithm - say using a transaction with the oldest change (and loop until we release at least one block in the Generation context), or maybe look for block mixing changes from the smallest number of transactions, or something like that. Other ideas are welcome. I don't think the exact algorithm is particularly critical, because it's meant to be triggered only very rarely (i.e. pick logical_work_mem high enough). The in-progress streaming is mostly mechanical extension of existing functionality (new methods in various APIs, ...) and refactoring of ReorderBuffer to handle incremental decoding. I'm sure it'd benefit from reviews, of course. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
- 0001-Add-logical_work_mem-to-limit-ReorderBuffer-20181216.patch.gz
- 0002-Immediately-WAL-log-assignments-20181216.patch.gz
- 0003-Issue-individual-invalidations-with-wal_lev-20181216.patch.gz
- 0004-Extend-the-output-plugin-API-with-stream-me-20181216.patch.gz
- 0005-Implement-streaming-mode-in-ReorderBuffer-20181216.patch.gz
- 0006-Add-support-for-streaming-to-built-in-repli-20181216.patch.gz
- 0007-Track-statistics-for-streaming-spilling-20181216.patch.gz
- 0008-BUGFIX-set-final_lsn-for-subxacts-before-cl-20181216.patch.gz
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Tomas Vondra
Date:
FWIW the original CF entry in 2018-07 [1] was marked as RWF. I'm not sure what's the right way to resubmit such patches, so I've created a new entry in 2019-01 [2] referencing the same hackers thread (and with the same authors/reviewers metadata). [1] https://commitfest.postgresql.org/19/1429/ [2] https://commitfest.postgresql.org/21/1927/ regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Alexey Kondratov
Date:
Hi Tomas, > This new version is mostly just a rebase to current master (or almost, > because 2pc decoding only applies to 29180e5d78 due to minor bitrot), > but it also addresses the new stuff committed since last version (most > importantly decoding of TRUNCATE). It also fixes a bug in WAL-logging of > subxact assignments, where the assignment was included in records with > XID=0, essentially failing to track the subxact properly. I started reviewing your patch about a month ago and tried to do an in-depth review, since I am very interested in this patch too. The new version is not applicable to master 29180e5d78, but everything is OK after applying 2pc patch before. Anyway, I guess it may complicate further testing and review, since any potential reviewer has to take into account both patches at once. Previous version was applicable to master and was working fine for me separately (excepting a few patch-specific issues, which I try to explain below). Patch review ======== First of all, I want to say thank you for such a huge work done. Here are some problems, which I have found and hopefully fixed with my additional patch (please, find attached, it should be applicable to the last commit of your newest patch version): 1) The most important issue is that your tap tests were broken—there was missing option "WITH (streaming=true)" in the subscription creating statement. Therefore, spilling mechanism has been tested rather than streaming. 2) After fixing tests the first one with simple streaming is immediately failed, because of logical replication worker segmentation fault. It happens, since worker tries to call stream_cleanup_files inside stream_open_file at the stream start, while nxids is zero, then it goes to the negative value and everything crashes. Something similar may happen with xids array, so I added two checks there. 3) The next problem is much more critical and is dedicated to historic MVCC visibility rules. Previously, walsender was starting to decode transaction on commit and we were able to resolve all xmin, xmax, combocids to cmin/cmax, build tuplecids hash and so on, but now we start doing all these things on the fly. Thus, rather difficult situation arises: HeapTupleSatisfiesHistoricMVCC is trying to validate catalog tuples, which are currently in the future relatively to the current decoder position inside transaction, e.g. we may want to resolve cmin/cmax of a tuple, which was created with cid 3 and deleted with cid 5, while we are currently at cid 4, so our tuplecids hash is not complete to handle such a case. I have updated HeapTupleSatisfiesHistoricMVCC visibility rules with two options: /* * If we accidentally see a tuple from our transaction, but cannot resolve its * cmin, so probably it is from the future, thus drop it. */ if (!resolved) return false; and /* * If we accidentally see a tuple from our transaction, but cannot resolve its * cmax or cmax == InvalidCommandId, so probably it is still valid, thus accept it. */ if (!resolved || cmax == InvalidCommandId) return true; 4) There was a problem with marking top-level transaction as having catalog changes if one of its subtransactions has. It was causing a problem with DDL statements just after subtransaction start (savepoint), so data from new columns is not replicated. 5) Similar issue with schema send. You send schema only once per each sub/transaction (IIRC), while we have to update schema on each catalog change: invalidation execution, snapshot rebuild, adding new tuple cids. So I ended up with adding is_schema_send flag to ReorderBufferTXN, since it is easy to set it inside RB and read in the output plugin. Probably, we have to choose a better place for this flag. 6) To better handle all these tricky cases I added new tap test—014_stream_tough_ddl.pl—which consist of really tough combination of DDL, DML, savepoints and ROLLBACK/RELEASE in a single transaction. I marked all my fixes and every questionable place with comment and "TOCHECK:" label for easy search. Removing of pretty any of these fixes leads to the tests fail due to the segmentation fault or replication mismatch. Though I mostly read and tested old version of patch, but after a quick look it seems that all these fixes are applicable to the new version of patch as well. Performance ======== I have also performed a series of performance tests, and found that patch adds a huge overhead in the case of a large transaction consisting of many small rows, e.g.: CREATE TABLE large_test (num1 bigint, num2 double precision, num3 double precision); EXPLAIN (ANALYZE, BUFFERS) INSERT INTO large_test (num1, num2, num3) SELECT round(random()*10), random(), random()*142 FROM generate_series(1, 1000000) s(i); Execution Time: 2407.709 ms Total Time: 11494,238 ms (00:11,494) With synchronous_standby_names and 64 MB logical_work_mem it takes up to x5 longer, while without patch it is about x2. Thus, logical replication streaming is approximately x4 as slower for similar transactions. However, dealing with large transactions consisting of a small number of large rows is much better: CREATE TABLE large_text (t TEXT); EXPLAIN (ANALYZE, BUFFERS) INSERT INTO large_text SELECT (SELECT string_agg('x', ',') FROM generate_series(1, 1000000)) FROM generate_series(1, 125); Execution Time: 3545.642 ms Total Time: 7678,617 ms (00:07,679) It is around the same x2 as without patch. If someone is interested I also added flamegraphs of walsender and logical replication worker during first numerical transaction processing. Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Company
Attachment
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Tomas Vondra
Date:
Hi Alexey, Thanks for the thorough and extremely valuable review! On 12/17/18 5:23 PM, Alexey Kondratov wrote: > Hi Tomas, > >> This new version is mostly just a rebase to current master (or almost, >> because 2pc decoding only applies to 29180e5d78 due to minor bitrot), >> but it also addresses the new stuff committed since last version (most >> importantly decoding of TRUNCATE). It also fixes a bug in WAL-logging of >> subxact assignments, where the assignment was included in records with >> XID=0, essentially failing to track the subxact properly. > > I started reviewing your patch about a month ago and tried to do an > in-depth review, since I am very interested in this patch too. The new > version is not applicable to master 29180e5d78, but everything is OK > after applying 2pc patch before. Anyway, I guess it may complicate > further testing and review, since any potential reviewer has to take > into account both patches at once. Previous version was applicable to > master and was working fine for me separately (excepting a few > patch-specific issues, which I try to explain below). > I agree it's somewhat annoying, but I don't think there's a better way, unfortunately. Decoding in-progress transactions does require safe handling of concurrent aborts, so it has to be committed after the 2pc decoding patch (which makes that possible). But the 2pc patch also touches the same places as this patch series (it reworks the reorder buffer for example). > > Patch review > ======== > > First of all, I want to say thank you for such a huge work done. Here > are some problems, which I have found and hopefully fixed with my > additional patch (please, find attached, it should be applicable to the > last commit of your newest patch version): > > 1) The most important issue is that your tap tests were broken—there was > missing option "WITH (streaming=true)" in the subscription creating > statement. Therefore, spilling mechanism has been tested rather than > streaming. > D'oh! > 2) After fixing tests the first one with simple streaming is immediately > failed, because of logical replication worker segmentation fault. It > happens, since worker tries to call stream_cleanup_files inside > stream_open_file at the stream start, while nxids is zero, then it goes > to the negative value and everything crashes. Something similar may > happen with xids array, so I added two checks there. > > 3) The next problem is much more critical and is dedicated to historic > MVCC visibility rules. Previously, walsender was starting to decode > transaction on commit and we were able to resolve all xmin, xmax, > combocids to cmin/cmax, build tuplecids hash and so on, but now we start > doing all these things on the fly. > > Thus, rather difficult situation arises: HeapTupleSatisfiesHistoricMVCC > is trying to validate catalog tuples, which are currently in the future > relatively to the current decoder position inside transaction, e.g. we > may want to resolve cmin/cmax of a tuple, which was created with cid 3 > and deleted with cid 5, while we are currently at cid 4, so our > tuplecids hash is not complete to handle such a case. > Damn it! I ran into those two issues some time ago and I fixed it, but I've forgotten to merge that fix into the patch. I'll merge those fixed and compare them to your proposed fix, and send a new version tomorrow. > > 4) There was a problem with marking top-level transaction as having > catalog changes if one of its subtransactions has. It was causing a > problem with DDL statements just after subtransaction start (savepoint), > so data from new columns is not replicated. > > 5) Similar issue with schema send. You send schema only once per each > sub/transaction (IIRC), while we have to update schema on each catalog > change: invalidation execution, snapshot rebuild, adding new tuple cids. > So I ended up with adding is_schema_send flag to ReorderBufferTXN, since > it is easy to set it inside RB and read in the output plugin. Probably, > we have to choose a better place for this flag. > Hmm. Can you share an example how to trigger these issues? > 6) To better handle all these tricky cases I added new tap > test—014_stream_tough_ddl.pl—which consist of really tough combination > of DDL, DML, savepoints and ROLLBACK/RELEASE in a single transaction. > Thanks! > I marked all my fixes and every questionable place with comment and > "TOCHECK:" label for easy search. Removing of pretty any of these fixes > leads to the tests fail due to the segmentation fault or replication > mismatch. Though I mostly read and tested old version of patch, but > after a quick look it seems that all these fixes are applicable to the > new version of patch as well. > Thanks. I'll go through your patch tomorrow. > > Performance > ======== > > I have also performed a series of performance tests, and found that > patch adds a huge overhead in the case of a large transaction consisting > of many small rows, e.g.: > > CREATE TABLE large_test (num1 bigint, num2 double precision, num3 double > precision); > > EXPLAIN (ANALYZE, BUFFERS) INSERT INTO large_test (num1, num2, num3) > SELECT round(random()*10), random(), random()*142 > FROM generate_series(1, 1000000) s(i); > > Execution Time: 2407.709 ms > Total Time: 11494,238 ms (00:11,494) > > With synchronous_standby_names and 64 MB logical_work_mem it takes up to > x5 longer, while without patch it is about x2. Thus, logical replication > streaming is approximately x4 as slower for similar transactions. > > However, dealing with large transactions consisting of a small number of > large rows is much better: > > CREATE TABLE large_text (t TEXT); > > EXPLAIN (ANALYZE, BUFFERS) INSERT INTO large_text > SELECT (SELECT string_agg('x', ',') > FROM generate_series(1, 1000000)) FROM generate_series(1, 125); > > Execution Time: 3545.642 ms > Total Time: 7678,617 ms (00:07,679) > > It is around the same x2 as without patch. If someone is interested I > also added flamegraphs of walsender and logical replication worker > during first numerical transaction processing. > Interesting. Any idea where does the extra overhead in this particular case come from? It's hard to deduce that from the single flame graph, when I don't have anything to compare it with (i.e. the flame graph for the "normal" case). I'll investigate this (probably not this week), but in general it's good to keep in mind a couple of things: 1) Some overhead is expected, due to doing things incrementally. 2) The memory limit should be set to sufficiently high value to be hit only infrequently. 3) And when the limit is actually hit, it's an alternative to spilling large amounts of data locally (to disk) or incurring significant replication lag later. So I'm not particularly worried, but I'll look into that. I'd be much more worried if there was measurable overhead in cases when there's no streaming happening (either because it's disabled or the memory limit was not hit). regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Alexey Kondratov
Date:
On 18.12.2018 1:28, Tomas Vondra wrote: >> 4) There was a problem with marking top-level transaction as having >> catalog changes if one of its subtransactions has. It was causing a >> problem with DDL statements just after subtransaction start (savepoint), >> so data from new columns is not replicated. >> >> 5) Similar issue with schema send. You send schema only once per each >> sub/transaction (IIRC), while we have to update schema on each catalog >> change: invalidation execution, snapshot rebuild, adding new tuple cids. >> So I ended up with adding is_schema_send flag to ReorderBufferTXN, since >> it is easy to set it inside RB and read in the output plugin. Probably, >> we have to choose a better place for this flag. >> > Hmm. Can you share an example how to trigger these issues? Test cases inside 014_stream_tough_ddl.pl and old ones (with streaming=true option added) should reproduce all these issues. In general, it happens in a txn like: INSERT SAVEPOINT ALTER TABLE ... ADD COLUMN INSERT then the second insert may discover old version of catalog. > Interesting. Any idea where does the extra overhead in this particular > case come from? It's hard to deduce that from the single flame graph, > when I don't have anything to compare it with (i.e. the flame graph for > the "normal" case). I guess that bottleneck is in disk operations. You can check logical_repl_worker_new_perf.svg flame graph: disk reads (~9%) and writes (~26%) take around 35% of CPU time in summary. To compare, please, see attached flame graph for the following transaction: INSERT INTO large_text SELECT (SELECT string_agg('x', ',') FROM generate_series(1, 2000)) FROM generate_series(1, 1000000); Execution Time: 44519.816 ms Time: 98333,642 ms (01:38,334) where disk IO is only ~7-8% in total. So we get very roughly the same ~x4-5 performance drop here. JFYI, I am using a machine with SSD for tests. Therefore, probably you may write changes on receiver in bigger chunks, not each change separately. > So I'm not particularly worried, but I'll look into that. I'd be much > more worried if there was measurable overhead in cases when there's no > streaming happening (either because it's disabled or the memory limit > was not hit). What I have also just found, is that if a table row is large enough to be TOASTed, e.g.: INSERT INTO large_text SELECT (SELECT string_agg('x', ',') FROM generate_series(1, 1000000)) FROM generate_series(1, 1000); then logical_work_mem limit is not hit and we neither stream, nor spill to disk this transaction, while it is still large. In contrast, the transaction above (with 1000000 smaller rows) being comparable in size is streamed. Not sure, that it is easy to add proper accounting of TOAST-able columns, but it worth it. -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Company
Attachment
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Tomas Vondra
Date:
Hi Alexey, Attached is an updated version of the patches, with all the fixes I've done in the past. I believe it should fix at least some of the issues you reported - certainly the problem with stream_cleanup_files, but perhaps some of the other issues too. I'm a bit confused by the changes to TAP tests. Per the patch summary, some .pl files get renamed (nor sure why), a new one is added, etc. So I've instead enabled streaming subscriptions in all tests, which with this patch produces two failures: Test Summary Report ------------------- t/004_sync.pl (Wstat: 7424 Tests: 1 Failed: 0) Non-zero exit status: 29 Parse errors: Bad plan. You planned 7 tests but ran 1. t/011_stream_ddl.pl (Wstat: 256 Tests: 2 Failed: 1) Failed test: 2 Non-zero exit status: 1 So yeah, there's more stuff to fix. But I can't directly apply your fixes because the updated patches are somewhat different. On 12/18/18 3:07 PM, Alexey Kondratov wrote: > On 18.12.2018 1:28, Tomas Vondra wrote: >>> 4) There was a problem with marking top-level transaction as having >>> catalog changes if one of its subtransactions has. It was causing a >>> problem with DDL statements just after subtransaction start (savepoint), >>> so data from new columns is not replicated. >>> >>> 5) Similar issue with schema send. You send schema only once per each >>> sub/transaction (IIRC), while we have to update schema on each catalog >>> change: invalidation execution, snapshot rebuild, adding new tuple cids. >>> So I ended up with adding is_schema_send flag to ReorderBufferTXN, since >>> it is easy to set it inside RB and read in the output plugin. Probably, >>> we have to choose a better place for this flag. >>> >> Hmm. Can you share an example how to trigger these issues? > > Test cases inside 014_stream_tough_ddl.pl and old ones (with > streaming=true option added) should reproduce all these issues. In > general, it happens in a txn like: > > INSERT > SAVEPOINT > ALTER TABLE ... ADD COLUMN > INSERT > > then the second insert may discover old version of catalog. > Yeah, that's the issue I've discovered before and thought it got fixed. >> Interesting. Any idea where does the extra overhead in this particular >> case come from? It's hard to deduce that from the single flame graph, >> when I don't have anything to compare it with (i.e. the flame graph for >> the "normal" case). > > I guess that bottleneck is in disk operations. You can check > logical_repl_worker_new_perf.svg flame graph: disk reads (~9%) and > writes (~26%) take around 35% of CPU time in summary. To compare, > please, see attached flame graph for the following transaction: > > INSERT INTO large_text > SELECT (SELECT string_agg('x', ',') > FROM generate_series(1, 2000)) FROM generate_series(1, 1000000); > > Execution Time: 44519.816 ms > Time: 98333,642 ms (01:38,334) > > where disk IO is only ~7-8% in total. So we get very roughly the same > ~x4-5 performance drop here. JFYI, I am using a machine with SSD for tests. > > Therefore, probably you may write changes on receiver in bigger chunks, > not each change separately. > Possibly, I/O is certainly a possible culprit, although we should be using buffered I/O and there certainly are not any fsyncs here. So I'm not sure why would it be cheaper to do the writes in batches. BTW does this mean you see the overhead on the apply side? Or are you running this on a single machine, and it's difficult to decide? >> So I'm not particularly worried, but I'll look into that. I'd be much >> more worried if there was measurable overhead in cases when there's no >> streaming happening (either because it's disabled or the memory limit >> was not hit). > > What I have also just found, is that if a table row is large enough to > be TOASTed, e.g.: > > INSERT INTO large_text > SELECT (SELECT string_agg('x', ',') > FROM generate_series(1, 1000000)) FROM generate_series(1, 1000); > > then logical_work_mem limit is not hit and we neither stream, nor spill > to disk this transaction, while it is still large. In contrast, the > transaction above (with 1000000 smaller rows) being comparable in size > is streamed. Not sure, that it is easy to add proper accounting of > TOAST-able columns, but it worth it. > That's certainly strange and possibly a bug in the memory accounting code. I'm not sure why would that happen, though, because TOAST data look just like regular INSERT changes. Interesting. I wonder if it's already fixed in this updated version, but it's a bit too late to investigate that today. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
- 0001-Add-logical_work_mem-to-limit-ReorderBuffer-20181219.patch.gz
- 0002-Immediately-WAL-log-assignments-20181219.patch.gz
- 0003-Issue-individual-invalidations-with-wal_lev-20181219.patch.gz
- 0004-Extend-the-output-plugin-API-with-stream-me-20181219.patch.gz
- 0005-Implement-streaming-mode-in-ReorderBuffer-20181219.patch.gz
- 0006-Add-support-for-streaming-to-built-in-repli-20181219.patch.gz
- 0007-Track-statistics-for-streaming-spilling-20181219.patch.gz
- 0008-BUGFIX-set-final_lsn-for-subxacts-before-cl-20181219.patch.gz
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Alexey Kondratov
Date:
Hi Tomas, > I'm a bit confused by the changes to TAP tests. Per the patch summary, > some .pl files get renamed (nor sure why), a new one is added, etc. I added new tap test case, streaming=true option inside old stream_* ones and incremented streaming tests number (+2) because of the collision between 009_matviews.pl / 009_stream_simple.pl and 010_truncate.pl / 010_stream_subxact.pl. At least in the previous version of the patch they were under the same numbers. Nothing special, but for simplicity, please, find attached my new tap test separately. > So > I've instead enabled streaming subscriptions in all tests, which with > this patch produces two failures: > > Test Summary Report > ------------------- > t/004_sync.pl (Wstat: 7424 Tests: 1 Failed: 0) > Non-zero exit status: 29 > Parse errors: Bad plan. You planned 7 tests but ran 1. > t/011_stream_ddl.pl (Wstat: 256 Tests: 2 Failed: 1) > Failed test: 2 > Non-zero exit status: 1 > > So yeah, there's more stuff to fix. But I can't directly apply your > fixes because the updated patches are somewhat different. Fixes should apply clearly to the previous version of your patch. Also, I am not sure, that it is a good idea to simply enable streaming subscriptions in all tests (e.g. pre streaming patch t/004_sync.pl), since then they do not hit not streaming code. >>> Interesting. Any idea where does the extra overhead in this particular >>> case come from? It's hard to deduce that from the single flame graph, >>> when I don't have anything to compare it with (i.e. the flame graph for >>> the "normal" case). >> I guess that bottleneck is in disk operations. You can check >> logical_repl_worker_new_perf.svg flame graph: disk reads (~9%) and >> writes (~26%) take around 35% of CPU time in summary. To compare, >> please, see attached flame graph for the following transaction: >> >> INSERT INTO large_text >> SELECT (SELECT string_agg('x', ',') >> FROM generate_series(1, 2000)) FROM generate_series(1, 1000000); >> >> Execution Time: 44519.816 ms >> Time: 98333,642 ms (01:38,334) >> >> where disk IO is only ~7-8% in total. So we get very roughly the same >> ~x4-5 performance drop here. JFYI, I am using a machine with SSD for tests. >> >> Therefore, probably you may write changes on receiver in bigger chunks, >> not each change separately. >> > Possibly, I/O is certainly a possible culprit, although we should be > using buffered I/O and there certainly are not any fsyncs here. So I'm > not sure why would it be cheaper to do the writes in batches. > > BTW does this mean you see the overhead on the apply side? Or are you > running this on a single machine, and it's difficult to decide? I run this on a single machine, but walsender and worker are utilizing almost 100% of CPU per each process all the time, and at apply side I/O syscalls take about 1/3 of CPU time. Though I am still not sure, but for me this result somehow links performance drop with problems at receiver side. Writing in batches was just a hypothesis and to validate it I have performed test with large txn, but consisting of a smaller number of wide rows. This test does not exhibit any significant performance drop, while it was streamed too. So it seems to be valid. Anyway, I do not have other reasonable ideas beside that right now. Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Company
Attachment
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Tomas Vondra
Date:
Hi, Attached is an updated patch series, merging fixes and changes to TAP tests proposed by Alexey. I've merged the fixes into the appropriate patches, and I've kept the TAP changes / new tests as separate patches towards the end of the series. I'm a bit unhappy with two aspects of the current patch series: 1) We now track schema changes in two ways - using the pre-existing schema_sent flag in RelationSyncEntry, and the (newly added) flag in ReorderBuffer. While those options are used for regular vs. streamed transactions, fundamentally it's the same thing and so having two competing ways seems like a bad idea. Not sure what's the best way to resolve this, though. 2) We've removed quite a few asserts, particularly ensuring sanity of cmin/cmax values. To some extent that's expected, because by allowing decoding of in-progress transactions relaxes some of those rules. But I'd be much happier if some of those asserts could be reinstated, even if only in a weaker form. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
- 0001-Add-logical_work_mem-to-limit-ReorderBuffer-20190114.patch.gz
- 0002-Immediately-WAL-log-assignments-20190114.patch.gz
- 0003-Issue-individual-invalidations-with-wal_lev-20190114.patch.gz
- 0004-Extend-the-output-plugin-API-with-stream-me-20190114.patch.gz
- 0005-Implement-streaming-mode-in-ReorderBuffer-20190114.patch.gz
- 0006-Add-support-for-streaming-to-built-in-repli-20190114.patch.gz
- 0007-Track-statistics-for-streaming-spilling-20190114.patch.gz
- 0008-Enable-streaming-for-all-subscription-TAP-t-20190114.patch.gz
- 0009-BUGFIX-set-final_lsn-for-subxacts-before-cl-20190114.patch.gz
- 0010-Add-TAP-test-for-streaming-vs.-DDL-20190114.patch.gz
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Michael Paquier
Date:
On Mon, Jan 14, 2019 at 07:23:31PM +0100, Tomas Vondra wrote: > Attached is an updated patch series, merging fixes and changes to TAP > tests proposed by Alexey. I've merged the fixes into the appropriate > patches, and I've kept the TAP changes / new tests as separate patches > towards the end of the series. Patch 4 of the latest set fails to apply, so I have moved the patch to next CF, waiting on author. -- Michael
Attachment
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Alexey Kondratov
Date:
Hi Tomas, On 14.01.2019 21:23, Tomas Vondra wrote: > Attached is an updated patch series, merging fixes and changes to TAP > tests proposed by Alexey. I've merged the fixes into the appropriate > patches, and I've kept the TAP changes / new tests as separate patches > towards the end of the series. I had problems applying this patch along with 2pc streaming one to the current master, but everything applied well on 97c39498e5. Regression tests pass. What I personally do not like in the current TAP tests set is that you have added "WITH (streaming=on)" to all tests including old non-streaming ones. It seems unclear, which mechanism is tested there: streaming, but those transactions probably do not hit memory limit, so it depends on default server parameters; or non-streaming, but then what is the need for (streaming=on)? I would prefer to add (streaming=on) only to the new tests, where it is clearly necessary. > I'm a bit unhappy with two aspects of the current patch series: > > 1) We now track schema changes in two ways - using the pre-existing > schema_sent flag in RelationSyncEntry, and the (newly added) flag in > ReorderBuffer. While those options are used for regular vs. streamed > transactions, fundamentally it's the same thing and so having two > competing ways seems like a bad idea. Not sure what's the best way to > resolve this, though. Yes, sure, when I have found problems with streaming of extensive DDL, I added new flag in the simplest way, and it worked. Now, old schema_sent flag is per relation based, while the new one - is_schema_sent - is per top-level transaction based. If I get it correctly, the former seems to be more thrifty, since new schema is sent only if we are streaming change for relation, whose schema is outdated. In contrast, in the latter case we will send new schema even if there will be no new changes which belong to this relation. I guess, it would be better to stick to the old behavior. I will try to investigate how to better use it in the streaming mode as well. > 2) We've removed quite a few asserts, particularly ensuring sanity of > cmin/cmax values. To some extent that's expected, because by allowing > decoding of in-progress transactions relaxes some of those rules. But > I'd be much happier if some of those asserts could be reinstated, even > if only in a weaker form. Asserts have been removed from two places: (1) HeapTupleSatisfiesHistoricMVCC, which seems inevitable, since we are touching the essence of the MVCC visibility rules, when trying to decode an in-progress transaction, and (2) ReorderBufferBuildTupleCidHash, which is probably not related directly to the topic of the ongoing patch, since Arseny Sher faced the same issue with simple repetitive DDL decoding [1] recently. Not many, but I agree, that replacing them with some softer asserts would be better, than just removing, especially point 1). [1] https://www.postgresql.org/message-id/flat/874l9p8hyw.fsf%40ars-thinkpad Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Company
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Alexey Kondratov
Date:
Hi Tomas, >>>> Interesting. Any idea where does the extra overhead in this particular >>>> case come from? It's hard to deduce that from the single flame graph, >>>> when I don't have anything to compare it with (i.e. the flame graph >>>> for >>>> the "normal" case). >>> I guess that bottleneck is in disk operations. You can check >>> logical_repl_worker_new_perf.svg flame graph: disk reads (~9%) and >>> writes (~26%) take around 35% of CPU time in summary. To compare, >>> please, see attached flame graph for the following transaction: >>> >>> INSERT INTO large_text >>> SELECT (SELECT string_agg('x', ',') >>> FROM generate_series(1, 2000)) FROM generate_series(1, 1000000); >>> >>> Execution Time: 44519.816 ms >>> Time: 98333,642 ms (01:38,334) >>> >>> where disk IO is only ~7-8% in total. So we get very roughly the same >>> ~x4-5 performance drop here. JFYI, I am using a machine with SSD for >>> tests. >>> >>> Therefore, probably you may write changes on receiver in bigger chunks, >>> not each change separately. >>> >> Possibly, I/O is certainly a possible culprit, although we should be >> using buffered I/O and there certainly are not any fsyncs here. So I'm >> not sure why would it be cheaper to do the writes in batches. >> >> BTW does this mean you see the overhead on the apply side? Or are you >> running this on a single machine, and it's difficult to decide? > > I run this on a single machine, but walsender and worker are utilizing > almost 100% of CPU per each process all the time, and at apply side > I/O syscalls take about 1/3 of CPU time. Though I am still not sure, > but for me this result somehow links performance drop with problems at > receiver side. > > Writing in batches was just a hypothesis and to validate it I have > performed test with large txn, but consisting of a smaller number of > wide rows. This test does not exhibit any significant performance > drop, while it was streamed too. So it seems to be valid. Anyway, I do > not have other reasonable ideas beside that right now. I've checked recently this patch again and tried to elaborate it in terms of performance. As a result I've implemented a new POC version of the applier (attached). Almost everything in streaming logic stayed intact, but apply worker is significantly different. As I wrote earlier I still claim, that spilling changes on disk at the applier side adds additional overhead, but it is possible to get rid of it. In my additional patch I do the following: 1) Maintain a pool of additional background workers (bgworkers), that are connected with main logical apply worker via shm_mq's. Each worker is dedicated to the processing of specific streamed transaction. 2) When we receive a streamed change for some transaction, we check whether there is an existing dedicated bgworker in HTAB (xid -> bgworker), or there are some in the idle list, or spawn a new one. 3) We pass all changes (between STREAM START/STOP) to that bgworker via shm_mq_send without intermediate waiting. However, we wait for bgworker to apply the entire changes chunk at STREAM STOP, since we don't want transactions reordering. 4) When transaction is commited/aborted worker is being added to the idle list and is waiting for reassigning message. 5) I have used the same machinery with apply_dispatch in bgworkers, since most of actions are practically very similar. Thus, we do not spill anything at the applier side, so transaction changes are processed by bgworkers as normal backends do. In the same time, changes processing is strictly serial, which prevents transactions reordering and possible conflicts/anomalies. Even though we trade off performance in favor of stability the result is rather impressive. I have used a similar query for testing as before: EXPLAIN (ANALYZE, BUFFERS) INSERT INTO large_test (num1, num2, num3) SELECT round(random()*10), random(), random()*142 FROM generate_series(1, 1000000) s(i); with 1kk (1000000), 3kk and 5kk rows; logical_work_mem = 64MB and synchronous_standby_names = 'FIRST 1 (large_sub)'. Table schema is following: CREATE TABLE large_test ( id serial primary key, num1 bigint, num2 double precision, num3 double precision ); Here are the results: ------------------------------------------------------------------- | N | Time on master, sec | Total xact time, sec | Ratio | ------------------------------------------------------------------- | On commit (master, v13) | ------------------------------------------------------------------- | 1kk | 6.5 | 17.6 | x2.74 | ------------------------------------------------------------------- | 3kk | 21 | 55.4 | x2.64 | ------------------------------------------------------------------- | 5kk | 38.3 | 91.5 | x2.39 | ------------------------------------------------------------------- | Stream + spill | ------------------------------------------------------------------- | 1kk | 5.9 | 18 | x3 | ------------------------------------------------------------------- | 3kk | 19.5 | 52.4 | x2.7 | ------------------------------------------------------------------- | 5kk | 33.3 | 86.7 | x2.86 | ------------------------------------------------------------------- | Stream + BGW pool | ------------------------------------------------------------------- | 1kk | 6 | 12 | x2 | ------------------------------------------------------------------- | 3kk | 18.5 | 30.5 | x1.65 | ------------------------------------------------------------------- | 5kk | 35.6 | 53.9 | x1.51 | ------------------------------------------------------------------- It seems that overhead added by synchronous replica is lower by 2-3 times compared with Postgres master and streaming with spilling. Therefore, the original patch eliminated delay before large transaction processing start by sender, while this additional patch speeds up the applier side. Although the overall speed up is surely measurable, there is a room for improvements yet: 1) Currently bgworkers are only spawned on demand without some initial pool and never stopped. Maybe we should create a small pool on replication start and offload some of idle bgworkers if they exceed some limit? 2) Probably we can track somehow that incoming change has conflicts with some of being processed xacts, so we can wait for specific bgworkers only in that case? 3) Since the communication between main logical apply worker and each bgworker from the pool is a 'single producer --- single consumer' problem, then probably it is possible to wait and set/check flags without locks, but using just atomics. What do you think about this concept in general? Any concerns and criticism are welcome! Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Company P.S. This patch shloud be applicable to your last patch set. I would rebase it against master, but it depends on 2pc patch,that I don't know well enough.
Attachment
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Tomas Vondra
Date:
On Wed, Aug 28, 2019 at 08:17:47PM +0300, Alexey Kondratov wrote: >Hi Tomas, > >>>>>Interesting. Any idea where does the extra overhead in this particular >>>>>case come from? It's hard to deduce that from the single flame graph, >>>>>when I don't have anything to compare it with (i.e. the flame >>>>>graph for >>>>>the "normal" case). >>>>I guess that bottleneck is in disk operations. You can check >>>>logical_repl_worker_new_perf.svg flame graph: disk reads (~9%) and >>>>writes (~26%) take around 35% of CPU time in summary. To compare, >>>>please, see attached flame graph for the following transaction: >>>> >>>>INSERT INTO large_text >>>>SELECT (SELECT string_agg('x', ',') >>>>FROM generate_series(1, 2000)) FROM generate_series(1, 1000000); >>>> >>>>Execution Time: 44519.816 ms >>>>Time: 98333,642 ms (01:38,334) >>>> >>>>where disk IO is only ~7-8% in total. So we get very roughly the same >>>>~x4-5 performance drop here. JFYI, I am using a machine with SSD >>>>for tests. >>>> >>>>Therefore, probably you may write changes on receiver in bigger chunks, >>>>not each change separately. >>>> >>>Possibly, I/O is certainly a possible culprit, although we should be >>>using buffered I/O and there certainly are not any fsyncs here. So I'm >>>not sure why would it be cheaper to do the writes in batches. >>> >>>BTW does this mean you see the overhead on the apply side? Or are you >>>running this on a single machine, and it's difficult to decide? >> >>I run this on a single machine, but walsender and worker are >>utilizing almost 100% of CPU per each process all the time, and at >>apply side I/O syscalls take about 1/3 of CPU time. Though I am >>still not sure, but for me this result somehow links performance >>drop with problems at receiver side. >> >>Writing in batches was just a hypothesis and to validate it I have >>performed test with large txn, but consisting of a smaller number of >>wide rows. This test does not exhibit any significant performance >>drop, while it was streamed too. So it seems to be valid. Anyway, I >>do not have other reasonable ideas beside that right now. > >I've checked recently this patch again and tried to elaborate it in >terms of performance. As a result I've implemented a new POC version >of the applier (attached). Almost everything in streaming logic stayed >intact, but apply worker is significantly different. > >As I wrote earlier I still claim, that spilling changes on disk at the >applier side adds additional overhead, but it is possible to get rid >of it. In my additional patch I do the following: > >1) Maintain a pool of additional background workers (bgworkers), that >are connected with main logical apply worker via shm_mq's. Each worker >is dedicated to the processing of specific streamed transaction. > >2) When we receive a streamed change for some transaction, we check >whether there is an existing dedicated bgworker in HTAB (xid -> >bgworker), or there are some in the idle list, or spawn a new one. > >3) We pass all changes (between STREAM START/STOP) to that bgworker >via shm_mq_send without intermediate waiting. However, we wait for >bgworker to apply the entire changes chunk at STREAM STOP, since we >don't want transactions reordering. > >4) When transaction is commited/aborted worker is being added to the >idle list and is waiting for reassigning message. > >5) I have used the same machinery with apply_dispatch in bgworkers, >since most of actions are practically very similar. > >Thus, we do not spill anything at the applier side, so transaction >changes are processed by bgworkers as normal backends do. In the same >time, changes processing is strictly serial, which prevents >transactions reordering and possible conflicts/anomalies. Even though >we trade off performance in favor of stability the result is rather >impressive. I have used a similar query for testing as before: > >EXPLAIN (ANALYZE, BUFFERS) INSERT INTO large_test (num1, num2, num3) > SELECT round(random()*10), random(), random()*142 > FROM generate_series(1, 1000000) s(i); > >with 1kk (1000000), 3kk and 5kk rows; logical_work_mem = 64MB and >synchronous_standby_names = 'FIRST 1 (large_sub)'. Table schema is >following: > >CREATE TABLE large_test ( > id serial primary key, > num1 bigint, > num2 double precision, > num3 double precision >); > >Here are the results: > >------------------------------------------------------------------- >| N | Time on master, sec | Total xact time, sec | Ratio | >------------------------------------------------------------------- >| On commit (master, v13) | >------------------------------------------------------------------- >| 1kk | 6.5 | 17.6 | x2.74 | >------------------------------------------------------------------- >| 3kk | 21 | 55.4 | x2.64 | >------------------------------------------------------------------- >| 5kk | 38.3 | 91.5 | x2.39 | >------------------------------------------------------------------- >| Stream + spill | >------------------------------------------------------------------- >| 1kk | 5.9 | 18 | x3 | >------------------------------------------------------------------- >| 3kk | 19.5 | 52.4 | x2.7 | >------------------------------------------------------------------- >| 5kk | 33.3 | 86.7 | x2.86 | >------------------------------------------------------------------- >| Stream + BGW pool | >------------------------------------------------------------------- >| 1kk | 6 | 12 | x2 | >------------------------------------------------------------------- >| 3kk | 18.5 | 30.5 | x1.65 | >------------------------------------------------------------------- >| 5kk | 35.6 | 53.9 | x1.51 | >------------------------------------------------------------------- > >It seems that overhead added by synchronous replica is lower by 2-3 >times compared with Postgres master and streaming with spilling. >Therefore, the original patch eliminated delay before large >transaction processing start by sender, while this additional patch >speeds up the applier side. > >Although the overall speed up is surely measurable, there is a room >for improvements yet: > >1) Currently bgworkers are only spawned on demand without some initial >pool and never stopped. Maybe we should create a small pool on >replication start and offload some of idle bgworkers if they exceed >some limit? > >2) Probably we can track somehow that incoming change has conflicts >with some of being processed xacts, so we can wait for specific >bgworkers only in that case? > >3) Since the communication between main logical apply worker and each >bgworker from the pool is a 'single producer --- single consumer' >problem, then probably it is possible to wait and set/check flags >without locks, but using just atomics. > >What do you think about this concept in general? Any concerns and >criticism are welcome! > Hi Alexey, I'm unable to do any in-depth review of the patch over the next two weeks or so, but I think the idea of having a pool of apply workers is sound and can be quite beneficial for some workloads. I don't think it matters very much whether the workers are started at the beginning or allocated ad hoc, that's IMO a minor implementation detail. There's one huge challenge that I however don't see mentioned in your message or in the patch (after cursory reading) - ensuring the same commit order, and introducing deadlocks that would not exist in single-process apply. Surely, we want to end up with the same commit order as on the upstream, otherwise we might easily get different data on the subscriber. So when we pass the large transaction to a separate process, then this process has to wait for the other processes processing transactions that committed first. And similarly, other processes have to wait for this process. Depending on the commit order. I might have missed something, but I don't see anything like that in your patch. Essentially, this means there needs to be some sort of wait between those apply processes, enforcing the commit order. That however means we can easily introduce deadlocks into workloads where the serial-apply would not have that issue - imagine multiple large transactions, touching the same set of rows. We may ship them to different bgworkers, and those processes may deadlock. Of course, the deadlock detector will come around (assuming the wait is done in a way visible to the detector) and will abort one of the processes. But we don't know it'll abort the right one - it may easily abort the apply process that needs to comit first, and eveyone else is waitiing for it. Which stalls the apply forever. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Alexey Kondratov
Date:
On 28.08.2019 22:06, Tomas Vondra wrote: > >> >>>>>> Interesting. Any idea where does the extra overhead in this >>>>>> particular >>>>>> case come from? It's hard to deduce that from the single flame >>>>>> graph, >>>>>> when I don't have anything to compare it with (i.e. the flame >>>>>> graph for >>>>>> the "normal" case). >>>>> I guess that bottleneck is in disk operations. You can check >>>>> logical_repl_worker_new_perf.svg flame graph: disk reads (~9%) and >>>>> writes (~26%) take around 35% of CPU time in summary. To compare, >>>>> please, see attached flame graph for the following transaction: >>>>> >>>>> INSERT INTO large_text >>>>> SELECT (SELECT string_agg('x', ',') >>>>> FROM generate_series(1, 2000)) FROM generate_series(1, 1000000); >>>>> >>>>> Execution Time: 44519.816 ms >>>>> Time: 98333,642 ms (01:38,334) >>>>> >>>>> where disk IO is only ~7-8% in total. So we get very roughly the same >>>>> ~x4-5 performance drop here. JFYI, I am using a machine with SSD >>>>> for tests. >>>>> >>>>> Therefore, probably you may write changes on receiver in bigger >>>>> chunks, >>>>> not each change separately. >>>>> >>>> Possibly, I/O is certainly a possible culprit, although we should be >>>> using buffered I/O and there certainly are not any fsyncs here. So I'm >>>> not sure why would it be cheaper to do the writes in batches. >>>> >>>> BTW does this mean you see the overhead on the apply side? Or are you >>>> running this on a single machine, and it's difficult to decide? >>> >>> I run this on a single machine, but walsender and worker are >>> utilizing almost 100% of CPU per each process all the time, and at >>> apply side I/O syscalls take about 1/3 of CPU time. Though I am >>> still not sure, but for me this result somehow links performance >>> drop with problems at receiver side. >>> >>> Writing in batches was just a hypothesis and to validate it I have >>> performed test with large txn, but consisting of a smaller number of >>> wide rows. This test does not exhibit any significant performance >>> drop, while it was streamed too. So it seems to be valid. Anyway, I >>> do not have other reasonable ideas beside that right now. >> >> It seems that overhead added by synchronous replica is lower by 2-3 >> times compared with Postgres master and streaming with spilling. >> Therefore, the original patch eliminated delay before large >> transaction processing start by sender, while this additional patch >> speeds up the applier side. >> >> Although the overall speed up is surely measurable, there is a room >> for improvements yet: >> >> 1) Currently bgworkers are only spawned on demand without some >> initial pool and never stopped. Maybe we should create a small pool >> on replication start and offload some of idle bgworkers if they >> exceed some limit? >> >> 2) Probably we can track somehow that incoming change has conflicts >> with some of being processed xacts, so we can wait for specific >> bgworkers only in that case? >> >> 3) Since the communication between main logical apply worker and each >> bgworker from the pool is a 'single producer --- single consumer' >> problem, then probably it is possible to wait and set/check flags >> without locks, but using just atomics. >> >> What do you think about this concept in general? Any concerns and >> criticism are welcome! >> > Hi Tomas, Thank you for a quick response. > I don't think it matters very much whether the workers are started at the > beginning or allocated ad hoc, that's IMO a minor implementation detail. OK, I had the same vision about this point. Any minor differences here will be neglectable for a sufficiently large transaction. > > There's one huge challenge that I however don't see mentioned in your > message or in the patch (after cursory reading) - ensuring the same > commit > order, and introducing deadlocks that would not exist in single-process > apply. Probably I haven't explained well this part, sorry for that. In my patch I don't use workers pool for a concurrent transaction apply, but rather for a fast context switch between long-lived streamed transactions. In other words we apply all changes arrived from the sender in a completely serial manner. Being written step-by-step it looks like: 1) Read STREAM START message and figure out the target worker by xid. 2) Put all changes, which belongs to this xact to the selected worker one by one via shm_mq_send. 3) Read STREAM STOP message and wait until our worker will apply all changes in the queue. 4) Process all other chunks of streamed xacts in the same manner. 5) Process all non-streamed xacts immediately in the main apply worker loop. 6) If we read STREAMED COMMIT/ABORT we again wait until selected worker either commits or aborts. Thus, it automatically guaranties the same commit order on replica as on master. Yes, we loose some performance here, since we don't apply transactions concurrently, but it would bring all those problems you have described. However, you helped me to figure out another point I have forgotten. Although we ensure commit order automatically, the beginning of streamed xacts may reorder. It happens if some small xacts have been commited on master since the streamed one started, because we do not start streaming immediately, but only after logical_work_mem hit. I have performed some tests with conflicting xacts and it seems that it's not a problem, since locking mechanism in Postgres guarantees that if there would some deadlocks, they will happen earlier on master. So if some records hit the WAL, it is safe to apply the sequentially. Am I wrong? Anyway, I'm going to double check the safety of this part later. Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Company
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Tomas Vondra
Date:
On Thu, Aug 29, 2019 at 05:37:45PM +0300, Alexey Kondratov wrote: >On 28.08.2019 22:06, Tomas Vondra wrote: >> >>> >>>>>>>Interesting. Any idea where does the extra overhead in >>>>>>>this particular >>>>>>>case come from? It's hard to deduce that from the single >>>>>>>flame graph, >>>>>>>when I don't have anything to compare it with (i.e. the >>>>>>>flame graph for >>>>>>>the "normal" case). >>>>>>I guess that bottleneck is in disk operations. You can check >>>>>>logical_repl_worker_new_perf.svg flame graph: disk reads (~9%) and >>>>>>writes (~26%) take around 35% of CPU time in summary. To compare, >>>>>>please, see attached flame graph for the following transaction: >>>>>> >>>>>>INSERT INTO large_text >>>>>>SELECT (SELECT string_agg('x', ',') >>>>>>FROM generate_series(1, 2000)) FROM generate_series(1, 1000000); >>>>>> >>>>>>Execution Time: 44519.816 ms >>>>>>Time: 98333,642 ms (01:38,334) >>>>>> >>>>>>where disk IO is only ~7-8% in total. So we get very roughly the same >>>>>>~x4-5 performance drop here. JFYI, I am using a machine with >>>>>>SSD for tests. >>>>>> >>>>>>Therefore, probably you may write changes on receiver in >>>>>>bigger chunks, >>>>>>not each change separately. >>>>>> >>>>>Possibly, I/O is certainly a possible culprit, although we should be >>>>>using buffered I/O and there certainly are not any fsyncs here. So I'm >>>>>not sure why would it be cheaper to do the writes in batches. >>>>> >>>>>BTW does this mean you see the overhead on the apply side? Or are you >>>>>running this on a single machine, and it's difficult to decide? >>>> >>>>I run this on a single machine, but walsender and worker are >>>>utilizing almost 100% of CPU per each process all the time, and >>>>at apply side I/O syscalls take about 1/3 of CPU time. Though I >>>>am still not sure, but for me this result somehow links >>>>performance drop with problems at receiver side. >>>> >>>>Writing in batches was just a hypothesis and to validate it I >>>>have performed test with large txn, but consisting of a smaller >>>>number of wide rows. This test does not exhibit any significant >>>>performance drop, while it was streamed too. So it seems to be >>>>valid. Anyway, I do not have other reasonable ideas beside that >>>>right now. >>> >>>It seems that overhead added by synchronous replica is lower by >>>2-3 times compared with Postgres master and streaming with >>>spilling. Therefore, the original patch eliminated delay before >>>large transaction processing start by sender, while this >>>additional patch speeds up the applier side. >>> >>>Although the overall speed up is surely measurable, there is a >>>room for improvements yet: >>> >>>1) Currently bgworkers are only spawned on demand without some >>>initial pool and never stopped. Maybe we should create a small >>>pool on replication start and offload some of idle bgworkers if >>>they exceed some limit? >>> >>>2) Probably we can track somehow that incoming change has >>>conflicts with some of being processed xacts, so we can wait for >>>specific bgworkers only in that case? >>> >>>3) Since the communication between main logical apply worker and >>>each bgworker from the pool is a 'single producer --- single >>>consumer' problem, then probably it is possible to wait and >>>set/check flags without locks, but using just atomics. >>> >>>What do you think about this concept in general? Any concerns and >>>criticism are welcome! >>> >> > >Hi Tomas, > >Thank you for a quick response. > >>I don't think it matters very much whether the workers are started at the >>beginning or allocated ad hoc, that's IMO a minor implementation detail. > >OK, I had the same vision about this point. Any minor differences here >will be neglectable for a sufficiently large transaction. > >> >>There's one huge challenge that I however don't see mentioned in your >>message or in the patch (after cursory reading) - ensuring the same >>commit >>order, and introducing deadlocks that would not exist in single-process >>apply. > >Probably I haven't explained well this part, sorry for that. In my >patch I don't use workers pool for a concurrent transaction apply, but >rather for a fast context switch between long-lived streamed >transactions. In other words we apply all changes arrived from the >sender in a completely serial manner. Being written step-by-step it >looks like: > >1) Read STREAM START message and figure out the target worker by xid. > >2) Put all changes, which belongs to this xact to the selected worker >one by one via shm_mq_send. > >3) Read STREAM STOP message and wait until our worker will apply all >changes in the queue. > >4) Process all other chunks of streamed xacts in the same manner. > >5) Process all non-streamed xacts immediately in the main apply worker loop. > >6) If we read STREAMED COMMIT/ABORT we again wait until selected >worker either commits or aborts. > >Thus, it automatically guaranties the same commit order on replica as >on master. Yes, we loose some performance here, since we don't apply >transactions concurrently, but it would bring all those problems you >have described. > OK, so it's apply in multiple processes, but at any moment only a single apply process is active. >However, you helped me to figure out another point I have forgotten. >Although we ensure commit order automatically, the beginning of >streamed xacts may reorder. It happens if some small xacts have been >commited on master since the streamed one started, because we do not >start streaming immediately, but only after logical_work_mem hit. I >have performed some tests with conflicting xacts and it seems that >it's not a problem, since locking mechanism in Postgres guarantees >that if there would some deadlocks, they will happen earlier on >master. So if some records hit the WAL, it is safe to apply the >sequentially. Am I wrong? > I think you're right the way you interleave the changes ensures you can't introduce new deadlocks between transactions in this stream. I don't think reordering the blocks of streamed trasactions does matter, as long as the comit order is ensured in this case. >Anyway, I'm going to double check the safety of this part later. > OK. FWIW my understanding is that the speedup comes mostly from elimination of the serialization to a file. That however requires savepoints to handle aborts of subtransactions - I'm pretty sure I'd be trivial to create a workload where this will be much slower (with many aborts of large subtransactions). regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Konstantin Knizhnik
Date:
> > FWIW my understanding is that the speedup comes mostly from > elimination of > the serialization to a file. That however requires savepoints to handle > aborts of subtransactions - I'm pretty sure I'd be trivial to create a > workload where this will be much slower (with many aborts of large > subtransactions). > > I think that instead of defining savepoints it is simpler and more efficient to use BeginInternalSubTransaction + ReleaseCurrentSubTransaction/RollbackAndReleaseCurrentSubTransaction as it is done in PL/pgSQL (pl_exec.c). Not sure if it can pr -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Alvaro Herrera
Date:
In the interest of moving things forward, how far are we from making 0001 committable? If I understand correctly, the rest of this patchset depends on https://commitfest.postgresql.org/24/944/ which seems to be moving at a glacial pace (or, actually, slower, because glaciers do move, which cannot be said of that other patch.) -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Tomas Vondra
Date:
On Mon, Sep 02, 2019 at 06:06:50PM -0400, Alvaro Herrera wrote: >In the interest of moving things forward, how far are we from making >0001 committable? If I understand correctly, the rest of this patchset >depends on https://commitfest.postgresql.org/24/944/ which seems to be >moving at a glacial pace (or, actually, slower, because glaciers do >move, which cannot be said of that other patch.) > I think 0001 is mostly there. I think there's one bug in this patch version, but I need to check and I'll post an updated version shortly if needed. FWIW maybe we should stop comparing things to glaciers. 50 years from not people won't know what a glacier is, and it'll be just like the floppy icon on the save button. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Alexey Kondratov
Date:
>> >> FWIW my understanding is that the speedup comes mostly from >> elimination of >> the serialization to a file. That however requires savepoints to handle >> aborts of subtransactions - I'm pretty sure I'd be trivial to create a >> workload where this will be much slower (with many aborts of large >> subtransactions). >> Yes, and it was my main motivation to eliminate that extra serialization to file. I've experimented a bit with large transactions + savepoints + aborts and ended up with a following query (the same schema as before with 600k rows): BEGIN; SAVEPOINT s1; UPDATE large_test SET num1 = num1 + 1, num2 = num2 + 1, num3 = num3 + 1; SAVEPOINT s2; UPDATE large_test SET num1 = num1 + 1, num2 = num2 + 1, num3 = num3 + 1; SAVEPOINT s3; UPDATE large_test SET num1 = num1 + 1, num2 = num2 + 1, num3 = num3 + 1; ROLLBACK TO SAVEPOINT s3; ROLLBACK TO SAVEPOINT s2; ROLLBACK TO SAVEPOINT s1; END; It looks like the worst case scenario, as we do a lot of work and then abort all subxacts one by one. As expected,it takes much longer (up to x30) to process using background worker instead of spilling to file. Surely, it is much easier to truncate a file, than apply all changes + abort. However, I guess that this kind of load pattern is not the most typical for real-life applications. Also this test helped me to find a bug in my current savepoints routine, so new patch is attached. On 30.08.2019 18:59, Konstantin Knizhnik wrote: > > I think that instead of defining savepoints it is simpler and more > efficient to use > > BeginInternalSubTransaction + > ReleaseCurrentSubTransaction/RollbackAndReleaseCurrentSubTransaction > > as it is done in PL/pgSQL (pl_exec.c). > Not sure if it can pr > Both BeginInternalSubTransaction and DefineSavepoint use PushTransaction() internally for a normal subtransaction start. So they seems to be identical from the performance perspective, which is also stated in the comment section: /* * BeginInternalSubTransaction * This is the same as DefineSavepoint except it allows TBLOCK_STARTED, * TBLOCK_IMPLICIT_INPROGRESS, TBLOCK_END, and TBLOCK_PREPARE states, * and therefore it can safely be used in functions that might be called * when not inside a BEGIN block or when running deferred triggers at * COMMIT/PREPARE time. Also, it automatically does * CommitTransactionCommand/StartTransactionCommand instead of expecting * the caller to do it. */ Please, correct me if I'm wrong. Anyway, I've performed a profiling of my apply worker (flamegraph is attached) and it spends the vast amount of time (>90%) applying changes. So the problem is not in the savepoints their-self, but in the fact that we first apply all changes and then abort all the work. Not sure, that it is possible to do something in this case. Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Company
Attachment
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Konstantin Knizhnik
Date:
On 16.09.2019 19:54, Alexey Kondratov wrote: > On 30.08.2019 18:59, Konstantin Knizhnik wrote: >> >> I think that instead of defining savepoints it is simpler and more >> efficient to use >> >> BeginInternalSubTransaction + >> ReleaseCurrentSubTransaction/RollbackAndReleaseCurrentSubTransaction >> >> as it is done in PL/pgSQL (pl_exec.c). >> Not sure if it can pr >> > > Both BeginInternalSubTransaction and DefineSavepoint use > PushTransaction() internally for a normal subtransaction start. So > they seems to be identical from the performance perspective, which is > also stated in the comment section: Yes, definitely them are using the same mechanism and most likely provides similar performance. But BeginInternalSubTransaction does not require to generate some savepoint name which seems to be redundant in this case. > > Anyway, I've performed a profiling of my apply worker (flamegraph is > attached) and it spends the vast amount of time (>90%) applying > changes. So the problem is not in the savepoints their-self, but in > the fact that we first apply all changes and then abort all the work. > Not sure, that it is possible to do something in this case. > Looks like the only way to increase apply speed is to do it in parallel: make it possible to concurrently execute non-conflicting transactions.
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Tomas Vondra
Date:
On Mon, Sep 16, 2019 at 10:29:18PM +0300, Konstantin Knizhnik wrote: > > >On 16.09.2019 19:54, Alexey Kondratov wrote: >>On 30.08.2019 18:59, Konstantin Knizhnik wrote: >>> >>>I think that instead of defining savepoints it is simpler and more >>>efficient to use >>> >>>BeginInternalSubTransaction + >>>ReleaseCurrentSubTransaction/RollbackAndReleaseCurrentSubTransaction >>> >>>as it is done in PL/pgSQL (pl_exec.c). >>>Not sure if it can pr >>> >> >>Both BeginInternalSubTransaction and DefineSavepoint use >>PushTransaction() internally for a normal subtransaction start. So >>they seems to be identical from the performance perspective, which >>is also stated in the comment section: > >Yes, definitely them are using the same mechanism and most likely >provides similar performance. >But BeginInternalSubTransaction does not require to generate some >savepoint name which seems to be redundant in this case. > > >> >>Anyway, I've performed a profiling of my apply worker (flamegraph is >>attached) and it spends the vast amount of time (>90%) applying >>changes. So the problem is not in the savepoints their-self, but in >>the fact that we first apply all changes and then abort all the >>work. Not sure, that it is possible to do something in this case. >> > >Looks like the only way to increase apply speed is to do it in >parallel: make it possible to concurrently execute non-conflicting >transactions. > True, although it seems like a massive can of worms to me. I'm not aware a way to identify non-conflicting transactions in advance, so it would have to be implemented as optimistic apply, with a detection and recovery from conflicts. I'm not against doing that, and I'm willing to spend some time on revies etc. but it seems like a completely separate effort. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Amit Kapila
Date:
On Tue, Sep 3, 2019 at 4:30 AM Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > > In the interest of moving things forward, how far are we from making > 0001 committable? If I understand correctly, the rest of this patchset > depends on https://commitfest.postgresql.org/24/944/ which seems to be > moving at a glacial pace (or, actually, slower, because glaciers do > move, which cannot be said of that other patch.) > I am not sure if it is completely correct that the other part of the patch is dependent on that CF entry. I have studied both the threads (not every detail) and it seems to me it is dependent on one of the patches from that series which handles concurrent aborts. It is patch 0003-Gracefully-handle-concurrent-aborts-of-uncommitted-t.Jan4.patch from what the Nikhil has posted on that thread [1]. Am, I wrong? So IIUC, the problem of concurrent aborts is that if we allow catalog scans for in-progress transactions, then we might get wrong answers in cases where somebody has performed Alter-Abort-Alter which is clearly explained with an example in email [2]. To solve that problem Nikhil seems to have written a patch [1] which detects these concurrent aborts during a system table scan and then aborts the decoding of such a transaction. Now, the problem is that patch has written considering 2PC transactions and might not deal with all cases for in-progress transactions especially when sub-transactions are involved as alluded by Arseny Sher [3]. So, the problem seems to be for cases when some sub-transaction aborts, but the main transaction still continued and we try to decode it. Nikhil's patch won't be able to deal with it because I think it just checks top-level xid whereas for this we need to check all-subxids which I think is possible now as Tomas seems to have written WAL for each xid-assignment. It might or might not be the best solution to check the status of all-subxids, but I think first we need to agree that the problem is just for concurrent aborts and that we can solve it by using some part of the technology being developed as part of patch "Logical decoding of two-phase transactions" (https://commitfest.postgresql.org/24/944/) rather than the entire patchset. I hope I am not saying something very obvious here and it helps in moving this patch forward. Thoughts? [1] - https://www.postgresql.org/message-id/CAMGcDxcBmN6jNeQkgWddfhX8HbSjQpW%3DUo70iBY3P_EPdp%2BLTQ%40mail.gmail.com [2] - https://www.postgresql.org/message-id/EEBD82AA-61EE-46F4-845E-05B94168E8F2%40postgrespro.ru [3] - https://www.postgresql.org/message-id/87a7py4iwl.fsf%40ars-thinkpad -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Amit Kapila
Date:
On Tue, Sep 3, 2019 at 4:16 PM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > > On Mon, Sep 02, 2019 at 06:06:50PM -0400, Alvaro Herrera wrote: > >In the interest of moving things forward, how far are we from making > >0001 committable? If I understand correctly, the rest of this patchset > >depends on https://commitfest.postgresql.org/24/944/ which seems to be > >moving at a glacial pace (or, actually, slower, because glaciers do > >move, which cannot be said of that other patch.) > > > > I think 0001 is mostly there. I think there's one bug in this patch > version, but I need to check and I'll post an updated version shortly if > needed. > Did you get a chance to work on 0001? I have a few comments on that patch: 1. + * To limit the amount of memory used by decoded changes, we track memory + * used at the reorder buffer level (i.e. total amount of memory), and for + * each toplevel transaction. When the total amount of used memory exceeds + * the limit, the toplevel transaction consuming the most memory is either + * serialized or streamed. Do we need to mention 'streamed' as part of this patch? It seems to me that this is an independent patch which can be committed without patches that stream the changes. So, we can remove it from here and other places where it is used. 2. + * deserializing and applying very few changes). We probably to give more + * memory to the oldest subtransactions. /We probably to/ It seems some word is missing after probably. 3. + * Find the largest transaction (toplevel or subxact) to evict (spill to disk). + * + * XXX With many subtransactions this might be quite slow, because we'll have + * to walk through all of them. There are some options how we could improve + * that: (a) maintain some secondary structure with transactions sorted by + * amount of changes, (b) not looking for the entirely largest transaction, + * but e.g. for transaction using at least some fraction of the memory limit, + * and (c) evicting multiple transactions at once, e.g. to free a given portion + * of the memory limit (e.g. 50%). + */ +static ReorderBufferTXN * +ReorderBufferLargestTXN(ReorderBuffer *rb) What is the guarantee that after evicting largest transaction, we won't immediately hit the memory limit? Say, all of the transactions are of almost similar size which I don't think is that uncommon a case. Instead, the strategy mentioned in point (c) or something like that seems more promising. In that strategy, there is some risk that it might lead to many smaller disk writes which we might want to control via some threshold (like we should not flush more than N xacts). In this, we also need to ensure that the total memory freed must be greater than the current change. I think we have some discussion around this point but didn't reach any conclusion which means some more brainstorming is required. 4. +int logical_work_mem; /* 4MB */ What this 4MB in comments indicate? 5. +/* + * Check whether the logical_work_mem limit was reached, and if yes pick + * the transaction tx should spill its data to disk. The second part of the sentence "pick the transaction tx should spill" seems to be incomplete. Apart from this, I see that Peter E. has raised some other points on this patch which are not yet addressed as those also need some discussion, so I will respond to those separately with my opinion. These comments are based on the last patch posted by you on this thread [1]. You might have fixed some of these already, so ignore if that is the case. [1] - https://www.postgresql.org/message-id/76fc440e-91c3-afe2-b78a-987205b3c758%402ndquadrant.com -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Tomas Vondra
Date:
Hi, Attached is an updated patch series, rebased on current master. It does fix one memory accounting bug in ReorderBufferToastReplace (the code was not properly updating the amount of memory). I've also included the patch series with decoding of 2PC transactions, which this depends on. This way we have a chance of making the cfbot happy. So parts 0001-0004 and 0009-0014 are "this" patch series, while 0005-0008 are the extra pieces from the other patch. I've done it like this because the initial parts are independent, and so might be committed irrespectedly of the other patch series. In practice that's only reasonable for 0001, which adds the memory limit - the rest is infrastucture for the streaming of in-progress transactions. On Wed, Sep 25, 2019 at 06:55:01PM +0530, Amit Kapila wrote: >On Tue, Sep 3, 2019 at 4:30 AM Alvaro Herrera <alvherre@2ndquadrant.com> wrote: >> >> In the interest of moving things forward, how far are we from making >> 0001 committable? If I understand correctly, the rest of this patchset >> depends on https://commitfest.postgresql.org/24/944/ which seems to be >> moving at a glacial pace (or, actually, slower, because glaciers do >> move, which cannot be said of that other patch.) >> > >I am not sure if it is completely correct that the other part of the >patch is dependent on that CF entry. I have studied both the threads >(not every detail) and it seems to me it is dependent on one of the >patches from that series which handles concurrent aborts. It is patch >0003-Gracefully-handle-concurrent-aborts-of-uncommitted-t.Jan4.patch >from what the Nikhil has posted on that thread [1]. Am, I wrong? > You're right - the part handling aborts is the only part required. There are dependencies on some other changes from the 2PC patch, but those are mostly refactorings that can be undone (e.g. switch from independent flags to a single bitmap in reorderbuffer). >So IIUC, the problem of concurrent aborts is that if we allow catalog >scans for in-progress transactions, then we might get wrong answers in >cases where somebody has performed Alter-Abort-Alter which is clearly >explained with an example in email [2]. To solve that problem Nikhil >seems to have written a patch [1] which detects these concurrent >aborts during a system table scan and then aborts the decoding of such >a transaction. > >Now, the problem is that patch has written considering 2PC >transactions and might not deal with all cases for in-progress >transactions especially when sub-transactions are involved as alluded >by Arseny Sher [3]. So, the problem seems to be for cases when some >sub-transaction aborts, but the main transaction still continued and >we try to decode it. Nikhil's patch won't be able to deal with it >because I think it just checks top-level xid whereas for this we need >to check all-subxids which I think is possible now as Tomas seems to >have written WAL for each xid-assignment. It might or might not be >the best solution to check the status of all-subxids, but I think >first we need to agree that the problem is just for concurrent aborts >and that we can solve it by using some part of the technology being >developed as part of patch "Logical decoding of two-phase >transactions" (https://commitfest.postgresql.org/24/944/) rather than >the entire patchset. > >I hope I am not saying something very obvious here and it helps in >moving this patch forward. > No, that's a good question, and I'm not sure what the answer is at the moment. My understanding was that the infrastructure in the 2PC patch is enough even for subtransactions, but I might be wrong. I need to think about that for a while. Maybe we should focus on the 0001 part for now - it can be committed indepently and does provide useful feature. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
- 0001-Add-logical_work_mem-to-limit-ReorderBuffer-20190926.patch.gz
- 0002-Immediately-WAL-log-assignments-20190926.patch.gz
- 0003-Issue-individual-invalidations-with-wal_lev-20190926.patch.gz
- 0004-Extend-the-output-plugin-API-with-stream-me-20190926.patch.gz
- 0005-Cleaning-up-of-flags-in-ReorderBufferTXN-st-20190926.patch.gz
- 0006-Support-decoding-of-two-phase-transactions--20190926.patch.gz
- 0007-Gracefully-handle-concurrent-aborts-of-unco-20190926.patch.gz
- 0008-Teach-test_decoding-plugin-to-work-with-2PC-20190926.patch.gz
- 0009-Implement-streaming-mode-in-ReorderBuffer-20190926.patch.gz
- 0010-Add-support-for-streaming-to-built-in-repli-20190926.patch.gz
- 0011-Track-statistics-for-streaming-spilling-20190926.patch.gz
- 0012-Enable-streaming-for-all-subscription-TAP-t-20190926.patch.gz
- 0013-BUGFIX-set-final_lsn-for-subxacts-before-cl-20190926.patch.gz
- 0014-Add-TAP-test-for-streaming-vs.-DDL-20190926.patch.gz
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Tomas Vondra
Date:
On Thu, Sep 26, 2019 at 06:58:17PM +0530, Amit Kapila wrote: >On Tue, Sep 3, 2019 at 4:16 PM Tomas Vondra ><tomas.vondra@2ndquadrant.com> wrote: >> >> On Mon, Sep 02, 2019 at 06:06:50PM -0400, Alvaro Herrera wrote: >> >In the interest of moving things forward, how far are we from making >> >0001 committable? If I understand correctly, the rest of this patchset >> >depends on https://commitfest.postgresql.org/24/944/ which seems to be >> >moving at a glacial pace (or, actually, slower, because glaciers do >> >move, which cannot be said of that other patch.) >> > >> >> I think 0001 is mostly there. I think there's one bug in this patch >> version, but I need to check and I'll post an updated version shortly if >> needed. >> > >Did you get a chance to work on 0001? I have a few comments on that patch: >1. >+ * To limit the amount of memory used by decoded changes, we track memory >+ * used at the reorder buffer level (i.e. total amount of memory), and for >+ * each toplevel transaction. When the total amount of used memory exceeds >+ * the limit, the toplevel transaction consuming the most memory is either >+ * serialized or streamed. > >Do we need to mention 'streamed' as part of this patch? It seems to >me that this is an independent patch which can be committed without >patches that stream the changes. So, we can remove it from here and >other places where it is used. > You're right - this patch should not mention streaming because the parts adding that capability are later in the series. So it can trigger just the serialization to disk. >2. >+ * deserializing and applying very few changes). We probably to give more >+ * memory to the oldest subtransactions. > >/We probably to/ >It seems some word is missing after probably. > Yes. >3. >+ * Find the largest transaction (toplevel or subxact) to evict (spill to disk). >+ * >+ * XXX With many subtransactions this might be quite slow, because we'll have >+ * to walk through all of them. There are some options how we could improve >+ * that: (a) maintain some secondary structure with transactions sorted by >+ * amount of changes, (b) not looking for the entirely largest transaction, >+ * but e.g. for transaction using at least some fraction of the memory limit, >+ * and (c) evicting multiple transactions at once, e.g. to free a given portion >+ * of the memory limit (e.g. 50%). >+ */ >+static ReorderBufferTXN * >+ReorderBufferLargestTXN(ReorderBuffer *rb) > >What is the guarantee that after evicting largest transaction, we >won't immediately hit the memory limit? Say, all of the transactions >are of almost similar size which I don't think is that uncommon a >case. Not sure I understand - what do you mean 'immediately hit'? We do check the limit after queueing a change, and we know that this change is what got us over the limit. We pick the largest transaction (which has to be larger than the change we just entered) and evict it, getting below the memory limit again. The next change can get us over the memory limit again, of course, but there's not much we could do about that. > Instead, the strategy mentioned in point (c) or something like >that seems more promising. In that strategy, there is some risk that >it might lead to many smaller disk writes which we might want to >control via some threshold (like we should not flush more than N >xacts). In this, we also need to ensure that the total memory freed >must be greater than the current change. > >I think we have some discussion around this point but didn't reach any >conclusion which means some more brainstorming is required. > I agree it's worth investigating, but I'm not sure it's necessary before committing v1 of the feature. I don't think there's a clear winner strategy, and the current approach works fairly well I think. The comment is concerned with the cost of ReorderBufferLargestTXN with many transactions, but we can only have certain number of top-level transactions (max_connections + certain number of not-yet-assigned subtransactions). And 0002 patch essentially gets rid of the subxacts entirely, further reducing the maximum number of xacts to walk. >4. >+int logical_work_mem; /* 4MB */ > >What this 4MB in comments indicate? > Sorry, that's a mistake. >5. >+/* >+ * Check whether the logical_work_mem limit was reached, and if yes pick >+ * the transaction tx should spill its data to disk. > >The second part of the sentence "pick the transaction tx should spill" >seems to be incomplete. > Yeah, that's a poor wording. Will fix. >Apart from this, I see that Peter E. has raised some other points on >this patch which are not yet addressed as those also need some >discussion, so I will respond to those separately with my opinion. > OK, thanks. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Alvaro Herrera
Date:
On 2019-Sep-26, Tomas Vondra wrote: > Hi, > > Attached is an updated patch series, rebased on current master. It does > fix one memory accounting bug in ReorderBufferToastReplace (the code was > not properly updating the amount of memory). Cool. Can we aim to get 0001 pushed during this commitfest, or is that a lost cause? The large new comment in reorderbuffer.c says that a transaction might get spilled *or streamed*, but surely that second thing is not correct, since before the subsequent patches it's not possible to stream transactions that have not yet finished? How certain are you about the approach to measure memory used by a reorderbuffer transaction ... does it not cause a measurable performance drop? I wonder if it would make more sense to use a separate contexts per transaction and use context-level accounting (per the patch Jeff Davis posted elsewhere for hash joins ... though I see now that that only works fot aset.c, not other memcxt implementations), or something like that. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Alvaro Herrera
Date:
On 2019-Sep-26, Alvaro Herrera wrote: > How certain are you about the approach to measure memory used by a > reorderbuffer transaction ... does it not cause a measurable performance > drop? I wonder if it would make more sense to use a separate contexts > per transaction and use context-level accounting (per the patch Jeff > Davis posted elsewhere for hash joins ... though I see now that that > only works fot aset.c, not other memcxt implementations), or something > like that. Oh, I just noticed that that patch was posted separately in its own thread, and that that improved version does include support for other memory context implementations. Excellent. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Tomas Vondra
Date:
On Thu, Sep 26, 2019 at 04:36:20PM -0300, Alvaro Herrera wrote: >On 2019-Sep-26, Alvaro Herrera wrote: > >> How certain are you about the approach to measure memory used by a >> reorderbuffer transaction ... does it not cause a measurable performance >> drop? I wonder if it would make more sense to use a separate contexts >> per transaction and use context-level accounting (per the patch Jeff >> Davis posted elsewhere for hash joins ... though I see now that that >> only works fot aset.c, not other memcxt implementations), or something >> like that. > >Oh, I just noticed that that patch was posted separately in its own >thread, and that that improved version does include support for other >memory context implementations. Excellent. > Unfortunately, that won't fly, for two simple reasons: 1) The memory accounting patch is known to perform poorly with many child contexts - this was why array_agg/string_agg were problematic, before we rewrote them not to create memory context for each group. It could be done differently (eager accounting) but then the overhead for regular/common cases (with just a couple of contexts) is higher. So that seems like a much inferior option. 2) We can't actually have a single context per transaction. Some parts (REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID) of a transaction are not evicted, so we'd have to keep them in a separate context. It'd also mean higher allocation overhead, because now we can reuse chunks cross-transaction. So one transaction commits or gets serialized, and we reuse the chunks for something else. With per-transaction contexts we'd lose some of this benefit - we could only reuse chunks within a transaction (i.e. large transactions that get spilled to disk) but not across commits. I don't have any numbers, of course, but I wouldn't be surprised if it was significant e.g. for small transactions that don't get spilled. And creating/destroying the contexts is not free either, I think. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Tomas Vondra
Date:
On Thu, Sep 26, 2019 at 04:33:59PM -0300, Alvaro Herrera wrote: >On 2019-Sep-26, Tomas Vondra wrote: > >> Hi, >> >> Attached is an updated patch series, rebased on current master. It does >> fix one memory accounting bug in ReorderBufferToastReplace (the code was >> not properly updating the amount of memory). > >Cool. > >Can we aim to get 0001 pushed during this commitfest, or is that a lost >cause? > It's tempting. The patch has been in the queue for quite a bit of time, and I think it's solid (at least 0001). I'll address the comments from Peter's review about separating the GUC etc. and polish it a bit more. If I manage to do that by Monday, I'll consider pushing it. If anyone feels I shouldn't do that, let me know. The one open question pointed out by Amit is how the patch picks the trasction for eviction. My feeling is that's fine and if needed can be improved later if necessary, but I'll try to construct a worst case (max_connections xacts, each with 64 subxact) to verify. >The large new comment in reorderbuffer.c says that a transaction might >get spilled *or streamed*, but surely that second thing is not correct, >since before the subsequent patches it's not possible to stream >transactions that have not yet finished? > True. That's a residue of reordering the patch series repeatedly, I think. I'll fix that while polishing the patch. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Amit Kapila
Date:
On Fri, Sep 27, 2019 at 12:06 AM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > > On Thu, Sep 26, 2019 at 06:58:17PM +0530, Amit Kapila wrote: > > >3. > >+ * Find the largest transaction (toplevel or subxact) to evict (spill to disk). > >+ * > >+ * XXX With many subtransactions this might be quite slow, because we'll have > >+ * to walk through all of them. There are some options how we could improve > >+ * that: (a) maintain some secondary structure with transactions sorted by > >+ * amount of changes, (b) not looking for the entirely largest transaction, > >+ * but e.g. for transaction using at least some fraction of the memory limit, > >+ * and (c) evicting multiple transactions at once, e.g. to free a given portion > >+ * of the memory limit (e.g. 50%). > >+ */ > >+static ReorderBufferTXN * > >+ReorderBufferLargestTXN(ReorderBuffer *rb) > > > >What is the guarantee that after evicting largest transaction, we > >won't immediately hit the memory limit? Say, all of the transactions > >are of almost similar size which I don't think is that uncommon a > >case. > > Not sure I understand - what do you mean 'immediately hit'? > > We do check the limit after queueing a change, and we know that this > change is what got us over the limit. We pick the largest transaction > (which has to be larger than the change we just entered) and evict it, > getting below the memory limit again. > > The next change can get us over the memory limit again, of course, > Yeah, this is what I want to say when I wrote that it can immediately hit again. > but > there's not much we could do about that. > > > Instead, the strategy mentioned in point (c) or something like > >that seems more promising. In that strategy, there is some risk that > >it might lead to many smaller disk writes which we might want to > >control via some threshold (like we should not flush more than N > >xacts). In this, we also need to ensure that the total memory freed > >must be greater than the current change. > > > >I think we have some discussion around this point but didn't reach any > >conclusion which means some more brainstorming is required. > > > > I agree it's worth investigating, but I'm not sure it's necessary before > committing v1 of the feature. I don't think there's a clear winner > strategy, and the current approach works fairly well I think. > > The comment is concerned with the cost of ReorderBufferLargestTXN with > many transactions, but we can only have certain number of top-level > transactions (max_connections + certain number of not-yet-assigned > subtransactions). And 0002 patch essentially gets rid of the subxacts > entirely, further reducing the maximum number of xacts to walk. > That would be good, but I don't understand how. The second patch will update the subxacts in top-level ReorderBufferTxn, but it won't remove it from hash table. It also doesn't seem to be caring for considering the size of subxacts in top-level xact, so not sure how will it reduce the number of xacts to walk. I might be missing something here. Can you explain a bit how 0002 patch would help in reducing the maximum number of xacts to walk? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Amit Kapila
Date:
On Tue, Jan 9, 2018 at 7:55 AM Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote: > > On 1/3/18 14:53, Tomas Vondra wrote: > >> I don't see the need to tie this setting to maintenance_work_mem. > >> maintenance_work_mem is often set to very large values, which could > >> then have undesirable side effects on this use. > > > > Well, we need to pick some default value, and we can either use a fixed > > value (not sure what would be a good default) or tie it to an existing > > GUC. We only really have work_mem and maintenance_work_mem, and the > > walsender process will never use more than one such buffer. Which seems > > to be closer to maintenance_work_mem. > > > > Pretty much any default value can have undesirable side effects. > > Let's just make it an independent setting unless we know any better. We > don't have a lot of settings that depend on other settings, and the ones > we do have a very specific relationship. > > >> Moreover, the name logical_work_mem makes it sound like it's a logical > >> version of work_mem. Maybe we could think of another name. > > > > I won't object to a better name, of course. Any proposals? > > logical_decoding_[work_]mem? > Having a separate variable for this can give more flexibility, but OTOH it will add one more knob which user might not have a good idea to set. What are the problems we see if directly use work_mem for this case? If we can't use work_mem, then I think the name proposed by you (logical_decoding_work_mem) sounds good to me. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Amit Kapila
Date:
On Thu, Sep 26, 2019 at 11:37 PM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > > On Wed, Sep 25, 2019 at 06:55:01PM +0530, Amit Kapila wrote: > >On Tue, Sep 3, 2019 at 4:30 AM Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > >> > >> In the interest of moving things forward, how far are we from making > >> 0001 committable? If I understand correctly, the rest of this patchset > >> depends on https://commitfest.postgresql.org/24/944/ which seems to be > >> moving at a glacial pace (or, actually, slower, because glaciers do > >> move, which cannot be said of that other patch.) > >> > > > >I am not sure if it is completely correct that the other part of the > >patch is dependent on that CF entry. I have studied both the threads > >(not every detail) and it seems to me it is dependent on one of the > >patches from that series which handles concurrent aborts. It is patch > >0003-Gracefully-handle-concurrent-aborts-of-uncommitted-t.Jan4.patch > >from what the Nikhil has posted on that thread [1]. Am, I wrong? > > > > You're right - the part handling aborts is the only part required. There > are dependencies on some other changes from the 2PC patch, but those are > mostly refactorings that can be undone (e.g. switch from independent > flags to a single bitmap in reorderbuffer). > > >So IIUC, the problem of concurrent aborts is that if we allow catalog > >scans for in-progress transactions, then we might get wrong answers in > >cases where somebody has performed Alter-Abort-Alter which is clearly > >explained with an example in email [2]. To solve that problem Nikhil > >seems to have written a patch [1] which detects these concurrent > >aborts during a system table scan and then aborts the decoding of such > >a transaction. > > > >Now, the problem is that patch has written considering 2PC > >transactions and might not deal with all cases for in-progress > >transactions especially when sub-transactions are involved as alluded > >by Arseny Sher [3]. So, the problem seems to be for cases when some > >sub-transaction aborts, but the main transaction still continued and > >we try to decode it. Nikhil's patch won't be able to deal with it > >because I think it just checks top-level xid whereas for this we need > >to check all-subxids which I think is possible now as Tomas seems to > >have written WAL for each xid-assignment. It might or might not be > >the best solution to check the status of all-subxids, but I think > >first we need to agree that the problem is just for concurrent aborts > >and that we can solve it by using some part of the technology being > >developed as part of patch "Logical decoding of two-phase > >transactions" (https://commitfest.postgresql.org/24/944/) rather than > >the entire patchset. > > > >I hope I am not saying something very obvious here and it helps in > >moving this patch forward. > > > > No, that's a good question, and I'm not sure what the answer is at the > moment. My understanding was that the infrastructure in the 2PC patch is > enough even for subtransactions, but I might be wrong. > I also think the patch that handles concurrent aborts should be sufficient, but that need to be integrated with your patch. Earlier, I thought we need to check whether any of the subtransaction is aborted as mentioned by Arseny Sher, but now after thinking again about that problem, it seems that checking only the status current subtransaction should be sufficient. Because, if the user does Rollback to Savepoint concurrently which aborts multiple subtransactions, the latest one must be aborted as well which is what I think we want to detect. Once we detect that we have two options (a) restart the decode of that transaction by removing changes of all subxacts or (b) somehow mark the transaction such that it gets decoded only at the commit time. > > Maybe we should focus on the 0001 part for now - it can be committed > indepently and does provide useful feature. > If that can be done sooner, then it is fine, but otherwise, preparing the patches on top of HEAD can facilitate the review of those. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Tomas Vondra
Date:
On Fri, Sep 27, 2019 at 02:33:32PM +0530, Amit Kapila wrote: >On Tue, Jan 9, 2018 at 7:55 AM Peter Eisentraut ><peter.eisentraut@2ndquadrant.com> wrote: >> >> On 1/3/18 14:53, Tomas Vondra wrote: >> >> I don't see the need to tie this setting to maintenance_work_mem. >> >> maintenance_work_mem is often set to very large values, which could >> >> then have undesirable side effects on this use. >> > >> > Well, we need to pick some default value, and we can either use a fixed >> > value (not sure what would be a good default) or tie it to an existing >> > GUC. We only really have work_mem and maintenance_work_mem, and the >> > walsender process will never use more than one such buffer. Which seems >> > to be closer to maintenance_work_mem. >> > >> > Pretty much any default value can have undesirable side effects. >> >> Let's just make it an independent setting unless we know any better. We >> don't have a lot of settings that depend on other settings, and the ones >> we do have a very specific relationship. >> >> >> Moreover, the name logical_work_mem makes it sound like it's a logical >> >> version of work_mem. Maybe we could think of another name. >> > >> > I won't object to a better name, of course. Any proposals? >> >> logical_decoding_[work_]mem? >> > >Having a separate variable for this can give more flexibility, but >OTOH it will add one more knob which user might not have a good idea >to set. What are the problems we see if directly use work_mem for >this case? > IMHO it's similar to autovacuum_work_mem - we have an independent setting, but most people use it as -1 so we use maintenance_work_mem as a default value. I think it makes sense to do the same thing here. It does ad an extra knob anyway (I don't think we should just use maintenance_work_mem directly, the user should have an option to override it when needed). But most users will not notice. FWIW I don't think we should use work_mem, maintenace_work_mem seems somewhat more appropriate here (not related to queries, etc.). >If we can't use work_mem, then I think the name proposed by you >(logical_decoding_work_mem) sounds good to me. > Yes, that name seems better. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Amit Kapila
Date:
On Fri, Sep 27, 2019 at 4:55 PM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > > On Fri, Sep 27, 2019 at 02:33:32PM +0530, Amit Kapila wrote: > >On Tue, Jan 9, 2018 at 7:55 AM Peter Eisentraut > ><peter.eisentraut@2ndquadrant.com> wrote: > >> > >> On 1/3/18 14:53, Tomas Vondra wrote: > >> >> I don't see the need to tie this setting to maintenance_work_mem. > >> >> maintenance_work_mem is often set to very large values, which could > >> >> then have undesirable side effects on this use. > >> > > >> > Well, we need to pick some default value, and we can either use a fixed > >> > value (not sure what would be a good default) or tie it to an existing > >> > GUC. We only really have work_mem and maintenance_work_mem, and the > >> > walsender process will never use more than one such buffer. Which seems > >> > to be closer to maintenance_work_mem. > >> > > >> > Pretty much any default value can have undesirable side effects. > >> > >> Let's just make it an independent setting unless we know any better. We > >> don't have a lot of settings that depend on other settings, and the ones > >> we do have a very specific relationship. > >> > >> >> Moreover, the name logical_work_mem makes it sound like it's a logical > >> >> version of work_mem. Maybe we could think of another name. > >> > > >> > I won't object to a better name, of course. Any proposals? > >> > >> logical_decoding_[work_]mem? > >> > > > >Having a separate variable for this can give more flexibility, but > >OTOH it will add one more knob which user might not have a good idea > >to set. What are the problems we see if directly use work_mem for > >this case? > > > > IMHO it's similar to autovacuum_work_mem - we have an independent > setting, but most people use it as -1 so we use maintenance_work_mem as > a default value. I think it makes sense to do the same thing here. > > It does ad an extra knob anyway (I don't think we should just use > maintenance_work_mem directly, the user should have an option to > override it when needed). But most users will not notice. > > FWIW I don't think we should use work_mem, maintenace_work_mem seems > somewhat more appropriate here (not related to queries, etc.). > I have the same concern for using maintenace_work_mem as Peter E. which is that the value of maintenace_work_mem will generally be higher which is suitable for its current purpose, but not for the purpose this patch is using. AFAIU, at this stage we want a better memory accounting system for logical decoding and we are not sure what is a good value for this variable. So, I think using work_mem or maintenace_work_mem should serve the purpose. Later, if we have requirements from people to have better control over the memory required for this purpose then we can introduce a new variable. I understand that currently work_mem is primarily tied with memory used for query workspaces, but it might be okay to extend it for this purpose. Another point is that the default for that sound to be more appealing for this case. I can see the argument against it which is having a separate variable will make the things look clean and give better control. So, if we can't convince ourselves for using work_mem, we can introduce a new guc variable and keep the default as 4MB or work_mem. I feel it is always tempting to introduce a new guc for the different tasks unless there is an exact match, but OTOH, having lesser guc's has its own advantage which is that people don't have to bother about a new setting which they need to tune and especially for which they can't decide with ease. I am not telling that we should not introduce new guc when it is required, but just to give more thought before doing so. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Amit Kapila
Date:
On Thu, Sep 26, 2019 at 11:37 PM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > > Hi, > > Attached is an updated patch series, rebased on current master. It does > fix one memory accounting bug in ReorderBufferToastReplace (the code was > not properly updating the amount of memory). > Few comments on 0001 1. I am getting below linking error in pgoutput when compiling the patch on my windows system: pgoutput.obj : error LNK2001: unresolved external symbol _logical_work_mem You need to use PGDLLIMPORT for logical_work_mem. 2. After, I fixed above and tried some basic test, it fails with below callstack: postgres.exe!ExceptionalCondition(const char * conditionName=0x00d92854, const char * errorType=0x00d928bc, const char * fileName=0x00d92e60, int lineNumber=2148) Line 55 postgres.exe!ReorderBufferChangeMemoryUpdate(ReorderBuffer * rb=0x02693390, ReorderBufferChange * change=0x0269dd38, bool addition=true) Line 2148 postgres.exe!ReorderBufferQueueChange(ReorderBuffer * rb=0x02693390, unsigned int xid=525, unsigned __int64 lsn=36083720, ReorderBufferChange * change=0x0269dd38) Line 635 postgres.exe!DecodeInsert(LogicalDecodingContext * ctx=0x0268ef80, XLogRecordBuffer * buf=0x012cf718) Line 716 + 0x24 bytes C postgres.exe!DecodeHeapOp(LogicalDecodingContext * ctx=0x0268ef80, XLogRecordBuffer * buf=0x012cf718) Line 437 + 0xd bytes C postgres.exe!LogicalDecodingProcessRecord(LogicalDecodingContext * ctx=0x0268ef80, XLogReaderState * record=0x0268f228) Line 129 postgres.exe!pg_logical_slot_get_changes_guts(FunctionCallInfoBaseData * fcinfo=0x02688680, bool confirm=true, bool binary=false) Line 307 postgres.exe!pg_logical_slot_get_changes(FunctionCallInfoBaseData * fcinfo=0x02688680) Line 376 Basically, the assert added by you in ReorderBufferChangeMemoryUpdate failed. Then, I explored a bit and it seems that you have missed assigning a value to txn, a new variable added by this patch in structure ReorderBufferChange: @@ -77,6 +82,9 @@ typedef struct ReorderBufferChange /* The type of change. */ enum ReorderBufferChangeType action; + /* Transaction this change belongs to. */ + struct ReorderBufferTXN *txn; 3. @@ -206,6 +206,17 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl </para> </listitem> </varlistentry> + + <varlistentry> + <term><literal>work_mem</literal> (<type>integer</type>)</term> + <listitem> + <para> + Limits the amount of memory used to decode changes on the + publisher. If not specified, the publisher will use the default + specified by <varname>logical_work_mem</varname>. + </para> + </listitem> + </varlistentry> I don't see any explanation of how this will be useful? How can a subscriber predict the amount of memory required by a publisher for decoding? This is more unpredictable because when initially the changes are recorded in ReorderBuffer, it doesn't even filter corresponding to any publisher. Do we really need this? I think giving more knobs to the user is helpful when they can someway know how to use it. In this case, it is not clear whether the user can ever use this. 4. Can we some way expose the memory consumed by ReorderBuffer? If so, we might be able to write some tests covering new functionality. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Tomas Vondra
Date:
On Sat, Sep 28, 2019 at 01:36:46PM +0530, Amit Kapila wrote: >On Fri, Sep 27, 2019 at 4:55 PM Tomas Vondra ><tomas.vondra@2ndquadrant.com> wrote: >> >> On Fri, Sep 27, 2019 at 02:33:32PM +0530, Amit Kapila wrote: >> >On Tue, Jan 9, 2018 at 7:55 AM Peter Eisentraut >> ><peter.eisentraut@2ndquadrant.com> wrote: >> >> >> >> On 1/3/18 14:53, Tomas Vondra wrote: >> >> >> I don't see the need to tie this setting to maintenance_work_mem. >> >> >> maintenance_work_mem is often set to very large values, which could >> >> >> then have undesirable side effects on this use. >> >> > >> >> > Well, we need to pick some default value, and we can either use a fixed >> >> > value (not sure what would be a good default) or tie it to an existing >> >> > GUC. We only really have work_mem and maintenance_work_mem, and the >> >> > walsender process will never use more than one such buffer. Which seems >> >> > to be closer to maintenance_work_mem. >> >> > >> >> > Pretty much any default value can have undesirable side effects. >> >> >> >> Let's just make it an independent setting unless we know any better. We >> >> don't have a lot of settings that depend on other settings, and the ones >> >> we do have a very specific relationship. >> >> >> >> >> Moreover, the name logical_work_mem makes it sound like it's a logical >> >> >> version of work_mem. Maybe we could think of another name. >> >> > >> >> > I won't object to a better name, of course. Any proposals? >> >> >> >> logical_decoding_[work_]mem? >> >> >> > >> >Having a separate variable for this can give more flexibility, but >> >OTOH it will add one more knob which user might not have a good idea >> >to set. What are the problems we see if directly use work_mem for >> >this case? >> > >> >> IMHO it's similar to autovacuum_work_mem - we have an independent >> setting, but most people use it as -1 so we use maintenance_work_mem as >> a default value. I think it makes sense to do the same thing here. >> >> It does ad an extra knob anyway (I don't think we should just use >> maintenance_work_mem directly, the user should have an option to >> override it when needed). But most users will not notice. >> >> FWIW I don't think we should use work_mem, maintenace_work_mem seems >> somewhat more appropriate here (not related to queries, etc.). >> > >I have the same concern for using maintenace_work_mem as Peter E. >which is that the value of maintenace_work_mem will generally be >higher which is suitable for its current purpose, but not for the >purpose this patch is using. AFAIU, at this stage we want a better >memory accounting system for logical decoding and we are not sure what >is a good value for this variable. So, I think using work_mem or >maintenace_work_mem should serve the purpose. Later, if we have >requirements from people to have better control over the memory >required for this purpose then we can introduce a new variable. > >I understand that currently work_mem is primarily tied with memory >used for query workspaces, but it might be okay to extend it for this >purpose. Another point is that the default for that sound to be more >appealing for this case. I can see the argument against it which is >having a separate variable will make the things look clean and give >better control. So, if we can't convince ourselves for using >work_mem, we can introduce a new guc variable and keep the default as >4MB or work_mem. > >I feel it is always tempting to introduce a new guc for the different >tasks unless there is an exact match, but OTOH, having lesser guc's >has its own advantage which is that people don't have to bother about >a new setting which they need to tune and especially for which they >can't decide with ease. I am not telling that we should not introduce >new guc when it is required, but just to give more thought before >doing so. > I do think having a separate GUC is a must, irrespectedly of what other GUC (if any) is used as a default. You're right the maintenance_work_mem value might be too high (e.g. in cases with many subscriptions), but the same issue applies to work_mem - there's no guarantee work_mem is lower than maintenance_work_mem, and in analytics databases it may be set very high. So work_mem does not really solve the issue IMHO we can't really do without a new GUC. It's not difficult to create examples that would benefit from small/large memory limit, depending on the number of subscriptions etc. I do however agree the GUC does not have to be tied to any existing one, it was just an attempt to use a more sensible default value. I do think m_w_m would be fine, but I can live with using an explicit value. So that's what I did in the attached patch - I've renamed the GUC to logical_decoding_work_mem, detached it from m_w_m and set the default to 64MB (i.e. the same default as m_w_m). It should also fix all the issues from the recent reviews (at least I believe so). I've realized that one of the subsequent patches allows overriding the limit for individual subscriptions (in the CREATE SUBSCRIPTION command). I think it'd be good to move this bit forward, but I think it can be done in a separate patch. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
- 0001-Add-logical_decoding_work_mem-to-limit-ReorderBuffer.patch.gz
- 0002-Immediately-WAL-log-assignments.patch.gz
- 0003-Issue-individual-invalidations-with-wal_level-logica.patch.gz
- 0004-Extend-the-output-plugin-API-with-stream-methods.patch.gz
- 0005-Cleaning-up-of-flags-in-ReorderBufferTXN-structure.patch.gz
- 0006-Support-decoding-of-two-phase-transactions-at-PREPAR.patch.gz
- 0007-Gracefully-handle-concurrent-aborts-of-uncommitted.patch.gz
- 0008-Teach-test_decoding-plugin-to-work-with-2PC.patch.gz
- 0009-Implement-streaming-mode-in-ReorderBuffer.patch.gz
- 0010-Add-support-for-streaming-to-built-in-replication.patch.gz
- 0011-Track-statistics-for-streaming-spilling.patch.gz
- 0012-Enable-streaming-for-all-subscription-TAP-tests.patch.gz
- 0013-BUGFIX-set-final_lsn-for-subxacts-before-cleanup.patch.gz
- 0014-Add-TAP-test-for-streaming-vs.-DDL.patch.gz
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Dilip Kumar
Date:
On Thu, Sep 26, 2019 at 11:38 PM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > > No, that's a good question, and I'm not sure what the answer is at the > moment. My understanding was that the infrastructure in the 2PC patch is > enough even for subtransactions, but I might be wrong. I need to think > about that for a while. > IIUC, for 2PC it's enough to check whether the main transaction is aborted or not but for the in-progress transaction it's possible that the current subtransaction might have done catalog changes and it might get aborted when we are decoding. So we need to extend an infrastructure such that we can check the status of the transaction for which we are decoding the change. Also, I think we need to handle the ERRCODE_TRANSACTION_ROLLBACK and ignore it. I have attached a small patch to handle this which can be applied on top of your patch set. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Amit Kapila
Date:
On Sun, Sep 29, 2019 at 12:39 AM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > > On Sat, Sep 28, 2019 at 01:36:46PM +0530, Amit Kapila wrote: > >On Fri, Sep 27, 2019 at 4:55 PM Tomas Vondra > ><tomas.vondra@2ndquadrant.com> wrote: > > I do think having a separate GUC is a must, irrespectedly of what other > GUC (if any) is used as a default. You're right the maintenance_work_mem > value might be too high (e.g. in cases with many subscriptions), but the > same issue applies to work_mem - there's no guarantee work_mem is lower > than maintenance_work_mem, and in analytics databases it may be set very > high. So work_mem does not really solve the issue > > IMHO we can't really do without a new GUC. It's not difficult to create > examples that would benefit from small/large memory limit, depending on > the number of subscriptions etc. > > I do however agree the GUC does not have to be tied to any existing one, > it was just an attempt to use a more sensible default value. I do think > m_w_m would be fine, but I can live with using an explicit value. > > So that's what I did in the attached patch - I've renamed the GUC to > logical_decoding_work_mem, detached it from m_w_m and set the default to > 64MB (i.e. the same default as m_w_m). Fair enough, let's not argue more on this unless someone else wants to share his opinion. > It should also fix all the issues > from the recent reviews (at least I believe so). > Have you given any thought on creating a test case for this patch? I think you also told that you will test some worst-case scenarios and report the numbers so that we are convinced that the current eviction algorithm is good. > I've realized that one of the subsequent patches allows overriding the > limit for individual subscriptions (in the CREATE SUBSCRIPTION command). > I think it'd be good to move this bit forward, but I think it can be > done in a separate patch. > Yeah, it is better to deal it separately as I am also not entirely convinced at this stage about this parameter. I have mentioned the same in the previous email as well. While glancing through the changes, I noticed a small thing: +#logical_decoding_work_mem = 64MB # min 1MB, or -1 to use maintenance_work_mem I guess this need to be updated. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Alvaro Herrera
Date:
On 2019-Sep-29, Amit Kapila wrote: > On Sun, Sep 29, 2019 at 12:39 AM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > > So that's what I did in the attached patch - I've renamed the GUC to > > logical_decoding_work_mem, detached it from m_w_m and set the default to > > 64MB (i.e. the same default as m_w_m). > > Fair enough, let's not argue more on this unless someone else wants to > share his opinion. I just read this part of the conversation and I agree that having a separate GUC with its own value independent from other GUCs is a good solution. Tying it to m_w_m seemed reasonable, but it's true that people frequently set m_w_m very high, and it would be undesirable to propagate that value to logical decoding memory usage. I wonder what would constitute good advice on how to set this value, I mean what is the metric that the user needs to be thinking about. Is it the total of memory required to keep all concurrent write transactions in memory? (Quick example: if you do 2048 wTPS and each transaction lasts 1s, and each transaction does 1kB of logically-decoded changes, then ~2MB are sufficient for the average case. Is that correct? I *think* that full-page images do not count, correct? With these things in mind users could go through pg_waldump output and figure out what to set the value to.) -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Tomas Vondra
Date:
On Sun, Sep 29, 2019 at 02:30:44PM -0300, Alvaro Herrera wrote: >On 2019-Sep-29, Amit Kapila wrote: > >> On Sun, Sep 29, 2019 at 12:39 AM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > >> > So that's what I did in the attached patch - I've renamed the GUC to >> > logical_decoding_work_mem, detached it from m_w_m and set the default to >> > 64MB (i.e. the same default as m_w_m). >> >> Fair enough, let's not argue more on this unless someone else wants to >> share his opinion. > >I just read this part of the conversation and I agree that having a >separate GUC with its own value independent from other GUCs is a good >solution. Tying it to m_w_m seemed reasonable, but it's true that >people frequently set m_w_m very high, and it would be undesirable to >propagate that value to logical decoding memory usage. > > >I wonder what would constitute good advice on how to set this value, I >mean what is the metric that the user needs to be thinking about. Is >it the total of memory required to keep all concurrent write transactions >in memory? (Quick example: if you do 2048 wTPS and each transaction >lasts 1s, and each transaction does 1kB of logically-decoded changes, >then ~2MB are sufficient for the average case. Is that correct? Yes, something like that. Essentially we'd like to keep all concurrent transactions decoded in memory, to eliminate the need to spill to disk. One of the subsequent patches adds some subscription-level stats, so maybe we don't need to worry about this too much - the stats seem like a better source of information for tuning. >I *think* that full-page images do not count, correct? With these >things in mind users could go through pg_waldump output and figure out >what to set the value to.) > Right, FPW do not matter here. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Amit Kapila
Date:
On Sun, Sep 29, 2019 at 11:24 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Sun, Sep 29, 2019 at 12:39 AM Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
> >
>
> Yeah, it is better to deal it separately as I am also not entirely
> convinced at this stage about this parameter. I have mentioned the
> same in the previous email as well.
>
> While glancing through the changes, I noticed a small thing:
> +#logical_decoding_work_mem = 64MB # min 1MB, or -1 to use maintenance_work_mem
>
> I guess this need to be updated.
>
On further testing, I found that the patch seems to have problems with toast. Consider below scenario:
Session-1
Create table large_text(t1 text);
INSERT INTO large_text
SELECT (SELECT string_agg('x', ',')
FROM generate_series(1, 1000000)) FROM generate_series(1, 1000);
Session-2
SELECT * FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
SELECT * FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL); --kaboom
> On Sun, Sep 29, 2019 at 12:39 AM Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
> >
>
> Yeah, it is better to deal it separately as I am also not entirely
> convinced at this stage about this parameter. I have mentioned the
> same in the previous email as well.
>
> While glancing through the changes, I noticed a small thing:
> +#logical_decoding_work_mem = 64MB # min 1MB, or -1 to use maintenance_work_mem
>
> I guess this need to be updated.
>
On further testing, I found that the patch seems to have problems with toast. Consider below scenario:
Session-1
Create table large_text(t1 text);
INSERT INTO large_text
SELECT (SELECT string_agg('x', ',')
FROM generate_series(1, 1000000)) FROM generate_series(1, 1000);
Session-2
SELECT * FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
SELECT * FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL); --kaboom
The second statement in Session-2 leads to a crash.
Other than that, I am not sure if the changes related to spill to disk after logical_decoding_work_mem works for toast table as I couldn't hit that code for toast table case, but I might be missing something. As mentioned previously, I feel there should be some way to test whether this patch works for the cases it claims to work. As of now, I have to check via debugging. Let me know if there is any way, I can test this.
Other than that, I am not sure if the changes related to spill to disk after logical_decoding_work_mem works for toast table as I couldn't hit that code for toast table case, but I might be missing something. As mentioned previously, I feel there should be some way to test whether this patch works for the cases it claims to work. As of now, I have to check via debugging. Let me know if there is any way, I can test this.
I am reluctant to say, but I think this patch still needs some more work (review, test, rework) before we can commit it.
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Tomas Vondra
Date:
On Tue, Oct 01, 2019 at 06:55:52PM +0530, Amit Kapila wrote: >On Sun, Sep 29, 2019 at 11:24 AM Amit Kapila <amit.kapila16@gmail.com> >wrote: >> On Sun, Sep 29, 2019 at 12:39 AM Tomas Vondra >> <tomas.vondra@2ndquadrant.com> wrote: >> > >> >> Yeah, it is better to deal it separately as I am also not entirely >> convinced at this stage about this parameter. I have mentioned the >> same in the previous email as well. >> >> While glancing through the changes, I noticed a small thing: >> +#logical_decoding_work_mem = 64MB # min 1MB, or -1 to use >maintenance_work_mem >> >> I guess this need to be updated. >> > >On further testing, I found that the patch seems to have problems with >toast. Consider below scenario: >Session-1 >Create table large_text(t1 text); >INSERT INTO large_text >SELECT (SELECT string_agg('x', ',') >FROM generate_series(1, 1000000)) FROM generate_series(1, 1000); > >Session-2 >SELECT * FROM pg_create_logical_replication_slot('regression_slot', >'test_decoding'); >SELECT * FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL); >*--kaboom* > >The second statement in Session-2 leads to a crash. > OK, thanks for the report - will investigate. >Other than that, I am not sure if the changes related to spill to disk >after logical_decoding_work_mem works for toast table as I couldn't hit >that code for toast table case, but I might be missing something. As >mentioned previously, I feel there should be some way to test whether this >patch works for the cases it claims to work. As of now, I have to check >via debugging. Let me know if there is any way, I can test this. > That's one of the reasons why I proposed to move the statistics (which say how many transactions / bytes were spilled to disk) from a later patch in the series. I don't think there's a better way. >I am reluctant to say, but I think this patch still needs some more work >(review, test, rework) before we can commit it. > I agreee. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Amit Kapila
Date:
On Tue, Oct 1, 2019 at 7:21 PM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
On Tue, Oct 01, 2019 at 06:55:52PM +0530, Amit Kapila wrote:
>
>On further testing, I found that the patch seems to have problems with
>toast. Consider below scenario:
>Session-1
>Create table large_text(t1 text);
>INSERT INTO large_text
>SELECT (SELECT string_agg('x', ',')
>FROM generate_series(1, 1000000)) FROM generate_series(1, 1000);
>
>Session-2
>SELECT * FROM pg_create_logical_replication_slot('regression_slot',
>'test_decoding');
>SELECT * FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL);
>*--kaboom*
>
>The second statement in Session-2 leads to a crash.
>
OK, thanks for the report - will investigate.
It was an assertion failure in ReorderBufferCleanupTXN at below line:
+ /* Check we're not mixing changes from different transactions. */
+ Assert(change->txn == txn);
+ Assert(change->txn == txn);
>Other than that, I am not sure if the changes related to spill to disk
>after logical_decoding_work_mem works for toast table as I couldn't hit
>that code for toast table case, but I might be missing something. As
>mentioned previously, I feel there should be some way to test whether this
>patch works for the cases it claims to work. As of now, I have to check
>via debugging. Let me know if there is any way, I can test this.
>
That's one of the reasons why I proposed to move the statistics (which
say how many transactions / bytes were spilled to disk) from a later
patch in the series. I don't think there's a better way.
I like that idea, but I think you need to split that patch to only get the stats related to the spill. It would be easier to review if you can prepare that atop of 0001-Add-logical_decoding_work_mem-to-limit-ReorderBuffer.
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Tomas Vondra
Date:
On Wed, Oct 02, 2019 at 04:27:30AM +0530, Amit Kapila wrote: >On Tue, Oct 1, 2019 at 7:21 PM Tomas Vondra <tomas.vondra@2ndquadrant.com> >wrote: > >> On Tue, Oct 01, 2019 at 06:55:52PM +0530, Amit Kapila wrote: >> > >> >On further testing, I found that the patch seems to have problems with >> >toast. Consider below scenario: >> >Session-1 >> >Create table large_text(t1 text); >> >INSERT INTO large_text >> >SELECT (SELECT string_agg('x', ',') >> >FROM generate_series(1, 1000000)) FROM generate_series(1, 1000); >> > >> >Session-2 >> >SELECT * FROM pg_create_logical_replication_slot('regression_slot', >> >'test_decoding'); >> >SELECT * FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL); >> >*--kaboom* >> > >> >The second statement in Session-2 leads to a crash. >> > >> >> OK, thanks for the report - will investigate. >> > >It was an assertion failure in ReorderBufferCleanupTXN at below line: >+ /* Check we're not mixing changes from different transactions. */ >+ Assert(change->txn == txn); > Can you still reproduce this issue with the patch I sent on 28/9? I have been unable to trigger the failure, and it seems pretty similar to the failure you reported (and I fixed) on 28/9. >> >Other than that, I am not sure if the changes related to spill to disk >> >after logical_decoding_work_mem works for toast table as I couldn't hit >> >that code for toast table case, but I might be missing something. As >> >mentioned previously, I feel there should be some way to test whether this >> >patch works for the cases it claims to work. As of now, I have to check >> >via debugging. Let me know if there is any way, I can test this. >> > >> >> That's one of the reasons why I proposed to move the statistics (which >> say how many transactions / bytes were spilled to disk) from a later >> patch in the series. I don't think there's a better way. >> >> >I like that idea, but I think you need to split that patch to only get the >stats related to the spill. It would be easier to review if you can >prepare that atop of >0001-Add-logical_decoding_work_mem-to-limit-ReorderBuffer. > Sure, I wasn't really proposing to adding all stats from that patch, including those related to streaming. We need to extract just those related to spilling. And yes, it needs to be moved right after 0001. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Dilip Kumar
Date:
I have attempted to test the performance of (Stream + Spill) vs (Stream + BGW pool) and I can see the similar gain what Alexey had shown[1]. In addition to this, I have rebased the latest patchset [2] without the two-phase logical decoding patch set. Test results: I have repeated the same test as Alexy[1] for 1kk and 1kk data and here is my result Stream + Spill N time on master(sec) Total xact time (sec) 1kk 6 21 3kk 18 55 Stream + BGW pool N time on master(sec) Total xact time (sec) 1kk 6 13 3kk 19 35 Patch details: All the patches are the same as posted on [2] except 1. 0006-Gracefully-handle-concurrent-aborts-of-uncommitted -> I have removed the handling of error which is specific for 2PC 2. 0007-Implement-streaming-mode-in-ReorderBuffer -> Rebased without 2PC 3. 0009-Extend-the-concurrent-abort-handling-for-in-progress -> New patch to handle concurrent abort error for the in-progress transaction and also add handling for the sub transaction's abort. 4. v3-0014-BGWorkers-pool-for-streamed-transactions-apply -> Rebased Alexey's patch [1] https://www.postgresql.org/message-id/8eda5118-2dd0-79a1-4fe9-eec7e334de17%40postgrespro.ru [2] https://www.postgresql.org/message-id/20190928190917.hrpknmq76v3ts3lj%40development On Thu, Oct 3, 2019 at 4:03 AM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > > On Wed, Oct 02, 2019 at 04:27:30AM +0530, Amit Kapila wrote: > >On Tue, Oct 1, 2019 at 7:21 PM Tomas Vondra <tomas.vondra@2ndquadrant.com> > >wrote: > > > >> On Tue, Oct 01, 2019 at 06:55:52PM +0530, Amit Kapila wrote: > >> > > >> >On further testing, I found that the patch seems to have problems with > >> >toast. Consider below scenario: > >> >Session-1 > >> >Create table large_text(t1 text); > >> >INSERT INTO large_text > >> >SELECT (SELECT string_agg('x', ',') > >> >FROM generate_series(1, 1000000)) FROM generate_series(1, 1000); > >> > > >> >Session-2 > >> >SELECT * FROM pg_create_logical_replication_slot('regression_slot', > >> >'test_decoding'); > >> >SELECT * FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL); > >> >*--kaboom* > >> > > >> >The second statement in Session-2 leads to a crash. > >> > > >> > >> OK, thanks for the report - will investigate. > >> > > > >It was an assertion failure in ReorderBufferCleanupTXN at below line: > >+ /* Check we're not mixing changes from different transactions. */ > >+ Assert(change->txn == txn); > > > > Can you still reproduce this issue with the patch I sent on 28/9? I have > been unable to trigger the failure, and it seems pretty similar to the > failure you reported (and I fixed) on 28/9. > > >> >Other than that, I am not sure if the changes related to spill to disk > >> >after logical_decoding_work_mem works for toast table as I couldn't hit > >> >that code for toast table case, but I might be missing something. As > >> >mentioned previously, I feel there should be some way to test whether this > >> >patch works for the cases it claims to work. As of now, I have to check > >> >via debugging. Let me know if there is any way, I can test this. > >> > > >> > >> That's one of the reasons why I proposed to move the statistics (which > >> say how many transactions / bytes were spilled to disk) from a later > >> patch in the series. I don't think there's a better way. > >> > >> > >I like that idea, but I think you need to split that patch to only get the > >stats related to the spill. It would be easier to review if you can > >prepare that atop of > >0001-Add-logical_decoding_work_mem-to-limit-ReorderBuffer. > > > > Sure, I wasn't really proposing to adding all stats from that patch, > including those related to streaming. We need to extract just those > related to spilling. And yes, it needs to be moved right after 0001. > > regards > > -- > Tomas Vondra http://www.2ndQuadrant.com > PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services > > -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
- 0001-Add-logical_decoding_work_mem-to-limit-ReorderBuffer.patch
- 0002-Immediately-WAL-log-assignments.patch
- 0003-Issue-individual-invalidations-with-wal_level-logica.patch
- 0005-Cleaning-up-of-flags-in-ReorderBufferTXN-structure.patch
- 0004-Extend-the-output-plugin-API-with-stream-methods.patch
- 0006-Gracefully-handle-concurrent-aborts-of-uncommitted.patch
- 0007-Implement-streaming-mode-in-ReorderBuffer.patch
- 0008-Add-support-for-streaming-to-built-in-replication.patch
- 0010-Track-statistics-for-streaming-spilling.patch
- 0009-Extend-the-concurrent-abort-handling-for-in-progress.patch
- 0011-Enable-streaming-for-all-subscription-TAP-tests.patch
- 0012-BUGFIX-set-final_lsn-for-subxacts-before-cleanup.patch
- 0013-Add-TAP-test-for-streaming-vs.-DDL.patch
- v3-0014-BGWorkers-pool-for-streamed-transactions-apply.patch
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Amit Kapila
Date:
On Thu, Oct 3, 2019 at 4:03 AM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
On Wed, Oct 02, 2019 at 04:27:30AM +0530, Amit Kapila wrote:
>On Tue, Oct 1, 2019 at 7:21 PM Tomas Vondra <tomas.vondra@2ndquadrant.com>
>wrote:
>
>> On Tue, Oct 01, 2019 at 06:55:52PM +0530, Amit Kapila wrote:
>> >
>> >On further testing, I found that the patch seems to have problems with
>> >toast. Consider below scenario:
>> >Session-1
>> >Create table large_text(t1 text);
>> >INSERT INTO large_text
>> >SELECT (SELECT string_agg('x', ',')
>> >FROM generate_series(1, 1000000)) FROM generate_series(1, 1000);
>> >
>> >Session-2
>> >SELECT * FROM pg_create_logical_replication_slot('regression_slot',
>> >'test_decoding');
>> >SELECT * FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL);
>> >*--kaboom*
>> >
>> >The second statement in Session-2 leads to a crash.
>> >
>>
>> OK, thanks for the report - will investigate.
>>
>
>It was an assertion failure in ReorderBufferCleanupTXN at below line:
>+ /* Check we're not mixing changes from different transactions. */
>+ Assert(change->txn == txn);
>
Can you still reproduce this issue with the patch I sent on 28/9? I have
been unable to trigger the failure, and it seems pretty similar to the
failure you reported (and I fixed) on 28/9.
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -2196,6 +2196,7 @@ ReorderBufferAddNewTupleCids(ReorderBuffer *rb, TransactionId xid,
change->data.tuplecid.cmax = cmax;
change->data.tuplecid.combocid = combocid;
change->lsn = lsn;
+ change->txn = txn;
change->action = REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID;
dlist_push_tail(&txn->tuplecids, &change->node);
Few more comments:
-----------------------------------
1.
+static bool
+check_logical_decoding_work_mem(int *newval, void **extra, GucSource source)
+{
+ /*
+ * -1 indicates fallback.
+ *
+ * If we haven't yet changed the boot_val default of -1, just let it be.
+ * logical decoding will look to maintenance_work_mem instead.
+ */
+ if (*newval == -1)
+ return true;
+
+ /*
+ * We clamp manually-set values to at least 64kB. The maintenance_work_mem
+ * uses a higher minimum value (1MB), so this is OK.
+ */
+ if (*newval < 64)
+ *newval = 64;
I think this needs to be changed as now we don't rely on maintenance_work_mem. Another thing related to this is that I think the default value for logical_decoding_work_mem still seems to be -1. We need to make it to 64MB. I have seen this while debugging memory accounting changes. I think this is the reason why I was not seeing toast related changes being serialized because, in that test, I haven't changed the default value of logical_decoding_work_mem.
2.
+ /*
+ * We're going modify the size of the change, so to make sure the
+ * accounting is correct we'll make it look like we're removing the
+ * change now (with the old size), and then re-add it at the end.
+ */
/going modify/going to modify/
3.
+ *
+ * While updating the existing change with detoasted tuple data, we need to
+ * update the memory accounting info, because the change size will differ.
+ * Otherwise the accounting may get out of sync, triggering serialization
+ * at unexpected times.
+ *
+ * We simply subtract size of the change before rejiggering the tuple, and
+ * then adding the new size. This makes it look like the change was removed
+ * and then added back, except it only tweaks the accounting info.
+ *
+ * In particular it can't trigger serialization, which would be pointless
+ * anyway as it happens during commit processing right before handing
+ * the change to the output plugin.
*/
static void
ReorderBufferToastReplace(ReorderBuffer *rb, ReorderBufferTXN *txn,
@@ -3023,6 +3281,13 @@ ReorderBufferToastReplace(ReorderBuffer *rb, ReorderBufferTXN *txn,
if (txn->toast_hash == NULL)
return;
+ /*
+ * We're going modify the size of the change, so to make sure the
+ * accounting is correct we'll make it look like we're removing the
+ * change now (with the old size), and then re-add it at the end.
+ */
+ ReorderBufferChangeMemoryUpdate(rb, change, false);
It is not very clear why this change is required. Basically, this is done at commit time after which actually we shouldn't attempt to spill these changes. This is mentioned in comments as well, but it is not clear if that is the case, then how and when accounting can create a problem. If possible, can you explain it with an example?
+check_logical_decoding_work_mem(int *newval, void **extra, GucSource source)
+{
+ /*
+ * -1 indicates fallback.
+ *
+ * If we haven't yet changed the boot_val default of -1, just let it be.
+ * logical decoding will look to maintenance_work_mem instead.
+ */
+ if (*newval == -1)
+ return true;
+
+ /*
+ * We clamp manually-set values to at least 64kB. The maintenance_work_mem
+ * uses a higher minimum value (1MB), so this is OK.
+ */
+ if (*newval < 64)
+ *newval = 64;
I think this needs to be changed as now we don't rely on maintenance_work_mem. Another thing related to this is that I think the default value for logical_decoding_work_mem still seems to be -1. We need to make it to 64MB. I have seen this while debugging memory accounting changes. I think this is the reason why I was not seeing toast related changes being serialized because, in that test, I haven't changed the default value of logical_decoding_work_mem.
2.
+ /*
+ * We're going modify the size of the change, so to make sure the
+ * accounting is correct we'll make it look like we're removing the
+ * change now (with the old size), and then re-add it at the end.
+ */
/going modify/going to modify/
3.
+ *
+ * While updating the existing change with detoasted tuple data, we need to
+ * update the memory accounting info, because the change size will differ.
+ * Otherwise the accounting may get out of sync, triggering serialization
+ * at unexpected times.
+ *
+ * We simply subtract size of the change before rejiggering the tuple, and
+ * then adding the new size. This makes it look like the change was removed
+ * and then added back, except it only tweaks the accounting info.
+ *
+ * In particular it can't trigger serialization, which would be pointless
+ * anyway as it happens during commit processing right before handing
+ * the change to the output plugin.
*/
static void
ReorderBufferToastReplace(ReorderBuffer *rb, ReorderBufferTXN *txn,
@@ -3023,6 +3281,13 @@ ReorderBufferToastReplace(ReorderBuffer *rb, ReorderBufferTXN *txn,
if (txn->toast_hash == NULL)
return;
+ /*
+ * We're going modify the size of the change, so to make sure the
+ * accounting is correct we'll make it look like we're removing the
+ * change now (with the old size), and then re-add it at the end.
+ */
+ ReorderBufferChangeMemoryUpdate(rb, change, false);
It is not very clear why this change is required. Basically, this is done at commit time after which actually we shouldn't attempt to spill these changes. This is mentioned in comments as well, but it is not clear if that is the case, then how and when accounting can create a problem. If possible, can you explain it with an example?
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Dilip Kumar
Date:
On Thu, Oct 3, 2019 at 1:18 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > I have attempted to test the performance of (Stream + Spill) vs > (Stream + BGW pool) and I can see the similar gain what Alexey had > shown[1]. > > In addition to this, I have rebased the latest patchset [2] without > the two-phase logical decoding patch set. > > Test results: > I have repeated the same test as Alexy[1] for 1kk and 1kk data and > here is my result > Stream + Spill > N time on master(sec) Total xact time (sec) > 1kk 6 21 > 3kk 18 55 > > Stream + BGW pool > N time on master(sec) Total xact time (sec) > 1kk 6 13 > 3kk 19 35 > > Patch details: > All the patches are the same as posted on [2] except > 1. 0006-Gracefully-handle-concurrent-aborts-of-uncommitted -> I have > removed the handling of error which is specific for 2PC Here[1], I mentioned that I have removed the 2PC changes from this[0006] patch but mistakenly I attached the original patch itself instead of the modified version. So attaching the modified version of only this patch other patches are the same. > 2. 0007-Implement-streaming-mode-in-ReorderBuffer -> Rebased without 2PC > 3. 0009-Extend-the-concurrent-abort-handling-for-in-progress -> New > patch to handle concurrent abort error for the in-progress transaction > and also add handling for the sub transaction's abort. > 4. v3-0014-BGWorkers-pool-for-streamed-transactions-apply -> Rebased > Alexey's patch [1] https://www.postgresql.org/message-id/CAFiTN-vHoksqvV4BZ0479NhugGe4QHq_ezngNdDd-YRQ_2cwug%40mail.gmail.com -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Dilip Kumar
Date:
On Thu, Oct 3, 2019 at 2:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Thu, Oct 3, 2019 at 4:03 AM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: >> >> On Wed, Oct 02, 2019 at 04:27:30AM +0530, Amit Kapila wrote: >> >On Tue, Oct 1, 2019 at 7:21 PM Tomas Vondra <tomas.vondra@2ndquadrant.com> >> >wrote: >> > >> >> On Tue, Oct 01, 2019 at 06:55:52PM +0530, Amit Kapila wrote: >> >> > >> >> >On further testing, I found that the patch seems to have problems with >> >> >toast. Consider below scenario: >> >> >Session-1 >> >> >Create table large_text(t1 text); >> >> >INSERT INTO large_text >> >> >SELECT (SELECT string_agg('x', ',') >> >> >FROM generate_series(1, 1000000)) FROM generate_series(1, 1000); >> >> > >> >> >Session-2 >> >> >SELECT * FROM pg_create_logical_replication_slot('regression_slot', >> >> >'test_decoding'); >> >> >SELECT * FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL); >> >> >*--kaboom* >> >> > >> >> >The second statement in Session-2 leads to a crash. >> >> > >> >> >> >> OK, thanks for the report - will investigate. >> >> >> > >> >It was an assertion failure in ReorderBufferCleanupTXN at below line: >> >+ /* Check we're not mixing changes from different transactions. */ >> >+ Assert(change->txn == txn); >> > >> >> Can you still reproduce this issue with the patch I sent on 28/9? I have >> been unable to trigger the failure, and it seems pretty similar to the >> failure you reported (and I fixed) on 28/9. > > > Yes, it seems we need a similar change in ReorderBufferAddNewTupleCids. I think in session-2 you need to create replicationslot before creating table in session-1 to see this problem. > > --- a/src/backend/replication/logical/reorderbuffer.c > +++ b/src/backend/replication/logical/reorderbuffer.c > @@ -2196,6 +2196,7 @@ ReorderBufferAddNewTupleCids(ReorderBuffer *rb, TransactionId xid, > change->data.tuplecid.cmax = cmax; > change->data.tuplecid.combocid = combocid; > change->lsn = lsn; > + change->txn = txn; > change->action = REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID; > dlist_push_tail(&txn->tuplecids, &change->node); > > Few more comments: > ----------------------------------- > 1. > +static bool > +check_logical_decoding_work_mem(int *newval, void **extra, GucSource source) > +{ > + /* > + * -1 indicates fallback. > + * > + * If we haven't yet changed the boot_val default of -1, just let it be. > + * logical decoding will look to maintenance_work_mem instead. > + */ > + if (*newval == -1) > + return true; > + > + /* > + * We clamp manually-set values to at least 64kB. The maintenance_work_mem > + * uses a higher minimum value (1MB), so this is OK. > + */ > + if (*newval < 64) > + *newval = 64; > > I think this needs to be changed as now we don't rely on maintenance_work_mem. Another thing related to this is that Ithink the default value for logical_decoding_work_mem still seems to be -1. We need to make it to 64MB. I have seen thiswhile debugging memory accounting changes. I think this is the reason why I was not seeing toast related changes beingserialized because, in that test, I haven't changed the default value of logical_decoding_work_mem. > > 2. > + /* > + * We're going modify the size of the change, so to make sure the > + * accounting is correct we'll make it look like we're removing the > + * change now (with the old size), and then re-add it at the end. > + */ > > > /going modify/going to modify/ > > 3. > + * > + * While updating the existing change with detoasted tuple data, we need to > + * update the memory accounting info, because the change size will differ. > + * Otherwise the accounting may get out of sync, triggering serialization > + * at unexpected times. > + * > + * We simply subtract size of the change before rejiggering the tuple, and > + * then adding the new size. This makes it look like the change was removed > + * and then added back, except it only tweaks the accounting info. > + * > + * In particular it can't trigger serialization, which would be pointless > + * anyway as it happens during commit processing right before handing > + * the change to the output plugin. > */ > static void > ReorderBufferToastReplace(ReorderBuffer *rb, ReorderBufferTXN *txn, > @@ -3023,6 +3281,13 @@ ReorderBufferToastReplace(ReorderBuffer *rb, ReorderBufferTXN *txn, > if (txn->toast_hash == NULL) > return; > > + /* > + * We're going modify the size of the change, so to make sure the > + * accounting is correct we'll make it look like we're removing the > + * change now (with the old size), and then re-add it at the end. > + */ > + ReorderBufferChangeMemoryUpdate(rb, change, false); > > It is not very clear why this change is required. Basically, this is done at commit time after which actually we shouldn'tattempt to spill these changes. This is mentioned in comments as well, but it is not clear if that is the case,then how and when accounting can create a problem. If possible, can you explain it with an example? > IIUC, we are keeping the track of the memory in ReorderBuffer which is common across the transactions. So even if this transaction is committing and will not spill to dis but we need to keep the memory accounting correct for the future changes in other transactions. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Amit Kapila
Date:
On Sun, Oct 13, 2019 at 12:25 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Thu, Oct 3, 2019 at 2:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > 3. > > + * > > + * While updating the existing change with detoasted tuple data, we need to > > + * update the memory accounting info, because the change size will differ. > > + * Otherwise the accounting may get out of sync, triggering serialization > > + * at unexpected times. > > + * > > + * We simply subtract size of the change before rejiggering the tuple, and > > + * then adding the new size. This makes it look like the change was removed > > + * and then added back, except it only tweaks the accounting info. > > + * > > + * In particular it can't trigger serialization, which would be pointless > > + * anyway as it happens during commit processing right before handing > > + * the change to the output plugin. > > */ > > static void > > ReorderBufferToastReplace(ReorderBuffer *rb, ReorderBufferTXN *txn, > > @@ -3023,6 +3281,13 @@ ReorderBufferToastReplace(ReorderBuffer *rb, ReorderBufferTXN *txn, > > if (txn->toast_hash == NULL) > > return; > > > > + /* > > + * We're going modify the size of the change, so to make sure the > > + * accounting is correct we'll make it look like we're removing the > > + * change now (with the old size), and then re-add it at the end. > > + */ > > + ReorderBufferChangeMemoryUpdate(rb, change, false); > > > > It is not very clear why this change is required. Basically, this is done at commit time after which actually we shouldn'tattempt to spill these changes. This is mentioned in comments as well, but it is not clear if that is the case,then how and when accounting can create a problem. If possible, can you explain it with an example? > > > IIUC, we are keeping the track of the memory in ReorderBuffer which is > common across the transactions. So even if this transaction is > committing and will not spill to dis but we need to keep the memory > accounting correct for the future changes in other transactions. > You are right. I somehow missed that we need to keep the size computation in sync even during commit for other in-progress transactions in the ReorderBuffer. You can ignore this point or maybe slightly adjust the comment to make it explicit. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Craig Ringer
Date:
On Sun, 13 Oct 2019 at 19:50, Amit Kapila <amit.kapila16@gmail.com> wrote:
On Sun, Oct 13, 2019 at 12:25 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Thu, Oct 3, 2019 at 2:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > 3.
> > + *
> > + * While updating the existing change with detoasted tuple data, we need to
> > + * update the memory accounting info, because the change size will differ.
> > + * Otherwise the accounting may get out of sync, triggering serialization
> > + * at unexpected times.
> > + *
> > + * We simply subtract size of the change before rejiggering the tuple, and
> > + * then adding the new size. This makes it look like the change was removed
> > + * and then added back, except it only tweaks the accounting info.
> > + *
> > + * In particular it can't trigger serialization, which would be pointless
> > + * anyway as it happens during commit processing right before handing
> > + * the change to the output plugin.
> > */
> > static void
> > ReorderBufferToastReplace(ReorderBuffer *rb, ReorderBufferTXN *txn,
> > @@ -3023,6 +3281,13 @@ ReorderBufferToastReplace(ReorderBuffer *rb, ReorderBufferTXN *txn,
> > if (txn->toast_hash == NULL)
> > return;
> >
> > + /*
> > + * We're going modify the size of the change, so to make sure the
> > + * accounting is correct we'll make it look like we're removing the
> > + * change now (with the old size), and then re-add it at the end.
> > + */
> > + ReorderBufferChangeMemoryUpdate(rb, change, false);
> >
> > It is not very clear why this change is required. Basically, this is done at commit time after which actually we shouldn't attempt to spill these changes. This is mentioned in comments as well, but it is not clear if that is the case, then how and when accounting can create a problem. If possible, can you explain it with an example?
> >
> IIUC, we are keeping the track of the memory in ReorderBuffer which is
> common across the transactions. So even if this transaction is
> committing and will not spill to dis but we need to keep the memory
> accounting correct for the future changes in other transactions.
>
You are right. I somehow missed that we need to keep the size
computation in sync even during commit for other in-progress
transactions in the ReorderBuffer. You can ignore this point or maybe
slightly adjust the comment to make it explicit.
Does anyone object if we add the reorder buffer total size & in-memory size to struct WalSnd too, so we can report it in pg_stat_replication?
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Amit Kapila
Date:
On Mon, Oct 14, 2019 at 6:51 AM Craig Ringer <craig@2ndquadrant.com> wrote: > > On Sun, 13 Oct 2019 at 19:50, Amit Kapila <amit.kapila16@gmail.com> wrote: >> > > > Does anyone object if we add the reorder buffer total size & in-memory size to struct WalSnd too, so we can report it inpg_stat_replication? > There is already a patch (0011-Track-statistics-for-streaming-spilling) in this series posted by Tomas[1] which tracks important statistics in WalSnd which I think are good enough. Have you checked that? I am not sure if adding additional size will help, but I might be missing something. > I can follow up with a patch to add on top of this one if you think it's reasonable. I'll also take the opportunity toadd a number of tracepoints across the walsender and logical decoding, since right now it's very opaque in production systems... and everyone just LOVES hunting down debug syms and attaching gdb to production DBs. > Sure, adding tracepoints can be helpful, but isn't it better to start that as a separate thread? [1] - https://www.postgresql.org/message-id/20190928190917.hrpknmq76v3ts3lj%40development -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Dilip Kumar
Date:
On Thu, Oct 3, 2019 at 4:03 AM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > > On Wed, Oct 02, 2019 at 04:27:30AM +0530, Amit Kapila wrote: > >On Tue, Oct 1, 2019 at 7:21 PM Tomas Vondra <tomas.vondra@2ndquadrant.com> > >wrote: > > > >> On Tue, Oct 01, 2019 at 06:55:52PM +0530, Amit Kapila wrote: > >> > > >> >On further testing, I found that the patch seems to have problems with > >> >toast. Consider below scenario: > >> >Session-1 > >> >Create table large_text(t1 text); > >> >INSERT INTO large_text > >> >SELECT (SELECT string_agg('x', ',') > >> >FROM generate_series(1, 1000000)) FROM generate_series(1, 1000); > >> > > >> >Session-2 > >> >SELECT * FROM pg_create_logical_replication_slot('regression_slot', > >> >'test_decoding'); > >> >SELECT * FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL); > >> >*--kaboom* > >> > > >> >The second statement in Session-2 leads to a crash. > >> > > >> > >> OK, thanks for the report - will investigate. > >> > > > >It was an assertion failure in ReorderBufferCleanupTXN at below line: > >+ /* Check we're not mixing changes from different transactions. */ > >+ Assert(change->txn == txn); > > > > Can you still reproduce this issue with the patch I sent on 28/9? I have > been unable to trigger the failure, and it seems pretty similar to the > failure you reported (and I fixed) on 28/9. > > >> >Other than that, I am not sure if the changes related to spill to disk > >> >after logical_decoding_work_mem works for toast table as I couldn't hit > >> >that code for toast table case, but I might be missing something. As > >> >mentioned previously, I feel there should be some way to test whether this > >> >patch works for the cases it claims to work. As of now, I have to check > >> >via debugging. Let me know if there is any way, I can test this. > >> > > >> > >> That's one of the reasons why I proposed to move the statistics (which > >> say how many transactions / bytes were spilled to disk) from a later > >> patch in the series. I don't think there's a better way. > >> > >> > >I like that idea, but I think you need to split that patch to only get the > >stats related to the spill. It would be easier to review if you can > >prepare that atop of > >0001-Add-logical_decoding_work_mem-to-limit-ReorderBuffer. > > > > Sure, I wasn't really proposing to adding all stats from that patch, > including those related to streaming. We need to extract just those > related to spilling. And yes, it needs to be moved right after 0001. > I have extracted the spilling related code to a separate patch on top of 0001. I have also fixed some bugs and review comments and attached as a separate patch. Later I can merge it to the main patch if you agree with the changes. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Amit Kapila
Date:
On Mon, Oct 14, 2019 at 3:09 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Thu, Oct 3, 2019 at 4:03 AM Tomas Vondra > <tomas.vondra@2ndquadrant.com> wrote: > > > > > > Sure, I wasn't really proposing to adding all stats from that patch, > > including those related to streaming. We need to extract just those > > related to spilling. And yes, it needs to be moved right after 0001. > > > I have extracted the spilling related code to a separate patch on top > of 0001. I have also fixed some bugs and review comments and attached > as a separate patch. Later I can merge it to the main patch if you > agree with the changes. > Few comments ------------------------- 0001-Add-logical_decoding_work_mem-to-limit-ReorderBuffer 1. + { + {"logical_decoding_work_mem", PGC_USERSET, RESOURCES_MEM, + gettext_noop("Sets the maximum memory to be used for logical decoding."), + gettext_noop("This much memory can be used by each internal " + "reorder buffer before spilling to disk or streaming."), + GUC_UNIT_KB + }, I think we can remove 'or streaming' from above sentence for now. We can add it later with later patch where streaming will be allowed. 2. @@ -206,6 +206,18 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl </para> </listitem> </varlistentry> + + <varlistentry> + <term><literal>work_mem</literal> (<type>integer</type>)</term> + <listitem> + <para> + Limits the amount of memory used to decode changes on the + publisher. If not specified, the publisher will use the default + specified by <varname>logical_decoding_work_mem</varname>. When + needed, additional data are spilled to disk. + </para> + </listitem> + </varlistentry> It is not clear why we need this parameter at least with this patch? I have raised this multiple times [1][2]. bugs_and_review_comments_fix 1. }, &logical_decoding_work_mem, - -1, -1, MAX_KILOBYTES, - check_logical_decoding_work_mem, NULL, NULL + 65536, 64, MAX_KILOBYTES, + NULL, NULL, NULL I think the default value should be 1MB similar to maintenance_work_mem. The same was true before this change. 2. -#logical_decoding_work_mem = 64MB # min 1MB, or -1 to use maintenance_work_mem +i#logical_decoding_work_mem = 64MB # min 64kB It seems the 'i' is a leftover character in the above change. Also, change the default value considering the previous point. 3. @@ -2479,7 +2480,7 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn) /* update the statistics */ rb->spillCount += 1; - rb->spillTxns += txn->serialized ? 1 : 0; + rb->spillTxns += txn->serialized ? 0 : 1; rb->spillBytes += size; Why is this change required? Shouldn't we increase the spillTxns count only when the txn is serialized? 0002-Track-statistics-for-spilling 1. + <row> + <entry><structfield>spill_txns</structfield></entry> + <entry><type>integer</type></entry> + <entry>Number of transactions spilled to disk after the memory used by + logical decoding exceeds <literal>logical_work_mem</literal>. The + counter gets incremented both for toplevel transactions and + subtransactions. + </entry> + </row> The parameter name is wrong here. /logical_work_mem/logical_decoding_work_mem 2. + <row> + <entry><structfield>spill_txns</structfield></entry> + <entry><type>integer</type></entry> + <entry>Number of transactions spilled to disk after the memory used by + logical decoding exceeds <literal>logical_work_mem</literal>. The + counter gets incremented both for toplevel transactions and + subtransactions. + </entry> + </row> + <row> + <entry><structfield>spill_count</structfield></entry> + <entry><type>integer</type></entry> + <entry>Number of times transactions were spilled to disk. Transactions + may get spilled repeatedly, and this counter gets incremented on every + such invocation. + </entry> + </row> + <row> + <entry><structfield>spill_bytes</structfield></entry> + <entry><type>integer</type></entry> + <entry>Amount of decoded transaction data spilled to disk. + </entry> + </row> In all the above cases, the explanation text starts immediately after <entry> tag, but the general coding practice is to start from the next line, see the explanation of nearby parameters. It seems these parameters are added in pg-stat-wal-receiver-view in the docs, but in code, it is present as part of pg_stat_replication. It seems doc needs to be updated. Am, I missing something? 3. ReorderBufferSerializeTXN() { .. /* update the statistics */ rb->spillCount += 1; rb->spillTxns += txn->serialized ? 0 : 1; rb->spillBytes += size; Assert(spilled == txn->nentries_mem); Assert(dlist_is_empty(&txn->changes)); txn->nentries_mem = 0; txn->serialized = true; .. } I am not able to understand the above code. We are setting the serialized parameter a few lines after we check it and increment the spillTxns count. Can you please explain it? Also, isn't spillTxns count bit confusing, because in some cases it will include subtransactions and other cases (where the largest picked transaction is a subtransaction) it won't include it? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Dilip Kumar
Date:
On Fri, Oct 18, 2019 at 5:32 PM Amit Kapila <amit.kapila16@gmail.com> wrote: I have replied to some of your questions inline. I will work on the remaining comments and post the patch for the same. > > > > > > Sure, I wasn't really proposing to adding all stats from that patch, > > > including those related to streaming. We need to extract just those > > > related to spilling. And yes, it needs to be moved right after 0001. > > > > > I have extracted the spilling related code to a separate patch on top > > of 0001. I have also fixed some bugs and review comments and attached > > as a separate patch. Later I can merge it to the main patch if you > > agree with the changes. > > > > Few comments > ------------------------- > 0001-Add-logical_decoding_work_mem-to-limit-ReorderBuffer > 1. > + { > + {"logical_decoding_work_mem", PGC_USERSET, RESOURCES_MEM, > + gettext_noop("Sets the maximum memory to be used for logical decoding."), > + gettext_noop("This much memory can be used by each internal " > + "reorder buffer before spilling to disk or streaming."), > + GUC_UNIT_KB > + }, > > I think we can remove 'or streaming' from above sentence for now. We > can add it later with later patch where streaming will be allowed. > > 2. > @@ -206,6 +206,18 @@ CREATE SUBSCRIPTION <replaceable > class="parameter">subscription_name</replaceabl > </para> > </listitem> > </varlistentry> > + > + <varlistentry> > + <term><literal>work_mem</literal> (<type>integer</type>)</term> > + <listitem> > + <para> > + Limits the amount of memory used to decode changes on the > + publisher. If not specified, the publisher will use the default > + specified by <varname>logical_decoding_work_mem</varname>. When > + needed, additional data are spilled to disk. > + </para> > + </listitem> > + </varlistentry> > > It is not clear why we need this parameter at least with this patch? > I have raised this multiple times [1][2]. > > bugs_and_review_comments_fix > 1. > }, > &logical_decoding_work_mem, > - -1, -1, MAX_KILOBYTES, > - check_logical_decoding_work_mem, NULL, NULL > + 65536, 64, MAX_KILOBYTES, > + NULL, NULL, NULL > > I think the default value should be 1MB similar to > maintenance_work_mem. The same was true before this change. > > 2. -#logical_decoding_work_mem = 64MB # min 1MB, or -1 to use > maintenance_work_mem > +i#logical_decoding_work_mem = 64MB # min 64kB > > It seems the 'i' is a leftover character in the above change. Also, > change the default value considering the previous point. > > 3. > @@ -2479,7 +2480,7 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb, > ReorderBufferTXN *txn) > > /* update the statistics */ > rb->spillCount += 1; > - rb->spillTxns += txn->serialized ? 1 : 0; > + rb->spillTxns += txn->serialized ? 0 : 1; > rb->spillBytes += size; > > Why is this change required? Shouldn't we increase the spillTxns > count only when the txn is serialized? Prior to this change it was increasing the rb->spillTxns, every time we try to serialize the changes of the transaction. Now, only we increase first time when it is not yet serialized. > 0002-Track-statistics-for-spilling > 1. > + <row> > + <entry><structfield>spill_txns</structfield></entry> > + <entry><type>integer</type></entry> > + <entry>Number of transactions spilled to disk after the memory used by > + logical decoding exceeds <literal>logical_work_mem</literal>. The > + counter gets incremented both for toplevel transactions and > + subtransactions. > + </entry> > + </row> > > The parameter name is wrong here. /logical_work_mem/logical_decoding_work_mem > > 2. > + <row> > + <entry><structfield>spill_txns</structfield></entry> > + <entry><type>integer</type></entry> > + <entry>Number of transactions spilled to disk after the memory used by > + logical decoding exceeds <literal>logical_work_mem</literal>. The > + counter gets incremented both for toplevel transactions and > + subtransactions. > + </entry> > + </row> > + <row> > + <entry><structfield>spill_count</structfield></entry> > + <entry><type>integer</type></entry> > + <entry>Number of times transactions were spilled to disk. Transactions > + may get spilled repeatedly, and this counter gets incremented on every > + such invocation. > + </entry> > + </row> > + <row> > + <entry><structfield>spill_bytes</structfield></entry> > + <entry><type>integer</type></entry> > + <entry>Amount of decoded transaction data spilled to disk. > + </entry> > + </row> > > In all the above cases, the explanation text starts immediately after > <entry> tag, but the general coding practice is to start from the next > line, see the explanation of nearby parameters. > > It seems these parameters are added in pg-stat-wal-receiver-view in > the docs, but in code, it is present as part of pg_stat_replication. > It seems doc needs to be updated. Am, I missing something? > > 3. > ReorderBufferSerializeTXN() > { > .. > /* update the statistics */ > rb->spillCount += 1; > rb->spillTxns += txn->serialized ? 0 : 1; > rb->spillBytes += size; > > Assert(spilled == txn->nentries_mem); > Assert(dlist_is_empty(&txn->changes)); > txn->nentries_mem = 0; > txn->serialized = true; > .. > } > > I am not able to understand the above code. We are setting the > serialized parameter a few lines after we check it and increment the > spillTxns count. Can you please explain it? Basically, when the first time we attempt to serialize a transaction, txn->serialized will be false, that time we will increment the rb->spillTxns and after that set txn->serialized to true. From next time onwards if we try to serialize the same transaction we will not increment the rb->spillTxns so that we count each transaction only once. > > Also, isn't spillTxns count bit confusing, because in some cases it > will include subtransactions and other cases (where the largest picked > transaction is a subtransaction) it won't include it? I did not understand your comment completely. Basically, every transaction which we are serializing we will increase the count first time right? whether it is the main transaction or the sub-transaction. Am I missing something? -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Amit Kapila
Date:
On Mon, Oct 21, 2019 at 10:48 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Fri, Oct 18, 2019 at 5:32 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > 3. > > @@ -2479,7 +2480,7 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb, > > ReorderBufferTXN *txn) > > > > /* update the statistics */ > > rb->spillCount += 1; > > - rb->spillTxns += txn->serialized ? 1 : 0; > > + rb->spillTxns += txn->serialized ? 0 : 1; > > rb->spillBytes += size; > > > > Why is this change required? Shouldn't we increase the spillTxns > > count only when the txn is serialized? > > Prior to this change it was increasing the rb->spillTxns, every time > we try to serialize the changes of the transaction. Now, only we > increase first time when it is not yet serialized. > > > > > 3. > > ReorderBufferSerializeTXN() > > { > > .. > > /* update the statistics */ > > rb->spillCount += 1; > > rb->spillTxns += txn->serialized ? 0 : 1; > > rb->spillBytes += size; > > > > Assert(spilled == txn->nentries_mem); > > Assert(dlist_is_empty(&txn->changes)); > > txn->nentries_mem = 0; > > txn->serialized = true; > > .. > > } > > > > I am not able to understand the above code. We are setting the > > serialized parameter a few lines after we check it and increment the > > spillTxns count. Can you please explain it? > > Basically, when the first time we attempt to serialize a transaction, > txn->serialized will be false, that time we will increment the > rb->spillTxns and after that set txn->serialized to true. From next > time onwards if we try to serialize the same transaction we will not > increment the rb->spillTxns so that we count each transaction only > once. > Your explanation for both the above comments makes sense to me. Can you please add some comments along these lines because it is not apparent why one wants to increase the spillTxns counter when txn->serialized is false? > > > > Also, isn't spillTxns count bit confusing, because in some cases it > > will include subtransactions and other cases (where the largest picked > > transaction is a subtransaction) it won't include it? > > I did not understand your comment completely. Basically, every > transaction which we are serializing we will increase the count first > time right? whether it is the main transaction or the sub-transaction. > It was not clear to me earlier whether we always increase the spillTxns counter for subtransactions or not. But now, looking at code carefully, it is clear that is it is getting increased in every case. In short, you don't need to do anything for this comment. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Dilip Kumar
Date:
On Mon, Oct 21, 2019 at 2:50 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Mon, Oct 21, 2019 at 10:48 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Fri, Oct 18, 2019 at 5:32 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > 3. > > > @@ -2479,7 +2480,7 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb, > > > ReorderBufferTXN *txn) > > > > > > /* update the statistics */ > > > rb->spillCount += 1; > > > - rb->spillTxns += txn->serialized ? 1 : 0; > > > + rb->spillTxns += txn->serialized ? 0 : 1; > > > rb->spillBytes += size; > > > > > > Why is this change required? Shouldn't we increase the spillTxns > > > count only when the txn is serialized? > > > > Prior to this change it was increasing the rb->spillTxns, every time > > we try to serialize the changes of the transaction. Now, only we > > increase first time when it is not yet serialized. > > > > > > > > 3. > > > ReorderBufferSerializeTXN() > > > { > > > .. > > > /* update the statistics */ > > > rb->spillCount += 1; > > > rb->spillTxns += txn->serialized ? 0 : 1; > > > rb->spillBytes += size; > > > > > > Assert(spilled == txn->nentries_mem); > > > Assert(dlist_is_empty(&txn->changes)); > > > txn->nentries_mem = 0; > > > txn->serialized = true; > > > .. > > > } > > > > > > I am not able to understand the above code. We are setting the > > > serialized parameter a few lines after we check it and increment the > > > spillTxns count. Can you please explain it? > > > > Basically, when the first time we attempt to serialize a transaction, > > txn->serialized will be false, that time we will increment the > > rb->spillTxns and after that set txn->serialized to true. From next > > time onwards if we try to serialize the same transaction we will not > > increment the rb->spillTxns so that we count each transaction only > > once. > > > > Your explanation for both the above comments makes sense to me. Can > you please add some comments along these lines because it is not > apparent why one wants to increase the spillTxns counter when > txn->serialized is false? Ok, I will add comments in the next patch. > > > > > > > Also, isn't spillTxns count bit confusing, because in some cases it > > > will include subtransactions and other cases (where the largest picked > > > transaction is a subtransaction) it won't include it? > > > > I did not understand your comment completely. Basically, every > > transaction which we are serializing we will increase the count first > > time right? whether it is the main transaction or the sub-transaction. > > > > It was not clear to me earlier whether we always increase the > spillTxns counter for subtransactions or not. But now, looking at > code carefully, it is clear that is it is getting increased in every > case. In short, you don't need to do anything for this comment. ok -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Dilip Kumar
Date:
On Fri, Oct 18, 2019 at 5:32 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Mon, Oct 14, 2019 at 3:09 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Thu, Oct 3, 2019 at 4:03 AM Tomas Vondra > > <tomas.vondra@2ndquadrant.com> wrote: > > > > > > > > > Sure, I wasn't really proposing to adding all stats from that patch, > > > including those related to streaming. We need to extract just those > > > related to spilling. And yes, it needs to be moved right after 0001. > > > > > I have extracted the spilling related code to a separate patch on top > > of 0001. I have also fixed some bugs and review comments and attached > > as a separate patch. Later I can merge it to the main patch if you > > agree with the changes. > > > > Few comments > ------------------------- > 0001-Add-logical_decoding_work_mem-to-limit-ReorderBuffer > 1. > + { > + {"logical_decoding_work_mem", PGC_USERSET, RESOURCES_MEM, > + gettext_noop("Sets the maximum memory to be used for logical decoding."), > + gettext_noop("This much memory can be used by each internal " > + "reorder buffer before spilling to disk or streaming."), > + GUC_UNIT_KB > + }, > > I think we can remove 'or streaming' from above sentence for now. We > can add it later with later patch where streaming will be allowed. Done > > 2. > @@ -206,6 +206,18 @@ CREATE SUBSCRIPTION <replaceable > class="parameter">subscription_name</replaceabl > </para> > </listitem> > </varlistentry> > + > + <varlistentry> > + <term><literal>work_mem</literal> (<type>integer</type>)</term> > + <listitem> > + <para> > + Limits the amount of memory used to decode changes on the > + publisher. If not specified, the publisher will use the default > + specified by <varname>logical_decoding_work_mem</varname>. When > + needed, additional data are spilled to disk. > + </para> > + </listitem> > + </varlistentry> > > It is not clear why we need this parameter at least with this patch? > I have raised this multiple times [1][2]. I have moved it out as a separate patch (0003) so that if we need that we need this for the streaming transaction then we can keep this. > > bugs_and_review_comments_fix > 1. > }, > &logical_decoding_work_mem, > - -1, -1, MAX_KILOBYTES, > - check_logical_decoding_work_mem, NULL, NULL > + 65536, 64, MAX_KILOBYTES, > + NULL, NULL, NULL > > I think the default value should be 1MB similar to > maintenance_work_mem. The same was true before this change. default value for maintenance_work_mem is also 64MB. Did you mean min value? > > 2. -#logical_decoding_work_mem = 64MB # min 1MB, or -1 to use > maintenance_work_mem > +i#logical_decoding_work_mem = 64MB # min 64kB > > It seems the 'i' is a leftover character in the above change. Also, > change the default value considering the previous point. oops, fixed. > > 3. > @@ -2479,7 +2480,7 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb, > ReorderBufferTXN *txn) > > /* update the statistics */ > rb->spillCount += 1; > - rb->spillTxns += txn->serialized ? 1 : 0; > + rb->spillTxns += txn->serialized ? 0 : 1; > rb->spillBytes += size; > > Why is this change required? Shouldn't we increase the spillTxns > count only when the txn is serialized? Already agreed in previous mail so added comments > > 0002-Track-statistics-for-spilling > 1. > + <row> > + <entry><structfield>spill_txns</structfield></entry> > + <entry><type>integer</type></entry> > + <entry>Number of transactions spilled to disk after the memory used by > + logical decoding exceeds <literal>logical_work_mem</literal>. The > + counter gets incremented both for toplevel transactions and > + subtransactions. > + </entry> > + </row> > > The parameter name is wrong here. /logical_work_mem/logical_decoding_work_mem done > > 2. > + <row> > + <entry><structfield>spill_txns</structfield></entry> > + <entry><type>integer</type></entry> > + <entry>Number of transactions spilled to disk after the memory used by > + logical decoding exceeds <literal>logical_work_mem</literal>. The > + counter gets incremented both for toplevel transactions and > + subtransactions. > + </entry> > + </row> > + <row> > + <entry><structfield>spill_count</structfield></entry> > + <entry><type>integer</type></entry> > + <entry>Number of times transactions were spilled to disk. Transactions > + may get spilled repeatedly, and this counter gets incremented on every > + such invocation. > + </entry> > + </row> > + <row> > + <entry><structfield>spill_bytes</structfield></entry> > + <entry><type>integer</type></entry> > + <entry>Amount of decoded transaction data spilled to disk. > + </entry> > + </row> > > In all the above cases, the explanation text starts immediately after > <entry> tag, but the general coding practice is to start from the next > line, see the explanation of nearby parameters. It seems it's mixed, for example, you can see <entry>Timeline number of last write-ahead log location received and flushed to disk, the initial value of this field being the timeline number of the first log location used when WAL receiver is started </entry> or <entry>Timeline number of last write-ahead log location received and flushed to disk, the initial value of this field being the timeline number of the first log location used when WAL receiver is started </entry> > > It seems these parameters are added in pg-stat-wal-receiver-view in > the docs, but in code, it is present as part of pg_stat_replication. > It seems doc needs to be updated. Am, I missing something? Fixed > > 3. > ReorderBufferSerializeTXN() > { > .. > /* update the statistics */ > rb->spillCount += 1; > rb->spillTxns += txn->serialized ? 0 : 1; > rb->spillBytes += size; > > Assert(spilled == txn->nentries_mem); > Assert(dlist_is_empty(&txn->changes)); > txn->nentries_mem = 0; > txn->serialized = true; > .. > } > > I am not able to understand the above code. We are setting the > serialized parameter a few lines after we check it and increment the > spillTxns count. Can you please explain it? > > Also, isn't spillTxns count bit confusing, because in some cases it > will include subtransactions and other cases (where the largest picked > transaction is a subtransaction) it won't include it? > Already discussed in the last mail. I have merged bugs_and_review_comments_fix.patch changes to 0001 and 0002. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Amit Kapila
Date:
On Thu, Oct 3, 2019 at 1:18 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > I have attempted to test the performance of (Stream + Spill) vs > (Stream + BGW pool) and I can see the similar gain what Alexey had > shown[1]. > > In addition to this, I have rebased the latest patchset [2] without > the two-phase logical decoding patch set. > > Test results: > I have repeated the same test as Alexy[1] for 1kk and 1kk data and > here is my result > Stream + Spill > N time on master(sec) Total xact time (sec) > 1kk 6 21 > 3kk 18 55 > > Stream + BGW pool > N time on master(sec) Total xact time (sec) > 1kk 6 13 > 3kk 19 35 > I think the test results for the master are missing. Also, how about running these tests over a network (means master and subscriber are not on the same machine)? In general, yours and Alexy's test results show that there is merit by having workers applying such transactions. OTOH, as noted above [1], we are also worried about the performance of Rollbacks if we follow that approach. I am not sure how much we need to worry about Rollabcks if commits are faster, but can we think of recording the changes in memory and only write to a file if the changes are above a certain threshold? I think that might help saving I/O in many cases. I am not very sure if we do that how much additional workers can help, but they might still help. I think we need to do some tests and experiments to figure out what is the best approach? What do you think? Tomas, Alexey, do you have any thoughts on this matter? I think it is important that we figure out the way to proceed in this patch. [1] - https://www.postgresql.org/message-id/b25ce80e-f536-78c8-d5c8-a5df3e230785%40postgrespro.ru -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Dilip Kumar
Date:
On Tue, Oct 22, 2019 at 10:46 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Thu, Oct 3, 2019 at 1:18 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > I have attempted to test the performance of (Stream + Spill) vs > > (Stream + BGW pool) and I can see the similar gain what Alexey had > > shown[1]. > > > > In addition to this, I have rebased the latest patchset [2] without > > the two-phase logical decoding patch set. > > > > Test results: > > I have repeated the same test as Alexy[1] for 1kk and 1kk data and > > here is my result > > Stream + Spill > > N time on master(sec) Total xact time (sec) > > 1kk 6 21 > > 3kk 18 55 > > > > Stream + BGW pool > > N time on master(sec) Total xact time (sec) > > 1kk 6 13 > > 3kk 19 35 > > > > I think the test results for the master are missing. Yeah, That time, I was planning to compare spill vs bgworker. Also, how about > running these tests over a network (means master and subscriber are > not on the same machine)? Yeah, we should do that that will show the merit of streaming the in-progress transactions. In general, yours and Alexy's test results > show that there is merit by having workers applying such transactions. > OTOH, as noted above [1], we are also worried about the performance > of Rollbacks if we follow that approach. I am not sure how much we > need to worry about Rollabcks if commits are faster, but can we think > of recording the changes in memory and only write to a file if the > changes are above a certain threshold? I think that might help saving > I/O in many cases. I am not very sure if we do that how much > additional workers can help, but they might still help. I think we > need to do some tests and experiments to figure out what is the best > approach? What do you think? I agree with the point. I think we might need to do some small changes and test to see what could be the best method to handle the streamed changes at the subscriber end. > > Tomas, Alexey, do you have any thoughts on this matter? I think it is > important that we figure out the way to proceed in this patch. > > [1] - https://www.postgresql.org/message-id/b25ce80e-f536-78c8-d5c8-a5df3e230785%40postgrespro.ru > -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Tomas Vondra
Date:
On Tue, Oct 22, 2019 at 10:30:16AM +0530, Dilip Kumar wrote: >On Fri, Oct 18, 2019 at 5:32 PM Amit Kapila <amit.kapila16@gmail.com> wrote: >> >> On Mon, Oct 14, 2019 at 3:09 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: >> > >> > On Thu, Oct 3, 2019 at 4:03 AM Tomas Vondra >> > <tomas.vondra@2ndquadrant.com> wrote: >> > > >> > > >> > > Sure, I wasn't really proposing to adding all stats from that patch, >> > > including those related to streaming. We need to extract just those >> > > related to spilling. And yes, it needs to be moved right after 0001. >> > > >> > I have extracted the spilling related code to a separate patch on top >> > of 0001. I have also fixed some bugs and review comments and attached >> > as a separate patch. Later I can merge it to the main patch if you >> > agree with the changes. >> > >> >> Few comments >> ------------------------- >> 0001-Add-logical_decoding_work_mem-to-limit-ReorderBuffer >> 1. >> + { >> + {"logical_decoding_work_mem", PGC_USERSET, RESOURCES_MEM, >> + gettext_noop("Sets the maximum memory to be used for logical decoding."), >> + gettext_noop("This much memory can be used by each internal " >> + "reorder buffer before spilling to disk or streaming."), >> + GUC_UNIT_KB >> + }, >> >> I think we can remove 'or streaming' from above sentence for now. We >> can add it later with later patch where streaming will be allowed. >Done >> >> 2. >> @@ -206,6 +206,18 @@ CREATE SUBSCRIPTION <replaceable >> class="parameter">subscription_name</replaceabl >> </para> >> </listitem> >> </varlistentry> >> + >> + <varlistentry> >> + <term><literal>work_mem</literal> (<type>integer</type>)</term> >> + <listitem> >> + <para> >> + Limits the amount of memory used to decode changes on the >> + publisher. If not specified, the publisher will use the default >> + specified by <varname>logical_decoding_work_mem</varname>. When >> + needed, additional data are spilled to disk. >> + </para> >> + </listitem> >> + </varlistentry> >> >> It is not clear why we need this parameter at least with this patch? >> I have raised this multiple times [1][2]. > >I have moved it out as a separate patch (0003) so that if we need that >we need this for the streaming transaction then we can keep this. >> I'm OK with moving it to a separate patch. That being said I think ability to control memory usage for individual subscriptions is very useful. Saying "We don't need such parameter" is essentially equivalent to saying "One size fits all" and I think we know that's not true. Imagine a system with multiple subscriptions, some of them mostly replicating OLTP changes, but one or two replicating tables that are updated in batches. What we'd have is to allow higher limit for the batch subscriptions, but much lower limit for the OLTP ones (which they should never hit in practice). With a single global GUC, you'll either have a high value - risking OOM when the OLTP subscriptions happen to decode a batch update, or a low value affecting the batch subscriotions. It's not strictly necessary (and we already have such limit), so I'm OK with treating it as an enhancement for the future. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Tomas Vondra
Date:
On Tue, Oct 22, 2019 at 11:01:48AM +0530, Dilip Kumar wrote: >On Tue, Oct 22, 2019 at 10:46 AM Amit Kapila <amit.kapila16@gmail.com> wrote: >> >> On Thu, Oct 3, 2019 at 1:18 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: >> > >> > I have attempted to test the performance of (Stream + Spill) vs >> > (Stream + BGW pool) and I can see the similar gain what Alexey had >> > shown[1]. >> > >> > In addition to this, I have rebased the latest patchset [2] without >> > the two-phase logical decoding patch set. >> > >> > Test results: >> > I have repeated the same test as Alexy[1] for 1kk and 1kk data and >> > here is my result >> > Stream + Spill >> > N time on master(sec) Total xact time (sec) >> > 1kk 6 21 >> > 3kk 18 55 >> > >> > Stream + BGW pool >> > N time on master(sec) Total xact time (sec) >> > 1kk 6 13 >> > 3kk 19 35 >> > >> >> I think the test results for the master are missing. >Yeah, That time, I was planning to compare spill vs bgworker. > Also, how about >> running these tests over a network (means master and subscriber are >> not on the same machine)? > >Yeah, we should do that that will show the merit of streaming the >in-progress transactions. > Which I agree it's an interesting feature, I think we need to stop adding more stuff to this patch series - it's already complex enough, so making it even more (unnecessary) stuff is a distraction and will make it harder to get anything committed. Typical "scope creep". I think the current behavior (spill to file) is sufficient for v0 and can be improved later - that's fine. I don't think we need to bother with comparisons to master very much, because while it might be a bit slower in some cases, you can always disable streaming (so if there's a regression for your workload, you can undo that). > In general, yours and Alexy's test results >> show that there is merit by having workers applying such transactions. >> OTOH, as noted above [1], we are also worried about the performance >> of Rollbacks if we follow that approach. I am not sure how much we >> need to worry about Rollabcks if commits are faster, but can we think >> of recording the changes in memory and only write to a file if the >> changes are above a certain threshold? I think that might help saving >> I/O in many cases. I am not very sure if we do that how much >> additional workers can help, but they might still help. I think we >> need to do some tests and experiments to figure out what is the best >> approach? What do you think? >I agree with the point. I think we might need to do some small >changes and test to see what could be the best method to handle the >streamed changes at the subscriber end. > >> >> Tomas, Alexey, do you have any thoughts on this matter? I think it is >> important that we figure out the way to proceed in this patch. >> >> [1] - https://www.postgresql.org/message-id/b25ce80e-f536-78c8-d5c8-a5df3e230785%40postgrespro.ru >> > I think the patch should do the simplest thing possible, i.e. what it does today. Otherwise we'll never get it committed. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Alexey Kondratov
Date:
On 22.10.2019 20:22, Tomas Vondra wrote: > On Tue, Oct 22, 2019 at 11:01:48AM +0530, Dilip Kumar wrote: >> On Tue, Oct 22, 2019 at 10:46 AM Amit Kapila >> <amit.kapila16@gmail.com> wrote: >> In general, yours and Alexy's test results >>> show that there is merit by having workers applying such transactions. >>> OTOH, as noted above [1], we are also worried about the performance >>> of Rollbacks if we follow that approach. I am not sure how much we >>> need to worry about Rollabcks if commits are faster, but can we think >>> of recording the changes in memory and only write to a file if the >>> changes are above a certain threshold? I think that might help saving >>> I/O in many cases. I am not very sure if we do that how much >>> additional workers can help, but they might still help. I think we >>> need to do some tests and experiments to figure out what is the best >>> approach? What do you think? >> I agree with the point. I think we might need to do some small >> changes and test to see what could be the best method to handle the >> streamed changes at the subscriber end. >> >>> >>> Tomas, Alexey, do you have any thoughts on this matter? I think it is >>> important that we figure out the way to proceed in this patch. >>> >>> [1] - >>> https://www.postgresql.org/message-id/b25ce80e-f536-78c8-d5c8-a5df3e230785%40postgrespro.ru >>> >> > > I think the patch should do the simplest thing possible, i.e. what it > does today. Otherwise we'll never get it committed. > I have to agree with Tomas, that keeping things as simple as possible should be a main priority right now. Otherwise, the entire patch set will pass next release cycle without being committed at least partially. In the same time, it resolves important problem from my perspective. It moves I/O overhead from primary to replica using large transactions streaming, which is a nice to have feature I guess. Later it would be possible to replace logical apply worker with bgworkers pool in a separated patch, if we decide that it is a viable solution. Anyway, regarding the Amit's questions: - I doubt that maintaining a separate buffer on the apply side before spilling to disk would help enough. We already have ReorderBuffer with logical_work_mem limit, and if we exceeded that limit on the sender side, then most probably we exceed it on the applier side as well, excepting the case when this new buffer size will be significantly higher then logical_work_mem to keep multiple open xacts. - I still think that we should optimize database for commits, not rollbacks. BGworkers pool is dramatically slower for rollbacks-only load, though being at least twice as faster for commits-only. I do not know how it will perform with real life load, but this drawback may be inappropriate for such a general purpose database like Postgres. - Tomas' implementation of streaming with spilling does not have this bias between commits/aborts. However, it has a noticeable performance drop (~x5 slower compared with master [1]) for large transaction consisting of many small rows. Although it is not of an order of magnitude slower. Another thing is it that about a year ago I have found some problems with MVCC/visibility and fixed them somehow [1]. If I get it correctly Tomas adapted some of those fixes into his patch set, but I think that this part should be reviewed carefully again. I would be glad to check it, but now I am a little bit confused with all the patch set variants in the thread. Which is the last one? Is it still dependent on 2pc decoding? [1] https://www.postgresql.org/message-id/flat/40c38758-04b5-74f4-c963-cf300f9e5dff%40postgrespro.ru#98d06fefc88122385dacb2f03f7c30f7 Thanks for moving this patch forward! -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Company
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Amit Kapila
Date:
On Tue, Oct 22, 2019 at 10:42 PM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > > On Tue, Oct 22, 2019 at 10:30:16AM +0530, Dilip Kumar wrote: > > > >I have moved it out as a separate patch (0003) so that if we need that > >we need this for the streaming transaction then we can keep this. > >> > > I'm OK with moving it to a separate patch. That being said I think > ability to control memory usage for individual subscriptions is very > useful. Saying "We don't need such parameter" is essentially equivalent > to saying "One size fits all" and I think we know that's not true. > > Imagine a system with multiple subscriptions, some of them mostly > replicating OLTP changes, but one or two replicating tables that are > updated in batches. What we'd have is to allow higher limit for the > batch subscriptions, but much lower limit for the OLTP ones (which they > should never hit in practice). > This point is not clear to me. The changes are recorded in ReorderBuffer which doesn't have any filtering aka it will have all the changes irrespective of the subscriber. How will it make a difference to have different limits? > With a single global GUC, you'll either have a high value - risking > OOM when the OLTP subscriptions happen to decode a batch update, or a > low value affecting the batch subscriotions. > > It's not strictly necessary (and we already have such limit), so I'm OK > with treating it as an enhancement for the future. > I am fine too if its usage is clear. I might be missing something here. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Amit Kapila
Date:
On Wed, Oct 23, 2019 at 12:32 AM Alexey Kondratov <a.kondratov@postgrespro.ru> wrote: > > On 22.10.2019 20:22, Tomas Vondra wrote: > > > > I think the patch should do the simplest thing possible, i.e. what it > > does today. Otherwise we'll never get it committed. > > > > I have to agree with Tomas, that keeping things as simple as possible > should be a main priority right now. Otherwise, the entire patch set > will pass next release cycle without being committed at least partially. > In the same time, it resolves important problem from my perspective. It > moves I/O overhead from primary to replica using large transactions > streaming, which is a nice to have feature I guess. > > Later it would be possible to replace logical apply worker with > bgworkers pool in a separated patch, if we decide that it is a viable > solution. Anyway, regarding the Amit's questions: > > - I doubt that maintaining a separate buffer on the apply side before > spilling to disk would help enough. We already have ReorderBuffer with > logical_work_mem limit, and if we exceeded that limit on the sender > side, then most probably we exceed it on the applier side as well, > I think on the sender side, the limit is for un-filtered changes (which means on the ReorderBuffer which has all the changes) whereas, on the receiver side, we will only have the requested changes which can make a difference? > excepting the case when this new buffer size will be significantly > higher then logical_work_mem to keep multiple open xacts. > I am not sure but I think we can have different controlling parameters on the subscriber-side. > - I still think that we should optimize database for commits, not > rollbacks. BGworkers pool is dramatically slower for rollbacks-only > load, though being at least twice as faster for commits-only. I do not > know how it will perform with real life load, but this drawback may be > inappropriate for such a general purpose database like Postgres. > > - Tomas' implementation of streaming with spilling does not have this > bias between commits/aborts. However, it has a noticeable performance > drop (~x5 slower compared with master [1]) for large transaction > consisting of many small rows. Although it is not of an order of > magnitude slower. > Did you ever identify the reason why it was slower in that case? I can see the numbers shared by you and Dilip which shows that the BGWorker pool is a really good idea and will work great for commit-mostly workload whereas the numbers without that are not very encouraging, maybe we have not benchmarked enough. This is the reason I am trying to see if we can do something to get the benefits similar to what is shown by your idea. I am not against doing something simple for the first version and then enhance it later, but it won't be good if we commit it with regression in some typical cases and depend on the user to use it when it seems favorable to its case. Also, sometimes it becomes difficult to generate enthusiasm to enhance the feature once the main patch is committed. I am not telling that always happens or will happen in this case. It is better if we put some energy and get things as good as possible in the first go itself. I am as much interested as you, Tomas or others are, otherwise, I wouldn't have spent a lot of time on this to disentangle it from 2PC patch which seems to get stalled due to lack of interest. > Another thing is it that about a year ago I have found some problems > with MVCC/visibility and fixed them somehow [1]. If I get it correctly > Tomas adapted some of those fixes into his patch set, but I think that > this part should be reviewed carefully again. > Agreed, I have read your emails and could see that you have done very good work on this project along with Tomas. But unfortunately, it didn't get committed. At this stage, we are working on just the first part of the patch which is to allow the data to spill once it crosses the logical_decoding_work_mem on the master side. I think we need more problems to discuss and solve once that is done. > I would be glad to check > it, but now I am a little bit confused with all the patch set variants > in the thread. Which is the last one? Is it still dependent on 2pc decoding? > I think the latest patches posted by Dilip are not dependent on logical decoding, but I haven't studied them yet. You can find those at [1][2]. As per discussion in this thread, we are also trying to see if we can make some part of the patch-series committed first, the latest patches corresponding to which are posted at [3]. [1] - https://www.postgresql.org/message-id/CAFiTN-vHoksqvV4BZ0479NhugGe4QHq_ezngNdDd-YRQ_2cwug%40mail.gmail.com [2] - https://www.postgresql.org/message-id/CAFiTN-vT%2B42xRbkw%3DhBnp44XkAyZaKZVA5hcvAMsYth3rk7vhg%40mail.gmail.com [3] - https://www.postgresql.org/message-id/CAFiTN-vkFB0RBEjVkLWhdgTYShSrSu3kCYObMghgXEwKA1FXRA%40mail.gmail.com -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Amit Kapila
Date:
On Tue, Oct 22, 2019 at 10:30 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > I have merged bugs_and_review_comments_fix.patch changes to 0001 and 0002. > I was wondering whether we have checked the code coverage after this patch? Previously, the existing tests seem to be covering most parts of the function ReorderBufferSerializeTXN [1]. After this patch, the timing to call ReorderBufferSerializeTXN will change, so that might impact the testing of the same. If it is already covered, then I would like to either add a new test or extend existing test with the help of new spill counters. If it is not getting covered, then we need to think of extending the existing test or write a new test to cover the function ReorderBufferSerializeTXN. [1] - https://coverage.postgresql.org/src/backend/replication/logical/reorderbuffer.c.gcov.html -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
vignesh C
Date:
On Tue, Oct 22, 2019 at 10:52 PM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > > I think the patch should do the simplest thing possible, i.e. what it > does today. Otherwise we'll never get it committed. > I found a couple of crashes while reviewing and testing flushing of open transaction data: Issue 1: #0 0x00007f22c5722337 in raise () from /lib64/libc.so.6 #1 0x00007f22c5723a28 in abort () from /lib64/libc.so.6 #2 0x0000000000ec5390 in ExceptionalCondition (conditionName=0x10ea814 "!dlist_is_empty(head)", errorType=0x10ea804 "FailedAssertion", fileName=0x10ea7e0 "../../../../src/include/lib/ilist.h", lineNumber=458) at assert.c:54 #3 0x0000000000b4fb91 in dlist_tail_element_off (head=0x19e4db8, off=64) at ../../../../src/include/lib/ilist.h:458 #4 0x0000000000b546d0 in ReorderBufferAbortOld (rb=0x191b6b0, oldestRunningXid=3834) at reorderbuffer.c:1966 #5 0x0000000000b3ca03 in DecodeStandbyOp (ctx=0x19af990, buf=0x7ffcbc26dc50) at decode.c:332 #6 0x0000000000b3c208 in LogicalDecodingProcessRecord (ctx=0x19af990, record=0x19afc50) at decode.c:121 #7 0x0000000000b7109e in XLogSendLogical () at walsender.c:2845 #8 0x0000000000b6f5e4 in WalSndLoop (send_data=0xb70f77 <XLogSendLogical>) at walsender.c:2199 #9 0x0000000000b6c7e1 in StartLogicalReplication (cmd=0x1983168) at walsender.c:1128 #10 0x0000000000b6da6f in exec_replication_command (cmd_string=0x18f70a0 "START_REPLICATION SLOT \"sub1\" LOGICAL 0/0 (proto_version '1', publication_names '\"pub1\"')") at walsender.c:1545 Issue 2: #0 0x00007f1d7ddc4337 in raise () from /lib64/libc.so.6 #1 0x00007f1d7ddc5a28 in abort () from /lib64/libc.so.6 #2 0x0000000000ec4e1d in ExceptionalCondition (conditionName=0x10ead30 "txn->final_lsn != InvalidXLogRecPtr", errorType=0x10ea284 "FailedAssertion", fileName=0x10ea2d0 "reorderbuffer.c", lineNumber=3052) at assert.c:54 #3 0x0000000000b577e0 in ReorderBufferRestoreCleanup (rb=0x2ae36b0, txn=0x2bafb08) at reorderbuffer.c:3052 #4 0x0000000000b52b1c in ReorderBufferCleanupTXN (rb=0y x2ae36b0, txn=0x2bafb08) at reorderbuffer.c:1318 #5 0x0000000000b5279d in ReorderBufferCleanupTXN (rb=0x2ae36b0, txn=0x2b9d778) at reorderbuffer.c:1257 #6 0x0000000000b5475c in ReorderBufferAbortOld (rb=0x2ae36b0, oldestRunningXid=3835) at reorderbuffer.c:1973 #7 0x0000000000b3ca03 in DecodeStandbyOp (ctx=0x2b676d0, buf=0x7ffcbc74cc00) at decode.c:332 #8 0x0000000000b3c208 in LogicalDecodingProcessRecord (ctx=0x2b676d0, record=0x2b67990) at decode.c:121 #9 0x0000000000b70b2b in XLogSendLogical () at walsender.c:2845 These failures come randomly. I'm not able to reproduce this issue with simple test case. I have attached the test case which I used to test. I will further try to find a scenario which could reproduce consistently. Posting it so that it can help someone in identifying the problem parallelly through code review by experts. Regards, Vignesh EnterpriseDB: http://www.enterprisedb.com
Attachment
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Dilip Kumar
Date:
On Wed, Oct 30, 2019 at 9:38 AM vignesh C <vignesh21@gmail.com> wrote: > I have noticed one more problem in the logic of setting the logical decoding work mem from the create subscription command. Suppose in subscription command we don't give the work mem then it sends the garbage value to the walsender and the walsender overwrite its value with the garbage value. After investigating a bit I have found the reason for the same. @@ -406,6 +406,9 @@ libpqrcv_startstreaming(WalReceiverConn *conn, appendStringInfo(&cmd, "proto_version '%u'", options->proto.logical.proto_version); + appendStringInfo(&cmd, ", work_mem '%d'", + options->proto.logical.work_mem); I think the problem is we are unconditionally sending the work_mem as part of the CREATE REPLICATION SLOT, without checking whether it's valid or not. --- a/src/backend/catalog/pg_subscription.c +++ b/src/backend/catalog/pg_subscription.c @@ -71,6 +71,7 @@ GetSubscription(Oid subid, bool missing_ok) sub->name = pstrdup(NameStr(subform->subname)); sub->owner = subform->subowner; sub->enabled = subform->subenabled; + sub->workmem = subform->subworkmem; Another problem is that there is no handling if the subform->subworkmem is NULL. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Kuntal Ghosh
Date:
Hello hackers, I've done some performance testing of this feature. Following is my test case (taken from an earlier thread): postgres=# CREATE TABLE large_test (num1 bigint, num2 double precision, num3 double precision); postgres=# \timing on postgres=# EXPLAIN (ANALYZE, BUFFERS) INSERT INTO large_test (num1, num2, num3) SELECT round(random()*10), random(), random()*142 FROM generate_series(1, 1000000) s(i); I've kept the publisher and subscriber in two different system. HEAD: With 1000000 tuples, Execution Time: 2576.821 ms, Time: 9.632.158 ms (00:09.632), Spill count: 245 With 10000000 tuples (10 times more), Execution Time: 30359.509 ms, Time: 95261.024 ms (01:35.261), Spill count: 2442 With the memory accounting patch, following are the performance results: With 100000 tuples, logical_decoding_work_mem=64kB, Execution Time: 2414.371 ms, Time: 9648.223 ms (00:09.648), Spill count: 2315 logical_decoding_work_mem=64MB, Execution Time: 2477.830 ms, Time: 9895.161 ms (00:09.895), Spill count 3 With 1000000 tuples (10 times more), logical_decoding_work_mem=64kB, Execution Time: 38259.227 ms, Time: 105761.978 ms (01:45.762), Spill count: 23149 logical_decoding_work_mem=64MB, Execution Time: 24624.639 ms, Time: 89985.342 ms (01:29.985), Spill count: 23 With logical decoding of in-progress transactions patch and with streaming on, following are the performance results: With 100000 tuples, logical_decoding_work_mem=64kB, Execution Time: 2674.034 ms, Time: 20779.601 ms (00:20.780) logical_decoding_work_mem=64MB, Execution Time: 2062.404 ms, Time: 9559.953 ms (00:09.560) With 1000000 tuples (10 times more), logical_decoding_work_mem=64kB, Execution Time: 26949.588 ms, Time: 196261.892 ms (03:16.262) logical_decoding_work_mem=64MB, Execution Time: 27084.403 ms, Time: 90079.286 ms (01:30.079) -- Thanks & Regards, Kuntal Ghosh EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Dilip Kumar
Date:
On Mon, Nov 4, 2019 at 2:43 PM Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote: > > Hello hackers, > > I've done some performance testing of this feature. Following is my > test case (taken from an earlier thread): > > postgres=# CREATE TABLE large_test (num1 bigint, num2 double > precision, num3 double precision); > postgres=# \timing on > postgres=# EXPLAIN (ANALYZE, BUFFERS) INSERT INTO large_test (num1, > num2, num3) SELECT round(random()*10), random(), random()*142 FROM > generate_series(1, 1000000) s(i); > > I've kept the publisher and subscriber in two different system. > > HEAD: > With 1000000 tuples, > Execution Time: 2576.821 ms, Time: 9.632.158 ms (00:09.632), Spill count: 245 > With 10000000 tuples (10 times more), > Execution Time: 30359.509 ms, Time: 95261.024 ms (01:35.261), Spill count: 2442 > > With the memory accounting patch, following are the performance results: > With 100000 tuples, > logical_decoding_work_mem=64kB, Execution Time: 2414.371 ms, Time: > 9648.223 ms (00:09.648), Spill count: 2315 > logical_decoding_work_mem=64MB, Execution Time: 2477.830 ms, Time: > 9895.161 ms (00:09.895), Spill count 3 > With 1000000 tuples (10 times more), > logical_decoding_work_mem=64kB, Execution Time: 38259.227 ms, Time: > 105761.978 ms (01:45.762), Spill count: 23149 > logical_decoding_work_mem=64MB, Execution Time: 24624.639 ms, Time: > 89985.342 ms (01:29.985), Spill count: 23 > > With logical decoding of in-progress transactions patch and with > streaming on, following are the performance results: > With 100000 tuples, > logical_decoding_work_mem=64kB, Execution Time: 2674.034 ms, Time: > 20779.601 ms (00:20.780) > logical_decoding_work_mem=64MB, Execution Time: 2062.404 ms, Time: > 9559.953 ms (00:09.560) > With 1000000 tuples (10 times more), > logical_decoding_work_mem=64kB, Execution Time: 26949.588 ms, Time: > 196261.892 ms (03:16.262) > logical_decoding_work_mem=64MB, Execution Time: 27084.403 ms, Time: > 90079.286 ms (01:30.079) So your result shows that with "streaming on", performance is degrading? By any chance did you try to see where is the bottleneck? -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Kuntal Ghosh
Date:
On Mon, Nov 4, 2019 at 3:32 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > So your result shows that with "streaming on", performance is > degrading? By any chance did you try to see where is the bottleneck? > Right. But, as we increase the logical_decoding_work_mem, the performance improves. I've not analyzed the bottleneck yet. I'm looking into the same. -- Thanks & Regards, Kuntal Ghosh EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
vignesh C
Date:
On Thu, Oct 24, 2019 at 7:07 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Tue, Oct 22, 2019 at 10:30 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > I have merged bugs_and_review_comments_fix.patch changes to 0001 and 0002. > > > > I was wondering whether we have checked the code coverage after this > patch? Previously, the existing tests seem to be covering most parts > of the function ReorderBufferSerializeTXN [1]. After this patch, the > timing to call ReorderBufferSerializeTXN will change, so that might > impact the testing of the same. If it is already covered, then I > would like to either add a new test or extend existing test with the > help of new spill counters. If it is not getting covered, then we > need to think of extending the existing test or write a new test to > cover the function ReorderBufferSerializeTXN. > I have run the tests with coverage and found that ReorderBufferSerializeTXN is not being hit. The reason it is not being hit is because of the following check in ReorderBufferCheckMemoryLimit: /* bail out if we haven't exceeded the memory limit */ if (rb->size < logical_decoding_work_mem * 1024L) return; Previously the tests from contrib/test_decoding could hit ReorderBufferSerializeTXN function. I'm checking if we can modify the test or add new test to hit ReorderBufferSerializeTXN function. Regards, Vignesh EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Amit Kapila
Date:
On Wed, Oct 30, 2019 at 9:38 AM vignesh C <vignesh21@gmail.com> wrote: > > On Tue, Oct 22, 2019 at 10:52 PM Tomas Vondra > <tomas.vondra@2ndquadrant.com> wrote: > > > > I think the patch should do the simplest thing possible, i.e. what it > > does today. Otherwise we'll never get it committed. > > > I found a couple of crashes while reviewing and testing flushing of > open transaction data: > Thanks for doing these tests. However, I don't think these issues are anyway related to this patch. It seems to be base code issues manifested by this patch. See my analysis below. > Issue 1: > #0 0x00007f22c5722337 in raise () from /lib64/libc.so.6 > #1 0x00007f22c5723a28 in abort () from /lib64/libc.so.6 > #2 0x0000000000ec5390 in ExceptionalCondition > (conditionName=0x10ea814 "!dlist_is_empty(head)", errorType=0x10ea804 > "FailedAssertion", > fileName=0x10ea7e0 "../../../../src/include/lib/ilist.h", > lineNumber=458) at assert.c:54 > #3 0x0000000000b4fb91 in dlist_tail_element_off (head=0x19e4db8, > off=64) at ../../../../src/include/lib/ilist.h:458 > #4 0x0000000000b546d0 in ReorderBufferAbortOld (rb=0x191b6b0, > oldestRunningXid=3834) at reorderbuffer.c:1966 > #5 0x0000000000b3ca03 in DecodeStandbyOp (ctx=0x19af990, > buf=0x7ffcbc26dc50) at decode.c:332 > This seems to be the problem of base code where we abort immediately after serializing the changes because in that case, the changes list will be empty. I think you can try to reproduce it via the debugger or by hacking the code such that it serializes after every change and then if you abort after one change, it should hit this problem. > > Issue 2: > #0 0x00007f1d7ddc4337 in raise () from /lib64/libc.so.6 > #1 0x00007f1d7ddc5a28 in abort () from /lib64/libc.so.6 > #2 0x0000000000ec4e1d in ExceptionalCondition > (conditionName=0x10ead30 "txn->final_lsn != InvalidXLogRecPtr", > errorType=0x10ea284 "FailedAssertion", > fileName=0x10ea2d0 "reorderbuffer.c", lineNumber=3052) at assert.c:54 > #3 0x0000000000b577e0 in ReorderBufferRestoreCleanup (rb=0x2ae36b0, > txn=0x2bafb08) at reorderbuffer.c:3052 > #4 0x0000000000b52b1c in ReorderBufferCleanupTXN (rb=0y x2ae36b0, > txn=0x2bafb08) at reorderbuffer.c:1318 > #5 0x0000000000b5279d in ReorderBufferCleanupTXN (rb=0x2ae36b0, > txn=0x2b9d778) at reorderbuffer.c:1257 > #6 0x0000000000b5475c in ReorderBufferAbortOld (rb=0x2ae36b0, > oldestRunningXid=3835) at reorderbuffer.c:1973 > This seems to be again the problem with base code as we don't update the final_lsn for subtransactions during ReorderBufferAbortOld. This can also be reproduced with some hacking in code or via debugger in a similar way as explained for the previous problem but with a difference that there must be subtransaction involved in this case. > #7 0x0000000000b3ca03 in DecodeStandbyOp (ctx=0x2b676d0, > buf=0x7ffcbc74cc00) at decode.c:332 > #8 0x0000000000b3c208 in LogicalDecodingProcessRecord (ctx=0x2b676d0, > record=0x2b67990) at decode.c:121 > #9 0x0000000000b70b2b in XLogSendLogical () at walsender.c:2845 > > These failures come randomly. > I'm not able to reproduce this issue with simple test case. Yeah, it appears to be difficult to reproduce unless you hack the code to serialize every change or use debugger to forcefully flush the changes every time. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Dilip Kumar
Date:
On Mon, Nov 4, 2019 at 5:22 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Wed, Oct 30, 2019 at 9:38 AM vignesh C <vignesh21@gmail.com> wrote: > > > > On Tue, Oct 22, 2019 at 10:52 PM Tomas Vondra > > <tomas.vondra@2ndquadrant.com> wrote: > > > > > > I think the patch should do the simplest thing possible, i.e. what it > > > does today. Otherwise we'll never get it committed. > > > > > I found a couple of crashes while reviewing and testing flushing of > > open transaction data: > > > > Thanks for doing these tests. However, I don't think these issues are > anyway related to this patch. It seems to be base code issues > manifested by this patch. See my analysis below. > > > Issue 1: > > #0 0x00007f22c5722337 in raise () from /lib64/libc.so.6 > > #1 0x00007f22c5723a28 in abort () from /lib64/libc.so.6 > > #2 0x0000000000ec5390 in ExceptionalCondition > > (conditionName=0x10ea814 "!dlist_is_empty(head)", errorType=0x10ea804 > > "FailedAssertion", > > fileName=0x10ea7e0 "../../../../src/include/lib/ilist.h", > > lineNumber=458) at assert.c:54 > > #3 0x0000000000b4fb91 in dlist_tail_element_off (head=0x19e4db8, > > off=64) at ../../../../src/include/lib/ilist.h:458 > > #4 0x0000000000b546d0 in ReorderBufferAbortOld (rb=0x191b6b0, > > oldestRunningXid=3834) at reorderbuffer.c:1966 > > #5 0x0000000000b3ca03 in DecodeStandbyOp (ctx=0x19af990, > > buf=0x7ffcbc26dc50) at decode.c:332 > > > > This seems to be the problem of base code where we abort immediately > after serializing the changes because in that case, the changes list > will be empty. I think you can try to reproduce it via the debugger > or by hacking the code such that it serializes after every change and > then if you abort after one change, it should hit this problem. > I think you might need to kill the server after all changes are serialized otherwise normal abort will hit the ReorderBufferAbort and that will remove your ReorderBufferTXN entry and you will never hit this case. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
vignesh C
Date:
On Mon, Nov 4, 2019 at 3:46 PM vignesh C <vignesh21@gmail.com> wrote: > > On Thu, Oct 24, 2019 at 7:07 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Tue, Oct 22, 2019 at 10:30 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > I have merged bugs_and_review_comments_fix.patch changes to 0001 and 0002. > > > > > > > I was wondering whether we have checked the code coverage after this > > patch? Previously, the existing tests seem to be covering most parts > > of the function ReorderBufferSerializeTXN [1]. After this patch, the > > timing to call ReorderBufferSerializeTXN will change, so that might > > impact the testing of the same. If it is already covered, then I > > would like to either add a new test or extend existing test with the > > help of new spill counters. If it is not getting covered, then we > > need to think of extending the existing test or write a new test to > > cover the function ReorderBufferSerializeTXN. > > > I have run the tests with coverage and found that > ReorderBufferSerializeTXN is not being hit. > The reason it is not being hit is because of the following check in > ReorderBufferCheckMemoryLimit: > /* bail out if we haven't exceeded the memory limit */ > if (rb->size < logical_decoding_work_mem * 1024L) > return; > Previously the tests from contrib/test_decoding could hit > ReorderBufferSerializeTXN function. > I'm checking if we can modify the test or add new test to hit > ReorderBufferSerializeTXN function. I have made one change to the configuration file in contrib/test_decoding directory, with that the coverage seems to be fine. I have seen that the coverage is almost like the code before applying the patch. I have attached the test change and the coverage report for reference. Coverage report includes the core logical work memory files for base code and by applying 0001-Add-logical_decoding_work_mem-to-limit-ReorderBuffer and 0002-Track-statistics-for-spilling patches. Regards, Vignesh EnterpriseDB: http://www.enterprisedb.com
Attachment
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
vignesh C
Date:
On Mon, Nov 4, 2019 at 5:22 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Wed, Oct 30, 2019 at 9:38 AM vignesh C <vignesh21@gmail.com> wrote: > > > > On Tue, Oct 22, 2019 at 10:52 PM Tomas Vondra > > <tomas.vondra@2ndquadrant.com> wrote: > > > > > > I think the patch should do the simplest thing possible, i.e. what it > > > does today. Otherwise we'll never get it committed. > > > > > I found a couple of crashes while reviewing and testing flushing of > > open transaction data: > > > > Thanks for doing these tests. However, I don't think these issues are > anyway related to this patch. It seems to be base code issues > manifested by this patch. See my analysis below. > > > Issue 1: > > #0 0x00007f22c5722337 in raise () from /lib64/libc.so.6 > > #1 0x00007f22c5723a28 in abort () from /lib64/libc.so.6 > > #2 0x0000000000ec5390 in ExceptionalCondition > > (conditionName=0x10ea814 "!dlist_is_empty(head)", errorType=0x10ea804 > > "FailedAssertion", > > fileName=0x10ea7e0 "../../../../src/include/lib/ilist.h", > > lineNumber=458) at assert.c:54 > > #3 0x0000000000b4fb91 in dlist_tail_element_off (head=0x19e4db8, > > off=64) at ../../../../src/include/lib/ilist.h:458 > > #4 0x0000000000b546d0 in ReorderBufferAbortOld (rb=0x191b6b0, > > oldestRunningXid=3834) at reorderbuffer.c:1966 > > #5 0x0000000000b3ca03 in DecodeStandbyOp (ctx=0x19af990, > > buf=0x7ffcbc26dc50) at decode.c:332 > > > > This seems to be the problem of base code where we abort immediately > after serializing the changes because in that case, the changes list > will be empty. I think you can try to reproduce it via the debugger > or by hacking the code such that it serializes after every change and > then if you abort after one change, it should hit this problem. > > > > > Issue 2: > > #0 0x00007f1d7ddc4337 in raise () from /lib64/libc.so.6 > > #1 0x00007f1d7ddc5a28 in abort () from /lib64/libc.so.6 > > #2 0x0000000000ec4e1d in ExceptionalCondition > > (conditionName=0x10ead30 "txn->final_lsn != InvalidXLogRecPtr", > > errorType=0x10ea284 "FailedAssertion", > > fileName=0x10ea2d0 "reorderbuffer.c", lineNumber=3052) at assert.c:54 > > #3 0x0000000000b577e0 in ReorderBufferRestoreCleanup (rb=0x2ae36b0, > > txn=0x2bafb08) at reorderbuffer.c:3052 > > #4 0x0000000000b52b1c in ReorderBufferCleanupTXN (rb=0y x2ae36b0, > > txn=0x2bafb08) at reorderbuffer.c:1318 > > #5 0x0000000000b5279d in ReorderBufferCleanupTXN (rb=0x2ae36b0, > > txn=0x2b9d778) at reorderbuffer.c:1257 > > #6 0x0000000000b5475c in ReorderBufferAbortOld (rb=0x2ae36b0, > > oldestRunningXid=3835) at reorderbuffer.c:1973 > > > > This seems to be again the problem with base code as we don't update > the final_lsn for subtransactions during ReorderBufferAbortOld. This > can also be reproduced with some hacking in code or via debugger in a > similar way as explained for the previous problem but with a > difference that there must be subtransaction involved in this case. > > > #7 0x0000000000b3ca03 in DecodeStandbyOp (ctx=0x2b676d0, > > buf=0x7ffcbc74cc00) at decode.c:332 > > #8 0x0000000000b3c208 in LogicalDecodingProcessRecord (ctx=0x2b676d0, > > record=0x2b67990) at decode.c:121 > > #9 0x0000000000b70b2b in XLogSendLogical () at walsender.c:2845 > > > > These failures come randomly. > > I'm not able to reproduce this issue with simple test case. > > Yeah, it appears to be difficult to reproduce unless you hack the code > to serialize every change or use debugger to forcefully flush the > changes every time. > Thanks Amit for your analysis, I was able to reproduce the above issue consistently by making some code changes and with help of debugger. I did one change so that it flushes every time instead of flushing after the buffer size exceeds the logical_decoding_work_mem, attached one of the transactions and called abort. When the server restarts after abort, this problem occurs consistently. I could reproduce the issue with base code also. It seems like this issue is not an issue of 0001-Add-logical_decoding_work_mem-to-limit-ReorderBuffer patch and exists from base code. I will post the issue in hackers with details. Regards, Vignesh EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Amit Kapila
Date:
On Wed, Nov 6, 2019 at 11:33 AM vignesh C <vignesh21@gmail.com> wrote: > > I have made one change to the configuration file in > contrib/test_decoding directory, with that the coverage seems to be > fine. I have seen that the coverage is almost like the code before > applying the patch. I have attached the test change and the coverage > report for reference. Coverage report includes the core logical work > memory files for base code and by applying > 0001-Add-logical_decoding_work_mem-to-limit-ReorderBuffer and > 0002-Track-statistics-for-spilling patches. > Thanks, I have incorporated your test changes and modified the two patches. Please see attached. Changes: --------------- 1. In guc.c, we should include reorderbuffer.h, not logical.h as we define logical_decoding_work_mem in earlier. 2. + * To limit the amount of memory used by decoded changes, we track memory + * used at the reorder buffer level (i.e. total amount of memory), and for + * each toplevel transaction. When the total amount of used memory exceeds + * the limit, the toplevel transaction consuming the most memory is then + * serialized to disk. In the above comments, removed 'toplevel' as we track memory usage for both toplevel and subtransactions. 3. There were still a few mentions of streaming which I have removed. 4. In the docs, the type for stats spill_* was integer whereas it should be bigint. 5. +UpdateSpillStats(LogicalDecodingContext *ctx) +{ + ReorderBuffer *rb = ctx->reorder; + + SpinLockAcquire(&MyWalSnd->mutex); + + MyWalSnd->spillTxns = rb->spillTxns; + MyWalSnd->spillCount = rb->spillCount; + MyWalSnd->spillBytes = rb->spillBytes; + + elog(WARNING, "UpdateSpillStats: updating stats %p %ld %ld %ld", + rb, rb->spillTxns, rb->spillCount, rb->spillBytes); Changed the above elog to DEBUG1 as otherwise it was getting printed very frequently. I think we can make it DEBUG2 if we want. 6. There was an extra space in rules.out due to which test was failing. I have fixed it. What do you think? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Attachment
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Dilip Kumar
Date:
On Thu, Nov 7, 2019 at 3:19 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Wed, Nov 6, 2019 at 11:33 AM vignesh C <vignesh21@gmail.com> wrote: > > > > I have made one change to the configuration file in > > contrib/test_decoding directory, with that the coverage seems to be > > fine. I have seen that the coverage is almost like the code before > > applying the patch. I have attached the test change and the coverage > > report for reference. Coverage report includes the core logical work > > memory files for base code and by applying > > 0001-Add-logical_decoding_work_mem-to-limit-ReorderBuffer and > > 0002-Track-statistics-for-spilling patches. > > > > Thanks, I have incorporated your test changes and modified the two > patches. Please see attached. > > Changes: > --------------- > 1. In guc.c, we should include reorderbuffer.h, not logical.h as we > define logical_decoding_work_mem in earlier. Yeah Right. > > 2. > + * To limit the amount of memory used by decoded changes, we track memory > + * used at the reorder buffer level (i.e. total amount of memory), and for > + * each toplevel transaction. When the total amount of used memory exceeds > + * the limit, the toplevel transaction consuming the most memory is then > + * serialized to disk. > > In the above comments, removed 'toplevel' as we track memory usage for > both toplevel and subtransactions. Correct. > > 3. There were still a few mentions of streaming which I have removed. > ok > 4. In the docs, the type for stats spill_* was integer whereas it > should be bigint. ok > > 5. > +UpdateSpillStats(LogicalDecodingContext *ctx) > +{ > + ReorderBuffer *rb = ctx->reorder; > + > + SpinLockAcquire(&MyWalSnd->mutex); > + > + MyWalSnd->spillTxns = rb->spillTxns; > + MyWalSnd->spillCount = rb->spillCount; > + MyWalSnd->spillBytes = rb->spillBytes; > + > + elog(WARNING, "UpdateSpillStats: updating stats %p %ld %ld %ld", > + rb, rb->spillTxns, rb->spillCount, rb->spillBytes); > > Changed the above elog to DEBUG1 as otherwise it was getting printed > very frequently. I think we can make it DEBUG2 if we want. Yeah, it should not be WARNING. > > 6. There was an extra space in rules.out due to which test was > failing. I have fixed it. My Bad. I have induced while separating out the changes for the spilling. > What do you think? I have reviewed your changes and looks fine to me. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Amit Kapila
Date:
On Thu, Nov 7, 2019 at 3:50 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Thu, Nov 7, 2019 at 3:19 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > What do you think? > I have reviewed your changes and looks fine to me. > Okay, thanks. I am also happy with the two patches I have posted in my last email [1]. Tomas, would you like to take a look at those patches and commit them if you are happy or would you like me to do the same? Some notes before commit: -------------------------------------- 1. Commit message need to be changed for the first patch ------------------------------------------------------------------------- A. > The memory limit is defined by a new logical_decoding_work_mem GUC, so for example we can do this SET logical_decoding_work_mem = '128kB' > to trigger very aggressive streaming. The minimum value is 64kB. I think this patch doesn't contain streaming, so we either need to reword it or remove it. B. > The logical_decoding_work_mem may be set either in postgresql.conf, in which case it serves as the default for all publisherson that instance, or when creating the > subscription, using a work_mem paramemter in the WITH clause (specifies number of kilobytes). We need to reword this as we have decided to remove the setting from the subscription side as of now. 2. I think we can change the message level in UpdateSpillStats() to DEBUG2. 3. I think we need catversion bump for the second patch. 4. I think we can combine both patches and commit as one patch, but it is okay to commit them separately as well. [1] - https://www.postgresql.org/message-id/CAA4eK1Kdmi6VVguKEHV6Ho2isCPVFdQtt0WLsK10fiuE59_0Yw%40mail.gmail.com -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Alexey Kondratov
Date:
On 04.11.2019 13:05, Kuntal Ghosh wrote: > On Mon, Nov 4, 2019 at 3:32 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: >> So your result shows that with "streaming on", performance is >> degrading? By any chance did you try to see where is the bottleneck? >> > Right. But, as we increase the logical_decoding_work_mem, the > performance improves. I've not analyzed the bottleneck yet. I'm > looking into the same. My guess is that 64 kB is just too small value. In the table schema used for tests every rows takes at least 24 bytes for storing column values. Thus, with this logical_decoding_work_mem value the limit should be hit after about 2500+ rows, or about 400 times during transaction of 1000000 rows size. It is just too frequent, while ReorderBufferStreamTXN includes a whole bunch of logic, e.g. it always starts internal transaction: /* * Decoding needs access to syscaches et al., which in turn use * heavyweight locks and such. Thus we need to have enough state around to * keep track of those. The easiest way is to simply use a transaction * internally. That also allows us to easily enforce that nothing writes * to the database by checking for xid assignments. ... */ Also it issues separated stream_start/stop messages around each streamed transaction chunk. So if streaming starts and stops too frequently it adds additional overhead and may even interfere with current in-progress transaction. If I get it correctly, then it is rather expected with too small values of logical_decoding_work_mem. Probably it may be optimized, but I am not sure that it is worth doing right now. Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Company
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Kuntal Ghosh
Date:
On Tue, Nov 12, 2019 at 4:12 PM Alexey Kondratov <a.kondratov@postgrespro.ru> wrote: > > On 04.11.2019 13:05, Kuntal Ghosh wrote: > > On Mon, Nov 4, 2019 at 3:32 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > >> So your result shows that with "streaming on", performance is > >> degrading? By any chance did you try to see where is the bottleneck? > >> > > Right. But, as we increase the logical_decoding_work_mem, the > > performance improves. I've not analyzed the bottleneck yet. I'm > > looking into the same. > > My guess is that 64 kB is just too small value. In the table schema used > for tests every rows takes at least 24 bytes for storing column values. > Thus, with this logical_decoding_work_mem value the limit should be hit > after about 2500+ rows, or about 400 times during transaction of 1000000 > rows size. > > It is just too frequent, while ReorderBufferStreamTXN includes a whole > bunch of logic, e.g. it always starts internal transaction: > > /* > * Decoding needs access to syscaches et al., which in turn use > * heavyweight locks and such. Thus we need to have enough state around to > * keep track of those. The easiest way is to simply use a transaction > * internally. That also allows us to easily enforce that nothing writes > * to the database by checking for xid assignments. ... > */ > > Also it issues separated stream_start/stop messages around each streamed > transaction chunk. So if streaming starts and stops too frequently it > adds additional overhead and may even interfere with current in-progress > transaction. > Yeah, I've also found the same. With stream_start/stop message, it writes 1 byte of checksum and 4 bytes of number of sub-transactions which increases the write amplification significantly. -- Thanks & Regards, Kuntal Ghosh EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Amit Kapila
Date:
On Thu, Oct 3, 2019 at 1:18 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > As mentioned by me a few days back that the first patch in this series is ready to go [1] (I am hoping Tomas will pick it up), so I have started the review of other patches Review/Questions on 0002-Immediately-WAL-log-assignments.patch ------------------------------------------------------------------------------------------------- 1. This patch adds the top_xid in WAL whenever the first time WAL for a subtransaction XID is written to correctly decode the changes of in-progress transaction. This patch also removes logging and applying WAL for XLOG_XACT_ASSIGNMENT which might have some effect. As replay of that, it prunes KnownAssignedXids to prevent overflow of that array. See comments in procarray.c (KnownAssignedTransactionIds sub-module). Can you please explain how after removing the WAL for XLOG_XACT_ASSIGNMENT, we will handle that or I am missing something and there is no impact of same? 2. +#define XLOG_INCLUDE_INVALS 0x08 /* include invalidations */ This doesn't seem to be used in this patch. [1] - https://www.postgresql.org/message-id/CAA4eK1JM0%3DRwODZQrn8DTQ3dbcb9xwKDdHCmVOryAk_xoKf9Nw%40mail.gmail.com -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Dilip Kumar
Date:
On Wed, Nov 13, 2019 at 5:55 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Thu, Oct 3, 2019 at 1:18 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > As mentioned by me a few days back that the first patch in this series > is ready to go [1] (I am hoping Tomas will pick it up), so I have > started the review of other patches > > Review/Questions on 0002-Immediately-WAL-log-assignments.patch > ------------------------------------------------------------------------------------------------- > 1. This patch adds the top_xid in WAL whenever the first time WAL for > a subtransaction XID is written to correctly decode the changes of > in-progress transaction. This patch also removes logging and applying > WAL for XLOG_XACT_ASSIGNMENT which might have some effect. As replay > of that, it prunes KnownAssignedXids to prevent overflow of that > array. See comments in procarray.c (KnownAssignedTransactionIds > sub-module). Can you please explain how after removing the WAL for > XLOG_XACT_ASSIGNMENT, we will handle that or I am missing something > and there is no impact of same? It seems like a problem to me as well. One option could be that since now we are adding the top transaction id in the first WAL of the subtransaction we can directly update the pg_subtrans and avoid adding sub transaction id in the KnownAssignedXids and mark it as lastOverflowedXid. But, I don't think we should go in that direction otherwise it will impact the performance of visibility check on the hot-standby. Let's see what Tomas has in mind. > > 2. > +#define XLOG_INCLUDE_INVALS 0x08 /* include invalidations */ > > This doesn't seem to be used in this patch. > -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Amit Kapila
Date:
On Thu, Nov 14, 2019 at 9:37 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Wed, Nov 13, 2019 at 5:55 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Thu, Oct 3, 2019 at 1:18 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > > As mentioned by me a few days back that the first patch in this series > > is ready to go [1] (I am hoping Tomas will pick it up), so I have > > started the review of other patches > > > > Review/Questions on 0002-Immediately-WAL-log-assignments.patch > > ------------------------------------------------------------------------------------------------- > > 1. This patch adds the top_xid in WAL whenever the first time WAL for > > a subtransaction XID is written to correctly decode the changes of > > in-progress transaction. This patch also removes logging and applying > > WAL for XLOG_XACT_ASSIGNMENT which might have some effect. As replay > > of that, it prunes KnownAssignedXids to prevent overflow of that > > array. See comments in procarray.c (KnownAssignedTransactionIds > > sub-module). Can you please explain how after removing the WAL for > > XLOG_XACT_ASSIGNMENT, we will handle that or I am missing something > > and there is no impact of same? > > It seems like a problem to me as well. One option could be that > since now we are adding the top transaction id in the first WAL of the > subtransaction we can directly update the pg_subtrans and avoid adding > sub transaction id in the KnownAssignedXids and mark it as > lastOverflowedXid. > Hmm, I am not sure if we can do that easily because I think in RecordKnownAssignedTransactionIds, we add those based on the gap via KnownAssignedXidsAdd and only remove them later while applying WAL for XLOG_XACT_ASSIGNMENT. I think if we really want to go in this direction then for each WAL record we need to check if it has XLR_BLOCK_ID_TOPLEVEL_XID set and then call function ProcArrayApplyXidAssignment() with the required information. I think this line of attack has WAL overhead both on master whenever subtransactions are involved and also on hot-standby for doing the work for each subtransaction separately. The WAL apply needs to acquire and release PROCArrayLock in exclusive mode for each subtransaction whereas now it does it once for PGPROC_MAX_CACHED_SUBXIDS number of subtransactions which can conflict with queries running on standby. The other idea could be that we keep the current XLOG_XACT_ASSIGNMENT mechanism (WAL logging and apply of same on hot-standby) as it is and additionally log top_xid the first time when WAL is written for a subtransaction only when wal_level >= WAL_LEVEL_LOGICAL. Then use the same for logical decoding. The advantage of this approach is that we will incur the overhead of additional transactionid only when required especially not with default server configuration. Thoughts? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Dilip Kumar
Date:
On Thu, Nov 14, 2019 at 12:10 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Thu, Nov 14, 2019 at 9:37 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Wed, Nov 13, 2019 at 5:55 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > On Thu, Oct 3, 2019 at 1:18 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > > > > > As mentioned by me a few days back that the first patch in this series > > > is ready to go [1] (I am hoping Tomas will pick it up), so I have > > > started the review of other patches > > > > > > Review/Questions on 0002-Immediately-WAL-log-assignments.patch > > > ------------------------------------------------------------------------------------------------- > > > 1. This patch adds the top_xid in WAL whenever the first time WAL for > > > a subtransaction XID is written to correctly decode the changes of > > > in-progress transaction. This patch also removes logging and applying > > > WAL for XLOG_XACT_ASSIGNMENT which might have some effect. As replay > > > of that, it prunes KnownAssignedXids to prevent overflow of that > > > array. See comments in procarray.c (KnownAssignedTransactionIds > > > sub-module). Can you please explain how after removing the WAL for > > > XLOG_XACT_ASSIGNMENT, we will handle that or I am missing something > > > and there is no impact of same? > > > > It seems like a problem to me as well. One option could be that > > since now we are adding the top transaction id in the first WAL of the > > subtransaction we can directly update the pg_subtrans and avoid adding > > sub transaction id in the KnownAssignedXids and mark it as > > lastOverflowedXid. > > > > Hmm, I am not sure if we can do that easily because I think in > RecordKnownAssignedTransactionIds, we add those based on the gap via > KnownAssignedXidsAdd and only remove them later while applying WAL for > XLOG_XACT_ASSIGNMENT. I think if we really want to go in this > direction then for each WAL record we need to check if it has > XLR_BLOCK_ID_TOPLEVEL_XID set and then call function > ProcArrayApplyXidAssignment() with the required information. I think > this line of attack has WAL overhead both on master whenever > subtransactions are involved and also on hot-standby for doing the > work for each subtransaction separately. The WAL apply needs to > acquire and release PROCArrayLock in exclusive mode for each > subtransaction whereas now it does it once for > PGPROC_MAX_CACHED_SUBXIDS number of subtransactions which can conflict > with queries running on standby. Right > > The other idea could be that we keep the current XLOG_XACT_ASSIGNMENT > mechanism (WAL logging and apply of same on hot-standby) as it is and > additionally log top_xid the first time when WAL is written for a > subtransaction only when wal_level >= WAL_LEVEL_LOGICAL. Then use the > same for logical decoding. The advantage of this approach is that we > will incur the overhead of additional transactionid only when required > especially not with default server configuration. > > Thoughts? The idea seems reasonable to me. Apart from this, I have another question in 0003-Issue-individual-invalidations-with-wal_level-logical.patch @@ -543,6 +588,18 @@ RegisterSnapshotInvalidation(Oid dbId, Oid relId) { AddSnapshotInvalidationMessage(&transInvalInfo->CurrentCmdInvalidMsgs, dbId, relId); + + /* Issue an invalidation WAL record (when wal_level=logical) */ + if (XLogLogicalInfoActive()) + { + SharedInvalidationMessage msg; + + msg.sn.id = SHAREDINVALSNAPSHOT_ID; + msg.sn.dbId = dbId; + msg.sn.relId = relId; + + LogLogicalInvalidations(1, &msg, false); + } } I am not sure why do we need to explicitly WAL log the snapshot invalidation? because this is logged for invalidating the catalog snapshot and for logical decoding we use HistoricSnapshot, not the catalog snapshot. I might be missing something? -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Amit Kapila
Date:
On Thu, Nov 14, 2019 at 3:40 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > Apart from this, I have another question in > 0003-Issue-individual-invalidations-with-wal_level-logical.patch > > @@ -543,6 +588,18 @@ RegisterSnapshotInvalidation(Oid dbId, Oid relId) > { > AddSnapshotInvalidationMessage(&transInvalInfo->CurrentCmdInvalidMsgs, > dbId, relId); > + > + /* Issue an invalidation WAL record (when wal_level=logical) */ > + if (XLogLogicalInfoActive()) > + { > + SharedInvalidationMessage msg; > + > + msg.sn.id = SHAREDINVALSNAPSHOT_ID; > + msg.sn.dbId = dbId; > + msg.sn.relId = relId; > + > + LogLogicalInvalidations(1, &msg, false); > + } > } > > I am not sure why do we need to explicitly WAL log the snapshot > invalidation? because this is logged for invalidating the catalog > snapshot and for logical decoding we use HistoricSnapshot, not the > catalog snapshot. > I think it has been logged because without this patch as well we log all the invalidation messages at commit time and process them during decoding. However, I agree that this particular invalidation message is not required for logical decoding for the reason you mentioned. I think as we are explicitly logging invalidations, so it is better to avoid this if we can. Few other comments on this patch: 1. + case REORDER_BUFFER_CHANGE_INVALIDATION: + + /* + * Execute the invalidation message locally. + * + * XXX Do we need to care about relcacheInitFileInval and + * the other fields added to ReorderBufferChange, or just + * about the message itself? + */ + LocalExecuteInvalidationMessage(&change->data.inval.msg); + break; Here, why are we executing messages individually? Can't we just follow what we do in DecodeCommit which is to record the invalidations in ReorderBufferTXN as we encounter them and then allow them to execute on each REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID. Is there a reason why we don't do ReorderBufferXidSetCatalogChanges when we receive any invalidation message? 2. @@ -3025,8 +3073,8 @@ ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn, * although we don't check the memory limit when restoring the changes in * this branch (we only do that when initially queueing the changes after * decoding), because we will release the changes later, and that will - * update the accounting too (subtracting the size from the counters). - * And we don't want to underflow there. + * update the accounting too (subtracting the size from the counters). And + * we don't want to underflow there. */ This seems like an unrelated change. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Dilip Kumar
Date:
On Fri, Nov 15, 2019 at 3:50 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Thu, Nov 14, 2019 at 3:40 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > Apart from this, I have another question in > > 0003-Issue-individual-invalidations-with-wal_level-logical.patch > > > > @@ -543,6 +588,18 @@ RegisterSnapshotInvalidation(Oid dbId, Oid relId) > > { > > AddSnapshotInvalidationMessage(&transInvalInfo->CurrentCmdInvalidMsgs, > > dbId, relId); > > + > > + /* Issue an invalidation WAL record (when wal_level=logical) */ > > + if (XLogLogicalInfoActive()) > > + { > > + SharedInvalidationMessage msg; > > + > > + msg.sn.id = SHAREDINVALSNAPSHOT_ID; > > + msg.sn.dbId = dbId; > > + msg.sn.relId = relId; > > + > > + LogLogicalInvalidations(1, &msg, false); > > + } > > } > > > > I am not sure why do we need to explicitly WAL log the snapshot > > invalidation? because this is logged for invalidating the catalog > > snapshot and for logical decoding we use HistoricSnapshot, not the > > catalog snapshot. > > > > I think it has been logged because without this patch as well we log > all the invalidation messages at commit time and process them during > decoding. However, I agree that this particular invalidation message > is not required for logical decoding for the reason you mentioned. I > think as we are explicitly logging invalidations, so it is better to > avoid this if we can. Ok > > Few other comments on this patch: > 1. > + case REORDER_BUFFER_CHANGE_INVALIDATION: > + > + /* > + * Execute the invalidation message locally. > + * > + * XXX Do we need to care about relcacheInitFileInval and > + * the other fields added to ReorderBufferChange, or just > + * about the message itself? > + */ > + LocalExecuteInvalidationMessage(&change->data.inval.msg); > + break; > > Here, why are we executing messages individually? Can't we just > follow what we do in DecodeCommit which is to record the invalidations > in ReorderBufferTXN as we encounter them and then allow them to > execute on each REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID. Is there a > reason why we don't do ReorderBufferXidSetCatalogChanges when we > receive any invalidation message? IMHO, the reason is that in DecodeCommit, we get all the invalidation at one time so, at REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID, we don't know which invalidation message to execute so for being safe we have to execute all. But, since we are logging all invalidation individually, we exactly know at this stage which cache to invalidate. So it is better to only invalidate required cache not all. > > 2. > @@ -3025,8 +3073,8 @@ ReorderBufferRestoreChange(ReorderBuffer *rb, > ReorderBufferTXN *txn, > * although we don't check the memory limit when restoring the changes in > * this branch (we only do that when initially queueing the changes after > * decoding), because we will release the changes later, and that will > - * update the accounting too (subtracting the size from the counters). > - * And we don't want to underflow there. > + * update the accounting too (subtracting the size from the counters). And > + * we don't want to underflow there. > */ > > This seems like an unrelated change. Indeed. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Amit Kapila
Date:
On Fri, Nov 15, 2019 at 4:01 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Fri, Nov 15, 2019 at 3:50 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > Few other comments on this patch: > > 1. > > + case REORDER_BUFFER_CHANGE_INVALIDATION: > > + > > + /* > > + * Execute the invalidation message locally. > > + * > > + * XXX Do we need to care about relcacheInitFileInval and > > + * the other fields added to ReorderBufferChange, or just > > + * about the message itself? > > + */ > > + LocalExecuteInvalidationMessage(&change->data.inval.msg); > > + break; > > > > Here, why are we executing messages individually? Can't we just > > follow what we do in DecodeCommit which is to record the invalidations > > in ReorderBufferTXN as we encounter them and then allow them to > > execute on each REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID. Is there a > > reason why we don't do ReorderBufferXidSetCatalogChanges when we > > receive any invalidation message? > IMHO, the reason is that in DecodeCommit, we get all the invalidation > at one time so, at REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID, we don't > know which invalidation message to execute so for being safe we have > to execute all. But, since we are logging all invalidation > individually, we exactly know at this stage which cache to invalidate. > So it is better to only invalidate required cache not all. > In that case, invalidations can be processed multiple times, the first time when these individual WAL logs for invalidation are processed and then later at commit time when we accumulate all invalidation messages and then execute them for REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID. Can we avoid to execute invalidations from other places after this patch which also includes executing them as part of XLOG_INVALIDATIONS processing? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Amit Kapila
Date:
On Thu, Nov 7, 2019 at 5:13 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > Some notes before commit: > -------------------------------------- > 1. > Commit message need to be changed for the first patch > ------------------------------------------------------------------------- > A. > > The memory limit is defined by a new logical_decoding_work_mem GUC, so for example we can do this > > SET logical_decoding_work_mem = '128kB' > > > to trigger very aggressive streaming. The minimum value is 64kB. > > I think this patch doesn't contain streaming, so we either need to > reword it or remove it. > > B. > > The logical_decoding_work_mem may be set either in postgresql.conf, in which case it serves as the default for all publisherson that instance, or when creating the > > subscription, using a work_mem paramemter in the WITH clause (specifies number of kilobytes). > > We need to reword this as we have decided to remove the setting from > the subscription side as of now. > > 2. I think we can change the message level in UpdateSpillStats() to DEBUG2. > I have made these modifications and additionally ran pgindent. > 4. I think we can combine both patches and commit as one patch, but it > is okay to commit them separately as well. > I am not sure if this is a good idea, so still kept them as separate. Tomas, do let me know if you want to commit these or if you have any comments, otherwise, I will commit these on Tuesday (19-Nov)? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Attachment
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Dilip Kumar
Date:
On Fri, Nov 15, 2019 at 4:19 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Fri, Nov 15, 2019 at 4:01 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Fri, Nov 15, 2019 at 3:50 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > Few other comments on this patch: > > > 1. > > > + case REORDER_BUFFER_CHANGE_INVALIDATION: > > > + > > > + /* > > > + * Execute the invalidation message locally. > > > + * > > > + * XXX Do we need to care about relcacheInitFileInval and > > > + * the other fields added to ReorderBufferChange, or just > > > + * about the message itself? > > > + */ > > > + LocalExecuteInvalidationMessage(&change->data.inval.msg); > > > + break; > > > > > > Here, why are we executing messages individually? Can't we just > > > follow what we do in DecodeCommit which is to record the invalidations > > > in ReorderBufferTXN as we encounter them and then allow them to > > > execute on each REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID. Is there a > > > reason why we don't do ReorderBufferXidSetCatalogChanges when we > > > receive any invalidation message? I think it's fine to call ReorderBufferXidSetCatalogChanges, only on commit. Because this is required to add any committed transaction to the snapshot if it has done any catalog changes. So I think there is no point in setting that flag every time we get an invalidation message. > > IMHO, the reason is that in DecodeCommit, we get all the invalidation > > at one time so, at REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID, we don't > > know which invalidation message to execute so for being safe we have > > to execute all. But, since we are logging all invalidation > > individually, we exactly know at this stage which cache to invalidate. > > So it is better to only invalidate required cache not all. > > > > In that case, invalidations can be processed multiple times, the first > time when these individual WAL logs for invalidation are processed and > then later at commit time when we accumulate all invalidation messages > and then execute them for REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID. > Can we avoid to execute invalidations from other places after this > patch which also includes executing them as part of XLOG_INVALIDATIONS > processing? I think we can avoid invalidation which is done as part of REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID. I need to further investigate the invalidation which is done as part of XLOG_INVALIDATIONS. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Amit Kapila
Date:
On Mon, Nov 18, 2019 at 5:02 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Fri, Nov 15, 2019 at 4:19 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Fri, Nov 15, 2019 at 4:01 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > On Fri, Nov 15, 2019 at 3:50 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > > > > Few other comments on this patch: > > > > 1. > > > > + case REORDER_BUFFER_CHANGE_INVALIDATION: > > > > + > > > > + /* > > > > + * Execute the invalidation message locally. > > > > + * > > > > + * XXX Do we need to care about relcacheInitFileInval and > > > > + * the other fields added to ReorderBufferChange, or just > > > > + * about the message itself? > > > > + */ > > > > + LocalExecuteInvalidationMessage(&change->data.inval.msg); > > > > + break; > > > > > > > > Here, why are we executing messages individually? Can't we just > > > > follow what we do in DecodeCommit which is to record the invalidations > > > > in ReorderBufferTXN as we encounter them and then allow them to > > > > execute on each REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID. Is there a > > > > reason why we don't do ReorderBufferXidSetCatalogChanges when we > > > > receive any invalidation message? > > I think it's fine to call ReorderBufferXidSetCatalogChanges, only on > commit. Because this is required to add any committed transaction to > the snapshot if it has done any catalog changes. > Hmm, this is also used to build cid hash map (see ReorderBufferBuildTupleCidHash) which we need to use while streaming changes for the in-progress transactions. So, I think that it would be required earlier (before commit) as well. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Amit Kapila
Date:
On Sat, Nov 16, 2019 at 6:44 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Thu, Nov 7, 2019 at 5:13 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > Some notes before commit: > > -------------------------------------- > > 1. > > Commit message need to be changed for the first patch > > ------------------------------------------------------------------------- > > A. > > > The memory limit is defined by a new logical_decoding_work_mem GUC, so for example we can do this > > > > SET logical_decoding_work_mem = '128kB' > > > > > to trigger very aggressive streaming. The minimum value is 64kB. > > > > I think this patch doesn't contain streaming, so we either need to > > reword it or remove it. > > > > B. > > > The logical_decoding_work_mem may be set either in postgresql.conf, in which case it serves as the default for allpublishers on that instance, or when creating the > > > subscription, using a work_mem paramemter in the WITH clause (specifies number of kilobytes). > > > > We need to reword this as we have decided to remove the setting from > > the subscription side as of now. > > > > 2. I think we can change the message level in UpdateSpillStats() to DEBUG2. > > > > I have made these modifications and additionally ran pgindent. > > > 4. I think we can combine both patches and commit as one patch, but it > > is okay to commit them separately as well. > > > > I am not sure if this is a good idea, so still kept them as separate. > I have committed the first patch. I will commit the second one related to stats of spilled xacts on Thursday. The second patch needs catalog version bump as well because we are modifying the catalog contents in that patch. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Dilip Kumar
Date:
On Tue, Nov 19, 2019 at 5:23 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Mon, Nov 18, 2019 at 5:02 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Fri, Nov 15, 2019 at 4:19 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > On Fri, Nov 15, 2019 at 4:01 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > > > On Fri, Nov 15, 2019 at 3:50 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > > > > > > > Few other comments on this patch: > > > > > 1. > > > > > + case REORDER_BUFFER_CHANGE_INVALIDATION: > > > > > + > > > > > + /* > > > > > + * Execute the invalidation message locally. > > > > > + * > > > > > + * XXX Do we need to care about relcacheInitFileInval and > > > > > + * the other fields added to ReorderBufferChange, or just > > > > > + * about the message itself? > > > > > + */ > > > > > + LocalExecuteInvalidationMessage(&change->data.inval.msg); > > > > > + break; > > > > > > > > > > Here, why are we executing messages individually? Can't we just > > > > > follow what we do in DecodeCommit which is to record the invalidations > > > > > in ReorderBufferTXN as we encounter them and then allow them to > > > > > execute on each REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID. Is there a > > > > > reason why we don't do ReorderBufferXidSetCatalogChanges when we > > > > > receive any invalidation message? > > > > I think it's fine to call ReorderBufferXidSetCatalogChanges, only on > > commit. Because this is required to add any committed transaction to > > the snapshot if it has done any catalog changes. > > > > Hmm, this is also used to build cid hash map (see > ReorderBufferBuildTupleCidHash) which we need to use while streaming > changes for the in-progress transactions. So, I think that it would > be required earlier (before commit) as well. > Oh right, I guess I missed that part. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Dilip Kumar
Date:
On Wed, Nov 20, 2019 at 11:15 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Tue, Nov 19, 2019 at 5:23 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Mon, Nov 18, 2019 at 5:02 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > On Fri, Nov 15, 2019 at 4:19 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > On Fri, Nov 15, 2019 at 4:01 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > > > > > On Fri, Nov 15, 2019 at 3:50 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > > > > > > > > > > Few other comments on this patch: > > > > > > 1. > > > > > > + case REORDER_BUFFER_CHANGE_INVALIDATION: > > > > > > + > > > > > > + /* > > > > > > + * Execute the invalidation message locally. > > > > > > + * > > > > > > + * XXX Do we need to care about relcacheInitFileInval and > > > > > > + * the other fields added to ReorderBufferChange, or just > > > > > > + * about the message itself? > > > > > > + */ > > > > > > + LocalExecuteInvalidationMessage(&change->data.inval.msg); > > > > > > + break; > > > > > > > > > > > > Here, why are we executing messages individually? Can't we just > > > > > > follow what we do in DecodeCommit which is to record the invalidations > > > > > > in ReorderBufferTXN as we encounter them and then allow them to > > > > > > execute on each REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID. Is there a > > > > > > reason why we don't do ReorderBufferXidSetCatalogChanges when we > > > > > > receive any invalidation message? > > > > > > I think it's fine to call ReorderBufferXidSetCatalogChanges, only on > > > commit. Because this is required to add any committed transaction to > > > the snapshot if it has done any catalog changes. > > > > > > > Hmm, this is also used to build cid hash map (see > > ReorderBufferBuildTupleCidHash) which we need to use while streaming > > changes for the in-progress transactions. So, I think that it would > > be required earlier (before commit) as well. > > > Oh right, I guess I missed that part. Attached a new rebased version of the patch set. I have fixed all the issues discussed up-thread and agreed upon. Pending Issues: 1. The default value of the logical_decoding_work_mem is set to 64kb in test_decoding/logical.conf. So we need to change the expected output files for the test decoding module. 2. Need to complete the patch for concurrent abort handling of the (sub)transaction. There are some pending issues with the existing patch[1]. [1] https://www.postgresql.org/message-id/CAFiTN-ud98kWHCo2YKS55H8rGw3_A7ESyssHwU0xPU6KJsoy6A%40mail.gmail.com -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
- 0002-Immediately-WAL-log-assignments.patch
- 0001-Track-statistics-for-spilling-of-changes-from-Reorde.patch
- 0003-Issue-individual-invalidations-with-wal_level-logica.patch
- 0005-Cleaning-up-of-flags-in-ReorderBufferTXN-structure.patch
- 0004-Extend-the-output-plugin-API-with-stream-methods.patch
- 0006-Gracefully-handle-concurrent-aborts-of-uncommitted-t.patch
- 0007-Implement-streaming-mode-in-ReorderBuffer.patch
- 0008-Support-logical_decoding_work_mem-set-from-create-su.patch
- 0010-Track-statistics-for-streaming.patch
- 0009-Add-support-for-streaming-to-built-in-replication.patch
- 0012-BUGFIX-set-final_lsn-for-subxacts-before-cleanup.patch
- 0011-Enable-streaming-for-all-subscription-TAP-tests.patch
- 0013-Add-TAP-test-for-streaming-vs.-DDL.patch
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Dilip Kumar
Date:
On Wed, Nov 20, 2019 at 8:22 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Wed, Nov 20, 2019 at 11:15 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Tue, Nov 19, 2019 at 5:23 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > On Mon, Nov 18, 2019 at 5:02 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > > > On Fri, Nov 15, 2019 at 4:19 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > > On Fri, Nov 15, 2019 at 4:01 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > > > > > > > On Fri, Nov 15, 2019 at 3:50 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > > > > > > > > > > > > > Few other comments on this patch: > > > > > > > 1. > > > > > > > + case REORDER_BUFFER_CHANGE_INVALIDATION: > > > > > > > + > > > > > > > + /* > > > > > > > + * Execute the invalidation message locally. > > > > > > > + * > > > > > > > + * XXX Do we need to care about relcacheInitFileInval and > > > > > > > + * the other fields added to ReorderBufferChange, or just > > > > > > > + * about the message itself? > > > > > > > + */ > > > > > > > + LocalExecuteInvalidationMessage(&change->data.inval.msg); > > > > > > > + break; > > > > > > > > > > > > > > Here, why are we executing messages individually? Can't we just > > > > > > > follow what we do in DecodeCommit which is to record the invalidations > > > > > > > in ReorderBufferTXN as we encounter them and then allow them to > > > > > > > execute on each REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID. Is there a > > > > > > > reason why we don't do ReorderBufferXidSetCatalogChanges when we > > > > > > > receive any invalidation message? > > > > > > > > I think it's fine to call ReorderBufferXidSetCatalogChanges, only on > > > > commit. Because this is required to add any committed transaction to > > > > the snapshot if it has done any catalog changes. > > > > > > > > > > Hmm, this is also used to build cid hash map (see > > > ReorderBufferBuildTupleCidHash) which we need to use while streaming > > > changes for the in-progress transactions. So, I think that it would > > > be required earlier (before commit) as well. > > > > > Oh right, I guess I missed that part. > > Attached a new rebased version of the patch set. I have fixed all > the issues discussed up-thread and agreed upon. > > Pending Issues: > 1. The default value of the logical_decoding_work_mem is set to 64kb > in test_decoding/logical.conf. So we need to change the expected > output files for the test decoding module. > 2. Need to complete the patch for concurrent abort handling of the > (sub)transaction. There are some pending issues with the existing > patch[1]. > [1] https://www.postgresql.org/message-id/CAFiTN-ud98kWHCo2YKS55H8rGw3_A7ESyssHwU0xPU6KJsoy6A%40mail.gmail.com Apart from these there is one more issue reported upthread[2] [2] https://www.postgresql.org/message-id/CAFiTN-vrSNkAfRVrWKe2R1dqFBTubjt%3DDYS%3DjhH%2BjiCoBODdaw%40mail.gmail.com -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Amit Kapila
Date:
On Tue, Nov 19, 2019 at 5:25 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Sat, Nov 16, 2019 at 6:44 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Thu, Nov 7, 2019 at 5:13 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > Some notes before commit: > > > -------------------------------------- > > > 1. > > > Commit message need to be changed for the first patch > > > ------------------------------------------------------------------------- > > > A. > > > > The memory limit is defined by a new logical_decoding_work_mem GUC, so for example we can do this > > > > > > SET logical_decoding_work_mem = '128kB' > > > > > > > to trigger very aggressive streaming. The minimum value is 64kB. > > > > > > I think this patch doesn't contain streaming, so we either need to > > > reword it or remove it. > > > > > > B. > > > > The logical_decoding_work_mem may be set either in postgresql.conf, in which case it serves as the default for allpublishers on that instance, or when creating the > > > > subscription, using a work_mem paramemter in the WITH clause (specifies number of kilobytes). > > > > > > We need to reword this as we have decided to remove the setting from > > > the subscription side as of now. > > > > > > 2. I think we can change the message level in UpdateSpillStats() to DEBUG2. > > > > > > > I have made these modifications and additionally ran pgindent. > > > > > 4. I think we can combine both patches and commit as one patch, but it > > > is okay to commit them separately as well. > > > > > > > I am not sure if this is a good idea, so still kept them as separate. > > > > I have committed the first patch. I will commit the second one > related to stats of spilled xacts on Thursday. The second patch needs > catalog version bump as well because we are modifying the catalog > contents in that patch. > Committed the second one as well. Now, we can move to a review of patches for "streaming of in-progress transactions". -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Dilip Kumar
Date:
On Thu, Nov 21, 2019 at 9:02 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Wed, Nov 20, 2019 at 8:22 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Wed, Nov 20, 2019 at 11:15 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > On Tue, Nov 19, 2019 at 5:23 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > On Mon, Nov 18, 2019 at 5:02 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > > > > > On Fri, Nov 15, 2019 at 4:19 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > > > > On Fri, Nov 15, 2019 at 4:01 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > > > > > > > > > On Fri, Nov 15, 2019 at 3:50 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > > > > > > > > > > > > > > > > Few other comments on this patch: > > > > > > > > 1. > > > > > > > > + case REORDER_BUFFER_CHANGE_INVALIDATION: > > > > > > > > + > > > > > > > > + /* > > > > > > > > + * Execute the invalidation message locally. > > > > > > > > + * > > > > > > > > + * XXX Do we need to care about relcacheInitFileInval and > > > > > > > > + * the other fields added to ReorderBufferChange, or just > > > > > > > > + * about the message itself? > > > > > > > > + */ > > > > > > > > + LocalExecuteInvalidationMessage(&change->data.inval.msg); > > > > > > > > + break; > > > > > > > > > > > > > > > > Here, why are we executing messages individually? Can't we just > > > > > > > > follow what we do in DecodeCommit which is to record the invalidations > > > > > > > > in ReorderBufferTXN as we encounter them and then allow them to > > > > > > > > execute on each REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID. Is there a > > > > > > > > reason why we don't do ReorderBufferXidSetCatalogChanges when we > > > > > > > > receive any invalidation message? > > > > > > > > > > I think it's fine to call ReorderBufferXidSetCatalogChanges, only on > > > > > commit. Because this is required to add any committed transaction to > > > > > the snapshot if it has done any catalog changes. > > > > > > > > > > > > > Hmm, this is also used to build cid hash map (see > > > > ReorderBufferBuildTupleCidHash) which we need to use while streaming > > > > changes for the in-progress transactions. So, I think that it would > > > > be required earlier (before commit) as well. > > > > > > > Oh right, I guess I missed that part. > > > > Attached a new rebased version of the patch set. I have fixed all > > the issues discussed up-thread and agreed upon. > > > > Pending Issues: > > 1. The default value of the logical_decoding_work_mem is set to 64kb > > in test_decoding/logical.conf. So we need to change the expected > > output files for the test decoding module. > > 2. Need to complete the patch for concurrent abort handling of the > > (sub)transaction. There are some pending issues with the existing > > patch[1]. > > [1] https://www.postgresql.org/message-id/CAFiTN-ud98kWHCo2YKS55H8rGw3_A7ESyssHwU0xPU6KJsoy6A%40mail.gmail.com > Apart from these there is one more issue reported upthread[2] > [2] https://www.postgresql.org/message-id/CAFiTN-vrSNkAfRVrWKe2R1dqFBTubjt%3DDYS%3DjhH%2BjiCoBODdaw%40mail.gmail.com > I have rebased the patch on the latest head and also fix the issue of "concurrent abort handling of the (sub)transaction." and attached as (v1-0013-Extend-handling-of-concurrent-aborts-for-streamin) along with the complete patch set. I have added the version number so that we can track the changes. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
- v1-0001-Immediately-WAL-log-assignments.patch
- v1-0002-Issue-individual-invalidations-with-wal_level-log.patch
- v1-0004-Cleaning-up-of-flags-in-ReorderBufferTXN-structur.patch
- v1-0005-Gracefully-handle-concurrent-aborts-of-uncommitte.patch
- v1-0003-Extend-the-output-plugin-API-with-stream-methods.patch
- v1-0006-Implement-streaming-mode-in-ReorderBuffer.patch
- v1-0007-Support-logical_decoding_work_mem-set-from-create.patch
- v1-0008-Add-support-for-streaming-to-built-in-replication.patch
- v1-0009-Track-statistics-for-streaming.patch
- v1-0010-Enable-streaming-for-all-subscription-TAP-tests.patch
- v1-0011-BUGFIX-set-final_lsn-for-subxacts-before-cleanup.patch
- v1-0012-Add-TAP-test-for-streaming-vs.-DDL.patch
- v1-0013-Extend-handling-of-concurrent-aborts-for-streamin.patch
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Michael Paquier
Date:
On Fri, Nov 22, 2019 at 01:18:11PM +0530, Dilip Kumar wrote: > I have rebased the patch on the latest head and also fix the issue of > "concurrent abort handling of the (sub)transaction." and attached as > (v1-0013-Extend-handling-of-concurrent-aborts-for-streamin) along with > the complete patch set. I have added the version number so that we > can track the changes. The patch has rotten a bit and does not apply anymore. Could you please send a rebased version? I have moved it to next CF, waiting on author. -- Michael
Attachment
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Dilip Kumar
Date:
On Sun, Dec 1, 2019 at 7:58 AM Michael Paquier <michael@paquier.xyz> wrote: > > On Fri, Nov 22, 2019 at 01:18:11PM +0530, Dilip Kumar wrote: > > I have rebased the patch on the latest head and also fix the issue of > > "concurrent abort handling of the (sub)transaction." and attached as > > (v1-0013-Extend-handling-of-concurrent-aborts-for-streamin) along with > > the complete patch set. I have added the version number so that we > > can track the changes. > > The patch has rotten a bit and does not apply anymore. Could you > please send a rebased version? I have moved it to next CF, waiting on > author. I have rebased the patch set on the latest head. Apart from this, there is one issue reported by my colleague Vignesh. The issue is that if we use more than two relations in a transaction then there is an error on standby (no relation map entry for remote relation ID 16390). After analyzing I have found that for the streaming transaction an "is_schema_sent" flag is kept in ReorderBufferTXN. And, I think that is done so that we can send the schema for each transaction stream so that if any subtransaction gets aborted we don't lose the logical WAL for that schema. But, this solution has induced a very basic issue that if a transaction operate on more than 1 relation then after sending the schema for the first relation it will mark the flag true and the schema for the subsequent relations will never be sent. I am still working on finding a better solution for this if anyone has any opinion/solution about this feel free to suggest. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
- v2-0001-Immediately-WAL-log-assignments.patch
- v2-0002-Issue-individual-invalidations-with-wal_level-log.patch
- v2-0004-Cleaning-up-of-flags-in-ReorderBufferTXN-structur.patch
- v2-0005-Gracefully-handle-concurrent-aborts-of-uncommitte.patch
- v2-0003-Extend-the-output-plugin-API-with-stream-methods.patch
- v2-0006-Implement-streaming-mode-in-ReorderBuffer.patch
- v2-0007-Support-logical_decoding_work_mem-set-from-create.patch
- v2-0008-Add-support-for-streaming-to-built-in-replication.patch
- v2-0009-Track-statistics-for-streaming.patch
- v2-0010-Enable-streaming-for-all-subscription-TAP-tests.patch
- v2-0011-BUGFIX-set-final_lsn-for-subxacts-before-cleanup.patch
- v2-0012-Add-TAP-test-for-streaming-vs.-DDL.patch
- v2-0013-Extend-handling-of-concurrent-aborts-for-streamin.patch
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Dilip Kumar
Date:
On Mon, Dec 2, 2019 at 2:01 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Sun, Dec 1, 2019 at 7:58 AM Michael Paquier <michael@paquier.xyz> wrote: > > > > On Fri, Nov 22, 2019 at 01:18:11PM +0530, Dilip Kumar wrote: > > > I have rebased the patch on the latest head and also fix the issue of > > > "concurrent abort handling of the (sub)transaction." and attached as > > > (v1-0013-Extend-handling-of-concurrent-aborts-for-streamin) along with > > > the complete patch set. I have added the version number so that we > > > can track the changes. > > > > The patch has rotten a bit and does not apply anymore. Could you > > please send a rebased version? I have moved it to next CF, waiting on > > author. > > I have rebased the patch set on the latest head. > I have review the patch set and here are few comments/questions 1. +static void +pg_decode_stream_change(LogicalDecodingContext *ctx, + ReorderBufferTXN *txn, + Relation relation, + ReorderBufferChange *change) +{ + OutputPluginPrepareWrite(ctx, true); + appendStringInfo(ctx->out, "streaming change for TXN %u", txn->xid); + OutputPluginWrite(ctx, true); +} Should we show the tuple in the streamed change like we do for the pg_decode_change? 2. pg_logical_slot_get_changes_guts It recreate the decoding slot [ctx = CreateDecodingContext(InvalidXLogRecPtr] but doesn't set the streaming to false, should we pass a parameter to pg_logical_slot_get_changes_guts saying whether we want streamed results or not 3. + XLogRecPtr prev_lsn = InvalidXLogRecPtr; ReorderBufferChange *change; ReorderBufferChange *specinsert = NULL; @@ -1565,6 +1965,16 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid, Relation relation = NULL; Oid reloid; + /* + * Enforce correct ordering of changes, merged from multiple + * subtransactions. The changes may have the same LSN due to + * MULTI_INSERT xlog records. + */ + if (prev_lsn != InvalidXLogRecPtr) + Assert(prev_lsn <= change->lsn); + + prev_lsn = change->lsn; I did not understand, how this change is relavent to this patch 4. + /* + * TOCHECK: We have to rebuild historic snapshot to be sure it includes all + * information about subtransactions, which could arrive after streaming start. + */ + if (!txn->is_schema_sent) + snapshot_now = ReorderBufferCopySnap(rb, txn->base_snapshot, + txn, command_id); In which case, txn->is_schema_sent will be true, because at the end of the stream in ReorderBufferExecuteInvalidations we are always setting it false, so while sending next stream it will always be false. That means we never required snapshot_now variable in ReorderBufferTXN. 5. @@ -2299,6 +2746,23 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer *rb, TransactionId xid, txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true); txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES; + + /* + * We read catalog changes from WAL, which are not yet sent, so + * invalidate current schema in order output plugin can resend + * schema again. + */ + txn->is_schema_sent = false; Same as point 4, during decode time it will never be true. 6. + /* send fields */ + pq_sendint64(out, commit_lsn); + pq_sendint64(out, txn->end_lsn); + pq_sendint64(out, txn->commit_time); Commit_time and end_lsn is used in standby_feedback 7. + /* FIXME optimize the search by bsearch on sorted data */ + for (i = nsubxacts; i > 0; i--) + { + if (subxacts[i - 1].xid == subxid) + { + subidx = (i - 1); + found = true; + break; + } + } We can not rollback intermediate subtransaction without rollbacking latest sub-transaction, so why do we need to search in the array? It will always be the the last subxact no? 8. + /* + * send feedback to upstream + * + * XXX Probably should send a valid LSN. But which one? + */ + send_feedback(InvalidXLogRecPtr, false, false); Why feedback is sent for every change? -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Amit Kapila
Date:
On Mon, Dec 2, 2019 at 2:02 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Sun, Dec 1, 2019 at 7:58 AM Michael Paquier <michael@paquier.xyz> wrote: > > > > On Fri, Nov 22, 2019 at 01:18:11PM +0530, Dilip Kumar wrote: > > > I have rebased the patch on the latest head and also fix the issue of > > > "concurrent abort handling of the (sub)transaction." and attached as > > > (v1-0013-Extend-handling-of-concurrent-aborts-for-streamin) along with > > > the complete patch set. I have added the version number so that we > > > can track the changes. > > > > The patch has rotten a bit and does not apply anymore. Could you > > please send a rebased version? I have moved it to next CF, waiting on > > author. > > I have rebased the patch set on the latest head. > > Apart from this, there is one issue reported by my colleague Vignesh. > The issue is that if we use more than two relations in a transaction > then there is an error on standby (no relation map entry for remote > relation ID 16390). After analyzing I have found that for the > streaming transaction an "is_schema_sent" flag is kept in > ReorderBufferTXN. And, I think that is done so that we can send the > schema for each transaction stream so that if any subtransaction gets > aborted we don't lose the logical WAL for that schema. But, this > solution has induced a very basic issue that if a transaction operate > on more than 1 relation then after sending the schema for the first > relation it will mark the flag true and the schema for the subsequent > relations will never be sent. > How about keeping a list of top-level xids in each RelationSyncEntry? Basically, whenever we send the schema for any transaction, we note that in RelationSyncEntry and at abort time we can remove xid from the list. Now, whenever, we check whether to send schema for any operation in a transaction, we will check if our xid is present in that list for a particular RelationSyncEntry and take an action based on that (if xid is present, then we won't send the schema, otherwise, send it). -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Dilip Kumar
Date:
On Tue, Dec 10, 2019 at 9:52 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Mon, Dec 2, 2019 at 2:02 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Sun, Dec 1, 2019 at 7:58 AM Michael Paquier <michael@paquier.xyz> wrote: > > > > > > On Fri, Nov 22, 2019 at 01:18:11PM +0530, Dilip Kumar wrote: > > > > I have rebased the patch on the latest head and also fix the issue of > > > > "concurrent abort handling of the (sub)transaction." and attached as > > > > (v1-0013-Extend-handling-of-concurrent-aborts-for-streamin) along with > > > > the complete patch set. I have added the version number so that we > > > > can track the changes. > > > > > > The patch has rotten a bit and does not apply anymore. Could you > > > please send a rebased version? I have moved it to next CF, waiting on > > > author. > > > > I have rebased the patch set on the latest head. > > > > Apart from this, there is one issue reported by my colleague Vignesh. > > The issue is that if we use more than two relations in a transaction > > then there is an error on standby (no relation map entry for remote > > relation ID 16390). After analyzing I have found that for the > > streaming transaction an "is_schema_sent" flag is kept in > > ReorderBufferTXN. And, I think that is done so that we can send the > > schema for each transaction stream so that if any subtransaction gets > > aborted we don't lose the logical WAL for that schema. But, this > > solution has induced a very basic issue that if a transaction operate > > on more than 1 relation then after sending the schema for the first > > relation it will mark the flag true and the schema for the subsequent > > relations will never be sent. > > > > How about keeping a list of top-level xids in each RelationSyncEntry? > Basically, whenever we send the schema for any transaction, we note > that in RelationSyncEntry and at abort time we can remove xid from the > list. Now, whenever, we check whether to send schema for any > operation in a transaction, we will check if our xid is present in > that list for a particular RelationSyncEntry and take an action based > on that (if xid is present, then we won't send the schema, otherwise, > send it). The idea make sense to me. I will try to write a patch for this and test. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Amit Kapila
Date:
On Mon, Dec 9, 2019 at 1:27 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > I have review the patch set and here are few comments/questions > > 1. > +static void > +pg_decode_stream_change(LogicalDecodingContext *ctx, > + ReorderBufferTXN *txn, > + Relation relation, > + ReorderBufferChange *change) > +{ > + OutputPluginPrepareWrite(ctx, true); > + appendStringInfo(ctx->out, "streaming change for TXN %u", txn->xid); > + OutputPluginWrite(ctx, true); > +} > > Should we show the tuple in the streamed change like we do for the > pg_decode_change? > I think so. The patch shows the message in pg_decode_stream_message(), so why to prohibit showing tuple here? > 2. pg_logical_slot_get_changes_guts > It recreate the decoding slot [ctx = > CreateDecodingContext(InvalidXLogRecPtr] but doesn't set the streaming > to false, should we pass a parameter to > pg_logical_slot_get_changes_guts saying whether we want streamed results or not > CreateDecodingContext internally calls StartupDecodingContext which sets the value of streaming based on if the plugin has provided callbacks for streaming functions. Isn't that sufficient? Why do we need additional parameters here? > 3. > + XLogRecPtr prev_lsn = InvalidXLogRecPtr; > ReorderBufferChange *change; > ReorderBufferChange *specinsert = NULL; > > @@ -1565,6 +1965,16 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid, > Relation relation = NULL; > Oid reloid; > > + /* > + * Enforce correct ordering of changes, merged from multiple > + * subtransactions. The changes may have the same LSN due to > + * MULTI_INSERT xlog records. > + */ > + if (prev_lsn != InvalidXLogRecPtr) > + Assert(prev_lsn <= change->lsn); > + > + prev_lsn = change->lsn; > I did not understand, how this change is relavent to this patch > This is just to ensure that changes are in LSN order. I think as we are merging the changes before commit for streaming, it is good to have such an Assertion for ReorderBufferStreamTXN. And, if we want to have it in ReorderBufferStreamTXN, then there is no harm in keeping it in ReorderBufferCommit() at least to keep the code consistent. Do you see any problem with this? > 4. > + /* > + * TOCHECK: We have to rebuild historic snapshot to be sure it includes all > + * information about subtransactions, which could arrive after streaming start. > + */ > + if (!txn->is_schema_sent) > + snapshot_now = ReorderBufferCopySnap(rb, txn->base_snapshot, > + txn, command_id); > > In which case, txn->is_schema_sent will be true, because at the end of > the stream in ReorderBufferExecuteInvalidations we are always setting > it false, > so while sending next stream it will always be false. That means we > never required snapshot_now variable in ReorderBufferTXN. > You are probably right, but as discussed we need to change this part of design/code (when to send schema changes) due to the issues discovered. So, I think this part will anyway change when we fix that problem. > 5. > @@ -2299,6 +2746,23 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer > *rb, TransactionId xid, > txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true); > > txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES; > + > + /* > + * We read catalog changes from WAL, which are not yet sent, so > + * invalidate current schema in order output plugin can resend > + * schema again. > + */ > + txn->is_schema_sent = false; > > Same as point 4, during decode time it will never be true. > Sure, my previous point's reply applies here as well. > 6. > + /* send fields */ > + pq_sendint64(out, commit_lsn); > + pq_sendint64(out, txn->end_lsn); > + pq_sendint64(out, txn->commit_time); > > Commit_time and end_lsn is used in standby_feedback > I don't understand what you mean by this. Can you be a bit more clear? > > 7. > + /* FIXME optimize the search by bsearch on sorted data */ > + for (i = nsubxacts; i > 0; i--) > + { > + if (subxacts[i - 1].xid == subxid) > + { > + subidx = (i - 1); > + found = true; > + break; > + } > + } > We can not rollback intermediate subtransaction without rollbacking > latest sub-transaction, so why do we need > to search in the array? It will always be the the last subxact no? > The same thing is already mentioned in the comments above this code ("XXX Or perhaps we can rely on the aborts to arrive in the reverse order, i.e. from the inner-most subxact (when nested)? In which case we could simply check the last element."). I think what you are saying is probably right, but we can leave this as it is for now because this is a minor optimization which can be done later as well if required. However, if you see any correctness issue, then we can discuss. > 8. > + /* > + * send feedback to upstream > + * > + * XXX Probably should send a valid LSN. But which one? > + */ > + send_feedback(InvalidXLogRecPtr, false, false); > > Why feedback is sent for every change? > I will study this part of the patch and let you know my opinion. Few comments on this patch series: 0001-Immediately-WAL-log-assignments: ------------------------------------------------------------ The commit message still refers to the old design for this patch. I think you need to modify the commit message as per the latest patch. 0002-Issue-individual-invalidations-with-wal_level-log ---------------------------------------------------------------------------- 1. xact_desc_invalidations(StringInfo buf, { .. + else if (msg->id == SHAREDINVALSNAPSHOT_ID) + appendStringInfo(buf, " snapshot %u", msg->sn.relId); You have removed logging for the above cache but forgot to remove its reference from one of the places. Also, I think you need to add a comment somewhere in inval.c to say why you are writing for WAL for some types of invalidations and not for others? 0003-Extend-the-output-plugin-API-with-stream-methods -------------------------------------------------------------------------------- 1. + are required, while <function>stream_message_cb</function> and + <function>stream_message_cb</function> are optional. stream_message_cb is mentioned twice. It seems the second one is for truncate. 2. size of the transaction size and network bandwidth, the transfer time + may significantly increase the apply lag. /size of the transaction size/size of the transaction no need to mention size twice. 3. + Similarly to spill-to-disk behavior, streaming is triggered when the total + amount of changes decoded from the WAL (for all in-progress transactions) + exceeds limit defined by <varname>logical_work_mem</varname> setting. The guc name used is wrong. /Similarly to/Similar to/ 4. stream_start_cb_wrapper() { .. + /* state.report_location = apply_lsn; */ .. + /* FIXME ctx->write_location = apply_lsn; */ .. } See, if we can fix these and similar in the callback for the stop. I think we don't have final_lsn till we commit/abort. Can we compute before calling these API's? 0005-Gracefully-handle-concurrent-aborts-of-uncommitte ---------------------------------------------------------------------------------- 1. @@ -1877,6 +1877,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid, PG_CATCH(); { /* TODO: Encapsulate cleanup from the PG_TRY and PG_CATCH blocks */ + if (iterstate) ReorderBufferIterTXNFinish(rb, iterstate); Spurious line change. 2. The commit message of this patch refers to Prepared transactions. I think that needs to be changed. 0006-Implement-streaming-mode-in-ReorderBuffer ------------------------------------------------------------------------- 1. + +/* iterator for streaming (only get data from memory) */ +static ReorderBufferStreamIterTXNState * ReorderBufferStreamIterTXNInit( + ReorderBuffer *rb, + ReorderBufferTXN *txn); + +static ReorderBufferChange *ReorderBufferStreamIterTXNNext( + ReorderBuffer *rb, + ReorderBufferStreamIterTXNState * state); + +static void ReorderBufferStreamIterTXNFinish( + ReorderBuffer *rb, + ReorderBufferStreamIterTXNState * state); Do we really need to introduce new APIs for iterating over changes from streamed transactions? Why can't we reuse the same API's as we use for committed xacts? 2. +static void +ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn) Please write some comments atop ReorderBufferStreamCommit. 3. +ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn) { .. .. + if (txn->snapshot_now == NULL) + { + dlist_iter subxact_i; + + /* make sure this transaction is streamed for the first time */ + Assert(!rbtxn_is_streamed(txn)); + + /* at the beginning we should have invalid command ID */ + Assert(txn->command_id == InvalidCommandId); + + dlist_foreach(subxact_i, &txn->subtxns) + { + ReorderBufferTXN *subtxn; + + subtxn = dlist_container(ReorderBufferTXN, node, subxact_i.cur); + + if (subtxn->base_snapshot != NULL && + (txn->base_snapshot == NULL || + txn->base_snapshot_lsn > subtxn->base_snapshot_lsn)) + { + txn->base_snapshot = subtxn->base_snapshot; The logic here seems to be correct, but I am not sure why it is not considered to purge the base snapshot before assigning the subtxn's snapshot and similarly, we have not purged snapshot for subtxn once we are done with it. I think we can use ReorderBufferTransferSnapToParent to replace part of the logic here. Do you see any reason for doing things differently here? 4. In ReorderBufferStreamTXN, why do you need to use ReorderBufferCopySnap to assign txn->base_snapshot to snapshot_now. 5. I see a lot of code similarity in ReorderBufferStreamTXN and existing ReorderBufferCommit. I understand that there are some subtle differences due to which we need to write this new function but can't we encapsulate the specific parts of code in functions and then call from both places. I am talking about code in different cases for change->action. 6. + * Note: We never stream and serialize a transaction at the same time (e /(e/(we -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Robert Haas
Date:
On Mon, Dec 2, 2019 at 3:32 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > I have rebased the patch set on the latest head. 0001 looks like a clever approach, but are you sure it doesn't hurt performance when many small XLOG records are being inserted? I think XLogRecordAssemble() can get pretty hot in some workloads. With regard to 0002, logging a separate WAL record for each invalidation seems painful; I think most operations that generate invalidations generate a bunch of them all at once. Perhaps you could just queue up invalidations as they happen, and then force anything that's been queued up to be emitted into WAL just before you emit any WAL record that might need to be decoded. Regarding 0005, it seems to me that this is no good: + errmsg("improper heap_getnext call"))); I think we should be using elog() rather than ereport() here, because this should only happen if there's a bug in a logical decoding plugin. At first, I thought maybe this should just be an Assert(), but since there are third-party logical decoding plugins available, checking this even in non-assert builds seems like a good idea. However, I think making it translatable is overkill; users should never see this, only developers. I also think that the message is really bad, because it just tells you did something bad. It gives no inkling as to why it was bad. 0006 contains lots of XXX comments that look like real issues. I guess those need to be fixed. Also, why don't we do the thing that the commit message for 0006 says we could "theoretically" do? I don't understand why we need the k-way merge at all, + if (prev_lsn != InvalidXLogRecPtr) + Assert(prev_lsn <= change->lsn); There is no reason to ever write an if statement that contains only an Assert, and it's bad style. Write Assert(prev_lsn == InvalidXLogRecPtr || prev_lsn <= change->lsn), or better yet, use XLogRecPtrIsInvalid. The purpose and mechanism of the is_schema_sent flag is not clear to me. The word "schema" here seems to be being used to mean "snapshot," which is rather confusing. I'm also somewhat unclear on what's happening here with invalidations. Perhaps that's as much a defect in my understanding as it is reflective of any problem with the patch, but I also don't see any comments either in 0002 or later patches explaining the theory of operation. If I've missed some, please point me in the right direction. Hypothetically speaking, it seems to me that if you just did InvalidateSystemCaches() every time the snapshot changed, you wouldn't need anything else (unless we're concerned with non-transactional invalidation messages like smgr and relmapper invalidations; not quite sure how those are handled). And, on the other hand, if we don't do InvalidateSystemCaches() every time the snapshot changes, then I don't understand why this works now, even without streaming. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Dilip Kumar
Date:
On Wed, Dec 11, 2019 at 5:22 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Mon, Dec 9, 2019 at 1:27 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > I have review the patch set and here are few comments/questions > > > > 1. > > +static void > > +pg_decode_stream_change(LogicalDecodingContext *ctx, > > + ReorderBufferTXN *txn, > > + Relation relation, > > + ReorderBufferChange *change) > > +{ > > + OutputPluginPrepareWrite(ctx, true); > > + appendStringInfo(ctx->out, "streaming change for TXN %u", txn->xid); > > + OutputPluginWrite(ctx, true); > > +} > > > > Should we show the tuple in the streamed change like we do for the > > pg_decode_change? > > > > I think so. The patch shows the message in > pg_decode_stream_message(), so why to prohibit showing tuple here? > > > 2. pg_logical_slot_get_changes_guts > > It recreate the decoding slot [ctx = > > CreateDecodingContext(InvalidXLogRecPtr] but doesn't set the streaming > > to false, should we pass a parameter to > > pg_logical_slot_get_changes_guts saying whether we want streamed results or not > > > > CreateDecodingContext internally calls StartupDecodingContext which > sets the value of streaming based on if the plugin has provided > callbacks for streaming functions. Isn't that sufficient? Why do we > need additional parameters here? I don't think that if plugin provides streaming function then we should stream. Like pgoutput plugin provides streaming function but we only stream if streaming is on in create subscription command. So I feel that should be true with any plugin. > > > 3. > > + XLogRecPtr prev_lsn = InvalidXLogRecPtr; > > ReorderBufferChange *change; > > ReorderBufferChange *specinsert = NULL; > > > > @@ -1565,6 +1965,16 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid, > > Relation relation = NULL; > > Oid reloid; > > > > + /* > > + * Enforce correct ordering of changes, merged from multiple > > + * subtransactions. The changes may have the same LSN due to > > + * MULTI_INSERT xlog records. > > + */ > > + if (prev_lsn != InvalidXLogRecPtr) > > + Assert(prev_lsn <= change->lsn); > > + > > + prev_lsn = change->lsn; > > I did not understand, how this change is relavent to this patch > > > > This is just to ensure that changes are in LSN order. I think as we > are merging the changes before commit for streaming, it is good to > have such an Assertion for ReorderBufferStreamTXN. And, if we want > to have it in ReorderBufferStreamTXN, then there is no harm in keeping > it in ReorderBufferCommit() at least to keep the code consistent. Do > you see any problem with this? I am fine with this. > > > 4. > > + /* > > + * TOCHECK: We have to rebuild historic snapshot to be sure it includes all > > + * information about subtransactions, which could arrive after streaming start. > > + */ > > + if (!txn->is_schema_sent) > > + snapshot_now = ReorderBufferCopySnap(rb, txn->base_snapshot, > > + txn, command_id); > > > > In which case, txn->is_schema_sent will be true, because at the end of > > the stream in ReorderBufferExecuteInvalidations we are always setting > > it false, > > so while sending next stream it will always be false. That means we > > never required snapshot_now variable in ReorderBufferTXN. > > > > You are probably right, but as discussed we need to change this part > of design/code (when to send schema changes) due to the issues > discovered. So, I think this part will anyway change when we fix that > problem. Make sense. > > > 5. > > @@ -2299,6 +2746,23 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer > > *rb, TransactionId xid, > > txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true); > > > > txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES; > > + > > + /* > > + * We read catalog changes from WAL, which are not yet sent, so > > + * invalidate current schema in order output plugin can resend > > + * schema again. > > + */ > > + txn->is_schema_sent = false; > > > > Same as point 4, during decode time it will never be true. > > > > Sure, my previous point's reply applies here as well. ok > > > 6. > > + /* send fields */ > > + pq_sendint64(out, commit_lsn); > > + pq_sendint64(out, txn->end_lsn); > > + pq_sendint64(out, txn->commit_time); > > > > Commit_time and end_lsn is used in standby_feedback > > > > I don't understand what you mean by this. Can you be a bit more clear? I think I paste it here by mistake. just ignore it. > > > > > 7. > > + /* FIXME optimize the search by bsearch on sorted data */ > > + for (i = nsubxacts; i > 0; i--) > > + { > > + if (subxacts[i - 1].xid == subxid) > > + { > > + subidx = (i - 1); > > + found = true; > > + break; > > + } > > + } > > We can not rollback intermediate subtransaction without rollbacking > > latest sub-transaction, so why do we need > > to search in the array? It will always be the the last subxact no? > > > > The same thing is already mentioned in the comments above this code > ("XXX Or perhaps we can rely on the aborts to arrive in the reverse > order, i.e. from the inner-most subxact (when nested)? In which case > we could simply check the last element."). I think what you are > saying is probably right, but we can leave this as it is for now > because this is a minor optimization which can be done later as well > if required. However, if you see any correctness issue, then we can > discuss. I think more than optimization here we have the question of whether this loop is required at all or not. Because, by optimizing we are not adding the complexity, infact it will be simple. I think here we need more analysis that whether we need to traverse the array or not. So maybe for time being we can leave this as it is. > > > 8. > > + /* > > + * send feedback to upstream > > + * > > + * XXX Probably should send a valid LSN. But which one? > > + */ > > + send_feedback(InvalidXLogRecPtr, false, false); > > > > Why feedback is sent for every change? > > > > I will study this part of the patch and let you know my opinion. Sure. > > Few comments on this patch series: > > 0001-Immediately-WAL-log-assignments: > ------------------------------------------------------------ > > The commit message still refers to the old design for this patch. I > think you need to modify the commit message as per the latest patch. > > 0002-Issue-individual-invalidations-with-wal_level-log > ---------------------------------------------------------------------------- > 1. > xact_desc_invalidations(StringInfo buf, > { > .. > + else if (msg->id == SHAREDINVALSNAPSHOT_ID) > + appendStringInfo(buf, " snapshot %u", msg->sn.relId); > > You have removed logging for the above cache but forgot to remove its > reference from one of the places. Also, I think you need to add a > comment somewhere in inval.c to say why you are writing for WAL for > some types of invalidations and not for others? > > 0003-Extend-the-output-plugin-API-with-stream-methods > -------------------------------------------------------------------------------- > 1. > + are required, while <function>stream_message_cb</function> and > + <function>stream_message_cb</function> are optional. > > stream_message_cb is mentioned twice. It seems the second one is for truncate. > > 2. > size of the transaction size and network bandwidth, the transfer time > + may significantly increase the apply lag. > > /size of the transaction size/size of the transaction > > no need to mention size twice. > > 3. > + Similarly to spill-to-disk behavior, streaming is triggered when the total > + amount of changes decoded from the WAL (for all in-progress > transactions) > + exceeds limit defined by <varname>logical_work_mem</varname> setting. > > The guc name used is wrong. /Similarly to/Similar to/ > > 4. > stream_start_cb_wrapper() > { > .. > + /* state.report_location = apply_lsn; */ > .. > + /* FIXME ctx->write_location = apply_lsn; */ > .. > } > > See, if we can fix these and similar in the callback for the stop. I > think we don't have final_lsn till we commit/abort. Can we compute > before calling these API's? > > > 0005-Gracefully-handle-concurrent-aborts-of-uncommitte > ---------------------------------------------------------------------------------- > 1. > @@ -1877,6 +1877,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid, > PG_CATCH(); > { > /* TODO: Encapsulate cleanup > from the PG_TRY and PG_CATCH blocks */ > + > if (iterstate) > ReorderBufferIterTXNFinish(rb, iterstate); > > Spurious line change. > > 2. The commit message of this patch refers to Prepared transactions. > I think that needs to be changed. > > 0006-Implement-streaming-mode-in-ReorderBuffer > ------------------------------------------------------------------------- > 1. > + > +/* iterator for streaming (only get data from memory) */ > +static ReorderBufferStreamIterTXNState * ReorderBufferStreamIterTXNInit( > + > ReorderBuffer *rb, > + > ReorderBufferTXN > *txn); > + > +static ReorderBufferChange *ReorderBufferStreamIterTXNNext( > + ReorderBuffer *rb, > + > ReorderBufferStreamIterTXNState * state); > + > +static void ReorderBufferStreamIterTXNFinish( > + > ReorderBuffer *rb, > + > ReorderBufferStreamIterTXNState * state); > > Do we really need to introduce new APIs for iterating over changes > from streamed transactions? Why can't we reuse the same API's as we > use for committed xacts? > > 2. > +static void > +ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn) > > Please write some comments atop ReorderBufferStreamCommit. > > 3. > +ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn) > { > .. > .. > + if (txn->snapshot_now > == NULL) > + { > + dlist_iter subxact_i; > + > + /* make sure this transaction is streamed for the first time */ > + > Assert(!rbtxn_is_streamed(txn)); > + > + /* at the beginning we should have invalid command ID */ > + Assert(txn->command_id == > InvalidCommandId); > + > + dlist_foreach(subxact_i, &txn->subtxns) > + { > + ReorderBufferTXN *subtxn; > + > + > subtxn = dlist_container(ReorderBufferTXN, node, subxact_i.cur); > + > + if (subtxn->base_snapshot != NULL && > + > (txn->base_snapshot == NULL || > + txn->base_snapshot_lsn > subtxn->base_snapshot_lsn)) > + { > + > txn->base_snapshot = subtxn->base_snapshot; > > The logic here seems to be correct, but I am not sure why it is not > considered to purge the base snapshot before assigning the subtxn's > snapshot and similarly, we have not purged snapshot for subtxn once we > are done with it. I think we can use > ReorderBufferTransferSnapToParent to replace part of the logic here. > Do you see any reason for doing things differently here? > > 4. In ReorderBufferStreamTXN, why do you need to use > ReorderBufferCopySnap to assign txn->base_snapshot to snapshot_now. > > 5. I see a lot of code similarity in ReorderBufferStreamTXN and > existing ReorderBufferCommit. I understand that there are some subtle > differences due to which we need to write this new function but can't > we encapsulate the specific parts of code in functions and then call > from both places. I am talking about code in different cases for > change->action. > > 6. + * Note: We never stream and serialize a transaction at the same time (e > /(e/(we > I will look into these comments and reply separately. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Amit Kapila
Date:
On Wed, Dec 11, 2019 at 11:46 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Mon, Dec 2, 2019 at 3:32 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > I have rebased the patch set on the latest head. > > 0001 looks like a clever approach, but are you sure it doesn't hurt > performance when many small XLOG records are being inserted? I think > XLogRecordAssemble() can get pretty hot in some workloads. > I don't think we have evaluated it yet, but we should do it. The point to note is that it is only for the case when wal_level is 'logical' (see IsSubTransactionAssignmentPending) in which case we already log more WAL, so this might not impact much. I guess that it might be better to have that check in XLogRecordAssemble for the sake of clarity. > > Regarding 0005, it seems to me that this is no good: > > + errmsg("improper heap_getnext call"))); > > I think we should be using elog() rather than ereport() here, because > this should only happen if there's a bug in a logical decoding plugin. > At first, I thought maybe this should just be an Assert(), but since > there are third-party logical decoding plugins available, checking > this even in non-assert builds seems like a good idea. However, I > think making it translatable is overkill; users should never see this, > only developers. > makes sense. I think we should change it. > > + if (prev_lsn != InvalidXLogRecPtr) > + Assert(prev_lsn <= change->lsn); > > There is no reason to ever write an if statement that contains only an > Assert, and it's bad style. Write Assert(prev_lsn == InvalidXLogRecPtr > || prev_lsn <= change->lsn), or better yet, use XLogRecPtrIsInvalid. > Agreed. > The purpose and mechanism of the is_schema_sent flag is not clear to > me. The word "schema" here seems to be being used to mean "snapshot," > which is rather confusing. > I have explained this flag below along with invalidations as both are slightly related. > I'm also somewhat unclear on what's happening here with invalidations. > Perhaps that's as much a defect in my understanding as it is > reflective of any problem with the patch, but I also don't see any > comments either in 0002 or later patches explaining the theory of > operation. If I've missed some, please point me in the right > direction. Hypothetically speaking, it seems to me that if you just > did InvalidateSystemCaches() every time the snapshot changed, you > wouldn't need anything else (unless we're concerned with > non-transactional invalidation messages like smgr and relmapper > invalidations; not quite sure how those are handled). And, on the > other hand, if we don't do InvalidateSystemCaches() every time the > snapshot changes, then I don't understand why this works now, even > without streaming. > I think the way invalidations work for logical replication is that normally, we always start a new transaction before decoding each commit which allows us to accept the invalidations (via AtStart_Cache). However, if there are catalog changes within the transaction being decoded, we need to reflect those before trying to decode the WAL of operation which happened after that catalog change. As we are not logging the WAL for each invalidation, we need to execute all the invalidation messages for this transaction at each catalog change. We are able to do that now as we decode the entire WAL for a transaction only once we get the commit's WAL which contains all the invalidation messages. So, we queue them up and execute them for each catalog change which we identify by WAL record XLOG_HEAP2_NEW_CID. The second related concept is that before sending each change to downstream (via pgoutput), we check whether we need to send the schema. This we decide based on the local map entry (RelationSyncEntry) which indicates whether the schema for the relation is already sent or not. Once the schema of the relation is sent, the entry for that relation in the map will indicate it. At the time of invalidation processing we also blew up this map, so it always reflects the correct state. Now, to decode an in-progress transaction, we need to ensure that we have received the WAL for all the invalidations before decoding the WAL of action that happened immediately after that catalog change. This is the reason we started WAL logging individual Invalidations. So, with this change we don't need to execute all the invalidations for each catalog change, rather execute them as and when their WAL is being decoded. The current mechanism to send schema changes won't work for streaming transactions because after sending the change, subtransaction might abort. On subtransaction abort, the downstream will simply discard the changes where we will lose the previous schema change sent. There is no such problem currently because we process all the aborts before sending any change. So, the current idea of having a schema_sent flag in each map entry (RelationSyncEntry) won't work for streaming transactions. To solve this problem initially patch has kept a flag 'is_schema_sent' for each top-level transaction (in ReorderBufferTXN) so that we can always send a schema for each (sub)transaction for streaming transactions, but that won't work if we access multiple relations in the same subtransaction. To solve this problem, we are thinking of keeping a list/array of top-level xids in each RelationSyncEntry. Basically, whenever we send the schema for any transaction, we note that in RelationSyncEntry and at abort/commit time we can remove xid from the list. Now, whenever, we check whether to send schema for any operation in a transaction, we will check if our xid is present in that list for a particular RelationSyncEntry and take an action based on that (if xid is present, then we won't send the schema, otherwise, send it). I think during decode, we should not have that may open transactions, so the search in the array should be cheap enough but we can consider some other data structure like hash as well. I will think some more and respond to your remaining comments/suggestions. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Amit Kapila
Date:
On Wed, Dec 11, 2019 at 11:46 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Mon, Dec 2, 2019 at 3:32 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > I have rebased the patch set on the latest head. > > 0001 looks like a clever approach, but are you sure it doesn't hurt > performance when many small XLOG records are being inserted? I think > XLogRecordAssemble() can get pretty hot in some workloads. > > With regard to 0002, logging a separate WAL record for each > invalidation seems painful; I think most operations that generate > invalidations generate a bunch of them all at once. Perhaps you could > just queue up invalidations as they happen, and then force anything > that's been queued up to be emitted into WAL just before you emit any > WAL record that might need to be decoded. > I feel we can log the invalidations of the entire command at one go if we log at CommandEndInvalidationMessages. We already have all the invalidations of current command in transInvalInfo->CurrentCmdInvalidMsgs. This can save us the effort of maintaining a new separate list/queue for invalidations and to a good extent, it will ameliorate your concern of logging each invalidation separately. > > 0006 contains lots of XXX comments that look like real issues. I guess > those need to be fixed. Also, why don't we do the thing that the > commit message for 0006 says we could "theoretically" do? I don't > understand why we need the k-way merge at all, > I think we can do what is written in the commit message, but then we need to maintain two paths (one for streaming contexts and other for non-streaming contexts) unless we want to entirely get rid of storing subtransaction changes separately which seems like a more fundamental change. Right now, also to some extent such things are there, but I have already given a comment to minimize it. Having said that, I think we can go either way. I think the original intention was to avoid doing more stuff unless it is really required as this is already a big patchset, but maybe Tomas has a different idea about this. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Amit Kapila
Date:
On Thu, Dec 12, 2019 at 9:45 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Wed, Dec 11, 2019 at 5:22 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Mon, Dec 9, 2019 at 1:27 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > I have review the patch set and here are few comments/questions > > > > > > 1. > > > +static void > > > +pg_decode_stream_change(LogicalDecodingContext *ctx, > > > + ReorderBufferTXN *txn, > > > + Relation relation, > > > + ReorderBufferChange *change) > > > +{ > > > + OutputPluginPrepareWrite(ctx, true); > > > + appendStringInfo(ctx->out, "streaming change for TXN %u", txn->xid); > > > + OutputPluginWrite(ctx, true); > > > +} > > > > > > Should we show the tuple in the streamed change like we do for the > > > pg_decode_change? > > > > > > > I think so. The patch shows the message in > > pg_decode_stream_message(), so why to prohibit showing tuple here? > > > > > 2. pg_logical_slot_get_changes_guts > > > It recreate the decoding slot [ctx = > > > CreateDecodingContext(InvalidXLogRecPtr] but doesn't set the streaming > > > to false, should we pass a parameter to > > > pg_logical_slot_get_changes_guts saying whether we want streamed results or not > > > > > > > CreateDecodingContext internally calls StartupDecodingContext which > > sets the value of streaming based on if the plugin has provided > > callbacks for streaming functions. Isn't that sufficient? Why do we > > need additional parameters here? > > I don't think that if plugin provides streaming function then we > should stream. Like pgoutput plugin provides streaming function but > we only stream if streaming is on in create subscription command. So > I feel that should be true with any plugin. > How about adding a new boolean parameter (streaming) in pg_create_logical_replication_slot()? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Masahiko Sawada
Date:
On Mon, 2 Dec 2019 at 17:32, Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Sun, Dec 1, 2019 at 7:58 AM Michael Paquier <michael@paquier.xyz> wrote: > > > > On Fri, Nov 22, 2019 at 01:18:11PM +0530, Dilip Kumar wrote: > > > I have rebased the patch on the latest head and also fix the issue of > > > "concurrent abort handling of the (sub)transaction." and attached as > > > (v1-0013-Extend-handling-of-concurrent-aborts-for-streamin) along with > > > the complete patch set. I have added the version number so that we > > > can track the changes. > > > > The patch has rotten a bit and does not apply anymore. Could you > > please send a rebased version? I have moved it to next CF, waiting on > > author. > > I have rebased the patch set on the latest head. Thank you for working on this. This might have already been discussed but I have a question about the changes of logical replication worker. In the current logical replication there is a problem that the response time are doubled when using synchronous replication because wal senders send changes after commit. It's worse especially when a transaction makes a lot of changes. So I expected this feature to reduce the response time by sending changes even while the transaction is progressing but it doesn't seem to be. The logical replication worker writes changes to temporary files and applies these changes when the worker received commit record (STREAM COMMIT). Since the worker sends the LSN of commit record as flush LSN to the publisher after applying all changes, the publisher must wait for all changes are applied to the subscriber. Another problem would be that the worker doesn't receive changes during applying changes of other transactions. These things make me think it's better to have a new worker dedicated to apply changes like we have the wal receiver process and the startup process. Maybe we can have 2 workers (receiver and applyer) per subscriptions. Any thoughts? Regards, -- Masahiko Sawada http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Kyotaro Horiguchi
Date:
Hello. At Fri, 13 Dec 2019 14:46:20 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in > On Wed, Dec 11, 2019 at 11:46 PM Robert Haas <robertmhaas@gmail.com> wrote: > > > > On Mon, Dec 2, 2019 at 3:32 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > I have rebased the patch set on the latest head. > > > > 0001 looks like a clever approach, but are you sure it doesn't hurt > > performance when many small XLOG records are being inserted? I think > > XLogRecordAssemble() can get pretty hot in some workloads. > > > > With regard to 0002, logging a separate WAL record for each > > invalidation seems painful; I think most operations that generate > > invalidations generate a bunch of them all at once. Perhaps you could > > just queue up invalidations as they happen, and then force anything > > that's been queued up to be emitted into WAL just before you emit any > > WAL record that might need to be decoded. > > > > I feel we can log the invalidations of the entire command at one go if > we log at CommandEndInvalidationMessages. We already have all the > invalidations of current command in > transInvalInfo->CurrentCmdInvalidMsgs. This can save us the effort of > maintaining a new separate list/queue for invalidations and to a good > extent, it will ameliorate your concern of logging each invalidation > separately. I have a question on this. Does that mean that the current logical decoder (or reorderbuffer) may emit incorrect result if it made a catalog change during the current transaction being decoded? If so, this is not a feature but a bug fix. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Amit Kapila
Date:
On Fri, Dec 20, 2019 at 11:47 AM Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote: > > On Mon, 2 Dec 2019 at 17:32, Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Sun, Dec 1, 2019 at 7:58 AM Michael Paquier <michael@paquier.xyz> wrote: > > > > > > On Fri, Nov 22, 2019 at 01:18:11PM +0530, Dilip Kumar wrote: > > > > I have rebased the patch on the latest head and also fix the issue of > > > > "concurrent abort handling of the (sub)transaction." and attached as > > > > (v1-0013-Extend-handling-of-concurrent-aborts-for-streamin) along with > > > > the complete patch set. I have added the version number so that we > > > > can track the changes. > > > > > > The patch has rotten a bit and does not apply anymore. Could you > > > please send a rebased version? I have moved it to next CF, waiting on > > > author. > > > > I have rebased the patch set on the latest head. > > Thank you for working on this. > > This might have already been discussed but I have a question about the > changes of logical replication worker. In the current logical > replication there is a problem that the response time are doubled when > using synchronous replication because wal senders send changes after > commit. It's worse especially when a transaction makes a lot of > changes. So I expected this feature to reduce the response time by > sending changes even while the transaction is progressing but it > doesn't seem to be. The logical replication worker writes changes to > temporary files and applies these changes when the worker received > commit record (STREAM COMMIT). Since the worker sends the LSN of > commit record as flush LSN to the publisher after applying all > changes, the publisher must wait for all changes are applied to the > subscriber. > The main aim of this feature is to reduce apply lag. Because if we send all the changes together it can delay there apply because of network delay, whereas if most of the changes are already sent, then we will save the effort on sending the entire data at commit time. This in itself gives us decent benefits. Sure, we can further improve it by having separate workers (dedicated to apply the changes) as you are suggesting and in fact, there is a patch for that as well(see the performance results and bgworker patch at [1]), but if try to shove in all the things in one go, then it will be difficult to get this patch committed (there are already enough things and the patch is quite big that to get it right takes a lot of energy). So, the plan is something like that first we get the basic feature and then try to improve by having dedicated workers or things like that. Does this make sense to you? [1] - https://www.postgresql.org/message-id/8eda5118-2dd0-79a1-4fe9-eec7e334de17%40postgrespro.ru -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Amit Kapila
Date:
On Fri, Dec 20, 2019 at 2:00 PM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > > Hello. > > At Fri, 13 Dec 2019 14:46:20 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in > > On Wed, Dec 11, 2019 at 11:46 PM Robert Haas <robertmhaas@gmail.com> wrote: > > > > > > On Mon, Dec 2, 2019 at 3:32 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > I have rebased the patch set on the latest head. > > > > > > 0001 looks like a clever approach, but are you sure it doesn't hurt > > > performance when many small XLOG records are being inserted? I think > > > XLogRecordAssemble() can get pretty hot in some workloads. > > > > > > With regard to 0002, logging a separate WAL record for each > > > invalidation seems painful; I think most operations that generate > > > invalidations generate a bunch of them all at once. Perhaps you could > > > just queue up invalidations as they happen, and then force anything > > > that's been queued up to be emitted into WAL just before you emit any > > > WAL record that might need to be decoded. > > > > > > > I feel we can log the invalidations of the entire command at one go if > > we log at CommandEndInvalidationMessages. We already have all the > > invalidations of current command in > > transInvalInfo->CurrentCmdInvalidMsgs. This can save us the effort of > > maintaining a new separate list/queue for invalidations and to a good > > extent, it will ameliorate your concern of logging each invalidation > > separately. > > I have a question on this. Does that mean that the current logical > decoder (or reorderbuffer) > What does currently refer to here? Is it about HEAD or about the patch? Without the patch, we decode only at commit time and by that time we have all invalidations (logged with commit WAL record), so we just execute them at each catalog change (see the actions in REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID). The patch has to separately WAL log each invalidation because we can decode the intermittent changes, so we can't wait till commit. The above is just an optimization for the patch. AFAIK, there is no correctness issue here, but let me know if you see any. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
vignesh C
Date:
On Mon, Dec 2, 2019 at 2:02 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Sun, Dec 1, 2019 at 7:58 AM Michael Paquier <michael@paquier.xyz> wrote: > > > > On Fri, Nov 22, 2019 at 01:18:11PM +0530, Dilip Kumar wrote: > > > I have rebased the patch on the latest head and also fix the issue of > > > "concurrent abort handling of the (sub)transaction." and attached as > > > (v1-0013-Extend-handling-of-concurrent-aborts-for-streamin) along with > > > the complete patch set. I have added the version number so that we > > > can track the changes. > > > > The patch has rotten a bit and does not apply anymore. Could you > > please send a rebased version? I have moved it to next CF, waiting on > > author. > > I have rebased the patch set on the latest head. > Few comments: assert variable should be within #ifdef USE_ASSERT_CHECKING in patch v2-0008-Add-support-for-streaming-to-built-in-replication.patch: + int64 subidx; + bool found = false; + char path[MAXPGPATH]; + + subidx = -1; + subxact_info_read(MyLogicalRepWorker->subid, xid); + + /* FIXME optimize the search by bsearch on sorted data */ + for (i = nsubxacts; i > 0; i--) + { + if (subxacts[i - 1].xid == subxid) + { + subidx = (i - 1); + found = true; + break; + } + } + + /* We should not receive aborts for unknown subtransactions. */ + Assert(found); Add the typedefs like below in typedefs.lst common across the patches: xl_xact_invalidations, ReorderBufferStreamIterTXNEntry, ReorderBufferStreamIterTXNState, SubXactInfo "are written" appears twice in commit message of v2-0002-Issue-individual-invalidations-with-wal_level-log.patch: The individual invalidations are written are written using a new xlog record type XLOG_XACT_INVALIDATIONS, from RM_XACT_ID resource manager. See LogLogicalInvalidations for details. v2-0002-Issue-individual-invalidations-with-wal_level-log.patch patch does not compile by itself: reorderbuffer.c:1822:9: error: ‘ReorderBufferTXN’ has no member named ‘is_schema_sent’ + LocalExecuteInvalidationMessage(&change->data.inval.msg); + txn->is_schema_sent = false; + break; Should we include printing of id here like in earlier cases in v2-0002-Issue-individual-invalidations-with-wal_level-log.patch: + appendStringInfo(buf, " relcache %u", msg->rc.relId); + /* not expected, but print something anyway */ + else if (msg->id == SHAREDINVALSMGR_ID) + appendStringInfoString(buf, " smgr"); + /* not expected, but print something anyway */ + else if (msg->id == SHAREDINVALRELMAP_ID) + appendStringInfo(buf, " relmap db %u", msg->rm.dbId); There is some code duplication in stream_change_cb_wrapper, stream_truncate_cb_wrapper, stream_message_cb_wrapper, stream_abort_cb_wrapper, stream_commit_cb_wrapper, stream_start_cb_wrapper and stream_stop_cb_wrapper functions in v2-0003-Extend-the-output-plugin-API-with-stream-methods.patch patch. Should we have a separate function for common code? Should we can add function header for AssertChangeLsnOrder in v2-0006-Implement-streaming-mode-in-ReorderBuffer.patch: +static void +AssertChangeLsnOrder(ReorderBuffer *rb, ReorderBufferTXN *txn) +{ This "Assert(txn->first_lsn != InvalidXLogRecPtr)"can be before the loop, can be checked only once: + dlist_foreach(iter, &txn->changes) + { + ReorderBufferChange *cur_change; + + cur_change = dlist_container(ReorderBufferChange, node, iter.cur); + + Assert(txn->first_lsn != InvalidXLogRecPtr); + Assert(cur_change->lsn != InvalidXLogRecPtr); + Assert(txn->first_lsn <= cur_change->lsn); Should we add function header for ReorderBufferDestroyTupleCidHash in v2-0006-Implement-streaming-mode-in-ReorderBuffer.patch: +static void +ReorderBufferDestroyTupleCidHash(ReorderBuffer *rb, ReorderBufferTXN *txn) +{ + if (txn->tuplecid_hash != NULL) + { + hash_destroy(txn->tuplecid_hash); + txn->tuplecid_hash = NULL; + } +} + Should we add function header for ReorderBufferStreamCommit in v2-0006-Implement-streaming-mode-in-ReorderBuffer.patch: +static void +ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn) +{ + /* we should only call this for previously streamed transactions */ + Assert(rbtxn_is_streamed(txn)); + + ReorderBufferStreamTXN(rb, txn); + + rb->stream_commit(rb, txn, txn->final_lsn); + + ReorderBufferCleanupTXN(rb, txn); +} + Should we add function header for ReorderBufferCanStream in v2-0006-Implement-streaming-mode-in-ReorderBuffer.patch: +static bool +ReorderBufferCanStream(ReorderBuffer *rb) +{ + LogicalDecodingContext *ctx = rb->private_data; + + return ctx->streaming; +} patch v2-0008-Add-support-for-streaming-to-built-in-replication.patch does not apply: Hunk #18 FAILED at 2035. Hunk #19 succeeded at 2199 (offset -16 lines). 1 out of 19 hunks FAILED -- saving rejects to file src/backend/replication/logical/worker.c.rej Header inclusion may not be required in patch v2-0008-Add-support-for-streaming-to-built-in-replication.patch: +++ b/src/backend/replication/logical/launcher.c @@ -14,6 +14,8 @@ * *------------------------------------------------------------------------- */ +#include <sys/types.h> +#include <unistd.h> Regards, Vignesh EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Amit Kapila
Date:
On Sun, Dec 22, 2019 at 5:04 PM vignesh C <vignesh21@gmail.com> wrote: > > Few comments: > assert variable should be within #ifdef USE_ASSERT_CHECKING in patch > v2-0008-Add-support-for-streaming-to-built-in-replication.patch: > + int64 subidx; > + bool found = false; > + char path[MAXPGPATH]; > + > + subidx = -1; > + subxact_info_read(MyLogicalRepWorker->subid, xid); > + > + /* FIXME optimize the search by bsearch on sorted data */ > + for (i = nsubxacts; i > 0; i--) > + { > + if (subxacts[i - 1].xid == subxid) > + { > + subidx = (i - 1); > + found = true; > + break; > + } > + } > + > + /* We should not receive aborts for unknown subtransactions. */ > + Assert(found); > We can use PG_USED_FOR_ASSERTS_ONLY for that variable. > > Should we include printing of id here like in earlier cases in > v2-0002-Issue-individual-invalidations-with-wal_level-log.patch: > + appendStringInfo(buf, " relcache %u", msg->rc.relId); > + /* not expected, but print something anyway */ > + else if (msg->id == SHAREDINVALSMGR_ID) > + appendStringInfoString(buf, " smgr"); > + /* not expected, but print something anyway */ > + else if (msg->id == SHAREDINVALRELMAP_ID) > + appendStringInfo(buf, " relmap db %u", msg->rm.dbId); > I am not sure if this patch is logging these invalidations, so not sure if it makes sense to add more ids in the cases you are referring to. However, if we change it to logging all invalidations at command end as being discussed in this thread, then it might be better to do what you are suggesting. > > Should we can add function header for AssertChangeLsnOrder in > v2-0006-Implement-streaming-mode-in-ReorderBuffer.patch: > +static void > +AssertChangeLsnOrder(ReorderBuffer *rb, ReorderBufferTXN *txn) > +{ > > This "Assert(txn->first_lsn != InvalidXLogRecPtr)"can be before the > loop, can be checked only once: > + dlist_foreach(iter, &txn->changes) > + { > + ReorderBufferChange *cur_change; > + > + cur_change = dlist_container(ReorderBufferChange, > node, iter.cur); > + > + Assert(txn->first_lsn != InvalidXLogRecPtr); > + Assert(cur_change->lsn != InvalidXLogRecPtr); > + Assert(txn->first_lsn <= cur_change->lsn); > This makes sense to me. Another thing about this function, do we really need "ReorderBuffer *rb" parameter in this function? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Robert Haas
Date:
On Thu, Dec 12, 2019 at 3:41 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > I don't think we have evaluated it yet, but we should do it. The > point to note is that it is only for the case when wal_level is > 'logical' (see IsSubTransactionAssignmentPending) in which case we > already log more WAL, so this might not impact much. I guess that it > might be better to have that check in XLogRecordAssemble for the sake > of clarity. I don't think that this is really a valid argument. Just because we have some overhead now doesn't mean that adding more won't hurt. Even testing the wal_level costs a little something. > I think the way invalidations work for logical replication is that > normally, we always start a new transaction before decoding each > commit which allows us to accept the invalidations (via > AtStart_Cache). However, if there are catalog changes within the > transaction being decoded, we need to reflect those before trying to > decode the WAL of operation which happened after that catalog change. > As we are not logging the WAL for each invalidation, we need to > execute all the invalidation messages for this transaction at each > catalog change. We are able to do that now as we decode the entire WAL > for a transaction only once we get the commit's WAL which contains all > the invalidation messages. So, we queue them up and execute them for > each catalog change which we identify by WAL record > XLOG_HEAP2_NEW_CID. Thanks for the explanation. That makes sense. But, it's still true, AFAICS, that instead of doing this stuff with logging invalidations you could just InvalidateSystemCaches() in the cases where you are currently applying all of the transaction's invalidations. That approach might be worse than changing the way invalidations are logged, but the two approaches deserve to be compared. One approach has more CPU overhead and the other has more WAL overhead, so it's a little hard to compare them, but it seems worth mulling over. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Masahiko Sawada
Date:
On Fri, 20 Dec 2019 at 22:30, Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Fri, Dec 20, 2019 at 11:47 AM Masahiko Sawada > <masahiko.sawada@2ndquadrant.com> wrote: > > > > On Mon, 2 Dec 2019 at 17:32, Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > On Sun, Dec 1, 2019 at 7:58 AM Michael Paquier <michael@paquier.xyz> wrote: > > > > > > > > On Fri, Nov 22, 2019 at 01:18:11PM +0530, Dilip Kumar wrote: > > > > > I have rebased the patch on the latest head and also fix the issue of > > > > > "concurrent abort handling of the (sub)transaction." and attached as > > > > > (v1-0013-Extend-handling-of-concurrent-aborts-for-streamin) along with > > > > > the complete patch set. I have added the version number so that we > > > > > can track the changes. > > > > > > > > The patch has rotten a bit and does not apply anymore. Could you > > > > please send a rebased version? I have moved it to next CF, waiting on > > > > author. > > > > > > I have rebased the patch set on the latest head. > > > > Thank you for working on this. > > > > This might have already been discussed but I have a question about the > > changes of logical replication worker. In the current logical > > replication there is a problem that the response time are doubled when > > using synchronous replication because wal senders send changes after > > commit. It's worse especially when a transaction makes a lot of > > changes. So I expected this feature to reduce the response time by > > sending changes even while the transaction is progressing but it > > doesn't seem to be. The logical replication worker writes changes to > > temporary files and applies these changes when the worker received > > commit record (STREAM COMMIT). Since the worker sends the LSN of > > commit record as flush LSN to the publisher after applying all > > changes, the publisher must wait for all changes are applied to the > > subscriber. > > > > The main aim of this feature is to reduce apply lag. Because if we > send all the changes together it can delay there apply because of > network delay, whereas if most of the changes are already sent, then > we will save the effort on sending the entire data at commit time. > This in itself gives us decent benefits. Sure, we can further improve > it by having separate workers (dedicated to apply the changes) as you > are suggesting and in fact, there is a patch for that as well(see the > performance results and bgworker patch at [1]), but if try to shove in > all the things in one go, then it will be difficult to get this patch > committed (there are already enough things and the patch is quite big > that to get it right takes a lot of energy). So, the plan is > something like that first we get the basic feature and then try to > improve by having dedicated workers or things like that. Does this > make sense to you? > Thank you for explanation. The plan makes sense. But I think in the current design it's a problem that logical replication worker doesn't receive changes (and doesn't check interrupts) during applying committed changes even if we don't have a worker dedicated for applying. I think the worker should continue to receive changes and save them to temporary files even during applying changes. Otherwise the buffer would be easily full and replication gets stuck. Regards, -- Masahiko Sawada http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Amit Kapila
Date:
On Tue, Dec 24, 2019 at 11:17 AM Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote: > > On Fri, 20 Dec 2019 at 22:30, Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > The main aim of this feature is to reduce apply lag. Because if we > > send all the changes together it can delay there apply because of > > network delay, whereas if most of the changes are already sent, then > > we will save the effort on sending the entire data at commit time. > > This in itself gives us decent benefits. Sure, we can further improve > > it by having separate workers (dedicated to apply the changes) as you > > are suggesting and in fact, there is a patch for that as well(see the > > performance results and bgworker patch at [1]), but if try to shove in > > all the things in one go, then it will be difficult to get this patch > > committed (there are already enough things and the patch is quite big > > that to get it right takes a lot of energy). So, the plan is > > something like that first we get the basic feature and then try to > > improve by having dedicated workers or things like that. Does this > > make sense to you? > > > > Thank you for explanation. The plan makes sense. But I think in the > current design it's a problem that logical replication worker doesn't > receive changes (and doesn't check interrupts) during applying > committed changes even if we don't have a worker dedicated for > applying. I think the worker should continue to receive changes and > save them to temporary files even during applying changes. > Won't it beat the purpose of this feature which is to reduce the apply lag? Basically, it can so happen that while applying commit, it constantly gets changes of other transactions which will delay the apply of the current transaction. Also, won't it create some further work to identify the order of commits? Say while applying commit-1, it receives 5 other commits that are written to separate temporary files. How will we later identify which transaction's WAL we need to apply first? We might deduce by LSN's, but I think that could be tricky. Another thing is that I think it could lead to some design complications as well because while applying commit, you need some sort of callback or something like that to receive and flush totally unrelated changes. It could lead to another kind of failure mode wherein while applying commit if it tries to receive another transaction data and some failure happens while writing the data of that transaction. I am not sure if it is a good idea to try something like that. > Otherwise > the buffer would be easily full and replication gets stuck. > Are you telling about network buffer? I think the best way as discussed is to launch new workers for streamed transactions, but we can do that as an additional feature. Anyway, as proposed, users can choose the streaming mode for subscriptions, so there is an option to turn this selectively. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Masahiko Sawada
Date:
On Tue, 24 Dec 2019 at 17:21, Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Tue, Dec 24, 2019 at 11:17 AM Masahiko Sawada > <masahiko.sawada@2ndquadrant.com> wrote: > > > > On Fri, 20 Dec 2019 at 22:30, Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > The main aim of this feature is to reduce apply lag. Because if we > > > send all the changes together it can delay there apply because of > > > network delay, whereas if most of the changes are already sent, then > > > we will save the effort on sending the entire data at commit time. > > > This in itself gives us decent benefits. Sure, we can further improve > > > it by having separate workers (dedicated to apply the changes) as you > > > are suggesting and in fact, there is a patch for that as well(see the > > > performance results and bgworker patch at [1]), but if try to shove in > > > all the things in one go, then it will be difficult to get this patch > > > committed (there are already enough things and the patch is quite big > > > that to get it right takes a lot of energy). So, the plan is > > > something like that first we get the basic feature and then try to > > > improve by having dedicated workers or things like that. Does this > > > make sense to you? > > > > > > > Thank you for explanation. The plan makes sense. But I think in the > > current design it's a problem that logical replication worker doesn't > > receive changes (and doesn't check interrupts) during applying > > committed changes even if we don't have a worker dedicated for > > applying. I think the worker should continue to receive changes and > > save them to temporary files even during applying changes. > > > > Won't it beat the purpose of this feature which is to reduce the apply > lag? Basically, it can so happen that while applying commit, it > constantly gets changes of other transactions which will delay the > apply of the current transaction. You're right. But it seems to me that it optimizes the apply lags of only a transaction that made many changes. On the other hand if a transaction made many changes applying of subsequent changes are delayed. > Also, won't it create some further > work to identify the order of commits? Say while applying commit-1, > it receives 5 other commits that are written to separate temporary > files. How will we later identify which transaction's WAL we need to > apply first? We might deduce by LSN's, but I think that could be > tricky. Another thing is that I think it could lead to some design > complications as well because while applying commit, you need some > sort of callback or something like that to receive and flush totally > unrelated changes. It could lead to another kind of failure mode > wherein while applying commit if it tries to receive another > transaction data and some failure happens while writing the data of > that transaction. I am not sure if it is a good idea to try something > like that. It's just an idea but we might want to have new workers dedicated to apply changes first and then we will have streaming option later. That way we can reduce the flush lags depending on use cases. The commit order can be determined by the receiver and shared with the applyer in shared memory. Once we separated workers the streaming option can be introduced without such a downside. > > > Otherwise > > the buffer would be easily full and replication gets stuck. > > > > Are you telling about network buffer? Yes. > I think the best way as > discussed is to launch new workers for streamed transactions, but we > can do that as an additional feature. Anyway, as proposed, users can > choose the streaming mode for subscriptions, so there is an option to > turn this selectively. Yes. But user who wants to use this feature would want to replicate many changes but I guess the side effect is quite big. I think that at least we need to make the logical replication tolerate such situation. Regards, -- Masahiko Sawada http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Tomas Vondra
Date:
On Tue, Dec 10, 2019 at 10:23:19AM +0530, Dilip Kumar wrote: >On Tue, Dec 10, 2019 at 9:52 AM Amit Kapila <amit.kapila16@gmail.com> wrote: >> >> On Mon, Dec 2, 2019 at 2:02 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: >> > >> > On Sun, Dec 1, 2019 at 7:58 AM Michael Paquier <michael@paquier.xyz> wrote: >> > > >> > > On Fri, Nov 22, 2019 at 01:18:11PM +0530, Dilip Kumar wrote: >> > > > I have rebased the patch on the latest head and also fix the issue of >> > > > "concurrent abort handling of the (sub)transaction." and attached as >> > > > (v1-0013-Extend-handling-of-concurrent-aborts-for-streamin) along with >> > > > the complete patch set. I have added the version number so that we >> > > > can track the changes. >> > > >> > > The patch has rotten a bit and does not apply anymore. Could you >> > > please send a rebased version? I have moved it to next CF, waiting on >> > > author. >> > >> > I have rebased the patch set on the latest head. >> > >> > Apart from this, there is one issue reported by my colleague Vignesh. >> > The issue is that if we use more than two relations in a transaction >> > then there is an error on standby (no relation map entry for remote >> > relation ID 16390). After analyzing I have found that for the >> > streaming transaction an "is_schema_sent" flag is kept in >> > ReorderBufferTXN. And, I think that is done so that we can send the >> > schema for each transaction stream so that if any subtransaction gets >> > aborted we don't lose the logical WAL for that schema. But, this >> > solution has induced a very basic issue that if a transaction operate >> > on more than 1 relation then after sending the schema for the first >> > relation it will mark the flag true and the schema for the subsequent >> > relations will never be sent. >> > >> >> How about keeping a list of top-level xids in each RelationSyncEntry? >> Basically, whenever we send the schema for any transaction, we note >> that in RelationSyncEntry and at abort time we can remove xid from the >> list. Now, whenever, we check whether to send schema for any >> operation in a transaction, we will check if our xid is present in >> that list for a particular RelationSyncEntry and take an action based >> on that (if xid is present, then we won't send the schema, otherwise, >> send it). >The idea make sense to me. I will try to write a patch for this and test. > Yeah, the "is_schema_sent" flag in ReorderBufferTXN does not work - it needs to be in the RelationSyncEntry. In fact, I already have code for that in my private repository - I thought the patches I sent here do include this, but apparently I forgot to include this bit :-( Attached is a rebased patch series, fixing this. It's essentially v2 with a couple of patches (0003, 0008, 0009 and 0012) replacing the is_schema_sent with correct handling. 0003 - removes an is_schema_sent reference added prematurely (it's added by a later patch, causing compile failure) 0008 - adds the is_schema_sent back (essentially reverting 0003) 0009 - removes is_schema_sent entirely 0012 - adds the correct handling of schema flags in pgoutput I don't know what other changes you've made since v2, so this way it should be possible to just take 0003, 0008, 0009 and 0012 and slip them in with minimal hassle. FWIW thanks to everyone (and Amit and Dilip in particular) working on this patch series. There's been a lot of great reviews and improvements since I abandoned this thread for a while. I expect to be able to spend more time working on this in January. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
- 0001-Immediately-WAL-log-assignments-v3.patch
- 0002-Issue-individual-invalidations-with-wal_level-log-v3.patch
- 0003-fixup-is_schema_sent-set-too-early-v3.patch
- 0004-Extend-the-output-plugin-API-with-stream-methods-v3.patch
- 0005-Cleaning-up-of-flags-in-ReorderBufferTXN-structur-v3.patch
- 0006-Gracefully-handle-concurrent-aborts-of-uncommitte-v3.patch
- 0007-Implement-streaming-mode-in-ReorderBuffer-v3.patch
- 0008-fixup-add-is_schema_sent-back-v3.patch
- 0009-fixup-get-rid-of-is_schema_sent-entirely-v3.patch
- 0010-Support-logical_decoding_work_mem-set-from-create-v3.patch
- 0011-Add-support-for-streaming-to-built-in-replication-v3.patch
- 0012-fixup-add-proper-schema-tracking-v3.patch
- 0013-Track-statistics-for-streaming-v3.patch
- 0014-Enable-streaming-for-all-subscription-TAP-tests-v3.patch
- 0015-BUGFIX-set-final_lsn-for-subxacts-before-cleanup-v3.patch
- 0016-Add-TAP-test-for-streaming-vs.-DDL-v3.patch
- 0017-Extend-handling-of-concurrent-aborts-for-streamin-v3.patch
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Dilip Kumar
Date:
On Sat, Dec 28, 2019 at 9:33 PM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > > On Tue, Dec 10, 2019 at 10:23:19AM +0530, Dilip Kumar wrote: > >On Tue, Dec 10, 2019 at 9:52 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > >> > >> On Mon, Dec 2, 2019 at 2:02 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > >> > > >> > On Sun, Dec 1, 2019 at 7:58 AM Michael Paquier <michael@paquier.xyz> wrote: > >> > > > >> > > On Fri, Nov 22, 2019 at 01:18:11PM +0530, Dilip Kumar wrote: > >> > > > I have rebased the patch on the latest head and also fix the issue of > >> > > > "concurrent abort handling of the (sub)transaction." and attached as > >> > > > (v1-0013-Extend-handling-of-concurrent-aborts-for-streamin) along with > >> > > > the complete patch set. I have added the version number so that we > >> > > > can track the changes. > >> > > > >> > > The patch has rotten a bit and does not apply anymore. Could you > >> > > please send a rebased version? I have moved it to next CF, waiting on > >> > > author. > >> > > >> > I have rebased the patch set on the latest head. > >> > > >> > Apart from this, there is one issue reported by my colleague Vignesh. > >> > The issue is that if we use more than two relations in a transaction > >> > then there is an error on standby (no relation map entry for remote > >> > relation ID 16390). After analyzing I have found that for the > >> > streaming transaction an "is_schema_sent" flag is kept in > >> > ReorderBufferTXN. And, I think that is done so that we can send the > >> > schema for each transaction stream so that if any subtransaction gets > >> > aborted we don't lose the logical WAL for that schema. But, this > >> > solution has induced a very basic issue that if a transaction operate > >> > on more than 1 relation then after sending the schema for the first > >> > relation it will mark the flag true and the schema for the subsequent > >> > relations will never be sent. > >> > > >> > >> How about keeping a list of top-level xids in each RelationSyncEntry? > >> Basically, whenever we send the schema for any transaction, we note > >> that in RelationSyncEntry and at abort time we can remove xid from the > >> list. Now, whenever, we check whether to send schema for any > >> operation in a transaction, we will check if our xid is present in > >> that list for a particular RelationSyncEntry and take an action based > >> on that (if xid is present, then we won't send the schema, otherwise, > >> send it). > >The idea make sense to me. I will try to write a patch for this and test. > > > > Yeah, the "is_schema_sent" flag in ReorderBufferTXN does not work - it > needs to be in the RelationSyncEntry. In fact, I already have code for > that in my private repository - I thought the patches I sent here do > include this, but apparently I forgot to include this bit :-( > > Attached is a rebased patch series, fixing this. It's essentially v2 > with a couple of patches (0003, 0008, 0009 and 0012) replacing the > is_schema_sent with correct handling. > > > 0003 - removes an is_schema_sent reference added prematurely (it's added > by a later patch, causing compile failure) > > 0008 - adds the is_schema_sent back (essentially reverting 0003) > > 0009 - removes is_schema_sent entirely > > 0012 - adds the correct handling of schema flags in pgoutput > > > I don't know what other changes you've made since v2, so this way it > should be possible to just take 0003, 0008, 0009 and 0012 and slip them > in with minimal hassle. > > FWIW thanks to everyone (and Amit and Dilip in particular) working on > this patch series. There's been a lot of great reviews and improvements > since I abandoned this thread for a while. I expect to be able to spend > more time working on this in January. > +static void +set_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid) +{ + MemoryContext oldctx; + + oldctx = MemoryContextSwitchTo(CacheMemoryContext); + + entry->streamed_txns = lappend_int(entry->streamed_txns, xid); + + MemoryContextSwitchTo(oldctx); +} I was looking into the schema tracking solution and I have one question, Shouldn't we remove the topxid from the list if the (sub)transaction is aborted? because once it is aborted we need to resent the schema. I think we can remove the xid from the list in the cleanup_rel_sync_cache function? I have observed some more issues 1. Currently, In ReorderBufferCommit, it is always expected that whenever we get REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM, we must have already got REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT and in SPEC_CONFIRM we send the tuple we got in SPECT_INSERT. But, now those two messages can be in different streams. So we need to find a way to handle this. Maybe once we get SPEC_INSERT then we can remember the tuple and then if we get the SPECT_CONFIRM in the next stream we can send that tuple? 2. During commit time in DecodeCommit we check whether we need to skip the changes of the transaction or not by calling SnapBuildXactNeedsSkip but since now we support streaming so it's possible that before we decode the commit WAL, we might have already sent the changes to the output plugin even though we could have skipped those changes. So my question is instead of checking at the commit time can't we check before adding to ReorderBuffer itself or we can truncate the changes if SnapBuildXactNeedsSkip is true whenever logical_decoding_workmem limit is reached. Am I missing something here? -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Dilip Kumar
Date:
On Thu, Dec 12, 2019 at 9:44 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > Yesterday, Tomas has posted the latest version of the patch set which contain the fix for schema send part. Meanwhile, I was working on few review comments/bugfixes and refactoring. I have tried to merge those changes with the latest patch set except the refactoring related to "0006-Implement-streaming-mode-in-ReorderBuffer" patch, because Tomas has also made some changes in the same patch. I have created a separate patch for the same so that we can review the changes and then we can merge them to the main patch. > On Wed, Dec 11, 2019 at 5:22 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Mon, Dec 9, 2019 at 1:27 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > I have review the patch set and here are few comments/questions > > > > > > 1. > > > +static void > > > +pg_decode_stream_change(LogicalDecodingContext *ctx, > > > + ReorderBufferTXN *txn, > > > + Relation relation, > > > + ReorderBufferChange *change) > > > +{ > > > + OutputPluginPrepareWrite(ctx, true); > > > + appendStringInfo(ctx->out, "streaming change for TXN %u", txn->xid); > > > + OutputPluginWrite(ctx, true); > > > +} > > > > > > Should we show the tuple in the streamed change like we do for the > > > pg_decode_change? > > > > > > > I think so. The patch shows the message in > > pg_decode_stream_message(), so why to prohibit showing tuple here? Yeah, we can do that. One option is that we can directly register "pg_decode_change" function as stream_change_cb plugin and that will show the tuple, another option is that we can write a similar function as pg_decode_change and change the message which includes the text "STREAM" so that the user can distinguish between tuple from committed transaction and the in-progress transaction. While analyzing this solution I have encountered one more issue, the problem is that currently, during commit time in DecodeCommit we check whether we need to skip the changes of the transaction or not by calling SnapBuildXactNeedsSkip but since now we support streaming so it's possible that before commit wal arrive we might have already sent the changes to the output plugin even though we could have skipped those changes. So my question is instead of checking at the commit time can't we check before adding to ReorderBuffer itself or we can truncate the changes if SnapBuildXactNeedsSkip is true whenever logical_decoding_workmem limit is reached. > > Few comments on this patch series: > > > > 0001-Immediately-WAL-log-assignments: > > ------------------------------------------------------------ > > > > The commit message still refers to the old design for this patch. I > > think you need to modify the commit message as per the latest patch. Done > > > > 0002-Issue-individual-invalidations-with-wal_level-log > > ---------------------------------------------------------------------------- > > 1. > > xact_desc_invalidations(StringInfo buf, > > { > > .. > > + else if (msg->id == SHAREDINVALSNAPSHOT_ID) > > + appendStringInfo(buf, " snapshot %u", msg->sn.relId); > > > > You have removed logging for the above cache but forgot to remove its > > reference from one of the places. Also, I think you need to add a > > comment somewhere in inval.c to say why you are writing for WAL for > > some types of invalidations and not for others? Done > > > > 0003-Extend-the-output-plugin-API-with-stream-methods > > -------------------------------------------------------------------------------- > > 1. > > + are required, while <function>stream_message_cb</function> and > > + <function>stream_message_cb</function> are optional. > > > > stream_message_cb is mentioned twice. It seems the second one is for truncate. Done > > > > 2. > > size of the transaction size and network bandwidth, the transfer time > > + may significantly increase the apply lag. > > > > /size of the transaction size/size of the transaction > > > > no need to mention size twice. Done > > > > 3. > > + Similarly to spill-to-disk behavior, streaming is triggered when the total > > + amount of changes decoded from the WAL (for all in-progress > > transactions) > > + exceeds limit defined by <varname>logical_work_mem</varname> setting. > > > > The guc name used is wrong. /Similarly to/Similar to/ Done > > > > 4. > > stream_start_cb_wrapper() > > { > > .. > > + /* state.report_location = apply_lsn; */ > > .. > > + /* FIXME ctx->write_location = apply_lsn; */ > > .. > > } > > > > See, if we can fix these and similar in the callback for the stop. I > > think we don't have final_lsn till we commit/abort. Can we compute > > before calling these API's? Done > > > > > > 0005-Gracefully-handle-concurrent-aborts-of-uncommitte > > ---------------------------------------------------------------------------------- > > 1. > > @@ -1877,6 +1877,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid, > > PG_CATCH(); > > { > > /* TODO: Encapsulate cleanup > > from the PG_TRY and PG_CATCH blocks */ > > + > > if (iterstate) > > ReorderBufferIterTXNFinish(rb, iterstate); > > > > Spurious line change. > > Done > > 2. The commit message of this patch refers to Prepared transactions. > > I think that needs to be changed. > > > > 0006-Implement-streaming-mode-in-ReorderBuffer > > ------------------------------------------------------------------------- > > 1. > > + > > +/* iterator for streaming (only get data from memory) */ > > +static ReorderBufferStreamIterTXNState * ReorderBufferStreamIterTXNInit( > > + > > ReorderBuffer *rb, > > + > > ReorderBufferTXN > > *txn); > > + > > +static ReorderBufferChange *ReorderBufferStreamIterTXNNext( > > + ReorderBuffer *rb, > > + > > ReorderBufferStreamIterTXNState * state); > > + > > +static void ReorderBufferStreamIterTXNFinish( > > + > > ReorderBuffer *rb, > > + > > ReorderBufferStreamIterTXNState * state); > > > > Do we really need to introduce new APIs for iterating over changes > > from streamed transactions? Why can't we reuse the same API's as we > > use for committed xacts? Done > > > > 2. > > +static void > > +ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn) > > > > Please write some comments atop ReorderBufferStreamCommit. Done > > > > 3. > > +ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn) > > { > > .. > > .. > > + if (txn->snapshot_now > > == NULL) > > + { > > + dlist_iter subxact_i; > > + > > + /* make sure this transaction is streamed for the first time */ > > + > > Assert(!rbtxn_is_streamed(txn)); > > + > > + /* at the beginning we should have invalid command ID */ > > + Assert(txn->command_id == > > InvalidCommandId); > > + > > + dlist_foreach(subxact_i, &txn->subtxns) > > + { > > + ReorderBufferTXN *subtxn; > > + > > + > > subtxn = dlist_container(ReorderBufferTXN, node, subxact_i.cur); > > + > > + if (subtxn->base_snapshot != NULL && > > + > > (txn->base_snapshot == NULL || > > + txn->base_snapshot_lsn > subtxn->base_snapshot_lsn)) > > + { > > + > > txn->base_snapshot = subtxn->base_snapshot; > > > > The logic here seems to be correct, but I am not sure why it is not > > considered to purge the base snapshot before assigning the subtxn's > > snapshot and similarly, we have not purged snapshot for subtxn once we > > are done with it. I think we can use > > ReorderBufferTransferSnapToParent to replace part of the logic here. > > Do you see any reason for doing things differently here? Done > > > > 4. In ReorderBufferStreamTXN, why do you need to use > > ReorderBufferCopySnap to assign txn->base_snapshot to snapshot_now. IMHO, here instead of directly copying the base snapshot we are modifying it by passing command id and thats the reason we are copying it. > > > > 5. I see a lot of code similarity in ReorderBufferStreamTXN and > > existing ReorderBufferCommit. I understand that there are some subtle > > differences due to which we need to write this new function but can't > > we encapsulate the specific parts of code in functions and then call > > from both places. I am talking about code in different cases for > > change->action. Done > > > > 6. + * Note: We never stream and serialize a transaction at the same time (e > > /(e/(we Done I have also found one bug in "v3-0012-fixup-add-proper-schema-tracking.patch" due to which some of the streaming test cases were failing, I have created a separate patch to fix the same. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
- v4-0001-Immediately-WAL-log-assignments.patch
- v4-0002-Issue-individual-invalidations-with-wal_level-log.patch
- v4-0003-fixup-is_schema_sent-set-too-early.patch
- v4-0004-Extend-the-output-plugin-API-with-stream-methods.patch
- v4-0005-Cleaning-up-of-flags-in-ReorderBufferTXN-structur.patch
- v4-0006-Gracefully-handle-concurrent-aborts-of-uncommitte.patch
- v4-0007-Implement-streaming-mode-in-ReorderBuffer.patch
- v4-0008-fixup-add-is_schema_sent-back.patch
- v4-0009-fixup-get-rid-of-is_schema_sent-entirely.patch
- v4-0010-Support-logical_decoding_work_mem-set-from-create.patch
- v4-0011-Add-support-for-streaming-to-built-in-replication.patch
- v4-0012-fixup-add-proper-schema-tracking.patch
- v4-0013-Track-statistics-for-streaming.patch
- v4-0014-Enable-streaming-for-all-subscription-TAP-tests.patch
- v4-0015-BUGFIX-set-final_lsn-for-subxacts-before-cleanup.patch
- v4-0016-Add-TAP-test-for-streaming-vs.-DDL.patch
- v4-0017-Extend-handling-of-concurrent-aborts-for-streamin.patch
- v4-0018-Review-comment-fix-and-refactoring.patch
- v4-0019-Bugfix-in-schema-tracking.patch
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Amit Kapila
Date:
On Sun, Dec 29, 2019 at 1:34 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > I have observed some more issues > > 1. Currently, In ReorderBufferCommit, it is always expected that > whenever we get REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM, we must > have already got REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT and in > SPEC_CONFIRM we send the tuple we got in SPECT_INSERT. But, now those > two messages can be in different streams. So we need to find a way to > handle this. Maybe once we get SPEC_INSERT then we can remember the > tuple and then if we get the SPECT_CONFIRM in the next stream we can > send that tuple? > Your suggestion makes sense to me. So, we can try it. > 2. During commit time in DecodeCommit we check whether we need to skip > the changes of the transaction or not by calling > SnapBuildXactNeedsSkip but since now we support streaming so it's > possible that before we decode the commit WAL, we might have already > sent the changes to the output plugin even though we could have > skipped those changes. So my question is instead of checking at the > commit time can't we check before adding to ReorderBuffer itself > I think if we can do that then the same will be true for current code irrespective of this patch. I think it is possible that we can't take that decision while decoding because we haven't assembled a consistent snapshot yet. I think we might be able to do that while we try to stream the changes. I think we need to take care of all the conditions during streaming (when the logical_decoding_workmem limit is reached) as we do in DecodeCommit. This needs a bit more study. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Amit Kapila
Date:
On Thu, Dec 26, 2019 at 12:36 PM Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote: > > On Tue, 24 Dec 2019 at 17:21, Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > Thank you for explanation. The plan makes sense. But I think in the > > > current design it's a problem that logical replication worker doesn't > > > receive changes (and doesn't check interrupts) during applying > > > committed changes even if we don't have a worker dedicated for > > > applying. I think the worker should continue to receive changes and > > > save them to temporary files even during applying changes. > > > > > > > Won't it beat the purpose of this feature which is to reduce the apply > > lag? Basically, it can so happen that while applying commit, it > > constantly gets changes of other transactions which will delay the > > apply of the current transaction. > > You're right. But it seems to me that it optimizes the apply lags of > only a transaction that made many changes. On the other hand if a > transaction made many changes applying of subsequent changes are > delayed. > Hmm, how would it be worse than the current situation where once commit is encountered on the publisher, we won't start with other transactions until the replay of the same is finished on subscriber? > > > I think the best way as > > discussed is to launch new workers for streamed transactions, but we > > can do that as an additional feature. Anyway, as proposed, users can > > choose the streaming mode for subscriptions, so there is an option to > > turn this selectively. > > Yes. But user who wants to use this feature would want to replicate > many changes but I guess the side effect is quite big. I think that at > least we need to make the logical replication tolerate such situation. > What exactly you mean by "at least we need to make the logical replication tolerate such situation."? Do you have something specific in mind? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Dilip Kumar
Date:
On Mon, Dec 30, 2019 at 3:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Sun, Dec 29, 2019 at 1:34 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > I have observed some more issues > > > > 1. Currently, In ReorderBufferCommit, it is always expected that > > whenever we get REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM, we must > > have already got REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT and in > > SPEC_CONFIRM we send the tuple we got in SPECT_INSERT. But, now those > > two messages can be in different streams. So we need to find a way to > > handle this. Maybe once we get SPEC_INSERT then we can remember the > > tuple and then if we get the SPECT_CONFIRM in the next stream we can > > send that tuple? > > > > Your suggestion makes sense to me. So, we can try it. Sure. > > > 2. During commit time in DecodeCommit we check whether we need to skip > > the changes of the transaction or not by calling > > SnapBuildXactNeedsSkip but since now we support streaming so it's > > possible that before we decode the commit WAL, we might have already > > sent the changes to the output plugin even though we could have > > skipped those changes. So my question is instead of checking at the > > commit time can't we check before adding to ReorderBuffer itself > > > > I think if we can do that then the same will be true for current code > irrespective of this patch. I think it is possible that we can't take > that decision while decoding because we haven't assembled a consistent > snapshot yet. I think we might be able to do that while we try to > stream the changes. I think we need to take care of all the > conditions during streaming (when the logical_decoding_workmem limit > is reached) as we do in DecodeCommit. This needs a bit more study. I agree. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Amit Kapila
Date:
On Tue, Dec 24, 2019 at 10:58 AM Robert Haas <robertmhaas@gmail.com> wrote: > > On Thu, Dec 12, 2019 at 3:41 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > I think the way invalidations work for logical replication is that > > normally, we always start a new transaction before decoding each > > commit which allows us to accept the invalidations (via > > AtStart_Cache). However, if there are catalog changes within the > > transaction being decoded, we need to reflect those before trying to > > decode the WAL of operation which happened after that catalog change. > > As we are not logging the WAL for each invalidation, we need to > > execute all the invalidation messages for this transaction at each > > catalog change. We are able to do that now as we decode the entire WAL > > for a transaction only once we get the commit's WAL which contains all > > the invalidation messages. So, we queue them up and execute them for > > each catalog change which we identify by WAL record > > XLOG_HEAP2_NEW_CID. > > Thanks for the explanation. That makes sense. But, it's still true, > AFAICS, that instead of doing this stuff with logging invalidations > you could just InvalidateSystemCaches() in the cases where you are > currently applying all of the transaction's invalidations. That > approach might be worse than changing the way invalidations are > logged, but the two approaches deserve to be compared. One approach > has more CPU overhead and the other has more WAL overhead, so it's a > little hard to compare them, but it seems worth mulling over. > I have given some thought over it and it seems to me that this will increase not only CPU usage but also Network usage. The increase in CPU usage will be for all WALSenders that decodes a transaction that has performed DDL. The increase in network usage comes from the fact that we need to send the schema of relations again which doesn't require to be invalidated. It is because invalidation blew our local map that remembers which relation schemas are sent. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Amit Kapila
Date:
On Sun, Dec 29, 2019 at 1:34 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > On Sat, Dec 28, 2019 at 9:33 PM Tomas Vondra > <tomas.vondra@2ndquadrant.com> wrote: > > > > > > Yeah, the "is_schema_sent" flag in ReorderBufferTXN does not work - it > > needs to be in the RelationSyncEntry. In fact, I already have code for > > that in my private repository - I thought the patches I sent here do > > include this, but apparently I forgot to include this bit :-( > > > > Attached is a rebased patch series, fixing this. It's essentially v2 > > with a couple of patches (0003, 0008, 0009 and 0012) replacing the > > is_schema_sent with correct handling. > > > > > > 0003 - removes an is_schema_sent reference added prematurely (it's added > > by a later patch, causing compile failure) > > > > 0008 - adds the is_schema_sent back (essentially reverting 0003) > > > > 0009 - removes is_schema_sent entirely > > > > 0012 - adds the correct handling of schema flags in pgoutput > > Thanks for splitting the changes. They are quite clear. > > > > I don't know what other changes you've made since v2, so this way it > > should be possible to just take 0003, 0008, 0009 and 0012 and slip them > > in with minimal hassle. > > > > FWIW thanks to everyone (and Amit and Dilip in particular) working on > > this patch series. There's been a lot of great reviews and improvements > > since I abandoned this thread for a while. I expect to be able to spend > > more time working on this in January. > > > +static void > +set_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid) > +{ > + MemoryContext oldctx; > + > + oldctx = MemoryContextSwitchTo(CacheMemoryContext); > + > + entry->streamed_txns = lappend_int(entry->streamed_txns, xid); > + > + MemoryContextSwitchTo(oldctx); > +} > I was looking into the schema tracking solution and I have one > question, Shouldn't we remove the topxid from the list if the > (sub)transaction is aborted? because once it is aborted we need to > resent the schema. > I think you are right because, at abort, the subscriber would remove the changes (for a subtransaction) including the schema changes sent and then it won't be able to understand the subsequent changes sent by the publisher. Won't we need to remove xid from the list at commit time as well, otherwise, the list will keep on growing. One more thing, we need to search the list of all the relations in the local map to find xid being aborted/committed, right? If so, won't it be costly doing at each transaction abort/commit? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Dilip Kumar
Date:
On Sat, Jan 4, 2020 at 10:00 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Sun, Dec 29, 2019 at 1:34 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Sat, Dec 28, 2019 at 9:33 PM Tomas Vondra > > <tomas.vondra@2ndquadrant.com> wrote: > > +static void > > +set_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid) > > +{ > > + MemoryContext oldctx; > > + > > + oldctx = MemoryContextSwitchTo(CacheMemoryContext); > > + > > + entry->streamed_txns = lappend_int(entry->streamed_txns, xid); > > + > > + MemoryContextSwitchTo(oldctx); > > +} > > I was looking into the schema tracking solution and I have one > > question, Shouldn't we remove the topxid from the list if the > > (sub)transaction is aborted? because once it is aborted we need to > > resent the schema. > > > > I think you are right because, at abort, the subscriber would remove > the changes (for a subtransaction) including the schema changes sent > and then it won't be able to understand the subsequent changes sent by > the publisher. Won't we need to remove xid from the list at commit > time as well, otherwise, the list will keep on growing. Yes, we need to remove the xid from the list at the time of commit as well. One more > thing, we need to search the list of all the relations in the local > map to find xid being aborted/committed, right? If so, won't it be > costly doing at each transaction abort/commit? Yeah, if multiple concurrent transactions operate on the common relations then the list can grow longer. I am not sure how many concurrent large transactions are possible maybe it won't be huge that searching will be very costly. Otherwise, we can maintain the sorted array of the xids and do a binary search or we can maintain hash? -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Amit Kapila
Date:
On Mon, Dec 30, 2019 at 3:11 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Thu, Dec 12, 2019 at 9:44 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > Yesterday, Tomas has posted the latest version of the patch set which > contain the fix for schema send part. Meanwhile, I was working on few > review comments/bugfixes and refactoring. I have tried to merge those > changes with the latest patch set except the refactoring related to > "0006-Implement-streaming-mode-in-ReorderBuffer" patch, because Tomas > has also made some changes in the same patch. > I don't see any changes by Tomas in that particular patch, am I missing something? > I have created a > separate patch for the same so that we can review the changes and then > we can merge them to the main patch. > It is better to merge it with the main patch for "Implement-streaming-mode-in-ReorderBuffer", otherwise, it is a bit difficult to review. > > > 0002-Issue-individual-invalidations-with-wal_level-log > > > ---------------------------------------------------------------------------- > > > 1. > > > xact_desc_invalidations(StringInfo buf, > > > { > > > .. > > > + else if (msg->id == SHAREDINVALSNAPSHOT_ID) > > > + appendStringInfo(buf, " snapshot %u", msg->sn.relId); > > > > > > You have removed logging for the above cache but forgot to remove its > > > reference from one of the places. Also, I think you need to add a > > > comment somewhere in inval.c to say why you are writing for WAL for > > > some types of invalidations and not for others? > Done > I don't see any new comments as asked by me. I think we should also consider WAL logging at each command end instead of doing piecemeal as discussed in another email [1], which will have lesser code changes and maybe better in performance. You might want to evaluate the performance of both approaches. > > > > > > 0003-Extend-the-output-plugin-API-with-stream-methods > > > -------------------------------------------------------------------------------- > > > > > > 4. > > > stream_start_cb_wrapper() > > > { > > > .. > > > + /* state.report_location = apply_lsn; */ > > > .. > > > + /* FIXME ctx->write_location = apply_lsn; */ > > > .. > > > } > > > > > > See, if we can fix these and similar in the callback for the stop. I > > > think we don't have final_lsn till we commit/abort. Can we compute > > > before calling these API's? > Done > You have just used final_lsn, but I don't see where you have ensured that it is set before the API stream_stop_cb_wrapper. I think we need something similar to what Vignesh has done in one of his bug-fix patch [2]. See my comment below in this regard. > > > > > > > > > 0005-Gracefully-handle-concurrent-aborts-of-uncommitte > > > ---------------------------------------------------------------------------------- > > > 1. > > > @@ -1877,6 +1877,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid, > > > PG_CATCH(); > > > { > > > /* TODO: Encapsulate cleanup > > > from the PG_TRY and PG_CATCH blocks */ > > > + > > > if (iterstate) > > > ReorderBufferIterTXNFinish(rb, iterstate); > > > > > > Spurious line change. > > > > Done + /* + * We don't expect direct calls to heap_getnext with valid + * CheckXidAlive for regular tables. Track that below. + */ + if (unlikely(TransactionIdIsValid(CheckXidAlive) && + !(IsCatalogRelation(scan->rs_base.rs_rd) || + RelationIsUsedAsCatalogTable(scan->rs_base.rs_rd)))) + elog(ERROR, "improper heap_getnext call"); Earlier, I thought we don't need to check if it is a regular table in this check, but it is required because output plugins can try to do that and if they do so during decoding (with historic snapshots), the same should be not allowed. How about changing the error message to "unexpected heap_getnext call during logical decoding" or something like that? > > > 2. The commit message of this patch refers to Prepared transactions. > > > I think that needs to be changed. > > > > > > 0006-Implement-streaming-mode-in-ReorderBuffer > > > ------------------------------------------------------------------------- Few comments on v4-0018-Review-comment-fix-and-refactoring: 1. + if (streaming) + { + /* + * Set the last last of the stream as the final lsn before calling + * stream stop. + */ + txn->final_lsn = prev_lsn; + rb->stream_stop(rb, txn); + } Shouldn't we try to final_lsn as is done by Vignesh's patch [2]? 2. + if (streaming) + { + /* + * Set the CheckXidAlive to the current (sub)xid for which this + * change belongs to so that we can detect the abort while we are + * decoding. + */ + CheckXidAlive = change->txn->xid; + + /* Increment the stream count. */ + streamed++; + } Is the variable 'streamed' used anywhere? 3. + /* + * Destroy the (relfilenode, ctid) hashtable, so that we don't leak + * any memory. We could also keep the hash table and update it with + * new ctid values, but this seems simpler and good enough for now. + */ + ReorderBufferDestroyTupleCidHash(rb, txn); Won't this be required only when we are streaming changes? As per my understanding apart from the above comments, the known pending work for this patchset is as follows: a. The two open items agreed to you in the email [3]. b. Complete the handling of schema_sent as discussed above [4]. c. Few comments by Vignesh and the response on the same by me [5][6]. d. WAL overhead and performance testing for additional WAL logging by this patchset. e. Some way to see the tuple for streamed transactions by decoding API as speculated by you [7]. Have I missed anything? [1] - https://www.postgresql.org/message-id/CAA4eK1LOa%2B2KqNX%3Dm%3D1qMBDW%2Bo50AuwjAOX6ZqL-rWGiH1F9MQ%40mail.gmail.com [2] - https://www.postgresql.org/message-id/CALDaNm3MDxFnsZsnSqVhPBLS3%3DqzNH6%2BYzB%3DxYuX2vbtsUeFgw%40mail.gmail.com [3] - https://www.postgresql.org/message-id/CAFiTN-uT5YZE0egGhKdTteTjcGrPi8hb%3DFMPpr9_hEB7hozQ-Q%40mail.gmail.com [4] - https://www.postgresql.org/message-id/CAA4eK1KjD9x0mS4JxzCbu3gu-r6K7XJRV%2BZcGb3BH6U3x2uxew%40mail.gmail.com [5] - https://www.postgresql.org/message-id/CALDaNm0DNUojjt7CV-fa59_kFbQQ3rcMBtauvo44ttea7r9KaA%40mail.gmail.com [6] - https://www.postgresql.org/message-id/CAA4eK1%2BZvupW00c--dqEg8f3dHZDOGmA9xOQLyQHjRSoDi6AkQ%40mail.gmail.com [7] - https://www.postgresql.org/message-id/CAFiTN-t8PmKA1X4jEqKmkvs0ggWpy0APWpPuaJwpx2YpfAf97w%40mail.gmail.com -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Dilip Kumar
Date:
On Sat, Jan 4, 2020 at 4:07 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Mon, Dec 30, 2019 at 3:11 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Thu, Dec 12, 2019 at 9:44 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > Yesterday, Tomas has posted the latest version of the patch set which > > contain the fix for schema send part. Meanwhile, I was working on few > > review comments/bugfixes and refactoring. I have tried to merge those > > changes with the latest patch set except the refactoring related to > > "0006-Implement-streaming-mode-in-ReorderBuffer" patch, because Tomas > > has also made some changes in the same patch. > > > > I don't see any changes by Tomas in that particular patch, am I > missing something? He has created some sub-patch from the main patch for handling schema-sent issue. So if I make change in that patch all other patches will conflict. > > > I have created a > > separate patch for the same so that we can review the changes and then > > we can merge them to the main patch. > > > > It is better to merge it with the main patch for > "Implement-streaming-mode-in-ReorderBuffer", otherwise, it is a bit > difficult to review. Actually, we can merge 0008, 0009, 0012, 0018 to the main patch (0007). Basically, if we merge all of them then we don't need to deal with the conflict. I think Tomas has kept them separate so that we can review the solution for the schema sent. And, I kept 0018 as a separate patch to avoid conflict and rebasing in 0008, 0009 and 0012. In the next patch set, I will merge all of them to 0007. > > > > > 0002-Issue-individual-invalidations-with-wal_level-log > > > > ---------------------------------------------------------------------------- > > > > 1. > > > > xact_desc_invalidations(StringInfo buf, > > > > { > > > > .. > > > > + else if (msg->id == SHAREDINVALSNAPSHOT_ID) > > > > + appendStringInfo(buf, " snapshot %u", msg->sn.relId); > > > > > > > > You have removed logging for the above cache but forgot to remove its > > > > reference from one of the places. Also, I think you need to add a > > > > comment somewhere in inval.c to say why you are writing for WAL for > > > > some types of invalidations and not for others? > > Done > > > > I don't see any new comments as asked by me. Oh, I just fixed one part of the comment and overlooked the rest. Will fix. I think we should also > consider WAL logging at each command end instead of doing piecemeal as > discussed in another email [1], which will have lesser code changes > and maybe better in performance. You might want to evaluate the > performance of both approaches. Ok > > > > > > > > > 0003-Extend-the-output-plugin-API-with-stream-methods > > > > -------------------------------------------------------------------------------- > > > > > > > > 4. > > > > stream_start_cb_wrapper() > > > > { > > > > .. > > > > + /* state.report_location = apply_lsn; */ > > > > .. > > > > + /* FIXME ctx->write_location = apply_lsn; */ > > > > .. > > > > } > > > > > > > > See, if we can fix these and similar in the callback for the stop. I > > > > think we don't have final_lsn till we commit/abort. Can we compute > > > > before calling these API's? > > Done > > > > You have just used final_lsn, but I don't see where you have ensured > that it is set before the API stream_stop_cb_wrapper. I think we need > something similar to what Vignesh has done in one of his bug-fix patch > [2]. See my comment below in this regard. You can refer below hunk in 0018. + /* + * Done with current changes, call stream_stop callback for streaming + * transaction, commit callback otherwise. + */ + if (streaming) + { + /* + * Set the last last of the stream as the final lsn before calling + * stream stop. + */ + txn->final_lsn = prev_lsn; + rb->stream_stop(rb, txn); + } > > > > > > > > > > > > > 0005-Gracefully-handle-concurrent-aborts-of-uncommitte > > > > ---------------------------------------------------------------------------------- > > > > 1. > > > > @@ -1877,6 +1877,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid, > > > > PG_CATCH(); > > > > { > > > > /* TODO: Encapsulate cleanup > > > > from the PG_TRY and PG_CATCH blocks */ > > > > + > > > > if (iterstate) > > > > ReorderBufferIterTXNFinish(rb, iterstate); > > > > > > > > Spurious line change. > > > > > > Done > > + /* > + * We don't expect direct calls to heap_getnext with valid > + * CheckXidAlive for regular tables. Track that below. > + */ > + if (unlikely(TransactionIdIsValid(CheckXidAlive) && > + !(IsCatalogRelation(scan->rs_base.rs_rd) || > + RelationIsUsedAsCatalogTable(scan->rs_base.rs_rd)))) > + elog(ERROR, "improper heap_getnext call"); > > Earlier, I thought we don't need to check if it is a regular table in > this check, but it is required because output plugins can try to do > that I did not understand that, can you give some example? and if they do so during decoding (with historic snapshots), the > same should be not allowed. > > How about changing the error message to "unexpected heap_getnext call > during logical decoding" or something like that? Ok > > > > > 2. The commit message of this patch refers to Prepared transactions. > > > > I think that needs to be changed. > > > > > > > > 0006-Implement-streaming-mode-in-ReorderBuffer > > > > ------------------------------------------------------------------------- > > Few comments on v4-0018-Review-comment-fix-and-refactoring: > 1. > + if (streaming) > + { > + /* > + * Set the last last of the stream as the final lsn before calling > + * stream stop. > + */ > + txn->final_lsn = prev_lsn; > + rb->stream_stop(rb, txn); > + } > > Shouldn't we try to final_lsn as is done by Vignesh's patch [2]? Isn't it the same, there we are doing while serializing and here we are doing while streaming? Basically, the last LSN we streamed. Am I missing something? > > 2. > + if (streaming) > + { > + /* > + * Set the CheckXidAlive to the current (sub)xid for which this > + * change belongs to so that we can detect the abort while we are > + * decoding. > + */ > + CheckXidAlive = change->txn->xid; > + > + /* Increment the stream count. */ > + streamed++; > + } > > Is the variable 'streamed' used anywhere? > > 3. > + /* > + * Destroy the (relfilenode, ctid) hashtable, so that we don't leak > + * any memory. We could also keep the hash table and update it with > + * new ctid values, but this seems simpler and good enough for now. > + */ > + ReorderBufferDestroyTupleCidHash(rb, txn); > > Won't this be required only when we are streaming changes? I will work on this review comments and reply to them separately along with the patch. > > As per my understanding apart from the above comments, the known > pending work for this patchset is as follows: > a. The two open items agreed to you in the email [3]. > b. Complete the handling of schema_sent as discussed above [4]. > c. Few comments by Vignesh and the response on the same by me [5][6]. > d. WAL overhead and performance testing for additional WAL logging by > this patchset. > e. Some way to see the tuple for streamed transactions by decoding API > as speculated by you [7]. > > Have I missed anything? I think this is the list I remember. Apart from these few points by Robert which are still under discussion[8]. > > [1] - https://www.postgresql.org/message-id/CAA4eK1LOa%2B2KqNX%3Dm%3D1qMBDW%2Bo50AuwjAOX6ZqL-rWGiH1F9MQ%40mail.gmail.com > [2] - https://www.postgresql.org/message-id/CALDaNm3MDxFnsZsnSqVhPBLS3%3DqzNH6%2BYzB%3DxYuX2vbtsUeFgw%40mail.gmail.com > [3] - https://www.postgresql.org/message-id/CAFiTN-uT5YZE0egGhKdTteTjcGrPi8hb%3DFMPpr9_hEB7hozQ-Q%40mail.gmail.com > [4] - https://www.postgresql.org/message-id/CAA4eK1KjD9x0mS4JxzCbu3gu-r6K7XJRV%2BZcGb3BH6U3x2uxew%40mail.gmail.com > [5] - https://www.postgresql.org/message-id/CALDaNm0DNUojjt7CV-fa59_kFbQQ3rcMBtauvo44ttea7r9KaA%40mail.gmail.com > [6] - https://www.postgresql.org/message-id/CAA4eK1%2BZvupW00c--dqEg8f3dHZDOGmA9xOQLyQHjRSoDi6AkQ%40mail.gmail.com > [7] - https://www.postgresql.org/message-id/CAFiTN-t8PmKA1X4jEqKmkvs0ggWpy0APWpPuaJwpx2YpfAf97w%40mail.gmail.com [8] https://www.postgresql.org/message-id/CA%2BTgmoYH6N_YDvKH9AaAJo5ZTHn142K%3DB75VO9yKvjjjHcoZhA%40mail.gmail.com -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Amit Kapila
Date:
On Mon, Jan 6, 2020 at 9:21 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Sat, Jan 4, 2020 at 4:07 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > It is better to merge it with the main patch for > > "Implement-streaming-mode-in-ReorderBuffer", otherwise, it is a bit > > difficult to review. > Actually, we can merge 0008, 0009, 0012, 0018 to the main patch > (0007). Basically, if we merge all of them then we don't need to deal > with the conflict. I think Tomas has kept them separate so that we > can review the solution for the schema sent. And, I kept 0018 as a > separate patch to avoid conflict and rebasing in 0008, 0009 and 0012. > In the next patch set, I will merge all of them to 0007. > Okay, I think we can merge those patches. > > > > + /* > > + * We don't expect direct calls to heap_getnext with valid > > + * CheckXidAlive for regular tables. Track that below. > > + */ > > + if (unlikely(TransactionIdIsValid(CheckXidAlive) && > > + !(IsCatalogRelation(scan->rs_base.rs_rd) || > > + RelationIsUsedAsCatalogTable(scan->rs_base.rs_rd)))) > > + elog(ERROR, "improper heap_getnext call"); > > > > Earlier, I thought we don't need to check if it is a regular table in > > this check, but it is required because output plugins can try to do > > that > I did not understand that, can you give some example? > I think it can lead to the same problem of concurrent aborts as for catalog scans. > > > > > > > 2. The commit message of this patch refers to Prepared transactions. > > > > > I think that needs to be changed. > > > > > > > > > > 0006-Implement-streaming-mode-in-ReorderBuffer > > > > > ------------------------------------------------------------------------- > > > > Few comments on v4-0018-Review-comment-fix-and-refactoring: > > 1. > > + if (streaming) > > + { > > + /* > > + * Set the last last of the stream as the final lsn before calling > > + * stream stop. > > + */ > > + txn->final_lsn = prev_lsn; > > + rb->stream_stop(rb, txn); > > + } > > > > Shouldn't we try to final_lsn as is done by Vignesh's patch [2]? > Isn't it the same, there we are doing while serializing and here we > are doing while streaming? Basically, the last LSN we streamed. Am I > missing something? > No, I think you are right. Few more comments: -------------------------------- v4-0007-Implement-streaming-mode-in-ReorderBuffer 1. +ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn) { .. + /* + * TOCHECK: We have to rebuild historic snapshot to be sure it includes all + * information about subtransactions, which could arrive after streaming start. + */ + if (!txn->is_schema_sent) + snapshot_now = ReorderBufferCopySnap(rb, txn->base_snapshot, + txn, command_id); .. } Why are we using base snapshot here instead of the snapshot we saved the first time streaming has happened? And as mentioned in comments, won't we need to consider the snapshots for subtransactions that arrived after the last time we have streamed the changes? 2. + /* remember the command ID and snapshot for the streaming run */ + txn->command_id = command_id; + txn- >snapshot_now = ReorderBufferCopySnap(rb, snapshot_now, + txn, command_id); I don't see where the txn->snapshot_now is getting freed. The base_snapshot is freed in ReorderBufferCleanupTXN, but I don't see this getting freed. 3. +static void +ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn) { .. + /* + * If this is a subxact, we need to stream the top-level transaction + * instead. + */ + if (txn->toptxn) + { + ReorderBufferStreamTXN(rb, txn->toptxn); + return; + } Is it ever possible that we reach here for subtransaction, if not, then it should be Assert rather than if condition? 4. In ReorderBufferStreamTXN(), don't we need to set some of the txn fields like origin_id, origin_lsn as we do in ReorderBufferCommit() especially to cover the case when it gets called due to memory overflow (aka via ReorderBufferCheckMemoryLimit). v4-0017-Extend-handling-of-concurrent-aborts-for-streamin 1. @@ -3712,7 +3727,22 @@ ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn) if (using_subtxn) RollbackAndReleaseCurrentSubTransaction(); - PG_RE_THROW(); + /* re-throw only if it's not an abort */ + if (errdata- >sqlerrcode != ERRCODE_TRANSACTION_ROLLBACK) + { + MemoryContextSwitchTo(ecxt); + PG_RE_THROW(); + } + else + { + /* remember the command ID and snapshot for the streaming run */ + txn- >command_id = command_id; + txn->snapshot_now = ReorderBufferCopySnap(rb, snapshot_now, + txn, command_id); + rb->stream_stop(rb, txn); + + FlushErrorState(); + } Can you update comments either in the above code block or some other place to explain what is the concurrent abort problem and how we dealt with it? Also, please explain how the above error handling is sufficient to address all the various scenarios (sub-transaction got aborted when we have already sent some changes, or when we have not sent any changes yet). v4-0006-Gracefully-handle-concurrent-aborts-of-uncommitte 1. + /* + * If CheckXidAlive is valid, then we check if it aborted. If it did, we + * error out + */ + if (TransactionIdIsValid(CheckXidAlive) && + !TransactionIdIsInProgress(CheckXidAlive) && + !TransactionIdDidCommit(CheckXidAlive)) + ereport(ERROR, + (errcode(ERRCODE_TRANSACTION_ROLLBACK), + errmsg("transaction aborted during system catalog scan"))); Why here we can't use TransactionIdDidAbort? If we can't use it, then can you add comments stating the reason of the same. 2. /* + * An xid value pointing to a possibly ongoing or a prepared transaction. + * Currently used in logical decoding. It's possible that such transactions + * can get aborted while the decoding is ongoing. + */ +TransactionId CheckXidAlive = InvalidTransactionId; In comments, there is a mention of a prepared transaction. Do we allow prepared transactions to be decoded as part of this patch? 3. + /* + * If CheckXidAlive is valid, then we check if it aborted. If it did, we + * error out + */ + if (TransactionIdIsValid (CheckXidAlive) && + !TransactionIdIsInProgress(CheckXidAlive) && + !TransactionIdDidCommit(CheckXidAlive)) This comment just says what code below is doing, can you explain the rationale behind this check. It would be better if it is clear by reading comments, why we are doing this check after fetching the tuple. I think this can refer to the comment I suggested to add for changes in patch v4-0017-Extend-handling-of-concurrent-aborts-for-streamin. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Dilip Kumar
Date:
On Mon, Jan 6, 2020 at 2:11 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Mon, Jan 6, 2020 at 9:21 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Sat, Jan 4, 2020 at 4:07 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > It is better to merge it with the main patch for > > > "Implement-streaming-mode-in-ReorderBuffer", otherwise, it is a bit > > > difficult to review. > > Actually, we can merge 0008, 0009, 0012, 0018 to the main patch > > (0007). Basically, if we merge all of them then we don't need to deal > > with the conflict. I think Tomas has kept them separate so that we > > can review the solution for the schema sent. And, I kept 0018 as a > > separate patch to avoid conflict and rebasing in 0008, 0009 and 0012. > > In the next patch set, I will merge all of them to 0007. > > > > Okay, I think we can merge those patches. ok > > > > > > > + /* > > > + * We don't expect direct calls to heap_getnext with valid > > > + * CheckXidAlive for regular tables. Track that below. > > > + */ > > > + if (unlikely(TransactionIdIsValid(CheckXidAlive) && > > > + !(IsCatalogRelation(scan->rs_base.rs_rd) || > > > + RelationIsUsedAsCatalogTable(scan->rs_base.rs_rd)))) > > > + elog(ERROR, "improper heap_getnext call"); > > > > > > Earlier, I thought we don't need to check if it is a regular table in > > > this check, but it is required because output plugins can try to do > > > that > > I did not understand that, can you give some example? > > > > I think it can lead to the same problem of concurrent aborts as for > catalog scans. Yeah, got it. > > > > > > > > > > 2. The commit message of this patch refers to Prepared transactions. > > > > > > I think that needs to be changed. > > > > > > > > > > > > 0006-Implement-streaming-mode-in-ReorderBuffer > > > > > > ------------------------------------------------------------------------- > > > > > > Few comments on v4-0018-Review-comment-fix-and-refactoring: > > > 1. > > > + if (streaming) > > > + { > > > + /* > > > + * Set the last last of the stream as the final lsn before calling > > > + * stream stop. > > > + */ > > > + txn->final_lsn = prev_lsn; > > > + rb->stream_stop(rb, txn); > > > + } > > > > > > Shouldn't we try to final_lsn as is done by Vignesh's patch [2]? > > Isn't it the same, there we are doing while serializing and here we > > are doing while streaming? Basically, the last LSN we streamed. Am I > > missing something? > > > > No, I think you are right. > > Few more comments: > -------------------------------- > v4-0007-Implement-streaming-mode-in-ReorderBuffer > 1. > +ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn) > { > .. > + /* > + * TOCHECK: We have to rebuild historic snapshot to be sure it includes all > + * information about > subtransactions, which could arrive after streaming start. > + */ > + if (!txn->is_schema_sent) > + snapshot_now > = ReorderBufferCopySnap(rb, txn->base_snapshot, > + txn, > command_id); > .. > } > > Why are we using base snapshot here instead of the snapshot we saved > the first time streaming has happened? And as mentioned in comments, > won't we need to consider the snapshots for subtransactions that > arrived after the last time we have streamed the changes? > > 2. > + /* remember the command ID and snapshot for the streaming run */ > + txn->command_id = command_id; > + txn- > >snapshot_now = ReorderBufferCopySnap(rb, snapshot_now, > + > txn, command_id); > > I don't see where the txn->snapshot_now is getting freed. The > base_snapshot is freed in ReorderBufferCleanupTXN, but I don't see > this getting freed. Ok, I will check that and fix. > > 3. > +static void > +ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn) > { > .. > + /* > + * If this is a subxact, we need to stream the top-level transaction > + * instead. > + */ > + if (txn->toptxn) > + { > + > ReorderBufferStreamTXN(rb, txn->toptxn); > + return; > + } > > Is it ever possible that we reach here for subtransaction, if not, > then it should be Assert rather than if condition? ReorderBufferCheckMemoryLimit, can call it either for the subtransaction or for the main transaction, depends upon in which ReorderBufferTXN you are adding the current change. I will analyze your other comments and fix them in the next version. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Amit Kapila
Date:
On Mon, Jan 6, 2020 at 3:56 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Mon, Jan 6, 2020 at 2:11 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > 3. > > +static void > > +ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn) > > { > > .. > > + /* > > + * If this is a subxact, we need to stream the top-level transaction > > + * instead. > > + */ > > + if (txn->toptxn) > > + { > > + > > ReorderBufferStreamTXN(rb, txn->toptxn); > > + return; > > + } > > > > Is it ever possible that we reach here for subtransaction, if not, > > then it should be Assert rather than if condition? > > ReorderBufferCheckMemoryLimit, can call it either for the > subtransaction or for the main transaction, depends upon in which > ReorderBufferTXN you are adding the current change. > That function has code like below: ReorderBufferCheckMemoryLimit() { .. if (ReorderBufferCanStream(rb)) { /* * Pick the largest toplevel transaction and evict it from memory by * streaming the already decoded part. */ txn = ReorderBufferLargestTopTXN(rb); /* we know there has to be one, because the size is not zero */ Assert(txn && !txn->toptxn); .. ReorderBufferStreamTXN(rb, txn); .. } How can it ReorderBufferTXN pass for subtransaction? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Dilip Kumar
Date:
On Mon, Jan 6, 2020 at 4:36 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Mon, Jan 6, 2020 at 3:56 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Mon, Jan 6, 2020 at 2:11 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > 3. > > > +static void > > > +ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn) > > > { > > > .. > > > + /* > > > + * If this is a subxact, we need to stream the top-level transaction > > > + * instead. > > > + */ > > > + if (txn->toptxn) > > > + { > > > + > > > ReorderBufferStreamTXN(rb, txn->toptxn); > > > + return; > > > + } > > > > > > Is it ever possible that we reach here for subtransaction, if not, > > > then it should be Assert rather than if condition? > > > > ReorderBufferCheckMemoryLimit, can call it either for the > > subtransaction or for the main transaction, depends upon in which > > ReorderBufferTXN you are adding the current change. > > > > That function has code like below: > > ReorderBufferCheckMemoryLimit() > { > .. > if (ReorderBufferCanStream(rb)) > { > /* > * Pick the largest toplevel transaction and evict it from memory by > * streaming the already decoded part. > */ > txn = ReorderBufferLargestTopTXN(rb); > /* we know there has to be one, because the size is not zero */ > Assert(txn && !txn->toptxn); > .. > ReorderBufferStreamTXN(rb, txn); > .. > } > > How can it ReorderBufferTXN pass for subtransaction? > Hmm, I missed it. You are right, will fix it. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Dilip Kumar
Date:
On Mon, Jan 6, 2020 at 4:44 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Mon, Jan 6, 2020 at 4:36 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Mon, Jan 6, 2020 at 3:56 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > On Mon, Jan 6, 2020 at 2:11 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > 3. > > > > +static void > > > > +ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn) > > > > { > > > > .. > > > > + /* > > > > + * If this is a subxact, we need to stream the top-level transaction > > > > + * instead. > > > > + */ > > > > + if (txn->toptxn) > > > > + { > > > > + > > > > ReorderBufferStreamTXN(rb, txn->toptxn); > > > > + return; > > > > + } > > > > > > > > Is it ever possible that we reach here for subtransaction, if not, > > > > then it should be Assert rather than if condition? > > > > > > ReorderBufferCheckMemoryLimit, can call it either for the > > > subtransaction or for the main transaction, depends upon in which > > > ReorderBufferTXN you are adding the current change. > > > > > > > That function has code like below: > > > > ReorderBufferCheckMemoryLimit() > > { > > .. > > if (ReorderBufferCanStream(rb)) > > { > > /* > > * Pick the largest toplevel transaction and evict it from memory by > > * streaming the already decoded part. > > */ > > txn = ReorderBufferLargestTopTXN(rb); > > /* we know there has to be one, because the size is not zero */ > > Assert(txn && !txn->toptxn); > > .. > > ReorderBufferStreamTXN(rb, txn); > > .. > > } > > > > How can it ReorderBufferTXN pass for subtransaction? > > > Hmm, I missed it. You are right, will fix it. > I have observed one more design issue. The problem is that when we get a toasted chunks we remember the changes in the memory(hash table) but don't stream until we get the actual change on the main table. Now, the problem is that we might get the change of the toasted table and the main table in different streams. So basically, in a stream, if we have only got the toasted tuples then even after ReorderBufferStreamTXN the memory usage will not be reduced. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Amit Kapila
Date:
On Wed, Jan 8, 2020 at 1:12 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > I have observed one more design issue. > Good observation. > The problem is that when we > get a toasted chunks we remember the changes in the memory(hash table) > but don't stream until we get the actual change on the main table. > Now, the problem is that we might get the change of the toasted table > and the main table in different streams. So basically, in a stream, > if we have only got the toasted tuples then even after > ReorderBufferStreamTXN the memory usage will not be reduced. > I think we can't split such changes in a different stream (unless we design an entirely new solution to send partial changes of toast data), so we need to send them together. We can keep a flag like data_complete in ReorderBufferTxn and mark it complete only when we are able to assemble the entire tuple. Now, whenever, we try to stream the changes once we reach the memory threshold, we can check whether the data_complete flag is true, if so, then only send the changes, otherwise, we can pick the next largest transaction. I think we can retry it for few times and if we get the incomplete data for multiple transactions, then we can decide to spill the transaction or maybe we can directly spill the first largest transaction which has incomplete data. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Dilip Kumar
Date:
On Thu, Jan 9, 2020 at 9:35 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Wed, Jan 8, 2020 at 1:12 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > I have observed one more design issue. > > > > Good observation. > > > The problem is that when we > > get a toasted chunks we remember the changes in the memory(hash table) > > but don't stream until we get the actual change on the main table. > > Now, the problem is that we might get the change of the toasted table > > and the main table in different streams. So basically, in a stream, > > if we have only got the toasted tuples then even after > > ReorderBufferStreamTXN the memory usage will not be reduced. > > > > I think we can't split such changes in a different stream (unless we > design an entirely new solution to send partial changes of toast > data), so we need to send them together. We can keep a flag like > data_complete in ReorderBufferTxn and mark it complete only when we > are able to assemble the entire tuple. Now, whenever, we try to > stream the changes once we reach the memory threshold, we can check > whether the data_complete flag is true, if so, then only send the > changes, otherwise, we can pick the next largest transaction. I think > we can retry it for few times and if we get the incomplete data for > multiple transactions, then we can decide to spill the transaction or > maybe we can directly spill the first largest transaction which has > incomplete data. > Yeah, we might do something on this line. Basically, we need to mark the top-transaction as data-incomplete if any of its subtransaction is having data-incomplete (it will always be the latest sub-transaction of the top transaction). Also, for streaming, we are checking the largest top transaction whereas for spilling we just need the larget (sub) transaction. So we also need to decide while picking the largest top transaction for streaming, if we get a few transactions with in-complete data then how we will go for the spill. Do we spill all the sub-transactions under this top transaction or we will again find the larget (sub) transaction for spilling. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Amit Kapila
Date:
On Thu, Jan 9, 2020 at 10:30 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Thu, Jan 9, 2020 at 9:35 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Wed, Jan 8, 2020 at 1:12 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > I have observed one more design issue. > > > > > > > Good observation. > > > > > The problem is that when we > > > get a toasted chunks we remember the changes in the memory(hash table) > > > but don't stream until we get the actual change on the main table. > > > Now, the problem is that we might get the change of the toasted table > > > and the main table in different streams. So basically, in a stream, > > > if we have only got the toasted tuples then even after > > > ReorderBufferStreamTXN the memory usage will not be reduced. > > > > > > > I think we can't split such changes in a different stream (unless we > > design an entirely new solution to send partial changes of toast > > data), so we need to send them together. We can keep a flag like > > data_complete in ReorderBufferTxn and mark it complete only when we > > are able to assemble the entire tuple. Now, whenever, we try to > > stream the changes once we reach the memory threshold, we can check > > whether the data_complete flag is true, if so, then only send the > > changes, otherwise, we can pick the next largest transaction. I think > > we can retry it for few times and if we get the incomplete data for > > multiple transactions, then we can decide to spill the transaction or > > maybe we can directly spill the first largest transaction which has > > incomplete data. > > > Yeah, we might do something on this line. Basically, we need to mark > the top-transaction as data-incomplete if any of its subtransaction is > having data-incomplete (it will always be the latest sub-transaction > of the top transaction). Also, for streaming, we are checking the > largest top transaction whereas for spilling we just need the larget > (sub) transaction. So we also need to decide while picking the > largest top transaction for streaming, if we get a few transactions > with in-complete data then how we will go for the spill. Do we spill > all the sub-transactions under this top transaction or we will again > find the larget (sub) transaction for spilling. > I think it is better to do later as that will lead to the spill of only required (minimum changes to get the memory below threshold) changes. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Dilip Kumar
Date:
On Thu, Jan 9, 2020 at 12:09 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Thu, Jan 9, 2020 at 10:30 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Thu, Jan 9, 2020 at 9:35 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > On Wed, Jan 8, 2020 at 1:12 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > > > I have observed one more design issue. > > > > > > > > > > Good observation. > > > > > > > The problem is that when we > > > > get a toasted chunks we remember the changes in the memory(hash table) > > > > but don't stream until we get the actual change on the main table. > > > > Now, the problem is that we might get the change of the toasted table > > > > and the main table in different streams. So basically, in a stream, > > > > if we have only got the toasted tuples then even after > > > > ReorderBufferStreamTXN the memory usage will not be reduced. > > > > > > > > > > I think we can't split such changes in a different stream (unless we > > > design an entirely new solution to send partial changes of toast > > > data), so we need to send them together. We can keep a flag like > > > data_complete in ReorderBufferTxn and mark it complete only when we > > > are able to assemble the entire tuple. Now, whenever, we try to > > > stream the changes once we reach the memory threshold, we can check > > > whether the data_complete flag is true, if so, then only send the > > > changes, otherwise, we can pick the next largest transaction. I think > > > we can retry it for few times and if we get the incomplete data for > > > multiple transactions, then we can decide to spill the transaction or > > > maybe we can directly spill the first largest transaction which has > > > incomplete data. > > > > > Yeah, we might do something on this line. Basically, we need to mark > > the top-transaction as data-incomplete if any of its subtransaction is > > having data-incomplete (it will always be the latest sub-transaction > > of the top transaction). Also, for streaming, we are checking the > > largest top transaction whereas for spilling we just need the larget > > (sub) transaction. So we also need to decide while picking the > > largest top transaction for streaming, if we get a few transactions > > with in-complete data then how we will go for the spill. Do we spill > > all the sub-transactions under this top transaction or we will again > > find the larget (sub) transaction for spilling. > > > > I think it is better to do later as that will lead to the spill of > only required (minimum changes to get the memory below threshold) > changes. Make sense to me. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Dilip Kumar
Date:
On Mon, Jan 6, 2020 at 2:11 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Mon, Jan 6, 2020 at 9:21 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Sat, Jan 4, 2020 at 4:07 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > It is better to merge it with the main patch for > > > "Implement-streaming-mode-in-ReorderBuffer", otherwise, it is a bit > > > difficult to review. > > Actually, we can merge 0008, 0009, 0012, 0018 to the main patch > > (0007). Basically, if we merge all of them then we don't need to deal > > with the conflict. I think Tomas has kept them separate so that we > > can review the solution for the schema sent. And, I kept 0018 as a > > separate patch to avoid conflict and rebasing in 0008, 0009 and 0012. > > In the next patch set, I will merge all of them to 0007. > > > > Okay, I think we can merge those patches. Done 0008, 0009, 0017, 0018 are merged to 0007, 0012 is merged to 0010 > > Few more comments: > -------------------------------- > v4-0007-Implement-streaming-mode-in-ReorderBuffer > 1. > +ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn) > { > .. > + /* > + * TOCHECK: We have to rebuild historic snapshot to be sure it includes all > + * information about > subtransactions, which could arrive after streaming start. > + */ > + if (!txn->is_schema_sent) > + snapshot_now > = ReorderBufferCopySnap(rb, txn->base_snapshot, > + txn, > command_id); > .. > } > > Why are we using base snapshot here instead of the snapshot we saved > the first time streaming has happened? And as mentioned in comments, > won't we need to consider the snapshots for subtransactions that > arrived after the last time we have streamed the changes? Fixed > > 2. > + /* remember the command ID and snapshot for the streaming run */ > + txn->command_id = command_id; > + txn- > >snapshot_now = ReorderBufferCopySnap(rb, snapshot_now, > + > txn, command_id); > > I don't see where the txn->snapshot_now is getting freed. The > base_snapshot is freed in ReorderBufferCleanupTXN, but I don't see > this getting freed. I have freed this In ReorderBufferCleanupTXN > > 3. > +static void > +ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn) > { > .. > + /* > + * If this is a subxact, we need to stream the top-level transaction > + * instead. > + */ > + if (txn->toptxn) > + { > + > ReorderBufferStreamTXN(rb, txn->toptxn); > + return; > + } > > Is it ever possible that we reach here for subtransaction, if not, > then it should be Assert rather than if condition? Fixed > > 4. In ReorderBufferStreamTXN(), don't we need to set some of the txn > fields like origin_id, origin_lsn as we do in ReorderBufferCommit() > especially to cover the case when it gets called due to memory > overflow (aka via ReorderBufferCheckMemoryLimit). We get origin_lsn during commit time so I am not sure how can we do that. I have also noticed that currently, we are not using origin_lsn on the subscriber side. I think need more investigation that if we want this then do we need to log it early. > > v4-0017-Extend-handling-of-concurrent-aborts-for-streamin > 1. > @@ -3712,7 +3727,22 @@ ReorderBufferStreamTXN(ReorderBuffer *rb, > ReorderBufferTXN *txn) > if (using_subtxn) > > RollbackAndReleaseCurrentSubTransaction(); > > - PG_RE_THROW(); > + /* re-throw only if it's not an abort */ > + if (errdata- > >sqlerrcode != ERRCODE_TRANSACTION_ROLLBACK) > + { > + MemoryContextSwitchTo(ecxt); > + PG_RE_THROW(); > + > } > + else > + { > + /* remember the command ID and snapshot for the streaming run */ > + txn- > >command_id = command_id; > + txn->snapshot_now = ReorderBufferCopySnap(rb, snapshot_now, > + > txn, command_id); > + rb->stream_stop(rb, txn); > + > + > FlushErrorState(); > + } > > Can you update comments either in the above code block or some other > place to explain what is the concurrent abort problem and how we dealt > with it? Also, please explain how the above error handling is > sufficient to address all the various scenarios (sub-transaction got > aborted when we have already sent some changes, or when we have not > sent any changes yet). Done > > v4-0006-Gracefully-handle-concurrent-aborts-of-uncommitte > 1. > + /* > + * If CheckXidAlive is valid, then we check if it aborted. If it did, we > + * error out > + */ > + if (TransactionIdIsValid(CheckXidAlive) && > + !TransactionIdIsInProgress(CheckXidAlive) && > + !TransactionIdDidCommit(CheckXidAlive)) > + ereport(ERROR, > + (errcode(ERRCODE_TRANSACTION_ROLLBACK), > + errmsg("transaction aborted during system catalog scan"))); > > Why here we can't use TransactionIdDidAbort? If we can't use it, then > can you add comments stating the reason of the same. Done > > 2. > /* > + * An xid value pointing to a possibly ongoing or a prepared transaction. > + * Currently used in logical decoding. It's possible that such transactions > + * can get aborted while the decoding is ongoing. > + */ > +TransactionId CheckXidAlive = InvalidTransactionId; > > In comments, there is a mention of a prepared transaction. Do we > allow prepared transactions to be decoded as part of this patch? Fixed > > 3. > + /* > + * If CheckXidAlive is valid, then we check if it aborted. If it did, we > + * error out > + */ > + if (TransactionIdIsValid > (CheckXidAlive) && > + !TransactionIdIsInProgress(CheckXidAlive) && > + !TransactionIdDidCommit(CheckXidAlive)) > > This comment just says what code below is doing, can you explain the > rationale behind this check. It would be better if it is clear by > reading comments, why we are doing this check after fetching the > tuple. I think this can refer to the comment I suggested to add for > changes in patch > v4-0017-Extend-handling-of-concurrent-aborts-for-streamin. Done -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
- v5-0001-Immediately-WAL-log-assignments.patch
- v5-0004-Extend-the-output-plugin-API-with-stream-methods.patch
- v5-0002-Issue-individual-invalidations-with-wal_level-log.patch
- v5-0005-Cleaning-up-of-flags-in-ReorderBufferTXN-structur.patch
- v5-0003-fixup-is_schema_sent-set-too-early.patch
- v5-0006-Gracefully-handle-concurrent-aborts-of-uncommitte.patch
- v5-0007-Implement-streaming-mode-in-ReorderBuffer.patch
- v5-0009-Support-logical_decoding_work_mem-set-from-create.patch
- v5-0008-Fix-speculative-insert-bug.patch
- v5-0010-Add-support-for-streaming-to-built-in-replication.patch
- v5-0011-Track-statistics-for-streaming.patch
- v5-0013-BUGFIX-set-final_lsn-for-subxacts-before-cleanup.patch
- v5-0014-Add-TAP-test-for-streaming-vs.-DDL.patch
- v5-0012-Enable-streaming-for-all-subscription-TAP-tests.patch
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Dilip Kumar
Date:
On Sat, Jan 4, 2020 at 4:07 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Mon, Dec 30, 2019 at 3:11 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Thu, Dec 12, 2019 at 9:44 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > 0002-Issue-individual-invalidations-with-wal_level-log > > > > ---------------------------------------------------------------------------- > > > > 1. > > > > xact_desc_invalidations(StringInfo buf, > > > > { > > > > .. > > > > + else if (msg->id == SHAREDINVALSNAPSHOT_ID) > > > > + appendStringInfo(buf, " snapshot %u", msg->sn.relId); > > > > > > > > You have removed logging for the above cache but forgot to remove its > > > > reference from one of the places. Also, I think you need to add a > > > > comment somewhere in inval.c to say why you are writing for WAL for > > > > some types of invalidations and not for others? > > Done > > > > I don't see any new comments as asked by me. Done I think we should also > consider WAL logging at each command end instead of doing piecemeal as > discussed in another email [1], which will have lesser code changes > and maybe better in performance. You might want to evaluate the > performance of both approaches. Still pending, will work on this. > > > > > 0005-Gracefully-handle-concurrent-aborts-of-uncommitte > > > > ---------------------------------------------------------------------------------- > > > > 1. > > > > @@ -1877,6 +1877,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid, > > > > PG_CATCH(); > > > > { > > > > /* TODO: Encapsulate cleanup > > > > from the PG_TRY and PG_CATCH blocks */ > > > > + > > > > if (iterstate) > > > > ReorderBufferIterTXNFinish(rb, iterstate); > > > > > > > > Spurious line change. > > > > > > Done > > + /* > + * We don't expect direct calls to heap_getnext with valid > + * CheckXidAlive for regular tables. Track that below. > + */ > + if (unlikely(TransactionIdIsValid(CheckXidAlive) && > + !(IsCatalogRelation(scan->rs_base.rs_rd) || > + RelationIsUsedAsCatalogTable(scan->rs_base.rs_rd)))) > + elog(ERROR, "improper heap_getnext call"); > > Earlier, I thought we don't need to check if it is a regular table in > this check, but it is required because output plugins can try to do > that and if they do so during decoding (with historic snapshots), the > same should be not allowed. > > How about changing the error message to "unexpected heap_getnext call > during logical decoding" or something like that? Done > > > > > 2. The commit message of this patch refers to Prepared transactions. > > > > I think that needs to be changed. > > > > > > > > 0006-Implement-streaming-mode-in-ReorderBuffer > > > > ------------------------------------------------------------------------- > > Few comments on v4-0018-Review-comment-fix-and-refactoring: > 1. > + if (streaming) > + { > + /* > + * Set the last last of the stream as the final lsn before calling > + * stream stop. > + */ > + txn->final_lsn = prev_lsn; > + rb->stream_stop(rb, txn); > + } > > Shouldn't we try to final_lsn as is done by Vignesh's patch [2]? Already agreed upon current implementation > > 2. > + if (streaming) > + { > + /* > + * Set the CheckXidAlive to the current (sub)xid for which this > + * change belongs to so that we can detect the abort while we are > + * decoding. > + */ > + CheckXidAlive = change->txn->xid; > + > + /* Increment the stream count. */ > + streamed++; > + } > > Is the variable 'streamed' used anywhere? Removed > > 3. > + /* > + * Destroy the (relfilenode, ctid) hashtable, so that we don't leak > + * any memory. We could also keep the hash table and update it with > + * new ctid values, but this seems simpler and good enough for now. > + */ > + ReorderBufferDestroyTupleCidHash(rb, txn); > > Won't this be required only when we are streaming changes? Fixed > > As per my understanding apart from the above comments, the known > pending work for this patchset is as follows: > a. The two open items agreed to you in the email [3]. > b. Complete the handling of schema_sent as discussed above [4]. > c. Few comments by Vignesh and the response on the same by me [5][6]. > d. WAL overhead and performance testing for additional WAL logging by > this patchset. > e. Some way to see the tuple for streamed transactions by decoding API > as speculated by you [7]. > > Have I missed anything? I have worked upon most of these items, I will reply to them separately. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Dilip Kumar
Date:
On Mon, Dec 30, 2019 at 3:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Sun, Dec 29, 2019 at 1:34 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > I have observed some more issues > > > > 1. Currently, In ReorderBufferCommit, it is always expected that > > whenever we get REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM, we must > > have already got REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT and in > > SPEC_CONFIRM we send the tuple we got in SPECT_INSERT. But, now those > > two messages can be in different streams. So we need to find a way to > > handle this. Maybe once we get SPEC_INSERT then we can remember the > > tuple and then if we get the SPECT_CONFIRM in the next stream we can > > send that tuple? > > > > Your suggestion makes sense to me. So, we can try it. I have implemented this and attached it as a separate patch. In my latest patch set[1] > > > 2. During commit time in DecodeCommit we check whether we need to skip > > the changes of the transaction or not by calling > > SnapBuildXactNeedsSkip but since now we support streaming so it's > > possible that before we decode the commit WAL, we might have already > > sent the changes to the output plugin even though we could have > > skipped those changes. So my question is instead of checking at the > > commit time can't we check before adding to ReorderBuffer itself > > > > I think if we can do that then the same will be true for current code > irrespective of this patch. I think it is possible that we can't take > that decision while decoding because we haven't assembled a consistent > snapshot yet. I think we might be able to do that while we try to > stream the changes. I think we need to take care of all the > conditions during streaming (when the logical_decoding_workmem limit > is reached) as we do in DecodeCommit. This needs a bit more study. I have analyzed this further and I think we can not decide all the conditions even while streaming. Because IMHO once we get the SNAPBUILD_FULL_SNAPSHOT we can add the changes to the reorder buffer so that if we get the commit of the transaction after we reach to the SNAPBUILD_CONSISTENT. However, if we get the commit before we reach to SNAPBUILD_CONSISTENT then we need to ignore this transaction. Now, even if we have SNAPBUILD_FULL_SNAPSHOT we can stream the changes which might get dropped later but that we can not decide while streaming. [1] https://www.postgresql.org/message-id/CAFiTN-snMb%3D53oqkM8av8Lqfxojjm4OBwCNxmFssgLCceY_zgg%40mail.gmail.com -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Alvaro Herrera
Date:
I pushed 0005 (the rbtxn flags thing) after some light editing. It's been around for long enough ... -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Alvaro Herrera
Date:
Here's a rebase of this patch series. I didn't change anything except 1. disregard what was 0005, since I already pushed it. 2. roll 0003 into 0002. 3. rebase 0007 (now 0005) to account for the reorderbuffer changes. (I did notice that 0005 adds a new boolean any_data_sent, which is silly -- it should be another txn_flags bit.) However, tests don't pass for me; notably, test_decoding crashes. OTOH I noticed that the streamed transaction support in test_decoding writes the XID to the output, which is going to make it useless for regression testing. It probably should not emit the numerical values. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Alvaro Herrera
Date:
On 2020-Jan-10, Alvaro Herrera wrote: > Here's a rebase of this patch series. I didn't change anything except ... this time with attachments ... -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
- v6-0001-Immediately-WAL-log-assignments.patch
- v6-0002-Issue-individual-invalidations-with-wal_level-log.patch
- v6-0003-Extend-the-output-plugin-API-with-stream-methods.patch
- v6-0004-Gracefully-handle-concurrent-aborts-of-uncommitte.patch
- v6-0005-Implement-streaming-mode-in-ReorderBuffer.patch
- v6-0006-Fix-speculative-insert-bug.patch
- v6-0007-Support-logical_decoding_work_mem-set-from-create.patch
- v6-0008-Add-support-for-streaming-to-built-in-replication.patch
- v6-0009-Track-statistics-for-streaming.patch
- v6-0010-Enable-streaming-for-all-subscription-TAP-tests.patch
- v6-0011-BUGFIX-set-final_lsn-for-subxacts-before-cleanup.patch
- v6-0012-Add-TAP-test-for-streaming-vs.-DDL.patch
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Alvaro Herrera
Date:
On 2020-Jan-10, Alvaro Herrera wrote: > From 7d671806584fff71067c8bde38b2f642ba1331a9 Mon Sep 17 00:00:00 2001 > From: Dilip Kumar <dilip.kumar@enterprisedb.com> > Date: Wed, 20 Nov 2019 16:41:13 +0530 > Subject: [PATCH v6 10/12] Enable streaming for all subscription TAP tests This patch turns a lot of test into the streamed mode. While it's great that streaming mode is tested, we should add new tests for it rather than failing to keep tests for the non-streamed mode. I suggest that we add two versions of each test, one for each mode. Maybe the way to do that is to create some subroutine that can be called twice. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Dilip Kumar
Date:
On Thu, Jan 9, 2020 at 12:09 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Thu, Jan 9, 2020 at 10:30 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Thu, Jan 9, 2020 at 9:35 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > On Wed, Jan 8, 2020 at 1:12 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > > > I have observed one more design issue. > > > > > > > > > > Good observation. > > > > > > > The problem is that when we > > > > get a toasted chunks we remember the changes in the memory(hash table) > > > > but don't stream until we get the actual change on the main table. > > > > Now, the problem is that we might get the change of the toasted table > > > > and the main table in different streams. So basically, in a stream, > > > > if we have only got the toasted tuples then even after > > > > ReorderBufferStreamTXN the memory usage will not be reduced. > > > > > > > > > > I think we can't split such changes in a different stream (unless we > > > design an entirely new solution to send partial changes of toast > > > data), so we need to send them together. We can keep a flag like > > > data_complete in ReorderBufferTxn and mark it complete only when we > > > are able to assemble the entire tuple. Now, whenever, we try to > > > stream the changes once we reach the memory threshold, we can check > > > whether the data_complete flag is true, if so, then only send the > > > changes, otherwise, we can pick the next largest transaction. I think > > > we can retry it for few times and if we get the incomplete data for > > > multiple transactions, then we can decide to spill the transaction or > > > maybe we can directly spill the first largest transaction which has > > > incomplete data. > > > > > Yeah, we might do something on this line. Basically, we need to mark > > the top-transaction as data-incomplete if any of its subtransaction is > > having data-incomplete (it will always be the latest sub-transaction > > of the top transaction). Also, for streaming, we are checking the > > largest top transaction whereas for spilling we just need the larget > > (sub) transaction. So we also need to decide while picking the > > largest top transaction for streaming, if we get a few transactions > > with in-complete data then how we will go for the spill. Do we spill > > all the sub-transactions under this top transaction or we will again > > find the larget (sub) transaction for spilling. > > > > I think it is better to do later as that will lead to the spill of > only required (minimum changes to get the memory below threshold) > changes. I think instead of doing this can't we just spill the changes which are in toast_hash. Basically, at the end of the stream, we have some toast tuple which we could not stream because we did not have the insert for the main table then we can spill only those changes which are in tuple hash. And, in the subsequent stream whenever we get the insert for the main table at that time we can restore those changes and stream. We can also maintain some flag saying data is not complete (with some change LSN number) and after that LSN we can spill any toast change to disk until we get the change for the main table, by this way we can avoid building tuple hash until we get the change for the main table. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Amit Kapila
Date:
On Mon, Jan 13, 2020 at 3:18 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Thu, Jan 9, 2020 at 12:09 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Thu, Jan 9, 2020 at 10:30 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > On Thu, Jan 9, 2020 at 9:35 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > The problem is that when we > > > > > get a toasted chunks we remember the changes in the memory(hash table) > > > > > but don't stream until we get the actual change on the main table. > > > > > Now, the problem is that we might get the change of the toasted table > > > > > and the main table in different streams. So basically, in a stream, > > > > > if we have only got the toasted tuples then even after > > > > > ReorderBufferStreamTXN the memory usage will not be reduced. > > > > > > > > > > > > > I think we can't split such changes in a different stream (unless we > > > > design an entirely new solution to send partial changes of toast > > > > data), so we need to send them together. We can keep a flag like > > > > data_complete in ReorderBufferTxn and mark it complete only when we > > > > are able to assemble the entire tuple. Now, whenever, we try to > > > > stream the changes once we reach the memory threshold, we can check > > > > whether the data_complete flag is true Here, we can also consider streaming the changes when data_complete is false, but some additional changes have been added to the same txn as the new changes might complete the tuple. > > > > , if so, then only send the > > > > changes, otherwise, we can pick the next largest transaction. I think > > > > we can retry it for few times and if we get the incomplete data for > > > > multiple transactions, then we can decide to spill the transaction or > > > > maybe we can directly spill the first largest transaction which has > > > > incomplete data. > > > > > > > Yeah, we might do something on this line. Basically, we need to mark > > > the top-transaction as data-incomplete if any of its subtransaction is > > > having data-incomplete (it will always be the latest sub-transaction > > > of the top transaction). Also, for streaming, we are checking the > > > largest top transaction whereas for spilling we just need the larget > > > (sub) transaction. So we also need to decide while picking the > > > largest top transaction for streaming, if we get a few transactions > > > with in-complete data then how we will go for the spill. Do we spill > > > all the sub-transactions under this top transaction or we will again > > > find the larget (sub) transaction for spilling. > > > > > > > I think it is better to do later as that will lead to the spill of > > only required (minimum changes to get the memory below threshold) > > changes. > I think instead of doing this can't we just spill the changes which > are in toast_hash. Basically, at the end of the stream, we have some > toast tuple which we could not stream because we did not have the > insert for the main table then we can spill only those changes which > are in tuple hash. > Hmm, I think this can turn out to be inefficient because we can easily end up spilling the data even when we don't need to so. Consider cases, where part of the streamed changes are for toast, and remaining are the changes which we would have streamed and hence can be removed. In such cases, we could have easily consumed remaining changes for toast without spilling. Also, I am not sure if spilling changes from the hash table is a good idea as they are no more in the same order as they were in ReorderBuffer which means the order in which we serialize the changes normally would change and that might have some impact, so we would need some more study if we want to pursue this idea. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Dilip Kumar
Date:
On Sat, Jan 11, 2020 at 3:07 AM Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > > On 2020-Jan-10, Alvaro Herrera wrote: > > > Here's a rebase of this patch series. I didn't change anything except > > ... this time with attachments ... The patch set fails to apply on the head so rebased. (Rebased on commit cebf9d6e6ee13cbf9f1a91ec633cf96780ffc985) -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
- v7-0002-Issue-individual-invalidations-with-wal_level-log.patch
- v7-0001-Immediately-WAL-log-assignments.patch
- v7-0005-Implement-streaming-mode-in-ReorderBuffer.patch
- v7-0004-Gracefully-handle-concurrent-aborts-of-uncommitte.patch
- v7-0003-Extend-the-output-plugin-API-with-stream-methods.patch
- v7-0006-Fix-speculative-insert-bug.patch
- v7-0008-Add-support-for-streaming-to-built-in-replication.patch
- v7-0009-Track-statistics-for-streaming.patch
- v7-0007-Support-logical_decoding_work_mem-set-from-create.patch
- v7-0010-Enable-streaming-for-all-subscription-TAP-tests.patch
- v7-0011-BUGFIX-set-final_lsn-for-subxacts-before-cleanup.patch
- v7-0012-Add-TAP-test-for-streaming-vs.-DDL.patch
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Tomas Vondra
Date:
On Tue, Jan 14, 2020 at 10:56:37AM +0530, Dilip Kumar wrote: >On Sat, Jan 11, 2020 at 3:07 AM Alvaro Herrera <alvherre@2ndquadrant.com> wrote: >> >> On 2020-Jan-10, Alvaro Herrera wrote: >> >> > Here's a rebase of this patch series. I didn't change anything except >> >> ... this time with attachments ... >The patch set fails to apply on the head so rebased. (Rebased on >commit cebf9d6e6ee13cbf9f1a91ec633cf96780ffc985) > I've noticed the patch was in WoA state since 2019/12/01, but there's been quite a lot of traffic on this thread and a bunch of new patch versions. So I've switched it to "needs review" - if that's not the right status, let me know. Also, the patch was moved forward mostly by Amit and Dilip, so I've added them as authors in the CF app (well, what matters is the commit message, of course, but let's keep this up to date too). regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Dilip Kumar
Date:
On Tue, Jan 14, 2020 at 10:44 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Mon, Jan 13, 2020 at 3:18 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Thu, Jan 9, 2020 at 12:09 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > On Thu, Jan 9, 2020 at 10:30 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > > > On Thu, Jan 9, 2020 at 9:35 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > > > The problem is that when we > > > > > > get a toasted chunks we remember the changes in the memory(hash table) > > > > > > but don't stream until we get the actual change on the main table. > > > > > > Now, the problem is that we might get the change of the toasted table > > > > > > and the main table in different streams. So basically, in a stream, > > > > > > if we have only got the toasted tuples then even after > > > > > > ReorderBufferStreamTXN the memory usage will not be reduced. > > > > > > > > > > > > > > > > I think we can't split such changes in a different stream (unless we > > > > > design an entirely new solution to send partial changes of toast > > > > > data), so we need to send them together. We can keep a flag like > > > > > data_complete in ReorderBufferTxn and mark it complete only when we > > > > > are able to assemble the entire tuple. Now, whenever, we try to > > > > > stream the changes once we reach the memory threshold, we can check > > > > > whether the data_complete flag is true > > Here, we can also consider streaming the changes when data_complete is > false, but some additional changes have been added to the same txn as > the new changes might complete the tuple. > > > > > > , if so, then only send the > > > > > changes, otherwise, we can pick the next largest transaction. I think > > > > > we can retry it for few times and if we get the incomplete data for > > > > > multiple transactions, then we can decide to spill the transaction or > > > > > maybe we can directly spill the first largest transaction which has > > > > > incomplete data. > > > > > > > > > Yeah, we might do something on this line. Basically, we need to mark > > > > the top-transaction as data-incomplete if any of its subtransaction is > > > > having data-incomplete (it will always be the latest sub-transaction > > > > of the top transaction). Also, for streaming, we are checking the > > > > largest top transaction whereas for spilling we just need the larget > > > > (sub) transaction. So we also need to decide while picking the > > > > largest top transaction for streaming, if we get a few transactions > > > > with in-complete data then how we will go for the spill. Do we spill > > > > all the sub-transactions under this top transaction or we will again > > > > find the larget (sub) transaction for spilling. > > > > > > > > > > I think it is better to do later as that will lead to the spill of > > > only required (minimum changes to get the memory below threshold) > > > changes. > > I think instead of doing this can't we just spill the changes which > > are in toast_hash. Basically, at the end of the stream, we have some > > toast tuple which we could not stream because we did not have the > > insert for the main table then we can spill only those changes which > > are in tuple hash. > > > > Hmm, I think this can turn out to be inefficient because we can easily > end up spilling the data even when we don't need to so. Consider > cases, where part of the streamed changes are for toast, and remaining > are the changes which we would have streamed and hence can be removed. > In such cases, we could have easily consumed remaining changes for > toast without spilling. Also, I am not sure if spilling changes from > the hash table is a good idea as they are no more in the same order as > they were in ReorderBuffer which means the order in which we serialize > the changes normally would change and that might have some impact, so > we would need some more study if we want to pursue this idea. I have fixed this bug and attached it as a separate patch. I will merge it to the main patch after we agree with the idea and after some more testing. The idea is that whenever we get the toasted chunk instead of directly inserting it into the toast hash I am inserting it into some local list so that if we don't get the change for the main table then we can insert these changes back to the txn->changes. So once we get the change for the main table at that time I am preparing the hash table to merge the chunks. If the stream is over and we haven't got the changes for the main table, that time we will mark the txn that it has some pending toast changes so that next time we will not pick the same transaction for the streaming. This flag will be cleaned whenever we get any changes for the txn (insert or /update). There is also a possibility that even after we stream the changes the rb->size is not below logical_decoding_work_mem because we could not stream the changes so for handling this after streaming we recheck the size and if it is still not under control then we pick another transaction. In some cases, we might not get any transaction to stream because the transaction has the pending toast change flag set, In this case, we will go for the spill. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
- v8-0001-Immediately-WAL-log-assignments.patch
- v8-0003-Extend-the-output-plugin-API-with-stream-methods.patch
- v8-0002-Issue-individual-invalidations-with-wal_level-log.patch
- v8-0004-Gracefully-handle-concurrent-aborts-of-uncommitte.patch
- v8-0005-Implement-streaming-mode-in-ReorderBuffer.patch
- v8-0006-Fix-speculative-insert-bug.patch
- v8-0007-Support-logical_decoding_work_mem-set-from-create.patch
- v8-0009-Track-statistics-for-streaming.patch
- v8-0010-Enable-streaming-for-all-subscription-TAP-tests.patch
- v8-0008-Add-support-for-streaming-to-built-in-replication.patch
- v8-0011-BUGFIX-set-final_lsn-for-subxacts-before-cleanup.patch
- v8-0012-Add-TAP-test-for-streaming-vs.-DDL.patch
- v8-0013-Bugfix-handling-of-incomplete-toast-tuple.patch
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Dilip Kumar
Date:
On Sat, Jan 4, 2020 at 4:07 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > Update on the open items > As per my understanding apart from the above comments, the known > pending work for this patchset is as follows: > a. The two open items agreed to you in the email [3]. -> The first part is done and the second part is an improvement,not a bugfix. I will try to work on this part in the next patch set. > b. Complete the handling of schema_sent as discussed above [4]. -> Done > c. Few comments by Vignesh and the response on the same by me [5][6]. -> Done > d. WAL overhead and performance testing for additional WAL logging by > this patchset. -> Pending > e. Some way to see the tuple for streamed transactions by decoding API > as speculated by you [7]. ->Pending f. Bug in the toast table handling -> Submitted as a separate POC patch, which can be merged to the main after review and more testing. > [3] - https://www.postgresql.org/message-id/CAFiTN-uT5YZE0egGhKdTteTjcGrPi8hb%3DFMPpr9_hEB7hozQ-Q%40mail.gmail.com > [4] - https://www.postgresql.org/message-id/CAA4eK1KjD9x0mS4JxzCbu3gu-r6K7XJRV%2BZcGb3BH6U3x2uxew%40mail.gmail.com > [5] - https://www.postgresql.org/message-id/CALDaNm0DNUojjt7CV-fa59_kFbQQ3rcMBtauvo44ttea7r9KaA%40mail.gmail.com > [6] - https://www.postgresql.org/message-id/CAA4eK1%2BZvupW00c--dqEg8f3dHZDOGmA9xOQLyQHjRSoDi6AkQ%40mail.gmail.com > [7] - https://www.postgresql.org/message-id/CAFiTN-t8PmKA1X4jEqKmkvs0ggWpy0APWpPuaJwpx2YpfAf97w%40mail.gmail.com -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Alvaro Herrera
Date:
I looked at this patchset and it seemed natural to apply 0008 next (adding work_mem to subscriptions). Attached is Dilip's latest version, plus my review changes. This will break the patch tester's logic; sorry about that. What part of this change is what sets the process's logical_decoding_work_mem to the given value? I was unable to figure that out. Is it missing or am I just stupid? Changes: * the patch adds logical_decoding_work_mem SGML, but that has already been applied (cec2edfa7859); remove dupe. * parse_subscription_options() comment says that it will raise an error if a caller does not pass the pointer for an option but option list specifies that option. It does not really implement that behavior (an existing problem): instead, if the pointer is not passed, the option is ignored. Moreover, this new patch continued to fail to handle things as the comment says. I decided to implement the documented behavior instead; it's now inconsistent with how the other options are implemented. I think we should fix the other options to behave as the comment says, because it's a more convenient API; if we instead opted to update the code comment to match the code, each caller would have to be checked to verify that the correct options are passed, which is pointless and error prone. * the parse_subscription_options API is a mess. I reordered the arguments a little bit; also change the argument layout in callers so that each caller is grouped more sensibly. Also added comments to simplify reading the argument lists. I think this could be fixed by using an ad-hoc struct to pass in and out. Didn't get around to doing that, seems an unrelated potential improvement. * trying to do own range checking in pgoutput and subscriptioncmds.c seems pointless and likely to get out of sync with guc.c. Simpler is to call set_config_option() to verify that the argument is in range. (Note a further problem in the patch series: the range check in subscriptioncmds.c is only added in patch 0009). * parsing integers using scanint8() seemed weird (error messages there do not correspond to what we want). After a couple of false starts, I decided to rely on guc.c's set_config_option() followed by parse_int(). That also has the benefit that you can give it units. * psql \dRs+ should display the work_mem; patch failed to do that. Added. Unit display is done by pg_size_pretty(), which might be different from what guc.c does, but I think it works OK. It's the first place where we use pg_size_pretty to show a memory limit, however. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Amit Kapila
Date:
On Wed, Jan 22, 2020 at 10:07 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > > I looked at this patchset and it seemed natural to apply 0008 next > (adding work_mem to subscriptions). > I am not so sure whether we need this patch as the exact scenario where it can help is not very clear to me and neither did anyone explained. I have raised this concern earlier as well [1]. The point is that 'logical_decoding_work_mem' applies to the entire ReorderBuffer in the publisher's side and how will a parameter from a particular subscription help in that? [1] - https://www.postgresql.org/message-id/CAA4eK1J%2B3kab6RSZrgj0YiQV1r%2BH3FWVaNjKhWvpEe5-bpZiBw%40mail.gmail.com -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Amit Kapila
Date:
On Wed, Jan 22, 2020 at 10:30 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Tue, Jan 14, 2020 at 10:44 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > Hmm, I think this can turn out to be inefficient because we can easily > > end up spilling the data even when we don't need to so. Consider > > cases, where part of the streamed changes are for toast, and remaining > > are the changes which we would have streamed and hence can be removed. > > In such cases, we could have easily consumed remaining changes for > > toast without spilling. Also, I am not sure if spilling changes from > > the hash table is a good idea as they are no more in the same order as > > they were in ReorderBuffer which means the order in which we serialize > > the changes normally would change and that might have some impact, so > > we would need some more study if we want to pursue this idea. > I have fixed this bug and attached it as a separate patch. I will > merge it to the main patch after we agree with the idea and after some > more testing. > > The idea is that whenever we get the toasted chunk instead of directly > inserting it into the toast hash I am inserting it into some local > list so that if we don't get the change for the main table then we can > insert these changes back to the txn->changes. So once we get the > change for the main table at that time I am preparing the hash table > to merge the chunks. > I think this idea will work but appears to be quite costly because (a) you might need to serialize/deserialize the changes multiple times and might attempt streaming multiple times even though you can't do (b) you need to remove/add the same set of changes from the main list multiple times. It seems to me that we need to add all of this new handling because while taking the decision whether to stream or not we don't know whether the txn has changes that can't be streamed. One idea to make it work is that we identify it while decoding the WAL. I think we need to set a bit in the insert/delete WAL record to identify if the tuple belongs to a toast relation. This won't add any additional overhead in WAL and reduce a lot of complexity in the logical decoding and also decoding will be efficient. If this is feasible, then we can do the same for speculative insertions. In patch v8-0013-Bugfix-handling-of-incomplete-toast-tuple, why is below change required? --- a/contrib/test_decoding/logical.conf +++ b/contrib/test_decoding/logical.conf @@ -1,3 +1,4 @@ wal_level = logical max_replication_slots = 4 logical_decoding_work_mem = 64kB +logging_collector=on -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Dilip Kumar
Date:
On Tue, Jan 28, 2020 at 11:28 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Wed, Jan 22, 2020 at 10:30 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Tue, Jan 14, 2020 at 10:44 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > Hmm, I think this can turn out to be inefficient because we can easily > > > end up spilling the data even when we don't need to so. Consider > > > cases, where part of the streamed changes are for toast, and remaining > > > are the changes which we would have streamed and hence can be removed. > > > In such cases, we could have easily consumed remaining changes for > > > toast without spilling. Also, I am not sure if spilling changes from > > > the hash table is a good idea as they are no more in the same order as > > > they were in ReorderBuffer which means the order in which we serialize > > > the changes normally would change and that might have some impact, so > > > we would need some more study if we want to pursue this idea. > > I have fixed this bug and attached it as a separate patch. I will > > merge it to the main patch after we agree with the idea and after some > > more testing. > > > > The idea is that whenever we get the toasted chunk instead of directly > > inserting it into the toast hash I am inserting it into some local > > list so that if we don't get the change for the main table then we can > > insert these changes back to the txn->changes. So once we get the > > change for the main table at that time I am preparing the hash table > > to merge the chunks. > > > > > I think this idea will work but appears to be quite costly because (a) > you might need to serialize/deserialize the changes multiple times and > might attempt streaming multiple times even though you can't do (b) > you need to remove/add the same set of changes from the main list > multiple times. I agree with this. > > It seems to me that we need to add all of this new handling because > while taking the decision whether to stream or not we don't know > whether the txn has changes that can't be streamed. One idea to make > it work is that we identify it while decoding the WAL. I think we > need to set a bit in the insert/delete WAL record to identify if the > tuple belongs to a toast relation. This won't add any additional > overhead in WAL and reduce a lot of complexity in the logical decoding > and also decoding will be efficient. If this is feasible, then we can > do the same for speculative insertions. The Idea looks good to me. I will work on this. > > In patch v8-0013-Bugfix-handling-of-incomplete-toast-tuple, why is > below change required? > > --- a/contrib/test_decoding/logical.conf > +++ b/contrib/test_decoding/logical.conf > @@ -1,3 +1,4 @@ > wal_level = logical > max_replication_slots = 4 > logical_decoding_work_mem = 64kB > +logging_collector=on Sorry, these are some local changes which got included in the patch. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Amit Kapila
Date:
On Tue, Jan 28, 2020 at 11:34 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Tue, Jan 28, 2020 at 11:28 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Wed, Jan 22, 2020 at 10:30 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > On Tue, Jan 14, 2020 at 10:44 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > > > > Hmm, I think this can turn out to be inefficient because we can easily > > > > end up spilling the data even when we don't need to so. Consider > > > > cases, where part of the streamed changes are for toast, and remaining > > > > are the changes which we would have streamed and hence can be removed. > > > > In such cases, we could have easily consumed remaining changes for > > > > toast without spilling. Also, I am not sure if spilling changes from > > > > the hash table is a good idea as they are no more in the same order as > > > > they were in ReorderBuffer which means the order in which we serialize > > > > the changes normally would change and that might have some impact, so > > > > we would need some more study if we want to pursue this idea. > > > I have fixed this bug and attached it as a separate patch. I will > > > merge it to the main patch after we agree with the idea and after some > > > more testing. > > > > > > The idea is that whenever we get the toasted chunk instead of directly > > > inserting it into the toast hash I am inserting it into some local > > > list so that if we don't get the change for the main table then we can > > > insert these changes back to the txn->changes. So once we get the > > > change for the main table at that time I am preparing the hash table > > > to merge the chunks. > > > > > > > > > I think this idea will work but appears to be quite costly because (a) > > you might need to serialize/deserialize the changes multiple times and > > might attempt streaming multiple times even though you can't do (b) > > you need to remove/add the same set of changes from the main list > > multiple times. > I agree with this. > > > > It seems to me that we need to add all of this new handling because > > while taking the decision whether to stream or not we don't know > > whether the txn has changes that can't be streamed. One idea to make > > it work is that we identify it while decoding the WAL. I think we > > need to set a bit in the insert/delete WAL record to identify if the > > tuple belongs to a toast relation. This won't add any additional > > overhead in WAL and reduce a lot of complexity in the logical decoding > > and also decoding will be efficient. If this is feasible, then we can > > do the same for speculative insertions. > The Idea looks good to me. I will work on this. > One more thing we can do is to identify whether the tuple belongs to toast relation while decoding it. However, I think to do that we need to have access to relcache at that time and that might add some overhead as we need to do that for each tuple. Can we investigate what it will take to do that and if it is better than setting a bit during WAL logging. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Dilip Kumar
Date:
On Tue, Jan 28, 2020 at 11:43 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Tue, Jan 28, 2020 at 11:34 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Tue, Jan 28, 2020 at 11:28 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > On Wed, Jan 22, 2020 at 10:30 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > > > On Tue, Jan 14, 2020 at 10:44 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > > > > > > > Hmm, I think this can turn out to be inefficient because we can easily > > > > > end up spilling the data even when we don't need to so. Consider > > > > > cases, where part of the streamed changes are for toast, and remaining > > > > > are the changes which we would have streamed and hence can be removed. > > > > > In such cases, we could have easily consumed remaining changes for > > > > > toast without spilling. Also, I am not sure if spilling changes from > > > > > the hash table is a good idea as they are no more in the same order as > > > > > they were in ReorderBuffer which means the order in which we serialize > > > > > the changes normally would change and that might have some impact, so > > > > > we would need some more study if we want to pursue this idea. > > > > I have fixed this bug and attached it as a separate patch. I will > > > > merge it to the main patch after we agree with the idea and after some > > > > more testing. > > > > > > > > The idea is that whenever we get the toasted chunk instead of directly > > > > inserting it into the toast hash I am inserting it into some local > > > > list so that if we don't get the change for the main table then we can > > > > insert these changes back to the txn->changes. So once we get the > > > > change for the main table at that time I am preparing the hash table > > > > to merge the chunks. > > > > > > > > > > > > > I think this idea will work but appears to be quite costly because (a) > > > you might need to serialize/deserialize the changes multiple times and > > > might attempt streaming multiple times even though you can't do (b) > > > you need to remove/add the same set of changes from the main list > > > multiple times. > > I agree with this. > > > > > > It seems to me that we need to add all of this new handling because > > > while taking the decision whether to stream or not we don't know > > > whether the txn has changes that can't be streamed. One idea to make > > > it work is that we identify it while decoding the WAL. I think we > > > need to set a bit in the insert/delete WAL record to identify if the > > > tuple belongs to a toast relation. This won't add any additional > > > overhead in WAL and reduce a lot of complexity in the logical decoding > > > and also decoding will be efficient. If this is feasible, then we can > > > do the same for speculative insertions. > > The Idea looks good to me. I will work on this. > > > > One more thing we can do is to identify whether the tuple belongs to > toast relation while decoding it. However, I think to do that we need > to have access to relcache at that time and that might add some > overhead as we need to do that for each tuple. Can we investigate > what it will take to do that and if it is better than setting a bit > during WAL logging. IMHO, for the catalog scan, we will have to start/stop the transaction for each change. So do you want that we should evaluate its performance? Also, during we get the change we might not have the complete historic snapshot ready to fetch the rel cache entry. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Amit Kapila
Date:
On Tue, Jan 28, 2020 at 11:58 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Tue, Jan 28, 2020 at 11:43 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > It seems to me that we need to add all of this new handling because > > > > while taking the decision whether to stream or not we don't know > > > > whether the txn has changes that can't be streamed. One idea to make > > > > it work is that we identify it while decoding the WAL. I think we > > > > need to set a bit in the insert/delete WAL record to identify if the > > > > tuple belongs to a toast relation. This won't add any additional > > > > overhead in WAL and reduce a lot of complexity in the logical decoding > > > > and also decoding will be efficient. If this is feasible, then we can > > > > do the same for speculative insertions. > > > The Idea looks good to me. I will work on this. > > > > > > > One more thing we can do is to identify whether the tuple belongs to > > toast relation while decoding it. However, I think to do that we need > > to have access to relcache at that time and that might add some > > overhead as we need to do that for each tuple. Can we investigate > > what it will take to do that and if it is better than setting a bit > > during WAL logging. > > IMHO, for the catalog scan, we will have to start/stop the transaction > for each change. So do you want that we should evaluate its > performance? > No, I was not thinking about each change, but at the level of ReorderBufferTXN. > Also, during we get the change we might not have the > complete historic snapshot ready to fetch the rel cache entry. > Before decoding each change (say DecodeInsert), we call SnapBuildProcessChange. Isn't that sufficient? Even, if the above is possible, I am not sure how good is it for each change we fetch rel cache entry, that is the point I was worried. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Dilip Kumar
Date:
On Tue, Jan 28, 2020 at 1:30 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Tue, Jan 28, 2020 at 11:58 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Tue, Jan 28, 2020 at 11:43 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > It seems to me that we need to add all of this new handling because > > > > > while taking the decision whether to stream or not we don't know > > > > > whether the txn has changes that can't be streamed. One idea to make > > > > > it work is that we identify it while decoding the WAL. I think we > > > > > need to set a bit in the insert/delete WAL record to identify if the > > > > > tuple belongs to a toast relation. This won't add any additional > > > > > overhead in WAL and reduce a lot of complexity in the logical decoding > > > > > and also decoding will be efficient. If this is feasible, then we can > > > > > do the same for speculative insertions. > > > > The Idea looks good to me. I will work on this. > > > > > > > > > > One more thing we can do is to identify whether the tuple belongs to > > > toast relation while decoding it. However, I think to do that we need > > > to have access to relcache at that time and that might add some > > > overhead as we need to do that for each tuple. Can we investigate > > > what it will take to do that and if it is better than setting a bit > > > during WAL logging. > > > > IMHO, for the catalog scan, we will have to start/stop the transaction > > for each change. So do you want that we should evaluate its > > performance? > > > > No, I was not thinking about each change, but at the level of ReorderBufferTXN. That means we will have to keep that transaction open until we decode the commit WAL for that ReorderBufferTXN or you have anything else in mind? > > > Also, during we get the change we might not have the > > complete historic snapshot ready to fetch the rel cache entry. > > > > Before decoding each change (say DecodeInsert), we call > SnapBuildProcessChange. Isn't that sufficient? Yeah, Right, we can get some recache entry based on the base snapshot. And, that might be sufficient to know whether it's a toast relation or not. > > Even, if the above is possible, I am not sure how good is it for each > change we fetch rel cache entry, that is the point I was worried. We might not need to scan the catalog every time, we might get it from the cache itself. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Amit Kapila
Date:
On Tue, Jan 28, 2020 at 1:55 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Tue, Jan 28, 2020 at 1:30 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Tue, Jan 28, 2020 at 11:58 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > On Tue, Jan 28, 2020 at 11:43 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > > It seems to me that we need to add all of this new handling because > > > > > > while taking the decision whether to stream or not we don't know > > > > > > whether the txn has changes that can't be streamed. One idea to make > > > > > > it work is that we identify it while decoding the WAL. I think we > > > > > > need to set a bit in the insert/delete WAL record to identify if the > > > > > > tuple belongs to a toast relation. This won't add any additional > > > > > > overhead in WAL and reduce a lot of complexity in the logical decoding > > > > > > and also decoding will be efficient. If this is feasible, then we can > > > > > > do the same for speculative insertions. > > > > > The Idea looks good to me. I will work on this. > > > > > > > > > > > > > One more thing we can do is to identify whether the tuple belongs to > > > > toast relation while decoding it. However, I think to do that we need > > > > to have access to relcache at that time and that might add some > > > > overhead as we need to do that for each tuple. Can we investigate > > > > what it will take to do that and if it is better than setting a bit > > > > during WAL logging. > > > > > > IMHO, for the catalog scan, we will have to start/stop the transaction > > > for each change. So do you want that we should evaluate its > > > performance? > > > > > > > No, I was not thinking about each change, but at the level of ReorderBufferTXN. > That means we will have to keep that transaction open until we decode > the commit WAL for that ReorderBufferTXN or you have anything else in > mind? > or probably till we start streaming. > > > > > Also, during we get the change we might not have the > > > complete historic snapshot ready to fetch the rel cache entry. > > > > > > > Before decoding each change (say DecodeInsert), we call > > SnapBuildProcessChange. Isn't that sufficient? > Yeah, Right, we can get some recache entry based on the base snapshot. > And, that might be sufficient to know whether it's a toast relation or > not. > > > > Even, if the above is possible, I am not sure how good is it for each > > change we fetch rel cache entry, that is the point I was worried. > > We might not need to scan the catalog every time, we might get it from > the cache itself. > Right, but I am not completely sure if that is better than setting a bit in WAL record for toast tuples. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Amit Kapila
Date:
On Fri, Jan 10, 2020 at 10:14 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Mon, Jan 6, 2020 at 2:11 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > Few more comments: > > -------------------------------- > > v4-0007-Implement-streaming-mode-in-ReorderBuffer > > 1. > > +ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn) > > { > > .. > > + /* > > + * TOCHECK: We have to rebuild historic snapshot to be sure it includes all > > + * information about > > subtransactions, which could arrive after streaming start. > > + */ > > + if (!txn->is_schema_sent) > > + snapshot_now > > = ReorderBufferCopySnap(rb, txn->base_snapshot, > > + txn, > > command_id); > > .. > > } > > > > Why are we using base snapshot here instead of the snapshot we saved > > the first time streaming has happened? And as mentioned in comments, > > won't we need to consider the snapshots for subtransactions that > > arrived after the last time we have streamed the changes? > Fixed > +static void +ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn) { .. + /* + * We can not use txn->snapshot_now directly because after we there + * might be some new sub-transaction which after the last streaming run + * so we need to add those sub-xip in the snapshot. + */ + snapshot_now = ReorderBufferCopySnap(rb, txn->snapshot_now, + txn, command_id); "because after we there", you seem to forget a word between 'we' and 'there'. So as we are copying it now, does this mean it will consider the snapshots for subtransactions that arrived after the last time we have streamed the changes? If so, have you tested it and can we add the same in comments. Also, if we need to copy the snapshot here, then do we need to again copy it in ReorderBufferProcessTXN(in below code and in catch block in the same function). { .. + /* + * Remember the command ID and snapshot if transaction is streaming + * otherwise free the snapshot if we have copied it. + */ + if (streaming) + { + txn->command_id = command_id; + txn->snapshot_now = ReorderBufferCopySnap(rb, snapshot_now, + txn, command_id); + } + else if (snapshot_now->copied) + ReorderBufferFreeSnap(rb, snapshot_now); .. } > > > > 4. In ReorderBufferStreamTXN(), don't we need to set some of the txn > > fields like origin_id, origin_lsn as we do in ReorderBufferCommit() > > especially to cover the case when it gets called due to memory > > overflow (aka via ReorderBufferCheckMemoryLimit). > We get origin_lsn during commit time so I am not sure how can we do > that. I have also noticed that currently, we are not using origin_lsn > on the subscriber side. I think need more investigation that if we > want this then do we need to log it early. > Have you done any investigation of this point? You might want to look at pg_replication_origin* APIs. Today, again looking at this code, I think with current coding, it won't be used even when we encounter commit record. Because ReorderBufferCommit calls ReorderBufferStreamCommit which will make sure that origin_id and origin_lsn is never sent. I think at least that should be fixed, if not, probably, we need a comment with reasoning why we think it is okay not to do in this case. + /* + * If we are streaming the in-progress transaction then Discard the /Discard/discard > > > > v4-0006-Gracefully-handle-concurrent-aborts-of-uncommitte > > 1. > > + /* > > + * If CheckXidAlive is valid, then we check if it aborted. If it did, we > > + * error out > > + */ > > + if (TransactionIdIsValid(CheckXidAlive) && > > + !TransactionIdIsInProgress(CheckXidAlive) && > > + !TransactionIdDidCommit(CheckXidAlive)) > > + ereport(ERROR, > > + (errcode(ERRCODE_TRANSACTION_ROLLBACK), > > + errmsg("transaction aborted during system catalog scan"))); > > > > Why here we can't use TransactionIdDidAbort? If we can't use it, then > > can you add comments stating the reason of the same. > Done + * If CheckXidAlive is valid, then we check if it aborted. If it did, we + * error out. Instead of directly checking the abort status we do check + * if it is not in progress transaction and no committed. Because if there + * were a system crash then status of the the transaction which were running + * at that time might not have marked. So we need to consider them as + * aborted. Refer detailed comments at snapmgr.c where the variable is + * declared. How about replacing the above comment with below one: If CheckXidAlive is valid, then we check if it aborted. If it did, we error out. We can't directly use TransactionIdDidAbort as after crash such transaction might not have been marked as aborted. See detailed comments at snapmgr.c where the variable is declared. I am not able to understand the change in v8-0011-BUGFIX-set-final_lsn-for-subxacts-before-cleanup. Do you have any explanation for the same? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Dilip Kumar
Date:
On Thu, Jan 30, 2020 at 4:11 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Fri, Jan 10, 2020 at 10:14 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Mon, Jan 6, 2020 at 2:11 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > > > Few more comments: > > > -------------------------------- > > > v4-0007-Implement-streaming-mode-in-ReorderBuffer > > > 1. > > > +ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn) > > > { > > > .. > > > + /* > > > + * TOCHECK: We have to rebuild historic snapshot to be sure it includes all > > > + * information about > > > subtransactions, which could arrive after streaming start. > > > + */ > > > + if (!txn->is_schema_sent) > > > + snapshot_now > > > = ReorderBufferCopySnap(rb, txn->base_snapshot, > > > + txn, > > > command_id); > > > .. > > > } > > > > > > Why are we using base snapshot here instead of the snapshot we saved > > > the first time streaming has happened? And as mentioned in comments, > > > won't we need to consider the snapshots for subtransactions that > > > arrived after the last time we have streamed the changes? > > Fixed > > > > +static void > +ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn) > { > .. > + /* > + * We can not use txn->snapshot_now directly because after we there > + * might be some new sub-transaction which after the last streaming run > + * so we need to add those sub-xip in the snapshot. > + */ > + snapshot_now = ReorderBufferCopySnap(rb, txn->snapshot_now, > + txn, command_id); > > "because after we there", you seem to forget a word between 'we' and > 'there'. So as we are copying it now, does this mean it will consider > the snapshots for subtransactions that arrived after the last time we > have streamed the changes? If so, have you tested it and can we add > the same in comments. Ok > Also, if we need to copy the snapshot here, then do we need to again > copy it in ReorderBufferProcessTXN(in below code and in catch block in > the same function). I think so because as part of the "REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT" change, we might directly point to the snapshot and that will get truncated when we truncate all the changes of the ReorderBufferTXN. So I think we can check if snapshot_now->copied is true then we can avoid copying otherwise we can copy? Other comments look fine to me so I will reply to them along with the next version of the patch. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Amit Kapila
Date:
On Thu, Jan 30, 2020 at 6:10 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Thu, Jan 30, 2020 at 4:11 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > Also, if we need to copy the snapshot here, then do we need to again > > copy it in ReorderBufferProcessTXN(in below code and in catch block in > > the same function). > I think so because as part of the > "REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT" change, we might directly > point to the snapshot and that will get truncated when we truncate all > the changes of the ReorderBufferTXN. So I think we can check if > snapshot_now->copied is true then we can avoid copying otherwise we > can copy? > Yeah, that makes sense, but I think then we also need to ensure that ReorderBufferStreamTXN frees the snapshot only when it is copied. It seems to me it should be always copied in the place where we are trying to free it, so probably we should have an Assert there. One more thing: ReorderBufferProcessTXN() { .. + if (streaming) + { + /* + * While streaming an in-progress transaction there is a + * possibility that the (sub)transaction might get aborted + * concurrently. In such case if the (sub)transaction has + * catalog update then we might decode the tuple using wrong + * catalog version. So for detecting the concurrent abort we + * set CheckXidAlive to the current (sub)transaction's xid for + * which this change belongs to. And, during catalog scan we + * can check the status of the xid and if it is aborted we will + * report an specific error which we can ignore. We might have + * already streamed some of the changes for the aborted + * (sub)transaction, but that is fine because when we decode the + * abort we will stream abort message to truncate the changes in + * the subscriber. + */ + CheckXidAlive = change->txn->xid; + } .. } I think it is better to move the above code into an inline function (something like SetXidAlive). It will make the code in function ReorderBufferProcessTXN look cleaner and easier to understand. > Other comments look fine to me so I will reply to them along with the > next version of the patch. > Okay, thanks. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Amit Kapila
Date:
On Thu, Jan 30, 2020 at 6:10 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > Other comments look fine to me so I will reply to them along with the > next version of the patch. > This still needs more work, so I have moved this to the next CF. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Amit Kapila
Date:
On Fri, Jan 10, 2020 at 10:53 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Mon, Dec 30, 2019 at 3:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > 2. During commit time in DecodeCommit we check whether we need to skip > > > the changes of the transaction or not by calling > > > SnapBuildXactNeedsSkip but since now we support streaming so it's > > > possible that before we decode the commit WAL, we might have already > > > sent the changes to the output plugin even though we could have > > > skipped those changes. So my question is instead of checking at the > > > commit time can't we check before adding to ReorderBuffer itself > > > > > > > I think if we can do that then the same will be true for current code > > irrespective of this patch. I think it is possible that we can't take > > that decision while decoding because we haven't assembled a consistent > > snapshot yet. I think we might be able to do that while we try to > > stream the changes. I think we need to take care of all the > > conditions during streaming (when the logical_decoding_workmem limit > > is reached) as we do in DecodeCommit. This needs a bit more study. > > I have analyzed this further and I think we can not decide all the > conditions even while streaming. Because IMHO once we get the > SNAPBUILD_FULL_SNAPSHOT we can add the changes to the reorder buffer > so that if we get the commit of the transaction after we reach to the > SNAPBUILD_CONSISTENT. However, if we get the commit before we reach > to SNAPBUILD_CONSISTENT then we need to ignore this transaction. Now, > even if we have SNAPBUILD_FULL_SNAPSHOT we can stream the changes > which might get dropped later but that we can not decide while > streaming. > This makes sense to me, but we should add a comment for the same when we are streaming to say we can't skip similar to how we do during commit time because of the above reason described by you. Also, what about other conditions where we can skip the transaction, basically cases like (a) when the transaction happened in another database, (b) when the output plugin is not interested in the origin and (c) when we are doing fast-forwarding -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Dilip Kumar
Date:
On Mon, Feb 3, 2020 at 9:51 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Fri, Jan 10, 2020 at 10:53 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Mon, Dec 30, 2019 at 3:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > > 2. During commit time in DecodeCommit we check whether we need to skip > > > > the changes of the transaction or not by calling > > > > SnapBuildXactNeedsSkip but since now we support streaming so it's > > > > possible that before we decode the commit WAL, we might have already > > > > sent the changes to the output plugin even though we could have > > > > skipped those changes. So my question is instead of checking at the > > > > commit time can't we check before adding to ReorderBuffer itself > > > > > > > > > > I think if we can do that then the same will be true for current code > > > irrespective of this patch. I think it is possible that we can't take > > > that decision while decoding because we haven't assembled a consistent > > > snapshot yet. I think we might be able to do that while we try to > > > stream the changes. I think we need to take care of all the > > > conditions during streaming (when the logical_decoding_workmem limit > > > is reached) as we do in DecodeCommit. This needs a bit more study. > > > > I have analyzed this further and I think we can not decide all the > > conditions even while streaming. Because IMHO once we get the > > SNAPBUILD_FULL_SNAPSHOT we can add the changes to the reorder buffer > > so that if we get the commit of the transaction after we reach to the > > SNAPBUILD_CONSISTENT. However, if we get the commit before we reach > > to SNAPBUILD_CONSISTENT then we need to ignore this transaction. Now, > > even if we have SNAPBUILD_FULL_SNAPSHOT we can stream the changes > > which might get dropped later but that we can not decide while > > streaming. > > > > This makes sense to me, but we should add a comment for the same when > we are streaming to say we can't skip similar to how we do during > commit time because of the above reason described by you. Also, what > about other conditions where we can skip the transaction, basically > cases like (a) when the transaction happened in another database, (b) > when the output plugin is not interested in the origin and (c) when we > are doing fast-forwarding I will analyze those and fix in my next version of the patch. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Dilip Kumar
Date:
On Tue, Jan 28, 2020 at 11:43 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Tue, Jan 28, 2020 at 11:34 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Tue, Jan 28, 2020 at 11:28 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > On Wed, Jan 22, 2020 at 10:30 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > > > On Tue, Jan 14, 2020 at 10:44 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > > > > > > > Hmm, I think this can turn out to be inefficient because we can easily > > > > > end up spilling the data even when we don't need to so. Consider > > > > > cases, where part of the streamed changes are for toast, and remaining > > > > > are the changes which we would have streamed and hence can be removed. > > > > > In such cases, we could have easily consumed remaining changes for > > > > > toast without spilling. Also, I am not sure if spilling changes from > > > > > the hash table is a good idea as they are no more in the same order as > > > > > they were in ReorderBuffer which means the order in which we serialize > > > > > the changes normally would change and that might have some impact, so > > > > > we would need some more study if we want to pursue this idea. > > > > I have fixed this bug and attached it as a separate patch. I will > > > > merge it to the main patch after we agree with the idea and after some > > > > more testing. > > > > > > > > The idea is that whenever we get the toasted chunk instead of directly > > > > inserting it into the toast hash I am inserting it into some local > > > > list so that if we don't get the change for the main table then we can > > > > insert these changes back to the txn->changes. So once we get the > > > > change for the main table at that time I am preparing the hash table > > > > to merge the chunks. > > > > > > > > > > > > > I think this idea will work but appears to be quite costly because (a) > > > you might need to serialize/deserialize the changes multiple times and > > > might attempt streaming multiple times even though you can't do (b) > > > you need to remove/add the same set of changes from the main list > > > multiple times. > > I agree with this. > > > > > > It seems to me that we need to add all of this new handling because > > > while taking the decision whether to stream or not we don't know > > > whether the txn has changes that can't be streamed. One idea to make > > > it work is that we identify it while decoding the WAL. I think we > > > need to set a bit in the insert/delete WAL record to identify if the > > > tuple belongs to a toast relation. This won't add any additional > > > overhead in WAL and reduce a lot of complexity in the logical decoding > > > and also decoding will be efficient. If this is feasible, then we can > > > do the same for speculative insertions. > > The Idea looks good to me. I will work on this. > > > > One more thing we can do is to identify whether the tuple belongs to > toast relation while decoding it. However, I think to do that we need > to have access to relcache at that time and that might add some > overhead as we need to do that for each tuple. Can we investigate > what it will take to do that and if it is better than setting a bit > during WAL logging. > I have done some more analysis on this and it appears that there are few problems in doing this. Basically, once we get the confirmed flush location, we advance the replication_slot_catalog_xmin so that vacuum can garbage collect the old tuple. So the problem is that while we are collecting the changes in the ReorderBuffer our catalog version might have removed, and we might not find any relation entry with that relfilenodeid (because it is dropped or altered in the future). -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Amit Kapila
Date:
On Tue, Feb 4, 2020 at 11:00 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Tue, Jan 28, 2020 at 11:43 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > One more thing we can do is to identify whether the tuple belongs to > > toast relation while decoding it. However, I think to do that we need > > to have access to relcache at that time and that might add some > > overhead as we need to do that for each tuple. Can we investigate > > what it will take to do that and if it is better than setting a bit > > during WAL logging. > > > I have done some more analysis on this and it appears that there are > few problems in doing this. Basically, once we get the confirmed > flush location, we advance the replication_slot_catalog_xmin so that > vacuum can garbage collect the old tuple. So the problem is that > while we are collecting the changes in the ReorderBuffer our catalog > version might have removed, and we might not find any relation entry > with that relfilenodeid (because it is dropped or altered in the > future). > Hmm, this means this can also occur while streaming the changes. The main reason as I understand is that it is because before decoding commit, we don't know whether these changes are already sent to the subscriber (based on confirmed_flush_location/start_decoding_at). I think it is better to skip streaming such transactions as we can't make the right decision about these and as this can happen generally after the crash for the first few transactions, it shouldn't matter much if we serialize such transactions instead of streaming them. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Dilip Kumar
Date:
On Thu, Jan 30, 2020 at 4:11 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Fri, Jan 10, 2020 at 10:14 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Mon, Jan 6, 2020 at 2:11 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > > > Few more comments: > > > -------------------------------- > > > v4-0007-Implement-streaming-mode-in-ReorderBuffer > > > 1. > > > +ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn) > > > { > > > .. > > > + /* > > > + * TOCHECK: We have to rebuild historic snapshot to be sure it includes all > > > + * information about > > > subtransactions, which could arrive after streaming start. > > > + */ > > > + if (!txn->is_schema_sent) > > > + snapshot_now > > > = ReorderBufferCopySnap(rb, txn->base_snapshot, > > > + txn, > > > command_id); > > > .. > > > } > > > > > > Why are we using base snapshot here instead of the snapshot we saved > > > the first time streaming has happened? And as mentioned in comments, > > > won't we need to consider the snapshots for subtransactions that > > > arrived after the last time we have streamed the changes? > > Fixed > > > > +static void > +ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn) > { > .. > + /* > + * We can not use txn->snapshot_now directly because after we there > + * might be some new sub-transaction which after the last streaming run > + * so we need to add those sub-xip in the snapshot. > + */ > + snapshot_now = ReorderBufferCopySnap(rb, txn->snapshot_now, > + txn, command_id); > > "because after we there", you seem to forget a word between 'we' and > 'there'. Fixed So as we are copying it now, does this mean it will consider > the snapshots for subtransactions that arrived after the last time we > have streamed the changes? If so, have you tested it and can we add > the same in comments. Yes I have tested. Comment added. > > Also, if we need to copy the snapshot here, then do we need to again > copy it in ReorderBufferProcessTXN(in below code and in catch block in > the same function). > > { > .. > + /* > + * Remember the command ID and snapshot if transaction is streaming > + * otherwise free the snapshot if we have copied it. > + */ > + if (streaming) > + { > + txn->command_id = command_id; > + txn->snapshot_now = ReorderBufferCopySnap(rb, snapshot_now, > + txn, command_id); > + } > + else if (snapshot_now->copied) > + ReorderBufferFreeSnap(rb, snapshot_now); > .. > } > Fixed > > > > > > 4. In ReorderBufferStreamTXN(), don't we need to set some of the txn > > > fields like origin_id, origin_lsn as we do in ReorderBufferCommit() > > > especially to cover the case when it gets called due to memory > > > overflow (aka via ReorderBufferCheckMemoryLimit). > > We get origin_lsn during commit time so I am not sure how can we do > > that. I have also noticed that currently, we are not using origin_lsn > > on the subscriber side. I think need more investigation that if we > > want this then do we need to log it early. > > > > Have you done any investigation of this point? You might want to look > at pg_replication_origin* APIs. Today, again looking at this code, I > think with current coding, it won't be used even when we encounter > commit record. Because ReorderBufferCommit calls > ReorderBufferStreamCommit which will make sure that origin_id and > origin_lsn is never sent. I think at least that should be fixed, if > not, probably, we need a comment with reasoning why we think it is > okay not to do in this case. Still, the problem is the same because, currently, we are sending origin_lsn as part of the "pgoutput_begin" message. Now, for the streaming transaction, we have already sent the stream start. However, we might send this during the stream commit, but I am not completely sure because currently, the consumer of this message "apply_handle_origin" is just ignoring it. I have also looked into pg_replication_origin* APIs and they are used for setting origin id and tracking the progress, but they will not consume the origin_lsn we are sending in pgoutput_begin so this is not directly related. > > + /* > + * If we are streaming the in-progress transaction then Discard the > > /Discard/discard Done > > > > > > > v4-0006-Gracefully-handle-concurrent-aborts-of-uncommitte > > > 1. > > > + /* > > > + * If CheckXidAlive is valid, then we check if it aborted. If it did, we > > > + * error out > > > + */ > > > + if (TransactionIdIsValid(CheckXidAlive) && > > > + !TransactionIdIsInProgress(CheckXidAlive) && > > > + !TransactionIdDidCommit(CheckXidAlive)) > > > + ereport(ERROR, > > > + (errcode(ERRCODE_TRANSACTION_ROLLBACK), > > > + errmsg("transaction aborted during system catalog scan"))); > > > > > > Why here we can't use TransactionIdDidAbort? If we can't use it, then > > > can you add comments stating the reason of the same. > > Done > > + * If CheckXidAlive is valid, then we check if it aborted. If it did, we > + * error out. Instead of directly checking the abort status we do check > + * if it is not in progress transaction and no committed. Because if there > + * were a system crash then status of the the transaction which were running > + * at that time might not have marked. So we need to consider them as > + * aborted. Refer detailed comments at snapmgr.c where the variable is > + * declared. > > > How about replacing the above comment with below one: > > If CheckXidAlive is valid, then we check if it aborted. If it did, we > error out. We can't directly use TransactionIdDidAbort as after crash > such transaction might not have been marked as aborted. See detailed > comments at snapmgr.c where the variable is declared. Done > > I am not able to understand the change in > v8-0011-BUGFIX-set-final_lsn-for-subxacts-before-cleanup. Do you have > any explanation for the same? It appears that in ReorderBufferCommitChild we are always setting the final_lsn of the subxacts so it should not be invalid. For testing, I have changed this as an assert and checked but it never hit. So maybe we can remove this change. Apart from that, I have fixed the toast tuple streaming bug by setting the flag bit in the WAL (attached as 0012). I have also extended this solution for handling the speculative insert bug so old patch for a speculative insert bug fix is removed. I am also exploring the solution that how can we do this without setting the flag in the WAL as we discussed upthread. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
- v9-0001-Immediately-WAL-log-assignments.patch
- v9-0005-Implement-streaming-mode-in-ReorderBuffer.patch
- v9-0002-Issue-individual-invalidations-with-wal_level-log.patch
- v9-0004-Gracefully-handle-concurrent-aborts-of-uncommitte.patch
- v9-0003-Extend-the-output-plugin-API-with-stream-methods.patch
- v9-0006-Support-logical_decoding_work_mem-set-from-create.patch
- v9-0007-Add-support-for-streaming-to-built-in-replication.patch
- v9-0008-Track-statistics-for-streaming.patch
- v9-0010-BUGFIX-set-final_lsn-for-subxacts-before-cleanup.patch
- v9-0009-Enable-streaming-for-all-subscription-TAP-tests.patch
- v9-0011-Add-TAP-test-for-streaming-vs.-DDL.patch
- v9-0012-Bugfix-handling-of-incomplete-toast-tuple.patch
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Dilip Kumar
Date:
On Fri, Jan 31, 2020 at 8:08 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Thu, Jan 30, 2020 at 6:10 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Thu, Jan 30, 2020 at 4:11 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > Also, if we need to copy the snapshot here, then do we need to again > > > copy it in ReorderBufferProcessTXN(in below code and in catch block in > > > the same function). > > I think so because as part of the > > "REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT" change, we might directly > > point to the snapshot and that will get truncated when we truncate all > > the changes of the ReorderBufferTXN. So I think we can check if > > snapshot_now->copied is true then we can avoid copying otherwise we > > can copy? > > > > Yeah, that makes sense, but I think then we also need to ensure that > ReorderBufferStreamTXN frees the snapshot only when it is copied. It > seems to me it should be always copied in the place where we are > trying to free it, so probably we should have an Assert there. > > One more thing: > ReorderBufferProcessTXN() > { > .. > + if (streaming) > + { > + /* > + * While streaming an in-progress transaction there is a > + * possibility that the (sub)transaction might get aborted > + * concurrently. In such case if the (sub)transaction has > + * catalog update then we might decode the tuple using wrong > + * catalog version. So for detecting the concurrent abort we > + * set CheckXidAlive to the current (sub)transaction's xid for > + * which this change belongs to. And, during catalog scan we > + * can check the status of the xid and if it is aborted we will > + * report an specific error which we can ignore. We might have > + * already streamed some of the changes for the aborted > + * (sub)transaction, but that is fine because when we decode the > + * abort we will stream abort message to truncate the changes in > + * the subscriber. > + */ > + CheckXidAlive = change->txn->xid; > + } > .. > } > > I think it is better to move the above code into an inline function > (something like SetXidAlive). It will make the code in function > ReorderBufferProcessTXN look cleaner and easier to understand. > Fixed in the latest version sent upthread. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Dilip Kumar
Date:
On Wed, Feb 5, 2020 at 9:27 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Tue, Feb 4, 2020 at 11:00 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Tue, Jan 28, 2020 at 11:43 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > One more thing we can do is to identify whether the tuple belongs to > > > toast relation while decoding it. However, I think to do that we need > > > to have access to relcache at that time and that might add some > > > overhead as we need to do that for each tuple. Can we investigate > > > what it will take to do that and if it is better than setting a bit > > > during WAL logging. > > > > > I have done some more analysis on this and it appears that there are > > few problems in doing this. Basically, once we get the confirmed > > flush location, we advance the replication_slot_catalog_xmin so that > > vacuum can garbage collect the old tuple. So the problem is that > > while we are collecting the changes in the ReorderBuffer our catalog > > version might have removed, and we might not find any relation entry > > with that relfilenodeid (because it is dropped or altered in the > > future). > > > > Hmm, this means this can also occur while streaming the changes. The > main reason as I understand is that it is because before decoding > commit, we don't know whether these changes are already sent to the > subscriber (based on confirmed_flush_location/start_decoding_at). Right. >I think it is better to skip streaming such transactions as we can't > make the right decision about these and as this can happen generally > after the crash for the first few transactions, it shouldn't matter > much if we serialize such transactions instead of streaming them. I think the idea makes sense to me. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Amit Kapila
Date:
On Wed, Feb 5, 2020 at 9:46 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > Fixed in the latest version sent upthread. > Okay, thanks. I haven't looked at the latest version of patch series as I was reviewing the previous version and I think all of these comments are in the patch which is not modified. Here are my comments: I think we don't need to maintain v8-0007-Support-logical_decoding_work_mem-set-from-create as per discussion in one of the above emails [1] as its usage is not clear. v8-0008-Add-support-for-streaming-to-built-in-replication 1. - information. The allowed options are <literal>slot_name</literal> and - <literal>synchronous_commit</literal> + information. The allowed options are <literal>slot_name</literal>, + <literal>synchronous_commit</literal>, <literal>work_mem</literal> + and <literal>streaming</literal>. As per the discussion above [1], I don't think we need work_mem here. You might want to remove the other usage from the patch as well. 2. @@ -59,7 +59,8 @@ parse_subscription_options(List *options, bool *connect, bool *enabled_given, bool *slot_name_given, char **slot_name, bool *copy_data, char **synchronous_commit, bool *refresh, int *logical_wm, - bool *logical_wm_given) + bool *logical_wm_given, bool *streaming, + bool *streaming_given) It is not clear to me why we need two parameters 'streaming' and 'streaming_given' in this API. Can't we handle similar to parameter 'refresh'? 3. diff --git a/src/backend/replication/logical/launcher.c b/src/backend/replication/logical/launcher.c index aec885e..e80d00c 100644 --- a/src/backend/replication/logical/launcher.c +++ b/src/backend/replication/logical/launcher.c @@ -14,6 +14,8 @@ * *------------------------------------------------------------------------- */ +#include <sys/types.h> +#include <unistd.h> #include "postgres.h" I see only the above change in launcher.c. Why we need to include these if there is no other change (at least not in this patch). 4. stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn) /* Push callback + info on the error context stack */ state.ctx = ctx; state.callback_name = "stream_start"; - /* state.report_location = apply_lsn; */ + state.report_location = InvalidXLogRecPtr; errcallback.callback = output_plugin_error_callback; errcallback.arg = (void *) &state; errcallback.previous = error_context_stack; @@ -1194,7 +1194,7 @@ stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn) /* Push callback + info on the error context stack */ state.ctx = ctx; state.callback_name = "stream_stop"; - /* state.report_location = apply_lsn; */ + state.report_location = InvalidXLogRecPtr; errcallback.callback = output_plugin_error_callback; errcallback.arg = (void *) &state; errcallback.previous = error_context_stack; Don't we want to set txn->final_lsn in report location as we do at few other places? 5. -logicalrep_write_delete(StringInfo out, Relation rel, HeapTuple oldtuple) +logicalrep_write_delete(StringInfo out, TransactionId xid, + Relation rel, HeapTuple oldtuple) { + pq_sendbyte(out, 'D'); /* action DELETE */ + Assert(rel->rd_rel->relreplident == REPLICA_IDENTITY_DEFAULT || rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL || rel->rd_rel->relreplident == REPLICA_IDENTITY_INDEX); - pq_sendbyte(out, 'D'); /* action DELETE */ Why this patch need to change the above code? 6. +void +logicalrep_write_stream_start(StringInfo out, + TransactionId xid, bool first_segment) +{ + pq_sendbyte(out, 'S'); /* action STREAM START */ + + Assert(TransactionIdIsValid(xid)); + + /* transaction ID (we're starting to stream, so must be valid) */ + pq_sendint32(out, xid); + + /* 1 if this is the first streaming segment for this xid */ + pq_sendint32(out, first_segment ? 1 : 0); +} + +TransactionId +logicalrep_read_stream_start(StringInfo in, bool *first_segment) +{ + TransactionId xid; + + Assert(first_segment); + + xid = pq_getmsgint(in, 4); + *first_segment = (pq_getmsgint(in, 4) == 1); + + return xid; +} In these functions for sending bool, pq_sendint32 is used. Can't we use pq_sendbyte similar to what we do in boolsend? 7. +void +logicalrep_write_stream_stop(StringInfo out, TransactionId xid) +{ + pq_sendbyte(out, 'E'); /* action STREAM END */ + + Assert(TransactionIdIsValid(xid)); + + /* transaction ID (we're starting to stream, so must be valid) */ + pq_sendint32(out, xid); +} In comments, 'starting to stream' is mentioned whereas this function is to stop it. 8. +void +logicalrep_write_stream_stop(StringInfo out, TransactionId xid) +{ + pq_sendbyte(out, 'E'); /* action STREAM END */ + + Assert(TransactionIdIsValid(xid)); + + /* transaction ID (we're starting to stream, so must be valid) */ + pq_sendint32(out, xid); +} + +TransactionId +logicalrep_read_stream_stop(StringInfo in) +{ + TransactionId xid; + + xid = pq_getmsgint(in, 4); + + return xid; +} Is there a reason to send xid on stopping stream? I don't see any use of function logicalrep_read_stream_stop. 9. + * XXX Add calls to pgstat_report_wait_start/pgstat_report_wait_end. + */ +static void +subxact_info_write(Oid subid, TransactionId xid) { .. + pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_WRITE); .. + pgstat_report_wait_end(); .. } I see the calls to pgstat_report_wait_start/pgstat_report_wait_end in this function, so not sure if the above comment makes sense. 10. + * The files are placed in /tmp by default, and the filenames include both + * the XID of the toplevel transaction and OID of the subscription. Are we keeping files in /tmp or pg's temp tablespace dir. Seeing below code, it doesn't seem that we place them in /tmp. If I am correct, then can you update the comment. +static void +subxact_filename(char *path, Oid subid, TransactionId xid) +{ + char tempdirpath[MAXPGPATH]; + + TempTablespacePath(tempdirpath, DEFAULTTABLESPACE_OID); 11. + * The change is serialied in a simple format, with length (not including + * the length), action code (identifying the message type) and message + * contents (without the subxact TransactionId value). + * .. + */ +static void +stream_write_change(char action, StringInfo s) The part of the comment which says "with length (not including the length) .." is not clear to me. What does "not including the length" mean? 12. + * TODO: Add missing_ok flag to specify in which cases it's OK not to + * find the files, and when it's an error. + */ +static void +stream_cleanup_files(Oid subid, TransactionId xid) I think we can implement this TODO. It is clear when this function is called from apply_handle_stream_commit, the file must exist. We can similarly analyze other callers of this API. 13. +apply_handle_stream_abort(StringInfo s) { .. + /* FIXME optimize the search by bsearch on sorted data */ + for (i = nsubxacts; i > 0; i--) .. I am not sure how important this optimization is, so instead of FIXME, it is better to keep it as a XXX comment. In the future, if we hit any performance issue due to this, we can revisit our decision. [1] - https://www.postgresql.org/message-id/CAA4eK1LH7xzF%2B-qHRv9EDXQTFYjPUYZw5B7FSK9QLEg7F603OQ%40mail.gmail.com -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Amit Kapila
Date:
On Wed, Feb 5, 2020 at 9:42 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > I am not able to understand the change in > > v8-0011-BUGFIX-set-final_lsn-for-subxacts-before-cleanup. Do you have > > any explanation for the same? > > It appears that in ReorderBufferCommitChild we are always setting the > final_lsn of the subxacts so it should not be invalid. For testing, I > have changed this as an assert and checked but it never hit. So maybe > we can remove this change. > Tomas, do you remember anything about this change? We are talking about below change: From: Tomas Vondra <tv@fuzzy.cz> Date: Thu, 26 Sep 2019 19:14:45 +0200 Subject: [PATCH v8 11/13] BUGFIX: set final_lsn for subxacts before cleanup --- src/backend/replication/logical/reorderbuffer.c | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c index fe4e57c..beb6cd2 100644 --- a/src/backend/replication/logical/reorderbuffer.c +++ b/src/backend/replication/logical/reorderbuffer.c @@ -1327,6 +1327,10 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn) subtxn = dlist_container(ReorderBufferTXN, node, iter.cur); + /* make sure subtxn has final_lsn */ + if (subtxn->final_lsn == InvalidXLogRecPtr) + subtxn->final_lsn = txn->final_lsn; + -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Dilip Kumar
Date:
On Wed, Feb 5, 2020 at 4:05 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Wed, Feb 5, 2020 at 9:46 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > Fixed in the latest version sent upthread. > > > > Okay, thanks. I haven't looked at the latest version of patch series > as I was reviewing the previous version and I think all of these > comments are in the patch which is not modified. Here are my > comments: > > I think we don't need to maintain > v8-0007-Support-logical_decoding_work_mem-set-from-create as per > discussion in one of the above emails [1] as its usage is not clear. > > v8-0008-Add-support-for-streaming-to-built-in-replication > 1. > - information. The allowed options are <literal>slot_name</literal> and > - <literal>synchronous_commit</literal> > + information. The allowed options are <literal>slot_name</literal>, > + <literal>synchronous_commit</literal>, <literal>work_mem</literal> > + and <literal>streaming</literal>. > > As per the discussion above [1], I don't think we need work_mem here. > You might want to remove the other usage from the patch as well. After putting more thought on this it appears that there could be some use cases for setting the work_mem from the subscription, Assume a case where data are coming from two different origins and based on the origin ids different slots might collect different type of changes, So isn't it good to have different work_mem for different slots? I am not saying that the current way of implementing is the best one but that we can improve. First, we need to decide whether we have a use case for this or not. Please let me know your thought on the same. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Amit Kapila
Date:
On Fri, Feb 7, 2020 at 4:18 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Wed, Feb 5, 2020 at 4:05 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Wed, Feb 5, 2020 at 9:46 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > Fixed in the latest version sent upthread. > > > > > > > Okay, thanks. I haven't looked at the latest version of patch series > > as I was reviewing the previous version and I think all of these > > comments are in the patch which is not modified. Here are my > > comments: > > > > I think we don't need to maintain > > v8-0007-Support-logical_decoding_work_mem-set-from-create as per > > discussion in one of the above emails [1] as its usage is not clear. > > > > v8-0008-Add-support-for-streaming-to-built-in-replication > > 1. > > - information. The allowed options are <literal>slot_name</literal> and > > - <literal>synchronous_commit</literal> > > + information. The allowed options are <literal>slot_name</literal>, > > + <literal>synchronous_commit</literal>, <literal>work_mem</literal> > > + and <literal>streaming</literal>. > > > > As per the discussion above [1], I don't think we need work_mem here. > > You might want to remove the other usage from the patch as well. > > After putting more thought on this it appears that there could be some > use cases for setting the work_mem from the subscription, Assume a > case where data are coming from two different origins and based on the > origin ids different slots might collect different type of changes, > So isn't it good to have different work_mem for different slots? I am > not saying that the current way of implementing is the best one but > that we can improve. First, we need to decide whether we have a use > case for this or not. > That is the whole point. I don't see a very clear usage of this and neither did anybody explained clearly how it will be useful. I am not denying that what you are describing has no use, but as you said we might need to invent an entirely new way even if we have such a use. I think it is better to avoid the requirements which are not essential for this patch. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Dilip Kumar
Date:
On Mon, Feb 10, 2020 at 1:52 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Fri, Feb 7, 2020 at 4:18 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Wed, Feb 5, 2020 at 4:05 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > On Wed, Feb 5, 2020 at 9:46 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > > > Fixed in the latest version sent upthread. > > > > > > > > > > Okay, thanks. I haven't looked at the latest version of patch series > > > as I was reviewing the previous version and I think all of these > > > comments are in the patch which is not modified. Here are my > > > comments: > > > > > > I think we don't need to maintain > > > v8-0007-Support-logical_decoding_work_mem-set-from-create as per > > > discussion in one of the above emails [1] as its usage is not clear. > > > > > > v8-0008-Add-support-for-streaming-to-built-in-replication > > > 1. > > > - information. The allowed options are <literal>slot_name</literal> and > > > - <literal>synchronous_commit</literal> > > > + information. The allowed options are <literal>slot_name</literal>, > > > + <literal>synchronous_commit</literal>, <literal>work_mem</literal> > > > + and <literal>streaming</literal>. > > > > > > As per the discussion above [1], I don't think we need work_mem here. > > > You might want to remove the other usage from the patch as well. > > > > After putting more thought on this it appears that there could be some > > use cases for setting the work_mem from the subscription, Assume a > > case where data are coming from two different origins and based on the > > origin ids different slots might collect different type of changes, > > So isn't it good to have different work_mem for different slots? I am > > not saying that the current way of implementing is the best one but > > that we can improve. First, we need to decide whether we have a use > > case for this or not. > > > > That is the whole point. I don't see a very clear usage of this and > neither did anybody explained clearly how it will be useful. I am not > denying that what you are describing has no use, but as you said we > might need to invent an entirely new way even if we have such a use. > I think it is better to avoid the requirements which are not essential > for this patch. Ok, I will include this change in the next patch set. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Dilip Kumar
Date:
On Wed, Feb 5, 2020 at 4:05 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Wed, Feb 5, 2020 at 9:46 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > I think we don't need to maintain > v8-0007-Support-logical_decoding_work_mem-set-from-create as per > discussion in one of the above emails [1] as its usage is not clear. Done > v8-0008-Add-support-for-streaming-to-built-in-replication > 1. > - information. The allowed options are <literal>slot_name</literal> and > - <literal>synchronous_commit</literal> > + information. The allowed options are <literal>slot_name</literal>, > + <literal>synchronous_commit</literal>, <literal>work_mem</literal> > + and <literal>streaming</literal>. > > As per the discussion above [1], I don't think we need work_mem here. > You might want to remove the other usage from the patch as well. Done > 2. > @@ -59,7 +59,8 @@ parse_subscription_options(List *options, bool > *connect, bool *enabled_given, > bool *slot_name_given, char **slot_name, > bool *copy_data, char **synchronous_commit, > bool *refresh, int *logical_wm, > - bool *logical_wm_given) > + bool *logical_wm_given, bool *streaming, > + bool *streaming_given) > > It is not clear to me why we need two parameters 'streaming' and > 'streaming_given' in this API. Can't we handle similar to parameter > 'refresh'? The streaming option we need to update in the system table, so if we don't remember whether the user has given its value or not then how we will know whether to update this column or not? Or you are suggesting that we should always mark this as updated but IMHO that is not a good idea. > 3. > diff --git a/src/backend/replication/logical/launcher.c > b/src/backend/replication/logical/launcher.c > index aec885e..e80d00c 100644 > --- a/src/backend/replication/logical/launcher.c > +++ b/src/backend/replication/logical/launcher.c > @@ -14,6 +14,8 @@ > * > *------------------------------------------------------------------------- > */ > +#include <sys/types.h> > +#include <unistd.h> > > #include "postgres.h" > > I see only the above change in launcher.c. Why we need to include > these if there is no other change (at least not in this patch). Removed > 4. > stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn) > /* Push callback + info on the error context stack */ > state.ctx = ctx; > state.callback_name = "stream_start"; > - /* state.report_location = apply_lsn; */ > + state.report_location = InvalidXLogRecPtr; > errcallback.callback = output_plugin_error_callback; > errcallback.arg = (void *) &state; > errcallback.previous = error_context_stack; > @@ -1194,7 +1194,7 @@ stream_stop_cb_wrapper(ReorderBuffer *cache, > ReorderBufferTXN *txn) > /* Push callback + info on the error context stack */ > state.ctx = ctx; > state.callback_name = "stream_stop"; > - /* state.report_location = apply_lsn; */ > + state.report_location = InvalidXLogRecPtr; > errcallback.callback = output_plugin_error_callback; > errcallback.arg = (void *) &state; > errcallback.previous = error_context_stack; > > Don't we want to set txn->final_lsn in report location as we do at few > other places? Fixed > 5. > -logicalrep_write_delete(StringInfo out, Relation rel, HeapTuple oldtuple) > +logicalrep_write_delete(StringInfo out, TransactionId xid, > + Relation rel, HeapTuple oldtuple) > { > + pq_sendbyte(out, 'D'); /* action DELETE */ > + > Assert(rel->rd_rel->relreplident == REPLICA_IDENTITY_DEFAULT || > rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL || > rel->rd_rel->relreplident == REPLICA_IDENTITY_INDEX); > > - pq_sendbyte(out, 'D'); /* action DELETE */ > > Why this patch need to change the above code? Fixed > 6. > +void > +logicalrep_write_stream_start(StringInfo out, > + TransactionId xid, bool first_segment) > +{ > + pq_sendbyte(out, 'S'); /* action STREAM START */ > + > + Assert(TransactionIdIsValid(xid)); > + > + /* transaction ID (we're starting to stream, so must be valid) */ > + pq_sendint32(out, xid); > + > + /* 1 if this is the first streaming segment for this xid */ > + pq_sendint32(out, first_segment ? 1 : 0); > +} > + > +TransactionId > +logicalrep_read_stream_start(StringInfo in, bool *first_segment) > +{ > + TransactionId xid; > + > + Assert(first_segment); > + > + xid = pq_getmsgint(in, 4); > + *first_segment = (pq_getmsgint(in, 4) == 1); > + > + return xid; > +} > > In these functions for sending bool, pq_sendint32 is used. Can't we > use pq_sendbyte similar to what we do in boolsend? Done > 7. > +void > +logicalrep_write_stream_stop(StringInfo out, TransactionId xid) > +{ > + pq_sendbyte(out, 'E'); /* action STREAM END */ > + > + Assert(TransactionIdIsValid(xid)); > + > + /* transaction ID (we're starting to stream, so must be valid) */ > + pq_sendint32(out, xid); > +} > > In comments, 'starting to stream' is mentioned whereas this function > is to stop it. Fixed > 8. > +void > +logicalrep_write_stream_stop(StringInfo out, TransactionId xid) > +{ > + pq_sendbyte(out, 'E'); /* action STREAM END */ > + > + Assert(TransactionIdIsValid(xid)); > + > + /* transaction ID (we're starting to stream, so must be valid) */ > + pq_sendint32(out, xid); > +} > + > +TransactionId > +logicalrep_read_stream_stop(StringInfo in) > +{ > + TransactionId xid; > + > + xid = pq_getmsgint(in, 4); > + > + return xid; > +} > > Is there a reason to send xid on stopping stream? I don't see any use > of function logicalrep_read_stream_stop. Removed > 9. > + * XXX Add calls to pgstat_report_wait_start/pgstat_report_wait_end. > + */ > +static void > +subxact_info_write(Oid subid, TransactionId xid) > { > .. > + pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_WRITE); > .. > + pgstat_report_wait_end(); > .. > } > > I see the calls to pgstat_report_wait_start/pgstat_report_wait_end in > this function, so not sure if the above comment makes sense. Fixed > 10. > + * The files are placed in /tmp by default, and the filenames include both > + * the XID of the toplevel transaction and OID of the subscription. > > Are we keeping files in /tmp or pg's temp tablespace dir. Seeing > below code, it doesn't seem that we place them in /tmp. If I am > correct, then can you update the comment. > +static void > +subxact_filename(char *path, Oid subid, TransactionId xid) > +{ > + char tempdirpath[MAXPGPATH]; > + > + TempTablespacePath(tempdirpath, DEFAULTTABLESPACE_OID); Done > 11. > + * The change is serialied in a simple format, with length (not including > + * the length), action code (identifying the message type) and message > + * contents (without the subxact TransactionId value). > + * > .. > + */ > +static void > +stream_write_change(char action, StringInfo s) > > The part of the comment which says "with length (not including the > length) .." is not clear to me. What does "not including the length" > mean? Basically, it says that the 4 bytes which are used for storing then the length of total data doesn't include the 4 bytes. > 12. > + * TODO: Add missing_ok flag to specify in which cases it's OK not to > + * find the files, and when it's an error. > + */ > +static void > +stream_cleanup_files(Oid subid, TransactionId xid) > > I think we can implement this TODO. It is clear when this function is > called from apply_handle_stream_commit, the file must exist. We can > similarly analyze other callers of this API. Done > 13. > +apply_handle_stream_abort(StringInfo s) > { > .. > + /* FIXME optimize the search by bsearch on sorted data */ > + for (i = nsubxacts; i > 0; i--) > .. > > I am not sure how important this optimization is, so instead of FIXME, > it is better to keep it as a XXX comment. In the future, if we hit > any performance issue due to this, we can revisit our decision. Done -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
- v10-0002-Issue-individual-invalidations-with-wal_level-lo.patch
- v10-0003-Extend-the-output-plugin-API-with-stream-methods.patch
- v10-0005-Implement-streaming-mode-in-ReorderBuffer.patch
- v10-0004-Gracefully-handle-concurrent-aborts-of-uncommitt.patch
- v10-0008-Enable-streaming-for-all-subscription-TAP-tests.patch
- v10-0009-Add-TAP-test-for-streaming-vs.-DDL.patch
- v10-0006-Add-support-for-streaming-to-built-in-replicatio.patch
- v10-0007-Track-statistics-for-streaming.patch
- v10-0010-Bugfix-handling-of-incomplete-toast-tuple.patch
- v10-0001-Immediately-WAL-log-assignments.patch
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Dilip Kumar
Date:
On Tue, Feb 11, 2020 at 8:42 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: The patch set was not applying on the head so I have rebased it. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
- v11-0001-Immediately-WAL-log-assignments.patch
- v11-0004-Gracefully-handle-concurrent-aborts-of-uncommitt.patch
- v11-0002-Issue-individual-invalidations-with-wal_level-lo.patch
- v11-0003-Extend-the-output-plugin-API-with-stream-methods.patch
- v11-0005-Implement-streaming-mode-in-ReorderBuffer.patch
- v11-0007-Track-statistics-for-streaming.patch
- v11-0006-Add-support-for-streaming-to-built-in-replicatio.patch
- v11-0008-Enable-streaming-for-all-subscription-TAP-tests.patch
- v11-0009-Add-TAP-test-for-streaming-vs.-DDL.patch
- v11-0010-Bugfix-handling-of-incomplete-toast-tuple.patch
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Dilip Kumar
Date:
On Thu, Feb 13, 2020 at 8:42 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Tue, Feb 11, 2020 at 8:42 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > The patch set was not applying on the head so I have rebased it. I have changed the patch 0002 so that instead of logging the WAL for each invalidation, now we log at each command end as discussed upthread[1] Soon we will evaluate the performance for the same and post the results. [1] https://www.postgresql.org/message-id/CAA4eK1LOa%2B2KqNX%3Dm%3D1qMBDW%2Bo50AuwjAOX6ZqL-rWGiH1F9MQ%40mail.gmail.com -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
- v12-0001-Immediately-WAL-log-assignments.patch
- v12-0002-Issue-individual-invalidations-with-wal_level-lo.patch
- v12-0005-Implement-streaming-mode-in-ReorderBuffer.patch
- v12-0003-Extend-the-output-plugin-API-with-stream-methods.patch
- v12-0004-Gracefully-handle-concurrent-aborts-of-uncommitt.patch
- v12-0007-Track-statistics-for-streaming.patch
- v12-0006-Add-support-for-streaming-to-built-in-replicatio.patch
- v12-0008-Enable-streaming-for-all-subscription-TAP-tests.patch
- v12-0009-Add-TAP-test-for-streaming-vs.-DDL.patch
- v12-0010-Bugfix-handling-of-incomplete-toast-tuple.patch
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Tomas Vondra
Date:
Hi, I started looking at this patch series again, hoping to get it moving for PG13. There's been a tremendous amount of work done since I last worked on it, and a lot was discussed on this thread, so it'll take a while to get familiar with the new code ... The first thing I realized that WAL-logging of assignments in v12 does both the "old" logging (using dedicated message) and "new" with toplevel-XID embedded in the first message. Yes, the patch was wrong, because it eliminated all calls to ProcArrayApplyXidAssignment() and so it was trivial to crash the replica due to KnownAssignedXids overflow. But I don't think re-introducing XLOG_XACT_ASSIGNMENT message is the right fix. I actually proposed doing this (having both ways to log assignments) so that there's no regression risk with (wal_level < logical). But IIRC Andres objected to it, argumenting that we should not log the same piece of information in two very different ways at the same time (IIRC it was discussed on the FOSDEM dev meeting, so I don't have a link to share). And I do agree with him ... The question is, why couldn't the replica use the same assignment info we already write for logical decoding? The main challenge is that now the assignment can be sent in many different xlog messages, from a bunch of resource managers (essentially, any xlog message with a xid can have embedded XID of the toplevel xact). So the handling would either need to happen in every rmgr, or we need to move it before we call the rmgr. For exampple, we might do this e.g. in StartupXLOG() I think, per the attached patch (FWIW this particular fix was written by Masahiko Sawada, not me). This does the trick for me - I'm no longer able to reproduce the KnownAssignedXids overflow. The one difference is that we used to call ProcArrayApplyXidAssignment for larger groups of XIDs, as sent in the assignment message. Now we call it for each individual assignment. I don't know if this is an issue, but I suppose we might introduce some sort of local caching (accumulate the assignments into a local array, call the function only when we have enough of them). Aside from that, I think there's a minor bug in xact.c - the patch adds a "assigned" field to TransactionStateData, but then it fails to add a default value into TopTransactionStateData. We probably interpret NULL as false, but then there's nothing for the pointer. I suspect it might leave some random garbage there, leading to strange things later. Another thing I noticed is LogicalDecodingProcessRecord() extracts the toplevel XID using a macro txid = XLogRecGetTopXid(record); but then it just starts accessing the fields directly again in the ReorderBufferAssignChild call. I think we should do this instead: ReorderBufferAssignChild(ctx->reorder, txid, XLogRecGetXid(record), buf.origptr); regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Tomas Vondra
Date:
D'oh! As usual I forgot to actually attach the patch I mentioned. So here it is ... regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Dilip Kumar
Date:
On Wed, Mar 4, 2020 at 3:16 AM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > > Hi, > > I started looking at this patch series again, hoping to get it moving > for PG13. Nice. There's been a tremendous amount of work done since I last > worked on it, and a lot was discussed on this thread, so it'll take a > while to get familiar with the new code ... > > The first thing I realized that WAL-logging of assignments in v12 does > both the "old" logging (using dedicated message) and "new" with > toplevel-XID embedded in the first message. Yes, the patch was wrong, > because it eliminated all calls to ProcArrayApplyXidAssignment() and so > it was trivial to crash the replica due to KnownAssignedXids overflow. > But I don't think re-introducing XLOG_XACT_ASSIGNMENT message is the > right fix. > > I actually proposed doing this (having both ways to log assignments) so > that there's no regression risk with (wal_level < logical). But IIRC > Andres objected to it, argumenting that we should not log the same piece > of information in two very different ways at the same time (IIRC it was > discussed on the FOSDEM dev meeting, so I don't have a link to share). > And I do agree with him ... > > The question is, why couldn't the replica use the same assignment info > we already write for logical decoding? The main challenge is that now > the assignment can be sent in many different xlog messages, from a bunch > of resource managers (essentially, any xlog message with a xid can have > embedded XID of the toplevel xact). So the handling would either need to > happen in every rmgr, or we need to move it before we call the rmgr. > > For exampple, we might do this e.g. in StartupXLOG() I think, per the > attached patch (FWIW this particular fix was written by Masahiko Sawada, > not me). This does the trick for me - I'm no longer able to reproduce > the KnownAssignedXids overflow. > > The one difference is that we used to call ProcArrayApplyXidAssignment > for larger groups of XIDs, as sent in the assignment message. Now we > call it for each individual assignment. I don't know if this is an > issue, but I suppose we might introduce some sort of local caching > (accumulate the assignments into a local array, call the function only > when we have enough of them). Thanks for the pointers, I will think over these points. > > Aside from that, I think there's a minor bug in xact.c - the patch adds > a "assigned" field to TransactionStateData, but then it fails to add a > default value into TopTransactionStateData. We probably interpret NULL > as false, but then there's nothing for the pointer. I suspect it might > leave some random garbage there, leading to strange things later. Actually, we will never access that field for the TopTransactionStateData, right? See below code, we have a check that only if IsSubTransaction(), then we access the "assigned" filed. +bool +IsSubTransactionAssignmentPending(void) +{ + if (!XLogLogicalInfoActive()) + return false; + + /* we need to be in a transaction state */ + if (!IsTransactionState()) + return false; + + /* it has to be a subtransaction */ + if (!IsSubTransaction()) + return false; + + /* the subtransaction has to have a XID assigned */ + if (!TransactionIdIsValid(GetCurrentTransactionIdIfAny())) + return false; + + /* and it needs to have 'assigned' */ + return !CurrentTransactionState->assigned; + +} > > Another thing I noticed is LogicalDecodingProcessRecord() extracts the > toplevel XID using a macro > > txid = XLogRecGetTopXid(record); > > but then it just starts accessing the fields directly again in the > ReorderBufferAssignChild call. I think we should do this instead: > > ReorderBufferAssignChild(ctx->reorder, > txid, > XLogRecGetXid(record), > buf.origptr); Make sense. I will change this in the patch. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Amit Kapila
Date:
On Wed, Mar 4, 2020 at 3:16 AM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > > Hi, > > I started looking at this patch series again, hoping to get it moving > for PG13. > It is good to keep moving this forward, but there are quite a few problems with the design which need a broader discussion. Some of what I recall are: a. Handling of abort of concurrent transactions. There is some code in the patch which might work, but there is not much discussion when it was posted. b. Handling of partial tuples (while streaming, we came to know that toast tuple is not complete or speculative insert is incomplete). For this also, we have proposed a few solutions which need further discussion. One of those is implemented in the patch series. c. We might also need some handling for replication origins. d. Try to minimize the performance overhead of WAL logging for invalidations. We discussed different solutions for this and implemented one of those. e. How to skip already streamed transactions. There might be a few more which I can't recall now. Apart from this, I haven't done any detailed review of subscriber-side implementation where we write streamed transactions to file. All of this will need much more discussion and review before we can say it is ready to commit, so I thought it might be better to pick it up for PG14 and focus on other things that have a better chance for PG13 especially because all the problems were not solved/discussed before last CF. However, it is a good idea to keep moving this and have a discussion on some of these issues. > There's been a tremendous amount of work done since I last > worked on it, and a lot was discussed on this thread, so it'll take a > while to get familiar with the new code ... > > The first thing I realized that WAL-logging of assignments in v12 does > both the "old" logging (using dedicated message) and "new" with > toplevel-XID embedded in the first message. Yes, the patch was wrong, > because it eliminated all calls to ProcArrayApplyXidAssignment() and so > it was trivial to crash the replica due to KnownAssignedXids overflow. > But I don't think re-introducing XLOG_XACT_ASSIGNMENT message is the > right fix. > > I actually proposed doing this (having both ways to log assignments) so > that there's no regression risk with (wal_level < logical). But IIRC > Andres objected to it, argumenting that we should not log the same piece > of information in two very different ways at the same time (IIRC it was > discussed on the FOSDEM dev meeting, so I don't have a link to share). > And I do agree with him ... > So, aren't we worried about the overhead of the amount of WAL and performance impact for the transactions? We might want to check the pgbench read-write test to see if that will add any significant overhead. > The question is, why couldn't the replica use the same assignment info > we already write for logical decoding? > I haven't thought about it in detail, but we can think on those lines if the performance overhead is in the acceptable range. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Amit Kapila
Date:
On Wed, Mar 4, 2020 at 10:28 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Wed, Mar 4, 2020 at 3:16 AM Tomas Vondra > <tomas.vondra@2ndquadrant.com> wrote: > > > > The first thing I realized that WAL-logging of assignments in v12 does > > both the "old" logging (using dedicated message) and "new" with > > toplevel-XID embedded in the first message. Yes, the patch was wrong, > > because it eliminated all calls to ProcArrayApplyXidAssignment() and so > > it was trivial to crash the replica due to KnownAssignedXids overflow. > > But I don't think re-introducing XLOG_XACT_ASSIGNMENT message is the > > right fix. > > > > I actually proposed doing this (having both ways to log assignments) so > > that there's no regression risk with (wal_level < logical). But IIRC > > Andres objected to it, argumenting that we should not log the same piece > > of information in two very different ways at the same time (IIRC it was > > discussed on the FOSDEM dev meeting, so I don't have a link to share). > > And I do agree with him ... > > > > So, aren't we worried about the overhead of the amount of WAL and > performance impact for the transactions? We might want to check the > pgbench read-write test to see if that will add any significant > overhead. > I have briefly looked at the original patch and it seems the additional overhead is only when subtransactions are involved, so ideally, it shouldn't impact default pgbench, but there is no harm in checking. It might be that we need to build a custom script with subtransactions involved to measure the impact, but I think it is worth checking -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Dilip Kumar
Date:
On Wed, Mar 4, 2020 at 2:33 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Wed, Mar 4, 2020 at 10:28 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Wed, Mar 4, 2020 at 3:16 AM Tomas Vondra > > <tomas.vondra@2ndquadrant.com> wrote: > > > > > > The first thing I realized that WAL-logging of assignments in v12 does > > > both the "old" logging (using dedicated message) and "new" with > > > toplevel-XID embedded in the first message. Yes, the patch was wrong, > > > because it eliminated all calls to ProcArrayApplyXidAssignment() and so > > > it was trivial to crash the replica due to KnownAssignedXids overflow. > > > But I don't think re-introducing XLOG_XACT_ASSIGNMENT message is the > > > right fix. > > > > > > I actually proposed doing this (having both ways to log assignments) so > > > that there's no regression risk with (wal_level < logical). But IIRC > > > Andres objected to it, argumenting that we should not log the same piece > > > of information in two very different ways at the same time (IIRC it was > > > discussed on the FOSDEM dev meeting, so I don't have a link to share). > > > And I do agree with him ... > > > > > > > So, aren't we worried about the overhead of the amount of WAL and > > performance impact for the transactions? We might want to check the > > pgbench read-write test to see if that will add any significant > > overhead. > > > > I have briefly looked at the original patch and it seems the > additional overhead is only when subtransactions are involved, so > ideally, it shouldn't impact default pgbench, but there is no harm in > checking. It might be that we need to build a custom script with > subtransactions involved to measure the impact, but I think it is > worth checking I agree. I will test the same and post the results. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Tomas Vondra
Date:
On Wed, Mar 04, 2020 at 10:28:32AM +0530, Amit Kapila wrote: >On Wed, Mar 4, 2020 at 3:16 AM Tomas Vondra ><tomas.vondra@2ndquadrant.com> wrote: >> >> Hi, >> >> I started looking at this patch series again, hoping to get it moving >> for PG13. >> > >It is good to keep moving this forward, but there are quite a few >problems with the design which need a broader discussion. Some of >what I recall are: >a. Handling of abort of concurrent transactions. There is some code >in the patch which might work, but there is not much discussion when >it was posted. >b. Handling of partial tuples (while streaming, we came to know that >toast tuple is not complete or speculative insert is incomplete). For >this also, we have proposed a few solutions which need further >discussion. One of those is implemented in the patch series. >c. We might also need some handling for replication origins. >d. Try to minimize the performance overhead of WAL logging for >invalidations. We discussed different solutions for this and >implemented one of those. >e. How to skip already streamed transactions. > >There might be a few more which I can't recall now. Apart from this, >I haven't done any detailed review of subscriber-side implementation >where we write streamed transactions to file. All of this will need >much more discussion and review before we can say it is ready to >commit, so I thought it might be better to pick it up for PG14 and >focus on other things that have a better chance for PG13 especially >because all the problems were not solved/discussed before last CF. >However, it is a good idea to keep moving this and have a discussion >on some of these issues. > Sure, there's a lot to discuss. And it's possible (likely) it's not feasible to get this into PG13. But I think it's still worth discussing it, instead of just punting it into the next CF right away. >> There's been a tremendous amount of work done since I last >> worked on it, and a lot was discussed on this thread, so it'll take a >> while to get familiar with the new code ... >> >> The first thing I realized that WAL-logging of assignments in v12 does >> both the "old" logging (using dedicated message) and "new" with >> toplevel-XID embedded in the first message. Yes, the patch was wrong, >> because it eliminated all calls to ProcArrayApplyXidAssignment() and so >> it was trivial to crash the replica due to KnownAssignedXids overflow. >> But I don't think re-introducing XLOG_XACT_ASSIGNMENT message is the >> right fix. >> >> I actually proposed doing this (having both ways to log assignments) so >> that there's no regression risk with (wal_level < logical). But IIRC >> Andres objected to it, argumenting that we should not log the same piece >> of information in two very different ways at the same time (IIRC it was >> discussed on the FOSDEM dev meeting, so I don't have a link to share). >> And I do agree with him ... >> > >So, aren't we worried about the overhead of the amount of WAL and >performance impact for the transactions? We might want to check the >pgbench read-write test to see if that will add any significant >overhead. > Well, sure. I agree we need to see how this affects performance, and I'll do some benchmarks (I think I did that when submitting the patch, but I don't recall the numbers / details). Isn't it a bit strange to log stuff twice, though, if we worry about performance? Surely that's more expensive than logging it just once. Of course, it might be useful if most systems need just the "old" way. I know it's going to be a bit hand-wavy, but I think embedding the assignments into existing WAL messages is about the cheapest way to log this. I would not expect this to be mesurably more expensive than what we have now, but I might be wrong. >> The question is, why couldn't the replica use the same assignment info >> we already write for logical decoding? >> > >I haven't thought about it in detail, but we can think on those lines >if the performance overhead is in the acceptable range. > OK, let me do some measurements ... regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Tomas Vondra
Date:
On Wed, Mar 04, 2020 at 09:13:49AM +0530, Dilip Kumar wrote: >On Wed, Mar 4, 2020 at 3:16 AM Tomas Vondra ><tomas.vondra@2ndquadrant.com> wrote: >> >> Hi, >> >> I started looking at this patch series again, hoping to get it moving >> for PG13. > >Nice. > > There's been a tremendous amount of work done since I last >> worked on it, and a lot was discussed on this thread, so it'll take a >> while to get familiar with the new code ... >> >> The first thing I realized that WAL-logging of assignments in v12 does >> both the "old" logging (using dedicated message) and "new" with >> toplevel-XID embedded in the first message. Yes, the patch was wrong, >> because it eliminated all calls to ProcArrayApplyXidAssignment() and so >> it was trivial to crash the replica due to KnownAssignedXids overflow. >> But I don't think re-introducing XLOG_XACT_ASSIGNMENT message is the >> right fix. >> >> I actually proposed doing this (having both ways to log assignments) so >> that there's no regression risk with (wal_level < logical). But IIRC >> Andres objected to it, argumenting that we should not log the same piece >> of information in two very different ways at the same time (IIRC it was >> discussed on the FOSDEM dev meeting, so I don't have a link to share). >> And I do agree with him ... >> >> The question is, why couldn't the replica use the same assignment info >> we already write for logical decoding? The main challenge is that now >> the assignment can be sent in many different xlog messages, from a bunch >> of resource managers (essentially, any xlog message with a xid can have >> embedded XID of the toplevel xact). So the handling would either need to >> happen in every rmgr, or we need to move it before we call the rmgr. >> >> For exampple, we might do this e.g. in StartupXLOG() I think, per the >> attached patch (FWIW this particular fix was written by Masahiko Sawada, >> not me). This does the trick for me - I'm no longer able to reproduce >> the KnownAssignedXids overflow. >> >> The one difference is that we used to call ProcArrayApplyXidAssignment >> for larger groups of XIDs, as sent in the assignment message. Now we >> call it for each individual assignment. I don't know if this is an >> issue, but I suppose we might introduce some sort of local caching >> (accumulate the assignments into a local array, call the function only >> when we have enough of them). > >Thanks for the pointers, I will think over these points. > >> >> Aside from that, I think there's a minor bug in xact.c - the patch adds >> a "assigned" field to TransactionStateData, but then it fails to add a >> default value into TopTransactionStateData. We probably interpret NULL >> as false, but then there's nothing for the pointer. I suspect it might >> leave some random garbage there, leading to strange things later. > >Actually, we will never access that field for the >TopTransactionStateData, right? >See below code, we have a check that only if IsSubTransaction(), then >we access the "assigned" filed. > >+bool >+IsSubTransactionAssignmentPending(void) >+{ >+ if (!XLogLogicalInfoActive()) >+ return false; >+ >+ /* we need to be in a transaction state */ >+ if (!IsTransactionState()) >+ return false; >+ >+ /* it has to be a subtransaction */ >+ if (!IsSubTransaction()) >+ return false; >+ >+ /* the subtransaction has to have a XID assigned */ >+ if (!TransactionIdIsValid(GetCurrentTransactionIdIfAny())) >+ return false; >+ >+ /* and it needs to have 'assigned' */ >+ return !CurrentTransactionState->assigned; >+ >+} > The problem is not with the "assigned" field, really. AFAICS we probably initialize it to false because we interpret NULL as false. My concern was that we essentially leave the last pointer not initialized. That seems like a bug, not sure if it breaks something in practice. >> >> Another thing I noticed is LogicalDecodingProcessRecord() extracts the >> toplevel XID using a macro >> >> txid = XLogRecGetTopXid(record); >> >> but then it just starts accessing the fields directly again in the >> ReorderBufferAssignChild call. I think we should do this instead: >> >> ReorderBufferAssignChild(ctx->reorder, >> txid, >> XLogRecGetXid(record), >> buf.origptr); > >Make sense. I will change this in the patch. > +1, thanks regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Amit Kapila
Date:
On Thu, Mar 5, 2020 at 11:20 PM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > > On Wed, Mar 04, 2020 at 10:28:32AM +0530, Amit Kapila wrote: > > > > Sure, there's a lot to discuss. And it's possible (likely) it's not > feasible to get this into PG13. But I think it's still worth discussing > it, instead of just punting it into the next CF right away. > That makes sense to me. > >> There's been a tremendous amount of work done since I last > >> worked on it, and a lot was discussed on this thread, so it'll take a > >> while to get familiar with the new code ... > >> > >> The first thing I realized that WAL-logging of assignments in v12 does > >> both the "old" logging (using dedicated message) and "new" with > >> toplevel-XID embedded in the first message. Yes, the patch was wrong, > >> because it eliminated all calls to ProcArrayApplyXidAssignment() and so > >> it was trivial to crash the replica due to KnownAssignedXids overflow. > >> But I don't think re-introducing XLOG_XACT_ASSIGNMENT message is the > >> right fix. > >> > >> I actually proposed doing this (having both ways to log assignments) so > >> that there's no regression risk with (wal_level < logical). But IIRC > >> Andres objected to it, argumenting that we should not log the same piece > >> of information in two very different ways at the same time (IIRC it was > >> discussed on the FOSDEM dev meeting, so I don't have a link to share). > >> And I do agree with him ... > >> > > > >So, aren't we worried about the overhead of the amount of WAL and > >performance impact for the transactions? We might want to check the > >pgbench read-write test to see if that will add any significant > >overhead. > > > > Well, sure. I agree we need to see how this affects performance, and > I'll do some benchmarks (I think I did that when submitting the patch, > but I don't recall the numbers / details). > > Isn't it a bit strange to log stuff twice, though, if we worry about > performance? Surely that's more expensive than logging it just once. Of > course, it might be useful if most systems need just the "old" way. > > I know it's going to be a bit hand-wavy, but I think embedding the > assignments into existing WAL messages is about the cheapest way to log > this. I would not expect this to be mesurably more expensive than what > we have now, but I might be wrong. > I agree that this shouldn't be much expensive, but it is better to be sure in that regard. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Amit Kapila
Date:
On Wed, Mar 4, 2020 at 9:14 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Wed, Mar 4, 2020 at 3:16 AM Tomas Vondra > <tomas.vondra@2ndquadrant.com> wrote: > > > > > > The first thing I realized that WAL-logging of assignments in v12 does > > both the "old" logging (using dedicated message) and "new" with > > toplevel-XID embedded in the first message. Yes, the patch was wrong, > > because it eliminated all calls to ProcArrayApplyXidAssignment() and so > > it was trivial to crash the replica due to KnownAssignedXids overflow. > > But I don't think re-introducing XLOG_XACT_ASSIGNMENT message is the > > right fix. > > > > I actually proposed doing this (having both ways to log assignments) so > > that there's no regression risk with (wal_level < logical). But IIRC > > Andres objected to it, argumenting that we should not log the same piece > > of information in two very different ways at the same time (IIRC it was > > discussed on the FOSDEM dev meeting, so I don't have a link to share). > > And I do agree with him ... > > > > The question is, why couldn't the replica use the same assignment info > > we already write for logical decoding? The main challenge is that now > > the assignment can be sent in many different xlog messages, from a bunch > > of resource managers (essentially, any xlog message with a xid can have > > embedded XID of the toplevel xact). So the handling would either need to > > happen in every rmgr, or we need to move it before we call the rmgr. > > > > For exampple, we might do this e.g. in StartupXLOG() I think, per the > > attached patch (FWIW this particular fix was written by Masahiko Sawada, > > not me). This does the trick for me - I'm no longer able to reproduce > > the KnownAssignedXids overflow. > > > > The one difference is that we used to call ProcArrayApplyXidAssignment > > for larger groups of XIDs, as sent in the assignment message. Now we > > call it for each individual assignment. I don't know if this is an > > issue, but I suppose we might introduce some sort of local caching > > (accumulate the assignments into a local array, call the function only > > when we have enough of them). > > Thanks for the pointers, I will think over these points. > I have looked at the solution proposed and I would like to share my findings. I think calling ProcArrayApplyXidAssignment for each subtransaction is not a good idea for a couple of reasons: (a) It will just beat the purpose of maintaining KnowAssignedXids array which is to avoid looking at pg_subtrans in TransactionIdIsInProgress() on standby. Basically, if we remove it for each subXid, it will consider the KnowAssignedXids to be overflowed and check pg_subtrans frequently. (b) Calling ProcArrayApplyXidAssignment() for each subtransaction can be costly from the perspective of concurrency because it acquires ProcArrayLock in Exclusive mode, so concurrently running transactions might start blocking at this lock. Also, I see that SubTransSetParent() makes the page dirty, so it might lead to more writes if we spread out setting that by calling it separately for each sub-transaction. Apart from this, I don't see how the proposed fix is correct because as far as I can see it tries to remove the Xid before we even record it via RecordKnownAssignedTransactionIds(). It seems after patch RecordKnownAssignedTransactionIds() will be called after ProcArrayApplyXidAssignment(), how could that be correct. Thoughts? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Dilip Kumar
Date:
On Sat, Mar 28, 2020 at 11:56 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Wed, Mar 4, 2020 at 9:14 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Wed, Mar 4, 2020 at 3:16 AM Tomas Vondra > > <tomas.vondra@2ndquadrant.com> wrote: > > > > > > > > > The first thing I realized that WAL-logging of assignments in v12 does > > > both the "old" logging (using dedicated message) and "new" with > > > toplevel-XID embedded in the first message. Yes, the patch was wrong, > > > because it eliminated all calls to ProcArrayApplyXidAssignment() and so > > > it was trivial to crash the replica due to KnownAssignedXids overflow. > > > But I don't think re-introducing XLOG_XACT_ASSIGNMENT message is the > > > right fix. > > > > > > I actually proposed doing this (having both ways to log assignments) so > > > that there's no regression risk with (wal_level < logical). But IIRC > > > Andres objected to it, argumenting that we should not log the same piece > > > of information in two very different ways at the same time (IIRC it was > > > discussed on the FOSDEM dev meeting, so I don't have a link to share). > > > And I do agree with him ... > > > > > > The question is, why couldn't the replica use the same assignment info > > > we already write for logical decoding? The main challenge is that now > > > the assignment can be sent in many different xlog messages, from a bunch > > > of resource managers (essentially, any xlog message with a xid can have > > > embedded XID of the toplevel xact). So the handling would either need to > > > happen in every rmgr, or we need to move it before we call the rmgr. > > > > > > For exampple, we might do this e.g. in StartupXLOG() I think, per the > > > attached patch (FWIW this particular fix was written by Masahiko Sawada, > > > not me). This does the trick for me - I'm no longer able to reproduce > > > the KnownAssignedXids overflow. > > > > > > The one difference is that we used to call ProcArrayApplyXidAssignment > > > for larger groups of XIDs, as sent in the assignment message. Now we > > > call it for each individual assignment. I don't know if this is an > > > issue, but I suppose we might introduce some sort of local caching > > > (accumulate the assignments into a local array, call the function only > > > when we have enough of them). > > > > Thanks for the pointers, I will think over these points. > > > > I have looked at the solution proposed and I would like to share my > findings. I think calling ProcArrayApplyXidAssignment for each > subtransaction is not a good idea for a couple of reasons: > (a) It will just beat the purpose of maintaining KnowAssignedXids > array which is to avoid looking at pg_subtrans in > TransactionIdIsInProgress() on standby. Basically, if we remove it > for each subXid, it will consider the KnowAssignedXids to be > overflowed and check pg_subtrans frequently. Right, I also think this is a problem with this solution. I think we may try to avoid this by caching this information. But, then we will have to maintain this in some dimensional array which stores sub-transaction ids per top transaction or we can maintain a list of sub-transaction for each transaction. I haven't thought about how much complexity this solution will add. > (b) Calling ProcArrayApplyXidAssignment() for each subtransaction can > be costly from the perspective of concurrency because it acquires > ProcArrayLock in Exclusive mode, so concurrently running transactions > might start blocking at this lock. Right Also, I see that > SubTransSetParent() makes the page dirty, so it might lead to more > writes if we spread out setting that by calling it separately for each > sub-transaction. Right. > > Apart from this, I don't see how the proposed fix is correct because > as far as I can see it tries to remove the Xid before we even record > it via RecordKnownAssignedTransactionIds(). It seems after patch > RecordKnownAssignedTransactionIds() will be called after > ProcArrayApplyXidAssignment(), how could that be correct. Valid point. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Amit Kapila
Date:
On Sat, Mar 28, 2020 at 2:19 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Sat, Mar 28, 2020 at 11:56 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > I have looked at the solution proposed and I would like to share my > > findings. I think calling ProcArrayApplyXidAssignment for each > > subtransaction is not a good idea for a couple of reasons: > > (a) It will just beat the purpose of maintaining KnowAssignedXids > > array which is to avoid looking at pg_subtrans in > > TransactionIdIsInProgress() on standby. Basically, if we remove it > > for each subXid, it will consider the KnowAssignedXids to be > > overflowed and check pg_subtrans frequently. > > Right, I also think this is a problem with this solution. I think we > may try to avoid this by caching this information. But, then we will > have to maintain this in some dimensional array which stores > sub-transaction ids per top transaction or we can maintain a list of > sub-transaction for each transaction. I haven't thought about how > much complexity this solution will add. > How about if instead of writing an XLOG_XACT_ASSIGNMENT WAL, we set a flag in TransactionStateData and then log that as special information whenever we write next WAL record for a new subtransaction? Then during recovery, we can only call ProcArrayApplyXidAssignment when we find that special flag is set in a WAL record. One idea could be to use a flag bit in XLogRecord.xl_info. If that is feasible then the solution can work as it is now, without any overhead or change in the way we maintain KnownAssignedXids. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Tomas Vondra
Date:
On Sat, Mar 28, 2020 at 03:29:34PM +0530, Amit Kapila wrote: >On Sat, Mar 28, 2020 at 2:19 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: >> >> On Sat, Mar 28, 2020 at 11:56 AM Amit Kapila <amit.kapila16@gmail.com> wrote: >> > >> > >> > I have looked at the solution proposed and I would like to share my >> > findings. I think calling ProcArrayApplyXidAssignment for each >> > subtransaction is not a good idea for a couple of reasons: >> > (a) It will just beat the purpose of maintaining KnowAssignedXids >> > array which is to avoid looking at pg_subtrans in >> > TransactionIdIsInProgress() on standby. Basically, if we remove it >> > for each subXid, it will consider the KnowAssignedXids to be >> > overflowed and check pg_subtrans frequently. >> >> Right, I also think this is a problem with this solution. I think we >> may try to avoid this by caching this information. But, then we will >> have to maintain this in some dimensional array which stores >> sub-transaction ids per top transaction or we can maintain a list of >> sub-transaction for each transaction. I haven't thought about how >> much complexity this solution will add. >> > >How about if instead of writing an XLOG_XACT_ASSIGNMENT WAL, we set a >flag in TransactionStateData and then log that as special information >whenever we write next WAL record for a new subtransaction? Then >during recovery, we can only call ProcArrayApplyXidAssignment when we >find that special flag is set in a WAL record. One idea could be to >use a flag bit in XLogRecord.xl_info. If that is feasible then the >solution can work as it is now, without any overhead or change in the >way we maintain KnownAssignedXids. > Ummm, how is that different from what the patch is doing now? I mean, we only write the top-level XID for the first WAL record in each subxact, right? Or what would be the difference with your approach? Anyway, I think you're right the ProcArrayApplyXidAssignment call was done too early, but I think that can be fixed by moving it until after the RecordKnownAssignedTransactionIds call, no? Essentially, right before rm_redo(). You're right calling ProcArrayApplyXidAssignment() may be an issue, because it exclusively acquires the ProcArrayLock. I've actually hinted that might be an issue in my original message, suggesting we might add a local cache of assigned XIDs (a small static array, doing essentially the same thing we used to do on the upstream node). I haven't done that in my WIP patch to keep it simple, but AFACS it'd work. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Amit Kapila
Date:
On Sun, Mar 29, 2020 at 6:29 AM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > > On Sat, Mar 28, 2020 at 03:29:34PM +0530, Amit Kapila wrote: > >On Sat, Mar 28, 2020 at 2:19 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > >How about if instead of writing an XLOG_XACT_ASSIGNMENT WAL, we set a > >flag in TransactionStateData and then log that as special information > >whenever we write next WAL record for a new subtransaction? Then > >during recovery, we can only call ProcArrayApplyXidAssignment when we > >find that special flag is set in a WAL record. One idea could be to > >use a flag bit in XLogRecord.xl_info. If that is feasible then the > >solution can work as it is now, without any overhead or change in the > >way we maintain KnownAssignedXids. > > > > Ummm, how is that different from what the patch is doing now? I mean, we > only write the top-level XID for the first WAL record in each subxact, > right? Or what would be the difference with your approach? > We have to do what the patch is currently doing and additionally, we will set this flag after PGPROC_MAX_CACHED_SUBXIDS which would allow us to call ProcArrayApplyXidAssignment during WAL replay only after PGPROC_MAX_CACHED_SUBXIDS number of subxacts. It will help us in clearing the KnownAssignedXids at the same time as we do now, so no additional performance overhead. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Tomas Vondra
Date:
On Sun, Mar 29, 2020 at 11:19:21AM +0530, Amit Kapila wrote: >On Sun, Mar 29, 2020 at 6:29 AM Tomas Vondra ><tomas.vondra@2ndquadrant.com> wrote: >> >> On Sat, Mar 28, 2020 at 03:29:34PM +0530, Amit Kapila wrote: >> >On Sat, Mar 28, 2020 at 2:19 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: >> > >> >How about if instead of writing an XLOG_XACT_ASSIGNMENT WAL, we set a >> >flag in TransactionStateData and then log that as special information >> >whenever we write next WAL record for a new subtransaction? Then >> >during recovery, we can only call ProcArrayApplyXidAssignment when we >> >find that special flag is set in a WAL record. One idea could be to >> >use a flag bit in XLogRecord.xl_info. If that is feasible then the >> >solution can work as it is now, without any overhead or change in the >> >way we maintain KnownAssignedXids. >> > >> >> Ummm, how is that different from what the patch is doing now? I mean, we >> only write the top-level XID for the first WAL record in each subxact, >> right? Or what would be the difference with your approach? >> > >We have to do what the patch is currently doing and additionally, we >will set this flag after PGPROC_MAX_CACHED_SUBXIDS which would allow >us to call ProcArrayApplyXidAssignment during WAL replay only after >PGPROC_MAX_CACHED_SUBXIDS number of subxacts. It will help us in >clearing the KnownAssignedXids at the same time as we do now, so no >additional performance overhead. > Hmmm. So we'd still log assignment twice? Or would we keep just the immediate assignments (embedded into xlog records), and cache the subxids on the replica somehow? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Amit Kapila
Date:
On Sun, Mar 29, 2020 at 9:01 PM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > > On Sun, Mar 29, 2020 at 11:19:21AM +0530, Amit Kapila wrote: > >On Sun, Mar 29, 2020 at 6:29 AM Tomas Vondra > ><tomas.vondra@2ndquadrant.com> wrote: > >> > >> Ummm, how is that different from what the patch is doing now? I mean, we > >> only write the top-level XID for the first WAL record in each subxact, > >> right? Or what would be the difference with your approach? > >> > > > >We have to do what the patch is currently doing and additionally, we > >will set this flag after PGPROC_MAX_CACHED_SUBXIDS which would allow > >us to call ProcArrayApplyXidAssignment during WAL replay only after > >PGPROC_MAX_CACHED_SUBXIDS number of subxacts. It will help us in > >clearing the KnownAssignedXids at the same time as we do now, so no > >additional performance overhead. > > > > Hmmm. So we'd still log assignment twice? Or would we keep just the > immediate assignments (embedded into xlog records), and cache the > subxids on the replica somehow? > I think we need to cache the subxids on the replica somehow but I don't have a very good idea for it. Basically, there are two ways to do it (a) Change the KnownAssignedXids in some way so that we can easily find this information without losing on the current benefits of it. I can't think of a good way to do that and even if we come up with something, it could easily be a lot of work, (b) Cache the subxids for a particular transaction in local memory along with KnownAssignedXids. This is doable but now we have two data-structures (one in shared memory and other in local memory) managing the same information in different ways. Do you have any other ideas? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Tomas Vondra
Date:
On Mon, Mar 30, 2020 at 11:47:57AM +0530, Amit Kapila wrote: >On Sun, Mar 29, 2020 at 9:01 PM Tomas Vondra ><tomas.vondra@2ndquadrant.com> wrote: >> >> On Sun, Mar 29, 2020 at 11:19:21AM +0530, Amit Kapila wrote: >> >On Sun, Mar 29, 2020 at 6:29 AM Tomas Vondra >> ><tomas.vondra@2ndquadrant.com> wrote: >> >> >> >> Ummm, how is that different from what the patch is doing now? I mean, we >> >> only write the top-level XID for the first WAL record in each subxact, >> >> right? Or what would be the difference with your approach? >> >> >> > >> >We have to do what the patch is currently doing and additionally, we >> >will set this flag after PGPROC_MAX_CACHED_SUBXIDS which would allow >> >us to call ProcArrayApplyXidAssignment during WAL replay only after >> >PGPROC_MAX_CACHED_SUBXIDS number of subxacts. It will help us in >> >clearing the KnownAssignedXids at the same time as we do now, so no >> >additional performance overhead. >> > >> >> Hmmm. So we'd still log assignment twice? Or would we keep just the >> immediate assignments (embedded into xlog records), and cache the >> subxids on the replica somehow? >> > >I think we need to cache the subxids on the replica somehow but I >don't have a very good idea for it. Basically, there are two ways to >do it (a) Change the KnownAssignedXids in some way so that we can >easily find this information without losing on the current benefits of >it. I can't think of a good way to do that and even if we come up >with something, it could easily be a lot of work, (b) Cache the >subxids for a particular transaction in local memory along with >KnownAssignedXids. This is doable but now we have two data-structures >(one in shared memory and other in local memory) managing the same >information in different ways. > >Do you have any other ideas? I don't follow. Why couldn't we have a simple cache on the standby? It could be either a simple array or a hash table (with the top-level xid as hash key)? I think the single array would be sufficient, but the hash table would allow keeping the apply logic more or less as it's today. See the attached patch that adds such cache - I do admit I haven't tested this, but hopefully it's a sufficient illustration of the idea. It does not handle cleanup of the cache, but I think that should not be difficult - we simply need to remove entries for transactions that got committed or rolled back. And do something about transactions without an explicit commit/rollback record, but that can be done by also handling XLOG_RUNNING_XACTS (by removing anything preceding oldestRunningXid). I don't think this is particularly complicated or a lot of code, and I don't see why would it require data structures in shared memory. Only the walreceiver on standby needs to worry about this, no? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Amit Kapila
Date:
On Mon, Mar 30, 2020 at 8:58 PM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > > On Mon, Mar 30, 2020 at 11:47:57AM +0530, Amit Kapila wrote: > > > >I think we need to cache the subxids on the replica somehow but I > >don't have a very good idea for it. Basically, there are two ways to > >do it (a) Change the KnownAssignedXids in some way so that we can > >easily find this information without losing on the current benefits of > >it. I can't think of a good way to do that and even if we come up > >with something, it could easily be a lot of work, (b) Cache the > >subxids for a particular transaction in local memory along with > >KnownAssignedXids. This is doable but now we have two data-structures > >(one in shared memory and other in local memory) managing the same > >information in different ways. > > > >Do you have any other ideas? > > I don't follow. Why couldn't we have a simple cache on the standby? It > could be either a simple array or a hash table (with the top-level xid > as hash key)? > I think having something like we discussed or what you have in the patch won't be sufficient to clean the KnownAssignedXid array. The point is that we won't write a WAL for xid-subxid association for unlogged relations in the "Immediately WAL-log assignments" patch, however, the KnownAssignedXid would have both kinds of Xids as we autofill it with gaps (see RecordKnownAssignedTransactionIds). I think if my understanding is correct to make it work we might need major surgery in the code or have to maintain KnownAssignedXid array differently. > > I don't think this is particularly complicated or a lot of code, and I > don't see why would it require data structures in shared memory. Only > the walreceiver on standby needs to worry about this, no? > Not a new data structure in shared memory, but we already have a KnownTransactionId structure in shared memory. So, after having a local cache, we will have xidAssignmentsHash and KnownTransactionId maintaining the same information in different ways. And, we need to ensure both are cleaned up properly. That was what I was pointing above related to maintaining two structures. However, I think before discussing more on this, we need to think about the above problem. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Tomas Vondra
Date:
On Tue, Apr 07, 2020 at 12:17:44PM +0530, Amit Kapila wrote: >On Mon, Mar 30, 2020 at 8:58 PM Tomas Vondra ><tomas.vondra@2ndquadrant.com> wrote: >> >> On Mon, Mar 30, 2020 at 11:47:57AM +0530, Amit Kapila wrote: >> > >> >I think we need to cache the subxids on the replica somehow but I >> >don't have a very good idea for it. Basically, there are two ways to >> >do it (a) Change the KnownAssignedXids in some way so that we can >> >easily find this information without losing on the current benefits of >> >it. I can't think of a good way to do that and even if we come up >> >with something, it could easily be a lot of work, (b) Cache the >> >subxids for a particular transaction in local memory along with >> >KnownAssignedXids. This is doable but now we have two data-structures >> >(one in shared memory and other in local memory) managing the same >> >information in different ways. >> > >> >Do you have any other ideas? >> >> I don't follow. Why couldn't we have a simple cache on the standby? It >> could be either a simple array or a hash table (with the top-level xid >> as hash key)? >> > >I think having something like we discussed or what you have in the >patch won't be sufficient to clean the KnownAssignedXid array. The >point is that we won't write a WAL for xid-subxid association for >unlogged relations in the "Immediately WAL-log assignments" patch, >however, the KnownAssignedXid would have both kinds of Xids as we >autofill it with gaps (see RecordKnownAssignedTransactionIds). I >think if my understanding is correct to make it work we might need >major surgery in the code or have to maintain KnownAssignedXid array >differently. Hmm, that's a good point. If I understand correctly, the issue is that if we create new subxact, write something into an unlogged table, and then create new subxact, the XID of the first subxact will be "known assigned" but we won't know it's a subxact or to which parent xact it belongs (because there will be no WAL records that could encode it). I wonder if there's a simple solution (e.g. when creating the second subxact we might notice the xid-subxid assignment was not logged, and write some "dummy" WAL record). But I admit it seems a bit ugly. >> >> I don't think this is particularly complicated or a lot of code, and I >> don't see why would it require data structures in shared memory. Only >> the walreceiver on standby needs to worry about this, no? >> > >Not a new data structure in shared memory, but we already have a >KnownTransactionId structure in shared memory. So, after having a >local cache, we will have xidAssignmentsHash and KnownTransactionId >maintaining the same information in different ways. And, we need to >ensure both are cleaned up properly. That was what I was pointing >above related to maintaining two structures. However, I think before >discussing more on this, we need to think about the above problem. > Sure. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Dilip Kumar
Date:
On Wed, Apr 8, 2020 at 6:29 AM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > > On Tue, Apr 07, 2020 at 12:17:44PM +0530, Amit Kapila wrote: > >On Mon, Mar 30, 2020 at 8:58 PM Tomas Vondra > ><tomas.vondra@2ndquadrant.com> wrote: > >> > >> On Mon, Mar 30, 2020 at 11:47:57AM +0530, Amit Kapila wrote: > >> > > >> >I think we need to cache the subxids on the replica somehow but I > >> >don't have a very good idea for it. Basically, there are two ways to > >> >do it (a) Change the KnownAssignedXids in some way so that we can > >> >easily find this information without losing on the current benefits of > >> >it. I can't think of a good way to do that and even if we come up > >> >with something, it could easily be a lot of work, (b) Cache the > >> >subxids for a particular transaction in local memory along with > >> >KnownAssignedXids. This is doable but now we have two data-structures > >> >(one in shared memory and other in local memory) managing the same > >> >information in different ways. > >> > > >> >Do you have any other ideas? > >> > >> I don't follow. Why couldn't we have a simple cache on the standby? It > >> could be either a simple array or a hash table (with the top-level xid > >> as hash key)? > >> > > > >I think having something like we discussed or what you have in the > >patch won't be sufficient to clean the KnownAssignedXid array. The > >point is that we won't write a WAL for xid-subxid association for > >unlogged relations in the "Immediately WAL-log assignments" patch, > >however, the KnownAssignedXid would have both kinds of Xids as we > >autofill it with gaps (see RecordKnownAssignedTransactionIds). I > >think if my understanding is correct to make it work we might need > >major surgery in the code or have to maintain KnownAssignedXid array > >differently. > > Hmm, that's a good point. If I understand correctly, the issue is > that if we create new subxact, write something into an unlogged table, > and then create new subxact, the XID of the first subxact will be "known > assigned" but we won't know it's a subxact or to which parent xact it > belongs (because there will be no WAL records that could encode it). > > I wonder if there's a simple solution (e.g. when creating the second > subxact we might notice the xid-subxid assignment was not logged, and > write some "dummy" WAL record). But I admit it seems a bit ugly. > > >> > >> I don't think this is particularly complicated or a lot of code, and I > >> don't see why would it require data structures in shared memory. Only > >> the walreceiver on standby needs to worry about this, no? > >> > > > >Not a new data structure in shared memory, but we already have a > >KnownTransactionId structure in shared memory. So, after having a > >local cache, we will have xidAssignmentsHash and KnownTransactionId > >maintaining the same information in different ways. And, we need to > >ensure both are cleaned up properly. That was what I was pointing > >above related to maintaining two structures. However, I think before > >discussing more on this, we need to think about the above problem. I have rebased the patch on the latest head. I haven't yet changed anything for xid assignment thing because it is not yet concluded. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
- v13-0001-Immediately-WAL-log-assignments.patch
- v13-0004-Gracefully-handle-concurrent-aborts-of-uncommitt.patch
- v13-0002-Issue-individual-invalidations-with-wal_level-lo.patch
- v13-0003-Extend-the-output-plugin-API-with-stream-methods.patch
- v13-0005-Implement-streaming-mode-in-ReorderBuffer.patch
- v13-0006-Add-support-for-streaming-to-built-in-replicatio.patch
- v13-0007-Track-statistics-for-streaming.patch
- v13-0009-Add-TAP-test-for-streaming-vs.-DDL.patch
- v13-0008-Enable-streaming-for-all-subscription-TAP-tests.patch
- v13-0010-Bugfix-handling-of-incomplete-toast-tuple.patch
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Kuntal Ghosh
Date:
On Thu, Apr 9, 2020 at 2:40 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > I have rebased the patch on the latest head. I haven't yet changed > anything for xid assignment thing because it is not yet concluded. > Some review comments from 0001-Immediately-WAL-log-*.patch, +bool +IsSubTransactionAssignmentPending(void) +{ + if (!XLogLogicalInfoActive()) + return false; + + /* we need to be in a transaction state */ + if (!IsTransactionState()) + return false; + + /* it has to be a subtransaction */ + if (!IsSubTransaction()) + return false; + + /* the subtransaction has to have a XID assigned */ + if (!TransactionIdIsValid(GetCurrentTransactionIdIfAny())) + return false; + + /* and it needs to have 'assigned' */ + return !CurrentTransactionState->assigned; + +} IMHO, it's important to reduce the complexity of this function since it's been called for every WAL insertion. During the lifespan of a transaction, any of these if conditions will only be evaluated if previous conditions are true. So, we can maintain some state machine to avoid multiple evaluation of a condition inside a transaction. But, if the overhead is not much, it's not worth I guess. +#define SizeOfTransactionId (sizeof(TransactionId) + sizeof(char)) This looks wrong. We should change the name of this Macro or we can add the 1 byte directly in HEADER_SCRATCH_SIZE and some comments. @@ -195,6 +197,10 @@ XLogResetInsertion(void) { int i; + /* reset the subxact assignment flag (if needed) */ + if (curinsert_flags & XLOG_INCLUDE_XID) + MarkSubTransactionAssigned(); The comment looks contradictory. XLogSetRecordFlags(uint8 flags) { Assert(begininsert_called); - curinsert_flags = flags; + curinsert_flags |= flags; } I didn't understand why we need this change in this patch. + txid = XLogRecGetTopXid(record); + + /* + * If the toplevel_xid is valid, we need to assign the subxact to the + * toplevel transaction. We need to do this for all records, hence we + * do it before the switch. + */ s/toplevel_xid/toplevel xid or s/toplevel_xid/txid if (SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT && - info != XLOG_XACT_ASSIGNMENT) + !TransactionIdIsValid(r->toplevel_xid)) Perhaps, XLogRecGetTopXid() can be used. -- Thanks & Regards, Kuntal Ghosh EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Dilip Kumar
Date:
On Mon, Apr 13, 2020 at 4:14 PM Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote: > > On Thu, Apr 9, 2020 at 2:40 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > I have rebased the patch on the latest head. I haven't yet changed > > anything for xid assignment thing because it is not yet concluded. > > > Some review comments from 0001-Immediately-WAL-log-*.patch, > > +bool > +IsSubTransactionAssignmentPending(void) > +{ > + if (!XLogLogicalInfoActive()) > + return false; > + > + /* we need to be in a transaction state */ > + if (!IsTransactionState()) > + return false; > + > + /* it has to be a subtransaction */ > + if (!IsSubTransaction()) > + return false; > + > + /* the subtransaction has to have a XID assigned */ > + if (!TransactionIdIsValid(GetCurrentTransactionIdIfAny())) > + return false; > + > + /* and it needs to have 'assigned' */ > + return !CurrentTransactionState->assigned; > + > +} > IMHO, it's important to reduce the complexity of this function since > it's been called for every WAL insertion. During the lifespan of a > transaction, any of these if conditions will only be evaluated if > previous conditions are true. So, we can maintain some state machine > to avoid multiple evaluation of a condition inside a transaction. But, > if the overhead is not much, it's not worth I guess. Yeah maybe, in some cases we can avoid checking multiple conditions by maintaining that state. But, that state will have to be at the transaction level. But, I am not sure how much worth it will be to add one extra condition to skip a few if checks and it will also add the code complexity. And, in some cases where logical decoding is not enabled, it may add one extra check? I mean first check the state and that will take you to the first if check. > > +#define SizeOfTransactionId (sizeof(TransactionId) + sizeof(char)) > This looks wrong. We should change the name of this Macro or we can > add the 1 byte directly in HEADER_SCRATCH_SIZE and some comments. I think this is in sync with below code (SizeOfXlogOrigin), SO doen't make much sense to add different terminology no? #define SizeOfXlogOrigin (sizeof(RepOriginId) + sizeof(char)) +#define SizeOfTransactionId (sizeof(TransactionId) + sizeof(char)) > > @@ -195,6 +197,10 @@ XLogResetInsertion(void) > { > int i; > > + /* reset the subxact assignment flag (if needed) */ > + if (curinsert_flags & XLOG_INCLUDE_XID) > + MarkSubTransactionAssigned(); > The comment looks contradictory. > > XLogSetRecordFlags(uint8 flags) > { > Assert(begininsert_called); > - curinsert_flags = flags; > + curinsert_flags |= flags; > } > I didn't understand why we need this change in this patch. I think it's changed so that below code can use it, but we have directly set the flag. I think I will change in the next version. @@ -748,6 +754,18 @@ XLogRecordAssemble(RmgrId rmid, uint8 info, scratch += sizeof(replorigin_session_origin); } + /* followed by toplevel XID, if not already included in previous record */ + if (IsSubTransactionAssignmentPending()) + { + TransactionId xid = GetTopTransactionIdIfAny(); + + /* update the flag (later used by XLogInsertRecord) */ + curinsert_flags |= XLOG_INCLUDE_XID; > > + txid = XLogRecGetTopXid(record); > + > + /* > + * If the toplevel_xid is valid, we need to assign the subxact to the > + * toplevel transaction. We need to do this for all records, hence we > + * do it before the switch. > + */ > s/toplevel_xid/toplevel xid or s/toplevel_xid/txid Okay, we can change > if (SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT && > - info != XLOG_XACT_ASSIGNMENT) > + !TransactionIdIsValid(r->toplevel_xid)) > Perhaps, XLogRecGetTopXid() can be used. ok -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Kuntal Ghosh
Date:
On Mon, Apr 13, 2020 at 5:20 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > On Mon, Apr 13, 2020 at 4:14 PM Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote: > > > > +#define SizeOfTransactionId (sizeof(TransactionId) + sizeof(char)) > > This looks wrong. We should change the name of this Macro or we can > > add the 1 byte directly in HEADER_SCRATCH_SIZE and some comments. > > I think this is in sync with below code (SizeOfXlogOrigin), SO doen't > make much sense to add different terminology no? > #define SizeOfXlogOrigin (sizeof(RepOriginId) + sizeof(char)) > +#define SizeOfTransactionId (sizeof(TransactionId) + sizeof(char)) > In that case, we can rename this, for example, SizeOfXLogTransactionId. Some review comments from 0002-Issue-individual-*.path, +void +ReorderBufferAddInvalidation(ReorderBuffer *rb, TransactionId xid, + XLogRecPtr lsn, int nmsgs, + SharedInvalidationMessage *msgs) +{ + MemoryContext oldcontext; + ReorderBufferChange *change; + + /* XXX Should we even write invalidations without valid XID? */ + if (xid == InvalidTransactionId) + return; + + Assert(xid != InvalidTransactionId); It seems we don't call the function if xid is not valid. In fact, @@ -281,6 +281,24 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf) } case XLOG_XACT_ASSIGNMENT: break; + case XLOG_XACT_INVALIDATIONS: + { + TransactionId xid; + xl_xact_invalidations *invals; + + xid = XLogRecGetXid(r); + invals = (xl_xact_invalidations *) XLogRecGetData(r); + + if (!TransactionIdIsValid(xid)) + break; + + ReorderBufferAddInvalidation(reorder, xid, buf->origptr, + invals->nmsgs, invals->msgs); Why should we insert an WAL record for such cases? + * When wal_level=logical, write invalidations into WAL at each command end to + * support the decoding of the in-progress transaction. As of now it was + * enough to log invalidation only at commit because we are only decoding the + * transaction at the commit time. We only need to log the catalog cache and + * relcache invalidation. There can not be any active MVCC scan in logical + * decoding so we don't need to log the snapshot invalidation. The alignment is not right. /* * CommandEndInvalidationMessages - * Process queued-up invalidation messages at end of one command - * in a transaction. + * Process queued-up invalidation messages at end of one command + * in a transaction. Looks unnecessary changes. * Note: - * This should be called during CommandCounterIncrement(), - * after we have advanced the command ID. + * This should be called during CommandCounterIncrement(), + * after we have advanced the command ID. */ Looks unnecessary changes. if (transInvalInfo == NULL) - return; + return; Looks unnecessary changes. + /* prepare record */ + memset(&xlrec, 0, sizeof(xlrec)); We should use MinSizeOfXactInvalidations, no? -- Thanks & Regards, Kuntal Ghosh EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Dilip Kumar
Date:
On Mon, Apr 13, 2020 at 6:12 PM Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote: > > On Mon, Apr 13, 2020 at 5:20 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Mon, Apr 13, 2020 at 4:14 PM Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote: > > > > > > +#define SizeOfTransactionId (sizeof(TransactionId) + sizeof(char)) > > > This looks wrong. We should change the name of this Macro or we can > > > add the 1 byte directly in HEADER_SCRATCH_SIZE and some comments. > > > > I think this is in sync with below code (SizeOfXlogOrigin), SO doen't > > make much sense to add different terminology no? > > #define SizeOfXlogOrigin (sizeof(RepOriginId) + sizeof(char)) > > +#define SizeOfTransactionId (sizeof(TransactionId) + sizeof(char)) > > > In that case, we can rename this, for example, SizeOfXLogTransactionId. Make sense. > > Some review comments from 0002-Issue-individual-*.path, > > +void > +ReorderBufferAddInvalidation(ReorderBuffer *rb, TransactionId xid, > + XLogRecPtr lsn, int nmsgs, > + SharedInvalidationMessage *msgs) > +{ > + MemoryContext oldcontext; > + ReorderBufferChange *change; > + > + /* XXX Should we even write invalidations without valid XID? */ > + if (xid == InvalidTransactionId) > + return; > + > + Assert(xid != InvalidTransactionId); > > It seems we don't call the function if xid is not valid. In fact, > > @@ -281,6 +281,24 @@ DecodeXactOp(LogicalDecodingContext *ctx, > XLogRecordBuffer *buf) > } > case XLOG_XACT_ASSIGNMENT: > break; > + case XLOG_XACT_INVALIDATIONS: > + { > + TransactionId xid; > + xl_xact_invalidations *invals; > + > + xid = XLogRecGetXid(r); > + invals = (xl_xact_invalidations *) XLogRecGetData(r); > + > + if (!TransactionIdIsValid(xid)) > + break; > + > + ReorderBufferAddInvalidation(reorder, xid, buf->origptr, > + invals->nmsgs, invals->msgs); > > Why should we insert an WAL record for such cases? I think we can avoid this. I will analyze and send update in my next patch. > > + * When wal_level=logical, write invalidations into WAL at each command end to > + * support the decoding of the in-progress transaction. As of now it was > + * enough to log invalidation only at commit because we are only decoding the > + * transaction at the commit time. We only need to log the catalog cache and > + * relcache invalidation. There can not be any active MVCC scan in logical > + * decoding so we don't need to log the snapshot invalidation. > The alignment is not right. Will fix. > /* > * CommandEndInvalidationMessages > - * Process queued-up invalidation messages at end of one command > - * in a transaction. > + * Process queued-up invalidation messages at end of one command > + * in a transaction. > Looks unnecessary changes. Will fix. > > * Note: > - * This should be called during CommandCounterIncrement(), > - * after we have advanced the command ID. > + * This should be called during CommandCounterIncrement(), > + * after we have advanced the command ID. > */ > Looks unnecessary changes. Will fix. > if (transInvalInfo == NULL) > - return; > + return; > Looks unnecessary changes. > > + /* prepare record */ > + memset(&xlrec, 0, sizeof(xlrec)); > We should use MinSizeOfXactInvalidations, no? Right. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Kuntal Ghosh
Date:
On Mon, Apr 13, 2020 at 6:34 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > Skipping 0003 for now. Review comments from 0004-Gracefully-handle-*.patch @@ -5490,6 +5523,14 @@ heap_finish_speculative(Relation relation, ItemPointer tid) ItemId lp = NULL; HeapTupleHeader htup; + /* + * We don't expect direct calls to heap_hot_search with + * valid CheckXidAlive for regular tables. Track that below. + */ + if (unlikely(TransactionIdIsValid(CheckXidAlive) && + !(IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation)))) + elog(ERROR, "unexpected heap_hot_search call during logical decoding"); The call is to heap_finish_speculative. @@ -481,6 +482,19 @@ systable_getnext(SysScanDesc sysscan) } } + if (TransactionIdIsValid(CheckXidAlive) && + !TransactionIdIsInProgress(CheckXidAlive) && + !TransactionIdDidCommit(CheckXidAlive)) + ereport(ERROR, + (errcode(ERRCODE_TRANSACTION_ROLLBACK), + errmsg("transaction aborted during system catalog scan"))); s/transaction aborted/transaction aborted concurrently perhaps? Also, can we move this check at the begining of the function? If the condition fails, we can skip the sys scan. Some of the checks looks repetative in the same file. Should we declare them as inline functions? Review comments from 0005-Implement-streaming*.patch +static void +AssertChangeLsnOrder(ReorderBufferTXN *txn) +{ +#ifdef USE_ASSERT_CHECKING + dlist_iter iter; ... +#endif +} We can implement the same as following: #ifdef USE_ASSERT_CHECKING static void AssertChangeLsnOrder(ReorderBufferTXN *txn) { dlist_iter iter; ... } #else #define AssertChangeLsnOrder(txn) ((void)true) #endif + * if it is aborted we will report an specific error which we can ignore. We s/an specific/a specific + * Set the last last of the stream as the final lsn before calling + * stream stop. s/last last/last PG_CATCH(); { + MemoryContext ecxt = MemoryContextSwitchTo(ccxt); + ErrorData *errdata = CopyErrorData(); When we don't re-throw, the errdata should be freed by calling FreeErrorData(errdata), right? + /* + * Set the last last of the stream as the final lsn before + * calling stream stop. + */ + txn->final_lsn = prev_lsn; + rb->stream_stop(rb, txn); + + FlushErrorState(); + } stream_stop() can still throw some error, right? In that case, we should flush the error state before calling stream_stop(). + /* + * Remember the command ID and snapshot if transaction is streaming + * otherwise free the snapshot if we have copied it. + */ + if (streaming) + { + txn->command_id = command_id; + + /* Avoid copying if it's already copied. */ + if (snapshot_now->copied) + txn->snapshot_now = snapshot_now; + else + txn->snapshot_now = ReorderBufferCopySnap(rb, snapshot_now, + txn, command_id); + } + else if (snapshot_now->copied) + ReorderBufferFreeSnap(rb, snapshot_now); Hmm, it seems this part needs an assumption that after copying the snapshot, no subsequent step can throw any error. If they do, then we can again create a copy of the snapshot in catch block, which will leak some memory. Is my understanding correct? + } + else + { + ReorderBufferCleanupTXN(rb, txn); + PG_RE_THROW(); + } Shouldn't we switch back to previously created error memory context before re-throwing? +ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid, + XLogRecPtr commit_lsn, XLogRecPtr end_lsn, + TimestampTz commit_time, + RepOriginId origin_id, XLogRecPtr origin_lsn) +{ + ReorderBufferTXN *txn; + volatile Snapshot snapshot_now; + volatile CommandId command_id = FirstCommandId; In the modified ReorderBufferCommit(), why is it necessary to declare the above two variable as volatile? There is no try-catch block here. @@ -1946,6 +2284,13 @@ ReorderBufferAbort(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn) if (txn == NULL) return; + /* + * When the (sub)transaction was streamed, notify the remote node + * about the abort only if we have sent any data for this transaction. + */ + if (rbtxn_is_streamed(txn) && txn->any_data_sent) + rb->stream_abort(rb, txn, lsn); + s/When/If + /* + * When the (sub)transaction was streamed, notify the remote node + * about the abort. + */ + if (rbtxn_is_streamed(txn)) + rb->stream_abort(rb, txn, lsn); s/When/If. And, in this case, if we've not sent any data, why should we send the abort message (similar to the previous one)? + * Note: We never do both stream and serialize a transaction (we only spill + * to disk when streaming is not supported by the plugin), so only one of + * those two flags may be set at any given time. + */ +#define rbtxn_is_streamed(txn) \ +( \ + ((txn)->txn_flags & RBTXN_IS_STREAMED) != 0 \ +) Should we put any assert (not necessarily here) to validate the above comment? + txn = ReorderBufferLargestTopTXN(rb); + + /* we know there has to be one, because the size is not zero */ + Assert(txn && !txn->toptxn); + Assert(txn->size > 0); + Assert(rb->size >= txn->size); The same three assertions are already there in ReorderBufferLargestTopTXN(). +static bool +ReorderBufferCanStream(ReorderBuffer *rb) +{ + LogicalDecodingContext *ctx = rb->private_data; + + return ctx->streaming; +} Potential inline function. +static void +ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn) +{ + volatile Snapshot snapshot_now; + volatile CommandId command_id; Here also, do we need to declare these two variables as volatile? -- Thanks & Regards, Kuntal Ghosh EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Amit Kapila
Date:
On Wed, Apr 8, 2020 at 6:29 AM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > > On Tue, Apr 07, 2020 at 12:17:44PM +0530, Amit Kapila wrote: > >On Mon, Mar 30, 2020 at 8:58 PM Tomas Vondra > ><tomas.vondra@2ndquadrant.com> wrote: > > > >I think having something like we discussed or what you have in the > >patch won't be sufficient to clean the KnownAssignedXid array. The > >point is that we won't write a WAL for xid-subxid association for > >unlogged relations in the "Immediately WAL-log assignments" patch, > >however, the KnownAssignedXid would have both kinds of Xids as we > >autofill it with gaps (see RecordKnownAssignedTransactionIds). I > >think if my understanding is correct to make it work we might need > >major surgery in the code or have to maintain KnownAssignedXid array > >differently. > > Hmm, that's a good point. If I understand correctly, the issue is > that if we create new subxact, write something into an unlogged table, > and then create new subxact, the XID of the first subxact will be "known > assigned" but we won't know it's a subxact or to which parent xact it > belongs (because there will be no WAL records that could encode it). > Yeah, there could be multiple such missing subxacts. > I wonder if there's a simple solution (e.g. when creating the second > subxact we might notice the xid-subxid assignment was not logged, and > write some "dummy" WAL record). > That WAL record can have multiple xids. > But I admit it seems a bit ugly. > Yeah, I guess it could be tricky as well because while assembling some WAL record, we need to generate an additional dummy record or might need to add additional information to the current record being formed. I think the handling of such WAL records during hot-standby and in logical decoding could vary. During logical decoding, currently, we don't form an association for subtransactions if it doesn't have any changes (see ReorderBufferCommitChild) and now with this new type of record, I think we need to ensure that we don't form such association. I think after quite some changes, tweaks and a lot of testing, we might be able to remove XLOG_XACT_ASSIGNMENT but I am not sure if it is worth doing along with this patch. I think it would have been good to do this if we are adding any visible overhead with this patch and or it is easy to do that. However, none of that seems to be true, so it might be better to write good comments in the code indicating what all we need to do to remove XLOG_XACT_ASSIGNMENT so that if we feel it is important to do in future we can do so. I am not against spending effort on this but I don't see the urgency of doing it along with this patch. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Amit Kapila
Date:
On Mon, Apr 13, 2020 at 6:12 PM Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote: > > On Mon, Apr 13, 2020 at 5:20 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Mon, Apr 13, 2020 at 4:14 PM Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote: > > > > > > +#define SizeOfTransactionId (sizeof(TransactionId) + sizeof(char)) > > > This looks wrong. We should change the name of this Macro or we can > > > add the 1 byte directly in HEADER_SCRATCH_SIZE and some comments. > > > > I think this is in sync with below code (SizeOfXlogOrigin), SO doen't > > make much sense to add different terminology no? > > #define SizeOfXlogOrigin (sizeof(RepOriginId) + sizeof(char)) > > +#define SizeOfTransactionId (sizeof(TransactionId) + sizeof(char)) > > > In that case, we can rename this, for example, SizeOfXLogTransactionId. > > Some review comments from 0002-Issue-individual-*.path, > > +void > +ReorderBufferAddInvalidation(ReorderBuffer *rb, TransactionId xid, > + XLogRecPtr lsn, int nmsgs, > + SharedInvalidationMessage *msgs) > +{ > + MemoryContext oldcontext; > + ReorderBufferChange *change; > + > + /* XXX Should we even write invalidations without valid XID? */ > + if (xid == InvalidTransactionId) > + return; > + > + Assert(xid != InvalidTransactionId); > > It seems we don't call the function if xid is not valid. In fact, > You have a valid point. Also, it is not clear if we are first checking (xid == InvalidTransactionId) and returning from the function, how can even Assert hit. > @@ -281,6 +281,24 @@ DecodeXactOp(LogicalDecodingContext *ctx, > XLogRecordBuffer *buf) > } > case XLOG_XACT_ASSIGNMENT: > break; > + case XLOG_XACT_INVALIDATIONS: > + { > + TransactionId xid; > + xl_xact_invalidations *invals; > + > + xid = XLogRecGetXid(r); > + invals = (xl_xact_invalidations *) XLogRecGetData(r); > + > + if (!TransactionIdIsValid(xid)) > + break; > + > + ReorderBufferAddInvalidation(reorder, xid, buf->origptr, > + invals->nmsgs, invals->msgs); > > Why should we insert an WAL record for such cases? > Right, if there is any such case, we should avoid it. One more point about this patch, the commit message needs to be updated: > The new invalidations are written to WAL immediately, without any such caching. Perhaps it would be possible to add similar caching, > e.g. at the command level, or something like that? I think the above part of commit message is not right as the patch already does such a caching now at the command level. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Dilip Kumar
Date:
On Mon, Apr 13, 2020 at 11:43 PM Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote: > > On Mon, Apr 13, 2020 at 6:34 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > Skipping 0003 for now. Review comments from 0004-Gracefully-handle-*.patch > > @@ -5490,6 +5523,14 @@ heap_finish_speculative(Relation relation, > ItemPointer tid) > ItemId lp = NULL; > HeapTupleHeader htup; > > + /* > + * We don't expect direct calls to heap_hot_search with > + * valid CheckXidAlive for regular tables. Track that below. > + */ > + if (unlikely(TransactionIdIsValid(CheckXidAlive) && > + !(IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation)))) > + elog(ERROR, "unexpected heap_hot_search call during logical decoding"); > The call is to heap_finish_speculative. Fixed > @@ -481,6 +482,19 @@ systable_getnext(SysScanDesc sysscan) > } > } > > + if (TransactionIdIsValid(CheckXidAlive) && > + !TransactionIdIsInProgress(CheckXidAlive) && > + !TransactionIdDidCommit(CheckXidAlive)) > + ereport(ERROR, > + (errcode(ERRCODE_TRANSACTION_ROLLBACK), > + errmsg("transaction aborted during system catalog scan"))); > s/transaction aborted/transaction aborted concurrently perhaps? Also, > can we move this check at the begining of the function? If the > condition fails, we can skip the sys scan. We must check this after we get the tuple because our goal is, not to decode based on the wrong tuple. And, if we move the check before then, what if the transaction aborted after the check. Once we get the tuple and if the transaction is alive by that time then it doesn't matter even if it aborts because we have got the right tuple already. > > Some of the checks looks repetative in the same file. Should we > declare them as inline functions? > > Review comments from 0005-Implement-streaming*.patch > > +static void > +AssertChangeLsnOrder(ReorderBufferTXN *txn) > +{ > +#ifdef USE_ASSERT_CHECKING > + dlist_iter iter; > ... > +#endif > +} > > We can implement the same as following: > #ifdef USE_ASSERT_CHECKING > static void > AssertChangeLsnOrder(ReorderBufferTXN *txn) > { > dlist_iter iter; > ... > } > #else > #define AssertChangeLsnOrder(txn) ((void)true) > #endif I am not sure, this doesn't look clean. Moreover, the other similar functions are defined in the same way. e.g. AssertTXNLsnOrder. > > + * if it is aborted we will report an specific error which we can ignore. We > s/an specific/a specific Done > > + * Set the last last of the stream as the final lsn before calling > + * stream stop. > s/last last/last > > PG_CATCH(); > { > + MemoryContext ecxt = MemoryContextSwitchTo(ccxt); > + ErrorData *errdata = CopyErrorData(); > When we don't re-throw, the errdata should be freed by calling > FreeErrorData(errdata), right? Done > > + /* > + * Set the last last of the stream as the final lsn before > + * calling stream stop. > + */ > + txn->final_lsn = prev_lsn; > + rb->stream_stop(rb, txn); > + > + FlushErrorState(); > + } > stream_stop() can still throw some error, right? In that case, we > should flush the error state before calling stream_stop(). Done > > + /* > + * Remember the command ID and snapshot if transaction is streaming > + * otherwise free the snapshot if we have copied it. > + */ > + if (streaming) > + { > + txn->command_id = command_id; > + > + /* Avoid copying if it's already copied. */ > + if (snapshot_now->copied) > + txn->snapshot_now = snapshot_now; > + else > + txn->snapshot_now = ReorderBufferCopySnap(rb, snapshot_now, > + txn, command_id); > + } > + else if (snapshot_now->copied) > + ReorderBufferFreeSnap(rb, snapshot_now); > Hmm, it seems this part needs an assumption that after copying the > snapshot, no subsequent step can throw any error. If they do, then we > can again create a copy of the snapshot in catch block, which will > leak some memory. Is my understanding correct? Actually, In CATCH we copy only if the error is ERRCODE_TRANSACTION_ROLLBACK. And, that can occur during systable scan. Basically, in TRY block we copy snapshot after we have streamed all the changes i.e. systable scan is done, now if there is any error that will not be ERRCODE_TRANSACTION_ROLLBACK. So we will not copy again. > > + } > + else > + { > + ReorderBufferCleanupTXN(rb, txn); > + PG_RE_THROW(); > + } > Shouldn't we switch back to previously created error memory context > before re-throwing? Fixed. > > +ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid, > + XLogRecPtr commit_lsn, XLogRecPtr end_lsn, > + TimestampTz commit_time, > + RepOriginId origin_id, XLogRecPtr origin_lsn) > +{ > + ReorderBufferTXN *txn; > + volatile Snapshot snapshot_now; > + volatile CommandId command_id = FirstCommandId; > In the modified ReorderBufferCommit(), why is it necessary to declare > the above two variable as volatile? There is no try-catch block here. Fixed > > @@ -1946,6 +2284,13 @@ ReorderBufferAbort(ReorderBuffer *rb, > TransactionId xid, XLogRecPtr lsn) > if (txn == NULL) > return; > > + /* > + * When the (sub)transaction was streamed, notify the remote node > + * about the abort only if we have sent any data for this transaction. > + */ > + if (rbtxn_is_streamed(txn) && txn->any_data_sent) > + rb->stream_abort(rb, txn, lsn); > + > s/When/If > > + /* > + * When the (sub)transaction was streamed, notify the remote node > + * about the abort. > + */ > + if (rbtxn_is_streamed(txn)) > + rb->stream_abort(rb, txn, lsn); > s/When/If. And, in this case, if we've not sent any data, why should > we send the abort message (similar to the previous one)? Fixed > > + * Note: We never do both stream and serialize a transaction (we only spill > + * to disk when streaming is not supported by the plugin), so only one of > + * those two flags may be set at any given time. > + */ > +#define rbtxn_is_streamed(txn) \ > +( \ > + ((txn)->txn_flags & RBTXN_IS_STREAMED) != 0 \ > +) > Should we put any assert (not necessarily here) to validate the above comment? Because of toast handling, this assumption is changed now so I will remove this note in that patch (0010). > > + txn = ReorderBufferLargestTopTXN(rb); > + > + /* we know there has to be one, because the size is not zero */ > + Assert(txn && !txn->toptxn); > + Assert(txn->size > 0); > + Assert(rb->size >= txn->size); > The same three assertions are already there in ReorderBufferLargestTopTXN(). > > +static bool > +ReorderBufferCanStream(ReorderBuffer *rb) > +{ > + LogicalDecodingContext *ctx = rb->private_data; > + > + return ctx->streaming; > +} > Potential inline function. Done > +static void > +ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn) > +{ > + volatile Snapshot snapshot_now; > + volatile CommandId command_id; > Here also, do we need to declare these two variables as volatile? Done -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
- v14-0001-Immediately-WAL-log-assignments.patch
- v14-0003-Extend-the-output-plugin-API-with-stream-methods.patch
- v14-0002-Issue-individual-invalidations-with.patch
- v14-0005-Implement-streaming-mode-in-ReorderBuffer.patch
- v14-0004-Gracefully-handle-concurrent-aborts-of-uncommitt.patch
- v14-0006-Add-support-for-streaming-to-built-in-replicatio.patch
- v14-0008-Enable-streaming-for-all-subscription-TAP-tests.patch
- v14-0009-Add-TAP-test-for-streaming-vs.-DDL.patch
- v14-0007-Track-statistics-for-streaming.patch
- v14-0010-Bugfix-handling-of-incomplete-toast-tuple.patch
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Dilip Kumar
Date:
On Tue, Apr 14, 2020 at 2:57 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Mon, Apr 13, 2020 at 6:12 PM Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote: > > > > On Mon, Apr 13, 2020 at 5:20 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > On Mon, Apr 13, 2020 at 4:14 PM Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote: > > > > > > > > +#define SizeOfTransactionId (sizeof(TransactionId) + sizeof(char)) > > > > This looks wrong. We should change the name of this Macro or we can > > > > add the 1 byte directly in HEADER_SCRATCH_SIZE and some comments. > > > > > > I think this is in sync with below code (SizeOfXlogOrigin), SO doen't > > > make much sense to add different terminology no? > > > #define SizeOfXlogOrigin (sizeof(RepOriginId) + sizeof(char)) > > > +#define SizeOfTransactionId (sizeof(TransactionId) + sizeof(char)) > > > > > In that case, we can rename this, for example, SizeOfXLogTransactionId. > > > > Some review comments from 0002-Issue-individual-*.path, > > > > +void > > +ReorderBufferAddInvalidation(ReorderBuffer *rb, TransactionId xid, > > + XLogRecPtr lsn, int nmsgs, > > + SharedInvalidationMessage *msgs) > > +{ > > + MemoryContext oldcontext; > > + ReorderBufferChange *change; > > + > > + /* XXX Should we even write invalidations without valid XID? */ > > + if (xid == InvalidTransactionId) > > + return; > > + > > + Assert(xid != InvalidTransactionId); > > > > It seems we don't call the function if xid is not valid. In fact, > > > > You have a valid point. Also, it is not clear if we are first > checking (xid == InvalidTransactionId) and returning from the > function, how can even Assert hit. I have changed to code, now we only have an assert. > > > @@ -281,6 +281,24 @@ DecodeXactOp(LogicalDecodingContext *ctx, > > XLogRecordBuffer *buf) > > } > > case XLOG_XACT_ASSIGNMENT: > > break; > > + case XLOG_XACT_INVALIDATIONS: > > + { > > + TransactionId xid; > > + xl_xact_invalidations *invals; > > + > > + xid = XLogRecGetXid(r); > > + invals = (xl_xact_invalidations *) XLogRecGetData(r); > > + > > + if (!TransactionIdIsValid(xid)) > > + break; > > + > > + ReorderBufferAddInvalidation(reorder, xid, buf->origptr, > > + invals->nmsgs, invals->msgs); > > > > Why should we insert an WAL record for such cases? > > > > Right, if there is any such case, we should avoid it. I think we don't have any such case because we are logging at the command end. So I have created an assert instead of the check. > One more point about this patch, the commit message needs to be updated: > > > The new invalidations are written to WAL immediately, without any > such caching. Perhaps it would be possible to add similar caching, > > e.g. at the command level, or something like that? > > I think the above part of commit message is not right as the patch > already does such a caching now at the command level. Right, I have removed that. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Amit Kapila
Date:
On Tue, Apr 14, 2020 at 3:41 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > > @@ -281,6 +281,24 @@ DecodeXactOp(LogicalDecodingContext *ctx, > > > XLogRecordBuffer *buf) > > > } > > > case XLOG_XACT_ASSIGNMENT: > > > break; > > > + case XLOG_XACT_INVALIDATIONS: > > > + { > > > + TransactionId xid; > > > + xl_xact_invalidations *invals; > > > + > > > + xid = XLogRecGetXid(r); > > > + invals = (xl_xact_invalidations *) XLogRecGetData(r); > > > + > > > + if (!TransactionIdIsValid(xid)) > > > + break; > > > + > > > + ReorderBufferAddInvalidation(reorder, xid, buf->origptr, > > > + invals->nmsgs, invals->msgs); > > > > > > Why should we insert an WAL record for such cases? > > > > > > > Right, if there is any such case, we should avoid it. > > I think we don't have any such case because we are logging at the > command end. So I have created an assert instead of the check. > Have you tried to ensure this in some way? One idea could be to add an Assert (to check if transaction id is assigned) in the new code where you are writing WAL for this action and then run make check-world and or make installcheck-world. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Dilip Kumar
Date:
On Tue, Apr 14, 2020 at 3:57 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Tue, Apr 14, 2020 at 3:41 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > > > > > > @@ -281,6 +281,24 @@ DecodeXactOp(LogicalDecodingContext *ctx, > > > > XLogRecordBuffer *buf) > > > > } > > > > case XLOG_XACT_ASSIGNMENT: > > > > break; > > > > + case XLOG_XACT_INVALIDATIONS: > > > > + { > > > > + TransactionId xid; > > > > + xl_xact_invalidations *invals; > > > > + > > > > + xid = XLogRecGetXid(r); > > > > + invals = (xl_xact_invalidations *) XLogRecGetData(r); > > > > + > > > > + if (!TransactionIdIsValid(xid)) > > > > + break; > > > > + > > > > + ReorderBufferAddInvalidation(reorder, xid, buf->origptr, > > > > + invals->nmsgs, invals->msgs); > > > > > > > > Why should we insert an WAL record for such cases? > > > > > > > > > > Right, if there is any such case, we should avoid it. > > > > I think we don't have any such case because we are logging at the > > command end. So I have created an assert instead of the check. > > > > Have you tried to ensure this in some way? One idea could be to add > an Assert (to check if transaction id is assigned) in the new code > where you are writing WAL for this action and then run make > check-world and or make installcheck-world. Yeah, I had already tested that. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Erik Rijkers
Date:
On 2020-04-14 12:10, Dilip Kumar wrote: > v14-0001-Immediately-WAL-log-assignments.patch + > v14-0002-Issue-individual-invalidations-with.patch + > v14-0003-Extend-the-output-plugin-API-with-stream-methods.patch+ > v14-0004-Gracefully-handle-concurrent-aborts-of-uncommitt.patch+ > v14-0005-Implement-streaming-mode-in-ReorderBuffer.patch + > v14-0006-Add-support-for-streaming-to-built-in-replicatio.patch+ > v14-0007-Track-statistics-for-streaming.patch + > v14-0008-Enable-streaming-for-all-subscription-TAP-tests.patch + > v14-0009-Add-TAP-test-for-streaming-vs.-DDL.patch + > v14-0010-Bugfix-handling-of-incomplete-toast-tuple.patch applied on top of 8128b0c (a few hours ago) Hi, I haven't followed this thread and maybe this instabilty is known/expected; just thought I'd let you know. When doing running a pgbench run over logical replication (cascading down two replicas), I get this segmentation fault. 2020-04-14 17:27:28.135 CEST [8118] DETAIL: Streaming transactions committing after 0/5FA2A38, reading WAL from 0/5FA2A00. 2020-04-14 17:27:28.135 CEST [8118] LOG: logical decoding found consistent point at 0/5FA2A00 2020-04-14 17:27:28.135 CEST [8118] DETAIL: There are no running transactions. 2020-04-14 17:27:28.138 CEST [8006] LOG: server process (PID 8118) was terminated by signal 11: Segmentation fault 2020-04-14 17:27:28.138 CEST [8006] DETAIL: Failed process was running: COMMIT 2020-04-14 17:27:28.138 CEST [8006] LOG: terminating any other active server processes 2020-04-14 17:27:28.138 CEST [8163] WARNING: terminating connection because of crash of another server process 2020-04-14 17:27:28.138 CEST [8163] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. 2020-04-14 17:27:28.138 CEST [8163] HINT: In a moment you should be able to reconnect to the database and repeat your command. This error happens somewhat buried away in my test-stuff; I can dig it out and make it into a repeatable test if you need it. (debian stretch/gcc 9.3.0) Erik Rijkers
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Dilip Kumar
Date:
On Tue, Apr 14, 2020 at 9:14 PM Erik Rijkers <er@xs4all.nl> wrote: > > On 2020-04-14 12:10, Dilip Kumar wrote: > > > v14-0001-Immediately-WAL-log-assignments.patch + > > v14-0002-Issue-individual-invalidations-with.patch + > > v14-0003-Extend-the-output-plugin-API-with-stream-methods.patch+ > > v14-0004-Gracefully-handle-concurrent-aborts-of-uncommitt.patch+ > > v14-0005-Implement-streaming-mode-in-ReorderBuffer.patch + > > v14-0006-Add-support-for-streaming-to-built-in-replicatio.patch+ > > v14-0007-Track-statistics-for-streaming.patch + > > v14-0008-Enable-streaming-for-all-subscription-TAP-tests.patch + > > v14-0009-Add-TAP-test-for-streaming-vs.-DDL.patch + > > v14-0010-Bugfix-handling-of-incomplete-toast-tuple.patch > > applied on top of 8128b0c (a few hours ago) > > Hi, > > I haven't followed this thread and maybe this instabilty is > known/expected; just thought I'd let you know. > > When doing running a pgbench run over logical replication (cascading > down two replicas), I get this segmentation fault. Thanks for the testing. Is it possible to share the call stack? > > 2020-04-14 17:27:28.135 CEST [8118] DETAIL: Streaming transactions > committing after 0/5FA2A38, reading WAL from 0/5FA2A00. > 2020-04-14 17:27:28.135 CEST [8118] LOG: logical decoding found > consistent point at 0/5FA2A00 > 2020-04-14 17:27:28.135 CEST [8118] DETAIL: There are no running > transactions. > 2020-04-14 17:27:28.138 CEST [8006] LOG: server process (PID 8118) was > terminated by signal 11: Segmentation fault > 2020-04-14 17:27:28.138 CEST [8006] DETAIL: Failed process was running: > COMMIT > 2020-04-14 17:27:28.138 CEST [8006] LOG: terminating any other active > server processes > 2020-04-14 17:27:28.138 CEST [8163] WARNING: terminating connection > because of crash of another server process > 2020-04-14 17:27:28.138 CEST [8163] DETAIL: The postmaster has > commanded this server process to roll back the current transaction and > exit, because another server process exited abnormally and possibly > corrupted shared memory. > 2020-04-14 17:27:28.138 CEST [8163] HINT: In a moment you should be > able to reconnect to the database and repeat your command. > > > This error happens somewhat buried away in my test-stuff; I can dig it > out and make it into a repeatable test if you need it. (debian > stretch/gcc 9.3.0) Yeah, that will be great. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Dilip Kumar
Date:
On Tue, Apr 14, 2020 at 9:14 PM Erik Rijkers <er@xs4all.nl> wrote: > > On 2020-04-14 12:10, Dilip Kumar wrote: > > > v14-0001-Immediately-WAL-log-assignments.patch + > > v14-0002-Issue-individual-invalidations-with.patch + > > v14-0003-Extend-the-output-plugin-API-with-stream-methods.patch+ > > v14-0004-Gracefully-handle-concurrent-aborts-of-uncommitt.patch+ > > v14-0005-Implement-streaming-mode-in-ReorderBuffer.patch + > > v14-0006-Add-support-for-streaming-to-built-in-replicatio.patch+ > > v14-0007-Track-statistics-for-streaming.patch + > > v14-0008-Enable-streaming-for-all-subscription-TAP-tests.patch + > > v14-0009-Add-TAP-test-for-streaming-vs.-DDL.patch + > > v14-0010-Bugfix-handling-of-incomplete-toast-tuple.patch > > applied on top of 8128b0c (a few hours ago) Hi Erik, While setting up the cascading replication I have hit one issue on base code[1]. After fixing that I have got one crash with streaming on patch. I am not sure whether you are facing any of these 2 issues or any other issue. If your issue is not any of these then plese share the callstack and steps to reproduce. [1] https://www.postgresql.org/message-id/CAFiTN-u64S5bUiPL1q5kwpHNd0hRnf1OE-bzxNiOs5zo84i51w%40mail.gmail.com -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Erik Rijkers
Date:
On 2020-04-16 11:33, Dilip Kumar wrote: > On Tue, Apr 14, 2020 at 9:14 PM Erik Rijkers <er@xs4all.nl> wrote: >> >> On 2020-04-14 12:10, Dilip Kumar wrote: >> >> > v14-0001-Immediately-WAL-log-assignments.patch + >> > v14-0002-Issue-individual-invalidations-with.patch + >> > v14-0003-Extend-the-output-plugin-API-with-stream-methods.patch+ >> > v14-0004-Gracefully-handle-concurrent-aborts-of-uncommitt.patch+ >> > v14-0005-Implement-streaming-mode-in-ReorderBuffer.patch + >> > v14-0006-Add-support-for-streaming-to-built-in-replicatio.patch+ >> > v14-0007-Track-statistics-for-streaming.patch + >> > v14-0008-Enable-streaming-for-all-subscription-TAP-tests.patch + >> > v14-0009-Add-TAP-test-for-streaming-vs.-DDL.patch + >> > v14-0010-Bugfix-handling-of-incomplete-toast-tuple.patch >> >> applied on top of 8128b0c (a few hours ago) > I've added your new patch [bugfix_replica_identity_full_on_subscriber.patch] on top of all those above but the crash (apparently the same crash) that I had earlier still occurs (and pretty soon). server process (PID 1721) was terminated by signal 11: Segmentation fault I'll try to isolate it better and get a stacktrace > Hi Erik, > > While setting up the cascading replication I have hit one issue on > base code[1]. After fixing that I have got one crash with streaming > on patch. I am not sure whether you are facing any of these 2 issues > or any other issue. If your issue is not any of these then plese > share the callstack and steps to reproduce. > > [1] > https://www.postgresql.org/message-id/CAFiTN-u64S5bUiPL1q5kwpHNd0hRnf1OE-bzxNiOs5zo84i51w%40mail.gmail.com > > > -- > Regards, > Dilip Kumar > EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Kuntal Ghosh
Date:
On Tue, Apr 14, 2020 at 3:41 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > Few review comments from 0006-Add-support-for-streaming*.patch + subxacts[nsubxacts].offset = lseek(stream_fd, 0, SEEK_END); lseek can return (-)ve value in case of error, right? + /* + * We might need to create the tablespace's tempfile directory, if no + * one has yet done so. + * + * Don't check for error from mkdir; it could fail if the directory + * already exists (maybe someone else just did the same thing). If + * it doesn't work then we'll bomb out when opening the file + */ + mkdir(tempdirpath, S_IRWXU); If that's the only reason, perhaps we can use something like following: if (mkdir(tempdirpath, S_IRWXU) < 0 && errno != EEXIST) throw error; + + CloseTransientFile(stream_fd); Might failed to close the file. We should handle the case. Also, I think we need some implementations in dumpSubscription() to dump the (streaming = 'on') option. -- Thanks & Regards, Kuntal Ghosh EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Tomas Vondra
Date:
On Mon, Apr 13, 2020 at 05:20:39PM +0530, Dilip Kumar wrote: >On Mon, Apr 13, 2020 at 4:14 PM Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote: >> >> On Thu, Apr 9, 2020 at 2:40 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: >> > >> > I have rebased the patch on the latest head. I haven't yet changed >> > anything for xid assignment thing because it is not yet concluded. >> > >> Some review comments from 0001-Immediately-WAL-log-*.patch, >> >> +bool >> +IsSubTransactionAssignmentPending(void) >> +{ >> + if (!XLogLogicalInfoActive()) >> + return false; >> + >> + /* we need to be in a transaction state */ >> + if (!IsTransactionState()) >> + return false; >> + >> + /* it has to be a subtransaction */ >> + if (!IsSubTransaction()) >> + return false; >> + >> + /* the subtransaction has to have a XID assigned */ >> + if (!TransactionIdIsValid(GetCurrentTransactionIdIfAny())) >> + return false; >> + >> + /* and it needs to have 'assigned' */ >> + return !CurrentTransactionState->assigned; >> + >> +} >> IMHO, it's important to reduce the complexity of this function since >> it's been called for every WAL insertion. During the lifespan of a >> transaction, any of these if conditions will only be evaluated if >> previous conditions are true. So, we can maintain some state machine >> to avoid multiple evaluation of a condition inside a transaction. But, >> if the overhead is not much, it's not worth I guess. > >Yeah maybe, in some cases we can avoid checking multiple conditions by >maintaining that state. But, that state will have to be at the >transaction level. But, I am not sure how much worth it will be to >add one extra condition to skip a few if checks and it will also add >the code complexity. And, in some cases where logical decoding is not >enabled, it may add one extra check? I mean first check the state and >that will take you to the first if check. > Perhaps. I think we should only do that if we can demonstrate it's an issue in practice. Otherwise it's just unnecessary complexity. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Erik Rijkers
Date:
On 2020-04-16 11:46, Erik Rijkers wrote: > On 2020-04-16 11:33, Dilip Kumar wrote: >> On Tue, Apr 14, 2020 at 9:14 PM Erik Rijkers <er@xs4all.nl> wrote: >>> >>> On 2020-04-14 12:10, Dilip Kumar wrote: >>> >>> > v14-0001-Immediately-WAL-log-assignments.patch + >>> > v14-0002-Issue-individual-invalidations-with.patch + >>> > v14-0003-Extend-the-output-plugin-API-with-stream-methods.patch+ >>> > v14-0004-Gracefully-handle-concurrent-aborts-of-uncommitt.patch+ >>> > v14-0005-Implement-streaming-mode-in-ReorderBuffer.patch + >>> > v14-0006-Add-support-for-streaming-to-built-in-replicatio.patch+ >>> > v14-0007-Track-statistics-for-streaming.patch + >>> > v14-0008-Enable-streaming-for-all-subscription-TAP-tests.patch + >>> > v14-0009-Add-TAP-test-for-streaming-vs.-DDL.patch + >>> > v14-0010-Bugfix-handling-of-incomplete-toast-tuple.patch >>> >>> applied on top of 8128b0c (a few hours ago) >> > > I've added your new patch > > [bugfix_replica_identity_full_on_subscriber.patch] > > on top of all those above but the crash (apparently the same crash) > that I had earlier still occurs (and pretty soon). > > server process (PID 1721) was terminated by signal 11: Segmentation > fault > > I'll try to isolate it better and get a stacktrace > > >> Hi Erik, >> >> While setting up the cascading replication I have hit one issue on >> base code[1]. After fixing that I have got one crash with streaming >> on patch. I am not sure whether you are facing any of these 2 issues >> or any other issue. If your issue is not any of these then plese >> share the callstack and steps to reproduce. I figured out a few things about this. Attached is a bash script test.sh, to reproduce: There is a variable CRASH_IT that determines whether the whole thing will fail (with a segmentation fault) or not. As attached it has CRASH_IT=0 and does not crash. When you change that to CRASH_IT=1, then it will crash. It turns out that this just depends on a short wait state (3 seconds, on my machine) between setting up de replication, and the running of pgbench. It's possible that on very fast machines maybe it does not occur; we've had such difference between hardware before. This is a i5-3330S. It deletes files so look it over before you run it. It may also depend on some of my local set-up but I guess that should be easily fixed. Can you let me know if you can reproduce the problem with this? thanks, Erik Rijkers >> >> [1] >> https://www.postgresql.org/message-id/CAFiTN-u64S5bUiPL1q5kwpHNd0hRnf1OE-bzxNiOs5zo84i51w%40mail.gmail.com >> >> >> -- >> Regards, >> Dilip Kumar >> EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Erik Rijkers
Date:
On 2020-04-18 11:07, Erik Rijkers wrote: >>> Hi Erik, >>> >>> While setting up the cascading replication I have hit one issue on >>> base code[1]. After fixing that I have got one crash with streaming >>> on patch. I am not sure whether you are facing any of these 2 issues >>> or any other issue. If your issue is not any of these then plese >>> share the callstack and steps to reproduce. > > I figured out a few things about this. Attached is a bash script > test.sh, to reproduce: And the attached file, test.sh. (sorry) > There is a variable CRASH_IT that determines whether the whole thing > will fail (with a segmentation fault) or not. As attached it has > CRASH_IT=0 and does not crash. When you change that to CRASH_IT=1, > then it will crash. It turns out that this just depends on a short > wait state (3 seconds, on my machine) between setting up de > replication, and the running of pgbench. It's possible that on very > fast machines maybe it does not occur; we've had such difference > between hardware before. This is a i5-3330S. > > It deletes files so look it over before you run it. It may also > depend on some of my local set-up but I guess that should be easily > fixed. > > Can you let me know if you can reproduce the problem with this? > > thanks, > > Erik Rijkers > > > >>> >>> [1] >>> https://www.postgresql.org/message-id/CAFiTN-u64S5bUiPL1q5kwpHNd0hRnf1OE-bzxNiOs5zo84i51w%40mail.gmail.com >>> >>> >>> -- >>> Regards, >>> Dilip Kumar >>> EnterpriseDB: http://www.enterprisedb.com
Attachment
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Erik Rijkers
Date:
On 2020-04-18 11:10, Erik Rijkers wrote: > On 2020-04-18 11:07, Erik Rijkers wrote: >>>> Hi Erik, >>>> >>>> While setting up the cascading replication I have hit one issue on >>>> base code[1]. After fixing that I have got one crash with streaming >>>> on patch. I am not sure whether you are facing any of these 2 >>>> issues >>>> or any other issue. If your issue is not any of these then plese >>>> share the callstack and steps to reproduce. >> >> I figured out a few things about this. Attached is a bash script >> test.sh, to reproduce: > > And the attached file, test.sh. (sorry) It turns out I must have been mistaken somewhere. I probably missed bugfix_in_schema_sent.patch) I have just now rebuilt all the instances on top of master with these patches: > [v14-0001-Immediately-WAL-log-assignments.patch] > [v14-0002-Issue-individual-invalidations-with.patch] > [v14-0003-Extend-the-output-plugin-API-with-stream-methods.patch] > [v14-0004-Gracefully-handle-concurrent-aborts-of-uncommitt.patch] > [v14-0005-Implement-streaming-mode-in-ReorderBuffer.patch] > [v14-0006-Add-support-for-streaming-to-built-in-replicatio.patch] > [v14-0007-Track-statistics-for-streaming.patch] > [v14-0008-Enable-streaming-for-all-subscription-TAP-tests.patch] > [v14-0009-Add-TAP-test-for-streaming-vs.-DDL.patch] > [v14-0010-Bugfix-handling-of-incomplete-toast-tuple.patch] > [bugfix_in_schema_sent.patch] (by the way: this build's regression tests 'ddl', 'toast', and 'spill' fail) I seem now able to run all my test programs on these instances without errors. Sorry, I seem to have raised a false alarm (although there was initially certainly a problem). Erik Rijkers >> There is a variable CRASH_IT that determines whether the whole thing >> will fail (with a segmentation fault) or not. As attached it has >> CRASH_IT=0 and does not crash. When you change that to CRASH_IT=1, >> then it will crash. It turns out that this just depends on a short >> wait state (3 seconds, on my machine) between setting up de >> replication, and the running of pgbench. It's possible that on very >> fast machines maybe it does not occur; we've had such difference >> between hardware before. This is a i5-3330S. >> >> It deletes files so look it over before you run it. It may also >> depend on some of my local set-up but I guess that should be easily >> fixed. >> >> Can you let me know if you can reproduce the problem with this? >> >> thanks, >> >> Erik Rijkers >> >> >> >>>> >>>> [1] >>>> https://www.postgresql.org/message-id/CAFiTN-u64S5bUiPL1q5kwpHNd0hRnf1OE-bzxNiOs5zo84i51w%40mail.gmail.com >>>> >>>> >>>> -- >>>> Regards, >>>> Dilip Kumar >>>> EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Dilip Kumar
Date:
On Sat, Apr 18, 2020 at 6:12 PM Erik Rijkers <er@xs4all.nl> wrote: > > On 2020-04-18 11:10, Erik Rijkers wrote: > > On 2020-04-18 11:07, Erik Rijkers wrote: > >>>> Hi Erik, > >>>> > >>>> While setting up the cascading replication I have hit one issue on > >>>> base code[1]. After fixing that I have got one crash with streaming > >>>> on patch. I am not sure whether you are facing any of these 2 > >>>> issues > >>>> or any other issue. If your issue is not any of these then plese > >>>> share the callstack and steps to reproduce. > >> > >> I figured out a few things about this. Attached is a bash script > >> test.sh, to reproduce: > > > > And the attached file, test.sh. (sorry) > > It turns out I must have been mistaken somewhere. I probably missed > bugfix_in_schema_sent.patch) > > I have just now rebuilt all the instances on top of master with these > patches: > > > [v14-0001-Immediately-WAL-log-assignments.patch] > > [v14-0002-Issue-individual-invalidations-with.patch] > > [v14-0003-Extend-the-output-plugin-API-with-stream-methods.patch] > > [v14-0004-Gracefully-handle-concurrent-aborts-of-uncommitt.patch] > > [v14-0005-Implement-streaming-mode-in-ReorderBuffer.patch] > > [v14-0006-Add-support-for-streaming-to-built-in-replicatio.patch] > > [v14-0007-Track-statistics-for-streaming.patch] > > [v14-0008-Enable-streaming-for-all-subscription-TAP-tests.patch] > > [v14-0009-Add-TAP-test-for-streaming-vs.-DDL.patch] > > [v14-0010-Bugfix-handling-of-incomplete-toast-tuple.patch] > > [bugfix_in_schema_sent.patch] > > (by the way: this build's regression tests 'ddl', 'toast', and > 'spill' fail) > > I seem now able to run all my test programs on these instances without > errors. > > Sorry, I seem to have raised a false alarm (although there was initially > certainly a problem). No problem, Thanks for confirming. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Dilip Kumar
Date:
On Sat, Apr 18, 2020 at 6:12 PM Erik Rijkers <er@xs4all.nl> wrote: > > On 2020-04-18 11:10, Erik Rijkers wrote: > > On 2020-04-18 11:07, Erik Rijkers wrote: > >>>> Hi Erik, > >>>> > >>>> While setting up the cascading replication I have hit one issue on > >>>> base code[1]. After fixing that I have got one crash with streaming > >>>> on patch. I am not sure whether you are facing any of these 2 > >>>> issues > >>>> or any other issue. If your issue is not any of these then plese > >>>> share the callstack and steps to reproduce. > >> > >> I figured out a few things about this. Attached is a bash script > >> test.sh, to reproduce: > > > > And the attached file, test.sh. (sorry) > > It turns out I must have been mistaken somewhere. I probably missed > bugfix_in_schema_sent.patch) > > I have just now rebuilt all the instances on top of master with these > patches: > > > [v14-0001-Immediately-WAL-log-assignments.patch] > > [v14-0002-Issue-individual-invalidations-with.patch] > > [v14-0003-Extend-the-output-plugin-API-with-stream-methods.patch] > > [v14-0004-Gracefully-handle-concurrent-aborts-of-uncommitt.patch] > > [v14-0005-Implement-streaming-mode-in-ReorderBuffer.patch] > > [v14-0006-Add-support-for-streaming-to-built-in-replicatio.patch] > > [v14-0007-Track-statistics-for-streaming.patch] > > [v14-0008-Enable-streaming-for-all-subscription-TAP-tests.patch] > > [v14-0009-Add-TAP-test-for-streaming-vs.-DDL.patch] > > [v14-0010-Bugfix-handling-of-incomplete-toast-tuple.patch] > > [bugfix_in_schema_sent.patch] > > (by the way: this build's regression tests 'ddl', 'toast', and > 'spill' fail) Yeah, this is a. known issue, actually, while streaming the transaction the output message is changed. I have a plan to work on this part. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Dilip Kumar
Date:
On Tue, Apr 21, 2020 at 5:30 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Sat, Apr 18, 2020 at 6:12 PM Erik Rijkers <er@xs4all.nl> wrote: > > > > On 2020-04-18 11:10, Erik Rijkers wrote: > > > On 2020-04-18 11:07, Erik Rijkers wrote: > > >>>> Hi Erik, > > >>>> > > >>>> While setting up the cascading replication I have hit one issue on > > >>>> base code[1]. After fixing that I have got one crash with streaming > > >>>> on patch. I am not sure whether you are facing any of these 2 > > >>>> issues > > >>>> or any other issue. If your issue is not any of these then plese > > >>>> share the callstack and steps to reproduce. > > >> > > >> I figured out a few things about this. Attached is a bash script > > >> test.sh, to reproduce: > > > > > > And the attached file, test.sh. (sorry) > > > > It turns out I must have been mistaken somewhere. I probably missed > > bugfix_in_schema_sent.patch) > > > > I have just now rebuilt all the instances on top of master with these > > patches: > > > > > [v14-0001-Immediately-WAL-log-assignments.patch] > > > [v14-0002-Issue-individual-invalidations-with.patch] > > > [v14-0003-Extend-the-output-plugin-API-with-stream-methods.patch] > > > [v14-0004-Gracefully-handle-concurrent-aborts-of-uncommitt.patch] > > > [v14-0005-Implement-streaming-mode-in-ReorderBuffer.patch] > > > [v14-0006-Add-support-for-streaming-to-built-in-replicatio.patch] > > > [v14-0007-Track-statistics-for-streaming.patch] > > > [v14-0008-Enable-streaming-for-all-subscription-TAP-tests.patch] > > > [v14-0009-Add-TAP-test-for-streaming-vs.-DDL.patch] > > > [v14-0010-Bugfix-handling-of-incomplete-toast-tuple.patch] > > > [bugfix_in_schema_sent.patch] > > > > (by the way: this build's regression tests 'ddl', 'toast', and > > 'spill' fail) > > Yeah, this is a. known issue, actually, while streaming the > transaction the output message is changed. I have a plan to work on > this part. I have fixed this part. Basically, now, I have created a separate function to get the streaming changes 'pg_logical_slot_get_streaming_changes'. So the default function pg_logical_slot_get_changes will work as it is and test decoding test cases will not fail. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
- v15-0001-Immediately-WAL-log-assignments.patch
- v15-0002-Issue-individual-invalidations-with-wal_level-lo.patch
- v15-0004-Gracefully-handle-concurrent-aborts-of-uncommitt.patch
- v15-0003-Extend-the-output-plugin-API-with-stream-methods.patch
- v15-0005-Implement-streaming-mode-in-ReorderBuffer.patch
- v15-0006-Add-support-for-streaming-to-built-in-replicatio.patch
- v15-0009-Add-TAP-test-for-streaming-vs.-DDL.patch
- v15-0008-Enable-streaming-for-all-subscription-TAP-tests.patch
- v15-0010-Bugfix-handling-of-incomplete-toast-tuple.patch
- v15-0007-Track-statistics-for-streaming.patch
- v15-0011-Provide-new-api-to-get-the-streaming-changes.patch
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Erik Rijkers
Date:
On 2020-04-22 16:49, Dilip Kumar wrote: > On Tue, Apr 21, 2020 at 5:30 PM Dilip Kumar <dilipbalaut@gmail.com> > wrote: >> >> > >> > (by the way: this build's regression tests 'ddl', 'toast', and >> > 'spill' fail) >> >> Yeah, this is a. known issue, actually, while streaming the >> transaction the output message is changed. I have a plan to work on >> this part. > > I have fixed this part. Basically, now, I have created a separate > function to get the streaming changes > 'pg_logical_slot_get_streaming_changes'. So the default function > pg_logical_slot_get_changes will work as it is and test decoding test > cases will not fail. The 'ddl' one is apparently not quite fixed - I get this in (cd contrib; make check)' (in both assert-enabled and non-assert-enabled build) grep -A7 -B7 make.check_contrib.out: contrib/make.check_contrib.out-============== initializing database system ============== contrib/make.check_contrib.out-============== starting postmaster ============== contrib/make.check_contrib.out-running on port 64464 with PID 9175 contrib/make.check_contrib.out-============== creating database "contrib_regression" ============== contrib/make.check_contrib.out-CREATE DATABASE contrib/make.check_contrib.out-ALTER DATABASE contrib/make.check_contrib.out-============== running regression test queries ============== contrib/make.check_contrib.out:test ddl ... FAILED 840 ms contrib/make.check_contrib.out-test xact ... ok 24 ms contrib/make.check_contrib.out-test rewrite ... ok 187 ms contrib/make.check_contrib.out-test toast ... ok 851 ms contrib/make.check_contrib.out-test permissions ... ok 26 ms contrib/make.check_contrib.out-test decoding_in_xact ... ok 31 ms contrib/make.check_contrib.out-test decoding_into_rel ... ok 25 ms contrib/make.check_contrib.out-test binary ... ok 12 ms Otherwise patches apply and build OK so will go run some tests...
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Dilip Kumar
Date:
On Wed, Apr 22, 2020 at 9:31 PM Erik Rijkers <er@xs4all.nl> wrote: > > On 2020-04-22 16:49, Dilip Kumar wrote: > > On Tue, Apr 21, 2020 at 5:30 PM Dilip Kumar <dilipbalaut@gmail.com> > > wrote: > >> > >> > > >> > (by the way: this build's regression tests 'ddl', 'toast', and > >> > 'spill' fail) > >> > >> Yeah, this is a. known issue, actually, while streaming the > >> transaction the output message is changed. I have a plan to work on > >> this part. > > > > I have fixed this part. Basically, now, I have created a separate > > function to get the streaming changes > > 'pg_logical_slot_get_streaming_changes'. So the default function > > pg_logical_slot_get_changes will work as it is and test decoding test > > cases will not fail. > > The 'ddl' one is apparently not quite fixed - I get this in (cd > contrib; make check)' (in both assert-enabled and non-assert-enabled > build) Can you send me the contrib/test_decoding/regression.diffs file? > grep -A7 -B7 make.check_contrib.out: > > contrib/make.check_contrib.out-============== initializing database > system ============== > contrib/make.check_contrib.out-============== starting postmaster > ============== > contrib/make.check_contrib.out-running on port 64464 with PID 9175 > contrib/make.check_contrib.out-============== creating database > "contrib_regression" ============== > contrib/make.check_contrib.out-CREATE DATABASE > contrib/make.check_contrib.out-ALTER DATABASE > contrib/make.check_contrib.out-============== running regression test > queries ============== > contrib/make.check_contrib.out:test ddl ... > FAILED 840 ms > contrib/make.check_contrib.out-test xact ... ok > 24 ms > contrib/make.check_contrib.out-test rewrite ... ok > 187 ms > contrib/make.check_contrib.out-test toast ... ok > 851 ms > contrib/make.check_contrib.out-test permissions ... ok > 26 ms > contrib/make.check_contrib.out-test decoding_in_xact ... ok > 31 ms > contrib/make.check_contrib.out-test decoding_into_rel ... ok > 25 ms > contrib/make.check_contrib.out-test binary ... ok > 12 ms > > Otherwise patches apply and build OK so will go run some tests... Thanks -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Erik Rijkers
Date:
On 2020-04-23 05:24, Dilip Kumar wrote: > On Wed, Apr 22, 2020 at 9:31 PM Erik Rijkers <er@xs4all.nl> wrote: >> >> The 'ddl' one is apparently not quite fixed - I get this in (cd >> contrib; make check)' (in both assert-enabled and non-assert-enabled >> build) > > Can you send me the contrib/test_decoding/regression.diffs file? Attached. Below is the patch list, in case that was unclear 20200422/v15-0001-Immediately-WAL-log-assignments.patch + 20200422/v15-0002-Issue-individual-invalidations-with-wal_level-lo.patch+ 20200422/v15-0003-Extend-the-output-plugin-API-with-stream-methods.patch+ 20200422/v15-0004-Gracefully-handle-concurrent-aborts-of-uncommitt.patch+ 20200422/v15-0005-Implement-streaming-mode-in-ReorderBuffer.patch + 20200422/v15-0006-Add-support-for-streaming-to-built-in-replicatio.patch+ 20200422/v15-0007-Track-statistics-for-streaming.patch + 20200422/v15-0008-Enable-streaming-for-all-subscription-TAP-tests.patch + 20200422/v15-0009-Add-TAP-test-for-streaming-vs.-DDL.patch + 20200422/v15-0010-Bugfix-handling-of-incomplete-toast-tuple.patch + 20200422/v15-0011-Provide-new-api-to-get-the-streaming-changes.patch + 20200414/bugfix_in_schema_sent.patch >> grep -A7 -B7 make.check_contrib.out: >> >> contrib/make.check_contrib.out-============== initializing database >> system ============== >> contrib/make.check_contrib.out-============== starting postmaster >> ============== >> contrib/make.check_contrib.out-running on port 64464 with PID 9175 >> contrib/make.check_contrib.out-============== creating database >> "contrib_regression" ============== >> contrib/make.check_contrib.out-CREATE DATABASE >> contrib/make.check_contrib.out-ALTER DATABASE >> contrib/make.check_contrib.out-============== running regression test >> queries ============== >> contrib/make.check_contrib.out:test ddl ... >> FAILED 840 ms >> contrib/make.check_contrib.out-test xact ... >> ok >> 24 ms >> contrib/make.check_contrib.out-test rewrite ... >> ok >> 187 ms >> contrib/make.check_contrib.out-test toast ... >> ok >> 851 ms >> contrib/make.check_contrib.out-test permissions ... >> ok >> 26 ms >> contrib/make.check_contrib.out-test decoding_in_xact ... >> ok >> 31 ms >> contrib/make.check_contrib.out-test decoding_into_rel ... >> ok >> 25 ms >> contrib/make.check_contrib.out-test binary ... >> ok >> 12 ms >> >> Otherwise patches apply and build OK so will go run some tests... > > Thanks > > > -- > Regards, > Dilip Kumar > EnterpriseDB: http://www.enterprisedb.com
Attachment
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Dilip Kumar
Date:
On Thu, Apr 23, 2020 at 2:28 PM Erik Rijkers <er@xs4all.nl> wrote: > > On 2020-04-23 05:24, Dilip Kumar wrote: > > On Wed, Apr 22, 2020 at 9:31 PM Erik Rijkers <er@xs4all.nl> wrote: > >> > >> The 'ddl' one is apparently not quite fixed - I get this in (cd > >> contrib; make check)' (in both assert-enabled and non-assert-enabled > >> build) > > > > Can you send me the contrib/test_decoding/regression.diffs file? > > Attached. So from regression.diff, it appears that in failing in memory allocation (+ERROR: invalid memory alloc request size 94119198201896). My colleague tried to reproduce this in a different environment but there is no success so far. One more thing surprises me is that after (v15-0011-Provide-new-api-to-get-the-streaming-changes.patch) actually, it should never go for the streaming path. However, we can not ignore the fact that some of the changes might impact the non-streaming path as well. Is it possible for you to somehow stop or break the code and send the stack trace? One idea is by seeing the log we can see from where the error is raised i.e MemoryContextAlloc or palloc or some other similar function. Once we know that we can convert that error to an assert and find the call stack. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Dilip Kumar
Date:
On Fri, Apr 17, 2020 at 1:40 AM Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote: > > On Tue, Apr 14, 2020 at 3:41 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > Few review comments from 0006-Add-support-for-streaming*.patch > > + subxacts[nsubxacts].offset = lseek(stream_fd, 0, SEEK_END); > lseek can return (-)ve value in case of error, right? > > + /* > + * We might need to create the tablespace's tempfile directory, if no > + * one has yet done so. > + * > + * Don't check for error from mkdir; it could fail if the directory > + * already exists (maybe someone else just did the same thing). If > + * it doesn't work then we'll bomb out when opening the file > + */ > + mkdir(tempdirpath, S_IRWXU); > If that's the only reason, perhaps we can use something like following: > > if (mkdir(tempdirpath, S_IRWXU) < 0 && errno != EEXIST) > throw error; Done > > + > + CloseTransientFile(stream_fd); > Might failed to close the file. We should handle the case. Changed Still, one place is pending because I don't have the filename there to report an error. One option is we can just give an error without the filename. I will try to think about this part. > Also, I think we need some implementations in dumpSubscription() to > dump the (streaming = 'on') option. Right, created another patch and attached. I have also fixed a couple of bugs internally reported by my colleague Neha Sharma. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
- v16-0002-Issue-individual-invalidations-with-wal_level-lo.patch
- v16-0001-Immediately-WAL-log-assignments.patch
- v16-0003-Extend-the-output-plugin-API-with-stream-methods.patch
- v16-0004-Gracefully-handle-concurrent-aborts-of-uncommitt.patch
- v16-0005-Implement-streaming-mode-in-ReorderBuffer.patch
- v16-0009-Add-TAP-test-for-streaming-vs.-DDL.patch
- v16-0006-Add-support-for-streaming-to-built-in-replicatio.patch
- v16-0007-Track-statistics-for-streaming.patch
- v16-0008-Enable-streaming-for-all-subscription-TAP-tests.patch
- v16-0010-Bugfix-handling-of-incomplete-toast-tuple.patch
- v16-0012-Add-streaming-option-in-pg_dump.patch
- v16-0011-Provide-new-api-to-get-the-streaming-changes.patch
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Amit Kapila
Date:
On Mon, Apr 27, 2020 at 4:05 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > I have also fixed a couple of bugs internally reported by my colleague > Neha Sharma. > I think it would be good if you can briefly explain what were the bugs and how you fixed those? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Dilip Kumar
Date:
On Mon, Apr 27, 2020 at 4:13 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Mon, Apr 27, 2020 at 4:05 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > I have also fixed a couple of bugs internally reported by my colleague > > Neha Sharma. > > > > I think it would be good if you can briefly explain what were the bugs > and how you fixed those? Issue1: If the concurrent transaction was aborted then in CATCH block we were not freeing the memory of the toast_has, and it was causing the assert that after the stream is complete txn->size != 0. Issue2: After streaming is complete we set the txn->final_lsn and we remember that in the local variable, But mistakenly it was remembered in local TRY block variable so if there is a concurrent abort in the CATCH block the variable value is always a zero. So after streaming the final_lsn were becoming 0 and that was asserting. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Amit Kapila
Date:
On Mon, Apr 27, 2020 at 4:05 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > [latest patches] v16-0004-Gracefully-handle-concurrent-aborts-of-uncommitt - Any actions leading to transaction ID assignment are prohibited. That, among others, + Note that access to user catalog tables or regular system catalog tables + in the output plugins has to be done via the <literal>systable_*</literal> scan APIs only. + Access via the <literal>heap_*</literal> scan APIs will error out. + Additionally, any actions leading to transaction ID assignment are prohibited. That, among others, .. @@ -1383,6 +1392,14 @@ heap_fetch(Relation relation, bool valid; /* + * We don't expect direct calls to heap_fetch with valid + * CheckXidAlive for regular tables. Track that below. + */ + if (unlikely(TransactionIdIsValid(CheckXidAlive) && + !(IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation)))) + elog(ERROR, "unexpected heap_fetch call during logical decoding"); + I think comments and code don't match. In the comment, we are saying that via output plugins access to user catalog tables or regular system catalog tables won't be allowed via heap_* APIs but code doesn't seem to reflect it. I feel only TransactionIdIsValid(CheckXidAlive) is sufficient here. See, the original discussion about this point [1] (Refer "I think it'd also be good to add assertions to codepaths not going through systable_* asserting that ..."). Isn't it better to block the scan to user catalog tables or regular system catalog tables for tableam scan APIs rather than at the heap level? There might be some APIs like heap_getnext where such a check might still be required but I guess it is still better to block at tableam level. [1] - https://www.postgresql.org/message-id/20180726200241.aje4dv4jsv25v4k2%40alap3.anarazel.de -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Dilip Kumar
Date:
On Tue, Apr 28, 2020 at 3:11 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Mon, Apr 27, 2020 at 4:05 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > [latest patches] > > v16-0004-Gracefully-handle-concurrent-aborts-of-uncommitt > - Any actions leading to transaction ID assignment are prohibited. > That, among others, > + Note that access to user catalog tables or regular system catalog tables > + in the output plugins has to be done via the > <literal>systable_*</literal> scan APIs only. > + Access via the <literal>heap_*</literal> scan APIs will error out. > + Additionally, any actions leading to transaction ID assignment > are prohibited. That, among others, > .. > @@ -1383,6 +1392,14 @@ heap_fetch(Relation relation, > bool valid; > > /* > + * We don't expect direct calls to heap_fetch with valid > + * CheckXidAlive for regular tables. Track that below. > + */ > + if (unlikely(TransactionIdIsValid(CheckXidAlive) && > + !(IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation)))) > + elog(ERROR, "unexpected heap_fetch call during logical decoding"); > + > > I think comments and code don't match. In the comment, we are saying > that via output plugins access to user catalog tables or regular > system catalog tables won't be allowed via heap_* APIs but code > doesn't seem to reflect it. I feel only > TransactionIdIsValid(CheckXidAlive) is sufficient here. See, the > original discussion about this point [1] (Refer "I think it'd also be > good to add assertions to codepaths not going through systable_* > asserting that ..."). Right, So I think we can just add an assert in these function that Assert(!TransactionIdIsValid(CheckXidAlive)) ? > > Isn't it better to block the scan to user catalog tables or regular > system catalog tables for tableam scan APIs rather than at the heap > level? There might be some APIs like heap_getnext where such a check > might still be required but I guess it is still better to block at > tableam level. > > [1] - https://www.postgresql.org/message-id/20180726200241.aje4dv4jsv25v4k2%40alap3.anarazel.de Okay, let me analyze this part. Because someplace we have to keep at heap level like heap_getnext and other places at tableam level so it seems a bit inconsistent. Also, I think the number of checks might going to increase because some of the heap functions like heap_hot_search_buffer are being called from multiple tableam calls, so we need to put check at every place. Another point is that I feel some of the checks what we have today might not be required like heap_finish_speculative, is not fetching any tuple for us so why do we need to care about this function? -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Amit Kapila
Date:
On Tue, Apr 28, 2020 at 3:55 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Tue, Apr 28, 2020 at 3:11 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Mon, Apr 27, 2020 at 4:05 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > [latest patches] > > > > v16-0004-Gracefully-handle-concurrent-aborts-of-uncommitt > > - Any actions leading to transaction ID assignment are prohibited. > > That, among others, > > + Note that access to user catalog tables or regular system catalog tables > > + in the output plugins has to be done via the > > <literal>systable_*</literal> scan APIs only. > > + Access via the <literal>heap_*</literal> scan APIs will error out. > > + Additionally, any actions leading to transaction ID assignment > > are prohibited. That, among others, > > .. > > @@ -1383,6 +1392,14 @@ heap_fetch(Relation relation, > > bool valid; > > > > /* > > + * We don't expect direct calls to heap_fetch with valid > > + * CheckXidAlive for regular tables. Track that below. > > + */ > > + if (unlikely(TransactionIdIsValid(CheckXidAlive) && > > + !(IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation)))) > > + elog(ERROR, "unexpected heap_fetch call during logical decoding"); > > + > > > > I think comments and code don't match. In the comment, we are saying > > that via output plugins access to user catalog tables or regular > > system catalog tables won't be allowed via heap_* APIs but code > > doesn't seem to reflect it. I feel only > > TransactionIdIsValid(CheckXidAlive) is sufficient here. See, the > > original discussion about this point [1] (Refer "I think it'd also be > > good to add assertions to codepaths not going through systable_* > > asserting that ..."). > > Right, So I think we can just add an assert in these function that > Assert(!TransactionIdIsValid(CheckXidAlive)) ? > I am fine with Assertion but update the documentation accordingly. However, I think you can once cross-verify if there are any output plugins that are already using such APIs. There is a list of "Logical Decoding Plugins" on the wiki [1], just look into those once. > > > > Isn't it better to block the scan to user catalog tables or regular > > system catalog tables for tableam scan APIs rather than at the heap > > level? There might be some APIs like heap_getnext where such a check > > might still be required but I guess it is still better to block at > > tableam level. > > > > [1] - https://www.postgresql.org/message-id/20180726200241.aje4dv4jsv25v4k2%40alap3.anarazel.de > > Okay, let me analyze this part. Because someplace we have to keep at > heap level like heap_getnext and other places at tableam level so it > seems a bit inconsistent. Also, I think the number of checks might > going to increase because some of the heap functions like > heap_hot_search_buffer are being called from multiple tableam calls, > so we need to put check at every place. > > Another point is that I feel some of the checks what we have today > might not be required like heap_finish_speculative, is not fetching > any tuple for us so why do we need to care about this function? > Yeah, I don't see the need for such a check (or Assertion) in heap_finish_speculative. One additional comment: --------------------------------------- - Any actions leading to transaction ID assignment are prohibited. That, among others, + Note that access to user catalog tables or regular system catalog tables + in the output plugins has to be done via the <literal>systable_*</literal> scan APIs only. + Access via the <literal>heap_*</literal> scan APIs will error out. + Additionally, any actions leading to transaction ID assignment are prohibited. That, among others, The above text doesn't seem to be aligned properly and you need to update it if we want to change the error to Assertion for heap APIs [1] - https://wiki.postgresql.org/wiki/Logical_Decoding_Plugins -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Mahendra Singh Thalor
Date:
On Fri, 24 Apr 2020 at 11:55, Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Thu, Apr 23, 2020 at 2:28 PM Erik Rijkers <er@xs4all.nl> wrote: > > > > On 2020-04-23 05:24, Dilip Kumar wrote: > > > On Wed, Apr 22, 2020 at 9:31 PM Erik Rijkers <er@xs4all.nl> wrote: > > >> > > >> The 'ddl' one is apparently not quite fixed - I get this in (cd > > >> contrib; make check)' (in both assert-enabled and non-assert-enabled > > >> build) > > > > > > Can you send me the contrib/test_decoding/regression.diffs file? > > > > Attached. > > So from regression.diff, it appears that in failing in memory > allocation (+ERROR: invalid memory alloc request size > 94119198201896). My colleague tried to reproduce this in a different > environment but there is no success so far. One more thing surprises > me is that after > (v15-0011-Provide-new-api-to-get-the-streaming-changes.patch) > actually, it should never go for the streaming path. However, we can > not ignore the fact that some of the changes might impact the > non-streaming path as well. Is it possible for you to somehow stop or > break the code and send the stack trace? One idea is by seeing the > log we can see from where the error is raised i.e MemoryContextAlloc > or palloc or some other similar function. Once we know that we can > convert that error to an assert and find the call stack. > > -- Thanks Erik for reporting this issue. I am able to reproduce this issue(+ERROR: invalid memory alloc request size) on the top of v16 patch set. I applied all patches(12 patches) of v16 series and then I fired "make check -i" from "contrib/test_decoding" folder. Below is stack trace of error: #0 0x0000560b1350902d in MemoryContextAlloc (context=0x560b14188d70, size=94605581787992) at mcxt.c:806 #1 0x0000560b130f0ad5 in ReorderBufferRestoreChange (rb=0x560b14188e90, txn=0x560b141baf08, data=0x560b1418a5e8 "K") at reorderbuffer.c:3680 #2 0x0000560b130f0662 in ReorderBufferRestoreChanges (rb=0x560b14188e90, txn=0x560b141baf08, file=0x560b1418ad10, segno=0x560b1418ad20) at reorderbuffer.c:3564 #3 0x0000560b130e918a in ReorderBufferIterTXNInit (rb=0x560b14188e90, txn=0x560b141baf08, iter_state=0x7ffef18b1600) at reorderbuffer.c:1186 #4 0x0000560b130eaee1 in ReorderBufferProcessTXN (rb=0x560b14188e90, txn=0x560b141baf08, commit_lsn=25986584, snapshot_now=0x560b141b74d8, command_id=0, streaming=false) at reorderbuffer.c:1785 #5 0x0000560b130ecae1 in ReorderBufferCommit (rb=0x560b14188e90, xid=508, commit_lsn=25986584, end_lsn=25989088, commit_time=641449268431600, origin_id=0, origin_lsn=0) at reorderbuffer.c:2315 #6 0x0000560b130d14a1 in DecodeCommit (ctx=0x560b1416ea80, buf=0x7ffef18b19b0, parsed=0x7ffef18b1850, xid=508) at decode.c:654 #7 0x0000560b130cff98 in DecodeXactOp (ctx=0x560b1416ea80, buf=0x7ffef18b19b0) at decode.c:261 #8 0x0000560b130cf99a in LogicalDecodingProcessRecord (ctx=0x560b1416ea80, record=0x560b1416ee00) at decode.c:130 #9 0x0000560b130dbbbc in pg_logical_slot_get_changes_guts (fcinfo=0x560b1417ee50, confirm=true, binary=false, streaming=false) at logicalfuncs.c:285 #10 0x0000560b130dbe71 in pg_logical_slot_get_changes (fcinfo=0x560b1417ee50) at logicalfuncs.c:354 #11 0x0000560b12e294d4 in ExecMakeTableFunctionResult (setexpr=0x560b14177838, econtext=0x560b14177748, argContext=0x560b1417ed30, expectedDesc=0x560b141814a0, randomAccess=false) at execSRF.c:234 #12 0x0000560b12e5490f in FunctionNext (node=0x560b14177630) at nodeFunctionscan.c:94 #13 0x0000560b12e2c108 in ExecScanFetch (node=0x560b14177630, accessMtd=0x560b12e54836 <FunctionNext>, recheckMtd=0x560b12e54e15 <FunctionRecheck>) at execScan.c:133 #14 0x0000560b12e2c227 in ExecScan (node=0x560b14177630, accessMtd=0x560b12e54836 <FunctionNext>, recheckMtd=0x560b12e54e15 <FunctionRecheck>) at execScan.c:199 #15 0x0000560b12e54e9b in ExecFunctionScan (pstate=0x560b14177630) at nodeFunctionscan.c:270 #16 0x0000560b12e24e23 in ExecProcNodeFirst (node=0x560b14177630) at execProcnode.c:450 #17 0x0000560b12e3e172 in ExecProcNode (node=0x560b14177630) at ../../../src/include/executor/executor.h:245 #18 0x0000560b12e3e998 in fetch_input_tuple (aggstate=0x560b14176f40) at nodeAgg.c:566 #19 0x0000560b12e4398f in agg_fill_hash_table (aggstate=0x560b14176f40) at nodeAgg.c:2518 #20 0x0000560b12e42c9a in ExecAgg (pstate=0x560b14176f40) at nodeAgg.c:2139 #21 0x0000560b12e24e23 in ExecProcNodeFirst (node=0x560b14176f40) at execProcnode.c:450 #22 0x0000560b12e8bb58 in ExecProcNode (node=0x560b14176f40) at ../../../src/include/executor/executor.h:245 #23 0x0000560b12e8bd59 in ExecSort (pstate=0x560b14176d28) at nodeSort.c:108 #24 0x0000560b12e24e23 in ExecProcNodeFirst (node=0x560b14176d28) at execProcnode.c:450 #25 0x0000560b12e10e71 in ExecProcNode (node=0x560b14176d28) at ../../../src/include/executor/executor.h:245 #26 0x0000560b12e15c4c in ExecutePlan (estate=0x560b14176af0, planstate=0x560b14176d28, use_parallel_mode=false, operation=CMD_SELECT, sendTuples=true, numberTuples=0, direction=ForwardScanDirection, dest=0x560b1419d188, execute_once=true) at execMain.c:1646 #27 0x0000560b12e11a19 in standard_ExecutorRun (queryDesc=0x560b1412db10, direction=ForwardScanDirection, count=0, execute_once=true) at execMain.c:364 #28 0x0000560b12e116e1 in ExecutorRun (queryDesc=0x560b1412db10, direction=ForwardScanDirection, count=0, execute_once=true) at execMain.c:308 #29 0x0000560b131f2177 in PortalRunSelect (portal=0x560b140db860, forward=true, count=0, dest=0x560b1419d188) at pquery.c:912 #30 0x0000560b131f1b14 in PortalRun (portal=0x560b140db860, count=9223372036854775807, isTopLevel=true, run_once=true, dest=0x560b1419d188, altdest=0x560b1419d188, qc=0x7ffef18b2350) at pquery.c:756 #31 0x0000560b131e550b in exec_simple_query ( query_string=0x560b14076720 "/ display results, but hide most of the output /\nSELECT count(*), min(data), max(data)\nFROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1')\nG"...) at postgres.c:1239 #32 0x0000560b131ee343 in PostgresMain (argc=1, argv=0x560b1409faa0, dbname=0x560b1409f858 "contrib_regression", username=0x560b1409f830 "mahendrathalor") at postgres.c:4315 #33 0x0000560b130a325b in BackendRun (port=0x560b14096880) at postmaster.c:4510 #34 0x0000560b130a22c3 in BackendStartup (port=0x560b14096880) at postmaster.c:4202 #35 0x0000560b1309a5cc in ServerLoop () at postmaster.c:1727 #36 0x0000560b130997c9 in PostmasterMain (argc=8, argv=0x560b1406f010) at postmaster.c:1400 #37 0x0000560b12ee9530 in main (argc=8, argv=0x560b1406f010) at main.c:210 I have Ubuntu setup. I think, this is reproducing into Ubuntu only. I am looking into this issue with Dilip. -- Thanks and Regards Mahendra Singh Thalor EnterpriseDB: http://www.enterprisedb.com
Attachment
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Mahendra Singh Thalor
Date:
On Wed, 29 Apr 2020 at 11:15, Mahendra Singh Thalor <mahi6run@gmail.com> wrote: > > On Fri, 24 Apr 2020 at 11:55, Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Thu, Apr 23, 2020 at 2:28 PM Erik Rijkers <er@xs4all.nl> wrote: > > > > > > On 2020-04-23 05:24, Dilip Kumar wrote: > > > > On Wed, Apr 22, 2020 at 9:31 PM Erik Rijkers <er@xs4all.nl> wrote: > > > >> > > > >> The 'ddl' one is apparently not quite fixed - I get this in (cd > > > >> contrib; make check)' (in both assert-enabled and non-assert-enabled > > > >> build) > > > > > > > > Can you send me the contrib/test_decoding/regression.diffs file? > > > > > > Attached. > > > > So from regression.diff, it appears that in failing in memory > > allocation (+ERROR: invalid memory alloc request size > > 94119198201896). My colleague tried to reproduce this in a different > > environment but there is no success so far. One more thing surprises > > me is that after > > (v15-0011-Provide-new-api-to-get-the-streaming-changes.patch) > > actually, it should never go for the streaming path. However, we can > > not ignore the fact that some of the changes might impact the > > non-streaming path as well. Is it possible for you to somehow stop or > > break the code and send the stack trace? One idea is by seeing the > > log we can see from where the error is raised i.e MemoryContextAlloc > > or palloc or some other similar function. Once we know that we can > > convert that error to an assert and find the call stack. > > > > -- > > Thanks Erik for reporting this issue. > > I am able to reproduce this issue(+ERROR: invalid memory alloc > request size) on the top of v16 patch set. I applied all patches(12 > patches) of v16 series and then I fired "make check -i" from > "contrib/test_decoding" folder. Below is stack trace of error: > > #0 0x0000560b1350902d in MemoryContextAlloc (context=0x560b14188d70, > size=94605581787992) at mcxt.c:806 > #1 0x0000560b130f0ad5 in ReorderBufferRestoreChange > (rb=0x560b14188e90, txn=0x560b141baf08, data=0x560b1418a5e8 "K") at > reorderbuffer.c:3680 > #2 0x0000560b130f0662 in ReorderBufferRestoreChanges > (rb=0x560b14188e90, txn=0x560b141baf08, file=0x560b1418ad10, > segno=0x560b1418ad20) at reorderbuffer.c:3564 > #3 0x0000560b130e918a in ReorderBufferIterTXNInit (rb=0x560b14188e90, > txn=0x560b141baf08, iter_state=0x7ffef18b1600) at reorderbuffer.c:1186 > #4 0x0000560b130eaee1 in ReorderBufferProcessTXN (rb=0x560b14188e90, > txn=0x560b141baf08, commit_lsn=25986584, snapshot_now=0x560b141b74d8, > command_id=0, streaming=false) > at reorderbuffer.c:1785 > #5 0x0000560b130ecae1 in ReorderBufferCommit (rb=0x560b14188e90, > xid=508, commit_lsn=25986584, end_lsn=25989088, > commit_time=641449268431600, origin_id=0, origin_lsn=0) > at reorderbuffer.c:2315 > #6 0x0000560b130d14a1 in DecodeCommit (ctx=0x560b1416ea80, > buf=0x7ffef18b19b0, parsed=0x7ffef18b1850, xid=508) at decode.c:654 > #7 0x0000560b130cff98 in DecodeXactOp (ctx=0x560b1416ea80, > buf=0x7ffef18b19b0) at decode.c:261 > #8 0x0000560b130cf99a in LogicalDecodingProcessRecord > (ctx=0x560b1416ea80, record=0x560b1416ee00) at decode.c:130 > #9 0x0000560b130dbbbc in pg_logical_slot_get_changes_guts > (fcinfo=0x560b1417ee50, confirm=true, binary=false, streaming=false) > at logicalfuncs.c:285 > #10 0x0000560b130dbe71 in pg_logical_slot_get_changes > (fcinfo=0x560b1417ee50) at logicalfuncs.c:354 > #11 0x0000560b12e294d4 in ExecMakeTableFunctionResult > (setexpr=0x560b14177838, econtext=0x560b14177748, > argContext=0x560b1417ed30, expectedDesc=0x560b141814a0, > randomAccess=false) at execSRF.c:234 > #12 0x0000560b12e5490f in FunctionNext (node=0x560b14177630) at > nodeFunctionscan.c:94 > #13 0x0000560b12e2c108 in ExecScanFetch (node=0x560b14177630, > accessMtd=0x560b12e54836 <FunctionNext>, recheckMtd=0x560b12e54e15 > <FunctionRecheck>) at execScan.c:133 > #14 0x0000560b12e2c227 in ExecScan (node=0x560b14177630, > accessMtd=0x560b12e54836 <FunctionNext>, recheckMtd=0x560b12e54e15 > <FunctionRecheck>) at execScan.c:199 > #15 0x0000560b12e54e9b in ExecFunctionScan (pstate=0x560b14177630) at > nodeFunctionscan.c:270 > #16 0x0000560b12e24e23 in ExecProcNodeFirst (node=0x560b14177630) at > execProcnode.c:450 > #17 0x0000560b12e3e172 in ExecProcNode (node=0x560b14177630) at > ../../../src/include/executor/executor.h:245 > #18 0x0000560b12e3e998 in fetch_input_tuple (aggstate=0x560b14176f40) > at nodeAgg.c:566 > #19 0x0000560b12e4398f in agg_fill_hash_table > (aggstate=0x560b14176f40) at nodeAgg.c:2518 > #20 0x0000560b12e42c9a in ExecAgg (pstate=0x560b14176f40) at nodeAgg.c:2139 > #21 0x0000560b12e24e23 in ExecProcNodeFirst (node=0x560b14176f40) at > execProcnode.c:450 > #22 0x0000560b12e8bb58 in ExecProcNode (node=0x560b14176f40) at > ../../../src/include/executor/executor.h:245 > #23 0x0000560b12e8bd59 in ExecSort (pstate=0x560b14176d28) at nodeSort.c:108 > #24 0x0000560b12e24e23 in ExecProcNodeFirst (node=0x560b14176d28) at > execProcnode.c:450 > #25 0x0000560b12e10e71 in ExecProcNode (node=0x560b14176d28) at > ../../../src/include/executor/executor.h:245 > #26 0x0000560b12e15c4c in ExecutePlan (estate=0x560b14176af0, > planstate=0x560b14176d28, use_parallel_mode=false, > operation=CMD_SELECT, sendTuples=true, numberTuples=0, > direction=ForwardScanDirection, dest=0x560b1419d188, > execute_once=true) at execMain.c:1646 > #27 0x0000560b12e11a19 in standard_ExecutorRun > (queryDesc=0x560b1412db10, direction=ForwardScanDirection, count=0, > execute_once=true) at execMain.c:364 > #28 0x0000560b12e116e1 in ExecutorRun (queryDesc=0x560b1412db10, > direction=ForwardScanDirection, count=0, execute_once=true) at > execMain.c:308 > #29 0x0000560b131f2177 in PortalRunSelect (portal=0x560b140db860, > forward=true, count=0, dest=0x560b1419d188) at pquery.c:912 > #30 0x0000560b131f1b14 in PortalRun (portal=0x560b140db860, > count=9223372036854775807, isTopLevel=true, run_once=true, > dest=0x560b1419d188, altdest=0x560b1419d188, > qc=0x7ffef18b2350) at pquery.c:756 > #31 0x0000560b131e550b in exec_simple_query ( > query_string=0x560b14076720 "/ display results, but hide most of the > output /\nSELECT count(*), min(data), max(data)\nFROM > pg_logical_slot_get_changes('regression_slot', NULL, NULL, > 'include-xids', '0', 'skip-empty-xacts', '1')\nG"...) at > postgres.c:1239 > #32 0x0000560b131ee343 in PostgresMain (argc=1, argv=0x560b1409faa0, > dbname=0x560b1409f858 "contrib_regression", username=0x560b1409f830 > "mahendrathalor") at postgres.c:4315 > #33 0x0000560b130a325b in BackendRun (port=0x560b14096880) at postmaster.c:4510 > #34 0x0000560b130a22c3 in BackendStartup (port=0x560b14096880) at > postmaster.c:4202 > #35 0x0000560b1309a5cc in ServerLoop () at postmaster.c:1727 > #36 0x0000560b130997c9 in PostmasterMain (argc=8, argv=0x560b1406f010) > at postmaster.c:1400 > #37 0x0000560b12ee9530 in main (argc=8, argv=0x560b1406f010) at main.c:210 > > I have Ubuntu setup. I think, this is reproducing into Ubuntu only. I > am looking into this issue with Dilip. This error is due to invalid size. diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c index eed9a5048b..487c1b4252 100644 --- a/src/backend/replication/logical/reorderbuffer.c +++ b/src/backend/replication/logical/reorderbuffer.c @@ -3678,7 +3678,7 @@ ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn, change->data.inval.invalidations = MemoryContextAlloc(rb->context, - change->data.msg.message_size); + inval_size); /* read the message */ memcpy(change->data.inval.invalidations, data, inval_size); data += inval_size; Above change, fixes the error. Thanks Dilip for helping. -- Thanks and Regards Mahendra Singh Thalor EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Dilip Kumar
Date:
On Wed, Apr 29, 2020 at 12:37 PM Mahendra Singh Thalor <mahi6run@gmail.com> wrote: > > On Wed, 29 Apr 2020 at 11:15, Mahendra Singh Thalor <mahi6run@gmail.com> wrote: > > > > On Fri, 24 Apr 2020 at 11:55, Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > On Thu, Apr 23, 2020 at 2:28 PM Erik Rijkers <er@xs4all.nl> wrote: > > > > > > > > On 2020-04-23 05:24, Dilip Kumar wrote: > > > > > On Wed, Apr 22, 2020 at 9:31 PM Erik Rijkers <er@xs4all.nl> wrote: > > > > >> > > > > >> The 'ddl' one is apparently not quite fixed - I get this in (cd > > > > >> contrib; make check)' (in both assert-enabled and non-assert-enabled > > > > >> build) > > > > > > > > > > Can you send me the contrib/test_decoding/regression.diffs file? > > > > > > > > Attached. > > > > > > So from regression.diff, it appears that in failing in memory > > > allocation (+ERROR: invalid memory alloc request size > > > 94119198201896). My colleague tried to reproduce this in a different > > > environment but there is no success so far. One more thing surprises > > > me is that after > > > (v15-0011-Provide-new-api-to-get-the-streaming-changes.patch) > > > actually, it should never go for the streaming path. However, we can > > > not ignore the fact that some of the changes might impact the > > > non-streaming path as well. Is it possible for you to somehow stop or > > > break the code and send the stack trace? One idea is by seeing the > > > log we can see from where the error is raised i.e MemoryContextAlloc > > > or palloc or some other similar function. Once we know that we can > > > convert that error to an assert and find the call stack. > > > > > > -- > > > > Thanks Erik for reporting this issue. > > > > I am able to reproduce this issue(+ERROR: invalid memory alloc > > request size) on the top of v16 patch set. I applied all patches(12 > > patches) of v16 series and then I fired "make check -i" from > > "contrib/test_decoding" folder. Below is stack trace of error: > > > > #0 0x0000560b1350902d in MemoryContextAlloc (context=0x560b14188d70, > > size=94605581787992) at mcxt.c:806 > > #1 0x0000560b130f0ad5 in ReorderBufferRestoreChange > > (rb=0x560b14188e90, txn=0x560b141baf08, data=0x560b1418a5e8 "K") at > > reorderbuffer.c:3680 > > #2 0x0000560b130f0662 in ReorderBufferRestoreChanges > > (rb=0x560b14188e90, txn=0x560b141baf08, file=0x560b1418ad10, > > segno=0x560b1418ad20) at reorderbuffer.c:3564 > > #3 0x0000560b130e918a in ReorderBufferIterTXNInit (rb=0x560b14188e90, > > txn=0x560b141baf08, iter_state=0x7ffef18b1600) at reorderbuffer.c:1186 > > #4 0x0000560b130eaee1 in ReorderBufferProcessTXN (rb=0x560b14188e90, > > txn=0x560b141baf08, commit_lsn=25986584, snapshot_now=0x560b141b74d8, > > command_id=0, streaming=false) > > at reorderbuffer.c:1785 > > #5 0x0000560b130ecae1 in ReorderBufferCommit (rb=0x560b14188e90, > > xid=508, commit_lsn=25986584, end_lsn=25989088, > > commit_time=641449268431600, origin_id=0, origin_lsn=0) > > at reorderbuffer.c:2315 > > #6 0x0000560b130d14a1 in DecodeCommit (ctx=0x560b1416ea80, > > buf=0x7ffef18b19b0, parsed=0x7ffef18b1850, xid=508) at decode.c:654 > > #7 0x0000560b130cff98 in DecodeXactOp (ctx=0x560b1416ea80, > > buf=0x7ffef18b19b0) at decode.c:261 > > #8 0x0000560b130cf99a in LogicalDecodingProcessRecord > > (ctx=0x560b1416ea80, record=0x560b1416ee00) at decode.c:130 > > #9 0x0000560b130dbbbc in pg_logical_slot_get_changes_guts > > (fcinfo=0x560b1417ee50, confirm=true, binary=false, streaming=false) > > at logicalfuncs.c:285 > > #10 0x0000560b130dbe71 in pg_logical_slot_get_changes > > (fcinfo=0x560b1417ee50) at logicalfuncs.c:354 > > #11 0x0000560b12e294d4 in ExecMakeTableFunctionResult > > (setexpr=0x560b14177838, econtext=0x560b14177748, > > argContext=0x560b1417ed30, expectedDesc=0x560b141814a0, > > randomAccess=false) at execSRF.c:234 > > #12 0x0000560b12e5490f in FunctionNext (node=0x560b14177630) at > > nodeFunctionscan.c:94 > > #13 0x0000560b12e2c108 in ExecScanFetch (node=0x560b14177630, > > accessMtd=0x560b12e54836 <FunctionNext>, recheckMtd=0x560b12e54e15 > > <FunctionRecheck>) at execScan.c:133 > > #14 0x0000560b12e2c227 in ExecScan (node=0x560b14177630, > > accessMtd=0x560b12e54836 <FunctionNext>, recheckMtd=0x560b12e54e15 > > <FunctionRecheck>) at execScan.c:199 > > #15 0x0000560b12e54e9b in ExecFunctionScan (pstate=0x560b14177630) at > > nodeFunctionscan.c:270 > > #16 0x0000560b12e24e23 in ExecProcNodeFirst (node=0x560b14177630) at > > execProcnode.c:450 > > #17 0x0000560b12e3e172 in ExecProcNode (node=0x560b14177630) at > > ../../../src/include/executor/executor.h:245 > > #18 0x0000560b12e3e998 in fetch_input_tuple (aggstate=0x560b14176f40) > > at nodeAgg.c:566 > > #19 0x0000560b12e4398f in agg_fill_hash_table > > (aggstate=0x560b14176f40) at nodeAgg.c:2518 > > #20 0x0000560b12e42c9a in ExecAgg (pstate=0x560b14176f40) at nodeAgg.c:2139 > > #21 0x0000560b12e24e23 in ExecProcNodeFirst (node=0x560b14176f40) at > > execProcnode.c:450 > > #22 0x0000560b12e8bb58 in ExecProcNode (node=0x560b14176f40) at > > ../../../src/include/executor/executor.h:245 > > #23 0x0000560b12e8bd59 in ExecSort (pstate=0x560b14176d28) at nodeSort.c:108 > > #24 0x0000560b12e24e23 in ExecProcNodeFirst (node=0x560b14176d28) at > > execProcnode.c:450 > > #25 0x0000560b12e10e71 in ExecProcNode (node=0x560b14176d28) at > > ../../../src/include/executor/executor.h:245 > > #26 0x0000560b12e15c4c in ExecutePlan (estate=0x560b14176af0, > > planstate=0x560b14176d28, use_parallel_mode=false, > > operation=CMD_SELECT, sendTuples=true, numberTuples=0, > > direction=ForwardScanDirection, dest=0x560b1419d188, > > execute_once=true) at execMain.c:1646 > > #27 0x0000560b12e11a19 in standard_ExecutorRun > > (queryDesc=0x560b1412db10, direction=ForwardScanDirection, count=0, > > execute_once=true) at execMain.c:364 > > #28 0x0000560b12e116e1 in ExecutorRun (queryDesc=0x560b1412db10, > > direction=ForwardScanDirection, count=0, execute_once=true) at > > execMain.c:308 > > #29 0x0000560b131f2177 in PortalRunSelect (portal=0x560b140db860, > > forward=true, count=0, dest=0x560b1419d188) at pquery.c:912 > > #30 0x0000560b131f1b14 in PortalRun (portal=0x560b140db860, > > count=9223372036854775807, isTopLevel=true, run_once=true, > > dest=0x560b1419d188, altdest=0x560b1419d188, > > qc=0x7ffef18b2350) at pquery.c:756 > > #31 0x0000560b131e550b in exec_simple_query ( > > query_string=0x560b14076720 "/ display results, but hide most of the > > output /\nSELECT count(*), min(data), max(data)\nFROM > > pg_logical_slot_get_changes('regression_slot', NULL, NULL, > > 'include-xids', '0', 'skip-empty-xacts', '1')\nG"...) at > > postgres.c:1239 > > #32 0x0000560b131ee343 in PostgresMain (argc=1, argv=0x560b1409faa0, > > dbname=0x560b1409f858 "contrib_regression", username=0x560b1409f830 > > "mahendrathalor") at postgres.c:4315 > > #33 0x0000560b130a325b in BackendRun (port=0x560b14096880) at postmaster.c:4510 > > #34 0x0000560b130a22c3 in BackendStartup (port=0x560b14096880) at > > postmaster.c:4202 > > #35 0x0000560b1309a5cc in ServerLoop () at postmaster.c:1727 > > #36 0x0000560b130997c9 in PostmasterMain (argc=8, argv=0x560b1406f010) > > at postmaster.c:1400 > > #37 0x0000560b12ee9530 in main (argc=8, argv=0x560b1406f010) at main.c:210 > > > > I have Ubuntu setup. I think, this is reproducing into Ubuntu only. I > > am looking into this issue with Dilip. > > This error is due to invalid size. > > diff --git a/src/backend/replication/logical/reorderbuffer.c > b/src/backend/replication/logical/reorderbuffer.c > index eed9a5048b..487c1b4252 100644 > --- a/src/backend/replication/logical/reorderbuffer.c > +++ b/src/backend/replication/logical/reorderbuffer.c > @@ -3678,7 +3678,7 @@ ReorderBufferRestoreChange(ReorderBuffer *rb, > ReorderBufferTXN *txn, > > change->data.inval.invalidations = > MemoryContextAlloc(rb->context, > - > change->data.msg.message_size); > + > inval_size); > /* read the message */ > > memcpy(change->data.inval.invalidations, data, inval_size); > data += inval_size; > > Above change, fixes the error. Thanks Dilip for helping. Thanks, Mahendra for reproducing and help in fixing this. I will include this change in my next patch set.
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Dilip Kumar
Date:
On Tue, Apr 28, 2020 at 3:55 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Tue, Apr 28, 2020 at 3:11 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Mon, Apr 27, 2020 at 4:05 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > [latest patches] > > > > v16-0004-Gracefully-handle-concurrent-aborts-of-uncommitt > > - Any actions leading to transaction ID assignment are prohibited. > > That, among others, > > + Note that access to user catalog tables or regular system catalog tables > > + in the output plugins has to be done via the > > <literal>systable_*</literal> scan APIs only. > > + Access via the <literal>heap_*</literal> scan APIs will error out. > > + Additionally, any actions leading to transaction ID assignment > > are prohibited. That, among others, > > .. > > @@ -1383,6 +1392,14 @@ heap_fetch(Relation relation, > > bool valid; > > > > /* > > + * We don't expect direct calls to heap_fetch with valid > > + * CheckXidAlive for regular tables. Track that below. > > + */ > > + if (unlikely(TransactionIdIsValid(CheckXidAlive) && > > + !(IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation)))) > > + elog(ERROR, "unexpected heap_fetch call during logical decoding"); > > + > > > > I think comments and code don't match. In the comment, we are saying > > that via output plugins access to user catalog tables or regular > > system catalog tables won't be allowed via heap_* APIs but code > > doesn't seem to reflect it. I feel only > > TransactionIdIsValid(CheckXidAlive) is sufficient here. See, the > > original discussion about this point [1] (Refer "I think it'd also be > > good to add assertions to codepaths not going through systable_* > > asserting that ..."). > > Right, So I think we can just add an assert in these function that > Assert(!TransactionIdIsValid(CheckXidAlive)) ? > > > > > Isn't it better to block the scan to user catalog tables or regular > > system catalog tables for tableam scan APIs rather than at the heap > > level? There might be some APIs like heap_getnext where such a check > > might still be required but I guess it is still better to block at > > tableam level. > > > > [1] - https://www.postgresql.org/message-id/20180726200241.aje4dv4jsv25v4k2%40alap3.anarazel.de > > Okay, let me analyze this part. Because someplace we have to keep at > heap level like heap_getnext and other places at tableam level so it > seems a bit inconsistent. Also, I think the number of checks might > going to increase because some of the heap functions like > heap_hot_search_buffer are being called from multiple tableam calls, > so we need to put check at every place. > > Another point is that I feel some of the checks what we have today > might not be required like heap_finish_speculative, is not fetching > any tuple for us so why do we need to care about this function? While testing these changes, I have noticed that the systable_* APIs internally, calls tableam apis and so if we just put assert Assert(!TransactionIdIsValid(CheckXidAlive)) then it will always hit that assert. Whether we put these assert in heap APIs or the tableam APIs because systable_ always access heap through tableam APIs. Refer below callstack #0 table_index_fetch_tuple (scan=0x2392558, tid=0x2392270, snapshot=0x2392178, slot=0x2391f60, call_again=0x2392276, all_dead=0x7fff4b6cc89e) at ../../../../src/include/access/tableam.h:1035 #1 0x00000000005100b6 in index_fetch_heap (scan=0x2392210, slot=0x2391f60) at indexam.c:577 #2 0x00000000005101ea in index_getnext_slot (scan=0x2392210, direction=ForwardScanDirection, slot=0x2391f60) at indexam.c:637 #3 0x000000000050e8f9 in systable_getnext (sysscan=0x2391f08) at genam.c:474 #4 0x0000000000aa44a2 in RelidByRelfilenode (reltablespace=0, relfilenode=16593) at relfilenodemap.c:213 #5 0x00000000008a64da in ReorderBufferProcessTXN (rb=0x23734b0, txn=0x2398e28, commit_lsn=23953600, snapshot_now=0x237b168, command_id=0, streaming=false) at reorderbuffer.c:1823 #6 0x00000000008a7201 in ReorderBufferCommit (rb=0x23734b0, xid=518, commit_lsn=23953600, end_lsn=23953648, commit_time=641466886013448, origin_id=0, origin_lsn=0) at reorderbuffer.c:2315 #7 0x00000000008985b1 in DecodeCommit (ctx=0x22e16a0, buf=0x7fff4b6cce30, parsed=0x7fff4b6ccca0, xid=518) at decode.c:654 #8 0x0000000000897a76 in DecodeXactOp (ctx=0x22e16a0, buf=0x7fff4b6cce30) at decode.c:261 #9 0x0000000000897739 in LogicalDecodingProcessRecord (ctx=0x22e16a0, record=0x22e19a0) at decode.c:130 So basically, the problem is that we can not distinguish whether the tableam/heap routine is called directly or via systable_*. Now I understand the current code was actually giving error for the user table not the system table with the assumption that the system table will come to this function only via systable_*. Only user table can come directly. So if this is not a system table i.e. we reach here directly so error out. Now, I am not sure if it is not for the system table then what is the purpose of throwing that error? -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Dilip Kumar
Date:
On Wed, Apr 29, 2020 at 2:56 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Tue, Apr 28, 2020 at 3:55 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Tue, Apr 28, 2020 at 3:11 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > On Mon, Apr 27, 2020 at 4:05 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > > [latest patches] > > > > > > v16-0004-Gracefully-handle-concurrent-aborts-of-uncommitt > > > - Any actions leading to transaction ID assignment are prohibited. > > > That, among others, > > > + Note that access to user catalog tables or regular system catalog tables > > > + in the output plugins has to be done via the > > > <literal>systable_*</literal> scan APIs only. > > > + Access via the <literal>heap_*</literal> scan APIs will error out. > > > + Additionally, any actions leading to transaction ID assignment > > > are prohibited. That, among others, > > > .. > > > @@ -1383,6 +1392,14 @@ heap_fetch(Relation relation, > > > bool valid; > > > > > > /* > > > + * We don't expect direct calls to heap_fetch with valid > > > + * CheckXidAlive for regular tables. Track that below. > > > + */ > > > + if (unlikely(TransactionIdIsValid(CheckXidAlive) && > > > + !(IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation)))) > > > + elog(ERROR, "unexpected heap_fetch call during logical decoding"); > > > + > > > > > > I think comments and code don't match. In the comment, we are saying > > > that via output plugins access to user catalog tables or regular > > > system catalog tables won't be allowed via heap_* APIs but code > > > doesn't seem to reflect it. I feel only > > > TransactionIdIsValid(CheckXidAlive) is sufficient here. See, the > > > original discussion about this point [1] (Refer "I think it'd also be > > > good to add assertions to codepaths not going through systable_* > > > asserting that ..."). > > > > Right, So I think we can just add an assert in these function that > > Assert(!TransactionIdIsValid(CheckXidAlive)) ? > > > > > > > > Isn't it better to block the scan to user catalog tables or regular > > > system catalog tables for tableam scan APIs rather than at the heap > > > level? There might be some APIs like heap_getnext where such a check > > > might still be required but I guess it is still better to block at > > > tableam level. > > > > > > [1] - https://www.postgresql.org/message-id/20180726200241.aje4dv4jsv25v4k2%40alap3.anarazel.de > > > > Okay, let me analyze this part. Because someplace we have to keep at > > heap level like heap_getnext and other places at tableam level so it > > seems a bit inconsistent. Also, I think the number of checks might > > going to increase because some of the heap functions like > > heap_hot_search_buffer are being called from multiple tableam calls, > > so we need to put check at every place. > > > > Another point is that I feel some of the checks what we have today > > might not be required like heap_finish_speculative, is not fetching > > any tuple for us so why do we need to care about this function? > > While testing these changes, I have noticed that the systable_* APIs > internally, calls tableam apis and so if we just put assert > Assert(!TransactionIdIsValid(CheckXidAlive)) then it will always hit > that assert. Whether we put these assert in heap APIs or the tableam > APIs because systable_ always access heap through tableam APIs. > > Refer below callstack > #0 table_index_fetch_tuple (scan=0x2392558, tid=0x2392270, > snapshot=0x2392178, slot=0x2391f60, call_again=0x2392276, > all_dead=0x7fff4b6cc89e) > at ../../../../src/include/access/tableam.h:1035 > #1 0x00000000005100b6 in index_fetch_heap (scan=0x2392210, > slot=0x2391f60) at indexam.c:577 > #2 0x00000000005101ea in index_getnext_slot (scan=0x2392210, > direction=ForwardScanDirection, slot=0x2391f60) at indexam.c:637 > #3 0x000000000050e8f9 in systable_getnext (sysscan=0x2391f08) at genam.c:474 > #4 0x0000000000aa44a2 in RelidByRelfilenode (reltablespace=0, > relfilenode=16593) at relfilenodemap.c:213 > #5 0x00000000008a64da in ReorderBufferProcessTXN (rb=0x23734b0, > txn=0x2398e28, commit_lsn=23953600, snapshot_now=0x237b168, > command_id=0, streaming=false) > at reorderbuffer.c:1823 > #6 0x00000000008a7201 in ReorderBufferCommit (rb=0x23734b0, xid=518, > commit_lsn=23953600, end_lsn=23953648, commit_time=641466886013448, > origin_id=0, origin_lsn=0) > at reorderbuffer.c:2315 > #7 0x00000000008985b1 in DecodeCommit (ctx=0x22e16a0, > buf=0x7fff4b6cce30, parsed=0x7fff4b6ccca0, xid=518) at decode.c:654 > #8 0x0000000000897a76 in DecodeXactOp (ctx=0x22e16a0, > buf=0x7fff4b6cce30) at decode.c:261 > #9 0x0000000000897739 in LogicalDecodingProcessRecord (ctx=0x22e16a0, > record=0x22e19a0) at decode.c:130 > > So basically, the problem is that we can not distinguish whether the > tableam/heap routine is called directly or via systable_*. > > Now I understand the current code was actually giving error for the > user table not the system table with the assumption that the system > table will come to this function only via systable_*. Only user table > can come directly. So if this is not a system table i.e. we reach > here directly so error out. Now, I am not sure if it is not for the > system table then what is the purpose of throwing that error? Putting some more thought upon this, I am just wondering what do we really want any such check because, we are always getting relation description from the reorder buffer code, not from the pgoutput plugin. And, our main issue with the concurrent abort is that we shall not get the wrong catalog entry for decoding our tuple. So if we are always getting our relation entry using RelationIdGetRelation then why should we bother about how output plugin is accessing system/user relations? -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Amit Kapila
Date:
On Wed, Apr 29, 2020 at 3:19 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Wed, Apr 29, 2020 at 2:56 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Tue, Apr 28, 2020 at 3:55 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > On Tue, Apr 28, 2020 at 3:11 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > On Mon, Apr 27, 2020 at 4:05 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > > > > [latest patches] > > > > > > > > v16-0004-Gracefully-handle-concurrent-aborts-of-uncommitt > > > > - Any actions leading to transaction ID assignment are prohibited. > > > > That, among others, > > > > + Note that access to user catalog tables or regular system catalog tables > > > > + in the output plugins has to be done via the > > > > <literal>systable_*</literal> scan APIs only. > > > > + Access via the <literal>heap_*</literal> scan APIs will error out. > > > > + Additionally, any actions leading to transaction ID assignment > > > > are prohibited. That, among others, > > > > .. > > > > @@ -1383,6 +1392,14 @@ heap_fetch(Relation relation, > > > > bool valid; > > > > > > > > /* > > > > + * We don't expect direct calls to heap_fetch with valid > > > > + * CheckXidAlive for regular tables. Track that below. > > > > + */ > > > > + if (unlikely(TransactionIdIsValid(CheckXidAlive) && > > > > + !(IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation)))) > > > > + elog(ERROR, "unexpected heap_fetch call during logical decoding"); > > > > + > > > > > > > > I think comments and code don't match. In the comment, we are saying > > > > that via output plugins access to user catalog tables or regular > > > > system catalog tables won't be allowed via heap_* APIs but code > > > > doesn't seem to reflect it. I feel only > > > > TransactionIdIsValid(CheckXidAlive) is sufficient here. See, the > > > > original discussion about this point [1] (Refer "I think it'd also be > > > > good to add assertions to codepaths not going through systable_* > > > > asserting that ..."). > > > > > > Right, So I think we can just add an assert in these function that > > > Assert(!TransactionIdIsValid(CheckXidAlive)) ? > > > > > > > > > > > Isn't it better to block the scan to user catalog tables or regular > > > > system catalog tables for tableam scan APIs rather than at the heap > > > > level? There might be some APIs like heap_getnext where such a check > > > > might still be required but I guess it is still better to block at > > > > tableam level. > > > > > > > > [1] - https://www.postgresql.org/message-id/20180726200241.aje4dv4jsv25v4k2%40alap3.anarazel.de > > > > > > Okay, let me analyze this part. Because someplace we have to keep at > > > heap level like heap_getnext and other places at tableam level so it > > > seems a bit inconsistent. Also, I think the number of checks might > > > going to increase because some of the heap functions like > > > heap_hot_search_buffer are being called from multiple tableam calls, > > > so we need to put check at every place. > > > > > > Another point is that I feel some of the checks what we have today > > > might not be required like heap_finish_speculative, is not fetching > > > any tuple for us so why do we need to care about this function? > > > > While testing these changes, I have noticed that the systable_* APIs > > internally, calls tableam apis and so if we just put assert > > Assert(!TransactionIdIsValid(CheckXidAlive)) then it will always hit > > that assert. Whether we put these assert in heap APIs or the tableam > > APIs because systable_ always access heap through tableam APIs. > > .. .. > > Putting some more thought upon this, I am just wondering what do we > really want any such check because, we are always getting relation > description from the reorder buffer code, not from the pgoutput > plugin. > But can't they access other catalogs like pg_publication*? I think the basic thing we want to ensure here is that all historic accesses always use systable* APIs to access catalogs. We can ensure that via having Asserts (or elog(ERROR, ..) in heap/tableam APIs. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Dilip Kumar
Date:
On Thu, Apr 30, 2020 at 12:31 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Wed, Apr 29, 2020 at 3:19 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Wed, Apr 29, 2020 at 2:56 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > On Tue, Apr 28, 2020 at 3:55 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > > > On Tue, Apr 28, 2020 at 3:11 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > > On Mon, Apr 27, 2020 at 4:05 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > > > > > > [latest patches] > > > > > > > > > > v16-0004-Gracefully-handle-concurrent-aborts-of-uncommitt > > > > > - Any actions leading to transaction ID assignment are prohibited. > > > > > That, among others, > > > > > + Note that access to user catalog tables or regular system catalog tables > > > > > + in the output plugins has to be done via the > > > > > <literal>systable_*</literal> scan APIs only. > > > > > + Access via the <literal>heap_*</literal> scan APIs will error out. > > > > > + Additionally, any actions leading to transaction ID assignment > > > > > are prohibited. That, among others, > > > > > .. > > > > > @@ -1383,6 +1392,14 @@ heap_fetch(Relation relation, > > > > > bool valid; > > > > > > > > > > /* > > > > > + * We don't expect direct calls to heap_fetch with valid > > > > > + * CheckXidAlive for regular tables. Track that below. > > > > > + */ > > > > > + if (unlikely(TransactionIdIsValid(CheckXidAlive) && > > > > > + !(IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation)))) > > > > > + elog(ERROR, "unexpected heap_fetch call during logical decoding"); > > > > > + > > > > > > > > > > I think comments and code don't match. In the comment, we are saying > > > > > that via output plugins access to user catalog tables or regular > > > > > system catalog tables won't be allowed via heap_* APIs but code > > > > > doesn't seem to reflect it. I feel only > > > > > TransactionIdIsValid(CheckXidAlive) is sufficient here. See, the > > > > > original discussion about this point [1] (Refer "I think it'd also be > > > > > good to add assertions to codepaths not going through systable_* > > > > > asserting that ..."). > > > > > > > > Right, So I think we can just add an assert in these function that > > > > Assert(!TransactionIdIsValid(CheckXidAlive)) ? > > > > > > > > > > > > > > Isn't it better to block the scan to user catalog tables or regular > > > > > system catalog tables for tableam scan APIs rather than at the heap > > > > > level? There might be some APIs like heap_getnext where such a check > > > > > might still be required but I guess it is still better to block at > > > > > tableam level. > > > > > > > > > > [1] - https://www.postgresql.org/message-id/20180726200241.aje4dv4jsv25v4k2%40alap3.anarazel.de > > > > > > > > Okay, let me analyze this part. Because someplace we have to keep at > > > > heap level like heap_getnext and other places at tableam level so it > > > > seems a bit inconsistent. Also, I think the number of checks might > > > > going to increase because some of the heap functions like > > > > heap_hot_search_buffer are being called from multiple tableam calls, > > > > so we need to put check at every place. > > > > > > > > Another point is that I feel some of the checks what we have today > > > > might not be required like heap_finish_speculative, is not fetching > > > > any tuple for us so why do we need to care about this function? > > > > > > While testing these changes, I have noticed that the systable_* APIs > > > internally, calls tableam apis and so if we just put assert > > > Assert(!TransactionIdIsValid(CheckXidAlive)) then it will always hit > > > that assert. Whether we put these assert in heap APIs or the tableam > > > APIs because systable_ always access heap through tableam APIs. > > > > .. > .. > > > > Putting some more thought upon this, I am just wondering what do we > > really want any such check because, we are always getting relation > > description from the reorder buffer code, not from the pgoutput > > plugin. > > > > But can't they access other catalogs like pg_publication*? I think > the basic thing we want to ensure here is that all historic accesses > always use systable* APIs to access catalogs. We can ensure that via > having Asserts (or elog(ERROR, ..) in heap/tableam APIs. Yeah, it can. So I have changed it now, actually along with CheckXidLive, I have kept one more flag so whenever CheckXidLive is set and we pass through systable_beginscan we will set that flag. So while accessing the tableam API we will set if CheckXidLive is set then another flag must also be set otherwise we through an error. Apart from this, I have also fixed one defect raised by my colleague Neha Sharma. That issue is the incomplete toast tuple flag was not reset when the main table tuple was inserted through speculative insert and due to that data was not streamed even if later we were getting speculative confirm because incomplete toast flag was never reset. This patch also includes the fix for the issue raised by Erik. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
- v17-0003-Extend-the-output-plugin-API-with-stream-methods.patch
- v17-0001-Immediately-WAL-log-assignments.patch
- v17-0004-Gracefully-handle-concurrent-aborts-of-uncommitt.patch
- v17-0005-Implement-streaming-mode-in-ReorderBuffer.patch
- v17-0002-Issue-individual-invalidations-with-wal_level-lo.patch
- v17-0008-Enable-streaming-for-all-subscription-TAP-tests.patch
- v17-0007-Track-statistics-for-streaming.patch
- v17-0009-Add-TAP-test-for-streaming-vs.-DDL.patch
- v17-0010-Bugfix-handling-of-incomplete-toast-tuple.patch
- v17-0006-Add-support-for-streaming-to-built-in-replicatio.patch
- v17-0012-Add-streaming-option-in-pg_dump.patch
- v17-0011-Provide-new-api-to-get-the-streaming-changes.patch
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Amit Kapila
Date:
On Fri, May 1, 2020 at 8:41 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Thu, Apr 30, 2020 at 12:31 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > But can't they access other catalogs like pg_publication*? I think > > the basic thing we want to ensure here is that all historic accesses > > always use systable* APIs to access catalogs. We can ensure that via > > having Asserts (or elog(ERROR, ..) in heap/tableam APIs. > > Yeah, it can. So I have changed it now, actually along with > CheckXidLive, I have kept one more flag so whenever CheckXidLive is > set and we pass through systable_beginscan we will set that flag. So > while accessing the tableam API we will set if CheckXidLive is set > then another flag must also be set otherwise we through an error. > Okay, I have reviewed these changes and below are my comments: Review of v17-0004-Gracefully-handle-concurrent-aborts-of-uncommitt -------------------------------------------------------------------- 1. + /* + * If CheckXidAlive is set then set a flag that this call is passed through + * systable_beginscan. See detailed comments at snapmgr.c where these + * variables are declared. + */ + if (TransactionIdIsValid(CheckXidAlive)) + sysbegin_called = true; a. How about calling this variable as bsysscan or sysscan instead of sysbegin_called? b. There is an extra space between detailed and comments. A similar change is required at other place where this comment is used. c. How about writing the first line as "If CheckXidAlive is set then set a flag to indicate that system table scan is in-progress." 2. - Any actions leading to transaction ID assignment are prohibited. That, among others, - includes writing to tables, performing DDL changes, and - calling <literal>pg_current_xact_id()</literal>. + Note that access to user catalog tables or regular system catalog tables in + the output plugins has to be done via the <literal>systable_*</literal> scan + APIs only. The user tables should not be accesed in the output plugins anyways. + Access via the <literal>heap_*</literal> scan APIs will error out. The line "The user tables should not be accesed in the output plugins anyways." seems a bit of out of place. I don't think this is required here. If you read the previous paragraph in the same document it is written: "Read only access to relations is permitted as long as only relations are accessed that either have been created by <command>initdb</command> in the <literal>pg_catalog</literal> schema, or have been marked as user provided catalog tables using ...". I think that is sufficient to convey the information that the newly added line by you is trying to convey. 3. + /* + * We don't expect direct calls to this routine when CheckXidAlive is a + * valid transaction id, this should only come through systable_* call. + * CheckXidAlive is set during logical decoding of a transactions. + */ + if (unlikely(TransactionIdIsValid(CheckXidAlive) && !sysbegin_called)) + elog(ERROR, "unexpected heap_getnext call during logical decoding"); How about changing this comment as "We don't expect direct calls to heap_getnext with valid CheckXidAlive for catalog or regular tables. See detailed comments at snapmgr.c where these variables are declared."? Change the similar comment used in other places in the patch. For this specific API, we can also say "Normally we have such a check at tableam level API but this is called from many places so we need to ensure it here." 4. + * If CheckXidAlive is valid, then we check if it aborted. If it did, we error + * out. We can't directly use TransactionIdDidAbort as after crash such + * transaction might not have been marked as aborted. See detailed comments + * at snapmgr.c where the variable is declared. + */ +static inline void +HandleConcurrentAbort() Can we change the comments as "Error out, if CheckXidAlive is aborted. We can't directly use TransactionIdDidAbort as after crash such transaction might not have been marked as aborted." After this add one empty line and then we can say something like: "This is a special API to check if CheckXidAlive is aborted in system table scan APIs. See detailed comments at snapmgr.c where the variable is declared." 5. Shouldn't we add a check in table_scan_sample_next_block and table_scan_sample_next_tuple APIs as well? 6. /* + * An xid value pointing to a possibly ongoing (sub)transaction. + * Currently used in logical decoding. It's possible that such transactions + * can get aborted while the decoding is ongoing. If CheckXidAlive is set + * then we will set sysbegin_called flag when we call systable_beginscan. This + * is to ensure that from the pgoutput plugin we should never directly access + * the tableam or heap apis because we are checking for the concurrent abort + * only in systable_* apis. + */ +TransactionId CheckXidAlive = InvalidTransactionId; +bool sysbegin_called = false; Can we change the above comment as "CheckXidAlive is a xid value pointing to a possibly ongoing (sub)transaction. Currently, it is used in logical decoding. It's possible that such transactions can get aborted while the decoding is ongoing in which case we skip decoding that particular transaction. To ensure that we check whether the CheckXidAlive is aborted after fetching the tuple from system tables. We also ensure that during logical decoding we never directly access the tableam or heap APIs because we are checking for the concurrent aborts only in systable_* APIs." > Apart from this, I have also fixed one defect raised by my colleague > Neha Sharma. That issue is the incomplete toast tuple flag was not > reset when the main table tuple was inserted through speculative > insert and due to that data was not streamed even if later we were > getting speculative confirm because incomplete toast flag was never > reset. This patch also includes the fix for the issue raised by Erik. > It would be better if you can mention which all patches contain the changes as it will be easier to review the fix. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Dilip Kumar
Date:
On Mon, May 4, 2020 at 5:16 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Fri, May 1, 2020 at 8:41 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > 5. Shouldn't we add a check in table_scan_sample_next_block and > table_scan_sample_next_tuple APIs as well? I am not sure that we need to do that, Because generally, we want to avoid getting any wrong system table tuple which we can use for taking some decision or decode tuple. But, I don't think that table_scan_sample falls under that category. > > Apart from this, I have also fixed one defect raised by my colleague > > Neha Sharma. That issue is the incomplete toast tuple flag was not > > reset when the main table tuple was inserted through speculative > > insert and due to that data was not streamed even if later we were > > getting speculative confirm because incomplete toast flag was never > > reset. This patch also includes the fix for the issue raised by Erik. > > > > It would be better if you can mention which all patches contain the > changes as it will be easier to review the fix. Fix1: v17-0010-Bugfix-handling-of-incomplete-toast-tuple.patch Fix2: patch: v17-0002-Issue-individual-invalidations-with-wal_level-lo.patch I will work on other comments and send the updated patch. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Amit Kapila
Date:
On Tue, May 5, 2020 at 9:27 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Mon, May 4, 2020 at 5:16 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Fri, May 1, 2020 at 8:41 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > 5. Shouldn't we add a check in table_scan_sample_next_block and > > table_scan_sample_next_tuple APIs as well? > > I am not sure that we need to do that, Because generally, we want to > avoid getting any wrong system table tuple which we can use for taking > some decision or decode tuple. But, I don't think that > table_scan_sample falls under that category. > Hmm, I am asking a check similar to what you have in function table_scan_bitmap_next_block(), can't we have that one? BTW, I noticed a below spurious line removal in the patch we are talking about. +/* * These are updated by GetSnapshotData. We initialize them this way * for the convenience of TransactionIdIsInProgress: even in bootstrap * mode, we don't want it to say that BootstrapTransactionId is in progress. @@ -2043,7 +2055,6 @@ SetupHistoricSnapshot(Snapshot historic_snapshot, HTAB *tuplecids) tuplecid_data = tuplecids; } - -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Dilip Kumar
Date:
On Tue, May 5, 2020 at 10:25 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Tue, May 5, 2020 at 9:27 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Mon, May 4, 2020 at 5:16 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > On Fri, May 1, 2020 at 8:41 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > > > > 5. Shouldn't we add a check in table_scan_sample_next_block and > > > table_scan_sample_next_tuple APIs as well? > > > > I am not sure that we need to do that, Because generally, we want to > > avoid getting any wrong system table tuple which we can use for taking > > some decision or decode tuple. But, I don't think that > > table_scan_sample falls under that category. > > > > Hmm, I am asking a check similar to what you have in function > table_scan_bitmap_next_block(), can't we have that one? Yeah we can put that and there is no harm in that, but my point is the table_scan_bitmap_next_block and other functions where I have put the check are used for fetching the tuple which can be used for decoding tuple or taking some decision, but IMHO, table_scan_sample_next_tuple is only used for analyzing the table. So do we really need to do that? Am I missing something here? BTW, I > noticed a below spurious line removal in the patch we are talking > about. > > +/* > * These are updated by GetSnapshotData. We initialize them this way > * for the convenience of TransactionIdIsInProgress: even in bootstrap > * mode, we don't want it to say that BootstrapTransactionId is in progress. > @@ -2043,7 +2055,6 @@ SetupHistoricSnapshot(Snapshot > historic_snapshot, HTAB *tuplecids) > tuplecid_data = tuplecids; > } > > - Okay, I will take care. of this. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Amit Kapila
Date:
On Tue, May 5, 2020 at 10:31 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Tue, May 5, 2020 at 10:25 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Tue, May 5, 2020 at 9:27 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > On Mon, May 4, 2020 at 5:16 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > On Fri, May 1, 2020 at 8:41 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > > > > > > > 5. Shouldn't we add a check in table_scan_sample_next_block and > > > > table_scan_sample_next_tuple APIs as well? > > > > > > I am not sure that we need to do that, Because generally, we want to > > > avoid getting any wrong system table tuple which we can use for taking > > > some decision or decode tuple. But, I don't think that > > > table_scan_sample falls under that category. > > > > > > > Hmm, I am asking a check similar to what you have in function > > table_scan_bitmap_next_block(), can't we have that one? > > Yeah we can put that and there is no harm in that, but my point is > the table_scan_bitmap_next_block and other functions where I have put > the check are used for fetching the tuple which can be used for > decoding tuple or taking some decision, but IMHO, > table_scan_sample_next_tuple is only used for analyzing the table. > These will be used in TABLESAMPLE scan. Try something like "select c1 from t1 TABLESAMPLE BERNOULLI(30);". So, I guess these APIs can also be used to fetch the tuple. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Dilip Kumar
Date:
On Mon, May 4, 2020 at 5:16 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Fri, May 1, 2020 at 8:41 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Thu, Apr 30, 2020 at 12:31 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > But can't they access other catalogs like pg_publication*? I think > > > the basic thing we want to ensure here is that all historic accesses > > > always use systable* APIs to access catalogs. We can ensure that via > > > having Asserts (or elog(ERROR, ..) in heap/tableam APIs. > > > > Yeah, it can. So I have changed it now, actually along with > > CheckXidLive, I have kept one more flag so whenever CheckXidLive is > > set and we pass through systable_beginscan we will set that flag. So > > while accessing the tableam API we will set if CheckXidLive is set > > then another flag must also be set otherwise we through an error. > > > > Okay, I have reviewed these changes and below are my comments: > > Review of v17-0004-Gracefully-handle-concurrent-aborts-of-uncommitt > -------------------------------------------------------------------- > 1. > + /* > + * If CheckXidAlive is set then set a flag that this call is passed through > + * systable_beginscan. See detailed comments at snapmgr.c where these > + * variables are declared. > + */ > + if (TransactionIdIsValid(CheckXidAlive)) > + sysbegin_called = true; > > a. How about calling this variable as bsysscan or sysscan instead of > sysbegin_called? Done > b. There is an extra space between detailed and comments. A similar > change is required at other place where this comment is used. Done > c. How about writing the first line as "If CheckXidAlive is set then > set a flag to indicate that system table scan is in-progress." > > 2. > - Any actions leading to transaction ID assignment are prohibited. > That, among others, > - includes writing to tables, performing DDL changes, and > - calling <literal>pg_current_xact_id()</literal>. > + Note that access to user catalog tables or regular system > catalog tables in > + the output plugins has to be done via the > <literal>systable_*</literal> scan > + APIs only. The user tables should not be accesed in the output > plugins anyways. > + Access via the <literal>heap_*</literal> scan APIs will error out. > > The line "The user tables should not be accesed in the output plugins > anyways." seems a bit of out of place. I don't think this is required > here. If you read the previous paragraph in the same document it is > written: "Read only access to relations is permitted as long as only > relations are accessed that either have been created by > <command>initdb</command> in the <literal>pg_catalog</literal> schema, > or have been marked as user provided catalog tables using ...". I > think that is sufficient to convey the information that the newly > added line by you is trying to convey. Right. > > 3. > + /* > + * We don't expect direct calls to this routine when CheckXidAlive is a > + * valid transaction id, this should only come through systable_* call. > + * CheckXidAlive is set during logical decoding of a transactions. > + */ > + if (unlikely(TransactionIdIsValid(CheckXidAlive) && !sysbegin_called)) > + elog(ERROR, "unexpected heap_getnext call during logical decoding"); > > How about changing this comment as "We don't expect direct calls to > heap_getnext with valid CheckXidAlive for catalog or regular tables. > See detailed comments at snapmgr.c where these variables are > declared."? Change the similar comment used in other places in the > patch. > > For this specific API, we can also say "Normally we have such a check > at tableam level API but this is called from many places so we need to > ensure it here." Done > > 4. > + * If CheckXidAlive is valid, then we check if it aborted. If it did, we error > + * out. We can't directly use TransactionIdDidAbort as after crash such > + * transaction might not have been marked as aborted. See detailed comments > + * at snapmgr.c where the variable is declared. > + */ > +static inline void > +HandleConcurrentAbort() > > Can we change the comments as "Error out, if CheckXidAlive is aborted. > We can't directly use TransactionIdDidAbort as after crash such > transaction might not have been marked as aborted." > > After this add one empty line and then we can say something like: > "This is a special API to check if CheckXidAlive is aborted in system > table scan APIs. See detailed comments at snapmgr.c where the > variable is declared." > > 5. Shouldn't we add a check in table_scan_sample_next_block and > table_scan_sample_next_tuple APIs as well? Done > 6. > /* > + * An xid value pointing to a possibly ongoing (sub)transaction. > + * Currently used in logical decoding. It's possible that such transactions > + * can get aborted while the decoding is ongoing. If CheckXidAlive is set > + * then we will set sysbegin_called flag when we call systable_beginscan. This > + * is to ensure that from the pgoutput plugin we should never directly access > + * the tableam or heap apis because we are checking for the concurrent abort > + * only in systable_* apis. > + */ > +TransactionId CheckXidAlive = InvalidTransactionId; > +bool sysbegin_called = false; > > Can we change the above comment as "CheckXidAlive is a xid value > pointing to a possibly ongoing (sub)transaction. Currently, it is > used in logical decoding. It's possible that such transactions can > get aborted while the decoding is ongoing in which case we skip > decoding that particular transaction. To ensure that we check whether > the CheckXidAlive is aborted after fetching the tuple from system > tables. We also ensure that during logical decoding we never directly > access the tableam or heap APIs because we are checking for the > concurrent aborts only in systable_* APIs." Done I have also fixed one issue in the patch v18-0010-Bugfix-handling-of-incomplete-toast-tuple.patch. Basically, the check, in ReorderBufferLargestTopTXN for selecting the largest top transaction was incorrect so I have fixed that. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
- v18-0001-Immediately-WAL-log-assignments.patch
- v18-0004-Gracefully-handle-concurrent-aborts-of-uncommitt.patch
- v18-0002-Issue-individual-invalidations-with-wal_level-lo.patch
- v18-0005-Implement-streaming-mode-in-ReorderBuffer.patch
- v18-0003-Extend-the-output-plugin-API-with-stream-methods.patch
- v18-0006-Add-support-for-streaming-to-built-in-replicatio.patch
- v18-0009-Add-TAP-test-for-streaming-vs.-DDL.patch
- v18-0010-Bugfix-handling-of-incomplete-toast-tuple.patch
- v18-0007-Track-statistics-for-streaming.patch
- v18-0008-Enable-streaming-for-all-subscription-TAP-tests.patch
- v18-0011-Provide-new-api-to-get-the-streaming-changes.patch
- v18-0012-Add-streaming-option-in-pg_dump.patch
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Dilip Kumar
Date:
On Tue, May 5, 2020 at 4:06 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Mon, May 4, 2020 at 5:16 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Fri, May 1, 2020 at 8:41 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > On Thu, Apr 30, 2020 at 12:31 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > > > > But can't they access other catalogs like pg_publication*? I think > > > > the basic thing we want to ensure here is that all historic accesses > > > > always use systable* APIs to access catalogs. We can ensure that via > > > > having Asserts (or elog(ERROR, ..) in heap/tableam APIs. > > > > > > Yeah, it can. So I have changed it now, actually along with > > > CheckXidLive, I have kept one more flag so whenever CheckXidLive is > > > set and we pass through systable_beginscan we will set that flag. So > > > while accessing the tableam API we will set if CheckXidLive is set > > > then another flag must also be set otherwise we through an error. > > > > > > > Okay, I have reviewed these changes and below are my comments: > > > > Review of v17-0004-Gracefully-handle-concurrent-aborts-of-uncommitt > > -------------------------------------------------------------------- > > 1. > > + /* > > + * If CheckXidAlive is set then set a flag that this call is passed through > > + * systable_beginscan. See detailed comments at snapmgr.c where these > > + * variables are declared. > > + */ > > + if (TransactionIdIsValid(CheckXidAlive)) > > + sysbegin_called = true; > > > > a. How about calling this variable as bsysscan or sysscan instead of > > sysbegin_called? > > Done > > > b. There is an extra space between detailed and comments. A similar > > change is required at other place where this comment is used. > > Done > > > c. How about writing the first line as "If CheckXidAlive is set then > > set a flag to indicate that system table scan is in-progress." > > > > 2. > > - Any actions leading to transaction ID assignment are prohibited. > > That, among others, > > - includes writing to tables, performing DDL changes, and > > - calling <literal>pg_current_xact_id()</literal>. > > + Note that access to user catalog tables or regular system > > catalog tables in > > + the output plugins has to be done via the > > <literal>systable_*</literal> scan > > + APIs only. The user tables should not be accesed in the output > > plugins anyways. > > + Access via the <literal>heap_*</literal> scan APIs will error out. > > > > The line "The user tables should not be accesed in the output plugins > > anyways." seems a bit of out of place. I don't think this is required > > here. If you read the previous paragraph in the same document it is > > written: "Read only access to relations is permitted as long as only > > relations are accessed that either have been created by > > <command>initdb</command> in the <literal>pg_catalog</literal> schema, > > or have been marked as user provided catalog tables using ...". I > > think that is sufficient to convey the information that the newly > > added line by you is trying to convey. > > Right. > > > > > 3. > > + /* > > + * We don't expect direct calls to this routine when CheckXidAlive is a > > + * valid transaction id, this should only come through systable_* call. > > + * CheckXidAlive is set during logical decoding of a transactions. > > + */ > > + if (unlikely(TransactionIdIsValid(CheckXidAlive) && !sysbegin_called)) > > + elog(ERROR, "unexpected heap_getnext call during logical decoding"); > > > > How about changing this comment as "We don't expect direct calls to > > heap_getnext with valid CheckXidAlive for catalog or regular tables. > > See detailed comments at snapmgr.c where these variables are > > declared."? Change the similar comment used in other places in the > > patch. > > > > For this specific API, we can also say "Normally we have such a check > > at tableam level API but this is called from many places so we need to > > ensure it here." > > Done > > > > > 4. > > + * If CheckXidAlive is valid, then we check if it aborted. If it did, we error > > + * out. We can't directly use TransactionIdDidAbort as after crash such > > + * transaction might not have been marked as aborted. See detailed comments > > + * at snapmgr.c where the variable is declared. > > + */ > > +static inline void > > +HandleConcurrentAbort() > > > > Can we change the comments as "Error out, if CheckXidAlive is aborted. > > We can't directly use TransactionIdDidAbort as after crash such > > transaction might not have been marked as aborted." > > > > After this add one empty line and then we can say something like: > > "This is a special API to check if CheckXidAlive is aborted in system > > table scan APIs. See detailed comments at snapmgr.c where the > > variable is declared." > > > > 5. Shouldn't we add a check in table_scan_sample_next_block and > > table_scan_sample_next_tuple APIs as well? > > Done > > > 6. > > /* > > + * An xid value pointing to a possibly ongoing (sub)transaction. > > + * Currently used in logical decoding. It's possible that such transactions > > + * can get aborted while the decoding is ongoing. If CheckXidAlive is set > > + * then we will set sysbegin_called flag when we call systable_beginscan. This > > + * is to ensure that from the pgoutput plugin we should never directly access > > + * the tableam or heap apis because we are checking for the concurrent abort > > + * only in systable_* apis. > > + */ > > +TransactionId CheckXidAlive = InvalidTransactionId; > > +bool sysbegin_called = false; > > > > Can we change the above comment as "CheckXidAlive is a xid value > > pointing to a possibly ongoing (sub)transaction. Currently, it is > > used in logical decoding. It's possible that such transactions can > > get aborted while the decoding is ongoing in which case we skip > > decoding that particular transaction. To ensure that we check whether > > the CheckXidAlive is aborted after fetching the tuple from system > > tables. We also ensure that during logical decoding we never directly > > access the tableam or heap APIs because we are checking for the > > concurrent aborts only in systable_* APIs." > > Done > > I have also fixed one issue in the patch > v18-0010-Bugfix-handling-of-incomplete-toast-tuple.patch. > > Basically, the check, in ReorderBufferLargestTopTXN for selecting the > largest top transaction was incorrect so I have fixed that. There was one unrelated bug fix in v18-0010 patch reported by Neha Sharma offlist so sending the updated version. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
- v19-0005-Implement-streaming-mode-in-ReorderBuffer.patch
- v19-0001-Immediately-WAL-log-assignments.patch
- v19-0002-Issue-individual-invalidations-with-wal_level-lo.patch
- v19-0004-Gracefully-handle-concurrent-aborts-of-uncommitt.patch
- v19-0003-Extend-the-output-plugin-API-with-stream-methods.patch
- v19-0007-Track-statistics-for-streaming.patch
- v19-0008-Enable-streaming-for-all-subscription-TAP-tests.patch
- v19-0009-Add-TAP-test-for-streaming-vs.-DDL.patch
- v19-0010-Bugfix-handling-of-incomplete-toast-tuple.patch
- v19-0006-Add-support-for-streaming-to-built-in-replicatio.patch
- v19-0011-Provide-new-api-to-get-the-streaming-changes.patch
- v19-0012-Add-streaming-option-in-pg_dump.patch
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Dilip Kumar
Date:
On Tue, May 5, 2020 at 7:13 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: I have fixed one more issue in 0010 patch. The issue was that once the transaction is serialized due to the incomplete toast after streaming the serialized store was not cleaned up so it was streaming the same tuple multiple times. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
- v20-0002-Issue-individual-invalidations-with-wal_level-lo.patch
- v20-0001-Immediately-WAL-log-assignments.patch
- v20-0003-Extend-the-output-plugin-API-with-stream-methods.patch
- v20-0005-Implement-streaming-mode-in-ReorderBuffer.patch
- v20-0004-Gracefully-handle-concurrent-aborts-of-uncommitt.patch
- v20-0007-Track-statistics-for-streaming.patch
- v20-0008-Enable-streaming-for-all-subscription-TAP-tests.patch
- v20-0009-Add-TAP-test-for-streaming-vs.-DDL.patch
- v20-0006-Add-support-for-streaming-to-built-in-replicatio.patch
- v20-0011-Provide-new-api-to-get-the-streaming-changes.patch
- v20-0010-Bugfix-handling-of-incomplete-toast-tuple.patch
- v20-0012-Add-streaming-option-in-pg_dump.patch
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Amit Kapila
Date:
On Thu, May 7, 2020 at 6:17 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Tue, May 5, 2020 at 7:13 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > I have fixed one more issue in 0010 patch. The issue was that once > the transaction is serialized due to the incomplete toast after > streaming the serialized store was not cleaned up so it was streaming > the same tuple multiple times. > I have reviewed a few patches (003, 004, and 005) and below are my comments. v20-0003-Extend-the-output-plugin-API-with-stream-methods ---------------------------------------------------------------------------------------- 1. +static void +pg_decode_stream_change(LogicalDecodingContext *ctx, + ReorderBufferTXN *txn, + Relation relation, + ReorderBufferChange *change) +{ + OutputPluginPrepareWrite(ctx, true); + appendStringInfo(ctx->out, "streaming change for TXN %u", txn->xid); + OutputPluginWrite(ctx, true); +} + +static void +pg_decode_stream_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn, + int nrelations, Relation relations[], + ReorderBufferChange *change) +{ + OutputPluginPrepareWrite(ctx, true); + appendStringInfo(ctx->out, "streaming truncate for TXN %u", txn->xid); + OutputPluginWrite(ctx, true); +} In the above and similar APIs, there are parameters like relation which are not used. I think you should add some comments atop these APIs to explain why it is so? I guess it is because we want to keep them similar to non-stream version of APIs and we can't display relation or other information as the transaction is still in-progress. 2. + <para> + Similar to spill-to-disk behavior, streaming is triggered when the total + amount of changes decoded from the WAL (for all in-progress transactions) + exceeds limit defined by <varname>logical_decoding_work_mem</varname> setting. + At that point the largest toplevel transaction (measured by amount of memory + currently used for decoded changes) is selected and streamed. + </para> I think we need to explain here the cases/exception where we need to spill even when stream is enabled and check if this is per latest implementation, otherwise, update it. 3. + * To support streaming, we require change/commit/abort callbacks. The + * message callback is optional, similarly to regular output plugins. /similarly/similar 4. +static void +stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn) +{ + LogicalDecodingContext *ctx = cache->private_data; + LogicalErrorCallbackState state; + ErrorContextCallback errcallback; + + Assert(!ctx->fast_forward); + + /* We're only supposed to call this when streaming is supported. */ + Assert(ctx->streaming); + + /* Push callback + info on the error context stack */ + state.ctx = ctx; + state.callback_name = "stream_start"; + /* state.report_location = apply_lsn; */ Why can't we supply the report_location here? I think here we need to report txn->first_lsn if this is the very first stream and txn->final_lsn if it is any consecutive one. 5. +static void +stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn) +{ + LogicalDecodingContext *ctx = cache->private_data; + LogicalErrorCallbackState state; + ErrorContextCallback errcallback; + + Assert(!ctx->fast_forward); + + /* We're only supposed to call this when streaming is supported. */ + Assert(ctx->streaming); + + /* Push callback + info on the error context stack */ + state.ctx = ctx; + state.callback_name = "stream_stop"; + /* state.report_location = apply_lsn; */ Can't we report txn->final_lsn here? 6. I think it will be good if we can provide an example of streaming changes via test_decoding at https://www.postgresql.org/docs/devel/test-decoding.html. I think we can also explain there why the user is not expected to see the actual data in the stream. v20-0004-Gracefully-handle-concurrent-aborts-of-uncommitt ---------------------------------------------------------------------------------------- 7. + /* + * We don't expect direct calls to table_tuple_get_latest_tid with valid + * CheckXidAlive for catalog or regular tables. There is an extra space between 'CheckXidAlive' and 'for'. I can see similar problems in other places as well where this comment is used, fix those as well. 8. +/* + * CheckXidAlive is a xid value pointing to a possibly ongoing (sub) + * transaction. Currently, it is used in logical decoding. It's possible + * that such transactions can get aborted while the decoding is ongoing in + * which case we skip decoding that particular transaction. To ensure that we + * check whether the CheckXidAlive is aborted after fetching the tuple from + * system tables. We also ensure that during logical decoding we never + * directly access the tableam or heap APIs because we are checking for the + * concurrent aborts only in systable_* APIs. + */ In this comment, there is an inconsistency in the space used after completing the sentence. In the part "transaction. To", single space is used whereas at other places two spaces are used after a full stop. v20-0005-Implement-streaming-mode-in-ReorderBuffer ----------------------------------------------------------------------------- 9. Implement streaming mode in ReorderBuffer Instead of serializing the transaction to disk after reaching the maximum number of changes in memory (4096 changes), we consume the changes we have in memory and invoke new stream API methods. This happens in ReorderBufferStreamTXN() using about the same logic as in ReorderBufferCommit() logic. I think the above part of the commit message needs to be updated. 10. Theoretically, we could get rid of the k-way merge, and append the changes to the toplevel xact directly (and remember the position in the list in case the subxact gets aborted later). I don't think this part of the commit message is correct as we sometimes need to spill even during streaming. Please check the entire commit message and update according to the latest implementation. 11. - * HeapTupleSatisfiesHistoricMVCC. + * tqual.c's HeapTupleSatisfiesHistoricMVCC. + * + * We do build the hash table even if there are no CIDs. That's + * because when streaming in-progress transactions we may run into + * tuples with the CID before actually decoding them. Think e.g. about + * INSERT followed by TRUNCATE, where the TRUNCATE may not be decoded + * yet when applying the INSERT. So we build a hash table so that + * ResolveCminCmaxDuringDecoding does not segfault in this case. + * + * XXX We might limit this behavior to streaming mode, and just bail + * out when decoding transaction at commit time (at which point it's + * guaranteed to see all CIDs). */ static void ReorderBufferBuildTupleCidHash(ReorderBuffer *rb, ReorderBufferTXN *txn) @@ -1350,9 +1498,6 @@ ReorderBufferBuildTupleCidHash(ReorderBuffer *rb, ReorderBufferTXN *txn) dlist_iter iter; HASHCTL hash_ctl; - if (!rbtxn_has_catalog_changes(txn) || dlist_is_empty(&txn->tuplecids)) - return; - I don't understand this change. Why would "INSERT followed by TRUNCATE" could lead to a tuple which can come for decode before its CID? The patch has made changes based on this assumption in HeapTupleSatisfiesHistoricMVCC which appears to be very risky as the behavior could be dependent on whether we are streaming the changes for in-progress xact or at the commit of a transaction. We might want to generate a test to once validate this behavior. Also, the comment refers to tqual.c which is wrong as this API is now in heapam_visibility.c. 12. + * setup CheckXidAlive if it's not committed yet. We don't check if the xid + * aborted. That will happen during catalog access. Also reset the + * sysbegin_called flag. */ - if (txn->base_snapshot == NULL) + if (!TransactionIdDidCommit(xid)) { - Assert(txn->ninvalidations == 0); - ReorderBufferCleanupTXN(rb, txn); - return; + CheckXidAlive = xid; + bsysscan = false; } In the comment, the flag name 'sysbegin_called' should be bsysscan. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Dilip Kumar
Date:
On Tue, May 12, 2020 at 4:39 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Thu, May 7, 2020 at 6:17 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Tue, May 5, 2020 at 7:13 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > I have fixed one more issue in 0010 patch. The issue was that once > > the transaction is serialized due to the incomplete toast after > > streaming the serialized store was not cleaned up so it was streaming > > the same tuple multiple times. > > > > I have reviewed a few patches (003, 004, and 005) and below are my comments. Thanks for the review, I am replying some of the comments where I have confusion, others are fine. > > v20-0003-Extend-the-output-plugin-API-with-stream-methods > ---------------------------------------------------------------------------------------- > 1. > +static void > +pg_decode_stream_change(LogicalDecodingContext *ctx, > + ReorderBufferTXN *txn, > + Relation relation, > + ReorderBufferChange *change) > +{ > + OutputPluginPrepareWrite(ctx, true); > + appendStringInfo(ctx->out, "streaming change for TXN %u", txn->xid); > + OutputPluginWrite(ctx, true); > +} > + > +static void > +pg_decode_stream_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn, > + int nrelations, Relation relations[], > + ReorderBufferChange *change) > +{ > + OutputPluginPrepareWrite(ctx, true); > + appendStringInfo(ctx->out, "streaming truncate for TXN %u", txn->xid); > + OutputPluginWrite(ctx, true); > +} > > In the above and similar APIs, there are parameters like relation > which are not used. I think you should add some comments atop these > APIs to explain why it is so? I guess it is because we want to keep > them similar to non-stream version of APIs and we can't display > relation or other information as the transaction is still in-progress. I think because the interfaces are designed that way because other decoding plugins might need it e.g. in pgoutput we need change and relation but not here. We have other similar examples also e.g. pg_decode_message has the parameter txn but not used. Do you think we still need to add comments? > 4. > +static void > +stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn) > +{ > + LogicalDecodingContext *ctx = cache->private_data; > + LogicalErrorCallbackState state; > + ErrorContextCallback errcallback; > + > + Assert(!ctx->fast_forward); > + > + /* We're only supposed to call this when streaming is supported. */ > + Assert(ctx->streaming); > + > + /* Push callback + info on the error context stack */ > + state.ctx = ctx; > + state.callback_name = "stream_start"; > + /* state.report_location = apply_lsn; */ > > Why can't we supply the report_location here? I think here we need to > report txn->first_lsn if this is the very first stream and > txn->final_lsn if it is any consecutive one. I am not sure about this, Because for the very first stream we will report the location of the first lsn of the stream and for the consecutive stream we will report the last lsn in the stream. > > 11. > - * HeapTupleSatisfiesHistoricMVCC. > + * tqual.c's HeapTupleSatisfiesHistoricMVCC. > + * > + * We do build the hash table even if there are no CIDs. That's > + * because when streaming in-progress transactions we may run into > + * tuples with the CID before actually decoding them. Think e.g. about > + * INSERT followed by TRUNCATE, where the TRUNCATE may not be decoded > + * yet when applying the INSERT. So we build a hash table so that > + * ResolveCminCmaxDuringDecoding does not segfault in this case. > + * > + * XXX We might limit this behavior to streaming mode, and just bail > + * out when decoding transaction at commit time (at which point it's > + * guaranteed to see all CIDs). > */ > static void > ReorderBufferBuildTupleCidHash(ReorderBuffer *rb, ReorderBufferTXN *txn) > @@ -1350,9 +1498,6 @@ ReorderBufferBuildTupleCidHash(ReorderBuffer > *rb, ReorderBufferTXN *txn) > dlist_iter iter; > HASHCTL hash_ctl; > > - if (!rbtxn_has_catalog_changes(txn) || dlist_is_empty(&txn->tuplecids)) > - return; > - > > I don't understand this change. Why would "INSERT followed by > TRUNCATE" could lead to a tuple which can come for decode before its > CID? Actually, even if we haven't decoded the DDL operation but in the actual system table the tuple might have been deleted from the next operation. e.g. while we are streaming the INSERT it is possible that the truncate has already deleted that tuple and set the max for the tuple. So before streaming patch, we were only streaming the INSERT only on commit so by that time we had got all the operation which has done DDL and we would have already prepared tuple CID hash. The patch has made changes based on this assumption in > HeapTupleSatisfiesHistoricMVCC which appears to be very risky as the > behavior could be dependent on whether we are streaming the changes > for in-progress xact or at the commit of a transaction. We might want > to generate a test to once validate this behavior. We have already added the test case for the same, 011_stream_ddl.pl in test/subscription > Also, the comment refers to tqual.c which is wrong as this API is now > in heapam_visibility.c. Ok, will fix. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Amit Kapila
Date:
On Wed, May 13, 2020 at 11:35 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Tue, May 12, 2020 at 4:39 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > v20-0003-Extend-the-output-plugin-API-with-stream-methods > > ---------------------------------------------------------------------------------------- > > 1. > > +static void > > +pg_decode_stream_change(LogicalDecodingContext *ctx, > > + ReorderBufferTXN *txn, > > + Relation relation, > > + ReorderBufferChange *change) > > +{ > > + OutputPluginPrepareWrite(ctx, true); > > + appendStringInfo(ctx->out, "streaming change for TXN %u", txn->xid); > > + OutputPluginWrite(ctx, true); > > +} > > + > > +static void > > +pg_decode_stream_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn, > > + int nrelations, Relation relations[], > > + ReorderBufferChange *change) > > +{ > > + OutputPluginPrepareWrite(ctx, true); > > + appendStringInfo(ctx->out, "streaming truncate for TXN %u", txn->xid); > > + OutputPluginWrite(ctx, true); > > +} > > > > In the above and similar APIs, there are parameters like relation > > which are not used. I think you should add some comments atop these > > APIs to explain why it is so? I guess it is because we want to keep > > them similar to non-stream version of APIs and we can't display > > relation or other information as the transaction is still in-progress. > > I think because the interfaces are designed that way because other > decoding plugins might need it e.g. in pgoutput we need change and > relation but not here. We have other similar examples also e.g. > pg_decode_message has the parameter txn but not used. Do you think we > still need to add comments? > In that case, we can leave but lets ensure that we are not exposing any parameter which is not used and if there is any due to some reason, we should document it. I will also look into this. > > 4. > > +static void > > +stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn) > > +{ > > + LogicalDecodingContext *ctx = cache->private_data; > > + LogicalErrorCallbackState state; > > + ErrorContextCallback errcallback; > > + > > + Assert(!ctx->fast_forward); > > + > > + /* We're only supposed to call this when streaming is supported. */ > > + Assert(ctx->streaming); > > + > > + /* Push callback + info on the error context stack */ > > + state.ctx = ctx; > > + state.callback_name = "stream_start"; > > + /* state.report_location = apply_lsn; */ > > > > Why can't we supply the report_location here? I think here we need to > > report txn->first_lsn if this is the very first stream and > > txn->final_lsn if it is any consecutive one. > > I am not sure about this, Because for the very first stream we will > report the location of the first lsn of the stream and for the > consecutive stream we will report the last lsn in the stream. > Yeah, that doesn't seem to be consistent. How about if get it as an additional parameter? The caller can pass the lsn of the very first change it is trying to decode in this stream. > > > > 11. > > - * HeapTupleSatisfiesHistoricMVCC. > > + * tqual.c's HeapTupleSatisfiesHistoricMVCC. > > + * > > + * We do build the hash table even if there are no CIDs. That's > > + * because when streaming in-progress transactions we may run into > > + * tuples with the CID before actually decoding them. Think e.g. about > > + * INSERT followed by TRUNCATE, where the TRUNCATE may not be decoded > > + * yet when applying the INSERT. So we build a hash table so that > > + * ResolveCminCmaxDuringDecoding does not segfault in this case. > > + * > > + * XXX We might limit this behavior to streaming mode, and just bail > > + * out when decoding transaction at commit time (at which point it's > > + * guaranteed to see all CIDs). > > */ > > static void > > ReorderBufferBuildTupleCidHash(ReorderBuffer *rb, ReorderBufferTXN *txn) > > @@ -1350,9 +1498,6 @@ ReorderBufferBuildTupleCidHash(ReorderBuffer > > *rb, ReorderBufferTXN *txn) > > dlist_iter iter; > > HASHCTL hash_ctl; > > > > - if (!rbtxn_has_catalog_changes(txn) || dlist_is_empty(&txn->tuplecids)) > > - return; > > - > > > > I don't understand this change. Why would "INSERT followed by > > TRUNCATE" could lead to a tuple which can come for decode before its > > CID? > > Actually, even if we haven't decoded the DDL operation but in the > actual system table the tuple might have been deleted from the next > operation. e.g. while we are streaming the INSERT it is possible that > the truncate has already deleted that tuple and set the max for the > tuple. So before streaming patch, we were only streaming the INSERT > only on commit so by that time we had got all the operation which has > done DDL and we would have already prepared tuple CID hash. > Okay, but I think for that case how good is that we always allow CID hash table to be built even if there are no catalog changes in TXN (see changes in ReorderBufferBuildTupleCidHash). Can't we detect that while resolving the cmin/cmax? Few more comments for v20-0005-Implement-streaming-mode-in-ReorderBuffer: ---------------------------------------------------------------------------------------------------------------- 1. /* - * Binary heap comparison function. + * Binary heap comparison function (regular non-streaming iterator). */ static int ReorderBufferIterCompare(Datum a, Datum b, void *arg) It seems to me the above comment change is not required as per the latest patch. 2. * For subtransactions, we only mark them as streamed when there are + * any changes in them. + * + * We do it this way because of aborts - we don't want to send aborts + * for XIDs the downstream is not aware of. And of course, it always + * knows about the toplevel xact (we send the XID in all messages), + * but we never stream XIDs of empty subxacts. + */ + if ((!txn->toptxn) || (txn->nentries_mem != 0)) + txn->txn_flags |= RBTXN_IS_STREAMED; /when there are any changes in them/when there are changes in them. I think we don't need 'any' in the above sentence. 3. And, during catalog scan we can check the status of the xid and + * if it is aborted we will report a specific error that we can ignore. We + * might have already streamed some of the changes for the aborted + * (sub)transaction, but that is fine because when we decode the abort we will + * stream abort message to truncate the changes in the subscriber. + */ +static inline void +SetupCheckXidLive(TransactionId xid) In the above comment, I don't think it is right to say that we ignore the error raised due to the aborted transaction. We need to say that we discard the already streamed changes on such an error. 4. +static inline void +SetupCheckXidLive(TransactionId xid) +{ /* - * If this transaction has no snapshot, it didn't make any changes to the - * database, so there's nothing to decode. Note that - * ReorderBufferCommitChild will have transferred any snapshots from - * subtransactions if there were any. + * setup CheckXidAlive if it's not committed yet. We don't check if the xid + * aborted. That will happen during catalog access. Also reset the + * sysbegin_called flag. */ - if (txn->base_snapshot == NULL) + if (!TransactionIdDidCommit(xid)) { - Assert(txn->ninvalidations == 0); - ReorderBufferCleanupTXN(rb, txn); - return; + CheckXidAlive = xid; + bsysscan = false; } I think this function is inline as it needs to be called for each change. If that is the case and otherwise also, isn't it better that we check if passed xid is the same as CheckXidAlive before checking TransactionIdDidCommit as TransactionIdDidCommit can be costly and calling it for each change might not be a good idea? 5. setup CheckXidAlive if it's not committed yet. We don't check if the xid + * aborted. That will happen during catalog access. Also reset the + * sysbegin_called flag. /if the xid aborted/if the xid is aborted. missing comma after Also. 6. ReorderBufferProcessTXN() { .. - /* build data to be able to lookup the CommandIds of catalog tuples */ + /* + * build data to be able to lookup the CommandIds of catalog tuples + */ ReorderBufferBuildTupleCidHash(rb, txn); .. } Is there a need to change the formatting of the comment? 7. ReorderBufferProcessTXN() { .. if (using_subtxn) - BeginInternalSubTransaction("replay"); + BeginInternalSubTransaction("stream"); else StartTransactionCommand(); .. } I am not sure changing unconditionally "replay" to "stream" is a good idea. How about something like BeginInternalSubTransaction(streaming ? "stream" : "replay");? 8. @@ -1588,8 +1766,6 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid, * use as a normal record. It'll be cleaned up at the end * of INSERT processing. */ - if (specinsert == NULL) - elog(ERROR, "invalid ordering of speculative insertion changes"); You have removed this check but all other handling of specinsert is same as far as this patch is concerned. Why so? 9. @@ -1676,8 +1860,6 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid, * freed/reused while restoring spooled data from * disk. */ - Assert(change->data.tp.newtuple != NULL); - dlist_delete(&change->node); Why is this Assert removed? 10. @@ -1753,7 +1935,15 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid, relations[nrelations++] = relation; } - rb->apply_truncate(rb, txn, nrelations, relations, change); + if (streaming) + { + rb->stream_truncate(rb, txn, nrelations, relations, change); + + /* Remember that we have sent some data. */ + change->txn->any_data_sent = true; + } + else + rb->apply_truncate(rb, txn, nrelations, relations, change); Can we encapsulate this in a separate function like ReorderBufferApplyTruncate or something like that? Basically, rather than having streaming check in this function, lets do it in some other internal function. And we can likewise do it for all the streaming checks in this function or at least whereever it is feasible. That will make this function look clean. 11. + * We currently can only decode a transaction's contents when its commit + * record is read because that's the only place where we know about cache + * invalidations. Thus, once a toplevel commit is read, we iterate over the top + * and subtransactions (using a k-way merge) and replay the changes in lsn + * order. + */ +void +ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid, { .. I think the above comment needs to be updated after this patch. This API can now be used during the decode of both a in-progress and a committed transaction. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Dilip Kumar
Date:
On Wed, May 13, 2020 at 4:50 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Wed, May 13, 2020 at 11:35 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Tue, May 12, 2020 at 4:39 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > v20-0003-Extend-the-output-plugin-API-with-stream-methods > > > ---------------------------------------------------------------------------------------- > > > 1. > > > +static void > > > +pg_decode_stream_change(LogicalDecodingContext *ctx, > > > + ReorderBufferTXN *txn, > > > + Relation relation, > > > + ReorderBufferChange *change) > > > +{ > > > + OutputPluginPrepareWrite(ctx, true); > > > + appendStringInfo(ctx->out, "streaming change for TXN %u", txn->xid); > > > + OutputPluginWrite(ctx, true); > > > +} > > > + > > > +static void > > > +pg_decode_stream_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn, > > > + int nrelations, Relation relations[], > > > + ReorderBufferChange *change) > > > +{ > > > + OutputPluginPrepareWrite(ctx, true); > > > + appendStringInfo(ctx->out, "streaming truncate for TXN %u", txn->xid); > > > + OutputPluginWrite(ctx, true); > > > +} > > > > > > In the above and similar APIs, there are parameters like relation > > > which are not used. I think you should add some comments atop these > > > APIs to explain why it is so? I guess it is because we want to keep > > > them similar to non-stream version of APIs and we can't display > > > relation or other information as the transaction is still in-progress. > > > > I think because the interfaces are designed that way because other > > decoding plugins might need it e.g. in pgoutput we need change and > > relation but not here. We have other similar examples also e.g. > > pg_decode_message has the parameter txn but not used. Do you think we > > still need to add comments? > > > > In that case, we can leave but lets ensure that we are not exposing > any parameter which is not used and if there is any due to some > reason, we should document it. I will also look into this. > > > > 4. > > > +static void > > > +stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn) > > > +{ > > > + LogicalDecodingContext *ctx = cache->private_data; > > > + LogicalErrorCallbackState state; > > > + ErrorContextCallback errcallback; > > > + > > > + Assert(!ctx->fast_forward); > > > + > > > + /* We're only supposed to call this when streaming is supported. */ > > > + Assert(ctx->streaming); > > > + > > > + /* Push callback + info on the error context stack */ > > > + state.ctx = ctx; > > > + state.callback_name = "stream_start"; > > > + /* state.report_location = apply_lsn; */ > > > > > > Why can't we supply the report_location here? I think here we need to > > > report txn->first_lsn if this is the very first stream and > > > txn->final_lsn if it is any consecutive one. > > > > I am not sure about this, Because for the very first stream we will > > report the location of the first lsn of the stream and for the > > consecutive stream we will report the last lsn in the stream. > > > > Yeah, that doesn't seem to be consistent. How about if get it as an > additional parameter? The caller can pass the lsn of the very first > change it is trying to decode in this stream. Hmm, I think we need to call ReorderBufferIterTXNInit and ReorderBufferIterTXNNext and get the first change of the stream after that we shall call stream start then we can find out the first LSN of the stream. I will see how to do so that it doesn't look awkward. Basically, as of now, our code is of this layout. 1. stream_start; 2. ReorderBufferIterTXNInit(rb, txn, &iterstate); while ((change = ReorderBufferIterTXNNext(rb, iterstate)) != NULL) { stream changes } 3. stream stop So if we want to know the first lsn of this stream then we shall do something like this 1. ReorderBufferIterTXNInit(rb, txn, &iterstate); while ((change = ReorderBufferIterTXNNext(rb, iterstate)) != NULL) { 2. if first_change stream_start; stream changes } 3. stream stop > > > > > > 11. > > > - * HeapTupleSatisfiesHistoricMVCC. > > > + * tqual.c's HeapTupleSatisfiesHistoricMVCC. > > > + * > > > + * We do build the hash table even if there are no CIDs. That's > > > + * because when streaming in-progress transactions we may run into > > > + * tuples with the CID before actually decoding them. Think e.g. about > > > + * INSERT followed by TRUNCATE, where the TRUNCATE may not be decoded > > > + * yet when applying the INSERT. So we build a hash table so that > > > + * ResolveCminCmaxDuringDecoding does not segfault in this case. > > > + * > > > + * XXX We might limit this behavior to streaming mode, and just bail > > > + * out when decoding transaction at commit time (at which point it's > > > + * guaranteed to see all CIDs). > > > */ > > > static void > > > ReorderBufferBuildTupleCidHash(ReorderBuffer *rb, ReorderBufferTXN *txn) > > > @@ -1350,9 +1498,6 @@ ReorderBufferBuildTupleCidHash(ReorderBuffer > > > *rb, ReorderBufferTXN *txn) > > > dlist_iter iter; > > > HASHCTL hash_ctl; > > > > > > - if (!rbtxn_has_catalog_changes(txn) || dlist_is_empty(&txn->tuplecids)) > > > - return; > > > - > > > > > > I don't understand this change. Why would "INSERT followed by > > > TRUNCATE" could lead to a tuple which can come for decode before its > > > CID? > > > > Actually, even if we haven't decoded the DDL operation but in the > > actual system table the tuple might have been deleted from the next > > operation. e.g. while we are streaming the INSERT it is possible that > > the truncate has already deleted that tuple and set the max for the > > tuple. So before streaming patch, we were only streaming the INSERT > > only on commit so by that time we had got all the operation which has > > done DDL and we would have already prepared tuple CID hash. > > > > Okay, but I think for that case how good is that we always allow CID > hash table to be built even if there are no catalog changes in TXN > (see changes in ReorderBufferBuildTupleCidHash). Can't we detect that > while resolving the cmin/cmax? Maybe in ResolveCminCmaxDuringDecoding we can see if tuplecid_data is NULL then we can return as unresolved and then caller can take a call based on that. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Amit Kapila
Date:
On Wed, May 13, 2020 at 9:16 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Wed, May 13, 2020 at 4:50 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > 4. > > > > +static void > > > > +stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn) > > > > +{ > > > > + LogicalDecodingContext *ctx = cache->private_data; > > > > + LogicalErrorCallbackState state; > > > > + ErrorContextCallback errcallback; > > > > + > > > > + Assert(!ctx->fast_forward); > > > > + > > > > + /* We're only supposed to call this when streaming is supported. */ > > > > + Assert(ctx->streaming); > > > > + > > > > + /* Push callback + info on the error context stack */ > > > > + state.ctx = ctx; > > > > + state.callback_name = "stream_start"; > > > > + /* state.report_location = apply_lsn; */ > > > > > > > > Why can't we supply the report_location here? I think here we need to > > > > report txn->first_lsn if this is the very first stream and > > > > txn->final_lsn if it is any consecutive one. > > > > > > I am not sure about this, Because for the very first stream we will > > > report the location of the first lsn of the stream and for the > > > consecutive stream we will report the last lsn in the stream. > > > > > > > Yeah, that doesn't seem to be consistent. How about if get it as an > > additional parameter? The caller can pass the lsn of the very first > > change it is trying to decode in this stream. > > Hmm, I think we need to call ReorderBufferIterTXNInit and > ReorderBufferIterTXNNext and get the first change of the stream after > that we shall call stream start then we can find out the first LSN of > the stream. I will see how to do so that it doesn't look awkward. > Basically, as of now, our code is of this layout. > > 1. stream_start; > 2. ReorderBufferIterTXNInit(rb, txn, &iterstate); > while ((change = ReorderBufferIterTXNNext(rb, iterstate)) != NULL) > { > stream changes > } > 3. stream stop > > So if we want to know the first lsn of this stream then we shall do > something like this > > 1. ReorderBufferIterTXNInit(rb, txn, &iterstate); > while ((change = ReorderBufferIterTXNNext(rb, iterstate)) != NULL) > { > 2. if first_change > stream_start; > > stream changes > } > 3. stream stop > Yeah, something like that would work. I think you need to see it is first change for 'streaming' mode. > > > > > > > > 11. > > > > - * HeapTupleSatisfiesHistoricMVCC. > > > > + * tqual.c's HeapTupleSatisfiesHistoricMVCC. > > > > + * > > > > + * We do build the hash table even if there are no CIDs. That's > > > > + * because when streaming in-progress transactions we may run into > > > > + * tuples with the CID before actually decoding them. Think e.g. about > > > > + * INSERT followed by TRUNCATE, where the TRUNCATE may not be decoded > > > > + * yet when applying the INSERT. So we build a hash table so that > > > > + * ResolveCminCmaxDuringDecoding does not segfault in this case. > > > > + * > > > > + * XXX We might limit this behavior to streaming mode, and just bail > > > > + * out when decoding transaction at commit time (at which point it's > > > > + * guaranteed to see all CIDs). > > > > */ > > > > static void > > > > ReorderBufferBuildTupleCidHash(ReorderBuffer *rb, ReorderBufferTXN *txn) > > > > @@ -1350,9 +1498,6 @@ ReorderBufferBuildTupleCidHash(ReorderBuffer > > > > *rb, ReorderBufferTXN *txn) > > > > dlist_iter iter; > > > > HASHCTL hash_ctl; > > > > > > > > - if (!rbtxn_has_catalog_changes(txn) || dlist_is_empty(&txn->tuplecids)) > > > > - return; > > > > - > > > > > > > > I don't understand this change. Why would "INSERT followed by > > > > TRUNCATE" could lead to a tuple which can come for decode before its > > > > CID? > > > > > > Actually, even if we haven't decoded the DDL operation but in the > > > actual system table the tuple might have been deleted from the next > > > operation. e.g. while we are streaming the INSERT it is possible that > > > the truncate has already deleted that tuple and set the max for the > > > tuple. So before streaming patch, we were only streaming the INSERT > > > only on commit so by that time we had got all the operation which has > > > done DDL and we would have already prepared tuple CID hash. > > > > > > > Okay, but I think for that case how good is that we always allow CID > > hash table to be built even if there are no catalog changes in TXN > > (see changes in ReorderBufferBuildTupleCidHash). Can't we detect that > > while resolving the cmin/cmax? > > Maybe in ResolveCminCmaxDuringDecoding we can see if tuplecid_data is > NULL then we can return as unresolved and then caller can take a call > based on that. > Yeah, and add appropriate comments about why we are doing so and in what kind of scenario that can happen. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Dilip Kumar
Date:
On Tue, May 12, 2020 at 4:39 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Thu, May 7, 2020 at 6:17 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Tue, May 5, 2020 at 7:13 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > I have fixed one more issue in 0010 patch. The issue was that once > > the transaction is serialized due to the incomplete toast after > > streaming the serialized store was not cleaned up so it was streaming > > the same tuple multiple times. > > > > I have reviewed a few patches (003, 004, and 005) and below are my comments. > > v20-0003-Extend-the-output-plugin-API-with-stream-methods > ---------------------------------------------------------------------------------------- > 2. > + <para> > + Similar to spill-to-disk behavior, streaming is triggered when the total > + amount of changes decoded from the WAL (for all in-progress transactions) > + exceeds limit defined by > <varname>logical_decoding_work_mem</varname> setting. > + At that point the largest toplevel transaction (measured by > amount of memory > + currently used for decoded changes) is selected and streamed. > + </para> > > I think we need to explain here the cases/exception where we need to > spill even when stream is enabled and check if this is per latest > implementation, otherwise, update it. Done > 3. > + * To support streaming, we require change/commit/abort callbacks. The > + * message callback is optional, similarly to regular output plugins. > > /similarly/similar Done > 4. > +static void > +stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn) > +{ > + LogicalDecodingContext *ctx = cache->private_data; > + LogicalErrorCallbackState state; > + ErrorContextCallback errcallback; > + > + Assert(!ctx->fast_forward); > + > + /* We're only supposed to call this when streaming is supported. */ > + Assert(ctx->streaming); > + > + /* Push callback + info on the error context stack */ > + state.ctx = ctx; > + state.callback_name = "stream_start"; > + /* state.report_location = apply_lsn; */ > > Why can't we supply the report_location here? I think here we need to > report txn->first_lsn if this is the very first stream and > txn->final_lsn if it is any consecutive one. Done > 5. > +static void > +stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn) > +{ > + LogicalDecodingContext *ctx = cache->private_data; > + LogicalErrorCallbackState state; > + ErrorContextCallback errcallback; > + > + Assert(!ctx->fast_forward); > + > + /* We're only supposed to call this when streaming is supported. */ > + Assert(ctx->streaming); > + > + /* Push callback + info on the error context stack */ > + state.ctx = ctx; > + state.callback_name = "stream_stop"; > + /* state.report_location = apply_lsn; */ > > Can't we report txn->final_lsn here We are already setting this to the txn->final_ls in 0006 patch, but I have moved it into this patch now. > 6. I think it will be good if we can provide an example of streaming > changes via test_decoding at > https://www.postgresql.org/docs/devel/test-decoding.html. I think we > can also explain there why the user is not expected to see the actual > data in the stream. I have a few problems to solve here. - With streaming transaction also shall we show the actual values or we shall do like it is currently in the patch (appendStringInfo(ctx->out, "streaming change for TXN %u", txn->xid);). I think we should show the actual values instead of what we are doing now. - In the example we can not show a real example, because of the in-progress transaction to show the changes, we might have to implement a lot of tuple. I think we can show the partial output? > v20-0004-Gracefully-handle-concurrent-aborts-of-uncommitt > ---------------------------------------------------------------------------------------- > 7. > + /* > + * We don't expect direct calls to table_tuple_get_latest_tid with valid > + * CheckXidAlive for catalog or regular tables. > > There is an extra space between 'CheckXidAlive' and 'for'. I can see > similar problems in other places as well where this comment is used, > fix those as well. Done > 8. > +/* > + * CheckXidAlive is a xid value pointing to a possibly ongoing (sub) > + * transaction. Currently, it is used in logical decoding. It's possible > + * that such transactions can get aborted while the decoding is ongoing in > + * which case we skip decoding that particular transaction. To ensure that we > + * check whether the CheckXidAlive is aborted after fetching the tuple from > + * system tables. We also ensure that during logical decoding we never > + * directly access the tableam or heap APIs because we are checking for the > + * concurrent aborts only in systable_* APIs. > + */ > > In this comment, there is an inconsistency in the space used after > completing the sentence. In the part "transaction. To", single space > is used whereas at other places two spaces are used after a full stop. Done > v20-0005-Implement-streaming-mode-in-ReorderBuffer > ----------------------------------------------------------------------------- > 9. > Implement streaming mode in ReorderBuffer > > Instead of serializing the transaction to disk after reaching the > maximum number of changes in memory (4096 changes), we consume the > changes we have in memory and invoke new stream API methods. This > happens in ReorderBufferStreamTXN() using about the same logic as > in ReorderBufferCommit() logic. > > I think the above part of the commit message needs to be updated. Done > 10. > Theoretically, we could get rid of the k-way merge, and append the > changes to the toplevel xact directly (and remember the position > in the list in case the subxact gets aborted later). > > I don't think this part of the commit message is correct as we > sometimes need to spill even during streaming. Please check the > entire commit message and update according to the latest > implementation. Done > 11. > - * HeapTupleSatisfiesHistoricMVCC. > + * tqual.c's HeapTupleSatisfiesHistoricMVCC. > + * > + * We do build the hash table even if there are no CIDs. That's > + * because when streaming in-progress transactions we may run into > + * tuples with the CID before actually decoding them. Think e.g. about > + * INSERT followed by TRUNCATE, where the TRUNCATE may not be decoded > + * yet when applying the INSERT. So we build a hash table so that > + * ResolveCminCmaxDuringDecoding does not segfault in this case. > + * > + * XXX We might limit this behavior to streaming mode, and just bail > + * out when decoding transaction at commit time (at which point it's > + * guaranteed to see all CIDs). > */ > static void > ReorderBufferBuildTupleCidHash(ReorderBuffer *rb, ReorderBufferTXN *txn) > @@ -1350,9 +1498,6 @@ ReorderBufferBuildTupleCidHash(ReorderBuffer > *rb, ReorderBufferTXN *txn) > dlist_iter iter; > HASHCTL hash_ctl; > > - if (!rbtxn_has_catalog_changes(txn) || dlist_is_empty(&txn->tuplecids)) > - return; > - > > I don't understand this change. Why would "INSERT followed by > TRUNCATE" could lead to a tuple which can come for decode before its > CID? The patch has made changes based on this assumption in > HeapTupleSatisfiesHistoricMVCC which appears to be very risky as the > behavior could be dependent on whether we are streaming the changes > for in-progress xact or at the commit of a transaction. We might want > to generate a test to once validate this behavior. > > Also, the comment refers to tqual.c which is wrong as this API is now > in heapam_visibility.c. Done. > 12. > + * setup CheckXidAlive if it's not committed yet. We don't check if the xid > + * aborted. That will happen during catalog access. Also reset the > + * sysbegin_called flag. > */ > - if (txn->base_snapshot == NULL) > + if (!TransactionIdDidCommit(xid)) > { > - Assert(txn->ninvalidations == 0); > - ReorderBufferCleanupTXN(rb, txn); > - return; > + CheckXidAlive = xid; > + bsysscan = false; > } > > In the comment, the flag name 'sysbegin_called' should be bsysscan. Done -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
- v21-0003-Extend-the-output-plugin-API-with-stream-methods.patch
- v21-0004-Gracefully-handle-concurrent-aborts-of-uncommitt.patch
- v21-0001-Immediately-WAL-log-assignments.patch
- v21-0002-Issue-individual-invalidations-with-wal_level-lo.patch
- v21-0005-Implement-streaming-mode-in-ReorderBuffer.patch
- v21-0006-Add-support-for-streaming-to-built-in-replicatio.patch
- v21-0008-Enable-streaming-for-all-subscription-TAP-tests.patch
- v21-0009-Add-TAP-test-for-streaming-vs.-DDL.patch
- v21-0010-Bugfix-handling-of-incomplete-toast-tuple.patch
- v21-0007-Track-statistics-for-streaming.patch
- v21-0012-Add-streaming-option-in-pg_dump.patch
- v21-0011-Provide-new-api-to-get-the-streaming-changes.patch
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Dilip Kumar
Date:
On Wed, May 13, 2020 at 4:50 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Wed, May 13, 2020 at 11:35 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Tue, May 12, 2020 at 4:39 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > v20-0003-Extend-the-output-plugin-API-with-stream-methods > > > ---------------------------------------------------------------------------------------- > > > 1. > > > +static void > > > +pg_decode_stream_change(LogicalDecodingContext *ctx, > > > + ReorderBufferTXN *txn, > > > + Relation relation, > > > + ReorderBufferChange *change) > > > +{ > > > + OutputPluginPrepareWrite(ctx, true); > > > + appendStringInfo(ctx->out, "streaming change for TXN %u", txn->xid); > > > + OutputPluginWrite(ctx, true); > > > +} > > > + > > > +static void > > > +pg_decode_stream_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn, > > > + int nrelations, Relation relations[], > > > + ReorderBufferChange *change) > > > +{ > > > + OutputPluginPrepareWrite(ctx, true); > > > + appendStringInfo(ctx->out, "streaming truncate for TXN %u", txn->xid); > > > + OutputPluginWrite(ctx, true); > > > +} > > > > > > In the above and similar APIs, there are parameters like relation > > > which are not used. I think you should add some comments atop these > > > APIs to explain why it is so? I guess it is because we want to keep > > > them similar to non-stream version of APIs and we can't display > > > relation or other information as the transaction is still in-progress. > > > > I think because the interfaces are designed that way because other > > decoding plugins might need it e.g. in pgoutput we need change and > > relation but not here. We have other similar examples also e.g. > > pg_decode_message has the parameter txn but not used. Do you think we > > still need to add comments? > > > > In that case, we can leave but lets ensure that we are not exposing > any parameter which is not used and if there is any due to some > reason, we should document it. I will also look into this. Ok > > > 4. > > > +static void > > > +stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn) > > > +{ > > > + LogicalDecodingContext *ctx = cache->private_data; > > > + LogicalErrorCallbackState state; > > > + ErrorContextCallback errcallback; > > > + > > > + Assert(!ctx->fast_forward); > > > + > > > + /* We're only supposed to call this when streaming is supported. */ > > > + Assert(ctx->streaming); > > > + > > > + /* Push callback + info on the error context stack */ > > > + state.ctx = ctx; > > > + state.callback_name = "stream_start"; > > > + /* state.report_location = apply_lsn; */ > > > > > > Why can't we supply the report_location here? I think here we need to > > > report txn->first_lsn if this is the very first stream and > > > txn->final_lsn if it is any consecutive one. > > > > I am not sure about this, Because for the very first stream we will > > report the location of the first lsn of the stream and for the > > consecutive stream we will report the last lsn in the stream. > > > > Yeah, that doesn't seem to be consistent. How about if get it as an > additional parameter? The caller can pass the lsn of the very first > change it is trying to decode in this stream. Done > > > 11. > > > - * HeapTupleSatisfiesHistoricMVCC. > > > + * tqual.c's HeapTupleSatisfiesHistoricMVCC. > > > + * > > > + * We do build the hash table even if there are no CIDs. That's > > > + * because when streaming in-progress transactions we may run into > > > + * tuples with the CID before actually decoding them. Think e.g. about > > > + * INSERT followed by TRUNCATE, where the TRUNCATE may not be decoded > > > + * yet when applying the INSERT. So we build a hash table so that > > > + * ResolveCminCmaxDuringDecoding does not segfault in this case. > > > + * > > > + * XXX We might limit this behavior to streaming mode, and just bail > > > + * out when decoding transaction at commit time (at which point it's > > > + * guaranteed to see all CIDs). > > > */ > > > static void > > > ReorderBufferBuildTupleCidHash(ReorderBuffer *rb, ReorderBufferTXN *txn) > > > @@ -1350,9 +1498,6 @@ ReorderBufferBuildTupleCidHash(ReorderBuffer > > > *rb, ReorderBufferTXN *txn) > > > dlist_iter iter; > > > HASHCTL hash_ctl; > > > > > > - if (!rbtxn_has_catalog_changes(txn) || dlist_is_empty(&txn->tuplecids)) > > > - return; > > > - > > > > > > I don't understand this change. Why would "INSERT followed by > > > TRUNCATE" could lead to a tuple which can come for decode before its > > > CID? > > > > Actually, even if we haven't decoded the DDL operation but in the > > actual system table the tuple might have been deleted from the next > > operation. e.g. while we are streaming the INSERT it is possible that > > the truncate has already deleted that tuple and set the max for the > > tuple. So before streaming patch, we were only streaming the INSERT > > only on commit so by that time we had got all the operation which has > > done DDL and we would have already prepared tuple CID hash. > > > > Okay, but I think for that case how good is that we always allow CID > hash table to be built even if there are no catalog changes in TXN > (see changes in ReorderBufferBuildTupleCidHash). Can't we detect that > while resolving the cmin/cmax? Done > > Few more comments for v20-0005-Implement-streaming-mode-in-ReorderBuffer: > ---------------------------------------------------------------------------------------------------------------- > 1. > /* > - * Binary heap comparison function. > + * Binary heap comparison function (regular non-streaming iterator). > */ > static int > ReorderBufferIterCompare(Datum a, Datum b, void *arg) > > It seems to me the above comment change is not required as per the latest patch. Done > 2. > * For subtransactions, we only mark them as streamed when there are > + * any changes in them. > + * > + * We do it this way because of aborts - we don't want to send aborts > + * for XIDs the downstream is not aware of. And of course, it always > + * knows about the toplevel xact (we send the XID in all messages), > + * but we never stream XIDs of empty subxacts. > + */ > + if ((!txn->toptxn) || (txn->nentries_mem != 0)) > + txn->txn_flags |= RBTXN_IS_STREAMED; > > /when there are any changes in them/when there are changes in them. I > think we don't need 'any' in the above sentence. Done > 3. > And, during catalog scan we can check the status of the xid and > + * if it is aborted we will report a specific error that we can ignore. We > + * might have already streamed some of the changes for the aborted > + * (sub)transaction, but that is fine because when we decode the abort we will > + * stream abort message to truncate the changes in the subscriber. > + */ > +static inline void > +SetupCheckXidLive(TransactionId xid) > > In the above comment, I don't think it is right to say that we ignore > the error raised due to the aborted transaction. We need to say that > we discard the already streamed changes on such an error. Done. > 4. > +static inline void > +SetupCheckXidLive(TransactionId xid) > +{ > /* > - * If this transaction has no snapshot, it didn't make any changes to the > - * database, so there's nothing to decode. Note that > - * ReorderBufferCommitChild will have transferred any snapshots from > - * subtransactions if there were any. > + * setup CheckXidAlive if it's not committed yet. We don't check if the xid > + * aborted. That will happen during catalog access. Also reset the > + * sysbegin_called flag. > */ > - if (txn->base_snapshot == NULL) > + if (!TransactionIdDidCommit(xid)) > { > - Assert(txn->ninvalidations == 0); > - ReorderBufferCleanupTXN(rb, txn); > - return; > + CheckXidAlive = xid; > + bsysscan = false; > } > > I think this function is inline as it needs to be called for each > change. If that is the case and otherwise also, isn't it better that > we check if passed xid is the same as CheckXidAlive before checking > TransactionIdDidCommit as TransactionIdDidCommit can be costly and > calling it for each change might not be a good idea? Done, Also I think it is good the check the TransactionIdIsInProgress instead of !TransactionIdDidCommit. I have changed that as well. > 5. > setup CheckXidAlive if it's not committed yet. We don't check if the xid > + * aborted. That will happen during catalog access. Also reset the > + * sysbegin_called flag. > > /if the xid aborted/if the xid is aborted. missing comma after Also. Done > 6. > ReorderBufferProcessTXN() > { > .. > - /* build data to be able to lookup the CommandIds of catalog tuples */ > + /* > + * build data to be able to lookup the CommandIds of catalog tuples > + */ > ReorderBufferBuildTupleCidHash(rb, txn); > .. > } > > Is there a need to change the formatting of the comment? No need changed back. > > 7. > ReorderBufferProcessTXN() > { > .. > if (using_subtxn) > - BeginInternalSubTransaction("replay"); > + BeginInternalSubTransaction("stream"); > else > StartTransactionCommand(); > .. > } > > I am not sure changing unconditionally "replay" to "stream" is a good > idea. How about something like BeginInternalSubTransaction(streaming > ? "stream" : "replay");? Done > 8. > @@ -1588,8 +1766,6 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid, > * use as a normal record. It'll be cleaned up at the end > * of INSERT processing. > */ > - if (specinsert == NULL) > - elog(ERROR, "invalid ordering of speculative insertion changes"); > > You have removed this check but all other handling of specinsert is > same as far as this patch is concerned. Why so? Seems like a merge issue, or the leftover from the old design of the toast handling where we were streaming with the partial tuple. fixed now. > 9. > @@ -1676,8 +1860,6 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid, > * freed/reused while restoring spooled data from > * disk. > */ > - Assert(change->data.tp.newtuple != NULL); > - > dlist_delete(&change->node); > > Why is this Assert removed? Same cause as above so fixed. > 10. > @@ -1753,7 +1935,15 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid, > relations[nrelations++] = relation; > } > > - rb->apply_truncate(rb, txn, nrelations, relations, change); > + if (streaming) > + { > + rb->stream_truncate(rb, txn, nrelations, relations, change); > + > + /* Remember that we have sent some data. */ > + change->txn->any_data_sent = true; > + } > + else > + rb->apply_truncate(rb, txn, nrelations, relations, change); > > Can we encapsulate this in a separate function like > ReorderBufferApplyTruncate or something like that? Basically, rather > than having streaming check in this function, lets do it in some other > internal function. And we can likewise do it for all the streaming > checks in this function or at least whereever it is feasible. That > will make this function look clean. Done for truncate and change. I think we can create a few more such functions for start/stop and cleanup handling on error. I will work on that. > 11. > + * We currently can only decode a transaction's contents when its commit > + * record is read because that's the only place where we know about cache > + * invalidations. Thus, once a toplevel commit is read, we iterate over the top > + * and subtransactions (using a k-way merge) and replay the changes in lsn > + * order. > + */ > +void > +ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid, > { > .. > > I think the above comment needs to be updated after this patch. This > API can now be used during the decode of both a in-progress and a > committed transaction. Done -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Amit Kapila
Date:
On Fri, May 15, 2020 at 2:47 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > 6. I think it will be good if we can provide an example of streaming > > changes via test_decoding at > > https://www.postgresql.org/docs/devel/test-decoding.html. I think we > > can also explain there why the user is not expected to see the actual > > data in the stream. > > I have a few problems to solve here. > - With streaming transaction also shall we show the actual values or > we shall do like it is currently in the patch > (appendStringInfo(ctx->out, "streaming change for TXN %u", > txn->xid);). I think we should show the actual values instead of what > we are doing now. > I think why we don't want to display the tuple at this stage is because it is not clear by this time if the transaction will commit or abort. I am not sure if displaying the contents of aborted transactions is a good idea but if there is a reason for doing so, we can do it later as well. > - In the example we can not show a real example, because of the > in-progress transaction to show the changes, we might have to > implement a lot of tuple. I think we can show the partial output? > I think we can display what API will actually display, what is the confusion here. I have a few more comments on the previous version of patch v20-0005-Implement-streaming-mode-in-ReorderBuffer. If you have fixed any, then leave those and fix others. Review comments: ------------------------------ 1. @@ -1762,10 +1952,16 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid, } case REORDER_BUFFER_CHANGE_MESSAGE: - rb->message(rb, txn, change->lsn, true, - change->data.msg.prefix, - change->data.msg.message_size, - change->data.msg.message); + if (streaming) + rb->stream_message(rb, txn, change->lsn, true, + change->data.msg.prefix, + change->data.msg.message_size, + change->data.msg.message); + else + rb->message(rb, txn, change->lsn, true, + change->data.msg.prefix, + change->data.msg.message_size, + change->data.msg.message); Don't we need to set any_data_sent flag while streaming messages as we do for other types of changes? 2. + if (streaming) + { + /* + * Set the last of the stream as the final lsn before calling + * stream stop. + */ + if (!XLogRecPtrIsInvalid(prev_lsn)) + txn->final_lsn = prev_lsn; + rb->stream_stop(rb, txn); + } I am not sure if it is good to use final_lsn for this purpose. See comments for this variable in reorderbuffer.h. Basically, it is used for a specific purpose on different occasions. Now, if we want to start using it for a new purpose, we need to study its interaction with all other places and update the comments as well. Can we pass an additional parameter to stream_stop() instead? 3. + /* remember the command ID and snapshot for the streaming run */ + txn->command_id = command_id; + + /* Avoid copying if it's already copied. */ + if (snapshot_now->copied) + txn->snapshot_now = snapshot_now; + else + txn->snapshot_now = ReorderBufferCopySnap(rb, snapshot_now, + txn, command_id); This code is used at two different places, can we try to keep this in a single function. 4. In ReorderBufferProcessTXN(), the patch is calling stream_stop in both the try and catch block. If there is an error after calling it in a try block, we might call it again via catch. I think that will lead to sending a stop message twice. Won't that be a problem? See the usage of iterstate in the catch block, we have made it safe from a similar problem. 5. + if (streaming) + { + /* Discard the changes that we just streamed. */ + ReorderBufferTruncateTXN(rb, txn); - PG_RE_THROW(); + /* Re-throw only if it's not an abort. */ + if (errdata->sqlerrcode != ERRCODE_TRANSACTION_ROLLBACK) + { + MemoryContextSwitchTo(ecxt); + PG_RE_THROW(); + } + else + { + FlushErrorState(); + FreeErrorData(errdata); + errdata = NULL; + I think here we can write few comments on why we are doing error-code specific handling, basically, explain a bit about concurrent abort handling and or refer to the part of comments where it is explained. 6. PG_CATCH(); { + MemoryContext ecxt = MemoryContextSwitchTo(ccxt); + ErrorData *errdata = CopyErrorData(); I don't understand the usage of memory context in this part of the code. Basically, you are switching to CurrentMemoryContext here, do some error handling and then again reset back to some random context before rethrowing the error. If there is some purpose for it, then it might be better if you can write a few comments to explain the same. 7. +ReorderBufferCommit() { .. + /* + * If the transaction was (partially) streamed, we need to commit it in a + * 'streamed' way. That is, we first stream the remaining part of the + * transaction, and then invoke stream_commit message. + * + * XXX Called after everything (origin ID and LSN, ...) is stored in the + * transaction, so we don't pass that directly. + * + * XXX Somewhat hackish redirection, perhaps needs to be refactored? + */ + if (rbtxn_is_streamed(txn)) + { + ReorderBufferStreamCommit(rb, txn); + return; + } + .. } "XXX Somewhat hackish redirection, perhaps needs to be refactored?" What kind of refactoring we can do here? To me, it looks okay. 8. @@ -2295,6 +2677,13 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer *rb, TransactionId xid, txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true); txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES; + + /* + * TOCHECK: Mark toplevel transaction as having catalog changes too + * if one of its children has. + */ + if (txn->toptxn != NULL) + txn->toptxn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES; } Why are we marking top transaction here? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Dilip Kumar
Date:
On Fri, May 15, 2020 at 4:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Fri, May 15, 2020 at 2:47 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > 6. I think it will be good if we can provide an example of streaming > > > changes via test_decoding at > > > https://www.postgresql.org/docs/devel/test-decoding.html. I think we > > > can also explain there why the user is not expected to see the actual > > > data in the stream. > > > > I have a few problems to solve here. > > - With streaming transaction also shall we show the actual values or > > we shall do like it is currently in the patch > > (appendStringInfo(ctx->out, "streaming change for TXN %u", > > txn->xid);). I think we should show the actual values instead of what > > we are doing now. > > > > I think why we don't want to display the tuple at this stage is > because it is not clear by this time if the transaction will commit or > abort. I am not sure if displaying the contents of aborted > transactions is a good idea but if there is a reason for doing so, we > can do it later as well. Ok. > > > - In the example we can not show a real example, because of the > > in-progress transaction to show the changes, we might have to > > implement a lot of tuple. I think we can show the partial output? > > > > I think we can display what API will actually display, what is the > confusion here. What, I meant is that even with the logical_decoding_work_mem=64kb, we need to have quite a few changes in a transaction to stream it so the example output will be quite big in size. So I told we might not show the real example instead we will just show a few lines and cut the remaining. But, I got your point we can just show how it will look like. > > I have a few more comments on the previous version of patch > v20-0005-Implement-streaming-mode-in-ReorderBuffer. If you have fixed > any, then leave those and fix others. > > Review comments: > ------------------------------ > 1. > @@ -1762,10 +1952,16 @@ ReorderBufferCommit(ReorderBuffer *rb, > TransactionId xid, > } > > case REORDER_BUFFER_CHANGE_MESSAGE: > - rb->message(rb, txn, change->lsn, true, > - change->data.msg.prefix, > - change->data.msg.message_size, > - change->data.msg.message); > + if (streaming) > + rb->stream_message(rb, txn, change->lsn, true, > + change->data.msg.prefix, > + change->data.msg.message_size, > + change->data.msg.message); > + else > + rb->message(rb, txn, change->lsn, true, > + change->data.msg.prefix, > + change->data.msg.message_size, > + change->data.msg.message); > > Don't we need to set any_data_sent flag while streaming messages as we > do for other types of changes? Actually, pgoutput plugin don't send any data on stream_message. But, I agree that how other plugin will handle. I will analyze this part again, maybe we have to such flag at the plugin level and whether stop is sent to not can also be handled at the plugin level. > 2. > + if (streaming) > + { > + /* > + * Set the last of the stream as the final lsn before calling > + * stream stop. > + */ > + if (!XLogRecPtrIsInvalid(prev_lsn)) > + txn->final_lsn = prev_lsn; > + rb->stream_stop(rb, txn); > + } > > I am not sure if it is good to use final_lsn for this purpose. See > comments for this variable in reorderbuffer.h. Basically, it is used > for a specific purpose on different occasions. Now, if we want to > start using it for a new purpose, we need to study its interaction > with all other places and update the comments as well. Can we pass an > additional parameter to stream_stop() instead? I think it was in sycn with the spill code right? I mean the last change we spill is set as the final_lsn and same is done here. Other comments looks fine so I will work on them and reply separatly. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Amit Kapila
Date:
On Fri, May 15, 2020 at 4:20 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Fri, May 15, 2020 at 4:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > - In the example we can not show a real example, because of the > > > in-progress transaction to show the changes, we might have to > > > implement a lot of tuple. I think we can show the partial output? > > > > > > > I think we can display what API will actually display, what is the > > confusion here. > > What, I meant is that even with the logical_decoding_work_mem=64kb, we > need to have quite a few changes in a transaction to stream it so the > example output will be quite big in size. So I told we might not show > the real example instead we will just show a few lines and cut the > remaining. But, I got your point we can just show how it will look > like. > Right. > > > > I have a few more comments on the previous version of patch > > v20-0005-Implement-streaming-mode-in-ReorderBuffer. If you have fixed > > any, then leave those and fix others. > > > > Review comments: > > ------------------------------ > > 1. > > @@ -1762,10 +1952,16 @@ ReorderBufferCommit(ReorderBuffer *rb, > > TransactionId xid, > > } > > > > case REORDER_BUFFER_CHANGE_MESSAGE: > > - rb->message(rb, txn, change->lsn, true, > > - change->data.msg.prefix, > > - change->data.msg.message_size, > > - change->data.msg.message); > > + if (streaming) > > + rb->stream_message(rb, txn, change->lsn, true, > > + change->data.msg.prefix, > > + change->data.msg.message_size, > > + change->data.msg.message); > > + else > > + rb->message(rb, txn, change->lsn, true, > > + change->data.msg.prefix, > > + change->data.msg.message_size, > > + change->data.msg.message); > > > > Don't we need to set any_data_sent flag while streaming messages as we > > do for other types of changes? > > Actually, pgoutput plugin don't send any data on stream_message. But, > I agree that how other plugin will handle. I will analyze this part > again, maybe we have to such flag at the plugin level and whether stop > is sent to not can also be handled at the plugin level. > Okay, lets discuss this after your analysis. > > 2. > > + if (streaming) > > + { > > + /* > > + * Set the last of the stream as the final lsn before calling > > + * stream stop. > > + */ > > + if (!XLogRecPtrIsInvalid(prev_lsn)) > > + txn->final_lsn = prev_lsn; > > + rb->stream_stop(rb, txn); > > + } > > > > I am not sure if it is good to use final_lsn for this purpose. See > > comments for this variable in reorderbuffer.h. Basically, it is used > > for a specific purpose on different occasions. Now, if we want to > > start using it for a new purpose, we need to study its interaction > > with all other places and update the comments as well. Can we pass an > > additional parameter to stream_stop() instead? > > I think it was in sycn with the spill code right? I mean the last > change we spill is set as the final_lsn and same is done here. > But we use final_lsn in ReorderBufferRestoreCleanup() for serialized changes. Now, in some case if we first do serialization, then perform streaming and then tried to call ReorderBufferRestoreCleanup(), it might not work as intended. Now, this might not happen today but I don't think we have any protection to avoid that. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Dilip Kumar
Date:
On Fri, May 15, 2020 at 4:35 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Fri, May 15, 2020 at 4:20 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Fri, May 15, 2020 at 4:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > > > > - In the example we can not show a real example, because of the > > > > in-progress transaction to show the changes, we might have to > > > > implement a lot of tuple. I think we can show the partial output? > > > > > > > > > > I think we can display what API will actually display, what is the > > > confusion here. > > > > What, I meant is that even with the logical_decoding_work_mem=64kb, we > > need to have quite a few changes in a transaction to stream it so the > > example output will be quite big in size. So I told we might not show > > the real example instead we will just show a few lines and cut the > > remaining. But, I got your point we can just show how it will look > > like. > > > > Right. > > > > > > > I have a few more comments on the previous version of patch > > > v20-0005-Implement-streaming-mode-in-ReorderBuffer. If you have fixed > > > any, then leave those and fix others. > > > > > > Review comments: > > > ------------------------------ > > > 1. > > > @@ -1762,10 +1952,16 @@ ReorderBufferCommit(ReorderBuffer *rb, > > > TransactionId xid, > > > } > > > > > > case REORDER_BUFFER_CHANGE_MESSAGE: > > > - rb->message(rb, txn, change->lsn, true, > > > - change->data.msg.prefix, > > > - change->data.msg.message_size, > > > - change->data.msg.message); > > > + if (streaming) > > > + rb->stream_message(rb, txn, change->lsn, true, > > > + change->data.msg.prefix, > > > + change->data.msg.message_size, > > > + change->data.msg.message); > > > + else > > > + rb->message(rb, txn, change->lsn, true, > > > + change->data.msg.prefix, > > > + change->data.msg.message_size, > > > + change->data.msg.message); > > > > > > Don't we need to set any_data_sent flag while streaming messages as we > > > do for other types of changes? > > > > Actually, pgoutput plugin don't send any data on stream_message. But, > > I agree that how other plugin will handle. I will analyze this part > > again, maybe we have to such flag at the plugin level and whether stop > > is sent to not can also be handled at the plugin level. > > > > Okay, lets discuss this after your analysis. > > > > 2. > > > + if (streaming) > > > + { > > > + /* > > > + * Set the last of the stream as the final lsn before calling > > > + * stream stop. > > > + */ > > > + if (!XLogRecPtrIsInvalid(prev_lsn)) > > > + txn->final_lsn = prev_lsn; > > > + rb->stream_stop(rb, txn); > > > + } > > > > > > I am not sure if it is good to use final_lsn for this purpose. See > > > comments for this variable in reorderbuffer.h. Basically, it is used > > > for a specific purpose on different occasions. Now, if we want to > > > start using it for a new purpose, we need to study its interaction > > > with all other places and update the comments as well. Can we pass an > > > additional parameter to stream_stop() instead? > > > > I think it was in sycn with the spill code right? I mean the last > > change we spill is set as the final_lsn and same is done here. > > > > But we use final_lsn in ReorderBufferRestoreCleanup() for serialized > changes. Now, in some case if we first do serialization, then perform > streaming and then tried to call ReorderBufferRestoreCleanup(),it > might not work as intended. Now, this might not happen today but I > don't think we have any protection to avoid that. If streaming is complete then we will remove the serialize flag so it will not cause any issue. However, we can avoid setting final_lsn here and pass some parameters to the stream_stop about the last lsn of the stream. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Dilip Kumar
Date:
On Fri, May 15, 2020 at 4:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Fri, May 15, 2020 at 2:47 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > 6. I think it will be good if we can provide an example of streaming > > > changes via test_decoding at > > > https://www.postgresql.org/docs/devel/test-decoding.html. I think we > > > can also explain there why the user is not expected to see the actual > > > data in the stream. > > > > I have a few problems to solve here. > > - With streaming transaction also shall we show the actual values or > > we shall do like it is currently in the patch > > (appendStringInfo(ctx->out, "streaming change for TXN %u", > > txn->xid);). I think we should show the actual values instead of what > > we are doing now. > > > > I think why we don't want to display the tuple at this stage is > because it is not clear by this time if the transaction will commit or > abort. I am not sure if displaying the contents of aborted > transactions is a good idea but if there is a reason for doing so, we > can do it later as well. > > > - In the example we can not show a real example, because of the > > in-progress transaction to show the changes, we might have to > > implement a lot of tuple. I think we can show the partial output? > > > > I think we can display what API will actually display, what is the > confusion here. Added example in the v22-0011 patch where I have added the API to get streaming changes. > I have a few more comments on the previous version of patch > v20-0005-Implement-streaming-mode-in-ReorderBuffer. If you have fixed > any, then leave those and fix others. > > Review comments: > ------------------------------ > 1. > @@ -1762,10 +1952,16 @@ ReorderBufferCommit(ReorderBuffer *rb, > TransactionId xid, > } > > case REORDER_BUFFER_CHANGE_MESSAGE: > - rb->message(rb, txn, change->lsn, true, > - change->data.msg.prefix, > - change->data.msg.message_size, > - change->data.msg.message); > + if (streaming) > + rb->stream_message(rb, txn, change->lsn, true, > + change->data.msg.prefix, > + change->data.msg.message_size, > + change->data.msg.message); > + else > + rb->message(rb, txn, change->lsn, true, > + change->data.msg.prefix, > + change->data.msg.message_size, > + change->data.msg.message); > > Don't we need to set any_data_sent flag while streaming messages as we > do for other types of changes? I think any_data_sent, was added to avoid sending abort to the subscriber if we haven't sent any data, but this is not complete as the output plugin can also take the decision not to send. So I think this should not be done as part of this patch and can be done separately. I think there is already a thread for handling the same[1] > 2. > + if (streaming) > + { > + /* > + * Set the last of the stream as the final lsn before calling > + * stream stop. > + */ > + if (!XLogRecPtrIsInvalid(prev_lsn)) > + txn->final_lsn = prev_lsn; > + rb->stream_stop(rb, txn); > + } > > I am not sure if it is good to use final_lsn for this purpose. See > comments for this variable in reorderbuffer.h. Basically, it is used > for a specific purpose on different occasions. Now, if we want to > start using it for a new purpose, we need to study its interaction > with all other places and update the comments as well. Can we pass an > additional parameter to stream_stop() instead? Done > 3. > + /* remember the command ID and snapshot for the streaming run */ > + txn->command_id = command_id; > + > + /* Avoid copying if it's already copied. */ > + if (snapshot_now->copied) > + txn->snapshot_now = snapshot_now; > + else > + txn->snapshot_now = ReorderBufferCopySnap(rb, snapshot_now, > + txn, command_id); > > This code is used at two different places, can we try to keep this in > a single function. Done > 4. > In ReorderBufferProcessTXN(), the patch is calling stream_stop in both > the try and catch block. If there is an error after calling it in a > try block, we might call it again via catch. I think that will lead > to sending a stop message twice. Won't that be a problem? See the > usage of iterstate in the catch block, we have made it safe from a > similar problem. IMHO, we don't need that, because we only call stream_stop in the catch block if the error type is ERRCODE_TRANSACTION_ROLLBACK. So if in TRY block we have already stopped the stream then we should not get that error. I have added the comments for the same. > 5. > + if (streaming) > + { > + /* Discard the changes that we just streamed. */ > + ReorderBufferTruncateTXN(rb, txn); > > - PG_RE_THROW(); > + /* Re-throw only if it's not an abort. */ > + if (errdata->sqlerrcode != ERRCODE_TRANSACTION_ROLLBACK) > + { > + MemoryContextSwitchTo(ecxt); > + PG_RE_THROW(); > + } > + else > + { > + FlushErrorState(); > + FreeErrorData(errdata); > + errdata = NULL; > + > > I think here we can write few comments on why we are doing error-code > specific handling, basically, explain a bit about concurrent abort > handling and or refer to the part of comments where it is explained. Done > 6. > PG_CATCH(); > { > + MemoryContext ecxt = MemoryContextSwitchTo(ccxt); > + ErrorData *errdata = CopyErrorData(); > > I don't understand the usage of memory context in this part of the > code. Basically, you are switching to CurrentMemoryContext here, do > some error handling and then again reset back to some random context > before rethrowing the error. If there is some purpose for it, then it > might be better if you can write a few comments to explain the same. Basically, the ccxt is the CurrentMemoryContext when we started the streaming and ecxt it the context when we catch the error. So ideally, before this change, it will rethrow in the context when we catch the error i.e. ecxt. So what we are trying to do is put it back to normal context (ccxt) and copy the error data in the normal context. And, if we are not handling it gracefully then put it back to the context it was in, and rethrow. > > 7. > +ReorderBufferCommit() > { > .. > + /* > + * If the transaction was (partially) streamed, we need to commit it in a > + * 'streamed' way. That is, we first stream the remaining part of the > + * transaction, and then invoke stream_commit message. > + * > + * XXX Called after everything (origin ID and LSN, ...) is stored in the > + * transaction, so we don't pass that directly. > + * > + * XXX Somewhat hackish redirection, perhaps needs to be refactored? > + */ > + if (rbtxn_is_streamed(txn)) > + { > + ReorderBufferStreamCommit(rb, txn); > + return; > + } > + > .. > } > > "XXX Somewhat hackish redirection, perhaps needs to be refactored?" > What kind of refactoring we can do here? To me, it looks okay. I think it looks fine to me also. So I have removed this comment. > 8. > @@ -2295,6 +2677,13 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer > *rb, TransactionId xid, > txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true); > > txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES; > + > + /* > + * TOCHECK: Mark toplevel transaction as having catalog changes too > + * if one of its children has. > + */ > + if (txn->toptxn != NULL) > + txn->toptxn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES; > } > > Why are we marking top transaction here? We need to mark top transaction to decide whether to build tuplecid hash or not. In non-streaming mode, we are only sending during the commit time, and during commit time we know whether the top transaction has any catalog changes or not based on the invalidation message so we are marking the top transaction there in DecodeCommit. Since here we are not waiting till commit so we need to mark the top transaction as soon as we mark any of its child transactions. [1] https://www.postgresql.org/message-id/CAMkU=1yohp9-dv48FLoSPrMqYEyyS5ZWkaZGD41RJr10xiNo_Q@mail.gmail.com -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
- v22-0001-Immediately-WAL-log-assignments.patch
- v22-0004-Gracefully-handle-concurrent-aborts-of-uncommitt.patch
- v22-0002-Issue-individual-invalidations-with-wal_level-lo.patch
- v22-0003-Extend-the-output-plugin-API-with-stream-methods.patch
- v22-0005-Implement-streaming-mode-in-ReorderBuffer.patch
- v22-0008-Enable-streaming-for-all-subscription-TAP-tests.patch
- v22-0006-Add-support-for-streaming-to-built-in-replicatio.patch
- v22-0009-Add-TAP-test-for-streaming-vs.-DDL.patch
- v22-0007-Track-statistics-for-streaming.patch
- v22-0010-Bugfix-handling-of-incomplete-toast-tuple.patch
- v22-0011-Provide-new-api-to-get-the-streaming-changes.patch
- v22-0012-Add-streaming-option-in-pg_dump.patch
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Amit Kapila
Date:
On Sun, May 17, 2020 at 12:41 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Fri, May 15, 2020 at 4:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > Review comments: > > ------------------------------ > > 1. > > @@ -1762,10 +1952,16 @@ ReorderBufferCommit(ReorderBuffer *rb, > > TransactionId xid, > > } > > > > case REORDER_BUFFER_CHANGE_MESSAGE: > > - rb->message(rb, txn, change->lsn, true, > > - change->data.msg.prefix, > > - change->data.msg.message_size, > > - change->data.msg.message); > > + if (streaming) > > + rb->stream_message(rb, txn, change->lsn, true, > > + change->data.msg.prefix, > > + change->data.msg.message_size, > > + change->data.msg.message); > > + else > > + rb->message(rb, txn, change->lsn, true, > > + change->data.msg.prefix, > > + change->data.msg.message_size, > > + change->data.msg.message); > > > > Don't we need to set any_data_sent flag while streaming messages as we > > do for other types of changes? > > I think any_data_sent, was added to avoid sending abort to the > subscriber if we haven't sent any data, but this is not complete as > the output plugin can also take the decision not to send. So I think > this should not be done as part of this patch and can be done > separately. I think there is already a thread for handling the > same[1] > Hmm, but prior to this patch, we never use to send (empty) aborts but now that will be possible. It is probably okay to deal that with another patch mentioned by you but I felt at least any_data_sent will work for some cases. OTOH, it appears to be half-baked solution, so we should probably refrain from adding it. BTW, how do the pgoutput plugin deal with it? I see that apply_handle_stream_abort will unconditionally try to unlink the file and it will probably fail. Have you tested this scenario after your latest changes? > > > 4. > > In ReorderBufferProcessTXN(), the patch is calling stream_stop in both > > the try and catch block. If there is an error after calling it in a > > try block, we might call it again via catch. I think that will lead > > to sending a stop message twice. Won't that be a problem? See the > > usage of iterstate in the catch block, we have made it safe from a > > similar problem. > > IMHO, we don't need that, because we only call stream_stop in the > catch block if the error type is ERRCODE_TRANSACTION_ROLLBACK. So if > in TRY block we have already stopped the stream then we should not get > that error. I have added the comments for the same. > I am still slightly nervous about it as I don't see any solid guarantee for the same. You are right as the code stands today but due to any code that gets added in the future, it might not remain true. I feel it is better to have an Assert here to ensure that stream_stop won't be called the second time. I don't see any good way of doing it other than by maintaining flag or some state but I think it will be good to ensure this. > > > 6. > > PG_CATCH(); > > { > > + MemoryContext ecxt = MemoryContextSwitchTo(ccxt); > > + ErrorData *errdata = CopyErrorData(); > > > > I don't understand the usage of memory context in this part of the > > code. Basically, you are switching to CurrentMemoryContext here, do > > some error handling and then again reset back to some random context > > before rethrowing the error. If there is some purpose for it, then it > > might be better if you can write a few comments to explain the same. > > Basically, the ccxt is the CurrentMemoryContext when we started the > streaming and ecxt it the context when we catch the error. So > ideally, before this change, it will rethrow in the context when we > catch the error i.e. ecxt. So what we are trying to do is put it back > to normal context (ccxt) and copy the error data in the normal > context. And, if we are not handling it gracefully then put it back > to the context it was in, and rethrow. > Okay, but when errorcode is *not* ERRCODE_TRANSACTION_ROLLBACK, don't we need to clean up the reorderbuffer by calling ReorderBufferCleanupTXN? If so, then you can try to combine it with the not-streaming else loop. > > > 8. > > @@ -2295,6 +2677,13 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer > > *rb, TransactionId xid, > > txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true); > > > > txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES; > > + > > + /* > > + * TOCHECK: Mark toplevel transaction as having catalog changes too > > + * if one of its children has. > > + */ > > + if (txn->toptxn != NULL) > > + txn->toptxn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES; > > } > > > > Why are we marking top transaction here? > > We need to mark top transaction to decide whether to build tuplecid > hash or not. In non-streaming mode, we are only sending during the > commit time, and during commit time we know whether the top > transaction has any catalog changes or not based on the invalidation > message so we are marking the top transaction there in DecodeCommit. > Since here we are not waiting till commit so we need to mark the top > transaction as soon as we mark any of its child transactions. > But how does it help? We use this flag (via ReorderBufferXidHasCatalogChanges) in SnapBuildCommitTxn which is anyway done in DecodeCommit and that too after setting this flag for the top transaction if required. So, how will it help in setting it while processing for subxid. Also, even if we have to do it won't it add the xid needlessly in builder->committed.xip array? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Amit Kapila
Date:
On Mon, May 18, 2020 at 4:10 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Sun, May 17, 2020 at 12:41 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > Few comments on v20-0010-Bugfix-handling-of-incomplete-toast-tuple 1. + /* + * If this is a toast insert then set the corresponding bit. Otherwise, if + * we have toast insert bit set and this is insert/update then clear the + * bit. + */ + if (toast_insert) + toptxn->txn_flags |= RBTXN_HAS_TOAST_INSERT; + else if (rbtxn_has_toast_insert(txn) && + ChangeIsInsertOrUpdate(change->action)) + { Here, it might better to add a comment on why we expect only Insert/Update? Also, it might be better that we add an assert for other operations. 2. @@ -1865,8 +1920,8 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, * disk. */ dlist_delete(&change->node); - ReorderBufferToastAppendChunk(rb, txn, relation, - change); + ReorderBufferToastAppendChunk(rb, txn, relation, + change); } This seems to be a spurious change. 3. + /* + * If streaming is enable and we have serialized this transaction because + * it had incomplete tuple. So if now we have got the complete tuple we + * can stream it. + */ + if (ReorderBufferCanStream(rb) && can_stream && rbtxn_is_serialized(toptxn) + && !(rbtxn_has_toast_insert(txn)) && !(rbtxn_has_spec_insert(txn))) + { This comment is just saying what you are doing in the if-check. I think you need to explain the rationale behind it. I don't like the variable name 'can_stream' because it matches ReorderBufferCanStream whereas it is for a different purpose, how about naming it as 'change_complete' or something like that. The check has many conditions, can we move it to a separate function to make the code here look clean? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Amit Kapila
Date:
On Mon, May 18, 2020 at 5:57 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > 3. > + /* > + * If streaming is enable and we have serialized this transaction because > + * it had incomplete tuple. So if now we have got the complete tuple we > + * can stream it. > + */ > + if (ReorderBufferCanStream(rb) && can_stream && rbtxn_is_serialized(toptxn) > + && !(rbtxn_has_toast_insert(txn)) && !(rbtxn_has_spec_insert(txn))) > + { > > This comment is just saying what you are doing in the if-check. I > think you need to explain the rationale behind it. I don't like the > variable name 'can_stream' because it matches ReorderBufferCanStream > whereas it is for a different purpose, how about naming it as > 'change_complete' or something like that. The check has many > conditions, can we move it to a separate function to make the code > here look clean? > Do we really need this? Immediately after this check, we are calling ReorderBufferCheckMemoryLimit which will anyway stream the changes if required. Can we move the changes related to the detection of incomplete data to a separate function? Another comments on v20-0010-Bugfix-handling-of-incomplete-toast-tuple: + else if (rbtxn_has_toast_insert(txn) && + ChangeIsInsertOrUpdate(change->action)) + { + toptxn->txn_flags &= ~RBTXN_HAS_TOAST_INSERT; + can_stream = true; + } .. +#define ChangeIsInsertOrUpdate(action) \ + (((action) == REORDER_BUFFER_CHANGE_INSERT) || \ + ((action) == REORDER_BUFFER_CHANGE_UPDATE) || \ + ((action) == REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT)) How can we clear the RBTXN_HAS_TOAST_INSERT flag on REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT action? IIUC, the basic idea used to handle incomplete changes (which is possible in case of toast tuples and speculative inserts) is to mark such TXNs as containing incomplete changes and then while finding the largest top-level TXN for streaming, we ignore such TXN's and move to next largest TXN. If none of the TXNs have complete changes then we choose the largest (sub)transaction and spill the same to make the in-memory changes below logical_decoding_work_mem threshold. This idea can work but the strategy to choose the transaction is suboptimal for cases where TXNs have some changes which are complete followed by an incomplete toast or speculative tuple. I was having an offlist discussion with Robert on this problem and he suggested that it would be better if we track the complete part of changes separately and then we can avoid the drawback mentioned above. I have thought about this and I think it can work if we track the size and LSN of completed changes. I think we need to ensure that if there is concurrent abort then we discard all changes for current (sub)transaction not only up to completed changes LSN whereas if the streaming is successful then we can truncate the changes only up to completed changes LSN. What do you think? I wonder why you have done this as 0010 in the patch series, it should be as 0006 after the 0005-Implement-streaming-mode-in-ReorderBuffer.patch. If we can do that way then it would be easier for me to review. Is there a reason for not doing so? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Dilip Kumar
Date:
On Tue, May 19, 2020 at 2:34 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Mon, May 18, 2020 at 5:57 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > 3. > > + /* > > + * If streaming is enable and we have serialized this transaction because > > + * it had incomplete tuple. So if now we have got the complete tuple we > > + * can stream it. > > + */ > > + if (ReorderBufferCanStream(rb) && can_stream && rbtxn_is_serialized(toptxn) > > + && !(rbtxn_has_toast_insert(txn)) && !(rbtxn_has_spec_insert(txn))) > > + { > > > > This comment is just saying what you are doing in the if-check. I > > think you need to explain the rationale behind it. I don't like the > > variable name 'can_stream' because it matches ReorderBufferCanStream > > whereas it is for a different purpose, how about naming it as > > 'change_complete' or something like that. The check has many > > conditions, can we move it to a separate function to make the code > > here look clean? > > > > Do we really need this? Immediately after this check, we are calling > ReorderBufferCheckMemoryLimit which will anyway stream the changes if > required. Actually, ReorderBufferCheckMemoryLimit is only meant for checking whether we need to stream the changes due to the memory limit. But suppose when memory limit exceeds that time we could not stream the transaction because there was only incomplete toast insert so we serialized. Now, when we get the tuple which makes the changes complete but now it is not crossing the memory limit as changes were already serialized. So I am not sure whether it is a good idea to stream the transaction as soon as we get the complete changes or we shall wait till next time memory limit exceed and that time we select the suitable candidate. Ideally, we were are in streaming more and the transaction is serialized means it was already a candidate for streaming but could not stream due to the incomplete changes so shouldn't we stream it immediately as soon as its changes are complete even though now we are in memory limit. Because our target is to stream not spill so we should try to stream the spilled changes on the first opportunity. Can we move the changes related to the detection of > incomplete data to a separate function? Ok. > > Another comments on v20-0010-Bugfix-handling-of-incomplete-toast-tuple: > > + else if (rbtxn_has_toast_insert(txn) && > + ChangeIsInsertOrUpdate(change->action)) > + { > + toptxn->txn_flags &= ~RBTXN_HAS_TOAST_INSERT; > + can_stream = true; > + } > .. > +#define ChangeIsInsertOrUpdate(action) \ > + (((action) == REORDER_BUFFER_CHANGE_INSERT) || \ > + ((action) == REORDER_BUFFER_CHANGE_UPDATE) || \ > + ((action) == REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT)) > > How can we clear the RBTXN_HAS_TOAST_INSERT flag on > REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT action? Partial toast insert means we have inserted in the toast but not in the main table. So even if it is spec insert we can form the complete tuple, however, we can still not stream it because we haven't got spec_confirm but for that, we are marking another flag. So if the insert is aspect insert the toast insert will also be spec insert and as part of that toast, spec inserts we are marking partial tuple so cleaning that flag should happen when the spec insert is done for the main table right? > IIUC, the basic idea used to handle incomplete changes (which is > possible in case of toast tuples and speculative inserts) is to mark > such TXNs as containing incomplete changes and then while finding the > largest top-level TXN for streaming, we ignore such TXN's and move to > next largest TXN. If none of the TXNs have complete changes then we > choose the largest (sub)transaction and spill the same to make the > in-memory changes below logical_decoding_work_mem threshold. This > idea can work but the strategy to choose the transaction is suboptimal > for cases where TXNs have some changes which are complete followed by > an incomplete toast or speculative tuple. I was having an offlist > discussion with Robert on this problem and he suggested that it would > be better if we track the complete part of changes separately and then > we can avoid the drawback mentioned above. I have thought about this > and I think it can work if we track the size and LSN of completed > changes. I think we need to ensure that if there is concurrent abort > then we discard all changes for current (sub)transaction not only up > to completed changes LSN whereas if the streaming is successful then > we can truncate the changes only up to completed changes LSN. What do > you think? > > I wonder why you have done this as 0010 in the patch series, it should > be as 0006 after the > 0005-Implement-streaming-mode-in-ReorderBuffer.patch. If we can do > that way then it would be easier for me to review. Is there a reason > for not doing so? No reason, I can do that. Actually, later we can merge the changes to 0005 only, I kept separate for review. Anyway, in the next version, I will make it as 0006. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Amit Kapila
Date:
On Tue, May 19, 2020 at 3:31 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Tue, May 19, 2020 at 2:34 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Mon, May 18, 2020 at 5:57 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > 3. > > > + /* > > > + * If streaming is enable and we have serialized this transaction because > > > + * it had incomplete tuple. So if now we have got the complete tuple we > > > + * can stream it. > > > + */ > > > + if (ReorderBufferCanStream(rb) && can_stream && rbtxn_is_serialized(toptxn) > > > + && !(rbtxn_has_toast_insert(txn)) && !(rbtxn_has_spec_insert(txn))) > > > + { > > > > > > This comment is just saying what you are doing in the if-check. I > > > think you need to explain the rationale behind it. I don't like the > > > variable name 'can_stream' because it matches ReorderBufferCanStream > > > whereas it is for a different purpose, how about naming it as > > > 'change_complete' or something like that. The check has many > > > conditions, can we move it to a separate function to make the code > > > here look clean? > > > > > > > Do we really need this? Immediately after this check, we are calling > > ReorderBufferCheckMemoryLimit which will anyway stream the changes if > > required. > > Actually, ReorderBufferCheckMemoryLimit is only meant for checking > whether we need to stream the changes due to the memory limit. But > suppose when memory limit exceeds that time we could not stream the > transaction because there was only incomplete toast insert so we > serialized. Now, when we get the tuple which makes the changes > complete but now it is not crossing the memory limit as changes were > already serialized. So I am not sure whether it is a good idea to > stream the transaction as soon as we get the complete changes or we > shall wait till next time memory limit exceed and that time we select > the suitable candidate. > I think it is better to wait till next time we exceed the memory threshold. > Ideally, we were are in streaming more and > the transaction is serialized means it was already a candidate for > streaming but could not stream due to the incomplete changes so > shouldn't we stream it immediately as soon as its changes are complete > even though now we are in memory limit. > The only time we need to stream or spill is when we exceed memory threshold. In the above case, it is possible that next time there is some other candidate transaction that we can stream. > > > > Another comments on v20-0010-Bugfix-handling-of-incomplete-toast-tuple: > > > > + else if (rbtxn_has_toast_insert(txn) && > > + ChangeIsInsertOrUpdate(change->action)) > > + { > > + toptxn->txn_flags &= ~RBTXN_HAS_TOAST_INSERT; > > + can_stream = true; > > + } > > .. > > +#define ChangeIsInsertOrUpdate(action) \ > > + (((action) == REORDER_BUFFER_CHANGE_INSERT) || \ > > + ((action) == REORDER_BUFFER_CHANGE_UPDATE) || \ > > + ((action) == REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT)) > > > > How can we clear the RBTXN_HAS_TOAST_INSERT flag on > > REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT action? > > Partial toast insert means we have inserted in the toast but not in > the main table. So even if it is spec insert we can form the complete > tuple, however, we can still not stream it because we haven't got > spec_confirm but for that, we are marking another flag. So if the > insert is aspect insert the toast insert will also be spec insert and > as part of that toast, spec inserts we are marking partial tuple so > cleaning that flag should happen when the spec insert is done for the > main table right? > Sounds reasonable. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Amit Kapila
Date:
On Fri, May 15, 2020 at 2:47 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Tue, May 12, 2020 at 4:39 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > 4. > > +static void > > +stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn) > > +{ > > + LogicalDecodingContext *ctx = cache->private_data; > > + LogicalErrorCallbackState state; > > + ErrorContextCallback errcallback; > > + > > + Assert(!ctx->fast_forward); > > + > > + /* We're only supposed to call this when streaming is supported. */ > > + Assert(ctx->streaming); > > + > > + /* Push callback + info on the error context stack */ > > + state.ctx = ctx; > > + state.callback_name = "stream_start"; > > + /* state.report_location = apply_lsn; */ > > > > Why can't we supply the report_location here? I think here we need to > > report txn->first_lsn if this is the very first stream and > > txn->final_lsn if it is any consecutive one. > > Done > Now after your change in stream_start_cb_wrapper, we assign report_location as first_lsn passed as input to function but write_location is still txn->first_lsn. Shouldn't we assing passed in first_lsn to write_location? It seems assigning txn->first_lsn won't be correct for streams other than first-one. > > 5. > > +static void > > +stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn) > > +{ > > + LogicalDecodingContext *ctx = cache->private_data; > > + LogicalErrorCallbackState state; > > + ErrorContextCallback errcallback; > > + > > + Assert(!ctx->fast_forward); > > + > > + /* We're only supposed to call this when streaming is supported. */ > > + Assert(ctx->streaming); > > + > > + /* Push callback + info on the error context stack */ > > + state.ctx = ctx; > > + state.callback_name = "stream_stop"; > > + /* state.report_location = apply_lsn; */ > > > > Can't we report txn->final_lsn here > > We are already setting this to the txn->final_ls in 0006 patch, but I > have moved it into this patch now. > Similar to previous point, here also, I think we need to assign report and write location as last_lsn passed to this API. > > > > v20-0005-Implement-streaming-mode-in-ReorderBuffer > > ----------------------------------------------------------------------------- > > 10. > > Theoretically, we could get rid of the k-way merge, and append the > > changes to the toplevel xact directly (and remember the position > > in the list in case the subxact gets aborted later). > > > > I don't think this part of the commit message is correct as we > > sometimes need to spill even during streaming. Please check the > > entire commit message and update according to the latest > > implementation. > > Done > You seem to forgot about removing the other part of message ("This adds a second iterator for the streaming case...." which is not relavant now. > > 11. > > - * HeapTupleSatisfiesHistoricMVCC. > > + * tqual.c's HeapTupleSatisfiesHistoricMVCC. > > + * > > + * We do build the hash table even if there are no CIDs. That's > > + * because when streaming in-progress transactions we may run into > > + * tuples with the CID before actually decoding them. Think e.g. about > > + * INSERT followed by TRUNCATE, where the TRUNCATE may not be decoded > > + * yet when applying the INSERT. So we build a hash table so that > > + * ResolveCminCmaxDuringDecoding does not segfault in this case. > > + * > > + * XXX We might limit this behavior to streaming mode, and just bail > > + * out when decoding transaction at commit time (at which point it's > > + * guaranteed to see all CIDs). > > */ > > static void > > ReorderBufferBuildTupleCidHash(ReorderBuffer *rb, ReorderBufferTXN *txn) > > @@ -1350,9 +1498,6 @@ ReorderBufferBuildTupleCidHash(ReorderBuffer > > *rb, ReorderBufferTXN *txn) > > dlist_iter iter; > > HASHCTL hash_ctl; > > > > - if (!rbtxn_has_catalog_changes(txn) || dlist_is_empty(&txn->tuplecids)) > > - return; > > - > > > > I don't understand this change. Why would "INSERT followed by > > TRUNCATE" could lead to a tuple which can come for decode before its > > CID? The patch has made changes based on this assumption in > > HeapTupleSatisfiesHistoricMVCC which appears to be very risky as the > > behavior could be dependent on whether we are streaming the changes > > for in-progress xact or at the commit of a transaction. We might want > > to generate a test to once validate this behavior. > > > > Also, the comment refers to tqual.c which is wrong as this API is now > > in heapam_visibility.c. > > Done. > + * INSERT. So in such cases we assume the CIDs is from the future command + * and return as unresolve. + */ + if (tuplecid_data == NULL) + return false; + Here lets reword the last line of comment as ". So in such cases we assume the CID is from the future command." -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Amit Kapila
Date:
On Fri, May 15, 2020 at 2:48 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Wed, May 13, 2020 at 4:50 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > 3. > > And, during catalog scan we can check the status of the xid and > > + * if it is aborted we will report a specific error that we can ignore. We > > + * might have already streamed some of the changes for the aborted > > + * (sub)transaction, but that is fine because when we decode the abort we will > > + * stream abort message to truncate the changes in the subscriber. > > + */ > > +static inline void > > +SetupCheckXidLive(TransactionId xid) > > > > In the above comment, I don't think it is right to say that we ignore > > the error raised due to the aborted transaction. We need to say that > > we discard the already streamed changes on such an error. > > Done. > In the same comment, there is typo (/messageto/message to). > > 4. > > +static inline void > > +SetupCheckXidLive(TransactionId xid) > > +{ > > /* > > - * If this transaction has no snapshot, it didn't make any changes to the > > - * database, so there's nothing to decode. Note that > > - * ReorderBufferCommitChild will have transferred any snapshots from > > - * subtransactions if there were any. > > + * setup CheckXidAlive if it's not committed yet. We don't check if the xid > > + * aborted. That will happen during catalog access. Also reset the > > + * sysbegin_called flag. > > */ > > - if (txn->base_snapshot == NULL) > > + if (!TransactionIdDidCommit(xid)) > > { > > - Assert(txn->ninvalidations == 0); > > - ReorderBufferCleanupTXN(rb, txn); > > - return; > > + CheckXidAlive = xid; > > + bsysscan = false; > > } > > > > I think this function is inline as it needs to be called for each > > change. If that is the case and otherwise also, isn't it better that > > we check if passed xid is the same as CheckXidAlive before checking > > TransactionIdDidCommit as TransactionIdDidCommit can be costly and > > calling it for each change might not be a good idea? > > Done, Also I think it is good the check the TransactionIdIsInProgress > instead of !TransactionIdDidCommit. I have changed that as well. > What if it is aborted just before this check? I think the decode API won't be able to detect that and sys* API won't care to check because CheckXidAlive won't be set for that case. > > 5. > > setup CheckXidAlive if it's not committed yet. We don't check if the xid > > + * aborted. That will happen during catalog access. Also reset the > > + * sysbegin_called flag. > > > > /if the xid aborted/if the xid is aborted. missing comma after Also. > > Done > You forgot to change as per the second part of the comment (missing comma after Also). > > > 8. > > @@ -1588,8 +1766,6 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid, > > * use as a normal record. It'll be cleaned up at the end > > * of INSERT processing. > > */ > > - if (specinsert == NULL) > > - elog(ERROR, "invalid ordering of speculative insertion changes"); > > > > You have removed this check but all other handling of specinsert is > > same as far as this patch is concerned. Why so? > > Seems like a merge issue, or the leftover from the old design of the > toast handling where we were streaming with the partial tuple. > fixed now. > > > 9. > > @@ -1676,8 +1860,6 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid, > > * freed/reused while restoring spooled data from > > * disk. > > */ > > - Assert(change->data.tp.newtuple != NULL); > > - > > dlist_delete(&change->node); > > > > Why is this Assert removed? > > Same cause as above so fixed. > > > 10. > > @@ -1753,7 +1935,15 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid, > > relations[nrelations++] = relation; > > } > > > > - rb->apply_truncate(rb, txn, nrelations, relations, change); > > + if (streaming) > > + { > > + rb->stream_truncate(rb, txn, nrelations, relations, change); > > + > > + /* Remember that we have sent some data. */ > > + change->txn->any_data_sent = true; > > + } > > + else > > + rb->apply_truncate(rb, txn, nrelations, relations, change); > > > > Can we encapsulate this in a separate function like > > ReorderBufferApplyTruncate or something like that? Basically, rather > > than having streaming check in this function, lets do it in some other > > internal function. And we can likewise do it for all the streaming > > checks in this function or at least whereever it is feasible. That > > will make this function look clean. > > Done for truncate and change. I think we can create a few more such > functions for > start/stop and cleanup handling on error. I will work on that. > Yeah, I think that would be better. One minor comment change suggestion: /* + * start stream or begin the transaction. If this is the first + * change in the current stream. + */ We can write the above comment as "Start the stream or begin the transaction for the first change in the current stream." -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Amit Kapila
Date:
On Tue, May 19, 2020 at 6:01 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Fri, May 15, 2020 at 2:48 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > I have further reviewed v22 and below are my comments: v22-0005-Implement-streaming-mode-in-ReorderBuffer -------------------------------------------------------------------------- 1. + * Note: We never do both stream and serialize a transaction (we only spill + * to disk when streaming is not supported by the plugin), so only one of + * those two flags may be set at any given time. + */ +#define rbtxn_is_streamed(txn) \ +( \ + ((txn)->txn_flags & RBTXN_IS_STREAMED) != 0 \ +) The above 'Note' is not correct as per the latest implementation. v22-0006-Add-support-for-streaming-to-built-in-replicatio ---------------------------------------------------------------------------- 2. --- a/src/backend/replication/logical/launcher.c +++ b/src/backend/replication/logical/launcher.c @@ -14,7 +14,6 @@ * *------------------------------------------------------------------------- */ - #include "postgres.h" Spurious line removal. 3. +void +logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn, + XLogRecPtr commit_lsn) +{ + uint8 flags = 0; + + pq_sendbyte(out, 'c'); /* action STREAM COMMIT */ + + Assert(TransactionIdIsValid(txn->xid)); + + /* transaction ID (we're starting to stream, so must be valid) */ + pq_sendint32(out, txn->xid); The part of the comment "we're starting to stream, so must be valid" is not correct as we are not at the start of the stream here. The patch has used the same incorrect sentence at few places, kindly fix those as well. 4. + * XXX Do we need to allocate it in TopMemoryContext? + */ +static void +subxact_info_add(TransactionId xid) { .. For this and other places in a patch like in function stream_open_file(), instead of using TopMemoryContext, can we consider using a new memory context LogicalStreamingContext or something like that. We can create LogicalStreamingContext under TopMemoryContext. I don't see any need of using TopMemoryContext here. 5. +static void +subxact_info_add(TransactionId xid) This function has assumed a valid value for global variables like stream_fd and stream_xid. I think it is better to have Assert for those in this function before using them. The Assert for those are present in handle_streamed_transaction but I feel they should be in subxact_info_add. 6. +subxact_info_add(TransactionId xid) /* + * In most cases we're checking the same subxact as we've already seen in + * the last call, so make ure just ignore it (this change comes later). + */ + if (subxact_last == xid) + return; Typo and minor correction, /ure just/sure to 7. +subxact_info_write(Oid subid, TransactionId xid) { .. + /* + * But we free the memory allocated for subxact info. There might be one + * exceptional transaction with many subxacts, and we don't want to keep + * the memory allocated forewer. + * + */ a. Typo, /forewer/forever b. The extra line at the end of the comment is not required. 8. + * XXX Maybe we should only include the checksum when the cluster is + * initialized with checksums? + */ +static void +subxact_info_write(Oid subid, TransactionId xid) Do we really need to have the checksum for temporary files? I have checked a few other similar cases like SharedFileSet stuff for parallel hash join but didn't find them using checksums. Can you also once see other usages of temporary files and then let us decide if we see any reason to have checksums for this? Another point is we don't seem to be doing this for 'changes' file, see stream_write_change. So, not sure, there is any sense to write checksum for subxact file. Tomas, do you see any reason for the same? 9. +subxact_filename(char *path, Oid subid, TransactionId xid) +{ + char tempdirpath[MAXPGPATH]; + + TempTablespacePath(tempdirpath, DEFAULTTABLESPACE_OID); + + /* + * We might need to create the tablespace's tempfile directory, if no + * one has yet done so. + */ + if ((MakePGDirectory(tempdirpath) < 0) && errno != EEXIST) + ereport(ERROR, + (errcode_for_file_access(), + errmsg("could not create directory \"%s\": %m", + tempdirpath))); + + snprintf(path, MAXPGPATH, "%s/logical-%u-%u.subxacts", + tempdirpath, subid, xid); +} Temporary files created in PGDATA/base/pgsql_tmp follow a certain naming convention (see docs[1]) which is not followed here. You can also refer SharedFileSetPath and OpenTemporaryFile. I think we can just try to follow that convention and then additionally append subid, xid and .subxacts. Also, a similar change is required for changes_filename. I would like to know if there is a reason why we want to use different naming convention here? 10. + * This can only be called at the beginning of a "streaming" block, i.e. + * between stream_start/stream_stop messages from the upstream. + */ +static void +stream_close_file(void) The comment seems to be wrong. I think this can be only called at stream end, so it should be "This can only be called at the end of a "streaming" block, i.e. at stream_stop message from the upstream." 11. + * the order the transactions are sent in. So streamed trasactions are + * handled separately by using schema_sent flag in ReorderBufferTXN. + * * For partitions, 'pubactions' considers not only the table's own * publications, but also those of all of its ancestors. */ typedef struct RelationSyncEntry { Oid relid; /* relation oid */ - + TransactionId xid; /* transaction that created the record */ /* * Did we send the schema? If ancestor relid is set, its schema must also * have been sent for this to be true. */ bool schema_sent; + List *streamed_txns; /* streamed toplevel transactions with this + * schema */ The part of comment "So streamed trasactions are handled separately by using schema_sent flag in ReorderBufferTXN." doesn't seem to match with what we are doing in the latest version of the patch. 12. maybe_send_schema() { .. + if (in_streaming) + { + /* + * TOCHECK: We have to send schema after each catalog change and it may + * occur when streaming already started, so we have to track new catalog + * changes somehow. + */ + schema_sent = get_schema_sent_in_streamed_txn(relentry, topxid); .. .. } I think it is good to once verify/test what this comment says but as per code we should be sending the schema after each catalog change as we invalidate the streamed_txns list in rel_sync_cache_relation_cb which must be called during relcache invalidation. Do we see any problem with that mechanism? 13. +/* + * Notify downstream to discard the streamed transaction (along with all + * it's subtransactions, if it's a toplevel transaction). + */ +static void +pgoutput_stream_commit(struct LogicalDecodingContext *ctx, + ReorderBufferTXN *txn, + XLogRecPtr commit_lsn) This comment is copied from pgoutput_stream_abort, so doesn't match what this function is doing. [1] - https://www.postgresql.org/docs/devel/storage-file-layout.html -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Amit Kapila
Date:
On Fri, May 22, 2020 at 11:54 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > v22-0006-Add-support-for-streaming-to-built-in-replicatio > ---------------------------------------------------------------------------- > Few more comments on v22-0006 patch: 1. +stream_cleanup_files(Oid subid, TransactionId xid, bool missing_ok) +{ + int i; + char path[MAXPGPATH]; + bool found = false; + + subxact_filename(path, subid, xid); + + if ((unlink(path) < 0) && (errno != ENOENT) && !missing_ok) + ereport(ERROR, + (errcode_for_file_access(), + errmsg("could not remove file \"%s\": %m", path))); Here, we have unlinked the files containing information of subxacts but don't we need to free the corresponding memory (memory for subxacts) as well? 2. apply_handle_stream_abort() { .. + subxact_filename(path, MyLogicalRepWorker->subid, xid); + + if (unlink(path) < 0) + ereport(ERROR, + (errcode_for_file_access(), + errmsg("could not remove file \"%s\": %m", path))); + + return; .. } Like the previous comment, it seems here also we need to free subxacts memory and additionally we forgot to adjust the xids array as well. 3. apply_handle_stream_abort() { .. + /* XXX optimize the search by bsearch on sorted data */ + for (i = nsubxacts; i > 0; i--) + { + if (subxacts[i - 1].xid == subxid) + { + subidx = (i - 1); + found = true; + break; + } + } + + if (!found) + return; .. } Is it possible that we didn't find the xid in subxacts array? If so, I think we should mention the same in comments, otherwise, we should have an assert for found. 4. apply_handle_stream_abort() { .. + changes_filename(path, MyLogicalRepWorker->subid, xid); + + if (truncate(path, subxacts[subidx].offset)) + ereport(ERROR, + (errcode_for_file_access(), + errmsg("could not truncate file \"%s\": %m", path))); .. } Will truncate works on Windows? I see in the code we ftruncate which is defined as chsize in win32.h and win32_port.h. I have not tested this so I am not very sure about this. I got a below warning when I tried to compile this code on Windows. I think it is better to ftruncate as it is used at other places in the code as well. worker.c(798): warning C4013: 'truncate' undefined; assuming extern returning int -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Dilip Kumar
Date:
On Mon, May 18, 2020 at 5:57 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Mon, May 18, 2020 at 4:10 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Sun, May 17, 2020 at 12:41 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > Few comments on v20-0010-Bugfix-handling-of-incomplete-toast-tuple > 1. > + /* > + * If this is a toast insert then set the corresponding bit. Otherwise, if > + * we have toast insert bit set and this is insert/update then clear the > + * bit. > + */ > + if (toast_insert) > + toptxn->txn_flags |= RBTXN_HAS_TOAST_INSERT; > + else if (rbtxn_has_toast_insert(txn) && > + ChangeIsInsertOrUpdate(change->action)) > + { > > Here, it might better to add a comment on why we expect only > Insert/Update? Also, it might be better that we add an assert for > other operations. I have added comments that why on Insert/Update we clean the flag. But I don't think we only expect insert/update, we might get the toast delete right? because in toast update we will do toast delete + toast insert. So when we get toast delete we just don't want to do anything. > > 2. > @@ -1865,8 +1920,8 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, > ReorderBufferTXN *txn, > * disk. > */ > dlist_delete(&change->node); > - ReorderBufferToastAppendChunk(rb, txn, relation, > - change); > + ReorderBufferToastAppendChunk(rb, txn, relation, > + change); > } > > This seems to be a spurious change. Done > 3. > + /* > + * If streaming is enable and we have serialized this transaction because > + * it had incomplete tuple. So if now we have got the complete tuple we > + * can stream it. > + */ > + if (ReorderBufferCanStream(rb) && can_stream && rbtxn_is_serialized(toptxn) > + && !(rbtxn_has_toast_insert(txn)) && !(rbtxn_has_spec_insert(txn))) > + { > > This comment is just saying what you are doing in the if-check. I > think you need to explain the rationale behind it. I don't like the > variable name 'can_stream' because it matches ReorderBufferCanStream > whereas it is for a different purpose, how about naming it as > 'change_complete' or something like that. The check has many > conditions, can we move it to a separate function to make the code > here look clean? As per the other comments we have removed this part in the latest patch set. Apart from these comments fixes, there are 2 more changes 1. Handling of the toast tuple is changed as per the offlist discussion with you Basically, now, instead of not streaming the txn with the incomplete tuple, we are streaming it up to the last complete lsn. So of the txn has incomplete changes but its complete size is largest then we will stream this. And, after streaming we will truncate the transaction up to the last complete lsn. 2. There is a bug fix in handling the stream abort in 0008 (earlier it was 0006). -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
- v23-0001-Immediately-WAL-log-assignments.patch
- v23-0005-Implement-streaming-mode-in-ReorderBuffer.patch
- v23-0004-Gracefully-handle-concurrent-aborts-of-uncommitt.patch
- v23-0002-Issue-individual-invalidations-with-wal_level-lo.patch
- v23-0003-Extend-the-output-plugin-API-with-stream-methods.patch
- v23-0009-Enable-streaming-for-all-subscription-TAP-tests.patch
- v23-0008-Add-support-for-streaming-to-built-in-replicatio.patch
- v23-0007-Track-statistics-for-streaming.patch
- v23-0006-Bugfix-handling-of-incomplete-toast-spec-insert-.patch
- v23-0010-Add-TAP-test-for-streaming-vs.-DDL.patch
- v23-0011-Provide-new-api-to-get-the-streaming-changes.patch
- v23-0012-Add-streaming-option-in-pg_dump.patch
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Dilip Kumar
Date:
On Mon, May 18, 2020 at 4:10 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Sun, May 17, 2020 at 12:41 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Fri, May 15, 2020 at 4:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > Review comments: > > > ------------------------------ > > > 1. > > > @@ -1762,10 +1952,16 @@ ReorderBufferCommit(ReorderBuffer *rb, > > > TransactionId xid, > > > } > > > > > > case REORDER_BUFFER_CHANGE_MESSAGE: > > > - rb->message(rb, txn, change->lsn, true, > > > - change->data.msg.prefix, > > > - change->data.msg.message_size, > > > - change->data.msg.message); > > > + if (streaming) > > > + rb->stream_message(rb, txn, change->lsn, true, > > > + change->data.msg.prefix, > > > + change->data.msg.message_size, > > > + change->data.msg.message); > > > + else > > > + rb->message(rb, txn, change->lsn, true, > > > + change->data.msg.prefix, > > > + change->data.msg.message_size, > > > + change->data.msg.message); > > > > > > Don't we need to set any_data_sent flag while streaming messages as we > > > do for other types of changes? > > > > I think any_data_sent, was added to avoid sending abort to the > > subscriber if we haven't sent any data, but this is not complete as > > the output plugin can also take the decision not to send. So I think > > this should not be done as part of this patch and can be done > > separately. I think there is already a thread for handling the > > same[1] > > > > Hmm, but prior to this patch, we never use to send (empty) aborts but > now that will be possible. It is probably okay to deal that with > another patch mentioned by you but I felt at least any_data_sent will > work for some cases. OTOH, it appears to be half-baked solution, so > we should probably refrain from adding it. BTW, how do the pgoutput > plugin deal with it? I see that apply_handle_stream_abort will > unconditionally try to unlink the file and it will probably fail. > Have you tested this scenario after your latest changes? Yeah, I see, I think this is a problem, but this exists without my latest change as well, if pgoutput ignore some changes because it is not published then we will see a similar error. Shall we handle the ENOENT error case from unlink? I think the best idea is that we shall track the empty transaction. > > > 4. > > > In ReorderBufferProcessTXN(), the patch is calling stream_stop in both > > > the try and catch block. If there is an error after calling it in a > > > try block, we might call it again via catch. I think that will lead > > > to sending a stop message twice. Won't that be a problem? See the > > > usage of iterstate in the catch block, we have made it safe from a > > > similar problem. > > > > IMHO, we don't need that, because we only call stream_stop in the > > catch block if the error type is ERRCODE_TRANSACTION_ROLLBACK. So if > > in TRY block we have already stopped the stream then we should not get > > that error. I have added the comments for the same. > > > > I am still slightly nervous about it as I don't see any solid > guarantee for the same. You are right as the code stands today but > due to any code that gets added in the future, it might not remain > true. I feel it is better to have an Assert here to ensure that > stream_stop won't be called the second time. I don't see any good way > of doing it other than by maintaining flag or some state but I think > it will be good to ensure this. Done > > > 6. > > > PG_CATCH(); > > > { > > > + MemoryContext ecxt = MemoryContextSwitchTo(ccxt); > > > + ErrorData *errdata = CopyErrorData(); > > > > > > I don't understand the usage of memory context in this part of the > > > code. Basically, you are switching to CurrentMemoryContext here, do > > > some error handling and then again reset back to some random context > > > before rethrowing the error. If there is some purpose for it, then it > > > might be better if you can write a few comments to explain the same. > > > > Basically, the ccxt is the CurrentMemoryContext when we started the > > streaming and ecxt it the context when we catch the error. So > > ideally, before this change, it will rethrow in the context when we > > catch the error i.e. ecxt. So what we are trying to do is put it back > > to normal context (ccxt) and copy the error data in the normal > > context. And, if we are not handling it gracefully then put it back > > to the context it was in, and rethrow. > > > > Okay, but when errorcode is *not* ERRCODE_TRANSACTION_ROLLBACK, don't > we need to clean up the reorderbuffer by calling > ReorderBufferCleanupTXN? If so, then you can try to combine it with > the not-streaming else loop. Done > > > 8. > > > @@ -2295,6 +2677,13 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer > > > *rb, TransactionId xid, > > > txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true); > > > > > > txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES; > > > + > > > + /* > > > + * TOCHECK: Mark toplevel transaction as having catalog changes too > > > + * if one of its children has. > > > + */ > > > + if (txn->toptxn != NULL) > > > + txn->toptxn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES; > > > } > > > > > > Why are we marking top transaction here? > > > > We need to mark top transaction to decide whether to build tuplecid > > hash or not. In non-streaming mode, we are only sending during the > > commit time, and during commit time we know whether the top > > transaction has any catalog changes or not based on the invalidation > > message so we are marking the top transaction there in DecodeCommit. > > Since here we are not waiting till commit so we need to mark the top > > transaction as soon as we mark any of its child transactions. > > > > But how does it help? We use this flag (via > ReorderBufferXidHasCatalogChanges) in SnapBuildCommitTxn which is > anyway done in DecodeCommit and that too after setting this flag for > the top transaction if required. So, how will it help in setting it > while processing for subxid. Also, even if we have to do it won't it > add the xid needlessly in builder->committed.xip array? In ReorderBufferBuildTupleCidHash, we use this flag to decide whether to build the tuplecid hash or not based on whether it has catalog changes or not. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Dilip Kumar
Date:
On Tue, May 19, 2020 at 4:33 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Tue, May 19, 2020 at 3:31 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Tue, May 19, 2020 at 2:34 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > On Mon, May 18, 2020 at 5:57 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > > > > 3. > > > > + /* > > > > + * If streaming is enable and we have serialized this transaction because > > > > + * it had incomplete tuple. So if now we have got the complete tuple we > > > > + * can stream it. > > > > + */ > > > > + if (ReorderBufferCanStream(rb) && can_stream && rbtxn_is_serialized(toptxn) > > > > + && !(rbtxn_has_toast_insert(txn)) && !(rbtxn_has_spec_insert(txn))) > > > > + { > > > > > > > > This comment is just saying what you are doing in the if-check. I > > > > think you need to explain the rationale behind it. I don't like the > > > > variable name 'can_stream' because it matches ReorderBufferCanStream > > > > whereas it is for a different purpose, how about naming it as > > > > 'change_complete' or something like that. The check has many > > > > conditions, can we move it to a separate function to make the code > > > > here look clean? > > > > > > > > > > Do we really need this? Immediately after this check, we are calling > > > ReorderBufferCheckMemoryLimit which will anyway stream the changes if > > > required. > > > > Actually, ReorderBufferCheckMemoryLimit is only meant for checking > > whether we need to stream the changes due to the memory limit. But > > suppose when memory limit exceeds that time we could not stream the > > transaction because there was only incomplete toast insert so we > > serialized. Now, when we get the tuple which makes the changes > > complete but now it is not crossing the memory limit as changes were > > already serialized. So I am not sure whether it is a good idea to > > stream the transaction as soon as we get the complete changes or we > > shall wait till next time memory limit exceed and that time we select > > the suitable candidate. > > > > I think it is better to wait till next time we exceed the memory threshold. Okay, done this way. > > Ideally, we were are in streaming more and > > the transaction is serialized means it was already a candidate for > > streaming but could not stream due to the incomplete changes so > > shouldn't we stream it immediately as soon as its changes are complete > > even though now we are in memory limit. > > > > The only time we need to stream or spill is when we exceed memory > threshold. In the above case, it is possible that next time there is > some other candidate transaction that we can stream. > > > > > > > Another comments on v20-0010-Bugfix-handling-of-incomplete-toast-tuple: > > > > > > + else if (rbtxn_has_toast_insert(txn) && > > > + ChangeIsInsertOrUpdate(change->action)) > > > + { > > > + toptxn->txn_flags &= ~RBTXN_HAS_TOAST_INSERT; > > > + can_stream = true; > > > + } > > > .. > > > +#define ChangeIsInsertOrUpdate(action) \ > > > + (((action) == REORDER_BUFFER_CHANGE_INSERT) || \ > > > + ((action) == REORDER_BUFFER_CHANGE_UPDATE) || \ > > > + ((action) == REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT)) > > > > > > How can we clear the RBTXN_HAS_TOAST_INSERT flag on > > > REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT action? > > > > Partial toast insert means we have inserted in the toast but not in > > the main table. So even if it is spec insert we can form the complete > > tuple, however, we can still not stream it because we haven't got > > spec_confirm but for that, we are marking another flag. So if the > > insert is aspect insert the toast insert will also be spec insert and > > as part of that toast, spec inserts we are marking partial tuple so > > cleaning that flag should happen when the spec insert is done for the > > main table right? > > > > Sounds reasonable. ok -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Dilip Kumar
Date:
On Tue, May 19, 2020 at 5:34 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Fri, May 15, 2020 at 2:47 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Tue, May 12, 2020 at 4:39 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > 4. > > > +static void > > > +stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn) > > > +{ > > > + LogicalDecodingContext *ctx = cache->private_data; > > > + LogicalErrorCallbackState state; > > > + ErrorContextCallback errcallback; > > > + > > > + Assert(!ctx->fast_forward); > > > + > > > + /* We're only supposed to call this when streaming is supported. */ > > > + Assert(ctx->streaming); > > > + > > > + /* Push callback + info on the error context stack */ > > > + state.ctx = ctx; > > > + state.callback_name = "stream_start"; > > > + /* state.report_location = apply_lsn; */ > > > > > > Why can't we supply the report_location here? I think here we need to > > > report txn->first_lsn if this is the very first stream and > > > txn->final_lsn if it is any consecutive one. > > > > Done > > > > Now after your change in stream_start_cb_wrapper, we assign > report_location as first_lsn passed as input to function but > write_location is still txn->first_lsn. Shouldn't we assing passed in > first_lsn to write_location? It seems assigning txn->first_lsn won't > be correct for streams other than first-one. Done > > > > 5. > > > +static void > > > +stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn) > > > +{ > > > + LogicalDecodingContext *ctx = cache->private_data; > > > + LogicalErrorCallbackState state; > > > + ErrorContextCallback errcallback; > > > + > > > + Assert(!ctx->fast_forward); > > > + > > > + /* We're only supposed to call this when streaming is supported. */ > > > + Assert(ctx->streaming); > > > + > > > + /* Push callback + info on the error context stack */ > > > + state.ctx = ctx; > > > + state.callback_name = "stream_stop"; > > > + /* state.report_location = apply_lsn; */ > > > > > > Can't we report txn->final_lsn here > > > > We are already setting this to the txn->final_ls in 0006 patch, but I > > have moved it into this patch now. > > > > Similar to previous point, here also, I think we need to assign report > and write location as last_lsn passed to this API. Done > > > > > v20-0005-Implement-streaming-mode-in-ReorderBuffer > > > ----------------------------------------------------------------------------- > > > 10. > > > Theoretically, we could get rid of the k-way merge, and append the > > > changes to the toplevel xact directly (and remember the position > > > in the list in case the subxact gets aborted later). > > > > > > I don't think this part of the commit message is correct as we > > > sometimes need to spill even during streaming. Please check the > > > entire commit message and update according to the latest > > > implementation. > > > > Done > > > > You seem to forgot about removing the other part of message ("This > adds a second iterator for the streaming case...." which is not > relavant now. Done > > > 11. > > > - * HeapTupleSatisfiesHistoricMVCC. > > > + * tqual.c's HeapTupleSatisfiesHistoricMVCC. > > > + * > > > + * We do build the hash table even if there are no CIDs. That's > > > + * because when streaming in-progress transactions we may run into > > > + * tuples with the CID before actually decoding them. Think e.g. about > > > + * INSERT followed by TRUNCATE, where the TRUNCATE may not be decoded > > > + * yet when applying the INSERT. So we build a hash table so that > > > + * ResolveCminCmaxDuringDecoding does not segfault in this case. > > > + * > > > + * XXX We might limit this behavior to streaming mode, and just bail > > > + * out when decoding transaction at commit time (at which point it's > > > + * guaranteed to see all CIDs). > > > */ > > > static void > > > ReorderBufferBuildTupleCidHash(ReorderBuffer *rb, ReorderBufferTXN *txn) > > > @@ -1350,9 +1498,6 @@ ReorderBufferBuildTupleCidHash(ReorderBuffer > > > *rb, ReorderBufferTXN *txn) > > > dlist_iter iter; > > > HASHCTL hash_ctl; > > > > > > - if (!rbtxn_has_catalog_changes(txn) || dlist_is_empty(&txn->tuplecids)) > > > - return; > > > - > > > > > > I don't understand this change. Why would "INSERT followed by > > > TRUNCATE" could lead to a tuple which can come for decode before its > > > CID? The patch has made changes based on this assumption in > > > HeapTupleSatisfiesHistoricMVCC which appears to be very risky as the > > > behavior could be dependent on whether we are streaming the changes > > > for in-progress xact or at the commit of a transaction. We might want > > > to generate a test to once validate this behavior. > > > > > > Also, the comment refers to tqual.c which is wrong as this API is now > > > in heapam_visibility.c. > > > > Done. > > > > + * INSERT. So in such cases we assume the CIDs is from the future command > + * and return as unresolve. > + */ > + if (tuplecid_data == NULL) > + return false; > + > > Here lets reword the last line of comment as ". So in such cases we > assume the CID is from the future command." Done -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Dilip Kumar
Date:
On Tue, May 19, 2020 at 6:01 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Fri, May 15, 2020 at 2:48 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Wed, May 13, 2020 at 4:50 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > 3. > > > And, during catalog scan we can check the status of the xid and > > > + * if it is aborted we will report a specific error that we can ignore. We > > > + * might have already streamed some of the changes for the aborted > > > + * (sub)transaction, but that is fine because when we decode the abort we will > > > + * stream abort message to truncate the changes in the subscriber. > > > + */ > > > +static inline void > > > +SetupCheckXidLive(TransactionId xid) > > > > > > In the above comment, I don't think it is right to say that we ignore > > > the error raised due to the aborted transaction. We need to say that > > > we discard the already streamed changes on such an error. > > > > Done. > > > > In the same comment, there is typo (/messageto/message to). Done > > > 4. > > > +static inline void > > > +SetupCheckXidLive(TransactionId xid) > > > +{ > > > /* > > > - * If this transaction has no snapshot, it didn't make any changes to the > > > - * database, so there's nothing to decode. Note that > > > - * ReorderBufferCommitChild will have transferred any snapshots from > > > - * subtransactions if there were any. > > > + * setup CheckXidAlive if it's not committed yet. We don't check if the xid > > > + * aborted. That will happen during catalog access. Also reset the > > > + * sysbegin_called flag. > > > */ > > > - if (txn->base_snapshot == NULL) > > > + if (!TransactionIdDidCommit(xid)) > > > { > > > - Assert(txn->ninvalidations == 0); > > > - ReorderBufferCleanupTXN(rb, txn); > > > - return; > > > + CheckXidAlive = xid; > > > + bsysscan = false; > > > } > > > > > > I think this function is inline as it needs to be called for each > > > change. If that is the case and otherwise also, isn't it better that > > > we check if passed xid is the same as CheckXidAlive before checking > > > TransactionIdDidCommit as TransactionIdDidCommit can be costly and > > > calling it for each change might not be a good idea? > > > > Done, Also I think it is good the check the TransactionIdIsInProgress > > instead of !TransactionIdDidCommit. I have changed that as well. > > > > What if it is aborted just before this check? I think the decode API > won't be able to detect that and sys* API won't care to check because > CheckXidAlive won't be set for that case. Yeah, that's the problem, I think it should be TransactionIdDidCommit only. > > > 5. > > > setup CheckXidAlive if it's not committed yet. We don't check if the xid > > > + * aborted. That will happen during catalog access. Also reset the > > > + * sysbegin_called flag. > > > > > > /if the xid aborted/if the xid is aborted. missing comma after Also. > > > > Done > > > > You forgot to change as per the second part of the comment (missing > comma after Also). Done > > > 8. > > > @@ -1588,8 +1766,6 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid, > > > * use as a normal record. It'll be cleaned up at the end > > > * of INSERT processing. > > > */ > > > - if (specinsert == NULL) > > > - elog(ERROR, "invalid ordering of speculative insertion changes"); > > > > > > You have removed this check but all other handling of specinsert is > > > same as far as this patch is concerned. Why so? > > > > Seems like a merge issue, or the leftover from the old design of the > > toast handling where we were streaming with the partial tuple. > > fixed now. > > > > > 9. > > > @@ -1676,8 +1860,6 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid, > > > * freed/reused while restoring spooled data from > > > * disk. > > > */ > > > - Assert(change->data.tp.newtuple != NULL); > > > - > > > dlist_delete(&change->node); > > > > > > Why is this Assert removed? > > > > Same cause as above so fixed. > > > > > 10. > > > @@ -1753,7 +1935,15 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid, > > > relations[nrelations++] = relation; > > > } > > > > > > - rb->apply_truncate(rb, txn, nrelations, relations, change); > > > + if (streaming) > > > + { > > > + rb->stream_truncate(rb, txn, nrelations, relations, change); > > > + > > > + /* Remember that we have sent some data. */ > > > + change->txn->any_data_sent = true; > > > + } > > > + else > > > + rb->apply_truncate(rb, txn, nrelations, relations, change); > > > > > > Can we encapsulate this in a separate function like > > > ReorderBufferApplyTruncate or something like that? Basically, rather > > > than having streaming check in this function, lets do it in some other > > > internal function. And we can likewise do it for all the streaming > > > checks in this function or at least whereever it is feasible. That > > > will make this function look clean. > > > > Done for truncate and change. I think we can create a few more such > > functions for > > start/stop and cleanup handling on error. I will work on that. > > > > Yeah, I think that would be better. I have done some refactoring, please look into the latest version. > One minor comment change suggestion: > /* > + * start stream or begin the transaction. If this is the first > + * change in the current stream. > + */ > > We can write the above comment as "Start the stream or begin the > transaction for the first change in the current stream." Done -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Dilip Kumar
Date:
On Fri, May 22, 2020 at 11:54 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Tue, May 19, 2020 at 6:01 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Fri, May 15, 2020 at 2:48 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > I have further reviewed v22 and below are my comments: > > v22-0005-Implement-streaming-mode-in-ReorderBuffer > -------------------------------------------------------------------------- > 1. > + * Note: We never do both stream and serialize a transaction (we only spill > + * to disk when streaming is not supported by the plugin), so only one of > + * those two flags may be set at any given time. > + */ > +#define rbtxn_is_streamed(txn) \ > +( \ > + ((txn)->txn_flags & RBTXN_IS_STREAMED) != 0 \ > +) > > The above 'Note' is not correct as per the latest implementation. That is removed in 0010 in the latest version you can see in 0006. > v22-0006-Add-support-for-streaming-to-built-in-replicatio > ---------------------------------------------------------------------------- > 2. > --- a/src/backend/replication/logical/launcher.c > +++ b/src/backend/replication/logical/launcher.c > @@ -14,7 +14,6 @@ > * > *------------------------------------------------------------------------- > */ > - > #include "postgres.h" > > Spurious line removal. Fixed > 3. > +void > +logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn, > + XLogRecPtr commit_lsn) > +{ > + uint8 flags = 0; > + > + pq_sendbyte(out, 'c'); /* action STREAM COMMIT */ > + > + Assert(TransactionIdIsValid(txn->xid)); > + > + /* transaction ID (we're starting to stream, so must be valid) */ > + pq_sendint32(out, txn->xid); > > The part of the comment "we're starting to stream, so must be valid" > is not correct as we are not at the start of the stream here. The > patch has used the same incorrect sentence at few places, kindly fix > those as well. I have removed that part of the comment. > 4. > + * XXX Do we need to allocate it in TopMemoryContext? > + */ > +static void > +subxact_info_add(TransactionId xid) > { > .. > > For this and other places in a patch like in function > stream_open_file(), instead of using TopMemoryContext, can we consider > using a new memory context LogicalStreamingContext or something like > that. We can create LogicalStreamingContext under TopMemoryContext. I > don't see any need of using TopMemoryContext here. But, when we will delete/reset the LogicalStreamingContext? because we are planning to keep this memory until the worker is alive so that supposed to be the top memory context. If we create any other context with the same life span as TopMemoryContext then what is the point? Am I missing something? > 5. > +static void > +subxact_info_add(TransactionId xid) > > This function has assumed a valid value for global variables like > stream_fd and stream_xid. I think it is better to have Assert for > those in this function before using them. The Assert for those are > present in handle_streamed_transaction but I feel they should be in > subxact_info_add. Done > 6. > +subxact_info_add(TransactionId xid) > /* > + * In most cases we're checking the same subxact as we've already seen in > + * the last call, so make ure just ignore it (this change comes later). > + */ > + if (subxact_last == xid) > + return; > > Typo and minor correction, /ure just/sure to Done > 7. > +subxact_info_write(Oid subid, TransactionId xid) > { > .. > + /* > + * But we free the memory allocated for subxact info. There might be one > + * exceptional transaction with many subxacts, and we don't want to keep > + * the memory allocated forewer. > + * > + */ > > a. Typo, /forewer/forever > b. The extra line at the end of the comment is not required. Done > 8. > + * XXX Maybe we should only include the checksum when the cluster is > + * initialized with checksums? > + */ > +static void > +subxact_info_write(Oid subid, TransactionId xid) > > Do we really need to have the checksum for temporary files? I have > checked a few other similar cases like SharedFileSet stuff for > parallel hash join but didn't find them using checksums. Can you also > once see other usages of temporary files and then let us decide if we > see any reason to have checksums for this? Yeah, even I can see other places checksum is not used. > > Another point is we don't seem to be doing this for 'changes' file, > see stream_write_change. So, not sure, there is any sense to write > checksum for subxact file. I can see there are comment atop this function * XXX The subxact file includes CRC32C of the contents. Maybe we should * include something like that here too, but doing so will not be as * straighforward, because we write the file in chunks. > > Tomas, do you see any reason for the same? > 9. > +subxact_filename(char *path, Oid subid, TransactionId xid) > +{ > + char tempdirpath[MAXPGPATH]; > + > + TempTablespacePath(tempdirpath, DEFAULTTABLESPACE_OID); > + > + /* > + * We might need to create the tablespace's tempfile directory, if no > + * one has yet done so. > + */ > + if ((MakePGDirectory(tempdirpath) < 0) && errno != EEXIST) > + ereport(ERROR, > + (errcode_for_file_access(), > + errmsg("could not create directory \"%s\": %m", > + tempdirpath))); > + > + snprintf(path, MAXPGPATH, "%s/logical-%u-%u.subxacts", > + tempdirpath, subid, xid); > +} > > Temporary files created in PGDATA/base/pgsql_tmp follow a certain > naming convention (see docs[1]) which is not followed here. You can > also refer SharedFileSetPath and OpenTemporaryFile. I think we can > just try to follow that convention and then additionally append subid, > xid and .subxacts. Also, a similar change is required for > changes_filename. I would like to know if there is a reason why we > want to use different naming convention here? I have changed it to this: pgsql_tmpPID-subid-xid.subxacts. > 10. > + * This can only be called at the beginning of a "streaming" block, i.e. > + * between stream_start/stream_stop messages from the upstream. > + */ > +static void > +stream_close_file(void) > > The comment seems to be wrong. I think this can be only called at > stream end, so it should be "This can only be called at the end of a > "streaming" block, i.e. at stream_stop message from the upstream." Right, I have fixed it. > 11. > + * the order the transactions are sent in. So streamed trasactions are > + * handled separately by using schema_sent flag in ReorderBufferTXN. > + * > * For partitions, 'pubactions' considers not only the table's own > * publications, but also those of all of its ancestors. > */ > typedef struct RelationSyncEntry > { > Oid relid; /* relation oid */ > - > + TransactionId xid; /* transaction that created the record */ > /* > * Did we send the schema? If ancestor relid is set, its schema must also > * have been sent for this to be true. > */ > bool schema_sent; > + List *streamed_txns; /* streamed toplevel transactions with this > + * schema */ > > The part of comment "So streamed trasactions are handled separately by > using schema_sent flag in ReorderBufferTXN." doesn't seem to match > with what we are doing in the latest version of the patch. Yeah, it's wrong, I have fixed it. > 12. > maybe_send_schema() > { > .. > + if (in_streaming) > + { > + /* > + * TOCHECK: We have to send schema after each catalog change and it may > + * occur when streaming already started, so we have to track new catalog > + * changes somehow. > + */ > + schema_sent = get_schema_sent_in_streamed_txn(relentry, topxid); > .. > .. > } > > I think it is good to once verify/test what this comment says but as > per code we should be sending the schema after each catalog change as > we invalidate the streamed_txns list in rel_sync_cache_relation_cb > which must be called during relcache invalidation. Do we see any > problem with that mechanism? I have tested this, I think we are already sending the schema after each catalog change. > 13. > +/* > + * Notify downstream to discard the streamed transaction (along with all > + * it's subtransactions, if it's a toplevel transaction). > + */ > +static void > +pgoutput_stream_commit(struct LogicalDecodingContext *ctx, > + ReorderBufferTXN *txn, > + XLogRecPtr commit_lsn) > > This comment is copied from pgoutput_stream_abort, so doesn't match > what this function is doing. Done -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Dilip Kumar
Date:
On Fri, May 22, 2020 at 4:46 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Fri, May 22, 2020 at 11:54 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > v22-0006-Add-support-for-streaming-to-built-in-replicatio > > ---------------------------------------------------------------------------- > > > Few more comments on v22-0006 patch: > > 1. > +stream_cleanup_files(Oid subid, TransactionId xid, bool missing_ok) > +{ > + int i; > + char path[MAXPGPATH]; > + bool found = false; > + > + subxact_filename(path, subid, xid); > + > + if ((unlink(path) < 0) && (errno != ENOENT) && !missing_ok) > + ereport(ERROR, > + (errcode_for_file_access(), > + errmsg("could not remove file \"%s\": %m", path))); > > Here, we have unlinked the files containing information of subxacts > but don't we need to free the corresponding memory (memory for > subxacts) as well? Basically, stream_cleanup_files, is used for 1) cleanup file on worker exit 2) while writing the first segment of the xid we clean up to ensure there are no orphaned file with same xid. 3) After apply commit we clean up the file. Whereas subxacts memory is used between the stream start and stream stop as soon stream stop we write the subxacts changes to file and free the memory. So there is no case that we can have subxact memory at stream_cleanup_files, except on worker exit but there we are already exiting the worker. IMHO we don't need to free memory there. > 2. > apply_handle_stream_abort() > { > .. > + subxact_filename(path, MyLogicalRepWorker->subid, xid); > + > + if (unlink(path) < 0) > + ereport(ERROR, > + (errcode_for_file_access(), > + errmsg("could not remove file \"%s\": %m", path))); > + > + return; > .. > } > > Like the previous comment, it seems here also we need to free subxacts > memory and additionally we forgot to adjust the xids array as well. In this, we are allocating memory in subxact_info_read, but we are again calling subxact_info_write which will free the memory. > 3. > apply_handle_stream_abort() > { > .. > + /* XXX optimize the search by bsearch on sorted data */ > + for (i = nsubxacts; i > 0; i--) > + { > + if (subxacts[i - 1].xid == subxid) > + { > + subidx = (i - 1); > + found = true; > + break; > + } > + } > + > + if (!found) > + return; > .. > } > > Is it possible that we didn't find the xid in subxacts array? If so, > I think we should mention the same in comments, otherwise, we should > have an assert for found. We may not find due to the empty transaction, I have changed the comments. > 4. > apply_handle_stream_abort() > { > .. > + changes_filename(path, MyLogicalRepWorker->subid, xid); > + > + if (truncate(path, subxacts[subidx].offset)) > + ereport(ERROR, > + (errcode_for_file_access(), > + errmsg("could not truncate file \"%s\": %m", path))); > .. > } > > Will truncate works on Windows? I see in the code we ftruncate which > is defined as chsize in win32.h and win32_port.h. I have not tested > this so I am not very sure about this. I got a below warning when I > tried to compile this code on Windows. I think it is better to > ftruncate as it is used at other places in the code as well. > > worker.c(798): warning C4013: 'truncate' undefined; assuming extern > returning int I have changed to the ftruncate. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Erik Rijkers
Date:
On 2020-05-25 16:37, Dilip Kumar wrote: > On Fri, May 22, 2020 at 11:54 AM Amit Kapila <amit.kapila16@gmail.com> > wrote: >> >> On Tue, May 19, 2020 at 6:01 PM Amit Kapila <amit.kapila16@gmail.com> >> wrote: >> > >> > On Fri, May 15, 2020 at 2:48 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: >> > > >> >> I have further reviewed v22 and below are my comments: >> >> [v24.tar] Hi, I am not able to extract all files correctly from this tar. The first file v24-0001-* seems to have some 'binary' junk at the top. (The other 11 files seem normally readably) Erik Rijkers
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Dilip Kumar
Date:
On Mon, May 25, 2020 at 8:48 PM Erik Rijkers <er@xs4all.nl> wrote: > > Hi, > > I am not able to extract all files correctly from this tar. > > The first file v24-0001-* seems to have some 'binary' junk at the top. > > (The other 11 files seem normally readably) Okay, sending again. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Amit Kapila
Date:
On Fri, May 22, 2020 at 6:21 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Mon, May 18, 2020 at 5:57 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > Few comments on v20-0010-Bugfix-handling-of-incomplete-toast-tuple > > 1. > > + /* > > + * If this is a toast insert then set the corresponding bit. Otherwise, if > > + * we have toast insert bit set and this is insert/update then clear the > > + * bit. > > + */ > > + if (toast_insert) > > + toptxn->txn_flags |= RBTXN_HAS_TOAST_INSERT; > > + else if (rbtxn_has_toast_insert(txn) && > > + ChangeIsInsertOrUpdate(change->action)) > > + { > > > > Here, it might better to add a comment on why we expect only > > Insert/Update? Also, it might be better that we add an assert for > > other operations. > > I have added comments that why on Insert/Update we clean the flag. > But I don't think we only expect insert/update, we might get the > toast delete right? because in toast update we will do toast delete + > toast insert. So when we get toast delete we just don't want to do > anything. > Okay, that makes sense. > > > > 2. > > @@ -1865,8 +1920,8 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, > > ReorderBufferTXN *txn, > > * disk. > > */ > > dlist_delete(&change->node); > > - ReorderBufferToastAppendChunk(rb, txn, relation, > > - change); > > + ReorderBufferToastAppendChunk(rb, txn, relation, > > + change); > > } > > > > This seems to be a spurious change. > > Done > > 2. There is a bug fix in handling the stream abort in 0008 (earlier it > was 0006). > The code changes look fine but it is not clear what was the exact issue. Can you explain? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Amit Kapila
Date:
On Fri, May 22, 2020 at 6:22 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Mon, May 18, 2020 at 4:10 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Sun, May 17, 2020 at 12:41 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > On Fri, May 15, 2020 at 4:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > > > > Review comments: > > > > ------------------------------ > > > > 1. > > > > @@ -1762,10 +1952,16 @@ ReorderBufferCommit(ReorderBuffer *rb, > > > > TransactionId xid, > > > > } > > > > > > > > case REORDER_BUFFER_CHANGE_MESSAGE: > > > > - rb->message(rb, txn, change->lsn, true, > > > > - change->data.msg.prefix, > > > > - change->data.msg.message_size, > > > > - change->data.msg.message); > > > > + if (streaming) > > > > + rb->stream_message(rb, txn, change->lsn, true, > > > > + change->data.msg.prefix, > > > > + change->data.msg.message_size, > > > > + change->data.msg.message); > > > > + else > > > > + rb->message(rb, txn, change->lsn, true, > > > > + change->data.msg.prefix, > > > > + change->data.msg.message_size, > > > > + change->data.msg.message); > > > > > > > > Don't we need to set any_data_sent flag while streaming messages as we > > > > do for other types of changes? > > > > > > I think any_data_sent, was added to avoid sending abort to the > > > subscriber if we haven't sent any data, but this is not complete as > > > the output plugin can also take the decision not to send. So I think > > > this should not be done as part of this patch and can be done > > > separately. I think there is already a thread for handling the > > > same[1] > > > > > > > Hmm, but prior to this patch, we never use to send (empty) aborts but > > now that will be possible. It is probably okay to deal that with > > another patch mentioned by you but I felt at least any_data_sent will > > work for some cases. OTOH, it appears to be half-baked solution, so > > we should probably refrain from adding it. BTW, how do the pgoutput > > plugin deal with it? I see that apply_handle_stream_abort will > > unconditionally try to unlink the file and it will probably fail. > > Have you tested this scenario after your latest changes? > > Yeah, I see, I think this is a problem, but this exists without my > latest change as well, if pgoutput ignore some changes because it is > not published then we will see a similar error. Shall we handle the > ENOENT error case from unlink? > Isn't this problem only for subxact file as we anyway create changes file as part of start stream message which should have come after abort? If so, can't we detect whether subxact file exists probably by using nsubxacts or something like that? Can you please once try to reproduce this scenario to ensure that we are not missing anything? > > > > > > 8. > > > > @@ -2295,6 +2677,13 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer > > > > *rb, TransactionId xid, > > > > txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true); > > > > > > > > txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES; > > > > + > > > > + /* > > > > + * TOCHECK: Mark toplevel transaction as having catalog changes too > > > > + * if one of its children has. > > > > + */ > > > > + if (txn->toptxn != NULL) > > > > + txn->toptxn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES; > > > > } > > > > > > > > Why are we marking top transaction here? > > > > > > We need to mark top transaction to decide whether to build tuplecid > > > hash or not. In non-streaming mode, we are only sending during the > > > commit time, and during commit time we know whether the top > > > transaction has any catalog changes or not based on the invalidation > > > message so we are marking the top transaction there in DecodeCommit. > > > Since here we are not waiting till commit so we need to mark the top > > > transaction as soon as we mark any of its child transactions. > > > > > > > But how does it help? We use this flag (via > > ReorderBufferXidHasCatalogChanges) in SnapBuildCommitTxn which is > > anyway done in DecodeCommit and that too after setting this flag for > > the top transaction if required. So, how will it help in setting it > > while processing for subxid. Also, even if we have to do it won't it > > add the xid needlessly in builder->committed.xip array? > > In ReorderBufferBuildTupleCidHash, we use this flag to decide whether > to build the tuplecid hash or not based on whether it has catalog > changes or not. > Okay, but you haven't answered the second part of the question: "won't it add the xid of top transaction needlessly in builder->committed.xip array, see function SnapBuildCommitTxn?" IIUC, this can happen without patch as well because DecodeCommit also sets the flags just based on invalidation messages irrespective of whether the messages are generated by top transaction or not, is that right? If this is correct, please explain why we are doing so in the comments. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Dilip Kumar
Date:
On Tue, May 26, 2020 at 10:27 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Fri, May 22, 2020 at 6:21 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Mon, May 18, 2020 at 5:57 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > Few comments on v20-0010-Bugfix-handling-of-incomplete-toast-tuple > > > 1. > > > + /* > > > + * If this is a toast insert then set the corresponding bit. Otherwise, if > > > + * we have toast insert bit set and this is insert/update then clear the > > > + * bit. > > > + */ > > > + if (toast_insert) > > > + toptxn->txn_flags |= RBTXN_HAS_TOAST_INSERT; > > > + else if (rbtxn_has_toast_insert(txn) && > > > + ChangeIsInsertOrUpdate(change->action)) > > > + { > > > > > > Here, it might better to add a comment on why we expect only > > > Insert/Update? Also, it might be better that we add an assert for > > > other operations. > > > > I have added comments that why on Insert/Update we clean the flag. > > But I don't think we only expect insert/update, we might get the > > toast delete right? because in toast update we will do toast delete + > > toast insert. So when we get toast delete we just don't want to do > > anything. > > > > Okay, that makes sense. > > > > > > > 2. > > > @@ -1865,8 +1920,8 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, > > > ReorderBufferTXN *txn, > > > * disk. > > > */ > > > dlist_delete(&change->node); > > > - ReorderBufferToastAppendChunk(rb, txn, relation, > > > - change); > > > + ReorderBufferToastAppendChunk(rb, txn, relation, > > > + change); > > > } > > > > > > This seems to be a spurious change. > > > > Done > > > > 2. There is a bug fix in handling the stream abort in 0008 (earlier it > > was 0006). > > > > The code changes look fine but it is not clear what was the exact > issue. Can you explain? Basically, in case of an empty subtransaction, we were reading the subxacts info but when we could not find the subxid in the subxacts info we were not releasing the memory. So on next subxact_info_read it will expect that subxacts should be freed but we did not free it in that !found case. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Amit Kapila
Date:
On Mon, May 25, 2020 at 8:07 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Fri, May 22, 2020 at 11:54 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > 4. > > + * XXX Do we need to allocate it in TopMemoryContext? > > + */ > > +static void > > +subxact_info_add(TransactionId xid) > > { > > .. > > > > For this and other places in a patch like in function > > stream_open_file(), instead of using TopMemoryContext, can we consider > > using a new memory context LogicalStreamingContext or something like > > that. We can create LogicalStreamingContext under TopMemoryContext. I > > don't see any need of using TopMemoryContext here. > > But, when we will delete/reset the LogicalStreamingContext? > Why can't we reset it at each stream stop message? > because > we are planning to keep this memory until the worker is alive so that > supposed to be the top memory context. > Which part of allocation do we want to keep till the worker is alive? Why we need memory-related to subxacts till the worker is alive? As we have now, after reading subxact info (subxact_info_read), we need to ensure that it is freed after its usage due to which we need to remember and perform pfree at various places. I think we should once see the possibility that such that we could switch to this new context in start stream message and reset it in stop stream message. That might help in avoiding MemoryContextSwitchTo TopMemoryContext at various places. > If we create any other context > with the same life span as TopMemoryContext then what is the point? > It is helpful for debugging. It is recommended that we don't use the top memory context unless it is really required. Read about it in src/backend/utils/mmgr/README. > > > 8. > > + * XXX Maybe we should only include the checksum when the cluster is > > + * initialized with checksums? > > + */ > > +static void > > +subxact_info_write(Oid subid, TransactionId xid) > > > > Do we really need to have the checksum for temporary files? I have > > checked a few other similar cases like SharedFileSet stuff for > > parallel hash join but didn't find them using checksums. Can you also > > once see other usages of temporary files and then let us decide if we > > see any reason to have checksums for this? > > Yeah, even I can see other places checksum is not used. > So, unless someone speaks up before you are ready for the next version of the patch, can we remove it? > > > > Another point is we don't seem to be doing this for 'changes' file, > > see stream_write_change. So, not sure, there is any sense to write > > checksum for subxact file. > > I can see there are comment atop this function > > * XXX The subxact file includes CRC32C of the contents. Maybe we should > * include something like that here too, but doing so will not be as > * straighforward, because we write the file in chunks. > You can remove this comment as well. I don't know how advantageous it is to checksum temporary files. We can anyway add it later if there is a reason for doing so. > > > > 12. > > maybe_send_schema() > > { > > .. > > + if (in_streaming) > > + { > > + /* > > + * TOCHECK: We have to send schema after each catalog change and it may > > + * occur when streaming already started, so we have to track new catalog > > + * changes somehow. > > + */ > > + schema_sent = get_schema_sent_in_streamed_txn(relentry, topxid); > > .. > > .. > > } > > > > I think it is good to once verify/test what this comment says but as > > per code we should be sending the schema after each catalog change as > > we invalidate the streamed_txns list in rel_sync_cache_relation_cb > > which must be called during relcache invalidation. Do we see any > > problem with that mechanism? > > I have tested this, I think we are already sending the schema after > each catalog change. > Then remove "TOCHECK" in the above comment. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Amit Kapila
Date:
On Tue, May 26, 2020 at 2:44 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Tue, May 26, 2020 at 10:27 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > 2. There is a bug fix in handling the stream abort in 0008 (earlier it > > > was 0006). > > > > > > > The code changes look fine but it is not clear what was the exact > > issue. Can you explain? > > Basically, in case of an empty subtransaction, we were reading the > subxacts info but when we could not find the subxid in the subxacts > info we were not releasing the memory. So on next subxact_info_read > it will expect that subxacts should be freed but we did not free it in > that !found case. > Okay, on looking at it again, the same code exists in subxact_info_write as well. It is better to have a function for it. Can we have a structure like SubXactContext for all the variables used for subxact? As mentioned earlier I find the allocation/deallocation of subxacts a bit ad-hoc, so there will always be a chance that we can forget to free it. Having it allocated in memory context which we can reset later might reduce that risk. One idea could be that we have a special memory context for start and stop messages which can be used to allocate the subxacts there. In case of commit/abort, we can allow subxacts information to be allocated in ApplyMessageContext which is reset at the end of each protocol message. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Mahendra Singh Thalor
Date:
On Tue, 26 May 2020 at 16:46, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, May 26, 2020 at 2:44 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Tue, May 26, 2020 at 10:27 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > >
> > > > 2. There is a bug fix in handling the stream abort in 0008 (earlier it
> > > > was 0006).
> > > >
> > >
> > > The code changes look fine but it is not clear what was the exact
> > > issue. Can you explain?
> >
> > Basically, in case of an empty subtransaction, we were reading the
> > subxacts info but when we could not find the subxid in the subxacts
> > info we were not releasing the memory. So on next subxact_info_read
> > it will expect that subxacts should be freed but we did not free it in
> > that !found case.
> >
>
> Okay, on looking at it again, the same code exists in
> subxact_info_write as well. It is better to have a function for it.
> Can we have a structure like SubXactContext for all the variables used
> for subxact? As mentioned earlier I find the allocation/deallocation
> of subxacts a bit ad-hoc, so there will always be a chance that we can
> forget to free it. Having it allocated in memory context which we can
> reset later might reduce that risk. One idea could be that we have a
> special memory context for start and stop messages which can be used
> to allocate the subxacts there. In case of commit/abort, we can allow
> subxacts information to be allocated in ApplyMessageContext which is
> reset at the end of each protocol message.
>
> --
> With Regards,
> Amit Kapila.
> EnterpriseDB: http://www.enterprisedb.com
>
>
Hi all,
On the top of v16 patch set [1], I did some testing for DDL's and DML's to test wal size and performance. Below is the testing summary;
Test parameters:
wal_level= 'logical
max_connections = '150'
wal_receiver_timeout = '600s'
max_wal_size = '2GB'
min_wal_size = '2GB'
autovacuum= 'off'
checkpoint_timeout= '1d'
Summary:
Basically, we are writing per command invalidation message and for testing that I have tested with different combinations of the DDL and DML operation. I have not observed any performance degradation with the patch. For "create index" DDL's, %change in wal is 1-7% for 1-15 DDL's. For "add col int/date" DDL's, it is 11-17% for 1-15 DDL's and for "add col text" DDL's, it is 2-6% for 1-15 DDL's. For mix (DDL & DML), it is 2-10%.
why are we seeing 11-13 % of the extra wall, basically, the amount of extra WAL is not very high but the amount of WAL generated with add column int/date is just ~1000 bytes so additional 100 bytes will be around 10% and for add column text it is ~35000 bytes so % is less. For text, these ~35000 bytes are due to toast.
[1]: https://www.postgresql.org/message-id/CAFiTN-vnnrk580ucZVYnub_UQ-ayROew8fQ2Yn5aFYMeF0U03w%40mail.gmail.com
[2]: https://docs.google.com/spreadsheets/d/1g11MrSd_I39505OnGoLFVslz3ykbZ1nmfR_gUiE_O9k/edit?usp=sharing
--
Thanks and Regards
Mahendra Singh Thalor
EnterpriseDB: http://www.enterprisedb.com
>
> On Tue, May 26, 2020 at 2:44 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Tue, May 26, 2020 at 10:27 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > >
> > > > 2. There is a bug fix in handling the stream abort in 0008 (earlier it
> > > > was 0006).
> > > >
> > >
> > > The code changes look fine but it is not clear what was the exact
> > > issue. Can you explain?
> >
> > Basically, in case of an empty subtransaction, we were reading the
> > subxacts info but when we could not find the subxid in the subxacts
> > info we were not releasing the memory. So on next subxact_info_read
> > it will expect that subxacts should be freed but we did not free it in
> > that !found case.
> >
>
> Okay, on looking at it again, the same code exists in
> subxact_info_write as well. It is better to have a function for it.
> Can we have a structure like SubXactContext for all the variables used
> for subxact? As mentioned earlier I find the allocation/deallocation
> of subxacts a bit ad-hoc, so there will always be a chance that we can
> forget to free it. Having it allocated in memory context which we can
> reset later might reduce that risk. One idea could be that we have a
> special memory context for start and stop messages which can be used
> to allocate the subxacts there. In case of commit/abort, we can allow
> subxacts information to be allocated in ApplyMessageContext which is
> reset at the end of each protocol message.
>
> --
> With Regards,
> Amit Kapila.
> EnterpriseDB: http://www.enterprisedb.com
>
>
Hi all,
On the top of v16 patch set [1], I did some testing for DDL's and DML's to test wal size and performance. Below is the testing summary;
Test parameters:
wal_level= 'logical
max_connections = '150'
wal_receiver_timeout = '600s'
max_wal_size = '2GB'
min_wal_size = '2GB'
autovacuum= 'off'
checkpoint_timeout= '1d'
Test results:
CREATE index operations | Add col int(date) operations | Add col text operations | ||||||||
SN. | operation name | LSN diff (in bytes) | time (in sec) | % LSN change | LSN diff (in bytes) | time (in sec) | % LSN change | LSN diff (in bytes) | time (in sec) | % LSN change |
1 | 1 DDL without patch | 17728 | 0.89116 | 1.624548 | 976 | 0.764393 | 11.475409 | 33904 | 0.80044 | 2.80792 |
with patch | 18016 | 0.804868 | 1088 | 0.763602 | 34856 | 0.787108 | ||||
2 | 2 DDL without patch | 19872 | 0.860348 | 2.73752 | 1632 | 0.763199 | 13.7254902 | 34560 | 0.806086 | 3.078703 |
with patch | 20416 | 0.839065 | 1856 | 0.733147 | 35624 | 0.829281 | ||||
3 | 3 DDL without patch | 22016 | 0.894891 | 3.63372093 | 2288 | 0.776871 | 14.685314 | 35216 | 0.803493 | 3.339391186 |
with patch | 22816 | 0.828028 | 2624 | 0.737177 | 36392 | 0.800194 | ||||
4 | 4 DDL without patch | 24160 | 0.901686 | 4.4701986 | 2944 | 0.768445 | 15.217391 | 35872 | 0.77489 | 3.590544 |
with patch | 25240 | 0.887143 | 3392 | 0.768382 | 37160 | 0.82777 | ||||
5 | 5 DDL without patch | 26328 | 0.901686 | 4.9832877 | 3600 | 0.751879 | 15.555555 | 36528 | 0.817928 | 3.832676 |
with patch | 27640 | 0.914078 | 4160 | 0.74709 | 37928 | 0.820621 | ||||
6 | 6 DDL without patch | 28472 | 0.936385 | 5.5071649 | 4256 | 0.745179 | 15.78947368 | 37184 | 0.797043 | 4.066265 |
with patch | 30040 | 0.958226 | 4928 | 0.725321 | 38696 | 0.814535 | ||||
7 | 8 DDL without patch | 32760 | 1.0022203 | 6.422466 | 5568 | 0.757468 | 16.091954 | 38496 | 0.83207 | 4.509559 |
with patch | 34864 | 0.966777 | 6464 | 0.769072 | 40232 | 0.903604 | ||||
8 | 11 DDL without patch | 50296 | 1.0022203 | 5.662478 | 7536 | 0.748332 | 16.666666 | 40464 | 0.822266 | 5.179913 |
with patch | 53144 | 0.966777 | 8792 | 0.750553 | 42560 | 0.797133 | ||||
9 | 15 DDL without patch | 58896 | 1.267253 | 5.662478 | 10184 | 0.776875 | 16.496465 | 43112 | 0.821916 | 5.84524 |
with patch | 62768 | 1.27234 | 11864 | 0.746844 | 45632 | 0.812567 | ||||
10 | 1 DDL & 3 DML without patch | 18240 | 0.812551 | 1.6228 | 1192 | 0.771993 | 10.067114 | 34120 | 0.849467 | 2.8113599 |
with patch | 18536 | 0.819089 | 1312 | 0.785117 | 35080 | 0.855456 | ||||
11 | 3 DDL & 5 DML without patch | 23656 | 0.926616 | 3.4832606 | 2656 | 0.758029 | 13.55421687 | 35584 | 0.829377 | 3.372302 |
with patch | 24480 | 0.915517 | 3016 | 0.797206 | 36784 | 0.839176 | ||||
12 | 10 DDL & 5 DML without patch | 52760 | 1.101005 | 4.958301744 | 7288 | 0.763065 | 16.02634468 | 40216 | 0.837843 | 4.993037 |
with patch | 55376 | 1.105241 | 8456 | 0.779257 | 42224 | 0.835206 | ||||
13 | 10 DML without patch | 1008 | 0.791091 | 6.349206 | 1008 | 0.81105 | 6.349206 | 1008 | 0.78817 | 6.349206 |
with patch | 1072 | 0.807875 | 1072 | 0.771113 | 1072 | 0.759789 |
To see all operations, please see[2] test_results
Summary:
Basically, we are writing per command invalidation message and for testing that I have tested with different combinations of the DDL and DML operation. I have not observed any performance degradation with the patch. For "create index" DDL's, %change in wal is 1-7% for 1-15 DDL's. For "add col int/date" DDL's, it is 11-17% for 1-15 DDL's and for "add col text" DDL's, it is 2-6% for 1-15 DDL's. For mix (DDL & DML), it is 2-10%.
why are we seeing 11-13 % of the extra wall, basically, the amount of extra WAL is not very high but the amount of WAL generated with add column int/date is just ~1000 bytes so additional 100 bytes will be around 10% and for add column text it is ~35000 bytes so % is less. For text, these ~35000 bytes are due to toast.
[1]: https://www.postgresql.org/message-id/CAFiTN-vnnrk580ucZVYnub_UQ-ayROew8fQ2Yn5aFYMeF0U03w%40mail.gmail.com
[2]: https://docs.google.com/spreadsheets/d/1g11MrSd_I39505OnGoLFVslz3ykbZ1nmfR_gUiE_O9k/edit?usp=sharing
--
Thanks and Regards
Mahendra Singh Thalor
EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Dilip Kumar
Date:
On Tue, May 26, 2020 at 7:45 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Mon, May 25, 2020 at 8:48 PM Erik Rijkers <er@xs4all.nl> wrote: > > > > > Hi, > > > > I am not able to extract all files correctly from this tar. > > > > The first file v24-0001-* seems to have some 'binary' junk at the top. > > > > (The other 11 files seem normally readably) > > Okay, sending again. While reviewing/testing I have found a couple of problems in 0005 and 0006 which I have fixed in the attached version. In 0005: Basically, in the latest version, we are starting a stream or begin txn only if there are any changes because we are doing in the while loop, so we need to stream_stop/commit also if we have started the stream. In 0006: If we are streaming the serialized changed and there are still few incomplete changes, then currently we are not deleting the spilled file, but the spill file contains all the changes of the transaction because there is no way to partially truncate it. So in the next stream, it will try to resend those. I have fixed this by sending the spilled transaction as soon as its changes are complete so ideally, we can always delete the spilled file. It is also a better solution because this transaction is already spilled once and that happened because we could not stream it, so we better stream it on the first opportunity that will reduce the replay lag which is our whole purpose here. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Dilip Kumar
Date:
On Tue, May 26, 2020 at 12:00 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Fri, May 22, 2020 at 6:22 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Mon, May 18, 2020 at 4:10 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > On Sun, May 17, 2020 at 12:41 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > > > On Fri, May 15, 2020 at 4:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > > > > > > > Review comments: > > > > > ------------------------------ > > > > > 1. > > > > > @@ -1762,10 +1952,16 @@ ReorderBufferCommit(ReorderBuffer *rb, > > > > > TransactionId xid, > > > > > } > > > > > > > > > > case REORDER_BUFFER_CHANGE_MESSAGE: > > > > > - rb->message(rb, txn, change->lsn, true, > > > > > - change->data.msg.prefix, > > > > > - change->data.msg.message_size, > > > > > - change->data.msg.message); > > > > > + if (streaming) > > > > > + rb->stream_message(rb, txn, change->lsn, true, > > > > > + change->data.msg.prefix, > > > > > + change->data.msg.message_size, > > > > > + change->data.msg.message); > > > > > + else > > > > > + rb->message(rb, txn, change->lsn, true, > > > > > + change->data.msg.prefix, > > > > > + change->data.msg.message_size, > > > > > + change->data.msg.message); > > > > > > > > > > Don't we need to set any_data_sent flag while streaming messages as we > > > > > do for other types of changes? > > > > > > > > I think any_data_sent, was added to avoid sending abort to the > > > > subscriber if we haven't sent any data, but this is not complete as > > > > the output plugin can also take the decision not to send. So I think > > > > this should not be done as part of this patch and can be done > > > > separately. I think there is already a thread for handling the > > > > same[1] > > > > > > > > > > Hmm, but prior to this patch, we never use to send (empty) aborts but > > > now that will be possible. It is probably okay to deal that with > > > another patch mentioned by you but I felt at least any_data_sent will > > > work for some cases. OTOH, it appears to be half-baked solution, so > > > we should probably refrain from adding it. BTW, how do the pgoutput > > > plugin deal with it? I see that apply_handle_stream_abort will > > > unconditionally try to unlink the file and it will probably fail. > > > Have you tested this scenario after your latest changes? > > > > Yeah, I see, I think this is a problem, but this exists without my > > latest change as well, if pgoutput ignore some changes because it is > > not published then we will see a similar error. Shall we handle the > > ENOENT error case from unlink? > Isn't this problem only for subxact file as we anyway create changes > file as part of start stream message which should have come after > abort? If so, can't we detect whether subxact file exists probably by > using nsubxacts or something like that? Can you please once try to > reproduce this scenario to ensure that we are not missing anything? I have tested this, as of now, by default we create both changes and subxact files irrespective of whether we get any subtransactions or not. Maybe this could be optimized that only if we have any subxact then only create that file otherwise not? What's your opinion on the same. > > > > > 8. > > > > > @@ -2295,6 +2677,13 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer > > > > > *rb, TransactionId xid, > > > > > txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true); > > > > > > > > > > txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES; > > > > > + > > > > > + /* > > > > > + * TOCHECK: Mark toplevel transaction as having catalog changes too > > > > > + * if one of its children has. > > > > > + */ > > > > > + if (txn->toptxn != NULL) > > > > > + txn->toptxn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES; > > > > > } > > > > > > > > > > Why are we marking top transaction here? > > > > > > > > We need to mark top transaction to decide whether to build tuplecid > > > > hash or not. In non-streaming mode, we are only sending during the > > > > commit time, and during commit time we know whether the top > > > > transaction has any catalog changes or not based on the invalidation > > > > message so we are marking the top transaction there in DecodeCommit. > > > > Since here we are not waiting till commit so we need to mark the top > > > > transaction as soon as we mark any of its child transactions. > > > > > > > > > > But how does it help? We use this flag (via > > > ReorderBufferXidHasCatalogChanges) in SnapBuildCommitTxn which is > > > anyway done in DecodeCommit and that too after setting this flag for > > > the top transaction if required. So, how will it help in setting it > > > while processing for subxid. Also, even if we have to do it won't it > > > add the xid needlessly in builder->committed.xip array? > > > > In ReorderBufferBuildTupleCidHash, we use this flag to decide whether > > to build the tuplecid hash or not based on whether it has catalog > > changes or not. > > > > Okay, but you haven't answered the second part of the question: "won't > it add the xid of top transaction needlessly in builder->committed.xip > array, see function SnapBuildCommitTxn?" IIUC, this can happen > without patch as well because DecodeCommit also sets the flags just > based on invalidation messages irrespective of whether the messages > are generated by top transaction or not, is that right? Yes, with or without the patch it always adds the topxid. I think purpose for doing this with/without patch is not for the snapshot instead we are marking the top itself that some of its subtxn has the catalog changes so that while building the tuplecid has we can know whether to build the hash or not. But, having said that I feel in ReorderBufferBuildTupleCidHash why do we need these two checks if (!rbtxn_has_catalog_changes(txn) || dlist_is_empty(&txn->tuplecids)) return; I mean it should be enough to just have the check, because if we have added something to the tuplecids then catalog changes must be there because that time we are setting the catalog changes to true. if (dlist_is_empty(&txn->tuplecids)) return; I think in the base code there are multiple things going on 1. If we get new CID we always set the catalog change in that transaction but add the tuplecids in the top transaction. So basically, top transaction is so far not marked with catalog changes but it has tuplecids. 2. Now, in DecodeCommit the top xid will be marked that it has catalog changes based on the invalidation messages. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Dilip Kumar
Date:
On Tue, May 26, 2020 at 3:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Mon, May 25, 2020 at 8:07 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Fri, May 22, 2020 at 11:54 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > 4. > > > + * XXX Do we need to allocate it in TopMemoryContext? > > > + */ > > > +static void > > > +subxact_info_add(TransactionId xid) > > > { > > > .. > > > > > > For this and other places in a patch like in function > > > stream_open_file(), instead of using TopMemoryContext, can we consider > > > using a new memory context LogicalStreamingContext or something like > > > that. We can create LogicalStreamingContext under TopMemoryContext. I > > > don't see any need of using TopMemoryContext here. > > > > But, when we will delete/reset the LogicalStreamingContext? > > > > Why can't we reset it at each stream stop message? > > because > > we are planning to keep this memory until the worker is alive so that > > supposed to be the top memory context. > > > > Which part of allocation do we want to keep till the worker is alive? static TransactionId *xids = NULL; we need to keep till worker life space. > Why we need memory-related to subxacts till the worker is alive? As > we have now, after reading subxact info (subxact_info_read), we need > to ensure that it is freed after its usage due to which we need to > remember and perform pfree at various places. > > I think we should once see the possibility that such that we could > switch to this new context in start stream message and reset it in > stop stream message. That might help in avoiding > MemoryContextSwitchTo TopMemoryContext at various places. Ok, I understand, I think subxacts can be allocated in new LogicalStreamingContext which we can reset at the stream stop. How about xids? shall we create another context that will stay until the worker lifespan? > > If we create any other context > > with the same life span as TopMemoryContext then what is the point? >> > > It is helpful for debugging. It is recommended that we don't use the > top memory context unless it is really required. Read about it in > src/backend/utils/mmgr/README. I see. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Amit Kapila
Date:
On Thu, May 28, 2020 at 12:46 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Tue, May 26, 2020 at 12:00 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > Isn't this problem only for subxact file as we anyway create changes > > file as part of start stream message which should have come after > > abort? If so, can't we detect whether subxact file exists probably by > > using nsubxacts or something like that? Can you please once try to > > reproduce this scenario to ensure that we are not missing anything? > > I have tested this, as of now, by default we create both changes and > subxact files irrespective of whether we get any subtransactions or > not. Maybe this could be optimized that only if we have any subxact > then only create that file otherwise not? What's your opinion on the > same. > Yeah, that makes sense. > > > > > > 8. > > > > > > @@ -2295,6 +2677,13 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer > > > > > > *rb, TransactionId xid, > > > > > > txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true); > > > > > > > > > > > > txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES; > > > > > > + > > > > > > + /* > > > > > > + * TOCHECK: Mark toplevel transaction as having catalog changes too > > > > > > + * if one of its children has. > > > > > > + */ > > > > > > + if (txn->toptxn != NULL) > > > > > > + txn->toptxn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES; > > > > > > } > > > > > > > > > > > > Why are we marking top transaction here? > > > > > > > > > > We need to mark top transaction to decide whether to build tuplecid > > > > > hash or not. In non-streaming mode, we are only sending during the > > > > > commit time, and during commit time we know whether the top > > > > > transaction has any catalog changes or not based on the invalidation > > > > > message so we are marking the top transaction there in DecodeCommit. > > > > > Since here we are not waiting till commit so we need to mark the top > > > > > transaction as soon as we mark any of its child transactions. > > > > > > > > > > > > > But how does it help? We use this flag (via > > > > ReorderBufferXidHasCatalogChanges) in SnapBuildCommitTxn which is > > > > anyway done in DecodeCommit and that too after setting this flag for > > > > the top transaction if required. So, how will it help in setting it > > > > while processing for subxid. Also, even if we have to do it won't it > > > > add the xid needlessly in builder->committed.xip array? > > > > > > In ReorderBufferBuildTupleCidHash, we use this flag to decide whether > > > to build the tuplecid hash or not based on whether it has catalog > > > changes or not. > > > > > > > Okay, but you haven't answered the second part of the question: "won't > > it add the xid of top transaction needlessly in builder->committed.xip > > array, see function SnapBuildCommitTxn?" IIUC, this can happen > > without patch as well because DecodeCommit also sets the flags just > > based on invalidation messages irrespective of whether the messages > > are generated by top transaction or not, is that right? > > Yes, with or without the patch it always adds the topxid. I think > purpose for doing this with/without patch is not for the snapshot > instead we are marking the top itself that some of its subtxn has the > catalog changes so that while building the tuplecid has we can know > whether to build the hash or not. But, having said that I feel in > ReorderBufferBuildTupleCidHash why do we need these two checks > if (!rbtxn_has_catalog_changes(txn) || dlist_is_empty(&txn->tuplecids)) > return; > > I mean it should be enough to just have the check, because if we have > added something to the tuplecids then catalog changes must be there > because that time we are setting the catalog changes to true. > > if (dlist_is_empty(&txn->tuplecids)) > return; > > I think in the base code there are multiple things going on > 1. If we get new CID we always set the catalog change in that > transaction but add the tuplecids in the top transaction. So > basically, top transaction is so far not marked with catalog changes > but it has tuplecids. > 2. Now, in DecodeCommit the top xid will be marked that it has catalog > changes based on the invalidation messages. > I don't think it is advisable to remove that check from base code unless we have a strong reason for doing so. I think here you can write better comments about why you are marking the flag for top transaction and remove TOCHECK from the comment. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Amit Kapila
Date:
On Thu, May 28, 2020 at 12:57 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Tue, May 26, 2020 at 3:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > Why we need memory-related to subxacts till the worker is alive? As > > we have now, after reading subxact info (subxact_info_read), we need > > to ensure that it is freed after its usage due to which we need to > > remember and perform pfree at various places. > > > > I think we should once see the possibility that such that we could > > switch to this new context in start stream message and reset it in > > stop stream message. That might help in avoiding > > MemoryContextSwitchTo TopMemoryContext at various places. > > Ok, I understand, I think subxacts can be allocated in new > LogicalStreamingContext which we can reset at the stream stop. How > about xids? > How about storing xids in ApplyContext? We do store similar lifespan things in that context, for ex. see store_flush_position. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Dilip Kumar
Date:
On Thu, May 28, 2020 at 3:15 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Thu, May 28, 2020 at 12:57 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Tue, May 26, 2020 at 3:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > Why we need memory-related to subxacts till the worker is alive? As > > > we have now, after reading subxact info (subxact_info_read), we need > > > to ensure that it is freed after its usage due to which we need to > > > remember and perform pfree at various places. > > > > > > I think we should once see the possibility that such that we could > > > switch to this new context in start stream message and reset it in > > > stop stream message. That might help in avoiding > > > MemoryContextSwitchTo TopMemoryContext at various places. > > > > Ok, I understand, I think subxacts can be allocated in new > > LogicalStreamingContext which we can reset at the stream stop. How > > about xids? > > > > How about storing xids in ApplyContext? We do store similar lifespan > things in that context, for ex. see store_flush_position. That sounds good to me, I will make this change in the next patch set, along with other changes. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Dilip Kumar
Date:
On Thu, May 28, 2020 at 2:41 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Thu, May 28, 2020 at 12:46 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Tue, May 26, 2020 at 12:00 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > Isn't this problem only for subxact file as we anyway create changes > > > file as part of start stream message which should have come after > > > abort? If so, can't we detect whether subxact file exists probably by > > > using nsubxacts or something like that? Can you please once try to > > > reproduce this scenario to ensure that we are not missing anything? > > > > I have tested this, as of now, by default we create both changes and > > subxact files irrespective of whether we get any subtransactions or > > not. Maybe this could be optimized that only if we have any subxact > > then only create that file otherwise not? What's your opinion on the > > same. > > > > Yeah, that makes sense. > > > > > > > > 8. > > > > > > > @@ -2295,6 +2677,13 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer > > > > > > > *rb, TransactionId xid, > > > > > > > txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true); > > > > > > > > > > > > > > txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES; > > > > > > > + > > > > > > > + /* > > > > > > > + * TOCHECK: Mark toplevel transaction as having catalog changes too > > > > > > > + * if one of its children has. > > > > > > > + */ > > > > > > > + if (txn->toptxn != NULL) > > > > > > > + txn->toptxn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES; > > > > > > > } > > > > > > > > > > > > > > Why are we marking top transaction here? > > > > > > > > > > > > We need to mark top transaction to decide whether to build tuplecid > > > > > > hash or not. In non-streaming mode, we are only sending during the > > > > > > commit time, and during commit time we know whether the top > > > > > > transaction has any catalog changes or not based on the invalidation > > > > > > message so we are marking the top transaction there in DecodeCommit. > > > > > > Since here we are not waiting till commit so we need to mark the top > > > > > > transaction as soon as we mark any of its child transactions. > > > > > > > > > > > > > > > > But how does it help? We use this flag (via > > > > > ReorderBufferXidHasCatalogChanges) in SnapBuildCommitTxn which is > > > > > anyway done in DecodeCommit and that too after setting this flag for > > > > > the top transaction if required. So, how will it help in setting it > > > > > while processing for subxid. Also, even if we have to do it won't it > > > > > add the xid needlessly in builder->committed.xip array? > > > > > > > > In ReorderBufferBuildTupleCidHash, we use this flag to decide whether > > > > to build the tuplecid hash or not based on whether it has catalog > > > > changes or not. > > > > > > > > > > Okay, but you haven't answered the second part of the question: "won't > > > it add the xid of top transaction needlessly in builder->committed.xip > > > array, see function SnapBuildCommitTxn?" IIUC, this can happen > > > without patch as well because DecodeCommit also sets the flags just > > > based on invalidation messages irrespective of whether the messages > > > are generated by top transaction or not, is that right? > > > > Yes, with or without the patch it always adds the topxid. I think > > purpose for doing this with/without patch is not for the snapshot > > instead we are marking the top itself that some of its subtxn has the > > catalog changes so that while building the tuplecid has we can know > > whether to build the hash or not. But, having said that I feel in > > ReorderBufferBuildTupleCidHash why do we need these two checks > > if (!rbtxn_has_catalog_changes(txn) || dlist_is_empty(&txn->tuplecids)) > > return; > > > > I mean it should be enough to just have the check, because if we have > > added something to the tuplecids then catalog changes must be there > > because that time we are setting the catalog changes to true. > > > > if (dlist_is_empty(&txn->tuplecids)) > > return; > > > > I think in the base code there are multiple things going on > > 1. If we get new CID we always set the catalog change in that > > transaction but add the tuplecids in the top transaction. So > > basically, top transaction is so far not marked with catalog changes > > but it has tuplecids. > > 2. Now, in DecodeCommit the top xid will be marked that it has catalog > > changes based on the invalidation messages. > > > > I don't think it is advisable to remove that check from base code > unless we have a strong reason for doing so. I think here you can > write better comments about why you are marking the flag for top > transaction and remove TOCHECK from the comment. Ok, I will do that. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Amit Kapila
Date:
On Wed, May 27, 2020 at 8:22 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Tue, May 26, 2020 at 7:45 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > Okay, sending again. > > While reviewing/testing I have found a couple of problems in 0005 and > 0006 which I have fixed in the attached version. > I haven't reviewed the new fixes yet but I have some comments on 0008-Add-support-for-streaming-to-built-in-replicatio.patch. 1. I think the temporary files (and or handles) used for storing the information of changes and subxacts are getting leaked in the patch. At some places, it is taken care to close the file but cases like apply_handle_stream_commit where if any error occurred in apply_dispatch(), the file might not get closed. The other place is in apply_handle_stream_abort() where if there is an error in ftruncate the file won't be closed. Now, the bigger problem is with changes related file which is opened in apply_handle_stream_start and closed in apply_handle_stream_stop and if there is any error in-between, we won't close it. OTOH, I think the worker will exit on an error so it might not matter but then why we are at few other places we are closing it before the error? I think on error these temporary files should be removed instead of relying on them to get removed next time when we receive changes for the same transaction which I feel is what we do in other cases where we use temporary files like for sorts or hashjoins. Also, what if the changes file size overflows "OS file size limit"? If we agree that the above are problems then do you think we should explore using BufFile interface (see storage/file/buffile.c) to avoid all such problems? 2. apply_handle_stream_abort() { .. + /* discard the subxacts added later */ + nsubxacts = subidx; + + /* write the updated subxact list */ + subxact_info_write(MyLogicalRepWorker->subid, xid); .. } Here, if subxacts becomes zero, then also subxact_info_write will create a new file and write checksum. I think subxact_info_write should have a check for nsubxacts > 0 before writing to the file. 3. apply_handle_stream_commit(StringInfo s) { .. + /* + * send feedback to upstream + * + * XXX Probably should send a valid LSN. But which one? + */ + send_feedback(InvalidXLogRecPtr, false, false); .. } Why do we need to send the feedback at this stage after applying each message? If we see a non-streamed case, we never send_feedback after each message. So, following that, I don't see the need to send it here but if you see any specific reason then do let me know? And if we have to send feedback, then we need to decide the appropriate values as well. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Amit Kapila
Date:
On Wed, May 27, 2020 at 8:22 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > While reviewing/testing I have found a couple of problems in 0005 and > 0006 which I have fixed in the attached version. > .. > > In 0006: If we are streaming the serialized changed and there are > still few incomplete changes, then currently we are not deleting the > spilled file, but the spill file contains all the changes of the > transaction because there is no way to partially truncate it. So in > the next stream, it will try to resend those. I have fixed this by > sending the spilled transaction as soon as its changes are complete so > ideally, we can always delete the spilled file. It is also a better > solution because this transaction is already spilled once and that > happened because we could not stream it, so we better stream it on > the first opportunity that will reduce the replay lag which is our > whole purpose here. > I have reviewed these changes (in the patch v25-0006-Bugfix-handling-of-incomplete-toast-spec-insert-) and below are my comments. 1. + /* + * If the transaction is serialized and the the changes are complete in + * the top level transaction then immediately stream the transaction. + * The reason for not waiting for memory limit to get full is that in + * the streaming mode, if the transaction serialized that means we have + * already reached the memory limit but that time we could not stream + * this due to incomplete tuple so now stream it as soon as the tuple + * is complete. + */ + if (rbtxn_is_serialized(txn)) + ReorderBufferStreamTXN(rb, toptxn); I think here it is important to explain why it is a must to stream a prior serialized transaction as otherwise, later we won't be able to know how to truncate a file. 2. + * If complete_truncate is set we completely truncate the transaction, + * otherwise we truncate upto last_complete_lsn if the transaction has + * incomplete changes. Basically, complete_truncate is passed true only if + * concurrent abort is detected while processing the TXN. */ static void -ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn) +ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, + bool partial_truncate) { The description talks about complete_truncate flag whereas API is using partial_truncate flag. I think the description needs to be changed. 3. + /* We have truncated upto last complete lsn so stop. */ + if (partial_truncate && rbtxn_has_incomplete_tuple(toptxn) && + (change->lsn > toptxn->last_complete_lsn)) + { + /* + * If this is a top transaction then we can reset the + * last_complete_lsn and complete_size, because by now we would + * have stream all the changes upto last_complete_lsn. + */ + if (txn->toptxn == NULL) + { + toptxn->last_complete_lsn = InvalidXLogRecPtr; + toptxn->complete_size = 0; + } + break; + } I think here we can add an Assert to ensure that we don't partially truncate when the transaction is serialized and add comments for the same. 4. + /* + * Subtract the processed changes from the nentries/nentries_mem Refer + * detailed comment atop this variable in ReorderBufferTXN structure. + * We do this only ff we are truncating the partial changes otherwise + * reset these values directly to 0. + */ + if (partial_truncate) + { + txn->nentries -= txn->nprocessed; + txn->nentries_mem -= txn->nprocessed; + } + else + { + txn->nentries = 0; + txn->nentries_mem = 0; + } I think we can write this comment as "Adjust nentries/nentries_mem based on the changes processed. See comments where nprocessed is declared." 5. + /* + * In streaming mode, sometime we can't stream all the changes due to the + * incomplete changes. So we can not directly reset the values of + * nentries/nentries_mem to 0 after one stream is sent like we do in + * non-streaming mode. So while sending one stream we keep count of the + * changes processed in thi stream and only those many changes we decrement + * from the nentries/nentries_mem. + */ + uint64 nprocessed; How about something like: "Number of changes processed. This is used to keep track of changes that remained to be streamed. As of now, this can happen either due to toast tuples or speculative insertions where we need to wait for multiple changes before we can send them." 6. + /* Size of the commplete changes. */ + Size complete_size; Typo. /commplete/complete 7. + /* + * Increment the nprocessed count. See the detailed comment + * for usage of this in ReorderBufferTXN structure. + */ + change->txn->nprocessed++; Ideally, this has to be incremented after processing the change. So, we can combine it with existing check in the patch as below: if (streaming) { change->txn->nprocessed++; if (rbtxn_has_incomplete_tuple(txn) && prev_lsn == txn->last_complete_lsn) { /* Only in streaming mode we should get here. */ Assert(streaming); partial_truncate = true; break; } } -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Amit Kapila
Date:
On Wed, May 27, 2020 at 5:19 PM Mahendra Singh Thalor <mahi6run@gmail.com> wrote:
On Tue, 26 May 2020 at 16:46, Amit Kapila <amit.kapila16@gmail.com> wrote:
Hi all,
On the top of v16 patch set [1], I did some testing for DDL's and DML's to test wal size and performance. Below is the testing summary;
Test parameters:
wal_level= 'logical
max_connections = '150'
wal_receiver_timeout = '600s'
max_wal_size = '2GB'
min_wal_size = '2GB'
autovacuum= 'off'
checkpoint_timeout= '1d'Test results:
CREATE index operations Add col int(date) operations Add col text operations SN. operation name LSN diff (in bytes) time (in sec) % LSN change LSN diff (in bytes) time (in sec) % LSN change LSN diff (in bytes) time (in sec) % LSN change 11 DDL without patch 17728 0.89116 1.624548976 0.764393 11.47540933904 0.80044 2.80792with patch 18016 0.804868 1088 0.763602 34856 0.787108 22 DDL without patch 19872 0.860348 2.737521632 0.763199 13.725490234560 0.806086 3.078703with patch 20416 0.839065 1856 0.733147 35624 0.829281 33 DDL without patch 22016 0.894891 3.633720932288 0.776871 14.68531435216 0.803493 3.339391186with patch 22816 0.828028 2624 0.737177 36392 0.800194 44 DDL without patch 24160 0.901686 4.47019862944 0.768445 15.21739135872 0.77489 3.590544with patch 25240 0.887143 3392 0.768382 37160 0.82777 55 DDL without patch 26328 0.901686 4.98328773600 0.751879 15.55555536528 0.817928 3.832676with patch 27640 0.914078 4160 0.74709 37928 0.820621 66 DDL without patch 28472 0.936385 5.50716494256 0.745179 15.7894736837184 0.797043 4.066265with patch 30040 0.958226 4928 0.725321 38696 0.814535 78 DDL without patch 32760 1.0022203 6.4224665568 0.757468 16.09195438496 0.83207 4.509559with patch 34864 0.966777 6464 0.769072 40232 0.903604 811 DDL without patch 50296 1.0022203 5.6624787536 0.748332 16.66666640464 0.822266 5.179913with patch 53144 0.966777 8792 0.750553 42560 0.797133 915 DDL without patch 58896 1.267253 5.66247810184 0.776875 16.49646543112 0.821916 5.84524with patch 62768 1.27234 11864 0.746844 45632 0.812567 101 DDL & 3 DML without patch 18240 0.812551 1.62281192 0.771993 10.06711434120 0.849467 2.8113599with patch 18536 0.819089 1312 0.785117 35080 0.855456 113 DDL & 5 DML without patch 23656 0.926616 3.48326062656 0.758029 13.5542168735584 0.829377 3.372302with patch 24480 0.915517 3016 0.797206 36784 0.839176 1210 DDL & 5 DML without patch 52760 1.101005 4.9583017447288 0.763065 16.0263446840216 0.837843 4.993037with patch 55376 1.105241 8456 0.779257 42224 0.835206 1310 DML without patch 1008 0.791091 6.3492061008 0.81105 6.3492061008 0.78817 6.349206with patch 1072 0.807875 1072 0.771113 1072 0.759789 To see all operations, please see[2] test_results
Why are you seeing any additional WAL in case-13 (10 DML) where there is no DDL? I think it is because you have used savepoints in that case which will add some additional WAL. You seems to have 9 savepoints in that test which should ideally generate 36 bytes of additional WAL (4-byte per transaction id for each subtransaction). Also, in other cases where you took data for DDL and DML, you have also used savepoints in those tests. I suggest for savepoints, let's do separate tests as you have done in case-13 but we can do it 3,5,7,10 savepoints and probably each transaction can update a row of 200 bytes or so.
I think you can take data for somewhat more realistic cases of DDL and DML combination like 3 DDL's with 10 DML and 3 DDL's with 15 DML operations. In general, I think we will see many more DML's per DDL. It is good to see the worst-case WAL and performance overhead as you have done.
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Dilip Kumar
Date:
On Fri, May 29, 2020 at 2:54 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Wed, May 27, 2020 at 8:22 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > While reviewing/testing I have found a couple of problems in 0005 and > > 0006 which I have fixed in the attached version. > > > .. > > > > In 0006: If we are streaming the serialized changed and there are > > still few incomplete changes, then currently we are not deleting the > > spilled file, but the spill file contains all the changes of the > > transaction because there is no way to partially truncate it. So in > > the next stream, it will try to resend those. I have fixed this by > > sending the spilled transaction as soon as its changes are complete so > > ideally, we can always delete the spilled file. It is also a better > > solution because this transaction is already spilled once and that > > happened because we could not stream it, so we better stream it on > > the first opportunity that will reduce the replay lag which is our > > whole purpose here. > > > > I have reviewed these changes (in the patch > v25-0006-Bugfix-handling-of-incomplete-toast-spec-insert-) and below > are my comments. > > 1. > + /* > + * If the transaction is serialized and the the changes are complete in > + * the top level transaction then immediately stream the transaction. > + * The reason for not waiting for memory limit to get full is that in > + * the streaming mode, if the transaction serialized that means we have > + * already reached the memory limit but that time we could not stream > + * this due to incomplete tuple so now stream it as soon as the tuple > + * is complete. > + */ > + if (rbtxn_is_serialized(txn)) > + ReorderBufferStreamTXN(rb, toptxn); > > I think here it is important to explain why it is a must to stream a > prior serialized transaction as otherwise, later we won't be able to > know how to truncate a file. Done > 2. > + * If complete_truncate is set we completely truncate the transaction, > + * otherwise we truncate upto last_complete_lsn if the transaction has > + * incomplete changes. Basically, complete_truncate is passed true only if > + * concurrent abort is detected while processing the TXN. > */ > static void > -ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn) > +ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, > + bool partial_truncate) > { > > The description talks about complete_truncate flag whereas API is > using partial_truncate flag. I think the description needs to be > changed. Fixed > 3. > + /* We have truncated upto last complete lsn so stop. */ > + if (partial_truncate && rbtxn_has_incomplete_tuple(toptxn) && > + (change->lsn > toptxn->last_complete_lsn)) > + { > + /* > + * If this is a top transaction then we can reset the > + * last_complete_lsn and complete_size, because by now we would > + * have stream all the changes upto last_complete_lsn. > + */ > + if (txn->toptxn == NULL) > + { > + toptxn->last_complete_lsn = InvalidXLogRecPtr; > + toptxn->complete_size = 0; > + } > + break; > + } > > I think here we can add an Assert to ensure that we don't partially > truncate when the transaction is serialized and add comments for the > same. Done > 4. > + /* > + * Subtract the processed changes from the nentries/nentries_mem Refer > + * detailed comment atop this variable in ReorderBufferTXN structure. > + * We do this only ff we are truncating the partial changes otherwise > + * reset these values directly to 0. > + */ > + if (partial_truncate) > + { > + txn->nentries -= txn->nprocessed; > + txn->nentries_mem -= txn->nprocessed; > + } > + else > + { > + txn->nentries = 0; > + txn->nentries_mem = 0; > + } > > I think we can write this comment as "Adjust nentries/nentries_mem > based on the changes processed. See comments where nprocessed is > declared." > > 5. > + /* > + * In streaming mode, sometime we can't stream all the changes due to the > + * incomplete changes. So we can not directly reset the values of > + * nentries/nentries_mem to 0 after one stream is sent like we do in > + * non-streaming mode. So while sending one stream we keep count of the > + * changes processed in thi stream and only those many changes we decrement > + * from the nentries/nentries_mem. > + */ > + uint64 nprocessed; > > How about something like: "Number of changes processed. This is used > to keep track of changes that remained to be streamed. As of now, > this can happen either due to toast tuples or speculative insertions > where we need to wait for multiple changes before we can send them." Done > 6. > + /* Size of the commplete changes. */ > + Size complete_size; > > Typo. /commplete/complete > > 7. > + /* > + * Increment the nprocessed count. See the detailed comment > + * for usage of this in ReorderBufferTXN structure. > + */ > + change->txn->nprocessed++; > > Ideally, this has to be incremented after processing the change. So, > we can combine it with existing check in the patch as below: > > if (streaming) > { > change->txn->nprocessed++; > > if (rbtxn_has_incomplete_tuple(txn) && > prev_lsn == txn->last_complete_lsn) > { > /* Only in streaming mode we should get here. */ > Assert(streaming); > partial_truncate = true; > break; > } > } Done Apart from this, there was one more issue in this patch + if (partial_truncate && rbtxn_has_incomplete_tuple(toptxn) && + (change->lsn > toptxn->last_complete_lsn)) + { + /* + * If this is a top transaction then we can reset the + * last_complete_lsn and complete_size, because by now we would + * have stream all the changes upto last_complete_lsn. + */ + if (txn->toptxn == NULL) + { + toptxn->last_complete_lsn = InvalidXLogRecPtr; + toptxn->complete_size = 0; + } + break; We shall reset toptxn->last_complete_lsn and toptxn->complete_size, outside this {(change->lsn > toptxn->last_complete_lsn)} check, because we might be in subxact when we meet this condition, so in that case, for toptxn we never reach here and it will never get reset, I have fixed this. Apart from this one more fix in 0005, basically, CheckLiveXid was never reset, so I have fixed that as well. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Dilip Kumar
Date:
On Thu, May 28, 2020 at 2:41 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Thu, May 28, 2020 at 12:46 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Tue, May 26, 2020 at 12:00 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > Isn't this problem only for subxact file as we anyway create changes > > > file as part of start stream message which should have come after > > > abort? If so, can't we detect whether subxact file exists probably by > > > using nsubxacts or something like that? Can you please once try to > > > reproduce this scenario to ensure that we are not missing anything? > > > > I have tested this, as of now, by default we create both changes and > > subxact files irrespective of whether we get any subtransactions or > > not. Maybe this could be optimized that only if we have any subxact > > then only create that file otherwise not? What's your opinion on the > > same. > > > > Yeah, that makes sense. > > > > > > > > 8. > > > > > > > @@ -2295,6 +2677,13 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer > > > > > > > *rb, TransactionId xid, > > > > > > > txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true); > > > > > > > > > > > > > > txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES; > > > > > > > + > > > > > > > + /* > > > > > > > + * TOCHECK: Mark toplevel transaction as having catalog changes too > > > > > > > + * if one of its children has. > > > > > > > + */ > > > > > > > + if (txn->toptxn != NULL) > > > > > > > + txn->toptxn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES; > > > > > > > } > > > > > > > > > > > > > > Why are we marking top transaction here? > > > > > > > > > > > > We need to mark top transaction to decide whether to build tuplecid > > > > > > hash or not. In non-streaming mode, we are only sending during the > > > > > > commit time, and during commit time we know whether the top > > > > > > transaction has any catalog changes or not based on the invalidation > > > > > > message so we are marking the top transaction there in DecodeCommit. > > > > > > Since here we are not waiting till commit so we need to mark the top > > > > > > transaction as soon as we mark any of its child transactions. > > > > > > > > > > > > > > > > But how does it help? We use this flag (via > > > > > ReorderBufferXidHasCatalogChanges) in SnapBuildCommitTxn which is > > > > > anyway done in DecodeCommit and that too after setting this flag for > > > > > the top transaction if required. So, how will it help in setting it > > > > > while processing for subxid. Also, even if we have to do it won't it > > > > > add the xid needlessly in builder->committed.xip array? > > > > > > > > In ReorderBufferBuildTupleCidHash, we use this flag to decide whether > > > > to build the tuplecid hash or not based on whether it has catalog > > > > changes or not. > > > > > > > > > > Okay, but you haven't answered the second part of the question: "won't > > > it add the xid of top transaction needlessly in builder->committed.xip > > > array, see function SnapBuildCommitTxn?" IIUC, this can happen > > > without patch as well because DecodeCommit also sets the flags just > > > based on invalidation messages irrespective of whether the messages > > > are generated by top transaction or not, is that right? > > > > Yes, with or without the patch it always adds the topxid. I think > > purpose for doing this with/without patch is not for the snapshot > > instead we are marking the top itself that some of its subtxn has the > > catalog changes so that while building the tuplecid has we can know > > whether to build the hash or not. But, having said that I feel in > > ReorderBufferBuildTupleCidHash why do we need these two checks > > if (!rbtxn_has_catalog_changes(txn) || dlist_is_empty(&txn->tuplecids)) > > return; > > > > I mean it should be enough to just have the check, because if we have > > added something to the tuplecids then catalog changes must be there > > because that time we are setting the catalog changes to true. > > > > if (dlist_is_empty(&txn->tuplecids)) > > return; > > > > I think in the base code there are multiple things going on > > 1. If we get new CID we always set the catalog change in that > > transaction but add the tuplecids in the top transaction. So > > basically, top transaction is so far not marked with catalog changes > > but it has tuplecids. > > 2. Now, in DecodeCommit the top xid will be marked that it has catalog > > changes based on the invalidation messages. > > > > I don't think it is advisable to remove that check from base code > unless we have a strong reason for doing so. I think here you can > write better comments about why you are marking the flag for top > transaction and remove TOCHECK from the comment. Done. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Dilip Kumar
Date:
On Tue, May 26, 2020 at 4:46 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Tue, May 26, 2020 at 2:44 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Tue, May 26, 2020 at 10:27 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > > > 2. There is a bug fix in handling the stream abort in 0008 (earlier it > > > > was 0006). > > > > > > > > > > The code changes look fine but it is not clear what was the exact > > > issue. Can you explain? > > > > Basically, in case of an empty subtransaction, we were reading the > > subxacts info but when we could not find the subxid in the subxacts > > info we were not releasing the memory. So on next subxact_info_read > > it will expect that subxacts should be freed but we did not free it in > > that !found case. > > > > Okay, on looking at it again, the same code exists in > subxact_info_write as well. It is better to have a function for it. > Can we have a structure like SubXactContext for all the variables used > for subxact? As mentioned earlier I find the allocation/deallocation > of subxacts a bit ad-hoc, so there will always be a chance that we can > forget to free it. Having it allocated in memory context which we can > reset later might reduce that risk. One idea could be that we have a > special memory context for start and stop messages which can be used > to allocate the subxacts there. In case of commit/abort, we can allow > subxacts information to be allocated in ApplyMessageContext which is > reset at the end of each protocol message. Changed as per this. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Dilip Kumar
Date:
On Tue, May 26, 2020 at 3:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Mon, May 25, 2020 at 8:07 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Fri, May 22, 2020 at 11:54 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > 4. > > > + * XXX Do we need to allocate it in TopMemoryContext? > > > + */ > > > +static void > > > +subxact_info_add(TransactionId xid) > > > { > > > .. > > > > > > For this and other places in a patch like in function > > > stream_open_file(), instead of using TopMemoryContext, can we consider > > > using a new memory context LogicalStreamingContext or something like > > > that. We can create LogicalStreamingContext under TopMemoryContext. I > > > don't see any need of using TopMemoryContext here. > > > > But, when we will delete/reset the LogicalStreamingContext? > > > > Why can't we reset it at each stream stop message? Done this > > > because > > we are planning to keep this memory until the worker is alive so that > > supposed to be the top memory context. > > > > Which part of allocation do we want to keep till the worker is alive? > Why we need memory-related to subxacts till the worker is alive? As > we have now, after reading subxact info (subxact_info_read), we need > to ensure that it is freed after its usage due to which we need to > remember and perform pfree at various places. > > I think we should once see the possibility that such that we could > switch to this new context in start stream message and reset it in > stop stream message. That might help in avoiding > MemoryContextSwitchTo TopMemoryContext at various places. > > > If we create any other context > > with the same life span as TopMemoryContext then what is the point? > > > > It is helpful for debugging. It is recommended that we don't use the > top memory context unless it is really required. Read about it in > src/backend/utils/mmgr/README. xids is now allocated in ApplyContext > > > 8. > > > + * XXX Maybe we should only include the checksum when the cluster is > > > + * initialized with checksums? > > > + */ > > > +static void > > > +subxact_info_write(Oid subid, TransactionId xid) > > > > > > Do we really need to have the checksum for temporary files? I have > > > checked a few other similar cases like SharedFileSet stuff for > > > parallel hash join but didn't find them using checksums. Can you also > > > once see other usages of temporary files and then let us decide if we > > > see any reason to have checksums for this? > > > > Yeah, even I can see other places checksum is not used. > > > > So, unless someone speaks up before you are ready for the next version > of the patch, can we remove it? Done > > > Another point is we don't seem to be doing this for 'changes' file, > > > see stream_write_change. So, not sure, there is any sense to write > > > checksum for subxact file. > > > > I can see there are comment atop this function > > > > * XXX The subxact file includes CRC32C of the contents. Maybe we should > > * include something like that here too, but doing so will not be as > > * straighforward, because we write the file in chunks. > > > > You can remove this comment as well. I don't know how advantageous it > is to checksum temporary files. We can anyway add it later if there > is a reason for doing so. Done > > > > > 12. > > > maybe_send_schema() > > > { > > > .. > > > + if (in_streaming) > > > + { > > > + /* > > > + * TOCHECK: We have to send schema after each catalog change and it may > > > + * occur when streaming already started, so we have to track new catalog > > > + * changes somehow. > > > + */ > > > + schema_sent = get_schema_sent_in_streamed_txn(relentry, topxid); > > > .. > > > .. > > > } > > > > > > I think it is good to once verify/test what this comment says but as > > > per code we should be sending the schema after each catalog change as > > > we invalidate the streamed_txns list in rel_sync_cache_relation_cb > > > which must be called during relcache invalidation. Do we see any > > > problem with that mechanism? > > > > I have tested this, I think we are already sending the schema after > > each catalog change. > > > > Then remove "TOCHECK" in the above comment. Done -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Dilip Kumar
Date:
On Thu, May 28, 2020 at 5:22 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Wed, May 27, 2020 at 8:22 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Tue, May 26, 2020 at 7:45 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > > > > Okay, sending again. > > > > While reviewing/testing I have found a couple of problems in 0005 and > > 0006 which I have fixed in the attached version. > > > > I haven't reviewed the new fixes yet but I have some comments on > 0008-Add-support-for-streaming-to-built-in-replicatio.patch. > 1. > I think the temporary files (and or handles) used for storing the > information of changes and subxacts are getting leaked in the patch. > At some places, it is taken care to close the file but cases like > apply_handle_stream_commit where if any error occurred in > apply_dispatch(), the file might not get closed. The other place is > in apply_handle_stream_abort() where if there is an error in ftruncate > the file won't be closed. Now, the bigger problem is with changes > related file which is opened in apply_handle_stream_start and closed > in apply_handle_stream_stop and if there is any error in-between, we > won't close it. > > OTOH, I think the worker will exit on an error so it might not matter > but then why we are at few other places we are closing it before the > error? I think on error these temporary files should be removed > instead of relying on them to get removed next time when we receive > changes for the same transaction which I feel is what we do in other > cases where we use temporary files like for sorts or hashjoins. > > Also, what if the changes file size overflows "OS file size limit"? > If we agree that the above are problems then do you think we should > explore using BufFile interface (see storage/file/buffile.c) to avoid > all such problems? I also think that the file size is a problem. I think we can use BufFile with some modifications. We can not use the BufFileCreateTemp, because of few reasons 1) files get deleted on close, but we have to open/close on every stream start/stop. 2) even if we try to avoid closing we need to the BufFile pointers (which take 8192k per file) because there is no option to pass the file name. I thin for our use case BufFileCreateShared is more suitable. I think we need to do some modifications so that we can use these apps without SharedFileSet. Otherwise, we need to unnecessarily need to create SharedFileSet for each transaction and also need to maintain it in xid array or xid hash until transaction commit/abort. So I suggest following modifications in shared files set so that we can conveniently use it. 1. ChooseTablespace(const SharedFileSet fileset, const char name) if fileset is NULL then select the DEFAULTTABLESPACEOID 2. SharedFileSetPath(char path, SharedFileSet fileset, Oid tablespace) If fileset is NULL then in directory path we can use MyProcPID or something instead of fileset->creator_pid. 3. Pass some parameter to BufFileOpenShared, so that it can open the file in RW mode instead of read-only mode. > 2. > apply_handle_stream_abort() > { > .. > + /* discard the subxacts added later */ > + nsubxacts = subidx; > + > + /* write the updated subxact list */ > + subxact_info_write(MyLogicalRepWorker->subid, xid); > .. > } > > Here, if subxacts becomes zero, then also subxact_info_write will > create a new file and write checksum. How, will it create the new file, in fact it will write nsubxacts as 0 in the existing file, and I think we need to do that right so that in next open we will know that the nsubxact is 0. I think subxact_info_write > should have a check for nsubxacts > 0 before writing to the file. But, even if nsubxacts become 0 we want to write the file so that we can overwrite the previous info. > 3. > apply_handle_stream_commit(StringInfo s) > { > .. > + /* > + * send feedback to upstream > + * > + * XXX Probably should send a valid LSN. But which one? > + */ > + send_feedback(InvalidXLogRecPtr, false, false); > .. > } > > Why do we need to send the feedback at this stage after applying each > message? If we see a non-streamed case, we never send_feedback after > each message. So, following that, I don't see the need to send it here > but if you see any specific reason then do let me know? And if we > have to send feedback, then we need to decide the appropriate values > as well. Let me put more thought on this and then I will revert back to you. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Amit Kapila
Date:
On Tue, Jun 2, 2020 at 3:59 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Thu, May 28, 2020 at 5:22 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > Also, what if the changes file size overflows "OS file size limit"? > > If we agree that the above are problems then do you think we should > > explore using BufFile interface (see storage/file/buffile.c) to avoid > > all such problems? > > I also think that the file size is a problem. I think we can use > BufFile with some modifications. We can not use the > BufFileCreateTemp, because of few reasons > 1) files get deleted on close, but we have to open/close on every > stream start/stop. > 2) even if we try to avoid closing we need to the BufFile pointers > (which take 8192k per file) because there is no option to pass the > file name. > > I thin for our use case BufFileCreateShared is more suitable. I think > we need to do some modifications so that we can use these apps without > SharedFileSet. Otherwise, we need to unnecessarily need to create > SharedFileSet for each transaction and also need to maintain it in xid > array or xid hash until transaction commit/abort. So I suggest > following modifications in shared files set so that we can > conveniently use it. > 1. ChooseTablespace(const SharedFileSet fileset, const char name) > if fileset is NULL then select the DEFAULTTABLESPACEOID > 2. SharedFileSetPath(char path, SharedFileSet fileset, Oid tablespace) > If fileset is NULL then in directory path we can use MyProcPID or > something instead of fileset->creator_pid. > Hmm, I find these modifications a bit ad-hoc. So, not sure if it is better than the patch maintains sharedfileset information. > 3. Pass some parameter to BufFileOpenShared, so that it can open the > file in RW mode instead of read-only mode. > This seems okay. > > > 2. > > apply_handle_stream_abort() > > { > > .. > > + /* discard the subxacts added later */ > > + nsubxacts = subidx; > > + > > + /* write the updated subxact list */ > > + subxact_info_write(MyLogicalRepWorker->subid, xid); > > .. > > } > > > > Here, if subxacts becomes zero, then also subxact_info_write will > > create a new file and write checksum. > > How, will it create the new file, in fact it will write nsubxacts as 0 > in the existing file, and I think we need to do that right so that in > next open we will know that the nsubxact is 0. > > I think subxact_info_write > > should have a check for nsubxacts > 0 before writing to the file. > > But, even if nsubxacts become 0 we want to write the file so that we > can overwrite the previous info. > Can't we just remove the file for such a case? apply_handle_stream_abort() { .. + /* XXX optimize the search by bsearch on sorted data */ + for (i = nsubxacts; i > 0; i--) + { + if (subxacts[i - 1].xid == subxid) + { + subidx = (i - 1); + found = true; + break; + } + } + + /* + * If it's an empty sub-transaction then we will not find the subxid + * here so just free the memory and return. + */ + if (!found) + { + /* Free the subxacts memory */ + if (subxacts) + pfree(subxacts); + + subxacts = NULL; + subxact_last = InvalidTransactionId; + nsubxacts = 0; + nsubxacts_max = 0; + + return; + } .. } I have one question regarding the above code. Isn't it possible that a particular subtransaction id doesn't have any change but others do we have? For ex. cases like below: postgres=# begin; BEGIN postgres=*# insert into t1 values(1); INSERT 0 1 postgres=*# savepoint s1; SAVEPOINT postgres=*# savepoint s2; SAVEPOINT postgres=*# insert into t1 values(2); INSERT 0 1 postgres=*# insert into t1 values(3); INSERT 0 1 postgres=*# Rollback to savepoint s1; ROLLBACK postgres=*# commit; Here, we have performed Rolledback to savepoint s1 which doesn't have any change of its own. I think this would have handled but just wanted to confirm. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Dilip Kumar
Date:
On Tue, Jun 2, 2020 at 4:56 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Tue, Jun 2, 2020 at 3:59 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Thu, May 28, 2020 at 5:22 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > Also, what if the changes file size overflows "OS file size limit"? > > > If we agree that the above are problems then do you think we should > > > explore using BufFile interface (see storage/file/buffile.c) to avoid > > > all such problems? > > > > I also think that the file size is a problem. I think we can use > > BufFile with some modifications. We can not use the > > BufFileCreateTemp, because of few reasons > > 1) files get deleted on close, but we have to open/close on every > > stream start/stop. > > 2) even if we try to avoid closing we need to the BufFile pointers > > (which take 8192k per file) because there is no option to pass the > > file name. > > > > I thin for our use case BufFileCreateShared is more suitable. I think > > we need to do some modifications so that we can use these apps without > > SharedFileSet. Otherwise, we need to unnecessarily need to create > > SharedFileSet for each transaction and also need to maintain it in xid > > array or xid hash until transaction commit/abort. So I suggest > > following modifications in shared files set so that we can > > conveniently use it. > > 1. ChooseTablespace(const SharedFileSet fileset, const char name) > > if fileset is NULL then select the DEFAULTTABLESPACEOID > > 2. SharedFileSetPath(char path, SharedFileSet fileset, Oid tablespace) > > If fileset is NULL then in directory path we can use MyProcPID or > > something instead of fileset->creator_pid. > > > > Hmm, I find these modifications a bit ad-hoc. So, not sure if it is > better than the patch maintains sharedfileset information. I think we might do something better here, maybe by supplying function pointer or so, but maintaining sharedfileset which contains different tablespace/mutext which we don't need at all for our purpose also doesn't sound very appealing. Let me see if I can not come up with some clean way of avoiding the need to shared-fileset then maybe we can go with the shared fileset idea. > > 3. Pass some parameter to BufFileOpenShared, so that it can open the > > file in RW mode instead of read-only mode. > > > > This seems okay. > > > > > > 2. > > > apply_handle_stream_abort() > > > { > > > .. > > > + /* discard the subxacts added later */ > > > + nsubxacts = subidx; > > > + > > > + /* write the updated subxact list */ > > > + subxact_info_write(MyLogicalRepWorker->subid, xid); > > > .. > > > } > > > > > > Here, if subxacts becomes zero, then also subxact_info_write will > > > create a new file and write checksum. > > > > How, will it create the new file, in fact it will write nsubxacts as 0 > > in the existing file, and I think we need to do that right so that in > > next open we will know that the nsubxact is 0. > > > > I think subxact_info_write > > > should have a check for nsubxacts > 0 before writing to the file. > > > > But, even if nsubxacts become 0 we want to write the file so that we > > can overwrite the previous info. > > > > Can't we just remove the file for such a case? But, as of now, we expect if it is not a first-time stream start then the file exists. Actually, currently, it's very easy that if it is not the first segment we always expect that the file must exist, otherwise an error. Now if it is not the first segment then we will need to handle multiple cases. a) subxact_info_read need to handle the error case, because the file may not exist because there was no subxact in last stream or it was deleted because nsubxact become 0. b) subxact_info_write, there will be multiple cases that if nsubxact was already 0 then we can avoid writing the file, but if it become 0 now we need to remove the file. Let me think more on that. > > apply_handle_stream_abort() > { > .. > + /* XXX optimize the search by bsearch on sorted data */ > + for (i = nsubxacts; i > 0; i--) > + { > + if (subxacts[i - 1].xid == subxid) > + { > + subidx = (i - 1); > + found = true; > + break; > + } > + } > + > + /* > + * If it's an empty sub-transaction then we will not find the subxid > + * here so just free the memory and return. > + */ > + if (!found) > + { > + /* Free the subxacts memory */ > + if (subxacts) > + pfree(subxacts); > + > + subxacts = NULL; > + subxact_last = InvalidTransactionId; > + nsubxacts = 0; > + nsubxacts_max = 0; > + > + return; > + } > .. > } > > I have one question regarding the above code. Isn't it possible that > a particular subtransaction id doesn't have any change but others do > we have? For ex. cases like below: > > postgres=# begin; > BEGIN > postgres=*# insert into t1 values(1); > INSERT 0 1 > postgres=*# savepoint s1; > SAVEPOINT > postgres=*# savepoint s2; > SAVEPOINT > postgres=*# insert into t1 values(2); > INSERT 0 1 > postgres=*# insert into t1 values(3); > INSERT 0 1 > postgres=*# Rollback to savepoint s1; > ROLLBACK > postgres=*# commit; > > Here, we have performed Rolledback to savepoint s1 which doesn't have > any change of its own. I think this would have handled but just > wanted to confirm. But internally, that will send abort for the s2 first, and for that, we will find xid and truncate, and later we will send abort for s1 but that we will not find and do nothing? Anyway, I will test it and let you know. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Amit Kapila
Date:
On Tue, Jun 2, 2020 at 7:53 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Tue, Jun 2, 2020 at 4:56 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Tue, Jun 2, 2020 at 3:59 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > I thin for our use case BufFileCreateShared is more suitable. I think > > > we need to do some modifications so that we can use these apps without > > > SharedFileSet. Otherwise, we need to unnecessarily need to create > > > SharedFileSet for each transaction and also need to maintain it in xid > > > array or xid hash until transaction commit/abort. So I suggest > > > following modifications in shared files set so that we can > > > conveniently use it. > > > 1. ChooseTablespace(const SharedFileSet fileset, const char name) > > > if fileset is NULL then select the DEFAULTTABLESPACEOID > > > 2. SharedFileSetPath(char path, SharedFileSet fileset, Oid tablespace) > > > If fileset is NULL then in directory path we can use MyProcPID or > > > something instead of fileset->creator_pid. > > > > > > > Hmm, I find these modifications a bit ad-hoc. So, not sure if it is > > better than the patch maintains sharedfileset information. > > I think we might do something better here, maybe by supplying function > pointer or so, but maintaining sharedfileset which contains different > tablespace/mutext which we don't need at all for our purpose also > doesn't sound very appealing. > I think we can say something similar for Relation (rel cache entry as well) maintained in LogicalRepRelMapEntry. I think we only need a pointer to that information. > Let me see if I can not come up with > some clean way of avoiding the need to shared-fileset then maybe we > can go with the shared fileset idea. > Fair enough. .. > > > > > > But, even if nsubxacts become 0 we want to write the file so that we > > > can overwrite the previous info. > > > > > > > Can't we just remove the file for such a case? > > But, as of now, we expect if it is not a first-time stream start then > the file exists. > Isn't it primarily because we do subxact_info_write in stop stream which will create such a file irrespective of whether we have any subxacts? If so, isn't that an unnecessary write? > Actually, currently, it's very easy that if it is > not the first segment we always expect that the file must exist, > otherwise an error. > I think we can check if the file doesn't exist then we can initialize nsubxacts as 0. > Now if it is not the first segment then we will > need to handle multiple cases. > > a) subxact_info_read need to handle the error case, because the file > may not exist because there was no subxact in last stream or it was > deleted because nsubxact become 0. > b) subxact_info_write, there will be multiple cases that if nsubxact > was already 0 then we can avoid writing the file, but if it become 0 > now we need to remove the file. > > Let me think more on that. > I feel we should be able to deal with these cases but if you find any difficulty then let us discuss. I understand there is some ease if we always have subxacts file but OTOH it sounds quite awkward that we need so many file operations to detect the case whether the transaction has any subtransactions. > > > > Here, we have performed Rolledback to savepoint s1 which doesn't have > > any change of its own. I think this would have handled but just > > wanted to confirm. > > But internally, that will send abort for the s2 first, and for that, > we will find xid and truncate, and later we will send abort for s1 but > that we will not find and do nothing? Anyway, I will test it and let > you know. > It would be good if we can test and confirm this behavior once. If it is not very inconvenient then we can even try to include a test for the same in the patch. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Amit Kapila
Date:
On Fri, May 29, 2020 at 8:31 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > The fixes in the latest patchset are correct. Few minor comments: v26-0005-Implement-streaming-mode-in-ReorderBuffer + /* + * Mark toplevel transaction as having catalog changes too if one of its + * children has so that the ReorderBufferBuildTupleCidHash can conveniently + * check just toplevel transaction and decide whethe we need to build the + * hash table or not. In non-streaming mode we mark the toplevel + * transaction in DecodeCommit as we only stream on commit. Typo, /whethe/whether missing comma, /In non-streaming mode we/In non-streaming mode, we v26-0008-Add-support-for-streaming-to-built-in-replicatio + /* + * This memory context used for per stream data when streaming mode is + * enabled. This context is reeset on each stream stop. + */ Can we slightly modify the above comment as "This is used in the streaming mode for the changes between the start and stop stream messages. We reset this context on the stream stop message."? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Dilip Kumar
Date:
On Wed, Jun 3, 2020 at 2:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Tue, Jun 2, 2020 at 7:53 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Tue, Jun 2, 2020 at 4:56 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > On Tue, Jun 2, 2020 at 3:59 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > > > I thin for our use case BufFileCreateShared is more suitable. I think > > > > we need to do some modifications so that we can use these apps without > > > > SharedFileSet. Otherwise, we need to unnecessarily need to create > > > > SharedFileSet for each transaction and also need to maintain it in xid > > > > array or xid hash until transaction commit/abort. So I suggest > > > > following modifications in shared files set so that we can > > > > conveniently use it. > > > > 1. ChooseTablespace(const SharedFileSet fileset, const char name) > > > > if fileset is NULL then select the DEFAULTTABLESPACEOID > > > > 2. SharedFileSetPath(char path, SharedFileSet fileset, Oid tablespace) > > > > If fileset is NULL then in directory path we can use MyProcPID or > > > > something instead of fileset->creator_pid. > > > > > > > > > > Hmm, I find these modifications a bit ad-hoc. So, not sure if it is > > > better than the patch maintains sharedfileset information. > > > > I think we might do something better here, maybe by supplying function > > pointer or so, but maintaining sharedfileset which contains different > > tablespace/mutext which we don't need at all for our purpose also > > doesn't sound very appealing. > > > > I think we can say something similar for Relation (rel cache entry as > well) maintained in LogicalRepRelMapEntry. I think we only need a > pointer to that information. Yeah, I see. > > Let me see if I can not come up with > > some clean way of avoiding the need to shared-fileset then maybe we > > can go with the shared fileset idea. > > > > Fair enough. While evaluating it further I feel there are a few more problems to solve if we are using BufFile, First thing is that in subxact file we maintain the information of xid and its offset in the changes file. So now, we will also have to store 'fileno' but that we can find using BufFileTell. Yet another problem is that currently, we don't have the truncate option in the BufFile, but we need it if the sub-transaction gets aborted. I think we can implement an extra interface with the BufFile and should not be very hard as we already know the fileno and the offset. I will evaluate this part further and let you know about the same. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Mahendra Singh Thalor
Date:
On Fri, 29 May 2020 at 15:52, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, May 27, 2020 at 5:19 PM Mahendra Singh Thalor <mahi6run@gmail.com> wrote:
>>
>> On Tue, 26 May 2020 at 16:46, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>
>> Hi all,
>> On the top of v16 patch set [1], I did some testing for DDL's and DML's to test wal size and performance. Below is the testing summary;
>>
>> Test parameters:
>> wal_level= 'logical
>> max_connections = '150'
>> wal_receiver_timeout = '600s'
>> max_wal_size = '2GB'
>> min_wal_size = '2GB'
>> autovacuum= 'off'
>> checkpoint_timeout= '1d'
>>
>> Test results:
>>
>> CREATE index operationsAdd col int(date) operationsAdd col text operations
>> SN.operation nameLSN diff (in bytes)time (in sec)% LSN changeLSN diff (in bytes)time (in sec)% LSN changeLSN diff (in bytes)time (in sec)% LSN change
>> 1
>> 1 DDL without patch177280.89116
>> 1.624548
>> 9760.764393
>> 11.475409
>> 339040.80044
>> 2.80792
>> with patch180160.80486810880.763602348560.787108
>> 2
>> 2 DDL without patch198720.860348
>> 2.73752
>> 16320.763199
>> 13.7254902
>> 345600.806086
>> 3.078703
>> with patch204160.83906518560.733147356240.829281
>> 3
>> 3 DDL without patch220160.894891
>> 3.63372093
>> 2 2880.776871
>> 14.685314
>> 352160.803493
>> 3.339391186
>> with patch228160.82802826240.737177363920.800194
>> 4
>> 4 DDL without patch241600.901686
>> 4.4701986
>> 29440.768445
>> 15.217391
>> 358720.77489
>> 3.590544
>> with patch252400.88714333920.768382371600.82777
>> 5
>> 5 DDL without patch263280.901686
>> 4.9832877
>> 36000.751879
>> 15.555555
>> 365280.817928
>> 3.832676
>> with patch276400.91407841600.74709379280.820621
>> 6
>> 6 DDL without patch284720.936385
>> 5.5071649
>> 42560.745179
>> 15.78947368
>> 371840.797043
>> 4.066265
>> with patch300400.95822649280.725321386960.814535
>> 7
>> 8 DDL without patch327601.0022203
>> 6.422466
>> 55680.757468
>> 16.091954
>> 384960.83207
>> 4.509559
>> with patch348640.96677764640.769072402320.903604
>> 8
>> 11 DDL without patch502961.0022203
>> 5.662478
>> 75360.748332
>> 16.666666
>> 404640.822266
>> 5.179913
>> with patch531440.96677787920.750553425600.797133
>> 9
>> 15 DDL without patch588961.267253
>> 5.662478
>> 101840.776875
>> 16.496465
>> 431120.821916
>> 5.84524
>> with patch627681.27234118640.746844456320.812567
>> 10
>> 1 DDL & 3 DML without patch182400.812551
>> 1.6228
>> 11920.771993
>> 10.067114
>> 341200.849467
>> 2.8113599
>> with patch185360.81908913120.785117350800.855456
>> 11
>> 3 DDL & 5 DML without patch236560.926616
>> 3.4832606
>> 26560.758029
>> 13.55421687
>> 355840.829377
>> 3.372302
>> with patch244800.91551730160.797206367840.839176
>> 12
>> 10 DDL & 5 DML without patch527601.101005
>> 4.958301744
>> 72880.763065
>> 16.02634468
>> 402160.837843
>> 4.993037
>> with patch553761.10524184560.779257422240.835206
>> 13
>> 10 DML without patch10080.791091
>> 6.349206
>> 10080.81105
>> 6.349206
>> 10080.78817
>> 6.349206
>> with patch10720.80787510720.77111310720.759789
>>
>> To see all operations, please see[2] test_results
>>
>
> Why are you seeing any additional WAL in case-13 (10 DML) where there is no DDL? I think it is because you have used savepoints in that case which will add some additional WAL. You seems to have 9 savepoints in that test which should ideally generate 36 bytes of additional WAL (4-byte per transaction id for each subtransaction). Also, in other cases where you took data for DDL and DML, you have also used savepoints in those tests. I suggest for savepoints, let's do separate tests as you have done in case-13 but we can do it 3,5,7,10 savepoints and probably each transaction can update a row of 200 bytes or so.
>
Thanks Amit for reviewing results.
--
Thanks and Regards
Mahendra Singh Thalor
EnterpriseDB: http://www.enterprisedb.com
>
> On Wed, May 27, 2020 at 5:19 PM Mahendra Singh Thalor <mahi6run@gmail.com> wrote:
>>
>> On Tue, 26 May 2020 at 16:46, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>
>> Hi all,
>> On the top of v16 patch set [1], I did some testing for DDL's and DML's to test wal size and performance. Below is the testing summary;
>>
>> Test parameters:
>> wal_level= 'logical
>> max_connections = '150'
>> wal_receiver_timeout = '600s'
>> max_wal_size = '2GB'
>> min_wal_size = '2GB'
>> autovacuum= 'off'
>> checkpoint_timeout= '1d'
>>
>> Test results:
>>
>> CREATE index operationsAdd col int(date) operationsAdd col text operations
>> SN.operation nameLSN diff (in bytes)time (in sec)% LSN changeLSN diff (in bytes)time (in sec)% LSN changeLSN diff (in bytes)time (in sec)% LSN change
>> 1
>> 1 DDL without patch177280.89116
>> 1.624548
>> 9760.764393
>> 11.475409
>> 339040.80044
>> 2.80792
>> with patch180160.80486810880.763602348560.787108
>> 2
>> 2 DDL without patch198720.860348
>> 2.73752
>> 16320.763199
>> 13.7254902
>> 345600.806086
>> 3.078703
>> with patch204160.83906518560.733147356240.829281
>> 3
>> 3 DDL without patch220160.894891
>> 3.63372093
>> 2 2880.776871
>> 14.685314
>> 352160.803493
>> 3.339391186
>> with patch228160.82802826240.737177363920.800194
>> 4
>> 4 DDL without patch241600.901686
>> 4.4701986
>> 29440.768445
>> 15.217391
>> 358720.77489
>> 3.590544
>> with patch252400.88714333920.768382371600.82777
>> 5
>> 5 DDL without patch263280.901686
>> 4.9832877
>> 36000.751879
>> 15.555555
>> 365280.817928
>> 3.832676
>> with patch276400.91407841600.74709379280.820621
>> 6
>> 6 DDL without patch284720.936385
>> 5.5071649
>> 42560.745179
>> 15.78947368
>> 371840.797043
>> 4.066265
>> with patch300400.95822649280.725321386960.814535
>> 7
>> 8 DDL without patch327601.0022203
>> 6.422466
>> 55680.757468
>> 16.091954
>> 384960.83207
>> 4.509559
>> with patch348640.96677764640.769072402320.903604
>> 8
>> 11 DDL without patch502961.0022203
>> 5.662478
>> 75360.748332
>> 16.666666
>> 404640.822266
>> 5.179913
>> with patch531440.96677787920.750553425600.797133
>> 9
>> 15 DDL without patch588961.267253
>> 5.662478
>> 101840.776875
>> 16.496465
>> 431120.821916
>> 5.84524
>> with patch627681.27234118640.746844456320.812567
>> 10
>> 1 DDL & 3 DML without patch182400.812551
>> 1.6228
>> 11920.771993
>> 10.067114
>> 341200.849467
>> 2.8113599
>> with patch185360.81908913120.785117350800.855456
>> 11
>> 3 DDL & 5 DML without patch236560.926616
>> 3.4832606
>> 26560.758029
>> 13.55421687
>> 355840.829377
>> 3.372302
>> with patch244800.91551730160.797206367840.839176
>> 12
>> 10 DDL & 5 DML without patch527601.101005
>> 4.958301744
>> 72880.763065
>> 16.02634468
>> 402160.837843
>> 4.993037
>> with patch553761.10524184560.779257422240.835206
>> 13
>> 10 DML without patch10080.791091
>> 6.349206
>> 10080.81105
>> 6.349206
>> 10080.78817
>> 6.349206
>> with patch10720.80787510720.77111310720.759789
>>
>> To see all operations, please see[2] test_results
>>
>
> Why are you seeing any additional WAL in case-13 (10 DML) where there is no DDL? I think it is because you have used savepoints in that case which will add some additional WAL. You seems to have 9 savepoints in that test which should ideally generate 36 bytes of additional WAL (4-byte per transaction id for each subtransaction). Also, in other cases where you took data for DDL and DML, you have also used savepoints in those tests. I suggest for savepoints, let's do separate tests as you have done in case-13 but we can do it 3,5,7,10 savepoints and probably each transaction can update a row of 200 bytes or so.
>
Thanks Amit for reviewing results.
Yes, you are correct. I used savepoints in DML so it was showing additional wal.
As suggested above, I did testing for DML's, DDL's and savepoints. Below is the test results:
Test results:
CREATE index operations | Add col int(date) operations | Add col text operations | ||||||||
SN. | operation name | LSN diff (in bytes) | time (in sec) | % LSN change | LSN diff (in bytes) | time (in sec) | % LSN change | LSN diff (in bytes) | time (in sec) | % LSN change |
1 | 1 DDL without patch | 17728 | 0.89116 | 1.624548 | 976 | 0.764393 | 11.475409 | 33904 | 0.80044 | 2.80792 |
with patch | 18016 | 0.804868 | 1088 | 0.763602 | 34856 | 0.787108 | ||||
2 | 2 DDL without patch | 19872 | 0.860348 | 2.73752 | 1632 | 0.763199 | 13.7254902 | 34560 | 0.806086 | 3.078703 |
with patch | 20416 | 0.839065 | 1856 | 0.733147 | 35624 | 0.829281 | ||||
3 | 3 DDL without patch | 22016 | 0.894891 | 3.63372093 | 2288 | 0.776871 | 14.685314 | 35216 | 0.803493 | 3.339391186 |
with patch | 22816 | 0.828028 | 2624 | 0.737177 | 36392 | 0.800194 | ||||
4 | 4 DDL without patch | 24160 | 0.901686 | 4.4701986 | 2944 | 0.768445 | 15.217391 | 35872 | 0.77489 | 3.590544 |
with patch | 25240 | 0.887143 | 3392 | 0.768382 | 37160 | 0.82777 | ||||
5 | 5 DDL without patch | 26328 | 0.901686 | 4.9832877 | 3600 | 0.751879 | 15.555555 | 36528 | 0.817928 | 3.832676 |
with patch | 27640 | 0.914078 | 4160 | 0.74709 | 37928 | 0.820621 | ||||
6 | 6 DDL without patch | 28472 | 0.936385 | 5.5071649 | 4256 | 0.745179 | 15.78947368 | 37184 | 0.797043 | 4.066265 |
with patch | 30040 | 0.958226 | 4928 | 0.725321 | 38696 | 0.814535 | ||||
7 | 8 DDL without patch | 32760 | 1.0022203 | 6.422466 | 5568 | 0.757468 | 16.091954 | 38496 | 0.83207 | 4.509559 |
with patch | 34864 | 0.966777 | 6464 | 0.769072 | 40232 | 0.903604 | ||||
8 | 11 DDL without patch | 50296 | 1.0022203 | 5.662478 | 7536 | 0.748332 | 16.666666 | 40464 | 0.822266 | 5.179913 |
with patch | 53144 | 0.966777 | 8792 | 0.750553 | 42560 | 0.797133 | ||||
9 | 15 DDL without patch | 58896 | 1.267253 | 5.662478 | 10184 | 0.776875 | 16.496465 | 43112 | 0.821916 | 5.84524 |
with patch | 62768 | 1.27234 | 11864 | 0.746844 | 45632 | 0.812567 | ||||
10 | 1 DDL & 3 DML without patch | 18224 | 0.865753 | 1.58033362 | 1176 | 0.78074 | 9.523809 | 34104 | 0.857664 | 2.7914614 |
with patch | 18512 | 0.854788 | 1288 | 0.767758 | 35056 | 0.877604 | ||||
11 | 3 DDL & 5 DML without patch | 23632 | 0.954274 | 3.385203 | 2632 | 0.785501 | 12.765957 | 35560 | 0.87744 | 3.3070866 |
with patch | 24432 | 0.927245 | 2968 | 0.857528 | 36736 | 0.867555 | ||||
12 | 3 DDL & 10 DML without patch | 25088 | 0.941534 | 3.316326 | 3040 | 0.812123 | 11.052631 | 35968 | 0.877769 | 3.269579 |
with patch | 25920 | 0.898643 | 3376 | 0.804943 | 37144 | 0.879752 | ||||
13 | 3 DDL & 15 DML without patch | 26400 | 0.949599 | 3.151515 | 3392 | 0.818491 | 9.90566037 | 36320 | 0.859353 | 3.2378854 |
with patch | 27232 | 0.892505 | 3728 | 0.789752 | 37320 | 0.812386 | ||||
14 | 5 DDL & 15 DML without patch | 31904 | 0.994223 | 4.287863 | 4704 | 0.838091 | 11.904761 | 37632 | 0.867281 | 3.720238095 |
with patch | 33272 | 0.968122 | 5264 | 0.816922 | 39032 | 0.876364 | ||||
15 | 1 DML without patch | 328 | 0.817988 | 0 | ||||||
with patch | 328 | 0.794927 | ||||||||
16 | 3 DML without patch | 464 | 0.791229 | 0 | ||||||
with patch | 464 | 0.806211 | ||||||||
17 | 5 DML without patch | 608 | 0.794258 | 0 | ||||||
with patch | 608 | 0.802001 | ||||||||
18 | 10 DML without patch | 968 | 0.831733 | 0 | ||||||
with patch | 968 | 0.852777 |
Results for savepoints:
SN. | Operation name | Operation | LSN diff (in bytes) | time (in sec) | % LSN change |
1 | 1 savepoint without patch | begin; insert into perftest values (1); savepoint s1; update perftest set c1 = 5 where c1 = 1; commit; | 408 | 0.805615 | 1.960784 |
with patch | 416 | 0.823121 | |||
2 | 2 savepoint without patch | begin; insert into perftest values (1); savepoint s1; update perftest set c1 = 5 where c1 = 1; savepoint s2; update perftest set c1 = 6 where c1 = 5; commit; | 488 | 0.827147 | 3.278688 |
with patch | 504 | 0.819165 | |||
3 | 3 savepoint without patch | begin; insert into perftest values (1); savepoint s1; update perftest set c1 = 2 where c1 = 1; savepoint s2; update perftest set c1 = 3 where c1 = 2; savepoint s3; update perftest set c1 = 4 where c1 = 3; commit; | 560 | 0.806441 | 4.28571428 |
with patch | 584 | 0.821316 | |||
4 | 5 savepoint without patch | 712 | 0.823774 | 5.617977528 | |
with patch | 752 | 0.800037 | |||
5 | 7 savepoint without patch | 864 | 0.829136 | 6.48148148 | |
with patch | 920 | 0.793751 | |||
6 | 10 savepoint without patch | 1096 | 0.77946 | 7.29927007 | |
with patch | 1176 | 0.78711 |
To see all the operations(DDL's and DML's), please see test_results
Testing summary:
Basically, we are writing per command invalidation message and for testing that I have tested with different combinations of the DDL and DML operation. I have not observed any performance degradation with the patch. For "create index" DDL's, %change in wal is 1-7% for 1-15 DDL's. For "add col int/date" DDL's, it is 11-17% for 1-15 DDL's and for "add col text" DDL's, it is 2-6% for 1-15 DDL's. For mix (DDL & DML), it is 2-10%.
why are we seeing 11-13 % of the extra wall, basically, the amount of extra WAL is not very high but the amount of WAL generated with add column int/date is just ~1000 bytes so additional 100 bytes will be around 10% and for add column text it is ~35000 bytes so % is less. For text, these ~35000 bytes are due to toast
why are we seeing 11-13 % of the extra wall, basically, the amount of extra WAL is not very high but the amount of WAL generated with add column int/date is just ~1000 bytes so additional 100 bytes will be around 10% and for add column text it is ~35000 bytes so % is less. For text, these ~35000 bytes are due to toast
There is no change in wal size for DML operations. For savepoints, we are getting max 8 bytes per savepoint wal increment (basically for Sub-transaction, we are adding 5 bytes to store xid but due to padding, it is 8 bytes and some times if wal is already aligned, then we are getting 0 bytes increment)
Thanks and Regards
Mahendra Singh Thalor
EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Amit Kapila
Date:
On Fri, May 29, 2020 at 8:31 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > Apart from this one more fix in 0005, basically, CheckLiveXid was > never reset, so I have fixed that as well. > I have made a number of modifications in the 0001 patch and attached is the result. I have changed/added comments, done some cosmetic cleanup, and ran pgindent. The most notable change is to remove the below code change: DecodeXactOp() { .. - * However, it's critical to process XLOG_XACT_ASSIGNMENT records even + * However, it's critical to process records with subxid assignment even * when the snapshot is being built: it is possible to get later records * that require subxids to be properly assigned. */ if (SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT && - info != XLOG_XACT_ASSIGNMENT) + !TransactionIdIsValid(XLogRecGetTopXid(r))) .. } I have not only removed the change done by the patch but the check related to XLOG_XACT_ASSIGNMENT as well. That check has been added by commit bac2fae05c to ensure that we process XLOG_XACT_ASSIGNMENT even if snapshot state is not SNAPBUILD_FULL_SNAPSHOT. Now, with this patch that is not required because we are making the subtransaction and top-level transaction much earlier than this. I have verified that it doesn't reopen the bug by running the test provided in the original report [1]. Let me know what you think of the changes? If you find them okay, then feel to include them in the next patch-set. [1] - https://www.postgresql.org/message-id/CAONYFtOv%2BEr1p3WAuwUsy1zsCFrSYvpHLhapC_fMD-zNaRWxYg%40mail.gmail.com -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Attachment
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Dilip Kumar
Date:
On Thu, Jun 4, 2020 at 2:05 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Wed, Jun 3, 2020 at 2:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Tue, Jun 2, 2020 at 7:53 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > On Tue, Jun 2, 2020 at 4:56 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > On Tue, Jun 2, 2020 at 3:59 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > > > > > I thin for our use case BufFileCreateShared is more suitable. I think > > > > > we need to do some modifications so that we can use these apps without > > > > > SharedFileSet. Otherwise, we need to unnecessarily need to create > > > > > SharedFileSet for each transaction and also need to maintain it in xid > > > > > array or xid hash until transaction commit/abort. So I suggest > > > > > following modifications in shared files set so that we can > > > > > conveniently use it. > > > > > 1. ChooseTablespace(const SharedFileSet fileset, const char name) > > > > > if fileset is NULL then select the DEFAULTTABLESPACEOID > > > > > 2. SharedFileSetPath(char path, SharedFileSet fileset, Oid tablespace) > > > > > If fileset is NULL then in directory path we can use MyProcPID or > > > > > something instead of fileset->creator_pid. > > > > > > > > > > > > > Hmm, I find these modifications a bit ad-hoc. So, not sure if it is > > > > better than the patch maintains sharedfileset information. > > > > > > I think we might do something better here, maybe by supplying function > > > pointer or so, but maintaining sharedfileset which contains different > > > tablespace/mutext which we don't need at all for our purpose also > > > doesn't sound very appealing. > > > > > > > I think we can say something similar for Relation (rel cache entry as > > well) maintained in LogicalRepRelMapEntry. I think we only need a > > pointer to that information. > > Yeah, I see. > > > > Let me see if I can not come up with > > > some clean way of avoiding the need to shared-fileset then maybe we > > > can go with the shared fileset idea. > > > > > > > Fair enough. > > While evaluating it further I feel there are a few more problems to > solve if we are using BufFile, First thing is that in subxact file we > maintain the information of xid and its offset in the changes file. > So now, we will also have to store 'fileno' but that we can find using > BufFileTell. Yet another problem is that currently, we don't > have the truncate option in the BufFile, but we need it if the > sub-transaction gets aborted. I think we can implement an extra > interface with the BufFile and should not be very hard as we already > know the fileno and the offset. I will evaluate this part further and > let you know about the same. I have further evaluated this and also tested the concept with a POC patch. Soon I will complete and share, here is the scatch of the idea. As discussed we will use SharedBufFile for changes files and subxact files. There will be a separate LogicalStreamingResourceOwner, which will be used to manage the VFD of the shared buf files. We can create a per stream resource owner i.e. on stream start we will create the resource owner and all the shared buffiles will be opened under that resource owner, which will be deleted on stream stop. We need to remember the SharedFileSet so that for subsequent stream for the same transaction we can open the same file again, for this we will use a hash table with xid as a key and in that, we will keep stream_fileset and subxact_fileset's pointers as payload. +typedef struct StreamXidHash +{ + TransactionId xid; + SharedFileSet *stream_fileset; + SharedFileSet *subxact_fileset; +} StreamXidHash; We have to do some extension to the buffile modules, some of them are already discussed up-thread but still listing them all down here - A new interface BufFileTruncateShared(BufFile *file, int fileno, off_t offset), for truncating the subtransaction changes, if changes are spread across multiple files those files will be deleted and we will adjust the file count and current offset accordingly in BufFile. - In BufFileOpenShared, we will have to implement a mode so that we can open in write mode as well, current only read-only mode supported. - In SharedFileSetInit, if dsm_segment is NULL then we will not register the file deletion on on_dsm_detach. - As usual, we will clean up the files on stream abort/commit, or on the worker exit. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Dilip Kumar
Date:
On Fri, Jun 5, 2020 at 11:37 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Fri, May 29, 2020 at 8:31 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > Apart from this one more fix in 0005, basically, CheckLiveXid was > > never reset, so I have fixed that as well. > > > > I have made a number of modifications in the 0001 patch and attached > is the result. I have changed/added comments, done some cosmetic > cleanup, and ran pgindent. The most notable change is to remove the > below code change: > DecodeXactOp() > { > .. > - * However, it's critical to process XLOG_XACT_ASSIGNMENT records even > + * However, it's critical to process records with subxid assignment even > * when the snapshot is being built: it is possible to get later records > * that require subxids to be properly assigned. > */ > if (SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT && > - info != XLOG_XACT_ASSIGNMENT) > + !TransactionIdIsValid(XLogRecGetTopXid(r))) > .. > } > > I have not only removed the change done by the patch but the check > related to XLOG_XACT_ASSIGNMENT as well. That check has been added by > commit bac2fae05c to ensure that we process XLOG_XACT_ASSIGNMENT even > if snapshot state is not SNAPBUILD_FULL_SNAPSHOT. Now, with this > patch that is not required because we are making the subtransaction > and top-level transaction much earlier than this. I have verified > that it doesn't reopen the bug by running the test provided in the > original report [1]. > > Let me know what you think of the changes? If you find them okay, > then feel to include them in the next patch-set. > > [1] - https://www.postgresql.org/message-id/CAONYFtOv%2BEr1p3WAuwUsy1zsCFrSYvpHLhapC_fMD-zNaRWxYg%40mail.gmail.com Thanks for the patch, I will review it and include it in my next version. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Amit Kapila
Date:
On Sun, Jun 7, 2020 at 5:08 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Fri, Jun 5, 2020 at 11:37 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > Let me know what you think of the changes? If you find them okay, > > then feel to include them in the next patch-set. > > > > [1] - https://www.postgresql.org/message-id/CAONYFtOv%2BEr1p3WAuwUsy1zsCFrSYvpHLhapC_fMD-zNaRWxYg%40mail.gmail.com > > Thanks for the patch, I will review it and include it in my next version. > Okay, I have done review of 0002-Issue-individual-invalidations-with-wal_level-lo.patch and below are my comments: 1. I don't think it is a good idea that logical decoding process the new XLOG_XACT_INVALIDATIONS and existing WAL records for invalidations like XLOG_INVALIDATIONS and what we do in DecodeCommit (see code in the check "if (parsed->nmsgs > 0)"). I think if that is required for some particular reason then we should write detailed comments about the same. I have tried some experiments to see if those are really required: a. After applying patch 0002, I have tried by commenting out the processing of invalidations via DecodeCommit and found some regression tests were failing but the reason for failure was that we are not setting RBTXN_HAS_CATALOG_CHANGES for the toptxn when subtxn has catalog changes and when I did that all regression tests started passing. See the attached diff patch (v27-0003-Incremental-patch-for-0002-to-test-removal-of-du) atop 0002 patch. b. The processing of invalidations for XLOG_INVALIDATIONS is added by commit c6ff84b06a for xid-less transactions. See https://postgr.es/m/CAB-SwXY6oH=9twBkXJtgR4UC1NqT-vpYAtxCseME62ADwyK5OA@mail.gmail.com to know why that has been added. Now, after this patch we will process the same invalidations via XLOG_XACT_INVALIDATIONS and XLOG_INVALIDATIONS which doesn't seem warranted. Also, the below assertion will fail for xid-less transactions (try create index concurrently statement): + case XLOG_XACT_INVALIDATIONS: + { + TransactionId xid; + xl_xact_invalidations *invals; + + xid = XLogRecGetXid(r); + invals = (xl_xact_invalidations *) XLogRecGetData(r); + + Assert(TransactionIdIsValid(xid)); I feel we don't need the processing of XLOG_INVALIDATIONS in logical decoding after this patch but to prove that first we need to write a test case which need XLOG_INVALIDATIONS in the HEAD as commit c6ff84b06a doesn't add one. I think we need two code paths in XLOG_XACT_INVALIDATIONS where if it is for xid-less transactions, then execute actions immediately as we are doing in processing of XLOG_INVALIDATIONS, otherwise, do what we are doing currently in the patch. If the above point (b) is correct, I am not sure if it is a good idea to use RM_XACT_ID as resource manager if for this WAL in LogLogicalInvalidations, what do you think? I think one of the usages we still need is in ReorderBufferForget because it can be called when we skip processing the txn. See the comments in DecodeCommit where we call this function. If I am correct, we need to probably collect all invalidations in ReorderBufferTxn as we are collecting tuplecids and use them here. We can do the same during processing of XLOG_XACT_INVALIDATIONS. I had also thought a bit about removing logging of invalidations at commit time altogether but it seems processing hot-standby is somewhat tightly coupled with existing WAL logging. See xact_redo_commit (a comment atop call to ProcessCommittedInvalidationMessages). It says we need to maintain the order when we process invalidations. If we can later find a way to avoid that we can probably remove it but for now maybe we can live with it. 2. + /* not expected, but print something anyway */ + else if (msg->id == SHAREDINVALSMGR_ID) + appendStringInfoString(buf, " smgr"); + /* not expected, but print something anyway */ + else if (msg->id == SHAREDINVALRELMAP_ID) I think the above comment is not valid after we started logging at CCI. 3. + + xid = XLogRecGetXid(r); + invals = (xl_xact_invalidations *) XLogRecGetData(r); + + Assert(TransactionIdIsValid(xid)); + ReorderBufferAddInvalidation(reorder, xid, buf->origptr, + invals->nmsgs, invals->msgs); Here, it should check !ctx->forward as we do in DecodeCommit, do we have any reason for not doing so. We can test once by changing this. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Attachment
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Amit Kapila
Date:
On Mon, Jun 8, 2020 at 11:53 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > I think one of the usages we still need is in ReorderBufferForget > because it can be called when we skip processing the txn. See the > comments in DecodeCommit where we call this function. If I am > correct, we need to probably collect all invalidations in > ReorderBufferTxn as we are collecting tuplecids and use them here. We > can do the same during processing of XLOG_XACT_INVALIDATIONS. > One more point related to this is that after this patch series, we need to consider executing all invalidation during transaction abort. Because it is possible that due to memory overflow, we have processed some of the messages which also contain a few XACT_INVALIDATION messages, so to avoid cache pollution, we need to execute all of them in abort. We also do the similar thing in Rollback/Rollback To Savepoint, see AtEOXact_Inval and AtEOSubXact_Inval. Few other comments on 0002-Issue-individual-invalidations-with-wal_level-lo.patch --------------------------------------------------------------------------------------------------------------- 1. + if (transInvalInfo->CurrentCmdInvalidMsgs.cclist) + { + ProcessInvalidationMessagesMulti(&transInvalInfo->CurrentCmdInvalidMsgs, + MakeSharedInvalidMessagesArray); + invalMessages = SharedInvalidMessagesArray; + nmsgs = numSharedInvalidMessagesArray; + SharedInvalidMessagesArray = NULL; + numSharedInvalidMessagesArray = 0; a. Immediately after ProcessInvalidationMessagesMulti, isn't it better to have an Assertion like Assert(!(numSharedInvalidMessagesArray > 0 && SharedInvalidMessagesArray == NULL));? b. Why check "if (transInvalInfo->CurrentCmdInvalidMsgs.cclist)" is required? If you see xactGetCommittedInvalidationMessages where we do something similar, we only check for valid value of transInvalInfo and here we check the same in the caller of LogLogicalInvalidations, isn't that sufficient? If that is sufficient, we can either have the same check here or have an Assert for the same. 2. @@ -1092,6 +1101,9 @@ CommandEndInvalidationMessages(void) if (transInvalInfo == NULL) return; + if (XLogLogicalInfoActive()) + LogLogicalInvalidations(); + ProcessInvalidationMessages(&transInvalInfo->CurrentCmdInvalidMsgs, LocalExecuteInvalidationMessage); Generally, we WAL log the action after performing it but here you are writing WAL first. Is there any specific reason? If so, can we write a comment about the same? 3. + * When wal_level=logical, write invalidations into WAL at each command end to + * support the decoding of the in-progress transaction. As of now it was + * enough to log invalidation only at commit because we are only decoding the + * transaction at the commit time. We only need to log the catalog cache and + * relcache invalidation. There can not be any active MVCC scan in logical + * decoding so we don't need to log the snapshot invalidation. I think this comment doesn't hold good after we have changed the patch to LOG invalidations at the time of CCI. 4. + +/* + * Emit WAL for invalidations. + */ +static void +LogLogicalInvalidations() Add the function name atop of this function in comments to match the style with other nearby functions. How about modifying it to something like: "Emit WAL for invalidations. This is currently only used for logging invalidations at the command end." 5. + * + * XXX Do we need to care about relcacheInitFileInval and + * the other fields added to ReorderBufferChange, or just + * about the message itself? + */ I don't think we need to do anything about relcacheInitFileInval. This is used to remove the stale files (RELCACHE_INIT_FILENAME) that have obsolete information about relcache. The walsender process that is doing decoding doesn't require us to do anything about this. Also, if you see before this patch, we don't do anything about relcache files during decoding of invalidation messages. In short, I think we can remove this comment unless you see some use of it. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Amit Kapila
Date:
On Thu, Jun 4, 2020 at 5:06 PM Mahendra Singh Thalor <mahi6run@gmail.com> wrote:
On Fri, 29 May 2020 at 15:52, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
To see all the operations(DDL's and DML's), please see test_resultsTesting summary:Basically, we are writing per command invalidation message and for testing that I have tested with different combinations of the DDL and DML operation. I have not observed any performance degradation with the patch. For "create index" DDL's, %change in wal is 1-7% for 1-15 DDL's. For "add col int/date" DDL's, it is 11-17% for 1-15 DDL's and for "add col text" DDL's, it is 2-6% for 1-15 DDL's. For mix (DDL & DML), it is 2-10%.
why are we seeing 11-13 % of the extra wall, basically, the amount of extra WAL is not very high but the amount of WAL generated with add column int/date is just ~1000 bytes so additional 100 bytes will be around 10% and for add column text it is ~35000 bytes so % is less. For text, these ~35000 bytes are due to toastThere is no change in wal size for DML operations. For savepoints, we are getting max 8 bytes per savepoint wal increment (basically for Sub-transaction, we are adding 5 bytes to store xid but due to padding, it is 8 bytes and some times if wal is already aligned, then we are getting 0 bytes increment)
So, if I read it correctly, there is no performance penalty with either of the patches but there is some additional WAL which in most cases is 2-5% but in worst cases and some specific DDL's it is upto 15%. I think as this WAL overhead is when wal_level is logical, we might have to live with it as the other alternative is to blew up all caches on any DDL in WALSenders and that will have bot CPU and Network overhead as expalined previously [1]. I feel if the WAL overhead pinches any workload, we might want to do it under some new guc (which will disable streaming of transactions) but I don't think we need to go there.
What do you think?
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Dilip Kumar
Date:
On Tue, Jun 9, 2020 at 3:39 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Thu, Jun 4, 2020 at 5:06 PM Mahendra Singh Thalor <mahi6run@gmail.com> wrote: >> >> On Fri, 29 May 2020 at 15:52, Amit Kapila <amit.kapila16@gmail.com> wrote: >> > >> >> >> To see all the operations(DDL's and DML's), please see test_results >> >> Testing summary: >> Basically, we are writing per command invalidation message and for testing that I have tested with different combinationsof the DDL and DML operation. I have not observed any performance degradation with the patch. For "create index"DDL's, %change in wal is 1-7% for 1-15 DDL's. For "add col int/date" DDL's, it is 11-17% for 1-15 DDL's and for "addcol text" DDL's, it is 2-6% for 1-15 DDL's. For mix (DDL & DML), it is 2-10%. >> >> why are we seeing 11-13 % of the extra wall, basically, the amount of extra WAL is not very high but the amount of WALgenerated with add column int/date is just ~1000 bytes so additional 100 bytes will be around 10% and for add column textit is ~35000 bytes so % is less. For text, these ~35000 bytes are due to toast >> There is no change in wal size for DML operations. For savepoints, we are getting max 8 bytes per savepoint wal increment(basically for Sub-transaction, we are adding 5 bytes to store xid but due to padding, it is 8 bytes and some timesif wal is already aligned, then we are getting 0 bytes increment) > > > So, if I read it correctly, there is no performance penalty with either of the patches but there is some additional WALwhich in most cases is 2-5% but in worst cases and some specific DDL's it is upto 15%. I think as this WAL overhead iswhen wal_level is logical, we might have to live with it as the other alternative is to blew up all caches on any DDL inWALSenders and that will have bot CPU and Network overhead as expalined previously [1]. I feel if the WAL overhead pinchesany workload, we might want to do it under some new guc (which will disable streaming of transactions) but I don'tthink we need to go there. > > What do you think? Even I feel so because the WAL overhead is only with wal_level=logical and especially with DDL and ideally, there should not be a large amount of DDL in the system compared to other operations. So I think we can live with the current approach. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Dilip Kumar
Date:
On Sun, Jun 7, 2020 at 5:06 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Thu, Jun 4, 2020 at 2:05 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Wed, Jun 3, 2020 at 2:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > On Tue, Jun 2, 2020 at 7:53 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > > > On Tue, Jun 2, 2020 at 4:56 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > > On Tue, Jun 2, 2020 at 3:59 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > > > > > > > I thin for our use case BufFileCreateShared is more suitable. I think > > > > > > we need to do some modifications so that we can use these apps without > > > > > > SharedFileSet. Otherwise, we need to unnecessarily need to create > > > > > > SharedFileSet for each transaction and also need to maintain it in xid > > > > > > array or xid hash until transaction commit/abort. So I suggest > > > > > > following modifications in shared files set so that we can > > > > > > conveniently use it. > > > > > > 1. ChooseTablespace(const SharedFileSet fileset, const char name) > > > > > > if fileset is NULL then select the DEFAULTTABLESPACEOID > > > > > > 2. SharedFileSetPath(char path, SharedFileSet fileset, Oid tablespace) > > > > > > If fileset is NULL then in directory path we can use MyProcPID or > > > > > > something instead of fileset->creator_pid. > > > > > > > > > > > > > > > > Hmm, I find these modifications a bit ad-hoc. So, not sure if it is > > > > > better than the patch maintains sharedfileset information. > > > > > > > > I think we might do something better here, maybe by supplying function > > > > pointer or so, but maintaining sharedfileset which contains different > > > > tablespace/mutext which we don't need at all for our purpose also > > > > doesn't sound very appealing. > > > > > > > > > > I think we can say something similar for Relation (rel cache entry as > > > well) maintained in LogicalRepRelMapEntry. I think we only need a > > > pointer to that information. > > > > Yeah, I see. > > > > > > Let me see if I can not come up with > > > > some clean way of avoiding the need to shared-fileset then maybe we > > > > can go with the shared fileset idea. > > > > > > > > > > Fair enough. > > > > While evaluating it further I feel there are a few more problems to > > solve if we are using BufFile, First thing is that in subxact file we > > maintain the information of xid and its offset in the changes file. > > So now, we will also have to store 'fileno' but that we can find using > > BufFileTell. Yet another problem is that currently, we don't > > have the truncate option in the BufFile, but we need it if the > > sub-transaction gets aborted. I think we can implement an extra > > interface with the BufFile and should not be very hard as we already > > know the fileno and the offset. I will evaluate this part further and > > let you know about the same. > > I have further evaluated this and also tested the concept with a POC > patch. Soon I will complete and share, here is the scatch of the > idea. > > As discussed we will use SharedBufFile for changes files and subxact > files. There will be a separate LogicalStreamingResourceOwner, which > will be used to manage the VFD of the shared buf files. We can create > a per stream resource owner i.e. on stream start we will create the > resource owner and all the shared buffiles will be opened under that > resource owner, which will be deleted on stream stop. We need to > remember the SharedFileSet so that for subsequent stream for the same > transaction we can open the same file again, for this we will use a > hash table with xid as a key and in that, we will keep stream_fileset > and subxact_fileset's pointers as payload. > > +typedef struct StreamXidHash > +{ > + TransactionId xid; > + SharedFileSet *stream_fileset; > + SharedFileSet *subxact_fileset; > +} StreamXidHash; > > We have to do some extension to the buffile modules, some of them are > already discussed up-thread but still listing them all down here > - A new interface BufFileTruncateShared(BufFile *file, int fileno, > off_t offset), for truncating the subtransaction changes, if changes > are spread across multiple files those files will be deleted and we > will adjust the file count and current offset accordingly in BufFile. > - In BufFileOpenShared, we will have to implement a mode so that we > can open in write mode as well, current only read-only mode supported. > - In SharedFileSetInit, if dsm_segment is NULL then we will not > register the file deletion on on_dsm_detach. > - As usual, we will clean up the files on stream abort/commit, or on > the worker exit. Currently, I am done with a working prototype of using the BufFile infrastructure for the tempfile. Meanwhile, I want to discuss a few interface changes required for the BufFIle infrastructure. 1. Support read-write mode for "BufFileOpenShared", Basically, in workers we will be opening the xid's changes and subxact files per stream, so we need an RW mode even in the open. I have passed a flag for the same. 2. Files should not be closed at the end of the transaction: Currently, files opened with BufFileCreateShared/BufFileOpenShared are registered to be closed on EOXACT. Basically, we need to open the changes file on the stream start and keep it open until stream stop, so we can not afford to get it closed on the EOXACT. I have added a flag for the same. 3. As. discussed above we need to support truncate for handling thee subtransaction abort so I have added a new interface for the same. 4. After every time we open the changes file, we need to seek to the end, so I have supported SEEK_END. Attached is the WIP patch for describing my changes. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Amit Kapila
Date:
On Wed, Jun 10, 2020 at 2:30 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > Currently, I am done with a working prototype of using the BufFile > infrastructure for the tempfile. Meanwhile, I want to discuss a few > interface changes required for the BufFIle infrastructure. > > 1. Support read-write mode for "BufFileOpenShared", Basically, in > workers we will be opening the xid's changes and subxact files per > stream, so we need an RW mode even in the open. I have passed a flag > for the same. > Generally file open APIs have mode as a parameter to indicate read_only or read_write. Using flag here seems a bit odd to me. > 2. Files should not be closed at the end of the transaction: > Currently, files opened with BufFileCreateShared/BufFileOpenShared are > registered to be closed on EOXACT. Basically, we need to open the > changes file on the stream start and keep it open until stream stop, > so we can not afford to get it closed on the EOXACT. I have added a > flag for the same. > But where do we end the transaction before the stream stop which can lead to closure of this file? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Dilip Kumar
Date:
On Wed, Jun 10, 2020 at 4:00 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Wed, Jun 10, 2020 at 2:30 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > Currently, I am done with a working prototype of using the BufFile > > infrastructure for the tempfile. Meanwhile, I want to discuss a few > > interface changes required for the BufFIle infrastructure. > > > > 1. Support read-write mode for "BufFileOpenShared", Basically, in > > workers we will be opening the xid's changes and subxact files per > > stream, so we need an RW mode even in the open. I have passed a flag > > for the same. > > > > Generally file open APIs have mode as a parameter to indicate > read_only or read_write. Using flag here seems a bit odd to me. Let me think about it, we can try to pass the mode. > > 2. Files should not be closed at the end of the transaction: > > Currently, files opened with BufFileCreateShared/BufFileOpenShared are > > registered to be closed on EOXACT. Basically, we need to open the > > changes file on the stream start and keep it open until stream stop, > > so we can not afford to get it closed on the EOXACT. I have added a > > flag for the same. > > > > But where do we end the transaction before the stream stop which can > lead to closure of this file? Currently, I am keeping the transaction only while creating/opening the files and closing immediately after that, maybe we can keep the transaction until stream stop, then we can avoid this changes, and we can also avoid creating extra resource owner? What is your thought on this? -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Amit Kapila
Date:
On Wed, Jun 10, 2020 at 5:02 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Wed, Jun 10, 2020 at 4:00 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > 2. Files should not be closed at the end of the transaction: > > > Currently, files opened with BufFileCreateShared/BufFileOpenShared are > > > registered to be closed on EOXACT. Basically, we need to open the > > > changes file on the stream start and keep it open until stream stop, > > > so we can not afford to get it closed on the EOXACT. I have added a > > > flag for the same. > > > > > > > But where do we end the transaction before the stream stop which can > > lead to closure of this file? > > Currently, I am keeping the transaction only while creating/opening > the files and closing immediately after that, maybe we can keep the > transaction until stream stop, then we can avoid this changes, and we > can also avoid creating extra resource owner? What is your thought on > this? > I would prefer to keep the transaction until the stream stop unless there are good reasons for not doing so. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Dilip Kumar
Date:
On Wed, Jun 10, 2020 at 5:11 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Wed, Jun 10, 2020 at 5:02 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Wed, Jun 10, 2020 at 4:00 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > 2. Files should not be closed at the end of the transaction: > > > > Currently, files opened with BufFileCreateShared/BufFileOpenShared are > > > > registered to be closed on EOXACT. Basically, we need to open the > > > > changes file on the stream start and keep it open until stream stop, > > > > so we can not afford to get it closed on the EOXACT. I have added a > > > > flag for the same. > > > > > > > > > > But where do we end the transaction before the stream stop which can > > > lead to closure of this file? > > > > Currently, I am keeping the transaction only while creating/opening > > the files and closing immediately after that, maybe we can keep the > > transaction until stream stop, then we can avoid this changes, and we > > can also avoid creating extra resource owner? What is your thought on > > this? > > > > I would prefer to keep the transaction until the stream stop unless > there are good reasons for not doing so. I am ready with the first patch set which replaces the temp file usage in the worker with the buffile usage. (patch v27-0013 and v27-0014) Open item: - As of now, I have kept the buffile changes and the worker using buffile as separate patches for review. Later I will make buffile changes patch as a base patch and I will merge the worker changes with the 0008 patch. - Currently, while reading/writing the streaming/subxact files we are reporting the wait event for example 'pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_WRITE);', but BufFileWrite/BufFileRead internally reports the read/write wait event. So I think we can avoid reporting that? Basically, this part is still I have to work upon, once we get the consensus then I can remove those extra wait event from the patch. - There are still a few open comments, from your other mails, I still have to work upon. So I will work on those in the next version. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Amit Kapila
Date:
On Fri, Jun 12, 2020 at 11:38 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > - Currently, while reading/writing the streaming/subxact files we are > reporting the wait event for example > 'pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_WRITE);', but > BufFileWrite/BufFileRead internally reports the read/write wait event. > So I think we can avoid reporting that? > Yes, we can avoid that. No other place using BufFileRead does any such reporting. > Basically, this part is still > I have to work upon, once we get the consensus then I can remove those > extra wait event from the patch. > Okay, feel free to send an updated patch with the above change. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Dilip Kumar
Date:
On Fri, Jun 12, 2020 at 4:35 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Fri, Jun 12, 2020 at 11:38 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > - Currently, while reading/writing the streaming/subxact files we are > > reporting the wait event for example > > 'pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_WRITE);', but > > BufFileWrite/BufFileRead internally reports the read/write wait event. > > So I think we can avoid reporting that? > > > > Yes, we can avoid that. No other place using BufFileRead does any > such reporting. I agree. > > Basically, this part is still > > I have to work upon, once we get the consensus then I can remove those > > extra wait event from the patch. > > > > Okay, feel free to send an updated patch with the above change. Sure, I will do that in the next patch set. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Amit Kapila
Date:
On Mon, Jun 15, 2020 at 9:12 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Fri, Jun 12, 2020 at 4:35 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > Basically, this part is still > > > I have to work upon, once we get the consensus then I can remove those > > > extra wait event from the patch. > > > > > > > Okay, feel free to send an updated patch with the above change. > > Sure, I will do that in the next patch set. > I have few more comments on the patch 0013-Change-buffile-interface-required-for-streaming-.patch: 1. - * temp_file_limit of the caller, are read-only and are automatically closed - * at the end of the transaction but are not deleted on close. + * temp_file_limit of the caller, are read-only if the flag is set and are + * automatically closed at the end of the transaction but are not deleted on + * close. */ File -PathNameOpenTemporaryFile(const char *path) +PathNameOpenTemporaryFile(const char *path, int mode) No need to say "are read-only if the flag is set". I don't see any flag passed to function so that part of the comment doesn't seem appropriate. 2. @@ -68,7 +68,8 @@ SharedFileSetInit(SharedFileSet *fileset, dsm_segment *seg) } /* Register our cleanup callback. */ - on_dsm_detach(seg, SharedFileSetOnDetach, PointerGetDatum(fileset)); + if (seg) + on_dsm_detach(seg, SharedFileSetOnDetach, PointerGetDatum(fileset)); } Add comments atop function to explain when we don't want to register the dsm detach stuff? 3. + */ + newFile = file->numFiles - 1; + newOffset = FileSize(file->files[file->numFiles - 1]); break; FileSize can return negative lengths to indicate failure which we should handle. See other places in the code where FileSize is used? But I have another question here which is why we need to implement SEEK_END? How other usages of BufFile interface takes care of this? I see an API BufFileTell which can give the current read/write location in the file, isn't that sufficient for your usage? Also, how before BufFile usage is this thing handled in the patch? 4. + /* Loop over all the files upto the fileno which we want to truncate. */ + for (i = file->numFiles - 1; i >= fileno; i--) "the files", extra space in the above part of the comment. 5. + /* + * Except the fileno, we can directly delete other files. Before 'we', there is extra space. 6. + else + { + FileTruncate(file->files[i], offset, WAIT_EVENT_BUFFILE_READ); + newOffset = offset; + } The wait event passed here doesn't seem to be appropriate. You might want to introduce a new wait event WAIT_EVENT_BUFFILE_TRUNCATE. Also, the error handling for FileTruncate is missing. 7. + if ((i != fileno || offset == 0) && fileno != 0) + { + SharedSegmentName(segment_name, file->name, i); + SharedFileSetDelete(file->fileset, segment_name, true); + newFile--; + newOffset = MAX_PHYSICAL_FILESIZE; + } Similar to the previous comment, I think we should handle the failure of SharedFileSetDelete. 8. I think the comments related to BufFile shared API usage need to be expanded in the code to explain the new usage. For ex., see the below comments atop buffile.c * BufFile supports temporary files that can be made read-only and shared with * other backends, as infrastructure for parallel execution. Such files need * to be created as a member of a SharedFileSet that all participants are * attached to. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Amit Kapila
Date:
On Mon, Jun 15, 2020 at 6:29 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > I have few more comments on the patch > 0013-Change-buffile-interface-required-for-streaming-.patch: > Review comments on 0014-Worker-tempfile-use-the-shared-buffile-infrastru: 1. The subxact file is only create if there + * are any suxact info under this xid. + */ +typedef struct StreamXidHash Lets slightly reword the part of the comment as "The subxact file is created iff there is any suxact info under this xid." 2. @@ -710,6 +740,9 @@ apply_handle_stream_stop(StringInfo s) subxact_info_write(MyLogicalRepWorker->subid, stream_xid); stream_close_file(); + /* Commit the per-stream transaction */ + CommitTransactionCommand(); Before calling commit, ensure that we are in a valid transaction. I think we can have an Assert for IsTransactionState(). 3. @@ -761,11 +791,13 @@ apply_handle_stream_abort(StringInfo s) int64 i; int64 subidx; - int fd; + BufFile *fd; bool found = false; char path[MAXPGPATH]; + StreamXidHash *ent; subidx = -1; + ensure_transaction(); subxact_info_read(MyLogicalRepWorker->subid, xid); Why to call ensure_transaction here? Is there any reason that we won't have a valid transaction by now? If not, then its better to have an Assert for IsTransactionState(). 4. - if (write(fd, &nsubxacts, sizeof(nsubxacts)) != sizeof(nsubxacts)) + if (BufFileWrite(fd, &nsubxacts, sizeof(nsubxacts)) != sizeof(nsubxacts)) { - int save_errno = errno; + int save_errno = errno; - CloseTransientFile(fd); + BufFileClose(fd); On error, won't these files be close automatically? If so, why at this place and before other errors, we need to close this? 5. if ((len > 0) && ((BufFileRead(fd, subxacts, len)) != len)) { int save_errno = errno; BufFileClose(fd); errno = save_errno; ereport(ERROR, (errcode_for_file_access(), errmsg("could not read file \"%s\": %m", Can we change the error message to "could not read from streaming transactions file .." or something like that and similarly we can change the message for failure in reading changes file? 6. if (BufFileWrite(fd, &nsubxacts, sizeof(nsubxacts)) != sizeof(nsubxacts)) { int save_errno = errno; BufFileClose(fd); errno = save_errno; ereport(ERROR, (errcode_for_file_access(), errmsg("could not write to file \"%s\": %m", Similar to previous, can we change it to "could not write to streaming transactions file 7. @@ -2855,17 +2844,32 @@ stream_open_file(Oid subid, TransactionId xid, bool first_segment) * for writing, in append mode. */ if (first_segment) - flags = (O_WRONLY | O_CREAT | O_EXCL | PG_BINARY); - else - flags = (O_WRONLY | O_APPEND | PG_BINARY); + { + /* + * Shared fileset handle must be allocated in the persistent context. + */ + SharedFileSet *fileset = + MemoryContextAlloc(ApplyContext, sizeof(SharedFileSet)); - stream_fd = OpenTransientFile(path, flags); + PrepareTempTablespaces(); + SharedFileSetInit(fileset, NULL); Why are we calling PrepareTempTablespaces here? It is already called in SharedFileSetInit. 8. + /* + * Start a transaction on stream start, this transaction will be committed + * on the stream stop. We need the transaction for handling the buffile, + * used for serializing the streaming data and subxact info. + */ + ensure_transaction(); I think we need this for PrepareTempTablespaces to set the temptablespaces. Also, isn't it required for a cleanup of buffile resources at the transaction end? Are there any other reasons for it as well? The comment should be a bit more clear for why we need a transaction here. 9. * Open a file for streamed changes from a toplevel transaction identified * by stream_xid (global variable). If it's the first chunk of streamed * changes for this transaction, perform cleanup by removing existing * files after a possible previous crash. .. stream_open_file(Oid subid, TransactionId xid, bool first_segment) The above part comment atop stream_open_file needs to be changed after new implementation. 10. * enabled. This context is reeset on each stream stop. */ LogicalStreamingContext = AllocSetContextCreate(ApplyContext, /reeset/reset 11. stream_cleanup_files(Oid subid, TransactionId xid, bool missing_ok) { .. + /* No entry created for this xid so simply return. */ + if (ent == NULL) + return; .. } Is there any reason or scenario where this ent can be NULL? If not, it will be better to have an Assert for the same. 12. subxact_info_write(Oid subid, TransactionId xid) { .. + /* + * If there is no subtransaction then nothing to do, but if already have + * subxact file then delete that. + */ + if (nsubxacts == 0) { - ereport(ERROR, - (errcode_for_file_access(), - errmsg("could not create file \"%s\": %m", - path))); + if (ent->subxact_fileset) + { + cleanup_subxact_info(); + BufFileDeleteShared(ent->subxact_fileset, path); + ent->subxact_fileset = NULL; .. } Here don't we need to free the subxact_fileset before setting it to NULL? 13. + /* + * Scan complete hash and delete the underlying files for the the xids. + * Also delete the memory for the shared file sets. + */ /the the/the. Instead of "delete the memory", it would be better to say "release the memory". 14. + /* + * We might not have created the suxact fileset if there is no sub + * transaction. + */ /suxact/subxact -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Dilip Kumar
Date:
On Tue, Jun 9, 2020 at 3:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Mon, Jun 8, 2020 at 11:53 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > I think one of the usages we still need is in ReorderBufferForget > > because it can be called when we skip processing the txn. See the > > comments in DecodeCommit where we call this function. If I am > > correct, we need to probably collect all invalidations in > > ReorderBufferTxn as we are collecting tuplecids and use them here. We > > can do the same during processing of XLOG_XACT_INVALIDATIONS. > > > > One more point related to this is that after this patch series, we > need to consider executing all invalidation during transaction abort. > Because it is possible that due to memory overflow, we have processed > some of the messages which also contain a few XACT_INVALIDATION > messages, so to avoid cache pollution, we need to execute all of them > in abort. We also do the similar thing in Rollback/Rollback To > Savepoint, see AtEOXact_Inval and AtEOSubXact_Inval. I have analyzed this further and I think there is some problem with that. If Instead of keeping the invalidation as an individual change, if we try to combine them in ReorderBufferTxn's invalidation then what happens if the (sub)transaction is aborted. Basically, in this case, we will end up executing all those invalidations for those we never polluted the cache if we never try to stream it. So this will affect the normal case where we haven't streamed the transaction because every time we have executed the invalidation logged by transaction those are aborted. One way is we develop the list at the sub-transaction level and just before sending the transaction (on commit) combine all the (sub) transaction's invalidation list. But, I think since we already have the invalidation in the commit message then there is no point in adding this complexity. But, my main worry is about the streaming transaction, the problems are - Immediately on the arrival of individual invalidation, we can not directly add to the top-level transaction's invalidation list because later if the transaction aborted before we stream (or we directly stream on commit) then we will get an unnecessarily long list of invalidation which is done by aborted subtransaction. - If we keep collecting in the individual subtransaction's ReorderBufferTxn->invalidations, then the problem is when to merge it? I think it is a good idea to merge them all as soon as we try to stream it/or on commit? So since this solution of combining the (sub) transaction's invalidation is required for the streaming case we can use it as common solution whether it streams due to the memory overflow or due to the commit. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Amit Kapila
Date:
On Tue, Jun 16, 2020 at 7:49 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Tue, Jun 9, 2020 at 3:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Mon, Jun 8, 2020 at 11:53 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > I think one of the usages we still need is in ReorderBufferForget > > > because it can be called when we skip processing the txn. See the > > > comments in DecodeCommit where we call this function. If I am > > > correct, we need to probably collect all invalidations in > > > ReorderBufferTxn as we are collecting tuplecids and use them here. We > > > can do the same during processing of XLOG_XACT_INVALIDATIONS. > > > > > > > One more point related to this is that after this patch series, we > > need to consider executing all invalidation during transaction abort. > > Because it is possible that due to memory overflow, we have processed > > some of the messages which also contain a few XACT_INVALIDATION > > messages, so to avoid cache pollution, we need to execute all of them > > in abort. We also do the similar thing in Rollback/Rollback To > > Savepoint, see AtEOXact_Inval and AtEOSubXact_Inval. > > I have analyzed this further and I think there is some problem with > that. If Instead of keeping the invalidation as an individual change, > if we try to combine them in ReorderBufferTxn's invalidation then what > happens if the (sub)transaction is aborted. Basically, in this case, > we will end up executing all those invalidations for those we never > polluted the cache if we never try to stream it. So this will affect > the normal case where we haven't streamed the transaction because > every time we have executed the invalidation logged by transaction > those are aborted. One way is we develop the list at the > sub-transaction level and just before sending the transaction (on > commit) combine all the (sub) transaction's invalidation list. But, > I think since we already have the invalidation in the commit message > then there is no point in adding this complexity. > But, my main worry is about the streaming transaction, the problems are > - Immediately on the arrival of individual invalidation, we can not > directly add to the top-level transaction's invalidation list because > later if the transaction aborted before we stream (or we directly > stream on commit) then we will get an unnecessarily long list of > invalidation which is done by aborted subtransaction. > Is there any problem you see with this or you are concerned with the efficiency? Please note, we already do something similar in ReorderBufferForget and if your concern is efficiency then that applies to existing cases as well. I think if we want we can improve it later in many ways and one of them you have already suggested, at this time, the main thing is correctness and also aborts are not frequent enough to worry too much about their performance. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Dilip Kumar
Date:
On Wed, Jun 17, 2020 at 9:33 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Tue, Jun 16, 2020 at 7:49 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Tue, Jun 9, 2020 at 3:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > On Mon, Jun 8, 2020 at 11:53 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > I think one of the usages we still need is in ReorderBufferForget > > > > because it can be called when we skip processing the txn. See the > > > > comments in DecodeCommit where we call this function. If I am > > > > correct, we need to probably collect all invalidations in > > > > ReorderBufferTxn as we are collecting tuplecids and use them here. We > > > > can do the same during processing of XLOG_XACT_INVALIDATIONS. > > > > > > > > > > One more point related to this is that after this patch series, we > > > need to consider executing all invalidation during transaction abort. > > > Because it is possible that due to memory overflow, we have processed > > > some of the messages which also contain a few XACT_INVALIDATION > > > messages, so to avoid cache pollution, we need to execute all of them > > > in abort. We also do the similar thing in Rollback/Rollback To > > > Savepoint, see AtEOXact_Inval and AtEOSubXact_Inval. > > > > I have analyzed this further and I think there is some problem with > > that. If Instead of keeping the invalidation as an individual change, > > if we try to combine them in ReorderBufferTxn's invalidation then what > > happens if the (sub)transaction is aborted. Basically, in this case, > > we will end up executing all those invalidations for those we never > > polluted the cache if we never try to stream it. So this will affect > > the normal case where we haven't streamed the transaction because > > every time we have executed the invalidation logged by transaction > > those are aborted. One way is we develop the list at the > > sub-transaction level and just before sending the transaction (on > > commit) combine all the (sub) transaction's invalidation list. But, > > I think since we already have the invalidation in the commit message > > then there is no point in adding this complexity. > > But, my main worry is about the streaming transaction, the problems are > > - Immediately on the arrival of individual invalidation, we can not > > directly add to the top-level transaction's invalidation list because > > later if the transaction aborted before we stream (or we directly > > stream on commit) then we will get an unnecessarily long list of > > invalidation which is done by aborted subtransaction. > > > > Is there any problem you see with this or you are concerned with the > efficiency? Please note, we already do something similar in > ReorderBufferForget and if your concern is efficiency then that > applies to existing cases as well. I think if we want we can improve > it later in many ways and one of them you have already suggested, at > this time, the main thing is correctness and also aborts are not > frequent enough to worry too much about their performance. As of now, I am not seeing the problem, I was just concerned about processing more invalidation messages in the aborted cases compared to current code, even if the streaming is off/ or transaction never streamed as memory size is not crossed. But, I agree that it is only in the case of the abort, so I will work on this and later maybe we can test the performance. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Amit Kapila
Date:
On Tue, Jun 16, 2020 at 2:36 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Mon, Jun 15, 2020 at 6:29 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > I have few more comments on the patch > > 0013-Change-buffile-interface-required-for-streaming-.patch: > > > > Review comments on 0014-Worker-tempfile-use-the-shared-buffile-infrastru: > changes_filename(char *path, Oid subid, TransactionId xid) { - char tempdirpath[MAXPGPATH]; - - TempTablespacePath(tempdirpath, DEFAULTTABLESPACE_OID); - - /* - * We might need to create the tablespace's tempfile directory, if no - * one has yet done so. - */ - if ((MakePGDirectory(tempdirpath) < 0) && errno != EEXIST) - ereport(ERROR, - (errcode_for_file_access(), - errmsg("could not create directory \"%s\": %m", - tempdirpath))); - - snprintf(path, MAXPGPATH, "%s/%s%d-%u-%u.changes", - tempdirpath, PG_TEMP_FILE_PREFIX, MyProcPid, subid, xid); + snprintf(path, MAXPGPATH, "%u-%u.changes", subid, xid); Today, I was studying this change and its impact. Initially, I thought that because the patch has removed pgsql_tmp prefix from the filename, it might create problems if the temporary files remain on the disk after the crash. Now as the patch has started using BufFile interface, it seems to be internally taking care of the same by generating names like "base/pgsql_tmp/pgsql_tmp13774.0.sharedfileset/16393-513.changes.0". Basically, it ensures to create the file in the directory starting with pgsql_tmp. I have tried by crashing the server in a situation where the temp files remain and after the restart, they are removed. So, it seems okay to generate file names like that but I still suggest testing other paths like backup where we ignore files whose names start with PG_TEMP_FILE_PREFIX. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Dilip Kumar
Date:
On Mon, Jun 8, 2020 at 11:53 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Sun, Jun 7, 2020 at 5:08 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Fri, Jun 5, 2020 at 11:37 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > Let me know what you think of the changes? If you find them okay, > > > then feel to include them in the next patch-set. > > > > > > [1] - https://www.postgresql.org/message-id/CAONYFtOv%2BEr1p3WAuwUsy1zsCFrSYvpHLhapC_fMD-zNaRWxYg%40mail.gmail.com > > > > Thanks for the patch, I will review it and include it in my next version. I have merged your changes 0002 in this version. > Okay, I have done review of > 0002-Issue-individual-invalidations-with-wal_level-lo.patch and below > are my comments: > > 1. I don't think it is a good idea that logical decoding process the > new XLOG_XACT_INVALIDATIONS and existing WAL records for invalidations > like XLOG_INVALIDATIONS and what we do in DecodeCommit (see code in > the check "if (parsed->nmsgs > 0)"). I think if that is required for > some particular reason then we should write detailed comments about > the same. I have tried some experiments to see if those are really > required: > a. After applying patch 0002, I have tried by commenting out the > processing of invalidations via DecodeCommit and found some regression > tests were failing but the reason for failure was that we are not > setting RBTXN_HAS_CATALOG_CHANGES for the toptxn when subtxn has > catalog changes and when I did that all regression tests started > passing. See the attached diff patch > (v27-0003-Incremental-patch-for-0002-to-test-removal-of-du) atop 0002 > patch. > b. The processing of invalidations for XLOG_INVALIDATIONS is added by > commit c6ff84b06a for xid-less transactions. See > https://postgr.es/m/CAB-SwXY6oH=9twBkXJtgR4UC1NqT-vpYAtxCseME62ADwyK5OA@mail.gmail.com > to know why that has been added. Now, after this patch we will > process the same invalidations via XLOG_XACT_INVALIDATIONS and > XLOG_INVALIDATIONS which doesn't seem warranted. Also, the below > assertion will fail for xid-less transactions (try create index > concurrently statement): > + case XLOG_XACT_INVALIDATIONS: > + { > + TransactionId xid; > + xl_xact_invalidations *invals; > + > + xid = XLogRecGetXid(r); > + invals = (xl_xact_invalidations *) XLogRecGetData(r); > + > + Assert(TransactionIdIsValid(xid)); > > I feel we don't need the processing of XLOG_INVALIDATIONS in logical > decoding after this patch but to prove that first we need to write a > test case which need XLOG_INVALIDATIONS in the HEAD as commit > c6ff84b06a doesn't add one. I think we need two code paths in > XLOG_XACT_INVALIDATIONS where if it is for xid-less transactions, then > execute actions immediately as we are doing in processing of > XLOG_INVALIDATIONS, otherwise, do what we are doing currently in the > patch. If the above point (b) is correct, I am not sure if it is a > good idea to use RM_XACT_ID as resource manager if for this WAL in > LogLogicalInvalidations, what do you think? > > I think one of the usages we still need is in ReorderBufferForget > because it can be called when we skip processing the txn. See the > comments in DecodeCommit where we call this function. If I am > correct, we need to probably collect all invalidations in > ReorderBufferTxn as we are collecting tuplecids and use them here. We > can do the same during processing of XLOG_XACT_INVALIDATIONS. > > I had also thought a bit about removing logging of invalidations at > commit time altogether but it seems processing hot-standby is somewhat > tightly coupled with existing WAL logging. See xact_redo_commit (a > comment atop call to ProcessCommittedInvalidationMessages). It says > we need to maintain the order when we process invalidations. If we > can later find a way to avoid that we can probably remove it but for > now maybe we can live with it. Yes, I have made the changes. Basically, now I am only using the XLOG_XACT_INVALIDATIONS for generating all the invalidation messages. So whenever we are getting the new set of XLOG_XACT_INVALIDATIONS, we are directly appending it to the txn->invalidations. I have tested the XLOG_INVALIDATIONS part but while sending this mail I realized that we could write some automated test for the same. I will work on that soon. > 2. > + /* not expected, but print something anyway */ > + else if (msg->id == SHAREDINVALSMGR_ID) > + appendStringInfoString(buf, " smgr"); > + /* not expected, but print something anyway */ > + else if (msg->id == SHAREDINVALRELMAP_ID) > > I think the above comment is not valid after we started logging at CCI. Yup, fixed. > 3. > + > + xid = XLogRecGetXid(r); > + invals = (xl_xact_invalidations *) XLogRecGetData(r); > + > + Assert(TransactionIdIsValid(xid)); > + ReorderBufferAddInvalidation(reorder, xid, buf->origptr, > + invals->nmsgs, invals->msgs); > > Here, it should check !ctx->forward as we do in DecodeCommit, do we > have any reason for not doing so. We can test once by changing this. Yeah, it should have this check. Mostly it contains changes in 0002, apart from that we needed some changes in 0005,0006 to rebase on 0002 and also there is one bug fix in 0005, basically the txn->snapshot_now was not getting set to NULL after freeing so it was getting double free. I have also removed the extra wait even from the 0014 as BufFile is already logging the wait event internally and also some changes because BufFileWrite interface is changed in recent commits. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Dilip Kumar
Date:
On Tue, Jun 9, 2020 at 3:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Mon, Jun 8, 2020 at 11:53 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > I think one of the usages we still need is in ReorderBufferForget > > because it can be called when we skip processing the txn. See the > > comments in DecodeCommit where we call this function. If I am > > correct, we need to probably collect all invalidations in > > ReorderBufferTxn as we are collecting tuplecids and use them here. We > > can do the same during processing of XLOG_XACT_INVALIDATIONS. > > > > One more point related to this is that after this patch series, we > need to consider executing all invalidation during transaction abort. > Because it is possible that due to memory overflow, we have processed > some of the messages which also contain a few XACT_INVALIDATION > messages, so to avoid cache pollution, we need to execute all of them > in abort. We also do the similar thing in Rollback/Rollback To > Savepoint, see AtEOXact_Inval and AtEOSubXact_Inval. Yes, we need to do that, So now we are collecting all the invalidation under txn->invalidation so they are getting executed on abort. > > Few other comments on > 0002-Issue-individual-invalidations-with-wal_level-lo.patch > --------------------------------------------------------------------------------------------------------------- > 1. > + if (transInvalInfo->CurrentCmdInvalidMsgs.cclist) > + { > + ProcessInvalidationMessagesMulti(&transInvalInfo->CurrentCmdInvalidMsgs, > + MakeSharedInvalidMessagesArray); > + invalMessages = SharedInvalidMessagesArray; > + nmsgs = numSharedInvalidMessagesArray; > + SharedInvalidMessagesArray = NULL; > + numSharedInvalidMessagesArray = 0; > > a. Immediately after ProcessInvalidationMessagesMulti, isn't it better > to have an Assertion like Assert(!(numSharedInvalidMessagesArray > 0 > && SharedInvalidMessagesArray == NULL));? Done > b. Why check "if (transInvalInfo->CurrentCmdInvalidMsgs.cclist)" is > required? If you see xactGetCommittedInvalidationMessages where we do > something similar, we only check for valid value of transInvalInfo and > here we check the same in the caller of LogLogicalInvalidations, isn't > that sufficient? If that is sufficient, we can either have the same > check here or have an Assert for the same. I have put the same check here. > > 2. > @@ -1092,6 +1101,9 @@ CommandEndInvalidationMessages(void) > if (transInvalInfo == NULL) > return; > > + if (XLogLogicalInfoActive()) > + LogLogicalInvalidations(); > + > ProcessInvalidationMessages(&transInvalInfo->CurrentCmdInvalidMsgs, > LocalExecuteInvalidationMessage); > Generally, we WAL log the action after performing it but here you are > writing WAL first. Is there any specific reason? If so, can we write > a comment about the same? Yeah, there is no reason for the same so moved it down. > > 3. > + * When wal_level=logical, write invalidations into WAL at each command end to > + * support the decoding of the in-progress transaction. As of now it was > + * enough to log invalidation only at commit because we are only decoding the > + * transaction at the commit time. We only need to log the catalog cache and > + * relcache invalidation. There can not be any active MVCC scan in logical > + * decoding so we don't need to log the snapshot invalidation. > > I think this comment doesn't hold good after we have changed the patch > to LOG invalidations at the time of CCI. Right, modified. > > 4. > + > +/* > + * Emit WAL for invalidations. > + */ > +static void > +LogLogicalInvalidations() > > Add the function name atop of this function in comments to match the > style with other nearby functions. How about modifying it to > something like: "Emit WAL for invalidations. This is currently only > used for logging invalidations at the command end." Done > > 5. > + * > + * XXX Do we need to care about relcacheInitFileInval and > + * the other fields added to ReorderBufferChange, or just > + * about the message itself? > + */ > > I don't think we need to do anything about relcacheInitFileInval. > This is used to remove the stale files (RELCACHE_INIT_FILENAME) that > have obsolete information about relcache. The walsender process that > is doing decoding doesn't require us to do anything about this. Also, > if you see before this patch, we don't do anything about relcache > files during decoding of invalidation messages. In short, I think we > can remove this comment unless you see some use of it. Now, we have removed the Invalidation change itself so this comment is gone. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Dilip Kumar
Date:
On Mon, Jun 15, 2020 at 6:30 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Mon, Jun 15, 2020 at 9:12 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Fri, Jun 12, 2020 at 4:35 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > Basically, this part is still > > > > I have to work upon, once we get the consensus then I can remove those > > > > extra wait event from the patch. > > > > > > > > > > Okay, feel free to send an updated patch with the above change. > > > > Sure, I will do that in the next patch set. > > > > I have few more comments on the patch > 0013-Change-buffile-interface-required-for-streaming-.patch: > > 1. > - * temp_file_limit of the caller, are read-only and are automatically closed > - * at the end of the transaction but are not deleted on close. > + * temp_file_limit of the caller, are read-only if the flag is set and are > + * automatically closed at the end of the transaction but are not deleted on > + * close. > */ > File > -PathNameOpenTemporaryFile(const char *path) > +PathNameOpenTemporaryFile(const char *path, int mode) > > No need to say "are read-only if the flag is set". I don't see any > flag passed to function so that part of the comment doesn't seem > appropriate. Done > 2. > @@ -68,7 +68,8 @@ SharedFileSetInit(SharedFileSet *fileset, dsm_segment *seg) > } > > /* Register our cleanup callback. */ > - on_dsm_detach(seg, SharedFileSetOnDetach, PointerGetDatum(fileset)); > + if (seg) > + on_dsm_detach(seg, SharedFileSetOnDetach, PointerGetDatum(fileset)); > } > > Add comments atop function to explain when we don't want to register > the dsm detach stuff? Done, I am planning to work on more cleaner function for on_proc_exit as we discussed offlist. I will work on this in the next version. > 3. > + */ > + newFile = file->numFiles - 1; > + newOffset = FileSize(file->files[file->numFiles - 1]); > break; > > FileSize can return negative lengths to indicate failure which we > should handle. Done See other places in the code where FileSize is used? > But I have another question here which is why we need to implement > SEEK_END? How other usages of BufFile interface takes care of this? > I see an API BufFileTell which can give the current read/write > location in the file, isn't that sufficient for your usage? Also, how > before BufFile usage is this thing handled in the patch? So far we never supported to open the file in write mode, only we create in write mode. So if we have created the file and its open we can always use BufFileTell, which will tell the current end location of the file. But, once we close and open again it always set to read from the start of the file as per the current use case. We need a way to jump to the end of the last file for appending it. > 4. > + /* Loop over all the files upto the fileno which we want to truncate. */ > + for (i = file->numFiles - 1; i >= fileno; i--) > > "the files", extra space in the above part of the comment. Fixed > 5. > + /* > + * Except the fileno, we can directly delete other files. > > Before 'we', there is extra space. Done. > 6. > + else > + { > + FileTruncate(file->files[i], offset, WAIT_EVENT_BUFFILE_READ); > + newOffset = offset; > + } > > The wait event passed here doesn't seem to be appropriate. You might > want to introduce a new wait event WAIT_EVENT_BUFFILE_TRUNCATE. Also, > the error handling for FileTruncate is missing. Done > 7. > + if ((i != fileno || offset == 0) && fileno != 0) > + { > + SharedSegmentName(segment_name, file->name, i); > + SharedFileSetDelete(file->fileset, segment_name, true); > + newFile--; > + newOffset = MAX_PHYSICAL_FILESIZE; > + } > > Similar to the previous comment, I think we should handle the failure > of SharedFileSetDelete. > > 8. I think the comments related to BufFile shared API usage need to be > expanded in the code to explain the new usage. For ex., see the below > comments atop buffile.c > * BufFile supports temporary files that can be made read-only and shared with > * other backends, as infrastructure for parallel execution. Such files need > * to be created as a member of a SharedFileSet that all participants are > * attached to. Other fixes (offlist raised by my colleague Neha Sharma) 1. In BufFileTruncateShared, the files were not closed before deleting. (in 0013) 2. In apply_handle_stream_commit, the file name in debug message was printed before populating the name (0014) 3. On concurrent abort we are truncating all the changes including some incomplete changes, so later when we get the complete changes we don't have the previous changes, e.g, if we had specinsert in the last stream and due to concurrent abort detection if we delete that changes later we will get spec_confirm without spec insert. We could have simply avoided deleting all the changes, but I think the better fix is once we detect the concurrent abort for any transaction, then why do we need to collect the changes for that, we can simply avoid that. So I have put that fix. (0006) -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Dilip Kumar
Date:
On Tue, Jun 16, 2020 at 2:37 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Mon, Jun 15, 2020 at 6:29 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > I have few more comments on the patch > > 0013-Change-buffile-interface-required-for-streaming-.patch: > > > > Review comments on 0014-Worker-tempfile-use-the-shared-buffile-infrastru: > 1. > The subxact file is only create if there > + * are any suxact info under this xid. > + */ > +typedef struct StreamXidHash > > Lets slightly reword the part of the comment as "The subxact file is > created iff there is any suxact info under this xid." Done > > 2. > @@ -710,6 +740,9 @@ apply_handle_stream_stop(StringInfo s) > subxact_info_write(MyLogicalRepWorker->subid, stream_xid); > stream_close_file(); > > + /* Commit the per-stream transaction */ > + CommitTransactionCommand(); > > Before calling commit, ensure that we are in a valid transaction. I > think we can have an Assert for IsTransactionState(). Done > 3. > @@ -761,11 +791,13 @@ apply_handle_stream_abort(StringInfo s) > > int64 i; > int64 subidx; > - int fd; > + BufFile *fd; > bool found = false; > char path[MAXPGPATH]; > + StreamXidHash *ent; > > subidx = -1; > + ensure_transaction(); > subxact_info_read(MyLogicalRepWorker->subid, xid); > > Why to call ensure_transaction here? Is there any reason that we > won't have a valid transaction by now? If not, then its better to > have an Assert for IsTransactionState(). We are only starting transaction from stream_start to stream_stop, so at stream_abort we will not have the transaction. > 4. > - if (write(fd, &nsubxacts, sizeof(nsubxacts)) != sizeof(nsubxacts)) > + if (BufFileWrite(fd, &nsubxacts, sizeof(nsubxacts)) != sizeof(nsubxacts)) > { > - int save_errno = errno; > + int save_errno = errno; > > - CloseTransientFile(fd); > + BufFileClose(fd); > > On error, won't these files be close automatically? If so, why at > this place and before other errors, we need to close this? Yes, that's correct. I have fixed those. > 5. > if ((len > 0) && ((BufFileRead(fd, subxacts, len)) != len)) > { > int save_errno = errno; > > BufFileClose(fd); > errno = save_errno; > ereport(ERROR, > (errcode_for_file_access(), > errmsg("could not read file \"%s\": %m", > > Can we change the error message to "could not read from streaming > transactions file .." or something like that and similarly we can > change the message for failure in reading changes file? Done > 6. > if (BufFileWrite(fd, &nsubxacts, sizeof(nsubxacts)) != sizeof(nsubxacts)) > { > int save_errno = errno; > > BufFileClose(fd); > errno = save_errno; > ereport(ERROR, > (errcode_for_file_access(), > errmsg("could not write to file \"%s\": %m", > > Similar to previous, can we change it to "could not write to streaming > transactions file BufFileWrite is not returning failure anymore. > 7. > @@ -2855,17 +2844,32 @@ stream_open_file(Oid subid, TransactionId xid, > bool first_segment) > * for writing, in append mode. > */ > if (first_segment) > - flags = (O_WRONLY | O_CREAT | O_EXCL | PG_BINARY); > - else > - flags = (O_WRONLY | O_APPEND | PG_BINARY); > + { > + /* > + * Shared fileset handle must be allocated in the persistent context. > + */ > + SharedFileSet *fileset = > + MemoryContextAlloc(ApplyContext, sizeof(SharedFileSet)); > > - stream_fd = OpenTransientFile(path, flags); > + PrepareTempTablespaces(); > + SharedFileSetInit(fileset, NULL); > > Why are we calling PrepareTempTablespaces here? It is already called > in SharedFileSetInit. My bad, First I tired using SharedFileSetInit but later it got changed for forgot to remove this part. > 8. > + /* > + * Start a transaction on stream start, this transaction will be committed > + * on the stream stop. We need the transaction for handling the buffile, > + * used for serializing the streaming data and subxact info. > + */ > + ensure_transaction(); > > I think we need this for PrepareTempTablespaces to set the > temptablespaces. Also, isn't it required for a cleanup of buffile > resources at the transaction end? Are there any other reasons for it > as well? The comment should be a bit more clear for why we need a > transaction here. I am not sure that will it make sense to add a comment here that why buffile and sharedfileset need a transaction? Do you think that we should add comment in buffile/shared fileset API that it should be called under a transaction? > 9. > * Open a file for streamed changes from a toplevel transaction identified > * by stream_xid (global variable). If it's the first chunk of streamed > * changes for this transaction, perform cleanup by removing existing > * files after a possible previous crash. > .. > stream_open_file(Oid subid, TransactionId xid, bool first_segment) > > The above part comment atop stream_open_file needs to be changed after > new implementation. Done > 10. > * enabled. This context is reeset on each stream stop. > */ > LogicalStreamingContext = AllocSetContextCreate(ApplyContext, > > /reeset/reset Done > 11. > stream_cleanup_files(Oid subid, TransactionId xid, bool missing_ok) > { > .. > + /* No entry created for this xid so simply return. */ > + if (ent == NULL) > + return; > .. > } > > Is there any reason or scenario where this ent can be NULL? If not, > it will be better to have an Assert for the same. Right, it should be an assert, even if all the changes are ignored for the top transaction, we should have sent the stream_start. > 12. > subxact_info_write(Oid subid, TransactionId xid) > { > .. > + /* > + * If there is no subtransaction then nothing to do, but if already have > + * subxact file then delete that. > + */ > + if (nsubxacts == 0) > { > - ereport(ERROR, > - (errcode_for_file_access(), > - errmsg("could not create file \"%s\": %m", > - path))); > + if (ent->subxact_fileset) > + { > + cleanup_subxact_info(); > + BufFileDeleteShared(ent->subxact_fileset, path); > + ent->subxact_fileset = NULL; > .. > } > > Here don't we need to free the subxact_fileset before setting it to NULL? Yes, done > 13. > + /* > + * Scan complete hash and delete the underlying files for the the xids. > + * Also delete the memory for the shared file sets. > + */ > > /the the/the. Instead of "delete the memory", it would be better to > say "release the memory". Done > > 14. > + /* > + * We might not have created the suxact fileset if there is no sub > + * transaction. > + */ > > /suxact/subxact Done -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Amit Kapila
Date:
On Thu, Jun 18, 2020 at 9:02 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > Yes, I have made the changes. Basically, now I am only using the > XLOG_XACT_INVALIDATIONS for generating all the invalidation messages. > So whenever we are getting the new set of XLOG_XACT_INVALIDATIONS, we > are directly appending it to the txn->invalidations. I have tested > the XLOG_INVALIDATIONS part but while sending this mail I realized > that we could write some automated test for the same. > Can you share how you have tested it? > I will work on > that soon. > Cool, I think having a regression test for this will be a good idea. @@ -2012,8 +2014,6 @@ ReorderBufferForget(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn) if (txn->base_snapshot != NULL && txn->ninvalidations > 0) ReorderBufferImmediateInvalidation(rb, txn->ninvalidations, txn->invalidations); - else - Assert(txn->ninvalidations == 0); Why this Assert is removed? Apart from above, I have made a number of changes in 0002-WAL-Log-invalidations-at-command-end-with-wal_le to remove some unnecessary changes, edited comments, ran pgindent and updated the commit message. If you are fine with these changes, then do include them in your next version. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Attachment
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Amit Kapila
Date:
On Mon, Jun 22, 2020 at 4:26 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Thu, Jun 18, 2020 at 9:02 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > Yes, I have made the changes. Basically, now I am only using the > > XLOG_XACT_INVALIDATIONS for generating all the invalidation messages. > > So whenever we are getting the new set of XLOG_XACT_INVALIDATIONS, we > > are directly appending it to the txn->invalidations. I have tested > > the XLOG_INVALIDATIONS part but while sending this mail I realized > > that we could write some automated test for the same. > > > > Can you share how you have tested it? > > > I will work on > > that soon. > > > > Cool, I think having a regression test for this will be a good idea. > Other than above tests, can we somehow verify that the invalidations generated at commit time are the same as what we do with this patch? We have verified with individual commands but it would be great if we can verify for the regression tests. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Dilip Kumar
Date:
On Mon, Jun 22, 2020 at 4:26 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Thu, Jun 18, 2020 at 9:02 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > Yes, I have made the changes. Basically, now I am only using the > > XLOG_XACT_INVALIDATIONS for generating all the invalidation messages. > > So whenever we are getting the new set of XLOG_XACT_INVALIDATIONS, we > > are directly appending it to the txn->invalidations. I have tested > > the XLOG_INVALIDATIONS part but while sending this mail I realized > > that we could write some automated test for the same. > > > > Can you share how you have tested it? I just ran create index concurrently and decoded the changes. > > I will work on > > that soon. > > > > Cool, I think having a regression test for this will be a good idea. ok > @@ -2012,8 +2014,6 @@ ReorderBufferForget(ReorderBuffer *rb, > TransactionId xid, XLogRecPtr lsn) > if (txn->base_snapshot != NULL && txn->ninvalidations > 0) > ReorderBufferImmediateInvalidation(rb, txn->ninvalidations, > txn->invalidations); > - else > - Assert(txn->ninvalidations == 0); > > Why this Assert is removed? Even if the base_snapshot is NULL, now we are collecting the txn->invalidation. However, we haven't done any activity for that transaction so we don't need to execute the invalidations same as the code before, but assert is no more valid. > Apart from above, I have made a number of changes in > 0002-WAL-Log-invalidations-at-command-end-with-wal_le to remove some > unnecessary changes, edited comments, ran pgindent and updated the > commit message. If you are fine with these changes, then do include > them in your next version. Thanks, I will check those. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Amit Kapila
Date:
On Mon, Jun 22, 2020 at 4:41 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Mon, Jun 22, 2020 at 4:26 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Thu, Jun 18, 2020 at 9:02 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > Yes, I have made the changes. Basically, now I am only using the > > > XLOG_XACT_INVALIDATIONS for generating all the invalidation messages. > > > So whenever we are getting the new set of XLOG_XACT_INVALIDATIONS, we > > > are directly appending it to the txn->invalidations. I have tested > > > the XLOG_INVALIDATIONS part but while sending this mail I realized > > > that we could write some automated test for the same. > > > > > > > Can you share how you have tested it? > > I just ran create index concurrently and decoded the changes. > Hmm, I think that won't reproduce the exact problem. What I wanted was to run another command after "create index concurrently" which depends on that and see if the decoding fails by removing the XLOG_INVALIDATIONS code. Once you get some failure, you can apply the 0002 patch and see if the test is passed? > > > @@ -2012,8 +2014,6 @@ ReorderBufferForget(ReorderBuffer *rb, > > TransactionId xid, XLogRecPtr lsn) > > if (txn->base_snapshot != NULL && txn->ninvalidations > 0) > > ReorderBufferImmediateInvalidation(rb, txn->ninvalidations, > > txn->invalidations); > > - else > > - Assert(txn->ninvalidations == 0); > > > > Why this Assert is removed? > > Even if the base_snapshot is NULL, now we are collecting the > txn->invalidation. > But there doesn't seem to be any check even before this patch which directly prohibits accumulating invalidations in DecodeCommit. We have check for base_snapshot in ReorderBufferCommit. Did you get any failure with that check? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Dilip Kumar
Date:
On Mon, Jun 22, 2020 at 5:26 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Mon, Jun 22, 2020 at 4:41 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Mon, Jun 22, 2020 at 4:26 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > On Thu, Jun 18, 2020 at 9:02 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > > > Yes, I have made the changes. Basically, now I am only using the > > > > XLOG_XACT_INVALIDATIONS for generating all the invalidation messages. > > > > So whenever we are getting the new set of XLOG_XACT_INVALIDATIONS, we > > > > are directly appending it to the txn->invalidations. I have tested > > > > the XLOG_INVALIDATIONS part but while sending this mail I realized > > > > that we could write some automated test for the same. > > > > > > > > > > Can you share how you have tested it? > > > > I just ran create index concurrently and decoded the changes. > > > > Hmm, I think that won't reproduce the exact problem. What I wanted > was to run another command after "create index concurrently" which > depends on that and see if the decoding fails by removing the > XLOG_INVALIDATIONS code. Once you get some failure, you can apply the > 0002 patch and see if the test is passed? Okay, I will test that. > > > > > @@ -2012,8 +2014,6 @@ ReorderBufferForget(ReorderBuffer *rb, > > > TransactionId xid, XLogRecPtr lsn) > > > if (txn->base_snapshot != NULL && txn->ninvalidations > 0) > > > ReorderBufferImmediateInvalidation(rb, txn->ninvalidations, > > > txn->invalidations); > > > - else > > > - Assert(txn->ninvalidations == 0); > > > > > > Why this Assert is removed? > > > > Even if the base_snapshot is NULL, now we are collecting the > > txn->invalidation. > > > > But there doesn't seem to be any check even before this patch which > directly prohibits accumulating invalidations in DecodeCommit. We > have check for base_snapshot in ReorderBufferCommit. Did you get any > failure with that check? Because earlier ReorderBufferForget for toptxn will be called if the top transaction is aborted and in abort case, we are not logging any invalidation so that will be 0. However same is not true now. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Amit Kapila
Date:
On Mon, Jun 22, 2020 at 6:38 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Mon, Jun 22, 2020 at 5:26 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > @@ -2012,8 +2014,6 @@ ReorderBufferForget(ReorderBuffer *rb, > > > > TransactionId xid, XLogRecPtr lsn) > > > > if (txn->base_snapshot != NULL && txn->ninvalidations > 0) > > > > ReorderBufferImmediateInvalidation(rb, txn->ninvalidations, > > > > txn->invalidations); > > > > - else > > > > - Assert(txn->ninvalidations == 0); > > > > > > > > Why this Assert is removed? > > > > > > Even if the base_snapshot is NULL, now we are collecting the > > > txn->invalidation. > > > > > > > But there doesn't seem to be any check even before this patch which > > directly prohibits accumulating invalidations in DecodeCommit. We > > have check for base_snapshot in ReorderBufferCommit. Did you get any > > failure with that check? > > Because earlier ReorderBufferForget for toptxn will be called if the > top transaction is aborted and in abort case, we are not logging any > invalidation so that will be 0. However same is not true now. > AFAICS, ReorderBufferForget() is called (via DecodeCommit) only when we need to skip the transaction. It doesn't seem to be called from Abort path (DecodeAbort/ReorderBufferAbort doesn't use ReorderBufferForget). I am not sure which code path are you referring here, can you please share the code flow which you are referring to here. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Dilip Kumar
Date:
On Tue, Jun 23, 2020 at 8:18 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Mon, Jun 22, 2020 at 6:38 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Mon, Jun 22, 2020 at 5:26 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > @@ -2012,8 +2014,6 @@ ReorderBufferForget(ReorderBuffer *rb, > > > > > TransactionId xid, XLogRecPtr lsn) > > > > > if (txn->base_snapshot != NULL && txn->ninvalidations > 0) > > > > > ReorderBufferImmediateInvalidation(rb, txn->ninvalidations, > > > > > txn->invalidations); > > > > > - else > > > > > - Assert(txn->ninvalidations == 0); > > > > > > > > > > Why this Assert is removed? > > > > > > > > Even if the base_snapshot is NULL, now we are collecting the > > > > txn->invalidation. > > > > > > > > > > But there doesn't seem to be any check even before this patch which > > > directly prohibits accumulating invalidations in DecodeCommit. We > > > have check for base_snapshot in ReorderBufferCommit. Did you get any > > > failure with that check? > > > > Because earlier ReorderBufferForget for toptxn will be called if the > > top transaction is aborted and in abort case, we are not logging any > > invalidation so that will be 0. However same is not true now. > > > > AFAICS, ReorderBufferForget() is called (via DecodeCommit) only when > we need to skip the transaction. It doesn't seem to be called from > Abort path (DecodeAbort/ReorderBufferAbort doesn't use > ReorderBufferForget). I am not sure which code path are you referring > here, can you please share the code flow which you are referring to > here. I think you are right, during some intermediate code change, it crashed on that assert (I guess I might be adding invalidation to the sub-transaction but not sure what was that state) and I assumed that is the reason that I explained above but, now I see my assumption was wrong. I will put back that assert. By testing, I could not hit any case where we hit that assert even after my changes, still I will put more thought if by any chance our case is different then the base code. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Dilip Kumar
Date:
On Tue, Jun 23, 2020 at 10:13 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Tue, Jun 23, 2020 at 8:18 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Mon, Jun 22, 2020 at 6:38 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > On Mon, Jun 22, 2020 at 5:26 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > > @@ -2012,8 +2014,6 @@ ReorderBufferForget(ReorderBuffer *rb, > > > > > > TransactionId xid, XLogRecPtr lsn) > > > > > > if (txn->base_snapshot != NULL && txn->ninvalidations > 0) > > > > > > ReorderBufferImmediateInvalidation(rb, txn->ninvalidations, > > > > > > txn->invalidations); > > > > > > - else > > > > > > - Assert(txn->ninvalidations == 0); > > > > > > > > > > > > Why this Assert is removed? > > > > > > > > > > Even if the base_snapshot is NULL, now we are collecting the > > > > > txn->invalidation. > > > > > > > > > > > > > But there doesn't seem to be any check even before this patch which > > > > directly prohibits accumulating invalidations in DecodeCommit. We > > > > have check for base_snapshot in ReorderBufferCommit. Did you get any > > > > failure with that check? > > > > > > Because earlier ReorderBufferForget for toptxn will be called if the > > > top transaction is aborted and in abort case, we are not logging any > > > invalidation so that will be 0. However same is not true now. > > > > > > > AFAICS, ReorderBufferForget() is called (via DecodeCommit) only when > > we need to skip the transaction. It doesn't seem to be called from > > Abort path (DecodeAbort/ReorderBufferAbort doesn't use > > ReorderBufferForget). I am not sure which code path are you referring > > here, can you please share the code flow which you are referring to > > here. > > I think you are right, during some intermediate code change, it > crashed on that assert (I guess I might be adding invalidation to the > sub-transaction but not sure what was that state) and I assumed that > is the reason that I explained above but, now I see my assumption was > wrong. I will put back that assert. By testing, I could not hit any > case where we hit that assert even after my changes, still I will put > more thought if by any chance our case is different then the base > code. Here is the POC patch to discuss the idea of a cleanup of shared fileset on proc exit. As discussed offlist, here I am maintaining the list of shared fileset. First time when the list is NULL I am registering the cleanup function with on_proc_exit routine. After that for subsequent fileset, I am just appending it to filesetlist. There is also an interface to unregister the shared file set from the cleanup list and that is done by the caller whenever we are deleting the shared fileset manually. While explaining it here, I think there could be one issue if we delete all the element from the list will become NULL and on next SharedFileSetInit we will again register the function. Maybe that is not a problem but we can avoid registering multiple times by using some flag in the file -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Amit Kapila
Date:
On Tue, Jun 23, 2020 at 7:00 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > Here is the POC patch to discuss the idea of a cleanup of shared > fileset on proc exit. As discussed offlist, here I am maintaining > the list of shared fileset. First time when the list is NULL I am > registering the cleanup function with on_proc_exit routine. After > that for subsequent fileset, I am just appending it to filesetlist. > There is also an interface to unregister the shared file set from the > cleanup list and that is done by the caller whenever we are deleting > the shared fileset manually. While explaining it here, I think there > could be one issue if we delete all the element from the list will > become NULL and on next SharedFileSetInit we will again register the > function. Maybe that is not a problem but we can avoid registering > multiple times by using some flag in the file > I don't understand what you mean by "using some flag in the file". Review comments on various patches. poc_shared_fileset_cleanup_on_procexit ================================= 1. - ent->subxact_fileset = - MemoryContextAlloc(ApplyContext, sizeof(SharedFileSet)); + MemoryContext oldctx; + /* Shared fileset handle must be allocated in the persistent context */ + oldctx = MemoryContextSwitchTo(ApplyContext); + ent->subxact_fileset = palloc(sizeof(SharedFileSet)); SharedFileSetInit(ent->subxact_fileset, NULL); + MemoryContextSwitchTo(oldctx); fd = BufFileCreateShared(ent->subxact_fileset, path); Why is this change required for this patch and why we only cover SharedFileSetInit in the Apply context and not BufFileCreateShared? The comment is also not very clear on this point. 2. +void +SharedFileSetUnregister(SharedFileSet *input_fileset) +{ + bool found = false; + ListCell *l; + + Assert(filesetlist != NULL); + + /* Loop over all the pending shared fileset entry */ + foreach (l, filesetlist) + { + SharedFileSet *fileset = (SharedFileSet *) lfirst(l); + + /* remove the entry from the list and delete the underlying files */ + if (input_fileset->number == fileset->number) + { + SharedFileSetDeleteAll(fileset); + filesetlist = list_delete_cell(filesetlist, l); Why are we calling SharedFileSetDeleteAll here when in the caller we have already deleted the fileset as per below code? BufFileDeleteShared(ent->stream_fileset, path); + SharedFileSetUnregister(ent->stream_fileset); I think it will be good if somehow we can remove the fileset from filesetlist during BufFileDeleteShared. If that is possible, then we don't need a separate API for SharedFileSetUnregister. 3. +static List * filesetlist = NULL; + static void SharedFileSetOnDetach(dsm_segment *segment, Datum datum); +static void SharedFileSetOnProcExit(int status, Datum arg); static void SharedFileSetPath(char *path, SharedFileSet *fileset, Oid tablespace); static void SharedFilePath(char *path, SharedFileSet *fileset, const char *name); static Oid ChooseTablespace(const SharedFileSet *fileset, const char *name); @@ -76,6 +80,13 @@ SharedFileSetInit(SharedFileSet *fileset, dsm_segment *seg) /* Register our cleanup callback. */ if (seg) on_dsm_detach(seg, SharedFileSetOnDetach, PointerGetDatum(fileset)); + else + { + if (filesetlist == NULL) + on_proc_exit(SharedFileSetOnProcExit, 0); We use NIL for list initialization and comparison. See lock_files usage. 4. +SharedFileSetOnProcExit(int status, Datum arg) +{ + ListCell *l; + + /* Loop over all the pending shared fileset entry */ + foreach (l, filesetlist) + { + SharedFileSet *fileset = (SharedFileSet *) lfirst(l); + SharedFileSetDeleteAll(fileset); + } We can initialize filesetlist as NIL after the for loop as it will make the code look clean. Comments on other patches: ========================= 5. > 3. On concurrent abort we are truncating all the changes including > some incomplete changes, so later when we get the complete changes we > don't have the previous changes, e.g, if we had specinsert in the > last stream and due to concurrent abort detection if we delete that > changes later we will get spec_confirm without spec insert. We could > have simply avoided deleting all the changes, but I think the better > fix is once we detect the concurrent abort for any transaction, then > why do we need to collect the changes for that, we can simply avoid > that. So I have put that fix. (0006) > On similar lines, I think we need to skip processing message, see else part of code in ReorderBufferQueueMessage. 6. In v29-0002-Issue-individual-invalidations-with-wal_level-lo, xact_desc_invalidations seems to be a subset of standby_desc_invalidations, can we have a common code for them? 7. I think we can avoid sending v29-0007-Track-statistics-for-streaming this each time. We can do this after the main patch is complete. Also, we might need to change how and where these stats will be tracked. See the related discussion [1]. 8. In v29-0005-Implement-streaming-mode-in-ReorderBuffer, * Return oldest transaction in reorderbuffer @@ -863,6 +909,9 @@ ReorderBufferAssignChild(ReorderBuffer *rb, TransactionId xid, /* set the reference to top-level transaction */ subtxn->toptxn = txn; + /* set the reference to toplevel transaction */ + subtxn->toptxn = txn; + There is a double initialization of subtxn->toptxn. You need to remove this line from 0005 patch as we have now added it in an earlier patch. 9. I think you forgot to update the patch to execute invalidations in Abort case or I might be missing something. I don't see any changes in ReorderBufferAbort. You have agreed in one of the emails above [2] about handling the same. 10. In v29-0008-Add-support-for-streaming-to-built-in-replicatio, apply_handle_stream_commit(StringInfo s) { .. + /* + * send feedback to upstream + * + * XXX Probably should send a valid LSN. But which one? + */ + send_feedback(InvalidXLogRecPtr, false, false); .. } I have given a comment on this code that we don't need this feedback and you mentioned on June 02 [3] that you will think on it and let me know your opinion but I don't see a response from you yet. Can you get back to me regarding this point? 11. Add some comments as to why we have used Shared BufFile interface instead of Temp BufFile interface? 12. In v29-0013-Change-buffile-interface-required-for-streaming, + * Initialize a space for temporary files that can be opened other backends. /opened other backends/opened for access by other backends [1] - https://www.postgresql.org/message-id/CA%2Bfd4k5_pPAYRTDrO2PbtTOe0eHQpBvuqmCr8ic39uTNmR49Eg%40mail.gmail.com [2] - https://www.postgresql.org/message-id/CAFiTN-t7WZZjFrAjSYj4fu%3DFZ2JKENN8ZHCUZaw-srnrHMWMrg%40mail.gmail.com [3] - https://www.postgresql.org/message-id/CAFiTN-tHpd%2BzXVemo9WqQUJS50p9m8jD%3DAWjsugKZQ4F-K8Pbw%40mail.gmail.com -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Amit Kapila
Date:
On Mon, Jun 22, 2020 at 11:56 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Tue, Jun 16, 2020 at 2:37 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > 8. > > + /* > > + * Start a transaction on stream start, this transaction will be committed > > + * on the stream stop. We need the transaction for handling the buffile, > > + * used for serializing the streaming data and subxact info. > > + */ > > + ensure_transaction(); > > > > I think we need this for PrepareTempTablespaces to set the > > temptablespaces. Also, isn't it required for a cleanup of buffile > > resources at the transaction end? Are there any other reasons for it > > as well? The comment should be a bit more clear for why we need a > > transaction here. > > I am not sure that will it make sense to add a comment here that why > buffile and sharedfileset need a transaction? > You can say usage of BufFile interface expects us to be in the transaction for so and so reason.... Do you think that we > should add comment in buffile/shared fileset API that it should be > called under a transaction? > I am fine with that as well. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Dilip Kumar
Date:
iOn Wed, Jun 24, 2020 at 4:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Tue, Jun 23, 2020 at 7:00 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > Here is the POC patch to discuss the idea of a cleanup of shared > > fileset on proc exit. As discussed offlist, here I am maintaining > > the list of shared fileset. First time when the list is NULL I am > > registering the cleanup function with on_proc_exit routine. After > > that for subsequent fileset, I am just appending it to filesetlist. > > There is also an interface to unregister the shared file set from the > > cleanup list and that is done by the caller whenever we are deleting > > the shared fileset manually. While explaining it here, I think there > > could be one issue if we delete all the element from the list will > > become NULL and on next SharedFileSetInit we will again register the > > function. Maybe that is not a problem but we can avoid registering > > multiple times by using some flag in the file > > > > I don't understand what you mean by "using some flag in the file". Basically, in POC as shown in below code snippet, We are checking that if the "filesetlist" is NULL then only register the on_proc_exit function. But, as described above if all the items are deleted the list will be NULL. So I told that instead of checking the filesetlist is NULL, we can have just a boolean variable that if we have registered the callback then don't do it again. @@ -76,6 +80,13 @@ SharedFileSetInit(SharedFileSet *fileset, dsm_segment *seg) /* Register our cleanup callback. */ if (seg) on_dsm_detach(seg, SharedFileSetOnDetach, PointerGetDatum(fileset)); + else + { + if (filesetlist == NULL) + on_proc_exit(SharedFileSetOnProcExit, 0); + + filesetlist = lcons((void *)fileset, filesetlist); + } } > > Review comments on various patches. > > poc_shared_fileset_cleanup_on_procexit > ================================= > 1. > - ent->subxact_fileset = > - MemoryContextAlloc(ApplyContext, sizeof(SharedFileSet)); > + MemoryContext oldctx; > > + /* Shared fileset handle must be allocated in the persistent context */ > + oldctx = MemoryContextSwitchTo(ApplyContext); > + ent->subxact_fileset = palloc(sizeof(SharedFileSet)); > SharedFileSetInit(ent->subxact_fileset, NULL); > + MemoryContextSwitchTo(oldctx); > fd = BufFileCreateShared(ent->subxact_fileset, path); > > Why is this change required for this patch and why we only cover > SharedFileSetInit in the Apply context and not BufFileCreateShared? > The comment is also not very clear on this point. Because only the sharedfileset and the filesetlist which is allocated under SharedFileSetInit, are required in the permanent context. BufFileCreateShared, only creates the Buffile and VFD which will be required only within the current stream so transaction context is enough. > 2. > +void > +SharedFileSetUnregister(SharedFileSet *input_fileset) > +{ > + bool found = false; > + ListCell *l; > + > + Assert(filesetlist != NULL); > + > + /* Loop over all the pending shared fileset entry */ > + foreach (l, filesetlist) > + { > + SharedFileSet *fileset = (SharedFileSet *) lfirst(l); > + > + /* remove the entry from the list and delete the underlying files */ > + if (input_fileset->number == fileset->number) > + { > + SharedFileSetDeleteAll(fileset); > + filesetlist = list_delete_cell(filesetlist, l); > > Why are we calling SharedFileSetDeleteAll here when in the caller we > have already deleted the fileset as per below code? > BufFileDeleteShared(ent->stream_fileset, path); > + SharedFileSetUnregister(ent->stream_fileset); > > I think it will be good if somehow we can remove the fileset from > filesetlist during BufFileDeleteShared. If that is possible, then we > don't need a separate API for SharedFileSetUnregister. But the filesetlist is maintained at the sharedfileset level, so even if we delete from BufFileDeleteShared, we need to call an API from the sharedfileset layer to unregister the fileset. Am I missing something? > 3. > +static List * filesetlist = NULL; > + > static void SharedFileSetOnDetach(dsm_segment *segment, Datum datum); > +static void SharedFileSetOnProcExit(int status, Datum arg); > static void SharedFileSetPath(char *path, SharedFileSet *fileset, Oid > tablespace); > static void SharedFilePath(char *path, SharedFileSet *fileset, const > char *name); > static Oid ChooseTablespace(const SharedFileSet *fileset, const char *name); > @@ -76,6 +80,13 @@ SharedFileSetInit(SharedFileSet *fileset, dsm_segment *seg) > /* Register our cleanup callback. */ > if (seg) > on_dsm_detach(seg, SharedFileSetOnDetach, PointerGetDatum(fileset)); > + else > + { > + if (filesetlist == NULL) > + on_proc_exit(SharedFileSetOnProcExit, 0); > > We use NIL for list initialization and comparison. See lock_files usage. Right. > 4. > +SharedFileSetOnProcExit(int status, Datum arg) > +{ > + ListCell *l; > + > + /* Loop over all the pending shared fileset entry */ > + foreach (l, filesetlist) > + { > + SharedFileSet *fileset = (SharedFileSet *) lfirst(l); > + SharedFileSetDeleteAll(fileset); > + } > > We can initialize filesetlist as NIL after the for loop as it will > make the code look clean. ok Thanks for your feedback on this. I will reply to other comments separately. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Amit Kapila
Date:
On Wed, Jun 24, 2020 at 4:27 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > iOn Wed, Jun 24, 2020 at 4:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Tue, Jun 23, 2020 at 7:00 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > Here is the POC patch to discuss the idea of a cleanup of shared > > > fileset on proc exit. As discussed offlist, here I am maintaining > > > the list of shared fileset. First time when the list is NULL I am > > > registering the cleanup function with on_proc_exit routine. After > > > that for subsequent fileset, I am just appending it to filesetlist. > > > There is also an interface to unregister the shared file set from the > > > cleanup list and that is done by the caller whenever we are deleting > > > the shared fileset manually. While explaining it here, I think there > > > could be one issue if we delete all the element from the list will > > > become NULL and on next SharedFileSetInit we will again register the > > > function. Maybe that is not a problem but we can avoid registering > > > multiple times by using some flag in the file > > > > > > > I don't understand what you mean by "using some flag in the file". > > Basically, in POC as shown in below code snippet, We are checking > that if the "filesetlist" is NULL then only register the on_proc_exit > function. But, as described above if all the items are deleted the > list will be NULL. So I told that instead of checking the filesetlist > is NULL, we can have just a boolean variable that if we have > registered the callback then don't do it again. > Check if there is any precedent of the same in the code? > > > > > Review comments on various patches. > > > > poc_shared_fileset_cleanup_on_procexit > > ================================= > > 1. > > - ent->subxact_fileset = > > - MemoryContextAlloc(ApplyContext, sizeof(SharedFileSet)); > > + MemoryContext oldctx; > > > > + /* Shared fileset handle must be allocated in the persistent context */ > > + oldctx = MemoryContextSwitchTo(ApplyContext); > > + ent->subxact_fileset = palloc(sizeof(SharedFileSet)); > > SharedFileSetInit(ent->subxact_fileset, NULL); > > + MemoryContextSwitchTo(oldctx); > > fd = BufFileCreateShared(ent->subxact_fileset, path); > > > > Why is this change required for this patch and why we only cover > > SharedFileSetInit in the Apply context and not BufFileCreateShared? > > The comment is also not very clear on this point. > > Because only the sharedfileset and the filesetlist which is allocated > under SharedFileSetInit, are required in the permanent context. > BufFileCreateShared, only creates the Buffile and VFD which will be > required only within the current stream so transaction context is > enough. > Okay, then add some more comments to explain it or if you have explained it elsewhere, then add a reference for the same. > > 2. > > +void > > +SharedFileSetUnregister(SharedFileSet *input_fileset) > > +{ > > + bool found = false; > > + ListCell *l; > > + > > + Assert(filesetlist != NULL); > > + > > + /* Loop over all the pending shared fileset entry */ > > + foreach (l, filesetlist) > > + { > > + SharedFileSet *fileset = (SharedFileSet *) lfirst(l); > > + > > + /* remove the entry from the list and delete the underlying files */ > > + if (input_fileset->number == fileset->number) > > + { > > + SharedFileSetDeleteAll(fileset); > > + filesetlist = list_delete_cell(filesetlist, l); > > > > Why are we calling SharedFileSetDeleteAll here when in the caller we > > have already deleted the fileset as per below code? > > BufFileDeleteShared(ent->stream_fileset, path); > > + SharedFileSetUnregister(ent->stream_fileset); > > > > I think it will be good if somehow we can remove the fileset from > > filesetlist during BufFileDeleteShared. If that is possible, then we > > don't need a separate API for SharedFileSetUnregister. > > But the filesetlist is maintained at the sharedfileset level, so even > if we delete from BufFileDeleteShared, we need to call an API from the > sharedfileset layer to unregister the fileset. > Sure, but isn't it better if we can call such an API from BufFileDeleteShared? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Dilip Kumar
Date:
On Wed, Jun 24, 2020 at 4:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Tue, Jun 23, 2020 at 7:00 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > Here is the POC patch to discuss the idea of a cleanup of shared > > fileset on proc exit. As discussed offlist, here I am maintaining > > the list of shared fileset. First time when the list is NULL I am > > registering the cleanup function with on_proc_exit routine. After > > that for subsequent fileset, I am just appending it to filesetlist. > > There is also an interface to unregister the shared file set from the > > cleanup list and that is done by the caller whenever we are deleting > > the shared fileset manually. While explaining it here, I think there > > could be one issue if we delete all the element from the list will > > become NULL and on next SharedFileSetInit we will again register the > > function. Maybe that is not a problem but we can avoid registering > > multiple times by using some flag in the file > > > > I don't understand what you mean by "using some flag in the file". > > Review comments on various patches. > > poc_shared_fileset_cleanup_on_procexit > ================================= > 1. > - ent->subxact_fileset = > - MemoryContextAlloc(ApplyContext, sizeof(SharedFileSet)); > + MemoryContext oldctx; > > + /* Shared fileset handle must be allocated in the persistent context */ > + oldctx = MemoryContextSwitchTo(ApplyContext); > + ent->subxact_fileset = palloc(sizeof(SharedFileSet)); > SharedFileSetInit(ent->subxact_fileset, NULL); > + MemoryContextSwitchTo(oldctx); > fd = BufFileCreateShared(ent->subxact_fileset, path); > > Why is this change required for this patch and why we only cover > SharedFileSetInit in the Apply context and not BufFileCreateShared? > The comment is also not very clear on this point. Added the comments for the same. > 2. > +void > +SharedFileSetUnregister(SharedFileSet *input_fileset) > +{ > + bool found = false; > + ListCell *l; > + > + Assert(filesetlist != NULL); > + > + /* Loop over all the pending shared fileset entry */ > + foreach (l, filesetlist) > + { > + SharedFileSet *fileset = (SharedFileSet *) lfirst(l); > + > + /* remove the entry from the list and delete the underlying files */ > + if (input_fileset->number == fileset->number) > + { > + SharedFileSetDeleteAll(fileset); > + filesetlist = list_delete_cell(filesetlist, l); > > Why are we calling SharedFileSetDeleteAll here when in the caller we > have already deleted the fileset as per below code? > BufFileDeleteShared(ent->stream_fileset, path); > + SharedFileSetUnregister(ent->stream_fileset); That's wrong I have removed this. > I think it will be good if somehow we can remove the fileset from > filesetlist during BufFileDeleteShared. If that is possible, then we > don't need a separate API for SharedFileSetUnregister. I have done as discussed on later replies, basically called SharedFileSetUnregister from BufFileDeleteShared. > 3. > +static List * filesetlist = NULL; > + > static void SharedFileSetOnDetach(dsm_segment *segment, Datum datum); > +static void SharedFileSetOnProcExit(int status, Datum arg); > static void SharedFileSetPath(char *path, SharedFileSet *fileset, Oid > tablespace); > static void SharedFilePath(char *path, SharedFileSet *fileset, const > char *name); > static Oid ChooseTablespace(const SharedFileSet *fileset, const char *name); > @@ -76,6 +80,13 @@ SharedFileSetInit(SharedFileSet *fileset, dsm_segment *seg) > /* Register our cleanup callback. */ > if (seg) > on_dsm_detach(seg, SharedFileSetOnDetach, PointerGetDatum(fileset)); > + else > + { > + if (filesetlist == NULL) > + on_proc_exit(SharedFileSetOnProcExit, 0); > > We use NIL for list initialization and comparison. See lock_files usage. Done > 4. > +SharedFileSetOnProcExit(int status, Datum arg) > +{ > + ListCell *l; > + > + /* Loop over all the pending shared fileset entry */ > + foreach (l, filesetlist) > + { > + SharedFileSet *fileset = (SharedFileSet *) lfirst(l); > + SharedFileSetDeleteAll(fileset); > + } > > We can initialize filesetlist as NIL after the for loop as it will > make the code look clean. Right. > Comments on other patches: > ========================= > 5. > > 3. On concurrent abort we are truncating all the changes including > > some incomplete changes, so later when we get the complete changes we > > don't have the previous changes, e.g, if we had specinsert in the > > last stream and due to concurrent abort detection if we delete that > > changes later we will get spec_confirm without spec insert. We could > > have simply avoided deleting all the changes, but I think the better > > fix is once we detect the concurrent abort for any transaction, then > > why do we need to collect the changes for that, we can simply avoid > > that. So I have put that fix. (0006) > > > > On similar lines, I think we need to skip processing message, see else > part of code in ReorderBufferQueueMessage. Basically, ReorderBufferQueueMessage also calls the ReorderBufferQueueChange internally for transactional changes. But, having said that, I realize the idea of skipping the changes in ReorderBufferQueueChange is not good, because by then we have already allocated the memory for the change and the tuple and it's not a correct to ReturnChanges because it will update the memory accounting. So I think we can do it at a more centralized place and before we process the change, maybe in LogicalDecodingProcessRecord, before going to the switch we can call a function from the reorderbuffer.c layer to see whether this transaction is detected as aborted or not. But I have to think more on this line that can we skip all the processing of that record or not. Your other comments look fine to me so I will send in the next patch set and reply on them individually. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Dilip Kumar
Date:
On Thu, Jun 25, 2020 at 7:10 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Wed, Jun 24, 2020 at 4:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Tue, Jun 23, 2020 at 7:00 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > Here is the POC patch to discuss the idea of a cleanup of shared > > > fileset on proc exit. As discussed offlist, here I am maintaining > > > the list of shared fileset. First time when the list is NULL I am > > > registering the cleanup function with on_proc_exit routine. After > > > that for subsequent fileset, I am just appending it to filesetlist. > > > There is also an interface to unregister the shared file set from the > > > cleanup list and that is done by the caller whenever we are deleting > > > the shared fileset manually. While explaining it here, I think there > > > could be one issue if we delete all the element from the list will > > > become NULL and on next SharedFileSetInit we will again register the > > > function. Maybe that is not a problem but we can avoid registering > > > multiple times by using some flag in the file > > > > > > > I don't understand what you mean by "using some flag in the file". > > > > Review comments on various patches. > > > > poc_shared_fileset_cleanup_on_procexit > > ================================= > > 1. > > - ent->subxact_fileset = > > - MemoryContextAlloc(ApplyContext, sizeof(SharedFileSet)); > > + MemoryContext oldctx; > > > > + /* Shared fileset handle must be allocated in the persistent context */ > > + oldctx = MemoryContextSwitchTo(ApplyContext); > > + ent->subxact_fileset = palloc(sizeof(SharedFileSet)); > > SharedFileSetInit(ent->subxact_fileset, NULL); > > + MemoryContextSwitchTo(oldctx); > > fd = BufFileCreateShared(ent->subxact_fileset, path); > > > > Why is this change required for this patch and why we only cover > > SharedFileSetInit in the Apply context and not BufFileCreateShared? > > The comment is also not very clear on this point. > > Added the comments for the same. > > > 2. > > +void > > +SharedFileSetUnregister(SharedFileSet *input_fileset) > > +{ > > + bool found = false; > > + ListCell *l; > > + > > + Assert(filesetlist != NULL); > > + > > + /* Loop over all the pending shared fileset entry */ > > + foreach (l, filesetlist) > > + { > > + SharedFileSet *fileset = (SharedFileSet *) lfirst(l); > > + > > + /* remove the entry from the list and delete the underlying files */ > > + if (input_fileset->number == fileset->number) > > + { > > + SharedFileSetDeleteAll(fileset); > > + filesetlist = list_delete_cell(filesetlist, l); > > > > Why are we calling SharedFileSetDeleteAll here when in the caller we > > have already deleted the fileset as per below code? > > BufFileDeleteShared(ent->stream_fileset, path); > > + SharedFileSetUnregister(ent->stream_fileset); > > That's wrong I have removed this. > > > > I think it will be good if somehow we can remove the fileset from > > filesetlist during BufFileDeleteShared. If that is possible, then we > > don't need a separate API for SharedFileSetUnregister. > > I have done as discussed on later replies, basically called > SharedFileSetUnregister from BufFileDeleteShared. > > > 3. > > +static List * filesetlist = NULL; > > + > > static void SharedFileSetOnDetach(dsm_segment *segment, Datum datum); > > +static void SharedFileSetOnProcExit(int status, Datum arg); > > static void SharedFileSetPath(char *path, SharedFileSet *fileset, Oid > > tablespace); > > static void SharedFilePath(char *path, SharedFileSet *fileset, const > > char *name); > > static Oid ChooseTablespace(const SharedFileSet *fileset, const char *name); > > @@ -76,6 +80,13 @@ SharedFileSetInit(SharedFileSet *fileset, dsm_segment *seg) > > /* Register our cleanup callback. */ > > if (seg) > > on_dsm_detach(seg, SharedFileSetOnDetach, PointerGetDatum(fileset)); > > + else > > + { > > + if (filesetlist == NULL) > > + on_proc_exit(SharedFileSetOnProcExit, 0); > > > > We use NIL for list initialization and comparison. See lock_files usage. > > Done > > > 4. > > +SharedFileSetOnProcExit(int status, Datum arg) > > +{ > > + ListCell *l; > > + > > + /* Loop over all the pending shared fileset entry */ > > + foreach (l, filesetlist) > > + { > > + SharedFileSet *fileset = (SharedFileSet *) lfirst(l); > > + SharedFileSetDeleteAll(fileset); > > + } > > > > We can initialize filesetlist as NIL after the for loop as it will > > make the code look clean. > > Right. > > > Comments on other patches: > > ========================= > > 5. > > > 3. On concurrent abort we are truncating all the changes including > > > some incomplete changes, so later when we get the complete changes we > > > don't have the previous changes, e.g, if we had specinsert in the > > > last stream and due to concurrent abort detection if we delete that > > > changes later we will get spec_confirm without spec insert. We could > > > have simply avoided deleting all the changes, but I think the better > > > fix is once we detect the concurrent abort for any transaction, then > > > why do we need to collect the changes for that, we can simply avoid > > > that. So I have put that fix. (0006) > > > > > > > On similar lines, I think we need to skip processing message, see else > > part of code in ReorderBufferQueueMessage. > > Basically, ReorderBufferQueueMessage also calls the > ReorderBufferQueueChange internally for transactional changes. But, > having said that, I realize the idea of skipping the changes in > ReorderBufferQueueChange is not good, because by then we have already > allocated the memory for the change and the tuple and it's not a > correct to ReturnChanges because it will update the memory accounting. > So I think we can do it at a more centralized place and before we > process the change, maybe in LogicalDecodingProcessRecord, before > going to the switch we can call a function from the reorderbuffer.c > layer to see whether this transaction is detected as aborted or not. > But I have to think more on this line that can we skip all the > processing of that record or not. > > Your other comments look fine to me so I will send in the next patch > set and reply on them individually. I think we can not put this check, in the higher-level functions like LogicalDecodingProcessRecord or DecodeXXXOp because we need to process that xid at least for abort, so I think it is good to keep the check, inside ReorderBufferQueueChange only and we can free the memory of the change if the abort is detected. Also, if just skip those changes in ReorderBufferQueueChange then the effect will be localized to that particular transaction which is already aborted. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Amit Kapila
Date:
On Fri, Jun 26, 2020 at 10:39 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Thu, Jun 25, 2020 at 7:10 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Wed, Jun 24, 2020 at 4:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > Comments on other patches: > > > ========================= > > > 5. > > > > 3. On concurrent abort we are truncating all the changes including > > > > some incomplete changes, so later when we get the complete changes we > > > > don't have the previous changes, e.g, if we had specinsert in the > > > > last stream and due to concurrent abort detection if we delete that > > > > changes later we will get spec_confirm without spec insert. We could > > > > have simply avoided deleting all the changes, but I think the better > > > > fix is once we detect the concurrent abort for any transaction, then > > > > why do we need to collect the changes for that, we can simply avoid > > > > that. So I have put that fix. (0006) > > > > > > > > > > On similar lines, I think we need to skip processing message, see else > > > part of code in ReorderBufferQueueMessage. > > > > Basically, ReorderBufferQueueMessage also calls the > > ReorderBufferQueueChange internally for transactional changes. Yes, that is correct but I was thinking about the non-transactional part due to the below code there. else { ReorderBufferTXN *txn = NULL; volatile Snapshot snapshot_now = snapshot; if (xid != InvalidTransactionId) txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true); Even though we are using txn here but I think we don't need to skip it for aborted xacts because without patch as well such messages get decoded irrespective of transaction status. What do you think? > > But, > > having said that, I realize the idea of skipping the changes in > > ReorderBufferQueueChange is not good, because by then we have already > > allocated the memory for the change and the tuple and it's not a > > correct to ReturnChanges because it will update the memory accounting. > > So I think we can do it at a more centralized place and before we > > process the change, maybe in LogicalDecodingProcessRecord, before > > going to the switch we can call a function from the reorderbuffer.c > > layer to see whether this transaction is detected as aborted or not. > > But I have to think more on this line that can we skip all the > > processing of that record or not. > > > > Your other comments look fine to me so I will send in the next patch > > set and reply on them individually. > > I think we can not put this check, in the higher-level functions like > LogicalDecodingProcessRecord or DecodeXXXOp because we need to process > that xid at least for abort, so I think it is good to keep the check, > inside ReorderBufferQueueChange only and we can free the memory of the > change if the abort is detected. Also, if just skip those changes in > ReorderBufferQueueChange then the effect will be localized to that > particular transaction which is already aborted. > Fair enough and for cases like non-transactional part of ReorderBufferQueueMessage, I think we anyway need to process the message irrespective of transaction status. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Amit Kapila
Date:
On Thu, Jun 25, 2020 at 7:11 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Wed, Jun 24, 2020 at 4:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > Review comments on various patches. > > > > poc_shared_fileset_cleanup_on_procexit > > ================================= > > 1. > > - ent->subxact_fileset = > > - MemoryContextAlloc(ApplyContext, sizeof(SharedFileSet)); > > + MemoryContext oldctx; > > > > + /* Shared fileset handle must be allocated in the persistent context */ > > + oldctx = MemoryContextSwitchTo(ApplyContext); > > + ent->subxact_fileset = palloc(sizeof(SharedFileSet)); > > SharedFileSetInit(ent->subxact_fileset, NULL); > > + MemoryContextSwitchTo(oldctx); > > fd = BufFileCreateShared(ent->subxact_fileset, path); > > > > Why is this change required for this patch and why we only cover > > SharedFileSetInit in the Apply context and not BufFileCreateShared? > > The comment is also not very clear on this point. > > Added the comments for the same. > 1. + /* + * Shared fileset handle must be allocated in the persistent context. + * Also, SharedFileSetInit allocate the memory for sharefileset list + * so we need to allocate that in the long term meemory context. + */ How about "We need to maintain shared fileset across multiple stream open/close calls. So, we allocate it in a persistent context." 2. + /* + * If the caller is following the dsm based cleanup then we don't + * maintain the filesetlist so return. + */ + if (filesetlist == NULL) + return; The check here should use 'NIL' instead of 'NULL' Other than that the changes in this particular patch looks good to me. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Dilip Kumar
Date:
On Wed, Jun 24, 2020 at 4:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > Comments on other patches: > ========================= Replying to the pending comments. > 6. > In v29-0002-Issue-individual-invalidations-with-wal_level-lo, > xact_desc_invalidations seems to be a subset of > standby_desc_invalidations, can we have a common code for them? Done > 7. > I think we can avoid sending v29-0007-Track-statistics-for-streaming > this each time. We can do this after the main patch is complete. > Also, we might need to change how and where these stats will be > tracked. See the related discussion [1]. Removed > 8. In v29-0005-Implement-streaming-mode-in-ReorderBuffer, > * Return oldest transaction in reorderbuffer > @@ -863,6 +909,9 @@ ReorderBufferAssignChild(ReorderBuffer *rb, > TransactionId xid, > /* set the reference to top-level transaction */ > subtxn->toptxn = txn; > > + /* set the reference to toplevel transaction */ > + subtxn->toptxn = txn; > + > > There is a double initialization of subtxn->toptxn. You need to > remove this line from 0005 patch as we have now added it in an earlier > patch. Done > 9. I think you forgot to update the patch to execute invalidations in > Abort case or I might be missing something. I don't see any changes > in ReorderBufferAbort. You have agreed in one of the emails above [2] > about handling the same. Done, check 0005 > 10. In v29-0008-Add-support-for-streaming-to-built-in-replicatio, > apply_handle_stream_commit(StringInfo s) > { > .. > + /* > + * send feedback to upstream > + * > + * XXX Probably should send a valid LSN. But which one? > + */ > + send_feedback(InvalidXLogRecPtr, false, false); > .. > } > > I have given a comment on this code that we don't need this feedback > and you mentioned on June 02 [3] that you will think on it and let me > know your opinion but I don't see a response from you yet. Can you > get back to me regarding this point? Yeah, I have analyzed this and this seems we don't need this. Because like non-streaming mode here also sending feedback mechanisms shall be the same. I don't see any reason for sending extra feedback on commit. > 11. Add some comments as to why we have used Shared BufFile interface > instead of Temp BufFile interface? Done > 12. In v29-0013-Change-buffile-interface-required-for-streaming, > + * Initialize a space for temporary files that can be opened other backends. > > /opened other backends/opened for access by other backends Done -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Dilip Kumar
Date:
On Fri, Jun 26, 2020 at 11:47 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Thu, Jun 25, 2020 at 7:11 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Wed, Jun 24, 2020 at 4:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > Review comments on various patches. > > > > > > poc_shared_fileset_cleanup_on_procexit > > > ================================= > > > 1. > > > - ent->subxact_fileset = > > > - MemoryContextAlloc(ApplyContext, sizeof(SharedFileSet)); > > > + MemoryContext oldctx; > > > > > > + /* Shared fileset handle must be allocated in the persistent context */ > > > + oldctx = MemoryContextSwitchTo(ApplyContext); > > > + ent->subxact_fileset = palloc(sizeof(SharedFileSet)); > > > SharedFileSetInit(ent->subxact_fileset, NULL); > > > + MemoryContextSwitchTo(oldctx); > > > fd = BufFileCreateShared(ent->subxact_fileset, path); > > > > > > Why is this change required for this patch and why we only cover > > > SharedFileSetInit in the Apply context and not BufFileCreateShared? > > > The comment is also not very clear on this point. > > > > Added the comments for the same. > > > > 1. > + /* > + * Shared fileset handle must be allocated in the persistent context. > + * Also, SharedFileSetInit allocate the memory for sharefileset list > + * so we need to allocate that in the long term meemory context. > + */ > > How about "We need to maintain shared fileset across multiple stream > open/close calls. So, we allocate it in a persistent context." Done > 2. > + /* > + * If the caller is following the dsm based cleanup then we don't > + * maintain the filesetlist so return. > + */ > + if (filesetlist == NULL) > + return; > > The check here should use 'NIL' instead of 'NULL' Done > Other than that the changes in this particular patch looks good to me. Added as a last patch in the series, in the next version I will merge this to 0012 and 0013. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
From
Dilip Kumar
Date:
On Mon, Jun 22, 2020 at 4:30 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Mon, Jun 22, 2020 at 4:26 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Thu, Jun 18, 2020 at 9:02 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > Yes, I have made the changes. Basically, now I am only using the > > > XLOG_XACT_INVALIDATIONS for generating all the invalidation messages. > > > So whenever we are getting the new set of XLOG_XACT_INVALIDATIONS, we > > > are directly appending it to the txn->invalidations. I have tested > > > the XLOG_INVALIDATIONS part but while sending this mail I realized > > > that we could write some automated test for the same. > > > > > > > Can you share how you have tested it? > > > > > I will work on > > > that soon. > > > > > > > Cool, I think having a regression test for this will be a good idea. > > > > Other than above tests, can we somehow verify that the invalidations > generated at commit time are the same as what we do with this patch? > We have verified with individual commands but it would be great if we > can verify for the regression tests. I have verified this using a few random test cases. For verifying this I have made some temporary code changes with an assert as shown below. Basically, on DecodeCommit we call ReorderBufferAddInvalidations function only for an assert checking. -void ReorderBufferAddInvalidations(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn, Size nmsgs, - SharedInvalidationMessage *msgs) + SharedInvalidationMessage *msgs, bool commit) { ReorderBufferTXN *txn; txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true); - + if (commit) + { + Assert(txn->ninvalidations == nmsgs); + return; + } The result is that for a normal local test it works fine. But with regression suit, it hit an assert at many places because if the rollback of the subtransaction is involved then at commit time invalidation messages those are not logged whereas with command time invalidation those are logged. As of now, I have only put assert on the count, if we need to verify the exact messages then we might need to somehow categories the invalidation messages because the ordering of the messages will not be the same. For testing this we will have to arrange them by category i.e relcahce, catcache and then we can compare them. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Amit Kapila
Date:
On Mon, Jun 29, 2020 at 4:24 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Mon, Jun 22, 2020 at 4:30 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > Other than above tests, can we somehow verify that the invalidations > > generated at commit time are the same as what we do with this patch? > > We have verified with individual commands but it would be great if we > > can verify for the regression tests. > > I have verified this using a few random test cases. For verifying > this I have made some temporary code changes with an assert as shown > below. Basically, on DecodeCommit we call > ReorderBufferAddInvalidations function only for an assert checking. > > -void > ReorderBufferAddInvalidations(ReorderBuffer *rb, TransactionId xid, > XLogRecPtr > lsn, Size nmsgs, > - > SharedInvalidationMessage *msgs) > + > SharedInvalidationMessage *msgs, bool commit) > { > ReorderBufferTXN *txn; > > txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true); > - > + if (commit) > + { > + Assert(txn->ninvalidations == nmsgs); > + return; > + } > > The result is that for a normal local test it works fine. But with > regression suit, it hit an assert at many places because if the > rollback of the subtransaction is involved then at commit time > invalidation messages those are not logged whereas with command time > invalidation those are logged. > Yeah, somehow, we need to ignore rollback to savepoint tests and verify for others. > As of now, I have only put assert on the count, if we need to verify > the exact messages then we might need to somehow categories the > invalidation messages because the ordering of the messages will not be > the same. For testing this we will have to arrange them by category > i.e relcahce, catcache and then we can compare them. > Can't we do this by verifying that each message at commit time exists in the list of invalidation messages we have collected via processing XLOG_XACT_INVALIDATIONS? One additional question on patch v30-0003-Extend-the-output-plugin-API-with-stream-methods: +static void +stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn, + XLogRecPtr apply_lsn) { .. .. + state.report_location = apply_lsn; .. .. + ctx->write_location = apply_lsn; .. } Can't we name the last parameter as 'commit_lsn' as that is how documentation in the patch spells it and it sounds more appropriate? Also, is there a reason for assigning report_location and write_location differently than what we do in commit_cb_wrapper? Basically, assign those as txn->final_lsn and txn->end_lsn respectively. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Dilip Kumar
Date:
On Tue, Jun 30, 2020 at 9:20 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Mon, Jun 29, 2020 at 4:24 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Mon, Jun 22, 2020 at 4:30 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > Other than above tests, can we somehow verify that the invalidations > > > generated at commit time are the same as what we do with this patch? > > > We have verified with individual commands but it would be great if we > > > can verify for the regression tests. > > > > I have verified this using a few random test cases. For verifying > > this I have made some temporary code changes with an assert as shown > > below. Basically, on DecodeCommit we call > > ReorderBufferAddInvalidations function only for an assert checking. > > > > -void > > ReorderBufferAddInvalidations(ReorderBuffer *rb, TransactionId xid, > > XLogRecPtr > > lsn, Size nmsgs, > > - > > SharedInvalidationMessage *msgs) > > + > > SharedInvalidationMessage *msgs, bool commit) > > { > > ReorderBufferTXN *txn; > > > > txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true); > > - > > + if (commit) > > + { > > + Assert(txn->ninvalidations == nmsgs); > > + return; > > + } > > > > The result is that for a normal local test it works fine. But with > > regression suit, it hit an assert at many places because if the > > rollback of the subtransaction is involved then at commit time > > invalidation messages those are not logged whereas with command time > > invalidation those are logged. > > > > Yeah, somehow, we need to ignore rollback to savepoint tests and > verify for others. Yeah, I have run the regression suite, I can see a lot of failure maybe we can somehow see the diff and confirm that all the failures are due to rollback to savepoint only. I will work on this. > > > As of now, I have only put assert on the count, if we need to verify > > the exact messages then we might need to somehow categories the > > invalidation messages because the ordering of the messages will not be > > the same. For testing this we will have to arrange them by category > > i.e relcahce, catcache and then we can compare them. > > > > Can't we do this by verifying that each message at commit time exists > in the list of invalidation messages we have collected via processing > XLOG_XACT_INVALIDATIONS? Let me try what is the easiest way to test this. > > One additional question on patch > v30-0003-Extend-the-output-plugin-API-with-stream-methods: > +static void > +stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn, > + XLogRecPtr apply_lsn) > { > .. > .. > + state.report_location = apply_lsn; > .. > .. > + ctx->write_location = apply_lsn; > .. > } > > Can't we name the last parameter as 'commit_lsn' as that is how > documentation in the patch spells it and it sounds more appropriate? You are right commit_lsn seems more appropriate here. > Also, is there a reason for assigning report_location and > write_location differently than what we do in commit_cb_wrapper? > Basically, assign those as txn->final_lsn and txn->end_lsn > respectively. Yes, I think it should be handled in same way as commit_cb_wrapper. Because before calling ReorderBufferStreamCommit in ReorderBufferCommit, we are properly updating the final_lsn as well as the end_lsn. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Amit Kapila
Date:
On Tue, Jun 30, 2020 at 10:13 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Tue, Jun 30, 2020 at 9:20 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > Can't we name the last parameter as 'commit_lsn' as that is how > > documentation in the patch spells it and it sounds more appropriate? > > You are right commit_lsn seems more appropriate here. > > > Also, is there a reason for assigning report_location and > > write_location differently than what we do in commit_cb_wrapper? > > Basically, assign those as txn->final_lsn and txn->end_lsn > > respectively. > > Yes, I think it should be handled in same way as commit_cb_wrapper. > Because before calling ReorderBufferStreamCommit in > ReorderBufferCommit, we are properly updating the final_lsn as well as > the end_lsn. > Okay, I have made these changes in the attached patch and there are few more changes in 0003-Extend-the-output-plugin-API-with-stream-methods. 1. In pg_decode_stream_message, for transactional messages, we were displaying message contents which is different from other streaming APIs. I have changed it so that streaming API doesn't display message contents for transactional messages. 2. + /* in streaming mode, stream_change_cb is required */ + if (ctx->callbacks.stream_change_cb == NULL) + ereport(ERROR, + (errmsg("Output plugin supports streaming, but has not registered " + "stream_change_cb callback."))); The error messages seem a bit weird. (a) doesn't include error code, (b) not in PG style. I have changed all the error messages to fix these two issues and change the message as well 3. Rearranged the functions stream_* so that the optional functions are at the end and also arranged other functions in a way that looks more logical to me. 4. Updated comments, commit message, and edited docs in the patch. I have made a few changes in 0004-Gracefully-handle-concurrent-aborts-of-transacti as well. 1. The variable bsysscan was not being reset in case of error. I have introduced a new function to reset both bsysscan and CheckXidAlive during transaction abort. Also, snapmgr.c doesn't seem right place for these variables, so I moved them to xact.c. I think this will make the initialization of CheckXidAlive during catch in ReorderBufferProcessTXN redundant. 2. Updated comments and commit message. Let me know what you think about the above changes. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Attachment
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Amit Kapila
Date:
On Tue, Jun 30, 2020 at 5:20 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > Let me know what you think about the above changes. > I went ahead and made few changes in 0005-Implement-streaming-mode-in-ReorderBuffer which are explained below. I have few questions and suggestions for the patch as well which are also covered in below points. 1. + if (prev_lsn == InvalidXLogRecPtr) + { + if (streaming) + rb->stream_start(rb, txn, change->lsn); + else + rb->begin(rb, txn); + stream_started = true; + } I don't think we want to move begin callback here that will change the existing semantics, so it is better to move begin at its original position. I have made the required changes in the attached patch. 2. ReorderBufferTruncateTXN() { .. + dlist_foreach_modify(iter, &txn->changes) + { + ReorderBufferChange *change; + + change = dlist_container(ReorderBufferChange, node, iter.cur); + + /* remove the change from it's containing list */ + dlist_delete(&change->node); + + ReorderBufferReturnChange(rb, change); + } .. } I think here we can add an Assert that we're not mixing changes from different transactions. See the changes in the patch. 3. SetupCheckXidLive() { .. + /* + * setup CheckXidAlive if it's not committed yet. We don't check if the xid + * aborted. That will happen during catalog access. Also, reset the + * bsysscan flag. + */ + if (!TransactionIdDidCommit(xid)) + { + CheckXidAlive = xid; + bsysscan = false; .. } What is the need to reset bsysscan flag here if we are already resetting on error (like in the previous patch sent by me)? 4. ReorderBufferProcessTXN() { .. .. + /* Reset the CheckXidAlive */ + if (streaming) + CheckXidAlive = InvalidTransactionId; .. } Similar to the previous point, we don't need this as well because AbortCurrentTransaction would have taken care of this. 5. + * XXX Do we need to check if the transaction has some changes to stream + * (maybe it got streamed right before the commit, which attempts to + * stream it again before the commit)? + */ +static void +ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn) The above comment doesn't make much sense to me, so I have removed it. Basically, if there are no changes before commit, we still need to send commit and anyway if there are no more changes ReorderBufferProcessTXN will not do anything. 6. +ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn) { .. if (txn->snapshot_now == NULL) + { + dlist_iter subxact_i; + + /* make sure this transaction is streamed for the first time */ + Assert(!rbtxn_is_streamed(txn)); + + /* at the beginning we should have invalid command ID */ + Assert(txn->command_id == InvalidCommandId); + + dlist_foreach(subxact_i, &txn->subtxns) + { + ReorderBufferTXN *subtxn; + + subtxn = dlist_container(ReorderBufferTXN, node, subxact_i.cur); + ReorderBufferTransferSnapToParent(txn, subtxn); + } .. } Here, it is possible that there is no base_snapshot for txn, so we need a check for that similar to ReorderBufferCommit. 7. Apart from the above, I made few changes in comments and ran pgindent. 8. We can't stream the transaction before we reach the SNAPBUILD_CONSISTENT state because some other output plugin can apply those changes unlike what we do with pgoutput plugin (which writes to file). And, I think applying the transactions without reaching a consistent state would be anyway wrong. So, we should avoid that and if do that then we should have an Assert for streamed txns rather than sending abort for them in ReorderBufferForget. 9. +ReorderBufferHandleConcurrentAbort(ReorderBuffer *rb, ReorderBufferTXN *txn, { .. + ReorderBufferToastReset(rb, txn); + if (specinsert != NULL) + ReorderBufferReturnChange(rb, specinsert); .. } Why do we need to do these here when we wouldn't have been done for any exception other than ERRCODE_TRANSACTION_ROLLBACK? 10. I have got the below failure once. I have not investigated this in detail as the patch is still under progress. See, if you have any idea? # Failed test 'check extra columns contain local defaults' # at t/013_stream_subxact_ddl_abort.pl line 81. # got: '2|0' # expected: '1000|500' # Looks like you failed 1 test of 2. make[2]: *** [check] Error 1 make[1]: *** [check-subscription-recurse] Error 2 make[1]: *** Waiting for unfinished jobs.... make: *** [check-world-src/test-recurse] Error 2 11. Can we test by introducing a new GUC such that all the transactions (at least in existing tests) start to stream? Basically, it will allow us to disregard logical_decoding_work_mem and ensure that all regression tests pass through new-code. Note, I am suggesting this just for testing purposes, not for actual integration in the code. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Attachment
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Dilip Kumar
Date:
On Tue, Jun 30, 2020 at 5:20 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Tue, Jun 30, 2020 at 10:13 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Tue, Jun 30, 2020 at 9:20 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > Can't we name the last parameter as 'commit_lsn' as that is how > > > documentation in the patch spells it and it sounds more appropriate? > > > > You are right commit_lsn seems more appropriate here. > > > > > Also, is there a reason for assigning report_location and > > > write_location differently than what we do in commit_cb_wrapper? > > > Basically, assign those as txn->final_lsn and txn->end_lsn > > > respectively. > > > > Yes, I think it should be handled in same way as commit_cb_wrapper. > > Because before calling ReorderBufferStreamCommit in > > ReorderBufferCommit, we are properly updating the final_lsn as well as > > the end_lsn. > > > > Okay, I have made these changes in the attached patch and there are > few more changes in > 0003-Extend-the-output-plugin-API-with-stream-methods. > 1. In pg_decode_stream_message, for transactional messages, we were > displaying message contents which is different from other streaming > APIs. I have changed it so that streaming API doesn't display message > contents for transactional messages. Ok, make sense. > 2. > + /* in streaming mode, stream_change_cb is required */ > + if (ctx->callbacks.stream_change_cb == NULL) > + ereport(ERROR, > + (errmsg("Output plugin supports streaming, but has not registered " > + "stream_change_cb callback."))); > > The error messages seem a bit weird. (a) doesn't include error code, > (b) not in PG style. I have changed all the error messages to fix > these two issues and change the message as well ok > 3. Rearranged the functions stream_* so that the optional functions > are at the end and also arranged other functions in a way that looks > more logical to me. Make sense to me. > 4. Updated comments, commit message, and edited docs in the patch. > > I have made a few changes in > 0004-Gracefully-handle-concurrent-aborts-of-transacti as well. > 1. The variable bsysscan was not being reset in case of error. I have > introduced a new function to reset both bsysscan and CheckXidAlive > during transaction abort. Also, snapmgr.c doesn't seem right place > for these variables, so I moved them to xact.c. I think this will > make the initialization of CheckXidAlive during catch in > ReorderBufferProcessTXN redundant. That looks better. > 2. Updated comments and commit message. > > Let me know what you think about the above changes. All the above changes look good to me and I will include in the next version. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Dilip Kumar
Date:
On Sat, Jul 4, 2020 at 11:35 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Tue, Jun 30, 2020 at 5:20 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > Let me know what you think about the above changes. > > > > I went ahead and made few changes in > 0005-Implement-streaming-mode-in-ReorderBuffer which are explained > below. I have few questions and suggestions for the patch as well > which are also covered in below points. > > 1. > + if (prev_lsn == InvalidXLogRecPtr) > + { > + if (streaming) > + rb->stream_start(rb, txn, change->lsn); > + else > + rb->begin(rb, txn); > + stream_started = true; > + } > > I don't think we want to move begin callback here that will change the > existing semantics, so it is better to move begin at its original > position. I have made the required changes in the attached patch. Looks good to me. > 2. > ReorderBufferTruncateTXN() > { > .. > + dlist_foreach_modify(iter, &txn->changes) > + { > + ReorderBufferChange *change; > + > + change = dlist_container(ReorderBufferChange, node, iter.cur); > + > + /* remove the change from it's containing list */ > + dlist_delete(&change->node); > + > + ReorderBufferReturnChange(rb, change); > + } > .. > } > > I think here we can add an Assert that we're not mixing changes from > different transactions. See the changes in the patch. Looks fine. > 3. > SetupCheckXidLive() > { > .. > + /* > + * setup CheckXidAlive if it's not committed yet. We don't check if the xid > + * aborted. That will happen during catalog access. Also, reset the > + * bsysscan flag. > + */ > + if (!TransactionIdDidCommit(xid)) > + { > + CheckXidAlive = xid; > + bsysscan = false; > .. > } > > What is the need to reset bsysscan flag here if we are already > resetting on error (like in the previous patch sent by me)? Yeah, now we don't not need this. > 4. > ReorderBufferProcessTXN() > { > .. > .. > + /* Reset the CheckXidAlive */ > + if (streaming) > + CheckXidAlive = InvalidTransactionId; > .. > } > > Similar to the previous point, we don't need this as well because > AbortCurrentTransaction would have taken care of this. Right > 5. > + * XXX Do we need to check if the transaction has some changes to stream > + * (maybe it got streamed right before the commit, which attempts to > + * stream it again before the commit)? > + */ > +static void > +ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn) > > The above comment doesn't make much sense to me, so I have removed it. > Basically, if there are no changes before commit, we still need to > send commit and anyway if there are no more changes > ReorderBufferProcessTXN will not do anything. ok > 6. > +ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn) > { > .. > if (txn->snapshot_now == NULL) > + { > + dlist_iter subxact_i; > + > + /* make sure this transaction is streamed for the first time */ > + Assert(!rbtxn_is_streamed(txn)); > + > + /* at the beginning we should have invalid command ID */ > + Assert(txn->command_id == InvalidCommandId); > + > + dlist_foreach(subxact_i, &txn->subtxns) > + { > + ReorderBufferTXN *subtxn; > + > + subtxn = dlist_container(ReorderBufferTXN, node, subxact_i.cur); > + ReorderBufferTransferSnapToParent(txn, subtxn); > + } > .. > } > > Here, it is possible that there is no base_snapshot for txn, so we > need a check for that similar to ReorderBufferCommit. > > 7. Apart from the above, I made few changes in comments and ran pgindent. Ok > 8. We can't stream the transaction before we reach the > SNAPBUILD_CONSISTENT state because some other output plugin can apply > those changes unlike what we do with pgoutput plugin (which writes to > file). And, I think applying the transactions without reaching a > consistent state would be anyway wrong. So, we should avoid that and > if do that then we should have an Assert for streamed txns rather than > sending abort for them in ReorderBufferForget. I will work on this point. > 9. > +ReorderBufferHandleConcurrentAbort(ReorderBuffer *rb, ReorderBufferTXN *txn, > { > .. > + ReorderBufferToastReset(rb, txn); > + if (specinsert != NULL) > + ReorderBufferReturnChange(rb, specinsert); > .. > } > > Why do we need to do these here when we wouldn't have been done for > any exception other than ERRCODE_TRANSACTION_ROLLBACK? Because we are handling this exception "ERRCODE_TRANSACTION_ROLLBACK" gracefully and we are continuing with further decoding so we need to return this change back. > 10. I have got the below failure once. I have not investigated this > in detail as the patch is still under progress. See, if you have any > idea? > # Failed test 'check extra columns contain local defaults' > # at t/013_stream_subxact_ddl_abort.pl line 81. > # got: '2|0' > # expected: '1000|500' > # Looks like you failed 1 test of 2. > make[2]: *** [check] Error 1 > make[1]: *** [check-subscription-recurse] Error 2 > make[1]: *** Waiting for unfinished jobs.... > make: *** [check-world-src/test-recurse] Error 2 Even I got the failure once and after that, it did not reproduce. I have executed it multiple time but it did not reproduce again. Are you able to reproduce it consistently? > 11. Can we test by introducing a new GUC such that all the > transactions (at least in existing tests) start to stream? Basically, > it will allow us to disregard logical_decoding_work_mem and ensure > that all regression tests pass through new-code. Note, I am > suggesting this just for testing purposes, not for actual integration > in the code. Yeah, that's a good suggestion. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Dilip Kumar
Date:
On Tue, Jun 30, 2020 at 10:13 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Tue, Jun 30, 2020 at 9:20 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Mon, Jun 29, 2020 at 4:24 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > On Mon, Jun 22, 2020 at 4:30 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > > > > Other than above tests, can we somehow verify that the invalidations > > > > generated at commit time are the same as what we do with this patch? > > > > We have verified with individual commands but it would be great if we > > > > can verify for the regression tests. > > > > > > I have verified this using a few random test cases. For verifying > > > this I have made some temporary code changes with an assert as shown > > > below. Basically, on DecodeCommit we call > > > ReorderBufferAddInvalidations function only for an assert checking. > > > > > > -void > > > ReorderBufferAddInvalidations(ReorderBuffer *rb, TransactionId xid, > > > XLogRecPtr > > > lsn, Size nmsgs, > > > - > > > SharedInvalidationMessage *msgs) > > > + > > > SharedInvalidationMessage *msgs, bool commit) > > > { > > > ReorderBufferTXN *txn; > > > > > > txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true); > > > - > > > + if (commit) > > > + { > > > + Assert(txn->ninvalidations == nmsgs); > > > + return; > > > + } > > > > > > The result is that for a normal local test it works fine. But with > > > regression suit, it hit an assert at many places because if the > > > rollback of the subtransaction is involved then at commit time > > > invalidation messages those are not logged whereas with command time > > > invalidation those are logged. > > > > > > > Yeah, somehow, we need to ignore rollback to savepoint tests and > > verify for others. > > Yeah, I have run the regression suite, I can see a lot of failure > maybe we can somehow see the diff and confirm that all the failures > are due to rollback to savepoint only. I will work on this. I have compared the changes logged at command end vs logged at commit time. I have ignored the invalidation for the transaction which has any aborted subtransaction in it. While testing this I found one issue, the issue is that if there are some invalidation generated between last command counter increment and the commit transaction then those were not logged. I have fixed the issue by logging the pending invalidation in RecordTransactionCommit. I will include the changes in the next patch set. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Amit Kapila
Date:
On Sun, Jul 5, 2020 at 4:47 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Sat, Jul 4, 2020 at 11:35 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > 9. > > +ReorderBufferHandleConcurrentAbort(ReorderBuffer *rb, ReorderBufferTXN *txn, > > { > > .. > > + ReorderBufferToastReset(rb, txn); > > + if (specinsert != NULL) > > + ReorderBufferReturnChange(rb, specinsert); > > .. > > } > > > > Why do we need to do these here when we wouldn't have been done for > > any exception other than ERRCODE_TRANSACTION_ROLLBACK? > > Because we are handling this exception "ERRCODE_TRANSACTION_ROLLBACK" > gracefully and we are continuing with further decoding so we need to > return this change back. > Okay, then I suggest we should do these before calling stream_stop and also move ReorderBufferResetTXN after calling stream_stop to follow a pattern similar to try block unless there is a reason for not doing so. Also, it would be good if we can initialize specinsert with NULL after returning the change as we are doing at other places. > > 10. I have got the below failure once. I have not investigated this > > in detail as the patch is still under progress. See, if you have any > > idea? > > # Failed test 'check extra columns contain local defaults' > > # at t/013_stream_subxact_ddl_abort.pl line 81. > > # got: '2|0' > > # expected: '1000|500' > > # Looks like you failed 1 test of 2. > > make[2]: *** [check] Error 1 > > make[1]: *** [check-subscription-recurse] Error 2 > > make[1]: *** Waiting for unfinished jobs.... > > make: *** [check-world-src/test-recurse] Error 2 > > Even I got the failure once and after that, it did not reproduce. I > have executed it multiple time but it did not reproduce again. Are > you able to reproduce it consistently? > No, I am also not able to reproduce it consistently but I think this can fail if a subscriber sends the replay_location before actually replaying the changes. First, I thought that extra send_feedback we have in apply_handle_stream_commit might have caused this but I guess that can't happen because we need the commit time location for that and we are storing the same at the end of apply_handle_stream_commit after applying all messages. I am not sure what is going on here. I think we somehow need to reproduce this or some variant of this test consistently to find the root cause. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Dilip Kumar
Date:
On Mon, Jul 6, 2020 at 11:31 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Sun, Jul 5, 2020 at 4:47 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Sat, Jul 4, 2020 at 11:35 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > 9. > > > +ReorderBufferHandleConcurrentAbort(ReorderBuffer *rb, ReorderBufferTXN *txn, > > > { > > > .. > > > + ReorderBufferToastReset(rb, txn); > > > + if (specinsert != NULL) > > > + ReorderBufferReturnChange(rb, specinsert); > > > .. > > > } > > > > > > Why do we need to do these here when we wouldn't have been done for > > > any exception other than ERRCODE_TRANSACTION_ROLLBACK? > > > > Because we are handling this exception "ERRCODE_TRANSACTION_ROLLBACK" > > gracefully and we are continuing with further decoding so we need to > > return this change back. > > > > Okay, then I suggest we should do these before calling stream_stop and > also move ReorderBufferResetTXN after calling stream_stop to follow a > pattern similar to try block unless there is a reason for not doing > so. Also, it would be good if we can initialize specinsert with NULL > after returning the change as we are doing at other places. Okay > > > 10. I have got the below failure once. I have not investigated this > > > in detail as the patch is still under progress. See, if you have any > > > idea? > > > # Failed test 'check extra columns contain local defaults' > > > # at t/013_stream_subxact_ddl_abort.pl line 81. > > > # got: '2|0' > > > # expected: '1000|500' > > > # Looks like you failed 1 test of 2. > > > make[2]: *** [check] Error 1 > > > make[1]: *** [check-subscription-recurse] Error 2 > > > make[1]: *** Waiting for unfinished jobs.... > > > make: *** [check-world-src/test-recurse] Error 2 > > > > Even I got the failure once and after that, it did not reproduce. I > > have executed it multiple time but it did not reproduce again. Are > > you able to reproduce it consistently? > > > > No, I am also not able to reproduce it consistently but I think this > can fail if a subscriber sends the replay_location before actually > replaying the changes. First, I thought that extra send_feedback we > have in apply_handle_stream_commit might have caused this but I guess > that can't happen because we need the commit time location for that > and we are storing the same at the end of apply_handle_stream_commit > after applying all messages. I am not sure what is going on here. I > think we somehow need to reproduce this or some variant of this test > consistently to find the root cause. And I think it appeared first time for me, so maybe either induced from past few versions so some changes in the last few versions might have exposed it. I have noticed that almost 50% of the time I am able to reproduce after the clean build so I can trace back from which version it started appearing that way it will be easy to narrow down. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Amit Kapila
Date:
On Mon, Jul 6, 2020 at 11:44 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Mon, Jul 6, 2020 at 11:31 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > 10. I have got the below failure once. I have not investigated this > > > > in detail as the patch is still under progress. See, if you have any > > > > idea? > > > > # Failed test 'check extra columns contain local defaults' > > > > # at t/013_stream_subxact_ddl_abort.pl line 81. > > > > # got: '2|0' > > > > # expected: '1000|500' > > > > # Looks like you failed 1 test of 2. > > > > make[2]: *** [check] Error 1 > > > > make[1]: *** [check-subscription-recurse] Error 2 > > > > make[1]: *** Waiting for unfinished jobs.... > > > > make: *** [check-world-src/test-recurse] Error 2 > > > > > > Even I got the failure once and after that, it did not reproduce. I > > > have executed it multiple time but it did not reproduce again. Are > > > you able to reproduce it consistently? > > > > > > > No, I am also not able to reproduce it consistently but I think this > > can fail if a subscriber sends the replay_location before actually > > replaying the changes. First, I thought that extra send_feedback we > > have in apply_handle_stream_commit might have caused this but I guess > > that can't happen because we need the commit time location for that > > and we are storing the same at the end of apply_handle_stream_commit > > after applying all messages. I am not sure what is going on here. I > > think we somehow need to reproduce this or some variant of this test > > consistently to find the root cause. > > And I think it appeared first time for me, so maybe either induced > from past few versions so some changes in the last few versions might > have exposed it. I have noticed that almost 50% of the time I am able > to reproduce after the clean build so I can trace back from which > version it started appearing that way it will be easy to narrow down. > One more comment ReorderBufferLargestTopTXN { .. dlist_foreach(iter, &rb->toplevel_by_lsn) { ReorderBufferTXN *txn; + Size size = 0; + Size largest_size = 0; txn = dlist_container(ReorderBufferTXN, node, iter.cur); - /* if the current transaction is larger, remember it */ - if ((!largest) || (txn->size > largest->size)) + /* + * If this transaction have some incomplete changes then only consider + * the size upto last complete lsn. + */ + if (rbtxn_has_incomplete_tuple(txn)) + size = txn->complete_size; + else + size = txn->total_size; + + /* If the current transaction is larger then remember it. */ + if ((largest != NULL || size > largest_size) && size > 0) Here largest_size is a local variable inside the loop which is initialized to 0 in each iteration and that will lead to picking each next txn as largest. This seems wrong to me. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Dilip Kumar
Date:
On Mon, Jul 6, 2020 at 3:09 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Mon, Jul 6, 2020 at 11:44 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Mon, Jul 6, 2020 at 11:31 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > > 10. I have got the below failure once. I have not investigated this > > > > > in detail as the patch is still under progress. See, if you have any > > > > > idea? > > > > > # Failed test 'check extra columns contain local defaults' > > > > > # at t/013_stream_subxact_ddl_abort.pl line 81. > > > > > # got: '2|0' > > > > > # expected: '1000|500' > > > > > # Looks like you failed 1 test of 2. > > > > > make[2]: *** [check] Error 1 > > > > > make[1]: *** [check-subscription-recurse] Error 2 > > > > > make[1]: *** Waiting for unfinished jobs.... > > > > > make: *** [check-world-src/test-recurse] Error 2 > > > > > > > > Even I got the failure once and after that, it did not reproduce. I > > > > have executed it multiple time but it did not reproduce again. Are > > > > you able to reproduce it consistently? > > > > > > > > > > No, I am also not able to reproduce it consistently but I think this > > > can fail if a subscriber sends the replay_location before actually > > > replaying the changes. First, I thought that extra send_feedback we > > > have in apply_handle_stream_commit might have caused this but I guess > > > that can't happen because we need the commit time location for that > > > and we are storing the same at the end of apply_handle_stream_commit > > > after applying all messages. I am not sure what is going on here. I > > > think we somehow need to reproduce this or some variant of this test > > > consistently to find the root cause. > > > > And I think it appeared first time for me, so maybe either induced > > from past few versions so some changes in the last few versions might > > have exposed it. I have noticed that almost 50% of the time I am able > > to reproduce after the clean build so I can trace back from which > > version it started appearing that way it will be easy to narrow down. > > > > One more comment > ReorderBufferLargestTopTXN > { > .. > dlist_foreach(iter, &rb->toplevel_by_lsn) > { > ReorderBufferTXN *txn; > + Size size = 0; > + Size largest_size = 0; > > txn = dlist_container(ReorderBufferTXN, node, iter.cur); > > - /* if the current transaction is larger, remember it */ > - if ((!largest) || (txn->size > largest->size)) > + /* > + * If this transaction have some incomplete changes then only consider > + * the size upto last complete lsn. > + */ > + if (rbtxn_has_incomplete_tuple(txn)) > + size = txn->complete_size; > + else > + size = txn->total_size; > + > + /* If the current transaction is larger then remember it. */ > + if ((largest != NULL || size > largest_size) && size > 0) > > Here largest_size is a local variable inside the loop which is > initialized to 0 in each iteration and that will lead to picking each > next txn as largest. This seems wrong to me. You are right, will fix. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Amit Kapila
Date:
On Sun, Jul 5, 2020 at 8:37 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Tue, Jun 30, 2020 at 10:13 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > Yeah, I have run the regression suite, I can see a lot of failure > > maybe we can somehow see the diff and confirm that all the failures > > are due to rollback to savepoint only. I will work on this. > > I have compared the changes logged at command end vs logged at commit > time. I have ignored the invalidation for the transaction which has > any aborted subtransaction in it. While testing this I found one > issue, the issue is that if there are some invalidation generated > between last command counter increment and the commit transaction then > those were not logged. I have fixed the issue by logging the pending > invalidation in RecordTransactionCommit. I will include the changes > in the next patch set. > I think it would have been better if you could have given examples for such cases where you need this extra logging. Anyway, below are few minor comments on this patch: 1. + /* + * Log any pending invalidations which are adding between the last + * command counter increment and the commit. + */ + if (XLogLogicalInfoActive()) + LogLogicalInvalidations(); I think we can change this comment slightly and extend a bit to say for which kind of special cases we are adding this. "Log any pending invalidations which are added between the last CommandCounterIncrement and the commit. Normally for DDLs, we log this at each command end, however for certain cases where we directly update the system table the invalidations were not logged at command end." Something like above based on cases that are not covered by command end WAL logging. 2. + * Emit WAL for invalidations. This is currently only used for logging + * invalidations at the command end. + */ +void +LogLogicalInvalidations() After this is getting used at a new place, it is better to modify the above comment to something like: "Emit WAL for invalidations. This is currently only used for logging invalidations at the command end or at commit time if any invalidations are pending." -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Amit Kapila
Date:
On Wed, Jul 8, 2020 at 9:36 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Sun, Jul 5, 2020 at 8:37 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > I have compared the changes logged at command end vs logged at commit > > time. I have ignored the invalidation for the transaction which has > > any aborted subtransaction in it. While testing this I found one > > issue, the issue is that if there are some invalidation generated > > between last command counter increment and the commit transaction then > > those were not logged. I have fixed the issue by logging the pending > > invalidation in RecordTransactionCommit. I will include the changes > > in the next patch set. > > > > I think it would have been better if you could have given examples for > such cases where you need this extra logging. Anyway, below are few > minor comments on this patch: > > 1. > + /* > + * Log any pending invalidations which are adding between the last > + * command counter increment and the commit. > + */ > + if (XLogLogicalInfoActive()) > + LogLogicalInvalidations(); > > I think we can change this comment slightly and extend a bit to say > for which kind of special cases we are adding this. "Log any pending > invalidations which are added between the last CommandCounterIncrement > and the commit. Normally for DDLs, we log this at each command end, > however for certain cases where we directly update the system table > the invalidations were not logged at command end." > > Something like above based on cases that are not covered by command > end WAL logging. > > 2. > + * Emit WAL for invalidations. This is currently only used for logging > + * invalidations at the command end. > + */ > +void > +LogLogicalInvalidations() > > After this is getting used at a new place, it is better to modify the > above comment to something like: "Emit WAL for invalidations. This is > currently only used for logging invalidations at the command end or at > commit time if any invalidations are pending." > I have done some more review and below are my comments: Review-v31-0010-Provide-new-api-to-get-the-streaming-changes ---------------------------------------------------------------------------------------------- 1. --- a/src/backend/catalog/system_views.sql +++ b/src/backend/catalog/system_views.sql @@ -1240,6 +1240,14 @@ LANGUAGE INTERNAL VOLATILE ROWS 1000 COST 1000 AS 'pg_logical_slot_get_changes'; +CREATE OR REPLACE FUNCTION pg_logical_slot_get_streaming_changes( + IN slot_name name, IN upto_lsn pg_lsn, IN upto_nchanges int, VARIADIC options text[] DEFAULT '{}', + OUT lsn pg_lsn, OUT xid xid, OUT data text) +RETURNS SETOF RECORD +LANGUAGE INTERNAL +VOLATILE ROWS 1000 COST 1000 +AS 'pg_logical_slot_get_streaming_changes'; If we are going to add a new streaming API for get_changes, don't we need for pg_logical_slot_get_binary_changes, pg_logical_slot_peek_changes and pg_logical_slot_peek_binary_changes as well? I was thinking why not add a new parameter (streaming boolean) instead of adding the new APIs. This could be an optional parameter which if user doesn't specify will be considered as false. We already have optional parameters for APIs like pg_create_logical_replication_slot. 2. You forgot to update sgml/func.sgml. This will be required even if we decide to add a new parameter instead of a new API. 3. + /* If called has not asked for streaming changes then disable it. */ + ctx->streaming &= streaming; /If called/If the caller 4. diff --git a/.gitignore b/.gitignore index 794e35b..6083744 100644 --- a/.gitignore +++ b/.gitignore @@ -42,3 +42,4 @@ lib*.pc /Debug/ /Release/ /tmp_install/ +/build/ Why the patch contains this change? 5. If I apply the first six patches and run the regressions, it fails primarily because streaming got enabled by default. And then when I applied this patch, the tests passed because it disables streaming by default. I think this should be patch 0007. Replication Origins ------------------------------ I think we also need to conclude on origins related discussion [1]. As far as I can see, the origin_id can be sent with the first startup message. The origin_lsn and origin_commit can be sent with the last start of streaming commit if we want but not sure if that is of use. If we need to send it earlier then we need to record it with other WAL records. The point is that those are set with pg_replication_origin_xact_setup but not sure how and when that function is called. The other alternative is that we can ignore that for now and once the usage is clear we can enhance it. What do you think? [1] - https://www.postgresql.org/message-id/CAA4eK1JwXaCezFw%2BkZwoxbLKYD0nWpC2rPgx7vUsaDAc0AZaow%40mail.gmail.com -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Ajin Cherian
Date:
I was going through this thread and testing and reviewing the patches, I think this is a great feature to have and one which customers would appreciate. I wanted to help out, and I saw a request for a test patch for a GUC to always enable streaming on logical replication. Here's one on top of patchset v31, just in case you still need it. By default the GUC is turned on, I ran the regression tests with it and didn't see any errors.
thanks,
Ajin
Fujitsu Australia
On Wed, Jul 8, 2020 at 8:02 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Wed, Jul 8, 2020 at 9:36 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Sun, Jul 5, 2020 at 8:37 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > I have compared the changes logged at command end vs logged at commit
> > time. I have ignored the invalidation for the transaction which has
> > any aborted subtransaction in it. While testing this I found one
> > issue, the issue is that if there are some invalidation generated
> > between last command counter increment and the commit transaction then
> > those were not logged. I have fixed the issue by logging the pending
> > invalidation in RecordTransactionCommit. I will include the changes
> > in the next patch set.
> >
>
> I think it would have been better if you could have given examples for
> such cases where you need this extra logging. Anyway, below are few
> minor comments on this patch:
>
> 1.
> + /*
> + * Log any pending invalidations which are adding between the last
> + * command counter increment and the commit.
> + */
> + if (XLogLogicalInfoActive())
> + LogLogicalInvalidations();
>
> I think we can change this comment slightly and extend a bit to say
> for which kind of special cases we are adding this. "Log any pending
> invalidations which are added between the last CommandCounterIncrement
> and the commit. Normally for DDLs, we log this at each command end,
> however for certain cases where we directly update the system table
> the invalidations were not logged at command end."
>
> Something like above based on cases that are not covered by command
> end WAL logging.
>
> 2.
> + * Emit WAL for invalidations. This is currently only used for logging
> + * invalidations at the command end.
> + */
> +void
> +LogLogicalInvalidations()
>
> After this is getting used at a new place, it is better to modify the
> above comment to something like: "Emit WAL for invalidations. This is
> currently only used for logging invalidations at the command end or at
> commit time if any invalidations are pending."
>
I have done some more review and below are my comments:
Review-v31-0010-Provide-new-api-to-get-the-streaming-changes
----------------------------------------------------------------------------------------------
1.
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1240,6 +1240,14 @@ LANGUAGE INTERNAL
VOLATILE ROWS 1000 COST 1000
AS 'pg_logical_slot_get_changes';
+CREATE OR REPLACE FUNCTION pg_logical_slot_get_streaming_changes(
+ IN slot_name name, IN upto_lsn pg_lsn, IN upto_nchanges int,
VARIADIC options text[] DEFAULT '{}',
+ OUT lsn pg_lsn, OUT xid xid, OUT data text)
+RETURNS SETOF RECORD
+LANGUAGE INTERNAL
+VOLATILE ROWS 1000 COST 1000
+AS 'pg_logical_slot_get_streaming_changes';
If we are going to add a new streaming API for get_changes, don't we
need for pg_logical_slot_get_binary_changes,
pg_logical_slot_peek_changes and pg_logical_slot_peek_binary_changes
as well? I was thinking why not add a new parameter (streaming
boolean) instead of adding the new APIs. This could be an optional
parameter which if user doesn't specify will be considered as false.
We already have optional parameters for APIs like
pg_create_logical_replication_slot.
2. You forgot to update sgml/func.sgml. This will be required even if
we decide to add a new parameter instead of a new API.
3.
+ /* If called has not asked for streaming changes then disable it. */
+ ctx->streaming &= streaming;
/If called/If the caller
4.
diff --git a/.gitignore b/.gitignore
index 794e35b..6083744 100644
--- a/.gitignore
+++ b/.gitignore
@@ -42,3 +42,4 @@ lib*.pc
/Debug/
/Release/
/tmp_install/
+/build/
Why the patch contains this change?
5. If I apply the first six patches and run the regressions, it fails
primarily because streaming got enabled by default. And then when I
applied this patch, the tests passed because it disables streaming by
default. I think this should be patch 0007.
Replication Origins
------------------------------
I think we also need to conclude on origins related discussion [1].
As far as I can see, the origin_id can be sent with the first startup
message. The origin_lsn and origin_commit can be sent with the last
start of streaming commit if we want but not sure if that is of use.
If we need to send it earlier then we need to record it with other WAL
records. The point is that those are set with
pg_replication_origin_xact_setup but not sure how and when that
function is called. The other alternative is that we can ignore that
for now and once the usage is clear we can enhance it. What do you
think?
[1] - https://www.postgresql.org/message-id/CAA4eK1JwXaCezFw%2BkZwoxbLKYD0nWpC2rPgx7vUsaDAc0AZaow%40mail.gmail.com
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
Attachment
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Amit Kapila
Date:
On Wed, Jul 8, 2020 at 7:31 PM Ajin Cherian <itsajin@gmail.com> wrote: > > I was going through this thread and testing and reviewing the patches, I think this is a great feature to have and onewhich customers would appreciate. I wanted to help out, and I saw a request for a test patch for a GUC to always enablestreaming on logical replication. Here's one on top of patchset v31, just in case you still need it. By default theGUC is turned on, I ran the regression tests with it and didn't see any errors. > Thanks for showing the interest in patch. How have you ensured that streaming is happening? I don't think the proposed patch can ensure it for every case because we also rely on logical_decoding_work_mem to decide whether to stream/spill, see ReorderBufferCheckMemoryLimit. I think with your patch it will allow streaming for cases where we have large amount of WAL to decode. I feel you need to add some DEBUG messages (or some other way) to ensure that all existing and new test cases related to logical decoding will perform the streaming. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Ajin Cherian
Date:
On Thu, Jul 9, 2020 at 12:28 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Wed, Jul 8, 2020 at 7:31 PM Ajin Cherian <itsajin@gmail.com> wrote:
Thanks for showing the interest in patch. How have you ensured that
streaming is happening? I don't think the proposed patch can ensure
it for every case because we also rely on logical_decoding_work_mem to
decide whether to stream/spill, see ReorderBufferCheckMemoryLimit. I
think with your patch it will allow streaming for cases where we have
large amount of WAL to decode.
Maybe I missed something but I looked at ReorderBufferCheckMemoryLimit, even there it checks the same function ReorderBufferCanStream () and decides whether to stream or spill. Did I miss something?
while (rb->size >= logical_decoding_work_mem * 1024L)
{
/*
* Pick the largest transaction (or subtransaction) and evict it from
* memory by streaming, if supported. Otherwise, spill to disk.
*/
if (ReorderBufferCanStream(rb) &&
(txn = ReorderBufferLargestTopTXN(rb)) != NULL)
{
/* we know there has to be one, because the size is not zero */
Assert(txn && !txn->toptxn);
Assert(txn->total_size > 0);
Assert(rb->size >= txn->total_size);
ReorderBufferStreamTXN(rb, txn);
}
else
I will also add debug and test as you suggested.
regards,
Ajin Cherian
Fujitsu Australia
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Amit Kapila
Date:
On Thu, Jul 9, 2020 at 8:18 AM Ajin Cherian <itsajin@gmail.com> wrote: > > On Thu, Jul 9, 2020 at 12:28 PM Amit Kapila <amit.kapila16@gmail.com> wrote: >> >> On Wed, Jul 8, 2020 at 7:31 PM Ajin Cherian <itsajin@gmail.com> wrote: >> >> Thanks for showing the interest in patch. How have you ensured that >> streaming is happening? I don't think the proposed patch can ensure >> it for every case because we also rely on logical_decoding_work_mem to >> decide whether to stream/spill, see ReorderBufferCheckMemoryLimit. I >> think with your patch it will allow streaming for cases where we have >> large amount of WAL to decode. >> > > Maybe I missed something but I looked at ReorderBufferCheckMemoryLimit, even there it checks the same function ReorderBufferCanStream() and decides whether to stream or spill. Did I miss something? > > while (rb->size >= logical_decoding_work_mem * 1024L) > { There is a check before above loop: ReorderBufferCheckMemoryLimit(ReorderBuffer *rb) { ReorderBufferTXN *txn; /* bail out if we haven't exceeded the memory limit */ if (rb->size < logical_decoding_work_mem * 1024L) return; This will prevent the streaming/spill to occur. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Dilip Kumar
Date:
On Thu, Jul 9, 2020 at 8:47 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Thu, Jul 9, 2020 at 8:18 AM Ajin Cherian <itsajin@gmail.com> wrote: > > > > On Thu, Jul 9, 2020 at 12:28 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > >> > >> On Wed, Jul 8, 2020 at 7:31 PM Ajin Cherian <itsajin@gmail.com> wrote: > >> > >> Thanks for showing the interest in patch. How have you ensured that > >> streaming is happening? I don't think the proposed patch can ensure > >> it for every case because we also rely on logical_decoding_work_mem to > >> decide whether to stream/spill, see ReorderBufferCheckMemoryLimit. I > >> think with your patch it will allow streaming for cases where we have > >> large amount of WAL to decode. > >> > > > > Maybe I missed something but I looked at ReorderBufferCheckMemoryLimit, even there it checks the same function ReorderBufferCanStream() and decides whether to stream or spill. Did I miss something? > > > > while (rb->size >= logical_decoding_work_mem * 1024L) > > { > > There is a check before above loop: > > ReorderBufferCheckMemoryLimit(ReorderBuffer *rb) > { > ReorderBufferTXN *txn; > > /* bail out if we haven't exceeded the memory limit */ > if (rb->size < logical_decoding_work_mem * 1024L) > return; > > This will prevent the streaming/spill to occur. I think if the GUC is set then maybe we can bypass this check so that it can try to stream every single change? -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Amit Kapila
Date:
On Thu, Jul 9, 2020 at 8:55 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Thu, Jul 9, 2020 at 8:47 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Thu, Jul 9, 2020 at 8:18 AM Ajin Cherian <itsajin@gmail.com> wrote: > > > > > > On Thu, Jul 9, 2020 at 12:28 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > >> > > >> On Wed, Jul 8, 2020 at 7:31 PM Ajin Cherian <itsajin@gmail.com> wrote: > > >> > > >> Thanks for showing the interest in patch. How have you ensured that > > >> streaming is happening? I don't think the proposed patch can ensure > > >> it for every case because we also rely on logical_decoding_work_mem to > > >> decide whether to stream/spill, see ReorderBufferCheckMemoryLimit. I > > >> think with your patch it will allow streaming for cases where we have > > >> large amount of WAL to decode. > > >> > > > > > > Maybe I missed something but I looked at ReorderBufferCheckMemoryLimit, even there it checks the same function ReorderBufferCanStream() and decides whether to stream or spill. Did I miss something? > > > > > > while (rb->size >= logical_decoding_work_mem * 1024L) > > > { > > > > There is a check before above loop: > > > > ReorderBufferCheckMemoryLimit(ReorderBuffer *rb) > > { > > ReorderBufferTXN *txn; > > > > /* bail out if we haven't exceeded the memory limit */ > > if (rb->size < logical_decoding_work_mem * 1024L) > > return; > > > > This will prevent the streaming/spill to occur. > > I think if the GUC is set then maybe we can bypass this check so that > it can try to stream every single change? > Yeah and probably we need to do something for the check "while (rb->size >= logical_decoding_work_mem * 1024L)" as well. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Dilip Kumar
Date:
On Wed, Jul 8, 2020 at 3:32 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Wed, Jul 8, 2020 at 9:36 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Sun, Jul 5, 2020 at 8:37 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > I have compared the changes logged at command end vs logged at commit > > > time. I have ignored the invalidation for the transaction which has > > > any aborted subtransaction in it. While testing this I found one > > > issue, the issue is that if there are some invalidation generated > > > between last command counter increment and the commit transaction then > > > those were not logged. I have fixed the issue by logging the pending > > > invalidation in RecordTransactionCommit. I will include the changes > > > in the next patch set. > > > > > > > I think it would have been better if you could have given examples for > > such cases where you need this extra logging. Anyway, below are few > > minor comments on this patch: > > > > 1. > > + /* > > + * Log any pending invalidations which are adding between the last > > + * command counter increment and the commit. > > + */ > > + if (XLogLogicalInfoActive()) > > + LogLogicalInvalidations(); > > > > I think we can change this comment slightly and extend a bit to say > > for which kind of special cases we are adding this. "Log any pending > > invalidations which are added between the last CommandCounterIncrement > > and the commit. Normally for DDLs, we log this at each command end, > > however for certain cases where we directly update the system table > > the invalidations were not logged at command end." > > > > Something like above based on cases that are not covered by command > > end WAL logging. > > > > 2. > > + * Emit WAL for invalidations. This is currently only used for logging > > + * invalidations at the command end. > > + */ > > +void > > +LogLogicalInvalidations() > > > > After this is getting used at a new place, it is better to modify the > > above comment to something like: "Emit WAL for invalidations. This is > > currently only used for logging invalidations at the command end or at > > commit time if any invalidations are pending." > > > > I have done some more review and below are my comments: > > Review-v31-0010-Provide-new-api-to-get-the-streaming-changes > ---------------------------------------------------------------------------------------------- > 1. > --- a/src/backend/catalog/system_views.sql > +++ b/src/backend/catalog/system_views.sql > @@ -1240,6 +1240,14 @@ LANGUAGE INTERNAL > VOLATILE ROWS 1000 COST 1000 > AS 'pg_logical_slot_get_changes'; > > +CREATE OR REPLACE FUNCTION pg_logical_slot_get_streaming_changes( > + IN slot_name name, IN upto_lsn pg_lsn, IN upto_nchanges int, > VARIADIC options text[] DEFAULT '{}', > + OUT lsn pg_lsn, OUT xid xid, OUT data text) > +RETURNS SETOF RECORD > +LANGUAGE INTERNAL > +VOLATILE ROWS 1000 COST 1000 > +AS 'pg_logical_slot_get_streaming_changes'; > > If we are going to add a new streaming API for get_changes, don't we > need for pg_logical_slot_get_binary_changes, > pg_logical_slot_peek_changes and pg_logical_slot_peek_binary_changes > as well? I was thinking why not add a new parameter (streaming > boolean) instead of adding the new APIs. This could be an optional > parameter which if user doesn't specify will be considered as false. > We already have optional parameters for APIs like > pg_create_logical_replication_slot. > > 2. You forgot to update sgml/func.sgml. This will be required even if > we decide to add a new parameter instead of a new API. > > 3. > + /* If called has not asked for streaming changes then disable it. */ > + ctx->streaming &= streaming; > > /If called/If the caller > > 4. > diff --git a/.gitignore b/.gitignore > index 794e35b..6083744 100644 > --- a/.gitignore > +++ b/.gitignore > @@ -42,3 +42,4 @@ lib*.pc > /Debug/ > /Release/ > /tmp_install/ > +/build/ > > Why the patch contains this change? > > 5. If I apply the first six patches and run the regressions, it fails > primarily because streaming got enabled by default. And then when I > applied this patch, the tests passed because it disables streaming by > default. I think this should be patch 0007. Only replying to the replication origin point, other comment looks fine to me so I will work on those. > Replication Origins > ------------------------------ > I think we also need to conclude on origins related discussion [1]. > As far as I can see, the origin_id can be sent with the first startup > message. The origin_lsn and origin_commit can be sent with the last > start of streaming commit if we want but not sure if that is of use. > If we need to send it earlier then we need to record it with other WAL > records. The point is that those are set with > pg_replication_origin_xact_setup but not sure how and when that > function is called. pg_replication_origin_xact_setup is exposed function so this will allow a user to set an origin for their session so that all the operation done from that session will be marked by that origin id. And the clear use case for this is to avoid sending such transactions by suing FilterByOrigin. But I am not sure about the point that we discussed at [1] that what is the use of the origin and origin_lsn we send at pgoutput_begin_txn. The other alternative is that we can ignore that > for now and once the usage is clear we can enhance it. What do you > think? That seems like a sensible option to me. > [1] - https://www.postgresql.org/message-id/CAA4eK1JwXaCezFw%2BkZwoxbLKYD0nWpC2rPgx7vUsaDAc0AZaow%40mail.gmail.com -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Amit Kapila
Date:
On Thu, Jul 9, 2020 at 2:41 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Wed, Jul 8, 2020 at 3:32 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > Only replying to the replication origin point, other comment looks > fine to me so I will work on those. > > > Replication Origins > > ------------------------------ > > I think we also need to conclude on origins related discussion [1]. > > As far as I can see, the origin_id can be sent with the first startup > > message. The origin_lsn and origin_commit can be sent with the last > > start of streaming commit if we want but not sure if that is of use. > > If we need to send it earlier then we need to record it with other WAL > > records. The point is that those are set with > > pg_replication_origin_xact_setup but not sure how and when that > > function is called. > > pg_replication_origin_xact_setup is exposed function so this will > allow a user to set an origin for their session so that all the > operation done from that session will be marked by that origin id. > Hmm, I think that can be done by pg_replication_origin_session_setup. > And the clear use case for this is to avoid sending such transactions > by suing FilterByOrigin. But I am not sure about the point that we > discussed at [1] that what is the use of the origin and origin_lsn we > send at pgoutput_begin_txn. > I could see the use of 'origin' with FilterByOrigin but not sure how origin_lsn can be used? > The other alternative is that we can ignore that > > for now and once the usage is clear we can enhance it. What do you > > think? > > That seems like a sensible option to me. > I have responded to that another thread. Let us see if someone responds to it. Feel free to add if you have some points related to that thread. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Ajin Cherian
Date:
On Thu, Jul 9, 2020 at 1:30 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> I think if the GUC is set then maybe we can bypass this check so that
> it can try to stream every single change?
>
Yeah and probably we need to do something for the check "while
(rb->size >= logical_decoding_work_mem * 1024L)" as well.
I have made this change, as discussed, the regression tests seem to run fine. I have added a debug that records the streaming for each transaction number. I also had to bypass certain asserts in ReorderBufferLargestTopTXN() as now we are going through the entire list of transactions and not just picking the biggest transaction .
regards,
Ajin
Fujitsu Australia
Attachment
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Dilip Kumar
Date:
On Fri, Jul 10, 2020 at 9:21 AM Ajin Cherian <itsajin@gmail.com> wrote: > > > > On Thu, Jul 9, 2020 at 1:30 PM Amit Kapila <amit.kapila16@gmail.com> wrote: >> >> >> > I think if the GUC is set then maybe we can bypass this check so that >> > it can try to stream every single change? >> > >> >> Yeah and probably we need to do something for the check "while >> (rb->size >= logical_decoding_work_mem * 1024L)" as well. >> >> > I have made this change, as discussed, the regression tests seem to run fine. I have added a debug that records the streamingfor each transaction >number. I also had to bypass certain asserts in ReorderBufferLargestTopTXN() as now we aregoing through the entire list of transactions and not just picking the biggest transaction . So if always_stream_logical is true then we are always going for the streaming even if the size is not reached and that is good. And if always_stream_logical is set then we are setting ctx->streaming=true that is also good. So now I don't think we need to change this part of the code, because when we bypass the memory limit and set the ctx->streaming=true it will always select the streaming option unless it is impossible. With your changes sometimes due to incomplete toast changes, if it can not pick the largest top txn for streaming it will hang forever in the while loop, in that case, it should go for spilling. while (rb->size >= logical_decoding_work_mem * 1024L) { /* * Pick the largest transaction (or subtransaction) and evict it from * memory by streaming, if supported. Otherwise, spill to disk. */ if (ReorderBufferCanStream(rb) && (txn = ReorderBufferLargestTopTXN(rb)) != NULL) -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Ajin Cherian
Date:
On Fri, Jul 10, 2020 at 3:11 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
With your changes sometimes due to incomplete toast
changes, if it can not pick the largest top txn for streaming it will
hang forever in the while loop, in that case, it should go for
spilling.
while (rb->size >= logical_decoding_work_mem * 1024L)
{
/*
* Pick the largest transaction (or subtransaction) and evict it from
* memory by streaming, if supported. Otherwise, spill to disk.
*/
if (ReorderBufferCanStream(rb) &&
(txn = ReorderBufferLargestTopTXN(rb)) != NULL)
Which is this condition (of not picking largest top txn)? Wouldn't ReorderBufferLargestTopTXN then return a NULL? If not, is there a way to know that a transaction cannot be streamed, so there can be an exit condition for the while loop?
regards,
Ajin Cherian
Fujitsu Australia
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Dilip Kumar
Date:
On Fri, Jul 10, 2020 at 11:01 AM Ajin Cherian <itsajin@gmail.com> wrote: > > > > On Fri, Jul 10, 2020 at 3:11 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: >> >> With your changes sometimes due to incomplete toast >> changes, if it can not pick the largest top txn for streaming it will >> hang forever in the while loop, in that case, it should go for >> spilling. >> >> while (rb->size >= logical_decoding_work_mem * 1024L) >> { >> /* >> * Pick the largest transaction (or subtransaction) and evict it from >> * memory by streaming, if supported. Otherwise, spill to disk. >> */ >> if (ReorderBufferCanStream(rb) && >> (txn = ReorderBufferLargestTopTXN(rb)) != NULL) >> >> > > Which is this condition (of not picking largest top txn)? Wouldn't ReorderBufferLargestTopTXN then return a NULL? If not,is there a way to know that a transaction cannot be streamed, so there can be an exit condition for the while loop? Okay, I see, so if ReorderBufferLargestTopTXN returns NULL you are breaking the loop. I did not see the other part of the patch but I agree that it will not go in an infinite loop. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Dilip Kumar
Date:
On Sat, Jul 4, 2020 at 11:35 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Tue, Jun 30, 2020 at 5:20 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > Let me know what you think about the above changes. > > > > I went ahead and made few changes in > 0005-Implement-streaming-mode-in-ReorderBuffer which are explained > below. I have few questions and suggestions for the patch as well > which are also covered in below points. > > 1. > + if (prev_lsn == InvalidXLogRecPtr) > + { > + if (streaming) > + rb->stream_start(rb, txn, change->lsn); > + else > + rb->begin(rb, txn); > + stream_started = true; > + } > > I don't think we want to move begin callback here that will change the > existing semantics, so it is better to move begin at its original > position. I have made the required changes in the attached patch. > > 2. > ReorderBufferTruncateTXN() > { > .. > + dlist_foreach_modify(iter, &txn->changes) > + { > + ReorderBufferChange *change; > + > + change = dlist_container(ReorderBufferChange, node, iter.cur); > + > + /* remove the change from it's containing list */ > + dlist_delete(&change->node); > + > + ReorderBufferReturnChange(rb, change); > + } > .. > } > > I think here we can add an Assert that we're not mixing changes from > different transactions. See the changes in the patch. > > 3. > SetupCheckXidLive() > { > .. > + /* > + * setup CheckXidAlive if it's not committed yet. We don't check if the xid > + * aborted. That will happen during catalog access. Also, reset the > + * bsysscan flag. > + */ > + if (!TransactionIdDidCommit(xid)) > + { > + CheckXidAlive = xid; > + bsysscan = false; > .. > } > > What is the need to reset bsysscan flag here if we are already > resetting on error (like in the previous patch sent by me)? > > 4. > ReorderBufferProcessTXN() > { > .. > .. > + /* Reset the CheckXidAlive */ > + if (streaming) > + CheckXidAlive = InvalidTransactionId; > .. > } > > Similar to the previous point, we don't need this as well because > AbortCurrentTransaction would have taken care of this. > > 5. > + * XXX Do we need to check if the transaction has some changes to stream > + * (maybe it got streamed right before the commit, which attempts to > + * stream it again before the commit)? > + */ > +static void > +ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn) > > The above comment doesn't make much sense to me, so I have removed it. > Basically, if there are no changes before commit, we still need to > send commit and anyway if there are no more changes > ReorderBufferProcessTXN will not do anything. > > 6. > +ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn) > { > .. > if (txn->snapshot_now == NULL) > + { > + dlist_iter subxact_i; > + > + /* make sure this transaction is streamed for the first time */ > + Assert(!rbtxn_is_streamed(txn)); > + > + /* at the beginning we should have invalid command ID */ > + Assert(txn->command_id == InvalidCommandId); > + > + dlist_foreach(subxact_i, &txn->subtxns) > + { > + ReorderBufferTXN *subtxn; > + > + subtxn = dlist_container(ReorderBufferTXN, node, subxact_i.cur); > + ReorderBufferTransferSnapToParent(txn, subtxn); > + } > .. > } > > Here, it is possible that there is no base_snapshot for txn, so we > need a check for that similar to ReorderBufferCommit. > > 7. Apart from the above, I made few changes in comments and ran pgindent. > > 8. We can't stream the transaction before we reach the > SNAPBUILD_CONSISTENT state because some other output plugin can apply > those changes unlike what we do with pgoutput plugin (which writes to > file). And, I think applying the transactions without reaching a > consistent state would be anyway wrong. So, we should avoid that and > if do that then we should have an Assert for streamed txns rather than > sending abort for them in ReorderBufferForget. I was analyzing this point so currently, we only enable streaming in StartReplicationSlot so basically in CreateReplicationSlot the streaming will be always off because by that time plugins are not yet startup that will happen only on StartReplicationSlot. See below snippet from patch 0007. However, I agree during start replication slot we might decode some of the extra walls of the transaction for which we already got the commit confirmation and we must have a way to avoid that. But I think we don't need to do anything for the CONSISTENT snapshot point. What's your thought on this? @@ -1016,6 +1016,12 @@ CreateReplicationSlot(CreateReplicationSlotCmd *cmd) WalSndPrepareWrite, WalSndWriteData, WalSndUpdateProgress); + /* + * Make sure streaming is disabled here - we may have the methods, + * but we don't have anywhere to send the data yet. + */ + ctx->streaming = false; + -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Dilip Kumar
Date:
On Mon, Jul 6, 2020 at 11:43 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Mon, Jul 6, 2020 at 11:31 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Sun, Jul 5, 2020 at 4:47 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > On Sat, Jul 4, 2020 at 11:35 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > > > 9. > > > > +ReorderBufferHandleConcurrentAbort(ReorderBuffer *rb, ReorderBufferTXN *txn, > > > > { > > > > .. > > > > + ReorderBufferToastReset(rb, txn); > > > > + if (specinsert != NULL) > > > > + ReorderBufferReturnChange(rb, specinsert); > > > > .. > > > > } > > > > > > > > Why do we need to do these here when we wouldn't have been done for > > > > any exception other than ERRCODE_TRANSACTION_ROLLBACK? > > > > > > Because we are handling this exception "ERRCODE_TRANSACTION_ROLLBACK" > > > gracefully and we are continuing with further decoding so we need to > > > return this change back. > > > > > > > Okay, then I suggest we should do these before calling stream_stop and > > also move ReorderBufferResetTXN after calling stream_stop to follow a > > pattern similar to try block unless there is a reason for not doing > > so. Also, it would be good if we can initialize specinsert with NULL > > after returning the change as we are doing at other places. > > Okay > > > > > 10. I have got the below failure once. I have not investigated this > > > > in detail as the patch is still under progress. See, if you have any > > > > idea? > > > > # Failed test 'check extra columns contain local defaults' > > > > # at t/013_stream_subxact_ddl_abort.pl line 81. > > > > # got: '2|0' > > > > # expected: '1000|500' > > > > # Looks like you failed 1 test of 2. > > > > make[2]: *** [check] Error 1 > > > > make[1]: *** [check-subscription-recurse] Error 2 > > > > make[1]: *** Waiting for unfinished jobs.... > > > > make: *** [check-world-src/test-recurse] Error 2 > > > > > > Even I got the failure once and after that, it did not reproduce. I > > > have executed it multiple time but it did not reproduce again. Are > > > you able to reproduce it consistently? > > > > > > > No, I am also not able to reproduce it consistently but I think this > > can fail if a subscriber sends the replay_location before actually > > replaying the changes. First, I thought that extra send_feedback we > > have in apply_handle_stream_commit might have caused this but I guess > > that can't happen because we need the commit time location for that > > and we are storing the same at the end of apply_handle_stream_commit > > after applying all messages. I am not sure what is going on here. I > > think we somehow need to reproduce this or some variant of this test > > consistently to find the root cause. > > And I think it appeared first time for me, so maybe either induced > from past few versions so some changes in the last few versions might > have exposed it. I have noticed that almost 50% of the time I am able > to reproduce after the clean build so I can trace back from which > version it started appearing that way it will be easy to narrow down. I think the reason for the failure is that we are not setting remote_final_lsn, in the streaming mode. I have put multiple logs and executed in log and from logs it appeared that some of the logical wal did not get replayed due to below check in should_apply_changes_for_rel. return (rel->state == SUBREL_STATE_READY || (rel->state == SUBREL_STATE_SYNCDONE && rel->statelsn <= remote_final_lsn)); I still need to do the detailed analysis that why does this fail in some cases, basically, most of the time the rel->state is SUBREL_STATE_READY so this check passes but whenever the state is SUBREL_STATE_SYNCDONE it failed because we never update remote_final_lsn. I will try to set this value in apply_handle_stream_commit and see whether it ever fails or not. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Amit Kapila
Date:
On Fri, Jul 10, 2020 at 3:37 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Sat, Jul 4, 2020 at 11:35 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > 8. We can't stream the transaction before we reach the > > SNAPBUILD_CONSISTENT state because some other output plugin can apply > > those changes unlike what we do with pgoutput plugin (which writes to > > file). And, I think applying the transactions without reaching a > > consistent state would be anyway wrong. So, we should avoid that and > > if do that then we should have an Assert for streamed txns rather than > > sending abort for them in ReorderBufferForget. > > I was analyzing this point so currently, we only enable streaming in > StartReplicationSlot so basically in CreateReplicationSlot the > streaming will be always off because by that time plugins are not yet > startup that will happen only on StartReplicationSlot. > What do you mean by 'startup' in the above sentence? AFAICS, we do call startup_cb_wrapper in CreateInitDecodingContext which is called from both CreateReplicationSlot and create_logical_replication_slot before the start of decoding. In CreateInitDecodingContext, we call StartupDecodingContext which should load the plugin. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Dilip Kumar
Date:
On Mon, Jul 13, 2020 at 10:14 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Fri, Jul 10, 2020 at 3:37 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Sat, Jul 4, 2020 at 11:35 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > 8. We can't stream the transaction before we reach the > > > SNAPBUILD_CONSISTENT state because some other output plugin can apply > > > those changes unlike what we do with pgoutput plugin (which writes to > > > file). And, I think applying the transactions without reaching a > > > consistent state would be anyway wrong. So, we should avoid that and > > > if do that then we should have an Assert for streamed txns rather than > > > sending abort for them in ReorderBufferForget. > > > > I was analyzing this point so currently, we only enable streaming in > > StartReplicationSlot so basically in CreateReplicationSlot the > > streaming will be always off because by that time plugins are not yet > > startup that will happen only on StartReplicationSlot. > > > > What do you mean by 'startup' in the above sentence? AFAICS, we do > call startup_cb_wrapper in CreateInitDecodingContext which is called > from both CreateReplicationSlot and create_logical_replication_slot > before the start of decoding. In CreateInitDecodingContext, we call > StartupDecodingContext which should load the plugin. Yeah, you are right that we do call startup_cb_wrapper from CreateInitDecodingContext as well. I think I got confused by below comment in patch 0007 @@ -1016,6 +1016,12 @@ CreateReplicationSlot(CreateReplicationSlotCmd *cmd) WalSndPrepareWrite, WalSndWriteData, WalSndUpdateProgress); + /* + * Make sure streaming is disabled here - we may have the methods, + * but we don't have anywhere to send the data yet. + */ + ctx->streaming = false; + Basically, during CreateReplicationSlot we forcefully disable the streaming with the comment "we don't have anywhere to send the data yet". So my point is during CreateReplicationSlot time the streaming will always be off and once we are done with creating the slot we will be having consistent snapshot. So my point is can we just check that while decoding unless the current LSN reaches the start_decoding_at point we should not start streaming and after that we can start. At that time we can have an assert that the snapshot should be CONSISTENT. However, before doing that I need to check on this point that why after creating slot we are setting ctx->streaming to false. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Dilip Kumar
Date:
On Sun, Jul 12, 2020 at 9:56 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Mon, Jul 6, 2020 at 11:43 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Mon, Jul 6, 2020 at 11:31 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > On Sun, Jul 5, 2020 at 4:47 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > > > On Sat, Jul 4, 2020 at 11:35 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > > > > > > 9. > > > > > +ReorderBufferHandleConcurrentAbort(ReorderBuffer *rb, ReorderBufferTXN *txn, > > > > > { > > > > > .. > > > > > + ReorderBufferToastReset(rb, txn); > > > > > + if (specinsert != NULL) > > > > > + ReorderBufferReturnChange(rb, specinsert); > > > > > .. > > > > > } > > > > > > > > > > Why do we need to do these here when we wouldn't have been done for > > > > > any exception other than ERRCODE_TRANSACTION_ROLLBACK? > > > > > > > > Because we are handling this exception "ERRCODE_TRANSACTION_ROLLBACK" > > > > gracefully and we are continuing with further decoding so we need to > > > > return this change back. > > > > > > > > > > Okay, then I suggest we should do these before calling stream_stop and > > > also move ReorderBufferResetTXN after calling stream_stop to follow a > > > pattern similar to try block unless there is a reason for not doing > > > so. Also, it would be good if we can initialize specinsert with NULL > > > after returning the change as we are doing at other places. > > > > Okay > > > > > > > 10. I have got the below failure once. I have not investigated this > > > > > in detail as the patch is still under progress. See, if you have any > > > > > idea? > > > > > # Failed test 'check extra columns contain local defaults' > > > > > # at t/013_stream_subxact_ddl_abort.pl line 81. > > > > > # got: '2|0' > > > > > # expected: '1000|500' > > > > > # Looks like you failed 1 test of 2. > > > > > make[2]: *** [check] Error 1 > > > > > make[1]: *** [check-subscription-recurse] Error 2 > > > > > make[1]: *** Waiting for unfinished jobs.... > > > > > make: *** [check-world-src/test-recurse] Error 2 > > > > > > > > Even I got the failure once and after that, it did not reproduce. I > > > > have executed it multiple time but it did not reproduce again. Are > > > > you able to reproduce it consistently? > > > > > > > > > > No, I am also not able to reproduce it consistently but I think this > > > can fail if a subscriber sends the replay_location before actually > > > replaying the changes. First, I thought that extra send_feedback we > > > have in apply_handle_stream_commit might have caused this but I guess > > > that can't happen because we need the commit time location for that > > > and we are storing the same at the end of apply_handle_stream_commit > > > after applying all messages. I am not sure what is going on here. I > > > think we somehow need to reproduce this or some variant of this test > > > consistently to find the root cause. > > > > And I think it appeared first time for me, so maybe either induced > > from past few versions so some changes in the last few versions might > > have exposed it. I have noticed that almost 50% of the time I am able > > to reproduce after the clean build so I can trace back from which > > version it started appearing that way it will be easy to narrow down. > > I think the reason for the failure is that we are not setting > remote_final_lsn, in the streaming mode. I have put multiple logs and > executed in log and from logs it appeared that some of the logical wal > did not get replayed due to below check in > should_apply_changes_for_rel. > return (rel->state == SUBREL_STATE_READY || (rel->state == > SUBREL_STATE_SYNCDONE && rel->statelsn <= remote_final_lsn)); > > I still need to do the detailed analysis that why does this fail in > some cases, basically, most of the time the rel->state is > SUBREL_STATE_READY so this check passes but whenever the state is > SUBREL_STATE_SYNCDONE it failed because we never update > remote_final_lsn. I will try to set this value in > apply_handle_stream_commit and see whether it ever fails or not. I have verified that after setting the remote_final_lsn in the apply_handle_stream_commit, I don't see that regression failure in over 70 runs whereas without that change it failed 6 times in 50 runs. Apart from this, I have noticed one more thing related to the same point. Basically, in the apply_handle_commit, we are calling process_syncing_tables whereas we are not calling the same in apply_handle_stream_commit. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Amit Kapila
Date:
On Mon, Jul 13, 2020 at 10:40 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Mon, Jul 13, 2020 at 10:14 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Fri, Jul 10, 2020 at 3:37 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > On Sat, Jul 4, 2020 at 11:35 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > > > > 8. We can't stream the transaction before we reach the > > > > SNAPBUILD_CONSISTENT state because some other output plugin can apply > > > > those changes unlike what we do with pgoutput plugin (which writes to > > > > file). And, I think applying the transactions without reaching a > > > > consistent state would be anyway wrong. So, we should avoid that and > > > > if do that then we should have an Assert for streamed txns rather than > > > > sending abort for them in ReorderBufferForget. > > > > > > I was analyzing this point so currently, we only enable streaming in > > > StartReplicationSlot so basically in CreateReplicationSlot the > > > streaming will be always off because by that time plugins are not yet > > > startup that will happen only on StartReplicationSlot. > > > > > > > What do you mean by 'startup' in the above sentence? AFAICS, we do > > call startup_cb_wrapper in CreateInitDecodingContext which is called > > from both CreateReplicationSlot and create_logical_replication_slot > > before the start of decoding. In CreateInitDecodingContext, we call > > StartupDecodingContext which should load the plugin. > > Yeah, you are right that we do call startup_cb_wrapper from > CreateInitDecodingContext as well. I think I got confused by below > comment in patch 0007 > > @@ -1016,6 +1016,12 @@ CreateReplicationSlot(CreateReplicationSlotCmd *cmd) > WalSndPrepareWrite, WalSndWriteData, > WalSndUpdateProgress); > + /* > + * Make sure streaming is disabled here - we may have the methods, > + * but we don't have anywhere to send the data yet. > + */ > + ctx->streaming = false; > + > > Basically, during CreateReplicationSlot we forcefully disable the > streaming with the comment "we don't have anywhere to send the data > yet". So my point is during CreateReplicationSlot time the streaming > will always be off and once we are done with creating the slot we will > be having consistent snapshot. So my point is can we just check that > while decoding unless the current LSN reaches the start_decoding_at > point we should not start streaming and after that we can start. At > that time we can have an assert that the snapshot should be > CONSISTENT. However, before doing that I need to check on this point > that why after creating slot we are setting ctx->streaming to false. > I think you can refer to commit message as well for that "We however must explicitly disable streaming replication during replication slot creation, even if the plugin supports it. We don't need to replicate the changes accumulated during this phase, and moreover, we don't have a replication connection open so we don't have where to send the data anyway.". I don't think this is a good way to hack the streaming flag because for SQL API's, we don't have a good reason to disable the streaming in this way. I guess if we had a condition related to reaching CONSISTENT snapshot during streaming then we won't need to hack the streaming flag in this way. Once we reach the CONSISTENT snapshot state, we come out of the creation of a replication slot (see how we use DecodingContextReady to achieve that) phase. So, I feel we should remove the ctx->streaming setting to false and add a CONSISTENT snapshot check during streaming unless you have a reason for not doing so. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Amit Kapila
Date:
On Mon, Jul 13, 2020 at 10:47 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Sun, Jul 12, 2020 at 9:56 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Mon, Jul 6, 2020 at 11:43 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > On Mon, Jul 6, 2020 at 11:31 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > > > > > 10. I have got the below failure once. I have not investigated this > > > > > > in detail as the patch is still under progress. See, if you have any > > > > > > idea? > > > > > > # Failed test 'check extra columns contain local defaults' > > > > > > # at t/013_stream_subxact_ddl_abort.pl line 81. > > > > > > # got: '2|0' > > > > > > # expected: '1000|500' > > > > > > # Looks like you failed 1 test of 2. > > > > > > make[2]: *** [check] Error 1 > > > > > > make[1]: *** [check-subscription-recurse] Error 2 > > > > > > make[1]: *** Waiting for unfinished jobs.... > > > > > > make: *** [check-world-src/test-recurse] Error 2 > > > > > > > > > > Even I got the failure once and after that, it did not reproduce. I > > > > > have executed it multiple time but it did not reproduce again. Are > > > > > you able to reproduce it consistently? > > > > > > > > > ... .. > > > > I think the reason for the failure is that we are not setting > > remote_final_lsn, in the streaming mode. I have put multiple logs and > > executed in log and from logs it appeared that some of the logical wal > > did not get replayed due to below check in > > should_apply_changes_for_rel. > > return (rel->state == SUBREL_STATE_READY || (rel->state == > > SUBREL_STATE_SYNCDONE && rel->statelsn <= remote_final_lsn)); > > > > I still need to do the detailed analysis that why does this fail in > > some cases, basically, most of the time the rel->state is > > SUBREL_STATE_READY so this check passes but whenever the state is > > SUBREL_STATE_SYNCDONE it failed because we never update > > remote_final_lsn. I will try to set this value in > > apply_handle_stream_commit and see whether it ever fails or not. > > I have verified that after setting the remote_final_lsn in the > apply_handle_stream_commit, I don't see that regression failure in > over 70 runs whereas without that change it failed 6 times in 50 runs. > Your analysis and fix seem correct to me. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Dilip Kumar
Date:
On Mon, Jul 13, 2020 at 11:10 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Mon, Jul 13, 2020 at 10:40 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Mon, Jul 13, 2020 at 10:14 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > On Fri, Jul 10, 2020 at 3:37 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > > > On Sat, Jul 4, 2020 at 11:35 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > > > > > > > 8. We can't stream the transaction before we reach the > > > > > SNAPBUILD_CONSISTENT state because some other output plugin can apply > > > > > those changes unlike what we do with pgoutput plugin (which writes to > > > > > file). And, I think applying the transactions without reaching a > > > > > consistent state would be anyway wrong. So, we should avoid that and > > > > > if do that then we should have an Assert for streamed txns rather than > > > > > sending abort for them in ReorderBufferForget. > > > > > > > > I was analyzing this point so currently, we only enable streaming in > > > > StartReplicationSlot so basically in CreateReplicationSlot the > > > > streaming will be always off because by that time plugins are not yet > > > > startup that will happen only on StartReplicationSlot. > > > > > > > > > > What do you mean by 'startup' in the above sentence? AFAICS, we do > > > call startup_cb_wrapper in CreateInitDecodingContext which is called > > > from both CreateReplicationSlot and create_logical_replication_slot > > > before the start of decoding. In CreateInitDecodingContext, we call > > > StartupDecodingContext which should load the plugin. > > > > Yeah, you are right that we do call startup_cb_wrapper from > > CreateInitDecodingContext as well. I think I got confused by below > > comment in patch 0007 > > > > @@ -1016,6 +1016,12 @@ CreateReplicationSlot(CreateReplicationSlotCmd *cmd) > > WalSndPrepareWrite, WalSndWriteData, > > WalSndUpdateProgress); > > + /* > > + * Make sure streaming is disabled here - we may have the methods, > > + * but we don't have anywhere to send the data yet. > > + */ > > + ctx->streaming = false; > > + > > > > Basically, during CreateReplicationSlot we forcefully disable the > > streaming with the comment "we don't have anywhere to send the data > > yet". So my point is during CreateReplicationSlot time the streaming > > will always be off and once we are done with creating the slot we will > > be having consistent snapshot. So my point is can we just check that > > while decoding unless the current LSN reaches the start_decoding_at > > point we should not start streaming and after that we can start. At > > that time we can have an assert that the snapshot should be > > CONSISTENT. However, before doing that I need to check on this point > > that why after creating slot we are setting ctx->streaming to false. > > > > I think you can refer to commit message as well for that "We however > must explicitly disable streaming replication during replication slot > creation, even if the plugin supports it. We don't need to replicate > the changes accumulated during this phase, and moreover, we don't have > a replication connection open so we don't have where to send the data > anyway.". I don't think this is a good way to hack the streaming flag > because for SQL API's, we don't have a good reason to disable the > streaming in this way. I guess if we had a condition related to > reaching CONSISTENT snapshot during streaming then we won't need to > hack the streaming flag in this way. Once we reach the CONSISTENT > snapshot state, we come out of the creation of a replication slot (see > how we use DecodingContextReady to achieve that) phase. So, I feel we > should remove the ctx->streaming setting to false and add a CONSISTENT > snapshot check during streaming unless you have a reason for not doing > so. I was worried about the point that streaming on/off is sent by the subscriber on START REPLICATION, not on CREATE REPLICATION SLOT, so if we keep streaming on during create then it may not be right. But, I agree with your point that it's better we can avoid streaming during slot creation by CONSISTENT snapshot check instead of disabling this way. And, anyways as soon as we reach the consistent snapshot we will stop processing further records so we will not attempt to stream during slot creation. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Amit Kapila
Date:
On Mon, Jul 13, 2020 at 2:32 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Mon, Jul 13, 2020 at 11:10 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > I think you can refer to commit message as well for that "We however > > must explicitly disable streaming replication during replication slot > > creation, even if the plugin supports it. We don't need to replicate > > the changes accumulated during this phase, and moreover, we don't have > > a replication connection open so we don't have where to send the data > > anyway.". I don't think this is a good way to hack the streaming flag > > because for SQL API's, we don't have a good reason to disable the > > streaming in this way. I guess if we had a condition related to > > reaching CONSISTENT snapshot during streaming then we won't need to > > hack the streaming flag in this way. Once we reach the CONSISTENT > > snapshot state, we come out of the creation of a replication slot (see > > how we use DecodingContextReady to achieve that) phase. So, I feel we > > should remove the ctx->streaming setting to false and add a CONSISTENT > > snapshot check during streaming unless you have a reason for not doing > > so. > > I was worried about the point that streaming on/off is sent by the > subscriber on START REPLICATION, not on CREATE REPLICATION SLOT, so if > we keep streaming on during create then it may not be right. > Then, how is that used on the publisher-side? AFAICS, the streaming is enabled based on whether streaming callbacks are provided and we do that in 0003-Extend-the-logical-decoding-output-plugin-API-wi patch. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Dilip Kumar
Date:
On Mon, Jul 13, 2020 at 2:56 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Mon, Jul 13, 2020 at 2:32 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Mon, Jul 13, 2020 at 11:10 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > I think you can refer to commit message as well for that "We however > > > must explicitly disable streaming replication during replication slot > > > creation, even if the plugin supports it. We don't need to replicate > > > the changes accumulated during this phase, and moreover, we don't have > > > a replication connection open so we don't have where to send the data > > > anyway.". I don't think this is a good way to hack the streaming flag > > > because for SQL API's, we don't have a good reason to disable the > > > streaming in this way. I guess if we had a condition related to > > > reaching CONSISTENT snapshot during streaming then we won't need to > > > hack the streaming flag in this way. Once we reach the CONSISTENT > > > snapshot state, we come out of the creation of a replication slot (see > > > how we use DecodingContextReady to achieve that) phase. So, I feel we > > > should remove the ctx->streaming setting to false and add a CONSISTENT > > > snapshot check during streaming unless you have a reason for not doing > > > so. > > > > I was worried about the point that streaming on/off is sent by the > > subscriber on START REPLICATION, not on CREATE REPLICATION SLOT, so if > > we keep streaming on during create then it may not be right. > > > > Then, how is that used on the publisher-side? AFAICS, the streaming > is enabled based on whether streaming callbacks are provided and we do > that in 0003-Extend-the-logical-decoding-output-plugin-API-wi patch. Basically, first, we enable based on whether we have the callbacks or not but later once we get the START REPLICATION command from the subscriber then we set it to false if the streaming is not enabled from the subscriber side. You can refer below code in patch 0007. pgoutput_startup { parse_output_parameters(ctx->output_plugin_options, &data->protocol_version, - &data->publication_names); + &data->publication_names, + &enable_streaming); /* Check if we support requested protocol */ if (data->protocol_version > LOGICALREP_PROTO_VERSION_NUM) @@ -222,6 +284,27 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt, (errcode(ERRCODE_INVALID_PARAMETER_VALUE), errmsg("publication_names parameter missing"))); + /* + * Decide whether to enable streaming. It is disabled by default, in + * which case we just update the flag in decoding context. Otherwise + * we only allow it with sufficient version of the protocol, and when + * the output plugin supports it. + */ + if (!enable_streaming) + ctx->streaming = false; } -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Amit Kapila
Date:
On Mon, Jul 13, 2020 at 3:04 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Mon, Jul 13, 2020 at 2:56 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Mon, Jul 13, 2020 at 2:32 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > On Mon, Jul 13, 2020 at 11:10 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > > > > I think you can refer to commit message as well for that "We however > > > > must explicitly disable streaming replication during replication slot > > > > creation, even if the plugin supports it. We don't need to replicate > > > > the changes accumulated during this phase, and moreover, we don't have > > > > a replication connection open so we don't have where to send the data > > > > anyway.". I don't think this is a good way to hack the streaming flag > > > > because for SQL API's, we don't have a good reason to disable the > > > > streaming in this way. I guess if we had a condition related to > > > > reaching CONSISTENT snapshot during streaming then we won't need to > > > > hack the streaming flag in this way. Once we reach the CONSISTENT > > > > snapshot state, we come out of the creation of a replication slot (see > > > > how we use DecodingContextReady to achieve that) phase. So, I feel we > > > > should remove the ctx->streaming setting to false and add a CONSISTENT > > > > snapshot check during streaming unless you have a reason for not doing > > > > so. > > > > > > I was worried about the point that streaming on/off is sent by the > > > subscriber on START REPLICATION, not on CREATE REPLICATION SLOT, so if > > > we keep streaming on during create then it may not be right. > > > > > > > Then, how is that used on the publisher-side? AFAICS, the streaming > > is enabled based on whether streaming callbacks are provided and we do > > that in 0003-Extend-the-logical-decoding-output-plugin-API-wi patch. > > Basically, first, we enable based on whether we have the callbacks or > not but later once we get the START REPLICATION command from the > subscriber then we set it to false if the streaming is not enabled > from the subscriber side. You can refer below code in patch 0007. > > pgoutput_startup > { > parse_output_parameters(ctx->output_plugin_options, > &data->protocol_version, > - &data->publication_names); > + &data->publication_names, > + &enable_streaming); > /* Check if we support requested protocol */ > if (data->protocol_version > LOGICALREP_PROTO_VERSION_NUM) > @@ -222,6 +284,27 @@ pgoutput_startup(LogicalDecodingContext *ctx, > OutputPluginOptions *opt, > (errcode(ERRCODE_INVALID_PARAMETER_VALUE), > errmsg("publication_names parameter missing"))); > + /* > + * Decide whether to enable streaming. It is disabled by default, in > + * which case we just update the flag in decoding context. Otherwise > + * we only allow it with sufficient version of the protocol, and when > + * the output plugin supports it. > + */ > + if (!enable_streaming) > + ctx->streaming = false; > } > Okay, in that case, we can do both enable and disable streaming in this function itself rather than allow the caller to later modify it. I suggest similarly we can enable/disable it for SQL API in pg_decode_startup via output_plugin_options. This way it will look consistent for both SQL APIs and for command-based replication. If we can do so, then probably adding an Assert for Consistent Snapshot while performing streaming should be okay. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Dilip Kumar
Date:
On Mon, Jul 13, 2020 at 4:00 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Mon, Jul 13, 2020 at 3:04 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Mon, Jul 13, 2020 at 2:56 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > On Mon, Jul 13, 2020 at 2:32 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > > > On Mon, Jul 13, 2020 at 11:10 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > > > > > > > I think you can refer to commit message as well for that "We however > > > > > must explicitly disable streaming replication during replication slot > > > > > creation, even if the plugin supports it. We don't need to replicate > > > > > the changes accumulated during this phase, and moreover, we don't have > > > > > a replication connection open so we don't have where to send the data > > > > > anyway.". I don't think this is a good way to hack the streaming flag > > > > > because for SQL API's, we don't have a good reason to disable the > > > > > streaming in this way. I guess if we had a condition related to > > > > > reaching CONSISTENT snapshot during streaming then we won't need to > > > > > hack the streaming flag in this way. Once we reach the CONSISTENT > > > > > snapshot state, we come out of the creation of a replication slot (see > > > > > how we use DecodingContextReady to achieve that) phase. So, I feel we > > > > > should remove the ctx->streaming setting to false and add a CONSISTENT > > > > > snapshot check during streaming unless you have a reason for not doing > > > > > so. > > > > > > > > I was worried about the point that streaming on/off is sent by the > > > > subscriber on START REPLICATION, not on CREATE REPLICATION SLOT, so if > > > > we keep streaming on during create then it may not be right. > > > > > > > > > > Then, how is that used on the publisher-side? AFAICS, the streaming > > > is enabled based on whether streaming callbacks are provided and we do > > > that in 0003-Extend-the-logical-decoding-output-plugin-API-wi patch. > > > > Basically, first, we enable based on whether we have the callbacks or > > not but later once we get the START REPLICATION command from the > > subscriber then we set it to false if the streaming is not enabled > > from the subscriber side. You can refer below code in patch 0007. > > > > pgoutput_startup > > { > > parse_output_parameters(ctx->output_plugin_options, > > &data->protocol_version, > > - &data->publication_names); > > + &data->publication_names, > > + &enable_streaming); > > /* Check if we support requested protocol */ > > if (data->protocol_version > LOGICALREP_PROTO_VERSION_NUM) > > @@ -222,6 +284,27 @@ pgoutput_startup(LogicalDecodingContext *ctx, > > OutputPluginOptions *opt, > > (errcode(ERRCODE_INVALID_PARAMETER_VALUE), > > errmsg("publication_names parameter missing"))); > > + /* > > + * Decide whether to enable streaming. It is disabled by default, in > > + * which case we just update the flag in decoding context. Otherwise > > + * we only allow it with sufficient version of the protocol, and when > > + * the output plugin supports it. > > + */ > > + if (!enable_streaming) > > + ctx->streaming = false; > > } > > > > Okay, in that case, we can do both enable and disable streaming in > this function itself rather than allow the caller to later modify it. > I suggest similarly we can enable/disable it for SQL API in > pg_decode_startup via output_plugin_options. This way it will look > consistent for both SQL APIs and for command-based replication. If we > can do so, then probably adding an Assert for Consistent Snapshot > while performing streaming should be okay. Sounds good to me. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Amit Kapila
Date:
On Mon, Jul 13, 2020 at 4:09 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Mon, Jul 13, 2020 at 4:00 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > Okay, in that case, we can do both enable and disable streaming in > > this function itself rather than allow the caller to later modify it. > > I suggest similarly we can enable/disable it for SQL API in > > pg_decode_startup via output_plugin_options. This way it will look > > consistent for both SQL APIs and for command-based replication. If we > > can do so, then probably adding an Assert for Consistent Snapshot > > while performing streaming should be okay. > > Sounds good to me. > Please find the latest patches. I have made changes only in the subscriber-side patches (0007 and 0008 as per the current patch-set). The main changes are: 1. As discussed above, remove SendFeedback call from apply_handle_stream_commit 2. In SharedFilesetInit, ensure to register callback once 3. In stream_open_file, change slight handling around MemoryContexts 4. Merged the subscriber-side patches. 5. Added/Edited comments in 0007 and 0008. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Attachment
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Dilip Kumar
Date:
On Tue, Jul 14, 2020 at 5:40 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Mon, Jul 13, 2020 at 4:09 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Mon, Jul 13, 2020 at 4:00 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > Okay, in that case, we can do both enable and disable streaming in > > > this function itself rather than allow the caller to later modify it. > > > I suggest similarly we can enable/disable it for SQL API in > > > pg_decode_startup via output_plugin_options. This way it will look > > > consistent for both SQL APIs and for command-based replication. If we > > > can do so, then probably adding an Assert for Consistent Snapshot > > > while performing streaming should be okay. > > > > Sounds good to me. > > > > Please find the latest patches. I have made changes only in the > subscriber-side patches (0007 and 0008 as per the current patch-set). > The main changes are: > 1. As discussed above, remove SendFeedback call from apply_handle_stream_commit > 2. In SharedFilesetInit, ensure to register callback once > 3. In stream_open_file, change slight handling around MemoryContexts > 4. Merged the subscriber-side patches. > 5. Added/Edited comments in 0007 and 0008. I have reviewed your changes and those look good to me, please find the latest version of the patch set. The major changes - A couple of review comments fixed suggested upthread in 0003 and 0005. - Handle the case of stop streaming until we reach to the start_decoding_at LSN in 0005 - Simplified the 0006 by avoiding sending the transaction with incomplete changes and added the comment atop ReorderBufferLargestTopTXN - Moved 0010 as 0007 and handled pending comments in the same. - In 0009 I have fixed a couple of defects mentioned above. And, one additional defect that is, if we do alter subscription streaming off/on then it was not working. - In 0009 sending the origin id. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Dilip Kumar
Date:
On Mon, Jul 13, 2020 at 4:00 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Mon, Jul 13, 2020 at 3:04 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Mon, Jul 13, 2020 at 2:56 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > On Mon, Jul 13, 2020 at 2:32 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > > > On Mon, Jul 13, 2020 at 11:10 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > > > > > > > I think you can refer to commit message as well for that "We however > > > > > must explicitly disable streaming replication during replication slot > > > > > creation, even if the plugin supports it. We don't need to replicate > > > > > the changes accumulated during this phase, and moreover, we don't have > > > > > a replication connection open so we don't have where to send the data > > > > > anyway.". I don't think this is a good way to hack the streaming flag > > > > > because for SQL API's, we don't have a good reason to disable the > > > > > streaming in this way. I guess if we had a condition related to > > > > > reaching CONSISTENT snapshot during streaming then we won't need to > > > > > hack the streaming flag in this way. Once we reach the CONSISTENT > > > > > snapshot state, we come out of the creation of a replication slot (see > > > > > how we use DecodingContextReady to achieve that) phase. So, I feel we > > > > > should remove the ctx->streaming setting to false and add a CONSISTENT > > > > > snapshot check during streaming unless you have a reason for not doing > > > > > so. > > > > > > > > I was worried about the point that streaming on/off is sent by the > > > > subscriber on START REPLICATION, not on CREATE REPLICATION SLOT, so if > > > > we keep streaming on during create then it may not be right. > > > > > > > > > > Then, how is that used on the publisher-side? AFAICS, the streaming > > > is enabled based on whether streaming callbacks are provided and we do > > > that in 0003-Extend-the-logical-decoding-output-plugin-API-wi patch. > > > > Basically, first, we enable based on whether we have the callbacks or > > not but later once we get the START REPLICATION command from the > > subscriber then we set it to false if the streaming is not enabled > > from the subscriber side. You can refer below code in patch 0007. > > > > pgoutput_startup > > { > > parse_output_parameters(ctx->output_plugin_options, > > &data->protocol_version, > > - &data->publication_names); > > + &data->publication_names, > > + &enable_streaming); > > /* Check if we support requested protocol */ > > if (data->protocol_version > LOGICALREP_PROTO_VERSION_NUM) > > @@ -222,6 +284,27 @@ pgoutput_startup(LogicalDecodingContext *ctx, > > OutputPluginOptions *opt, > > (errcode(ERRCODE_INVALID_PARAMETER_VALUE), > > errmsg("publication_names parameter missing"))); > > + /* > > + * Decide whether to enable streaming. It is disabled by default, in > > + * which case we just update the flag in decoding context. Otherwise > > + * we only allow it with sufficient version of the protocol, and when > > + * the output plugin supports it. > > + */ > > + if (!enable_streaming) > > + ctx->streaming = false; > > } > > > > Okay, in that case, we can do both enable and disable streaming in > this function itself rather than allow the caller to later modify it. > I suggest similarly we can enable/disable it for SQL API in > pg_decode_startup via output_plugin_options. This way it will look > consistent for both SQL APIs and for command-based replication. If we > can do so, then probably adding an Assert for Consistent Snapshot > while performing streaming should be okay. Done this way In the latest patch set. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Dilip Kumar
Date:
On Mon, Jul 13, 2020 at 10:47 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Sun, Jul 12, 2020 at 9:56 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Mon, Jul 6, 2020 at 11:43 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > On Mon, Jul 6, 2020 at 11:31 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > On Sun, Jul 5, 2020 at 4:47 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > > > > > On Sat, Jul 4, 2020 at 11:35 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > > > > > > > > > 9. > > > > > > +ReorderBufferHandleConcurrentAbort(ReorderBuffer *rb, ReorderBufferTXN *txn, > > > > > > { > > > > > > .. > > > > > > + ReorderBufferToastReset(rb, txn); > > > > > > + if (specinsert != NULL) > > > > > > + ReorderBufferReturnChange(rb, specinsert); > > > > > > .. > > > > > > } > > > > > > > > > > > > Why do we need to do these here when we wouldn't have been done for > > > > > > any exception other than ERRCODE_TRANSACTION_ROLLBACK? > > > > > > > > > > Because we are handling this exception "ERRCODE_TRANSACTION_ROLLBACK" > > > > > gracefully and we are continuing with further decoding so we need to > > > > > return this change back. > > > > > > > > > > > > > Okay, then I suggest we should do these before calling stream_stop and > > > > also move ReorderBufferResetTXN after calling stream_stop to follow a > > > > pattern similar to try block unless there is a reason for not doing > > > > so. Also, it would be good if we can initialize specinsert with NULL > > > > after returning the change as we are doing at other places. > > > > > > Okay > > > > > > > > > 10. I have got the below failure once. I have not investigated this > > > > > > in detail as the patch is still under progress. See, if you have any > > > > > > idea? > > > > > > # Failed test 'check extra columns contain local defaults' > > > > > > # at t/013_stream_subxact_ddl_abort.pl line 81. > > > > > > # got: '2|0' > > > > > > # expected: '1000|500' > > > > > > # Looks like you failed 1 test of 2. > > > > > > make[2]: *** [check] Error 1 > > > > > > make[1]: *** [check-subscription-recurse] Error 2 > > > > > > make[1]: *** Waiting for unfinished jobs.... > > > > > > make: *** [check-world-src/test-recurse] Error 2 > > > > > > > > > > Even I got the failure once and after that, it did not reproduce. I > > > > > have executed it multiple time but it did not reproduce again. Are > > > > > you able to reproduce it consistently? > > > > > > > > > > > > > No, I am also not able to reproduce it consistently but I think this > > > > can fail if a subscriber sends the replay_location before actually > > > > replaying the changes. First, I thought that extra send_feedback we > > > > have in apply_handle_stream_commit might have caused this but I guess > > > > that can't happen because we need the commit time location for that > > > > and we are storing the same at the end of apply_handle_stream_commit > > > > after applying all messages. I am not sure what is going on here. I > > > > think we somehow need to reproduce this or some variant of this test > > > > consistently to find the root cause. > > > > > > And I think it appeared first time for me, so maybe either induced > > > from past few versions so some changes in the last few versions might > > > have exposed it. I have noticed that almost 50% of the time I am able > > > to reproduce after the clean build so I can trace back from which > > > version it started appearing that way it will be easy to narrow down. > > > > I think the reason for the failure is that we are not setting > > remote_final_lsn, in the streaming mode. I have put multiple logs and > > executed in log and from logs it appeared that some of the logical wal > > did not get replayed due to below check in > > should_apply_changes_for_rel. > > return (rel->state == SUBREL_STATE_READY || (rel->state == > > SUBREL_STATE_SYNCDONE && rel->statelsn <= remote_final_lsn)); > > > > I still need to do the detailed analysis that why does this fail in > > some cases, basically, most of the time the rel->state is > > SUBREL_STATE_READY so this check passes but whenever the state is > > SUBREL_STATE_SYNCDONE it failed because we never update > > remote_final_lsn. I will try to set this value in > > apply_handle_stream_commit and see whether it ever fails or not. > > I have verified that after setting the remote_final_lsn in the > apply_handle_stream_commit, I don't see that regression failure in > over 70 runs whereas without that change it failed 6 times in 50 runs. > Apart from this, I have noticed one more thing related to the same > point. Basically, in the apply_handle_commit, we are calling > process_syncing_tables whereas we are not calling the same in > apply_handle_stream_commit. I have set the remote_final_lsn as well as called process_syncing_tables, in apply_handle_stream_commit. Please see the latest patch set v33. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Ajin Cherian
Date:
On Wed, Jul 15, 2020 at 2:05 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
Please see the
latest patch set v33.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
I have a minor comment. You've defined a new function ReorderBufferStartStreaming() but the function doesn't actually start streaming but is used to find out if you can start streaming and it returns a boolean. Can't you name it accordingly?
Probably ReorderBufferCanStartStreaming(). I understand that it internally calls ReorderBufferCanStream() which is similar sounding but I think that should not matter.
regards,
Ajin Cherian
Fujitsu Australia
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Amit Kapila
Date:
On Wed, Jul 15, 2020 at 4:51 PM Ajin Cherian <itsajin@gmail.com> wrote: > > On Wed, Jul 15, 2020 at 2:05 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: >> >> Please see the >> latest patch set v33. >> >> >> > > I have a minor comment. You've defined a new function ReorderBufferStartStreaming() but the function doesn't actually startstreaming but is used to find out if you can start streaming and it returns a boolean. Can't you name it accordingly? > Probably ReorderBufferCanStartStreaming(). I understand that it internally calls ReorderBufferCanStream() which is similarsounding but I think that should not matter. > +1. I am actually editing some of the patches and I have already named it as you are suggesting. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Amit Kapila
Date:
On Wed, Jul 15, 2020 at 9:29 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > I have reviewed your changes and those look good to me, please find > the latest version of the patch set. > I have done an additional round of review and below are the changes I made in the attached patch-set. 1. Changed comments in 0002. 2. In 0005, apart from changing a few comments and function name, I have changed below code: + if (ReorderBufferCanStream(rb) && + !SnapBuildXactNeedsSkip(builder, ctx->reader->ReadRecPtr)) Here, I think it is better to compare it with EndRecPtr. I feel in boundary case the next record could be the same as start_decoding_at, so why to avoid streaming in that case? 3. In 0006, made below changes: a. Removed function ReorderBufferFreeChange and added a new parameter in ReorderBufferReturnChange to achieve the same purpose. b. Changed quite a few comments, function names, added additional Asserts, and few other cosmetic changes. 4. In 0007, made below changes: a. Removed the unnecessary change in .gitignore b. Changed the newly added option name to "stream-change". Apart from above, I have merged patches 0004, 0005, 0006 and 0007 as those seems one functionality to me. For the sake of review, the patch-set that contains merged patches is attached separately as v34-combined. Let me know what you think of the changes? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Attachment
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Dilip Kumar
Date:
On Wed, Jul 15, 2020 at 6:59 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Wed, Jul 15, 2020 at 9:29 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > I have reviewed your changes and those look good to me, please find > > the latest version of the patch set. > > > > I have done an additional round of review and below are the changes I > made in the attached patch-set. > 1. Changed comments in 0002. > 2. In 0005, apart from changing a few comments and function name, I > have changed below code: > + if (ReorderBufferCanStream(rb) && > + !SnapBuildXactNeedsSkip(builder, ctx->reader->ReadRecPtr)) > Here, I think it is better to compare it with EndRecPtr. I feel in > boundary case the next record could be the same as start_decoding_at, > so why to avoid streaming in that case? Make sense to me > 3. In 0006, made below changes: > a. Removed function ReorderBufferFreeChange and added a new > parameter in ReorderBufferReturnChange to achieve the same purpose. > b. Changed quite a few comments, function names, added additional > Asserts, and few other cosmetic changes. > 4. In 0007, made below changes: > a. Removed the unnecessary change in .gitignore > b. Changed the newly added option name to "stream-change". > > Apart from above, I have merged patches 0004, 0005, 0006 and 0007 as > those seems one functionality to me. For the sake of review, the > patch-set that contains merged patches is attached separately as > v34-combined. > > Let me know what you think of the changes? I have reviewed the changes and looks fine to me. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Amit Kapila
Date:
On Thu, Jul 16, 2020 at 12:23 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Wed, Jul 15, 2020 at 6:59 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > Let me know what you think of the changes? > > I have reviewed the changes and looks fine to me. > Thanks, I am planning to start committing a few of the infrastructure patches (especially first two) by early next week as we have resolved all the open issues and done an extensive review of the entire patch-set. In the attached version, there is a slight change in one of the commit messages as compared to the previous version. I would like to describe in brief the first two patches for the sake of convenience. Let me know if you or anyone else sees any problems with these. The first patch in the series allows us to WAL-log subtransaction and top-level XID association. The logical decoding infrastructure needs to know which top-level transaction the subxact belongs to, in order to decode all the changes. Until now that might be delayed until commit, due to the caching (GPROC_MAX_CACHED_SUBXIDS), preventing features requiring incremental decoding. So we also write the assignment info into WAL immediately, as part of the next WAL record (to minimize overhead) only when *wal_level=logical*. We can not remove the existing XLOG_XACT_ASSIGNMENT WAL as that is required for avoiding overflow in the hot standby snapshot. The second patch writes WAL for invalidations at command end with wal_level=logical. When wal_level=logical, write invalidations at command end into WAL so that decoding can use this information. This patch is required to allow the streaming of in-progress transactions in logical decoding. We still add the invalidations to the cache and write them to WAL at commit time in RecordTransactionCommit(). This uses the existing XLOG_INVALIDATIONS xlog record type, from the RM_STANDBY_ID resource manager (see LogStandbyInvalidations for details). So existing code relying on those invalidations (e.g. redo) does not need to be changed. The invalidations written at command end uses a new xlog record type XLOG_XACT_INVALIDATIONS, from RM_XACT_ID resource manager. See LogLogicalInvalidations for details. These new xlog records are ignored by existing redo procedures, which still rely on the invalidations written to commit records. The invalidations are decoded and accumulated in top-transaction, and then executed during replay. This obviates the need to decode the invalidations as part of a commit record. The performance testing has shown that there is no performance penalty with either of the patches but there is some additional WAL which in most cases is 2-5% but in worst cases and for some specific DDL's it is up to 15% with the second patch, however, that happens at wal_level=logical only. We have considered an alternative to blow up all caches on any DDL in WALSenders and that will have both CPU and network overhead. For detailed results and analysis see [1][2]. [1] - https://www.postgresql.org/message-id/CAKYtNAqWkPpPFrdEbpPrCan3G_QAcankZarRKKd7cj6vQigM7w%40mail.gmail.com [2] - https://www.postgresql.org/message-id/CAA4eK1L3PoiBw6uogB7jD5rmdT-GmEF4kOEccS1AWKuBhSkQkQ%40mail.gmail.com -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Attachment
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Dilip Kumar
Date:
On Thu, Jul 16, 2020 at 4:25 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Thu, Jul 16, 2020 at 12:23 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Wed, Jul 15, 2020 at 6:59 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > Let me know what you think of the changes? > > > > I have reviewed the changes and looks fine to me. > > > > Thanks, I am planning to start committing a few of the infrastructure > patches (especially first two) by early next week as we have resolved > all the open issues and done an extensive review of the entire > patch-set. In the attached version, there is a slight change in one > of the commit messages as compared to the previous version. I would > like to describe in brief the first two patches for the sake of > convenience. Let me know if you or anyone else sees any problems with > these. > > The first patch in the series allows us to WAL-log subtransaction and > top-level XID association. The logical decoding infrastructure needs > to know which top-level > transaction the subxact belongs to, in order to decode all the > changes. Until now that might be delayed until commit, due to the > caching (GPROC_MAX_CACHED_SUBXIDS), preventing features requiring > incremental decoding. So we also write the assignment info into WAL > immediately, as part of the next WAL record (to minimize overhead) > only when *wal_level=logical*. We can not remove the existing > XLOG_XACT_ASSIGNMENT WAL as that is required for avoiding overflow in > the hot standby snapshot. > > The second patch writes WAL for invalidations at command end with > wal_level=logical. When wal_level=logical, write invalidations at > command end into WAL so that decoding can use this information. This > patch is required to allow the streaming of in-progress transactions > in logical decoding. We still add the invalidations to the cache and > write them to WAL at commit time in RecordTransactionCommit(). This > uses the existing XLOG_INVALIDATIONS xlog record type, from the > RM_STANDBY_ID resource manager (see LogStandbyInvalidations for > details). So existing code relying on those invalidations (e.g. redo) > does not need to be changed. The invalidations written at command end > uses a new xlog record type XLOG_XACT_INVALIDATIONS, from RM_XACT_ID > resource manager. See LogLogicalInvalidations for details. These new > xlog records are ignored by existing redo procedures, which still rely > on the invalidations written to commit records. The invalidations are > decoded and accumulated in top-transaction, and then executed during > replay. This obviates the need to decode the invalidations as part of > a commit record. > > The performance testing has shown that there is no performance penalty > with either of the patches but there is some additional WAL which in > most cases is 2-5% but in worst cases and for some specific DDL's it > is up to 15% with the second patch, however, that happens at > wal_level=logical only. We have considered an alternative to blow up > all caches on any DDL in WALSenders and that will have both CPU and > network overhead. For detailed results and analysis see [1][2]. > > [1] - https://www.postgresql.org/message-id/CAKYtNAqWkPpPFrdEbpPrCan3G_QAcankZarRKKd7cj6vQigM7w%40mail.gmail.com > [2] - https://www.postgresql.org/message-id/CAA4eK1L3PoiBw6uogB7jD5rmdT-GmEF4kOEccS1AWKuBhSkQkQ%40mail.gmail.com > The patch set required to rebase after committing the binary format option support in the create subscription command. I have rebased the patch set on the latest head and also added a test case to test streaming in binary format. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Amit Kapila
Date:
On Mon, Jul 20, 2020 at 12:01 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Thu, Jul 16, 2020 at 4:25 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Thu, Jul 16, 2020 at 12:23 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > On Wed, Jul 15, 2020 at 6:59 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > > > > Let me know what you think of the changes? > > > > > > I have reviewed the changes and looks fine to me. > > > > > > > Thanks, I am planning to start committing a few of the infrastructure > > patches (especially first two) by early next week as we have resolved > > all the open issues and done an extensive review of the entire > > patch-set. In the attached version, there is a slight change in one > > of the commit messages as compared to the previous version. I would > > like to describe in brief the first two patches for the sake of > > convenience. Let me know if you or anyone else sees any problems with > > these. > > > > The first patch in the series allows us to WAL-log subtransaction and > > top-level XID association. The logical decoding infrastructure needs > > to know which top-level > > transaction the subxact belongs to, in order to decode all the > > changes. Until now that might be delayed until commit, due to the > > caching (GPROC_MAX_CACHED_SUBXIDS), preventing features requiring > > incremental decoding. So we also write the assignment info into WAL > > immediately, as part of the next WAL record (to minimize overhead) > > only when *wal_level=logical*. We can not remove the existing > > XLOG_XACT_ASSIGNMENT WAL as that is required for avoiding overflow in > > the hot standby snapshot. > > Pushed, this patch. > > > > The patch set required to rebase after committing the binary format > option support in the create subscription command. I have rebased the > patch set on the latest head and also added a test case to test > streaming in binary format. > While going through commit 9de77b5453, I noticed below change: @@ -424,6 +424,10 @@ libpqrcv_startstreaming(WalReceiverConn *conn, PQfreemem(pubnames_literal); pfree(pubnames_str); + if (options->proto.logical.binary && + PQserverVersion(conn->streamConn) >= 140000) + appendStringInfoString(&cmd, ", binary 'true'"); + Now, the similar change in this patch series is as below: @@ -408,6 +408,9 @@ libpqrcv_startstreaming(WalReceiverConn *conn, appendStringInfo(&cmd, "proto_version '%u'", options->proto.logical.proto_version); + if (options->proto.logical.streaming) + appendStringInfo(&cmd, ", streaming 'on'"); + I think we also need a version check similar to commit 9de77b5453 to ensure that we send the new option only when connected to a newer version (>=14) primary server. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Dilip Kumar
Date:
On Mon, Jul 20, 2020 at 2:15 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Mon, Jul 20, 2020 at 12:01 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Thu, Jul 16, 2020 at 4:25 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > On Thu, Jul 16, 2020 at 12:23 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > > > On Wed, Jul 15, 2020 at 6:59 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > > > > > > > Let me know what you think of the changes? > > > > > > > > I have reviewed the changes and looks fine to me. > > > > > > > > > > Thanks, I am planning to start committing a few of the infrastructure > > > patches (especially first two) by early next week as we have resolved > > > all the open issues and done an extensive review of the entire > > > patch-set. In the attached version, there is a slight change in one > > > of the commit messages as compared to the previous version. I would > > > like to describe in brief the first two patches for the sake of > > > convenience. Let me know if you or anyone else sees any problems with > > > these. > > > > > > The first patch in the series allows us to WAL-log subtransaction and > > > top-level XID association. The logical decoding infrastructure needs > > > to know which top-level > > > transaction the subxact belongs to, in order to decode all the > > > changes. Until now that might be delayed until commit, due to the > > > caching (GPROC_MAX_CACHED_SUBXIDS), preventing features requiring > > > incremental decoding. So we also write the assignment info into WAL > > > immediately, as part of the next WAL record (to minimize overhead) > > > only when *wal_level=logical*. We can not remove the existing > > > XLOG_XACT_ASSIGNMENT WAL as that is required for avoiding overflow in > > > the hot standby snapshot. > > > > > Pushed, this patch. > > > > > > > > The patch set required to rebase after committing the binary format > > option support in the create subscription command. I have rebased the > > patch set on the latest head and also added a test case to test > > streaming in binary format. > > > > While going through commit 9de77b5453, I noticed below change: > > @@ -424,6 +424,10 @@ libpqrcv_startstreaming(WalReceiverConn *conn, > PQfreemem(pubnames_literal); > pfree(pubnames_str); > > + if (options->proto.logical.binary && > + PQserverVersion(conn->streamConn) >= 140000) > + appendStringInfoString(&cmd, ", binary 'true'"); > + > > Now, the similar change in this patch series is as below: > > @@ -408,6 +408,9 @@ libpqrcv_startstreaming(WalReceiverConn *conn, > appendStringInfo(&cmd, "proto_version '%u'", > options->proto.logical.proto_version); > > + if (options->proto.logical.streaming) > + appendStringInfo(&cmd, ", streaming 'on'"); > + > > I think we also need a version check similar to commit 9de77b5453 to > ensure that we send the new option only when connected to a newer > version (>=14) primary server. I have changed that in the attached patch. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Dilip Kumar
Date:
On Mon, Jul 20, 2020 at 4:41 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Mon, Jul 20, 2020 at 2:15 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Mon, Jul 20, 2020 at 12:01 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > On Thu, Jul 16, 2020 at 4:25 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > On Thu, Jul 16, 2020 at 12:23 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > > > > > On Wed, Jul 15, 2020 at 6:59 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > > > > > > > > > > Let me know what you think of the changes? > > > > > > > > > > I have reviewed the changes and looks fine to me. > > > > > > > > > > > > > Thanks, I am planning to start committing a few of the infrastructure > > > > patches (especially first two) by early next week as we have resolved > > > > all the open issues and done an extensive review of the entire > > > > patch-set. In the attached version, there is a slight change in one > > > > of the commit messages as compared to the previous version. I would > > > > like to describe in brief the first two patches for the sake of > > > > convenience. Let me know if you or anyone else sees any problems with > > > > these. > > > > > > > > The first patch in the series allows us to WAL-log subtransaction and > > > > top-level XID association. The logical decoding infrastructure needs > > > > to know which top-level > > > > transaction the subxact belongs to, in order to decode all the > > > > changes. Until now that might be delayed until commit, due to the > > > > caching (GPROC_MAX_CACHED_SUBXIDS), preventing features requiring > > > > incremental decoding. So we also write the assignment info into WAL > > > > immediately, as part of the next WAL record (to minimize overhead) > > > > only when *wal_level=logical*. We can not remove the existing > > > > XLOG_XACT_ASSIGNMENT WAL as that is required for avoiding overflow in > > > > the hot standby snapshot. > > > > > > > > Pushed, this patch. > > > > > > > > > > > > The patch set required to rebase after committing the binary format > > > option support in the create subscription command. I have rebased the > > > patch set on the latest head and also added a test case to test > > > streaming in binary format. > > > > > > > While going through commit 9de77b5453, I noticed below change: > > > > @@ -424,6 +424,10 @@ libpqrcv_startstreaming(WalReceiverConn *conn, > > PQfreemem(pubnames_literal); > > pfree(pubnames_str); > > > > + if (options->proto.logical.binary && > > + PQserverVersion(conn->streamConn) >= 140000) > > + appendStringInfoString(&cmd, ", binary 'true'"); > > + > > > > Now, the similar change in this patch series is as below: > > > > @@ -408,6 +408,9 @@ libpqrcv_startstreaming(WalReceiverConn *conn, > > appendStringInfo(&cmd, "proto_version '%u'", > > options->proto.logical.proto_version); > > > > + if (options->proto.logical.streaming) > > + appendStringInfo(&cmd, ", streaming 'on'"); > > + > > > > I think we also need a version check similar to commit 9de77b5453 to > > ensure that we send the new option only when connected to a newer > > version (>=14) primary server. > > I have changed that in the attached patch. There was one warning in release mode in the last version in 0004 so attaching a new version. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Ajin Cherian
Date:
On Mon, Jul 20, 2020 at 11:16 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
There was one warning in release mode in the last version in 0004 so
attaching a new version.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
Hello,
I have tried to rework the patch which did the stats for the streaming of logical replication but based on the new logical replication stats framework developed by Masahiko-san and rebased by Amit in [1]. This uses v38 of the streaming logical update patch as well as the v1 of the stats framework patch as base. I will rebase this as the stats framework is updated. Let me know if you have any comments.
regards,
Ajin Cherian
Fujitsu Australia
Attachment
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Amit Kapila
Date:
On Mon, Jul 20, 2020 at 6:46 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > There was one warning in release mode in the last version in 0004 so > attaching a new version. > Today, I was reviewing patch v38-0001-WAL-Log-invalidations-at-command-end-with-wal_le and found a small problem with it. + /* + * Execute the invalidations for xid-less transactions, + * otherwise, accumulate them so that they can be processed at + * the commit time. + */ + if (!ctx->fast_forward) + { + if (TransactionIdIsValid(xid)) + { + ReorderBufferAddInvalidations(reorder, xid, buf->origptr, + invals->nmsgs, invals->msgs); + ReorderBufferXidSetCatalogChanges(ctx->reorder, xid, + buf->origptr); + } I think we need to set ReorderBufferXidSetCatalogChanges even when ctx->fast-forward is true because we are dependent on that flag for snapshot build (see SnapBuildCommitTxn). We are already doing the same way in DecodeCommit where even though we skip adding invalidations for fast-forward cases but we do set the flag to indicate that this txn has catalog changes. Is there any reason to do things differently here? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Dilip Kumar
Date:
On Wed, Jul 22, 2020 at 9:18 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Mon, Jul 20, 2020 at 6:46 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > There was one warning in release mode in the last version in 0004 so > > attaching a new version. > > > > Today, I was reviewing patch > v38-0001-WAL-Log-invalidations-at-command-end-with-wal_le and found a > small problem with it. > > + /* > + * Execute the invalidations for xid-less transactions, > + * otherwise, accumulate them so that they can be processed at > + * the commit time. > + */ > + if (!ctx->fast_forward) > + { > + if (TransactionIdIsValid(xid)) > + { > + ReorderBufferAddInvalidations(reorder, xid, buf->origptr, > + invals->nmsgs, invals->msgs); > + ReorderBufferXidSetCatalogChanges(ctx->reorder, xid, > + buf->origptr); > + } > > I think we need to set ReorderBufferXidSetCatalogChanges even when > ctx->fast-forward is true because we are dependent on that flag for > snapshot build (see SnapBuildCommitTxn). We are already doing the > same way in DecodeCommit where even though we skip adding > invalidations for fast-forward cases but we do set the flag to > indicate that this txn has catalog changes. Is there any reason to do > things differently here? I think it is wrong, we should set the ReorderBufferXidSetCatalogChanges, even if it is in fast-forward mode. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Amit Kapila
Date:
On Wed, Jul 22, 2020 at 10:20 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Wed, Jul 22, 2020 at 9:18 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Mon, Jul 20, 2020 at 6:46 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > There was one warning in release mode in the last version in 0004 so > > > attaching a new version. > > > > > > > Today, I was reviewing patch > > v38-0001-WAL-Log-invalidations-at-command-end-with-wal_le and found a > > small problem with it. > > > > + /* > > + * Execute the invalidations for xid-less transactions, > > + * otherwise, accumulate them so that they can be processed at > > + * the commit time. > > + */ > > + if (!ctx->fast_forward) > > + { > > + if (TransactionIdIsValid(xid)) > > + { > > + ReorderBufferAddInvalidations(reorder, xid, buf->origptr, > > + invals->nmsgs, invals->msgs); > > + ReorderBufferXidSetCatalogChanges(ctx->reorder, xid, > > + buf->origptr); > > + } > > > > I think we need to set ReorderBufferXidSetCatalogChanges even when > > ctx->fast-forward is true because we are dependent on that flag for > > snapshot build (see SnapBuildCommitTxn). We are already doing the > > same way in DecodeCommit where even though we skip adding > > invalidations for fast-forward cases but we do set the flag to > > indicate that this txn has catalog changes. Is there any reason to do > > things differently here? > > I think it is wrong, we should set the > ReorderBufferXidSetCatalogChanges, even if it is in fast-forward mode. > Thanks for the change. I have one more minor comment in the patch 0001-WAL-Log-invalidations-at-command-end-with-wal_le. /* + * Invalidations logged with wal_level=logical. + */ +typedef struct xl_xact_invalidations +{ + int nmsgs; /* number of shared inval msgs */ + SharedInvalidationMessage msgs[FLEXIBLE_ARRAY_MEMBER]; +} xl_xact_invalidations; I see that we already have a structure xl_xact_invals in the code which has the same members, so I think it is better to use that instead of defining a new one. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Dilip Kumar
Date:
On Wed, Jul 22, 2020 at 4:20 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Wed, Jul 22, 2020 at 10:20 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Wed, Jul 22, 2020 at 9:18 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > On Mon, Jul 20, 2020 at 6:46 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > > > There was one warning in release mode in the last version in 0004 so > > > > attaching a new version. > > > > > > > > > > Today, I was reviewing patch > > > v38-0001-WAL-Log-invalidations-at-command-end-with-wal_le and found a > > > small problem with it. > > > > > > + /* > > > + * Execute the invalidations for xid-less transactions, > > > + * otherwise, accumulate them so that they can be processed at > > > + * the commit time. > > > + */ > > > + if (!ctx->fast_forward) > > > + { > > > + if (TransactionIdIsValid(xid)) > > > + { > > > + ReorderBufferAddInvalidations(reorder, xid, buf->origptr, > > > + invals->nmsgs, invals->msgs); > > > + ReorderBufferXidSetCatalogChanges(ctx->reorder, xid, > > > + buf->origptr); > > > + } > > > > > > I think we need to set ReorderBufferXidSetCatalogChanges even when > > > ctx->fast-forward is true because we are dependent on that flag for > > > snapshot build (see SnapBuildCommitTxn). We are already doing the > > > same way in DecodeCommit where even though we skip adding > > > invalidations for fast-forward cases but we do set the flag to > > > indicate that this txn has catalog changes. Is there any reason to do > > > things differently here? > > > > I think it is wrong, we should set the > > ReorderBufferXidSetCatalogChanges, even if it is in fast-forward mode. > > > > Thanks for the change. I have one more minor comment in the patch > 0001-WAL-Log-invalidations-at-command-end-with-wal_le. > > /* > + * Invalidations logged with wal_level=logical. > + */ > +typedef struct xl_xact_invalidations > +{ > + int nmsgs; /* number of shared inval msgs */ > + SharedInvalidationMessage msgs[FLEXIBLE_ARRAY_MEMBER]; > +} xl_xact_invalidations; > > I see that we already have a structure xl_xact_invals in the code > which has the same members, so I think it is better to use that > instead of defining a new one. You are right. I have changed it. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Amit Kapila
Date:
On Wed, Jul 22, 2020 at 4:55 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > You are right. I have changed it. > Thanks, I have pushed the second patch in this series which is 0001-WAL-Log-invalidations-at-command-end-with-wal_le in your latest patch. I will continue working on remaining patches. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Amit Kapila
Date:
On Thu, Jul 23, 2020 at 11:31 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Wed, Jul 22, 2020 at 4:55 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > You are right. I have changed it. > > > > Thanks, I have pushed the second patch in this series which is > 0001-WAL-Log-invalidations-at-command-end-with-wal_le in your latest > patch. I will continue working on remaining patches. > I have reviewed and made a number of changes in the next patch which extends the logical decoding output plugin API with stream methods. (v41-0001-Extend-the-logical-decoding-output-plugin-API-wi). 1. I think we need handling of include_xids and include_timestamp but not skip_empty_xacts in the new APIs, as of now, none of the options were respected. We need 'include_xids' handling because we need to include xid with stream messages and similarly 'include_timestamp' for stream commit messages. OTOH, I think we never use streaming mode for empty xacts, so we don't need to bother about skip_empty_xacts in streaming APIs. 2. Then I made a number of changes in documentation, comments, and other cosmetic changes. Kindly review/test and let me know if you see any problems with the above changes. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Attachment
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Dilip Kumar
Date:
On Fri, Jul 24, 2020 at 5:05 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Thu, Jul 23, 2020 at 11:31 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Wed, Jul 22, 2020 at 4:55 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > You are right. I have changed it. > > > > > > > Thanks, I have pushed the second patch in this series which is > > 0001-WAL-Log-invalidations-at-command-end-with-wal_le in your latest > > patch. I will continue working on remaining patches. > > > > I have reviewed and made a number of changes in the next patch which > extends the logical decoding output plugin API with stream methods. > (v41-0001-Extend-the-logical-decoding-output-plugin-API-wi). > > 1. I think we need handling of include_xids and include_timestamp but > not skip_empty_xacts in the new APIs, as of now, none of the options > were respected. We need 'include_xids' handling because we need to > include xid with stream messages and similarly 'include_timestamp' for > stream commit messages. OTOH, I think we never use streaming mode for > empty xacts, so we don't need to bother about skip_empty_xacts in > streaming APIs. > 2. Then I made a number of changes in documentation, comments, and > other cosmetic changes. > > Kindly review/test and let me know if you see any problems with the > above changes. Your changes look fine to me. Additionally, I have changed a test case of getting the streaming changes in 0002. Instead of just showing the count, I am showing that the transaction is actually streaming. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Amit Kapila
Date:
On Fri, Jul 24, 2020 at 7:17 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > Your changes look fine to me. Additionally, I have changed a test > case of getting the streaming changes in 0002. Instead of just > showing the count, I am showing that the transaction is actually > streaming. > If you want to show the changes then there is no need to display 157 rows probably a few (10-15) should be sufficient. If we can do that by increasing the size of the row then good, otherwise, I think it is better to retain the test to display the count. Today, I have again looked at the first patch (v42-0001-Extend-the-logical-decoding-output-plugin-API-wi) and didn't find any more problems with it so planning to commit the same unless you or someone else want to add more to it. Just for ease of others, "the next patch extends the logical decoding output plugin API with stream methods". It adds seven methods to the output plugin API, adding support for streaming changes for large in-progress transactions. The methods are stream_start, stream_stop, stream_abort, stream_commit, stream_change, stream_message, and stream_truncate. Most of this is a simple extension of the existing methods, with the semantic difference that the transaction (or subtransaction) is incomplete and may be aborted later (which is something the regular API does not really need to deal with). This also extends the 'test_decoding' plugin, implementing these new stream methods. The stream_start/start_stop are used to demarcate a chunk of changes streamed for a particular toplevel transaction. This commit simply adds these new APIs and the upcoming patch to "allow the streaming mode in ReorderBuffer" will use these APIs. -- With Regards, Amit Kapila.
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Dilip Kumar
Date:
On Sat, Jul 25, 2020 at 5:08 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Fri, Jul 24, 2020 at 7:17 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > Your changes look fine to me. Additionally, I have changed a test > > case of getting the streaming changes in 0002. Instead of just > > showing the count, I am showing that the transaction is actually > > streaming. > > > > If you want to show the changes then there is no need to display 157 > rows probably a few (10-15) should be sufficient. If we can do that > by increasing the size of the row then good, otherwise, I think it is > better to retain the test to display the count. I think in existing test cases also we are displaying multiple lines e.g. toast.out is showing 235 rows. But maybe I will try to reduce it to the less number of rows. > Today, I have again looked at the first patch > (v42-0001-Extend-the-logical-decoding-output-plugin-API-wi) and didn't > find any more problems with it so planning to commit the same unless > you or someone else want to add more to it. Just for ease of others, > "the next patch extends the logical decoding output plugin API with > stream methods". It adds seven methods to the output plugin API, > adding support for streaming changes for large in-progress > transactions. The methods are stream_start, stream_stop, stream_abort, > stream_commit, stream_change, stream_message, and stream_truncate. > Most of this is a simple extension of the existing methods, with the > semantic difference that the transaction (or subtransaction) is > incomplete and may be aborted later (which is something the regular > API does not really need to deal with). > > This also extends the 'test_decoding' plugin, implementing these new > stream methods. The stream_start/start_stop are used to demarcate a > chunk of changes streamed for a particular toplevel transaction. > > This commit simply adds these new APIs and the upcoming patch to > "allow the streaming mode in ReorderBuffer" will use these APIs. LGTM -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Dilip Kumar
Date:
On Sun, Jul 26, 2020 at 11:04 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Sat, Jul 25, 2020 at 5:08 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Fri, Jul 24, 2020 at 7:17 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > Your changes look fine to me. Additionally, I have changed a test > > > case of getting the streaming changes in 0002. Instead of just > > > showing the count, I am showing that the transaction is actually > > > streaming. > > > > > > > If you want to show the changes then there is no need to display 157 > > rows probably a few (10-15) should be sufficient. If we can do that > > by increasing the size of the row then good, otherwise, I think it is > > better to retain the test to display the count. > > I think in existing test cases also we are displaying multiple lines > e.g. toast.out is showing 235 rows. But maybe I will try to reduce it > to the less number of rows. Changed, now only 27 rows. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Amit Kapila
Date:
On Sun, Jul 26, 2020 at 11:04 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > Today, I have again looked at the first patch > > (v42-0001-Extend-the-logical-decoding-output-plugin-API-wi) and didn't > > find any more problems with it so planning to commit the same unless > > you or someone else want to add more to it. Just for ease of others, > > "the next patch extends the logical decoding output plugin API with > > stream methods". It adds seven methods to the output plugin API, > > adding support for streaming changes for large in-progress > > transactions. The methods are stream_start, stream_stop, stream_abort, > > stream_commit, stream_change, stream_message, and stream_truncate. > > Most of this is a simple extension of the existing methods, with the > > semantic difference that the transaction (or subtransaction) is > > incomplete and may be aborted later (which is something the regular > > API does not really need to deal with). > > > > This also extends the 'test_decoding' plugin, implementing these new > > stream methods. The stream_start/start_stop are used to demarcate a > > chunk of changes streamed for a particular toplevel transaction. > > > > This commit simply adds these new APIs and the upcoming patch to > > "allow the streaming mode in ReorderBuffer" will use these APIs. > > LGTM > Pushed. Feel free to submit the remaining patches. -- With Regards, Amit Kapila.
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Dilip Kumar
Date:
On Tue, Jul 28, 2020 at 9:52 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Sun, Jul 26, 2020 at 11:04 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > Today, I have again looked at the first patch > > > (v42-0001-Extend-the-logical-decoding-output-plugin-API-wi) and didn't > > > find any more problems with it so planning to commit the same unless > > > you or someone else want to add more to it. Just for ease of others, > > > "the next patch extends the logical decoding output plugin API with > > > stream methods". It adds seven methods to the output plugin API, > > > adding support for streaming changes for large in-progress > > > transactions. The methods are stream_start, stream_stop, stream_abort, > > > stream_commit, stream_change, stream_message, and stream_truncate. > > > Most of this is a simple extension of the existing methods, with the > > > semantic difference that the transaction (or subtransaction) is > > > incomplete and may be aborted later (which is something the regular > > > API does not really need to deal with). > > > > > > This also extends the 'test_decoding' plugin, implementing these new > > > stream methods. The stream_start/start_stop are used to demarcate a > > > chunk of changes streamed for a particular toplevel transaction. > > > > > > This commit simply adds these new APIs and the upcoming patch to > > > "allow the streaming mode in ReorderBuffer" will use these APIs. > > > > LGTM > > > > Pushed. Feel free to submit the remaining patches. Thanks, please find the rebased patch set. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Ajin Cherian
Date:
On Wed, Jul 29, 2020 at 3:16 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
Thanks, please find the rebased patch set.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
I was running some tests on this patch. I was generally trying to see how the patch affects logical replication when doing bulk inserts. This issue has been raised in the past, for eg: this [1].
My test setup is:
1. Two postgres servers running - A and B
2. Create a pgbench setup on A. (pgbench -i -s 5 postgres)
3. replicate the 3 tables (schema only) on B.
4. Three publishers on A for the 3 tables of pgbench; pgbench_accounts, pgbench_branches and pgbench_tellers;
5. Three subscribers on B for the same tables. (streaming on and off based on the scenarios described below)
run pgbench with : pgbench -c 4 -T 100 postgres
While pgbench is running, Do a bulk insert on some other table not in the publication list (say t1); INSERT INTO t1 (select i FROM generate_series(1,10000000) i);
Four scenarios:
1. Pgbench with logical replication enabled without bulk insert
Avg TPS (out of 10 runs): 641 TPS
2.Pgbench without logical replication enabled with bulk insert (no pub/sub)
Avg TPS (out of 10 runs): 665 TPS
3, Pgbench with logical replication enabled with bulk insert
Avg TPS (out of 10 runs): 278 TPS
4. Pgbench with logical replication streaming on with bulk insert
Avg TPS (out of 10 runs): 440 TPS
As you can see, the bulk inserts, although on a totally unaffected table, does impact the TPS. But what is good is that, enabling streaming improves the TPS (about 58% improvement)
regards,
Ajin Cherian
Fujitsu Australia
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Amit Kapila
Date:
On Thu, Jul 30, 2020 at 12:28 PM Ajin Cherian <itsajin@gmail.com> wrote: > > I was running some tests on this patch. I was generally trying to see how the patch affects logical replication when doingbulk inserts. This issue has been raised in the past, for eg: this [1]. > My test setup is: > 1. Two postgres servers running - A and B > 2. Create a pgbench setup on A. (pgbench -i -s 5 postgres) > 3. replicate the 3 tables (schema only) on B. > 4. Three publishers on A for the 3 tables of pgbench; pgbench_accounts, pgbench_branches and pgbench_tellers; > 5. Three subscribers on B for the same tables. (streaming on and off based on the scenarios described below) > > run pgbench with : pgbench -c 4 -T 100 postgres > While pgbench is running, Do a bulk insert on some other table not in the publication list (say t1); INSERT INTO t1 (selecti FROM generate_series(1,10000000) i); > > Four scenarios: > 1. Pgbench with logical replication enabled without bulk insert > Avg TPS (out of 10 runs): 641 TPS > 2.Pgbench without logical replication enabled with bulk insert (no pub/sub) > Avg TPS (out of 10 runs): 665 TPS > 3, Pgbench with logical replication enabled with bulk insert > Avg TPS (out of 10 runs): 278 TPS > 4. Pgbench with logical replication streaming on with bulk insert > Avg TPS (out of 10 runs): 440 TPS > > As you can see, the bulk inserts, although on a totally unaffected table, does impact the TPS. But what is good is that,enabling streaming improves the TPS (about 58% improvement) > Thanks for doing these tests, it is a good win and probably the reason is that after patch we won't serialize such big transactions (as shown in Konstantin's email [1]) and they will be simply skipped. Basically, it will try to stream such transactions and will skip them as they are not required to be sent. [1] - https://www.postgresql.org/message-id/5f5143cc-9f73-3909-3ef7-d3895cc6cc90%40postgrespro.ru -- With Regards, Amit Kapila.
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Ajin Cherian
Date:
Attaching an updated patch for the stats for streaming based on v2 of Sawada's san replication slot stats framework and v44 of this patch series . This is one patch that has both the stats framework from Sawada-san (1) as well as my update for streaming, so it can be applied easily on top of v44.
regards,
Ajin Cherian
Fujitsu Australia
Attachment
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Amit Kapila
Date:
On Wed, Jul 29, 2020 at 10:46 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > Thanks, please find the rebased patch set. > Few comments on v44-0001-Implement-streaming-mode-in-ReorderBuffer: ============================================================ 1. +-- streaming with subxact, nothing in main +BEGIN; +savepoint s1; +SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50)); +INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i); +TRUNCATE table stream_test; +rollback to s1; +INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i); +COMMIT; Is the above comment true? Because it seems to me that Insert is getting streamed in the main transaction. 2. +<programlisting> +postgres[33712]=#* SELECT * FROM pg_logical_slot_get_changes('test_slot', NULL, NULL, 'stream-changes', '1'); + lsn | xid | data +-----------+-----+-------------------------------------------------- + 0/16B21F8 | 503 | opening a streamed block for transaction TXN 503 + 0/16B21F8 | 503 | streaming change for TXN 503 + 0/16B2300 | 503 | streaming change for TXN 503 + 0/16B2408 | 503 | streaming change for TXN 503 + 0/16BEBA0 | 503 | closing a streamed block for transaction TXN 503 + 0/16B21F8 | 503 | opening a streamed block for transaction TXN 503 + 0/16BECA8 | 503 | streaming change for TXN 503 + 0/16BEDB0 | 503 | streaming change for TXN 503 + 0/16BEEB8 | 503 | streaming change for TXN 503 + 0/16BEBA0 | 503 | closing a streamed block for transaction TXN 503 +(10 rows) +</programlisting> + </para> + Is the above example correct? Because we should include XID in the stream message only when include_xids option is specified. 3. /* - * Queue a change into a transaction so it can be replayed upon commit. + * Record the partial change for the streaming of in-progress transactions. We + * can stream only complete changes so if we have a partial change like toast + * table insert or speculative then we mark such a 'txn' so that it can't be + * streamed. /speculative then/speculative insert then 4. I think we can explain the problems (like we can see the wrong tuple or see two versions of the same tuple or whatever else wrong can happen, if possible with some example) related to concurrent aborts somewhere in comments. -- With Regards, Amit Kapila.
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Dilip Kumar
Date:
On Tue, Aug 4, 2020 at 10:12 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Wed, Jul 29, 2020 at 10:46 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > Thanks, please find the rebased patch set. > > > > Few comments on v44-0001-Implement-streaming-mode-in-ReorderBuffer: > ============================================================ > 1. > +-- streaming with subxact, nothing in main > +BEGIN; > +savepoint s1; > +SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50)); > +INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM > generate_series(1, 35) g(i); > +TRUNCATE table stream_test; > +rollback to s1; > +INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM > generate_series(1, 20) g(i); > +COMMIT; > > Is the above comment true? Because it seems to me that Insert is > getting streamed in the main transaction. Changed the comments. > 2. > +<programlisting> > +postgres[33712]=#* SELECT * FROM > pg_logical_slot_get_changes('test_slot', NULL, NULL, 'stream-changes', > '1'); > + lsn | xid | data > +-----------+-----+-------------------------------------------------- > + 0/16B21F8 | 503 | opening a streamed block for transaction TXN 503 > + 0/16B21F8 | 503 | streaming change for TXN 503 > + 0/16B2300 | 503 | streaming change for TXN 503 > + 0/16B2408 | 503 | streaming change for TXN 503 > + 0/16BEBA0 | 503 | closing a streamed block for transaction TXN 503 > + 0/16B21F8 | 503 | opening a streamed block for transaction TXN 503 > + 0/16BECA8 | 503 | streaming change for TXN 503 > + 0/16BEDB0 | 503 | streaming change for TXN 503 > + 0/16BEEB8 | 503 | streaming change for TXN 503 > + 0/16BEBA0 | 503 | closing a streamed block for transaction TXN 503 > +(10 rows) > +</programlisting> > + </para> > + > > Is the above example correct? Because we should include XID in the > stream message only when include_xids option is specified. include_xids is true if we don't set it to false explicitly > 3. > /* > - * Queue a change into a transaction so it can be replayed upon commit. > + * Record the partial change for the streaming of in-progress transactions. We > + * can stream only complete changes so if we have a partial change like toast > + * table insert or speculative then we mark such a 'txn' so that it can't be > + * streamed. > > /speculative then/speculative insert then Done > 4. I think we can explain the problems (like we can see the wrong > tuple or see two versions of the same tuple or whatever else wrong can > happen, if possible with some example) related to concurrent aborts > somewhere in comments. Done -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Amit Kapila
Date:
On Tue, Aug 4, 2020 at 12:42 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Tue, Aug 4, 2020 at 10:12 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > 4. I think we can explain the problems (like we can see the wrong > > tuple or see two versions of the same tuple or whatever else wrong can > > happen, if possible with some example) related to concurrent aborts > > somewhere in comments. > > Done > I have slightly modified the comment added for the above point and apart from that added/modified a few comments at other places. I have also slightly edited the commit message. @@ -2196,6 +2778,7 @@ ReorderBufferAddNewTupleCids(ReorderBuffer *rb, TransactionId xid, change->lsn = lsn; change->txn = txn; change->action = REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID; + change->txn = txn; This change is not required as the same information is assigned a few lines before. So, I have removed this change as well. Let me know what you think of the above changes? Can we add a test for incomplete changes (probably with toast insertion but we can do it for spec_insert case as well) in ReorderBuffer such that it needs to first serialize the changes and then stream it? I have manually verified such scenarios but it is good to have the test for the same. -- With Regards, Amit Kapila.
Attachment
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Dilip Kumar
Date:
On Wed, Aug 5, 2020 at 6:25 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Tue, Aug 4, 2020 at 12:42 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Tue, Aug 4, 2020 at 10:12 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > 4. I think we can explain the problems (like we can see the wrong > > > tuple or see two versions of the same tuple or whatever else wrong can > > > happen, if possible with some example) related to concurrent aborts > > > somewhere in comments. > > > > Done > > > > I have slightly modified the comment added for the above point and > apart from that added/modified a few comments at other places. I have > also slightly edited the commit message. > > @@ -2196,6 +2778,7 @@ ReorderBufferAddNewTupleCids(ReorderBuffer *rb, > TransactionId xid, > change->lsn = lsn; > change->txn = txn; > change->action = REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID; > + change->txn = txn; > > This change is not required as the same information is assigned a few > lines before. So, I have removed this change as well. Let me know > what you think of the above changes? Changes look fine to me. > Can we add a test for incomplete changes (probably with toast > insertion but we can do it for spec_insert case as well) in > ReorderBuffer such that it needs to first serialize the changes and > then stream it? I have manually verified such scenarios but it is > good to have the test for the same. I have added a new test for the same in the stream.sql file. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Amit Kapila
Date:
On Wed, Aug 5, 2020 at 7:37 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Wed, Aug 5, 2020 at 6:25 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > Can we add a test for incomplete changes (probably with toast > > insertion but we can do it for spec_insert case as well) in > > ReorderBuffer such that it needs to first serialize the changes and > > then stream it? I have manually verified such scenarios but it is > > good to have the test for the same. > > I have added a new test for the same in the stream.sql file. > Thanks, I have slightly changed the test so that we can consume DDL changes separately. I have made a number of other adjustments like changing few more comments (to make them consistent with nearby comments), removed unnecessary inclusion of header file, ran pgindent. The next patch (v47-0001-Implement-streaming-mode-in-ReorderBuffer) in this series looks good to me. I am planning to push it after one more read-through unless you or anyone else has any comments on the same. The patch I am talking about has the following functionality: Implement streaming mode in ReorderBuffer. Instead of serializing the transaction to disk after reaching the logical_decoding_work_mem limit in memory, we consume the changes we have in memory and invoke stream API methods added by commit 45fdc9738b. However, sometimes if we have incomplete toast or speculative insert we spill to the disk because we can't stream till we have the complete tuple. And, as soon as we get the complete tuple we stream the transaction including the serialized changes. Now that we can stream in-progress transactions, the concurrent aborts may cause failures when the output plugin consults catalogs (both system and user-defined). We handle such failures by returning ERRCODE_TRANSACTION_ROLLBACK sqlerrcode from system table scan APIs to the backend or WALSender decoding a specific uncommitted transaction. The decoding logic on the receipt of such a sqlerrcode aborts the decoding of the current transaction and continues with the decoding of other transactions. We also provide a new option via SQL APIs to fetch the changes being streamed. This patch's functionality can be independently verified by SQL APIs -- With Regards, Amit Kapila.
Attachment
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Dilip Kumar
Date:
On Thu, Aug 6, 2020 at 2:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Wed, Aug 5, 2020 at 7:37 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Wed, Aug 5, 2020 at 6:25 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > Can we add a test for incomplete changes (probably with toast > > > insertion but we can do it for spec_insert case as well) in > > > ReorderBuffer such that it needs to first serialize the changes and > > > then stream it? I have manually verified such scenarios but it is > > > good to have the test for the same. > > > > I have added a new test for the same in the stream.sql file. > > > > Thanks, I have slightly changed the test so that we can consume DDL > changes separately. I have made a number of other adjustments like > changing few more comments (to make them consistent with nearby > comments), removed unnecessary inclusion of header file, ran pgindent. > The next patch (v47-0001-Implement-streaming-mode-in-ReorderBuffer) in > this series looks good to me. I am planning to push it after one more > read-through unless you or anyone else has any comments on the same. > The patch I am talking about has the following functionality: > > Implement streaming mode in ReorderBuffer. Instead of serializing the > transaction to disk after reaching the logical_decoding_work_mem limit > in memory, we consume the changes we have in memory and invoke stream > API methods added by commit 45fdc9738b. However, sometimes if we have > incomplete toast or speculative insert we spill to the disk because we > can't stream till we have the complete tuple. And, as soon as we get > the complete tuple we stream the transaction including the serialized > changes. Now that we can stream in-progress transactions, the > concurrent aborts may cause failures when the output plugin consults > catalogs (both system and user-defined). We handle such failures by > returning ERRCODE_TRANSACTION_ROLLBACK sqlerrcode from system table > scan APIs to the backend or WALSender decoding a specific uncommitted > transaction. The decoding logic on the receipt of such a sqlerrcode > aborts the decoding of the current transaction and continues with the > decoding of other transactions. We also provide a new option via SQL > APIs to fetch the changes being streamed. > > This patch's functionality can be independently verified by SQL APIs Your changes look fine to me. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Amit Kapila
Date:
On Fri, Aug 7, 2020 at 2:04 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Thu, Aug 6, 2020 at 2:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > .. > > This patch's functionality can be independently verified by SQL APIs > > Your changes look fine to me. > I have pushed that patch last week and attached are the remaining patches. I have made a few changes in the next patch 0001-Extend-the-BufFile-interface.patch and have some comments on it which are as below: 1. case SEEK_END: - /* could be implemented, not needed currently */ + + /* + * Get the file size of the last file to get the last offset of + * that file. + */ + newFile = file->numFiles - 1; + newOffset = FileSize(file->files[file->numFiles - 1]); + if (newOffset < 0) + ereport(ERROR, + (errcode_for_file_access(), + errmsg("could not determine size of temporary file \"%s\" from BufFile \"%s\": %m", + FilePathName(file->files[file->numFiles - 1]), + file->name))); + break; break; There is no need for multiple breaks in the above code. I have fixed this one in the attached patch. 2. +void +BufFileTruncateShared(BufFile *file, int fileno, off_t offset) +{ + int newFile = file->numFiles; + off_t newOffset = file->curOffset; + char segment_name[MAXPGPATH]; + int i; + + /* Loop over all the files upto the fileno which we want to truncate. */ + for (i = file->numFiles - 1; i >= fileno; i--) + { + /* + * Except the fileno, we can directly delete other files. If the + * offset is 0 then we can delete the fileno file as well unless it is + * the first file. + */ + if ((i != fileno || offset == 0) && fileno != 0) + { + SharedSegmentName(segment_name, file->name, i); + FileClose(file->files[i]); + if (!SharedFileSetDelete(file->fileset, segment_name, true)) + ereport(ERROR, + (errcode_for_file_access(), + errmsg("could not delete shared fileset \"%s\": %m", + segment_name))); + newFile--; + newOffset = MAX_PHYSICAL_FILESIZE; + } + else + { + if (FileTruncate(file->files[i], offset, + WAIT_EVENT_BUFFILE_TRUNCATE) < 0) + ereport(ERROR, + (errcode_for_file_access(), + errmsg("could not truncate file \"%s\": %m", + FilePathName(file->files[i])))); + + newOffset = offset; + } + } + + file->numFiles = newFile; + file->curOffset = newOffset; +} In the end, you have only set 'numFiles' and 'curOffset' members of BufFile and left others. I think other members like 'curFile' also need to be set especially for the case where we have deleted segments at the end, also, shouldn't we need to set 'pos' and 'nbytes' as we do in BufFileSeek. If there is some reason that we don't to set these other members then maybe it is better to add a comment to make it clear. Another thing we need to think here whether we need to flush the buffer data for the dirty buffer? Consider a case where we truncate the file up to a position that falls in the buffer. Now we might truncate the file and part of buffer contents will become invalid, next time if we flush such a buffer then the file can contain the garbage or maybe this will be handled if we update the position in buffer appropriately but all of this should be explained in comments. If what I said is correct, then we still can skip buffer flush in some cases as we do in BufFileSeek. Also, consider if we need to do other handling (convert seek to "start of next seg" to "end of last seg") as we do after changing the seek position in BufFileSeek. 3. /* * Initialize a space for temporary files that can be opened by other backends. * Other backends must attach to it before accessing it. Associate this * SharedFileSet with 'seg'. Any contained files will be deleted when the * last backend detaches. * * We can also use this interface if the temporary files are used only by * single backend but the files need to be opened and closed multiple times * and also the underlying files need to survive across transactions. For * such cases, dsm segment 'seg' should be passed as NULL. We remove such * files on proc exit. * * Files will be distributed over the tablespaces configured in * temp_tablespaces. * * Under the covers the set is one or more directories which will eventually * be deleted when there are no backends attached. */ void SharedFileSetInit(SharedFileSet *fileset, dsm_segment *seg) { .. I think we can remove the part of the above comment after 'eventually be deleted' (see last sentence in comment) because now the files can be removed in more than one way and we have explained that in the comments before this last sentence of the comment. If you can rephrase it differently to cover the other case as well, then that is fine too. -- With Regards, Amit Kapila.
Attachment
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Amit Kapila
Date:
On Thu, Aug 13, 2020 at 12:08 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Fri, Aug 7, 2020 at 2:04 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Thu, Aug 6, 2020 at 2:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > .. > > > This patch's functionality can be independently verified by SQL APIs > > > > Your changes look fine to me. > > > > I have pushed that patch last week and attached are the remaining > patches. I have made a few changes in the next patch > 0001-Extend-the-BufFile-interface.patch and have some comments on it > which are as below: > Few more comments on the latest patches: v48-0002-Add-support-for-streaming-to-built-in-replicatio 1. It appears to me that we don't remove the temporary folders created by the apply worker. So, we have folders like pgsql_tmp15324.0.sharedfileset in base/pgsql_tmp directory even when the apply worker exits. I think we can remove these by calling PathNameDeleteTemporaryDir in SharedFileSetUnregister while removing the fileset from registered filesetlist. 2. +typedef struct SubXactInfo +{ + TransactionId xid; /* XID of the subxact */ + int fileno; /* file number in the buffile */ + off_t offset; /* offset in the file */ +} SubXactInfo; + +static uint32 nsubxacts = 0; +static uint32 nsubxacts_max = 0; +static SubXactInfo *subxacts = NULL; +static TransactionId subxact_last = InvalidTransactionId; Will it be better if we move all the subxact related variables (like nsubxacts, nsubxacts_max and subxact_last) inside SubXactInfo struct as all the information anyway is related to sub-transactions? 3. + /* + * If there is no subtransaction then nothing to do, but if already have + * subxact file then delete that. + */ extra space before 'but' in the above sentence is not required. v48-0001-Extend-the-BufFile-interface 4. - * SharedFileSets can also be used by backends when the temporary files need - * to be opened/closed multiple times and the underlying files need to survive + * SharedFileSets can be used by backends when the temporary files need to be + * opened/closed multiple times and the underlying files need to survive * across transactions. * No need of 'also' in the above sentence. -- With Regards, Amit Kapila.
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Thomas Munro
Date:
On Thu, Aug 13, 2020 at 6:38 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > I have pushed that patch last week and attached are the remaining > patches. I have made a few changes in the next patch > 0001-Extend-the-BufFile-interface.patch and have some comments on it > which are as below: Hi Amit, I noticed that Konstantin Knizhnik's CF entry 2386 calls table_scan_XXX() functions from an extension, namely contrib/auto_explain, and started failing to build on Windows after commit 7259736a. This seems to be due to the new global variables CheckXidAlive and bsysscan, which probably need PGDLLIMPORT if they are accessed from inline functions that are part of the API that we expect extensions to be allowed to call.
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Amit Kapila
Date:
On Fri, Aug 14, 2020 at 10:11 AM Thomas Munro <thomas.munro@gmail.com> wrote: > > On Thu, Aug 13, 2020 at 6:38 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > I have pushed that patch last week and attached are the remaining > > patches. I have made a few changes in the next patch > > 0001-Extend-the-BufFile-interface.patch and have some comments on it > > which are as below: > > Hi Amit, > > I noticed that Konstantin Knizhnik's CF entry 2386 calls > table_scan_XXX() functions from an extension, namely > contrib/auto_explain, and started failing to build on Windows after > commit 7259736a. This seems to be due to the new global variables > CheckXidAlive and bsysscan, which probably need PGDLLIMPORT if they > are accessed from inline functions that are part of the API that we > expect extensions to be allowed to call. > Yeah, that makes sense. I will take care of that later today or tomorrow. We have not noticed that because currently none of the extensions is using those functions. BTW, I noticed that after failure, the next run is green, why so? Is the next run not on windows? -- With Regards, Amit Kapila.
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Thomas Munro
Date:
On Fri, Aug 14, 2020 at 6:14 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > Yeah, that makes sense. I will take care of that later today or > tomorrow. We have not noticed that because currently none of the > extensions is using those functions. BTW, I noticed that after > failure, the next run is green, why so? Is the next run not on > windows? The three cfbot results are for applying the patch, testing on Windows and testing on Ubuntu in that order. It's not at all clear and I'll probably find a better way to display it when I get around to adding some more operating systems, maybe with some OS icons or something like that...
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Amit Kapila
Date:
On Sat, Aug 15, 2020 at 4:14 AM Thomas Munro <thomas.munro@gmail.com> wrote: > > On Fri, Aug 14, 2020 at 6:14 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > Yeah, that makes sense. I will take care of that later today or > > tomorrow. We have not noticed that because currently none of the > > extensions is using those functions. BTW, I noticed that after > > failure, the next run is green, why so? Is the next run not on > > windows? > > The three cfbot results are for applying the patch, testing on Windows > and testing on Ubuntu in that order. It's not at all clear and I'll > probably find a better way to display it when I get around to adding > some more operating systems, maybe with some OS icons or something > like that... > Good to know, anyway, I have pushed a patch to mark those variables with PGDLLIMPORT. -- With Regards, Amit Kapila.
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Dilip Kumar
Date:
On Thu, Aug 13, 2020 at 12:08 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Fri, Aug 7, 2020 at 2:04 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Thu, Aug 6, 2020 at 2:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > .. > > > This patch's functionality can be independently verified by SQL APIs > > > > Your changes look fine to me. > > > > I have pushed that patch last week and attached are the remaining > patches. I have made a few changes in the next patch > 0001-Extend-the-BufFile-interface.patch and have some comments on it > which are as below: > > 1. > case SEEK_END: > - /* could be implemented, not needed currently */ > + > + /* > + * Get the file size of the last file to get the last offset of > + * that file. > + */ > + newFile = file->numFiles - 1; > + newOffset = FileSize(file->files[file->numFiles - 1]); > + if (newOffset < 0) > + ereport(ERROR, > + (errcode_for_file_access(), > + errmsg("could not determine size of temporary file \"%s\" from > BufFile \"%s\": %m", > + FilePathName(file->files[file->numFiles - 1]), > + file->name))); > + break; > break; > > There is no need for multiple breaks in the above code. I have fixed > this one in the attached patch. Ok. > 2. > +void > +BufFileTruncateShared(BufFile *file, int fileno, off_t offset) > +{ > + int newFile = file->numFiles; > + off_t newOffset = file->curOffset; > + char segment_name[MAXPGPATH]; > + int i; > + > + /* Loop over all the files upto the fileno which we want to truncate. */ > + for (i = file->numFiles - 1; i >= fileno; i--) > + { > + /* > + * Except the fileno, we can directly delete other files. If the > + * offset is 0 then we can delete the fileno file as well unless it is > + * the first file. > + */ > + if ((i != fileno || offset == 0) && fileno != 0) > + { > + SharedSegmentName(segment_name, file->name, i); > + FileClose(file->files[i]); > + if (!SharedFileSetDelete(file->fileset, segment_name, true)) > + ereport(ERROR, > + (errcode_for_file_access(), > + errmsg("could not delete shared fileset \"%s\": %m", > + segment_name))); > + newFile--; > + newOffset = MAX_PHYSICAL_FILESIZE; > + } > + else > + { > + if (FileTruncate(file->files[i], offset, > + WAIT_EVENT_BUFFILE_TRUNCATE) < 0) > + ereport(ERROR, > + (errcode_for_file_access(), > + errmsg("could not truncate file \"%s\": %m", > + FilePathName(file->files[i])))); > + > + newOffset = offset; > + } > + } > + > + file->numFiles = newFile; > + file->curOffset = newOffset; > +} > > In the end, you have only set 'numFiles' and 'curOffset' members of > BufFile and left others. I think other members like 'curFile' also > need to be set especially for the case where we have deleted segments > at the end, Yes this must be set. also, shouldn't we need to set 'pos' and 'nbytes' as we do > in BufFileSeek. If there is some reason that we don't to set these > other members then maybe it is better to add a comment to make it > clear. IMHO, we can directly call the BufFileFlush, this will reset the pos and nbytes and we can directly set the absolute location of the curOffset. Next time BufFileRead/BufFileWrite reread the buffer so everything will be fine. > Another thing we need to think here whether we need to flush the > buffer data for the dirty buffer? Consider a case where we truncate > the file up to a position that falls in the buffer. Now we might > truncate the file and part of buffer contents will become invalid, > next time if we flush such a buffer then the file can contain the > garbage or maybe this will be handled if we update the position in > buffer appropriately but all of this should be explained in comments. > If what I said is correct, then we still can skip buffer flush in some > cases as we do in BufFileSeek. I think all the cases we can flush the buffer and reset the pos and nbytes. Also, consider if we need to do other > handling (convert seek to "start of next seg" to "end of last seg") as > we do after changing the seek position in BufFileSeek. We also do this when we truncate complete file, see this + if ((i != fileno || offset == 0) && fileno != 0) + { + SharedSegmentName(segment_name, file->name, i); + FileClose(file->files[i]); + if (!SharedFileSetDelete(file->fileset, segment_name, true)) + ereport(ERROR, + (errcode_for_file_access(), + errmsg("could not delete shared fileset \"%s\": %m", + segment_name))); + newFile--; + newOffset = MAX_PHYSICAL_FILESIZE; + } > 3. > /* > * Initialize a space for temporary files that can be opened by other backends. > * Other backends must attach to it before accessing it. Associate this > * SharedFileSet with 'seg'. Any contained files will be deleted when the > * last backend detaches. > * > * We can also use this interface if the temporary files are used only by > * single backend but the files need to be opened and closed multiple times > * and also the underlying files need to survive across transactions. For > * such cases, dsm segment 'seg' should be passed as NULL. We remove such > * files on proc exit. > * > * Files will be distributed over the tablespaces configured in > * temp_tablespaces. > * > * Under the covers the set is one or more directories which will eventually > * be deleted when there are no backends attached. > */ > void > SharedFileSetInit(SharedFileSet *fileset, dsm_segment *seg) > { > .. > > I think we can remove the part of the above comment after 'eventually > be deleted' (see last sentence in comment) because now the files can > be removed in more than one way and we have explained that in the > comments before this last sentence of the comment. If you can rephrase > it differently to cover the other case as well, then that is fine too. I think it makes sense to remove, so I have removed it. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Dilip Kumar
Date:
On Thu, Aug 13, 2020 at 6:47 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Thu, Aug 13, 2020 at 12:08 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Fri, Aug 7, 2020 at 2:04 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > On Thu, Aug 6, 2020 at 2:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > .. > > > > This patch's functionality can be independently verified by SQL APIs > > > > > > Your changes look fine to me. > > > > > > > I have pushed that patch last week and attached are the remaining > > patches. I have made a few changes in the next patch > > 0001-Extend-the-BufFile-interface.patch and have some comments on it > > which are as below: > > > > Few more comments on the latest patches: > v48-0002-Add-support-for-streaming-to-built-in-replicatio > 1. It appears to me that we don't remove the temporary folders created > by the apply worker. So, we have folders like > pgsql_tmp15324.0.sharedfileset in base/pgsql_tmp directory even when > the apply worker exits. I think we can remove these by calling > PathNameDeleteTemporaryDir in SharedFileSetUnregister while removing > the fileset from registered filesetlist. I think we need to call SharedFileSetDeleteAll(input_fileset), from SharedFileSetUnregister, so that all the directories created for this fileset are removed > 2. > +typedef struct SubXactInfo > +{ > + TransactionId xid; /* XID of the subxact */ > + int fileno; /* file number in the buffile */ > + off_t offset; /* offset in the file */ > +} SubXactInfo; > + > +static uint32 nsubxacts = 0; > +static uint32 nsubxacts_max = 0; > +static SubXactInfo *subxacts = NULL; > +static TransactionId subxact_last = InvalidTransactionId; > > Will it be better if we move all the subxact related variables (like > nsubxacts, nsubxacts_max and subxact_last) inside SubXactInfo struct > as all the information anyway is related to sub-transactions? I have moved them all to a structure. > 3. > + /* > + * If there is no subtransaction then nothing to do, but if already have > + * subxact file then delete that. > + */ > > extra space before 'but' in the above sentence is not required. Fixed > v48-0001-Extend-the-BufFile-interface > 4. > - * SharedFileSets can also be used by backends when the temporary files need > - * to be opened/closed multiple times and the underlying files need to survive > + * SharedFileSets can be used by backends when the temporary files need to be > + * opened/closed multiple times and the underlying files need to survive > * across transactions. > * > > No need of 'also' in the above sentence. Fixed -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Dilip Kumar
Date:
On Sat, Aug 15, 2020 at 3:32 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Thu, Aug 13, 2020 at 6:47 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Thu, Aug 13, 2020 at 12:08 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > On Fri, Aug 7, 2020 at 2:04 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > > > On Thu, Aug 6, 2020 at 2:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > .. > > > > > This patch's functionality can be independently verified by SQL APIs > > > > > > > > Your changes look fine to me. > > > > > > > > > > I have pushed that patch last week and attached are the remaining > > > patches. I have made a few changes in the next patch > > > 0001-Extend-the-BufFile-interface.patch and have some comments on it > > > which are as below: > > > > > > > Few more comments on the latest patches: > > v48-0002-Add-support-for-streaming-to-built-in-replicatio > > 1. It appears to me that we don't remove the temporary folders created > > by the apply worker. So, we have folders like > > pgsql_tmp15324.0.sharedfileset in base/pgsql_tmp directory even when > > the apply worker exits. I think we can remove these by calling > > PathNameDeleteTemporaryDir in SharedFileSetUnregister while removing > > the fileset from registered filesetlist. > > I think we need to call SharedFileSetDeleteAll(input_fileset), from > SharedFileSetUnregister, so that all the directories created for this > fileset are removed > > > 2. > > +typedef struct SubXactInfo > > +{ > > + TransactionId xid; /* XID of the subxact */ > > + int fileno; /* file number in the buffile */ > > + off_t offset; /* offset in the file */ > > +} SubXactInfo; > > + > > +static uint32 nsubxacts = 0; > > +static uint32 nsubxacts_max = 0; > > +static SubXactInfo *subxacts = NULL; > > +static TransactionId subxact_last = InvalidTransactionId; > > > > Will it be better if we move all the subxact related variables (like > > nsubxacts, nsubxacts_max and subxact_last) inside SubXactInfo struct > > as all the information anyway is related to sub-transactions? > > I have moved them all to a structure. > > > 3. > > + /* > > + * If there is no subtransaction then nothing to do, but if already have > > + * subxact file then delete that. > > + */ > > > > extra space before 'but' in the above sentence is not required. > > Fixed > > > v48-0001-Extend-the-BufFile-interface > > 4. > > - * SharedFileSets can also be used by backends when the temporary files need > > - * to be opened/closed multiple times and the underlying files need to survive > > + * SharedFileSets can be used by backends when the temporary files need to be > > + * opened/closed multiple times and the underlying files need to survive > > * across transactions. > > * > > > > No need of 'also' in the above sentence. > > Fixed > In last patch v49-0001, there is one issue, Basically, I have called BufFileFlush in all the cases. But, ideally, we can not call this if the underlying files are deleted/truncated because those files/blocks might not exist now. So I think if the truncate position is within the same buffer we just need to adjust the buffer, otherwise we just need to set the currFile and currOffset to the absolute number and set the pos and nbytes 0. Attached patch fixes this issue. + errmsg("could not truncate file \"%s\": %m", + FilePathName(file->files[i])))); + curOffset = offset; + } + } + + /* Otherwise, must reposition buffer, so flush any dirty data */ + BufFileFlush(file); + -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Amit Kapila
Date:
On Mon, Aug 17, 2020 at 6:29 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > In last patch v49-0001, there is one issue, Basically, I have called > BufFileFlush in all the cases. But, ideally, we can not call this if > the underlying files are deleted/truncated because those files/blocks > might not exist now. So I think if the truncate position is within > the same buffer we just need to adjust the buffer, otherwise we just > need to set the currFile and currOffset to the absolute number and set > the pos and nbytes 0. Attached patch fixes this issue. > Few comments on the latest patch v50-0001-Extend-the-BufFile-interface 1. + + /* + * If the truncate point is within existing buffer then we can just + * adjust pos-within-buffer, without flushing buffer. Otherwise, + * we don't need to do anything because we have already deleted/truncated + * the underlying files. + */ + if (curFile == file->curFile && + curOffset >= file->curOffset && + curOffset <= file->curOffset + file->nbytes) + { + file->pos = (int) (curOffset - file->curOffset); + return; + } I think in this case you have set the position correctly but what about file->nbytes? In BufFileSeek, it was okay not to update 'nbytes' because the contents of the buffer are still valid but I don't think the same is true here. 2. + int curFile = file->curFile; + off_t curOffset = file->curOffset; I find the previous naming (newFile, newOffset) was better as it distinguishes them from BufFile variables. 3. +void +SharedFileSetUnregister(SharedFileSet *input_fileset) +{ .. + /* Delete all files in the set */ + SharedFileSetDeleteAll(input_fileset); .. } I am not sure if this is completely correct because we call this function (SharedFileSetUnregister) from BufFileDeleteShared which would have already removed all the required files. This raises the question in my mind whether it is correct to call SharedFileSetUnregister from BufFileDeleteShared from the API perspective as one might not want to remove the entire fileset at that point of time. It will work for your use case (where while removing buffile you also want to remove the entire fileset) but not sure if it is generic enough. For your case, I wonder if we can directly call SharedFileSetDeleteAll and we can have a call like SharedFileSetUnregister which will be called from it. -- With Regards, Amit Kapila.
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Amit Kapila
Date:
On Wed, Aug 19, 2020 at 10:10 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Mon, Aug 17, 2020 at 6:29 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > In last patch v49-0001, there is one issue, Basically, I have called > > BufFileFlush in all the cases. But, ideally, we can not call this if > > the underlying files are deleted/truncated because those files/blocks > > might not exist now. So I think if the truncate position is within > > the same buffer we just need to adjust the buffer, otherwise we just > > need to set the currFile and currOffset to the absolute number and set > > the pos and nbytes 0. Attached patch fixes this issue. > > > > Few comments on the latest patch v50-0001-Extend-the-BufFile-interface > 1. > + > + /* > + * If the truncate point is within existing buffer then we can just > + * adjust pos-within-buffer, without flushing buffer. Otherwise, > + * we don't need to do anything because we have already deleted/truncated > + * the underlying files. > + */ > + if (curFile == file->curFile && > + curOffset >= file->curOffset && > + curOffset <= file->curOffset + file->nbytes) > + { > + file->pos = (int) (curOffset - file->curOffset); > + return; > + } > > I think in this case you have set the position correctly but what > about file->nbytes? In BufFileSeek, it was okay not to update 'nbytes' > because the contents of the buffer are still valid but I don't think > the same is true here. > I think you need to set 'nbytes' to curOffset as per your current patch as that is the new size of the file. --- a/src/backend/storage/file/buffile.c +++ b/src/backend/storage/file/buffile.c @@ -912,6 +912,7 @@ BufFileTruncateShared(BufFile *file, int fileno, off_t offset) curOffset <= file->curOffset + file->nbytes) { file->pos = (int) (curOffset - file->curOffset); + file->nbytes = (int) curOffset; return; } Also, what about file 'numFiles', that can also change due to the removal of certain files, shouldn't that be also set in this case? -- With Regards, Amit Kapila.
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Dilip Kumar
Date:
On Wed, Aug 19, 2020 at 10:11 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Mon, Aug 17, 2020 at 6:29 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > In last patch v49-0001, there is one issue, Basically, I have called > > BufFileFlush in all the cases. But, ideally, we can not call this if > > the underlying files are deleted/truncated because those files/blocks > > might not exist now. So I think if the truncate position is within > > the same buffer we just need to adjust the buffer, otherwise we just > > need to set the currFile and currOffset to the absolute number and set > > the pos and nbytes 0. Attached patch fixes this issue. > > > > Few comments on the latest patch v50-0001-Extend-the-BufFile-interface > 1. > + > + /* > + * If the truncate point is within existing buffer then we can just > + * adjust pos-within-buffer, without flushing buffer. Otherwise, > + * we don't need to do anything because we have already deleted/truncated > + * the underlying files. > + */ > + if (curFile == file->curFile && > + curOffset >= file->curOffset && > + curOffset <= file->curOffset + file->nbytes) > + { > + file->pos = (int) (curOffset - file->curOffset); > + return; > + } > > I think in this case you have set the position correctly but what > about file->nbytes? In BufFileSeek, it was okay not to update 'nbytes' > because the contents of the buffer are still valid but I don't think > the same is true here. Right, I think we need to set nbytes to new file->pos as shown below > + file->pos = (int) (curOffset - file->curOffset); > file->nbytes = file->pos > 2. > + int curFile = file->curFile; > + off_t curOffset = file->curOffset; > > I find the previous naming (newFile, newOffset) was better as it > distinguishes them from BufFile variables. Ok > 3. > +void > +SharedFileSetUnregister(SharedFileSet *input_fileset) > +{ > .. > + /* Delete all files in the set */ > + SharedFileSetDeleteAll(input_fileset); > .. > } > > I am not sure if this is completely correct because we call this > function (SharedFileSetUnregister) from BufFileDeleteShared which > would have already removed all the required files. This raises the > question in my mind whether it is correct to call > SharedFileSetUnregister from BufFileDeleteShared from the API > perspective as one might not want to remove the entire fileset at that > point of time. It will work for your use case (where while removing > buffile you also want to remove the entire fileset) but not sure if it > is generic enough. For your case, I wonder if we can directly call > SharedFileSetDeleteAll and we can have a call like > SharedFileSetUnregister which will be called from it. Yeah this make more sense to me that we can directly call SharedFileSetDeleteAll, instead of calling BufFileDeleteShared and we can call SharedFileSetUnregister from SharedFileSetDeleteAll. I will make these changes and send the patch after some testing. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Dilip Kumar
Date:
On Wed, Aug 19, 2020 at 12:20 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Wed, Aug 19, 2020 at 10:10 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Mon, Aug 17, 2020 at 6:29 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > > > > In last patch v49-0001, there is one issue, Basically, I have called > > > BufFileFlush in all the cases. But, ideally, we can not call this if > > > the underlying files are deleted/truncated because those files/blocks > > > might not exist now. So I think if the truncate position is within > > > the same buffer we just need to adjust the buffer, otherwise we just > > > need to set the currFile and currOffset to the absolute number and set > > > the pos and nbytes 0. Attached patch fixes this issue. > > > > > > > Few comments on the latest patch v50-0001-Extend-the-BufFile-interface > > 1. > > + > > + /* > > + * If the truncate point is within existing buffer then we can just > > + * adjust pos-within-buffer, without flushing buffer. Otherwise, > > + * we don't need to do anything because we have already deleted/truncated > > + * the underlying files. > > + */ > > + if (curFile == file->curFile && > > + curOffset >= file->curOffset && > > + curOffset <= file->curOffset + file->nbytes) > > + { > > + file->pos = (int) (curOffset - file->curOffset); > > + return; > > + } > > > > I think in this case you have set the position correctly but what > > about file->nbytes? In BufFileSeek, it was okay not to update 'nbytes' > > because the contents of the buffer are still valid but I don't think > > the same is true here. > > > > I think you need to set 'nbytes' to curOffset as per your current > patch as that is the new size of the file. > --- a/src/backend/storage/file/buffile.c > +++ b/src/backend/storage/file/buffile.c > @@ -912,6 +912,7 @@ BufFileTruncateShared(BufFile *file, int fileno, > off_t offset) > curOffset <= file->curOffset + file->nbytes) > { > file->pos = (int) (curOffset - file->curOffset); > + file->nbytes = (int) curOffset; > return; > } > > Also, what about file 'numFiles', that can also change due to the > removal of certain files, shouldn't that be also set in this case Right, we need to set the numFile. I will fix this as well. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Dilip Kumar
Date:
On Wed, Aug 19, 2020 at 1:35 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Wed, Aug 19, 2020 at 12:20 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Wed, Aug 19, 2020 at 10:10 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > On Mon, Aug 17, 2020 at 6:29 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > > > > > > > In last patch v49-0001, there is one issue, Basically, I have called > > > > BufFileFlush in all the cases. But, ideally, we can not call this if > > > > the underlying files are deleted/truncated because those files/blocks > > > > might not exist now. So I think if the truncate position is within > > > > the same buffer we just need to adjust the buffer, otherwise we just > > > > need to set the currFile and currOffset to the absolute number and set > > > > the pos and nbytes 0. Attached patch fixes this issue. > > > > > > > > > > Few comments on the latest patch v50-0001-Extend-the-BufFile-interface > > > 1. > > > + > > > + /* > > > + * If the truncate point is within existing buffer then we can just > > > + * adjust pos-within-buffer, without flushing buffer. Otherwise, > > > + * we don't need to do anything because we have already deleted/truncated > > > + * the underlying files. > > > + */ > > > + if (curFile == file->curFile && > > > + curOffset >= file->curOffset && > > > + curOffset <= file->curOffset + file->nbytes) > > > + { > > > + file->pos = (int) (curOffset - file->curOffset); > > > + return; > > > + } > > > > > > I think in this case you have set the position correctly but what > > > about file->nbytes? In BufFileSeek, it was okay not to update 'nbytes' > > > because the contents of the buffer are still valid but I don't think > > > the same is true here. > > > > > > > I think you need to set 'nbytes' to curOffset as per your current > > patch as that is the new size of the file. > > --- a/src/backend/storage/file/buffile.c > > +++ b/src/backend/storage/file/buffile.c > > @@ -912,6 +912,7 @@ BufFileTruncateShared(BufFile *file, int fileno, > > off_t offset) > > curOffset <= file->curOffset + file->nbytes) > > { > > file->pos = (int) (curOffset - file->curOffset); > > + file->nbytes = (int) curOffset; > > return; > > } > > > > Also, what about file 'numFiles', that can also change due to the > > removal of certain files, shouldn't that be also set in this case > > Right, we need to set the numFile. I will fix this as well. I think there are a couple of more problems in the truncate APIs, basically, if the curFile and curOffset are already smaller than the truncate location the truncate should not change that. So the truncate should only change the curFile and curOffset if it is truncating the part of the file where the curFile or curOffset is pointing. I will work on those along with your other comments and submit the updated patch. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Amit Kapila
Date:
On Thu, Aug 20, 2020 at 1:41 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Wed, Aug 19, 2020 at 1:35 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Wed, Aug 19, 2020 at 12:20 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > On Wed, Aug 19, 2020 at 10:10 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > On Mon, Aug 17, 2020 at 6:29 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > > > > > > > > > > In last patch v49-0001, there is one issue, Basically, I have called > > > > > BufFileFlush in all the cases. But, ideally, we can not call this if > > > > > the underlying files are deleted/truncated because those files/blocks > > > > > might not exist now. So I think if the truncate position is within > > > > > the same buffer we just need to adjust the buffer, otherwise we just > > > > > need to set the currFile and currOffset to the absolute number and set > > > > > the pos and nbytes 0. Attached patch fixes this issue. > > > > > > > > > > > > > Few comments on the latest patch v50-0001-Extend-the-BufFile-interface > > > > 1. > > > > + > > > > + /* > > > > + * If the truncate point is within existing buffer then we can just > > > > + * adjust pos-within-buffer, without flushing buffer. Otherwise, > > > > + * we don't need to do anything because we have already deleted/truncated > > > > + * the underlying files. > > > > + */ > > > > + if (curFile == file->curFile && > > > > + curOffset >= file->curOffset && > > > > + curOffset <= file->curOffset + file->nbytes) > > > > + { > > > > + file->pos = (int) (curOffset - file->curOffset); > > > > + return; > > > > + } > > > > > > > > I think in this case you have set the position correctly but what > > > > about file->nbytes? In BufFileSeek, it was okay not to update 'nbytes' > > > > because the contents of the buffer are still valid but I don't think > > > > the same is true here. > > > > > > > > > > I think you need to set 'nbytes' to curOffset as per your current > > > patch as that is the new size of the file. > > > --- a/src/backend/storage/file/buffile.c > > > +++ b/src/backend/storage/file/buffile.c > > > @@ -912,6 +912,7 @@ BufFileTruncateShared(BufFile *file, int fileno, > > > off_t offset) > > > curOffset <= file->curOffset + file->nbytes) > > > { > > > file->pos = (int) (curOffset - file->curOffset); > > > + file->nbytes = (int) curOffset; > > > return; > > > } > > > > > > Also, what about file 'numFiles', that can also change due to the > > > removal of certain files, shouldn't that be also set in this case > > > > Right, we need to set the numFile. I will fix this as well. > > I think there are a couple of more problems in the truncate APIs, > basically, if the curFile and curOffset are already smaller than the > truncate location the truncate should not change that. So the > truncate should only change the curFile and curOffset if it is > truncating the part of the file where the curFile or curOffset is > pointing. > Right, I think this can happen if one has changed those by BufFileSeek before doing truncate. We should fix that case as well. > I will work on those along with your other comments and > submit the updated patch. > Thanks. -- With Regards, Amit Kapila.
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Dilip Kumar
Date:
On Thu, Aug 20, 2020 at 2:30 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Thu, Aug 20, 2020 at 1:41 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Wed, Aug 19, 2020 at 1:35 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > On Wed, Aug 19, 2020 at 12:20 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > On Wed, Aug 19, 2020 at 10:10 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > > On Mon, Aug 17, 2020 at 6:29 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > > > > > > > > > > > > > In last patch v49-0001, there is one issue, Basically, I have called > > > > > > BufFileFlush in all the cases. But, ideally, we can not call this if > > > > > > the underlying files are deleted/truncated because those files/blocks > > > > > > might not exist now. So I think if the truncate position is within > > > > > > the same buffer we just need to adjust the buffer, otherwise we just > > > > > > need to set the currFile and currOffset to the absolute number and set > > > > > > the pos and nbytes 0. Attached patch fixes this issue. > > > > > > > > > > > > > > > > Few comments on the latest patch v50-0001-Extend-the-BufFile-interface > > > > > 1. > > > > > + > > > > > + /* > > > > > + * If the truncate point is within existing buffer then we can just > > > > > + * adjust pos-within-buffer, without flushing buffer. Otherwise, > > > > > + * we don't need to do anything because we have already deleted/truncated > > > > > + * the underlying files. > > > > > + */ > > > > > + if (curFile == file->curFile && > > > > > + curOffset >= file->curOffset && > > > > > + curOffset <= file->curOffset + file->nbytes) > > > > > + { > > > > > + file->pos = (int) (curOffset - file->curOffset); > > > > > + return; > > > > > + } > > > > > > > > > > I think in this case you have set the position correctly but what > > > > > about file->nbytes? In BufFileSeek, it was okay not to update 'nbytes' > > > > > because the contents of the buffer are still valid but I don't think > > > > > the same is true here. > > > > > > > > > > > > > I think you need to set 'nbytes' to curOffset as per your current > > > > patch as that is the new size of the file. > > > > --- a/src/backend/storage/file/buffile.c > > > > +++ b/src/backend/storage/file/buffile.c > > > > @@ -912,6 +912,7 @@ BufFileTruncateShared(BufFile *file, int fileno, > > > > off_t offset) > > > > curOffset <= file->curOffset + file->nbytes) > > > > { > > > > file->pos = (int) (curOffset - file->curOffset); > > > > + file->nbytes = (int) curOffset; > > > > return; > > > > } > > > > > > > > Also, what about file 'numFiles', that can also change due to the > > > > removal of certain files, shouldn't that be also set in this case > > > > > > Right, we need to set the numFile. I will fix this as well. > > > > I think there are a couple of more problems in the truncate APIs, > > basically, if the curFile and curOffset are already smaller than the > > truncate location the truncate should not change that. So the > > truncate should only change the curFile and curOffset if it is > > truncating the part of the file where the curFile or curOffset is > > pointing. > > > > Right, I think this can happen if one has changed those by BufFileSeek > before doing truncate. We should fix that case as well. Right. > > I will work on those along with your other comments and > > submit the updated patch. I have fixed this in the attached patch along with your other comments. I have also attached a contrib module that is just used for testing the truncate API. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Amit Kapila
Date:
On Thu, Aug 20, 2020 at 5:42 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Thu, Aug 20, 2020 at 2:30 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > Right, I think this can happen if one has changed those by BufFileSeek > > before doing truncate. We should fix that case as well. > > Right. > > > > I will work on those along with your other comments and > > > submit the updated patch. > > I have fixed this in the attached patch along with your other > comments. I have also attached a contrib module that is just used for > testing the truncate API. > Few comments: ============== +void +BufFileTruncateShared(BufFile *file, int fileno, off_t offset) { .. + if ((i != fileno || offset == 0) && i != 0) + { + SharedSegmentName(segment_name, file->name, i); + FileClose(file->files[i]); + if (!SharedFileSetDelete(file->fileset, segment_name, true)) + ereport(ERROR, + (errcode_for_file_access(), + errmsg("could not delete shared fileset \"%s\": %m", + segment_name))); + numFiles--; + newOffset = MAX_PHYSICAL_FILESIZE; + + if (i == fileno) + newFile--; + } Here, shouldn't it be i <= fileno? Because we need to move back the curFile up to newFile whenever curFile is greater than newFile 2. + /* + * If the new location is smaller then the current location in file then + * we need to set the curFile and the curOffset to the new values and also + * reset the pos and nbytes. Otherwise nothing to do. + */ + else if ((newFile < file->curFile) || + newOffset < file->curOffset + file->pos) + { + file->curFile = newFile; + file->curOffset = newOffset; + file->pos = 0; + file->nbytes = 0; + } Shouldn't there be && instead of || because if newFile is greater than curFile then there is no meaning to update it? -- With Regards, Amit Kapila.
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Amit Kapila
Date:
On Fri, Aug 21, 2020 at 9:14 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Thu, Aug 20, 2020 at 5:42 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Thu, Aug 20, 2020 at 2:30 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > Right, I think this can happen if one has changed those by BufFileSeek > > > before doing truncate. We should fix that case as well. > > > > Right. > > > > > > I will work on those along with your other comments and > > > > submit the updated patch. > > > > I have fixed this in the attached patch along with your other > > comments. I have also attached a contrib module that is just used for > > testing the truncate API. > > > > Few comments: > ============== > +void > +BufFileTruncateShared(BufFile *file, int fileno, off_t offset) > { > .. > + if ((i != fileno || offset == 0) && i != 0) > + { > + SharedSegmentName(segment_name, file->name, i); > + FileClose(file->files[i]); > + if (!SharedFileSetDelete(file->fileset, segment_name, true)) > + ereport(ERROR, > + (errcode_for_file_access(), > + errmsg("could not delete shared fileset \"%s\": %m", > + segment_name))); > + numFiles--; > + newOffset = MAX_PHYSICAL_FILESIZE; > + > + if (i == fileno) > + newFile--; > + } > > Here, shouldn't it be i <= fileno? Because we need to move back the > curFile up to newFile whenever curFile is greater than newFile > > 2. > + /* > + * If the new location is smaller then the current location in file then > + * we need to set the curFile and the curOffset to the new values and also > + * reset the pos and nbytes. Otherwise nothing to do. > + */ > + else if ((newFile < file->curFile) || > + newOffset < file->curOffset + file->pos) > + { > + file->curFile = newFile; > + file->curOffset = newOffset; > + file->pos = 0; > + file->nbytes = 0; > + } > > Shouldn't there be && instead of || because if newFile is greater than > curFile then there is no meaning to update it? > Wait, actually, it is not clear to me which case second condition (newOffset < file->curOffset + file->pos) is trying to cover, so I can't recommend anything for this. Can you please explain to me why you have added the second condition in the above check? -- With Regards, Amit Kapila.
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Amit Kapila
Date:
On Fri, Aug 21, 2020 at 9:14 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Thu, Aug 20, 2020 at 5:42 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Thu, Aug 20, 2020 at 2:30 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > Right, I think this can happen if one has changed those by BufFileSeek > > > before doing truncate. We should fix that case as well. > > > > Right. > > > > > > I will work on those along with your other comments and > > > > submit the updated patch. > > > > I have fixed this in the attached patch along with your other > > comments. I have also attached a contrib module that is just used for > > testing the truncate API. > > > > Few comments: > ============== > +void > +BufFileTruncateShared(BufFile *file, int fileno, off_t offset) > { > .. > + if ((i != fileno || offset == 0) && i != 0) > + { > + SharedSegmentName(segment_name, file->name, i); > + FileClose(file->files[i]); > + if (!SharedFileSetDelete(file->fileset, segment_name, true)) > + ereport(ERROR, > + (errcode_for_file_access(), > + errmsg("could not delete shared fileset \"%s\": %m", > + segment_name))); > + numFiles--; > + newOffset = MAX_PHYSICAL_FILESIZE; > + > + if (i == fileno) > + newFile--; > + } > > Here, shouldn't it be i <= fileno? Because we need to move back the > curFile up to newFile whenever curFile is greater than newFile > I think now I have understood why you have added this condition but probably a comment on the lines "This is required to indicate that we have removed the given fileno" would be better for future readers. -- With Regards, Amit Kapila.
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Dilip Kumar
Date:
On Fri, Aug 21, 2020 at 9:14 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Thu, Aug 20, 2020 at 5:42 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Thu, Aug 20, 2020 at 2:30 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > Right, I think this can happen if one has changed those by BufFileSeek > > > before doing truncate. We should fix that case as well. > > > > Right. > > > > > > I will work on those along with your other comments and > > > > submit the updated patch. > > > > I have fixed this in the attached patch along with your other > > comments. I have also attached a contrib module that is just used for > > testing the truncate API. > > > > Few comments: > ============== > +void > +BufFileTruncateShared(BufFile *file, int fileno, off_t offset) > { > .. > + if ((i != fileno || offset == 0) && i != 0) > + { > + SharedSegmentName(segment_name, file->name, i); > + FileClose(file->files[i]); > + if (!SharedFileSetDelete(file->fileset, segment_name, true)) > + ereport(ERROR, > + (errcode_for_file_access(), > + errmsg("could not delete shared fileset \"%s\": %m", > + segment_name))); > + numFiles--; > + newOffset = MAX_PHYSICAL_FILESIZE; > + > + if (i == fileno) > + newFile--; > + } > > Here, shouldn't it be i <= fileno? Because we need to move back the > curFile up to newFile whenever curFile is greater than newFile +/* Loop over all the files upto the fileno which we want to truncate. */ +for (i = file->numFiles - 1; i >= fileno; i--) Because the above loop is up to the fileno, so I feel there is no point of that check or any assert. > 2. > + /* > + * If the new location is smaller then the current location in file then > + * we need to set the curFile and the curOffset to the new values and also > + * reset the pos and nbytes. Otherwise nothing to do. > + */ > + else if ((newFile < file->curFile) || > + newOffset < file->curOffset + file->pos) > + { > + file->curFile = newFile; > + file->curOffset = newOffset; > + file->pos = 0; > + file->nbytes = 0; > + } > > Shouldn't there be && instead of || because if newFile is greater than > curFile then there is no meaning to update it? I think this condition is wrong it should be, else if ((newFile < file->curFile) || ((newFile == file->curFile) && (newOffset < file->curOffset + file->pos) Basically, either new file is smaller otherwise if it is the same then-new offset should be smaller. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Dilip Kumar
Date:
On Fri, Aug 21, 2020 at 10:20 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Fri, Aug 21, 2020 at 9:14 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Thu, Aug 20, 2020 at 5:42 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > On Thu, Aug 20, 2020 at 2:30 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > > > > Right, I think this can happen if one has changed those by BufFileSeek > > > > before doing truncate. We should fix that case as well. > > > > > > Right. > > > > > > > > I will work on those along with your other comments and > > > > > submit the updated patch. > > > > > > I have fixed this in the attached patch along with your other > > > comments. I have also attached a contrib module that is just used for > > > testing the truncate API. > > > > > > > Few comments: > > ============== > > +void > > +BufFileTruncateShared(BufFile *file, int fileno, off_t offset) > > { > > .. > > + if ((i != fileno || offset == 0) && i != 0) > > + { > > + SharedSegmentName(segment_name, file->name, i); > > + FileClose(file->files[i]); > > + if (!SharedFileSetDelete(file->fileset, segment_name, true)) > > + ereport(ERROR, > > + (errcode_for_file_access(), > > + errmsg("could not delete shared fileset \"%s\": %m", > > + segment_name))); > > + numFiles--; > > + newOffset = MAX_PHYSICAL_FILESIZE; > > + > > + if (i == fileno) > > + newFile--; > > + } > > > > Here, shouldn't it be i <= fileno? Because we need to move back the > > curFile up to newFile whenever curFile is greater than newFile > > > > I think now I have understood why you have added this condition but > probably a comment on the lines "This is required to indicate that we > have removed the given fileno" would be better for future readers. Okay. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Amit Kapila
Date:
On Fri, Aug 21, 2020 at 10:33 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Fri, Aug 21, 2020 at 9:14 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > 2. > > + /* > > + * If the new location is smaller then the current location in file then > > + * we need to set the curFile and the curOffset to the new values and also > > + * reset the pos and nbytes. Otherwise nothing to do. > > + */ > > + else if ((newFile < file->curFile) || > > + newOffset < file->curOffset + file->pos) > > + { > > + file->curFile = newFile; > > + file->curOffset = newOffset; > > + file->pos = 0; > > + file->nbytes = 0; > > + } > > > > Shouldn't there be && instead of || because if newFile is greater than > > curFile then there is no meaning to update it? > > I think this condition is wrong it should be, > > else if ((newFile < file->curFile) || ((newFile == file->curFile) && > (newOffset < file->curOffset + file->pos) > > Basically, either new file is smaller otherwise if it is the same > then-new offset should be smaller. > I think we don't need to use file->pos for that as that is required only for the current buffer, otherwise, such a condition should suffice the need. However, I was not happy with the way code and conditions were arranged in BufFileTruncateShared, so I have re-arranged them and change quite a few comments in that API. Apart from that I have updated the docs and ran pgindent for the first patch. Do let me know if you have any more comments on the first patch? -- With Regards, Amit Kapila.
Attachment
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Dilip Kumar
Date:
On Fri, Aug 21, 2020 at 3:13 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Fri, Aug 21, 2020 at 10:33 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Fri, Aug 21, 2020 at 9:14 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > 2. > > > + /* > > > + * If the new location is smaller then the current location in file then > > > + * we need to set the curFile and the curOffset to the new values and also > > > + * reset the pos and nbytes. Otherwise nothing to do. > > > + */ > > > + else if ((newFile < file->curFile) || > > > + newOffset < file->curOffset + file->pos) > > > + { > > > + file->curFile = newFile; > > > + file->curOffset = newOffset; > > > + file->pos = 0; > > > + file->nbytes = 0; > > > + } > > > > > > Shouldn't there be && instead of || because if newFile is greater than > > > curFile then there is no meaning to update it? > > > > I think this condition is wrong it should be, > > > > else if ((newFile < file->curFile) || ((newFile == file->curFile) && > > (newOffset < file->curOffset + file->pos) > > > > Basically, either new file is smaller otherwise if it is the same > > then-new offset should be smaller. > > > > I think we don't need to use file->pos for that as that is required > only for the current buffer, otherwise, such a condition should > suffice the need. However, I was not happy with the way code and > conditions were arranged in BufFileTruncateShared, so I have > re-arranged them and change quite a few comments in that API. Apart > from that I have updated the docs and ran pgindent for the first > patch. Do let me know if you have any more comments on the first > patch? I have reviewed and tested the patch and the changes look fine to me. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Amit Kapila
Date:
On Fri, Aug 21, 2020 at 5:10 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > I have reviewed and tested the patch and the changes look fine to me. > Thanks, I will push the next patch early next week (by Tuesday) unless you or someone else has any more comments on it. The summary of the patch (v52-0001-Extend-the-BufFile-interface, attached with my previous email) I am planning to push is: "It extends the BufFile interface to support temporary files that can be used by the single backend when the corresponding files need to be survived across the transaction and need to be opened and closed multiple times. Such files need to be created as a member of a SharedFileSet. We have implemented the interface for BufFileTruncate to allow files to be truncated up to a particular offset and extended the BufFileSeek API to support SEEK_END case. We have also added an option to provide a mode while opening the shared BufFiles instead of always opening in read-only mode. These enhancements in BufFile interface are required for the upcoming patch to allow the replication apply worker, to properly handle streamed in-progress transactions." -- With Regards, Amit Kapila.
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Dilip Kumar
Date:
On Sat, Aug 22, 2020 at 8:38 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Fri, Aug 21, 2020 at 5:10 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > I have reviewed and tested the patch and the changes look fine to me. > > > > Thanks, I will push the next patch early next week (by Tuesday) unless > you or someone else has any more comments on it. The summary of the > patch (v52-0001-Extend-the-BufFile-interface, attached with my > previous email) I am planning to push is: "It extends the BufFile > interface to support temporary files that can be used by the single > backend when the corresponding files need to be survived across the > transaction and need to be opened and closed multiple times. Such > files need to be created as a member of a SharedFileSet. We have > implemented the interface for BufFileTruncate to allow files to be > truncated up to a particular offset and extended the BufFileSeek API > to support SEEK_END case. We have also added an option to provide a > mode while opening the shared BufFiles instead of always opening in > read-only mode. These enhancements in BufFile interface are required > for the upcoming patch to allow the replication apply worker, to > properly handle streamed in-progress transactions." While reviewing 0002, I realized that instead of using individual shared fileset for each transaction, we can use just one common shared file set. We can create individual buffile under one shared fileset and whenever a transaction commits/aborts we can just delete its buffile and the shared fileset can stay. I have attached a POC patch for this idea and if we agree with this approach then I will prepare a final patch in a couple of days. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Amit Kapila
Date:
On Mon, Aug 24, 2020 at 9:41 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Sat, Aug 22, 2020 at 8:38 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Fri, Aug 21, 2020 at 5:10 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > I have reviewed and tested the patch and the changes look fine to me. > > > > > > > Thanks, I will push the next patch early next week (by Tuesday) unless > > you or someone else has any more comments on it. The summary of the > > patch (v52-0001-Extend-the-BufFile-interface, attached with my > > previous email) I am planning to push is: "It extends the BufFile > > interface to support temporary files that can be used by the single > > backend when the corresponding files need to be survived across the > > transaction and need to be opened and closed multiple times. Such > > files need to be created as a member of a SharedFileSet. We have > > implemented the interface for BufFileTruncate to allow files to be > > truncated up to a particular offset and extended the BufFileSeek API > > to support SEEK_END case. We have also added an option to provide a > > mode while opening the shared BufFiles instead of always opening in > > read-only mode. These enhancements in BufFile interface are required > > for the upcoming patch to allow the replication apply worker, to > > properly handle streamed in-progress transactions." > > While reviewing 0002, I realized that instead of using individual > shared fileset for each transaction, we can use just one common shared > file set. We can create individual buffile under one shared fileset > and whenever a transaction commits/aborts we can just delete its > buffile and the shared fileset can stay. > I think the existing design is superior as it allows the flexibility to create transaction files in different temp_tablespaces which is quite important to consider as we know the files will be created only for large transactions. Once we fix the sharedfileset for a worker all the files will be created in the temp_tablespaces chosen for the first time apply worker creates it even if it got changed at some later point of time (user can change its value and then do reload config which I think will impact the worker settings as well). This all can happen because we set the tablespaces at the time of SharedFileSetInit. The other relatively smaller thing which I don't like is that we always need to create a buffile for subxact even though we don't need it. We might be able to find some solution for this but I guess the previous point is what bothers me more. -- With Regards, Amit Kapila.
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Dilip Kumar
Date:
On Tue, Aug 25, 2020 at 9:31 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Mon, Aug 24, 2020 at 9:41 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Sat, Aug 22, 2020 at 8:38 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > On Fri, Aug 21, 2020 at 5:10 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > > > I have reviewed and tested the patch and the changes look fine to me. > > > > > > > > > > Thanks, I will push the next patch early next week (by Tuesday) unless > > > you or someone else has any more comments on it. The summary of the > > > patch (v52-0001-Extend-the-BufFile-interface, attached with my > > > previous email) I am planning to push is: "It extends the BufFile > > > interface to support temporary files that can be used by the single > > > backend when the corresponding files need to be survived across the > > > transaction and need to be opened and closed multiple times. Such > > > files need to be created as a member of a SharedFileSet. We have > > > implemented the interface for BufFileTruncate to allow files to be > > > truncated up to a particular offset and extended the BufFileSeek API > > > to support SEEK_END case. We have also added an option to provide a > > > mode while opening the shared BufFiles instead of always opening in > > > read-only mode. These enhancements in BufFile interface are required > > > for the upcoming patch to allow the replication apply worker, to > > > properly handle streamed in-progress transactions." > > > > While reviewing 0002, I realized that instead of using individual > > shared fileset for each transaction, we can use just one common shared > > file set. We can create individual buffile under one shared fileset > > and whenever a transaction commits/aborts we can just delete its > > buffile and the shared fileset can stay. > > > > I think the existing design is superior as it allows the flexibility > to create transaction files in different temp_tablespaces which is > quite important to consider as we know the files will be created only > for large transactions. Once we fix the sharedfileset for a worker all > the files will be created in the temp_tablespaces chosen for the first > time apply worker creates it even if it got changed at some later > point of time (user can change its value and then do reload config > which I think will impact the worker settings as well). This all can > happen because we set the tablespaces at the time of > SharedFileSetInit. Yeah, I agree with this point, that if we use the single shared fileset then it will always use the same tablespace for all the streaming transactions. And, we might get the benefit of concurrent I/O if we use different tablespaces as we are not immediately flushing the files to the disk. > The other relatively smaller thing which I don't like is that we > always need to create a buffile for subxact even though we don't need > it. We might be able to find some solution for this but I guess the > previous point is what bothers me more. Yeah, if we go this way we might need to find some solution to this. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Amit Kapila
Date:
On Tue, Aug 25, 2020 at 10:41 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Tue, Aug 25, 2020 at 9:31 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > I think the existing design is superior as it allows the flexibility > > to create transaction files in different temp_tablespaces which is > > quite important to consider as we know the files will be created only > > for large transactions. Once we fix the sharedfileset for a worker all > > the files will be created in the temp_tablespaces chosen for the first > > time apply worker creates it even if it got changed at some later > > point of time (user can change its value and then do reload config > > which I think will impact the worker settings as well). This all can > > happen because we set the tablespaces at the time of > > SharedFileSetInit. > > Yeah, I agree with this point, that if we use the single shared > fileset then it will always use the same tablespace for all the > streaming transactions. And, we might get the benefit of concurrent > I/O if we use different tablespaces as we are not immediately flushing > the files to the disk. > Okay, so let's retain the original approach then. I have made a few cosmetic modifications in the first two patches which include updating docs, comments, slightly modify the commit message, and change the code to match the nearby code. One change which you might have a different opinion is below: + case WAIT_EVENT_LOGICAL_CHANGES_READ: + event_name = "ReorderLogicalChangesRead"; + break; + case WAIT_EVENT_LOGICAL_CHANGES_WRITE: + event_name = "ReorderLogicalChangesWrite"; + break; + case WAIT_EVENT_LOGICAL_SUBXACT_READ: + event_name = "ReorderLogicalSubxactRead"; + break; + case WAIT_EVENT_LOGICAL_SUBXACT_WRITE: + event_name = "ReorderLogicalSubxactWrite"; + break; Why do we want to name these events starting with name as Reorder*? I think these are used in subscriber-side, so no need to use the word Reorder, so I have removed it from the attached patch. I am planning to push the first patch (v53-0001-Extend-the-BufFile-interface) in this series tomorrow unless you have any comments on the same. -- With Regards, Amit Kapila.
Attachment
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Dilip Kumar
Date:
On Tue, Aug 25, 2020 at 6:27 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Tue, Aug 25, 2020 at 10:41 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Tue, Aug 25, 2020 at 9:31 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > I think the existing design is superior as it allows the flexibility > > > to create transaction files in different temp_tablespaces which is > > > quite important to consider as we know the files will be created only > > > for large transactions. Once we fix the sharedfileset for a worker all > > > the files will be created in the temp_tablespaces chosen for the first > > > time apply worker creates it even if it got changed at some later > > > point of time (user can change its value and then do reload config > > > which I think will impact the worker settings as well). This all can > > > happen because we set the tablespaces at the time of > > > SharedFileSetInit. > > > > Yeah, I agree with this point, that if we use the single shared > > fileset then it will always use the same tablespace for all the > > streaming transactions. And, we might get the benefit of concurrent > > I/O if we use different tablespaces as we are not immediately flushing > > the files to the disk. > > > > Okay, so let's retain the original approach then. I have made a few > cosmetic modifications in the first two patches which include updating > docs, comments, slightly modify the commit message, and change the > code to match the nearby code. One change which you might have a > different opinion is below: > > + case WAIT_EVENT_LOGICAL_CHANGES_READ: > + event_name = "ReorderLogicalChangesRead"; > + break; > + case WAIT_EVENT_LOGICAL_CHANGES_WRITE: > + event_name = "ReorderLogicalChangesWrite"; > + break; > + case WAIT_EVENT_LOGICAL_SUBXACT_READ: > + event_name = "ReorderLogicalSubxactRead"; > + break; > + case WAIT_EVENT_LOGICAL_SUBXACT_WRITE: > + event_name = "ReorderLogicalSubxactWrite"; > + break; > > Why do we want to name these events starting with name as Reorder*? I > think these are used in subscriber-side, so no need to use the word > Reorder, so I have removed it from the attached patch. I am planning > to push the first patch (v53-0001-Extend-the-BufFile-interface) in > this series tomorrow unless you have any comments on the same. Your changes in 0001 and 0002, looks fine to me. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Jeff Janes
Date:
On Tue, Aug 25, 2020 at 8:58 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
I am planning
to push the first patch (v53-0001-Extend-the-BufFile-interface) in
this series tomorrow unless you have any comments on the same.
I'm getting compiler warnings now, src/backend/storage/file/sharedfileset.c line 288 needs to be:
bool found PG_USED_FOR_ASSERTS_ONLY = false;
Cheers,
Jeff
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Amit Kapila
Date:
On Wed, Aug 26, 2020 at 11:22 PM Jeff Janes <jeff.janes@gmail.com> wrote: > > > On Tue, Aug 25, 2020 at 8:58 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > >> >> I am planning >> to push the first patch (v53-0001-Extend-the-BufFile-interface) in >> this series tomorrow unless you have any comments on the same. > > > > I'm getting compiler warnings now, src/backend/storage/file/sharedfileset.c line 288 needs to be: > > bool found PG_USED_FOR_ASSERTS_ONLY = false; > Thanks for the report. Tom Lane has already fixed this [1]. [1] - https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=e942af7b8261cd8070d0eeaf518dbc1a664859fd -- With Regards, Amit Kapila.
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Dilip Kumar
Date:
On Thu, Aug 27, 2020 at 11:16 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Wed, Aug 26, 2020 at 11:22 PM Jeff Janes <jeff.janes@gmail.com> wrote: > > > > > > On Tue, Aug 25, 2020 at 8:58 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > >> > >> I am planning > >> to push the first patch (v53-0001-Extend-the-BufFile-interface) in > >> this series tomorrow unless you have any comments on the same. > > > > > > > > I'm getting compiler warnings now, src/backend/storage/file/sharedfileset.c line 288 needs to be: > > > > bool found PG_USED_FOR_ASSERTS_ONLY = false; > > > > Thanks for the report. Tom Lane has already fixed this [1]. > > [1] - https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=e942af7b8261cd8070d0eeaf518dbc1a664859fd As discussed, I have added a another test case for covering the out of order subtransaction rollback scenario. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Neha Sharma
Date:
Hi,
I have done code coverage analysis on the latest patches(v53) and below is the report for the same.
Highlighted are the files where the coverage modifications were observed.
OS: Ubuntu 18.04
Patch applied on commit : 77c7267c37f7fa8e5e48abda4798afdbecb2b95a
File Name | Coverage | ||||||||
Without logical decoding patch | On v53 (2,3,4,5) patch | Without v53-0003 patch | |||||||
%Line | %Function | %Line | %Function | %Line | %Function | ||||
src/backend/access/transam/xact.c | 86.2 | 92.9 | 86.2 | 92.9 | 86.2 | 92.9 | |||
src/backend/access/transam/xloginsert.c | 90.2 | 94.1 | 90.2 | 94.1 | 90.2 | 94.1 | |||
src/backend/access/transam/xlogreader.c | 73.3 | 93.3 | 73.8 | 93.3 | 73.8 | 93.3 | |||
src/backend/replication/logical/decode.c | 93.4 | 100 | 93.4 | 100 | 93.4 | 100 | |||
src/backend/access/rmgrdesc/xactdesc.c | 54.4 | 63.6 | 54.4 | 63.6 | 54.4 | 63.6 | |||
src/backend/replication/logical/reorderbuffer.c | 93.4 | 96.7 | 93.4 | 96.7 | 93.4 | 96.7 | |||
src/backend/utils/cache/inval.c | 98.1 | 100 | 98.1 | 100 | 98.1 | 100 | |||
contrib/test_decoding/test_decoding.c | 86.8 | 95.2 | 86.8 | 95.2 | 86.8 | 95.2 | |||
src/backend/replication/logical/logical.c | 90.9 | 93.5 | 90.9 | 93.5 | 91.8 | 93.5 | |||
src/backend/access/heap/heapam.c | 86.1 | 94.5 | 86.1 | 94.5 | 86.1 | 94.5 | |||
src/backend/access/index/genam.c | 90.7 | 91.7 | 91.2 | 91.7 | 91.2 | 91.7 | |||
src/backend/access/table/tableam.c | 90.6 | 100 | 90.6 | 100 | 90.6 | 100 | |||
src/backend/utils/time/snapmgr.c | 81.1 | 98.1 | 80.2 | 98.1 | 81.1 | 98.1 | |||
src/include/access/tableam.h | 92.5 | 100 | 92.5 | 100 | 92.5 | 100 | |||
src/backend/access/heap/heapam_visibility.c | 77.8 | 100 | 77.8 | 100 | 77.8 | 100 | |||
src/backend/replication/walsender.c | 90.5 | 97.8 | 90.5 | 97.8 | 90.9 | 100 | |||
src/backend/catalog/pg_subscription.c | 96 | 100 | 96 | 100 | 96 | 100 | |||
src/backend/commands/subscriptioncmds.c | 93.2 | 90 | 92.7 | 90 | 92.7 | 90 | |||
src/backend/postmaster/pgstat.c | 64.2 | 85.1 | 63.9 | 85.1 | 64.6 | 86.1 | |||
src/backend/replication/libpqwalreceiver/libpqwalreceiver.c | 82.4 | 95 | 82.5 | 95 | 83.6 | 95 | |||
src/backend/replication/logical/proto.c | 93.5 | 91.3 | 93.7 | 93.3 | 93.7 | 93.3 | |||
src/backend/replication/logical/worker.c | 91.6 | 96 | 91.5 | 97.4 | 91.9 | 97.4 | |||
src/backend/replication/pgoutput/pgoutput.c | 81.9 | 100 | 85.5 | 100 | 86.2 | 100 | |||
src/backend/replication/slotfuncs.c | 93 | 93.8 | 93 | 93.8 | 93 | 93.8 | |||
src/include/pgstat.h | 100 | - | 100 | - | 100 | - | |||
src/backend/replication/logical/logicalfuncs.c | 87.1 | 90 | 87.1 | 90 | 87.1 | 90 | |||
src/backend/storage/file/buffile.c | 68.3 | 85 | 69.6 | 85 | 69.6 | 85 | |||
src/backend/storage/file/fd.c | 81.1 | 93 | 81.1 | 93 | 81.1 | 93 | |||
src/backend/storage/file/sharedfileset.c | 77.7 | 90.9 | 93.2 | 100 | 93.2 | 100 | |||
src/backend/utils/sort/logtape.c | 94.4 | 100 | 94.4 | 100 | 94.4 | 100 | |||
src/backend/utils/sort/sharedtuplestore.c | 90.1 | 90.9 | 90.1 | 90.9 | 90.1 | 90.9 |
Thanks.
--
--
Regards,
Neha Sharma
On Thu, Aug 27, 2020 at 11:16 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Wed, Aug 26, 2020 at 11:22 PM Jeff Janes <jeff.janes@gmail.com> wrote:
>
>
> On Tue, Aug 25, 2020 at 8:58 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
>>
>> I am planning
>> to push the first patch (v53-0001-Extend-the-BufFile-interface) in
>> this series tomorrow unless you have any comments on the same.
>
>
>
> I'm getting compiler warnings now, src/backend/storage/file/sharedfileset.c line 288 needs to be:
>
> bool found PG_USED_FOR_ASSERTS_ONLY = false;
>
Thanks for the report. Tom Lane has already fixed this [1].
[1] - https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=e942af7b8261cd8070d0eeaf518dbc1a664859fd
--
With Regards,
Amit Kapila.
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Amit Kapila
Date:
On Fri, Aug 28, 2020 at 2:18 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > As discussed, I have added a another test case for covering the out of > order subtransaction rollback scenario. > +# large (streamed) transaction with out of order subtransaction ROLLBACKs +$node_publisher->safe_psql('postgres', q{ How about writing a comment as: "large (streamed) transaction with subscriber receiving out of order subtransaction ROLLBACKs"? I have reviewed and modified the number of things in the attached patch: 1. In apply_handle_origin, improved the check streamed xacts. 2. In apply_handle_stream_commit() while applying changes in the loop, added CHECK_FOR_INTERRUPTS. 3. In DEBUG messages, print the path with double-quotes as we are doing in all other places. 4. + /* + * Exit if streaming option is changed. The launcher will start new + * worker. + */ + if (newsub->stream != MySubscription->stream) + { + ereport(LOG, + (errmsg("logical replication apply worker for subscription \"%s\" will " + "restart because subscription's streaming option were changed", + MySubscription->name))); + + proc_exit(0); + } + We don't need a separate check like this. I have merged this into one of the existing checks. 5. subxact_info_write() { .. + if (subxact_data.nsubxacts == 0) + { + if (ent->subxact_fileset) + { + cleanup_subxact_info(); + BufFileDeleteShared(ent->subxact_fileset, path); + pfree(ent->subxact_fileset); + ent->subxact_fileset = NULL; + } I don't think it is right to use BufFileDeleteShared interface here because it won't perform SharedFileSetUnregister which means if after above code execution is the server exits it will crash in SharedFileSetDeleteOnProcExit which will try to access already deleted fileset entry. Fixed this by calling SharedFileSetDeleteAll() instead. The another related problem is that in function SharedFileSetDeleteOnProcExit, it tries to delete the list element while traversing the list with 'foreach' construct which makes the behavior of list traversal unpredictable. I have fixed this in a separate patch v54-0001-Fix-the-SharedFileSetUnregister-API, if you are fine with this, I would like to commit this as this fixes a problem in the existing commit 808e13b282. 6. Function stream_cleanup_files() contains a missing_ok argument which is not used so removed it. 7. In pgoutput.c, change the ordering of functions to make them consistent with their declaration. 8. typedef struct RelationSyncEntry { Oid relid; /* relation oid */ + TransactionId xid; /* transaction that created the record */ Removed above parameter as this doesn't seem to be required as per the new design in the patch. Apart from above, I have added/changed quite a few comments and a few other cosmetic changes. Kindly review and let me know what do you think about the changes? One more comment for which I haven't done anything yet. +static void +set_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid) +{ + MemoryContext oldctx; + + oldctx = MemoryContextSwitchTo(CacheMemoryContext); + + entry->streamed_txns = lappend_int(entry->streamed_txns, xid); Is it a good idea to append xid with lappend_int? Won't we need something equivalent for uint32? If so, I think we have a couple of options (a) use lcons method and accordingly append the pointer to xid, I think we need to allocate memory for xid if we want to use this idea or (b) use an array instead. What do you think? -- With Regards, Amit Kapila.
Attachment
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Dilip Kumar
Date:
On Sat, Aug 29, 2020 at 5:18 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Fri, Aug 28, 2020 at 2:18 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > As discussed, I have added a another test case for covering the out of > > order subtransaction rollback scenario. > > > > +# large (streamed) transaction with out of order subtransaction ROLLBACKs > +$node_publisher->safe_psql('postgres', q{ > > How about writing a comment as: "large (streamed) transaction with > subscriber receiving out of order subtransaction ROLLBACKs"? I have fixed and merged with 0002. > I have reviewed and modified the number of things in the attached patch: > 1. In apply_handle_origin, improved the check streamed xacts. > 2. In apply_handle_stream_commit() while applying changes in the loop, > added CHECK_FOR_INTERRUPTS. > 3. In DEBUG messages, print the path with double-quotes as we are > doing in all other places. > 4. > + /* > + * Exit if streaming option is changed. The launcher will start new > + * worker. > + */ > + if (newsub->stream != MySubscription->stream) > + { > + ereport(LOG, > + (errmsg("logical replication apply worker for subscription \"%s\" will " > + "restart because subscription's streaming option were changed", > + MySubscription->name))); > + > + proc_exit(0); > + } > + > We don't need a separate check like this. I have merged this into one > of the existing checks. > 5. > subxact_info_write() > { > .. > + if (subxact_data.nsubxacts == 0) > + { > + if (ent->subxact_fileset) > + { > + cleanup_subxact_info(); > + BufFileDeleteShared(ent->subxact_fileset, path); > + pfree(ent->subxact_fileset); > + ent->subxact_fileset = NULL; > + } > > I don't think it is right to use BufFileDeleteShared interface here > because it won't perform SharedFileSetUnregister which means if after > above code execution is the server exits it will crash in > SharedFileSetDeleteOnProcExit which will try to access already deleted > fileset entry. Fixed this by calling SharedFileSetDeleteAll() instead. > The another related problem is that in function > SharedFileSetDeleteOnProcExit, it tries to delete the list element > while traversing the list with 'foreach' construct which makes the > behavior of list traversal unpredictable. I have fixed this in a > separate patch v54-0001-Fix-the-SharedFileSetUnregister-API, if you > are fine with this, I would like to commit this as this fixes a > problem in the existing commit 808e13b282. > 6. Function stream_cleanup_files() contains a missing_ok argument > which is not used so removed it. > 7. In pgoutput.c, change the ordering of functions to make them > consistent with their declaration. > 8. > typedef struct RelationSyncEntry > { > Oid relid; /* relation oid */ > + TransactionId xid; /* transaction that created the record */ > > Removed above parameter as this doesn't seem to be required as per the > new design in the patch. > > Apart from above, I have added/changed quite a few comments and a few > other cosmetic changes. Kindly review and let me know what do you > think about the changes? I have reviewed your changes and look fine to me. And the bug fix in 0001 also looks fine. > One more comment for which I haven't done anything yet. > +static void > +set_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid) > +{ > + MemoryContext oldctx; > + > + oldctx = MemoryContextSwitchTo(CacheMemoryContext); > + > + entry->streamed_txns = lappend_int(entry->streamed_txns, xid); > Is it a good idea to append xid with lappend_int? Won't we need > something equivalent for uint32? If so, I think we have a couple of > options (a) use lcons method and accordingly append the pointer to > xid, I think we need to allocate memory for xid if we want to use this > idea or (b) use an array instead. What do you think? BTW, OID is internally mapped to uint32, but using lappend_oid might not look good. So maybe we can provide an option for lappend_uint32? Using an array is also not a bad idea. Providing lappend_uint32 option looks more appealing to me. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Amit Kapila
Date:
On Sun, Aug 30, 2020 at 2:43 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Sat, Aug 29, 2020 at 5:18 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > One more comment for which I haven't done anything yet. > > +static void > > +set_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid) > > +{ > > + MemoryContext oldctx; > > + > > + oldctx = MemoryContextSwitchTo(CacheMemoryContext); > > + > > + entry->streamed_txns = lappend_int(entry->streamed_txns, xid); > > > Is it a good idea to append xid with lappend_int? Won't we need > > something equivalent for uint32? If so, I think we have a couple of > > options (a) use lcons method and accordingly append the pointer to > > xid, I think we need to allocate memory for xid if we want to use this > > idea or (b) use an array instead. What do you think? > > BTW, OID is internally mapped to uint32, but using lappend_oid might > not look good. So maybe we can provide an option for lappend_uint32? > Using an array is also not a bad idea. Providing lappend_uint32 > option looks more appealing to me. > I thought about this again and I feel it might be okay to use it for our case as after storing it in T_IntList, we primarily fetch it for comparison with TrasnactionId (uint32), so this shouldn't create any problem. I feel we can just discuss this in a separate thread and check the opinion of others, what do you think? Another comment: +cleanup_rel_sync_cache(TransactionId xid, bool is_commit) +{ + HASH_SEQ_STATUS hash_seq; + RelationSyncEntry *entry; + + Assert(RelationSyncCache != NULL); + + hash_seq_init(&hash_seq, RelationSyncCache); + while ((entry = hash_seq_search(&hash_seq)) != NULL) + { + if (is_commit) + entry->schema_sent = true; How is it correct to set 'entry->schema_sent' for all the entries in RelationSyncCache? Consider a case where due to invalidation in an unrelated transaction we have set the flag schema_sent for a particular relation 'r1' as 'false' and that transaction is executed before the current streamed transaction for which we are performing commit and called this function. It will set the flag for unrelated entry in this case 'r1' which doesn't seem correct to me. Or, if this is correct, it would be a good idea to write some comments about it. -- With Regards, Amit Kapila.
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Amit Kapila
Date:
On Mon, Aug 31, 2020 at 10:49 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Sun, Aug 30, 2020 at 2:43 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > Another comment: > > +cleanup_rel_sync_cache(TransactionId xid, bool is_commit) > +{ > + HASH_SEQ_STATUS hash_seq; > + RelationSyncEntry *entry; > + > + Assert(RelationSyncCache != NULL); > + > + hash_seq_init(&hash_seq, RelationSyncCache); > + while ((entry = hash_seq_search(&hash_seq)) != NULL) > + { > + if (is_commit) > + entry->schema_sent = true; > > How is it correct to set 'entry->schema_sent' for all the entries in > RelationSyncCache? Consider a case where due to invalidation in an > unrelated transaction we have set the flag schema_sent for a > particular relation 'r1' as 'false' and that transaction is executed > before the current streamed transaction for which we are performing > commit and called this function. It will set the flag for unrelated > entry in this case 'r1' which doesn't seem correct to me. Or, if this > is correct, it would be a good idea to write some comments about it. > Few more comments: 1. +my $appname = 'tap_sub'; +$node_subscriber->safe_psql('postgres', +"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub" +); In most of the tests, we are using the above statement to create a subscription. Don't we need (streaming = 'on') parameter while creating a subscription? Is there a reason for not doing so in this patch itself? 2. 009_stream_simple.pl +# Insert, update and delete enough rows to exceed the 64kB limit. +$node_publisher->safe_psql('postgres', q{ +BEGIN; +INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i); +UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0; +DELETE FROM test_tab WHERE mod(a,3) = 0; +COMMIT; +}); How much above this data is 64kB limit? I just wanted to see that it should not be on borderline and then due to some alignment issues the streaming doesn't happen on some machines? Also, how such a test ensures that the streaming has happened because the way we are checking results, won't it be the same for the non-streaming case as well? 3. +# Change the local values of the extra columns on the subscriber, +# update publisher, and check that subscriber retains the expected +# values +$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'"); +$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)"); + +wait_for_caught_up($node_publisher, $appname); + +$result = + $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab"); +is($result, qq(3334|3334|3334), 'check extra columns contain locally changed data'); Again, how this test is relevant to streaming mode? 4. I have checked that in one of the previous patches, we have a test v53-0004-Add-TAP-test-for-streaming-vs.-DDL which contains a test case quite similar to what we have in v55-0002-Add-support-for-streaming-to-built-in-logical-re/013_stream_subxact_ddl_abort. If there is any difference that can cover more scenarios then can we consider merging them into one test? Apart from the above, I have made a few changes in the attached patch which are mainly to simplify the code at one place, added/edited few comments, some other cosmetic changes, and renamed the test case files as the initials of their name were matching other tests in the similar directory. -- With Regards, Amit Kapila.
Attachment
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Amit Kapila
Date:
On Mon, Aug 31, 2020 at 1:24 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > 2. > 009_stream_simple.pl > +# Insert, update and delete enough rows to exceed the 64kB limit. > +$node_publisher->safe_psql('postgres', q{ > +BEGIN; > +INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i); > +UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0; > +DELETE FROM test_tab WHERE mod(a,3) = 0; > +COMMIT; > +}); > > How much above this data is 64kB limit? I just wanted to see that it > should not be on borderline and then due to some alignment issues the > streaming doesn't happen on some machines? > I think we should find similar information for other tests added by the patch as well. Few other comments: =================== +sub wait_for_caught_up +{ + my ($node, $appname) = @_; + + $node->poll_query_until('postgres', +"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';" + ) or die "Timed ou The patch has added this in all the test files if it is used in so many tests then we need to add this in some generic place (PostgresNode.pm) but actually, I am not sure if need this at all. Why can't the existing wait_for_catchup in PostgresNode.pm serve the same purpose. 2. In system_views.sql, -- All columns of pg_subscription except subconninfo are readable. REVOKE ALL ON pg_subscription FROM public; GRANT SELECT (subdbid, subname, subowner, subenabled, subbinary, subslotname, subpublications) ON pg_subscription TO public; Here, we need to update for substream column as well. 3. Update describeSubscriptions() to show the 'substream' value in \dRs. 4. Also, lets add few tests in subscription.sql as we have added 'binary' option in commit 9de77b5453. 5. I think we can merge pg_dump related changes (the last version posted in mail thread is v53-0005-Add-streaming-option-in-pg_dump) in the main patch, one minor comment on pg_dump related changes @@ -4358,6 +4369,8 @@ dumpSubscription(Archive *fout, SubscriptionInfo *subinfo) if (strcmp(subinfo->subbinary, "t") == 0) appendPQExpBuffer(query, ", binary = true"); + if (strcmp(subinfo->substream, "f") != 0) + appendPQExpBuffer(query, ", streaming = on"); if (strcmp(subinfo->subsynccommit, "off") != 0) appendPQExpBuffer(query, ", synchronous_commit = %s", fmtId(subinfo->subsynccommit)); Keep one line space between substream and subsynccommit option code to keep it consistent with nearby code. -- With Regards, Amit Kapila.
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Dilip Kumar
Date:
On Mon, Aug 31, 2020 at 1:24 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Mon, Aug 31, 2020 at 10:49 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Sun, Aug 30, 2020 at 2:43 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > > Another comment: > > > > +cleanup_rel_sync_cache(TransactionId xid, bool is_commit) > > +{ > > + HASH_SEQ_STATUS hash_seq; > > + RelationSyncEntry *entry; > > + > > + Assert(RelationSyncCache != NULL); > > + > > + hash_seq_init(&hash_seq, RelationSyncCache); > > + while ((entry = hash_seq_search(&hash_seq)) != NULL) > > + { > > + if (is_commit) > > + entry->schema_sent = true; > > > > How is it correct to set 'entry->schema_sent' for all the entries in > > RelationSyncCache? Consider a case where due to invalidation in an > > unrelated transaction we have set the flag schema_sent for a > > particular relation 'r1' as 'false' and that transaction is executed > > before the current streamed transaction for which we are performing > > commit and called this function. It will set the flag for unrelated > > entry in this case 'r1' which doesn't seem correct to me. Or, if this > > is correct, it would be a good idea to write some comments about it. Yeah, this is wrong, I have fixed this issue in the attached patch and also added a new test for the same. > Few more comments: > 1. > +my $appname = 'tap_sub'; > +$node_subscriber->safe_psql('postgres', > +"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr > application_name=$appname' PUBLICATION tap_pub" > +); > > In most of the tests, we are using the above statement to create a > subscription. Don't we need (streaming = 'on') parameter while > creating a subscription? Is there a reason for not doing so in this > patch itself? I have changed this. > 2. > 009_stream_simple.pl > +# Insert, update and delete enough rows to exceed the 64kB limit. > +$node_publisher->safe_psql('postgres', q{ > +BEGIN; > +INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i); > +UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0; > +DELETE FROM test_tab WHERE mod(a,3) = 0; > +COMMIT; > +}); > > How much above this data is 64kB limit? I just wanted to see that it > should not be on borderline and then due to some alignment issues the > streaming doesn't happen on some machines? Also, how such a test > ensures that the streaming has happened because the way we are > checking results, won't it be the same for the non-streaming case as > well? Only for this case, or you mean for all the tests? > 3. > +# Change the local values of the extra columns on the subscriber, > +# update publisher, and check that subscriber retains the expected > +# values > +$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = > 'epoch'::timestamptz + 987654321 * interval '1s'"); > +$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)"); > + > +wait_for_caught_up($node_publisher, $appname); > + > +$result = > + $node_subscriber->safe_psql('postgres', "SELECT count(*), > count(extract(epoch from c) = 987654321), count(d = 999) FROM > test_tab"); > +is($result, qq(3334|3334|3334), 'check extra columns contain locally > changed data'); > > Again, how this test is relevant to streaming mode? I agree, it is not specific to the streaming. > 4. I have checked that in one of the previous patches, we have a test > v53-0004-Add-TAP-test-for-streaming-vs.-DDL which contains a test case > quite similar to what we have in > v55-0002-Add-support-for-streaming-to-built-in-logical-re/013_stream_subxact_ddl_abort. > If there is any difference that can cover more scenarios then can we > consider merging them into one test? I will have a look. > Apart from the above, I have made a few changes in the attached patch > which are mainly to simplify the code at one place, added/edited few > comments, some other cosmetic changes, and renamed the test case files > as the initials of their name were matching other tests in the similar > directory. Changes look fine to me except this + + /* the value must be on/off */ + if (strcmp(strVal(defel->arg), "on") && strcmp(strVal(defel->arg), "off")) + ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("invalid streaming value"))); + + /* enable streaming if it's 'on' */ + *enable_streaming = (strcmp(strVal(defel->arg), "on") == 0); I mean for streaming why we need to handle differently than the other surrounding code for example "binary" option. Apart from that for testing 0001, I have added a new test in the attached contrib. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Neha Sharma
Date:
Hi Amit/Dilip,
I have tested a few scenarios on top of the v56 patches, where the replication worker still had few subtransactions in uncommitted state and we restart the publisher server.
No crash or data discrepancies were observed, attached are the test scenarios verified.
Data Setup:
Publication Server postgresql.conf :
echo "wal_level = logical
max_wal_senders = 10
max_wal_senders = 10
max_replication_slots = 15
wal_log_hints = on
hot_standby_feedback = on
wal_receiver_status_interval = 1
listen_addresses='*'
log_min_messages=debug1
wal_sender_timeout = 0
logical_decoding_work_mem=64kB
Subscription Server postgresql.conf :
wal_level = logical
max_wal_senders = 10
max_replication_slots = 15
wal_log_hints = on
hot_standby_feedback = on
wal_receiver_status_interval = 1
listen_addresses='*'
log_min_messages=debug1
wal_sender_timeout = 0
logical_decoding_work_mem=64kB
port=5433
Initial setup:
Publication Server:
create table t(a int PRIMARY KEY ,b text);
CREATE OR REPLACE FUNCTION large_val() RETURNS TEXT LANGUAGE SQL AS 'select array_agg(md5(g::text))::text from generate_series(1, 256) g';
create publication test_pub for table t with(PUBLISH='insert,delete,update,truncate');
alter table t replica identity FULL ;
insert into t values (generate_series(1,20),large_val()) ON CONFLICT (a) DO UPDATE SET a=EXCLUDED.a*300;
Subscription server:
create table t(a int,b text);
create subscription test_sub CONNECTION 'host=localhost port=5432 dbname=postgres user=edb' PUBLICATION test_pub WITH ( slot_name = test_slot_sub1,streaming=on);
Subscription server:
create table t(a int,b text);
create subscription test_sub CONNECTION 'host=localhost port=5432 dbname=postgres user=edb' PUBLICATION test_pub WITH ( slot_name = test_slot_sub1,streaming=on);
Thanks.
--
--
Regards,
Neha Sharma
On Mon, Aug 31, 2020 at 1:25 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Mon, Aug 31, 2020 at 10:49 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Sun, Aug 30, 2020 at 2:43 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
>
> Another comment:
>
> +cleanup_rel_sync_cache(TransactionId xid, bool is_commit)
> +{
> + HASH_SEQ_STATUS hash_seq;
> + RelationSyncEntry *entry;
> +
> + Assert(RelationSyncCache != NULL);
> +
> + hash_seq_init(&hash_seq, RelationSyncCache);
> + while ((entry = hash_seq_search(&hash_seq)) != NULL)
> + {
> + if (is_commit)
> + entry->schema_sent = true;
>
> How is it correct to set 'entry->schema_sent' for all the entries in
> RelationSyncCache? Consider a case where due to invalidation in an
> unrelated transaction we have set the flag schema_sent for a
> particular relation 'r1' as 'false' and that transaction is executed
> before the current streamed transaction for which we are performing
> commit and called this function. It will set the flag for unrelated
> entry in this case 'r1' which doesn't seem correct to me. Or, if this
> is correct, it would be a good idea to write some comments about it.
>
Few more comments:
1.
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr
application_name=$appname' PUBLICATION tap_pub"
+);
In most of the tests, we are using the above statement to create a
subscription. Don't we need (streaming = 'on') parameter while
creating a subscription? Is there a reason for not doing so in this
patch itself?
2.
009_stream_simple.pl
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});
How much above this data is 64kB limit? I just wanted to see that it
should not be on borderline and then due to some alignment issues the
streaming doesn't happen on some machines? Also, how such a test
ensures that the streaming has happened because the way we are
checking results, won't it be the same for the non-streaming case as
well?
3.
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c =
'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+ $node_subscriber->safe_psql('postgres', "SELECT count(*),
count(extract(epoch from c) = 987654321), count(d = 999) FROM
test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain locally
changed data');
Again, how this test is relevant to streaming mode?
4. I have checked that in one of the previous patches, we have a test
v53-0004-Add-TAP-test-for-streaming-vs.-DDL which contains a test case
quite similar to what we have in
v55-0002-Add-support-for-streaming-to-built-in-logical-re/013_stream_subxact_ddl_abort.
If there is any difference that can cover more scenarios then can we
consider merging them into one test?
Apart from the above, I have made a few changes in the attached patch
which are mainly to simplify the code at one place, added/edited few
comments, some other cosmetic changes, and renamed the test case files
as the initials of their name were matching other tests in the similar
directory.
--
With Regards,
Amit Kapila.
Attachment
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Amit Kapila
Date:
On Mon, Aug 31, 2020 at 7:28 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Mon, Aug 31, 2020 at 1:24 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Mon, Aug 31, 2020 at 10:49 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > On Sun, Aug 30, 2020 at 2:43 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > > > > > Another comment: > > > > > > +cleanup_rel_sync_cache(TransactionId xid, bool is_commit) > > > +{ > > > + HASH_SEQ_STATUS hash_seq; > > > + RelationSyncEntry *entry; > > > + > > > + Assert(RelationSyncCache != NULL); > > > + > > > + hash_seq_init(&hash_seq, RelationSyncCache); > > > + while ((entry = hash_seq_search(&hash_seq)) != NULL) > > > + { > > > + if (is_commit) > > > + entry->schema_sent = true; > > > > > > How is it correct to set 'entry->schema_sent' for all the entries in > > > RelationSyncCache? Consider a case where due to invalidation in an > > > unrelated transaction we have set the flag schema_sent for a > > > particular relation 'r1' as 'false' and that transaction is executed > > > before the current streamed transaction for which we are performing > > > commit and called this function. It will set the flag for unrelated > > > entry in this case 'r1' which doesn't seem correct to me. Or, if this > > > is correct, it would be a good idea to write some comments about it. > > Yeah, this is wrong, I have fixed this issue in the attached patch > and also added a new test for the same. > In functions cleanup_rel_sync_cache and get_schema_sent_in_streamed_txn, lets cast the result of lfirst_int to uint32 as suggested by Tom [1]. Also, lets keep the way we compare xids consistent in both functions, i.e, if (xid == lfirst_int(lc)). The behavior tested by the test case added for this is not clear primarily because of comments. +++ b/src/test/subscription/t/021_stream_schema.pl @@ -0,0 +1,80 @@ +# Test behavior with streaming transaction exceeding logical_decoding_work_mem ... +# large (streamed) transaction with DDL, DML and ROLLBACKs +$node_publisher->safe_psql('postgres', q{ +BEGIN; +ALTER TABLE test_tab ADD COLUMN c INT; +INSERT INTO test_tab SELECT i, md5(i::text), i FROM generate_series(3,3000) s(i); +ALTER TABLE test_tab ADD COLUMN d INT; +COMMIT; +}); + +# large (streamed) transaction with DDL, DML and ROLLBACKs +$node_publisher->safe_psql('postgres', q{ +BEGIN; +INSERT INTO test_tab SELECT i, md5(i::text), i, i FROM generate_series(3001,3005) s(i); +COMMIT; +}); +wait_for_caught_up($node_publisher, $appname); I understand that how this test will test the functionality related to schema_sent stuff but neither the comments atop of file nor atop the test case explains it clearly. > > Few more comments: > > > 2. > > 009_stream_simple.pl > > +# Insert, update and delete enough rows to exceed the 64kB limit. > > +$node_publisher->safe_psql('postgres', q{ > > +BEGIN; > > +INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i); > > +UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0; > > +DELETE FROM test_tab WHERE mod(a,3) = 0; > > +COMMIT; > > +}); > > > > How much above this data is 64kB limit? I just wanted to see that it > > should not be on borderline and then due to some alignment issues the > > streaming doesn't happen on some machines? Also, how such a test > > ensures that the streaming has happened because the way we are > > checking results, won't it be the same for the non-streaming case as > > well? > > Only for this case, or you mean for all the tests? > It is better to do it for all tests and I have clarified this in my next email sent yesterday [2] where I have raised a few more comments as well. I hope you have not missed that email. > > 3. > > +# Change the local values of the extra columns on the subscriber, > > +# update publisher, and check that subscriber retains the expected > > +# values > > +$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = > > 'epoch'::timestamptz + 987654321 * interval '1s'"); > > +$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)"); > > + > > +wait_for_caught_up($node_publisher, $appname); > > + > > +$result = > > + $node_subscriber->safe_psql('postgres', "SELECT count(*), > > count(extract(epoch from c) = 987654321), count(d = 999) FROM > > test_tab"); > > +is($result, qq(3334|3334|3334), 'check extra columns contain locally > > changed data'); > > > > Again, how this test is relevant to streaming mode? > > I agree, it is not specific to the streaming. > > > Apart from the above, I have made a few changes in the attached patch > > which are mainly to simplify the code at one place, added/edited few > > comments, some other cosmetic changes, and renamed the test case files > > as the initials of their name were matching other tests in the similar > > directory. > > Changes look fine to me except this > > + > > + /* the value must be on/off */ > + if (strcmp(strVal(defel->arg), "on") && strcmp(strVal(defel->arg), "off")) > + ereport(ERROR, > + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), > + errmsg("invalid streaming value"))); > + > + /* enable streaming if it's 'on' */ > + *enable_streaming = (strcmp(strVal(defel->arg), "on") == 0); > > I mean for streaming why we need to handle differently than the other > surrounding code for example "binary" option. > Hmm, I think the code changed by me is to make it look similar to the binary option. The code you have quoted above is from the patch version prior to what I have sent. See the code snippet after my changes: @@ -182,6 +222,16 @@ parse_output_parameters(List *options, uint32 *protocol_version, *binary = defGetBoolean(defel); } + else if (strcmp(defel->defname, "streaming") == 0) + { + if (streaming_given) + ereport(ERROR, + (errcode(ERRCODE_SYNTAX_ERROR), + errmsg("conflicting or redundant options"))); + streaming_given = true; + + *enable_streaming = defGetBoolean(defel); + } This looks exactly similar to the binary option. Can you please check it once again and confirm back? [1] - https://www.postgresql.org/message-id/3955127.1598880523%40sss.pgh.pa.us [2] - https://www.postgresql.org/message-id/CAA4eK1JjrcK6bk%2Bur3J%2BkLsfz4%2BipJFN7VcRd3cXr4gG5ZWWig%40mail.gmail.com -- With Regards, Amit Kapila.
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Amit Kapila
Date:
On Mon, Aug 31, 2020 at 10:27 PM Neha Sharma <neha.sharma@enterprisedb.com> wrote: > > Hi Amit/Dilip, > > I have tested a few scenarios on top of the v56 patches, where the replication worker still had few subtransactions inuncommitted state and we restart the publisher server. > No crash or data discrepancies were observed, attached are the test scenarios verified. > Thanks, I have pushed the fix (https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=4ab77697f67aa5b90b032b9175b46901859da6d7). -- With Regards, Amit Kapila.
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Amit Kapila
Date:
On Tue, Sep 1, 2020 at 9:28 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Mon, Aug 31, 2020 at 7:28 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Mon, Aug 31, 2020 at 1:24 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > In functions cleanup_rel_sync_cache and > get_schema_sent_in_streamed_txn, lets cast the result of lfirst_int to > uint32 as suggested by Tom [1]. Also, lets keep the way we compare > xids consistent in both functions, i.e, if (xid == lfirst_int(lc)). > Fixed this in the attached patch. > The behavior tested by the test case added for this is not clear > primarily because of comments. > > +++ b/src/test/subscription/t/021_stream_schema.pl > @@ -0,0 +1,80 @@ > +# Test behavior with streaming transaction exceeding logical_decoding_work_mem > ... > +# large (streamed) transaction with DDL, DML and ROLLBACKs > +$node_publisher->safe_psql('postgres', q{ > +BEGIN; > +ALTER TABLE test_tab ADD COLUMN c INT; > +INSERT INTO test_tab SELECT i, md5(i::text), i FROM > generate_series(3,3000) s(i); > +ALTER TABLE test_tab ADD COLUMN d INT; > +COMMIT; > +}); > + > +# large (streamed) transaction with DDL, DML and ROLLBACKs > +$node_publisher->safe_psql('postgres', q{ > +BEGIN; > +INSERT INTO test_tab SELECT i, md5(i::text), i, i FROM > generate_series(3001,3005) s(i); > +COMMIT; > +}); > +wait_for_caught_up($node_publisher, $appname); > > I understand that how this test will test the functionality related to > schema_sent stuff but neither the comments atop of file nor atop the > test case explains it clearly. > Added comments for this test. > > > Few more comments: > > > > > > 2. > > > 009_stream_simple.pl > > > +# Insert, update and delete enough rows to exceed the 64kB limit. > > > +$node_publisher->safe_psql('postgres', q{ > > > +BEGIN; > > > +INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i); > > > +UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0; > > > +DELETE FROM test_tab WHERE mod(a,3) = 0; > > > +COMMIT; > > > +}); > > > > > > How much above this data is 64kB limit? I just wanted to see that it > > > should not be on borderline and then due to some alignment issues the > > > streaming doesn't happen on some machines? Also, how such a test > > > ensures that the streaming has happened because the way we are > > > checking results, won't it be the same for the non-streaming case as > > > well? > > > > Only for this case, or you mean for all the tests? > > > I have not done this yet. > It is better to do it for all tests and I have clarified this in my > next email sent yesterday [2] where I have raised a few more comments > as well. I hope you have not missed that email. > > > > 3. > > > +# Change the local values of the extra columns on the subscriber, > > > +# update publisher, and check that subscriber retains the expected > > > +# values > > > +$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = > > > 'epoch'::timestamptz + 987654321 * interval '1s'"); > > > +$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)"); > > > + > > > +wait_for_caught_up($node_publisher, $appname); > > > + > > > +$result = > > > + $node_subscriber->safe_psql('postgres', "SELECT count(*), > > > count(extract(epoch from c) = 987654321), count(d = 999) FROM > > > test_tab"); > > > +is($result, qq(3334|3334|3334), 'check extra columns contain locally > > > changed data'); > > > > > > Again, how this test is relevant to streaming mode? > > > > I agree, it is not specific to the streaming. > > I think we can leave this as of now. After committing the stats patches by Sawada-San and Ajin, we might be able to improve this test. > +sub wait_for_caught_up > +{ > + my ($node, $appname) = @_; > + > + $node->poll_query_until('postgres', > +"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication > WHERE application_name = '$appname';" > + ) or die "Timed ou > > The patch has added this in all the test files if it is used in so > many tests then we need to add this in some generic place > (PostgresNode.pm) but actually, I am not sure if need this at all. Why > can't the existing wait_for_catchup in PostgresNode.pm serve the same > purpose. > Changed as per this suggestion. > 2. > In system_views.sql, > > -- All columns of pg_subscription except subconninfo are readable. > REVOKE ALL ON pg_subscription FROM public; > GRANT SELECT (subdbid, subname, subowner, subenabled, subbinary, > subslotname, subpublications) > ON pg_subscription TO public; > > Here, we need to update for substream column as well. > Fixed. > 3. Update describeSubscriptions() to show the 'substream' value in \dRs. > > 4. Also, lets add few tests in subscription.sql as we have added > 'binary' option in commit 9de77b5453. > Fixed both the above comments. > 5. I think we can merge pg_dump related changes (the last version > posted in mail thread is v53-0005-Add-streaming-option-in-pg_dump) in > the main patch, one minor comment on pg_dump related changes > @@ -4358,6 +4369,8 @@ dumpSubscription(Archive *fout, SubscriptionInfo *subinfo) > if (strcmp(subinfo->subbinary, "t") == 0) > appendPQExpBuffer(query, ", binary = true"); > > + if (strcmp(subinfo->substream, "f") != 0) > + appendPQExpBuffer(query, ", streaming = on"); > if (strcmp(subinfo->subsynccommit, "off") != 0) > appendPQExpBuffer(query, ", synchronous_commit = %s", > fmtId(subinfo->subsynccommit)); > > Keep one line space between substream and subsynccommit option code to > keep it consistent with nearby code. > Changed as per this suggestion. I have fixed all the comments except the below comments. 1. verify the size of various tests to ensure that it is above logical_decoding_work_mem. 2. I have checked that in one of the previous patches, we have a test v53-0004-Add-TAP-test-for-streaming-vs.-DDL which contains a test case quite similar to what we have in v55-0002-Add-support-for-streaming-to-built-in-logical-re/013_stream_subxact_ddl_abort. If there is any difference that can cover more scenarios then can we consider merging them into one test? 3. +# Change the local values of the extra columns on the subscriber, +# update publisher, and check that subscriber retains the expected +# values +$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = 'epoch'::timestamptz + 987654321 * interval '1s'"); +$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)"); + +wait_for_caught_up($node_publisher, $appname); + +$result = + $node_subscriber->safe_psql('postgres', "SELECT count(*), count(extract(epoch from c) = 987654321), count(d = 999) FROM test_tab"); +is($result, qq(3334|3334|3334), 'check extra columns contain locally changed data'); Again, how this test is relevant to streaming mode? 4. Apart from the above, I think we should think of minimizing the test cases which can be committed with the base patch. We can later add more tests. Kindly verify the changes. -- With Regards, Amit Kapila.
Attachment
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Amit Kapila
Date:
On Tue, Sep 1, 2020 at 8:33 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > I have fixed all the comments except .. > 3. +# Change the local values of the extra columns on the subscriber, > +# update publisher, and check that subscriber retains the expected > +# values > +$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = > 'epoch'::timestamptz + 987654321 * interval '1s'"); > +$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)"); > + > +wait_for_caught_up($node_publisher, $appname); > + > +$result = > + $node_subscriber->safe_psql('postgres', "SELECT count(*), > count(extract(epoch from c) = 987654321), count(d = 999) FROM > test_tab"); > +is($result, qq(3334|3334|3334), 'check extra columns contain locally > changed data'); > > Again, how this test is relevant to streaming mode? > I think we can keep this test in one of the newly added tests say in 015_stream_simple.pl to ensure that after streaming transaction, the non-streaming one behaves expectedly. So we can change the comment as "Change the local values of the extra columns on the subscriber, update publisher, and check that subscriber retains the expected values. This is to ensure that non-streaming transactions behave properly after a streaming transaction." We can remove this test from the other two places 016_stream_subxact.pl and 020_stream_binary.pl. > 4. Apart from the above, I think we should think of minimizing the > test cases which can be committed with the base patch. We can later > add more tests. > We can combine the tests in 015_stream_simple.pl and 020_stream_binary.pl as I can't see a good reason to keep them separate. Then, I think we can keep only this part with the main patch and extract other tests into a separate patch. Basically, we can commit the basic tests with the main patch and then keep the advanced tests separately. I am afraid that there are some tests that don't add much value so we can review them separately. One minor comment for option 'streaming = on', spacing-wise it should be consistent in all the tests. Similarly, we can combine 017_stream_ddl.pl and 021_stream_schema.pl as both contains similar tests. As per the above suggestion, this will be in a separate patch though. If you agree with the above suggestions then kindly make these adjustments and send the updated patch. -- With Regards, Amit Kapila.
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Dilip Kumar
Date:
On Wed, Sep 2, 2020 at 10:55 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Tue, Sep 1, 2020 at 8:33 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > I have fixed all the comments except > .. > > 3. +# Change the local values of the extra columns on the subscriber, > > +# update publisher, and check that subscriber retains the expected > > +# values > > +$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = > > 'epoch'::timestamptz + 987654321 * interval '1s'"); > > +$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)"); > > + > > +wait_for_caught_up($node_publisher, $appname); > > + > > +$result = > > + $node_subscriber->safe_psql('postgres', "SELECT count(*), > > count(extract(epoch from c) = 987654321), count(d = 999) FROM > > test_tab"); > > +is($result, qq(3334|3334|3334), 'check extra columns contain locally > > changed data'); > > > > Again, how this test is relevant to streaming mode? > > > > I think we can keep this test in one of the newly added tests say in > 015_stream_simple.pl to ensure that after streaming transaction, the > non-streaming one behaves expectedly. So we can change the comment as > "Change the local values of the extra columns on the subscriber, > update publisher, and check that subscriber retains the expected > values. This is to ensure that non-streaming transactions behave > properly after a streaming transaction." > > We can remove this test from the other two places > 016_stream_subxact.pl and 020_stream_binary.pl. > > > 4. Apart from the above, I think we should think of minimizing the > > test cases which can be committed with the base patch. We can later > > add more tests. > > > > We can combine the tests in 015_stream_simple.pl and > 020_stream_binary.pl as I can't see a good reason to keep them > separate. Then, I think we can keep only this part with the main patch > and extract other tests into a separate patch. Basically, we can > commit the basic tests with the main patch and then keep the advanced > tests separately. I am afraid that there are some tests that don't add > much value so we can review them separately. Fixed > One minor comment for option 'streaming = on', spacing-wise it should > be consistent in all the tests. > > Similarly, we can combine 017_stream_ddl.pl and 021_stream_schema.pl > as both contains similar tests. As per the above suggestion, this will > be in a separate patch though. > > If you agree with the above suggestions then kindly make these > adjustments and send the updated patch. Done that way. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Dilip Kumar
Date:
On Tue, Sep 1, 2020 at 8:33 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Tue, Sep 1, 2020 at 9:28 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Mon, Aug 31, 2020 at 7:28 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > On Mon, Aug 31, 2020 at 1:24 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > In functions cleanup_rel_sync_cache and > > get_schema_sent_in_streamed_txn, lets cast the result of lfirst_int to > > uint32 as suggested by Tom [1]. Also, lets keep the way we compare > > xids consistent in both functions, i.e, if (xid == lfirst_int(lc)). > > > > Fixed this in the attached patch. > > > The behavior tested by the test case added for this is not clear > > primarily because of comments. > > > > +++ b/src/test/subscription/t/021_stream_schema.pl > > @@ -0,0 +1,80 @@ > > +# Test behavior with streaming transaction exceeding logical_decoding_work_mem > > ... > > +# large (streamed) transaction with DDL, DML and ROLLBACKs > > +$node_publisher->safe_psql('postgres', q{ > > +BEGIN; > > +ALTER TABLE test_tab ADD COLUMN c INT; > > +INSERT INTO test_tab SELECT i, md5(i::text), i FROM > > generate_series(3,3000) s(i); > > +ALTER TABLE test_tab ADD COLUMN d INT; > > +COMMIT; > > +}); > > + > > +# large (streamed) transaction with DDL, DML and ROLLBACKs > > +$node_publisher->safe_psql('postgres', q{ > > +BEGIN; > > +INSERT INTO test_tab SELECT i, md5(i::text), i, i FROM > > generate_series(3001,3005) s(i); > > +COMMIT; > > +}); > > +wait_for_caught_up($node_publisher, $appname); > > > > I understand that how this test will test the functionality related to > > schema_sent stuff but neither the comments atop of file nor atop the > > test case explains it clearly. > > > > Added comments for this test. > > > > > Few more comments: > > > > > > > > > 2. > > > > 009_stream_simple.pl > > > > +# Insert, update and delete enough rows to exceed the 64kB limit. > > > > +$node_publisher->safe_psql('postgres', q{ > > > > +BEGIN; > > > > +INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i); > > > > +UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0; > > > > +DELETE FROM test_tab WHERE mod(a,3) = 0; > > > > +COMMIT; > > > > +}); > > > > > > > > How much above this data is 64kB limit? I just wanted to see that it > > > > should not be on borderline and then due to some alignment issues the > > > > streaming doesn't happen on some machines? Also, how such a test > > > > ensures that the streaming has happened because the way we are > > > > checking results, won't it be the same for the non-streaming case as > > > > well? > > > > > > Only for this case, or you mean for all the tests? > > > > > > > I have not done this yet. Most of the test cases are generating above 100kb and a few are around 72kb, Please find the test case wise data size. 015 - 200kb 016 - 150kb 017 - 72kb 018 - 72kb before first rollback to sb and total ~100kb 019 - 76kb before first rollback to sb and total ~100kb 020 - 150kb 021 - 100kb > > It is better to do it for all tests and I have clarified this in my > > next email sent yesterday [2] where I have raised a few more comments > > as well. I hope you have not missed that email. I saw that I think I replied to this before seeing that. > > > > 3. > > > > +# Change the local values of the extra columns on the subscriber, > > > > +# update publisher, and check that subscriber retains the expected > > > > +# values > > > > +$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = > > > > 'epoch'::timestamptz + 987654321 * interval '1s'"); > > > > +$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)"); > > > > + > > > > +wait_for_caught_up($node_publisher, $appname); > > > > + > > > > +$result = > > > > + $node_subscriber->safe_psql('postgres', "SELECT count(*), > > > > count(extract(epoch from c) = 987654321), count(d = 999) FROM > > > > test_tab"); > > > > +is($result, qq(3334|3334|3334), 'check extra columns contain locally > > > > changed data'); > > > > > > > > Again, how this test is relevant to streaming mode? > > > > > > I agree, it is not specific to the streaming. > > > > > I think we can leave this as of now. After committing the stats > patches by Sawada-San and Ajin, we might be able to improve this test. Make sense to me. > > +sub wait_for_caught_up > > +{ > > + my ($node, $appname) = @_; > > + > > + $node->poll_query_until('postgres', > > +"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication > > WHERE application_name = '$appname';" > > + ) or die "Timed ou > > > > The patch has added this in all the test files if it is used in so > > many tests then we need to add this in some generic place > > (PostgresNode.pm) but actually, I am not sure if need this at all. Why > > can't the existing wait_for_catchup in PostgresNode.pm serve the same > > purpose. > > > > Changed as per this suggestion. Okay. > > 2. > > In system_views.sql, > > > > -- All columns of pg_subscription except subconninfo are readable. > > REVOKE ALL ON pg_subscription FROM public; > > GRANT SELECT (subdbid, subname, subowner, subenabled, subbinary, > > subslotname, subpublications) > > ON pg_subscription TO public; > > > > Here, we need to update for substream column as well. > > > > Fixed. LGTM > > 3. Update describeSubscriptions() to show the 'substream' value in \dRs. > > > > 4. Also, lets add few tests in subscription.sql as we have added > > 'binary' option in commit 9de77b5453. > > > > Fixed both the above comments. Ok > > 5. I think we can merge pg_dump related changes (the last version > > posted in mail thread is v53-0005-Add-streaming-option-in-pg_dump) in > > the main patch, one minor comment on pg_dump related changes > > @@ -4358,6 +4369,8 @@ dumpSubscription(Archive *fout, SubscriptionInfo *subinfo) > > if (strcmp(subinfo->subbinary, "t") == 0) > > appendPQExpBuffer(query, ", binary = true"); > > > > + if (strcmp(subinfo->substream, "f") != 0) > > + appendPQExpBuffer(query, ", streaming = on"); > > if (strcmp(subinfo->subsynccommit, "off") != 0) > > appendPQExpBuffer(query, ", synchronous_commit = %s", > > fmtId(subinfo->subsynccommit)); > > > > Keep one line space between substream and subsynccommit option code to > > keep it consistent with nearby code. > > > > Changed as per this suggestion. Ok > I have fixed all the comments except the below comments. > 1. verify the size of various tests to ensure that it is above > logical_decoding_work_mem. > 2. I have checked that in one of the previous patches, we have a test > v53-0004-Add-TAP-test-for-streaming-vs.-DDL which contains a test case > quite similar to what we have in > v55-0002-Add-support-for-streaming-to-built-in-logical-re/013_stream_subxact_ddl_abort. > If there is any difference that can cover more scenarios then can we > consider merging them into one test? > 3. +# Change the local values of the extra columns on the subscriber, > +# update publisher, and check that subscriber retains the expected > +# values > +$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c = > 'epoch'::timestamptz + 987654321 * interval '1s'"); > +$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)"); > + > +wait_for_caught_up($node_publisher, $appname); > + > +$result = > + $node_subscriber->safe_psql('postgres', "SELECT count(*), > count(extract(epoch from c) = 987654321), count(d = 999) FROM > test_tab"); > +is($result, qq(3334|3334|3334), 'check extra columns contain locally > changed data'); > > Again, how this test is relevant to streaming mode? > 4. Apart from the above, I think we should think of minimizing the > test cases which can be committed with the base patch. We can later > add more tests. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Amit Kapila
Date:
On Wed, Sep 2, 2020 at 3:41 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Wed, Sep 2, 2020 at 10:55 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > We can combine the tests in 015_stream_simple.pl and > > 020_stream_binary.pl as I can't see a good reason to keep them > > separate. Then, I think we can keep only this part with the main patch > > and extract other tests into a separate patch. Basically, we can > > commit the basic tests with the main patch and then keep the advanced > > tests separately. I am afraid that there are some tests that don't add > > much value so we can review them separately. > > Fixed > I have slightly adjusted this test and ran pgindent on the patch. I am planning to push this tomorrow unless you have more comments. -- With Regards, Amit Kapila.
Attachment
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Dilip Kumar
Date:
On Wed, Sep 2, 2020 at 7:19 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Wed, Sep 2, 2020 at 3:41 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Wed, Sep 2, 2020 at 10:55 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > >
> >
> > We can combine the tests in 015_stream_simple.pl and
> > 020_stream_binary.pl as I can't see a good reason to keep them
> > separate. Then, I think we can keep only this part with the main patch
> > and extract other tests into a separate patch. Basically, we can
> > commit the basic tests with the main patch and then keep the advanced
> > tests separately. I am afraid that there are some tests that don't add
> > much value so we can review them separately.
>
> Fixed
>
I have slightly adjusted this test and ran pgindent on the patch. I am
planning to push this tomorrow unless you have more comments.
Looks good to me.
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
"Bossart, Nathan"
Date:
I noticed a small compiler warning for this. diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c index 812aca8011..88d3444c39 100644 --- a/src/backend/replication/logical/worker.c +++ b/src/backend/replication/logical/worker.c @@ -199,7 +199,7 @@ typedef struct ApplySubXactData static ApplySubXactData subxact_data = {0, 0, InvalidTransactionId, NULL}; static void subxact_filename(char *path, Oid subid, TransactionId xid); -static void changes_filename(char *path, Oid subid, TransactionId xid); +static inline void changes_filename(char *path, Oid subid, TransactionId xid); /* * Information about subtransactions of a given toplevel transaction. Nathan
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Amit Kapila
Date:
On Fri, Sep 4, 2020 at 3:10 AM Bossart, Nathan <bossartn@amazon.com> wrote: > > I noticed a small compiler warning for this. > > diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c > index 812aca8011..88d3444c39 100644 > --- a/src/backend/replication/logical/worker.c > +++ b/src/backend/replication/logical/worker.c > @@ -199,7 +199,7 @@ typedef struct ApplySubXactData > static ApplySubXactData subxact_data = {0, 0, InvalidTransactionId, NULL}; > > static void subxact_filename(char *path, Oid subid, TransactionId xid); > -static void changes_filename(char *path, Oid subid, TransactionId xid); > +static inline void changes_filename(char *path, Oid subid, TransactionId xid); > Thanks for the report, I'll take care of this. I think the nearby similar function subxact_filename() should also be inline. -- With Regards, Amit Kapila.
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Amit Kapila
Date:
On Tue, Sep 1, 2020 at 8:33 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Tue, Sep 1, 2020 at 9:28 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > I have fixed all the comments except the below comments. > 1. verify the size of various tests to ensure that it is above > logical_decoding_work_mem. > 2. I have checked that in one of the previous patches, we have a test > v53-0004-Add-TAP-test-for-streaming-vs.-DDL which contains a test case > quite similar to what we have in > v55-0002-Add-support-for-streaming-to-built-in-logical-re/013_stream_subxact_ddl_abort. > If there is any difference that can cover more scenarios then can we > consider merging them into one test? > I have compared these two tests and found that the only thing additional in the test case present in v53-0004-Add-TAP-test-for-streaming-vs.-DDL was that it was performing few savepoints and DMLs after doing the first rollback to savepoint and I included that in one of the existing tests in 018_stream_subxact_abort.pl. I have added one test for Rollback, changed few messages, removed one test case which was not making any sense in the patch. See attached and let me know what you think about it? -- With Regards, Amit Kapila.
Attachment
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Dilip Kumar
Date:
On Sat, 5 Sep 2020 at 4:02 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
On Tue, Sep 1, 2020 at 8:33 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Sep 1, 2020 at 9:28 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> I have fixed all the comments except the below comments.
> 1. verify the size of various tests to ensure that it is above
> logical_decoding_work_mem.
> 2. I have checked that in one of the previous patches, we have a test
> v53-0004-Add-TAP-test-for-streaming-vs.-DDL which contains a test case
> quite similar to what we have in
> v55-0002-Add-support-for-streaming-to-built-in-logical-re/013_stream_subxact_ddl_abort.
> If there is any difference that can cover more scenarios then can we
> consider merging them into one test?
>
I have compared these two tests and found that the only thing
additional in the test case present in
v53-0004-Add-TAP-test-for-streaming-vs.-DDL was that it was performing
few savepoints and DMLs after doing the first rollback to savepoint
and I included that in one of the existing tests in
018_stream_subxact_abort.pl. I have added one test for Rollback,
changed few messages, removed one test case which was not making any
sense in the patch. See attached and let me know what you think about
it?
I have reviewed the changes and looks fine to me.
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Amit Kapila
Date:
On Sat, Sep 5, 2020 at 8:55 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > I have reviewed the changes and looks fine to me. > Thanks, I have pushed the last patch. Let's wait for a day or so to see the buildfarm reports and then we can probably close this CF entry. I am aware that we have one patch related to stats still pending but I think we can tackle it along with the spill stats patch which is being discussed in a different thread [1]. Do let me know if I have missed anything? [1] - https://www.postgresql.org/message-id/CAA4eK1JBqQh9cBKjO-nKOOE%3D7f6ONDCZp0TJZfn4VsQqRZ%2BuYA%40mail.gmail.com -- With Regards, Amit Kapila.
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Dilip Kumar
Date:
On Mon, Sep 7, 2020 at 12:00 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Sat, Sep 5, 2020 at 8:55 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
>
> I have reviewed the changes and looks fine to me.
>
Thanks, I have pushed the last patch. Let's wait for a day or so to
see the buildfarm reports and then we can probably close this CF
entry.
Thanks.
I am aware that we have one patch related to stats still
pending but I think we can tackle it along with the spill stats patch
which is being discussed in a different thread [1]. Do let me know if
I have missed anything?
[1] - https://www.postgresql.org/message-id/CAA4eK1JBqQh9cBKjO-nKOOE%3D7f6ONDCZp0TJZfn4VsQqRZ%2BuYA%40mail.gmail.com
Sound good to me.
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Amit Kapila
Date:
On Mon, Sep 7, 2020 at 12:57 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Mon, Sep 7, 2020 at 12:00 PM Amit Kapila <amit.kapila16@gmail.com> wrote: >> >> On Sat, Sep 5, 2020 at 8:55 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: >> > >> > >> > I have reviewed the changes and looks fine to me. >> > >> >> Thanks, I have pushed the last patch. Let's wait for a day or so to >> see the buildfarm reports and then we can probably close this CF >> entry. > > > Thanks. > I have updated the status of CF entry as committed now. -- With Regards, Amit Kapila.
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Tomas Vondra
Date:
Hi, while looking at the streaming code I noticed two minor issues: 1) logicalrep_read_stream_stop is never defined/called, so the prototype in logicalproto.h is unnecessary 2) minor typo in one of the comments Patch attached. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Dilip Kumar
Date:
On Wed, Sep 9, 2020 at 2:13 PM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
Hi,
while looking at the streaming code I noticed two minor issues:
1) logicalrep_read_stream_stop is never defined/called, so the prototype
in logicalproto.h is unnecessary
Yeah, right.
2) minor typo in one of the comments
Patch attached.
Looks good to me.
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Amit Kapila
Date:
On Wed, Sep 9, 2020 at 2:13 PM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > > Hi, > > while looking at the streaming code I noticed two minor issues: > > 1) logicalrep_read_stream_stop is never defined/called, so the prototype > in logicalproto.h is unnecessary > > 2) minor typo in one of the comments > > Patch attached. > LGTM. -- With Regards, Amit Kapila.
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Amit Kapila
Date:
On Wed, Sep 9, 2020 at 2:26 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Wed, Sep 9, 2020 at 2:13 PM Tomas Vondra > <tomas.vondra@2ndquadrant.com> wrote: > > > > Hi, > > > > while looking at the streaming code I noticed two minor issues: > > > > 1) logicalrep_read_stream_stop is never defined/called, so the prototype > > in logicalproto.h is unnecessary > > > > 2) minor typo in one of the comments > > > > Patch attached. > > > > LGTM. > Pushed. -- With Regards, Amit Kapila.
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Tom Lane
Date:
Amit Kapila <amit.kapila16@gmail.com> writes: > Pushed. Observe the following reports: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=idiacanthus&dt=2020-09-13%2016%3A54%3A03 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=desmoxytes&dt=2020-09-10%2009%3A08%3A03 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=komodoensis&dt=2020-09-05%2020%3A22%3A02 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dragonet&dt=2020-09-04%2001%3A52%3A03 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dragonet&dt=2020-09-03%2020%3A54%3A04 These are all on HEAD, and all within the last ten days, and I see nothing comparable in any branch before that. So it's hard to avoid the conclusion that somebody broke something about ten days ago. None of these animals provided gdb backtraces; but we do have a built-in trace from several, and they all look like pgoutput.so is trying to list_free() garbage, somewhere inside a relcache invalidation/rebuild scenario: TRAP: FailedAssertion("list->length > 0", File: "/home/bf/build/buildfarm-idiacanthus/HEAD/pgsql.build/../pgsql/src/backend/nodes/list.c",Line: 68) postgres: publisher: walsender bf [local] idle(ExceptionalCondition+0x57)[0x9081f7] postgres: publisher: walsender bf [local] idle[0x6bcc70] postgres: publisher: walsender bf [local] idle(list_free+0x11)[0x6bdc01] /home/bf/build/buildfarm-idiacanthus/HEAD/pgsql.build/tmp_install/home/bf/build/buildfarm-idiacanthus/HEAD/inst/lib/postgresql/pgoutput.so(+0x35d8)[0x7fa4c5a6f5d8] postgres: publisher: walsender bf [local] idle(LocalExecuteInvalidationMessage+0x15b)[0x8f0cdb] postgres: publisher: walsender bf [local] idle(ReceiveSharedInvalidMessages+0x4b)[0x7bca0b] postgres: publisher: walsender bf [local] idle(LockRelationOid+0x56)[0x7c19e6] postgres: publisher: walsender bf [local] idle(relation_open+0x1c)[0x4a2d0c] postgres: publisher: walsender bf [local] idle(table_open+0x6)[0x524486] postgres: publisher: walsender bf [local] idle[0x9017f2] postgres: publisher: walsender bf [local] idle[0x8fabd4] postgres: publisher: walsender bf [local] idle[0x8fa58a] postgres: publisher: walsender bf [local] idle(RelationCacheInvalidateEntry+0xaf)[0x8fbdbf] postgres: publisher: walsender bf [local] idle(LocalExecuteInvalidationMessage+0xec)[0x8f0c6c] postgres: publisher: walsender bf [local] idle(ReceiveSharedInvalidMessages+0xcb)[0x7bca8b] postgres: publisher: walsender bf [local] idle(LockRelationOid+0x56)[0x7c19e6] postgres: publisher: walsender bf [local] idle(relation_open+0x1c)[0x4a2d0c] postgres: publisher: walsender bf [local] idle(table_open+0x6)[0x524486] postgres: publisher: walsender bf [local] idle[0x8ee8b0] 010_truncate.pl itself hasn't changed meaningfully in a good long time. However, I see that 464824323 added a whole boatload of code to pgoutput.c, and the timing is right for that commit to be the culprit, so that's what I'm betting on. Probably this requires a relcache inval at the wrong time; although we have recent passes from CLOBBER_CACHE_ALWAYS animals, so that can't be the whole triggering condition. I wonder whether it is relevant that all of the complaining animals are JIT-enabled. regards, tom lane
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Tom Lane
Date:
I wrote: > Probably this requires a relcache inval at the wrong time; > although we have recent passes from CLOBBER_CACHE_ALWAYS animals, > so that can't be the whole triggering condition. I wonder whether > it is relevant that all of the complaining animals are JIT-enabled. Hmmm ... I take that back. hyrax has indeed passed since this went in, but *it doesn't run any TAP tests*. So the buildfarm offers no information about whether the replication tests work under CLOBBER_CACHE_ALWAYS. Realizing that, I built an installation that way and tried to run the subscription tests. Results so far: * Running 010_truncate.pl by itself passed for me. So there's still some unexplained factor needed to trigger the buildfarm failures. (I'm wondering about concurrent autovacuum activity now...) * Starting over, it appears that 001_rep_changes.pl almost immediately gets into an infinite loop. It does not complete the third test step, rather infinitely waiting for progress to be made. The publisher log shows a repeating loop like 2020-09-13 21:16:05.734 EDT [928529] tap_sub LOG: could not send data to client: Broken pipe 2020-09-13 21:16:05.734 EDT [928529] tap_sub CONTEXT: slot "tap_sub", output plugin "pgoutput", in the commit callback,associated LSN 0/1660628 2020-09-13 21:16:05.843 EDT [928581] 001_rep_changes.pl LOG: statement: SELECT pg_current_wal_lsn() <= replay_lsn AND state= 'streaming' FROM pg_catalog.pg_stat_replication WHERE application_name = 'tap_sub'; 2020-09-13 21:16:05.861 EDT [928582] tap_sub LOG: statement: SELECT pg_catalog.set_config('search_path', '', false); 2020-09-13 21:16:05.929 EDT [928582] tap_sub LOG: received replication command: IDENTIFY_SYSTEM 2020-09-13 21:16:05.930 EDT [928582] tap_sub LOG: received replication command: START_REPLICATION SLOT "tap_sub" LOGICAL0/1652820 (proto_version '2', publication_names '"tap_pub","tap_pub_ins_only"') 2020-09-13 21:16:05.930 EDT [928582] tap_sub LOG: starting logical decoding for slot "tap_sub" 2020-09-13 21:16:05.930 EDT [928582] tap_sub DETAIL: Streaming transactions committing after 0/1652820, reading WAL from0/1651B20. 2020-09-13 21:16:05.930 EDT [928582] tap_sub LOG: logical decoding found consistent point at 0/1651B20 2020-09-13 21:16:05.930 EDT [928582] tap_sub DETAIL: There are no running transactions. 2020-09-13 21:16:21.560 EDT [928600] 001_rep_changes.pl LOG: statement: SELECT pg_current_wal_lsn() <= replay_lsn AND state= 'streaming' FROM pg_catalog.pg_stat_replication WHERE application_name = 'tap_sub'; 2020-09-13 21:16:37.291 EDT [928610] 001_rep_changes.pl LOG: statement: SELECT pg_current_wal_lsn() <= replay_lsn AND state= 'streaming' FROM pg_catalog.pg_stat_replication WHERE application_name = 'tap_sub'; 2020-09-13 21:16:52.959 EDT [928627] 001_rep_changes.pl LOG: statement: SELECT pg_current_wal_lsn() <= replay_lsn AND state= 'streaming' FROM pg_catalog.pg_stat_replication WHERE application_name = 'tap_sub'; 2020-09-13 21:17:06.866 EDT [928636] tap_sub LOG: statement: SELECT pg_catalog.set_config('search_path', '', false); 2020-09-13 21:17:06.934 EDT [928636] tap_sub LOG: received replication command: IDENTIFY_SYSTEM 2020-09-13 21:17:06.934 EDT [928636] tap_sub LOG: received replication command: START_REPLICATION SLOT "tap_sub" LOGICAL0/1652820 (proto_version '2', publication_names '"tap_pub","tap_pub_ins_only"') 2020-09-13 21:17:06.934 EDT [928636] tap_sub ERROR: replication slot "tap_sub" is active for PID 928582 2020-09-13 21:17:07.811 EDT [928638] tap_sub LOG: statement: SELECT pg_catalog.set_config('search_path', '', false); 2020-09-13 21:17:07.880 EDT [928638] tap_sub LOG: received replication command: IDENTIFY_SYSTEM 2020-09-13 21:17:07.881 EDT [928638] tap_sub LOG: received replication command: START_REPLICATION SLOT "tap_sub" LOGICAL0/1652820 (proto_version '2', publication_names '"tap_pub","tap_pub_ins_only"') 2020-09-13 21:17:07.881 EDT [928638] tap_sub ERROR: replication slot "tap_sub" is active for PID 928582 2020-09-13 21:17:08.618 EDT [928641] 001_rep_changes.pl LOG: statement: SELECT pg_current_wal_lsn() <= replay_lsn AND state= 'streaming' FROM pg_catalog.pg_stat_replication WHERE application_name = 'tap_sub'; 2020-09-13 21:17:08.753 EDT [928642] tap_sub LOG: statement: SELECT pg_catalog.set_config('search_path', '', false); 2020-09-13 21:17:08.821 EDT [928642] tap_sub LOG: received replication command: IDENTIFY_SYSTEM 2020-09-13 21:17:08.821 EDT [928642] tap_sub LOG: received replication command: START_REPLICATION SLOT "tap_sub" LOGICAL0/1652820 (proto_version '2', publication_names '"tap_pub","tap_pub_ins_only"') 2020-09-13 21:17:08.821 EDT [928642] tap_sub ERROR: replication slot "tap_sub" is active for PID 928582 2020-09-13 21:17:09.689 EDT [928645] tap_sub LOG: statement: SELECT pg_catalog.set_config('search_path', '', false); 2020-09-13 21:17:09.756 EDT [928645] tap_sub LOG: received replication command: IDENTIFY_SYSTEM 2020-09-13 21:17:09.757 EDT [928645] tap_sub LOG: received replication command: START_REPLICATION SLOT "tap_sub" LOGICAL0/1652820 (proto_version '2', publication_names '"tap_pub","tap_pub_ins_only"') 2020-09-13 21:17:09.757 EDT [928645] tap_sub ERROR: replication slot "tap_sub" is active for PID 928582 2020-09-13 21:17:09.841 EDT [928582] tap_sub LOG: could not send data to client: Broken pipe 2020-09-13 21:17:09.841 EDT [928582] tap_sub CONTEXT: slot "tap_sub", output plugin "pgoutput", in the commit callback,associated LSN 0/1660628 while the subscriber is repeating 2020-09-13 21:15:01.598 EDT [928528] LOG: logical replication apply worker for subscription "tap_sub" has started 2020-09-13 21:16:02.178 EDT [928528] ERROR: terminating logical replication worker due to timeout 2020-09-13 21:16:02.179 EDT [920797] LOG: background worker "logical replication worker" (PID 928528) exited with exit code1 2020-09-13 21:16:02.606 EDT [928571] LOG: logical replication apply worker for subscription "tap_sub" has started 2020-09-13 21:16:03.117 EDT [928571] ERROR: could not start WAL streaming: ERROR: replication slot "tap_sub" is activefor PID 928529 2020-09-13 21:16:03.118 EDT [920797] LOG: background worker "logical replication worker" (PID 928571) exited with exit code1 2020-09-13 21:16:03.544 EDT [928574] LOG: logical replication apply worker for subscription "tap_sub" has started 2020-09-13 21:16:04.053 EDT [928574] ERROR: could not start WAL streaming: ERROR: replication slot "tap_sub" is activefor PID 928529 2020-09-13 21:16:04.054 EDT [920797] LOG: background worker "logical replication worker" (PID 928574) exited with exit code1 2020-09-13 21:16:04.479 EDT [928576] LOG: logical replication apply worker for subscription "tap_sub" has started 2020-09-13 21:16:04.990 EDT [928576] ERROR: could not start WAL streaming: ERROR: replication slot "tap_sub" is activefor PID 928529 2020-09-13 21:16:04.990 EDT [920797] LOG: background worker "logical replication worker" (PID 928576) exited with exit code1 2020-09-13 21:16:05.415 EDT [928579] LOG: logical replication apply worker for subscription "tap_sub" has started 2020-09-13 21:17:05.994 EDT [928579] ERROR: terminating logical replication worker due to timeout I'm out of patience to investigate this for tonight, but there is something extremely broken here; maybe more than one something. regards, tom lane
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Amit Kapila
Date:
On Mon, Sep 14, 2020 at 3:08 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Amit Kapila <amit.kapila16@gmail.com> writes: > > Pushed. > > Observe the following reports: > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=idiacanthus&dt=2020-09-13%2016%3A54%3A03 > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=desmoxytes&dt=2020-09-10%2009%3A08%3A03 > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=komodoensis&dt=2020-09-05%2020%3A22%3A02 > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dragonet&dt=2020-09-04%2001%3A52%3A03 > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dragonet&dt=2020-09-03%2020%3A54%3A04 > > These are all on HEAD, and all within the last ten days, and I see > nothing comparable in any branch before that. So it's hard to avoid > the conclusion that somebody broke something about ten days ago. > I'll analyze these reports. -- With Regards, Amit Kapila.
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Tom Lane
Date:
I wrote: > * Starting over, it appears that 001_rep_changes.pl almost immediately > gets into an infinite loop. It does not complete the third test step, > rather infinitely waiting for progress to be made. Ah, looking closer, the problem is that wal_receiver_timeout = 60s is too short when the sender is using CCA. It times out before we can get through the needed data transmission. regards, tom lane
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Amit Kapila
Date:
On Mon, Sep 14, 2020 at 3:08 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Amit Kapila <amit.kapila16@gmail.com> writes: > > Pushed. > > Observe the following reports: > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=idiacanthus&dt=2020-09-13%2016%3A54%3A03 > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=desmoxytes&dt=2020-09-10%2009%3A08%3A03 > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=komodoensis&dt=2020-09-05%2020%3A22%3A02 > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dragonet&dt=2020-09-04%2001%3A52%3A03 > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dragonet&dt=2020-09-03%2020%3A54%3A04 > > These are all on HEAD, and all within the last ten days, and I see > nothing comparable in any branch before that. So it's hard to avoid > the conclusion that somebody broke something about ten days ago. > > None of these animals provided gdb backtraces; but we do have a built-in > trace from several, and they all look like pgoutput.so is trying to > list_free() garbage, somewhere inside a relcache invalidation/rebuild > scenario: > Yeah, this is right, and here is some initial analysis. It seems to be failing in below code: rel_sync_cache_relation_cb(){ ...list_free(entry->streamed_txns);..} This list can have elements only in 'streaming' mode (need to enable 'streaming' with Create Subscription command) whereas none of the tests in 010_truncate.pl is using 'streaming', so this list should be empty (NULL). The two different assertion failures shown in BF reports in list_free code are as below: Assert(list->length > 0); Assert(list->length <= list->max_length); It seems to me that this list is not initialized properly when it is not used or maybe that is true in some special circumstances because we initialize it in get_rel_sync_entry(). I am not sure if CCI build is impacting this in some way. -- With Regards, Amit Kapila.
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Dilip Kumar
Date:
On Mon, Sep 14, 2020 at 8:48 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Mon, Sep 14, 2020 at 3:08 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > > > Amit Kapila <amit.kapila16@gmail.com> writes: > > > Pushed. > > > > Observe the following reports: > > > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=idiacanthus&dt=2020-09-13%2016%3A54%3A03 > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=desmoxytes&dt=2020-09-10%2009%3A08%3A03 > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=komodoensis&dt=2020-09-05%2020%3A22%3A02 > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dragonet&dt=2020-09-04%2001%3A52%3A03 > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dragonet&dt=2020-09-03%2020%3A54%3A04 > > > > These are all on HEAD, and all within the last ten days, and I see > > nothing comparable in any branch before that. So it's hard to avoid > > the conclusion that somebody broke something about ten days ago. > > > > None of these animals provided gdb backtraces; but we do have a built-in > > trace from several, and they all look like pgoutput.so is trying to > > list_free() garbage, somewhere inside a relcache invalidation/rebuild > > scenario: > > > > Yeah, this is right, and here is some initial analysis. It seems to be > failing in below code: > rel_sync_cache_relation_cb(){ ...list_free(entry->streamed_txns);..} > > This list can have elements only in 'streaming' mode (need to enable > 'streaming' with Create Subscription command) whereas none of the > tests in 010_truncate.pl is using 'streaming', so this list should be > empty (NULL). The two different assertion failures shown in BF reports > in list_free code are as below: > Assert(list->length > 0); > Assert(list->length <= list->max_length); > > It seems to me that this list is not initialized properly when it is > not used or maybe that is true in some special circumstances because > we initialize it in get_rel_sync_entry(). I am not sure if CCI build > is impacting this in some way. Even I have analyzed this but did not find any reason why the streamed_txns list should be anything other than NULL. The only thing is we are initializing the entry->streamed_txns to NULL and the list free is checking "if (list == NIL)" then return. However IMHO, that should not be an issue becase NIL is defined as (List*) NULL. I am doing further testing and investigation. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Amit Kapila
Date:
On Mon, Sep 14, 2020 at 1:23 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Mon, Sep 14, 2020 at 8:48 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > Yeah, this is right, and here is some initial analysis. It seems to be > > failing in below code: > > rel_sync_cache_relation_cb(){ ...list_free(entry->streamed_txns);..} > > > > This list can have elements only in 'streaming' mode (need to enable > > 'streaming' with Create Subscription command) whereas none of the > > tests in 010_truncate.pl is using 'streaming', so this list should be > > empty (NULL). The two different assertion failures shown in BF reports > > in list_free code are as below: > > Assert(list->length > 0); > > Assert(list->length <= list->max_length); > > > > It seems to me that this list is not initialized properly when it is > > not used or maybe that is true in some special circumstances because > > we initialize it in get_rel_sync_entry(). I am not sure if CCI build > > is impacting this in some way. > > > Even I have analyzed this but did not find any reason why the > streamed_txns list should be anything other than NULL. The only thing > is we are initializing the entry->streamed_txns to NULL and the list > free is checking "if (list == NIL)" then return. However IMHO, that > should not be an issue becase NIL is defined as (List*) NULL. > Yeah, that is not the issue but it is better to initialize it with NIL for the sake of consistency. The basic issue here was we were trying to open/lock the relation(s) before initializing this list. Now, when we process the invalidations during open relation, we try to access this list in rel_sync_cache_relation_cb and that leads to assertion failure. I have reproduced the exact scenario of 010_truncate.pl via debugger. Basically, the backend on publisher has sent the invalidation after truncating the relation 'tab1' and while processing the truncate message if WALSender receives that message exactly after creating the RelSyncEntry for 'tab1', the Assertion shown in BF can be reproduced. The attached patch will fix the issue. What do you think? -- With Regards, Amit Kapila.
Attachment
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Dilip Kumar
Date:
On Mon, Sep 14, 2020 at 4:50 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Mon, Sep 14, 2020 at 1:23 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Mon, Sep 14, 2020 at 8:48 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > Yeah, this is right, and here is some initial analysis. It seems to be > > > failing in below code: > > > rel_sync_cache_relation_cb(){ ...list_free(entry->streamed_txns);..} > > > > > > This list can have elements only in 'streaming' mode (need to enable > > > 'streaming' with Create Subscription command) whereas none of the > > > tests in 010_truncate.pl is using 'streaming', so this list should be > > > empty (NULL). The two different assertion failures shown in BF reports > > > in list_free code are as below: > > > Assert(list->length > 0); > > > Assert(list->length <= list->max_length); > > > > > > It seems to me that this list is not initialized properly when it is > > > not used or maybe that is true in some special circumstances because > > > we initialize it in get_rel_sync_entry(). I am not sure if CCI build > > > is impacting this in some way. > > > > > > Even I have analyzed this but did not find any reason why the > > streamed_txns list should be anything other than NULL. The only thing > > is we are initializing the entry->streamed_txns to NULL and the list > > free is checking "if (list == NIL)" then return. However IMHO, that > > should not be an issue becase NIL is defined as (List*) NULL. > > > > Yeah, that is not the issue but it is better to initialize it with NIL > for the sake of consistency. The basic issue here was we were trying > to open/lock the relation(s) before initializing this list. Now, when > we process the invalidations during open relation, we try to access > this list in rel_sync_cache_relation_cb and that leads to assertion > failure. I have reproduced the exact scenario of 010_truncate.pl via > debugger. Basically, the backend on publisher has sent the > invalidation after truncating the relation 'tab1' and while processing > the truncate message if WALSender receives that message exactly after > creating the RelSyncEntry for 'tab1', the Assertion shown in BF can be > reproduced. Yeah, this is an issue and I am also able to reproduce this manually using gdb. Basically, I have inserted some data in publication table and after that, I stopped in get_rel_sync_entry after creating the reentry and before calling GetRelationPublications. Meanwhile, I have truncated this table and then it hit the same issue you pointed here. > The attached patch will fix the issue. What do you think? The patch looks good to me and fixing the reported issue. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Tom Lane
Date:
Amit Kapila <amit.kapila16@gmail.com> writes: > The attached patch will fix the issue. What do you think? I think it'd be cleaner to separate the initialization of a new entry from validation altogether, along the lines of /* Find cached function info, creating if not found */ oldctx = MemoryContextSwitchTo(CacheMemoryContext); entry = (RelationSyncEntry *) hash_search(RelationSyncCache, (void *) &relid, HASH_ENTER, &found); MemoryContextSwitchTo(oldctx); Assert(entry != NULL); if (!found) { /* immediately make a new entry valid enough to satisfy callbacks */ entry->schema_sent = false; entry->streamed_txns = NIL; entry->replicate_valid = false; /* are there any other fields we should clear here for safety??? */ } /* Fill it in if not valid */ if (!entry->replicate_valid) { List *pubids = GetRelationPublications(relid); ... BTW, unless someone has changed the behavior of dynahash when I wasn't looking, those MemoryContextSwitchTos shown above are useless. Also, why does the comment refer to a "function" entry? regards, tom lane
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Amit Kapila
Date:
On Mon, Sep 14, 2020 at 9:42 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Amit Kapila <amit.kapila16@gmail.com> writes: > > The attached patch will fix the issue. What do you think? > > I think it'd be cleaner to separate the initialization of a new entry from > validation altogether, along the lines of > > /* Find cached function info, creating if not found */ > oldctx = MemoryContextSwitchTo(CacheMemoryContext); > entry = (RelationSyncEntry *) hash_search(RelationSyncCache, > (void *) &relid, > HASH_ENTER, &found); > MemoryContextSwitchTo(oldctx); > Assert(entry != NULL); > > if (!found) > { > /* immediately make a new entry valid enough to satisfy callbacks */ > entry->schema_sent = false; > entry->streamed_txns = NIL; > entry->replicate_valid = false; > /* are there any other fields we should clear here for safety??? */ > } > If we want to separate validation then we need to initialize other fields like 'pubactions' and 'publish_as_relid' as well. I think it will be better to arrange it the way you are suggesting. So, I will change it along with other fields that required initialization. > /* Fill it in if not valid */ > if (!entry->replicate_valid) > { > List *pubids = GetRelationPublications(relid); > ... > > BTW, unless someone has changed the behavior of dynahash when I > wasn't looking, those MemoryContextSwitchTos shown above are useless. > As far as I can see they are useless in this case but I think they might be required in case the user provides its own allocator function (using HASH_ALLOC). So, we can probably remove those from here? > Also, why does the comment refer to a "function" entry? > It should be "relation" instead. I'll take care of changing this as well. -- With Regards, Amit Kapila.
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Tom Lane
Date:
Amit Kapila <amit.kapila16@gmail.com> writes: > On Mon, Sep 14, 2020 at 9:42 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> BTW, unless someone has changed the behavior of dynahash when I >> wasn't looking, those MemoryContextSwitchTos shown above are useless. > As far as I can see they are useless in this case but I think they > might be required in case the user provides its own allocator function > (using HASH_ALLOC). So, we can probably remove those from here? You could imagine writing a HASH_ALLOC allocator whose behavior varies depending on CurrentMemoryContext, but it seems like a pretty foolish/fragile way to do it. In any case I can think of, the hash table lives in one specific context and you really really do not want parts of it spread across other contexts. dynahash.c is not going to look kindly on pieces of what it is managing disappearing from under it. (To be clear, objects that the hash entries contain pointers to are a different question. But the hash entries themselves have to have exactly the same lifespan as the hash table.) regards, tom lane
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Amit Kapila
Date:
On Tue, Sep 15, 2020 at 8:38 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Amit Kapila <amit.kapila16@gmail.com> writes: > > On Mon, Sep 14, 2020 at 9:42 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > >> BTW, unless someone has changed the behavior of dynahash when I > >> wasn't looking, those MemoryContextSwitchTos shown above are useless. > > > As far as I can see they are useless in this case but I think they > > might be required in case the user provides its own allocator function > > (using HASH_ALLOC). So, we can probably remove those from here? > > You could imagine writing a HASH_ALLOC allocator whose behavior > varies depending on CurrentMemoryContext, but it seems like a > pretty foolish/fragile way to do it. In any case I can think of, > the hash table lives in one specific context and you really > really do not want parts of it spread across other contexts. > dynahash.c is not going to look kindly on pieces of what it > is managing disappearing from under it. > I agree that doesn't make sense. I have fixed all the comments discussed in the attached patch. -- With Regards, Amit Kapila.
Attachment
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Amit Kapila
Date:
On Tue, Sep 15, 2020 at 10:17 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Tue, Sep 15, 2020 at 8:38 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > > As far as I can see they are useless in this case but I think they > > > might be required in case the user provides its own allocator function > > > (using HASH_ALLOC). So, we can probably remove those from here? > > > > You could imagine writing a HASH_ALLOC allocator whose behavior > > varies depending on CurrentMemoryContext, but it seems like a > > pretty foolish/fragile way to do it. In any case I can think of, > > the hash table lives in one specific context and you really > > really do not want parts of it spread across other contexts. > > dynahash.c is not going to look kindly on pieces of what it > > is managing disappearing from under it. > > > > I agree that doesn't make sense. I have fixed all the comments > discussed in the attached patch. > Pushed. -- With Regards, Amit Kapila.
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Noah Misch
Date:
On Mon, Sep 07, 2020 at 12:00:41PM +0530, Amit Kapila wrote: > Thanks, I have pushed the last patch. Let's wait for a day or so to > see the buildfarm reports https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sungazer&dt=2020-09-08%2006%3A24%3A14 failed the new 015_stream.pl test with the subscriber looping like this: 2020-09-08 11:22:49.848 UTC [13959252:1] LOG: logical replication apply worker for subscription "tap_sub" has started 2020-09-08 11:22:54.045 UTC [13959252:2] ERROR: could not open temporary file "16393-510.changes.0" from BufFile "16393-510.changes":No such file or directory 2020-09-08 11:22:54.055 UTC [7602182:1] LOG: logical replication apply worker for subscription "tap_sub" has started 2020-09-08 11:22:54.101 UTC [31785284:4] LOG: background worker "logical replication worker" (PID 13959252) exited withexit code 1 2020-09-08 11:23:01.142 UTC [7602182:2] ERROR: could not open temporary file "16393-510.changes.0" from BufFile "16393-510.changes":No such file or directory ... What happened there?
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Amit Kapila
Date:
On Mon, Nov 30, 2020 at 3:14 AM Noah Misch <noah@leadboat.com> wrote: > > On Mon, Sep 07, 2020 at 12:00:41PM +0530, Amit Kapila wrote: > > Thanks, I have pushed the last patch. Let's wait for a day or so to > > see the buildfarm reports > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sungazer&dt=2020-09-08%2006%3A24%3A14 > failed the new 015_stream.pl test with the subscriber looping like this: > I will look into this. -- With Regards, Amit Kapila.
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Amit Kapila
Date:
On Mon, Nov 30, 2020 at 3:14 AM Noah Misch <noah@leadboat.com> wrote: > > On Mon, Sep 07, 2020 at 12:00:41PM +0530, Amit Kapila wrote: > > Thanks, I have pushed the last patch. Let's wait for a day or so to > > see the buildfarm reports > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sungazer&dt=2020-09-08%2006%3A24%3A14 > failed the new 015_stream.pl test with the subscriber looping like this: > > 2020-09-08 11:22:49.848 UTC [13959252:1] LOG: logical replication apply worker for subscription "tap_sub" has started > 2020-09-08 11:22:54.045 UTC [13959252:2] ERROR: could not open temporary file "16393-510.changes.0" from BufFile "16393-510.changes":No such file or directory > 2020-09-08 11:22:54.055 UTC [7602182:1] LOG: logical replication apply worker for subscription "tap_sub" has started > 2020-09-08 11:22:54.101 UTC [31785284:4] LOG: background worker "logical replication worker" (PID 13959252) exited withexit code 1 > 2020-09-08 11:23:01.142 UTC [7602182:2] ERROR: could not open temporary file "16393-510.changes.0" from BufFile "16393-510.changes":No such file or directory > ... > > What happened there? > What is going on here is that the expected streaming file is missing. Normally, the first time we send a stream of changes (some percentage of transaction changes) we create the streaming file, and then in respective streams we just keep on writing in that file the changes we receive from the publisher, and on commit, we read that file and apply all the changes. The above kind of error can happen due to the following reasons: (a) the first time we sent the stream and created the file and that got removed before the second stream reached the subscriber. (b) from the publisher-side, we never sent the indication that it is the first stream and the subscriber directly tries to open the file thinking it is already there. Now, the publisher and subscriber log doesn't directly indicate any of the above problems but I have some observations. The subscriber log indicates that before the apply worker exits due to an error the new apply worker gets started. We delete the streaming-related temporary files on proc_exit, so one possibility could have been that the new apply worker has created the streaming file which the old apply worker has removed but that is not possible because we always create these temp-files by having procid in the path. The other thing I observed in the code is that we can mark the transaction as streamed (via ReorderBufferTruncateTxn) if we try to stream a transaction that has no changes the first time we try to stream the transaction. This would lead to symptom (b) because the second-time when there are more changes we would stream the changes as it is not the first time. However, this shouldn't happen because we never pick-up a transaction to stream which has no changes. I can try to fix the code here such that we don't mark the transaction as streamed unless we have streamed at least one change but I don't see how it is related to this particular test failure. I am not sure why this failure is not repeated since it occurred a few months back, it's probably a timing issue. I have few timing issues in the last month or so related to this feature but I am not able to come up with a theory if any of those would have fixed this problem. -- With Regards, Amit Kapila.
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Dilip Kumar
Date:
On Mon, Nov 30, 2020 at 6:49 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Mon, Nov 30, 2020 at 3:14 AM Noah Misch <noah@leadboat.com> wrote: > > > > On Mon, Sep 07, 2020 at 12:00:41PM +0530, Amit Kapila wrote: > > > Thanks, I have pushed the last patch. Let's wait for a day or so to > > > see the buildfarm reports > > > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sungazer&dt=2020-09-08%2006%3A24%3A14 > > failed the new 015_stream.pl test with the subscriber looping like this: > > > > 2020-09-08 11:22:49.848 UTC [13959252:1] LOG: logical replication apply worker for subscription "tap_sub" has started > > 2020-09-08 11:22:54.045 UTC [13959252:2] ERROR: could not open temporary file "16393-510.changes.0" from BufFile "16393-510.changes":No such file or directory > > 2020-09-08 11:22:54.055 UTC [7602182:1] LOG: logical replication apply worker for subscription "tap_sub" has started > > 2020-09-08 11:22:54.101 UTC [31785284:4] LOG: background worker "logical replication worker" (PID 13959252) exited withexit code 1 > > 2020-09-08 11:23:01.142 UTC [7602182:2] ERROR: could not open temporary file "16393-510.changes.0" from BufFile "16393-510.changes":No such file or directory > > ... > > > > What happened there? > > > > What is going on here is that the expected streaming file is missing. > Normally, the first time we send a stream of changes (some percentage > of transaction changes) we create the streaming file, and then in > respective streams we just keep on writing in that file the changes we > receive from the publisher, and on commit, we read that file and apply > all the changes. > > The above kind of error can happen due to the following reasons: (a) > the first time we sent the stream and created the file and that got > removed before the second stream reached the subscriber. (b) from the > publisher-side, we never sent the indication that it is the first > stream and the subscriber directly tries to open the file thinking it > is already there. > > Now, the publisher and subscriber log doesn't directly indicate any of > the above problems but I have some observations. > > The subscriber log indicates that before the apply worker exits due to > an error the new apply worker gets started. We delete the > streaming-related temporary files on proc_exit, so one possibility > could have been that the new apply worker has created the streaming > file which the old apply worker has removed but that is not possible > because we always create these temp-files by having procid in the > path. Yeah, and I have tried to test on this line, basically, after the streaming has started I have set the binary=on. Now using gdb I have made the worker wait before it deletes the temp file and meanwhile the new worker started and it worked properly as expected. > The other thing I observed in the code is that we can mark the > transaction as streamed (via ReorderBufferTruncateTxn) if we try to > stream a transaction that has no changes the first time we try to > stream the transaction. This would lead to symptom (b) because the > second-time when there are more changes we would stream the changes as > it is not the first time. However, this shouldn't happen because we > never pick-up a transaction to stream which has no changes. I can try > to fix the code here such that we don't mark the transaction as > streamed unless we have streamed at least one change but I don't see > how it is related to this particular test failure. Yeah, this can be improved but as you mentioned that we never select an empty transaction for streaming so this case should not occur. I will perform some testing/review around this and report. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Amit Kapila
Date:
On Tue, Dec 1, 2020 at 11:38 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Mon, Nov 30, 2020 at 6:49 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > What is going on here is that the expected streaming file is missing. > > Normally, the first time we send a stream of changes (some percentage > > of transaction changes) we create the streaming file, and then in > > respective streams we just keep on writing in that file the changes we > > receive from the publisher, and on commit, we read that file and apply > > all the changes. > > > > The above kind of error can happen due to the following reasons: (a) > > the first time we sent the stream and created the file and that got > > removed before the second stream reached the subscriber. (b) from the > > publisher-side, we never sent the indication that it is the first > > stream and the subscriber directly tries to open the file thinking it > > is already there. > > > > Now, the publisher and subscriber log doesn't directly indicate any of > > the above problems but I have some observations. > > > > The subscriber log indicates that before the apply worker exits due to > > an error the new apply worker gets started. We delete the > > streaming-related temporary files on proc_exit, so one possibility > > could have been that the new apply worker has created the streaming > > file which the old apply worker has removed but that is not possible > > because we always create these temp-files by having procid in the > > path. > > Yeah, and I have tried to test on this line, basically, after the > streaming has started I have set the binary=on. Now using gdb I have > made the worker wait before it deletes the temp file and meanwhile the > new worker started and it worked properly as expected. > > > The other thing I observed in the code is that we can mark the > > transaction as streamed (via ReorderBufferTruncateTxn) if we try to > > stream a transaction that has no changes the first time we try to > > stream the transaction. This would lead to symptom (b) because the > > second-time when there are more changes we would stream the changes as > > it is not the first time. However, this shouldn't happen because we > > never pick-up a transaction to stream which has no changes. I can try > > to fix the code here such that we don't mark the transaction as > > streamed unless we have streamed at least one change but I don't see > > how it is related to this particular test failure. > > Yeah, this can be improved but as you mentioned that we never select > an empty transaction for streaming so this case should not occur. I > will perform some testing/review around this and report. > On further thinking about this point, I think the message seen on subscriber [1] won't occur if missed the first stream. This is because we always check the value of fileset from the stream hash table (xidhash) and it won't be there if we directly send the second stream and that would have lead to a different kind of problem (probably crash). This symptom seems to be due to the reason (a) mentioned above unless we are missing something else. Now, I am not sure how the file can be removed without the corresponding entry in hash table (xidhash) is still present. The only reasons that come to mind are that some other process cleaned pgsql_tmp directory thinking these temporary file are not required or one manually removes it, none of those seems plausible reasons. [1] - ERROR: could not open temporary file "16393-510.changes.0" from BufFile "16393-510.changes": No such file or directory -- With Regards, Amit Kapila.
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Dilip Kumar
Date:
On Tue, Dec 1, 2020 at 11:38 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Mon, Nov 30, 2020 at 6:49 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Mon, Nov 30, 2020 at 3:14 AM Noah Misch <noah@leadboat.com> wrote: > > > > > > On Mon, Sep 07, 2020 at 12:00:41PM +0530, Amit Kapila wrote: > > > > Thanks, I have pushed the last patch. Let's wait for a day or so to > > > > see the buildfarm reports > > > > > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sungazer&dt=2020-09-08%2006%3A24%3A14 > > > failed the new 015_stream.pl test with the subscriber looping like this: > > > > > > 2020-09-08 11:22:49.848 UTC [13959252:1] LOG: logical replication apply worker for subscription "tap_sub" has started > > > 2020-09-08 11:22:54.045 UTC [13959252:2] ERROR: could not open temporary file "16393-510.changes.0" from BufFile "16393-510.changes":No such file or directory > > > 2020-09-08 11:22:54.055 UTC [7602182:1] LOG: logical replication apply worker for subscription "tap_sub" has started > > > 2020-09-08 11:22:54.101 UTC [31785284:4] LOG: background worker "logical replication worker" (PID 13959252) exitedwith exit code 1 > > > 2020-09-08 11:23:01.142 UTC [7602182:2] ERROR: could not open temporary file "16393-510.changes.0" from BufFile "16393-510.changes":No such file or directory > > > ... > > > > > > What happened there? > > > > > > > What is going on here is that the expected streaming file is missing. > > Normally, the first time we send a stream of changes (some percentage > > of transaction changes) we create the streaming file, and then in > > respective streams we just keep on writing in that file the changes we > > receive from the publisher, and on commit, we read that file and apply > > all the changes. > > > > The above kind of error can happen due to the following reasons: (a) > > the first time we sent the stream and created the file and that got > > removed before the second stream reached the subscriber. (b) from the > > publisher-side, we never sent the indication that it is the first > > stream and the subscriber directly tries to open the file thinking it > > is already there. > > > > Now, the publisher and subscriber log doesn't directly indicate any of > > the above problems but I have some observations. > > > > The subscriber log indicates that before the apply worker exits due to > > an error the new apply worker gets started. We delete the > > streaming-related temporary files on proc_exit, so one possibility > > could have been that the new apply worker has created the streaming > > file which the old apply worker has removed but that is not possible > > because we always create these temp-files by having procid in the > > path. > > Yeah, and I have tried to test on this line, basically, after the > streaming has started I have set the binary=on. Now using gdb I have > made the worker wait before it deletes the temp file and meanwhile the > new worker started and it worked properly as expected. > > > The other thing I observed in the code is that we can mark the > > transaction as streamed (via ReorderBufferTruncateTxn) if we try to > > stream a transaction that has no changes the first time we try to > > stream the transaction. This would lead to symptom (b) because the > > second-time when there are more changes we would stream the changes as > > it is not the first time. However, this shouldn't happen because we > > never pick-up a transaction to stream which has no changes. I can try > > to fix the code here such that we don't mark the transaction as > > streamed unless we have streamed at least one change but I don't see > > how it is related to this particular test failure. > > Yeah, this can be improved but as you mentioned that we never select > an empty transaction for streaming so this case should not occur. I > will perform some testing/review around this and report. I have executed "make check" in the loop with only this file. I have repeated it 5000 times but no failure, I am wondering shall we try to execute in the same machine in a loop where it failed once? -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Amit Kapila
Date:
On Wed, Dec 2, 2020 at 1:20 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Tue, Dec 1, 2020 at 11:38 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Mon, Nov 30, 2020 at 6:49 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > On Mon, Nov 30, 2020 at 3:14 AM Noah Misch <noah@leadboat.com> wrote: > > > > > > > > On Mon, Sep 07, 2020 at 12:00:41PM +0530, Amit Kapila wrote: > > > > > Thanks, I have pushed the last patch. Let's wait for a day or so to > > > > > see the buildfarm reports > > > > > > > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sungazer&dt=2020-09-08%2006%3A24%3A14 > > > > failed the new 015_stream.pl test with the subscriber looping like this: > > > > > > > > 2020-09-08 11:22:49.848 UTC [13959252:1] LOG: logical replication apply worker for subscription "tap_sub" has started > > > > 2020-09-08 11:22:54.045 UTC [13959252:2] ERROR: could not open temporary file "16393-510.changes.0" from BufFile"16393-510.changes": No such file or directory > > > > 2020-09-08 11:22:54.055 UTC [7602182:1] LOG: logical replication apply worker for subscription "tap_sub" has started > > > > 2020-09-08 11:22:54.101 UTC [31785284:4] LOG: background worker "logical replication worker" (PID 13959252) exitedwith exit code 1 > > > > 2020-09-08 11:23:01.142 UTC [7602182:2] ERROR: could not open temporary file "16393-510.changes.0" from BufFile"16393-510.changes": No such file or directory > > > > ... > > > > > > > > What happened there? > > > > > > > > > > What is going on here is that the expected streaming file is missing. > > > Normally, the first time we send a stream of changes (some percentage > > > of transaction changes) we create the streaming file, and then in > > > respective streams we just keep on writing in that file the changes we > > > receive from the publisher, and on commit, we read that file and apply > > > all the changes. > > > > > > The above kind of error can happen due to the following reasons: (a) > > > the first time we sent the stream and created the file and that got > > > removed before the second stream reached the subscriber. (b) from the > > > publisher-side, we never sent the indication that it is the first > > > stream and the subscriber directly tries to open the file thinking it > > > is already there. > > > > > I have executed "make check" in the loop with only this file. I have > repeated it 5000 times but no failure, I am wondering shall we try to > execute in the same machine in a loop where it failed once? > Yes, that might help. Noah, would it be possible for you to try that out, and if it failed then probably get the stack trace of subscriber? If we are able to reproduce it then we can add elogs in functions SharedFileSetInit, BufFileCreateShared, BufFileOpenShared, and SharedFileSetDeleteAll to print the paths to see if we are sometimes unintentionally removing some files. I have checked the code and there doesn't appear to be any such problems but I might be missing something. -- With Regards, Amit Kapila.
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Noah Misch
Date:
On Wed, Dec 02, 2020 at 01:50:25PM +0530, Amit Kapila wrote: > On Wed, Dec 2, 2020 at 1:20 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > On Mon, Nov 30, 2020 at 6:49 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Mon, Nov 30, 2020 at 3:14 AM Noah Misch <noah@leadboat.com> wrote: > > > > > On Mon, Sep 07, 2020 at 12:00:41PM +0530, Amit Kapila wrote: > > > > > > Thanks, I have pushed the last patch. Let's wait for a day or so to > > > > > > see the buildfarm reports > > > > > > > > > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sungazer&dt=2020-09-08%2006%3A24%3A14 > > > > > failed the new 015_stream.pl test with the subscriber looping like this: > > > > > > > > > > 2020-09-08 11:22:49.848 UTC [13959252:1] LOG: logical replication apply worker for subscription "tap_sub" hasstarted > > > > > 2020-09-08 11:22:54.045 UTC [13959252:2] ERROR: could not open temporary file "16393-510.changes.0" from BufFile"16393-510.changes": No such file or directory > > > > > 2020-09-08 11:22:54.055 UTC [7602182:1] LOG: logical replication apply worker for subscription "tap_sub" has started > > > > > 2020-09-08 11:22:54.101 UTC [31785284:4] LOG: background worker "logical replication worker" (PID 13959252) exitedwith exit code 1 > > > > > 2020-09-08 11:23:01.142 UTC [7602182:2] ERROR: could not open temporary file "16393-510.changes.0" from BufFile"16393-510.changes": No such file or directory > > > > > ... > > > > The above kind of error can happen due to the following reasons: (a) > > > > the first time we sent the stream and created the file and that got > > > > removed before the second stream reached the subscriber. (b) from the > > > > publisher-side, we never sent the indication that it is the first > > > > stream and the subscriber directly tries to open the file thinking it > > > > is already there. Further testing showed it was a file location problem, not a deletion problem. The worker tried to open base/pgsql_tmp/pgsql_tmp9896408.1.sharedfileset/16393-510.changes.0, but these were the files actually existing: [nm@power-aix 0:2 2020-12-08T13:56:35 64gcc 0]$ ls -la $(find src/test/subscription/tmp_check -name '*sharedfileset*') src/test/subscription/tmp_check/t_015_stream_subscriber_data/pgdata/base/pgsql_tmp/pgsql_tmp9896408.0.sharedfileset: total 408 drwx------ 2 nm usr 256 Dec 08 03:20 . drwx------ 4 nm usr 256 Dec 08 03:20 .. -rw------- 1 nm usr 207806 Dec 08 03:20 16393-510.changes.0 src/test/subscription/tmp_check/t_015_stream_subscriber_data/pgdata/base/pgsql_tmp/pgsql_tmp9896408.1.sharedfileset: total 0 drwx------ 2 nm usr 256 Dec 08 03:20 . drwx------ 4 nm usr 256 Dec 08 03:20 .. -rw------- 1 nm usr 0 Dec 08 03:20 16393-511.changes.0 > > I have executed "make check" in the loop with only this file. I have > > repeated it 5000 times but no failure, I am wondering shall we try to > > execute in the same machine in a loop where it failed once? > > Yes, that might help. Noah, would it be possible for you to try that The problem is xidhash using strcmp() to compare keys; it needs memcmp(). For this to matter, xidhash must contain more than one element. Existing tests rarely exercise the multi-element scenario. Under heavy load, on this system, the test publisher can have two active transactions at once, in which case it does exercise multi-element xidhash. (The publisher is sensitive to timing, but the subscriber is not; once WAL contains interleaved records of two XIDs, the subscriber fails every time.) This would be much harder to reproduce on a little-endian system, where strcmp(&xid, &xid_plus_one)!=0. On big-endian, every small XID has zero in the first octet; they all look like empty strings. The attached patch has the one-line fix and some test suite changes that make this reproduce frequently on any big-endian system. I'm currently planning to drop the test suite changes from the commit, but I could keep them if folks like them. (They'd need more comments and timeout handling.)
Attachment
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Amit Kapila
Date:
On Wed, Dec 9, 2020 at 2:56 PM Noah Misch <noah@leadboat.com> wrote: > > Further testing showed it was a file location problem, not a deletion problem. > The worker tried to open > base/pgsql_tmp/pgsql_tmp9896408.1.sharedfileset/16393-510.changes.0, but these > were the files actually existing: > > [nm@power-aix 0:2 2020-12-08T13:56:35 64gcc 0]$ ls -la $(find src/test/subscription/tmp_check -name '*sharedfileset*') > src/test/subscription/tmp_check/t_015_stream_subscriber_data/pgdata/base/pgsql_tmp/pgsql_tmp9896408.0.sharedfileset: > total 408 > drwx------ 2 nm usr 256 Dec 08 03:20 . > drwx------ 4 nm usr 256 Dec 08 03:20 .. > -rw------- 1 nm usr 207806 Dec 08 03:20 16393-510.changes.0 > > src/test/subscription/tmp_check/t_015_stream_subscriber_data/pgdata/base/pgsql_tmp/pgsql_tmp9896408.1.sharedfileset: > total 0 > drwx------ 2 nm usr 256 Dec 08 03:20 . > drwx------ 4 nm usr 256 Dec 08 03:20 .. > -rw------- 1 nm usr 0 Dec 08 03:20 16393-511.changes.0 > > > > I have executed "make check" in the loop with only this file. I have > > > repeated it 5000 times but no failure, I am wondering shall we try to > > > execute in the same machine in a loop where it failed once? > > > > Yes, that might help. Noah, would it be possible for you to try that > > The problem is xidhash using strcmp() to compare keys; it needs memcmp(). For > this to matter, xidhash must contain more than one element. Existing tests > rarely exercise the multi-element scenario. Under heavy load, on this system, > the test publisher can have two active transactions at once, in which case it > does exercise multi-element xidhash. (The publisher is sensitive to timing, > but the subscriber is not; once WAL contains interleaved records of two XIDs, > the subscriber fails every time.) This would be much harder to reproduce on a > little-endian system, where strcmp(&xid, &xid_plus_one)!=0. On big-endian, > every small XID has zero in the first octet; they all look like empty strings. > Your analysis is correct. > The attached patch has the one-line fix and some test suite changes that make > this reproduce frequently on any big-endian system. I'm currently planning to > drop the test suite changes from the commit, but I could keep them if folks > like them. (They'd need more comments and timeout handling.) > I think it is better to keep this test which can always test multiple streams on the subscriber. Thanks for working on this. -- With Regards, Amit Kapila.
HASH_BLOBS hazards (was Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions)
From
Tom Lane
Date:
Amit Kapila <amit.kapila16@gmail.com> writes: > On Wed, Dec 9, 2020 at 2:56 PM Noah Misch <noah@leadboat.com> wrote: >> The problem is xidhash using strcmp() to compare keys; it needs memcmp(). > Your analysis is correct. Sorry for not having noticed this thread before. Noah's fix is clearly correct, and I have no objection to the added test case. But what jumps out at me here is that this sort of error seems way too easy to make, and evidently way too hard to detect. What can we do to make it more obvious if one has incorrectly used or omitted HASH_BLOBS? Both directions of error might easily escape notice on little-endian hardware. I thought of a few ideas, all of which have drawbacks: 1. Invert the sense of the flag, ie HASH_BLOBS becomes the default. This seems to just move the problem somewhere else, besides which it'd require touching an awful lot of callers, and would silently break third-party callers. 2. Don't allow a default: invent a new HASH_STRING flag, and require that hash_create() calls specify exactly one of HASH_BLOBS, HASH_STRING, or HASH_FUNCTION. This doesn't completely fix the hazard of mindless-copy-and-paste, but I think it might make it a little more obvious. Still requires touching a lot of calls. 3. Add some sort of heuristic restriction on keysize. A keysize that's only 4 or 8 bytes almost certainly is not a string. This doesn't give us much traction for larger keysizes, though. 4. Disallow empty string keys, ie something like "Assert(s_len > 0)" in string_hash(). I think we could get away with that given that SQL disallows empty identifiers. However, it would only help to catch one direction of error (omitting HASH_BLOBS), and it would only help on big-endian hardware, which is getting harder to find. Still, we could hope that the buildfarm would detect errors. There might be some more options. Also, some of these ideas could be applied in combination. A quick count of grep hits suggest that the large majority of existing hash_create() calls use HASH_BLOBS, and there might be only order-of-ten calls that would need to be touched if we required an explicit HASH_STRING flag. So option #2 is seeming kind of attractive. Maybe that together with an assertion that string keys have to exceed 8 or 16 bytes would be enough protection. Also, this census now suggests to me that the opposite problem (copy-and-paste HASH_BLOBS when you meant string keys) might be a real hazard, since so many of the existing prototypes that you might copy have HASH_BLOBS. I'm not sure if there's much to be done for this case though. A small saving grace is that it seems relatively likely that you'd notice a functional problem pretty quickly with this type of mistake, since lookups would tend to fail due to trailing garbage after your lookup string. A different angle we could think about is that the name "HASH_BLOBS" is kind of un-obvious. Maybe we should deprecate that spelling in favor of something like "HASH_BINARY". Thoughts? regards, tom lane
Re: HASH_BLOBS hazards (was Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions)
From
Noah Misch
Date:
On Sun, Dec 13, 2020 at 11:49:31AM -0500, Tom Lane wrote: > But what jumps out at me here is that this sort of error seems way > too easy to make, and evidently way too hard to detect. What can we > do to make it more obvious if one has incorrectly used or omitted > HASH_BLOBS? Both directions of error might easily escape notice on > little-endian hardware. > > I thought of a few ideas, all of which have drawbacks: > > 1. Invert the sense of the flag, ie HASH_BLOBS becomes the default. > This seems to just move the problem somewhere else, besides which > it'd require touching an awful lot of callers, and would silently > break third-party callers. > > 2. Don't allow a default: invent a new HASH_STRING flag, and > require that hash_create() calls specify exactly one of HASH_BLOBS, > HASH_STRING, or HASH_FUNCTION. This doesn't completely fix the > hazard of mindless-copy-and-paste, but I think it might make it > a little more obvious. Still requires touching a lot of calls. I like (2), for making the bug harder and for greppability. Probably pluralize it to HASH_STRINGS, for the parallel with HASH_BLOBS. > 3. Add some sort of heuristic restriction on keysize. A keysize > that's only 4 or 8 bytes almost certainly is not a string. > This doesn't give us much traction for larger keysizes, though. > > 4. Disallow empty string keys, ie something like "Assert(s_len > 0)" > in string_hash(). I think we could get away with that given that > SQL disallows empty identifiers. However, it would only help to > catch one direction of error (omitting HASH_BLOBS), and it would > only help on big-endian hardware, which is getting harder to find. > Still, we could hope that the buildfarm would detect errors. It's nontrivial to confirm that the empty-string key can't happen for a given hash table. (In contrast, what (3) asserts on is usually a compile-time constant.) I would stop short of adding (4), though it could be okay. > A quick count of grep hits suggest that the large majority of > existing hash_create() calls use HASH_BLOBS, and there might be > only order-of-ten calls that would need to be touched if we > required an explicit HASH_STRING flag. So option #2 is seeming > kind of attractive. Maybe that together with an assertion that > string keys have to exceed 8 or 16 bytes would be enough protection. Agreed. I expect (2) gives most of the benefit. Requiring 8-byte capacity should be harmless, and most architectures can zero 8 bytes in one instruction. Requiring more bytes trades specificity for sensitivity. > A different angle we could think about is that the name "HASH_BLOBS" > is kind of un-obvious. Maybe we should deprecate that spelling in > favor of something like "HASH_BINARY". With (2) in place, I wouldn't worry about renaming HASH_BLOBS. It's hard to confuse with HASH_STRINGS or HASH_FUNCTION. If anything, HASH_BLOBS conveys something more specific. HASH_FUNCTION cases see binary data, but that data has structure that promotes it out of "blob" status.
Re: HASH_BLOBS hazards (was Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions)
From
Peter Eisentraut
Date:
On 2020-12-13 17:49, Tom Lane wrote: > 2. Don't allow a default: invent a new HASH_STRING flag, and > require that hash_create() calls specify exactly one of HASH_BLOBS, > HASH_STRING, or HASH_FUNCTION. This doesn't completely fix the > hazard of mindless-copy-and-paste, but I think it might make it > a little more obvious. Still requires touching a lot of calls. I think this sounds best, and also expand the documentation of these flags a bit. -- Peter Eisentraut 2ndQuadrant, an EDB company https://www.2ndquadrant.com/
Re: HASH_BLOBS hazards (was Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions)
From
Amit Kapila
Date:
On Mon, Dec 14, 2020 at 1:36 AM Noah Misch <noah@leadboat.com> wrote: > > On Sun, Dec 13, 2020 at 11:49:31AM -0500, Tom Lane wrote: > > But what jumps out at me here is that this sort of error seems way > > too easy to make, and evidently way too hard to detect. What can we > > do to make it more obvious if one has incorrectly used or omitted > > HASH_BLOBS? Both directions of error might easily escape notice on > > little-endian hardware. > > > > I thought of a few ideas, all of which have drawbacks: > > > > 1. Invert the sense of the flag, ie HASH_BLOBS becomes the default. > > This seems to just move the problem somewhere else, besides which > > it'd require touching an awful lot of callers, and would silently > > break third-party callers. > > > > 2. Don't allow a default: invent a new HASH_STRING flag, and > > require that hash_create() calls specify exactly one of HASH_BLOBS, > > HASH_STRING, or HASH_FUNCTION. This doesn't completely fix the > > hazard of mindless-copy-and-paste, but I think it might make it > > a little more obvious. Still requires touching a lot of calls. > > I like (2), for making the bug harder and for greppability. Probably > pluralize it to HASH_STRINGS, for the parallel with HASH_BLOBS. > > > 3. Add some sort of heuristic restriction on keysize. A keysize > > that's only 4 or 8 bytes almost certainly is not a string. > > This doesn't give us much traction for larger keysizes, though. > > > > 4. Disallow empty string keys, ie something like "Assert(s_len > 0)" > > in string_hash(). I think we could get away with that given that > > SQL disallows empty identifiers. However, it would only help to > > catch one direction of error (omitting HASH_BLOBS), and it would > > only help on big-endian hardware, which is getting harder to find. > > Still, we could hope that the buildfarm would detect errors. > > It's nontrivial to confirm that the empty-string key can't happen for a given > hash table. (In contrast, what (3) asserts on is usually a compile-time > constant.) I would stop short of adding (4), though it could be okay. > > > A quick count of grep hits suggest that the large majority of > > existing hash_create() calls use HASH_BLOBS, and there might be > > only order-of-ten calls that would need to be touched if we > > required an explicit HASH_STRING flag. So option #2 is seeming > > kind of attractive. Maybe that together with an assertion that > > string keys have to exceed 8 or 16 bytes would be enough protection. > > Agreed. I expect (2) gives most of the benefit. Requiring 8-byte capacity > should be harmless, and most architectures can zero 8 bytes in one > instruction. Requiring more bytes trades specificity for sensitivity. > +1. I also think in most cases (2) would be sufficient to avoid such bugs. Adding restriction on string size might annoy some out-of-core user which is already using small strings. However, adding an 8-byte restriction on string size would be still okay. -- With Regards, Amit Kapila.
Re: HASH_BLOBS hazards (was Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions)
From
Tom Lane
Date:
Noah Misch <noah@leadboat.com> writes: > On Sun, Dec 13, 2020 at 11:49:31AM -0500, Tom Lane wrote: >> A quick count of grep hits suggest that the large majority of >> existing hash_create() calls use HASH_BLOBS, and there might be >> only order-of-ten calls that would need to be touched if we >> required an explicit HASH_STRING flag. So option #2 is seeming >> kind of attractive. Maybe that together with an assertion that >> string keys have to exceed 8 or 16 bytes would be enough protection. > Agreed. I expect (2) gives most of the benefit. Requiring 8-byte capacity > should be harmless, and most architectures can zero 8 bytes in one > instruction. Requiring more bytes trades specificity for sensitivity. Attached is a proposed patch that requires HASH_STRINGS to be stated explicitly (in the event, there are 13 callers needing that) and insists on keysize > 8 for string keys. In examining the now-easily-visible uses of string keys, almost all of them are using NAMEDATALEN-sized keys, or in a few places larger values. Only two are smaller: 1. ShmemIndex uses SHMEM_INDEX_KEYSIZE, which is only set to 48. 2. ResetUnloggedRelationsInDbspaceDir is using OIDCHARS + 1, because it stores relfilenode OIDs as strings. That seems pretty damfool to me, so I'm inclined to change it to store binary OIDs instead; those'd be a third the size (or probably a quarter the size after alignment padding) and likely faster to hash or compare. But I didn't do that here, since it's still more than 8. (I did whack it upside the head to the extent of not storing its temporary hash table in CacheMemoryContext.) So it seems to me that insisting on keysize > 8 is fine. There are a couple of other API oddities that maybe we should think about while we're here: * Should we just have a blanket insistence that all callers supply HASH_ELEM? The default sizes that dynahash.c uses without that are undocumented and basically useless. We're already asserting that in the HASH_BLOBS path, which is the majority use-case, and this patch now asserts it for HASH_STRINGS too. * The coding convention that the HASHCTL argument struct should be pre-zeroed seems to have been ignored at a lot of call sites. I added a memset call to a couple of callers that I was touching in this patch, but I'm having second thoughts about that. Maybe we should just rip out all those memsets as pointless, since there's basically no case where you'd use the memset to fill a field that you meant to pass as zero. The fact that hash_create() doesn't read fields it's not told to by a flag means we should not need the memsets to avoid uninitialized-memory reads. regards, tom lane diff --git a/contrib/dblink/dblink.c b/contrib/dblink/dblink.c index 2dc9e44ae6..8b17fb06eb 100644 --- a/contrib/dblink/dblink.c +++ b/contrib/dblink/dblink.c @@ -2604,10 +2604,12 @@ createConnHash(void) { HASHCTL ctl; + memset(&ctl, 0, sizeof(ctl)); ctl.keysize = NAMEDATALEN; ctl.entrysize = sizeof(remoteConnHashEnt); - return hash_create("Remote Con hash", NUMCONN, &ctl, HASH_ELEM); + return hash_create("Remote Con hash", NUMCONN, &ctl, + HASH_ELEM | HASH_STRINGS); } static void diff --git a/contrib/tablefunc/tablefunc.c b/contrib/tablefunc/tablefunc.c index 85986ec24a..ec7819ca77 100644 --- a/contrib/tablefunc/tablefunc.c +++ b/contrib/tablefunc/tablefunc.c @@ -726,7 +726,7 @@ load_categories_hash(char *cats_sql, MemoryContext per_query_ctx) crosstab_hash = hash_create("crosstab hash", INIT_CATS, &ctl, - HASH_ELEM | HASH_CONTEXT); + HASH_ELEM | HASH_STRINGS | HASH_CONTEXT); /* Connect to SPI manager */ if ((ret = SPI_connect()) < 0) diff --git a/src/backend/commands/prepare.c b/src/backend/commands/prepare.c index 4b18be5b27..5ba7c2eb3c 100644 --- a/src/backend/commands/prepare.c +++ b/src/backend/commands/prepare.c @@ -414,7 +414,7 @@ InitQueryHashTable(void) prepared_queries = hash_create("Prepared Queries", 32, &hash_ctl, - HASH_ELEM); + HASH_ELEM | HASH_STRINGS); } /* diff --git a/src/backend/nodes/extensible.c b/src/backend/nodes/extensible.c index ab04459c55..2fe89fd361 100644 --- a/src/backend/nodes/extensible.c +++ b/src/backend/nodes/extensible.c @@ -51,7 +51,8 @@ RegisterExtensibleNodeEntry(HTAB **p_htable, const char *htable_label, ctl.keysize = EXTNODENAME_MAX_LEN; ctl.entrysize = sizeof(ExtensibleNodeEntry); - *p_htable = hash_create(htable_label, 100, &ctl, HASH_ELEM); + *p_htable = hash_create(htable_label, 100, &ctl, + HASH_ELEM | HASH_STRINGS); } if (strlen(extnodename) >= EXTNODENAME_MAX_LEN) diff --git a/src/backend/storage/file/reinit.c b/src/backend/storage/file/reinit.c index 0c2094f766..f21ab67ae4 100644 --- a/src/backend/storage/file/reinit.c +++ b/src/backend/storage/file/reinit.c @@ -175,7 +175,9 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op) memset(&ctl, 0, sizeof(ctl)); ctl.keysize = sizeof(unlogged_relation_entry); ctl.entrysize = sizeof(unlogged_relation_entry); - hash = hash_create("unlogged hash", 32, &ctl, HASH_ELEM); + ctl.hcxt = CurrentMemoryContext; + hash = hash_create("unlogged hash", 32, &ctl, + HASH_ELEM | HASH_STRINGS | HASH_CONTEXT); /* Scan the directory. */ dbspace_dir = AllocateDir(dbspacedirname); diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c index 97716f6aef..0afd87e075 100644 --- a/src/backend/storage/ipc/shmem.c +++ b/src/backend/storage/ipc/shmem.c @@ -292,7 +292,6 @@ void InitShmemIndex(void) { HASHCTL info; - int hash_flags; /* * Create the shared memory shmem index. @@ -302,13 +301,14 @@ InitShmemIndex(void) * initializing the ShmemIndex itself. The special "ShmemIndex" hash * table name will tell ShmemInitStruct to fake it. */ + memset(&info, 0, sizeof(info)); info.keysize = SHMEM_INDEX_KEYSIZE; info.entrysize = sizeof(ShmemIndexEnt); - hash_flags = HASH_ELEM; ShmemIndex = ShmemInitHash("ShmemIndex", SHMEM_INDEX_SIZE, SHMEM_INDEX_SIZE, - &info, hash_flags); + &info, + HASH_ELEM | HASH_STRINGS); } /* @@ -329,6 +329,10 @@ InitShmemIndex(void) * whose maximum size is certain, this should be equal to max_size; that * ensures that no run-time out-of-shared-memory failures can occur. * + * *infoP and hash_flags should specify at least the entry sizes and key + * comparison semantics (see hash_create()). Flag bits specific to + * shared-memory hash tables are added here. + * * Note: before Postgres 9.0, this function returned NULL for some failure * cases. Now, it always throws error instead, so callers need not check * for NULL. diff --git a/src/backend/utils/adt/jsonfuncs.c b/src/backend/utils/adt/jsonfuncs.c index 12557ce3af..be0a45b55e 100644 --- a/src/backend/utils/adt/jsonfuncs.c +++ b/src/backend/utils/adt/jsonfuncs.c @@ -3446,7 +3446,7 @@ get_json_object_as_hash(char *json, int len, const char *funcname) tab = hash_create("json object hashtable", 100, &ctl, - HASH_ELEM | HASH_CONTEXT); + HASH_ELEM | HASH_STRINGS | HASH_CONTEXT); state = palloc0(sizeof(JHashState)); sem = palloc0(sizeof(JsonSemAction)); @@ -3838,7 +3838,7 @@ populate_recordset_object_start(void *state) _state->json_hash = hash_create("json object hashtable", 100, &ctl, - HASH_ELEM | HASH_CONTEXT); + HASH_ELEM | HASH_STRINGS | HASH_CONTEXT); } static void diff --git a/src/backend/utils/adt/ruleutils.c b/src/backend/utils/adt/ruleutils.c index ad582f99a5..87a3154c1a 100644 --- a/src/backend/utils/adt/ruleutils.c +++ b/src/backend/utils/adt/ruleutils.c @@ -3471,7 +3471,7 @@ set_rtable_names(deparse_namespace *dpns, List *parent_namespaces, names_hash = hash_create("set_rtable_names names", list_length(dpns->rtable), &hash_ctl, - HASH_ELEM | HASH_CONTEXT); + HASH_ELEM | HASH_STRINGS | HASH_CONTEXT); /* Preload the hash table with names appearing in parent_namespaces */ foreach(lc, parent_namespaces) { diff --git a/src/backend/utils/fmgr/dfmgr.c b/src/backend/utils/fmgr/dfmgr.c index bd779fdaf7..e83e30defe 100644 --- a/src/backend/utils/fmgr/dfmgr.c +++ b/src/backend/utils/fmgr/dfmgr.c @@ -686,7 +686,7 @@ find_rendezvous_variable(const char *varName) rendezvousHash = hash_create("Rendezvous variable hash", 16, &ctl, - HASH_ELEM); + HASH_ELEM | HASH_STRINGS); } /* Find or create the hashtable entry for this varName */ diff --git a/src/backend/utils/hash/dynahash.c b/src/backend/utils/hash/dynahash.c index d14d875c93..07cae638df 100644 --- a/src/backend/utils/hash/dynahash.c +++ b/src/backend/utils/hash/dynahash.c @@ -30,11 +30,12 @@ * dynahash.c provides support for these types of lookup keys: * * 1. Null-terminated C strings (truncated if necessary to fit in keysize), - * compared as though by strcmp(). This is the default behavior. + * compared as though by strcmp(). This is selected by specifying the + * HASH_STRINGS flag to hash_create. * * 2. Arbitrary binary data of size keysize, compared as though by memcmp(). * (Caller must ensure there are no undefined padding bits in the keys!) - * This is selected by specifying HASH_BLOBS flag to hash_create. + * This is selected by specifying the HASH_BLOBS flag to hash_create. * * 3. More complex key behavior can be selected by specifying user-supplied * hashing, comparison, and/or key-copying functions. At least a hashing @@ -47,8 +48,8 @@ * locks. * - Shared memory hashes are allocated in a fixed size area at startup and * are discoverable by name from other processes. - * - Because entries don't need to be moved in the case of hash conflicts, has - * better performance for large entries + * - Because entries don't need to be moved in the case of hash conflicts, + * dynahash has better performance for large entries. * - Guarantees stable pointers to entries. * * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group @@ -316,6 +317,12 @@ string_compare(const char *key1, const char *key2, Size keysize) * *info: additional table parameters, as indicated by flags * flags: bitmask indicating which parameters to take from *info * + * The flags value must include exactly one of HASH_STRINGS, HASH_BLOBS, + * or HASH_FUNCTION, to define the key hashing semantics (C strings, + * binary blobs, or custom, respectively). Callers specifying a custom + * hash function will likely also want to use HASH_COMPARE, and perhaps + * also HASH_KEYCOPY, to control key comparison and copying. + * * Note: for a shared-memory hashtable, nelem needs to be a pretty good * estimate, since we can't expand the table on the fly. But an unshared * hashtable can be expanded on-the-fly, so it's better for nelem to be @@ -370,9 +377,13 @@ hash_create(const char *tabname, long nelem, HASHCTL *info, int flags) * Select the appropriate hash function (see comments at head of file). */ if (flags & HASH_FUNCTION) + { + Assert(!(flags & (HASH_BLOBS | HASH_STRINGS))); hashp->hash = info->hash; + } else if (flags & HASH_BLOBS) { + Assert(!(flags & HASH_STRINGS)); /* We can optimize hashing for common key sizes */ Assert(flags & HASH_ELEM); if (info->keysize == sizeof(uint32)) @@ -381,17 +392,30 @@ hash_create(const char *tabname, long nelem, HASHCTL *info, int flags) hashp->hash = tag_hash; } else - hashp->hash = string_hash; /* default hash function */ + { + /* + * string_hash used to be considered the default hash method, and in a + * non-assert build it effectively still is. But we now consider it + * an assertion error to not say HASH_STRINGS explicitly. To help + * catch mistaken usage of HASH_STRINGS, we also insist on a + * reasonably long string length: if the keysize is only 4 or 8 bytes, + * it's almost certainly an integer or pointer not a string. + */ + Assert(flags & HASH_STRINGS); + Assert(flags & HASH_ELEM); + Assert(info->keysize > 8); + + hashp->hash = string_hash; + } /* * If you don't specify a match function, it defaults to string_compare if - * you used string_hash (either explicitly or by default) and to memcmp - * otherwise. + * you used string_hash, and to memcmp otherwise. * * Note: explicitly specifying string_hash is deprecated, because this * might not work for callers in loadable modules on some platforms due to * referencing a trampoline instead of the string_hash function proper. - * Just let it default, eh? + * Specify HASH_STRINGS instead. */ if (flags & HASH_COMPARE) hashp->match = info->match; diff --git a/src/backend/utils/mmgr/portalmem.c b/src/backend/utils/mmgr/portalmem.c index ec6f80ee99..a382c4219b 100644 --- a/src/backend/utils/mmgr/portalmem.c +++ b/src/backend/utils/mmgr/portalmem.c @@ -111,6 +111,7 @@ EnablePortalManager(void) "TopPortalContext", ALLOCSET_DEFAULT_SIZES); + memset(&ctl, 0, sizeof(ctl)); ctl.keysize = MAX_PORTALNAME_LEN; ctl.entrysize = sizeof(PortalHashEnt); @@ -119,7 +120,7 @@ EnablePortalManager(void) * create, initially */ PortalHashTable = hash_create("Portal hash", PORTALS_PER_USER, - &ctl, HASH_ELEM); + &ctl, HASH_ELEM | HASH_STRINGS); } /* diff --git a/src/include/utils/hsearch.h b/src/include/utils/hsearch.h index bebf89b3c4..666ad33567 100644 --- a/src/include/utils/hsearch.h +++ b/src/include/utils/hsearch.h @@ -82,7 +82,8 @@ typedef struct HASHCTL #define HASH_PARTITION 0x0001 /* Hashtable is used w/partitioned locking */ #define HASH_SEGMENT 0x0002 /* Set segment size */ #define HASH_DIRSIZE 0x0004 /* Set directory size (initial and max) */ -#define HASH_ELEM 0x0010 /* Set keysize and entrysize */ +#define HASH_ELEM 0x0008 /* Set keysize and entrysize */ +#define HASH_STRINGS 0x0010 /* Select support functions for string keys */ #define HASH_BLOBS 0x0020 /* Select support functions for binary keys */ #define HASH_FUNCTION 0x0040 /* Set user defined hash function */ #define HASH_COMPARE 0x0080 /* Set user defined comparison function */ @@ -119,7 +120,8 @@ typedef struct * * Note: It is deprecated for callers of hash_create to explicitly specify * string_hash, tag_hash, uint32_hash, or oid_hash. Just set HASH_BLOBS or - * not. Use HASH_FUNCTION only when you want something other than those. + * HASH_STRINGS. Use HASH_FUNCTION only when you want something other than + * one of these. */ extern HTAB *hash_create(const char *tabname, long nelem, HASHCTL *info, int flags); diff --git a/src/pl/plperl/plperl.c b/src/pl/plperl/plperl.c index 4de756455d..60f5d66264 100644 --- a/src/pl/plperl/plperl.c +++ b/src/pl/plperl/plperl.c @@ -586,7 +586,7 @@ select_perl_context(bool trusted) interp_desc->query_hash = hash_create("PL/Perl queries", 32, &hash_ctl, - HASH_ELEM); + HASH_ELEM | HASH_STRINGS); } /* diff --git a/src/timezone/pgtz.c b/src/timezone/pgtz.c index 3f0fb51e91..5240cab022 100644 --- a/src/timezone/pgtz.c +++ b/src/timezone/pgtz.c @@ -211,7 +211,7 @@ init_timezone_hashtable(void) timezone_cache = hash_create("Timezones", 4, &hash_ctl, - HASH_ELEM); + HASH_ELEM | HASH_STRINGS); if (!timezone_cache) return false;
Re: HASH_BLOBS hazards (was Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions)
From
Tom Lane
Date:
I wrote: > There are a couple of other API oddities that maybe we should think > about while we're here: > * Should we just have a blanket insistence that all callers supply > HASH_ELEM? The default sizes that dynahash.c uses without that are > undocumented and basically useless. We're already asserting that > in the HASH_BLOBS path, which is the majority use-case, and this > patch now asserts it for HASH_STRINGS too. Here's a follow-up patch for that part, which also tries to respond a bit to Heikki's complaint about skimpy documentation. While at it, I const-ified the HASHCTL argument, since there's no need for hash_create to modify that. regards, tom lane diff --git a/src/backend/utils/hash/dynahash.c b/src/backend/utils/hash/dynahash.c index 07cae638df..49f21b77bb 100644 --- a/src/backend/utils/hash/dynahash.c +++ b/src/backend/utils/hash/dynahash.c @@ -317,11 +317,20 @@ string_compare(const char *key1, const char *key2, Size keysize) * *info: additional table parameters, as indicated by flags * flags: bitmask indicating which parameters to take from *info * - * The flags value must include exactly one of HASH_STRINGS, HASH_BLOBS, + * The flags value *must* include HASH_ELEM. (Formerly, this was nominally + * optional, but the default keysize and entrysize values were useless.) + * The flags value must also include exactly one of HASH_STRINGS, HASH_BLOBS, * or HASH_FUNCTION, to define the key hashing semantics (C strings, * binary blobs, or custom, respectively). Callers specifying a custom * hash function will likely also want to use HASH_COMPARE, and perhaps * also HASH_KEYCOPY, to control key comparison and copying. + * Another often-used flag is HASH_CONTEXT, to allocate the hash table + * under info->hcxt rather than under TopMemoryContext; the default + * behavior is only suitable for session-lifespan hash tables. + * Other flags bits are special-purpose and seldom used. + * + * Fields in *info are read only when the associated flags bit is set. + * It is not necessary to initialize other fields of *info. * * Note: for a shared-memory hashtable, nelem needs to be a pretty good * estimate, since we can't expand the table on the fly. But an unshared @@ -330,11 +339,19 @@ string_compare(const char *key1, const char *key2, Size keysize) * large nelem will penalize hash_seq_search speed without buying much. */ HTAB * -hash_create(const char *tabname, long nelem, HASHCTL *info, int flags) +hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags) { HTAB *hashp; HASHHDR *hctl; + /* + * Hash tables now allocate space for key and data, but you have to say + * how much space to allocate. + */ + Assert(flags & HASH_ELEM); + Assert(info->keysize > 0); + Assert(info->entrysize >= info->keysize); + /* * For shared hash tables, we have a local hash header (HTAB struct) that * we allocate in TopMemoryContext; all else is in shared memory. @@ -385,7 +402,6 @@ hash_create(const char *tabname, long nelem, HASHCTL *info, int flags) { Assert(!(flags & HASH_STRINGS)); /* We can optimize hashing for common key sizes */ - Assert(flags & HASH_ELEM); if (info->keysize == sizeof(uint32)) hashp->hash = uint32_hash; else @@ -402,7 +418,6 @@ hash_create(const char *tabname, long nelem, HASHCTL *info, int flags) * it's almost certainly an integer or pointer not a string. */ Assert(flags & HASH_STRINGS); - Assert(flags & HASH_ELEM); Assert(info->keysize > 8); hashp->hash = string_hash; @@ -529,16 +544,9 @@ hash_create(const char *tabname, long nelem, HASHCTL *info, int flags) hctl->dsize = info->dsize; } - /* - * hash table now allocates space for key and data but you have to say how - * much space to allocate - */ - if (flags & HASH_ELEM) - { - Assert(info->entrysize >= info->keysize); - hctl->keysize = info->keysize; - hctl->entrysize = info->entrysize; - } + /* remember the entry sizes, too */ + hctl->keysize = info->keysize; + hctl->entrysize = info->entrysize; /* make local copies of heavily-used constant fields */ hashp->keysize = hctl->keysize; @@ -617,10 +625,6 @@ hdefault(HTAB *hashp) hctl->dsize = DEF_DIRSIZE; hctl->nsegs = 0; - /* rather pointless defaults for key & entry size */ - hctl->keysize = sizeof(char *); - hctl->entrysize = 2 * sizeof(char *); - hctl->num_partitions = 0; /* not partitioned */ /* table has no fixed maximum size */ diff --git a/src/include/utils/hsearch.h b/src/include/utils/hsearch.h index 666ad33567..c3daaae92b 100644 --- a/src/include/utils/hsearch.h +++ b/src/include/utils/hsearch.h @@ -124,7 +124,7 @@ typedef struct * one of these. */ extern HTAB *hash_create(const char *tabname, long nelem, - HASHCTL *info, int flags); + const HASHCTL *info, int flags); extern void hash_destroy(HTAB *hashp); extern void hash_stats(const char *where, HTAB *hashp); extern void *hash_search(HTAB *hashp, const void *keyPtr, HASHACTION action,
Re: HASH_BLOBS hazards (was Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions)
From
Tom Lane
Date:
Here's a rolled-up patch that does some further documentation work and gets rid of the unnecessary memset's as well. regards, tom lane diff --git a/contrib/dblink/dblink.c b/contrib/dblink/dblink.c index 2dc9e44ae6..651227f510 100644 --- a/contrib/dblink/dblink.c +++ b/contrib/dblink/dblink.c @@ -2607,7 +2607,8 @@ createConnHash(void) ctl.keysize = NAMEDATALEN; ctl.entrysize = sizeof(remoteConnHashEnt); - return hash_create("Remote Con hash", NUMCONN, &ctl, HASH_ELEM); + return hash_create("Remote Con hash", NUMCONN, &ctl, + HASH_ELEM | HASH_STRINGS); } static void diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c index 70cfdb2c9d..2f00344b7f 100644 --- a/contrib/pg_stat_statements/pg_stat_statements.c +++ b/contrib/pg_stat_statements/pg_stat_statements.c @@ -567,7 +567,6 @@ pgss_shmem_startup(void) pgss->stats.dealloc = 0; } - memset(&info, 0, sizeof(info)); info.keysize = sizeof(pgssHashKey); info.entrysize = sizeof(pgssEntry); pgss_hash = ShmemInitHash("pg_stat_statements hash", diff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c index ab3226287d..66581e5414 100644 --- a/contrib/postgres_fdw/connection.c +++ b/contrib/postgres_fdw/connection.c @@ -119,14 +119,11 @@ GetConnection(UserMapping *user, bool will_prep_stmt) { HASHCTL ctl; - MemSet(&ctl, 0, sizeof(ctl)); ctl.keysize = sizeof(ConnCacheKey); ctl.entrysize = sizeof(ConnCacheEntry); - /* allocate ConnectionHash in the cache context */ - ctl.hcxt = CacheMemoryContext; ConnectionHash = hash_create("postgres_fdw connections", 8, &ctl, - HASH_ELEM | HASH_BLOBS | HASH_CONTEXT); + HASH_ELEM | HASH_BLOBS); /* * Register some callback functions that manage connection cleanup. diff --git a/contrib/postgres_fdw/shippable.c b/contrib/postgres_fdw/shippable.c index 3433c19712..b4766dc5ff 100644 --- a/contrib/postgres_fdw/shippable.c +++ b/contrib/postgres_fdw/shippable.c @@ -93,7 +93,6 @@ InitializeShippableCache(void) HASHCTL ctl; /* Create the hash table. */ - MemSet(&ctl, 0, sizeof(ctl)); ctl.keysize = sizeof(ShippableCacheKey); ctl.entrysize = sizeof(ShippableCacheEntry); ShippableCacheHash = diff --git a/contrib/tablefunc/tablefunc.c b/contrib/tablefunc/tablefunc.c index 85986ec24a..e9a9741154 100644 --- a/contrib/tablefunc/tablefunc.c +++ b/contrib/tablefunc/tablefunc.c @@ -714,7 +714,6 @@ load_categories_hash(char *cats_sql, MemoryContext per_query_ctx) MemoryContext SPIcontext; /* initialize the category hash table */ - MemSet(&ctl, 0, sizeof(ctl)); ctl.keysize = MAX_CATNAME_LEN; ctl.entrysize = sizeof(crosstab_HashEnt); ctl.hcxt = per_query_ctx; @@ -726,7 +725,7 @@ load_categories_hash(char *cats_sql, MemoryContext per_query_ctx) crosstab_hash = hash_create("crosstab hash", INIT_CATS, &ctl, - HASH_ELEM | HASH_CONTEXT); + HASH_ELEM | HASH_STRINGS | HASH_CONTEXT); /* Connect to SPI manager */ if ((ret = SPI_connect()) < 0) diff --git a/src/backend/access/gist/gistbuildbuffers.c b/src/backend/access/gist/gistbuildbuffers.c index 4ad67c88b4..217c199a14 100644 --- a/src/backend/access/gist/gistbuildbuffers.c +++ b/src/backend/access/gist/gistbuildbuffers.c @@ -76,7 +76,6 @@ gistInitBuildBuffers(int pagesPerBuffer, int levelStep, int maxLevel) * nodeBuffersTab hash is association between index blocks and it's * buffers. */ - memset(&hashCtl, 0, sizeof(hashCtl)); hashCtl.keysize = sizeof(BlockNumber); hashCtl.entrysize = sizeof(GISTNodeBuffer); hashCtl.hcxt = CurrentMemoryContext; diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c index a664ecf494..c77a189907 100644 --- a/src/backend/access/hash/hashpage.c +++ b/src/backend/access/hash/hashpage.c @@ -1363,7 +1363,6 @@ _hash_finish_split(Relation rel, Buffer metabuf, Buffer obuf, Bucket obucket, bool found; /* Initialize hash tables used to track TIDs */ - memset(&hash_ctl, 0, sizeof(hash_ctl)); hash_ctl.keysize = sizeof(ItemPointerData); hash_ctl.entrysize = sizeof(ItemPointerData); hash_ctl.hcxt = CurrentMemoryContext; diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c index 39e33763df..65942cc428 100644 --- a/src/backend/access/heap/rewriteheap.c +++ b/src/backend/access/heap/rewriteheap.c @@ -266,7 +266,6 @@ begin_heap_rewrite(Relation old_heap, Relation new_heap, TransactionId oldest_xm state->rs_cxt = rw_cxt; /* Initialize hash tables used to track update chains */ - memset(&hash_ctl, 0, sizeof(hash_ctl)); hash_ctl.keysize = sizeof(TidHashKey); hash_ctl.entrysize = sizeof(UnresolvedTupData); hash_ctl.hcxt = state->rs_cxt; @@ -824,7 +823,6 @@ logical_begin_heap_rewrite(RewriteState state) state->rs_begin_lsn = GetXLogInsertRecPtr(); state->rs_num_rewrite_mappings = 0; - memset(&hash_ctl, 0, sizeof(hash_ctl)); hash_ctl.keysize = sizeof(TransactionId); hash_ctl.entrysize = sizeof(RewriteMappingFile); hash_ctl.hcxt = state->rs_cxt; diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c index 32a3099c1f..e0ca3859a9 100644 --- a/src/backend/access/transam/xlogutils.c +++ b/src/backend/access/transam/xlogutils.c @@ -113,7 +113,6 @@ log_invalid_page(RelFileNode node, ForkNumber forkno, BlockNumber blkno, /* create hash table when first needed */ HASHCTL ctl; - memset(&ctl, 0, sizeof(ctl)); ctl.keysize = sizeof(xl_invalid_page_key); ctl.entrysize = sizeof(xl_invalid_page); diff --git a/src/backend/catalog/pg_enum.c b/src/backend/catalog/pg_enum.c index 6a2c6685a0..f2e7bab62a 100644 --- a/src/backend/catalog/pg_enum.c +++ b/src/backend/catalog/pg_enum.c @@ -188,7 +188,6 @@ init_enum_blacklist(void) { HASHCTL hash_ctl; - memset(&hash_ctl, 0, sizeof(hash_ctl)); hash_ctl.keysize = sizeof(Oid); hash_ctl.entrysize = sizeof(Oid); hash_ctl.hcxt = TopTransactionContext; diff --git a/src/backend/catalog/pg_inherits.c b/src/backend/catalog/pg_inherits.c index 17f37eb39f..5c3c78a0e6 100644 --- a/src/backend/catalog/pg_inherits.c +++ b/src/backend/catalog/pg_inherits.c @@ -171,7 +171,6 @@ find_all_inheritors(Oid parentrelId, LOCKMODE lockmode, List **numparents) *rel_numparents; ListCell *l; - memset(&ctl, 0, sizeof(ctl)); ctl.keysize = sizeof(Oid); ctl.entrysize = sizeof(SeenRelsEntry); ctl.hcxt = CurrentMemoryContext; diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c index c0763c63e2..e04afd9963 100644 --- a/src/backend/commands/async.c +++ b/src/backend/commands/async.c @@ -2375,7 +2375,6 @@ AddEventToPendingNotifies(Notification *n) ListCell *l; /* Create the hash table */ - MemSet(&hash_ctl, 0, sizeof(hash_ctl)); hash_ctl.keysize = sizeof(Notification *); hash_ctl.entrysize = sizeof(NotificationHash); hash_ctl.hash = notification_hash; diff --git a/src/backend/commands/prepare.c b/src/backend/commands/prepare.c index 4b18be5b27..89087a7be3 100644 --- a/src/backend/commands/prepare.c +++ b/src/backend/commands/prepare.c @@ -406,15 +406,13 @@ InitQueryHashTable(void) { HASHCTL hash_ctl; - MemSet(&hash_ctl, 0, sizeof(hash_ctl)); - hash_ctl.keysize = NAMEDATALEN; hash_ctl.entrysize = sizeof(PreparedStatement); prepared_queries = hash_create("Prepared Queries", 32, &hash_ctl, - HASH_ELEM); + HASH_ELEM | HASH_STRINGS); } /* diff --git a/src/backend/commands/sequence.c b/src/backend/commands/sequence.c index 632b34af61..fa2eea8af2 100644 --- a/src/backend/commands/sequence.c +++ b/src/backend/commands/sequence.c @@ -1087,7 +1087,6 @@ create_seq_hashtable(void) { HASHCTL ctl; - memset(&ctl, 0, sizeof(ctl)); ctl.keysize = sizeof(Oid); ctl.entrysize = sizeof(SeqTableData); diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c index 86594bd056..97bfc8bd71 100644 --- a/src/backend/executor/execPartition.c +++ b/src/backend/executor/execPartition.c @@ -521,7 +521,6 @@ ExecHashSubPlanResultRelsByOid(ModifyTableState *mtstate, HTAB *htab; int i; - memset(&ctl, 0, sizeof(ctl)); ctl.keysize = sizeof(Oid); ctl.entrysize = sizeof(SubplanResultRelHashElem); ctl.hcxt = CurrentMemoryContext; diff --git a/src/backend/nodes/extensible.c b/src/backend/nodes/extensible.c index ab04459c55..3a6cfc44d3 100644 --- a/src/backend/nodes/extensible.c +++ b/src/backend/nodes/extensible.c @@ -47,11 +47,11 @@ RegisterExtensibleNodeEntry(HTAB **p_htable, const char *htable_label, { HASHCTL ctl; - memset(&ctl, 0, sizeof(HASHCTL)); ctl.keysize = EXTNODENAME_MAX_LEN; ctl.entrysize = sizeof(ExtensibleNodeEntry); - *p_htable = hash_create(htable_label, 100, &ctl, HASH_ELEM); + *p_htable = hash_create(htable_label, 100, &ctl, + HASH_ELEM | HASH_STRINGS); } if (strlen(extnodename) >= EXTNODENAME_MAX_LEN) diff --git a/src/backend/optimizer/util/predtest.c b/src/backend/optimizer/util/predtest.c index 0edd873dca..d6e83e5f8e 100644 --- a/src/backend/optimizer/util/predtest.c +++ b/src/backend/optimizer/util/predtest.c @@ -1982,7 +1982,6 @@ lookup_proof_cache(Oid pred_op, Oid clause_op, bool refute_it) /* First time through: initialize the hash table */ HASHCTL ctl; - MemSet(&ctl, 0, sizeof(ctl)); ctl.keysize = sizeof(OprProofCacheKey); ctl.entrysize = sizeof(OprProofCacheEntry); OprProofCacheHash = hash_create("Btree proof lookup cache", 256, diff --git a/src/backend/optimizer/util/relnode.c b/src/backend/optimizer/util/relnode.c index 76245c1ff3..9c9a738c80 100644 --- a/src/backend/optimizer/util/relnode.c +++ b/src/backend/optimizer/util/relnode.c @@ -400,7 +400,6 @@ build_join_rel_hash(PlannerInfo *root) ListCell *l; /* Create the hash table */ - MemSet(&hash_ctl, 0, sizeof(hash_ctl)); hash_ctl.keysize = sizeof(Relids); hash_ctl.entrysize = sizeof(JoinHashEntry); hash_ctl.hash = bitmap_hash; diff --git a/src/backend/parser/parse_oper.c b/src/backend/parser/parse_oper.c index 6613a3a8f8..e72d3676f1 100644 --- a/src/backend/parser/parse_oper.c +++ b/src/backend/parser/parse_oper.c @@ -999,7 +999,6 @@ find_oper_cache_entry(OprCacheKey *key) /* First time through: initialize the hash table */ HASHCTL ctl; - MemSet(&ctl, 0, sizeof(ctl)); ctl.keysize = sizeof(OprCacheKey); ctl.entrysize = sizeof(OprCacheEntry); OprCacheHash = hash_create("Operator lookup cache", 256, diff --git a/src/backend/partitioning/partdesc.c b/src/backend/partitioning/partdesc.c index 9a292290ed..5b0a15ac0b 100644 --- a/src/backend/partitioning/partdesc.c +++ b/src/backend/partitioning/partdesc.c @@ -286,13 +286,13 @@ CreatePartitionDirectory(MemoryContext mcxt) PartitionDirectory pdir; HASHCTL ctl; - MemSet(&ctl, 0, sizeof(HASHCTL)); + pdir = palloc(sizeof(PartitionDirectoryData)); + pdir->pdir_mcxt = mcxt; + ctl.keysize = sizeof(Oid); ctl.entrysize = sizeof(PartitionDirectoryEntry); ctl.hcxt = mcxt; - pdir = palloc(sizeof(PartitionDirectoryData)); - pdir->pdir_mcxt = mcxt; pdir->pdir_hash = hash_create("partition directory", 256, &ctl, HASH_ELEM | HASH_BLOBS | HASH_CONTEXT); diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c index 7e28944d2f..ed127a1032 100644 --- a/src/backend/postmaster/autovacuum.c +++ b/src/backend/postmaster/autovacuum.c @@ -2043,7 +2043,6 @@ do_autovacuum(void) pg_class_desc = CreateTupleDescCopy(RelationGetDescr(classRel)); /* create hash table for toast <-> main relid mapping */ - MemSet(&ctl, 0, sizeof(ctl)); ctl.keysize = sizeof(Oid); ctl.entrysize = sizeof(av_relation); diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c index 429c8010ef..a62c6d4d0a 100644 --- a/src/backend/postmaster/checkpointer.c +++ b/src/backend/postmaster/checkpointer.c @@ -1161,7 +1161,6 @@ CompactCheckpointerRequestQueue(void) skip_slot = palloc0(sizeof(bool) * CheckpointerShmem->num_requests); /* Initialize temporary hash table */ - MemSet(&ctl, 0, sizeof(ctl)); ctl.keysize = sizeof(CheckpointerRequest); ctl.entrysize = sizeof(struct CheckpointerSlotMapping); ctl.hcxt = CurrentMemoryContext; diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c index 7c75a25d21..6b60f293e9 100644 --- a/src/backend/postmaster/pgstat.c +++ b/src/backend/postmaster/pgstat.c @@ -1265,7 +1265,6 @@ pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid) HeapTuple tup; Snapshot snapshot; - memset(&hash_ctl, 0, sizeof(hash_ctl)); hash_ctl.keysize = sizeof(Oid); hash_ctl.entrysize = sizeof(Oid); hash_ctl.hcxt = CurrentMemoryContext; @@ -1815,7 +1814,6 @@ pgstat_init_function_usage(FunctionCallInfo fcinfo, /* First time through - initialize function stat table */ HASHCTL hash_ctl; - memset(&hash_ctl, 0, sizeof(hash_ctl)); hash_ctl.keysize = sizeof(Oid); hash_ctl.entrysize = sizeof(PgStat_BackendFunctionEntry); pgStatFunctions = hash_create("Function stat entries", @@ -1975,7 +1973,6 @@ get_tabstat_entry(Oid rel_id, bool isshared) { HASHCTL ctl; - memset(&ctl, 0, sizeof(ctl)); ctl.keysize = sizeof(Oid); ctl.entrysize = sizeof(TabStatHashEntry); @@ -4994,7 +4991,6 @@ reset_dbentry_counters(PgStat_StatDBEntry *dbentry) dbentry->stat_reset_timestamp = GetCurrentTimestamp(); dbentry->stats_timestamp = 0; - memset(&hash_ctl, 0, sizeof(hash_ctl)); hash_ctl.keysize = sizeof(Oid); hash_ctl.entrysize = sizeof(PgStat_StatTabEntry); dbentry->tables = hash_create("Per-database table", @@ -5423,7 +5419,6 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep) /* * Create the DB hashtable */ - memset(&hash_ctl, 0, sizeof(hash_ctl)); hash_ctl.keysize = sizeof(Oid); hash_ctl.entrysize = sizeof(PgStat_StatDBEntry); hash_ctl.hcxt = pgStatLocalContext; @@ -5608,7 +5603,6 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep) break; } - memset(&hash_ctl, 0, sizeof(hash_ctl)); hash_ctl.keysize = sizeof(Oid); hash_ctl.entrysize = sizeof(PgStat_StatTabEntry); hash_ctl.hcxt = pgStatLocalContext; diff --git a/src/backend/replication/logical/relation.c b/src/backend/replication/logical/relation.c index 07aa52977f..f4dbbbe2dd 100644 --- a/src/backend/replication/logical/relation.c +++ b/src/backend/replication/logical/relation.c @@ -111,7 +111,6 @@ logicalrep_relmap_init(void) ALLOCSET_DEFAULT_SIZES); /* Initialize the relation hash table. */ - MemSet(&ctl, 0, sizeof(ctl)); ctl.keysize = sizeof(LogicalRepRelId); ctl.entrysize = sizeof(LogicalRepRelMapEntry); ctl.hcxt = LogicalRepRelMapContext; @@ -120,7 +119,6 @@ logicalrep_relmap_init(void) HASH_ELEM | HASH_BLOBS | HASH_CONTEXT); /* Initialize the type hash table. */ - MemSet(&ctl, 0, sizeof(ctl)); ctl.keysize = sizeof(Oid); ctl.entrysize = sizeof(LogicalRepTyp); ctl.hcxt = LogicalRepRelMapContext; @@ -606,7 +604,6 @@ logicalrep_partmap_init(void) ALLOCSET_DEFAULT_SIZES); /* Initialize the relation hash table. */ - MemSet(&ctl, 0, sizeof(ctl)); ctl.keysize = sizeof(Oid); /* partition OID */ ctl.entrysize = sizeof(LogicalRepPartMapEntry); ctl.hcxt = LogicalRepPartMapContext; diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c index 15dc51a94d..7359fa9df2 100644 --- a/src/backend/replication/logical/reorderbuffer.c +++ b/src/backend/replication/logical/reorderbuffer.c @@ -1619,8 +1619,6 @@ ReorderBufferBuildTupleCidHash(ReorderBuffer *rb, ReorderBufferTXN *txn) if (!rbtxn_has_catalog_changes(txn) || dlist_is_empty(&txn->tuplecids)) return; - memset(&hash_ctl, 0, sizeof(hash_ctl)); - hash_ctl.keysize = sizeof(ReorderBufferTupleCidKey); hash_ctl.entrysize = sizeof(ReorderBufferTupleCidEnt); hash_ctl.hcxt = rb->context; @@ -4116,7 +4114,6 @@ ReorderBufferToastInitHash(ReorderBuffer *rb, ReorderBufferTXN *txn) Assert(txn->toast_hash == NULL); - memset(&hash_ctl, 0, sizeof(hash_ctl)); hash_ctl.keysize = sizeof(Oid); hash_ctl.entrysize = sizeof(ReorderBufferToastEnt); hash_ctl.hcxt = rb->context; diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c index 1904f3471c..6259606537 100644 --- a/src/backend/replication/logical/tablesync.c +++ b/src/backend/replication/logical/tablesync.c @@ -372,7 +372,6 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn) { HASHCTL ctl; - memset(&ctl, 0, sizeof(ctl)); ctl.keysize = sizeof(Oid); ctl.entrysize = sizeof(struct tablesync_start_time_mapping); last_start_times = hash_create("Logical replication table sync worker start times", diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c index 9c997aed83..49d25b02d7 100644 --- a/src/backend/replication/pgoutput/pgoutput.c +++ b/src/backend/replication/pgoutput/pgoutput.c @@ -867,22 +867,18 @@ static void init_rel_sync_cache(MemoryContext cachectx) { HASHCTL ctl; - MemoryContext old_ctxt; if (RelationSyncCache != NULL) return; /* Make a new hash table for the cache */ - MemSet(&ctl, 0, sizeof(ctl)); ctl.keysize = sizeof(Oid); ctl.entrysize = sizeof(RelationSyncEntry); ctl.hcxt = cachectx; - old_ctxt = MemoryContextSwitchTo(cachectx); RelationSyncCache = hash_create("logical replication output relation cache", 128, &ctl, HASH_ELEM | HASH_CONTEXT | HASH_BLOBS); - (void) MemoryContextSwitchTo(old_ctxt); Assert(RelationSyncCache != NULL); diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c index ad0d1a9abc..c5e8707151 100644 --- a/src/backend/storage/buffer/bufmgr.c +++ b/src/backend/storage/buffer/bufmgr.c @@ -2505,7 +2505,6 @@ InitBufferPoolAccess(void) memset(&PrivateRefCountArray, 0, sizeof(PrivateRefCountArray)); - MemSet(&hash_ctl, 0, sizeof(hash_ctl)); hash_ctl.keysize = sizeof(int32); hash_ctl.entrysize = sizeof(PrivateRefCountEntry); diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c index 6ffd7b3306..cd3475e9e1 100644 --- a/src/backend/storage/buffer/localbuf.c +++ b/src/backend/storage/buffer/localbuf.c @@ -465,7 +465,6 @@ InitLocalBuffers(void) } /* Create the lookup hash table */ - MemSet(&info, 0, sizeof(info)); info.keysize = sizeof(BufferTag); info.entrysize = sizeof(LocalBufferLookupEnt); diff --git a/src/backend/storage/file/reinit.c b/src/backend/storage/file/reinit.c index 0c2094f766..8700f7f19a 100644 --- a/src/backend/storage/file/reinit.c +++ b/src/backend/storage/file/reinit.c @@ -30,7 +30,7 @@ static void ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, typedef struct { - char oid[OIDCHARS + 1]; + Oid reloid; /* hash key */ } unlogged_relation_entry; /* @@ -172,10 +172,11 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op) * need to be reset. Otherwise, this cleanup operation would be * O(n^2). */ - memset(&ctl, 0, sizeof(ctl)); - ctl.keysize = sizeof(unlogged_relation_entry); + ctl.keysize = sizeof(Oid); ctl.entrysize = sizeof(unlogged_relation_entry); - hash = hash_create("unlogged hash", 32, &ctl, HASH_ELEM); + ctl.hcxt = CurrentMemoryContext; + hash = hash_create("unlogged relation OIDs", 32, &ctl, + HASH_ELEM | HASH_BLOBS | HASH_CONTEXT); /* Scan the directory. */ dbspace_dir = AllocateDir(dbspacedirname); @@ -198,9 +199,8 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op) * Put the OID portion of the name into the hash table, if it * isn't already. */ - memset(ent.oid, 0, sizeof(ent.oid)); - memcpy(ent.oid, de->d_name, oidchars); - hash_search(hash, &ent, HASH_ENTER, NULL); + ent.reloid = atooid(de->d_name); + (void) hash_search(hash, &ent, HASH_ENTER, NULL); } /* Done with the first pass. */ @@ -224,7 +224,6 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op) { ForkNumber forkNum; int oidchars; - bool found; unlogged_relation_entry ent; /* Skip anything that doesn't look like a relation data file. */ @@ -238,14 +237,10 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op) /* * See whether the OID portion of the name shows up in the hash - * table. + * table. If so, nuke it! */ - memset(ent.oid, 0, sizeof(ent.oid)); - memcpy(ent.oid, de->d_name, oidchars); - hash_search(hash, &ent, HASH_FIND, &found); - - /* If so, nuke it! */ - if (found) + ent.reloid = atooid(de->d_name); + if (hash_search(hash, &ent, HASH_FIND, NULL)) { snprintf(rm_path, sizeof(rm_path), "%s/%s", dbspacedirname, de->d_name); diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c index 97716f6aef..b0fc9f160d 100644 --- a/src/backend/storage/ipc/shmem.c +++ b/src/backend/storage/ipc/shmem.c @@ -292,7 +292,6 @@ void InitShmemIndex(void) { HASHCTL info; - int hash_flags; /* * Create the shared memory shmem index. @@ -304,11 +303,11 @@ InitShmemIndex(void) */ info.keysize = SHMEM_INDEX_KEYSIZE; info.entrysize = sizeof(ShmemIndexEnt); - hash_flags = HASH_ELEM; ShmemIndex = ShmemInitHash("ShmemIndex", SHMEM_INDEX_SIZE, SHMEM_INDEX_SIZE, - &info, hash_flags); + &info, + HASH_ELEM | HASH_STRINGS); } /* @@ -329,6 +328,11 @@ InitShmemIndex(void) * whose maximum size is certain, this should be equal to max_size; that * ensures that no run-time out-of-shared-memory failures can occur. * + * *infoP and hash_flags should specify at least the entry sizes and key + * comparison semantics (see hash_create()). Flag bits and values specific + * to shared-memory hash tables are added here, except that callers may + * choose to specify HASH_PARTITION and/or HASH_FIXED_SIZE. + * * Note: before Postgres 9.0, this function returned NULL for some failure * cases. Now, it always throws error instead, so callers need not check * for NULL. diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c index 52b2809dac..4ea3cf1f5c 100644 --- a/src/backend/storage/ipc/standby.c +++ b/src/backend/storage/ipc/standby.c @@ -81,7 +81,6 @@ InitRecoveryTransactionEnvironment(void) * Initialize the hash table for tracking the list of locks held by each * transaction. */ - memset(&hash_ctl, 0, sizeof(hash_ctl)); hash_ctl.keysize = sizeof(TransactionId); hash_ctl.entrysize = sizeof(RecoveryLockListsEntry); RecoveryLockLists = hash_create("RecoveryLockLists", diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c index d86566f455..53472dd21e 100644 --- a/src/backend/storage/lmgr/lock.c +++ b/src/backend/storage/lmgr/lock.c @@ -419,7 +419,6 @@ InitLocks(void) * Allocate hash table for LOCK structs. This stores per-locked-object * information. */ - MemSet(&info, 0, sizeof(info)); info.keysize = sizeof(LOCKTAG); info.entrysize = sizeof(LOCK); info.num_partitions = NUM_LOCK_PARTITIONS; diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c index 108e652179..26bcce9735 100644 --- a/src/backend/storage/lmgr/lwlock.c +++ b/src/backend/storage/lmgr/lwlock.c @@ -342,7 +342,6 @@ init_lwlock_stats(void) ALLOCSET_DEFAULT_SIZES); MemoryContextAllowInCriticalSection(lwlock_stats_cxt, true); - MemSet(&ctl, 0, sizeof(ctl)); ctl.keysize = sizeof(lwlock_stats_key); ctl.entrysize = sizeof(lwlock_stats); ctl.hcxt = lwlock_stats_cxt; diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c index 8a365b400c..e42e131543 100644 --- a/src/backend/storage/lmgr/predicate.c +++ b/src/backend/storage/lmgr/predicate.c @@ -1096,7 +1096,6 @@ InitPredicateLocks(void) * Allocate hash table for PREDICATELOCKTARGET structs. This stores * per-predicate-lock-target information. */ - MemSet(&info, 0, sizeof(info)); info.keysize = sizeof(PREDICATELOCKTARGETTAG); info.entrysize = sizeof(PREDICATELOCKTARGET); info.num_partitions = NUM_PREDICATELOCK_PARTITIONS; @@ -1129,7 +1128,6 @@ InitPredicateLocks(void) * Allocate hash table for PREDICATELOCK structs. This stores per * xact-lock-of-a-target information. */ - MemSet(&info, 0, sizeof(info)); info.keysize = sizeof(PREDICATELOCKTAG); info.entrysize = sizeof(PREDICATELOCK); info.hash = predicatelock_hash; @@ -1212,7 +1210,6 @@ InitPredicateLocks(void) * Allocate hash table for SERIALIZABLEXID structs. This stores per-xid * information for serializable transactions which have accessed data. */ - MemSet(&info, 0, sizeof(info)); info.keysize = sizeof(SERIALIZABLEXIDTAG); info.entrysize = sizeof(SERIALIZABLEXID); @@ -1853,7 +1850,6 @@ CreateLocalPredicateLockHash(void) /* Initialize the backend-local hash table of parent locks */ Assert(LocalPredicateLockHash == NULL); - MemSet(&hash_ctl, 0, sizeof(hash_ctl)); hash_ctl.keysize = sizeof(PREDICATELOCKTARGETTAG); hash_ctl.entrysize = sizeof(LOCALPREDICATELOCK); LocalPredicateLockHash = hash_create("Local predicate lock", diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c index dcc09df0c7..072bdd118f 100644 --- a/src/backend/storage/smgr/smgr.c +++ b/src/backend/storage/smgr/smgr.c @@ -154,7 +154,6 @@ smgropen(RelFileNode rnode, BackendId backend) /* First time through: initialize the hash table */ HASHCTL ctl; - MemSet(&ctl, 0, sizeof(ctl)); ctl.keysize = sizeof(RelFileNodeBackend); ctl.entrysize = sizeof(SMgrRelationData); SMgrRelationHash = hash_create("smgr relation table", 400, diff --git a/src/backend/storage/sync/sync.c b/src/backend/storage/sync/sync.c index 1d635d596c..a49588f6b9 100644 --- a/src/backend/storage/sync/sync.c +++ b/src/backend/storage/sync/sync.c @@ -150,7 +150,6 @@ InitSync(void) ALLOCSET_DEFAULT_SIZES); MemoryContextAllowInCriticalSection(pendingOpsCxt, true); - MemSet(&hash_ctl, 0, sizeof(hash_ctl)); hash_ctl.keysize = sizeof(FileTag); hash_ctl.entrysize = sizeof(PendingFsyncEntry); hash_ctl.hcxt = pendingOpsCxt; diff --git a/src/backend/tsearch/ts_typanalyze.c b/src/backend/tsearch/ts_typanalyze.c index 2eed0cd137..19e9611a3a 100644 --- a/src/backend/tsearch/ts_typanalyze.c +++ b/src/backend/tsearch/ts_typanalyze.c @@ -180,7 +180,6 @@ compute_tsvector_stats(VacAttrStats *stats, * worry about overflowing the initial size. Also we don't need to pay any * attention to locking and memory management. */ - MemSet(&hash_ctl, 0, sizeof(hash_ctl)); hash_ctl.keysize = sizeof(LexemeHashKey); hash_ctl.entrysize = sizeof(TrackItem); hash_ctl.hash = lexeme_hash; diff --git a/src/backend/utils/adt/array_typanalyze.c b/src/backend/utils/adt/array_typanalyze.c index 4912cabc61..cb2a834193 100644 --- a/src/backend/utils/adt/array_typanalyze.c +++ b/src/backend/utils/adt/array_typanalyze.c @@ -277,7 +277,6 @@ compute_array_stats(VacAttrStats *stats, AnalyzeAttrFetchFunc fetchfunc, * worry about overflowing the initial size. Also we don't need to pay any * attention to locking and memory management. */ - MemSet(&elem_hash_ctl, 0, sizeof(elem_hash_ctl)); elem_hash_ctl.keysize = sizeof(Datum); elem_hash_ctl.entrysize = sizeof(TrackItem); elem_hash_ctl.hash = element_hash; @@ -289,7 +288,6 @@ compute_array_stats(VacAttrStats *stats, AnalyzeAttrFetchFunc fetchfunc, HASH_ELEM | HASH_FUNCTION | HASH_COMPARE | HASH_CONTEXT); /* hashtable for array distinct elements counts */ - MemSet(&count_hash_ctl, 0, sizeof(count_hash_ctl)); count_hash_ctl.keysize = sizeof(int); count_hash_ctl.entrysize = sizeof(DECountItem); count_hash_ctl.hcxt = CurrentMemoryContext; diff --git a/src/backend/utils/adt/jsonfuncs.c b/src/backend/utils/adt/jsonfuncs.c index 12557ce3af..7a25415078 100644 --- a/src/backend/utils/adt/jsonfuncs.c +++ b/src/backend/utils/adt/jsonfuncs.c @@ -3439,14 +3439,13 @@ get_json_object_as_hash(char *json, int len, const char *funcname) JsonLexContext *lex = makeJsonLexContextCstringLen(json, len, GetDatabaseEncoding(), true); JsonSemAction *sem; - memset(&ctl, 0, sizeof(ctl)); ctl.keysize = NAMEDATALEN; ctl.entrysize = sizeof(JsonHashEntry); ctl.hcxt = CurrentMemoryContext; tab = hash_create("json object hashtable", 100, &ctl, - HASH_ELEM | HASH_CONTEXT); + HASH_ELEM | HASH_STRINGS | HASH_CONTEXT); state = palloc0(sizeof(JHashState)); sem = palloc0(sizeof(JsonSemAction)); @@ -3831,14 +3830,13 @@ populate_recordset_object_start(void *state) return; /* Object at level 1: set up a new hash table for this object */ - memset(&ctl, 0, sizeof(ctl)); ctl.keysize = NAMEDATALEN; ctl.entrysize = sizeof(JsonHashEntry); ctl.hcxt = CurrentMemoryContext; _state->json_hash = hash_create("json object hashtable", 100, &ctl, - HASH_ELEM | HASH_CONTEXT); + HASH_ELEM | HASH_STRINGS | HASH_CONTEXT); } static void diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c index b6d05ac98d..c39d67645c 100644 --- a/src/backend/utils/adt/pg_locale.c +++ b/src/backend/utils/adt/pg_locale.c @@ -1297,7 +1297,6 @@ lookup_collation_cache(Oid collation, bool set_flags) /* First time through, initialize the hash table */ HASHCTL ctl; - memset(&ctl, 0, sizeof(ctl)); ctl.keysize = sizeof(Oid); ctl.entrysize = sizeof(collation_cache_entry); collation_cache = hash_create("Collation cache", 100, &ctl, diff --git a/src/backend/utils/adt/ri_triggers.c b/src/backend/utils/adt/ri_triggers.c index 02b1a3868f..5ab134a853 100644 --- a/src/backend/utils/adt/ri_triggers.c +++ b/src/backend/utils/adt/ri_triggers.c @@ -2540,7 +2540,6 @@ ri_InitHashTables(void) { HASHCTL ctl; - memset(&ctl, 0, sizeof(ctl)); ctl.keysize = sizeof(Oid); ctl.entrysize = sizeof(RI_ConstraintInfo); ri_constraint_cache = hash_create("RI constraint cache", @@ -2552,14 +2551,12 @@ ri_InitHashTables(void) InvalidateConstraintCacheCallBack, (Datum) 0); - memset(&ctl, 0, sizeof(ctl)); ctl.keysize = sizeof(RI_QueryKey); ctl.entrysize = sizeof(RI_QueryHashEntry); ri_query_cache = hash_create("RI query cache", RI_INIT_QUERYHASHSIZE, &ctl, HASH_ELEM | HASH_BLOBS); - memset(&ctl, 0, sizeof(ctl)); ctl.keysize = sizeof(RI_CompareKey); ctl.entrysize = sizeof(RI_CompareHashEntry); ri_compare_cache = hash_create("RI compare cache", diff --git a/src/backend/utils/adt/ruleutils.c b/src/backend/utils/adt/ruleutils.c index ad582f99a5..7d4443e807 100644 --- a/src/backend/utils/adt/ruleutils.c +++ b/src/backend/utils/adt/ruleutils.c @@ -3464,14 +3464,14 @@ set_rtable_names(deparse_namespace *dpns, List *parent_namespaces, * We use a hash table to hold known names, so that this process is O(N) * not O(N^2) for N names. */ - MemSet(&hash_ctl, 0, sizeof(hash_ctl)); hash_ctl.keysize = NAMEDATALEN; hash_ctl.entrysize = sizeof(NameHashEntry); hash_ctl.hcxt = CurrentMemoryContext; names_hash = hash_create("set_rtable_names names", list_length(dpns->rtable), &hash_ctl, - HASH_ELEM | HASH_CONTEXT); + HASH_ELEM | HASH_STRINGS | HASH_CONTEXT); + /* Preload the hash table with names appearing in parent_namespaces */ foreach(lc, parent_namespaces) { diff --git a/src/backend/utils/cache/attoptcache.c b/src/backend/utils/cache/attoptcache.c index 05ac366b40..934a84e03f 100644 --- a/src/backend/utils/cache/attoptcache.c +++ b/src/backend/utils/cache/attoptcache.c @@ -79,7 +79,6 @@ InitializeAttoptCache(void) HASHCTL ctl; /* Initialize the hash table. */ - MemSet(&ctl, 0, sizeof(ctl)); ctl.keysize = sizeof(AttoptCacheKey); ctl.entrysize = sizeof(AttoptCacheEntry); AttoptCacheHash = diff --git a/src/backend/utils/cache/evtcache.c b/src/backend/utils/cache/evtcache.c index 0427795395..0877bc7e0e 100644 --- a/src/backend/utils/cache/evtcache.c +++ b/src/backend/utils/cache/evtcache.c @@ -118,7 +118,6 @@ BuildEventTriggerCache(void) EventTriggerCacheState = ETCS_REBUILD_STARTED; /* Create new hash table. */ - MemSet(&ctl, 0, sizeof(ctl)); ctl.keysize = sizeof(EventTriggerEvent); ctl.entrysize = sizeof(EventTriggerCacheEntry); ctl.hcxt = EventTriggerCacheContext; diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c index 66393becfb..3bd5e18042 100644 --- a/src/backend/utils/cache/relcache.c +++ b/src/backend/utils/cache/relcache.c @@ -1607,7 +1607,6 @@ LookupOpclassInfo(Oid operatorClassOid, /* First time through: initialize the opclass cache */ HASHCTL ctl; - MemSet(&ctl, 0, sizeof(ctl)); ctl.keysize = sizeof(Oid); ctl.entrysize = sizeof(OpClassCacheEnt); OpClassCache = hash_create("Operator class cache", 64, @@ -3775,7 +3774,6 @@ RelationCacheInitialize(void) /* * create hashtable that indexes the relcache */ - MemSet(&ctl, 0, sizeof(ctl)); ctl.keysize = sizeof(Oid); ctl.entrysize = sizeof(RelIdCacheEnt); RelationIdCache = hash_create("Relcache by OID", INITRELCACHESIZE, diff --git a/src/backend/utils/cache/relfilenodemap.c b/src/backend/utils/cache/relfilenodemap.c index 0dbdbff603..38e6379974 100644 --- a/src/backend/utils/cache/relfilenodemap.c +++ b/src/backend/utils/cache/relfilenodemap.c @@ -110,17 +110,15 @@ InitializeRelfilenodeMap(void) relfilenode_skey[0].sk_attno = Anum_pg_class_reltablespace; relfilenode_skey[1].sk_attno = Anum_pg_class_relfilenode; - /* Initialize the hash table. */ - MemSet(&ctl, 0, sizeof(ctl)); - ctl.keysize = sizeof(RelfilenodeMapKey); - ctl.entrysize = sizeof(RelfilenodeMapEntry); - ctl.hcxt = CacheMemoryContext; - /* * Only create the RelfilenodeMapHash now, so we don't end up partially * initialized when fmgr_info_cxt() above ERRORs out with an out of memory * error. */ + ctl.keysize = sizeof(RelfilenodeMapKey); + ctl.entrysize = sizeof(RelfilenodeMapEntry); + ctl.hcxt = CacheMemoryContext; + RelfilenodeMapHash = hash_create("RelfilenodeMap cache", 64, &ctl, HASH_ELEM | HASH_BLOBS | HASH_CONTEXT); diff --git a/src/backend/utils/cache/spccache.c b/src/backend/utils/cache/spccache.c index e0c3c1b1c1..c8387e2541 100644 --- a/src/backend/utils/cache/spccache.c +++ b/src/backend/utils/cache/spccache.c @@ -79,7 +79,6 @@ InitializeTableSpaceCache(void) HASHCTL ctl; /* Initialize the hash table. */ - MemSet(&ctl, 0, sizeof(ctl)); ctl.keysize = sizeof(Oid); ctl.entrysize = sizeof(TableSpaceCacheEntry); TableSpaceCacheHash = diff --git a/src/backend/utils/cache/ts_cache.c b/src/backend/utils/cache/ts_cache.c index f9f7912cb8..a2867fac7d 100644 --- a/src/backend/utils/cache/ts_cache.c +++ b/src/backend/utils/cache/ts_cache.c @@ -117,7 +117,6 @@ lookup_ts_parser_cache(Oid prsId) /* First time through: initialize the hash table */ HASHCTL ctl; - MemSet(&ctl, 0, sizeof(ctl)); ctl.keysize = sizeof(Oid); ctl.entrysize = sizeof(TSParserCacheEntry); TSParserCacheHash = hash_create("Tsearch parser cache", 4, @@ -215,7 +214,6 @@ lookup_ts_dictionary_cache(Oid dictId) /* First time through: initialize the hash table */ HASHCTL ctl; - MemSet(&ctl, 0, sizeof(ctl)); ctl.keysize = sizeof(Oid); ctl.entrysize = sizeof(TSDictionaryCacheEntry); TSDictionaryCacheHash = hash_create("Tsearch dictionary cache", 8, @@ -365,7 +363,6 @@ init_ts_config_cache(void) { HASHCTL ctl; - MemSet(&ctl, 0, sizeof(ctl)); ctl.keysize = sizeof(Oid); ctl.entrysize = sizeof(TSConfigCacheEntry); TSConfigCacheHash = hash_create("Tsearch configuration cache", 16, diff --git a/src/backend/utils/cache/typcache.c b/src/backend/utils/cache/typcache.c index 5883fde367..1e331098c0 100644 --- a/src/backend/utils/cache/typcache.c +++ b/src/backend/utils/cache/typcache.c @@ -341,7 +341,6 @@ lookup_type_cache(Oid type_id, int flags) /* First time through: initialize the hash table */ HASHCTL ctl; - MemSet(&ctl, 0, sizeof(ctl)); ctl.keysize = sizeof(Oid); ctl.entrysize = sizeof(TypeCacheEntry); TypeCacheHash = hash_create("Type information cache", 64, @@ -1874,7 +1873,6 @@ assign_record_type_typmod(TupleDesc tupDesc) /* First time through: initialize the hash table */ HASHCTL ctl; - MemSet(&ctl, 0, sizeof(ctl)); ctl.keysize = sizeof(TupleDesc); /* just the pointer */ ctl.entrysize = sizeof(RecordCacheEntry); ctl.hash = record_type_typmod_hash; diff --git a/src/backend/utils/fmgr/dfmgr.c b/src/backend/utils/fmgr/dfmgr.c index bd779fdaf7..adb31e109f 100644 --- a/src/backend/utils/fmgr/dfmgr.c +++ b/src/backend/utils/fmgr/dfmgr.c @@ -680,13 +680,12 @@ find_rendezvous_variable(const char *varName) { HASHCTL ctl; - MemSet(&ctl, 0, sizeof(ctl)); ctl.keysize = NAMEDATALEN; ctl.entrysize = sizeof(rendezvousHashEntry); rendezvousHash = hash_create("Rendezvous variable hash", 16, &ctl, - HASH_ELEM); + HASH_ELEM | HASH_STRINGS); } /* Find or create the hashtable entry for this varName */ diff --git a/src/backend/utils/fmgr/fmgr.c b/src/backend/utils/fmgr/fmgr.c index 2681b7fbc6..fa5f7ac615 100644 --- a/src/backend/utils/fmgr/fmgr.c +++ b/src/backend/utils/fmgr/fmgr.c @@ -565,7 +565,6 @@ record_C_func(HeapTuple procedureTuple, { HASHCTL hash_ctl; - MemSet(&hash_ctl, 0, sizeof(hash_ctl)); hash_ctl.keysize = sizeof(Oid); hash_ctl.entrysize = sizeof(CFuncHashTabEntry); CFuncHash = hash_create("CFuncHash", diff --git a/src/backend/utils/hash/dynahash.c b/src/backend/utils/hash/dynahash.c index d14d875c93..fbd849b8f7 100644 --- a/src/backend/utils/hash/dynahash.c +++ b/src/backend/utils/hash/dynahash.c @@ -30,11 +30,12 @@ * dynahash.c provides support for these types of lookup keys: * * 1. Null-terminated C strings (truncated if necessary to fit in keysize), - * compared as though by strcmp(). This is the default behavior. + * compared as though by strcmp(). This is selected by specifying the + * HASH_STRINGS flag to hash_create. * * 2. Arbitrary binary data of size keysize, compared as though by memcmp(). * (Caller must ensure there are no undefined padding bits in the keys!) - * This is selected by specifying HASH_BLOBS flag to hash_create. + * This is selected by specifying the HASH_BLOBS flag to hash_create. * * 3. More complex key behavior can be selected by specifying user-supplied * hashing, comparison, and/or key-copying functions. At least a hashing @@ -47,8 +48,8 @@ * locks. * - Shared memory hashes are allocated in a fixed size area at startup and * are discoverable by name from other processes. - * - Because entries don't need to be moved in the case of hash conflicts, has - * better performance for large entries + * - Because entries don't need to be moved in the case of hash conflicts, + * dynahash has better performance for large entries. * - Guarantees stable pointers to entries. * * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group @@ -316,6 +317,28 @@ string_compare(const char *key1, const char *key2, Size keysize) * *info: additional table parameters, as indicated by flags * flags: bitmask indicating which parameters to take from *info * + * The flags value *must* include HASH_ELEM. (Formerly, this was nominally + * optional, but the default keysize and entrysize values were useless.) + * The flags value must also include exactly one of HASH_STRINGS, HASH_BLOBS, + * or HASH_FUNCTION, to define the key hashing semantics (C strings, + * binary blobs, or custom, respectively). Callers specifying a custom + * hash function will likely also want to use HASH_COMPARE, and perhaps + * also HASH_KEYCOPY, to control key comparison and copying. + * Another often-used flag is HASH_CONTEXT, to allocate the hash table + * under info->hcxt rather than under TopMemoryContext; the default + * behavior is only suitable for session-lifespan hash tables. + * Other flags bits are special-purpose and seldom used, except for those + * associated with shared-memory hash tables, for which see ShmemInitHash(). + * + * Fields in *info are read only when the associated flags bit is set. + * It is not necessary to initialize other fields of *info. + * Neither tabname nor *info need persist after the hash_create() call. + * + * Note: It is deprecated for callers of hash_create() to explicitly specify + * string_hash, tag_hash, uint32_hash, or oid_hash. Just set HASH_BLOBS or + * HASH_STRINGS. Use HASH_FUNCTION only when you want something other than + * one of these. + * * Note: for a shared-memory hashtable, nelem needs to be a pretty good * estimate, since we can't expand the table on the fly. But an unshared * hashtable can be expanded on-the-fly, so it's better for nelem to be @@ -323,11 +346,19 @@ string_compare(const char *key1, const char *key2, Size keysize) * large nelem will penalize hash_seq_search speed without buying much. */ HTAB * -hash_create(const char *tabname, long nelem, HASHCTL *info, int flags) +hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags) { HTAB *hashp; HASHHDR *hctl; + /* + * Hash tables now allocate space for key and data, but you have to say + * how much space to allocate. + */ + Assert(flags & HASH_ELEM); + Assert(info->keysize > 0); + Assert(info->entrysize >= info->keysize); + /* * For shared hash tables, we have a local hash header (HTAB struct) that * we allocate in TopMemoryContext; all else is in shared memory. @@ -370,28 +401,43 @@ hash_create(const char *tabname, long nelem, HASHCTL *info, int flags) * Select the appropriate hash function (see comments at head of file). */ if (flags & HASH_FUNCTION) + { + Assert(!(flags & (HASH_BLOBS | HASH_STRINGS))); hashp->hash = info->hash; + } else if (flags & HASH_BLOBS) { + Assert(!(flags & HASH_STRINGS)); /* We can optimize hashing for common key sizes */ - Assert(flags & HASH_ELEM); if (info->keysize == sizeof(uint32)) hashp->hash = uint32_hash; else hashp->hash = tag_hash; } else - hashp->hash = string_hash; /* default hash function */ + { + /* + * string_hash used to be considered the default hash method, and in a + * non-assert build it effectively still is. But we now consider it + * an assertion error to not say HASH_STRINGS explicitly. To help + * catch mistaken usage of HASH_STRINGS, we also insist on a + * reasonably long string length: if the keysize is only 4 or 8 bytes, + * it's almost certainly an integer or pointer not a string. + */ + Assert(flags & HASH_STRINGS); + Assert(info->keysize > 8); + + hashp->hash = string_hash; + } /* * If you don't specify a match function, it defaults to string_compare if - * you used string_hash (either explicitly or by default) and to memcmp - * otherwise. + * you used string_hash, and to memcmp otherwise. * * Note: explicitly specifying string_hash is deprecated, because this * might not work for callers in loadable modules on some platforms due to * referencing a trampoline instead of the string_hash function proper. - * Just let it default, eh? + * Specify HASH_STRINGS instead. */ if (flags & HASH_COMPARE) hashp->match = info->match; @@ -505,16 +551,9 @@ hash_create(const char *tabname, long nelem, HASHCTL *info, int flags) hctl->dsize = info->dsize; } - /* - * hash table now allocates space for key and data but you have to say how - * much space to allocate - */ - if (flags & HASH_ELEM) - { - Assert(info->entrysize >= info->keysize); - hctl->keysize = info->keysize; - hctl->entrysize = info->entrysize; - } + /* remember the entry sizes, too */ + hctl->keysize = info->keysize; + hctl->entrysize = info->entrysize; /* make local copies of heavily-used constant fields */ hashp->keysize = hctl->keysize; @@ -593,10 +632,6 @@ hdefault(HTAB *hashp) hctl->dsize = DEF_DIRSIZE; hctl->nsegs = 0; - /* rather pointless defaults for key & entry size */ - hctl->keysize = sizeof(char *); - hctl->entrysize = 2 * sizeof(char *); - hctl->num_partitions = 0; /* not partitioned */ /* table has no fixed maximum size */ diff --git a/src/backend/utils/mmgr/portalmem.c b/src/backend/utils/mmgr/portalmem.c index ec6f80ee99..283dfe2d9e 100644 --- a/src/backend/utils/mmgr/portalmem.c +++ b/src/backend/utils/mmgr/portalmem.c @@ -119,7 +119,7 @@ EnablePortalManager(void) * create, initially */ PortalHashTable = hash_create("Portal hash", PORTALS_PER_USER, - &ctl, HASH_ELEM); + &ctl, HASH_ELEM | HASH_STRINGS); } /* diff --git a/src/backend/utils/time/combocid.c b/src/backend/utils/time/combocid.c index 4ee9ef0ffe..9626f98100 100644 --- a/src/backend/utils/time/combocid.c +++ b/src/backend/utils/time/combocid.c @@ -223,7 +223,6 @@ GetComboCommandId(CommandId cmin, CommandId cmax) sizeComboCids = CCID_ARRAY_SIZE; usedComboCids = 0; - memset(&hash_ctl, 0, sizeof(hash_ctl)); hash_ctl.keysize = sizeof(ComboCidKeyData); hash_ctl.entrysize = sizeof(ComboCidEntryData); hash_ctl.hcxt = TopTransactionContext; diff --git a/src/include/utils/hsearch.h b/src/include/utils/hsearch.h index bebf89b3c4..13c6602217 100644 --- a/src/include/utils/hsearch.h +++ b/src/include/utils/hsearch.h @@ -64,25 +64,36 @@ typedef struct HTAB HTAB; /* Only those fields indicated by hash_flags need be set */ typedef struct HASHCTL { + /* Used if HASH_PARTITION flag is set: */ long num_partitions; /* # partitions (must be power of 2) */ + /* Used if HASH_SEGMENT flag is set: */ long ssize; /* segment size */ + /* Used if HASH_DIRSIZE flag is set: */ long dsize; /* (initial) directory size */ long max_dsize; /* limit to dsize if dir size is limited */ + /* Used if HASH_ELEM flag is set (which is now required): */ Size keysize; /* hash key length in bytes */ Size entrysize; /* total user element size in bytes */ + /* Used if HASH_FUNCTION flag is set: */ HashValueFunc hash; /* hash function */ + /* Used if HASH_COMPARE flag is set: */ HashCompareFunc match; /* key comparison function */ + /* Used if HASH_KEYCOPY flag is set: */ HashCopyFunc keycopy; /* key copying function */ + /* Used if HASH_ALLOC flag is set: */ HashAllocFunc alloc; /* memory allocator */ + /* Used if HASH_CONTEXT flag is set: */ MemoryContext hcxt; /* memory context to use for allocations */ + /* Used if HASH_SHARED_MEM flag is set: */ HASHHDR *hctl; /* location of header in shared mem */ } HASHCTL; -/* Flags to indicate which parameters are supplied */ +/* Flag bits for hash_create; most indicate which parameters are supplied */ #define HASH_PARTITION 0x0001 /* Hashtable is used w/partitioned locking */ #define HASH_SEGMENT 0x0002 /* Set segment size */ #define HASH_DIRSIZE 0x0004 /* Set directory size (initial and max) */ -#define HASH_ELEM 0x0010 /* Set keysize and entrysize */ +#define HASH_ELEM 0x0008 /* Set keysize and entrysize (now required!) */ +#define HASH_STRINGS 0x0010 /* Select support functions for string keys */ #define HASH_BLOBS 0x0020 /* Select support functions for binary keys */ #define HASH_FUNCTION 0x0040 /* Set user defined hash function */ #define HASH_COMPARE 0x0080 /* Set user defined comparison function */ @@ -93,7 +104,6 @@ typedef struct HASHCTL #define HASH_ATTACH 0x1000 /* Do not initialize hctl */ #define HASH_FIXED_SIZE 0x2000 /* Initial size is a hard limit */ - /* max_dsize value to indicate expansible directory */ #define NO_MAX_DSIZE (-1) @@ -116,13 +126,9 @@ typedef struct /* * prototypes for functions in dynahash.c - * - * Note: It is deprecated for callers of hash_create to explicitly specify - * string_hash, tag_hash, uint32_hash, or oid_hash. Just set HASH_BLOBS or - * not. Use HASH_FUNCTION only when you want something other than those. */ extern HTAB *hash_create(const char *tabname, long nelem, - HASHCTL *info, int flags); + const HASHCTL *info, int flags); extern void hash_destroy(HTAB *hashp); extern void hash_stats(const char *where, HTAB *hashp); extern void *hash_search(HTAB *hashp, const void *keyPtr, HASHACTION action, diff --git a/src/pl/plperl/plperl.c b/src/pl/plperl/plperl.c index 4de756455d..6299adf71a 100644 --- a/src/pl/plperl/plperl.c +++ b/src/pl/plperl/plperl.c @@ -458,7 +458,6 @@ _PG_init(void) /* * Create hash tables. */ - memset(&hash_ctl, 0, sizeof(hash_ctl)); hash_ctl.keysize = sizeof(Oid); hash_ctl.entrysize = sizeof(plperl_interp_desc); plperl_interp_hash = hash_create("PL/Perl interpreters", @@ -466,7 +465,6 @@ _PG_init(void) &hash_ctl, HASH_ELEM | HASH_BLOBS); - memset(&hash_ctl, 0, sizeof(hash_ctl)); hash_ctl.keysize = sizeof(plperl_proc_key); hash_ctl.entrysize = sizeof(plperl_proc_ptr); plperl_proc_hash = hash_create("PL/Perl procedures", @@ -580,13 +578,12 @@ select_perl_context(bool trusted) { HASHCTL hash_ctl; - memset(&hash_ctl, 0, sizeof(hash_ctl)); hash_ctl.keysize = NAMEDATALEN; hash_ctl.entrysize = sizeof(plperl_query_entry); interp_desc->query_hash = hash_create("PL/Perl queries", 32, &hash_ctl, - HASH_ELEM); + HASH_ELEM | HASH_STRINGS); } /* diff --git a/src/pl/plpgsql/src/pl_comp.c b/src/pl/plpgsql/src/pl_comp.c index b610b28d70..555da952e1 100644 --- a/src/pl/plpgsql/src/pl_comp.c +++ b/src/pl/plpgsql/src/pl_comp.c @@ -2567,7 +2567,6 @@ plpgsql_HashTableInit(void) /* don't allow double-initialization */ Assert(plpgsql_HashTable == NULL); - memset(&ctl, 0, sizeof(ctl)); ctl.keysize = sizeof(PLpgSQL_func_hashkey); ctl.entrysize = sizeof(plpgsql_HashEnt); plpgsql_HashTable = hash_create("PLpgSQL function hash", diff --git a/src/pl/plpgsql/src/pl_exec.c b/src/pl/plpgsql/src/pl_exec.c index ccbc50fc45..112f6ab0ae 100644 --- a/src/pl/plpgsql/src/pl_exec.c +++ b/src/pl/plpgsql/src/pl_exec.c @@ -4058,7 +4058,6 @@ plpgsql_estate_setup(PLpgSQL_execstate *estate, { estate->simple_eval_estate = simple_eval_estate; /* Private cast hash just lives in function's main context */ - memset(&ctl, 0, sizeof(ctl)); ctl.keysize = sizeof(plpgsql_CastHashKey); ctl.entrysize = sizeof(plpgsql_CastHashEntry); ctl.hcxt = CurrentMemoryContext; @@ -4077,7 +4076,6 @@ plpgsql_estate_setup(PLpgSQL_execstate *estate, shared_cast_context = AllocSetContextCreate(TopMemoryContext, "PLpgSQL cast info", ALLOCSET_DEFAULT_SIZES); - memset(&ctl, 0, sizeof(ctl)); ctl.keysize = sizeof(plpgsql_CastHashKey); ctl.entrysize = sizeof(plpgsql_CastHashEntry); ctl.hcxt = shared_cast_context; diff --git a/src/pl/plpython/plpy_plpymodule.c b/src/pl/plpython/plpy_plpymodule.c index 7f54d093ac..0365acc95b 100644 --- a/src/pl/plpython/plpy_plpymodule.c +++ b/src/pl/plpython/plpy_plpymodule.c @@ -214,7 +214,6 @@ PLy_add_exceptions(PyObject *plpy) PLy_exc_spi_error = PLy_create_exception("plpy.SPIError", NULL, NULL, "SPIError", plpy); - memset(&hash_ctl, 0, sizeof(hash_ctl)); hash_ctl.keysize = sizeof(int); hash_ctl.entrysize = sizeof(PLyExceptionEntry); PLy_spi_exceptions = hash_create("PL/Python SPI exceptions", 256, diff --git a/src/pl/plpython/plpy_procedure.c b/src/pl/plpython/plpy_procedure.c index 1f05c633ef..b7c0b5cebe 100644 --- a/src/pl/plpython/plpy_procedure.c +++ b/src/pl/plpython/plpy_procedure.c @@ -34,7 +34,6 @@ init_procedure_caches(void) { HASHCTL hash_ctl; - memset(&hash_ctl, 0, sizeof(hash_ctl)); hash_ctl.keysize = sizeof(PLyProcedureKey); hash_ctl.entrysize = sizeof(PLyProcedureEntry); PLy_procedure_cache = hash_create("PL/Python procedures", 32, &hash_ctl, diff --git a/src/pl/tcl/pltcl.c b/src/pl/tcl/pltcl.c index a3a2dc8e89..e11837559d 100644 --- a/src/pl/tcl/pltcl.c +++ b/src/pl/tcl/pltcl.c @@ -439,7 +439,6 @@ _PG_init(void) /************************************************************ * Create the hash table for working interpreters ************************************************************/ - memset(&hash_ctl, 0, sizeof(hash_ctl)); hash_ctl.keysize = sizeof(Oid); hash_ctl.entrysize = sizeof(pltcl_interp_desc); pltcl_interp_htab = hash_create("PL/Tcl interpreters", @@ -450,7 +449,6 @@ _PG_init(void) /************************************************************ * Create the hash table for function lookup ************************************************************/ - memset(&hash_ctl, 0, sizeof(hash_ctl)); hash_ctl.keysize = sizeof(pltcl_proc_key); hash_ctl.entrysize = sizeof(pltcl_proc_ptr); pltcl_proc_htab = hash_create("PL/Tcl functions", diff --git a/src/timezone/pgtz.c b/src/timezone/pgtz.c index 3f0fb51e91..4a360f5077 100644 --- a/src/timezone/pgtz.c +++ b/src/timezone/pgtz.c @@ -203,15 +203,13 @@ init_timezone_hashtable(void) { HASHCTL hash_ctl; - MemSet(&hash_ctl, 0, sizeof(hash_ctl)); - hash_ctl.keysize = TZ_STRLEN_MAX + 1; hash_ctl.entrysize = sizeof(pg_tz_cache); timezone_cache = hash_create("Timezones", 4, &hash_ctl, - HASH_ELEM); + HASH_ELEM | HASH_STRINGS); if (!timezone_cache) return false;
Re: HASH_BLOBS hazards (was Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions)
From
Noah Misch
Date:
On Mon, Dec 14, 2020 at 01:59:03PM -0500, Tom Lane wrote: > * Should we just have a blanket insistence that all callers supply > HASH_ELEM? The default sizes that dynahash.c uses without that are > undocumented and basically useless. +1 > we should just rip out all those memsets as pointless, since there's > basically no case where you'd use the memset to fill a field that > you meant to pass as zero. The fact that hash_create() doesn't > read fields it's not told to by a flag means we should not need > the memsets to avoid uninitialized-memory reads. On Mon, Dec 14, 2020 at 06:55:20PM -0500, Tom Lane wrote: > Here's a rolled-up patch that does some further documentation work > and gets rid of the unnecessary memset's as well. +1 on removing the memset() calls. That said, it's not a big deal if more creep in over time; it doesn't qualify as a project policy violation. > @@ -329,6 +328,11 @@ InitShmemIndex(void) > * whose maximum size is certain, this should be equal to max_size; that > * ensures that no run-time out-of-shared-memory failures can occur. > * > + * *infoP and hash_flags should specify at least the entry sizes and key s/should/must/
Re: HASH_BLOBS hazards (was Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions)
From
Tom Lane
Date:
Noah Misch <noah@leadboat.com> writes: > On Mon, Dec 14, 2020 at 01:59:03PM -0500, Tom Lane wrote: >> Here's a rolled-up patch that does some further documentation work >> and gets rid of the unnecessary memset's as well. > +1 on removing the memset() calls. That said, it's not a big deal if more > creep in over time; it doesn't qualify as a project policy violation. Right, that part is just neatnik-ism. Neither the calls with memset nor the ones without are buggy. >> + * *infoP and hash_flags should specify at least the entry sizes and key > s/should/must/ OK; thanks for reviewing! regards, tom lane
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Amit Kapila
Date:
Tom Lane has raised a complaint on pgsql-commiters [1] about one of the commits related to this work [2]. The new member wrasse is showing Warning: "/export/home/nm/farm/studio64v12_6/HEAD/pgsql.build/../pgsql/src/backend/replication/logical/reorderbuffer.c", line 2510: Warning: Likely null pointer dereference (*(curtxn+272)): ReorderBufferProcessTXN The Warning is for line: curtxn->concurrent_abort = true; Now, we can simply fix this warning by adding an if check like: if (curtxn) curtxn->concurrent_abort = true; However, on further discussion, it seems that is not sufficient here because the callbacks can throw the surrounding error code (ERRCODE_TRANSACTION_ROLLBACK) where we set concurrent_abort flag for a completely different scenario. I think here we need a stronger check to ensure that we set concurrent abort flag and do other things in that check only when we are decoding non-committed xacts. The idea I have is to additionally check that we are decoding streaming or prepared transaction (the same check as we have for setting curtxn) or we can check if CheckXidAlive is a valid transaction id. What do you think? [1] - https://www.postgresql.org/message-id/2752962.1619568098%40sss.pgh.pa.us [2] - https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=7259736a6e5b7c7588fff9578370736a6648acbb -- With Regards, Amit Kapila.
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Dilip Kumar
Date:
On Wed, Apr 28, 2021 at 11:00 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > Tom Lane has raised a complaint on pgsql-commiters [1] about one of > the commits related to this work [2]. The new member wrasse is showing > Warning: > > "/export/home/nm/farm/studio64v12_6/HEAD/pgsql.build/../pgsql/src/backend/replication/logical/reorderbuffer.c", > line 2510: Warning: Likely null pointer dereference (*(curtxn+272)): > ReorderBufferProcessTXN > > The Warning is for line: > curtxn->concurrent_abort = true; > > Now, we can simply fix this warning by adding an if check like: > if (curtxn) > curtxn->concurrent_abort = true; > > However, on further discussion, it seems that is not sufficient here > because the callbacks can throw the surrounding error code > (ERRCODE_TRANSACTION_ROLLBACK) where we set concurrent_abort flag for > a completely different scenario. I think here we need a > stronger check to ensure that we set concurrent abort flag and do > other things in that check only when we are decoding non-committed > xacts. That makes sense. The idea I have is to additionally check that we are decoding > streaming or prepared transaction (the same check as we have for > setting curtxn) or we can check if CheckXidAlive is a valid > transaction id. What do you think? I think a check based on CheckXidAlive looks good to me. This will protect against if a similar error is raised from any other path as you mentioned above. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Amit Kapila
Date:
On Wed, Apr 28, 2021 at 11:03 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Wed, Apr 28, 2021 at 11:00 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > The idea I have is to additionally check that we are decoding > > streaming or prepared transaction (the same check as we have for > > setting curtxn) or we can check if CheckXidAlive is a valid > > transaction id. What do you think? > > I think a check based on CheckXidAlive looks good to me. This will > protect against if a similar error is raised from any other path as > you mentioned above. > We can't use CheckXidAlive because it is reset by that time. So, I used the other approach which led to the attached. -- With Regards, Amit Kapila.
Attachment
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Dilip Kumar
Date:
On Fri, Apr 30, 2021 at 3:01 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Wed, Apr 28, 2021 at 11:03 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Wed, Apr 28, 2021 at 11:00 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > The idea I have is to additionally check that we are decoding > > > streaming or prepared transaction (the same check as we have for > > > setting curtxn) or we can check if CheckXidAlive is a valid > > > transaction id. What do you think? > > > > I think a check based on CheckXidAlive looks good to me. This will > > protect against if a similar error is raised from any other path as > > you mentioned above. > > > > We can't use CheckXidAlive because it is reset by that time. Right. So, I > used the other approach which led to the attached. The patch looks fine to me. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Amit Kapila
Date:
On Fri, Apr 30, 2021 at 7:43 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > So, I > > used the other approach which led to the attached. > > The patch looks fine to me. > Thanks, pushed! -- With Regards, Amit Kapila.
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
From
Dilip Kumar
Date:
On Thu, May 6, 2021 at 9:01 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Fri, Apr 30, 2021 at 7:43 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > So, I > > > used the other approach which led to the attached. > > > > The patch looks fine to me. > > > > Thanks, pushed! Thanks! -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com