Thread: logical decoding : exceeded maxAllocatedDescs for .spill files
Hi, I reproduced the error "exceeded maxAllocatedDescs (492) while trying to open file ...", which was also discussed about in the thread [1]. This issue is similar but not exactly the same as [1]. In [1], the file for which this error used to show up was "pg_logical/mappings/map...." , while here it's the .spill file. And here the issue , in short, seems to be : The .spill file does not get closed there and then, unlike in [1] where there was a file descriptor leak. I could reproduce it using a transaction containing a long series of sub-transactions (possibly could be reproduced without sub-transactions, but looking at the code I could come up with this script using sub-transactions easily) : create table tab(id int); -- Function that does huge changes in a single transaction create or replace function f(id int) returns int as $$ begin -- Iterate this more than 492 times (max transient file descriptors PG would typically allow) -- This will create that many sub-transactions due to presence of exception block. FOR i IN 1..600 LOOP BEGIN -- Iterate more than 4096 times (so that changes spill to disk: max_changes_in_memory) FOR j IN 1..5000 LOOP insert into tab values (1); END LOOP; EXCEPTION when division_by_zero then perform 'dummy'; END; END LOOP; return id; end $$ language plpgsql; SELECT * FROM pg_create_logical_replication_slot('logical', 'test_decoding'); begin; select f(1); -- Do huge changes in a single transaction commit; \! pg_recvlogical -d postgres --slot=logical --verbose --start -f - pg_recvlogical: starting log streaming at 0/0 (slot logical) pg_recvlogical: streaming initiated pg_recvlogical: confirming write up to 0/0, flush to 0/0 (slot logical) BEGIN 1869 pg_recvlogical: confirming write up to 0/1B6D6E38, flush to 0/1B6D6E38 (slot logical) pg_recvlogical: error: unexpected termination of replication stream: ERROR: exceeded maxAllocatedDescs (492) while trying to open file "pg_replslot/logical/xid-2362-lsn-0-24000000.spill" pg_recvlogical: disconnected; waiting 5 seconds to try again Looking at the code, what might be happening is, ReorderBufferIterTXNInit()=>ReorderBufferRestoreChanges() opens the files, but leaves them open if end of file is not reached. Eventually if end of file is reached, it gets closed. The function returns back without closing the file descriptor if reorder buffer changes being restored are more than max_changes_in_memory. Probably later on, the rest of the changes get restored in another ReorderBufferRestoreChanges() call. But meanwhile, if there are a lot of such files opened, we can run out of the max files that PG decides to keep open (it has some logic that takes into account system files allowed to be open, and already opened). Offhand, what I am thinking is, we need to close the file descriptor before returning from ReorderBufferRestoreChanges(), and keep track of the file offset and file path, so that next time we can resume reading from there. Comments ? [1] https://www.postgresql.org/message-id/flat/738a590a-2ce5-9394-2bef-7b1caad89b37%402ndquadrant.com -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
Re: logical decoding : exceeded maxAllocatedDescs for .spill files
From
Alvaro Herrera from 2ndQuadrant
Date:
On 2019-Sep-11, Amit Khandekar wrote: > I reproduced the error "exceeded maxAllocatedDescs (492) while trying > to open file ...", which was also discussed about in the thread [1]. > This issue is similar but not exactly the same as [1]. In [1], the > file for which this error used to show up was > "pg_logical/mappings/map...." , while here it's the .spill file. And > here the issue , in short, seems to be : The .spill file does not get > closed there and then, unlike in [1] where there was a file descriptor > leak. Uh-oh :-( Thanks for the reproducer -- I confirm it breaks things. > Looking at the code, what might be happening is, > ReorderBufferIterTXNInit()=>ReorderBufferRestoreChanges() opens the > files, but leaves them open if end of file is not reached. Eventually > if end of file is reached, it gets closed. The function returns back > without closing the file descriptor if reorder buffer changes being > restored are more than max_changes_in_memory. Probably later on, the > rest of the changes get restored in another > ReorderBufferRestoreChanges() call. But meanwhile, if there are a lot > of such files opened, we can run out of the max files that PG decides > to keep open (it has some logic that takes into account system files > allowed to be open, and already opened). Makes sense. > Offhand, what I am thinking is, we need to close the file descriptor > before returning from ReorderBufferRestoreChanges(), and keep track of > the file offset and file path, so that next time we can resume reading > from there. I think doing this all the time would make restore very slow -- there's a reason we keep the files open, after all. It would be better if we can keep the descriptors open as much as possible, and only close them if there's trouble. I was under the impression that using OpenTransientFile was already taking care of that, but that's evidently not the case. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Wed, Sep 11, 2019 at 09:51:40AM -0300, Alvaro Herrera from 2ndQuadrant wrote: >On 2019-Sep-11, Amit Khandekar wrote: > >> I reproduced the error "exceeded maxAllocatedDescs (492) while trying >> to open file ...", which was also discussed about in the thread [1]. >> This issue is similar but not exactly the same as [1]. In [1], the >> file for which this error used to show up was >> "pg_logical/mappings/map...." , while here it's the .spill file. And >> here the issue , in short, seems to be : The .spill file does not get >> closed there and then, unlike in [1] where there was a file descriptor >> leak. > >Uh-oh :-( Thanks for the reproducer -- I confirm it breaks things. > >> Looking at the code, what might be happening is, >> ReorderBufferIterTXNInit()=>ReorderBufferRestoreChanges() opens the >> files, but leaves them open if end of file is not reached. Eventually >> if end of file is reached, it gets closed. The function returns back >> without closing the file descriptor if reorder buffer changes being >> restored are more than max_changes_in_memory. Probably later on, the >> rest of the changes get restored in another >> ReorderBufferRestoreChanges() call. But meanwhile, if there are a lot >> of such files opened, we can run out of the max files that PG decides >> to keep open (it has some logic that takes into account system files >> allowed to be open, and already opened). > >Makes sense. > >> Offhand, what I am thinking is, we need to close the file descriptor >> before returning from ReorderBufferRestoreChanges(), and keep track of >> the file offset and file path, so that next time we can resume reading >> from there. > >I think doing this all the time would make restore very slow -- there's a >reason we keep the files open, after all. How much slower? It certainly will have a hit, but maybe it's negligible compared to all the other stuff happening in this code? >It would be better if we can keep the descriptors open as much as >possible, and only close them if there's trouble. I was under the >impression that using OpenTransientFile was already taking care of that, >but that's evidently not the case. > I don't see how the current API could do that transparently - it does track the files, but the user only gets a file descriptor. With just a file descriptor, how could the code know to do reopen/seek when it's going just through the regular fopen/fclose? Anyway, I agree we need to do something, to fix this corner case (many serialized in-progress transactions). ISTM we have two options - either do something in the context of reorderbuffer.c, or extend the transient file API somehow. I'd say the second option is the right thing going forward, because it does allow doing it transparently and without leaking details about maxAllocatedDescs etc. There are two issues, though - it does require changes / extensions to the API, and it's not backpatchable. So maybe we should start with the localized fix in reorderbuffer, and I agree tracking offset seems reasonable. As a sidenote - in the other thread about streaming, one of the patches does change how we log subxact assignments. In the end, this allows using just a single file for the top-level transaction, instead of having one file per subxact. That would also solve this. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2019-Sep-12, Tomas Vondra wrote: > On Wed, Sep 11, 2019 at 09:51:40AM -0300, Alvaro Herrera from 2ndQuadrant wrote: > > On 2019-Sep-11, Amit Khandekar wrote: > > I think doing this all the time would make restore very slow -- there's a > > reason we keep the files open, after all. > > How much slower? It certainly will have a hit, but maybe it's negligible > compared to all the other stuff happening in this code? I dunno -- that's a half-assed guess based on there being many more syscalls, and on the fact that the API is how it is in the first place. (Andres did a lot of perf benchmarking and tweaking back when he was writing this ... I just point out that I have a colleague that had to invent *a lot* of new MemCxt infrastructure in order to make some of Andres' perf optimizations cleaner, just as a semi-related data point. Anyway, I digress.) Anyway, such a fix would pessimize all cases, including every single case that works today (which evidently is almost every single usage of this feature, since we hadn't heard of this problem until yesterday), in order to solve a problem that you can only find in very rare ones. Another point of view is that we should make it work first, then make it fast. But the point remains that it works fine and fast for 99.99% of cases. > > It would be better if we can keep the descriptors open as much as > > possible, and only close them if there's trouble. I was under the > > impression that using OpenTransientFile was already taking care of that, > > but that's evidently not the case. > > I don't see how the current API could do that transparently - it does > track the files, but the user only gets a file descriptor. With just a > file descriptor, how could the code know to do reopen/seek when it's going > just through the regular fopen/fclose? Yeah, I don't know what was in Amit's mind, but it seemed obvious to me that such a fix required changing that API so that a seekpos is kept together with the fd. ReorderBufferRestoreChange is static in reorderbuffer.c so it's not a problem to change its ABI. I agree with trying to get a reorderbuffer-localized back-patchable fix first, then we how to improve from that. > As a sidenote - in the other thread about streaming, one of the patches > does change how we log subxact assignments. In the end, this allows using > just a single file for the top-level transaction, instead of having one > file per subxact. That would also solve this. :-( While I would love to get that patch done and get rid of the issue, it's not a backpatchable fix either. ... however, it does mean we maybe can get away with the reorderbuffer.c-local fix, and then just use your streaming stuff to get rid of the perf problem forever? -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu, Sep 12, 2019 at 5:31 AM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > I don't see how the current API could do that transparently - it does > track the files, but the user only gets a file descriptor. With just a > file descriptor, how could the code know to do reopen/seek when it's going > just through the regular fopen/fclose? > > Anyway, I agree we need to do something, to fix this corner case (many > serialized in-progress transactions). ISTM we have two options - either do > something in the context of reorderbuffer.c, or extend the transient file > API somehow. I'd say the second option is the right thing going forward, > because it does allow doing it transparently and without leaking details > about maxAllocatedDescs etc. There are two issues, though - it does > require changes / extensions to the API, and it's not backpatchable. > > So maybe we should start with the localized fix in reorderbuffer, and I > agree tracking offset seems reasonable. We've already got code that knows how to track this sort of thing. You just need to go through the File abstraction (PathNameOpenFile or PathNameOpenFilePerm or OpenTemporaryFile) rather than using the functions that deal directly with fds (OpenTransientFile, BasicOpenFile, etc.). It seems like it would be better to reuse the existing VFD layer than to invent a whole new one specific to logical replication. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, On 2019-09-12 09:41:02 -0400, Robert Haas wrote: > On Thu, Sep 12, 2019 at 5:31 AM Tomas Vondra > <tomas.vondra@2ndquadrant.com> wrote: > > I don't see how the current API could do that transparently - it does > > track the files, but the user only gets a file descriptor. With just a > > file descriptor, how could the code know to do reopen/seek when it's going > > just through the regular fopen/fclose? > > > > Anyway, I agree we need to do something, to fix this corner case (many > > serialized in-progress transactions). ISTM we have two options - either do > > something in the context of reorderbuffer.c, or extend the transient file > > API somehow. I'd say the second option is the right thing going forward, > > because it does allow doing it transparently and without leaking details > > about maxAllocatedDescs etc. There are two issues, though - it does > > require changes / extensions to the API, and it's not backpatchable. > > > > So maybe we should start with the localized fix in reorderbuffer, and I > > agree tracking offset seems reasonable. > > We've already got code that knows how to track this sort of thing. > You just need to go through the File abstraction (PathNameOpenFile or > PathNameOpenFilePerm or OpenTemporaryFile) rather than using the > functions that deal directly with fds (OpenTransientFile, > BasicOpenFile, etc.). It seems like it would be better to reuse the > existing VFD layer than to invent a whole new one specific to logical > replication. Yea, I agree that that is the right fix. Greetings, Andres Freund
Hi, On 2019-09-12 09:57:55 -0300, Alvaro Herrera wrote: > On 2019-Sep-12, Tomas Vondra wrote: > > > On Wed, Sep 11, 2019 at 09:51:40AM -0300, Alvaro Herrera from 2ndQuadrant wrote: > > > On 2019-Sep-11, Amit Khandekar wrote: > > > > I think doing this all the time would make restore very slow -- there's a > > > reason we keep the files open, after all. > > > > How much slower? It certainly will have a hit, but maybe it's negligible > > compared to all the other stuff happening in this code? I'd expect it to be significant. > > As a sidenote - in the other thread about streaming, one of the patches > > does change how we log subxact assignments. In the end, this allows using > > just a single file for the top-level transaction, instead of having one > > file per subxact. That would also solve this. Uhm, how is rollback to savepoint going to be handled in that case? I don't think it's great to just retain space for all rolled back subtransactions. Greetings, Andres Freund
On Thu, Sep 12, 2019 at 11:34:01AM -0700, Andres Freund wrote: >Hi, > >On 2019-09-12 09:57:55 -0300, Alvaro Herrera wrote: >> On 2019-Sep-12, Tomas Vondra wrote: >> >> > On Wed, Sep 11, 2019 at 09:51:40AM -0300, Alvaro Herrera from 2ndQuadrant wrote: >> > > On 2019-Sep-11, Amit Khandekar wrote: >> >> > > I think doing this all the time would make restore very slow -- there's a >> > > reason we keep the files open, after all. >> > >> > How much slower? It certainly will have a hit, but maybe it's negligible >> > compared to all the other stuff happening in this code? > >I'd expect it to be significant. > > >> > As a sidenote - in the other thread about streaming, one of the patches >> > does change how we log subxact assignments. In the end, this allows using >> > just a single file for the top-level transaction, instead of having one >> > file per subxact. That would also solve this. > >Uhm, how is rollback to savepoint going to be handled in that case? I >don't think it's great to just retain space for all rolled back >subtransactions. > The code can just do ftruncate() to the position of the subtransaction. That's what the patch [1] does. [1] https://commitfest.postgresql.org/24/1927/ regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu, 12 Sep 2019 at 19:11, Robert Haas <robertmhaas@gmail.com> wrote: > > On Thu, Sep 12, 2019 at 5:31 AM Tomas Vondra > <tomas.vondra@2ndquadrant.com> wrote: > > I don't see how the current API could do that transparently - it does > > track the files, but the user only gets a file descriptor. With just a > > file descriptor, how could the code know to do reopen/seek when it's going > > just through the regular fopen/fclose? > > > > Anyway, I agree we need to do something, to fix this corner case (many > > serialized in-progress transactions). ISTM we have two options - either do > > something in the context of reorderbuffer.c, or extend the transient file > > API somehow. I'd say the second option is the right thing going forward, > > because it does allow doing it transparently and without leaking details > > about maxAllocatedDescs etc. There are two issues, though - it does > > require changes / extensions to the API, and it's not backpatchable. > > > > So maybe we should start with the localized fix in reorderbuffer, and I > > agree tracking offset seems reasonable. > > We've already got code that knows how to track this sort of thing. You mean tracking excess kernel fds right ? Yeah, we can use VFDs so that excess fds are automatically closed. But Alvaro seems to be talking in context of tracking of file seek position. VFD does not have a mechanism to track file offsets if one of the vfd cached file is closed and reopened. Robert, are you suggesting to add this capability to VFD ? I agree that we could do it, but for back-patching, offhand I couldn't think of a simpler way. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
Amit Khandekar <amitdkhan.pg@gmail.com> writes: > You mean tracking excess kernel fds right ? Yeah, we can use VFDs so > that excess fds are automatically closed. But Alvaro seems to be > talking in context of tracking of file seek position. VFD does not > have a mechanism to track file offsets if one of the vfd cached file > is closed and reopened. Hm. It used to, but somebody got rid of that on the theory that we could use pread/pwrite instead. I'm inclined to think that that was the right tradeoff, but it'd mean that getting logical decoding to adhere to the VFD API requires extra work to track file position on the caller side. Again, though, the advice that's been given here is that we should fix logical decoding to use the VFD API as it stands, not change that API. I concur with that. regards, tom lane
On Fri, Sep 13, 2019 at 12:14 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Amit Khandekar <amitdkhan.pg@gmail.com> writes: > > You mean tracking excess kernel fds right ? Yeah, we can use VFDs so > > that excess fds are automatically closed. But Alvaro seems to be > > talking in context of tracking of file seek position. VFD does not > > have a mechanism to track file offsets if one of the vfd cached file > > is closed and reopened. > > Hm. It used to, but somebody got rid of that on the theory that > we could use pread/pwrite instead. I'm inclined to think that that > was the right tradeoff, but it'd mean that getting logical decoding > to adhere to the VFD API requires extra work to track file position > on the caller side. Oops. I forgot that we'd removed that. > Again, though, the advice that's been given here is that we should > fix logical decoding to use the VFD API as it stands, not change > that API. I concur with that. A reasonable position. So I guess logical decoding has to track the file position itself, but perhaps use the VFD layer for managing FD pooling. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, 13 Sep 2019 at 22:01, Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, Sep 13, 2019 at 12:14 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Again, though, the advice that's been given here is that we should > > fix logical decoding to use the VFD API as it stands, not change > > that API. I concur with that. > > A reasonable position. So I guess logical decoding has to track the > file position itself, but perhaps use the VFD layer for managing FD > pooling. Yeah, something like the attached patch. I think this tracking of offsets would have been cleaner if we add in-built support in VFD. But yeah, for bank branches at least, we need to handle it outside of VFD. Or may be we would add it if we find one more use-case. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
Attachment
Amit Khandekar <amitdkhan.pg@gmail.com> writes: > On Fri, 13 Sep 2019 at 22:01, Robert Haas <robertmhaas@gmail.com> wrote: >> On Fri, Sep 13, 2019 at 12:14 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> Again, though, the advice that's been given here is that we should >>> fix logical decoding to use the VFD API as it stands, not change >>> that API. I concur with that. >> A reasonable position. So I guess logical decoding has to track the >> file position itself, but perhaps use the VFD layer for managing FD >> pooling. > Yeah, something like the attached patch. I think this tracking of > offsets would have been cleaner if we add in-built support in VFD. But > yeah, for bank branches at least, we need to handle it outside of VFD. > Or may be we would add it if we find one more use-case. Again, we had that and removed it, for what seem to me to be solid reasons. It adds cycles when we're forced to close/reopen a file, and it also adds failure modes that we could do without (ie, failure of either the ftell or the lseek, which are particularly nasty because they shouldn't happen according to the VFD abstraction). I do not think there is going to be any argument strong enough to make us put it back, especially not for non-mainstream callers like logical decoding. regards, tom lane
Hi, On 2019-09-14 14:34:21 -0400, Tom Lane wrote: > Amit Khandekar <amitdkhan.pg@gmail.com> writes: > > On Fri, 13 Sep 2019 at 22:01, Robert Haas <robertmhaas@gmail.com> wrote: > >> On Fri, Sep 13, 2019 at 12:14 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > >>> Again, though, the advice that's been given here is that we should > >>> fix logical decoding to use the VFD API as it stands, not change > >>> that API. I concur with that. > > >> A reasonable position. So I guess logical decoding has to track the > >> file position itself, but perhaps use the VFD layer for managing FD > >> pooling. > > > Yeah, something like the attached patch. I think this tracking of > > offsets would have been cleaner if we add in-built support in VFD. But > > yeah, for bank branches at least, we need to handle it outside of VFD. > > Or may be we would add it if we find one more use-case. > > Again, we had that and removed it, for what seem to me to be solid > reasons. It adds cycles when we're forced to close/reopen a file, > and it also adds failure modes that we could do without (ie, failure > of either the ftell or the lseek, which are particularly nasty because > they shouldn't happen according to the VFD abstraction). I do not > think there is going to be any argument strong enough to make us > put it back, especially not for non-mainstream callers like logical > decoding. Yea, I think that's the right call. Avoiding kernel seeks is quite worthwhile, and we shouldn't undo it just because of this usecase. And that'll become more and more important performance-wise (and has already done so, with all the intel fixes making syscalls much slower). I could see an argument for adding a separate generic layer providing position tracking ontop of the VFD abstraction however. Seems quite possible that there's some other parts of the system that could benefit from using VFDs rather than plain fds. And they'd probably also need the positional tracking. Greetings, Andres Freund
On Tue, 17 Sep 2019 at 21:19, Andres Freund <andres@anarazel.de> wrote: > On 2019-09-14 14:34:21 -0400, Tom Lane wrote: > > Amit Khandekar <amitdkhan.pg@gmail.com> writes: > > > Yeah, something like the attached patch. I think this tracking of > > > offsets would have been cleaner if we add in-built support in VFD. But > > > yeah, for bank branches at least, we need to handle it outside of VFD. > > > Or may be we would add it if we find one more use-case. > > > > Again, we had that and removed it, for what seem to me to be solid > > reasons. It adds cycles when we're forced to close/reopen a file, > > and it also adds failure modes that we could do without (ie, failure > > of either the ftell or the lseek, which are particularly nasty because > > they shouldn't happen according to the VFD abstraction). Ok. So you mean, when the caller would call FileRead() for sequential reading, underneath VFD would do a pread(), but if pread() returns error, the errno can belong to read() or it might as well belong to lseek(). If it's due to lseek(), it's not expected from the caller because for the caller it's just a sequential read. Yeah, makes sense. >> I do not > > think there is going to be any argument strong enough to make us > > put it back, especially not for non-mainstream callers like logical > > decoding. Ok. Also, more about putting back is in the below comments ... > > Yea, I think that's the right call. Avoiding kernel seeks is quite > worthwhile, and we shouldn't undo it just because of this usecase. And > that'll become more and more important performance-wise (and has already > done so, with all the intel fixes making syscalls much slower). By the way, I was not thinking about adding back the read() and lseek() calls. I was saying we continue to use the pread() call, so it's just a single system call. FileReadAt(..., offset) would do pread() with user-supplied offset, and FileRead() would do pread() using internally tracked offset. So for the user, FileReadAt() is like pread(), and FileRead() would be like read(). But I agree with Tom's objection about having to unnecessarily handle lseek error codes. > > I could see an argument for adding a separate generic layer providing > position tracking ontop of the VFD abstraction however. Seems quite > possible that there's some other parts of the system that could benefit > from using VFDs rather than plain fds. And they'd probably also need the > positional tracking. Yeah, that also could be done. Probably, for now at least, what everyone seems to agree is to take my earlier attached patch forward. I am going to see if I can add a TAP test for the patch, and will add the patch into the commitfest soon. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
On Wed, 18 Sep 2019 at 12:24, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > Probably, for now at least, what everyone seems to agree is to take my > earlier attached patch forward. > > I am going to see if I can add a TAP test for the patch, and will add > the patch into the commitfest soon. Attached is an updated patch v2. Has a new test scenario added in contrib/test_decoding/sql/spill test, and some minor code cleanup. Going to add this into Nov commitfest. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
Attachment
On Thu, Oct 3, 2019 at 4:48 PM Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > > On Wed, 18 Sep 2019 at 12:24, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > > Probably, for now at least, what everyone seems to agree is to take my > > earlier attached patch forward. > > > > I am going to see if I can add a TAP test for the patch, and will add > > the patch into the commitfest soon. > > Attached is an updated patch v2. > I see that you have made changes in ReorderBufferRestoreChanges to use PathNameOpenFile, but not in ReorderBufferSerializeTXN. Is there a reason for the same? In my test environment, with the test provided by you, I got the error (reported in this thread) via ReorderBufferSerializeTXN. See call stack below: !errfinish(int dummy=0, ...) Line 441 C !OpenTransientFilePerm(const char * fileName=0x012deeac, int fileFlags=33033, unsigned short fileMode=384) Line 2272 + 0x57 bytes C !OpenTransientFile(const char * fileName=0x012deeac, int fileFlags=33033) Line 2256 + 0x15 bytes C !ReorderBufferSerializeTXN(ReorderBuffer * rb=0x01ee4d80, ReorderBufferTXN * txn=0x1f9a6ce8) Line 2302 + 0x11 bytes C !ReorderBufferIterTXNInit(ReorderBuffer * rb=0x01ee4d80, ReorderBufferTXN * txn=0x01f08f80) Line 1044 + 0xd bytes C -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Mon, 18 Nov 2019 at 17:20, Amit Kapila <amit.kapila16@gmail.com> wrote: > I see that you have made changes in ReorderBufferRestoreChanges to use > PathNameOpenFile, but not in ReorderBufferSerializeTXN. Is there a > reason for the same? In my test environment, with the test provided > by you, I got the error (reported in this thread) via > ReorderBufferSerializeTXN. You didn't get this error with the patch applied, did you ? If you were debugging this without the patch applied, I suspect that the reason why ReorderBufferSerializeTXN() => OpenTransientFile() is generating this error is because the max limit must be already crossed because of earlier calls to ReorderBufferRestoreChanges(). Note that in ReorderBufferSerializeTXN(), OpenTransientFile() is sufficient because the code in that function has made sure the fd gets closed there itself. If you are getting this error even with the patch applied, then this needs investigation. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
On Mon, Nov 18, 2019 at 5:20 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Thu, Oct 3, 2019 at 4:48 PM Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > > > > On Wed, 18 Sep 2019 at 12:24, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > > > Probably, for now at least, what everyone seems to agree is to take my > > > earlier attached patch forward. > > > > > > I am going to see if I can add a TAP test for the patch, and will add > > > the patch into the commitfest soon. > > > > Attached is an updated patch v2. > > > > I see that you have made changes in ReorderBufferRestoreChanges to use > PathNameOpenFile, but not in ReorderBufferSerializeTXN. Is there a > reason for the same? > I have one more question regarding this patch. It seems to me that the files opened via OpenTransientFile or OpenTemporaryFile are automatically closed at transaction end(abort), but that doesn't seem to be the case for files opened with PathNameOpenFile. See AtEOXact_Files and AtEOSubXact_Files. So, now with the change proposed by this patch, don't we need to deal it in some other way? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Mon, 18 Nov 2019 at 17:52, Amit Kapila <amit.kapila16@gmail.com> wrote: > I have one more question regarding this patch. It seems to me that > the files opened via OpenTransientFile or OpenTemporaryFile are > automatically closed at transaction end(abort), but that doesn't seem > to be the case for files opened with PathNameOpenFile. See > AtEOXact_Files and AtEOSubXact_Files. So, now with the change > proposed by this patch, don't we need to deal it in some other way? For the API's that use VFDs (like PathNameOpenFile), the files opened are always recorded in the VfdCache array. So it is not required to do the cleanup at (sub)transaction end, because the kernel fds get closed dynamically in ReleaseLruFiles() whenever they reach max_safe_fds limit. So if a transaction aborts, the fds might remain open, but those will get cleaned up whenever we require more fds, through ReleaseLruFiles(). Whereas, for files opened through OpenTransientFile(), VfdCache is not involved, so this needs transaction end cleanup. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
On Mon, Nov 18, 2019 at 5:50 PM Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > > On Mon, 18 Nov 2019 at 17:20, Amit Kapila <amit.kapila16@gmail.com> wrote: > > I see that you have made changes in ReorderBufferRestoreChanges to use > > PathNameOpenFile, but not in ReorderBufferSerializeTXN. Is there a > > reason for the same? In my test environment, with the test provided > > by you, I got the error (reported in this thread) via > > ReorderBufferSerializeTXN. > > You didn't get this error with the patch applied, did you ? > No, I got this before applying the patch. However, after applying the patch, I got below error in the same test: postgres=# SELECT 1 from pg_logical_slot_get_changes('regression_slot', NULL,NULL) LIMIT 1; ERROR: could not read from reorderbuffer spill file: Invalid argument It seems to me that FileRead API used in the patch can return value < 0 on EOF. See the API usage in BufFileLoadBuffer. I got this error on a windows machine and in the server log the message was "LOG: unrecognized win32 error code: 38" which indicates "Reached the end of the file." > If you were debugging this without the patch applied, I suspect that > the reason why ReorderBufferSerializeTXN() => OpenTransientFile() is > generating this error is because the max limit must be already crossed > because of earlier calls to ReorderBufferRestoreChanges(). > > Note that in ReorderBufferSerializeTXN(), OpenTransientFile() is > sufficient because the code in that function has made sure the fd gets > closed there itself. > Okay, then we might not need it there, but we should at least add a comment in ReorderBufferRestoreChanges to explain why we have used a different function to operate on the file at that place. > > For the API's that use VFDs (like PathNameOpenFile), the files opened > are always recorded in the VfdCache array. So it is not required to do > the cleanup at (sub)transaction end, because the kernel fds get closed > dynamically in ReleaseLruFiles() whenever they reach max_safe_fds > limit. So if a transaction aborts, the fds might remain open, but > those will get cleaned up whenever we require more fds, through > ReleaseLruFiles(). Whereas, for files opened through > OpenTransientFile(), VfdCache is not involved, so this needs > transaction end cleanup. > Have you tried by injecting some error? After getting the error mentioned above in email, when I retried the same query, I got the below message. postgres=# SELECT 1 from pg_logical_slot_get_changes('regression_slot', NULL,NULL) LIMIT 1; ERROR: could not remove file "pg_replslot/regression_slot/xid-1693-lsn-0-18000000.spill" during removal of pg_replslot/regression_slot/xid*: Permission denied And, then I tried to drop the replication slot and I got below error. postgres=# SELECT * FROM pg_drop_replication_slot('regression_slot'); ERROR: could not rename file "pg_replslot/regression_slot" to "pg_replslot/regression_slot.tmp": Permission denied It might be something related to Windows, but you can once try by injecting some error after reading a few files in the code path and see the behavior. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Tue, 19 Nov 2019 at 14:07, Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Mon, Nov 18, 2019 at 5:50 PM Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > > > > On Mon, 18 Nov 2019 at 17:20, Amit Kapila <amit.kapila16@gmail.com> wrote: > > > I see that you have made changes in ReorderBufferRestoreChanges to use > > > PathNameOpenFile, but not in ReorderBufferSerializeTXN. Is there a > > > reason for the same? In my test environment, with the test provided > > > by you, I got the error (reported in this thread) via > > > ReorderBufferSerializeTXN. > > > > You didn't get this error with the patch applied, did you ? > > > > No, I got this before applying the patch. However, after applying the > patch, I got below error in the same test: > > postgres=# SELECT 1 from > pg_logical_slot_get_changes('regression_slot', NULL,NULL) LIMIT 1; > ERROR: could not read from reorderbuffer spill file: Invalid argument > > It seems to me that FileRead API used in the patch can return value < > 0 on EOF. See the API usage in BufFileLoadBuffer. I got this error > on a windows machine and in the server log the message was "LOG: > unrecognized win32 error code: 38" which indicates "Reached the end of > the file." On Windows, it is documented that ReadFile() (which is called by pg_pread) will return false on EOF but only when the file is open for asynchronous reads/writes. But here we are just dealing with usual synchronous reads. So pg_pread() code should indeed return 0 on EOF on Windows. Not yet able to figure out how FileRead() managed to return this error on Windows. But from your symptoms, it does look like pg_pread()=>ReadFile() returned false (despite doing asynchronous reads), and so _dosmaperr() gets called, and then it does not find the eof error in doserrors[], so the "unrecognized win32 error code" message is printed. May have to dig up more on this. > > > If you were debugging this without the patch applied, I suspect that > > the reason why ReorderBufferSerializeTXN() => OpenTransientFile() is > > generating this error is because the max limit must be already crossed > > because of earlier calls to ReorderBufferRestoreChanges(). > > > > Note that in ReorderBufferSerializeTXN(), OpenTransientFile() is > > sufficient because the code in that function has made sure the fd gets > > closed there itself. > > > > Okay, then we might not need it there, but we should at least add a > comment in ReorderBufferRestoreChanges to explain why we have used a > different function to operate on the file at that place. Yeah, that might make sense. > > > > > For the API's that use VFDs (like PathNameOpenFile), the files opened > > are always recorded in the VfdCache array. So it is not required to do > > the cleanup at (sub)transaction end, because the kernel fds get closed > > dynamically in ReleaseLruFiles() whenever they reach max_safe_fds > > limit. So if a transaction aborts, the fds might remain open, but > > those will get cleaned up whenever we require more fds, through > > ReleaseLruFiles(). Whereas, for files opened through > > OpenTransientFile(), VfdCache is not involved, so this needs > > transaction end cleanup. > > > > Have you tried by injecting some error? After getting the error > mentioned above in email, when I retried the same query, I got the > below message. > > postgres=# SELECT 1 from > pg_logical_slot_get_changes('regression_slot', NULL,NULL) LIMIT 1; > ERROR: could not remove file > "pg_replslot/regression_slot/xid-1693-lsn-0-18000000.spill" during > removal of pg_replslot/regression_slot/xid*: Permission denied > > And, then I tried to drop the replication slot and I got below error. > postgres=# SELECT * FROM pg_drop_replication_slot('regression_slot'); > ERROR: could not rename file "pg_replslot/regression_slot" to > "pg_replslot/regression_slot.tmp": Permission denied > > It might be something related to Windows Oh ok, I missed the fact that on Windows we can't delete the files that are already open, unlike Linux/Unix. I guess, I may have to use FD_CLOSE_AT_EOXACT flags; or simply use OpenTemporaryFile(). I wonder though if this same issue might come up for the other use-case of PathNameOpenFile() : logical_rewrite_log_mapping(). > but you can once try by > injecting some error after reading a few files in the code path and > see the behavior. Yeah, will check the behaviour, although on Linux, I think I won't get this error. But yes, like I mentioned above, I think we might have to arrange for something. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
On Wed, Nov 20, 2019 at 12:28 AM Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > On Tue, 19 Nov 2019 at 14:07, Amit Kapila <amit.kapila16@gmail.com> wrote: > > No, I got this before applying the patch. However, after applying the > > patch, I got below error in the same test: > > > > postgres=# SELECT 1 from > > pg_logical_slot_get_changes('regression_slot', NULL,NULL) LIMIT 1; > > ERROR: could not read from reorderbuffer spill file: Invalid argument > > > > It seems to me that FileRead API used in the patch can return value < > > 0 on EOF. See the API usage in BufFileLoadBuffer. I got this error > > on a windows machine and in the server log the message was "LOG: > > unrecognized win32 error code: 38" which indicates "Reached the end of > > the file." > > On Windows, it is documented that ReadFile() (which is called by > pg_pread) will return false on EOF but only when the file is open for > asynchronous reads/writes. But here we are just dealing with usual > synchronous reads. So pg_pread() code should indeed return 0 on EOF on > Windows. Not yet able to figure out how FileRead() managed to return > this error on Windows. But from your symptoms, it does look like > pg_pread()=>ReadFile() returned false (despite doing asynchronous > reads), and so _dosmaperr() gets called, and then it does not find the > eof error in doserrors[], so the "unrecognized win32 error code" > message is printed. May have to dig up more on this. Hmm. See also this report: https://www.postgresql.org/message-id/flat/CABuU89MfEvJE%3DWif%2BHk7SCqjSOF4rhgwJWW6aR3hjojpGqFbjQ%40mail.gmail.com
Re: logical decoding : exceeded maxAllocatedDescs for .spill files
From
Juan José Santamaría Flecha
Date:
On Tue, Nov 19, 2019 at 12:49 PM Thomas Munro <thomas.munro@gmail.com> wrote:
On Wed, Nov 20, 2019 at 12:28 AM Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> On Tue, 19 Nov 2019 at 14:07, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > No, I got this before applying the patch. However, after applying the
> > patch, I got below error in the same test:
> >
> > postgres=# SELECT 1 from
> > pg_logical_slot_get_changes('regression_slot', NULL,NULL) LIMIT 1;
> > ERROR: could not read from reorderbuffer spill file: Invalid argument
> >
> > It seems to me that FileRead API used in the patch can return value <
> > 0 on EOF. See the API usage in BufFileLoadBuffer. I got this error
> > on a windows machine and in the server log the message was "LOG:
> > unrecognized win32 error code: 38" which indicates "Reached the end of
> > the file."
>
> On Windows, it is documented that ReadFile() (which is called by
> pg_pread) will return false on EOF but only when the file is open for
> asynchronous reads/writes. But here we are just dealing with usual
> synchronous reads. So pg_pread() code should indeed return 0 on EOF on
> Windows. Not yet able to figure out how FileRead() managed to return
> this error on Windows. But from your symptoms, it does look like
> pg_pread()=>ReadFile() returned false (despite doing asynchronous
> reads), and so _dosmaperr() gets called, and then it does not find the
> eof error in doserrors[], so the "unrecognized win32 error code"
> message is printed. May have to dig up more on this.
Hmm. See also this report:
https://www.postgresql.org/message-id/flat/CABuU89MfEvJE%3DWif%2BHk7SCqjSOF4rhgwJWW6aR3hjojpGqFbjQ%40mail.gmail.com
The files from pgwin32_open() are open for synchronous access, while pg_pread() uses the asynchronous functionality to offset the read. Under these circunstances, a read past EOF will return ERROR_HANDLE_EOF (38), as explained in:
Regards,
Juan José Santamaría Flecha
On Wed, Nov 20, 2019 at 1:14 AM Juan José Santamaría Flecha <juanjo.santamaria@gmail.com> wrote: > On Tue, Nov 19, 2019 at 12:49 PM Thomas Munro <thomas.munro@gmail.com> wrote: >> On Wed, Nov 20, 2019 at 12:28 AM Amit Khandekar <amitdkhan.pg@gmail.com> wrote: >> > On Windows, it is documented that ReadFile() (which is called by >> > pg_pread) will return false on EOF but only when the file is open for >> > asynchronous reads/writes. But here we are just dealing with usual >> > synchronous reads. So pg_pread() code should indeed return 0 on EOF on >> > Windows. Not yet able to figure out how FileRead() managed to return >> > this error on Windows. But from your symptoms, it does look like >> > pg_pread()=>ReadFile() returned false (despite doing asynchronous >> > reads), and so _dosmaperr() gets called, and then it does not find the >> > eof error in doserrors[], so the "unrecognized win32 error code" >> > message is printed. May have to dig up more on this. >> >> Hmm. See also this report: >> >> https://www.postgresql.org/message-id/flat/CABuU89MfEvJE%3DWif%2BHk7SCqjSOF4rhgwJWW6aR3hjojpGqFbjQ%40mail.gmail.com >> > > The files from pgwin32_open() are open for synchronous access, while pg_pread() uses the asynchronous functionality tooffset the read. Under these circunstances, a read past EOF will return ERROR_HANDLE_EOF (38), as explained in: Oh, thanks. > https://devblogs.microsoft.com/oldnewthing/20150121-00/?p=44863 !?! Amit, since it looks like you are Windows-enabled and have a repro, would you mind confirming that this fixes the problem? --- a/src/port/pread.c +++ b/src/port/pread.c @@ -41,6 +41,9 @@ pg_pread(int fd, void *buf, size_t size, off_t offset) overlapped.Offset = offset; if (!ReadFile(handle, buf, size, &result, &overlapped)) { + if (GetLastError() == ERROR_HANDLE_EOF) + return 0; + _dosmaperr(GetLastError()); return -1; }
On Wed, Nov 20, 2019 at 7:58 AM Thomas Munro <thomas.munro@gmail.com> wrote: > On Wed, Nov 20, 2019 at 1:14 AM Juan José Santamaría Flecha > > https://devblogs.microsoft.com/oldnewthing/20150121-00/?p=44863 > > !?! One thing I don't understand (besides, apparently, the documentation): how did this problem escape detection by check-world for such a long time? Surely we expect to hit the end of various temporary files in various tests. Is it intermittent, or dependent on Windows version, or something like that?
On Wed, 20 Nov 2019 at 01:05, Thomas Munro <thomas.munro@gmail.com> wrote: > > On Wed, Nov 20, 2019 at 7:58 AM Thomas Munro <thomas.munro@gmail.com> wrote: > > On Wed, Nov 20, 2019 at 1:14 AM Juan José Santamaría Flecha > > > https://devblogs.microsoft.com/oldnewthing/20150121-00/?p=44863 > > > > !?! Thanks Juan and Thomas for pointing to these links where already this was discussed. > > One thing I don't understand (besides, apparently, the documentation): > how did this problem escape detection by check-world for such a long > time? Surely we expect to hit the end of various temporary files in > various tests. Is it intermittent, or dependent on Windows version, > or something like that? Possibly there aren't any callers who try to pread() at end-of-file using FileRead/pg_pread : - mdread() seems to read from an offset which it seems to know that it is inside the end-of file, including the whole BLCKSZ. - BufFileLoadBuffer() seems to deliberately ignore FileRead()'s return value if it is -1 if (file->nbytes < 0) file->nbytes = 0; - XLogPageRead() also seems to know that the offset is a valid offset. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
On Wed, Nov 20, 2019 at 12:28 AM Thomas Munro <thomas.munro@gmail.com> wrote: > > On Wed, Nov 20, 2019 at 1:14 AM Juan José Santamaría Flecha > <juanjo.santamaria@gmail.com> wrote: > > On Tue, Nov 19, 2019 at 12:49 PM Thomas Munro <thomas.munro@gmail.com> wrote: > >> On Wed, Nov 20, 2019 at 12:28 AM Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > >> > On Windows, it is documented that ReadFile() (which is called by > >> > pg_pread) will return false on EOF but only when the file is open for > >> > asynchronous reads/writes. But here we are just dealing with usual > >> > synchronous reads. So pg_pread() code should indeed return 0 on EOF on > >> > Windows. Not yet able to figure out how FileRead() managed to return > >> > this error on Windows. But from your symptoms, it does look like > >> > pg_pread()=>ReadFile() returned false (despite doing asynchronous > >> > reads), and so _dosmaperr() gets called, and then it does not find the > >> > eof error in doserrors[], so the "unrecognized win32 error code" > >> > message is printed. May have to dig up more on this. > >> > >> Hmm. See also this report: > >> > >> https://www.postgresql.org/message-id/flat/CABuU89MfEvJE%3DWif%2BHk7SCqjSOF4rhgwJWW6aR3hjojpGqFbjQ%40mail.gmail.com > >> > > > > The files from pgwin32_open() are open for synchronous access, while pg_pread() uses the asynchronous functionality tooffset the read. Under these circunstances, a read past EOF will return ERROR_HANDLE_EOF (38), as explained in: > > Oh, thanks. > > > https://devblogs.microsoft.com/oldnewthing/20150121-00/?p=44863 > > !?! > > Amit, since it looks like you are Windows-enabled and have a repro, > would you mind confirming that this fixes the problem? > > --- a/src/port/pread.c > +++ b/src/port/pread.c > @@ -41,6 +41,9 @@ pg_pread(int fd, void *buf, size_t size, off_t offset) > overlapped.Offset = offset; > if (!ReadFile(handle, buf, size, &result, &overlapped)) > { > + if (GetLastError() == ERROR_HANDLE_EOF) > + return 0; > + > _dosmaperr(GetLastError()); > return -1; > } Yes, this works for me. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Wed, Nov 20, 2019 at 4:54 PM Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > - BufFileLoadBuffer() seems to deliberately ignore FileRead()'s return > value if it is -1 > if (file->nbytes < 0) file->nbytes = 0; Ok, that's a different problem we need to fix then. But it does explain how we didn't know. And sure enough there is "unrecognized win32 error code: 38" LOG-spam on the build farm, at places where tuplestores are expected: https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=whelk&dt=2019-11-20%2002%3A41%3A41&stg=check
On Tue, Nov 19, 2019 at 4:58 PM Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > > On Tue, 19 Nov 2019 at 14:07, Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Mon, Nov 18, 2019 at 5:50 PM Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > > > > > > For the API's that use VFDs (like PathNameOpenFile), the files opened > > > are always recorded in the VfdCache array. So it is not required to do > > > the cleanup at (sub)transaction end, because the kernel fds get closed > > > dynamically in ReleaseLruFiles() whenever they reach max_safe_fds > > > limit. So if a transaction aborts, the fds might remain open, but > > > those will get cleaned up whenever we require more fds, through > > > ReleaseLruFiles(). Whereas, for files opened through > > > OpenTransientFile(), VfdCache is not involved, so this needs > > > transaction end cleanup. > > > > > > > Have you tried by injecting some error? After getting the error > > mentioned above in email, when I retried the same query, I got the > > below message. > > > > postgres=# SELECT 1 from > > pg_logical_slot_get_changes('regression_slot', NULL,NULL) LIMIT 1; > > ERROR: could not remove file > > "pg_replslot/regression_slot/xid-1693-lsn-0-18000000.spill" during > > removal of pg_replslot/regression_slot/xid*: Permission denied > > > > And, then I tried to drop the replication slot and I got below error. > > postgres=# SELECT * FROM pg_drop_replication_slot('regression_slot'); > > ERROR: could not rename file "pg_replslot/regression_slot" to > > "pg_replslot/regression_slot.tmp": Permission denied > > > > It might be something related to Windows > > Oh ok, I missed the fact that on Windows we can't delete the files > that are already open, unlike Linux/Unix. > I guess, I may have to use FD_CLOSE_AT_EOXACT flags; or simply use > OpenTemporaryFile(). > I think setting FD_CLOSE_AT_EOXACT won't work unless you also set have_xact_temporary_files because it checks that flag in CleanupTempFiles. Also, OpenTemporaryFile() doesn't take the input file path, so how will you use it? > I wonder though if this same issue might come up > for the other use-case of PathNameOpenFile() : > logical_rewrite_log_mapping(). > It is possible, but I haven't tested that path. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Wed, Nov 20, 2019 at 5:44 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > On Wed, Nov 20, 2019 at 12:28 AM Thomas Munro <thomas.munro@gmail.com> wrote: > > + if (GetLastError() == ERROR_HANDLE_EOF) > > + return 0; > Yes, this works for me. Thanks, pushed.
On Tue, Nov 19, 2019 at 4:58 PM Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > > On Tue, 19 Nov 2019 at 14:07, Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > Have you tried by injecting some error? After getting the error > > mentioned above in email, when I retried the same query, I got the > > below message. > > > > postgres=# SELECT 1 from > > pg_logical_slot_get_changes('regression_slot', NULL,NULL) LIMIT 1; > > ERROR: could not remove file > > "pg_replslot/regression_slot/xid-1693-lsn-0-18000000.spill" during > > removal of pg_replslot/regression_slot/xid*: Permission denied > > > > And, then I tried to drop the replication slot and I got below error. > > postgres=# SELECT * FROM pg_drop_replication_slot('regression_slot'); > > ERROR: could not rename file "pg_replslot/regression_slot" to > > "pg_replslot/regression_slot.tmp": Permission denied > > > > It might be something related to Windows > > Oh ok, I missed the fact that on Windows we can't delete the files > that are already open, unlike Linux/Unix. > See comment in pgunlink() "We need to loop because even though PostgreSQL uses flags that allow unlink while the file is open, other applications might have the file open without those flags.". Can you once see if there is any flag that you have missed to pass to allow this? If there is nothing we can do about it, then we might need to use some different API or maybe define a new API that can handle this. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Wed, 20 Nov 2019 at 13:10, Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Tue, Nov 19, 2019 at 4:58 PM Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > > > > On Tue, 19 Nov 2019 at 14:07, Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > Have you tried by injecting some error? After getting the error > > > mentioned above in email, when I retried the same query, I got the > > > below message. > > > > > > postgres=# SELECT 1 from > > > pg_logical_slot_get_changes('regression_slot', NULL,NULL) LIMIT 1; > > > ERROR: could not remove file > > > "pg_replslot/regression_slot/xid-1693-lsn-0-18000000.spill" during > > > removal of pg_replslot/regression_slot/xid*: Permission denied > > > > > > And, then I tried to drop the replication slot and I got below error. > > > postgres=# SELECT * FROM pg_drop_replication_slot('regression_slot'); > > > ERROR: could not rename file "pg_replslot/regression_slot" to > > > "pg_replslot/regression_slot.tmp": Permission denied > > > > > > It might be something related to Windows > > > > Oh ok, I missed the fact that on Windows we can't delete the files > > that are already open, unlike Linux/Unix. > > > > See comment in pgunlink() "We need to loop because even though > PostgreSQL uses flags that allow unlink while the file is open, other > applications might have the file > open without those flags.". Can you once see if there is any flag > that you have missed to pass to allow this? If there is nothing we > can do about it, then we might need to use some different API or maybe > define a new API that can handle this. Hmm, looks like there is one such flag: FILE_SHARE_DELETE. When file is opened with this flag, other processes can delete as well as rename the file. But it turns out that in pgwin32_open(), we already use FILE_SHARE_DELETE. So, this is again confusing. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
On Wed, 20 Nov 2019 at 13:10, Amit Kapila <amit.kapila16@gmail.com> wrote: > See comment in pgunlink() "We need to loop because even though > PostgreSQL uses flags that allow unlink while the file is open, other > applications might have the file > open without those flags.". Can you once see if there is any flag > that you have missed to pass to allow this? > If there is nothing we > can do about it, then we might need to use some different API or maybe > define a new API that can handle this. There were objections against modifying the vfd api only for this replication-related use-case. Having a new API will require all the changes required to enable the virtual FDs feature that we need from vfd. If nothing works out from the FILE_SHARE_DELETE thing, I am thinking, we can use VFD, plus we can keep track of per-subtransaction vfd handles, and do something similar to AtEOSubXact_Files(). -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
Re: logical decoding : exceeded maxAllocatedDescs for .spill files
From
Juan José Santamaría Flecha
Date:
On Wed, Nov 20, 2019 at 9:48 AM Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
On Wed, 20 Nov 2019 at 13:10, Amit Kapila <amit.kapila16@gmail.com> wrote:
> See comment in pgunlink() "We need to loop because even though
> PostgreSQL uses flags that allow unlink while the file is open, other
> applications might have the file
> open without those flags.". Can you once see if there is any flag
> that you have missed to pass to allow this?
> If there is nothing we
> can do about it, then we might need to use some different API or maybe
> define a new API that can handle this.
There were objections against modifying the vfd api only for this
replication-related use-case. Having a new API will require all the
changes required to enable the virtual FDs feature that we need from
vfd. If nothing works out from the FILE_SHARE_DELETE thing, I am
thinking, we can use VFD, plus we can keep track of per-subtransaction
vfd handles, and do something similar to AtEOSubXact_Files().
The comment about "other applications might have the file open without those flags." is surely due to systems working with an antivirus touching Postgres files.
I was not able to reproduce the Permission denied error with current HEAD, until I opened another CMD inside the "pg_replslot/regression_slot" folder. This will be problematic, is the deletion of the folder actually needed?
Regards,
Juan José Santamaría Flecha
On 2019-Nov-20, Juan José Santamaría Flecha wrote: > I was not able to reproduce the Permission denied error with current HEAD, > until I opened another CMD inside the "pg_replslot/regression_slot" folder. > This will be problematic, is the deletion of the folder actually needed? Yes :-( The code assumes that if the directory is there, then it's valid. Trying to remove that assumption is probably a more invasive fix. I think ReplicationSlotDropAcquired is too pessimistic (no recourse if the rename fails) and too optimistic (this will almost never happen). We could change it so that the rename is retried a few times, and avoid the failure. (Naturally, the rmtree should also be retried.) The code seems written with the POSIX semantics in mind, but it seems easy to improve. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Wed, 20 Nov 2019 at 19:24, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > > On 2019-Nov-20, Juan José Santamaría Flecha wrote: > > > I was not able to reproduce the Permission denied error with current HEAD, > > until I opened another CMD inside the "pg_replslot/regression_slot" folder. > > This will be problematic, is the deletion of the folder actually needed? > > Yes :-( The code assumes that if the directory is there, then it's > valid. Trying to remove that assumption is probably a more invasive > fix. > > I think ReplicationSlotDropAcquired is too pessimistic (no recourse if > the rename fails) and too optimistic (this will almost never happen). > We could change it so that the rename is retried a few times, and avoid > the failure. (Naturally, the rmtree should also be retried.) The code > seems written with the POSIX semantics in mind, but it seems easy to > improve. Just to be clear, there are two issues being discussed here : 1. Issue with the patch, where pg_replslot/slotname/xid-*.spill files can't be removed because the same backend process has left these files opened because of an abort. This is happening despite the file being opened using FILE_SHARE_DELETE flag. I am going to investigate (possibly the flag is not applicable in case a single process is involved) 2. This existing issue where pg_replslot/slotname directory removal will fail if someone else is accessing this directory. This has nothing to do with the patch. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
On Wed, Nov 20, 2019 at 5:41 PM Juan José Santamaría Flecha <juanjo.santamaria@gmail.com> wrote: > > On Wed, Nov 20, 2019 at 9:48 AM Amit Khandekar <amitdkhan.pg@gmail.com> wrote: >> >> On Wed, 20 Nov 2019 at 13:10, Amit Kapila <amit.kapila16@gmail.com> wrote: >> > See comment in pgunlink() "We need to loop because even though >> > PostgreSQL uses flags that allow unlink while the file is open, other >> > applications might have the file >> > open without those flags.". Can you once see if there is any flag >> > that you have missed to pass to allow this? >> >> > If there is nothing we >> > can do about it, then we might need to use some different API or maybe >> > define a new API that can handle this. >> >> There were objections against modifying the vfd api only for this >> replication-related use-case. Having a new API will require all the >> changes required to enable the virtual FDs feature that we need from >> vfd. If nothing works out from the FILE_SHARE_DELETE thing, I am >> thinking, we can use VFD, plus we can keep track of per-subtransaction >> vfd handles, and do something similar to AtEOSubXact_Files(). >> > > The comment about "other applications might have the file open without those flags." is surely due to systems working withan antivirus touching Postgres files. > > I was not able to reproduce the Permission denied error with current HEAD, > I am not sure what exactly you tried. Can you share the steps and your environment details? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Wed, Nov 20, 2019 at 2:18 PM Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > > On Wed, 20 Nov 2019 at 13:10, Amit Kapila <amit.kapila16@gmail.com> wrote: > > See comment in pgunlink() "We need to loop because even though > > PostgreSQL uses flags that allow unlink while the file is open, other > > applications might have the file > > open without those flags.". Can you once see if there is any flag > > that you have missed to pass to allow this? > > > If there is nothing we > > can do about it, then we might need to use some different API or maybe > > define a new API that can handle this. > > There were objections against modifying the vfd api only for this > replication-related use-case. Having a new API will require all the > changes required to enable the virtual FDs feature that we need from > vfd. If nothing works out from the FILE_SHARE_DELETE thing, > While experimenting with FILE_SHARE_DELETE, I think you can once try to open/close the file before unlink. If you read the specs [1] of this flag, it seems they allow you to open the file with delete access even when it is already opened by someone else. I am not sure if that is helpful, but at least we can try out. > I am > thinking, we can use VFD, plus we can keep track of per-subtransaction > vfd handles, and do something similar to AtEOSubXact_Files(). > I think if we can't make the current API work, then it is better to sketch the design for this approach and probably the design of new API using existing infrastructure. Then we can see which approach people prefer. [1] - Enables subsequent open operations on a file or device to request delete access. Otherwise, other processes cannot open the file or device if they request delete access. (https://docs.microsoft.com/en-us/windows/win32/api/fileapi/nf-fileapi-createfilea) -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: logical decoding : exceeded maxAllocatedDescs for .spill files
From
Juan José Santamaría Flecha
Date:
On Thu, Nov 21, 2019 at 5:02 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Wed, Nov 20, 2019 at 5:41 PM Juan José Santamaría Flecha
<juanjo.santamaria@gmail.com> wrote:
>
> On Wed, Nov 20, 2019 at 9:48 AM Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
>>
>> On Wed, 20 Nov 2019 at 13:10, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> > See comment in pgunlink() "We need to loop because even though
>> > PostgreSQL uses flags that allow unlink while the file is open, other
>> > applications might have the file
>> > open without those flags.". Can you once see if there is any flag
>> > that you have missed to pass to allow this?
>>
>> > If there is nothing we
>> > can do about it, then we might need to use some different API or maybe
>> > define a new API that can handle this.
>>
>> There were objections against modifying the vfd api only for this
>> replication-related use-case. Having a new API will require all the
>> changes required to enable the virtual FDs feature that we need from
>> vfd. If nothing works out from the FILE_SHARE_DELETE thing, I am
>> thinking, we can use VFD, plus we can keep track of per-subtransaction
>> vfd handles, and do something similar to AtEOSubXact_Files().
>>
>
> The comment about "other applications might have the file open without those flags." is surely due to systems working with an antivirus touching Postgres files.
>
> I was not able to reproduce the Permission denied error with current HEAD,
>
I am not sure what exactly you tried. Can you share the steps and
your environment details?
Sure, I was trying to reproduce the Permission denied error after the ERROR_HANDLE_EOF fix.
1. Using a clean environment [1] the spill.sql script produces the expected output.
2. I manually injected a negative value for readBytes after @@ -2611,10 +2627,11 @@ ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn. In doing so pg_logical_slot_get_changes() failed, but following executions did not run into Permission denied.
3. During the cleanup of some of the tests, pg_drop_replication_slot() failed because the "pg_replslot/regression_slot" folder was is use.
[1] Win10 (1903) MSVC 19.22.27905
Regards,
Juan José Santamaría Flecha
On Thu, Nov 21, 2019 at 8:32 PM Juan José Santamaría Flecha <juanjo.santamaria@gmail.com> wrote: > > On Thu, Nov 21, 2019 at 5:02 AM Amit Kapila <amit.kapila16@gmail.com> wrote: >> >> On Wed, Nov 20, 2019 at 5:41 PM Juan José Santamaría Flecha >> <juanjo.santamaria@gmail.com> wrote: >> > >> > On Wed, Nov 20, 2019 at 9:48 AM Amit Khandekar <amitdkhan.pg@gmail.com> wrote: >> >> >> >> On Wed, 20 Nov 2019 at 13:10, Amit Kapila <amit.kapila16@gmail.com> wrote: >> >> > See comment in pgunlink() "We need to loop because even though >> >> > PostgreSQL uses flags that allow unlink while the file is open, other >> >> > applications might have the file >> >> > open without those flags.". Can you once see if there is any flag >> >> > that you have missed to pass to allow this? >> >> >> >> > If there is nothing we >> >> > can do about it, then we might need to use some different API or maybe >> >> > define a new API that can handle this. >> >> >> >> There were objections against modifying the vfd api only for this >> >> replication-related use-case. Having a new API will require all the >> >> changes required to enable the virtual FDs feature that we need from >> >> vfd. If nothing works out from the FILE_SHARE_DELETE thing, I am >> >> thinking, we can use VFD, plus we can keep track of per-subtransaction >> >> vfd handles, and do something similar to AtEOSubXact_Files(). >> >> >> > >> > The comment about "other applications might have the file open without those flags." is surely due to systems workingwith an antivirus touching Postgres files. >> > >> > I was not able to reproduce the Permission denied error with current HEAD, >> > >> >> I am not sure what exactly you tried. Can you share the steps and >> your environment details? >> > > Sure, I was trying to reproduce the Permission denied error after the ERROR_HANDLE_EOF fix. > Have you tried before that fix , if not, can you once try by temporarily reverting that fix in your environment and share the output of each step? After you get the error due to EOF, check that you have .spill files in pg_replslot/<slot_name>/ and then again try to get changes by pg_logical_slot_get_changes(). If you want, you can use the test provided in Amit Khandekar's patch. > > [1] Win10 (1903) MSVC 19.22.27905 > I have tested this on Windows7. I am not sure if it is due to a different version of windows, but I think we can't rule out that possibility. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Fri, 22 Nov 2019 at 09:08, Amit Kapila <amit.kapila16@gmail.com> wrote: > Have you tried before that fix , if not, can you once try by > temporarily reverting that fix in your environment and share the > output of each step? After you get the error due to EOF, check that > you have .spill files in pg_replslot/<slot_name>/ and then again try > to get changes by pg_logical_slot_get_changes(). If you want, you > can use the test provided in Amit Khandekar's patch. On my Linux machine, I added elog() in ReorderBufferRestoreChanges(), just after FileRead() returns 0. This results in error. But the thing is, in ReorderBufferCommit(), the error is already handled using PG_CATCH : PG_CATCH(); { ..... AbortCurrentTransaction(); ....... if (using_subtxn) RollbackAndReleaseCurrentSubTransaction(); ........ ........ /* remove potential on-disk data, and deallocate */ ReorderBufferCleanupTXN(rb, txn); } So ReorderBufferCleanupTXN() removes all the .spill files using unlink(). And on Windows, what should happen is : unlink() should succeed because the file is opened using FILE_SHARE_DELETE. But the files should still remain there because these are still open. It is just marked for deletion until there is no one having opened the file. That is what is my conclusion from running a sample attached program test.c . But what you are seeing is "Permission denied" errors. Not sure why unlink() is failing. The thing that is still a problem is : On Windows, if the file remains open, and later even when the unlink() succeeds, the file will be left there until it is closed. So subsequent operations will open the same old file. Not sure what happens if we open a file that is marked for deletion. - Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
Attachment
On Fri, Nov 22, 2019 at 11:00 AM Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > > On Fri, 22 Nov 2019 at 09:08, Amit Kapila <amit.kapila16@gmail.com> wrote: > > Have you tried before that fix , if not, can you once try by > > temporarily reverting that fix in your environment and share the > > output of each step? After you get the error due to EOF, check that > > you have .spill files in pg_replslot/<slot_name>/ and then again try > > to get changes by pg_logical_slot_get_changes(). If you want, you > > can use the test provided in Amit Khandekar's patch. > > On my Linux machine, I added elog() in ReorderBufferRestoreChanges(), > just after FileRead() returns 0. This results in error. But the thing is, in > ReorderBufferCommit(), the error is already handled using PG_CATCH : > > PG_CATCH(); > { > ..... > AbortCurrentTransaction(); > ....... > if (using_subtxn) > RollbackAndReleaseCurrentSubTransaction(); > ........ > ........ > /* remove potential on-disk data, and deallocate */ > ReorderBufferCleanupTXN(rb, txn); > } > > So ReorderBufferCleanupTXN() removes all the .spill files using unlink(). > > And on Windows, what should happen is : unlink() should succeed > because the file is opened using FILE_SHARE_DELETE. But the files > should still remain there because these are still open. It is just > marked for deletion until there is no one having opened the file. That > is what is my conclusion from running a sample attached program test.c > I think this is exactly the reason for the problem. In my test [1], the error "permission denied" occurred when I second time executed pg_logical_slot_get_changes() which means on first execution the unlink would have been successful but the files are still not removed as they were not closed. Then on second execution, it gets an error "Permission denied" when it again tries to unlink files via ReorderBufferCleanupSerializedTXNs(). . > But what you are seeing is "Permission denied" errors. Not sure why > unlink() is failing. > In your test program, if you try to unlink the file second time, you should see the error "Permission denied". [1] - https://www.postgresql.org/message-id/CAA4eK1%2Bcey6i6a0zD9kk_eaDXb4RPNZqu4UwXO9LbHAgMpMBkg%40mail.gmail.com -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Fri, 22 Nov 2019 at 4:26 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
On Fri, Nov 22, 2019 at 11:00 AM Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
>
> On Fri, 22 Nov 2019 at 09:08, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > Have you tried before that fix , if not, can you once try by
> > temporarily reverting that fix in your environment and share the
> > output of each step? After you get the error due to EOF, check that
> > you have .spill files in pg_replslot/<slot_name>/ and then again try
> > to get changes by pg_logical_slot_get_changes(). If you want, you
> > can use the test provided in Amit Khandekar's patch.
>
> On my Linux machine, I added elog() in ReorderBufferRestoreChanges(),
> just after FileRead() returns 0. This results in error. But the thing is, in
> ReorderBufferCommit(), the error is already handled using PG_CATCH :
>
> PG_CATCH();
> {
> .....
> AbortCurrentTransaction();
> .......
> if (using_subtxn)
> RollbackAndReleaseCurrentSubTransaction();
> ........
> ........
> /* remove potential on-disk data, and deallocate */
> ReorderBufferCleanupTXN(rb, txn);
> }
>
> So ReorderBufferCleanupTXN() removes all the .spill files using unlink().
>
> And on Windows, what should happen is : unlink() should succeed
> because the file is opened using FILE_SHARE_DELETE. But the files
> should still remain there because these are still open. It is just
> marked for deletion until there is no one having opened the file. That
> is what is my conclusion from running a sample attached program test.c
>
I think this is exactly the reason for the problem. In my test [1],
the error "permission denied" occurred when I second time executed
pg_logical_slot_get_changes() which means on first execution the
unlink would have been successful but the files are still not removed
as they were not closed. Then on second execution, it gets an error
"Permission denied" when it again tries to unlink files via
ReorderBufferCleanupSerializedTXNs().
.
> But what you are seeing is "Permission denied" errors. Not sure why
> unlink() is failing.
>
In your test program, if you try to unlink the file second time, you
should see the error "Permission denied".
I tested using the sample program and indeed I got the error 5 (access denied) when I called unlink the second time.
[1] - https://www.postgresql.org/message-id/CAA4eK1%2Bcey6i6a0zD9kk_eaDXb4RPNZqu4UwXO9LbHAgMpMBkg%40mail.gmail.com
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company
Re: logical decoding : exceeded maxAllocatedDescs for .spill files
From
Juan José Santamaría Flecha
Date:
On Fri, Nov 22, 2019 at 4:38 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Thu, Nov 21, 2019 at 8:32 PM Juan José Santamaría Flecha
<juanjo.santamaria@gmail.com> wrote:
>
> [1] Win10 (1903) MSVC 19.22.27905
>
I have tested this on Windows7. I am not sure if it is due to a
different version of windows, but I think we can't rule out that
possibility.
This seems to be the case. The unexpected behaviour is on my end, which is working as described in FILE_DISPOSITION_POSIX_SEMANTICS [1].
The expected behaviour is what you have already diagnosed.
[1] https://docs.microsoft.com/es-es/windows-hardware/drivers/ddi/ntddk/ns-ntddk-_file_disposition_information_ex
Regards,
Juan José Santamaría Flecha
On Fri, Nov 22, 2019 at 7:38 PM Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > > On Fri, 22 Nov 2019 at 4:26 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> >> I think this is exactly the reason for the problem. In my test [1], >> the error "permission denied" occurred when I second time executed >> pg_logical_slot_get_changes() which means on first execution the >> unlink would have been successful but the files are still not removed >> as they were not closed. Then on second execution, it gets an error >> "Permission denied" when it again tries to unlink files via >> ReorderBufferCleanupSerializedTXNs(). >> >> >> . >> > But what you are seeing is "Permission denied" errors. Not sure why >> > unlink() is failing. >> > >> >> In your test program, if you try to unlink the file second time, you >> should see the error "Permission denied". > > I tested using the sample program and indeed I got the error 5 (access denied) when I called unlink the second time. > So, what is the next step here? How about if we somehow check whether the file exists before doing unlink, say by using stat? If that doesn't work, I think we might need to go in the direction of tracking file handles in some way, so that they can be closed during an abort. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Tue, 26 Nov 2019 at 10:49, Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Fri, Nov 22, 2019 at 7:38 PM Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > > > > On Fri, 22 Nov 2019 at 4:26 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > >> > >> I think this is exactly the reason for the problem. In my test [1], > >> the error "permission denied" occurred when I second time executed > >> pg_logical_slot_get_changes() which means on first execution the > >> unlink would have been successful but the files are still not removed > >> as they were not closed. Then on second execution, it gets an error > >> "Permission denied" when it again tries to unlink files via > >> ReorderBufferCleanupSerializedTXNs(). > >> > >> > >> . > >> > But what you are seeing is "Permission denied" errors. Not sure why > >> > unlink() is failing. > >> > > >> > >> In your test program, if you try to unlink the file second time, you > >> should see the error "Permission denied". > > > > I tested using the sample program and indeed I got the error 5 (access denied) when I called unlink the second time. > > > > So, what is the next step here? How about if we somehow check whether > the file exists before doing unlink, say by using stat? But the thing is, the behaviour is so much in a grey area, that we cannot reliably say for instance that when stat() says there is no such file, there is indeed no such file, and if we re-create the same file when it is still open, it is always going to open a new file, etc. > If that doesn't work, I think we might need to go in the direction of tracking > file handles in some way, so that they can be closed during an abort. Yeah, that is one way. I am still working on different approaches. WIll get back with proposals. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
On Tue, Nov 26, 2019 at 11:19 AM Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > > On Tue, 26 Nov 2019 at 10:49, Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > So, what is the next step here? How about if we somehow check whether > > the file exists before doing unlink, say by using stat? > But the thing is, the behaviour is so much in a grey area, that we > cannot reliably say for instance that when stat() says there is no > such file, there is indeed no such file, > Why so? > and if we re-create the same > file when it is still open, it is always going to open a new file, > etc. > Yeah, or maybe even if we don't create with the same name, there is always be some dangling file which again doesn't sound like a good thing. > > If that doesn't work, I think we might need to go in the direction of tracking > > file handles in some way, so that they can be closed during an abort. > Yeah, that is one way. I am still working on different approaches. > WIll get back with proposals. > Fair enough. See, if you can also consider an approach that is local to ReorderBuffer module wherein we can track those handles in ReorderBufferTxn or some other place local to that module. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Tue, 26 Nov 2019 at 12:10, Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Tue, Nov 26, 2019 at 11:19 AM Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > > > > On Tue, 26 Nov 2019 at 10:49, Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > So, what is the next step here? How about if we somehow check whether > > > the file exists before doing unlink, say by using stat? > > But the thing is, the behaviour is so much in a grey area, that we > > cannot reliably say for instance that when stat() says there is no > > such file, there is indeed no such file, > > > > Why so? This was just an example. What I meant was, we are really not sure of the behaviour of file operations when the file is in this state. unlink() returning "Permission denied" when called twice is itself weird enough. > > > and if we re-create the same > > file when it is still open, it is always going to open a new file, > > etc. > > > > Yeah, or maybe even if we don't create with the same name, there is > always be some dangling file which again doesn't sound like a good > thing. Right. > > > > If that doesn't work, I think we might need to go in the direction of tracking > > > file handles in some way, so that they can be closed during an abort. > > Yeah, that is one way. I am still working on different approaches. > > WIll get back with proposals. > > > > Fair enough. See, if you can also consider an approach that is local > to ReorderBuffer module wherein we can track those handles in > ReorderBufferTxn or some other place local to that module. What I found was : We do attempt to close the opened vfds in the PG_CATCH block. In ReorderBufferCommit(), ReorderBufferIterTXNFinish is called both in PG_TRY and PG_CATCH. This closes all the opened vfds. But the issue is : if the ereport() occurs inside ReorderBufferIterTXNInit(), then iterstate is still NULL. So in PG_CATCH, ReorderBufferIterTXNFinish() is not called, so the vfds in state->entries[] remain open. We can have &iterstate passed to ReorderBufferIterTXNInit() as another argument, and initialize it first thing inside the function. This way, it will never be NULL. But need to be careful about the possibility of having a iterstate in a half-cooked state, so cleanup might use some uninitialized handles. Will work on it. At least, we can make sure the iterstate->entries handle doesn't have junk values. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
On Wed, 27 Nov 2019 at 14:16, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > What I found was : We do attempt to close the opened vfds in the > PG_CATCH block. In ReorderBufferCommit(), ReorderBufferIterTXNFinish > is called both in PG_TRY and PG_CATCH. This closes all the opened > vfds. But the issue is : if the ereport() occurs inside > ReorderBufferIterTXNInit(), then iterstate is still NULL. So in > PG_CATCH, ReorderBufferIterTXNFinish() is not called, so the vfds in > state->entries[] remain open. > > We can have &iterstate passed to ReorderBufferIterTXNInit() as another > argument, and initialize it first thing inside the function. This way, > it will never be NULL. But need to be careful about the possibility of > having a iterstate in a half-cooked state, so cleanup might use some > uninitialized handles. Will work on it. At least, we can make sure the > iterstate->entries handle doesn't have junk values. Done as stated above; attached v3 patch. I have verified that the file handles do get closed in PG_CATCH block via ReorderBufferIterTXNFinish(). -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
Attachment
On Tue, Dec 3, 2019 at 11:10 AM Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > > On Wed, 27 Nov 2019 at 14:16, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > > What I found was : We do attempt to close the opened vfds in the > > PG_CATCH block. In ReorderBufferCommit(), ReorderBufferIterTXNFinish > > is called both in PG_TRY and PG_CATCH. This closes all the opened > > vfds. But the issue is : if the ereport() occurs inside > > ReorderBufferIterTXNInit(), then iterstate is still NULL. So in > > PG_CATCH, ReorderBufferIterTXNFinish() is not called, so the vfds in > > state->entries[] remain open. > > > > We can have &iterstate passed to ReorderBufferIterTXNInit() as another > > argument, and initialize it first thing inside the function. This way, > > it will never be NULL. But need to be careful about the possibility of > > having a iterstate in a half-cooked state, so cleanup might use some > > uninitialized handles. Will work on it. At least, we can make sure the > > iterstate->entries handle doesn't have junk values. > > Done as stated above; attached v3 patch. I have verified that the file > handles do get closed in PG_CATCH block via > ReorderBufferIterTXNFinish(). > I couldn't reproduce the original problem (on HEAD) reported with the test case in the patch. So, I can't verify the fix. I think it is because of recent commits cec2edfa7859279f36d2374770ca920c59c73dd8 and 9290ad198b15d6b986b855d2a58d087a54777e87. It seems you need to either change the value of logical_decoding_work_mem or change the test in some way so that the original problem can be reproduced. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Thu, Dec 5, 2019 at 4:20 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Tue, Dec 3, 2019 at 11:10 AM Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > > > > > > Done as stated above; attached v3 patch. I have verified that the file > > handles do get closed in PG_CATCH block via > > ReorderBufferIterTXNFinish(). > > > > I couldn't reproduce the original problem (on HEAD) reported with the > test case in the patch. So, I can't verify the fix. I think it is > because of recent commits cec2edfa7859279f36d2374770ca920c59c73dd8 and > 9290ad198b15d6b986b855d2a58d087a54777e87. It seems you need to either > change the value of logical_decoding_work_mem or change the test in > some way so that the original problem can be reproduced. > Few comments: ---------------------- 1. + /* Now that the state fields are initialized, it is safe to return it. */ + *iter_state = state; + /* allocate heap */ state->heap = binaryheap_allocate(state->nr_txns, ReorderBufferIterCompare, Is there a reason for not initializing iter_state after binaryheap_allocate? If we do so, then we don't need additional check you have added in ReorderBufferIterTXNFinish. 2. /* No harm in resetting the offset even in case of failure */ file->curOffset = 0; The above comment is not clear because you are not setting it in case of error rather this is a success path. 3. + * + * Note: The iterator state is returned through iter_state parameter rather + * than the function's return value. This is because the state gets cleaned up + * in a PG_CATCH block, so we want to make sure the caller gets back the state + * even if this function throws an exception, so that the state resources can + * be cleaned up. How about changing it slightly as follows to make it more clear. "Note: The iterator state is returned through iter_state parameter rather than the function's return value. This is because the state gets cleaned up in a PG_CATCH block in the caller, so we want to make sure the caller gets back the state even if this function throws an exception." 4. I think we should also check how much time increase will happen for test_decoding regression test after the test added by this patch? 5. One naive question about the usage of PathNameOpenFile(). When it reaches the max limit, it will automatically close one of the files, but how will that be reflected in the data structure (TXNEntryFile) you are managing. Basically, when PathNameOpenFile closes some file, how will the corresponding vfd in TXNEntryFile be changed. Because if it is not changed, then won't it start pointing to some wrong filehandle? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Fri, 6 Dec 2019 at 15:40, Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Thu, Dec 5, 2019 at 4:20 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Tue, Dec 3, 2019 at 11:10 AM Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > > > > > > > > > Done as stated above; attached v3 patch. I have verified that the file > > > handles do get closed in PG_CATCH block via > > > ReorderBufferIterTXNFinish(). > > > > > > > I couldn't reproduce the original problem (on HEAD) reported with the > > test case in the patch. So, I can't verify the fix. I think it is > > because of recent commits cec2edfa7859279f36d2374770ca920c59c73dd8 and > > 9290ad198b15d6b986b855d2a58d087a54777e87. It seems you need to either > > change the value of logical_decoding_work_mem or change the test in > > some way so that the original problem can be reproduced. Yeah, it does seem like the commit cec2edfa78592 must have caused the test to not reproduce on your env, although the test does fail for me still. Setting logical_decoding_work_mem to a low value does sound like a good idea. Will work on it. > > > > Few comments: > ---------------------- > > 1. > + /* Now that the state fields are initialized, it is safe to return it. */ > + *iter_state = state; > + > /* allocate heap */ > state->heap = > binaryheap_allocate(state->nr_txns, > ReorderBufferIterCompare, > > Is there a reason for not initializing iter_state after > binaryheap_allocate? If we do so, then we don't need additional check > you have added in ReorderBufferIterTXNFinish. If iter_state is initialized *after* binaryheap_allocate, then we won't be able to close the vfds if binaryheap_allocate() ereports(). > > 2. > /* No harm in resetting the offset even in case of failure */ > file->curOffset = 0; > > The above comment is not clear because you are not setting it in case > of error rather this is a success path. I meant, even if PathNameOpenFile() failed, it is ok to set file->curOffset to 0, so need not set it only in case of *fd >= 0. In most of the cases, fd would be valid, so just set file->curOffset to 0 always. > > 3. > + * > + * Note: The iterator state is returned through iter_state parameter rather > + * than the function's return value. This is because the state gets cleaned up > + * in a PG_CATCH block, so we want to make sure the caller gets back the state > + * even if this function throws an exception, so that the state resources can > + * be cleaned up. > > How about changing it slightly as follows to make it more clear. > "Note: The iterator state is returned through iter_state parameter > rather than the function's return value. This is because the state > gets cleaned up in a PG_CATCH block in the caller, so we want to make > sure the caller gets back the state even if this function throws an > exception." Agreed. Will do that in the next patch version. > > 4. I think we should also check how much time increase will happen for > test_decoding regression test after the test added by this patch? Yeah, it currently takes noticeably longer compared to the others. Let's see if setting logical_decoding_work_mem to a min value allows us to reproduce the test with much lesser number of inserts. > > 5. One naive question about the usage of PathNameOpenFile(). When it > reaches the max limit, it will automatically close one of the files, > but how will that be reflected in the data structure (TXNEntryFile) > you are managing. Basically, when PathNameOpenFile closes some file, > how will the corresponding vfd in TXNEntryFile be changed. Because if > it is not changed, then won't it start pointing to some wrong > filehandle? In PathNameOpenFile(), excess kernel fds could be closed (ReleaseLruFiles). But with that, the vfds themselves don't get invalidated. Only the underlying kernel fd gets closed, and the vfd->fd is marked VFD_CLOSED. The vfd array element remains valid (a non-null vfd->fileName means the vfd slot is valid; check FileIsValid). So later, when FileRead(vfd1) is called and that vfd1 happens to be the one that had got it's kernel fd closed, it gets opened again through FileAccess(). -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
On Fri, Dec 6, 2019 at 5:00 PM Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > > On Fri, 6 Dec 2019 at 15:40, Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > Few comments: > > ---------------------- > > > > 1. > > + /* Now that the state fields are initialized, it is safe to return it. */ > > + *iter_state = state; > > + > > /* allocate heap */ > > state->heap = > > binaryheap_allocate(state->nr_txns, > > ReorderBufferIterCompare, > > > > Is there a reason for not initializing iter_state after > > binaryheap_allocate? If we do so, then we don't need additional check > > you have added in ReorderBufferIterTXNFinish. > > If iter_state is initialized *after* binaryheap_allocate, then we > won't be able to close the vfds if binaryheap_allocate() ereports(). > Is it possible to have vfds opened before binaryheap_allocate(), if so how? > > > > 5. One naive question about the usage of PathNameOpenFile(). When it > > reaches the max limit, it will automatically close one of the files, > > but how will that be reflected in the data structure (TXNEntryFile) > > you are managing. Basically, when PathNameOpenFile closes some file, > > how will the corresponding vfd in TXNEntryFile be changed. Because if > > it is not changed, then won't it start pointing to some wrong > > filehandle? > > In PathNameOpenFile(), excess kernel fds could be closed > (ReleaseLruFiles). But with that, the vfds themselves don't get > invalidated. Only the underlying kernel fd gets closed, and the > vfd->fd is marked VFD_CLOSED. The vfd array element remains valid (a > non-null vfd->fileName means the vfd slot is valid; check > FileIsValid). So later, when FileRead(vfd1) is called and that vfd1 > happens to be the one that had got it's kernel fd closed, it gets > opened again through FileAccess(). > I was under impression that once the fd is closed due to excess kernel fds that are opened, the slot in VfdCache array could be resued by someone else, but on closer inspection that is not true. It will be only available for reuse after we explicitly call FileClose, right? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Wed, Nov 20, 2019 at 1:14 AM Juan José Santamaría Flecha <juanjo.santamaria@gmail.com> wrote: > On Tue, Nov 19, 2019 at 12:49 PM Thomas Munro <thomas.munro@gmail.com> wrote: >> On Wed, Nov 20, 2019 at 12:28 AM Amit Khandekar <amitdkhan.pg@gmail.com> wrote: >> > On Windows, it is documented that ReadFile() (which is called by >> > pg_pread) will return false on EOF but only when the file is open for >> > asynchronous reads/writes. But here we are just dealing with usual >> > synchronous reads. So pg_pread() code should indeed return 0 on EOF on >> > Windows. Not yet able to figure out how FileRead() managed to return >> > this error on Windows. But from your symptoms, it does look like >> > pg_pread()=>ReadFile() returned false (despite doing asynchronous >> > reads), and so _dosmaperr() gets called, and then it does not find the >> > eof error in doserrors[], so the "unrecognized win32 error code" >> > message is printed. May have to dig up more on this. >> >> Hmm. See also this report: >> >> https://www.postgresql.org/message-id/flat/CABuU89MfEvJE%3DWif%2BHk7SCqjSOF4rhgwJWW6aR3hjojpGqFbjQ%40mail.gmail.com > > The files from pgwin32_open() are open for synchronous access, while pg_pread() uses the asynchronous functionality tooffset the read. Under these circunstances, a read past EOF will return ERROR_HANDLE_EOF (38), as explained in: > > https://devblogs.microsoft.com/oldnewthing/20150121-00/?p=44863 FWIW, I sent a pull request to see if the MicrosoftDocs project agrees that the ReadFile page is misleading on this point: https://github.com/MicrosoftDocs/sdk-api/pull/7
On Sat, 7 Dec 2019 at 11:37, Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Fri, Dec 6, 2019 at 5:00 PM Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > > > > On Fri, 6 Dec 2019 at 15:40, Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > Few comments: > > > ---------------------- > > > > > > 1. > > > + /* Now that the state fields are initialized, it is safe to return it. */ > > > + *iter_state = state; > > > + > > > /* allocate heap */ > > > state->heap = > > > binaryheap_allocate(state->nr_txns, > > > ReorderBufferIterCompare, > > > > > > Is there a reason for not initializing iter_state after > > > binaryheap_allocate? If we do so, then we don't need additional check > > > you have added in ReorderBufferIterTXNFinish. > > > > If iter_state is initialized *after* binaryheap_allocate, then we > > won't be able to close the vfds if binaryheap_allocate() ereports(). > > > > Is it possible to have vfds opened before binaryheap_allocate(), if so how? No it does not look possible for the vfds to be opened before binaryheap_allocate(). But actually, the idea behind placing the iter_state at the place where I put is that, we should return back the iter_state at the *earliest* place in the code where it is safe to return. > > I couldn't reproduce the original problem (on HEAD) reported with the > > test case in the patch. So, I can't verify the fix. I think it is > > because of recent commits cec2edfa7859279f36d2374770ca920c59c73dd8 and > > 9290ad198b15d6b986b855d2a58d087a54777e87. It seems you need to either > > change the value of logical_decoding_work_mem or change the test in > > some way so that the original problem can be reproduced. Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > Yeah, it does seem like the commit cec2edfa78592 must have caused the > test to not reproduce on your env, although the test does fail for me > still. Setting logical_decoding_work_mem to a low value does sound > like a good idea. Will work on it. I checked that setting logical_decoding_work_mem to its min value (64KB) causes early serialization. So I think if you set this, you should be able to reproduce with the spill.sql test that has the new testcase. I will anyway set this value in the test. Also check below ... >> 4. I think we should also check how much time increase will happen for >> test_decoding regression test after the test added by this patch? Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > Yeah, it currently takes noticeably longer compared to the others. > Let's see if setting logical_decoding_work_mem to a min value allows > us to reproduce the test with much lesser number of inserts. The test in the patch takes around 20 seconds, as compared to the max time of 2 seconds any of the other tests take in that test suite. But if we set the max_files_per_process to a very low value (say 26) as against the default 1000, we can reproduce the issue with as low as 20 sub-transactions as against the 600 that I used in spill.sql test. And with this, the test runs in around 4 seconds, so this is good. But the problem is : max_files_per_process needs server restart. So either we have to shift this test to src/test/recovery in one of the logical_decoding test, or retain it in contrib/test_decoding and let it run for 20 seconds. Let me know if you figure out any other approach. > > > > > > > 5. One naive question about the usage of PathNameOpenFile(). When it > > > reaches the max limit, it will automatically close one of the files, > > > but how will that be reflected in the data structure (TXNEntryFile) > > > you are managing. Basically, when PathNameOpenFile closes some file, > > > how will the corresponding vfd in TXNEntryFile be changed. Because if > > > it is not changed, then won't it start pointing to some wrong > > > filehandle? > > > > In PathNameOpenFile(), excess kernel fds could be closed > > (ReleaseLruFiles). But with that, the vfds themselves don't get > > invalidated. Only the underlying kernel fd gets closed, and the > > vfd->fd is marked VFD_CLOSED. The vfd array element remains valid (a > > non-null vfd->fileName means the vfd slot is valid; check > > FileIsValid). So later, when FileRead(vfd1) is called and that vfd1 > > happens to be the one that had got it's kernel fd closed, it gets > > opened again through FileAccess(). > > > > I was under impression that once the fd is closed due to excess kernel > fds that are opened, the slot in VfdCache array could be resued by > someone else, but on closer inspection that is not true. It will be > only available for reuse after we explicitly call FileClose, right? Yes, that's right. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
On Wed, Dec 11, 2019 at 4:17 PM Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > > On Sat, 7 Dec 2019 at 11:37, Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Fri, Dec 6, 2019 at 5:00 PM Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > > > > > > On Fri, 6 Dec 2019 at 15:40, Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > 1. > > > > + /* Now that the state fields are initialized, it is safe to return it. */ > > > > + *iter_state = state; > > > > + > > > > /* allocate heap */ > > > > state->heap = > > > > binaryheap_allocate(state->nr_txns, > > > > ReorderBufferIterCompare, > > > > > > > > Is there a reason for not initializing iter_state after > > > > binaryheap_allocate? If we do so, then we don't need additional check > > > > you have added in ReorderBufferIterTXNFinish. > > > > > > If iter_state is initialized *after* binaryheap_allocate, then we > > > won't be able to close the vfds if binaryheap_allocate() ereports(). > > > > > > > Is it possible to have vfds opened before binaryheap_allocate(), if so how? > No it does not look possible for the vfds to be opened before > binaryheap_allocate(). But actually, the idea behind placing the > iter_state at the place where I put is that, we should return back the > iter_state at the *earliest* place in the code where it is safe to > return. > Sure, I get that point, but it seems it is equally good to do this after binaryheap_allocate(). It will be sligthly better because if there is any error in binaryheap_allocate, then we don't need to even call ReorderBufferIterTXNFinish(). > > >> 4. I think we should also check how much time increase will happen for > >> test_decoding regression test after the test added by this patch? > Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > > Yeah, it currently takes noticeably longer compared to the others. > > Let's see if setting logical_decoding_work_mem to a min value allows > > us to reproduce the test with much lesser number of inserts. > > The test in the patch takes around 20 seconds, as compared to the max > time of 2 seconds any of the other tests take in that test suite. > > But if we set the max_files_per_process to a very low value (say 26) > as against the default 1000, we can reproduce the issue with as low as > 20 sub-transactions as against the 600 that I used in spill.sql test. > And with this, the test runs in around 4 seconds, so this is good. > Do you get 4s even after setting the minimum value of logical_decoding_work_mem? I think 4s is also too much for this test which is going to test one extreme scenario. Can you try with some bigger rows or something else to reduce this time? I think if we can get it close to 2s or whatever maximum time taken by any other logical decoding tests, then good, otherwise, it doesn't seem like a good idea to add this test. > But > the problem is : max_files_per_process needs server restart. So either > we have to shift this test to src/test/recovery in one of the > logical_decoding test, or retain it in contrib/test_decoding and let > it run for 20 seconds. Let me know if you figure out any other > approach. > I don't think 20s for one test is acceptable. So, we should just give up that path, you can try what I suggested above, if we can reduce the test time, then good, otherwise, I suggest to drop the test from the patch. Having said that, we should try our best to reduce the test time as it will be good if we can have such a test. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Thu, 12 Dec 2019 at 09:49, Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Wed, Dec 11, 2019 at 4:17 PM Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > > > > On Sat, 7 Dec 2019 at 11:37, Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > On Fri, Dec 6, 2019 at 5:00 PM Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > > > > > > > > On Fri, 6 Dec 2019 at 15:40, Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > > 1. > > > > > + /* Now that the state fields are initialized, it is safe to return it. */ > > > > > + *iter_state = state; > > > > > + > > > > > /* allocate heap */ > > > > > state->heap = > > > > > binaryheap_allocate(state->nr_txns, > > > > > ReorderBufferIterCompare, > > > > > > > > > > Is there a reason for not initializing iter_state after > > > > > binaryheap_allocate? If we do so, then we don't need additional check > > > > > you have added in ReorderBufferIterTXNFinish. > > > > > > > > If iter_state is initialized *after* binaryheap_allocate, then we > > > > won't be able to close the vfds if binaryheap_allocate() ereports(). > > > > > > > > > > Is it possible to have vfds opened before binaryheap_allocate(), if so how? > > No it does not look possible for the vfds to be opened before > > binaryheap_allocate(). But actually, the idea behind placing the > > iter_state at the place where I put is that, we should return back the > > iter_state at the *earliest* place in the code where it is safe to > > return. > > > > Sure, I get that point, but it seems it is equally good to do this > after binaryheap_allocate(). It will be sligthly better because if > there is any error in binaryheap_allocate, then we don't need to even > call ReorderBufferIterTXNFinish(). All right. WIll do that. > > > > > >> 4. I think we should also check how much time increase will happen for > > >> test_decoding regression test after the test added by this patch? > > Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > > > Yeah, it currently takes noticeably longer compared to the others. > > > Let's see if setting logical_decoding_work_mem to a min value allows > > > us to reproduce the test with much lesser number of inserts. > > > > The test in the patch takes around 20 seconds, as compared to the max > > time of 2 seconds any of the other tests take in that test suite. > > > > But if we set the max_files_per_process to a very low value (say 26) > > as against the default 1000, we can reproduce the issue with as low as > > 20 sub-transactions as against the 600 that I used in spill.sql test. > > And with this, the test runs in around 4 seconds, so this is good. > > > > Do you get 4s even after setting the minimum value of > logical_decoding_work_mem? I think 4s is also too much for this test > which is going to test one extreme scenario. Can you try with some > bigger rows or something else to reduce this time? I think if we can > get it close to 2s or whatever maximum time taken by any other logical > decoding tests, then good, otherwise, it doesn't seem like a good idea > to add this test. > > > But > > the problem is : max_files_per_process needs server restart. So either > > we have to shift this test to src/test/recovery in one of the > > logical_decoding test, or retain it in contrib/test_decoding and let > > it run for 20 seconds. Let me know if you figure out any other > > approach. > > > > I don't think 20s for one test is acceptable. So, we should just give > up that path, you can try what I suggested above, if we can reduce the > test time, then good, otherwise, I suggest to drop the test from the > patch. Having said that, we should try our best to reduce the test > time as it will be good if we can have such a test. I tried; it is actually roughly 3.4 seconds. Note that reducing logical_decoding_work_mem does not reduce the test time. It causes the serialization to start early, and so increases the chance of reproducing the problem. During restore of the serialized data, we still use max_changes_in_memory. So max_changes_in_memory is the one that allows us to reduce the number of transactions required, so we can cut down on the outer loop iterations and make the test finish much earlier. But also note that, we can't use the test suite in contrib/test_decoding, because max_changes_in_memory needs server restart. So we need to shift this test to src/test/recovery. And there, I guess it is not that critical for the testcase to be very quick because the tests in general are much slower than the ones in contrib/test_decoding, although it would be nice to make it fast. What I propose is to modify max_changes_in_memory, do a server restart (which takes hardly a sec), run the testcase (3.5 sec) and then restart after resetting the guc. So totally it will be around 4-5 seconds. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
On Thu, 12 Dec 2019 at 11:34, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > So max_changes_in_memory is the one > that allows us to reduce the number of transactions required, so we > can cut down on the outer loop iterations and make the test finish > much earlier. > > But also note that, we can't use the test suite in > contrib/test_decoding, because max_changes_in_memory needs server > restart. So we need to shift this test to src/test/recovery. And > there, I guess it is not that critical for the testcase to be very > quick because the tests in general are much slower than the ones in > contrib/test_decoding, although it would be nice to make it fast. What > I propose is to modify max_changes_in_memory, do a server restart > (which takes hardly a sec), run the testcase (3.5 sec) and then > restart after resetting the guc. So totally it will be around 4-5 > seconds. Sorry I meant max_files_per_process. We need to reduce max_files_per_process, so that it causes max_safe_fds to be reduced, and so only a few transactions are sufficient to reproduce the problem, because the reserveAllocatedDesc() will return false much sooner due to low max_safe_fds. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
On Thu, Dec 12, 2019 at 11:53 AM Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > > On Thu, 12 Dec 2019 at 11:34, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > > > So max_changes_in_memory is the one > > that allows us to reduce the number of transactions required, so we > > can cut down on the outer loop iterations and make the test finish > > much earlier. > > > > > But also note that, we can't use the test suite in > > contrib/test_decoding, because max_changes_in_memory needs server > > restart. So we need to shift this test to src/test/recovery. And > > there, I guess it is not that critical for the testcase to be very > > quick because the tests in general are much slower than the ones in > > contrib/test_decoding, although it would be nice to make it fast. What > > I propose is to modify max_changes_in_memory, do a server restart > > (which takes hardly a sec), run the testcase (3.5 sec) and then > > restart after resetting the guc. So totally it will be around 4-5 > > seconds. > > Sorry I meant max_files_per_process. > Okay, what time other individual tests take in that directory on your machine? How about providing a separate test patch for this case so that I can also test it? I think we can take the opinion of others as well if they are fine with adding this test, otherwise, we can go ahead with the main patch. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Thu, 12 Dec 2019 at 14:18, Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Thu, Dec 12, 2019 at 11:53 AM Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > > > > On Thu, 12 Dec 2019 at 11:34, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > > > > > So max_changes_in_memory is the one > > > that allows us to reduce the number of transactions required, so we > > > can cut down on the outer loop iterations and make the test finish > > > much earlier. > > > > > > > > But also note that, we can't use the test suite in > > > contrib/test_decoding, because max_changes_in_memory needs server > > > restart. So we need to shift this test to src/test/recovery. And > > > there, I guess it is not that critical for the testcase to be very > > > quick because the tests in general are much slower than the ones in > > > contrib/test_decoding, although it would be nice to make it fast. What > > > I propose is to modify max_changes_in_memory, do a server restart > > > (which takes hardly a sec), run the testcase (3.5 sec) and then > > > restart after resetting the guc. So totally it will be around 4-5 > > > seconds. > > > > Sorry I meant max_files_per_process. > > > > Okay, what time other individual tests take in that directory on your > machine? For src/test/recovery directory, on average, a test takes about 4-5 seconds. > How about providing a separate test patch for this case so > that I can also test it? Attached is a v4 patch that also addresses your code comments so far. I have included the test case in 006_logical_decoding.pl. I observed that the test case just adds only about 0.5 to 1 sec time. Please verify on your env also, and also whether the test reproduces the issue without the code changes. > I think we can take the opinion of others as > well if they are fine with adding this test, otherwise, we can go > ahead with the main patch. Sure. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
Attachment
On Thu, Dec 12, 2019 at 9:50 PM Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > > Attached is a v4 patch that also addresses your code comments so far. > I have included the test case in 006_logical_decoding.pl. I observed > that the test case just adds only about 0.5 to 1 sec time. Please > verify on your env also, and also whether the test reproduces the > issue without the code changes. > It takes roughly the same time on my machine as well. I have checked on Windows as well, it increases the time from 14 to 16 (17) seconds for this test. I don't think this is any big increase considering the timing of other tests and it would be good to have a test for such boundary conditions. I have slightly change the comments in the patch and ran pgindent. Attached, find the patch with a proposed commit message. I have also made minor changes related to below code in patch: - else if (readBytes != sizeof(ReorderBufferDiskChange)) + + file->curOffset += readBytes; + + if (readBytes != sizeof(ReorderBufferDiskChange)) Why the size is added before the error check? I think it should be after that check, so changed accordingly. Similarly, I don't see why we need to change 'else if' to 'if' in this code, so changed back. I think we need to change/tweak the test for back branches as there we don't have logical_decoding_work_mem. Can you please look into that and see if you can run perltidy for the test file. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Attachment
On Sat, 14 Dec 2019 at 11:59, Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Thu, Dec 12, 2019 at 9:50 PM Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > > > > Attached is a v4 patch that also addresses your code comments so far. > > I have included the test case in 006_logical_decoding.pl. I observed > > that the test case just adds only about 0.5 to 1 sec time. Please > > verify on your env also, and also whether the test reproduces the > > issue without the code changes. > > > > It takes roughly the same time on my machine as well. I have checked > on Windows as well, it increases the time from 14 to 16 (17) seconds > for this test. I don't think this is any big increase considering the > timing of other tests and it would be good to have a test for such > boundary conditions. I have slightly change the comments in the patch > and ran pgindent. Attached, find the patch with a proposed commit > message. > > I have also made minor changes related to below code in patch: > - else if (readBytes != sizeof(ReorderBufferDiskChange)) > + > + file->curOffset += readBytes; > + > + if (readBytes != > sizeof(ReorderBufferDiskChange)) > > Why the size is added before the error check? The logic was : even though it's an error that the readBytes does not match the expected size, the file read is successful so update the vfd offset as early as possible. In our case, this might not matter much, but who knows, in the future, in the exception block (say, in ReorderBufferIterTXNFinish, someone assumes that the file offset is correct and does something with that, then we will get in trouble, although I agree that it's very unlikely. But IMO, because we want to simulate the file offset support in vfd, we should update the file offset immediately after a file read is known to have succeeded. > I think it should be > after that check, so changed accordingly. Similarly, I don't see why > we need to change 'else if' to 'if' in this code, so changed back. Since for adding the size before the error check I had to remove the else-if, so to be consistent, I removed the else-if at surrounding places also. > > I think we need to change/tweak the test for back branches as there we > don't have logical_decoding_work_mem. Can you please look into that Yeah, I believe we need to backport up to PG 9.4 where logical decoding was introduced, so I am first trying out with 9.4 branch. > and see if you can run perltidy for the test file. Hmm, I tried perltidy, and it seems to mostly add a space after ( and a space before ) if there's already; so "('postgres'," is replaced by "(<space> 'postgres',". And this is going to be inconsistent with other places. And it replaces tab with spaces. Do you think we should try perltidy, or have we before been using this tool for the tap tests ? -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
On Mon, Dec 16, 2019 at 3:26 PM Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > > On Sat, 14 Dec 2019 at 11:59, Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > I have also made minor changes related to below code in patch: > > - else if (readBytes != sizeof(ReorderBufferDiskChange)) > > + > > + file->curOffset += readBytes; > > + > > + if (readBytes != > > sizeof(ReorderBufferDiskChange)) > > > > Why the size is added before the error check? > The logic was : even though it's an error that the readBytes does not > match the expected size, the file read is successful so update the vfd > offset as early as possible. In our case, this might not matter much, > but who knows, in the future, in the exception block (say, in > ReorderBufferIterTXNFinish, someone assumes that the file offset is > correct and does something with that, then we will get in trouble, > although I agree that it's very unlikely. > I am not sure if there is any such need, but even if it is there, I think updating after a *short* read (read less than expected) doesn't seem like a good idea because there is clearly some problem with the read call. Also, in the case below that case where we read the actual change data, the offset is updated after the check of *short* read. I don't see any advantage in such an inconsistency. I still feel it is better to update the offset after all error checks. > > > and see if you can run perltidy for the test file. > Hmm, I tried perltidy, and it seems to mostly add a space after ( and > a space before ) if there's already; so "('postgres'," is replaced by > "(<space> 'postgres',". And this is going to be inconsistent with > other places. And it replaces tab with spaces. Do you think we should > try perltidy, or have we before been using this tool for the tap tests > ? > See text in src/test/perl/README (Note that all tests and test tools should have perltidy run on them before patches are submitted, using perltidy - profile=src/tools/pgindent/perltidyrc). It is recommended to use perltidy. Now, if it is making the added code inconsistent with nearby code, then I suggest to leave it. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Mon, 16 Dec 2019 at 16:52, Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Mon, Dec 16, 2019 at 3:26 PM Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > > > > On Sat, 14 Dec 2019 at 11:59, Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > I have also made minor changes related to below code in patch: > > > - else if (readBytes != sizeof(ReorderBufferDiskChange)) > > > + > > > + file->curOffset += readBytes; > > > + > > > + if (readBytes != > > > sizeof(ReorderBufferDiskChange)) > > > > > > Why the size is added before the error check? > > The logic was : even though it's an error that the readBytes does not > > match the expected size, the file read is successful so update the vfd > > offset as early as possible. In our case, this might not matter much, > > but who knows, in the future, in the exception block (say, in > > ReorderBufferIterTXNFinish, someone assumes that the file offset is > > correct and does something with that, then we will get in trouble, > > although I agree that it's very unlikely. > > > > I am not sure if there is any such need, but even if it is there, I > think updating after a *short* read (read less than expected) doesn't > seem like a good idea because there is clearly some problem with the > read call. Also, in the case below that case where we read the actual > change data, the offset is updated after the check of *short* read. I > don't see any advantage in such an inconsistency. I still feel it is > better to update the offset after all error checks. Ok, no problem; I don't see any harm in doing the updates after the size checks. By the way, the backport patch is turning out to be simpler. It's because in pre-12 versions, the file offset is part of the Vfd structure, so all the offset handling is not required. > > > > > > and see if you can run perltidy for the test file. > > Hmm, I tried perltidy, and it seems to mostly add a space after ( and > > a space before ) if there's already; so "('postgres'," is replaced by > > "(<space> 'postgres',". And this is going to be inconsistent with > > other places. And it replaces tab with spaces. Do you think we should > > try perltidy, or have we before been using this tool for the tap tests > > ? > > > > See text in src/test/perl/README (Note that all tests and test tools > should have perltidy run on them before patches are submitted, using > perltidy - profile=src/tools/pgindent/perltidyrc). It is recommended > to use perltidy. > > Now, if it is making the added code inconsistent with nearby code, > then I suggest to leave it. In many places, it is becoming inconsistent, but will see if there are some places where it does make sense and does not break consistency. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
On Tue, 17 Dec 2019 at 17:40, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > By the way, the backport patch is turning out to be simpler. It's > because in pre-12 versions, the file offset is part of the Vfd > structure, so all the offset handling is not required. Please have a look at the attached backport patch for PG 11. branch. Once you are ok with the patch, I will port it on other branches. Note that in the patch, wherever applicable I have renamed the fd variable to vfd to signify that it is a vfd, and not the kernel fd. If we don't do the renaming, the patch would be still smaller, but I think the renaming makes sense. The recovery TAP tests don't seem to be there on 9.4 and 9.5 branch, so I think it's ok to not have any tests with the patches on these branches that don't have the tap tests. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
Attachment
On Wed, Dec 18, 2019 at 12:34 PM Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > > On Tue, 17 Dec 2019 at 17:40, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > > By the way, the backport patch is turning out to be simpler. It's > > because in pre-12 versions, the file offset is part of the Vfd > > structure, so all the offset handling is not required. > > Please have a look at the attached backport patch for PG 11. branch. > Once you are ok with the patch, I will port it on other branches. > Note that in the patch, wherever applicable I have renamed the fd > variable to vfd to signify that it is a vfd, and not the kernel fd. If > we don't do the renaming, the patch would be still smaller, but I > think the renaming makes sense. > The other usage of PathNameOpenFile in md.c is already using 'fd' as a variable name (also, if you see example in fd.h, that also uses fd as variable name), so I don't see any problem with using fd especially if that leads to lesser changes. Apart from that, your patch LGTM. > The recovery TAP tests don't seem to be there on 9.4 and 9.5 branch, > so I think it's ok to not have any tests with the patches on these > branches that don't have the tap tests. > Yeah, that is fine. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Thu, 19 Dec 2019 at 11:59, Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Wed, Dec 18, 2019 at 12:34 PM Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > > > > On Tue, 17 Dec 2019 at 17:40, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > > > By the way, the backport patch is turning out to be simpler. It's > > > because in pre-12 versions, the file offset is part of the Vfd > > > structure, so all the offset handling is not required. > > > > Please have a look at the attached backport patch for PG 11. branch. > > Once you are ok with the patch, I will port it on other branches. > > Note that in the patch, wherever applicable I have renamed the fd > > variable to vfd to signify that it is a vfd, and not the kernel fd. If > > we don't do the renaming, the patch would be still smaller, but I > > think the renaming makes sense. > > > > The other usage of PathNameOpenFile in md.c is already using 'fd' as a > variable name (also, if you see example in fd.h, that also uses fd as > variable name), so I don't see any problem with using fd especially if > that leads to lesser changes. Ok. I have retained fd name. > Apart from that, your patch LGTM. Attached are the patches from master back up to 94 branch. PG 9.4 and 9.5 have a common patch to be applied : pg94_95_use_vfd_for_logrep.patch From PG 9.6 onwards, each version has a separate patch. For PG 9.6, there is no logical decoding perl test file. So I have made a new file 006_logical_decoding_spill.pl that has only the specific testcase. Also, for building the test_decoding.so, I had to add the EXTRA_INSTALL=contrib/test_decoding line in the src/test/recovery/Makefile, because this is the first time we are using the plugin in the 9.6 tap test. From PG 10 onwards, pgstat_report_*() calls around read() are removed in the patch, because FileRead() itself reports the wait events. From PG 12 onwards, the vfd offset handling had to be added, because the offset is not present in Vfd structure. In master, logical_decoding_work_mem is used in the test file. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
Attachment
On Fri, Dec 20, 2019 at 9:31 AM Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > Attached are the patches from master back up to 94 branch. > > PG 9.4 and 9.5 have a common patch to be applied : > pg94_95_use_vfd_for_logrep.patch > From PG 9.6 onwards, each version has a separate patch. > > For PG 9.6, there is no logical decoding perl test file. So I have > made a new file 006_logical_decoding_spill.pl that has only the > specific testcase. Also, for building the test_decoding.so, I had to > add the EXTRA_INSTALL=contrib/test_decoding line in the > src/test/recovery/Makefile, because this is the first time we are > using the plugin in the 9.6 tap test. > I am not sure if we need to go that far for 9.6 branch. If the other tests for logical decoding are not present, then I don't see why we need to create a new test file for this test only. Also, I think this will make the patch the same for 9.4,9.5 and 9.6. > From PG 10 onwards, pgstat_report_*() calls around read() are removed > in the patch, because FileRead() itself reports the wait events. > Why there are different patches for 10 and 11? We should try to minimize the difference between patches in different branches. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Tue, 24 Dec 2019 at 10:41, Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Fri, Dec 20, 2019 at 9:31 AM Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > > Attached are the patches from master back up to 94 branch. > > > > PG 9.4 and 9.5 have a common patch to be applied : > > pg94_95_use_vfd_for_logrep.patch > > From PG 9.6 onwards, each version has a separate patch. > > > > For PG 9.6, there is no logical decoding perl test file. So I have > > made a new file 006_logical_decoding_spill.pl that has only the > > specific testcase. Also, for building the test_decoding.so, I had to > > add the EXTRA_INSTALL=contrib/test_decoding line in the > > src/test/recovery/Makefile, because this is the first time we are > > using the plugin in the 9.6 tap test. > > > > I am not sure if we need to go that far for 9.6 branch. If the other > tests for logical decoding are not present, then I don't see why we > need to create a new test file for this test only. Also, I think this > will make the patch the same for 9.4,9.5 and 9.6. Ok. I tested pg94_95_use_vfd_for_logrep.patch for 9.6 branch, and it works there. So please use this patch for all the three branches. > > > From PG 10 onwards, pgstat_report_*() calls around read() are removed > > in the patch, because FileRead() itself reports the wait events. > > > > Why there are different patches for 10 and 11? For PG10, OpenTransientFile() and PathNameOpenFile() each have an extra parameter for specifying file creation modes such as S_IRUSR or S_IWUSR. For 11, these functions don't accept the flags, rather the file is always opened with PG_FILE_MODE_OWNER. Because of these differences in the calls, the PG 10 patch does not apply on 11. > > We should try to minimize the difference between patches in different branches. Ok. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
On Tue, Dec 24, 2019 at 2:31 PM Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > > On Tue, 24 Dec 2019 at 10:41, Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Fri, Dec 20, 2019 at 9:31 AM Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > > > Attached are the patches from master back up to 94 branch. > > > > > > PG 9.4 and 9.5 have a common patch to be applied : > > > pg94_95_use_vfd_for_logrep.patch > > > From PG 9.6 onwards, each version has a separate patch. > > > > > > For PG 9.6, there is no logical decoding perl test file. So I have > > > made a new file 006_logical_decoding_spill.pl that has only the > > > specific testcase. Also, for building the test_decoding.so, I had to > > > add the EXTRA_INSTALL=contrib/test_decoding line in the > > > src/test/recovery/Makefile, because this is the first time we are > > > using the plugin in the 9.6 tap test. > > > > > > > I am not sure if we need to go that far for 9.6 branch. If the other > > tests for logical decoding are not present, then I don't see why we > > need to create a new test file for this test only. Also, I think this > > will make the patch the same for 9.4,9.5 and 9.6. > > Ok. I tested pg94_95_use_vfd_for_logrep.patch for 9.6 branch, and it > works there. So please use this patch for all the three branches. > Pushed! -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Thu, Jan 2, 2020 at 5:44 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Tue, Dec 24, 2019 at 2:31 PM Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
>
>
> Ok. I tested pg94_95_use_vfd_for_logrep.patch for 9.6 branch, and it
> works there. So please use this patch for all the three branches.
>
Pushed!
I see one failure in REL_10_STABLE [1] which seems to be due to this commit:
-------------------
t/006_logical_decoding.pl (Wstat: 7424 Tests: 10 Failed: 0)
Non-zero exit status: 29
Parse errors: Bad plan. You planned 11 tests but ran 10.
Files=14, Tests=122, 1968 wallclock secs ( 0.10 usr 0.03 sys + 19.00 cusr 21.98 csys = 41.11 CPU)
Result: FAIL
Makefile:19: recipe for target 'check' failed
make: *** [check] Error 1
See below snippet from 006_logical_decoding_master.log
..
..
2020-01-03 01:30:48.254 UTC [12189836:9] t/006_logical_decoding.pl STATEMENT: SELECT data from pg_logical_slot_get_changes('test_slot', NULL, NULL)
WHERE data LIKE '%INSERT%' ORDER BY lsn LIMIT 1;
2020-01-03 01:30:51.990 UTC [6882174:3] LOG: server process (PID 12189836) was terminated by signal 11
2020-01-03 01:30:51.990 UTC [6882174:4] DETAIL: Failed process was running: SELECT data from pg_logical_slot_get_changes('test_slot', NULL, NULL)
WHERE data LIKE '%INSERT%' ORDER BY lsn LIMIT 1;
2020-01-03 01:30:51.990 UTC [6882174:5] LOG: terminating any other active server processes
The strange thing is that the same test passes on master on the same machine [2] and for 10 as well, it passes on other machines, so not sure what could cause this. Any clue?
On Fri, Jan 3, 2020 at 8:29 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Thu, Jan 2, 2020 at 5:44 PM Amit Kapila <amit.kapila16@gmail.com> wrote:On Tue, Dec 24, 2019 at 2:31 PM Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
>
>
> Ok. I tested pg94_95_use_vfd_for_logrep.patch for 9.6 branch, and it
> works there. So please use this patch for all the three branches.
>
Pushed!I see one failure in REL_10_STABLE [1] which seems to be due to this commit:
I tried this test on my CentOs and Power8 machine more than 50 times, but couldn't reproduce it. So, adding Noah to see if he can try this test [1] on his machine (tern) and get stack track or some other information?
[1] - make -C src/test/recovery/ check PROVE_TESTS=t/006_logical_decoding.pl
On Fri, 3 Jan 2020 at 10:19, Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Fri, Jan 3, 2020 at 8:29 AM Amit Kapila <amit.kapila16@gmail.com> wrote: >> >> On Thu, Jan 2, 2020 at 5:44 PM Amit Kapila <amit.kapila16@gmail.com> wrote: >>> >>> On Tue, Dec 24, 2019 at 2:31 PM Amit Khandekar <amitdkhan.pg@gmail.com> wrote: >>> > >>> > >>> > Ok. I tested pg94_95_use_vfd_for_logrep.patch for 9.6 branch, and it >>> > works there. So please use this patch for all the three branches. >>> > >>> >>> Pushed! >> >> >> >> I see one failure in REL_10_STABLE [1] which seems to be due to this commit: >> > > I tried this test on my CentOs and Power8 machine more than 50 times, but couldn't reproduce it. So, adding Noah to seeif he can try this test [1] on his machine (tern) and get stack track or some other information? > > [1] - make -C src/test/recovery/ check PROVE_TESTS=t/006_logical_decoding.pl I also tested multiple times using PG 10 branch; also tried to inject an error so that PG_CATCH related code also gets covered, but unfortunately didn't get the crash on my machine. I guess, we will have to somehow get the stacktrace. > > -- > With Regards, > Amit Kapila. > EnterpriseDB: http://www.enterprisedb.com -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
On Fri, Jan 03, 2020 at 02:20:09PM +0530, Amit Khandekar wrote: > On Fri, 3 Jan 2020 at 10:19, Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Fri, Jan 3, 2020 at 8:29 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > >> I see one failure in REL_10_STABLE [1] which seems to be due to this commit: > > > > I tried this test on my CentOs and Power8 machine more than 50 times, but couldn't reproduce it. So, adding Noah tosee if he can try this test [1] on his machine (tern) and get stack track or some other information? > > > > [1] - make -C src/test/recovery/ check PROVE_TESTS=t/006_logical_decoding.pl > > I also tested multiple times using PG 10 branch; also tried to inject > an error so that PG_CATCH related code also gets covered, but > unfortunately didn't get the crash on my machine. I guess, we will > have to somehow get the stacktrace. I have buildfarm member tern running this test in a loop. In the 290 iterations so far, it hasn't failed. I've leave it running for another week or so. The buildfarm client can capture stack traces, but it currently doesn't do so for TAP test suites (search the client code for get_stack_trace). If someone feels like writing a fix for that, it would be a nice improvement. Perhaps, rather than having the client code know all the locations where core files might appear, failed runs should walk the test directory tree for core files?
On Sun, Jan 5, 2020 at 12:21 AM Noah Misch <noah@leadboat.com> wrote: > > On Fri, Jan 03, 2020 at 02:20:09PM +0530, Amit Khandekar wrote: > > On Fri, 3 Jan 2020 at 10:19, Amit Kapila <amit.kapila16@gmail.com> wrote: > > > On Fri, Jan 3, 2020 at 8:29 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > >> I see one failure in REL_10_STABLE [1] which seems to be due to this commit: > > > > > > I tried this test on my CentOs and Power8 machine more than 50 times, but couldn't reproduce it. So, adding Noah tosee if he can try this test [1] on his machine (tern) and get stack track or some other information? > > > > > > [1] - make -C src/test/recovery/ check PROVE_TESTS=t/006_logical_decoding.pl > > > > I also tested multiple times using PG 10 branch; also tried to inject > > an error so that PG_CATCH related code also gets covered, but > > unfortunately didn't get the crash on my machine. I guess, we will > > have to somehow get the stacktrace. > > I have buildfarm member tern running this test in a loop. In the 290 > iterations so far, it hasn't failed. I've leave it running for another week > or so. > Okay, thanks! FYI, your other machine 'mandril' also exhibits the exact same behavior and on v10. Both the machines (tern and mandril) seem to have the same specs which seems to be the reason that they are failing in the same way. The thing that bothers me is that the fix and test are the same for v11 and test passes for v11 on both machines. Does this indicate any random behavior or maybe some other bug in v10 which is discovered by this test? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Sun, 5 Jan 2020 at 00:21, Noah Misch <noah@leadboat.com> wrote: > The buildfarm client can capture stack traces, but it currently doesn't do so > for TAP test suites (search the client code for get_stack_trace). If someone > feels like writing a fix for that, it would be a nice improvement. Perhaps, > rather than having the client code know all the locations where core files > might appear, failed runs should walk the test directory tree for core files? I think this might end up having the same code to walk the directory spread out on multiple files. Instead, I think in the build script, in get_stack_trace(), we can do an equivalent of "find <inputdir> -name "*core*" , as against the current way in which it looks for core files only in the specific data directory. So get_stack_trace(bindir, datadir) would change to get_stack_trace(bindir, input_dir) where input_dir can be any directory that can contain multiple data directories. E.g. a recovery test can create multiple instances so there would be multiple data directories inside the test directory. Noah, is it possible to run a patch'ed build script once I submit a patch, so that we can quickly get the stack trace ? I mean, can we do this before getting the patch committed ? I guess, we can run the build script with a single branch specified, right ? -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
On Sun, Jan 5, 2020 at 10:29 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Sun, Jan 5, 2020 at 12:21 AM Noah Misch <noah@leadboat.com> wrote: > > > > On Fri, Jan 03, 2020 at 02:20:09PM +0530, Amit Khandekar wrote: > > > On Fri, 3 Jan 2020 at 10:19, Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Fri, Jan 3, 2020 at 8:29 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > >> I see one failure in REL_10_STABLE [1] which seems to be due to this commit: > > > > > > > > I tried this test on my CentOs and Power8 machine more than 50 times, but couldn't reproduce it. So, adding Noahto see if he can try this test [1] on his machine (tern) and get stack track or some other information? > > > > > > > > [1] - make -C src/test/recovery/ check PROVE_TESTS=t/006_logical_decoding.pl > > > > > > I also tested multiple times using PG 10 branch; also tried to inject > > > an error so that PG_CATCH related code also gets covered, but > > > unfortunately didn't get the crash on my machine. I guess, we will > > > have to somehow get the stacktrace. > > > > I have buildfarm member tern running this test in a loop. In the 290 > > iterations so far, it hasn't failed. I've leave it running for another week > > or so. > > > > Okay, thanks! FYI, your other machine 'mandril' also exhibits the > exact same behavior and on v10. Both the machines (tern and mandril) > seem to have the same specs which seems to be the reason that they are > failing in the same way. The thing that bothers me is that the fix > and test are the same for v11 and test passes for v11 on both > machines. Does this indicate any random behavior or maybe some other > bug in v10 which is discovered by this test? > Another thing to notice here is that on buildfarm 'tern' (for v10), it is getting reproduced, whereas when you ran it independently, then the problem is not reproduced even after so many runs. What could be the difference which is causing this? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Wed, Jan 08, 2020 at 02:50:53PM +0530, Amit Khandekar wrote: > On Sun, 5 Jan 2020 at 00:21, Noah Misch <noah@leadboat.com> wrote: > > The buildfarm client can capture stack traces, but it currently doesn't do so > > for TAP test suites (search the client code for get_stack_trace). If someone > > feels like writing a fix for that, it would be a nice improvement. Perhaps, > > rather than having the client code know all the locations where core files > > might appear, failed runs should walk the test directory tree for core files? > > I think this might end up having the same code to walk the directory > spread out on multiple files. Instead, I think in the build script, in > get_stack_trace(), we can do an equivalent of "find <inputdir> -name > "*core*" , as against the current way in which it looks for core files > only in the specific data directory. Agreed. > Noah, is it possible to run a patch'ed build script once I submit a > patch, so that we can quickly get the stack trace ? I mean, can we do > this before getting the patch committed ? I guess, we can run the > build script with a single branch specified, right ? Yes to all questions, but it would not have helped in this case. First, v10 deletes PostgresNode base directories at the end of this test file, despite the failure[1]. Second, the stack trace was minimal: (gdb) bt #0 0xd011119c in extend_brk () from /usr/lib/libc.a(shr.o) Even so, a web search for "extend_brk" led to the answer. By default, 32-bit AIX binaries get only 256M of RAM for stack and sbrk. The new regression test used more than that, hence this crash. Setting LDR_CNTRL=MAXDATA=0x80000000 in the environment cured the crash. I've put that in the buildfarm member configuration and started a new run. (PostgreSQL documentation actually covers this problem: https://www.postgresql.org/docs/devel/installation-platform-notes.html#INSTALLATION-NOTES-AIX) [1] It has the all_tests_passing() logic in an attempt to stop this. I'm guessing it didn't help because the file failed by calling die "connection error: ...", not by reporting a failure to Test::More via ok(0) or similar.
Noah Misch <noah@leadboat.com> writes: > Even so, a web search for "extend_brk" led to the answer. By default, 32-bit > AIX binaries get only 256M of RAM for stack and sbrk. The new regression test > used more than that, hence this crash. Hm, so (1) Why did we get a crash and not some more-decipherable out-of-resources error? Can we improve that experience? (2) Should we be dialing back the resource consumption of this test? Even on machines where it doesn't fail outright, I'd imagine that it's costing a lot of buildfarm cycles. Is it actually worth that? regards, tom lane
On Thu, Jan 9, 2020 at 11:15 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Noah Misch <noah@leadboat.com> writes: > > Even so, a web search for "extend_brk" led to the answer. By default, 32-bit > > AIX binaries get only 256M of RAM for stack and sbrk. The new regression test > > used more than that, hence this crash. > > Hm, so > > (1) Why did we get a crash and not some more-decipherable out-of-resources > error? Can we improve that experience? > > (2) Should we be dialing back the resource consumption of this test? > In HEAD, we have a guc variable 'logical_decoding_work_mem' by which we can control the memory usage of changes and we have used that, but for back branches, we don't have such a control. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Thu, Jan 9, 2020 at 11:15 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Noah Misch <noah@leadboat.com> writes: > > Even so, a web search for "extend_brk" led to the answer. By default, 32-bit > > AIX binaries get only 256M of RAM for stack and sbrk. The new regression test > > used more than that, hence this crash. > > Hm, so > > (1) Why did we get a crash and not some more-decipherable out-of-resources > error? Can we improve that experience? > > (2) Should we be dialing back the resource consumption of this test? > Even on machines where it doesn't fail outright, I'd imagine that it's > costing a lot of buildfarm cycles. Is it actually worth that? > After the latest changes by Noah, the tern and mandrill both are green. I will revert the test added by this patch unless there is some strong argument to keep it. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On 2020-Jan-09, Amit Kapila wrote: > In HEAD, we have a guc variable 'logical_decoding_work_mem' by which > we can control the memory usage of changes and we have used that, but > for back branches, we don't have such a control. > After the latest changes by Noah, the tern and mandrill both are > green. I will revert the test added by this patch unless there is > some strong argument to keep it. Hmm, so why not revert the test only in the back branches, given that it's not so onerous in master? -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Alvaro Herrera <alvherre@2ndquadrant.com> writes: > Hmm, so why not revert the test only in the back branches, given that > it's not so onerous in master? I grow tired of repeating myself, but: it's purely accidental that this test passes in master for the existing set of buildfarm members. If I have to do so to prove my point, I will set up a buildfarm member that uses USE_NAMED_POSIX_SEMAPHORES, and then insist that the patch cope with that. But the real issue is that the test is abusing max_files_per_process to do something it was never intended for. What it was intended for, and works well at, is to constrain the total FD consumption of a collection of backends. It doesn't work well to constrain the maximum allocatedDescs consumption, because there's too much variability in our demand for other FDs. If we feel that we should have a test that is constraining that, then we need to invent some other mechanism to do it with. If we're not willing to invent an appropriate mechanism to support the test, then we should drop the test, because a half-baked test is worse than none. An appropriate mechanism, perhaps, would be some way to constrain max_safe_fds directly, without any platform- or environment-dependent effects in the way. It could be as simple as /* * Take off the FDs reserved for system() etc. */ max_safe_fds -= NUM_RESERVED_FDS; + /* + * Apply debugging limit, if defined. + */ +#ifdef MAX_SAFE_FDS_LIMIT + max_safe_fds = Min(max_safe_fds, MAX_SAFE_FDS_LIMIT); +#endif + /* * Make sure we still have enough to get by. */ and then somebody who was concerned about this could run a buildfarm member with "-DMAX_SAFE_FDS_LIMIT=10" or so. regards, tom lane
On 2020-Jan-09, Tom Lane wrote: > Alvaro Herrera <alvherre@2ndquadrant.com> writes: > > Hmm, so why not revert the test only in the back branches, given that > > it's not so onerous in master? > > I grow tired of repeating myself, but: it's purely accidental that this > test passes in master for the existing set of buildfarm members. Oh, I forgot we had that problem. I agree with reverting the test, rather than building all the extra functionality needed to make it more stable, in that case. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Amit Kapila <amit.kapila16@gmail.com> writes: > On Thu, Jan 9, 2020 at 11:15 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Noah Misch <noah@leadboat.com> writes: >>> Even so, a web search for "extend_brk" led to the answer. By default, 32-bit >>> AIX binaries get only 256M of RAM for stack and sbrk. The new regression test >>> used more than that, hence this crash. >> Hm, so >> (1) Why did we get a crash and not some more-decipherable out-of-resources >> error? Can we improve that experience? >> (2) Should we be dialing back the resource consumption of this test? > In HEAD, we have a guc variable 'logical_decoding_work_mem' by which > we can control the memory usage of changes and we have used that, but > for back branches, we don't have such a control. I poked into this a bit more by running the src/test/recovery tests under restrictive ulimit settings. I used ulimit -s 1024 ulimit -v 250000 (At least on my 64-bit RHEL6 box, reducing ulimit -v much below this causes initdb to fail, apparently because the post-bootstrap process tries to load all our tsearch and encoding conversion shlibs at once, and it hasn't got enough VM space to do so. Someday we may have to improve that.) I did not manage to duplicate Noah's crash this way. What I see in the v10 branch is that the new 006_logical_decoding.pl test fails, but with a clean "out of memory" error. The memory map dump that that produces fingers the culprit pretty unambiguously: ... ReorderBuffer: 223302560 total in 26995 blocks; 7056 free (3 chunks); 223295504 used ReorderBufferByXid: 24576 total in 2 blocks; 11888 free (3 chunks); 12688 used Slab: TXN: 8192 total in 1 blocks; 5208 free (21 chunks); 2984 used Slab: Change: 2170880 total in 265 blocks; 2800 free (35 chunks); 2168080 used ... Grand total: 226714720 bytes in 27327 blocks; 590888 free (785 chunks); 226123832 used The test case is only inserting 50K fairly-short rows, so this seems like an unreasonable amount of memory to be consuming for that; and even if you think it's reasonable, it clearly isn't going to scale to large production transactions. Now, the good news is that v11 and later get through 006_logical_decoding.pl just fine under the same restriction. So we did something in v11 to fix this excessive memory consumption. However, unless we're willing to back-port whatever that was, this test case is clearly consuming excessive resources for the v10 branch. We're not out of the woods either. I also observe that v12 and HEAD fall over, under these same test conditions, with a stack-overflow error in the 012_subtransactions.pl test. This seems to be due to somebody's decision to use a heavily recursive function to generate a bunch of subtransactions. Is there a good reason for hs_subxids() to use recursion instead of a loop? If there is, what's the value of using 201 levels rather than, say, 10? Anyway it remains unclear why Noah's machine got a crash instead of something more user-friendly. But the reason why it's only in the v10 branch seems non-mysterious. regards, tom lane
I wrote: > ReorderBuffer: 223302560 total in 26995 blocks; 7056 free (3 chunks); 223295504 used > The test case is only inserting 50K fairly-short rows, so this seems > like an unreasonable amount of memory to be consuming for that; and > even if you think it's reasonable, it clearly isn't going to scale > to large production transactions. > Now, the good news is that v11 and later get through > 006_logical_decoding.pl just fine under the same restriction. > So we did something in v11 to fix this excessive memory consumption. > However, unless we're willing to back-port whatever that was, this > test case is clearly consuming excessive resources for the v10 branch. I dug around a little in the git history for backend/replication/logical/, and while I find several commit messages mentioning memory leaks and faulty spill logic, they all claim to have been back-patched as far as 9.4. It seems reasonably likely to me that this result is telling us about an actual bug, ie, faulty back-patching of one or more of those fixes into v10 and perhaps earlier branches. I don't know this code well enough to take point on looking for the problem, though. regards, tom lane
On Thu, Jan 09, 2020 at 12:45:41AM -0500, Tom Lane wrote: > Noah Misch <noah@leadboat.com> writes: > > Even so, a web search for "extend_brk" led to the answer. By default, 32-bit > > AIX binaries get only 256M of RAM for stack and sbrk. The new regression test > > used more than that, hence this crash. > > Hm, so > > (1) Why did we get a crash and not some more-decipherable out-of-resources > error? Can we improve that experience? By default, 32-bit AIX binaries have maxdata:0x00000000. Specifying maxdata:0x10000000 provides the same 256M of RAM, yet it magically changes the SIGSEGV to ENOMEM: $ OBJECT_MODE=32 gcc maxdata.c && ./a.out Segmentation fault $ OBJECT_MODE=32 gcc -Wl,-bmaxdata:0x00000000 maxdata.c && ./a.out Segmentation fault $ OBJECT_MODE=32 gcc -Wl,-bmaxdata:0x10000000 maxdata.c && ./a.out done at 255 MiB: Not enough space We could add -Wl,-bmaxdata:0x10000000 (or a higher value) to LDFLAGS when building for 32-bit AIX. > (2) Should we be dialing back the resource consumption of this test? > Even on machines where it doesn't fail outright, I'd imagine that it's > costing a lot of buildfarm cycles. Is it actually worth that? The test's resource usage, being quite low, should not be a factor in the test's fate. On my usual development machine, the entire 006_logical_decoding.pl file takes just 3s and ~250 MiB of RAM.
On Fri, Jan 10, 2020 at 6:10 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > I wrote: > > ReorderBuffer: 223302560 total in 26995 blocks; 7056 free (3 chunks); 223295504 used > > > The test case is only inserting 50K fairly-short rows, so this seems > > like an unreasonable amount of memory to be consuming for that; and > > even if you think it's reasonable, it clearly isn't going to scale > > to large production transactions. > > > Now, the good news is that v11 and later get through > > 006_logical_decoding.pl just fine under the same restriction. > > So we did something in v11 to fix this excessive memory consumption. > > However, unless we're willing to back-port whatever that was, this > > test case is clearly consuming excessive resources for the v10 branch. > > I dug around a little in the git history for backend/replication/logical/, > and while I find several commit messages mentioning memory leaks and > faulty spill logic, they all claim to have been back-patched as far > as 9.4. > > It seems reasonably likely to me that this result is telling us about > an actual bug, ie, faulty back-patching of one or more of those fixes > into v10 and perhaps earlier branches. > I think it would be good to narrow down this problem, but it seems we can do this separately. I think to avoid forgetting about this, can we track it somewhere as an open issue (In Older Bugs section of PostgreSQL 12 Open Items or some other place)? It seems to me that this test has found a problem in back-branches, so we might want to keep it after removing the max_files_per_process restriction. However, unless we narrow down this memory leak it is not a good idea to keep it at least not in v10. So, we have the below options: (a) remove this test entirely from all branches and once we found the memory leak problem in back-branches, then consider adding it again without max_files_per_process restriction. (b) keep this test without max_files_per_process restriction till v11 and once the memory leak issue in v10 is found, we can back-patch to v10 as well. Suggestions? > If I have to do so to prove my point, I will set up a buildfarm member > that uses USE_NAMED_POSIX_SEMAPHORES, and then insist that the patch > cope with that. > Shall we document that under USE_NAMED_POSIX_SEMAPHORES, we consume additional fd? I thought about it because the minimum limit for max_files_per_process is 25 and the system won't even start if someone has used a platform where USE_NAMED_POSIX_SEMAPHORES is enabled. Also, if it would have been explicitly mentioned, then I think this test wouldn't have tried to become so optimistic about max_files_per_process. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Noah Misch <noah@leadboat.com> writes: > On Thu, Jan 09, 2020 at 12:45:41AM -0500, Tom Lane wrote: >> (1) Why did we get a crash and not some more-decipherable out-of-resources >> error? Can we improve that experience? > By default, 32-bit AIX binaries have maxdata:0x00000000. Specifying > maxdata:0x10000000 provides the same 256M of RAM, yet it magically changes the > SIGSEGV to ENOMEM: > ... > We could add -Wl,-bmaxdata:0x10000000 (or a higher value) to LDFLAGS when > building for 32-bit AIX. +1, seems like that would improve matters considerably on that platform. >> (2) Should we be dialing back the resource consumption of this test? >> Even on machines where it doesn't fail outright, I'd imagine that it's >> costing a lot of buildfarm cycles. Is it actually worth that? > The test's resource usage, being quite low, should not be a factor in the > test's fate. On my usual development machine, the entire > 006_logical_decoding.pl file takes just 3s and ~250 MiB of RAM. Yeah, as I noted downthread, it appears that initdb itself can't succeed with less than ~250MB these days. My old-school self feels like that's excessive, but I must admit I'm not motivated to go reduce it right now. But I think it's a clear win to fail with "out of memory" rather than "SIGSEGV", so I think we ought to adjust the AIX build options as you suggest. regards, tom lane
On Fri, Jan 10, 2020 at 9:31 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Fri, Jan 10, 2020 at 6:10 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > > > I wrote: > > > ReorderBuffer: 223302560 total in 26995 blocks; 7056 free (3 chunks); 223295504 used > > > > > The test case is only inserting 50K fairly-short rows, so this seems > > > like an unreasonable amount of memory to be consuming for that; and > > > even if you think it's reasonable, it clearly isn't going to scale > > > to large production transactions. > > > > > Now, the good news is that v11 and later get through > > > 006_logical_decoding.pl just fine under the same restriction. > > > So we did something in v11 to fix this excessive memory consumption. > > > However, unless we're willing to back-port whatever that was, this > > > test case is clearly consuming excessive resources for the v10 branch. > > > > I dug around a little in the git history for backend/replication/logical/, > > and while I find several commit messages mentioning memory leaks and > > faulty spill logic, they all claim to have been back-patched as far > > as 9.4. > > > > It seems reasonably likely to me that this result is telling us about > > an actual bug, ie, faulty back-patching of one or more of those fixes > > into v10 and perhaps earlier branches. > > > > I think it would be good to narrow down this problem, but it seems we > can do this separately. I think to avoid forgetting about this, can > we track it somewhere as an open issue (In Older Bugs section of > PostgreSQL 12 Open Items or some other place)? > > It seems to me that this test has found a problem in back-branches, so > we might want to keep it after removing the max_files_per_process > restriction. However, unless we narrow down this memory leak it is > not a good idea to keep it at least not in v10. So, we have the below > options: > (a) remove this test entirely from all branches and once we found the > memory leak problem in back-branches, then consider adding it again > without max_files_per_process restriction. > (b) keep this test without max_files_per_process restriction till v11 > and once the memory leak issue in v10 is found, we can back-patch to > v10 as well. > I am planning to go with option (a) and attached are patches to revert the entire test on HEAD and back branches. I am planning to commit these by Tuesday unless someone has a better idea. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Attachment
Amit Kapila <amit.kapila16@gmail.com> writes: > On Fri, Jan 10, 2020 at 9:31 AM Amit Kapila <amit.kapila16@gmail.com> wrote: >> ... So, we have the below >> options: >> (a) remove this test entirely from all branches and once we found the >> memory leak problem in back-branches, then consider adding it again >> without max_files_per_process restriction. >> (b) keep this test without max_files_per_process restriction till v11 >> and once the memory leak issue in v10 is found, we can back-patch to >> v10 as well. > I am planning to go with option (a) and attached are patches to revert > the entire test on HEAD and back branches. I am planning to commit > these by Tuesday unless someone has a better idea. Makes sense to me. We've certainly found out something interesting from this test, but not what it was expecting to find ;-). I think that there could be scope for two sorts of successor tests: * I still like my idea of directly constraining max_safe_fds through some sort of debug option. But to my mind, we want to run the entire regression suite with that restriction, not just one small test. * The seeming bug in v10 suggests that we aren't testing large enough logical-decoding cases, or at least aren't noticing leaks in that area. I'm not sure what a good design is for testing that. I'm not thrilled with just using a larger (and slower) test case, but it's not clear to me how else to attack it. regards, tom lane
On Sat, Jan 11, 2020 at 11:16 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Amit Kapila <amit.kapila16@gmail.com> writes: > > On Fri, Jan 10, 2020 at 9:31 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > >> ... So, we have the below > >> options: > >> (a) remove this test entirely from all branches and once we found the > >> memory leak problem in back-branches, then consider adding it again > >> without max_files_per_process restriction. > >> (b) keep this test without max_files_per_process restriction till v11 > >> and once the memory leak issue in v10 is found, we can back-patch to > >> v10 as well. > > > I am planning to go with option (a) and attached are patches to revert > > the entire test on HEAD and back branches. I am planning to commit > > these by Tuesday unless someone has a better idea. > > Makes sense to me. We've certainly found out something interesting > from this test, but not what it was expecting to find ;-). I think > that there could be scope for two sorts of successor tests: > > * I still like my idea of directly constraining max_safe_fds through > some sort of debug option. But to my mind, we want to run the entire > regression suite with that restriction, not just one small test. > Good idea. > * The seeming bug in v10 suggests that we aren't testing large enough > logical-decoding cases, or at least aren't noticing leaks in that > area. I'm not sure what a good design is for testing that. I'm not > thrilled with just using a larger (and slower) test case, but it's > not clear to me how else to attack it. > It is not clear to me either at this stage, but I think we can decide that after chasing the issue in v10. My current plan is to revert this test and make a note of the memory leak problem found (probably track in Older Bugs section of PostgreSQL 12 Open Items). I think once we found the issue id v10, we might be in a better position to decide if the test on the lines of the current test would make sense or we need something else. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Thu, Jan 09, 2020 at 07:40:12PM -0500, Tom Lane wrote: >I wrote: >> ReorderBuffer: 223302560 total in 26995 blocks; 7056 free (3 chunks); 223295504 used > >> The test case is only inserting 50K fairly-short rows, so this seems >> like an unreasonable amount of memory to be consuming for that; and >> even if you think it's reasonable, it clearly isn't going to scale >> to large production transactions. > >> Now, the good news is that v11 and later get through >> 006_logical_decoding.pl just fine under the same restriction. >> So we did something in v11 to fix this excessive memory consumption. >> However, unless we're willing to back-port whatever that was, this >> test case is clearly consuming excessive resources for the v10 branch. > >I dug around a little in the git history for backend/replication/logical/, >and while I find several commit messages mentioning memory leaks and >faulty spill logic, they all claim to have been back-patched as far >as 9.4. > >It seems reasonably likely to me that this result is telling us about >an actual bug, ie, faulty back-patching of one or more of those fixes >into v10 and perhaps earlier branches. > >I don't know this code well enough to take point on looking for the >problem, though. > Well, one thing we did in 11 is introduction of the Generation context. In 10 we're still stashing all tuple data into the main AllocSet. I wonder if backporting a4ccc1cef5a04cc054af83bc4582a045d5232cb3 and a couple of follow-up fixes would make the issue go away. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Tomas Vondra <tomas.vondra@2ndquadrant.com> writes: > On Thu, Jan 09, 2020 at 07:40:12PM -0500, Tom Lane wrote: >> It seems reasonably likely to me that this result is telling us about >> an actual bug, ie, faulty back-patching of one or more of those fixes >> into v10 and perhaps earlier branches. > Well, one thing we did in 11 is introduction of the Generation context. > In 10 we're still stashing all tuple data into the main AllocSet. I > wonder if backporting a4ccc1cef5a04cc054af83bc4582a045d5232cb3 and a > couple of follow-up fixes would make the issue go away. Hm. I'm loath to back-port Generation contexts. But looking at a4ccc1cef5a04cc054af83bc4582a045d5232cb3, I see that (a) the commit message mentions space savings, but (b) the replaced code in reorderbuffer.c doesn't look like it really would move the needle much in that regard. The old code had a one-off slab allocator that we got rid of, but I don't see any actual leak there ... remind me where the win came from, exactly? regards, tom lane
On Sat, Jan 11, 2020 at 10:53:57PM -0500, Tom Lane wrote: >Tomas Vondra <tomas.vondra@2ndquadrant.com> writes: >> On Thu, Jan 09, 2020 at 07:40:12PM -0500, Tom Lane wrote: >>> It seems reasonably likely to me that this result is telling us about >>> an actual bug, ie, faulty back-patching of one or more of those fixes >>> into v10 and perhaps earlier branches. > >> Well, one thing we did in 11 is introduction of the Generation context. >> In 10 we're still stashing all tuple data into the main AllocSet. I >> wonder if backporting a4ccc1cef5a04cc054af83bc4582a045d5232cb3 and a >> couple of follow-up fixes would make the issue go away. > >Hm. I'm loath to back-port Generation contexts. Yeah, I agree. My suggestion was to try backpatching it and see if it resolves the issue. >But looking at >a4ccc1cef5a04cc054af83bc4582a045d5232cb3, I see that (a) the >commit message mentions space savings, but (b) the replaced code >in reorderbuffer.c doesn't look like it really would move the needle >much in that regard. The old code had a one-off slab allocator >that we got rid of, but I don't see any actual leak there ... >remind me where the win came from, exactly? > Well, the problem is that in 10 we allocate tuple data in the main memory ReorderBuffer context, and when the transaction gets decoded we pfree() it. But in AllocSet that only moves the data to the freelists, it does not release it entirely. So with the right allocation pattern (sufficiently diverse chunk sizes) this can easily result in allocation of large amount of memory that is never released. I don't know if this is what's happening in this particular test, but I wouldn't be surprised by it. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Tomas Vondra <tomas.vondra@2ndquadrant.com> writes: > On Sat, Jan 11, 2020 at 10:53:57PM -0500, Tom Lane wrote: >> remind me where the win came from, exactly? > Well, the problem is that in 10 we allocate tuple data in the main > memory ReorderBuffer context, and when the transaction gets decoded we > pfree() it. But in AllocSet that only moves the data to the freelists, > it does not release it entirely. So with the right allocation pattern > (sufficiently diverse chunk sizes) this can easily result in allocation > of large amount of memory that is never released. > I don't know if this is what's happening in this particular test, but I > wouldn't be surprised by it. Nah, don't think I believe that: the test inserts a bunch of tuples, but they look like they will all be *exactly* the same size. CREATE TABLE decoding_test(x integer, y text); ... FOR i IN 1..10 LOOP BEGIN INSERT INTO decoding_test(x) SELECT generate_series(1,5000); EXCEPTION when division_by_zero then perform 'dummy'; END; regards, tom lane
On Sun, Jan 12, 2020 at 8:18 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Sat, Jan 11, 2020 at 11:16 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > > > > * The seeming bug in v10 suggests that we aren't testing large enough > > logical-decoding cases, or at least aren't noticing leaks in that > > area. I'm not sure what a good design is for testing that. I'm not > > thrilled with just using a larger (and slower) test case, but it's > > not clear to me how else to attack it. > > > > It is not clear to me either at this stage, but I think we can decide > that after chasing the issue in v10. My current plan is to revert > this test and make a note of the memory leak problem found (probably > track in Older Bugs section of PostgreSQL 12 Open Items). > Pushed the revert and added an open item in the 'Older Bugs' section of PostgreSQL 12 Open Items [1]. [1] - https://wiki.postgresql.org/wiki/PostgreSQL_12_Open_Items -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Tue, Jan 14, 2020 at 9:58 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Sun, Jan 12, 2020 at 8:18 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Sat, Jan 11, 2020 at 11:16 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > > > > > > > * The seeming bug in v10 suggests that we aren't testing large enough > > > logical-decoding cases, or at least aren't noticing leaks in that > > > area. I'm not sure what a good design is for testing that. I'm not > > > thrilled with just using a larger (and slower) test case, but it's > > > not clear to me how else to attack it. > > > > > > > It is not clear to me either at this stage, but I think we can decide > > that after chasing the issue in v10. My current plan is to revert > > this test and make a note of the memory leak problem found (probably > > track in Older Bugs section of PostgreSQL 12 Open Items). > > > > Pushed the revert > Sidewinder is green now on back branches. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Forking thread "logical decoding : exceeded maxAllocatedDescs for .spill files" for this side issue: On Wed, Jan 08, 2020 at 09:37:04PM -0800, Noah Misch wrote: > v10 > deletes PostgresNode base directories at the end of this test file, despite > the failure[1]. > [1] It has the all_tests_passing() logic in an attempt to stop this. I'm > guessing it didn't help because the file failed by calling die "connection > error: ...", not by reporting a failure to Test::More via ok(0) or similar. That is what happened. We should test the exit status to decide whether to keep temporaries, as attached. PostgresNode does that, since commit 90627cf (thread https://postgr.es/m/flat/6205.1492883490%40sss.pgh.pa.us). That thread already discussed $SUBJECT[1] and the __DIE__ handler being redundant[2]. I plan to back-patch, since it's most useful for v10 and v9.6. [1] https://postgr.es/m/CAMsr+YFyFU=+MVFZqhthfMW22x5-h517e6ck6ET+DT+X4bUO7g@mail.gmail.com [2] https://postgr.es/m/FEA925B2-C3AE-4BA9-9194-5F5616AD0794@yesql.se
Attachment
> On 2 Feb 2020, at 18:01, Noah Misch <noah@leadboat.com> wrote: > > Forking thread "logical decoding : exceeded maxAllocatedDescs for .spill > files" for this side issue: Thanks, I hadn't seen this. > On Wed, Jan 08, 2020 at 09:37:04PM -0800, Noah Misch wrote: >> v10 >> deletes PostgresNode base directories at the end of this test file, despite >> the failure[1]. > >> [1] It has the all_tests_passing() logic in an attempt to stop this. I'm >> guessing it didn't help because the file failed by calling die "connection >> error: ...", not by reporting a failure to Test::More via ok(0) or similar. > > That is what happened. We should test the exit status to decide whether to > keep temporaries, as attached. PostgresNode does that, since commit 90627cf > (thread https://postgr.es/m/flat/6205.1492883490%40sss.pgh.pa.us). That > thread already discussed $SUBJECT[1] and the __DIE__ handler being > redundant[2]. I plan to back-patch, since it's most useful for v10 and v9.6. I'm travelling and haven't been able to test, but this makes sense from reading. +1 on backpatching. cheers ./daniel
On Sun, Jan 12, 2020 at 9:51 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Tomas Vondra <tomas.vondra@2ndquadrant.com> writes: > > On Sat, Jan 11, 2020 at 10:53:57PM -0500, Tom Lane wrote: > >> remind me where the win came from, exactly? > > > Well, the problem is that in 10 we allocate tuple data in the main > > memory ReorderBuffer context, and when the transaction gets decoded we > > pfree() it. But in AllocSet that only moves the data to the freelists, > > it does not release it entirely. So with the right allocation pattern > > (sufficiently diverse chunk sizes) this can easily result in allocation > > of large amount of memory that is never released. > > > I don't know if this is what's happening in this particular test, but I > > wouldn't be surprised by it. > > Nah, don't think I believe that: the test inserts a bunch of tuples, > but they look like they will all be *exactly* the same size. > > CREATE TABLE decoding_test(x integer, y text); > ... > > FOR i IN 1..10 LOOP > BEGIN > INSERT INTO decoding_test(x) SELECT generate_series(1,5000); > EXCEPTION > when division_by_zero then perform 'dummy'; > END; > I performed the same test in pg11 and reproduced the issue on the commit prior to a4ccc1cef5a04 (Generational memory allocator). ulimit -s 1024 ulimit -v 300000 wal_level = logical max_replication_slots = 4 And executed the following code snippet (shared by Amit Khandekar earlier in the thread). SELECT pg_create_logical_replication_slot('test_slot', 'test_decoding'); CREATE TABLE decoding_test(x integer, y text); do $$ BEGIN FOR i IN 1..10 LOOP BEGIN INSERT INTO decoding_test(x) SELECT generate_series(1,3000); EXCEPTION when division_by_zero then perform 'dummy'; END; END LOOP; END $$; SELECT data from pg_logical_slot_get_changes('test_slot', NULL, NULL) LIMIT 10; I got the following error: ERROR: out of memory DETAIL: Failed on request of size 8208. After that, I applied the "Generational memory allocator" patch and that solved the issue. From the error message, it is evident that the underlying code is trying to allocate a MaxTupleSize memory for each tuple. So, I re-introduced the following lines (which are removed by a4ccc1cef5a04) on top of the patch: --- a/src/backend/replication/logical/reorderbuffer.c +++ b/src/backend/replication/logical/reorderbuffer.c @@ -417,6 +417,9 @@ ReorderBufferGetTupleBuf(ReorderBuffer *rb, Size tuple_len) alloc_len = tuple_len + SizeofHeapTupleHeader; + if (alloc_len < MaxHeapTupleSize) + alloc_len = MaxHeapTupleSize; And, the issue got reproduced with the same error: WARNING: problem in Generation Tuples: number of free chunks 0 in block 0x7fe9e9e74010 exceeds 1018 allocated ..... ERROR: out of memory DETAIL: Failed on request of size 8208. I don't understand the code well enough to comment whether we can back-patch only this part of the code. But, this seems to allocate a huge amount of memory per chunk although the tuple is small. Thoughts? -- Thanks & Regards, Kuntal Ghosh EnterpriseDB: http://www.enterprisedb.com
On Tue, Feb 4, 2020 at 10:15 AM Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote: > > On Sun, Jan 12, 2020 at 9:51 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > > > Tomas Vondra <tomas.vondra@2ndquadrant.com> writes: > > > On Sat, Jan 11, 2020 at 10:53:57PM -0500, Tom Lane wrote: > > >> remind me where the win came from, exactly? > > > > > Well, the problem is that in 10 we allocate tuple data in the main > > > memory ReorderBuffer context, and when the transaction gets decoded we > > > pfree() it. But in AllocSet that only moves the data to the freelists, > > > it does not release it entirely. So with the right allocation pattern > > > (sufficiently diverse chunk sizes) this can easily result in allocation > > > of large amount of memory that is never released. > > > > > I don't know if this is what's happening in this particular test, but I > > > wouldn't be surprised by it. > > > > Nah, don't think I believe that: the test inserts a bunch of tuples, > > but they look like they will all be *exactly* the same size. > > > > CREATE TABLE decoding_test(x integer, y text); > > ... > > > > FOR i IN 1..10 LOOP > > BEGIN > > INSERT INTO decoding_test(x) SELECT generate_series(1,5000); > > EXCEPTION > > when division_by_zero then perform 'dummy'; > > END; > > > I performed the same test in pg11 and reproduced the issue on the > commit prior to a4ccc1cef5a04 (Generational memory allocator). > > ulimit -s 1024 > ulimit -v 300000 > > wal_level = logical > max_replication_slots = 4 > > And executed the following code snippet (shared by Amit Khandekar > earlier in the thread). > .. > > SELECT data from pg_logical_slot_get_changes('test_slot', NULL, NULL) LIMIT 10; > > I got the following error: > ERROR: out of memory > DETAIL: Failed on request of size 8208. > > After that, I applied the "Generational memory allocator" patch and > that solved the issue. From the error message, it is evident that the > underlying code is trying to allocate a MaxTupleSize memory for each > tuple. So, I re-introduced the following lines (which are removed by > a4ccc1cef5a04) on top of the patch: > > --- a/src/backend/replication/logical/reorderbuffer.c > +++ b/src/backend/replication/logical/reorderbuffer.c > @@ -417,6 +417,9 @@ ReorderBufferGetTupleBuf(ReorderBuffer *rb, Size tuple_len) > > alloc_len = tuple_len + SizeofHeapTupleHeader; > > + if (alloc_len < MaxHeapTupleSize) > + alloc_len = MaxHeapTupleSize; > > And, the issue got reproduced with the same error: > WARNING: problem in Generation Tuples: number of free chunks 0 in > block 0x7fe9e9e74010 exceeds 1018 allocated > ..... > ERROR: out of memory > DETAIL: Failed on request of size 8208. > > I don't understand the code well enough to comment whether we can > back-patch only this part of the code. > I don't think we can just back-patch that part of code as it is linked to the way we are maintaining a cache (~8MB) for frequently allocated objects. See the comments around the definition of max_cached_tuplebufs. But probably, we can do something once we reach such a limit, basically, once we know that we have already allocated max_cached_tuplebufs number of tuples of size MaxHeapTupleSize, we don't need to allocate more of that size. Does this make sense? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Tue, Feb 4, 2020 at 2:40 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > I don't think we can just back-patch that part of code as it is linked > to the way we are maintaining a cache (~8MB) for frequently allocated > objects. See the comments around the definition of > max_cached_tuplebufs. But probably, we can do something once we reach > such a limit, basically, once we know that we have already allocated > max_cached_tuplebufs number of tuples of size MaxHeapTupleSize, we > don't need to allocate more of that size. Does this make sense? > Yeah, this makes sense. I've attached a patch that implements the same. It solves the problem reported earlier. This solution will at least slow down the process of going OOM even for very small sized tuples. -- Thanks & Regards, Kuntal Ghosh EnterpriseDB: http://www.enterprisedb.com
Attachment
Hi, On 2020-02-04 10:15:01 +0530, Kuntal Ghosh wrote: > And, the issue got reproduced with the same error: > WARNING: problem in Generation Tuples: number of free chunks 0 in > block 0x7fe9e9e74010 exceeds 1018 allocated That seems like a problem in generation.c - because this should be unreachable, I think? Tomas? Regards, Andres
Hi, On 2020-02-04 10:15:01 +0530, Kuntal Ghosh wrote: > I performed the same test in pg11 and reproduced the issue on the > commit prior to a4ccc1cef5a04 (Generational memory allocator). > > ulimit -s 1024 > ulimit -v 300000 > > wal_level = logical > max_replication_slots = 4 > > [...] > After that, I applied the "Generational memory allocator" patch and > that solved the issue. From the error message, it is evident that the > underlying code is trying to allocate a MaxTupleSize memory for each > tuple. So, I re-introduced the following lines (which are removed by > a4ccc1cef5a04) on top of the patch: > --- a/src/backend/replication/logical/reorderbuffer.c > +++ b/src/backend/replication/logical/reorderbuffer.c > @@ -417,6 +417,9 @@ ReorderBufferGetTupleBuf(ReorderBuffer *rb, Size tuple_len) > > alloc_len = tuple_len + SizeofHeapTupleHeader; > > + if (alloc_len < MaxHeapTupleSize) > + alloc_len = MaxHeapTupleSize; Maybe I'm being slow here - but what does this actually prove? Before the generation contexts were introduced we avoided fragmentation (which would make things unusably slow) using a a brute force method (namely forcing all tuple allocations to be of the same/maximum size). Which means that yes, we'll need more memory than necessary. Do you think you see anything but that here? It's good that the situation is better now, but I don't think this means we need to necessarily backpatch something nontrivial? Greetings, Andres Freund
On Fri, Feb 07, 2020 at 10:33:48AM -0800, Andres Freund wrote: >Hi, > >On 2020-02-04 10:15:01 +0530, Kuntal Ghosh wrote: >> And, the issue got reproduced with the same error: >> WARNING: problem in Generation Tuples: number of free chunks 0 in >> block 0x7fe9e9e74010 exceeds 1018 allocated > >That seems like a problem in generation.c - because this should be >unreachable, I think? > >Tomas? > That's rather strange. How could we print this message? The code looks like this if (block->nfree >= block->nchunks) elog(WARNING, "problem in Generation %s: number of free chunks %d in block %p exceeds %d allocated", name, block->nfree, block, block->nchunks); so this says 0 >= 1018. Or am I missing something? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi, On 2020-01-11 23:20:56 -0500, Tom Lane wrote: > Tomas Vondra <tomas.vondra@2ndquadrant.com> writes: > Nah, don't think I believe that: the test inserts a bunch of tuples, > but they look like they will all be *exactly* the same size. > > CREATE TABLE decoding_test(x integer, y text); > ... > > FOR i IN 1..10 LOOP > BEGIN > INSERT INTO decoding_test(x) SELECT generate_series(1,5000); > EXCEPTION > when division_by_zero then perform 'dummy'; > END; I think the issue this triggers higher memory usage in in older versions is that before commit cec2edfa7859279f36d2374770ca920c59c73dd8 Author: Amit Kapila <akapila@postgresql.org> Date: 2019-11-16 17:49:33 +0530 Add logical_decoding_work_mem to limit ReorderBuffer memory usage. we enforced how many changes to keep in memory (vs on disk) /* * Maximum number of changes kept in memory, per transaction. After that, * changes are spooled to disk. * * The current value should be sufficient to decode the entire transaction * without hitting disk in OLTP workloads, while starting to spool to disk in * other workloads reasonably fast. * * At some point in the future it probably makes sense to have a more elaborate * resource management here, but it's not entirely clear what that would look * like. */ static const Size max_changes_in_memory = 4096; on a per-transaction basis. And that subtransactions are *different* transactions for that purpose (as they can be rolled back separately). As the test generates loads of records for different subtransactions, they each end up having quite a few changes (including the tuples pointed to!) in memory at the same time. Due to the way the limit of 4096 interacts with the 5000 rows inserted above, we only hit the out of memory error when loading. That's because when decoding (before the commit has been seen), we spill after 4096 changes: 2020-02-07 11:18:22.399 PST [1136134][3/2] DEBUG: spill 4096 changes in XID 585 to disk 2020-02-07 11:18:22.419 PST [1136134][3/2] DEBUG: spill 4096 changes in XID 586 to disk 2020-02-07 11:18:22.431 PST [1136134][3/2] DEBUG: spill 4096 changes in XID 587 to disk 2020-02-07 11:18:22.443 PST [1136134][3/2] DEBUG: spill 4096 changes in XID 588 to disk 2020-02-07 11:18:22.454 PST [1136134][3/2] DEBUG: spill 4096 changes in XID 589 to disk 2020-02-07 11:18:22.465 PST [1136134][3/2] DEBUG: spill 4096 changes in XID 590 to disk 2020-02-07 11:18:22.477 PST [1136134][3/2] DEBUG: spill 4096 changes in XID 591 to disk 2020-02-07 11:18:22.488 PST [1136134][3/2] DEBUG: spill 4096 changes in XID 592 to disk 2020-02-07 11:18:22.499 PST [1136134][3/2] DEBUG: spill 4096 changes in XID 593 to disk 2020-02-07 11:18:22.511 PST [1136134][3/2] DEBUG: spill 4096 changes in XID 594 to disk so there's each 5000 - 4096 changes in memory, times 10. But when actually calling the output plugin (at the commit record), we start with loading changes back into memory from the start of each subtransaction. That first entails spilling the tail of that transaction to disk, and then loading the start: 2020-02-07 11:18:22.515 PST [1136134][3/2] DEBUG: StartSubTransaction(1) name: unnamed; blockState: STARTED; state: INPROGR,xid/subid/cid: 0/1/0 2020-02-07 11:18:22.515 PST [1136134][3/2] DEBUG: StartSubTransaction(2) name: replay; blockState: SUB BEGIN; state: INPROGR,xid/subid/cid: 0/2/0 2020-02-07 11:18:22.515 PST [1136134][3/2] DEBUG: spill 904 changes in XID 585 to disk 2020-02-07 11:18:22.524 PST [1136134][3/2] DEBUG: restored 4096 changes in XID 585 into memory 2020-02-07 11:18:22.524 PST [1136134][3/2] DEBUG: spill 904 changes in XID 586 to disk 2020-02-07 11:18:22.534 PST [1136134][3/2] DEBUG: restored 4096 changes in XID 586 into memory 2020-02-07 11:18:22.534 PST [1136134][3/2] DEBUG: spill 904 changes in XID 587 to disk 2020-02-07 11:18:22.544 PST [1136134][3/2] DEBUG: restored 4096 changes in XID 587 into memory 2020-02-07 11:18:22.544 PST [1136134][3/2] DEBUG: spill 904 changes in XID 588 to disk 2020-02-07 11:18:22.554 PST [1136134][3/2] DEBUG: restored 4096 changes in XID 588 into memory 2020-02-07 11:18:22.554 PST [1136134][3/2] DEBUG: spill 904 changes in XID 589 to disk TopMemoryContext: 161440 total in 7 blocks; 80240 free (68 chunks); 81200 used ... Because each transaction has 4096 changes in memory, we actually need more memory here than we did during the decoding phase, where all but the "current" subtransaction only have 5000 - 4096 changes in memory. If we instead change the test to insert 4096*2 - 1 tuples each, we run out of memory earlier: 2020-02-07 11:23:20.540 PST [1136134][3/12] DEBUG: spill 4096 changes in XID 610 to disk 2020-02-07 11:23:20.565 PST [1136134][3/12] DEBUG: spill 4096 changes in XID 611 to disk 2020-02-07 11:23:20.587 PST [1136134][3/12] DEBUG: spill 4096 changes in XID 612 to disk 2020-02-07 11:23:20.608 PST [1136134][3/12] DEBUG: spill 4096 changes in XID 613 to disk 2020-02-07 11:23:20.630 PST [1136134][3/12] DEBUG: spill 4096 changes in XID 614 to disk TopMemoryContext: 161440 total in 7 blocks; 79264 free (82 chunks); 82176 used ... 2020-02-07 11:23:20.655 PST [1136134][3/12] ERROR: out of memory 2020-02-07 11:23:20.655 PST [1136134][3/12] DETAIL: Failed on request of size 8208. 2020-02-07 11:23:20.655 PST [1136134][3/12] STATEMENT: SELECT * FROM pg_logical_slot_peek_changes('regression_slot', NULL,NULL); The reason that the per-subxact enforcement of max_changes_in_memory isn't as noticable in 11 is that there we have the generational context. Which means that each of the 4096*10 tuples we have in memory doesn't allocate MaxHeapTupleSize, but instead something like ~30 bytes. I wonder if we, in the backbranches that don't have generation context, should just reduce the size of slab allocated tuples to be ~1024 bytes instead of MaxHeapTupleSize. That's an almost trivial change, as we already have to support tuples above that limit (in cases the oldtuple in an update/delete contains toasted columns that we "inlined"). POC for that attached. Greetings, Andres Freund
Attachment
Hi, On 2020-02-07 20:02:01 +0100, Tomas Vondra wrote: > On Fri, Feb 07, 2020 at 10:33:48AM -0800, Andres Freund wrote: > > Hi, > > > > On 2020-02-04 10:15:01 +0530, Kuntal Ghosh wrote: > > > And, the issue got reproduced with the same error: > > > WARNING: problem in Generation Tuples: number of free chunks 0 in > > > block 0x7fe9e9e74010 exceeds 1018 allocated > > > > That seems like a problem in generation.c - because this should be > > unreachable, I think? > That's rather strange. How could we print this message? The code looks > like this > > if (block->nfree >= block->nchunks) > elog(WARNING, "problem in Generation %s: number of free chunks %d in block %p exceeds %d allocated", > name, block->nfree, block, block->nchunks); > > so this says 0 >= 1018. Or am I missing something? Indeed, it's pretty weird. I can't reproduce it either. Kuntal, which exact git version did you repro this on? What precise settings / platform? Greetings, Andres Freund
On Sat, Feb 8, 2020 at 12:10 AM Andres Freund <andres@anarazel.de> wrote: > > Hi, > > On 2020-02-04 10:15:01 +0530, Kuntal Ghosh wrote: > > I performed the same test in pg11 and reproduced the issue on the > > commit prior to a4ccc1cef5a04 (Generational memory allocator). > > > > ulimit -s 1024 > > ulimit -v 300000 > > > > wal_level = logical > > max_replication_slots = 4 > > > > [...] > > > After that, I applied the "Generational memory allocator" patch and > > that solved the issue. From the error message, it is evident that the > > underlying code is trying to allocate a MaxTupleSize memory for each > > tuple. So, I re-introduced the following lines (which are removed by > > a4ccc1cef5a04) on top of the patch: > > > --- a/src/backend/replication/logical/reorderbuffer.c > > +++ b/src/backend/replication/logical/reorderbuffer.c > > @@ -417,6 +417,9 @@ ReorderBufferGetTupleBuf(ReorderBuffer *rb, Size tuple_len) > > > > alloc_len = tuple_len + SizeofHeapTupleHeader; > > > > + if (alloc_len < MaxHeapTupleSize) > > + alloc_len = MaxHeapTupleSize; > > Maybe I'm being slow here - but what does this actually prove? Before > the generation contexts were introduced we avoided fragmentation (which > would make things unusably slow) using a a brute force method (namely > forcing all tuple allocations to be of the same/maximum size). > It seems for this we formed a cache of max_cached_tuplebufs number of objects and we don't need to allocate more than that number of tuples of size MaxHeapTupleSize because we will anyway return that memory to aset.c. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Hello, On Sat, Feb 8, 2020 at 1:18 AM Andres Freund <andres@anarazel.de> wrote: > > Hi, > > On 2020-02-07 20:02:01 +0100, Tomas Vondra wrote: > > On Fri, Feb 07, 2020 at 10:33:48AM -0800, Andres Freund wrote: > > > Hi, > > > > > > On 2020-02-04 10:15:01 +0530, Kuntal Ghosh wrote: > > > > And, the issue got reproduced with the same error: > > > > WARNING: problem in Generation Tuples: number of free chunks 0 in > > > > block 0x7fe9e9e74010 exceeds 1018 allocated > > > > > > That seems like a problem in generation.c - because this should be > > > unreachable, I think? > > > That's rather strange. How could we print this message? The code looks > > like this > > > > if (block->nfree >= block->nchunks) > > elog(WARNING, "problem in Generation %s: number of free chunks %d in block %p exceeds %d allocated", > > name, block->nfree, block, block->nchunks); > > > > so this says 0 >= 1018. Or am I missing something? > > Indeed, it's pretty weird. I can't reproduce it either. Kuntal, which > exact git version did you repro this on? What precise settings / > platform? > I've used the following steps: Platform: OS: Ubuntu 64-bit 18.04.2 VMWare Fusion Version 8.5.10 on my MacOS 10.14.6 16GB RAM with 8 core processors git checkout -b pg11 remotes/origin/REL_11_STABLE git reset --hard a4ccc1cef5a04cc (Generational memory allocator) git apply Set-alloc_len-as-MaxHeapTupleSize.patch Then, I performed the same test. I've attached the test.sql file for the same. With this test, I wanted to check how the generational memory allocator patch is solving the issue. As reported by you as well earlier in the thread, each of the 4096*10 in-memory tuples doesn't allocate MaxHeapTupleSize, but instead something like ~30 bytes with this patch. IMHO, this is the change that is making the difference. If we allocate MaxHeapTupleSize instead for all the tuples, we'll encounter the same out-of-memory issue. I haven't looked into the issue in generational tuple context yet. It's possible that that the changes I've done in the attached patch don't make sense and break the logic of generational memory allocator. :-) -- Thanks & Regards, Kuntal Ghosh EnterpriseDB: http://www.enterprisedb.com
Attachment
On Sun, Feb 9, 2020 at 9:18 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > It seems for this we formed a cache of max_cached_tuplebufs number of > objects and we don't need to allocate more than that number of tuples > of size MaxHeapTupleSize because we will anyway return that memory to > aset.c. > In the approach suggested by Amit (approach 1), once we allocate the max_cached_tuplebufs number of MaxHeapTupleSize, we can use the actual length of the tuple for allocating memory. So, if we have m subtransactions, the memory usage at worst case will be, (max_cached_tuplebufs * MaxHeapTupleSize) cache + (Maximum changes in a subtransaction before spilling) * m * (Actual tuple size) = 64 MB cache + 4095 * m * (Actual tuple size) In the approach suggested by Andres (approach 2), we're going to reduce the size of a cached tuple to 1024 bytes. So, if we have m sub-transactions, the memory usage at worst case will be, (max_cached_tuplebufs * 1024 bytes) cache + (Maximum changes in a subtransaction before spilling) * m * 1024 bytes = 8 MB cache + 4095 * m * 1024 (considering the size of the tuple is less than 1024 bytes) Once the cache is filled, for 1000 sub-transactions operating on tuple size, say 100 bytes, approach 1 will allocate 390 MB of memory (approx.) whereas approach 2 will allocate 4GB of memory approximately. If there is no obvious error that I'm missing, I think we should implement the first approach. -- Thanks & Regards, Kuntal Ghosh EnterpriseDB: http://www.enterprisedb.com
On Fri, Feb 7, 2020 at 5:32 PM Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote: > > On Tue, Feb 4, 2020 at 2:40 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > I don't think we can just back-patch that part of code as it is linked > > to the way we are maintaining a cache (~8MB) for frequently allocated > > objects. See the comments around the definition of > > max_cached_tuplebufs. But probably, we can do something once we reach > > such a limit, basically, once we know that we have already allocated > > max_cached_tuplebufs number of tuples of size MaxHeapTupleSize, we > > don't need to allocate more of that size. Does this make sense? > > > > Yeah, this makes sense. I've attached a patch that implements the > same. It solves the problem reported earlier. This solution will at > least slow down the process of going OOM even for very small sized > tuples. > The patch seems to be in right direction and the test at my end shows that it resolves the issue. One minor comment: * those. Thus always allocate at least MaxHeapTupleSize. Note that tuples * generated for oldtuples can be bigger, as they don't have out-of-line * toast columns. + * + * But, if we've already allocated the memory required for building the + * cache later, we don't have to allocate memory more than the size of the + * tuple. */ How about modifying the existing comment as: "Most tuples are below MaxHeapTupleSize, so we use a slab allocator for those. Thus always allocate at least MaxHeapTupleSize till the slab cache is filled. Note that tuples generated for oldtuples can be bigger, as they don't have out-of-line toast columns."? Have you tested this in 9.6 and 9.5? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Fri, Feb 14, 2020 at 4:06 PM Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote: > > On Sun, Feb 9, 2020 at 9:18 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > It seems for this we formed a cache of max_cached_tuplebufs number of > > objects and we don't need to allocate more than that number of tuples > > of size MaxHeapTupleSize because we will anyway return that memory to > > aset.c. > > > In the approach suggested by Amit (approach 1), once we allocate the > max_cached_tuplebufs number of MaxHeapTupleSize, we can use the actual > length of the tuple for allocating memory. So, if we have m > subtransactions, the memory usage at worst case will be, > > (max_cached_tuplebufs * MaxHeapTupleSize) cache + > (Maximum changes in a subtransaction before spilling) * m * (Actual tuple size) > > = 64 MB cache + 4095 * m * (Actual tuple size) > > In the approach suggested by Andres (approach 2), we're going to > reduce the size of a cached tuple to 1024 bytes. So, if we have m > sub-transactions, the memory usage at worst case will be, > > (max_cached_tuplebufs * 1024 bytes) cache + (Maximum changes in a > subtransaction before spilling) * m * 1024 bytes > > = 8 MB cache + 4095 * m * 1024 (considering the size of the tuple is > less than 1024 bytes) > > Once the cache is filled, for 1000 sub-transactions operating on tuple > size, say 100 bytes, approach 1 will allocate 390 MB of memory > (approx.) whereas approach 2 will allocate 4GB of memory > approximately. If there is no obvious error that I'm missing, I think > we should implement the first approach. > Your calculation seems correct to me. So, I think we should proceed with the patch written by you. Andres, any objections on proceeding with Kuntal's patch for back-branches (10, 9.6 and 9.5)? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On 2020-02-18 11:20:17 +0530, Amit Kapila wrote: > Andres, any objections on proceeding with Kuntal's patch for > back-branches (10, 9.6 and 9.5)? Yes. In my past experiments that lead to *terrible* allocator performance due to fragmentation. Like, up to 90% of the time spent in aset.c. Try a workload with a number of overlapping transactions that have different tuple sizes. I'm not even sure it's the right thing to do anything in the back branches to be honest. If somebody hits this badly they likely have done so before, and they at least have the choice to upgrade, but if we regress performance for more people...
On Tue, Feb 18, 2020 at 11:33 AM Andres Freund <andres@anarazel.de> wrote: > > On 2020-02-18 11:20:17 +0530, Amit Kapila wrote: > > Andres, any objections on proceeding with Kuntal's patch for > > back-branches (10, 9.6 and 9.5)? > > Yes. In my past experiments that lead to *terrible* allocator > performance due to fragmentation. Like, up to 90% of the time spent in > aset.c. Try a workload with a number of overlapping transactions that > have different tuple sizes. > I thought slab-cache would have addressed it. But, it is possible if there are many-2 such overlapping transactions, then that might lead to performance regression. OTOH, the current code also might lead to worse performance for transactions with multiple subtransactions as they would frequently need to malloc. > I'm not even sure it's the right thing to do anything in the back > branches to be honest. If somebody hits this badly they likely have done > so before, and they at least have the choice to upgrade, but if we > regress performance for more people... I could see that for some cases the current code might give better performance, but OTOH, consuming memory at a high rate for some other cases is also not good either. But you are right that we can always ask such users to upgrade (which again sometimes is painful for some of the users), so maybe the right thing is to do nothing here. Anyone else has any opinion on this? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com