Re: tableam vs. TOAST - Mailing list pgsql-hackers
From | Ashutosh Sharma |
---|---|
Subject | Re: tableam vs. TOAST |
Date | |
Msg-id | CAE9k0PnzB-+FabrUs_=OZ+3fz4+C4hkGcZf=52ruaisTgueWNA@mail.gmail.com Whole thread Raw |
In response to | Re: tableam vs. TOAST (Prabhat Sahu <prabhat.sahu@enterprisedb.com>) |
Responses |
Re: tableam vs. TOAST
|
List | pgsql-hackers |
On Thu, Nov 7, 2019 at 10:57 AM Prabhat Sahu <prabhat.sahu@enterprisedb.com> wrote: > > > > On Tue, Nov 5, 2019 at 4:48 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote: >> >> From the stack trace shared by Prabhat, I understand that the checkpointer process panicked due to one of the followingtwo reasons: >> >> 1) The fsync() failed in the first attempt itself and the reason for the failure was not due to file being dropped ortruncated i.e. fsync failed with the error other than ENOENT. Refer to ProcessSyncRequests() for details esp. the codeinside for (failures = 0; !entry->canceled; failures++) loop. >> >> 2) The first attempt to fsync() failed with ENOENT error because just before the fsync function was called, the file beingsynced either got dropped or truncated. When this happened, the checkpointer process called AbsorbSyncRequests() toupdate the entry for deleted file in the hash table but it seems like AbsorbSyncRequests() failed to do so and that's whythe "entry->canceled" couldn't be set to true. Due to this, fsync() was performed on the same file twice and that failedtoo. As checkpointer process doesn't expect the fsync on the same file to fail twice, it panicked. Again, please checkProcessSyncRequests() for details esp. the code inside for (failures = 0; !entry->canceled; failures++) loop. >> >> Now, the point of discussion here is, which one of the above two reasons could the cause for panic? According to me, point#2 doesn't look like the possible reason for panic. The reason being just before a file is unlinked, backend first sendsa SYNC_FORGET_REQUEST to the checkpointer process which marks the entry for this file in the hash table as cancelledand then removes the file. So, with this understanding it is hard to believe that once the first fsync() for a filehas failed with error ENOENT, a call to AbsorbSyncRequests() made immediately after that wouldn't update the entry forthis file in the hash table because the backend only removes the file once it has successfully sent the SYNC_FORGET_REQUESTfor that file to the checkpointer process. See mdunlinkfork()->register_forget_request() for details onthis. >> >> So, I think the first point that I mentioned above could be the probable reason for the checkpointer process getting panicked.But, having said all that, it would be good to have some evidence for it which can be confirmed by inspecting theserver logfile. >> >> Prabhat, is it possible for you to re-run the test-case with log_min_messages set to DEBUG1 and save the logfile for thetest-case that crashes. This would be helpful in knowing if the fsync was performed just once or twice i.e. whether point#1 is the reason for the panic or point #2. > > > I have ran the same testcases with and without patch multiple times with debug option (log_min_messages = DEBUG1), butthis time I am not able to reproduce the crash. Okay, no problem. Thanks for re-running the test-cases. @Robert, Myself and Prabhat have tried running the test-cases that caused the checkpointer process to crash earlier multiple times but we are not able to reproduce it both with and without the patch. However, from the stack trace shared earlier by Prabhat, it is clear that the checkpointer process panicked due to fsync failure. But, there is no further data to know the exact reason for the fsync failure. From the code of checkpointer process (basically the function to process fsync requests) it is understood that, the checkpointer process can PANIC due to one of the following two reasons. 1) The fsync call made by checkpointer process has failed with error other than ENOENT. 2) The fsync call made by checkpointer process failed with ENOENT error which caused the checkpointer process to invoke AbsorbSyncRequests() to update the entry for deleted file in the hash table (basically to mark the entry as cancelled). But, seems like it couldn't do so either because - a) possibly there was no SYNC_FORGET_REQUEST sent by the backend to the checkpointer process or b) the request was sent but due to some reason the checkpointer process couldn't absorb the request. This caused the checkpointer process to perform fsync on the same file once again which is bound to fail resulting into a panic. Now, if checkpointer process panicked due to reason #1 then I don't think it has anything to do with postgres because postgres only cares when fsync fails with ENOENT error. If in case checkpointer process panicked due reason #2 then possibly there is some bug in postgres code which I assume has to be some problem with the way backend is sending fsync request to the checkpointer for deleted files and the way checkpointer is handling the requests. At least for me, it is hard to believe that reason #2 could be the cause for the checkpointer process getting panicked here - for the reason that before a file is unlinked by backend, it first sends a SYNC_FORGET_REQUEST to the checkpointer process, when this is done successfully then only backend removes the file. So, with this understanding it is hard to believe that once the first fsync() for a file has failed with error ENOENT, a call to AbsorbSyncRequests() made immediately after that wouldn't update the entry for this file in the hash table. And even if reason #2 is the cause for this failure, I don't think it has anything to do with your changes, although I haven't studied your patches in detail but considering the purpose of the patch and from a quick look it doesn't seem to change anything in the area of the code that might be causing this crash. -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com >> >> >> Thanks, >> >> -- >> With Regards, >> Ashutosh Sharma >> EnterpriseDB:http://www.enterprisedb.com >> >> On Thu, Oct 31, 2019 at 10:26 AM Prabhat Sahu <prabhat.sahu@enterprisedb.com> wrote: >>> >>> >>> >>> On Wed, Oct 30, 2019 at 9:46 PM Robert Haas <robertmhaas@gmail.com> wrote: >>>> >>>> On Wed, Oct 30, 2019 at 3:49 AM Prabhat Sahu <prabhat.sahu@enterprisedb.com> wrote: >>>>> >>>>> While testing the Toast patch(PG+v7 patch) I found below server crash. >>>>> System configuration: >>>>> VCPUs: 4, RAM: 8GB, Storage: 320GB >>>>> >>>>> This issue is not frequently reproducible, we need to repeat the same testcase multiple times. >>>> >>>> >>>> I wonder if this is an independent bug, because the backtrace doesn't look like it's related to the stuff this is changing.Your report doesn't specify whether you can also reproduce the problem without the patch, which is something thatyou should always check before reporting a bug in a particular patch. >>> >>> >>> Hi Robert, >>> >>> My sincere apologize that I have not mentioned the issue in more detail. >>> I have ran the same case against both PG HEAD and HEAD+Patch multiple times(7, 10, 20nos), and >>> as I found this issue was not failing in HEAD and same case is reproducible in HEAD+Patch (again I was not sure aboutthe backtrace whether its related to patch or not). >>> >>> >>>> >>>> -- >>>> Robert Haas >>>> EnterpriseDB: http://www.enterprisedb.com >>>> The Enterprise PostgreSQL Company >>> >>> >>> >>> -- >>> >>> With Regards, >>> >>> Prabhat Kumar Sahu >>> Skype ID: prabhat.sahu1984 >>> EnterpriseDB Software India Pvt. Ltd. >>> >>> The Postgres Database Company > > > > -- > > With Regards, > > Prabhat Kumar Sahu > Skype ID: prabhat.sahu1984 > EnterpriseDB Software India Pvt. Ltd. > > The Postgres Database Company
pgsql-hackers by date: