Re: tableam vs. TOAST - Mailing list pgsql-hackers

From Prabhat Sahu
Subject Re: tableam vs. TOAST
Date
Msg-id CANEvxPpKFwhaddt0TSn7tmBJjgY4SNs=AB61Rx5f7M6EaLXcPA@mail.gmail.com
Whole thread Raw
In response to Re: tableam vs. TOAST  (Ashutosh Sharma <ashu.coek88@gmail.com>)
Responses Re: tableam vs. TOAST
List pgsql-hackers


On Tue, Nov 5, 2019 at 4:48 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
From the stack trace shared by Prabhat, I understand that the checkpointer process panicked due to one of the following two reasons:

1) The fsync() failed in the first attempt itself and the reason for the failure was not due to file being dropped or truncated i.e. fsync failed with the error other than ENOENT. Refer to ProcessSyncRequests() for details esp. the code inside for (failures = 0; !entry->canceled; failures++) loop.

2) The first attempt to fsync() failed with ENOENT error because just before the fsync function was called, the file being synced either got dropped or truncated. When this happened, the checkpointer process called AbsorbSyncRequests() to update the entry for deleted file in the hash table but it seems like AbsorbSyncRequests() failed to do so and that's why the "entry->canceled" couldn't be set to true. Due to this, fsync() was performed on the same file twice and that failed too. As checkpointer process doesn't expect the fsync on the same file to fail twice, it panicked. Again, please check ProcessSyncRequests() for details esp. the code inside for (failures = 0; !entry->canceled; failures++) loop.

Now, the point of discussion here is, which one of the above two reasons could the cause for panic? According to me, point #2 doesn't look like the possible reason for panic. The reason being just before a file is unlinked, backend first sends a SYNC_FORGET_REQUEST to the checkpointer process which marks the entry for this file in the hash table as cancelled and then removes the file. So, with this understanding it is hard to believe that once the first fsync() for a file has failed with error ENOENT, a call to AbsorbSyncRequests() made immediately after that wouldn't update the entry for this file in the hash table because the backend only removes the file once it has successfully sent the SYNC_FORGET_REQUEST for that file to the checkpointer process. See mdunlinkfork()->register_forget_request() for details on this.

So, I think the first point that I mentioned above could be the probable reason for the checkpointer process getting panicked. But, having said all that, it would be good to have some evidence for it which can be confirmed by inspecting the server logfile.

Prabhat, is it possible for you to re-run the test-case with log_min_messages set to DEBUG1 and save the logfile for the test-case that crashes. This would be helpful in knowing if the fsync was performed just once or twice i.e. whether point #1 is the reason for the panic or point #2. 

I have ran the same testcases with and without patch multiple times with debug option (log_min_messages = DEBUG1), but this time I am not able to reproduce the crash.

Thanks,

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

On Thu, Oct 31, 2019 at 10:26 AM Prabhat Sahu <prabhat.sahu@enterprisedb.com> wrote:


On Wed, Oct 30, 2019 at 9:46 PM Robert Haas <robertmhaas@gmail.com> wrote:
On Wed, Oct 30, 2019 at 3:49 AM Prabhat Sahu <prabhat.sahu@enterprisedb.com> wrote:
While testing the Toast patch(PG+v7 patch) I found below server crash.
System configuration:
VCPUs: 4, RAM: 8GB, Storage: 320GB

This issue is not frequently reproducible, we need to repeat the same testcase multiple times.

I wonder if this is an independent bug, because the backtrace doesn't look like it's related to the stuff this is changing. Your report doesn't specify whether you can also reproduce the problem without the patch, which is something that you should always check before reporting a bug in a particular patch.
 
Hi Robert,

My sincere apologize that I have not mentioned the issue in more detail.
I have ran the same case against both PG HEAD and HEAD+Patch multiple times(7, 10, 20nos), and
as I found this issue was not failing in HEAD and same case is reproducible in HEAD+Patch (again I was not sure about the backtrace whether its related to patch or not).


 
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


--

With Regards,

Prabhat Kumar Sahu
Skype ID: prabhat.sahu1984
EnterpriseDB Software India Pvt. Ltd.

The Postgres Database Company



--

With Regards,

Prabhat Kumar Sahu
Skype ID: prabhat.sahu1984
EnterpriseDB Software India Pvt. Ltd.

The Postgres Database Company

pgsql-hackers by date:

Previous
From: Pavel Stehule
Date:
Subject: Re: dropdb --force
Next
From: Paul A Jungwirth
Date:
Subject: Re: SQL:2011 PERIODS vs Postgres Ranges?