On Sun, Feb 23, 2014 at 9:49 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Jeff Janes <jeff.janes@gmail.com> writes:
>> On Sunday, February 23, 2014, Scott Marlowe <scott.marlowe@gmail.com> wrote:
>>> I'm guessing that this is so that it can be rolled back. Unlink is
>>> likely issued at commit;
>
>> I would hope that ftruncate is issued at commit as well. That doesn't
>> sound undoable.
>
> It's more subtle than that. I'm too lazy to look at the comments in md.c
> right now, but basically the reason for not doing an instant unlink is
> to ensure that if a relation is truncated and then re-extended, open file
> pointers held by other backends will still be valid. The ftruncate is
> done to ensure that allocated disk space goes away as soon as that's safe
> (ie, at commit of the truncation); but immediate unlink would require
> forcing more cross-backend synchronization than we want to have.
>
> If memory serves, the inode should get removed during the next checkpoint.
I was moments away from commenting to say that I had traced the flow
of the code to md.c and found the comments there quite illuminating. I
wonder if there is a different way to solve the underlying issue
without relying on ftruncate (which seems to be somewhat expensive).
--
Jon