Thread: BUG #16663: DROP INDEX did not free up disk space: idle connection hold file marked as deleted

The following bug has been logged on the website:

Bug reference:      16663
Logged by:          Denis Patron
Email address:      denis.patron@previnet.it
PostgreSQL version: 11.9
Operating system:   CentOS 7
Description:

I have an index, which at the file system level, is made up of multiple
segments (file: <id>.1, <id>.2 ecc). When I DROP INDEX, the index is dropped
in Postgresql but at the file system level, the segments are marked as
"deleted". if I check with the lsof command, I see that the segments are in
use from an idle connection. This does not happen if the index is formed by
only one segment (in my case <1Gb). How can I prevent this?
thanks


This is not a bug.

At Fri, 09 Oct 2020 13:24:15 +0000, PG Bug reporting form <noreply@postgresql.org> wrote in 
> The following bug has been logged on the website:
> 
> Bug reference:      16663
> Logged by:          Denis Patron
> Email address:      denis.patron@previnet.it
> PostgreSQL version: 11.9
> Operating system:   CentOS 7
> Description:        
> 
> I have an index, which at the file system level, is made up of multiple
> segments (file: <id>.1, <id>.2 ecc). When I DROP INDEX, the index is dropped
> in Postgresql but at the file system level, the segments are marked as
> "deleted". if I check with the lsof command, I see that the segments are in
> use from an idle connection. This does not happen if the index is formed by
> only one segment (in my case <1Gb). How can I prevent this?
> thanks

That references to deleted files will dissapear at the beginning of
the next transaction.

At the time a relation including an index is dropped, the first
segment file (named as "<id>" without a suffix number) is left behind
so the file is not shown as "(deleted)" in lsof output.

The next checkpoint removes the first segment.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Hi,

On 2020-10-14 12:05:10 +0900, Kyotaro Horiguchi wrote:
> This is not a bug.
>
> At Fri, 09 Oct 2020 13:24:15 +0000, PG Bug reporting form <noreply@postgresql.org> wrote in
> > The following bug has been logged on the website:
> >
> > Bug reference:      16663
> > Logged by:          Denis Patron
> > Email address:      denis.patron@previnet.it
> > PostgreSQL version: 11.9
> > Operating system:   CentOS 7
> > Description:
> >
> > I have an index, which at the file system level, is made up of multiple
> > segments (file: <id>.1, <id>.2 ecc). When I DROP INDEX, the index is dropped
> > in Postgresql but at the file system level, the segments are marked as
> > "deleted". if I check with the lsof command, I see that the segments are in
> > use from an idle connection. This does not happen if the index is formed by
> > only one segment (in my case <1Gb). How can I prevent this?
> > thanks
>
> That references to deleted files will dissapear at the beginning of
> the next transaction.
>
> At the time a relation including an index is dropped, the first
> segment file (named as "<id>" without a suffix number) is left behind
> so the file is not shown as "(deleted)" in lsof output.

I think we should consider either occasionally sending a sinval catchup
interrupt to backends that have been idle for a while, or to use a timer
that we use to limit the maximum time until we process sinvals. Just
having to wait till all backends become busy and process sinval events
doesn't really seem like good approach to me.

Regards,

Andres



Andres Freund wrote
> Hi,
> 
> On 2020-10-14 12:05:10 +0900, Kyotaro Horiguchi wrote:
>> This is not a bug.
>>
>> At Fri, 09 Oct 2020 13:24:15 +0000, PG Bug reporting form <

> noreply@

> > wrote in
>> > The following bug has been logged on the website:
>> >
>> > Bug reference:      16663
>> > Logged by:          Denis Patron
>> > Email address:      

> denis.patron@

>> > PostgreSQL version: 11.9
>> > Operating system:   CentOS 7
>> > Description:
>> >
>> > I have an index, which at the file system level, is made up of multiple
>> > segments (file: 
> <id>
> .1, 
> <id>
> .2 ecc). When I DROP INDEX, the index is dropped
>> > in Postgresql but at the file system level, the segments are marked as
>> > "deleted". if I check with the lsof command, I see that the segments
>> are in
>> > use from an idle connection. This does not happen if the index is
>> formed by
>> > only one segment (in my case <1Gb). How can I prevent this?
>> > thanks
>>
>> That references to deleted files will dissapear at the beginning of
>> the next transaction.
>>
>> At the time a relation including an index is dropped, the first
>> segment file (named as "
> <id>
> " without a suffix number) is left behind
>> so the file is not shown as "(deleted)" in lsof output.
> 
> I think we should consider either occasionally sending a sinval catchup
> interrupt to backends that have been idle for a while, or to use a timer
> that we use to limit the maximum time until we process sinvals. Just
> having to wait till all backends become busy and process sinval events
> doesn't really seem like good approach to me.
> 
> Regards,
> 
> Andres



thanks for replying.
the problem is that I have a very large database, with indexes of up to 70
Gb. while I redo the indexes in concurrently mode, if an idle transaction is
using the index in question, the segment file (<id> _1 <id> _2 etc) of the
index remains in the filesystem (marked as deleted) as long as the idle
connection that it is blocking it does not make another transaction. this
means that I can have hundreds of GB of space occupied by files marked
"deleted", and this for hours. the risk is to run out of free space



--
Sent from: https://www.postgresql-archive.org/PostgreSQL-bugs-f2117394.html



On Wed, Oct 14, 2020 at 5:35 PM Andres Freund <andres@anarazel.de> wrote:
> On 2020-10-14 12:05:10 +0900, Kyotaro Horiguchi wrote:
> > At the time a relation including an index is dropped, the first
> > segment file (named as "<id>" without a suffix number) is left behind
> > so the file is not shown as "(deleted)" in lsof output.
>
> I think we should consider either occasionally sending a sinval catchup
> interrupt to backends that have been idle for a while, or to use a timer
> that we use to limit the maximum time until we process sinvals. Just
> having to wait till all backends become busy and process sinval events
> doesn't really seem like good approach to me.

Oops, I also replied to this but now I see that I accidentally replied
only to Horiguchi-san and not the list!  I was thinking that we should
perhaps consider truncating the files to give back the disk space (as
we do for the first segment), so that it doesn't matter so much how
long other backends take to process SHAREDINVALSMGR_ID, close their
descriptors and release the inode.



Thomas Munro <thomas.munro@gmail.com> writes:
> On Wed, Oct 14, 2020 at 5:35 PM Andres Freund <andres@anarazel.de> wrote:
>> I think we should consider either occasionally sending a sinval catchup
>> interrupt to backends that have been idle for a while, or to use a timer
>> that we use to limit the maximum time until we process sinvals. Just
>> having to wait till all backends become busy and process sinval events
>> doesn't really seem like good approach to me.

> Oops, I also replied to this but now I see that I accidentally replied
> only to Horiguchi-san and not the list!  I was thinking that we should
> perhaps consider truncating the files to give back the disk space (as
> we do for the first segment), so that it doesn't matter so much how
> long other backends take to process SHAREDINVALSMGR_ID, close their
> descriptors and release the inode.

+1, I was also thinking that.  It'd be pretty easy to fit into the
existing system structure (I think, without having looked at the relevant
code lately), and it would not add any overhead to normal processing.
Installing a timeout to handle this per Andres' idea inevitably *would*
add overhead.

            regards, tom lane



On Thu, Oct 15, 2020 at 8:15 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Thomas Munro <thomas.munro@gmail.com> writes:
> > On Wed, Oct 14, 2020 at 5:35 PM Andres Freund <andres@anarazel.de> wrote:
> >> I think we should consider either occasionally sending a sinval catchup
> >> interrupt to backends that have been idle for a while, or to use a timer
> >> that we use to limit the maximum time until we process sinvals. Just
> >> having to wait till all backends become busy and process sinval events
> >> doesn't really seem like good approach to me.
>
> > Oops, I also replied to this but now I see that I accidentally replied
> > only to Horiguchi-san and not the list!  I was thinking that we should
> > perhaps consider truncating the files to give back the disk space (as
> > we do for the first segment), so that it doesn't matter so much how
> > long other backends take to process SHAREDINVALSMGR_ID, close their
> > descriptors and release the inode.
>
> +1, I was also thinking that.  It'd be pretty easy to fit into the
> existing system structure (I think, without having looked at the relevant
> code lately), and it would not add any overhead to normal processing.
> Installing a timeout to handle this per Andres' idea inevitably *would*
> add overhead.

Alright, here is a first swing at making our behaviour more consistent
in two ways:

1.  The first segment should be truncated even in recovery.
2.  Later segments should be truncated on commit.

I don't know why the existing coding decides not to try to unlink the
later segments if the truncate of segment 0 failed.  We already
committed, we should plough on.

Attachment
Ouch. You beat me to it.

At Thu, 15 Oct 2020 14:26:36 +1300, Thomas Munro <thomas.munro@gmail.com> wrote in 
> On Thu, Oct 15, 2020 at 8:15 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Thomas Munro <thomas.munro@gmail.com> writes:
> > > On Wed, Oct 14, 2020 at 5:35 PM Andres Freund <andres@anarazel.de> wrote:
> > >> I think we should consider either occasionally sending a sinval catchup
> > >> interrupt to backends that have been idle for a while, or to use a timer
> > >> that we use to limit the maximum time until we process sinvals. Just
> > >> having to wait till all backends become busy and process sinval events
> > >> doesn't really seem like good approach to me.
> >
> > > Oops, I also replied to this but now I see that I accidentally replied
> > > only to Horiguchi-san and not the list!  I was thinking that we should
> > > perhaps consider truncating the files to give back the disk space (as
> > > we do for the first segment), so that it doesn't matter so much how
> > > long other backends take to process SHAREDINVALSMGR_ID, close their
> > > descriptors and release the inode.
> >
> > +1, I was also thinking that.  It'd be pretty easy to fit into the
> > existing system structure (I think, without having looked at the relevant
> > code lately), and it would not add any overhead to normal processing.
> > Installing a timeout to handle this per Andres' idea inevitably *would*
> > add overhead.
> 
> Alright, here is a first swing at making our behaviour more consistent
> in two ways:
> 
> 1.  The first segment should be truncated even in recovery.
> 2.  Later segments should be truncated on commit.
> 
> I don't know why the existing coding decides not to try to unlink the
> later segments if the truncate of segment 0 failed.  We already
> committed, we should plough on.

I was trying the almost the same thing except how to emit the error
message for truncation and not trying to unlink if truncation ends
with ENOENT for following segments.

regareds.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



thanks for the patch. 
Do you think it can be included in the next minor releases or the only
solution will be to recompile?
regards
Denis



--
Sent from: https://www.postgresql-archive.org/PostgreSQL-bugs-f2117394.html



On Thu, Oct 15, 2020 at 8:20 PM denni.pat <denni.pat@gmail.com> wrote:
> thanks for the patch.
> Do you think it can be included in the next minor releases or the only
> solution will be to recompile?

I would vote +1 for back-patching a fix for this problem (that is,
pushing it into the minor releases), because I agree that it's very
arguably a bug that we treat the segments differently, and looking
around I do see reports of people having to terminate processes to get
their disk space back.  I'd definitely want a consensus on that plan
from some experienced reviewers and testers, though.  For anyone
wanting to test this, you might want to set RELSEGSIZE to a smaller
number in src/include/pg_config.h.



Thomas, 
I get into the patch and I think it's worth being committed and backpatched. 
BTW I noticed that sometimes the same comparisons are done twice, and I made a very minor refactor of the code. PFA v2 of a patch if you don't mind.
As for the question on what to do with the additional segments if the first one failed to be truncated, I don't consider myself experienced enough and surely someone else's independent opinion is very much welcome. 

--
Best regards,
Pavel Borisov

Postgres Professional: http://postgrespro.com
Attachment