Thread: BUG #15427: DROP INDEX did not free up disk space

BUG #15427: DROP INDEX did not free up disk space

From

PG Bug reporting form

Date:

12 October 2018, 03:27:28

The following bug has been logged on the website:

Bug reference:      15427
Logged by:          Andrew P
Email address:      ap@zip.com.au
PostgreSQL version: 10.5
Operating system:   Debian Sarge (9)
Description:

Performed a DROP INDEX to free up disk space but, whilst the command
succeeded the disk space was not freed up.

The index was:

CREATE INDEX radlelink_rlreid_idx ON radlelink (rlid, reid);

Config of table:

CREATE TABLE radlelink (
        rlid            BIGINT REFERENCES radlogins (id) ON DELETE
CASCADE,
        reid            BIGINT REFERENCES radextra (id) ON DELETE CASCADE
);

Ran 'lsof -nP +L1' on the server as per suggestion on postgresql irc channel
and this was the (abridged output):

COMMAND   PID     USER   FD   TYPE DEVICE   SIZE/OFF NLINK    NODE NAME
postgres 2633 postgres  197u   REG  252,5   16777216     0 3410028
/var/lib/postgresql/10/main/pg_wal/00000001000003960000006E (deleted)
postgres 2633 postgres  325u   REG  252,5          0     0 3409445
/var/lib/postgresql/10/main/base/16562/17437 (deleted)
postgres 2633 postgres  326u   REG  252,5 1073741824     0 3409702
/var/lib/postgresql/10/main/base/16562/17437.1 (deleted)
postgres 2633 postgres  327u   REG  252,5 1073741824     0 3409704
/var/lib/postgresql/10/main/base/16562/17437.2 (deleted)
postgres 2633 postgres  328u   REG  252,5 1073741824     0 3409705
/var/lib/postgresql/10/main/base/16562/17437.3 (deleted)
postgres 2633 postgres  329u   REG  252,5 1073741824     0 3409452
/var/lib/postgresql/10/main/base/16562/17437.4 (deleted)
postgres 2633 postgres  330u   REG  252,5 1073741824     0 3409709
/var/lib/postgresql/10/main/base/16562/17437.5 (deleted)
postgres 2633 postgres  331u   REG  252,5 1073741824     0 3409710
/var/lib/postgresql/10/main/base/16562/17437.6 (deleted)
postgres 2633 postgres  332u   REG  252,5 1073741824     0 3409706
/var/lib/postgresql/10/main/base/16562/17437.7 (deleted)
postgres 2633 postgres  333u   REG  252,5 1073741824     0 3409722
/var/lib/postgresql/10/main/base/16562/17437.8 (deleted)
postgres 2633 postgres  334u   REG  252,5 1073741824     0 3409724
/var/lib/postgresql/10/main/base/16562/17437.9 (deleted)
postgres 2633 postgres  335u   REG  252,5 1073741824     0 3409716
/var/lib/postgresql/10/main/base/16562/17437.10 (deleted)
...
postgres 2633 postgres  403u   REG  252,5 1073741824     0 3409970
/var/lib/postgresql/10/main/base/16562/17437.78 (deleted)
postgres 2633 postgres  404u   REG  252,5 1073741824     0 3409976
/var/lib/postgresql/10/main/base/16562/17437.79 (deleted)
postgres 2633 postgres  405u   REG  252,5 1073741824     0 3409969
/var/lib/postgresql/10/main/base/16562/17437.80 (deleted)
postgres 2633 postgres  406u   REG  252,5 1073741824     0 3409983
/var/lib/postgresql/10/main/base/16562/17437.81 (deleted)
postgres 2633 postgres  407u   REG  252,5 1073741824     0 3409984
/var/lib/postgresql/10/main/base/16562/17437.82 (deleted)
postgres 2633 postgres  408u   REG  252,5 1073741824     0 3409981
/var/lib/postgresql/10/main/base/16562/17437.83 (deleted)
postgres 2633 postgres  409u   REG  252,5 1073741824     0 3410002
/var/lib/postgresql/10/main/base/16562/17437.84 (deleted)
postgres 2633 postgres  410u   REG  252,5 1073741824     0 3410012
/var/lib/postgresql/10/main/base/16562/17437.85 (deleted)
postgres 2633 postgres  411u   REG  252,5 1073741824     0 3409991
/var/lib/postgresql/10/main/base/16562/17437.86 (deleted)
postgres 2633 postgres  412u   REG  252,5 1073741824     0 3410025
/var/lib/postgresql/10/main/base/16562/17437.87 (deleted)
postgres 2633 postgres  413u   REG  252,5  264241152     0 3410026
/var/lib/postgresql/10/main/base/16562/17437.88 (deleted)

PID 2633 was:
postgres  2633 12.5  0.4 305868 158632 ?       Ss   Oct09 544:04 postgres:
10/main: postgres radlogs [local] idle

I executed 'SELECT 1;' in that session (it was a psql session) and 5s later
it returned, the files were closed and the disk space freed.

There were also two autovacuum processes up at the time:
postgres 29102  2.4  0.2 355320 76724 ?        Ss   13:09   0:45 postgres:
10/main: autovacuum worker process   radlogs
postgres 29378 16.0  0.1 355312 37976 ?        Ss   13:40   0:00 postgres:
10/main: autovacuum worker process   radlogs

Hope this helps.

AP

Re: BUG #15427: DROP INDEX did not free up disk space

From

Andres Freund

Date:

12 October 2018, 03:37:14

Hi,

On 2018-10-12 03:27:28 +0000, PG Bug reporting form wrote:
> The following bug has been logged on the website:
> 
> Bug reference:      15427
> Logged by:          Andrew P
> Email address:      ap@zip.com.au
> PostgreSQL version: 10.5
> Operating system:   Debian Sarge (9)
> Description:        
> 
> Performed a DROP INDEX to free up disk space but, whilst the command
> succeeded the disk space was not freed up.
> 
> The index was:
> 
> CREATE INDEX radlelink_rlreid_idx ON radlelink (rlid, reid);
> 
> Config of table:
> 
> CREATE TABLE radlelink (
>         rlid            BIGINT REFERENCES radlogins (id) ON DELETE
> CASCADE,
>         reid            BIGINT REFERENCES radextra (id) ON DELETE CASCADE
> );
> 
> Ran 'lsof -nP +L1' on the server as per suggestion on postgresql irc channel
> and this was the (abridged output):
> 
> COMMAND   PID     USER   FD   TYPE DEVICE   SIZE/OFF NLINK    NODE NAME
> postgres 2633 postgres  197u   REG  252,5   16777216     0 3410028
> /var/lib/postgresql/10/main/pg_wal/00000001000003960000006E (deleted)
> postgres 2633 postgres  325u   REG  252,5          0     0 3409445
> /var/lib/postgresql/10/main/base/16562/17437 (deleted)
> postgres 2633 postgres  326u   REG  252,5 1073741824     0 3409702
> /var/lib/postgresql/10/main/base/16562/17437.1 (deleted)
> postgres 2633 postgres  327u   REG  252,5 1073741824     0 3409704
> /var/lib/postgresql/10/main/base/16562/17437.2 (deleted)
> postgres 2633 postgres  328u   REG  252,5 1073741824     0 3409705
> /var/lib/postgresql/10/main/base/16562/17437.3 (deleted)
> postgres 2633 postgres  329u   REG  252,5 1073741824     0 3409452
> /var/lib/postgresql/10/main/base/16562/17437.4 (deleted)
> postgres 2633 postgres  330u   REG  252,5 1073741824     0 3409709
> /var/lib/postgresql/10/main/base/16562/17437.5 (deleted)
> postgres 2633 postgres  331u   REG  252,5 1073741824     0 3409710
> /var/lib/postgresql/10/main/base/16562/17437.6 (deleted)
> postgres 2633 postgres  332u   REG  252,5 1073741824     0 3409706
> /var/lib/postgresql/10/main/base/16562/17437.7 (deleted)
> postgres 2633 postgres  333u   REG  252,5 1073741824     0 3409722
> /var/lib/postgresql/10/main/base/16562/17437.8 (deleted)
> postgres 2633 postgres  334u   REG  252,5 1073741824     0 3409724
> /var/lib/postgresql/10/main/base/16562/17437.9 (deleted)
> postgres 2633 postgres  335u   REG  252,5 1073741824     0 3409716
> /var/lib/postgresql/10/main/base/16562/17437.10 (deleted)
> ...
> postgres 2633 postgres  403u   REG  252,5 1073741824     0 3409970
> /var/lib/postgresql/10/main/base/16562/17437.78 (deleted)
> postgres 2633 postgres  404u   REG  252,5 1073741824     0 3409976
> /var/lib/postgresql/10/main/base/16562/17437.79 (deleted)
> postgres 2633 postgres  405u   REG  252,5 1073741824     0 3409969
> /var/lib/postgresql/10/main/base/16562/17437.80 (deleted)
> postgres 2633 postgres  406u   REG  252,5 1073741824     0 3409983
> /var/lib/postgresql/10/main/base/16562/17437.81 (deleted)
> postgres 2633 postgres  407u   REG  252,5 1073741824     0 3409984
> /var/lib/postgresql/10/main/base/16562/17437.82 (deleted)
> postgres 2633 postgres  408u   REG  252,5 1073741824     0 3409981
> /var/lib/postgresql/10/main/base/16562/17437.83 (deleted)
> postgres 2633 postgres  409u   REG  252,5 1073741824     0 3410002
> /var/lib/postgresql/10/main/base/16562/17437.84 (deleted)
> postgres 2633 postgres  410u   REG  252,5 1073741824     0 3410012
> /var/lib/postgresql/10/main/base/16562/17437.85 (deleted)
> postgres 2633 postgres  411u   REG  252,5 1073741824     0 3409991
> /var/lib/postgresql/10/main/base/16562/17437.86 (deleted)
> postgres 2633 postgres  412u   REG  252,5 1073741824     0 3410025
> /var/lib/postgresql/10/main/base/16562/17437.87 (deleted)
> postgres 2633 postgres  413u   REG  252,5  264241152     0 3410026
> /var/lib/postgresql/10/main/base/16562/17437.88 (deleted)
> 
> PID 2633 was:
> postgres  2633 12.5  0.4 305868 158632 ?       Ss   Oct09 544:04 postgres:
> 10/main: postgres radlogs [local] idle
> 
> I executed 'SELECT 1;' in that session (it was a psql session) and 5s later
> it returned, the files were closed and the disk space freed.
> 
> There were also two autovacuum processes up at the time:
> postgres 29102  2.4  0.2 355320 76724 ?        Ss   13:09   0:45 postgres:
> 10/main: autovacuum worker process   radlogs
> postgres 29378 16.0  0.1 355312 37976 ?        Ss   13:40   0:00 postgres:
> 10/main: autovacuum worker process   radlogs
> 
> Hope this helps.

The problem here is that even though we send a invalidation message to
all backends, there's nothing to force backends to process invalidation
messages in time if they're idle. Sure, at some point, when
independently enough inval messages have been created, we'll send out a
catchup interrupt.  But that's not necessarily going to be that soon in
a production database.

ISTM that we need to force catchup interrupts in a few cases, like when
smgr invals for truncation.

Greetings,

Andres Freund

Re: BUG #15427: DROP INDEX did not free up disk space

From

Tom Lane

Date:

12 October 2018, 03:57:16

Andres Freund <andres@anarazel.de> writes:
> On 2018-10-12 03:27:28 +0000, PG Bug reporting form wrote:
>> Performed a DROP INDEX to free up disk space but, whilst the command
>> succeeded the disk space was not freed up.

> The problem here is that even though we send a invalidation message to
> all backends, there's nothing to force backends to process invalidation
> messages in time if they're idle.

Uh, what's that got to do with it?

My recollection (though I've not looked at the code) is that we
truncate the file to ensure disk space is freed, whether or not
all backends have closed their FDs for the file.  This is quite
independent of the sinval mechanism.

I'm suspicious that this report reflects some weird behavior
of a non-mainline filesystem ...

            regards, tom lane

Re: BUG #15427: DROP INDEX did not free up disk space

From

Andres Freund

Date:

12 October 2018, 04:09:58

Hi,

On 2018-10-11 23:57:16 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On 2018-10-12 03:27:28 +0000, PG Bug reporting form wrote:
> >> Performed a DROP INDEX to free up disk space but, whilst the command
> >> succeeded the disk space was not freed up.
> 
> > The problem here is that even though we send a invalidation message to
> > all backends, there's nothing to force backends to process invalidation
> > messages in time if they're idle.
> 
> Uh, what's that got to do with it?

If you look at the bugreport: As soon as the op, on my suggestion,
triggered sinval processing (by issuing a SELECT 1;) the space was
freed. So clearly the open FDs were part of the problem.

> My recollection (though I've not looked at the code) is that we
> truncate the file to ensure disk space is freed, whether or not
> all backends have closed their FDs for the file.

That's not quite how we do it tho. In mdunlinkfork() we truncate the
*first* segment, but all further segments are unlink()ed directly. For
that to work, sinvals need to be processed. So

> This is quite independent of the sinval mechanism.

is clearly currently not correct.

I'd argue that even if we were to change mdunlinkfork() so it always
truncates and then unlinks, we should *still* make sure that sinval
messages for things like smgrdounlinkfork() are processed in some
bounded time. Consider e.g. that you might drop a table + tablespace and
then would want to unmount - this'd prevent that.

Greetings,

Andres Freund

Re: BUG #15427: DROP INDEX did not free up disk space

From

Tom Lane

Date:

12 October 2018, 04:33:14

Andres Freund <andres@anarazel.de> writes:
> On 2018-10-11 23:57:16 -0400, Tom Lane wrote:
>> Uh, what's that got to do with it?

> If you look at the bugreport: As soon as the op, on my suggestion,
> triggered sinval processing (by issuing a SELECT 1;) the space was
> freed. So clearly the open FDs were part of the problem.

TBH, I think the space-freeup was more likely driven off a background
checkpoint completion, which is where the truncation happens.

            regards, tom lane

Re: BUG #15427: DROP INDEX did not free up disk space

From

Andres Freund

Date:

12 October 2018, 04:51:48

Hi,

On 2018-10-12 00:33:14 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On 2018-10-11 23:57:16 -0400, Tom Lane wrote:
> >> Uh, what's that got to do with it?
>
> > If you look at the bugreport: As soon as the op, on my suggestion,
> > triggered sinval processing (by issuing a SELECT 1;) the space was
> > freed. So clearly the open FDs were part of the problem.
>
> TBH, I think the space-freeup was more likely driven off a background
> checkpoint completion, which is where the truncation happens.

Uh, as I wrote, mdunlinkfork(), which backs dropping an index via
index_drop()->RelationDropStorage() and then
smgrDoPendingDeletes()->smgrdounlinkall()->mdunlink()->mdunlinkfork(),
unlinks all segments beyond the first itself:

static void
mdunlinkfork(RelFileNodeBackend rnode, ForkNumber forkNum, bool isRedo)
{
    char       *path;
    int            ret;

    path = relpath(rnode, forkNum);

    /*
     * Delete or truncate the first segment.
     */
    if (isRedo || forkNum != MAIN_FORKNUM || RelFileNodeBackendIsTemp(rnode))
    {
        ret = unlink(path);
        if (ret < 0 && errno != ENOENT)
            ereport(WARNING,
                    (errcode_for_file_access(),
                     errmsg("could not remove file \"%s\": %m", path)));
    }
    else
    {
        /* truncate(2) would be easier here, but Windows hasn't got it */
        int            fd;

        fd = OpenTransientFile(path, O_RDWR | PG_BINARY);
        if (fd >= 0)
        {
            int            save_errno;

            ret = ftruncate(fd, 0);
            save_errno = errno;
            CloseTransientFile(fd);
            errno = save_errno;
        }
        else
            ret = -1;
        if (ret < 0 && errno != ENOENT)
            ereport(WARNING,
                    (errcode_for_file_access(),
                     errmsg("could not truncate file \"%s\": %m", path)));

        /* Register request to unlink first segment later */
        register_unlink(rnode);
    }

    /*
     * Delete any additional segments.
     */
    if (ret >= 0)
    {
        char       *segpath = (char *) palloc(strlen(path) + 12);
        BlockNumber segno;

        /*
         * Note that because we loop until getting ENOENT, we will correctly
         * remove all inactive segments as well as active ones.
         */
        for (segno = 1;; segno++)
        {
            sprintf(segpath, "%s.%u", path, segno);
            if (unlink(segpath) < 0)
            {
                /* ENOENT is expected after the last segment... */
                if (errno != ENOENT)
                    ereport(WARNING,
                            (errcode_for_file_access(),
                             errmsg("could not remove file \"%s\": %m", segpath)));
                break;
            }
        }
        pfree(segpath);
    }

    pfree(path);
}

As you clearly can see, unlink() is called directly here for anything
but the first segment (which is registered to be unlinked in
checkpointer via register_unlink()).

Note that the checkpointer based machinery doesn't even *support*
unlinking anything beyond the first segment:

void
mdpostckpt(void)
{
...
    while (pendingUnlinks != NIL)
...
        /* Unlink the file */
        path = relpathperm(entry->rnode, MAIN_FORKNUM);
        if (unlink(path) < 0)

there's no segment handling here.


You're right that mdtruncate() leaves later segments around in a
truncated manner.  But that's because of an orthogonal concern:
 *    The full and partial segments are collectively the "active" segments.
 *    Inactive segments are those that once contained data but are currently
 *    not needed because of an mdtruncate() operation.  The reason for leaving
 *    them present at size zero, rather than unlinking them, is that other
 *    backends and/or the checkpointer might be holding open file references to
 *    such segments.  If the relation expands again after mdtruncate(), such
 *    that a deactivated segment becomes active again, it is important that
 *    such file references still be valid --- else data might get written
 *    out to an unlinked old copy of a segment file that will eventually
 *    disappear.

that doesn't apply to dropping relations.

Greetings,

Andres Freund