Thread: BUG #15427: DROP INDEX did not free up disk space
The following bug has been logged on the website: Bug reference: 15427 Logged by: Andrew P Email address: ap@zip.com.au PostgreSQL version: 10.5 Operating system: Debian Sarge (9) Description: Performed a DROP INDEX to free up disk space but, whilst the command succeeded the disk space was not freed up. The index was: CREATE INDEX radlelink_rlreid_idx ON radlelink (rlid, reid); Config of table: CREATE TABLE radlelink ( rlid BIGINT REFERENCES radlogins (id) ON DELETE CASCADE, reid BIGINT REFERENCES radextra (id) ON DELETE CASCADE ); Ran 'lsof -nP +L1' on the server as per suggestion on postgresql irc channel and this was the (abridged output): COMMAND PID USER FD TYPE DEVICE SIZE/OFF NLINK NODE NAME postgres 2633 postgres 197u REG 252,5 16777216 0 3410028 /var/lib/postgresql/10/main/pg_wal/00000001000003960000006E (deleted) postgres 2633 postgres 325u REG 252,5 0 0 3409445 /var/lib/postgresql/10/main/base/16562/17437 (deleted) postgres 2633 postgres 326u REG 252,5 1073741824 0 3409702 /var/lib/postgresql/10/main/base/16562/17437.1 (deleted) postgres 2633 postgres 327u REG 252,5 1073741824 0 3409704 /var/lib/postgresql/10/main/base/16562/17437.2 (deleted) postgres 2633 postgres 328u REG 252,5 1073741824 0 3409705 /var/lib/postgresql/10/main/base/16562/17437.3 (deleted) postgres 2633 postgres 329u REG 252,5 1073741824 0 3409452 /var/lib/postgresql/10/main/base/16562/17437.4 (deleted) postgres 2633 postgres 330u REG 252,5 1073741824 0 3409709 /var/lib/postgresql/10/main/base/16562/17437.5 (deleted) postgres 2633 postgres 331u REG 252,5 1073741824 0 3409710 /var/lib/postgresql/10/main/base/16562/17437.6 (deleted) postgres 2633 postgres 332u REG 252,5 1073741824 0 3409706 /var/lib/postgresql/10/main/base/16562/17437.7 (deleted) postgres 2633 postgres 333u REG 252,5 1073741824 0 3409722 /var/lib/postgresql/10/main/base/16562/17437.8 (deleted) postgres 2633 postgres 334u REG 252,5 1073741824 0 3409724 /var/lib/postgresql/10/main/base/16562/17437.9 (deleted) postgres 2633 postgres 335u REG 252,5 1073741824 0 3409716 /var/lib/postgresql/10/main/base/16562/17437.10 (deleted) ... postgres 2633 postgres 403u REG 252,5 1073741824 0 3409970 /var/lib/postgresql/10/main/base/16562/17437.78 (deleted) postgres 2633 postgres 404u REG 252,5 1073741824 0 3409976 /var/lib/postgresql/10/main/base/16562/17437.79 (deleted) postgres 2633 postgres 405u REG 252,5 1073741824 0 3409969 /var/lib/postgresql/10/main/base/16562/17437.80 (deleted) postgres 2633 postgres 406u REG 252,5 1073741824 0 3409983 /var/lib/postgresql/10/main/base/16562/17437.81 (deleted) postgres 2633 postgres 407u REG 252,5 1073741824 0 3409984 /var/lib/postgresql/10/main/base/16562/17437.82 (deleted) postgres 2633 postgres 408u REG 252,5 1073741824 0 3409981 /var/lib/postgresql/10/main/base/16562/17437.83 (deleted) postgres 2633 postgres 409u REG 252,5 1073741824 0 3410002 /var/lib/postgresql/10/main/base/16562/17437.84 (deleted) postgres 2633 postgres 410u REG 252,5 1073741824 0 3410012 /var/lib/postgresql/10/main/base/16562/17437.85 (deleted) postgres 2633 postgres 411u REG 252,5 1073741824 0 3409991 /var/lib/postgresql/10/main/base/16562/17437.86 (deleted) postgres 2633 postgres 412u REG 252,5 1073741824 0 3410025 /var/lib/postgresql/10/main/base/16562/17437.87 (deleted) postgres 2633 postgres 413u REG 252,5 264241152 0 3410026 /var/lib/postgresql/10/main/base/16562/17437.88 (deleted) PID 2633 was: postgres 2633 12.5 0.4 305868 158632 ? Ss Oct09 544:04 postgres: 10/main: postgres radlogs [local] idle I executed 'SELECT 1;' in that session (it was a psql session) and 5s later it returned, the files were closed and the disk space freed. There were also two autovacuum processes up at the time: postgres 29102 2.4 0.2 355320 76724 ? Ss 13:09 0:45 postgres: 10/main: autovacuum worker process radlogs postgres 29378 16.0 0.1 355312 37976 ? Ss 13:40 0:00 postgres: 10/main: autovacuum worker process radlogs Hope this helps. AP
Hi, On 2018-10-12 03:27:28 +0000, PG Bug reporting form wrote: > The following bug has been logged on the website: > > Bug reference: 15427 > Logged by: Andrew P > Email address: ap@zip.com.au > PostgreSQL version: 10.5 > Operating system: Debian Sarge (9) > Description: > > Performed a DROP INDEX to free up disk space but, whilst the command > succeeded the disk space was not freed up. > > The index was: > > CREATE INDEX radlelink_rlreid_idx ON radlelink (rlid, reid); > > Config of table: > > CREATE TABLE radlelink ( > rlid BIGINT REFERENCES radlogins (id) ON DELETE > CASCADE, > reid BIGINT REFERENCES radextra (id) ON DELETE CASCADE > ); > > Ran 'lsof -nP +L1' on the server as per suggestion on postgresql irc channel > and this was the (abridged output): > > COMMAND PID USER FD TYPE DEVICE SIZE/OFF NLINK NODE NAME > postgres 2633 postgres 197u REG 252,5 16777216 0 3410028 > /var/lib/postgresql/10/main/pg_wal/00000001000003960000006E (deleted) > postgres 2633 postgres 325u REG 252,5 0 0 3409445 > /var/lib/postgresql/10/main/base/16562/17437 (deleted) > postgres 2633 postgres 326u REG 252,5 1073741824 0 3409702 > /var/lib/postgresql/10/main/base/16562/17437.1 (deleted) > postgres 2633 postgres 327u REG 252,5 1073741824 0 3409704 > /var/lib/postgresql/10/main/base/16562/17437.2 (deleted) > postgres 2633 postgres 328u REG 252,5 1073741824 0 3409705 > /var/lib/postgresql/10/main/base/16562/17437.3 (deleted) > postgres 2633 postgres 329u REG 252,5 1073741824 0 3409452 > /var/lib/postgresql/10/main/base/16562/17437.4 (deleted) > postgres 2633 postgres 330u REG 252,5 1073741824 0 3409709 > /var/lib/postgresql/10/main/base/16562/17437.5 (deleted) > postgres 2633 postgres 331u REG 252,5 1073741824 0 3409710 > /var/lib/postgresql/10/main/base/16562/17437.6 (deleted) > postgres 2633 postgres 332u REG 252,5 1073741824 0 3409706 > /var/lib/postgresql/10/main/base/16562/17437.7 (deleted) > postgres 2633 postgres 333u REG 252,5 1073741824 0 3409722 > /var/lib/postgresql/10/main/base/16562/17437.8 (deleted) > postgres 2633 postgres 334u REG 252,5 1073741824 0 3409724 > /var/lib/postgresql/10/main/base/16562/17437.9 (deleted) > postgres 2633 postgres 335u REG 252,5 1073741824 0 3409716 > /var/lib/postgresql/10/main/base/16562/17437.10 (deleted) > ... > postgres 2633 postgres 403u REG 252,5 1073741824 0 3409970 > /var/lib/postgresql/10/main/base/16562/17437.78 (deleted) > postgres 2633 postgres 404u REG 252,5 1073741824 0 3409976 > /var/lib/postgresql/10/main/base/16562/17437.79 (deleted) > postgres 2633 postgres 405u REG 252,5 1073741824 0 3409969 > /var/lib/postgresql/10/main/base/16562/17437.80 (deleted) > postgres 2633 postgres 406u REG 252,5 1073741824 0 3409983 > /var/lib/postgresql/10/main/base/16562/17437.81 (deleted) > postgres 2633 postgres 407u REG 252,5 1073741824 0 3409984 > /var/lib/postgresql/10/main/base/16562/17437.82 (deleted) > postgres 2633 postgres 408u REG 252,5 1073741824 0 3409981 > /var/lib/postgresql/10/main/base/16562/17437.83 (deleted) > postgres 2633 postgres 409u REG 252,5 1073741824 0 3410002 > /var/lib/postgresql/10/main/base/16562/17437.84 (deleted) > postgres 2633 postgres 410u REG 252,5 1073741824 0 3410012 > /var/lib/postgresql/10/main/base/16562/17437.85 (deleted) > postgres 2633 postgres 411u REG 252,5 1073741824 0 3409991 > /var/lib/postgresql/10/main/base/16562/17437.86 (deleted) > postgres 2633 postgres 412u REG 252,5 1073741824 0 3410025 > /var/lib/postgresql/10/main/base/16562/17437.87 (deleted) > postgres 2633 postgres 413u REG 252,5 264241152 0 3410026 > /var/lib/postgresql/10/main/base/16562/17437.88 (deleted) > > PID 2633 was: > postgres 2633 12.5 0.4 305868 158632 ? Ss Oct09 544:04 postgres: > 10/main: postgres radlogs [local] idle > > I executed 'SELECT 1;' in that session (it was a psql session) and 5s later > it returned, the files were closed and the disk space freed. > > There were also two autovacuum processes up at the time: > postgres 29102 2.4 0.2 355320 76724 ? Ss 13:09 0:45 postgres: > 10/main: autovacuum worker process radlogs > postgres 29378 16.0 0.1 355312 37976 ? Ss 13:40 0:00 postgres: > 10/main: autovacuum worker process radlogs > > Hope this helps. The problem here is that even though we send a invalidation message to all backends, there's nothing to force backends to process invalidation messages in time if they're idle. Sure, at some point, when independently enough inval messages have been created, we'll send out a catchup interrupt. But that's not necessarily going to be that soon in a production database. ISTM that we need to force catchup interrupts in a few cases, like when smgr invals for truncation. Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > On 2018-10-12 03:27:28 +0000, PG Bug reporting form wrote: >> Performed a DROP INDEX to free up disk space but, whilst the command >> succeeded the disk space was not freed up. > The problem here is that even though we send a invalidation message to > all backends, there's nothing to force backends to process invalidation > messages in time if they're idle. Uh, what's that got to do with it? My recollection (though I've not looked at the code) is that we truncate the file to ensure disk space is freed, whether or not all backends have closed their FDs for the file. This is quite independent of the sinval mechanism. I'm suspicious that this report reflects some weird behavior of a non-mainline filesystem ... regards, tom lane
Hi, On 2018-10-11 23:57:16 -0400, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > On 2018-10-12 03:27:28 +0000, PG Bug reporting form wrote: > >> Performed a DROP INDEX to free up disk space but, whilst the command > >> succeeded the disk space was not freed up. > > > The problem here is that even though we send a invalidation message to > > all backends, there's nothing to force backends to process invalidation > > messages in time if they're idle. > > Uh, what's that got to do with it? If you look at the bugreport: As soon as the op, on my suggestion, triggered sinval processing (by issuing a SELECT 1;) the space was freed. So clearly the open FDs were part of the problem. > My recollection (though I've not looked at the code) is that we > truncate the file to ensure disk space is freed, whether or not > all backends have closed their FDs for the file. That's not quite how we do it tho. In mdunlinkfork() we truncate the *first* segment, but all further segments are unlink()ed directly. For that to work, sinvals need to be processed. So > This is quite independent of the sinval mechanism. is clearly currently not correct. I'd argue that even if we were to change mdunlinkfork() so it always truncates and then unlinks, we should *still* make sure that sinval messages for things like smgrdounlinkfork() are processed in some bounded time. Consider e.g. that you might drop a table + tablespace and then would want to unmount - this'd prevent that. Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > On 2018-10-11 23:57:16 -0400, Tom Lane wrote: >> Uh, what's that got to do with it? > If you look at the bugreport: As soon as the op, on my suggestion, > triggered sinval processing (by issuing a SELECT 1;) the space was > freed. So clearly the open FDs were part of the problem. TBH, I think the space-freeup was more likely driven off a background checkpoint completion, which is where the truncation happens. regards, tom lane
Hi, On 2018-10-12 00:33:14 -0400, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > On 2018-10-11 23:57:16 -0400, Tom Lane wrote: > >> Uh, what's that got to do with it? > > > If you look at the bugreport: As soon as the op, on my suggestion, > > triggered sinval processing (by issuing a SELECT 1;) the space was > > freed. So clearly the open FDs were part of the problem. > > TBH, I think the space-freeup was more likely driven off a background > checkpoint completion, which is where the truncation happens. Uh, as I wrote, mdunlinkfork(), which backs dropping an index via index_drop()->RelationDropStorage() and then smgrDoPendingDeletes()->smgrdounlinkall()->mdunlink()->mdunlinkfork(), unlinks all segments beyond the first itself: static void mdunlinkfork(RelFileNodeBackend rnode, ForkNumber forkNum, bool isRedo) { char *path; int ret; path = relpath(rnode, forkNum); /* * Delete or truncate the first segment. */ if (isRedo || forkNum != MAIN_FORKNUM || RelFileNodeBackendIsTemp(rnode)) { ret = unlink(path); if (ret < 0 && errno != ENOENT) ereport(WARNING, (errcode_for_file_access(), errmsg("could not remove file \"%s\": %m", path))); } else { /* truncate(2) would be easier here, but Windows hasn't got it */ int fd; fd = OpenTransientFile(path, O_RDWR | PG_BINARY); if (fd >= 0) { int save_errno; ret = ftruncate(fd, 0); save_errno = errno; CloseTransientFile(fd); errno = save_errno; } else ret = -1; if (ret < 0 && errno != ENOENT) ereport(WARNING, (errcode_for_file_access(), errmsg("could not truncate file \"%s\": %m", path))); /* Register request to unlink first segment later */ register_unlink(rnode); } /* * Delete any additional segments. */ if (ret >= 0) { char *segpath = (char *) palloc(strlen(path) + 12); BlockNumber segno; /* * Note that because we loop until getting ENOENT, we will correctly * remove all inactive segments as well as active ones. */ for (segno = 1;; segno++) { sprintf(segpath, "%s.%u", path, segno); if (unlink(segpath) < 0) { /* ENOENT is expected after the last segment... */ if (errno != ENOENT) ereport(WARNING, (errcode_for_file_access(), errmsg("could not remove file \"%s\": %m", segpath))); break; } } pfree(segpath); } pfree(path); } As you clearly can see, unlink() is called directly here for anything but the first segment (which is registered to be unlinked in checkpointer via register_unlink()). Note that the checkpointer based machinery doesn't even *support* unlinking anything beyond the first segment: void mdpostckpt(void) { ... while (pendingUnlinks != NIL) ... /* Unlink the file */ path = relpathperm(entry->rnode, MAIN_FORKNUM); if (unlink(path) < 0) there's no segment handling here. You're right that mdtruncate() leaves later segments around in a truncated manner. But that's because of an orthogonal concern: * The full and partial segments are collectively the "active" segments. * Inactive segments are those that once contained data but are currently * not needed because of an mdtruncate() operation. The reason for leaving * them present at size zero, rather than unlinking them, is that other * backends and/or the checkpointer might be holding open file references to * such segments. If the relation expands again after mdtruncate(), such * that a deactivated segment becomes active again, it is important that * such file references still be valid --- else data might get written * out to an unlinked old copy of a segment file that will eventually * disappear. that doesn't apply to dropping relations. Greetings, Andres Freund