Thread: Breakage with VACUUM ANALYSE + partitions
Hi, I've not determined what's causing the following issue, but this is the simplest reproducible test case I've managed to create so far in 9.6 latest git master: postgresql.conf: shared_buffers =3D 256MB createdb pgbench pgbench -i -s 80 pgbench psql pgbench \pset pager off CREATE TABLE pgbench_accounts_1 (CHECK (bid % 2 =3D 0)) INHERITS (pgbench_accounts); CREATE TABLE pgbench_accounts_2 (CHECK (bid % 2 =3D 1)) INHERITS (pgbench_accounts); WITH del AS (DELETE FROM pgbench_accounts WHERE bid % 2 =3D 0 RETURNING *) INSERT INTO pgbench_accounts_1 SELECT * FROM del; WITH del AS (DELETE FROM pgbench_accounts WHERE bid % 2 =3D 1 RETURNING *) INSERT INTO pgbench_accounts_2 SELECT * FROM del; VACUUM ANALYSE; EXPLAIN ANALYSE SELECT count(*) FROM pgbench_accounts; This last statement produces: ERROR: could not read block 0 in file "base/35160/35173": read only 0 of 8192 bytes Looking at the file, I get: File: =E2=80=9835173=E2=80=99 Size: 0 Blocks: 0 IO Block: 4096 regular empty fil= e Device: 801h/2049d Inode: 4504115 Links: 1 Access: (0600/-rw-------) Uid: (1000001/ thom) Gid: (1000001/ thom= ) Access: 2016-03-20 19:37:50.433485085 +0000 Modify: 2016-03-20 19:39:49.422959222 +0000 Change: 2016-03-20 19:39:49.422959222 +0000 Birth: - This is the same file info I get if I just to a regular VACUUM instead, but that doesn't result in the error. And if I follow that up with a VACUUM ANALYSE, the error still doesn't occur. The base directory also shows this listing for files name after the affected filenode: -rw------- 1 thom thom 0 Mar 20 19:39 35173 -rw------- 1 thom thom 0 Mar 20 19:39 35173.1 -rw------- 1 thom thom 16384 Mar 20 19:39 35173_fsm -rw------- 1 thom thom 0 Mar 20 19:39 35173_vm After I see the error, I get the same error again if I then try to run plain VACUUM. The oid 35173 shown in this error message references the parent pgbench_accounts table. If I reduce the scale factor to 50, or increase shared_buffers to 512MB, there's no error. If, instead of VACUUM ANALYSE, I just use VACUUM, or just ANALYSE, the error doesn't happen. System details: OS: Linux Mint Debian Edition (Wheezy 7.0) Linux Kernel: 3.11.8 RAM: 16GB Filesystem: ext4 (rw,noatime,discard,errors=3Dremount-ro,data=3Dordered) Regard Thom
On Sun, Mar 20, 2016 at 3:55 PM, Thom Brown <thom@linux.com> wrote: > I've not determined what's causing the following issue, but this is > the simplest reproducible test case I've managed to create so far in > 9.6 latest git master: > > postgresql.conf: > shared_buffers = 256MB > > createdb pgbench > pgbench -i -s 80 pgbench > > psql pgbench > \pset pager off > CREATE TABLE pgbench_accounts_1 (CHECK (bid % 2 = 0)) INHERITS > (pgbench_accounts); > CREATE TABLE pgbench_accounts_2 (CHECK (bid % 2 = 1)) INHERITS > (pgbench_accounts); > WITH del AS (DELETE FROM pgbench_accounts WHERE bid % 2 = 0 > RETURNING *) INSERT INTO pgbench_accounts_1 SELECT * FROM del; > WITH del AS (DELETE FROM pgbench_accounts WHERE bid % 2 = 1 > RETURNING *) INSERT INTO pgbench_accounts_2 SELECT * FROM del; > VACUUM ANALYSE; > EXPLAIN ANALYSE SELECT count(*) FROM pgbench_accounts; > > This last statement produces: > ERROR: could not read block 0 in file "base/35160/35173": read only > 0 of 8192 bytes Hmm. The natural thing to suspect here would be a problem with the freeze map changes. But I tried this and I couldn't reproduce it. Any chance you can try this on each of the following commits? fd31cd265138019dcccc9b5fe53043670898bc9f 77a1d1e79892a20ed15a67be42b96949b8546bf6 a892234f830e832110f63fc0a2afce2fb21d1584 68c521eb92c3515e3306f51a7fd3f32d16c97524 Make sure to re-initdb each time. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Mar 24, 2016 at 4:20 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Sun, Mar 20, 2016 at 3:55 PM, Thom Brown <thom@linux.com> wrote: >> I've not determined what's causing the following issue, but this is >> the simplest reproducible test case I've managed to create so far in >> 9.6 latest git master: >> >> postgresql.conf: >> shared_buffers = 256MB >> >> createdb pgbench >> pgbench -i -s 80 pgbench >> >> psql pgbench >> \pset pager off >> CREATE TABLE pgbench_accounts_1 (CHECK (bid % 2 = 0)) INHERITS >> (pgbench_accounts); >> CREATE TABLE pgbench_accounts_2 (CHECK (bid % 2 = 1)) INHERITS >> (pgbench_accounts); >> WITH del AS (DELETE FROM pgbench_accounts WHERE bid % 2 = 0 >> RETURNING *) INSERT INTO pgbench_accounts_1 SELECT * FROM del; >> WITH del AS (DELETE FROM pgbench_accounts WHERE bid % 2 = 1 >> RETURNING *) INSERT INTO pgbench_accounts_2 SELECT * FROM del; >> VACUUM ANALYSE; >> EXPLAIN ANALYSE SELECT count(*) FROM pgbench_accounts; >> >> This last statement produces: >> ERROR: could not read block 0 in file "base/35160/35173": read only >> 0 of 8192 bytes > > Hmm. The natural thing to suspect here would be a problem with the > freeze map changes. But I tried this and I couldn't reproduce it. > Any chance you can try this on each of the following commits? > > fd31cd265138019dcccc9b5fe53043670898bc9f > 77a1d1e79892a20ed15a67be42b96949b8546bf6 > a892234f830e832110f63fc0a2afce2fb21d1584 > 68c521eb92c3515e3306f51a7fd3f32d16c97524 > > Make sure to re-initdb each time. I am able to reproduce the problem with latest HEAD 473b932 code. During vacuum analyse operation, vacuum truncates the entire table to 0 bytes because of no records present in the table and invalidates the smgr relation. During analyse operation, the smgr relation is constructed with segno = 1 also in the following location. In this case the smgr relation is not invalidated. #0 _mdfd_openseg (reln=0x1b8d790, forknum=MAIN_FORKNUM, segno=1, oflags=0) at md.c:1736 #1 0x00000000007fa032 in _mdfd_getseg (reln=0x1b8d790, forknum=MAIN_FORKNUM, blkno=131147, skipFsync=0 '\000', behavior=EXTENSION_RETURN_NULL) at md.c:1815 #2 0x00000000007f8449 in mdwriteback (reln=0x1b8d790, forknum=MAIN_FORKNUM, blocknum=131147, nblocks=1) at md.c:686 #3 0x00000000007faca3 in smgrwriteback (reln=0x1b8d790, forknum=MAIN_FORKNUM, blocknum=131147, nblocks=1) at smgr.c:663 #4 0x00000000007c5211 in IssuePendingWritebacks (context=0xe70fc0 <BackendWritebackContext>) at bufmgr.c:4112 #5 0x00000000007c5071 in ScheduleBufferTagForWriteback (context=0xe70fc0 <BackendWritebackContext>, tag=0x7ff18deaa180) at bufmgr.c:4041 #6 0x00000000007c0a9d in BufferAlloc (smgr=0x1b8d490, relpersistence=112 'p', forkNum=MAIN_FORKNUM, blockNum=247, strategy=0x1bded48, foundPtr=0x7ffea3f0f34f "") at bufmgr.c:1134 #7 0x00000000007bff58 in ReadBuffer_common (smgr=0x1b8d490, relpersistence=112 'p', forkNum=MAIN_FORKNUM, blockNum=247, mode=RBM_NORMAL, strategy=0x1bded48, hit=0x7ffea3f0f43b "") at bufmgr.c:744 #8 0x00000000007bfd6e in ReadBufferExtended (reln=0x7ff19dc69280, forkNum=MAIN_FORKNUM, blockNum=247, mode=RBM_NORMAL, strategy=0x1bded48) at bufmgr.c:663 #9 0x00000000005e09f7 in acquire_sample_rows (onerel=0x7ff19dc69280, elevel=13, rows=0x1be7f28, targrows=15000, totalrows=0x7ffea3f0f598, totaldeadrows=0x7ffea3f0f590) at analyze.c:1025 #10 0x00000000005e1877 in acquire_inherited_sample_rows (onerel=0x7ff19dc76a78, elevel=13, rows=0x1be7f28, targrows=30000, totalrows=0x7ffea3f0f780, totaldeadrows=0x7ffea3f0f778) at analyze.c:1410 #11 0x00000000005dfa68 in do_analyze_rel (onerel=0x7ff19dc76a78, options=3, params=0x7ffea3f0fab0, va_cols=0x0, acquirefunc=0x5e08ca <acquire_sample_rows>, relpages=0, inh=1 '\001', in_outer_xact=0 '\000', elevel=13) at analyze.c:486 #12 0x00000000005df276 in analyze_rel (relid=16463, relation=0x0, options=3, params=0x7ffea3f0fab0, va_cols=0x0, in_outer_xact=0 '\000', bstrategy=0x1bded48) at analyze.c:264 So further operations on the table uses the already constructed smgr relation and treats that there are RELSEG_SIZE number of blocks in the page and try to do the scan. But there are 0 pages in the table thus it produces the error. The issue doesn't occur from another session. Because of this reason only if we do only vacuum operation, the error not occurred. Regards, Hari Babu Fujitsu Australia
On Thu, Mar 24, 2016 at 12:59 AM, Haribabu Kommi <kommi.haribabu@gmail.com> wrote: > So further operations on the table uses the already constructed smgr relation > and treats that there are RELSEG_SIZE number of blocks in the page and try > to do the scan. But there are 0 pages in the table thus it produces the error. > > The issue doesn't occur from another session. Because of this reason only > if we do only vacuum operation, the error not occurred. Yeah, I had a suspicion that this might have to do with invalidation messages based on Thom's description, but I think we still need to track down which commit is at fault. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Mar 24, 2016 at 9:40 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, Mar 24, 2016 at 12:59 AM, Haribabu Kommi > <kommi.haribabu@gmail.com> wrote: >> So further operations on the table uses the already constructed smgr relation >> and treats that there are RELSEG_SIZE number of blocks in the page and try >> to do the scan. But there are 0 pages in the table thus it produces the error. >> >> The issue doesn't occur from another session. Because of this reason only >> if we do only vacuum operation, the error not occurred. > > Yeah, I had a suspicion that this might have to do with invalidation > messages based on Thom's description, but I think we still need to > track down which commit is at fault. I could reproduce the failure on Linux, not on OSX, and bisecting the failure, the first bad commit is this one: commit: 428b1d6b29ca599c5700d4bc4f4ce4c5880369bf author: Andres Freund <andres@anarazel.de> date: Thu, 10 Mar 2016 17:04:34 -0800 Allow to trigger kernel writeback after a configurable number of writes. The failure is a little bit sporadic, based on my tests 1/2 runs out of 10 could pass, so one good commit was recognized as such after passing the SQL sequence sent by Thom 5 times in a row. I also did some manual tests and those are pointing to this commit as well. I am adding Fabien and Andres in CC for some feedback. -- Michael
On 25 March 2016 at 12:41, Michael Paquier <michael.paquier@gmail.com> wrote: > On Thu, Mar 24, 2016 at 9:40 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Thu, Mar 24, 2016 at 12:59 AM, Haribabu Kommi >> <kommi.haribabu@gmail.com> wrote: >>> So further operations on the table uses the already constructed smgr relation >>> and treats that there are RELSEG_SIZE number of blocks in the page and try >>> to do the scan. But there are 0 pages in the table thus it produces the error. >>> >>> The issue doesn't occur from another session. Because of this reason only >>> if we do only vacuum operation, the error not occurred. >> >> Yeah, I had a suspicion that this might have to do with invalidation >> messages based on Thom's description, but I think we still need to >> track down which commit is at fault. > > I could reproduce the failure on Linux, not on OSX, and bisecting the > failure, the first bad commit is this one: > commit: 428b1d6b29ca599c5700d4bc4f4ce4c5880369bf > author: Andres Freund <andres@anarazel.de> > date: Thu, 10 Mar 2016 17:04:34 -0800 > Allow to trigger kernel writeback after a configurable number of writes. > > The failure is a little bit sporadic, based on my tests 1/2 runs out > of 10 could pass, so one good commit was recognized as such after > passing the SQL sequence sent by Thom 5 times in a row. I also did > some manual tests and those are pointing to this commit as well. I've had to adapt a test with waits in to get it to work more reliably. I've just scattered it with sleeps: \pset pager off select pg_sleep(1); CREATE TABLE pgbench_accounts_1 (CHECK (bid % 2 = 0)) INHERITS (pgbench_accounts); select pg_sleep(1); CREATE TABLE pgbench_accounts_2 (CHECK (bid % 2 = 1)) INHERITS (pgbench_accounts); select pg_sleep(1); WITH del AS (DELETE FROM pgbench_accounts WHERE bid % 2 = 0 RETURNING *) INSERT INTO pgbench_accounts_1 SELECT * FROM del; select pg_sleep(1); WITH del AS (DELETE FROM pgbench_accounts WHERE bid % 2 = 1 RETURNING *) INSERT INTO pgbench_accounts_2 SELECT * FROM del; select pg_sleep(1); VACUUM ANALYSE; select pg_sleep(3); EXPLAIN ANALYSE SELECT count(*) FROM pgbench_accounts; But this brings me to the same commit you found. Thom
On Fri, Mar 25, 2016 at 8:41 AM, Michael Paquier <michael.paquier@gmail.com> wrote: > On Thu, Mar 24, 2016 at 9:40 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Thu, Mar 24, 2016 at 12:59 AM, Haribabu Kommi >> <kommi.haribabu@gmail.com> wrote: >>> So further operations on the table uses the already constructed smgr relation >>> and treats that there are RELSEG_SIZE number of blocks in the page and try >>> to do the scan. But there are 0 pages in the table thus it produces the error. >>> >>> The issue doesn't occur from another session. Because of this reason only >>> if we do only vacuum operation, the error not occurred. >> >> Yeah, I had a suspicion that this might have to do with invalidation >> messages based on Thom's description, but I think we still need to >> track down which commit is at fault. > > I could reproduce the failure on Linux, not on OSX, and bisecting the > failure, the first bad commit is this one: > commit: 428b1d6b29ca599c5700d4bc4f4ce4c5880369bf > author: Andres Freund <andres@anarazel.de> > date: Thu, 10 Mar 2016 17:04:34 -0800 > Allow to trigger kernel writeback after a configurable number of writes. > > The failure is a little bit sporadic, based on my tests 1/2 runs out > of 10 could pass, so one good commit was recognized as such after > passing the SQL sequence sent by Thom 5 times in a row. I also did > some manual tests and those are pointing to this commit as well. > > I am adding Fabien and Andres in CC for some feedback. Gosh, that's surprising. I wonder if that just revealed an underlying issue rather than creating it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2016-03-25 12:02:05 -0400, Robert Haas wrote: > On Fri, Mar 25, 2016 at 8:41 AM, Michael Paquier > <michael.paquier@gmail.com> wrote: > > On Thu, Mar 24, 2016 at 9:40 PM, Robert Haas <robertmhaas@gmail.com> wrote: > >> On Thu, Mar 24, 2016 at 12:59 AM, Haribabu Kommi > >> <kommi.haribabu@gmail.com> wrote: > >>> So further operations on the table uses the already constructed smgr relation > >>> and treats that there are RELSEG_SIZE number of blocks in the page and try > >>> to do the scan. But there are 0 pages in the table thus it produces the error. > >>> > >>> The issue doesn't occur from another session. Because of this reason only > >>> if we do only vacuum operation, the error not occurred. > >> > >> Yeah, I had a suspicion that this might have to do with invalidation > >> messages based on Thom's description, but I think we still need to > >> track down which commit is at fault. > > > > I could reproduce the failure on Linux, not on OSX, and bisecting the > > failure, the first bad commit is this one: > > commit: 428b1d6b29ca599c5700d4bc4f4ce4c5880369bf > > author: Andres Freund <andres@anarazel.de> > > date: Thu, 10 Mar 2016 17:04:34 -0800 > > Allow to trigger kernel writeback after a configurable number of writes. > > > > The failure is a little bit sporadic, based on my tests 1/2 runs out > > of 10 could pass, so one good commit was recognized as such after > > passing the SQL sequence sent by Thom 5 times in a row. I also did > > some manual tests and those are pointing to this commit as well. > > > > I am adding Fabien and Andres in CC for some feedback. > > Gosh, that's surprising. I wonder if that just revealed an underlying > issue rather than creating it. I think that might be the case, but I'm not entirely sure yet. It appears to me that the current backend - others don't show the problem - still has the first segment of pgbench_accounts open (in the md.c mdfdvec sense); likely because there were remaining flush requests. Thus, when mdnblocks is called to get the size of the relation we return the size of the first segment (131072) plus the size of the second segment (0, doesn't exist). That then leads to this error. I don't really understand yet how the "open segment" thing happens.
On 2016-03-25 12:02:05 -0400, Robert Haas wrote: > Gosh, that's surprising. I wonder if that just revealed an underlying > issue rather than creating it. I think that's the case; it's just somewhat unlikely to hit in other cases. If SMgrRelation->md_fd[n] is an empty relation, and mdread() or another routine is asking for a block in the second segment - which will error out. But even if the first segment is zero bytes, _mdfd_getseg() will dutifully try to open the second segment. Which will succeed in the case of a truncated relation, because we leave the truncated segment in place. ISTM that _mdfd_getseg better check the length of the last segment before opening the next one? Regards, Andres
On Sat, Mar 26, 2016 at 01:39:42PM +0100, Andres Freund wrote: > On 2016-03-25 12:02:05 -0400, Robert Haas wrote: > > Gosh, that's surprising. I wonder if that just revealed an underlying > > issue rather than creating it. > > I think that's the case; it's just somewhat unlikely to hit in other > cases. > > If SMgrRelation->md_fd[n] is an empty relation, and mdread() or another > routine is asking for a block in the second segment - which will error > out. But even if the first segment is zero bytes, _mdfd_getseg() will > dutifully try to open the second segment. Which will succeed in the case > of a truncated relation, because we leave the truncated segment in > place. > > ISTM that _mdfd_getseg better check the length of the last segment > before opening the next one? [This is a generic notification.] The above-described topic is currently a PostgreSQL 9.6 open item. Andres, since you committed the patch believed to have created it, you own this open item. If that responsibility lies elsewhere, please let us know whose responsibility it is to fix this. Since new open items may be discovered at any time and I want to plan to have them all fixed well in advance of the ship date, I will appreciate your efforts toward speedy resolution. Please present, within 72 hours, a plan to fix the defect within seven days of this message. Thanks. Non-generic addendum: granted, s/created/unearthed/ is looking more precise.
On Sat, Mar 26, 2016 at 8:39 AM, Andres Freund <andres@anarazel.de> wrote: > On 2016-03-25 12:02:05 -0400, Robert Haas wrote: >> Gosh, that's surprising. I wonder if that just revealed an underlying >> issue rather than creating it. > > I think that's the case; it's just somewhat unlikely to hit in other > cases. > > If SMgrRelation->md_fd[n] is an empty relation, and mdread() or another > routine is asking for a block in the second segment - which will error > out. But even if the first segment is zero bytes, _mdfd_getseg() will > dutifully try to open the second segment. Which will succeed in the case > of a truncated relation, because we leave the truncated segment in > place. > > ISTM that _mdfd_getseg better check the length of the last segment > before opening the next one? Well, I agree that it's pretty strange that _mdfd_getseg() makes no such check, but I still don't think I understand what's going on here. Backends shouldn't be requesting nonexistent blocks from a relation - higher-level safeguards, like holding AccessExclusiveLock before trying to complete a DROP or TRUNCATE - are supposed to prevent that. If this patch is causing us to hold onto smgr references to a relation on which we no longer hold locks, I think that's irretrievably broken and should be reverted. I really doubt this will be the only thing that goes wrong if you do that. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Apr 04, 2016 at 12:45:25PM -0400, Robert Haas wrote: > Backends shouldn't be requesting nonexistent blocks from a relation - > higher-level safeguards, like holding AccessExclusiveLock before > trying to complete a DROP or TRUNCATE - are supposed to prevent that. > If this patch is causing us to hold onto smgr references to a relation > on which we no longer hold locks, I think that's irretrievably broken > and should be reverted. I really doubt this will be the only thing > that goes wrong if you do that. +1
On 2016-04-04 12:45:25 -0400, Robert Haas wrote: > Well, I agree that it's pretty strange that _mdfd_getseg() makes no > such check, but I still don't think I understand what's going on here. > Backends shouldn't be requesting nonexistent blocks from a relation - > higher-level safeguards, like holding AccessExclusiveLock before > trying to complete a DROP or TRUNCATE - are supposed to prevent that. I don't think that's really true: We write blocks back without holding any sort of relation level locks; and thus do _mdfd_getseg() type accesses as well. And we're not really "requesting nonexistant blocks" - the segments are just opened to get the associated file descriptor, and they're opened with EXTENSION_RETURN_NULL. It turns out to be important performance-wise to reuse fd's when triggering kernel writeback. > If this patch is causing us to hold onto smgr references to a relation > on which we no longer hold locks, I think that's irretrievably broken > and should be reverted. I really doubt this will be the only thing > that goes wrong if you do that. As we already have done that for writes for a long time, I'm a bit surprised about that statement. Greetings, Andres Freund
On Mon, Apr 11, 2016 at 12:50 PM, Andres Freund <andres@anarazel.de> wrote: > On 2016-04-04 12:45:25 -0400, Robert Haas wrote: >> Well, I agree that it's pretty strange that _mdfd_getseg() makes no >> such check, but I still don't think I understand what's going on here. >> Backends shouldn't be requesting nonexistent blocks from a relation - >> higher-level safeguards, like holding AccessExclusiveLock before >> trying to complete a DROP or TRUNCATE - are supposed to prevent that. > > I don't think that's really true: We write blocks back without holding > any sort of relation level locks; and thus do _mdfd_getseg() type > accesses as well. You're right, but I think that's more because I didn't say it correctly than because you haven't done something novel. DROP and relation truncation know about shared buffers, and they go clear blocks that that might be affected from it as part of the truncate operation, which means that no other backend will see them after they are gone. The lock makes sure that no other references can be added while we're busy removing any that are already there. So I think that there is currently an invariant that any block we are attempting to access should actually still exist. It sounds like these references are sticking around in backend-private memory, which means they are neither protected by locks nor able to be cleared out on drop or truncate. I think that's a new thing, and a bit scary. > And we're not really "requesting nonexistant blocks" - the segments are > just opened to get the associated file descriptor, and they're opened > with EXTENSION_RETURN_NULL. It turns out to be important > performance-wise to reuse fd's when triggering kernel writeback. The possibly-saving grace here, I suppose, is that the references we're worried about are just being used to issue hints to the operating system. So I guess if we sent a hint on a wrong block or skip sending a hint altogether because of some failure, no harm done, as long as we don't error out. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2016-04-11 13:04:48 -0400, Robert Haas wrote: > You're right, but I think that's more because I didn't say it > correctly than because you haven't done something novel. Could be. > DROP and > relation truncation know about shared buffers, and they go clear > blocks that that might be affected from it as part of the truncate > operation, which means that no other backend will see them after they > are gone. The lock makes sure that no other references can be added > while we're busy removing any that are already there. So I think that > there is currently an invariant that any block we are attempting to > access should actually still exist. Note that we're not actually accessing any blocks, we're just opening a segment to get the associated file descriptor. > It sounds like these references are sticking around in backend-private > memory, which means they are neither protected by locks nor able to be > cleared out on drop or truncate. I think that's a new thing, and a > bit scary. True. But how would you batch flush requests in a sorted manner otherwise, without re-opening file descriptors otherwise? And that's prety essential for performance. I can think of a number of relatively easy ways to address this: 1) Just zap (or issue?) all pending flush requests when getting an smgrinval/smgrclosenode 2) Do 1), but filter for the closed relnode 3) Actually handle the case of the last open segment not being RELSEG_SIZE properly in _mdfd_getseg() - mdnblocks() doesso. I'm kind of inclined to do both 3) and 1). > The possibly-saving grace here, I suppose, is that the references > we're worried about are just being used to issue hints to the > operating system. Indeed. > So I guess if we sent a hint on a wrong block or > skip sending a hint altogether because of some failure, no harm done, > as long as we don't error out. Which the writeback code is careful not to do; afaics it's just the "already open segment" issue making problems here. - Andres
Hello Andres, > I can think of a number of relatively easy ways to address this: > 1) Just zap (or issue?) all pending flush requests when getting an > smgrinval/smgrclosenode > 2) Do 1), but filter for the closed relnode > 3) Actually handle the case of the last open segment not being > RELSEG_SIZE properly in _mdfd_getseg() - mdnblocks() does so. > > I'm kind of inclined to do both 3) and 1). Alas I'm a little bit outside my area of expertise. Just for suggestion purpose, possibly totally wrong (please accept my apology if this is the case), the ideas I had while thinking about these issues, some may be quite close to your suggestions above: - keep track of file/segment closing/reopenings (say with a counter), so that if a flush is about a file/segment whichhas been closed or reopened, it is just skipped. I'm not sure this is enough, because one process may do the filecleaning and another may want to flush, although I guess there is some kind of locking/shared data structures to manage such interactions between pg processes. - because of "empty the bucket when filled behavior" of the current implementation, a buffer to be flused may be keptfor a very long time in the bucket. I think that flushing advices become stale after a while so should not be issued(the buffer may have been written again, ...), or the bucket should be flushed after a while even of not full. Also, a detail in "pg_flush_data", there are a serie of three if/endif, but it is not obvious to me whether they are mutually exclusive while looking at the source, so I was wondering whether the code could try several flushings approaches one after the other... This is probably not the case, but I would be more at ease with a if/elif/elif/endif structure there so that is clearer. -- Fabien.
On Mon, Apr 11, 2016 at 1:17 PM, Andres Freund <andres@anarazel.de> wrote: > I can think of a number of relatively easy ways to address this: > 1) Just zap (or issue?) all pending flush requests when getting an > smgrinval/smgrclosenode > 2) Do 1), but filter for the closed relnode > 3) Actually handle the case of the last open segment not being > RELSEG_SIZE properly in _mdfd_getseg() - mdnblocks() does so. > > I'm kind of inclined to do both 3) and 1). #3 seems like it's probably about 15 years overdue, so let's do that anyway. I don't quite understand why #1 fixes the problem - not because I doubt you, but just because I haven't studied it much. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Apr 12, 2016 at 09:00:57AM -0400, Robert Haas wrote: > On Mon, Apr 11, 2016 at 1:17 PM, Andres Freund <andres@anarazel.de> wrote: > > I can think of a number of relatively easy ways to address this: > > 1) Just zap (or issue?) all pending flush requests when getting an > > smgrinval/smgrclosenode > > 2) Do 1), but filter for the closed relnode > > 3) Actually handle the case of the last open segment not being > > RELSEG_SIZE properly in _mdfd_getseg() - mdnblocks() does so. > > > > I'm kind of inclined to do both 3) and 1). Andres, you own the underlying open item, right? Do you plan to implement this fix you have outlined? If so, when? > #3 seems like it's probably about 15 years overdue, so let's do that anyway. > > I don't quite understand why #1 fixes the problem - not because I > doubt you, but just because I haven't studied it much.
On 2016-04-16 18:23:01 -0400, Noah Misch wrote: > On Tue, Apr 12, 2016 at 09:00:57AM -0400, Robert Haas wrote: > > On Mon, Apr 11, 2016 at 1:17 PM, Andres Freund <andres@anarazel.de> wrote: > > > I can think of a number of relatively easy ways to address this: > > > 1) Just zap (or issue?) all pending flush requests when getting an > > > smgrinval/smgrclosenode > > > 2) Do 1), but filter for the closed relnode > > > 3) Actually handle the case of the last open segment not being > > > RELSEG_SIZE properly in _mdfd_getseg() - mdnblocks() does so. > > > > > > I'm kind of inclined to do both 3) and 1). > > Andres, you own the underlying open item, right? Right. > Do you plan to implement this fix you have outlined? If so, when? Yes, I plan to next week. - Andres
On Sat, Apr 16, 2016 at 03:28:23PM -0700, Andres Freund wrote: > On 2016-04-16 18:23:01 -0400, Noah Misch wrote: > > On Tue, Apr 12, 2016 at 09:00:57AM -0400, Robert Haas wrote: > > > On Mon, Apr 11, 2016 at 1:17 PM, Andres Freund <andres@anarazel.de> wrote: > > > > I can think of a number of relatively easy ways to address this: > > > > 1) Just zap (or issue?) all pending flush requests when getting an > > > > smgrinval/smgrclosenode > > > > 2) Do 1), but filter for the closed relnode > > > > 3) Actually handle the case of the last open segment not being > > > > RELSEG_SIZE properly in _mdfd_getseg() - mdnblocks() does so. > > > > > > > > I'm kind of inclined to do both 3) and 1). > > > > Andres, you own the underlying open item, right? > > Right. > > > Do you plan to implement this fix you have outlined? If so, when? > > Yes, I plan to next week. Completion by the end of next week will work. Thanks. At that time (2016-04-24T00:00 UTC), this would otherwise have spent twenty-three days as an open item. Would you inform this thread if you discover reasons to think your plan will no longer succeed?