Thread: Breakage with VACUUM ANALYSE + partitions

Breakage with VACUUM ANALYSE + partitions

From

Thom Brown

Date:

20 March 2016, 22:56:02

Hi,

I've not determined what's causing the following issue, but this is
the simplest reproducible test case I've managed to create so far in
9.6 latest git master:

postgresql.conf:
  shared_buffers =3D 256MB

createdb pgbench
pgbench -i -s 80 pgbench

psql pgbench
  \pset pager off
  CREATE TABLE pgbench_accounts_1 (CHECK (bid % 2 =3D 0)) INHERITS
(pgbench_accounts);
  CREATE TABLE pgbench_accounts_2 (CHECK (bid % 2 =3D 1)) INHERITS
(pgbench_accounts);
  WITH del AS (DELETE FROM pgbench_accounts WHERE bid % 2 =3D 0
RETURNING *) INSERT INTO pgbench_accounts_1 SELECT * FROM del;
  WITH del AS (DELETE FROM pgbench_accounts WHERE bid % 2 =3D 1
RETURNING *) INSERT INTO pgbench_accounts_2 SELECT * FROM del;
  VACUUM ANALYSE;
  EXPLAIN ANALYSE SELECT count(*) FROM pgbench_accounts;

This last statement produces:
  ERROR:  could not read block 0 in file "base/35160/35173": read only
0 of 8192 bytes

Looking at the file, I get:

  File: =E2=80=9835173=E2=80=99
  Size: 0             Blocks: 0          IO Block: 4096   regular empty fil=
e
Device: 801h/2049d    Inode: 4504115     Links: 1
Access: (0600/-rw-------)  Uid: (1000001/    thom)   Gid: (1000001/    thom=
)
Access: 2016-03-20 19:37:50.433485085 +0000
Modify: 2016-03-20 19:39:49.422959222 +0000
Change: 2016-03-20 19:39:49.422959222 +0000
 Birth: -

This is the same file info I get if I just to a regular VACUUM
instead, but that doesn't result in the error.  And if I follow that
up with a VACUUM ANALYSE, the error still doesn't occur.

The base directory also shows this listing for files name after the
affected filenode:

-rw------- 1 thom thom     0 Mar 20 19:39 35173
-rw------- 1 thom thom     0 Mar 20 19:39 35173.1
-rw------- 1 thom thom 16384 Mar 20 19:39 35173_fsm
-rw------- 1 thom thom     0 Mar 20 19:39 35173_vm

After I see the error, I get the same error again if I then try to run
plain VACUUM.

The oid 35173 shown in this error message references the parent
pgbench_accounts table.

If I reduce the scale factor to 50, or increase shared_buffers to
512MB, there's no error.  If, instead of VACUUM ANALYSE, I just use
VACUUM, or just ANALYSE,  the error doesn't happen.

System details:
OS: Linux Mint Debian Edition (Wheezy 7.0)
Linux Kernel: 3.11.8
RAM: 16GB
Filesystem: ext4 (rw,noatime,discard,errors=3Dremount-ro,data=3Dordered)

Regard

Thom

Re: Breakage with VACUUM ANALYSE + partitions

From

Robert Haas

Date:

23 March 2016, 20:20:10

On Sun, Mar 20, 2016 at 3:55 PM, Thom Brown <thom@linux.com> wrote:
> I've not determined what's causing the following issue, but this is
> the simplest reproducible test case I've managed to create so far in
> 9.6 latest git master:
>
> postgresql.conf:
>   shared_buffers = 256MB
>
> createdb pgbench
> pgbench -i -s 80 pgbench
>
> psql pgbench
>   \pset pager off
>   CREATE TABLE pgbench_accounts_1 (CHECK (bid % 2 = 0)) INHERITS
> (pgbench_accounts);
>   CREATE TABLE pgbench_accounts_2 (CHECK (bid % 2 = 1)) INHERITS
> (pgbench_accounts);
>   WITH del AS (DELETE FROM pgbench_accounts WHERE bid % 2 = 0
> RETURNING *) INSERT INTO pgbench_accounts_1 SELECT * FROM del;
>   WITH del AS (DELETE FROM pgbench_accounts WHERE bid % 2 = 1
> RETURNING *) INSERT INTO pgbench_accounts_2 SELECT * FROM del;
>   VACUUM ANALYSE;
>   EXPLAIN ANALYSE SELECT count(*) FROM pgbench_accounts;
>
> This last statement produces:
>   ERROR:  could not read block 0 in file "base/35160/35173": read only
> 0 of 8192 bytes

Hmm.  The natural thing to suspect here would be a problem with the
freeze map changes.  But I tried this and I couldn't reproduce it.
Any chance you can try this on each of the following commits?

fd31cd265138019dcccc9b5fe53043670898bc9f
77a1d1e79892a20ed15a67be42b96949b8546bf6
a892234f830e832110f63fc0a2afce2fb21d1584
68c521eb92c3515e3306f51a7fd3f32d16c97524

Make sure to re-initdb each time.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Breakage with VACUUM ANALYSE + partitions

From

Haribabu Kommi

Date:

24 March 2016, 07:59:33

On Thu, Mar 24, 2016 at 4:20 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Sun, Mar 20, 2016 at 3:55 PM, Thom Brown <thom@linux.com> wrote:
>> I've not determined what's causing the following issue, but this is
>> the simplest reproducible test case I've managed to create so far in
>> 9.6 latest git master:
>>
>> postgresql.conf:
>>   shared_buffers = 256MB
>>
>> createdb pgbench
>> pgbench -i -s 80 pgbench
>>
>> psql pgbench
>>   \pset pager off
>>   CREATE TABLE pgbench_accounts_1 (CHECK (bid % 2 = 0)) INHERITS
>> (pgbench_accounts);
>>   CREATE TABLE pgbench_accounts_2 (CHECK (bid % 2 = 1)) INHERITS
>> (pgbench_accounts);
>>   WITH del AS (DELETE FROM pgbench_accounts WHERE bid % 2 = 0
>> RETURNING *) INSERT INTO pgbench_accounts_1 SELECT * FROM del;
>>   WITH del AS (DELETE FROM pgbench_accounts WHERE bid % 2 = 1
>> RETURNING *) INSERT INTO pgbench_accounts_2 SELECT * FROM del;
>>   VACUUM ANALYSE;
>>   EXPLAIN ANALYSE SELECT count(*) FROM pgbench_accounts;
>>
>> This last statement produces:
>>   ERROR:  could not read block 0 in file "base/35160/35173": read only
>> 0 of 8192 bytes
>
> Hmm.  The natural thing to suspect here would be a problem with the
> freeze map changes.  But I tried this and I couldn't reproduce it.
> Any chance you can try this on each of the following commits?
>
> fd31cd265138019dcccc9b5fe53043670898bc9f
> 77a1d1e79892a20ed15a67be42b96949b8546bf6
> a892234f830e832110f63fc0a2afce2fb21d1584
> 68c521eb92c3515e3306f51a7fd3f32d16c97524
>
> Make sure to re-initdb each time.

I am able to reproduce the problem with latest HEAD 473b932 code.

During vacuum analyse operation, vacuum truncates the entire table
to 0 bytes because of no records present in the table and invalidates
the smgr relation.

During analyse operation, the smgr relation is constructed with segno = 1
also in the following location. In this case the smgr relation is not
invalidated.

#0  _mdfd_openseg (reln=0x1b8d790, forknum=MAIN_FORKNUM, segno=1,
oflags=0) at md.c:1736
#1  0x00000000007fa032 in _mdfd_getseg (reln=0x1b8d790,
forknum=MAIN_FORKNUM, blkno=131147, skipFsync=0 '\000',
behavior=EXTENSION_RETURN_NULL) at md.c:1815
#2  0x00000000007f8449 in mdwriteback (reln=0x1b8d790,
forknum=MAIN_FORKNUM, blocknum=131147, nblocks=1) at md.c:686
#3  0x00000000007faca3 in smgrwriteback (reln=0x1b8d790,
forknum=MAIN_FORKNUM, blocknum=131147, nblocks=1) at smgr.c:663
#4  0x00000000007c5211 in IssuePendingWritebacks (context=0xe70fc0
<BackendWritebackContext>) at bufmgr.c:4112
#5  0x00000000007c5071 in ScheduleBufferTagForWriteback
(context=0xe70fc0 <BackendWritebackContext>, tag=0x7ff18deaa180) at
bufmgr.c:4041
#6  0x00000000007c0a9d in BufferAlloc (smgr=0x1b8d490,
relpersistence=112 'p', forkNum=MAIN_FORKNUM, blockNum=247,
strategy=0x1bded48, foundPtr=0x7ffea3f0f34f "") at bufmgr.c:1134
#7  0x00000000007bff58 in ReadBuffer_common (smgr=0x1b8d490,
relpersistence=112 'p', forkNum=MAIN_FORKNUM, blockNum=247,
mode=RBM_NORMAL, strategy=0x1bded48, hit=0x7ffea3f0f43b "")
    at bufmgr.c:744
#8  0x00000000007bfd6e in ReadBufferExtended (reln=0x7ff19dc69280,
forkNum=MAIN_FORKNUM, blockNum=247, mode=RBM_NORMAL,
strategy=0x1bded48) at bufmgr.c:663
#9  0x00000000005e09f7 in acquire_sample_rows (onerel=0x7ff19dc69280,
elevel=13, rows=0x1be7f28, targrows=15000, totalrows=0x7ffea3f0f598,
totaldeadrows=0x7ffea3f0f590)
    at analyze.c:1025
#10 0x00000000005e1877 in acquire_inherited_sample_rows
(onerel=0x7ff19dc76a78, elevel=13, rows=0x1be7f28, targrows=30000,
totalrows=0x7ffea3f0f780, totaldeadrows=0x7ffea3f0f778)
    at analyze.c:1410
#11 0x00000000005dfa68 in do_analyze_rel (onerel=0x7ff19dc76a78,
options=3, params=0x7ffea3f0fab0, va_cols=0x0, acquirefunc=0x5e08ca
<acquire_sample_rows>, relpages=0, inh=1 '\001',
    in_outer_xact=0 '\000', elevel=13) at analyze.c:486
#12 0x00000000005df276 in analyze_rel (relid=16463, relation=0x0,
options=3, params=0x7ffea3f0fab0, va_cols=0x0, in_outer_xact=0 '\000',
bstrategy=0x1bded48) at analyze.c:264


So further operations on the table uses the already constructed smgr relation
and treats that there are RELSEG_SIZE number of blocks in the page and try
to do the scan. But there are 0 pages in the table thus it produces the error.

The issue doesn't occur from another session. Because of this reason only
if we do only vacuum operation, the error not occurred.

Regards,
Hari Babu
Fujitsu Australia

Re: Breakage with VACUUM ANALYSE + partitions

From

Robert Haas

Date:

25 March 2016, 00:40:08

On Thu, Mar 24, 2016 at 12:59 AM, Haribabu Kommi
<kommi.haribabu@gmail.com> wrote:
> So further operations on the table uses the already constructed smgr relation
> and treats that there are RELSEG_SIZE number of blocks in the page and try
> to do the scan. But there are 0 pages in the table thus it produces the error.
>
> The issue doesn't occur from another session. Because of this reason only
> if we do only vacuum operation, the error not occurred.

Yeah, I had a suspicion that this might have to do with invalidation
messages based on Thom's description, but I think we still need to
track down which commit is at fault.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Breakage with VACUUM ANALYSE + partitions

From

Michael Paquier

Date:

25 March 2016, 15:41:57

On Thu, Mar 24, 2016 at 9:40 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Mar 24, 2016 at 12:59 AM, Haribabu Kommi
> <kommi.haribabu@gmail.com> wrote:
>> So further operations on the table uses the already constructed smgr relation
>> and treats that there are RELSEG_SIZE number of blocks in the page and try
>> to do the scan. But there are 0 pages in the table thus it produces the error.
>>
>> The issue doesn't occur from another session. Because of this reason only
>> if we do only vacuum operation, the error not occurred.
>
> Yeah, I had a suspicion that this might have to do with invalidation
> messages based on Thom's description, but I think we still need to
> track down which commit is at fault.

I could reproduce the failure on Linux, not on OSX, and bisecting the
failure, the first bad commit is this one:
commit: 428b1d6b29ca599c5700d4bc4f4ce4c5880369bf
author: Andres Freund <andres@anarazel.de>
date: Thu, 10 Mar 2016 17:04:34 -0800
Allow to trigger kernel writeback after a configurable number of writes.

The failure is a little bit sporadic, based on my tests 1/2 runs out
of 10 could pass, so one good commit was recognized as such after
passing the SQL sequence sent by Thom 5 times in a row. I also did
some manual tests and those are pointing to this commit as well.

I am adding Fabien and Andres in CC for some feedback.
--
Michael

Re: Breakage with VACUUM ANALYSE + partitions

From

Thom Brown

Date:

25 March 2016, 16:26:16

On 25 March 2016 at 12:41, Michael Paquier <michael.paquier@gmail.com> wrote:
> On Thu, Mar 24, 2016 at 9:40 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Thu, Mar 24, 2016 at 12:59 AM, Haribabu Kommi
>> <kommi.haribabu@gmail.com> wrote:
>>> So further operations on the table uses the already constructed smgr relation
>>> and treats that there are RELSEG_SIZE number of blocks in the page and try
>>> to do the scan. But there are 0 pages in the table thus it produces the error.
>>>
>>> The issue doesn't occur from another session. Because of this reason only
>>> if we do only vacuum operation, the error not occurred.
>>
>> Yeah, I had a suspicion that this might have to do with invalidation
>> messages based on Thom's description, but I think we still need to
>> track down which commit is at fault.
>
> I could reproduce the failure on Linux, not on OSX, and bisecting the
> failure, the first bad commit is this one:
> commit: 428b1d6b29ca599c5700d4bc4f4ce4c5880369bf
> author: Andres Freund <andres@anarazel.de>
> date: Thu, 10 Mar 2016 17:04:34 -0800
> Allow to trigger kernel writeback after a configurable number of writes.
>
> The failure is a little bit sporadic, based on my tests 1/2 runs out
> of 10 could pass, so one good commit was recognized as such after
> passing the SQL sequence sent by Thom 5 times in a row. I also did
> some manual tests and those are pointing to this commit as well.

I've had to adapt a test with waits in to get it to work more
reliably.  I've just scattered it with sleeps:

\pset pager off
select pg_sleep(1);
CREATE TABLE pgbench_accounts_1 (CHECK (bid % 2 = 0)) INHERITS
(pgbench_accounts);
select pg_sleep(1);
CREATE TABLE pgbench_accounts_2 (CHECK (bid % 2 = 1)) INHERITS
(pgbench_accounts);
select pg_sleep(1);
WITH del AS (DELETE FROM pgbench_accounts WHERE bid % 2 = 0 RETURNING
*) INSERT INTO pgbench_accounts_1 SELECT * FROM del;
select pg_sleep(1);
WITH del AS (DELETE FROM pgbench_accounts WHERE bid % 2 = 1 RETURNING
*) INSERT INTO pgbench_accounts_2 SELECT * FROM del;
select pg_sleep(1);
VACUUM ANALYSE;
select pg_sleep(3);
EXPLAIN ANALYSE SELECT count(*) FROM pgbench_accounts;

But this brings me to the same commit you found.

Thom

Re: Breakage with VACUUM ANALYSE + partitions

From

Robert Haas

Date:

25 March 2016, 19:02:15

On Fri, Mar 25, 2016 at 8:41 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Thu, Mar 24, 2016 at 9:40 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Thu, Mar 24, 2016 at 12:59 AM, Haribabu Kommi
>> <kommi.haribabu@gmail.com> wrote:
>>> So further operations on the table uses the already constructed smgr relation
>>> and treats that there are RELSEG_SIZE number of blocks in the page and try
>>> to do the scan. But there are 0 pages in the table thus it produces the error.
>>>
>>> The issue doesn't occur from another session. Because of this reason only
>>> if we do only vacuum operation, the error not occurred.
>>
>> Yeah, I had a suspicion that this might have to do with invalidation
>> messages based on Thom's description, but I think we still need to
>> track down which commit is at fault.
>
> I could reproduce the failure on Linux, not on OSX, and bisecting the
> failure, the first bad commit is this one:
> commit: 428b1d6b29ca599c5700d4bc4f4ce4c5880369bf
> author: Andres Freund <andres@anarazel.de>
> date: Thu, 10 Mar 2016 17:04:34 -0800
> Allow to trigger kernel writeback after a configurable number of writes.
>
> The failure is a little bit sporadic, based on my tests 1/2 runs out
> of 10 could pass, so one good commit was recognized as such after
> passing the SQL sequence sent by Thom 5 times in a row. I also did
> some manual tests and those are pointing to this commit as well.
>
> I am adding Fabien and Andres in CC for some feedback.

Gosh, that's surprising.  I wonder if that just revealed an underlying
issue rather than creating it.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Breakage with VACUUM ANALYSE + partitions

From

Andres Freund

Date:

25 March 2016, 19:49:32

On 2016-03-25 12:02:05 -0400, Robert Haas wrote:
> On Fri, Mar 25, 2016 at 8:41 AM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
> > On Thu, Mar 24, 2016 at 9:40 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> >> On Thu, Mar 24, 2016 at 12:59 AM, Haribabu Kommi
> >> <kommi.haribabu@gmail.com> wrote:
> >>> So further operations on the table uses the already constructed smgr relation
> >>> and treats that there are RELSEG_SIZE number of blocks in the page and try
> >>> to do the scan. But there are 0 pages in the table thus it produces the error.
> >>>
> >>> The issue doesn't occur from another session. Because of this reason only
> >>> if we do only vacuum operation, the error not occurred.
> >>
> >> Yeah, I had a suspicion that this might have to do with invalidation
> >> messages based on Thom's description, but I think we still need to
> >> track down which commit is at fault.
> >
> > I could reproduce the failure on Linux, not on OSX, and bisecting the
> > failure, the first bad commit is this one:
> > commit: 428b1d6b29ca599c5700d4bc4f4ce4c5880369bf
> > author: Andres Freund <andres@anarazel.de>
> > date: Thu, 10 Mar 2016 17:04:34 -0800
> > Allow to trigger kernel writeback after a configurable number of writes.
> >
> > The failure is a little bit sporadic, based on my tests 1/2 runs out
> > of 10 could pass, so one good commit was recognized as such after
> > passing the SQL sequence sent by Thom 5 times in a row. I also did
> > some manual tests and those are pointing to this commit as well.
> >
> > I am adding Fabien and Andres in CC for some feedback.
>
> Gosh, that's surprising.  I wonder if that just revealed an underlying
> issue rather than creating it.

I think that might be the case, but I'm not entirely sure yet. It
appears to me that the current backend - others don't show the problem -
still has the first segment of pgbench_accounts open (in the md.c
mdfdvec sense); likely because there were remaining flush
requests. Thus, when mdnblocks is called to get the size of the relation
we return the size of the first segment (131072) plus the size of the
second segment (0, doesn't exist). That then leads to this error.

I don't really understand yet how the "open segment" thing happens.

Re: Breakage with VACUUM ANALYSE + partitions

From

Andres Freund

Date:

26 March 2016, 15:39:51

On 2016-03-25 12:02:05 -0400, Robert Haas wrote:
> Gosh, that's surprising.  I wonder if that just revealed an underlying
> issue rather than creating it.

I think that's the case; it's just somewhat unlikely to hit in other
cases.

If SMgrRelation->md_fd[n] is an empty relation, and mdread() or another
routine is asking for a block in the second segment - which will error
out. But even if the first segment is zero bytes, _mdfd_getseg() will
dutifully try to open the second segment. Which will succeed in the case
of a truncated relation, because we leave the truncated segment in
place.

ISTM that _mdfd_getseg better check the length of the last segment
before opening the next one?

Regards,

Andres

Re: Breakage with VACUUM ANALYSE + partitions

From

Noah Misch

Date:

31 March 2016, 08:22:32

On Sat, Mar 26, 2016 at 01:39:42PM +0100, Andres Freund wrote:
> On 2016-03-25 12:02:05 -0400, Robert Haas wrote:
> > Gosh, that's surprising.  I wonder if that just revealed an underlying
> > issue rather than creating it.
> 
> I think that's the case; it's just somewhat unlikely to hit in other
> cases.
> 
> If SMgrRelation->md_fd[n] is an empty relation, and mdread() or another
> routine is asking for a block in the second segment - which will error
> out. But even if the first segment is zero bytes, _mdfd_getseg() will
> dutifully try to open the second segment. Which will succeed in the case
> of a truncated relation, because we leave the truncated segment in
> place.
> 
> ISTM that _mdfd_getseg better check the length of the last segment
> before opening the next one?

[This is a generic notification.]

The above-described topic is currently a PostgreSQL 9.6 open item.  Andres,
since you committed the patch believed to have created it, you own this open
item.  If that responsibility lies elsewhere, please let us know whose
responsibility it is to fix this.  Since new open items may be discovered at
any time and I want to plan to have them all fixed well in advance of the ship
date, I will appreciate your efforts toward speedy resolution.  Please
present, within 72 hours, a plan to fix the defect within seven days of this
message.  Thanks.

Non-generic addendum: granted, s/created/unearthed/ is looking more precise.

Re: Breakage with VACUUM ANALYSE + partitions

From

Robert Haas

Date:

04 April 2016, 19:45:36

On Sat, Mar 26, 2016 at 8:39 AM, Andres Freund <andres@anarazel.de> wrote:
> On 2016-03-25 12:02:05 -0400, Robert Haas wrote:
>> Gosh, that's surprising.  I wonder if that just revealed an underlying
>> issue rather than creating it.
>
> I think that's the case; it's just somewhat unlikely to hit in other
> cases.
>
> If SMgrRelation->md_fd[n] is an empty relation, and mdread() or another
> routine is asking for a block in the second segment - which will error
> out. But even if the first segment is zero bytes, _mdfd_getseg() will
> dutifully try to open the second segment. Which will succeed in the case
> of a truncated relation, because we leave the truncated segment in
> place.
>
> ISTM that _mdfd_getseg better check the length of the last segment
> before opening the next one?

Well, I agree that it's pretty strange that _mdfd_getseg() makes no
such check, but I still don't think I understand what's going on here.
Backends shouldn't be requesting nonexistent blocks from a relation -
higher-level safeguards, like holding AccessExclusiveLock before
trying to complete a DROP or TRUNCATE - are supposed to prevent that.
If this patch is causing us to hold onto smgr references to a relation
on which we no longer hold locks, I think that's irretrievably broken
and should be reverted.  I really doubt this will be the only thing
that goes wrong if you do that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Breakage with VACUUM ANALYSE + partitions

From

Noah Misch

Date:

06 April 2016, 03:57:09

On Mon, Apr 04, 2016 at 12:45:25PM -0400, Robert Haas wrote:
> Backends shouldn't be requesting nonexistent blocks from a relation -
> higher-level safeguards, like holding AccessExclusiveLock before
> trying to complete a DROP or TRUNCATE - are supposed to prevent that.
> If this patch is causing us to hold onto smgr references to a relation
> on which we no longer hold locks, I think that's irretrievably broken
> and should be reverted.  I really doubt this will be the only thing
> that goes wrong if you do that.

+1

Re: [HACKERS] Breakage with VACUUM ANALYSE + partitions

From

Andres Freund

Date:

11 April 2016, 19:51:20

On 2016-04-04 12:45:25 -0400, Robert Haas wrote:
> Well, I agree that it's pretty strange that _mdfd_getseg() makes no
> such check, but I still don't think I understand what's going on here.
> Backends shouldn't be requesting nonexistent blocks from a relation -
> higher-level safeguards, like holding AccessExclusiveLock before
> trying to complete a DROP or TRUNCATE - are supposed to prevent that.

I don't think that's really true: We write blocks back without holding
any sort of relation level locks; and thus do _mdfd_getseg() type
accesses as well.

And we're not really "requesting nonexistant blocks" - the segments are
just opened to get the associated file descriptor, and they're opened
with EXTENSION_RETURN_NULL. It turns out to be important
performance-wise to reuse fd's when triggering kernel writeback.

> If this patch is causing us to hold onto smgr references to a relation
> on which we no longer hold locks, I think that's irretrievably broken
> and should be reverted.  I really doubt this will be the only thing
> that goes wrong if you do that.

As we already have done that for writes for a long time, I'm a bit
surprised about that statement.

Greetings,

Andres Freund

Re: [HACKERS] Breakage with VACUUM ANALYSE + partitions

From

Robert Haas

Date:

11 April 2016, 20:05:00

On Mon, Apr 11, 2016 at 12:50 PM, Andres Freund <andres@anarazel.de> wrote:
> On 2016-04-04 12:45:25 -0400, Robert Haas wrote:
>> Well, I agree that it's pretty strange that _mdfd_getseg() makes no
>> such check, but I still don't think I understand what's going on here.
>> Backends shouldn't be requesting nonexistent blocks from a relation -
>> higher-level safeguards, like holding AccessExclusiveLock before
>> trying to complete a DROP or TRUNCATE - are supposed to prevent that.
>
> I don't think that's really true: We write blocks back without holding
> any sort of relation level locks; and thus do _mdfd_getseg() type
> accesses as well.

You're right, but I think that's more because I didn't say it
correctly than because you haven't done something novel.  DROP and
relation truncation know about shared buffers, and they go clear
blocks that that might be affected from it as part of the truncate
operation, which means that no other backend will see them after they
are gone.  The lock makes sure that no other references can be added
while we're busy removing any that are already there.  So I think that
there is currently an invariant that any block we are attempting to
access should actually still exist.  It sounds like these references
are sticking around in backend-private memory, which means they are
neither protected by locks nor able to be cleared out on drop or
truncate.  I think that's a new thing, and a bit scary.

> And we're not really "requesting nonexistant blocks" - the segments are
> just opened to get the associated file descriptor, and they're opened
> with EXTENSION_RETURN_NULL. It turns out to be important
> performance-wise to reuse fd's when triggering kernel writeback.

The possibly-saving grace here, I suppose, is that the references
we're worried about are just being used to issue hints to the
operating system.  So I guess if we sent a hint on a wrong block or
skip sending a hint altogether because of some failure, no harm done,
as long as we don't error out.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Breakage with VACUUM ANALYSE + partitions

From

Andres Freund

Date:

11 April 2016, 20:17:53

On 2016-04-11 13:04:48 -0400, Robert Haas wrote:
> You're right, but I think that's more because I didn't say it
> correctly than because you haven't done something novel.

Could be.

> DROP and
> relation truncation know about shared buffers, and they go clear
> blocks that that might be affected from it as part of the truncate
> operation, which means that no other backend will see them after they
> are gone.  The lock makes sure that no other references can be added
> while we're busy removing any that are already there.  So I think that
> there is currently an invariant that any block we are attempting to
> access should actually still exist.

Note that we're not actually accessing any blocks, we're just opening a
segment to get the associated file descriptor.

> It sounds like these references are sticking around in backend-private
> memory, which means they are neither protected by locks nor able to be
> cleared out on drop or truncate.  I think that's a new thing, and a
> bit scary.

True. But how would you batch flush requests in a sorted manner
otherwise, without re-opening file descriptors otherwise? And that's
prety essential for performance.

I can think of a number of relatively easy ways to address this:
1) Just zap (or issue?) all pending flush requests when getting an  smgrinval/smgrclosenode
2) Do 1), but filter for the closed relnode
3) Actually handle the case of the last open segment not being  RELSEG_SIZE properly in _mdfd_getseg() - mdnblocks()
doesso.

I'm kind of inclined to do both 3) and 1).

> The possibly-saving grace here, I suppose, is that the references
> we're worried about are just being used to issue hints to the
> operating system.

Indeed.

> So I guess if we sent a hint on a wrong block or
> skip sending a hint altogether because of some failure, no harm done,
> as long as we don't error out.

Which the writeback code is careful not to do; afaics it's just the
"already open segment" issue making problems here.

- Andres

Re: [HACKERS] Breakage with VACUUM ANALYSE + partitions

From

Fabien COELHO

Date:

12 April 2016, 11:52:09

Hello Andres,

> I can think of a number of relatively easy ways to address this:
> 1) Just zap (or issue?) all pending flush requests when getting an
>   smgrinval/smgrclosenode
> 2) Do 1), but filter for the closed relnode
> 3) Actually handle the case of the last open segment not being
>   RELSEG_SIZE properly in _mdfd_getseg() - mdnblocks() does so.
>
> I'm kind of inclined to do both 3) and 1).

Alas I'm a little bit outside my area of expertise. Just for suggestion 
purpose, possibly totally wrong (please accept my apology if this is the 
case), the ideas I had while thinking about these issues, some may be 
quite close to your suggestions above:
 - keep track of file/segment closing/reopenings (say with a counter), so   that if a flush is about a file/segment
whichhas been closed or   reopened, it is just skipped. I'm not sure this is enough, because one   process may do the
filecleaning and another may want to flush, although   I guess there is some kind of locking/shared data structures to
manage  such interactions between pg processes.
 
 - because of "empty the bucket when filled behavior" of the current   implementation, a buffer to be flused may be
keptfor a very   long time in the bucket. I think that flushing advices become stale   after a while so should not be
issued(the buffer may have been   written again, ...), or the bucket should be flushed after a while   even of not
full.

Also, a detail in "pg_flush_data", there are a serie of three if/endif, 
but it is not obvious to me whether they are mutually exclusive while 
looking at the source, so I was wondering whether the code could try 
several flushings approaches one after the other... This is probably not 
the case, but I would be more at ease with a if/elif/elif/endif structure 
there so that is clearer.

-- 
Fabien.

Re: [HACKERS] Breakage with VACUUM ANALYSE + partitions

From

Robert Haas

Date:

12 April 2016, 16:01:03

On Mon, Apr 11, 2016 at 1:17 PM, Andres Freund <andres@anarazel.de> wrote:
> I can think of a number of relatively easy ways to address this:
> 1) Just zap (or issue?) all pending flush requests when getting an
>    smgrinval/smgrclosenode
> 2) Do 1), but filter for the closed relnode
> 3) Actually handle the case of the last open segment not being
>    RELSEG_SIZE properly in _mdfd_getseg() - mdnblocks() does so.
>
> I'm kind of inclined to do both 3) and 1).

#3 seems like it's probably about 15 years overdue, so let's do that anyway.

I don't quite understand why #1 fixes the problem - not because I
doubt you, but just because I haven't studied it much.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Breakage with VACUUM ANALYSE + partitions

From

Noah Misch

Date:

17 April 2016, 01:23:13

On Tue, Apr 12, 2016 at 09:00:57AM -0400, Robert Haas wrote:
> On Mon, Apr 11, 2016 at 1:17 PM, Andres Freund <andres@anarazel.de> wrote:
> > I can think of a number of relatively easy ways to address this:
> > 1) Just zap (or issue?) all pending flush requests when getting an
> >    smgrinval/smgrclosenode
> > 2) Do 1), but filter for the closed relnode
> > 3) Actually handle the case of the last open segment not being
> >    RELSEG_SIZE properly in _mdfd_getseg() - mdnblocks() does so.
> >
> > I'm kind of inclined to do both 3) and 1).

Andres, you own the underlying open item, right?  Do you plan to implement
this fix you have outlined?  If so, when?

> #3 seems like it's probably about 15 years overdue, so let's do that anyway.
> 
> I don't quite understand why #1 fixes the problem - not because I
> doubt you, but just because I haven't studied it much.

Re: Breakage with VACUUM ANALYSE + partitions

From

Andres Freund

Date:

17 April 2016, 01:28:29

On 2016-04-16 18:23:01 -0400, Noah Misch wrote:
> On Tue, Apr 12, 2016 at 09:00:57AM -0400, Robert Haas wrote:
> > On Mon, Apr 11, 2016 at 1:17 PM, Andres Freund <andres@anarazel.de> wrote:
> > > I can think of a number of relatively easy ways to address this:
> > > 1) Just zap (or issue?) all pending flush requests when getting an
> > >    smgrinval/smgrclosenode
> > > 2) Do 1), but filter for the closed relnode
> > > 3) Actually handle the case of the last open segment not being
> > >    RELSEG_SIZE properly in _mdfd_getseg() - mdnblocks() does so.
> > >
> > > I'm kind of inclined to do both 3) and 1).
> 
> Andres, you own the underlying open item, right?

Right.

> Do you plan to implement this fix you have outlined?  If so, when?

Yes, I plan to next week.

- Andres

Re: Breakage with VACUUM ANALYSE + partitions

From

Noah Misch

Date:

17 April 2016, 01:53:26

On Sat, Apr 16, 2016 at 03:28:23PM -0700, Andres Freund wrote:
> On 2016-04-16 18:23:01 -0400, Noah Misch wrote:
> > On Tue, Apr 12, 2016 at 09:00:57AM -0400, Robert Haas wrote:
> > > On Mon, Apr 11, 2016 at 1:17 PM, Andres Freund <andres@anarazel.de> wrote:
> > > > I can think of a number of relatively easy ways to address this:
> > > > 1) Just zap (or issue?) all pending flush requests when getting an
> > > >    smgrinval/smgrclosenode
> > > > 2) Do 1), but filter for the closed relnode
> > > > 3) Actually handle the case of the last open segment not being
> > > >    RELSEG_SIZE properly in _mdfd_getseg() - mdnblocks() does so.
> > > >
> > > > I'm kind of inclined to do both 3) and 1).
> > 
> > Andres, you own the underlying open item, right?
> 
> Right.
> 
> > Do you plan to implement this fix you have outlined?  If so, when?
> 
> Yes, I plan to next week.

Completion by the end of next week will work.  Thanks.  At that time
(2016-04-24T00:00 UTC), this would otherwise have spent twenty-three days as
an open item.  Would you inform this thread if you discover reasons to think
your plan will no longer succeed?