Thread: Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly

Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly

From
Alexey Kondratov
Date:
On 2021-01-28 00:36, Alvaro Herrera wrote:
> On 2021-Jan-28, Alexey Kondratov wrote:
> 
>> I have read more about lock levels and ShareLock should prevent any 
>> kind of
>> physical modification of indexes. We already hold ShareLock doing
>> find_all_inheritors(), which is higher than ShareUpdateExclusiveLock, 
>> so
>> using ShareLock seems to be safe here, but I will look on it closer.
> 
> You can look at lock.c where LockConflicts[] is; that would tell you
> that ShareLock indeed conflicts with ShareUpdateExclusiveLock ... but 
> it
> does not conflict with itself!  So it would be possible to have more
> than one process doing this thing at the same time, which surely makes
> no sense.
> 

Thanks for the explanation and pointing me to the LockConflicts[]. This 
is a good reference.

> 
> I didn't look at the patch closely enough to understand why you're
> trying to do something like CLUSTER, VACUUM FULL or REINDEX without
> holding full AccessExclusiveLock on the relation.  But do keep in mind
> that once you hold a lock on a relation, trying to grab a weaker lock
> afterwards is pretty pointless.
> 

No, you are right, we are doing REINDEX with AccessExclusiveLock as it 
was before. This part is a more specific one. It only applies to 
partitioned indexes, which do not hold any data, so we do not reindex 
them directly, only their leafs. However, if we are doing a TABLESPACE 
change, we have to record it in their pg_class entry, so all future leaf 
partitions were created in the proper tablespace.

That way, we open partitioned index relation only for a reference, i.e. 
read-only, but modify pg_class entry under a proper lock 
(RowExclusiveLock). That's why I thought that ShareLock will be enough.

IIUC, 'ALTER TABLE ... SET TABLESPACE' uses AccessExclusiveLock even for 
relations with no storage, since AlterTableGetLockLevel() chooses it if 
AT_SetTableSpace is met. This is very similar to our case, so probably 
we should do the same?

Actually it is not completely clear for me why ShareUpdateExclusiveLock 
is sufficient for newly added SetRelationTableSpace() as Michael wrote 
in the comment.


Regards
-- 
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company



Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly

From
Alexey Kondratov
Date:
On 2021-01-28 14:42, Alexey Kondratov wrote:
> On 2021-01-28 00:36, Alvaro Herrera wrote:

>> I didn't look at the patch closely enough to understand why you're
>> trying to do something like CLUSTER, VACUUM FULL or REINDEX without
>> holding full AccessExclusiveLock on the relation.  But do keep in mind
>> that once you hold a lock on a relation, trying to grab a weaker lock
>> afterwards is pretty pointless.
>> 
> 
> No, you are right, we are doing REINDEX with AccessExclusiveLock as it
> was before. This part is a more specific one. It only applies to
> partitioned indexes, which do not hold any data, so we do not reindex
> them directly, only their leafs. However, if we are doing a TABLESPACE
> change, we have to record it in their pg_class entry, so all future
> leaf partitions were created in the proper tablespace.
> 
> That way, we open partitioned index relation only for a reference,
> i.e. read-only, but modify pg_class entry under a proper lock
> (RowExclusiveLock). That's why I thought that ShareLock will be
> enough.
> 
> IIUC, 'ALTER TABLE ... SET TABLESPACE' uses AccessExclusiveLock even
> for relations with no storage, since AlterTableGetLockLevel() chooses
> it if AT_SetTableSpace is met. This is very similar to our case, so
> probably we should do the same?
> 
> Actually it is not completely clear for me why
> ShareUpdateExclusiveLock is sufficient for newly added
> SetRelationTableSpace() as Michael wrote in the comment.
> 

Changed patch to use AccessExclusiveLock in this part for now. This is 
what 'ALTER TABLE/INDEX ... SET TABLESPACE' and 'REINDEX' usually do. 
Anyway, all real leaf partitions are processed in the independent 
transactions later.

Also changed some doc/comment parts Justin pointed me to.

>> +      then all "mapped" and system relations will be skipped and a 
>> single
>> +      <literal>WARNING</literal> will be generated. Indexes on TOAST 
>> tables
>> +      are reindexed, but not moved the new tablespace.
> 
> moved *to* the new tablespace.
> 

Fixed.

> 
> I don't know if that needs to be said at all.  We talked about it a lot 
> to
> arrive at the current behavior, but I think that's only due to the 
> difficulty
> of correcting the initial mistake.
> 

I do not think that it will be a big deal to move indexes on TOAST 
tables as well. I just thought that since 'ALTER TABLE/INDEX ... SET 
TABLESPACE' only moves them together with host table, we also should not 
do that. Yet, I am ready to change this logic if requested.


Regards
-- 
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company
Attachment

Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly

From
Alexey Kondratov
Date:
On 2021-01-30 05:23, Michael Paquier wrote:
> On Fri, Jan 29, 2021 at 08:56:47PM +0300, Alexey Kondratov wrote:
>> On 2021-01-28 14:42, Alexey Kondratov wrote:
>>> No, you are right, we are doing REINDEX with AccessExclusiveLock as 
>>> it
>>> was before. This part is a more specific one. It only applies to
>>> partitioned indexes, which do not hold any data, so we do not reindex
>>> them directly, only their leafs. However, if we are doing a 
>>> TABLESPACE
>>> change, we have to record it in their pg_class entry, so all future
>>> leaf partitions were created in the proper tablespace.
>>> 
>>> That way, we open partitioned index relation only for a reference,
>>> i.e. read-only, but modify pg_class entry under a proper lock
>>> (RowExclusiveLock). That's why I thought that ShareLock will be
>>> enough.
>>> 
>>> IIUC, 'ALTER TABLE ... SET TABLESPACE' uses AccessExclusiveLock even
>>> for relations with no storage, since AlterTableGetLockLevel() chooses
>>> it if AT_SetTableSpace is met. This is very similar to our case, so
>>> probably we should do the same?
>>> 
>>> Actually it is not completely clear for me why
>>> ShareUpdateExclusiveLock is sufficient for newly added
>>> SetRelationTableSpace() as Michael wrote in the comment.
> 
> Nay, it was not fine.  That's something Alvaro has mentioned, leading
> to 2484329.  This also means that the main patch of this thread should
> refresh the comments at the top of CheckRelationTableSpaceMove() and
> SetRelationTableSpace() to mention that this is used by REINDEX
> CONCURRENTLY with a lower lock.
> 

Hm, IIUC, REINDEX CONCURRENTLY doesn't use either of them. It directly 
uses index_create() with a proper tablespaceOid instead of 
SetRelationTableSpace(). And its checks structure is more restrictive 
even without tablespace change, so it doesn't use 
CheckRelationTableSpaceMove().

>> Changed patch to use AccessExclusiveLock in this part for now. This is 
>> what
>> 'ALTER TABLE/INDEX ... SET TABLESPACE' and 'REINDEX' usually do. 
>> Anyway, all
>> real leaf partitions are processed in the independent transactions 
>> later.
> 
> +       if (partkind == RELKIND_PARTITIONED_INDEX)
> +       {
> +           Relation iRel = index_open(partoid, AccessExclusiveLock);
> +
> +           if (CheckRelationTableSpaceMove(iRel, 
> params->tablespaceOid))
> +               SetRelationTableSpace(iRel,
> +                                     params->tablespaceOid,
> +                                     InvalidOid);
> +           index_close(iRel, NoLock);
> Are you sure that this does not represent a risk of deadlocks as EAL
> is not taken consistently across all the partitions?  A second issue
> here is that this breaks the assumption of REINDEX CONCURRENTLY kicked
> on partitioned relations that should use ShareUpdateExclusiveLock for
> all its steps.  This would make the first transaction invasive for the
> user, but we don't want that.
> 
> This makes me really wonder if we would not be better to restrict this
> operation for partitioned relation as part of REINDEX as a first step.
> Another thing, mentioned upthread, is that we could do this part of
> the switch at the last transaction, or we could silently *not* do the
> switch for partitioned indexes in the flow of REINDEX, letting users
> handle that with an extra ALTER TABLE SET TABLESPACE once REINDEX has
> finished on all the partitions, cascading the command only on the
> partitioned relation of a tree.  It may be interesting to look as well
> at if we could lower the lock used for partitioned relations with
> ALTER TABLE SET TABLESPACE from AEL to SUEL, choosing AEL only if at
> least one partition with storage is involved in the command,
> CheckRelationTableSpaceMove() discarding anything that has no need to
> change.
> 

I am not sure right now, so I split previous patch into two parts:

0001: Adds TABLESPACE into REINDEX with tests, doc and all the stuff we 
did before with the only exception that it doesn't move partitioned 
indexes into the new tablespace.

Basically, it implements this option "we could silently *not* do the 
switch for partitioned indexes in the flow of REINDEX, letting users 
handle that with an extra ALTER TABLE SET TABLESPACE once REINDEX has 
finished". It probably makes sense, since we are doing tablespace change 
altogether with index relation rewrite and don't touch relations without 
storage. Doing ALTER INDEX ... SET TABLESPACE will be almost cost-less 
on them, since they do not hold any data.

0002: Implements the remaining part where pg_class entry is also changed 
for partitioned indexes. I think that we should think more about it, 
maybe it is not so dangerous and proper locking strategy could be 
achieved.


Regards
-- 
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company
Attachment

Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly

From
Alexey Kondratov
Date:
On 2021-02-03 09:37, Michael Paquier wrote:
> On Tue, Feb 02, 2021 at 10:32:19AM +0900, Michael Paquier wrote:
>> On Mon, Feb 01, 2021 at 06:28:57PM +0300, Alexey Kondratov wrote:
>> > Hm, IIUC, REINDEX CONCURRENTLY doesn't use either of them. It directly uses
>> > index_create() with a proper tablespaceOid instead of
>> > SetRelationTableSpace(). And its checks structure is more restrictive even
>> > without tablespace change, so it doesn't use CheckRelationTableSpaceMove().
>> 
>> Sure.  I have not checked the patch in details, but even with that it
>> would be much safer to me if we apply the same sanity checks
>> everywhere.  That's less potential holes to worry about.
> 
> Thanks Alexey for the new patch.  I have been looking at the main
> patch in details.
> 
>     /*
> -    * Don't allow reindex on temp tables of other backends ... their 
> local
> -    * buffer manager is not going to cope.
> +    * We don't support moving system relations into different 
> tablespaces
> +    * unless allow_system_table_mods=1.
>      */
> If you remove the check on RELATION_IS_OTHER_TEMP() in
> reindex_index(), you would allow the reindex of a temp relation owned
> by a different session if its tablespace is not changed, so this
> cannot be removed.
> 
> +        !allowSystemTableMods && IsSystemRelation(iRel))
>          ereport(ERROR,
> -                (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
> -                 errmsg("cannot reindex temporary tables of other 
> sessions")));
> +                (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
> +                 errmsg("permission denied: \"%s\" is a system 
> catalog",
> +                        RelationGetRelationName(iRel))));
> Indeed, a system relation with a relfilenode should be allowed to move
> under allow_system_table_mods.  I think that we had better move this
> check into CheckRelationTableSpaceMove() instead of reindex_index() to
> centralize the logic.  ALTER TABLE does this business in
> RangeVarCallbackForAlterRelation(), but our code path opening the
> relation is different for the non-concurrent case.
> 
> +       if (OidIsValid(params->tablespaceOid) &&
> +           IsSystemClass(relid, classtuple))
> +       {
> +           if (!allowSystemTableMods)
> +           {
> +               /* Skip all system relations, if not 
> allowSystemTableMods *
> I don't see the need for having two warnings here to say the same
> thing if a relation is mapped or not mapped, so let's keep it simple.
> 

Yeah, I just wanted to separate mapped and system relations, but 
probably it is too complicated.

> 
> I have found that the test suite was rather messy in its
> organization.  Table creations were done first with a set of tests not
> really ordered, so that was really hard to follow.  This has also led
> to a set of tests that were duplicated, while other tests have been
> missed, mainly some cross checks for the concurrent and non-concurrent
> behaviors.  I have reordered the whole so as tests on catalogs, normal
> tables and partitions are done separately with relations created and
> dropped for each set.  Partitions use a global check for tablespaces
> and relfilenodes after one concurrent reindex (didn't see the point in
> doubling with the non-concurrent case as the same code path to select
> the relations from the partition tree is taken).  An ACL test has been
> added at the end.
> 
> The case of partitioned indexes was kind of interesting and I thought
> about that a couple of days, and I took the decision to ignore
> relations that have no storage as you did, documenting that ALTER
> TABLE can be used to update the references of the partitioned
> relations.  The command is still useful with this behavior, and the
> tests I have added track that.
> 
> Finally, I have reworked the docs, separating the limitations related
> to system catalogs and partitioned relations, to be more consistent
> with the notes at the end of the page.
> 

Thanks for working on this.

+    if (tablespacename != NULL)
+    {
+        params.tablespaceOid = get_tablespace_oid(tablespacename, false);
+
+        /* Check permissions except when moving to database's default */
+        if (OidIsValid(params.tablespaceOid) &&

This check for OidIsValid() seems to be excessive, since you moved the 
whole ACL check under 'if (tablespacename != NULL)' here.

+            params.tablespaceOid != MyDatabaseTableSpace)
+        {
+            AclResult    aclresult;


+CREATE INDEX regress_tblspace_test_tbl_idx ON regress_tblspace_test_tbl 
(num1);
+-- move to global tablespace move fails

Maybe 'move to global tablespace, fail', just to match a style of the 
previous comments.

+REINDEX (TABLESPACE pg_global) INDEX regress_tblspace_test_tbl_idx;


+SELECT relid, parentrelid, level FROM 
pg_partition_tree('tbspace_reindex_part_index')
+  ORDER BY relid, level;
+SELECT relid, parentrelid, level FROM 
pg_partition_tree('tbspace_reindex_part_index')
+  ORDER BY relid, level;

Why do you do the same twice in a row? It looks like a typo. Maybe it 
was intended to be called for partitioned table AND index.


Regards
-- 
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company



Free port choosing freezes when PostgresNode::use_tcp is used on BSD systems

From
Alexey Kondratov
Date:
Hi Hackers,

Inside PostgresNode.pm there is a free port choosing routine --- 
get_free_port(). The comment section there says:

    # On non-Linux, non-Windows kernels, binding to 127.0.0/24 addresses
    # other than 127.0.0.1 might fail with EADDRNOTAVAIL.

And this is an absolute true, on BSD-like systems (macOS and FreeBSD 
tested) it hangs on looping through the entire ports range over and over 
when $PostgresNode::use_tcp = 1 is set, since bind fails with:

# Checking port 52208
# bind: 127.0.0.1 52208
# bind: 127.0.0.2 52208
bind: Can't assign requested address

To reproduce just apply reproduce.diff and try to run 'make -C 
src/bin/pg_ctl check'.

This is not a case with standard Postgres tests, since TestLib.pm 
chooses unix sockets automatically everywhere outside Windows. However, 
we got into this problem when tried to run a custom tap test that 
required TCP for stable running.

That way, if it really could happen why not to just skip binding to 
127.0.0/24 addresses other than 127.0.0.1 outside of Linux/Windows as 
per attached patch_PostgresNode.diff?


Regards
-- 
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company
Attachment
Alexey Kondratov <a.kondratov@postgrespro.ru> writes:
> And this is an absolute true, on BSD-like systems (macOS and FreeBSD 
> tested) it hangs on looping through the entire ports range over and over 
> when $PostgresNode::use_tcp = 1 is set, since bind fails with:

Hm.

> That way, if it really could happen why not to just skip binding to 
> 127.0.0/24 addresses other than 127.0.0.1 outside of Linux/Windows as 
> per attached patch_PostgresNode.diff?

That patch seems wrong, or at least it's ignoring the advice immediately
above about binding to 0.0.0.0 only on Windows.

I wonder whether we could get away with just replacing the $use_tcp
test with $TestLib::windows_os.  It's not really apparent to me
why we should care about 127.0.0.not-1 on Unix-oid systems.

            regards, tom lane



On 4/19/21 7:22 PM, Tom Lane wrote:
> Alexey Kondratov <a.kondratov@postgrespro.ru> writes:
>> And this is an absolute true, on BSD-like systems (macOS and FreeBSD 
>> tested) it hangs on looping through the entire ports range over and over 
>> when $PostgresNode::use_tcp = 1 is set, since bind fails with:
> Hm.
>
>> That way, if it really could happen why not to just skip binding to 
>> 127.0.0/24 addresses other than 127.0.0.1 outside of Linux/Windows as 
>> per attached patch_PostgresNode.diff?
> That patch seems wrong, or at least it's ignoring the advice immediately
> above about binding to 0.0.0.0 only on Windows.
>
> I wonder whether we could get away with just replacing the $use_tcp
> test with $TestLib::windows_os.  It's not really apparent to me
> why we should care about 127.0.0.not-1 on Unix-oid systems.
>
>             


Yeah


The comment is a bit strange anyway - Cygwin is actually going to use
Unix sockets, not TCP.


I think I would just change the test to this: $use_tcp &&
$TestLib::windows_os.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com




Andrew Dunstan <andrew@dunslane.net> writes:
> On 4/19/21 7:22 PM, Tom Lane wrote:
>> I wonder whether we could get away with just replacing the $use_tcp
>> test with $TestLib::windows_os.  It's not really apparent to me
>> why we should care about 127.0.0.not-1 on Unix-oid systems.

> Yeah
> The comment is a bit strange anyway - Cygwin is actually going to use
> Unix sockets, not TCP.
> I think I would just change the test to this: $use_tcp &&
> $TestLib::windows_os.

Works for me, but we need to revise the comment to match.

            regards, tom lane



Re: Free port choosing freezes when PostgresNode::use_tcp is used on BSD systems

From
Alexey Kondratov
Date:
On 2021-04-20 18:03, Tom Lane wrote:
> Andrew Dunstan <andrew@dunslane.net> writes:
>> On 4/19/21 7:22 PM, Tom Lane wrote:
>>> I wonder whether we could get away with just replacing the $use_tcp
>>> test with $TestLib::windows_os.  It's not really apparent to me
>>> why we should care about 127.0.0.not-1 on Unix-oid systems.
> 
>> Yeah
>> The comment is a bit strange anyway - Cygwin is actually going to use
>> Unix sockets, not TCP.
>> I think I would just change the test to this: $use_tcp &&
>> $TestLib::windows_os.
> 
> Works for me, but we need to revise the comment to match.
> 

Then it could be somewhat like that, I guess.


Regards
-- 
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company
Attachment
On 4/20/21 6:49 PM, Alexey Kondratov wrote:
> On 2021-04-20 18:03, Tom Lane wrote:
>> Andrew Dunstan <andrew@dunslane.net> writes:
>>> On 4/19/21 7:22 PM, Tom Lane wrote:
>>>> I wonder whether we could get away with just replacing the $use_tcp
>>>> test with $TestLib::windows_os.  It's not really apparent to me
>>>> why we should care about 127.0.0.not-1 on Unix-oid systems.
>>
>>> Yeah
>>> The comment is a bit strange anyway - Cygwin is actually going to use
>>> Unix sockets, not TCP.
>>> I think I would just change the test to this: $use_tcp &&
>>> $TestLib::windows_os.
>>
>> Works for me, but we need to revise the comment to match.
>>
>
> Then it could be somewhat like that, I guess.
>
>
>


pushed with slight edits.


Thanks.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com