Re: Add parallelism and glibc dependent only options to reindexdb - Mailing list pgsql-hackers

From Julien Rouhaud
Subject Re: Add parallelism and glibc dependent only options to reindexdb
Date
Msg-id CAOBaU_Y4zP=kYiXkh2sBgV9=k0XmTb4DwY9hS4H=AnZrDTWXuA@mail.gmail.com
Whole thread Raw
In response to Re: Add parallelism and glibc dependent only options to reindexdb  (Michael Paquier <michael@paquier.xyz>)
Responses Re: Add parallelism and glibc dependent only options to reindexdb  (Michael Paquier <michael@paquier.xyz>)
List pgsql-hackers
On Thu, Jul 25, 2019 at 12:03 PM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Thu, Jul 25, 2019 at 12:12:40PM +0300, Sergei Kornilov wrote:
> > Thank you, v9 code and behavior looks good for me. Builds cleanly,
> > works with older releases. I'll mark Ready to Commiter.
>
> The restriction with --jobs and --system is not documented, and that's
> good to have from the start.

That's documented in the patch:
+        Note that this option is ignored with the <option>--index</option>
+        option to prevent deadlocks when processing multiple indexes from
+        the same relation, and incompatible with the <option>--system</option>
+        option.

The restriction with --jobs and --concurrently is indeed not
specifically documented in reindexdb.sgml, there's only a mention in
reindex.sgml:

    <command>REINDEX SYSTEM</command> does not support
    <command>CONCURRENTLY</command> since system catalogs cannot be reindexed

The behavior doesn't really change with this patch, though we could
enhance the documentation.

>  This actually makes the parallel job
> handling with --index inconsistent as we mask parallelism in this case
> by enforcing the use of one connection.  I think that we should
> revisit the interactions with --index and --jobs actually, because,
> assuming that users never read the documentation, they may actually be
> surprised to see that something like --index idx1 .. --index idxN
> --jobs=N does not lead to any improvements at all, until they find out
> the reason why.

The problem is that a user doing something like:

reindexdb -j48 -i some_index -S s1 -S s2 -S s3....

will probably be disappointed to learn that he has to run a specific
command for the index(es) that should be reindexed.  Maybe we can
issue a warning that parallelism isn't used when an index list is
processed and user asked for multiple jobs?

> Thinking deeper about it, there is also a point of gathering first all
> the relations if one associates --schemas and --tables in the same
> call of reindexdb and then pass down a list of decomposed relations
> which are processed in parallel.  The code as currently presented is
> rather straight-forward, and I don't think that this is worth the
> extra complication, but this was not mentioned until now on this
> thread :)

+1


> For the non-parallel case in reindex_one_database(), I would add an
> Assert on REINDEX_DATABASE and REINDEX_SYSTEM with a comment to
> mention that a NULL list of objects can just be passed down only in
> those two cases when the single-object list with the database name is
> built.

Something like that?

    if (!parallel)
    {
-       if (user_list == NULL)
+       /*
+        * Database-wide and system catalogs processing should omit the list
+        * of objects to process.
+        */
+       if (process_type == REINDEX_DATABASE || process_type == REINDEX_SYSTEM)
        {
+           Assert(user_list == NULL);
+
            process_list = pg_malloc0(sizeof(SimpleStringList));
            simple_string_list_append(process_list, PQdb(conn));
        }

There's another assert  after the else-parallel that checks that a
list is present, so there's no need to also check it here.

I don't send a new patch since the --index wanted behavior is not clear yet.



pgsql-hackers by date:

Previous
From: Amit Kapila
Date:
Subject: Re: POC: Cleaning up orphaned files using undo logs
Next
From: Kyotaro Horiguchi
Date:
Subject: Re: Index Skip Scan