Home > mailing lists

Re: Add parallelism and glibc dependent only options to reindexdb - Mailing list pgsql-hackers

From	Julien Rouhaud
Subject	Re: Add parallelism and glibc dependent only options to reindexdb
Date	July 1, 2019 10:30:28
Msg-id	CAOBaU_Z4_JcLpd9yVzag+7AEx9oaYdh-1gr5PjL6oNB+sRvCYA@mail.gmail.com Whole thread Raw
In response to	Re: Add parallelism and glibc dependent only options to reindexdb (Michael Paquier <michael@paquier.xyz>)
List	pgsql-hackers

Tree view

On Mon, Jul 1, 2019 at 10:55 AM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Sun, Jun 30, 2019 at 11:45:47AM +0200, Julien Rouhaud wrote:
> > With the glibc 2.28 coming, all users will have to reindex almost
> > every indexes after a glibc upgrade to guarantee the lack of
> > corruption.  Unfortunately, reindexdb is not ideal for that as it's
> > processing everything using a single connexion and isn't able to
> > discard indexes that doesn't depend on a glibc collation.
>
> We have seen that with a database of up to 100GB we finish by cutting
> the reindex time from 30 minutes to a couple of minutes with a schema
> we work on.  Julien, what were the actual numbers?

I did my benchmarking using a quite ideal database, having a large
number of tables and various set of indexes, for a 75 GB total size.
This was done on my laptop which has 6 multithreaded cores (and crappy
IO), also keeping the default max_parallel_maintenance_worker = 2.

A naive reindexdb took approximately 33 minutes.  Filtering the list
of indexes took that down to slightly less than 15 min, but of course
each database will have a different ratio there.

Then, keeping the --glibc-dependent and using different level of parallelism:

-j1: ~ 14:50
-j3: ~ 7:30
-j6: ~ 5:23
-j8: ~ 4:45

That's pretty much the kind of results I was expecting given the
hardware I used.

> > PFA a patchset to add parallelism to reindexdb (reusing the
> > infrastructure in vacuumdb with some additions) and an option to
> > discard indexes that doesn't depend on glibc (without any specific
> > collation filtering or glibc version detection), with updated
> > regression tests.  Note that this should be applied on top of the
> > existing reindexdb cleanup & refactoring patch
> > (https://commitfest.postgresql.org/23/2115/).
>
> Please note that patch 0003 does not seem to apply correctly on HEAD
> as of c74d49d4.

Yes, this is because this patchset has to be applied on top of the
reindexdb refactoring patch mentioned.  It's sad that we don't have a
good way to deal with that kind of dependency, as it's also breaking
Thomas' cfbot :(

> - 0003 begins to be the actual fancy thing with the addition of a
> --jobs option into reindexdb.  The main issue here which should be
> discussed is that when it comes to reindex of tables, you basically
> are not going to have any conflicts between the objects manipulated.
> However if you wish to do a reindex on a set of indexes then things
> get more tricky as it is necessary to list items per-table so as
> multiple connections do not conflict with each other if attempting to
> work on multiple indexes of the same table.  What this patch does is
> to select the set of indexes which need to be worked on (see the
> addition of cell in ParallelSlot), and then does a kind of
> pre-planning of each item into the connection slots so as each
> connection knows from the beginning which items it needs to process.
> This is quite different from vacuumdb where a new item is distributed
> only on a free connection from a unique list.  I'd personally prefer
> if we keep the facility in parallel.c so as it is only
> execution-dependent and that we have no pre-planning.  This would
> require keeping within reindexdb.c an array of lists, with one list
> corresponding to one connection instead which feels more natural.

My fear here is that this approach would add some extra complexity,
especially requiring to deal with free connection handling both in
GetIdleSlot() and the main reindexdb loop.  Also, the pre-planning
allows us to start processing the biggest tables first, which
optimises the overall runtime.

pgsql-hackers by date:

From: Thomas Munro
Date: 01 July 2019, 10:21:46
Subject: Re: [HACKERS] Custom compression methods

From: Thomas Munro
Date: 01 July 2019, 10:30:51
Subject: Re: FETCH FIRST clause WITH TIES option

Re: Add parallelism and glibc dependent only options to reindexdb - Mailing list pgsql-hackers

Previous

Next