Thread: Proposal: speeding up GIN build with parallel workers

Proposal: speeding up GIN build with parallel workers

From
"Constantin S. Pan"
Date:
Hi, Hackers.

The task of building GIN can require lots of time and eats 100 % CPU,
but we could easily make it use more than a 100 %, especially since we
now have parallel workers in postgres.

The process of building GIN looks like this:

1. Accumulate a batch of index records into an rbtree in maintenance
work memory.

2. Dump the batch to disk.

3. Repeat.

I have a draft implementation which divides the whole process between
N parallel workers, see the patch attached. Instead of a full scan of
the relation, I give each worker a range of blocks to read.

This speeds up the first step N times, but slows down the second one,
because when multiple workers dump item pointers for the same key, each
of them has to read and decode the results of the previous one. That is
a huge waste, but there is an idea on how to eliminate it.

When it comes to dumping the next batch, a worker does not do it
independently. Instead, it (and every other worker) sends the
accumulated index records to the parent (backend) in ascending key
order. The backend, which receives the records from the workers through
shared memory, can merge them and dump each of them once, without the
need to reread the records N-1 times.

In current state the implementation is just a proof of concept
and it has all the configuration hardcoded, but it already works as is,
though it does not speed up the build process more than 4 times on my
configuration (12 CPUs). There is also a problem with temporary tables,
for which the parallel mode does not work.

Please leave your feedback.

Regards,

Constantin S. Pan
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachment

Re: Proposal: speeding up GIN build with parallel workers

From
Peter Geoghegan
Date:
On Fri, Jan 15, 2016 at 2:38 PM, Constantin S. Pan <kvapen@gmail.com> wrote:
> I have a draft implementation which divides the whole process between
> N parallel workers, see the patch attached. Instead of a full scan of
> the relation, I give each worker a range of blocks to read.

I am currently working on a patch that allows B-Tree index builds to
be performed in parallel. I think I'm a week or two away from posting
it.

Even without parallelism, wouldn't it be better if GIN indexes were
built using tuplesort? I know way way less about the gin am than the
nbtree am, but I imagine that a prominent cost for GIN index builds is
constructing the main B-Tree (the one that's constructed over key
values) itself. Couldn't tuplesort.c be adapted to cover this case?
That would be much faster in general, particularly with the recent
addition of abbreviated keys, while also leaving a clear path forward
to performing the build in parallel.

I understand that a long term ambition for the gin am is to merge it
with nbtree, to almost automatically benefit from enhancements, and to
reduce the maintenance burden of each.

-- 
Peter Geoghegan



Re: Proposal: speeding up GIN build with parallel workers

From
"Constantin S. Pan"
Date:
On Fri, 15 Jan 2016 15:29:51 -0800
Peter Geoghegan <pg@heroku.com> wrote:

> On Fri, Jan 15, 2016 at 2:38 PM, Constantin S. Pan <kvapen@gmail.com>
> wrote:
> Even without parallelism, wouldn't it be better if GIN indexes were
> built using tuplesort? I know way way less about the gin am than the
> nbtree am, but I imagine that a prominent cost for GIN index builds is
> constructing the main B-Tree (the one that's constructed over key
> values) itself. Couldn't tuplesort.c be adapted to cover this case?
> That would be much faster in general, particularly with the recent
> addition of abbreviated keys, while also leaving a clear path forward
> to performing the build in parallel.

While building GIN we need a quick way to update the posting list of
the same key, this is where rbtree comes to rescue. Using tuplesort will
require a completely different approach to building the index: dump
(key, itempointer) pairs into a tuplesort heap, then sort the heap and
merge the itempointers for the same key values.

Both rbtree and sorting require NlogN operations, and abbreviated keys
will not help here, because GIN is used for the case where there are
lots of repeated keys. The benefit of tuplesort is that it would be
better for huge data that does not fit into memory, but on the other
hand it would need twice as much free disk space for sorting as the
data itself took. Are we ready for such cost?

I think we have to experiment with both approaches, and see how it goes.

What are your thoughts?

Regards,

Constantin S. Pan
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: Proposal: speeding up GIN build with parallel workers

From
Jeff Janes
Date:
On Fri, Jan 15, 2016 at 3:29 PM, Peter Geoghegan <pg@heroku.com> wrote:
> On Fri, Jan 15, 2016 at 2:38 PM, Constantin S. Pan <kvapen@gmail.com> wrote:
>> I have a draft implementation which divides the whole process between
>> N parallel workers, see the patch attached. Instead of a full scan of
>> the relation, I give each worker a range of blocks to read.
>
> I am currently working on a patch that allows B-Tree index builds to
> be performed in parallel. I think I'm a week or two away from posting
> it.
>
> Even without parallelism, wouldn't it be better if GIN indexes were
> built using tuplesort? I know way way less about the gin am than the
> nbtree am, but I imagine that a prominent cost for GIN index builds is
> constructing the main B-Tree (the one that's constructed over key
> values) itself. Couldn't tuplesort.c be adapted to cover this case?
> That would be much faster in general, particularly with the recent
> addition of abbreviated keys, while also leaving a clear path forward
> to performing the build in parallel.

I think it would take a lot of changes to tuple sort to make this be a
almost-always win.

In the general case each GIN key occurs in many tuples, and the
in-memory rbtree is good at compressing the tid list for each key to
maximize the amount of data that can be in memory (although perhaps it
could be even better, as it doesn't use varbyte encoding on the tid
list the way the posting lists on disk do--it could do so in the bulk
build case, where it receives the tid in order, but not feasibly in
the pending-list clean-up case, where it doesn't get them in order)

When I was testing building an index on a column of text identifiers
where there were a couple million identifiers occurring a few dozen
times each, building it as a gin index (using btree_gin so that plain
text columns could be indexed) was quite a bit faster than building it
as a regular btree index using tuple sort.  I didn't really
investigate this difference, but I assume it was due to the better
memory usage leading to less IO.

(Disclaimer: this was on those identifiers which had little difference
in the first 8 bytes, so they didn't really benefit from the
abbreviated keys.)

So in addition to shimming (key, tid) pairs to look like something
that can be fed to tuple sort, I think we would also need to make
tuple sort do on-the fly aggregation when it finds adjacent equal sort
columns, so it can make (key,tid[]) structures rather than (key,tid).
Without that, I think using tuple sort would be a loss at least as
often as it would be a win.

But I do think this with aggregation would be worth it despite the
amount of work involved.  In addition to automatically benefiting from
parallel sorts and any other future sort improvements, I think it
would also generate better indexes.  I have a theory that a problem
with very large gin indexes is that repeatedly building
maintenance_work_mem worth of rbtree entries and then merging them to
disk yields highly fragmented btrees, in which logical adjacent
key-space does not occupy physically nearby leaf pages, which then can
causes problems either with access that follows a pattern (like range
scans, except gin indexes can't really do those anyway) or further
bulk operations.

So I agree with that a better long term approach would probably be to
make gin index builds (at least the bulk ones, I don't know about the
pending-list-cleanup) to use a tuple sort.  But it looks like
Constantin has already done most of the work on his current proposal,
and no one has volunteered to do the work on converting it to tuple
sort instead.

Cheers,

Jeff



Re: Proposal: speeding up GIN build with parallel workers

From
Peter Geoghegan
Date:
On Sun, Jan 17, 2016 at 12:03 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
> I think it would take a lot of changes to tuple sort to make this be a
> almost-always win.
>
> In the general case each GIN key occurs in many tuples, and the
> in-memory rbtree is good at compressing the tid list for each key to
> maximize the amount of data that can be in memory (although perhaps it
> could be even better, as it doesn't use varbyte encoding on the tid
> list the way the posting lists on disk do--it could do so in the bulk
> build case, where it receives the tid in order, but not feasibly in
> the pending-list clean-up case, where it doesn't get them in order)

Interesting analysis.

> When I was testing building an index on a column of text identifiers
> where there were a couple million identifiers occurring a few dozen
> times each, building it as a gin index (using btree_gin so that plain
> text columns could be indexed) was quite a bit faster than building it
> as a regular btree index using tuple sort.  I didn't really
> investigate this difference, but I assume it was due to the better
> memory usage leading to less IO.
>
> (Disclaimer: this was on those identifiers which had little difference
> in the first 8 bytes, so they didn't really benefit from the
> abbreviated keys.)

Sorry for going on about it, but I think you'd be surprised how
effective abbreviated keys are even in seemingly marginal cases.

> So I agree with that a better long term approach would probably be to
> make gin index builds (at least the bulk ones, I don't know about the
> pending-list-cleanup) to use a tuple sort.  But it looks like
> Constantin has already done most of the work on his current proposal,
> and no one has volunteered to do the work on converting it to tuple
> sort instead.

I'm not going to stand in the way of incremental progress,
particularly given that this looks to be a simple patch that doesn't
commit us to anything. I suspect that we should consolidate GIN and
nbtree at some point, though. I think that there are some useful
general consequences for performance that come from consolidation. For
example, my ongoing work on external sorting makes it use much of the
same infrastructure as internal sorting. Now, external sorts
automatically benefit from optimizations to internal sorting, like the
"onlyKey" quicksort optimization.

-- 
Peter Geoghegan



Re: Proposal: speeding up GIN build with parallel workers

From
Robert Haas
Date:
On Fri, Jan 15, 2016 at 5:38 PM, Constantin S. Pan <kvapen@gmail.com> wrote:
> In current state the implementation is just a proof of concept
> and it has all the configuration hardcoded, but it already works as is,
> though it does not speed up the build process more than 4 times on my
> configuration (12 CPUs). There is also a problem with temporary tables,
> for which the parallel mode does not work.

I have yet to see a case where parallel query, with any current or
pending patch, gets more than about a 4x speedup.  I can't think of
any reason that there should be a wall at 4x, and I'm not sure we're
hitting the wall there for the same reason in all cases.  But your
observation corresponds to my experience.

I hope we eventually figure out how to make that much better, but I
wouldn't feel too bad about being at that spot now.  4x faster is
still a lot faster.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [WIP] speeding up GIN build with parallel workers

From
"Constantin S. Pan"
Date:
On Sat, 16 Jan 2016 01:38:39 +0300
"Constantin S. Pan" <kvapen@gmail.com> wrote:

> The task of building GIN can require lots of time and eats 100 % CPU,
> but we could easily make it use more than a 100 %, especially since we
> now have parallel workers in postgres.
>
> The process of building GIN looks like this:
>
> 1. Accumulate a batch of index records into an rbtree in maintenance
> work memory.
>
> 2. Dump the batch to disk.
>
> 3. Repeat.
>
> I have a draft implementation which divides the whole process between
> N parallel workers, see the patch attached. Instead of a full scan of
> the relation, I give each worker a range of blocks to read.
>
> This speeds up the first step N times, but slows down the second one,
> because when multiple workers dump item pointers for the same key,
> each of them has to read and decode the results of the previous one.
> That is a huge waste, but there is an idea on how to eliminate it.
>
> When it comes to dumping the next batch, a worker does not do it
> independently. Instead, it (and every other worker) sends the
> accumulated index records to the parent (backend) in ascending key
> order. The backend, which receives the records from the workers
> through shared memory, can merge them and dump each of them once,
> without the need to reread the records N-1 times.
>
> In current state the implementation is just a proof of concept
> and it has all the configuration hardcoded, but it already works as
> is, though it does not speed up the build process more than 4 times
> on my configuration (12 CPUs). There is also a problem with temporary
> tables, for which the parallel mode does not work.

Hey Hackers!

I have made some progress on the proposal (see the attached patch):

0. Moved some repeated code to functions (e.g. ginDumpAccumulator,
ginbuildCommon).

1. Implemented results merging on backend.

2. Disabled the usage of parallel mode when creating index on temporary
tables. No point in using parallel mode for temporary tables anyway,
right?

3. Added GUC parameter to control the number of workers for GIN
building.

4. Hit the 8x speedup limit. Made some analysis of the reasons (see the
attached plot or the data file).

In order to analyze the performance issues, I have made the following:

    create table t (k int, v int[]);

    create or replace
    function randarray(width int, low int, high int)
    returns int[] as
    $$
        select array(select (random()*(high-low) + low)::int
        from generate_series(1,width))
    $$ language sql;

    insert into t select k, randarray(3000, 0, 100000)
    from generate_series(1, 100000) as k;

    create index t_v_idx on t using gin (v);

This creates 100000 arrays of 3000 random numbers each. The random
numbers are in range [0, 100000]. Then I measure how long the gin
building steps take. There are two steps: scan and merge.

The results show that 'scan' step is sped up perfectly. But the
'merge' step takes longer as you increase the number of workers. The
profiler shows that the bottleneck here is ginMergeItemPointers(), which
I use to merge the results.

Also, I did encounter the problem with workers deadlocking during
heap_open, but that seems to have been resolved by Robert Haas in his
commit regarding group locking.

Please leave your feedback!

Regards,

Constantin S. Pan
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Attachment

Re: [WIP] speeding up GIN build with parallel workers

From
Oleg Bartunov
Date:


On Wed, Feb 17, 2016 at 6:55 PM, Constantin S. Pan <kvapen@gmail.com> wrote:
On Sat, 16 Jan 2016 01:38:39 +0300
"Constantin S. Pan" <kvapen@gmail.com> wrote:

> The task of building GIN can require lots of time and eats 100 % CPU,
> but we could easily make it use more than a 100 %, especially since we
> now have parallel workers in postgres.
>
> The process of building GIN looks like this:
>
> 1. Accumulate a batch of index records into an rbtree in maintenance
> work memory.
>
> 2. Dump the batch to disk.
>
> 3. Repeat.
>
> I have a draft implementation which divides the whole process between
> N parallel workers, see the patch attached. Instead of a full scan of
> the relation, I give each worker a range of blocks to read.
>
> This speeds up the first step N times, but slows down the second one,
> because when multiple workers dump item pointers for the same key,
> each of them has to read and decode the results of the previous one.
> That is a huge waste, but there is an idea on how to eliminate it.
>
> When it comes to dumping the next batch, a worker does not do it
> independently. Instead, it (and every other worker) sends the
> accumulated index records to the parent (backend) in ascending key
> order. The backend, which receives the records from the workers
> through shared memory, can merge them and dump each of them once,
> without the need to reread the records N-1 times.
>
> In current state the implementation is just a proof of concept
> and it has all the configuration hardcoded, but it already works as
> is, though it does not speed up the build process more than 4 times
> on my configuration (12 CPUs). There is also a problem with temporary
> tables, for which the parallel mode does not work.

Hey Hackers!

I have made some progress on the proposal (see the attached patch):

0. Moved some repeated code to functions (e.g. ginDumpAccumulator,
ginbuildCommon).

1. Implemented results merging on backend.

2. Disabled the usage of parallel mode when creating index on temporary
tables. No point in using parallel mode for temporary tables anyway,
right?

3. Added GUC parameter to control the number of workers for GIN
building.

4. Hit the 8x speedup limit. Made some analysis of the reasons (see the
attached plot or the data file).

In order to analyze the performance issues, I have made the following:

        create table t (k int, v int[]);

        create or replace
        function randarray(width int, low int, high int)
        returns int[] as
        $$
                select array(select (random()*(high-low) + low)::int
                from generate_series(1,width))
        $$ language sql;

        insert into t select k, randarray(3000, 0, 100000)
        from generate_series(1, 100000) as k;

        create index t_v_idx on t using gin (v);

This creates 100000 arrays of 3000 random numbers each. The random
numbers are in range [0, 100000]. Then I measure how long the gin
building steps take. There are two steps: scan and merge.

The results show that 'scan' step is sped up perfectly. But the
'merge' step takes longer as you increase the number of workers. The
profiler shows that the bottleneck here is ginMergeItemPointers(), which
I use to merge the results.

Also, I did encounter the problem with workers deadlocking during
heap_open, but that seems to have been resolved by Robert Haas in his
commit regarding group locking.

Please leave your feedback!

My feedback is (Mac OS X 10.11.3)

set gin_parallel_workers=2;
create index message_body_idx on messages using gin(body_tsvector);
LOG:  worker process: parallel worker for PID 5689 (PID 6906) was terminated by signal 11: Segmentation fault
LOG:  terminating any other active server processes
WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and repeat your command.
WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and repeat your command.
WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and repeat your command.
server closed the connection unexpectedly
    This probably means the server terminated abnormally
    before or while processing the request.
The connection to the server was lost. Attempting reset: FATAL:  the database system is in recovery mode
Failed.

 

Regards,

Constantin S. Pan
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [WIP] speeding up GIN build with parallel workers

From
Peter Geoghegan
Date:
On Wed, Feb 17, 2016 at 7:55 AM, Constantin S. Pan <kvapen@gmail.com> wrote:
> 4. Hit the 8x speedup limit. Made some analysis of the reasons (see the
> attached plot or the data file).

Did you actually compare this to the master branch? I wouldn't like to
assume that the one worker case was equivalent. Obviously that's the
really interesting baseline.


-- 
Peter Geoghegan



Re: [WIP] speeding up GIN build with parallel workers

From
"Constantin S. Pan"
Date:
I was testing with number of workers starting at 0. The 0 case is (in
theory) equivalent to master branch. But I should certainly compare to
the master branch, you are right. Will do that shortly.

On Wed, 17 Feb 2016 12:26:05 -0800
Peter Geoghegan <pg@heroku.com> wrote:

> On Wed, Feb 17, 2016 at 7:55 AM, Constantin S. Pan <kvapen@gmail.com>
> wrote:
> > 4. Hit the 8x speedup limit. Made some analysis of the reasons (see
> > the attached plot or the data file).  
> 
> Did you actually compare this to the master branch? I wouldn't like to
> assume that the one worker case was equivalent. Obviously that's the
> really interesting baseline.
> 
> 



Re: [WIP] speeding up GIN build with parallel workers

From
"Constantin S. Pan"
Date:
On Wed, 17 Feb 2016 23:01:47 +0300
Oleg Bartunov <obartunov@gmail.com> wrote:

> My feedback is (Mac OS X 10.11.3)
>
> set gin_parallel_workers=2;
> create index message_body_idx on messages using gin(body_tsvector);
> LOG:  worker process: parallel worker for PID 5689 (PID 6906) was
> terminated by signal 11: Segmentation fault

Fixed this, try the new patch. The bug was in incorrect handling
of some GIN categories.
Attachment

Re: [WIP] speeding up GIN build with parallel workers

From
David Steele
Date:
On 2/18/16 10:10 AM, Constantin S. Pan wrote:
> On Wed, 17 Feb 2016 23:01:47 +0300
> Oleg Bartunov <obartunov@gmail.com> wrote:
>
>> My feedback is (Mac OS X 10.11.3)
>>
>> set gin_parallel_workers=2;
>> create index message_body_idx on messages using gin(body_tsvector);
>> LOG:  worker process: parallel worker for PID 5689 (PID 6906) was
>> terminated by signal 11: Segmentation fault
>
> Fixed this, try the new patch. The bug was in incorrect handling
> of some GIN categories.

Oleg, it looks like Constantin has updated to patch to address the issue 
you were seeing.  Do you have time to retest and review?

Thanks,
-- 
-David
david@pgmasters.net



Re: [WIP] speeding up GIN build with parallel workers

From
"Constantin S. Pan"
Date:
On Mon, 14 Mar 2016 08:42:26 -0400
David Steele <david@pgmasters.net> wrote:

> On 2/18/16 10:10 AM, Constantin S. Pan wrote:
> > On Wed, 17 Feb 2016 23:01:47 +0300
> > Oleg Bartunov <obartunov@gmail.com> wrote:
> >
> >> My feedback is (Mac OS X 10.11.3)
> >>
> >> set gin_parallel_workers=2;
> >> create index message_body_idx on messages using gin(body_tsvector);
> >> LOG:  worker process: parallel worker for PID 5689 (PID 6906) was
> >> terminated by signal 11: Segmentation fault
> >
> > Fixed this, try the new patch. The bug was in incorrect handling
> > of some GIN categories.
>
> Oleg, it looks like Constantin has updated to patch to address the
> issue you were seeing.  Do you have time to retest and review?
>
> Thanks,

Actually, there was some progress since. The patch is
attached.

1. Added another GUC parameter for changing the amount of
shared memory for parallel GIN workers.

2. Changed the way results are merged. It uses shared memory
message queue now.

3. Tested on some real data (GIN index on email message body
tsvectors). Here are the timings for different values of
'gin_shared_mem' and 'gin_parallel_workers' on a 4-CPU
machine. Seems 'gin_shared_mem' has no visible effect.

wnum mem(MB) time(s)
   0      16     247
   1      16     256
   2      16     126
   4      16      89
   0      32     247
   1      32     270
   2      32     123
   4      32      92
   0      64     254
   1      64     272
   2      64     123
   4      64      88
   0     128     250
   1     128     263
   2     128     126
   4     128      85
   0     256     247
   1     256     269
   2     256     130
   4     256      88
   0     512     257
   1     512     275
   2     512     129
   4     512      92
   0    1024     255
   1    1024     273
   2    1024     130
   4    1024      90

On Wed, 17 Feb 2016 12:26:05 -0800
Peter Geoghegan <pg@heroku.com> wrote:

> On Wed, Feb 17, 2016 at 7:55 AM, Constantin S. Pan <kvapen@gmail.com>
> wrote:
> > 4. Hit the 8x speedup limit. Made some analysis of the reasons (see
> > the attached plot or the data file).
>
> Did you actually compare this to the master branch? I wouldn't like to
> assume that the one worker case was equivalent. Obviously that's the
> really interesting baseline.

Compared with the master branch. The case of 0 workers is
indeed equivalent to the master branch.

Regards,
Constantin
Attachment

Re: [WIP] speeding up GIN build with parallel workers

From
Amit Kapila
Date:
On Wed, Mar 16, 2016 at 5:41 AM, Constantin S. Pan <kvapen@gmail.com> wrote:
On Mon, 14 Mar 2016 08:42:26 -0400
David Steele <david@pgmasters.net> wrote:

> On 2/18/16 10:10 AM, Constantin S. Pan wrote:
> > On Wed, 17 Feb 2016 23:01:47 +0300
> > Oleg Bartunov <obartunov@gmail.com> wrote:
> >
> >> My feedback is (Mac OS X 10.11.3)
> >>
> >> set gin_parallel_workers=2;
> >> create index message_body_idx on messages using gin(body_tsvector);
> >> LOG:  worker process: parallel worker for PID 5689 (PID 6906) was
> >> terminated by signal 11: Segmentation fault
> >
> > Fixed this, try the new patch. The bug was in incorrect handling
> > of some GIN categories.
>
> Oleg, it looks like Constantin has updated to patch to address the
> issue you were seeing.  Do you have time to retest and review?
>
> Thanks,

Actually, there was some progress since. The patch is
attached.

1. Added another GUC parameter for changing the amount of
shared memory for parallel GIN workers.

2. Changed the way results are merged. It uses shared memory
message queue now.

3. Tested on some real data (GIN index on email message body
tsvectors). Here are the timings for different values of
'gin_shared_mem' and 'gin_parallel_workers' on a 4-CPU
machine. Seems 'gin_shared_mem' has no visible effect.

wnum mem(MB) time(s)
   0      16     247
   1      16     256


It seems from you data that with 1 worker, you are always seeing slowdown, have you investigated the reason of same?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [WIP] speeding up GIN build with parallel workers

From
"Constantin S. Pan"
Date:
On Wed, 16 Mar 2016 12:14:51 +0530
Amit Kapila <amit.kapila16@gmail.com> wrote:

> On Wed, Mar 16, 2016 at 5:41 AM, Constantin S. Pan <kvapen@gmail.com>
> wrote:
> 
> > On Mon, 14 Mar 2016 08:42:26 -0400
> > David Steele <david@pgmasters.net> wrote:
> >
> > > On 2/18/16 10:10 AM, Constantin S. Pan wrote:
> > > > On Wed, 17 Feb 2016 23:01:47 +0300
> > > > Oleg Bartunov <obartunov@gmail.com> wrote:
> > > >
> > > >> My feedback is (Mac OS X 10.11.3)
> > > >>
> > > >> set gin_parallel_workers=2;
> > > >> create index message_body_idx on messages using
> > > >> gin(body_tsvector); LOG:  worker process: parallel worker for
> > > >> PID 5689 (PID 6906) was terminated by signal 11: Segmentation
> > > >> fault
> > > >
> > > > Fixed this, try the new patch. The bug was in incorrect handling
> > > > of some GIN categories.
> > >
> > > Oleg, it looks like Constantin has updated to patch to address the
> > > issue you were seeing.  Do you have time to retest and review?
> > >
> > > Thanks,
> >
> > Actually, there was some progress since. The patch is
> > attached.
> >
> > 1. Added another GUC parameter for changing the amount of
> > shared memory for parallel GIN workers.
> >
> > 2. Changed the way results are merged. It uses shared memory
> > message queue now.
> >
> > 3. Tested on some real data (GIN index on email message body
> > tsvectors). Here are the timings for different values of
> > 'gin_shared_mem' and 'gin_parallel_workers' on a 4-CPU
> > machine. Seems 'gin_shared_mem' has no visible effect.
> >
> > wnum mem(MB) time(s)
> >    0      16     247
> >    1      16     256
> >
> 
> 
> It seems from you data that with 1 worker, you are always seeing
> slowdown, have you investigated the reason of same?
> 
> With Regards,
> Amit Kapila.
> EnterpriseDB: http://www.enterprisedb.com

That slowdown is expected. It slows down because with 1 worker it
has to transfer the results from the worker to the backend.

The backend just waits for the results from the workers and merges them
(in case wnum > 0). So the 1-worker configuration should never be used,
because it is as sequential as the 0-worker, but adds data transfer.

Regards,

Constantin S. Pan
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: [WIP] speeding up GIN build with parallel workers

From
Peter Geoghegan
Date:
On Wed, Mar 16, 2016 at 2:25 AM, Constantin S. Pan <kvapen@gmail.com> wrote:
> The backend just waits for the results from the workers and merges them
> (in case wnum > 0). So the 1-worker configuration should never be used,
> because it is as sequential as the 0-worker, but adds data transfer.

This is why I wanted an easy way of atomically guaranteeing some
number of workers (typically 2), or not using parallelism at all. I
think the parallel worker API should offer a simple way to do that in
cases like this, where having only 1 worker is never going to win.

-- 
Peter Geoghegan



Re: [WIP] speeding up GIN build with parallel workers

From
"Constantin S. Pan"
Date:
On Wed, 16 Mar 2016 02:43:47 -0700
Peter Geoghegan <pg@heroku.com> wrote:

> On Wed, Mar 16, 2016 at 2:25 AM, Constantin S. Pan <kvapen@gmail.com>
> wrote:
> > The backend just waits for the results from the workers and merges
> > them (in case wnum > 0). So the 1-worker configuration should never
> > be used, because it is as sequential as the 0-worker, but adds data
> > transfer.
> 
> This is why I wanted an easy way of atomically guaranteeing some
> number of workers (typically 2), or not using parallelism at all. I
> think the parallel worker API should offer a simple way to do that in
> cases like this, where having only 1 worker is never going to win.

Well, we can check the number of workers actually launched and revert
back to single backend way when there is less than 2 workers. Let me
code that in.

Regards,

Constantin S. Pan
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: [WIP] speeding up GIN build with parallel workers

From
Amit Kapila
Date:
On Wed, Mar 16, 2016 at 2:55 PM, Constantin S. Pan <kvapen@gmail.com> wrote:
>
> On Wed, 16 Mar 2016 12:14:51 +0530
> Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> > On Wed, Mar 16, 2016 at 5:41 AM, Constantin S. Pan <kvapen@gmail.com>
> > wrote:
> > > 3. Tested on some real data (GIN index on email message body
> > > tsvectors). Here are the timings for different values of
> > > 'gin_shared_mem' and 'gin_parallel_workers' on a 4-CPU
> > > machine. Seems 'gin_shared_mem' has no visible effect.
> > >
> > > wnum mem(MB) time(s)
> > >    0      16     247
> > >    1      16     256
> > >
> >
> >
> > It seems from you data that with 1 worker, you are always seeing
> > slowdown, have you investigated the reason of same?
> >
>
> That slowdown is expected. It slows down because with 1 worker it
> has to transfer the results from the worker to the backend.
>
> The backend just waits for the results from the workers and merges them
> (in case wnum > 0).

Why backend just waits, why can't it does the same work as any worker does?  In general, for other parallelism features the backend also behaves the same way as worker in producing the results if the results from workers is not available.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [WIP] speeding up GIN build with parallel workers

From
"Constantin S. Pan"
Date:
On Wed, 16 Mar 2016 18:08:38 +0530
Amit Kapila <amit.kapila16@gmail.com> wrote:

> On Wed, Mar 16, 2016 at 2:55 PM, Constantin S. Pan <kvapen@gmail.com>
> wrote:
> >
> > On Wed, 16 Mar 2016 12:14:51 +0530
> > Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > > On Wed, Mar 16, 2016 at 5:41 AM, Constantin S. Pan
> > > <kvapen@gmail.com> wrote:
> > > > 3. Tested on some real data (GIN index on email message body
> > > > tsvectors). Here are the timings for different values of
> > > > 'gin_shared_mem' and 'gin_parallel_workers' on a 4-CPU
> > > > machine. Seems 'gin_shared_mem' has no visible effect.
> > > >
> > > > wnum mem(MB) time(s)
> > > >    0      16     247
> > > >    1      16     256
> > > >
> > >
> > >
> > > It seems from you data that with 1 worker, you are always seeing
> > > slowdown, have you investigated the reason of same?
> > >
> >
> > That slowdown is expected. It slows down because with 1 worker it
> > has to transfer the results from the worker to the backend.
> >
> > The backend just waits for the results from the workers and merges
> > them (in case wnum > 0).
> 
> Why backend just waits, why can't it does the same work as any worker
> does?  In general, for other parallelism features the backend also
> behaves the same way as worker in producing the results if the
> results from workers is not available.

We can make backend do the same work as any worker, but that
will complicate the code for less than 2 % perfomance boost.
This performance issue matters even less as you increase the
number of workers N, since you save only 1/N-th of the
transfer time.

Is backend waiting for workers a really bad practice that
should be avoided at all costs, or can we leave it as is?

Regards,

Constantin S. Pan
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: [WIP] speeding up GIN build with parallel workers

From
Amit Kapila
Date:
On Wed, Mar 16, 2016 at 7:50 PM, Constantin S. Pan <kvapen@gmail.com> wrote:
>
> On Wed, 16 Mar 2016 18:08:38 +0530
> Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> >
> > Why backend just waits, why can't it does the same work as any worker
> > does?  In general, for other parallelism features the backend also
> > behaves the same way as worker in producing the results if the
> > results from workers is not available.
>
> We can make backend do the same work as any worker, but that
> will complicate the code for less than 2 % perfomance boost.

Why do you think it will be just 2%?  I think for single worker case, it should be much more as the master backend will be less busy in consuming tuples from tuple queue.  I can't say much about code-complexity, as I haven't yet looked carefully at the logic of patch, but we didn't find much difficulty while doing it for parallel scans.  One of the commit which might help you in understanding how currently heap scans are parallelised is ee7ca559fcf404f9a3bd99da85c8f4ea9fbc2e92, you can see if that can help you in anyway for writing a generic API for Gin parallel builds. 


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [WIP] speeding up GIN build with parallel workers

From
"Constantin S. Pan"
Date:
On Thu, 17 Mar 2016 13:21:32 +0530
Amit Kapila <amit.kapila16@gmail.com> wrote:

> On Wed, Mar 16, 2016 at 7:50 PM, Constantin S. Pan <kvapen@gmail.com>
> wrote:
> >
> > On Wed, 16 Mar 2016 18:08:38 +0530
> > Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > >
> > > Why backend just waits, why can't it does the same work as any
> > > worker does?  In general, for other parallelism features the
> > > backend also behaves the same way as worker in producing the
> > > results if the results from workers is not available.
> >
> > We can make backend do the same work as any worker, but that
> > will complicate the code for less than 2 % perfomance boost.
> 
> Why do you think it will be just 2%?  I think for single worker case,
> it should be much more as the master backend will be less busy in
> consuming tuples from tuple queue.  I can't say much about
> code-complexity, as I haven't yet looked carefully at the logic of
> patch, but we didn't find much difficulty while doing it for parallel
> scans.  One of the commit which might help you in understanding how
> currently heap scans are parallelised is
> ee7ca559fcf404f9a3bd99da85c8f4ea9fbc2e92, you can see if that can
> help you in anyway for writing a generic API for Gin parallel builds.

I looked at the timing details some time ago, which showed
that the backend spent about 1% of total time on data
transfer from 1 worker, and 3% on transfer and merging from
2 workers. So if we use (active backend + 1 worker) instead
of (passive backend + 2 workers), we still have to spend
1.5% on transfer and merging.

Or we can look at these measurements (from yesterday's
message):

wnum mem(MB) time(s)  0      16     247  1      16     256  2      16     126

If 2 workers didn't have to transfer and merge their
results, they would have finished in 247 / 2 = 123.5
seconds. But the transfer and merging took another 2.5
seconds. The merging takes a little longer than the
transfer. If we now use backend+worker we get rid of 1
transfer, but still have to do 1 transfer and then merge, so
we will save less than a quarter of those 2.5 seconds.

In other words, we gain almost nothing by teaching the
backend how to be a worker.

Regards,

Constantin S. Pan
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: [WIP] speeding up GIN build with parallel workers

From
Amit Kapila
Date:
On Thu, Mar 17, 2016 at 2:56 PM, Constantin S. Pan <kvapen@gmail.com> wrote:
>
> On Thu, 17 Mar 2016 13:21:32 +0530
> Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> > On Wed, Mar 16, 2016 at 7:50 PM, Constantin S. Pan <kvapen@gmail.com>
> > wrote:
> > >
> > > On Wed, 16 Mar 2016 18:08:38 +0530
> > > Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > >
> > > > Why backend just waits, why can't it does the same work as any
> > > > worker does?  In general, for other parallelism features the
> > > > backend also behaves the same way as worker in producing the
> > > > results if the results from workers is not available.
> > >
> > > We can make backend do the same work as any worker, but that
> > > will complicate the code for less than 2 % perfomance boost.
> >
> > Why do you think it will be just 2%?  I think for single worker case,
> > it should be much more as the master backend will be less busy in
> > consuming tuples from tuple queue.  I can't say much about
> > code-complexity, as I haven't yet looked carefully at the logic of
> > patch, but we didn't find much difficulty while doing it for parallel
> > scans.  One of the commit which might help you in understanding how
> > currently heap scans are parallelised is
> > ee7ca559fcf404f9a3bd99da85c8f4ea9fbc2e92, you can see if that can
> > help you in anyway for writing a generic API for Gin parallel builds.
>
> I looked at the timing details some time ago, which showed
> that the backend spent about 1% of total time on data
> transfer from 1 worker, and 3% on transfer and merging from
> 2 workers. So if we use (active backend + 1 worker) instead
> of (passive backend + 2 workers), we still have to spend
> 1.5% on transfer and merging.
>

I think here the comparison should be between the case of (active backend + 1 worker) with (passive backend + 1 worker) or  (active backend + 2 worker) with (passive backend + 2 workers).  I don't think it is good assumption that workers are always freely available and you can use them as and when required for any operation.

>
> Or we can look at these measurements (from yesterday's
> message):
>
> wnum mem(MB) time(s)
>    0      16     247
>    1      16     256
>    2      16     126
>
> If 2 workers didn't have to transfer and merge their
> results, they would have finished in 247 / 2 = 123.5
> seconds. But the transfer and merging took another 2.5
> seconds. The merging takes a little longer than the
> transfer. If we now use backend+worker we get rid of 1
> transfer, but still have to do 1 transfer and then merge, so
> we will save less than a quarter of those 2.5 seconds.
>

If I understand the above data correctly, then it seems to indicate that majority of the work is done in processing the data, so I think it should be better if master and worker both can work together.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [WIP] speeding up GIN build with parallel workers

From
Robert Haas
Date:
On Thu, Mar 17, 2016 at 11:42 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> I think here the comparison should be between the case of (active backend +
> 1 worker) with (passive backend + 1 worker) or  (active backend + 2 worker)
> with (passive backend + 2 workers).  I don't think it is good assumption
> that workers are always freely available and you can use them as and when
> required for any operation.

Strong +1.  The pool of background workers is necessarily quite
limited and you can't just gobble them up.  I'm not saying that it's
absolutely essential that the leader can also participate, but saying
that 1 active leader + 1 worker is only 2% faster than 1 passive
leader + 2 workers is not comparing apples to apples.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [WIP] speeding up GIN build with parallel workers

From
Dmitry Ivanov
Date:
Hi Constantin,

I did a quick review of your patch, and here are my comments:

- This patch applies cleanly to the current HEAD (61d2ebdbf91).

- Code compiles without warnings.

- Currently there's no documentation regarding parallel gin build feature and 
provided GUC variables.

- Built indexes work seem to work normally.


Performance
-----------

I've made a few runs on my laptop (i7-4710HQ, default postgresql.conf), here 
are the results:
workers    avg. time (s)0            4124            1338            81

Looks like 8 workers & a backend do the job 5x times faster than a sole 
backend, which is good!


Code style
----------

There are some things that you've probably overlooked, such as:

> task->heap_oid = heap->rd_id;
> task->index_oid = index->rd_id;

You could replace direct access to 'rd_id' field with the RelationGetRelid 
macro.

> static void ginDumpEntry(GinState *ginstate,
>                          volatile WorkerResult *r

Parameter 'r' is unused, you could remove it.

Some of the functions and pieces of code that you've added do not comply to 
the formatting conventions, e. g.

> static int claimSomeBlocks(...
> static GinEntryStack *pushEntry(

>> // The message consists of 2 or 3 parts. iovec allows us to send them as

etc.

Keep up the good work!



Re: [WIP] speeding up GIN build with parallel workers

From
"Constantin S. Pan"
Date:
On Fri, 18 Mar 2016 20:40:16 +0300
Dmitry Ivanov <d.ivanov@postgrespro.ru> wrote:

> Hi Constantin,
> 
> I did a quick review of your patch, and here are my comments:
> 
> - This patch applies cleanly to the current HEAD (61d2ebdbf91).
> 
> - Code compiles without warnings.
> 
> - Currently there's no documentation regarding parallel gin build
> feature and provided GUC variables.
> 
> - Built indexes work seem to work normally.
> 
> 
> Performance
> -----------
> 
> I've made a few runs on my laptop (i7-4710HQ, default
> postgresql.conf), here are the results:
> 
>     workers    avg. time (s)
>     0            412
>     4            133
>     8            81
> 
> Looks like 8 workers & a backend do the job 5x times faster than a
> sole backend, which is good!
> 
> 
> Code style
> ----------
> 
> There are some things that you've probably overlooked, such as:
> 
> > task->heap_oid = heap->rd_id;
> > task->index_oid = index->rd_id;
> 
> You could replace direct access to 'rd_id' field with the
> RelationGetRelid macro.
> 
> > static void ginDumpEntry(GinState *ginstate,
> >                          volatile
> > WorkerResult *r
> 
> Parameter 'r' is unused, you could remove it.
> 
> Some of the functions and pieces of code that you've added do not
> comply to the formatting conventions, e. g.
> 
> > static int claimSomeBlocks(...
> > static GinEntryStack *pushEntry(
> 
> >> // The message consists of 2 or 3 parts. iovec allows us to send
> >> them as
> 
> etc.
> 
> Keep up the good work!

Hi Dmitry,

Thank you for the review. I am working on a new version, that will fix
the code style and also involve backend into the process as one of the
workers.



Re: [WIP] speeding up GIN build with parallel workers

From
"Constantin S. Pan"
Date:
On Fri, 18 Mar 2016 20:40:16 +0300
Dmitry Ivanov <d.ivanov@postgrespro.ru> wrote:

> - Currently there's no documentation regarding parallel gin build
> feature and provided GUC variables.

> You could replace direct access to 'rd_id' field with the
> RelationGetRelid macro.

> Parameter 'r' is unused, you could remove it.

> Some of the functions and pieces of code that you've added do not
> comply to the formatting conventions, e. g.

On Fri, 18 Mar 2016 11:18:46 -0400
Robert Haas <robertmhaas@gmail.com> wrote:

> Strong +1.  The pool of background workers is necessarily quite
> limited and you can't just gobble them up.  I'm not saying that it's
> absolutely essential that the leader can also participate, but saying
> that 1 active leader + 1 worker is only 2% faster than 1 passive
> leader + 2 workers is not comparing apples to apples.

On Fri, 18 Mar 2016 09:12:59 +0530
Amit Kapila <amit.kapila16@gmail.com> wrote:

> If I understand the above data correctly, then it seems to indicate
> that majority of the work is done in processing the data, so I think
> it should be better if master and worker both can work together.

Thank you for the support and feedback! Fixed all of these in the new
version. Also added new queries to 'gin' regression test, which
immediately helped me catch one silly bug. The new patch is
attached.

Regards,

Constantin S. Pan
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Attachment

Re: [PATCH] speeding up GIN build with parallel workers

From
"Constantin S. Pan"
Date:
Here is a new version of the patch, which:

1. Fixes some minor stylistic issues.

2. Uses binaryheap (instead of a custom ugly stack) for merging.

Regards,

Constantin S. Pan
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachment

Re: [PATCH] speeding up GIN build with parallel workers

From
Robert Haas
Date:
On Fri, Apr 8, 2016 at 10:04 AM, Constantin S. Pan <kvapen@gmail.com> wrote:
> Here is a new version of the patch, which:
>
> 1. Fixes some minor stylistic issues.
>
> 2. Uses binaryheap (instead of a custom ugly stack) for merging.

I think we need to push this patch out to 9.7.  This code has had a
little review here and there from various people, but clearly not
enough to push it into the tree at the very last minute.  I don't
think we even have enough discussion to conclude that things like the
gin_shared_mem GUCs are good ideas rather than bad ones.

Also, I personally find this code to be extremely low on comments.
There's barely any explanation of what the overall methodology is
here.  Most of the functions do not have a header comment explaining
what they do.  The code hasn't been run through pgindent.  There's no
check to see whether the computation that will be done inside the GIN
index is parallel-safe; what if it's an expression index on an unsafe
function?  Opening the heap and index with no lock in the worker is
wrong; the worker should use the same lock mode as the leader.

That's just on a quick read-through; I'm sure there's more.  I'm going
to move this to the next CommitFest.  Hopefully someone will volunteer
to do some serious review of this, because the feature sounds cool.
Possibly that person will even be me.  But it will not be today.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Proposal: speeding up GIN build with parallel workers

From
Heikki Linnakangas
Date:
On 01/17/2016 10:03 PM, Jeff Janes wrote:
> On Fri, Jan 15, 2016 at 3:29 PM, Peter Geoghegan <pg@heroku.com> wrote:
>> On Fri, Jan 15, 2016 at 2:38 PM, Constantin S. Pan <kvapen@gmail.com> wrote:
>>> I have a draft implementation which divides the whole process between
>>> N parallel workers, see the patch attached. Instead of a full scan of
>>> the relation, I give each worker a range of blocks to read.
>>
>> I am currently working on a patch that allows B-Tree index builds to
>> be performed in parallel. I think I'm a week or two away from posting
>> it.
>>
>> Even without parallelism, wouldn't it be better if GIN indexes were
>> built using tuplesort? I know way way less about the gin am than the
>> nbtree am, but I imagine that a prominent cost for GIN index builds is
>> constructing the main B-Tree (the one that's constructed over key
>> values) itself. Couldn't tuplesort.c be adapted to cover this case?
>> That would be much faster in general, particularly with the recent
>> addition of abbreviated keys, while also leaving a clear path forward
>> to performing the build in parallel.
>
> I think it would take a lot of changes to tuple sort to make this be a
> almost-always win.
>
> In the general case each GIN key occurs in many tuples, and the
> in-memory rbtree is good at compressing the tid list for each key to
> maximize the amount of data that can be in memory (although perhaps it
> could be even better, as it doesn't use varbyte encoding on the tid
> list the way the posting lists on disk do--it could do so in the bulk
> build case, where it receives the tid in order, but not feasibly in
> the pending-list clean-up case, where it doesn't get them in order)
>
> When I was testing building an index on a column of text identifiers
> where there were a couple million identifiers occurring a few dozen
> times each, building it as a gin index (using btree_gin so that plain
> text columns could be indexed) was quite a bit faster than building it
> as a regular btree index using tuple sort.  I didn't really
> investigate this difference, but I assume it was due to the better
> memory usage leading to less IO.

I've been wondering about this too, using tuplesort to build GIN 
indexes, for a long time. Surely the highly-optimized quicksort+merge 
code in tuplesort.c is faster than maintaining a red-black tree? Or so I 
thought.

I wrote a quick prototype of using tuplesort.c for GIN index build. I 
tested it with a 500 MB table of integer arrays, so that the sort fit 
completely in memory. It's a lot slower than the current code. It turns 
out eliminating the duplicates early is really really important.

So we want to keep the red-black tree, to eliminate the duplicates. Or 
add that capability to tuplesort.c, which might speed up Sort+Unique and 
aggregates in general, but that's a big effort.

However, I still wonder, if we shouldn't use a merge approach when the 
tree doesn't fit in memory, like tuplesort.c does. Currently, when the 
tree is full, we flush it out to the index, by inserting all the 
entries. That can get quite expensive, I/O-wise. It also generates more 
WAL, compared to writing each page only once.

If we flushed the tree to a tape instead, then we could perhaps use the 
machinery that Peter's parallel B-tree patch is adding to tuplesort.c, 
to merge the tapes. I'm not sure if that works out, but I think it's 
worth some experimentation.

> But I do think this with aggregation would be worth it despite the
> amount of work involved.  In addition to automatically benefiting from
> parallel sorts and any other future sort improvements, I think it
> would also generate better indexes.  I have a theory that a problem
> with very large gin indexes is that repeatedly building
> maintenance_work_mem worth of rbtree entries and then merging them to
> disk yields highly fragmented btrees, in which logical adjacent
> key-space does not occupy physically nearby leaf pages, which then can
> causes problems either with access that follows a pattern (like range
> scans, except gin indexes can't really do those anyway) or further
> bulk operations.

Yeah, there's that too.

- Heikki




Re: Proposal: speeding up GIN build with parallel workers

From
Michael Paquier
Date:
On Wed, Sep 14, 2016 at 3:48 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> If we flushed the tree to a tape instead, then we could perhaps use the
> machinery that Peter's parallel B-tree patch is adding to tuplesort.c, to
> merge the tapes. I'm not sure if that works out, but I think it's worth some
> experimentation.

Marking as returned with feedback per mainly this comment.
-- 
Michael