Thread: Proposal: speeding up GIN build with parallel workers
Hi, Hackers. The task of building GIN can require lots of time and eats 100 % CPU, but we could easily make it use more than a 100 %, especially since we now have parallel workers in postgres. The process of building GIN looks like this: 1. Accumulate a batch of index records into an rbtree in maintenance work memory. 2. Dump the batch to disk. 3. Repeat. I have a draft implementation which divides the whole process between N parallel workers, see the patch attached. Instead of a full scan of the relation, I give each worker a range of blocks to read. This speeds up the first step N times, but slows down the second one, because when multiple workers dump item pointers for the same key, each of them has to read and decode the results of the previous one. That is a huge waste, but there is an idea on how to eliminate it. When it comes to dumping the next batch, a worker does not do it independently. Instead, it (and every other worker) sends the accumulated index records to the parent (backend) in ascending key order. The backend, which receives the records from the workers through shared memory, can merge them and dump each of them once, without the need to reread the records N-1 times. In current state the implementation is just a proof of concept and it has all the configuration hardcoded, but it already works as is, though it does not speed up the build process more than 4 times on my configuration (12 CPUs). There is also a problem with temporary tables, for which the parallel mode does not work. Please leave your feedback. Regards, Constantin S. Pan Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
On Fri, Jan 15, 2016 at 2:38 PM, Constantin S. Pan <kvapen@gmail.com> wrote: > I have a draft implementation which divides the whole process between > N parallel workers, see the patch attached. Instead of a full scan of > the relation, I give each worker a range of blocks to read. I am currently working on a patch that allows B-Tree index builds to be performed in parallel. I think I'm a week or two away from posting it. Even without parallelism, wouldn't it be better if GIN indexes were built using tuplesort? I know way way less about the gin am than the nbtree am, but I imagine that a prominent cost for GIN index builds is constructing the main B-Tree (the one that's constructed over key values) itself. Couldn't tuplesort.c be adapted to cover this case? That would be much faster in general, particularly with the recent addition of abbreviated keys, while also leaving a clear path forward to performing the build in parallel. I understand that a long term ambition for the gin am is to merge it with nbtree, to almost automatically benefit from enhancements, and to reduce the maintenance burden of each. -- Peter Geoghegan
On Fri, 15 Jan 2016 15:29:51 -0800 Peter Geoghegan <pg@heroku.com> wrote: > On Fri, Jan 15, 2016 at 2:38 PM, Constantin S. Pan <kvapen@gmail.com> > wrote: > Even without parallelism, wouldn't it be better if GIN indexes were > built using tuplesort? I know way way less about the gin am than the > nbtree am, but I imagine that a prominent cost for GIN index builds is > constructing the main B-Tree (the one that's constructed over key > values) itself. Couldn't tuplesort.c be adapted to cover this case? > That would be much faster in general, particularly with the recent > addition of abbreviated keys, while also leaving a clear path forward > to performing the build in parallel. While building GIN we need a quick way to update the posting list of the same key, this is where rbtree comes to rescue. Using tuplesort will require a completely different approach to building the index: dump (key, itempointer) pairs into a tuplesort heap, then sort the heap and merge the itempointers for the same key values. Both rbtree and sorting require NlogN operations, and abbreviated keys will not help here, because GIN is used for the case where there are lots of repeated keys. The benefit of tuplesort is that it would be better for huge data that does not fit into memory, but on the other hand it would need twice as much free disk space for sorting as the data itself took. Are we ready for such cost? I think we have to experiment with both approaches, and see how it goes. What are your thoughts? Regards, Constantin S. Pan Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
On Fri, Jan 15, 2016 at 3:29 PM, Peter Geoghegan <pg@heroku.com> wrote: > On Fri, Jan 15, 2016 at 2:38 PM, Constantin S. Pan <kvapen@gmail.com> wrote: >> I have a draft implementation which divides the whole process between >> N parallel workers, see the patch attached. Instead of a full scan of >> the relation, I give each worker a range of blocks to read. > > I am currently working on a patch that allows B-Tree index builds to > be performed in parallel. I think I'm a week or two away from posting > it. > > Even without parallelism, wouldn't it be better if GIN indexes were > built using tuplesort? I know way way less about the gin am than the > nbtree am, but I imagine that a prominent cost for GIN index builds is > constructing the main B-Tree (the one that's constructed over key > values) itself. Couldn't tuplesort.c be adapted to cover this case? > That would be much faster in general, particularly with the recent > addition of abbreviated keys, while also leaving a clear path forward > to performing the build in parallel. I think it would take a lot of changes to tuple sort to make this be a almost-always win. In the general case each GIN key occurs in many tuples, and the in-memory rbtree is good at compressing the tid list for each key to maximize the amount of data that can be in memory (although perhaps it could be even better, as it doesn't use varbyte encoding on the tid list the way the posting lists on disk do--it could do so in the bulk build case, where it receives the tid in order, but not feasibly in the pending-list clean-up case, where it doesn't get them in order) When I was testing building an index on a column of text identifiers where there were a couple million identifiers occurring a few dozen times each, building it as a gin index (using btree_gin so that plain text columns could be indexed) was quite a bit faster than building it as a regular btree index using tuple sort. I didn't really investigate this difference, but I assume it was due to the better memory usage leading to less IO. (Disclaimer: this was on those identifiers which had little difference in the first 8 bytes, so they didn't really benefit from the abbreviated keys.) So in addition to shimming (key, tid) pairs to look like something that can be fed to tuple sort, I think we would also need to make tuple sort do on-the fly aggregation when it finds adjacent equal sort columns, so it can make (key,tid[]) structures rather than (key,tid). Without that, I think using tuple sort would be a loss at least as often as it would be a win. But I do think this with aggregation would be worth it despite the amount of work involved. In addition to automatically benefiting from parallel sorts and any other future sort improvements, I think it would also generate better indexes. I have a theory that a problem with very large gin indexes is that repeatedly building maintenance_work_mem worth of rbtree entries and then merging them to disk yields highly fragmented btrees, in which logical adjacent key-space does not occupy physically nearby leaf pages, which then can causes problems either with access that follows a pattern (like range scans, except gin indexes can't really do those anyway) or further bulk operations. So I agree with that a better long term approach would probably be to make gin index builds (at least the bulk ones, I don't know about the pending-list-cleanup) to use a tuple sort. But it looks like Constantin has already done most of the work on his current proposal, and no one has volunteered to do the work on converting it to tuple sort instead. Cheers, Jeff
On Sun, Jan 17, 2016 at 12:03 PM, Jeff Janes <jeff.janes@gmail.com> wrote: > I think it would take a lot of changes to tuple sort to make this be a > almost-always win. > > In the general case each GIN key occurs in many tuples, and the > in-memory rbtree is good at compressing the tid list for each key to > maximize the amount of data that can be in memory (although perhaps it > could be even better, as it doesn't use varbyte encoding on the tid > list the way the posting lists on disk do--it could do so in the bulk > build case, where it receives the tid in order, but not feasibly in > the pending-list clean-up case, where it doesn't get them in order) Interesting analysis. > When I was testing building an index on a column of text identifiers > where there were a couple million identifiers occurring a few dozen > times each, building it as a gin index (using btree_gin so that plain > text columns could be indexed) was quite a bit faster than building it > as a regular btree index using tuple sort. I didn't really > investigate this difference, but I assume it was due to the better > memory usage leading to less IO. > > (Disclaimer: this was on those identifiers which had little difference > in the first 8 bytes, so they didn't really benefit from the > abbreviated keys.) Sorry for going on about it, but I think you'd be surprised how effective abbreviated keys are even in seemingly marginal cases. > So I agree with that a better long term approach would probably be to > make gin index builds (at least the bulk ones, I don't know about the > pending-list-cleanup) to use a tuple sort. But it looks like > Constantin has already done most of the work on his current proposal, > and no one has volunteered to do the work on converting it to tuple > sort instead. I'm not going to stand in the way of incremental progress, particularly given that this looks to be a simple patch that doesn't commit us to anything. I suspect that we should consolidate GIN and nbtree at some point, though. I think that there are some useful general consequences for performance that come from consolidation. For example, my ongoing work on external sorting makes it use much of the same infrastructure as internal sorting. Now, external sorts automatically benefit from optimizations to internal sorting, like the "onlyKey" quicksort optimization. -- Peter Geoghegan
On Fri, Jan 15, 2016 at 5:38 PM, Constantin S. Pan <kvapen@gmail.com> wrote: > In current state the implementation is just a proof of concept > and it has all the configuration hardcoded, but it already works as is, > though it does not speed up the build process more than 4 times on my > configuration (12 CPUs). There is also a problem with temporary tables, > for which the parallel mode does not work. I have yet to see a case where parallel query, with any current or pending patch, gets more than about a 4x speedup. I can't think of any reason that there should be a wall at 4x, and I'm not sure we're hitting the wall there for the same reason in all cases. But your observation corresponds to my experience. I hope we eventually figure out how to make that much better, but I wouldn't feel too bad about being at that spot now. 4x faster is still a lot faster. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sat, 16 Jan 2016 01:38:39 +0300 "Constantin S. Pan" <kvapen@gmail.com> wrote: > The task of building GIN can require lots of time and eats 100 % CPU, > but we could easily make it use more than a 100 %, especially since we > now have parallel workers in postgres. > > The process of building GIN looks like this: > > 1. Accumulate a batch of index records into an rbtree in maintenance > work memory. > > 2. Dump the batch to disk. > > 3. Repeat. > > I have a draft implementation which divides the whole process between > N parallel workers, see the patch attached. Instead of a full scan of > the relation, I give each worker a range of blocks to read. > > This speeds up the first step N times, but slows down the second one, > because when multiple workers dump item pointers for the same key, > each of them has to read and decode the results of the previous one. > That is a huge waste, but there is an idea on how to eliminate it. > > When it comes to dumping the next batch, a worker does not do it > independently. Instead, it (and every other worker) sends the > accumulated index records to the parent (backend) in ascending key > order. The backend, which receives the records from the workers > through shared memory, can merge them and dump each of them once, > without the need to reread the records N-1 times. > > In current state the implementation is just a proof of concept > and it has all the configuration hardcoded, but it already works as > is, though it does not speed up the build process more than 4 times > on my configuration (12 CPUs). There is also a problem with temporary > tables, for which the parallel mode does not work. Hey Hackers! I have made some progress on the proposal (see the attached patch): 0. Moved some repeated code to functions (e.g. ginDumpAccumulator, ginbuildCommon). 1. Implemented results merging on backend. 2. Disabled the usage of parallel mode when creating index on temporary tables. No point in using parallel mode for temporary tables anyway, right? 3. Added GUC parameter to control the number of workers for GIN building. 4. Hit the 8x speedup limit. Made some analysis of the reasons (see the attached plot or the data file). In order to analyze the performance issues, I have made the following: create table t (k int, v int[]); create or replace function randarray(width int, low int, high int) returns int[] as $$ select array(select (random()*(high-low) + low)::int from generate_series(1,width)) $$ language sql; insert into t select k, randarray(3000, 0, 100000) from generate_series(1, 100000) as k; create index t_v_idx on t using gin (v); This creates 100000 arrays of 3000 random numbers each. The random numbers are in range [0, 100000]. Then I measure how long the gin building steps take. There are two steps: scan and merge. The results show that 'scan' step is sped up perfectly. But the 'merge' step takes longer as you increase the number of workers. The profiler shows that the bottleneck here is ginMergeItemPointers(), which I use to merge the results. Also, I did encounter the problem with workers deadlocking during heap_open, but that seems to have been resolved by Robert Haas in his commit regarding group locking. Please leave your feedback! Regards, Constantin S. Pan Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
On Wed, Feb 17, 2016 at 6:55 PM, Constantin S. Pan <kvapen@gmail.com> wrote:
On Sat, 16 Jan 2016 01:38:39 +0300
"Constantin S. Pan" <kvapen@gmail.com> wrote:
> The task of building GIN can require lots of time and eats 100 % CPU,
> but we could easily make it use more than a 100 %, especially since we
> now have parallel workers in postgres.
>
> The process of building GIN looks like this:
>
> 1. Accumulate a batch of index records into an rbtree in maintenance
> work memory.
>
> 2. Dump the batch to disk.
>
> 3. Repeat.
>
> I have a draft implementation which divides the whole process between
> N parallel workers, see the patch attached. Instead of a full scan of
> the relation, I give each worker a range of blocks to read.
>
> This speeds up the first step N times, but slows down the second one,
> because when multiple workers dump item pointers for the same key,
> each of them has to read and decode the results of the previous one.
> That is a huge waste, but there is an idea on how to eliminate it.
>
> When it comes to dumping the next batch, a worker does not do it
> independently. Instead, it (and every other worker) sends the
> accumulated index records to the parent (backend) in ascending key
> order. The backend, which receives the records from the workers
> through shared memory, can merge them and dump each of them once,
> without the need to reread the records N-1 times.
>
> In current state the implementation is just a proof of concept
> and it has all the configuration hardcoded, but it already works as
> is, though it does not speed up the build process more than 4 times
> on my configuration (12 CPUs). There is also a problem with temporary
> tables, for which the parallel mode does not work.
Hey Hackers!
I have made some progress on the proposal (see the attached patch):
0. Moved some repeated code to functions (e.g. ginDumpAccumulator,
ginbuildCommon).
1. Implemented results merging on backend.
2. Disabled the usage of parallel mode when creating index on temporary
tables. No point in using parallel mode for temporary tables anyway,
right?
3. Added GUC parameter to control the number of workers for GIN
building.
4. Hit the 8x speedup limit. Made some analysis of the reasons (see the
attached plot or the data file).
In order to analyze the performance issues, I have made the following:
create table t (k int, v int[]);
create or replace
function randarray(width int, low int, high int)
returns int[] as
$$
select array(select (random()*(high-low) + low)::int
from generate_series(1,width))
$$ language sql;
insert into t select k, randarray(3000, 0, 100000)
from generate_series(1, 100000) as k;
create index t_v_idx on t using gin (v);
This creates 100000 arrays of 3000 random numbers each. The random
numbers are in range [0, 100000]. Then I measure how long the gin
building steps take. There are two steps: scan and merge.
The results show that 'scan' step is sped up perfectly. But the
'merge' step takes longer as you increase the number of workers. The
profiler shows that the bottleneck here is ginMergeItemPointers(), which
I use to merge the results.
Also, I did encounter the problem with workers deadlocking during
heap_open, but that seems to have been resolved by Robert Haas in his
commit regarding group locking.
Please leave your feedback!
My feedback is (Mac OS X 10.11.3)
set gin_parallel_workers=2;
create index message_body_idx on messages using gin(body_tsvector);
LOG: worker process: parallel worker for PID 5689 (PID 6906) was terminated by signal 11: Segmentation fault
LOG: terminating any other active server processes
WARNING: terminating connection because of crash of another server process
DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
HINT: In a moment you should be able to reconnect to the database and repeat your command.
WARNING: terminating connection because of crash of another server process
DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
HINT: In a moment you should be able to reconnect to the database and repeat your command.
WARNING: terminating connection because of crash of another server process
DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
HINT: In a moment you should be able to reconnect to the database and repeat your command.
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: FATAL: the database system is in recovery mode
Failed.
set gin_parallel_workers=2;
create index message_body_idx on messages using gin(body_tsvector);
LOG: worker process: parallel worker for PID 5689 (PID 6906) was terminated by signal 11: Segmentation fault
LOG: terminating any other active server processes
WARNING: terminating connection because of crash of another server process
DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
HINT: In a moment you should be able to reconnect to the database and repeat your command.
WARNING: terminating connection because of crash of another server process
DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
HINT: In a moment you should be able to reconnect to the database and repeat your command.
WARNING: terminating connection because of crash of another server process
DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
HINT: In a moment you should be able to reconnect to the database and repeat your command.
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: FATAL: the database system is in recovery mode
Failed.
Regards,
Constantin S. Pan
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Feb 17, 2016 at 7:55 AM, Constantin S. Pan <kvapen@gmail.com> wrote: > 4. Hit the 8x speedup limit. Made some analysis of the reasons (see the > attached plot or the data file). Did you actually compare this to the master branch? I wouldn't like to assume that the one worker case was equivalent. Obviously that's the really interesting baseline. -- Peter Geoghegan
I was testing with number of workers starting at 0. The 0 case is (in theory) equivalent to master branch. But I should certainly compare to the master branch, you are right. Will do that shortly. On Wed, 17 Feb 2016 12:26:05 -0800 Peter Geoghegan <pg@heroku.com> wrote: > On Wed, Feb 17, 2016 at 7:55 AM, Constantin S. Pan <kvapen@gmail.com> > wrote: > > 4. Hit the 8x speedup limit. Made some analysis of the reasons (see > > the attached plot or the data file). > > Did you actually compare this to the master branch? I wouldn't like to > assume that the one worker case was equivalent. Obviously that's the > really interesting baseline. > >
On Wed, 17 Feb 2016 23:01:47 +0300 Oleg Bartunov <obartunov@gmail.com> wrote: > My feedback is (Mac OS X 10.11.3) > > set gin_parallel_workers=2; > create index message_body_idx on messages using gin(body_tsvector); > LOG: worker process: parallel worker for PID 5689 (PID 6906) was > terminated by signal 11: Segmentation fault Fixed this, try the new patch. The bug was in incorrect handling of some GIN categories.
Attachment
On 2/18/16 10:10 AM, Constantin S. Pan wrote: > On Wed, 17 Feb 2016 23:01:47 +0300 > Oleg Bartunov <obartunov@gmail.com> wrote: > >> My feedback is (Mac OS X 10.11.3) >> >> set gin_parallel_workers=2; >> create index message_body_idx on messages using gin(body_tsvector); >> LOG: worker process: parallel worker for PID 5689 (PID 6906) was >> terminated by signal 11: Segmentation fault > > Fixed this, try the new patch. The bug was in incorrect handling > of some GIN categories. Oleg, it looks like Constantin has updated to patch to address the issue you were seeing. Do you have time to retest and review? Thanks, -- -David david@pgmasters.net
On Mon, 14 Mar 2016 08:42:26 -0400 David Steele <david@pgmasters.net> wrote: > On 2/18/16 10:10 AM, Constantin S. Pan wrote: > > On Wed, 17 Feb 2016 23:01:47 +0300 > > Oleg Bartunov <obartunov@gmail.com> wrote: > > > >> My feedback is (Mac OS X 10.11.3) > >> > >> set gin_parallel_workers=2; > >> create index message_body_idx on messages using gin(body_tsvector); > >> LOG: worker process: parallel worker for PID 5689 (PID 6906) was > >> terminated by signal 11: Segmentation fault > > > > Fixed this, try the new patch. The bug was in incorrect handling > > of some GIN categories. > > Oleg, it looks like Constantin has updated to patch to address the > issue you were seeing. Do you have time to retest and review? > > Thanks, Actually, there was some progress since. The patch is attached. 1. Added another GUC parameter for changing the amount of shared memory for parallel GIN workers. 2. Changed the way results are merged. It uses shared memory message queue now. 3. Tested on some real data (GIN index on email message body tsvectors). Here are the timings for different values of 'gin_shared_mem' and 'gin_parallel_workers' on a 4-CPU machine. Seems 'gin_shared_mem' has no visible effect. wnum mem(MB) time(s) 0 16 247 1 16 256 2 16 126 4 16 89 0 32 247 1 32 270 2 32 123 4 32 92 0 64 254 1 64 272 2 64 123 4 64 88 0 128 250 1 128 263 2 128 126 4 128 85 0 256 247 1 256 269 2 256 130 4 256 88 0 512 257 1 512 275 2 512 129 4 512 92 0 1024 255 1 1024 273 2 1024 130 4 1024 90 On Wed, 17 Feb 2016 12:26:05 -0800 Peter Geoghegan <pg@heroku.com> wrote: > On Wed, Feb 17, 2016 at 7:55 AM, Constantin S. Pan <kvapen@gmail.com> > wrote: > > 4. Hit the 8x speedup limit. Made some analysis of the reasons (see > > the attached plot or the data file). > > Did you actually compare this to the master branch? I wouldn't like to > assume that the one worker case was equivalent. Obviously that's the > really interesting baseline. Compared with the master branch. The case of 0 workers is indeed equivalent to the master branch. Regards, Constantin
Attachment
On Wed, Mar 16, 2016 at 5:41 AM, Constantin S. Pan <kvapen@gmail.com> wrote:
On Mon, 14 Mar 2016 08:42:26 -0400
David Steele <david@pgmasters.net> wrote:
> On 2/18/16 10:10 AM, Constantin S. Pan wrote:
> > On Wed, 17 Feb 2016 23:01:47 +0300
> > Oleg Bartunov <obartunov@gmail.com> wrote:
> >
> >> My feedback is (Mac OS X 10.11.3)
> >>
> >> set gin_parallel_workers=2;
> >> create index message_body_idx on messages using gin(body_tsvector);
> >> LOG: worker process: parallel worker for PID 5689 (PID 6906) was
> >> terminated by signal 11: Segmentation fault
> >
> > Fixed this, try the new patch. The bug was in incorrect handling
> > of some GIN categories.
>
> Oleg, it looks like Constantin has updated to patch to address the
> issue you were seeing. Do you have time to retest and review?
>
> Thanks,
Actually, there was some progress since. The patch is
attached.
1. Added another GUC parameter for changing the amount of
shared memory for parallel GIN workers.
2. Changed the way results are merged. It uses shared memory
message queue now.
3. Tested on some real data (GIN index on email message body
tsvectors). Here are the timings for different values of
'gin_shared_mem' and 'gin_parallel_workers' on a 4-CPU
machine. Seems 'gin_shared_mem' has no visible effect.
wnum mem(MB) time(s)
0 16 247
1 16 256
It seems from you data that with 1 worker, you are always seeing slowdown, have you investigated the reason of same?
On Wed, 16 Mar 2016 12:14:51 +0530 Amit Kapila <amit.kapila16@gmail.com> wrote: > On Wed, Mar 16, 2016 at 5:41 AM, Constantin S. Pan <kvapen@gmail.com> > wrote: > > > On Mon, 14 Mar 2016 08:42:26 -0400 > > David Steele <david@pgmasters.net> wrote: > > > > > On 2/18/16 10:10 AM, Constantin S. Pan wrote: > > > > On Wed, 17 Feb 2016 23:01:47 +0300 > > > > Oleg Bartunov <obartunov@gmail.com> wrote: > > > > > > > >> My feedback is (Mac OS X 10.11.3) > > > >> > > > >> set gin_parallel_workers=2; > > > >> create index message_body_idx on messages using > > > >> gin(body_tsvector); LOG: worker process: parallel worker for > > > >> PID 5689 (PID 6906) was terminated by signal 11: Segmentation > > > >> fault > > > > > > > > Fixed this, try the new patch. The bug was in incorrect handling > > > > of some GIN categories. > > > > > > Oleg, it looks like Constantin has updated to patch to address the > > > issue you were seeing. Do you have time to retest and review? > > > > > > Thanks, > > > > Actually, there was some progress since. The patch is > > attached. > > > > 1. Added another GUC parameter for changing the amount of > > shared memory for parallel GIN workers. > > > > 2. Changed the way results are merged. It uses shared memory > > message queue now. > > > > 3. Tested on some real data (GIN index on email message body > > tsvectors). Here are the timings for different values of > > 'gin_shared_mem' and 'gin_parallel_workers' on a 4-CPU > > machine. Seems 'gin_shared_mem' has no visible effect. > > > > wnum mem(MB) time(s) > > 0 16 247 > > 1 16 256 > > > > > It seems from you data that with 1 worker, you are always seeing > slowdown, have you investigated the reason of same? > > With Regards, > Amit Kapila. > EnterpriseDB: http://www.enterprisedb.com That slowdown is expected. It slows down because with 1 worker it has to transfer the results from the worker to the backend. The backend just waits for the results from the workers and merges them (in case wnum > 0). So the 1-worker configuration should never be used, because it is as sequential as the 0-worker, but adds data transfer. Regards, Constantin S. Pan Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
On Wed, Mar 16, 2016 at 2:25 AM, Constantin S. Pan <kvapen@gmail.com> wrote: > The backend just waits for the results from the workers and merges them > (in case wnum > 0). So the 1-worker configuration should never be used, > because it is as sequential as the 0-worker, but adds data transfer. This is why I wanted an easy way of atomically guaranteeing some number of workers (typically 2), or not using parallelism at all. I think the parallel worker API should offer a simple way to do that in cases like this, where having only 1 worker is never going to win. -- Peter Geoghegan
On Wed, 16 Mar 2016 02:43:47 -0700 Peter Geoghegan <pg@heroku.com> wrote: > On Wed, Mar 16, 2016 at 2:25 AM, Constantin S. Pan <kvapen@gmail.com> > wrote: > > The backend just waits for the results from the workers and merges > > them (in case wnum > 0). So the 1-worker configuration should never > > be used, because it is as sequential as the 0-worker, but adds data > > transfer. > > This is why I wanted an easy way of atomically guaranteeing some > number of workers (typically 2), or not using parallelism at all. I > think the parallel worker API should offer a simple way to do that in > cases like this, where having only 1 worker is never going to win. Well, we can check the number of workers actually launched and revert back to single backend way when there is less than 2 workers. Let me code that in. Regards, Constantin S. Pan Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
On Wed, Mar 16, 2016 at 2:55 PM, Constantin S. Pan <kvapen@gmail.com> wrote:
>
> On Wed, 16 Mar 2016 12:14:51 +0530
> Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> > On Wed, Mar 16, 2016 at 5:41 AM, Constantin S. Pan <kvapen@gmail.com>
> > wrote:
> > > 3. Tested on some real data (GIN index on email message body
> > > tsvectors). Here are the timings for different values of
> > > 'gin_shared_mem' and 'gin_parallel_workers' on a 4-CPU
> > > machine. Seems 'gin_shared_mem' has no visible effect.
> > >
> > > wnum mem(MB) time(s)
> > > 0 16 247
> > > 1 16 256
> > >
> >
> >
> > It seems from you data that with 1 worker, you are always seeing
> > slowdown, have you investigated the reason of same?
> >
>
> That slowdown is expected. It slows down because with 1 worker it
> has to transfer the results from the worker to the backend.
>
> The backend just waits for the results from the workers and merges them
> (in case wnum > 0).
>
> On Wed, 16 Mar 2016 12:14:51 +0530
> Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> > On Wed, Mar 16, 2016 at 5:41 AM, Constantin S. Pan <kvapen@gmail.com>
> > wrote:
> > > 3. Tested on some real data (GIN index on email message body
> > > tsvectors). Here are the timings for different values of
> > > 'gin_shared_mem' and 'gin_parallel_workers' on a 4-CPU
> > > machine. Seems 'gin_shared_mem' has no visible effect.
> > >
> > > wnum mem(MB) time(s)
> > > 0 16 247
> > > 1 16 256
> > >
> >
> >
> > It seems from you data that with 1 worker, you are always seeing
> > slowdown, have you investigated the reason of same?
> >
>
> That slowdown is expected. It slows down because with 1 worker it
> has to transfer the results from the worker to the backend.
>
> The backend just waits for the results from the workers and merges them
> (in case wnum > 0).
Why backend just waits, why can't it does the same work as any worker does? In general, for other parallelism features the backend also behaves the same way as worker in producing the results if the results from workers is not available.
On Wed, 16 Mar 2016 18:08:38 +0530 Amit Kapila <amit.kapila16@gmail.com> wrote: > On Wed, Mar 16, 2016 at 2:55 PM, Constantin S. Pan <kvapen@gmail.com> > wrote: > > > > On Wed, 16 Mar 2016 12:14:51 +0530 > > Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > On Wed, Mar 16, 2016 at 5:41 AM, Constantin S. Pan > > > <kvapen@gmail.com> wrote: > > > > 3. Tested on some real data (GIN index on email message body > > > > tsvectors). Here are the timings for different values of > > > > 'gin_shared_mem' and 'gin_parallel_workers' on a 4-CPU > > > > machine. Seems 'gin_shared_mem' has no visible effect. > > > > > > > > wnum mem(MB) time(s) > > > > 0 16 247 > > > > 1 16 256 > > > > > > > > > > > > > It seems from you data that with 1 worker, you are always seeing > > > slowdown, have you investigated the reason of same? > > > > > > > That slowdown is expected. It slows down because with 1 worker it > > has to transfer the results from the worker to the backend. > > > > The backend just waits for the results from the workers and merges > > them (in case wnum > 0). > > Why backend just waits, why can't it does the same work as any worker > does? In general, for other parallelism features the backend also > behaves the same way as worker in producing the results if the > results from workers is not available. We can make backend do the same work as any worker, but that will complicate the code for less than 2 % perfomance boost. This performance issue matters even less as you increase the number of workers N, since you save only 1/N-th of the transfer time. Is backend waiting for workers a really bad practice that should be avoided at all costs, or can we leave it as is? Regards, Constantin S. Pan Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
On Wed, Mar 16, 2016 at 7:50 PM, Constantin S. Pan <kvapen@gmail.com> wrote:
>
> On Wed, 16 Mar 2016 18:08:38 +0530
> Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> >
> > Why backend just waits, why can't it does the same work as any worker
> > does? In general, for other parallelism features the backend also
> > behaves the same way as worker in producing the results if the
> > results from workers is not available.
>
> We can make backend do the same work as any worker, but that
> will complicate the code for less than 2 % perfomance boost.
Why do you think it will be just 2%? I think for single worker case, it should be much more as the master backend will be less busy in consuming tuples from tuple queue. I can't say much about code-complexity, as I haven't yet looked carefully at the logic of patch, but we didn't find much difficulty while doing it for parallel scans. One of the commit which might help you in understanding how currently heap scans are parallelised is ee7ca559fcf404f9a3bd99da85c8f4ea9fbc2e92, you can see if that can help you in anyway for writing a generic API for Gin parallel builds.
>
> On Wed, 16 Mar 2016 18:08:38 +0530
> Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> >
> > Why backend just waits, why can't it does the same work as any worker
> > does? In general, for other parallelism features the backend also
> > behaves the same way as worker in producing the results if the
> > results from workers is not available.
>
> We can make backend do the same work as any worker, but that
> will complicate the code for less than 2 % perfomance boost.
Why do you think it will be just 2%? I think for single worker case, it should be much more as the master backend will be less busy in consuming tuples from tuple queue. I can't say much about code-complexity, as I haven't yet looked carefully at the logic of patch, but we didn't find much difficulty while doing it for parallel scans. One of the commit which might help you in understanding how currently heap scans are parallelised is ee7ca559fcf404f9a3bd99da85c8f4ea9fbc2e92, you can see if that can help you in anyway for writing a generic API for Gin parallel builds.
On Thu, 17 Mar 2016 13:21:32 +0530 Amit Kapila <amit.kapila16@gmail.com> wrote: > On Wed, Mar 16, 2016 at 7:50 PM, Constantin S. Pan <kvapen@gmail.com> > wrote: > > > > On Wed, 16 Mar 2016 18:08:38 +0530 > > Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > Why backend just waits, why can't it does the same work as any > > > worker does? In general, for other parallelism features the > > > backend also behaves the same way as worker in producing the > > > results if the results from workers is not available. > > > > We can make backend do the same work as any worker, but that > > will complicate the code for less than 2 % perfomance boost. > > Why do you think it will be just 2%? I think for single worker case, > it should be much more as the master backend will be less busy in > consuming tuples from tuple queue. I can't say much about > code-complexity, as I haven't yet looked carefully at the logic of > patch, but we didn't find much difficulty while doing it for parallel > scans. One of the commit which might help you in understanding how > currently heap scans are parallelised is > ee7ca559fcf404f9a3bd99da85c8f4ea9fbc2e92, you can see if that can > help you in anyway for writing a generic API for Gin parallel builds. I looked at the timing details some time ago, which showed that the backend spent about 1% of total time on data transfer from 1 worker, and 3% on transfer and merging from 2 workers. So if we use (active backend + 1 worker) instead of (passive backend + 2 workers), we still have to spend 1.5% on transfer and merging. Or we can look at these measurements (from yesterday's message): wnum mem(MB) time(s) 0 16 247 1 16 256 2 16 126 If 2 workers didn't have to transfer and merge their results, they would have finished in 247 / 2 = 123.5 seconds. But the transfer and merging took another 2.5 seconds. The merging takes a little longer than the transfer. If we now use backend+worker we get rid of 1 transfer, but still have to do 1 transfer and then merge, so we will save less than a quarter of those 2.5 seconds. In other words, we gain almost nothing by teaching the backend how to be a worker. Regards, Constantin S. Pan Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
On Thu, Mar 17, 2016 at 2:56 PM, Constantin S. Pan <kvapen@gmail.com> wrote:
>
> On Thu, 17 Mar 2016 13:21:32 +0530
> Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> > On Wed, Mar 16, 2016 at 7:50 PM, Constantin S. Pan <kvapen@gmail.com>
> > wrote:
> > >
> > > On Wed, 16 Mar 2016 18:08:38 +0530
> > > Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > >
> > > > Why backend just waits, why can't it does the same work as any
> > > > worker does? In general, for other parallelism features the
> > > > backend also behaves the same way as worker in producing the
> > > > results if the results from workers is not available.
> > >
> > > We can make backend do the same work as any worker, but that
> > > will complicate the code for less than 2 % perfomance boost.
> >
> > Why do you think it will be just 2%? I think for single worker case,
> > it should be much more as the master backend will be less busy in
> > consuming tuples from tuple queue. I can't say much about
> > code-complexity, as I haven't yet looked carefully at the logic of
> > patch, but we didn't find much difficulty while doing it for parallel
> > scans. One of the commit which might help you in understanding how
> > currently heap scans are parallelised is
> > ee7ca559fcf404f9a3bd99da85c8f4ea9fbc2e92, you can see if that can
> > help you in anyway for writing a generic API for Gin parallel builds.
>
> I looked at the timing details some time ago, which showed
> that the backend spent about 1% of total time on data
> transfer from 1 worker, and 3% on transfer and merging from
> 2 workers. So if we use (active backend + 1 worker) instead
> of (passive backend + 2 workers), we still have to spend
> 1.5% on transfer and merging.
>
I think here the comparison should be between the case of (active backend + 1 worker) with (passive backend + 1 worker) or (active backend + 2 worker) with (passive backend + 2 workers). I don't think it is good assumption that workers are always freely available and you can use them as and when required for any operation.
>
> On Thu, 17 Mar 2016 13:21:32 +0530
> Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> > On Wed, Mar 16, 2016 at 7:50 PM, Constantin S. Pan <kvapen@gmail.com>
> > wrote:
> > >
> > > On Wed, 16 Mar 2016 18:08:38 +0530
> > > Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > >
> > > > Why backend just waits, why can't it does the same work as any
> > > > worker does? In general, for other parallelism features the
> > > > backend also behaves the same way as worker in producing the
> > > > results if the results from workers is not available.
> > >
> > > We can make backend do the same work as any worker, but that
> > > will complicate the code for less than 2 % perfomance boost.
> >
> > Why do you think it will be just 2%? I think for single worker case,
> > it should be much more as the master backend will be less busy in
> > consuming tuples from tuple queue. I can't say much about
> > code-complexity, as I haven't yet looked carefully at the logic of
> > patch, but we didn't find much difficulty while doing it for parallel
> > scans. One of the commit which might help you in understanding how
> > currently heap scans are parallelised is
> > ee7ca559fcf404f9a3bd99da85c8f4ea9fbc2e92, you can see if that can
> > help you in anyway for writing a generic API for Gin parallel builds.
>
> I looked at the timing details some time ago, which showed
> that the backend spent about 1% of total time on data
> transfer from 1 worker, and 3% on transfer and merging from
> 2 workers. So if we use (active backend + 1 worker) instead
> of (passive backend + 2 workers), we still have to spend
> 1.5% on transfer and merging.
>
I think here the comparison should be between the case of (active backend + 1 worker) with (passive backend + 1 worker) or (active backend + 2 worker) with (passive backend + 2 workers). I don't think it is good assumption that workers are always freely available and you can use them as and when required for any operation.
>
> Or we can look at these measurements (from yesterday's
> message):
>
> wnum mem(MB) time(s)
> 0 16 247
> 1 16 256
> 2 16 126
>
> If 2 workers didn't have to transfer and merge their
> results, they would have finished in 247 / 2 = 123.5
> seconds. But the transfer and merging took another 2.5
> seconds. The merging takes a little longer than the
> transfer. If we now use backend+worker we get rid of 1
> transfer, but still have to do 1 transfer and then merge, so
> we will save less than a quarter of those 2.5 seconds.
>
> Or we can look at these measurements (from yesterday's
> message):
>
> wnum mem(MB) time(s)
> 0 16 247
> 1 16 256
> 2 16 126
>
> If 2 workers didn't have to transfer and merge their
> results, they would have finished in 247 / 2 = 123.5
> seconds. But the transfer and merging took another 2.5
> seconds. The merging takes a little longer than the
> transfer. If we now use backend+worker we get rid of 1
> transfer, but still have to do 1 transfer and then merge, so
> we will save less than a quarter of those 2.5 seconds.
>
If I understand the above data correctly, then it seems to indicate that majority of the work is done in processing the data, so I think it should be better if master and worker both can work together.
On Thu, Mar 17, 2016 at 11:42 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > I think here the comparison should be between the case of (active backend + > 1 worker) with (passive backend + 1 worker) or (active backend + 2 worker) > with (passive backend + 2 workers). I don't think it is good assumption > that workers are always freely available and you can use them as and when > required for any operation. Strong +1. The pool of background workers is necessarily quite limited and you can't just gobble them up. I'm not saying that it's absolutely essential that the leader can also participate, but saying that 1 active leader + 1 worker is only 2% faster than 1 passive leader + 2 workers is not comparing apples to apples. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi Constantin, I did a quick review of your patch, and here are my comments: - This patch applies cleanly to the current HEAD (61d2ebdbf91). - Code compiles without warnings. - Currently there's no documentation regarding parallel gin build feature and provided GUC variables. - Built indexes work seem to work normally. Performance ----------- I've made a few runs on my laptop (i7-4710HQ, default postgresql.conf), here are the results: workers avg. time (s)0 4124 1338 81 Looks like 8 workers & a backend do the job 5x times faster than a sole backend, which is good! Code style ---------- There are some things that you've probably overlooked, such as: > task->heap_oid = heap->rd_id; > task->index_oid = index->rd_id; You could replace direct access to 'rd_id' field with the RelationGetRelid macro. > static void ginDumpEntry(GinState *ginstate, > volatile WorkerResult *r Parameter 'r' is unused, you could remove it. Some of the functions and pieces of code that you've added do not comply to the formatting conventions, e. g. > static int claimSomeBlocks(... > static GinEntryStack *pushEntry( >> // The message consists of 2 or 3 parts. iovec allows us to send them as etc. Keep up the good work!
On Fri, 18 Mar 2016 20:40:16 +0300 Dmitry Ivanov <d.ivanov@postgrespro.ru> wrote: > Hi Constantin, > > I did a quick review of your patch, and here are my comments: > > - This patch applies cleanly to the current HEAD (61d2ebdbf91). > > - Code compiles without warnings. > > - Currently there's no documentation regarding parallel gin build > feature and provided GUC variables. > > - Built indexes work seem to work normally. > > > Performance > ----------- > > I've made a few runs on my laptop (i7-4710HQ, default > postgresql.conf), here are the results: > > workers avg. time (s) > 0 412 > 4 133 > 8 81 > > Looks like 8 workers & a backend do the job 5x times faster than a > sole backend, which is good! > > > Code style > ---------- > > There are some things that you've probably overlooked, such as: > > > task->heap_oid = heap->rd_id; > > task->index_oid = index->rd_id; > > You could replace direct access to 'rd_id' field with the > RelationGetRelid macro. > > > static void ginDumpEntry(GinState *ginstate, > > volatile > > WorkerResult *r > > Parameter 'r' is unused, you could remove it. > > Some of the functions and pieces of code that you've added do not > comply to the formatting conventions, e. g. > > > static int claimSomeBlocks(... > > static GinEntryStack *pushEntry( > > >> // The message consists of 2 or 3 parts. iovec allows us to send > >> them as > > etc. > > Keep up the good work! Hi Dmitry, Thank you for the review. I am working on a new version, that will fix the code style and also involve backend into the process as one of the workers.
On Fri, 18 Mar 2016 20:40:16 +0300 Dmitry Ivanov <d.ivanov@postgrespro.ru> wrote: > - Currently there's no documentation regarding parallel gin build > feature and provided GUC variables. > You could replace direct access to 'rd_id' field with the > RelationGetRelid macro. > Parameter 'r' is unused, you could remove it. > Some of the functions and pieces of code that you've added do not > comply to the formatting conventions, e. g. On Fri, 18 Mar 2016 11:18:46 -0400 Robert Haas <robertmhaas@gmail.com> wrote: > Strong +1. The pool of background workers is necessarily quite > limited and you can't just gobble them up. I'm not saying that it's > absolutely essential that the leader can also participate, but saying > that 1 active leader + 1 worker is only 2% faster than 1 passive > leader + 2 workers is not comparing apples to apples. On Fri, 18 Mar 2016 09:12:59 +0530 Amit Kapila <amit.kapila16@gmail.com> wrote: > If I understand the above data correctly, then it seems to indicate > that majority of the work is done in processing the data, so I think > it should be better if master and worker both can work together. Thank you for the support and feedback! Fixed all of these in the new version. Also added new queries to 'gin' regression test, which immediately helped me catch one silly bug. The new patch is attached. Regards, Constantin S. Pan Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
Here is a new version of the patch, which: 1. Fixes some minor stylistic issues. 2. Uses binaryheap (instead of a custom ugly stack) for merging. Regards, Constantin S. Pan Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
On Fri, Apr 8, 2016 at 10:04 AM, Constantin S. Pan <kvapen@gmail.com> wrote: > Here is a new version of the patch, which: > > 1. Fixes some minor stylistic issues. > > 2. Uses binaryheap (instead of a custom ugly stack) for merging. I think we need to push this patch out to 9.7. This code has had a little review here and there from various people, but clearly not enough to push it into the tree at the very last minute. I don't think we even have enough discussion to conclude that things like the gin_shared_mem GUCs are good ideas rather than bad ones. Also, I personally find this code to be extremely low on comments. There's barely any explanation of what the overall methodology is here. Most of the functions do not have a header comment explaining what they do. The code hasn't been run through pgindent. There's no check to see whether the computation that will be done inside the GIN index is parallel-safe; what if it's an expression index on an unsafe function? Opening the heap and index with no lock in the worker is wrong; the worker should use the same lock mode as the leader. That's just on a quick read-through; I'm sure there's more. I'm going to move this to the next CommitFest. Hopefully someone will volunteer to do some serious review of this, because the feature sounds cool. Possibly that person will even be me. But it will not be today. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 01/17/2016 10:03 PM, Jeff Janes wrote: > On Fri, Jan 15, 2016 at 3:29 PM, Peter Geoghegan <pg@heroku.com> wrote: >> On Fri, Jan 15, 2016 at 2:38 PM, Constantin S. Pan <kvapen@gmail.com> wrote: >>> I have a draft implementation which divides the whole process between >>> N parallel workers, see the patch attached. Instead of a full scan of >>> the relation, I give each worker a range of blocks to read. >> >> I am currently working on a patch that allows B-Tree index builds to >> be performed in parallel. I think I'm a week or two away from posting >> it. >> >> Even without parallelism, wouldn't it be better if GIN indexes were >> built using tuplesort? I know way way less about the gin am than the >> nbtree am, but I imagine that a prominent cost for GIN index builds is >> constructing the main B-Tree (the one that's constructed over key >> values) itself. Couldn't tuplesort.c be adapted to cover this case? >> That would be much faster in general, particularly with the recent >> addition of abbreviated keys, while also leaving a clear path forward >> to performing the build in parallel. > > I think it would take a lot of changes to tuple sort to make this be a > almost-always win. > > In the general case each GIN key occurs in many tuples, and the > in-memory rbtree is good at compressing the tid list for each key to > maximize the amount of data that can be in memory (although perhaps it > could be even better, as it doesn't use varbyte encoding on the tid > list the way the posting lists on disk do--it could do so in the bulk > build case, where it receives the tid in order, but not feasibly in > the pending-list clean-up case, where it doesn't get them in order) > > When I was testing building an index on a column of text identifiers > where there were a couple million identifiers occurring a few dozen > times each, building it as a gin index (using btree_gin so that plain > text columns could be indexed) was quite a bit faster than building it > as a regular btree index using tuple sort. I didn't really > investigate this difference, but I assume it was due to the better > memory usage leading to less IO. I've been wondering about this too, using tuplesort to build GIN indexes, for a long time. Surely the highly-optimized quicksort+merge code in tuplesort.c is faster than maintaining a red-black tree? Or so I thought. I wrote a quick prototype of using tuplesort.c for GIN index build. I tested it with a 500 MB table of integer arrays, so that the sort fit completely in memory. It's a lot slower than the current code. It turns out eliminating the duplicates early is really really important. So we want to keep the red-black tree, to eliminate the duplicates. Or add that capability to tuplesort.c, which might speed up Sort+Unique and aggregates in general, but that's a big effort. However, I still wonder, if we shouldn't use a merge approach when the tree doesn't fit in memory, like tuplesort.c does. Currently, when the tree is full, we flush it out to the index, by inserting all the entries. That can get quite expensive, I/O-wise. It also generates more WAL, compared to writing each page only once. If we flushed the tree to a tape instead, then we could perhaps use the machinery that Peter's parallel B-tree patch is adding to tuplesort.c, to merge the tapes. I'm not sure if that works out, but I think it's worth some experimentation. > But I do think this with aggregation would be worth it despite the > amount of work involved. In addition to automatically benefiting from > parallel sorts and any other future sort improvements, I think it > would also generate better indexes. I have a theory that a problem > with very large gin indexes is that repeatedly building > maintenance_work_mem worth of rbtree entries and then merging them to > disk yields highly fragmented btrees, in which logical adjacent > key-space does not occupy physically nearby leaf pages, which then can > causes problems either with access that follows a pattern (like range > scans, except gin indexes can't really do those anyway) or further > bulk operations. Yeah, there's that too. - Heikki
On Wed, Sep 14, 2016 at 3:48 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote: > If we flushed the tree to a tape instead, then we could perhaps use the > machinery that Peter's parallel B-tree patch is adding to tuplesort.c, to > merge the tapes. I'm not sure if that works out, but I think it's worth some > experimentation. Marking as returned with feedback per mainly this comment. -- Michael