Re: Parallel Seq Scan - Mailing list pgsql-hackers
From | Haribabu Kommi |
---|---|
Subject | Re: Parallel Seq Scan |
Date | |
Msg-id | CAJrrPGfvkpMqXcOr-xnWSCX4pbVVDaVY_c_R8H2UcRODagwgMg@mail.gmail.com Whole thread Raw |
In response to | Re: Parallel Seq Scan (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: Parallel Seq Scan
|
List | pgsql-hackers |
On Sat, Sep 19, 2015 at 1:45 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, Sep 18, 2015 at 4:03 AM, Haribabu Kommi > <kommi.haribabu@gmail.com> wrote: >> On Thu, Sep 3, 2015 at 8:21 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: >>> >>> Attached, find the rebased version of patch. >> >> Here are the performance test results: > > Thanks, this is really interesting. I'm very surprised by how much > kernel overhead this shows. I wonder where that's coming from. The > writes to and reads from the shm_mq shouldn't need to touch the kernel > at all except for page faults; that's why I chose this form of IPC. > It could be that the signals which are sent for flow control are > chewing up a lot of cycles, but if that's the problem, it's not very > clear from here. copy_user_generic_string doesn't sound like > something related to signals. And why all the kernel time in > _spin_lock? Maybe perf -g would help us tease out where this kernel > time is coming from. copy_user_generic_string system call is because of file read operations. In my test, I gave the shared_buffers as 12GB with the table size of 18GB. To reduce the user of copy_user_generic_string by loading all the pages into shared buffers with different combinations of 12GB and 20GB shared_buffers settings. The _spin_lock calls are from the signals that are generated by the workers. With the increase of tuple queue size, there is a change in kernel system calls usage. Here I attached the perf reports collected for your reference with -g option. > Some of this may be due to rapid context switching. Suppose the > master process is the bottleneck. Then each worker will fill up the > queue and go to sleep. When the master reads a tuple, the worker has > to wake up and write a tuple, and then it goes back to sleep. This > might be an indication that we need a bigger shm_mq size. I think > that would be experimenting with: if we double or quadruple or > increase by 10x the queue size, what happens to performance? I tried with 1, 2, 4, 8 and 10 multiply factor for the tuple queue size and collected the performance readings. Summary of the results are: - There is not much change in low selectivity cases with the increase of tuple queue size. - Till 1.5 million selectivity, the time taken to execute a query is 8 workers < 4 workers < 2 workers with any tuple queue size. - with tuple queue multiply factor 4 (i.e 4 * tuple queue size) for selectivity greater than 1.5 million 4 workers < 2 workers < 8 workers - with tuple queue multiply factor 8 or 10 for selectivity greater than 1.5 million 2 workers < 4 workers < 8 workers - From the above performance readings, increase of tuple queue size gets benefited with lesser number of workers compared to higher number of workers. - May be the tuple queue size can be calculated automatically based on the selectivity, average tuple width and number of workers. - when the buffers are loaded into shared_buffers using prewarm utility, there is not much scaling is visible with the increase of workers. Performance report is attached for your reference. Apart from the performance, I have the following observations. Workers are getting started irrespective of the system load. If user configures 16 workers, but because of a sudden increase in the system load, there are only 2 or 3 cpu's are only IDLE. In this case, if any parallel seq scan eligible query is executed, the backend may start 16 workers thus it can lead to overall increase of system usage and may decrease the performance of the other backend sessions? If the query have two parallel seq scan plan nodes and how the workers will be distributed across the two nodes? Currently parallel_seqscan_degree is used per plan node, even if we change that to per query, I think we need a worker distribution logic, instead of using all workers by a single plan node. Select with a limit clause is having a performance drawback with parallel seq scan in some scenarios, because of very less selectivity compared to seq scan, it should be better if we document it. Users can take necessary actions based on that for the queries with limit clause. Regards, Hari Babu Fujitsu Australia
Attachment
pgsql-hackers by date: