Home > mailing lists

Re: Parallel Seq Scan - Mailing list pgsql-hackers

From	Haribabu Kommi
Subject	Re: Parallel Seq Scan
Date	September 22, 2015 07:14:44
Msg-id	CAJrrPGfvkpMqXcOr-xnWSCX4pbVVDaVY_c_R8H2UcRODagwgMg@mail.gmail.com Whole thread Raw
In response to	Re: Parallel Seq Scan (Robert Haas <robertmhaas@gmail.com>)
Responses	Re: Parallel Seq Scan
List	pgsql-hackers

Tree view

On Sat, Sep 19, 2015 at 1:45 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Sep 18, 2015 at 4:03 AM, Haribabu Kommi
> <kommi.haribabu@gmail.com> wrote:
>> On Thu, Sep 3, 2015 at 8:21 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>>
>>> Attached, find the rebased version of patch.
>>
>> Here are the performance test results:
>
> Thanks, this is really interesting.  I'm very surprised by how much
> kernel overhead this shows.  I wonder where that's coming from.  The
> writes to and reads from the shm_mq shouldn't need to touch the kernel
> at all except for page faults; that's why I chose this form of IPC.
> It could be that the signals which are sent for flow control are
> chewing up a lot of cycles, but if that's the problem, it's not very
> clear from here.  copy_user_generic_string doesn't sound like
> something related to signals.  And why all the kernel time in
> _spin_lock?  Maybe perf -g would help us tease out where this kernel
> time is coming from.

copy_user_generic_string system call is because of file read operations.
In my test, I gave the shared_buffers as 12GB with the table size of 18GB.

To reduce the user of copy_user_generic_string by loading all the pages into
shared buffers with different combinations of 12GB and 20GB shared_buffers
settings.

The _spin_lock calls are from the signals that are generated by the workers.
With the increase of tuple queue size, there is a change in kernel system
calls usage.

Here I attached the perf reports collected for your reference with -g option.

> Some of this may be due to rapid context switching.  Suppose the
> master process is the bottleneck.  Then each worker will fill up the
> queue and go to sleep.  When the master reads a tuple, the worker has
> to wake up and write a tuple, and then it goes back to sleep.  This
> might be an indication that we need a bigger shm_mq size.  I think
> that would be experimenting with: if we double or quadruple or
> increase by 10x the queue size, what happens to performance?

I tried with 1, 2, 4, 8 and 10 multiply factor for the tuple queue
size and collected
the performance readings. Summary of the results are:

- There is not much change in low selectivity cases with the increase
of tuple queue size.

- Till 1.5 million selectivity, the time taken to execute a query is 8
workers < 4 workers < 2 workers
  with any tuple queue size.

- with tuple queue multiply factor 4 (i.e 4 * tuple queue size) for
selectivity greater than 1.5 million
  4 workers < 2 workers < 8 workers

- with tuple queue multiply factor 8 or 10 for selectivity greater
than 1.5 million
  2 workers < 4 workers < 8 workers

- From the above performance readings, increase of tuple queue size
gets benefited with lesser
  number of workers compared to higher number of workers.

- May be the tuple queue size can be calculated automatically based on
the selectivity,
  average tuple width and number of workers.

- when the buffers are loaded into shared_buffers using prewarm
utility, there is not much scaling
  is visible with the increase of workers.

Performance report is attached for your reference.

Apart from the performance, I have the following observations.

Workers are getting started irrespective of the system load. If user
configures 16 workers, but
because of a sudden increase in the system load, there are only 2 or 3
cpu's are only IDLE.
In this case, if any parallel seq scan eligible query is executed, the
backend may start 16 workers
thus it can lead to overall increase of system usage and may decrease
the performance of the
other backend sessions?

If the query have two parallel seq scan plan nodes and how the workers
will be distributed across
the two nodes? Currently parallel_seqscan_degree is used per plan
node, even if we change that
to per query, I think we need a worker distribution logic, instead of
using all workers by a single
plan node.

Select with a limit clause is having a performance drawback with
parallel seq scan in some scenarios,
because of very less selectivity compared to seq scan, it should be
better if we document it. Users
can take necessary actions based on that for the queries with limit clause.

Regards,
Hari Babu
Fujitsu Australia

Attachment

pgsql-hackers by date:

From: Peter Geoghegan
Date: 22 September 2015, 04:55:50
Subject: Re: [COMMITTERS] pgsql: Use gender-neutral language in documentation

From: Albe Laurenz
Date: 22 September 2015, 08:28:59
Subject: Re: [COMMITTERS] pgsql: Use gender-neutral language in documentation

Re: Parallel Seq Scan - Mailing list pgsql-hackers

Attachment

Previous

Next