Thread: Support Parallel Query Execution in Executor

Support Parallel Query Execution in Executor

From
"Qingqing Zhou"
Date:
I have written some experimental code of doing master-slave seqscan in
PostgreSQL. During the work, I feel we had enough infrastructure to support
parallel query execution.

What I did is adding a new node PARA and plug it above the node that we want
to execute in parallel. In this stage, a PARA node is just a SeqScan node,
which is:

typedef struct Para
{/* TODO: add a union to put all nodes supporting parallism here */SeqScan     scan;
/* Split / Merge / Redistribute */ParaType    type;
/* TODO: other possible parameters */
} Para;

At the execution, the master (the process who receives the query) will wake
up a slave process (an idle ordinary backend) and the slave will pass the
scan results to the master via a shared memory communication-buffer. In
details, the execution is like this:

Master process:
1. PARA init: wake up a slave, pass the queryTree and outerPlan(planTree) to
it by nodeToString();
2. PARA exec:   get an item from the communication-buffer;   if item is a valid tuple       return item;   else
handleother types of item;    /* execution done/error */
 
3. PARA end:  do some cleanup.

As we can see from PARA init stage, with even the most simple PARA node, it
is easy to support inter-node parallism.

Slave process (use similar code for autovacuum process):
1. Get queryTree and planTree;
2. Redirect the destReceiver to the communication-buffer;
3. Encapsulate them in an executor and run;

The query plan is like this:
TEST=# explain select max(a), max(b) from t;                             QUERY PLAN
----------------------------------------------------------------------Aggregate  (cost=7269.01..7269.02 rows=1
width=53) ->  Para [Split = 1] (cost=10.00..5879.00 rows=278000 width=53)        ->  Seq Scan on T  (cost=0.00..5879.00
rows=278000width=53)
 
(3 rows)

There are some problems I haven't addressed yet. The most difficult one for
me is the xid assignment: master and slaves should see an identical view,
and the key is the xid. I am not sure the correct solution of this problem.
We may use the same xid or use a continuous portion of xids for master and
slaves. There are other problems like the login problem (the master and
slaves should be acting as the same user), the elog message passing etc are
also important but I think we are able to handle them without any problem.

I haven't touched the most difficult part, the parallel query optimizer. But
thanks to the two-phase parallel optimization technique, this part can be
treated as the geqo optimizer, without enough evidence, we don't enable
parallel query execution.

Is there any show-stop reasons of not doing this?

Regards,
Qingqing













Re: Support Parallel Query Execution in Executor

From
Martijn van Oosterhout
Date:
On Thu, Apr 06, 2006 at 06:28:33PM +0800, Qingqing Zhou wrote:
> I have written some experimental code of doing master-slave seqscan in
> PostgreSQL. During the work, I feel we had enough infrastructure to support
> parallel query execution.

Good work. One question though, temporary tables. Currently there no
way to get a consistant view of a temporary table from another process.

You probably have to avoid shipping off non-immutable function calls.

Is it possible to deadlock (don't see how, but I may not be looking
hard enough).

Hardest work would be finding what can be safely farmed off.

Nice work though, I hadn't thought of this approach.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

Re: Support Parallel Query Execution in Executor

From
"Jonah H. Harris"
Date:
On 4/6/06, Qingqing Zhou <zhouqq@cs.toronto.edu> wrote:
> I have written some experimental code of doing master-slave seqscan in
> PostgreSQL. During the work, I feel we had enough infrastructure to support
> parallel query execution.

Great work!  I had looked into this a little bit and came to the same
ideas/problems you did, but none of them seemed insurmountable at all.I'd be interested in working with you on this if
you'dlike. 

I'm interested in hearing Tom's suggestions on this topic too.

--
Jonah H. Harris, Database Internals Architect
EnterpriseDB Corporation
732.331.1324


Re: Support Parallel Query Execution in Executor

From
Andrew Dunstan
Date:
Jonah H. Harris wrote:
> On 4/6/06, Qingqing Zhou <zhouqq@cs.toronto.edu> wrote:
>   
>> I have written some experimental code of doing master-slave seqscan in
>> PostgreSQL. During the work, I feel we had enough infrastructure to support
>> parallel query execution.
>>     
>
> Great work!  I had looked into this a little bit and came to the same
> ideas/problems you did, but none of them seemed insurmountable at all.
>  I'd be interested in working with you on this if you'd like.
>
> I'm interested in hearing Tom's suggestions on this topic too.
>
>   
From time to time there is (often ill-informed) discussion about making 
postgres multi-threaded - this strikes me as a possible candidate area 
for multithreading, and maybe that would be a way to overcome some of 
the problems mentioned.

Just a thought

cheers

andrew


Re: Support Parallel Query Execution in Executor

From
Thomas Hallgren
Date:
Hear hear ;-)

Regards,
Thomas Hallgren

Andrew Dunstan wrote:
> Jonah H. Harris wrote:
>> On 4/6/06, Qingqing Zhou <zhouqq@cs.toronto.edu> wrote:
>>  
>>> I have written some experimental code of doing master-slave seqscan in
>>> PostgreSQL. During the work, I feel we had enough infrastructure to 
>>> support
>>> parallel query execution.
>>>     
>>
>> Great work!  I had looked into this a little bit and came to the same
>> ideas/problems you did, but none of them seemed insurmountable at all.
>>  I'd be interested in working with you on this if you'd like.
>>
>> I'm interested in hearing Tom's suggestions on this topic too.
>>
>>   
> 
>  From time to time there is (often ill-informed) discussion about making 
> postgres multi-threaded - this strikes me as a possible candidate area 
> for multithreading, and maybe that would be a way to overcome some of 
> the problems mentioned.
> 
> Just a thought
> 
> cheers
> 
> andrew
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 6: explain analyze is your friend
> 



Re: Support Parallel Query Execution in Executor

From
"Qingqing Zhou"
Date:
""Jonah H. Harris"" <jonah.harris@gmail.com> wrote
>
> Great work!  I had looked into this a little bit and came to the same
> ideas/problems you did, but none of them seemed insurmountable at all.
>  I'd be interested in working with you on this if you'd like.
>

Yes, I am happy to work with anyone on the topic. The plan in mind is like
this:
(1) stable the master-slave seqscan: solve all the problems left;
(2) parallize the seqscan: AFAICS, this should not very difficult based on
1, may only need some scan portition assignment;
(3) add an indexscan or other one or two node type to  master-slave
solution: this is in order to make the framework extensible;
(4) parallize these node - this will be a big chunk of job;
(5) add a two-phase optimization to the server - we have to consider the
partitioned table in this stage, yet another big chunk of job;

Regards,
Qingqing




Re: Support Parallel Query Execution in Executor

From
"Mike Rylander"
Date:
On 4/6/06, Qingqing Zhou <zhouqq@cs.toronto.edu> wrote:
>
> ""Jonah H. Harris"" <jonah.harris@gmail.com> wrote
> >
> > Great work!  I had looked into this a little bit and came to the same
> > ideas/problems you did, but none of them seemed insurmountable at all.
> >  I'd be interested in working with you on this if you'd like.
> >

First, I want to second Jonah's enthusiasm.  This is very exciting!

>
> Yes, I am happy to work with anyone on the topic. The plan in mind is like
> this:
> (1) stable the master-slave seqscan: solve all the problems left;
> (2) parallize the seqscan: AFAICS, this should not very difficult based on
> 1, may only need some scan portition assignment;

This is really only a gut feeling for me (it can't be otherwise, since
we can't yet test), but I think parallelizing a single seqscan is
pretty much guaranteed to do nothing, because seqscans, especially on
large tables, are IO bound.

There was plan some time ago (during 8.0 beta, I think) to allow
multiple seqscans from different queries to join each other, such that
scans that begin later start scanning the table at the point, or just
behind the point, that the first running scan is already at.  That
plan would reduce IO contention, and buffer and OS cache thrashing, by
having multiple readers pull from the same hose.

I can't see how asking for more than one stream from the same file
would do anything but increase both cache thrashing and IO bandwidth
contention.  Am I missing something here?

> (3) add an indexscan or other one or two node type to  master-slave
> solution: this is in order to make the framework extensible;
> (4) parallize these node - this will be a big chunk of job;

Now that could be a _big_ win!  Especially if tablespaces are used to
balance commonly combined tables and indexes.

> (5) add a two-phase optimization to the server - we have to consider the
> partitioned table in this stage, yet another big chunk of job;
>

Same here.  This would be a place where parallel seqscans of different
tables (instead of multi-headed scan of one table) could buy you a
lot, especially with proper tablespace use.

Thanks again, Qingqing, for the work on this.  I'm very excited about
where this could go. :)

> Regards,
> Qingqing
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Have you searched our list archives?
>
>                http://archives.postgresql.org
>


--
Mike Rylander
mrylander@gmail.com
GPLS -- PINES Development
Database Developer
http://open-ils.org


Re: Support Parallel Query Execution in Executor

From
Josh Berkus
Date:
Qingquing,

> First, I want to second Jonah's enthusiasm.  This is very exciting!

Me, three!  I didn't think this was ever going to come to Postgres absent 
major corporate funding.

> This is really only a gut feeling for me (it can't be otherwise, since
> we can't yet test), but I think parallelizing a single seqscan is
> pretty much guaranteed to do nothing, because seqscans, especially on
> large tables, are IO bound.

Actuall, not true.  Our current seqscan performance suffers from 
produce-consumer fluctuation.  GreenPlum and Sun did a whole bunch of 
testing on this.

Basically reading a large table off disk does this:

read some table while not processing
process in cpu while not reading
read some more table while not processing
process some more in cpu while not reading
etc.
resulting in an I/O througput graph that looks like:
   *       *       *  * *    * *    * * *    * *    * *    **       *       *      *

The really annoying part about this, for me personally, is that the peaks 
are significantly faster than comparable commercial DBMSes ... but our 
average is far less.   So even on a single seq scan, parallel query 
execution would make a significant difference in performance, possibly as 
much as +75% on seq scans of large tables.

-- 
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco


Re: Support Parallel Query Execution in Executor

From
"Jonah H. Harris"
Date:
On 4/7/06, Josh Berkus <josh@agliodbs.com> wrote:
> So even on a single seq scan, parallel query
> execution would make a significant difference
> in performance, possibly as
> much as +75% on seq scans of large tables.

I've been looking at several commercial systems which employ dynamic
partitioning for sequential scans of large tables.  While 75% sounds a
little optimistic, there is most definitely a sizable performance
increase.


--
Jonah H. Harris, Database Internals Architect
EnterpriseDB Corporation
732.331.1324


Re: Support Parallel Query Execution in Executor

From
Tom Lane
Date:
Josh Berkus <josh@agliodbs.com> writes:
> Basically reading a large table off disk does this:

> read some table while not processing
> process in cpu while not reading
> read some more table while not processing
> process some more in cpu while not reading
> etc.
> resulting in an I/O througput graph that looks like:

>     *       *       *
>    * *    * *    * *
>   *    * *    * *    *
>  *       *       *      *

Interesting ...

> The really annoying part about this, for me personally, is that the peaks 
> are significantly faster than comparable commercial DBMSes ... but our 
> average is far less.   So even on a single seq scan, parallel query 
> execution would make a significant difference in performance, possibly as 
> much as +75% on seq scans of large tables.

... but I'm failing to follow where it says that parallel processing 
will fix that.  All I can foresee in that direction is extra data
transfer costs, bought at the price of portability and locking headaches.
        regards, tom lane


Re: Support Parallel Query Execution in Executor

From
"Jonah H. Harris"
Date:
On 4/8/06, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> ... but I'm failing to follow where it says that parallel processing
> will fix that.  All I can foresee in that direction is extra data
> transfer costs, bought at the price of portability and locking headaches.

I don't think it's any less portable than the system is now; It's just
enabling multiple slave processes to participate in scans and
processing (parallel query, parallel index builds, parallel sorts,
...)  Likewise, the additional I/O cost isn't that much of an issue
because systems which really take advantage of this type of parallel
processing have large bandwidth I/O arrays anyway.

I didn't even want to mention that EVERY other database I know of
(Oracle, DB2, Sybase, SQL Server, Ingres, Bizgres MPP, MaxDB) supports
this, but it's a pretty obvious win for many environments.


--
Jonah H. Harris, Database Internals Architect
EnterpriseDB Corporation
732.331.1324


Re: Support Parallel Query Execution in Executor

From
Markus Schiltknecht
Date:
Hi,

On Sat, 2006-04-08 at 13:16 -0400, Jonah H. Harris wrote:
> On 4/8/06, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > ... but I'm failing to follow where it says that parallel processing
> > will fix that.  All I can foresee in that direction is extra data
> > transfer costs, bought at the price of portability and locking headaches.
> 
> I don't think it's any less portable than the system is now;

ACK. As long as processes, signals and shared memory are used this could
be as portable as PostgreSQL is now.

> It's just
> enabling multiple slave processes to participate in scans and
> processing (parallel query, parallel index builds, parallel sorts,
> ...)  Likewise, the additional I/O cost isn't that much of an issue
> because systems which really take advantage of this type of parallel
> processing have large bandwidth I/O arrays anyway.

Ehm.. which additional I/O cost? Or do you count inter-process
communication to I/O?

I'd like to help teaching PostgreSQL the art of parallel query
execution. I have rawly implemented an internal message passing system
on shared memory basis. This allows all pg-processes to send messages to
each other identified by PID. Uppon receiving of a message a process
gets a SIGUSR2 (AFAIC remember now).

On the positive side are:- portable as I'm using only shmem and signals- fast (not much copying around of the data in
memory)

One downside I see with my approach is:- the shared memory buffer is limited in size, so it can simply be 'full' and no
moremessages can be sent before another process reads its messages and frees the memory for other messages.
 

In case you're interested I'll compile a patch for review.

Regards

Markus




Re: Support Parallel Query Execution in Executor

From
"Jonah H. Harris"
Date:
On 4/8/06, Markus Schiltknecht <markus@bluegap.ch> wrote:
> ACK. As long as processes, signals and shared memory are used this could
> be as portable as PostgreSQL is now.

This is certainly the case.

> Ehm.. which additional I/O cost? Or do you count inter-process
> communication to I/O?

Inter-process will add a minimal amount, but if it's done correctly,
you'll still end up ahead.

> I'd like to help teaching PostgreSQL the art of parallel query
> execution.

Cool, we're not at the communication-level yet, but your help would be
appreciated.

> In case you're interested I'll compile a patch for review.

Surely!

--
Jonah H. Harris, Database Internals Architect
EnterpriseDB Corporation
732.331.1324


Re: Support Parallel Query Execution in Executor

From
Tom Lane
Date:
"Jonah H. Harris" <jonah.harris@gmail.com> writes:
> On 4/8/06, Markus Schiltknecht <markus@bluegap.ch> wrote:
>> Ehm.. which additional I/O cost? Or do you count inter-process
>> communication to I/O?

> Inter-process will add a minimal amount, but if it's done correctly,
> you'll still end up ahead.

This is exactly the bit of optimism I was questioning.  We've already
been sweating blood trying to reduce multiprocessor contention on data
structures in which collisions ought to be avoidable (ie, buffer arrays
where you hope not everyone is hitting the same buffer at once).  I
think passing large volumes of data between different processes is going
to incur quite a lot of locking overhead, pipeline stalls for cache line
transfers, etc, etc, because heavy contention for the transfer buffer is
simply not going to be avoidable.
        regards, tom lane


Re: Support Parallel Query Execution in Executor

From
"Jonah H. Harris"
Date:
On 4/8/06, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> passing large volumes of data between different processes is going
> to incur quite a lot of locking overhead, pipeline stalls for cache line
> transfers, etc, etc, because heavy contention for the transfer buffer is
> simply not going to be avoidable.

I don't think anyone believes there isn't going to be contention
somewhere.  Anyone that's developed a parallel system acknowledges
there's a price to pay with parallelism (in complexity) which requires
good algorithms, strategy, and design.

I may certainly be misunderstanding you, but it seems like you want to
avoid parallelism altogether.  Aside from PostgreSQL, the only systems
I know of which don't support parallelism in query execution are MySQL
and Firebird, but I don't see competing at the low end as a compelling
argument.  Similarly, I don't think every major vendor put money into
developing something that isn't useful.

I do agree that there are many times you won't want to have parallel
query.  The reasons for employing parallelism is very dependent on the
environment and application it's used in, so it's certainly going to
be up to the user to decide whether they want to use it or not.

We're currently playing with a proof-of-concept to find issues before
proposing a design.  This doesn't affect the community in any way.
Rather than negativity, I'd really like to see your suggestions on
avoiding contention as much as possible; your ideas may certainly get
us past several obstacles more quickly.

--
Jonah H. Harris, Database Internals Architect
EnterpriseDB Corporation
732.331.1324


Re: Support Parallel Query Execution in Executor

From
"Gregory Maxwell"
Date:
On 4/8/06, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> This is exactly the bit of optimism I was questioning.  We've already
> been sweating blood trying to reduce multiprocessor contention on data
> structures in which collisions ought to be avoidable (ie, buffer arrays
> where you hope not everyone is hitting the same buffer at once).  I
> think passing large volumes of data between different processes is going
> to incur quite a lot of locking overhead, pipeline stalls for cache line
> transfers, etc, etc, because heavy contention for the transfer buffer is
> simply not going to be avoidable.

We should consider true parallel execution and overlapping execution
with I/O as distinct cases.

For example, one case made in this thread involved bursty performance
with seqscans presumably because the I/O was stalling while processing
was being performed.  In general this can be avoided without parallel
execution through the use of non-blocking I/O and making an effort to
keep the request pipeline full.

There are other cases where it is useful to perform parallel I/O
without parallel processing.. for example: a query that will perform
an index lookup per row can benefit from running some number of those
lookups in parallel in order to hide the lookup latency and give the
OS and disk elevators a chance to make the random accesses a little
more orderly. This can be accomplished without true parallel
processing. (Perhaps PG does this already?)

Parallel execution to get access to more CPU and memory bandwidth is a
fine thing, and worth the costs in many cases... but it shouldn't be
used as an easy way to get parallel IO without careful consideration.


Re: Support Parallel Query Execution in Executor

From
"Luke Lonergan"
Date:
Greg,

On 4/8/06 5:43 PM, "Gregory Maxwell" <gmaxwell@gmail.com> wrote:

> For example, one case made in this thread involved bursty performance
> with seqscans presumably because the I/O was stalling while processing
> was being performed.  In general this can be avoided without parallel
> execution through the use of non-blocking I/O and making an effort to
> keep the request pipeline full.

I agree - there is a real continuing need for overlapped scan and execution,
though I'm not sure quite how to get it efficiently using inter-process
communication.  The AIO interface is nicely proven at this point, but to
Tom's point, it's non-portable.

What we see now is that the executor is CPU bound at about 300MB/s scan rate
using modern Opteron CPUs.  Using some kind of overlap strategy would help
this a lot - using programmable readahead to stage I/O requests
asynchronously from the scan node would be better than OS hints like
fadvise(), but the tuning would be tricky IMO.

- Luke




Re: Support Parallel Query Execution in Executor

From
Myron Scott
Date:
Gregory Maxwell wrote:
> We should consider true parallel execution and overlapping execution
> with I/O as distinct cases.
>
> For example, one case made in this thread involved bursty performance
> with seqscans presumably because the I/O was stalling while processing
> was being performed.  In general this can be avoided without parallel
> execution through the use of non-blocking I/O and making an effort to
> keep the request pipeline full.
>
> There are other cases where it is useful to perform parallel I/O
> without parallel processing.. for example: a query that will perform
> an index lookup per row can benefit from running some number of those
> lookups in parallel in order to hide the lookup latency and give the
> OS and disk elevators a chance to make the random accesses a little
> more orderly. This can be accomplished without true parallel
> processing. (Perhaps PG does this already?)
>
>   

I have done some testing more along these lines with an old fork of 
postgres code
(2001).  In my tests, I used a thread to delegate out the actual heap 
scan of the
SeqScan.  The job of  the "slave" thread the was to fault in buffer 
pages and
determine the time validity of the tuples.  ItemPointers are passed back 
to the
"master" thread via a common memory area guarded by mutex locking.  The
master thread is then responsible for converting the ItemPointers to 
HeapTuples
and finishing the execution run.  I added a little hack to the buffer 
code to force
pages read into the buffer to stay at the back of the free buffer list 
until the master
thread has had a chance to use it.  These are the parameters of my test 
table.

Pages 9459:  ; Tup 961187: Live 673029, Dead 288158

Average tuple size is 70 bytes

create table test (rand int, varchar(256) message)

So far I've done a couple of runs with a single query on a 2 processor 
machine with
the following results via dtrace.

select * from test;

CPU     ID                    FUNCTION:NAME1  46218            ExecEndSeqScan:return Inline scan time 817290  46216
ExecEndDelegatedSeqScan:returnDelegated scan time 599030  46218            ExecEndSeqScan:return Inline scan time
957080 46216   ExecEndDelegatedSeqScan:return Delegated scan time 582550  46218            ExecEndSeqScan:return Inline
scantime 790280  46216   ExecEndDelegatedSeqScan:return Delegated scan time 50500
 

average 34% decrease in total time using the delegated scan.

A very crude, simple test but I think it shows some promise.

I know I used threads but you could probably just as easily use a slave 
process
and pass ItemPointers via pipes or shared memory.

Thanks,

Myron Scott





Re: Support Parallel Query Execution in Executor

From
"Luke Lonergan"
Date:
Myron,

First, this sounds really good!

On 4/8/06 9:54 PM, "Myron Scott" <lister@sacadia.com> wrote:

>  I added a little hack to the buffer
> code to force
> pages read into the buffer to stay at the back of the free buffer list
> until the master
> thread has had a chance to use it.

This is the part I'm curious about - is this using the shared_buffers region
in a circular buffer fashion to store pre-fetched pages?

One thing I've wondered about is: how much memory is required to get
efficient overlap?  Did you find that you had to tune the amount of buffer
memory to get the performance to work out?

- Luke




Re: Support Parallel Query Execution in Executor

From
Hannu Krosing
Date:
Ühel kenal päeval, L, 2006-04-08 kell 18:47, kirjutas Luke Lonergan:

> I agree - there is a real continuing need for overlapped scan and execution,
> though I'm not sure quite how to get it efficiently using inter-process
> communication.  The AIO interface is nicely proven at this point, but to
> Tom's point, it's non-portable.
> 
> What we see now is that the executor is CPU bound at about 300MB/s scan rate
> using modern Opteron CPUs.  

Where exactly does the CPU spend its time at 300MB/s ? Is it mainly I/O
related processing,  processing pages into in-memory tuples or something
else ?

--------------
Hannu




Re: Support Parallel Query Execution in Executor

From
"Qingqing Zhou"
Date:
"Tom Lane" <tgl@sss.pgh.pa.us> wrote
>
> This is exactly the bit of optimism I was questioning.  We've already
> been sweating blood trying to reduce multiprocessor contention on data
> structures in which collisions ought to be avoidable (ie, buffer arrays
> where you hope not everyone is hitting the same buffer at once).  I
> think passing large volumes of data between different processes is going
> to incur quite a lot of locking overhead, pipeline stalls for cache line
> transfers, etc, etc, because heavy contention for the transfer buffer is
> simply not going to be avoidable.
>

Yes, there is overhead. Let's say how can we tackle them one by one:

* lock contention of transfer buffer *
A general way is partitioning. We can try to alleviate the lock contention 
by sharing transfer buffer only between each two processes (master and 
slave).

* large amount of data *
Thinking of the simplest master-slave seqscan senario, I am copying the 
HeapTuple so far (like the code in sort read/write tuple), but potentially 
we can only pass the offset and pinned page, which will require the slave 
process pinned the page for a little longer time, say every 5 pages. I 
didn't see any application can cause Postgres pinned a lot pages, so this 
might be one step to reduce the data volumn.

On the other hand, why do we need parallel execution anyway? Because if we 
got a 4 way SMP with strong IO device, we may want to trade the performance 
with more resources for some queries.


Regards,
Qingqing 




Re: Support Parallel Query Execution in Executor

From
Myron Scott
Date:
On Apr 8, 2006, at 10:29 PM, Luke Lonergan wrote:

> Myron,
>
> First, this sounds really good!
>
> On 4/8/06 9:54 PM, "Myron Scott" <lister@sacadia.com> wrote:
>
>>  I added a little hack to the buffer
>> code to force
>> pages read into the buffer to stay at the back of the free buffer  
>> list
>> until the master
>> thread has had a chance to use it.
>
> This is the part I'm curious about - is this using the  
> shared_buffers region
> in a circular buffer fashion to store pre-fetched pages?

Yes.  That is basically what the slave thread is trying to do.  As  
well as weed out
any tuples/pages that don't need to be looked at due to dead tuples.
I did several things to try and insure that a buffer needed by the  
master thread
would not be pulled out of the buffer pool before it was seen by the  
master.
I wanted to do this without holding the buffer pinned, so I did the  
change to
the buffer free list to do this.

static void
AddBufferToFreelist(BufferDesc *bf)
{        S_LOCK(&SLockArray[FreeBufMgrLock]);        int movebehind = SharedFreeList->freePrev;
/* find the right spot with bias */        while ( BufferDescriptors[movebehind].bias > bf->bias ) {
movebehind= BufferDescriptors[movebehind].freePrev;        }
 
...

The bias number is removed the next time the buffer is pulled out of  
the free list.

Also, I force an ItemPointer transfer when the ItemPointer transfer list
is full ( currently 4096 ) or 10% of the buffer pool have been affected
by the slave thread.  Lastly, if the slave thread gets too far ahead of
the master thread, it waits for the master to catch up.  To my  
knowledge,
this hasn't happened yet.

>
> One thing I've wondered about is: how much memory is required to get
> efficient overlap?  Did you find that you had to tune the amount of  
> buffer
> memory to get the performance to work out?

I haven't done much tuning yet.  I think there is an optimal balance  
that I
most likely haven't found yet.

Myron Scott



Re: Support Parallel Query Execution in Executor

From
Martijn van Oosterhout
Date:
On Sun, Apr 09, 2006 at 08:23:36AM -0700, Myron Scott wrote:
> >This is the part I'm curious about - is this using the
> >shared_buffers region
> >in a circular buffer fashion to store pre-fetched pages?
>
> Yes.  That is basically what the slave thread is trying to do.  As
> well as weed out
> any tuples/pages that don't need to be looked at due to dead tuples.
> I did several things to try and insure that a buffer needed by the
> master thread
> would not be pulled out of the buffer pool before it was seen by the
> master.
> I wanted to do this without holding the buffer pinned, so I did the
> change to
> the buffer free list to do this.

Is this necessary? I mean, what's the chance that a page might get
thrown out early? And if so, what's the chance that page will still be
in the OS cache?

The cost of fetching a page from the OS is not really much of an
overhead, so I'd like to know how much benefit these buffer cache hacks
actually produce.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

Re: Support Parallel Query Execution in Executor

From
Tom Lane
Date:
Myron Scott <lister@sacadia.com> writes:
> Gregory Maxwell wrote:
>> There are other cases where it is useful to perform parallel I/O
>> without parallel processing..

> I have done some testing more along these lines with an old fork of
> postgres code (2001).  In my tests, I used a thread to delegate out
> the actual heap scan of the SeqScan.  The job of the "slave" thread
> the was to fault in buffer pages and determine the time validity of
> the tuples.  ItemPointers are passed back to the "master" thread via a
> common memory area guarded by mutex locking.

I was considering a variant idea in the shower this morning: suppose
that we invent one or more "background reader" processes that have
basically the same infrastructure as the background writer, but have
the responsibility of causing buffer reads to happen at useful times
(whereas the writer causes writes to happen at useful times).  The
idea would be for backends to signal the readers when they know they
will need a given block soon, and then hopefully when they need it
it'll already be in shared buffers.  For instance, in a seqscan it'd be
pretty trivial to request block N+1 just after reading block N, and then
doing our actual processing on block N while (we hope) some reader
process is faulting in N+1.  Bitmap indexscans could use this structure
too; I'm less sure about whether plain indexscans could do much with it
though.

The major issues I can see are:

1. We'd need a shared-memory queue of read requests, probably much like
the queue of fsync requests.  We've already seen problems with
contention for the fsync queue, IIRC, and that's used much less heavily
than the read request queue would be.  So there might be some
performance issues with getting the block requests sent over to the
readers.

2. There are some low-level assumptions that no one reads in pages of
a relation without having some kind of lock on the relation (consider
eg the case where the relation is being dropped).  A bgwriter-like
process wouldn't be able to hold lmgr locks, and we wouldn't really want
it to be thrashing the lmgr shared data structures for each read anyway.
So you'd have to design some interlock to guarantee that no backend
abandons a query (and releases its own lmgr locks) while an async read
request it made is still pending.  Ugh.
        regards, tom lane


Re: Support Parallel Query Execution in Executor

From
Tom Lane
Date:
"Gregory Maxwell" <gmaxwell@gmail.com> writes:
> For example, one case made in this thread involved bursty performance
> with seqscans presumably because the I/O was stalling while processing
> was being performed.

Actually, the question that that raised in my mind is "why isn't the
kernel doing read-ahead properly?"  When we're doing nonsequential
access like an indexscan, it's unsurprising that the kernel can't guess
which block we need next, but in a plain seqscan you'd certainly expect
the read-ahead algorithm to kick in and ensure that the next block is
fetched before we need it.

So before we go inventing complicated bits of code with lots of added
overhead, we should first find out exactly why the system doesn't
already work the way it's supposed to.
        regards, tom lane


Re: Support Parallel Query Execution in Executor

From
"Jonah H. Harris"
Date:
On 4/9/06, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I was considering a variant idea in the shower this morning:suppose
> that we invent one or more "background reader" processes that have
> basically the same infrastructure as the background writer, but have
> the responsibility of causing buffer reads to happen at useful times
> (whereas the writer causes writes to happen at useful times).  The
> idea would be for backends to signal the readers when they know they
> will need a given block soon, and then hopefully when they need it
> it'll already be in shared buffers.

This is sort of what I'm playing with.  There are N-number of backends
which are configured at startup and are available solely for parallel
processing.  If you have parallelism enabled and you have a query
which the planner says can take advantage of it, your backend becomes
a sort of "coordinator" and signals the number of backends defined by
(right now) a global parallelism degree parameter for participation in
query execution.

> The major issues I can see are:
>
> 1. We'd need a shared-memory queue of read requests, probably much like
> the queue of fsync requests.  We've already seen problems with
> contention for the fsync queue, IIRC, and that's used much less heavily
> than the read request queue would be.  So there might be some
> performance issues with getting the block requests sent over to the
> readers.

Yes, I was looking at this contention as well.  However, I haven't
looked at the fsync request queue in detail, but I'm going to play
around with a simple initial semaphore test such that if the queue is
full, the query will just be processed serially.  I'll also take a
look at the fsync queue and see how it's working.

While it's down the road a ways, I'm really looking forward to seeing
parallel index builds, sorts, and joins.  In my opinion, all of our
combined ideas and system knowledge is really going to pay off big
time on this one.

--
Jonah H. Harris, Database Internals Architect
EnterpriseDB Corporation
732.331.1324


Re: Support Parallel Query Execution in Executor

From
Myron Scott
Date:
On Apr 9, 2006, at 9:26 AM, Martijn van Oosterhout wrote:

> On Sun, Apr 09, 2006 at 08:23:36AM -0700, Myron Scott wrote:
>>> This is the part I'm curious about - is this using the
>>> shared_buffers region
>>> in a circular buffer fashion to store pre-fetched pages?
>>
>> Yes.  That is basically what the slave thread is trying to do.  As
>> well as weed out
>> any tuples/pages that don't need to be looked at due to dead tuples.
>> I did several things to try and insure that a buffer needed by the
>> master thread
>> would not be pulled out of the buffer pool before it was seen by the
>> master.
>> I wanted to do this without holding the buffer pinned, so I did the
>> change to
>> the buffer free list to do this.
>
> Is this necessary? I mean, what's the chance that a page might get
> thrown out early? And if so, what's the chance that page will still be
> in the OS cache?
>
> The cost of fetching a page from the OS is not really much of an
> overhead, so I'd like to know how much benefit these buffer cache  
> hacks
> actually produce.
>

You may be right on this one.  I wanted ensure that I didn't lose
pages I needed.  I may have just added a belt to my
suspenders.  I'll add a switch to turn it off and on and try and
devise a test to see what the costs either way are.


Myron Scott


Re: Support Parallel Query Execution in Executor

From
"Gregory Maxwell"
Date:
On 4/9/06, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> "Gregory Maxwell" <gmaxwell@gmail.com> writes:
> > For example, one case made in this thread involved bursty performance
> > with seqscans presumably because the I/O was stalling while processing
> > was being performed.
>
> Actually, the question that that raised in my mind is "why isn't the
> kernel doing read-ahead properly?"  When we're doing nonsequential
> access like an indexscan, it's unsurprising that the kernel can't guess
> which block we need next, but in a plain seqscan you'd certainly expect
> the read-ahead algorithm to kick in and ensure that the next block is
> fetched before we need it.
>
> So before we go inventing complicated bits of code with lots of added
> overhead, we should first find out exactly why the system doesn't
> already work the way it's supposed to.

But is that really the behavior we should expect?

How much memory can we expect the OS to spend on opportunistic
read-in? How much disk access should be spent on a guess? There is an
intrinsic tradeoff here, applications tend to be bursty so just
because you're reading a lot now doesn't mean you'll continue... and
the filesystem will have fragmentation, so a failed guess can
translate into a lot of pointless seeking.

As I recall, in Linux 2.6 you have something like a max of 128KB of
readahead. Given that and a disk subsystem that reads at 200MB/sec you
can't spend more than 600us processing before requesting enough
additional blocks put the disk back into readhead or you will stall
the disk.  Stalling the disk costs more than you'd expect, due to FS
fragmentation there can be terrific gains from allowing the OS and
disk to issue reads out of order from a large request queue.

It would probably be reasonable to say that the OS should be using
much larger readhead buffers, especially on systems with fast disk
subsystems... But that doesn't come for free and can slaughter
performance for many workloads (consider, what if it was triggering
5MB of file oriented read-ahead for every index scan seek we did?).
There is an adaptive readahead patch for Linux which should improve
things (http://lwn.net/Articles/176279/ and if you google around there
are some benchmarks) but I doubt that even that would be able to keep
a 200MB/sec+ disk subsystem saturated with the sort of access patterns
PG has...

To address this in a cross platform way will be a challenge. I doubt
Linux is alone at having skimpy readahead (because big readahead
translates into huge losses if you get it wrong).

Given this information, a stupid 'fault-in' process should probably
give huge gains for seqscans... but I think the extra work required to
find a solution which is also useful for index operations is probably
worth it as well.


Re: Support Parallel Query Execution in Executor

From
Tom Lane
Date:
"Jonah H. Harris" <jonah.harris@gmail.com> writes:
> On 4/9/06, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I was considering a variant idea in the shower this morning:suppose
>> that we invent one or more "background reader" processes that have
>> basically the same infrastructure as the background writer, but have
>> the responsibility of causing buffer reads to happen at useful times

> This is sort of what I'm playing with.  There are N-number of backends
> which are configured at startup and are available solely for parallel
> processing.

That's not remotely the same thing: a backend is a very different animal
from a bgwriter.  In particular, bgwriter (and bgreaders if we had 'em)
aren't database-specific, don't need to think about permission checking
as they don't execute on behalf of particular users, don't have syscaches
to keep in-sync with everything else, etc etc.
        regards, tom lane


Re: Support Parallel Query Execution in Executor

From
Tom Lane
Date:
"Gregory Maxwell" <gmaxwell@gmail.com> writes:
> On 4/9/06, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> So before we go inventing complicated bits of code with lots of added
>> overhead, we should first find out exactly why the system doesn't
>> already work the way it's supposed to.

> But is that really the behavior we should expect?

Certainly.  If the OS has readahead logic at all, it ought to think that
a seqscan of a large table qualifies.  Your arguments seem to question
whether readahead is useful at all --- but they would apply *just as
well* to an app doing its own readahead, which is what is really
getting proposed in this thread.

Before we go replacing a standard OS-level facility with our own
version, we need to have a much clearer idea of why the OS isn't getting
the job done for us.  Otherwise we're likely to write a large amount of
code and find out that it doesn't work very well either.
        regards, tom lane


Re: Support Parallel Query Execution in Executor

From
"Jonah H. Harris"
Date:
On 4/9/06, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> That's not remotely the same thing: a backend is a very different animal
> from a bgwriter.

Yes, I know.  Had I known it would've been picked apart, I would've
described it better.

> In particular, bgwriter (and bgreaders if we had 'em)
> aren't database-specific, don't need to think about permission checking
> as they don't execute on behalf of particular users, don't have syscaches
> to keep in-sync with everything else, etc etc.

In the current proof-of-concept, there is a backend (manager) and
several reader-like processes; However, in this case, the readers
perform actual execution on behalf of the user.

The reader-like processes or "execution agents", whatever you want to
call them, are limited to the postgres database only because they've
been coded that way... but there's nothing that actually restricts
them from being used for any database.

--
Jonah H. Harris, Database Internals Architect
EnterpriseDB Corporation
732.331.1324


Re: Support Parallel Query Execution in Executor

From
Tom Lane
Date:
"Jonah H. Harris" <jonah.harris@gmail.com> writes:
> The reader-like processes or "execution agents", whatever you want to
> call them, are limited to the postgres database only because they've
> been coded that way... but there's nothing that actually restricts
> them from being used for any database.

Oh?  Anything much above the level of bufmgr is going to be dependent on
relcache, syscache, etc, and those are all definitely database-specific.
You can't just retarget a backend to operate in another database, at
least not without major changes in that infrastructure.
        regards, tom lane


Re: Support Parallel Query Execution in Executor

From
"Jonah H. Harris"
Date:
On 4/9/06, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Oh?  Anything much above the level of bufmgr is going to be dependent on
> relcache, syscache, etc, and those are all definitely database-specific.
> You can't just retarget a backend to operate in another database, at
> least not without major changes in that infrastructure.

It's not that complicated and it may not prove to be the best method,
but it's worth a try.  When I get done playing with it, I'll send it
to ya to check out.


--
Jonah H. Harris, Database Internals Architect
EnterpriseDB Corporation
732.331.1324


Re: Support Parallel Query Execution in Executor

From
Alvaro Herrera
Date:
Jonah H. Harris wrote:
> On 4/9/06, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Oh?  Anything much above the level of bufmgr is going to be dependent on
> > relcache, syscache, etc, and those are all definitely database-specific.
> > You can't just retarget a backend to operate in another database, at
> > least not without major changes in that infrastructure.
> 
> It's not that complicated and it may not prove to be the best method,
> but it's worth a try.  When I get done playing with it, I'll send it
> to ya to check out.

Hmm, if it turns out to work, I'm interested.  This is something that
definitely could be of some help to autovacuum.  Please post to
pgsql-patches.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: Support Parallel Query Execution in Executor

From
lister@sacadia.com
Date:
I am working with Solaris on SPARC almost exclusively and I believe Josh said that Sun was the one who found the bursty
behaviorwith scans.  Has it been confirmed that this is the case on all/most platforms?
 

Myron Scott
-----Original Message-----

From:  Tom Lane <tgl@sss.pgh.pa.us>
Subj:  Re: [HACKERS] Support Parallel Query Execution in Executor
Date:  Sun Apr 9, 2006 11:48 am
Size:  1K
To:  Gregory Maxwell <gmaxwell@gmail.com>
cc:  pgsql-hackers@postgresql.org

"Gregory Maxwell" <gmaxwell@gmail.com> writes:
> On 4/9/06, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> So before we go inventing complicated bits of code with lots of added
>> overhead, we should first find out exactly why the system doesn't
>> already work the way it's supposed to.

> But is that really the behavior we should expect?

Certainly.  If the OS has readahead logic at all, it ought to think that
a seqscan of a large table qualifies.  Your arguments seem to question
whether readahead is useful at all --- but they would apply *just as
well* to an app doing its own readahead, which is what is really
getting proposed in this thread.

Before we go replacing a standard OS-level facility with our own
version, we need to have a much clearer idea of why the OS isn't getting
the job done for us.  Otherwise we're likely to write a large amount of
code and find out that it doesn't work very well either.
        regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend




Re: Support Parallel Query Execution in Executor

From
"Luke Lonergan"
Date:
Myron,

On 4/9/06 1:18 PM, "lister@sacadia.com" <lister@sacadia.com> wrote:

> I am working with Solaris on SPARC almost exclusively and I believe Josh said
> that Sun was the one who found the bursty behavior with scans.  Has it been
> confirmed that this is the case on all/most platforms?

It was our folks at Greenplum that found this behavior.  It is particularly
bad on Solaris because of the poor OS readahead on UFS.  You can modify UFS
readahead behavior using OS tuning parameters as mentioned previously in the
thread and contained in Jignesh Shaw's blog at Sun (we worked with Jignesh
and Sun for about a year to get this information).

Interestingly, the use of POSIX fadvise() to indicate the nature of the I/O
as sequential makes a big positive difference in many cases, but it hurts a
lot in some other workloads where there is seek I/O and the sequential
pattern was set.

I think the best solution will use a more direct overlap approach in the
executor.

- Luke   




Re: Support Parallel Query Execution in Executor

From
"Gregory Maxwell"
Date:
On 4/9/06, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Certainly.  If the OS has readahead logic at all, it ought to think that
> a seqscan of a large table qualifies.  Your arguments seem to question
> whether readahead is useful at all --- but they would apply *just as
> well* to an app doing its own readahead, which is what is really
> getting proposed in this thread.

We know we're going to read the whole table, the OS doesn't. We can be
confident that we're will not use our  read-ahead when we're really
doing random accesses.

The OS has to deal with many applications with many workloads running
on a wider spectrum of hardware. It's important that it does the right
thing, but probably more important that it doesn't do the wrong thing.This encourages erroring on the side of small
readahead.

> Before we go replacing a standard OS-level facility with our own
> version, we need to have a much clearer idea of why the OS isn't getting
> the job done for us.  Otherwise we're likely to write a large amount of
> code and find out that it doesn't work very well either.

Thats a fair position...  It would be useful to know much much
readahead PG needs in order to keep a high speed disk subsystem
saturated. This would involve profiling how frequently PG requests
data, how much it requests when running out of a hot cache. We could
then say that the OS would need to readahead xMB to keep a yMB/s disk
subsystem saturated.

It would be good to know how much FBSD will readahead...

It might also be interesting for someone with the right testing rig on
linux to try the adaptive
readahead patch to see if that improves PG's ability to keep the disk busy.


Re: Support Parallel Query Execution in Executor

From
"Luke Lonergan"
Date:
Gregory,

On 4/9/06 1:36 PM, "Gregory Maxwell" <gmaxwell@gmail.com> wrote:

> It might also be interesting for someone with the right testing rig on
> linux to try the adaptive
> readahead patch to see if that improves PG's ability to keep the disk busy.

"the" adaptive readahead patch?  Did I miss one?

We will happily test experimental patches that improve I/O utilitization.
We have an assortment of gear with high speed I/O, mostly Linux now.

- Luke 




Re: Support Parallel Query Execution in Executor

From
"Luke Lonergan"
Date:
Tom,

On 4/9/06 9:27 AM, "Tom Lane" <tgl@sss.pgh.pa.us> wrote:

> 2. There are some low-level assumptions that no one reads in pages of
> a relation without having some kind of lock on the relation (consider
> eg the case where the relation is being dropped).  A bgwriter-like
> process wouldn't be able to hold lmgr locks, and we wouldn't really want
> it to be thrashing the lmgr shared data structures for each read anyway.
> So you'd have to design some interlock to guarantee that no backend
> abandons a query (and releases its own lmgr locks) while an async read
> request it made is still pending.  Ugh.

Does this lead us right back to planning for the appropriate amount of
readahead at plan time?  We could consider a "page range" lock at that point
instead of locking each individual page.

It seems that the planner might have enough information to identify the
times when this approach would be beneficial (low selectivity aggregate,
etc), and to allocate sufficient resources to accommodate the overlap.  The
planner specific knowledge would be the main reason this would work better
than the OS built-in dynamic readahead algorithms.

- Luke 




Re: Support Parallel Query Execution in Executor

From
"Gregory Maxwell"
Date:
On 4/9/06, Luke Lonergan <llonergan@greenplum.com> wrote:
> Gregory,
>
> On 4/9/06 1:36 PM, "Gregory Maxwell" <gmaxwell@gmail.com> wrote:
>
> > It might also be interesting for someone with the right testing rig on
> > linux to try the adaptive
> > readahead patch to see if that improves PG's ability to keep the disk busy.
>
> "the" adaptive readahead patch?  Did I miss one?
>
> We will happily test experimental patches that improve I/O utilitization.
> We have an assortment of gear with high speed I/O, mostly Linux now.

Linux kernel patch, I'd mentioned it in a prior post.

http://www.vanheusden.com/ara/

It increases Linux's maximum readahead from 128K to 1meg .. and it
should be smart enough that you could crank it up further without too
much risk of hurting performance elsewhere.

If PG's bottlenecked on seqscan due to insufficient readahead, I'd
expect this to show an improvement...  although I am still somewhat
doubtful that it'll be enough to keep the disk saturated if PG's
behavior is highly unoptimal.


Re: Support Parallel Query Execution in Executor

From
Tom Lane
Date:
"Luke Lonergan" <llonergan@greenplum.com> writes:
> On 4/9/06 9:27 AM, "Tom Lane" <tgl@sss.pgh.pa.us> wrote:
>> 2. There are some low-level assumptions that no one reads in pages of
>> a relation without having some kind of lock on the relation (consider
>> eg the case where the relation is being dropped).  A bgwriter-like
>> process wouldn't be able to hold lmgr locks, and we wouldn't really want
>> it to be thrashing the lmgr shared data structures for each read anyway.
>> So you'd have to design some interlock to guarantee that no backend
>> abandons a query (and releases its own lmgr locks) while an async read
>> request it made is still pending.  Ugh.

> Does this lead us right back to planning for the appropriate amount of
> readahead at plan time?  We could consider a "page range" lock at that point
> instead of locking each individual page.

No, you're missing my point entirely.  What's bothering me is the
prospect of a "bgreader" process taking actions that are only safe
because of a lock that is held by a different process --- changing the
granularity of that lock doesn't get you out of trouble.

Here's a detailed scenario:
1. Backend X reads page N of a table T, queues a request for N+1.2. While processing page N, backend X gets an error
andaborts   its transaction, thereby dropping all its lmgr locks.3. Backend Y executes a DROP or TRUNCATE on T, which
itcan   now do because there's no lock held anywhere on T.  There   are actually two interesting sub-phases of this:
3a. Kill any shared buffers holding pages to be deleted.   3b.  Physically drop or truncate the OS file.4. Bgreader
triesto execute the pending read request.  Ooops.
 

If step 4 happens after 3b, the bgreader gets an error.  Maybe we could
kluge that to not cause any serious problems, but the nastier case is
where 4 happens between 3a and 3b --- the bgreader sees nothing wrong,
but it's now loaded a shared buffer that must *not* be there.

Having thought more about this, there may be a solution possible using
tighter integration with the bufmgr.  The bufmgr already has a notion of
a buffer page being "read busy".  Maybe, rather than just pushing a read
request into a separate queue somewhere, the requestor has to assign a
shared buffer for the page it wants and put the buffer into read-busy
state, but then pass the request to perform the physical read off to
someone else.  The advantage of this is that there's state that step 3a
can see telling it that a conflicting read is pending, and it just needs
to wait for the read to finish before killing the buffer.

Bottom line seems to be: just as the bgwriter is pretty intimately tied
to bufmgr, bgreaders would have to be as well.
        regards, tom lane


Re: Support Parallel Query Execution in Executor

From
Markus Schiltknecht
Date:
Hi,

On Sun, 2006-04-09 at 15:11 -0400, Tom Lane wrote:
> You can't just retarget a backend to operate in another database, at
> least not without major changes in that infrastructure. 

Why not? What would be needed to retarget a backend to operate in
another database?

Such a retargetting would be helpful in several cases:
- pre-forking of backends (well, depends on efficiency of
retargetting..)
- slave backends for parallel query execution
- slave backends for replaying replicated transactions

I'm using backends operating without a client connection, with superuser
privileges only. But those are still connected to one specific database.
It would really be helpfull if I could retarget them as needed.

Regards

Markus




Re: Support Parallel Query Execution in Executor

From
"Qingqing Zhou"
Date:
"Markus Schiltknecht" <markus@bluegap.ch> wrote
> Hi,
>
> On Sun, 2006-04-09 at 15:11 -0400, Tom Lane wrote:
> > You can't just retarget a backend to operate in another database, at
> > least not without major changes in that infrastructure.
>
> Why not? What would be needed to retarget a backend to operate in
> another database?
>

As Tom pointed out, without big change, a backend on database "D1" can't
connect to "D2". This is because to connect to a database, we need to
initialize a lot of variables. So when you reconnect to another one on the
fly, you have to change these variables one by one.

Regards,
Qingqing




Re: Support Parallel Query Execution in Executor

From
Simon Riggs
Date:
On Sun, 2006-04-09 at 17:11 +0800, Qingqing Zhou wrote:
> "Tom Lane" <tgl@sss.pgh.pa.us> wrote
> >
> > This is exactly the bit of optimism I was questioning.  We've already
> > been sweating blood trying to reduce multiprocessor contention on data
> > structures in which collisions ought to be avoidable (ie, buffer arrays
> > where you hope not everyone is hitting the same buffer at once).  I
> > think passing large volumes of data between different processes is going
> > to incur quite a lot of locking overhead, pipeline stalls for cache line
> > transfers, etc, etc, because heavy contention for the transfer buffer is
> > simply not going to be avoidable.
> >
> 
> Yes, there is overhead. Let's say how can we tackle them one by one:
> 
> * lock contention of transfer buffer *

I think it would be useful to think about exactly what type of
query/activity we are looking to improve the performance on. That way we
can understand the benefit of this proposal and take some baseline
measurements to analyse what is happening for those cases.

For example, a CREATE INDEX will require most if not all rows. Whereas a
SeqScan may have a selectivity of 0.001 or below in some cases. The
SeqScan case might also vary considerably in the amount of CPU required
to identify qualifying rows and that CPU might benefit from
parallelisation also.

e.g. A low-selectivity SeqScan is a good case for fairly simple
performance improvement, since the output from that executor node is
going to be much less than its input and there is less need to
parallelize the complete tree. A parallel slave node might then be
placed in charge of running a complete SeqScan node on an agreed section
of a table, rather than simply performing I/O for another query.

Best Regards, Simon Riggs



Re: Support Parallel Query Execution in Executor

From
Markus Schiltknecht
Date:
Hi Qingqing,

On Mon, 2006-04-10 at 16:38 +0800, Qingqing Zhou wrote:
> > Why not? What would be needed to retarget a backend to operate in
> > another database?
>
> As Tom pointed out, without big change, a backend on database "D1" can't
> connect to "D2". This is because to connect to a database, we need to
> initialize a lot of variables. So when you reconnect to another one on the
> fly, you have to change these variables one by one.

Sure, the question is: what is needed to retarget a backend?

IMHO, in all of the three examples mentioned up-thread, it would be
better to retarget an existing backend, instead of forking a new one,
because that would perform better and/or save some resources.

Please do also consider, that for parallel query execution as well as
for replaying remote transactions access rights checking and any client
connection setup could be omitted. Such special processing backeds could
easily be run with superuser privileges and without a client connection.

Regards

Markus




Re: Support Parallel Query Execution in Executor

From
Martijn van Oosterhout
Date:
On Sun, Apr 09, 2006 at 05:54:56PM -0400, Tom Lane wrote:
> Here's a detailed scenario:
>
>     1. Backend X reads page N of a table T, queues a request for N+1.
>     2. While processing page N, backend X gets an error and aborts
>        its transaction, thereby dropping all its lmgr locks.
>     3. Backend Y executes a DROP or TRUNCATE on T, which it can
>        now do because there's no lock held anywhere on T.  There
>        are actually two interesting sub-phases of this:
>        3a.  Kill any shared buffers holding pages to be deleted.
>        3b.  Physically drop or truncate the OS file.
>     4. Bgreader tries to execute the pending read request.  Ooops.
>
> If step 4 happens after 3b, the bgreader gets an error.  Maybe we could
> kluge that to not cause any serious problems, but the nastier case is
> where 4 happens between 3a and 3b --- the bgreader sees nothing wrong,
> but it's now loaded a shared buffer that must *not* be there.

The way I dealt with this when playing with async i/o for postgres was
to have the initiator mark the buffer I/O-in-progress then send the
request, just like you suggest. This would cause other people to wait
on it. When the request later finished you processed it as usual and
backends waiting would continue.

The appears to be two seperate cases here though, one is to just farm
out the read request to another process (basically aio), the other is
to do actual processing there. The latter is obviously for more useful
but requires a fair bit more infrastructure.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

Re: Support Parallel Query Execution in Executor

From
"Qingqing Zhou"
Date:
"Markus Schiltknecht" <markus@bluegap.ch> wrote
> Hi Qingqing,
>
> >
> > As Tom pointed out, without big change, a backend on database "D1" can't
> > connect to "D2". This is because to connect to a database, we need to
> > initialize a lot of variables. So when you reconnect to another one on
the
> > fly, you have to change these variables one by one.
>
> Sure, the question is: what is needed to retarget a backend?
>
Check the code InitPostgres(). These global varaibles are scattered in many
places, so I am not sure if it is easy to write clean code to clear up these
variables. But if you can come up with a patch to do reconnect without
disconnect, that will be cool.

> IMHO, in all of the three examples mentioned up-thread, it would be
> better to retarget an existing backend, instead of forking a new one,
> because that would perform better and/or save some resources.
>
That's for sure.

> Please do also consider, that for parallel query execution as well as
> for replaying remote transactions access rights checking and any client
> connection setup could be omitted. Such special processing backeds could
> easily be run with superuser privileges and without a client connection.
>
This is a problem for parallel query execution. I suspect we can't run with
superuser privileges for now and for future. For now, I am not clear if
there is any ACL check at query execution stage -- seems should not be, at
least for DML. For future, I am pretty clear that this is not the way -- the
mere possibility of future support of audit or MAC will bring in mind the
reason.

Regards,
Qingqing




Re: Support Parallel Query Execution in Executor

From
Hannu Krosing
Date:
Ühel kenal päeval, P, 2006-04-09 kell 18:26, kirjutas Martijn van
Oosterhout:

> The cost of fetching a page from the OS is not really much of an
> overhead, 

Have you tested this ?

I remember having a case, where data load usin COPY into a table with
several indexes ran an order of magnitude faster with bigger work-mem.
It was with a computer with enough memory to fit the whole working set
(table + indexes) into OS cache, thus I can't think of any other
overhead except reading the index pages repeatedly from OS cache to
shared mem buffers.

I agree that a real disk read is much slower than OS cache read, but
even an OS cache read is still much slower than no read at all. Sure
it's not nearly as bad on seqscans, but it still takes time.

> so I'd like to know how much benefit these buffer cache hacks
> actually produce.

-------------------
Hannu




Re: Support Parallel Query Execution in Executor

From
Markus Schiltknecht
Date:
On Mon, 2006-04-10 at 17:22 +0800, Qingqing Zhou wrote:
> Check the code InitPostgres(). These global varaibles are scattered in many
> places, so I am not sure if it is easy to write clean code to clear up these
> variables. But if you can come up with a patch to do reconnect without
> disconnect, that will be cool.

Yes, that's where I've been playing around already. Not with much
success until now, though. :-(

> This is a problem for parallel query execution. I suspect we can't run with
> superuser privileges for now and for future. For now, I am not clear if
> there is any ACL check at query execution stage -- seems should not be, at
> least for DML. For future, I am pretty clear that this is not the way -- the
> mere possibility of future support of audit or MAC will bring in mind the
> reason.

(What's audit or MAC?)

Well, then let such a slave backend inherit access rights from it's
caller. You'd still save rechecking of passwords or such. And you save
the client connection.

I'll give the retargetting code another try. If anybody has some hints
on what stumbling blocks and potholes to watch out for on the way
there...

Regards

Markus



Re: Support Parallel Query Execution in Executor

From
Alvaro Herrera
Date:
Markus Schiltknecht wrote:
> On Mon, 2006-04-10 at 17:22 +0800, Qingqing Zhou wrote:
> > Check the code InitPostgres(). These global varaibles are scattered in many
> > places, so I am not sure if it is easy to write clean code to clear up these
> > variables. But if you can come up with a patch to do reconnect without
> > disconnect, that will be cool.
> 
> Yes, that's where I've been playing around already. Not with much
> success until now, though. :-(

An idea arising in chat with Joshua Drake:  the retargetting code, if it
turns out to work and not be excessively expensive, could also be useful
to implement a server-side "connection pooling" of sorts: the postmaster
could keep idle backends and retarget them to a database that receives
an incoming connection.  However, we'd also need a mechanism to clean
all backend state previous to reusing a connection, to leave it "as
new" (no prepared statements, WITH HOLD cursors, etc.)

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: Support Parallel Query Execution in Executor

From
Markus Schiltknecht
Date:
Hi,

On Mon, 2006-04-10 at 13:36 -0400, Alvaro Herrera wrote:
> An idea arising in chat with Joshua Drake:  the retargetting code, if it
> turns out to work and not be excessively expensive, could also be useful
> to implement a server-side "connection pooling" of sorts: the postmaster
> could keep idle backends and retarget them to a database that receives
> an incoming connection.  However, we'd also need a mechanism to clean
> all backend state previous to reusing a connection, to leave it "as
> new" (no prepared statements, WITH HOLD cursors, etc.)

I suspect that's what I was calling 'pre-forking'. Please also note that
this would be the most difficult of my three examples (by difficulty of
implementation).

Regards

Markus




Re: Support Parallel Query Execution in Executor

From
"Luke Lonergan"
Date:
Hannu,

On 4/10/06 2:23 AM, "Hannu Krosing" <hannu@skype.net> wrote:

>> The cost of fetching a page from the OS is not really much of an
>> overhead,
> 
> Have you tested this ?

I have - the overhead of fetching a page from Linux I/O cache to buffer
cache is about an additional 20% over fetching it directly from buffer cache
on PG 7.4.

Haven't tested it on 8.1.

- Luke




Re: Support Parallel Query Execution in Executor

From
"Luke Lonergan"
Date:
Gregory,

On 4/9/06 2:04 PM, "Gregory Maxwell" <gmaxwell@gmail.com> wrote:

> It increases Linux's maximum readahead from 128K to 1meg .. and it
> should be smart enough that you could crank it up further without too
> much risk of hurting performance elsewhere.

Interesting - we are now routinely using the "blockdev --setra" to set the
block device readahead and our common setting now is 16MB, so I will look at
how the behavior of this adaptive readahead affects normal query workload.

From the patch:
     Normally, the kernel uses a stock readahead logic that is well     understood and well tuned. This option enables
amuch complex and     feature rich one. It is more aggressive and memory efficient in     doing readahead, and supports
someless-common access patterns such     as reading backward and reading sparsely. However, due to the great
diversityof real world applications, it might not fit everyone.
 

- Luke




Re: Support Parallel Query Execution in Executor

From
Myron Scott
Date:
On Mon, 2006-04-10 at 02:16, Martijn van Oosterhout wrote:

> The appears to be two seperate cases here though, one is to just farm
> out the read request to another process (basically aio), the other is
> to do actual processing there. The latter is obviously for more useful
> but requires a fair bit more infrastructure.
>

I ran some tests to see where time is spent during SeqScans.  I did the
following.

        tester=# vacuum analyze verbose test;
        INFO:  vacuuming "public.test"
        INFO:  "test": found 0 removable, 727960 nonremovable row
        versions in 5353 pagesDETAIL:  0 dead row versions cannot be
        removed yet.
        There were 0 unused item pointers.
        0 pages are entirely empty.
        CPU 0.18s/0.27u sec elapsed 0.91 sec.
        INFO:  analyzing "public.test"
        INFO:  "test": scanned 3000 of 5353 pages, containing 407952
        live rows and 0 dead rows; 3000 rows in sample, 727922 estimated
        total rows
        VACUUM
        tester=# select version();
                                            version
        -------------------------------------------------------------------------------
         PostgreSQL 8.2devel on sparc-sun-solaris2.11, compiled by GCC
        gcc (GCC) 3.3.2
        (1 row)

        tester=# select count(random) from test;
         count
        --------
         727960
        (1 row)


With the follow ing dtrace results...

        # ./probediff2.d 514607
        dtrace: script './probediff2.d' matched 10 probes
        CPU     ID                    FUNCTION:NAME
          0  46811            ExecEndSeqScan:return scan time 20406
        ^C

          smgrread
        641566800
          Virtualized - smgrread
        439798800
          smgread - Call Count
        5353
          HeapTupleSatisfiesSnapshot
        6735471000
          Virtualized - HeapTupleSatisfiesSnapshot
        3516556800
          HeapTupleSatisfiesSnapshot - Call Count
        727960
          Virtualized - ReadBuffer
        558230600
          ReadBuffer
        864931000
          Virtualized - ExecutePlan
        7331181400
          Virtualized - ExecSeqScan
        7331349600
          ExecutePlan
        20405943000
          ExecSeqScan
        20406161000


The virtualized times are supposed to be actual time spent on the CPU
with
the time spent in the probe factored out.  It seems here that half the
time
in SeqScan is spent time validating the tuples as opposed to 1/10th
doing IO.
I'm not sure that just farming out read IO is going to be all that
helpful
in this situation.  That's why I think it's a good idea to create a
slave
process that prefetchs pages and transfers valid ItemPointers to the
master.
There may not be much to be gained on simple SeqScans, however, in
complex
queries that include a SeqScan, you may gain alot by offloading this
work
onto a slave thread.  A table with TOAST'ed attributes comes to mind.
The
slave thread could be working away on the rest of the table while the
master
is PG_DETOAST_DATUM'ing the attributes for transmission back to the
client
or additional processing.  Am I missing something in this analysis?

I've attached my dtrace script.

Myron Scott


Attachment

Re: Support Parallel Query Execution in Executor

From
Greg Stark
Date:
Alvaro Herrera <alvherre@commandprompt.com> writes:

> An idea arising in chat with Joshua Drake:  the retargetting code, if it
> turns out to work and not be excessively expensive, could also be useful
> to implement a server-side "connection pooling" of sorts: the postmaster
> could keep idle backends and retarget them to a database that receives
> an incoming connection.  However, we'd also need a mechanism to clean
> all backend state previous to reusing a connection, to leave it "as
> new" (no prepared statements, WITH HOLD cursors, etc.)

Isn't all that work pretty much exactly the main cost of starting a new
backend?


-- 
greg



Re: Support Parallel Query Execution in Executor

From
Alvaro Herrera
Date:
Greg Stark wrote:
> Alvaro Herrera <alvherre@commandprompt.com> writes:
> 
> > An idea arising in chat with Joshua Drake:  the retargetting code, if it
> > turns out to work and not be excessively expensive, could also be useful
> > to implement a server-side "connection pooling" of sorts: the postmaster
> > could keep idle backends and retarget them to a database that receives
> > an incoming connection.  However, we'd also need a mechanism to clean
> > all backend state previous to reusing a connection, to leave it "as
> > new" (no prepared statements, WITH HOLD cursors, etc.)
> 
> Isn't all that work pretty much exactly the main cost of starting a new
> backend?

On Linux and other systems were fork() has negligible cost, maybe; but
on Windows and Solaris, it's certainly not.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.


Re: Support Parallel Query Execution in Executor

From
Greg Stark
Date:
Simon Riggs <simon@2ndquadrant.com> writes:

> I think it would be useful to think about exactly what type of
> query/activity we are looking to improve the performance on. That way we
> can understand the benefit of this proposal and take some baseline
> measurements to analyse what is happening for those cases.

I find the focus on sequential scans, index scans, etc. quite odd when you're
discussing parallel query processing. The whole goal of parallel query
processing is to bring more *cpu* to bear on the problem. That's going to be
most relevant when you're cpu bound, not i/o bound.

The queries I would expect to be helped most by parallel query processing are
queries that involve sorting. For example, a big merge join with two sorts on
either side could perform the two sorts simultaneously. If they provide the
results of the final pass to a third thread it can execute the merge join and
the rest of the query plan while the sorts are still executing on two other
processors.


-- 
greg



Re: Support Parallel Query Execution in Executor

From
Myron Scott
Date:
On Tue, 2006-04-11 at 07:47, Myron Scott wrote:
> client 
> or additional processing.  Am I missing something in this analysis?
> 
> I've attached my dtrace script.
> 

To answer my own question, I suppose my processors are
relatively slow compared to most setups.

Myron Scott



Re: Support Parallel Query Execution in Executor

From
Greg Stark
Date:
Alvaro Herrera <alvherre@commandprompt.com> writes:

> Greg Stark wrote:
> > Alvaro Herrera <alvherre@commandprompt.com> writes:
> > 
> > > An idea arising in chat with Joshua Drake:  the retargetting code, if it
> > > turns out to work and not be excessively expensive, could also be useful
> > > to implement a server-side "connection pooling" of sorts: the postmaster
> > > could keep idle backends and retarget them to a database that receives
> > > an incoming connection.  However, we'd also need a mechanism to clean
> > > all backend state previous to reusing a connection, to leave it "as
> > > new" (no prepared statements, WITH HOLD cursors, etc.)
> > 
> > Isn't all that work pretty much exactly the main cost of starting a new
> > backend?
> 
> On Linux and other systems were fork() has negligible cost, maybe; but
> on Windows and Solaris, it's certainly not.

Even on Solaris I'm sure parsing and preparing plans for all the queries,
building up the system table cache for all the objects in the database, and so
on are much much more expensive than fork(). I wouldn't be surprised if even
on windows it was still a pretty close race.

-- 
greg



Re: Support Parallel Query Execution in Executor

From
Alvaro Herrera
Date:
Greg Stark wrote:

> Even on Solaris I'm sure parsing and preparing plans for all the queries,
> building up the system table cache for all the objects in the database, and so
> on are much much more expensive than fork(). I wouldn't be surprised if even
> on windows it was still a pretty close race.

Parsing/planning what queries?  Regarding system caches, they are
populated from a cache file; they are not read from the catalogs each
time.

But while we don't see a patch implementing the idea, this is all very
theoretical and probably wrong.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: Support Parallel Query Execution in Executor

From
"Joshua D. Drake"
Date:
Greg Stark wrote:
> Simon Riggs <simon@2ndquadrant.com> writes:
> 
>> I think it would be useful to think about exactly what type of
>> query/activity we are looking to improve the performance on. That way we
>> can understand the benefit of this proposal and take some baseline
>> measurements to analyse what is happening for those cases.
> 
> I find the focus on sequential scans, index scans, etc. quite odd when you're
> discussing parallel query processing. The whole goal of parallel query
> processing is to bring more *cpu* to bear on the problem. That's going to be
> most relevant when you're cpu bound, not i/o bound.
> 
> The queries I would expect to be helped most by parallel query processing are
> queries that involve sorting. For example, a big merge join with two sorts on
> either side could perform the two sorts simultaneously. If they provide the
> results of the final pass to a third thread it can execute the merge join and
> the rest of the query plan while the sorts are still executing on two other
> processors.

Well I am go out on a limb here and gather to guess that sequential 
scans and index scans are still very relevant because the CPU could be 
bound by the scan (either one) based on the type of query being performed.

This doesn't really have anything to do with being IO bound as to the 
type of accesses to the data we have to deal with in regards to query 
processing.

You are correct about parallel query processing helping mutliple sort 
queries but those sorts may or may not hit and index.

Sincerely,

Joshua D. Drake


> 
> 


-- 
            === The PostgreSQL Company: Command Prompt, Inc. ===      Sales/Support: +1.503.667.4564 || 24x7/Emergency:
+1.800.492.2240     Providing the most comprehensive  PostgreSQL solutions since 1997
http://www.commandprompt.com/




Re: Support Parallel Query Execution in Executor

From
Greg Stark
Date:
"Joshua D. Drake" <jd@commandprompt.com> writes:

> Well I am go out on a limb here and gather to guess that sequential scans and
> index scans are still very relevant because the CPU could be bound by the scan
> (either one) based on the type of query being performed.
> 
> This doesn't really have anything to do with being IO bound as to the type of
> accesses to the data we have to deal with in regards to query processing.

It has everything to do with being i/o bound. The only way having two
processors perform part of an index or sequential scan would help is if your
disk subsystem is capable of providing data faster than a single processor is
capable of requesting it.

That's only going to be true for very high end systems with multiple raid
controllers and dozens of spindles.

On the other hand even moderately sized dual-core systems could probably
benefit from being able to perform multiple cpu-intensive operations
simultaneously.

-- 
greg



OS cached buffers (was: Support Parallel Query Execution in Executor)

From
"Jim C. Nasby"
Date:
On Mon, Apr 10, 2006 at 12:02:56PM -0700, Luke Lonergan wrote:
> Hannu,
> 
> On 4/10/06 2:23 AM, "Hannu Krosing" <hannu@skype.net> wrote:
> 
> >> The cost of fetching a page from the OS is not really much of an
> >> overhead,
> > 
> > Have you tested this ?
> 
> I have - the overhead of fetching a page from Linux I/O cache to buffer
> cache is about an additional 20% over fetching it directly from buffer cache
> on PG 7.4.

Is there any pratcical way to tell the difference between a page comming
from the OS cache and one comming from disk? Or maybe for a set of pages
an estimate on how many came from cache vs disk? There's some areas
where having this information would be very useful, such as for vacuum
delay. It would make tuning much easier, and it would also give us some
insight on how heavily loaded disks were, which would also be useful
info for vacuum to have (so we could adjust vacuum_cost_delay
dynamically based on load).
-- 
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461


Re: Support Parallel Query Execution in Executor

From
"Jim C. Nasby"
Date:
On Mon, Apr 10, 2006 at 05:22:13PM +0800, Qingqing Zhou wrote:
> 
> "Markus Schiltknecht" <markus@bluegap.ch> wrote
> > Hi Qingqing,
> >
> > >
> > > As Tom pointed out, without big change, a backend on database "D1" can't
> > > connect to "D2". This is because to connect to a database, we need to
> > > initialize a lot of variables. So when you reconnect to another one on
> the
> > > fly, you have to change these variables one by one.
> >
> > Sure, the question is: what is needed to retarget a backend?
> >
> Check the code InitPostgres(). These global varaibles are scattered in many
> places, so I am not sure if it is easy to write clean code to clear up these
> variables. But if you can come up with a patch to do reconnect without
> disconnect, that will be cool.

Something else to consider: most queries that would benefit from
parallel execution are expensive enough that the cost of spawning some
new backends wouldn't be that big a deal, so perhaps for an initial
version it would be best to KISS and just spawn parallel execution
backends as needed.
-- 
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461


Re: Support Parallel Query Execution in Executor

From
"Jim C. Nasby"
Date:
On Mon, Apr 10, 2006 at 01:36:45PM -0400, Alvaro Herrera wrote:
> Markus Schiltknecht wrote:
> > On Mon, 2006-04-10 at 17:22 +0800, Qingqing Zhou wrote:
> > > Check the code InitPostgres(). These global varaibles are scattered in many
> > > places, so I am not sure if it is easy to write clean code to clear up these
> > > variables. But if you can come up with a patch to do reconnect without
> > > disconnect, that will be cool.
> > 
> > Yes, that's where I've been playing around already. Not with much
> > success until now, though. :-(
> 
> An idea arising in chat with Joshua Drake:  the retargetting code, if it
> turns out to work and not be excessively expensive, could also be useful
> to implement a server-side "connection pooling" of sorts: the postmaster
> could keep idle backends and retarget them to a database that receives
> an incoming connection.  However, we'd also need a mechanism to clean
> all backend state previous to reusing a connection, to leave it "as
> new" (no prepared statements, WITH HOLD cursors, etc.)

Oracle allows you to essentially re-connect to an existing connection by
saving connection state when a connection goes back to the connection
pool. Essentially, you connect, and then re-authenticate as a different
user. That user has a specific environment associated with it which is
then pulled in. This makes it reasonable to use Oracle's built-in code
for handling permissions, etc; you just give each system user a database
account. While this sounds scary to folk that haven't used it, it's
actually safer than rolling your own authentication and security
mechanism (which will likely have bugs in it) and having your middleware
connect to the database with some dedicated account.
-- 
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461


Re: Support Parallel Query Execution in Executor

From
"Joshua D. Drake"
Date:
> That's only going to be true for very high end systems with multiple raid
> controllers and dozens of spindles.

Not true. I have a system right now that would benifit. My database is
only 500 megs and I have 2 gig of ram and two processors... I only have
a raid 1, but that is o.k. because most things are cached in memory
anyway.

> 
> On the other hand even moderately sized dual-core systems could probably
> benefit from being able to perform multiple cpu-intensive operations
> simultaneously.

See above :)

Joshua D. Drake

> 
-- 
           === The PostgreSQL Company: Command Prompt, Inc. ===     Sales/Support: +1.503.667.4564 || 24x7/Emergency:
+1.800.492.2240    Providing the most comprehensive  PostgreSQL solutions since 1997
http://www.commandprompt.com/






Re: Support Parallel Query Execution in Executor

From
Bruce Momjian
Date:
Enhanded TODO:* Experiment with multi-threaded backend better resource utilization  This would allow a single query to
makeuse of multiple CPU's or  multiple I/O channels simultaneously.  One idea is to create a  background reader that
canpre-fetch sequential and index scan  pages needed by other backends.  This could be expanded to allow  concurrent
readsfrom multiple devices in a partitioned table.
 
The issue of parallism is basically a sliding scale, meaning what do you
want to do concurrently.  In this case, we are trying to do I/O and CPU
concurrently.  Another idea is to do I/O for partitioned tables
concurrently, and of course there are many CPU concurrent cases like
sorting.

---------------------------------------------------------------------------

Tom Lane wrote:
> Myron Scott <lister@sacadia.com> writes:
> > Gregory Maxwell wrote:
> >> There are other cases where it is useful to perform parallel I/O
> >> without parallel processing..
> 
> > I have done some testing more along these lines with an old fork of
> > postgres code (2001).  In my tests, I used a thread to delegate out
> > the actual heap scan of the SeqScan.  The job of the "slave" thread
> > the was to fault in buffer pages and determine the time validity of
> > the tuples.  ItemPointers are passed back to the "master" thread via a
> > common memory area guarded by mutex locking.
> 
> I was considering a variant idea in the shower this morning: suppose
> that we invent one or more "background reader" processes that have
> basically the same infrastructure as the background writer, but have
> the responsibility of causing buffer reads to happen at useful times
> (whereas the writer causes writes to happen at useful times).  The
> idea would be for backends to signal the readers when they know they
> will need a given block soon, and then hopefully when they need it
> it'll already be in shared buffers.  For instance, in a seqscan it'd be
> pretty trivial to request block N+1 just after reading block N, and then
> doing our actual processing on block N while (we hope) some reader
> process is faulting in N+1.  Bitmap indexscans could use this structure
> too; I'm less sure about whether plain indexscans could do much with it
> though.
> 
> The major issues I can see are:
> 
> 1. We'd need a shared-memory queue of read requests, probably much like
> the queue of fsync requests.  We've already seen problems with
> contention for the fsync queue, IIRC, and that's used much less heavily
> than the read request queue would be.  So there might be some
> performance issues with getting the block requests sent over to the
> readers.
> 
> 2. There are some low-level assumptions that no one reads in pages of
> a relation without having some kind of lock on the relation (consider
> eg the case where the relation is being dropped).  A bgwriter-like
> process wouldn't be able to hold lmgr locks, and we wouldn't really want
> it to be thrashing the lmgr shared data structures for each read anyway.
> So you'd have to design some interlock to guarantee that no backend
> abandons a query (and releases its own lmgr locks) while an async read
> request it made is still pending.  Ugh.
> 
>             regards, tom lane
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 5: don't forget to increase your free space map settings
> 

--  Bruce Momjian   http://candle.pha.pa.us EnterpriseDB    http://www.enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: OS cached buffers (was: Support Parallel Query Execution

From
Bruce Momjian
Date:
Jim C. Nasby wrote:
> On Mon, Apr 10, 2006 at 12:02:56PM -0700, Luke Lonergan wrote:
> > Hannu,
> > 
> > On 4/10/06 2:23 AM, "Hannu Krosing" <hannu@skype.net> wrote:
> > 
> > >> The cost of fetching a page from the OS is not really much of an
> > >> overhead,
> > > 
> > > Have you tested this ?
> > 
> > I have - the overhead of fetching a page from Linux I/O cache to buffer
> > cache is about an additional 20% over fetching it directly from buffer cache
> > on PG 7.4.
> 
> Is there any pratcical way to tell the difference between a page comming
> from the OS cache and one comming from disk? Or maybe for a set of pages
> an estimate on how many came from cache vs disk? There's some areas
> where having this information would be very useful, such as for vacuum
> delay. It would make tuning much easier, and it would also give us some
> insight on how heavily loaded disks were, which would also be useful
> info for vacuum to have (so we could adjust vacuum_cost_delay
> dynamically based on load).

getrusage() returns:
       !       0.000062 elapsed 0.000000 user 0.000062 system sec       !       [0.000000 user 0.009859 sys total]
!       0/0 [19/2] filesystem blocks in/out       !       0/0 [0/0] page faults/reclaims, 0 [0] swaps       !       0
[0]signals rcvd, 0/0 [4/5] messages rcvd/sent       !       0/0 [23/6] voluntary/involuntary context switches
 

but I don't see anything in there that would show kernel cache vs. disk
I/O.  In fact, there is usually little connection in the kernel between
an I/O request and the process that requests it.

--  Bruce Momjian   http://candle.pha.pa.us EnterpriseDB    http://www.enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: OS cached buffers (was: Support Parallel Query Execution

From
"Jim C. Nasby"
Date:
On Thu, Apr 13, 2006 at 03:38:04PM -0400, Bruce Momjian wrote:
> > Is there any pratcical way to tell the difference between a page comming
> > from the OS cache and one comming from disk? Or maybe for a set of pages
> > an estimate on how many came from cache vs disk? There's some areas
> > where having this information would be very useful, such as for vacuum
> > delay. It would make tuning much easier, and it would also give us some
> > insight on how heavily loaded disks were, which would also be useful
> > info for vacuum to have (so we could adjust vacuum_cost_delay
> > dynamically based on load).
> 
> getrusage() returns:
> 
>         !       0.000062 elapsed 0.000000 user 0.000062 system sec
>         !       [0.000000 user 0.009859 sys total]
>         !       0/0 [19/2] filesystem blocks in/out
>         !       0/0 [0/0] page faults/reclaims, 0 [0] swaps
>         !       0 [0] signals rcvd, 0/0 [4/5] messages rcvd/sent
>         !       0/0 [23/6] voluntary/involuntary context switches
> 
> but I don't see anything in there that would show kernel cache vs. disk
> I/O.  In fact, there is usually little connection in the kernel between
> an I/O request and the process that requests it.

Yeah, my assumption has been that the only way to tell the difference
would be by timing, but I don't know how practical that is. Since
gettime() or whatever EXPLAIN ANALYZE uses is apparently very expensive,
perhaps there's some other alternative. Perhapse the timing info in
getrusage would work for this. Another idea is setting an alarm for a
fairly short period before making the IO request. If the request comes
back before the alarm fires, the data must have been in the OS cache.

Another thought is that any IO request that goes to disk would most
likely put the process requesting the IO to sleep, but a request being
served out of cache might not do that. Perhaps there's some way to
recognize that.

Or maybe a better track would be to develop a patch for as many OSes as
possible that would tell the caller if an IO request came out of cache
or not.
-- 
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461


Re: OS cached buffers (was: Support Parallel Query Execution

From
Bruce Momjian
Date:
Jim C. Nasby wrote:
> Yeah, my assumption has been that the only way to tell the difference
> would be by timing, but I don't know how practical that is. Since
> gettime() or whatever EXPLAIN ANALYZE uses is apparently very expensive,
> perhaps there's some other alternative. Perhapse the timing info in
> getrusage would work for this. Another idea is setting an alarm for a
> fairly short period before making the IO request. If the request comes
> back before the alarm fires, the data must have been in the OS cache.
> 
> Another thought is that any IO request that goes to disk would most
> likely put the process requesting the IO to sleep, but a request being
> served out of cache might not do that. Perhaps there's some way to
> recognize that.
> 
> Or maybe a better track would be to develop a patch for as many OSes as
> possible that would tell the caller if an IO request came out of cache
> or not.

Or give up, which seems the most fruitful approach.

--  Bruce Momjian   http://candle.pha.pa.us EnterpriseDB    http://www.enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +