Thread: Support Parallel Query Execution in Executor
I have written some experimental code of doing master-slave seqscan in PostgreSQL. During the work, I feel we had enough infrastructure to support parallel query execution. What I did is adding a new node PARA and plug it above the node that we want to execute in parallel. In this stage, a PARA node is just a SeqScan node, which is: typedef struct Para {/* TODO: add a union to put all nodes supporting parallism here */SeqScan scan; /* Split / Merge / Redistribute */ParaType type; /* TODO: other possible parameters */ } Para; At the execution, the master (the process who receives the query) will wake up a slave process (an idle ordinary backend) and the slave will pass the scan results to the master via a shared memory communication-buffer. In details, the execution is like this: Master process: 1. PARA init: wake up a slave, pass the queryTree and outerPlan(planTree) to it by nodeToString(); 2. PARA exec: get an item from the communication-buffer; if item is a valid tuple return item; else handleother types of item; /* execution done/error */ 3. PARA end: do some cleanup. As we can see from PARA init stage, with even the most simple PARA node, it is easy to support inter-node parallism. Slave process (use similar code for autovacuum process): 1. Get queryTree and planTree; 2. Redirect the destReceiver to the communication-buffer; 3. Encapsulate them in an executor and run; The query plan is like this: TEST=# explain select max(a), max(b) from t; QUERY PLAN ----------------------------------------------------------------------Aggregate (cost=7269.01..7269.02 rows=1 width=53) -> Para [Split = 1] (cost=10.00..5879.00 rows=278000 width=53) -> Seq Scan on T (cost=0.00..5879.00 rows=278000width=53) (3 rows) There are some problems I haven't addressed yet. The most difficult one for me is the xid assignment: master and slaves should see an identical view, and the key is the xid. I am not sure the correct solution of this problem. We may use the same xid or use a continuous portion of xids for master and slaves. There are other problems like the login problem (the master and slaves should be acting as the same user), the elog message passing etc are also important but I think we are able to handle them without any problem. I haven't touched the most difficult part, the parallel query optimizer. But thanks to the two-phase parallel optimization technique, this part can be treated as the geqo optimizer, without enough evidence, we don't enable parallel query execution. Is there any show-stop reasons of not doing this? Regards, Qingqing
On Thu, Apr 06, 2006 at 06:28:33PM +0800, Qingqing Zhou wrote: > I have written some experimental code of doing master-slave seqscan in > PostgreSQL. During the work, I feel we had enough infrastructure to support > parallel query execution. Good work. One question though, temporary tables. Currently there no way to get a consistant view of a temporary table from another process. You probably have to avoid shipping off non-immutable function calls. Is it possible to deadlock (don't see how, but I may not be looking hard enough). Hardest work would be finding what can be safely farmed off. Nice work though, I hadn't thought of this approach. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them.
On 4/6/06, Qingqing Zhou <zhouqq@cs.toronto.edu> wrote: > I have written some experimental code of doing master-slave seqscan in > PostgreSQL. During the work, I feel we had enough infrastructure to support > parallel query execution. Great work! I had looked into this a little bit and came to the same ideas/problems you did, but none of them seemed insurmountable at all.I'd be interested in working with you on this if you'dlike. I'm interested in hearing Tom's suggestions on this topic too. -- Jonah H. Harris, Database Internals Architect EnterpriseDB Corporation 732.331.1324
Jonah H. Harris wrote: > On 4/6/06, Qingqing Zhou <zhouqq@cs.toronto.edu> wrote: > >> I have written some experimental code of doing master-slave seqscan in >> PostgreSQL. During the work, I feel we had enough infrastructure to support >> parallel query execution. >> > > Great work! I had looked into this a little bit and came to the same > ideas/problems you did, but none of them seemed insurmountable at all. > I'd be interested in working with you on this if you'd like. > > I'm interested in hearing Tom's suggestions on this topic too. > > From time to time there is (often ill-informed) discussion about making postgres multi-threaded - this strikes me as a possible candidate area for multithreading, and maybe that would be a way to overcome some of the problems mentioned. Just a thought cheers andrew
Hear hear ;-) Regards, Thomas Hallgren Andrew Dunstan wrote: > Jonah H. Harris wrote: >> On 4/6/06, Qingqing Zhou <zhouqq@cs.toronto.edu> wrote: >> >>> I have written some experimental code of doing master-slave seqscan in >>> PostgreSQL. During the work, I feel we had enough infrastructure to >>> support >>> parallel query execution. >>> >> >> Great work! I had looked into this a little bit and came to the same >> ideas/problems you did, but none of them seemed insurmountable at all. >> I'd be interested in working with you on this if you'd like. >> >> I'm interested in hearing Tom's suggestions on this topic too. >> >> > > From time to time there is (often ill-informed) discussion about making > postgres multi-threaded - this strikes me as a possible candidate area > for multithreading, and maybe that would be a way to overcome some of > the problems mentioned. > > Just a thought > > cheers > > andrew > > ---------------------------(end of broadcast)--------------------------- > TIP 6: explain analyze is your friend >
""Jonah H. Harris"" <jonah.harris@gmail.com> wrote > > Great work! I had looked into this a little bit and came to the same > ideas/problems you did, but none of them seemed insurmountable at all. > I'd be interested in working with you on this if you'd like. > Yes, I am happy to work with anyone on the topic. The plan in mind is like this: (1) stable the master-slave seqscan: solve all the problems left; (2) parallize the seqscan: AFAICS, this should not very difficult based on 1, may only need some scan portition assignment; (3) add an indexscan or other one or two node type to master-slave solution: this is in order to make the framework extensible; (4) parallize these node - this will be a big chunk of job; (5) add a two-phase optimization to the server - we have to consider the partitioned table in this stage, yet another big chunk of job; Regards, Qingqing
On 4/6/06, Qingqing Zhou <zhouqq@cs.toronto.edu> wrote: > > ""Jonah H. Harris"" <jonah.harris@gmail.com> wrote > > > > Great work! I had looked into this a little bit and came to the same > > ideas/problems you did, but none of them seemed insurmountable at all. > > I'd be interested in working with you on this if you'd like. > > First, I want to second Jonah's enthusiasm. This is very exciting! > > Yes, I am happy to work with anyone on the topic. The plan in mind is like > this: > (1) stable the master-slave seqscan: solve all the problems left; > (2) parallize the seqscan: AFAICS, this should not very difficult based on > 1, may only need some scan portition assignment; This is really only a gut feeling for me (it can't be otherwise, since we can't yet test), but I think parallelizing a single seqscan is pretty much guaranteed to do nothing, because seqscans, especially on large tables, are IO bound. There was plan some time ago (during 8.0 beta, I think) to allow multiple seqscans from different queries to join each other, such that scans that begin later start scanning the table at the point, or just behind the point, that the first running scan is already at. That plan would reduce IO contention, and buffer and OS cache thrashing, by having multiple readers pull from the same hose. I can't see how asking for more than one stream from the same file would do anything but increase both cache thrashing and IO bandwidth contention. Am I missing something here? > (3) add an indexscan or other one or two node type to master-slave > solution: this is in order to make the framework extensible; > (4) parallize these node - this will be a big chunk of job; Now that could be a _big_ win! Especially if tablespaces are used to balance commonly combined tables and indexes. > (5) add a two-phase optimization to the server - we have to consider the > partitioned table in this stage, yet another big chunk of job; > Same here. This would be a place where parallel seqscans of different tables (instead of multi-headed scan of one table) could buy you a lot, especially with proper tablespace use. Thanks again, Qingqing, for the work on this. I'm very excited about where this could go. :) > Regards, > Qingqing > > > > ---------------------------(end of broadcast)--------------------------- > TIP 4: Have you searched our list archives? > > http://archives.postgresql.org > -- Mike Rylander mrylander@gmail.com GPLS -- PINES Development Database Developer http://open-ils.org
Qingquing, > First, I want to second Jonah's enthusiasm. This is very exciting! Me, three! I didn't think this was ever going to come to Postgres absent major corporate funding. > This is really only a gut feeling for me (it can't be otherwise, since > we can't yet test), but I think parallelizing a single seqscan is > pretty much guaranteed to do nothing, because seqscans, especially on > large tables, are IO bound. Actuall, not true. Our current seqscan performance suffers from produce-consumer fluctuation. GreenPlum and Sun did a whole bunch of testing on this. Basically reading a large table off disk does this: read some table while not processing process in cpu while not reading read some more table while not processing process some more in cpu while not reading etc. resulting in an I/O througput graph that looks like: * * * * * * * * * * * * * * ** * * * The really annoying part about this, for me personally, is that the peaks are significantly faster than comparable commercial DBMSes ... but our average is far less. So even on a single seq scan, parallel query execution would make a significant difference in performance, possibly as much as +75% on seq scans of large tables. -- --Josh Josh Berkus Aglio Database Solutions San Francisco
On 4/7/06, Josh Berkus <josh@agliodbs.com> wrote: > So even on a single seq scan, parallel query > execution would make a significant difference > in performance, possibly as > much as +75% on seq scans of large tables. I've been looking at several commercial systems which employ dynamic partitioning for sequential scans of large tables. While 75% sounds a little optimistic, there is most definitely a sizable performance increase. -- Jonah H. Harris, Database Internals Architect EnterpriseDB Corporation 732.331.1324
Josh Berkus <josh@agliodbs.com> writes: > Basically reading a large table off disk does this: > read some table while not processing > process in cpu while not reading > read some more table while not processing > process some more in cpu while not reading > etc. > resulting in an I/O througput graph that looks like: > * * * > * * * * * * > * * * * * * > * * * * Interesting ... > The really annoying part about this, for me personally, is that the peaks > are significantly faster than comparable commercial DBMSes ... but our > average is far less. So even on a single seq scan, parallel query > execution would make a significant difference in performance, possibly as > much as +75% on seq scans of large tables. ... but I'm failing to follow where it says that parallel processing will fix that. All I can foresee in that direction is extra data transfer costs, bought at the price of portability and locking headaches. regards, tom lane
On 4/8/06, Tom Lane <tgl@sss.pgh.pa.us> wrote: > ... but I'm failing to follow where it says that parallel processing > will fix that. All I can foresee in that direction is extra data > transfer costs, bought at the price of portability and locking headaches. I don't think it's any less portable than the system is now; It's just enabling multiple slave processes to participate in scans and processing (parallel query, parallel index builds, parallel sorts, ...) Likewise, the additional I/O cost isn't that much of an issue because systems which really take advantage of this type of parallel processing have large bandwidth I/O arrays anyway. I didn't even want to mention that EVERY other database I know of (Oracle, DB2, Sybase, SQL Server, Ingres, Bizgres MPP, MaxDB) supports this, but it's a pretty obvious win for many environments. -- Jonah H. Harris, Database Internals Architect EnterpriseDB Corporation 732.331.1324
Hi, On Sat, 2006-04-08 at 13:16 -0400, Jonah H. Harris wrote: > On 4/8/06, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > ... but I'm failing to follow where it says that parallel processing > > will fix that. All I can foresee in that direction is extra data > > transfer costs, bought at the price of portability and locking headaches. > > I don't think it's any less portable than the system is now; ACK. As long as processes, signals and shared memory are used this could be as portable as PostgreSQL is now. > It's just > enabling multiple slave processes to participate in scans and > processing (parallel query, parallel index builds, parallel sorts, > ...) Likewise, the additional I/O cost isn't that much of an issue > because systems which really take advantage of this type of parallel > processing have large bandwidth I/O arrays anyway. Ehm.. which additional I/O cost? Or do you count inter-process communication to I/O? I'd like to help teaching PostgreSQL the art of parallel query execution. I have rawly implemented an internal message passing system on shared memory basis. This allows all pg-processes to send messages to each other identified by PID. Uppon receiving of a message a process gets a SIGUSR2 (AFAIC remember now). On the positive side are:- portable as I'm using only shmem and signals- fast (not much copying around of the data in memory) One downside I see with my approach is:- the shared memory buffer is limited in size, so it can simply be 'full' and no moremessages can be sent before another process reads its messages and frees the memory for other messages. In case you're interested I'll compile a patch for review. Regards Markus
On 4/8/06, Markus Schiltknecht <markus@bluegap.ch> wrote: > ACK. As long as processes, signals and shared memory are used this could > be as portable as PostgreSQL is now. This is certainly the case. > Ehm.. which additional I/O cost? Or do you count inter-process > communication to I/O? Inter-process will add a minimal amount, but if it's done correctly, you'll still end up ahead. > I'd like to help teaching PostgreSQL the art of parallel query > execution. Cool, we're not at the communication-level yet, but your help would be appreciated. > In case you're interested I'll compile a patch for review. Surely! -- Jonah H. Harris, Database Internals Architect EnterpriseDB Corporation 732.331.1324
"Jonah H. Harris" <jonah.harris@gmail.com> writes: > On 4/8/06, Markus Schiltknecht <markus@bluegap.ch> wrote: >> Ehm.. which additional I/O cost? Or do you count inter-process >> communication to I/O? > Inter-process will add a minimal amount, but if it's done correctly, > you'll still end up ahead. This is exactly the bit of optimism I was questioning. We've already been sweating blood trying to reduce multiprocessor contention on data structures in which collisions ought to be avoidable (ie, buffer arrays where you hope not everyone is hitting the same buffer at once). I think passing large volumes of data between different processes is going to incur quite a lot of locking overhead, pipeline stalls for cache line transfers, etc, etc, because heavy contention for the transfer buffer is simply not going to be avoidable. regards, tom lane
On 4/8/06, Tom Lane <tgl@sss.pgh.pa.us> wrote: > passing large volumes of data between different processes is going > to incur quite a lot of locking overhead, pipeline stalls for cache line > transfers, etc, etc, because heavy contention for the transfer buffer is > simply not going to be avoidable. I don't think anyone believes there isn't going to be contention somewhere. Anyone that's developed a parallel system acknowledges there's a price to pay with parallelism (in complexity) which requires good algorithms, strategy, and design. I may certainly be misunderstanding you, but it seems like you want to avoid parallelism altogether. Aside from PostgreSQL, the only systems I know of which don't support parallelism in query execution are MySQL and Firebird, but I don't see competing at the low end as a compelling argument. Similarly, I don't think every major vendor put money into developing something that isn't useful. I do agree that there are many times you won't want to have parallel query. The reasons for employing parallelism is very dependent on the environment and application it's used in, so it's certainly going to be up to the user to decide whether they want to use it or not. We're currently playing with a proof-of-concept to find issues before proposing a design. This doesn't affect the community in any way. Rather than negativity, I'd really like to see your suggestions on avoiding contention as much as possible; your ideas may certainly get us past several obstacles more quickly. -- Jonah H. Harris, Database Internals Architect EnterpriseDB Corporation 732.331.1324
On 4/8/06, Tom Lane <tgl@sss.pgh.pa.us> wrote: > This is exactly the bit of optimism I was questioning. We've already > been sweating blood trying to reduce multiprocessor contention on data > structures in which collisions ought to be avoidable (ie, buffer arrays > where you hope not everyone is hitting the same buffer at once). I > think passing large volumes of data between different processes is going > to incur quite a lot of locking overhead, pipeline stalls for cache line > transfers, etc, etc, because heavy contention for the transfer buffer is > simply not going to be avoidable. We should consider true parallel execution and overlapping execution with I/O as distinct cases. For example, one case made in this thread involved bursty performance with seqscans presumably because the I/O was stalling while processing was being performed. In general this can be avoided without parallel execution through the use of non-blocking I/O and making an effort to keep the request pipeline full. There are other cases where it is useful to perform parallel I/O without parallel processing.. for example: a query that will perform an index lookup per row can benefit from running some number of those lookups in parallel in order to hide the lookup latency and give the OS and disk elevators a chance to make the random accesses a little more orderly. This can be accomplished without true parallel processing. (Perhaps PG does this already?) Parallel execution to get access to more CPU and memory bandwidth is a fine thing, and worth the costs in many cases... but it shouldn't be used as an easy way to get parallel IO without careful consideration.
Greg, On 4/8/06 5:43 PM, "Gregory Maxwell" <gmaxwell@gmail.com> wrote: > For example, one case made in this thread involved bursty performance > with seqscans presumably because the I/O was stalling while processing > was being performed. In general this can be avoided without parallel > execution through the use of non-blocking I/O and making an effort to > keep the request pipeline full. I agree - there is a real continuing need for overlapped scan and execution, though I'm not sure quite how to get it efficiently using inter-process communication. The AIO interface is nicely proven at this point, but to Tom's point, it's non-portable. What we see now is that the executor is CPU bound at about 300MB/s scan rate using modern Opteron CPUs. Using some kind of overlap strategy would help this a lot - using programmable readahead to stage I/O requests asynchronously from the scan node would be better than OS hints like fadvise(), but the tuning would be tricky IMO. - Luke
Gregory Maxwell wrote: > We should consider true parallel execution and overlapping execution > with I/O as distinct cases. > > For example, one case made in this thread involved bursty performance > with seqscans presumably because the I/O was stalling while processing > was being performed. In general this can be avoided without parallel > execution through the use of non-blocking I/O and making an effort to > keep the request pipeline full. > > There are other cases where it is useful to perform parallel I/O > without parallel processing.. for example: a query that will perform > an index lookup per row can benefit from running some number of those > lookups in parallel in order to hide the lookup latency and give the > OS and disk elevators a chance to make the random accesses a little > more orderly. This can be accomplished without true parallel > processing. (Perhaps PG does this already?) > > I have done some testing more along these lines with an old fork of postgres code (2001). In my tests, I used a thread to delegate out the actual heap scan of the SeqScan. The job of the "slave" thread the was to fault in buffer pages and determine the time validity of the tuples. ItemPointers are passed back to the "master" thread via a common memory area guarded by mutex locking. The master thread is then responsible for converting the ItemPointers to HeapTuples and finishing the execution run. I added a little hack to the buffer code to force pages read into the buffer to stay at the back of the free buffer list until the master thread has had a chance to use it. These are the parameters of my test table. Pages 9459: ; Tup 961187: Live 673029, Dead 288158 Average tuple size is 70 bytes create table test (rand int, varchar(256) message) So far I've done a couple of runs with a single query on a 2 processor machine with the following results via dtrace. select * from test; CPU ID FUNCTION:NAME1 46218 ExecEndSeqScan:return Inline scan time 817290 46216 ExecEndDelegatedSeqScan:returnDelegated scan time 599030 46218 ExecEndSeqScan:return Inline scan time 957080 46216 ExecEndDelegatedSeqScan:return Delegated scan time 582550 46218 ExecEndSeqScan:return Inline scantime 790280 46216 ExecEndDelegatedSeqScan:return Delegated scan time 50500 average 34% decrease in total time using the delegated scan. A very crude, simple test but I think it shows some promise. I know I used threads but you could probably just as easily use a slave process and pass ItemPointers via pipes or shared memory. Thanks, Myron Scott
Myron, First, this sounds really good! On 4/8/06 9:54 PM, "Myron Scott" <lister@sacadia.com> wrote: > I added a little hack to the buffer > code to force > pages read into the buffer to stay at the back of the free buffer list > until the master > thread has had a chance to use it. This is the part I'm curious about - is this using the shared_buffers region in a circular buffer fashion to store pre-fetched pages? One thing I've wondered about is: how much memory is required to get efficient overlap? Did you find that you had to tune the amount of buffer memory to get the performance to work out? - Luke
Ühel kenal päeval, L, 2006-04-08 kell 18:47, kirjutas Luke Lonergan: > I agree - there is a real continuing need for overlapped scan and execution, > though I'm not sure quite how to get it efficiently using inter-process > communication. The AIO interface is nicely proven at this point, but to > Tom's point, it's non-portable. > > What we see now is that the executor is CPU bound at about 300MB/s scan rate > using modern Opteron CPUs. Where exactly does the CPU spend its time at 300MB/s ? Is it mainly I/O related processing, processing pages into in-memory tuples or something else ? -------------- Hannu
"Tom Lane" <tgl@sss.pgh.pa.us> wrote > > This is exactly the bit of optimism I was questioning. We've already > been sweating blood trying to reduce multiprocessor contention on data > structures in which collisions ought to be avoidable (ie, buffer arrays > where you hope not everyone is hitting the same buffer at once). I > think passing large volumes of data between different processes is going > to incur quite a lot of locking overhead, pipeline stalls for cache line > transfers, etc, etc, because heavy contention for the transfer buffer is > simply not going to be avoidable. > Yes, there is overhead. Let's say how can we tackle them one by one: * lock contention of transfer buffer * A general way is partitioning. We can try to alleviate the lock contention by sharing transfer buffer only between each two processes (master and slave). * large amount of data * Thinking of the simplest master-slave seqscan senario, I am copying the HeapTuple so far (like the code in sort read/write tuple), but potentially we can only pass the offset and pinned page, which will require the slave process pinned the page for a little longer time, say every 5 pages. I didn't see any application can cause Postgres pinned a lot pages, so this might be one step to reduce the data volumn. On the other hand, why do we need parallel execution anyway? Because if we got a 4 way SMP with strong IO device, we may want to trade the performance with more resources for some queries. Regards, Qingqing
On Apr 8, 2006, at 10:29 PM, Luke Lonergan wrote: > Myron, > > First, this sounds really good! > > On 4/8/06 9:54 PM, "Myron Scott" <lister@sacadia.com> wrote: > >> I added a little hack to the buffer >> code to force >> pages read into the buffer to stay at the back of the free buffer >> list >> until the master >> thread has had a chance to use it. > > This is the part I'm curious about - is this using the > shared_buffers region > in a circular buffer fashion to store pre-fetched pages? Yes. That is basically what the slave thread is trying to do. As well as weed out any tuples/pages that don't need to be looked at due to dead tuples. I did several things to try and insure that a buffer needed by the master thread would not be pulled out of the buffer pool before it was seen by the master. I wanted to do this without holding the buffer pinned, so I did the change to the buffer free list to do this. static void AddBufferToFreelist(BufferDesc *bf) { S_LOCK(&SLockArray[FreeBufMgrLock]); int movebehind = SharedFreeList->freePrev; /* find the right spot with bias */ while ( BufferDescriptors[movebehind].bias > bf->bias ) { movebehind= BufferDescriptors[movebehind].freePrev; } ... The bias number is removed the next time the buffer is pulled out of the free list. Also, I force an ItemPointer transfer when the ItemPointer transfer list is full ( currently 4096 ) or 10% of the buffer pool have been affected by the slave thread. Lastly, if the slave thread gets too far ahead of the master thread, it waits for the master to catch up. To my knowledge, this hasn't happened yet. > > One thing I've wondered about is: how much memory is required to get > efficient overlap? Did you find that you had to tune the amount of > buffer > memory to get the performance to work out? I haven't done much tuning yet. I think there is an optimal balance that I most likely haven't found yet. Myron Scott
On Sun, Apr 09, 2006 at 08:23:36AM -0700, Myron Scott wrote: > >This is the part I'm curious about - is this using the > >shared_buffers region > >in a circular buffer fashion to store pre-fetched pages? > > Yes. That is basically what the slave thread is trying to do. As > well as weed out > any tuples/pages that don't need to be looked at due to dead tuples. > I did several things to try and insure that a buffer needed by the > master thread > would not be pulled out of the buffer pool before it was seen by the > master. > I wanted to do this without holding the buffer pinned, so I did the > change to > the buffer free list to do this. Is this necessary? I mean, what's the chance that a page might get thrown out early? And if so, what's the chance that page will still be in the OS cache? The cost of fetching a page from the OS is not really much of an overhead, so I'd like to know how much benefit these buffer cache hacks actually produce. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them.
Myron Scott <lister@sacadia.com> writes: > Gregory Maxwell wrote: >> There are other cases where it is useful to perform parallel I/O >> without parallel processing.. > I have done some testing more along these lines with an old fork of > postgres code (2001). In my tests, I used a thread to delegate out > the actual heap scan of the SeqScan. The job of the "slave" thread > the was to fault in buffer pages and determine the time validity of > the tuples. ItemPointers are passed back to the "master" thread via a > common memory area guarded by mutex locking. I was considering a variant idea in the shower this morning: suppose that we invent one or more "background reader" processes that have basically the same infrastructure as the background writer, but have the responsibility of causing buffer reads to happen at useful times (whereas the writer causes writes to happen at useful times). The idea would be for backends to signal the readers when they know they will need a given block soon, and then hopefully when they need it it'll already be in shared buffers. For instance, in a seqscan it'd be pretty trivial to request block N+1 just after reading block N, and then doing our actual processing on block N while (we hope) some reader process is faulting in N+1. Bitmap indexscans could use this structure too; I'm less sure about whether plain indexscans could do much with it though. The major issues I can see are: 1. We'd need a shared-memory queue of read requests, probably much like the queue of fsync requests. We've already seen problems with contention for the fsync queue, IIRC, and that's used much less heavily than the read request queue would be. So there might be some performance issues with getting the block requests sent over to the readers. 2. There are some low-level assumptions that no one reads in pages of a relation without having some kind of lock on the relation (consider eg the case where the relation is being dropped). A bgwriter-like process wouldn't be able to hold lmgr locks, and we wouldn't really want it to be thrashing the lmgr shared data structures for each read anyway. So you'd have to design some interlock to guarantee that no backend abandons a query (and releases its own lmgr locks) while an async read request it made is still pending. Ugh. regards, tom lane
"Gregory Maxwell" <gmaxwell@gmail.com> writes: > For example, one case made in this thread involved bursty performance > with seqscans presumably because the I/O was stalling while processing > was being performed. Actually, the question that that raised in my mind is "why isn't the kernel doing read-ahead properly?" When we're doing nonsequential access like an indexscan, it's unsurprising that the kernel can't guess which block we need next, but in a plain seqscan you'd certainly expect the read-ahead algorithm to kick in and ensure that the next block is fetched before we need it. So before we go inventing complicated bits of code with lots of added overhead, we should first find out exactly why the system doesn't already work the way it's supposed to. regards, tom lane
On 4/9/06, Tom Lane <tgl@sss.pgh.pa.us> wrote: > I was considering a variant idea in the shower this morning:suppose > that we invent one or more "background reader" processes that have > basically the same infrastructure as the background writer, but have > the responsibility of causing buffer reads to happen at useful times > (whereas the writer causes writes to happen at useful times). The > idea would be for backends to signal the readers when they know they > will need a given block soon, and then hopefully when they need it > it'll already be in shared buffers. This is sort of what I'm playing with. There are N-number of backends which are configured at startup and are available solely for parallel processing. If you have parallelism enabled and you have a query which the planner says can take advantage of it, your backend becomes a sort of "coordinator" and signals the number of backends defined by (right now) a global parallelism degree parameter for participation in query execution. > The major issues I can see are: > > 1. We'd need a shared-memory queue of read requests, probably much like > the queue of fsync requests. We've already seen problems with > contention for the fsync queue, IIRC, and that's used much less heavily > than the read request queue would be. So there might be some > performance issues with getting the block requests sent over to the > readers. Yes, I was looking at this contention as well. However, I haven't looked at the fsync request queue in detail, but I'm going to play around with a simple initial semaphore test such that if the queue is full, the query will just be processed serially. I'll also take a look at the fsync queue and see how it's working. While it's down the road a ways, I'm really looking forward to seeing parallel index builds, sorts, and joins. In my opinion, all of our combined ideas and system knowledge is really going to pay off big time on this one. -- Jonah H. Harris, Database Internals Architect EnterpriseDB Corporation 732.331.1324
On Apr 9, 2006, at 9:26 AM, Martijn van Oosterhout wrote: > On Sun, Apr 09, 2006 at 08:23:36AM -0700, Myron Scott wrote: >>> This is the part I'm curious about - is this using the >>> shared_buffers region >>> in a circular buffer fashion to store pre-fetched pages? >> >> Yes. That is basically what the slave thread is trying to do. As >> well as weed out >> any tuples/pages that don't need to be looked at due to dead tuples. >> I did several things to try and insure that a buffer needed by the >> master thread >> would not be pulled out of the buffer pool before it was seen by the >> master. >> I wanted to do this without holding the buffer pinned, so I did the >> change to >> the buffer free list to do this. > > Is this necessary? I mean, what's the chance that a page might get > thrown out early? And if so, what's the chance that page will still be > in the OS cache? > > The cost of fetching a page from the OS is not really much of an > overhead, so I'd like to know how much benefit these buffer cache > hacks > actually produce. > You may be right on this one. I wanted ensure that I didn't lose pages I needed. I may have just added a belt to my suspenders. I'll add a switch to turn it off and on and try and devise a test to see what the costs either way are. Myron Scott
On 4/9/06, Tom Lane <tgl@sss.pgh.pa.us> wrote: > "Gregory Maxwell" <gmaxwell@gmail.com> writes: > > For example, one case made in this thread involved bursty performance > > with seqscans presumably because the I/O was stalling while processing > > was being performed. > > Actually, the question that that raised in my mind is "why isn't the > kernel doing read-ahead properly?" When we're doing nonsequential > access like an indexscan, it's unsurprising that the kernel can't guess > which block we need next, but in a plain seqscan you'd certainly expect > the read-ahead algorithm to kick in and ensure that the next block is > fetched before we need it. > > So before we go inventing complicated bits of code with lots of added > overhead, we should first find out exactly why the system doesn't > already work the way it's supposed to. But is that really the behavior we should expect? How much memory can we expect the OS to spend on opportunistic read-in? How much disk access should be spent on a guess? There is an intrinsic tradeoff here, applications tend to be bursty so just because you're reading a lot now doesn't mean you'll continue... and the filesystem will have fragmentation, so a failed guess can translate into a lot of pointless seeking. As I recall, in Linux 2.6 you have something like a max of 128KB of readahead. Given that and a disk subsystem that reads at 200MB/sec you can't spend more than 600us processing before requesting enough additional blocks put the disk back into readhead or you will stall the disk. Stalling the disk costs more than you'd expect, due to FS fragmentation there can be terrific gains from allowing the OS and disk to issue reads out of order from a large request queue. It would probably be reasonable to say that the OS should be using much larger readhead buffers, especially on systems with fast disk subsystems... But that doesn't come for free and can slaughter performance for many workloads (consider, what if it was triggering 5MB of file oriented read-ahead for every index scan seek we did?). There is an adaptive readahead patch for Linux which should improve things (http://lwn.net/Articles/176279/ and if you google around there are some benchmarks) but I doubt that even that would be able to keep a 200MB/sec+ disk subsystem saturated with the sort of access patterns PG has... To address this in a cross platform way will be a challenge. I doubt Linux is alone at having skimpy readahead (because big readahead translates into huge losses if you get it wrong). Given this information, a stupid 'fault-in' process should probably give huge gains for seqscans... but I think the extra work required to find a solution which is also useful for index operations is probably worth it as well.
"Jonah H. Harris" <jonah.harris@gmail.com> writes: > On 4/9/06, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I was considering a variant idea in the shower this morning:suppose >> that we invent one or more "background reader" processes that have >> basically the same infrastructure as the background writer, but have >> the responsibility of causing buffer reads to happen at useful times > This is sort of what I'm playing with. There are N-number of backends > which are configured at startup and are available solely for parallel > processing. That's not remotely the same thing: a backend is a very different animal from a bgwriter. In particular, bgwriter (and bgreaders if we had 'em) aren't database-specific, don't need to think about permission checking as they don't execute on behalf of particular users, don't have syscaches to keep in-sync with everything else, etc etc. regards, tom lane
"Gregory Maxwell" <gmaxwell@gmail.com> writes: > On 4/9/06, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> So before we go inventing complicated bits of code with lots of added >> overhead, we should first find out exactly why the system doesn't >> already work the way it's supposed to. > But is that really the behavior we should expect? Certainly. If the OS has readahead logic at all, it ought to think that a seqscan of a large table qualifies. Your arguments seem to question whether readahead is useful at all --- but they would apply *just as well* to an app doing its own readahead, which is what is really getting proposed in this thread. Before we go replacing a standard OS-level facility with our own version, we need to have a much clearer idea of why the OS isn't getting the job done for us. Otherwise we're likely to write a large amount of code and find out that it doesn't work very well either. regards, tom lane
On 4/9/06, Tom Lane <tgl@sss.pgh.pa.us> wrote: > That's not remotely the same thing: a backend is a very different animal > from a bgwriter. Yes, I know. Had I known it would've been picked apart, I would've described it better. > In particular, bgwriter (and bgreaders if we had 'em) > aren't database-specific, don't need to think about permission checking > as they don't execute on behalf of particular users, don't have syscaches > to keep in-sync with everything else, etc etc. In the current proof-of-concept, there is a backend (manager) and several reader-like processes; However, in this case, the readers perform actual execution on behalf of the user. The reader-like processes or "execution agents", whatever you want to call them, are limited to the postgres database only because they've been coded that way... but there's nothing that actually restricts them from being used for any database. -- Jonah H. Harris, Database Internals Architect EnterpriseDB Corporation 732.331.1324
"Jonah H. Harris" <jonah.harris@gmail.com> writes: > The reader-like processes or "execution agents", whatever you want to > call them, are limited to the postgres database only because they've > been coded that way... but there's nothing that actually restricts > them from being used for any database. Oh? Anything much above the level of bufmgr is going to be dependent on relcache, syscache, etc, and those are all definitely database-specific. You can't just retarget a backend to operate in another database, at least not without major changes in that infrastructure. regards, tom lane
On 4/9/06, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Oh? Anything much above the level of bufmgr is going to be dependent on > relcache, syscache, etc, and those are all definitely database-specific. > You can't just retarget a backend to operate in another database, at > least not without major changes in that infrastructure. It's not that complicated and it may not prove to be the best method, but it's worth a try. When I get done playing with it, I'll send it to ya to check out. -- Jonah H. Harris, Database Internals Architect EnterpriseDB Corporation 732.331.1324
Jonah H. Harris wrote: > On 4/9/06, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Oh? Anything much above the level of bufmgr is going to be dependent on > > relcache, syscache, etc, and those are all definitely database-specific. > > You can't just retarget a backend to operate in another database, at > > least not without major changes in that infrastructure. > > It's not that complicated and it may not prove to be the best method, > but it's worth a try. When I get done playing with it, I'll send it > to ya to check out. Hmm, if it turns out to work, I'm interested. This is something that definitely could be of some help to autovacuum. Please post to pgsql-patches. -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
I am working with Solaris on SPARC almost exclusively and I believe Josh said that Sun was the one who found the bursty behaviorwith scans. Has it been confirmed that this is the case on all/most platforms? Myron Scott -----Original Message----- From: Tom Lane <tgl@sss.pgh.pa.us> Subj: Re: [HACKERS] Support Parallel Query Execution in Executor Date: Sun Apr 9, 2006 11:48 am Size: 1K To: Gregory Maxwell <gmaxwell@gmail.com> cc: pgsql-hackers@postgresql.org "Gregory Maxwell" <gmaxwell@gmail.com> writes: > On 4/9/06, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> So before we go inventing complicated bits of code with lots of added >> overhead, we should first find out exactly why the system doesn't >> already work the way it's supposed to. > But is that really the behavior we should expect? Certainly. If the OS has readahead logic at all, it ought to think that a seqscan of a large table qualifies. Your arguments seem to question whether readahead is useful at all --- but they would apply *just as well* to an app doing its own readahead, which is what is really getting proposed in this thread. Before we go replacing a standard OS-level facility with our own version, we need to have a much clearer idea of why the OS isn't getting the job done for us. Otherwise we're likely to write a large amount of code and find out that it doesn't work very well either. regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 6: explain analyze is your friend
Myron, On 4/9/06 1:18 PM, "lister@sacadia.com" <lister@sacadia.com> wrote: > I am working with Solaris on SPARC almost exclusively and I believe Josh said > that Sun was the one who found the bursty behavior with scans. Has it been > confirmed that this is the case on all/most platforms? It was our folks at Greenplum that found this behavior. It is particularly bad on Solaris because of the poor OS readahead on UFS. You can modify UFS readahead behavior using OS tuning parameters as mentioned previously in the thread and contained in Jignesh Shaw's blog at Sun (we worked with Jignesh and Sun for about a year to get this information). Interestingly, the use of POSIX fadvise() to indicate the nature of the I/O as sequential makes a big positive difference in many cases, but it hurts a lot in some other workloads where there is seek I/O and the sequential pattern was set. I think the best solution will use a more direct overlap approach in the executor. - Luke
On 4/9/06, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Certainly. If the OS has readahead logic at all, it ought to think that > a seqscan of a large table qualifies. Your arguments seem to question > whether readahead is useful at all --- but they would apply *just as > well* to an app doing its own readahead, which is what is really > getting proposed in this thread. We know we're going to read the whole table, the OS doesn't. We can be confident that we're will not use our read-ahead when we're really doing random accesses. The OS has to deal with many applications with many workloads running on a wider spectrum of hardware. It's important that it does the right thing, but probably more important that it doesn't do the wrong thing.This encourages erroring on the side of small readahead. > Before we go replacing a standard OS-level facility with our own > version, we need to have a much clearer idea of why the OS isn't getting > the job done for us. Otherwise we're likely to write a large amount of > code and find out that it doesn't work very well either. Thats a fair position... It would be useful to know much much readahead PG needs in order to keep a high speed disk subsystem saturated. This would involve profiling how frequently PG requests data, how much it requests when running out of a hot cache. We could then say that the OS would need to readahead xMB to keep a yMB/s disk subsystem saturated. It would be good to know how much FBSD will readahead... It might also be interesting for someone with the right testing rig on linux to try the adaptive readahead patch to see if that improves PG's ability to keep the disk busy.
Gregory, On 4/9/06 1:36 PM, "Gregory Maxwell" <gmaxwell@gmail.com> wrote: > It might also be interesting for someone with the right testing rig on > linux to try the adaptive > readahead patch to see if that improves PG's ability to keep the disk busy. "the" adaptive readahead patch? Did I miss one? We will happily test experimental patches that improve I/O utilitization. We have an assortment of gear with high speed I/O, mostly Linux now. - Luke
Tom, On 4/9/06 9:27 AM, "Tom Lane" <tgl@sss.pgh.pa.us> wrote: > 2. There are some low-level assumptions that no one reads in pages of > a relation without having some kind of lock on the relation (consider > eg the case where the relation is being dropped). A bgwriter-like > process wouldn't be able to hold lmgr locks, and we wouldn't really want > it to be thrashing the lmgr shared data structures for each read anyway. > So you'd have to design some interlock to guarantee that no backend > abandons a query (and releases its own lmgr locks) while an async read > request it made is still pending. Ugh. Does this lead us right back to planning for the appropriate amount of readahead at plan time? We could consider a "page range" lock at that point instead of locking each individual page. It seems that the planner might have enough information to identify the times when this approach would be beneficial (low selectivity aggregate, etc), and to allocate sufficient resources to accommodate the overlap. The planner specific knowledge would be the main reason this would work better than the OS built-in dynamic readahead algorithms. - Luke
On 4/9/06, Luke Lonergan <llonergan@greenplum.com> wrote: > Gregory, > > On 4/9/06 1:36 PM, "Gregory Maxwell" <gmaxwell@gmail.com> wrote: > > > It might also be interesting for someone with the right testing rig on > > linux to try the adaptive > > readahead patch to see if that improves PG's ability to keep the disk busy. > > "the" adaptive readahead patch? Did I miss one? > > We will happily test experimental patches that improve I/O utilitization. > We have an assortment of gear with high speed I/O, mostly Linux now. Linux kernel patch, I'd mentioned it in a prior post. http://www.vanheusden.com/ara/ It increases Linux's maximum readahead from 128K to 1meg .. and it should be smart enough that you could crank it up further without too much risk of hurting performance elsewhere. If PG's bottlenecked on seqscan due to insufficient readahead, I'd expect this to show an improvement... although I am still somewhat doubtful that it'll be enough to keep the disk saturated if PG's behavior is highly unoptimal.
"Luke Lonergan" <llonergan@greenplum.com> writes: > On 4/9/06 9:27 AM, "Tom Lane" <tgl@sss.pgh.pa.us> wrote: >> 2. There are some low-level assumptions that no one reads in pages of >> a relation without having some kind of lock on the relation (consider >> eg the case where the relation is being dropped). A bgwriter-like >> process wouldn't be able to hold lmgr locks, and we wouldn't really want >> it to be thrashing the lmgr shared data structures for each read anyway. >> So you'd have to design some interlock to guarantee that no backend >> abandons a query (and releases its own lmgr locks) while an async read >> request it made is still pending. Ugh. > Does this lead us right back to planning for the appropriate amount of > readahead at plan time? We could consider a "page range" lock at that point > instead of locking each individual page. No, you're missing my point entirely. What's bothering me is the prospect of a "bgreader" process taking actions that are only safe because of a lock that is held by a different process --- changing the granularity of that lock doesn't get you out of trouble. Here's a detailed scenario: 1. Backend X reads page N of a table T, queues a request for N+1.2. While processing page N, backend X gets an error andaborts its transaction, thereby dropping all its lmgr locks.3. Backend Y executes a DROP or TRUNCATE on T, which itcan now do because there's no lock held anywhere on T. There are actually two interesting sub-phases of this: 3a. Kill any shared buffers holding pages to be deleted. 3b. Physically drop or truncate the OS file.4. Bgreader triesto execute the pending read request. Ooops. If step 4 happens after 3b, the bgreader gets an error. Maybe we could kluge that to not cause any serious problems, but the nastier case is where 4 happens between 3a and 3b --- the bgreader sees nothing wrong, but it's now loaded a shared buffer that must *not* be there. Having thought more about this, there may be a solution possible using tighter integration with the bufmgr. The bufmgr already has a notion of a buffer page being "read busy". Maybe, rather than just pushing a read request into a separate queue somewhere, the requestor has to assign a shared buffer for the page it wants and put the buffer into read-busy state, but then pass the request to perform the physical read off to someone else. The advantage of this is that there's state that step 3a can see telling it that a conflicting read is pending, and it just needs to wait for the read to finish before killing the buffer. Bottom line seems to be: just as the bgwriter is pretty intimately tied to bufmgr, bgreaders would have to be as well. regards, tom lane
Hi, On Sun, 2006-04-09 at 15:11 -0400, Tom Lane wrote: > You can't just retarget a backend to operate in another database, at > least not without major changes in that infrastructure. Why not? What would be needed to retarget a backend to operate in another database? Such a retargetting would be helpful in several cases: - pre-forking of backends (well, depends on efficiency of retargetting..) - slave backends for parallel query execution - slave backends for replaying replicated transactions I'm using backends operating without a client connection, with superuser privileges only. But those are still connected to one specific database. It would really be helpfull if I could retarget them as needed. Regards Markus
"Markus Schiltknecht" <markus@bluegap.ch> wrote > Hi, > > On Sun, 2006-04-09 at 15:11 -0400, Tom Lane wrote: > > You can't just retarget a backend to operate in another database, at > > least not without major changes in that infrastructure. > > Why not? What would be needed to retarget a backend to operate in > another database? > As Tom pointed out, without big change, a backend on database "D1" can't connect to "D2". This is because to connect to a database, we need to initialize a lot of variables. So when you reconnect to another one on the fly, you have to change these variables one by one. Regards, Qingqing
On Sun, 2006-04-09 at 17:11 +0800, Qingqing Zhou wrote: > "Tom Lane" <tgl@sss.pgh.pa.us> wrote > > > > This is exactly the bit of optimism I was questioning. We've already > > been sweating blood trying to reduce multiprocessor contention on data > > structures in which collisions ought to be avoidable (ie, buffer arrays > > where you hope not everyone is hitting the same buffer at once). I > > think passing large volumes of data between different processes is going > > to incur quite a lot of locking overhead, pipeline stalls for cache line > > transfers, etc, etc, because heavy contention for the transfer buffer is > > simply not going to be avoidable. > > > > Yes, there is overhead. Let's say how can we tackle them one by one: > > * lock contention of transfer buffer * I think it would be useful to think about exactly what type of query/activity we are looking to improve the performance on. That way we can understand the benefit of this proposal and take some baseline measurements to analyse what is happening for those cases. For example, a CREATE INDEX will require most if not all rows. Whereas a SeqScan may have a selectivity of 0.001 or below in some cases. The SeqScan case might also vary considerably in the amount of CPU required to identify qualifying rows and that CPU might benefit from parallelisation also. e.g. A low-selectivity SeqScan is a good case for fairly simple performance improvement, since the output from that executor node is going to be much less than its input and there is less need to parallelize the complete tree. A parallel slave node might then be placed in charge of running a complete SeqScan node on an agreed section of a table, rather than simply performing I/O for another query. Best Regards, Simon Riggs
Hi Qingqing, On Mon, 2006-04-10 at 16:38 +0800, Qingqing Zhou wrote: > > Why not? What would be needed to retarget a backend to operate in > > another database? > > As Tom pointed out, without big change, a backend on database "D1" can't > connect to "D2". This is because to connect to a database, we need to > initialize a lot of variables. So when you reconnect to another one on the > fly, you have to change these variables one by one. Sure, the question is: what is needed to retarget a backend? IMHO, in all of the three examples mentioned up-thread, it would be better to retarget an existing backend, instead of forking a new one, because that would perform better and/or save some resources. Please do also consider, that for parallel query execution as well as for replaying remote transactions access rights checking and any client connection setup could be omitted. Such special processing backeds could easily be run with superuser privileges and without a client connection. Regards Markus
On Sun, Apr 09, 2006 at 05:54:56PM -0400, Tom Lane wrote: > Here's a detailed scenario: > > 1. Backend X reads page N of a table T, queues a request for N+1. > 2. While processing page N, backend X gets an error and aborts > its transaction, thereby dropping all its lmgr locks. > 3. Backend Y executes a DROP or TRUNCATE on T, which it can > now do because there's no lock held anywhere on T. There > are actually two interesting sub-phases of this: > 3a. Kill any shared buffers holding pages to be deleted. > 3b. Physically drop or truncate the OS file. > 4. Bgreader tries to execute the pending read request. Ooops. > > If step 4 happens after 3b, the bgreader gets an error. Maybe we could > kluge that to not cause any serious problems, but the nastier case is > where 4 happens between 3a and 3b --- the bgreader sees nothing wrong, > but it's now loaded a shared buffer that must *not* be there. The way I dealt with this when playing with async i/o for postgres was to have the initiator mark the buffer I/O-in-progress then send the request, just like you suggest. This would cause other people to wait on it. When the request later finished you processed it as usual and backends waiting would continue. The appears to be two seperate cases here though, one is to just farm out the read request to another process (basically aio), the other is to do actual processing there. The latter is obviously for more useful but requires a fair bit more infrastructure. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them.
"Markus Schiltknecht" <markus@bluegap.ch> wrote > Hi Qingqing, > > > > > As Tom pointed out, without big change, a backend on database "D1" can't > > connect to "D2". This is because to connect to a database, we need to > > initialize a lot of variables. So when you reconnect to another one on the > > fly, you have to change these variables one by one. > > Sure, the question is: what is needed to retarget a backend? > Check the code InitPostgres(). These global varaibles are scattered in many places, so I am not sure if it is easy to write clean code to clear up these variables. But if you can come up with a patch to do reconnect without disconnect, that will be cool. > IMHO, in all of the three examples mentioned up-thread, it would be > better to retarget an existing backend, instead of forking a new one, > because that would perform better and/or save some resources. > That's for sure. > Please do also consider, that for parallel query execution as well as > for replaying remote transactions access rights checking and any client > connection setup could be omitted. Such special processing backeds could > easily be run with superuser privileges and without a client connection. > This is a problem for parallel query execution. I suspect we can't run with superuser privileges for now and for future. For now, I am not clear if there is any ACL check at query execution stage -- seems should not be, at least for DML. For future, I am pretty clear that this is not the way -- the mere possibility of future support of audit or MAC will bring in mind the reason. Regards, Qingqing
Ühel kenal päeval, P, 2006-04-09 kell 18:26, kirjutas Martijn van Oosterhout: > The cost of fetching a page from the OS is not really much of an > overhead, Have you tested this ? I remember having a case, where data load usin COPY into a table with several indexes ran an order of magnitude faster with bigger work-mem. It was with a computer with enough memory to fit the whole working set (table + indexes) into OS cache, thus I can't think of any other overhead except reading the index pages repeatedly from OS cache to shared mem buffers. I agree that a real disk read is much slower than OS cache read, but even an OS cache read is still much slower than no read at all. Sure it's not nearly as bad on seqscans, but it still takes time. > so I'd like to know how much benefit these buffer cache hacks > actually produce. ------------------- Hannu
On Mon, 2006-04-10 at 17:22 +0800, Qingqing Zhou wrote: > Check the code InitPostgres(). These global varaibles are scattered in many > places, so I am not sure if it is easy to write clean code to clear up these > variables. But if you can come up with a patch to do reconnect without > disconnect, that will be cool. Yes, that's where I've been playing around already. Not with much success until now, though. :-( > This is a problem for parallel query execution. I suspect we can't run with > superuser privileges for now and for future. For now, I am not clear if > there is any ACL check at query execution stage -- seems should not be, at > least for DML. For future, I am pretty clear that this is not the way -- the > mere possibility of future support of audit or MAC will bring in mind the > reason. (What's audit or MAC?) Well, then let such a slave backend inherit access rights from it's caller. You'd still save rechecking of passwords or such. And you save the client connection. I'll give the retargetting code another try. If anybody has some hints on what stumbling blocks and potholes to watch out for on the way there... Regards Markus
Markus Schiltknecht wrote: > On Mon, 2006-04-10 at 17:22 +0800, Qingqing Zhou wrote: > > Check the code InitPostgres(). These global varaibles are scattered in many > > places, so I am not sure if it is easy to write clean code to clear up these > > variables. But if you can come up with a patch to do reconnect without > > disconnect, that will be cool. > > Yes, that's where I've been playing around already. Not with much > success until now, though. :-( An idea arising in chat with Joshua Drake: the retargetting code, if it turns out to work and not be excessively expensive, could also be useful to implement a server-side "connection pooling" of sorts: the postmaster could keep idle backends and retarget them to a database that receives an incoming connection. However, we'd also need a mechanism to clean all backend state previous to reusing a connection, to leave it "as new" (no prepared statements, WITH HOLD cursors, etc.) -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Hi, On Mon, 2006-04-10 at 13:36 -0400, Alvaro Herrera wrote: > An idea arising in chat with Joshua Drake: the retargetting code, if it > turns out to work and not be excessively expensive, could also be useful > to implement a server-side "connection pooling" of sorts: the postmaster > could keep idle backends and retarget them to a database that receives > an incoming connection. However, we'd also need a mechanism to clean > all backend state previous to reusing a connection, to leave it "as > new" (no prepared statements, WITH HOLD cursors, etc.) I suspect that's what I was calling 'pre-forking'. Please also note that this would be the most difficult of my three examples (by difficulty of implementation). Regards Markus
Hannu, On 4/10/06 2:23 AM, "Hannu Krosing" <hannu@skype.net> wrote: >> The cost of fetching a page from the OS is not really much of an >> overhead, > > Have you tested this ? I have - the overhead of fetching a page from Linux I/O cache to buffer cache is about an additional 20% over fetching it directly from buffer cache on PG 7.4. Haven't tested it on 8.1. - Luke
Gregory, On 4/9/06 2:04 PM, "Gregory Maxwell" <gmaxwell@gmail.com> wrote: > It increases Linux's maximum readahead from 128K to 1meg .. and it > should be smart enough that you could crank it up further without too > much risk of hurting performance elsewhere. Interesting - we are now routinely using the "blockdev --setra" to set the block device readahead and our common setting now is 16MB, so I will look at how the behavior of this adaptive readahead affects normal query workload. From the patch: Normally, the kernel uses a stock readahead logic that is well understood and well tuned. This option enables amuch complex and feature rich one. It is more aggressive and memory efficient in doing readahead, and supports someless-common access patterns such as reading backward and reading sparsely. However, due to the great diversityof real world applications, it might not fit everyone. - Luke
On Mon, 2006-04-10 at 02:16, Martijn van Oosterhout wrote: > The appears to be two seperate cases here though, one is to just farm > out the read request to another process (basically aio), the other is > to do actual processing there. The latter is obviously for more useful > but requires a fair bit more infrastructure. > I ran some tests to see where time is spent during SeqScans. I did the following. tester=# vacuum analyze verbose test; INFO: vacuuming "public.test" INFO: "test": found 0 removable, 727960 nonremovable row versions in 5353 pagesDETAIL: 0 dead row versions cannot be removed yet. There were 0 unused item pointers. 0 pages are entirely empty. CPU 0.18s/0.27u sec elapsed 0.91 sec. INFO: analyzing "public.test" INFO: "test": scanned 3000 of 5353 pages, containing 407952 live rows and 0 dead rows; 3000 rows in sample, 727922 estimated total rows VACUUM tester=# select version(); version ------------------------------------------------------------------------------- PostgreSQL 8.2devel on sparc-sun-solaris2.11, compiled by GCC gcc (GCC) 3.3.2 (1 row) tester=# select count(random) from test; count -------- 727960 (1 row) With the follow ing dtrace results... # ./probediff2.d 514607 dtrace: script './probediff2.d' matched 10 probes CPU ID FUNCTION:NAME 0 46811 ExecEndSeqScan:return scan time 20406 ^C smgrread 641566800 Virtualized - smgrread 439798800 smgread - Call Count 5353 HeapTupleSatisfiesSnapshot 6735471000 Virtualized - HeapTupleSatisfiesSnapshot 3516556800 HeapTupleSatisfiesSnapshot - Call Count 727960 Virtualized - ReadBuffer 558230600 ReadBuffer 864931000 Virtualized - ExecutePlan 7331181400 Virtualized - ExecSeqScan 7331349600 ExecutePlan 20405943000 ExecSeqScan 20406161000 The virtualized times are supposed to be actual time spent on the CPU with the time spent in the probe factored out. It seems here that half the time in SeqScan is spent time validating the tuples as opposed to 1/10th doing IO. I'm not sure that just farming out read IO is going to be all that helpful in this situation. That's why I think it's a good idea to create a slave process that prefetchs pages and transfers valid ItemPointers to the master. There may not be much to be gained on simple SeqScans, however, in complex queries that include a SeqScan, you may gain alot by offloading this work onto a slave thread. A table with TOAST'ed attributes comes to mind. The slave thread could be working away on the rest of the table while the master is PG_DETOAST_DATUM'ing the attributes for transmission back to the client or additional processing. Am I missing something in this analysis? I've attached my dtrace script. Myron Scott
Attachment
Alvaro Herrera <alvherre@commandprompt.com> writes: > An idea arising in chat with Joshua Drake: the retargetting code, if it > turns out to work and not be excessively expensive, could also be useful > to implement a server-side "connection pooling" of sorts: the postmaster > could keep idle backends and retarget them to a database that receives > an incoming connection. However, we'd also need a mechanism to clean > all backend state previous to reusing a connection, to leave it "as > new" (no prepared statements, WITH HOLD cursors, etc.) Isn't all that work pretty much exactly the main cost of starting a new backend? -- greg
Greg Stark wrote: > Alvaro Herrera <alvherre@commandprompt.com> writes: > > > An idea arising in chat with Joshua Drake: the retargetting code, if it > > turns out to work and not be excessively expensive, could also be useful > > to implement a server-side "connection pooling" of sorts: the postmaster > > could keep idle backends and retarget them to a database that receives > > an incoming connection. However, we'd also need a mechanism to clean > > all backend state previous to reusing a connection, to leave it "as > > new" (no prepared statements, WITH HOLD cursors, etc.) > > Isn't all that work pretty much exactly the main cost of starting a new > backend? On Linux and other systems were fork() has negligible cost, maybe; but on Windows and Solaris, it's certainly not. -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
Simon Riggs <simon@2ndquadrant.com> writes: > I think it would be useful to think about exactly what type of > query/activity we are looking to improve the performance on. That way we > can understand the benefit of this proposal and take some baseline > measurements to analyse what is happening for those cases. I find the focus on sequential scans, index scans, etc. quite odd when you're discussing parallel query processing. The whole goal of parallel query processing is to bring more *cpu* to bear on the problem. That's going to be most relevant when you're cpu bound, not i/o bound. The queries I would expect to be helped most by parallel query processing are queries that involve sorting. For example, a big merge join with two sorts on either side could perform the two sorts simultaneously. If they provide the results of the final pass to a third thread it can execute the merge join and the rest of the query plan while the sorts are still executing on two other processors. -- greg
On Tue, 2006-04-11 at 07:47, Myron Scott wrote: > client > or additional processing. Am I missing something in this analysis? > > I've attached my dtrace script. > To answer my own question, I suppose my processors are relatively slow compared to most setups. Myron Scott
Alvaro Herrera <alvherre@commandprompt.com> writes: > Greg Stark wrote: > > Alvaro Herrera <alvherre@commandprompt.com> writes: > > > > > An idea arising in chat with Joshua Drake: the retargetting code, if it > > > turns out to work and not be excessively expensive, could also be useful > > > to implement a server-side "connection pooling" of sorts: the postmaster > > > could keep idle backends and retarget them to a database that receives > > > an incoming connection. However, we'd also need a mechanism to clean > > > all backend state previous to reusing a connection, to leave it "as > > > new" (no prepared statements, WITH HOLD cursors, etc.) > > > > Isn't all that work pretty much exactly the main cost of starting a new > > backend? > > On Linux and other systems were fork() has negligible cost, maybe; but > on Windows and Solaris, it's certainly not. Even on Solaris I'm sure parsing and preparing plans for all the queries, building up the system table cache for all the objects in the database, and so on are much much more expensive than fork(). I wouldn't be surprised if even on windows it was still a pretty close race. -- greg
Greg Stark wrote: > Even on Solaris I'm sure parsing and preparing plans for all the queries, > building up the system table cache for all the objects in the database, and so > on are much much more expensive than fork(). I wouldn't be surprised if even > on windows it was still a pretty close race. Parsing/planning what queries? Regarding system caches, they are populated from a cache file; they are not read from the catalogs each time. But while we don't see a patch implementing the idea, this is all very theoretical and probably wrong. -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Greg Stark wrote: > Simon Riggs <simon@2ndquadrant.com> writes: > >> I think it would be useful to think about exactly what type of >> query/activity we are looking to improve the performance on. That way we >> can understand the benefit of this proposal and take some baseline >> measurements to analyse what is happening for those cases. > > I find the focus on sequential scans, index scans, etc. quite odd when you're > discussing parallel query processing. The whole goal of parallel query > processing is to bring more *cpu* to bear on the problem. That's going to be > most relevant when you're cpu bound, not i/o bound. > > The queries I would expect to be helped most by parallel query processing are > queries that involve sorting. For example, a big merge join with two sorts on > either side could perform the two sorts simultaneously. If they provide the > results of the final pass to a third thread it can execute the merge join and > the rest of the query plan while the sorts are still executing on two other > processors. Well I am go out on a limb here and gather to guess that sequential scans and index scans are still very relevant because the CPU could be bound by the scan (either one) based on the type of query being performed. This doesn't really have anything to do with being IO bound as to the type of accesses to the data we have to deal with in regards to query processing. You are correct about parallel query processing helping mutliple sort queries but those sorts may or may not hit and index. Sincerely, Joshua D. Drake > > -- === The PostgreSQL Company: Command Prompt, Inc. === Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240 Providing the most comprehensive PostgreSQL solutions since 1997 http://www.commandprompt.com/
"Joshua D. Drake" <jd@commandprompt.com> writes: > Well I am go out on a limb here and gather to guess that sequential scans and > index scans are still very relevant because the CPU could be bound by the scan > (either one) based on the type of query being performed. > > This doesn't really have anything to do with being IO bound as to the type of > accesses to the data we have to deal with in regards to query processing. It has everything to do with being i/o bound. The only way having two processors perform part of an index or sequential scan would help is if your disk subsystem is capable of providing data faster than a single processor is capable of requesting it. That's only going to be true for very high end systems with multiple raid controllers and dozens of spindles. On the other hand even moderately sized dual-core systems could probably benefit from being able to perform multiple cpu-intensive operations simultaneously. -- greg
On Mon, Apr 10, 2006 at 12:02:56PM -0700, Luke Lonergan wrote: > Hannu, > > On 4/10/06 2:23 AM, "Hannu Krosing" <hannu@skype.net> wrote: > > >> The cost of fetching a page from the OS is not really much of an > >> overhead, > > > > Have you tested this ? > > I have - the overhead of fetching a page from Linux I/O cache to buffer > cache is about an additional 20% over fetching it directly from buffer cache > on PG 7.4. Is there any pratcical way to tell the difference between a page comming from the OS cache and one comming from disk? Or maybe for a set of pages an estimate on how many came from cache vs disk? There's some areas where having this information would be very useful, such as for vacuum delay. It would make tuning much easier, and it would also give us some insight on how heavily loaded disks were, which would also be useful info for vacuum to have (so we could adjust vacuum_cost_delay dynamically based on load). -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
On Mon, Apr 10, 2006 at 05:22:13PM +0800, Qingqing Zhou wrote: > > "Markus Schiltknecht" <markus@bluegap.ch> wrote > > Hi Qingqing, > > > > > > > > As Tom pointed out, without big change, a backend on database "D1" can't > > > connect to "D2". This is because to connect to a database, we need to > > > initialize a lot of variables. So when you reconnect to another one on > the > > > fly, you have to change these variables one by one. > > > > Sure, the question is: what is needed to retarget a backend? > > > Check the code InitPostgres(). These global varaibles are scattered in many > places, so I am not sure if it is easy to write clean code to clear up these > variables. But if you can come up with a patch to do reconnect without > disconnect, that will be cool. Something else to consider: most queries that would benefit from parallel execution are expensive enough that the cost of spawning some new backends wouldn't be that big a deal, so perhaps for an initial version it would be best to KISS and just spawn parallel execution backends as needed. -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
On Mon, Apr 10, 2006 at 01:36:45PM -0400, Alvaro Herrera wrote: > Markus Schiltknecht wrote: > > On Mon, 2006-04-10 at 17:22 +0800, Qingqing Zhou wrote: > > > Check the code InitPostgres(). These global varaibles are scattered in many > > > places, so I am not sure if it is easy to write clean code to clear up these > > > variables. But if you can come up with a patch to do reconnect without > > > disconnect, that will be cool. > > > > Yes, that's where I've been playing around already. Not with much > > success until now, though. :-( > > An idea arising in chat with Joshua Drake: the retargetting code, if it > turns out to work and not be excessively expensive, could also be useful > to implement a server-side "connection pooling" of sorts: the postmaster > could keep idle backends and retarget them to a database that receives > an incoming connection. However, we'd also need a mechanism to clean > all backend state previous to reusing a connection, to leave it "as > new" (no prepared statements, WITH HOLD cursors, etc.) Oracle allows you to essentially re-connect to an existing connection by saving connection state when a connection goes back to the connection pool. Essentially, you connect, and then re-authenticate as a different user. That user has a specific environment associated with it which is then pulled in. This makes it reasonable to use Oracle's built-in code for handling permissions, etc; you just give each system user a database account. While this sounds scary to folk that haven't used it, it's actually safer than rolling your own authentication and security mechanism (which will likely have bugs in it) and having your middleware connect to the database with some dedicated account. -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
> That's only going to be true for very high end systems with multiple raid > controllers and dozens of spindles. Not true. I have a system right now that would benifit. My database is only 500 megs and I have 2 gig of ram and two processors... I only have a raid 1, but that is o.k. because most things are cached in memory anyway. > > On the other hand even moderately sized dual-core systems could probably > benefit from being able to perform multiple cpu-intensive operations > simultaneously. See above :) Joshua D. Drake > -- === The PostgreSQL Company: Command Prompt, Inc. === Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240 Providing the most comprehensive PostgreSQL solutions since 1997 http://www.commandprompt.com/
Enhanded TODO:* Experiment with multi-threaded backend better resource utilization This would allow a single query to makeuse of multiple CPU's or multiple I/O channels simultaneously. One idea is to create a background reader that canpre-fetch sequential and index scan pages needed by other backends. This could be expanded to allow concurrent readsfrom multiple devices in a partitioned table. The issue of parallism is basically a sliding scale, meaning what do you want to do concurrently. In this case, we are trying to do I/O and CPU concurrently. Another idea is to do I/O for partitioned tables concurrently, and of course there are many CPU concurrent cases like sorting. --------------------------------------------------------------------------- Tom Lane wrote: > Myron Scott <lister@sacadia.com> writes: > > Gregory Maxwell wrote: > >> There are other cases where it is useful to perform parallel I/O > >> without parallel processing.. > > > I have done some testing more along these lines with an old fork of > > postgres code (2001). In my tests, I used a thread to delegate out > > the actual heap scan of the SeqScan. The job of the "slave" thread > > the was to fault in buffer pages and determine the time validity of > > the tuples. ItemPointers are passed back to the "master" thread via a > > common memory area guarded by mutex locking. > > I was considering a variant idea in the shower this morning: suppose > that we invent one or more "background reader" processes that have > basically the same infrastructure as the background writer, but have > the responsibility of causing buffer reads to happen at useful times > (whereas the writer causes writes to happen at useful times). The > idea would be for backends to signal the readers when they know they > will need a given block soon, and then hopefully when they need it > it'll already be in shared buffers. For instance, in a seqscan it'd be > pretty trivial to request block N+1 just after reading block N, and then > doing our actual processing on block N while (we hope) some reader > process is faulting in N+1. Bitmap indexscans could use this structure > too; I'm less sure about whether plain indexscans could do much with it > though. > > The major issues I can see are: > > 1. We'd need a shared-memory queue of read requests, probably much like > the queue of fsync requests. We've already seen problems with > contention for the fsync queue, IIRC, and that's used much less heavily > than the read request queue would be. So there might be some > performance issues with getting the block requests sent over to the > readers. > > 2. There are some low-level assumptions that no one reads in pages of > a relation without having some kind of lock on the relation (consider > eg the case where the relation is being dropped). A bgwriter-like > process wouldn't be able to hold lmgr locks, and we wouldn't really want > it to be thrashing the lmgr shared data structures for each read anyway. > So you'd have to design some interlock to guarantee that no backend > abandons a query (and releases its own lmgr locks) while an async read > request it made is still pending. Ugh. > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 5: don't forget to increase your free space map settings > -- Bruce Momjian http://candle.pha.pa.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Jim C. Nasby wrote: > On Mon, Apr 10, 2006 at 12:02:56PM -0700, Luke Lonergan wrote: > > Hannu, > > > > On 4/10/06 2:23 AM, "Hannu Krosing" <hannu@skype.net> wrote: > > > > >> The cost of fetching a page from the OS is not really much of an > > >> overhead, > > > > > > Have you tested this ? > > > > I have - the overhead of fetching a page from Linux I/O cache to buffer > > cache is about an additional 20% over fetching it directly from buffer cache > > on PG 7.4. > > Is there any pratcical way to tell the difference between a page comming > from the OS cache and one comming from disk? Or maybe for a set of pages > an estimate on how many came from cache vs disk? There's some areas > where having this information would be very useful, such as for vacuum > delay. It would make tuning much easier, and it would also give us some > insight on how heavily loaded disks were, which would also be useful > info for vacuum to have (so we could adjust vacuum_cost_delay > dynamically based on load). getrusage() returns: ! 0.000062 elapsed 0.000000 user 0.000062 system sec ! [0.000000 user 0.009859 sys total] ! 0/0 [19/2] filesystem blocks in/out ! 0/0 [0/0] page faults/reclaims, 0 [0] swaps ! 0 [0]signals rcvd, 0/0 [4/5] messages rcvd/sent ! 0/0 [23/6] voluntary/involuntary context switches but I don't see anything in there that would show kernel cache vs. disk I/O. In fact, there is usually little connection in the kernel between an I/O request and the process that requests it. -- Bruce Momjian http://candle.pha.pa.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
On Thu, Apr 13, 2006 at 03:38:04PM -0400, Bruce Momjian wrote: > > Is there any pratcical way to tell the difference between a page comming > > from the OS cache and one comming from disk? Or maybe for a set of pages > > an estimate on how many came from cache vs disk? There's some areas > > where having this information would be very useful, such as for vacuum > > delay. It would make tuning much easier, and it would also give us some > > insight on how heavily loaded disks were, which would also be useful > > info for vacuum to have (so we could adjust vacuum_cost_delay > > dynamically based on load). > > getrusage() returns: > > ! 0.000062 elapsed 0.000000 user 0.000062 system sec > ! [0.000000 user 0.009859 sys total] > ! 0/0 [19/2] filesystem blocks in/out > ! 0/0 [0/0] page faults/reclaims, 0 [0] swaps > ! 0 [0] signals rcvd, 0/0 [4/5] messages rcvd/sent > ! 0/0 [23/6] voluntary/involuntary context switches > > but I don't see anything in there that would show kernel cache vs. disk > I/O. In fact, there is usually little connection in the kernel between > an I/O request and the process that requests it. Yeah, my assumption has been that the only way to tell the difference would be by timing, but I don't know how practical that is. Since gettime() or whatever EXPLAIN ANALYZE uses is apparently very expensive, perhaps there's some other alternative. Perhapse the timing info in getrusage would work for this. Another idea is setting an alarm for a fairly short period before making the IO request. If the request comes back before the alarm fires, the data must have been in the OS cache. Another thought is that any IO request that goes to disk would most likely put the process requesting the IO to sleep, but a request being served out of cache might not do that. Perhaps there's some way to recognize that. Or maybe a better track would be to develop a patch for as many OSes as possible that would tell the caller if an IO request came out of cache or not. -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
Jim C. Nasby wrote: > Yeah, my assumption has been that the only way to tell the difference > would be by timing, but I don't know how practical that is. Since > gettime() or whatever EXPLAIN ANALYZE uses is apparently very expensive, > perhaps there's some other alternative. Perhapse the timing info in > getrusage would work for this. Another idea is setting an alarm for a > fairly short period before making the IO request. If the request comes > back before the alarm fires, the data must have been in the OS cache. > > Another thought is that any IO request that goes to disk would most > likely put the process requesting the IO to sleep, but a request being > served out of cache might not do that. Perhaps there's some way to > recognize that. > > Or maybe a better track would be to develop a patch for as many OSes as > possible that would tell the caller if an IO request came out of cache > or not. Or give up, which seems the most fruitful approach. -- Bruce Momjian http://candle.pha.pa.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +