Thread: Piggybacking vacuum I/O
I've been looking at the way we do vacuums. The fundamental performance issue is that a vacuum generates nheapblocks+nindexblocks+ndirtyblocks I/Os. Vacuum cost delay helps to spread the cost like part payment, but the total is the same. In an I/O bound system, the extra I/O directly leads to less throughput. Therefore, we need to do less I/O. Dead space map helps by allowing us to skip blocks that don't need vacuuming, reducing the # of I/Os to 2*ndirtyblocks+nindexblocks. That's great, but it doesn't help us if the dead tuples are spread uniformly. If we could piggyback the vacuum I/Os to the I/Os that we're doing anyway, vacuum wouldn't ideally have to issue any I/O of its own. I've tried to figure out a way to do that. Vacuum is done in 3 phases: 1. Scan heap 2. Vacuum index 3. Vacuum heap Instead of doing a sequential scan, we could perform the 1st phase by watching the buffer pool, scanning blocks for dead tuples when they're in memory and keeping track of which pages we've seen. When all pages have been seen, the tid list is sorted and 1st phase is done. In theory, the index vacuum could also be done that way, but let's assume for now that indexes would be scanned like they are currently. The 3rd phase can be performed similarly to the 1st phase. Whenever a page enters the buffer pool, we check the tid list and remove any matching tuples from the page. When the list is empty, vacuum is complete. Of course, there's some issues in the design as described above. For example, the vacuum might take a long time if there's cold spots in the table. In fact, a block full of dead tuples might never be visited again. A variation of the scheme would be to keep scanning pages that are in cache, until the tid list reaches a predefined size, instead of keeping track of which pages have already been seen. That would deal better with tables with hot and cold spots, but it couldn't advance the relfrozenid because there would be no guarantee that all pages are visited. Also, we could start 1st phase of the next vacuum, while we're still in the 3rd phase of previous one. Also, after we've seen 95% of the pages or a timeout expires, we could fetch the rest of them with random I/O to let the vacuum finish. I'm not sure how exactly this would be implemented. Perhaps bgwriter or autovacuum would do it, or a new background process. Presumably the process would need access to relcache. One issue is that if we're trying to vacuum every table simultaneously this way, we'll need more overall memory for the tid lists. I'm hoping there's a way to implement this without requiring shared memory for the tid lists, that would make the memory management a nightmare. Also, we'd need changes to bufmgr API to support this. This would work nicely with the DSM. The list of pages that need to be visited in phase 1 could be initialized from the DSM, largely avoiding the problem with cold spots. Any thoughts before I start experimenting? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Mon, 2007-01-22 at 13:41 +0000, Heikki Linnakangas wrote: > Any thoughts before I start experimenting? Probably only to detail the various use cases we are discussing. My thoughts on various use cases are: - small table with frequent update/delete, heap and indexes all/mostly cached e.g. Counter tables, DBT2: District/Warehouse TPC-C, pgbench: Branches/Tellers Current VACUUM works well in this situation, since the only I/O incurred is the WAL written for the VACUUM. VACUUM very cheap even if not in cache because of sequential I/O. Keeping track of whether there are hot spots in these tables seems like a waste of cycles and could potentially introduce contention and hence reduce performance. These need to be very frequently VACUUMed, even when other VACUUMs are required. My current view: just need multiple concurrent autovacuum processes. - large table with severe hotspots e.g. DBT2: NewOrder, larger queue-style tables The hotspots are likely to be in cache and the not-so-hotspots might or might not be in cache, but we don't care either way. DSM concept works well for this case, since we are able to avoid lots of I/O by appropriate book-keeping. Works well for removing rows after a file-scan DELETE, as well as for DELETE or UPDATE hot spots. My current view: DSM would be great for this - large table with few hotspots e.g. DBT2: Stock, pgbench: Accounts, most Customer tables Current VACUUM works very badly in this case, since updates are sparsely distributed across table. DSM wouldn't help either unless we differentiate between few/many updates to a block. My current view: Piggyback concept seems on the right track, but clearly needs further thought. Currently we have only one technique for garbage collection, plus one process to perform it. We need multiple techniques executed by multiple processes, when required, plus some way of automatically selecting which is appropriate depending upon the use case. Yes, automatic :-) DSM and this piggyback idea need not be thought of as competing techniques. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
On Mon, Jan 22, 2007 at 02:51:47PM +0000, Heikki Linnakangas wrote: > I've been looking at the way we do vacuums. > > The fundamental performance issue is that a vacuum generates > nheapblocks+nindexblocks+ndirtyblocks I/Os. Vacuum cost delay helps to > spread the cost like part payment, but the total is the same. In an I/O > bound system, the extra I/O directly leads to less throughput. > > Therefore, we need to do less I/O. Dead space map helps by allowing us > to skip blocks that don't need vacuuming, reducing the # of I/Os to > 2*ndirtyblocks+nindexblocks. That's great, but it doesn't help us if the > dead tuples are spread uniformly. > > If we could piggyback the vacuum I/Os to the I/Os that we're doing > anyway, vacuum wouldn't ideally have to issue any I/O of its own. I've > tried to figure out a way to do that. > > Vacuum is done in 3 phases: > > 1. Scan heap > 2. Vacuum index > 3. Vacuum heap > Instead of doing a sequential scan, we could perform the 1st phase by > watching the buffer pool, scanning blocks for dead tuples when they're > in memory and keeping track of which pages we've seen. When all pages > have been seen, the tid list is sorted and 1st phase is done. > > In theory, the index vacuum could also be done that way, but let's > assume for now that indexes would be scanned like they are currently. > > The 3rd phase can be performed similarly to the 1st phase. Whenever a > page enters the buffer pool, we check the tid list and remove any > matching tuples from the page. When the list is empty, vacuum is complete. Is there any real reason to demark the start and end of a vacuum? Why not just go to a continuous process? One possibility is to keep a list of TIDs for each phase, though that could prove tricky with multiple indexes. > A variation of the scheme would be to keep scanning pages that are in > cache, until the tid list reaches a predefined size, instead of keeping > track of which pages have already been seen. That would deal better with > tables with hot and cold spots, but it couldn't advance the relfrozenid > because there would be no guarantee that all pages are visited. Also, we > could start 1st phase of the next vacuum, while we're still in the 3rd > phase of previous one. What if we tracked freeze status on a per-page basis? Perhaps track the minimum XID that's on each page. That would allow us to ensure that we freeze pages that are approaching XID wrap. -- Jim Nasby jim@nasby.net EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
Heikki Linnakangas <heikki@enterprisedb.com> wrote: > Vacuum is done in 3 phases: > 1. Scan heap > 2. Vacuum index > 3. Vacuum heap > A variation of the scheme would be to keep scanning pages that are in > cache, until the tid list reaches a predefined size, instead of keeping > track of which pages have already been seen. That would deal better with > tables with hot and cold spots, but it couldn't advance the relfrozenid > because there would be no guarantee that all pages are visited. Also, we > could start 1st phase of the next vacuum, while we're still in the 3rd > phase of previous one. ISTM, it is another DSM that has a tuple-level accuracy, not a page-level. One of the benefits is that we can skip the 1st phase of vacuum; We will have a TID list of dead tuples at the start of vacuum, so we can start from 2nd phase. I have another idea for use of TID lists -- Store the TIDs after the 1st or 2nd phase, and exit the vacuum. At the next vacuum, we will do both the previous 3rd phase and new 1st phase at once, so that I/Os are reduced (ndirtyblocks + nindexblocks) from (2*ndirtyblocks + nindexblocks) in average. We've already use a similar method in vacuuming btree indexes to collect recyclable empty pages. I think piggybacking of I/Os are very useful. Buffer manager helps us folding up some of I/Os, but explicit orders are more effective. Regards, --- ITAGAKI Takahiro NTT Open Source Software Center
On 1/22/07, Heikki Linnakangas <heikki@enterprisedb.com> wrote:
I've been looking at the way we do vacuums.
The fundamental performance issue is that a vacuum generates
nheapblocks+nindexblocks+ndirtyblocks I/Os. Vacuum cost delay helps to
spread the cost like part payment, but the total is the same. In an I/O
bound system, the extra I/O directly leads to less throughput.
Another source of I/O is perhaps the CLOG read/writes for checking transaction status. If we are talking about large tables like accounts in pgbench or customer/stock in DBT2, the tables are vacuumed much later than the actual UPDATEs. I don't have any numbers to prove yet, but my sense is that CLOG pages holding the status of many of the transactions might have been already flushed out of the cache and require an I/O. Since the default CLOG SLRU buffers is set to 8, there could be severe CLOG SLRU thrashing during VACUUM as the transaction ids will be all random in a heap page.
Would it help to set the status of the XMIN/XMAX of tuples early enough such that the heap page is still in the buffer cache, but late enough such that the XMIN/XMAX transactions are finished ? How about doing it when the bgwriter is about to write the page to disk ? Assuming few seconds of life of a heap page in the buffer cache, hopefully most of the XMIN/XMAX transactions should have completed and bgwriter can set XMIN(XMAX)_COMMITTED or XMIN(XMAX)_INVALID for most of the tuples in the page. This would save us CLOG I/Os later, either during subsequent access to the tuple and/or vacuum.
Any thoughts ?
Thanks,
Pavan
EnterpriseDB http://www.enterprisedb.com
Pavan Deolasee wrote: > Another source of I/O is perhaps the CLOG read/writes for checking > transaction status. If we are talking about large tables like accounts in > pgbench or customer/stock in DBT2, the tables are vacuumed much later than > the actual UPDATEs. I don't have any numbers to prove yet, but my sense is > that CLOG pages holding the status of many of the transactions might have > been already flushed out of the cache and require an I/O. Since the default > CLOG SLRU buffers is set to 8, there could be severe CLOG SLRU thrashing > during VACUUM as the transaction ids will be all random in a heap page. 8 log pages hold 8*8192*4=262144 transactions. If the active set of transactions is larger than that, the OS cache will probably hold more clog pages. I guess you could end up doing some I/O on clog on a vacuum of a big table, if you have a high transaction rate and vacuum infrequently... > Would it help to set the status of the XMIN/XMAX of tuples early enough > such > that the heap page is still in the buffer cache, but late enough such that > the XMIN/XMAX transactions are finished ? How about doing it when the > bgwriter is about to write the page to disk ? Assuming few seconds of life > of a heap page in the buffer cache, hopefully most of the XMIN/XMAX > transactions should have completed and bgwriter can set > XMIN(XMAX)_COMMITTED > or XMIN(XMAX)_INVALID for most of the tuples in the page. This would > save us > CLOG I/Os later, either during subsequent access to the tuple and/or > vacuum. Yeah, we could do that. First I'd like to see some more evidence that clog trashing is a problem, though. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
"Pavan Deolasee" <pavan.deolasee@gmail.com> writes: > Would it help to set the status of the XMIN/XMAX of tuples early enough such > that the heap page is still in the buffer cache, but late enough such that > the XMIN/XMAX transactions are finished ? How about doing it when the > bgwriter is about to write the page to disk ? No. The bgwriter would then become subject to deadlocks because it would be needing to read in clog pages before it could flush out dirty pages. In any case, if the table is in active use then some passing backend has probably updated the bits already ... regards, tom lane
On 1/23/07, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Well, let me collect some evidence. If we figure out that there is indeed a
CLOG buffer thrash at VACUUM time, I am sure we would be able to solve
the problem one way or the other.
IMHO this case would be more applicable to the very large tables where the
UPDATEd rows are not accessed again for a long time. And hence the hint bits
might not have been updated.
Thanks,
Pavan
"Pavan Deolasee" <pavan.deolasee@gmail.com> writes:
> Would it help to set the status of the XMIN/XMAX of tuples early enough such
> that the heap page is still in the buffer cache, but late enough such that
> the XMIN/XMAX transactions are finished ? How about doing it when the
> bgwriter is about to write the page to disk ?
No. The bgwriter would then become subject to deadlocks because it
would be needing to read in clog pages before it could flush out
dirty pages. In any case, if the table is in active use then some
passing backend has probably updated the bits already ...
Well, let me collect some evidence. If we figure out that there is indeed a
CLOG buffer thrash at VACUUM time, I am sure we would be able to solve
the problem one way or the other.
IMHO this case would be more applicable to the very large tables where the
UPDATEd rows are not accessed again for a long time. And hence the hint bits
might not have been updated.
Thanks,
Pavan
--
EnterpriseDB http://www.enterprisedb.com
On 1/23/07, Heikki Linnakangas <heikki@enterprisedb.com> wrote:
On a typical desktop class 2 CPU Dell machine, we have seen pgbench
clocking more than 1500 tps. That implies CLOG would get filled up in less
than 262144/1500=174 seconds. VACUUM on accounts table takes much
longer to trigger.
Here are some numbers from a 4 hour DBT2 run with 270 warehouses and 50
connections.
see the total number of CLOG reads jumped from 545323 to 1181851 i.e.
1181851 - 545323 = 636528 CLOG block reads for 1554697 pages of stock table.
This figure is only indicative since it also includes the CLOG block reads which would
have happened as part of other backend operations (VACUUM took almost 55 minutes to
complete). Still in the first 210 minutes of the run, the total reads were only 545323. So
most of the 636528 reads in the next 55 minutes can be attributed to VACUUM.
The writes are very small though, may be because most of the CLOG pages are accessed
read-only. A simple patch that I used to get these numbers is attached.
Thanks,
Pavan
EnterpriseDB http://www.enterprisedb.com
Pavan Deolasee wrote:
> Another source of I/O is perhaps the CLOG read/writes for checking
> transaction status. If we are talking about large tables like accounts in
> pgbench or customer/stock in DBT2, the tables are vacuumed much later than
> the actual UPDATEs. I don't have any numbers to prove yet, but my sense is
> that CLOG pages holding the status of many of the transactions might have
> been already flushed out of the cache and require an I/O. Since the default
> CLOG SLRU buffers is set to 8, there could be severe CLOG SLRU thrashing
> during VACUUM as the transaction ids will be all random in a heap page.
8 log pages hold 8*8192*4=262144 transactions. If the active set of
transactions is larger than that, the OS cache will probably hold more
clog pages. I guess you could end up doing some I/O on clog on a vacuum
of a big table, if you have a high transaction rate and vacuum
infrequently...
On a typical desktop class 2 CPU Dell machine, we have seen pgbench
clocking more than 1500 tps. That implies CLOG would get filled up in less
than 262144/1500=174 seconds. VACUUM on accounts table takes much
longer to trigger.
> Would it help to set the status of the XMIN/XMAX of tuples early enough
> such
> that the heap page is still in the buffer cache, but late enough such that
> the XMIN/XMAX transactions are finished ?
Yeah, we could do that. First I'd like to see some more evidence that
clog trashing is a problem, though.
Here are some numbers from a 4 hour DBT2 run with 270 warehouses and 50
connections.
2007-01-23 07:40:30 PST_17428 LOG: vacuuming "public.warehouse"I have just counted the number of read/write calls on the CLOG blocks. As you can
2007-01-23 07:40:30 PST_17428 LOG: CLOG r(1), w(0)- vacuum start
2007-01-23 07:40:30 PST_17428 LOG: CLOG r(1), w(0)- vacuum end
2007-01-23 07:40:30 PST_17428 LOG: "warehouse": found 1214 removable, 2275 nonremovable row versions in 111 pages
<snip>
2007-01-23 11:11:43 PST_30356 LOG: vacuuming "public.stock "
2007-01-23 11:11:43 PST_30356 LOG: CLOG r(545323), w(91)- vacuum start
2007-01-23 12:03:14 PST_30356 LOG: CLOG r(1181851), w(133) - vacuum end
2007-01-23 12:03:14 PST_30356 LOG: "stock": found 5645264 removable, 27003788 nonremovable row versions in 1554697 pages
see the total number of CLOG reads jumped from 545323 to 1181851 i.e.
1181851 - 545323 = 636528 CLOG block reads for 1554697 pages of stock table.
This figure is only indicative since it also includes the CLOG block reads which would
have happened as part of other backend operations (VACUUM took almost 55 minutes to
complete). Still in the first 210 minutes of the run, the total reads were only 545323. So
most of the 636528 reads in the next 55 minutes can be attributed to VACUUM.
The writes are very small though, may be because most of the CLOG pages are accessed
read-only. A simple patch that I used to get these numbers is attached.
Thanks,
Pavan
Attachment
"Pavan Deolasee" <pavan.deolasee@gmail.com> writes: > On a typical desktop class 2 CPU Dell machine, we have seen pgbench > clocking more than 1500 tps. Only if you had fsync off, or equivalently a disk drive that lies about write-complete. You could possibly achieve such rates in a non-broken configuration with a battery-backed write cache, but that's not "typical desktop" kit. In any case, you ignored Heikki's point that the PG shared memory pages holding CLOG are unlikely to be the sole level of caching, if the update rate is that high. The kernel will have some pages too. And even if we thought not, wouldn't bumping the size of the clog cache be a far simpler solution offering benefit for more things than just this? regards, tom lane
On 1/24/07, Tom Lane <tgl@sss.pgh.pa.us> wrote:
May be I was too vague about the machine/test. Its probably not a
"typical desktop" machine since it has better storage. A two disk
RAID 0 configuration for data, and a dedicated disk for xlog. I remember
running with 50 clients and 50 scaling factor, 1 GB shared buffer,
autovacuum turned on with default parameters and rest with default
configuration. I don't think I had explicitly turned fsync off.
Yes. May be what Heikki said is true, but we don't know for sure.
Wouldn't bumping the cache size just delay the problem a bit ?
Especially with even larger table and a very high end machine/storage
which can clock very high transactions per minute ?
Anyways, if we agree that there is a problem, the solution could be
as simple as increasing the cache size, as you suggested.
Thanks,
Pavan
-- "Pavan Deolasee" <pavan.deolasee@gmail.com> writes:
> On a typical desktop class 2 CPU Dell machine, we have seen pgbench
> clocking more than 1500 tps.
Only if you had fsync off, or equivalently a disk drive that lies about
write-complete. You could possibly achieve such rates in a non-broken
configuration with a battery-backed write cache, but that's not "typical
desktop" kit.
May be I was too vague about the machine/test. Its probably not a
"typical desktop" machine since it has better storage. A two disk
RAID 0 configuration for data, and a dedicated disk for xlog. I remember
running with 50 clients and 50 scaling factor, 1 GB shared buffer,
autovacuum turned on with default parameters and rest with default
configuration. I don't think I had explicitly turned fsync off.
In any case, you ignored Heikki's point that the PG shared memory pages
holding CLOG are unlikely to be the sole level of caching, if the update
rate is that high. The kernel will have some pages too. And even if we
thought not, wouldn't bumping the size of the clog cache be a far
simpler solution offering benefit for more things than just this?
Yes. May be what Heikki said is true, but we don't know for sure.
Wouldn't bumping the cache size just delay the problem a bit ?
Especially with even larger table and a very high end machine/storage
which can clock very high transactions per minute ?
Anyways, if we agree that there is a problem, the solution could be
as simple as increasing the cache size, as you suggested.
Thanks,
Pavan
EnterpriseDB http://www.enterprisedb.com
Pavan Deolasee wrote: > I have just counted the number of read/write calls on the CLOG blocks. As > you can > see the total number of CLOG reads jumped from 545323 to 1181851 i.e. > 1181851 - 545323 = 636528 CLOG block reads for 1554697 pages of stock > table. Hmm. So there is some activity there. Could you modify the patch to count how many of those reads came from OS cache? I'm thinking of doing a gettimeofday() call before and after read, and counting how many calls finished in less than say < 1 ms. Also, summing up the total time spent in reads would be interesting. Or, would it be possible to put the clog to a different drive, and use iostat to get the numbers? > This figure is only indicative since it also includes the CLOG block reads > which would > have happened as part of other backend operations (VACUUM took almost 55 > minutes to > complete). Still in the first 210 minutes of the run, the total reads were > only 545323. So > most of the 636528 reads in the next 55 minutes can be attributed to > VACUUM. Actually, clog reads during normal activity is even worse. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Wed, 2007-01-24 at 09:32 +0530, Pavan Deolasee wrote: > On a typical desktop class 2 CPU Dell machine, we have seen pgbench > clocking more than 1500 tps. That implies CLOG would get filled up in > less > than 262144/1500=174 seconds. VACUUM on accounts table takes much > longer to trigger. You assume that all of the top level transactions have no subtransactions. On that test, subtransactions are in use because of the EXCEPTION clause in the PL/pgSQL used. That should at least double the number of Xids. > So > most of the 636528 reads in the next 55 minutes can be attributed to > VACUUM. A similar argument might also be applied to subtrans, so a similar investigation seems worthwhile. Subtrans has space for less Xids than clog, BTW. OTOH, I do think that 99% of that will not cause I/O. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
On 1/24/07, Heikki Linnakangas <heikki@enterprisedb.com> wrote:
Here are some more numbers. I ran two tests of 4 hour each with CLOG cache
size set to 8 blocks (default) and 16 blocks. I counted the number of read() calls
and specifically those read() calls which took more than 0.5 ms to complete.
As you guessed, almost 99% of the reads complete in less than 0.5 ms, but
the total read() time is still more than 1% of the duration of the test. Is it
worth optimizing ?
CLOG (16 blocks)
reads(743317), writes(84), reads > 0.5 ms (5171), time reads (186s), time reads > 0.5 ms(175s)
CLOG (8 blocks)
reads(1155917), writes(119), reads > 0.5 ms (4040), time reads (146s), time reads > 0.5 ms(130s)
(amused to see increase in the total read time with 16 blocks)
Also is it worth optimizing on the total read() system calls which might not cause physical I/O, but Pavan Deolasee wrote:
> I have just counted the number of read/write calls on the CLOG blocks. As
> you can
> see the total number of CLOG reads jumped from 545323 to 1181851 i.e.
> 1181851 - 545323 = 636528 CLOG block reads for 1554697 pages of stock
> table.
Hmm. So there is some activity there. Could you modify the patch to
count how many of those reads came from OS cache? I'm thinking of doing
a gettimeofday() call before and after read, and counting how many
calls finished in less than say < 1 ms. Also, summing up the total time
spent in reads would be interesting.
Here are some more numbers. I ran two tests of 4 hour each with CLOG cache
size set to 8 blocks (default) and 16 blocks. I counted the number of read() calls
and specifically those read() calls which took more than 0.5 ms to complete.
As you guessed, almost 99% of the reads complete in less than 0.5 ms, but
the total read() time is still more than 1% of the duration of the test. Is it
worth optimizing ?
CLOG (16 blocks)
reads(743317), writes(84), reads > 0.5 ms (5171), time reads (186s), time reads > 0.5 ms(175s)
CLOG (8 blocks)
reads(1155917), writes(119), reads > 0.5 ms (4040), time reads (146s), time reads > 0.5 ms(130s)
(amused to see increase in the total read time with 16 blocks)
still consume CPU ?
Thanks,
Pavan
EnterpriseDB http://www.enterprisedb.com
Pavan Deolasee wrote: > On 1/24/07, Heikki Linnakangas <heikki@enterprisedb.com> wrote: >> Hmm. So there is some activity there. Could you modify the patch to >> count how many of those reads came from OS cache? I'm thinking of doing >> a gettimeofday() call before and after read, and counting how many >> calls finished in less than say < 1 ms. Also, summing up the total time >> spent in reads would be interesting. > > Here are some more numbers. I ran two tests of 4 hour each with CLOG cache > size set to 8 blocks (default) and 16 blocks. I counted the number of > read() > calls > and specifically those read() calls which took more than 0.5 ms to > complete. > As you guessed, almost 99% of the reads complete in less than 0.5 ms, but > the total read() time is still more than 1% of the duration of the test. Is > it > worth optimizing ? Probably not. I wouldn't trust that 1% of test duration figure too much, gettimeofday() has some overhead of its own... > CLOG (16 blocks) > reads(743317), writes(84), reads > 0.5 ms (5171), time reads (186s), time > reads > 0.5 ms(175s) > > CLOG (8 blocks) > reads(1155917), writes(119), reads > 0.5 ms (4040), time reads (146s), time > reads > 0.5 ms(130s) > > (amused to see increase in the total read time with 16 blocks) Hmm. That's surprising. > Also is it worth optimizing on the total read() system calls which might > not > cause physical I/O, but > still consume CPU ? I don't think it's worth it, but now that we're talking about it: What I'd like to do to all the slru files is to replace the custom buffer management with mmapping the whole file, and letting the OS take care of it. We would get rid of some guc variables, the OS would tune the amount of memory used for clog/subtrans dynamically, and we would avoid the memory copying. And I'd like to do the same for WAL. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On 1/25/07, Heikki Linnakangas <heikki@enterprisedb.com> wrote:
Yes, we can do that. One problem though is mmaping wouldn't work when
CLOG file is extended and some of the backends may not see the extended
portion. But may be we can start with a sufficiently large initialized file and
mmap the whole file.
Another simpler solution for VACUUM would be to read the entire CLOG file
in local memory. Most of the transaction status queries can be satisfied from
this local copy and the normal CLOG is consulted only when the status is
unknown (TRANSACTION_STATUS_IN_PROGRESS)
Thanks,
Pavan
-- Pavan Deolasee wrote:
>
> Also is it worth optimizing on the total read() system calls which might
> not
> cause physical I/O, but
> still consume CPU ?
I don't think it's worth it, but now that we're talking about it: What
I'd like to do to all the slru files is to replace the custom buffer
management with mmapping the whole file, and letting the OS take care of
it. We would get rid of some guc variables, the OS would tune the amount
of memory used for clog/subtrans dynamically, and we would avoid the
memory copying. And I'd like to do the same for WAL.
Yes, we can do that. One problem though is mmaping wouldn't work when
CLOG file is extended and some of the backends may not see the extended
portion. But may be we can start with a sufficiently large initialized file and
mmap the whole file.
Another simpler solution for VACUUM would be to read the entire CLOG file
in local memory. Most of the transaction status queries can be satisfied from
this local copy and the normal CLOG is consulted only when the status is
unknown (TRANSACTION_STATUS_IN_PROGRESS)
Thanks,
Pavan
EnterpriseDB http://www.enterprisedb.com
Pavan Deolasee wrote: > Another simpler solution for VACUUM would be to read the entire CLOG file > in local memory. Most of the transaction status queries can be satisfied > from > this local copy and the normal CLOG is consulted only when the status is > unknown (TRANSACTION_STATUS_IN_PROGRESS) The clog is only for finished (committed/aborted/crashed) transactions. If a transaction is in progress, the clog is never consulted. Anyway, that'd only be reasonable for vacuums, and I'm actually more worried if we had normal backends thrashing the clog buffers. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Is there a TODO here? --------------------------------------------------------------------------- Heikki Linnakangas wrote: > Pavan Deolasee wrote: > > Another simpler solution for VACUUM would be to read the entire CLOG file > > in local memory. Most of the transaction status queries can be satisfied > > from > > this local copy and the normal CLOG is consulted only when the status is > > unknown (TRANSACTION_STATUS_IN_PROGRESS) > > The clog is only for finished (committed/aborted/crashed) transactions. > If a transaction is in progress, the clog is never consulted. Anyway, > that'd only be reasonable for vacuums, and I'm actually more worried if > we had normal backends thrashing the clog buffers. > > -- > Heikki Linnakangas > EnterpriseDB http://www.enterprisedb.com > > > ---------------------------(end of broadcast)--------------------------- > TIP 4: Have you searched our list archives? > > http://archives.postgresql.org -- Bruce Momjian bruce@momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
I'd like to see still more evidence that it's a problem before we start changing that piece of code. It has served us well for years. Bruce Momjian wrote: > Is there a TODO here? > > --------------------------------------------------------------------------- > > Heikki Linnakangas wrote: >> Pavan Deolasee wrote: >>> Another simpler solution for VACUUM would be to read the entire CLOG file >>> in local memory. Most of the transaction status queries can be satisfied >>> from >>> this local copy and the normal CLOG is consulted only when the status is >>> unknown (TRANSACTION_STATUS_IN_PROGRESS) >> The clog is only for finished (committed/aborted/crashed) transactions. >> If a transaction is in progress, the clog is never consulted. Anyway, >> that'd only be reasonable for vacuums, and I'm actually more worried if >> we had normal backends thrashing the clog buffers. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas wrote: > I'd like to see still more evidence that it's a problem before we start > changing that piece of code. It has served us well for years. So the TODO could be "investigate whether caching pg_clog and/or pg_subtrans in local memory can be useful for vacuum performance". > Bruce Momjian wrote: > >Is there a TODO here? > > > >--------------------------------------------------------------------------- > > > >Heikki Linnakangas wrote: > >>Pavan Deolasee wrote: > >>>Another simpler solution for VACUUM would be to read the entire CLOG file > >>>in local memory. Most of the transaction status queries can be satisfied > >>>from > >>>this local copy and the normal CLOG is consulted only when the status is > >>>unknown (TRANSACTION_STATUS_IN_PROGRESS) > >>The clog is only for finished (committed/aborted/crashed) transactions. > >>If a transaction is in progress, the clog is never consulted. Anyway, > >>that'd only be reasonable for vacuums, and I'm actually more worried if > >>we had normal backends thrashing the clog buffers. -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
On 1/26/07, Alvaro Herrera <alvherre@commandprompt.com> wrote:
Heikki Linnakangas wrote:
> I'd like to see still more evidence that it's a problem before we start
> changing that piece of code. It has served us well for years.
So the TODO could be "investigate whether caching pg_clog and/or
pg_subtrans in local memory can be useful for vacuum performance".
As Heikki suggested, we should also investigate the same for normal
backends as well.
It would also be interesting to investigate whether early setting of hint bits
can reduce subsequent writes of blocks. A typical case would be a large table
being updated heavily for a while, followed by SELECT queries. The SELECT
queries would set hint bits for the previously UPDATEd tuples (old and new
versions) and thus cause subsequent writes of those blocks for what could
have been read-only queries.
Thanks,
Pavan
EnterpriseDB http://www.enterprisedb.com
Pavan Deolasee wrote: > On 1/26/07, Alvaro Herrera <alvherre@commandprompt.com> wrote: > > > >Heikki Linnakangas wrote: > >> I'd like to see still more evidence that it's a problem before we start > >> changing that piece of code. It has served us well for years. > > > >So the TODO could be "investigate whether caching pg_clog and/or > >pg_subtrans in local memory can be useful for vacuum performance". > > > As Heikki suggested, we should also investigate the same for normal > backends as well. Maybe. An idea that comes to mind is to never cache the latest page, since it'll most likely result in extra reads anyway because there'll be a lot of IN_PROGRESS transactions. Problem to solve: how much memory to dedicate to this? Could we mmap() portions of the pg_clog segment, so that the page could be shared across backends instead of allocating them for each? > It would also be interesting to investigate whether early setting of > hint bits can reduce subsequent writes of blocks. A typical case would > be a large table being updated heavily for a while, followed by SELECT > queries. The SELECT queries would set hint bits for the previously > UPDATEd tuples (old and new versions) and thus cause subsequent > writes of those blocks for what could have been read-only queries. This has been suggested before, but I don't see how this could work. How does the UPDATE transaction go back to the pages it wrote to update the hint bits, _after_ it committed? Maybe have the bgwriter update hint bits as it evicts pages out of the cache? It could result in pg_clog read traffic for each page that needs eviction; not such a hot idea. I don't see how this is related to the above proposal though. -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
On 1/26/07, Alvaro Herrera <alvherre@commandprompt.com> wrote:
I thought once we enhance clog so that there are no clog reads,
bgwriter would be able to update hint bits without getting into any deadlock
with pg_clog read.
May be we can have this as a seperate TODO
Thanks,
Pavan
--
Maybe have the bgwriter update hint bits as it evicts pages out of the
cache? It could result in pg_clog read traffic for each page that needs
eviction; not such a hot idea.
I thought once we enhance clog so that there are no clog reads,
bgwriter would be able to update hint bits without getting into any deadlock
with pg_clog read.
May be we can have this as a seperate TODO
Thanks,
Pavan
EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki@enterprisedb.com> writes: > I'd like to see still more evidence that it's a problem before we start > changing that piece of code. It has served us well for years. What I see here is mostly evidence suggesting that we should consider raising NUM_CLOG_BUFFERS, rather than anything more invasive. regards, tom lane
Alvaro Herrera <alvherre@commandprompt.com> writes: > Pavan Deolasee wrote: >> It would also be interesting to investigate whether early setting of >> hint bits can reduce subsequent writes of blocks. A typical case would >> be a large table being updated heavily for a while, followed by SELECT >> queries. The SELECT queries would set hint bits for the previously >> UPDATEd tuples (old and new versions) and thus cause subsequent >> writes of those blocks for what could have been read-only queries. > This has been suggested before, but I don't see how this could work. > How does the UPDATE transaction go back to the pages it wrote to update > the hint bits, _after_ it committed? I think what he's suggesting is deliberately not updating the hint bits during a SELECT ... but that's surely misguided. If you don't set the hint bit after discovering the transaction commit state, then the next visitor of the tuple will have to repeat the clog lookup, meaning that any such policy greatly increases clog read traffic and contention. regards, tom lane
On 1/26/07, Tom Lane <tgl@sss.pgh.pa.us> wrote:
I think what he's suggesting is deliberately not updating the hint bits
during a SELECT ...
No, I was suggesting doing it in bgwriter so that we may not need to that during
a SELECT. Of course, we need to investigate more and have numbers to prove
the need. Also you have already expressed concerns that doing so in bgwriter is deadlock
prone. So there is certainly more work needed for any such scheme to work.
Thanks,
Pavan
--
EnterpriseDB http://www.enterprisedb.com
Tom Lane wrote: > Heikki Linnakangas <heikki@enterprisedb.com> writes: > > I'd like to see still more evidence that it's a problem before we start > > changing that piece of code. It has served us well for years. > > What I see here is mostly evidence suggesting that we should consider > raising NUM_CLOG_BUFFERS, rather than anything more invasive. Added to TODO: * Consider increasing NUM_CLOG_BUFFERS -- Bruce Momjian bruce@momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Pavan Deolasee wrote: > On 1/26/07, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > > > > > I think what he's suggesting is deliberately not updating the hint bits > > during a SELECT ... > > > No, I was suggesting doing it in bgwriter so that we may not need to that > during > a SELECT. Of course, we need to investigate more and have numbers to prove > the need. Also you have already expressed concerns that doing so in bgwriter > is deadlock > prone. So there is certainly more work needed for any such scheme to work. Added to TODO: * Consider having the background writer update the transaction status hint bits before writing out the page -- Bruce Momjian bruce@momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +