Thread: Piggybacking vacuum I/O

Piggybacking vacuum I/O

From
Heikki Linnakangas
Date:
I've been looking at the way we do vacuums.

The fundamental performance issue is that a vacuum generates 
nheapblocks+nindexblocks+ndirtyblocks I/Os. Vacuum cost delay helps to 
spread the cost like part payment, but the total is the same. In an I/O 
bound system, the extra I/O directly leads to less throughput.

Therefore, we need to do less I/O. Dead space map helps by allowing us 
to skip blocks that don't need vacuuming, reducing the # of I/Os to 
2*ndirtyblocks+nindexblocks. That's great, but it doesn't help us if the 
dead tuples are spread uniformly.

If we could piggyback the vacuum I/Os to the I/Os that we're doing 
anyway, vacuum wouldn't ideally have to issue any I/O of its own. I've 
tried to figure out a way to do that.

Vacuum is done in 3 phases:

1. Scan heap
2. Vacuum index
3. Vacuum heap

Instead of doing a sequential scan, we could perform the 1st phase by 
watching the buffer pool, scanning blocks for dead tuples when they're 
in memory and keeping track of which pages we've seen. When all pages 
have been seen, the tid list is sorted and 1st phase is done.

In theory, the index vacuum could also be done that way, but let's 
assume for now that indexes would be scanned like they are currently.

The 3rd phase can be performed similarly to the 1st phase. Whenever a 
page enters the buffer pool, we check the tid list and remove any 
matching tuples from the page. When the list is empty, vacuum is complete.

Of course, there's some issues in the design as described above. For 
example, the vacuum might take a long time if there's cold spots in the 
table. In fact, a block full of dead tuples might never be visited again.

A variation of the scheme would be to keep scanning pages that are in 
cache, until the tid list reaches a predefined size, instead of keeping 
track of which pages have already been seen. That would deal better with 
tables with hot and cold spots, but it couldn't advance the relfrozenid 
because there would be no guarantee that all pages are visited. Also, we 
could start 1st phase of the next vacuum, while we're still in the 3rd 
phase of previous one.

Also, after we've seen 95% of the pages or a timeout expires, we could 
fetch the rest of them with random I/O to let the vacuum finish.

I'm not sure how exactly this would be implemented. Perhaps bgwriter or 
autovacuum would do it, or a new background process. Presumably the 
process would need access to relcache.

One issue is that if we're trying to vacuum every table simultaneously 
this way, we'll need more overall memory for the tid lists. I'm hoping 
there's a way to implement this without requiring shared memory for the 
tid lists, that would make the memory management a nightmare. Also, we'd 
need changes to bufmgr API to support this.

This would work nicely with the DSM. The list of pages that need to be 
visited in phase 1 could be initialized from the DSM, largely avoiding 
the problem with cold spots.

Any thoughts before I start experimenting?

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Piggybacking vacuum I/O

From
"Simon Riggs"
Date:
On Mon, 2007-01-22 at 13:41 +0000, Heikki Linnakangas wrote:
> Any thoughts before I start experimenting?

Probably only to detail the various use cases we are discussing.

My thoughts on various use cases are:

- small table with frequent update/delete, heap and indexes all/mostly
cached
e.g. Counter tables, DBT2: District/Warehouse TPC-C, pgbench:
Branches/Tellers
Current VACUUM works well in this situation, since the only I/O incurred
is the WAL written for the VACUUM. VACUUM very cheap even if not in
cache because of sequential I/O. Keeping track of whether there are hot
spots in these tables seems like a waste of cycles and could potentially
introduce contention and hence reduce performance. These need to be very
frequently VACUUMed, even when other VACUUMs are required.
My current view: just need multiple concurrent autovacuum processes.

- large table with severe hotspots
e.g. DBT2: NewOrder, larger queue-style tables
The hotspots are likely to be in cache and the not-so-hotspots might or
might not be in cache, but we don't care either way. DSM concept works
well for this case, since we are able to avoid lots of I/O by
appropriate book-keeping. Works well for removing rows after a file-scan
DELETE, as well as for DELETE or UPDATE hot spots. 
My current view: DSM would be great for this

- large table with few hotspots
e.g. DBT2: Stock, pgbench: Accounts, most Customer tables
Current VACUUM works very badly in this case, since updates are sparsely
distributed across table. DSM wouldn't help either unless we
differentiate between few/many updates to a block. 
My current view: Piggyback concept seems on the right track, but clearly
needs further thought.

Currently we have only one technique for garbage collection, plus one
process to perform it. We need multiple techniques executed by multiple
processes, when required, plus some way of automatically selecting which
is appropriate depending upon the use case. Yes, automatic :-)

DSM and this piggyback idea need not be thought of as competing
techniques.

--  Simon Riggs              EnterpriseDB   http://www.enterprisedb.com




Re: Piggybacking vacuum I/O

From
"Jim C. Nasby"
Date:
On Mon, Jan 22, 2007 at 02:51:47PM +0000, Heikki Linnakangas wrote:
> I've been looking at the way we do vacuums.
> 
> The fundamental performance issue is that a vacuum generates 
> nheapblocks+nindexblocks+ndirtyblocks I/Os. Vacuum cost delay helps to 
> spread the cost like part payment, but the total is the same. In an I/O 
> bound system, the extra I/O directly leads to less throughput.
> 
> Therefore, we need to do less I/O. Dead space map helps by allowing us 
> to skip blocks that don't need vacuuming, reducing the # of I/Os to 
> 2*ndirtyblocks+nindexblocks. That's great, but it doesn't help us if the 
> dead tuples are spread uniformly.
> 
> If we could piggyback the vacuum I/Os to the I/Os that we're doing 
> anyway, vacuum wouldn't ideally have to issue any I/O of its own. I've 
> tried to figure out a way to do that.
> 
> Vacuum is done in 3 phases:
> 
> 1. Scan heap
> 2. Vacuum index
> 3. Vacuum heap


> Instead of doing a sequential scan, we could perform the 1st phase by 
> watching the buffer pool, scanning blocks for dead tuples when they're 
> in memory and keeping track of which pages we've seen. When all pages 
> have been seen, the tid list is sorted and 1st phase is done.
> 
> In theory, the index vacuum could also be done that way, but let's 
> assume for now that indexes would be scanned like they are currently.
> 
> The 3rd phase can be performed similarly to the 1st phase. Whenever a 
> page enters the buffer pool, we check the tid list and remove any 
> matching tuples from the page. When the list is empty, vacuum is complete.

Is there any real reason to demark the start and end of a vacuum? Why
not just go to a continuous process? One possibility is to keep a list
of TIDs for each phase, though that could prove tricky with multiple
indexes.

> A variation of the scheme would be to keep scanning pages that are in 
> cache, until the tid list reaches a predefined size, instead of keeping 
> track of which pages have already been seen. That would deal better with 
> tables with hot and cold spots, but it couldn't advance the relfrozenid 
> because there would be no guarantee that all pages are visited. Also, we 
> could start 1st phase of the next vacuum, while we're still in the 3rd 
> phase of previous one.
What if we tracked freeze status on a per-page basis? Perhaps track the
minimum XID that's on each page. That would allow us to ensure that we
freeze pages that are approaching XID wrap.
-- 
Jim Nasby                                            jim@nasby.net
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)


Re: Piggybacking vacuum I/O

From
ITAGAKI Takahiro
Date:
Heikki Linnakangas <heikki@enterprisedb.com> wrote:

> Vacuum is done in 3 phases:
> 1. Scan heap
> 2. Vacuum index
> 3. Vacuum heap

> A variation of the scheme would be to keep scanning pages that are in 
> cache, until the tid list reaches a predefined size, instead of keeping 
> track of which pages have already been seen. That would deal better with 
> tables with hot and cold spots, but it couldn't advance the relfrozenid 
> because there would be no guarantee that all pages are visited. Also, we 
> could start 1st phase of the next vacuum, while we're still in the 3rd 
> phase of previous one.

ISTM, it is another DSM that has a tuple-level accuracy, not a page-level.
One of the benefits is that we can skip the 1st phase of vacuum; We will
have a TID list of dead tuples at the start of vacuum, so we can start
from 2nd phase.

I have another idea for use of TID lists -- Store the TIDs after the 1st
or 2nd phase, and exit the vacuum. At the next vacuum, we will do both
the previous 3rd phase and new 1st phase at once, so that I/Os are reduced
(ndirtyblocks + nindexblocks) from (2*ndirtyblocks + nindexblocks) in
average. We've already use a similar method in vacuuming btree indexes
to collect recyclable empty pages.

I think piggybacking of I/Os are very useful. Buffer manager helps us
folding up some of I/Os, but explicit orders are more effective.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center




Re: Piggybacking vacuum I/O

From
"Pavan Deolasee"
Date:


On 1/22/07, Heikki Linnakangas <heikki@enterprisedb.com> wrote:
I've been looking at the way we do vacuums.

The fundamental performance issue is that a vacuum generates
nheapblocks+nindexblocks+ndirtyblocks I/Os. Vacuum cost delay helps to
spread the cost like part payment, but the total is the same. In an I/O
bound system, the extra I/O directly leads to less throughput.


Another source of I/O is perhaps the CLOG read/writes for checking transaction status. If we are talking about large tables like accounts in pgbench or customer/stock in DBT2, the tables are vacuumed much later than the actual UPDATEs. I don't have any numbers to prove yet, but my sense is that CLOG pages holding the status of many of the transactions might have been already flushed out of the cache and require an I/O. Since the default CLOG SLRU buffers is set to 8, there could be severe CLOG SLRU thrashing during VACUUM as the transaction ids will be all random in a heap page.

Would it help to set the status of the XMIN/XMAX of tuples early enough such that the heap page is still in the buffer cache, but late enough such that the XMIN/XMAX transactions are finished ? How about doing it when the bgwriter is about to write the page to disk ? Assuming few seconds of life of a heap page in the buffer cache, hopefully most of the XMIN/XMAX transactions should have completed and bgwriter can set XMIN(XMAX)_COMMITTED or XMIN(XMAX)_INVALID for most of the tuples in the page. This would save us CLOG I/Os  later, either during subsequent access to the tuple and/or vacuum.

Any thoughts ?

Thanks,
Pavan

EnterpriseDB     http://www.enterprisedb.com

Re: Piggybacking vacuum I/O

From
Heikki Linnakangas
Date:
Pavan Deolasee wrote:
> Another source of I/O is perhaps the CLOG read/writes for checking
> transaction status. If we are talking about large tables like accounts in
> pgbench or customer/stock in DBT2, the tables are vacuumed much later than
> the actual UPDATEs. I don't have any numbers to prove yet, but my sense is
> that CLOG pages holding the status of many of the transactions might have
> been already flushed out of the cache and require an I/O. Since the default
> CLOG SLRU buffers is set to 8, there could be severe CLOG SLRU thrashing
> during VACUUM as the transaction ids will be all random in a heap page.

8 log pages hold 8*8192*4=262144 transactions. If the active set of 
transactions is larger than that, the OS cache will probably hold more 
clog pages. I guess you could end up doing some I/O on clog on a vacuum 
of a big table, if you have a high transaction rate and vacuum 
infrequently...

> Would it help to set the status of the XMIN/XMAX of tuples early enough 
> such
> that the heap page is still in the buffer cache, but late enough such that
> the XMIN/XMAX transactions are finished ? How about doing it when the
> bgwriter is about to write the page to disk ? Assuming few seconds of life
> of a heap page in the buffer cache, hopefully most of the XMIN/XMAX
> transactions should have completed and bgwriter can set 
> XMIN(XMAX)_COMMITTED
> or XMIN(XMAX)_INVALID for most of the tuples in the page. This would 
> save us
> CLOG I/Os  later, either during subsequent access to the tuple and/or
> vacuum.

Yeah, we could do that. First I'd like to see some more evidence that 
clog trashing is a problem, though.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Piggybacking vacuum I/O

From
Tom Lane
Date:
"Pavan Deolasee" <pavan.deolasee@gmail.com> writes:
> Would it help to set the status of the XMIN/XMAX of tuples early enough such
> that the heap page is still in the buffer cache, but late enough such that
> the XMIN/XMAX transactions are finished ? How about doing it when the
> bgwriter is about to write the page to disk ?

No.  The bgwriter would then become subject to deadlocks because it
would be needing to read in clog pages before it could flush out
dirty pages.  In any case, if the table is in active use then some
passing backend has probably updated the bits already ...
        regards, tom lane


Re: Piggybacking vacuum I/O

From
"Pavan Deolasee"
Date:

On 1/23/07, Tom Lane <tgl@sss.pgh.pa.us> wrote:
"Pavan Deolasee" <pavan.deolasee@gmail.com> writes:
> Would it help to set the status of the XMIN/XMAX of tuples early enough such
> that the heap page is still in the buffer cache, but late enough such that
> the XMIN/XMAX transactions are finished ? How about doing it when the
> bgwriter is about to write the page to disk ?

No.  The bgwriter would then become subject to deadlocks because it
would be needing to read in clog pages before it could flush out
dirty pages.  In any case, if the table is in active use then some
passing backend has probably updated the bits already ...

Well, let me collect some evidence. If we figure out that there is indeed a
CLOG buffer thrash at VACUUM time, I am sure we would be able to solve
the problem one way or the other.
 
IMHO this case would be more applicable to the very large tables where the
UPDATEd rows are not accessed again for a long time. And hence the hint bits
might not have been updated.

Thanks,
Pavan




--

EnterpriseDB     http://www.enterprisedb.com

Re: Piggybacking vacuum I/O

From
"Pavan Deolasee"
Date:

On 1/23/07, Heikki Linnakangas <heikki@enterprisedb.com> wrote:
Pavan Deolasee wrote:
> Another source of I/O is perhaps the CLOG read/writes for checking
> transaction status. If we are talking about large tables like accounts in
> pgbench or customer/stock in DBT2, the tables are vacuumed much later than
> the actual UPDATEs. I don't have any numbers to prove yet, but my sense is
> that CLOG pages holding the status of many of the transactions might have
> been already flushed out of the cache and require an I/O. Since the default
> CLOG SLRU buffers is set to 8, there could be severe CLOG SLRU thrashing
> during VACUUM as the transaction ids will be all random in a heap page.

8 log pages hold 8*8192*4=262144 transactions. If the active set of
transactions is larger than that, the OS cache will probably hold more
clog pages. I guess you could end up doing some I/O on clog on a vacuum
of a big table, if you have a high transaction rate and vacuum
infrequently...

On a typical desktop class 2 CPU Dell machine, we have seen pgbench
clocking more than 1500 tps. That implies CLOG would get filled up in less
than 262144/1500=174 seconds. VACUUM on accounts table takes much
longer to trigger.

> Would it help to set the status of the XMIN/XMAX of tuples early enough
> such
> that the heap page is still in the buffer cache, but late enough such that
> the XMIN/XMAX transactions are finished ?
Yeah, we could do that. First I'd like to see some more evidence that
clog trashing is a problem, though.

Here are some numbers from a 4 hour DBT2 run with 270 warehouses and 50
connections.
2007-01-23 07:40:30 PST_17428 LOG:  vacuuming "public.warehouse"
2007-01-23 07:40:30 PST_17428 LOG: CLOG r(1), w(0)- vacuum start
2007-01-23 07:40:30 PST_17428 LOG: CLOG r(1), w(0)- vacuum end
2007-01-23 07:40:30 PST_17428 LOG: "warehouse": found 1214 removable, 2275 nonremovable row versions in 111 pages

<snip>

2007-01-23 11:11:43 PST_30356 LOG: vacuuming "public.stock "
2007-01-23 11:11:43 PST_30356 LOG: CLOG r(545323), w(91)- vacuum start
2007-01-23 12:03:14 PST_30356 LOG: CLOG r(1181851), w(133) - vacuum end
2007-01-23 12:03:14 PST_30356 LOG: "stock": found 5645264 removable, 27003788 nonremovable row versions in 1554697 pages

I have just counted the number of read/write calls on the CLOG blocks. As you can
see the total number of CLOG reads jumped from 545323 to 1181851 i.e.
1181851 - 545323 = 636528 CLOG block reads for 1554697 pages of stock table.
This figure is only indicative since it also includes the CLOG block reads which would
have happened as part of other backend operations (VACUUM took almost 55 minutes to
complete). Still in the first 210 minutes of the run, the total reads were only 545323. So
most of the 636528 reads in the next 55 minutes can be attributed to VACUUM.

The writes are very small though, may be because most of the CLOG pages are accessed
read-only. A simple patch that I used to get these numbers is attached.

Thanks,
Pavan

EnterpriseDB     http://www.enterprisedb.com
Attachment

Re: Piggybacking vacuum I/O

From
Tom Lane
Date:
"Pavan Deolasee" <pavan.deolasee@gmail.com> writes:
> On a typical desktop class 2 CPU Dell machine, we have seen pgbench
> clocking more than 1500 tps.

Only if you had fsync off, or equivalently a disk drive that lies about
write-complete.  You could possibly achieve such rates in a non-broken
configuration with a battery-backed write cache, but that's not "typical
desktop" kit.

In any case, you ignored Heikki's point that the PG shared memory pages
holding CLOG are unlikely to be the sole level of caching, if the update
rate is that high.  The kernel will have some pages too.  And even if we
thought not, wouldn't bumping the size of the clog cache be a far
simpler solution offering benefit for more things than just this?
        regards, tom lane


Re: Piggybacking vacuum I/O

From
"Pavan Deolasee"
Date:

On 1/24/07, Tom Lane <tgl@sss.pgh.pa.us> wrote:
"Pavan Deolasee" <pavan.deolasee@gmail.com> writes:
> On a typical desktop class 2 CPU Dell machine, we have seen pgbench
> clocking more than 1500 tps.

Only if you had fsync off, or equivalently a disk drive that lies about
write-complete.  You could possibly achieve such rates in a non-broken
configuration with a battery-backed write cache, but that's not "typical
desktop" kit.

May be I was too vague about the machine/test. Its probably not a
"typical desktop" machine since it has better storage. A two disk
RAID 0 configuration for data, and a dedicated disk for xlog. I remember
running with 50 clients and 50 scaling factor, 1 GB shared buffer,
autovacuum turned on with default parameters and rest with default
configuration. I don't think I had explicitly turned fsync off.
 
In any case, you ignored Heikki's point that the PG shared memory pages
holding CLOG are unlikely to be the sole level of caching, if the update
rate is that high.  The kernel will have some pages too.  And even if we
thought not, wouldn't bumping the size of the clog cache be a far
simpler solution offering benefit for more things than just this?

Yes. May be what Heikki said is true, but we don't know for sure.
Wouldn't bumping the cache size just delay the problem a bit ?
Especially with even larger table and a very high end machine/storage
which can clock very high transactions per minute ?

Anyways, if we agree that there is a problem, the solution could be
as simple as increasing the cache size, as you suggested.

Thanks,
Pavan

--

EnterpriseDB     http://www.enterprisedb.com

Re: Piggybacking vacuum I/O

From
Heikki Linnakangas
Date:
Pavan Deolasee wrote:
> I have just counted the number of read/write calls on the CLOG blocks. As
> you can
> see the total number of CLOG reads jumped from 545323 to 1181851 i.e.
> 1181851 - 545323 = 636528 CLOG block reads for 1554697 pages of stock 
> table.

Hmm. So there is some activity there. Could you modify the patch to 
count how many of those reads came from OS cache? I'm thinking of doing 
a gettimeofday() call before and after read, and counting how many 
calls finished in less than say < 1 ms. Also, summing up the total time 
spent in reads would be interesting.

Or, would it be possible to put the clog to a different drive, and use 
iostat to get the numbers?

> This figure is only indicative since it also includes the CLOG block reads
> which would
> have happened as part of other backend operations (VACUUM took almost 55
> minutes to
> complete). Still in the first 210 minutes of the run, the total reads were
> only 545323. So
> most of the 636528 reads in the next 55 minutes can be attributed to 
> VACUUM.

Actually, clog reads during normal activity is even worse.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Piggybacking vacuum I/O

From
"Simon Riggs"
Date:
On Wed, 2007-01-24 at 09:32 +0530, Pavan Deolasee wrote:

> On a typical desktop class 2 CPU Dell machine, we have seen pgbench
> clocking more than 1500 tps. That implies CLOG would get filled up in
> less
> than 262144/1500=174 seconds. VACUUM on accounts table takes much 
> longer to trigger.

You assume that all of the top level transactions have no
subtransactions. On that test, subtransactions are in use because of the
EXCEPTION clause in the PL/pgSQL used. That should at least double the
number of Xids.

>  So
> most of the 636528 reads in the next 55 minutes can be attributed to
> VACUUM.

A similar argument might also be applied to subtrans, so a similar
investigation seems worthwhile. Subtrans has space for less Xids than
clog, BTW.

OTOH, I do think that 99% of that will not cause I/O.

--  Simon Riggs              EnterpriseDB   http://www.enterprisedb.com




Re: Piggybacking vacuum I/O

From
"Pavan Deolasee"
Date:

On 1/24/07, Heikki Linnakangas <heikki@enterprisedb.com> wrote:
Pavan Deolasee wrote:
> I have just counted the number of read/write calls on the CLOG blocks. As
> you can
> see the total number of CLOG reads jumped from 545323 to 1181851 i.e.
> 1181851 - 545323 = 636528 CLOG block reads for 1554697 pages of stock
> table.

Hmm. So there is some activity there. Could you modify the patch to
count how many of those reads came from OS cache? I'm thinking of doing
a gettimeofday() call before and after read, and counting how many
calls finished in less than say < 1 ms. Also, summing up the total time
spent in reads would be interesting.

Here are some more numbers. I ran two tests of 4 hour each with CLOG cache
size set to 8 blocks (default) and 16 blocks. I counted the number of read() calls
and specifically those read() calls which took more than 0.5 ms to complete.
As you guessed, almost 99% of the reads complete in less than 0.5 ms, but
the total read() time is still more than 1% of the duration of the test. Is it
worth optimizing ?

CLOG (16 blocks)
reads(743317), writes(84), reads > 0.5 ms (5171), time reads (186s), time reads > 0.5 ms(175s)

CLOG (8 blocks)
reads(1155917), writes(119), reads > 0.5 ms (4040), time reads (146s), time reads > 0.5 ms(130s)

(amused to see increase in the total read time with 16 blocks)

Also is it worth optimizing on the total read() system calls which might not cause physical I/O, but
still consume CPU ?

Thanks,
Pavan

EnterpriseDB     http://www.enterprisedb.com

Re: Piggybacking vacuum I/O

From
Heikki Linnakangas
Date:
Pavan Deolasee wrote:
> On 1/24/07, Heikki Linnakangas <heikki@enterprisedb.com> wrote:
>> Hmm. So there is some activity there. Could you modify the patch to
>> count how many of those reads came from OS cache? I'm thinking of doing
>> a gettimeofday() call before and after read, and counting how many
>> calls finished in less than say < 1 ms. Also, summing up the total time
>> spent in reads would be interesting.
> 
> Here are some more numbers. I ran two tests of 4 hour each with CLOG cache
> size set to 8 blocks (default) and 16 blocks. I counted the number of 
> read()
> calls
> and specifically those read() calls which took more than 0.5 ms to 
> complete.
> As you guessed, almost 99% of the reads complete in less than 0.5 ms, but
> the total read() time is still more than 1% of the duration of the test. Is
> it
> worth optimizing ?

Probably not. I wouldn't trust that 1% of test duration figure too much, 
gettimeofday() has some overhead of its own...

> CLOG (16 blocks)
> reads(743317), writes(84), reads > 0.5 ms (5171), time reads (186s), time
> reads > 0.5 ms(175s)
> 
> CLOG (8 blocks)
> reads(1155917), writes(119), reads > 0.5 ms (4040), time reads (146s), time
> reads > 0.5 ms(130s)
> 
> (amused to see increase in the total read time with 16 blocks)

Hmm. That's surprising.

> Also is it worth optimizing on the total read() system calls which might 
> not
> cause physical I/O, but
> still consume CPU ?

I don't think it's worth it, but now that we're talking about it: What 
I'd like to do to all the slru files is to replace the custom buffer 
management with mmapping the whole file, and letting the OS take care of 
it. We would get rid of some guc variables, the OS would tune the amount 
of memory used for clog/subtrans dynamically, and we would avoid the 
memory copying. And I'd like to do the same for WAL.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Piggybacking vacuum I/O

From
"Pavan Deolasee"
Date:

On 1/25/07, Heikki Linnakangas <heikki@enterprisedb.com> wrote:
Pavan Deolasee wrote:
>
> Also is it worth optimizing on the total read() system calls which might
> not
> cause physical I/O, but
> still consume CPU ?

I don't think it's worth it, but now that we're talking about it: What
I'd like to do to all the slru files is to replace the custom buffer
management with mmapping the whole file, and letting the OS take care of
it. We would get rid of some guc variables, the OS would tune the amount
of memory used for clog/subtrans dynamically, and we would avoid the
memory copying. And I'd like to do the same for WAL.

Yes, we can do that. One problem though is mmaping wouldn't work when
CLOG file is extended and some of the backends may not see the extended
portion. But may be we can start with a sufficiently large initialized file and
mmap the whole file.

Another simpler solution for VACUUM would be to read the entire CLOG file
in local memory. Most of the transaction status queries can be satisfied from
this local copy and the normal CLOG is consulted only when the status is
unknown (TRANSACTION_STATUS_IN_PROGRESS)

Thanks,
Pavan

--

EnterpriseDB     http://www.enterprisedb.com

Re: Piggybacking vacuum I/O

From
Heikki Linnakangas
Date:
Pavan Deolasee wrote:
> Another simpler solution for VACUUM would be to read the entire CLOG file
> in local memory. Most of the transaction status queries can be satisfied
> from
> this local copy and the normal CLOG is consulted only when the status is
> unknown (TRANSACTION_STATUS_IN_PROGRESS)

The clog is only for finished (committed/aborted/crashed) transactions.
If a transaction is in progress, the clog is never consulted. Anyway,
that'd only be reasonable for vacuums, and I'm actually more worried if
we had normal backends thrashing the clog buffers.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com



Re: Piggybacking vacuum I/O

From
Bruce Momjian
Date:
Is there a TODO here?

---------------------------------------------------------------------------

Heikki Linnakangas wrote:
> Pavan Deolasee wrote:
> > Another simpler solution for VACUUM would be to read the entire CLOG file
> > in local memory. Most of the transaction status queries can be satisfied
> > from
> > this local copy and the normal CLOG is consulted only when the status is
> > unknown (TRANSACTION_STATUS_IN_PROGRESS)
> 
> The clog is only for finished (committed/aborted/crashed) transactions.
> If a transaction is in progress, the clog is never consulted. Anyway,
> that'd only be reasonable for vacuums, and I'm actually more worried if
> we had normal backends thrashing the clog buffers.
> 
> -- 
>    Heikki Linnakangas
>    EnterpriseDB   http://www.enterprisedb.com
> 
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Have you searched our list archives?
> 
>                http://archives.postgresql.org

--  Bruce Momjian   bruce@momjian.us EnterpriseDB    http://www.enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: Piggybacking vacuum I/O

From
Heikki Linnakangas
Date:
I'd like to see still more evidence that it's a problem before we start 
changing that piece of code. It has served us well for years.

Bruce Momjian wrote:
> Is there a TODO here?
> 
> ---------------------------------------------------------------------------
> 
> Heikki Linnakangas wrote:
>> Pavan Deolasee wrote:
>>> Another simpler solution for VACUUM would be to read the entire CLOG file
>>> in local memory. Most of the transaction status queries can be satisfied
>>> from
>>> this local copy and the normal CLOG is consulted only when the status is
>>> unknown (TRANSACTION_STATUS_IN_PROGRESS)
>> The clog is only for finished (committed/aborted/crashed) transactions.
>> If a transaction is in progress, the clog is never consulted. Anyway,
>> that'd only be reasonable for vacuums, and I'm actually more worried if
>> we had normal backends thrashing the clog buffers.


--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Piggybacking vacuum I/O

From
Alvaro Herrera
Date:
Heikki Linnakangas wrote:
> I'd like to see still more evidence that it's a problem before we start 
> changing that piece of code. It has served us well for years.

So the TODO could be "investigate whether caching pg_clog and/or
pg_subtrans in local memory can be useful for vacuum performance".

> Bruce Momjian wrote:
> >Is there a TODO here?
> >
> >---------------------------------------------------------------------------
> >
> >Heikki Linnakangas wrote:
> >>Pavan Deolasee wrote:
> >>>Another simpler solution for VACUUM would be to read the entire CLOG file
> >>>in local memory. Most of the transaction status queries can be satisfied
> >>>from
> >>>this local copy and the normal CLOG is consulted only when the status is
> >>>unknown (TRANSACTION_STATUS_IN_PROGRESS)
> >>The clog is only for finished (committed/aborted/crashed) transactions.
> >>If a transaction is in progress, the clog is never consulted. Anyway,
> >>that'd only be reasonable for vacuums, and I'm actually more worried if
> >>we had normal backends thrashing the clog buffers.


-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.


Re: Piggybacking vacuum I/O

From
"Pavan Deolasee"
Date:

On 1/26/07, Alvaro Herrera <alvherre@commandprompt.com> wrote:
Heikki Linnakangas wrote:
> I'd like to see still more evidence that it's a problem before we start
> changing that piece of code. It has served us well for years.

So the TODO could be "investigate whether caching pg_clog and/or
pg_subtrans in local memory can be useful for vacuum performance".


As  Heikki suggested, we should also investigate the same for normal
backends as well.

It would also be interesting to investigate whether early setting of hint bits
can reduce subsequent writes of blocks. A typical case would be a large table
being updated heavily for a while, followed by SELECT queries. The SELECT
queries would set hint bits for the previously UPDATEd  tuples (old and new
versions) and thus cause subsequent writes of those blocks for what could
have been read-only queries.

Thanks,
Pavan

EnterpriseDB     http://www.enterprisedb.com

Re: Piggybacking vacuum I/O

From
Alvaro Herrera
Date:
Pavan Deolasee wrote:
> On 1/26/07, Alvaro Herrera <alvherre@commandprompt.com> wrote:
> >
> >Heikki Linnakangas wrote:
> >> I'd like to see still more evidence that it's a problem before we start
> >> changing that piece of code. It has served us well for years.
> >
> >So the TODO could be "investigate whether caching pg_clog and/or
> >pg_subtrans in local memory can be useful for vacuum performance".
> >
> As  Heikki suggested, we should also investigate the same for normal
> backends as well.

Maybe.  An idea that comes to mind is to never cache the latest page,
since it'll most likely result in extra reads anyway because there'll be
a lot of IN_PROGRESS transactions.

Problem to solve: how much memory to dedicate to this?  Could we mmap()
portions of the pg_clog segment, so that the page could be shared across
backends instead of allocating them for each?

> It would also be interesting to investigate whether early setting of
> hint bits can reduce subsequent writes of blocks. A typical case would
> be a large table being updated heavily for a while, followed by SELECT
> queries. The SELECT queries would set hint bits for the previously
> UPDATEd  tuples (old and new versions) and thus cause subsequent
> writes of those blocks for what could have been read-only queries.

This has been suggested before, but I don't see how this could work.
How does the UPDATE transaction go back to the pages it wrote to update
the hint bits, _after_ it committed?

Maybe have the bgwriter update hint bits as it evicts pages out of the
cache?  It could result in pg_clog read traffic for each page that needs
eviction; not such a hot idea.

I don't see how this is related to the above proposal though.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: Piggybacking vacuum I/O

From
"Pavan Deolasee"
Date:

On 1/26/07, Alvaro Herrera <alvherre@commandprompt.com> wrote:

Maybe have the bgwriter update hint bits as it evicts pages out of the
cache?  It could result in pg_clog read traffic for each page that needs
eviction; not such a hot idea.

I thought once we enhance clog so that there are no clog reads,
bgwriter would be able to update hint bits without getting into any deadlock
with pg_clog read.

May be we can have this as a seperate TODO
 
Thanks,
Pavan

--

EnterpriseDB     http://www.enterprisedb.com

Re: Piggybacking vacuum I/O

From
Tom Lane
Date:
Heikki Linnakangas <heikki@enterprisedb.com> writes:
> I'd like to see still more evidence that it's a problem before we start 
> changing that piece of code. It has served us well for years.

What I see here is mostly evidence suggesting that we should consider
raising NUM_CLOG_BUFFERS, rather than anything more invasive.
        regards, tom lane


Re: Piggybacking vacuum I/O

From
Tom Lane
Date:
Alvaro Herrera <alvherre@commandprompt.com> writes:
> Pavan Deolasee wrote:
>> It would also be interesting to investigate whether early setting of
>> hint bits can reduce subsequent writes of blocks. A typical case would
>> be a large table being updated heavily for a while, followed by SELECT
>> queries. The SELECT queries would set hint bits for the previously
>> UPDATEd  tuples (old and new versions) and thus cause subsequent
>> writes of those blocks for what could have been read-only queries.

> This has been suggested before, but I don't see how this could work.
> How does the UPDATE transaction go back to the pages it wrote to update
> the hint bits, _after_ it committed?

I think what he's suggesting is deliberately not updating the hint bits
during a SELECT ... but that's surely misguided.  If you don't set the
hint bit after discovering the transaction commit state, then the next
visitor of the tuple will have to repeat the clog lookup, meaning that
any such policy greatly increases clog read traffic and contention.
        regards, tom lane


Re: Piggybacking vacuum I/O

From
"Pavan Deolasee"
Date:

On 1/26/07, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I think what he's suggesting is deliberately not updating the hint bits
during a SELECT ...

No, I was suggesting doing it in bgwriter so that we may not need to that during
a SELECT. Of course, we need to investigate more and have numbers to prove
the need. Also you have already expressed concerns that doing so in bgwriter is deadlock
prone. So there is certainly more work needed for any such scheme to work.

Thanks,
Pavan

--

EnterpriseDB     http://www.enterprisedb.com

Re: Piggybacking vacuum I/O

From
Bruce Momjian
Date:
Tom Lane wrote:
> Heikki Linnakangas <heikki@enterprisedb.com> writes:
> > I'd like to see still more evidence that it's a problem before we start 
> > changing that piece of code. It has served us well for years.
> 
> What I see here is mostly evidence suggesting that we should consider
> raising NUM_CLOG_BUFFERS, rather than anything more invasive.

Added to TODO:
* Consider increasing NUM_CLOG_BUFFERS

--  Bruce Momjian   bruce@momjian.us EnterpriseDB    http://www.enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: Piggybacking vacuum I/O

From
Bruce Momjian
Date:
Pavan Deolasee wrote:
> On 1/26/07, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >
> >
> > I think what he's suggesting is deliberately not updating the hint bits
> > during a SELECT ...
> 
> 
> No, I was suggesting doing it in bgwriter so that we may not need to that
> during
> a SELECT. Of course, we need to investigate more and have numbers to prove
> the need. Also you have already expressed concerns that doing so in bgwriter
> is deadlock
> prone. So there is certainly more work needed for any such scheme to work.

Added to TODO:
* Consider having the background writer update the transaction status  hint bits before writing out the page

--  Bruce Momjian   bruce@momjian.us EnterpriseDB    http://www.enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +