Thread: Some ideas about Vacuum

Some ideas about Vacuum

From
"Gokulakannan Somasundaram"
Date:
Hi,<br />    May be i am reposting something which has been discussed to end in this forum. I have made a search in the
archivesand i couldn't find any immediately. <br />    With my relatively small experience in Performance Testing and
Tuning,one of the rules of thumb for getting Performance is "Don't do it, if you don't need to do it". When we look at
clearingthe older versions of tuples from our tables in PostgreSQL, we can't stop thinking about how it is done in
otherdatabases. When we compare the Oracle Undo Log approach with the Postgresql Vacuum approach, the pattern looks
verysimilar to C++ memory de-allocation and Java garbage collection. <br />      So, as you may all know, the thing
whichworries us about Vacuum is that it is going to places where it need not goto. That's when we are thinking about
Deadspace Map. This dead space map is a map, if implemented correctly, would guide Vacuum to go and only look at places
wherethere was some activity of Delete/Update/Insert after the last Vacuum. This is accomplished at the cost of some
verysmall overhead to Inserts/Deletes/Updates. <br />      Dead space Map is like an undo-log, if we think its role is
toget rid of the older versions of data. Instead of moving the tuples to separate location, it guides the Vacuum
processto do the cleanup task. May be we can even think of something like Dead space log, which may not be a bitmap. In
thislog, transactions might enter their transaction ids and ctids, which can be scanned by the Vacuum process. While
thismight take more space, it is with lesser contention, while compared to Dead space Map. To me, as far as i can think
of,the only advantage of Dead space Map over Dead space log is the disk space. <br />     It just strikes me that WAL
logis already doing just that. I think you can follow my thought-line. If we can ask the Vacuum process to scan the WAL
log,it can get all the relevant details on where it needs to go. One optimization, that can be placed here is to
somehowmake the archiver do a double-job of helping the Vacuum, while doing the archiving. For people, who have
switchedoff archiving, this might not be a benefit. <br />    One main restriction it places on the WAL Logs is that
theWAL Log needs to be archived only after all the transactions in it completes. In other words, WAL logs need to be
givenenough space, to survive the longest transaction of the database. It is possible to avoid this situation by asking
theVacuum process to take the necessary information out of WAL log and store it somewhere and wait for the long running
transactionto complete. <br />    The information of interest in WAL is only the table inserts/updates/deletes. So if
everyoneaccepts that this is a good idea, till this point, there is a point in reading further.<br />    Ultimately,
whathas been achieved till now is that we have made the sequential scans made by the Vacuum process on each table into
afew random i/os. Of course there are optimizations possible to group the random i/os and find some sequential i/o out
ofit. But still we need to do a full index scan for all those indexes out there. HOT might have saved some work over
there.But i am pessimistic here and wondering how it could have been improved. So it just strikes me, we can do the
samething which we did just with the tables. Convert a seq scan of the entire table into a random scan of few blocks.
Wecan read the necessary tuple information from the tuples, group them and hit at the index in just those blocks and
cleanit up. <br />   I can already hear people, saying that it is not always possible to go back to index from table.
Thereis this culprit called unstable function based indexes. The structure stops us from going back to index from
table.So currently we should restrict the above said approach to only normal indexes(not the function based ones). I
hopeit would still give a good benefit. <br />   Of course Vacuum can convert the few random scans into a seq scan, if
requiredby referring to table statistics.<br /><br />   Thoughts about the idea????<br /><br />Thanks,<br />Gokul.<br
/><br/>P.S.:  Let the objections/opposing views have a subtle reduction in its harshness. <br /> 

Re: Some ideas about Vacuum

From
Markus Schiltknecht
Date:
Hi,

Gokulakannan Somasundaram wrote:
> If we can ask the Vacuum process to scan 
> the WAL log, it can get all the relevant details on where it needs to 
> go.

You seem to be assuming that only few tuples have changed between 
vacuums, so that WAL could quickly guide the VACUUM processes to the 
areas where cleaning is necessary.

Let's drop that assumption, because by default, autovacuum_scale_factor 
is 20%, so a VACUUM process normally kicks in after 20% of tuples 
changed (disk space is cheap, I/O isn't). Additionally, there's a 
default nap time of one minute - and VACUUM is forced to take at least 
that much of a nap.

So it's easily possible having more dead tuples, than live ones. In such 
cases, scanning the WAL can easily takes *longer* than scanning the 
table, because the amount of WAL to read would be bigger.

>     One main restriction it places on the WAL Logs is that the WAL Log 
> needs to be archived only after all the transactions in it completes. In 
> other words, WAL logs need to be given enough space, to survive the 
> longest transaction of the database. It is possible to avoid this 
> situation by asking the Vacuum process to take the necessary information 
> out of WAL log and store it somewhere and wait for the long running 
> transaction to complete.

That would result in even more I/O...

>     The information of interest in WAL is only the table 
> inserts/updates/deletes. So if everyone accepts that this is a good 
> idea, till this point, there is a point in reading further.

Well, that's the information of interest, the question is where to store 
that information. Maintaining a dead space map looks a lot cheaper to 
me, than relying on the WAL to store that information.

>     Ultimately, what has been achieved till now is that we have made the 
> sequential scans made by the Vacuum process on each table into a few 
> random i/os. Of course there are optimizations possible to group the 
> random i/os and find some sequential i/o out of it. But still we need to 
> do a full index scan for all those indexes out there. HOT might have 
> saved some work over there. But i am pessimistic here and wondering how 
> it could have been improved. So it just strikes me, we can do the same 
> thing which we did just with the tables. Convert a seq scan of the 
> entire table into a random scan of few blocks. We can read the necessary 
> tuple information from the tuples, group them and hit at the index in 
> just those blocks and clean it up.

Sorry, I don't quite get what you are talking about here. What do 
indexes have to do with dead space? Why not just keep acting on the 
block level?

>    I can already hear people, saying that it is not always possible to 
> go back to index from table. There is this culprit called unstable 
> function based indexes.

No, there's no such thing. Citing [1]: "All functions and operators used 
in an index definition must be "immutable", that is, their results must 
depend only on their arguments and never on any outside influence".

Of course, you can mark any function IMMUTABLE and get unstable function 
based indexes, but that turns into a giant foot gun very quickly.

> P.S.:  Let the objections/opposing views have a subtle reduction in its 
> harshness.

I'm just pointing at things that are in conflict with my knowledge, 
assumptions and believes, all which might be erroneous, plain wrong or 
completely mad. ;-)

Regards

Markus

[1]: the Very Fine Postgres Manual on CREATE INDEX:
http://www.postgresql.org/docs/8.3/static/sql-createindex.html


Re: Some ideas about Vacuum

From
"Gokulakannan Somasundaram"
Date:

So it's easily possible having more dead tuples, than live ones. In such
cases, scanning the WAL can easily takes *longer* than scanning the
table, because the amount of WAL to read would be bigger.

Yes... i made a wrong assumption there......  so the idea  is totally useless.

Thanks,
Gokul.

Re: Some ideas about Vacuum

From
Gregory Stark
Date:
"Markus Schiltknecht" <markus@bluegap.ch> writes:

> Hi,
>
> Gokulakannan Somasundaram wrote:
>> If we can ask the Vacuum process to scan the WAL log, it can get all the
>> relevant details on where it needs to go.

That's an interesting thought. I think your caveats are right but with some
more work it might be possible to work it out. For example if a background
process processed the WAL and accumulated an array of possibly-dead tuples to
process in batch. It would wait whenever it sees an xid which isn't yet past
globalxmin, and keep accumulating until it has enough to make it worthwhile
doing a pass.

I think a bigger issue with this approach is that it ties all your tables
together. You can't process one table frequently while some other table has
some long-lived deleted tuples.

I'm also not sure it really buys us anything over having a second
dead-space-map data structure. The WAL is much larger and serves other
purposes which would limit what we can do with it.

> You seem to be assuming that only few tuples have changed between vacuums, so
> that WAL could quickly guide the VACUUM processes to the areas where cleaning
> is necessary.
>
> Let's drop that assumption, because by default, autovacuum_scale_factor is 20%,
> so a VACUUM process normally kicks in after 20% of tuples changed (disk space
> is cheap, I/O isn't). Additionally, there's a default nap time of one minute -
> and VACUUM is forced to take at least that much of a nap.

I think this is exactly backwards. The goal should be to improve vacuum, then
adjust the autovacuum_scale_factor as low as we can. As vacuum gets cheaper
the scale factor can go lower and lower. We shouldn't allow the existing
autovacuum behaviour to control the way vacuum works.

As a side point, "disk is cheap, I/O isn't" is a weird statement. The more
disk you use the more I/O you'll have to do to work with the data. I still
maintain the default autovacuum_scale_factor is *far* to liberal. If I had my
druthers it would be 5%. But that's mostly informed by TPCC experience, in
real life the actual value will vary depending on the width of your records
and the relative length of your transactions versus transaction rate. The TPCC
experience is with ~ 400 byte records and many short transactions.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Ask me about EnterpriseDB's On-Demand Production
Tuning


Re: Some ideas about Vacuum

From
Simon Riggs
Date:
On Wed, 2008-01-09 at 15:10 +0000, Gregory Stark wrote:

> The goal should be to improve vacuum, then
> adjust the autovacuum_scale_factor as low as we can. As vacuum gets
> cheaper the scale factor can go lower and lower. We shouldn't allow
> the existing autovacuum behaviour to control the way vacuum works. 

Very much agreed.

--  Simon Riggs 2ndQuadrant  http://www.2ndQuadrant.com



Re: Some ideas about Vacuum

From
Markus Schiltknecht
Date:
Hi,

Gregory Stark wrote:
> That's an interesting thought. I think your caveats are right but with some
> more work it might be possible to work it out. For example if a background
> process processed the WAL and accumulated an array of possibly-dead tuples to
> process in batch. It would wait whenever it sees an xid which isn't yet past
> globalxmin, and keep accumulating until it has enough to make it worthwhile
> doing a pass.

I don't understand why one would want to go via the WAL, that only 
creates needless I/O. Better accumulate the data right away, during the 
inserts, updates and deletes.  Spilling the accumulated data to disk, if 
absolutely required, would presumably still result in less I/O.

> I think a bigger issue with this approach is that it ties all your tables
> together. You can't process one table frequently while some other table has
> some long-lived deleted tuples.

Don't use the WAL as the source of that information and that's issue's gone.

> I'm also not sure it really buys us anything over having a second
> dead-space-map data structure. The WAL is much larger and serves other
> purposes which would limit what we can do with it.

Exactly.

>> You seem to be assuming that only few tuples have changed between vacuums, so
>> that WAL could quickly guide the VACUUM processes to the areas where cleaning
>> is necessary.
>>
>> Let's drop that assumption, because by default, autovacuum_scale_factor is 20%,
>> so a VACUUM process normally kicks in after 20% of tuples changed (disk space
>> is cheap, I/O isn't). Additionally, there's a default nap time of one minute -
>> and VACUUM is forced to take at least that much of a nap.
> 
> I think this is exactly backwards. The goal should be to improve vacuum, then
> adjust the autovacuum_scale_factor as low as we can. As vacuum gets cheaper
> the scale factor can go lower and lower.

But you can't lower it endlessly, it's still a compromise, because it 
also means reducing the amount of tuples being cleaned per scan, which 
is against the goal of minimizing overall I/O cost of vacuuming.

> We shouldn't allow the existing
> autovacuum behaviour to control the way vacuum works.

That's a point.

> As a side point, "disk is cheap, I/O isn't" is a weird statement. The more
> disk you use the more I/O you'll have to do to work with the data.

That's only true, as long as you need *all* your data to work with it.

> I still
> maintain the default autovacuum_scale_factor is *far* to liberal. If I had my
> druthers it would be 5%. But that's mostly informed by TPCC experience, in
> real life the actual value will vary depending on the width of your records
> and the relative length of your transactions versus transaction rate. The TPCC
> experience is with ~ 400 byte records and many short transactions.

Hm.. 5% vs 20% would mean 4x as many vacuum scans, but only a 15% growth 
in size (105% vs 120%), right? Granted, those 15% are also taken from 
memory and caches, resulting in additional I/O...  Still these numbers 
are surprising me. Or am I missing something?

Regards

Markus



Re: Some ideas about Vacuum

From
"Gokulakannan Somasundaram"
Date:
Markus,<br />         I was re-thinking about what you said. I feel, if we read the WAL through archiver(Where the
archiveris switched on), which anyway reads the entire WAL Log, it might save some CPU cycles off updates, inserts and
deletes.<br />        The question is about reducing I/Os and i have no doubt about it. But if we create the WAL Log in
aseperate disk and we make the Vacuum scan through it(in case the archiver is absent), it would reduce the I/O off the
diskcontaining the data. Essentially the I/O effects are seperated. We might end up doing more I/Os, but it would not
affectthe OLTP transactions. <br />          I would also like to clarify one more thing. I am not asking to remove the
DSMapproach. But i am just thinking of creating the DSM by reading through the WAL Logs, instead of asking the Inserts,
updatesand deletes to do the DSM creation. <br />          Of course, if a person places both WAL logs and Data files
inthe same disk drives, this would reduce the performance. But can we take that hit?<br />        I think what Gregory
iscoming at is, "if we schedule the Vacuum after 20% of table changes, then we essentially say we need 120% of the disk
spaceand hence our select operations might end up doing more I/Os." <br />        Please put forward your
suggestions.<br/><br />Hi All,<br /><br />Essentially concluding<br />a) If there is a archiver running, we are putting
slightlymore CPU cycles on the archiver to help form the DSM.<br />b) If there is no archiver, if the DBA places the
WALin a seperate disk, Vacuum will do more I/O on that disk to form the DSM. <br />c) In case someone has not schedules
botharchiver and is not ready to spare a disk for WAL, this approach reduces the performance of that setup.<br />   
Aremy conclusions right?<br />    If they are right, how much percentage constitute the third part? (Field experts out
there!!)<br />    If the percentage is more, we should stop this line of thinking.<br /><br />Thanks,<br />Gokul.<br
/><br/> 

Re: Some ideas about Vacuum

From
Markus Schiltknecht
Date:
Hi,

Gokulakannan Somasundaram wrote:
> But i am just thinking of creating the DSM 
> by reading through the WAL Logs, instead of asking the Inserts, updates 
> and deletes to do the DSM creation.

What's the advantage of that? What's wrong with collecting the 
information for DSM at transaction processing time? The overhead is 
certainly smaller than the overhead for doing it later on.

>         I think what Gregory is coming at is, "if we schedule the Vacuum 
> after 20% of table changes, then we essentially say we need 120% of the 
> disk space and hence our select operations might end up doing more I/Os."

Well, full sequential scans end up doing more I/O, but not index scans 
typical for OLTP. So if autovacuum is the only thing doing full 
sequential scans, you'd better reduce the number of full scans, instead 
of saving only some percentage per scan, no?

Of course, depending on how much of your table fits in ram, you also 
need to consider the space savings in RAM...  However, I'm assuming a 
reasonably low ratio of RAM size vs table size.

Regards

Markus



Re: Some ideas about Vacuum

From
"Gokulakannan Somasundaram"
Date:


On Jan 10, 2008 3:43 PM, Markus Schiltknecht <markus@bluegap.ch> wrote:
Hi,

Gokulakannan Somasundaram wrote:
> But i am just thinking of creating the DSM
> by reading through the WAL Logs, instead of asking the Inserts, updates
> and deletes to do the DSM creation.

What's the advantage of that? What's wrong with collecting the
information for DSM at transaction processing time? The overhead is
certainly smaller than the overhead for doing it later on.
 
The overhead ..... is because of the contention. Am i missing something here? While Vacuum is reading the DSM, operations may not be able to update the bits. We need to put the DSM in shared memory, if all the processes are going to update it, whereas if Vacuum is going to form the DSM, then it might well be in the process local memory.  I can think of things like False sharing which might be avoided. But i think the main stuff is contention.
 


>         I think what Gregory is coming at is, "if we schedule the Vacuum
> after 20% of table changes, then we essentially say we need 120% of the
> disk space and hence our select operations might end up doing more I/Os."

Well, full sequential scans end up doing more I/O, but not index scans
typical for OLTP. So if autovacuum is the only thing doing full
sequential scans, you'd better reduce the number of full scans, instead
of saving only some percentage per scan, no?

Even in indexes, we might end up reading dead tuples. We would mark it with LP_DEAD. So the overhead is less, but its there. Ofcourse its natural to think of some background jobs during OLTP, and they will be affected
 


Of course, depending on how much of your table fits in ram, you also
need to consider the space savings in RAM...  However, I'm assuming a
reasonably low ratio of RAM size vs table size.

That's another one.

Thanks,
Gokul.

Re: Some ideas about Vacuum

From
Markus Schiltknecht
Date:
Hi,

Gokulakannan Somasundaram wrote:
> because of the contention. Am i missing something 
> here? While Vacuum is reading the DSM, operations may not be able to 
> update the bits. We need to put the DSM in shared memory, if all the 
> processes are going to update it, whereas if Vacuum is going to form the 
> DSM, then it might well be in the process local memory.  I can think of 
> things like False sharing which might be avoided. But i think the main 
> stuff is contention.

Ah, I begin to understand where you are coming from now, yes. However, 
(ab-)using the WAL and archiver still doesn't look like a good idea to me.

> Even in indexes, we might end up reading dead tuples. We would mark it 
> with LP_DEAD. So the overhead is less, but its there.

That's a good point, yes.

> Ofcourse its 
> natural to think of some background jobs during OLTP, and they will be 
> affected

Agreed.

Regards

Markus



Re: Some ideas about Vacuum

From
"Gokulakannan Somasundaram"
Date:

Sorry Greg , I missed to read this part before.

On Jan 9, 2008 8:40 PM, Gregory Stark <stark@enterprisedb.com> wrote:

"Markus Schiltknecht" <markus@bluegap.ch> writes:

> Hi,
>
> Gokulakannan Somasundaram wrote:
>> If we can ask the Vacuum process to scan the WAL log, it can get all the
>> relevant details on where it needs to go.

That's an interesting thought. I think your caveats are right but with some
more work it might be possible to work it out. For example if a background
process processed the WAL and accumulated an array of possibly-dead tuples to
process in batch. It would wait whenever it sees an xid which isn't yet past
globalxmin, and keep accumulating until it has enough to make it worthwhile
doing a pass.

I think a bigger issue with this approach is that it ties all your tables
together. You can't process one table frequently while some other table has
some long-lived deleted tuples. 

I am not able to clearly understand what you are saying here. It ties all the tables yes.  There are two options here
a) Do we really need to do Vacuum table by table? Say we read 'n' WAL segments and accumulate the data. We should try to sort the result with Relation name, Block num and we can go ahead with the Vacuum. In this way, Vacuum will only work at the database level. Why do we need to process one table frequently?
b) We can create DSMs for each table separately and Vacuum will use the WAL information to update it. In this way, we can Vacuum table wise.
 


I'm also not sure it really buys us anything over having a second
dead-space-map data structure. The WAL is much larger and serves other
purposes which would limit what we can do with it.
Ok. One obvious advantage is that it saves the contention over DSM for the DML operations and Vacuum process. Since Vacuum process is going to have much more information on what has happened in the database, it is possible for some new structures. For example i have been thinking of changing our current index structure in such a way, it won't hold any duplicate tuples for different versions of data. Whenever there is a update, only the indexes relevant to the columns changed will get updated. The Vacuum has to play the role of changing the tid, the index tuple points to, whenever it vacuums a older version.
 It would be possible to create  such structures, which can be synched asynchronously. Another example would be Asynchronous Materialized views.

But pushing those future plans aside, don't you think this would reduce the contention,  which otherwise would be faced by the DML operations?

Thanks,
Gokul.


Re: Some ideas about Vacuum

From
Markus Schiltknecht
Date:
Hi,

Gokulakannan Somasundaram wrote:
>     I'm also not sure it really buys us anything over having a second
>     dead-space-map data structure. The WAL is much larger and serves other
>     purposes which would limit what we can do with it.
> 
> Ok. One obvious advantage is that it saves the contention over DSM for 
> the DML operations and Vacuum process.

Do you have evidence of that contention being so worse, that it 
justifies the additional WAL reading from disk? (Assuming no WAL archiving).

IMO we can get about any granularity we want for DSM update locking, 
depending on how we arrange the DSM bits.

> Since Vacuum process is going to 
> have much more information on what has happened in the database,

Why should that be? IMO, collecting the information at transaction time 
can give you exactly the same information, if not more or better 
information.

> it is 
> possible for some new structures. For example i have been thinking of 
> changing our current index structure in such a way, it won't hold any 
> duplicate tuples for different versions of data. Whenever there is a 
> update, only the indexes relevant to the columns changed will get 
> updated. The Vacuum has to play the role of changing the tid, the index 
> tuple points to, whenever it vacuums a older version.

Huh? The index would then point to the old tuple only, until a VACUUM 
comes by, right. How are following transactions expected to find the new 
tuple before that VACUUMing?

Regards

Markus


Re: Some ideas about Vacuum

From
Tom Lane
Date:
Markus Schiltknecht <markus@bluegap.ch> writes:
>> Since Vacuum process is going to 
>> have much more information on what has happened in the database,

> Why should that be? IMO, collecting the information at transaction time 
> can give you exactly the same information, if not more or better 
> information.

Well, one of the principal arguments for having VACUUM at all is that it
off-loads required maintenance effort from foreground transaction code
paths.  I'm not really going to be in favor of solutions that put more
work into the transaction code paths (HOT already did more of that than
I would like :-().  OTOH, I agree that scanning the WAL log doesn't
really sound like something well-matched to this problem either.
        regards, tom lane


Re: Some ideas about Vacuum

From
Markus Schiltknecht
Date:
Hi,

Tom Lane wrote:
> Well, one of the principal arguments for having VACUUM at all is that it
> off-loads required maintenance effort from foreground transaction code
> paths.

Off-loading doesn't mean we don't have to do the work, so it's obviously 
is a compromise.

AFAICT, having to write some DSM blocks from foreground transaction code 
paths may well be worth it overall, if it saves VACUUM from doing much 
more I/O.

Especially if the bgwriter can defer the I/O to after commit time (which 
I'm thinking of as another form of off-loading work from foreground 
transaction code).

Regards

Markus


Re: Some ideas about Vacuum

From
"Gokulakannan Somasundaram"
Date:

Hi,

Please find my answers inline

Do you have evidence of that contention being so worse, that it
justifies the additional WAL reading from disk? (Assuming no WAL archiving).
On a broader sense, DSM is a bitmap index with some optimization that has been placed to make the updates more effective. As you may know, the design of Bitmap index doesn't scale very well with concurrency. If you put more information into a little space, then i feel it might affect concurrency. Let us discuss it in detail.
DSM, i believe plans to achieve the following objectives,
a) To find out the blocks, which are to be Vacuumed
b) To find out the blocks, where freezing is required
c) To find out the blocks which are visible to everyone.

The DSM might get split into multiple maps like Visibility maps(already proposed by Heikki), Vacuum Maps and Freezing maps.  When the inserts happen, the map has to get extended and it has to lock the block to extend the map. Say if the DSM block corresponds to some 60K data blocks. Then any updates / deletes happening over those blocks have to wait for that time. This is just an example, which i can think of off-hand. May be the people, who are implementing might throw more light on the synchronization points.

IMO we can get about any granularity we want for DSM update locking,
depending on how we arrange the DSM bits.
I can't understand this exactly. 


> Since Vacuum process is going to
> have much more information on what has happened in the database,

Why should that be? IMO, collecting the information at transaction time
can give you exactly the same information, if not more or better
information.

My argument is if we have collected that information in WAL, why should we collect it again and again?
 
> it is
> possible for some new structures. For example i have been thinking of
> changing our current index structure in such a way, it won't hold any
> duplicate tuples for different versions of data. Whenever there is a
> update, only the indexes relevant to the columns changed will get
> updated. The Vacuum has to play the role of changing the tid, the index
> tuple points to, whenever it vacuums a older version.

Huh? The index would then point to the old tuple only, until a VACUUM
comes by, right. How are following transactions expected to find the new
tuple before that VACUUMing?
You are right. We have already discusses about this. In the Vacuum aproach, we travel front in time. We catch the oldest transaction and go to the new transaction, by following the ctid in the old tuple. In the undo log approach, it is the reverse. We go to the latest transaction and travel back in time. Its interesting to see, how theory of relativity has got applied in database science right?

So say we have 'n' versions of the same data in index. Right now we have 'n' index tuples which point to 'n' block in heap. we would read all the 'n' index tuples and go to all the versions of data in the table. If this changes, there will be one index tuple, which would point to the oldest heap tuple and from there we will navigate to all the new tuples. The advantage is obvious, the index is going to have lesser size and the updates will not update indexes, unless the data in it has got changed.

Hope i was clear. Please revert back, in case i am not clear.

Thanks,
Gokul.

Re: Some ideas about Vacuum

From
"Gokulakannan Somasundaram"
Date:
One more application of the same is Asynchronous Materialized views. I hope you agree that the asynchronous
materializedviews have to get updated only through WAL. If WAL can be used for that purpose, why can't we multiplex it?
<br/><br />Thanks,<br />Gokul.<br /> 

Re: Some ideas about Vacuum

From
"Gokulakannan Somasundaram"
Date:
<div class="gmail_quote"><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt
0pt0pt 0.8ex; padding-left: 1ex;"><div class="Ih2E3d"><br /></div>Well, one of the principal arguments for having
VACUUMat all is that it <br />off-loads required maintenance effort from foreground transaction code<br />paths.  I'm
notreally going to be in favor of solutions that put more<br />work into the transaction code paths (HOT already did
moreof that than <br />I would like :-().  OTOH, I agree that scanning the WAL log doesn't<br />really sound like
somethingwell-matched to this problem either.<br /></blockquote></div><br />Tom, Don't you like the idea of building
somemore structures around WAL, like Asynchronous Materialized views. Indexes, if implemented as  stated, would remove
theHOT code in the path of the transaction(as you may know).  I am also slightly doubtful of the argument, that doing
full-tablescans and full index scans for Vacuum is efficient. Can you please advise me on why we should not use a read
onlyoperation on WAL log ? <br /><br />Thanks,<br />Gokul.<br /> 

Re: Some ideas about Vacuum

From
"Heikki Linnakangas"
Date:
Gokulakannan Somasundaram wrote:
>>
>> Well, one of the principal arguments for having VACUUM at all is that it
>> off-loads required maintenance effort from foreground transaction code
>> paths.  I'm not really going to be in favor of solutions that put more
>> work into the transaction code paths (HOT already did more of that than
>> I would like :-().  OTOH, I agree that scanning the WAL log doesn't
>> really sound like something well-matched to this problem either.
>>
> 
> Tom, Don't you like the idea of building some more structures around WAL,
> like Asynchronous Materialized views. Indexes, if implemented as  stated,
> would remove the HOT code in the path of the transaction(as you may know).
> I am also slightly doubtful of the argument, that doing full-table scans and
> full index scans for Vacuum is efficient. Can you please advise me on why we
> should not use a read only operation on WAL log ?

I haven't been paying close attention to this thread, but there is a 
couple general issues with using the WAL for this kind of things. First 
of all, one extremely cool feature of PostgreSQL is that transaction 
size is not limited by WAL space, unlike on many other DBMSs. I think 
many of the proposed ideas of reading WAL would require us to keep all 
WAL available back to the beginning of the oldest running transaction.

Another issue is that reading WAL is inherently not very scalable. 
There's only one WAL for the whole cluster, and it needs to be read 
sequentially, so it can easily become a bottleneck on large systems.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Some ideas about Vacuum

From
"Gokulakannan Somasundaram"
Date:

I haven't been paying close attention to this thread, but there is a
couple general issues with using the WAL for this kind of things. First
of all, one extremely cool feature of PostgreSQL is that transaction
size is not limited by WAL space, unlike on many other DBMSs. I think
many of the proposed ideas of reading WAL would require us to keep all
WAL available back to the beginning of the oldest running transaction.

Initially i thought this may be required. But the current idea is Vacuum is going to maintain a DSM per relation and it will update it, once the WAL segement is switched. so if the WAL logging is happening at segment 2, then the first segment will be scanned to update the DSM.

Another issue is that reading WAL is inherently not very scalable.
There's only one WAL for the whole cluster, and it needs to be read
sequentially, so it can easily become a bottleneck on large systems.

Let me try to understand what would become a problem here. We are going to have only one process, which would open this WAL (one segment at a time) and update the DSMs. The limitation would be that we should have completed reading the log before the WAL segment round-up. What else do you think would be the problem?

Thanks,
Gokul.

Re: Some ideas about Vacuum

From
Alvaro Herrera
Date:
Heikki Linnakangas escribió:

> Another issue is that reading WAL is inherently not very scalable. There's 
> only one WAL for the whole cluster, and it needs to be read sequentially, 
> so it can easily become a bottleneck on large systems.

I have wondered why do we do it this way.  Is there a problem with
having one WAL per database, and another for general operations?  This
last WAL would have changes to shared tables, as well as global stuff
like "create database" or "create tablespace".

Of course, it means a lot more files, and a PITR setup is a bit more
complex.

One obvious problem is that it is no longer true that you have a "no
seek" disk head.  But is there much use of that, these days?  People
have either a big RAID on which the WAL resides along all data; or, on
low-cost systems, the whole thing is in a single disk or a small RAID.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.


Re: Some ideas about Vacuum

From
Tom Lane
Date:
Alvaro Herrera <alvherre@commandprompt.com> writes:
> Heikki Linnakangas escribi�:
>> Another issue is that reading WAL is inherently not very scalable. There's 
>> only one WAL for the whole cluster, and it needs to be read sequentially, 
>> so it can easily become a bottleneck on large systems.

> I have wondered why do we do it this way.  Is there a problem with
> having one WAL per database, and another for general operations?  This
> last WAL would have changes to shared tables, as well as global stuff
> like "create database" or "create tablespace".

It would only be useful to have one per spindle-dedicated-to-WAL, so
tying the division to databases doesn't seem like it'd be a good idea.
        regards, tom lane


Re: Some ideas about Vacuum

From
Alvaro Herrera
Date:
Tom Lane escribió:
> Alvaro Herrera <alvherre@commandprompt.com> writes:
> > Heikki Linnakangas escribi�:
> >> Another issue is that reading WAL is inherently not very scalable. There's 
> >> only one WAL for the whole cluster, and it needs to be read sequentially, 
> >> so it can easily become a bottleneck on large systems.
> 
> > I have wondered why do we do it this way.  Is there a problem with
> > having one WAL per database, and another for general operations?  This
> > last WAL would have changes to shared tables, as well as global stuff
> > like "create database" or "create tablespace".
> 
> It would only be useful to have one per spindle-dedicated-to-WAL, so
> tying the division to databases doesn't seem like it'd be a good idea.

Keep in mind that there are claims that a write-cache-enabled
battery-backed RAID controller negates the effect of a separate spindle.
Also, these days people is recommending keeping WAL in a mirrored disk,
so you would have to have a mirrored pair for every WAL stream, which
starts to sound unworkable.

My point, rather, is that with this sort of setup it would be easier to
do per-database PITR shipping, and one database's WAL activity would not
affect another's (thus hosting providers are happier -- high-rate
customer A need not affect low-budget customer B).

A totally separate consideration is that of LWLock contention.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: Some ideas about Vacuum

From
Tom Lane
Date:
Alvaro Herrera <alvherre@commandprompt.com> writes:
> Tom Lane escribió:
>> It would only be useful to have one per spindle-dedicated-to-WAL, so
>> tying the division to databases doesn't seem like it'd be a good idea.

> Keep in mind that there are claims that a write-cache-enabled
> battery-backed RAID controller negates the effect of a separate spindle.

Possibly true, but if that's the underlying hardware then there's no
performance benefit in breaking WAL up at all, no?

> My point, rather, is that with this sort of setup it would be easier to
> do per-database PITR shipping, and one database's WAL activity would not
> affect another's (thus hosting providers are happier -- high-rate
> customer A need not affect low-budget customer B).

You won't get far with that because of the shared catalogs.  In
particular, most DDL operations these days touch pg_shdepend ...
        regards, tom lane


Re: Some ideas about Vacuum

From
Alvaro Herrera
Date:
Tom Lane escribió:
> Alvaro Herrera <alvherre@commandprompt.com> writes:
> > Tom Lane escribió:
> >> It would only be useful to have one per spindle-dedicated-to-WAL, so
> >> tying the division to databases doesn't seem like it'd be a good idea.
> 
> > Keep in mind that there are claims that a write-cache-enabled
> > battery-backed RAID controller negates the effect of a separate spindle.
> 
> Possibly true, but if that's the underlying hardware then there's no
> performance benefit in breaking WAL up at all, no?

Selective PITR shipping.

> > My point, rather, is that with this sort of setup it would be easier to
> > do per-database PITR shipping, and one database's WAL activity would not
> > affect another's (thus hosting providers are happier -- high-rate
> > customer A need not affect low-budget customer B).
> 
> You won't get far with that because of the shared catalogs.  In
> particular, most DDL operations these days touch pg_shdepend ...

That's why you log shared activity to another WAL stream, and ship that
to everyone, while the other databases' WAL streams are shipped only to
the interested slaves.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.


Re: Some ideas about Vacuum

From
Gregory Stark
Date:
"Tom Lane" <tgl@sss.pgh.pa.us> writes:

> Alvaro Herrera <alvherre@commandprompt.com> writes:
>> Heikki Linnakangas escribió:
>>> Another issue is that reading WAL is inherently not very scalable. There's
>>> only one WAL for the whole cluster, and it needs to be read sequentially,
>>> so it can easily become a bottleneck on large systems.
>
>> I have wondered why do we do it this way.  Is there a problem with
>> having one WAL per database, and another for general operations?  This
>> last WAL would have changes to shared tables, as well as global stuff
>> like "create database" or "create tablespace".
>
> It would only be useful to have one per spindle-dedicated-to-WAL, so
> tying the division to databases doesn't seem like it'd be a good idea.

I think one-per-database would help if you had a very particular type of
application which had a lot of equally busy databases. In general to eliminate
the bottleneck I think you would need to be able to break them up by process.
So two processes writing to the same table would be able to write to different
WAL logs.

That sounds hard but I'm not sure. It may not be as bad as it sounds.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Ask me about EnterpriseDB's Slony Replication
support!


Re: Some ideas about Vacuum

From
"Guillaume Smet"
Date:
On Jan 16, 2008 6:12 PM, Alvaro Herrera <alvherre@commandprompt.com> wrote:
> Tom Lane escribió:
> > Possibly true, but if that's the underlying hardware then there's no
> > performance benefit in breaking WAL up at all, no?
>
> Selective PITR shipping.

If it was possible to launch a PITR only on a given database, that
could be a great feature too. We have at least one customer who runs
every database in a separate cluster to be able to do PITR on only one
database if needed (for example if someone executed a DROP TABLE by
mistake).

--
Guillaume


Re: Some ideas about Vacuum

From
Greg Smith
Date:
On Wed, 16 Jan 2008, Alvaro Herrera wrote:

> Keep in mind that there are claims that a write-cache-enabled
> battery-backed RAID controller negates the effect of a separate spindle.

"Negates" is a bit strong; there's still some performance advantage on 
systems that write a serious amount of data.  It's certainly true that a 
BCC controller greatly reduces the need for a separate spindle.

It can be handy to keep it seperate anyway because it makes it trivial to 
track WAL I/O vs. database I/O.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD


Re: Some ideas about Vacuum

From
"Kevin Grittner"
Date:
>>> On Wed, Jan 16, 2008 at 11:40 AM, in message
<Pine.GSO.4.64.0801161233260.20980@westnet.com>, Greg Smith
<gsmith@gregsmith.com> wrote:
> On Wed, 16 Jan 2008, Alvaro Herrera wrote:
>
>> Keep in mind that there are claims that a write-cache-enabled
>> battery-backed RAID controller negates the effect of a separate spindle.
>
> "Negates" is a bit strong; there's still some performance advantage on
> systems that write a serious amount of data.  It's certainly true that a
> BCC controller greatly reduces the need for a separate spindle.
I haven't seen any benchmarks on the list or in our environment
where the separate spindles gave more than a 1% increase in
performance when using a good-quality BBC controller.  Do you have
results that show more of a difference?  Can you share them?
-Kevin




Re: Some ideas about Vacuum

From
"Heikki Linnakangas"
Date:
Guillaume Smet wrote:
> On Jan 16, 2008 6:12 PM, Alvaro Herrera <alvherre@commandprompt.com> wrote:
>> Tom Lane escribió:
>>> Possibly true, but if that's the underlying hardware then there's no
>>> performance benefit in breaking WAL up at all, no?
>> Selective PITR shipping.
> 
> If it was possible to launch a PITR only on a given database, that
> could be a great feature too. We have at least one customer who runs
> every database in a separate cluster to be able to do PITR on only one
> database if needed (for example if someone executed a DROP TABLE by
> mistake).

Yeah, it sure would be nice.

I don't think it's going to work too well, though, not without major 
changes at least. What would happen when you restore a PITR backup of 
just one database? Would the other databases still be there in the 
restored cluster? What state would they be in? After restoring one 
database, and doing some stuff on it, could you ever "merge" those 
changes with the rest of the cluster?

Mind you, there's more things shared between databases than the shared 
catalogs. clog for example.

It might be useful for creating read-only copies of a master database, 
but I don't see it being very useful/possible in general.

For more usefulness, we'd need to keep databases more separate from each 
other than we do now. Databases would need to have their own transaction 
counters, for example. Shared relations would obviously need major 
changes for that to work. If we ultimately could separate databases so 
that you could take a filesystem copy of a single database, and restore 
it to another cluster, then per-database WAL and PITR would work.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Some ideas about Vacuum

From
"Gokulakannan Somasundaram"
Date:

For more usefulness, we'd need to keep databases more separate from each
other than we do now. Databases would need to have their own transaction
counters, for example. Shared relations would obviously need major
changes for that to work. If we ultimately could separate databases so
that you could take a filesystem copy of a single database, and restore
it to another cluster, then per-database WAL and PITR would work.

I agree to the fact that we can't have a separate WAL per database. Looks like it makes more sense to create a seperate database cluster, instead of adding one more database, if we want to make better use of available horse power and if we don't have cross database queries.

Thanks,
Gokul.

Re: Some ideas about Vacuum

From
"Guillaume Smet"
Date:
On Jan 16, 2008 7:41 PM, Heikki Linnakangas <heikki@enterprisedb.com> wrote:
> I don't think it's going to work too well, though, not without major
> changes at least.

Well, I know it's really not doable with the current behaviour of WAL.
I just wanted to point this feature request because we had it a few
times and having one cluster per database is not really smart and it
wasn't too far from the subject.

> What would happen when you restore a PITR backup of
> just one database? Would the other databases still be there in the
> restored cluster?

In a perfect world, you should be able to trigger the PITR for only
one database of the cluster with the other databases still running.
It's especially interesting on a shared database server.

--
Guillaume


Re: Some ideas about Vacuum

From
Alvaro Herrera
Date:
Heikki Linnakangas escribió:

> I don't think it's going to work too well, though, not without major 
> changes at least. What would happen when you restore a PITR backup of just 
> one database? Would the other databases still be there in the restored 
> cluster? What state would they be in? After restoring one database, and 
> doing some stuff on it, could you ever "merge" those changes with the rest 
> of the cluster?

Well, a PITR slave, after you change it, cannot be brought in sync with
the master.  This is not different.

If you replicate a single database's stream, the other databases should
not be there.  My idea is that a slave could request multiple databases'
streams.  The ability to do it is needed anyway, to follow both the
basic database stream and the shared stream.

> Mind you, there's more things shared between databases than the shared 
> catalogs. clog for example.

Sure --- my original proposal mentioned the use of the shared WAL stream
for global objects (though I didn't mention pg_clog, but surely it had
better be there).

> For more usefulness, we'd need to keep databases more separate from each 
> other than we do now. Databases would need to have their own transaction 
> counters, for example.

Hmm, why?  Perhaps you are right but I don't see the reason.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: Some ideas about Vacuum

From
"Heikki Linnakangas"
Date:
Alvaro Herrera wrote:
> Heikki Linnakangas escribió:
>> For more usefulness, we'd need to keep databases more separate from each 
>> other than we do now. Databases would need to have their own transaction 
>> counters, for example.
> 
> Hmm, why?  Perhaps you are right but I don't see the reason.

If each database was stand-alone, you would need only one base backup 
and WAL per database to restore, instead of base backup and WAL of the 
database, and base backup and WAL of shared stuff. You could backup one 
database in cluster, restore it somewhere else, and later copy it back 
to the original cluster. You could back up one database at a time, and 
restore the whole cluster from the N per-database backups.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Some ideas about Vacuum

From
Greg Smith
Date:
On Wed, 16 Jan 2008, Kevin Grittner wrote:

> I haven't seen any benchmarks on the list or in our environment
> where the separate spindles gave more than a 1% increase in
> performance when using a good-quality BBC controller.

Well, even 1% isn't nothing, which is the main point I was making--it 
doesn't completely remove the gain, just reduce it a lot.  If you wanted 
to see a bigger difference you could simulate a workload with lots of 
clients doing short transactions.

The biggest gain in having a separate WAL isn't as obvious in gross 
performance measurements.  It's what happens to worst-case performance for 
the transactions just after a checkpoint, when there is a burst of more 
full page writes (they normally settle down as the most popular pages get 
written).  That's the spot where you're most likely to run into a WAL 
bottleneck that just having a BBC doesn't completely eliminate.

> Do you have results that show more of a difference?  Can you share them?

I wasn't trying to quantify this particular number and it would take a bit 
just to figure out what I could and couldn't share.  Expanding on the 
above, though, if you look some of the recent public benchmarks like 
http://www.kaltenbrunner.cc/blog/index.php?/archives/21-8.3-vs.-8.2-a-simple-benchmark.html 
you'll see Stefan was able to hit around 4000 TPS on that test system. 
Now, if you had a typical 256MB BBC (it's 512MB there) and full pages 
writes are dumping 8K each, that means you can fit 32768 of them before 
you blow your cache and the disks really have to keep up--and the WAL 
doesn't get the whole cache to itself.  The first 10-20 seconds after a 
checkpoint on such a system are kind of interesting to zoom in on.  If the 
WAL has to fight for seek time with database reads during that period (DB 
writes will still be mostly cached by the OS just after a checkpoint) it 
can be messy compared to what you get with a dedicated WAL.  But that will 
average out to a minimal effect on TPS over the course of the test.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD