Some ideas about Vacuum - Mailing list pgsql-hackers

From Gokulakannan Somasundaram
Subject Some ideas about Vacuum
Date
Msg-id 9362e74e0801090239y4fde9d93tac044907df842c39@mail.gmail.com
Whole thread Raw
Responses Re: Some ideas about Vacuum
List pgsql-hackers
Hi,<br />    May be i am reposting something which has been discussed to end in this forum. I have made a search in the
archivesand i couldn't find any immediately. <br />    With my relatively small experience in Performance Testing and
Tuning,one of the rules of thumb for getting Performance is "Don't do it, if you don't need to do it". When we look at
clearingthe older versions of tuples from our tables in PostgreSQL, we can't stop thinking about how it is done in
otherdatabases. When we compare the Oracle Undo Log approach with the Postgresql Vacuum approach, the pattern looks
verysimilar to C++ memory de-allocation and Java garbage collection. <br />      So, as you may all know, the thing
whichworries us about Vacuum is that it is going to places where it need not goto. That's when we are thinking about
Deadspace Map. This dead space map is a map, if implemented correctly, would guide Vacuum to go and only look at places
wherethere was some activity of Delete/Update/Insert after the last Vacuum. This is accomplished at the cost of some
verysmall overhead to Inserts/Deletes/Updates. <br />      Dead space Map is like an undo-log, if we think its role is
toget rid of the older versions of data. Instead of moving the tuples to separate location, it guides the Vacuum
processto do the cleanup task. May be we can even think of something like Dead space log, which may not be a bitmap. In
thislog, transactions might enter their transaction ids and ctids, which can be scanned by the Vacuum process. While
thismight take more space, it is with lesser contention, while compared to Dead space Map. To me, as far as i can think
of,the only advantage of Dead space Map over Dead space log is the disk space. <br />     It just strikes me that WAL
logis already doing just that. I think you can follow my thought-line. If we can ask the Vacuum process to scan the WAL
log,it can get all the relevant details on where it needs to go. One optimization, that can be placed here is to
somehowmake the archiver do a double-job of helping the Vacuum, while doing the archiving. For people, who have
switchedoff archiving, this might not be a benefit. <br />    One main restriction it places on the WAL Logs is that
theWAL Log needs to be archived only after all the transactions in it completes. In other words, WAL logs need to be
givenenough space, to survive the longest transaction of the database. It is possible to avoid this situation by asking
theVacuum process to take the necessary information out of WAL log and store it somewhere and wait for the long running
transactionto complete. <br />    The information of interest in WAL is only the table inserts/updates/deletes. So if
everyoneaccepts that this is a good idea, till this point, there is a point in reading further.<br />    Ultimately,
whathas been achieved till now is that we have made the sequential scans made by the Vacuum process on each table into
afew random i/os. Of course there are optimizations possible to group the random i/os and find some sequential i/o out
ofit. But still we need to do a full index scan for all those indexes out there. HOT might have saved some work over
there.But i am pessimistic here and wondering how it could have been improved. So it just strikes me, we can do the
samething which we did just with the tables. Convert a seq scan of the entire table into a random scan of few blocks.
Wecan read the necessary tuple information from the tuples, group them and hit at the index in just those blocks and
cleanit up. <br />   I can already hear people, saying that it is not always possible to go back to index from table.
Thereis this culprit called unstable function based indexes. The structure stops us from going back to index from
table.So currently we should restrict the above said approach to only normal indexes(not the function based ones). I
hopeit would still give a good benefit. <br />   Of course Vacuum can convert the few random scans into a seq scan, if
requiredby referring to table statistics.<br /><br />   Thoughts about the idea????<br /><br />Thanks,<br />Gokul.<br
/><br/>P.S.:  Let the objections/opposing views have a subtle reduction in its harshness. <br /> 

pgsql-hackers by date:

Previous
From: "Ilya A. Kovalenko"
Date:
Subject: operator suggest " interval / interval = numeric"
Next
From: Michael Akinde
Date:
Subject: Re: VACUUM FULL out of memory