Thread: remove flatfiles.c
This patch removes flatfiles.c for good. It doesn't change the keeping of locks in dbcommands.c and user.c, because at least some of them are still required. Regarding sync commits that previously happen and now won't, I think the only case worth worrying about is the one in vacuum.c. Do we need a ForceSyncCommit() in there? I'm not sure if vacuum itself already forces sync commit. -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
Attachment
Alvaro Herrera <alvherre@commandprompt.com> writes: > This patch removes flatfiles.c for good. Aw, you beat me to it. > Regarding sync commits that previously happen and now won't, I think the > only case worth worrying about is the one in vacuum.c. Do we need a > ForceSyncCommit() in there? I'm not sure if vacuum itself already > forces sync commit. Hmm, I had been assuming we wouldn't need that anymore. regards, tom lane
Tom Lane wrote: > Alvaro Herrera <alvherre@commandprompt.com> writes: > > Regarding sync commits that previously happen and now won't, I think the > > only case worth worrying about is the one in vacuum.c. Do we need a > > ForceSyncCommit() in there? I'm not sure if vacuum itself already > > forces sync commit. > > Hmm, I had been assuming we wouldn't need that anymore. The comment in user.c and dbcommands.c says /* * Force synchronous commit, thus minimizing the window between * creation of the database files and commitalof the transaction. If * we crash before committing, we'll have a DB that's taking up disk * space but isnot in pg_database, which is not good. */ ForceSyncCommit(); so I think those ones are still necessary. There's another call in RenameDatabase() which I don't think needs a sync commit (because it won't change the dir name), and one in vacuum.c: /* ! * If we were able to advance datfrozenxid, mark the flat-file copy of ! * pg_database for update at commit, and see if we can truncate pg_clog. ! * Also force update if the shared XID-wrap-limit info is stale. */ if (dirty || !TransactionIdLimitIsValid()) - { - database_file_update_needed(); vac_truncate_clog(newFrozenXid); - } } AFAICT this doesn't need a sync commit. (Right now, VACUUM FULL forces one, but lazy vacuum doesn't). -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Alvaro Herrera <alvherre@commandprompt.com> writes: > Tom Lane wrote: >> Hmm, I had been assuming we wouldn't need that anymore. > The comment in user.c and dbcommands.c says [...] > so I think those ones are still necessary. Yeah, after a look through the code I think you can trust the associated comments: if it says it needs sync commit, put in ForceSyncCommit, else we don't need it. regards, tom lane
On Mon, 2009-08-31 at 18:53 -0400, Alvaro Herrera wrote: > Regarding sync commits that previously happen and now won't, I think the > only case worth worrying about is the one in vacuum.c. Do we need a > ForceSyncCommit() in there? I'm not sure if vacuum itself already > forces sync commit. VACUUM FULL requires ForceSyncCommit(). Not sure why removing them elsewhere is important? Getting robustness wrong is a big, bad thing and this opens us to future error. We already tuned VACUUM so it does very little if it has no work to do, why would one extra I/O improve things so much? If it ain't broke... VACUUM does so many things that I'd rather have it all safely on disk. I'd feel happier with the rule "VACUUM always sync commits", so we all remember it and can rely upon it to be the same from release to release. -- Simon Riggs www.2ndQuadrant.com
Simon Riggs <simon@2ndQuadrant.com> writes: > VACUUM does so many things that I'd rather have it all safely on disk. > I'd feel happier with the rule "VACUUM always sync commits", so we all > remember it and can rely upon it to be the same from release to release. Non-FULL vacuum has *never* done a sync commit, except in the unusual corner case that it moves the database's datfrozenxid, which is a corner case that didn't even exist until fairly recently. I think the argument that we should have it force sync for no reason whatsoever is silly. We get beat up on a regular basis about "spikes" in response time; why would you want to have vacuum creating one when it doesn't need to? As for the FULL case, the sync commit is to try to protect a horribly unsafe kluge that should go away entirely (if vacuum full itself doesn't go away entirely). That's hardly something I want to institutionalize either. regards, tom lane
On Tue, 2009-09-01 at 09:58 -0400, Tom Lane wrote: > We get beat up on a regular basis about "spikes" in response time; > why would you want to have vacuum creating one when it doesn't need > to? If one I/O on a background utility can cause such a spike, we are in serious shitake. I would be more comfortable if the various important things VACUUM does were protected by sync commit. I see no reason to optimise away one I/O just because we might theoretically do so. Any mistake in the theory and we are exposed. Why take the risk? We do many things to check and secure our data, why not this one? If this was suggested separately it as an optimisation you'd laugh and say why bother? -- Simon Riggs www.2ndQuadrant.com
Simon Riggs <simon@2ndQuadrant.com> writes: > On Tue, 2009-09-01 at 09:58 -0400, Tom Lane wrote: >> We get beat up on a regular basis about "spikes" in response time; >> why would you want to have vacuum creating one when it doesn't need >> to? > If one I/O on a background utility can cause such a spike, we are in > serious shitake. I would be more comfortable if the various important > things VACUUM does were protected by sync commit. I see no reason to > optimise away one I/O just because we might theoretically do so. Any > mistake in the theory and we are exposed. Why take the risk? *WHAT* risk? Most vacuums do not do a sync commit, and never have. regards, tom lane
On Tue, Sep 1, 2009 at 2:58 PM, Tom Lane<tgl@sss.pgh.pa.us> wrote: > We get beat up on a regular basis about "spikes" in response time; > why would you want to have vacuum creating one when it doesn't need to? Isn't this sync commit just going to do the same thing that the wal writer is going to do in at most 200ms anyways? > As for the FULL case, the sync commit is to try to protect a horribly > unsafe kluge that should go away entirely (if vacuum full itself doesn't > go away entirely). I'm all for throwing away VACUUM FULL btw. I was thinking of proposing that we replace it with something like CLUSTER which just rewrites the tuples in the order it finds them. The use cases where VACUUM FULL wins currently are where storing two copies of the table and its indexes concurrently just isn't practical. Also perhaps tables where there are too many large indexes to make rebuilding them all in one maintenance window practical. I don't see any way to address these problems without something as complex as xvac and moved_in/moved_off and without the index bloat problems. I think we could improve the i/o access patterns we have currently which make vacuum full so slow, but the fundamental problems would remain. So the question is whether those use cases are worth keeping our existing vacuum full for or whether we could do without it and just recommend partitioning for people with tables large enough to make table rewrites impractical. -- greg http://mit.edu/~gsstark/resume.pdf
Greg Stark wrote: > The use cases where VACUUM FULL wins currently are where storing two > copies of the table and its indexes concurrently just isn't practical. Yeah, but then do you really need to use VACUUM FULL? If that's really a problem then there ain't that many dead tuples around. > Also perhaps tables where there are too many large indexes to make > rebuilding them all in one maintenance window practical. If that's the concern maybe we oughta do something about concurrently re-creating those indexes somehow. Plain REINDEX doesn't work of course, but maybe we can do some trick with creating a new index and dropping the original one afterwards. -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
On Wed, Sep 2, 2009 at 12:01 AM, Alvaro Herrera<alvherre@commandprompt.com> wrote: >> The use cases where VACUUM FULL wins currently are where storing two >> copies of the table and its indexes concurrently just isn't practical. > > Yeah, but then do you really need to use VACUUM FULL? If that's really > a problem then there ain't that many dead tuples around. That's what I want to believe. But picture if you have, say a 1-terabyte table which is 50% dead tuples and you don't have a spare 1-terabytes to rewrite the whole table. >> Also perhaps tables where there are too many large indexes to make >> rebuilding them all in one maintenance window practical. > > If that's the concern maybe we oughta do something about concurrently > re-creating those indexes somehow. Plain REINDEX doesn't work of > course, but maybe we can do some trick with creating a new index and > dropping the original one afterwards. Well that doesn't really work if you want to rewrite the table. CLUSTER has to rebuild all the indexes when it's done. I think the solution for both of these is actually partitioning. The bottom line is that having a single table which contains very large amounts of data is awkward to maintain. -- greg http://mit.edu/~gsstark/resume.pdf
Greg Stark <gsstark@mit.edu> writes: > On Wed, Sep 2, 2009 at 12:01 AM, Alvaro > Herrera<alvherre@commandprompt.com> wrote: >>> The use cases where VACUUM FULL wins currently are where storing two >>> copies of the table and its indexes concurrently just isn't practical. >> >> Yeah, but then do you really need to use VACUUM FULL? �If that's really >> a problem then there ain't that many dead tuples around. > That's what I want to believe. But picture if you have, say a > 1-terabyte table which is 50% dead tuples and you don't have a spare > 1-terabytes to rewrite the whole table. But trying to VACUUM FULL that table is going to be horridly painful too, and you'll still have bloated indexes afterwards. You might as well just live with the 50% waste, especially since if you did a full-table update once you'll probably do it again sometime. I'm having a hard time believing that VACUUM FULL really has any interesting use-case anymore. regards, tom lane
Greg Stark wrote: > > That's what I want to believe. But picture if you have, say a > 1-terabyte table which is 50% dead tuples and you don't have a spare > 1-terabytes to rewrite the whole table. Could one hypothetically do update bigtable set pk = pk where ctid in (select ctid from bigtable order by ctid desc limit100); vacuum; and repeat until max(ctid) is small enough? Sure, it'll take longer than vacuum full; but at first glance it seems lightweight enough to do even on a live, heavily accessed table. IIRC I tried something like this once, and it worked to some extent, but after a few loops didn't shrink the table as much as I had expected.
Ron Mayer wrote: > Greg Stark wrote: > > > > That's what I want to believe. But picture if you have, say a > > 1-terabyte table which is 50% dead tuples and you don't have a spare > > 1-terabytes to rewrite the whole table. > > Could one hypothetically do > update bigtable set pk = pk where ctid in (select ctid from bigtable order by ctid desc limit 100); > vacuum; > and repeat until max(ctid) is small enough? I remember Hannu Krosing said they used something like that to shrink really bloated tables. Maybe we should try to explicitely support a mechanism that worked in that fashion. I think I tried it at some point and found that the problem with it was that ctid was too limited in what it was able to do. The neat thing is that now that we have the visibility fork, each vacuum needn't scan the whole table each time. -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
On Tue, Sep 1, 2009 at 7:42 PM, Tom Lane<tgl@sss.pgh.pa.us> wrote: > Greg Stark <gsstark@mit.edu> writes: >> On Wed, Sep 2, 2009 at 12:01 AM, Alvaro >> Herrera<alvherre@commandprompt.com> wrote: >>>> The use cases where VACUUM FULL wins currently are where storing two >>>> copies of the table and its indexes concurrently just isn't practical. >>> >>> Yeah, but then do you really need to use VACUUM FULL? If that's really >>> a problem then there ain't that many dead tuples around. > >> That's what I want to believe. But picture if you have, say a >> 1-terabyte table which is 50% dead tuples and you don't have a spare >> 1-terabytes to rewrite the whole table. > > But trying to VACUUM FULL that table is going to be horridly painful > too, and you'll still have bloated indexes afterwards. You might as > well just live with the 50% waste, especially since if you did a > full-table update once you'll probably do it again sometime. > > I'm having a hard time believing that VACUUM FULL really has any > interesting use-case anymore. What if your large table doesn't have an index? Then there's no way to cluster. I'm a bit skeptical about partitioning as a solution, too. The planner is just not clever enough with partitioned tables, yet. ...Robert
On Tue, Sep 1, 2009 at 9:29 PM, Alvaro Herrera<alvherre@commandprompt.com> wrote: > Ron Mayer wrote: >> Greg Stark wrote: >> > >> > That's what I want to believe. But picture if you have, say a >> > 1-terabyte table which is 50% dead tuples and you don't have a spare >> > 1-terabytes to rewrite the whole table. >> >> Could one hypothetically do >> update bigtable set pk = pk where ctid in (select ctid from bigtable order by ctid desc limit 100); >> vacuum; >> and repeat until max(ctid) is small enough? > > I remember Hannu Krosing said they used something like that to shrink > really bloated tables. Maybe we should try to explicitely support a > mechanism that worked in that fashion. I think I tried it at some point > and found that the problem with it was that ctid was too limited in what > it was able to do. I think a way to incrementally shrink large tables would be enormously beneficial. Maybe vacuum could try to do a bit of that each time it runs. ...Robert
Robert Haas escribió: > On Tue, Sep 1, 2009 at 7:42 PM, Tom Lane<tgl@sss.pgh.pa.us> wrote: > > But trying to VACUUM FULL that table is going to be horridly painful > > too, and you'll still have bloated indexes afterwards. You might as > > well just live with the 50% waste, especially since if you did a > > full-table update once you'll probably do it again sometime. > > > > I'm having a hard time believing that VACUUM FULL really has any > > interesting use-case anymore. > > What if your large table doesn't have an index? Then there's no way to cluster. But there's nothing saying we cannot provide a version of CLUSTER that does not follow any index and just copies the live tuples. -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
On Tue, Sep 1, 2009 at 10:58 PM, Alvaro Herrera<alvherre@commandprompt.com> wrote: > Robert Haas escribió: >> On Tue, Sep 1, 2009 at 7:42 PM, Tom Lane<tgl@sss.pgh.pa.us> wrote: > >> > But trying to VACUUM FULL that table is going to be horridly painful >> > too, and you'll still have bloated indexes afterwards. You might as >> > well just live with the 50% waste, especially since if you did a >> > full-table update once you'll probably do it again sometime. >> > >> > I'm having a hard time believing that VACUUM FULL really has any >> > interesting use-case anymore. >> >> What if your large table doesn't have an index? Then there's no way to cluster. > > But there's nothing saying we cannot provide a version of CLUSTER that > does not follow any index and just copies the live tuples. Agreed. ...Robert
On Tue, Sep 1, 2009 at 19:34, Greg Stark <gsstark@mit.edu> wrote:
It would be interesting if there was something between VACUUM FULL and CLUSTER which could, say, work on a single 1GB segment at a time in a manner similar to cluster.
You would still end up with index bloat like vacuum full, though perhaps not as bad, but shuffling around the tuples should be faster.
The idea here is that the files can be truncated individually. Two 500MB files is pretty much the same as a single 1GB file on disk.
Of course, I'm hand waving and don't have the technical expertise to figure out if it can be done easily within PostgreSQL.
On Wed, Sep 2, 2009 at 12:01 AM, AlvaroThat's what I want to believe. But picture if you have, say a
Herrera<alvherre@commandprompt.com> wrote:
>> The use cases where VACUUM FULL wins currently are where storing two
>> copies of the table and its indexes concurrently just isn't practical.
>
> Yeah, but then do you really need to use VACUUM FULL? If that's really
> a problem then there ain't that many dead tuples around.
1-terabyte table which is 50% dead tuples and you don't have a spare
1-terabytes to rewrite the whole table.
You would still end up with index bloat like vacuum full, though perhaps not as bad, but shuffling around the tuples should be faster.
The idea here is that the files can be truncated individually. Two 500MB files is pretty much the same as a single 1GB file on disk.
Of course, I'm hand waving and don't have the technical expertise to figure out if it can be done easily within PostgreSQL.
On Tue, Sep 1, 2009 at 9:55 PM, Robert Haas<robertmhaas@gmail.com> wrote: > > I'm a bit skeptical about partitioning as a solution, too. The > planner is just not clever enough with partitioned tables, yet. > analyze and vacuum a *very* big table and even scan a huge index is not a joke neither... and yes the planner is not very clever about partitioning and certainly that is something we need to fix not something we have to live with... no that that will be easy but hey! we have very brilliant people here (you being one of them) -- Atentamente, Jaime Casanova Soporte y capacitación de PostgreSQL Asesoría y desarrollo de sistemas Guayaquil - Ecuador Cel. +59387171157
On Wed, Sep 2, 2009 at 6:30 AM, Jaime Casanova<jcasanov@systemguards.com.ec> wrote: > On Tue, Sep 1, 2009 at 9:55 PM, Robert Haas<robertmhaas@gmail.com> wrote: >> >> I'm a bit skeptical about partitioning as a solution, too. The >> planner is just not clever enough with partitioned tables, yet. Yeah, we need to fix that :) I think we're already reaching the point where the pains of dealing with partitioned tables are usually less than the pains of dealing with VACUUM FULL. > analyze and vacuum a *very* big table and even scan a huge index is > not a joke neither... Hm, not sure I see this. The sample size for Analyze is not dependent on the size of the table. Only on the stats_target. And vacuum with the VM is now going to be dependent only on the number of updates to the table, not on the size of the table. The problem use cases we have today are only when you really do have enough dead space to clean up that you want to compact the file -- but not so much that it's worth rewriting the whole table using CLUSTER or ALTER TABLE. Perhaps we should go one version with a enable_legacy_full_vacuum which defaults to off. That would at least let us hear about use cases where people are unhappy with a replacement. I did start a while ago on a replacement which used the existing rewrite mechanism to do the equivalent of cluster without changing the ordering. I forget where I left that but I could go back and look at it. I'll be busy for the next few weeks though so it won't be right away. -- greg http://mit.edu/~gsstark/resume.pdf
All, > I'm having a hard time believing that VACUUM FULL really has any > interesting use-case anymore. Basically, for: a) people who don't understand CLUSTER (easily fixed, simply create a VACUUM FULL command which just does CLUSTER on the primary key) b) people who are completely out of space on disk and are trying to shrink the database to free up space. For (b), I think it's OK to just tell those people that they need to move the database files or find something else to delete. Most of the time, they have to do that *anyway* in order for VACUUM FULL to work, since the transaction log is on the same disk. We just need a little more documentation, is all. > The problem use cases we have today are only when you really do have > enough dead space to clean up that you want to compact the file -- but > not so much that it's worth rewriting the whole table using CLUSTER or > ALTER TABLE. I haven't seen this use-case in the field. I'm not sure that it actually exists. Anyone run across a case where this made sense? Recently I actually had a client dump and reload their database rather than running VACUUM FULL; a reload took 4 hours but VACUUM FULL took more than 18. > Perhaps we should go one version with a enable_legacy_full_vacuum > which defaults to off. That would at least let us hear about use cases > where people are unhappy with a replacement. I think we do need to do this, just because people won't have changed their admin scripts. But the goal should be to dump VACUUM FULL entirely by 8.6 if we *don't* get serious use-cases. > I did start a while ago on a replacement which used the existing > rewrite mechanism to do the equivalent of cluster without changing the > ordering. I forget where I left that but I could go back and look at > it. I'll be busy for the next few weeks though so it won't be right > away. This would be very helpful. -- Josh Berkus PostgreSQL Experts Inc. www.pgexperts.com
On Wed, 2009-09-02 at 10:41 -0700, Josh Berkus wrote: > All, > > > > I'm having a hard time believing that VACUUM FULL really has any > > interesting use-case anymore. > > Basically, for: > a) people who don't understand CLUSTER (easily fixed, simply create a > VACUUM FULL command which just does CLUSTER on the primary key) > > b) people who are completely out of space on disk and are trying to > shrink the database to free up space. > > For (b), I think it's OK to just tell those people that they need to > move the database files or find something else to delete. Most of the > time, they have to do that *anyway* in order for VACUUM FULL to work, > since the transaction log is on the same disk. We just need a little > more documentation, is all. Right. > > > The problem use cases we have today are only when you really do have > > enough dead space to clean up that you want to compact the file -- but > > not so much that it's worth rewriting the whole table using CLUSTER or > > ALTER TABLE. > > I haven't seen this use-case in the field. I'm not sure that it > actually exists. Anyone run across a case where this made sense? > No. > Recently I actually had a client dump and reload their database rather > than running VACUUM FULL; a reload took 4 hours but VACUUM FULL took > more than 18. > Exactly. > > Perhaps we should go one version with a enable_legacy_full_vacuum > > which defaults to off. That would at least let us hear about use cases > > where people are unhappy with a replacement. > > I think we do need to do this, just because people won't have changed > their admin scripts. But the goal should be to dump VACUUM FULL > entirely by 8.6 if we *don't* get serious use-cases. > Agreed, but I think we shouldn't even put it in the postgresql.conf by default. Just document that it exists. Settings for the sake of settings (even ones that may have a corner case) seem to confuse users. Joshua D. Drake -- PostgreSQL.org Major Contributor Command Prompt, Inc: http://www.commandprompt.com/ - 503.667.4564 Consulting, Training, Support, Custom Development, Engineering
On Wed, Sep 2, 2009 at 6:41 PM, Josh Berkus<josh@agliodbs.com> wrote: > All, > > >> I'm having a hard time believing that VACUUM FULL really has any >> interesting use-case anymore. > > Basically, for: > a) people who don't understand CLUSTER (easily fixed, simply create a > VACUUM FULL command which just does CLUSTER on the primary key) I don't think we want to cluster on the primary key. I think we just want to rewrite the table keeping the same physical ordering. >> The problem use cases we have today are only when you really do have >> enough dead space to clean up that you want to compact the file -- but >> not so much that it's worth rewriting the whole table using CLUSTER or >> ALTER TABLE. > > I haven't seen this use-case in the field. I'm not sure that it > actually exists. Anyone run across a case where this made sense? Well I've certainly seen people whose disks are more than 50% full. They tend to be the same people who want to compact their tables. I can't say whether any of them had a single table with associated indexes that were taking up more than 50% but it's not uncommon to have a single table that dominates your database. > Recently I actually had a client dump and reload their database rather > than running VACUUM FULL; a reload took 4 hours but VACUUM FULL took > more than 18. > >> Perhaps we should go one version with a enable_legacy_full_vacuum >> which defaults to off. That would at least let us hear about use cases >> where people are unhappy with a replacement. > > I think we do need to do this, just because people won't have changed > their admin scripts. But the goal should be to dump VACUUM FULL > entirely by 8.6 if we *don't* get serious use-cases. We could deal with the admin scripts by making VACUUM FULL do the new behaviour. But I actually don't really like that. I wold prefer to break VACUUM FULL since anyone doing it routinely is probably mistaken. We could name the command something which is more descriptive like VACUUM REWRITE or VACUUM REBUILD or something like that. -- greg http://mit.edu/~gsstark/resume.pdf
Greg Stark <gsstark@mit.edu> wrote: > I don't think we want to cluster on the primary key. I think we just > want to rewrite the table keeping the same physical ordering. Well if that's what you want to do, couldn't you do something like?: Lock the table. Prop all indexes Pass the heap with two pointers, one to the first available empty space and one to the first non-dead row past that, and move inside the existing file. Rebuild the indexes. Release the lock. -Kevin
Greg, > I don't think we want to cluster on the primary key. I think we just > want to rewrite the table keeping the same physical ordering. Agreed. > Well I've certainly seen people whose disks are more than 50% full. > They tend to be the same people who want to compact their tables. I > can't say whether any of them had a single table with associated > indexes that were taking up more than 50% but it's not uncommon to > have a single table that dominates your database. Those people would also need for the tables involved to be fairly small, or to be able to afford a lot of downtime. VACUUM FULL on a 100GB table with current commodity servers can take upwards of 8 hours. I really think the cases of people who have more available downtime than disk space is is vanishingly small group. However, I'll do a survey. Why not? > We could deal with the admin scripts by making VACUUM FULL do the new > behaviour. But I actually don't really like that. I wold prefer to > break VACUUM FULL since anyone doing it routinely is probably > mistaken. We could name the command something which is more > descriptive like VACUUM REWRITE or VACUUM REBUILD or something like > that. Agreed. I like VACUUM REWRITE, as it makes it fairly clear what's going on. -- Josh Berkus PostgreSQL Experts Inc. www.pgexperts.com
On Wed, 2009-09-02 at 11:01 -0700, Josh Berkus wrote: > Greg, > > > I don't think we want to cluster on the primary key. I think we just > > want to rewrite the table keeping the same physical ordering. > > Agreed. Are we sure about that? I would argue that the majority of users out their (think Django and other Web*Frameworks) are all searching primarily by primary key + other anyway. We could always offer the ability to vacuum full (cluster) on index foo but default to the primary key. Joshua D. Drake -- PostgreSQL.org Major Contributor Command Prompt, Inc: http://www.commandprompt.com/ - 503.667.4564 Consulting, Training, Support, Custom Development, Engineering
On Wed, Sep 2, 2009 at 1:52 PM, Greg Stark<gsstark@mit.edu> wrote: > We could deal with the admin scripts by making VACUUM FULL do the new > behaviour. But I actually don't really like that. I wold prefer to > break VACUUM FULL since anyone doing it routinely is probably > mistaken. So I have a script that goes and finds bloated tables and runs VACUUM FULL on them in the middle of the night if the bloat passes a certain threshold. The tables are small enough and the number of users is low enough that this doesn't cause any problems for me. I'm OK if the name of the command changes, but I'd like there to be a command that I can pass a table name to and get my table debloated without having to make any follow-on decisions (such as picking an index to cluster by). ...Robert
Greg Stark <gsstark@mit.edu> writes: > On Wed, Sep 2, 2009 at 6:41 PM, Josh Berkus<josh@agliodbs.com> wrote: >>> Perhaps we should go one version with a enable_legacy_full_vacuum >>> which defaults to off. That would at least let us hear about use cases >>> where people are unhappy with a replacement. >> >> I think we do need to do this, just because people won't have changed >> their admin scripts. �But the goal should be to dump VACUUM FULL >> entirely by 8.6 if we *don't* get serious use-cases. > We could deal with the admin scripts by making VACUUM FULL do the new > behaviour. But I actually don't really like that. I wold prefer to > break VACUUM FULL since anyone doing it routinely is probably > mistaken. We could name the command something which is more > descriptive like VACUUM REWRITE or VACUUM REBUILD or something like > that. What's wrong with just ignoring the FULL option? It's a reserved word anyway because of FULL OUTER JOINs, so there's no syntactic benefit to be had from eliminating it from the VACUUM syntax. regards, tom lane
Robert Haas <robertmhaas@gmail.com> writes: > So I have a script that goes and finds bloated tables and runs VACUUM > FULL on them in the middle of the night if the bloat passes a certain > threshold. The tables are small enough and the number of users is low > enough that this doesn't cause any problems for me. I'm OK if the > name of the command changes, but I'd like there to be a command that I > can pass a table name to and get my table debloated without having to > make any follow-on decisions (such as picking an index to cluster by). I think we *should* have a command that works like CLUSTER except it just seqscans the source table without depending on any particular index. Whether it's called VACUUM FULL or something else is a detail. regards, tom lane
On Wed, Sep 2, 2009 at 6:57 PM, Kevin Grittner<Kevin.Grittner@wicourts.gov> wrote: > Greg Stark <gsstark@mit.edu> wrote: > >> I don't think we want to cluster on the primary key. I think we just >> want to rewrite the table keeping the same physical ordering. > > Well if that's what you want to do, couldn't you do something like?: > > Lock the table. > Prop all indexes > Pass the heap with two pointers, one to the first available empty > space and one to the first non-dead row past that, and move inside the > existing file. > Rebuild the indexes. > Release the lock. Well dropping the indexes and moving tuples are both "hard" if you care about crash-safety and transactional integrity. The way we rewrite tables now is: Lock table Create new filenode. Scan old table and copy each record into the new filenode keeping update chains intact. Rebuild all indexes for the table (using a similar strategy with new relfilenodes) Commit the transaction If the transaction aborts at any point you still have the old pg_class record which points to the old relfilenode and all the old indexes are still valid. We have all the pieces we need to do this, it's just a matter of putting them together with a command to call them. A big part of what VACUUM FULL is annoying is the complexity of moving tuples in place. VACUUM FULL has to mark the old tuples and the new copies with its xid. It can't truncate the relation until it commits that xid. Actually I wonder how much performance improvement would come on normal DML just from not having to check xvac in the visibility checks. It's probably not much but... -- greg http://mit.edu/~gsstark/resume.pdf
On Wed, Sep 2, 2009 at 2:31 PM, Tom Lane<tgl@sss.pgh.pa.us> wrote: > Greg Stark <gsstark@mit.edu> writes: >> On Wed, Sep 2, 2009 at 6:41 PM, Josh Berkus<josh@agliodbs.com> wrote: >>>> Perhaps we should go one version with a enable_legacy_full_vacuum >>>> which defaults to off. That would at least let us hear about use cases >>>> where people are unhappy with a replacement. >>> >>> I think we do need to do this, just because people won't have changed >>> their admin scripts. But the goal should be to dump VACUUM FULL >>> entirely by 8.6 if we *don't* get serious use-cases. > >> We could deal with the admin scripts by making VACUUM FULL do the new >> behaviour. But I actually don't really like that. I wold prefer to >> break VACUUM FULL since anyone doing it routinely is probably >> mistaken. We could name the command something which is more >> descriptive like VACUUM REWRITE or VACUUM REBUILD or something like >> that. > > What's wrong with just ignoring the FULL option? It's a reserved > word anyway because of FULL OUTER JOINs, so there's no syntactic > benefit to be had from eliminating it from the VACUUM syntax. Silent behavior changes are usually a bad idea. ...Robert
On Wed, Sep 2, 2009 at 2:54 PM, Tom Lane<tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> So I have a script that goes and finds bloated tables and runs VACUUM >> FULL on them in the middle of the night if the bloat passes a certain >> threshold. The tables are small enough and the number of users is low >> enough that this doesn't cause any problems for me. I'm OK if the >> name of the command changes, but I'd like there to be a command that I >> can pass a table name to and get my table debloated without having to >> make any follow-on decisions (such as picking an index to cluster by). > > I think we *should* have a command that works like CLUSTER except it just > seqscans the source table without depending on any particular index. > Whether it's called VACUUM FULL or something else is a detail. Yeah. We could do this by extending the syntax for cluster (e.g. CLUSTER [VERBOSE] tablename [USING indexname | WITH NO INDEX]), but I'm not sure whether that has any real advantage over just using the existing command name. I confess to being a little fuzzy on the details of how this implementation (seq-scanning the source table for live tuples) is different/better from the current VACUUM FULL implementation. Can someone fill me in? ...Robert
On Wed, Sep 2, 2009 at 8:10 PM, Robert Haas<robertmhaas@gmail.com> wrote: > I confess to being a little fuzzy on the details of how this > implementation (seq-scanning the source table for live tuples) is > different/better from the current VACUUM FULL implementation. Can > someone fill me in? VACUUM FULL is a *lot* more complex. It scans pages *backwards* from the end (which does wonderful things on rotating media). Marks each live tuple it finds as "moved off", finds a new place for it (using the free space map I think?). Insert the tuple on the new page and marks it "moved in" and updates the indexes. Then it commits the transaction but keeps the lock. Then it has to vacuum all the indexes of the references to the old tuples at the end of the table. I think it has to commit that too before it can finally truncate the table. The backwards scan is awful for rotating media. The reading from the end and writing to the beginning is bad too, though hopefully the cache can help that. A lot of the complexity comes in from other parts of the system that have to be aware of tuples that have been "moved off" or "moved in". They have to be able to check whether the vacuum committed or not. That reminds me there was another proposal to do an "online" vacuum full similar to our concurrent index builds. Do noop-updates to tuples at the end of the table, hopefully finding space for them earlier in the table. Wait until those transactions are no longer visible to anyone else and then truncate. (Actually I think you could just not do anything and let regular lazy vacuum do the truncate). That might be a good practical alternative for sites where copying their entire table isn't practical. -- greg http://mit.edu/~gsstark/resume.pdf
Greg Stark <gsstark@mit.edu> writes: > The backwards scan is awful for rotating media. The reading from the > end and writing to the beginning is bad too, though hopefully the > cache can help that. Yeah. And all that pales in comparison to what happens in the indexes. You have to insert index entries (retail) for each moved-in tuple, then after doing the intermediate commit you run around and remove the index entries for the moved-off tuples. Lots of nonsequential access to insert the entries. The cleanup isn't so bad --- it's comparable to what regular lazy VACUUM has to do --- but that's just one step in a very expensive process. regards, tom lane
Greg Stark <gsstark@mit.edu> writes: > It scans pages *backwards* from the end (which does wonderful things > on rotating media). Marks each live tuple it finds as "moved off", > finds a new place for it (using the free space map I think?). BTW, VACUUM FULL doesn't use the free space map --- that code predates the FSM by a lot. It builds its own map of free space during its initial lazy-VACUUM-equivalent scan that just removes dead tuples. While I don't think this hurts performance any, I have seen reports of VACUUM FULL failing outright on large tables because it runs out of memory for this map. So that's still another way in which it's not actually all that useful for huge tables. regards, tom lane
Tom Lane escribió: > Greg Stark <gsstark@mit.edu> writes: > > It scans pages *backwards* from the end (which does wonderful things > > on rotating media). Marks each live tuple it finds as "moved off", > > finds a new place for it (using the free space map I think?). > > BTW, VACUUM FULL doesn't use the free space map --- that code predates > the FSM by a lot. It builds its own map of free space during its > initial lazy-VACUUM-equivalent scan that just removes dead tuples. Another weird consequence of this is that it bails out if it finds a tuple larger than it can fit in one of the earlier pages; if there's dead space to be compacted before that, it's not compacted. -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
On Wed, Sep 2, 2009 at 3:30 PM, Greg Stark<gsstark@mit.edu> wrote: > On Wed, Sep 2, 2009 at 8:10 PM, Robert Haas<robertmhaas@gmail.com> wrote: >> I confess to being a little fuzzy on the details of how this >> implementation (seq-scanning the source table for live tuples) is >> different/better from the current VACUUM FULL implementation. Can >> someone fill me in? > > > VACUUM FULL is a *lot* more complex. > > It scans pages *backwards* from the end (which does wonderful things > on rotating media). Marks each live tuple it finds as "moved off", > finds a new place for it (using the free space map I think?). Insert > the tuple on the new page and marks it "moved in" and updates the > indexes. > > Then it commits the transaction but keeps the lock. Then it has to > vacuum all the indexes of the references to the old tuples at the end > of the table. I think it has to commit that too before it can finally > truncate the table. > > The backwards scan is awful for rotating media. The reading from the > end and writing to the beginning is bad too, though hopefully the > cache can help that. > > A lot of the complexity comes in from other parts of the system that > have to be aware of tuples that have been "moved off" or "moved in". > They have to be able to check whether the vacuum committed or not. Ugh. > That reminds me there was another proposal to do an "online" vacuum > full similar to our concurrent index builds. Do noop-updates to tuples > at the end of the table, hopefully finding space for them earlier in > the table. Wait until those transactions are no longer visible to > anyone else and then truncate. (Actually I think you could just not do > anything and let regular lazy vacuum do the truncate). That might be a > good practical alternative for sites where copying their entire table > isn't practical. I don't have a strong opinion about whether it's better to take an exclusive lock on everything and reorganize freely, or whether it's better to try to use MVCC semantics to reduce the locking impact (at a cost of difficulty in the presence of long-running transactions). But I think it would be really nice to have an incremental way to reduce table bloat. Locking a table (or even better, part of a table) for a couple of seconds once an hour for several days or weeks figures to be practical in some (many?) environments where locking a table for minutes or hours is not. ...Robert
Alvaro Herrera <alvherre@commandprompt.com> writes: > Another weird consequence of this is that it bails out if it finds a > tuple larger than it can fit in one of the earlier pages; if there's > dead space to be compacted before that, it's not compacted. I don't find a lot wrong with that. The code defines its purpose as being to shorten the table file length. Once it hits a page that can't be emptied, it cannot shorten the file any further, so why shouldn't it stop? regards, tom lane
Tom Lane escribió: > Alvaro Herrera <alvherre@commandprompt.com> writes: > > Another weird consequence of this is that it bails out if it finds a > > tuple larger than it can fit in one of the earlier pages; if there's > > dead space to be compacted before that, it's not compacted. > > I don't find a lot wrong with that. The code defines its purpose as > being to shorten the table file length. Once it hits a page that > can't be emptied, it cannot shorten the file any further, so why > shouldn't it stop? All that work, and it wasn't capable of defragging the other pages? At the very least it could register them in the FSM. -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Alvaro Herrera <alvherre@commandprompt.com> writes: > Tom Lane escribi�: >> I don't find a lot wrong with that. The code defines its purpose as >> being to shorten the table file length. Once it hits a page that >> can't be emptied, it cannot shorten the file any further, so why >> shouldn't it stop? > All that work, and it wasn't capable of defragging the other pages? At > the very least it could register them in the FSM. You mean like vac_update_fsm() ? regards, tom lane
Tom Lane escribió: > Alvaro Herrera <alvherre@commandprompt.com> writes: > > Tom Lane escribi�: > >> I don't find a lot wrong with that. The code defines its purpose as > >> being to shorten the table file length. Once it hits a page that > >> can't be emptied, it cannot shorten the file any further, so why > >> shouldn't it stop? > > > All that work, and it wasn't capable of defragging the other pages? At > > the very least it could register them in the FSM. > > You mean like vac_update_fsm() ? Huh :-) -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Robert Haas wrote: > On Tue, Sep 1, 2009 at 9:29 PM, Alvaro > Herrera<alvherre@commandprompt.com> wrote: >> Ron Mayer wrote: >>> Greg Stark wrote: >>>> That's what I want to believe. But picture if you have, say a >>>> 1-terabyte table which is 50% dead tuples and you don't have a spare >>>> 1-terabytes to rewrite the whole table. >>> Could one hypothetically do >>> update bigtable set pk = pk where ctid in (select ctid from bigtable order by ctid desc limit 100); >>> vacuum; >>> and repeat until max(ctid) is small enough? >> I remember Hannu Krosing said they used something like that to shrink >> really bloated tables. Maybe we should try to explicitely support a >> mechanism that worked in that fashion. I think I tried it at some point >> and found that the problem with it was that ctid was too limited in what >> it was able to do. > > I think a way to incrementally shrink large tables would be enormously > beneficial. Maybe vacuum could try to do a bit of that each time it > runs. Yet when I try it now, I'm having trouble making it work. Would you expect the ctid to be going down in the psql session shown below? I wonder why it isn't. regression=# create table shrink_test as select * from tenk1; SELECT regression=# delete from shrink_test where (unique2 % 2) = 0; DELETE 5000 regression=# create index "shrink_test(unique1)" on shrink_test(unique1); CREATE INDEX regression=# select max(ctid) from shrink_test; max ----------(333,10) (1 row) regression=# update shrink_test set unique1=unique1 where ctid in (select ctid from shrink_test order by ctid desc limit100); UPDATE 100 regression=# vacuum shrink_test; VACUUM regression=# select max(ctid) from shrink_test; max ----------(333,21) (1 row) regression=# update shrink_test set unique1=unique1 where ctid in (select ctid from shrink_test order by ctid desc limit100); UPDATE 100 regression=# vacuum shrink_test; VACUUM regression=# select max(ctid) from shrink_test; max ----------(333,27) (1 row) regression=# update shrink_test set unique1=unique1 where ctid in (select ctid from shrink_test order by ctid desc limit100); UPDATE 100 regression=# vacuum shrink_test; VACUUM regression=# select max(ctid) from shrink_test; max ----------(333,33) (1 row)
On Wed, Sep 2, 2009 at 11:55 PM, Ron Mayer<rm_pg@cheapcomplexdevices.com> wrote: > Yet when I try it now, I'm having trouble making it work. > Would you expect the ctid to be going down in the psql session > shown below? I wonder why it isn't. Even before HOT we preferentially tried to put updated tuples on the same page they were on before. On pre-8.3 if you did these updates *without* the vacuum they would eventually be forced to find a new page and hopefully would find one earlier in the table. On 8.4 HOT will (hopefully) prevent even that from working. Unless you have a long-running transaction in the background it will clean up the old tuples in the chain on the page each time the page fills up. You've deleted half the tuples on the page so the updates will always fit in that space. Eventually you'll hit the maximum number of tuples allowed on the page dead or alive. But the vacuums are defeating that too. A special purpose command could work around all of this. -- greg http://mit.edu/~gsstark/resume.pdf
On Wed, Sep 2, 2009 at 8:45 PM, Tom Lane<tgl@sss.pgh.pa.us> wrote: > Greg Stark <gsstark@mit.edu> writes: >> The backwards scan is awful for rotating media. The reading from the >> end and writing to the beginning is bad too, though hopefully the >> cache can help that. > > Yeah. And all that pales in comparison to what happens in the indexes. > You have to insert index entries (retail) for each moved-in tuple, Hm, that could be addressed by buffering index inserts in backend local memory. That's something Heikki proposed a long time ago primarily for improving bulk data loading. Basically it would be a retail version of the bulk loader that we saw at the 10th anniversary where you merge a sorted list into the index. You would still have to flush the buffer at transaction commit but even if it only buffered a few dozen tuples if they're in the same region of the index it would be a win. In this case it could probably buffer hundreds and merge them all into the index en masse. -- greg http://mit.edu/~gsstark/resume.pdf
On Wed, 2009-09-02 at 10:41 -0700, Josh Berkus wrote: > All, > > > > I'm having a hard time believing that VACUUM FULL really has any > > interesting use-case anymore. > > Basically, for: > a) people who don't understand CLUSTER (easily fixed, simply create a > VACUUM FULL command which just does CLUSTER on the primary key) > > b) people who are completely out of space on disk and are trying to > shrink the database to free up space. > > For (b), I think it's OK to just tell those people that they need to > move the database files or find something else to delete. Most of the > time, they have to do that *anyway* in order for VACUUM FULL to work, > since the transaction log is on the same disk. We just need a little > more documentation, is all. Right. > > > The problem use cases we have today are only when you really do have > > enough dead space to clean up that you want to compact the file -- but > > not so much that it's worth rewriting the whole table using CLUSTER or > > ALTER TABLE. > > I haven't seen this use-case in the field. I'm not sure that it > actually exists. Anyone run across a case where this made sense? > No. > Recently I actually had a client dump and reload their database rather > than running VACUUM FULL; a reload took 4 hours but VACUUM FULL took > more than 18. > Exactly. > > Perhaps we should go one version with a enable_legacy_full_vacuum > > which defaults to off. That would at least let us hear about use cases > > where people are unhappy with a replacement. > > I think we do need to do this, just because people won't have changed > their admin scripts. But the goal should be to dump VACUUM FULL > entirely by 8.6 if we *don't* get serious use-cases. > Agreed, but I think we shouldn't even put it in the postgresql.conf by default. Just document that it exists. Settings for the sake of settings (even ones that may have a corner case) seem to confuse users. Joshua D. Drake -- PostgreSQL.org Major Contributor Command Prompt, Inc: http://www.commandprompt.com/ - 503.667.4564 Consulting, Training, Support, Custom Development, Engineering
On Wed, 2009-09-02 at 11:01 -0700, Josh Berkus wrote: > Greg, > > > I don't think we want to cluster on the primary key. I think we just > > want to rewrite the table keeping the same physical ordering. > > Agreed. Are we sure about that? I would argue that the majority of users out their (think Django and other Web*Frameworks) are all searching primarily by primary key + other anyway. We could always offer the ability to vacuum full (cluster) on index foo but default to the primary key. Joshua D. Drake -- PostgreSQL.org Major Contributor Command Prompt, Inc: http://www.commandprompt.com/ - 503.667.4564 Consulting, Training, Support, Custom Development, Engineering
On Tue, Sep 01, 2009 at 07:42:56PM -0400, Tom Lane wrote: > Greg Stark <gsstark@mit.edu> writes: > > On Wed, Sep 2, 2009 at 12:01 AM, Alvaro > > Herrera<alvherre@commandprompt.com> wrote: > >>> The use cases where VACUUM FULL wins currently are where storing two > >>> copies of the table and its indexes concurrently just isn't practical. > >> > >> Yeah, but then do you really need to use VACUUM FULL? If that's really > >> a problem then there ain't that many dead tuples around. > > > That's what I want to believe. But picture if you have, say a > > 1-terabyte table which is 50% dead tuples and you don't have a spare > > 1-terabytes to rewrite the whole table. > > But trying to VACUUM FULL that table is going to be horridly painful > too, and you'll still have bloated indexes afterwards. You might as > well just live with the 50% waste, especially since if you did a > full-table update once you'll probably do it again sometime. > > I'm having a hard time believing that VACUUM FULL really has any > interesting use-case anymore. I have a client who uses temp tables heavily, hundreds of thousands of creates and drops per day. They also have long running queries. The only thing that keeps catalog bloat somewhat in check is vacuum full on bloated catalogs a few times a day. Without that pg_class, pg_attribute etc quickly balloon to thousands of pages. -dg -- David Gould daveg@sonic.net 510 536 1443 510 282 0869 If simplicity worked, the world would be overrun with insects.
daveg wrote: > On Tue, Sep 01, 2009 at 07:42:56PM -0400, Tom Lane wrote: > >> Greg Stark <gsstark@mit.edu> writes: >> >>> On Wed, Sep 2, 2009 at 12:01 AM, Alvaro >>> Herrera<alvherre@commandprompt.com> wrote: >>> >>>>> The use cases where VACUUM FULL wins currently are where storing two >>>>> copies of the table and its indexes concurrently just isn't practical. >>>>> >>>> Yeah, but then do you really need to use VACUUM FULL? If that's really >>>> a problem then there ain't that many dead tuples around. >>>> >>> That's what I want to believe. But picture if you have, say a >>> 1-terabyte table which is 50% dead tuples and you don't have a spare >>> 1-terabytes to rewrite the whole table. >>> >> But trying to VACUUM FULL that table is going to be horridly painful >> too, and you'll still have bloated indexes afterwards. You might as >> well just live with the 50% waste, especially since if you did a >> full-table update once you'll probably do it again sometime. >> >> I'm having a hard time believing that VACUUM FULL really has any >> interesting use-case anymore. >> > > I have a client who uses temp tables heavily, hundreds of thousands of creates > and drops per day. They also have long running queries. The only thing that > keeps catalog bloat somewhat in check is vacuum full on bloated catalogs > a few times a day. Without that pg_class, pg_attribute etc quickly balloon to > thousands of pages. > > > That's a rate of more than one create and drop per second. How does your client handle the fact that VACUUM FULL will exclusively lock those catalog tables? Without knowing more, it looks like a bit of a design issue. cheers andrew
On Thu, Sep 03, 2009 at 07:57:25PM -0400, Andrew Dunstan wrote: > daveg wrote: > >On Tue, Sep 01, 2009 at 07:42:56PM -0400, Tom Lane wrote: > >>I'm having a hard time believing that VACUUM FULL really has any > >>interesting use-case anymore. > > > >I have a client who uses temp tables heavily, hundreds of thousands of > >creates > >and drops per day. They also have long running queries. The only thing that > >keeps catalog bloat somewhat in check is vacuum full on bloated catalogs > >a few times a day. Without that pg_class, pg_attribute etc quickly balloon > >to thousands of pages. > > That's a rate of more than one create and drop per second. How does your > client handle the fact that VACUUM FULL will exclusively lock those > catalog tables? Without knowing more, it looks like a bit of a design issue. I'd say it is several per second. They wait for the catalog locks sometimes. This is not an interactive application so that is somewhat acceptable. It also occasionally causes deadlocks which is less agreeable. There are various reasons for the heavy use of temps, mainly having to do with loading external feeds or reusing intermediate query results in a series of queries. It would be great if there was a way to have temp tables that did not get cataloged, eg local cache only. -dg -- David Gould daveg@sonic.net 510 536 1443 510 282 0869 If simplicity worked, the world would be overrun with insects.
All, >>> I have a client who uses temp tables heavily, hundreds of thousands of >>> creates >>> and drops per day. They also have long running queries. The only thing that >>> keeps catalog bloat somewhat in check is vacuum full on bloated catalogs >>> a few times a day. With Actually, this is a good point ... if we dropped VACUUM FULL, we'd need to also be able to call CLUSTER (or VACUUM REWRITE) on the system catalogs. -- Josh Berkus PostgreSQL Experts Inc. www.pgexperts.com
Josh Berkus <josh@agliodbs.com> writes: >>> I have a client who uses temp tables heavily, hundreds of thousands of >>> creates >>> and drops per day. They also have long running queries. The only >>> thing that >>> keeps catalog bloat somewhat in check is vacuum full on bloated catalogs >>> a few times a day. With > Actually, this is a good point ... if we dropped VACUUM FULL, we'd need > to also be able to call CLUSTER (or VACUUM REWRITE) on the system catalogs. I don't think I believe the claim above that vacuum full is actually necessary. Reasonably aggressive regular vacuuming ought to do it. We used to have a bug that caused row deletions during backend shutdown to not get reported to the stats collector; which had the effect that dead catalog entries for temp tables didn't get counted, and so autovac didn't hit the catalogs often enough, and so you'd get bloat in exactly this scenario. I suspect the claim that manual vacuum full is necessary is based on obsolete experience from before that bug got stomped. It's hardly an ideal solution anyway given what an exclusive lock on pg_class will do to the rest of the system --- and a cluster-like cleanup won't be any better about that. regards, tom lane
On Fri, Sep 4, 2009 at 2:48 PM, Tom Lane<tgl@sss.pgh.pa.us> wrote: > Josh Berkus <josh@agliodbs.com> writes: >>>> I have a client who uses temp tables heavily, hundreds of thousands of >>>> creates >>>> and drops per day. They also have long running queries. The only >>>> thing that >>>> keeps catalog bloat somewhat in check is vacuum full on bloated catalogs >>>> a few times a day. With > >> Actually, this is a good point ... if we dropped VACUUM FULL, we'd need >> to also be able to call CLUSTER (or VACUUM REWRITE) on the system catalogs. > > I don't think I believe the claim above that vacuum full is actually > necessary. Reasonably aggressive regular vacuuming ought to do it. > > We used to have a bug that caused row deletions during backend shutdown > to not get reported to the stats collector; which had the effect that > dead catalog entries for temp tables didn't get counted, and so autovac > didn't hit the catalogs often enough, and so you'd get bloat in exactly > this scenario. I suspect the claim that manual vacuum full is necessary > is based on obsolete experience from before that bug got stomped. > It's hardly an ideal solution anyway given what an exclusive lock on > pg_class will do to the rest of the system --- and a cluster-like > cleanup won't be any better about that. I'm confused. Are you saying that pg_class will never get bloated, so we don't need a way to debloat it? I realize that with HOT bloat is much less of a problem than it used to be, but surely it's not altogether impossible... ...Robert
Tom Lane írta: > Josh Berkus <josh@agliodbs.com> writes: > >>>> I have a client who uses temp tables heavily, hundreds of thousands of >>>> creates >>>> and drops per day. They also have long running queries. The only >>>> thing that >>>> keeps catalog bloat somewhat in check is vacuum full on bloated catalogs >>>> a few times a day. With >>>> > > >> Actually, this is a good point ... if we dropped VACUUM FULL, we'd need >> to also be able to call CLUSTER (or VACUUM REWRITE) on the system catalogs. >> > > I don't think I believe the claim above that vacuum full is actually > necessary. Reasonably aggressive regular vacuuming ought to do it. > How about setting a non-100% fillfactor on catalog tables? Maybe by default? That would also avoid most of the bloat, wouldn't it? > We used to have a bug that caused row deletions during backend shutdown > to not get reported to the stats collector; which had the effect that > dead catalog entries for temp tables didn't get counted, and so autovac > didn't hit the catalogs often enough, and so you'd get bloat in exactly > this scenario. I suspect the claim that manual vacuum full is necessary > is based on obsolete experience from before that bug got stomped. > It's hardly an ideal solution anyway given what an exclusive lock on > pg_class will do to the rest of the system --- and a cluster-like > cleanup won't be any better about that. > > regards, tom lane > > -- Bible has answers for everything. Proof: "But let your communication be, Yea, yea; Nay, nay: for whatsoever is more than these cometh of evil." (Matthew 5:37) - basics of digital technology. "May your kingdom come" - superficial description of plate tectonics ---------------------------------- Zoltán Böszörményi Cybertec Schönig & Schönig GmbH http://www.postgresql.at/
Robert Haas <robertmhaas@gmail.com> writes: > I'm confused. Are you saying that pg_class will never get bloated, so > we don't need a way to debloat it? I realize that with HOT bloat is > much less of a problem than it used to be, but surely it's not > altogether impossible... Well, it's certainly *possible*, I'm just questioning the assertion that it's necessarily a common situation. regards, tom lane
On Fri, 2009-09-04 at 15:10 -0400, Tom Lane wrote: > Robert Haas <robertmhaas@gmail.com> writes: > > I'm confused. Are you saying that pg_class will never get bloated, so > > we don't need a way to debloat it? I realize that with HOT bloat is > > much less of a problem than it used to be, but surely it's not > > altogether impossible... > > Well, it's certainly *possible*, I'm just questioning the assertion that > it's necessarily a common situation. Depends on your definition of common. It is very easy for someone to blow away their vacuum settings in such a way that it will become bloated pretty quick. Joshua D. Drake > > regards, tom lane > -- PostgreSQL.org Major Contributor Command Prompt, Inc: http://www.commandprompt.com/ - 503.667.4564 Consulting, Training, Support, Custom Development, Engineering
Joshua D. Drake escribió: > On Fri, 2009-09-04 at 15:10 -0400, Tom Lane wrote: > > Robert Haas <robertmhaas@gmail.com> writes: > > > I'm confused. Are you saying that pg_class will never get bloated, so > > > we don't need a way to debloat it? I realize that with HOT bloat is > > > much less of a problem than it used to be, but surely it's not > > > altogether impossible... > > > > Well, it's certainly *possible*, I'm just questioning the assertion that > > it's necessarily a common situation. > > Depends on your definition of common. It is very easy for someone to > blow away their vacuum settings in such a way that it will become > bloated pretty quick. No problem, just CLUSTER that table same as today. -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
Alvaro Herrera <alvherre@commandprompt.com> writes: > No problem, just CLUSTER that table same as today. Uh, no, that was Josh's point: you can't CLUSTER pg_class, because you can't change its relfilenode. If you do, backends won't know where to read pg_class to find out its relfilenode. I was wondering whether maintenance operations like "vacuum rewrite" could get away with filling a new table file and then moving it into place with rename(2), which is guaranteed atomic (at least on sane filesystems). The idea doesn't work right off because (1) you need to atomically install the updated indexes too, and (2) the table might span more than one segment file. But maybe we could think of something. regards, tom lane
I wrote: > Alvaro Herrera <alvherre@commandprompt.com> writes: >> No problem, just CLUSTER that table same as today. > Uh, no, that was Josh's point: you can't CLUSTER pg_class, because you > can't change its relfilenode. If you do, backends won't know where to > read pg_class to find out its relfilenode. > I was wondering whether maintenance operations like "vacuum rewrite" > could get away with filling a new table file and then moving it into > place with rename(2), which is guaranteed atomic (at least on sane > filesystems). The idea doesn't work right off because (1) you need > to atomically install the updated indexes too, and (2) the table > might span more than one segment file. But maybe we could think of > something. Hmm ... reading that over again, it seems like there is a pretty obvious solution. The sticking point --- not only for pg_class, but for shared catalogs such as pg_database --- is the lack of a way to track relfilenode if it ever changes. What if we keep the relfilenode of these critical tables someplace else? For instance, we could have a "map" file in each database holding the relfilenode of pg_class, and one in $PGDATA/global holding the relfilenodes of the shared catalogs and indexes. It'd be possible to update a map file atomically via the rename(2) trick. Then we teach relcache or some similar place to believe the map files over the contents of pg_class. This looks sort of like a reversion to flat files, but it has a couple of saving graces: 1. The set of entries is fixed and small, so there's no performance issue looming with database growth. 2. We could not synchronize updates with transaction commit, which was always the real Achilles' heel of the flat files. But I think we don't need to, if we restrict the set of operations that can change the relfilenodes of critical tables to "maintenance" operations that only rewrite the table contents and don't make any logical changes in the contents. Given that restriction, transaction commit isn't actually important; it will be the act of moving the updated map file into place that effectively is commit for these operations. If you crash after that, your change is still effective. Thoughts? regards, tom lane
Hi, Tom Lane <tgl@sss.pgh.pa.us> writes: > Alvaro Herrera <alvherre@commandprompt.com> writes: >> No problem, just CLUSTER that table same as today. > > Uh, no, that was Josh's point: you can't CLUSTER pg_class, because you > can't change its relfilenode. If you do, backends won't know where to > read pg_class to find out its relfilenode. Why can't MVCC apply here? You'd have two versions of the pg_class entry that just has been CLUSTERed, and you keep the old relfilenode arround too. MVCC applies, and you teach vacuum to clean out the old file when cleaning out the no more visible tuple. Or you take necessary locks to protect the operation. Something so obvious that I'll regret asking why it can not be done this way sure will get back from this mail :) Regards. -- dim
Dimitri Fontaine escribió: > Hi, > > Tom Lane <tgl@sss.pgh.pa.us> writes: > > Alvaro Herrera <alvherre@commandprompt.com> writes: > >> No problem, just CLUSTER that table same as today. > > > > Uh, no, that was Josh's point: you can't CLUSTER pg_class, because you > > can't change its relfilenode. If you do, backends won't know where to > > read pg_class to find out its relfilenode. > > Why can't MVCC apply here? You'd have two versions of the pg_class entry > that just has been CLUSTERed, and you keep the old relfilenode arround > too. MVCC applies, and you teach vacuum to clean out the old file when > cleaning out the no more visible tuple. It just doesn't work. pg_class (and various other rels) are special because they are needed to bootstrap the catalog system. See RelationCacheInitializePhase3. It wouldn't be possible to figure out what's pg_class relfilenode until you have read it from pg_class, which is precisely what we're trying to do. -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
Alvaro Herrera <alvherre@commandprompt.com> writes: > Dimitri Fontaine escribió: >> Why can't MVCC apply here? You'd have two versions of the pg_class entry >> that just has been CLUSTERed, and you keep the old relfilenode arround >> too. MVCC applies, and you teach vacuum to clean out the old file when >> cleaning out the no more visible tuple. > > It just doesn't work. pg_class (and various other rels) are special > because they are needed to bootstrap the catalog system. See > RelationCacheInitializePhase3. It wouldn't be possible to figure out > what's pg_class relfilenode until you have read it from pg_class, which > is precisely what we're trying to do. Well at bootstrap time I guess noone is able to disturb the system by placing a concurrent CLUSTER pg_class; call. Once started, do those rels still need to have a special behavior? I guess I'm being dense, will now let people in the know find a solution... -- dim
Dimitri Fontaine escribió: > Alvaro Herrera <alvherre@commandprompt.com> writes: > > Dimitri Fontaine escribió: > >> Why can't MVCC apply here? You'd have two versions of the pg_class entry > >> that just has been CLUSTERed, and you keep the old relfilenode arround > >> too. MVCC applies, and you teach vacuum to clean out the old file when > >> cleaning out the no more visible tuple. > > > > It just doesn't work. pg_class (and various other rels) are special > > because they are needed to bootstrap the catalog system. See > > RelationCacheInitializePhase3. It wouldn't be possible to figure out > > what's pg_class relfilenode until you have read it from pg_class, which > > is precisely what we're trying to do. > > Well at bootstrap time I guess noone is able to disturb the system by > placing a concurrent CLUSTER pg_class; call. Once started, do those rels > still need to have a special behavior? The relcache need to be bootstrapped more than once, not just at initdb's bootstrap. (I guess you could try a breakpoint in formrdesc) > I guess I'm being dense, will now let people in the know find a solution... Yeah, well, Tom just posted a possible solution :-) -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Dimitri Fontaine <dfontaine@hi-media.com> writes: > Well at bootstrap time I guess noone is able to disturb the system by > placing a concurrent CLUSTER pg_class; call. Once started, do those rels > still need to have a special behavior? It doesn't matter, if you fail to get past bootstrap because you couldn't find pg_class. The existing design for this is absolutely dependent on the fact that pg_class has a fixed relfilenode = fixed physical file name. MVCC has nothing to do with it. See my followon message for a sketch of a possible solution. Basically it's pushing the fixed file name over to another place ... regards, tom lane
Alvaro Herrera <alvherre@commandprompt.com> writes: > The relcache need to be bootstrapped more than once, not just at > initdb's bootstrap. (I guess you could try a breakpoint in formrdesc) Ok so in RelationCacheInitializePhase3 we have formrdesc calls: formrdesc("pg_class", false, true, Natts_pg_class, Desc_pg_class); That will do this:/* * initialize relation id from info in att array (my, this is ugly) */RelationGetRelid(relation) = relation->rd_att->attrs[0]->attrelid;relation->rd_rel->relfilenode= RelationGetRelid(relation); And this uses data from pg_attribute.h which looks like this: #define Schema_pg_class \ { 1259, {"relname"}, 19, -1, 0, NAMEDATALEN, 1, 0, -1, -1, false, 'p', 'c', true, false, false, true, 0, { 0 } }, \ So in the source I'm reading, pg_class relfilenode is 1259, the system knows how to get there, then read real values instead of bootstrap mode dummy one. Except that the bootstrap mode is used at initdb, at system startup and anytime there's a cache miss? So to be able to CLUSTER we have to find a way for cache miss to get the file named 1259 or the new value, whatever it is after CLUSTER changed it. I guess symlinks are not allowed? > Yeah, well, Tom just posted a possible solution :-) Flat file are back? really? Regards, -- dim
On Fri, Sep 4, 2009 at 4:01 PM, Tom Lane<tgl@sss.pgh.pa.us> wrote: > I wrote: >> Alvaro Herrera <alvherre@commandprompt.com> writes: >>> No problem, just CLUSTER that table same as today. > >> Uh, no, that was Josh's point: you can't CLUSTER pg_class, because you >> can't change its relfilenode. If you do, backends won't know where to >> read pg_class to find out its relfilenode. > >> I was wondering whether maintenance operations like "vacuum rewrite" >> could get away with filling a new table file and then moving it into >> place with rename(2), which is guaranteed atomic (at least on sane >> filesystems). The idea doesn't work right off because (1) you need >> to atomically install the updated indexes too, and (2) the table >> might span more than one segment file. But maybe we could think of >> something. > > Hmm ... reading that over again, it seems like there is a pretty > obvious solution. The sticking point --- not only for pg_class, > but for shared catalogs such as pg_database --- is the lack of a > way to track relfilenode if it ever changes. What if we keep > the relfilenode of these critical tables someplace else? For > instance, we could have a "map" file in each database holding > the relfilenode of pg_class, and one in $PGDATA/global holding > the relfilenodes of the shared catalogs and indexes. It'd be > possible to update a map file atomically via the rename(2) trick. > Then we teach relcache or some similar place to believe the map > files over the contents of pg_class. > > This looks sort of like a reversion to flat files, but it has a couple > of saving graces: > 1. The set of entries is fixed and small, so there's no performance > issue looming with database growth. > 2. We could not synchronize updates with transaction commit, which was > always the real Achilles' heel of the flat files. But I think we don't > need to, if we restrict the set of operations that can change the > relfilenodes of critical tables to "maintenance" operations that only > rewrite the table contents and don't make any logical changes in the > contents. Given that restriction, transaction commit isn't actually > important; it will be the act of moving the updated map file into place > that effectively is commit for these operations. If you crash after > that, your change is still effective. This doesn't seem totally horrible. But, before you go do it, do we have a clearly-defined plan for the rest of the project? Because we only need this if we're absolutely confident that rewriting the table in place is just not an option worth keeping around. It's unclear to me that everyone is convinced of that, and even if they are, it's unclear to me what we plan to implement instead. ...Robert
Robert Haas <robertmhaas@gmail.com> writes: > On Fri, Sep 4, 2009 at 4:01 PM, Tom Lane<tgl@sss.pgh.pa.us> wrote: >> Hmm ... reading that over again, it seems like there is a pretty >> obvious solution. > This doesn't seem totally horrible. But, before you go do it, do we > have a clearly-defined plan for the rest of the project? Rest of what project? Removing vacuum full isn't a necessary component of that. It would enable doing CLUSTER on pg_class, and it would eliminate the crock of REINDEX having to reindex shared indexes in-place. It could probably be justified even without any changes in our approach to vacuum. > ... only need this if we're absolutely confident that rewriting the table > in place is just not an option worth keeping around. It's unclear to > me that everyone is convinced of that, and even if they are, it's > unclear to me what we plan to implement instead. I thought we were pretty well agreed that a seqscan variant of CLUSTER would be worth doing. Whether we take the next step by eliminating vacuum full is a different question, but the shape of the substitute seems perfectly clear. regards, tom lane
On Fri, 2009-09-04 at 15:10 -0400, Tom Lane wrote: > Robert Haas <robertmhaas@gmail.com> writes: > > I'm confused. Are you saying that pg_class will never get bloated, so > > we don't need a way to debloat it? I realize that with HOT bloat is > > much less of a problem than it used to be, but surely it's not > > altogether impossible... > > Well, it's certainly *possible*, I'm just questioning the assertion that > it's necessarily a common situation. Depends on your definition of common. It is very easy for someone to blow away their vacuum settings in such a way that it will become bloated pretty quick. Joshua D. Drake > > regards, tom lane > -- PostgreSQL.org Major Contributor Command Prompt, Inc: http://www.commandprompt.com/ - 503.667.4564 Consulting, Training, Support, Custom Development, Engineering
On Fri, Sep 4, 2009 at 9:37 PM, Tom Lane<tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Fri, Sep 4, 2009 at 4:01 PM, Tom Lane<tgl@sss.pgh.pa.us> wrote: >>> Hmm ... reading that over again, it seems like there is a pretty >>> obvious solution. > >> This doesn't seem totally horrible. But, before you go do it, do we >> have a clearly-defined plan for the rest of the project? > > Rest of what project? Removing vacuum full isn't a necessary component > of that. It would enable doing CLUSTER on pg_class, and it would > eliminate the crock of REINDEX having to reindex shared indexes > in-place. It could probably be justified even without any changes in > our approach to vacuum. OK, I'm sold. >> ... only need this if we're absolutely confident that rewriting the table >> in place is just not an option worth keeping around. It's unclear to >> me that everyone is convinced of that, and even if they are, it's >> unclear to me what we plan to implement instead. > > I thought we were pretty well agreed that a seqscan variant of > CLUSTER would be worth doing. Whether we take the next step by > eliminating vacuum full is a different question, but the shape of > the substitute seems perfectly clear. Well, there were some other ideas discussed, but perhaps that's the only one that had a clear consensus. ...Robert
>> That's what I want to believe. But picture if you have, say a >> 1-terabyte table which is 50% dead tuples and you don't have a spare >> 1-terabytes to rewrite the whole table. > >But trying to VACUUM FULL that table is going to be horridly painful >too, and you'll still have bloated indexes afterwards. You might as >well just live with the 50% waste, especially since if you did a >full-table update once you'll probably do it again sometime. > >I'm having a hard time believing that VACUUM FULL really has any >interesting use-case anymore. This was almost exactly the scenario I faced recently. A production database unexpectedly filled up its partition. On investigation, we found a developer had added a component to the application that updated every row in one table each day, exhausting the free space map. Over time, most of the tables in the system had grown to contain 50-70% dead tuples. The owner of the system was understandably reluctant to dump and restore the system, and there wasn't enough space left on the system to rewrite any of the large tables. In the end, I dropped a table (the one owned by the offending developer... 8-), and this gave me just enough space to VACUUM FULL one table at a time. -- Andrew McNamara, Senior Developer, Object Craft http://www.object-craft.com.au/
Back in September I wrote: > ... The sticking point --- not only for pg_class, > but for shared catalogs such as pg_database --- is the lack of a > way to track relfilenode if it ever changes. What if we keep > the relfilenode of these critical tables someplace else? For > instance, we could have a "map" file in each database holding > the relfilenode of pg_class, and one in $PGDATA/global holding > the relfilenodes of the shared catalogs and indexes. It'd be > possible to update a map file atomically via the rename(2) trick. > Then we teach relcache or some similar place to believe the map > files over the contents of pg_class. Thinking about this some more, I can see one small disadvantage: for the relations that we use the map file for, pg_class.relfilenode would not be trustworthy. This would not affect most of the system internals (which will be looking at the relcache's copy, which would be kept valid by the relcache code). But it would affect user queries, such as for example attempts to use contrib/oid2name to identify a file on-disk. The main case where pg_class.relfilenode would be likely to be out-of-sync is for shared catalogs. We could keep it up to date in most cases for local catalogs, but there's no hope of reaching into other databases' pg_class when a shared catalog is relocated. What I'd suggest doing about this is: (1) Store zero in pg_class.relfilenode for those catalogs for which the map is used. This at least makes it obvious that the value you're looking at isn't valid. (2) Provide a SQL function to extract the real relfilenode of any specified pg_class entry. We'd have to modify oid2name and pg_dump to know to use the function instead of looking at the column. There might be some other client-side code that would be broken until it got taught about the function, but hopefully not much. Thoughts? regards, tom lane
On Sun, Jan 31, 2010 at 3:36 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Back in September I wrote: >> ... The sticking point --- not only for pg_class, >> but for shared catalogs such as pg_database --- is the lack of a >> way to track relfilenode if it ever changes. What if we keep >> the relfilenode of these critical tables someplace else? For >> instance, we could have a "map" file in each database holding >> the relfilenode of pg_class, and one in $PGDATA/global holding >> the relfilenodes of the shared catalogs and indexes. It'd be >> possible to update a map file atomically via the rename(2) trick. >> Then we teach relcache or some similar place to believe the map >> files over the contents of pg_class. > > Thinking about this some more, I can see one small disadvantage: > for the relations that we use the map file for, pg_class.relfilenode > would not be trustworthy. This would not affect most of the system > internals (which will be looking at the relcache's copy, which would > be kept valid by the relcache code). But it would affect user queries, > such as for example attempts to use contrib/oid2name to identify a > file on-disk. The main case where pg_class.relfilenode would be > likely to be out-of-sync is for shared catalogs. We could keep it > up to date in most cases for local catalogs, but there's no hope > of reaching into other databases' pg_class when a shared catalog > is relocated. > > What I'd suggest doing about this is: > > (1) Store zero in pg_class.relfilenode for those catalogs for which > the map is used. This at least makes it obvious that the value > you're looking at isn't valid. > > (2) Provide a SQL function to extract the real relfilenode of any > specified pg_class entry. We'd have to modify oid2name and > pg_dump to know to use the function instead of looking at the > column. > > There might be some other client-side code that would be broken > until it got taught about the function, but hopefully not much. > > Thoughts? Seems reasonable to me (assuming there's no way to avoid changing the relfilenode, which I assume is the case but don't actually know the code well enough to say with certainty). ...Robert