Thread: CLUSTER and synchronized scans and pg_dump et al
It occurred to me the other day that synchronized scans could play havoc with clustered tables. When you dump and reload a table even if it was recently clustered if any other sequential scans are happening in the system at the time you dump it the dump could shuffle the records out of order. Now the records would still be effectively ordered for most purposes but our statistics can't detect that. Since the correlation would be poor the restored database would have markedly different statistics showing virtually no correlation on the clustered column. Perhaps we should have some form of escape hatch for pg_dump to request real physical order when dumping clustered tables. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's 24x7 Postgres support!
Gregory Stark <stark@enterprisedb.com> writes: > Perhaps we should have some form of escape hatch for pg_dump to request real > physical order when dumping clustered tables. Yeah, Rae Steining was complaining to me about that off-list a few weeks ago. The whole syncscan behavior risks breaking many apps that "always worked before", even if they were disregarding the letter of the SQL spec. Maybe a GUC variable to enable/disable syncscan? regards, tom lane
On Jan 27, 2008 6:45 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Yeah, Rae Steining was complaining to me about that off-list a few weeks > ago. The whole syncscan behavior risks breaking many apps that "always > worked before", even if they were disregarding the letter of the SQL spec. > > Maybe a GUC variable to enable/disable syncscan? I'm not sure it's really a good reason for that because it's just a matter of time for them to be broken anyway. But it seems at least a good idea to have a way to build reproducible test cases on production boxes without being perturbed by the other scans running. Would it need a restart and be a global GUC variable or could it be set temporarily per session? -- Guillaume
"Guillaume Smet" <guillaume.smet@gmail.com> writes: > Would it need a restart and be a global GUC variable or > could it be set temporarily per session? It could be PGC_USERSET, afaics. regards, tom lane
On Jan 27, 2008 7:14 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > It could be PGC_USERSET, afaics. If so, it seems like a good idea even if it's just for debugging purposes. -- Guillaume
"Guillaume Smet" <guillaume.smet@gmail.com> writes: >>> Maybe a GUC variable to enable/disable syncscan? > If so, it seems like a good idea even if it's just for debugging purposes. Do we have nominations for a name? The first idea that comes to mind is "synchronized_scanning" (defaulting to ON). Also, does anyone object to making pg_dump just disable it unconditionally? Greg's original gripe only mentioned the case of clustered tables, but it'd be kind of a pain to make pg_dump turn it on and off again for different tables. And I could see people complaining about pg_dump failing to preserve row order even in unclustered tables. regards, tom lane
On Jan 27, 2008 7:37 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Also, does anyone object to making pg_dump just disable it > unconditionally? Greg's original gripe only mentioned the case of > clustered tables, but it'd be kind of a pain to make pg_dump turn it > on and off again for different tables. And I could see people > complaining about pg_dump failing to preserve row order even in > unclustered tables. +1. I don't think it's worth it to enable it for non clustered tables and it's always better to keep the order if we can. -- Guillaume
2008/1/28, Tom Lane <tgl@sss.pgh.pa.us>: > > Do we have nominations for a name? The first idea that comes to mind > is "synchronized_scanning" (defaulting to ON). "synchronized_sequential_scans" is a bit long, but contains the keyword "sequential scans", which will ring a bell with many, more so than "synchronized_scanning". Markus
On Jan 27, 2008 9:07 PM, Markus Bertheau <mbertheau.pg@googlemail.com> wrote: > 2008/1/28, Tom Lane <tgl@sss.pgh.pa.us>: > > > > Do we have nominations for a name? The first idea that comes to mind > > is "synchronized_scanning" (defaulting to ON). > > "synchronized_sequential_scans" is a bit long, but contains the > keyword "sequential scans", which will ring a bell with many, more so > than "synchronized_scanning". synchronize_seqscans?
Guillaume Smet wrote: > On Jan 27, 2008 9:07 PM, Markus Bertheau > <mbertheau.pg@googlemail.com> wrote: >> 2008/1/28, Tom Lane <tgl@sss.pgh.pa.us>: >>> Do we have nominations for a name? The first idea that comes to >>> mind is "synchronized_scanning" (defaulting to ON). >> "synchronized_sequential_scans" is a bit long, but contains the >> keyword "sequential scans", which will ring a bell with many, more >> so than "synchronized_scanning". > > synchronize_seqscans? How about enable_syncscan, or enable_seqscan_sync? It's not strictly something the influences the planner, but maybe it's similar enough to justify a similar naming? regards, Florian Pflug
Hi Florian, Glad to see you back! On Jan 28, 2008 3:25 PM, Florian G. Pflug <fgp@phlo.org> wrote: > How about enable_syncscan, or enable_seqscan_sync? It's not strictly > something the influences the planner, but maybe it's similar enough to > justify a similar naming? It was my first idea but I didn't propose it as it's really a different thing IMHO. enable_* variables don't change the way PostgreSQL really does the job as synchronize_scans (or whatever the name will be) does. And it's not very consistent with the other GUC variables (most of them could have "enable" in their name) but we limited the usage of enable_* to planner variables. I don't know if it's on purpose though. -- Guillaume
>>> On Sun, Jan 27, 2008 at 9:02 AM, in message <87odb7s45i.fsf@oxford.xeocode.com>, Gregory Stark <stark@enterprisedb.com> wrote: > Perhaps we should have some form of escape hatch for pg_dump to request real > physical order when dumping clustered tables. It would seem reasonable to me for pg_dump to use ORDER BY to select data from clustered tables. I don't see a general case for worrying about the order of rows returned by queries which lack an ORDER BY clause. -Kevin
Kevin Grittner wrote: >>>> On Sun, Jan 27, 2008 at 9:02 AM, in message >>>> > <87odb7s45i.fsf@oxford.xeocode.com>, Gregory Stark <stark@enterprisedb.com> > wrote: > > >> Perhaps we should have some form of escape hatch for pg_dump to request real >> physical order when dumping clustered tables. >> > > It would seem reasonable to me for pg_dump to use ORDER BY to select > data from clustered tables. > > I don't see a general case for worrying about the order of rows > returned by queries which lack an ORDER BY clause. > > > What will be the performance hit from doing that? cheers andrew
>>> On Mon, Jan 28, 2008 at 9:00 AM, in message <479DEDF5.4090909@dunslane.net>, Andrew Dunstan <andrew@dunslane.net> wrote: > Kevin Grittner wrote: >>>>> On Sun, Jan 27, 2008 at 9:02 AM, in message >> <87odb7s45i.fsf@oxford.xeocode.com>, Gregory Stark <stark@enterprisedb.com> >> wrote: >> >>> Perhaps we should have some form of escape hatch for pg_dump to request real >>> physical order when dumping clustered tables. >> >> It would seem reasonable to me for pg_dump to use ORDER BY to select >> data from clustered tables. > > What will be the performance hit from doing that? If the rows actually are in order of the clustered index, it shouldn't add much more than the time needed to sequentially pass the clustered index, should it? Even so, perhaps there should be a command-line option on pg_dump to control whether it does this. -Kevin
"Guillaume Smet" <guillaume.smet@gmail.com> writes: > It was my first idea but I didn't propose it as it's really a > different thing IMHO. enable_* variables don't change the way > PostgreSQL really does the job as synchronize_scans (or whatever the > name will be) does. > And it's not very consistent with the other GUC variables (most of > them could have "enable" in their name) but we limited the usage of > enable_* to planner variables. I don't know if it's on purpose though. Yeah, it is a more or less deliberate policy to use enable_ only for planner control variables, which this one certainly isn't. I seem to recall an argument also that prefixing enable_ is just noise; it doesn't add anything to your understanding of what the variable does. So far I think "synchronize_seqscans" is the best proposal. regards, tom lane
Andrew Dunstan <andrew@dunslane.net> writes: > Kevin Grittner wrote: >> It would seem reasonable to me for pg_dump to use ORDER BY to select >> data from clustered tables. > What will be the performance hit from doing that? That worries me too. Also, in general pg_dump's charter is to reproduce the state of the database as best it can, not to "improve" it. regards, tom lane
On Jan 28, 2008, at 8:36 AM, Tom Lane wrote: > Andrew Dunstan <andrew@dunslane.net> writes: >> Kevin Grittner wrote: >>> It would seem reasonable to me for pg_dump to use ORDER BY to select >>> data from clustered tables. > >> What will be the performance hit from doing that? > > That worries me too. Also, in general pg_dump's charter is to > reproduce > the state of the database as best it can, not to "improve" it. One common use of cluster around here is to act as a faster version of vacuum full when there's a lot of dead rows in a table. There's no intent to keep the table clustered on that index, and the cluster flag isn't removed with alter table (why bother, the only thing it affects is the cluster command). I'm guessing that's not unusual, and it'd lead to sorting tables as part of pg_dump. Cheers, Steve
>>> On Mon, Jan 28, 2008 at 10:36 AM, in message <3001.1201538162@sss.pgh.pa.us>, Tom Lane <tgl@sss.pgh.pa.us> wrote: > in general pg_dump's charter is to reproduce > the state of the database as best it can, not to "improve" it. Seems that I've often seen it recommended as a way to eliminate bloat. It seems like there are some practical use cases where it would be a pain to have to do a CLUSTER right on the heels of having used pg_dump to psql. This does seem like the right way to do it where a user really wants to maintain the physical sequence; my biggest concern is that CLUSTER is sometimes used to eliminate bloat, and there is no real interest in maintaining that sequence later. I'd bet that people generally do not alter the table to remove the clustered index choice, so this option could be rather painful somewhere downstream, when the sequence has become pretty random. Maybe it would make sense if it was not the default, and the issues were properly documented under the description of the option? -Kevin
Steve Atkins wrote: > On Jan 28, 2008, at 8:36 AM, Tom Lane wrote: > >> Andrew Dunstan <andrew@dunslane.net> writes: >>> Kevin Grittner wrote: >>>> It would seem reasonable to me for pg_dump to use ORDER BY to select >>>> data from clustered tables. >> >>> What will be the performance hit from doing that? >> >> That worries me too. Also, in general pg_dump's charter is to reproduce >> the state of the database as best it can, not to "improve" it. > > One common use of cluster around here is to act as a faster version > of vacuum full when there's a lot of dead rows in a table. There's no > intent to keep the table clustered on that index, and the cluster flag > isn't removed with alter table (why bother, the only thing it affects is > the cluster command). > > I'm guessing that's not unusual, and it'd lead to sorting tables as part > of pg_dump. I've done that too - and every time I typed that "CLUSTER ... " I thought why, oh why isn't there something like REWRITE TABLE <table>", which would work just like CLUSTER, but without the sorting ;-) Maybe something to put on the TODO list... We might even call it "VACCUM REWRITE" ;-) regards, Florian Pflug
On Sun, 2008-01-27 at 12:45 -0500, Tom Lane wrote: > Maybe a GUC variable to enable/disable syncscan? The first iterations of the patch included a GUC. I don't have any objection to re-introducing a GUC to enable/disable it. However, I would suggest that it defaults to "on", because: 1. There aren't many cases where you'd want it to be off, and this particular case with pg_dump is the best one that I've heard of (thanks Greg). We want people who install 8.3 to see a boost without lots of tuning, if possible. 2. It only turns on for tables over 25% of shared buffers anyway. Introducing GUCs reintroduces the same questions that were discussed before. 1. Should the 25% figure be tunable as well? 2. Remember that the 25% figure is also tied to Simon and Heikki's buffer recycling patch (buffer ring patch). Should they be configurable independently? Should they be tied together, but configurable? The simplest solution, in my opinion, is something like: large_scan_threshold = 0.25 # set to -1 to disable Where a scan of any table larger than (large_scan_threshold * shared_buffers) employs both synchronized scans and buffer recycling. We may implement other large scan strategies in the future. Regards,Jeff Davis
On Sun, 2008-01-27 at 15:02 +0000, Gregory Stark wrote: > It occurred to me the other day that synchronized scans could play havoc with > clustered tables. When you dump and reload a table even if it was recently > clustered if any other sequential scans are happening in the system at the > time you dump it the dump could shuffle the records out of order. > > Now the records would still be effectively ordered for most purposes but our > statistics can't detect that. Since the correlation would be poor the restored > database would have markedly different statistics showing virtually no > correlation on the clustered column. > > Perhaps we should have some form of escape hatch for pg_dump to request real > physical order when dumping clustered tables. > Thank you for bringing this up, it's an interesting point. Keep in mind that this only matters if you are actually running pg_dump concurrently with another scan, because a scan will reset the starting point after completing. Regards,Jeff Davis
On Sun, 2008-01-27 at 13:37 -0500, Tom Lane wrote: > Also, does anyone object to making pg_dump just disable it > unconditionally? Greg's original gripe only mentioned the case of > clustered tables, but it'd be kind of a pain to make pg_dump turn it > on and off again for different tables. And I could see people > complaining about pg_dump failing to preserve row order even in > unclustered tables. > If you are running pg_dump, that increases the likelihood that multiple sequential scans will be reading the same large table at the same time. Sync scans prevent that additional scan from bringing your active database to a halt during your dump (due to horrible seeking and poor cache efficiency). I think that pg_dump is a good use case for synchronized scans. Assuming it doesn't hold up 8.3, I think it's worthwhile to consider only disabling it for tables that have been clustered. That being said, this isn't a strong objection. Having a painless 8.3 release is the top priority, of course. Regards,Jeff Davis