Thread: Freeze avoidance of very large table.
Hi all, I'd like to propose read-only table to avoid full scanning to the very large table. The WIP patch is attached. - Background Postgres can have tuple forever by freezing it, but freezing tuple needs to scan whole table. It would negatively affect to system performance, especially in very large database system. There is no command that will guarantee a whole table has been completely frozen, so postgres needs to run freezing tuples even we have not written table at all. We need a DDL command will ensure all tuples are frozen and mark table as read-only, as one way to avoid full scanning to the very large table. This topic has been already discussed before, proposed by Simon. - Feature I tried to implement this feature called ALTER TABLE SET READ ONLY, and SET READ WRITE. What I'm imagining feature is attached this mail as patch file, it's WIP version patch. The patch does followings. * Add new column relreadonly to pg_class. * Add new syntax ALTER TABLE SET READ ONLY, and ALTER TABLE SET READ WRTIE * When marking read-only, all tuple of table are frozen with ShareLock at one pass (like VACUUM FREEZE), and then update pg_class.relreadonly to true. * When un-marking read-only, just update pg_class.readonly to false. * If table has TOAST table then TOAST table is marked as well at same time. * The writing and vacuum to read-only table are completely restricted or ignored. e.g., INSERT, UPDATE ,DELTET, explicit vacuum, auto vacuum There are a few but not critical problem. * Processing freezing all tuple are quite similar to VACUUM FREEZE, but calling lazy_vacuum_rel() would be overkill, I think. * Need to consider lock level. Please give me feedback. Regards, ------- Sawada Masahiko
Attachment
On 4/3/15 12:59 AM, Sawada Masahiko wrote: > + case HEAPTUPLE_LIVE: > + case HEAPTUPLE_RECENTLY_DEAD: > + case HEAPTUPLE_INSERT_IN_PROGRESS: > + case HEAPTUPLE_DELETE_IN_PROGRESS: > + if (heap_prepare_freeze_tuple(tuple.t_data, freezelimit, > + mxactcutoff, &frozen[nfrozen])) > + frozen[nfrozen++].offset = offnum; > + break; This doesn't seem safe enough to me. Can't there be tuples that are still new enough that they can't be frozen, and are still live? I don't think it's safe to leave tuples as dead either, even if they're hinted. The hint may not be written. Also, the patch seems to be completely ignoring actually freezing the toast relation; I can't see how that's actually safe. I'd feel a heck of a lot safer if any time heap_prepare_freeze_tuple returned false we did a second check on the tuple to ensure it was truly frozen. Somewhat related... instead of forcing the freeze to happen synchronously, can't we set this up so a table is in one of three states? Read/Write, Read Only, Frozen. AT_SetReadOnly and AT_SetReadWrite would simply change to the appropriate state, and all the vacuum infrastructure would continue to process those tables as it does today. lazy_vacuum_rel would become responsible for tracking if there were any non-frozen tuples if it was also attempting a freeze. If it discovered there were none, AND the table was marked as ReadOnly, then it would change the table state to Frozen and set relfrozenxid = InvalidTransactionId and relminxid = InvalidMultiXactId. AT_SetReadWrite could change relfrozenxid to it's own Xid as an optimization. Doing it that way leaves all the complicated vacuum code in one place, and would eliminate concerns about race conditions with still running transactions, etc. BTW, you also need to put things in place to ensure it's impossible to unfreeze a tuple in a relation that's marked ReadOnly or Frozen. I'm not sure what the right way to do that would be. -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com
On 4/4/15 5:10 PM, Jim Nasby wrote: > On 4/3/15 12:59 AM, Sawada Masahiko wrote: >> + case HEAPTUPLE_LIVE: >> + case HEAPTUPLE_RECENTLY_DEAD: >> + case HEAPTUPLE_INSERT_IN_PROGRESS: >> + case HEAPTUPLE_DELETE_IN_PROGRESS: >> + if (heap_prepare_freeze_tuple(tuple.t_data, >> freezelimit, >> + mxactcutoff, >> &frozen[nfrozen])) >> + frozen[nfrozen++].offset = offnum; >> + break; > > This doesn't seem safe enough to me. Can't there be tuples that are > still new enough that they can't be frozen, and are still live? I don't > think it's safe to leave tuples as dead either, even if they're hinted. > The hint may not be written. Also, the patch seems to be completely > ignoring actually freezing the toast relation; I can't see how that's > actually safe. > > I'd feel a heck of a lot safer if any time heap_prepare_freeze_tuple > returned false we did a second check on the tuple to ensure it was truly > frozen. > > Somewhat related... instead of forcing the freeze to happen > synchronously, can't we set this up so a table is in one of three > states? Read/Write, Read Only, Frozen. AT_SetReadOnly and > AT_SetReadWrite would simply change to the appropriate state, and all > the vacuum infrastructure would continue to process those tables as it > does today. lazy_vacuum_rel would become responsible for tracking if > there were any non-frozen tuples if it was also attempting a freeze. If > it discovered there were none, AND the table was marked as ReadOnly, > then it would change the table state to Frozen and set relfrozenxid = > InvalidTransactionId and relminxid = InvalidMultiXactId. AT_SetReadWrite > could change relfrozenxid to it's own Xid as an optimization. Doing it > that way leaves all the complicated vacuum code in one place, and would > eliminate concerns about race conditions with still running > transactions, etc. > > BTW, you also need to put things in place to ensure it's impossible to > unfreeze a tuple in a relation that's marked ReadOnly or Frozen. I'm not > sure what the right way to do that would be. Answering my own question... I think visibilitymap_clear() would be the right place. AFAICT this is basically as critical as clearing the VM, and that function has the Relation, so it can see what mode the relation is in. There is another possibility here, too. We can completely divorce a ReadOnly mode (which I think is useful for other things besides freezing) from the question of whether we need to force-freeze a relation if we create a FrozenMap, similar to the visibility map. This has the added advantage of helping freeze scans on relations that are not ReadOnly in the case of tables that are insert-mostly or any other pattern where most pages stay all-frozen. Prior to the visibility map this would have been a rather daunting project, but I believe this could piggyback on the VM code rather nicely. Anytime you clear the VM you clearly must clear the FrozenMap as well. The logic for setting the FM is clearly different, but that would be entirely self-contained to vacuum. Unlike the VM, I don't see any point to marking special bits in the page itself for FM. It would be nice if each bit in the FM covered multiple pages, but that can be optimized later. -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com
On Sat, Apr 4, 2015 at 3:10 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
On 4/3/15 12:59 AM, Sawada Masahiko wrote:+ case HEAPTUPLE_LIVE:
+ case HEAPTUPLE_RECENTLY_DEAD:
+ case HEAPTUPLE_INSERT_IN_PROGRESS:
+ case HEAPTUPLE_DELETE_IN_PROGRESS:
+ if (heap_prepare_freeze_tuple(tuple.t_data, freezelimit,
+ mxactcutoff, &frozen[nfrozen]))
+ frozen[nfrozen++].offset = offnum;
+ break;
This doesn't seem safe enough to me. Can't there be tuples that are still new enough that they can't be frozen, and are still live?
Yep. I've set a table to read only while it contained unfreezable tuples, and the tuples remain unfrozen yet the read-only action claims to have succeeded.
Somewhat related... instead of forcing the freeze to happen synchronously, can't we set this up so a table is in one of three states? Read/Write, Read Only, Frozen. AT_SetReadOnly and AT_SetReadWrite would simply change to the appropriate state, and all the vacuum infrastructure would continue to process those tables as it does today. lazy_vacuum_rel would become responsible for tracking if there were any non-frozen tuples if it was also attempting a freeze. If it discovered there were none, AND the table was marked as ReadOnly, then it would change the table state to Frozen and set relfrozenxid = InvalidTransactionId and relminxid = InvalidMultiXactId. AT_SetReadWrite could change relfrozenxid to it's own Xid as an optimization. Doing it that way leaves all the complicated vacuum code in one place, and would eliminate concerns about race conditions with still running transactions, etc.
+1 here as well. I might want to set tables to read only for reasons other than to avoid repeated freezing. (After all, the name of the command suggests it is a general purpose thing) and wouldn't want to automatically trigger a vacuum freeze in the process.
Cheers,
Jeff
On Sun, Apr 5, 2015 at 8:21 AM, Jeff Janes <jeff.janes@gmail.com> wrote: > On Sat, Apr 4, 2015 at 3:10 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote: >> >> On 4/3/15 12:59 AM, Sawada Masahiko wrote: >>> >>> + case HEAPTUPLE_LIVE: >>> + case HEAPTUPLE_RECENTLY_DEAD: >>> + case HEAPTUPLE_INSERT_IN_PROGRESS: >>> + case HEAPTUPLE_DELETE_IN_PROGRESS: >>> + if >>> (heap_prepare_freeze_tuple(tuple.t_data, freezelimit, >>> + >>> mxactcutoff, &frozen[nfrozen])) >>> + frozen[nfrozen++].offset >>> = offnum; >>> + break; >> >> >> This doesn't seem safe enough to me. Can't there be tuples that are still >> new enough that they can't be frozen, and are still live? > > > Yep. I've set a table to read only while it contained unfreezable tuples, > and the tuples remain unfrozen yet the read-only action claims to have > succeeded. > > >> >> Somewhat related... instead of forcing the freeze to happen synchronously, >> can't we set this up so a table is in one of three states? Read/Write, Read >> Only, Frozen. AT_SetReadOnly and AT_SetReadWrite would simply change to the >> appropriate state, and all the vacuum infrastructure would continue to >> process those tables as it does today. lazy_vacuum_rel would become >> responsible for tracking if there were any non-frozen tuples if it was also >> attempting a freeze. If it discovered there were none, AND the table was >> marked as ReadOnly, then it would change the table state to Frozen and set >> relfrozenxid = InvalidTransactionId and relminxid = InvalidMultiXactId. >> AT_SetReadWrite could change relfrozenxid to it's own Xid as an >> optimization. Doing it that way leaves all the complicated vacuum code in >> one place, and would eliminate concerns about race conditions with still >> running transactions, etc. > > > +1 here as well. I might want to set tables to read only for reasons other > than to avoid repeated freezing. (After all, the name of the command > suggests it is a general purpose thing) and wouldn't want to automatically > trigger a vacuum freeze in the process. > Thank you for comments. > Somewhat related... instead of forcing the freeze to happen synchronously, can't we set this up so a table is in one ofthree states? Read/Write, Read Only, Frozen. AT_SetReadOnly and AT_SetReadWrite would simply change to > the appropriatestate, and all the vacuum infrastructure would continue to process those tables as it does today. lazy_vacuum_relwould become responsible for tracking if there were any non-frozen tuples if it was also attempting > a freeze.If it discovered there were none, AND the table was marked as ReadOnly, then it would change the table state to Frozenand set relfrozenxid = InvalidTransactionId and relminxid = InvalidMultiXactId. AT_SetReadWrite > could change relfrozenxidto it's own Xid as an optimization. Doing it that way leaves all the complicated vacuum code in one place, andwould eliminate concerns about race conditions with still running transactions, etc. I agree with 3 status, Read/Write, ReadOnly and Frozen. But I'm not sure when we should do to freeze tuples, e.g., scan whole tables. I think that the any changes to table are completely ignored/restricted if table is marked as ReadOnly table, and it's accompanied by freezing tuples, just mark as ReadOnly. Frozen table ensures that all tuples of its table completely has been frozen, so it also needs to scan whole table as well. e.g., we should need to scan whole table at two times. right? > +1 here as well. I might want to set tables to read only for reasons other than to avoid repeated freezing. (After all,the name of the command suggests it is a general purpose thing) and wouldn't want to automatically trigger a > vacuum freeze in the process. > > There is another possibility here, too. We can completely divorce a ReadOnly mode (which I think is useful for other thingsbesides freezing) from the question of whether we need to force-freeze a relation if we create a > FrozenMap, similar to the visibility map. This has the added advantage of helping freeze scans on relations that are notReadOnly in the case of tables that are insert-mostly or any other pattern where most pages stay all-frozen. > Prior to the visibility map this would have been a rather daunting project, but I believe this could piggyback on the VMcode rather nicely. Anytime you clear the VM you clearly must clear the FrozenMap as well. The logic for > setting the FM is clearly different, but that would be entirely self-contained to vacuum. Unlike the VM, I don't see anypoint to marking special bits in the page itself for FM. I was thinking this idea (FM) to avoid freezing all tuples actually. As you said, it might not be good idea (or overkill) that the reason why settings table to read only is avoidance repeated freezing. I'm attempting to try design FM to avoid freezing relations as well. Is it enough that each bit of FM has information that corresponding pages are completely frozen on each bit? Regards, ------- Sawada Masahiko
On 4/6/15 1:46 AM, Sawada Masahiko wrote: > On Sun, Apr 5, 2015 at 8:21 AM, Jeff Janes <jeff.janes@gmail.com> wrote: >> On Sat, Apr 4, 2015 at 3:10 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote: >>> >>> On 4/3/15 12:59 AM, Sawada Masahiko wrote: >>>> >>>> + case HEAPTUPLE_LIVE: >>>> + case HEAPTUPLE_RECENTLY_DEAD: >>>> + case HEAPTUPLE_INSERT_IN_PROGRESS: >>>> + case HEAPTUPLE_DELETE_IN_PROGRESS: >>>> + if >>>> (heap_prepare_freeze_tuple(tuple.t_data, freezelimit, >>>> + >>>> mxactcutoff, &frozen[nfrozen])) >>>> + frozen[nfrozen++].offset >>>> = offnum; >>>> + break; >>> >>> >>> This doesn't seem safe enough to me. Can't there be tuples that are still >>> new enough that they can't be frozen, and are still live? >> >> >> Yep. I've set a table to read only while it contained unfreezable tuples, >> and the tuples remain unfrozen yet the read-only action claims to have >> succeeded. >> >> >>> >>> Somewhat related... instead of forcing the freeze to happen synchronously, >>> can't we set this up so a table is in one of three states? Read/Write, Read >>> Only, Frozen. AT_SetReadOnly and AT_SetReadWrite would simply change to the >>> appropriate state, and all the vacuum infrastructure would continue to >>> process those tables as it does today. lazy_vacuum_rel would become >>> responsible for tracking if there were any non-frozen tuples if it was also >>> attempting a freeze. If it discovered there were none, AND the table was >>> marked as ReadOnly, then it would change the table state to Frozen and set >>> relfrozenxid = InvalidTransactionId and relminxid = InvalidMultiXactId. >>> AT_SetReadWrite could change relfrozenxid to it's own Xid as an >>> optimization. Doing it that way leaves all the complicated vacuum code in >>> one place, and would eliminate concerns about race conditions with still >>> running transactions, etc. >> >> >> +1 here as well. I might want to set tables to read only for reasons other >> than to avoid repeated freezing. (After all, the name of the command >> suggests it is a general purpose thing) and wouldn't want to automatically >> trigger a vacuum freeze in the process. >> > > Thank you for comments. > >> Somewhat related... instead of forcing the freeze to happen synchronously, can't we set this up so a table is in one ofthree states? Read/Write, Read Only, Frozen. AT_SetReadOnly and AT_SetReadWrite would simply change to > the appropriatestate, and all the vacuum infrastructure would continue to process those tables as it does today. lazy_vacuum_relwould become responsible for tracking if there were any non-frozen tuples if it was also attempting > a freeze.If it discovered there were none, AND the table was marked as ReadOnly, then it would change the table state to Frozenand set relfrozenxid = InvalidTransactionId and relminxid = InvalidMultiXactId. AT_SetReadWrite > could change relfrozenxidto it's own Xid as an optimization. Doing it that way leaves all the complicated vacuum code in one place, andwould eliminate concerns about race conditions with still running transactions, etc. > > I agree with 3 status, Read/Write, ReadOnly and Frozen. > But I'm not sure when we should do to freeze tuples, e.g., scan whole tables. > I think that the any changes to table are completely > ignored/restricted if table is marked as ReadOnly table, > and it's accompanied by freezing tuples, just mark as ReadOnly. > Frozen table ensures that all tuples of its table completely has been > frozen, so it also needs to scan whole table as well. > e.g., we should need to scan whole table at two times. right? No. You would be free to set a table as ReadOnly any time you wanted, without scanning anything. All that setting does is disable any DML on the table. The Frozen state would only be set by the vacuum code, IFF: - The table state is ReadOnly *at the start of vacuum* and did not change during vacuum - Vacuum ensured that there were no un-frozen tuples in the table That does not necessitate 2 scans. >> +1 here as well. I might want to set tables to read only for reasons other than to avoid repeated freezing. (After all,the name of the command suggests it is a general purpose thing) and wouldn't want to automatically trigger a >> vacuum freeze in the process. >> >> There is another possibility here, too. We can completely divorce a ReadOnly mode (which I think is useful for other thingsbesides freezing) from the question of whether we need to force-freeze a relation if we create a >> FrozenMap, similar to the visibility map. This has the added advantage of helping freeze scans on relations that are notReadOnly in the case of tables that are insert-mostly or any other pattern where most pages stay all-frozen. >> Prior to the visibility map this would have been a rather daunting project, but I believe this could piggyback on theVM code rather nicely. Anytime you clear the VM you clearly must clear the FrozenMap as well. The logic for >> setting the FM is clearly different, but that would be entirely self-contained to vacuum. Unlike the VM, I don't see anypoint to marking special bits in the page itself for FM. > > I was thinking this idea (FM) to avoid freezing all tuples actually. > As you said, it might not be good idea (or overkill) that the reason > why settings table to read only is avoidance repeated freezing. > I'm attempting to try design FM to avoid freezing relations as well. > Is it enough that each bit of FM has information that corresponding > pages are completely frozen on each bit? If I'm understanding your implied question correctly, I don't think there would actually be any relationship between FM and marking ReadOnly. It would come into play if we wanted to do the Frozen state, but if we have the FM, marking an entire relation as Frozen becomes a lot less useful. What's going to happen with a VACUUM FREEZE once we have FM is that vacuum will be able to skip reading pages if they are all-visible *and* the FM shows them as frozen, whereas today we can't use the VM to skip pages if scan_all is true. For simplicity, I would start out with each FM bit representing a single page. That means the FM would be very similar in operation to the VM; the only difference would be when a bit in the FM was set. I would absolutely split this into 2 patches as well; one for ReadOnly (and skip the Frozen status for now), and one for FM. When I looked at the VM code briefly it occurred to me that it might be quite difficult to have 1 FM bit represent multiple pages. The issue is the locking necessary between VACUUM and clearing a FM bit. In the VM that's handled by the cleanup lock, but that will only work at a page level. We'd need something to ensure that nothing came in and performed DML while the vacuum code was getting ready to set a FM bit. There's probably several ways this could be accomplished, but I think it would be foolish to try and do anything about it in the initial patch. Especially because it's only supposition that there would be much benefit to having multiple pages per bit. -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com
On Mon, Apr 6, 2015 at 10:17 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote: > On 4/6/15 1:46 AM, Sawada Masahiko wrote: >> >> On Sun, Apr 5, 2015 at 8:21 AM, Jeff Janes <jeff.janes@gmail.com> wrote: >>> >>> On Sat, Apr 4, 2015 at 3:10 PM, Jim Nasby <Jim.Nasby@bluetreble.com> >>> wrote: >>>> >>>> >>>> On 4/3/15 12:59 AM, Sawada Masahiko wrote: >>>>> >>>>> >>>>> + case HEAPTUPLE_LIVE: >>>>> + case HEAPTUPLE_RECENTLY_DEAD: >>>>> + case HEAPTUPLE_INSERT_IN_PROGRESS: >>>>> + case HEAPTUPLE_DELETE_IN_PROGRESS: >>>>> + if >>>>> (heap_prepare_freeze_tuple(tuple.t_data, freezelimit, >>>>> + >>>>> mxactcutoff, &frozen[nfrozen])) >>>>> + >>>>> frozen[nfrozen++].offset >>>>> = offnum; >>>>> + break; >>>> >>>> >>>> >>>> This doesn't seem safe enough to me. Can't there be tuples that are >>>> still >>>> new enough that they can't be frozen, and are still live? >>> >>> >>> >>> Yep. I've set a table to read only while it contained unfreezable >>> tuples, >>> and the tuples remain unfrozen yet the read-only action claims to have >>> succeeded. >>> >>> >>>> >>>> Somewhat related... instead of forcing the freeze to happen >>>> synchronously, >>>> can't we set this up so a table is in one of three states? Read/Write, >>>> Read >>>> Only, Frozen. AT_SetReadOnly and AT_SetReadWrite would simply change to >>>> the >>>> appropriate state, and all the vacuum infrastructure would continue to >>>> process those tables as it does today. lazy_vacuum_rel would become >>>> responsible for tracking if there were any non-frozen tuples if it was >>>> also >>>> attempting a freeze. If it discovered there were none, AND the table was >>>> marked as ReadOnly, then it would change the table state to Frozen and >>>> set >>>> relfrozenxid = InvalidTransactionId and relminxid = InvalidMultiXactId. >>>> AT_SetReadWrite could change relfrozenxid to it's own Xid as an >>>> optimization. Doing it that way leaves all the complicated vacuum code >>>> in >>>> one place, and would eliminate concerns about race conditions with still >>>> running transactions, etc. >>> >>> >>> >>> +1 here as well. I might want to set tables to read only for reasons >>> other >>> than to avoid repeated freezing. (After all, the name of the command >>> suggests it is a general purpose thing) and wouldn't want to >>> automatically >>> trigger a vacuum freeze in the process. >>> >> >> Thank you for comments. >> >>> Somewhat related... instead of forcing the freeze to happen >>> synchronously, can't we set this up so a table is in one of three states? >>> Read/Write, Read Only, Frozen. AT_SetReadOnly and AT_SetReadWrite would >>> simply change to > the appropriate state, and all the vacuum infrastructure >>> would continue to process those tables as it does today. lazy_vacuum_rel >>> would become responsible for tracking if there were any non-frozen tuples if >>> it was also attempting > a freeze. If it discovered there were none, AND the >>> table was marked as ReadOnly, then it would change the table state to Frozen >>> and set relfrozenxid = InvalidTransactionId and relminxid = >>> InvalidMultiXactId. AT_SetReadWrite > could change relfrozenxid to it's own >>> Xid as an optimization. Doing it that way leaves all the complicated vacuum >>> code in one place, and would eliminate concerns about race conditions with >>> still running transactions, etc. >> >> >> I agree with 3 status, Read/Write, ReadOnly and Frozen. >> But I'm not sure when we should do to freeze tuples, e.g., scan whole >> tables. >> I think that the any changes to table are completely >> ignored/restricted if table is marked as ReadOnly table, >> and it's accompanied by freezing tuples, just mark as ReadOnly. >> Frozen table ensures that all tuples of its table completely has been >> frozen, so it also needs to scan whole table as well. >> e.g., we should need to scan whole table at two times. right? > > > No. You would be free to set a table as ReadOnly any time you wanted, > without scanning anything. All that setting does is disable any DML on the > table. > > The Frozen state would only be set by the vacuum code, IFF: > - The table state is ReadOnly *at the start of vacuum* and did not change > during vacuum > - Vacuum ensured that there were no un-frozen tuples in the table > > That does not necessitate 2 scans. > I understood this comcept, and have question as I wrote below. >>> +1 here as well. I might want to set tables to read only for reasons >>> other than to avoid repeated freezing. (After all, the name of the command >>> suggests it is a general purpose thing) and wouldn't want to automatically >>> trigger a >>> vacuum freeze in the process. >>> >>> There is another possibility here, too. We can completely divorce a >>> ReadOnly mode (which I think is useful for other things besides freezing) >>> from the question of whether we need to force-freeze a relation if we create >>> a >>> FrozenMap, similar to the visibility map. This has the added advantage of >>> helping freeze scans on relations that are not ReadOnly in the case of >>> tables that are insert-mostly or any other pattern where most pages stay >>> all-frozen. >>> Prior to the visibility map this would have been a rather daunting >>> project, but I believe this could piggyback on the VM code rather nicely. >>> Anytime you clear the VM you clearly must clear the FrozenMap as well. The >>> logic for >>> setting the FM is clearly different, but that would be entirely >>> self-contained to vacuum. Unlike the VM, I don't see any point to marking >>> special bits in the page itself for FM. >> >> >> I was thinking this idea (FM) to avoid freezing all tuples actually. >> As you said, it might not be good idea (or overkill) that the reason >> why settings table to read only is avoidance repeated freezing. >> I'm attempting to try design FM to avoid freezing relations as well. >> Is it enough that each bit of FM has information that corresponding >> pages are completely frozen on each bit? > > > If I'm understanding your implied question correctly, I don't think there > would actually be any relationship between FM and marking ReadOnly. It would > come into play if we wanted to do the Frozen state, but if we have the FM, > marking an entire relation as Frozen becomes a lot less useful. What's going > to happen with a VACUUM FREEZE once we have FM is that vacuum will be able > to skip reading pages if they are all-visible *and* the FM shows them as > frozen, whereas today we can't use the VM to skip pages if scan_all is true. > > For simplicity, I would start out with each FM bit representing a single > page. That means the FM would be very similar in operation to the VM; the > only difference would be when a bit in the FM was set. I would absolutely > split this into 2 patches as well; one for ReadOnly (and skip the Frozen > status for now), and one for FM. > When I looked at the VM code briefly it occurred to me that it might be > quite difficult to have 1 FM bit represent multiple pages. The issue is the > locking necessary between VACUUM and clearing a FM bit. In the VM that's > handled by the cleanup lock, but that will only work at a page level. We'd > need something to ensure that nothing came in and performed DML while the > vacuum code was getting ready to set a FM bit. There's probably several ways > this could be accomplished, but I think it would be foolish to try and do > anything about it in the initial patch. Especially because it's only > supposition that there would be much benefit to having multiple pages per > bit. > Yes, I will separate the patch into two patches. I'd like to confirm about whether what I'm thinking is correct here. In first version of patch, each FM bit represent a single page is imply whether the all tuple of the page completely has been frozen, it would be one patch. The second patch adds 3 states and read-only table which disable to any write to table. The trigger which changes state from Read/Write to Read-Only is ALTER TABLE SET READ ONLY. And the trigger changes from Read-Only to Frozen is vacuum only when the table has been marked as Read-Only at vacuum is started *and* the vacuum did not any freeze tuple(including skip the page refer to FM). If we support FM, we would be able to avoid repeated freezing whole table even if the table has not been marked as Read-Only. In order to change state to Frozen, we need to do VACUUM FREEZE or wait for running of auto vacuum. Generally, the threshold of cutoff xid is different between VACUUM (and autovacuum) and VACUUM FREEZE. We would not expect to change status using by explicit vacuum and autovacuum. Inevitably, we would need to do both command ALTER TABLE SET READ ONLY and VACUUM FREEZE to change state to Frozen. I think that we should also add DDL which does both freezing tuple and changing state in one pass, like ALTER TABLE SET READ ONLY WITH FREEZE or ALTER TABLE SET FROZEN. Regards, ------- Sawada Masahiko
On 4/6/15 11:12 AM, Sawada Masahiko wrote: > On Mon, Apr 6, 2015 at 10:17 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote: >> On 4/6/15 1:46 AM, Sawada Masahiko wrote: >>> >>> On Sun, Apr 5, 2015 at 8:21 AM, Jeff Janes <jeff.janes@gmail.com> wrote: >>>> >>>> On Sat, Apr 4, 2015 at 3:10 PM, Jim Nasby <Jim.Nasby@bluetreble.com> >>>> wrote: >>>>> >>>>> >>>>> On 4/3/15 12:59 AM, Sawada Masahiko wrote: >>>>>> >>>>>> >>>>>> + case HEAPTUPLE_LIVE: >>>>>> + case HEAPTUPLE_RECENTLY_DEAD: >>>>>> + case HEAPTUPLE_INSERT_IN_PROGRESS: >>>>>> + case HEAPTUPLE_DELETE_IN_PROGRESS: >>>>>> + if >>>>>> (heap_prepare_freeze_tuple(tuple.t_data, freezelimit, >>>>>> + >>>>>> mxactcutoff, &frozen[nfrozen])) >>>>>> + >>>>>> frozen[nfrozen++].offset >>>>>> = offnum; >>>>>> + break; >>>>> >>>>> >>>>> >>>>> This doesn't seem safe enough to me. Can't there be tuples that are >>>>> still >>>>> new enough that they can't be frozen, and are still live? >>>> >>>> >>>> >>>> Yep. I've set a table to read only while it contained unfreezable >>>> tuples, >>>> and the tuples remain unfrozen yet the read-only action claims to have >>>> succeeded. >>>> >>>> >>>>> >>>>> Somewhat related... instead of forcing the freeze to happen >>>>> synchronously, >>>>> can't we set this up so a table is in one of three states? Read/Write, >>>>> Read >>>>> Only, Frozen. AT_SetReadOnly and AT_SetReadWrite would simply change to >>>>> the >>>>> appropriate state, and all the vacuum infrastructure would continue to >>>>> process those tables as it does today. lazy_vacuum_rel would become >>>>> responsible for tracking if there were any non-frozen tuples if it was >>>>> also >>>>> attempting a freeze. If it discovered there were none, AND the table was >>>>> marked as ReadOnly, then it would change the table state to Frozen and >>>>> set >>>>> relfrozenxid = InvalidTransactionId and relminxid = InvalidMultiXactId. >>>>> AT_SetReadWrite could change relfrozenxid to it's own Xid as an >>>>> optimization. Doing it that way leaves all the complicated vacuum code >>>>> in >>>>> one place, and would eliminate concerns about race conditions with still >>>>> running transactions, etc. >>>> >>>> >>>> >>>> +1 here as well. I might want to set tables to read only for reasons >>>> other >>>> than to avoid repeated freezing. (After all, the name of the command >>>> suggests it is a general purpose thing) and wouldn't want to >>>> automatically >>>> trigger a vacuum freeze in the process. >>>> >>> >>> Thank you for comments. >>> >>>> Somewhat related... instead of forcing the freeze to happen >>>> synchronously, can't we set this up so a table is in one of three states? >>>> Read/Write, Read Only, Frozen. AT_SetReadOnly and AT_SetReadWrite would >>>> simply change to > the appropriate state, and all the vacuum infrastructure >>>> would continue to process those tables as it does today. lazy_vacuum_rel >>>> would become responsible for tracking if there were any non-frozen tuples if >>>> it was also attempting > a freeze. If it discovered there were none, AND the >>>> table was marked as ReadOnly, then it would change the table state to Frozen >>>> and set relfrozenxid = InvalidTransactionId and relminxid = >>>> InvalidMultiXactId. AT_SetReadWrite > could change relfrozenxid to it's own >>>> Xid as an optimization. Doing it that way leaves all the complicated vacuum >>>> code in one place, and would eliminate concerns about race conditions with >>>> still running transactions, etc. >>> >>> >>> I agree with 3 status, Read/Write, ReadOnly and Frozen. >>> But I'm not sure when we should do to freeze tuples, e.g., scan whole >>> tables. >>> I think that the any changes to table are completely >>> ignored/restricted if table is marked as ReadOnly table, >>> and it's accompanied by freezing tuples, just mark as ReadOnly. >>> Frozen table ensures that all tuples of its table completely has been >>> frozen, so it also needs to scan whole table as well. >>> e.g., we should need to scan whole table at two times. right? >> >> >> No. You would be free to set a table as ReadOnly any time you wanted, >> without scanning anything. All that setting does is disable any DML on the >> table. >> >> The Frozen state would only be set by the vacuum code, IFF: >> - The table state is ReadOnly *at the start of vacuum* and did not change >> during vacuum >> - Vacuum ensured that there were no un-frozen tuples in the table >> >> That does not necessitate 2 scans. >> > > I understood this comcept, and have question as I wrote below. > >>>> +1 here as well. I might want to set tables to read only for reasons >>>> other than to avoid repeated freezing. (After all, the name of the command >>>> suggests it is a general purpose thing) and wouldn't want to automatically >>>> trigger a >>>> vacuum freeze in the process. >>>> >>>> There is another possibility here, too. We can completely divorce a >>>> ReadOnly mode (which I think is useful for other things besides freezing) >>>> from the question of whether we need to force-freeze a relation if we create >>>> a >>>> FrozenMap, similar to the visibility map. This has the added advantage of >>>> helping freeze scans on relations that are not ReadOnly in the case of >>>> tables that are insert-mostly or any other pattern where most pages stay >>>> all-frozen. >>>> Prior to the visibility map this would have been a rather daunting >>>> project, but I believe this could piggyback on the VM code rather nicely. >>>> Anytime you clear the VM you clearly must clear the FrozenMap as well. The >>>> logic for >>>> setting the FM is clearly different, but that would be entirely >>>> self-contained to vacuum. Unlike the VM, I don't see any point to marking >>>> special bits in the page itself for FM. >>> >>> >>> I was thinking this idea (FM) to avoid freezing all tuples actually. >>> As you said, it might not be good idea (or overkill) that the reason >>> why settings table to read only is avoidance repeated freezing. >>> I'm attempting to try design FM to avoid freezing relations as well. >>> Is it enough that each bit of FM has information that corresponding >>> pages are completely frozen on each bit? >> >> >> If I'm understanding your implied question correctly, I don't think there >> would actually be any relationship between FM and marking ReadOnly. It would >> come into play if we wanted to do the Frozen state, but if we have the FM, >> marking an entire relation as Frozen becomes a lot less useful. What's going >> to happen with a VACUUM FREEZE once we have FM is that vacuum will be able >> to skip reading pages if they are all-visible *and* the FM shows them as >> frozen, whereas today we can't use the VM to skip pages if scan_all is true. >> >> For simplicity, I would start out with each FM bit representing a single >> page. That means the FM would be very similar in operation to the VM; the >> only difference would be when a bit in the FM was set. I would absolutely >> split this into 2 patches as well; one for ReadOnly (and skip the Frozen >> status for now), and one for FM. >> When I looked at the VM code briefly it occurred to me that it might be >> quite difficult to have 1 FM bit represent multiple pages. The issue is the >> locking necessary between VACUUM and clearing a FM bit. In the VM that's >> handled by the cleanup lock, but that will only work at a page level. We'd >> need something to ensure that nothing came in and performed DML while the >> vacuum code was getting ready to set a FM bit. There's probably several ways >> this could be accomplished, but I think it would be foolish to try and do >> anything about it in the initial patch. Especially because it's only >> supposition that there would be much benefit to having multiple pages per >> bit. >> > > Yes, I will separate the patch into two patches. > > I'd like to confirm about whether what I'm thinking is correct here. > In first version of patch, each FM bit represent a single page is > imply whether the all tuple of the page completely has been frozen, it > would be one patch. Yes. > The second patch adds 3 states and read-only table which disable to Actually, I would start simply with ReadOnly and ReadWrite. As I understand it, the goal here is to prevent huge amounts of periodic freeze work due to XID wraparound. I don't think we need the Freeze state to accomplish that. With a single bit per page in the Frozen Map, checking a 800GB table would require reading a mere 100MB of FM. That's pretty tiny, and largely accomplishes the goal. Obviously it would be nice to eliminate even that 100MB read, but I suggest you leave that for a 3rd patch. I think you'll find that just getting the first 2 accomplished will be a significant amount of work. Also, note that you don't really even need the ReadOnly patch. As long as you're not actually touching the table at all the FM will eventually read as everything is frozen; that gets you 80% of the way there. So I'd suggest starting with the FM, then doing ReadOnly, and only then attempting to add the Frozen state. > any write to table. The trigger which changes state from Read/Write to > Read-Only is ALTER TABLE SET READ ONLY. And the trigger changes from > Read-Only to Frozen is vacuum only when the table has been marked as > Read-Only at vacuum is started *and* the vacuum did not any freeze > tuple(including skip the page refer to FM). If we support FM, we would > be able to avoid repeated freezing whole table even if the table has > not been marked as Read-Only. > > In order to change state to Frozen, we need to do VACUUM FREEZE or > wait for running of auto vacuum. Generally, the threshold of cutoff > xid is different between VACUUM (and autovacuum) and VACUUM FREEZE. We > would not expect to change status using by explicit vacuum and > autovacuum. Inevitably, we would need to do both command ALTER TABLE > SET READ ONLY and VACUUM FREEZE to change state to Frozen. > I think that we should also add DDL which does both freezing tuple and > changing state in one pass, like ALTER TABLE SET READ ONLY WITH FREEZE > or ALTER TABLE SET FROZEN. > > Regards, > > ------- > Sawada Masahiko > -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com
On Mon, Apr 06, 2015 at 12:07:47PM -0500, Jim Nasby wrote: > ... > As I understand it, the goal here is to prevent huge amounts of > periodic freeze work due to XID wraparound. I don't think we need > the Freeze state to accomplish that. > > With a single bit per page in the Frozen Map, checking a 800GB table > would require reading a mere 100MB of FM. That's pretty tiny, and > largely accomplishes the goal. > > Obviously it would be nice to eliminate even that 100MB read, but I > suggest you leave that for a 3rd patch. I think you'll find that > just getting the first 2 accomplished will be a significant amount > of work. > Hi, I may have my math wrong, but 800GB ~ 100M pages or 12.5MB and not 100MB. Regards, Ken
On 4/6/15 12:29 PM, ktm@rice.edu wrote: > On Mon, Apr 06, 2015 at 12:07:47PM -0500, Jim Nasby wrote: >> ... >> As I understand it, the goal here is to prevent huge amounts of >> periodic freeze work due to XID wraparound. I don't think we need >> the Freeze state to accomplish that. >> >> With a single bit per page in the Frozen Map, checking a 800GB table >> would require reading a mere 100MB of FM. That's pretty tiny, and >> largely accomplishes the goal. >> >> Obviously it would be nice to eliminate even that 100MB read, but I >> suggest you leave that for a 3rd patch. I think you'll find that >> just getting the first 2 accomplished will be a significant amount >> of work. >> > > Hi, > I may have my math wrong, but 800GB ~ 100M pages or 12.5MB and not > 100MB. Doh! 8 bits per byte and all that... -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com
On 04/06/2015 10:07 AM, Jim Nasby wrote: > Actually, I would start simply with ReadOnly and ReadWrite. > > As I understand it, the goal here is to prevent huge amounts of periodic > freeze work due to XID wraparound. I don't think we need the Freeze > state to accomplish that. > > With a single bit per page in the Frozen Map, checking a 800GB table > would require reading a mere 100MB of FM. That's pretty tiny, and > largely accomplishes the goal. > > Obviously it would be nice to eliminate even that 100MB read, but I > suggest you leave that for a 3rd patch. I think you'll find that just > getting the first 2 accomplished will be a significant amount of work. > > Also, note that you don't really even need the ReadOnly patch. As long > as you're not actually touching the table at all the FM will eventually > read as everything is frozen; that gets you 80% of the way there. So I'd > suggest starting with the FM, then doing ReadOnly, and only then > attempting to add the Frozen state. +1 There was some reason why we didn't have Freeze Map before, though; IIRC these were the problems: 1. would need to make sure it gets sync'd to disk and/or WAL-logged 2. every time a page is modified, the map would need to get updated 3. Yet Another Relation File (not inconsequential for the cases we're discussing). Also, given that the Visibility Map necessarily needs to have the superset of the Frozen Map, maybe combining them in some way would make sense. I agree with Jim that if we have a trustworthy Frozen Map, having a ReadOnly flag is of marginal value, unless such a ReadOnly flag allowed us to skip updating the individual row XIDs entirely. I can think of some ways to do that, but they have severe tradeoffs. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
Josh Berkus wrote: > I agree with Jim that if we have a trustworthy Frozen Map, having a > ReadOnly flag is of marginal value, unless such a ReadOnly flag allowed > us to skip updating the individual row XIDs entirely. I can think of > some ways to do that, but they have severe tradeoffs. If you're thinking that the READ ONLY flag is only useful for freezing, then yeah maybe it's of marginal value. But in the foreign key constraint area, consider that you could have tables with frequently-referenced PKs marked as READ ONLY -- then you don't need to acquire row locks when inserting/updating rows in the referencing tables. That might give you a good performance benefit that's not in any way related to freezing, as well as reducing your multixact consumption rate. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 04/06/2015 11:35 AM, Alvaro Herrera wrote: > Josh Berkus wrote: > >> I agree with Jim that if we have a trustworthy Frozen Map, having a >> ReadOnly flag is of marginal value, unless such a ReadOnly flag allowed >> us to skip updating the individual row XIDs entirely. I can think of >> some ways to do that, but they have severe tradeoffs. > > If you're thinking that the READ ONLY flag is only useful for freezing, > then yeah maybe it's of marginal value. But in the foreign key > constraint area, consider that you could have tables with > frequently-referenced PKs marked as READ ONLY -- then you don't need to > acquire row locks when inserting/updating rows in the referencing > tables. That might give you a good performance benefit that's not in > any way related to freezing, as well as reducing your multixact > consumption rate. Hmmmm. Yeah, that would make it worthwhile, although it would be a fairly obscure bit of performance optimization for anyone not on this list ;-) -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On 4/6/15 1:28 PM, Josh Berkus wrote: > On 04/06/2015 10:07 AM, Jim Nasby wrote: >> Actually, I would start simply with ReadOnly and ReadWrite. >> >> As I understand it, the goal here is to prevent huge amounts of periodic >> freeze work due to XID wraparound. I don't think we need the Freeze >> state to accomplish that. >> >> With a single bit per page in the Frozen Map, checking a 800GB table >> would require reading a mere 100MB of FM. That's pretty tiny, and >> largely accomplishes the goal. >> >> Obviously it would be nice to eliminate even that 100MB read, but I >> suggest you leave that for a 3rd patch. I think you'll find that just >> getting the first 2 accomplished will be a significant amount of work. >> >> Also, note that you don't really even need the ReadOnly patch. As long >> as you're not actually touching the table at all the FM will eventually >> read as everything is frozen; that gets you 80% of the way there. So I'd >> suggest starting with the FM, then doing ReadOnly, and only then >> attempting to add the Frozen state. > > +1 > > There was some reason why we didn't have Freeze Map before, though; > IIRC these were the problems: > > 1. would need to make sure it gets sync'd to disk and/or WAL-logged Same as VM. > 2. every time a page is modified, the map would need to get updated Not everytime, just the first time if FM for a page was set. It would only be set by vacuum, just like VM. > 3. Yet Another Relation File (not inconsequential for the cases we're > discussing). Sure, which is why I think it might be interesting to either allow for more than one page per bit, or perhaps some form of compression. That said, I don't think it's worth worrying about too much because it's still a 64,000-1 ratio with 8k pages. If you use 32k pages it becomes 256,000-1, or 4GB of FM for 1PB of heap. > Also, given that the Visibility Map necessarily needs to have the > superset of the Frozen Map, maybe combining them in some way would make > sense. The thing is, I think in many workloads the paterns here will actually be radically different, in that it's way easier to get a page to be all-visible than it is to freeze it. Perhaps there's something we can do here when we look at other ways to reduce space usage for FM (and maybe VM too), but I don't think now is the time to put effort into this. > I agree with Jim that if we have a trustworthy Frozen Map, having a > ReadOnly flag is of marginal value, unless such a ReadOnly flag allowed > us to skip updating the individual row XIDs entirely. I can think of > some ways to do that, but they have severe tradeoffs. Aside from Alvaro's points, I think many users would find it useful as an easy way to ensure no one is writing to a table, which could be valuable for any number of reasons. As long as the patch isn't too complicated I don't see a reason not to do it. -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com
<p dir="ltr"><br /> On 6 Apr 2015 09:17, "Jim Nasby" <<a href="mailto:Jim.Nasby@bluetreble.com">Jim.Nasby@bluetreble.com</a>>wrote:<br /> ><br /> > <br /> > No. You wouldbe free to set a table as ReadOnly any time you wanted, without scanning anything. All that setting does is disableany DML on the table.<br /> ><br /> > The Frozen state would only be set by the vacuum code, IFF:<br /> >- The table state is ReadOnly *at the start of vacuum* and did not change during vacuum<br /> > - Vacuum ensured thatthere were no un-frozen tuples in the table<br /> ><br /> > That does not necessitate 2 scans.<p dir="ltr">Thisis exactly what I would suggest.<p dir="ltr">Only I would suggest thinking of it in terms of two orthogonalboolean flags rather than three states. It's easier to reason about whether a table has a specific property thantrying to control a state machine in a predefined pathway.<p dir="ltr">So I would say the two flags are: <br /> READONLY:guarantees nothing can be dirtied<br /> ALLFROZEN: guarantees no unfrozen tuples are present<p dir="ltr">In practiceyou can't have the later without the former since vacuum can't know everything is frozen unless it knows nobody isinserting. But perhaps there will be cases in the future where that's not true.<p dir="ltr">Incidentally there are numberof other optimisations tat over had in mind that are only possible on frozen read-only tables:<p dir="ltr">1) Compression:compress the pages and pack them one after the other. Build a new fork with offsets for each page.<p dir="ltr">2)Automatic partition elimination where the statistics track the minimum and maximum value per partition (and numberof tuples) and treat then as implicit constraints. In particular it would magically make read only empty parent partitionsbe excluded regardless of the where clause.
On 4/6/15 5:18 PM, Greg Stark wrote: > Only I would suggest thinking of it in terms of two orthogonal boolean > flags rather than three states. It's easier to reason about whether a > table has a specific property than trying to control a state machine in > a predefined pathway. > > So I would say the two flags are: > READONLY: guarantees nothing can be dirtied > ALLFROZEN: guarantees no unfrozen tuples are present > > In practice you can't have the later without the former since vacuum > can't know everything is frozen unless it knows nobody is inserting. But > perhaps there will be cases in the future where that's not true. I'm not so sure about that. There's a logical state progression here (see below). ISTM it's easier to just enforce that in one place instead of a bunch of places having to check multiple conditions. But, I'm not wed to a single field. > Incidentally there are number of other optimisations tat over had in > mind that are only possible on frozen read-only tables: > > 1) Compression: compress the pages and pack them one after the other. > Build a new fork with offsets for each page. > > 2) Automatic partition elimination where the statistics track the > minimum and maximum value per partition (and number of tuples) and treat > then as implicit constraints. In particular it would magically make read > only empty parent partitions be excluded regardless of the where clause. AFAICT neither of those actually requires ALLFROZEN, no? You'll need to uncompact and re-compact for #1 when you actually freeze (which maybe isn't worth it), but freezing isn't absolutely required. #2 would only require that everything in the relation is visible; not frozen. I think there's value here to having an ALLVISIBLE state as well as ALLFROZEN. -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com
On Tue, Apr 7, 2015 at 7:53 AM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote: > On 4/6/15 5:18 PM, Greg Stark wrote: >> >> Only I would suggest thinking of it in terms of two orthogonal boolean >> flags rather than three states. It's easier to reason about whether a >> table has a specific property than trying to control a state machine in >> a predefined pathway. >> >> So I would say the two flags are: >> READONLY: guarantees nothing can be dirtied >> ALLFROZEN: guarantees no unfrozen tuples are present >> >> In practice you can't have the later without the former since vacuum >> can't know everything is frozen unless it knows nobody is inserting. But >> perhaps there will be cases in the future where that's not true. > > > I'm not so sure about that. There's a logical state progression here (see > below). ISTM it's easier to just enforce that in one place instead of a > bunch of places having to check multiple conditions. But, I'm not wed to a > single field. > >> Incidentally there are number of other optimisations tat over had in >> mind that are only possible on frozen read-only tables: >> >> 1) Compression: compress the pages and pack them one after the other. >> Build a new fork with offsets for each page. >> >> 2) Automatic partition elimination where the statistics track the >> minimum and maximum value per partition (and number of tuples) and treat >> then as implicit constraints. In particular it would magically make read >> only empty parent partitions be excluded regardless of the where clause. > > > AFAICT neither of those actually requires ALLFROZEN, no? You'll need to > uncompact and re-compact for #1 when you actually freeze (which maybe isn't > worth it), but freezing isn't absolutely required. #2 would only require > that everything in the relation is visible; not frozen. > > I think there's value here to having an ALLVISIBLE state as well as > ALLFROZEN. > Based on may suggestions, I'm going to deal with FM at first as one patch. It would be simply mechanism and similar to VM, at first patch. - Each bit of FM represent single page - The bit is set only by vacuum - The bit is un-set by inserting and updating and deleting At second, I'll deal with simply read-only table and 2 states, Read/Write(default) and ReadOnly as one patch. ITSM the having the Frozen state needs to more discussion. read-only table just allow us to disable any updating table, and it's controlled by read-only flag pg_class has. And DDL command which changes these status is like ALTER TABLE SET READ ONLY, or READ WRITE. Also as Alvaro's suggested, the read-only table affect not only freezing table but also performance optimization. I'll consider including them when I deal with read-only table. Regards, ------- Sawada Masahiko
On Tue, Apr 7, 2015 at 11:22 AM, Sawada Masahiko <sawada.mshk@gmail.com> wrote: > On Tue, Apr 7, 2015 at 7:53 AM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote: >> On 4/6/15 5:18 PM, Greg Stark wrote: >>> >>> Only I would suggest thinking of it in terms of two orthogonal boolean >>> flags rather than three states. It's easier to reason about whether a >>> table has a specific property than trying to control a state machine in >>> a predefined pathway. >>> >>> So I would say the two flags are: >>> READONLY: guarantees nothing can be dirtied >>> ALLFROZEN: guarantees no unfrozen tuples are present >>> >>> In practice you can't have the later without the former since vacuum >>> can't know everything is frozen unless it knows nobody is inserting. But >>> perhaps there will be cases in the future where that's not true. >> >> >> I'm not so sure about that. There's a logical state progression here (see >> below). ISTM it's easier to just enforce that in one place instead of a >> bunch of places having to check multiple conditions. But, I'm not wed to a >> single field. >> >>> Incidentally there are number of other optimisations tat over had in >>> mind that are only possible on frozen read-only tables: >>> >>> 1) Compression: compress the pages and pack them one after the other. >>> Build a new fork with offsets for each page. >>> >>> 2) Automatic partition elimination where the statistics track the >>> minimum and maximum value per partition (and number of tuples) and treat >>> then as implicit constraints. In particular it would magically make read >>> only empty parent partitions be excluded regardless of the where clause. >> >> >> AFAICT neither of those actually requires ALLFROZEN, no? You'll need to >> uncompact and re-compact for #1 when you actually freeze (which maybe isn't >> worth it), but freezing isn't absolutely required. #2 would only require >> that everything in the relation is visible; not frozen. >> >> I think there's value here to having an ALLVISIBLE state as well as >> ALLFROZEN. >> > > Based on may suggestions, I'm going to deal with FM at first as one > patch. It would be simply mechanism and similar to VM, at first patch. > - Each bit of FM represent single page > - The bit is set only by vacuum > - The bit is un-set by inserting and updating and deleting > > At second, I'll deal with simply read-only table and 2 states, > Read/Write(default) and ReadOnly as one patch. ITSM the having the > Frozen state needs to more discussion. read-only table just allow us > to disable any updating table, and it's controlled by read-only flag > pg_class has. And DDL command which changes these status is like ALTER > TABLE SET READ ONLY, or READ WRITE. > Also as Alvaro's suggested, the read-only table affect not only > freezing table but also performance optimization. I'll consider > including them when I deal with read-only table. > Attached WIP patch adds Frozen Map which enables us to avoid whole table vacuuming even when full scan is required: preventing XID wraparound failures. Frozen Map is a bitmap with one bit per heap page, and quite similar to Visibility Map. A set bit means that all tuples on heap page are completely frozen, therefore we don't need to do vacuum freeze that page. A bit is set when vacuum(or autovacuum) figures out that all tuples on corresponding heap page are completely frozen, and a bit is cleared when INSERT and UPDATE(only new heap page) are executed. Current patch adds new source file src/backend/access/heap/frozenmap.c which is quite similar to visibilitymap.c. They have similar code but are separated for now. I do refactoring these source code like adding bitmap.c, if needed. Also, when skipping vacuum by visibility map, we can skip at least SKIP_PAGE_THESHOLD consecutive page, but such mechanism is not in frozen map. Please give me feedbacks. Regards, ------- Sawada Masahiko
Attachment
On Mon, Apr 20, 2015 at 04:45:34PM +0900, Sawada Masahiko wrote: > Attached WIP patch adds Frozen Map which enables us to avoid whole > table vacuuming even when full scan is required: preventing XID > wraparound failures. > > Frozen Map is a bitmap with one bit per heap page, and quite similar > to Visibility Map. A set bit means that all tuples on heap page are > completely frozen, therefore we don't need to do vacuum freeze that > page. > A bit is set when vacuum(or autovacuum) figures out that all tuples on > corresponding heap page are completely frozen, and a bit is cleared > when INSERT and UPDATE(only new heap page) are executed. So, this patch avoids reading the all-frozen pages if it has not been modified since the last VACUUM FREEZE? Since it is already frozen, the running VACUUM FREEZE will not modify the page or generate WAL, so is it really worth maintaining a new per-page bitmap just to avoid the sequential scan of tables every 200MB transactions? I would like to see us reduce the need for VACUUM FREEZE, rather than go in this direction. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
On 4/20/15 1:48 PM, Bruce Momjian wrote: > On Mon, Apr 20, 2015 at 04:45:34PM +0900, Sawada Masahiko wrote: >> Attached WIP patch adds Frozen Map which enables us to avoid whole >> table vacuuming even when full scan is required: preventing XID >> wraparound failures. >> >> Frozen Map is a bitmap with one bit per heap page, and quite similar >> to Visibility Map. A set bit means that all tuples on heap page are >> completely frozen, therefore we don't need to do vacuum freeze that >> page. >> A bit is set when vacuum(or autovacuum) figures out that all tuples on >> corresponding heap page are completely frozen, and a bit is cleared >> when INSERT and UPDATE(only new heap page) are executed. > > So, this patch avoids reading the all-frozen pages if it has not been > modified since the last VACUUM FREEZE? Since it is already frozen, the > running VACUUM FREEZE will not modify the page or generate WAL, so is it > really worth maintaining a new per-page bitmap just to avoid the > sequential scan of tables every 200MB transactions? I would like to see > us reduce the need for VACUUM FREEZE, rather than go in this direction. How would you propose we do that? I also think there's better ways we could handle *all* our cleanup work. Tuples have a definite lifespan, and there's potentially a lot of efficiency to be gained if we could track tuples through their stages of life... but I don't see any easy ways to do that. -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com
On Mon, Apr 20, 2015 at 01:59:17PM -0500, Jim Nasby wrote: > On 4/20/15 1:48 PM, Bruce Momjian wrote: > >On Mon, Apr 20, 2015 at 04:45:34PM +0900, Sawada Masahiko wrote: > >>Attached WIP patch adds Frozen Map which enables us to avoid whole > >>table vacuuming even when full scan is required: preventing XID > >>wraparound failures. > >> > >>Frozen Map is a bitmap with one bit per heap page, and quite similar > >>to Visibility Map. A set bit means that all tuples on heap page are > >>completely frozen, therefore we don't need to do vacuum freeze that > >>page. > >>A bit is set when vacuum(or autovacuum) figures out that all tuples on > >>corresponding heap page are completely frozen, and a bit is cleared > >>when INSERT and UPDATE(only new heap page) are executed. > > > >So, this patch avoids reading the all-frozen pages if it has not been > >modified since the last VACUUM FREEZE? Since it is already frozen, the > >running VACUUM FREEZE will not modify the page or generate WAL, so is it > >really worth maintaining a new per-page bitmap just to avoid the > >sequential scan of tables every 200MB transactions? I would like to see > >us reduce the need for VACUUM FREEZE, rather than go in this direction. > > How would you propose we do that? > > I also think there's better ways we could handle *all* our cleanup > work. Tuples have a definite lifespan, and there's potentially a lot > of efficiency to be gained if we could track tuples through their > stages of life... but I don't see any easy ways to do that. See the TODO list: https://wiki.postgresql.org/wiki/Todoo Avoid the requirement of freezing pages that are infrequently modified -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
On 4/20/15 2:09 PM, Bruce Momjian wrote: > On Mon, Apr 20, 2015 at 01:59:17PM -0500, Jim Nasby wrote: >> On 4/20/15 1:48 PM, Bruce Momjian wrote: >>> On Mon, Apr 20, 2015 at 04:45:34PM +0900, Sawada Masahiko wrote: >>>> Attached WIP patch adds Frozen Map which enables us to avoid whole >>>> table vacuuming even when full scan is required: preventing XID >>>> wraparound failures. >>>> >>>> Frozen Map is a bitmap with one bit per heap page, and quite similar >>>> to Visibility Map. A set bit means that all tuples on heap page are >>>> completely frozen, therefore we don't need to do vacuum freeze that >>>> page. >>>> A bit is set when vacuum(or autovacuum) figures out that all tuples on >>>> corresponding heap page are completely frozen, and a bit is cleared >>>> when INSERT and UPDATE(only new heap page) are executed. >>> >>> So, this patch avoids reading the all-frozen pages if it has not been >>> modified since the last VACUUM FREEZE? Since it is already frozen, the >>> running VACUUM FREEZE will not modify the page or generate WAL, so is it >>> really worth maintaining a new per-page bitmap just to avoid the >>> sequential scan of tables every 200MB transactions? I would like to see >>> us reduce the need for VACUUM FREEZE, rather than go in this direction. >> >> How would you propose we do that? >> >> I also think there's better ways we could handle *all* our cleanup >> work. Tuples have a definite lifespan, and there's potentially a lot >> of efficiency to be gained if we could track tuples through their >> stages of life... but I don't see any easy ways to do that. > > See the TODO list: > > https://wiki.postgresql.org/wiki/Todo > o Avoid the requirement of freezing pages that are infrequently > modified Right, but do you have a proposal for how that would actually happen? Perhaps I'm mis-understanding you, but it sounded like you were opposed to this patch because it doesn't do anything to avoid the need to freeze. My point is that no one has any good ideas on how to avoid freezing, and I think it's a safe bet that any ideas people do come up with there will be a lot more invasive than a FrozenMap is. While not perfect, a FrozenMap is something we can do today, without a lot of effort, and it will provide definite value for any tables that have a "good" amount of frozen pages. Without performance testing, we don't know what "good" actually looks like, but we can't test without a patch (which we now have). Assuming performance numbers look good I think it would be folly to reject this patch in the hopes that eventually we'll have some way to avoid the need to freeze. -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com
On Mon, Apr 20, 2015 at 03:58:19PM -0500, Jim Nasby wrote: > >>I also think there's better ways we could handle *all* our cleanup > >>work. Tuples have a definite lifespan, and there's potentially a lot > >>of efficiency to be gained if we could track tuples through their > >>stages of life... but I don't see any easy ways to do that. > > > >See the TODO list: > > > > https://wiki.postgresql.org/wiki/Todo > > o Avoid the requirement of freezing pages that are infrequently > > modified > > Right, but do you have a proposal for how that would actually happen? > > Perhaps I'm mis-understanding you, but it sounded like you were > opposed to this patch because it doesn't do anything to avoid the > need to freeze. My point is that no one has any good ideas on how to > avoid freezing, and I think it's a safe bet that any ideas people do > come up with there will be a lot more invasive than a FrozenMap is. Didn't you think any of the TODO threads had workable solutions? And don't expect adding an additional file per relation will be zero cost --- added over the lifetime of 200M transactions, I question if this approach would be a win. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
On 4/20/15 2:45 AM, Sawada Masahiko wrote: > Current patch adds new source file src/backend/access/heap/frozenmap.c > which is quite similar to visibilitymap.c. They have similar code but > are separated for now. I do refactoring these source code like adding > bitmap.c, if needed. My feeling is we'd definitely want this refactored; it looks to be a whole lot of duplicated code. But before working on that we should get consensus that a FrozenMap is a good idea. Are there any meaningful differences between the two, besides the obvious name changes? I think there's also a bunch of XLOG stuff that could be refactored too... > Also, when skipping vacuum by visibility map, we can skip at least > SKIP_PAGE_THESHOLD consecutive page, but such mechanism is not in > frozen map. That's probably something else that can be factored out, since it's basically the same logic. I suspect we just need to && some of the checks so we're looking at both FM and VM at the same time. Other comments... It would be nice if we didn't need another page bit for FM; do you see any reasonable way that could happen? + * If we didn't pin the visibility(and frozen) map page and the page has + * become all visible(and frozen) while we were busy locking the buffer, + * or during some subsequent window during which we had it unlocked, + * we'll have to unlock and re-lock, to avoid holding the buffer lock + * across an I/O. That's a bit unfortunate, especially since we'll now + * have to recheck whether the tuple has been locked or updated under us, + * but hopefully it won't happen very often. */ s/(and frozen)/ or frozen/ + * Reply XLOG_HEAP3_FROZENMAP record. s/Reply/Replay/ + /* + * XLogReplayBufferExtended locked the buffer. But frozenmap_set + * will handle locking itself. + */ + LockBuffer(fmbuffer, BUFFER_LOCK_UNLOCK); Doesn't this create a race condition? Are you sure the bit in finish_heap_swap() is safe? If so, we should add the same the same for the visibility map too (it certainly better be all visible if it's frozen...) + /* + * Current block is all-visible. + * If frozen map represents that it's all frozen and this + * function is called for freezing tuples, we can skip to + * vacuum block. + */ I would state this as "Even if scan_all is true, we can skip blocks that are marked as frozen." + if (frozenmap_test(onerel, blkno, &fmbuffer) && scan_all) I suspect it's faster to reverse those tests (scan_all && frozenmap_test())... but why do we even need to look at scan_all? AFAICT if a block as frozen we can skip it unconditionally. + /* + * If the un-frozen tuple is remaining in current page and + * current page is marked as ALL_FROZEN, we should clear it. + */ That needs to NEVER happen. If it does then we're going to consider tuples as visible/frozen that shouldn't be. We should probably throw an error here, because it means the heap is now corrupted. At the minimum it needs to be an assert(). Note that I haven't reviewed all the logic in detail at this point. If this ends up being refactored it'll be a lot easier to spot logic problems, so I'll hold off on that for now. -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com
On 4/20/15 4:13 PM, Bruce Momjian wrote: > On Mon, Apr 20, 2015 at 03:58:19PM -0500, Jim Nasby wrote: >>>> I also think there's better ways we could handle *all* our cleanup >>>> work. Tuples have a definite lifespan, and there's potentially a lot >>>> of efficiency to be gained if we could track tuples through their >>>> stages of life... but I don't see any easy ways to do that. >>> >>> See the TODO list: >>> >>> https://wiki.postgresql.org/wiki/Todo >>> o Avoid the requirement of freezing pages that are infrequently >>> modified >> >> Right, but do you have a proposal for how that would actually happen? >> >> Perhaps I'm mis-understanding you, but it sounded like you were >> opposed to this patch because it doesn't do anything to avoid the >> need to freeze. My point is that no one has any good ideas on how to >> avoid freezing, and I think it's a safe bet that any ideas people do >> come up with there will be a lot more invasive than a FrozenMap is. > > Didn't you think any of the TODO threads had workable solutions? And I didn't realize there were threads there. The first three are discussion around the idea of eliminating the need to freeze based on a page already being all visible. No patches. http://www.postgresql.org/message-id/CA+TgmoaEmnoLZmVbb8gvY69NA8zw9BWpiZ9+TLz-LnaBOZi7JA@mail.gmail.com has a WIP patch that goes the route of using a tuple flag to indicate frozen, but also raises a lot of concerns about visibility, because it means we'd stop using FrozenXID. That impacts a large amount of code. There were some followup patches as well as a bunch of discussion of how to make it visible that a tuple was frozen or not. That thread died in January 2014. The fifth thread is XID to LSN mapping. AFAICT this has a significant drawback in that it breaks page compatibility, meaning no pg_upgrade. It ends 5/14/2014 with this comment: "Well, Heikki was saying on another thread that he had kind of gotten cold feet about this, so I gather he's not planning to pursue it. Not sure if I understood that correctly. If so, I guess it depends on whether someone else can pick it up, but we might first want to establish why he got cold feet and how worrying those problems seem to other people." - http://www.postgresql.org/message-id/CA+TgmoYoN8LzSuaffUaEkyV8Mhv1wi=ZLBXQ3VOfEZNO1dbw9Q@mail.gmail.com So work was done on two alternative approaches, and then abandoned. Both of those approaches might still be valid, but seem to need more work. They're also higher risk because they're changing MVCC at a very fundamental level. As I mentioned, I think there's a lot better stuff we could be doing about tuple lifetime, but there's no easy fixes to be had. This patch solves a problem today, using a concept that's now well proven (visibility map). If we had something more sophisticated being developed then I'd be inclined not to pursue this patch, but that's not the case. Perhaps others can elaborate on where those two patches are at... > don't expect adding an additional file per relation will be zero cost > --- added over the lifetime of 200M transactions, I question if this > approach would be a win. Can you elaborate on this? I don't see how the number of transactions would come into play, but the overhead here is not large; the FrozenMap would be the same size as the VM map, which is 1/64,000th as large as the heap. So a 64G table means a 1M FM. That doesn't seem very expensive. -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com
On 04/20/2015 02:13 PM, Bruce Momjian wrote: > Didn't you think any of the TODO threads had workable solutions? And > don't expect adding an additional file per relation will be zero cost > --- added over the lifetime of 200M transactions, I question if this > approach would be a win. Well, the only real way to test that is a prototype, no? -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On 2015-04-20 17:13:29 -0400, Bruce Momjian wrote: > Didn't you think any of the TODO threads had workable solutions? And > don't expect adding an additional file per relation will be zero cost > --- added over the lifetime of 200M transactions, I question if this > approach would be a win. Note that normally you'd not run with a 200M transaction freeze max age on a busy server. Rather around a magnitude more. Think about this being used on a time partionioned table. Right now all the partitions have to be fully rescanned on a regular basis - quite painful. With something like this normally only the newest partitions will have to be. Greetings, Andres Freund
On Tue, Apr 21, 2015 at 7:00 AM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
> On 4/20/15 2:45 AM, Sawada Masahiko wrote:
>>
>> Current patch adds new source file src/backend/access/heap/frozenmap.c
>> which is quite similar to visibilitymap.c. They have similar code but
>> are separated for now. I do refactoring these source code like adding
>> bitmap.c, if needed.
>
Thank you for having a look this patch.
>
> My feeling is we'd definitely want this refactored; it looks to be a whole
> lot of duplicated code. But before working on that we should get consensus
> that a FrozenMap is a good idea.
Yes, we need to get consensus about FrozenMap before starting work on.
In addition to comment you pointed out, I noticed that one problems I should address, that a bit of FrozenMap need to be cleared on deletion and (i.g. xmax is set).
The page as frozen could have the dead tuple for now, but I think to change to that the frozen page guarantees that page is all frozen *and* all visible.
> Are there any meaningful differences between the two, besides the obvious
> name changes?
No, there aren't.
> I think there's also a bunch of XLOG stuff that could be refactored too...
I agree with you.
>> Also, when skipping vacuum by visibility map, we can skip at least
>> SKIP_PAGE_THESHOLD consecutive page, but such mechanism is not in
>> frozen map.
>
>
> That's probably something else that can be factored out, since it's
> basically the same logic. I suspect we just need to && some of the checks so
> we're looking at both FM and VM at the same time.
FrozenMap is used to skip scan only when anti-wrapping vacuum or freezing all tuples (i.g scan_all is true).
The normal vacuum uses only VM, doesn't use FM for now.
> Other comments...
>
> It would be nice if we didn't need another page bit for FM; do you see any
> reasonable way that could happen?
We may be able to remove page bit for FM from page header, but I'm not sure we could do that.
> + * If we didn't pin the visibility(and frozen) map page and the page
> has
> + * become all visible(and frozen) while we were busy locking the
> buffer,
> + * or during some subsequent window during which we had it unlocked,
> + * we'll have to unlock and re-lock, to avoid holding the buffer
> lock
> + * across an I/O. That's a bit unfortunate, especially since we'll
> now
> + * have to recheck whether the tuple has been locked or updated
> under us,
> + * but hopefully it won't happen very often.
> */
>
> s/(and frozen)/ or frozen/
>
>
> + * Reply XLOG_HEAP3_FROZENMAP record.
> s/Reply/Replay/
Understood.
>
> + /*
> + * XLogReplayBufferExtended locked the buffer. But
> frozenmap_set
> + * will handle locking itself.
> + */
> + LockBuffer(fmbuffer, BUFFER_LOCK_UNLOCK);
>
> Doesn't this create a race condition?
>
>
> Are you sure the bit in finish_heap_swap() is safe? If so, we should add the
> same the same for the visibility map too (it certainly better be all visible
> if it's frozen...)
We can not ensure page is all visible even if we execute VACUUM FULL, because of dead tuple could be remained. e.g. the case when other process does insert and update to same tuple in same transaction before VACUUM FULL.
I was thinking that the FrozenMap is free of the influence of delete operation. But as I said at top of this mail, a bit of FrozenMap needs to be cleared on deletion.
So I will remove these related code as you mentioned.
>
>
>
> + /*
> + * Current block is all-visible.
> + * If frozen map represents that it's all frozen and
> this
> + * function is called for freezing tuples, we can
> skip to
> + * vacuum block.
> + */
>
> I would state this as "Even if scan_all is true, we can skip blocks that are
> marked as frozen."
>
> + if (frozenmap_test(onerel, blkno, &fmbuffer) &&
> scan_all)
>
> I suspect it's faster to reverse those tests (scan_all &&
> frozenmap_test())... but why do we even need to look at scan_all? AFAICT if
> a block as frozen we can skip it unconditionally.
The tuple which is frozen and dead, could be remained in page is marked all frozen, in currently patch.
i.g., There is possible to exist the page is not all visible but marked frozen.
But I'm thinking to change that.
>
>
> + /*
> + * If the un-frozen tuple is remaining in current
> page and
> + * current page is marked as ALL_FROZEN, we should
> clear it.
> + */
>
> That needs to NEVER happen. If it does then we're going to consider tuples
> as visible/frozen that shouldn't be. We should probably throw an error here,
> because it means the heap is now corrupted. At the minimum it needs to be an
> assert().
I understood. I'll fix it.
> Note that I haven't reviewed all the logic in detail at this point. If this
> ends up being refactored it'll be a lot easier to spot logic problems, so
> I'll hold off on that for now.
Understood, we need to get consen at first.
Regards,
-------
Sawada Masahiko
>
>
> + /*
> + * If the un-frozen tuple is remaining in current
> page and
> + * current page is marked as ALL_FROZEN, we should
> clear it.
> + */
>
> That needs to NEVER happen. If it does then we're going to consider tuples
> as visible/frozen that shouldn't be. We should probably throw an error here,
> because it means the heap is now corrupted. At the minimum it needs to be an
> assert().
I understood. I'll fix it.
> Note that I haven't reviewed all the logic in detail at this point. If this
> ends up being refactored it'll be a lot easier to spot logic problems, so
> I'll hold off on that for now.
Understood, we need to get consen at first.
Regards,
-------
Sawada Masahiko
On 2015-04-21 23:59:45 +0900, Sawada Masahiko wrote: > The page as frozen could have the dead tuple for now, but I think to change > to that the frozen page guarantees that page is all frozen *and* all > visible. It shouldn't. That'd potentially cause corruption after a wraparound. A tuple's visibility might change due to that. Greetings, Andres Freund
On Wed, Apr 22, 2015 at 12:02 AM, Andres Freund <andres@anarazel.de> wrote: > On 2015-04-21 23:59:45 +0900, Sawada Masahiko wrote: >> The page as frozen could have the dead tuple for now, but I think to change >> to that the frozen page guarantees that page is all frozen *and* all >> visible. > > It shouldn't. That'd potentially cause corruption after a wraparound. A > tuple's visibility might change due to that. The page as frozen could have some dead tuples, right? I think we should to clear a bit of FrozenMap (and flag of page header) on delete operation, and a bit is set only by vacuum. So accordingly, the page as frozen guarantees that all frozen and all visible? Regards, ------- Sawada Masahiko
On 2015-04-22 00:15:53 +0900, Sawada Masahiko wrote: > On Wed, Apr 22, 2015 at 12:02 AM, Andres Freund <andres@anarazel.de> wrote: > > On 2015-04-21 23:59:45 +0900, Sawada Masahiko wrote: > >> The page as frozen could have the dead tuple for now, but I think to change > >> to that the frozen page guarantees that page is all frozen *and* all > >> visible. > > > > It shouldn't. That'd potentially cause corruption after a wraparound. A > > tuple's visibility might change due to that. > > The page as frozen could have some dead tuples, right? Well, we right now don't really freeze pages, but tuples. But in what you described above that could happen. > I think we should to clear a bit of FrozenMap (and flag of page > header) on delete operation, and a bit is set only by vacuum. Yes. > So accordingly, the page as frozen guarantees that all frozen and all > visible? I think that's how it has to be, yes. I *do* wonder if we shouldn't redefine the VM to also contain information about the frozenness. Having two identically structured maps that'll often both have to be touched at the same time isn't nice. Neither is adding another fork. Given the size of the files pg_upgrade could be made to rewrite them. The bigger question is probably how bad that'd be for index-only efficiency. Greetings, Andres Freund
On Mon, Apr 20, 2015 at 7:59 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote: > http://www.postgresql.org/message-id/CA+TgmoaEmnoLZmVbb8gvY69NA8zw9BWpiZ9+TLz-LnaBOZi7JA@mail.gmail.com > has a WIP patch that goes the route of using a tuple flag to indicate > frozen, but also raises a lot of concerns about visibility, because it means > we'd stop using FrozenXID. That impacts a large amount of code. There were > some followup patches as well as a bunch of discussion of how to make it > visible that a tuple was frozen or not. That thread died in January 2014. Actually, this change has already been made, so it's not so much of a to-do as a was-done. See commit 37484ad2aacef5ec794f4dd3d5cf814475180a78. The immediate thing we got out of that change is that when CLUSTER or VACUUM FULL rewrite a table, they now freeze all of the tuples using this method. See commits 3cff1879f8d03cb729368722ca823a4bf74c0cac and af2543e884db06c0beb75010218cd88680203b86. Previously, CLUSTER or VACUUM FULL would not freeze anything, which meant that people who tried to use VACUUM FULL to recover from XID wraparound problems got nowhere, and even people who knew when to use which tool could end up having to VACUUM FULL and then VACUUM FREEZE afterward, rewriting the table twice, an annoyance. It's possible that we could use this infrastructure to freeze more aggressively in other circumstances. For example, perhaps VACUUM should freeze any page it intends to mark all-visible. That's not a guaranteed win, because it might increase WAL volume: setting a page all-visible does not emit an FPI for that page, but freezing any tuple on it would, if the page hasn't otherwise been modified since the last checkpoint. Even if that were no issue, the freezing itself must be WAL-logged. But if we could somehow get to a place where all-visible => frozen, then autovacuum would never need to visit all-visible pages, a huge win. We could also attack the problem from the other end. Instead of trying to set the bits on the individual tuples, we could decide that whenever a page is marked all-visible, we regard it as frozen regardless of the bits set or not set on the individual tuples. Anybody who wants to modify the page must freeze any unfrozen tuples "for real" before clearing the visibility map bit. This would have the same end result as the previous idea: all-visible would essentially imply frozen, and autovacuum could ignore those pages categorically. I'm not saying those ideas don't have problems, because they do. But I think they are worth further exploring. The main reason I gave up on that is because Heikki was working on the XID-to-LSN mapping stuff. That seemed like a better approach than either of the above, so as long as Heikki was working on that, there wasn't much reason to pursue more lowbrow approaches. Clearly, though, we need to do something about this. Freezing is a big problem for lots of users. All that having been said, I don't think adding a new fork is a good approach. We already have problems pretty commonly where our customers complain about running out of inodes. Adding another fork for every table would exacerbate that problem considerably. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2015-04-21 16:21:47 -0400, Robert Haas wrote: > All that having been said, I don't think adding a new fork is a good > approach. We already have problems pretty commonly where our > customers complain about running out of inodes. Adding another fork > for every table would exacerbate that problem considerably. Really? These days? There's good arguments against another fork (increased number of fsyncs, more stat calls, increased number of file handles, more WAL logging, ...), but the number of inodes themselves seems like something halfway recent filesystems should handle. Greetings, Andres Freund
On Tue, Apr 21, 2015 at 4:27 PM, Andres Freund <andres@anarazel.de> wrote: > On 2015-04-21 16:21:47 -0400, Robert Haas wrote: >> All that having been said, I don't think adding a new fork is a good >> approach. We already have problems pretty commonly where our >> customers complain about running out of inodes. Adding another fork >> for every table would exacerbate that problem considerably. > > Really? These days? There's good arguments against another fork > (increased number of fsyncs, more stat calls, increased number of file > handles, more WAL logging, ...), but the number of inodes themselves > seems like something halfway recent filesystems should handle. Not making it up... -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 4/21/15 3:21 PM, Robert Haas wrote: > It's possible that we could use this infrastructure to freeze more > aggressively in other circumstances. For example, perhaps VACUUM > should freeze any page it intends to mark all-visible. That's not a > guaranteed win, because it might increase WAL volume: setting a page > all-visible does not emit an FPI for that page, but freezing any tuple > on it would, if the page hasn't otherwise been modified since the last > checkpoint. Even if that were no issue, the freezing itself must be > WAL-logged. But if we could somehow get to a place where all-visible > => frozen, then autovacuum would never need to visit all-visible > pages, a huge win. I don't know how bad the extra WAL traffic would be; we'd obviously need to incur it eventually, so it's a question of how common it is for a page to go all-visible but then go not-all-visible again before freezing. It would presumably be far more traffic than some form of a FrozenMap though... > We could also attack the problem from the other end. Instead of > trying to set the bits on the individual tuples, we could decide that > whenever a page is marked all-visible, we regard it as frozen > regardless of the bits set or not set on the individual tuples. > Anybody who wants to modify the page must freeze any unfrozen tuples > "for real" before clearing the visibility map bit. This would have > the same end result as the previous idea: all-visible would > essentially imply frozen, and autovacuum could ignore those pages > categorically. Pushing what's currently background work onto foreground processes doesn't seem like a good idea... > I'm not saying those ideas don't have problems, because they do. But > I think they are worth further exploring. The main reason I gave up > on that is because Heikki was working on the XID-to-LSN mapping stuff. > That seemed like a better approach than either of the above, so as > long as Heikki was working on that, there wasn't much reason to pursue > more lowbrow approaches. Clearly, though, we need to do something > about this. Freezing is a big problem for lots of users. Did XID-LSN die? I see at the bottom of the thread it was returned with feedback; I guess Heikki just hasn't had time and there's no major blockers? From what I remember this is probably a better solution, but if it's not going to make it into 9.6 then we should probably at least look further into a FM. > All that having been said, I don't think adding a new fork is a good > approach. We already have problems pretty commonly where our > customers complain about running out of inodes. Adding another fork > for every table would exacerbate that problem considerably. Andres idea of adding this to the VM may work well to handle that. It would double the size of the VM, but it would still be a ratio of 32,000-1 compared to heap size, or 2MB for a 64GB table. -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com
On Tue, Apr 21, 2015 at 7:24 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote: > On 4/21/15 3:21 PM, Robert Haas wrote: >> It's possible that we could use this infrastructure to freeze more >> aggressively in other circumstances. For example, perhaps VACUUM >> should freeze any page it intends to mark all-visible. That's not a >> guaranteed win, because it might increase WAL volume: setting a page >> all-visible does not emit an FPI for that page, but freezing any tuple >> on it would, if the page hasn't otherwise been modified since the last >> checkpoint. Even if that were no issue, the freezing itself must be >> WAL-logged. But if we could somehow get to a place where all-visible >> => frozen, then autovacuum would never need to visit all-visible >> pages, a huge win. > > I don't know how bad the extra WAL traffic would be; we'd obviously need to > incur it eventually, so it's a question of how common it is for a page to go > all-visible but then go not-all-visible again before freezing. It would > presumably be far more traffic than some form of a FrozenMap though... Yeah, maybe. The freeze record contains details for each TID, while the freeze map bit would only need to be set once for the whole page. I wonder if the format of that record could be optimized somehow. >> We could also attack the problem from the other end. Instead of >> trying to set the bits on the individual tuples, we could decide that >> whenever a page is marked all-visible, we regard it as frozen >> regardless of the bits set or not set on the individual tuples. >> Anybody who wants to modify the page must freeze any unfrozen tuples >> "for real" before clearing the visibility map bit. This would have >> the same end result as the previous idea: all-visible would >> essentially imply frozen, and autovacuum could ignore those pages >> categorically. > > Pushing what's currently background work onto foreground processes doesn't > seem like a good idea... When you phrase it that way, no, but pushing work that otherwise would need to be done right now off to a future time that may never arrive sounds like a good idea. Today, we freeze the page -- rewriting it -- and then keep scanning those all-frozen pages every X number of transactions to make sure they are really all-frozen. In this system, we'd eliminate the repeated scanning and defer the freeze work until the page actually gets modified again. But that might never happen, in which case we never have to do the work at all. >> I'm not saying those ideas don't have problems, because they do. But >> I think they are worth further exploring. The main reason I gave up >> on that is because Heikki was working on the XID-to-LSN mapping stuff. >> That seemed like a better approach than either of the above, so as >> long as Heikki was working on that, there wasn't much reason to pursue >> more lowbrow approaches. Clearly, though, we need to do something >> about this. Freezing is a big problem for lots of users. > > Did XID-LSN die? I see at the bottom of the thread it was returned with > feedback; I guess Heikki just hasn't had time and there's no major blockers? > From what I remember this is probably a better solution, but if it's not > going to make it into 9.6 then we should probably at least look further into > a FM. Heikki said he'd lost enthusiasm for it, but he wasn't too specific about his reasons, IIRC. I guess maybe just that it got complicated, and he wasn't sure it was correct. >> All that having been said, I don't think adding a new fork is a good >> approach. We already have problems pretty commonly where our >> customers complain about running out of inodes. Adding another fork >> for every table would exacerbate that problem considerably. > > Andres idea of adding this to the VM may work well to handle that. It would > double the size of the VM, but it would still be a ratio of 32,000-1 > compared to heap size, or 2MB for a 64GB table. Yes, that's got some potential. It would mean pg_upgrade would have to remove all existing visibility maps when upgrading to the new version, or rewrite them into the new format. But it otherwise seems promising. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> wrote: > It's possible that we could use this infrastructure to freeze > more aggressively in other circumstances. For example, perhaps > VACUUM should freeze any page it intends to mark all-visible. > That's not a guaranteed win, because it might increase WAL > volume: setting a page all-visible does not emit an FPI for that > page, but freezing any tuple on it would, if the page hasn't > otherwise been modified since the last checkpoint. Even if that > were no issue, the freezing itself must be WAL-logged. But if we > could somehow get to a place where all-visible => frozen, then > autovacuum would never need to visit all-visible pages, a huge > win. That would eliminate full-table scan vacuums, right? It would do that by adding incremental effort and WAL to the "normal" autovacuum run to eliminate the full table scan and the associated mass freeze WAL-logging? It's hard to see how that would not be an overall win. > We could also attack the problem from the other end. Instead of > trying to set the bits on the individual tuples, we could decide > that whenever a page is marked all-visible, we regard it as > frozen regardless of the bits set or not set on the individual > tuples. Anybody who wants to modify the page must freeze any > unfrozen tuples "for real" before clearing the visibility map > bit. This would have the same end result as the previous idea: > all-visible would essentially imply frozen, and autovacuum could > ignore those pages categorically. Besides putting work into the foreground that could be done in the background, that sounds more complicated. Also, there is no ability to "pace" the freeze load or use scheduled jobs to shift the work to off-peak hours. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Apr 22, 2015 at 11:09 AM, Kevin Grittner <kgrittn@ymail.com> wrote: > Robert Haas <robertmhaas@gmail.com> wrote: >> It's possible that we could use this infrastructure to freeze >> more aggressively in other circumstances. For example, perhaps >> VACUUM should freeze any page it intends to mark all-visible. >> That's not a guaranteed win, because it might increase WAL >> volume: setting a page all-visible does not emit an FPI for that >> page, but freezing any tuple on it would, if the page hasn't >> otherwise been modified since the last checkpoint. Even if that >> were no issue, the freezing itself must be WAL-logged. But if we >> could somehow get to a place where all-visible => frozen, then >> autovacuum would never need to visit all-visible pages, a huge >> win. > > That would eliminate full-table scan vacuums, right? It would do > that by adding incremental effort and WAL to the "normal" > autovacuum run to eliminate the full table scan and the associated > mass freeze WAL-logging? It's hard to see how that would not be an > overall win. Yes and yes. In terms of an overall win, this design loses when the tuples that have been recently marked all-visible are going to get updated again in the near future. In that case, the effort we spend to freeze them is wasted. I just tested "pgbench -i -s 40 -n" followed by "VACUUM" or alternatively followed by "VACUUM FREEZE". The VACUUM generated 4641kB of WAL. The VACUUM FREEZE generated 515MB of WAL - that is, 113 times more. So changing every VACUUM to act like VACUUM FREEZE would be quite expensive. We'll still come out ahead if those tuples are going to stick around long enough that they would have eventually gotten frozen anyway, but if they get deleted again the loss is pretty significant. Incidentally, the reason for the large difference is that when Heikki created the visibility map, it wasn't necessary for the WAL records that set the visibility map bits to bump the page LSN, because it was just a hint anyway. When I made the visibility-map crash-safe, I went to some pains to preserve that property. Therefore, a regular VACUUM does not emit full page images for the heap pages - it does for the visibility map pages themselves, but there aren't very many of those. In this example, the relation itself was 512MB, so you can see that adding freezing to the mix roughly doubles the I/O cost. Either way we have to write half a gig of dirty data pages, but in one case we also have to write an additional half a gig of WAL. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 04/22/2015 05:33 PM, Robert Haas wrote: > On Tue, Apr 21, 2015 at 7:24 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote: >> On 4/21/15 3:21 PM, Robert Haas wrote: >>> I'm not saying those ideas don't have problems, because they do. But >>> I think they are worth further exploring. The main reason I gave up >>> on that is because Heikki was working on the XID-to-LSN mapping stuff. >>> That seemed like a better approach than either of the above, so as >>> long as Heikki was working on that, there wasn't much reason to pursue >>> more lowbrow approaches. Clearly, though, we need to do something >>> about this. Freezing is a big problem for lots of users. >> >> Did XID-LSN die? I see at the bottom of the thread it was returned with >> feedback; I guess Heikki just hasn't had time and there's no major blockers? >> From what I remember this is probably a better solution, but if it's not >> going to make it into 9.6 then we should probably at least look further into >> a FM. > > Heikki said he'd lost enthusiasm N it, but he wasn't too specific > about his reasons, IIRC. I guess maybe just that it got complicated, > and he wasn't sure it was correct. I'd like to continue working on that when I get around to it. Or even better if someone else continues it :-). The thing that made me nervous about that approach is that it made the LSN of each page critical information. If you somehow zeroed out the LSN, you could no longer tell which pages are frozen and which are not. I'm sure it could be made to work - and I got it working to some degree anyway - but it's a bit scary. It's similar to the multixid changes in 9.3: multixids also used to be data that you can just zap at restart, and when we changed the rules so that you lose data if you lose multixids, we got trouble. Now, LSNs are much simpler, and there wouldn't be anything like the multioffset/member SLRUs that you'd have to keep around forever or vacuum, but still.. I would feel safer if we added a completely new "epoch" counter to the page header, instead of reusing LSNs. But as we all know, changing the page format is a problem for in-place upgrade, and takes some space too. - Heikki
Robert Haas <robertmhaas@gmail.com> wrote: > I just tested "pgbench -i -s 40 -n" followed by "VACUUM" or > alternatively followed by "VACUUM FREEZE". The VACUUM generated > 4641kB of WAL. The VACUUM FREEZE generated 515MB of WAL - that > is, 113 times more. Essentially a bulk load. OK, so if you bulk load data and then vacuum it before updating 100% of it, this approach will generate a lot more WAL than we currently do. Of course, if you don't VACUUM FREEZE after a bulk load and then are engaged in a fairly normal OLTP workload with peak and off-peak cycles, you are currently almost certain to hit a point during peak OLTP load where you begin to sequentially scan all tables, rewriting them in place, with WAL logging. Incidentally, this tends to flush a lot of your "hot" data out of cache, increasing disk reads. The first time I hit this "interesting" experience in production it was so devastating, and generated so many user complaints, that I never again considered a bulk load complete until I had run VACUUM FREEZE on it -- although I was sometimes able to defer that to an off-peak window of time. In other words, for the production environments I managed, the only value of that number is in demonstrating the importance of using unlogged COPY followed by VACUUM FREEZE for bulk-loading and capturing a fresh base backup upon completion. A better way to use pgbench to measure WAL size cost might be to initialize, VACUUM FREEZE to set a "long term baseline", and do a reasonable length run with crontab running VACUUM FREEZE periodically (including after the run was complete) versus doing the same with plain VACUUM (followed by a VACUUM FREEZE at the end?). Comparing the total WAL sizes generated following the initial load and VACUUM FREEZE would give a more accurate picture of the impact on an OLTP load, I think. > We'll still come out ahead if those tuples are going to stick > around long enough that they would have eventually gotten frozen > anyway, but if they get deleted again the loss is pretty > significant. Perhaps my perception is biased by having worked in an environment where the vast majority of tuples (both in terms of tuple count and byte count) were never updated and were only eligible for deletion after a period of years. Our current approach is pretty bad in such an environment, at least if you try to leave all vacuuming to autovacuum. I'll admit that we were able to work around the problems by running VACUUM FREEZE every night for most databases. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Apr 22, 2015 at 12:39 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote: > The thing that made me nervous about that approach is that it made the LSN > of each page critical information. If you somehow zeroed out the LSN, you > could no longer tell which pages are frozen and which are not. I'm sure it > could be made to work - and I got it working to some degree anyway - but > it's a bit scary. It's similar to the multixid changes in 9.3: multixids > also used to be data that you can just zap at restart, and when we changed > the rules so that you lose data if you lose multixids, we got trouble. Now, > LSNs are much simpler, and there wouldn't be anything like the > multioffset/member SLRUs that you'd have to keep around forever or vacuum, > but still.. LSNs are already pretty critical. If they're in the future, you can't flush those pages. Ever. And if they're wrong in either direction, crash recovery is broken. But it's still worth thinking about ways that we could make this more robust. I keep coming back to the idea of treating any page that is marked as all-visible as frozen, and deferring freezing until the page is again modified. The big downside of this is that if the page is set as all-visible and then immediately thereafter modified, it sucks to have to freeze when the XIDs in the page are still present in CLOG. But if we could determine from the LSN that the XIDs in the page are new enough to still be considered valid, then we could skip freezing in those cases and only do it when the page is "old". That way, if somebody zeroed out the LSN (why, oh why?) the worst that would happen is that we'd do some extra freezing when the page was next modified. > I would feel safer if we added a completely new "epoch" counter to the page > header, instead of reusing LSNs. But as we all know, changing the page > format is a problem for in-place upgrade, and takes some space too. Yeah. We have a serious need to reduce the size of our on-disk format. On a TPC-C-like workload Jan Wieck recently tested, our data set was 34% larger than another database at the beginning of the test, and 80% larger by the end of the test. And we did twice the disk writes. See "The Elephants in the Room.pdf" at https://sites.google.com/site/robertmhaas/presentations -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Apr 22, 2015 at 2:23 PM, Kevin Grittner <kgrittn@ymail.com> wrote: > Robert Haas <robertmhaas@gmail.com> wrote: >> I just tested "pgbench -i -s 40 -n" followed by "VACUUM" or >> alternatively followed by "VACUUM FREEZE". The VACUUM generated >> 4641kB of WAL. The VACUUM FREEZE generated 515MB of WAL - that >> is, 113 times more. > > Essentially a bulk load. OK, so if you bulk load data and then > vacuum it before updating 100% of it, this approach will generate a > lot more WAL than we currently do. Of course, if you don't VACUUM > FREEZE after a bulk load and then are engaged in a fairly normal > OLTP workload with peak and off-peak cycles, you are currently > almost certain to hit a point during peak OLTP load where you begin > to sequentially scan all tables, rewriting them in place, with WAL > logging. Incidentally, this tends to flush a lot of your "hot" > data out of cache, increasing disk reads. The first time I hit > this "interesting" experience in production it was so devastating, > and generated so many user complaints, that I never again > considered a bulk load complete until I had run VACUUM FREEZE on it > -- although I was sometimes able to defer that to an off-peak > window of time. > > In other words, for the production environments I managed, the only > value of that number is in demonstrating the importance of using > unlogged COPY followed by VACUUM FREEZE for bulk-loading and > capturing a fresh base backup upon completion. A better way to use > pgbench to measure WAL size cost might be to initialize, VACUUM > FREEZE to set a "long term baseline", and do a reasonable length > run with crontab running VACUUM FREEZE periodically (including > after the run was complete) versus doing the same with plain VACUUM > (followed by a VACUUM FREEZE at the end?). Comparing the total WAL > sizes generated following the initial load and VACUUM FREEZE would > give a more accurate picture of the impact on an OLTP load, I > think. Sure, that would be a better test. But I'm pretty sure the impact will still be fairly substantial. >> We'll still come out ahead if those tuples are going to stick >> around long enough that they would have eventually gotten frozen >> anyway, but if they get deleted again the loss is pretty >> significant. > > Perhaps my perception is biased by having worked in an environment > where the vast majority of tuples (both in terms of tuple count and > byte count) were never updated and were only eligible for deletion > after a period of years. Our current approach is pretty bad in > such an environment, at least if you try to leave all vacuuming to > autovacuum. I'll admit that we were able to work around the > problems by running VACUUM FREEZE every night for most databases. Yeah. And that breaks down when you have very big databases with a high XID consumption rate, because the mostly-no-op VACUUM FREEZE runs for longer than you can tolerate. I'm not saying we don't need to fix this problem; we clearly do. I'm just saying that we've got to be careful not to harm other scenarios in the process. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 4/22/15 1:24 PM, Robert Haas wrote: > I keep coming back to the idea of treating any page that is marked as > all-visible as frozen, and deferring freezing until the page is again > modified. The big downside of this is that if the page is set as > all-visible and then immediately thereafter modified, it sucks to have > to freeze when the XIDs in the page are still present in CLOG. But if > we could determine from the LSN that the XIDs in the page are new > enough to still be considered valid, then we could skip freezing in > those cases and only do it when the page is "old". That way, if > somebody zeroed out the LSN (why, oh why?) the worst that would happen > is that we'd do some extra freezing when the page was next modified. Maybe freezing a page as part of making it not all-visible wouldn't be that horrible, even without LSN. For one, we already know that every tuple is visible, so no MVCC checks needed. That's probably a significant savings over current freezing. If we're marking a page as no longer all-visible, that means we're already dirtying it and generating WAL for it (likely including a FPI). We may be able to consolidate all of this into a new WAL record that's a lot more efficient than what we currently do for freezing. I suspect we wouldn't need to log each TID we're freezing, for starters. Even if we did though, we could at least combine all that into one WAL message that just contains an array of TIDs or LPs. <ponders...> I think we could actually proactively freeze tuples during vacuum too, at least if we're about to mark the page as all-visible. Though, with Robert's HEAP_XMIN_FROZEN change we could be a lot more aggressive about freezing during VACUUM, certainly for pages we're already dirtying, especially if we can keep the WAL cost of that down. -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com
On Tue, Apr 21, 2015 at 08:39:37AM +0200, Andres Freund wrote: > On 2015-04-20 17:13:29 -0400, Bruce Momjian wrote: > > Didn't you think any of the TODO threads had workable solutions? And > > don't expect adding an additional file per relation will be zero cost > > --- added over the lifetime of 200M transactions, I question if this > > approach would be a win. > > Note that normally you'd not run with a 200M transaction freeze max age > on a busy server. Rather around a magnitude more. > > Think about this being used on a time partionioned table. Right now all > the partitions have to be fully rescanned on a regular basis - quite > painful. With something like this normally only the newest partitions > will have to be. My point is that for the life of 200M transactions, you would have the overhead of an additional file per table in the file system, and updates of that. I just don't know if the overhead over the long time period would be smaller than the VACUUM FREEZE. It might be fine --- I don't know. People seem to focus on the big activities, while many small activities can lead to larger slowdowns. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
On 4/22/15 6:12 PM, Bruce Momjian wrote: > My point is that for the life of 200M transactions, you would have the > overhead of an additional file per table in the file system, and updates > of that. I just don't know if the overhead over the long time period > would be smaller than the VACUUM FREEZE. It might be fine --- I don't > know. People seem to focus on the big activities, while many small > activities can lead to larger slowdowns. Ahh. This wouldn't be for the life of 200M transactions; it would be a permanent fork, just like the VM is. -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com
On Wed, Apr 22, 2015 at 06:36:23PM -0500, Jim Nasby wrote: > On 4/22/15 6:12 PM, Bruce Momjian wrote: > >My point is that for the life of 200M transactions, you would have the > >overhead of an additional file per table in the file system, and updates > >of that. I just don't know if the overhead over the long time period > >would be smaller than the VACUUM FREEZE. It might be fine --- I don't > >know. People seem to focus on the big activities, while many small > >activities can lead to larger slowdowns. > > Ahh. This wouldn't be for the life of 200M transactions; it would be > a permanent fork, just like the VM is. Right. My point is that either you do X 2M times to maintain that fork and the overhead of the file existance, or you do one VACUUM FREEZE. I am saying that 2M is a large number and adding all those X's might exceed the cost of a VACUUM FREEZE. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
On Thu, Apr 23, 2015 at 3:24 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, Apr 22, 2015 at 12:39 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote: >> The thing that made me nervous about that approach is that it made the LSN >> of each page critical information. If you somehow zeroed out the LSN, you >> could no longer tell which pages are frozen and which are not. I'm sure it >> could be made to work - and I got it working to some degree anyway - but >> it's a bit scary. It's similar to the multixid changes in 9.3: multixids >> also used to be data that you can just zap at restart, and when we changed >> the rules so that you lose data if you lose multixids, we got trouble. Now, >> LSNs are much simpler, and there wouldn't be anything like the >> multioffset/member SLRUs that you'd have to keep around forever or vacuum, >> but still.. > > LSNs are already pretty critical. If they're in the future, you can't > flush those pages. Ever. And if they're wrong in either direction, > crash recovery is broken. But it's still worth thinking about ways > that we could make this more robust. > > I keep coming back to the idea of treating any page that is marked as > all-visible as frozen, and deferring freezing until the page is again > modified. The big downside of this is that if the page is set as > all-visible and then immediately thereafter modified, it sucks to have > to freeze when the XIDs in the page are still present in CLOG. But if > we could determine from the LSN that the XIDs in the page are new > enough to still be considered valid, then we could skip freezing in > those cases and only do it when the page is "old". That way, if > somebody zeroed out the LSN (why, oh why?) the worst that would happen > is that we'd do some extra freezing when the page was next modified. In your idea, if we have WORM (write-once read-many) table then these tuples in page would not be frozen at all unless we do VACUUM FREEZE. Also in this situation, from the second time VACUUM FREEZE would need to scan only pages of increment from last freezing, we could reduce I/O, but we would still need to do explicitly freezing for anti-wrapping as in the past. WORM table has huge data in general, and that data would be increase rapidly, so it would also be expensive. > >> I would feel safer if we added a completely new "epoch" counter to the page >> header, instead of reusing LSNs. But as we all know, changing the page >> format is a problem for in-place upgrade, and takes some space too. > > Yeah. We have a serious need to reduce the size of our on-disk > format. On a TPC-C-like workload Jan Wieck recently tested, our data > set was 34% larger than another database at the beginning of the test, > and 80% larger by the end of the test. And we did twice the disk > writes. See "The Elephants in the Room.pdf" at > https://sites.google.com/site/robertmhaas/presentations > Regards, ------- Sawada Masahiko
On 04/22/2015 09:24 PM, Robert Haas wrote: >> I would feel safer if we added a completely new "epoch" counter to the page >> >header, instead of reusing LSNs. But as we all know, changing the page >> >format is a problem for in-place upgrade, and takes some space too. > Yeah. We have a serious need to reduce the size of our on-disk > format. On a TPC-C-like workload Jan Wieck recently tested, our data > set was 34% larger than another database at the beginning of the test, > and 80% larger by the end of the test. And we did twice the disk > writes. See "The Elephants in the Room.pdf" at > https://sites.google.com/site/robertmhaas/presentations Meh. Adding an 8-byte header to every 8k block would add 0.1% to the disk size. No doubt it would be nice to reduce our disk footprint, but the page header is not the elephant in the room. - Heikki
On 21 April 2015 at 22:21, Robert Haas <robertmhaas@gmail.com> wrote:
--
I'm not saying those ideas don't have problems, because they do. But
I think they are worth further exploring. The main reason I gave up
on that is because Heikki was working on the XID-to-LSN mapping stuff.
That seemed like a better approach than either of the above, so as
long as Heikki was working on that, there wasn't much reason to pursue
more lowbrow approaches. Clearly, though, we need to do something
about this. Freezing is a big problem for lots of users.
All that having been said, I don't think adding a new fork is a good
approach. We already have problems pretty commonly where our
customers complain about running out of inodes. Adding another fork
for every table would exacerbate that problem considerably.
We were talking about having an incremental backup map also. Which sounds a lot like the freeze map.
XID-to-LSN sounded cool but was complex. If we need the map for backup purposes, we may as well do it the simple way and hit both birds at once.
We only need a freeze/backup map for larger relations. So if we map 1000 blocks per map page, we skip having a map at all when size < 1000.
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu, Apr 23, 2015 at 4:19 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > We were talking about having an incremental backup map also. Which sounds a > lot like the freeze map. Yeah, possibly. I think we should try to set things up so that the backup map can be updated asynchronously by a background worker, so that we're not adding more work to the foreground path just for the benefit of maintenance operations. That might make the logic for autovacuum to use it a little bit more complex, but it seems manageable. > We only need a freeze/backup map for larger relations. So if we map 1000 > blocks per map page, we skip having a map at all when size < 1000. Agreed. We might also want to map multiple blocks per map slot - e.g. one slot per 32 blocks. That would keep the map quite small even for very large relations, and would not compromise efficiency that much since reading 256kB sequentially probably takes only a little longer than reading 8kB. I think the idea of integrating the freeze map into the VM fork is also worth considering. Then, the incremental backup map could be optional; if you don't want incremental backup, you can shut it off and have less overhead. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Apr 22, 2015 at 8:55 PM, Bruce Momjian <bruce@momjian.us> wrote: > On Wed, Apr 22, 2015 at 06:36:23PM -0500, Jim Nasby wrote: >> On 4/22/15 6:12 PM, Bruce Momjian wrote: >> >My point is that for the life of 200M transactions, you would have the >> >overhead of an additional file per table in the file system, and updates >> >of that. I just don't know if the overhead over the long time period >> >would be smaller than the VACUUM FREEZE. It might be fine --- I don't >> >know. People seem to focus on the big activities, while many small >> >activities can lead to larger slowdowns. >> >> Ahh. This wouldn't be for the life of 200M transactions; it would be >> a permanent fork, just like the VM is. > > Right. My point is that either you do X 2M times to maintain that fork > and the overhead of the file existance, or you do one VACUUM FREEZE. I > am saying that 2M is a large number and adding all those X's might > exceed the cost of a VACUUM FREEZE. I agree, but if we instead make this part of the visibility map instead of a separate fork, the cost is much less. It won't be any more expensive to clear 2 consecutive bits any time a page is touched than it is to clear 1. The VM fork will be twice as large, but still tiny. And the fact that you'll have only half as many pages mapping to the same VM page may even improve performance in some cases by reducing contention. Even when it reduces performance, I think the impact will be so tiny as not to be worth caring about. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 4/23/15 2:42 AM, Heikki Linnakangas wrote: > On 04/22/2015 09:24 PM, Robert Haas wrote: >> Yeah. We have a serious need to reduce the size of our on-disk >> format. On a TPC-C-like workload Jan Wieck recently tested, our data >> set was 34% larger than another database at the beginning of the test, >> and 80% larger by the end of the test. And we did twice the disk >> writes. See "The Elephants in the Room.pdf" at >> https://sites.google.com/site/robertmhaas/presentations > > Meh. Adding an 8-byte header to every 8k block would add 0.1% to the > disk size. No doubt it would be nice to reduce our disk footprint, but > the page header is not the elephant in the room. I've often wondered if there was some way we could consolidate XMIN/XMAX from multiple tuples at the page level; that could be a big win for OLAP environments where most of your tuples belong to a pretty small range of XIDs. In many workloads you could have 80%+ of the tuples in a table having a single inserting XID. Dunno how much it would help for OLTP though... :/ -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com
On 4/23/15 8:42 AM, Robert Haas wrote: > On Thu, Apr 23, 2015 at 4:19 AM, Simon Riggs <simon@2ndquadrant.com> wrote: >> We were talking about having an incremental backup map also. Which sounds a >> lot like the freeze map. > > Yeah, possibly. I think we should try to set things up so that the > backup map can be updated asynchronously by a background worker, so > that we're not adding more work to the foreground path just for the > benefit of maintenance operations. That might make the logic for > autovacuum to use it a little bit more complex, but it seems > manageable. I'm not sure an actual map makes sense... for incremental backups you need some kind of stream that tells you not only what changed but when it changed. A simple freeze map won't work for that because the operation of freezing itself writes data (and the same can be true for VM). Though, if the backup utility was actually comparing live data to an actual backup maybe this would work... >> We only need a freeze/backup map for larger relations. So if we map 1000 >> blocks per map page, we skip having a map at all when size < 1000. > > Agreed. We might also want to map multiple blocks per map slot - e.g. > one slot per 32 blocks. That would keep the map quite small even for > very large relations, and would not compromise efficiency that much > since reading 256kB sequentially probably takes only a little longer > than reading 8kB. The problem with mapping a range of pages per bit is dealing with locking when you set the bit. Currently that's easy because we're holding the cleanup lock on the page, but you can't do that if you have a range of pages. Though, if each 'slot' wasn't a simple binary value we could have a 3rd state that indicates we're in the process of marking that slot as all visible/frozen, but you still need to consider the bit as cleared. Honestly though, I think concerns about the size of the map are a bit overblown. Even if we double it's size, it's still 32,000 times smaller than the heap is with 8k pages. I suspect if you have tables large enough where you'll care, you'll also be using 32k pages, which means it'd be 128,000 times smaller than the heap. I have a hard time believing that's going to be even a faint blip on the performance radar. -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com
On 04/23/2015 05:52 PM, Jim Nasby wrote: > On 4/23/15 2:42 AM, Heikki Linnakangas wrote: >> On 04/22/2015 09:24 PM, Robert Haas wrote: >>> Yeah. We have a serious need to reduce the size of our on-disk >>> format. On a TPC-C-like workload Jan Wieck recently tested, our data >>> set was 34% larger than another database at the beginning of the test, >>> and 80% larger by the end of the test. And we did twice the disk >>> writes. See "The Elephants in the Room.pdf" at >>> https://sites.google.com/site/robertmhaas/presentations >> >> Meh. Adding an 8-byte header to every 8k block would add 0.1% to the >> disk size. No doubt it would be nice to reduce our disk footprint, but >> the page header is not the elephant in the room. > > I've often wondered if there was some way we could consolidate XMIN/XMAX > from multiple tuples at the page level; that could be a big win for OLAP > environments where most of your tuples belong to a pretty small range of > XIDs. In many workloads you could have 80%+ of the tuples in a table > having a single inserting XID. It would be doable for xmin - IIRC someone even posted a patch for that years ago - but xmax (and ctid) is difficult. When a tuple is inserted, Xmax is basically just a reservation for the value that will be put there later. You have no idea what that value is, and you can't influence it, and when it's time to delete/update the row, you *must* have the space for that xmax. So we can't opportunistically use the space for anything else, or compress them or anything like that. - Heikki
On Thu, Apr 23, 2015 at 10:42:59AM +0300, Heikki Linnakangas wrote: > On 04/22/2015 09:24 PM, Robert Haas wrote: > >>I would feel safer if we added a completely new "epoch" counter to the page > >>>header, instead of reusing LSNs. But as we all know, changing the page > >>>format is a problem for in-place upgrade, and takes some space too. > >Yeah. We have a serious need to reduce the size of our on-disk > >format. On a TPC-C-like workload Jan Wieck recently tested, our data > >set was 34% larger than another database at the beginning of the test, > >and 80% larger by the end of the test. And we did twice the disk > >writes. See "The Elephants in the Room.pdf" at > >https://sites.google.com/site/robertmhaas/presentations > > Meh. Adding an 8-byte header to every 8k block would add 0.1% to the > disk size. No doubt it would be nice to reduce our disk footprint, > but the page header is not the elephant in the room. Agreed. Are you saying we can't find a way to fit an 8-byte value into the existing page in a backward-compatible way? -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
On 23/04/15 17:24, Heikki Linnakangas wrote: > On 04/23/2015 05:52 PM, Jim Nasby wrote: >> On 4/23/15 2:42 AM, Heikki Linnakangas wrote: >>> On 04/22/2015 09:24 PM, Robert Haas wrote: >>>> Yeah. We have a serious need to reduce the size of our on-disk >>>> format. On a TPC-C-like workload Jan Wieck recently tested, our data >>>> set was 34% larger than another database at the beginning of the test, >>>> and 80% larger by the end of the test. And we did twice the disk >>>> writes. See "The Elephants in the Room.pdf" at >>>> https://sites.google.com/site/robertmhaas/presentations >>> >>> Meh. Adding an 8-byte header to every 8k block would add 0.1% to the >>> disk size. No doubt it would be nice to reduce our disk footprint, but >>> the page header is not the elephant in the room. >> >> I've often wondered if there was some way we could consolidate XMIN/XMAX >> from multiple tuples at the page level; that could be a big win for OLAP >> environments where most of your tuples belong to a pretty small range of >> XIDs. In many workloads you could have 80%+ of the tuples in a table >> having a single inserting XID. > > It would be doable for xmin - IIRC someone even posted a patch for that > years ago - but xmax (and ctid) is difficult. When a tuple is inserted, > Xmax is basically just a reservation for the value that will be put > there later. You have no idea what that value is, and you can't > influence it, and when it's time to delete/update the row, you *must* > have the space for that xmax. So we can't opportunistically use the > space for anything else, or compress them or anything like that. > That depends, if we are going to change page format we can move the xmax to be some map of ctid->xmax in the header (with no values for tuples with no xmax) or have bitmap there of tuples that have xmax etc. Basically not saving xmax (and potentially other info) inline for each tuple but have some info in header only for tuples that need it. That might have bad performance side effects of course, but there are definitely some potential ways of doing things differently which we could explore. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Thu, Apr 23, 2015 at 06:24:00PM +0300, Heikki Linnakangas wrote: > >I've often wondered if there was some way we could consolidate XMIN/XMAX > >from multiple tuples at the page level; that could be a big win for OLAP > >environments where most of your tuples belong to a pretty small range of > >XIDs. In many workloads you could have 80%+ of the tuples in a table > >having a single inserting XID. > > It would be doable for xmin - IIRC someone even posted a patch for > that years ago - but xmax (and ctid) is difficult. When a tuple is > inserted, Xmax is basically just a reservation for the value that > will be put there later. You have no idea what that value is, and > you can't influence it, and when it's time to delete/update the row, > you *must* have the space for that xmax. So we can't > opportunistically use the space for anything else, or compress them > or anything like that. Also SELECT FOR UPDATE uses the per-row xmax too. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
On Thu, Apr 23, 2015 at 09:45:38AM -0400, Robert Haas wrote: > > Right. My point is that either you do X 2M times to maintain that fork > > and the overhead of the file existence, or you do one VACUUM FREEZE. I > > am saying that 2M is a large number and adding all those X's might > > exceed the cost of a VACUUM FREEZE. > > I agree, but if we instead make this part of the visibility map > instead of a separate fork, the cost is much less. It won't be any > more expensive to clear 2 consecutive bits any time a page is touched > than it is to clear 1. The VM fork will be twice as large, but still > tiny. And the fact that you'll have only half as many pages mapping > to the same VM page may even improve performance in some cases by > reducing contention. Even when it reduces performance, I think the > impact will be so tiny as not to be worth caring about. Agreed, no extra file, and the same write volume as currently. It would also match pg_clog, which uses two bits per transaction --- maybe we can reuse some of that code. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
On 04/23/2015 06:39 PM, Petr Jelinek wrote: > On 23/04/15 17:24, Heikki Linnakangas wrote: >> On 04/23/2015 05:52 PM, Jim Nasby wrote: >>> I've often wondered if there was some way we could consolidate XMIN/XMAX >>> from multiple tuples at the page level; that could be a big win for OLAP >>> environments where most of your tuples belong to a pretty small range of >>> XIDs. In many workloads you could have 80%+ of the tuples in a table >>> having a single inserting XID. >> >> It would be doable for xmin - IIRC someone even posted a patch for that >> years ago - but xmax (and ctid) is difficult. When a tuple is inserted, >> Xmax is basically just a reservation for the value that will be put >> there later. You have no idea what that value is, and you can't >> influence it, and when it's time to delete/update the row, you *must* >> have the space for that xmax. So we can't opportunistically use the >> space for anything else, or compress them or anything like that. > > That depends, if we are going to change page format we can move the xmax > to be some map of ctid->xmax in the header (with no values for tuples > with no xmax) ... Stop right there. You need to reserve enough space on the page to store an xmax for *every* tuple on the page. Because if you don't, what are you going to do when every tuple on the page is deleted by a different transaction. Even if you store the xmax somewhere else than the page header, you need to reserve the same amount of space for them, so it doesn't help at all. - Heikki
On 04/23/2015 06:38 PM, Bruce Momjian wrote: > On Thu, Apr 23, 2015 at 10:42:59AM +0300, Heikki Linnakangas wrote: >> On 04/22/2015 09:24 PM, Robert Haas wrote: >>>> I would feel safer if we added a completely new "epoch" counter to the page >>>>> header, instead of reusing LSNs. But as we all know, changing the page >>>>> format is a problem for in-place upgrade, and takes some space too. >>> Yeah. We have a serious need to reduce the size of our on-disk >>> format. On a TPC-C-like workload Jan Wieck recently tested, our data >>> set was 34% larger than another database at the beginning of the test, >>> and 80% larger by the end of the test. And we did twice the disk >>> writes. See "The Elephants in the Room.pdf" at >>> https://sites.google.com/site/robertmhaas/presentations >> >> Meh. Adding an 8-byte header to every 8k block would add 0.1% to the >> disk size. No doubt it would be nice to reduce our disk footprint, >> but the page header is not the elephant in the room. > > Agreed. Are you saying we can't find a way to fit an 8-byte value into > the existing page in a backward-compatible way? I'm sure we can find a way. We've discussed ways to handle page format updates in pg_upgrade before, and I don't want to get into that discussion here, but it's not trivial. - Heikki
On Thu, Apr 23, 2015 at 06:52:20PM +0300, Heikki Linnakangas wrote: > >Agreed. Are you saying we can't find a way to fit an 8-byte value into > >the existing page in a backward-compatible way? > > I'm sure we can find a way. We've discussed ways to handle page > format updates in pg_upgrade before, and I don't want to get into > that discussion here, but it's not trivial. OK, good to know, thanks. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
On 23/04/15 17:45, Bruce Momjian wrote: > On Thu, Apr 23, 2015 at 09:45:38AM -0400, Robert Haas wrote: >>> Right. My point is that either you do X 2M times to maintain that fork >>> and the overhead of the file existence, or you do one VACUUM FREEZE. I >>> am saying that 2M is a large number and adding all those X's might >>> exceed the cost of a VACUUM FREEZE. >> >> I agree, but if we instead make this part of the visibility map >> instead of a separate fork, the cost is much less. It won't be any >> more expensive to clear 2 consecutive bits any time a page is touched >> than it is to clear 1. The VM fork will be twice as large, but still >> tiny. And the fact that you'll have only half as many pages mapping >> to the same VM page may even improve performance in some cases by >> reducing contention. Even when it reduces performance, I think the >> impact will be so tiny as not to be worth caring about. > > Agreed, no extra file, and the same write volume as currently. It would > also match pg_clog, which uses two bits per transaction --- maybe we can > reuse some of that code. > Yeah, this approach seems promising. We probably can't reuse code from clog because the usage pattern is different (key for clog is xid, while for visibility/freeze map ctid is used). But visibility map storage layer is pretty simple so it should be easy to extend it for this use. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 4/23/15 11:06 AM, Petr Jelinek wrote: > On 23/04/15 17:45, Bruce Momjian wrote: >> On Thu, Apr 23, 2015 at 09:45:38AM -0400, Robert Haas wrote: >> Agreed, no extra file, and the same write volume as currently. It would >> also match pg_clog, which uses two bits per transaction --- maybe we can >> reuse some of that code. >> > > Yeah, this approach seems promising. We probably can't reuse code from > clog because the usage pattern is different (key for clog is xid, while > for visibility/freeze map ctid is used). But visibility map storage > layer is pretty simple so it should be easy to extend it for this use. Actually, there may be some bit manipulation functions we could reuse; things like efficiently counting how many things in a byte are set. Probably doesn't make sense to fully refactor it, but at least CLOG is a good source for cut/paste/whack. -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com
On Thu, Apr 23, 2015 at 10:42 PM, Robert Haas wrote: > On Thu, Apr 23, 2015 at 4:19 AM, Simon Riggs wrote: >> We only need a freeze/backup map for larger relations. So if we map 1000 >> blocks per map page, we skip having a map at all when size < 1000. > > Agreed. We might also want to map multiple blocks per map slot - e.g. > one slot per 32 blocks. That would keep the map quite small even for > very large relations, and would not compromise efficiency that much > since reading 256kB sequentially probably takes only a little longer > than reading 8kB. > > I think the idea of integrating the freeze map into the VM fork is > also worth considering. Then, the incremental backup map could be > optional; if you don't want incremental backup, you can shut it off > and have less overhead. When I read that I think about something configurable at relation-level.There are cases where you may want to have more granularity of this information at block level by having the VM slots to track less blocks than 32, and vice-versa. -- Michael
On Fri, Apr 24, 2015 at 1:31 AM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote: > On 4/23/15 11:06 AM, Petr Jelinek wrote: >> >> On 23/04/15 17:45, Bruce Momjian wrote: >>> >>> On Thu, Apr 23, 2015 at 09:45:38AM -0400, Robert Haas wrote: >>> Agreed, no extra file, and the same write volume as currently. It would >>> also match pg_clog, which uses two bits per transaction --- maybe we can >>> reuse some of that code. >>> >> >> Yeah, this approach seems promising. We probably can't reuse code from >> clog because the usage pattern is different (key for clog is xid, while >> for visibility/freeze map ctid is used). But visibility map storage >> layer is pretty simple so it should be easy to extend it for this use. > > > Actually, there may be some bit manipulation functions we could reuse; > things like efficiently counting how many things in a byte are set. Probably > doesn't make sense to fully refactor it, but at least CLOG is a good source > for cut/paste/whack. > I agree with adding a bit that indicates corresponding page is all-frozen into VM, just like CLOG. I'll change the patch as second version patch. Regards, ------- Sawada Masahiko
On Thu, Apr 23, 2015 at 9:03 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > On Thu, Apr 23, 2015 at 10:42 PM, Robert Haas wrote: >> On Thu, Apr 23, 2015 at 4:19 AM, Simon Riggs wrote: >>> We only need a freeze/backup map for larger relations. So if we map 1000 >>> blocks per map page, we skip having a map at all when size < 1000. >> >> Agreed. We might also want to map multiple blocks per map slot - e.g. >> one slot per 32 blocks. That would keep the map quite small even for >> very large relations, and would not compromise efficiency that much >> since reading 256kB sequentially probably takes only a little longer >> than reading 8kB. >> >> I think the idea of integrating the freeze map into the VM fork is >> also worth considering. Then, the incremental backup map could be >> optional; if you don't want incremental backup, you can shut it off >> and have less overhead. > > When I read that I think about something configurable at > relation-level.There are cases where you may want to have more > granularity of this information at block level by having the VM slots > to track less blocks than 32, and vice-versa. What are those cases? To me that sounds like making things complicated to no obvious benefit. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 4/24/15 6:52 AM, Robert Haas wrote: > On Thu, Apr 23, 2015 at 9:03 PM, Michael Paquier > <michael.paquier@gmail.com> wrote: >> On Thu, Apr 23, 2015 at 10:42 PM, Robert Haas wrote: >>> On Thu, Apr 23, 2015 at 4:19 AM, Simon Riggs wrote: >>>> We only need a freeze/backup map for larger relations. So if we map 1000 >>>> blocks per map page, we skip having a map at all when size < 1000. >>> >>> Agreed. We might also want to map multiple blocks per map slot - e.g. >>> one slot per 32 blocks. That would keep the map quite small even for >>> very large relations, and would not compromise efficiency that much >>> since reading 256kB sequentially probably takes only a little longer >>> than reading 8kB. >>> >>> I think the idea of integrating the freeze map into the VM fork is >>> also worth considering. Then, the incremental backup map could be >>> optional; if you don't want incremental backup, you can shut it off >>> and have less overhead. >> >> When I read that I think about something configurable at >> relation-level.There are cases where you may want to have more >> granularity of this information at block level by having the VM slots >> to track less blocks than 32, and vice-versa. > > What are those cases? To me that sounds like making things > complicated to no obvious benefit. Tables that get few/no dead tuples, like bulk insert tables. You'll have large sections of blocks with the same visibility. I suspect the added code to allow setting 1 bit for multiple pages without having to lock all those pages simultaneously will probably outweigh making this a reloption anyway. -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com
On Fri, Apr 24, 2015 at 4:09 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:>>> When I read that I think about something configurable at >>> relation-level.There are cases where you may want to have more >>> granularity of this information at block level by having the VM slots >>> to track less blocks than 32, and vice-versa. >> >> What are those cases? To me that sounds like making things >> complicated to no obvious benefit. > > Tables that get few/no dead tuples, like bulk insert tables. You'll have > large sections of blocks with the same visibility. I don't see any reason why that would require different granularity. > I suspect the added code to allow setting 1 bit for multiple pages without > having to lock all those pages simultaneously will probably outweigh making > this a reloption anyway. That's a completely unrelated issue. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 4/28/15 7:11 AM, Robert Haas wrote: > On Fri, Apr 24, 2015 at 4:09 PM, Jim Nasby<Jim.Nasby@bluetreble.com> > wrote:>>> When I read that I think about something configurable at >>>> >>>relation-level.There are cases where you may want to have more >>>> >>>granularity of this information at block level by having the VM slots >>>> >>>to track less blocks than 32, and vice-versa. >>> >> >>> >>What are those cases? To me that sounds like making things >>> >>complicated to no obvious benefit. >> > >> >Tables that get few/no dead tuples, like bulk insert tables. You'll have >> >large sections of blocks with the same visibility. > I don't see any reason why that would require different granularity. Because in those cases it would be trivial to drop XMIN out of the tuple headers. For a warehouse with narrow rows that could be a significant win. Moreso, we could also move XMAX to the page level if we accept that if we need to invalidate any tuple we'd have to move all of them. In a warehouse situation that's probably OK as well. That said, I don't think this is the first place to focus for reducing our on-disk format; reducing cleanup bloat would probably be a lot more useful. Did you or Jan have more detailed info from the test he ran about where our 80% overhead was ending up? That would remove a lot of speculation here... -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com
On Tue, Apr 28, 2015 at 1:53 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote: > Because in those cases it would be trivial to drop XMIN out of the tuple > headers. For a warehouse with narrow rows that could be a significant win. > Moreso, we could also move XMAX to the page level if we accept that if we > need to invalidate any tuple we'd have to move all of them. In a warehouse > situation that's probably OK as well. You have a funny definition of "trivial". If you start looking through the code you'll see that anything that changes the format of the tuple header is a very large undertaking. And the bit about "if we invalidate any tuple we'd need to move all of them" doesn't really make any sense; we have no infrastructure that would allow us "move" tuples like that. A lot of people would like it if we did, but we don't. > That said, I don't think this is the first place to focus for reducing our > on-disk format; reducing cleanup bloat would probably be a lot more useful. Sure; changing the on-disk format is a different project that tracking the frozen parts of a table, which is what this thread started out being about, and nothing you've said since then seems to add or detract from that. I still think the best way to do it is to make the VM carry two bits per page instead of one. > Did you or Jan have more detailed info from the test he ran about where our > 80% overhead was ending up? That would remove a lot of speculation here... We have more detailed information on that, but (1) that's not a very specific question and (2) it has nothing to do with freeze avoidance, so I'm not sure why you are asking on this thread. Let's try not to get sidetracked from the well-defined proposal that just needs to be implemented to speculation about major changes in completely unrelated areas. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Apr 24, 2015 at 11:21 AM, Sawada Masahiko <sawada.mshk@gmail.com> wrote: > On Fri, Apr 24, 2015 at 1:31 AM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote: >> On 4/23/15 11:06 AM, Petr Jelinek wrote: >>> >>> On 23/04/15 17:45, Bruce Momjian wrote: >>>> >>>> On Thu, Apr 23, 2015 at 09:45:38AM -0400, Robert Haas wrote: >>>> Agreed, no extra file, and the same write volume as currently. It would >>>> also match pg_clog, which uses two bits per transaction --- maybe we can >>>> reuse some of that code. >>>> >>> >>> Yeah, this approach seems promising. We probably can't reuse code from >>> clog because the usage pattern is different (key for clog is xid, while >>> for visibility/freeze map ctid is used). But visibility map storage >>> layer is pretty simple so it should be easy to extend it for this use. >> >> >> Actually, there may be some bit manipulation functions we could reuse; >> things like efficiently counting how many things in a byte are set. Probably >> doesn't make sense to fully refactor it, but at least CLOG is a good source >> for cut/paste/whack. >> > > I agree with adding a bit that indicates corresponding page is > all-frozen into VM, just like CLOG. > I'll change the patch as second version patch. > The second patch is attached. In second patch, I added a bit that indicates all tuples in page are completely frozen into visibility map. The visibility map became a bitmap with two bit per heap page: all-visible and all-frozen. The logics around vacuum, insert/update/delete heap are almost same as previous version. This patch lack some point: documentation, comment in source code, etc, so it's WIP patch yet, but I think that it's enough to discuss about this. Please feedbacks. Regards, ------- Sawada Masahiko
Attachment
On Thu, Apr 30, 2015 at 8:07 PM, Sawada Masahiko <sawada.mshk@gmail.com> wrote: > On Fri, Apr 24, 2015 at 11:21 AM, Sawada Masahiko <sawada.mshk@gmail.com> wrote: >> On Fri, Apr 24, 2015 at 1:31 AM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote: >>> On 4/23/15 11:06 AM, Petr Jelinek wrote: >>>> >>>> On 23/04/15 17:45, Bruce Momjian wrote: >>>>> >>>>> On Thu, Apr 23, 2015 at 09:45:38AM -0400, Robert Haas wrote: >>>>> Agreed, no extra file, and the same write volume as currently. It would >>>>> also match pg_clog, which uses two bits per transaction --- maybe we can >>>>> reuse some of that code. >>>>> >>>> >>>> Yeah, this approach seems promising. We probably can't reuse code from >>>> clog because the usage pattern is different (key for clog is xid, while >>>> for visibility/freeze map ctid is used). But visibility map storage >>>> layer is pretty simple so it should be easy to extend it for this use. >>> >>> >>> Actually, there may be some bit manipulation functions we could reuse; >>> things like efficiently counting how many things in a byte are set. Probably >>> doesn't make sense to fully refactor it, but at least CLOG is a good source >>> for cut/paste/whack. >>> >> >> I agree with adding a bit that indicates corresponding page is >> all-frozen into VM, just like CLOG. >> I'll change the patch as second version patch. >> > > The second patch is attached. > > In second patch, I added a bit that indicates all tuples in page are > completely frozen into visibility map. > The visibility map became a bitmap with two bit per heap page: > all-visible and all-frozen. > The logics around vacuum, insert/update/delete heap are almost same as > previous version. > > This patch lack some point: documentation, comment in source code, > etc, so it's WIP patch yet, > but I think that it's enough to discuss about this. > The previous patch is no longer applied cleanly to HEAD. The attached v2 patch is latest version. Please review it. Regards, ------- Sawada Masahiko
Attachment
On Thu, May 28, 2015 at 11:34 AM, Sawada Masahiko <sawada.mshk@gmail.com> wrote: > On Thu, Apr 30, 2015 at 8:07 PM, Sawada Masahiko <sawada.mshk@gmail.com> wrote: >> On Fri, Apr 24, 2015 at 11:21 AM, Sawada Masahiko <sawada.mshk@gmail.com> wrote: >>> On Fri, Apr 24, 2015 at 1:31 AM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote: >>>> On 4/23/15 11:06 AM, Petr Jelinek wrote: >>>>> >>>>> On 23/04/15 17:45, Bruce Momjian wrote: >>>>>> >>>>>> On Thu, Apr 23, 2015 at 09:45:38AM -0400, Robert Haas wrote: >>>>>> Agreed, no extra file, and the same write volume as currently. It would >>>>>> also match pg_clog, which uses two bits per transaction --- maybe we can >>>>>> reuse some of that code. >>>>>> >>>>> >>>>> Yeah, this approach seems promising. We probably can't reuse code from >>>>> clog because the usage pattern is different (key for clog is xid, while >>>>> for visibility/freeze map ctid is used). But visibility map storage >>>>> layer is pretty simple so it should be easy to extend it for this use. >>>> >>>> >>>> Actually, there may be some bit manipulation functions we could reuse; >>>> things like efficiently counting how many things in a byte are set. Probably >>>> doesn't make sense to fully refactor it, but at least CLOG is a good source >>>> for cut/paste/whack. >>>> >>> >>> I agree with adding a bit that indicates corresponding page is >>> all-frozen into VM, just like CLOG. >>> I'll change the patch as second version patch. >>> >> >> The second patch is attached. >> >> In second patch, I added a bit that indicates all tuples in page are >> completely frozen into visibility map. >> The visibility map became a bitmap with two bit per heap page: >> all-visible and all-frozen. >> The logics around vacuum, insert/update/delete heap are almost same as >> previous version. >> >> This patch lack some point: documentation, comment in source code, >> etc, so it's WIP patch yet, >> but I think that it's enough to discuss about this. >> > > The previous patch is no longer applied cleanly to HEAD. > The attached v2 patch is latest version. > > Please review it. Attached new rebased version patch. Please give me comments! Regards, -- Sawada Masahiko
Attachment
On 30 April 2015 at 12:07, Sawada Masahiko <sawada.mshk@gmail.com> wrote:
--
This patch lack some point: documentation, comment in source code,
etc, so it's WIP patch yet,
but I think that it's enough to discuss about this.
Code comments exist to indicate the intention of sections of code. They are essential for reviewers, not a cosmetic thing to be added later. To gain wide agreement we need wide understanding. (I recommend a development approach where you write the comments first, then add code later.)
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu, Jul 2, 2015 at 12:13 AM, Sawada Masahiko <sawada.mshk@gmail.com> wrote: > On Thu, May 28, 2015 at 11:34 AM, Sawada Masahiko <sawada.mshk@gmail.com> wrote: >> On Thu, Apr 30, 2015 at 8:07 PM, Sawada Masahiko <sawada.mshk@gmail.com> wrote: >>> On Fri, Apr 24, 2015 at 11:21 AM, Sawada Masahiko <sawada.mshk@gmail.com> wrote: >>>> On Fri, Apr 24, 2015 at 1:31 AM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote: >>>>> On 4/23/15 11:06 AM, Petr Jelinek wrote: >>>>>> >>>>>> On 23/04/15 17:45, Bruce Momjian wrote: >>>>>>> >>>>>>> On Thu, Apr 23, 2015 at 09:45:38AM -0400, Robert Haas wrote: >>>>>>> Agreed, no extra file, and the same write volume as currently. It would >>>>>>> also match pg_clog, which uses two bits per transaction --- maybe we can >>>>>>> reuse some of that code. >>>>>>> >>>>>> >>>>>> Yeah, this approach seems promising. We probably can't reuse code from >>>>>> clog because the usage pattern is different (key for clog is xid, while >>>>>> for visibility/freeze map ctid is used). But visibility map storage >>>>>> layer is pretty simple so it should be easy to extend it for this use. >>>>> >>>>> >>>>> Actually, there may be some bit manipulation functions we could reuse; >>>>> things like efficiently counting how many things in a byte are set. Probably >>>>> doesn't make sense to fully refactor it, but at least CLOG is a good source >>>>> for cut/paste/whack. >>>>> >>>> >>>> I agree with adding a bit that indicates corresponding page is >>>> all-frozen into VM, just like CLOG. >>>> I'll change the patch as second version patch. >>>> >>> >>> The second patch is attached. >>> >>> In second patch, I added a bit that indicates all tuples in page are >>> completely frozen into visibility map. >>> The visibility map became a bitmap with two bit per heap page: >>> all-visible and all-frozen. >>> The logics around vacuum, insert/update/delete heap are almost same as >>> previous version. >>> >>> This patch lack some point: documentation, comment in source code, >>> etc, so it's WIP patch yet, >>> but I think that it's enough to discuss about this. >>> >> >> The previous patch is no longer applied cleanly to HEAD. >> The attached v2 patch is latest version. >> >> Please review it. > > Attached new rebased version patch. > Please give me comments! Now we should review your design and approach rather than code, but since I got an assertion error while trying the patch, I report it. "initdb -D test -k" caused the following assertion failure. vacuuming database template1 ... TRAP: FailedAssertion("!((((PageHeader) (heapPage))->pd_flags & 0x0004))", File: "visibilitymap.c", Line: 328) sh: line 1: 83785 Abort trap: 6 "/dav/000_add_frozen_bit_into_visibilitymap_v3/bin/postgres" --single -F -O -c search_path=pg_catalog -c exit_on_error=true template1 > /dev/null child process exited with exit code 134 initdb: removing data directory "test" Regards, -- Fujii Masao
On Thu, Jul 2, 2015 at 1:06 AM, Fujii Masao <masao.fujii@gmail.com> wrote: > On Thu, Jul 2, 2015 at 12:13 AM, Sawada Masahiko <sawada.mshk@gmail.com> wrote: >> On Thu, May 28, 2015 at 11:34 AM, Sawada Masahiko <sawada.mshk@gmail.com> wrote: >>> On Thu, Apr 30, 2015 at 8:07 PM, Sawada Masahiko <sawada.mshk@gmail.com> wrote: >>>> On Fri, Apr 24, 2015 at 11:21 AM, Sawada Masahiko <sawada.mshk@gmail.com> wrote: >>>>> On Fri, Apr 24, 2015 at 1:31 AM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote: >>>>>> On 4/23/15 11:06 AM, Petr Jelinek wrote: >>>>>>> >>>>>>> On 23/04/15 17:45, Bruce Momjian wrote: >>>>>>>> >>>>>>>> On Thu, Apr 23, 2015 at 09:45:38AM -0400, Robert Haas wrote: >>>>>>>> Agreed, no extra file, and the same write volume as currently. It would >>>>>>>> also match pg_clog, which uses two bits per transaction --- maybe we can >>>>>>>> reuse some of that code. >>>>>>>> >>>>>>> >>>>>>> Yeah, this approach seems promising. We probably can't reuse code from >>>>>>> clog because the usage pattern is different (key for clog is xid, while >>>>>>> for visibility/freeze map ctid is used). But visibility map storage >>>>>>> layer is pretty simple so it should be easy to extend it for this use. >>>>>> >>>>>> >>>>>> Actually, there may be some bit manipulation functions we could reuse; >>>>>> things like efficiently counting how many things in a byte are set. Probably >>>>>> doesn't make sense to fully refactor it, but at least CLOG is a good source >>>>>> for cut/paste/whack. >>>>>> >>>>> >>>>> I agree with adding a bit that indicates corresponding page is >>>>> all-frozen into VM, just like CLOG. >>>>> I'll change the patch as second version patch. >>>>> >>>> >>>> The second patch is attached. >>>> >>>> In second patch, I added a bit that indicates all tuples in page are >>>> completely frozen into visibility map. >>>> The visibility map became a bitmap with two bit per heap page: >>>> all-visible and all-frozen. >>>> The logics around vacuum, insert/update/delete heap are almost same as >>>> previous version. >>>> >>>> This patch lack some point: documentation, comment in source code, >>>> etc, so it's WIP patch yet, >>>> but I think that it's enough to discuss about this. >>>> >>> >>> The previous patch is no longer applied cleanly to HEAD. >>> The attached v2 patch is latest version. >>> >>> Please review it. >> >> Attached new rebased version patch. >> Please give me comments! > > Now we should review your design and approach rather than code, > but since I got an assertion error while trying the patch, I report it. > > "initdb -D test -k" caused the following assertion failure. > > vacuuming database template1 ... TRAP: > FailedAssertion("!((((PageHeader) (heapPage))->pd_flags & 0x0004))", > File: "visibilitymap.c", Line: 328) > sh: line 1: 83785 Abort trap: 6 > "/dav/000_add_frozen_bit_into_visibilitymap_v3/bin/postgres" --single > -F -O -c search_path=pg_catalog -c exit_on_error=true template1 > > /dev/null > child process exited with exit code 134 > initdb: removing data directory "test" Thank you for bug report, and comments. Fixed version is attached, and source code comment is also updated. Please review it. And I explain again here about what this patch does, current design. - A additional bit for visibility map. I added additional bit, say all-frozen bit, which indicates whether the all pages of corresponding page are frozen, to visibility map. This structure is similar to CLOG. So the size of VM grew as twice as today. Also, the flags of each heap page header might be set PD_ALL_FROZEN, as well as all-visible - Set and clear a all-frozen bit Update and delete and insert(multi insert) operation would clear a bit of that page, and clear flags of page header at same time. Only vauum operation can set a bit if all tuple of a page are frozen. - Anti-wrapping vacuum We have to scan whole table for XID anti-warring today, and it's really quite expensive because disk I/O. The main benefit of this proposal is to reduce and avoid such extremely large quantity I/O even when anti-wrapping vacuum is executed. We have to scan whole table for XID anti-warring today, and it's really quite expensive. In lazy_scan_heap() function, I added a such logic for experimental. There were several another idea on previous discussion such as read-only table, frozen map. But advantage of this direction is that we don't need additional heap file, and can use the matured VM mechanism. Regards, -- Sawada Masahiko
Attachment
On 2 July 2015 at 16:30, Sawada Masahiko <sawada.mshk@gmail.com> wrote:
--
as well as all-visibleAlso, the flags of each heap page header might be set PD_ALL_FROZEN,
Is it possible to have VM bits set to frozen but not visible?
The description makes those two states sound independent of each other.
Are they? Or not? Do we test for an impossible state?
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Fri, Jul 3, 2015 at 1:23 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > On 2 July 2015 at 16:30, Sawada Masahiko <sawada.mshk@gmail.com> wrote: > >> >> Also, the flags of each heap page header might be set PD_ALL_FROZEN, >> as well as all-visible > > > Is it possible to have VM bits set to frozen but not visible? > > The description makes those two states sound independent of each other. > > Are they? Or not? Do we test for an impossible state? > It's impossible to have VM bits set to frozen but not visible. These bit are controlled independently. But eventually, when all-frozen bit is set, all-visible is also set. Regards, -- Sawada Masahiko
On Fri, Jul 3, 2015 at 5:25 PM, Sawada Masahiko <sawada.mshk@gmail.com> wrote: > On Fri, Jul 3, 2015 at 1:23 AM, Simon Riggs <simon@2ndquadrant.com> wrote: >> On 2 July 2015 at 16:30, Sawada Masahiko <sawada.mshk@gmail.com> wrote: >> >>> >>> Also, the flags of each heap page header might be set PD_ALL_FROZEN, >>> as well as all-visible >> >> >> Is it possible to have VM bits set to frozen but not visible? >> >> The description makes those two states sound independent of each other. >> >> Are they? Or not? Do we test for an impossible state? >> > > It's impossible to have VM bits set to frozen but not visible. > These bit are controlled independently. But eventually, when > all-frozen bit is set, all-visible is also set. Attached latest version including some bug fix. Please review it. Regards, -- Sawada Masahiko
Attachment
On 3 July 2015 at 09:25, Sawada Masahiko <sawada.mshk@gmail.com> wrote:
--
On Fri, Jul 3, 2015 at 1:23 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On 2 July 2015 at 16:30, Sawada Masahiko <sawada.mshk@gmail.com> wrote:
>
>>
>> Also, the flags of each heap page header might be set PD_ALL_FROZEN,
>> as well as all-visible
>
>
> Is it possible to have VM bits set to frozen but not visible?
>
> The description makes those two states sound independent of each other.
>
> Are they? Or not? Do we test for an impossible state?
>
It's impossible to have VM bits set to frozen but not visible.
These bit are controlled independently. But eventually, when
all-frozen bit is set, all-visible is also set.
And my understanding is that if you clear all-visible you would also clear all-frozen...
So I don't understand why you have two separate calls to visibilitymap_clear()
Surely the logic should be to clear both bits at the same time?
In my understanding the state logic is
1. Both bits unset ~(VISIBILITYMAP_ALL_VISIBLE | VISIBILITYMAP_ALL_FROZEN)
which can be changed to state 2 only
2. VISIBILITYMAP_ALL_VISIBLE only
which can be changed state 1 or state 3
3. VISIBILITYMAP_ALL_VISIBLE | VISIBILITYMAP_ALL_FROZEN
which can be changed to state 1 only
If that is the case please simplify the logic for setting and unsetting the bits so they are set together efficiently. At the same time please also put in Asserts to ensure that the state logic is maintained when it is set and when it is tested.
I would also like to see the visibilitymap_test function exposed in SQL, so we can write code to examine the map contents for particular ctids. By doing that we can then write a formal test that shows the evolution of tuples from insertion, vacuuming and freezing, testing the map has been set correctly at each stage. I guess that needs to be done as an isolationtest so we have an observer that contrains the xmin in various ways. In light of multixact bugs, any code that changes the on-disk tuple metadata needs formal tests.
Other than that the overall concept seems sound.
I think we need something for pg_upgrade to rewrite existing VMs. Otherwise a large read only database would suddenly require a massive revacuum after upgrade, which seems bad. That can wait for now until we all agree this patch is sound.
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Fri, Jul 3, 2015 at 1:55 PM, Sawada Masahiko <sawada.mshk@gmail.com> wrote:
>
> On Fri, Jul 3, 2015 at 1:23 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> > On 2 July 2015 at 16:30, Sawada Masahiko <sawada.mshk@gmail.com> wrote:
> >
> >>
> >> Also, the flags of each heap page header might be set PD_ALL_FROZEN,
> >> as well as all-visible
> >
> >
> > Is it possible to have VM bits set to frozen but not visible?
> >
> > The description makes those two states sound independent of each other.
> >
> > Are they? Or not? Do we test for an impossible state?
> >
>
> It's impossible to have VM bits set to frozen but not visible.
> These bit are controlled independently. But eventually, when
> all-frozen bit is set, all-visible is also set.
>
>
> On Fri, Jul 3, 2015 at 1:23 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> > On 2 July 2015 at 16:30, Sawada Masahiko <sawada.mshk@gmail.com> wrote:
> >
> >>
> >> Also, the flags of each heap page header might be set PD_ALL_FROZEN,
> >> as well as all-visible
> >
> >
> > Is it possible to have VM bits set to frozen but not visible?
> >
> > The description makes those two states sound independent of each other.
> >
> > Are they? Or not? Do we test for an impossible state?
> >
>
> It's impossible to have VM bits set to frozen but not visible.
In patch, during Vacuum first the frozen bit is set and then the visibility
will be set in a later operation, now if the crash happens between those
2 operations, then isn't it possible that the frozen bit is set and visible
bit is not set?
> These bit are controlled independently. But eventually, when
> all-frozen bit is set, all-visible is also set.
>
Yes, during normal operations it will happen that way, but I think there
are corner cases where that assumption is not true.
On Thu, Jul 2, 2015 at 9:00 PM, Sawada Masahiko <sawada.mshk@gmail.com> wrote:
>
>
> Thank you for bug report, and comments.
>
> Fixed version is attached, and source code comment is also updated.
> Please review it.
>
>
>
> Thank you for bug report, and comments.
>
> Fixed version is attached, and source code comment is also updated.
> Please review it.
>
I am looking into this patch and would like to share my findings with
you:
1.
@@ -2131,8 +2133,9 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
CheckForSerializableConflictIn(relation, NULL, InvalidBuffer);
/*
- * Find buffer to insert this
tuple into. If the page is all visible,
- * this will also pin the requisite visibility map page.
+
* Find buffer to insert this tuple into. If the page is all visible
+ * of all frozen, this will also pin
the requisite visibility map and
+ * frozen map page.
*/
typo in comments.
/of all frozen/or all frozen
2.
visibilitymap.c
+ * The visibility map is a bitmap with two bits (all-visible and all-frozen
+ * per heap page.
/and all-frozen/and all-frozen)
closing round bracket is missing.
3.
visibilitymap.c
-/*#define TRACE_VISIBILITYMAP */
+#define TRACE_VISIBILITYMAP
why is this hash define opened?
4.
-visibilitymap_count(Relation rel)
+visibilitymap_count(Relation rel, bool for_visible)
This API needs to count set bits for either visibility info, frozen info
or both (if required), it seems better to have second parameter as
uint8 flags rather than bool. Also, if it is required to be called at most
places for both visibility and frozen bits count, why not get them
in one call?
5.
Clearing visibility and frozen bit separately for the dml
operations would lead locking/unlocking the corresponding buffer
twice, can we do it as a one operation. I think this is suggested
by Simon as well.
6.
- * Before locking the buffer, pin the visibility map page if it appears to
- * be necessary.
Since we haven't got the lock yet, someone else might be
+ * Before locking the buffer, pin the
visibility map if it appears to be
+ * necessary. Since we haven't got the lock yet, someone else might
be
Why you have deleted 'page' in above comment?
7.
@@ -3490,21 +3532,23 @@ l2:
UnlockTupleTuplock(relation, &(oldtup.t_self), *lockmode);
if (vmbuffer != InvalidBuffer)
ReleaseBuffer(vmbuffer);
+
bms_free
(hot_attrs);
Seems unnecessary change.
8.
@@ -1919,11 +1919,18 @@ index_update_stats(Relation rel,
{
BlockNumber relpages =
RelationGetNumberOfBlocks(rel);
BlockNumber relallvisible;
+ BlockNumber
relallfrozen;
if (rd_rel->relkind != RELKIND_INDEX)
- relallvisible =
visibilitymap_count(rel);
+ {
+ relallvisible = visibilitymap_count(rel,
true);
+ relallfrozen = visibilitymap_count(rel, false);
+ }
else
/* don't bother for indexes */
+ {
relallvisible = 0;
+
relallfrozen = 0;
+ }
I think in this function, you have forgotten to update the
relallfrozen value in pg_class.
9.
vacuumlazy.c
@@ -253,14 +258,16 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
* NB: We
need to check this before truncating the relation, because that
* will change ->rel_pages.
*/
-
if (vacrelstats->scanned_pages < vacrelstats->rel_pages)
+ if ((vacrelstats->scanned_pages +
vacrelstats->vmskipped_pages)
+ < vacrelstats->rel_pages)
{
- Assert(!scan_all);
Why you have removed this Assert, won't the count of
vacrelstats->scanned_pages + vacrelstats->vmskipped_pages be
equal to vacrelstats->rel_pages when scall_all = true.
10.
vacuumlazy.c
lazy_vacuum_rel()
..
+ scanned_all |= scan_all;
+
Why this new assignment is added, please add a comment to
explain it.
11.
lazy_scan_heap()
..
+ * Also, skipping even a single page accorind to all-visible bit of
+ * visibility map means that we can't update relfrozenxid, so we only want
+ * to do it if we can skip a goodly number. On the other hand, we count
+ * both how many pages we skipped according to all-frozen bit of visibility
+ * map and how many pages we freeze page, so we can update relfrozenxid if
+ * the sum of their is as many as tuples per page.
a.
typo
/accorind/according
b.
is the second part of comment (starting from On the other hand)
right? I mean you are comparing sum of pages skipped due to
all_frozen bit and number of pages freezed with tuples per page.
I don't understand how are they related?
12.
@@ -918,8 +954,13 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
else
{
num_tuples += 1;
+ ntup_in_blk += 1;
hastup = true;
+ /* If current tuple is already frozen, count it up */
+ if (HeapTupleHeaderXminFrozen(tuple.t_data))
+ already_nfrozen += 1;
+
/*
* Each non-removable tuple must be checked to see if it needs
* freezing. Note we already have exclusive buffer lock.
Here, if tuple is already_frozen, can't we just continue and
check for next tuple?
13.
+extern XLogRecPtr log_heap_frozenmap(RelFileNode rnode, Buffer heap_buffer,
+ Buffer fm_buffer);
It seems like this function is not used.
On 6 July 2015 at 17:28, Simon Riggs <simon@2ndquadrant.com> wrote:
That way we will be able to easily check whether the rewrite has been conducted on all relations.
I think we need something for pg_upgrade to rewrite existing VMs. Otherwise a large read only database would suddenly require a massive revacuum after upgrade, which seems bad. That can wait for now until we all agree this patch is sound.
Since we need to rewrite the "vm" map, I think we should call the new map "vfm"
Since the maps are just bits there is no other way to tell that a map has been rewritten
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
> So I don't understand why you have two separate calls to visibilitymap_clear() > Surely the logic should be to clear both bits at the same time? Yes, you're right. all-frozen bit should be cleared at the same time as clearing all-visible bit. > 1. Both bits unset ~(VISIBILITYMAP_ALL_VISIBLE | VISIBILITYMAP_ALL_FROZEN) > which can be changed to state 2 only > 2. VISIBILITYMAP_ALL_VISIBLE only > which can be changed state 1 or state 3 > 3. VISIBILITYMAP_ALL_VISIBLE | VISIBILITYMAP_ALL_FROZEN > which can be changed to state 1 only > If that is the case please simplify the logic for setting and unsetting the bits so they are set together efficiently. > At the same time please also put in Asserts to ensure that the state logic is maintained when it is set and when it istested. > > In patch, during Vacuum first the frozen bit is set and then the visibility > will be set in a later operation, now if the crash happens between those > 2 operations, then isn't it possible that the frozen bit is set and visible > bit is not set? In current patch, frozen bit is set first in lazy_scan_heap(), so it's possible to have VM bits set frozen bit but not visible as Amit pointed out. To fix it, I'm modifying the patch to more simpler and setting both bits at the same time efficiently. > I would also like to see the visibilitymap_test function exposed in SQL, > so we can write code to examine the map contents for particular ctids. > By doing that we can then write a formal test that shows the evolution of tuples from insertion, > vacuuming and freezing, testing the map has been set correctly at each stage. > I guess that needs to be done as an isolationtest so we have an observer that contrains the xmin in various ways. > In light of multixact bugs, any code that changes the on-disk tuple metadata needs formal tests. Attached patch adds a few function to contrib/pg_freespacemap to explore the inside of visibility map, which I used for my test. I hope it helps for testing this feature. > I think we need something for pg_upgrade to rewrite existing VMs. > Otherwise a large read only database would suddenly require a massive > revacuum after upgrade, which seems bad. That can wait for now until we all > agree this patch is sound. Yeah, I will address them. Regards, -- Sawada Masahiko
Attachment
On 7 July 2015 at 15:18, Sawada Masahiko <sawada.mshk@gmail.com> wrote:
--
> I would also like to see the visibilitymap_test function exposed in SQL,
> so we can write code to examine the map contents for particular ctids.
> By doing that we can then write a formal test that shows the evolution of tuples from insertion,
> vacuuming and freezing, testing the map has been set correctly at each stage.
> I guess that needs to be done as an isolationtest so we have an observer that contrains the xmin in various ways.
> In light of multixact bugs, any code that changes the on-disk tuple metadata needs formal tests.
Attached patch adds a few function to contrib/pg_freespacemap to
explore the inside of visibility map, which I used for my test.
I hope it helps for testing this feature.
I don't think pg_freespacemap is the right place.
I'd prefer to add that as a single function into core, so we can write formal tests. I would not personally commit this feature without rigorous and easily repeatable verification.
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2015-07-07 16:25:13 +0100, Simon Riggs wrote: > I don't think pg_freespacemap is the right place. I agree that pg_freespacemap sounds like an odd location. > I'd prefer to add that as a single function into core, so we can write > formal tests. With the advent of src/test/modules it's not really a prerequisite for things to be builtin to be testable. I think there's fair arguments for moving stuff like pg_stattuple, pg_freespacemap, pg_buffercache into core at some point, but that's probably a separate discussion. Regards, Andres
On Wed, Jul 8, 2015 at 12:37 AM, Andres Freund <andres@anarazel.de> wrote: > On 2015-07-07 16:25:13 +0100, Simon Riggs wrote: >> I don't think pg_freespacemap is the right place. > > I agree that pg_freespacemap sounds like an odd location. > >> I'd prefer to add that as a single function into core, so we can write >> formal tests. > > With the advent of src/test/modules it's not really a prerequisite for > things to be builtin to be testable. I think there's fair arguments for > moving stuff like pg_stattuple, pg_freespacemap, pg_buffercache into > core at some point, but that's probably a separate discussion. > I understood. So I will place bunch of test like src/test/module/visibilitymap_test, which contains some tests regarding this feature, and gather them into one patch. Regards, -- Sawada Masahiko
On Tue, Jul 7, 2015 at 5:19 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Jul 2, 2015 at 9:00 PM, Sawada Masahiko <sawada.mshk@gmail.com> wrote:
> >
> >
> > Thank you for bug report, and comments.
> >
> > Fixed version is attached, and source code comment is also updated.
> > Please review it.
> >
>
> I am looking into this patch and would like to share my findings with
> you:
>
Few more comments:
>
> On Thu, Jul 2, 2015 at 9:00 PM, Sawada Masahiko <sawada.mshk@gmail.com> wrote:
> >
> >
> > Thank you for bug report, and comments.
> >
> > Fixed version is attached, and source code comment is also updated.
> > Please review it.
> >
>
> I am looking into this patch and would like to share my findings with
> you:
>
Few more comments:
@@ -47,6 +47,8 @@ CATALOG(pg_class,1259) BKI_BOOTSTRAP BKI_ROWTYPE_OID(83) BKI_SCHEMA_MACRO
float4 reltuples; /* # of tuples (not always up-to-date) */
int32 relallvisible; /* # of all-visible blocks (not always
* up-to-date) */
+ int32 relallfrozen; /* # of all-frozen blocks (not always
+ up-to-date) */
You have added relallfrozen similar to relallvisible, but how you
are planning to use it, is there any usecase for it?
lazy_scan_heap()
..
- /* Current block is all-visible */
+ /*
+ * Current block is all-visible.
+ * If visibility map represents that it's all frozen, we can
+ * skip to vacuum page unconditionally.
+ */
+ if (visibilitymap_test(onerel, blkno, &vmbuffer, VISIBILITYMAP_ALL_FROZEN))
+ {
+ vacrelstats->vmskipped_pages++;
+ continue;
+ }
+
a. please explain in comment why it is safe if someone clear the
frozen bit concurrently
b. won't skipping pages intermittently due to set frozen bit break the
readahead mechanism? In this regard, if possible, I think we should
do some tests to see the benefit of this patch. I understand that in
general, it will be good to skip pages, however it seems better to check
that with some different kind of tests.
On 7 July 2015 at 18:45, Sawada Masahiko <sawada.mshk@gmail.com> wrote:
--
I understood.On Wed, Jul 8, 2015 at 12:37 AM, Andres Freund <andres@anarazel.de> wrote:
> On 2015-07-07 16:25:13 +0100, Simon Riggs wrote:
>> I don't think pg_freespacemap is the right place.
>
> I agree that pg_freespacemap sounds like an odd location.
>
>> I'd prefer to add that as a single function into core, so we can write
>> formal tests.
>
> With the advent of src/test/modules it's not really a prerequisite for
> things to be builtin to be testable. I think there's fair arguments for
> moving stuff like pg_stattuple, pg_freespacemap, pg_buffercache into
> core at some point, but that's probably a separate discussion.
>
So I will place bunch of test like src/test/module/visibilitymap_test,
which contains some tests regarding this feature,
and gather them into one patch.
Please place it in core. I see value in having a diagnostic function for general use on production systems.
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, Jul 7, 2015 at 5:37 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
On 6 July 2015 at 17:28, Simon Riggs <simon@2ndquadrant.com> wrote:I think we need something for pg_upgrade to rewrite existing VMs. Otherwise a large read only database would suddenly require a massive revacuum after upgrade, which seems bad. That can wait for now until we all agree this patch is sound.
Since we need to rewrite the "vm" map, I think we should call the new map "vfm"
+1 for changing the name, as now map contains more than visibility
information.
On Fri, Jul 3, 2015 at 1:25 AM, Sawada Masahiko <sawada.mshk@gmail.com> wrote:
On Fri, Jul 3, 2015 at 1:23 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On 2 July 2015 at 16:30, Sawada Masahiko <sawada.mshk@gmail.com> wrote:
>
>>
>> Also, the flags of each heap page header might be set PD_ALL_FROZEN,
>> as well as all-visible
>
>
> Is it possible to have VM bits set to frozen but not visible?
>
> The description makes those two states sound independent of each other.
>
> Are they? Or not? Do we test for an impossible state?
>
It's impossible to have VM bits set to frozen but not visible.
These bit are controlled independently. But eventually, when
all-frozen bit is set, all-visible is also set.
If that combination is currently impossible, could it be used indicate that the page is all empty?
Having a crash-proof bitmap of all-empty pages would make vacuum truncation scans much more efficient.
Cheers,
Jeff
On 7/8/15 8:31 AM, Simon Riggs wrote: > I understood. > So I will place bunch of test like src/test/module/visibilitymap_test, > which contains some tests regarding this feature, > and gather them into one patch. > > > Please place it in core. I see value in having a diagnostic function for > general use on production systems. +1. I don't think there's value to keeping this stuff away from DBAs. Perhaps it should default to only SU being able to execute it though. -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Data in Trouble? Get it in Treble! http://BlueTreble.com
On Thu, Jul 9, 2015 at 4:31 AM, Jeff Janes <jeff.janes@gmail.com> wrote: > On Fri, Jul 3, 2015 at 1:25 AM, Sawada Masahiko <sawada.mshk@gmail.com> > wrote: >> >> On Fri, Jul 3, 2015 at 1:23 AM, Simon Riggs <simon@2ndquadrant.com> wrote: >> > On 2 July 2015 at 16:30, Sawada Masahiko <sawada.mshk@gmail.com> wrote: >> > >> >> >> >> Also, the flags of each heap page header might be set PD_ALL_FROZEN, >> >> as well as all-visible >> > >> > >> > Is it possible to have VM bits set to frozen but not visible? >> > >> > The description makes those two states sound independent of each other. >> > >> > Are they? Or not? Do we test for an impossible state? >> > >> >> It's impossible to have VM bits set to frozen but not visible. >> These bit are controlled independently. But eventually, when >> all-frozen bit is set, all-visible is also set. > > > If that combination is currently impossible, could it be used indicate that > the page is all empty? Yeah, the status of that VM bits set to frozen but not visible is impossible, so we could use this status for another something status of the page. > Having a crash-proof bitmap of all-empty pages would make vacuum truncation > scans much more efficient. The empty page is always marked all-visible by vacuum today, it's not enough? Regards, -- Sawada Masahiko
On Tue, Jul 7, 2015 at 8:49 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Thu, Jul 2, 2015 at 9:00 PM, Sawada Masahiko <sawada.mshk@gmail.com> > wrote: >> >> >> Thank you for bug report, and comments. >> >> Fixed version is attached, and source code comment is also updated. >> Please review it. >> > > I am looking into this patch and would like to share my findings with > you: Thank you for comment. I appreciate you taking time to review this patch. > > 1. > @@ -2131,8 +2133,9 @@ heap_insert(Relation relation, HeapTuple tup, > CommandId cid, > > CheckForSerializableConflictIn(relation, NULL, InvalidBuffer); > > /* > - * Find buffer to insert this > tuple into. If the page is all visible, > - * this will also pin the requisite visibility map page. > + > * Find buffer to insert this tuple into. If the page is all visible > + * of all frozen, this will also pin > the requisite visibility map and > + * frozen map page. > */ > > typo in comments. > > /of all frozen/or all frozen Fixed. > 2. > visibilitymap.c > + * The visibility map is a bitmap with two bits (all-visible and all-frozen > + * per heap page. > > /and all-frozen/and all-frozen) > closing round bracket is missing. Fixed. > 3. > visibilitymap.c > -/*#define TRACE_VISIBILITYMAP */ > +#define TRACE_VISIBILITYMAP > > why is this hash define opened? Fixed. > 4. > -visibilitymap_count(Relation rel) > +visibilitymap_count(Relation rel, bool for_visible) > > This API needs to count set bits for either visibility info, frozen info > or both (if required), it seems better to have second parameter as > uint8 flags rather than bool. Also, if it is required to be called at most > places for both visibility and frozen bits count, why not get them > in one call? Fixed. > 5. > Clearing visibility and frozen bit separately for the dml > operations would lead locking/unlocking the corresponding buffer > twice, can we do it as a one operation. I think this is suggested > by Simon as well. Latest patch clears bits in one operation, and set all-frozen with all-visible in one operation. We can judge the page is all-frozen in two places: first scanning the page(lazy_scan_heap), and after cleaning garbage(lazy_vacuum_page). > 6. > - * Before locking the buffer, pin the visibility map page if it appears to > - * be necessary. > Since we haven't got the lock yet, someone else might be > + * Before locking the buffer, pin the > visibility map if it appears to be > + * necessary. Since we haven't got the lock yet, someone else might > be > > Why you have deleted 'page' in above comment? Fixed. > 7. > @@ -3490,21 +3532,23 @@ l2: > UnlockTupleTuplock(relation, &(oldtup.t_self), *lockmode); > > if (vmbuffer != InvalidBuffer) > ReleaseBuffer(vmbuffer); > + > bms_free > (hot_attrs); > > Seems unnecessary change. Fixed. > 8. > @@ -1919,11 +1919,18 @@ index_update_stats(Relation rel, > { > BlockNumber relpages = > RelationGetNumberOfBlocks(rel); > BlockNumber relallvisible; > + BlockNumber > relallfrozen; > > if (rd_rel->relkind != RELKIND_INDEX) > - relallvisible = > visibilitymap_count(rel); > + { > + relallvisible = visibilitymap_count(rel, > true); > + relallfrozen = visibilitymap_count(rel, false); > + } > else > /* don't bother for indexes */ > + { > relallvisible = 0; > + > relallfrozen = 0; > + } > > I think in this function, you have forgotten to update the > relallfrozen value in pg_class. Fixed. > 9. > vacuumlazy.c > > @@ -253,14 +258,16 @@ lazy_vacuum_rel(Relation onerel, int options, > VacuumParams *params, > * NB: We > need to check this before truncating the relation, because that > * will change ->rel_pages. > */ > - > if (vacrelstats->scanned_pages < vacrelstats->rel_pages) > + if ((vacrelstats->scanned_pages + > vacrelstats->vmskipped_pages) > + < vacrelstats->rel_pages) > { > - Assert(!scan_all); > > Why you have removed this Assert, won't the count of > vacrelstats->scanned_pages + vacrelstats->vmskipped_pages be > equal to vacrelstats->rel_pages when scall_all = true. Fixed. > 10. > vacuumlazy.c > lazy_vacuum_rel() > .. > + scanned_all |= scan_all; > + > > Why this new assignment is added, please add a comment to > explain it. It's not necessary, removed. > 11. > lazy_scan_heap() > .. > + * Also, skipping even a single page accorind to all-visible bit of > + * visibility map means that we can't update relfrozenxid, so we only want > + * to do it if we can skip a goodly number. On the other hand, we count > + * both how many pages we skipped according to all-frozen bit of visibility > + * map and how many pages we freeze page, so we can update relfrozenxid if > + * the sum of their is as many as tuples per page. > > a. > typo > /accorind/according Fixed. > b. > is the second part of comment (starting from On the other hand) > right? I mean you are comparing sum of pages skipped due to > all_frozen bit and number of pages freezed with tuples per page. > I don't understand how are they related? > It's wrong, I wanted to say at last sentence that, "so we can update relfrozenxid if the sum of them is as many as pages of table." > 12. > @@ -918,8 +954,13 @@ lazy_scan_heap(Relation onerel, LVRelStats > *vacrelstats, > else > { > num_tuples += 1; > + ntup_in_blk += 1; > hastup = true; > > + /* If current tuple is already frozen, count it up */ > + if (HeapTupleHeaderXminFrozen(tuple.t_data)) > + already_nfrozen += 1; > + > /* > * Each non-removable tuple must be checked to see if it needs > * freezing. Note we already have exclusive buffer lock. > > Here, if tuple is already_frozen, can't we just continue and > check for next tuple? I think it's impossible because the logic related to old-style VACUUM FULL is remained yet in HeapTupleHeaderXminFrozen(). > 13. > +extern XLogRecPtr log_heap_frozenmap(RelFileNode rnode, Buffer heap_buffer, > + Buffer fm_buffer); > > It seems like this function is not used. Fixed. > You have added relallfrozen similar to relallvisible, but how you > are planning to use it, is there any usecase for it? Yep, the value of relallvisible would be effective for in case where the user want to know how the vacuuming takes time to do. If this value is low score, it's a usually good idea to do VACUUM FREEZE manually to prevent unpredictable anti-wrapping vacuum. > a. please explain in comment why it is safe if someone clear the > frozen bit concurrently > b. won't skipping pages intermittently due to set frozen bit break the > readahead mechanism? In this regard, if possible, I think we should > do some tests to see the benefit of this patch. I understand that in > general, it will be good to skip pages, however it seems better to check > that with some different kind of tests. In latest patch, we can skip the all-visible or all-frozen page until we find next_not_all_visible_block, and then we do re-check whether this page is all-frozen to skip to vacuum this page even if scan_all is true. Also, I added the message about number of the skipped frozen pages to the verbose log for test. > Please place it in core. I see value in having a diagnostic function for > general use on production systems. I added new heapfuncs.c file for related heap function which the DBA uses, and then added theses function to that file. But test cases are not yet, I'm making them. Also something for pg_upgrade is also not yet. TODO - Test case for this feature - pg_upgrade support. Regards, -- Sawada Masahiko
Attachment
On Wed, Jul 8, 2015 at 10:10 PM, Sawada Masahiko <sawada.mshk@gmail.com> wrote:
On Thu, Jul 9, 2015 at 4:31 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
> On Fri, Jul 3, 2015 at 1:25 AM, Sawada Masahiko <sawada.mshk@gmail.com>
> wrote:
>>
>> It's impossible to have VM bits set to frozen but not visible.
>> These bit are controlled independently. But eventually, when
>> all-frozen bit is set, all-visible is also set.
>
>
> If that combination is currently impossible, could it be used indicate that
> the page is all empty?
Yeah, the status of that VM bits set to frozen but not visible is
impossible, so we could use this status for another something status
of the page.
> Having a crash-proof bitmap of all-empty pages would make vacuum truncation
> scans much more efficient.
The empty page is always marked all-visible by vacuum today, it's not enough?
The "current" vacuum can just remember that they were empty as well as all-visible.
But the next vacuum that occurs on the table won't know that they are empty, just that they are all-visible, so it can't truncate them away without having to read each one first.
It is a minor thing, but if there is no other use for this fourth "bit-space", it seems a shame to waste it when there is some use for it. I haven't looked at the code around this area to know how hard it would be to implement the setting and clearing of the bit.
Cheers,
Jeff
On Fri, Jul 10, 2015 at 3:05 AM, Sawada Masahiko <sawada.mshk@gmail.com> wrote: > > Also something for pg_upgrade is also not yet. > > TODO > - Test case for this feature > - pg_upgrade support. > I had forgotten to change the fork name of visibility map to "vfm". Attached latest v7 patch. Please review it. Regards, -- Sawada Masahiko
Attachment
On Fri, Jul 10, 2015 at 3:42 AM, Jeff Janes <jeff.janes@gmail.com> wrote: > On Wed, Jul 8, 2015 at 10:10 PM, Sawada Masahiko <sawada.mshk@gmail.com> > wrote: >> >> On Thu, Jul 9, 2015 at 4:31 AM, Jeff Janes <jeff.janes@gmail.com> wrote: >> > On Fri, Jul 3, 2015 at 1:25 AM, Sawada Masahiko <sawada.mshk@gmail.com> >> > wrote: >> >> >> >> It's impossible to have VM bits set to frozen but not visible. >> >> These bit are controlled independently. But eventually, when >> >> all-frozen bit is set, all-visible is also set. >> > >> > >> > If that combination is currently impossible, could it be used indicate >> > that >> > the page is all empty? >> >> Yeah, the status of that VM bits set to frozen but not visible is >> impossible, so we could use this status for another something status >> of the page. >> >> > Having a crash-proof bitmap of all-empty pages would make vacuum >> > truncation >> > scans much more efficient. >> >> The empty page is always marked all-visible by vacuum today, it's not >> enough? > > > The "current" vacuum can just remember that they were empty as well as > all-visible. > > But the next vacuum that occurs on the table won't know that they are empty, > just that they are all-visible, so it can't truncate them away without > having to read each one first. Yeah, it would be effective for vacuum empty page. > > It is a minor thing, but if there is no other use for this fourth > "bit-space", it seems a shame to waste it when there is some use for it. I > haven't looked at the code around this area to know how hard it would be to > implement the setting and clearing of the bit. I think so too, we would be able to use unused fourth status of bits efficiently. Should I include these improvement into this patch? This topic should be discussed on another thread after this feature is committed, I think. Regards, -- Sawada Masahiko
On 10 July 2015 at 09:49, Sawada Masahiko <sawada.mshk@gmail.com> wrote:
--
> It is a minor thing, but if there is no other use for this fourth
> "bit-space", it seems a shame to waste it when there is some use for it. I
> haven't looked at the code around this area to know how hard it would be to
> implement the setting and clearing of the bit.
I think so too, we would be able to use unused fourth status of bits
efficiently.
Should I include these improvement into this patch?
This topic should be discussed on another thread after this feature is
committed, I think.
The impossible state acts as a diagnostic check for us to ensure the bitmap is not itself corrupt.
-1 for using it for another purpose.
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Fri, Jul 10, 2015 at 2:41 PM, Sawada Masahiko <sawada.mshk@gmail.com> wrote: > On Fri, Jul 10, 2015 at 3:05 AM, Sawada Masahiko <sawada.mshk@gmail.com> wrote: >> >> Also something for pg_upgrade is also not yet. >> >> TODO >> - Test case for this feature >> - pg_upgrade support. >> > > I had forgotten to change the fork name of visibility map to "vfm". > Attached latest v7 patch. > Please review it. The compilation failed on my machine... gcc -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Wendif-labels -Wmissing-format-attribute -Wformat-security -fno-strict-aliasing -fwrapv -g -O0 -I../../../../src/include -D_GNU_SOURCE -c -o visibilitymap.o visibilitymap.c make[4]: *** No rule to make target `heapfuncs.o', needed by `objfiles.txt'. Stop. make[4]: *** Waiting for unfinished jobs.... ( echo src/backend/access/index/genam.o src/backend/access/index/indexam.o ) >objfiles.txt make[4]: Leaving directory `/home/postgres/pgsql/git/src/backend/access/index' gcc -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Wendif-labels -Wmissing-format-attribute -Wformat-security -fno-strict-aliasing -fwrapv -g -O0 -I../../../src/include -D_GNU_SOURCE -c -o tablespace.o tablespace.c gcc -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Wendif-labels -Wmissing-format-attribute -Wformat-security -fno-strict-aliasing -fwrapv -g -O0 -I../../../src/include -D_GNU_SOURCE -c -o instrument.o instrument.c make[4]: Leaving directory `/home/postgres/pgsql/git/src/backend/access/heap' make[3]: *** [heap-recursive] Error 2 make[3]: Leaving directory `/home/postgres/pgsql/git/src/backend/access' make[2]: *** [access-recursive] Error 2 make[2]: *** Waiting for unfinished jobs.... Regards, -- Fujii Masao
On Fri, Jul 10, 2015 at 10:43 PM, Fujii Masao <masao.fujii@gmail.com> wrote: > On Fri, Jul 10, 2015 at 2:41 PM, Sawada Masahiko <sawada.mshk@gmail.com> wrote: >> On Fri, Jul 10, 2015 at 3:05 AM, Sawada Masahiko <sawada.mshk@gmail.com> wrote: >>> >>> Also something for pg_upgrade is also not yet. >>> >>> TODO >>> - Test case for this feature >>> - pg_upgrade support. >>> >> >> I had forgotten to change the fork name of visibility map to "vfm". >> Attached latest v7 patch. >> Please review it. > > The compilation failed on my machine... > > gcc -Wall -Wmissing-prototypes -Wpointer-arith > -Wdeclaration-after-statement -Wendif-labels > -Wmissing-format-attribute -Wformat-security -fno-strict-aliasing > -fwrapv -g -O0 -I../../../../src/include -D_GNU_SOURCE -c -o > visibilitymap.o visibilitymap.c > make[4]: *** No rule to make target `heapfuncs.o', needed by > `objfiles.txt'. Stop. > make[4]: *** Waiting for unfinished jobs.... > ( echo src/backend/access/index/genam.o > src/backend/access/index/indexam.o ) >objfiles.txt > make[4]: Leaving directory `/home/postgres/pgsql/git/src/backend/access/index' > gcc -Wall -Wmissing-prototypes -Wpointer-arith > -Wdeclaration-after-statement -Wendif-labels > -Wmissing-format-attribute -Wformat-security -fno-strict-aliasing > -fwrapv -g -O0 -I../../../src/include -D_GNU_SOURCE -c -o > tablespace.o tablespace.c > gcc -Wall -Wmissing-prototypes -Wpointer-arith > -Wdeclaration-after-statement -Wendif-labels > -Wmissing-format-attribute -Wformat-security -fno-strict-aliasing > -fwrapv -g -O0 -I../../../src/include -D_GNU_SOURCE -c -o > instrument.o instrument.c > make[4]: Leaving directory `/home/postgres/pgsql/git/src/backend/access/heap' > make[3]: *** [heap-recursive] Error 2 > make[3]: Leaving directory `/home/postgres/pgsql/git/src/backend/access' > make[2]: *** [access-recursive] Error 2 > make[2]: *** Waiting for unfinished jobs.... > Oops, I had forgotten to add new file heapfuncs.c. Latest patch is attached. Regards, -- Sawada Masahiko
Attachment
On 7/10/15 4:46 AM, Simon Riggs wrote: > On 10 July 2015 at 09:49, Sawada Masahiko <sawada.mshk@gmail.com > <mailto:sawada.mshk@gmail.com>> wrote: > > > > It is a minor thing, but if there is no other use for this fourth > > "bit-space", it seems a shame to waste it when there is some use for it. I > > haven't looked at the code around this area to know how hard it would be to > > implement the setting and clearing of the bit. > > I think so too, we would be able to use unused fourth status of bits > efficiently. > Should I include these improvement into this patch? > This topic should be discussed on another thread after this feature is > committed, I think. > > > The impossible state acts as a diagnostic check for us to ensure the > bitmap is not itself corrupt. > > -1 for using it for another purpose. AFAICS empty page is only interesting for vacuum truncate, which is a very short-term thing. It would be better to find a way to handle that differently. In any case, that should definitely be a separate discussion from this patch. -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Data in Trouble? Get it in Treble! http://BlueTreble.com
On Tue, Jul 7, 2015 at 9:07 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > On 6 July 2015 at 17:28, Simon Riggs <simon@2ndquadrant.com> wrote: > >> I think we need something for pg_upgrade to rewrite existing VMs. >> Otherwise a large read only database would suddenly require a massive >> revacuum after upgrade, which seems bad. That can wait for now until we all >> agree this patch is sound. > > > Since we need to rewrite the "vm" map, I think we should call the new map > "vfm" > > That way we will be able to easily check whether the rewrite has been > conducted on all relations. > > Since the maps are just bits there is no other way to tell that a map has > been rewritten To avoid revacuum after upgrade, you meant that we need to rewrite each bit of vm to corresponding bits of vfm, if it's from not-supporting vfm version(i.g., 9.5 or earlier ). right? If so, we will need to do whole scanning table, which is expensive as well. Clearing vm and do revacuum would be nice, rather than doing in upgrading, I think. Regards, -- Masahiko Sawada
On Mon, Jul 13, 2015 at 3:39 PM, Sawada Masahiko <sawada.mshk@gmail.com> wrote:
>
> On Tue, Jul 7, 2015 at 9:07 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> > On 6 July 2015 at 17:28, Simon Riggs <simon@2ndquadrant.com> wrote:
> >
> >> I think we need something for pg_upgrade to rewrite existing VMs.
> >> Otherwise a large read only database would suddenly require a massive
> >> revacuum after upgrade, which seems bad. That can wait for now until we all
> >> agree this patch is sound.
> >
> >
> > Since we need to rewrite the "vm" map, I think we should call the new map
> > "vfm"
> >
> > That way we will be able to easily check whether the rewrite has been
> > conducted on all relations.
> >
> > Since the maps are just bits there is no other way to tell that a map has
> > been rewritten
>
> To avoid revacuum after upgrade, you meant that we need to rewrite
> each bit of vm to corresponding bits of vfm, if it's from
> not-supporting vfm version(i.g., 9.5 or earlier ). right?
> If so, we will need to do whole scanning table, which is expensive as well.
> Clearing vm and do revacuum would be nice, rather than doing in
> upgrading, I think.
>
How will you ensure to have revacuum for all the tables after
>
> On Tue, Jul 7, 2015 at 9:07 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> > On 6 July 2015 at 17:28, Simon Riggs <simon@2ndquadrant.com> wrote:
> >
> >> I think we need something for pg_upgrade to rewrite existing VMs.
> >> Otherwise a large read only database would suddenly require a massive
> >> revacuum after upgrade, which seems bad. That can wait for now until we all
> >> agree this patch is sound.
> >
> >
> > Since we need to rewrite the "vm" map, I think we should call the new map
> > "vfm"
> >
> > That way we will be able to easily check whether the rewrite has been
> > conducted on all relations.
> >
> > Since the maps are just bits there is no other way to tell that a map has
> > been rewritten
>
> To avoid revacuum after upgrade, you meant that we need to rewrite
> each bit of vm to corresponding bits of vfm, if it's from
> not-supporting vfm version(i.g., 9.5 or earlier ). right?
> If so, we will need to do whole scanning table, which is expensive as well.
> Clearing vm and do revacuum would be nice, rather than doing in
> upgrading, I think.
>
How will you ensure to have revacuum for all the tables after
upgrading? Till the time Vacuum is done on the tables that
have vm before upgrade, any queries on those tables can
become slower.
On Mon, Jul 13, 2015 at 7:46 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Mon, Jul 13, 2015 at 3:39 PM, Sawada Masahiko <sawada.mshk@gmail.com> > wrote: >> >> On Tue, Jul 7, 2015 at 9:07 PM, Simon Riggs <simon@2ndquadrant.com> wrote: >> > On 6 July 2015 at 17:28, Simon Riggs <simon@2ndquadrant.com> wrote: >> > >> >> I think we need something for pg_upgrade to rewrite existing VMs. >> >> Otherwise a large read only database would suddenly require a massive >> >> revacuum after upgrade, which seems bad. That can wait for now until we >> >> all >> >> agree this patch is sound. >> > >> > >> > Since we need to rewrite the "vm" map, I think we should call the new >> > map >> > "vfm" >> > >> > That way we will be able to easily check whether the rewrite has been >> > conducted on all relations. >> > >> > Since the maps are just bits there is no other way to tell that a map >> > has >> > been rewritten >> >> To avoid revacuum after upgrade, you meant that we need to rewrite >> each bit of vm to corresponding bits of vfm, if it's from >> not-supporting vfm version(i.g., 9.5 or earlier ). right? >> If so, we will need to do whole scanning table, which is expensive as >> well. >> Clearing vm and do revacuum would be nice, rather than doing in >> upgrading, I think. >> > > How will you ensure to have revacuum for all the tables after > upgrading? We use script file which are generated by pg_upgrade. > Till the time Vacuum is done on the tables that > have vm before upgrade, any queries on those tables can > become slower. Even If we implement rewriting tool for vm into pg_upgrade, it will take time as much as revacuum because it need whole scanning table. I meant that we rewrite vm using by existing facility (i.g., vacuum (freeze)), instead of implementing new rewriting tool for vm. Regards, -- Masahiko Sawada
On 2015-07-13 21:03:07 +0900, Sawada Masahiko wrote: > Even If we implement rewriting tool for vm into pg_upgrade, it will > take time as much as revacuum because it need whole scanning table. Why would it? Sure, you can only set allvisible and not the frozen bit, but that's fine. That way the cost for freezing can be paid over time. If we require terrabytes of data to be scanned, including possibly rewriting large portions due to freezing, before index only scans work and most vacuums act in a partial manner the migration to 9.6 will be a major pain for our users.
On Mon, Jul 13, 2015 at 9:03 PM, Sawada Masahiko <sawada.mshk@gmail.com> wrote: > On Mon, Jul 13, 2015 at 7:46 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> On Mon, Jul 13, 2015 at 3:39 PM, Sawada Masahiko <sawada.mshk@gmail.com> >> wrote: >>> >>> On Tue, Jul 7, 2015 at 9:07 PM, Simon Riggs <simon@2ndquadrant.com> wrote: >>> > On 6 July 2015 at 17:28, Simon Riggs <simon@2ndquadrant.com> wrote: >>> > >>> >> I think we need something for pg_upgrade to rewrite existing VMs. >>> >> Otherwise a large read only database would suddenly require a massive >>> >> revacuum after upgrade, which seems bad. That can wait for now until we >>> >> all >>> >> agree this patch is sound. >>> > >>> > >>> > Since we need to rewrite the "vm" map, I think we should call the new >>> > map >>> > "vfm" >>> > >>> > That way we will be able to easily check whether the rewrite has been >>> > conducted on all relations. >>> > >>> > Since the maps are just bits there is no other way to tell that a map >>> > has >>> > been rewritten >>> >>> To avoid revacuum after upgrade, you meant that we need to rewrite >>> each bit of vm to corresponding bits of vfm, if it's from >>> not-supporting vfm version(i.g., 9.5 or earlier ). right? >>> If so, we will need to do whole scanning table, which is expensive as >>> well. >>> Clearing vm and do revacuum would be nice, rather than doing in >>> upgrading, I think. >>> >> >> How will you ensure to have revacuum for all the tables after >> upgrading? > > We use script file which are generated by pg_upgrade. I haven't followed this thread closely, but I am sure you recall that vacuumdb has a parallel mode. -- Michael
On Mon, Jul 13, 2015 at 9:22 PM, Andres Freund <andres@anarazel.de> wrote: > On 2015-07-13 21:03:07 +0900, Sawada Masahiko wrote: >> Even If we implement rewriting tool for vm into pg_upgrade, it will >> take time as much as revacuum because it need whole scanning table. > > Why would it? Sure, you can only set allvisible and not the frozen bit, > but that's fine. That way the cost for freezing can be paid over time. > > If we require terrabytes of data to be scanned, including possibly > rewriting large portions due to freezing, before index only scans work > and most vacuums act in a partial manner the migration to 9.6 will be a > major pain for our users. Ah, If we set all bit as not all-frozen, we don't need to whole table scanning, only scan vm. And I agree with this. But please image the case where old cluster has table which is very large, read-only and vacuum freeze is done. In this case, the all-frozen bit of such table in new cluster will not set, unless we do vacuum freeze again. The information of all-frozen of such table is lacked. Regards, -- Masahiko Sawada
On 13 July 2015 at 15:48, Sawada Masahiko <sawada.mshk@gmail.com> wrote:
--
Ah, If we set all bit as not all-frozen, we don't need to whole tableOn Mon, Jul 13, 2015 at 9:22 PM, Andres Freund <andres@anarazel.de> wrote:
> On 2015-07-13 21:03:07 +0900, Sawada Masahiko wrote:
>> Even If we implement rewriting tool for vm into pg_upgrade, it will
>> take time as much as revacuum because it need whole scanning table.
>
> Why would it? Sure, you can only set allvisible and not the frozen bit,
> but that's fine. That way the cost for freezing can be paid over time.
>
> If we require terrabytes of data to be scanned, including possibly
> rewriting large portions due to freezing, before index only scans work
> and most vacuums act in a partial manner the migration to 9.6 will be a
> major pain for our users.
scanning, only scan vm.
And I agree with this.
But please image the case where old cluster has table which is very
large, read-only and vacuum freeze is done.
In this case, the all-frozen bit of such table in new cluster will not
set, unless we do vacuum freeze again.
The information of all-frozen of such table is lacked.
The contents of the VM fork is essential to retain after an upgrade because it is used for Index Only Scans. If we destroy that information it could send SQL response times to unacceptable levels after upgrade.
It takes time to scan the VM and create the new VFM, but the time taken is proportional to the size of VM, which seems like it will be acceptable.
Example calcs:
An 8TB PostgreSQL installation would need us to scan 128MB of VM into about 256MB of VFM. Probably the fsyncs will occupy the most time.
In comparison, we would need to scan all 8TB to rebuild the VMs, which will take much longer (and fsyncs will still be needed).
Since we don't record freeze map information now it is acceptable to begin after upgrade with all freeze info set to zero.
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2015-07-13 23:48:02 +0900, Sawada Masahiko wrote: > But please image the case where old cluster has table which is very > large, read-only and vacuum freeze is done. > In this case, the all-frozen bit of such table in new cluster will not > set, unless we do vacuum freeze again. > The information of all-frozen of such table is lacked. So what? That's the situation today… Yes, it'll trigger a anti-wraparound vacuum at some later point, after that they map bits will be set.
On 10 July 2015 at 15:11, Sawada Masahiko <sawada.mshk@gmail.com> wrote:
--
Oops, I had forgotten to add new file heapfuncs.c.
Latest patch is attached.
I think we've established the approach is desirable and defined the way forwards for this, so this is looking good.
Some of my requests haven't been actioned yet, so I personally would not commit this yet. I am happy to continue as reviewer/committer unless others wish to take over.
The main missing item is pg_upgrade support, which won't happen by end of CF1, so I am marking this as Returned With Feedback. Hopefully we can review this again before CF2.
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Michael Paquier wrote: > On Mon, Jul 13, 2015 at 9:03 PM, Sawada Masahiko <sawada.mshk@gmail.com> wrote: > > We use script file which are generated by pg_upgrade. > > I haven't followed this thread closely, but I am sure you recall that > vacuumdb has a parallel mode. I think having to vacuum the whole database during pg_upgrade (or immediately thereafter, which in practice means that the database is unusable for queries until that has finished) is way too impractical. Even in parallel mode, it could take far too long. People already complain that our upgrading procedure takes too long as opposed to that of other database systems. I don't think there's any problem with rewriting the existing server's VM file into "vfm" format during pg_upgrade, since we expect those files to be much smaller than the data itself. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Wed, Jul 15, 2015 at 12:55 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > On 10 July 2015 at 15:11, Sawada Masahiko <sawada.mshk@gmail.com> wrote: >> >> >> Oops, I had forgotten to add new file heapfuncs.c. >> Latest patch is attached. > > > I think we've established the approach is desirable and defined the way > forwards for this, so this is looking good. If we want to move stuff like pg_stattuple, pg_freespacemap into core, we could move them into heapfuncs.c. > Some of my requests haven't been actioned yet, so I personally would not > commit this yet. I am happy to continue as reviewer/committer unless others > wish to take over. > The main missing item is pg_upgrade support, which won't happen by end of > CF1, so I am marking this as Returned With Feedback. Hopefully we can review > this again before CF2. I appreciate your reviewing. Yeah, the pg_upgrade support and regression test for VFM patch is almost done now, I will submit the patch in this week after testing it . Regards, -- Masahiko Sawada
On Wed, Jul 15, 2015 at 3:07 AM, Sawada Masahiko <sawada.mshk@gmail.com> wrote: > On Wed, Jul 15, 2015 at 12:55 AM, Simon Riggs <simon@2ndquadrant.com> wrote: >> On 10 July 2015 at 15:11, Sawada Masahiko <sawada.mshk@gmail.com> wrote: >>> >>> >>> Oops, I had forgotten to add new file heapfuncs.c. >>> Latest patch is attached. >> >> >> I think we've established the approach is desirable and defined the way >> forwards for this, so this is looking good. > > If we want to move stuff like pg_stattuple, pg_freespacemap into core, > we could move them into heapfuncs.c. > >> Some of my requests haven't been actioned yet, so I personally would not >> commit this yet. I am happy to continue as reviewer/committer unless others >> wish to take over. >> The main missing item is pg_upgrade support, which won't happen by end of >> CF1, so I am marking this as Returned With Feedback. Hopefully we can review >> this again before CF2. > > I appreciate your reviewing. > Yeah, the pg_upgrade support and regression test for VFM patch is > almost done now, I will submit the patch in this week after testing it > . Attached patch is latest v9 patch. I added: - regression test for visibility map (visibilitymap.sql and visibilitymap.out files) - pg_upgrade support (rewriting vm file to vfm file) - regression test for pg_upgrade Please review it. Regards, -- Masahiko Sawada
Attachment
On Thu, Jul 16, 2015 at 8:51 PM, Sawada Masahiko <sawada.mshk@gmail.com> wrote: > On Wed, Jul 15, 2015 at 3:07 AM, Sawada Masahiko <sawada.mshk@gmail.com> wrote: >> On Wed, Jul 15, 2015 at 12:55 AM, Simon Riggs <simon@2ndquadrant.com> wrote: >>> On 10 July 2015 at 15:11, Sawada Masahiko <sawada.mshk@gmail.com> wrote: >>>> >>>> >>>> Oops, I had forgotten to add new file heapfuncs.c. >>>> Latest patch is attached. >>> >>> >>> I think we've established the approach is desirable and defined the way >>> forwards for this, so this is looking good. >> >> If we want to move stuff like pg_stattuple, pg_freespacemap into core, >> we could move them into heapfuncs.c. >> >>> Some of my requests haven't been actioned yet, so I personally would not >>> commit this yet. I am happy to continue as reviewer/committer unless others >>> wish to take over. >>> The main missing item is pg_upgrade support, which won't happen by end of >>> CF1, so I am marking this as Returned With Feedback. Hopefully we can review >>> this again before CF2. >> >> I appreciate your reviewing. >> Yeah, the pg_upgrade support and regression test for VFM patch is >> almost done now, I will submit the patch in this week after testing it >> . > > Attached patch is latest v9 patch. > > I added: > - regression test for visibility map (visibilitymap.sql and > visibilitymap.out files) > - pg_upgrade support (rewriting vm file to vfm file) > - regression test for pg_upgrade > Previous patch has some fail to apply, so attached the rebased patch. Catalog version is not decided yet, so we will need to rewrite VISIBILITY_MAP_FROZEN_BIT_CAT_VER in pg_upgrade.h Please review it. Regards, -- Masahiko Sawada
Attachment
On Wed, Jul 8, 2015 at 02:31:04PM +0100, Simon Riggs wrote: > On 7 July 2015 at 18:45, Sawada Masahiko <sawada.mshk@gmail.com> wrote: > > On Wed, Jul 8, 2015 at 12:37 AM, Andres Freund <andres@anarazel.de> wrote: > > On 2015-07-07 16:25:13 +0100, Simon Riggs wrote: > >> I don't think pg_freespacemap is the right place. > > > > I agree that pg_freespacemap sounds like an odd location. > > > >> I'd prefer to add that as a single function into core, so we can write > >> formal tests. > > > > With the advent of src/test/modules it's not really a prerequisite for > > things to be builtin to be testable. I think there's fair arguments for > > moving stuff like pg_stattuple, pg_freespacemap, pg_buffercache into > > core at some point, but that's probably a separate discussion. > > > > I understood. > So I will place bunch of test like src/test/module/visibilitymap_test, > which contains some tests regarding this feature, > and gather them into one patch. > > > Please place it in core. I see value in having a diagnostic function for > general use on production systems. Sorry to be coming to this discussion late. I understand the desire for a diagnostic function in core, but we have to be consistent. Just because we are adding this function now doesn't mean we should use different rules from what we did previously for diagnostic functions. Either their is logic to why this function is different from the other diagnostic functions in contrib, or we need to have a separate discussion of whether diagnostic functions belong in contrib or core. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
Bruce Momjian wrote: > I understand the desire for a diagnostic function in core, but we have > to be consistent. Just because we are adding this function now doesn't > mean we should use different rules from what we did previously for > diagnostic functions. Either their is logic to why this function is > different from the other diagnostic functions in contrib, or we need to > have a separate discussion of whether diagnostic functions belong in > contrib or core. Then let's start moving some extensions to src/extension/. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Wed, Aug 5, 2015 at 12:36 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Bruce Momjian wrote: >> I understand the desire for a diagnostic function in core, but we have >> to be consistent. Just because we are adding this function now doesn't >> mean we should use different rules from what we did previously for >> diagnostic functions. Either their is logic to why this function is >> different from the other diagnostic functions in contrib, or we need to >> have a separate discussion of whether diagnostic functions belong in >> contrib or core. > > Then let's start moving some extensions to src/extension/. That seems like yet another separate issue. FWIW, it seems to me that we've done a heck of a lot of moving stuff out of contrib over the last few releases. A bunch of things moved to src/test/modules and a bunch of things went to src/bin. We can move more, of course, but this code reorganization has non-trivial costs and I'm not clear what benefits we hope to realize and whether we are in fact realizing those benefits. At this point, the overwhelming majority of what's in contrib is extensions; we're not far from being able to put the whole thing in src/extensions if it really needs to be moved at all. But I don't think it's fair to conflate that with Bruce's question, which it seems to me is both a fair question and a different one. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas wrote: > On Wed, Aug 5, 2015 at 12:36 PM, Alvaro Herrera > <alvherre@2ndquadrant.com> wrote: > > Bruce Momjian wrote: > >> I understand the desire for a diagnostic function in core, but we have > >> to be consistent. Just because we are adding this function now doesn't > >> mean we should use different rules from what we did previously for > >> diagnostic functions. Either their is logic to why this function is > >> different from the other diagnostic functions in contrib, or we need to > >> have a separate discussion of whether diagnostic functions belong in > >> contrib or core. > > > > Then let's start moving some extensions to src/extension/. > > That seems like yet another separate issue. > > FWIW, it seems to me that we've done a heck of a lot of moving stuff > out of contrib over the last few releases. A bunch of things moved to > src/test/modules and a bunch of things went to src/bin. We can move > more, of course, but this code reorganization has non-trivial costs > and I'm not clear what benefits we hope to realize and whether we are > in fact realizing those benefits. At this point, the overwhelming > majority of what's in contrib is extensions; we're not far from being > able to put the whole thing in src/extensions if it really needs to be > moved at all. There are a number of things in contrib that are not extensions, and others are not core-quality yet. I don't think we should move everything; at least not everything in one go. I think there are a small number of diagnostic extensions that would be useful to have in core (pageinspect, pg_buffercache, pg_stat_statements). > But I don't think it's fair to conflate that with Bruce's question, > which it seems to me is both a fair question and a different one. Well, there was no question as such. If the question is "should we instead put it in contrib just to be consistent?" then I think the answer is no. I value consistency as much as every other person, but I there are other things I value more, such as availability. If stuff is in contrib and servers don't have it installed because of package policies and it takes three management layers' approval to get it installed in a dying server, then I prefer to have it in core. If the question was "why are we not using the rule we previously had that diagnostic tools were in contrib?" then I think the answer is that we have evolved and we now know better. We have evolved in the sense that we have more stuff in production now that needs better diagnostic tooling to be available; and we know better now in the sense that we have realized there's this company policy bureaucracy that things in contrib are not always available for reasons that are beyond us. Anyway, the patch as proposed puts the new functions in core as builtins (which is what Bruce seems to be objecting to). Maybe instead of proposing moving existing extensions in core, it would be better to have this patch put those two new functions alone as a single new extension in src/extension, and not move anything else. I don't necessarily resist adding these functions as builtins, but if we do that then there's no going back to having them as an extension instead, which is presumably more in line with what we want in the long run. (It would be a shame to delay this patch, which messes with complex innards, just because of a discussion about the placement of two smallish diagnostic functions.) -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 08/05/2015 10:00 AM, Alvaro Herrera wrote: > Anyway, the patch as proposed puts the new functions in core as builtins > (which is what Bruce seems to be objecting to). Maybe instead of > proposing moving existing extensions in core, it would be better to have > this patch put those two new functions alone as a single new extension > in src/extension, and not move anything else. I don't necessarily > resist adding these functions as builtins, but if we do that then > there's no going back to having them as an extension instead, which is > presumably more in line with what we want in the long run. For my part, I am unclear on why we are putting *any* diagnostic tools in /contrib today. Either the diagnostic tools are good quality and necessary for a bunch of users, in which case we ship them in core, or they are obscure and/or untested, in which case they go in an external project and/or on PGXN. Yes, for tools with overhead we might want to require enabling them in pg.conf. But that's very different from requiring the user to install a separate package. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On Wed, Aug 5, 2015 at 10:22:48AM -0700, Josh Berkus wrote: > On 08/05/2015 10:00 AM, Alvaro Herrera wrote: > > Anyway, the patch as proposed puts the new functions in core as builtins > > (which is what Bruce seems to be objecting to). Maybe instead of > > proposing moving existing extensions in core, it would be better to have > > this patch put those two new functions alone as a single new extension > > in src/extension, and not move anything else. I don't necessarily > > resist adding these functions as builtins, but if we do that then > > there's no going back to having them as an extension instead, which is > > presumably more in line with what we want in the long run. > > For my part, I am unclear on why we are putting *any* diagnostic tools > in /contrib today. Either the diagnostic tools are good quality and > necessary for a bunch of users, in which case we ship them in core, or > they are obscure and/or untested, in which case they go in an external > project and/or on PGXN. > > Yes, for tools with overhead we might want to require enabling them in > pg.conf. But that's very different from requiring the user to install a > separate package. I don't care what we do, but I do think we should be consistent. Frankly I am unclear why I am even having to make this point, as cases where we have chosen expediency over consistency have served us badly in the past. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
On 08/05/2015 10:26 AM, Bruce Momjian wrote: > On Wed, Aug 5, 2015 at 10:22:48AM -0700, Josh Berkus wrote: >> On 08/05/2015 10:00 AM, Alvaro Herrera wrote: >>> Anyway, the patch as proposed puts the new functions in core as builtins >>> (which is what Bruce seems to be objecting to). Maybe instead of >>> proposing moving existing extensions in core, it would be better to have >>> this patch put those two new functions alone as a single new extension >>> in src/extension, and not move anything else. I don't necessarily >>> resist adding these functions as builtins, but if we do that then >>> there's no going back to having them as an extension instead, which is >>> presumably more in line with what we want in the long run. >> >> For my part, I am unclear on why we are putting *any* diagnostic tools >> in /contrib today. Either the diagnostic tools are good quality and >> necessary for a bunch of users, in which case we ship them in core, or >> they are obscure and/or untested, in which case they go in an external >> project and/or on PGXN. >> >> Yes, for tools with overhead we might want to require enabling them in >> pg.conf. But that's very different from requiring the user to install a >> separate package. > > I don't care what we do, but I do think we should be consistent. > Frankly I am unclear why I am even having to make this point, as cases > where we have chosen expediency over consistency have served us badly in > the past. Saying "it's stupid to be consistent with a bad old rule", and making a new rule is not "expediency". -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
Josh Berkus wrote: > On 08/05/2015 10:26 AM, Bruce Momjian wrote: > > I don't care what we do, but I do think we should be consistent. > > Frankly I am unclear why I am even having to make this point, as cases > > where we have chosen expediency over consistency have served us badly in > > the past. > > Saying "it's stupid to be consistent with a bad old rule", and making a > new rule is not "expediency". So I discussed this with Bruce on IM a bit. I think there are basically four ways we could go about this: 1. Add the functions as a builtins. This is what the current patch does. Simon seems to prefer this, because he wantsthe function to be always available in production; but I don't like this option because adding functions as builtins makes it impossible to move later to extensions. Bruce doesn't like this option either. 2. Add the functions to contrib, keep them there for the foreesable future. Simon is against this option, because the functionswill be unavailable when needed in production. I am of the same position. Bruce opines this option is acceptable. 3. a) Add the function to some extension in contrib now, by using a slightly modified version of the current patch, and b) Apply some later patch to move said extension to src/extension. 4. a) Patch some extension(s) to move it to src/extension, b) Apply a version of this patch that adds the new functionsto said extension Essentially 3 and 4 are the same thing except the order is reversed; they both result in the functions being shipped in some "core extension" (a concept we do not have today). Bruce says either of these is fine with him. I am fine with either of them also. As long as we do 3b during 9.6 timeframe, the outcome of either 3 and 4 seems to be acceptable for Simon also. Robert seems to be saying that he doesn't care about moving extensions to core at all. What do others think? -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 08/05/2015 10:46 AM, Alvaro Herrera wrote: > 1. Add the functions as a builtins. > This is what the current patch does. Simon seems to prefer this, > because he wants the function to be always available in production; > but I don't like this option because adding functions as builtins > makes it impossible to move later to extensions. > Bruce doesn't like this option either. Why would we want to move them later to extensions? Do you anticipate not needing them in the future? If we don't need them in the future, why would they continue to exist at all? I'm really not getting this. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
Josh Berkus wrote: > On 08/05/2015 10:46 AM, Alvaro Herrera wrote: > > 1. Add the functions as a builtins. > > This is what the current patch does. Simon seems to prefer this, > > because he wants the function to be always available in production; > > but I don't like this option because adding functions as builtins > > makes it impossible to move later to extensions. > > Bruce doesn't like this option either. > > Why would we want to move them later to extensions? Because it's not nice to have random stuff as builtins. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2015-08-05 20:09, Alvaro Herrera wrote: > Josh Berkus wrote: >> On 08/05/2015 10:46 AM, Alvaro Herrera wrote: >>> 1. Add the functions as a builtins. >>> This is what the current patch does. Simon seems to prefer this, >>> because he wants the function to be always available in production; >>> but I don't like this option because adding functions as builtins >>> makes it impossible to move later to extensions. >>> Bruce doesn't like this option either. >> >> Why would we want to move them later to extensions? > > Because it's not nice to have random stuff as builtins. > Extensions have one nice property, they provide namespacing so not everything has to be in pg_catalog which already has about gazilion functions. It's nice to have stuff you don't need for day to day operations separate but still available (which is why src/extensions is better than contrib). -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Wed, Aug 5, 2015 at 10:58:00AM -0700, Josh Berkus wrote: > On 08/05/2015 10:46 AM, Alvaro Herrera wrote: > > 1. Add the functions as a builtins. > > This is what the current patch does. Simon seems to prefer this, > > because he wants the function to be always available in production; > > but I don't like this option because adding functions as builtins > > makes it impossible to move later to extensions. > > Bruce doesn't like this option either. > > Why would we want to move them later to extensions? Do you anticipate > not needing them in the future? If we don't need them in the future, > why would they continue to exist at all? > > I'm really not getting this. ---------------------------- This is why I suggested putting the new SQL function where it belongs for consistency and then open a separate thread to discuss the future of where we want diagnostic functions to be. It is too complicated to talk about both issues in the same thread. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
Bruce Momjian wrote: > This is why I suggested putting the new SQL function where it belongs > for consistency and then open a separate thread to discuss the future of > where we want diagnostic functions to be. It is too complicated to talk > about both issues in the same thread. Oh come on -- gimme a break. We figure out much more complicated problems in single threads all the time. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Wed, Aug 5, 2015 at 11:57:48PM -0300, Alvaro Herrera wrote: > Bruce Momjian wrote: > > > This is why I suggested putting the new SQL function where it belongs > > for consistency and then open a separate thread to discuss the future of > > where we want diagnostic functions to be. It is too complicated to talk > > about both issues in the same thread. > > Oh come on -- gimme a break. We figure out much more complicated > problems in single threads all the time. Well, people are confused, as stated --- what more can I say? -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
On 8/5/15 1:47 PM, Petr Jelinek wrote: > On 2015-08-05 20:09, Alvaro Herrera wrote: >> Josh Berkus wrote: >>> On 08/05/2015 10:46 AM, Alvaro Herrera wrote: >>>> 1. Add the functions as a builtins. >>>> This is what the current patch does. Simon seems to prefer this, >>>> because he wants the function to be always available in production; >>>> but I don't like this option because adding functions as builtins >>>> makes it impossible to move later to extensions. >>>> Bruce doesn't like this option either. >>> >>> Why would we want to move them later to extensions? >> >> Because it's not nice to have random stuff as builtins. >> > > Extensions have one nice property, they provide namespacing so not > everything has to be in pg_catalog which already has about gazilion > functions. It's nice to have stuff you don't need for day to day > operations separate but still available (which is why src/extensions is > better than contrib). They also provide a level of control over what is and isn't installed in a cluster. Personally, I'd prefer that most users not even be aware of the existence of things like pageinspect. -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Data in Trouble? Get it in Treble! http://BlueTreble.com
On 5 August 2015 at 18:46, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
2. I'd also like to be able to make checks on this while we're in production, to ensure we have no bugs. I was trying to learn from earlier mistakes and make sure we are ready with diagnostic tools to allow run-time checks and confirm everything is good. If people feel that means I've asked for something in the wrong place, I am happy to skip that request and place it wherever requested.
--
What do others think?
Wow, everything moves when you blink, eh? Sorry I was wasn't watching this. Mainly because I was working on some other related thoughts, separate post coming.
1. Most importantly, it needs to be somewhere where we can use the function in a regression test. As I said before, I would not commit this without a formal proof of correctness.
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu, Aug 6, 2015 at 11:33 AM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote: > They also provide a level of control over what is and isn't installed in a > cluster. Personally, I'd prefer that most users not even be aware of the > existence of things like pageinspect. +1. If everybody feels that moving extensions currently stored in contrib into src/extensions is going to help us somehow, then, uh, OK. I can't work up any enthusiasm for that, but I can live with it. However, I think it's affirmatively bad policy to say that we're going to put all of our debugging facilities into core because otherwise some people might not have them installed. That's depriving users of the ability to control their environment, and there are good reasons for some people to want those things not to be installed. If we accept the argument "it inconveniences hacker X when Y is not installed" as a reason to put Y in core, then we can justify putting anything at all into core. And I don't think that's right at all. Extensions are a useful packaging mechanism for functionality that is useful but not required, and debugging facilities are definitely very useful but should not be required. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Aug 10, 2015 at 12:39 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, Aug 6, 2015 at 11:33 AM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote: >> They also provide a level of control over what is and isn't installed in a >> cluster. Personally, I'd prefer that most users not even be aware of the >> existence of things like pageinspect. > > +1. > > [...] > > Extensions are a useful packaging mechanism for functionality that is > useful but not required, and debugging facilities are definitely very > useful but should not be required. +1. -- Michael
On Mon, Aug 10, 2015 at 11:05 AM, Michael Paquier <michael.paquier@gmail.com> wrote: > On Mon, Aug 10, 2015 at 12:39 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Thu, Aug 6, 2015 at 11:33 AM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote: >>> They also provide a level of control over what is and isn't installed in a >>> cluster. Personally, I'd prefer that most users not even be aware of the >>> existence of things like pageinspect. >> >> +1. >> >> [...] >> >> Extensions are a useful packaging mechanism for functionality that is >> useful but not required, and debugging facilities are definitely very >> useful but should not be required. > > +1. Sorry to be come discussion late. I have encountered the much cases where pg_stat_statement, pgstattuples are required in production, so I basically agree with moving such extension into core. But IMO, the diagnostic tools for visibility map, heap (pageinspect) and so on, are a kind of debugging tool. Attached latest v11 patches, which is separated into 2 patches: frozen bit patch and diagnostic function patch. Moving diagnostic function into core is still under the discussion, but this patch puts such function into core because the diagnostic function for visibility map needs to be in core to execute regression test at least. Regards, -- Masahiko Sawada
Attachment
On Tue, Aug 18, 2015 at 7:27 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > I have encountered the much cases where pg_stat_statement, > pgstattuples are required in production, so I basically agree with > moving such extension into core. > But IMO, the diagnostic tools for visibility map, heap (pageinspect) > and so on, are a kind of debugging tool. Just because something might be required in production isn't a sufficient reason to put it in core. Debugging tools, or anything else, can be required in production, too. > Attached latest v11 patches, which is separated into 2 patches: frozen > bit patch and diagnostic function patch. > Moving diagnostic function into core is still under the discussion, > but this patch puts such function into core because the diagnostic > function for visibility map needs to be in core to execute regression > test at least. As has been discussed recently, there are other ways to handle that. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Aug 19, 2015 at 1:28 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Tue, Aug 18, 2015 at 7:27 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> I have encountered the much cases where pg_stat_statement, >> pgstattuples are required in production, so I basically agree with >> moving such extension into core. >> But IMO, the diagnostic tools for visibility map, heap (pageinspect) >> and so on, are a kind of debugging tool. > > Just because something might be required in production isn't a > sufficient reason to put it in core. Debugging tools, or anything > else, can be required in production, too. > >> Attached latest v11 patches, which is separated into 2 patches: frozen >> bit patch and diagnostic function patch. >> Moving diagnostic function into core is still under the discussion, >> but this patch puts such function into core because the diagnostic >> function for visibility map needs to be in core to execute regression >> test at least. > > As has been discussed recently, there are other ways to handle that. The currently regression test for VM is that we just compare between the total number of all-visible and all-frozen in VM before and after VACUUM, and don't check particular a bit in VM. we could substitute it to the ANALYZE command with enough sampling number and checking pg_class.relallvisible and pg_class.relallfrozen. So another way is that diagnostic function for VM is put into something contrib (pg_freespacemap or pageinspect), and if we want to use such function in production, we can install such extension as in the past. Regards, -- Masahiko Sawada
On 8/19/15 2:56 AM, Masahiko Sawada wrote: > The currently regression test for VM is that we just compare between > the total number of all-visible and all-frozen in VM before and after > VACUUM, and don't check particular a bit in VM. > we could substitute it to the ANALYZE command with enough sampling > number and checking pg_class.relallvisible and pg_class.relallfrozen. I think this is another indication that we need more than just pg_regress... > So another way is that diagnostic function for VM is put into > something contrib (pg_freespacemap or pageinspect), and if we want to > use such function in production, we can install such extension as in > the past. pg_buffercache is very useful as a performance monitoring tool, and I view being able to pull statistics about the VM and FM the same way. I'd like to see us providing more performance information by default, not less. I think things like pageinspect are very different; I really can't see any use for those beyond debugging (and debugging by an expert at that). -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Data in Trouble? Get it in Treble! http://BlueTreble.com
Jim Nasby wrote: > I think things like pageinspect are very different; I really can't see any > use for those beyond debugging (and debugging by an expert at that). I don't think that necessarily means it must continue to be in contrib. Quite the contrary, I think it is a tool critical enough that it should not be relegated to be a second-class citizen as it is now (let's face it, being in contrib *is* second-class citizenship). -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu, Aug 20, 2015 at 11:46 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Jim Nasby wrote: > >> I think things like pageinspect are very different; I really can't see any >> use for those beyond debugging (and debugging by an expert at that). > > I don't think that necessarily means it must continue to be in contrib. > Quite the contrary, I think it is a tool critical enough that it should > not be relegated to be a second-class citizen as it is now (let's face > it, being in contrib *is* second-class citizenship). > Attached patch is latest patch. The how to do the VM regression test is changed so that we do test without diagnostic functions. In current patch, we do VACUUM and VACUUM FREEZE table, and check its value of pg_class.relallvisible and relallfrozen. When doing first VACUUM in regression test, the table doesn't have VM. So VACUUM scans all pages and update exactly information about the number of all-visible bit. And when doing second VACUUM FREEZE, VACUUM FREEZE also scans all pages because every page is not marked as all-frozen. So VACUUM FREEZE can update exactly information about the number of all-frozen bit. In previous patch, we checked a bit of VM one by one using by diagnostic function, and compared between these result and pg_class.relallvisible(/frozen). So the essential check process is same as previous patch. We can ensure correctness by using such procedure. Regards, -- Masahiko Sawada
Attachment
On Thu, Aug 27, 2015 at 1:54 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > On Thu, Aug 20, 2015 at 11:46 PM, Alvaro Herrera > <alvherre@2ndquadrant.com> wrote: >> Jim Nasby wrote: >> >>> I think things like pageinspect are very different; I really can't see any >>> use for those beyond debugging (and debugging by an expert at that). >> >> I don't think that necessarily means it must continue to be in contrib. >> Quite the contrary, I think it is a tool critical enough that it should >> not be relegated to be a second-class citizen as it is now (let's face >> it, being in contrib *is* second-class citizenship). >> > > Attached patch is latest patch. The previous patch lacks some files for regression test. Attached fixed v12 patch. Regards, -- Masahiko Sawada
Attachment
On Thu, Aug 20, 2015 at 10:46 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Jim Nasby wrote: >> I think things like pageinspect are very different; I really can't see any >> use for those beyond debugging (and debugging by an expert at that). > > I don't think that necessarily means it must continue to be in contrib. > Quite the contrary, I think it is a tool critical enough that it should > not be relegated to be a second-class citizen as it is now (let's face > it, being in contrib *is* second-class citizenship). I have resisted that principle for years and will continue to do so. It is entirely reasonable for some DBAs to want certain functionality (debugging tools, crypto) to not be installed on their machines. Folding everything into core is not a good policy, IMHO. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas wrote: > On Thu, Aug 20, 2015 at 10:46 AM, Alvaro Herrera > <alvherre@2ndquadrant.com> wrote: > > I don't think that necessarily means it must continue to be in contrib. > > Quite the contrary, I think it is a tool critical enough that it should > > not be relegated to be a second-class citizen as it is now (let's face > > it, being in contrib *is* second-class citizenship). > > I have resisted that principle for years and will continue to do so. > It is entirely reasonable for some DBAs to want certain functionality > (debugging tools, crypto) to not be installed on their machines. > Folding everything into core is not a good policy, IMHO. I don't understand. I'm just proposing that the source code for the extension to live in src/extensions/, and have the shared library installed by toplevel make install; I'm not suggesting that the extension is installed automatically. For that, you still need a superuser to run CREATE EXTENSION. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu, Sep 3, 2015 at 2:26 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Robert Haas wrote: >> On Thu, Aug 20, 2015 at 10:46 AM, Alvaro Herrera >> <alvherre@2ndquadrant.com> wrote: > >> > I don't think that necessarily means it must continue to be in contrib. >> > Quite the contrary, I think it is a tool critical enough that it should >> > not be relegated to be a second-class citizen as it is now (let's face >> > it, being in contrib *is* second-class citizenship). >> >> I have resisted that principle for years and will continue to do so. >> It is entirely reasonable for some DBAs to want certain functionality >> (debugging tools, crypto) to not be installed on their machines. >> Folding everything into core is not a good policy, IMHO. > > I don't understand. I'm just proposing that the source code for the > extension to live in src/extensions/, and have the shared library > installed by toplevel make install; I'm not suggesting that the > extension is installed automatically. For that, you still need a > superuser to run CREATE EXTENSION. Oh. Well, that's different. I don't particularly support that proposal, but I'm not prepared to fight over it either. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2015-09-03 20:26, Alvaro Herrera wrote: > Robert Haas wrote: >> On Thu, Aug 20, 2015 at 10:46 AM, Alvaro Herrera >> <alvherre@2ndquadrant.com> wrote: > >>> I don't think that necessarily means it must continue to be in contrib. >>> Quite the contrary, I think it is a tool critical enough that it should >>> not be relegated to be a second-class citizen as it is now (let's face >>> it, being in contrib *is* second-class citizenship). >> >> I have resisted that principle for years and will continue to do so. >> It is entirely reasonable for some DBAs to want certain functionality >> (debugging tools, crypto) to not be installed on their machines. >> Folding everything into core is not a good policy, IMHO. > > I don't understand. I'm just proposing that the source code for the > extension to live in src/extensions/, and have the shared library > installed by toplevel make install; I'm not suggesting that the > extension is installed automatically. For that, you still need a > superuser to run CREATE EXTENSION. > +! for this -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Thu, Sep 3, 2015 at 11:37:09PM +0200, Petr Jelinek wrote: > >I don't understand. I'm just proposing that the source code for the > >extension to live in src/extensions/, and have the shared library > >installed by toplevel make install; I'm not suggesting that the > >extension is installed automatically. For that, you still need a > >superuser to run CREATE EXTENSION. > > > > +! for this OK, what does "+!" mean? (I know it is probably a shift-key mistype, but it looks interesting.) -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
On 09/03/2015 05:11 PM, Bruce Momjian wrote: > On Thu, Sep 3, 2015 at 11:37:09PM +0200, Petr Jelinek wrote: >>> I don't understand. I'm just proposing that the source code for the >>> extension to live in src/extensions/, and have the shared library >>> installed by toplevel make install; I'm not suggesting that the >>> extension is installed automatically. For that, you still need a >>> superuser to run CREATE EXTENSION. >>> >> >> +! for this > > OK, what does "+!" mean? (I know it is probably a shift-key mistype, > but it looks interesting.) Add the next factorial value? -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On 04/09/15 12:11, Bruce Momjian wrote: > On Thu, Sep 3, 2015 at 11:37:09PM +0200, Petr Jelinek wrote: >>> I don't understand. I'm just proposing that the source code for the >>> extension to live in src/extensions/, and have the shared library >>> installed by toplevel make install; I'm not suggesting that the >>> extension is installed automatically. For that, you still need a >>> superuser to run CREATE EXTENSION. >>> >> +! for this > OK, what does "+!" mean? (I know it is probably a shift-key mistype, > but it looks interesting.) > It obviously signifies a Good Move that involved a check - at least, that is what it would mean when annotating a Chess Game! :-)
On Fri, Sep 4, 2015 at 2:23 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > On Thu, Aug 27, 2015 at 1:54 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> On Thu, Aug 20, 2015 at 11:46 PM, Alvaro Herrera >> <alvherre@2ndquadrant.com> wrote: >>> Jim Nasby wrote: >>> >>>> I think things like pageinspect are very different; I really can't see any >>>> use for those beyond debugging (and debugging by an expert at that). >>> >>> I don't think that necessarily means it must continue to be in contrib. >>> Quite the contrary, I think it is a tool critical enough that it should >>> not be relegated to be a second-class citizen as it is now (let's face >>> it, being in contrib *is* second-class citizenship). >>> >> >> Attached patch is latest patch. > > The previous patch lacks some files for regression test. > Attached fixed v12 patch. The patch could be applied cleanly. "make check" could pass successfully. But "make check-world -j 2" failed. Regards, -- Fujii Masao
Bruce Momjian wrote: > On Thu, Sep 3, 2015 at 11:37:09PM +0200, Petr Jelinek wrote: > > >I don't understand. I'm just proposing that the source code for the > > >extension to live in src/extensions/, and have the shared library > > >installed by toplevel make install; I'm not suggesting that the > > >extension is installed automatically. For that, you still need a > > >superuser to run CREATE EXTENSION. > > > > > > > +! for this > > OK, what does "+!" mean? (I know it is probably a shift-key mistype, > but it looks interesting.) I took it as an uppercase 1 myself -- a shouted "PLUS ONE". -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Fri, Sep 4, 2015 at 10:35 AM, Fujii Masao <masao.fujii@gmail.com> wrote: > On Fri, Sep 4, 2015 at 2:23 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> On Thu, Aug 27, 2015 at 1:54 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >>> On Thu, Aug 20, 2015 at 11:46 PM, Alvaro Herrera >>> <alvherre@2ndquadrant.com> wrote: >>>> Jim Nasby wrote: >>>> >>>>> I think things like pageinspect are very different; I really can't see any >>>>> use for those beyond debugging (and debugging by an expert at that). >>>> >>>> I don't think that necessarily means it must continue to be in contrib. >>>> Quite the contrary, I think it is a tool critical enough that it should >>>> not be relegated to be a second-class citizen as it is now (let's face >>>> it, being in contrib *is* second-class citizenship). >>>> >>> >>> Attached patch is latest patch. >> >> The previous patch lacks some files for regression test. >> Attached fixed v12 patch. > > The patch could be applied cleanly. "make check" could pass successfully. > But "make check-world -j 2" failed. > Thank you for looking at this patch. Could you tell me what test you got failed? make check-world -j 2 or more is done successfully in my environment. Regards, -- Masahiko Sawada
On 2015-09-04 02:11, Bruce Momjian wrote: > On Thu, Sep 3, 2015 at 11:37:09PM +0200, Petr Jelinek wrote: >>> I don't understand. I'm just proposing that the source code for the >>> extension to live in src/extensions/, and have the shared library >>> installed by toplevel make install; I'm not suggesting that the >>> extension is installed automatically. For that, you still need a >>> superuser to run CREATE EXTENSION. >>> >> >> +! for this > > OK, what does "+!" mean? (I know it is probably a shift-key mistype, > but it looks interesting.) > Yes, shift-key mistype:) -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Thu, Sep 3, 2015 at 11:56:52PM -0300, Alvaro Herrera wrote: > Bruce Momjian wrote: > > On Thu, Sep 3, 2015 at 11:37:09PM +0200, Petr Jelinek wrote: > > > >I don't understand. I'm just proposing that the source code for the > > > >extension to live in src/extensions/, and have the shared library > > > >installed by toplevel make install; I'm not suggesting that the > > > >extension is installed automatically. For that, you still need a > > > >superuser to run CREATE EXTENSION. > > > > > > > > > > +! for this > > > > OK, what does "+!" mean? (I know it is probably a shift-key mistype, > > but it looks interesting.) > > I took it as an uppercase 1 myself -- a shouted "PLUS ONE". Oh, an ALL-CAPS +1. Yeah, it actually makes sense. ;-) -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
On 3 September 2015 at 18:23, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
--
The previous patch lacks some files for regression test.
Attached fixed v12 patch.
This looks OK. You saw that I was proposing to solve this problem a different way ("Summary of plans to avoid the annoyance of Freezing"), suggesting that we wait for a few CFs to see if a patch emerges for that - then fall back to this patch if it doesn't? So I am moving this patch to next CF.
I apologise for the personal annoyance caused by this; I hope whatever solution we find we can work together on it.
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Sat, Sep 5, 2015 at 7:35 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > On 3 September 2015 at 18:23, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > >> >> The previous patch lacks some files for regression test. >> Attached fixed v12 patch. > > > This looks OK. You saw that I was proposing to solve this problem a > different way ("Summary of plans to avoid the annoyance of Freezing"), > suggesting that we wait for a few CFs to see if a patch emerges for that - > then fall back to this patch if it doesn't? So I am moving this patch to > next CF. > > I apologise for the personal annoyance caused by this; I hope whatever > solution we find we can work together on it. > I had missed that thread actually, but have understood status of around freeze avoidance topic. It's no problem to me that we address Heikki's solution at first and next is other plan(maybe frozen map). But this frozen map patch is still under the reviewing and might have serious problem, that is still need to be reviewed. So I think we should continue to review this patch at least, while reviewing Heikki's solution, and then we can select solution for frozen map. Otherwise, if frozen map has serious problem or other big problem is occurred, the reviewing of patch will be not enough, and then it will leads bad result, I think. Regards, -- Masahiko Sawada
On Mon, Sep 7, 2015 at 11:18 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > On Sat, Sep 5, 2015 at 7:35 AM, Simon Riggs <simon@2ndquadrant.com> wrote: >> On 3 September 2015 at 18:23, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >>> The previous patch lacks some files for regression test. >>> Attached fixed v12 patch. >> >> This looks OK. You saw that I was proposing to solve this problem a >> different way ("Summary of plans to avoid the annoyance of Freezing"), >> suggesting that we wait for a few CFs to see if a patch emerges for that - >> then fall back to this patch if it doesn't? So I am moving this patch to >> next CF. >> >> I apologise for the personal annoyance caused by this; I hope whatever >> solution we find we can work together on it. >> > > I had missed that thread actually, but have understood status of > around freeze avoidance topic. > It's no problem to me that we address Heikki's solution at first and > next is other plan(maybe frozen map). > But this frozen map patch is still under the reviewing and might have > serious problem, that is still need to be reviewed. > So I think we should continue to review this patch at least, while > reviewing Heikki's solution, and then we can select solution for > frozen map. > Otherwise, if frozen map has serious problem or other big problem is > occurred, the reviewing of patch will be not enough, and then it will > leads bad result, I think. I agree! -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2015-09-04 23:35:42 +0100, Simon Riggs wrote: > This looks OK. You saw that I was proposing to solve this problem a > different way ("Summary of plans to avoid the annoyance of Freezing"), > suggesting that we wait for a few CFs to see if a patch emerges for that - > then fall back to this patch if it doesn't? So I am moving this patch to > next CF. As noted on that other thread I don't think that's a good policy, and it seems like Robert agrees with me. So I think we should move this back to "Needs Review". Greetings, Andres Freund
On Fri, Sep 4, 2015 at 2:55 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > On Fri, Sep 4, 2015 at 10:35 AM, Fujii Masao <masao.fujii@gmail.com> wrote: >> On Fri, Sep 4, 2015 at 2:23 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >>> On Thu, Aug 27, 2015 at 1:54 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >>>> On Thu, Aug 20, 2015 at 11:46 PM, Alvaro Herrera >>>> <alvherre@2ndquadrant.com> wrote: >>>>> Jim Nasby wrote: >>>>> >>>>>> I think things like pageinspect are very different; I really can't see any >>>>>> use for those beyond debugging (and debugging by an expert at that). >>>>> >>>>> I don't think that necessarily means it must continue to be in contrib. >>>>> Quite the contrary, I think it is a tool critical enough that it should >>>>> not be relegated to be a second-class citizen as it is now (let's face >>>>> it, being in contrib *is* second-class citizenship). >>>>> >>>> >>>> Attached patch is latest patch. >>> >>> The previous patch lacks some files for regression test. >>> Attached fixed v12 patch. >> >> The patch could be applied cleanly. "make check" could pass successfully. >> But "make check-world -j 2" failed. >> > > Thank you for looking at this patch. > Could you tell me what test you got failed? > make check-world -j 2 or more is done successfully in my environment. I tried to do the test again, but initdb failed with the following error. creating template1 database in data/base/1 ... FATAL: invalid input syntax for type oid: "f" This error didn't happen when I tested before. So the commit which was applied recently might interfere with the patch. Regards, -- Fujii Masao
On Fri, Sep 18, 2015 at 6:13 PM, Fujii Masao <masao.fujii@gmail.com> wrote: > On Fri, Sep 4, 2015 at 2:55 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> On Fri, Sep 4, 2015 at 10:35 AM, Fujii Masao <masao.fujii@gmail.com> wrote: >>> On Fri, Sep 4, 2015 at 2:23 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >>>> On Thu, Aug 27, 2015 at 1:54 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >>>>> On Thu, Aug 20, 2015 at 11:46 PM, Alvaro Herrera >>>>> <alvherre@2ndquadrant.com> wrote: >>>>>> Jim Nasby wrote: >>>>>> >>>>>>> I think things like pageinspect are very different; I really can't see any >>>>>>> use for those beyond debugging (and debugging by an expert at that). >>>>>> >>>>>> I don't think that necessarily means it must continue to be in contrib. >>>>>> Quite the contrary, I think it is a tool critical enough that it should >>>>>> not be relegated to be a second-class citizen as it is now (let's face >>>>>> it, being in contrib *is* second-class citizenship). >>>>>> >>>>> >>>>> Attached patch is latest patch. >>>> >>>> The previous patch lacks some files for regression test. >>>> Attached fixed v12 patch. >>> >>> The patch could be applied cleanly. "make check" could pass successfully. >>> But "make check-world -j 2" failed. >>> >> >> Thank you for looking at this patch. >> Could you tell me what test you got failed? >> make check-world -j 2 or more is done successfully in my environment. > > I tried to do the test again, but initdb failed with the following error. > > creating template1 database in data/base/1 ... FATAL: invalid > input syntax for type oid: "f" > > This error didn't happen when I tested before. So the commit which was > applied recently might interfere with the patch. > Thank you for testing! Attached fixed version patch. Regards, -- Masahiko Sawada
Attachment
On Fri, Sep 18, 2015 at 8:14 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > On Fri, Sep 18, 2015 at 6:13 PM, Fujii Masao <masao.fujii@gmail.com> wrote: >> On Fri, Sep 4, 2015 at 2:55 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >>> On Fri, Sep 4, 2015 at 10:35 AM, Fujii Masao <masao.fujii@gmail.com> wrote: >>>> On Fri, Sep 4, 2015 at 2:23 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >>>>> On Thu, Aug 27, 2015 at 1:54 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >>>>>> On Thu, Aug 20, 2015 at 11:46 PM, Alvaro Herrera >>>>>> <alvherre@2ndquadrant.com> wrote: >>>>>>> Jim Nasby wrote: >>>>>>> >>>>>>>> I think things like pageinspect are very different; I really can't see any >>>>>>>> use for those beyond debugging (and debugging by an expert at that). >>>>>>> >>>>>>> I don't think that necessarily means it must continue to be in contrib. >>>>>>> Quite the contrary, I think it is a tool critical enough that it should >>>>>>> not be relegated to be a second-class citizen as it is now (let's face >>>>>>> it, being in contrib *is* second-class citizenship). >>>>>>> >>>>>> >>>>>> Attached patch is latest patch. >>>>> >>>>> The previous patch lacks some files for regression test. >>>>> Attached fixed v12 patch. >>>> >>>> The patch could be applied cleanly. "make check" could pass successfully. >>>> But "make check-world -j 2" failed. >>>> >>> >>> Thank you for looking at this patch. >>> Could you tell me what test you got failed? >>> make check-world -j 2 or more is done successfully in my environment. >> >> I tried to do the test again, but initdb failed with the following error. >> >> creating template1 database in data/base/1 ... FATAL: invalid >> input syntax for type oid: "f" >> >> This error didn't happen when I tested before. So the commit which was >> applied recently might interfere with the patch. >> > > Thank you for testing! > Attached fixed version patch. Thanks for updating the patch! Here are comments. +#include "access/visibilitymap.h" visibilitymap.h doesn't need to be included in cluster.c. - errmsg("table row type and query-specified row type do not match"), + errmsg("table row type and query-specified row type do not match"), This change doesn't seem to be necessary. +#define Anum_pg_class_relallfrozen 12 Why is pg_class.relallfrozen necessary? ISTM that there is no user of it now. lazy_scan_heap() calls PageClearAllVisible() when the page containing dead tuples is marked as all-visible. Shouldn't PageClearAllFrozen() be called at the same time? - "vm", /* VISIBILITYMAP_FORKNUM */ + "vfm", /* VISIBILITYMAP_FORKNUM */ I wonder how much it's worth renaming only the file extension while there are many places where "visibility map" and "vm" are used, for example, log messages, function names, variables, etc. Regards, -- Fujii Masao
On Thu, Oct 1, 2015 at 9:44 AM, Fujii Masao <masao.fujii@gmail.com> wrote: > I wonder how much it's worth renaming only the file extension while > there are many places where "visibility map" and "vm" are used, > for example, log messages, function names, variables, etc. I'd be inclined to keep calling it the visibility map (vm) even if it also contains freeze information. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 10/01/2015 07:43 AM, Robert Haas wrote: > On Thu, Oct 1, 2015 at 9:44 AM, Fujii Masao <masao.fujii@gmail.com> wrote: >> I wonder how much it's worth renaming only the file extension while >> there are many places where "visibility map" and "vm" are used, >> for example, log messages, function names, variables, etc. > > I'd be inclined to keep calling it the visibility map (vm) even if it > also contains freeze information. > -1 to rename. Visibility Map is a perfectly good name. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On Fri, Oct 2, 2015 at 7:30 AM, Josh Berkus <josh@agliodbs.com> wrote: > On 10/01/2015 07:43 AM, Robert Haas wrote: >> On Thu, Oct 1, 2015 at 9:44 AM, Fujii Masao <masao.fujii@gmail.com> wrote: >>> I wonder how much it's worth renaming only the file extension while >>> there are many places where "visibility map" and "vm" are used, >>> for example, log messages, function names, variables, etc. >> >> I'd be inclined to keep calling it the visibility map (vm) even if it >> also contains freeze information. >> > > -1 to rename. Visibility Map is a perfectly good name. > Thank you for taking time to review this patch. Attached latest v14 patch. v14 patch is changed so that I don't change file name of visibilitymap to "vfm", and contains some bug fix. > +#include "access/visibilitymap.h" > visibilitymap.h doesn't need to be included in cluster.c. Fixed. > - errmsg("table row type and query-specified row type do not match"), > + errmsg("table row type and query-specified row type > do not match"), > This change doesn't seem to be necessary. Fixed. > +#define Anum_pg_class_relallfrozen 12 > Why is pg_class.relallfrozen necessary? ISTM that there is no user of it now. The relallfrozen would be useful for user to estimate time to vacuum freeze or anti-wrapping vacuum before being done them actually. (Also this value is used on regression test.) But this information is not used on planning like relallvisible, so it would be good to move this information to another system view like pg_stat_*_tables. > lazy_scan_heap() calls PageClearAllVisible() when the page containing > dead tuples is marked as all-visible. Shouldn't PageClearAllFrozen() be > called at the same time? Fixed. > - "vm", /* VISIBILITYMAP_FORKNUM */ > + "vfm", /* VISIBILITYMAP_FORKNUM */ > I wonder how much it's worth renaming only the file extension while > there are many places where "visibility map" and "vm" are used, > for example, log messages, function names, variables, etc. > > I'd be inclined to keep calling it the visibility map (vm) even if it > also contains freeze information. > > -1 to rename. Visibility Map is a perfectly good name. Yeah, I agree with this. The latest v14 patch is changed so. Regards, -- Masahiko Sawada
Attachment
Masahiko Sawada wrote: > @@ -2972,10 +2981,15 @@ l1: > */ > PageSetPrunable(page, xid); > > + /* clear PD_ALL_VISIBLE and PD_ALL_FORZEN flags */ Typo "FORZEN". > if (PageIsAllVisible(page)) > { > all_visible_cleared = true; > + > + /* all-frozen information is also cleared at the same time */ > PageClearAllVisible(page); > + PageClearAllFrozen(page); I wonder if it makes sense to have a macro to clear both in unison, which seems a very common pattern. > diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c > index 7c38772..a284b85 100644 > --- a/src/backend/access/heap/visibilitymap.c > +++ b/src/backend/access/heap/visibilitymap.c > @@ -21,33 +21,45 @@ > * > * NOTES > * > - * The visibility map is a bitmap with one bit per heap page. A set bit means > - * that all tuples on the page are known visible to all transactions, and > - * therefore the page doesn't need to be vacuumed. The map is conservative in > - * the sense that we make sure that whenever a bit is set, we know the > - * condition is true, but if a bit is not set, it might or might not be true. > + * The visibility map is a bitmap with two bits (all-visible and all-frozen) > + * per heap page. A set all-visible bit means that all tuples on the page are > + * known visible to all transactions, and therefore the page doesn't need to > + * be vacuumed. A set all-frozen bit means that all tuples on the page are > + * completely frozen, and therefore the page doesn't need to be vacuumed even > + * if whole table scanning vacuum is required (e.g. anti-wraparound vacuum). > + * A all-frozen bit must be set only when the page is already all-visible. > + * That is, all-frozen bit is always set with all-visible bit. "A all-frozen" -> "The all-frozen" (but "A set all-xyz" is correct). > * When we *set* a visibility map during VACUUM, we must write WAL. This may > * seem counterintuitive, since the bit is basically a hint: if it is clear, > - * it may still be the case that every tuple on the page is visible to all > - * transactions; we just don't know that for certain. The difficulty is that > - * there are two bits which are typically set together: the PD_ALL_VISIBLE bit > - * on the page itself, and the visibility map bit. If a crash occurs after the > - * visibility map page makes it to disk and before the updated heap page makes > - * it to disk, redo must set the bit on the heap page. Otherwise, the next > - * insert, update, or delete on the heap page will fail to realize that the > - * visibility map bit must be cleared, possibly causing index-only scans to > - * return wrong answers. > + * it may still be the case that every tuple on the page is visible or frozen > + * to all transactions; we just don't know that for certain. The difficulty is > + * that there are two bits which are typically set together: the PD_ALL_VISIBLE > + * or PD_ALL_FROZEN bit on the page itself, and the visibility map bit. If a > + * crash occurs after the visibility map page makes it to disk and before the > + * updated heap page makes it to disk, redo must set the bit on the heap page. > + * Otherwise, the next insert, update, or delete on the heap page will fail to > + * realize that the visibility map bit must be cleared, possibly causing index-only > + * scans to return wrong answers. In the "The difficulty ..." para, I would add the word "corresponding" before "visibility". Otherwise, it is not clear what the plural means exactly. > * VACUUM will normally skip pages for which the visibility map bit is set; > * such pages can't contain any dead tuples and therefore don't need vacuuming. > - * The visibility map is not used for anti-wraparound vacuums, because > + * The visibility map is not used for anti-wraparound vacuums before 9.5, because > * an anti-wraparound vacuum needs to freeze tuples and observe the latest xid > * present in the table, even on pages that don't have any dead tuples. > + * 9.6 or later, the visibility map has a additional bit which indicates all tuple > + * on single page has been completely forzen, so the visibility map is also used for > + * anti-wraparound vacuums. This should not mention database versions. Just explain how the code behaves today, not how it behaved in the past. Those who want to understand how it behaved in 9.5 can read the 9.5 code. (Again typo "forzen".) > @@ -1115,6 +1187,14 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats, > tups_vacuumed, vacuumed_pages))); > > /* > + * This information would be effective for how much effect all-frozen bit > + * of VM had for freezing tuples. > + */ > + ereport(elevel, > + (errmsg("Skipped %d frozen pages acoording to visibility map", > + vacrelstats->vmskipped_frozen_pages))); Message must start on lowercase letter. I don't understand what the comment means. Can you rephrase it? > @@ -1779,10 +1873,12 @@ vac_cmp_itemptr(const void *left, const void *right) > /* > * Check if every tuple in the given page is visible to all current and future > * transactions. Also return the visibility_cutoff_xid which is the highest > - * xmin amongst the visible tuples. > + * xmin amongst the visible tuples, and all_forzen which implies that all tuples > + * of this page are frozen. Typo "forzen" here again. > @@ -201,6 +239,110 @@ copy_file(const char *srcfile, const char *dstfile, bool force) > #endif > > > +/* > + * rewriteVisibilitymap() > + * > + * A additional bit which indicates that all tuples on page is completely > + * frozen is added into visibility map at PG 9.6. So the format of visibiilty > + * map has been changed. > + * Copies a visibility map file while adding all-frozen bit(0) into each bit. > + */ > +static const char * > +rewriteVisibilitymap(const char *fromfile, const char *tofile, bool force) > +{ > +#define REWRITE_BUF_SIZE (50 * BLCKSZ) > +#define BITS_PER_HEAPBLOCK 2 > + > + int src_fd, dst_fd; > + uint16 vm_bits; > + ssize_t nbytes; > + char *buffer; > + int ret = 0; > + int save_errno = 0; > + > + if ((fromfile == NULL) || (tofile == NULL)) > + { > + errno = EINVAL; > + return getErrorText(errno); > + } > + > + if ((src_fd = open(fromfile, O_RDONLY, 0)) < 0) > + return getErrorText(errno); > + > + if ((dst_fd = open(tofile, O_RDWR | O_CREAT | (force ? 0 : O_EXCL), S_IRUSR | S_IWUSR)) < 0) > + { > + save_errno = errno; > + if (src_fd != 0) > + close(src_fd); > + > + errno = save_errno; > + return getErrorText(errno); > + } > + > + buffer = (char *) pg_malloc(REWRITE_BUF_SIZE); > + > + /* Copy page header data in advance */ > + if ((nbytes = read(src_fd, buffer, MAXALIGN(SizeOfPageHeaderData))) <= 0) > + { > + save_errno = errno; > + return getErrorText(errno); > + } Not clear why you bother with save_errno in this path. Forgot to close()? (Though I wonder why you bother to close() if the program is going to exit shortly thereafter anyway.) > diff --git a/src/bin/pg_upgrade/pg_upgrade.h b/src/bin/pg_upgrade/pg_upgrade.h > index 13aa891..fc92a5f 100644 > --- a/src/bin/pg_upgrade/pg_upgrade.h > +++ b/src/bin/pg_upgrade/pg_upgrade.h > @@ -112,6 +112,11 @@ extern char *output_files[]; > #define VISIBILITY_MAP_CRASHSAFE_CAT_VER 201107031 > > /* > + * The format of visibility map changed with this 9.6 commit, > + * > + */ > +#define VISIBILITY_MAP_FROZEN_BIT_CAT_VER 201509181 Useless empty line in comment. > diff --git a/src/common/relpath.c b/src/common/relpath.c > index 66dfef1..52ff14e 100644 > --- a/src/common/relpath.c > +++ b/src/common/relpath.c > @@ -30,6 +30,9 @@ > * If you add a new entry, remember to update the errhint in > * forkname_to_number() below, and update the SGML documentation for > * pg_relation_size(). > + * 9.6 or later, the visibility map fork name is changed from "vm" to > + * "vfm" bacause visibility map has not only information about all-visible > + * but also information about all-frozen. > */ > const char *const forkNames[] = { > "main", /* MAIN_FORKNUM */ Drop the change in comment? There's no "vfm" in this version of the patch, is there? -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Sat, Oct 3, 2015 at 12:23 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Masahiko Sawada wrote: > Thank you for taking time to review this feature. Attached the latest version patch (v15). >> @@ -2972,10 +2981,15 @@ l1: >> */ >> PageSetPrunable(page, xid); >> >> + /* clear PD_ALL_VISIBLE and PD_ALL_FORZEN flags */ > > Typo "FORZEN". Fixed. > >> if (PageIsAllVisible(page)) >> { >> all_visible_cleared = true; >> + >> + /* all-frozen information is also cleared at the same time */ >> PageClearAllVisible(page); >> + PageClearAllFrozen(page); > > I wonder if it makes sense to have a macro to clear both in unison, > which seems a very common pattern. > Fixed. > >> diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c >> index 7c38772..a284b85 100644 >> --- a/src/backend/access/heap/visibilitymap.c >> +++ b/src/backend/access/heap/visibilitymap.c >> @@ -21,33 +21,45 @@ >> * >> * NOTES >> * >> - * The visibility map is a bitmap with one bit per heap page. A set bit means >> - * that all tuples on the page are known visible to all transactions, and >> - * therefore the page doesn't need to be vacuumed. The map is conservative in >> - * the sense that we make sure that whenever a bit is set, we know the >> - * condition is true, but if a bit is not set, it might or might not be true. >> + * The visibility map is a bitmap with two bits (all-visible and all-frozen) >> + * per heap page. A set all-visible bit means that all tuples on the page are >> + * known visible to all transactions, and therefore the page doesn't need to >> + * be vacuumed. A set all-frozen bit means that all tuples on the page are >> + * completely frozen, and therefore the page doesn't need to be vacuumed even >> + * if whole table scanning vacuum is required (e.g. anti-wraparound vacuum). >> + * A all-frozen bit must be set only when the page is already all-visible. >> + * That is, all-frozen bit is always set with all-visible bit. > > "A all-frozen" -> "The all-frozen" (but "A set all-xyz" is correct). Fixed. > >> * When we *set* a visibility map during VACUUM, we must write WAL. This may >> * seem counterintuitive, since the bit is basically a hint: if it is clear, >> - * it may still be the case that every tuple on the page is visible to all >> - * transactions; we just don't know that for certain. The difficulty is that >> - * there are two bits which are typically set together: the PD_ALL_VISIBLE bit >> - * on the page itself, and the visibility map bit. If a crash occurs after the >> - * visibility map page makes it to disk and before the updated heap page makes >> - * it to disk, redo must set the bit on the heap page. Otherwise, the next >> - * insert, update, or delete on the heap page will fail to realize that the >> - * visibility map bit must be cleared, possibly causing index-only scans to >> - * return wrong answers. >> + * it may still be the case that every tuple on the page is visible or frozen >> + * to all transactions; we just don't know that for certain. The difficulty is >> + * that there are two bits which are typically set together: the PD_ALL_VISIBLE >> + * or PD_ALL_FROZEN bit on the page itself, and the visibility map bit. If a >> + * crash occurs after the visibility map page makes it to disk and before the >> + * updated heap page makes it to disk, redo must set the bit on the heap page. >> + * Otherwise, the next insert, update, or delete on the heap page will fail to >> + * realize that the visibility map bit must be cleared, possibly causing index-only >> + * scans to return wrong answers. > > In the "The difficulty ..." para, I would add the word "corresponding" before > "visibility". Otherwise, it is not clear what the plural means exactly. Fixed. >> * VACUUM will normally skip pages for which the visibility map bit is set; >> * such pages can't contain any dead tuples and therefore don't need vacuuming. >> - * The visibility map is not used for anti-wraparound vacuums, because >> + * The visibility map is not used for anti-wraparound vacuums before 9.5, because >> * an anti-wraparound vacuum needs to freeze tuples and observe the latest xid >> * present in the table, even on pages that don't have any dead tuples. >> + * 9.6 or later, the visibility map has a additional bit which indicates all tuple >> + * on single page has been completely forzen, so the visibility map is also used for >> + * anti-wraparound vacuums. > > This should not mention database versions. Just explain how the code > behaves today, not how it behaved in the past. Those who want to > understand how it behaved in 9.5 can read the 9.5 code. (Again typo > "forzen".) Changed these comment. Sorry for the same typo frequently.. >> @@ -1115,6 +1187,14 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats, >> tups_vacuumed, vacuumed_pages))); >> >> /* >> + * This information would be effective for how much effect all-frozen bit >> + * of VM had for freezing tuples. >> + */ >> + ereport(elevel, >> + (errmsg("Skipped %d frozen pages acoording to visibility map", >> + vacrelstats->vmskipped_frozen_pages))); > > Message must start on lowercase letter. I don't understand what the > comment means. Can you rephrase it? Fixed. >> @@ -1779,10 +1873,12 @@ vac_cmp_itemptr(const void *left, const void *right) >> /* >> * Check if every tuple in the given page is visible to all current and future >> * transactions. Also return the visibility_cutoff_xid which is the highest >> - * xmin amongst the visible tuples. >> + * xmin amongst the visible tuples, and all_forzen which implies that all tuples >> + * of this page are frozen. > > Typo "forzen" here again. Fixed. >> @@ -201,6 +239,110 @@ copy_file(const char *srcfile, const char *dstfile, bool force) >> #endif >> >> >> +/* >> + * rewriteVisibilitymap() >> + * >> + * A additional bit which indicates that all tuples on page is completely >> + * frozen is added into visibility map at PG 9.6. So the format of visibiilty >> + * map has been changed. >> + * Copies a visibility map file while adding all-frozen bit(0) into each bit. >> + */ >> +static const char * >> +rewriteVisibilitymap(const char *fromfile, const char *tofile, bool force) >> +{ >> +#define REWRITE_BUF_SIZE (50 * BLCKSZ) >> +#define BITS_PER_HEAPBLOCK 2 >> + >> + int src_fd, dst_fd; >> + uint16 vm_bits; >> + ssize_t nbytes; >> + char *buffer; >> + int ret = 0; >> + int save_errno = 0; >> + >> + if ((fromfile == NULL) || (tofile == NULL)) >> + { >> + errno = EINVAL; >> + return getErrorText(errno); >> + } >> + >> + if ((src_fd = open(fromfile, O_RDONLY, 0)) < 0) >> + return getErrorText(errno); >> + >> + if ((dst_fd = open(tofile, O_RDWR | O_CREAT | (force ? 0 : O_EXCL), S_IRUSR | S_IWUSR)) < 0) >> + { >> + save_errno = errno; >> + if (src_fd != 0) >> + close(src_fd); >> + >> + errno = save_errno; >> + return getErrorText(errno); >> + } >> + >> + buffer = (char *) pg_malloc(REWRITE_BUF_SIZE); >> + >> + /* Copy page header data in advance */ >> + if ((nbytes = read(src_fd, buffer, MAXALIGN(SizeOfPageHeaderData))) <= 0) >> + { >> + save_errno = errno; >> + return getErrorText(errno); >> + } > > Not clear why you bother with save_errno in this path. Forgot to > close()? (Though I wonder why you bother to close() if the program is > going to exit shortly thereafter anyway.) Fixed. >> diff --git a/src/bin/pg_upgrade/pg_upgrade.h b/src/bin/pg_upgrade/pg_upgrade.h >> index 13aa891..fc92a5f 100644 >> --- a/src/bin/pg_upgrade/pg_upgrade.h >> +++ b/src/bin/pg_upgrade/pg_upgrade.h >> @@ -112,6 +112,11 @@ extern char *output_files[]; >> #define VISIBILITY_MAP_CRASHSAFE_CAT_VER 201107031 >> >> /* >> + * The format of visibility map changed with this 9.6 commit, >> + * >> + */ >> +#define VISIBILITY_MAP_FROZEN_BIT_CAT_VER 201509181 > > Useless empty line in comment. Fixed. >> diff --git a/src/common/relpath.c b/src/common/relpath.c >> index 66dfef1..52ff14e 100644 >> --- a/src/common/relpath.c >> +++ b/src/common/relpath.c >> @@ -30,6 +30,9 @@ >> * If you add a new entry, remember to update the errhint in >> * forkname_to_number() below, and update the SGML documentation for >> * pg_relation_size(). >> + * 9.6 or later, the visibility map fork name is changed from "vm" to >> + * "vfm" bacause visibility map has not only information about all-visible >> + * but also information about all-frozen. >> */ >> const char *const forkNames[] = { >> "main", /* MAIN_FORKNUM */ > > Drop the change in comment? There's no "vfm" in this version of the > patch, is there? Fixed. Regards, -- Masahiko Sawada
Attachment
On Fri, Oct 2, 2015 at 11:23 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: >> + /* all-frozen information is also cleared at the same time */ >> PageClearAllVisible(page); >> + PageClearAllFrozen(page); > > I wonder if it makes sense to have a macro to clear both in unison, > which seems a very common pattern. I think PageClearAllVisible should clear both, and there should be no other macro. There is no event that causes a page to cease being all-visible that does not also cause it to cease being all-frozen. You might think that deleting or locking a tuple would fall into that category - but nope, XMAX needs to be cleared or the tuple pruned, or there will be problems after wraparound. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sat, Oct 3, 2015 at 3:41 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, Oct 2, 2015 at 11:23 AM, Alvaro Herrera > <alvherre@2ndquadrant.com> wrote: >>> + /* all-frozen information is also cleared at the same time */ >>> PageClearAllVisible(page); >>> + PageClearAllFrozen(page); >> >> I wonder if it makes sense to have a macro to clear both in unison, >> which seems a very common pattern. > > I think PageClearAllVisible should clear both, and there should be no > other macro. There is no event that causes a page to cease being > all-visible that does not also cause it to cease being all-frozen. > You might think that deleting or locking a tuple would fall into that > category - but nope, XMAX needs to be cleared or the tuple pruned, or > there will be problems after wraparound. > Thank you for your advice. I understood. I changed the patch so that PageClearAllVisible clear both bits, and removed ClearAllFrozen. Attached the latest v16 patch which contains draft version documentation patch. Regards, -- Masahiko Sawada
Attachment
On Fri, Oct 2, 2015 at 8:14 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> +#define Anum_pg_class_relallfrozen 12 >> Why is pg_class.relallfrozen necessary? ISTM that there is no user of it now. > > The relallfrozen would be useful for user to estimate time to vacuum > freeze or anti-wrapping vacuum before being done them actually. > (Also this value is used on regression test.) > But this information is not used on planning like relallvisible, so it > would be good to move this information to another system view like > pg_stat_*_tables. Or make pgstattuple and pgstattuple_approx report even the number of frozen tuples? Regards, -- Fujii Masao
On 10 September 2015 at 01:58, Andres Freund <andres@anarazel.de> wrote:
--
On 2015-09-04 23:35:42 +0100, Simon Riggs wrote:
> This looks OK. You saw that I was proposing to solve this problem a
> different way ("Summary of plans to avoid the annoyance of Freezing"),
> suggesting that we wait for a few CFs to see if a patch emerges for that -
> then fall back to this patch if it doesn't? So I am moving this patch to
> next CF.
As noted on that other thread I don't think that's a good policy, and it
seems like Robert agrees with me. So I think we should move this back to
"Needs Review".
I also agree. Andres and I spoke at PostgresOpen and persuaded me, I've just been away.
Am happy to review and commit in next few days/weeks, once I catch up on the thread.
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Mon, Oct 5, 2015 at 11:03 PM, Fujii Masao <masao.fujii@gmail.com> wrote: > On Fri, Oct 2, 2015 at 8:14 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >>> +#define Anum_pg_class_relallfrozen 12 >>> Why is pg_class.relallfrozen necessary? ISTM that there is no user of it now. >> >> The relallfrozen would be useful for user to estimate time to vacuum >> freeze or anti-wrapping vacuum before being done them actually. >> (Also this value is used on regression test.) >> But this information is not used on planning like relallvisible, so it >> would be good to move this information to another system view like >> pg_stat_*_tables. > > Or make pgstattuple and pgstattuple_approx report even the number > of frozen tuples? > But we cannot know the number of frozen pages without installation of pageinspect module. I'm a bit concerned about that the all projects cannot install extentension module into postgresql on production environment. I think we need to provide such feature at least into core. Thought? Regards, -- Masahiko Sawada
On Mon, Oct 5, 2015 at 7:31 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > On Sat, Oct 3, 2015 at 3:41 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Fri, Oct 2, 2015 at 11:23 AM, Alvaro Herrera >> <alvherre@2ndquadrant.com> wrote: >>>> + /* all-frozen information is also cleared at the same time */ >>>> PageClearAllVisible(page); >>>> + PageClearAllFrozen(page); >>> >>> I wonder if it makes sense to have a macro to clear both in unison, >>> which seems a very common pattern. >> >> I think PageClearAllVisible should clear both, and there should be no >> other macro. There is no event that causes a page to cease being >> all-visible that does not also cause it to cease being all-frozen. >> You might think that deleting or locking a tuple would fall into that >> category - but nope, XMAX needs to be cleared or the tuple pruned, or >> there will be problems after wraparound. >> > > Thank you for your advice. > I understood. > > I changed the patch so that PageClearAllVisible clear both bits, and > removed ClearAllFrozen. > Attached the latest v16 patch which contains draft version documentation patch. Thanks for updating the patch! Here are another review comments. + ereport(elevel, + (errmsg("skipped %d frozen pages acoording to visibility map", + vacrelstats->vmskipped_frozen_pages))); Typo: acoording should be according. When vmskipped_frozen_pages is 1, "1 frozen pages" in log message sounds incorrect in terms of grammar. So probably errmsg_plural() should be used here. + relallvisible = visibilitymap_count(rel, VISIBILITYMAP_ALL_VISIBLE); + relallfrozen = visibilitymap_count(rel, VISIBILITYMAP_ALL_FROZEN); We can refactor visibilitymap_count() so that it counts the numbers of both all-visible and all-frozen tuples at the same time, in order to avoid reading through visibility map twice. heap_page_is_all_visible() can set all_frozen to TRUE even when it returns FALSE. This is odd because the page must not be all frozen when it's not all visible. heap_page_is_all_visible() should set all_frozen to FALSE whenever all_visible is set to FALSE? Probably it's better to forcibly set all_frozen to FALSE at the end of the function whenever all_visible is FALSE. + if (PageIsAllVisible(page)) { - Assert(BufferIsValid(*vmbuffer)); Why did you remove this assertion? + if (all_frozen) + { + PageSetAllFrozen(page); + flags |= VISIBILITYMAP_ALL_FROZEN; + } Why didn't you call visibilitymap_test() for all frozen case here? In visibilitymap_set(), the argument flag must be either (VISIBILITYMAP_ALL_VISIBLE | VISIBILITYMAP_ALL_FROZEN) or VISIBILITYMAP_ALL_VISIBLE. So I think that it's better to add Assert() which checks whether the specified flag is valid or not. + * caller is expected to set PD_ALL_VISIBLE or + * PD_ALL_FROZEN first. + */ + Assert(PageIsAllVisible(heapPage) || PageIsAllFrozen(heapPage)); This should be the following? Assert(((flag | VISIBILITYMAP_ALL_VISIBLE) && PageIsAllVisible(heapPage)) || ((flag | VISIBILITYMAP_ALL_FROZEN)&& PageIsAllFrozen(heapPage))); Regards, -- Fujii Masao
On Thu, Oct 8, 2015 at 7:03 PM, Fujii Masao <masao.fujii@gmail.com> wrote: > On Mon, Oct 5, 2015 at 7:31 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> On Sat, Oct 3, 2015 at 3:41 AM, Robert Haas <robertmhaas@gmail.com> wrote: >>> On Fri, Oct 2, 2015 at 11:23 AM, Alvaro Herrera >>> <alvherre@2ndquadrant.com> wrote: >>>>> + /* all-frozen information is also cleared at the same time */ >>>>> PageClearAllVisible(page); >>>>> + PageClearAllFrozen(page); >>>> >>>> I wonder if it makes sense to have a macro to clear both in unison, >>>> which seems a very common pattern. >>> >>> I think PageClearAllVisible should clear both, and there should be no >>> other macro. There is no event that causes a page to cease being >>> all-visible that does not also cause it to cease being all-frozen. >>> You might think that deleting or locking a tuple would fall into that >>> category - but nope, XMAX needs to be cleared or the tuple pruned, or >>> there will be problems after wraparound. >>> >> >> Thank you for your advice. >> I understood. >> >> I changed the patch so that PageClearAllVisible clear both bits, and >> removed ClearAllFrozen. >> Attached the latest v16 patch which contains draft version documentation patch. > > Thanks for updating the patch! Here are another review comments. > Thank you for reviewing! Attached the latest patch. > + ereport(elevel, > + (errmsg("skipped %d frozen pages acoording to visibility map", > + vacrelstats->vmskipped_frozen_pages))); > > Typo: acoording should be according. > > When vmskipped_frozen_pages is 1, "1 frozen pages" in log message > sounds incorrect in terms of grammar. So probably errmsg_plural() > should be used here. Thank you for your advice. Fixed. > + relallvisible = visibilitymap_count(rel, > VISIBILITYMAP_ALL_VISIBLE); > + relallfrozen = visibilitymap_count(rel, VISIBILITYMAP_ALL_FROZEN); > > We can refactor visibilitymap_count() so that it counts the numbers of > both all-visible and all-frozen tuples at the same time, in order to > avoid reading through visibility map twice. I agree. I've changed so. > heap_page_is_all_visible() can set all_frozen to TRUE even when > it returns FALSE. This is odd because the page must not be all frozen > when it's not all visible. heap_page_is_all_visible() should set > all_frozen to FALSE whenever all_visible is set to FALSE? > Probably it's better to forcibly set all_frozen to FALSE at the end of > the function whenever all_visible is FALSE. Fixed. > + if (PageIsAllVisible(page)) > { > - Assert(BufferIsValid(*vmbuffer)); > > Why did you remove this assertion? It's my mistake. Fixed. > + if (all_frozen) > + { > + PageSetAllFrozen(page); > + flags |= VISIBILITYMAP_ALL_FROZEN; > + } > > Why didn't you call visibilitymap_test() for all frozen case here? Same as above. Fixed. > In visibilitymap_set(), the argument flag must be either > (VISIBILITYMAP_ALL_VISIBLE | VISIBILITYMAP_ALL_FROZEN) or > VISIBILITYMAP_ALL_VISIBLE. So I think that it's better to add > Assert() which checks whether the specified flag is valid or not. I agree. I added Assert() to beginning of visibilitymap_set() function. > + * caller is expected to set PD_ALL_VISIBLE or > + * PD_ALL_FROZEN first. > + */ > + Assert(PageIsAllVisible(heapPage) || > PageIsAllFrozen(heapPage)); > > This should be the following? > > Assert(((flag | VISIBILITYMAP_ALL_VISIBLE) && PageIsAllVisible(heapPage)) || > ((flag | VISIBILITYMAP_ALL_FROZEN) && PageIsAllFrozen(heapPage))); I agree. Fixed. Regards, -- Masahiko Sawada
Attachment
On 1 October 2015 at 23:30, Josh Berkus <josh@agliodbs.com> wrote:
--
On 10/01/2015 07:43 AM, Robert Haas wrote:
> On Thu, Oct 1, 2015 at 9:44 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
>> I wonder how much it's worth renaming only the file extension while
>> there are many places where "visibility map" and "vm" are used,
>> for example, log messages, function names, variables, etc.
>
> I'd be inclined to keep calling it the visibility map (vm) even if it
> also contains freeze information.
>
-1 to rename. Visibility Map is a perfectly good name.
The name can stay the same, but specifically the file extension should change.
This patch changes the layout of existing information:
* _vm stores one bit per page
* _$new stores two bits per page
The problem is we won't be able to tell the two formats apart, since they both are just lots of bits. So we won't be able to tell if the file is old format or new format, which could lead to loss of information that relates to visibility. If we think something is all-visible when it is not, this is effectively data corruption.
In light of lessons learned from multixactids, I think its important that we are able to tell the difference between an old format and a new format visibility map.
My suggestion to do so was to call it "vfm", so we indicate that it is now a Visibility & Freeze Map
I don't care if we change the name, but I do care if we can't tell the difference between a failed upgrade, a normal upgrade and a server that has been upgraded multiple times. Alternate suggestions welcome.
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On October 8, 2015 7:35:24 PM GMT+02:00, Simon Riggs <simon@2ndQuadrant.com> wrote: >The problem is we won't be able to tell the two formats apart, since >they >both are just lots of bits. So we won't be able to tell if the file is >old >format or new format, which could lead to loss of information that >relates >to visibility. I don't see the problem? I mean catversion will reliably tell you which format the vm is in? We could additionally use the opportunity to as a metapage, but that seems like an independent thing. Andres --- Please excuse brevity and formatting - I am writing this on my mobile phone.
On Thu, Oct 8, 2015 at 1:52 PM, Andres Freund <andres@anarazel.de> wrote: > I don't see the problem? I mean catversion will reliably tell you which format the vm is in? Totally agreed. > We could additionally use the opportunity to as a metapage, but that seems like an independent thing. I agree with that, too. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sat, Oct 10, 2015 at 4:20 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, Oct 8, 2015 at 1:52 PM, Andres Freund <andres@anarazel.de> wrote: >> I don't see the problem? I mean catversion will reliably tell you which format the vm is in? > > Totally agreed. > >> We could additionally use the opportunity to as a metapage, but that seems like an independent thing. > > I agree with that, too. > Attached the updated v18 patch fixes some bugs. Please review the patch. Regards, -- Masahiko Sawada
Attachment
On 9 October 2015 at 15:20, Robert Haas <robertmhaas@gmail.com> wrote:
--
On Thu, Oct 8, 2015 at 1:52 PM, Andres Freund <andres@anarazel.de> wrote:
> I don't see the problem? I mean catversion will reliably tell you which format the vm is in?
Totally agreed.
This isn't an agreement competition, its a cool look at what might cause problems for all of us.
If we want to avoid bugs in future then we'd better start acting like that is actually true in practice.
Why should we wave away this concern? Will we wave away a concern next time you personally raise one? Bruce would have me believe that we added months onto 9.5 to improve robustness. So lets actually do that. Starting at the first opportunity.
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2015-10-20 20:35:31 -0400, Simon Riggs wrote: > On 9 October 2015 at 15:20, Robert Haas <robertmhaas@gmail.com> wrote: > > > On Thu, Oct 8, 2015 at 1:52 PM, Andres Freund <andres@anarazel.de> wrote: > > > I don't see the problem? I mean catversion will reliably tell you which > > format the vm is in? > > > > Totally agreed. > > > > This isn't an agreement competition, its a cool look at what might cause > problems for all of us. Uh, we form rough concensuses all the time. > If we want to avoid bugs in future then we'd better start acting like that > is actually true in practice. > Why should we wave away this concern? Will we wave away a concern next time > you personally raise one? Bruce would have me believe that we added months > onto 9.5 to improve robustness. So lets actually do that. Starting at the > first opportunity. Meh. Adding complexity definitely needs to be weighed against the benefits. As pointed out e.g. by all the multixact issues you mentioned upthread. In this case your argument for changing the name doesn't seem to hold much water. Greetings, Andres Freund
On 10/21/15 8:11 AM, Andres Freund wrote: > Meh. Adding complexity definitely needs to be weighed against the > benefits. As pointed out e.g. by all the multixact issues you mentioned > upthread. In this case your argument for changing the name doesn't seem > to hold much water. ISTM VISIBILITY_MAP_FROZEN_BIT_CAT_VER shold be defined in catversion.h instead of pg_upgrade.h though, to ensure it's correctly updated when this gets committed though. -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com
Jim Nasby wrote: > On 10/21/15 8:11 AM, Andres Freund wrote: > >Meh. Adding complexity definitely needs to be weighed against the > >benefits. As pointed out e.g. by all the multixact issues you mentioned > >upthread. In this case your argument for changing the name doesn't seem > >to hold much water. > > ISTM VISIBILITY_MAP_FROZEN_BIT_CAT_VER shold be defined in catversion.h > instead of pg_upgrade.h though, to ensure it's correctly updated when this > gets committed though. That would be untidy and pointless. pg_upgrade.h contains other catversions. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 21.10.2015 02:05, Masahiko Sawada wrote: > On Sat, Oct 10, 2015 at 4:20 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Thu, Oct 8, 2015 at 1:52 PM, Andres Freund <andres@anarazel.de> wrote: >>> I don't see the problem? I mean catversion will reliably tell you which format the vm is in? >> >> Totally agreed. >> >>> We could additionally use the opportunity to as a metapage, but that seems like an independent thing. >> >> I agree with that, too. >> > > Attached the updated v18 patch fixes some bugs. > Please review the patch. I've just checked the comments: File: /doc/src/sgml/catalogs.sgml + Number of pages that are marked all-frozen in the tables's Should be: + Number of pages that are marked all-frozen in the tables + <command>ANALYZE</command>, and a few DDL coomand such as Should be: + <command>ANALYZE</command>, and a few DDL command such as File: doc/src/sgml/maintenance.sgml + When the all pages of table are eventually marked as frozen by <command>VACUUM</>, Should be: + When all pages of the table are eventually marked as frozen by <command>VACUUM</>, File: /src/backend/access/heap/visibilitymap.c + * visibility map bit. Then, we lock the buffer. But this creates a race Should be: + * visibility map bit. Than we lock the buffer. But this creates a race + * buffer, the PD_ALL_VISIBLE or PD_ALL_FROZEN bit gets set. If that happens, Should be: + * buffer, the PD_ALL_VISIBLE or PD_ALL_FROZEN bit gets set. If that happens, (Remove duplicate white space before if) Please note i'm not a native speaker. There is a good chance that i am wrong ;) Greetings, Torsten
On Thu, Oct 22, 2015 at 4:11 PM, Torsten Zühlsdorff <mailinglists@toco-domains.de> wrote: > On 21.10.2015 02:05, Masahiko Sawada wrote: >> >> On Sat, Oct 10, 2015 at 4:20 AM, Robert Haas <robertmhaas@gmail.com> >> wrote: >>> >>> On Thu, Oct 8, 2015 at 1:52 PM, Andres Freund <andres@anarazel.de> wrote: >>>> >>>> I don't see the problem? I mean catversion will reliably tell you which >>>> format the vm is in? >>> >>> >>> Totally agreed. >>> >>>> We could additionally use the opportunity to as a metapage, but that >>>> seems like an independent thing. >>> >>> >>> I agree with that, too. >>> >> >> Attached the updated v18 patch fixes some bugs. >> Please review the patch. > > > I've just checked the comments: Thank you for taking the time to review this patch. Attached updated patch(v19). > File: /doc/src/sgml/catalogs.sgml > > + Number of pages that are marked all-frozen in the tables's > Should be: > + Number of pages that are marked all-frozen in the tables I changed it as follows. + Number of pages that are marked all-frozen in the table's The similar sentence of relallvisible is exist. > + <command>ANALYZE</command>, and a few DDL coomand such as > Should be: > + <command>ANALYZE</command>, and a few DDL command such as Fixed. > File: doc/src/sgml/maintenance.sgml > > + When the all pages of table are eventually marked as frozen by > <command>VACUUM</>, > Should be: > + When all pages of the table are eventually marked as frozen by > <command>VACUUM</>, Fixed. > File: /src/backend/access/heap/visibilitymap.c > > + * visibility map bit. Then, we lock the buffer. But this creates a race > Should be: > + * visibility map bit. Than we lock the buffer. But this creates a race I didn't change this sentence actually. so kept it. > + * buffer, the PD_ALL_VISIBLE or PD_ALL_FROZEN bit gets set. If that > happens, > Should be: > + * buffer, the PD_ALL_VISIBLE or PD_ALL_FROZEN bit gets set. If that > happens, > (Remove duplicate white space before if) The other sentence seems to have double white space after period. I kept it. Please review it. Regards, -- Masahiko Sawada
Attachment
On Mon, Oct 5, 2015 at 9:53 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Mon, Oct 5, 2015 at 11:03 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
> > On Fri, Oct 2, 2015 at 8:14 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >>> +#define Anum_pg_class_relallfrozen 12
> >>> Why is pg_class.relallfrozen necessary? ISTM that there is no user of it now.
> >>
> >> The relallfrozen would be useful for user to estimate time to vacuum
> >> freeze or anti-wrapping vacuum before being done them actually.
> >> (Also this value is used on regression test.)
> >> But this information is not used on planning like relallvisible, so it
> >> would be good to move this information to another system view like
> >> pg_stat_*_tables.
> >
> > Or make pgstattuple and pgstattuple_approx report even the number
> > of frozen tuples?
> >
>
> But we cannot know the number of frozen pages without installation of
> pageinspect module.
> I'm a bit concerned about that the all projects cannot install
> extentension module into postgresql on production environment.
> I think we need to provide such feature at least into core.
>
>
> On Mon, Oct 5, 2015 at 11:03 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
> > On Fri, Oct 2, 2015 at 8:14 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >>> +#define Anum_pg_class_relallfrozen 12
> >>> Why is pg_class.relallfrozen necessary? ISTM that there is no user of it now.
> >>
> >> The relallfrozen would be useful for user to estimate time to vacuum
> >> freeze or anti-wrapping vacuum before being done them actually.
> >> (Also this value is used on regression test.)
> >> But this information is not used on planning like relallvisible, so it
> >> would be good to move this information to another system view like
> >> pg_stat_*_tables.
> >
> > Or make pgstattuple and pgstattuple_approx report even the number
> > of frozen tuples?
> >
>
> But we cannot know the number of frozen pages without installation of
> pageinspect module.
> I'm a bit concerned about that the all projects cannot install
> extentension module into postgresql on production environment.
> I think we need to provide such feature at least into core.
>
I think we can display information about relallfrozen it in pg_stat_*_tables
as suggested by you. It doesn't make much sense to keep it in pg_class
unless we have some usecase for the same.
On Sat, Oct 24, 2015 at 10:59 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Mon, Oct 5, 2015 at 9:53 PM, Masahiko Sawada <sawada.mshk@gmail.com> > wrote: >> >> On Mon, Oct 5, 2015 at 11:03 PM, Fujii Masao <masao.fujii@gmail.com> >> wrote: >> > On Fri, Oct 2, 2015 at 8:14 PM, Masahiko Sawada <sawada.mshk@gmail.com> >> > wrote: >> >>> +#define Anum_pg_class_relallfrozen 12 >> >>> Why is pg_class.relallfrozen necessary? ISTM that there is no user of >> >>> it now. >> >> >> >> The relallfrozen would be useful for user to estimate time to vacuum >> >> freeze or anti-wrapping vacuum before being done them actually. >> >> (Also this value is used on regression test.) >> >> But this information is not used on planning like relallvisible, so it >> >> would be good to move this information to another system view like >> >> pg_stat_*_tables. >> > >> > Or make pgstattuple and pgstattuple_approx report even the number >> > of frozen tuples? >> > >> >> But we cannot know the number of frozen pages without installation of >> pageinspect module. >> I'm a bit concerned about that the all projects cannot install >> extentension module into postgresql on production environment. >> I think we need to provide such feature at least into core. >> > > I think we can display information about relallfrozen it in pg_stat_*_tables > as suggested by you. It doesn't make much sense to keep it in pg_class > unless we have some usecase for the same. > I'm thinking a bit about implementing the read-only table that is restricted to update/delete and is ensured that whole table is frozen, if this feature is committed. The value of relallfrozen might be useful for such feature. Regards, -- Masahiko Sawada
On Sat, Oct 24, 2015 at 2:24 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Sat, Oct 24, 2015 at 10:59 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > I think we can display information about relallfrozen it in pg_stat_*_tables
> > as suggested by you. It doesn't make much sense to keep it in pg_class
> > unless we have some usecase for the same.
> >
>
> I'm thinking a bit about implementing the read-only table that is
> restricted to update/delete and is ensured that whole table is frozen,
> if this feature is committed.
> The value of relallfrozen might be useful for such feature.
>
>
> On Sat, Oct 24, 2015 at 10:59 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > I think we can display information about relallfrozen it in pg_stat_*_tables
> > as suggested by you. It doesn't make much sense to keep it in pg_class
> > unless we have some usecase for the same.
> >
>
> I'm thinking a bit about implementing the read-only table that is
> restricted to update/delete and is ensured that whole table is frozen,
> if this feature is committed.
> The value of relallfrozen might be useful for such feature.
>
If we need this for read-only table feature, then better lets add that
after discussing the design of that feature. It doesn't seem to be
advisable to have an extra field in system table which we might
need in yet not completely-discussed feature.
Review Comments:
-------------------------------
1.
/*
- * Find buffer to insert this tuple into. If the page is all visible,
- * this will also pin
the requisite visibility map page.
+ * Find buffer to insert this tuple into. If the page is all
visible
+ * or all frozen, this will also pin the requisite visibility map and
+ * frozen map page.
*/
buffer = RelationGetBufferForTuple(relation, heaptup->t_len,
InvalidBuffer, options, bistate,
I think it is sufficient to say in the end 'visibility map page'.
Let's not include 'frozen map page'.
2.
+ * corresponding page has been completely frozen, so the visibility map is also
+ * used for anti-wraparound
vacuum, even if freezing tuples is required.
/all tuple/all tuples
/freezing tuples/freezing of tuples
3.
- * Are all tuples on heapBlk visible to all, according to the visibility map?
+ * Are all tuples on heapBlk
visible or frozen to all, according to the visibility map?
I think it is better to modify the above statement as:
Are all tuples on heapBlk visible to all or are marked as frozen, according
to the visibility map?
4.
+ * releasing *buf after it's done testing and setting bits, and must set flags
+ * which indicates what flag
we want to test.
Here are you talking about the flags passed to visibilitymap_set(), if
yes, then above comment is not clear, how about:
and must pass flags
for which it needs to check the value in visibility map.
5.
+ * both how many pages we skipped according to all-frozen bit of visibility
+ * map and how many
pages we freeze page, so we can update relfrozenxid if
In above sentence word 'page' after freeze sounds redundant.
/we freeze page/we freeze
Another suggestion:
/sum of them/sum of two
6.
+ * This block is at least all-visible according to visibility map.
+
* We check whehter this block is all-frozen or not, to skip to
whether is mis-spelled
7.
+ * If we froze any tuples or any tuples are already frozen,
+ * mark the buffer
dirty, and write a WAL record recording the changes.
Here, I think WAL record is written only when we mark some
tuple/'s as frozen not if we they are already frozen,
so in that regard, I think above comment is wrong.
8.
+ /*
+ * We cant't allow upgrading with link mode between 9.5 or before and 9.6 or later,
+ *
because the format of visibility map has been changed on version 9.6.
+ */
a. /cant't/can't
b. changed on version 9.6/changed in version 9.6
b. Won't such a change needs to be updated in pg_upgrade
documentation (Notes Section)?
9.
@@ -180,6 +181,13 @@ transfer_single_new_db(pageCnvCtx *pageConverter,
new_cluster.controldata.cat_ver >= VISIBILITY_MAP_CRASHSAFE_CAT_VER)
vm_crashsafe_match = false;
+
/*
+ * Do we need to rewrite visibilitymap?
+ */
+ if (old_cluster.controldata.cat_ver <
VISIBILITY_MAP_FROZEN_BIT_CAT_VER &&
+ new_cluster.controldata.cat_ver >=
VISIBILITY_MAP_FROZEN_BIT_CAT_VER)
+ vm_rewrite_needed = true;
..
@@ -276,7 +285,15 @@ transfer_relfile(pageCnvCtx *pageConverter, FileNameMap *map,
{
pg_log(PG_VERBOSE, "copying \"%s\" to \"%s\"\n", old_file, new_file);
- if ((msg =
copyAndUpdateFile(pageConverter, old_file, new_file, true)) != NULL)
+ /*
+
* Do we need to rewrite visibilitymap?
+ */
+ if (strcmp
(type_suffix, "_vm") == 0 &&
+ old_cluster.controldata.cat_ver <
VISIBILITY_MAP_FROZEN_BIT_CAT_VER &&
+ new_cluster.controldata.cat_ver >=
VISIBILITY_MAP_FROZEN_BIT_CAT_VER)
+ rewrite_vm = true;
Instead of doing re-check in transfer_relfile(), I think it is better
to pass an additional parameter in this function.
10.
You have mentioned up-thread that, you have changed the patch so that
PageClearAllVisible clear both bits, can you please point me to this
change?
Basically after applying the patch, I see below code in bufpage.h:
#define PageClearAllVisible(page) \
(((PageHeader) (page))->pd_flags &= ~PD_ALL_VISIBLE)
Don't we need to clear the PD_ALL_FROZEN separately?
On Wed, Oct 28, 2015 at 12:58 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Sat, Oct 24, 2015 at 2:24 PM, Masahiko Sawada <sawada.mshk@gmail.com> > wrote: >> >> On Sat, Oct 24, 2015 at 10:59 AM, Amit Kapila <amit.kapila16@gmail.com> >> wrote: >> > >> > I think we can display information about relallfrozen it in >> > pg_stat_*_tables >> > as suggested by you. It doesn't make much sense to keep it in pg_class >> > unless we have some usecase for the same. >> > >> >> I'm thinking a bit about implementing the read-only table that is >> restricted to update/delete and is ensured that whole table is frozen, >> if this feature is committed. >> The value of relallfrozen might be useful for such feature. >> Thank you for reviewing! > If we need this for read-only table feature, then better lets add that > after discussing the design of that feature. It doesn't seem to be > advisable to have an extra field in system table which we might > need in yet not completely-discussed feature. I changed it so that the number of frozen pages is stored in pg_stat_all_tables as statistics information. Also, the tests related to counting all-visible bit and skipping vacuum are added to visibility map test, and the test related to counting all-frozen is added to stats collector test. Attached updated v20 patch. > Review Comments: > ------------------------------- > 1. > /* > - * Find buffer to insert this tuple into. If the page is all visible, > - * this will also pin > the requisite visibility map page. > + * Find buffer to insert this tuple into. If the page is all > visible > + * or all frozen, this will also pin the requisite visibility map and > + * frozen map page. > > */ > buffer = RelationGetBufferForTuple(relation, heaptup->t_len, > > InvalidBuffer, options, bistate, > > > I think it is sufficient to say in the end 'visibility map page'. > Let's not include 'frozen map page'. Fixed. > > 2. > + * corresponding page has been completely frozen, so the visibility map is > also > + * used for anti-wraparound > vacuum, even if freezing tuples is required. > > /all tuple/all tuples > /freezing tuples/freezing of tuples Fixed. > 3. > - * Are all tuples on heapBlk visible to all, according to the visibility > map? > + * Are all tuples on heapBlk > visible or frozen to all, according to the visibility map? > > I think it is better to modify the above statement as: > Are all tuples on heapBlk visible to all or are marked as frozen, according > to the visibility map? Fixed. > 4. > + * releasing *buf after it's done testing and setting bits, and must set > flags > + * which indicates what flag > we want to test. > > Here are you talking about the flags passed to visibilitymap_set(), if > yes, then above comment is not clear, how about: > > and must pass flags > for which it needs to check the value in visibility map. Fixed. > 5. > + * both how many pages we skipped according to all-frozen bit of visibility > + * map and how many > pages we freeze page, so we can update relfrozenxid if > > In above sentence word 'page' after freeze sounds redundant. > /we freeze page/we freeze > > Another suggestion: > /sum of them/sum of two Fixed. > 6. > + * This block is at least all-visible according to visibility map. > + > * We check whehter this block is all-frozen or not, to skip to > > whether is mis-spelled Fixed. > 7. > + * If we froze any tuples or any tuples are already frozen, > + * mark the buffer > dirty, and write a WAL record recording the changes. > > Here, I think WAL record is written only when we mark some > tuple/'s as frozen not if we they are already frozen, > so in that regard, I think above comment is wrong. It's wrong. Fixed. > 8. > + /* > + * We cant't allow upgrading with link mode between 9.5 or before and 9.6 > or later, > + * > because the format of visibility map has been changed on version 9.6. > + */ > > > a. /cant't/can't > b. changed on version 9.6/changed in version 9.6 > b. Won't such a change needs to be updated in pg_upgrade > documentation (Notes Section)? Fixed. And updated document. > 9. > @@ -180,6 +181,13 @@ transfer_single_new_db(pageCnvCtx *pageConverter, > > new_cluster.controldata.cat_ver >= VISIBILITY_MAP_CRASHSAFE_CAT_VER) > vm_crashsafe_match = false; > > + > /* > + * Do we need to rewrite visibilitymap? > + */ > + if (old_cluster.controldata.cat_ver < > VISIBILITY_MAP_FROZEN_BIT_CAT_VER && > + new_cluster.controldata.cat_ver >= > VISIBILITY_MAP_FROZEN_BIT_CAT_VER) > + vm_rewrite_needed = true; > > .. > > @@ -276,7 +285,15 @@ transfer_relfile(pageCnvCtx *pageConverter, FileNameMap > *map, > { > > pg_log(PG_VERBOSE, "copying \"%s\" to \"%s\"\n", old_file, new_file); > > - if ((msg = > copyAndUpdateFile(pageConverter, old_file, new_file, true)) != NULL) > + /* > + > * Do we need to rewrite visibilitymap? > + */ > + if (strcmp > (type_suffix, "_vm") == 0 && > + old_cluster.controldata.cat_ver < > VISIBILITY_MAP_FROZEN_BIT_CAT_VER && > + new_cluster.controldata.cat_ver >= > VISIBILITY_MAP_FROZEN_BIT_CAT_VER) > + rewrite_vm = true; > > Instead of doing re-check in transfer_relfile(), I think it is better > to pass an additional parameter in this function. I agree. Fixed. > > 10. > You have mentioned up-thread that, you have changed the patch so that > PageClearAllVisible clear both bits, can you please point me to this > change? > Basically after applying the patch, I see below code in bufpage.h: > #define PageClearAllVisible(page) \ > (((PageHeader) (page))->pd_flags &= ~PD_ALL_VISIBLE) > > Don't we need to clear the PD_ALL_FROZEN separately? Previous patch is wrong. PageClearAllVisible() should be; #define PageClearAllVisible(page) \ (((PageHeader) (page))->pd_flags &= ~(PD_ALL_VISIBLE | PD_ALL_FROZEN)) The all-frozen flag/bit is cleared only by modifying page, so it is impossible that only all-frozen flags/bit is cleared. The clearing of all-visible flag/bit also means that the page has some garbage, and is needed to vacuum. Regards, -- Masahiko Sawada
Attachment
On Fri, Oct 30, 2015 at 1:26 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > On Wed, Oct 28, 2015 at 12:58 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> On Sat, Oct 24, 2015 at 2:24 PM, Masahiko Sawada <sawada.mshk@gmail.com> >> wrote: >>> >>> On Sat, Oct 24, 2015 at 10:59 AM, Amit Kapila <amit.kapila16@gmail.com> >>> wrote: >>> > >>> > I think we can display information about relallfrozen it in >>> > pg_stat_*_tables >>> > as suggested by you. It doesn't make much sense to keep it in pg_class >>> > unless we have some usecase for the same. >>> > >>> >>> I'm thinking a bit about implementing the read-only table that is >>> restricted to update/delete and is ensured that whole table is frozen, >>> if this feature is committed. >>> The value of relallfrozen might be useful for such feature. >>> > > Thank you for reviewing! > >> If we need this for read-only table feature, then better lets add that >> after discussing the design of that feature. It doesn't seem to be >> advisable to have an extra field in system table which we might >> need in yet not completely-discussed feature. > > I changed it so that the number of frozen pages is stored in > pg_stat_all_tables as statistics information. > Also, the tests related to counting all-visible bit and skipping > vacuum are added to visibility map test, and the test related to > counting all-frozen is added to stats collector test. > > Attached updated v20 patch. > >> Review Comments: >> ------------------------------- >> 1. >> /* >> - * Find buffer to insert this tuple into. If the page is all visible, >> - * this will also pin >> the requisite visibility map page. >> + * Find buffer to insert this tuple into. If the page is all >> visible >> + * or all frozen, this will also pin the requisite visibility map and >> + * frozen map page. >> >> */ >> buffer = RelationGetBufferForTuple(relation, heaptup->t_len, >> >> InvalidBuffer, options, bistate, >> >> >> I think it is sufficient to say in the end 'visibility map page'. >> Let's not include 'frozen map page'. > > Fixed. > >> >> 2. >> + * corresponding page has been completely frozen, so the visibility map is >> also >> + * used for anti-wraparound >> vacuum, even if freezing tuples is required. >> >> /all tuple/all tuples >> /freezing tuples/freezing of tuples > > Fixed. > >> 3. >> - * Are all tuples on heapBlk visible to all, according to the visibility >> map? >> + * Are all tuples on heapBlk >> visible or frozen to all, according to the visibility map? >> >> I think it is better to modify the above statement as: >> Are all tuples on heapBlk visible to all or are marked as frozen, according >> to the visibility map? > > Fixed. > >> 4. >> + * releasing *buf after it's done testing and setting bits, and must set >> flags >> + * which indicates what flag >> we want to test. >> >> Here are you talking about the flags passed to visibilitymap_set(), if >> yes, then above comment is not clear, how about: >> >> and must pass flags >> for which it needs to check the value in visibility map. > > Fixed. > >> 5. >> + * both how many pages we skipped according to all-frozen bit of visibility >> + * map and how many >> pages we freeze page, so we can update relfrozenxid if >> >> In above sentence word 'page' after freeze sounds redundant. >> /we freeze page/we freeze >> >> Another suggestion: >> /sum of them/sum of two > > Fixed. > >> 6. >> + * This block is at least all-visible according to visibility map. >> + >> * We check whehter this block is all-frozen or not, to skip to >> >> whether is mis-spelled > > Fixed. > >> 7. >> + * If we froze any tuples or any tuples are already frozen, >> + * mark the buffer >> dirty, and write a WAL record recording the changes. >> >> Here, I think WAL record is written only when we mark some >> tuple/'s as frozen not if we they are already frozen, >> so in that regard, I think above comment is wrong. > > It's wrong. > Fixed. > >> 8. >> + /* >> + * We cant't allow upgrading with link mode between 9.5 or before and 9.6 >> or later, >> + * >> because the format of visibility map has been changed on version 9.6. >> + */ >> >> >> a. /cant't/can't >> b. changed on version 9.6/changed in version 9.6 >> b. Won't such a change needs to be updated in pg_upgrade >> documentation (Notes Section)? > > Fixed. > And updated document. > >> 9. >> @@ -180,6 +181,13 @@ transfer_single_new_db(pageCnvCtx *pageConverter, >> >> new_cluster.controldata.cat_ver >= VISIBILITY_MAP_CRASHSAFE_CAT_VER) >> vm_crashsafe_match = false; >> >> + >> /* >> + * Do we need to rewrite visibilitymap? >> + */ >> + if (old_cluster.controldata.cat_ver < >> VISIBILITY_MAP_FROZEN_BIT_CAT_VER && >> + new_cluster.controldata.cat_ver >= >> VISIBILITY_MAP_FROZEN_BIT_CAT_VER) >> + vm_rewrite_needed = true; >> >> .. >> >> @@ -276,7 +285,15 @@ transfer_relfile(pageCnvCtx *pageConverter, FileNameMap >> *map, >> { >> >> pg_log(PG_VERBOSE, "copying \"%s\" to \"%s\"\n", old_file, new_file); >> >> - if ((msg = >> copyAndUpdateFile(pageConverter, old_file, new_file, true)) != NULL) >> + /* >> + >> * Do we need to rewrite visibilitymap? >> + */ >> + if (strcmp >> (type_suffix, "_vm") == 0 && >> + old_cluster.controldata.cat_ver < >> VISIBILITY_MAP_FROZEN_BIT_CAT_VER && >> + new_cluster.controldata.cat_ver >= >> VISIBILITY_MAP_FROZEN_BIT_CAT_VER) >> + rewrite_vm = true; >> >> Instead of doing re-check in transfer_relfile(), I think it is better >> to pass an additional parameter in this function. > > I agree. > Fixed. > >> >> 10. >> You have mentioned up-thread that, you have changed the patch so that >> PageClearAllVisible clear both bits, can you please point me to this >> change? >> Basically after applying the patch, I see below code in bufpage.h: >> #define PageClearAllVisible(page) \ >> (((PageHeader) (page))->pd_flags &= ~PD_ALL_VISIBLE) >> >> Don't we need to clear the PD_ALL_FROZEN separately? > > Previous patch is wrong. PageClearAllVisible() should be; > #define PageClearAllVisible(page) \ > (((PageHeader) (page))->pd_flags &= ~(PD_ALL_VISIBLE | PD_ALL_FROZEN)) > > The all-frozen flag/bit is cleared only by modifying page, so it is > impossible that only all-frozen flags/bit is cleared. > The clearing of all-visible flag/bit also means that the page has some > garbage, and is needed to vacuum. > v20 patch has a bug in result of regression test. Attached updated v21 patch. Regards, -- Masahiko Sawada
Attachment
On Thu, Oct 8, 2015 at 11:05 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>
> On 1 October 2015 at 23:30, Josh Berkus <josh@agliodbs.com> wrote:
>>
>> On 10/01/2015 07:43 AM, Robert Haas wrote:
>> > On Thu, Oct 1, 2015 at 9:44 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
>> >> I wonder how much it's worth renaming only the file extension while
>> >> there are many places where "visibility map" and "vm" are used,
>> >> for example, log messages, function names, variables, etc.
>> >
>> > I'd be inclined to keep calling it the visibility map (vm) even if it
>> > also contains freeze information.
>> >
>
> On 1 October 2015 at 23:30, Josh Berkus <josh@agliodbs.com> wrote:
>>
>> On 10/01/2015 07:43 AM, Robert Haas wrote:
>> > On Thu, Oct 1, 2015 at 9:44 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
>> >> I wonder how much it's worth renaming only the file extension while
>> >> there are many places where "visibility map" and "vm" are used,
>> >> for example, log messages, function names, variables, etc.
>> >
>> > I'd be inclined to keep calling it the visibility map (vm) even if it
>> > also contains freeze information.
>> >
What is your main worry about changing the name of this map, is it
about more code churn or is it about that we might introduce new issues
or is it about that people are already accustomed to call this map as
visibility map?
>>
>> -1 to rename. Visibility Map is a perfectly good name.
>
>
> The name can stay the same, but specifically the file extension should change.
>
>> -1 to rename. Visibility Map is a perfectly good name.
>
>
> The name can stay the same, but specifically the file extension should change.
>
It seems to me quite logical for understanding purpose as well. Any new
person who wants to work in this area or is looking into it will always
wonder why this map is named as visibility map even though it contains
information about visibility of page as well as frozen state of page. So
even though it doesn't make any difference in correctness of feature whether
we retain the current name or change it to Visibility & Freeze Map (aka vfm),
but I think it makes sense to change it for the sake of maintenance of this
code.
On Fri, Oct 30, 2015 at 6:03 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
>
>
> v20 patch has a bug in result of regression test.
> Attached updated v21 patch.
>
Couple of more review comments:
>
>
>
> v20 patch has a bug in result of regression test.
> Attached updated v21 patch.
>
Couple of more review comments:
------------------------------------------------------
1.
@@ -615,6 +617,8 @@ typedef struct PgStat_StatTabEntry
PgStat_Counter n_dead_tuples;
PgStat_Counter
changes_since_analyze;
+ int32 n_frozen_pages;
+
PgStat_Counter blocks_fetched;
PgStat_Counter
blocks_hit;
As you are changing above structure, you need to update
PGSTAT_FILE_FORMAT_ID, refer below code:
#define PGSTAT_FILE_FORMAT_ID 0x01A5BC9D
2. It seems that n_frozen_page is not initialized/updated properly
for toast tables:
Try with below steps:
postgres=# create table t4(c1 int, c2 text);
CREATE TABLE
postgres=# select oid, relname from pg_class where relname like '%t4%';
oid | relname
-------+---------
16390 | t4
(1 row)
postgres=# select oid, relname from pg_class where relname like '%16390%';
oid | relname
-------+----------------------
16393 | pg_toast_16390
16395 | pg_toast_16390_index
(2 rows)
postgres=# select relname,seq_scan,n_tup_ins,last_vacuum,n_frozen_page from pg_s
tat_all_tables where relname like '%16390%';
relname | seq_scan | n_tup_ins | last_vacuum | n_frozen_page
----------------+----------+-----------+-------------+---------------
pg_toast_16390 | 1 | 0 | | -842150451
(1 row)
Note that I have tested above scenario on my Windows 7 m/c.
3.
* visibilitymap.c
* bitmap for tracking visibility of heap tuples
I think above needs to be changed to:
bitmap for tracking visibility and frozen state of heap tuples
4.
a.
/*
- * If we froze any tuples, mark the buffer dirty, and write a WAL
- * record recording the changes. We must log the changes to be
- * crash-safe against future truncation of CLOG.
+ * If we froze any tuples then we mark the buffer dirty, and write a WAL
b.
- * Release any remaining pin on visibility map page.
+ * Release any remaining pin on visibility map.
c.
* We do update relallvisible even in the corner case, since if the table
- * is all-visible
we'd definitely like to know that. But clamp the value
- * to be not more than what we're setting
relpages to.
+ * is all-visible we'd definitely like to know that.
+ * But clamp the value to be not more
than what we're setting relpages to.
I don't think you need to change above comments.
5.
+ * Even if scan_all is set so far, we could skip to scan some pages
+ * according by all-frozen
bit of visibility amp.
/according by/according to
/amp/map
I suggested to modify comment as below:
During full scan, we could skip some pages according to all-frozen
bit of visibility map.
Also no need to start this in new line, start from where the
previous line of comment ends.
6.
/*
* lazy_scan_heap() -- scan an open heap relation
*
* This routine prunes each page in the
heap, which will among other
* things truncate dead tuples to dead line pointers, defragment the
*
page, and set commit status bits (see heap_page_prune). It also builds
* lists of dead
tuples and pages with free space, calculates statistics
* on the number of live tuples in the
heap, and marks pages as
* all-visible if appropriate.
Modify above function header as:
all-visible, all-frozen
7.
lazy_scan_heap()
{
..
if (PageIsEmpty(page))
{
empty_pages++;
freespace =
PageGetHeapFreeSpace(page);
/* empty pages are always all-visible */
if (!PageIsAllVisible(page))
..
}
Don't we need to ensure that empty pages should get marked as
all-frozen?
8.
lazy_scan_heap()
{
..
/*
* As of PostgreSQL 9.2, the visibility map bit should never be set if
* the page-
level bit is clear. However, it's possible that the bit
* got cleared after we checked it
and before we took the buffer
* content lock, so we must recheck before jumping to the conclusion
* that something bad has happened.
*/
else if (all_visible_according_to_vm
&& !PageIsAllVisible(page)
&& visibilitymap_test(onerel, blkno, &vmbuffer,
VISIBILITYMAP_ALL_VISIBLE))
{
elog(WARNING, "page is not marked all-visible
but visibility map bit is set in relation \"%s\" page %u",
relname, blkno);
visibilitymap_clear(onerel, blkno, vmbuffer);
}
/*
*
It's possible for the value returned by GetOldestXmin() to move
* backwards, so it's not wrong for
us to see tuples that appear to
* not be visible to everyone yet, while PD_ALL_VISIBLE is already
* set. The real safe xmin value never moves backwards, but
* GetOldestXmin() is
conservative and sometimes returns a value
* that's unnecessarily small, so if we see that
contradiction it just
* means that the tuples that we think are not visible to everyone yet
* actually are, and the PD_ALL_VISIBLE flag is correct.
*
* There should never
be dead tuples on a page with PD_ALL_VISIBLE
* set, however.
*/
else
if (PageIsAllVisible(page) && has_dead_tuples)
{
elog(WARNING, "page
containing dead tuples is marked as all-visible in relation \"%s\" page %u",
relname, blkno);
PageClearAllVisible(page);
MarkBufferDirty(buf);
visibilitymap_clear(onerel, blkno, vmbuffer);
}
..
}
I think both the above cases could happen for frozen state
as well, unless you think otherwise, we need similar handling
for frozen bit.
On Sat, Oct 31, 2015 at 1:32 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Thu, Oct 8, 2015 at 11:05 PM, Simon Riggs <simon@2ndquadrant.com> wrote: >> >> On 1 October 2015 at 23:30, Josh Berkus <josh@agliodbs.com> wrote: >>> >>> On 10/01/2015 07:43 AM, Robert Haas wrote: >>> > On Thu, Oct 1, 2015 at 9:44 AM, Fujii Masao <masao.fujii@gmail.com> >>> > wrote: >>> >> I wonder how much it's worth renaming only the file extension while >>> >> there are many places where "visibility map" and "vm" are used, >>> >> for example, log messages, function names, variables, etc. >>> > >>> > I'd be inclined to keep calling it the visibility map (vm) even if it >>> > also contains freeze information. >>> > > > What is your main worry about changing the name of this map, is it > about more code churn or is it about that we might introduce new issues > or is it about that people are already accustomed to call this map as > visibility map? My concern is mostly that I think calling it the "visibility and freeze map" is excessively long and wordy. One observation that someone made previously is that there is a difference between "all-visible" and "index-only scan OK". An all-visible page that has a HOT update is no longer all-visible (it needs vacuuming) but an index-only scan would still be OK (because only the non-indexed values in the tuple have changed, and every scan scan can see either the old or the new tuple but not both. At present, the index-only scan will consult the heap page anyway, because all we know is that the page is not all-visible. But maybe in the future somebody will decide to add a bit for that. Then we'd have the "visibility, usable for index-only scans, and freeze map", but I think "_vufiosfm" will not be a good choice for a file suffix. So similarly here. The file suffix doesn't need to enumerate all the bits that are present for each page. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Nov 3, 2015 at 5:04 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Sat, Oct 31, 2015 at 1:32 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > What is your main worry about changing the name of this map, is it
> > about more code churn or is it about that we might introduce new issues
> > or is it about that people are already accustomed to call this map as
> > visibility map?
>
> My concern is mostly that I think calling it the "visibility and
> freeze map" is excessively long and wordy.
>
> One observation that someone made previously is that there is a
> difference between "all-visible" and "index-only scan OK". An
> all-visible page that has a HOT update is no longer all-visible (it
> needs vacuuming) but an index-only scan would still be OK (because
> only the non-indexed values in the tuple have changed, and every scan
> scan can see either the old or the new tuple but not both. At
> present, the index-only scan will consult the heap page anyway,
> because all we know is that the page is not all-visible. But maybe in
> the future somebody will decide to add a bit for that. Then we'd have
> the "visibility, usable for index-only scans, and freeze map", but I
> think "_vufiosfm" will not be a good choice for a file suffix.
>
>
> On Sat, Oct 31, 2015 at 1:32 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > What is your main worry about changing the name of this map, is it
> > about more code churn or is it about that we might introduce new issues
> > or is it about that people are already accustomed to call this map as
> > visibility map?
>
> My concern is mostly that I think calling it the "visibility and
> freeze map" is excessively long and wordy.
>
> One observation that someone made previously is that there is a
> difference between "all-visible" and "index-only scan OK". An
> all-visible page that has a HOT update is no longer all-visible (it
> needs vacuuming) but an index-only scan would still be OK (because
> only the non-indexed values in the tuple have changed, and every scan
> scan can see either the old or the new tuple but not both. At
> present, the index-only scan will consult the heap page anyway,
> because all we know is that the page is not all-visible. But maybe in
> the future somebody will decide to add a bit for that. Then we'd have
> the "visibility, usable for index-only scans, and freeze map", but I
> think "_vufiosfm" will not be a good choice for a file suffix.
>
I think in that case we can call it as page info map or page state map, but
I find retaining visibility map name in this case or for future (if we want to
add another bit) as confusing. In-fact if you find "visibility and freeze map",
as excessively long, then we can change it to "page info map" or "page state
map" now as well.
On Mon, Nov 2, 2015 at 10:33 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Tue, Nov 3, 2015 at 5:04 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> >> On Sat, Oct 31, 2015 at 1:32 AM, Amit Kapila <amit.kapila16@gmail.com> >> wrote: >> > >> > What is your main worry about changing the name of this map, is it >> > about more code churn or is it about that we might introduce new issues >> > or is it about that people are already accustomed to call this map as >> > visibility map? >> >> My concern is mostly that I think calling it the "visibility and >> freeze map" is excessively long and wordy. >> >> One observation that someone made previously is that there is a >> difference between "all-visible" and "index-only scan OK". An >> all-visible page that has a HOT update is no longer all-visible (it >> needs vacuuming) but an index-only scan would still be OK (because >> only the non-indexed values in the tuple have changed, and every scan >> scan can see either the old or the new tuple but not both. At >> present, the index-only scan will consult the heap page anyway, >> because all we know is that the page is not all-visible. But maybe in >> the future somebody will decide to add a bit for that. Then we'd have >> the "visibility, usable for index-only scans, and freeze map", but I >> think "_vufiosfm" will not be a good choice for a file suffix. >> > > I think in that case we can call it as page info map or page state map, but > I find retaining visibility map name in this case or for future (if we want > to > add another bit) as confusing. In-fact if you find "visibility and freeze > map", > as excessively long, then we can change it to "page info map" or "page state > map" now as well. Sure. Or we could just keep calling it the visibility map, and then everyone would know what we're talking about. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Nov 3, 2015 at 12:33 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Tue, Nov 3, 2015 at 5:04 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> >> On Sat, Oct 31, 2015 at 1:32 AM, Amit Kapila <amit.kapila16@gmail.com> >> wrote: >> > >> > What is your main worry about changing the name of this map, is it >> > about more code churn or is it about that we might introduce new issues >> > or is it about that people are already accustomed to call this map as >> > visibility map? >> >> My concern is mostly that I think calling it the "visibility and >> freeze map" is excessively long and wordy. >> >> One observation that someone made previously is that there is a >> difference between "all-visible" and "index-only scan OK". An >> all-visible page that has a HOT update is no longer all-visible (it >> needs vacuuming) but an index-only scan would still be OK (because >> only the non-indexed values in the tuple have changed, and every scan >> scan can see either the old or the new tuple but not both. At >> present, the index-only scan will consult the heap page anyway, >> because all we know is that the page is not all-visible. But maybe in >> the future somebody will decide to add a bit for that. Then we'd have >> the "visibility, usable for index-only scans, and freeze map", but I >> think "_vufiosfm" will not be a good choice for a file suffix. >> > > I think in that case we can call it as page info map or page state map, but > I find retaining visibility map name in this case or for future (if we want > to > add another bit) as confusing. In-fact if you find "visibility and freeze > map", > as excessively long, then we can change it to "page info map" or "page state > map" now as well. > In that case, file suffix would be "_pim" or "_psm"? IMO, "page info map" would be better, because the bit doesn't indicate the status of page in real time, it's just additional information. Also we need to rewrite to new name in source code, and source file name as well. Regards, -- Masahiko Sawada
On Wed, Nov 4, 2015 at 4:45 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Tue, Nov 3, 2015 at 12:33 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > On Tue, Nov 3, 2015 at 5:04 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> >>
> >> On Sat, Oct 31, 2015 at 1:32 AM, Amit Kapila <amit.kapila16@gmail.com>
> >> wrote:
> >> >
> >> > What is your main worry about changing the name of this map, is it
> >> > about more code churn or is it about that we might introduce new issues
> >> > or is it about that people are already accustomed to call this map as
> >> > visibility map?
> >>
> >> My concern is mostly that I think calling it the "visibility and
> >> freeze map" is excessively long and wordy.
> >>
> >> One observation that someone made previously is that there is a
> >> difference between "all-visible" and "index-only scan OK". An
> >> all-visible page that has a HOT update is no longer all-visible (it
> >> needs vacuuming) but an index-only scan would still be OK (because
> >> only the non-indexed values in the tuple have changed, and every scan
> >> scan can see either the old or the new tuple but not both. At
> >> present, the index-only scan will consult the heap page anyway,
> >> because all we know is that the page is not all-visible. But maybe in
> >> the future somebody will decide to add a bit for that. Then we'd have
> >> the "visibility, usable for index-only scans, and freeze map", but I
> >> think "_vufiosfm" will not be a good choice for a file suffix.
> >>
> >
> > I think in that case we can call it as page info map or page state map, but
> > I find retaining visibility map name in this case or for future (if we want
> > to
> > add another bit) as confusing. In-fact if you find "visibility and freeze
> > map",
> > as excessively long, then we can change it to "page info map" or "page state
> > map" now as well.
> >
>
> In that case, file suffix would be "_pim" or "_psm"?
> IMO, "page info map" would be better, because the bit doesn't indicate
> the status of page in real time, it's just additional information.
> Also we need to rewrite to new name in source code, and source file
> name as well.
>
>
> On Tue, Nov 3, 2015 at 12:33 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > On Tue, Nov 3, 2015 at 5:04 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> >>
> >> On Sat, Oct 31, 2015 at 1:32 AM, Amit Kapila <amit.kapila16@gmail.com>
> >> wrote:
> >> >
> >> > What is your main worry about changing the name of this map, is it
> >> > about more code churn or is it about that we might introduce new issues
> >> > or is it about that people are already accustomed to call this map as
> >> > visibility map?
> >>
> >> My concern is mostly that I think calling it the "visibility and
> >> freeze map" is excessively long and wordy.
> >>
> >> One observation that someone made previously is that there is a
> >> difference between "all-visible" and "index-only scan OK". An
> >> all-visible page that has a HOT update is no longer all-visible (it
> >> needs vacuuming) but an index-only scan would still be OK (because
> >> only the non-indexed values in the tuple have changed, and every scan
> >> scan can see either the old or the new tuple but not both. At
> >> present, the index-only scan will consult the heap page anyway,
> >> because all we know is that the page is not all-visible. But maybe in
> >> the future somebody will decide to add a bit for that. Then we'd have
> >> the "visibility, usable for index-only scans, and freeze map", but I
> >> think "_vufiosfm" will not be a good choice for a file suffix.
> >>
> >
> > I think in that case we can call it as page info map or page state map, but
> > I find retaining visibility map name in this case or for future (if we want
> > to
> > add another bit) as confusing. In-fact if you find "visibility and freeze
> > map",
> > as excessively long, then we can change it to "page info map" or "page state
> > map" now as well.
> >
>
> In that case, file suffix would be "_pim" or "_psm"?
Right.
> IMO, "page info map" would be better, because the bit doesn't indicate
> the status of page in real time, it's just additional information.
> Also we need to rewrite to new name in source code, and source file
> name as well.
>
I think so. Here I think the right thing to do is lets proceed with fixing
other issues of patch and work on this part later and in the mean time
we might get more feedback on this part of proposal.
Hello, I had a look on v21 patch. Though I haven't looked the whole of the patch, I'd like to show you some comments only for visibilitymap.c and a part of the documentation. 1. Patch application patch command complains about offsets for heapam.c at current master. 2. visitibilymap_test() - if (visibilitymap_test(rel, blkno, &vmbuffer)) + if (visibilitymap_test(rel, blkno, &vmbuffer, VISIBILITYMAP_ALL_VISIBLE) The old VM was a simple bitmap so the name _test and thefunction are proper but now the bitmap is quad state so it'd bebetterchainging the function. Alghough it is not so expensiveto call it twice successively, it is a bit uneasy for me doingso.One possible shape would be like the following. lazy_vacuum_page()> int vmstate = visibilitymap_get_status(rel, blkno, &vmbuffer);> if (!(vmstate & VISIBILITYMAP_ALL_VISIBLE))> ...> if (all_frozen && !(vmstate & VISIBILITYMAP_ALL_FROZEN))> ...> if (flags != vmstate)> visibilitymap_set(...., flags); and defining two macros for indivisual tests, > #define VM_ALL_VISIBLE(r, b, v) ((vm_get_status((r), (b), (v)) & .._VISIBLE) != 0)> if (VM_ALL_VISIBLE(rel, blkno, &vmbuffer))and>if (VM_ALL_FROZEN(rel, blkno, &vmbuffer)) How about this? 3. visibilitymap.c - HEAPBLK_TO_MAPBIT In visibilitymap_clear and other functions, mapBit means mapDualBit in the patch, and mapBit always appears in the form"mapBit * BITS_PER_HEAPBLOCK". So, it'd be better to change the definition of HEAPBLK_TO_MAPBIT so that it calculatesreally the bit position in a byte. - #define HEAPBLK_TO_MAPBIT(x) ((x) % HEAPBLOCKS_PER_BYTE) + #define HEAPBLK_TO_MAPBIT(x) (((x) % HEAPBLOCKS_PER_BYTE) *BITS_PER_HEAPBLOCK) - visibilitymap_count() The third argument all_frozen is not necessary in some usage. So this interface would be preferable to be as following, BlockNumber visibilitymap_count(Relation rel, BlockNumber *all_frozen) { BlockNumber all_visible = 0; ... if (all_frozen) *all_frozen = 0; ... something like ... - visibilitymap_set() The check for ALL_VISIBLE is duplicate in the following assertion. > Assert((flags & VISIBILITYMAP_ALL_VISIBLE) || > (flags & (VISIBILITYMAP_ALL_VISIBLE | VISIBILITYMAP_ALL_FROZEN))); 4. documentation - 18.11.1 Statement Hehavior A typo. > VACUUM performs *a* aggressive freezing However I am not a fluent English speaker, and such wordsmithing would be done by someone else, I feel that "eager/greedy"is more suitable for this meaning.., nevertheless, the term "whole-table freezing" that you wrote elsewherein this patch would be usable. "VACUUM performs a whole-table freezing" All "a table scan/sweep"s and something has the similar meaning would be better be changed to "a whole-table freezing" In similar manner, "tuples/rows that are marked as frozen" could be replaced with "unfrozen tuples/rows". - 23.1.5 Preventing Transaction ID Wraparound Failures "The whole table is scanned only when all pages happen to require vacuuming to remove dead row versions." This description looks a bit out-of-point. "the whole table scan" in the original description is what is triggered by relfrozenxid so the correspondent in the revised description is "the whole-table freezing", maybe. "The whole-table feezing takes place when <structfield>relfrozenxid</> is more than <varname>vacuum_freeze_table_age</>transactions old or when <command>VACUUM</>'s <literal>FREEZE</> option is used. The whole-table freezing scans all unfreezed pages." The last sentence might be unnecessary. - 63.4 Visibility Map "pages contain only tuples that are marked as frozen" would be enough to be "pages contain only frozen tuples" and according to the discussion upthread, we might be good to have some desciption that the name is historically omitting the aspect of freezemap. At Sat, 31 Oct 2015 18:07:32 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in <CAA4eK1+aTdaSwG3u+y8fXxn67Kkj0T1KzRsFDLEi=tQvTYgFrQ@mail.gmail.com> amit.kapila16> On Fri, Oct 30, 2015 at 6:03 AM, Masahiko Sawada <sawada.mshk@gmail.com> > Couple of more review comments: > ------------------------------------------------------ > > 1. > @@ -615,6 +617,8 @@ typedef struct PgStat_StatTabEntry > PgStat_Counter n_dead_tuples; > PgStat_Counter > changes_since_analyze; > > + int32 n_frozen_pages; > + > PgStat_Counter blocks_fetched; > PgStat_Counter > blocks_hit; > > As you are changing above structure, you need to update > PGSTAT_FILE_FORMAT_ID, refer below code: > #define PGSTAT_FILE_FORMAT_ID 0x01A5BC9D > > 2. It seems that n_frozen_page is not initialized/updated properly > for toast tables: > > Try with below steps: > > postgres=# create table t4(c1 int, c2 text); > CREATE TABLE > postgres=# select oid, relname from pg_class where relname like '%t4%'; > oid | relname > -------+--------- > 16390 | t4 > (1 row) > > > postgres=# select oid, relname from pg_class where relname like '%16390%'; > oid | relname > -------+---------------------- > 16393 | pg_toast_16390 > 16395 | pg_toast_16390_index > (2 rows) > > postgres=# select relname,seq_scan,n_tup_ins,last_vacuum,n_frozen_page from > pg_s > tat_all_tables where relname like '%16390%'; > relname | seq_scan | n_tup_ins | last_vacuum | n_frozen_page > ----------------+----------+-----------+-------------+--------------- > pg_toast_16390 | 1 | 0 | | -842150451 > (1 row) > > Note that I have tested above scenario on my Windows 7 m/c. > > 3. > * visibilitymap.c > * bitmap for tracking visibility of heap tuples > > I think above needs to be changed to: > bitmap for tracking visibility and frozen state of heap tuples > > > 4. > a. > /* > - * If we froze any tuples, mark the buffer dirty, and write a WAL > - * record recording the changes. We must log the changes to be > - * crash-safe against future truncation of CLOG. > + * If we froze any tuples then we mark the buffer dirty, and write a WAL > > b. > - * Release any remaining pin on visibility map page. > + * Release any remaining pin on visibility map. > > c. > * We do update relallvisible even in the corner case, since if the table > - * is all-visible > we'd definitely like to know that. But clamp the value > - * to be not more than what we're setting > relpages to. > + * is all-visible we'd definitely like to know that. > + * But clamp the value to be not more > than what we're setting relpages to. > > I don't think you need to change above comments. > > 5. > + * Even if scan_all is set so far, we could skip to scan some pages > + * according by all-frozen > bit of visibility amp. > > /according by/according to > /amp/map > > I suggested to modify comment as below: > During full scan, we could skip some pages according to all-frozen > bit of visibility map. > > Also no need to start this in new line, start from where the > previous line of comment ends. > > 6. > /* > * lazy_scan_heap() -- scan an open heap relation > * > * This routine prunes each page in the > heap, which will among other > * things truncate dead tuples to dead line pointers, defragment the > * > page, and set commit status bits (see heap_page_prune). It also builds > * lists of dead > tuples and pages with free space, calculates statistics > * on the number of live tuples in the > heap, and marks pages as > * all-visible if appropriate. > > Modify above function header as: > > all-visible, all-frozen > > 7. > lazy_scan_heap() > { > .. > > if (PageIsEmpty(page)) > { > empty_pages++; > freespace = > PageGetHeapFreeSpace(page); > > /* empty pages are always all-visible */ > if (!PageIsAllVisible(page)) > .. > } > > Don't we need to ensure that empty pages should get marked as > all-frozen? > > 8. > lazy_scan_heap() > { > .. > /* > * As of PostgreSQL 9.2, the visibility map bit should never be set if > * the page- > level bit is clear. However, it's possible that the bit > * got cleared after we checked it > and before we took the buffer > * content lock, so we must recheck before jumping to the conclusion > * that something bad has happened. > */ > else if (all_visible_according_to_vm > && !PageIsAllVisible(page) > && visibilitymap_test(onerel, blkno, &vmbuffer, > VISIBILITYMAP_ALL_VISIBLE)) > { > elog(WARNING, "page is not marked all-visible > but visibility map bit is set in relation \"%s\" page %u", > relname, blkno); > visibilitymap_clear(onerel, blkno, vmbuffer); > } > > /* > * > It's possible for the value returned by GetOldestXmin() to move > * backwards, so it's not wrong for > us to see tuples that appear to > * not be visible to everyone yet, while PD_ALL_VISIBLE is already > * set. The real safe xmin value never moves backwards, but > * GetOldestXmin() is > conservative and sometimes returns a value > * that's unnecessarily small, so if we see that > contradiction it just > * means that the tuples that we think are not visible to everyone yet > * actually are, and the PD_ALL_VISIBLE flag is correct. > * > * There should never > be dead tuples on a page with PD_ALL_VISIBLE > * set, however. > */ > else > if (PageIsAllVisible(page) && has_dead_tuples) > { > elog(WARNING, "page > containing dead tuples is marked as all-visible in relation \"%s\" page %u", > > relname, blkno); > PageClearAllVisible(page); > MarkBufferDirty(buf); > visibilitymap_clear(onerel, blkno, vmbuffer); > } > > .. > } > > I think both the above cases could happen for frozen state > as well, unless you think otherwise, we need similar handling > for frozen bit. -- Kyotaro Horiguchi NTT Open Source Software Center
On Wed, Nov 4, 2015 at 12:19 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Wed, Nov 4, 2015 at 4:45 AM, Masahiko Sawada <sawada.mshk@gmail.com> > wrote: >> >> On Tue, Nov 3, 2015 at 12:33 PM, Amit Kapila <amit.kapila16@gmail.com> >> wrote: >> > On Tue, Nov 3, 2015 at 5:04 AM, Robert Haas <robertmhaas@gmail.com> >> > wrote: >> >> >> >> On Sat, Oct 31, 2015 at 1:32 AM, Amit Kapila <amit.kapila16@gmail.com> >> >> wrote: >> >> > >> >> > What is your main worry about changing the name of this map, is it >> >> > about more code churn or is it about that we might introduce new >> >> > issues >> >> > or is it about that people are already accustomed to call this map as >> >> > visibility map? >> >> >> >> My concern is mostly that I think calling it the "visibility and >> >> freeze map" is excessively long and wordy. >> >> >> >> One observation that someone made previously is that there is a >> >> difference between "all-visible" and "index-only scan OK". An >> >> all-visible page that has a HOT update is no longer all-visible (it >> >> needs vacuuming) but an index-only scan would still be OK (because >> >> only the non-indexed values in the tuple have changed, and every scan >> >> scan can see either the old or the new tuple but not both. At >> >> present, the index-only scan will consult the heap page anyway, >> >> because all we know is that the page is not all-visible. But maybe in >> >> the future somebody will decide to add a bit for that. Then we'd have >> >> the "visibility, usable for index-only scans, and freeze map", but I >> >> think "_vufiosfm" will not be a good choice for a file suffix. >> >> >> > >> > I think in that case we can call it as page info map or page state map, >> > but >> > I find retaining visibility map name in this case or for future (if we >> > want >> > to >> > add another bit) as confusing. In-fact if you find "visibility and >> > freeze >> > map", >> > as excessively long, then we can change it to "page info map" or "page >> > state >> > map" now as well. >> > >> >> In that case, file suffix would be "_pim" or "_psm"? > > Right. > >> IMO, "page info map" would be better, because the bit doesn't indicate >> the status of page in real time, it's just additional information. >> Also we need to rewrite to new name in source code, and source file >> name as well. >> > > I think so. Here I think the right thing to do is lets proceed with fixing > other issues of patch and work on this part later and in the mean time > we might get more feedback on this part of proposal. > Yeah, I'm going to do that changes if there is no strong objection from hackers. Regards, -- Masahiko Sawada
On Thu, Nov 5, 2015 at 6:03 PM, Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > Hello, I had a look on v21 patch. > > Though I haven't looked the whole of the patch, I'd like to show > you some comments only for visibilitymap.c and a part of the > documentation. > > > 1. Patch application > > patch command complains about offsets for heapam.c at current > master. > > 2. visitibilymap_test() > > - if (visibilitymap_test(rel, blkno, &vmbuffer)) > + if (visibilitymap_test(rel, blkno, &vmbuffer, VISIBILITYMAP_ALL_VISIBLE) > > The old VM was a simple bitmap so the name _test and the > function are proper but now the bitmap is quad state so it'd be > better chainging the function. Alghough it is not so expensive > to call it twice successively, it is a bit uneasy for me doing > so. One possible shape would be like the following. > > lazy_vacuum_page() > > int vmstate = visibilitymap_get_status(rel, blkno, &vmbuffer); > > if (!(vmstate & VISIBILITYMAP_ALL_VISIBLE)) > > ... > > if (all_frozen && !(vmstate & VISIBILITYMAP_ALL_FROZEN)) > > ... > > if (flags != vmstate) > > visibilitymap_set(...., flags); > > and defining two macros for indivisual tests, > > > #define VM_ALL_VISIBLE(r, b, v) ((vm_get_status((r), (b), (v)) & .._VISIBLE) != 0) > > if (VM_ALL_VISIBLE(rel, blkno, &vmbuffer)) > and > > if (VM_ALL_FROZEN(rel, blkno, &vmbuffer)) > > How about this? > > > 3. visibilitymap.c > > - HEAPBLK_TO_MAPBIT > > In visibilitymap_clear and other functions, mapBit means > mapDualBit in the patch, and mapBit always appears in the form > "mapBit * BITS_PER_HEAPBLOCK". So, it'd be better to change the > definition of HEAPBLK_TO_MAPBIT so that it calculates really > the bit position in a byte. > > - #define HEAPBLK_TO_MAPBIT(x) ((x) % HEAPBLOCKS_PER_BYTE) > + #define HEAPBLK_TO_MAPBIT(x) (((x) % HEAPBLOCKS_PER_BYTE) * BITS_PER_HEAPBLOCK) > > > - visibilitymap_count() > > The third argument all_frozen is not necessary in some > usage. So this interface would be preferable to be as > following, > > BlockNumber > visibilitymap_count(Relation rel, BlockNumber *all_frozen) > { > BlockNumber all_visible = 0; > ... > if (all_frozen) > *all_frozen = 0; > ... something like ... > > - visibilitymap_set() > > The check for ALL_VISIBLE is duplicate in the following > assertion. > > > Assert((flags & VISIBILITYMAP_ALL_VISIBLE) || > > (flags & (VISIBILITYMAP_ALL_VISIBLE | VISIBILITYMAP_ALL_FROZEN))); > > > > 4. documentation > > - 18.11.1 Statement Hehavior > > A typo. > > > VACUUM performs *a* aggressive freezing > > However I am not a fluent English speaker, and such > wordsmithing would be done by someone else, I feel that > "eager/greedy" is more suitable for this meaning.., > nevertheless, the term "whole-table freezing" that you wrote > elsewhere in this patch would be usable. > > "VACUUM performs a whole-table freezing" > > All "a table scan/sweep"s and something has the similar > meaning would be better be changed to "a whole-table > freezing" > > In similar manner, "tuples/rows that are marked as frozen" > could be replaced with "unfrozen tuples/rows". > > - 23.1.5 Preventing Transaction ID Wraparound Failures > > "The whole table is scanned only when all pages happen to > require vacuuming to remove dead row versions." > > This description looks a bit out-of-point. "the whole table > scan" in the original description is what is triggered by > relfrozenxid so the correspondent in the revised description > is "the whole-table freezing", maybe. > > "The whole-table feezing takes place when > <structfield>relfrozenxid</> is more than > <varname>vacuum_freeze_table_age</> transactions old or when > <command>VACUUM</>'s <literal>FREEZE</> option is used. The > whole-table freezing scans all unfreezed pages." > > The last sentence might be unnecessary. > > > - 63.4 Visibility Map > > "pages contain only tuples that are marked as frozen" would be > enough to be "pages contain only frozen tuples" > > and according to the discussion upthread, we might be good to > have some desciption that the name is historically omitting > the aspect of freezemap. > > > At Sat, 31 Oct 2015 18:07:32 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in <CAA4eK1+aTdaSwG3u+y8fXxn67Kkj0T1KzRsFDLEi=tQvTYgFrQ@mail.gmail.com> > amit.kapila16> On Fri, Oct 30, 2015 at 6:03 AM, Masahiko Sawada <sawada.mshk@gmail.com> >> Couple of more review comments: >> ------------------------------------------------------ >> >> 1. >> @@ -615,6 +617,8 @@ typedef struct PgStat_StatTabEntry >> PgStat_Counter n_dead_tuples; >> PgStat_Counter >> changes_since_analyze; >> >> + int32 n_frozen_pages; >> + >> PgStat_Counter blocks_fetched; >> PgStat_Counter >> blocks_hit; >> >> As you are changing above structure, you need to update >> PGSTAT_FILE_FORMAT_ID, refer below code: >> #define PGSTAT_FILE_FORMAT_ID 0x01A5BC9D >> >> 2. It seems that n_frozen_page is not initialized/updated properly >> for toast tables: >> >> Try with below steps: >> >> postgres=# create table t4(c1 int, c2 text); >> CREATE TABLE >> postgres=# select oid, relname from pg_class where relname like '%t4%'; >> oid | relname >> -------+--------- >> 16390 | t4 >> (1 row) >> >> >> postgres=# select oid, relname from pg_class where relname like '%16390%'; >> oid | relname >> -------+---------------------- >> 16393 | pg_toast_16390 >> 16395 | pg_toast_16390_index >> (2 rows) >> >> postgres=# select relname,seq_scan,n_tup_ins,last_vacuum,n_frozen_page from >> pg_s >> tat_all_tables where relname like '%16390%'; >> relname | seq_scan | n_tup_ins | last_vacuum | n_frozen_page >> ----------------+----------+-----------+-------------+--------------- >> pg_toast_16390 | 1 | 0 | | -842150451 >> (1 row) >> >> Note that I have tested above scenario on my Windows 7 m/c. >> >> 3. >> * visibilitymap.c >> * bitmap for tracking visibility of heap tuples >> >> I think above needs to be changed to: >> bitmap for tracking visibility and frozen state of heap tuples >> >> >> 4. >> a. >> /* >> - * If we froze any tuples, mark the buffer dirty, and write a WAL >> - * record recording the changes. We must log the changes to be >> - * crash-safe against future truncation of CLOG. >> + * If we froze any tuples then we mark the buffer dirty, and write a WAL >> >> b. >> - * Release any remaining pin on visibility map page. >> + * Release any remaining pin on visibility map. >> >> c. >> * We do update relallvisible even in the corner case, since if the table >> - * is all-visible >> we'd definitely like to know that. But clamp the value >> - * to be not more than what we're setting >> relpages to. >> + * is all-visible we'd definitely like to know that. >> + * But clamp the value to be not more >> than what we're setting relpages to. >> >> I don't think you need to change above comments. >> >> 5. >> + * Even if scan_all is set so far, we could skip to scan some pages >> + * according by all-frozen >> bit of visibility amp. >> >> /according by/according to >> /amp/map >> >> I suggested to modify comment as below: >> During full scan, we could skip some pages according to all-frozen >> bit of visibility map. >> >> Also no need to start this in new line, start from where the >> previous line of comment ends. >> >> 6. >> /* >> * lazy_scan_heap() -- scan an open heap relation >> * >> * This routine prunes each page in the >> heap, which will among other >> * things truncate dead tuples to dead line pointers, defragment the >> * >> page, and set commit status bits (see heap_page_prune). It also builds >> * lists of dead >> tuples and pages with free space, calculates statistics >> * on the number of live tuples in the >> heap, and marks pages as >> * all-visible if appropriate. >> >> Modify above function header as: >> >> all-visible, all-frozen >> >> 7. >> lazy_scan_heap() >> { >> .. >> >> if (PageIsEmpty(page)) >> { >> empty_pages++; >> freespace = >> PageGetHeapFreeSpace(page); >> >> /* empty pages are always all-visible */ >> if (!PageIsAllVisible(page)) >> .. >> } >> >> Don't we need to ensure that empty pages should get marked as >> all-frozen? >> >> 8. >> lazy_scan_heap() >> { >> .. >> /* >> * As of PostgreSQL 9.2, the visibility map bit should never be set if >> * the page- >> level bit is clear. However, it's possible that the bit >> * got cleared after we checked it >> and before we took the buffer >> * content lock, so we must recheck before jumping to the conclusion >> * that something bad has happened. >> */ >> else if (all_visible_according_to_vm >> && !PageIsAllVisible(page) >> && visibilitymap_test(onerel, blkno, &vmbuffer, >> VISIBILITYMAP_ALL_VISIBLE)) >> { >> elog(WARNING, "page is not marked all-visible >> but visibility map bit is set in relation \"%s\" page %u", >> relname, blkno); >> visibilitymap_clear(onerel, blkno, vmbuffer); >> } >> >> /* >> * >> It's possible for the value returned by GetOldestXmin() to move >> * backwards, so it's not wrong for >> us to see tuples that appear to >> * not be visible to everyone yet, while PD_ALL_VISIBLE is already >> * set. The real safe xmin value never moves backwards, but >> * GetOldestXmin() is >> conservative and sometimes returns a value >> * that's unnecessarily small, so if we see that >> contradiction it just >> * means that the tuples that we think are not visible to everyone yet >> * actually are, and the PD_ALL_VISIBLE flag is correct. >> * >> * There should never >> be dead tuples on a page with PD_ALL_VISIBLE >> * set, however. >> */ >> else >> if (PageIsAllVisible(page) && has_dead_tuples) >> { >> elog(WARNING, "page >> containing dead tuples is marked as all-visible in relation \"%s\" page %u", >> >> relname, blkno); >> PageClearAllVisible(page); >> MarkBufferDirty(buf); >> visibilitymap_clear(onerel, blkno, vmbuffer); >> } >> >> .. >> } >> >> I think both the above cases could happen for frozen state >> as well, unless you think otherwise, we need similar handling >> for frozen bit. > Thank you for reviewing the patch. I changed the patch so that the visibility map become the page info map, in source code and documentation. And fixed review comments I received. Attached v22 patch. > I think both the above cases could happen for frozen state > as well, unless you think otherwise, we need similar handling > for frozen bit. It's not happen the situation where is all-frozen and not all-visible, and the bits of visibility map are cleared at the same time, page flags are as well. So I think it's enough to handle only all-visible situation. Am I missing something? > 2. visitibilymap_test() > > - if (visibilitymap_test(rel, blkno, &vmbuffer)) > + if (visibilitymap_test(rel, blkno, &vmbuffer, VISIBILITYMAP_ALL_VISIBLE) > > The old VM was a simple bitmap so the name _test and the > function are proper but now the bitmap is quad state so it'd be > better chainging the function. Alghough it is not so expensive > to call it twice successively, it is a bit uneasy for me doing > so. One possible shape would be like the following. > > lazy_vacuum_page() > > int vmstate = visibilitymap_get_status(rel, blkno, &vmbuffer); > > if (!(vmstate & VISIBILITYMAP_ALL_VISIBLE)) > > ... > > if (all_frozen && !(vmstate & VISIBILITYMAP_ALL_FROZEN)) > > ... > > if (flags != vmstate) > > visibilitymap_set(...., flags); > > and defining two macros for indivisual tests, > > > #define VM_ALL_VISIBLE(r, b, v) ((vm_get_status((r), (b), (v)) & .._VISIBLE) != 0) > > if (VM_ALL_VISIBLE(rel, blkno, &vmbuffer)) > and > > if (VM_ALL_FROZEN(rel, blkno, &vmbuffer)) > > How about this? I agree. I've changed so. Regards, -- Masahiko Sawada
Attachment
On Fri, Nov 13, 2015 at 4:48 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
>
> Thank you for reviewing the patch.
>
> I changed the patch so that the visibility map become the page info
> map, in source code and documentation.
>
>
> Thank you for reviewing the patch.
>
> I changed the patch so that the visibility map become the page info
> map, in source code and documentation.
>
One thing to notice is that this almost doubles the patch size which
might makes it slightly difficult to review, but on the other hand if
no-body opposes for such a change, this seems to be the right direction.
> And fixed review comments I received.
> Attached v22 patch.
>
> > I think both the above cases could happen for frozen state
> > as well, unless you think otherwise, we need similar handling
> > for frozen bit.
>
> It's not happen the situation where is all-frozen and not all-visible,
> and the bits of visibility map are cleared at the same time, page
> flags are as well.
> So I think it's enough to handle only all-visible situation. Am I
>
> missing something?
>
No, I think you are right as information for both is cleared together
and all-visible is superset of all-frozen (means if all-frozen is set,
then all-visible must be set), so it is sufficient to check visibility
info in above situation, but I feel we can update the comment to
indicate the same and add an Assert to ensure if all-frozen is set
all-visibile must be set.
On Fri, Nov 13, 2015 at 1:44 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Fri, Nov 13, 2015 at 4:48 AM, Masahiko Sawada <sawada.mshk@gmail.com> > wrote: >> >> >> Thank you for reviewing the patch. >> >> I changed the patch so that the visibility map become the page info >> map, in source code and documentation. >> > > One thing to notice is that this almost doubles the patch size which > might makes it slightly difficult to review, but on the other hand if > no-body opposes for such a change, this seems to be the right direction. I believe that it's going to right direction. But I think we didn't get consensus about this changes yet, so it might go back. > >> And fixed review comments I received. >> Attached v22 patch. >> >> > I think both the above cases could happen for frozen state >> > as well, unless you think otherwise, we need similar handling >> > for frozen bit. >> >> It's not happen the situation where is all-frozen and not all-visible, >> and the bits of visibility map are cleared at the same time, page >> flags are as well. >> So I think it's enough to handle only all-visible situation. Am I >> >> missing something? >> > > No, I think you are right as information for both is cleared together > and all-visible is superset of all-frozen (means if all-frozen is set, > then all-visible must be set), so it is sufficient to check visibility > info in above situation, but I feel we can update the comment to > indicate the same and add an Assert to ensure if all-frozen is set > all-visibile must be set. I agree. I added Assert() macro into lazy_scan_heap() and some comments. Attached v23 patch. Regards, -- Masahiko Sawada
Attachment
On Tue, Nov 3, 2015 at 09:03:49AM +0530, Amit Kapila wrote: > I think in that case we can call it as page info map or page state map, but > I find retaining visibility map name in this case or for future (if we want to > add another bit) as confusing. In-fact if you find "visibility and freeze > map", > as excessively long, then we can change it to "page info map" or "page state > map" now as well. Coming in late here, but the problem with "page info map" is that free space is also page info (how much free space on each page), so "page info map" isn't very descriptive. "page status" or "page state" might make more sense, but even then, free space is a kind of page status/state. What is happening is that broadening the name to cover both visibility and freeze state also encompasses free space. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Roman grave inscription +
On 2015-10-31 11:02:12 +0530, Amit Kapila wrote: > On Thu, Oct 8, 2015 at 11:05 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > > > > On 1 October 2015 at 23:30, Josh Berkus <josh@agliodbs.com> wrote: > >> > >> On 10/01/2015 07:43 AM, Robert Haas wrote: > >> > On Thu, Oct 1, 2015 at 9:44 AM, Fujii Masao <masao.fujii@gmail.com> > wrote: > >> >> I wonder how much it's worth renaming only the file extension while > >> >> there are many places where "visibility map" and "vm" are used, > >> >> for example, log messages, function names, variables, etc. > >> > > >> > I'd be inclined to keep calling it the visibility map (vm) even if it > >> > also contains freeze information. > >> > > > What is your main worry about changing the name of this map, is it > about more code churn or is it about that we might introduce new issues > or is it about that people are already accustomed to call this map as > visibility map? Several: * Visibility map is rather descriptive, none of the replacement terms imo come close. Few people will know what a 'freeze'map is. * It increases the size of the patch considerably * It forces tooling that knows about the layout of the database directory to change their tools On the benfit side the only argument I've heard so far is that it allows to disambiguate the format. But, uh, a look at the major version does that just as well, for far less trouble. > It seems to me quite logical for understanding purpose as well. Any new > person who wants to work in this area or is looking into it will always > wonder why this map is named as visibility map even though it contains > information about visibility of page as well as frozen state of page. Being frozen is about visibility as well. Greetings, Andres Freund
On Sat, Nov 14, 2015 at 1:12 AM, Bruce Momjian <bruce@momjian.us> wrote:
>
> On Tue, Nov 3, 2015 at 09:03:49AM +0530, Amit Kapila wrote:
> > I think in that case we can call it as page info map or page state map, but
> > I find retaining visibility map name in this case or for future (if we want to
> > add another bit) as confusing. In-fact if you find "visibility and freeze
> > map",
> > as excessively long, then we can change it to "page info map" or "page state
> > map" now as well.
>
> Coming in late here, but the problem with "page info map" is that free
> space is also page info (how much free space on each page), so "page
> info map" isn't very descriptive. "page status" or "page state" might
> make more sense, but even then, free space is a kind of page
> status/state. What is happening is that broadening the name to cover
> both visibility and freeze state also encompasses free space.
>
>
> On Tue, Nov 3, 2015 at 09:03:49AM +0530, Amit Kapila wrote:
> > I think in that case we can call it as page info map or page state map, but
> > I find retaining visibility map name in this case or for future (if we want to
> > add another bit) as confusing. In-fact if you find "visibility and freeze
> > map",
> > as excessively long, then we can change it to "page info map" or "page state
> > map" now as well.
>
> Coming in late here, but the problem with "page info map" is that free
> space is also page info (how much free space on each page), so "page
> info map" isn't very descriptive. "page status" or "page state" might
> make more sense, but even then, free space is a kind of page
> status/state. What is happening is that broadening the name to cover
> both visibility and freeze state also encompasses free space.
>
Valid point, but I think free space map is a specific information of page
stored in a completely different format. "page info"/"page state" map
could contain information about multiple states of page in same format.
There is yet another option of changing it Visibility and Freeze map and
or change file extension to vfm, but Robert felt that is rather long name
and I also agree with him.
Do you see retaining the visibility map as better option ?
On Sat, Nov 14, 2015 at 1:19 AM, Andres Freund <andres@anarazel.de> wrote:
> On 2015-10-31 11:02:12 +0530, Amit Kapila wrote:
> > On Thu, Oct 8, 2015 at 11:05 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> > >
> > > On 1 October 2015 at 23:30, Josh Berkus <josh@agliodbs.com> wrote:
> > >>
> > >> On 10/01/2015 07:43 AM, Robert Haas wrote:
> > >> > On Thu, Oct 1, 2015 at 9:44 AM, Fujii Masao <masao.fujii@gmail.com>
> > wrote:
> > >> >> I wonder how much it's worth renaming only the file extension while
> > >> >> there are many places where "visibility map" and "vm" are used,
> > >> >> for example, log messages, function names, variables, etc.
> > >> >
> > >> > I'd be inclined to keep calling it the visibility map (vm) even if it
> > >> > also contains freeze information.
> > >> >
> >
> > What is your main worry about changing the name of this map, is it
> > about more code churn or is it about that we might introduce new issues
> > or is it about that people are already accustomed to call this map as
> > visibility map?
>
> Several:
> * Visibility map is rather descriptive, none of the replacement terms
> imo come close. Few people will know what a 'freeze' map is.
> * It increases the size of the patch considerably
> * It forces tooling that knows about the layout of the database
> directory to change their tools
>
> On 2015-10-31 11:02:12 +0530, Amit Kapila wrote:
> > On Thu, Oct 8, 2015 at 11:05 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> > >
> > > On 1 October 2015 at 23:30, Josh Berkus <josh@agliodbs.com> wrote:
> > >>
> > >> On 10/01/2015 07:43 AM, Robert Haas wrote:
> > >> > On Thu, Oct 1, 2015 at 9:44 AM, Fujii Masao <masao.fujii@gmail.com>
> > wrote:
> > >> >> I wonder how much it's worth renaming only the file extension while
> > >> >> there are many places where "visibility map" and "vm" are used,
> > >> >> for example, log messages, function names, variables, etc.
> > >> >
> > >> > I'd be inclined to keep calling it the visibility map (vm) even if it
> > >> > also contains freeze information.
> > >> >
> >
> > What is your main worry about changing the name of this map, is it
> > about more code churn or is it about that we might introduce new issues
> > or is it about that people are already accustomed to call this map as
> > visibility map?
>
> Several:
> * Visibility map is rather descriptive, none of the replacement terms
> imo come close. Few people will know what a 'freeze' map is.
> * It increases the size of the patch considerably
> * It forces tooling that knows about the layout of the database
> directory to change their tools
>
All these points are legitimate.
> On the benfit side the only argument I've heard so far is that it allows
> to disambiguate the format. But, uh, a look at the major version does
> that just as well, for far less trouble.
>
> > It seems to me quite logical for understanding purpose as well. Any new
> > person who wants to work in this area or is looking into it will always
> > wonder why this map is named as visibility map even though it contains
> > information about visibility of page as well as frozen state of page.
>
> Being frozen is about visibility as well.
>
OTOH being visible doesn't mean page is frozen. I understand that frozen is
related to visibility, but still it is a separate state of page and used for different
purpose. I think this is a subjective point and we could go either way, it is
just a matter in which way more people are comfortable.
On Sun, Nov 15, 2015 at 1:47 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Sat, Nov 14, 2015 at 1:19 AM, Andres Freund <andres@anarazel.de> wrote: >> On 2015-10-31 11:02:12 +0530, Amit Kapila wrote: >> > On Thu, Oct 8, 2015 at 11:05 PM, Simon Riggs <simon@2ndquadrant.com> >> > wrote: >> > > >> > > On 1 October 2015 at 23:30, Josh Berkus <josh@agliodbs.com> wrote: >> > >> >> > >> On 10/01/2015 07:43 AM, Robert Haas wrote: >> > >> > On Thu, Oct 1, 2015 at 9:44 AM, Fujii Masao <masao.fujii@gmail.com> >> > wrote: >> > >> >> I wonder how much it's worth renaming only the file extension >> > >> >> while >> > >> >> there are many places where "visibility map" and "vm" are used, >> > >> >> for example, log messages, function names, variables, etc. >> > >> > >> > >> > I'd be inclined to keep calling it the visibility map (vm) even if >> > >> > it >> > >> > also contains freeze information. >> > >> > >> > >> > What is your main worry about changing the name of this map, is it >> > about more code churn or is it about that we might introduce new issues >> > or is it about that people are already accustomed to call this map as >> > visibility map? >> >> Several: >> * Visibility map is rather descriptive, none of the replacement terms >> imo come close. Few people will know what a 'freeze' map is. >> * It increases the size of the patch considerably >> * It forces tooling that knows about the layout of the database >> directory to change their tools >> > > All these points are legitimate. > >> On the benfit side the only argument I've heard so far is that it allows >> to disambiguate the format. But, uh, a look at the major version does >> that just as well, for far less trouble. >> >> > It seems to me quite logical for understanding purpose as well. Any new >> > person who wants to work in this area or is looking into it will always >> > wonder why this map is named as visibility map even though it contains >> > information about visibility of page as well as frozen state of page. >> >> Being frozen is about visibility as well. >> > > OTOH being visible doesn't mean page is frozen. I understand that frozen is > related to visibility, but still it is a separate state of page and used for > different > purpose. I think this is a subjective point and we could go either way, it > is > just a matter in which way more people are comfortable. I'm stickin' with what I said before, and what I think Andres is saying too: renaming the map is a horrible idea. It produces lots of code churn for no real benefit. We usually avoid such changes, and I think we should do so here, too. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
<br /><br /> On Tue, Nov 17, 2015 at 10:45 AM, Robert Haas <<a href="javascript:;">robertmhaas@gmail.com</a>> wrote:<br/> > On Sun, Nov 15, 2015 at 1:47 AM, Amit Kapila <<a href="javascript:;">amit.kapila16@gmail.com</a>>wrote:<br /> >> On Sat, Nov 14, 2015 at 1:19 AM, Andres Freund <<ahref="javascript:;">andres@anarazel.de</a>> wrote:<br /> >>> On 2015-10-31 11:02:12 +0530, Amit Kapilawrote:<br /> >>> > On Thu, Oct 8, 2015 at 11:05 PM, Simon Riggs <<a href="javascript:;">simon@2ndquadrant.com</a>><br/> >>> > wrote:<br /> >>> > ><br /> >>>> > On 1 October 2015 at 23:30, Josh Berkus <<a href="javascript:;">josh@agliodbs.com</a>> wrote:<br/> >>> > >><br /> >>> > >> On 10/01/2015 07:43 AM, Robert Haas wrote:<br />>>> > >> > On Thu, Oct 1, 2015 at 9:44 AM, Fujii Masao <<a href="javascript:;">masao.fujii@gmail.com</a>><br/> >>> > wrote:<br /> >>> > >> >>I wonder how much it's worth renaming only the file extension<br /> >>> > >> >> while<br/> >>> > >> >> there are many places where "visibility map" and "vm" are used,<br /> >>>> >> >> for example, log messages, function names, variables, etc.<br /> >>> > >>><br /> >>> > >> > I'd be inclined to keep calling it the visibility map (vm) even if<br/> >>> > >> > it<br /> >>> > >> > also contains freeze information.<br />>>> > >> ><br /> >>> ><br /> >>> > What is your main worry about changingthe name of this map, is it<br /> >>> > about more code churn or is it about that we might introducenew issues<br /> >>> > or is it about that people are already accustomed to call this map as<br /> >>>> visibility map?<br /> >>><br /> >>> Several:<br /> >>> * Visibility map is ratherdescriptive, none of the replacement terms<br /> >>> imo come close. Few people will know what a 'freeze'map is.<br /> >>> * It increases the size of the patch considerably<br /> >>> * It forces toolingthat knows about the layout of the database<br /> >>> directory to change their tools<br /> >>><br/> >><br /> >> All these points are legitimate.<br /> >><br /> >>> On the benfitside the only argument I've heard so far is that it allows<br /> >>> to disambiguate the format. But, uh,a look at the major version does<br /> >>> that just as well, for far less trouble.<br /> >>><br />>>> > It seems to me quite logical for understanding purpose as well. Any new<br /> >>> > personwho wants to work in this area or is looking into it will always<br /> >>> > wonder why this map is namedas visibility map even though it contains<br /> >>> > information about visibility of page as well as frozenstate of page.<br /> >>><br /> >>> Being frozen is about visibility as well.<br /> >>><br/> >><br /> >> OTOH being visible doesn't mean page is frozen. I understand that frozen is<br/> >> related to visibility, but still it is a separate state of page and used for<br /> >> different<br/> >> purpose. I think this is a subjective point and we could go either way, it<br /> >> is<br/> >> just a matter in which way more people are comfortable.<br /> ><br /> > I'm stickin' with what I saidbefore, and what I think Andres is<br /> > saying too: renaming the map is a horrible idea. It produces lots of<br/> > code churn for no real benefit. We usually avoid such changes, and I<br /> > think we should do so here,too.<br /><br /> I understood.<br /> I'm going to turn the patch back to visibility map, and just add the logic of enhancementof VACUUM FREEZE.<br /> If we want to add the other status not related to visibility into visibility map in thefuture, it would be worth to consider.<br /><br /> Regards,<br /><br /> --<br /> Masahiko Sawada<br /><br /><br />-- <br/>Regards,<br /><br />--<br />Masahiko Sawada<br />
On 17 November 2015 at 10:29, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > On Tue, Nov 17, 2015 at 10:45 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Sun, Nov 15, 2015 at 1:47 AM, Amit Kapila <amit.kapila16@gmail.com> >> wrote: >>> On Sat, Nov 14, 2015 at 1:19 AM, Andres Freund <andres@anarazel.de> >>> wrote: >>>> On 2015-10-31 11:02:12 +0530, Amit Kapila wrote: >>>> > On Thu, Oct 8, 2015 at 11:05 PM, Simon Riggs <simon@2ndquadrant.com> >>>> > wrote: >>>> > > >>>> > > On 1 October 2015 at 23:30, Josh Berkus <josh@agliodbs.com> wrote: >>>> > >> >>>> > >> On 10/01/2015 07:43 AM, Robert Haas wrote: >>>> > >> > On Thu, Oct 1, 2015 at 9:44 AM, Fujii Masao >>>> > >> > <masao.fujii@gmail.com> >>>> > wrote: >>>> > >> >> I wonder how much it's worth renaming only the file extension >>>> > >> >> while >>>> > >> >> there are many places where "visibility map" and "vm" are used, >>>> > >> >> for example, log messages, function names, variables, etc. >>>> > >> > >>>> > >> > I'd be inclined to keep calling it the visibility map (vm) even >>>> > >> > if >>>> > >> > it >>>> > >> > also contains freeze information. >>>> > >> > >>>> > >>>> > What is your main worry about changing the name of this map, is it >>>> > about more code churn or is it about that we might introduce new >>>> > issues >>>> > or is it about that people are already accustomed to call this map as >>>> > visibility map? >>>> >>>> Several: >>>> * Visibility map is rather descriptive, none of the replacement terms >>>> imo come close. Few people will know what a 'freeze' map is. >>>> * It increases the size of the patch considerably >>>> * It forces tooling that knows about the layout of the database >>>> directory to change their tools >>>> >>> >>> All these points are legitimate. >>> >>>> On the benfit side the only argument I've heard so far is that it allows >>>> to disambiguate the format. But, uh, a look at the major version does >>>> that just as well, for far less trouble. >>>> >>>> > It seems to me quite logical for understanding purpose as well. Any >>>> > new >>>> > person who wants to work in this area or is looking into it will >>>> > always >>>> > wonder why this map is named as visibility map even though it contains >>>> > information about visibility of page as well as frozen state of page. >>>> >>>> Being frozen is about visibility as well. >>>> >>> >>> OTOH being visible doesn't mean page is frozen. I understand that frozen >>> is >>> related to visibility, but still it is a separate state of page and used >>> for >>> different >>> purpose. I think this is a subjective point and we could go either way, >>> it >>> is >>> just a matter in which way more people are comfortable. >> >> I'm stickin' with what I said before, and what I think Andres is >> saying too: renaming the map is a horrible idea. It produces lots of >> code churn for no real benefit. We usually avoid such changes, and I >> think we should do so here, too. > > I understood. > I'm going to turn the patch back to visibility map, and just add the logic > of enhancement of VACUUM FREEZE. > If we want to add the other status not related to visibility into visibility > map in the future, it would be worth to consider. Could someone post a TL;DR summary of what the current plan looks like? I can see there is a huge amount of discussion to trawl back through. I can see it's something to do with the visibility map. And does it avoid freezing very large tables like the title originally sought? Thanks Thom
On 11/17/15 4:41 AM, Thom Brown wrote: > Could someone post a TL;DR summary of what the current plan looks > like? I can see there is a huge amount of discussion to trawl back > through. I can see it's something to do with the visibility map. And > does it avoid freezing very large tables like the title originally > sought? Basically, it follows the same pattern that all-visible bits do, except instead of indicating a page is all-visible, the bit shows that all tuples on the page are frozen. That allows a scan_all vacuum to skip those pages. -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com
On 17 November 2015 at 15:43, Jim Nasby <Jim.Nasby@bluetreble.com> wrote: > On 11/17/15 4:41 AM, Thom Brown wrote: >> >> Could someone post a TL;DR summary of what the current plan looks >> like? I can see there is a huge amount of discussion to trawl back >> through. I can see it's something to do with the visibility map. And >> does it avoid freezing very large tables like the title originally >> sought? > > > Basically, it follows the same pattern that all-visible bits do, except > instead of indicating a page is all-visible, the bit shows that all tuples > on the page are frozen. That allows a scan_all vacuum to skip those pages. So the visibility map is being repurposed? And if a row on a frozen page is modified, what happens to the visibility of all other rows on that page, as the bit will be set back to 0? I think I'm missing a critical part of this functionality. Thom
On Wed, Nov 18, 2015 at 12:56 AM, Thom Brown <thom@linux.com> wrote: > On 17 November 2015 at 15:43, Jim Nasby <Jim.Nasby@bluetreble.com> wrote: >> On 11/17/15 4:41 AM, Thom Brown wrote: >>> >>> Could someone post a TL;DR summary of what the current plan looks >>> like? I can see there is a huge amount of discussion to trawl back >>> through. I can see it's something to do with the visibility map. And >>> does it avoid freezing very large tables like the title originally >>> sought? >> >> >> Basically, it follows the same pattern that all-visible bits do, except >> instead of indicating a page is all-visible, the bit shows that all tuples >> on the page are frozen. That allows a scan_all vacuum to skip those pages. > > So the visibility map is being repurposed? My proposal is to add additional one bit that indicates all tuples on page are completely frozen, into visibility map. That is, the visibility map will become a bitmap with two bits (all-visible, all-frozen) per page. > And if a row on a frozen > page is modified, what happens to the visibility of all other rows on > that page, as the bit will be set back to 0? In this case, the corresponding VM both bits are cleared. Such behaviour is almost same as what postgresql is doing today. Regards, -- Masahiko Sawada
On Tue, Nov 17, 2015 at 7:29 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > On Tue, Nov 17, 2015 at 10:45 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Sun, Nov 15, 2015 at 1:47 AM, Amit Kapila <amit.kapila16@gmail.com> >> wrote: >>> On Sat, Nov 14, 2015 at 1:19 AM, Andres Freund <andres@anarazel.de> >>> wrote: >>>> On 2015-10-31 11:02:12 +0530, Amit Kapila wrote: >>>> > On Thu, Oct 8, 2015 at 11:05 PM, Simon Riggs <simon@2ndquadrant.com> >>>> > wrote: >>>> > > >>>> > > On 1 October 2015 at 23:30, Josh Berkus <josh@agliodbs.com> wrote: >>>> > >> >>>> > >> On 10/01/2015 07:43 AM, Robert Haas wrote: >>>> > >> > On Thu, Oct 1, 2015 at 9:44 AM, Fujii Masao >>>> > >> > <masao.fujii@gmail.com> >>>> > wrote: >>>> > >> >> I wonder how much it's worth renaming only the file extension >>>> > >> >> while >>>> > >> >> there are many places where "visibility map" and "vm" are used, >>>> > >> >> for example, log messages, function names, variables, etc. >>>> > >> > >>>> > >> > I'd be inclined to keep calling it the visibility map (vm) even >>>> > >> > if >>>> > >> > it >>>> > >> > also contains freeze information. >>>> > >> > >>>> > >>>> > What is your main worry about changing the name of this map, is it >>>> > about more code churn or is it about that we might introduce new >>>> > issues >>>> > or is it about that people are already accustomed to call this map as >>>> > visibility map? >>>> >>>> Several: >>>> * Visibility map is rather descriptive, none of the replacement terms >>>> imo come close. Few people will know what a 'freeze' map is. >>>> * It increases the size of the patch considerably >>>> * It forces tooling that knows about the layout of the database >>>> directory to change their tools >>>> >>> >>> All these points are legitimate. >>> >>>> On the benfit side the only argument I've heard so far is that it allows >>>> to disambiguate the format. But, uh, a look at the major version does >>>> that just as well, for far less trouble. >>>> >>>> > It seems to me quite logical for understanding purpose as well. Any >>>> > new >>>> > person who wants to work in this area or is looking into it will >>>> > always >>>> > wonder why this map is named as visibility map even though it contains >>>> > information about visibility of page as well as frozen state of page. >>>> >>>> Being frozen is about visibility as well. >>>> >>> >>> OTOH being visible doesn't mean page is frozen. I understand that frozen >>> is >>> related to visibility, but still it is a separate state of page and used >>> for >>> different >>> purpose. I think this is a subjective point and we could go either way, >>> it >>> is >>> just a matter in which way more people are comfortable. >> >> I'm stickin' with what I said before, and what I think Andres is >> saying too: renaming the map is a horrible idea. It produces lots of >> code churn for no real benefit. We usually avoid such changes, and I >> think we should do so here, too. > > I understood. > I'm going to turn the patch back to visibility map, and just add the logic > of enhancement of VACUUM FREEZE. Attached latest v24 patch. I've changed patch so that just adding frozen bit into visibility map. So the size of patch is almost half of previous one. Please review it. Regards, -- Masahiko Sawada
Attachment
On Tue, Nov 17, 2015 at 10:32 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > Attached latest v24 patch. > I've changed patch so that just adding frozen bit into visibility map. > So the size of patch is almost half of previous one. > Should there be an Assert in visibilitymap_get_status (or elsewhere) against the impossible state of being all frozen but not all visible? I get an error when running pg_upgrade from 9.4 to 9.6-this error while copying relation "mediawiki.archive" ("/tmp/data/base/16414/21043_vm" to "/tmp/data_fm/base/16400/21043_vm"): No such file or directory Cheers, Jeff
On Wed, Nov 18, 2015 at 11:18 AM, Jeff Janes <jeff.janes@gmail.com> wrote: > > I get an error when running pg_upgrade from 9.4 to 9.6-this > > error while copying relation "mediawiki.archive" > ("/tmp/data/base/16414/21043_vm" to > "/tmp/data_fm/base/16400/21043_vm"): No such file or directory OK, so the problem seems to be that rewriteVisibilitymap can get called with errno already set to a nonzero value. It never clears it, and then fails at the end despite that no error has actually occurred. Just setting it to 0 at the top of the function seems to be correct thing to do. Or does it need to save the old value and restore it? But now when I want to do the upgrade faster, I run into this: "This utility cannot upgrade from PostgreSQL version from 9.5 or before to 9.6 or later with link mode." Is this really an acceptable a tradeoff? Surely we can arrange to link everything else and rewrite just the _vm, which is a tiny portion of the data directory. I don't think that -k promises to link everything it possibly can. Cheers, Jeff
On Thu, Nov 19, 2015 at 5:54 AM, Jeff Janes <jeff.janes@gmail.com> wrote: > On Wed, Nov 18, 2015 at 11:18 AM, Jeff Janes <jeff.janes@gmail.com> wrote: >> >> I get an error when running pg_upgrade from 9.4 to 9.6-this >> >> error while copying relation "mediawiki.archive" >> ("/tmp/data/base/16414/21043_vm" to >> "/tmp/data_fm/base/16400/21043_vm"): No such file or directory > > OK, so the problem seems to be that rewriteVisibilitymap can get > called with errno already set to a nonzero value. > > It never clears it, and then fails at the end despite that no error > has actually occurred. > > Just setting it to 0 at the top of the function seems to be correct > thing to do. Or does it need to save the old value and restore it? Thank you for testing! I think that the former is better, so attached latest patch. > But now when I want to do the upgrade faster, I run into this: > > "This utility cannot upgrade from PostgreSQL version from 9.5 or > before to 9.6 or later with link mode." > > Is this really an acceptable a tradeoff? Surely we can arrange to > link everything else and rewrite just the _vm, which is a tiny portion > of the data directory. I don't think that -k promises to link > everything it possibly can. I agree. I've changed the patch so that. pg_upgarde creates new _vm file and rewrites it even if upgrading to 9.6 with link mode. Regards, -- Masahiko Sawada
Attachment
On Thu, Nov 19, 2015 at 6:44 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > On Thu, Nov 19, 2015 at 5:54 AM, Jeff Janes <jeff.janes@gmail.com> wrote: >> On Wed, Nov 18, 2015 at 11:18 AM, Jeff Janes <jeff.janes@gmail.com> wrote: >>> >>> I get an error when running pg_upgrade from 9.4 to 9.6-this >>> >>> error while copying relation "mediawiki.archive" >>> ("/tmp/data/base/16414/21043_vm" to >>> "/tmp/data_fm/base/16400/21043_vm"): No such file or directory >> >> OK, so the problem seems to be that rewriteVisibilitymap can get >> called with errno already set to a nonzero value. >> >> It never clears it, and then fails at the end despite that no error >> has actually occurred. >> >> Just setting it to 0 at the top of the function seems to be correct >> thing to do. Or does it need to save the old value and restore it? > > Thank you for testing! > I think that the former is better, so attached latest patch. > >> But now when I want to do the upgrade faster, I run into this: >> >> "This utility cannot upgrade from PostgreSQL version from 9.5 or >> before to 9.6 or later with link mode." >> >> Is this really an acceptable a tradeoff? Surely we can arrange to >> link everything else and rewrite just the _vm, which is a tiny portion >> of the data directory. I don't think that -k promises to link >> everything it possibly can. > > I agree. > I've changed the patch so that. > pg_upgarde creates new _vm file and rewrites it even if upgrading to > 9.6 with link mode. The rewrite code thinks that only the first page of a vm has a header of size SizeOfPageHeaderData, and the rest of the pages have a zero size header. So the resulting _vm is corrupt. After pg_upgrade, doing a vacuum freeze verbose gives: WARNING: invalid page in block 1 of relation base/16402/22430_vm; zeroing out page WARNING: invalid page in block 1 of relation base/16402/22430_vm; zeroing out page WARNING: invalid page in block 2 of relation base/16402/22430_vm; zeroing out page WARNING: invalid page in block 2 of relation base/16402/22430_vm; zeroing out page WARNING: invalid page in block 3 of relation base/16402/22430_vm; zeroing out page WARNING: invalid page in block 3 of relation base/16402/22430_vm; zeroing out page WARNING: invalid page in block 4 of relation base/16402/22430_vm; zeroing out page WARNING: invalid page in block 4 of relation base/16402/22430_vm; zeroing out page WARNING: invalid page in block 5 of relation base/16402/22430_vm; zeroing out page WARNING: invalid page in block 5 of relation base/16402/22430_vm; zeroing out page WARNING: invalid page in block 6 of relation base/16402/22430_vm; zeroing out page WARNING: invalid page in block 6 of relation base/16402/22430_vm; zeroing out page WARNING: invalid page in block 7 of relation base/16402/22430_vm; zeroing out page WARNING: invalid page in block 7 of relation base/16402/22430_vm; zeroing out page WARNING: invalid page in block 8 of relation base/16402/22430_vm; zeroing out page WARNING: invalid page in block 8 of relation base/16402/22430_vm; zeroing out page Cheers, Jeff
On Sat, Nov 21, 2015 at 6:50 AM, Jeff Janes <jeff.janes@gmail.com> wrote: > On Thu, Nov 19, 2015 at 6:44 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> On Thu, Nov 19, 2015 at 5:54 AM, Jeff Janes <jeff.janes@gmail.com> wrote: >>> On Wed, Nov 18, 2015 at 11:18 AM, Jeff Janes <jeff.janes@gmail.com> wrote: >>>> >>>> I get an error when running pg_upgrade from 9.4 to 9.6-this >>>> >>>> error while copying relation "mediawiki.archive" >>>> ("/tmp/data/base/16414/21043_vm" to >>>> "/tmp/data_fm/base/16400/21043_vm"): No such file or directory >>> >>> OK, so the problem seems to be that rewriteVisibilitymap can get >>> called with errno already set to a nonzero value. >>> >>> It never clears it, and then fails at the end despite that no error >>> has actually occurred. >>> >>> Just setting it to 0 at the top of the function seems to be correct >>> thing to do. Or does it need to save the old value and restore it? >> >> Thank you for testing! >> I think that the former is better, so attached latest patch. >> >>> But now when I want to do the upgrade faster, I run into this: >>> >>> "This utility cannot upgrade from PostgreSQL version from 9.5 or >>> before to 9.6 or later with link mode." >>> >>> Is this really an acceptable a tradeoff? Surely we can arrange to >>> link everything else and rewrite just the _vm, which is a tiny portion >>> of the data directory. I don't think that -k promises to link >>> everything it possibly can. >> >> I agree. >> I've changed the patch so that. >> pg_upgarde creates new _vm file and rewrites it even if upgrading to >> 9.6 with link mode. > > > The rewrite code thinks that only the first page of a vm has a header > of size SizeOfPageHeaderData, and the rest of the pages have a zero > size header. So the resulting _vm is corrupt. > > After pg_upgrade, doing a vacuum freeze verbose gives: > > > WARNING: invalid page in block 1 of relation base/16402/22430_vm; > zeroing out page > WARNING: invalid page in block 1 of relation base/16402/22430_vm; > zeroing out page > WARNING: invalid page in block 2 of relation base/16402/22430_vm; > zeroing out page > WARNING: invalid page in block 2 of relation base/16402/22430_vm; > zeroing out page > WARNING: invalid page in block 3 of relation base/16402/22430_vm; > zeroing out page > WARNING: invalid page in block 3 of relation base/16402/22430_vm; > zeroing out page > WARNING: invalid page in block 4 of relation base/16402/22430_vm; > zeroing out page > WARNING: invalid page in block 4 of relation base/16402/22430_vm; > zeroing out page > WARNING: invalid page in block 5 of relation base/16402/22430_vm; > zeroing out page > WARNING: invalid page in block 5 of relation base/16402/22430_vm; > zeroing out page > WARNING: invalid page in block 6 of relation base/16402/22430_vm; > zeroing out page > WARNING: invalid page in block 6 of relation base/16402/22430_vm; > zeroing out page > WARNING: invalid page in block 7 of relation base/16402/22430_vm; > zeroing out page > WARNING: invalid page in block 7 of relation base/16402/22430_vm; > zeroing out page > WARNING: invalid page in block 8 of relation base/16402/22430_vm; > zeroing out page > WARNING: invalid page in block 8 of relation base/16402/22430_vm; > zeroing out page > Thank you for taking the time to review this patch! The updated version patch is attached. Regards, -- Masahiko Sawada
Attachment
On Sun, Nov 22, 2015 at 8:16 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > Thank you for taking the time to review this patch! > The updated version patch is attached. I am skeptical about just copying the old page header to be two new page headers. I don't know what the implications for this will be on pd_lsn. Since pg_upgrade can only run on a cluster that was cleanly shutdown, I think that just copying it from the old page to both new pages it turns into might be fine. But pd_checksum will certainly be wrong, breaking pg_upgrade for cases where checksums are turned on in. It needs to be recomputed on both new pages. It looks like there is no precedence for doing that in pg_upgrade so this will be breaking new ground. Cheers, Jeff
On Mon, Nov 23, 2015 at 6:27 AM, Jeff Janes <jeff.janes@gmail.com> wrote: > On Sun, Nov 22, 2015 at 8:16 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > >> Thank you for taking the time to review this patch! >> The updated version patch is attached. > > I am skeptical about just copying the old page header to be two new > page headers. I don't know what the implications for this will be on > pd_lsn. Since pg_upgrade can only run on a cluster that was cleanly > shutdown, I think that just copying it from the old page to both new > pages it turns into might be fine. But pd_checksum will certainly be > wrong, breaking pg_upgrade for cases where checksums are turned on in. > It needs to be recomputed on both new pages. It looks like there is > no precedence for doing that in pg_upgrade so this will be breaking > new ground. > Yeah, we need to consider to compute checksum if enabled. I've changed the patch, and attached. Please review it. Regards, -- Masahiko Sawada
Attachment
On Tue, Nov 24, 2015 at 3:13 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > Yeah, we need to consider to compute checksum if enabled. > I've changed the patch, and attached. > Please review it. Thanks for the update. This now conflicts with the updates doesn to fix pg_upgrade out-of-space issue on Windows. I've fixed (I think) the conflict in order to do some testing, but I'd like to get an updated patch from the author in case I did it wrong. I don't want to find bugs that I just introduced myself. Thanks, Jeff
On Sun, Nov 29, 2015 at 2:21 AM, Jeff Janes <jeff.janes@gmail.com> wrote: > On Tue, Nov 24, 2015 at 3:13 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> >> Yeah, we need to consider to compute checksum if enabled. >> I've changed the patch, and attached. >> Please review it. > > Thanks for the update. This now conflicts with the updates doesn to > fix pg_upgrade out-of-space issue on Windows. I've fixed (I think) the > conflict in order to do some testing, but I'd like to get an updated > patch from the author in case I did it wrong. I don't want to find > bugs that I just introduced myself. > Thank you for having a look. Attached updated v28 patch. Please review it. Regards, -- Masahiko Sawada
Attachment
On Mon, Nov 30, 2015 at 10:48:04PM +0530, Masahiko Sawada wrote: > On Sun, Nov 29, 2015 at 2:21 AM, Jeff Janes <jeff.janes@gmail.com> wrote: > > On Tue, Nov 24, 2015 at 3:13 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > >> > >> Yeah, we need to consider to compute checksum if enabled. > >> I've changed the patch, and attached. > >> Please review it. > > > > Thanks for the update. This now conflicts with the updates doesn to > > fix pg_upgrade out-of-space issue on Windows. I've fixed (I think) the > > conflict in order to do some testing, but I'd like to get an updated > > patch from the author in case I did it wrong. I don't want to find > > bugs that I just introduced myself. > > > > Thank you for having a look. I would not bother mentioning this detail in the pg_upgrade manual page: + Since the format of visibility map has been changed in version 9.6, + <application>pg_upgrade</> creates and rewrite new <literal>'_vm'</literal> + file even if upgrading from 9.5 or before to 9.6 or later with link mode (-k). -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Roman grave inscription +
On 2015-11-30 12:58:43 -0500, Bruce Momjian wrote: > I would not bother mentioning this detail in the pg_upgrade manual page: > > + Since the format of visibility map has been changed in version 9.6, > + <application>pg_upgrade</> creates and rewrite new <literal>'_vm'</literal> > + file even if upgrading from 9.5 or before to 9.6 or later with link mode (-k). Might be worthwhile to keep as that influences the runtime for link mode when migrating <9.6 -> 9.6.
On Mon, Nov 30, 2015 at 07:05:21PM +0100, Andres Freund wrote: > On 2015-11-30 12:58:43 -0500, Bruce Momjian wrote: > > I would not bother mentioning this detail in the pg_upgrade manual page: > > > > + Since the format of visibility map has been changed in version 9.6, > > + <application>pg_upgrade</> creates and rewrite new <literal>'_vm'</literal> > > + file even if upgrading from 9.5 or before to 9.6 or later with link mode (-k). > > Might be worthwhile to keep as that influences the runtime for link mode > when migrating <9.6 -> 9.6. It is hard to see that it would have a measurable duration. The pg_upgrade docs are already very long and this detail doesn't seems significant. Can someone test the overhead? -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Roman grave inscription +
On Mon, Nov 30, 2015 at 9:18 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > On Sun, Nov 29, 2015 at 2:21 AM, Jeff Janes <jeff.janes@gmail.com> wrote: >> On Tue, Nov 24, 2015 at 3:13 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >>> >>> Yeah, we need to consider to compute checksum if enabled. >>> I've changed the patch, and attached. >>> Please review it. >> >> Thanks for the update. This now conflicts with the updates doesn to >> fix pg_upgrade out-of-space issue on Windows. I've fixed (I think) the >> conflict in order to do some testing, but I'd like to get an updated >> patch from the author in case I did it wrong. I don't want to find >> bugs that I just introduced myself. >> > > Thank you for having a look. > > Attached updated v28 patch. > Please review it. > > Regards, After running pg_upgrade, if I manually vacuum a table a start getting warnings: WARNING: page is not marked all-visible (and all-frozen) but visibility map bit(s) is set in relation "foo" page 32756 WARNING: page is not marked all-visible (and all-frozen) but visibility map bit(s) is set in relation "foo" page 32756 WARNING: page is not marked all-visible (and all-frozen) but visibility map bit(s) is set in relation "foo" page 32757 WARNING: page is not marked all-visible (and all-frozen) but visibility map bit(s) is set in relation "foo" page 32757 The warnings are right where the blocks would start using the 2nd page of the _vm, so I think the problem is there. And looking at the code, I think that "cur += SizeOfPageHeaderData;" in the inner loop cannot be correct. We can't skip a header in the current (old) block each time we reach the end of the new block. The thing we are skipping in the current block is half the time not a header, but the data at the halfway point through the block. Cheers, Jeff
On Tue, Dec 1, 2015 at 3:04 AM, Jeff Janes <jeff.janes@gmail.com> wrote: > On Mon, Nov 30, 2015 at 9:18 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> On Sun, Nov 29, 2015 at 2:21 AM, Jeff Janes <jeff.janes@gmail.com> wrote: >>> On Tue, Nov 24, 2015 at 3:13 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >>>> >>>> Yeah, we need to consider to compute checksum if enabled. >>>> I've changed the patch, and attached. >>>> Please review it. >>> >>> Thanks for the update. This now conflicts with the updates doesn to >>> fix pg_upgrade out-of-space issue on Windows. I've fixed (I think) the >>> conflict in order to do some testing, but I'd like to get an updated >>> patch from the author in case I did it wrong. I don't want to find >>> bugs that I just introduced myself. >>> >> >> Thank you for having a look. >> >> Attached updated v28 patch. >> Please review it. >> >> Regards, > > After running pg_upgrade, if I manually vacuum a table a start getting warnings: > > WARNING: page is not marked all-visible (and all-frozen) but > visibility map bit(s) is set in relation "foo" page 32756 > WARNING: page is not marked all-visible (and all-frozen) but > visibility map bit(s) is set in relation "foo" page 32756 > WARNING: page is not marked all-visible (and all-frozen) but > visibility map bit(s) is set in relation "foo" page 32757 > WARNING: page is not marked all-visible (and all-frozen) but > visibility map bit(s) is set in relation "foo" page 32757 > > The warnings are right where the blocks would start using the 2nd page > of the _vm, so I think the problem is there. And looking at the code, > I think that "cur += SizeOfPageHeaderData;" in the inner loop cannot > be correct. We can't skip a header in the current (old) block each > time we reach the end of the new block. The thing we are skipping in > the current block is half the time not a header, but the data at the > halfway point through the block. > Thank you for reviewing. You're right, it's not necessary. Attached latest v29 patch which removes the mention in pg_upgrade documentation. Regards, -- Masahiko Sawada
Attachment
Hello, > You're right, it's not necessary. > Attached latest v29 patch which removes the mention in pg_upgrade documentation. The changes looks to be correct but I haven't tested. And I have some additional random comments. visibilitymap.c: In visibilitymap_set, the followint lines. map = PageGetContents(page); ... if (flags != (map[mapByte] & (flags << mapBit))) map is (char*), PageGetContents returns (char*) but flags is uint8. I think that defining map as (uint8*) would be safer. In visibilitymap_set, the following lines does something different from what to do. Only right side of the inequality getsshifted and what should be used in right side is not flags but VISIBILITYMAP_VALID_BITS. - if (!(map[mapByte] & (1 << mapBit))) + if (flags != (map[mapByte] & (flags << mapBit))) Something like the following will do the right thing. + if (flags != (map[mapByte]>>mapBit & VISIBILITYMAP_VALID_BITS)) analyze.c: In do_analyze_rel, the successive if (!inh) in the followingsteps looks a bit odd. This would be emphasized by the firstifblock you added:) These blocks should be enclosed by if (!inh){} block. > /* Calculate the number of all-visible and all-frozen bit */> if (!inh)> relallvisible = visibilitymap_count(onerel,&relallfrozen);> if (!inh)> vac_update_relstats(onerel,> if (!inh && !(options & VACOPT_VACUUM))> {> for (ind = 0; ind < nindexes; ind++)...> }> if (!inh)> pgstat_report_analyze(onerel,totalrows, totaldeadrows, relallfrozen); vacuum.c: >- * relpages and relallvisible, we try to maintain certain lazily-updated >- * DDL flags such as relhasindex, by clearingthem if no longer correct. >- * It's safe to do this in VACUUM, which can't run in parallel with >- * CREATE INDEX/RULE/TRIGGERand can't be part of a transaction block. >- * However, it's *not* safe to do it in an ANALYZE that's withinan >+ * relpages, relallvisible, we try to maintain certain lazily-updated Why did you just drop the 'and' afterrelpages? And this seems the only change of this file except the additinally missing letter just below:p >+ * DDLflags such as relhasindex, by clearing them if no onger correct. >+ * It's safe to do this in VACUUM, which can't runin >+ * parallel with CREATE INDEX/RULE/TRIGGER and can't be part of a transaction >+ * block. However, it's *not* safeto do it in an ANALYZE that's within an nodeIndexonlyscan.c: A duplicate letters. And the line exceeds right margin. > - * Note on Memory Ordering Effects: visibilitymap_test does not lock -> + * Note on Memory Ordering Effects: visibilitymap_get_stattus does not lock + * Note on Memory Ordering Effects: visibilitymap_get_status does not lock The edited line exceeds right margin by removing a newline. - if (!visibilitymap_test(scandesc->heapRelation, - ItemPointerGetBlockNumber(tid), - &node->ioss_VMBuffer)) + if (!VM_ALL_VISIBLE(scandesc->heapRelation, ItemPointerGetBlockNumber(tid), + &node->ioss_VMBuffer)) costsize.c: Duplicate words and it is the only change. > - * pages for which the visibility map shows all tuples are visible. -> + * pages for which the visibility map map shows all tuples are visible. + * pages for which the visibility map shows all tuples are visible. pgstat.c: The new parameter frozenpages of pgstat_report_vacuum() isdefined as int32, but its callers give BlockNumber(=uint32). Irecommendto define the frozenpages as BlockNumber.PgStat_MsgVacuum has a corresponding member defined as int32. pg_upgrade.c: BITS_PER_HEAPBLOCK is defined in two c files with the samedefinition. This might be better to be merged into some headerfile. heapam_xlog.h, hio.h, execnodes.h: Have we decided to rename vm to pim? Anyway it is inconsistentwith that of corresponding definition of the function bodyremainsas 'vm_buffer'. (I'm not confident on that, though.) >- Buffer vm_buffer, TransactionId cutoff_xid);>+ Buffer pim_buffer, TransactionId cutoff_xid, uint8 flags); regards, At Wed, 2 Dec 2015 00:10:09 +0530, Masahiko Sawada <sawada.mshk@gmail.com> wrote in <CAD21AoC72S2ShoeAmCxWYUyGSNOaTn4fMHJ-ZKNX-MPcsQpaOw@mail.gmail.com> > On Tue, Dec 1, 2015 at 3:04 AM, Jeff Janes <jeff.janes@gmail.com> wrote: > > On Mon, Nov 30, 2015 at 9:18 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > After running pg_upgrade, if I manually vacuum a table a start getting warnings: > > > > WARNING: page is not marked all-visible (and all-frozen) but > > visibility map bit(s) is set in relation "foo" page 32756 > > WARNING: page is not marked all-visible (and all-frozen) but > > visibility map bit(s) is set in relation "foo" page 32756 ... > > The warnings are right where the blocks would start using the 2nd page > > of the _vm, so I think the problem is there. And looking at the code, > > I think that "cur += SizeOfPageHeaderData;" in the inner loop cannot > > be correct. We can't skip a header in the current (old) block each > > time we reach the end of the new block. The thing we are skipping in > > the current block is half the time not a header, but the data at the > > halfway point through the block. > > > > Thank you for reviewing. > > You're right, it's not necessary. > Attached latest v29 patch which removes the mention in pg_upgrade documentation. -- Kyotaro Horiguchi NTT Open Source Software Center
On Wed, Dec 2, 2015 at 9:30 AM, Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > Hello, > >> You're right, it's not necessary. >> Attached latest v29 patch which removes the mention in pg_upgrade documentation. > > The changes looks to be correct but I haven't tested. > And I have some additional random comments. > Thank you for revewing! Fixed these following points, and attached latest patch. > visibilitymap.c: > > In visibilitymap_set, the followint lines. > > map = PageGetContents(page); > ... > if (flags != (map[mapByte] & (flags << mapBit))) > > map is (char*), PageGetContents returns (char*) but flags is > uint8. I think that defining map as (uint8*) would be safer. I agree with you. Fixed. > > In visibilitymap_set, the following lines does something > different from what to do. Only right side of the inequality > gets shifted and what should be used in right side is not flags > but VISIBILITYMAP_VALID_BITS. > > - if (!(map[mapByte] & (1 << mapBit))) > + if (flags != (map[mapByte] & (flags << mapBit))) > > Something like the following will do the right thing. > > + if (flags != (map[mapByte]>>mapBit & VISIBILITYMAP_VALID_BITS)) > You're right. Fixed. > analyze.c: > > In do_analyze_rel, the successive if (!inh) in the following > steps looks a bit odd. This would be emphasized by the first if > block you added:) These blocks should be enclosed by if (!inh) > {} block. > > > > /* Calculate the number of all-visible and all-frozen bit */ > > if (!inh) > > relallvisible = visibilitymap_count(onerel, &relallfrozen); > > if (!inh) > > vac_update_relstats(onerel, > > if (!inh && !(options & VACOPT_VACUUM)) > > { > > for (ind = 0; ind < nindexes; ind++) > ... > > } > > if (!inh) > > pgstat_report_analyze(onerel, totalrows, totaldeadrows, relallfrozen); Fixed. > > vacuum.c: > > >- * relpages and relallvisible, we try to maintain certain lazily-updated > >- * DDL flags such as relhasindex, by clearing them if no longer correct. > >- * It's safe to do this in VACUUM, which can't run in parallel with > >- * CREATE INDEX/RULE/TRIGGER and can't be part of a transaction block. > >- * However, it's *not* safe to do it in an ANALYZE that's within an > > >+ * relpages, relallvisible, we try to maintain certain lazily-updated > > Why did you just drop the 'and' after relpages? And this seems > the only change of this file except the additinally missing > letter just below:p > > >+ * DDL flags such as relhasindex, by clearing them if no onger correct. > >+ * It's safe to do this in VACUUM, which can't run in > >+ * parallel with CREATE INDEX/RULE/TRIGGER and can't be part of a transaction > >+ * block. However, it's *not* safe to do it in an ANALYZE that's within an Fixed. > > nodeIndexonlyscan.c: > > A duplicate letters. And the line exceeds right margin. > > > - * Note on Memory Ordering Effects: visibilitymap_test does not lock > -> + * Note on Memory Ordering Effects: visibilitymap_get_stattus does not lock > + * Note on Memory Ordering Effects: visibilitymap_get_status does not lock Fixed. > > The edited line exceeds right margin by removing a newline. > > - if (!visibilitymap_test(scandesc->heapRelation, > - ItemPointerGetBlockNumber(tid), > - &node->ioss_VMBuffer)) > + if (!VM_ALL_VISIBLE(scandesc->heapRelation, ItemPointerGetBlockNumber(tid), > + &node->ioss_VMBuffer)) > Fixed. > costsize.c: > > Duplicate words and it is the only change. > > > - * pages for which the visibility map shows all tuples are visible. > -> + * pages for which the visibility map map shows all tuples are visible. > + * pages for which the visibility map shows all tuples are visible. Fixed. > pgstat.c: > > The new parameter frozenpages of pgstat_report_vacuum() is > defined as int32, but its callers give BlockNumber(=uint32). I > recommend to define the frozenpages as BlockNumber. > PgStat_MsgVacuum has a corresponding member defined as int32. I agree with you. Fixed. > pg_upgrade.c: > > BITS_PER_HEAPBLOCK is defined in two c files with the same > definition. This might be better to be merged into some header > file. Fixed. I moved these definition to visibilitymap.h. > > heapam_xlog.h, hio.h, execnodes.h: > > Have we decided to rename vm to pim? Anyway it is inconsistent > with that of corresponding definition of the function body > remains as 'vm_buffer'. (I'm not confident on that, though.) > > >- Buffer vm_buffer, TransactionId cutoff_xid); > >+ Buffer pim_buffer, TransactionId cutoff_xid, uint8 flags); > Fixed. Regards, -- Masahiko Sawada
Attachment
On Tue, Dec 1, 2015 at 10:40 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > On Tue, Dec 1, 2015 at 3:04 AM, Jeff Janes <jeff.janes@gmail.com> wrote: >> On Mon, Nov 30, 2015 at 9:18 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >>> On Sun, Nov 29, 2015 at 2:21 AM, Jeff Janes <jeff.janes@gmail.com> wrote: >>>> On Tue, Nov 24, 2015 at 3:13 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >>>>> >>>>> Yeah, we need to consider to compute checksum if enabled. >>>>> I've changed the patch, and attached. >>>>> Please review it. >>>> >>>> Thanks for the update. This now conflicts with the updates doesn to >>>> fix pg_upgrade out-of-space issue on Windows. I've fixed (I think) the >>>> conflict in order to do some testing, but I'd like to get an updated >>>> patch from the author in case I did it wrong. I don't want to find >>>> bugs that I just introduced myself. >>>> >>> >>> Thank you for having a look. >>> >>> Attached updated v28 patch. >>> Please review it. >>> >>> Regards, >> >> After running pg_upgrade, if I manually vacuum a table a start getting warnings: >> >> WARNING: page is not marked all-visible (and all-frozen) but >> visibility map bit(s) is set in relation "foo" page 32756 >> WARNING: page is not marked all-visible (and all-frozen) but >> visibility map bit(s) is set in relation "foo" page 32756 >> WARNING: page is not marked all-visible (and all-frozen) but >> visibility map bit(s) is set in relation "foo" page 32757 >> WARNING: page is not marked all-visible (and all-frozen) but >> visibility map bit(s) is set in relation "foo" page 32757 >> >> The warnings are right where the blocks would start using the 2nd page >> of the _vm, so I think the problem is there. And looking at the code, >> I think that "cur += SizeOfPageHeaderData;" in the inner loop cannot >> be correct. We can't skip a header in the current (old) block each >> time we reach the end of the new block. The thing we are skipping in >> the current block is half the time not a header, but the data at the >> halfway point through the block. >> > > Thank you for reviewing. > > You're right, it's not necessary. > Attached latest v29 patch which removes the mention in pg_upgrade documentation. I could successfully upgrade with this patch, with the link option and without. After the update the tables seemed to have their correct visibility status, and after a VACUUM FREEZE then had the correct freeze status as well. Then I manually corrupted the vm file, just to make sure a corrupted one would get detected. And much to my surprise, I didn't get any errors or warning when starting it back up and running vacuum freeze (unless I had page checksums turned on, then I got warnings and zeroed out pages). But I guess this is not considered a warnable condition for bits to be off when they should be on, only the opposite. Consecutive VACUUM FREEZE operations with no DML activity between were not sped up by as much as I thought they would be, because it still had to walk all the indexes even though it didn't touch the table at all. In real-world usage there would almost always be some dead tuples that would require an index scan anyway for a normal vacuum. Cheers, Jeff
On Fri, Dec 4, 2015 at 9:51 PM, Jeff Janes <jeff.janes@gmail.com> wrote: > On Tue, Dec 1, 2015 at 10:40 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> On Tue, Dec 1, 2015 at 3:04 AM, Jeff Janes <jeff.janes@gmail.com> wrote: >>> On Mon, Nov 30, 2015 at 9:18 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >>>> On Sun, Nov 29, 2015 at 2:21 AM, Jeff Janes <jeff.janes@gmail.com> wrote: >>>>> On Tue, Nov 24, 2015 at 3:13 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >>>>>> >>>>>> Yeah, we need to consider to compute checksum if enabled. >>>>>> I've changed the patch, and attached. >>>>>> Please review it. >>>>> >>>>> Thanks for the update. This now conflicts with the updates doesn to >>>>> fix pg_upgrade out-of-space issue on Windows. I've fixed (I think) the >>>>> conflict in order to do some testing, but I'd like to get an updated >>>>> patch from the author in case I did it wrong. I don't want to find >>>>> bugs that I just introduced myself. >>>>> >>>> >>>> Thank you for having a look. >>>> >>>> Attached updated v28 patch. >>>> Please review it. >>>> >>>> Regards, >>> >>> After running pg_upgrade, if I manually vacuum a table a start getting warnings: >>> >>> WARNING: page is not marked all-visible (and all-frozen) but >>> visibility map bit(s) is set in relation "foo" page 32756 >>> WARNING: page is not marked all-visible (and all-frozen) but >>> visibility map bit(s) is set in relation "foo" page 32756 >>> WARNING: page is not marked all-visible (and all-frozen) but >>> visibility map bit(s) is set in relation "foo" page 32757 >>> WARNING: page is not marked all-visible (and all-frozen) but >>> visibility map bit(s) is set in relation "foo" page 32757 >>> >>> The warnings are right where the blocks would start using the 2nd page >>> of the _vm, so I think the problem is there. And looking at the code, >>> I think that "cur += SizeOfPageHeaderData;" in the inner loop cannot >>> be correct. We can't skip a header in the current (old) block each >>> time we reach the end of the new block. The thing we are skipping in >>> the current block is half the time not a header, but the data at the >>> halfway point through the block. >>> >> >> Thank you for reviewing. >> >> You're right, it's not necessary. >> Attached latest v29 patch which removes the mention in pg_upgrade documentation. > > I could successfully upgrade with this patch, with the link option and > without. After the update the tables seemed to have their correct > visibility status, and after a VACUUM FREEZE then had the correct > freeze status as well. Thank you for tesing! > Then I manually corrupted the vm file, just to make sure a corrupted > one would get detected. And much to my surprise, I didn't get any > errors or warning when starting it back up and running vacuum freeze > (unless I had page checksums turned on, then I got warnings and zeroed > out pages). But I guess this is not considered a warnable condition > for bits to be off when they should be on, only the opposite. How did you break the vm file? The inconsistent flags state (set all-frozen but not set all-visible) will be detected in visibility map code.But the vm file has consecutive bits simply after its page header, so detecting its corruption would be difficult unless whole page is corrupted. > Consecutive VACUUM FREEZE operations with no DML activity between were > not sped up by as much as I thought they would be, because it still > had to walk all the indexes even though it didn't touch the table at > all. In real-world usage there would almost always be some dead > tuples that would require an index scan anyway for a normal vacuum. The another reason why consecutive VACUUM FREEZE were not sped up is the many pages of that table were on disk cache, right? In case of very large database, vacuuming large table would engage the total vacuum time, so it would be more effective. Regards, -- Masahiko Sawada
On Mon, Nov 30, 2015 at 12:58 PM, Bruce Momjian <bruce@momjian.us> wrote: > On Mon, Nov 30, 2015 at 10:48:04PM +0530, Masahiko Sawada wrote: >> On Sun, Nov 29, 2015 at 2:21 AM, Jeff Janes <jeff.janes@gmail.com> wrote: >> > On Tue, Nov 24, 2015 at 3:13 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> >> >> >> Yeah, we need to consider to compute checksum if enabled. >> >> I've changed the patch, and attached. >> >> Please review it. >> > >> > Thanks for the update. This now conflicts with the updates doesn to >> > fix pg_upgrade out-of-space issue on Windows. I've fixed (I think) the >> > conflict in order to do some testing, but I'd like to get an updated >> > patch from the author in case I did it wrong. I don't want to find >> > bugs that I just introduced myself. >> > >> >> Thank you for having a look. > > I would not bother mentioning this detail in the pg_upgrade manual page: > > + Since the format of visibility map has been changed in version 9.6, > + <application>pg_upgrade</> creates and rewrite new <literal>'_vm'</literal> > + file even if upgrading from 9.5 or before to 9.6 or later with link mode (-k). Really? I know we don't always document things like this, but it seems like a good idea to me that we do so. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Dec 10, 2015 at 3:31 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Mon, Nov 30, 2015 at 12:58 PM, Bruce Momjian <bruce@momjian.us> wrote: >> On Mon, Nov 30, 2015 at 10:48:04PM +0530, Masahiko Sawada wrote: >>> On Sun, Nov 29, 2015 at 2:21 AM, Jeff Janes <jeff.janes@gmail.com> wrote: >>> > On Tue, Nov 24, 2015 at 3:13 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >>> >> >>> >> Yeah, we need to consider to compute checksum if enabled. >>> >> I've changed the patch, and attached. >>> >> Please review it. >>> > >>> > Thanks for the update. This now conflicts with the updates doesn to >>> > fix pg_upgrade out-of-space issue on Windows. I've fixed (I think) the >>> > conflict in order to do some testing, but I'd like to get an updated >>> > patch from the author in case I did it wrong. I don't want to find >>> > bugs that I just introduced myself. >>> > >>> >>> Thank you for having a look. >> >> I would not bother mentioning this detail in the pg_upgrade manual page: >> >> + Since the format of visibility map has been changed in version 9.6, >> + <application>pg_upgrade</> creates and rewrite new <literal>'_vm'</literal> >> + file even if upgrading from 9.5 or before to 9.6 or later with link mode (-k). > > Really? I know we don't always document things like this, but it > seems like a good idea to me that we do so. Just going though v30... + frozen. The whole-table freezing is occuerred only when all pages happen to + require freezing to freeze rows. In other cases such as where I am not really getting the meaning of this sentence. Shouldn't this be reworded something like: "Freezing occurs on the whole table once all pages of this relation require it." + <structfield>relfrozenxid</> is more than <varname>vacuum_freeze_table_age</> + transcations old, where <command>VACUUM</>'s <literal>FREEZE</> option is used, + <command>VACUUM</> can skip the pages that all tuples on the page itself are + marked as frozen. + When all pages of table are eventually marked as frozen by <command>VACUUM</>, + after it's finished <literal>age(relfrozenxid)</> should be a little more + than the <varname>vacuum_freeze_min_age</> setting that was used (more by + the number of transcations started since the <command>VACUUM</> started). + If the advancing of <structfield>relfrozenxid</> is not happend until + <varname>autovacuum_freeze_max_age</> is reached, an autovacuum will soon + be forced for the table. s/transcations/transactions. + <entry><structfield>n_frozen_page</></entry> + <entry><type>integer</></entry> + <entry>Number of frozen pages</entry> n_frozen_pages? make check with pg_upgrade is failing for me: Visibility map rewriting test failed make: *** [check] Error 1 -GetVisibilityMapPins(Relation relation, Buffer buffer1, Buffer buffer2, +GetVisibilitymapPins(Relation relation, Buffer buffer1, Buffer buffer2, This looks like an unrelated change. * Clearing a visibility map bit is not separately WAL-logged. The callers * must make sure that whenever a bit is cleared,the bit is cleared on WAL - * replay of the updating operation as well. + * replay of the updating operation as well. And all-frozen bit must be + * cleared with all-visible at the same time. This could be reformulated. This is just an addition on top of the existing paragraph. + * The visibility map has the all-frozen bit which indicates all tuples on + * corresponding page has been completely frozen, so the visibility map is also "have been completely frozen". -/* Mapping from heap block number to the right bit in the visibility map */ -#define HEAPBLK_TO_MAPBLOCK(x) ((x) / HEAPBLOCKS_PER_PAGE) -#define HEAPBLK_TO_MAPBYTE(x) (((x) % HEAPBLOCKS_PER_PAGE) / HEAPBLOCKS_PER_BYTE) Those two declarations are just noise in the patch: those definitions are unchanged. - elog(DEBUG1, "vm_clear %s %d", RelationGetRelationName(rel), heapBlk); + elog(DEBUG1, "vm_clear %s block %d", RelationGetRelationName(rel), heapBlk); This may be better as a separate patch. -visibilitymap_count(Relation rel) +visibilitymap_count(Relation rel, BlockNumber *all_frozen) I think that this routine would gain in clarity if reworked as follows: visibilitymap_count(Relation rel, BlockNumber *all_visible, BlockNumber *all_frozen) + /* + * Report ANALYZE to the stats collector, too. However, if doing + * inherited stats we shouldn't report, because the stats collector only + * tracks per-table stats. + */ + if (!inh) + pgstat_report_analyze(onerel, totalrows, totaldeadrows, relallfrozen); Here we already know that this is working in the non-inherited code path. As a whole, why that? Why isn't relallfrozen passed as an argument of vac_update_relstats and then inserted in pg_class? Maybe I am missing something.. + * mxid full-table scan limit. During full scan, we could skip some pags + * according to all-frozen bit of visibility map. s/pags/pages + * Also, skipping even a single page accorinding to all-visible bit of s/accorinding/according. So, lazy_scan_heap is the central and really vital portion of the patch... + /* Check whether this tuple is alrady frozen or not */ s/alrady/already -heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cutoff_xid) +heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cutoff_xid, + bool *all_frozen) I think you would want to change that to heap_page_visible_status that returns *all_visible as well. At least it seems to be a more consistent style of routine. + * The format of visibility map is changed with this 9.6 commit, + */ +#define VISIBILITY_MAP_FROZEN_BIT_CAT_VER 201512021 It looks a bit strange to have a different flag for the vm with the new frozen bit. Couldn't we merge that into a unique version number? I guess that we should just ask for a vm rewrite anyway in any case if pg_upgrade is used on the version of pg with the new vm format, no? Sawada-san, are you planning to continue working on that? At this stage it seems that this patch is not in commitable state and should at best be moved to next CF, or at worst returned with feedback. -- Michael
On 9 December 2015 at 18:31, Robert Haas <robertmhaas@gmail.com> wrote:
--
On Mon, Nov 30, 2015 at 12:58 PM, Bruce Momjian <bruce@momjian.us> wrote:
> On Mon, Nov 30, 2015 at 10:48:04PM +0530, Masahiko Sawada wrote:
>> On Sun, Nov 29, 2015 at 2:21 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
>> > On Tue, Nov 24, 2015 at 3:13 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>> >>
>> >> Yeah, we need to consider to compute checksum if enabled.
>> >> I've changed the patch, and attached.
>> >> Please review it.
>> >
>> > Thanks for the update. This now conflicts with the updates doesn to
>> > fix pg_upgrade out-of-space issue on Windows. I've fixed (I think) the
>> > conflict in order to do some testing, but I'd like to get an updated
>> > patch from the author in case I did it wrong. I don't want to find
>> > bugs that I just introduced myself.
>> >
>>
>> Thank you for having a look.
>
> I would not bother mentioning this detail in the pg_upgrade manual page:
>
> + Since the format of visibility map has been changed in version 9.6,
> + <application>pg_upgrade</> creates and rewrite new <literal>'_vm'</literal>
> + file even if upgrading from 9.5 or before to 9.6 or later with link mode (-k).
Really? I know we don't always document things like this, but it
seems like a good idea to me that we do so.
Agreed.
For me, rewriting the visibility map is a new data loss bug waiting to happen. I am worried that the group is not taking seriously the potential for catastrophe here. I think we can do it, but I think it needs these things
* Clear notice that it is happening unconditionally and unavoidably
* Log files showing it has happened, action by action
* Very clear mechanism for resolving an incomplete or interrupted upgrade process. Which VMs got upgraded? Which didn't?
* Ability to undo an upgrade attempt, somehow, ideally automatically by default
* Ability to restart a failed upgrade attempt without doing a "double upgrade", i.e. ensure transformation is immutable
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu, Dec 17, 2015 at 3:44 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > For me, rewriting the visibility map is a new data loss bug waiting to > happen. I am worried that the group is not taking seriously the potential > for catastrophe here. FWIW, I'm following this line and merging the vm file into a single unit looks like a ticking bomb. We may really want a separate _vm file, like _vmf to track this new bit flag but this has already been mentioned largely upthread... -- Michael
On 2015-12-17 15:56:35 +0900, Michael Paquier wrote: > On Thu, Dec 17, 2015 at 3:44 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > > For me, rewriting the visibility map is a new data loss bug waiting to > > happen. I am worried that the group is not taking seriously the potential > > for catastrophe here. > > FWIW, I'm following this line and merging the vm file into a single > unit looks like a ticking bomb. And what are those risks? > We may really want a separate _vm file, like _vmf to track this new > bit flag but this has already been mentioned largely upthread... That'd double the overhead when those bits get unset.
On Thu, Dec 17, 2015 at 4:10 PM, Andres Freund <andres@anarazel.de> wrote: > On 2015-12-17 15:56:35 +0900, Michael Paquier wrote: >> On Thu, Dec 17, 2015 at 3:44 PM, Simon Riggs <simon@2ndquadrant.com> wrote: >> > For me, rewriting the visibility map is a new data loss bug waiting to >> > happen. I am worried that the group is not taking seriously the potential >> > for catastrophe here. >> >> FWIW, I'm following this line and merging the vm file into a single >> unit looks like a ticking bomb. > > And what are those risks? Incorrect vm file rewrite after a pg_upgrade run. -- Michael
On 2015-12-17 16:22:24 +0900, Michael Paquier wrote: > On Thu, Dec 17, 2015 at 4:10 PM, Andres Freund <andres@anarazel.de> wrote: > > On 2015-12-17 15:56:35 +0900, Michael Paquier wrote: > >> On Thu, Dec 17, 2015 at 3:44 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > >> > For me, rewriting the visibility map is a new data loss bug waiting to > >> > happen. I am worried that the group is not taking seriously the potential > >> > for catastrophe here. > >> > >> FWIW, I'm following this line and merging the vm file into a single > >> unit looks like a ticking bomb. > > > > And what are those risks? > > Incorrect vm file rewrite after a pg_upgrade run. If we can't manage to rewrite a file, replacing a binary b1 with a b10, then we shouldn't be working on a database. And if we screw up, recovery i is an rm *_vm away. I can't imagine that this is going to be the actually complicated part of this feature.
On Thu, Dec 17, 2015 at 11:47 AM, Michael Paquier <michael.paquier@gmail.com> wrote: > On Thu, Dec 10, 2015 at 3:31 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Mon, Nov 30, 2015 at 12:58 PM, Bruce Momjian <bruce@momjian.us> wrote: >>> On Mon, Nov 30, 2015 at 10:48:04PM +0530, Masahiko Sawada wrote: >>>> On Sun, Nov 29, 2015 at 2:21 AM, Jeff Janes <jeff.janes@gmail.com> wrote: >>>> > On Tue, Nov 24, 2015 at 3:13 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >>>> >> >>>> >> Yeah, we need to consider to compute checksum if enabled. >>>> >> I've changed the patch, and attached. >>>> >> Please review it. >>>> > >>>> > Thanks for the update. This now conflicts with the updates doesn to >>>> > fix pg_upgrade out-of-space issue on Windows. I've fixed (I think) the >>>> > conflict in order to do some testing, but I'd like to get an updated >>>> > patch from the author in case I did it wrong. I don't want to find >>>> > bugs that I just introduced myself. >>>> > >>>> >>>> Thank you for having a look. >>> >>> I would not bother mentioning this detail in the pg_upgrade manual page: >>> >>> + Since the format of visibility map has been changed in version 9.6, >>> + <application>pg_upgrade</> creates and rewrite new <literal>'_vm'</literal> >>> + file even if upgrading from 9.5 or before to 9.6 or later with link mode (-k). >> >> Really? I know we don't always document things like this, but it >> seems like a good idea to me that we do so. > > Just going though v30... > > + frozen. The whole-table freezing is occuerred only when all pages happen to > + require freezing to freeze rows. In other cases such as where > > I am not really getting the meaning of this sentence. Shouldn't this > be reworded something like: > "Freezing occurs on the whole table once all pages of this relation require it." > > + <structfield>relfrozenxid</> is more than > <varname>vacuum_freeze_table_age</> > + transcations old, where <command>VACUUM</>'s <literal>FREEZE</> > option is used, > + <command>VACUUM</> can skip the pages that all tuples on the page > itself are > + marked as frozen. > + When all pages of table are eventually marked as frozen by > <command>VACUUM</>, > + after it's finished <literal>age(relfrozenxid)</> should be a little more > + than the <varname>vacuum_freeze_min_age</> setting that was used (more by > + the number of transcations started since the <command>VACUUM</> started). > + If the advancing of <structfield>relfrozenxid</> is not happend until > + <varname>autovacuum_freeze_max_age</> is reached, an autovacuum will soon > + be forced for the table. > > s/transcations/transactions. > > + <entry><structfield>n_frozen_page</></entry> > + <entry><type>integer</></entry> > + <entry>Number of frozen pages</entry> > n_frozen_pages? > > make check with pg_upgrade is failing for me: > Visibility map rewriting test failed > make: *** [check] Error 1 make check with pg_upgrade is done successfully on my environment. Could you give me more information about this? > -GetVisibilityMapPins(Relation relation, Buffer buffer1, Buffer buffer2, > +GetVisibilitymapPins(Relation relation, Buffer buffer1, Buffer buffer2, > This looks like an unrelated change. > > * Clearing a visibility map bit is not separately WAL-logged. The callers > * must make sure that whenever a bit is cleared, the bit is cleared on WAL > - * replay of the updating operation as well. > + * replay of the updating operation as well. And all-frozen bit must be > + * cleared with all-visible at the same time. > This could be reformulated. This is just an addition on top of the > existing paragraph. > > + * The visibility map has the all-frozen bit which indicates all tuples on > + * corresponding page has been completely frozen, so the visibility map is also > "have been completely frozen". > > -/* Mapping from heap block number to the right bit in the visibility map */ > -#define HEAPBLK_TO_MAPBLOCK(x) ((x) / HEAPBLOCKS_PER_PAGE) > -#define HEAPBLK_TO_MAPBYTE(x) (((x) % HEAPBLOCKS_PER_PAGE) / > HEAPBLOCKS_PER_BYTE) > Those two declarations are just noise in the patch: those definitions > are unchanged. > > - elog(DEBUG1, "vm_clear %s %d", RelationGetRelationName(rel), heapBlk); > + elog(DEBUG1, "vm_clear %s block %d", > RelationGetRelationName(rel), heapBlk); > This may be better as a separate patch. I've attached 001 patch for this separately. > > -visibilitymap_count(Relation rel) > +visibilitymap_count(Relation rel, BlockNumber *all_frozen) > I think that this routine would gain in clarity if reworked as follows: > visibilitymap_count(Relation rel, BlockNumber *all_visible, > BlockNumber *all_frozen) > > + /* > + * Report ANALYZE to the stats collector, too. > However, if doing > + * inherited stats we shouldn't report, because the > stats collector only > + * tracks per-table stats. > + */ > + if (!inh) > + pgstat_report_analyze(onerel, totalrows, > totaldeadrows, relallfrozen); > Here we already know that this is working in the non-inherited code > path. As a whole, why that? Why isn't relallfrozen passed as an > argument of vac_update_relstats and then inserted in pg_class? Maybe I > am missing something.. IIUC, as per discussion, the number of frozen pages should not be inserted into pg_class. Because it's not information used by query planning like relallvisible, repages. > + * mxid full-table scan limit. During full scan, we could skip some pags > + * according to all-frozen bit of visibility map. > s/pags/pages > > + * Also, skipping even a single page accorinding to all-visible bit of > s/accorinding/according. > > So, lazy_scan_heap is the central and really vital portion of the patch... > > + /* Check whether this tuple is alrady > frozen or not */ > s/alrady/already > > -heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId > *visibility_cutoff_xid) > +heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId > *visibility_cutoff_xid, > + bool *all_frozen) > I think you would want to change that to heap_page_visible_status that > returns *all_visible as well. At least it seems to be a more > consistent style of routine. > > + * The format of visibility map is changed with this 9.6 commit, > + */ > +#define VISIBILITY_MAP_FROZEN_BIT_CAT_VER 201512021 > It looks a bit strange to have a different flag for the vm with the > new frozen bit. Couldn't we merge that into a unique version number? I > guess that we should just ask for a vm rewrite anyway in any case if > pg_upgrade is used on the version of pg with the new vm format, no? Thank you for your review. Please find the attached latest v31 patches. > > Sawada-san, are you planning to continue working on that? At this > stage it seems that this patch is not in commitable state and should > at best be moved to next CF, or at worst returned with feedback. Yes, of course. This patch should be marked as "Move to next CF". Regards, -- Masahiko Sawada
Attachment
On Fri, Dec 18, 2015 at 3:17 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > On Thu, Dec 17, 2015 at 11:47 AM, Michael Paquier <michael.paquier@gmail.com> wrote: >> make check with pg_upgrade is failing for me: >> Visibility map rewriting test failed >> make: *** [check] Error 1 > > make check with pg_upgrade is done successfully on my environment. > Could you give me more information about this? Oh, well I see now after digging into your code. You are missing -X when running psql, and until recently psql -c implied -X all the time. The reason why it failed for me is that I have \timing enabled in psqlrc. Actually test.sh needs to be fixed as well, see the attached, this is a separate bug. Could a kind committer look at that if this is acceptable? >> Sawada-san, are you planning to continue working on that? At this >> stage it seems that this patch is not in commitable state and should >> at best be moved to next CF, or at worst returned with feedback. > > Yes, of course. > This patch should be marked as "Move to next CF". OK, done so. -- Michael
Attachment
On Thu, Dec 17, 2015 at 1:17 AM, Michael Paquier <michael.paquier@gmail.com> wrote: > I am not really getting the meaning of this sentence. Shouldn't this > be reworded something like: > "Freezing occurs on the whole table once all pages of this relation require it." That statement isn't remotely true, and I don't think this patch changes that. Freezing occurs on the whole table once relfrozenxid is old enough that we think there might be at least one page in the table that requires it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Dec 17, 2015 at 2:26 AM, Andres Freund <andres@anarazel.de> wrote: > On 2015-12-17 16:22:24 +0900, Michael Paquier wrote: >> On Thu, Dec 17, 2015 at 4:10 PM, Andres Freund <andres@anarazel.de> wrote: >> > On 2015-12-17 15:56:35 +0900, Michael Paquier wrote: >> >> On Thu, Dec 17, 2015 at 3:44 PM, Simon Riggs <simon@2ndquadrant.com> wrote: >> >> > For me, rewriting the visibility map is a new data loss bug waiting to >> >> > happen. I am worried that the group is not taking seriously the potential >> >> > for catastrophe here. >> >> >> >> FWIW, I'm following this line and merging the vm file into a single >> >> unit looks like a ticking bomb. >> > >> > And what are those risks? >> >> Incorrect vm file rewrite after a pg_upgrade run. > > If we can't manage to rewrite a file, replacing a binary b1 with a b10, > then we shouldn't be working on a database. And if we screw up, recovery > i is an rm *_vm away. I can't imagine that this is going to be the > actually complicated part of this feature. Yeah. If that part of this feature isn't right, the chances that the rest of the patch are robust enough to commit seem extremely low. That is, as Andres says, not the hard part. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hello, At Fri, 18 Dec 2015 12:09:43 -0500, Robert Haas <robertmhaas@gmail.com> wrote in <CA+TgmoZCCFwgKL0PmSi=htfZ2aCOZPoTPD73ecvSA9rhXa0zUw@mail.gmail.com> > On Thu, Dec 17, 2015 at 1:17 AM, Michael Paquier > <michael.paquier@gmail.com> wrote: > > I am not really getting the meaning of this sentence. Shouldn't this > > be reworded something like: > > "Freezing occurs on the whole table once all pages of this relation require it." > > That statement isn't remotely true, and I don't think this patch > changes that. Freezing occurs on the whole table once relfrozenxid is > old enough that we think there might be at least one page in the table > that requires it. I doubt I can explain this accurately, but I took the original phrase as that if and only if all pages of the table are marked as "requires freezing" by accident, all pages are frozen. It's quite obvious but it is what I think "happen to require freezing" means. Does this make sense? The phrase might not be necessary if this is correct. regards, -- Kyotaro Horiguchi NTT Open Source Software Center
On Mon, Dec 21, 2015 at 3:27 AM, Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > Hello, > > At Fri, 18 Dec 2015 12:09:43 -0500, Robert Haas <robertmhaas@gmail.com> wrote in <CA+TgmoZCCFwgKL0PmSi=htfZ2aCOZPoTPD73ecvSA9rhXa0zUw@mail.gmail.com> >> On Thu, Dec 17, 2015 at 1:17 AM, Michael Paquier >> <michael.paquier@gmail.com> wrote: >> > I am not really getting the meaning of this sentence. Shouldn't this >> > be reworded something like: >> > "Freezing occurs on the whole table once all pages of this relation require it." >> >> That statement isn't remotely true, and I don't think this patch >> changes that. Freezing occurs on the whole table once relfrozenxid is >> old enough that we think there might be at least one page in the table >> that requires it. > > I doubt I can explain this accurately, but I took the original > phrase as that if and only if all pages of the table are marked > as "requires freezing" by accident, all pages are frozen. It's > quite obvious but it is what I think "happen to require freezing" > means. Does this make sense? > > The phrase might not be necessary if this is correct. Maybe you are trying to say something like "only those pages which require freezing are frozen?". -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Dec 17, 2015 at 06:44:46AM +0000, Simon Riggs wrote: > >> Thank you for having a look. > > > > I would not bother mentioning this detail in the pg_upgrade manual page: > > > > + Since the format of visibility map has been changed in version 9.6, > > + <application>pg_upgrade</> creates and rewrite new <literal>'_vm'</ > literal> > > + file even if upgrading from 9.5 or before to 9.6 or later with link > mode (-k). > > Really? I know we don't always document things like this, but it > seems like a good idea to me that we do so. > > > Agreed. > > For me, rewriting the visibility map is a new data loss bug waiting to happen. > I am worried that the group is not taking seriously the potential for > catastrophe here. I think we can do it, but I think it needs these things > > * Clear notice that it is happening unconditionally and unavoidably > * Log files showing it has happened, action by action > * Very clear mechanism for resolving an incomplete or interrupted upgrade > process. Which VMs got upgraded? Which didn't? > * Ability to undo an upgrade attempt, somehow, ideally automatically by default > * Ability to restart a failed upgrade attempt without doing a "double upgrade", > i.e. ensure transformation is immutable Have you forgotten how pg_upgrade works? This new vm file is only created on the new cluster, which is not usable if pg_upgrade doesn't complete successfully. pg_upgrade never modifies the old cluster. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Roman grave inscription +
On Mon, Dec 21, 2015 at 11:54 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Mon, Dec 21, 2015 at 3:27 AM, Kyotaro HORIGUCHI > <horiguchi.kyotaro@lab.ntt.co.jp> wrote: >> Hello, >> >> At Fri, 18 Dec 2015 12:09:43 -0500, Robert Haas <robertmhaas@gmail.com> wrote in <CA+TgmoZCCFwgKL0PmSi=htfZ2aCOZPoTPD73ecvSA9rhXa0zUw@mail.gmail.com> >>> On Thu, Dec 17, 2015 at 1:17 AM, Michael Paquier >>> <michael.paquier@gmail.com> wrote: >>> > I am not really getting the meaning of this sentence. Shouldn't this >>> > be reworded something like: >>> > "Freezing occurs on the whole table once all pages of this relation require it." >>> >>> That statement isn't remotely true, and I don't think this patch >>> changes that. Freezing occurs on the whole table once relfrozenxid is >>> old enough that we think there might be at least one page in the table >>> that requires it. >> >> I doubt I can explain this accurately, but I took the original >> phrase as that if and only if all pages of the table are marked >> as "requires freezing" by accident, all pages are frozen. It's >> quite obvious but it is what I think "happen to require freezing" >> means. Does this make sense? >> >> The phrase might not be necessary if this is correct. > > Maybe you are trying to say something like "only those pages which > require freezing are frozen?". > I was thinking the same as what Horiguchi-san said. That is, even if relfrozenxid is old enough, freezing on the whole table is not required if the table are marked as "not requires freezing". In other word, only those pages which are marked as "not frozen" are frozen. Regards, -- Masahiko Sawada
On Mon, Dec 28, 2015 at 6:38 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > On Mon, Dec 21, 2015 at 11:54 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Mon, Dec 21, 2015 at 3:27 AM, Kyotaro HORIGUCHI >> <horiguchi.kyotaro@lab.ntt.co.jp> wrote: >>> Hello, >>> >>> At Fri, 18 Dec 2015 12:09:43 -0500, Robert Haas <robertmhaas@gmail.com> wrote in <CA+TgmoZCCFwgKL0PmSi=htfZ2aCOZPoTPD73ecvSA9rhXa0zUw@mail.gmail.com> >>>> On Thu, Dec 17, 2015 at 1:17 AM, Michael Paquier >>>> <michael.paquier@gmail.com> wrote: >>>> > I am not really getting the meaning of this sentence. Shouldn't this >>>> > be reworded something like: >>>> > "Freezing occurs on the whole table once all pages of this relation require it." >>>> >>>> That statement isn't remotely true, and I don't think this patch >>>> changes that. Freezing occurs on the whole table once relfrozenxid is >>>> old enough that we think there might be at least one page in the table >>>> that requires it. >>> >>> I doubt I can explain this accurately, but I took the original >>> phrase as that if and only if all pages of the table are marked >>> as "requires freezing" by accident, all pages are frozen. It's >>> quite obvious but it is what I think "happen to require freezing" >>> means. Does this make sense? >>> >>> The phrase might not be necessary if this is correct. >> >> Maybe you are trying to say something like "only those pages which >> require freezing are frozen?". >> > > I was thinking the same as what Horiguchi-san said. > That is, even if relfrozenxid is old enough, freezing on the whole > table is not required if the table are marked as "not requires > freezing". > In other word, only those pages which are marked as "not frozen" are frozen. > The recently changes to HEAD conflicts with freeze map patch, so I've updated and attached latest freeze map patch. The another patch that enhances the debug log message of visibilitymap is attached to previous mail. <http://www.postgresql.org/message-id/CAD21AoBScUD4k_QWrYGRmbXVruiekPY=2BY2Fxhqq55a+tzUxg@mail.gmail.com>. Please review it. Regards, -- Masahiko Sawada
Attachment
On Wed, Jan 13, 2016 at 12:16 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > On Mon, Dec 28, 2015 at 6:38 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> On Mon, Dec 21, 2015 at 11:54 PM, Robert Haas <robertmhaas@gmail.com> wrote: >>> On Mon, Dec 21, 2015 at 3:27 AM, Kyotaro HORIGUCHI >>> <horiguchi.kyotaro@lab.ntt.co.jp> wrote: >>>> Hello, >>>> >>>> At Fri, 18 Dec 2015 12:09:43 -0500, Robert Haas <robertmhaas@gmail.com> wrote in <CA+TgmoZCCFwgKL0PmSi=htfZ2aCOZPoTPD73ecvSA9rhXa0zUw@mail.gmail.com> >>>>> On Thu, Dec 17, 2015 at 1:17 AM, Michael Paquier >>>>> <michael.paquier@gmail.com> wrote: >>>>> > I am not really getting the meaning of this sentence. Shouldn't this >>>>> > be reworded something like: >>>>> > "Freezing occurs on the whole table once all pages of this relation require it." >>>>> >>>>> That statement isn't remotely true, and I don't think this patch >>>>> changes that. Freezing occurs on the whole table once relfrozenxid is >>>>> old enough that we think there might be at least one page in the table >>>>> that requires it. >>>> >>>> I doubt I can explain this accurately, but I took the original >>>> phrase as that if and only if all pages of the table are marked >>>> as "requires freezing" by accident, all pages are frozen. It's >>>> quite obvious but it is what I think "happen to require freezing" >>>> means. Does this make sense? >>>> >>>> The phrase might not be necessary if this is correct. >>> >>> Maybe you are trying to say something like "only those pages which >>> require freezing are frozen?". >>> >> >> I was thinking the same as what Horiguchi-san said. >> That is, even if relfrozenxid is old enough, freezing on the whole >> table is not required if the table are marked as "not requires >> freezing". >> In other word, only those pages which are marked as "not frozen" are frozen. >> > > The recently changes to HEAD conflicts with freeze map patch, so I've > updated and attached latest freeze map patch. > The another patch that enhances the debug log message of visibilitymap > is attached to previous mail. > <http://www.postgresql.org/message-id/CAD21AoBScUD4k_QWrYGRmbXVruiekPY=2BY2Fxhqq55a+tzUxg@mail.gmail.com>. > > Please review it. > Attached updated version patch. Please review it. Regards, -- Masahiko Sawada
Attachment
Masahiko Sawada wrote: > Attached updated version patch. > Please review it. In pg_upgrade, the "page convert" functionality is there to abstract rewrites of pages being copied; your patch is circumventing it and AFAICS it makes the interface more complicated for no good reason. I think the real way to do that is to write your rewriteVisibilityMap as a pageConverter routine. That should reduce some duplication there. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2/1/16 4:59 PM, Alvaro Herrera wrote: > Masahiko Sawada wrote: > >> Attached updated version patch. >> Please review it. > > In pg_upgrade, the "page convert" functionality is there to abstract > rewrites of pages being copied; your patch is circumventing it and > AFAICS it makes the interface more complicated for no good reason. I > think the real way to do that is to write your rewriteVisibilityMap as a > pageConverter routine. That should reduce some duplication there. IIRC this is about the third problem that's been found with pg_upgrade in this patch. That concerns me given the potential for disaster if freeze bits are set incorrectly. -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com
On Tue, Feb 2, 2016 at 10:15 AM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote: > On 2/1/16 4:59 PM, Alvaro Herrera wrote: >> >> Masahiko Sawada wrote: >> >>> Attached updated version patch. >>> Please review it. >> >> >> In pg_upgrade, the "page convert" functionality is there to abstract >> rewrites of pages being copied; your patch is circumventing it and >> AFAICS it makes the interface more complicated for no good reason. I >> think the real way to do that is to write your rewriteVisibilityMap as a >> pageConverter routine. That should reduce some duplication there. > This means that user always have to set pageConverter plug-in when upgrading? I was thinking that this conversion is required for all user who wants to upgrade to 9.6, so we should support it in core, not as a plug-in. > > IIRC this is about the third problem that's been found with pg_upgrade in > this patch. That concerns me given the potential for disaster if freeze bits > are set incorrectly. Yeah, I tend to have diagnostic tool for visibilitymap, and to exactly compare VM between old one and new one after upgrading postgres server. I've implemented a such tool that is in my github repository[1]. I'm thinking to add this tool into core(e.g., pg_upgrade directory, not contrib module) as testing function. [1] https://github.com/MasahikoSawada/pg_visibilitymap Regards, -- Masahiko Sawada
On Tue, Feb 2, 2016 at 11:42 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > On Tue, Feb 2, 2016 at 10:15 AM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote: >> On 2/1/16 4:59 PM, Alvaro Herrera wrote: >>> >>> Masahiko Sawada wrote: >>> >>>> Attached updated version patch. >>>> Please review it. >>> >>> >>> In pg_upgrade, the "page convert" functionality is there to abstract >>> rewrites of pages being copied; your patch is circumventing it and >>> AFAICS it makes the interface more complicated for no good reason. I >>> think the real way to do that is to write your rewriteVisibilityMap as a >>> pageConverter routine. That should reduce some duplication there. >> > > This means that user always have to set pageConverter plug-in when upgrading? > I was thinking that this conversion is required for all user who wants > to upgrade to 9.6, so we should support it in core, not as a plug-in. I misunderstood. Sorry for noise. I agree with adding conversion method as a pageConverter routine. This patch doesn't change page layout actually, but pageConverter routine checks only the page layout. And we have to plugin named convertLayout_X_to_Y. I think we have two options. 1. Change page layout(PG_PAGE_LAYOUT_VERSION) to 5. pg_upgrade detects it and then converts only VM files. 2. Change pg_upgrade plugin mechanism so that it can handle other name conversion plugins (e.g., convertLayout_vm_to_vfm) I think #2 is better. Thought? Regards, -- Masahiko Sawada
Masahiko Sawada wrote: > I misunderstood. Sorry for noise. > I agree with adding conversion method as a pageConverter routine. \o/ > This patch doesn't change page layout actually, but pageConverter > routine checks only the page layout. > And we have to plugin named convertLayout_X_to_Y. > > I think we have two options. > > 1. Change page layout(PG_PAGE_LAYOUT_VERSION) to 5. pg_upgrade detects > it and then converts only VM files. > 2. Change pg_upgrade plugin mechanism so that it can handle other name > conversion plugins (e.g., convertLayout_vm_to_vfm) > > I think #2 is better. Thought? My vote is for #2 as well. Maybe we just didn't have forks when this functionality was invented; maybe the author just didn't think hard enough about what would be the right interface to do it. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, Feb 2, 2016 at 7:22 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Masahiko Sawada wrote: > >> I misunderstood. Sorry for noise. >> I agree with adding conversion method as a pageConverter routine. > > \o/ > >> This patch doesn't change page layout actually, but pageConverter >> routine checks only the page layout. >> And we have to plugin named convertLayout_X_to_Y. >> >> I think we have two options. >> >> 1. Change page layout(PG_PAGE_LAYOUT_VERSION) to 5. pg_upgrade detects >> it and then converts only VM files. >> 2. Change pg_upgrade plugin mechanism so that it can handle other name >> conversion plugins (e.g., convertLayout_vm_to_vfm) >> >> I think #2 is better. Thought? > > My vote is for #2 as well. Maybe we just didn't have forks when this > functionality was invented; maybe the author just didn't think hard > enough about what would be the right interface to do it. Thanks. I'm planning to change as follows. - pageCnvCtx struct has new function pointer convertVMFile(). If the layout of other relation such as FSM, CLOG in the future,we could add convertFSMFile() and convertCLOGFile(). - Create new library convertLayoutVM_add_frozenbit.c that has convertVMFile() function which converts only visibilitymap. When rewriting of VM is required, convertLayoutVM_add_frozenbit.so is dynamically loaded. convertLayout_X_to_Y converts other relation files. That is, converting VM and converting other relationsare independently done. - Current plugin mechanism puts conversion library (*.so) into ${bin}/plugins (i.g., new plugin directory is required), but I'm thinking to puts it into ${libdir}. Please give me feedbacks. Regards, -- Masahiko Sawada
Hello, At Tue, 2 Feb 2016 20:25:23 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in <CAD21AoA5iaKQ6K7gUZyzN2KJnPNMeHc6PPPxj6cJgmssjj=fqw@mail.gmail.com> > On Tue, Feb 2, 2016 at 7:22 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > > Masahiko Sawada wrote: > > > >> I misunderstood. Sorry for noise. > >> I agree with adding conversion method as a pageConverter routine. > > > > \o/ > > > >> This patch doesn't change page layout actually, but pageConverter > >> routine checks only the page layout. > >> And we have to plugin named convertLayout_X_to_Y. > >> > >> I think we have two options. > >> > >> 1. Change page layout(PG_PAGE_LAYOUT_VERSION) to 5. pg_upgrade detects > >> it and then converts only VM files. > >> 2. Change pg_upgrade plugin mechanism so that it can handle other name > >> conversion plugins (e.g., convertLayout_vm_to_vfm) > >> > >> I think #2 is better. Thought? > > > > My vote is for #2 as well. Maybe we just didn't have forks when this > > functionality was invented; maybe the author just didn't think hard > > enough about what would be the right interface to do it. > > Thanks. > > I'm planning to change as follows. > - pageCnvCtx struct has new function pointer convertVMFile(). > If the layout of other relation such as FSM, CLOG in the future, we > could add convertFSMFile() and convertCLOGFile(). > - Create new library convertLayoutVM_add_frozenbit.c that has > convertVMFile() function which converts only visibilitymap. > When rewriting of VM is required, convertLayoutVM_add_frozenbit.so > is dynamically loaded. > convertLayout_X_to_Y converts other relation files. > That is, converting VM and converting other relations are independently done. > - Current plugin mechanism puts conversion library (*.so) into > ${bin}/plugins (i.g., new plugin directory is required), but I'm > thinking to puts it into ${libdir}. > > Please give me feedbacks. I agree that the plugin mechanism would be usable and needs to be redesigned, but.. Since the destination version is fixed, the advantage of the plugin mechanism for pg_upgade would be capability to choose a plugin to load according to some characteristics of the source database. What do you think the trigger characteristics for convertLayoutVM_add_frozenbit.so or similars? If it is hard-coded like what transfer_single_new_db is doing for fsm and vm, I suppose the module is not necessary to be a plugin. regards, -- Kyotaro Horiguchi NTT Open Source Software Center
This patch has gotten its fair share of feedback in this fest. I moved it to the next commitfest. Please do keep working on it and reviewers that have additional comments are welcome. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, Feb 2, 2016 at 10:05 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > This patch has gotten its fair share of feedback in this fest. I moved > it to the next commitfest. Please do keep working on it and reviewers > that have additional comments are welcome. Thanks! On Tue, Feb 2, 2016 at 8:59 PM, Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > Since the destination version is fixed, the advantage of the > plugin mechanism for pg_upgade would be capability to choose a > plugin to load according to some characteristics of the source > database. What do you think the trigger characteristics for > convertLayoutVM_add_frozenbit.so or similars? If it is hard-coded > like what transfer_single_new_db is doing for fsm and vm, I > suppose the module is not necessary to be a plugin. Sorry, I couldn't get it. You meant that we should use rewriteVisibilityMap as a function (not dynamically load)? The destination version is not fixed, it depends on new cluster version. I'm planning that convertLayoutVM_add_frozenbit.so is dynamically loaded and used only when rewriting of VM is required. If layout of VM will be changed again in the future, we could add other libraries for convert Regards, -- Masahiko Sawada
On Tue, Feb 2, 2016 at 7:22 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Masahiko Sawada wrote: > >> I misunderstood. Sorry for noise. >> I agree with adding conversion method as a pageConverter routine. > > \o/ > >> This patch doesn't change page layout actually, but pageConverter >> routine checks only the page layout. >> And we have to plugin named convertLayout_X_to_Y. >> >> I think we have two options. >> >> 1. Change page layout(PG_PAGE_LAYOUT_VERSION) to 5. pg_upgrade detects >> it and then converts only VM files. >> 2. Change pg_upgrade plugin mechanism so that it can handle other name >> conversion plugins (e.g., convertLayout_vm_to_vfm) >> >> I think #2 is better. Thought? > > My vote is for #2 as well. Maybe we just didn't have forks when this > functionality was invented; maybe the author just didn't think hard > enough about what would be the right interface to do it. I've almost wrote up the very rough patch. (it can pass regression test) Windows supporting is not yet, and Makefile is not correct. I've divided the main patch into two patches; add frozen bit patch and pg_upgrade support patch. 000 patch is almost same as previous code. (includes small fix) 001 patch provides rewriting visibility map as a pageConverter routine. 002 patch is for enhancement debug message in visibilitymap.c In order to support pageConvert plugin, I made the following changes. * Main changes - Remove PAGE_CONVERSION - pg_upgrade plugin is located to 'src/bin/pg_upgrade/plugins' directory. - Move directory having plugins from '$(bin)/plugins' to '$(lib)/plugins'. - Add new page-converter plugin function for visibility map. - Current code doesn't allow us to use link mode (-k) in the case where page-converter is required. But I changed it so that if page-converter for fork file is specified, we convert it actually even when link mode. * Interface designe convertFile() and convertPage() are plugin function for main relation file, and these functions are dynamically loaded by loadConvertPlugin(). I added a new pageConvert plugin function converVMFile() for visibility map (fork file). If layout of CLOG, FSM or etc will be changed in the future, we could expand some new pageConvert plugin functions like convertCLOGFile() or convertFSMFile(), and these functions are dynamically loaded by loadAdditionalConvertPlugin(). It means that main file and other fork file conversion are executed independently, and conversion for fork file are executed even if link mode is specified. Each conversion plugin is loaded and used only when it's required. I still agree with this plugin approach, but I felt it's still complicated a bit, and I'm concerned that patch size has been increased. Please give me feedbacks. If there are not objections about this, I'm going to spend time to improve it. Regards, -- Masahiko Sawada
Attachment
Hello, At Thu, 4 Feb 2016 02:32:29 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in <CAD21AoB1HnZ7thWYjqKve78gQ5+PyedbbkjAPbc5zLV3oA-CuA@mail.gmail.com> > On Tue, Feb 2, 2016 at 7:22 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > > Masahiko Sawada wrote: > >> I think we have two options. > >> > >> 1. Change page layout(PG_PAGE_LAYOUT_VERSION) to 5. pg_upgrade detects > >> it and then converts only VM files. > >> 2. Change pg_upgrade plugin mechanism so that it can handle other name > >> conversion plugins (e.g., convertLayout_vm_to_vfm) > >> > >> I think #2 is better. Thought? > > > > My vote is for #2 as well. Maybe we just didn't have forks when this > > functionality was invented; maybe the author just didn't think hard > > enough about what would be the right interface to do it. > > I've almost wrote up the very rough patch. (it can pass regression test) > Windows supporting is not yet, and Makefile is not correct. > > I've divided the main patch into two patches; add frozen bit patch and > pg_upgrade support patch. > 000 patch is almost same as previous code. (includes small fix) > 001 patch provides rewriting visibility map as a pageConverter routine. > 002 patch is for enhancement debug message in visibilitymap.c Thanks, it becomes easy to read. > In order to support pageConvert plugin, I made the following changes. > * Main changes > - Remove PAGE_CONVERSION > - pg_upgrade plugin is located to 'src/bin/pg_upgrade/plugins' directory. > - Move directory having plugins from '$(bin)/plugins' to '$(lib)/plugins'. These seem fair. > - Add new page-converter plugin function for visibility map. > - Current code doesn't allow us to use link mode (-k) in the case > where page-converter is required. > > But I changed it so that if page-converter for fork file is > specified, we convert it actually even when link mode. > > * Interface designe > convertFile() and convertPage() are plugin function for main relation > file, and these functions are dynamically loaded by > loadConvertPlugin(). Though I haven't looked this so closer, loadConverterPlugin looks to continue deciding what plugin to load using old and new page layout versions. Currently the acutually possible versions are 4 and if we increment it now, 5. On the other hand, _vm came at the *catalog version* 201107031 (9.1 release) and _fsm came at 8.4 release. Both of them are of page layout version 4. Are we allowed to increment page layout version for this reason? And is this framework under reconstruction is flexible enough for this kind of changes in future? I don't think so. We have added _vm and _fsm so far so we must use a version number that can determine when _vm, _fsm and _vfm are introduced. I'm afraid that it is out of purpose, catalog version seems to be most usable, it is already used to know when the crash safe VM has been introduced. Using the catalog version, the plugin we provide first would be converteLayout_201105231_201602071.so which has only a converter from _vm to _vfm. This plugin is loaded for the combination of the source cluster with the catalog version of 201105231(when VM has been introduced) or later and the destination cluster with that *before* 201602071 (this version). If we change the format of fsm (vm no longer exists), we would have a new plugin maybe named converteLayout_200904091_2017xxxxx.so which has an, maybe, inplace file converter for fsm. It will be loaded when a source database is of the catalog version of 200904091(when FSM has been introduced) or later and a destination before 2017xxxxx(the version). Catalog version seems to work fine. So far, I assumed that we can safely assume that the name of files to be converted is <oid>[fork_name] so the possible types of conversions would be the following. - per-page conversion- per-file conversion between the files with the same fork name.- per-file conversion between the fileswith different fork names. Since the plugin filename doesn't tell such things, they should be told by the plugin itself. So a plugin is to provide the following interface, typedef struct ConverterTable { char *src_fork_name; char *dst_fork_name; FileConverterFunc file_conveter; PageConverterFunc page_conveter; } ConverterTable[]; Following such name convention of plugins, we may load multiple plugins at once, so we collect all entries of the table of all loaded plugins and check if any src_fork_name from them don't duplicate. Here, we have sufficient information to choose what conveter to invoke and execute conversion like this. for (fork_name in all_fork_names_including_"" ) { find a converter comparing fork_name with src_fork_name. check dst_fork_nameand rename the target file if needed. invoke the converter. } If we need to convert clogs or similars and need to prepare for such events, the ConverterTable might have an additional member and change the meaning of some of existing members. typedef struct ConverterTable { enum target_type; /* FILE_NAME or FORK_NAME */ char *src_name; char *dst_name; FileConverterFunc file_conveter; PageConverterFuncpage_conveter; } ConverterTable[]; when target_type == FILE_NAME, src_name and dst_name represents the target file names relatively to $PGDATA. # Yeah, I know it is too complicated. > I added a new pageConvert plugin function converVMFile() for > visibility map (fork file). > If layout of CLOG, FSM or etc will be changed in the future, we could > expand some new pageConvert plugin functions like convertCLOGFile() or > convertFSMFile(), and these functions are dynamically loaded by > loadAdditionalConvertPlugin(). > It means that main file and other fork file conversion are executed > independently, and conversion for fork file are executed even if link > mode is specified. > Each conversion plugin is loaded and used only when it's required. As I asked upthread, It is one of the most important design point of plugin mechanism that what characteristics of src and/or dest cluster to trigger loading of a plugin. And if page layout format is it, are we allowed to increment for such irrelevant events? Or using another characteristics like catalog version? > I still agree with this plugin approach, but I felt it's still > complicated a bit, and I'm concerned that patch size has been > increased. > Please give me feedbacks. Yeah, I feel the same. What make it worse, the plugin mechanism will get further complex if we make it more flexible for possible usage as I proposed above. It is apparently too complicated for deciding whether to load *just one*, for now, converter function. And no additional converter is in sight. I incline to pull out all the plugin stuff of pg_upgrade. We are so prudent to make changes of file formats so this kind of events will happen with several-years intervals. The plugin mechanism would be valuable if we are encouriged to change file formats more frequently and freely by providing it, but such situation absolutely introduces more untoward things.. > If there are not objections about this, I'm going to spend time > to improve it. Sorry, but I do have strong objection to this... Does anyone else have opinions for that? regareds, -- Kyotaro Horiguchi NTT Open Source Software Center
Thank you for reviewing this patch! On Wed, Feb 10, 2016 at 4:39 PM, Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > Hello, > > At Thu, 4 Feb 2016 02:32:29 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in <CAD21AoB1HnZ7thWYjqKve78gQ5+PyedbbkjAPbc5zLV3oA-CuA@mail.gmail.com> >> On Tue, Feb 2, 2016 at 7:22 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: >> > Masahiko Sawada wrote: >> >> I think we have two options. >> >> >> >> 1. Change page layout(PG_PAGE_LAYOUT_VERSION) to 5. pg_upgrade detects >> >> it and then converts only VM files. >> >> 2. Change pg_upgrade plugin mechanism so that it can handle other name >> >> conversion plugins (e.g., convertLayout_vm_to_vfm) >> >> >> >> I think #2 is better. Thought? >> > >> > My vote is for #2 as well. Maybe we just didn't have forks when this >> > functionality was invented; maybe the author just didn't think hard >> > enough about what would be the right interface to do it. >> >> I've almost wrote up the very rough patch. (it can pass regression test) >> Windows supporting is not yet, and Makefile is not correct. >> >> I've divided the main patch into two patches; add frozen bit patch and >> pg_upgrade support patch. >> 000 patch is almost same as previous code. (includes small fix) >> 001 patch provides rewriting visibility map as a pageConverter routine. >> 002 patch is for enhancement debug message in visibilitymap.c > > Thanks, it becomes easy to read. > >> In order to support pageConvert plugin, I made the following changes. >> * Main changes >> - Remove PAGE_CONVERSION >> - pg_upgrade plugin is located to 'src/bin/pg_upgrade/plugins' directory. >> - Move directory having plugins from '$(bin)/plugins' to '$(lib)/plugins'. > > These seem fair. > >> - Add new page-converter plugin function for visibility map. >> - Current code doesn't allow us to use link mode (-k) in the case >> where page-converter is required. >> >> But I changed it so that if page-converter for fork file is >> specified, we convert it actually even when link mode. >> >> * Interface designe >> convertFile() and convertPage() are plugin function for main relation >> file, and these functions are dynamically loaded by >> loadConvertPlugin(). > > Though I haven't looked this so closer, loadConverterPlugin looks > to continue deciding what plugin to load using old and new page > layout versions. Currently the acutually possible versions are 4 > and if we increment it now, 5. > > On the other hand, _vm came at the *catalog version* 201107031 > (9.1 release) and _fsm came at 8.4 release. Both of them are of > page layout version 4. Are we allowed to increment page layout > version for this reason? And is this framework under > reconstruction is flexible enough for this kind of changes in > future? I don't think so. Yeah, I also think that page layout version should not be increased by this layout change of vm. This patch checks catalog version at first, and then decides what plugin to load. In this case, only the format of VM has been changed, so pg_upgrade loads a plugin for VM and convert them. pg_upgrade doesn't load for other plugin file, and other files are just copied. > > > We have added _vm and _fsm so far so we must use a version number > that can determine when _vm, _fsm and _vfm are introduced. I'm > afraid that it is out of purpose, catalog version seems to be > most usable, it is already used to know when the crash safe VM > has been introduced. > > Using the catalog version, the plugin we provide first would be > converteLayout_201105231_201602071.so which has only a converter > from _vm to _vfm. This plugin is loaded for the combination of > the source cluster with the catalog version of 201105231(when VM > has been introduced) or later and the destination cluster with > that *before* 201602071 (this version). > > If we change the format of fsm (vm no longer exists), we would > have a new plugin maybe named > converteLayout_200904091_2017xxxxx.so which has an, maybe, > inplace file converter for fsm. It will be loaded when a source > database is of the catalog version of 200904091(when FSM has been > introduced) or later and a destination before 2017xxxxx(the > version). Catalog version seems to work fine. I think that it's not good idea to use catalog version for plugin name. Because, even if catalog version is used for plugin file name as you suggested, pg_upgrade still needs to decide what plugin name to load by itself. Also, the plugin file having catalog version would not easy to understand what plugin does actually. It's not developer friendly. The advanteage of using page layout version as plugin name is that pg_upgrade can decide automatically which plugin should be loaded. > > So far, I assumed that we can safely assume that the name of > files to be converted is <oid>[fork_name] so the possible types > of conversions would be the following. > > - per-page conversion > - per-file conversion between the files with the same fork name. > - per-file conversion between the files with different fork names. > > Since the plugin filename doesn't tell such things, they should > be told by the plugin itself. So a plugin is to provide the > following interface, > > typedef struct ConverterTable > { > char *src_fork_name; > char *dst_fork_name; > FileConverterFunc file_conveter; > PageConverterFunc page_conveter; > } ConverterTable[]; > > Following such name convention of plugins, we may load multiple > plugins at once, so we collect all entries of the table of all > loaded plugins and check if any src_fork_name from them don't > duplicate. > > Here, we have sufficient information to choose what conveter to > invoke and execute conversion like this. > > for (fork_name in all_fork_names_including_"" ) > { > find a converter comparing fork_name with src_fork_name. > check dst_fork_name and rename the target file if needed. > invoke the converter. > } > > If we need to convert clogs or similars and need to prepare for > such events, the ConverterTable might have an additional member > and change the meaning of some of existing members. > > typedef struct ConverterTable > { > enum target_type; /* FILE_NAME or FORK_NAME */ > char *src_name; > char *dst_name; > FileConverterFunc file_conveter; > PageConverterFunc page_conveter; > } ConverterTable[]; > > when target_type == FILE_NAME, src_name and dst_name represents > the target file names relatively to $PGDATA. > > # Yeah, I know it is too complicated. > I agree with having ConverterTable. Since we have three kind of fiel suffix types; "", "_vm", "_fsm", pg_upgrade will have three elements of ConverterTable[]. >> I added a new pageConvert plugin function converVMFile() for >> visibility map (fork file). >> If layout of CLOG, FSM or etc will be changed in the future, we could >> expand some new pageConvert plugin functions like convertCLOGFile() or >> convertFSMFile(), and these functions are dynamically loaded by >> loadAdditionalConvertPlugin(). >> It means that main file and other fork file conversion are executed >> independently, and conversion for fork file are executed even if link >> mode is specified. >> Each conversion plugin is loaded and used only when it's required. > > As I asked upthread, It is one of the most important design point > of plugin mechanism that what characteristics of src and/or dest > cluster to trigger loading of a plugin. And if page layout format > is it, are we allowed to increment for such irrelevant events? Or > using another characteristics like catalog version? > >> I still agree with this plugin approach, but I felt it's still >> complicated a bit, and I'm concerned that patch size has been >> increased. >> Please give me feedbacks. > > Yeah, I feel the same. What make it worse, the plugin mechanism > will get further complex if we make it more flexible for possible > usage as I proposed above. It is apparently too complicated for > deciding whether to load *just one*, for now, converter > function. And no additional converter is in sight. There will be case where layout of other type relation file is changed, so pg_upgrade will need to convert several types of relation file at the same time. I'm thinking that we need to support to load multiple plugin function at least. > I incline to pull out all the plugin stuff of pg_upgrade. We are > so prudent to make changes of file formats so this kind of events > will happen with several-years intervals. The plugin mechanism > would be valuable if we are encouriged to change file formats > more frequently and freely by providing it, but such situation > absolutely introduces more untoward things.. Yes, I think so too. In fact, such a layout change is for the first time since pg_upgrade has been introduced at 9.0. Regards, -- Masahiko Sawada
On Wed, Feb 3, 2016 at 12:32 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > I've divided the main patch into two patches; add frozen bit patch and > pg_upgrade support patch. > 000 patch is almost same as previous code. (includes small fix) > 001 patch provides rewriting visibility map as a pageConverter routine. > 002 patch is for enhancement debug message in visibilitymap.c I'd like to suggest splitting 000 into two patches. The first one would change the format of the visibility map, and the second one would change VACUUM to optimize scans based on the new format. I think that would make it easier to get this reviewed and committed. I think this patch churns a bunch of things that don't really need to be churned. For example, consider this hunk: /* * If we didn't pin the visibility map page and the page has become all - * visible while we were busy locking the buffer, we'll have to unlock and - * re-lock, to avoid holding the buffer lock across an I/O. That's a bit - * unfortunate, but hopefully shouldn't happen often. + * visible or all frozen while we were busy locking the buffer, we'll + * have to unlock and re-lock, to avoid holding the buffer lock across an + * I/O. That's a bit unfortunate, but hopefully shouldn't happen often. */ Since the page can't become all-frozen without also becoming all-visible, the original text is still 100% accurate, and the change doesn't seem to add any useful clarity. Let's think about which things really need to be changed and not just mechanically change everything. - Assert(PageIsAllVisible(heapPage)); + /* + * Caller is expected to set PD_ALL_VISIBLE or + * PD_ALL_FROZEN first. + */ + Assert(((flags | VISIBILITYMAP_ALL_VISIBLE) && PageIsAllVisible(heapPage)) || + ((flags | VISIBILITYMAP_ALL_FROZEN) && PageIsAllFrozen(heapPage))); I think this would be more clear as two separate assertions. Your 000 patch has a little bit of whitespace damage: [rhaas pgsql]$ git diff --check src/backend/commands/vacuumlazy.c:1951: indent with spaces. + bool *all_visible, bool *all_frozen) src/include/access/heapam_xlog.h:393: indent with spaces. + Buffer vm_buffer, TransactionId cutoff_xid, uint8 flags); -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Feb 12, 2016 at 4:46 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, Feb 3, 2016 at 12:32 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> I've divided the main patch into two patches; add frozen bit patch and >> pg_upgrade support patch. >> 000 patch is almost same as previous code. (includes small fix) >> 001 patch provides rewriting visibility map as a pageConverter routine. >> 002 patch is for enhancement debug message in visibilitymap.c > > I'd like to suggest splitting 000 into two patches. The first one > would change the format of the visibility map, and the second one > would change VACUUM to optimize scans based on the new format. I > think that would make it easier to get this reviewed and committed. > > I think this patch churns a bunch of things that don't really need to > be churned. For example, consider this hunk: > > /* > * If we didn't pin the visibility map page and the page has become all > - * visible while we were busy locking the buffer, we'll have to unlock and > - * re-lock, to avoid holding the buffer lock across an I/O. That's a bit > - * unfortunate, but hopefully shouldn't happen often. > + * visible or all frozen while we were busy locking the buffer, we'll > + * have to unlock and re-lock, to avoid holding the buffer lock across an > + * I/O. That's a bit unfortunate, but hopefully shouldn't happen often. > */ > > Since the page can't become all-frozen without also becoming > all-visible, the original text is still 100% accurate, and the change > doesn't seem to add any useful clarity. Let's think about which > things really need to be changed and not just mechanically change > everything. > > - Assert(PageIsAllVisible(heapPage)); > + /* > + * Caller is expected to set PD_ALL_VISIBLE or > + * PD_ALL_FROZEN first. > + */ > + Assert(((flags | VISIBILITYMAP_ALL_VISIBLE) && > PageIsAllVisible(heapPage)) || > + ((flags | VISIBILITYMAP_ALL_FROZEN) && > PageIsAllFrozen(heapPage))); > > I think this would be more clear as two separate assertions. > > Your 000 patch has a little bit of whitespace damage: > > [rhaas pgsql]$ git diff --check > src/backend/commands/vacuumlazy.c:1951: indent with spaces. > + bool *all_visible, bool > *all_frozen) > src/include/access/heapam_xlog.h:393: indent with spaces. > + Buffer vm_buffer, TransactionId > cutoff_xid, uint8 flags); > Thank you for reviewing this patch. I've divided 000 patch into two patches, and attached latest 4 patches in total. I changed pg_upgrade plugin logic so that all kind of type suffix has one convert plugin. The type suffix which doesn't need to be converted has pg_copy_file() function as plugin function. Regards, -- Masahiko Sawada
Attachment
On Sun, Feb 14, 2016 at 12:19 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > Thank you for reviewing this patch. > I've divided 000 patch into two patches, and attached latest 4 patches in total. Thank you! I'll go through this again as soon as I have a free moment. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Feb 10, 2016 at 04:39:15PM +0900, Kyotaro HORIGUCHI wrote: > > I still agree with this plugin approach, but I felt it's still > > complicated a bit, and I'm concerned that patch size has been > > increased. > > Please give me feedbacks. > > Yeah, I feel the same. What make it worse, the plugin mechanism > will get further complex if we make it more flexible for possible > usage as I proposed above. It is apparently too complicated for > deciding whether to load *just one*, for now, converter > function. And no additional converter is in sight. > > I incline to pull out all the plugin stuff of pg_upgrade. We are > so prudent to make changes of file formats so this kind of events > will happen with several-years intervals. The plugin mechanism > would be valuable if we are encouraged to change file formats > more frequently and freely by providing it, but such situation > absolutely introduces more untoward things.. I agreed on ripping out the converter plugin ability of pg_upgrade. Remember pg_upgrade was originally written by EnterpriseDB staff, and I think they expected their closed-source fork of Postgres might need a custom page converter someday, but it never needed one, and at this point I think having the code in there is just making things more complex. I see _no_ reason for community Postgres to use a plugin converter because we are going to need that code for every upgrade from pre-9.6 to 9.6+, so why not just hard-code in the functions we need. We can remove it once 9.5 is end-of-life. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Roman grave inscription +
On Tue, Feb 16, 2016 at 6:13 AM, Bruce Momjian <bruce@momjian.us> wrote: > On Wed, Feb 10, 2016 at 04:39:15PM +0900, Kyotaro HORIGUCHI wrote: >> > I still agree with this plugin approach, but I felt it's still >> > complicated a bit, and I'm concerned that patch size has been >> > increased. >> > Please give me feedbacks. >> >> Yeah, I feel the same. What make it worse, the plugin mechanism >> will get further complex if we make it more flexible for possible >> usage as I proposed above. It is apparently too complicated for >> deciding whether to load *just one*, for now, converter >> function. And no additional converter is in sight. >> >> I incline to pull out all the plugin stuff of pg_upgrade. We are >> so prudent to make changes of file formats so this kind of events >> will happen with several-years intervals. The plugin mechanism >> would be valuable if we are encouraged to change file formats >> more frequently and freely by providing it, but such situation >> absolutely introduces more untoward things.. > > I agreed on ripping out the converter plugin ability of pg_upgrade. > Remember pg_upgrade was originally written by EnterpriseDB staff, and I > think they expected their closed-source fork of Postgres might need a > custom page converter someday, but it never needed one, and at this > point I think having the code in there is just making things more > complex. I see _no_ reason for community Postgres to use a plugin > converter because we are going to need that code for every upgrade from > pre-9.6 to 9.6+, so why not just hard-code in the functions we need. We > can remove it once 9.5 is end-of-life. > Hm, we should rather remove the source code around PAGE_CONVERSION and page.c at 9.6? Regards, -- Masahiko Sawada
On Tue, Feb 16, 2016 at 11:56:25PM +0900, Masahiko Sawada wrote: > > I agreed on ripping out the converter plugin ability of pg_upgrade. > > Remember pg_upgrade was originally written by EnterpriseDB staff, and I > > think they expected their closed-source fork of Postgres might need a > > custom page converter someday, but it never needed one, and at this > > point I think having the code in there is just making things more > > complex. I see _no_ reason for community Postgres to use a plugin > > converter because we are going to need that code for every upgrade from > > pre-9.6 to 9.6+, so why not just hard-code in the functions we need. We > > can remove it once 9.5 is end-of-life. > > > > Hm, we should rather remove the source code around PAGE_CONVERSION and > page.c at 9.6? Yes. I can do it if you wish. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Roman grave inscription +
On Wed, Feb 17, 2016 at 12:02 AM, Bruce Momjian <bruce@momjian.us> wrote: > On Tue, Feb 16, 2016 at 11:56:25PM +0900, Masahiko Sawada wrote: >> > I agreed on ripping out the converter plugin ability of pg_upgrade. >> > Remember pg_upgrade was originally written by EnterpriseDB staff, and I >> > think they expected their closed-source fork of Postgres might need a >> > custom page converter someday, but it never needed one, and at this >> > point I think having the code in there is just making things more >> > complex. I see _no_ reason for community Postgres to use a plugin >> > converter because we are going to need that code for every upgrade from >> > pre-9.6 to 9.6+, so why not just hard-code in the functions we need. We >> > can remove it once 9.5 is end-of-life. >> > >> >> Hm, we should rather remove the source code around PAGE_CONVERSION and >> page.c at 9.6? > > Yes. I can do it if you wish. I see. I understand that page-converter code would be useful for some future cases, but makes thing more complex. So I will post the patch without page-converter If no objection from other hackers. Regards, -- Masahiko Sawada
Masahiko Sawada wrote: > On Wed, Feb 17, 2016 at 12:02 AM, Bruce Momjian <bruce@momjian.us> wrote: > > On Tue, Feb 16, 2016 at 11:56:25PM +0900, Masahiko Sawada wrote: > >> > I agreed on ripping out the converter plugin ability of pg_upgrade. > >> > Remember pg_upgrade was originally written by EnterpriseDB staff, and I > >> > think they expected their closed-source fork of Postgres might need a > >> > custom page converter someday, but it never needed one, and at this > >> > point I think having the code in there is just making things more > >> > complex. I see _no_ reason for community Postgres to use a plugin > >> > converter because we are going to need that code for every upgrade from > >> > pre-9.6 to 9.6+, so why not just hard-code in the functions we need. We > >> > can remove it once 9.5 is end-of-life. > >> > > >> > >> Hm, we should rather remove the source code around PAGE_CONVERSION and > >> page.c at 9.6? > > > > Yes. I can do it if you wish. > > I see. I understand that page-converter code would be useful for some > future cases, but makes thing more complex. If we're not going to use it, let's get rid of it right away. There's no point in having a feature that adds complexity just because we might find some hypothetical use of it in a not-yet-imagined future. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, Feb 16, 2016 at 03:57:01PM -0300, Alvaro Herrera wrote: > Masahiko Sawada wrote: > > On Wed, Feb 17, 2016 at 12:02 AM, Bruce Momjian <bruce@momjian.us> wrote: > > > On Tue, Feb 16, 2016 at 11:56:25PM +0900, Masahiko Sawada wrote: > > >> > I agreed on ripping out the converter plugin ability of pg_upgrade. > > >> > Remember pg_upgrade was originally written by EnterpriseDB staff, and I > > >> > think they expected their closed-source fork of Postgres might need a > > >> > custom page converter someday, but it never needed one, and at this > > >> > point I think having the code in there is just making things more > > >> > complex. I see _no_ reason for community Postgres to use a plugin > > >> > converter because we are going to need that code for every upgrade from > > >> > pre-9.6 to 9.6+, so why not just hard-code in the functions we need. We > > >> > can remove it once 9.5 is end-of-life. > > >> > > > >> > > >> Hm, we should rather remove the source code around PAGE_CONVERSION and > > >> page.c at 9.6? > > > > > > Yes. I can do it if you wish. > > > > I see. I understand that page-converter code would be useful for some > > future cases, but makes thing more complex. > > If we're not going to use it, let's get rid of it right away. There's > no point in having a feature that adds complexity just because we might > find some hypothetical use of it in a not-yet-imagined future. Agreed. We can always add it later if we need it. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Roman grave inscription +
On Wed, Feb 17, 2016 at 4:08 AM, Bruce Momjian <bruce@momjian.us> wrote: > On Tue, Feb 16, 2016 at 03:57:01PM -0300, Alvaro Herrera wrote: >> Masahiko Sawada wrote: >> > On Wed, Feb 17, 2016 at 12:02 AM, Bruce Momjian <bruce@momjian.us> wrote: >> > > On Tue, Feb 16, 2016 at 11:56:25PM +0900, Masahiko Sawada wrote: >> > >> > I agreed on ripping out the converter plugin ability of pg_upgrade. >> > >> > Remember pg_upgrade was originally written by EnterpriseDB staff, and I >> > >> > think they expected their closed-source fork of Postgres might need a >> > >> > custom page converter someday, but it never needed one, and at this >> > >> > point I think having the code in there is just making things more >> > >> > complex. I see _no_ reason for community Postgres to use a plugin >> > >> > converter because we are going to need that code for every upgrade from >> > >> > pre-9.6 to 9.6+, so why not just hard-code in the functions we need. We >> > >> > can remove it once 9.5 is end-of-life. >> > >> > >> > >> >> > >> Hm, we should rather remove the source code around PAGE_CONVERSION and >> > >> page.c at 9.6? >> > > >> > > Yes. I can do it if you wish. >> > >> > I see. I understand that page-converter code would be useful for some >> > future cases, but makes thing more complex. >> >> If we're not going to use it, let's get rid of it right away. There's >> no point in having a feature that adds complexity just because we might >> find some hypothetical use of it in a not-yet-imagined future. > > Agreed. We can always add it later if we need it. > Attached patch gets rid of page conversion code. Regards, -- Masahiko Sawada
Attachment
On Wed, Feb 17, 2016 at 4:29 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > On Wed, Feb 17, 2016 at 4:08 AM, Bruce Momjian <bruce@momjian.us> wrote: >> On Tue, Feb 16, 2016 at 03:57:01PM -0300, Alvaro Herrera wrote: >>> Masahiko Sawada wrote: >>> > On Wed, Feb 17, 2016 at 12:02 AM, Bruce Momjian <bruce@momjian.us> wrote: >>> > > On Tue, Feb 16, 2016 at 11:56:25PM +0900, Masahiko Sawada wrote: >>> > >> > I agreed on ripping out the converter plugin ability of pg_upgrade. >>> > >> > Remember pg_upgrade was originally written by EnterpriseDB staff, and I >>> > >> > think they expected their closed-source fork of Postgres might need a >>> > >> > custom page converter someday, but it never needed one, and at this >>> > >> > point I think having the code in there is just making things more >>> > >> > complex. I see _no_ reason for community Postgres to use a plugin >>> > >> > converter because we are going to need that code for every upgrade from >>> > >> > pre-9.6 to 9.6+, so why not just hard-code in the functions we need. We >>> > >> > can remove it once 9.5 is end-of-life. >>> > >> > >>> > >> >>> > >> Hm, we should rather remove the source code around PAGE_CONVERSION and >>> > >> page.c at 9.6? >>> > > >>> > > Yes. I can do it if you wish. >>> > >>> > I see. I understand that page-converter code would be useful for some >>> > future cases, but makes thing more complex. >>> >>> If we're not going to use it, let's get rid of it right away. There's >>> no point in having a feature that adds complexity just because we might >>> find some hypothetical use of it in a not-yet-imagined future. >> >> Agreed. We can always add it later if we need it. >> > > Attached patch gets rid of page conversion code. > Sorry, previous patch is incorrect.. Fixed version patch attached. Regards, -- Masahiko Sawada
Attachment
On Wed, Feb 17, 2016 at 4:44 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > On Wed, Feb 17, 2016 at 4:29 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> On Wed, Feb 17, 2016 at 4:08 AM, Bruce Momjian <bruce@momjian.us> wrote: >>> On Tue, Feb 16, 2016 at 03:57:01PM -0300, Alvaro Herrera wrote: >>>> Masahiko Sawada wrote: >>>> > On Wed, Feb 17, 2016 at 12:02 AM, Bruce Momjian <bruce@momjian.us> wrote: >>>> > > On Tue, Feb 16, 2016 at 11:56:25PM +0900, Masahiko Sawada wrote: >>>> > >> > I agreed on ripping out the converter plugin ability of pg_upgrade. >>>> > >> > Remember pg_upgrade was originally written by EnterpriseDB staff, and I >>>> > >> > think they expected their closed-source fork of Postgres might need a >>>> > >> > custom page converter someday, but it never needed one, and at this >>>> > >> > point I think having the code in there is just making things more >>>> > >> > complex. I see _no_ reason for community Postgres to use a plugin >>>> > >> > converter because we are going to need that code for every upgrade from >>>> > >> > pre-9.6 to 9.6+, so why not just hard-code in the functions we need. We >>>> > >> > can remove it once 9.5 is end-of-life. >>>> > >> > >>>> > >> >>>> > >> Hm, we should rather remove the source code around PAGE_CONVERSION and >>>> > >> page.c at 9.6? >>>> > > >>>> > > Yes. I can do it if you wish. >>>> > >>>> > I see. I understand that page-converter code would be useful for some >>>> > future cases, but makes thing more complex. >>>> >>>> If we're not going to use it, let's get rid of it right away. There's >>>> no point in having a feature that adds complexity just because we might >>>> find some hypothetical use of it in a not-yet-imagined future. >>> >>> Agreed. We can always add it later if we need it. >>> >> >> Attached patch gets rid of page conversion code. >> > Attached updated 5 patches. I would like to explain these patch shortly again here to make reviewing more easier. We can divided these patches into 2 purposes. 1. Freeze map 000_ patch adds additional frozen bit into visibility map, but doesn't include the logic for improve freezing performance. 001_ patch gets rid of page-conversion code from pg_upgrade. (This patch doesn't related to this feature essentially, but is required by 002_ patch.) 002_ patch adds upgrading mechanism from 9.6- to 9.6+ and its regression test. 2. Improve freezing logic 003_ patch changes the VACUUM to optimize scans based on freeze map (i.g., 000_ patch), and its regression test. 004_ patch enhances debug messages in src/backend/access/heap/visibilitymap.c Please review them. Regards, -- Masahiko Sawada
Attachment
On Thu, Feb 18, 2016 at 3:45 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > Attached updated 5 patches. > I would like to explain these patch shortly again here to make > reviewing more easier. > > We can divided these patches into 2 purposes. > > 1. Freeze map > 000_ patch adds additional frozen bit into visibility map, but doesn't > include the logic for improve freezing performance. > 001_ patch gets rid of page-conversion code from pg_upgrade. (This > patch doesn't related to this feature essentially, but is required by > 002_ patch.) > 002_ patch adds upgrading mechanism from 9.6- to 9.6+ and its regression test. > > 2. Improve freezing logic > 003_ patch changes the VACUUM to optimize scans based on freeze map > (i.g., 000_ patch), and its regression test. > 004_ patch enhances debug messages in src/backend/access/heap/visibilitymap.c > > Please review them. I have pushed 000 and part of 003, with substantial revisions to the 003 part and minor revisions to the 000 part. This gets the basic infrastructure in place, but the vacuum optimization and pg_upgrade fixes still need to be done. I discovered that make check-world failed with 000 applied, because the Assert()s added to visibilitymap_set were using | rather than & to test for a set bit. I fixed that. I revised the code in vacuumlazy.c that updates the new map bits rather heavily. I hope I didn't break anything; please have a look and see if you spot any problems. One big problem was that it's inadequate to judge whether a tuple needs freezing just by looking at xmin; xmax might need to be cleared, for example. I removed the pgstat stuff. I'm not sure we want that stuff in that form; it doesn't seem to fit with the rest of what's in that view, and it wasn't reliable in my testing. I did however throw together a little contrib module for testing, which I attach here. I'm not sure we want to commit this, and at the least someone would need to write documentation. But it's certainly handy for checking whether this works. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
Thank you for revising and commiting this. At Tue, 1 Mar 2016 21:51:55 -0500, Robert Haas <robertmhaas@gmail.com> wrote in <CA+TgmoZtG7hnkgP74zRCeuRrGGG917J5-_P4dzNJz5_kAXFTKg@mail.gmail.com> > On Thu, Feb 18, 2016 at 3:45 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > Attached updated 5 patches. > > I would like to explain these patch shortly again here to make > > reviewing more easier. > > > > We can divided these patches into 2 purposes. > > > > 1. Freeze map > > 000_ patch adds additional frozen bit into visibility map, but doesn't > > include the logic for improve freezing performance. > > 001_ patch gets rid of page-conversion code from pg_upgrade. (This > > patch doesn't related to this feature essentially, but is required by > > 002_ patch.) > > 002_ patch adds upgrading mechanism from 9.6- to 9.6+ and its regression test. > > > > 2. Improve freezing logic > > 003_ patch changes the VACUUM to optimize scans based on freeze map > > (i.g., 000_ patch), and its regression test. > > 004_ patch enhances debug messages in src/backend/access/heap/visibilitymap.c > > > > Please review them. > > I have pushed 000 and part of 003, with substantial revisions to the > 003 part and minor revisions to the 000 part. This gets the basic > infrastructure in place, but the vacuum optimization and pg_upgrade > fixes still need to be done. > > I discovered that make check-world failed with 000 applied, because > the Assert()s added to visibilitymap_set were using | rather than & to > test for a set bit. I fixed that. It looks reasonable as far as I can see. Thank you for your labor for the additional part. > I revised the code in vacuumlazy.c that updates the new map bits > rather heavily. I hope I didn't break anything; please have a look > and see if you spot any problems. One big problem was that it's > inadequate to judge whether a tuple needs freezing just by looking at > xmin; xmax might need to be cleared, for example. The new function heap_tuple_needs_eventual_freeze looks reasonable for me in comparizon with heap_tuple_needs_freeze. Looking the additional diff for lazy_vacuum_page, I noticed that visibilitymap_set have a potential performance problem. (Though it doesn't seem to occur for now.) visibilitymap_set decides to modify vm bits by the following code. | if (flags = (map[mapByte] >> mapBit & VISIBILITYMAP_VALID_BITS)) | { | START_CRIT_SECTION(); | | map[mapByte] |= (flags << mapBit); This is effectively right and no problem but it runs the critical section for the case of (vmbit = 11, flags = 01), which does not need to do so. Please apply this if this looks reasonable. ====== diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c index 2e64fc3..87b7fc6 100644 --- a/src/backend/access/heap/visibilitymap.c +++ b/src/backend/access/heap/visibilitymap.c @@ -292,7 +292,8 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf, map = (uint8 *)PageGetContents(page); LockBuffer(vmBuf, BUFFER_LOCK_EXCLUSIVE); - if (flags != (map[mapByte] >> mapBit & VISIBILITYMAP_VALID_BITS)) + /* modify vm bits only if any bit is necessary to be set */ + if (~flags & (map[mapByte] >> mapBit & VISIBILITYMAP_VALID_BITS)) { START_CRIT_SECTION(); ====== > I removed the pgstat stuff. I'm not sure we want that stuff in that > form; it doesn't seem to fit with the rest of what's in that view, and > it wasn't reliable in my testing. I did however throw together a > little contrib module for testing, which I attach here. I'm not sure > we want to commit this, and at the least someone would need to write > documentation. But it's certainly handy for checking whether this > works. I hanven't considered about the reliability but the n_frozen_pages in the proposed patch surelly seems alien to the columns around it. regards, -- Kyotaro Horiguchi NTT Open Source Software Center
On Tue, Mar 1, 2016 at 6:51 PM, Robert Haas <robertmhaas@gmail.com> wrote: > I removed the pgstat stuff. I'm not sure we want that stuff in that > form; it doesn't seem to fit with the rest of what's in that view, and > it wasn't reliable in my testing. I did however throw together a > little contrib module for testing, which I attach here. I'm not sure > we want to commit this, and at the least someone would need to write > documentation. But it's certainly handy for checking whether this > works. I think you should commit this. The chances of anyone other than you and Masahiko recalling that you developed this tool in 3 years is essentially nil. I think that the cost of committing a developer-level debugging tool like this is very low. Modules like pg_freespacemap currently already have no chance of being of use to ordinary users. All you need to do is restrict the functions to throw an error when called by non-superusers, out of caution. It's a problem that modules like pg_stat_statements and pg_freespacemap are currently lumped together in the documentation, but we all know that. -- Peter Geoghegan
On 3/2/16 4:21 PM, Peter Geoghegan wrote: > I think you should commit this. The chances of anyone other than you > and Masahiko recalling that you developed this tool in 3 years is > essentially nil. I think that the cost of committing a developer-level > debugging tool like this is very low. Modules like pg_freespacemap > currently already have no chance of being of use to ordinary users. > All you need to do is restrict the functions to throw an error when > called by non-superusers, out of caution. > > It's a problem that modules like pg_stat_statements and > pg_freespacemap are currently lumped together in the documentation, > but we all know that. +1. -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com
Jim Nasby <Jim.Nasby@BlueTreble.com> writes: > On 3/2/16 4:21 PM, Peter Geoghegan wrote: >> I think you should commit this. The chances of anyone other than you >> and Masahiko recalling that you developed this tool in 3 years is >> essentially nil. I think that the cost of committing a developer-level >> debugging tool like this is very low. Modules like pg_freespacemap >> currently already have no chance of being of use to ordinary users. >> All you need to do is restrict the functions to throw an error when >> called by non-superusers, out of caution. >> >> It's a problem that modules like pg_stat_statements and >> pg_freespacemap are currently lumped together in the documentation, >> but we all know that. > +1. Would it make any sense to stick it under src/test/modules/ instead of contrib/ ? That would help make it clear that it's a debugging tool and not something we expect end users to use. regards, tom lane
On 3/2/16 5:41 PM, Tom Lane wrote: > Jim Nasby <Jim.Nasby@BlueTreble.com> writes: >> On 3/2/16 4:21 PM, Peter Geoghegan wrote: >>> I think you should commit this. The chances of anyone other than you >>> and Masahiko recalling that you developed this tool in 3 years is >>> essentially nil. I think that the cost of committing a developer-level >>> debugging tool like this is very low. Modules like pg_freespacemap >>> currently already have no chance of being of use to ordinary users. >>> All you need to do is restrict the functions to throw an error when >>> called by non-superusers, out of caution. >>> >>> It's a problem that modules like pg_stat_statements and >>> pg_freespacemap are currently lumped together in the documentation, >>> but we all know that. > >> +1. > > Would it make any sense to stick it under src/test/modules/ instead of > contrib/ ? That would help make it clear that it's a debugging tool > and not something we expect end users to use. I haven't looked at it in detail; is there something inherently dangerous about it? When I'm forced to wear a DBA hat, I'd really love to be able to find out what VM status for a large table is. If it's in contrib they'll know the tool is there; if it's under src then there's about 0 chance of that. I'd think SU-only and any appropriate warnings would be enough heads-up for DBAs to be careful with it. -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com
At Wed, 2 Mar 2016 17:57:27 -0600, Jim Nasby <Jim.Nasby@BlueTreble.com> wrote in <56D77DE7.7080309@BlueTreble.com> > On 3/2/16 5:41 PM, Tom Lane wrote: > > Jim Nasby <Jim.Nasby@BlueTreble.com> writes: > >> On 3/2/16 4:21 PM, Peter Geoghegan wrote: > >>> I think you should commit this. The chances of anyone other than you > >>> and Masahiko recalling that you developed this tool in 3 years is > >>> essentially nil. I think that the cost of committing a developer-level > >>> debugging tool like this is very low. Modules like pg_freespacemap > >>> currently already have no chance of being of use to ordinary users. > >>> All you need to do is restrict the functions to throw an error when > >>> called by non-superusers, out of caution. > >>> > >>> It's a problem that modules like pg_stat_statements and > >>> pg_freespacemap are currently lumped together in the documentation, > >>> but we all know that. > > > >> +1. > > > > Would it make any sense to stick it under src/test/modules/ instead of > > contrib/ ? That would help make it clear that it's a debugging tool > > and not something we expect end users to use. > > I haven't looked at it in detail; is there something inherently > dangerous about it? I don't see any danger but the interface doesn't seem to fit use by DBAs. > When I'm forced to wear a DBA hat, I'd really love to be able to find > out what VM status for a large table is. If it's in contrib they'll > know the tool is there; if it's under src then there's about 0 chance > of that. I'd think SU-only and any appropriate warnings would be > enough heads-up for DBAs to be careful with it. It looks to expose nothing about table contents. At lesast anybody who can see the name of a table are safely allowable to use this on it. A possible usage (for me) of this would be directly couting (un)vacuumed or (un)freezed pages in a relation. It would be convenient that the 'freezed' and 'vacuumed' bits are in separate integers. It would be usable even stats values for these bits are shown in statistics views. regards, -- Kyotaro Horiguchi NTT Open Source Software Center
On Wed, Mar 2, 2016 at 6:41 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Jim Nasby <Jim.Nasby@BlueTreble.com> writes: >> On 3/2/16 4:21 PM, Peter Geoghegan wrote: >>> I think you should commit this. The chances of anyone other than you >>> and Masahiko recalling that you developed this tool in 3 years is >>> essentially nil. I think that the cost of committing a developer-level >>> debugging tool like this is very low. Modules like pg_freespacemap >>> currently already have no chance of being of use to ordinary users. >>> All you need to do is restrict the functions to throw an error when >>> called by non-superusers, out of caution. >>> >>> It's a problem that modules like pg_stat_statements and >>> pg_freespacemap are currently lumped together in the documentation, >>> but we all know that. > >> +1. > > Would it make any sense to stick it under src/test/modules/ instead of > contrib/ ? That would help make it clear that it's a debugging tool > and not something we expect end users to use. I actually think end-users might well want to use it. Also, I created it by hacking up pg_freespacemap, so it may make sense to have it in the same place. I would also be tempted to add an additional C functions that scan the entire visibility map and return counts of the total number of bits of each type that are set, and similarly for the page level bits. Presumably that would be much faster than I am also tempted to change the API to be a bit more friendly, although I am not sure exactly how. This was a quick and dirty hack so that I could test, but the hardest thing about making it not a quick and dirty hack is probably deciding on a good UI. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sat, Mar 5, 2016 at 1:25 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, Mar 2, 2016 at 6:41 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Jim Nasby <Jim.Nasby@BlueTreble.com> writes: >>> On 3/2/16 4:21 PM, Peter Geoghegan wrote: >>>> I think you should commit this. The chances of anyone other than you >>>> and Masahiko recalling that you developed this tool in 3 years is >>>> essentially nil. I think that the cost of committing a developer-level >>>> debugging tool like this is very low. Modules like pg_freespacemap >>>> currently already have no chance of being of use to ordinary users. >>>> All you need to do is restrict the functions to throw an error when >>>> called by non-superusers, out of caution. >>>> >>>> It's a problem that modules like pg_stat_statements and >>>> pg_freespacemap are currently lumped together in the documentation, >>>> but we all know that. >> >>> +1. >> >> Would it make any sense to stick it under src/test/modules/ instead of >> contrib/ ? That would help make it clear that it's a debugging tool >> and not something we expect end users to use. > > I actually think end-users might well want to use it. Also, I created > it by hacking up pg_freespacemap, so it may make sense to have it in > the same place. > I would also be tempted to add an additional C functions that scan the > entire visibility map and return counts of the total number of bits of > each type that are set, and similarly for the page level bits. > Presumably that would be much faster than +1. > > I am also tempted to change the API to be a bit more friendly, > although I am not sure exactly how. This was a quick and dirty hack > so that I could test, but the hardest thing about making it not a > quick and dirty hack is probably deciding on a good UI. > Does it mean visibility map API in visibilitymap.c? Regards, -- Masahiko Sawada
On Sat, Mar 5, 2016 at 11:25 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > On Sat, Mar 5, 2016 at 1:25 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Wed, Mar 2, 2016 at 6:41 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> Jim Nasby <Jim.Nasby@BlueTreble.com> writes: >>>> On 3/2/16 4:21 PM, Peter Geoghegan wrote: >>>>> I think you should commit this. The chances of anyone other than you >>>>> and Masahiko recalling that you developed this tool in 3 years is >>>>> essentially nil. I think that the cost of committing a developer-level >>>>> debugging tool like this is very low. Modules like pg_freespacemap >>>>> currently already have no chance of being of use to ordinary users. >>>>> All you need to do is restrict the functions to throw an error when >>>>> called by non-superusers, out of caution. >>>>> >>>>> It's a problem that modules like pg_stat_statements and >>>>> pg_freespacemap are currently lumped together in the documentation, >>>>> but we all know that. >>> >>>> +1. >>> >>> Would it make any sense to stick it under src/test/modules/ instead of >>> contrib/ ? That would help make it clear that it's a debugging tool >>> and not something we expect end users to use. >> >> I actually think end-users might well want to use it. Also, I created >> it by hacking up pg_freespacemap, so it may make sense to have it in >> the same place. >> I would also be tempted to add an additional C functions that scan the >> entire visibility map and return counts of the total number of bits of >> each type that are set, and similarly for the page level bits. >> Presumably that would be much faster than > > +1. > >> >> I am also tempted to change the API to be a bit more friendly, >> although I am not sure exactly how. This was a quick and dirty hack >> so that I could test, but the hardest thing about making it not a >> quick and dirty hack is probably deciding on a good UI. >> > > Does it mean visibility map API in visibilitymap.c? > Attached latest version optimisation patch. I'm still consider regarding pg_upgrade regression test code, so I will submit that patch later. Regards, -- Masahiko Sawada
Attachment
On Mon, Mar 7, 2016 at 12:41 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > Attached latest version optimisation patch. > I'm still consider regarding pg_upgrade regression test code, so I > will submit that patch later. I was thinking more about this today and I think that we don't actually need the PD_ALL_FROZEN page-level bit for anything. It's enough that the bit is present in the visibility map. The only point of PD_ALL_VISIBLE is that it tells us that we need to clear the visibility map bit, but that bit is enough to tell us to clear both visibility map bits. So I propose the attached cleanup patch. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
On Sat, Mar 5, 2016 at 9:25 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> I actually think end-users might well want to use it. Also, I created >> it by hacking up pg_freespacemap, so it may make sense to have it in >> the same place. >> I would also be tempted to add an additional C functions that scan the >> entire visibility map and return counts of the total number of bits of >> each type that are set, and similarly for the page level bits. >> Presumably that would be much faster than > > +1. > >> I am also tempted to change the API to be a bit more friendly, >> although I am not sure exactly how. This was a quick and dirty hack >> so that I could test, but the hardest thing about making it not a >> quick and dirty hack is probably deciding on a good UI. > > Does it mean visibility map API in visibilitymap.c? Here's an updated patch with an API that I think is much more reasonable to expose to users, and documentation! It assumes that the patch I posted a few hours ago to remove PD_ALL_FROZEN will be accepted; if that falls apart for some reason, I'll update this. I plan to push this RSN if nobody objects. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
On Mon, Mar 7, 2016 at 4:50 PM, Robert Haas <robertmhaas@gmail.com> wrote: > Here's an updated patch with an API that I think is much more > reasonable to expose to users, and documentation! It assumes that the > patch I posted a few hours ago to remove PD_ALL_FROZEN will be > accepted; if that falls apart for some reason, I'll update this. I > plan to push this RSN if nobody objects. Thanks for making the effort to make the tool generally available. -- Peter Geoghegan
Hello, thank you for updating this tool. At Mon, 7 Mar 2016 14:03:08 -0500, Robert Haas <robertmhaas@gmail.com> wrote in <CA+Tgmob+NjfYE3b3BHBmAC=3tvTbqsZgZ1RoJ63yRAmRgrQOcA@mail.gmail.com> > On Mon, Mar 7, 2016 at 12:41 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > Attached latest version optimisation patch. > > I'm still consider regarding pg_upgrade regression test code, so I > > will submit that patch later. > > I was thinking more about this today and I think that we don't > actually need the PD_ALL_FROZEN page-level bit for anything. It's > enough that the bit is present in the visibility map. The only point > of PD_ALL_VISIBLE is that it tells us that we need to clear the > visibility map bit, but that bit is enough to tell us to clear both > visibility map bits. So I propose the attached cleanup patch. It seems reasonable to me. Although I haven't played it (or even it didn't apply for me for now), but at a glance, PD_VALID_FLAG_BITS seems should be changed to 0x0007 since PD_ALL_FROZEN has been removed. regards, -- Kyotaro Horiguchi NTT Open Source Software Center
On Tue, Mar 8, 2016 at 1:20 PM, Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > Hello, thank you for updating this tool. > > At Mon, 7 Mar 2016 14:03:08 -0500, Robert Haas <robertmhaas@gmail.com> wrote in <CA+Tgmob+NjfYE3b3BHBmAC=3tvTbqsZgZ1RoJ63yRAmRgrQOcA@mail.gmail.com> >> On Mon, Mar 7, 2016 at 12:41 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> > Attached latest version optimisation patch. >> > I'm still consider regarding pg_upgrade regression test code, so I >> > will submit that patch later. >> > I was thinking more about this today and I think that we don't > actually need the PD_ALL_FROZEN page-level bit for anything. It's > enough that the bit is present in the visibility map. The only point > of PD_ALL_VISIBLE is that it tells us that we need to clear the > visibility map bit, but that bit is enough to tell us to clear both > visibility map bits. So I propose the attached cleanup patch. Thank you for updating tool and proposing it. I agree with you, and the patch you attached looks good to me except for Horiguchi-san's comment. Regarding pg_visibility module, I'd like to share some bugs and propose to add a relation type condition to each functions. Including it, I've attached remaining 2 patches; one is removing page conversion code from pg_upgarde, and another is supporting pg_upgrade for frozen bit. Please have a look at them. Regards, -- Masahiko Sawada
Attachment
On Tue, Mar 8, 2016 at 7:26 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > Regarding pg_visibility module, I'd like to share some bugs and > propose to add a relation type condition to each functions. OK, thanks. > Including it, I've attached remaining 2 patches; one is removing page > conversion code from pg_upgarde, and another is supporting pg_upgrade > for frozen bit. Committed 001 with minor tweaks. I find rewrite_vm_table to be pretty opaque. There's not even a comment explaining what it is supposed to do. And I wonder why we really need to be this efficient about it anyway. Like, would it be too expensive to just do this: for (i = 0; i < BITS_PER_BYTE; ++i) if ((old & (1 << i)) != 0) new |= 1 << (2 * i); And how about adding some more comments explaining why we are doing this rewriting, like this: In versions of PostgreSQL prior to catversion 201602181, PostgreSQL's visibility map included one bit per heap page; it now includes two. When upgrading a cluster from before that time to a current PostgreSQL version, we could refuse to copy visibility maps from the old cluster to the new cluster; the next VACUUM would recreate them, but at the price of scanning the entire table. So, instead, we rewrite the old visibility maps in the new format. That way, the all-visible bit remains set for the pages for which it was set previously. The all-frozen bit is never set by this conversion; we leave that to VACUUM. Also, I'm slightly perplexed by the fact that I can't see how this code succeeds in turning each page into two pages, which is something that it seems like it would need to do. Wouldn't we need to write out the old page header twice, one for the first of the two new pages and again for the second? I probably need more caffeine here, so please tell me what I'm missing. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Mar 8, 2016 at 8:30 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Tue, Mar 8, 2016 at 7:26 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> Regarding pg_visibility module, I'd like to share some bugs and >> propose to add a relation type condition to each functions. > > OK, thanks. I left out the relkind check from the final commit because, for one thing, the check you added isn't actually right: toast relations can also have a visibility map. And also, I'm sort of wondering what the point of that check is. What does it protect us from? It doesn't seem very future-proof ... what if we add a new relkind in the future?Do we really want to have to update this? How about instead changing things so that we specifically reject indexes? And maybe some kind of a check that will reject anything that lacks a relfilnode? That seems like it would be more on point. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Mar 8, 2016 at 5:30 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Tue, Mar 8, 2016 at 7:26 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> Regarding pg_visibility module, I'd like to share some bugs and >> propose to add a relation type condition to each functions. > > OK, thanks. > >> Including it, I've attached remaining 2 patches; one is removing page >> conversion code from pg_upgarde, and another is supporting pg_upgrade >> for frozen bit. > > Committed 001 with minor tweaks. > > I find rewrite_vm_table to be pretty opaque. There's not even a > comment explaining what it is supposed to do. And I wonder why we > really need to be this efficient about it anyway. Like, would it be > too expensive to just do this: > > for (i = 0; i < BITS_PER_BYTE; ++i) > if ((old & (1 << i)) != 0) > new |= 1 << (2 * i); > > And how about adding some more comments explaining why we are doing > this rewriting, like this: > > In versions of PostgreSQL prior to catversion 201602181, PostgreSQL's > visibility map included one bit per heap page; it now includes two. > When upgrading a cluster from before that time to a current PostgreSQL > version, we could refuse to copy visibility maps from the old cluster > to the new cluster; the next VACUUM would recreate them, but at the > price of scanning the entire table. So, instead, we rewrite the old > visibility maps in the new format. That way, the all-visible bit > remains set for the pages for which it was set previously. The > all-frozen bit is never set by this conversion; we leave that to > VACUUM. > > Also, I'm slightly perplexed by the fact that I can't see how this > code succeeds in turning each page into two pages, which is something > that it seems like it would need to do. Wouldn't we need to write out > the old page header twice, one for the first of the two new pages and > again for the second? I probably need more caffeine here, so please > tell me what I'm missing. I think that this loop: while (blkend >= end) Executes exactly twice for each iteration of the outer loop. I'd rather see it written as a loop which explicitly executes twice, rather looking like it might execute a dynamic number of times. I can't imagine that this code needs to be future-proof. If we change the format again in the future, surely we can't just change this code, we would have to write new code for the new format. Cheers, Jeff
On Mon, Mar 7, 2016 at 12:41 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > Attached latest version optimisation patch. > I'm still consider regarding pg_upgrade regression test code, so I > will submit that patch later. I just spent some time looking at this and I'm a bit worried about the following (existing) comment in vacuumlazy.c: * Note: The value returned by visibilitymap_get_status could be slightly * out-of-date, since we make this test before reading the corresponding * heap page or locking the buffer. This is OK. If we mistakenly think * that the page is all-visible when in fact the flag's just been cleared, * we might fail to vacuum the page. But it's OK to skip pages when * scan_all is not set, so no great harm done; the next vacuum will find * them. If we make the reverse mistake and vacuum a page unnecessarily, * it'll just be a no-op. The patch makes some attempt to update the comment mechanically, but that's not nearly enough. That comment is explaining that you *can't* rely on the visibility map to tell you *for sure* that a page does not require vacuuming. For current uses, that's OK, because if we miss a page we'll pick it up later. But now we want to skip vacuuming pages for relfrozenxid/relminmxid advancement, that rationale doesn't apply. Missing pages that need to be frozen and advancing relfrozenxid anyway would be _bad_. However, after some further thought, I think we might actually be OK. If a page goes from all-frozen to not-all-frozen while VACUUM is running, any new XID added to the page must be newer than the oldestXmin value computed by vacuum_set_xid_limits(), so it won't affect the value to which we can safely set relfrozenxid. Similarly, any MXID added to the page will be newer than GetOldestMultiXactId(), so setting relminmxid is still safe for similar reasons. I'd appreciate it if any other senior hackers could review that chain of reasoning. It would be really bad to get this wrong. On another note, I didn't really like the way you updated the documentation. "eager freezing" doesn't seem like a great term to me, and I think your changes were a little too localized. Here's a draft alternative where I used the term "aggressive vacuum" to describe freezing all of the pages except for those already known to be all-frozen. Thoughts? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
Robert Haas <robertmhaas@gmail.com> writes: > The patch makes some attempt to update the comment mechanically, but > that's not nearly enough. That comment is explaining that you *can't* > rely on the visibility map to tell you *for sure* that a page does not > require vacuuming. For current uses, that's OK, because if we miss a > page we'll pick it up later. But now we want to skip vacuuming pages > for relfrozenxid/relminmxid advancement, that rationale doesn't apply. > Missing pages that need to be frozen and advancing relfrozenxid anyway > would be _bad_. Check. > However, after some further thought, I think we might actually be OK. > If a page goes from all-frozen to not-all-frozen while VACUUM is > running, any new XID added to the page must be newer than the > oldestXmin value computed by vacuum_set_xid_limits(), so it won't > affect the value to which we can safely set relfrozenxid. Similarly, > any MXID added to the page will be newer than GetOldestMultiXactId(), > so setting relminmxid is still safe for similar reasons. Yeah, I agree with this, as long as the issue is only that the visibility map result is slightly stale and not that it's, say, not crash-safe. We can reasonably assume that any newly-added XID must be one that was in progress while VACUUM was running, and hence will be after the xmin horizon we computed earlier. This requires the existence of a read barrier somewhere between computing xmin horizon and inspecting the visibility map, but I find it hard to believe there aren't plenty. regards, tom lane
On Wed, Mar 9, 2016 at 1:23 AM, Jeff Janes <jeff.janes@gmail.com> wrote: > On Tue, Mar 8, 2016 at 5:30 AM, Robert Haas <robertmhaas@gmail.com> wrote: > I left out the relkind check from the final commit because, for one > thing, the check you added isn't actually right: toast relations can > also have a visibility map. And also, I'm sort of wondering what the > point of that check is. What does it protect us from? It doesn't > seem very future-proof ... what if we add a new relkind in the future? > Do we really want to have to update this? > > How about instead changing things so that we specifically reject > indexes? And maybe some kind of a check that will reject anything > that lacks a relfilnode? That seems like it would be more on point. > I agree, I don't have strong opinion about this. It would be good to add condition for rejecting only indexes. Attached patches are, - Change heap2 rmgr description - Add condition to pg_visibility - Fix typo in pgvisibility.sgml (Sorry for the late notice..) Regards, -- Masahiko Sawada
Attachment
On Tue, Mar 8, 2016 at 12:49 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> However, after some further thought, I think we might actually be OK. >> If a page goes from all-frozen to not-all-frozen while VACUUM is >> running, any new XID added to the page must be newer than the >> oldestXmin value computed by vacuum_set_xid_limits(), so it won't >> affect the value to which we can safely set relfrozenxid. Similarly, >> any MXID added to the page will be newer than GetOldestMultiXactId(), >> so setting relminmxid is still safe for similar reasons. > > Yeah, I agree with this, as long as the issue is only that the visibility > map result is slightly stale and not that it's, say, not crash-safe. If the visibility map isn't crash safe, we've got big problems even without this patch, but we dealt with that when index-only scans went in. Maybe this patch introduces more stringent requirements in this area, but I can't think of any reason why that should be true. If anything occurs to you (or anyone else), it would be good to mention that before I go further destroy the world. > We can reasonably assume that any newly-added XID must be one that was > in progress while VACUUM was running, and hence will be after the xmin > horizon we computed earlier. This requires the existence of a read > barrier somewhere between computing xmin horizon and inspecting the > visibility map, but I find it hard to believe there aren't plenty. I'll check that, but I agree that it should be OK. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Mar 8, 2016 at 12:59 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> How about instead changing things so that we specifically reject >> indexes? And maybe some kind of a check that will reject anything >> that lacks a relfilnode? That seems like it would be more on point. > > I agree, I don't have strong opinion about this. > It would be good to add condition for rejecting only indexes. > Attached patches are, > - Change heap2 rmgr description > - Add condition to pg_visibility > - Fix typo in pgvisibility.sgml > (Sorry for the late notice..) OK, committed the first and last of those. I think the other one needs some work yet; the error message doesn't seem like it is quite our usual style, and if we're going to do something here we should probably also insert a check to throw a better error when there is no relfilenode. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Mar 9, 2016 at 3:38 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Tue, Mar 8, 2016 at 12:59 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >>> How about instead changing things so that we specifically reject >>> indexes? And maybe some kind of a check that will reject anything >>> that lacks a relfilnode? That seems like it would be more on point. >> >> I agree, I don't have strong opinion about this. >> It would be good to add condition for rejecting only indexes. >> Attached patches are, >> - Change heap2 rmgr description >> - Add condition to pg_visibility >> - Fix typo in pgvisibility.sgml >> (Sorry for the late notice..) > > OK, committed the first and last of those. I think the other one > needs some work yet; the error message doesn't seem like it is quite > our usual style, and if we're going to do something here we should > probably also insert a check to throw a better error when there is no > relfilenode. > Thank you for your advising and suggestion! Attached latest 2 patches. * 000 patch : Incorporated the review comments and made rewriting logic more clearly. * 001 patch : Incorporated the documentation suggestions and updated logic a little. Please review them. Regards, -- Masahiko Sawada
Attachment
On Wed, Mar 9, 2016 at 9:09 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: Attached latest 2 patches. > * 000 patch : Incorporated the review comments and made rewriting > logic more clearly. That's better, thanks. But your comments don't survive pgindent. After running pgindent, I get this: + /* + * These old_* variables point to old visibility map page. + * + * cur_old : Points to current position on old page. blkend_old : + * Points to end of old block. break_old : Points to old page break + * position for rewriting a new page. After wrote a new page, old_end + * proceeds rewriteVmBytesPerPgae bytes. + */ You need to either surround this sort of thing with dashes to make pgindent ignore it, or, probably better, rewrite it using complete sentences that together form a paragraph. + Oid pg_database_oid; /* OID of pg_database relation */ Not used anywhere? Instead of vm_need_rewrite, how about vm_must_add_frozenbit? Can you explain the changes to test.sh? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Thank you for reviewing! Attached updated patch. On Thu, Mar 10, 2016 at 3:37 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, Mar 9, 2016 at 9:09 AM, Masahiko Sawada > <sawada.mshk@gmail.com> wrote: Attached latest 2 patches. >> * 000 patch : Incorporated the review comments and made rewriting >> logic more clearly. > > That's better, thanks. But your comments don't survive pgindent. > After running pgindent, I get this: > > + /* > + * These old_* variables point to old visibility map page. > + * > + * cur_old : Points to current position on old > page. blkend_old : > + * Points to end of old block. break_old : Points to > old page break > + * position for rewriting a new page. After wrote a > new page, old_end > + * proceeds rewriteVmBytesPerPgae bytes. > + */ > > You need to either surround this sort of thing with dashes to make > pgindent ignore it, or, probably better, rewrite it using complete > sentences that together form a paragraph. Fixed. > > + Oid pg_database_oid; /* OID of > pg_database relation */ > > Not used anywhere? Fixed. > Instead of vm_need_rewrite, how about vm_must_add_frozenbit? Fixed. > Can you explain the changes to test.sh? Current regression test scenario is, 1. Do 'make check' on pre-upgrade cluster 2. Dump relallvisible values of all relation in pre-upgrade cluster to vm_test1.txt 3. Do pg_upgrade 4. Do analyze (not vacuum), dump relallvisibile values of all relation in post-upgrade cluster to vm_test2.txt 5. Compare between vm_test1.txt and vm_test2.txt That is, regression test compares between relallvisible values in pre-upgrade cluster and post-upgrade cluster. But because test.sh always uses pre/post clusters with same catalog version, I realized that we cannot ensure that visibility map rewriting is processed successfully on test.sh framework. Rewriting visibility map never be executed. We might need to have another framework for rewriting visibility map page.. Regards, -- Masahiko Sawada
Attachment
On Wed, Mar 9, 2016 at 9:09 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > * 001 patch : Incorporated the documentation suggestions and updated > logic a little. This 001 patch looks so little like what I was expecting that I decided to start over from scratch. The new version I wrote is attached here. I don't understand why your version tinkers with the logic for setting the all-frozen bit; I thought that what I already committed dealt with that already, and in any case, your version doesn't even compile against latest sources. Your version also leaves the scan_all terminology intact even though it's not accurate any more, and I am not very convinced that the updates to the page-skipping logic are actually correct. Please have a look over this version and see what you think. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
On Thu, Mar 10, 2016 at 3:27 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > Thank you for reviewing! > Attached updated patch. > > > On Thu, Mar 10, 2016 at 3:37 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Wed, Mar 9, 2016 at 9:09 AM, Masahiko Sawada >> <sawada.mshk@gmail.com> wrote: Attached latest 2 patches. >>> * 000 patch : Incorporated the review comments and made rewriting >>> logic more clearly. >> >> That's better, thanks. But your comments don't survive pgindent. >> After running pgindent, I get this: >> >> + /* >> + * These old_* variables point to old visibility map page. >> + * >> + * cur_old : Points to current position on old >> page. blkend_old : >> + * Points to end of old block. break_old : Points to >> old page break >> + * position for rewriting a new page. After wrote a >> new page, old_end >> + * proceeds rewriteVmBytesPerPgae bytes. >> + */ >> >> You need to either surround this sort of thing with dashes to make >> pgindent ignore it, or, probably better, rewrite it using complete >> sentences that together form a paragraph. > > Fixed. > >> >> + Oid pg_database_oid; /* OID of >> pg_database relation */ >> >> Not used anywhere? > > Fixed. > >> Instead of vm_need_rewrite, how about vm_must_add_frozenbit? > > Fixed. > >> Can you explain the changes to test.sh? > > Current regression test scenario is, > 1. Do 'make check' on pre-upgrade cluster > 2. Dump relallvisible values of all relation in pre-upgrade cluster to > vm_test1.txt > 3. Do pg_upgrade > 4. Do analyze (not vacuum), dump relallvisibile values of all relation > in post-upgrade cluster to vm_test2.txt > 5. Compare between vm_test1.txt and vm_test2.txt > > That is, regression test compares between relallvisible values in > pre-upgrade cluster and post-upgrade cluster. > But because test.sh always uses pre/post clusters with same catalog > version, I realized that we cannot ensure that visibility map > rewriting is processed successfully on test.sh framework. > Rewriting visibility map never be executed. > We might need to have another framework for rewriting visibility map page.. > After some further thought, I thought that it's better to add check logic for result of rewriting visibility map to upgrading logic rather than regression test in order to ensure that rewriting visibility map has been successfully done. As a draft, attached patch checks the result of rewriting visibility map after rewrote for each relation as a routine of pg_upgrade. The disadvantage point of this is that we need to scan each visibility map page for 2 times. But since visibility map size would not be so large, it would not bad. Thoughts? Regards, -- Regards, -- Masahiko Sawada
Attachment
On Thu, Mar 10, 2016 at 8:51 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > After some further thought, I thought that it's better to add check > logic for result of rewriting visibility map to upgrading logic rather > than regression test in order to ensure that rewriting visibility map > has been successfully done. > As a draft, attached patch checks the result of rewriting visibility > map after rewrote for each relation as a routine of pg_upgrade. > The disadvantage point of this is that we need to scan each visibility > map page for 2 times. > But since visibility map size would not be so large, it would not bad. > Thoughts? I think that's kind of pointless. We need to test that this conversion code works, but once it does, I don't think we should make everybody pay the overhead of retesting that. Anyway, the test code could have bugs, too. Here's an updated version of your patch with that code removed and some cosmetic cleanups like fixing typos and stuff like that. I think this is mostly ready to commit, but I noticed one problem: your conversion code always produces two output pages for each input page even if one of them would be empty. In particular, if you have a large number of small relations and run pg_upgrade, all of their visibility maps will go from 8kB to 16kB. That isn't the end of the world, maybe, but I think you should see if you can't fix it somehow.... -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
On Fri, Mar 11, 2016 at 1:03 AM, Robert Haas <robertmhaas@gmail.com> wrote: > This 001 patch looks so little like what I was expecting that I > decided to start over from scratch. The new version I wrote is > attached here. I don't understand why your version tinkers with the > logic for setting the all-frozen bit; I thought that what I already > committed dealt with that already, and in any case, your version > doesn't even compile against latest sources. Your version also leaves > the scan_all terminology intact even though it's not accurate any > more, and I am not very convinced that the updates to the > page-skipping logic are actually correct. Please have a look over > this version and see what you think. Thank you for your advise. Sorry, optimising logic of previous patch was old by mistake. Attached latest patch incorporated your suggestions with a little revising. > > I think that's kind of pointless. We need to test that this > conversion code works, but once it does, I don't think we should make > everybody pay the overhead of retesting that. Anyway, the test code > could have bugs, too. > > Here's an updated version of your patch with that code removed and > some cosmetic cleanups like fixing typos and stuff like that. I think > this is mostly ready to commit, but I noticed one problem: your > conversion code always produces two output pages for each input page > even if one of them would be empty. In particular, if you have a > large number of small relations and run pg_upgrade, all of their > visibility maps will go from 8kB to 16kB. That isn't the end of the > world, maybe, but I think you should see if you can't fix it > somehow.... Thank you for updating patch. To deal with this problem, I've changed it so that pg_upgrade checks file size before conversion. And if fork file does not exist or size is 0 (empty), ignore. Attached latest patch. Regards, -- Masahiko Sawada
Attachment
On Thu, Mar 10, 2016 at 1:41 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > On Fri, Mar 11, 2016 at 1:03 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> This 001 patch looks so little like what I was expecting that I >> decided to start over from scratch. The new version I wrote is >> attached here. I don't understand why your version tinkers with the >> logic for setting the all-frozen bit; I thought that what I already >> committed dealt with that already, and in any case, your version >> doesn't even compile against latest sources. Your version also leaves >> the scan_all terminology intact even though it's not accurate any >> more, and I am not very convinced that the updates to the >> page-skipping logic are actually correct. Please have a look over >> this version and see what you think. > > Thank you for your advise. > Sorry, optimising logic of previous patch was old by mistake. > Attached latest patch incorporated your suggestions with a little revising. OK, I'll have a look. Thanks. >> I think that's kind of pointless. We need to test that this >> conversion code works, but once it does, I don't think we should make >> everybody pay the overhead of retesting that. Anyway, the test code >> could have bugs, too. >> >> Here's an updated version of your patch with that code removed and >> some cosmetic cleanups like fixing typos and stuff like that. I think >> this is mostly ready to commit, but I noticed one problem: your >> conversion code always produces two output pages for each input page >> even if one of them would be empty. In particular, if you have a >> large number of small relations and run pg_upgrade, all of their >> visibility maps will go from 8kB to 16kB. That isn't the end of the >> world, maybe, but I think you should see if you can't fix it >> somehow.... > > Thank you for updating patch. > To deal with this problem, I've changed it so that pg_upgrade checks > file size before conversion. > And if fork file does not exist or size is 0 (empty), ignore. > Attached latest patch. I think what I really want is some logic so that if we have a 1-page visibility map in the old cluster and the second half of that page is all zeroes, we only create a 1-page visibility map in the new cluster rather than a 2-page visibility map. Or more generally, if the old VM is N pages, but the last half of the last page is empty, then let the output VM be 2*N-1 pages instead of 2*N pages. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Mar 10, 2016 at 1:41 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > On Fri, Mar 11, 2016 at 1:03 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> This 001 patch looks so little like what I was expecting that I >> decided to start over from scratch. The new version I wrote is >> attached here. I don't understand why your version tinkers with the >> logic for setting the all-frozen bit; I thought that what I already >> committed dealt with that already, and in any case, your version >> doesn't even compile against latest sources. Your version also leaves >> the scan_all terminology intact even though it's not accurate any >> more, and I am not very convinced that the updates to the >> page-skipping logic are actually correct. Please have a look over >> this version and see what you think. > > Thank you for your advise. > Sorry, optimising logic of previous patch was old by mistake. > Attached latest patch incorporated your suggestions with a little revising. Thanks. I adopted some of your suggested, rejected others, fixed a few minor things that I missed previously, and committed this. If you think any of the changes that I rejected still have merit, please resubmit those changes as separate patches. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Mar 11, 2016 at 6:16 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, Mar 10, 2016 at 1:41 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> On Fri, Mar 11, 2016 at 1:03 AM, Robert Haas <robertmhaas@gmail.com> wrote: >>> This 001 patch looks so little like what I was expecting that I >>> decided to start over from scratch. The new version I wrote is >>> attached here. I don't understand why your version tinkers with the >>> logic for setting the all-frozen bit; I thought that what I already >>> committed dealt with that already, and in any case, your version >>> doesn't even compile against latest sources. Your version also leaves >>> the scan_all terminology intact even though it's not accurate any >>> more, and I am not very convinced that the updates to the >>> page-skipping logic are actually correct. Please have a look over >>> this version and see what you think. >> >> Thank you for your advise. >> Sorry, optimising logic of previous patch was old by mistake. >> Attached latest patch incorporated your suggestions with a little revising. > > Thanks. I adopted some of your suggested, rejected others, fixed a > few minor things that I missed previously, and committed this. If you > think any of the changes that I rejected still have merit, please > resubmit those changes as separate patches. > Thank you for your effort to this feature and committing it. I guess that I couldn't do good work to this feature at final stage, but I really appreciate all your advice and suggestion. > I think what I really want is some logic so that if we have a 1-page > visibility map in the old cluster and the second half of that page is > all zeroes, we only create a 1-page visibility map in the new cluster > rather than a 2-page visibility map. > > Or more generally, if the old VM is N pages, but the last half of the > last page is empty, then let the output VM be 2*N-1 pages instead of > 2*N pages. > I got your point. Attached latest patch can skip to write the last part of last old page if it's empty. Please review it. Regards, -- Masahiko Sawada
Attachment
On Thu, Mar 10, 2016 at 10:47 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> Thanks. I adopted some of your suggested, rejected others, fixed a >> few minor things that I missed previously, and committed this. If you >> think any of the changes that I rejected still have merit, please >> resubmit those changes as separate patches. > > Thank you for your effort to this feature and committing it. > I guess that I couldn't do good work to this feature at final stage, > but I really appreciate all your advice and suggestion. Don't feel bad, you put a lot of work on this, and if you were getting a little tired towards the end, that's very understandable. This extremely important feature was largely driven by you, and that's a big accomplishment. > I got your point. > Attached latest patch can skip to write the last part of last old page > if it's empty. > Please review it. Committed. Which I think just about brings us to the end of this epic journey, except for any cleanup of what's already been committed that needs to be done. Thanks so much for your hard work! -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sat, Mar 12, 2016 at 2:37 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, Mar 10, 2016 at 10:47 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >>> Thanks. I adopted some of your suggested, rejected others, fixed a >>> few minor things that I missed previously, and committed this. If you >>> think any of the changes that I rejected still have merit, please >>> resubmit those changes as separate patches. >> >> Thank you for your effort to this feature and committing it. >> I guess that I couldn't do good work to this feature at final stage, >> but I really appreciate all your advice and suggestion. > > Don't feel bad, you put a lot of work on this, and if you were getting > a little tired towards the end, that's very understandable. This > extremely important feature was largely driven by you, and that's a > big accomplishment. > >> I got your point. >> Attached latest patch can skip to write the last part of last old page >> if it's empty. >> Please review it. > > Committed. > > Which I think just about brings us to the end of this epic journey, > except for any cleanup of what's already been committed that needs to > be done. Thanks so much for your hard work! > Thank you so much! What I wanted deal with in thread is almost done. I'm going to more test the feature for 9.6 releasing. Regards, -- Masahiko Sawada
On 03/11/2016 09:48 AM, Masahiko Sawada wrote: > > Thank you so much! > What I wanted deal with in thread is almost done. I'm going to more > test the feature for 9.6 releasing. Nicely done! > > Regards, > > -- > Masahiko Sawada > > -- Command Prompt, Inc. http://the.postgres.company/ +1-503-667-4564 PostgreSQL Centered full stack support, consulting and development. Everyone appreciates your honesty, until you are honest with them.