Thread: Freeze avoidance of very large table.

Freeze avoidance of very large table.

From
Sawada Masahiko
Date:
Hi all,

I'd like to propose read-only table to avoid full scanning to the very
large table.
The WIP patch is attached.

- Background
Postgres can have tuple forever by freezing it, but freezing tuple
needs to scan whole table.
It would negatively affect to system performance, especially in very
large database system.
There is no command that will guarantee a whole table has been
completely frozen,
so postgres needs to run freezing tuples even we have not written table at all.

We need a DDL command will ensure all tuples are frozen and mark table
as read-only, as one way to avoid full scanning to the very large
table.
This topic has been already discussed before, proposed by Simon.

- Feature
I tried to implement this feature called ALTER TABLE SET READ ONLY,
and SET READ WRITE.
What I'm imagining feature is attached this mail as patch file, it's
WIP version patch.

The patch does followings.
* Add new column relreadonly to pg_class.
* Add new syntax ALTER TABLE SET READ ONLY, and ALTER TABLE SET READ WRTIE
* When marking read-only, all tuple of table are frozen with ShareLock
at one pass (like VACUUM FREEZE),
  and then update pg_class.relreadonly to true.
* When un-marking read-only, just update pg_class.readonly to false.
* If table has TOAST table then TOAST table is marked as well at same time.
* The writing and vacuum to read-only table are completely restricted
or ignored.
  e.g., INSERT, UPDATE ,DELTET, explicit vacuum, auto vacuum

There are a few but not critical problem.
* Processing freezing all tuple are quite similar to VACUUM FREEZE,
but calling lazy_vacuum_rel() would be overkill, I think.
* Need to consider lock level.

Please give me feedback.

Regards,
-------
Sawada Masahiko

Attachment

Re: Freeze avoidance of very large table.

From
Jim Nasby
Date:
On 4/3/15 12:59 AM, Sawada Masahiko wrote:
> +                case HEAPTUPLE_LIVE:
> +                case HEAPTUPLE_RECENTLY_DEAD:
> +                case HEAPTUPLE_INSERT_IN_PROGRESS:
> +                case HEAPTUPLE_DELETE_IN_PROGRESS:
> +                    if (heap_prepare_freeze_tuple(tuple.t_data, freezelimit,
> +                                                  mxactcutoff, &frozen[nfrozen]))
> +                        frozen[nfrozen++].offset = offnum;
> +                    break;

This doesn't seem safe enough to me. Can't there be tuples that are 
still new enough that they can't be frozen, and are still live? I don't 
think it's safe to leave tuples as dead either, even if they're hinted. 
The hint may not be written. Also, the patch seems to be completely 
ignoring actually freezing the toast relation; I can't see how that's 
actually safe.

I'd feel a heck of a lot safer if any time heap_prepare_freeze_tuple 
returned false we did a second check on the tuple to ensure it was truly 
frozen.

Somewhat related... instead of forcing the freeze to happen 
synchronously, can't we set this up so a table is in one of three 
states? Read/Write, Read Only, Frozen. AT_SetReadOnly and 
AT_SetReadWrite would simply change to the appropriate state, and all 
the vacuum infrastructure would continue to process those tables as it 
does today. lazy_vacuum_rel would become responsible for tracking if 
there were any non-frozen tuples if it was also attempting a freeze. If 
it discovered there were none, AND the table was marked as ReadOnly, 
then it would change the table state to Frozen and set relfrozenxid = 
InvalidTransactionId and relminxid = InvalidMultiXactId. AT_SetReadWrite 
could change relfrozenxid to it's own Xid as an optimization. Doing it 
that way leaves all the complicated vacuum code in one place, and would 
eliminate concerns about race conditions with still running 
transactions, etc.

BTW, you also need to put things in place to ensure it's impossible to 
unfreeze a tuple in a relation that's marked ReadOnly or Frozen. I'm not 
sure what the right way to do that would be.
-- 
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com



Re: Freeze avoidance of very large table.

From
Jim Nasby
Date:
On 4/4/15 5:10 PM, Jim Nasby wrote:
> On 4/3/15 12:59 AM, Sawada Masahiko wrote:
>> +                case HEAPTUPLE_LIVE:
>> +                case HEAPTUPLE_RECENTLY_DEAD:
>> +                case HEAPTUPLE_INSERT_IN_PROGRESS:
>> +                case HEAPTUPLE_DELETE_IN_PROGRESS:
>> +                    if (heap_prepare_freeze_tuple(tuple.t_data,
>> freezelimit,
>> +                                                  mxactcutoff,
>> &frozen[nfrozen]))
>> +                        frozen[nfrozen++].offset = offnum;
>> +                    break;
>
> This doesn't seem safe enough to me. Can't there be tuples that are
> still new enough that they can't be frozen, and are still live? I don't
> think it's safe to leave tuples as dead either, even if they're hinted.
> The hint may not be written. Also, the patch seems to be completely
> ignoring actually freezing the toast relation; I can't see how that's
> actually safe.
>
> I'd feel a heck of a lot safer if any time heap_prepare_freeze_tuple
> returned false we did a second check on the tuple to ensure it was truly
> frozen.
>
> Somewhat related... instead of forcing the freeze to happen
> synchronously, can't we set this up so a table is in one of three
> states? Read/Write, Read Only, Frozen. AT_SetReadOnly and
> AT_SetReadWrite would simply change to the appropriate state, and all
> the vacuum infrastructure would continue to process those tables as it
> does today. lazy_vacuum_rel would become responsible for tracking if
> there were any non-frozen tuples if it was also attempting a freeze. If
> it discovered there were none, AND the table was marked as ReadOnly,
> then it would change the table state to Frozen and set relfrozenxid =
> InvalidTransactionId and relminxid = InvalidMultiXactId. AT_SetReadWrite
> could change relfrozenxid to it's own Xid as an optimization. Doing it
> that way leaves all the complicated vacuum code in one place, and would
> eliminate concerns about race conditions with still running
> transactions, etc.
>
> BTW, you also need to put things in place to ensure it's impossible to
> unfreeze a tuple in a relation that's marked ReadOnly or Frozen. I'm not
> sure what the right way to do that would be.

Answering my own question... I think visibilitymap_clear() would be the 
right place. AFAICT this is basically as critical as clearing the VM, 
and that function has the Relation, so it can see what mode the relation 
is in.

There is another possibility here, too. We can completely divorce a 
ReadOnly mode (which I think is useful for other things besides 
freezing) from the question of whether we need to force-freeze a 
relation if we create a FrozenMap, similar to the visibility map. This 
has the added advantage of helping freeze scans on relations that are 
not ReadOnly in the case of tables that are insert-mostly or any other 
pattern where most pages stay all-frozen.

Prior to the visibility map this would have been a rather daunting 
project, but I believe this could piggyback on the VM code rather 
nicely. Anytime you clear the VM you clearly must clear the FrozenMap as 
well. The logic for setting the FM is clearly different, but that would 
be entirely self-contained to vacuum. Unlike the VM, I don't see any 
point to marking special bits in the page itself for FM.

It would be nice if each bit in the FM covered multiple pages, but that 
can be optimized later.
-- 
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com



Re: Freeze avoidance of very large table.

From
Jeff Janes
Date:
On Sat, Apr 4, 2015 at 3:10 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
On 4/3/15 12:59 AM, Sawada Masahiko wrote:
+                               case HEAPTUPLE_LIVE:
+                               case HEAPTUPLE_RECENTLY_DEAD:
+                               case HEAPTUPLE_INSERT_IN_PROGRESS:
+                               case HEAPTUPLE_DELETE_IN_PROGRESS:
+                                       if (heap_prepare_freeze_tuple(tuple.t_data, freezelimit,
+                                                                                                 mxactcutoff, &frozen[nfrozen]))
+                                               frozen[nfrozen++].offset = offnum;
+                                       break;

This doesn't seem safe enough to me. Can't there be tuples that are still new enough that they can't be frozen, and are still live?

Yep.  I've set a table to read only while it contained unfreezable tuples, and the tuples remain unfrozen yet the read-only action claims to have succeeded.

 
Somewhat related... instead of forcing the freeze to happen synchronously, can't we set this up so a table is in one of three states? Read/Write, Read Only, Frozen. AT_SetReadOnly and AT_SetReadWrite would simply change to the appropriate state, and all the vacuum infrastructure would continue to process those tables as it does today. lazy_vacuum_rel would become responsible for tracking if there were any non-frozen tuples if it was also attempting a freeze. If it discovered there were none, AND the table was marked as ReadOnly, then it would change the table state to Frozen and set relfrozenxid = InvalidTransactionId and relminxid = InvalidMultiXactId. AT_SetReadWrite could change relfrozenxid to it's own Xid as an optimization. Doing it that way leaves all the complicated vacuum code in one place, and would eliminate concerns about race conditions with still running transactions, etc.

+1 here as well.  I might want to set tables to read only for reasons other than to avoid repeated freezing.  (After all, the name of the command suggests it is a general purpose thing) and wouldn't want to automatically trigger a vacuum freeze in the process.

Cheers,

Jeff

Re: Freeze avoidance of very large table.

From
Sawada Masahiko
Date:
On Sun, Apr 5, 2015 at 8:21 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
> On Sat, Apr 4, 2015 at 3:10 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
>>
>> On 4/3/15 12:59 AM, Sawada Masahiko wrote:
>>>
>>> +                               case HEAPTUPLE_LIVE:
>>> +                               case HEAPTUPLE_RECENTLY_DEAD:
>>> +                               case HEAPTUPLE_INSERT_IN_PROGRESS:
>>> +                               case HEAPTUPLE_DELETE_IN_PROGRESS:
>>> +                                       if
>>> (heap_prepare_freeze_tuple(tuple.t_data, freezelimit,
>>> +
>>> mxactcutoff, &frozen[nfrozen]))
>>> +                                               frozen[nfrozen++].offset
>>> = offnum;
>>> +                                       break;
>>
>>
>> This doesn't seem safe enough to me. Can't there be tuples that are still
>> new enough that they can't be frozen, and are still live?
>
>
> Yep.  I've set a table to read only while it contained unfreezable tuples,
> and the tuples remain unfrozen yet the read-only action claims to have
> succeeded.
>
>
>>
>> Somewhat related... instead of forcing the freeze to happen synchronously,
>> can't we set this up so a table is in one of three states? Read/Write, Read
>> Only, Frozen. AT_SetReadOnly and AT_SetReadWrite would simply change to the
>> appropriate state, and all the vacuum infrastructure would continue to
>> process those tables as it does today. lazy_vacuum_rel would become
>> responsible for tracking if there were any non-frozen tuples if it was also
>> attempting a freeze. If it discovered there were none, AND the table was
>> marked as ReadOnly, then it would change the table state to Frozen and set
>> relfrozenxid = InvalidTransactionId and relminxid = InvalidMultiXactId.
>> AT_SetReadWrite could change relfrozenxid to it's own Xid as an
>> optimization. Doing it that way leaves all the complicated vacuum code in
>> one place, and would eliminate concerns about race conditions with still
>> running transactions, etc.
>
>
> +1 here as well.  I might want to set tables to read only for reasons other
> than to avoid repeated freezing.  (After all, the name of the command
> suggests it is a general purpose thing) and wouldn't want to automatically
> trigger a vacuum freeze in the process.
>

Thank you for comments.

> Somewhat related... instead of forcing the freeze to happen synchronously, can't we set this up so a table is in one
ofthree states? Read/Write, Read Only, Frozen. AT_SetReadOnly and AT_SetReadWrite would simply change to > the
appropriatestate, and all the vacuum infrastructure would continue to process those tables as it does today.
lazy_vacuum_relwould become responsible for tracking if there were any non-frozen tuples if it was also attempting > a
freeze.If it discovered there were none, AND the table was marked as ReadOnly, then it would change the table state to
Frozenand set relfrozenxid = InvalidTransactionId and relminxid = InvalidMultiXactId. AT_SetReadWrite > could change
relfrozenxidto it's own Xid as an optimization. Doing it that way leaves all the complicated vacuum code in one place,
andwould eliminate concerns about race conditions with still running transactions, etc. 

I agree with 3 status, Read/Write, ReadOnly and Frozen.
But I'm not sure when we should do to freeze tuples, e.g., scan whole tables.
I think that the any changes to table are completely
ignored/restricted if table is marked as ReadOnly table,
and it's accompanied by freezing tuples, just mark as ReadOnly.
Frozen table ensures that all tuples of its table completely has been
frozen, so it also needs to scan whole table as well.
e.g., we should need to scan whole table at two times. right?

> +1 here as well.  I might want to set tables to read only for reasons other than to avoid repeated freezing.  (After
all,the name of the command suggests it is a general purpose thing) and wouldn't want to automatically trigger a 
> vacuum freeze in the process.
>
> There is another possibility here, too. We can completely divorce a ReadOnly mode (which I think is useful for other
thingsbesides freezing) from the question of whether we need to force-freeze a relation if we create a 
> FrozenMap, similar to the visibility map. This has the added advantage of helping freeze scans on relations that are
notReadOnly in the case of tables that are insert-mostly or any other pattern where most pages stay all-frozen. 
> Prior to the visibility map this would have been a rather daunting project, but I believe this could piggyback on the
VMcode rather nicely. Anytime you clear the VM you clearly must clear the FrozenMap as well. The logic for 
> setting the FM is clearly different, but that would be entirely self-contained to vacuum. Unlike the VM, I don't see
anypoint to marking special bits in the page itself for FM. 

I was thinking this idea (FM) to avoid freezing all tuples actually.
As you said, it might not be good idea (or overkill) that the reason
why settings table to read only is avoidance repeated freezing.
I'm attempting to try design FM to avoid freezing relations as well.
Is it enough that each bit of FM has information that corresponding
pages are completely frozen on each bit?

Regards,

-------
Sawada Masahiko



Re: Freeze avoidance of very large table.

From
Jim Nasby
Date:
On 4/6/15 1:46 AM, Sawada Masahiko wrote:
> On Sun, Apr 5, 2015 at 8:21 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
>> On Sat, Apr 4, 2015 at 3:10 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
>>>
>>> On 4/3/15 12:59 AM, Sawada Masahiko wrote:
>>>>
>>>> +                               case HEAPTUPLE_LIVE:
>>>> +                               case HEAPTUPLE_RECENTLY_DEAD:
>>>> +                               case HEAPTUPLE_INSERT_IN_PROGRESS:
>>>> +                               case HEAPTUPLE_DELETE_IN_PROGRESS:
>>>> +                                       if
>>>> (heap_prepare_freeze_tuple(tuple.t_data, freezelimit,
>>>> +
>>>> mxactcutoff, &frozen[nfrozen]))
>>>> +                                               frozen[nfrozen++].offset
>>>> = offnum;
>>>> +                                       break;
>>>
>>>
>>> This doesn't seem safe enough to me. Can't there be tuples that are still
>>> new enough that they can't be frozen, and are still live?
>>
>>
>> Yep.  I've set a table to read only while it contained unfreezable tuples,
>> and the tuples remain unfrozen yet the read-only action claims to have
>> succeeded.
>>
>>
>>>
>>> Somewhat related... instead of forcing the freeze to happen synchronously,
>>> can't we set this up so a table is in one of three states? Read/Write, Read
>>> Only, Frozen. AT_SetReadOnly and AT_SetReadWrite would simply change to the
>>> appropriate state, and all the vacuum infrastructure would continue to
>>> process those tables as it does today. lazy_vacuum_rel would become
>>> responsible for tracking if there were any non-frozen tuples if it was also
>>> attempting a freeze. If it discovered there were none, AND the table was
>>> marked as ReadOnly, then it would change the table state to Frozen and set
>>> relfrozenxid = InvalidTransactionId and relminxid = InvalidMultiXactId.
>>> AT_SetReadWrite could change relfrozenxid to it's own Xid as an
>>> optimization. Doing it that way leaves all the complicated vacuum code in
>>> one place, and would eliminate concerns about race conditions with still
>>> running transactions, etc.
>>
>>
>> +1 here as well.  I might want to set tables to read only for reasons other
>> than to avoid repeated freezing.  (After all, the name of the command
>> suggests it is a general purpose thing) and wouldn't want to automatically
>> trigger a vacuum freeze in the process.
>>
>
> Thank you for comments.
>
>> Somewhat related... instead of forcing the freeze to happen synchronously, can't we set this up so a table is in one
ofthree states? Read/Write, Read Only, Frozen. AT_SetReadOnly and AT_SetReadWrite would simply change to > the
appropriatestate, and all the vacuum infrastructure would continue to process those tables as it does today.
lazy_vacuum_relwould become responsible for tracking if there were any non-frozen tuples if it was also attempting > a
freeze.If it discovered there were none, AND the table was marked as ReadOnly, then it would change the table state to
Frozenand set relfrozenxid = InvalidTransactionId and relminxid = InvalidMultiXactId. AT_SetReadWrite > could change
relfrozenxidto it's own Xid as an optimization. Doing it that way leaves all the complicated vacuum code in one place,
andwould eliminate concerns about race conditions with still running transactions, etc.
 
>
> I agree with 3 status, Read/Write, ReadOnly and Frozen.
> But I'm not sure when we should do to freeze tuples, e.g., scan whole tables.
> I think that the any changes to table are completely
> ignored/restricted if table is marked as ReadOnly table,
> and it's accompanied by freezing tuples, just mark as ReadOnly.
> Frozen table ensures that all tuples of its table completely has been
> frozen, so it also needs to scan whole table as well.
> e.g., we should need to scan whole table at two times. right?

No. You would be free to set a table as ReadOnly any time you wanted, 
without scanning anything. All that setting does is disable any DML on 
the table.

The Frozen state would only be set by the vacuum code, IFF:
- The table state is ReadOnly *at the start of vacuum* and did not 
change during vacuum
- Vacuum ensured that there were no un-frozen tuples in the table

That does not necessitate 2 scans.

>> +1 here as well.  I might want to set tables to read only for reasons other than to avoid repeated freezing.  (After
all,the name of the command suggests it is a general purpose thing) and wouldn't want to automatically trigger a
 
>> vacuum freeze in the process.
>>
>> There is another possibility here, too. We can completely divorce a ReadOnly mode (which I think is useful for other
thingsbesides freezing) from the question of whether we need to force-freeze a relation if we create a
 
>> FrozenMap, similar to the visibility map. This has the added advantage of helping freeze scans on relations that are
notReadOnly in the case of tables that are insert-mostly or any other pattern where most pages stay all-frozen.
 
>> Prior to the visibility map this would have been a rather daunting project, but I believe this could piggyback on
theVM code rather nicely. Anytime you clear the VM you clearly must clear the FrozenMap as well. The logic for
 
>> setting the FM is clearly different, but that would be entirely self-contained to vacuum. Unlike the VM, I don't see
anypoint to marking special bits in the page itself for FM.
 
>
> I was thinking this idea (FM) to avoid freezing all tuples actually.
> As you said, it might not be good idea (or overkill) that the reason
> why settings table to read only is avoidance repeated freezing.
> I'm attempting to try design FM to avoid freezing relations as well.
> Is it enough that each bit of FM has information that corresponding
> pages are completely frozen on each bit?

If I'm understanding your implied question correctly, I don't think 
there would actually be any relationship between FM and marking 
ReadOnly. It would come into play if we wanted to do the Frozen state, 
but if we have the FM, marking an entire relation as Frozen becomes a 
lot less useful. What's going to happen with a VACUUM FREEZE once we 
have FM is that vacuum will be able to skip reading pages if they are 
all-visible *and* the FM shows them as frozen, whereas today we can't 
use the VM to skip pages if scan_all is true.

For simplicity, I would start out with each FM bit representing a single 
page. That means the FM would be very similar in operation to the VM; 
the only difference would be when a bit in the FM was set. I would 
absolutely split this into 2 patches as well; one for ReadOnly (and skip 
the Frozen status for now), and one for FM.

When I looked at the VM code briefly it occurred to me that it might be 
quite difficult to have 1 FM bit represent multiple pages. The issue is 
the locking necessary between VACUUM and clearing a FM bit. In the VM 
that's handled by the cleanup lock, but that will only work at a page 
level. We'd need something to ensure that nothing came in and performed 
DML while the vacuum code was getting ready to set a FM bit. There's 
probably several ways this could be accomplished, but I think it would 
be foolish to try and do anything about it in the initial patch. 
Especially because it's only supposition that there would be much 
benefit to having multiple pages per bit.
-- 
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com



Re: Freeze avoidance of very large table.

From
Sawada Masahiko
Date:
On Mon, Apr 6, 2015 at 10:17 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
> On 4/6/15 1:46 AM, Sawada Masahiko wrote:
>>
>> On Sun, Apr 5, 2015 at 8:21 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
>>>
>>> On Sat, Apr 4, 2015 at 3:10 PM, Jim Nasby <Jim.Nasby@bluetreble.com>
>>> wrote:
>>>>
>>>>
>>>> On 4/3/15 12:59 AM, Sawada Masahiko wrote:
>>>>>
>>>>>
>>>>> +                               case HEAPTUPLE_LIVE:
>>>>> +                               case HEAPTUPLE_RECENTLY_DEAD:
>>>>> +                               case HEAPTUPLE_INSERT_IN_PROGRESS:
>>>>> +                               case HEAPTUPLE_DELETE_IN_PROGRESS:
>>>>> +                                       if
>>>>> (heap_prepare_freeze_tuple(tuple.t_data, freezelimit,
>>>>> +
>>>>> mxactcutoff, &frozen[nfrozen]))
>>>>> +
>>>>> frozen[nfrozen++].offset
>>>>> = offnum;
>>>>> +                                       break;
>>>>
>>>>
>>>>
>>>> This doesn't seem safe enough to me. Can't there be tuples that are
>>>> still
>>>> new enough that they can't be frozen, and are still live?
>>>
>>>
>>>
>>> Yep.  I've set a table to read only while it contained unfreezable
>>> tuples,
>>> and the tuples remain unfrozen yet the read-only action claims to have
>>> succeeded.
>>>
>>>
>>>>
>>>> Somewhat related... instead of forcing the freeze to happen
>>>> synchronously,
>>>> can't we set this up so a table is in one of three states? Read/Write,
>>>> Read
>>>> Only, Frozen. AT_SetReadOnly and AT_SetReadWrite would simply change to
>>>> the
>>>> appropriate state, and all the vacuum infrastructure would continue to
>>>> process those tables as it does today. lazy_vacuum_rel would become
>>>> responsible for tracking if there were any non-frozen tuples if it was
>>>> also
>>>> attempting a freeze. If it discovered there were none, AND the table was
>>>> marked as ReadOnly, then it would change the table state to Frozen and
>>>> set
>>>> relfrozenxid = InvalidTransactionId and relminxid = InvalidMultiXactId.
>>>> AT_SetReadWrite could change relfrozenxid to it's own Xid as an
>>>> optimization. Doing it that way leaves all the complicated vacuum code
>>>> in
>>>> one place, and would eliminate concerns about race conditions with still
>>>> running transactions, etc.
>>>
>>>
>>>
>>> +1 here as well.  I might want to set tables to read only for reasons
>>> other
>>> than to avoid repeated freezing.  (After all, the name of the command
>>> suggests it is a general purpose thing) and wouldn't want to
>>> automatically
>>> trigger a vacuum freeze in the process.
>>>
>>
>> Thank you for comments.
>>
>>> Somewhat related... instead of forcing the freeze to happen
>>> synchronously, can't we set this up so a table is in one of three states?
>>> Read/Write, Read Only, Frozen. AT_SetReadOnly and AT_SetReadWrite would
>>> simply change to > the appropriate state, and all the vacuum infrastructure
>>> would continue to process those tables as it does today. lazy_vacuum_rel
>>> would become responsible for tracking if there were any non-frozen tuples if
>>> it was also attempting > a freeze. If it discovered there were none, AND the
>>> table was marked as ReadOnly, then it would change the table state to Frozen
>>> and set relfrozenxid = InvalidTransactionId and relminxid =
>>> InvalidMultiXactId. AT_SetReadWrite > could change relfrozenxid to it's own
>>> Xid as an optimization. Doing it that way leaves all the complicated vacuum
>>> code in one place, and would eliminate concerns about race conditions with
>>> still running transactions, etc.
>>
>>
>> I agree with 3 status, Read/Write, ReadOnly and Frozen.
>> But I'm not sure when we should do to freeze tuples, e.g., scan whole
>> tables.
>> I think that the any changes to table are completely
>> ignored/restricted if table is marked as ReadOnly table,
>> and it's accompanied by freezing tuples, just mark as ReadOnly.
>> Frozen table ensures that all tuples of its table completely has been
>> frozen, so it also needs to scan whole table as well.
>> e.g., we should need to scan whole table at two times. right?
>
>
> No. You would be free to set a table as ReadOnly any time you wanted,
> without scanning anything. All that setting does is disable any DML on the
> table.
>
> The Frozen state would only be set by the vacuum code, IFF:
> - The table state is ReadOnly *at the start of vacuum* and did not change
> during vacuum
> - Vacuum ensured that there were no un-frozen tuples in the table
>
> That does not necessitate 2 scans.
>

I understood this comcept, and have question as I wrote below.

>>> +1 here as well.  I might want to set tables to read only for reasons
>>> other than to avoid repeated freezing.  (After all, the name of the command
>>> suggests it is a general purpose thing) and wouldn't want to automatically
>>> trigger a
>>> vacuum freeze in the process.
>>>
>>> There is another possibility here, too. We can completely divorce a
>>> ReadOnly mode (which I think is useful for other things besides freezing)
>>> from the question of whether we need to force-freeze a relation if we create
>>> a
>>> FrozenMap, similar to the visibility map. This has the added advantage of
>>> helping freeze scans on relations that are not ReadOnly in the case of
>>> tables that are insert-mostly or any other pattern where most pages stay
>>> all-frozen.
>>> Prior to the visibility map this would have been a rather daunting
>>> project, but I believe this could piggyback on the VM code rather nicely.
>>> Anytime you clear the VM you clearly must clear the FrozenMap as well. The
>>> logic for
>>> setting the FM is clearly different, but that would be entirely
>>> self-contained to vacuum. Unlike the VM, I don't see any point to marking
>>> special bits in the page itself for FM.
>>
>>
>> I was thinking this idea (FM) to avoid freezing all tuples actually.
>> As you said, it might not be good idea (or overkill) that the reason
>> why settings table to read only is avoidance repeated freezing.
>> I'm attempting to try design FM to avoid freezing relations as well.
>> Is it enough that each bit of FM has information that corresponding
>> pages are completely frozen on each bit?
>
>
> If I'm understanding your implied question correctly, I don't think there
> would actually be any relationship between FM and marking ReadOnly. It would
> come into play if we wanted to do the Frozen state, but if we have the FM,
> marking an entire relation as Frozen becomes a lot less useful. What's going
> to happen with a VACUUM FREEZE once we have FM is that vacuum will be able
> to skip reading pages if they are all-visible *and* the FM shows them as
> frozen, whereas today we can't use the VM to skip pages if scan_all is true.
>
> For simplicity, I would start out with each FM bit representing a single
> page. That means the FM would be very similar in operation to the VM; the
> only difference would be when a bit in the FM was set. I would absolutely
> split this into 2 patches as well; one for ReadOnly (and skip the Frozen
> status for now), and one for FM.
> When I looked at the VM code briefly it occurred to me that it might be
> quite difficult to have 1 FM bit represent multiple pages. The issue is the
> locking necessary between VACUUM and clearing a FM bit. In the VM that's
> handled by the cleanup lock, but that will only work at a page level. We'd
> need something to ensure that nothing came in and performed DML while the
> vacuum code was getting ready to set a FM bit. There's probably several ways
> this could be accomplished, but I think it would be foolish to try and do
> anything about it in the initial patch. Especially because it's only
> supposition that there would be much benefit to having multiple pages per
> bit.
>

Yes, I will separate the patch into two patches.

I'd like to confirm about whether what I'm thinking is correct here.
In first version of patch, each FM bit represent a single page is
imply whether the all tuple of the page completely has been frozen, it
would be one patch.

The second patch adds 3 states and read-only table which disable to
any write to table. The trigger which changes state from Read/Write to
Read-Only is ALTER TABLE SET READ ONLY. And the trigger changes from
Read-Only to Frozen is vacuum only when the table has been marked as
Read-Only at vacuum is started *and* the vacuum did not any freeze
tuple(including skip the page refer to FM). If we support FM, we would
be able to avoid repeated freezing whole table even if the table has
not been marked as Read-Only.

In order to change state to Frozen, we need to do VACUUM FREEZE or
wait for running of auto vacuum. Generally, the threshold of cutoff
xid is different between VACUUM (and autovacuum) and VACUUM FREEZE. We
would not expect to change status using by explicit vacuum and
autovacuum. Inevitably, we would need to do both command ALTER TABLE
SET READ ONLY and VACUUM FREEZE to change state to Frozen.
I think that we should also add DDL which does both freezing tuple and
changing state in one pass, like ALTER TABLE SET READ ONLY WITH FREEZE
or ALTER TABLE SET FROZEN.

Regards,

-------
Sawada Masahiko



Re: Freeze avoidance of very large table.

From
Jim Nasby
Date:
On 4/6/15 11:12 AM, Sawada Masahiko wrote:
> On Mon, Apr 6, 2015 at 10:17 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
>> On 4/6/15 1:46 AM, Sawada Masahiko wrote:
>>>
>>> On Sun, Apr 5, 2015 at 8:21 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
>>>>
>>>> On Sat, Apr 4, 2015 at 3:10 PM, Jim Nasby <Jim.Nasby@bluetreble.com>
>>>> wrote:
>>>>>
>>>>>
>>>>> On 4/3/15 12:59 AM, Sawada Masahiko wrote:
>>>>>>
>>>>>>
>>>>>> +                               case HEAPTUPLE_LIVE:
>>>>>> +                               case HEAPTUPLE_RECENTLY_DEAD:
>>>>>> +                               case HEAPTUPLE_INSERT_IN_PROGRESS:
>>>>>> +                               case HEAPTUPLE_DELETE_IN_PROGRESS:
>>>>>> +                                       if
>>>>>> (heap_prepare_freeze_tuple(tuple.t_data, freezelimit,
>>>>>> +
>>>>>> mxactcutoff, &frozen[nfrozen]))
>>>>>> +
>>>>>> frozen[nfrozen++].offset
>>>>>> = offnum;
>>>>>> +                                       break;
>>>>>
>>>>>
>>>>>
>>>>> This doesn't seem safe enough to me. Can't there be tuples that are
>>>>> still
>>>>> new enough that they can't be frozen, and are still live?
>>>>
>>>>
>>>>
>>>> Yep.  I've set a table to read only while it contained unfreezable
>>>> tuples,
>>>> and the tuples remain unfrozen yet the read-only action claims to have
>>>> succeeded.
>>>>
>>>>
>>>>>
>>>>> Somewhat related... instead of forcing the freeze to happen
>>>>> synchronously,
>>>>> can't we set this up so a table is in one of three states? Read/Write,
>>>>> Read
>>>>> Only, Frozen. AT_SetReadOnly and AT_SetReadWrite would simply change to
>>>>> the
>>>>> appropriate state, and all the vacuum infrastructure would continue to
>>>>> process those tables as it does today. lazy_vacuum_rel would become
>>>>> responsible for tracking if there were any non-frozen tuples if it was
>>>>> also
>>>>> attempting a freeze. If it discovered there were none, AND the table was
>>>>> marked as ReadOnly, then it would change the table state to Frozen and
>>>>> set
>>>>> relfrozenxid = InvalidTransactionId and relminxid = InvalidMultiXactId.
>>>>> AT_SetReadWrite could change relfrozenxid to it's own Xid as an
>>>>> optimization. Doing it that way leaves all the complicated vacuum code
>>>>> in
>>>>> one place, and would eliminate concerns about race conditions with still
>>>>> running transactions, etc.
>>>>
>>>>
>>>>
>>>> +1 here as well.  I might want to set tables to read only for reasons
>>>> other
>>>> than to avoid repeated freezing.  (After all, the name of the command
>>>> suggests it is a general purpose thing) and wouldn't want to
>>>> automatically
>>>> trigger a vacuum freeze in the process.
>>>>
>>>
>>> Thank you for comments.
>>>
>>>> Somewhat related... instead of forcing the freeze to happen
>>>> synchronously, can't we set this up so a table is in one of three states?
>>>> Read/Write, Read Only, Frozen. AT_SetReadOnly and AT_SetReadWrite would
>>>> simply change to > the appropriate state, and all the vacuum infrastructure
>>>> would continue to process those tables as it does today. lazy_vacuum_rel
>>>> would become responsible for tracking if there were any non-frozen tuples if
>>>> it was also attempting > a freeze. If it discovered there were none, AND the
>>>> table was marked as ReadOnly, then it would change the table state to Frozen
>>>> and set relfrozenxid = InvalidTransactionId and relminxid =
>>>> InvalidMultiXactId. AT_SetReadWrite > could change relfrozenxid to it's own
>>>> Xid as an optimization. Doing it that way leaves all the complicated vacuum
>>>> code in one place, and would eliminate concerns about race conditions with
>>>> still running transactions, etc.
>>>
>>>
>>> I agree with 3 status, Read/Write, ReadOnly and Frozen.
>>> But I'm not sure when we should do to freeze tuples, e.g., scan whole
>>> tables.
>>> I think that the any changes to table are completely
>>> ignored/restricted if table is marked as ReadOnly table,
>>> and it's accompanied by freezing tuples, just mark as ReadOnly.
>>> Frozen table ensures that all tuples of its table completely has been
>>> frozen, so it also needs to scan whole table as well.
>>> e.g., we should need to scan whole table at two times. right?
>>
>>
>> No. You would be free to set a table as ReadOnly any time you wanted,
>> without scanning anything. All that setting does is disable any DML on the
>> table.
>>
>> The Frozen state would only be set by the vacuum code, IFF:
>> - The table state is ReadOnly *at the start of vacuum* and did not change
>> during vacuum
>> - Vacuum ensured that there were no un-frozen tuples in the table
>>
>> That does not necessitate 2 scans.
>>
>
> I understood this comcept, and have question as I wrote below.
>
>>>> +1 here as well.  I might want to set tables to read only for reasons
>>>> other than to avoid repeated freezing.  (After all, the name of the command
>>>> suggests it is a general purpose thing) and wouldn't want to automatically
>>>> trigger a
>>>> vacuum freeze in the process.
>>>>
>>>> There is another possibility here, too. We can completely divorce a
>>>> ReadOnly mode (which I think is useful for other things besides freezing)
>>>> from the question of whether we need to force-freeze a relation if we create
>>>> a
>>>> FrozenMap, similar to the visibility map. This has the added advantage of
>>>> helping freeze scans on relations that are not ReadOnly in the case of
>>>> tables that are insert-mostly or any other pattern where most pages stay
>>>> all-frozen.
>>>> Prior to the visibility map this would have been a rather daunting
>>>> project, but I believe this could piggyback on the VM code rather nicely.
>>>> Anytime you clear the VM you clearly must clear the FrozenMap as well. The
>>>> logic for
>>>> setting the FM is clearly different, but that would be entirely
>>>> self-contained to vacuum. Unlike the VM, I don't see any point to marking
>>>> special bits in the page itself for FM.
>>>
>>>
>>> I was thinking this idea (FM) to avoid freezing all tuples actually.
>>> As you said, it might not be good idea (or overkill) that the reason
>>> why settings table to read only is avoidance repeated freezing.
>>> I'm attempting to try design FM to avoid freezing relations as well.
>>> Is it enough that each bit of FM has information that corresponding
>>> pages are completely frozen on each bit?
>>
>>
>> If I'm understanding your implied question correctly, I don't think there
>> would actually be any relationship between FM and marking ReadOnly. It would
>> come into play if we wanted to do the Frozen state, but if we have the FM,
>> marking an entire relation as Frozen becomes a lot less useful. What's going
>> to happen with a VACUUM FREEZE once we have FM is that vacuum will be able
>> to skip reading pages if they are all-visible *and* the FM shows them as
>> frozen, whereas today we can't use the VM to skip pages if scan_all is true.
>>
>> For simplicity, I would start out with each FM bit representing a single
>> page. That means the FM would be very similar in operation to the VM; the
>> only difference would be when a bit in the FM was set. I would absolutely
>> split this into 2 patches as well; one for ReadOnly (and skip the Frozen
>> status for now), and one for FM.
>> When I looked at the VM code briefly it occurred to me that it might be
>> quite difficult to have 1 FM bit represent multiple pages. The issue is the
>> locking necessary between VACUUM and clearing a FM bit. In the VM that's
>> handled by the cleanup lock, but that will only work at a page level. We'd
>> need something to ensure that nothing came in and performed DML while the
>> vacuum code was getting ready to set a FM bit. There's probably several ways
>> this could be accomplished, but I think it would be foolish to try and do
>> anything about it in the initial patch. Especially because it's only
>> supposition that there would be much benefit to having multiple pages per
>> bit.
>>
>
> Yes, I will separate the patch into two patches.
>
> I'd like to confirm about whether what I'm thinking is correct here.
> In first version of patch, each FM bit represent a single page is
> imply whether the all tuple of the page completely has been frozen, it
> would be one patch.

Yes.

> The second patch adds 3 states and read-only table which disable to

Actually, I would start simply with ReadOnly and ReadWrite.

As I understand it, the goal here is to prevent huge amounts of periodic 
freeze work due to XID wraparound. I don't think we need the Freeze 
state to accomplish that.

With a single bit per page in the Frozen Map, checking a 800GB table 
would require reading a mere 100MB of FM. That's pretty tiny, and 
largely accomplishes the goal.

Obviously it would be nice to eliminate even that 100MB read, but I 
suggest you leave that for a 3rd patch. I think you'll find that just 
getting the first 2 accomplished will be a significant amount of work.

Also, note that you don't really even need the ReadOnly patch. As long 
as you're not actually touching the table at all the FM will eventually 
read as everything is frozen; that gets you 80% of the way there. So I'd 
suggest starting with the FM, then doing ReadOnly, and only then 
attempting to add the Frozen state.

> any write to table. The trigger which changes state from Read/Write to
> Read-Only is ALTER TABLE SET READ ONLY. And the trigger changes from
> Read-Only to Frozen is vacuum only when the table has been marked as
> Read-Only at vacuum is started *and* the vacuum did not any freeze
> tuple(including skip the page refer to FM). If we support FM, we would
> be able to avoid repeated freezing whole table even if the table has
> not been marked as Read-Only.
>
> In order to change state to Frozen, we need to do VACUUM FREEZE or
> wait for running of auto vacuum. Generally, the threshold of cutoff
> xid is different between VACUUM (and autovacuum) and VACUUM FREEZE. We
> would not expect to change status using by explicit vacuum and
> autovacuum. Inevitably, we would need to do both command ALTER TABLE
> SET READ ONLY and VACUUM FREEZE to change state to Frozen.
> I think that we should also add DDL which does both freezing tuple and
> changing state in one pass, like ALTER TABLE SET READ ONLY WITH FREEZE
> or ALTER TABLE SET FROZEN.
>
> Regards,
>
> -------
> Sawada Masahiko
>


-- 
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com



Re: Freeze avoidance of very large table.

From
"ktm@rice.edu"
Date:
On Mon, Apr 06, 2015 at 12:07:47PM -0500, Jim Nasby wrote:
> ...
> As I understand it, the goal here is to prevent huge amounts of
> periodic freeze work due to XID wraparound. I don't think we need
> the Freeze state to accomplish that.
> 
> With a single bit per page in the Frozen Map, checking a 800GB table
> would require reading a mere 100MB of FM. That's pretty tiny, and
> largely accomplishes the goal.
> 
> Obviously it would be nice to eliminate even that 100MB read, but I
> suggest you leave that for a 3rd patch. I think you'll find that
> just getting the first 2 accomplished will be a significant amount
> of work.
> 

Hi,
I may have my math wrong, but 800GB ~ 100M pages or 12.5MB and not
100MB.

Regards,
Ken



Re: Freeze avoidance of very large table.

From
Jim Nasby
Date:
On 4/6/15 12:29 PM, ktm@rice.edu wrote:
> On Mon, Apr 06, 2015 at 12:07:47PM -0500, Jim Nasby wrote:
>> ...
>> As I understand it, the goal here is to prevent huge amounts of
>> periodic freeze work due to XID wraparound. I don't think we need
>> the Freeze state to accomplish that.
>>
>> With a single bit per page in the Frozen Map, checking a 800GB table
>> would require reading a mere 100MB of FM. That's pretty tiny, and
>> largely accomplishes the goal.
>>
>> Obviously it would be nice to eliminate even that 100MB read, but I
>> suggest you leave that for a 3rd patch. I think you'll find that
>> just getting the first 2 accomplished will be a significant amount
>> of work.
>>
>
> Hi,
> I may have my math wrong, but 800GB ~ 100M pages or 12.5MB and not
> 100MB.

Doh! 8 bits per byte and all that...
-- 
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com



Re: Freeze avoidance of very large table.

From
Josh Berkus
Date:
On 04/06/2015 10:07 AM, Jim Nasby wrote:
> Actually, I would start simply with ReadOnly and ReadWrite.
> 
> As I understand it, the goal here is to prevent huge amounts of periodic
> freeze work due to XID wraparound. I don't think we need the Freeze
> state to accomplish that.
> 
> With a single bit per page in the Frozen Map, checking a 800GB table
> would require reading a mere 100MB of FM. That's pretty tiny, and
> largely accomplishes the goal.
> 
> Obviously it would be nice to eliminate even that 100MB read, but I
> suggest you leave that for a 3rd patch. I think you'll find that just
> getting the first 2 accomplished will be a significant amount of work.
> 
> Also, note that you don't really even need the ReadOnly patch. As long
> as you're not actually touching the table at all the FM will eventually
> read as everything is frozen; that gets you 80% of the way there. So I'd
> suggest starting with the FM, then doing ReadOnly, and only then
> attempting to add the Frozen state.

+1

There was some reason why we didn't have  Freeze Map before, though;
IIRC these were the problems:

1. would need to make sure it gets sync'd to disk and/or WAL-logged

2. every time a page is modified, the map would need to get updated

3. Yet Another Relation File (not inconsequential for the cases we're
discussing).

Also, given that the Visibility Map necessarily needs to have the
superset of the Frozen Map, maybe combining them in some way would make
sense.

I agree with Jim that if we have a trustworthy Frozen Map, having a
ReadOnly flag is of marginal value, unless such a ReadOnly flag allowed
us to skip updating the individual row XIDs entirely.  I can think of
some ways to do that, but they have severe tradeoffs.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



Re: Freeze avoidance of very large table.

From
Alvaro Herrera
Date:
Josh Berkus wrote:

> I agree with Jim that if we have a trustworthy Frozen Map, having a
> ReadOnly flag is of marginal value, unless such a ReadOnly flag allowed
> us to skip updating the individual row XIDs entirely.  I can think of
> some ways to do that, but they have severe tradeoffs.

If you're thinking that the READ ONLY flag is only useful for freezing,
then yeah maybe it's of marginal value.  But in the foreign key
constraint area, consider that you could have tables with
frequently-referenced PKs marked as READ ONLY -- then you don't need to
acquire row locks when inserting/updating rows in the referencing
tables.  That might give you a good performance benefit that's not in
any way related to freezing, as well as reducing your multixact
consumption rate.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Freeze avoidance of very large table.

From
Josh Berkus
Date:
On 04/06/2015 11:35 AM, Alvaro Herrera wrote:
> Josh Berkus wrote:
> 
>> I agree with Jim that if we have a trustworthy Frozen Map, having a
>> ReadOnly flag is of marginal value, unless such a ReadOnly flag allowed
>> us to skip updating the individual row XIDs entirely.  I can think of
>> some ways to do that, but they have severe tradeoffs.
> 
> If you're thinking that the READ ONLY flag is only useful for freezing,
> then yeah maybe it's of marginal value.  But in the foreign key
> constraint area, consider that you could have tables with
> frequently-referenced PKs marked as READ ONLY -- then you don't need to
> acquire row locks when inserting/updating rows in the referencing
> tables.  That might give you a good performance benefit that's not in
> any way related to freezing, as well as reducing your multixact
> consumption rate.

Hmmmm.  Yeah, that would make it worthwhile, although it would be a
fairly obscure bit of performance optimization for anyone not on this
list ;-)

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



Re: Freeze avoidance of very large table.

From
Jim Nasby
Date:
On 4/6/15 1:28 PM, Josh Berkus wrote:
> On 04/06/2015 10:07 AM, Jim Nasby wrote:
>> Actually, I would start simply with ReadOnly and ReadWrite.
>>
>> As I understand it, the goal here is to prevent huge amounts of periodic
>> freeze work due to XID wraparound. I don't think we need the Freeze
>> state to accomplish that.
>>
>> With a single bit per page in the Frozen Map, checking a 800GB table
>> would require reading a mere 100MB of FM. That's pretty tiny, and
>> largely accomplishes the goal.
>>
>> Obviously it would be nice to eliminate even that 100MB read, but I
>> suggest you leave that for a 3rd patch. I think you'll find that just
>> getting the first 2 accomplished will be a significant amount of work.
>>
>> Also, note that you don't really even need the ReadOnly patch. As long
>> as you're not actually touching the table at all the FM will eventually
>> read as everything is frozen; that gets you 80% of the way there. So I'd
>> suggest starting with the FM, then doing ReadOnly, and only then
>> attempting to add the Frozen state.
>
> +1
>
> There was some reason why we didn't have  Freeze Map before, though;
> IIRC these were the problems:
>
> 1. would need to make sure it gets sync'd to disk and/or WAL-logged

Same as VM.

> 2. every time a page is modified, the map would need to get updated

Not everytime, just the first time if FM for a page was set. It would 
only be set by vacuum, just like VM.

> 3. Yet Another Relation File (not inconsequential for the cases we're
> discussing).

Sure, which is why I think it might be interesting to either allow for 
more than one page per bit, or perhaps some form of compression. That 
said, I don't think it's worth worrying about too much because it's 
still a 64,000-1 ratio with 8k pages. If you use 32k pages it becomes 
256,000-1, or 4GB of FM for 1PB of heap.

> Also, given that the Visibility Map necessarily needs to have the
> superset of the Frozen Map, maybe combining them in some way would make
> sense.

The thing is, I think in many workloads the paterns here will actually 
be radically different, in that it's way easier to get a page to be 
all-visible than it is to freeze it.

Perhaps there's something we can do here when we look at other ways to 
reduce space usage for FM (and maybe VM too), but I don't think now is 
the time to put effort into this.

> I agree with Jim that if we have a trustworthy Frozen Map, having a
> ReadOnly flag is of marginal value, unless such a ReadOnly flag allowed
> us to skip updating the individual row XIDs entirely.  I can think of
> some ways to do that, but they have severe tradeoffs.

Aside from Alvaro's points, I think many users would find it useful as 
an easy way to ensure no one is writing to a table, which could be 
valuable for any number of reasons. As long as the patch isn't too 
complicated I don't see a reason not to do it.
-- 
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com



Re: Freeze avoidance of very large table.

From
Greg Stark
Date:
<p dir="ltr"><br /> On 6 Apr 2015 09:17, "Jim Nasby" <<a
href="mailto:Jim.Nasby@bluetreble.com">Jim.Nasby@bluetreble.com</a>>wrote:<br /> ><br /> > <br /> > No. You
wouldbe free to set a table as ReadOnly any time you wanted, without scanning anything. All that setting does is
disableany DML on the table.<br /> ><br /> > The Frozen state would only be set by the vacuum code, IFF:<br />
>- The table state is ReadOnly *at the start of vacuum* and did not change during vacuum<br /> > - Vacuum ensured
thatthere were no un-frozen tuples in the table<br /> ><br /> > That does not necessitate 2 scans.<p
dir="ltr">Thisis exactly what I would suggest.<p dir="ltr">Only I would suggest thinking of it in terms of two
orthogonalboolean flags rather than three states. It's easier to reason about whether a table has a specific property
thantrying to control a state machine in a predefined pathway.<p dir="ltr">So I would say the two flags are: <br />
READONLY:guarantees nothing can be dirtied<br /> ALLFROZEN: guarantees no unfrozen tuples are present<p dir="ltr">In
practiceyou can't have the later without the former since vacuum can't know everything is frozen unless it knows nobody
isinserting. But perhaps there will be cases in the future where that's not true.<p dir="ltr">Incidentally there are
numberof other optimisations tat over had in mind that are only possible on frozen read-only tables:<p dir="ltr">1)
Compression:compress the pages and pack them one after the other. Build a new fork with offsets for each page.<p
dir="ltr">2)Automatic partition elimination where the statistics track the minimum and maximum value per partition (and
numberof tuples) and treat then as implicit constraints. In particular it would magically make read only empty parent
partitionsbe excluded regardless of the where clause. 

Re: Freeze avoidance of very large table.

From
Jim Nasby
Date:
On 4/6/15 5:18 PM, Greg Stark wrote:
> Only I would suggest thinking of it in terms of two orthogonal boolean
> flags rather than three states. It's easier to reason about whether a
> table has a specific property than trying to control a state machine in
> a predefined pathway.
>
> So I would say the two flags are:
> READONLY: guarantees nothing can be dirtied
> ALLFROZEN: guarantees no unfrozen tuples are present
>
> In practice you can't have the later without the former since vacuum
> can't know everything is frozen unless it knows nobody is inserting. But
> perhaps there will be cases in the future where that's not true.

I'm not so sure about that. There's a logical state progression here 
(see below). ISTM it's easier to just enforce that in one place instead 
of a bunch of places having to check multiple conditions. But, I'm not 
wed to a single field.

> Incidentally there are number of other optimisations tat over had in
> mind that are only possible on frozen read-only tables:
>
> 1) Compression: compress the pages and pack them one after the other.
> Build a new fork with offsets for each page.
>
> 2) Automatic partition elimination where the statistics track the
> minimum and maximum value per partition (and number of tuples) and treat
> then as implicit constraints. In particular it would magically make read
> only empty parent partitions be excluded regardless of the where clause.

AFAICT neither of those actually requires ALLFROZEN, no? You'll need to 
uncompact and re-compact for #1 when you actually freeze (which maybe 
isn't worth it), but freezing isn't absolutely required. #2 would only 
require that everything in the relation is visible; not frozen.

I think there's value here to having an ALLVISIBLE state as well as 
ALLFROZEN.
-- 
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com



Re: Freeze avoidance of very large table.

From
Sawada Masahiko
Date:
On Tue, Apr 7, 2015 at 7:53 AM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
> On 4/6/15 5:18 PM, Greg Stark wrote:
>>
>> Only I would suggest thinking of it in terms of two orthogonal boolean
>> flags rather than three states. It's easier to reason about whether a
>> table has a specific property than trying to control a state machine in
>> a predefined pathway.
>>
>> So I would say the two flags are:
>> READONLY: guarantees nothing can be dirtied
>> ALLFROZEN: guarantees no unfrozen tuples are present
>>
>> In practice you can't have the later without the former since vacuum
>> can't know everything is frozen unless it knows nobody is inserting. But
>> perhaps there will be cases in the future where that's not true.
>
>
> I'm not so sure about that. There's a logical state progression here (see
> below). ISTM it's easier to just enforce that in one place instead of a
> bunch of places having to check multiple conditions. But, I'm not wed to a
> single field.
>
>> Incidentally there are number of other optimisations tat over had in
>> mind that are only possible on frozen read-only tables:
>>
>> 1) Compression: compress the pages and pack them one after the other.
>> Build a new fork with offsets for each page.
>>
>> 2) Automatic partition elimination where the statistics track the
>> minimum and maximum value per partition (and number of tuples) and treat
>> then as implicit constraints. In particular it would magically make read
>> only empty parent partitions be excluded regardless of the where clause.
>
>
> AFAICT neither of those actually requires ALLFROZEN, no? You'll need to
> uncompact and re-compact for #1 when you actually freeze (which maybe isn't
> worth it), but freezing isn't absolutely required. #2 would only require
> that everything in the relation is visible; not frozen.
>
> I think there's value here to having an ALLVISIBLE state as well as
> ALLFROZEN.
>

Based on may suggestions, I'm going to deal with FM at first as one
patch. It would be simply mechanism and similar to VM, at first patch.
- Each bit of FM represent single page
- The bit is set only by vacuum
- The bit is un-set by inserting and updating and deleting

At second, I'll deal with simply read-only table and 2 states,
Read/Write(default) and ReadOnly as one patch. ITSM the having the
Frozen state needs to more discussion. read-only table just allow us
to disable any updating table, and it's controlled by read-only flag
pg_class has. And DDL command which changes these status is like ALTER
TABLE SET READ ONLY, or READ WRITE.
Also as Alvaro's suggested, the read-only table affect not only
freezing table but also performance optimization. I'll consider
including them when I deal with read-only table.

Regards,

-------
Sawada Masahiko



Re: Freeze avoidance of very large table.

From
Sawada Masahiko
Date:
On Tue, Apr 7, 2015 at 11:22 AM, Sawada Masahiko <sawada.mshk@gmail.com> wrote:
> On Tue, Apr 7, 2015 at 7:53 AM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
>> On 4/6/15 5:18 PM, Greg Stark wrote:
>>>
>>> Only I would suggest thinking of it in terms of two orthogonal boolean
>>> flags rather than three states. It's easier to reason about whether a
>>> table has a specific property than trying to control a state machine in
>>> a predefined pathway.
>>>
>>> So I would say the two flags are:
>>> READONLY: guarantees nothing can be dirtied
>>> ALLFROZEN: guarantees no unfrozen tuples are present
>>>
>>> In practice you can't have the later without the former since vacuum
>>> can't know everything is frozen unless it knows nobody is inserting. But
>>> perhaps there will be cases in the future where that's not true.
>>
>>
>> I'm not so sure about that. There's a logical state progression here (see
>> below). ISTM it's easier to just enforce that in one place instead of a
>> bunch of places having to check multiple conditions. But, I'm not wed to a
>> single field.
>>
>>> Incidentally there are number of other optimisations tat over had in
>>> mind that are only possible on frozen read-only tables:
>>>
>>> 1) Compression: compress the pages and pack them one after the other.
>>> Build a new fork with offsets for each page.
>>>
>>> 2) Automatic partition elimination where the statistics track the
>>> minimum and maximum value per partition (and number of tuples) and treat
>>> then as implicit constraints. In particular it would magically make read
>>> only empty parent partitions be excluded regardless of the where clause.
>>
>>
>> AFAICT neither of those actually requires ALLFROZEN, no? You'll need to
>> uncompact and re-compact for #1 when you actually freeze (which maybe isn't
>> worth it), but freezing isn't absolutely required. #2 would only require
>> that everything in the relation is visible; not frozen.
>>
>> I think there's value here to having an ALLVISIBLE state as well as
>> ALLFROZEN.
>>
>
> Based on may suggestions, I'm going to deal with FM at first as one
> patch. It would be simply mechanism and similar to VM, at first patch.
> - Each bit of FM represent single page
> - The bit is set only by vacuum
> - The bit is un-set by inserting and updating and deleting
>
> At second, I'll deal with simply read-only table and 2 states,
> Read/Write(default) and ReadOnly as one patch. ITSM the having the
> Frozen state needs to more discussion. read-only table just allow us
> to disable any updating table, and it's controlled by read-only flag
> pg_class has. And DDL command which changes these status is like ALTER
> TABLE SET READ ONLY, or READ WRITE.
> Also as Alvaro's suggested, the read-only table affect not only
> freezing table but also performance optimization. I'll consider
> including them when I deal with read-only table.
>

Attached WIP patch adds Frozen Map which enables us to avoid whole
table vacuuming even when full scan is required: preventing XID
wraparound failures.

Frozen Map is a bitmap with one bit per heap page, and quite similar
to Visibility Map. A set bit means that all tuples on heap page are
completely frozen, therefore we don't need to do vacuum freeze that
page.
A bit is set when vacuum(or autovacuum) figures out that all tuples on
corresponding heap page are completely frozen, and a bit is cleared
when INSERT and UPDATE(only new heap page) are executed.

Current patch adds new source file src/backend/access/heap/frozenmap.c
which is quite similar to visibilitymap.c. They have similar code but
are separated for now. I do refactoring these source code like adding
bitmap.c, if needed.
Also, when skipping vacuum by visibility map, we can skip at least
SKIP_PAGE_THESHOLD consecutive page, but such mechanism is not in
frozen map.

Please give me feedbacks.

Regards,

-------
Sawada Masahiko

Attachment

Re: Freeze avoidance of very large table.

From
Bruce Momjian
Date:
On Mon, Apr 20, 2015 at 04:45:34PM +0900, Sawada Masahiko wrote:
> Attached WIP patch adds Frozen Map which enables us to avoid whole
> table vacuuming even when full scan is required: preventing XID
> wraparound failures.
> 
> Frozen Map is a bitmap with one bit per heap page, and quite similar
> to Visibility Map. A set bit means that all tuples on heap page are
> completely frozen, therefore we don't need to do vacuum freeze that
> page.
> A bit is set when vacuum(or autovacuum) figures out that all tuples on
> corresponding heap page are completely frozen, and a bit is cleared
> when INSERT and UPDATE(only new heap page) are executed.

So, this patch avoids reading the all-frozen pages if it has not been
modified since the last VACUUM FREEZE?  Since it is already frozen, the
running VACUUM FREEZE will not modify the page or generate WAL, so is it
really worth maintaining a new per-page bitmap just to avoid the
sequential scan of tables every 200MB transactions?  I would like to see
us reduce the need for VACUUM FREEZE, rather than go in this direction.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +



Re: Freeze avoidance of very large table.

From
Jim Nasby
Date:
On 4/20/15 1:48 PM, Bruce Momjian wrote:
> On Mon, Apr 20, 2015 at 04:45:34PM +0900, Sawada Masahiko wrote:
>> Attached WIP patch adds Frozen Map which enables us to avoid whole
>> table vacuuming even when full scan is required: preventing XID
>> wraparound failures.
>>
>> Frozen Map is a bitmap with one bit per heap page, and quite similar
>> to Visibility Map. A set bit means that all tuples on heap page are
>> completely frozen, therefore we don't need to do vacuum freeze that
>> page.
>> A bit is set when vacuum(or autovacuum) figures out that all tuples on
>> corresponding heap page are completely frozen, and a bit is cleared
>> when INSERT and UPDATE(only new heap page) are executed.
>
> So, this patch avoids reading the all-frozen pages if it has not been
> modified since the last VACUUM FREEZE?  Since it is already frozen, the
> running VACUUM FREEZE will not modify the page or generate WAL, so is it
> really worth maintaining a new per-page bitmap just to avoid the
> sequential scan of tables every 200MB transactions?  I would like to see
> us reduce the need for VACUUM FREEZE, rather than go in this direction.

How would you propose we do that?

I also think there's better ways we could handle *all* our cleanup work. 
Tuples have a definite lifespan, and there's potentially a lot of 
efficiency to be gained if we could track tuples through their stages of 
life... but I don't see any easy ways to do that.
-- 
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com



Re: Freeze avoidance of very large table.

From
Bruce Momjian
Date:
On Mon, Apr 20, 2015 at 01:59:17PM -0500, Jim Nasby wrote:
> On 4/20/15 1:48 PM, Bruce Momjian wrote:
> >On Mon, Apr 20, 2015 at 04:45:34PM +0900, Sawada Masahiko wrote:
> >>Attached WIP patch adds Frozen Map which enables us to avoid whole
> >>table vacuuming even when full scan is required: preventing XID
> >>wraparound failures.
> >>
> >>Frozen Map is a bitmap with one bit per heap page, and quite similar
> >>to Visibility Map. A set bit means that all tuples on heap page are
> >>completely frozen, therefore we don't need to do vacuum freeze that
> >>page.
> >>A bit is set when vacuum(or autovacuum) figures out that all tuples on
> >>corresponding heap page are completely frozen, and a bit is cleared
> >>when INSERT and UPDATE(only new heap page) are executed.
> >
> >So, this patch avoids reading the all-frozen pages if it has not been
> >modified since the last VACUUM FREEZE?  Since it is already frozen, the
> >running VACUUM FREEZE will not modify the page or generate WAL, so is it
> >really worth maintaining a new per-page bitmap just to avoid the
> >sequential scan of tables every 200MB transactions?  I would like to see
> >us reduce the need for VACUUM FREEZE, rather than go in this direction.
> 
> How would you propose we do that?
> 
> I also think there's better ways we could handle *all* our cleanup
> work. Tuples have a definite lifespan, and there's potentially a lot
> of efficiency to be gained if we could track tuples through their
> stages of life... but I don't see any easy ways to do that.

See the TODO list:
https://wiki.postgresql.org/wiki/Todoo  Avoid the requirement of freezing pages that are infrequently   modified

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +



Re: Freeze avoidance of very large table.

From
Jim Nasby
Date:
On 4/20/15 2:09 PM, Bruce Momjian wrote:
> On Mon, Apr 20, 2015 at 01:59:17PM -0500, Jim Nasby wrote:
>> On 4/20/15 1:48 PM, Bruce Momjian wrote:
>>> On Mon, Apr 20, 2015 at 04:45:34PM +0900, Sawada Masahiko wrote:
>>>> Attached WIP patch adds Frozen Map which enables us to avoid whole
>>>> table vacuuming even when full scan is required: preventing XID
>>>> wraparound failures.
>>>>
>>>> Frozen Map is a bitmap with one bit per heap page, and quite similar
>>>> to Visibility Map. A set bit means that all tuples on heap page are
>>>> completely frozen, therefore we don't need to do vacuum freeze that
>>>> page.
>>>> A bit is set when vacuum(or autovacuum) figures out that all tuples on
>>>> corresponding heap page are completely frozen, and a bit is cleared
>>>> when INSERT and UPDATE(only new heap page) are executed.
>>>
>>> So, this patch avoids reading the all-frozen pages if it has not been
>>> modified since the last VACUUM FREEZE?  Since it is already frozen, the
>>> running VACUUM FREEZE will not modify the page or generate WAL, so is it
>>> really worth maintaining a new per-page bitmap just to avoid the
>>> sequential scan of tables every 200MB transactions?  I would like to see
>>> us reduce the need for VACUUM FREEZE, rather than go in this direction.
>>
>> How would you propose we do that?
>>
>> I also think there's better ways we could handle *all* our cleanup
>> work. Tuples have a definite lifespan, and there's potentially a lot
>> of efficiency to be gained if we could track tuples through their
>> stages of life... but I don't see any easy ways to do that.
>
> See the TODO list:
>
>     https://wiki.postgresql.org/wiki/Todo
>     o  Avoid the requirement of freezing pages that are infrequently
>        modified

Right, but do you have a proposal for how that would actually happen?

Perhaps I'm mis-understanding you, but it sounded like you were opposed 
to this patch because it doesn't do anything to avoid the need to 
freeze. My point is that no one has any good ideas on how to avoid 
freezing, and I think it's a safe bet that any ideas people do come up 
with there will be a lot more invasive than a FrozenMap is.

While not perfect, a FrozenMap is something we can do today, without a 
lot of effort, and it will provide definite value for any tables that 
have a "good" amount of frozen pages. Without performance testing, we 
don't know what "good" actually looks like, but we can't test without a 
patch (which we now have). Assuming performance numbers look good I 
think it would be folly to reject this patch in the hopes that 
eventually we'll have some way to avoid the need to freeze.
-- 
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com



Re: Freeze avoidance of very large table.

From
Bruce Momjian
Date:
On Mon, Apr 20, 2015 at 03:58:19PM -0500, Jim Nasby wrote:
> >>I also think there's better ways we could handle *all* our cleanup
> >>work. Tuples have a definite lifespan, and there's potentially a lot
> >>of efficiency to be gained if we could track tuples through their
> >>stages of life... but I don't see any easy ways to do that.
> >
> >See the TODO list:
> >
> >    https://wiki.postgresql.org/wiki/Todo
> >    o  Avoid the requirement of freezing pages that are infrequently
> >       modified
> 
> Right, but do you have a proposal for how that would actually happen?
> 
> Perhaps I'm mis-understanding you, but it sounded like you were
> opposed to this patch because it doesn't do anything to avoid the
> need to freeze. My point is that no one has any good ideas on how to
> avoid freezing, and I think it's a safe bet that any ideas people do
> come up with there will be a lot more invasive than a FrozenMap is.

Didn't you think any of the TODO threads had workable solutions?  And
don't expect adding an additional file per relation will be zero cost
--- added over the lifetime of 200M transactions, I question if this
approach would be a win.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +



Re: Freeze avoidance of very large table.

From
Jim Nasby
Date:
On 4/20/15 2:45 AM, Sawada Masahiko wrote:
> Current patch adds new source file src/backend/access/heap/frozenmap.c
> which is quite similar to visibilitymap.c. They have similar code but
> are separated for now. I do refactoring these source code like adding
> bitmap.c, if needed.

My feeling is we'd definitely want this refactored; it looks to be a 
whole lot of duplicated code. But before working on that we should get 
consensus that a FrozenMap is a good idea.

Are there any meaningful differences between the two, besides the 
obvious name changes?

I think there's also a bunch of XLOG stuff that could be refactored too...

> Also, when skipping vacuum by visibility map, we can skip at least
> SKIP_PAGE_THESHOLD consecutive page, but such mechanism is not in
> frozen map.

That's probably something else that can be factored out, since it's 
basically the same logic. I suspect we just need to && some of the 
checks so we're looking at both FM and VM at the same time.

Other comments...

It would be nice if we didn't need another page bit for FM; do you see 
any reasonable way that could happen?

+     * If we didn't pin the visibility(and frozen) map page and the page has
+     * become all visible(and frozen) while we were busy locking the buffer,
+     * or during some subsequent window during which we had it unlocked,
+     * we'll have to unlock and re-lock, to avoid holding the buffer lock
+     * across an I/O.  That's a bit unfortunate, especially since we'll now
+     * have to recheck whether the tuple has been locked or updated under us,
+     * but hopefully it won't happen very often.      */

s/(and frozen)/ or frozen/


+ * Reply XLOG_HEAP3_FROZENMAP record.
s/Reply/Replay/


+        /*
+         * XLogReplayBufferExtended locked the buffer. But frozenmap_set
+         * will handle locking itself.
+         */
+        LockBuffer(fmbuffer, BUFFER_LOCK_UNLOCK);

Doesn't this create a race condition?


Are you sure the bit in finish_heap_swap() is safe? If so, we should add 
the same the same for the visibility map too (it certainly better be all 
visible if it's frozen...)



+            /*
+             * Current block is all-visible.
+             * If frozen map represents that it's all frozen and this
+             * function is called for freezing tuples, we can skip to
+             * vacuum block.
+             */

I would state this as "Even if scan_all is true, we can skip blocks that 
are marked as frozen."

+            if (frozenmap_test(onerel, blkno, &fmbuffer) && scan_all)

I suspect it's faster to reverse those tests (scan_all && 
frozenmap_test())... but why do we even need to look at scan_all? AFAICT 
if a block as frozen we can skip it unconditionally.


+            /*
+             * If the un-frozen tuple is remaining in current page and
+             * current page is marked as ALL_FROZEN, we should clear it.
+             */

That needs to NEVER happen. If it does then we're going to consider 
tuples as visible/frozen that shouldn't be. We should probably throw an 
error here, because it means the heap is now corrupted. At the minimum 
it needs to be an assert().



Note that I haven't reviewed all the logic in detail at this point. If 
this ends up being refactored it'll be a lot easier to spot logic 
problems, so I'll hold off on that for now.
-- 
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com



Re: Freeze avoidance of very large table.

From
Jim Nasby
Date:
On 4/20/15 4:13 PM, Bruce Momjian wrote:
> On Mon, Apr 20, 2015 at 03:58:19PM -0500, Jim Nasby wrote:
>>>> I also think there's better ways we could handle *all* our cleanup
>>>> work. Tuples have a definite lifespan, and there's potentially a lot
>>>> of efficiency to be gained if we could track tuples through their
>>>> stages of life... but I don't see any easy ways to do that.
>>>
>>> See the TODO list:
>>>
>>>     https://wiki.postgresql.org/wiki/Todo
>>>     o  Avoid the requirement of freezing pages that are infrequently
>>>        modified
>>
>> Right, but do you have a proposal for how that would actually happen?
>>
>> Perhaps I'm mis-understanding you, but it sounded like you were
>> opposed to this patch because it doesn't do anything to avoid the
>> need to freeze. My point is that no one has any good ideas on how to
>> avoid freezing, and I think it's a safe bet that any ideas people do
>> come up with there will be a lot more invasive than a FrozenMap is.
>
> Didn't you think any of the TODO threads had workable solutions?  And

I didn't realize there were threads there.

The first three are discussion around the idea of eliminating the need 
to freeze based on a page already being all visible. No patches.

http://www.postgresql.org/message-id/CA+TgmoaEmnoLZmVbb8gvY69NA8zw9BWpiZ9+TLz-LnaBOZi7JA@mail.gmail.com 
has a WIP patch that goes the route of using a tuple flag to indicate 
frozen, but also raises a lot of concerns about visibility, because it 
means we'd stop using FrozenXID. That impacts a large amount of code. 
There were some followup patches as well as a bunch of discussion of how 
to make it visible that a tuple was frozen or not. That thread died in 
January 2014.

The fifth thread is XID to LSN mapping. AFAICT this has a significant 
drawback in that it breaks page compatibility, meaning no pg_upgrade. It 
ends 5/14/2014 with this comment:

"Well, Heikki was saying on another thread that he had kind of gotten
cold feet about this, so I gather he's not planning to pursue it.  Not
sure if I understood that correctly.  If so, I guess it depends on
whether someone else can pick it up, but we might first want to
establish why he got cold feet and how worrying those problems seem to
other people." - 
http://www.postgresql.org/message-id/CA+TgmoYoN8LzSuaffUaEkyV8Mhv1wi=ZLBXQ3VOfEZNO1dbw9Q@mail.gmail.com

So work was done on two alternative approaches, and then abandoned. Both 
of those approaches might still be valid, but seem to need more work. 
They're also higher risk because they're changing MVCC at a very 
fundamental level.

As I mentioned, I think there's a lot better stuff we could be doing 
about tuple lifetime, but there's no easy fixes to be had. This patch 
solves a problem today, using a concept that's now well proven 
(visibility map). If we had something more sophisticated being developed 
then I'd be inclined not to pursue this patch, but that's not the case.

Perhaps others can elaborate on where those two patches are at...

> don't expect adding an additional file per relation will be zero cost
> --- added over the lifetime of 200M transactions, I question if this
> approach would be a win.

Can you elaborate on this? I don't see how the number of transactions 
would come into play, but the overhead here is not large; the FrozenMap 
would be the same size as the VM map, which is 1/64,000th as large as 
the heap. So a 64G table means a 1M FM. That doesn't seem very expensive.
-- 
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com



Re: Freeze avoidance of very large table.

From
Josh Berkus
Date:
On 04/20/2015 02:13 PM, Bruce Momjian wrote:
> Didn't you think any of the TODO threads had workable solutions?  And
> don't expect adding an additional file per relation will be zero cost
> --- added over the lifetime of 200M transactions, I question if this
> approach would be a win.

Well, the only real way to test that is a prototype, no?

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



Re: Freeze avoidance of very large table.

From
Andres Freund
Date:
On 2015-04-20 17:13:29 -0400, Bruce Momjian wrote:
> Didn't you think any of the TODO threads had workable solutions?  And
> don't expect adding an additional file per relation will be zero cost
> --- added over the lifetime of 200M transactions, I question if this
> approach would be a win.

Note that normally you'd not run with a 200M transaction freeze max age
on a busy server. Rather around a magnitude more.

Think about this being used on a time partionioned table. Right now all
the partitions have to be fully rescanned on a regular basis - quite
painful. With something like this normally only the newest partitions
will have to be.

Greetings,

Andres Freund



Freeze avoidance of very large table.

From
Sawada Masahiko
Date:


On Tue, Apr 21, 2015 at 7:00 AM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
> On 4/20/15 2:45 AM, Sawada Masahiko wrote:
>>
>> Current patch adds new source file src/backend/access/heap/frozenmap.c
>> which is quite similar to visibilitymap.c. They have similar code but
>> are separated for now. I do refactoring these source code like adding
>> bitmap.c, if needed.
>

Thank you for having a look this patch.

>
> My feeling is we'd definitely want this refactored; it looks to be a whole
> lot of duplicated code. But before working on that we should get consensus
> that a FrozenMap is a good idea.

Yes, we need to get consensus about FrozenMap before starting work on.
In addition to comment you pointed out, I noticed that one problems I should address, that a bit of FrozenMap need to be cleared on deletion and (i.g. xmax is set).
The page as frozen could have the dead tuple for now, but I think to change to that the frozen page guarantees that page is all frozen *and* all visible.

> Are there any meaningful differences between the two, besides the obvious
> name changes?

No, there aren't.

> I think there's also a bunch of XLOG stuff that could be refactored too...

I agree with you.

>> Also, when skipping vacuum by visibility map, we can skip at least
>> SKIP_PAGE_THESHOLD consecutive page, but such mechanism is not in
>> frozen map.
>
>
> That's probably something else that can be factored out, since it's
> basically the same logic. I suspect we just need to && some of the checks so
> we're looking at both FM and VM at the same time.

FrozenMap is used to skip scan only when anti-wrapping vacuum or freezing all tuples (i.g scan_all is true).
The normal vacuum uses only VM, doesn't use FM for now.

> Other comments...
>
> It would be nice if we didn't need another page bit for FM; do you see any
> reasonable way that could happen?

We may be able to remove page bit for FM from page header, but I'm not sure we could do that.

> +        * If we didn't pin the visibility(and frozen) map page and the page
> has
> +        * become all visible(and frozen) while we were busy locking the
> buffer,
> +        * or during some subsequent window during which we had it unlocked,
> +        * we'll have to unlock and re-lock, to avoid holding the buffer
> lock
> +        * across an I/O.  That's a bit unfortunate, especially since we'll
> now
> +        * have to recheck whether the tuple has been locked or updated
> under us,
> +        * but hopefully it won't happen very often.
>          */
>
> s/(and frozen)/ or frozen/
>
>
> + * Reply XLOG_HEAP3_FROZENMAP record.
> s/Reply/Replay/

Understood.

>
> +               /*
> +                * XLogReplayBufferExtended locked the buffer. But
> frozenmap_set
> +                * will handle locking itself.
> +                */
> +               LockBuffer(fmbuffer, BUFFER_LOCK_UNLOCK);
>
> Doesn't this create a race condition?
>
>
> Are you sure the bit in finish_heap_swap() is safe? If so, we should add the
> same the same for the visibility map too (it certainly better be all visible
> if it's frozen...)

We can not ensure page is all visible even if we execute VACUUM FULL, because of dead tuple could be remained. e.g. the case when other process does insert and update to same tuple in same transaction before VACUUM FULL.
I was thinking that the FrozenMap is free of the influence of delete operation. But as I said at top of this mail, a bit of FrozenMap needs to be cleared on deletion.
So I will remove these related code as you mentioned.

>
>
>
> +                       /*
> +                        * Current block is all-visible.
> +                        * If frozen map represents that it's all frozen and
> this
> +                        * function is called for freezing tuples, we can
> skip to
> +                        * vacuum block.
> +                        */
>
> I would state this as "Even if scan_all is true, we can skip blocks that are
> marked as frozen."
>
> +                       if (frozenmap_test(onerel, blkno, &fmbuffer) &&
> scan_all)
>
> I suspect it's faster to reverse those tests (scan_all &&
> frozenmap_test())... but why do we even need to look at scan_all? AFAICT if
> a block as frozen we can skip it unconditionally.

The tuple which is frozen and dead, could be remained in page is marked all frozen, in currently patch.
i.g., There is possible to exist the page is not all visible but marked frozen.
But I'm thinking to change that.

>
>
> +                       /*
> +                        * If the un-frozen tuple is remaining in current
> page and
> +                        * current page is marked as ALL_FROZEN, we should
> clear it.
> +                        */
>
> That needs to NEVER happen. If it does then we're going to consider tuples
> as visible/frozen that shouldn't be. We should probably throw an error here,
> because it means the heap is now corrupted. At the minimum it needs to be an
> assert().

I understood. I'll fix it.

> Note that I haven't reviewed all the logic in detail at this point. If this
> ends up being refactored it'll be a lot easier to spot logic problems, so
> I'll hold off on that for now.

Understood, we need to get consen at first.

Regards,

-------
Sawada Masahiko

Re: Freeze avoidance of very large table.

From
Andres Freund
Date:
On 2015-04-21 23:59:45 +0900, Sawada Masahiko wrote:
> The page as frozen could have the dead tuple for now, but I think to change
> to that the frozen page guarantees that page is all frozen *and* all
> visible.

It shouldn't. That'd potentially cause corruption after a wraparound. A
tuple's visibility might change due to that.

Greetings,

Andres Freund



Re: Freeze avoidance of very large table.

From
Sawada Masahiko
Date:
On Wed, Apr 22, 2015 at 12:02 AM, Andres Freund <andres@anarazel.de> wrote:
> On 2015-04-21 23:59:45 +0900, Sawada Masahiko wrote:
>> The page as frozen could have the dead tuple for now, but I think to change
>> to that the frozen page guarantees that page is all frozen *and* all
>> visible.
>
> It shouldn't. That'd potentially cause corruption after a wraparound. A
> tuple's visibility might change due to that.

The page as frozen could have some dead tuples, right?
I think we should to clear a bit of FrozenMap (and flag of page
header) on delete operation, and a bit is set only by vacuum.
So accordingly, the page as frozen guarantees that all frozen and all visible?

Regards,

-------
Sawada Masahiko



Re: Freeze avoidance of very large table.

From
Andres Freund
Date:
On 2015-04-22 00:15:53 +0900, Sawada Masahiko wrote:
> On Wed, Apr 22, 2015 at 12:02 AM, Andres Freund <andres@anarazel.de> wrote:
> > On 2015-04-21 23:59:45 +0900, Sawada Masahiko wrote:
> >> The page as frozen could have the dead tuple for now, but I think to change
> >> to that the frozen page guarantees that page is all frozen *and* all
> >> visible.
> >
> > It shouldn't. That'd potentially cause corruption after a wraparound. A
> > tuple's visibility might change due to that.
> 
> The page as frozen could have some dead tuples, right?

Well, we right now don't really freeze pages, but tuples. But in what
you described above that could happen.

> I think we should to clear a bit of FrozenMap (and flag of page
> header) on delete operation, and a bit is set only by vacuum.

Yes.

> So accordingly, the page as frozen guarantees that all frozen and all
> visible?

I think that's how it has to be, yes.

I *do* wonder if we shouldn't redefine the VM to also contain
information about the frozenness. Having two identically structured maps
that'll often both have to be touched at the same time isn't
nice. Neither is adding another fork.  Given the size of the files
pg_upgrade could be made to rewrite them.  The bigger question is
probably how bad that'd be for index-only efficiency.

Greetings,

Andres Freund



Re: Freeze avoidance of very large table.

From
Robert Haas
Date:
On Mon, Apr 20, 2015 at 7:59 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
> http://www.postgresql.org/message-id/CA+TgmoaEmnoLZmVbb8gvY69NA8zw9BWpiZ9+TLz-LnaBOZi7JA@mail.gmail.com
> has a WIP patch that goes the route of using a tuple flag to indicate
> frozen, but also raises a lot of concerns about visibility, because it means
> we'd stop using FrozenXID. That impacts a large amount of code. There were
> some followup patches as well as a bunch of discussion of how to make it
> visible that a tuple was frozen or not. That thread died in January 2014.

Actually, this change has already been made, so it's not so much of a
to-do as a was-done.  See commit
37484ad2aacef5ec794f4dd3d5cf814475180a78.  The immediate thing we got
out of that change is that when CLUSTER or VACUUM FULL rewrite a
table, they now freeze all of the tuples using this method.  See
commits 3cff1879f8d03cb729368722ca823a4bf74c0cac and
af2543e884db06c0beb75010218cd88680203b86.  Previously, CLUSTER or
VACUUM FULL would not freeze anything, which meant that people who
tried to use VACUUM FULL to recover from XID wraparound problems got
nowhere, and even people who knew when to use which tool could end up
having to VACUUM FULL and then VACUUM FREEZE afterward, rewriting the
table twice, an annoyance.

It's possible that we could use this infrastructure to freeze more
aggressively in other circumstances.  For example, perhaps VACUUM
should freeze any page it intends to mark all-visible.  That's not a
guaranteed win, because it might increase WAL volume: setting a page
all-visible does not emit an FPI for that page, but freezing any tuple
on it would, if the page hasn't otherwise been modified since the last
checkpoint.  Even if that were no issue, the freezing itself must be
WAL-logged.  But if we could somehow get to a place where all-visible
=> frozen, then autovacuum would never need to visit all-visible
pages, a huge win.

We could also attack the problem from the other end.  Instead of
trying to set the bits on the individual tuples, we could decide that
whenever a page is marked all-visible, we regard it as frozen
regardless of the bits set or not set on the individual tuples.
Anybody who wants to modify the page must freeze any unfrozen tuples
"for real" before clearing the visibility map bit.  This would have
the same end result as the previous idea: all-visible would
essentially imply frozen, and autovacuum could ignore those pages
categorically.

I'm not saying those ideas don't have problems, because they do.  But
I think they are worth further exploring.  The main reason I gave up
on that is because Heikki was working on the XID-to-LSN mapping stuff.
That seemed like a better approach than either of the above, so as
long as Heikki was working on that, there wasn't much reason to pursue
more lowbrow approaches.  Clearly, though, we need to do something
about this.  Freezing is a big problem for lots of users.

All that having been said, I don't think adding a new fork is a good
approach.  We already have problems pretty commonly where our
customers complain about running out of inodes.  Adding another fork
for every table would exacerbate that problem considerably.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Freeze avoidance of very large table.

From
Andres Freund
Date:
On 2015-04-21 16:21:47 -0400, Robert Haas wrote:
> All that having been said, I don't think adding a new fork is a good
> approach.  We already have problems pretty commonly where our
> customers complain about running out of inodes.  Adding another fork
> for every table would exacerbate that problem considerably.

Really? These days? There's good arguments against another fork
(increased number of fsyncs, more stat calls, increased number of file
handles, more WAL logging, ...), but the number of inodes themselves
seems like something halfway recent filesystems should handle.

Greetings,

Andres Freund



Re: Freeze avoidance of very large table.

From
Robert Haas
Date:
On Tue, Apr 21, 2015 at 4:27 PM, Andres Freund <andres@anarazel.de> wrote:
> On 2015-04-21 16:21:47 -0400, Robert Haas wrote:
>> All that having been said, I don't think adding a new fork is a good
>> approach.  We already have problems pretty commonly where our
>> customers complain about running out of inodes.  Adding another fork
>> for every table would exacerbate that problem considerably.
>
> Really? These days? There's good arguments against another fork
> (increased number of fsyncs, more stat calls, increased number of file
> handles, more WAL logging, ...), but the number of inodes themselves
> seems like something halfway recent filesystems should handle.

Not making it up...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Freeze avoidance of very large table.

From
Jim Nasby
Date:
On 4/21/15 3:21 PM, Robert Haas wrote:
> It's possible that we could use this infrastructure to freeze more
> aggressively in other circumstances.  For example, perhaps VACUUM
> should freeze any page it intends to mark all-visible.  That's not a
> guaranteed win, because it might increase WAL volume: setting a page
> all-visible does not emit an FPI for that page, but freezing any tuple
> on it would, if the page hasn't otherwise been modified since the last
> checkpoint.  Even if that were no issue, the freezing itself must be
> WAL-logged.  But if we could somehow get to a place where all-visible
> => frozen, then autovacuum would never need to visit all-visible
> pages, a huge win.

I don't know how bad the extra WAL traffic would be; we'd obviously need 
to incur it eventually, so it's a question of how common it is for a 
page to go all-visible but then go not-all-visible again before 
freezing. It would presumably be far more traffic than some form of a 
FrozenMap though...

> We could also attack the problem from the other end.  Instead of
> trying to set the bits on the individual tuples, we could decide that
> whenever a page is marked all-visible, we regard it as frozen
> regardless of the bits set or not set on the individual tuples.
> Anybody who wants to modify the page must freeze any unfrozen tuples
> "for real" before clearing the visibility map bit.  This would have
> the same end result as the previous idea: all-visible would
> essentially imply frozen, and autovacuum could ignore those pages
> categorically.

Pushing what's currently background work onto foreground processes 
doesn't seem like a good idea...

> I'm not saying those ideas don't have problems, because they do.  But
> I think they are worth further exploring.  The main reason I gave up
> on that is because Heikki was working on the XID-to-LSN mapping stuff.
> That seemed like a better approach than either of the above, so as
> long as Heikki was working on that, there wasn't much reason to pursue
> more lowbrow approaches.  Clearly, though, we need to do something
> about this.  Freezing is a big problem for lots of users.

Did XID-LSN die? I see at the bottom of the thread it was returned with 
feedback; I guess Heikki just hasn't had time and there's no major 
blockers? From what I remember this is probably a better solution, but 
if it's not going to make it into 9.6 then we should probably at least 
look further into a FM.

> All that having been said, I don't think adding a new fork is a good
> approach.  We already have problems pretty commonly where our
> customers complain about running out of inodes.  Adding another fork
> for every table would exacerbate that problem considerably.

Andres idea of adding this to the VM may work well to handle that. It 
would double the size of the VM, but it would still be a ratio of 
32,000-1 compared to heap size, or 2MB for a 64GB table.
-- 
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com



Re: Freeze avoidance of very large table.

From
Robert Haas
Date:
On Tue, Apr 21, 2015 at 7:24 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
> On 4/21/15 3:21 PM, Robert Haas wrote:
>> It's possible that we could use this infrastructure to freeze more
>> aggressively in other circumstances.  For example, perhaps VACUUM
>> should freeze any page it intends to mark all-visible.  That's not a
>> guaranteed win, because it might increase WAL volume: setting a page
>> all-visible does not emit an FPI for that page, but freezing any tuple
>> on it would, if the page hasn't otherwise been modified since the last
>> checkpoint.  Even if that were no issue, the freezing itself must be
>> WAL-logged.  But if we could somehow get to a place where all-visible
>> => frozen, then autovacuum would never need to visit all-visible
>> pages, a huge win.
>
> I don't know how bad the extra WAL traffic would be; we'd obviously need to
> incur it eventually, so it's a question of how common it is for a page to go
> all-visible but then go not-all-visible again before freezing. It would
> presumably be far more traffic than some form of a FrozenMap though...

Yeah, maybe.  The freeze record contains details for each TID, while
the freeze map bit would only need to be set once for the whole page.
I wonder if the format of that record could be optimized somehow.

>> We could also attack the problem from the other end.  Instead of
>> trying to set the bits on the individual tuples, we could decide that
>> whenever a page is marked all-visible, we regard it as frozen
>> regardless of the bits set or not set on the individual tuples.
>> Anybody who wants to modify the page must freeze any unfrozen tuples
>> "for real" before clearing the visibility map bit.  This would have
>> the same end result as the previous idea: all-visible would
>> essentially imply frozen, and autovacuum could ignore those pages
>> categorically.
>
> Pushing what's currently background work onto foreground processes doesn't
> seem like a good idea...

When you phrase it that way, no, but pushing work that otherwise would
need to be done right now off to a future time that may never arrive
sounds like a good idea.  Today, we freeze the page -- rewriting it --
and then keep scanning those all-frozen pages every X number of
transactions to make sure they are really all-frozen.  In this system,
we'd eliminate the repeated scanning and defer the freeze work until
the page actually gets modified again.  But that might never happen,
in which case we never have to do the work at all.

>> I'm not saying those ideas don't have problems, because they do.  But
>> I think they are worth further exploring.  The main reason I gave up
>> on that is because Heikki was working on the XID-to-LSN mapping stuff.
>> That seemed like a better approach than either of the above, so as
>> long as Heikki was working on that, there wasn't much reason to pursue
>> more lowbrow approaches.  Clearly, though, we need to do something
>> about this.  Freezing is a big problem for lots of users.
>
> Did XID-LSN die? I see at the bottom of the thread it was returned with
> feedback; I guess Heikki just hasn't had time and there's no major blockers?
> From what I remember this is probably a better solution, but if it's not
> going to make it into 9.6 then we should probably at least look further into
> a FM.

Heikki said he'd lost enthusiasm for it, but he wasn't too specific
about his reasons, IIRC.  I guess maybe just that it got complicated,
and he wasn't sure it was correct.

>> All that having been said, I don't think adding a new fork is a good
>> approach.  We already have problems pretty commonly where our
>> customers complain about running out of inodes.  Adding another fork
>> for every table would exacerbate that problem considerably.
>
> Andres idea of adding this to the VM may work well to handle that. It would
> double the size of the VM, but it would still be a ratio of 32,000-1
> compared to heap size, or 2MB for a 64GB table.

Yes, that's got some potential.  It would mean pg_upgrade would have
to remove all existing visibility maps when upgrading to the new
version, or rewrite them into the new format.  But it otherwise seems
promising.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Freeze avoidance of very large table.

From
Kevin Grittner
Date:
Robert Haas <robertmhaas@gmail.com> wrote:

> It's possible that we could use this infrastructure to freeze
> more aggressively in other circumstances.  For example, perhaps
> VACUUM should freeze any page it intends to mark all-visible.
> That's not a guaranteed win, because it might increase WAL
> volume: setting a page all-visible does not emit an FPI for that
> page, but freezing any tuple on it would, if the page hasn't
> otherwise been modified since the last checkpoint.  Even if that
> were no issue, the freezing itself must be WAL-logged.  But if we
> could somehow get to a place where all-visible => frozen, then
> autovacuum would never need to visit all-visible pages, a huge
> win.

That would eliminate full-table scan vacuums, right?  It would do
that by adding incremental effort and WAL to the "normal"
autovacuum run to eliminate the full table scan and the associated
mass freeze WAL-logging?  It's hard to see how that would not be an
overall win.

> We could also attack the problem from the other end.  Instead of
> trying to set the bits on the individual tuples, we could decide
> that whenever a page is marked all-visible, we regard it as
> frozen regardless of the bits set or not set on the individual
> tuples.  Anybody who wants to modify the page must freeze any
> unfrozen tuples "for real" before clearing the visibility map
> bit.  This would have the same end result as the previous idea:
> all-visible would essentially imply frozen, and autovacuum could
> ignore those pages categorically.

Besides putting work into the foreground that could be done in the
background, that sounds more complicated.  Also, there is no
ability to "pace" the freeze load or use scheduled jobs to shift
the work to off-peak hours.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Freeze avoidance of very large table.

From
Robert Haas
Date:
On Wed, Apr 22, 2015 at 11:09 AM, Kevin Grittner <kgrittn@ymail.com> wrote:
> Robert Haas <robertmhaas@gmail.com> wrote:
>> It's possible that we could use this infrastructure to freeze
>> more aggressively in other circumstances.  For example, perhaps
>> VACUUM should freeze any page it intends to mark all-visible.
>> That's not a guaranteed win, because it might increase WAL
>> volume: setting a page all-visible does not emit an FPI for that
>> page, but freezing any tuple on it would, if the page hasn't
>> otherwise been modified since the last checkpoint.  Even if that
>> were no issue, the freezing itself must be WAL-logged.  But if we
>> could somehow get to a place where all-visible => frozen, then
>> autovacuum would never need to visit all-visible pages, a huge
>> win.
>
> That would eliminate full-table scan vacuums, right?  It would do
> that by adding incremental effort and WAL to the "normal"
> autovacuum run to eliminate the full table scan and the associated
> mass freeze WAL-logging?  It's hard to see how that would not be an
> overall win.

Yes and yes.

In terms of an overall win, this design loses when the tuples that
have been recently marked all-visible are going to get updated again
in the near future. In that case, the effort we spend to freeze them
is wasted.  I just tested "pgbench -i -s 40 -n" followed by "VACUUM"
or alternatively followed by "VACUUM FREEZE".  The VACUUM generated
4641kB of WAL.  The VACUUM FREEZE generated 515MB of WAL - that is,
113 times more.  So changing every VACUUM to act like VACUUM FREEZE
would be quite expensive.  We'll still come out ahead if those tuples
are going to stick around long enough that they would have eventually
gotten frozen anyway, but if they get deleted again the loss is pretty
significant.

Incidentally, the reason for the large difference is that when Heikki
created the visibility map, it wasn't necessary for the WAL records
that set the visibility map bits to bump the page LSN, because it was
just a hint anyway.  When I made the visibility-map crash-safe, I went
to some pains to preserve that property.  Therefore, a regular VACUUM
does not emit full page images for the heap pages - it does for the
visibility map pages themselves, but there aren't very many of those.
In this example, the relation itself was 512MB, so you can see that
adding freezing to the mix roughly doubles the I/O cost.  Either way
we have to write half a gig of dirty data pages, but in one case we
also have to write an additional half a gig of WAL.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Freeze avoidance of very large table.

From
Heikki Linnakangas
Date:
On 04/22/2015 05:33 PM, Robert Haas wrote:
> On Tue, Apr 21, 2015 at 7:24 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
>> On 4/21/15 3:21 PM, Robert Haas wrote:
>>> I'm not saying those ideas don't have problems, because they do.  But
>>> I think they are worth further exploring.  The main reason I gave up
>>> on that is because Heikki was working on the XID-to-LSN mapping stuff.
>>> That seemed like a better approach than either of the above, so as
>>> long as Heikki was working on that, there wasn't much reason to pursue
>>> more lowbrow approaches.  Clearly, though, we need to do something
>>> about this.  Freezing is a big problem for lots of users.
>>
>> Did XID-LSN die? I see at the bottom of the thread it was returned with
>> feedback; I guess Heikki just hasn't had time and there's no major blockers?
>>  From what I remember this is probably a better solution, but if it's not
>> going to make it into 9.6 then we should probably at least look further into
>> a FM.
>
> Heikki said he'd lost enthusiasm N it, but he wasn't too specific
> about his reasons, IIRC.  I guess maybe just that it got complicated,
> and he wasn't sure it was correct.

I'd like to continue working on that when I get around to it. Or even 
better if someone else continues it :-).

The thing that made me nervous about that approach is that it made the 
LSN of each page critical information. If you somehow zeroed out the 
LSN, you could no longer tell which pages are frozen and which are not. 
I'm sure it could be made to work - and I got it working to some degree 
anyway - but it's a bit scary. It's similar to the multixid changes in 
9.3: multixids also used to be data that you can just zap at restart, 
and when we changed the rules so that you lose data if you lose 
multixids, we got trouble. Now, LSNs are much simpler, and there 
wouldn't be anything like the multioffset/member SLRUs that you'd have 
to keep around forever or vacuum, but still..

I would feel safer if we added a completely new "epoch" counter to the 
page header, instead of reusing LSNs. But as we all know, changing the 
page format is a problem for in-place upgrade, and takes some space too.

- Heikki




Re: Freeze avoidance of very large table.

From
Kevin Grittner
Date:
Robert Haas <robertmhaas@gmail.com> wrote:

> I just tested "pgbench -i -s 40 -n" followed by "VACUUM" or
> alternatively followed by "VACUUM FREEZE".  The VACUUM generated
> 4641kB of WAL.  The VACUUM FREEZE generated 515MB of WAL - that
> is, 113 times more.

Essentially a bulk load.  OK, so if you bulk load data and then
vacuum it before updating 100% of it, this approach will generate a
lot more WAL than we currently do.  Of course, if you don't VACUUM
FREEZE after a bulk load and then are engaged in a fairly normal
OLTP workload with peak and off-peak cycles, you are currently
almost certain to hit a point during peak OLTP load where you begin
to sequentially scan all tables, rewriting them in place, with WAL
logging.  Incidentally, this tends to flush a lot of your "hot"
data out of cache, increasing disk reads.  The first time I hit
this "interesting" experience in production it was so devastating,
and generated so many user complaints, that I never again
considered a bulk load complete until I had run VACUUM FREEZE on it
-- although I was sometimes able to defer that to an off-peak
window of time.

In other words, for the production environments I managed, the only
value of that number is in demonstrating the importance of using
unlogged COPY followed by VACUUM FREEZE for bulk-loading and
capturing a fresh base backup upon completion.  A better way to use
pgbench to measure WAL size cost might be to initialize, VACUUM
FREEZE to set a "long term baseline", and do a reasonable length
run with crontab running VACUUM FREEZE periodically (including
after the run was complete) versus doing the same with plain VACUUM
(followed by a VACUUM FREEZE at the end?).  Comparing the total WAL
sizes generated following the initial load and VACUUM FREEZE would
give a more accurate picture of the impact on an OLTP load, I
think.

> We'll still come out ahead if those tuples are going to stick
> around long enough that they would have eventually gotten frozen
> anyway, but if they get deleted again the loss is pretty
> significant.

Perhaps my perception is biased by having worked in an environment
where the vast majority of tuples (both in terms of tuple count and
byte count) were never updated and were only eligible for deletion
after a period of years.  Our current approach is pretty bad in
such an environment, at least if you try to leave all vacuuming to
autovacuum.  I'll admit that we were able to work around the
problems by running VACUUM FREEZE every night for most databases.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Freeze avoidance of very large table.

From
Robert Haas
Date:
On Wed, Apr 22, 2015 at 12:39 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> The thing that made me nervous about that approach is that it made the LSN
> of each page critical information. If you somehow zeroed out the LSN, you
> could no longer tell which pages are frozen and which are not. I'm sure it
> could be made to work - and I got it working to some degree anyway - but
> it's a bit scary. It's similar to the multixid changes in 9.3: multixids
> also used to be data that you can just zap at restart, and when we changed
> the rules so that you lose data if you lose multixids, we got trouble. Now,
> LSNs are much simpler, and there wouldn't be anything like the
> multioffset/member SLRUs that you'd have to keep around forever or vacuum,
> but still..

LSNs are already pretty critical.  If they're in the future, you can't
flush those pages.  Ever.  And if they're wrong in either direction,
crash recovery is broken.  But it's still worth thinking about ways
that we could make this more robust.

I keep coming back to the idea of treating any page that is marked as
all-visible as frozen, and deferring freezing until the page is again
modified.  The big downside of this is that if the page is set as
all-visible and then immediately thereafter modified, it sucks to have
to freeze when the XIDs in the page are still present in CLOG.  But if
we could determine from the LSN that the XIDs in the page are new
enough to still be considered valid, then we could skip freezing in
those cases and only do it when the page is "old".  That way, if
somebody zeroed out the LSN (why, oh why?) the worst that would happen
is that we'd do some extra freezing when the page was next modified.

> I would feel safer if we added a completely new "epoch" counter to the page
> header, instead of reusing LSNs. But as we all know, changing the page
> format is a problem for in-place upgrade, and takes some space too.

Yeah.  We have a serious need to reduce the size of our on-disk
format.  On a TPC-C-like workload Jan Wieck recently tested, our data
set was 34% larger than another database at the beginning of the test,
and 80% larger by the end of the test.  And we did twice the disk
writes.  See "The Elephants in the Room.pdf" at
https://sites.google.com/site/robertmhaas/presentations

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Freeze avoidance of very large table.

From
Robert Haas
Date:
On Wed, Apr 22, 2015 at 2:23 PM, Kevin Grittner <kgrittn@ymail.com> wrote:
> Robert Haas <robertmhaas@gmail.com> wrote:
>> I just tested "pgbench -i -s 40 -n" followed by "VACUUM" or
>> alternatively followed by "VACUUM FREEZE".  The VACUUM generated
>> 4641kB of WAL.  The VACUUM FREEZE generated 515MB of WAL - that
>> is, 113 times more.
>
> Essentially a bulk load.  OK, so if you bulk load data and then
> vacuum it before updating 100% of it, this approach will generate a
> lot more WAL than we currently do.  Of course, if you don't VACUUM
> FREEZE after a bulk load and then are engaged in a fairly normal
> OLTP workload with peak and off-peak cycles, you are currently
> almost certain to hit a point during peak OLTP load where you begin
> to sequentially scan all tables, rewriting them in place, with WAL
> logging.  Incidentally, this tends to flush a lot of your "hot"
> data out of cache, increasing disk reads.  The first time I hit
> this "interesting" experience in production it was so devastating,
> and generated so many user complaints, that I never again
> considered a bulk load complete until I had run VACUUM FREEZE on it
> -- although I was sometimes able to defer that to an off-peak
> window of time.
>
> In other words, for the production environments I managed, the only
> value of that number is in demonstrating the importance of using
> unlogged COPY followed by VACUUM FREEZE for bulk-loading and
> capturing a fresh base backup upon completion.  A better way to use
> pgbench to measure WAL size cost might be to initialize, VACUUM
> FREEZE to set a "long term baseline", and do a reasonable length
> run with crontab running VACUUM FREEZE periodically (including
> after the run was complete) versus doing the same with plain VACUUM
> (followed by a VACUUM FREEZE at the end?).  Comparing the total WAL
> sizes generated following the initial load and VACUUM FREEZE would
> give a more accurate picture of the impact on an OLTP load, I
> think.

Sure, that would be a better test.  But I'm pretty sure the impact
will still be fairly substantial.

>> We'll still come out ahead if those tuples are going to stick
>> around long enough that they would have eventually gotten frozen
>> anyway, but if they get deleted again the loss is pretty
>> significant.
>
> Perhaps my perception is biased by having worked in an environment
> where the vast majority of tuples (both in terms of tuple count and
> byte count) were never updated and were only eligible for deletion
> after a period of years.  Our current approach is pretty bad in
> such an environment, at least if you try to leave all vacuuming to
> autovacuum.  I'll admit that we were able to work around the
> problems by running VACUUM FREEZE every night for most databases.

Yeah.  And that breaks down when you have very big databases with a
high XID consumption rate, because the mostly-no-op VACUUM FREEZE runs
for longer than you can tolerate.  I'm not saying we don't need to fix
this problem; we clearly do.  I'm just saying that we've got to be
careful not to harm other scenarios in the process.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Freeze avoidance of very large table.

From
Jim Nasby
Date:
On 4/22/15 1:24 PM, Robert Haas wrote:
> I keep coming back to the idea of treating any page that is marked as
> all-visible as frozen, and deferring freezing until the page is again
> modified.  The big downside of this is that if the page is set as
> all-visible and then immediately thereafter modified, it sucks to have
> to freeze when the XIDs in the page are still present in CLOG.  But if
> we could determine from the LSN that the XIDs in the page are new
> enough to still be considered valid, then we could skip freezing in
> those cases and only do it when the page is "old".  That way, if
> somebody zeroed out the LSN (why, oh why?) the worst that would happen
> is that we'd do some extra freezing when the page was next modified.

Maybe freezing a page as part of making it not all-visible wouldn't be 
that horrible, even without LSN.

For one, we already know that every tuple is visible, so no MVCC checks 
needed. That's probably a significant savings over current freezing.

If we're marking a page as no longer all-visible, that means we're 
already dirtying it and generating WAL for it (likely including a FPI). 
We may be able to consolidate all of this into a new WAL record that's a 
lot more efficient than what we currently do for freezing. I suspect we 
wouldn't need to log each TID we're freezing, for starters. Even if we 
did though, we could at least combine all that into one WAL message that 
just contains an array of TIDs or LPs.

<ponders...> I think we could actually proactively freeze tuples during 
vacuum too, at least if we're about to mark the page as all-visible. 
Though, with Robert's HEAP_XMIN_FROZEN change we could be a lot more 
aggressive about freezing during VACUUM, certainly for pages we're 
already dirtying, especially if we can keep the WAL cost of that down.
-- 
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com



Re: Freeze avoidance of very large table.

From
Bruce Momjian
Date:
On Tue, Apr 21, 2015 at 08:39:37AM +0200, Andres Freund wrote:
> On 2015-04-20 17:13:29 -0400, Bruce Momjian wrote:
> > Didn't you think any of the TODO threads had workable solutions?  And
> > don't expect adding an additional file per relation will be zero cost
> > --- added over the lifetime of 200M transactions, I question if this
> > approach would be a win.
> 
> Note that normally you'd not run with a 200M transaction freeze max age
> on a busy server. Rather around a magnitude more.
> 
> Think about this being used on a time partionioned table. Right now all
> the partitions have to be fully rescanned on a regular basis - quite
> painful. With something like this normally only the newest partitions
> will have to be.

My point is that for the life of 200M transactions, you would have the
overhead of an additional file per table in the file system, and updates
of that.  I just don't know if the overhead over the long time period
would be smaller than the VACUUM FREEZE.  It might be fine --- I don't
know.  People seem to focus on the big activities, while many small
activities can lead to larger slowdowns.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +



Re: Freeze avoidance of very large table.

From
Jim Nasby
Date:
On 4/22/15 6:12 PM, Bruce Momjian wrote:
> My point is that for the life of 200M transactions, you would have the
> overhead of an additional file per table in the file system, and updates
> of that.  I just don't know if the overhead over the long time period
> would be smaller than the VACUUM FREEZE.  It might be fine --- I don't
> know.  People seem to focus on the big activities, while many small
> activities can lead to larger slowdowns.

Ahh. This wouldn't be for the life of 200M transactions; it would be a 
permanent fork, just like the VM is.
-- 
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com



Re: Freeze avoidance of very large table.

From
Bruce Momjian
Date:
On Wed, Apr 22, 2015 at 06:36:23PM -0500, Jim Nasby wrote:
> On 4/22/15 6:12 PM, Bruce Momjian wrote:
> >My point is that for the life of 200M transactions, you would have the
> >overhead of an additional file per table in the file system, and updates
> >of that.  I just don't know if the overhead over the long time period
> >would be smaller than the VACUUM FREEZE.  It might be fine --- I don't
> >know.  People seem to focus on the big activities, while many small
> >activities can lead to larger slowdowns.
> 
> Ahh. This wouldn't be for the life of 200M transactions; it would be
> a permanent fork, just like the VM is.

Right.  My point is that either you do X 2M times to maintain that fork
and the overhead of the file existance, or you do one VACUUM FREEZE.  I
am saying that 2M is a large number and adding all those X's might
exceed the cost of a VACUUM FREEZE.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +



Re: Freeze avoidance of very large table.

From
Sawada Masahiko
Date:
On Thu, Apr 23, 2015 at 3:24 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Apr 22, 2015 at 12:39 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>> The thing that made me nervous about that approach is that it made the LSN
>> of each page critical information. If you somehow zeroed out the LSN, you
>> could no longer tell which pages are frozen and which are not. I'm sure it
>> could be made to work - and I got it working to some degree anyway - but
>> it's a bit scary. It's similar to the multixid changes in 9.3: multixids
>> also used to be data that you can just zap at restart, and when we changed
>> the rules so that you lose data if you lose multixids, we got trouble. Now,
>> LSNs are much simpler, and there wouldn't be anything like the
>> multioffset/member SLRUs that you'd have to keep around forever or vacuum,
>> but still..
>
> LSNs are already pretty critical.  If they're in the future, you can't
> flush those pages.  Ever.  And if they're wrong in either direction,
> crash recovery is broken.  But it's still worth thinking about ways
> that we could make this more robust.
>
> I keep coming back to the idea of treating any page that is marked as
> all-visible as frozen, and deferring freezing until the page is again
> modified.  The big downside of this is that if the page is set as
> all-visible and then immediately thereafter modified, it sucks to have
> to freeze when the XIDs in the page are still present in CLOG.  But if
> we could determine from the LSN that the XIDs in the page are new
> enough to still be considered valid, then we could skip freezing in
> those cases and only do it when the page is "old".  That way, if
> somebody zeroed out the LSN (why, oh why?) the worst that would happen
> is that we'd do some extra freezing when the page was next modified.

In your idea, if we have WORM (write-once read-many) table then these
tuples in page would not be frozen at all unless we do VACUUM FREEZE.
Also in this situation, from the second time VACUUM FREEZE would need
to scan only pages of increment from last freezing, we could reduce
I/O, but we would still need to do explicitly freezing for
anti-wrapping as in the past. WORM table has huge data in general, and
that data would be increase rapidly, so it would also be expensive.

>
>> I would feel safer if we added a completely new "epoch" counter to the page
>> header, instead of reusing LSNs. But as we all know, changing the page
>> format is a problem for in-place upgrade, and takes some space too.
>
> Yeah.  We have a serious need to reduce the size of our on-disk
> format.  On a TPC-C-like workload Jan Wieck recently tested, our data
> set was 34% larger than another database at the beginning of the test,
> and 80% larger by the end of the test.  And we did twice the disk
> writes.  See "The Elephants in the Room.pdf" at
> https://sites.google.com/site/robertmhaas/presentations
>

Regards,

-------
Sawada Masahiko



Re: Freeze avoidance of very large table.

From
Heikki Linnakangas
Date:
On 04/22/2015 09:24 PM, Robert Haas wrote:
>> I would feel safer if we added a completely new "epoch" counter to the page
>> >header, instead of reusing LSNs. But as we all know, changing the page
>> >format is a problem for in-place upgrade, and takes some space too.
> Yeah.  We have a serious need to reduce the size of our on-disk
> format.  On a TPC-C-like workload Jan Wieck recently tested, our data
> set was 34% larger than another database at the beginning of the test,
> and 80% larger by the end of the test.  And we did twice the disk
> writes.  See "The Elephants in the Room.pdf" at
> https://sites.google.com/site/robertmhaas/presentations

Meh. Adding an 8-byte header to every 8k block would add 0.1% to the 
disk size. No doubt it would be nice to reduce our disk footprint, but 
the page header is not the elephant in the room.

- Heikki





Re: Freeze avoidance of very large table.

From
Simon Riggs
Date:
On 21 April 2015 at 22:21, Robert Haas <robertmhaas@gmail.com> wrote:
 
I'm not saying those ideas don't have problems, because they do.  But
I think they are worth further exploring.  The main reason I gave up
on that is because Heikki was working on the XID-to-LSN mapping stuff.
That seemed like a better approach than either of the above, so as
long as Heikki was working on that, there wasn't much reason to pursue
more lowbrow approaches.  Clearly, though, we need to do something
about this.  Freezing is a big problem for lots of users.

All that having been said, I don't think adding a new fork is a good
approach.  We already have problems pretty commonly where our
customers complain about running out of inodes.  Adding another fork
for every table would exacerbate that problem considerably.

We were talking about having an incremental backup map also. Which sounds a lot like the freeze map.

XID-to-LSN sounded cool but was complex. If we need the map for backup purposes, we may as well do it the simple way and hit both birds at once.

We only need a freeze/backup map for larger relations. So if we map 1000 blocks per map page, we skip having a map at all when size < 1000.

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Freeze avoidance of very large table.

From
Robert Haas
Date:
On Thu, Apr 23, 2015 at 4:19 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> We were talking about having an incremental backup map also. Which sounds a
> lot like the freeze map.

Yeah, possibly.  I think we should try to set things up so that the
backup map can be updated asynchronously by a background worker, so
that we're not adding more work to the foreground path just for the
benefit of maintenance operations.  That might make the logic for
autovacuum to use it a little bit more complex, but it seems
manageable.

> We only need a freeze/backup map for larger relations. So if we map 1000
> blocks per map page, we skip having a map at all when size < 1000.

Agreed.  We might also want to map multiple blocks per map slot - e.g.
one slot per 32 blocks.  That would keep the map quite small even for
very large relations, and would not compromise efficiency that much
since reading 256kB sequentially probably takes only a little longer
than reading 8kB.

I think the idea of integrating the freeze map into the VM fork is
also worth considering.  Then, the incremental backup map could be
optional; if you don't want incremental backup, you can shut it off
and have less overhead.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Freeze avoidance of very large table.

From
Robert Haas
Date:
On Wed, Apr 22, 2015 at 8:55 PM, Bruce Momjian <bruce@momjian.us> wrote:
> On Wed, Apr 22, 2015 at 06:36:23PM -0500, Jim Nasby wrote:
>> On 4/22/15 6:12 PM, Bruce Momjian wrote:
>> >My point is that for the life of 200M transactions, you would have the
>> >overhead of an additional file per table in the file system, and updates
>> >of that.  I just don't know if the overhead over the long time period
>> >would be smaller than the VACUUM FREEZE.  It might be fine --- I don't
>> >know.  People seem to focus on the big activities, while many small
>> >activities can lead to larger slowdowns.
>>
>> Ahh. This wouldn't be for the life of 200M transactions; it would be
>> a permanent fork, just like the VM is.
>
> Right.  My point is that either you do X 2M times to maintain that fork
> and the overhead of the file existance, or you do one VACUUM FREEZE.  I
> am saying that 2M is a large number and adding all those X's might
> exceed the cost of a VACUUM FREEZE.

I agree, but if we instead make this part of the visibility map
instead of a separate fork, the cost is much less.  It won't be any
more expensive to clear 2 consecutive bits any time a page is touched
than it is to clear 1.  The VM fork will be twice as large, but still
tiny.  And the fact that you'll have only half as many pages mapping
to the same VM page may even improve performance in some cases by
reducing contention.  Even when it reduces performance, I think the
impact will be so tiny as not to be worth caring about.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Freeze avoidance of very large table.

From
Jim Nasby
Date:
On 4/23/15 2:42 AM, Heikki Linnakangas wrote:
> On 04/22/2015 09:24 PM, Robert Haas wrote:
>> Yeah.  We have a serious need to reduce the size of our on-disk
>> format.  On a TPC-C-like workload Jan Wieck recently tested, our data
>> set was 34% larger than another database at the beginning of the test,
>> and 80% larger by the end of the test.  And we did twice the disk
>> writes.  See "The Elephants in the Room.pdf" at
>> https://sites.google.com/site/robertmhaas/presentations
>
> Meh. Adding an 8-byte header to every 8k block would add 0.1% to the
> disk size. No doubt it would be nice to reduce our disk footprint, but
> the page header is not the elephant in the room.

I've often wondered if there was some way we could consolidate XMIN/XMAX 
from multiple tuples at the page level; that could be a big win for OLAP 
environments where most of your tuples belong to a pretty small range of 
XIDs. In many workloads you could have 80%+ of the tuples in a table 
having a single inserting XID.

Dunno how much it would help for OLTP though... :/
-- 
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com



Re: Freeze avoidance of very large table.

From
Jim Nasby
Date:
On 4/23/15 8:42 AM, Robert Haas wrote:
> On Thu, Apr 23, 2015 at 4:19 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> We were talking about having an incremental backup map also. Which sounds a
>> lot like the freeze map.
>
> Yeah, possibly.  I think we should try to set things up so that the
> backup map can be updated asynchronously by a background worker, so
> that we're not adding more work to the foreground path just for the
> benefit of maintenance operations.  That might make the logic for
> autovacuum to use it a little bit more complex, but it seems
> manageable.

I'm not sure an actual map makes sense... for incremental backups you 
need some kind of stream that tells you not only what changed but when 
it changed. A simple freeze map won't work for that because the 
operation of freezing itself writes data (and the same can be true for 
VM). Though, if the backup utility was actually comparing live data to 
an actual backup maybe this would work...

>> We only need a freeze/backup map for larger relations. So if we map 1000
>> blocks per map page, we skip having a map at all when size < 1000.
>
> Agreed.  We might also want to map multiple blocks per map slot - e.g.
> one slot per 32 blocks.  That would keep the map quite small even for
> very large relations, and would not compromise efficiency that much
> since reading 256kB sequentially probably takes only a little longer
> than reading 8kB.

The problem with mapping a range of pages per bit is dealing with 
locking when you set the bit. Currently that's easy because we're 
holding the cleanup lock on the page, but you can't do that if you have 
a range of pages. Though, if each 'slot' wasn't a simple binary value we 
could have a 3rd state that indicates we're in the process of marking 
that slot as all visible/frozen, but you still need to consider the bit 
as cleared.

Honestly though, I think concerns about the size of the map are a bit 
overblown. Even if we double it's size, it's still 32,000 times smaller 
than the heap is with 8k pages. I suspect if you have tables large 
enough where you'll care, you'll also be using 32k pages, which means 
it'd be 128,000 times smaller than the heap. I have a hard time 
believing that's going to be even a faint blip on the performance radar.
-- 
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com



Re: Freeze avoidance of very large table.

From
Heikki Linnakangas
Date:
On 04/23/2015 05:52 PM, Jim Nasby wrote:
> On 4/23/15 2:42 AM, Heikki Linnakangas wrote:
>> On 04/22/2015 09:24 PM, Robert Haas wrote:
>>> Yeah.  We have a serious need to reduce the size of our on-disk
>>> format.  On a TPC-C-like workload Jan Wieck recently tested, our data
>>> set was 34% larger than another database at the beginning of the test,
>>> and 80% larger by the end of the test.  And we did twice the disk
>>> writes.  See "The Elephants in the Room.pdf" at
>>> https://sites.google.com/site/robertmhaas/presentations
>>
>> Meh. Adding an 8-byte header to every 8k block would add 0.1% to the
>> disk size. No doubt it would be nice to reduce our disk footprint, but
>> the page header is not the elephant in the room.
>
> I've often wondered if there was some way we could consolidate XMIN/XMAX
> from multiple tuples at the page level; that could be a big win for OLAP
> environments where most of your tuples belong to a pretty small range of
> XIDs. In many workloads you could have 80%+ of the tuples in a table
> having a single inserting XID.

It would be doable for xmin - IIRC someone even posted a patch for that 
years ago - but xmax (and ctid) is difficult. When a tuple is inserted, 
Xmax is basically just a reservation for the value that will be put 
there later. You have no idea what that value is, and you can't 
influence it, and when it's time to delete/update the row, you *must* 
have the space for that xmax. So we can't opportunistically use the 
space for anything else, or compress them or anything like that.

- Heikki




Re: Freeze avoidance of very large table.

From
Bruce Momjian
Date:
On Thu, Apr 23, 2015 at 10:42:59AM +0300, Heikki Linnakangas wrote:
> On 04/22/2015 09:24 PM, Robert Haas wrote:
> >>I would feel safer if we added a completely new "epoch" counter to the page
> >>>header, instead of reusing LSNs. But as we all know, changing the page
> >>>format is a problem for in-place upgrade, and takes some space too.
> >Yeah.  We have a serious need to reduce the size of our on-disk
> >format.  On a TPC-C-like workload Jan Wieck recently tested, our data
> >set was 34% larger than another database at the beginning of the test,
> >and 80% larger by the end of the test.  And we did twice the disk
> >writes.  See "The Elephants in the Room.pdf" at
> >https://sites.google.com/site/robertmhaas/presentations
> 
> Meh. Adding an 8-byte header to every 8k block would add 0.1% to the
> disk size. No doubt it would be nice to reduce our disk footprint,
> but the page header is not the elephant in the room.

Agreed.  Are you saying we can't find a way to fit an 8-byte value into
the existing page in a backward-compatible way?

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +



Re: Freeze avoidance of very large table.

From
Petr Jelinek
Date:
On 23/04/15 17:24, Heikki Linnakangas wrote:
> On 04/23/2015 05:52 PM, Jim Nasby wrote:
>> On 4/23/15 2:42 AM, Heikki Linnakangas wrote:
>>> On 04/22/2015 09:24 PM, Robert Haas wrote:
>>>> Yeah.  We have a serious need to reduce the size of our on-disk
>>>> format.  On a TPC-C-like workload Jan Wieck recently tested, our data
>>>> set was 34% larger than another database at the beginning of the test,
>>>> and 80% larger by the end of the test.  And we did twice the disk
>>>> writes.  See "The Elephants in the Room.pdf" at
>>>> https://sites.google.com/site/robertmhaas/presentations
>>>
>>> Meh. Adding an 8-byte header to every 8k block would add 0.1% to the
>>> disk size. No doubt it would be nice to reduce our disk footprint, but
>>> the page header is not the elephant in the room.
>>
>> I've often wondered if there was some way we could consolidate XMIN/XMAX
>> from multiple tuples at the page level; that could be a big win for OLAP
>> environments where most of your tuples belong to a pretty small range of
>> XIDs. In many workloads you could have 80%+ of the tuples in a table
>> having a single inserting XID.
>
> It would be doable for xmin - IIRC someone even posted a patch for that
> years ago - but xmax (and ctid) is difficult. When a tuple is inserted,
> Xmax is basically just a reservation for the value that will be put
> there later. You have no idea what that value is, and you can't
> influence it, and when it's time to delete/update the row, you *must*
> have the space for that xmax. So we can't opportunistically use the
> space for anything else, or compress them or anything like that.
>

That depends, if we are going to change page format we can move the xmax 
to be some map of ctid->xmax in the header (with no values for tuples 
with no xmax) or have bitmap there of tuples that have xmax etc. 
Basically not saving xmax (and potentially other info) inline for each 
tuple but have some info in header only for tuples that need it. That 
might have bad performance side effects of course, but there are 
definitely some potential ways of doing things differently which we 
could explore.

--  Petr Jelinek                  http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training &
Services



Re: Freeze avoidance of very large table.

From
Bruce Momjian
Date:
On Thu, Apr 23, 2015 at 06:24:00PM +0300, Heikki Linnakangas wrote:
> >I've often wondered if there was some way we could consolidate XMIN/XMAX
> >from multiple tuples at the page level; that could be a big win for OLAP
> >environments where most of your tuples belong to a pretty small range of
> >XIDs. In many workloads you could have 80%+ of the tuples in a table
> >having a single inserting XID.
> 
> It would be doable for xmin - IIRC someone even posted a patch for
> that years ago - but xmax (and ctid) is difficult. When a tuple is
> inserted, Xmax is basically just a reservation for the value that
> will be put there later. You have no idea what that value is, and
> you can't influence it, and when it's time to delete/update the row,
> you *must* have the space for that xmax. So we can't
> opportunistically use the space for anything else, or compress them
> or anything like that.

Also SELECT FOR UPDATE uses the per-row xmax too.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +



Re: Freeze avoidance of very large table.

From
Bruce Momjian
Date:
On Thu, Apr 23, 2015 at 09:45:38AM -0400, Robert Haas wrote:
> > Right.  My point is that either you do X 2M times to maintain that fork
> > and the overhead of the file existence, or you do one VACUUM FREEZE.  I
> > am saying that 2M is a large number and adding all those X's might
> > exceed the cost of a VACUUM FREEZE.
> 
> I agree, but if we instead make this part of the visibility map
> instead of a separate fork, the cost is much less.  It won't be any
> more expensive to clear 2 consecutive bits any time a page is touched
> than it is to clear 1.  The VM fork will be twice as large, but still
> tiny.  And the fact that you'll have only half as many pages mapping
> to the same VM page may even improve performance in some cases by
> reducing contention.  Even when it reduces performance, I think the
> impact will be so tiny as not to be worth caring about.

Agreed, no extra file, and the same write volume as currently.  It would
also match pg_clog, which uses two bits per transaction --- maybe we can
reuse some of that code.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +



Re: Freeze avoidance of very large table.

From
Heikki Linnakangas
Date:
On 04/23/2015 06:39 PM, Petr Jelinek wrote:
> On 23/04/15 17:24, Heikki Linnakangas wrote:
>> On 04/23/2015 05:52 PM, Jim Nasby wrote:
>>> I've often wondered if there was some way we could consolidate XMIN/XMAX
>>> from multiple tuples at the page level; that could be a big win for OLAP
>>> environments where most of your tuples belong to a pretty small range of
>>> XIDs. In many workloads you could have 80%+ of the tuples in a table
>>> having a single inserting XID.
>>
>> It would be doable for xmin - IIRC someone even posted a patch for that
>> years ago - but xmax (and ctid) is difficult. When a tuple is inserted,
>> Xmax is basically just a reservation for the value that will be put
>> there later. You have no idea what that value is, and you can't
>> influence it, and when it's time to delete/update the row, you *must*
>> have the space for that xmax. So we can't opportunistically use the
>> space for anything else, or compress them or anything like that.
>
> That depends, if we are going to change page format we can move the xmax
> to be some map of ctid->xmax in the header (with no values for tuples
> with no xmax)  ...

Stop right there. You need to reserve enough space on the page to store 
an xmax for *every* tuple on the page. Because if you don't, what are 
you going to do when every tuple on the page is deleted by a different 
transaction.

Even if you store the xmax somewhere else than the page header, you need 
to reserve the same amount of space for them, so it doesn't help at all.

- Heikki




Re: Freeze avoidance of very large table.

From
Heikki Linnakangas
Date:
On 04/23/2015 06:38 PM, Bruce Momjian wrote:
> On Thu, Apr 23, 2015 at 10:42:59AM +0300, Heikki Linnakangas wrote:
>> On 04/22/2015 09:24 PM, Robert Haas wrote:
>>>> I would feel safer if we added a completely new "epoch" counter to the page
>>>>> header, instead of reusing LSNs. But as we all know, changing the page
>>>>> format is a problem for in-place upgrade, and takes some space too.
>>> Yeah.  We have a serious need to reduce the size of our on-disk
>>> format.  On a TPC-C-like workload Jan Wieck recently tested, our data
>>> set was 34% larger than another database at the beginning of the test,
>>> and 80% larger by the end of the test.  And we did twice the disk
>>> writes.  See "The Elephants in the Room.pdf" at
>>> https://sites.google.com/site/robertmhaas/presentations
>>
>> Meh. Adding an 8-byte header to every 8k block would add 0.1% to the
>> disk size. No doubt it would be nice to reduce our disk footprint,
>> but the page header is not the elephant in the room.
>
> Agreed.  Are you saying we can't find a way to fit an 8-byte value into
> the existing page in a backward-compatible way?

I'm sure we can find a way. We've discussed ways to handle page format 
updates in pg_upgrade before, and I don't want to get into that 
discussion here, but it's not trivial.

- Heikki




Re: Freeze avoidance of very large table.

From
Bruce Momjian
Date:
On Thu, Apr 23, 2015 at 06:52:20PM +0300, Heikki Linnakangas wrote:
> >Agreed.  Are you saying we can't find a way to fit an 8-byte value into
> >the existing page in a backward-compatible way?
> 
> I'm sure we can find a way. We've discussed ways to handle page
> format updates in pg_upgrade before, and I don't want to get into
> that discussion here, but it's not trivial.

OK, good to know, thanks.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +



Re: Freeze avoidance of very large table.

From
Petr Jelinek
Date:
On 23/04/15 17:45, Bruce Momjian wrote:
> On Thu, Apr 23, 2015 at 09:45:38AM -0400, Robert Haas wrote:
>>> Right.  My point is that either you do X 2M times to maintain that fork
>>> and the overhead of the file existence, or you do one VACUUM FREEZE.  I
>>> am saying that 2M is a large number and adding all those X's might
>>> exceed the cost of a VACUUM FREEZE.
>>
>> I agree, but if we instead make this part of the visibility map
>> instead of a separate fork, the cost is much less.  It won't be any
>> more expensive to clear 2 consecutive bits any time a page is touched
>> than it is to clear 1.  The VM fork will be twice as large, but still
>> tiny.  And the fact that you'll have only half as many pages mapping
>> to the same VM page may even improve performance in some cases by
>> reducing contention.  Even when it reduces performance, I think the
>> impact will be so tiny as not to be worth caring about.
>
> Agreed, no extra file, and the same write volume as currently.  It would
> also match pg_clog, which uses two bits per transaction --- maybe we can
> reuse some of that code.
>

Yeah, this approach seems promising. We probably can't reuse code from 
clog because the usage pattern is different (key for clog is xid, while 
for visibility/freeze map ctid is used). But visibility map storage 
layer is pretty simple so it should be easy to extend it for this use.

--  Petr Jelinek                  http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training &
Services



Re: Freeze avoidance of very large table.

From
Jim Nasby
Date:
On 4/23/15 11:06 AM, Petr Jelinek wrote:
> On 23/04/15 17:45, Bruce Momjian wrote:
>> On Thu, Apr 23, 2015 at 09:45:38AM -0400, Robert Haas wrote:
>> Agreed, no extra file, and the same write volume as currently.  It would
>> also match pg_clog, which uses two bits per transaction --- maybe we can
>> reuse some of that code.
>>
>
> Yeah, this approach seems promising. We probably can't reuse code from
> clog because the usage pattern is different (key for clog is xid, while
> for visibility/freeze map ctid is used). But visibility map storage
> layer is pretty simple so it should be easy to extend it for this use.

Actually, there may be some bit manipulation functions we could reuse; 
things like efficiently counting how many things in a byte are set. 
Probably doesn't make sense to fully refactor it, but at least CLOG is a 
good source for cut/paste/whack.
-- 
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com



Re: Freeze avoidance of very large table.

From
Michael Paquier
Date:
On Thu, Apr 23, 2015 at 10:42 PM, Robert Haas wrote:
> On Thu, Apr 23, 2015 at 4:19 AM, Simon Riggs  wrote:
>> We only need a freeze/backup map for larger relations. So if we map 1000
>> blocks per map page, we skip having a map at all when size < 1000.
>
> Agreed.  We might also want to map multiple blocks per map slot - e.g.
> one slot per 32 blocks.  That would keep the map quite small even for
> very large relations, and would not compromise efficiency that much
> since reading 256kB sequentially probably takes only a little longer
> than reading 8kB.
>
> I think the idea of integrating the freeze map into the VM fork is
> also worth considering.  Then, the incremental backup map could be
> optional; if you don't want incremental backup, you can shut it off
> and have less overhead.

When I read that I think about something configurable at
relation-level.There are cases where you may want to have more
granularity of this information at block level by having the VM slots
to track less blocks than 32, and vice-versa.
-- 
Michael



Re: Freeze avoidance of very large table.

From
Sawada Masahiko
Date:
On Fri, Apr 24, 2015 at 1:31 AM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
> On 4/23/15 11:06 AM, Petr Jelinek wrote:
>>
>> On 23/04/15 17:45, Bruce Momjian wrote:
>>>
>>> On Thu, Apr 23, 2015 at 09:45:38AM -0400, Robert Haas wrote:
>>> Agreed, no extra file, and the same write volume as currently.  It would
>>> also match pg_clog, which uses two bits per transaction --- maybe we can
>>> reuse some of that code.
>>>
>>
>> Yeah, this approach seems promising. We probably can't reuse code from
>> clog because the usage pattern is different (key for clog is xid, while
>> for visibility/freeze map ctid is used). But visibility map storage
>> layer is pretty simple so it should be easy to extend it for this use.
>
>
> Actually, there may be some bit manipulation functions we could reuse;
> things like efficiently counting how many things in a byte are set. Probably
> doesn't make sense to fully refactor it, but at least CLOG is a good source
> for cut/paste/whack.
>

I agree with adding a bit that indicates corresponding page is
all-frozen into VM, just like CLOG.
I'll change the patch as second version patch.

Regards,

-------
Sawada Masahiko



Re: Freeze avoidance of very large table.

From
Robert Haas
Date:
On Thu, Apr 23, 2015 at 9:03 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Thu, Apr 23, 2015 at 10:42 PM, Robert Haas wrote:
>> On Thu, Apr 23, 2015 at 4:19 AM, Simon Riggs  wrote:
>>> We only need a freeze/backup map for larger relations. So if we map 1000
>>> blocks per map page, we skip having a map at all when size < 1000.
>>
>> Agreed.  We might also want to map multiple blocks per map slot - e.g.
>> one slot per 32 blocks.  That would keep the map quite small even for
>> very large relations, and would not compromise efficiency that much
>> since reading 256kB sequentially probably takes only a little longer
>> than reading 8kB.
>>
>> I think the idea of integrating the freeze map into the VM fork is
>> also worth considering.  Then, the incremental backup map could be
>> optional; if you don't want incremental backup, you can shut it off
>> and have less overhead.
>
> When I read that I think about something configurable at
> relation-level.There are cases where you may want to have more
> granularity of this information at block level by having the VM slots
> to track less blocks than 32, and vice-versa.

What are those cases?  To me that sounds like making things
complicated to no obvious benefit.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Freeze avoidance of very large table.

From
Jim Nasby
Date:
On 4/24/15 6:52 AM, Robert Haas wrote:
> On Thu, Apr 23, 2015 at 9:03 PM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
>> On Thu, Apr 23, 2015 at 10:42 PM, Robert Haas wrote:
>>> On Thu, Apr 23, 2015 at 4:19 AM, Simon Riggs  wrote:
>>>> We only need a freeze/backup map for larger relations. So if we map 1000
>>>> blocks per map page, we skip having a map at all when size < 1000.
>>>
>>> Agreed.  We might also want to map multiple blocks per map slot - e.g.
>>> one slot per 32 blocks.  That would keep the map quite small even for
>>> very large relations, and would not compromise efficiency that much
>>> since reading 256kB sequentially probably takes only a little longer
>>> than reading 8kB.
>>>
>>> I think the idea of integrating the freeze map into the VM fork is
>>> also worth considering.  Then, the incremental backup map could be
>>> optional; if you don't want incremental backup, you can shut it off
>>> and have less overhead.
>>
>> When I read that I think about something configurable at
>> relation-level.There are cases where you may want to have more
>> granularity of this information at block level by having the VM slots
>> to track less blocks than 32, and vice-versa.
>
> What are those cases?  To me that sounds like making things
> complicated to no obvious benefit.

Tables that get few/no dead tuples, like bulk insert tables. You'll have 
large sections of blocks with the same visibility.

I suspect the added code to allow setting 1 bit for multiple pages 
without having to lock all those pages simultaneously will probably 
outweigh making this a reloption anyway.
-- 
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com



Re: Freeze avoidance of very large table.

From
Robert Haas
Date:
On Fri, Apr 24, 2015 at 4:09 PM, Jim Nasby <Jim.Nasby@bluetreble.com>
wrote:>>> When I read that I think about something configurable at
>>> relation-level.There are cases where you may want to have more
>>> granularity of this information at block level by having the VM slots
>>> to track less blocks than 32, and vice-versa.
>>
>> What are those cases?  To me that sounds like making things
>> complicated to no obvious benefit.
>
> Tables that get few/no dead tuples, like bulk insert tables. You'll have
> large sections of blocks with the same visibility.

I don't see any reason why that would require different granularity.

> I suspect the added code to allow setting 1 bit for multiple pages without
> having to lock all those pages simultaneously will probably outweigh making
> this a reloption anyway.

That's a completely unrelated issue.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Freeze avoidance of very large table.

From
Jim Nasby
Date:
On 4/28/15 7:11 AM, Robert Haas wrote:
> On Fri, Apr 24, 2015 at 4:09 PM, Jim Nasby<Jim.Nasby@bluetreble.com>
> wrote:>>> When I read that I think about something configurable at
>>>> >>>relation-level.There are cases where you may want to have more
>>>> >>>granularity of this information at block level by having the VM slots
>>>> >>>to track less blocks than 32, and vice-versa.
>>> >>
>>> >>What are those cases?  To me that sounds like making things
>>> >>complicated to no obvious benefit.
>> >
>> >Tables that get few/no dead tuples, like bulk insert tables. You'll have
>> >large sections of blocks with the same visibility.
> I don't see any reason why that would require different granularity.

Because in those cases it would be trivial to drop XMIN out of the tuple 
headers. For a warehouse with narrow rows that could be a significant 
win. Moreso, we could also move XMAX to the page level if we accept that 
if we need to invalidate any tuple we'd have to move all of them. In a 
warehouse situation that's probably OK as well.

That said, I don't think this is the first place to focus for reducing 
our on-disk format; reducing cleanup bloat would probably be a lot more 
useful.

Did you or Jan have more detailed info from the test he ran about where 
our 80% overhead was ending up? That would remove a lot of speculation 
here...
-- 
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com



Re: Freeze avoidance of very large table.

From
Robert Haas
Date:
On Tue, Apr 28, 2015 at 1:53 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
> Because in those cases it would be trivial to drop XMIN out of the tuple
> headers. For a warehouse with narrow rows that could be a significant win.
> Moreso, we could also move XMAX to the page level if we accept that if we
> need to invalidate any tuple we'd have to move all of them. In a warehouse
> situation that's probably OK as well.

You have a funny definition of "trivial".  If you start looking
through the code you'll see that anything that changes the format of
the tuple header is a very large undertaking.  And the bit about "if
we invalidate any tuple we'd need to move all of them" doesn't really
make any sense; we have no infrastructure that would allow us "move"
tuples like that.  A lot of people would like it if we did, but we
don't.

> That said, I don't think this is the first place to focus for reducing our
> on-disk format; reducing cleanup bloat would probably be a lot more useful.

Sure; changing the on-disk format is a different project that tracking
the frozen parts of a table, which is what this thread started out
being about, and nothing you've said since then seems to add or
detract from that.  I still think the best way to do it is to make the
VM carry two bits per page instead of one.

> Did you or Jan have more detailed info from the test he ran about where our
> 80% overhead was ending up? That would remove a lot of speculation here...

We have more detailed information on that, but (1) that's not a very
specific question and (2) it has nothing to do with freeze avoidance,
so I'm not sure why you are asking on this thread.  Let's try not to
get sidetracked from the well-defined proposal that just needs to be
implemented to speculation about major changes in completely unrelated
areas.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Freeze avoidance of very large table.

From
Sawada Masahiko
Date:
On Fri, Apr 24, 2015 at 11:21 AM, Sawada Masahiko <sawada.mshk@gmail.com> wrote:
> On Fri, Apr 24, 2015 at 1:31 AM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
>> On 4/23/15 11:06 AM, Petr Jelinek wrote:
>>>
>>> On 23/04/15 17:45, Bruce Momjian wrote:
>>>>
>>>> On Thu, Apr 23, 2015 at 09:45:38AM -0400, Robert Haas wrote:
>>>> Agreed, no extra file, and the same write volume as currently.  It would
>>>> also match pg_clog, which uses two bits per transaction --- maybe we can
>>>> reuse some of that code.
>>>>
>>>
>>> Yeah, this approach seems promising. We probably can't reuse code from
>>> clog because the usage pattern is different (key for clog is xid, while
>>> for visibility/freeze map ctid is used). But visibility map storage
>>> layer is pretty simple so it should be easy to extend it for this use.
>>
>>
>> Actually, there may be some bit manipulation functions we could reuse;
>> things like efficiently counting how many things in a byte are set. Probably
>> doesn't make sense to fully refactor it, but at least CLOG is a good source
>> for cut/paste/whack.
>>
>
> I agree with adding a bit that indicates corresponding page is
> all-frozen into VM, just like CLOG.
> I'll change the patch as second version patch.
>

The second patch is attached.

In second patch, I added a bit that indicates all tuples in page are
completely frozen into visibility map.
The visibility map became a bitmap with two bit per heap page:
all-visible and all-frozen.
The logics around vacuum, insert/update/delete heap are almost same as
previous version.

This patch lack some point: documentation, comment in source code,
etc, so it's WIP patch yet,
but I think that it's enough to discuss about this.

Please feedbacks.

Regards,

-------
Sawada Masahiko

Attachment

Re: Freeze avoidance of very large table.

From
Sawada Masahiko
Date:
On Thu, Apr 30, 2015 at 8:07 PM, Sawada Masahiko <sawada.mshk@gmail.com> wrote:
> On Fri, Apr 24, 2015 at 11:21 AM, Sawada Masahiko <sawada.mshk@gmail.com> wrote:
>> On Fri, Apr 24, 2015 at 1:31 AM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
>>> On 4/23/15 11:06 AM, Petr Jelinek wrote:
>>>>
>>>> On 23/04/15 17:45, Bruce Momjian wrote:
>>>>>
>>>>> On Thu, Apr 23, 2015 at 09:45:38AM -0400, Robert Haas wrote:
>>>>> Agreed, no extra file, and the same write volume as currently.  It would
>>>>> also match pg_clog, which uses two bits per transaction --- maybe we can
>>>>> reuse some of that code.
>>>>>
>>>>
>>>> Yeah, this approach seems promising. We probably can't reuse code from
>>>> clog because the usage pattern is different (key for clog is xid, while
>>>> for visibility/freeze map ctid is used). But visibility map storage
>>>> layer is pretty simple so it should be easy to extend it for this use.
>>>
>>>
>>> Actually, there may be some bit manipulation functions we could reuse;
>>> things like efficiently counting how many things in a byte are set. Probably
>>> doesn't make sense to fully refactor it, but at least CLOG is a good source
>>> for cut/paste/whack.
>>>
>>
>> I agree with adding a bit that indicates corresponding page is
>> all-frozen into VM, just like CLOG.
>> I'll change the patch as second version patch.
>>
>
> The second patch is attached.
>
> In second patch, I added a bit that indicates all tuples in page are
> completely frozen into visibility map.
> The visibility map became a bitmap with two bit per heap page:
> all-visible and all-frozen.
> The logics around vacuum, insert/update/delete heap are almost same as
> previous version.
>
> This patch lack some point: documentation, comment in source code,
> etc, so it's WIP patch yet,
> but I think that it's enough to discuss about this.
>

The previous patch is no longer applied cleanly to HEAD.
The attached v2 patch is latest version.

Please review it.

Regards,

-------
Sawada Masahiko

Attachment

Re: Freeze avoidance of very large table.

From
Sawada Masahiko
Date:
On Thu, May 28, 2015 at 11:34 AM, Sawada Masahiko <sawada.mshk@gmail.com> wrote:
> On Thu, Apr 30, 2015 at 8:07 PM, Sawada Masahiko <sawada.mshk@gmail.com> wrote:
>> On Fri, Apr 24, 2015 at 11:21 AM, Sawada Masahiko <sawada.mshk@gmail.com> wrote:
>>> On Fri, Apr 24, 2015 at 1:31 AM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
>>>> On 4/23/15 11:06 AM, Petr Jelinek wrote:
>>>>>
>>>>> On 23/04/15 17:45, Bruce Momjian wrote:
>>>>>>
>>>>>> On Thu, Apr 23, 2015 at 09:45:38AM -0400, Robert Haas wrote:
>>>>>> Agreed, no extra file, and the same write volume as currently.  It would
>>>>>> also match pg_clog, which uses two bits per transaction --- maybe we can
>>>>>> reuse some of that code.
>>>>>>
>>>>>
>>>>> Yeah, this approach seems promising. We probably can't reuse code from
>>>>> clog because the usage pattern is different (key for clog is xid, while
>>>>> for visibility/freeze map ctid is used). But visibility map storage
>>>>> layer is pretty simple so it should be easy to extend it for this use.
>>>>
>>>>
>>>> Actually, there may be some bit manipulation functions we could reuse;
>>>> things like efficiently counting how many things in a byte are set. Probably
>>>> doesn't make sense to fully refactor it, but at least CLOG is a good source
>>>> for cut/paste/whack.
>>>>
>>>
>>> I agree with adding a bit that indicates corresponding page is
>>> all-frozen into VM, just like CLOG.
>>> I'll change the patch as second version patch.
>>>
>>
>> The second patch is attached.
>>
>> In second patch, I added a bit that indicates all tuples in page are
>> completely frozen into visibility map.
>> The visibility map became a bitmap with two bit per heap page:
>> all-visible and all-frozen.
>> The logics around vacuum, insert/update/delete heap are almost same as
>> previous version.
>>
>> This patch lack some point: documentation, comment in source code,
>> etc, so it's WIP patch yet,
>> but I think that it's enough to discuss about this.
>>
>
> The previous patch is no longer applied cleanly to HEAD.
> The attached v2 patch is latest version.
>
> Please review it.

Attached new rebased version patch.
Please give me comments!

Regards,

--
Sawada Masahiko

Attachment

Re: Freeze avoidance of very large table.

From
Simon Riggs
Date:
On 30 April 2015 at 12:07, Sawada Masahiko <sawada.mshk@gmail.com> wrote:
 
This patch lack some point: documentation, comment in source code,
etc, so it's WIP patch yet,
but I think that it's enough to discuss about this.

Code comments exist to indicate the intention of sections of code. They are essential for reviewers, not a cosmetic thing to be added later. To gain wide agreement we need wide understanding. (I recommend a development approach where you write the comments first, then add code later.) 

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Freeze avoidance of very large table.

From
Fujii Masao
Date:
On Thu, Jul 2, 2015 at 12:13 AM, Sawada Masahiko <sawada.mshk@gmail.com> wrote:
> On Thu, May 28, 2015 at 11:34 AM, Sawada Masahiko <sawada.mshk@gmail.com> wrote:
>> On Thu, Apr 30, 2015 at 8:07 PM, Sawada Masahiko <sawada.mshk@gmail.com> wrote:
>>> On Fri, Apr 24, 2015 at 11:21 AM, Sawada Masahiko <sawada.mshk@gmail.com> wrote:
>>>> On Fri, Apr 24, 2015 at 1:31 AM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
>>>>> On 4/23/15 11:06 AM, Petr Jelinek wrote:
>>>>>>
>>>>>> On 23/04/15 17:45, Bruce Momjian wrote:
>>>>>>>
>>>>>>> On Thu, Apr 23, 2015 at 09:45:38AM -0400, Robert Haas wrote:
>>>>>>> Agreed, no extra file, and the same write volume as currently.  It would
>>>>>>> also match pg_clog, which uses two bits per transaction --- maybe we can
>>>>>>> reuse some of that code.
>>>>>>>
>>>>>>
>>>>>> Yeah, this approach seems promising. We probably can't reuse code from
>>>>>> clog because the usage pattern is different (key for clog is xid, while
>>>>>> for visibility/freeze map ctid is used). But visibility map storage
>>>>>> layer is pretty simple so it should be easy to extend it for this use.
>>>>>
>>>>>
>>>>> Actually, there may be some bit manipulation functions we could reuse;
>>>>> things like efficiently counting how many things in a byte are set. Probably
>>>>> doesn't make sense to fully refactor it, but at least CLOG is a good source
>>>>> for cut/paste/whack.
>>>>>
>>>>
>>>> I agree with adding a bit that indicates corresponding page is
>>>> all-frozen into VM, just like CLOG.
>>>> I'll change the patch as second version patch.
>>>>
>>>
>>> The second patch is attached.
>>>
>>> In second patch, I added a bit that indicates all tuples in page are
>>> completely frozen into visibility map.
>>> The visibility map became a bitmap with two bit per heap page:
>>> all-visible and all-frozen.
>>> The logics around vacuum, insert/update/delete heap are almost same as
>>> previous version.
>>>
>>> This patch lack some point: documentation, comment in source code,
>>> etc, so it's WIP patch yet,
>>> but I think that it's enough to discuss about this.
>>>
>>
>> The previous patch is no longer applied cleanly to HEAD.
>> The attached v2 patch is latest version.
>>
>> Please review it.
>
> Attached new rebased version patch.
> Please give me comments!

Now we should review your design and approach rather than code,
but since I got an assertion error while trying the patch, I report it.

"initdb -D test -k" caused the following assertion failure.

vacuuming database template1 ... TRAP:
FailedAssertion("!((((PageHeader) (heapPage))->pd_flags & 0x0004))",
File: "visibilitymap.c", Line: 328)
sh: line 1: 83785 Abort trap: 6
"/dav/000_add_frozen_bit_into_visibilitymap_v3/bin/postgres" --single
-F -O -c search_path=pg_catalog -c exit_on_error=true template1 >
/dev/null
child process exited with exit code 134
initdb: removing data directory "test"

Regards,

-- 
Fujii Masao



Re: Freeze avoidance of very large table.

From
Sawada Masahiko
Date:
On Thu, Jul 2, 2015 at 1:06 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
> On Thu, Jul 2, 2015 at 12:13 AM, Sawada Masahiko <sawada.mshk@gmail.com> wrote:
>> On Thu, May 28, 2015 at 11:34 AM, Sawada Masahiko <sawada.mshk@gmail.com> wrote:
>>> On Thu, Apr 30, 2015 at 8:07 PM, Sawada Masahiko <sawada.mshk@gmail.com> wrote:
>>>> On Fri, Apr 24, 2015 at 11:21 AM, Sawada Masahiko <sawada.mshk@gmail.com> wrote:
>>>>> On Fri, Apr 24, 2015 at 1:31 AM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
>>>>>> On 4/23/15 11:06 AM, Petr Jelinek wrote:
>>>>>>>
>>>>>>> On 23/04/15 17:45, Bruce Momjian wrote:
>>>>>>>>
>>>>>>>> On Thu, Apr 23, 2015 at 09:45:38AM -0400, Robert Haas wrote:
>>>>>>>> Agreed, no extra file, and the same write volume as currently.  It would
>>>>>>>> also match pg_clog, which uses two bits per transaction --- maybe we can
>>>>>>>> reuse some of that code.
>>>>>>>>
>>>>>>>
>>>>>>> Yeah, this approach seems promising. We probably can't reuse code from
>>>>>>> clog because the usage pattern is different (key for clog is xid, while
>>>>>>> for visibility/freeze map ctid is used). But visibility map storage
>>>>>>> layer is pretty simple so it should be easy to extend it for this use.
>>>>>>
>>>>>>
>>>>>> Actually, there may be some bit manipulation functions we could reuse;
>>>>>> things like efficiently counting how many things in a byte are set. Probably
>>>>>> doesn't make sense to fully refactor it, but at least CLOG is a good source
>>>>>> for cut/paste/whack.
>>>>>>
>>>>>
>>>>> I agree with adding a bit that indicates corresponding page is
>>>>> all-frozen into VM, just like CLOG.
>>>>> I'll change the patch as second version patch.
>>>>>
>>>>
>>>> The second patch is attached.
>>>>
>>>> In second patch, I added a bit that indicates all tuples in page are
>>>> completely frozen into visibility map.
>>>> The visibility map became a bitmap with two bit per heap page:
>>>> all-visible and all-frozen.
>>>> The logics around vacuum, insert/update/delete heap are almost same as
>>>> previous version.
>>>>
>>>> This patch lack some point: documentation, comment in source code,
>>>> etc, so it's WIP patch yet,
>>>> but I think that it's enough to discuss about this.
>>>>
>>>
>>> The previous patch is no longer applied cleanly to HEAD.
>>> The attached v2 patch is latest version.
>>>
>>> Please review it.
>>
>> Attached new rebased version patch.
>> Please give me comments!
>
> Now we should review your design and approach rather than code,
> but since I got an assertion error while trying the patch, I report it.
>
> "initdb -D test -k" caused the following assertion failure.
>
> vacuuming database template1 ... TRAP:
> FailedAssertion("!((((PageHeader) (heapPage))->pd_flags & 0x0004))",
> File: "visibilitymap.c", Line: 328)
> sh: line 1: 83785 Abort trap: 6
> "/dav/000_add_frozen_bit_into_visibilitymap_v3/bin/postgres" --single
> -F -O -c search_path=pg_catalog -c exit_on_error=true template1 >
> /dev/null
> child process exited with exit code 134
> initdb: removing data directory "test"

Thank you for bug report, and comments.

Fixed version is attached, and source code comment is also updated.
Please review it.

And I explain again here about what this patch does, current design.

- A additional bit for visibility map.
I added additional bit, say all-frozen bit, which indicates whether
the all pages of corresponding page are frozen, to visibility map.
This structure is similar to CLOG.
So the size of VM grew as twice as today.
Also, the flags of each heap page header might be set PD_ALL_FROZEN,
as well as all-visible

- Set and clear a all-frozen bit
Update and delete and insert(multi insert) operation would clear a bit
of that page, and clear flags of page header at same time.
Only vauum operation can set a bit if all tuple of a page are frozen.

- Anti-wrapping vacuum
We have to scan whole table for XID anti-warring today, and it's
really quite expensive because disk I/O.
The main benefit of this proposal is to reduce and avoid such
extremely large quantity I/O even when anti-wrapping vacuum is
executed.
We have to scan whole table for XID anti-warring today, and it's
really quite expensive.
In lazy_scan_heap() function, I added a such logic for experimental.

There were several another idea on previous discussion such as
read-only table, frozen map. But advantage of this direction is that
we don't need additional heap file, and can use the matured VM
mechanism.

Regards,

--
Sawada Masahiko

Attachment

Re: Freeze avoidance of very large table.

From
Simon Riggs
Date:
On 2 July 2015 at 16:30, Sawada Masahiko <sawada.mshk@gmail.com> wrote:
 
Also, the flags of each heap page header might be set PD_ALL_FROZEN,
as well as all-visible

Is it possible to have VM bits set to frozen but not visible?

The description makes those two states sound independent of each other.

Are they? Or not? Do we test for an impossible state?

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Freeze avoidance of very large table.

From
Sawada Masahiko
Date:
On Fri, Jul 3, 2015 at 1:23 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On 2 July 2015 at 16:30, Sawada Masahiko <sawada.mshk@gmail.com> wrote:
>
>>
>> Also, the flags of each heap page header might be set PD_ALL_FROZEN,
>> as well as all-visible
>
>
> Is it possible to have VM bits set to frozen but not visible?
>
> The description makes those two states sound independent of each other.
>
> Are they? Or not? Do we test for an impossible state?
>

It's impossible to have VM bits set to frozen but not visible.
These bit are controlled independently. But eventually, when
all-frozen bit is set, all-visible is also set.

Regards,

--
Sawada Masahiko



Re: Freeze avoidance of very large table.

From
Sawada Masahiko
Date:
On Fri, Jul 3, 2015 at 5:25 PM, Sawada Masahiko <sawada.mshk@gmail.com> wrote:
> On Fri, Jul 3, 2015 at 1:23 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> On 2 July 2015 at 16:30, Sawada Masahiko <sawada.mshk@gmail.com> wrote:
>>
>>>
>>> Also, the flags of each heap page header might be set PD_ALL_FROZEN,
>>> as well as all-visible
>>
>>
>> Is it possible to have VM bits set to frozen but not visible?
>>
>> The description makes those two states sound independent of each other.
>>
>> Are they? Or not? Do we test for an impossible state?
>>
>
> It's impossible to have VM bits set to frozen but not visible.
> These bit are controlled independently. But eventually, when
> all-frozen bit is set, all-visible is also set.

Attached latest version including some bug fix.
Please review it.

Regards,

--
Sawada Masahiko

Attachment

Re: Freeze avoidance of very large table.

From
Simon Riggs
Date:
On 3 July 2015 at 09:25, Sawada Masahiko <sawada.mshk@gmail.com> wrote:
On Fri, Jul 3, 2015 at 1:23 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On 2 July 2015 at 16:30, Sawada Masahiko <sawada.mshk@gmail.com> wrote:
>
>>
>> Also, the flags of each heap page header might be set PD_ALL_FROZEN,
>> as well as all-visible
>
>
> Is it possible to have VM bits set to frozen but not visible?
>
> The description makes those two states sound independent of each other.
>
> Are they? Or not? Do we test for an impossible state?
>

It's impossible to have VM bits set to frozen but not visible.
These bit are controlled independently. But eventually, when
all-frozen bit is set, all-visible is also set.

And my understanding is that if you clear all-visible you would also clear all-frozen...

So I don't understand why you have two separate calls to visibilitymap_clear() 
Surely the logic should be to clear both bits at the same time?

In my understanding the state logic is

1. Both bits unset   ~(VISIBILITYMAP_ALL_VISIBLE | VISIBILITYMAP_ALL_FROZEN)
which can be changed to state 2 only

2. VISIBILITYMAP_ALL_VISIBLE only
which can be changed state 1 or state 3

3. VISIBILITYMAP_ALL_VISIBLE | VISIBILITYMAP_ALL_FROZEN
which can be changed to state 1 only

If that is the case please simplify the logic for setting and unsetting the bits so they are set together efficiently. At the same time please also put in Asserts to ensure that the state logic is maintained when it is set and when it is tested.

I would also like to see the visibilitymap_test function exposed in SQL, so we can write code to examine the map contents for particular ctids. By doing that we can then write a formal test that shows the evolution of tuples from insertion, vacuuming and freezing, testing the map has been set correctly at each stage. I guess that needs to be done as an isolationtest so we have an observer that contrains the xmin in various ways. In light of multixact bugs, any code that changes the on-disk tuple metadata needs formal tests.

Other than that the overall concept seems sound. 

I think we need something for pg_upgrade to rewrite existing VMs. Otherwise a large read only database would suddenly require a massive revacuum after upgrade, which seems bad. That can wait for now until we all agree this patch is sound.

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Freeze avoidance of very large table.

From
Amit Kapila
Date:
On Fri, Jul 3, 2015 at 1:55 PM, Sawada Masahiko <sawada.mshk@gmail.com> wrote:
>
> On Fri, Jul 3, 2015 at 1:23 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> > On 2 July 2015 at 16:30, Sawada Masahiko <sawada.mshk@gmail.com> wrote:
> >
> >>
> >> Also, the flags of each heap page header might be set PD_ALL_FROZEN,
> >> as well as all-visible
> >
> >
> > Is it possible to have VM bits set to frozen but not visible?
> >
> > The description makes those two states sound independent of each other.
> >
> > Are they? Or not? Do we test for an impossible state?
> >
>
> It's impossible to have VM bits set to frozen but not visible.

In patch, during Vacuum first the frozen bit is set and then the visibility
will be set in a later operation, now if the crash happens between those
2 operations, then isn't it possible that the frozen bit is set and visible
bit is not set?

> These bit are controlled independently. But eventually, when
> all-frozen bit is set, all-visible is also set.
>

Yes, during normal operations it will happen that way, but I think there
are corner cases where that assumption is not true.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Freeze avoidance of very large table.

From
Amit Kapila
Date:
On Thu, Jul 2, 2015 at 9:00 PM, Sawada Masahiko <sawada.mshk@gmail.com> wrote:
>
>
> Thank you for bug report, and comments.
>
> Fixed version is attached, and source code comment is also updated.
> Please review it.
>

I am looking into this patch and would like to share my findings with
you:

1.
@@ -2131,8 +2133,9 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 
CheckForSerializableConflictIn(relation, NULL, InvalidBuffer);
 
  /*
- * Find buffer to insert this 
tuple into.  If the page is all visible,
- * this will also pin the requisite visibility map page.
+
 * Find buffer to insert this tuple into.  If the page is all visible
+ * of all frozen, this will also pin 
the requisite visibility map and
+ * frozen map page.
  */

typo in comments.

/of all frozen/or all frozen

2.
visibilitymap.c
+ * The visibility map is a bitmap with two bits (all-visible and all-frozen
+ * per heap page.

/and all-frozen/and all-frozen)
closing round bracket is missing.

3.
visibilitymap.c
-/*#define TRACE_VISIBILITYMAP */
+#define TRACE_VISIBILITYMAP

why is this hash define opened?

4.
-visibilitymap_count(Relation rel)
+visibilitymap_count(Relation rel, bool for_visible)

This API needs to count set bits for either visibility info, frozen info
or both (if required), it seems better to have second parameter as
uint8 flags rather than bool. Also, if it is required to be called at most
places for both visibility and frozen bits count, why not get them
in one call?

5.
Clearing visibility and frozen bit separately for the dml
operations would lead locking/unlocking the corresponding buffer
twice, can we do it as a one operation.  I think this is suggested
by Simon as well.

6.
- * Before locking the buffer, pin the visibility map page if it appears to
- * be necessary.  
Since we haven't got the lock yet, someone else might be
+ * Before locking the buffer, pin the 
visibility map if it appears to be
+ * necessary.  Since we haven't got the lock yet, someone else might 
be

Why you have deleted 'page' in above comment?

7.
@@ -3490,21 +3532,23 @@ l2:
  UnlockTupleTuplock(relation, &(oldtup.t_self), *lockmode);
 
if (vmbuffer != InvalidBuffer)
  ReleaseBuffer(vmbuffer);
+
  bms_free
(hot_attrs);

Seems unnecessary change.

8.
@@ -1919,11 +1919,18 @@ index_update_stats(Relation rel,
  {
  BlockNumber relpages = 
RelationGetNumberOfBlocks(rel);
  BlockNumber relallvisible;
+ BlockNumber 
relallfrozen;
 
  if (rd_rel->relkind != RELKIND_INDEX)
- relallvisible = 
visibilitymap_count(rel);
+ {
+ relallvisible = visibilitymap_count(rel, 
true);
+ relallfrozen = visibilitymap_count(rel, false);
+ }
  else
/* don't bother for indexes */
+ {
  relallvisible = 0;
+
relallfrozen = 0;
+ }

I think in this function, you have forgotten to update the
relallfrozen value in pg_class.

9.
vacuumlazy.c

@@ -253,14 +258,16 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
  * NB: We 
need to check this before truncating the relation, because that
  * will change ->rel_pages.
  */
-
if (vacrelstats->scanned_pages < vacrelstats->rel_pages)
+ if ((vacrelstats->scanned_pages + 
vacrelstats->vmskipped_pages)
+ < vacrelstats->rel_pages)
  {
- Assert(!scan_all);

Why you have removed this Assert, won't the count of
vacrelstats->scanned_pages + vacrelstats->vmskipped_pages be
equal to vacrelstats->rel_pages when scall_all = true.

10.
vacuumlazy.c
lazy_vacuum_rel()
..
+ scanned_all |= scan_all;
+

Why this new assignment is added, please add a comment to
explain it.

11.
lazy_scan_heap()
..
+ * Also, skipping even a single page accorind to all-visible bit of
+ * visibility map means that we can't update relfrozenxid, so we only want
+ * to do it if we can skip a goodly number. On the other hand, we count
+ * both how many pages we skipped according to all-frozen bit of visibility
+ * map and how many pages we freeze page, so we can update relfrozenxid if
+ * the sum of their is as many as tuples per page.

a.
typo
/accorind/according
b.
is the second part of comment (starting from On the other hand)
right?  I mean you are comparing sum of pages skipped due to
all_frozen bit and number of pages freezed with tuples per page.
I don't understand how are they related?


12.
@@ -918,8 +954,13 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
  else
  {
  num_tuples += 1;
+ ntup_in_blk += 1;
  hastup = true;
 
+ /* If current tuple is already frozen, count it up */
+ if (HeapTupleHeaderXminFrozen(tuple.t_data))
+ already_nfrozen += 1;
+
  /*
  * Each non-removable tuple must be checked to see if it needs
  * freezing.  Note we already have exclusive buffer lock.

Here, if tuple is already_frozen, can't we just continue and
check for next tuple?

13.
+extern XLogRecPtr log_heap_frozenmap(RelFileNode rnode, Buffer heap_buffer,
+ Buffer fm_buffer);

It seems like this function is not used.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Freeze avoidance of very large table.

From
Simon Riggs
Date:
On 6 July 2015 at 17:28, Simon Riggs <simon@2ndquadrant.com> wrote:

I think we need something for pg_upgrade to rewrite existing VMs. Otherwise a large read only database would suddenly require a massive revacuum after upgrade, which seems bad. That can wait for now until we all agree this patch is sound.
 
Since we need to rewrite the "vm" map, I think we should call the new map "vfm" 

That way we will be able to easily check whether the rewrite has been conducted on all relations.

Since the maps are just bits there is no other way to tell that a map has been rewritten

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Freeze avoidance of very large table.

From
Sawada Masahiko
Date:
> So I don't understand why you have two separate calls to visibilitymap_clear()
> Surely the logic should be to clear both bits at the same time?
Yes, you're right. all-frozen bit should be cleared at the same time
as clearing all-visible bit.

> 1. Both bits unset   ~(VISIBILITYMAP_ALL_VISIBLE | VISIBILITYMAP_ALL_FROZEN)
> which can be changed to state 2 only
> 2. VISIBILITYMAP_ALL_VISIBLE only
> which can be changed state 1 or state 3
> 3. VISIBILITYMAP_ALL_VISIBLE | VISIBILITYMAP_ALL_FROZEN
> which can be changed to state 1 only
> If that is the case please simplify the logic for setting and unsetting the bits so they are set together
efficiently.
> At the same time please also put in Asserts to ensure that the state logic is maintained when it is set and when it
istested. 
>
> In patch, during Vacuum first the frozen bit is set and then the visibility
> will be set in a later operation, now if the crash happens between those
> 2 operations, then isn't it possible that the frozen bit is set and visible
> bit is not set?

In current patch, frozen bit is set first in lazy_scan_heap(), so it's
possible to have VM bits set frozen bit but not visible as Amit
pointed out.
To fix it, I'm modifying the patch to more simpler and setting both
bits at the same time efficiently.

> I would also like to see the visibilitymap_test function exposed in SQL,
> so we can write code to examine the map contents for particular ctids.
> By doing that we can then write a formal test that shows the evolution of tuples from insertion,
> vacuuming and freezing, testing the map has been set correctly at each stage.
> I guess that needs to be done as an isolationtest so we have an observer that contrains the xmin in various ways.
> In light of multixact bugs, any code that changes the on-disk tuple metadata needs formal tests.

Attached patch adds a few function to contrib/pg_freespacemap to
explore the inside of visibility map, which I used for my test.
I hope it helps for testing this feature.

> I think we need something for pg_upgrade to rewrite existing VMs.
> Otherwise a large read only database would suddenly require a massive
> revacuum after upgrade, which seems bad. That can wait for now until we all
> agree this patch is sound.

Yeah, I will address them.

Regards,

--
Sawada Masahiko

Attachment

Re: Freeze avoidance of very large table.

From
Simon Riggs
Date:
On 7 July 2015 at 15:18, Sawada Masahiko <sawada.mshk@gmail.com> wrote:
 
> I would also like to see the visibilitymap_test function exposed in SQL,
> so we can write code to examine the map contents for particular ctids.
> By doing that we can then write a formal test that shows the evolution of tuples from insertion,
> vacuuming and freezing, testing the map has been set correctly at each stage.
> I guess that needs to be done as an isolationtest so we have an observer that contrains the xmin in various ways.
> In light of multixact bugs, any code that changes the on-disk tuple metadata needs formal tests.

Attached patch adds a few function to contrib/pg_freespacemap to
explore the inside of visibility map, which I used for my test.
I hope it helps for testing this feature.

I don't think pg_freespacemap is the right place.

I'd prefer to add that as a single function into core, so we can write formal tests. I would not personally commit this feature without rigorous and easily repeatable verification.

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Freeze avoidance of very large table.

From
Andres Freund
Date:
On 2015-07-07 16:25:13 +0100, Simon Riggs wrote:
> I don't think pg_freespacemap is the right place.

I agree that pg_freespacemap sounds like an odd location.

> I'd prefer to add that as a single function into core, so we can write
> formal tests.

With the advent of src/test/modules it's not really a prerequisite for
things to be builtin to be testable. I think there's fair arguments for
moving stuff like pg_stattuple, pg_freespacemap, pg_buffercache into
core at some point, but that's probably a separate discussion.

Regards,

Andres



Re: Freeze avoidance of very large table.

From
Sawada Masahiko
Date:
On Wed, Jul 8, 2015 at 12:37 AM, Andres Freund <andres@anarazel.de> wrote:
> On 2015-07-07 16:25:13 +0100, Simon Riggs wrote:
>> I don't think pg_freespacemap is the right place.
>
> I agree that pg_freespacemap sounds like an odd location.
>
>> I'd prefer to add that as a single function into core, so we can write
>> formal tests.
>
> With the advent of src/test/modules it's not really a prerequisite for
> things to be builtin to be testable. I think there's fair arguments for
> moving stuff like pg_stattuple, pg_freespacemap, pg_buffercache into
> core at some point, but that's probably a separate discussion.
>

I understood.
So I will place bunch of test like src/test/module/visibilitymap_test,
which contains  some tests regarding this feature,
and gather them into one patch.

Regards,

--
Sawada Masahiko



Re: Freeze avoidance of very large table.

From
Amit Kapila
Date:
On Tue, Jul 7, 2015 at 5:19 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Jul 2, 2015 at 9:00 PM, Sawada Masahiko <sawada.mshk@gmail.com> wrote:
> >
> >
> > Thank you for bug report, and comments.
> >
> > Fixed version is attached, and source code comment is also updated.
> > Please review it.
> >
>
> I am looking into this patch and would like to share my findings with
> you:
>

Few more comments:

@@ -47,6 +47,8 @@ CATALOG(pg_class,1259) BKI_BOOTSTRAP BKI_ROWTYPE_OID(83) BKI_SCHEMA_MACRO
  float4 reltuples; /* # of tuples (not always up-to-date) */
  int32 relallvisible; /* # of all-visible blocks (not always
  * up-to-date) */
+ int32 relallfrozen; /* # of all-frozen blocks (not always
+   up-to-date) */


You have added relallfrozen similar to relallvisible, but how you
are planning to use it, is there any usecase for it?


lazy_scan_heap()
..
- /* Current block is all-visible */
+ /*
+ * Current block is all-visible.
+ * If visibility map represents that it's all frozen, we can
+ * skip to vacuum page unconditionally.
+ */
+ if (visibilitymap_test(onerel, blkno, &vmbuffer, VISIBILITYMAP_ALL_FROZEN))
+ {
+ vacrelstats->vmskipped_pages++;
+ continue;
+ }
+


a. please explain in comment why it is safe if someone clear the
    frozen bit concurrently
b. won't skipping pages intermittently due to set frozen bit break the
    readahead mechanism?  In this regard, if possible,  I think we should
    do some tests to see the benefit of this patch.  I understand that in
    general, it will be good to skip pages, however it seems better to check
    that with some different kind of tests.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Freeze avoidance of very large table.

From
Simon Riggs
Date:
On 7 July 2015 at 18:45, Sawada Masahiko <sawada.mshk@gmail.com> wrote:
On Wed, Jul 8, 2015 at 12:37 AM, Andres Freund <andres@anarazel.de> wrote:
> On 2015-07-07 16:25:13 +0100, Simon Riggs wrote:
>> I don't think pg_freespacemap is the right place.
>
> I agree that pg_freespacemap sounds like an odd location.
>
>> I'd prefer to add that as a single function into core, so we can write
>> formal tests.
>
> With the advent of src/test/modules it's not really a prerequisite for
> things to be builtin to be testable. I think there's fair arguments for
> moving stuff like pg_stattuple, pg_freespacemap, pg_buffercache into
> core at some point, but that's probably a separate discussion.
>

I understood.
So I will place bunch of test like src/test/module/visibilitymap_test,
which contains  some tests regarding this feature,
and gather them into one patch.

Please place it in core. I see value in having a diagnostic function for general use on production systems.

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Freeze avoidance of very large table.

From
Amit Kapila
Date:
On Tue, Jul 7, 2015 at 5:37 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
On 6 July 2015 at 17:28, Simon Riggs <simon@2ndquadrant.com> wrote:

I think we need something for pg_upgrade to rewrite existing VMs. Otherwise a large read only database would suddenly require a massive revacuum after upgrade, which seems bad. That can wait for now until we all agree this patch is sound.
  
Since we need to rewrite the "vm" map, I think we should call the new map "vfm" 

+1 for changing the name, as now map contains more than visibility
information.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Freeze avoidance of very large table.

From
Jeff Janes
Date:
On Fri, Jul 3, 2015 at 1:25 AM, Sawada Masahiko <sawada.mshk@gmail.com> wrote:
On Fri, Jul 3, 2015 at 1:23 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On 2 July 2015 at 16:30, Sawada Masahiko <sawada.mshk@gmail.com> wrote:
>
>>
>> Also, the flags of each heap page header might be set PD_ALL_FROZEN,
>> as well as all-visible
>
>
> Is it possible to have VM bits set to frozen but not visible?
>
> The description makes those two states sound independent of each other.
>
> Are they? Or not? Do we test for an impossible state?
>

It's impossible to have VM bits set to frozen but not visible.
These bit are controlled independently. But eventually, when
all-frozen bit is set, all-visible is also set.

If that combination is currently impossible, could it be used indicate that the page is all empty?

Having a crash-proof bitmap of all-empty pages would make vacuum truncation scans much more efficient.

Cheers,

Jeff

Re: Freeze avoidance of very large table.

From
Jim Nasby
Date:
On 7/8/15 8:31 AM, Simon Riggs wrote:
>     I understood.
>     So I will place bunch of test like src/test/module/visibilitymap_test,
>     which contains  some tests regarding this feature,
>     and gather them into one patch.
>
>
> Please place it in core. I see value in having a diagnostic function for
> general use on production systems.

+1. I don't think there's value to keeping this stuff away from DBAs. 
Perhaps it should default to only SU being able to execute it though.
-- 
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Data in Trouble? Get it in Treble! http://BlueTreble.com



Re: Freeze avoidance of very large table.

From
Sawada Masahiko
Date:
On Thu, Jul 9, 2015 at 4:31 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
> On Fri, Jul 3, 2015 at 1:25 AM, Sawada Masahiko <sawada.mshk@gmail.com>
> wrote:
>>
>> On Fri, Jul 3, 2015 at 1:23 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> > On 2 July 2015 at 16:30, Sawada Masahiko <sawada.mshk@gmail.com> wrote:
>> >
>> >>
>> >> Also, the flags of each heap page header might be set PD_ALL_FROZEN,
>> >> as well as all-visible
>> >
>> >
>> > Is it possible to have VM bits set to frozen but not visible?
>> >
>> > The description makes those two states sound independent of each other.
>> >
>> > Are they? Or not? Do we test for an impossible state?
>> >
>>
>> It's impossible to have VM bits set to frozen but not visible.
>> These bit are controlled independently. But eventually, when
>> all-frozen bit is set, all-visible is also set.
>
>
> If that combination is currently impossible, could it be used indicate that
> the page is all empty?

Yeah, the status of that VM bits set to frozen but not visible is
impossible, so we could use this status for another something status
of the page.

> Having a crash-proof bitmap of all-empty pages would make vacuum truncation
> scans much more efficient.

The empty page is always marked all-visible by vacuum today, it's not enough?

Regards,

--
Sawada Masahiko



Re: Freeze avoidance of very large table.

From
Sawada Masahiko
Date:
On Tue, Jul 7, 2015 at 8:49 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Thu, Jul 2, 2015 at 9:00 PM, Sawada Masahiko <sawada.mshk@gmail.com>
> wrote:
>>
>>
>> Thank you for bug report, and comments.
>>
>> Fixed version is attached, and source code comment is also updated.
>> Please review it.
>>
>
> I am looking into this patch and would like to share my findings with
> you:

Thank you for comment.
I appreciate you taking time to review this patch.

>
> 1.
> @@ -2131,8 +2133,9 @@ heap_insert(Relation relation, HeapTuple tup,
> CommandId cid,
>
> CheckForSerializableConflictIn(relation, NULL, InvalidBuffer);
>
>   /*
> - * Find buffer to insert this
> tuple into.  If the page is all visible,
> - * this will also pin the requisite visibility map page.
> +
>  * Find buffer to insert this tuple into.  If the page is all visible
> + * of all frozen, this will also pin
> the requisite visibility map and
> + * frozen map page.
>   */
>
> typo in comments.
>
> /of all frozen/or all frozen

Fixed.

> 2.
> visibilitymap.c
> + * The visibility map is a bitmap with two bits (all-visible and all-frozen
> + * per heap page.
>
> /and all-frozen/and all-frozen)
> closing round bracket is missing.

Fixed.

> 3.
> visibilitymap.c
> -/*#define TRACE_VISIBILITYMAP */
> +#define TRACE_VISIBILITYMAP
>
> why is this hash define opened?

Fixed.

> 4.
> -visibilitymap_count(Relation rel)
> +visibilitymap_count(Relation rel, bool for_visible)
>
> This API needs to count set bits for either visibility info, frozen info
> or both (if required), it seems better to have second parameter as
> uint8 flags rather than bool. Also, if it is required to be called at most
> places for both visibility and frozen bits count, why not get them
> in one call?

Fixed.

> 5.
> Clearing visibility and frozen bit separately for the dml
> operations would lead locking/unlocking the corresponding buffer
> twice, can we do it as a one operation.  I think this is suggested
> by Simon as well.

Latest patch clears bits in one operation, and set all-frozen with
all-visible in one operation.
We can judge the page is all-frozen in two places: first scanning the
page(lazy_scan_heap), and after cleaning garbage(lazy_vacuum_page).

> 6.
> - * Before locking the buffer, pin the visibility map page if it appears to
> - * be necessary.
> Since we haven't got the lock yet, someone else might be
> + * Before locking the buffer, pin the
> visibility map if it appears to be
> + * necessary.  Since we haven't got the lock yet, someone else might
> be
>
> Why you have deleted 'page' in above comment?

Fixed.

> 7.
> @@ -3490,21 +3532,23 @@ l2:
>   UnlockTupleTuplock(relation, &(oldtup.t_self), *lockmode);
>
> if (vmbuffer != InvalidBuffer)
>   ReleaseBuffer(vmbuffer);
> +
>   bms_free
> (hot_attrs);
>
> Seems unnecessary change.

Fixed.

> 8.
> @@ -1919,11 +1919,18 @@ index_update_stats(Relation rel,
>   {
>   BlockNumber relpages =
> RelationGetNumberOfBlocks(rel);
>   BlockNumber relallvisible;
> + BlockNumber
> relallfrozen;
>
>   if (rd_rel->relkind != RELKIND_INDEX)
> - relallvisible =
> visibilitymap_count(rel);
> + {
> + relallvisible = visibilitymap_count(rel,
> true);
> + relallfrozen = visibilitymap_count(rel, false);
> + }
>   else
> /* don't bother for indexes */
> + {
>   relallvisible = 0;
> +
> relallfrozen = 0;
> + }
>
> I think in this function, you have forgotten to update the
> relallfrozen value in pg_class.

Fixed.

> 9.
> vacuumlazy.c
>
> @@ -253,14 +258,16 @@ lazy_vacuum_rel(Relation onerel, int options,
> VacuumParams *params,
>   * NB: We
> need to check this before truncating the relation, because that
>   * will change ->rel_pages.
>   */
> -
> if (vacrelstats->scanned_pages < vacrelstats->rel_pages)
> + if ((vacrelstats->scanned_pages +
> vacrelstats->vmskipped_pages)
> + < vacrelstats->rel_pages)
>   {
> - Assert(!scan_all);
>
> Why you have removed this Assert, won't the count of
> vacrelstats->scanned_pages + vacrelstats->vmskipped_pages be
> equal to vacrelstats->rel_pages when scall_all = true.

Fixed.

> 10.
> vacuumlazy.c
> lazy_vacuum_rel()
> ..
> + scanned_all |= scan_all;
> +
>
> Why this new assignment is added, please add a comment to
> explain it.

It's not necessary, removed.

> 11.
> lazy_scan_heap()
> ..
> + * Also, skipping even a single page accorind to all-visible bit of
> + * visibility map means that we can't update relfrozenxid, so we only want
> + * to do it if we can skip a goodly number. On the other hand, we count
> + * both how many pages we skipped according to all-frozen bit of visibility
> + * map and how many pages we freeze page, so we can update relfrozenxid if
> + * the sum of their is as many as tuples per page.
>
> a.
> typo
> /accorind/according

Fixed.

> b.
> is the second part of comment (starting from On the other hand)
> right?  I mean you are comparing sum of pages skipped due to
> all_frozen bit and number of pages freezed with tuples per page.
> I don't understand how are they related?
>

It's wrong, I wanted to say at last sentence that, "so we can update
relfrozenxid if the sum of them is as many as pages of table."

> 12.
> @@ -918,8 +954,13 @@ lazy_scan_heap(Relation onerel, LVRelStats
> *vacrelstats,
>   else
>   {
>   num_tuples += 1;
> + ntup_in_blk += 1;
>   hastup = true;
>
> + /* If current tuple is already frozen, count it up */
> + if (HeapTupleHeaderXminFrozen(tuple.t_data))
> + already_nfrozen += 1;
> +
>   /*
>   * Each non-removable tuple must be checked to see if it needs
>   * freezing.  Note we already have exclusive buffer lock.
>
> Here, if tuple is already_frozen, can't we just continue and
> check for next tuple?

I think it's impossible because the logic related to old-style VACUUM
FULL is remained yet in HeapTupleHeaderXminFrozen().


> 13.
> +extern XLogRecPtr log_heap_frozenmap(RelFileNode rnode, Buffer heap_buffer,
> + Buffer fm_buffer);
>
> It seems like this function is not used.

Fixed.

> You have added relallfrozen similar to relallvisible, but how you
> are planning to use it, is there any usecase for it?

Yep, the value of relallvisible would be effective for in case where
the user want to know how the vacuuming takes time to do.
If this value is low score, it's a usually good idea to do VACUUM
FREEZE manually to prevent unpredictable anti-wrapping vacuum.

> a. please explain in comment why it is safe if someone clear the
>     frozen bit concurrently
> b. won't skipping pages intermittently due to set frozen bit break the
>     readahead mechanism?  In this regard, if possible,  I think we should
>     do some tests to see the benefit of this patch.  I understand that in
>     general, it will be good to skip pages, however it seems better to check
>     that with some different kind of tests.

In latest patch, we can skip the all-visible or all-frozen page until
we find next_not_all_visible_block,
and then we do re-check whether this page is all-frozen to skip to
vacuum this page even if scan_all is true.
Also, I added the message about number of the skipped frozen pages to
the verbose log for test.

> Please place it in core. I see value in having a diagnostic function for
> general use on production systems.

I added new heapfuncs.c file for related heap function which the DBA
uses, and then added theses function to that file.
But test cases are not yet, I'm making them.

Also something for pg_upgrade is also not yet.

TODO
- Test case for this feature
- pg_upgrade support.

Regards,

--
Sawada Masahiko

Attachment

Re: Freeze avoidance of very large table.

From
Jeff Janes
Date:
On Wed, Jul 8, 2015 at 10:10 PM, Sawada Masahiko <sawada.mshk@gmail.com> wrote:
On Thu, Jul 9, 2015 at 4:31 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
> On Fri, Jul 3, 2015 at 1:25 AM, Sawada Masahiko <sawada.mshk@gmail.com>
> wrote:
>>
>> It's impossible to have VM bits set to frozen but not visible.
>> These bit are controlled independently. But eventually, when
>> all-frozen bit is set, all-visible is also set.
>
>
> If that combination is currently impossible, could it be used indicate that
> the page is all empty?

Yeah, the status of that VM bits set to frozen but not visible is
impossible, so we could use this status for another something status
of the page.

> Having a crash-proof bitmap of all-empty pages would make vacuum truncation
> scans much more efficient.

The empty page is always marked all-visible by vacuum today, it's not enough?

The "current" vacuum can just remember that they were empty as well as all-visible.

But the next vacuum that occurs on the table won't know that they are empty, just that they are all-visible, so it can't truncate them away without having to read each one first.

It is a minor thing, but if there is no other use for this fourth "bit-space", it seems a shame to waste it when there is some use for it.  I haven't looked at the code around this area to know how hard it would be to implement the setting and clearing of the bit.

Cheers,

Jeff

Re: Freeze avoidance of very large table.

From
Sawada Masahiko
Date:
On Fri, Jul 10, 2015 at 3:05 AM, Sawada Masahiko <sawada.mshk@gmail.com> wrote:
>
> Also something for pg_upgrade is also not yet.
>
> TODO
> - Test case for this feature
> - pg_upgrade support.
>

I had forgotten to change the fork name of visibility map to "vfm".
Attached latest v7 patch.
Please review it.

Regards,

--
Sawada Masahiko

Attachment

Re: Freeze avoidance of very large table.

From
Sawada Masahiko
Date:
On Fri, Jul 10, 2015 at 3:42 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
> On Wed, Jul 8, 2015 at 10:10 PM, Sawada Masahiko <sawada.mshk@gmail.com>
> wrote:
>>
>> On Thu, Jul 9, 2015 at 4:31 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
>> > On Fri, Jul 3, 2015 at 1:25 AM, Sawada Masahiko <sawada.mshk@gmail.com>
>> > wrote:
>> >>
>> >> It's impossible to have VM bits set to frozen but not visible.
>> >> These bit are controlled independently. But eventually, when
>> >> all-frozen bit is set, all-visible is also set.
>> >
>> >
>> > If that combination is currently impossible, could it be used indicate
>> > that
>> > the page is all empty?
>>
>> Yeah, the status of that VM bits set to frozen but not visible is
>> impossible, so we could use this status for another something status
>> of the page.
>>
>> > Having a crash-proof bitmap of all-empty pages would make vacuum
>> > truncation
>> > scans much more efficient.
>>
>> The empty page is always marked all-visible by vacuum today, it's not
>> enough?
>
>
> The "current" vacuum can just remember that they were empty as well as
> all-visible.
>
> But the next vacuum that occurs on the table won't know that they are empty,
> just that they are all-visible, so it can't truncate them away without
> having to read each one first.

Yeah, it would be effective for vacuum empty page.

>
> It is a minor thing, but if there is no other use for this fourth
> "bit-space", it seems a shame to waste it when there is some use for it.  I
> haven't looked at the code around this area to know how hard it would be to
> implement the setting and clearing of the bit.

I think so too, we would be able to use unused fourth status of bits
efficiently.
Should I include these improvement into this patch?
This topic should be discussed on another thread after this feature is
committed, I think.

Regards,

--
Sawada Masahiko



Re: Freeze avoidance of very large table.

From
Simon Riggs
Date:
On 10 July 2015 at 09:49, Sawada Masahiko <sawada.mshk@gmail.com> wrote:

> It is a minor thing, but if there is no other use for this fourth
> "bit-space", it seems a shame to waste it when there is some use for it.  I
> haven't looked at the code around this area to know how hard it would be to
> implement the setting and clearing of the bit.

I think so too, we would be able to use unused fourth status of bits
efficiently.
Should I include these improvement into this patch?
This topic should be discussed on another thread after this feature is
committed, I think.

The impossible state acts as a diagnostic check for us to ensure the bitmap is not itself corrupt.

-1 for using it for another purpose.
 
--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Freeze avoidance of very large table.

From
Fujii Masao
Date:
On Fri, Jul 10, 2015 at 2:41 PM, Sawada Masahiko <sawada.mshk@gmail.com> wrote:
> On Fri, Jul 10, 2015 at 3:05 AM, Sawada Masahiko <sawada.mshk@gmail.com> wrote:
>>
>> Also something for pg_upgrade is also not yet.
>>
>> TODO
>> - Test case for this feature
>> - pg_upgrade support.
>>
>
> I had forgotten to change the fork name of visibility map to "vfm".
> Attached latest v7 patch.
> Please review it.

The compilation failed on my machine...

gcc -Wall -Wmissing-prototypes -Wpointer-arith
-Wdeclaration-after-statement -Wendif-labels
-Wmissing-format-attribute -Wformat-security -fno-strict-aliasing
-fwrapv -g -O0 -I../../../../src/include -D_GNU_SOURCE   -c -o
visibilitymap.o visibilitymap.c
make[4]: *** No rule to make target `heapfuncs.o', needed by
`objfiles.txt'.  Stop.
make[4]: *** Waiting for unfinished jobs....
( echo src/backend/access/index/genam.o
src/backend/access/index/indexam.o ) >objfiles.txt
make[4]: Leaving directory `/home/postgres/pgsql/git/src/backend/access/index'
gcc -Wall -Wmissing-prototypes -Wpointer-arith
-Wdeclaration-after-statement -Wendif-labels
-Wmissing-format-attribute -Wformat-security -fno-strict-aliasing
-fwrapv -g -O0 -I../../../src/include -D_GNU_SOURCE   -c -o
tablespace.o tablespace.c
gcc -Wall -Wmissing-prototypes -Wpointer-arith
-Wdeclaration-after-statement -Wendif-labels
-Wmissing-format-attribute -Wformat-security -fno-strict-aliasing
-fwrapv -g -O0 -I../../../src/include -D_GNU_SOURCE   -c -o
instrument.o instrument.c
make[4]: Leaving directory `/home/postgres/pgsql/git/src/backend/access/heap'
make[3]: *** [heap-recursive] Error 2
make[3]: Leaving directory `/home/postgres/pgsql/git/src/backend/access'
make[2]: *** [access-recursive] Error 2
make[2]: *** Waiting for unfinished jobs....

Regards,

-- 
Fujii Masao



Re: Freeze avoidance of very large table.

From
Sawada Masahiko
Date:
On Fri, Jul 10, 2015 at 10:43 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
> On Fri, Jul 10, 2015 at 2:41 PM, Sawada Masahiko <sawada.mshk@gmail.com> wrote:
>> On Fri, Jul 10, 2015 at 3:05 AM, Sawada Masahiko <sawada.mshk@gmail.com> wrote:
>>>
>>> Also something for pg_upgrade is also not yet.
>>>
>>> TODO
>>> - Test case for this feature
>>> - pg_upgrade support.
>>>
>>
>> I had forgotten to change the fork name of visibility map to "vfm".
>> Attached latest v7 patch.
>> Please review it.
>
> The compilation failed on my machine...
>
> gcc -Wall -Wmissing-prototypes -Wpointer-arith
> -Wdeclaration-after-statement -Wendif-labels
> -Wmissing-format-attribute -Wformat-security -fno-strict-aliasing
> -fwrapv -g -O0 -I../../../../src/include -D_GNU_SOURCE   -c -o
> visibilitymap.o visibilitymap.c
> make[4]: *** No rule to make target `heapfuncs.o', needed by
> `objfiles.txt'.  Stop.
> make[4]: *** Waiting for unfinished jobs....
> ( echo src/backend/access/index/genam.o
> src/backend/access/index/indexam.o ) >objfiles.txt
> make[4]: Leaving directory `/home/postgres/pgsql/git/src/backend/access/index'
> gcc -Wall -Wmissing-prototypes -Wpointer-arith
> -Wdeclaration-after-statement -Wendif-labels
> -Wmissing-format-attribute -Wformat-security -fno-strict-aliasing
> -fwrapv -g -O0 -I../../../src/include -D_GNU_SOURCE   -c -o
> tablespace.o tablespace.c
> gcc -Wall -Wmissing-prototypes -Wpointer-arith
> -Wdeclaration-after-statement -Wendif-labels
> -Wmissing-format-attribute -Wformat-security -fno-strict-aliasing
> -fwrapv -g -O0 -I../../../src/include -D_GNU_SOURCE   -c -o
> instrument.o instrument.c
> make[4]: Leaving directory `/home/postgres/pgsql/git/src/backend/access/heap'
> make[3]: *** [heap-recursive] Error 2
> make[3]: Leaving directory `/home/postgres/pgsql/git/src/backend/access'
> make[2]: *** [access-recursive] Error 2
> make[2]: *** Waiting for unfinished jobs....
>

Oops, I had forgotten to add new file heapfuncs.c.
Latest patch is attached.

Regards,

--
Sawada Masahiko

Attachment

Re: Freeze avoidance of very large table.

From
Jim Nasby
Date:
On 7/10/15 4:46 AM, Simon Riggs wrote:
> On 10 July 2015 at 09:49, Sawada Masahiko <sawada.mshk@gmail.com
> <mailto:sawada.mshk@gmail.com>> wrote:
>
>
>     > It is a minor thing, but if there is no other use for this fourth
>     > "bit-space", it seems a shame to waste it when there is some use for it.  I
>     > haven't looked at the code around this area to know how hard it would be to
>     > implement the setting and clearing of the bit.
>
>     I think so too, we would be able to use unused fourth status of bits
>     efficiently.
>     Should I include these improvement into this patch?
>     This topic should be discussed on another thread after this feature is
>     committed, I think.
>
>
> The impossible state acts as a diagnostic check for us to ensure the
> bitmap is not itself corrupt.
>
> -1 for using it for another purpose.

AFAICS empty page is only interesting for vacuum truncate, which is a 
very short-term thing. It would be better to find a way to handle that 
differently.

In any case, that should definitely be a separate discussion from this 
patch.
-- 
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Data in Trouble? Get it in Treble! http://BlueTreble.com



Re: Freeze avoidance of very large table.

From
Sawada Masahiko
Date:
On Tue, Jul 7, 2015 at 9:07 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On 6 July 2015 at 17:28, Simon Riggs <simon@2ndquadrant.com> wrote:
>
>> I think we need something for pg_upgrade to rewrite existing VMs.
>> Otherwise a large read only database would suddenly require a massive
>> revacuum after upgrade, which seems bad. That can wait for now until we all
>> agree this patch is sound.
>
>
> Since we need to rewrite the "vm" map, I think we should call the new map
> "vfm"
>
> That way we will be able to easily check whether the rewrite has been
> conducted on all relations.
>
> Since the maps are just bits there is no other way to tell that a map has
> been rewritten

To avoid revacuum after upgrade, you meant that we need to rewrite
each bit of vm to corresponding bits of vfm, if it's from
not-supporting vfm version(i.g., 9.5 or earlier ). right?
If so, we will need to do whole scanning table, which is expensive as well.
Clearing vm and do revacuum would be nice, rather than doing in
upgrading, I think.

Regards,

--
Masahiko Sawada



Re: Freeze avoidance of very large table.

From
Amit Kapila
Date:
On Mon, Jul 13, 2015 at 3:39 PM, Sawada Masahiko <sawada.mshk@gmail.com> wrote:
>
> On Tue, Jul 7, 2015 at 9:07 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> > On 6 July 2015 at 17:28, Simon Riggs <simon@2ndquadrant.com> wrote:
> >
> >> I think we need something for pg_upgrade to rewrite existing VMs.
> >> Otherwise a large read only database would suddenly require a massive
> >> revacuum after upgrade, which seems bad. That can wait for now until we all
> >> agree this patch is sound.
> >
> >
> > Since we need to rewrite the "vm" map, I think we should call the new map
> > "vfm"
> >
> > That way we will be able to easily check whether the rewrite has been
> > conducted on all relations.
> >
> > Since the maps are just bits there is no other way to tell that a map has
> > been rewritten
>
> To avoid revacuum after upgrade, you meant that we need to rewrite
> each bit of vm to corresponding bits of vfm, if it's from
> not-supporting vfm version(i.g., 9.5 or earlier ). right?
> If so, we will need to do whole scanning table, which is expensive as well.
> Clearing vm and do revacuum would be nice, rather than doing in
> upgrading, I think.
>

How will you ensure to have revacuum for all the tables after
upgrading?  Till the time Vacuum is done on the tables that
have vm before upgrade, any queries on those tables can
become slower.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Freeze avoidance of very large table.

From
Sawada Masahiko
Date:
On Mon, Jul 13, 2015 at 7:46 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Mon, Jul 13, 2015 at 3:39 PM, Sawada Masahiko <sawada.mshk@gmail.com>
> wrote:
>>
>> On Tue, Jul 7, 2015 at 9:07 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> > On 6 July 2015 at 17:28, Simon Riggs <simon@2ndquadrant.com> wrote:
>> >
>> >> I think we need something for pg_upgrade to rewrite existing VMs.
>> >> Otherwise a large read only database would suddenly require a massive
>> >> revacuum after upgrade, which seems bad. That can wait for now until we
>> >> all
>> >> agree this patch is sound.
>> >
>> >
>> > Since we need to rewrite the "vm" map, I think we should call the new
>> > map
>> > "vfm"
>> >
>> > That way we will be able to easily check whether the rewrite has been
>> > conducted on all relations.
>> >
>> > Since the maps are just bits there is no other way to tell that a map
>> > has
>> > been rewritten
>>
>> To avoid revacuum after upgrade, you meant that we need to rewrite
>> each bit of vm to corresponding bits of vfm, if it's from
>> not-supporting vfm version(i.g., 9.5 or earlier ). right?
>> If so, we will need to do whole scanning table, which is expensive as
>> well.
>> Clearing vm and do revacuum would be nice, rather than doing in
>> upgrading, I think.
>>
>
> How will you ensure to have revacuum for all the tables after
> upgrading?

We use script file which are generated by pg_upgrade.

>  Till the time Vacuum is done on the tables that
> have vm before upgrade, any queries on those tables can
> become slower.

Even If we implement rewriting tool for vm into pg_upgrade, it will
take time as much as revacuum because it need whole scanning table.
I meant that we rewrite vm using by existing facility (i.g., vacuum
(freeze)), instead of implementing new rewriting tool for vm.

Regards,

--
Masahiko Sawada



Re: Freeze avoidance of very large table.

From
Andres Freund
Date:
On 2015-07-13 21:03:07 +0900, Sawada Masahiko wrote:
> Even If we implement rewriting tool for vm into pg_upgrade, it will
> take time as much as revacuum because it need whole scanning table.

Why would it? Sure, you can only set allvisible and not the frozen bit,
but that's fine. That way the cost for freezing can be paid over time.

If we require terrabytes of data to be scanned, including possibly
rewriting large portions due to freezing, before index only scans work
and most vacuums act in a partial manner the migration to 9.6 will be a
major pain for our users.



Re: Freeze avoidance of very large table.

From
Michael Paquier
Date:
On Mon, Jul 13, 2015 at 9:03 PM, Sawada Masahiko <sawada.mshk@gmail.com> wrote:
> On Mon, Jul 13, 2015 at 7:46 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> On Mon, Jul 13, 2015 at 3:39 PM, Sawada Masahiko <sawada.mshk@gmail.com>
>> wrote:
>>>
>>> On Tue, Jul 7, 2015 at 9:07 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>>> > On 6 July 2015 at 17:28, Simon Riggs <simon@2ndquadrant.com> wrote:
>>> >
>>> >> I think we need something for pg_upgrade to rewrite existing VMs.
>>> >> Otherwise a large read only database would suddenly require a massive
>>> >> revacuum after upgrade, which seems bad. That can wait for now until we
>>> >> all
>>> >> agree this patch is sound.
>>> >
>>> >
>>> > Since we need to rewrite the "vm" map, I think we should call the new
>>> > map
>>> > "vfm"
>>> >
>>> > That way we will be able to easily check whether the rewrite has been
>>> > conducted on all relations.
>>> >
>>> > Since the maps are just bits there is no other way to tell that a map
>>> > has
>>> > been rewritten
>>>
>>> To avoid revacuum after upgrade, you meant that we need to rewrite
>>> each bit of vm to corresponding bits of vfm, if it's from
>>> not-supporting vfm version(i.g., 9.5 or earlier ). right?
>>> If so, we will need to do whole scanning table, which is expensive as
>>> well.
>>> Clearing vm and do revacuum would be nice, rather than doing in
>>> upgrading, I think.
>>>
>>
>> How will you ensure to have revacuum for all the tables after
>> upgrading?
>
> We use script file which are generated by pg_upgrade.

I haven't followed this thread closely, but I am sure you recall that
vacuumdb has a parallel mode.
-- 
Michael



Re: Freeze avoidance of very large table.

From
Sawada Masahiko
Date:
On Mon, Jul 13, 2015 at 9:22 PM, Andres Freund <andres@anarazel.de> wrote:
> On 2015-07-13 21:03:07 +0900, Sawada Masahiko wrote:
>> Even If we implement rewriting tool for vm into pg_upgrade, it will
>> take time as much as revacuum because it need whole scanning table.
>
> Why would it? Sure, you can only set allvisible and not the frozen bit,
> but that's fine. That way the cost for freezing can be paid over time.
>
> If we require terrabytes of data to be scanned, including possibly
> rewriting large portions due to freezing, before index only scans work
> and most vacuums act in a partial manner the migration to 9.6 will be a
> major pain for our users.

Ah, If we set all bit as not all-frozen,  we don't need to whole table
scanning, only scan vm.
And I agree with this.

But please image the case where old cluster has table which is very
large, read-only and vacuum freeze is done.
In this case, the all-frozen bit of such table in new cluster will not
set, unless we do vacuum freeze again.
The information of all-frozen of such table is lacked.

Regards,

--
Masahiko Sawada



Re: Freeze avoidance of very large table.

From
Simon Riggs
Date:
On 13 July 2015 at 15:48, Sawada Masahiko <sawada.mshk@gmail.com> wrote:
On Mon, Jul 13, 2015 at 9:22 PM, Andres Freund <andres@anarazel.de> wrote:
> On 2015-07-13 21:03:07 +0900, Sawada Masahiko wrote:
>> Even If we implement rewriting tool for vm into pg_upgrade, it will
>> take time as much as revacuum because it need whole scanning table.
>
> Why would it? Sure, you can only set allvisible and not the frozen bit,
> but that's fine. That way the cost for freezing can be paid over time.
>
> If we require terrabytes of data to be scanned, including possibly
> rewriting large portions due to freezing, before index only scans work
> and most vacuums act in a partial manner the migration to 9.6 will be a
> major pain for our users.

Ah, If we set all bit as not all-frozen,  we don't need to whole table
scanning, only scan vm.
And I agree with this.

But please image the case where old cluster has table which is very
large, read-only and vacuum freeze is done.
In this case, the all-frozen bit of such table in new cluster will not
set, unless we do vacuum freeze again.
The information of all-frozen of such table is lacked.

The contents of the VM fork is essential to retain after an upgrade because it is used for Index Only Scans. If we destroy that information it could send SQL response times to unacceptable levels after upgrade.

It takes time to scan the VM and create the new VFM, but the time taken is proportional to the size of VM, which seems like it will be acceptable. 

Example calcs:
An 8TB PostgreSQL installation would need us to scan 128MB of VM into about 256MB of VFM. Probably the fsyncs will occupy the most time.
In comparison, we would need to scan all 8TB to rebuild the VMs, which will take much longer (and fsyncs will still be needed).

Since we don't record freeze map information now it is acceptable to begin after upgrade with all freeze info set to zero. 

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Freeze avoidance of very large table.

From
Andres Freund
Date:
On 2015-07-13 23:48:02 +0900, Sawada Masahiko wrote:
> But please image the case where old cluster has table which is very
> large, read-only and vacuum freeze is done.
> In this case, the all-frozen bit of such table in new cluster will not
> set, unless we do vacuum freeze again.
> The information of all-frozen of such table is lacked.

So what? That's the situation today… Yes, it'll trigger a
anti-wraparound vacuum at some later point, after that they map bits
will be set.



Re: Freeze avoidance of very large table.

From
Simon Riggs
Date:
On 10 July 2015 at 15:11, Sawada Masahiko <sawada.mshk@gmail.com> wrote:

Oops, I had forgotten to add new file heapfuncs.c.
Latest patch is attached.

I think we've established the approach is desirable and defined the way forwards for this, so this is looking good.

Some of my requests haven't been actioned yet, so I personally would not commit this yet. I am happy to continue as reviewer/committer unless others wish to take over.

The main missing item is pg_upgrade support, which won't happen by end of CF1, so I am marking this as Returned With Feedback. Hopefully we can review this again before CF2.

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Freeze avoidance of very large table.

From
Alvaro Herrera
Date:
Michael Paquier wrote:
> On Mon, Jul 13, 2015 at 9:03 PM, Sawada Masahiko <sawada.mshk@gmail.com> wrote:

> > We use script file which are generated by pg_upgrade.
> 
> I haven't followed this thread closely, but I am sure you recall that
> vacuumdb has a parallel mode.

I think having to vacuum the whole database during pg_upgrade (or
immediately thereafter, which in practice means that the database is
unusable for queries until that has finished) is way too impractical.
Even in parallel mode, it could take far too long.  People already
complain that our upgrading procedure takes too long as opposed to that
of other database systems.

I don't think there's any problem with rewriting the existing server's
VM file into "vfm" format during pg_upgrade, since we expect those files
to be much smaller than the data itself.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Freeze avoidance of very large table.

From
Sawada Masahiko
Date:
On Wed, Jul 15, 2015 at 12:55 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On 10 July 2015 at 15:11, Sawada Masahiko <sawada.mshk@gmail.com> wrote:
>>
>>
>> Oops, I had forgotten to add new file heapfuncs.c.
>> Latest patch is attached.
>
>
> I think we've established the approach is desirable and defined the way
> forwards for this, so this is looking good.

If we want to move stuff like pg_stattuple, pg_freespacemap into core,
we could move them into heapfuncs.c.

> Some of my requests haven't been actioned yet, so I personally would not
> commit this yet. I am happy to continue as reviewer/committer unless others
> wish to take over.
> The main missing item is pg_upgrade support, which won't happen by end of
> CF1, so I am marking this as Returned With Feedback. Hopefully we can review
> this again before CF2.

I appreciate your reviewing.
Yeah, the pg_upgrade support and regression test for VFM patch is
almost done now, I will submit the patch in this week after testing it
.

Regards,

--
Masahiko Sawada



Re: Freeze avoidance of very large table.

From
Sawada Masahiko
Date:
On Wed, Jul 15, 2015 at 3:07 AM, Sawada Masahiko <sawada.mshk@gmail.com> wrote:
> On Wed, Jul 15, 2015 at 12:55 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> On 10 July 2015 at 15:11, Sawada Masahiko <sawada.mshk@gmail.com> wrote:
>>>
>>>
>>> Oops, I had forgotten to add new file heapfuncs.c.
>>> Latest patch is attached.
>>
>>
>> I think we've established the approach is desirable and defined the way
>> forwards for this, so this is looking good.
>
> If we want to move stuff like pg_stattuple, pg_freespacemap into core,
> we could move them into heapfuncs.c.
>
>> Some of my requests haven't been actioned yet, so I personally would not
>> commit this yet. I am happy to continue as reviewer/committer unless others
>> wish to take over.
>> The main missing item is pg_upgrade support, which won't happen by end of
>> CF1, so I am marking this as Returned With Feedback. Hopefully we can review
>> this again before CF2.
>
> I appreciate your reviewing.
> Yeah, the pg_upgrade support and regression test for VFM patch is
> almost done now, I will submit the patch in this week after testing it
> .

Attached patch is latest v9 patch.

I added:
- regression test for visibility map (visibilitymap.sql and
visibilitymap.out files)
- pg_upgrade support (rewriting vm file to vfm file)
- regression test for pg_upgrade

Please review it.

Regards,

--
Masahiko Sawada

Attachment

Re: Freeze avoidance of very large table.

From
Masahiko Sawada
Date:
On Thu, Jul 16, 2015 at 8:51 PM, Sawada Masahiko <sawada.mshk@gmail.com> wrote:
> On Wed, Jul 15, 2015 at 3:07 AM, Sawada Masahiko <sawada.mshk@gmail.com> wrote:
>> On Wed, Jul 15, 2015 at 12:55 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>>> On 10 July 2015 at 15:11, Sawada Masahiko <sawada.mshk@gmail.com> wrote:
>>>>
>>>>
>>>> Oops, I had forgotten to add new file heapfuncs.c.
>>>> Latest patch is attached.
>>>
>>>
>>> I think we've established the approach is desirable and defined the way
>>> forwards for this, so this is looking good.
>>
>> If we want to move stuff like pg_stattuple, pg_freespacemap into core,
>> we could move them into heapfuncs.c.
>>
>>> Some of my requests haven't been actioned yet, so I personally would not
>>> commit this yet. I am happy to continue as reviewer/committer unless others
>>> wish to take over.
>>> The main missing item is pg_upgrade support, which won't happen by end of
>>> CF1, so I am marking this as Returned With Feedback. Hopefully we can review
>>> this again before CF2.
>>
>> I appreciate your reviewing.
>> Yeah, the pg_upgrade support and regression test for VFM patch is
>> almost done now, I will submit the patch in this week after testing it
>> .
>
> Attached patch is latest v9 patch.
>
> I added:
> - regression test for visibility map (visibilitymap.sql and
> visibilitymap.out files)
> - pg_upgrade support (rewriting vm file to vfm file)
> - regression test for pg_upgrade
>

Previous patch has some fail to apply, so attached the rebased patch.
Catalog version is not decided yet, so we will need to rewrite
VISIBILITY_MAP_FROZEN_BIT_CAT_VER in pg_upgrade.h
Please review it.

Regards,

--
Masahiko Sawada

Attachment

Re: Freeze avoidance of very large table.

From
Bruce Momjian
Date:
On Wed, Jul  8, 2015 at 02:31:04PM +0100, Simon Riggs wrote:
> On 7 July 2015 at 18:45, Sawada Masahiko <sawada.mshk@gmail.com> wrote:
> 
>     On Wed, Jul 8, 2015 at 12:37 AM, Andres Freund <andres@anarazel.de> wrote:
>     > On 2015-07-07 16:25:13 +0100, Simon Riggs wrote:
>     >> I don't think pg_freespacemap is the right place.
>     >
>     > I agree that pg_freespacemap sounds like an odd location.
>     >
>     >> I'd prefer to add that as a single function into core, so we can write
>     >> formal tests.
>     >
>     > With the advent of src/test/modules it's not really a prerequisite for
>     > things to be builtin to be testable. I think there's fair arguments for
>     > moving stuff like pg_stattuple, pg_freespacemap, pg_buffercache into
>     > core at some point, but that's probably a separate discussion.
>     >
> 
>     I understood.
>     So I will place bunch of test like src/test/module/visibilitymap_test,
>     which contains  some tests regarding this feature,
>     and gather them into one patch.
> 
> 
> Please place it in core. I see value in having a diagnostic function for
> general use on production systems.

Sorry to be coming to this discussion late.

I understand the desire for a diagnostic function in core, but we have
to be consistent.  Just because we are adding this function now doesn't
mean we should use different rules from what we did previously for
diagnostic functions.  Either their is logic to why this function is
different from the other diagnostic functions in contrib, or we need to
have a separate discussion of whether diagnostic functions belong in
contrib or core.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +



Re: Freeze avoidance of very large table.

From
Alvaro Herrera
Date:
Bruce Momjian wrote:

> I understand the desire for a diagnostic function in core, but we have
> to be consistent.  Just because we are adding this function now doesn't
> mean we should use different rules from what we did previously for
> diagnostic functions.  Either their is logic to why this function is
> different from the other diagnostic functions in contrib, or we need to
> have a separate discussion of whether diagnostic functions belong in
> contrib or core.

Then let's start moving some extensions to src/extension/.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Freeze avoidance of very large table.

From
Robert Haas
Date:
On Wed, Aug 5, 2015 at 12:36 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> Bruce Momjian wrote:
>> I understand the desire for a diagnostic function in core, but we have
>> to be consistent.  Just because we are adding this function now doesn't
>> mean we should use different rules from what we did previously for
>> diagnostic functions.  Either their is logic to why this function is
>> different from the other diagnostic functions in contrib, or we need to
>> have a separate discussion of whether diagnostic functions belong in
>> contrib or core.
>
> Then let's start moving some extensions to src/extension/.

That seems like yet another separate issue.

FWIW, it seems to me that we've done a heck of a lot of moving stuff
out of contrib over the last few releases.  A bunch of things moved to
src/test/modules and a bunch of things went to src/bin.  We can move
more, of course, but this code reorganization has non-trivial costs
and I'm not clear what benefits we hope to realize and whether we are
in fact realizing those benefits.  At this point, the overwhelming
majority of what's in contrib is extensions; we're not far from being
able to put the whole thing in src/extensions if it really needs to be
moved at all.

But I don't think it's fair to conflate that with Bruce's question,
which it seems to me is both a fair question and a different one.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Freeze avoidance of very large table.

From
Alvaro Herrera
Date:
Robert Haas wrote:
> On Wed, Aug 5, 2015 at 12:36 PM, Alvaro Herrera
> <alvherre@2ndquadrant.com> wrote:
> > Bruce Momjian wrote:
> >> I understand the desire for a diagnostic function in core, but we have
> >> to be consistent.  Just because we are adding this function now doesn't
> >> mean we should use different rules from what we did previously for
> >> diagnostic functions.  Either their is logic to why this function is
> >> different from the other diagnostic functions in contrib, or we need to
> >> have a separate discussion of whether diagnostic functions belong in
> >> contrib or core.
> >
> > Then let's start moving some extensions to src/extension/.
> 
> That seems like yet another separate issue.
> 
> FWIW, it seems to me that we've done a heck of a lot of moving stuff
> out of contrib over the last few releases.  A bunch of things moved to
> src/test/modules and a bunch of things went to src/bin.  We can move
> more, of course, but this code reorganization has non-trivial costs
> and I'm not clear what benefits we hope to realize and whether we are
> in fact realizing those benefits.  At this point, the overwhelming
> majority of what's in contrib is extensions; we're not far from being
> able to put the whole thing in src/extensions if it really needs to be
> moved at all.

There are a number of things in contrib that are not extensions, and
others are not core-quality yet.  I don't think we should move
everything; at least not everything in one go.  I think there are a
small number of diagnostic extensions that would be useful to have in
core (pageinspect, pg_buffercache, pg_stat_statements).

> But I don't think it's fair to conflate that with Bruce's question,
> which it seems to me is both a fair question and a different one.

Well, there was no question as such.  If the question is "should we
instead put it in contrib just to be consistent?" then I think the
answer is no.  I value consistency as much as every other person, but I
there are other things I value more, such as availability.  If stuff is
in contrib and servers don't have it installed because of package
policies and it takes three management layers' approval to get it
installed in a dying server, then I prefer to have it in core.

If the question was "why are we not using the rule we previously had
that diagnostic tools were in contrib?" then I think the answer is that
we have evolved and we now know better.  We have evolved in the sense
that we have more stuff in production now that needs better diagnostic
tooling to be available; and we know better now in the sense that we
have realized there's this company policy bureaucracy that things in
contrib are not always available for reasons that are beyond us.

Anyway, the patch as proposed puts the new functions in core as builtins
(which is what Bruce seems to be objecting to).  Maybe instead of
proposing moving existing extensions in core, it would be better to have
this patch put those two new functions alone as a single new extension
in src/extension, and not move anything else.  I don't necessarily
resist adding these functions as builtins, but if we do that then
there's no going back to having them as an extension instead, which is
presumably more in line with what we want in the long run.

(It would be a shame to delay this patch, which messes with complex
innards, just because of a discussion about the placement of two
smallish diagnostic functions.)

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Freeze avoidance of very large table.

From
Josh Berkus
Date:
On 08/05/2015 10:00 AM, Alvaro Herrera wrote:
> Anyway, the patch as proposed puts the new functions in core as builtins
> (which is what Bruce seems to be objecting to).  Maybe instead of
> proposing moving existing extensions in core, it would be better to have
> this patch put those two new functions alone as a single new extension
> in src/extension, and not move anything else.  I don't necessarily
> resist adding these functions as builtins, but if we do that then
> there's no going back to having them as an extension instead, which is
> presumably more in line with what we want in the long run.

For my part, I am unclear on why we are putting *any* diagnostic tools
in /contrib today.  Either the diagnostic tools are good quality and
necessary for a bunch of users, in which case we ship them in core, or
they are obscure and/or untested, in which case they go in an external
project and/or on PGXN.

Yes, for tools with overhead we might want to require enabling them in
pg.conf.  But that's very different from requiring the user to install a
separate package.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



Re: Freeze avoidance of very large table.

From
Bruce Momjian
Date:
On Wed, Aug  5, 2015 at 10:22:48AM -0700, Josh Berkus wrote:
> On 08/05/2015 10:00 AM, Alvaro Herrera wrote:
> > Anyway, the patch as proposed puts the new functions in core as builtins
> > (which is what Bruce seems to be objecting to).  Maybe instead of
> > proposing moving existing extensions in core, it would be better to have
> > this patch put those two new functions alone as a single new extension
> > in src/extension, and not move anything else.  I don't necessarily
> > resist adding these functions as builtins, but if we do that then
> > there's no going back to having them as an extension instead, which is
> > presumably more in line with what we want in the long run.
> 
> For my part, I am unclear on why we are putting *any* diagnostic tools
> in /contrib today.  Either the diagnostic tools are good quality and
> necessary for a bunch of users, in which case we ship them in core, or
> they are obscure and/or untested, in which case they go in an external
> project and/or on PGXN.
> 
> Yes, for tools with overhead we might want to require enabling them in
> pg.conf.  But that's very different from requiring the user to install a
> separate package.

I don't care what we do, but I do think we should be consistent. 
Frankly I am unclear why I am even having to make this point, as cases
where we have chosen expediency over consistency have served us badly in
the past.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +



Re: Freeze avoidance of very large table.

From
Josh Berkus
Date:
On 08/05/2015 10:26 AM, Bruce Momjian wrote:
> On Wed, Aug  5, 2015 at 10:22:48AM -0700, Josh Berkus wrote:
>> On 08/05/2015 10:00 AM, Alvaro Herrera wrote:
>>> Anyway, the patch as proposed puts the new functions in core as builtins
>>> (which is what Bruce seems to be objecting to).  Maybe instead of
>>> proposing moving existing extensions in core, it would be better to have
>>> this patch put those two new functions alone as a single new extension
>>> in src/extension, and not move anything else.  I don't necessarily
>>> resist adding these functions as builtins, but if we do that then
>>> there's no going back to having them as an extension instead, which is
>>> presumably more in line with what we want in the long run.
>>
>> For my part, I am unclear on why we are putting *any* diagnostic tools
>> in /contrib today.  Either the diagnostic tools are good quality and
>> necessary for a bunch of users, in which case we ship them in core, or
>> they are obscure and/or untested, in which case they go in an external
>> project and/or on PGXN.
>>
>> Yes, for tools with overhead we might want to require enabling them in
>> pg.conf.  But that's very different from requiring the user to install a
>> separate package.
> 
> I don't care what we do, but I do think we should be consistent. 
> Frankly I am unclear why I am even having to make this point, as cases
> where we have chosen expediency over consistency have served us badly in
> the past.

Saying "it's stupid to be consistent with a bad old rule", and making a
new rule is not "expediency".

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



Re: Freeze avoidance of very large table.

From
Alvaro Herrera
Date:
Josh Berkus wrote:
> On 08/05/2015 10:26 AM, Bruce Momjian wrote:

> > I don't care what we do, but I do think we should be consistent. 
> > Frankly I am unclear why I am even having to make this point, as cases
> > where we have chosen expediency over consistency have served us badly in
> > the past.
> 
> Saying "it's stupid to be consistent with a bad old rule", and making a
> new rule is not "expediency".

So I discussed this with Bruce on IM a bit.  I think there are basically
four ways we could go about this:

1. Add the functions as a builtins.  This is what the current patch does.  Simon seems to prefer this,  because he
wantsthe function to be always available in production;  but I don't like this option because adding functions as
builtins makes it impossible to move later to extensions.  Bruce doesn't like this option either.
 

2. Add the functions to contrib, keep them there for the foreesable future.  Simon is against this option, because the
functionswill be  unavailable when needed in production.  I am of the same position.  Bruce opines this option is
acceptable.

3. a) Add the function to some extension in contrib now, by using a     slightly modified version of the current patch,
and b) Apply some later patch to move said extension to src/extension.
 

4. a) Patch some extension(s) to move it to src/extension,  b) Apply a version of this patch that adds the new
functionsto said     extension
 

Essentially 3 and 4 are the same thing except the order is reversed;
they both result in the functions being shipped in some "core extension"
(a concept we do not have today).  Bruce says either of these is fine
with him.  I am fine with either of them also.  As long as we do 3b
during 9.6 timeframe, the outcome of either 3 and 4 seems to be
acceptable for Simon also.

Robert seems to be saying that he doesn't care about moving extensions
to core at all.

What do others think?

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Freeze avoidance of very large table.

From
Josh Berkus
Date:
On 08/05/2015 10:46 AM, Alvaro Herrera wrote:
> 1. Add the functions as a builtins.
>    This is what the current patch does.  Simon seems to prefer this,
>    because he wants the function to be always available in production;
>    but I don't like this option because adding functions as builtins
>    makes it impossible to move later to extensions.
>    Bruce doesn't like this option either.

Why would we want to move them later to extensions?  Do you anticipate
not needing them in the future?  If we don't need them in the future,
why would they continue to exist at all?

I'm really not getting this.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



Re: Freeze avoidance of very large table.

From
Alvaro Herrera
Date:
Josh Berkus wrote:
> On 08/05/2015 10:46 AM, Alvaro Herrera wrote:
> > 1. Add the functions as a builtins.
> >    This is what the current patch does.  Simon seems to prefer this,
> >    because he wants the function to be always available in production;
> >    but I don't like this option because adding functions as builtins
> >    makes it impossible to move later to extensions.
> >    Bruce doesn't like this option either.
> 
> Why would we want to move them later to extensions?

Because it's not nice to have random stuff as builtins.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Freeze avoidance of very large table.

From
Petr Jelinek
Date:
On 2015-08-05 20:09, Alvaro Herrera wrote:
> Josh Berkus wrote:
>> On 08/05/2015 10:46 AM, Alvaro Herrera wrote:
>>> 1. Add the functions as a builtins.
>>>     This is what the current patch does.  Simon seems to prefer this,
>>>     because he wants the function to be always available in production;
>>>     but I don't like this option because adding functions as builtins
>>>     makes it impossible to move later to extensions.
>>>     Bruce doesn't like this option either.
>>
>> Why would we want to move them later to extensions?
>
> Because it's not nice to have random stuff as builtins.
>

Extensions have one nice property, they provide namespacing so not 
everything has to be in pg_catalog which already has about gazilion 
functions. It's nice to have stuff you don't need for day to day 
operations separate but still available (which is why src/extensions is 
better than contrib).

--  Petr Jelinek                  http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training &
Services



Re: Freeze avoidance of very large table.

From
Bruce Momjian
Date:
On Wed, Aug  5, 2015 at 10:58:00AM -0700, Josh Berkus wrote:
> On 08/05/2015 10:46 AM, Alvaro Herrera wrote:
> > 1. Add the functions as a builtins.
> >    This is what the current patch does.  Simon seems to prefer this,
> >    because he wants the function to be always available in production;
> >    but I don't like this option because adding functions as builtins
> >    makes it impossible to move later to extensions.
> >    Bruce doesn't like this option either.
> 
> Why would we want to move them later to extensions?  Do you anticipate
> not needing them in the future?  If we don't need them in the future,
> why would they continue to exist at all?
> 
> I'm really not getting this. ----------------------------

This is why I suggested putting the new SQL function where it belongs
for consistency and then open a separate thread to discuss the future of
where we want diagnostic functions to be.  It is too complicated to talk
about both issues in the same thread.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +



Re: Freeze avoidance of very large table.

From
Alvaro Herrera
Date:
Bruce Momjian wrote:

> This is why I suggested putting the new SQL function where it belongs
> for consistency and then open a separate thread to discuss the future of
> where we want diagnostic functions to be.  It is too complicated to talk
> about both issues in the same thread.

Oh come on -- gimme a break.  We figure out much more complicated
problems in single threads all the time.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Freeze avoidance of very large table.

From
Bruce Momjian
Date:
On Wed, Aug  5, 2015 at 11:57:48PM -0300, Alvaro Herrera wrote:
> Bruce Momjian wrote:
> 
> > This is why I suggested putting the new SQL function where it belongs
> > for consistency and then open a separate thread to discuss the future of
> > where we want diagnostic functions to be.  It is too complicated to talk
> > about both issues in the same thread.
> 
> Oh come on -- gimme a break.  We figure out much more complicated
> problems in single threads all the time.

Well, people are confused, as stated --- what more can I say?

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +



Re: Freeze avoidance of very large table.

From
Jim Nasby
Date:
On 8/5/15 1:47 PM, Petr Jelinek wrote:
> On 2015-08-05 20:09, Alvaro Herrera wrote:
>> Josh Berkus wrote:
>>> On 08/05/2015 10:46 AM, Alvaro Herrera wrote:
>>>> 1. Add the functions as a builtins.
>>>>     This is what the current patch does.  Simon seems to prefer this,
>>>>     because he wants the function to be always available in production;
>>>>     but I don't like this option because adding functions as builtins
>>>>     makes it impossible to move later to extensions.
>>>>     Bruce doesn't like this option either.
>>>
>>> Why would we want to move them later to extensions?
>>
>> Because it's not nice to have random stuff as builtins.
>>
>
> Extensions have one nice property, they provide namespacing so not
> everything has to be in pg_catalog which already has about gazilion
> functions. It's nice to have stuff you don't need for day to day
> operations separate but still available (which is why src/extensions is
> better than contrib).

They also provide a level of control over what is and isn't installed in 
a cluster. Personally, I'd prefer that most users not even be aware of 
the existence of things like pageinspect.
-- 
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Data in Trouble? Get it in Treble! http://BlueTreble.com



Re: Freeze avoidance of very large table.

From
Simon Riggs
Date:
On 5 August 2015 at 18:46, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

What do others think?

Wow, everything moves when you blink, eh? Sorry I was wasn't watching this. Mainly because I was working on some other related thoughts, separate post coming.

1. Most importantly, it needs to be somewhere where we can use the function in a regression test. As I said before, I would not commit this without a formal proof of correctness.

2. I'd also like to be able to make checks on this while we're in production, to ensure we have no bugs. I was trying to learn from earlier mistakes and make sure we are ready with diagnostic tools to allow run-time checks and confirm everything is good. If people feel that means I've asked for something in the wrong place, I am happy to skip that request and place it wherever requested.

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Freeze avoidance of very large table.

From
Robert Haas
Date:
On Thu, Aug 6, 2015 at 11:33 AM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
> They also provide a level of control over what is and isn't installed in a
> cluster. Personally, I'd prefer that most users not even be aware of the
> existence of things like pageinspect.

+1.

If everybody feels that moving extensions currently stored in contrib
into src/extensions is going to help us somehow, then, uh, OK.  I
can't work up any enthusiasm for that, but I can live with it.

However, I think it's affirmatively bad policy to say that we're going
to put all of our debugging facilities into core because otherwise
some people might not have them installed.  That's depriving users of
the ability to control their environment, and there are good reasons
for some people to want those things not to be installed.  If we
accept the argument "it inconveniences hacker X when Y is not
installed" as a reason to put Y in core, then we can justify putting
anything at all into core.  And I don't think that's right at all.
Extensions are a useful packaging mechanism for functionality that is
useful but not required, and debugging facilities are definitely very
useful but should not be required.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Freeze avoidance of very large table.

From
Michael Paquier
Date:
On Mon, Aug 10, 2015 at 12:39 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Aug 6, 2015 at 11:33 AM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
>> They also provide a level of control over what is and isn't installed in a
>> cluster. Personally, I'd prefer that most users not even be aware of the
>> existence of things like pageinspect.
>
> +1.
>
> [...]
>
> Extensions are a useful packaging mechanism for functionality that is
> useful but not required, and debugging facilities are definitely very
> useful but should not be required.

+1.
-- 
Michael



Re: Freeze avoidance of very large table.

From
Masahiko Sawada
Date:
On Mon, Aug 10, 2015 at 11:05 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Mon, Aug 10, 2015 at 12:39 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Thu, Aug 6, 2015 at 11:33 AM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
>>> They also provide a level of control over what is and isn't installed in a
>>> cluster. Personally, I'd prefer that most users not even be aware of the
>>> existence of things like pageinspect.
>>
>> +1.
>>
>> [...]
>>
>> Extensions are a useful packaging mechanism for functionality that is
>> useful but not required, and debugging facilities are definitely very
>> useful but should not be required.
>
> +1.

Sorry to be come discussion late.

I have encountered the much cases where pg_stat_statement,
pgstattuples are required in production, so I basically agree with
moving such extension into core.
But IMO, the diagnostic tools for visibility map, heap (pageinspect)
and so on, are a kind of debugging tool.

Attached latest v11 patches, which is separated into 2 patches: frozen
bit patch and diagnostic function patch.
Moving diagnostic function into core is still under the discussion,
but this patch puts such function into core because the diagnostic
function for visibility map needs to be in core to execute regression
test at least.

Regards,

--
Masahiko Sawada

Attachment

Re: Freeze avoidance of very large table.

From
Robert Haas
Date:
On Tue, Aug 18, 2015 at 7:27 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> I have encountered the much cases where pg_stat_statement,
> pgstattuples are required in production, so I basically agree with
> moving such extension into core.
> But IMO, the diagnostic tools for visibility map, heap (pageinspect)
> and so on, are a kind of debugging tool.

Just because something might be required in production isn't a
sufficient reason to put it in core.  Debugging tools, or anything
else, can be required in production, too.

> Attached latest v11 patches, which is separated into 2 patches: frozen
> bit patch and diagnostic function patch.
> Moving diagnostic function into core is still under the discussion,
> but this patch puts such function into core because the diagnostic
> function for visibility map needs to be in core to execute regression
> test at least.

As has been discussed recently, there are other ways to handle that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Freeze avoidance of very large table.

From
Masahiko Sawada
Date:
On Wed, Aug 19, 2015 at 1:28 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Aug 18, 2015 at 7:27 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>> I have encountered the much cases where pg_stat_statement,
>> pgstattuples are required in production, so I basically agree with
>> moving such extension into core.
>> But IMO, the diagnostic tools for visibility map, heap (pageinspect)
>> and so on, are a kind of debugging tool.
>
> Just because something might be required in production isn't a
> sufficient reason to put it in core.  Debugging tools, or anything
> else, can be required in production, too.
>
>> Attached latest v11 patches, which is separated into 2 patches: frozen
>> bit patch and diagnostic function patch.
>> Moving diagnostic function into core is still under the discussion,
>> but this patch puts such function into core because the diagnostic
>> function for visibility map needs to be in core to execute regression
>> test at least.
>
> As has been discussed recently, there are other ways to handle that.

The currently regression test for VM is that we just compare between
the total number of all-visible and all-frozen in VM before and after
VACUUM, and don't check particular a bit in VM.
we could substitute it to the ANALYZE command with enough sampling
number and checking pg_class.relallvisible and pg_class.relallfrozen.

So another way is that diagnostic function for VM is put into
something contrib (pg_freespacemap or pageinspect), and if we want to
use such function in production, we can install such extension as in
the past.

Regards,

--
Masahiko Sawada



Re: Freeze avoidance of very large table.

From
Jim Nasby
Date:
On 8/19/15 2:56 AM, Masahiko Sawada wrote:
> The currently regression test for VM is that we just compare between
> the total number of all-visible and all-frozen in VM before and after
> VACUUM, and don't check particular a bit in VM.
> we could substitute it to the ANALYZE command with enough sampling
> number and checking pg_class.relallvisible and pg_class.relallfrozen.

I think this is another indication that we need more than just pg_regress...

> So another way is that diagnostic function for VM is put into
> something contrib (pg_freespacemap or pageinspect), and if we want to
> use such function in production, we can install such extension as in
> the past.

pg_buffercache is very useful as a performance monitoring tool, and I 
view being able to pull statistics about the VM and FM the same way. I'd 
like to see us providing more performance information by default, not less.

I think things like pageinspect are very different; I really can't see 
any use for those beyond debugging (and debugging by an expert at that).
-- 
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Data in Trouble? Get it in Treble! http://BlueTreble.com



Re: Freeze avoidance of very large table.

From
Alvaro Herrera
Date:
Jim Nasby wrote:

> I think things like pageinspect are very different; I really can't see any
> use for those beyond debugging (and debugging by an expert at that).

I don't think that necessarily means it must continue to be in contrib.
Quite the contrary, I think it is a tool critical enough that it should
not be relegated to be a second-class citizen as it is now (let's face
it, being in contrib *is* second-class citizenship).

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Freeze avoidance of very large table.

From
Masahiko Sawada
Date:
On Thu, Aug 20, 2015 at 11:46 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> Jim Nasby wrote:
>
>> I think things like pageinspect are very different; I really can't see any
>> use for those beyond debugging (and debugging by an expert at that).
>
> I don't think that necessarily means it must continue to be in contrib.
> Quite the contrary, I think it is a tool critical enough that it should
> not be relegated to be a second-class citizen as it is now (let's face
> it, being in contrib *is* second-class citizenship).
>

Attached patch is latest patch.
The how to do the VM regression test is changed so that we do test
without diagnostic functions.
In current patch, we do VACUUM and VACUUM FREEZE table, and check its
value of pg_class.relallvisible and relallfrozen.
When doing first VACUUM in regression test, the table doesn't have VM.
So VACUUM scans all pages and update exactly information about the
number of all-visible bit.
And when doing second VACUUM FREEZE, VACUUM FREEZE also scans all
pages because every page is not marked as all-frozen. So VACUUM FREEZE
can update exactly information about the number of all-frozen bit.

In previous patch, we checked a bit of VM one by one using by
diagnostic function, and compared between these result and
pg_class.relallvisible(/frozen).
So the essential check process is same as previous patch.
We can ensure correctness by using such procedure.

Regards,

--
Masahiko Sawada

Attachment

Re: Freeze avoidance of very large table.

From
Masahiko Sawada
Date:
On Thu, Aug 27, 2015 at 1:54 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> On Thu, Aug 20, 2015 at 11:46 PM, Alvaro Herrera
> <alvherre@2ndquadrant.com> wrote:
>> Jim Nasby wrote:
>>
>>> I think things like pageinspect are very different; I really can't see any
>>> use for those beyond debugging (and debugging by an expert at that).
>>
>> I don't think that necessarily means it must continue to be in contrib.
>> Quite the contrary, I think it is a tool critical enough that it should
>> not be relegated to be a second-class citizen as it is now (let's face
>> it, being in contrib *is* second-class citizenship).
>>
>
> Attached patch is latest patch.

The previous patch lacks some files for regression test.
Attached fixed v12 patch.


Regards,

--
Masahiko Sawada

Attachment

Re: Freeze avoidance of very large table.

From
Robert Haas
Date:
On Thu, Aug 20, 2015 at 10:46 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> Jim Nasby wrote:
>> I think things like pageinspect are very different; I really can't see any
>> use for those beyond debugging (and debugging by an expert at that).
>
> I don't think that necessarily means it must continue to be in contrib.
> Quite the contrary, I think it is a tool critical enough that it should
> not be relegated to be a second-class citizen as it is now (let's face
> it, being in contrib *is* second-class citizenship).

I have resisted that principle for years and will continue to do so.
It is entirely reasonable for some DBAs to want certain functionality
(debugging tools, crypto) to not be installed on their machines.
Folding everything into core is not a good policy, IMHO.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Freeze avoidance of very large table.

From
Alvaro Herrera
Date:
Robert Haas wrote:
> On Thu, Aug 20, 2015 at 10:46 AM, Alvaro Herrera
> <alvherre@2ndquadrant.com> wrote:

> > I don't think that necessarily means it must continue to be in contrib.
> > Quite the contrary, I think it is a tool critical enough that it should
> > not be relegated to be a second-class citizen as it is now (let's face
> > it, being in contrib *is* second-class citizenship).
> 
> I have resisted that principle for years and will continue to do so.
> It is entirely reasonable for some DBAs to want certain functionality
> (debugging tools, crypto) to not be installed on their machines.
> Folding everything into core is not a good policy, IMHO.

I don't understand.  I'm just proposing that the source code for the
extension to live in src/extensions/, and have the shared library
installed by toplevel make install; I'm not suggesting that the
extension is installed automatically.  For that, you still need a
superuser to run CREATE EXTENSION.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Freeze avoidance of very large table.

From
Robert Haas
Date:
On Thu, Sep 3, 2015 at 2:26 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> Robert Haas wrote:
>> On Thu, Aug 20, 2015 at 10:46 AM, Alvaro Herrera
>> <alvherre@2ndquadrant.com> wrote:
>
>> > I don't think that necessarily means it must continue to be in contrib.
>> > Quite the contrary, I think it is a tool critical enough that it should
>> > not be relegated to be a second-class citizen as it is now (let's face
>> > it, being in contrib *is* second-class citizenship).
>>
>> I have resisted that principle for years and will continue to do so.
>> It is entirely reasonable for some DBAs to want certain functionality
>> (debugging tools, crypto) to not be installed on their machines.
>> Folding everything into core is not a good policy, IMHO.
>
> I don't understand.  I'm just proposing that the source code for the
> extension to live in src/extensions/, and have the shared library
> installed by toplevel make install; I'm not suggesting that the
> extension is installed automatically.  For that, you still need a
> superuser to run CREATE EXTENSION.

Oh.  Well, that's different.  I don't particularly support that
proposal, but I'm not prepared to fight over it either.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Freeze avoidance of very large table.

From
Petr Jelinek
Date:
On 2015-09-03 20:26, Alvaro Herrera wrote:
> Robert Haas wrote:
>> On Thu, Aug 20, 2015 at 10:46 AM, Alvaro Herrera
>> <alvherre@2ndquadrant.com> wrote:
>
>>> I don't think that necessarily means it must continue to be in contrib.
>>> Quite the contrary, I think it is a tool critical enough that it should
>>> not be relegated to be a second-class citizen as it is now (let's face
>>> it, being in contrib *is* second-class citizenship).
>>
>> I have resisted that principle for years and will continue to do so.
>> It is entirely reasonable for some DBAs to want certain functionality
>> (debugging tools, crypto) to not be installed on their machines.
>> Folding everything into core is not a good policy, IMHO.
>
> I don't understand.  I'm just proposing that the source code for the
> extension to live in src/extensions/, and have the shared library
> installed by toplevel make install; I'm not suggesting that the
> extension is installed automatically.  For that, you still need a
> superuser to run CREATE EXTENSION.
>

+! for this

--  Petr Jelinek                  http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training &
Services



Re: Freeze avoidance of very large table.

From
Bruce Momjian
Date:
On Thu, Sep  3, 2015 at 11:37:09PM +0200, Petr Jelinek wrote:
> >I don't understand.  I'm just proposing that the source code for the
> >extension to live in src/extensions/, and have the shared library
> >installed by toplevel make install; I'm not suggesting that the
> >extension is installed automatically.  For that, you still need a
> >superuser to run CREATE EXTENSION.
> >
> 
> +! for this

OK, what does "+!" mean?  (I know it is probably a shift-key mistype,
but it looks interesting.)

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +



Re: Freeze avoidance of very large table.

From
Josh Berkus
Date:
On 09/03/2015 05:11 PM, Bruce Momjian wrote:
> On Thu, Sep  3, 2015 at 11:37:09PM +0200, Petr Jelinek wrote:
>>> I don't understand.  I'm just proposing that the source code for the
>>> extension to live in src/extensions/, and have the shared library
>>> installed by toplevel make install; I'm not suggesting that the
>>> extension is installed automatically.  For that, you still need a
>>> superuser to run CREATE EXTENSION.
>>>
>>
>> +! for this
> 
> OK, what does "+!" mean?  (I know it is probably a shift-key mistype,
> but it looks interesting.)

Add the next factorial value?


-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



Re: Freeze avoidance of very large table.

From
Gavin Flower
Date:
On 04/09/15 12:11, Bruce Momjian wrote:
> On Thu, Sep  3, 2015 at 11:37:09PM +0200, Petr Jelinek wrote:
>>> I don't understand.  I'm just proposing that the source code for the
>>> extension to live in src/extensions/, and have the shared library
>>> installed by toplevel make install; I'm not suggesting that the
>>> extension is installed automatically.  For that, you still need a
>>> superuser to run CREATE EXTENSION.
>>>
>> +! for this
> OK, what does "+!" mean?  (I know it is probably a shift-key mistype,
> but it looks interesting.)
>
It obviously signifies a Good Move that involved a check - at least, 
that is what it would mean when annotating a Chess Game!  :-)



Re: Freeze avoidance of very large table.

From
Fujii Masao
Date:
On Fri, Sep 4, 2015 at 2:23 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> On Thu, Aug 27, 2015 at 1:54 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>> On Thu, Aug 20, 2015 at 11:46 PM, Alvaro Herrera
>> <alvherre@2ndquadrant.com> wrote:
>>> Jim Nasby wrote:
>>>
>>>> I think things like pageinspect are very different; I really can't see any
>>>> use for those beyond debugging (and debugging by an expert at that).
>>>
>>> I don't think that necessarily means it must continue to be in contrib.
>>> Quite the contrary, I think it is a tool critical enough that it should
>>> not be relegated to be a second-class citizen as it is now (let's face
>>> it, being in contrib *is* second-class citizenship).
>>>
>>
>> Attached patch is latest patch.
>
> The previous patch lacks some files for regression test.
> Attached fixed v12 patch.

The patch could be applied cleanly. "make check" could pass successfully.
But "make check-world -j 2" failed.

Regards,

-- 
Fujii Masao



Re: Freeze avoidance of very large table.

From
Alvaro Herrera
Date:
Bruce Momjian wrote:
> On Thu, Sep  3, 2015 at 11:37:09PM +0200, Petr Jelinek wrote:
> > >I don't understand.  I'm just proposing that the source code for the
> > >extension to live in src/extensions/, and have the shared library
> > >installed by toplevel make install; I'm not suggesting that the
> > >extension is installed automatically.  For that, you still need a
> > >superuser to run CREATE EXTENSION.
> > >
> > 
> > +! for this
> 
> OK, what does "+!" mean?  (I know it is probably a shift-key mistype,
> but it looks interesting.)

I took it as an uppercase 1 myself -- a shouted "PLUS ONE".

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Freeze avoidance of very large table.

From
Masahiko Sawada
Date:
On Fri, Sep 4, 2015 at 10:35 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
> On Fri, Sep 4, 2015 at 2:23 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>> On Thu, Aug 27, 2015 at 1:54 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>>> On Thu, Aug 20, 2015 at 11:46 PM, Alvaro Herrera
>>> <alvherre@2ndquadrant.com> wrote:
>>>> Jim Nasby wrote:
>>>>
>>>>> I think things like pageinspect are very different; I really can't see any
>>>>> use for those beyond debugging (and debugging by an expert at that).
>>>>
>>>> I don't think that necessarily means it must continue to be in contrib.
>>>> Quite the contrary, I think it is a tool critical enough that it should
>>>> not be relegated to be a second-class citizen as it is now (let's face
>>>> it, being in contrib *is* second-class citizenship).
>>>>
>>>
>>> Attached patch is latest patch.
>>
>> The previous patch lacks some files for regression test.
>> Attached fixed v12 patch.
>
> The patch could be applied cleanly. "make check" could pass successfully.
> But "make check-world -j 2" failed.
>

Thank you for looking at this patch.
Could you tell me what test you got failed?
make check-world -j 2 or more is done successfully in my environment.

Regards,

--
Masahiko Sawada



Re: Freeze avoidance of very large table.

From
Petr Jelinek
Date:
On 2015-09-04 02:11, Bruce Momjian wrote:
> On Thu, Sep  3, 2015 at 11:37:09PM +0200, Petr Jelinek wrote:
>>> I don't understand.  I'm just proposing that the source code for the
>>> extension to live in src/extensions/, and have the shared library
>>> installed by toplevel make install; I'm not suggesting that the
>>> extension is installed automatically.  For that, you still need a
>>> superuser to run CREATE EXTENSION.
>>>
>>
>> +! for this
>
> OK, what does "+!" mean?  (I know it is probably a shift-key mistype,
> but it looks interesting.)
>

Yes, shift-key  mistype:)

--  Petr Jelinek                  http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training &
Services



Re: Freeze avoidance of very large table.

From
Bruce Momjian
Date:
On Thu, Sep  3, 2015 at 11:56:52PM -0300, Alvaro Herrera wrote:
> Bruce Momjian wrote:
> > On Thu, Sep  3, 2015 at 11:37:09PM +0200, Petr Jelinek wrote:
> > > >I don't understand.  I'm just proposing that the source code for the
> > > >extension to live in src/extensions/, and have the shared library
> > > >installed by toplevel make install; I'm not suggesting that the
> > > >extension is installed automatically.  For that, you still need a
> > > >superuser to run CREATE EXTENSION.
> > > >
> > > 
> > > +! for this
> > 
> > OK, what does "+!" mean?  (I know it is probably a shift-key mistype,
> > but it looks interesting.)
> 
> I took it as an uppercase 1 myself -- a shouted "PLUS ONE".

Oh, an ALL-CAPS +1.  Yeah, it actually makes sense.  ;-)

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +



Re: Freeze avoidance of very large table.

From
Simon Riggs
Date:
On 3 September 2015 at 18:23, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
 
The previous patch lacks some files for regression test.
Attached fixed v12 patch.

This looks OK. You saw that I was proposing to solve this problem a different way ("Summary of plans to avoid the annoyance of Freezing"), suggesting that we wait for a few CFs to see if a patch emerges for that - then fall back to this patch if it doesn't? So I am moving this patch to next CF.

I apologise for the personal annoyance caused by this; I hope whatever solution we find we can work together on it.

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Freeze avoidance of very large table.

From
Masahiko Sawada
Date:
On Sat, Sep 5, 2015 at 7:35 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On 3 September 2015 at 18:23, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
>>
>> The previous patch lacks some files for regression test.
>> Attached fixed v12 patch.
>
>
> This looks OK. You saw that I was proposing to solve this problem a
> different way ("Summary of plans to avoid the annoyance of Freezing"),
> suggesting that we wait for a few CFs to see if a patch emerges for that -
> then fall back to this patch if it doesn't? So I am moving this patch to
> next CF.
>
> I apologise for the personal annoyance caused by this; I hope whatever
> solution we find we can work together on it.
>

I had missed that thread actually, but have understood status of
around freeze avoidance topic.
It's no problem to me that we address Heikki's solution at first and
next is other plan(maybe frozen map).
But this frozen map patch is still under the reviewing and might have
serious problem, that is still need to be reviewed.
So I think we should continue to review this patch at least, while
reviewing Heikki's solution, and then we can select solution for
frozen map.
Otherwise, if frozen map has serious problem or other big problem is
occurred, the reviewing of patch will be not enough, and then it will
leads bad result, I think.

Regards,

--
Masahiko Sawada



Re: Freeze avoidance of very large table.

From
Robert Haas
Date:
On Mon, Sep 7, 2015 at 11:18 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> On Sat, Sep 5, 2015 at 7:35 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> On 3 September 2015 at 18:23, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>>> The previous patch lacks some files for regression test.
>>> Attached fixed v12 patch.
>>
>> This looks OK. You saw that I was proposing to solve this problem a
>> different way ("Summary of plans to avoid the annoyance of Freezing"),
>> suggesting that we wait for a few CFs to see if a patch emerges for that -
>> then fall back to this patch if it doesn't? So I am moving this patch to
>> next CF.
>>
>> I apologise for the personal annoyance caused by this; I hope whatever
>> solution we find we can work together on it.
>>
>
> I had missed that thread actually, but have understood status of
> around freeze avoidance topic.
> It's no problem to me that we address Heikki's solution at first and
> next is other plan(maybe frozen map).
> But this frozen map patch is still under the reviewing and might have
> serious problem, that is still need to be reviewed.
> So I think we should continue to review this patch at least, while
> reviewing Heikki's solution, and then we can select solution for
> frozen map.
> Otherwise, if frozen map has serious problem or other big problem is
> occurred, the reviewing of patch will be not enough, and then it will
> leads bad result, I think.

I agree!

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Freeze avoidance of very large table.

From
Andres Freund
Date:
On 2015-09-04 23:35:42 +0100, Simon Riggs wrote:
> This looks OK. You saw that I was proposing to solve this problem a
> different way ("Summary of plans to avoid the annoyance of Freezing"),
> suggesting that we wait for a few CFs to see if a patch emerges for that -
> then fall back to this patch if it doesn't? So I am moving this patch to
> next CF.

As noted on that other thread I don't think that's a good policy, and it
seems like Robert agrees with me. So I think we should move this back to
"Needs Review".

Greetings,

Andres Freund



Re: Freeze avoidance of very large table.

From
Fujii Masao
Date:
On Fri, Sep 4, 2015 at 2:55 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> On Fri, Sep 4, 2015 at 10:35 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
>> On Fri, Sep 4, 2015 at 2:23 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>>> On Thu, Aug 27, 2015 at 1:54 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>>>> On Thu, Aug 20, 2015 at 11:46 PM, Alvaro Herrera
>>>> <alvherre@2ndquadrant.com> wrote:
>>>>> Jim Nasby wrote:
>>>>>
>>>>>> I think things like pageinspect are very different; I really can't see any
>>>>>> use for those beyond debugging (and debugging by an expert at that).
>>>>>
>>>>> I don't think that necessarily means it must continue to be in contrib.
>>>>> Quite the contrary, I think it is a tool critical enough that it should
>>>>> not be relegated to be a second-class citizen as it is now (let's face
>>>>> it, being in contrib *is* second-class citizenship).
>>>>>
>>>>
>>>> Attached patch is latest patch.
>>>
>>> The previous patch lacks some files for regression test.
>>> Attached fixed v12 patch.
>>
>> The patch could be applied cleanly. "make check" could pass successfully.
>> But "make check-world -j 2" failed.
>>
>
> Thank you for looking at this patch.
> Could you tell me what test you got failed?
> make check-world -j 2 or more is done successfully in my environment.

I tried to do the test again, but initdb failed with the following error.
   creating template1 database in data/base/1 ... FATAL:  invalid
input syntax for type oid: "f"

This error didn't happen when I tested before. So the commit which was
applied recently might interfere with the patch.

Regards,

-- 
Fujii Masao



Re: Freeze avoidance of very large table.

From
Masahiko Sawada
Date:
On Fri, Sep 18, 2015 at 6:13 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
> On Fri, Sep 4, 2015 at 2:55 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>> On Fri, Sep 4, 2015 at 10:35 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
>>> On Fri, Sep 4, 2015 at 2:23 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>>>> On Thu, Aug 27, 2015 at 1:54 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>>>>> On Thu, Aug 20, 2015 at 11:46 PM, Alvaro Herrera
>>>>> <alvherre@2ndquadrant.com> wrote:
>>>>>> Jim Nasby wrote:
>>>>>>
>>>>>>> I think things like pageinspect are very different; I really can't see any
>>>>>>> use for those beyond debugging (and debugging by an expert at that).
>>>>>>
>>>>>> I don't think that necessarily means it must continue to be in contrib.
>>>>>> Quite the contrary, I think it is a tool critical enough that it should
>>>>>> not be relegated to be a second-class citizen as it is now (let's face
>>>>>> it, being in contrib *is* second-class citizenship).
>>>>>>
>>>>>
>>>>> Attached patch is latest patch.
>>>>
>>>> The previous patch lacks some files for regression test.
>>>> Attached fixed v12 patch.
>>>
>>> The patch could be applied cleanly. "make check" could pass successfully.
>>> But "make check-world -j 2" failed.
>>>
>>
>> Thank you for looking at this patch.
>> Could you tell me what test you got failed?
>> make check-world -j 2 or more is done successfully in my environment.
>
> I tried to do the test again, but initdb failed with the following error.
>
>     creating template1 database in data/base/1 ... FATAL:  invalid
> input syntax for type oid: "f"
>
> This error didn't happen when I tested before. So the commit which was
> applied recently might interfere with the patch.
>

Thank you for testing!
Attached fixed version patch.

Regards,

--
Masahiko Sawada

Attachment

Re: Freeze avoidance of very large table.

From
Fujii Masao
Date:
On Fri, Sep 18, 2015 at 8:14 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> On Fri, Sep 18, 2015 at 6:13 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
>> On Fri, Sep 4, 2015 at 2:55 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>>> On Fri, Sep 4, 2015 at 10:35 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
>>>> On Fri, Sep 4, 2015 at 2:23 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>>>>> On Thu, Aug 27, 2015 at 1:54 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>>>>>> On Thu, Aug 20, 2015 at 11:46 PM, Alvaro Herrera
>>>>>> <alvherre@2ndquadrant.com> wrote:
>>>>>>> Jim Nasby wrote:
>>>>>>>
>>>>>>>> I think things like pageinspect are very different; I really can't see any
>>>>>>>> use for those beyond debugging (and debugging by an expert at that).
>>>>>>>
>>>>>>> I don't think that necessarily means it must continue to be in contrib.
>>>>>>> Quite the contrary, I think it is a tool critical enough that it should
>>>>>>> not be relegated to be a second-class citizen as it is now (let's face
>>>>>>> it, being in contrib *is* second-class citizenship).
>>>>>>>
>>>>>>
>>>>>> Attached patch is latest patch.
>>>>>
>>>>> The previous patch lacks some files for regression test.
>>>>> Attached fixed v12 patch.
>>>>
>>>> The patch could be applied cleanly. "make check" could pass successfully.
>>>> But "make check-world -j 2" failed.
>>>>
>>>
>>> Thank you for looking at this patch.
>>> Could you tell me what test you got failed?
>>> make check-world -j 2 or more is done successfully in my environment.
>>
>> I tried to do the test again, but initdb failed with the following error.
>>
>>     creating template1 database in data/base/1 ... FATAL:  invalid
>> input syntax for type oid: "f"
>>
>> This error didn't happen when I tested before. So the commit which was
>> applied recently might interfere with the patch.
>>
>
> Thank you for testing!
> Attached fixed version patch.

Thanks for updating the patch! Here are comments.

+#include "access/visibilitymap.h"

visibilitymap.h doesn't need to be included in cluster.c.

-          errmsg("table row type and query-specified row type do not match"),
+                 errmsg("table row type and query-specified row type
do not match"),

This change doesn't seem to be necessary.

+#define Anum_pg_class_relallfrozen        12

Why is pg_class.relallfrozen necessary? ISTM that there is no user of it now.

lazy_scan_heap() calls PageClearAllVisible() when the page containing
dead tuples is marked as all-visible. Shouldn't PageClearAllFrozen() be
called at the same time?

-    "vm",                        /* VISIBILITYMAP_FORKNUM */
+    "vfm",                        /* VISIBILITYMAP_FORKNUM */

I wonder how much it's worth renaming only the file extension while
there are many places where "visibility map" and "vm" are used,
for example, log messages, function names, variables, etc.

Regards,

-- 
Fujii Masao



Re: Freeze avoidance of very large table.

From
Robert Haas
Date:
On Thu, Oct 1, 2015 at 9:44 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
> I wonder how much it's worth renaming only the file extension while
> there are many places where "visibility map" and "vm" are used,
> for example, log messages, function names, variables, etc.

I'd be inclined to keep calling it the visibility map (vm) even if it
also contains freeze information.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Freeze avoidance of very large table.

From
Josh Berkus
Date:
On 10/01/2015 07:43 AM, Robert Haas wrote:
> On Thu, Oct 1, 2015 at 9:44 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
>> I wonder how much it's worth renaming only the file extension while
>> there are many places where "visibility map" and "vm" are used,
>> for example, log messages, function names, variables, etc.
> 
> I'd be inclined to keep calling it the visibility map (vm) even if it
> also contains freeze information.
> 

-1 to rename.  Visibility Map is a perfectly good name.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



Re: Freeze avoidance of very large table.

From
Masahiko Sawada
Date:
On Fri, Oct 2, 2015 at 7:30 AM, Josh Berkus <josh@agliodbs.com> wrote:
> On 10/01/2015 07:43 AM, Robert Haas wrote:
>> On Thu, Oct 1, 2015 at 9:44 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
>>> I wonder how much it's worth renaming only the file extension while
>>> there are many places where "visibility map" and "vm" are used,
>>> for example, log messages, function names, variables, etc.
>>
>> I'd be inclined to keep calling it the visibility map (vm) even if it
>> also contains freeze information.
>>
>
> -1 to rename.  Visibility Map is a perfectly good name.
>

Thank you for taking time to review this patch.

Attached latest v14 patch.
v14 patch is changed so that I don't change file name of visibilitymap
to "vfm", and contains some bug fix.

> +#include "access/visibilitymap.h"
> visibilitymap.h doesn't need to be included in cluster.c.

Fixed.

> -          errmsg("table row type and query-specified row type do not match"),
> +                 errmsg("table row type and query-specified row type
> do not match"),
> This change doesn't seem to be necessary.

Fixed.

> +#define Anum_pg_class_relallfrozen        12
> Why is pg_class.relallfrozen necessary? ISTM that there is no user of it now.

The relallfrozen would be useful for user to estimate time to vacuum
freeze or anti-wrapping vacuum before being done them actually.
(Also this value is used on regression test.)
But this information is not used on planning like relallvisible, so it
would be good to move this information to another system view like
pg_stat_*_tables.

> lazy_scan_heap() calls PageClearAllVisible() when the page containing
> dead tuples is marked as all-visible. Shouldn't PageClearAllFrozen() be
> called at the same time?

Fixed.

> -    "vm",                        /* VISIBILITYMAP_FORKNUM */
> +    "vfm",                        /* VISIBILITYMAP_FORKNUM */
> I wonder how much it's worth renaming only the file extension while
> there are many places where "visibility map" and "vm" are used,
> for example, log messages, function names, variables, etc.
>
> I'd be inclined to keep calling it the visibility map (vm) even if it
> also contains freeze information.
>
> -1 to rename.  Visibility Map is a perfectly good name.

Yeah, I agree with this.
The latest v14 patch is changed so.

Regards,

--
Masahiko Sawada

Attachment

Re: Freeze avoidance of very large table.

From
Alvaro Herrera
Date:
Masahiko Sawada wrote:

> @@ -2972,10 +2981,15 @@ l1:
>       */
>      PageSetPrunable(page, xid);
>  
> +    /* clear PD_ALL_VISIBLE and PD_ALL_FORZEN flags */

Typo "FORZEN".

>      if (PageIsAllVisible(page))
>      {
>          all_visible_cleared = true;
> +
> +        /* all-frozen information is also cleared at the same time */
>          PageClearAllVisible(page);
> +        PageClearAllFrozen(page);

I wonder if it makes sense to have a macro to clear both in unison,
which seems a very common pattern.
 
> diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
> index 7c38772..a284b85 100644
> --- a/src/backend/access/heap/visibilitymap.c
> +++ b/src/backend/access/heap/visibilitymap.c
> @@ -21,33 +21,45 @@
>   *
>   * NOTES
>   *
> - * The visibility map is a bitmap with one bit per heap page. A set bit means
> - * that all tuples on the page are known visible to all transactions, and
> - * therefore the page doesn't need to be vacuumed. The map is conservative in
> - * the sense that we make sure that whenever a bit is set, we know the
> - * condition is true, but if a bit is not set, it might or might not be true.
> + * The visibility map is a bitmap with two bits (all-visible and all-frozen)
> + * per heap page. A set all-visible bit means that all tuples on the page are
> + * known visible to all transactions, and therefore the page doesn't need to
> + * be vacuumed. A set all-frozen bit means that all tuples on the page are
> + * completely frozen, and therefore the page doesn't need to be vacuumed even
> + * if whole table scanning vacuum is required (e.g. anti-wraparound vacuum).
> + * A all-frozen bit must be set only when the page is already all-visible.
> + * That is, all-frozen bit is always set with all-visible bit.

"A all-frozen" -> "The all-frozen" (but "A set all-xyz" is correct).


>   * When we *set* a visibility map during VACUUM, we must write WAL.  This may
>   * seem counterintuitive, since the bit is basically a hint: if it is clear,
> - * it may still be the case that every tuple on the page is visible to all
> - * transactions; we just don't know that for certain.  The difficulty is that
> - * there are two bits which are typically set together: the PD_ALL_VISIBLE bit
> - * on the page itself, and the visibility map bit.  If a crash occurs after the
> - * visibility map page makes it to disk and before the updated heap page makes
> - * it to disk, redo must set the bit on the heap page.  Otherwise, the next
> - * insert, update, or delete on the heap page will fail to realize that the
> - * visibility map bit must be cleared, possibly causing index-only scans to
> - * return wrong answers.
> + * it may still be the case that every tuple on the page is visible or frozen
> + * to all transactions; we just don't know that for certain.  The difficulty is
> + * that there are two bits which are typically set together: the PD_ALL_VISIBLE
> + * or PD_ALL_FROZEN bit on the page itself, and the visibility map bit.  If a
> + * crash occurs after the visibility map page makes it to disk and before the
> + * updated heap page makes it to disk, redo must set the bit on the heap page.
> + * Otherwise, the next insert, update, or delete on the heap page will fail to
> + * realize that the visibility map bit must be cleared, possibly causing index-only
> + * scans to return wrong answers.

In the "The difficulty ..." para, I would add the word "corresponding" before
"visibility".  Otherwise, it is not clear what the plural means exactly.

>   * VACUUM will normally skip pages for which the visibility map bit is set;
>   * such pages can't contain any dead tuples and therefore don't need vacuuming.
> - * The visibility map is not used for anti-wraparound vacuums, because
> + * The visibility map is not used for anti-wraparound vacuums before 9.5, because
>   * an anti-wraparound vacuum needs to freeze tuples and observe the latest xid
>   * present in the table, even on pages that don't have any dead tuples.
> + * 9.6 or later, the visibility map has a additional bit which indicates all tuple
> + * on single page has been completely forzen, so the visibility map is also used for
> + * anti-wraparound vacuums.

This should not mention database versions.  Just explain how the code
behaves today, not how it behaved in the past.  Those who want to
understand how it behaved in 9.5 can read the 9.5 code.  (Again typo
"forzen".)

> @@ -1115,6 +1187,14 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
>                          tups_vacuumed, vacuumed_pages)));
>  
>      /*
> +     * This information would be effective for how much effect all-frozen bit
> +     * of VM had for freezing tuples.
> +     */
> +    ereport(elevel,
> +            (errmsg("Skipped %d frozen pages acoording to visibility map",
> +                    vacrelstats->vmskipped_frozen_pages)));

Message must start on lowercase letter.  I don't understand what the
comment means.  Can you rephrase it?

> @@ -1779,10 +1873,12 @@ vac_cmp_itemptr(const void *left, const void *right)
>  /*
>   * Check if every tuple in the given page is visible to all current and future
>   * transactions. Also return the visibility_cutoff_xid which is the highest
> - * xmin amongst the visible tuples.
> + * xmin amongst the visible tuples, and all_forzen which implies that all tuples
> + * of this page are frozen.

Typo "forzen" here again.

> @@ -201,6 +239,110 @@ copy_file(const char *srcfile, const char *dstfile, bool force)
>  #endif
>  
>  
> +/*
> + * rewriteVisibilitymap()
> + *
> + * A additional bit which indicates that all tuples on page is completely
> + * frozen is added into visibility map at PG 9.6. So the format of visibiilty
> + * map has been changed.
> + * Copies a visibility map file while adding all-frozen bit(0) into each bit.
> + */
> +static const char *
> +rewriteVisibilitymap(const char *fromfile, const char *tofile, bool force)
> +{
> +#define REWRITE_BUF_SIZE (50 * BLCKSZ)
> +#define BITS_PER_HEAPBLOCK 2
> +
> +    int            src_fd, dst_fd;
> +    uint16         vm_bits;
> +    ssize_t     nbytes;
> +    char         *buffer;
> +    int            ret = 0;
> +    int            save_errno = 0;
> +
> +    if ((fromfile == NULL) || (tofile == NULL))
> +    {
> +        errno = EINVAL;
> +        return getErrorText(errno);
> +    }
> +
> +    if ((src_fd = open(fromfile, O_RDONLY, 0)) < 0)
> +        return getErrorText(errno);
> +
> +    if ((dst_fd = open(tofile, O_RDWR | O_CREAT | (force ? 0 : O_EXCL), S_IRUSR | S_IWUSR)) < 0)
> +    {
> +        save_errno = errno;
> +        if (src_fd != 0)
> +            close(src_fd);
> +
> +        errno = save_errno;
> +        return getErrorText(errno);
> +    }
> +
> +    buffer = (char *) pg_malloc(REWRITE_BUF_SIZE);
> +
> +    /* Copy page header data in advance */
> +    if ((nbytes = read(src_fd, buffer, MAXALIGN(SizeOfPageHeaderData))) <= 0)
> +    {
> +        save_errno = errno;
> +        return getErrorText(errno);
> +    }

Not clear why you bother with save_errno in this path.  Forgot to
close()?  (Though I wonder why you bother to close() if the program is
going to exit shortly thereafter anyway.)

> diff --git a/src/bin/pg_upgrade/pg_upgrade.h b/src/bin/pg_upgrade/pg_upgrade.h
> index 13aa891..fc92a5f 100644
> --- a/src/bin/pg_upgrade/pg_upgrade.h
> +++ b/src/bin/pg_upgrade/pg_upgrade.h
> @@ -112,6 +112,11 @@ extern char *output_files[];
>  #define VISIBILITY_MAP_CRASHSAFE_CAT_VER 201107031
>  
>  /*
> + * The format of visibility map changed with this 9.6 commit,
> + *
> + */
> +#define VISIBILITY_MAP_FROZEN_BIT_CAT_VER 201509181

Useless empty line in comment.

> diff --git a/src/common/relpath.c b/src/common/relpath.c
> index 66dfef1..52ff14e 100644
> --- a/src/common/relpath.c
> +++ b/src/common/relpath.c
> @@ -30,6 +30,9 @@
>   * If you add a new entry, remember to update the errhint in
>   * forkname_to_number() below, and update the SGML documentation for
>   * pg_relation_size().
> + * 9.6 or later, the visibility map fork name is changed from "vm" to
> + * "vfm" bacause visibility map has not only information about all-visible
> + * but also information about all-frozen.
>   */
>  const char *const forkNames[] = {
>      "main",                        /* MAIN_FORKNUM */

Drop the change in comment?  There's no "vfm" in this version of the
patch, is there?


-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Freeze avoidance of very large table.

From
Masahiko Sawada
Date:
On Sat, Oct 3, 2015 at 12:23 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> Masahiko Sawada wrote:
>

Thank you for taking time to review this feature.
Attached the latest version patch (v15).


>> @@ -2972,10 +2981,15 @@ l1:
>>        */
>>       PageSetPrunable(page, xid);
>>
>> +     /* clear PD_ALL_VISIBLE and PD_ALL_FORZEN flags */
>
> Typo "FORZEN".

Fixed.

>
>>       if (PageIsAllVisible(page))
>>       {
>>               all_visible_cleared = true;
>> +
>> +             /* all-frozen information is also cleared at the same time */
>>               PageClearAllVisible(page);
>> +             PageClearAllFrozen(page);
>
> I wonder if it makes sense to have a macro to clear both in unison,
> which seems a very common pattern.
>

Fixed.

>
>> diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
>> index 7c38772..a284b85 100644
>> --- a/src/backend/access/heap/visibilitymap.c
>> +++ b/src/backend/access/heap/visibilitymap.c
>> @@ -21,33 +21,45 @@
>>   *
>>   * NOTES
>>   *
>> - * The visibility map is a bitmap with one bit per heap page. A set bit means
>> - * that all tuples on the page are known visible to all transactions, and
>> - * therefore the page doesn't need to be vacuumed. The map is conservative in
>> - * the sense that we make sure that whenever a bit is set, we know the
>> - * condition is true, but if a bit is not set, it might or might not be true.
>> + * The visibility map is a bitmap with two bits (all-visible and all-frozen)
>> + * per heap page. A set all-visible bit means that all tuples on the page are
>> + * known visible to all transactions, and therefore the page doesn't need to
>> + * be vacuumed. A set all-frozen bit means that all tuples on the page are
>> + * completely frozen, and therefore the page doesn't need to be vacuumed even
>> + * if whole table scanning vacuum is required (e.g. anti-wraparound vacuum).
>> + * A all-frozen bit must be set only when the page is already all-visible.
>> + * That is, all-frozen bit is always set with all-visible bit.
>
> "A all-frozen" -> "The all-frozen" (but "A set all-xyz" is correct).

Fixed.

>
>>   * When we *set* a visibility map during VACUUM, we must write WAL.  This may
>>   * seem counterintuitive, since the bit is basically a hint: if it is clear,
>> - * it may still be the case that every tuple on the page is visible to all
>> - * transactions; we just don't know that for certain.  The difficulty is that
>> - * there are two bits which are typically set together: the PD_ALL_VISIBLE bit
>> - * on the page itself, and the visibility map bit.  If a crash occurs after the
>> - * visibility map page makes it to disk and before the updated heap page makes
>> - * it to disk, redo must set the bit on the heap page.  Otherwise, the next
>> - * insert, update, or delete on the heap page will fail to realize that the
>> - * visibility map bit must be cleared, possibly causing index-only scans to
>> - * return wrong answers.
>> + * it may still be the case that every tuple on the page is visible or frozen
>> + * to all transactions; we just don't know that for certain.  The difficulty is
>> + * that there are two bits which are typically set together: the PD_ALL_VISIBLE
>> + * or PD_ALL_FROZEN bit on the page itself, and the visibility map bit.  If a
>> + * crash occurs after the visibility map page makes it to disk and before the
>> + * updated heap page makes it to disk, redo must set the bit on the heap page.
>> + * Otherwise, the next insert, update, or delete on the heap page will fail to
>> + * realize that the visibility map bit must be cleared, possibly causing index-only
>> + * scans to return wrong answers.
>
> In the "The difficulty ..." para, I would add the word "corresponding" before
> "visibility".  Otherwise, it is not clear what the plural means exactly.

Fixed.

>>   * VACUUM will normally skip pages for which the visibility map bit is set;
>>   * such pages can't contain any dead tuples and therefore don't need vacuuming.
>> - * The visibility map is not used for anti-wraparound vacuums, because
>> + * The visibility map is not used for anti-wraparound vacuums before 9.5, because
>>   * an anti-wraparound vacuum needs to freeze tuples and observe the latest xid
>>   * present in the table, even on pages that don't have any dead tuples.
>> + * 9.6 or later, the visibility map has a additional bit which indicates all tuple
>> + * on single page has been completely forzen, so the visibility map is also used for
>> + * anti-wraparound vacuums.
>
> This should not mention database versions.  Just explain how the code
> behaves today, not how it behaved in the past.  Those who want to
> understand how it behaved in 9.5 can read the 9.5 code.  (Again typo
> "forzen".)

Changed these comment.
Sorry for the same typo frequently..

>> @@ -1115,6 +1187,14 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
>>                                               tups_vacuumed, vacuumed_pages)));
>>
>>       /*
>> +      * This information would be effective for how much effect all-frozen bit
>> +      * of VM had for freezing tuples.
>> +      */
>> +     ereport(elevel,
>> +                     (errmsg("Skipped %d frozen pages acoording to visibility map",
>> +                                     vacrelstats->vmskipped_frozen_pages)));
>
> Message must start on lowercase letter.  I don't understand what the
> comment means.  Can you rephrase it?

Fixed.

>> @@ -1779,10 +1873,12 @@ vac_cmp_itemptr(const void *left, const void *right)
>>  /*
>>   * Check if every tuple in the given page is visible to all current and future
>>   * transactions. Also return the visibility_cutoff_xid which is the highest
>> - * xmin amongst the visible tuples.
>> + * xmin amongst the visible tuples, and all_forzen which implies that all tuples
>> + * of this page are frozen.
>
> Typo "forzen" here again.

Fixed.

>> @@ -201,6 +239,110 @@ copy_file(const char *srcfile, const char *dstfile, bool force)
>>  #endif
>>
>>
>> +/*
>> + * rewriteVisibilitymap()
>> + *
>> + * A additional bit which indicates that all tuples on page is completely
>> + * frozen is added into visibility map at PG 9.6. So the format of visibiilty
>> + * map has been changed.
>> + * Copies a visibility map file while adding all-frozen bit(0) into each bit.
>> + */
>> +static const char *
>> +rewriteVisibilitymap(const char *fromfile, const char *tofile, bool force)
>> +{
>> +#define REWRITE_BUF_SIZE (50 * BLCKSZ)
>> +#define BITS_PER_HEAPBLOCK 2
>> +
>> +     int                     src_fd, dst_fd;
>> +     uint16          vm_bits;
>> +     ssize_t         nbytes;
>> +     char            *buffer;
>> +     int                     ret = 0;
>> +     int                     save_errno = 0;
>> +
>> +     if ((fromfile == NULL) || (tofile == NULL))
>> +     {
>> +             errno = EINVAL;
>> +             return getErrorText(errno);
>> +     }
>> +
>> +     if ((src_fd = open(fromfile, O_RDONLY, 0)) < 0)
>> +             return getErrorText(errno);
>> +
>> +     if ((dst_fd = open(tofile, O_RDWR | O_CREAT | (force ? 0 : O_EXCL), S_IRUSR | S_IWUSR)) < 0)
>> +     {
>> +             save_errno = errno;
>> +             if (src_fd != 0)
>> +                     close(src_fd);
>> +
>> +             errno = save_errno;
>> +             return getErrorText(errno);
>> +     }
>> +
>> +     buffer = (char *) pg_malloc(REWRITE_BUF_SIZE);
>> +
>> +     /* Copy page header data in advance */
>> +     if ((nbytes = read(src_fd, buffer, MAXALIGN(SizeOfPageHeaderData))) <= 0)
>> +     {
>> +             save_errno = errno;
>> +             return getErrorText(errno);
>> +     }
>
> Not clear why you bother with save_errno in this path.  Forgot to
> close()?  (Though I wonder why you bother to close() if the program is
> going to exit shortly thereafter anyway.)

Fixed.

>> diff --git a/src/bin/pg_upgrade/pg_upgrade.h b/src/bin/pg_upgrade/pg_upgrade.h
>> index 13aa891..fc92a5f 100644
>> --- a/src/bin/pg_upgrade/pg_upgrade.h
>> +++ b/src/bin/pg_upgrade/pg_upgrade.h
>> @@ -112,6 +112,11 @@ extern char *output_files[];
>>  #define VISIBILITY_MAP_CRASHSAFE_CAT_VER 201107031
>>
>>  /*
>> + * The format of visibility map changed with this 9.6 commit,
>> + *
>> + */
>> +#define VISIBILITY_MAP_FROZEN_BIT_CAT_VER 201509181
>
> Useless empty line in comment.

Fixed.

>> diff --git a/src/common/relpath.c b/src/common/relpath.c
>> index 66dfef1..52ff14e 100644
>> --- a/src/common/relpath.c
>> +++ b/src/common/relpath.c
>> @@ -30,6 +30,9 @@
>>   * If you add a new entry, remember to update the errhint in
>>   * forkname_to_number() below, and update the SGML documentation for
>>   * pg_relation_size().
>> + * 9.6 or later, the visibility map fork name is changed from "vm" to
>> + * "vfm" bacause visibility map has not only information about all-visible
>> + * but also information about all-frozen.
>>   */
>>  const char *const forkNames[] = {
>>       "main",                                         /* MAIN_FORKNUM */
>
> Drop the change in comment?  There's no "vfm" in this version of the
> patch, is there?

Fixed.

Regards,

--
Masahiko Sawada

Attachment

Re: Freeze avoidance of very large table.

From
Robert Haas
Date:
On Fri, Oct 2, 2015 at 11:23 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
>> +             /* all-frozen information is also cleared at the same time */
>>               PageClearAllVisible(page);
>> +             PageClearAllFrozen(page);
>
> I wonder if it makes sense to have a macro to clear both in unison,
> which seems a very common pattern.

I think PageClearAllVisible should clear both, and there should be no
other macro.  There is no event that causes a page to cease being
all-visible that does not also cause it to cease being all-frozen.
You might think that deleting or locking a tuple would fall into that
category - but nope, XMAX needs to be cleared or the tuple pruned, or
there will be problems after wraparound.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Freeze avoidance of very large table.

From
Masahiko Sawada
Date:
On Sat, Oct 3, 2015 at 3:41 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Oct 2, 2015 at 11:23 AM, Alvaro Herrera
> <alvherre@2ndquadrant.com> wrote:
>>> +             /* all-frozen information is also cleared at the same time */
>>>               PageClearAllVisible(page);
>>> +             PageClearAllFrozen(page);
>>
>> I wonder if it makes sense to have a macro to clear both in unison,
>> which seems a very common pattern.
>
> I think PageClearAllVisible should clear both, and there should be no
> other macro.  There is no event that causes a page to cease being
> all-visible that does not also cause it to cease being all-frozen.
> You might think that deleting or locking a tuple would fall into that
> category - but nope, XMAX needs to be cleared or the tuple pruned, or
> there will be problems after wraparound.
>

Thank you for your advice.
I understood.

I changed the patch so that PageClearAllVisible clear both bits, and
removed ClearAllFrozen.
Attached the latest v16 patch which contains draft version documentation patch.

Regards,

--
Masahiko Sawada

Attachment

Re: Freeze avoidance of very large table.

From
Fujii Masao
Date:
On Fri, Oct 2, 2015 at 8:14 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>> +#define Anum_pg_class_relallfrozen        12
>> Why is pg_class.relallfrozen necessary? ISTM that there is no user of it now.
>
> The relallfrozen would be useful for user to estimate time to vacuum
> freeze or anti-wrapping vacuum before being done them actually.
> (Also this value is used on regression test.)
> But this information is not used on planning like relallvisible, so it
> would be good to move this information to another system view like
> pg_stat_*_tables.

Or make pgstattuple and pgstattuple_approx report even the number
of frozen tuples?

Regards,

-- 
Fujii Masao



Re: Freeze avoidance of very large table.

From
Simon Riggs
Date:
On 10 September 2015 at 01:58, Andres Freund <andres@anarazel.de> wrote:
On 2015-09-04 23:35:42 +0100, Simon Riggs wrote:
> This looks OK. You saw that I was proposing to solve this problem a
> different way ("Summary of plans to avoid the annoyance of Freezing"),
> suggesting that we wait for a few CFs to see if a patch emerges for that -
> then fall back to this patch if it doesn't? So I am moving this patch to
> next CF.

As noted on that other thread I don't think that's a good policy, and it
seems like Robert agrees with me. So I think we should move this back to
"Needs Review".

I also agree. Andres and I spoke at PostgresOpen and persuaded me, I've just been away.

Am happy to review and commit in next few days/weeks, once I catch up on the thread. 

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Freeze avoidance of very large table.

From
Masahiko Sawada
Date:
On Mon, Oct 5, 2015 at 11:03 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
> On Fri, Oct 2, 2015 at 8:14 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>>> +#define Anum_pg_class_relallfrozen        12
>>> Why is pg_class.relallfrozen necessary? ISTM that there is no user of it now.
>>
>> The relallfrozen would be useful for user to estimate time to vacuum
>> freeze or anti-wrapping vacuum before being done them actually.
>> (Also this value is used on regression test.)
>> But this information is not used on planning like relallvisible, so it
>> would be good to move this information to another system view like
>> pg_stat_*_tables.
>
> Or make pgstattuple and pgstattuple_approx report even the number
> of frozen tuples?
>

But we cannot know the number of frozen pages without installation of
pageinspect module.
I'm a bit concerned about that the all projects cannot install
extentension module into postgresql on production environment.
I think we need to provide such feature at least into core.
Thought?

Regards,

--
Masahiko Sawada



Re: Freeze avoidance of very large table.

From
Fujii Masao
Date:
On Mon, Oct 5, 2015 at 7:31 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> On Sat, Oct 3, 2015 at 3:41 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Fri, Oct 2, 2015 at 11:23 AM, Alvaro Herrera
>> <alvherre@2ndquadrant.com> wrote:
>>>> +             /* all-frozen information is also cleared at the same time */
>>>>               PageClearAllVisible(page);
>>>> +             PageClearAllFrozen(page);
>>>
>>> I wonder if it makes sense to have a macro to clear both in unison,
>>> which seems a very common pattern.
>>
>> I think PageClearAllVisible should clear both, and there should be no
>> other macro.  There is no event that causes a page to cease being
>> all-visible that does not also cause it to cease being all-frozen.
>> You might think that deleting or locking a tuple would fall into that
>> category - but nope, XMAX needs to be cleared or the tuple pruned, or
>> there will be problems after wraparound.
>>
>
> Thank you for your advice.
> I understood.
>
> I changed the patch so that PageClearAllVisible clear both bits, and
> removed ClearAllFrozen.
> Attached the latest v16 patch which contains draft version documentation patch.

Thanks for updating the patch! Here are another review comments.

+    ereport(elevel,
+            (errmsg("skipped %d frozen pages acoording to visibility map",
+                    vacrelstats->vmskipped_frozen_pages)));

Typo: acoording should be according.

When vmskipped_frozen_pages is 1, "1 frozen pages" in log message
sounds incorrect in terms of grammar. So probably errmsg_plural()
should be used here.

+            relallvisible = visibilitymap_count(rel,
VISIBILITYMAP_ALL_VISIBLE);
+            relallfrozen = visibilitymap_count(rel, VISIBILITYMAP_ALL_FROZEN);

We can refactor visibilitymap_count() so that it counts the numbers of
both all-visible and all-frozen tuples at the same time, in order to
avoid reading through visibility map twice.

heap_page_is_all_visible() can set all_frozen to TRUE even when
it returns FALSE. This is odd because the page must not be all frozen
when it's not all visible. heap_page_is_all_visible() should set
all_frozen to FALSE whenever all_visible is set to FALSE?
Probably it's better to forcibly set all_frozen to FALSE at the end of
the function whenever all_visible is FALSE.

+    if (PageIsAllVisible(page))    {
-        Assert(BufferIsValid(*vmbuffer));

Why did you remove this assertion?

+        if (all_frozen)
+        {
+            PageSetAllFrozen(page);
+            flags |= VISIBILITYMAP_ALL_FROZEN;
+        }

Why didn't you call visibilitymap_test() for all frozen case here?

In visibilitymap_set(), the argument flag must be either
(VISIBILITYMAP_ALL_VISIBLE | VISIBILITYMAP_ALL_FROZEN) or
VISIBILITYMAP_ALL_VISIBLE. So I think that it's better to add
Assert() which checks whether the specified flag is valid or not.

+                     * caller is expected to set PD_ALL_VISIBLE or
+                     * PD_ALL_FROZEN first.
+                     */
+                    Assert(PageIsAllVisible(heapPage) ||
PageIsAllFrozen(heapPage));

This should be the following?
 Assert(((flag | VISIBILITYMAP_ALL_VISIBLE) && PageIsAllVisible(heapPage)) ||             ((flag |
VISIBILITYMAP_ALL_FROZEN)&& PageIsAllFrozen(heapPage)));
 

Regards,

-- 
Fujii Masao



Re: Freeze avoidance of very large table.

From
Masahiko Sawada
Date:
On Thu, Oct 8, 2015 at 7:03 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
> On Mon, Oct 5, 2015 at 7:31 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>> On Sat, Oct 3, 2015 at 3:41 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>>> On Fri, Oct 2, 2015 at 11:23 AM, Alvaro Herrera
>>> <alvherre@2ndquadrant.com> wrote:
>>>>> +             /* all-frozen information is also cleared at the same time */
>>>>>               PageClearAllVisible(page);
>>>>> +             PageClearAllFrozen(page);
>>>>
>>>> I wonder if it makes sense to have a macro to clear both in unison,
>>>> which seems a very common pattern.
>>>
>>> I think PageClearAllVisible should clear both, and there should be no
>>> other macro.  There is no event that causes a page to cease being
>>> all-visible that does not also cause it to cease being all-frozen.
>>> You might think that deleting or locking a tuple would fall into that
>>> category - but nope, XMAX needs to be cleared or the tuple pruned, or
>>> there will be problems after wraparound.
>>>
>>
>> Thank you for your advice.
>> I understood.
>>
>> I changed the patch so that PageClearAllVisible clear both bits, and
>> removed ClearAllFrozen.
>> Attached the latest v16 patch which contains draft version documentation patch.
>
> Thanks for updating the patch! Here are another review comments.
>

Thank you for reviewing!
Attached the latest patch.

> +    ereport(elevel,
> +            (errmsg("skipped %d frozen pages acoording to visibility map",
> +                    vacrelstats->vmskipped_frozen_pages)));
>
> Typo: acoording should be according.
>
> When vmskipped_frozen_pages is 1, "1 frozen pages" in log message
> sounds incorrect in terms of grammar. So probably errmsg_plural()
> should be used here.

Thank you for your advice.
Fixed.

> +            relallvisible = visibilitymap_count(rel,
> VISIBILITYMAP_ALL_VISIBLE);
> +            relallfrozen = visibilitymap_count(rel, VISIBILITYMAP_ALL_FROZEN);
>
> We can refactor visibilitymap_count() so that it counts the numbers of
> both all-visible and all-frozen tuples at the same time, in order to
> avoid reading through visibility map twice.

I agree.
I've changed so.

> heap_page_is_all_visible() can set all_frozen to TRUE even when
> it returns FALSE. This is odd because the page must not be all frozen
> when it's not all visible. heap_page_is_all_visible() should set
> all_frozen to FALSE whenever all_visible is set to FALSE?
> Probably it's better to forcibly set all_frozen to FALSE at the end of
> the function whenever all_visible is FALSE.

Fixed.

> +    if (PageIsAllVisible(page))
>      {
> -        Assert(BufferIsValid(*vmbuffer));
>
> Why did you remove this assertion?

It's my mistake.
Fixed.

> +        if (all_frozen)
> +        {
> +            PageSetAllFrozen(page);
> +            flags |= VISIBILITYMAP_ALL_FROZEN;
> +        }
>
> Why didn't you call visibilitymap_test() for all frozen case here?

Same as above.
Fixed.

> In visibilitymap_set(), the argument flag must be either
> (VISIBILITYMAP_ALL_VISIBLE | VISIBILITYMAP_ALL_FROZEN) or
> VISIBILITYMAP_ALL_VISIBLE. So I think that it's better to add
> Assert() which checks whether the specified flag is valid or not.

I agree.
I added Assert() to beginning of visibilitymap_set() function.

> +                     * caller is expected to set PD_ALL_VISIBLE or
> +                     * PD_ALL_FROZEN first.
> +                     */
> +                    Assert(PageIsAllVisible(heapPage) ||
> PageIsAllFrozen(heapPage));
>
> This should be the following?
>
>   Assert(((flag | VISIBILITYMAP_ALL_VISIBLE) && PageIsAllVisible(heapPage)) ||
>               ((flag | VISIBILITYMAP_ALL_FROZEN) && PageIsAllFrozen(heapPage)));

I agree.
Fixed.

Regards,

--
Masahiko Sawada

Attachment

Re: Freeze avoidance of very large table.

From
Simon Riggs
Date:
On 1 October 2015 at 23:30, Josh Berkus <josh@agliodbs.com> wrote:
On 10/01/2015 07:43 AM, Robert Haas wrote:
> On Thu, Oct 1, 2015 at 9:44 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
>> I wonder how much it's worth renaming only the file extension while
>> there are many places where "visibility map" and "vm" are used,
>> for example, log messages, function names, variables, etc.
>
> I'd be inclined to keep calling it the visibility map (vm) even if it
> also contains freeze information.
>

-1 to rename.  Visibility Map is a perfectly good name.

The name can stay the same, but specifically the file extension should change.

This patch changes the layout of existing information:
* _vm stores one bit per page
* _$new stores two bits per page

The problem is we won't be able to tell the two formats apart, since they both are just lots of bits. So we won't be able to tell if the file is old format or new format, which could lead to loss of information that relates to visibility. If we think something is all-visible when it is not, this is effectively data corruption.

In light of lessons learned from multixactids, I think its important that we are able to tell the difference between an old format and a new format visibility map. 

My suggestion to do so was to call it "vfm", so we indicate that it is now a Visibility & Freeze Map

I don't care if we change the name, but I do care if we can't tell the difference between a failed upgrade, a normal upgrade and a server that has been upgraded multiple times. Alternate suggestions welcome.

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Freeze avoidance of very large table.

From
Andres Freund
Date:
On October 8, 2015 7:35:24 PM GMT+02:00, Simon Riggs <simon@2ndQuadrant.com> wrote:

>The problem is we won't be able to tell the two formats apart, since
>they
>both are just lots of bits. So we won't be able to tell if the file is
>old
>format or new format, which could lead to loss of information that
>relates
>to visibility. 

I don't see the problem? I mean catversion will reliably tell you which format the vm is in?

We could additionally use the opportunity to as a metapage, but that seems like an independent thing.


Andres

--- 
Please excuse brevity and formatting - I am writing this on my mobile phone.



Re: Freeze avoidance of very large table.

From
Robert Haas
Date:
On Thu, Oct 8, 2015 at 1:52 PM, Andres Freund <andres@anarazel.de> wrote:
> I don't see the problem? I mean catversion will reliably tell you which format the vm is in?

Totally agreed.

> We could additionally use the opportunity to as a metapage, but that seems like an independent thing.

I agree with that, too.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Freeze avoidance of very large table.

From
Masahiko Sawada
Date:
On Sat, Oct 10, 2015 at 4:20 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Oct 8, 2015 at 1:52 PM, Andres Freund <andres@anarazel.de> wrote:
>> I don't see the problem? I mean catversion will reliably tell you which format the vm is in?
>
> Totally agreed.
>
>> We could additionally use the opportunity to as a metapage, but that seems like an independent thing.
>
> I agree with that, too.
>

Attached the updated v18 patch fixes some bugs.
Please review the patch.

Regards,

--
Masahiko Sawada

Attachment

Re: Freeze avoidance of very large table.

From
Simon Riggs
Date:
On 9 October 2015 at 15:20, Robert Haas <robertmhaas@gmail.com> wrote:
On Thu, Oct 8, 2015 at 1:52 PM, Andres Freund <andres@anarazel.de> wrote:
> I don't see the problem? I mean catversion will reliably tell you which format the vm is in?

Totally agreed.

This isn't an agreement competition, its a cool look at what might cause problems for all of us.

If we want to avoid bugs in future then we'd better start acting like that is actually true in practice.

Why should we wave away this concern? Will we wave away a concern next time you personally raise one? Bruce would have me believe that we added months onto 9.5 to improve robustness. So lets actually do that. Starting at the first opportunity.

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Freeze avoidance of very large table.

From
Andres Freund
Date:
On 2015-10-20 20:35:31 -0400, Simon Riggs wrote:
> On 9 October 2015 at 15:20, Robert Haas <robertmhaas@gmail.com> wrote:
> 
> > On Thu, Oct 8, 2015 at 1:52 PM, Andres Freund <andres@anarazel.de> wrote:
> > > I don't see the problem? I mean catversion will reliably tell you which
> > format the vm is in?
> >
> > Totally agreed.
> >
> 
> This isn't an agreement competition, its a cool look at what might cause
> problems for all of us.

Uh, we form rough concensuses all the time.

> If we want to avoid bugs in future then we'd better start acting like that
> is actually true in practice.

> Why should we wave away this concern? Will we wave away a concern next time
> you personally raise one? Bruce would have me believe that we added months
> onto 9.5 to improve robustness. So lets actually do that. Starting at the
> first opportunity.

Meh. Adding complexity definitely needs to be weighed against the
benefits. As pointed out e.g. by all the multixact issues you mentioned
upthread. In this case your argument for changing the name doesn't seem
to hold much water.

Greetings,

Andres Freund



Re: Freeze avoidance of very large table.

From
Jim Nasby
Date:
On 10/21/15 8:11 AM, Andres Freund wrote:
> Meh. Adding complexity definitely needs to be weighed against the
> benefits. As pointed out e.g. by all the multixact issues you mentioned
> upthread. In this case your argument for changing the name doesn't seem
> to hold much water.

ISTM VISIBILITY_MAP_FROZEN_BIT_CAT_VER shold be defined in catversion.h 
instead of pg_upgrade.h though, to ensure it's correctly updated when 
this gets committed though.
-- 
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com



Re: Freeze avoidance of very large table.

From
Alvaro Herrera
Date:
Jim Nasby wrote:
> On 10/21/15 8:11 AM, Andres Freund wrote:
> >Meh. Adding complexity definitely needs to be weighed against the
> >benefits. As pointed out e.g. by all the multixact issues you mentioned
> >upthread. In this case your argument for changing the name doesn't seem
> >to hold much water.
> 
> ISTM VISIBILITY_MAP_FROZEN_BIT_CAT_VER shold be defined in catversion.h
> instead of pg_upgrade.h though, to ensure it's correctly updated when this
> gets committed though.

That would be untidy and pointless.  pg_upgrade.h contains other
catversions.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Freeze avoidance of very large table.

From
Torsten Zühlsdorff
Date:
On 21.10.2015 02:05, Masahiko Sawada wrote:
> On Sat, Oct 10, 2015 at 4:20 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Thu, Oct 8, 2015 at 1:52 PM, Andres Freund <andres@anarazel.de> wrote:
>>> I don't see the problem? I mean catversion will reliably tell you which format the vm is in?
>>
>> Totally agreed.
>>
>>> We could additionally use the opportunity to as a metapage, but that seems like an independent thing.
>>
>> I agree with that, too.
>>
>
> Attached the updated v18 patch fixes some bugs.
> Please review the patch.

I've just checked the comments:

File: /doc/src/sgml/catalogs.sgml

+        Number of pages that are marked all-frozen in the tables's
Should be:
+        Number of pages that are marked all-frozen in the tables

+        <command>ANALYZE</command>, and a few DDL coomand such as
Should be:
+        <command>ANALYZE</command>, and a few DDL command such as

File: doc/src/sgml/maintenance.sgml

+    When the all pages of table are eventually marked as frozen by 
<command>VACUUM</>,
Should be:
+    When all pages of the table are eventually marked as frozen by 
<command>VACUUM</>,

File: /src/backend/access/heap/visibilitymap.c

+ * visibility map bit.  Then, we lock the buffer.  But this creates a race
Should be:
+ * visibility map bit.  Than we lock the buffer.  But this creates a race

+ * buffer, the PD_ALL_VISIBLE or PD_ALL_FROZEN bit gets set.  If that 
happens,
Should be:
+ * buffer, the PD_ALL_VISIBLE or PD_ALL_FROZEN bit gets set. If that 
happens,
(Remove duplicate white space before if)

Please note i'm not a native speaker. There is a good chance that i am 
wrong ;)

Greetings,
Torsten



Re: Freeze avoidance of very large table.

From
Masahiko Sawada
Date:
On Thu, Oct 22, 2015 at 4:11 PM, Torsten Zühlsdorff
<mailinglists@toco-domains.de> wrote:
> On 21.10.2015 02:05, Masahiko Sawada wrote:
>>
>> On Sat, Oct 10, 2015 at 4:20 AM, Robert Haas <robertmhaas@gmail.com>
>> wrote:
>>>
>>> On Thu, Oct 8, 2015 at 1:52 PM, Andres Freund <andres@anarazel.de> wrote:
>>>>
>>>> I don't see the problem? I mean catversion will reliably tell you which
>>>> format the vm is in?
>>>
>>>
>>> Totally agreed.
>>>
>>>> We could additionally use the opportunity to as a metapage, but that
>>>> seems like an independent thing.
>>>
>>>
>>> I agree with that, too.
>>>
>>
>> Attached the updated v18 patch fixes some bugs.
>> Please review the patch.
>
>
> I've just checked the comments:

Thank you for taking the time to review this patch.
Attached updated patch(v19).

> File: /doc/src/sgml/catalogs.sgml
>
> +        Number of pages that are marked all-frozen in the tables's
> Should be:
> +        Number of pages that are marked all-frozen in the tables

I changed it as follows.
+        Number of pages that are marked all-frozen in the table's

The similar sentence of relallvisible is exist.

> +        <command>ANALYZE</command>, and a few DDL coomand such as
> Should be:
> +        <command>ANALYZE</command>, and a few DDL command such as

Fixed.

> File: doc/src/sgml/maintenance.sgml
>
> +    When the all pages of table are eventually marked as frozen by
> <command>VACUUM</>,
> Should be:
> +    When all pages of the table are eventually marked as frozen by
> <command>VACUUM</>,

Fixed.

> File: /src/backend/access/heap/visibilitymap.c
>
> + * visibility map bit.  Then, we lock the buffer.  But this creates a race
> Should be:
> + * visibility map bit.  Than we lock the buffer.  But this creates a race

I didn't change this sentence actually. so kept it.

> + * buffer, the PD_ALL_VISIBLE or PD_ALL_FROZEN bit gets set.  If that
> happens,
> Should be:
> + * buffer, the PD_ALL_VISIBLE or PD_ALL_FROZEN bit gets set. If that
> happens,
> (Remove duplicate white space before if)

The other sentence seems to have double white space after period.
I kept it.

Please review it.

Regards,

--
Masahiko Sawada

Attachment

Re: Freeze avoidance of very large table.

From
Amit Kapila
Date:
On Mon, Oct 5, 2015 at 9:53 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Mon, Oct 5, 2015 at 11:03 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
> > On Fri, Oct 2, 2015 at 8:14 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >>> +#define Anum_pg_class_relallfrozen        12
> >>> Why is pg_class.relallfrozen necessary? ISTM that there is no user of it now.
> >>
> >> The relallfrozen would be useful for user to estimate time to vacuum
> >> freeze or anti-wrapping vacuum before being done them actually.
> >> (Also this value is used on regression test.)
> >> But this information is not used on planning like relallvisible, so it
> >> would be good to move this information to another system view like
> >> pg_stat_*_tables.
> >
> > Or make pgstattuple and pgstattuple_approx report even the number
> > of frozen tuples?
> >
>
> But we cannot know the number of frozen pages without installation of
> pageinspect module.
> I'm a bit concerned about that the all projects cannot install
> extentension module into postgresql on production environment.
> I think we need to provide such feature at least into core.
>

I think we can display information about relallfrozen it in pg_stat_*_tables
as suggested by you.  It doesn't make much sense to keep it in pg_class
unless we have some usecase for the same.



With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Freeze avoidance of very large table.

From
Masahiko Sawada
Date:
On Sat, Oct 24, 2015 at 10:59 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Mon, Oct 5, 2015 at 9:53 PM, Masahiko Sawada <sawada.mshk@gmail.com>
> wrote:
>>
>> On Mon, Oct 5, 2015 at 11:03 PM, Fujii Masao <masao.fujii@gmail.com>
>> wrote:
>> > On Fri, Oct 2, 2015 at 8:14 PM, Masahiko Sawada <sawada.mshk@gmail.com>
>> > wrote:
>> >>> +#define Anum_pg_class_relallfrozen        12
>> >>> Why is pg_class.relallfrozen necessary? ISTM that there is no user of
>> >>> it now.
>> >>
>> >> The relallfrozen would be useful for user to estimate time to vacuum
>> >> freeze or anti-wrapping vacuum before being done them actually.
>> >> (Also this value is used on regression test.)
>> >> But this information is not used on planning like relallvisible, so it
>> >> would be good to move this information to another system view like
>> >> pg_stat_*_tables.
>> >
>> > Or make pgstattuple and pgstattuple_approx report even the number
>> > of frozen tuples?
>> >
>>
>> But we cannot know the number of frozen pages without installation of
>> pageinspect module.
>> I'm a bit concerned about that the all projects cannot install
>> extentension module into postgresql on production environment.
>> I think we need to provide such feature at least into core.
>>
>
> I think we can display information about relallfrozen it in pg_stat_*_tables
> as suggested by you.  It doesn't make much sense to keep it in pg_class
> unless we have some usecase for the same.
>

I'm thinking a bit about implementing the read-only table that is
restricted to update/delete and is ensured that whole table is frozen,
if this feature is committed.
The value of relallfrozen might be useful for such feature.

Regards,

--
Masahiko Sawada



Re: Freeze avoidance of very large table.

From
Amit Kapila
Date:
On Sat, Oct 24, 2015 at 2:24 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Sat, Oct 24, 2015 at 10:59 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > I think we can display information about relallfrozen it in pg_stat_*_tables
> > as suggested by you.  It doesn't make much sense to keep it in pg_class
> > unless we have some usecase for the same.
> >
>
> I'm thinking a bit about implementing the read-only table that is
> restricted to update/delete and is ensured that whole table is frozen,
> if this feature is committed.
> The value of relallfrozen might be useful for such feature.
>

If we need this for read-only table feature, then better lets add that
after discussing the design of that feature.  It doesn't seem to be
advisable to have an extra field in system table which we might
need in yet not completely-discussed feature.

Review Comments:
-------------------------------
1.
  /*
- * Find buffer to insert this tuple into.  If the page is all visible,
- * this will also pin 
the requisite visibility map page.
+ * Find buffer to insert this tuple into.  If the page is all 
visible
+ * or all frozen, this will also pin the requisite visibility map and
+ * frozen map page.
 
 */
  buffer = RelationGetBufferForTuple(relation, heaptup->t_len,
 
  InvalidBuffer, options, bistate,


I think it is sufficient to say in the end 'visibility map page'.
Let's not include 'frozen map page'.


2.
+ * corresponding page has been completely frozen, so the visibility map is also
+ * used for anti-wraparound 
vacuum, even if freezing tuples is required.

/all tuple/all tuples
/freezing tuples/freezing of tuples

3.
- * Are all tuples on heapBlk visible to all, according to the visibility map?
+ * Are all tuples on heapBlk 
visible or frozen to all, according to the visibility map?

I think it is better to modify the above statement as:
Are all tuples on heapBlk visible to all or are marked as frozen, according
to the visibility map?

4.
+ * releasing *buf after it's done testing and setting bits, and must set flags
+ * which indicates what flag 
we want to test.

Here are you talking about the flags passed to visibilitymap_set(), if
yes, then above comment is not clear, how about:

and must pass flags
for which it needs to check the value in visibility map.

5.
+ * both how many pages we skipped according to all-frozen bit of visibility
+ * map and how many 
pages we freeze page, so we can update relfrozenxid if

In above sentence word 'page' after freeze sounds redundant.
/we freeze page/we freeze

Another suggestion:
/sum of them/sum of two

6.
+ * This block is at least all-visible according to visibility map.
+
 * We check whehter this block is all-frozen or not, to skip to

whether is mis-spelled

7.
+ * If we froze any tuples or any tuples are already frozen,
+ * mark the buffer 
dirty, and write a WAL record recording the changes.

Here, I think WAL record is written only when we mark some
tuple/'s as frozen not if we they are already frozen,
so in that regard, I think above comment is wrong.

8.
+ /*
+ * We cant't allow upgrading with link mode between 9.5 or before and 9.6 or later,
+
because the format of visibility map has been changed on version 9.6.
+ */


a. /cant't/can't
b. changed on version 9.6/changed in version 9.6
b. Won't such a change needs to be updated in pg_upgrade
documentation (Notes Section)?

9.
@@ -180,6 +181,13 @@ transfer_single_new_db(pageCnvCtx *pageConverter,
 
new_cluster.controldata.cat_ver >= VISIBILITY_MAP_CRASHSAFE_CAT_VER)
  vm_crashsafe_match = false;
 
+
/*
+ * Do we need to rewrite visibilitymap?
+ */
+ if (old_cluster.controldata.cat_ver < 
VISIBILITY_MAP_FROZEN_BIT_CAT_VER &&
+ new_cluster.controldata.cat_ver >= 
VISIBILITY_MAP_FROZEN_BIT_CAT_VER)
+ vm_rewrite_needed = true;

..

@@ -276,7 +285,15 @@ transfer_relfile(pageCnvCtx *pageConverter, FileNameMap *map,
  {
 
pg_log(PG_VERBOSE, "copying \"%s\" to \"%s\"\n", old_file, new_file);
 
- if ((msg = 
copyAndUpdateFile(pageConverter, old_file, new_file, true)) != NULL)
+ /*
+
 * Do we need to rewrite visibilitymap?
+ */
+ if (strcmp
(type_suffix, "_vm") == 0 &&
+ old_cluster.controldata.cat_ver < 
VISIBILITY_MAP_FROZEN_BIT_CAT_VER &&
+ new_cluster.controldata.cat_ver >= 
VISIBILITY_MAP_FROZEN_BIT_CAT_VER)
+ rewrite_vm = true;

Instead of doing re-check in transfer_relfile(), I think it is better
to pass an additional parameter in this function.

10.
You have mentioned up-thread that, you have changed the patch so that
PageClearAllVisible clear both bits, can you please point me to this
change?
Basically after applying the patch, I see below code in bufpage.h:
#define PageClearAllVisible(page) \
(((PageHeader) (page))->pd_flags &= ~PD_ALL_VISIBLE)

Don't we need to clear the PD_ALL_FROZEN separately?


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Freeze avoidance of very large table.

From
Masahiko Sawada
Date:
On Wed, Oct 28, 2015 at 12:58 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Sat, Oct 24, 2015 at 2:24 PM, Masahiko Sawada <sawada.mshk@gmail.com>
> wrote:
>>
>> On Sat, Oct 24, 2015 at 10:59 AM, Amit Kapila <amit.kapila16@gmail.com>
>> wrote:
>> >
>> > I think we can display information about relallfrozen it in
>> > pg_stat_*_tables
>> > as suggested by you.  It doesn't make much sense to keep it in pg_class
>> > unless we have some usecase for the same.
>> >
>>
>> I'm thinking a bit about implementing the read-only table that is
>> restricted to update/delete and is ensured that whole table is frozen,
>> if this feature is committed.
>> The value of relallfrozen might be useful for such feature.
>>

Thank you for reviewing!

> If we need this for read-only table feature, then better lets add that
> after discussing the design of that feature.  It doesn't seem to be
> advisable to have an extra field in system table which we might
> need in yet not completely-discussed feature.

I changed it so that the number of frozen pages is stored in
pg_stat_all_tables as statistics information.
Also, the tests related to counting all-visible bit and skipping
vacuum are added to visibility map test, and the test related to
counting all-frozen is added to stats collector test.

Attached updated v20 patch.

> Review Comments:
> -------------------------------
> 1.
>   /*
> - * Find buffer to insert this tuple into.  If the page is all visible,
> - * this will also pin
> the requisite visibility map page.
> + * Find buffer to insert this tuple into.  If the page is all
> visible
> + * or all frozen, this will also pin the requisite visibility map and
> + * frozen map page.
>
>  */
>   buffer = RelationGetBufferForTuple(relation, heaptup->t_len,
>
>   InvalidBuffer, options, bistate,
>
>
> I think it is sufficient to say in the end 'visibility map page'.
> Let's not include 'frozen map page'.

Fixed.

>
> 2.
> + * corresponding page has been completely frozen, so the visibility map is
> also
> + * used for anti-wraparound
> vacuum, even if freezing tuples is required.
>
> /all tuple/all tuples
> /freezing tuples/freezing of tuples

Fixed.

> 3.
> - * Are all tuples on heapBlk visible to all, according to the visibility
> map?
> + * Are all tuples on heapBlk
> visible or frozen to all, according to the visibility map?
>
> I think it is better to modify the above statement as:
> Are all tuples on heapBlk visible to all or are marked as frozen, according
> to the visibility map?

Fixed.

> 4.
> + * releasing *buf after it's done testing and setting bits, and must set
> flags
> + * which indicates what flag
> we want to test.
>
> Here are you talking about the flags passed to visibilitymap_set(), if
> yes, then above comment is not clear, how about:
>
> and must pass flags
> for which it needs to check the value in visibility map.

Fixed.

> 5.
> + * both how many pages we skipped according to all-frozen bit of visibility
> + * map and how many
> pages we freeze page, so we can update relfrozenxid if
>
> In above sentence word 'page' after freeze sounds redundant.
> /we freeze page/we freeze
>
> Another suggestion:
> /sum of them/sum of two

Fixed.

> 6.
> + * This block is at least all-visible according to visibility map.
> +
>  * We check whehter this block is all-frozen or not, to skip to
>
> whether is mis-spelled

Fixed.

> 7.
> + * If we froze any tuples or any tuples are already frozen,
> + * mark the buffer
> dirty, and write a WAL record recording the changes.
>
> Here, I think WAL record is written only when we mark some
> tuple/'s as frozen not if we they are already frozen,
> so in that regard, I think above comment is wrong.

It's wrong.
Fixed.

> 8.
> + /*
> + * We cant't allow upgrading with link mode between 9.5 or before and 9.6
> or later,
> + *
> because the format of visibility map has been changed on version 9.6.
> + */
>
>
> a. /cant't/can't
> b. changed on version 9.6/changed in version 9.6
> b. Won't such a change needs to be updated in pg_upgrade
> documentation (Notes Section)?

Fixed.
And updated document.

> 9.
> @@ -180,6 +181,13 @@ transfer_single_new_db(pageCnvCtx *pageConverter,
>
> new_cluster.controldata.cat_ver >= VISIBILITY_MAP_CRASHSAFE_CAT_VER)
>   vm_crashsafe_match = false;
>
> +
> /*
> + * Do we need to rewrite visibilitymap?
> + */
> + if (old_cluster.controldata.cat_ver <
> VISIBILITY_MAP_FROZEN_BIT_CAT_VER &&
> + new_cluster.controldata.cat_ver >=
> VISIBILITY_MAP_FROZEN_BIT_CAT_VER)
> + vm_rewrite_needed = true;
>
> ..
>
> @@ -276,7 +285,15 @@ transfer_relfile(pageCnvCtx *pageConverter, FileNameMap
> *map,
>   {
>
> pg_log(PG_VERBOSE, "copying \"%s\" to \"%s\"\n", old_file, new_file);
>
> - if ((msg =
> copyAndUpdateFile(pageConverter, old_file, new_file, true)) != NULL)
> + /*
> +
>  * Do we need to rewrite visibilitymap?
> + */
> + if (strcmp
> (type_suffix, "_vm") == 0 &&
> + old_cluster.controldata.cat_ver <
> VISIBILITY_MAP_FROZEN_BIT_CAT_VER &&
> + new_cluster.controldata.cat_ver >=
> VISIBILITY_MAP_FROZEN_BIT_CAT_VER)
> + rewrite_vm = true;
>
> Instead of doing re-check in transfer_relfile(), I think it is better
> to pass an additional parameter in this function.

I agree.
Fixed.

>
> 10.
> You have mentioned up-thread that, you have changed the patch so that
> PageClearAllVisible clear both bits, can you please point me to this
> change?
> Basically after applying the patch, I see below code in bufpage.h:
> #define PageClearAllVisible(page) \
> (((PageHeader) (page))->pd_flags &= ~PD_ALL_VISIBLE)
>
> Don't we need to clear the PD_ALL_FROZEN separately?

Previous patch is wrong. PageClearAllVisible() should be;
#define PageClearAllVisible(page) \
       (((PageHeader) (page))->pd_flags &= ~(PD_ALL_VISIBLE | PD_ALL_FROZEN))

The all-frozen flag/bit is cleared only by modifying page, so it is
impossible that only all-frozen flags/bit is cleared.
The clearing of all-visible flag/bit also means that the page has some
garbage, and is needed to vacuum.

Regards,

--
Masahiko Sawada

Attachment

Re: Freeze avoidance of very large table.

From
Masahiko Sawada
Date:
On Fri, Oct 30, 2015 at 1:26 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> On Wed, Oct 28, 2015 at 12:58 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> On Sat, Oct 24, 2015 at 2:24 PM, Masahiko Sawada <sawada.mshk@gmail.com>
>> wrote:
>>>
>>> On Sat, Oct 24, 2015 at 10:59 AM, Amit Kapila <amit.kapila16@gmail.com>
>>> wrote:
>>> >
>>> > I think we can display information about relallfrozen it in
>>> > pg_stat_*_tables
>>> > as suggested by you.  It doesn't make much sense to keep it in pg_class
>>> > unless we have some usecase for the same.
>>> >
>>>
>>> I'm thinking a bit about implementing the read-only table that is
>>> restricted to update/delete and is ensured that whole table is frozen,
>>> if this feature is committed.
>>> The value of relallfrozen might be useful for such feature.
>>>
>
> Thank you for reviewing!
>
>> If we need this for read-only table feature, then better lets add that
>> after discussing the design of that feature.  It doesn't seem to be
>> advisable to have an extra field in system table which we might
>> need in yet not completely-discussed feature.
>
> I changed it so that the number of frozen pages is stored in
> pg_stat_all_tables as statistics information.
> Also, the tests related to counting all-visible bit and skipping
> vacuum are added to visibility map test, and the test related to
> counting all-frozen is added to stats collector test.
>
> Attached updated v20 patch.
>
>> Review Comments:
>> -------------------------------
>> 1.
>>   /*
>> - * Find buffer to insert this tuple into.  If the page is all visible,
>> - * this will also pin
>> the requisite visibility map page.
>> + * Find buffer to insert this tuple into.  If the page is all
>> visible
>> + * or all frozen, this will also pin the requisite visibility map and
>> + * frozen map page.
>>
>>  */
>>   buffer = RelationGetBufferForTuple(relation, heaptup->t_len,
>>
>>   InvalidBuffer, options, bistate,
>>
>>
>> I think it is sufficient to say in the end 'visibility map page'.
>> Let's not include 'frozen map page'.
>
> Fixed.
>
>>
>> 2.
>> + * corresponding page has been completely frozen, so the visibility map is
>> also
>> + * used for anti-wraparound
>> vacuum, even if freezing tuples is required.
>>
>> /all tuple/all tuples
>> /freezing tuples/freezing of tuples
>
> Fixed.
>
>> 3.
>> - * Are all tuples on heapBlk visible to all, according to the visibility
>> map?
>> + * Are all tuples on heapBlk
>> visible or frozen to all, according to the visibility map?
>>
>> I think it is better to modify the above statement as:
>> Are all tuples on heapBlk visible to all or are marked as frozen, according
>> to the visibility map?
>
> Fixed.
>
>> 4.
>> + * releasing *buf after it's done testing and setting bits, and must set
>> flags
>> + * which indicates what flag
>> we want to test.
>>
>> Here are you talking about the flags passed to visibilitymap_set(), if
>> yes, then above comment is not clear, how about:
>>
>> and must pass flags
>> for which it needs to check the value in visibility map.
>
> Fixed.
>
>> 5.
>> + * both how many pages we skipped according to all-frozen bit of visibility
>> + * map and how many
>> pages we freeze page, so we can update relfrozenxid if
>>
>> In above sentence word 'page' after freeze sounds redundant.
>> /we freeze page/we freeze
>>
>> Another suggestion:
>> /sum of them/sum of two
>
> Fixed.
>
>> 6.
>> + * This block is at least all-visible according to visibility map.
>> +
>>  * We check whehter this block is all-frozen or not, to skip to
>>
>> whether is mis-spelled
>
> Fixed.
>
>> 7.
>> + * If we froze any tuples or any tuples are already frozen,
>> + * mark the buffer
>> dirty, and write a WAL record recording the changes.
>>
>> Here, I think WAL record is written only when we mark some
>> tuple/'s as frozen not if we they are already frozen,
>> so in that regard, I think above comment is wrong.
>
> It's wrong.
> Fixed.
>
>> 8.
>> + /*
>> + * We cant't allow upgrading with link mode between 9.5 or before and 9.6
>> or later,
>> + *
>> because the format of visibility map has been changed on version 9.6.
>> + */
>>
>>
>> a. /cant't/can't
>> b. changed on version 9.6/changed in version 9.6
>> b. Won't such a change needs to be updated in pg_upgrade
>> documentation (Notes Section)?
>
> Fixed.
> And updated document.
>
>> 9.
>> @@ -180,6 +181,13 @@ transfer_single_new_db(pageCnvCtx *pageConverter,
>>
>> new_cluster.controldata.cat_ver >= VISIBILITY_MAP_CRASHSAFE_CAT_VER)
>>   vm_crashsafe_match = false;
>>
>> +
>> /*
>> + * Do we need to rewrite visibilitymap?
>> + */
>> + if (old_cluster.controldata.cat_ver <
>> VISIBILITY_MAP_FROZEN_BIT_CAT_VER &&
>> + new_cluster.controldata.cat_ver >=
>> VISIBILITY_MAP_FROZEN_BIT_CAT_VER)
>> + vm_rewrite_needed = true;
>>
>> ..
>>
>> @@ -276,7 +285,15 @@ transfer_relfile(pageCnvCtx *pageConverter, FileNameMap
>> *map,
>>   {
>>
>> pg_log(PG_VERBOSE, "copying \"%s\" to \"%s\"\n", old_file, new_file);
>>
>> - if ((msg =
>> copyAndUpdateFile(pageConverter, old_file, new_file, true)) != NULL)
>> + /*
>> +
>>  * Do we need to rewrite visibilitymap?
>> + */
>> + if (strcmp
>> (type_suffix, "_vm") == 0 &&
>> + old_cluster.controldata.cat_ver <
>> VISIBILITY_MAP_FROZEN_BIT_CAT_VER &&
>> + new_cluster.controldata.cat_ver >=
>> VISIBILITY_MAP_FROZEN_BIT_CAT_VER)
>> + rewrite_vm = true;
>>
>> Instead of doing re-check in transfer_relfile(), I think it is better
>> to pass an additional parameter in this function.
>
> I agree.
> Fixed.
>
>>
>> 10.
>> You have mentioned up-thread that, you have changed the patch so that
>> PageClearAllVisible clear both bits, can you please point me to this
>> change?
>> Basically after applying the patch, I see below code in bufpage.h:
>> #define PageClearAllVisible(page) \
>> (((PageHeader) (page))->pd_flags &= ~PD_ALL_VISIBLE)
>>
>> Don't we need to clear the PD_ALL_FROZEN separately?
>
> Previous patch is wrong. PageClearAllVisible() should be;
> #define PageClearAllVisible(page) \
>        (((PageHeader) (page))->pd_flags &= ~(PD_ALL_VISIBLE | PD_ALL_FROZEN))
>
> The all-frozen flag/bit is cleared only by modifying page, so it is
> impossible that only all-frozen flags/bit is cleared.
> The clearing of all-visible flag/bit also means that the page has some
> garbage, and is needed to vacuum.
>

v20 patch has a bug in result of regression test.
Attached updated v21 patch.

Regards,

--
Masahiko Sawada

Attachment

Re: Freeze avoidance of very large table.

From
Amit Kapila
Date:
On Thu, Oct 8, 2015 at 11:05 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>
> On 1 October 2015 at 23:30, Josh Berkus <josh@agliodbs.com> wrote:
>>
>> On 10/01/2015 07:43 AM, Robert Haas wrote:
>> > On Thu, Oct 1, 2015 at 9:44 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
>> >> I wonder how much it's worth renaming only the file extension while
>> >> there are many places where "visibility map" and "vm" are used,
>> >> for example, log messages, function names, variables, etc.
>> >
>> > I'd be inclined to keep calling it the visibility map (vm) even if it
>> > also contains freeze information.
>> >

What is your main worry about changing the name of this map, is it
about more code churn or is it about that we might introduce new issues
or is it about that people are already accustomed to call this map as
visibility map?

>>
>> -1 to rename.  Visibility Map is a perfectly good name.
>
>
> The name can stay the same, but specifically the file extension should change.
>

It seems to me quite logical for understanding purpose as well.  Any new
person who wants to work in this area or is looking into it will always
wonder why this map is named as visibility map even though it contains
information about visibility of page as well as frozen state of page.  So
even though it doesn't make any difference in correctness of feature whether
we retain the current name or change it to Visibility & Freeze Map (aka vfm),
but I think it makes sense to change it for the sake of maintenance of this
code.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Freeze avoidance of very large table.

From
Amit Kapila
Date:
On Fri, Oct 30, 2015 at 6:03 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
>
>
> v20 patch has a bug in result of regression test.
> Attached updated v21 patch.
>

Couple of more review comments:
------------------------------------------------------

1.
@@ -615,6 +617,8 @@ typedef struct PgStat_StatTabEntry
  PgStat_Counter n_dead_tuples;
  PgStat_Counter 
changes_since_analyze;
 
+ int32 n_frozen_pages;
+
  PgStat_Counter blocks_fetched;
  PgStat_Counter 
blocks_hit;

As you are changing above structure, you need to update
PGSTAT_FILE_FORMAT_ID, refer below code:
#define PGSTAT_FILE_FORMAT_ID 0x01A5BC9D

2. It seems that n_frozen_page is not initialized/updated properly
for toast tables:

Try with below steps:

postgres=# create table t4(c1 int, c2 text);
CREATE TABLE
postgres=# select oid, relname from pg_class where relname like '%t4%';
  oid  | relname
-------+---------
 16390 | t4
(1 row)


postgres=# select oid, relname from pg_class where relname like '%16390%';
  oid  |       relname
-------+----------------------
 16393 | pg_toast_16390
 16395 | pg_toast_16390_index
(2 rows)

postgres=# select relname,seq_scan,n_tup_ins,last_vacuum,n_frozen_page from pg_s
tat_all_tables where relname like '%16390%';
    relname     | seq_scan | n_tup_ins | last_vacuum | n_frozen_page
----------------+----------+-----------+-------------+---------------
 pg_toast_16390 |        1 |         0 |             |    -842150451
(1 row)

Note that I have tested above scenario on my Windows 7 m/c. 

3.
 * visibilitymap.c
 *  bitmap for tracking visibility of heap tuples

I think above needs to be changed to:
bitmap for tracking visibility and frozen state of heap tuples


4.
a.
  /*
- * If we froze any tuples, mark the buffer dirty, and write a WAL
- * record recording the changes.  We must log the changes to be
- * crash-safe against future truncation of CLOG.
+ * If we froze any tuples then we mark the buffer dirty, and write a WAL

b.
- * Release any remaining pin on visibility map page.
+ * Release any remaining pin on visibility map.

c.
  * We do update relallvisible even in the corner case, since if the table
- * is all-visible 
we'd definitely like to know that.  But clamp the value
- * to be not more than what we're setting 
relpages to.
+ * is all-visible we'd definitely like to know that.
+ * But clamp the value to be not more 
than what we're setting relpages to.

I don't think you need to change above comments.

5.
+ * Even if scan_all is set so far, we could skip to scan some pages
+ * according by all-frozen 
bit of visibility amp.

/according by/according to
/amp/map

I suggested to modify comment as below:
During full scan, we could skip some pages according to all-frozen
bit of visibility map.

Also no need to start this in new line, start from where the
previous line of comment ends.

6.
/*
 * lazy_scan_heap() -- scan an open heap relation
 *
 * This routine prunes each page in the 
heap, which will among other
 * things truncate dead tuples to dead line pointers, defragment the
 *
page, and set commit status bits (see heap_page_prune).  It also builds
 * lists of dead 
tuples and pages with free space, calculates statistics
 * on the number of live tuples in the 
heap, and marks pages as
 * all-visible if appropriate.

Modify above function header as:

all-visible, all-frozen

7.
lazy_scan_heap()
{
..

if (PageIsEmpty(page))
{
empty_pages++;
freespace = 
PageGetHeapFreeSpace(page);

/* empty pages are always all-visible */
if (!PageIsAllVisible(page))
..
}

Don't we need to ensure that empty pages should get marked as
all-frozen?

8.
lazy_scan_heap()
{
..
/*
* As of PostgreSQL 9.2, the visibility map bit should never be set if
* the page-
level bit is clear.  However, it's possible that the bit
* got cleared after we checked it 
and before we took the buffer
* content lock, so we must recheck before jumping to the conclusion
* that something bad has happened.
*/
else if (all_visible_according_to_vm 
&& !PageIsAllVisible(page)
&& visibilitymap_test(onerel, blkno, &vmbuffer, 
VISIBILITYMAP_ALL_VISIBLE))
{
elog(WARNING, "page is not marked all-visible 
but visibility map bit is set in relation \"%s\" page %u",
relname, blkno);
visibilitymap_clear(onerel, blkno, vmbuffer);
}

/*
It's possible for the value returned by GetOldestXmin() to move
* backwards, so it's not wrong for 
us to see tuples that appear to
* not be visible to everyone yet, while PD_ALL_VISIBLE is already
* set. The real safe xmin value never moves backwards, but
* GetOldestXmin() is 
conservative and sometimes returns a value
* that's unnecessarily small, so if we see that 
contradiction it just
* means that the tuples that we think are not visible to everyone yet
 * actually are, and the PD_ALL_VISIBLE flag is correct.
*
* There should never 
be dead tuples on a page with PD_ALL_VISIBLE
* set, however.
*/
else 
if (PageIsAllVisible(page) && has_dead_tuples)
{
elog(WARNING, "page 
containing dead tuples is marked as all-visible in relation \"%s\" page %u",
 
relname, blkno);
PageClearAllVisible(page);
MarkBufferDirty(buf);
visibilitymap_clear(onerel, blkno, vmbuffer);
}

..
}

I think both the above cases could happen for frozen state
as well, unless you think otherwise, we need similar handling
for frozen bit.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Freeze avoidance of very large table.

From
Robert Haas
Date:
On Sat, Oct 31, 2015 at 1:32 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Thu, Oct 8, 2015 at 11:05 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>>
>> On 1 October 2015 at 23:30, Josh Berkus <josh@agliodbs.com> wrote:
>>>
>>> On 10/01/2015 07:43 AM, Robert Haas wrote:
>>> > On Thu, Oct 1, 2015 at 9:44 AM, Fujii Masao <masao.fujii@gmail.com>
>>> > wrote:
>>> >> I wonder how much it's worth renaming only the file extension while
>>> >> there are many places where "visibility map" and "vm" are used,
>>> >> for example, log messages, function names, variables, etc.
>>> >
>>> > I'd be inclined to keep calling it the visibility map (vm) even if it
>>> > also contains freeze information.
>>> >
>
> What is your main worry about changing the name of this map, is it
> about more code churn or is it about that we might introduce new issues
> or is it about that people are already accustomed to call this map as
> visibility map?

My concern is mostly that I think calling it the "visibility and
freeze map" is excessively long and wordy.

One observation that someone made previously is that there is a
difference between "all-visible" and "index-only scan OK".  An
all-visible page that has a HOT update is no longer all-visible (it
needs vacuuming) but an index-only scan would still be OK (because
only the non-indexed values in the tuple have changed, and every scan
scan can see either the old or the new tuple but not both.  At
present, the index-only scan will consult the heap page anyway,
because all we know is that the page is not all-visible.  But maybe in
the future somebody will decide to add a bit for that.  Then we'd have
the "visibility, usable for index-only scans, and freeze map", but I
think "_vufiosfm" will not be a good choice for a file suffix.

So similarly here.  The file suffix doesn't need to enumerate all the
bits that are present for each page.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Freeze avoidance of very large table.

From
Amit Kapila
Date:
On Tue, Nov 3, 2015 at 5:04 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Sat, Oct 31, 2015 at 1:32 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > What is your main worry about changing the name of this map, is it
> > about more code churn or is it about that we might introduce new issues
> > or is it about that people are already accustomed to call this map as
> > visibility map?
>
> My concern is mostly that I think calling it the "visibility and
> freeze map" is excessively long and wordy.
>
> One observation that someone made previously is that there is a
> difference between "all-visible" and "index-only scan OK".  An
> all-visible page that has a HOT update is no longer all-visible (it
> needs vacuuming) but an index-only scan would still be OK (because
> only the non-indexed values in the tuple have changed, and every scan
> scan can see either the old or the new tuple but not both.  At
> present, the index-only scan will consult the heap page anyway,
> because all we know is that the page is not all-visible.  But maybe in
> the future somebody will decide to add a bit for that.  Then we'd have
> the "visibility, usable for index-only scans, and freeze map", but I
> think "_vufiosfm" will not be a good choice for a file suffix.
>

I think in that case we can call it as page info map or page state map, but
I find retaining visibility map name in this case or for future (if we want to
add another bit) as confusing.  In-fact if you find "visibility and freeze map",
as excessively long, then we can change it to "page info map" or "page state
map" now as well.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Freeze avoidance of very large table.

From
Robert Haas
Date:
On Mon, Nov 2, 2015 at 10:33 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Tue, Nov 3, 2015 at 5:04 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>>
>> On Sat, Oct 31, 2015 at 1:32 AM, Amit Kapila <amit.kapila16@gmail.com>
>> wrote:
>> >
>> > What is your main worry about changing the name of this map, is it
>> > about more code churn or is it about that we might introduce new issues
>> > or is it about that people are already accustomed to call this map as
>> > visibility map?
>>
>> My concern is mostly that I think calling it the "visibility and
>> freeze map" is excessively long and wordy.
>>
>> One observation that someone made previously is that there is a
>> difference between "all-visible" and "index-only scan OK".  An
>> all-visible page that has a HOT update is no longer all-visible (it
>> needs vacuuming) but an index-only scan would still be OK (because
>> only the non-indexed values in the tuple have changed, and every scan
>> scan can see either the old or the new tuple but not both.  At
>> present, the index-only scan will consult the heap page anyway,
>> because all we know is that the page is not all-visible.  But maybe in
>> the future somebody will decide to add a bit for that.  Then we'd have
>> the "visibility, usable for index-only scans, and freeze map", but I
>> think "_vufiosfm" will not be a good choice for a file suffix.
>>
>
> I think in that case we can call it as page info map or page state map, but
> I find retaining visibility map name in this case or for future (if we want
> to
> add another bit) as confusing.  In-fact if you find "visibility and freeze
> map",
> as excessively long, then we can change it to "page info map" or "page state
> map" now as well.

Sure.  Or we could just keep calling it the visibility map, and then
everyone would know what we're talking about.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Freeze avoidance of very large table.

From
Masahiko Sawada
Date:
On Tue, Nov 3, 2015 at 12:33 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Tue, Nov 3, 2015 at 5:04 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>>
>> On Sat, Oct 31, 2015 at 1:32 AM, Amit Kapila <amit.kapila16@gmail.com>
>> wrote:
>> >
>> > What is your main worry about changing the name of this map, is it
>> > about more code churn or is it about that we might introduce new issues
>> > or is it about that people are already accustomed to call this map as
>> > visibility map?
>>
>> My concern is mostly that I think calling it the "visibility and
>> freeze map" is excessively long and wordy.
>>
>> One observation that someone made previously is that there is a
>> difference between "all-visible" and "index-only scan OK".  An
>> all-visible page that has a HOT update is no longer all-visible (it
>> needs vacuuming) but an index-only scan would still be OK (because
>> only the non-indexed values in the tuple have changed, and every scan
>> scan can see either the old or the new tuple but not both.  At
>> present, the index-only scan will consult the heap page anyway,
>> because all we know is that the page is not all-visible.  But maybe in
>> the future somebody will decide to add a bit for that.  Then we'd have
>> the "visibility, usable for index-only scans, and freeze map", but I
>> think "_vufiosfm" will not be a good choice for a file suffix.
>>
>
> I think in that case we can call it as page info map or page state map, but
> I find retaining visibility map name in this case or for future (if we want
> to
> add another bit) as confusing.  In-fact if you find "visibility and freeze
> map",
> as excessively long, then we can change it to "page info map" or "page state
> map" now as well.
>

In that case, file suffix would be "_pim" or "_psm"?
IMO, "page info map" would be better, because the bit doesn't indicate
the status of page in real time, it's just additional information.
Also we need to rewrite to new name in source code, and source file
name as well.

Regards,

--
Masahiko Sawada



Re: Freeze avoidance of very large table.

From
Amit Kapila
Date:
On Wed, Nov 4, 2015 at 4:45 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Tue, Nov 3, 2015 at 12:33 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > On Tue, Nov 3, 2015 at 5:04 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> >>
> >> On Sat, Oct 31, 2015 at 1:32 AM, Amit Kapila <amit.kapila16@gmail.com>
> >> wrote:
> >> >
> >> > What is your main worry about changing the name of this map, is it
> >> > about more code churn or is it about that we might introduce new issues
> >> > or is it about that people are already accustomed to call this map as
> >> > visibility map?
> >>
> >> My concern is mostly that I think calling it the "visibility and
> >> freeze map" is excessively long and wordy.
> >>
> >> One observation that someone made previously is that there is a
> >> difference between "all-visible" and "index-only scan OK".  An
> >> all-visible page that has a HOT update is no longer all-visible (it
> >> needs vacuuming) but an index-only scan would still be OK (because
> >> only the non-indexed values in the tuple have changed, and every scan
> >> scan can see either the old or the new tuple but not both.  At
> >> present, the index-only scan will consult the heap page anyway,
> >> because all we know is that the page is not all-visible.  But maybe in
> >> the future somebody will decide to add a bit for that.  Then we'd have
> >> the "visibility, usable for index-only scans, and freeze map", but I
> >> think "_vufiosfm" will not be a good choice for a file suffix.
> >>
> >
> > I think in that case we can call it as page info map or page state map, but
> > I find retaining visibility map name in this case or for future (if we want
> > to
> > add another bit) as confusing.  In-fact if you find "visibility and freeze
> > map",
> > as excessively long, then we can change it to "page info map" or "page state
> > map" now as well.
> >
>
> In that case, file suffix would be "_pim" or "_psm"?

Right.

> IMO, "page info map" would be better, because the bit doesn't indicate
> the status of page in real time, it's just additional information.
> Also we need to rewrite to new name in source code, and source file
> name as well.
>

I think so.  Here I think the right thing to do is lets proceed with fixing
other issues of patch and work on this part later and in the mean time
we might get more feedback on this part of proposal.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Freeze avoidance of very large table.

From
Kyotaro HORIGUCHI
Date:
Hello, I had a look on v21 patch.

Though I haven't looked the whole of the patch, I'd like to show
you some comments only for visibilitymap.c and a part of the
documentation.


1. Patch application
  patch command complains about offsets for heapam.c at current  master.

2. visitibilymap_test()

-  if (visibilitymap_test(rel, blkno, &vmbuffer))
+  if (visibilitymap_test(rel, blkno, &vmbuffer, VISIBILITYMAP_ALL_VISIBLE)
The old VM was a simple bitmap so the name _test and thefunction are proper but now the bitmap is quad state so it'd
bebetterchainging the function. Alghough it is not so expensiveto call it twice successively, it is a bit uneasy for me
doingso.One possible shape would be like the following.
 
lazy_vacuum_page()> int vmstate = visibilitymap_get_status(rel, blkno, &vmbuffer);> if (!(vmstate  &
VISIBILITYMAP_ALL_VISIBLE))>  ...> if (all_frozen && !(vmstate  & VISIBILITYMAP_ALL_FROZEN))>   ...> if (flags !=
vmstate)>  visibilitymap_set(...., flags);
 
and defining two macros for indivisual tests,
> #define VM_ALL_VISIBLE(r, b, v) ((vm_get_status((r), (b), (v)) & .._VISIBLE) != 0)> if (VM_ALL_VISIBLE(rel, blkno,
&vmbuffer))and>if (VM_ALL_FROZEN(rel, blkno, &vmbuffer))
 
How about this?


3. visibilitymap.c
- HEAPBLK_TO_MAPBIT
 In visibilitymap_clear and other functions, mapBit means mapDualBit in the patch, and mapBit always appears in the
form"mapBit * BITS_PER_HEAPBLOCK". So, it'd be better to change the definition of HEAPBLK_TO_MAPBIT so that it
calculatesreally the bit position in a byte.
 
 - #define HEAPBLK_TO_MAPBIT(x) ((x) % HEAPBLOCKS_PER_BYTE) + #define HEAPBLK_TO_MAPBIT(x) (((x) % HEAPBLOCKS_PER_BYTE)
*BITS_PER_HEAPBLOCK)
 

- visibilitymap_count()
 The third argument all_frozen is not necessary in some usage. So this interface would be preferable to be as
following,
 BlockNumber visibilitymap_count(Relation rel, BlockNumber *all_frozen) {    BlockNumber all_visible = 0; ...    if
(all_frozen)  *all_frozen = 0; ... something like ...
 
 - visibilitymap_set()
  The check for ALL_VISIBLE is duplicate in the following  assertion.
  > Assert((flags & VISIBILITYMAP_ALL_VISIBLE) ||  >       (flags & (VISIBILITYMAP_ALL_VISIBLE |
VISIBILITYMAP_ALL_FROZEN)));
   

4. documentation
 - 18.11.1 Statement Hehavior
   A typo.
   > VACUUM performs *a* aggressive freezing
   However I am not a fluent English speaker, and such   wordsmithing would be done by someone else, I feel that
"eager/greedy"is more suitable for this meaning..,   nevertheless, the term "whole-table freezing" that you wrote
elsewherein this patch would be usable.
 
   "VACUUM performs a whole-table freezing"
   All "a table scan/sweep"s and something has the similar   meaning would be better be changed to "a whole-table
freezing"
   In similar manner, "tuples/rows that are marked as frozen"   could be replaced with "unfrozen tuples/rows".
 - 23.1.5 Preventing Transaction ID Wraparound Failures
   "The whole table is scanned only when all pages happen to    require vacuuming to remove dead row versions."
   This description looks a bit out-of-point. "the whole table   scan" in the original description is what is triggered
by  relfrozenxid so the correspondent in the revised description   is "the whole-table freezing", maybe.
 
   "The whole-table feezing takes place when    <structfield>relfrozenxid</> is more than
<varname>vacuum_freeze_table_age</>transactions old or when    <command>VACUUM</>'s <literal>FREEZE</> option is used.
The   whole-table freezing scans all unfreezed pages."
 
   The last sentence might be unnecessary.

- 63.4 Visibility Map
  "pages contain only tuples that are marked as frozen" would be   enough to be "pages contain only frozen tuples"
   and according to the discussion upthread, we might be good to   have some desciption that the name is historically
omitting  the aspect of freezemap.
 


At Sat, 31 Oct 2015 18:07:32 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in
<CAA4eK1+aTdaSwG3u+y8fXxn67Kkj0T1KzRsFDLEi=tQvTYgFrQ@mail.gmail.com>
amit.kapila16> On Fri, Oct 30, 2015 at 6:03 AM, Masahiko Sawada <sawada.mshk@gmail.com>
> Couple of more review comments:
> ------------------------------------------------------
> 
> 1.
> @@ -615,6 +617,8 @@ typedef struct PgStat_StatTabEntry
>   PgStat_Counter n_dead_tuples;
>   PgStat_Counter
> changes_since_analyze;
> 
> + int32 n_frozen_pages;
> +
>   PgStat_Counter blocks_fetched;
>   PgStat_Counter
> blocks_hit;
> 
> As you are changing above structure, you need to update
> PGSTAT_FILE_FORMAT_ID, refer below code:
> #define PGSTAT_FILE_FORMAT_ID 0x01A5BC9D
> 
> 2. It seems that n_frozen_page is not initialized/updated properly
> for toast tables:
> 
> Try with below steps:
> 
> postgres=# create table t4(c1 int, c2 text);
> CREATE TABLE
> postgres=# select oid, relname from pg_class where relname like '%t4%';
>   oid  | relname
> -------+---------
>  16390 | t4
> (1 row)
> 
> 
> postgres=# select oid, relname from pg_class where relname like '%16390%';
>   oid  |       relname
> -------+----------------------
>  16393 | pg_toast_16390
>  16395 | pg_toast_16390_index
> (2 rows)
> 
> postgres=# select relname,seq_scan,n_tup_ins,last_vacuum,n_frozen_page from
> pg_s
> tat_all_tables where relname like '%16390%';
>     relname     | seq_scan | n_tup_ins | last_vacuum | n_frozen_page
> ----------------+----------+-----------+-------------+---------------
>  pg_toast_16390 |        1 |         0 |             |    -842150451
> (1 row)
> 
> Note that I have tested above scenario on my Windows 7 m/c.
> 
> 3.
>  * visibilitymap.c
>  *  bitmap for tracking visibility of heap tuples
> 
> I think above needs to be changed to:
> bitmap for tracking visibility and frozen state of heap tuples
> 
> 
> 4.
> a.
>   /*
> - * If we froze any tuples, mark the buffer dirty, and write a WAL
> - * record recording the changes.  We must log the changes to be
> - * crash-safe against future truncation of CLOG.
> + * If we froze any tuples then we mark the buffer dirty, and write a WAL
> 
> b.
> - * Release any remaining pin on visibility map page.
> + * Release any remaining pin on visibility map.
> 
> c.
>   * We do update relallvisible even in the corner case, since if the table
> - * is all-visible
> we'd definitely like to know that.  But clamp the value
> - * to be not more than what we're setting
> relpages to.
> + * is all-visible we'd definitely like to know that.
> + * But clamp the value to be not more
> than what we're setting relpages to.
> 
> I don't think you need to change above comments.
> 
> 5.
> + * Even if scan_all is set so far, we could skip to scan some pages
> + * according by all-frozen
> bit of visibility amp.
> 
> /according by/according to
> /amp/map
> 
> I suggested to modify comment as below:
> During full scan, we could skip some pages according to all-frozen
> bit of visibility map.
> 
> Also no need to start this in new line, start from where the
> previous line of comment ends.
> 
> 6.
> /*
>  * lazy_scan_heap() -- scan an open heap relation
>  *
>  * This routine prunes each page in the
> heap, which will among other
>  * things truncate dead tuples to dead line pointers, defragment the
>  *
> page, and set commit status bits (see heap_page_prune).  It also builds
>  * lists of dead
> tuples and pages with free space, calculates statistics
>  * on the number of live tuples in the
> heap, and marks pages as
>  * all-visible if appropriate.
> 
> Modify above function header as:
> 
> all-visible, all-frozen
> 
> 7.
> lazy_scan_heap()
> {
> ..
> 
> if (PageIsEmpty(page))
> {
> empty_pages++;
> freespace =
> PageGetHeapFreeSpace(page);
> 
> /* empty pages are always all-visible */
> if (!PageIsAllVisible(page))
> ..
> }
> 
> Don't we need to ensure that empty pages should get marked as
> all-frozen?
> 
> 8.
> lazy_scan_heap()
> {
> ..
> /*
> * As of PostgreSQL 9.2, the visibility map bit should never be set if
> * the page-
> level bit is clear.  However, it's possible that the bit
> * got cleared after we checked it
> and before we took the buffer
> * content lock, so we must recheck before jumping to the conclusion
> * that something bad has happened.
> */
> else if (all_visible_according_to_vm
> && !PageIsAllVisible(page)
> && visibilitymap_test(onerel, blkno, &vmbuffer,
> VISIBILITYMAP_ALL_VISIBLE))
> {
> elog(WARNING, "page is not marked all-visible
> but visibility map bit is set in relation \"%s\" page %u",
> relname, blkno);
> visibilitymap_clear(onerel, blkno, vmbuffer);
> }
> 
> /*
> *
> It's possible for the value returned by GetOldestXmin() to move
> * backwards, so it's not wrong for
> us to see tuples that appear to
> * not be visible to everyone yet, while PD_ALL_VISIBLE is already
> * set. The real safe xmin value never moves backwards, but
> * GetOldestXmin() is
> conservative and sometimes returns a value
> * that's unnecessarily small, so if we see that
> contradiction it just
> * means that the tuples that we think are not visible to everyone yet
>  * actually are, and the PD_ALL_VISIBLE flag is correct.
> *
> * There should never
> be dead tuples on a page with PD_ALL_VISIBLE
> * set, however.
> */
> else
> if (PageIsAllVisible(page) && has_dead_tuples)
> {
> elog(WARNING, "page
> containing dead tuples is marked as all-visible in relation \"%s\" page %u",
> 
> relname, blkno);
> PageClearAllVisible(page);
> MarkBufferDirty(buf);
> visibilitymap_clear(onerel, blkno, vmbuffer);
> }
> 
> ..
> }
> 
> I think both the above cases could happen for frozen state
> as well, unless you think otherwise, we need similar handling
> for frozen bit.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Freeze avoidance of very large table.

From
Masahiko Sawada
Date:
On Wed, Nov 4, 2015 at 12:19 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Wed, Nov 4, 2015 at 4:45 AM, Masahiko Sawada <sawada.mshk@gmail.com>
> wrote:
>>
>> On Tue, Nov 3, 2015 at 12:33 PM, Amit Kapila <amit.kapila16@gmail.com>
>> wrote:
>> > On Tue, Nov 3, 2015 at 5:04 AM, Robert Haas <robertmhaas@gmail.com>
>> > wrote:
>> >>
>> >> On Sat, Oct 31, 2015 at 1:32 AM, Amit Kapila <amit.kapila16@gmail.com>
>> >> wrote:
>> >> >
>> >> > What is your main worry about changing the name of this map, is it
>> >> > about more code churn or is it about that we might introduce new
>> >> > issues
>> >> > or is it about that people are already accustomed to call this map as
>> >> > visibility map?
>> >>
>> >> My concern is mostly that I think calling it the "visibility and
>> >> freeze map" is excessively long and wordy.
>> >>
>> >> One observation that someone made previously is that there is a
>> >> difference between "all-visible" and "index-only scan OK".  An
>> >> all-visible page that has a HOT update is no longer all-visible (it
>> >> needs vacuuming) but an index-only scan would still be OK (because
>> >> only the non-indexed values in the tuple have changed, and every scan
>> >> scan can see either the old or the new tuple but not both.  At
>> >> present, the index-only scan will consult the heap page anyway,
>> >> because all we know is that the page is not all-visible.  But maybe in
>> >> the future somebody will decide to add a bit for that.  Then we'd have
>> >> the "visibility, usable for index-only scans, and freeze map", but I
>> >> think "_vufiosfm" will not be a good choice for a file suffix.
>> >>
>> >
>> > I think in that case we can call it as page info map or page state map,
>> > but
>> > I find retaining visibility map name in this case or for future (if we
>> > want
>> > to
>> > add another bit) as confusing.  In-fact if you find "visibility and
>> > freeze
>> > map",
>> > as excessively long, then we can change it to "page info map" or "page
>> > state
>> > map" now as well.
>> >
>>
>> In that case, file suffix would be "_pim" or "_psm"?
>
> Right.
>
>> IMO, "page info map" would be better, because the bit doesn't indicate
>> the status of page in real time, it's just additional information.
>> Also we need to rewrite to new name in source code, and source file
>> name as well.
>>
>
> I think so.  Here I think the right thing to do is lets proceed with fixing
> other issues of patch and work on this part later and in the mean time
> we might get more feedback on this part of proposal.
>

Yeah, I'm going to do that changes if there is no strong objection from hackers.

Regards,

--
Masahiko Sawada



Re: Freeze avoidance of very large table.

From
Masahiko Sawada
Date:
On Thu, Nov 5, 2015 at 6:03 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> Hello, I had a look on v21 patch.
>
> Though I haven't looked the whole of the patch, I'd like to show
> you some comments only for visibilitymap.c and a part of the
> documentation.
>
>
> 1. Patch application
>
>    patch command complains about offsets for heapam.c at current
>    master.
>
> 2. visitibilymap_test()
>
> -  if (visibilitymap_test(rel, blkno, &vmbuffer))
> +  if (visibilitymap_test(rel, blkno, &vmbuffer, VISIBILITYMAP_ALL_VISIBLE)
>
>  The old VM was a simple bitmap so the name _test and the
>  function are proper but now the bitmap is quad state so it'd be
>  better chainging the function. Alghough it is not so expensive
>  to call it twice successively, it is a bit uneasy for me doing
>  so. One possible shape would be like the following.
>
>  lazy_vacuum_page()
>  > int vmstate = visibilitymap_get_status(rel, blkno, &vmbuffer);
>  > if (!(vmstate  & VISIBILITYMAP_ALL_VISIBLE))
>  >   ...
>  > if (all_frozen && !(vmstate  & VISIBILITYMAP_ALL_FROZEN))
>  >   ...
>  > if (flags != vmstate)
>  >   visibilitymap_set(...., flags);
>
>  and defining two macros for indivisual tests,
>
>  > #define VM_ALL_VISIBLE(r, b, v) ((vm_get_status((r), (b), (v)) & .._VISIBLE) != 0)
>  > if (VM_ALL_VISIBLE(rel, blkno, &vmbuffer))
>  and
>  > if (VM_ALL_FROZEN(rel, blkno, &vmbuffer))
>
>  How about this?
>
>
> 3. visibilitymap.c
>
>  - HEAPBLK_TO_MAPBIT
>
>   In visibilitymap_clear and other functions, mapBit means
>   mapDualBit in the patch, and mapBit always appears in the form
>   "mapBit * BITS_PER_HEAPBLOCK". So, it'd be better to change the
>   definition of HEAPBLK_TO_MAPBIT so that it calculates really
>   the bit position in a byte.
>
>   - #define HEAPBLK_TO_MAPBIT(x) ((x) % HEAPBLOCKS_PER_BYTE)
>   + #define HEAPBLK_TO_MAPBIT(x) (((x) % HEAPBLOCKS_PER_BYTE) * BITS_PER_HEAPBLOCK)
>
>
>  - visibilitymap_count()
>
>   The third argument all_frozen is not necessary in some
>   usage. So this interface would be preferable to be as
>   following,
>
>   BlockNumber
>   visibilitymap_count(Relation rel, BlockNumber *all_frozen)
>   {
>      BlockNumber all_visible = 0;
>   ...
>      if (all_frozen)
>            *all_frozen = 0;
>   ... something like ...
>
>   - visibilitymap_set()
>
>    The check for ALL_VISIBLE is duplicate in the following
>    assertion.
>
>    > Assert((flags & VISIBILITYMAP_ALL_VISIBLE) ||
>    >       (flags & (VISIBILITYMAP_ALL_VISIBLE | VISIBILITYMAP_ALL_FROZEN)));
>
>
>
> 4. documentation
>
>   - 18.11.1 Statement Hehavior
>
>     A typo.
>
>     > VACUUM performs *a* aggressive freezing
>
>     However I am not a fluent English speaker, and such
>     wordsmithing would be done by someone else, I feel that
>     "eager/greedy" is more suitable for this meaning..,
>     nevertheless, the term "whole-table freezing" that you wrote
>     elsewhere in this patch would be usable.
>
>     "VACUUM performs a whole-table freezing"
>
>     All "a table scan/sweep"s and something has the similar
>     meaning would be better be changed to "a whole-table
>     freezing"
>
>     In similar manner, "tuples/rows that are marked as frozen"
>     could be replaced with "unfrozen tuples/rows".
>
>   - 23.1.5 Preventing Transaction ID Wraparound Failures
>
>     "The whole table is scanned only when all pages happen to
>      require vacuuming to remove dead row versions."
>
>     This description looks a bit out-of-point. "the whole table
>     scan" in the original description is what is triggered by
>     relfrozenxid so the correspondent in the revised description
>     is "the whole-table freezing", maybe.
>
>     "The whole-table feezing takes place when
>      <structfield>relfrozenxid</> is more than
>      <varname>vacuum_freeze_table_age</> transactions old or when
>      <command>VACUUM</>'s <literal>FREEZE</> option is used. The
>      whole-table freezing scans all unfreezed pages."
>
>     The last sentence might be unnecessary.
>
>
>  - 63.4 Visibility Map
>
>    "pages contain only tuples that are marked as frozen" would be
>     enough to be "pages contain only frozen tuples"
>
>     and according to the discussion upthread, we might be good to
>     have some desciption that the name is historically omitting
>     the aspect of freezemap.
>
>
> At Sat, 31 Oct 2015 18:07:32 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in
<CAA4eK1+aTdaSwG3u+y8fXxn67Kkj0T1KzRsFDLEi=tQvTYgFrQ@mail.gmail.com>
> amit.kapila16> On Fri, Oct 30, 2015 at 6:03 AM, Masahiko Sawada <sawada.mshk@gmail.com>
>> Couple of more review comments:
>> ------------------------------------------------------
>>
>> 1.
>> @@ -615,6 +617,8 @@ typedef struct PgStat_StatTabEntry
>>   PgStat_Counter n_dead_tuples;
>>   PgStat_Counter
>> changes_since_analyze;
>>
>> + int32 n_frozen_pages;
>> +
>>   PgStat_Counter blocks_fetched;
>>   PgStat_Counter
>> blocks_hit;
>>
>> As you are changing above structure, you need to update
>> PGSTAT_FILE_FORMAT_ID, refer below code:
>> #define PGSTAT_FILE_FORMAT_ID 0x01A5BC9D
>>
>> 2. It seems that n_frozen_page is not initialized/updated properly
>> for toast tables:
>>
>> Try with below steps:
>>
>> postgres=# create table t4(c1 int, c2 text);
>> CREATE TABLE
>> postgres=# select oid, relname from pg_class where relname like '%t4%';
>>   oid  | relname
>> -------+---------
>>  16390 | t4
>> (1 row)
>>
>>
>> postgres=# select oid, relname from pg_class where relname like '%16390%';
>>   oid  |       relname
>> -------+----------------------
>>  16393 | pg_toast_16390
>>  16395 | pg_toast_16390_index
>> (2 rows)
>>
>> postgres=# select relname,seq_scan,n_tup_ins,last_vacuum,n_frozen_page from
>> pg_s
>> tat_all_tables where relname like '%16390%';
>>     relname     | seq_scan | n_tup_ins | last_vacuum | n_frozen_page
>> ----------------+----------+-----------+-------------+---------------
>>  pg_toast_16390 |        1 |         0 |             |    -842150451
>> (1 row)
>>
>> Note that I have tested above scenario on my Windows 7 m/c.
>>
>> 3.
>>  * visibilitymap.c
>>  *  bitmap for tracking visibility of heap tuples
>>
>> I think above needs to be changed to:
>> bitmap for tracking visibility and frozen state of heap tuples
>>
>>
>> 4.
>> a.
>>   /*
>> - * If we froze any tuples, mark the buffer dirty, and write a WAL
>> - * record recording the changes.  We must log the changes to be
>> - * crash-safe against future truncation of CLOG.
>> + * If we froze any tuples then we mark the buffer dirty, and write a WAL
>>
>> b.
>> - * Release any remaining pin on visibility map page.
>> + * Release any remaining pin on visibility map.
>>
>> c.
>>   * We do update relallvisible even in the corner case, since if the table
>> - * is all-visible
>> we'd definitely like to know that.  But clamp the value
>> - * to be not more than what we're setting
>> relpages to.
>> + * is all-visible we'd definitely like to know that.
>> + * But clamp the value to be not more
>> than what we're setting relpages to.
>>
>> I don't think you need to change above comments.
>>
>> 5.
>> + * Even if scan_all is set so far, we could skip to scan some pages
>> + * according by all-frozen
>> bit of visibility amp.
>>
>> /according by/according to
>> /amp/map
>>
>> I suggested to modify comment as below:
>> During full scan, we could skip some pages according to all-frozen
>> bit of visibility map.
>>
>> Also no need to start this in new line, start from where the
>> previous line of comment ends.
>>
>> 6.
>> /*
>>  * lazy_scan_heap() -- scan an open heap relation
>>  *
>>  * This routine prunes each page in the
>> heap, which will among other
>>  * things truncate dead tuples to dead line pointers, defragment the
>>  *
>> page, and set commit status bits (see heap_page_prune).  It also builds
>>  * lists of dead
>> tuples and pages with free space, calculates statistics
>>  * on the number of live tuples in the
>> heap, and marks pages as
>>  * all-visible if appropriate.
>>
>> Modify above function header as:
>>
>> all-visible, all-frozen
>>
>> 7.
>> lazy_scan_heap()
>> {
>> ..
>>
>> if (PageIsEmpty(page))
>> {
>> empty_pages++;
>> freespace =
>> PageGetHeapFreeSpace(page);
>>
>> /* empty pages are always all-visible */
>> if (!PageIsAllVisible(page))
>> ..
>> }
>>
>> Don't we need to ensure that empty pages should get marked as
>> all-frozen?
>>
>> 8.
>> lazy_scan_heap()
>> {
>> ..
>> /*
>> * As of PostgreSQL 9.2, the visibility map bit should never be set if
>> * the page-
>> level bit is clear.  However, it's possible that the bit
>> * got cleared after we checked it
>> and before we took the buffer
>> * content lock, so we must recheck before jumping to the conclusion
>> * that something bad has happened.
>> */
>> else if (all_visible_according_to_vm
>> && !PageIsAllVisible(page)
>> && visibilitymap_test(onerel, blkno, &vmbuffer,
>> VISIBILITYMAP_ALL_VISIBLE))
>> {
>> elog(WARNING, "page is not marked all-visible
>> but visibility map bit is set in relation \"%s\" page %u",
>> relname, blkno);
>> visibilitymap_clear(onerel, blkno, vmbuffer);
>> }
>>
>> /*
>> *
>> It's possible for the value returned by GetOldestXmin() to move
>> * backwards, so it's not wrong for
>> us to see tuples that appear to
>> * not be visible to everyone yet, while PD_ALL_VISIBLE is already
>> * set. The real safe xmin value never moves backwards, but
>> * GetOldestXmin() is
>> conservative and sometimes returns a value
>> * that's unnecessarily small, so if we see that
>> contradiction it just
>> * means that the tuples that we think are not visible to everyone yet
>>  * actually are, and the PD_ALL_VISIBLE flag is correct.
>> *
>> * There should never
>> be dead tuples on a page with PD_ALL_VISIBLE
>> * set, however.
>> */
>> else
>> if (PageIsAllVisible(page) && has_dead_tuples)
>> {
>> elog(WARNING, "page
>> containing dead tuples is marked as all-visible in relation \"%s\" page %u",
>>
>> relname, blkno);
>> PageClearAllVisible(page);
>> MarkBufferDirty(buf);
>> visibilitymap_clear(onerel, blkno, vmbuffer);
>> }
>>
>> ..
>> }
>>
>> I think both the above cases could happen for frozen state
>> as well, unless you think otherwise, we need similar handling
>> for frozen bit.
>

Thank you for reviewing the patch.

I changed the patch so that the visibility map become the page info
map, in source code and documentation.
And fixed review comments I received.
Attached v22 patch.

> I think both the above cases could happen for frozen state
> as well, unless you think otherwise, we need similar handling
> for frozen bit.

It's not happen the situation where is all-frozen and not all-visible,
and the bits of visibility map are cleared at the same time, page
flags are as well.
So I think it's enough to handle only all-visible situation. Am I
missing something?

> 2. visitibilymap_test()
>
> -  if (visibilitymap_test(rel, blkno, &vmbuffer))
> +  if (visibilitymap_test(rel, blkno, &vmbuffer, VISIBILITYMAP_ALL_VISIBLE)
>
>  The old VM was a simple bitmap so the name _test and the
>  function are proper but now the bitmap is quad state so it'd be
>  better chainging the function. Alghough it is not so expensive
>  to call it twice successively, it is a bit uneasy for me doing
>  so. One possible shape would be like the following.
>
>  lazy_vacuum_page()
>  > int vmstate = visibilitymap_get_status(rel, blkno, &vmbuffer);
>  > if (!(vmstate  & VISIBILITYMAP_ALL_VISIBLE))
>  >   ...
>  > if (all_frozen && !(vmstate  & VISIBILITYMAP_ALL_FROZEN))
>  >   ...
>  > if (flags != vmstate)
>  >   visibilitymap_set(...., flags);
>
>  and defining two macros for indivisual tests,
>
>  > #define VM_ALL_VISIBLE(r, b, v) ((vm_get_status((r), (b), (v)) & .._VISIBLE) != 0)
>  > if (VM_ALL_VISIBLE(rel, blkno, &vmbuffer))
>  and
>  > if (VM_ALL_FROZEN(rel, blkno, &vmbuffer))
>
>  How about this?

I agree.
I've changed so.

Regards,

--
Masahiko Sawada

Attachment

Re: Freeze avoidance of very large table.

From
Amit Kapila
Date:
On Fri, Nov 13, 2015 at 4:48 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
>
> Thank you for reviewing the patch.
>
> I changed the patch so that the visibility map become the page info
> map, in source code and documentation.
>

One thing to notice is that this almost doubles the patch size which
might makes it slightly difficult to review, but on the other hand if
no-body opposes for such a change, this seems to be the right direction.

> And fixed review comments I received.
> Attached v22 patch.
>
> > I think both the above cases could happen for frozen state
> > as well, unless you think otherwise, we need similar handling
> > for frozen bit.
>
> It's not happen the situation where is all-frozen and not all-visible,
> and the bits of visibility map are cleared at the same time, page
> flags are as well.
> So I think it's enough to handle only all-visible situation. Am I
>
> missing something?
>

No, I think you are right as information for both is cleared together
and all-visible is superset of all-frozen (means if all-frozen is set,
then all-visible must be set), so it is sufficient to check visibility
info in above situation, but I feel we can update the comment to
indicate the same and add an Assert to ensure if all-frozen is set
all-visibile must be set.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Freeze avoidance of very large table.

From
Masahiko Sawada
Date:
On Fri, Nov 13, 2015 at 1:44 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Fri, Nov 13, 2015 at 4:48 AM, Masahiko Sawada <sawada.mshk@gmail.com>
> wrote:
>>
>>
>> Thank you for reviewing the patch.
>>
>> I changed the patch so that the visibility map become the page info
>> map, in source code and documentation.
>>
>
> One thing to notice is that this almost doubles the patch size which
> might makes it slightly difficult to review, but on the other hand if
> no-body opposes for such a change, this seems to be the right direction.

I believe that it's going to right direction.
But I think we didn't get consensus about this changes yet, so it might go back.

>
>> And fixed review comments I received.
>> Attached v22 patch.
>>
>> > I think both the above cases could happen for frozen state
>> > as well, unless you think otherwise, we need similar handling
>> > for frozen bit.
>>
>> It's not happen the situation where is all-frozen and not all-visible,
>> and the bits of visibility map are cleared at the same time, page
>> flags are as well.
>> So I think it's enough to handle only all-visible situation. Am I
>>
>> missing something?
>>
>
> No, I think you are right as information for both is cleared together
> and all-visible is superset of all-frozen (means if all-frozen is set,
> then all-visible must be set), so it is sufficient to check visibility
> info in above situation, but I feel we can update the comment to
> indicate the same and add an Assert to ensure if all-frozen is set
> all-visibile must be set.

I agree.
I added Assert() macro into lazy_scan_heap() and some comments.
Attached v23 patch.

Regards,

--
Masahiko Sawada

Attachment

Re: Freeze avoidance of very large table.

From
Bruce Momjian
Date:
On Tue, Nov  3, 2015 at 09:03:49AM +0530, Amit Kapila wrote:
> I think in that case we can call it as page info map or page state map, but
> I find retaining visibility map name in this case or for future (if we want to
> add another bit) as confusing.  In-fact if you find "visibility and freeze
> map",
> as excessively long, then we can change it to "page info map" or "page state
> map" now as well.

Coming in late here, but the problem with "page info map" is that free
space is also page info (how much free space on each page), so "page
info map" isn't very descriptive.  "page status" or "page state" might
make more sense, but even then, free space is a kind of page
status/state.  What is happening is that broadening the name to cover
both visibility and freeze state also encompasses free space.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+ Roman grave inscription                             +



Re: Freeze avoidance of very large table.

From
Andres Freund
Date:
On 2015-10-31 11:02:12 +0530, Amit Kapila wrote:
> On Thu, Oct 8, 2015 at 11:05 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> >
> > On 1 October 2015 at 23:30, Josh Berkus <josh@agliodbs.com> wrote:
> >>
> >> On 10/01/2015 07:43 AM, Robert Haas wrote:
> >> > On Thu, Oct 1, 2015 at 9:44 AM, Fujii Masao <masao.fujii@gmail.com>
> wrote:
> >> >> I wonder how much it's worth renaming only the file extension while
> >> >> there are many places where "visibility map" and "vm" are used,
> >> >> for example, log messages, function names, variables, etc.
> >> >
> >> > I'd be inclined to keep calling it the visibility map (vm) even if it
> >> > also contains freeze information.
> >> >
> 
> What is your main worry about changing the name of this map, is it
> about more code churn or is it about that we might introduce new issues
> or is it about that people are already accustomed to call this map as
> visibility map?

Several:
* Visibility map is rather descriptive, none of the replacement terms imo come close. Few people will know what a
'freeze'map is.
 
* It increases the size of the patch considerably
* It forces tooling that knows about the layout of the database directory to change their tools

On the benfit side the only argument I've heard so far is that it allows
to disambiguate the format. But, uh, a look at the major version does
that just as well, for far less trouble.

> It seems to me quite logical for understanding purpose as well.  Any new
> person who wants to work in this area or is looking into it will always
> wonder why this map is named as visibility map even though it contains
> information about visibility of page as well as frozen state of page.

Being frozen is about visibility as well.

Greetings,

Andres Freund



Re: Freeze avoidance of very large table.

From
Amit Kapila
Date:
On Sat, Nov 14, 2015 at 1:12 AM, Bruce Momjian <bruce@momjian.us> wrote:
>
> On Tue, Nov  3, 2015 at 09:03:49AM +0530, Amit Kapila wrote:
> > I think in that case we can call it as page info map or page state map, but
> > I find retaining visibility map name in this case or for future (if we want to
> > add another bit) as confusing.  In-fact if you find "visibility and freeze
> > map",
> > as excessively long, then we can change it to "page info map" or "page state
> > map" now as well.
>
> Coming in late here, but the problem with "page info map" is that free
> space is also page info (how much free space on each page), so "page
> info map" isn't very descriptive.  "page status" or "page state" might
> make more sense, but even then, free space is a kind of page
> status/state.  What is happening is that broadening the name to cover
> both visibility and freeze state also encompasses free space.
>

Valid point, but I think free space map is a specific information of page
stored in a completely different format.  "page info"/"page state" map
could contain information about multiple states of page in same format.
There is yet another option of changing it Visibility and Freeze map and
or change file extension to vfm, but Robert felt that is rather long name
and I also agree with him.

Do you see retaining the visibility map as better option ?



With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Freeze avoidance of very large table.

From
Amit Kapila
Date:
On Sat, Nov 14, 2015 at 1:19 AM, Andres Freund <andres@anarazel.de> wrote:
> On 2015-10-31 11:02:12 +0530, Amit Kapila wrote:
> > On Thu, Oct 8, 2015 at 11:05 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> > >
> > > On 1 October 2015 at 23:30, Josh Berkus <josh@agliodbs.com> wrote:
> > >>
> > >> On 10/01/2015 07:43 AM, Robert Haas wrote:
> > >> > On Thu, Oct 1, 2015 at 9:44 AM, Fujii Masao <masao.fujii@gmail.com>
> > wrote:
> > >> >> I wonder how much it's worth renaming only the file extension while
> > >> >> there are many places where "visibility map" and "vm" are used,
> > >> >> for example, log messages, function names, variables, etc.
> > >> >
> > >> > I'd be inclined to keep calling it the visibility map (vm) even if it
> > >> > also contains freeze information.
> > >> >
> >
> > What is your main worry about changing the name of this map, is it
> > about more code churn or is it about that we might introduce new issues
> > or is it about that people are already accustomed to call this map as
> > visibility map?
>
> Several:
> * Visibility map is rather descriptive, none of the replacement terms
>   imo come close. Few people will know what a 'freeze' map is.
> * It increases the size of the patch considerably
> * It forces tooling that knows about the layout of the database
>   directory to change their tools
>

All these points are legitimate.

> On the benfit side the only argument I've heard so far is that it allows
> to disambiguate the format. But, uh, a look at the major version does
> that just as well, for far less trouble.
>
> > It seems to me quite logical for understanding purpose as well.  Any new
> > person who wants to work in this area or is looking into it will always
> > wonder why this map is named as visibility map even though it contains
> > information about visibility of page as well as frozen state of page.
>
> Being frozen is about visibility as well.
>

OTOH being visible doesn't mean page is frozen.  I understand that frozen is
related to visibility, but still it is a separate state of page and used for different
purpose.  I think this is a subjective point and we could go either way, it is
just a matter in which way more people are comfortable.



With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Freeze avoidance of very large table.

From
Robert Haas
Date:
On Sun, Nov 15, 2015 at 1:47 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Sat, Nov 14, 2015 at 1:19 AM, Andres Freund <andres@anarazel.de> wrote:
>> On 2015-10-31 11:02:12 +0530, Amit Kapila wrote:
>> > On Thu, Oct 8, 2015 at 11:05 PM, Simon Riggs <simon@2ndquadrant.com>
>> > wrote:
>> > >
>> > > On 1 October 2015 at 23:30, Josh Berkus <josh@agliodbs.com> wrote:
>> > >>
>> > >> On 10/01/2015 07:43 AM, Robert Haas wrote:
>> > >> > On Thu, Oct 1, 2015 at 9:44 AM, Fujii Masao <masao.fujii@gmail.com>
>> > wrote:
>> > >> >> I wonder how much it's worth renaming only the file extension
>> > >> >> while
>> > >> >> there are many places where "visibility map" and "vm" are used,
>> > >> >> for example, log messages, function names, variables, etc.
>> > >> >
>> > >> > I'd be inclined to keep calling it the visibility map (vm) even if
>> > >> > it
>> > >> > also contains freeze information.
>> > >> >
>> >
>> > What is your main worry about changing the name of this map, is it
>> > about more code churn or is it about that we might introduce new issues
>> > or is it about that people are already accustomed to call this map as
>> > visibility map?
>>
>> Several:
>> * Visibility map is rather descriptive, none of the replacement terms
>>   imo come close. Few people will know what a 'freeze' map is.
>> * It increases the size of the patch considerably
>> * It forces tooling that knows about the layout of the database
>>   directory to change their tools
>>
>
> All these points are legitimate.
>
>> On the benfit side the only argument I've heard so far is that it allows
>> to disambiguate the format. But, uh, a look at the major version does
>> that just as well, for far less trouble.
>>
>> > It seems to me quite logical for understanding purpose as well.  Any new
>> > person who wants to work in this area or is looking into it will always
>> > wonder why this map is named as visibility map even though it contains
>> > information about visibility of page as well as frozen state of page.
>>
>> Being frozen is about visibility as well.
>>
>
> OTOH being visible doesn't mean page is frozen.  I understand that frozen is
> related to visibility, but still it is a separate state of page and used for
> different
> purpose.  I think this is a subjective point and we could go either way, it
> is
> just a matter in which way more people are comfortable.

I'm stickin' with what I said before, and what I think Andres is
saying too: renaming the map is a horrible idea.  It produces lots of
code churn for no real benefit.  We usually avoid such changes, and I
think we should do so here, too.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Freeze avoidance of very large table.

From
Masahiko Sawada
Date:
<br /><br /> On Tue, Nov 17, 2015 at 10:45 AM, Robert Haas <<a href="javascript:;">robertmhaas@gmail.com</a>>
wrote:<br/> > On Sun, Nov 15, 2015 at 1:47 AM, Amit Kapila <<a
href="javascript:;">amit.kapila16@gmail.com</a>>wrote:<br /> >> On Sat, Nov 14, 2015 at 1:19 AM, Andres Freund
<<ahref="javascript:;">andres@anarazel.de</a>> wrote:<br /> >>> On 2015-10-31 11:02:12 +0530, Amit
Kapilawrote:<br /> >>> > On Thu, Oct 8, 2015 at 11:05 PM, Simon Riggs <<a
href="javascript:;">simon@2ndquadrant.com</a>><br/> >>> > wrote:<br /> >>> > ><br />
>>>> > On 1 October 2015 at 23:30, Josh Berkus <<a href="javascript:;">josh@agliodbs.com</a>>
wrote:<br/> >>> > >><br /> >>> > >> On 10/01/2015 07:43 AM, Robert Haas wrote:<br
/>>>> > >> > On Thu, Oct 1, 2015 at 9:44 AM, Fujii Masao <<a
href="javascript:;">masao.fujii@gmail.com</a>><br/> >>> > wrote:<br /> >>> > >>
>>I wonder how much it's worth renaming only the file extension<br /> >>> > >> >>
while<br/> >>> > >> >> there are many places where "visibility map" and "vm" are used,<br />
>>>> >> >> for example, log messages, function names, variables, etc.<br /> >>> >
>>><br /> >>> > >> > I'd be inclined to keep calling it the visibility map (vm) even
if<br/> >>> > >> > it<br /> >>> > >> > also contains freeze information.<br
/>>>> > >> ><br /> >>> ><br /> >>> > What is your main worry about
changingthe name of this map, is it<br /> >>> > about more code churn or is it about that we might
introducenew issues<br /> >>> > or is it about that people are already accustomed to call this map as<br />
>>>> visibility map?<br /> >>><br /> >>> Several:<br /> >>> * Visibility map is
ratherdescriptive, none of the replacement terms<br /> >>>   imo come close. Few people will know what a
'freeze'map is.<br /> >>> * It increases the size of the patch considerably<br /> >>> * It forces
toolingthat knows about the layout of the database<br /> >>>   directory to change their tools<br />
>>><br/> >><br /> >> All these points are legitimate.<br /> >><br /> >>> On the
benfitside the only argument I've heard so far is that it allows<br /> >>> to disambiguate the format. But,
uh,a look at the major version does<br /> >>> that just as well, for far less trouble.<br /> >>><br
/>>>> > It seems to me quite logical for understanding purpose as well.  Any new<br /> >>> >
personwho wants to work in this area or is looking into it will always<br /> >>> > wonder why this map is
namedas visibility map even though it contains<br /> >>> > information about visibility of page as well as
frozenstate of page.<br /> >>><br /> >>> Being frozen is about visibility as well.<br />
>>><br/> >><br /> >> OTOH being visible doesn't mean page is frozen.  I understand that frozen
is<br/> >> related to visibility, but still it is a separate state of page and used for<br /> >>
different<br/> >> purpose.  I think this is a subjective point and we could go either way, it<br /> >>
is<br/> >> just a matter in which way more people are comfortable.<br /> ><br /> > I'm stickin' with what I
saidbefore, and what I think Andres is<br /> > saying too: renaming the map is a horrible idea.  It produces lots
of<br/> > code churn for no real benefit.  We usually avoid such changes, and I<br /> > think we should do so
here,too.<br /><br /> I understood.<br /> I'm going to turn the patch back to visibility map, and just add the logic of
enhancementof VACUUM FREEZE.<br /> If we want to add the other status not related to visibility into visibility map in
thefuture, it would be worth to consider.<br /><br /> Regards,<br /><br /> --<br /> Masahiko Sawada<br /><br /><br />--
<br/>Regards,<br /><br />--<br />Masahiko Sawada<br /> 

Re: Freeze avoidance of very large table.

From
Thom Brown
Date:
On 17 November 2015 at 10:29, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
>
> On Tue, Nov 17, 2015 at 10:45 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Sun, Nov 15, 2015 at 1:47 AM, Amit Kapila <amit.kapila16@gmail.com>
>> wrote:
>>> On Sat, Nov 14, 2015 at 1:19 AM, Andres Freund <andres@anarazel.de>
>>> wrote:
>>>> On 2015-10-31 11:02:12 +0530, Amit Kapila wrote:
>>>> > On Thu, Oct 8, 2015 at 11:05 PM, Simon Riggs <simon@2ndquadrant.com>
>>>> > wrote:
>>>> > >
>>>> > > On 1 October 2015 at 23:30, Josh Berkus <josh@agliodbs.com> wrote:
>>>> > >>
>>>> > >> On 10/01/2015 07:43 AM, Robert Haas wrote:
>>>> > >> > On Thu, Oct 1, 2015 at 9:44 AM, Fujii Masao
>>>> > >> > <masao.fujii@gmail.com>
>>>> > wrote:
>>>> > >> >> I wonder how much it's worth renaming only the file extension
>>>> > >> >> while
>>>> > >> >> there are many places where "visibility map" and "vm" are used,
>>>> > >> >> for example, log messages, function names, variables, etc.
>>>> > >> >
>>>> > >> > I'd be inclined to keep calling it the visibility map (vm) even
>>>> > >> > if
>>>> > >> > it
>>>> > >> > also contains freeze information.
>>>> > >> >
>>>> >
>>>> > What is your main worry about changing the name of this map, is it
>>>> > about more code churn or is it about that we might introduce new
>>>> > issues
>>>> > or is it about that people are already accustomed to call this map as
>>>> > visibility map?
>>>>
>>>> Several:
>>>> * Visibility map is rather descriptive, none of the replacement terms
>>>>   imo come close. Few people will know what a 'freeze' map is.
>>>> * It increases the size of the patch considerably
>>>> * It forces tooling that knows about the layout of the database
>>>>   directory to change their tools
>>>>
>>>
>>> All these points are legitimate.
>>>
>>>> On the benfit side the only argument I've heard so far is that it allows
>>>> to disambiguate the format. But, uh, a look at the major version does
>>>> that just as well, for far less trouble.
>>>>
>>>> > It seems to me quite logical for understanding purpose as well.  Any
>>>> > new
>>>> > person who wants to work in this area or is looking into it will
>>>> > always
>>>> > wonder why this map is named as visibility map even though it contains
>>>> > information about visibility of page as well as frozen state of page.
>>>>
>>>> Being frozen is about visibility as well.
>>>>
>>>
>>> OTOH being visible doesn't mean page is frozen.  I understand that frozen
>>> is
>>> related to visibility, but still it is a separate state of page and used
>>> for
>>> different
>>> purpose.  I think this is a subjective point and we could go either way,
>>> it
>>> is
>>> just a matter in which way more people are comfortable.
>>
>> I'm stickin' with what I said before, and what I think Andres is
>> saying too: renaming the map is a horrible idea.  It produces lots of
>> code churn for no real benefit.  We usually avoid such changes, and I
>> think we should do so here, too.
>
> I understood.
> I'm going to turn the patch back to visibility map, and just add the logic
> of enhancement of VACUUM FREEZE.
> If we want to add the other status not related to visibility into visibility
> map in the future, it would be worth to consider.

Could someone post a TL;DR summary of what the current plan looks
like?  I can see there is a huge amount of discussion to trawl back
through.  I can see it's something to do with the visibility map.  And
does it avoid freezing very large tables like the title originally
sought?

Thanks

Thom



Re: Freeze avoidance of very large table.

From
Jim Nasby
Date:
On 11/17/15 4:41 AM, Thom Brown wrote:
> Could someone post a TL;DR summary of what the current plan looks
> like?  I can see there is a huge amount of discussion to trawl back
> through.  I can see it's something to do with the visibility map.  And
> does it avoid freezing very large tables like the title originally
> sought?

Basically, it follows the same pattern that all-visible bits do, except 
instead of indicating a page is all-visible, the bit shows that all 
tuples on the page are frozen. That allows a scan_all vacuum to skip 
those pages.
-- 
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com



Re: Freeze avoidance of very large table.

From
Thom Brown
Date:
On 17 November 2015 at 15:43, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
> On 11/17/15 4:41 AM, Thom Brown wrote:
>>
>> Could someone post a TL;DR summary of what the current plan looks
>> like?  I can see there is a huge amount of discussion to trawl back
>> through.  I can see it's something to do with the visibility map.  And
>> does it avoid freezing very large tables like the title originally
>> sought?
>
>
> Basically, it follows the same pattern that all-visible bits do, except
> instead of indicating a page is all-visible, the bit shows that all tuples
> on the page are frozen. That allows a scan_all vacuum to skip those pages.

So the visibility map is being repurposed?  And if a row on a frozen
page is modified, what happens to the visibility of all other rows on
that page, as the bit will be set back to 0?  I think I'm missing a
critical part of this functionality.

Thom



Re: Freeze avoidance of very large table.

From
Masahiko Sawada
Date:
On Wed, Nov 18, 2015 at 12:56 AM, Thom Brown <thom@linux.com> wrote:
> On 17 November 2015 at 15:43, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
>> On 11/17/15 4:41 AM, Thom Brown wrote:
>>>
>>> Could someone post a TL;DR summary of what the current plan looks
>>> like?  I can see there is a huge amount of discussion to trawl back
>>> through.  I can see it's something to do with the visibility map.  And
>>> does it avoid freezing very large tables like the title originally
>>> sought?
>>
>>
>> Basically, it follows the same pattern that all-visible bits do, except
>> instead of indicating a page is all-visible, the bit shows that all tuples
>> on the page are frozen. That allows a scan_all vacuum to skip those pages.
>
> So the visibility map is being repurposed?

My proposal is to add additional one bit that indicates all tuples on
page are completely frozen, into visibility map.
That is, the visibility map will become a bitmap with two bits
(all-visible, all-frozen) per page.

> And if a row on a frozen
> page is modified, what happens to the visibility of all other rows on
> that page, as the bit will be set back to 0?

In this case, the corresponding VM both bits are cleared.
Such behaviour is almost same as what postgresql is doing today.


Regards,

--
Masahiko Sawada



Re: Freeze avoidance of very large table.

From
Masahiko Sawada
Date:
On Tue, Nov 17, 2015 at 7:29 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
>
> On Tue, Nov 17, 2015 at 10:45 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Sun, Nov 15, 2015 at 1:47 AM, Amit Kapila <amit.kapila16@gmail.com>
>> wrote:
>>> On Sat, Nov 14, 2015 at 1:19 AM, Andres Freund <andres@anarazel.de>
>>> wrote:
>>>> On 2015-10-31 11:02:12 +0530, Amit Kapila wrote:
>>>> > On Thu, Oct 8, 2015 at 11:05 PM, Simon Riggs <simon@2ndquadrant.com>
>>>> > wrote:
>>>> > >
>>>> > > On 1 October 2015 at 23:30, Josh Berkus <josh@agliodbs.com> wrote:
>>>> > >>
>>>> > >> On 10/01/2015 07:43 AM, Robert Haas wrote:
>>>> > >> > On Thu, Oct 1, 2015 at 9:44 AM, Fujii Masao
>>>> > >> > <masao.fujii@gmail.com>
>>>> > wrote:
>>>> > >> >> I wonder how much it's worth renaming only the file extension
>>>> > >> >> while
>>>> > >> >> there are many places where "visibility map" and "vm" are used,
>>>> > >> >> for example, log messages, function names, variables, etc.
>>>> > >> >
>>>> > >> > I'd be inclined to keep calling it the visibility map (vm) even
>>>> > >> > if
>>>> > >> > it
>>>> > >> > also contains freeze information.
>>>> > >> >
>>>> >
>>>> > What is your main worry about changing the name of this map, is it
>>>> > about more code churn or is it about that we might introduce new
>>>> > issues
>>>> > or is it about that people are already accustomed to call this map as
>>>> > visibility map?
>>>>
>>>> Several:
>>>> * Visibility map is rather descriptive, none of the replacement terms
>>>>   imo come close. Few people will know what a 'freeze' map is.
>>>> * It increases the size of the patch considerably
>>>> * It forces tooling that knows about the layout of the database
>>>>   directory to change their tools
>>>>
>>>
>>> All these points are legitimate.
>>>
>>>> On the benfit side the only argument I've heard so far is that it allows
>>>> to disambiguate the format. But, uh, a look at the major version does
>>>> that just as well, for far less trouble.
>>>>
>>>> > It seems to me quite logical for understanding purpose as well.  Any
>>>> > new
>>>> > person who wants to work in this area or is looking into it will
>>>> > always
>>>> > wonder why this map is named as visibility map even though it contains
>>>> > information about visibility of page as well as frozen state of page.
>>>>
>>>> Being frozen is about visibility as well.
>>>>
>>>
>>> OTOH being visible doesn't mean page is frozen.  I understand that frozen
>>> is
>>> related to visibility, but still it is a separate state of page and used
>>> for
>>> different
>>> purpose.  I think this is a subjective point and we could go either way,
>>> it
>>> is
>>> just a matter in which way more people are comfortable.
>>
>> I'm stickin' with what I said before, and what I think Andres is
>> saying too: renaming the map is a horrible idea.  It produces lots of
>> code churn for no real benefit.  We usually avoid such changes, and I
>> think we should do so here, too.
>
> I understood.
> I'm going to turn the patch back to visibility map, and just add the logic
> of enhancement of VACUUM FREEZE.

Attached latest v24 patch.
I've changed patch so that just adding frozen bit into visibility map.
So the size of patch is almost half of previous one.

Please review it.

Regards,

--
Masahiko Sawada

Attachment

Re: Freeze avoidance of very large table.

From
Jeff Janes
Date:
On Tue, Nov 17, 2015 at 10:32 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> Attached latest v24 patch.
> I've changed patch so that just adding frozen bit into visibility map.
> So the size of patch is almost half of previous one.
>

Should there be an Assert in visibilitymap_get_status (or elsewhere)
against the impossible state of being all frozen but not all visible?

I get an error when running pg_upgrade from 9.4 to 9.6-this

error while copying relation "mediawiki.archive"
("/tmp/data/base/16414/21043_vm" to
"/tmp/data_fm/base/16400/21043_vm"): No such file or directory


Cheers,

Jeff



Re: Freeze avoidance of very large table.

From
Jeff Janes
Date:
On Wed, Nov 18, 2015 at 11:18 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
>
> I get an error when running pg_upgrade from 9.4 to 9.6-this
>
> error while copying relation "mediawiki.archive"
> ("/tmp/data/base/16414/21043_vm" to
> "/tmp/data_fm/base/16400/21043_vm"): No such file or directory

OK, so the problem seems to be that rewriteVisibilitymap can get
called with errno already set to a nonzero value.

It never clears it, and then fails at the end despite that no error
has actually occurred.

Just setting it to 0 at the top of the function seems to be correct
thing to do.  Or does it need to save the old value and restore it?

But now when I want to do the upgrade faster, I run into this:

"This utility cannot upgrade from PostgreSQL version from 9.5 or
before to 9.6 or later with link mode."

Is this really an acceptable a tradeoff?  Surely we can arrange to
link everything else and rewrite just the _vm, which is a tiny portion
of the data directory.  I don't think that -k promises to link
everything it possibly can.

Cheers,

Jeff



Re: Freeze avoidance of very large table.

From
Masahiko Sawada
Date:
On Thu, Nov 19, 2015 at 5:54 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
> On Wed, Nov 18, 2015 at 11:18 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
>>
>> I get an error when running pg_upgrade from 9.4 to 9.6-this
>>
>> error while copying relation "mediawiki.archive"
>> ("/tmp/data/base/16414/21043_vm" to
>> "/tmp/data_fm/base/16400/21043_vm"): No such file or directory
>
> OK, so the problem seems to be that rewriteVisibilitymap can get
> called with errno already set to a nonzero value.
>
> It never clears it, and then fails at the end despite that no error
> has actually occurred.
>
> Just setting it to 0 at the top of the function seems to be correct
> thing to do.  Or does it need to save the old value and restore it?

Thank you for testing!
I think that the former is better, so attached latest patch.

> But now when I want to do the upgrade faster, I run into this:
>
> "This utility cannot upgrade from PostgreSQL version from 9.5 or
> before to 9.6 or later with link mode."
>
> Is this really an acceptable a tradeoff?  Surely we can arrange to
> link everything else and rewrite just the _vm, which is a tiny portion
> of the data directory.  I don't think that -k promises to link
> everything it possibly can.

I agree.
I've changed the patch so that.
pg_upgarde creates new _vm file and rewrites it even if upgrading to
9.6 with link mode.

Regards,

--
Masahiko Sawada

Attachment

Re: Freeze avoidance of very large table.

From
Jeff Janes
Date:
On Thu, Nov 19, 2015 at 6:44 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> On Thu, Nov 19, 2015 at 5:54 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
>> On Wed, Nov 18, 2015 at 11:18 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
>>>
>>> I get an error when running pg_upgrade from 9.4 to 9.6-this
>>>
>>> error while copying relation "mediawiki.archive"
>>> ("/tmp/data/base/16414/21043_vm" to
>>> "/tmp/data_fm/base/16400/21043_vm"): No such file or directory
>>
>> OK, so the problem seems to be that rewriteVisibilitymap can get
>> called with errno already set to a nonzero value.
>>
>> It never clears it, and then fails at the end despite that no error
>> has actually occurred.
>>
>> Just setting it to 0 at the top of the function seems to be correct
>> thing to do.  Or does it need to save the old value and restore it?
>
> Thank you for testing!
> I think that the former is better, so attached latest patch.
>
>> But now when I want to do the upgrade faster, I run into this:
>>
>> "This utility cannot upgrade from PostgreSQL version from 9.5 or
>> before to 9.6 or later with link mode."
>>
>> Is this really an acceptable a tradeoff?  Surely we can arrange to
>> link everything else and rewrite just the _vm, which is a tiny portion
>> of the data directory.  I don't think that -k promises to link
>> everything it possibly can.
>
> I agree.
> I've changed the patch so that.
> pg_upgarde creates new _vm file and rewrites it even if upgrading to
> 9.6 with link mode.


The rewrite code thinks that only the first page of a vm has a header
of size SizeOfPageHeaderData, and the rest of the pages have a zero
size header.  So the resulting _vm is corrupt.

After pg_upgrade, doing a vacuum freeze verbose gives:


WARNING:  invalid page in block 1 of relation base/16402/22430_vm;
zeroing out page
WARNING:  invalid page in block 1 of relation base/16402/22430_vm;
zeroing out page
WARNING:  invalid page in block 2 of relation base/16402/22430_vm;
zeroing out page
WARNING:  invalid page in block 2 of relation base/16402/22430_vm;
zeroing out page
WARNING:  invalid page in block 3 of relation base/16402/22430_vm;
zeroing out page
WARNING:  invalid page in block 3 of relation base/16402/22430_vm;
zeroing out page
WARNING:  invalid page in block 4 of relation base/16402/22430_vm;
zeroing out page
WARNING:  invalid page in block 4 of relation base/16402/22430_vm;
zeroing out page
WARNING:  invalid page in block 5 of relation base/16402/22430_vm;
zeroing out page
WARNING:  invalid page in block 5 of relation base/16402/22430_vm;
zeroing out page
WARNING:  invalid page in block 6 of relation base/16402/22430_vm;
zeroing out page
WARNING:  invalid page in block 6 of relation base/16402/22430_vm;
zeroing out page
WARNING:  invalid page in block 7 of relation base/16402/22430_vm;
zeroing out page
WARNING:  invalid page in block 7 of relation base/16402/22430_vm;
zeroing out page
WARNING:  invalid page in block 8 of relation base/16402/22430_vm;
zeroing out page
WARNING:  invalid page in block 8 of relation base/16402/22430_vm;
zeroing out page

Cheers,

Jeff



Re: Freeze avoidance of very large table.

From
Masahiko Sawada
Date:
On Sat, Nov 21, 2015 at 6:50 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
> On Thu, Nov 19, 2015 at 6:44 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>> On Thu, Nov 19, 2015 at 5:54 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
>>> On Wed, Nov 18, 2015 at 11:18 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
>>>>
>>>> I get an error when running pg_upgrade from 9.4 to 9.6-this
>>>>
>>>> error while copying relation "mediawiki.archive"
>>>> ("/tmp/data/base/16414/21043_vm" to
>>>> "/tmp/data_fm/base/16400/21043_vm"): No such file or directory
>>>
>>> OK, so the problem seems to be that rewriteVisibilitymap can get
>>> called with errno already set to a nonzero value.
>>>
>>> It never clears it, and then fails at the end despite that no error
>>> has actually occurred.
>>>
>>> Just setting it to 0 at the top of the function seems to be correct
>>> thing to do.  Or does it need to save the old value and restore it?
>>
>> Thank you for testing!
>> I think that the former is better, so attached latest patch.
>>
>>> But now when I want to do the upgrade faster, I run into this:
>>>
>>> "This utility cannot upgrade from PostgreSQL version from 9.5 or
>>> before to 9.6 or later with link mode."
>>>
>>> Is this really an acceptable a tradeoff?  Surely we can arrange to
>>> link everything else and rewrite just the _vm, which is a tiny portion
>>> of the data directory.  I don't think that -k promises to link
>>> everything it possibly can.
>>
>> I agree.
>> I've changed the patch so that.
>> pg_upgarde creates new _vm file and rewrites it even if upgrading to
>> 9.6 with link mode.
>
>
> The rewrite code thinks that only the first page of a vm has a header
> of size SizeOfPageHeaderData, and the rest of the pages have a zero
> size header.  So the resulting _vm is corrupt.
>
> After pg_upgrade, doing a vacuum freeze verbose gives:
>
>
> WARNING:  invalid page in block 1 of relation base/16402/22430_vm;
> zeroing out page
> WARNING:  invalid page in block 1 of relation base/16402/22430_vm;
> zeroing out page
> WARNING:  invalid page in block 2 of relation base/16402/22430_vm;
> zeroing out page
> WARNING:  invalid page in block 2 of relation base/16402/22430_vm;
> zeroing out page
> WARNING:  invalid page in block 3 of relation base/16402/22430_vm;
> zeroing out page
> WARNING:  invalid page in block 3 of relation base/16402/22430_vm;
> zeroing out page
> WARNING:  invalid page in block 4 of relation base/16402/22430_vm;
> zeroing out page
> WARNING:  invalid page in block 4 of relation base/16402/22430_vm;
> zeroing out page
> WARNING:  invalid page in block 5 of relation base/16402/22430_vm;
> zeroing out page
> WARNING:  invalid page in block 5 of relation base/16402/22430_vm;
> zeroing out page
> WARNING:  invalid page in block 6 of relation base/16402/22430_vm;
> zeroing out page
> WARNING:  invalid page in block 6 of relation base/16402/22430_vm;
> zeroing out page
> WARNING:  invalid page in block 7 of relation base/16402/22430_vm;
> zeroing out page
> WARNING:  invalid page in block 7 of relation base/16402/22430_vm;
> zeroing out page
> WARNING:  invalid page in block 8 of relation base/16402/22430_vm;
> zeroing out page
> WARNING:  invalid page in block 8 of relation base/16402/22430_vm;
> zeroing out page
>

Thank you for taking the time to review this patch!
The updated version patch is attached.

Regards,

--
Masahiko Sawada

Attachment

Re: Freeze avoidance of very large table.

From
Jeff Janes
Date:
On Sun, Nov 22, 2015 at 8:16 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

> Thank you for taking the time to review this patch!
> The updated version patch is attached.

I am skeptical about just copying the old page header to be two new
page headers.  I don't know what the implications for this will be on
pd_lsn.  Since pg_upgrade can only run on a cluster that was cleanly
shutdown, I think that just copying it from the old page to both new
pages it turns into might be fine.  But pd_checksum will certainly be
wrong, breaking pg_upgrade for cases where checksums are turned on in.
It needs to be recomputed on both new pages.  It looks like there is
no precedence for doing that in pg_upgrade so this will be breaking
new ground.

Cheers,

Jeff



Re: Freeze avoidance of very large table.

From
Masahiko Sawada
Date:
On Mon, Nov 23, 2015 at 6:27 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
> On Sun, Nov 22, 2015 at 8:16 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
>> Thank you for taking the time to review this patch!
>> The updated version patch is attached.
>
> I am skeptical about just copying the old page header to be two new
> page headers.  I don't know what the implications for this will be on
> pd_lsn.  Since pg_upgrade can only run on a cluster that was cleanly
> shutdown, I think that just copying it from the old page to both new
> pages it turns into might be fine.  But pd_checksum will certainly be
> wrong, breaking pg_upgrade for cases where checksums are turned on in.
> It needs to be recomputed on both new pages.  It looks like there is
> no precedence for doing that in pg_upgrade so this will be breaking
> new ground.
>

Yeah, we need to consider to compute checksum if enabled.
I've changed the patch, and attached.
Please review it.

Regards,

--
Masahiko Sawada

Attachment

Re: Freeze avoidance of very large table.

From
Jeff Janes
Date:
On Tue, Nov 24, 2015 at 3:13 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> Yeah, we need to consider to compute checksum if enabled.
> I've changed the patch, and attached.
> Please review it.

Thanks for the update.  This now conflicts with the updates doesn to
fix pg_upgrade out-of-space issue on Windows. I've fixed (I think) the
conflict in order to do some testing, but I'd like to get an updated
patch from the author in case I did it wrong.  I don't want to find
bugs that I just introduced myself.

Thanks,

Jeff



Re: Freeze avoidance of very large table.

From
Masahiko Sawada
Date:
On Sun, Nov 29, 2015 at 2:21 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
> On Tue, Nov 24, 2015 at 3:13 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>>
>> Yeah, we need to consider to compute checksum if enabled.
>> I've changed the patch, and attached.
>> Please review it.
>
> Thanks for the update.  This now conflicts with the updates doesn to
> fix pg_upgrade out-of-space issue on Windows. I've fixed (I think) the
> conflict in order to do some testing, but I'd like to get an updated
> patch from the author in case I did it wrong.  I don't want to find
> bugs that I just introduced myself.
>

Thank you for having a look.

Attached updated v28 patch.
Please review it.

Regards,

--
Masahiko Sawada

Attachment

Re: Freeze avoidance of very large table.

From
Bruce Momjian
Date:
On Mon, Nov 30, 2015 at 10:48:04PM +0530, Masahiko Sawada wrote:
> On Sun, Nov 29, 2015 at 2:21 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
> > On Tue, Nov 24, 2015 at 3:13 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >>
> >> Yeah, we need to consider to compute checksum if enabled.
> >> I've changed the patch, and attached.
> >> Please review it.
> >
> > Thanks for the update.  This now conflicts with the updates doesn to
> > fix pg_upgrade out-of-space issue on Windows. I've fixed (I think) the
> > conflict in order to do some testing, but I'd like to get an updated
> > patch from the author in case I did it wrong.  I don't want to find
> > bugs that I just introduced myself.
> >
> 
> Thank you for having a look.

I would not bother mentioning this detail in the pg_upgrade manual page:

+   Since the format of visibility map has been changed in version 9.6,
+   <application>pg_upgrade</> creates and rewrite new <literal>'_vm'</literal>
+   file even if upgrading from 9.5 or before to 9.6 or later with link mode (-k).

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+ Roman grave inscription                             +



Re: Freeze avoidance of very large table.

From
Andres Freund
Date:
On 2015-11-30 12:58:43 -0500, Bruce Momjian wrote:
> I would not bother mentioning this detail in the pg_upgrade manual page:
> 
> +   Since the format of visibility map has been changed in version 9.6,
> +   <application>pg_upgrade</> creates and rewrite new <literal>'_vm'</literal>
> +   file even if upgrading from 9.5 or before to 9.6 or later with link mode (-k).

Might be worthwhile to keep as that influences the runtime for link mode
when migrating <9.6 -> 9.6.



Re: Freeze avoidance of very large table.

From
Bruce Momjian
Date:
On Mon, Nov 30, 2015 at 07:05:21PM +0100, Andres Freund wrote:
> On 2015-11-30 12:58:43 -0500, Bruce Momjian wrote:
> > I would not bother mentioning this detail in the pg_upgrade manual page:
> > 
> > +   Since the format of visibility map has been changed in version 9.6,
> > +   <application>pg_upgrade</> creates and rewrite new <literal>'_vm'</literal>
> > +   file even if upgrading from 9.5 or before to 9.6 or later with link mode (-k).
> 
> Might be worthwhile to keep as that influences the runtime for link mode
> when migrating <9.6 -> 9.6.

It is hard to see that it would have a measurable duration.  The
pg_upgrade docs are already very long and this detail doesn't seems
significant.  Can someone test the overhead?

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+ Roman grave inscription                             +



Re: Freeze avoidance of very large table.

From
Jeff Janes
Date:
On Mon, Nov 30, 2015 at 9:18 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> On Sun, Nov 29, 2015 at 2:21 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
>> On Tue, Nov 24, 2015 at 3:13 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>>>
>>> Yeah, we need to consider to compute checksum if enabled.
>>> I've changed the patch, and attached.
>>> Please review it.
>>
>> Thanks for the update.  This now conflicts with the updates doesn to
>> fix pg_upgrade out-of-space issue on Windows. I've fixed (I think) the
>> conflict in order to do some testing, but I'd like to get an updated
>> patch from the author in case I did it wrong.  I don't want to find
>> bugs that I just introduced myself.
>>
>
> Thank you for having a look.
>
> Attached updated v28 patch.
> Please review it.
>
> Regards,

After running pg_upgrade, if I manually vacuum a table a start getting warnings:

WARNING:  page is not marked all-visible (and all-frozen) but
visibility map bit(s) is set in relation "foo" page 32756
WARNING:  page is not marked all-visible (and all-frozen) but
visibility map bit(s) is set in relation "foo" page 32756
WARNING:  page is not marked all-visible (and all-frozen) but
visibility map bit(s) is set in relation "foo" page 32757
WARNING:  page is not marked all-visible (and all-frozen) but
visibility map bit(s) is set in relation "foo" page 32757

The warnings are right where the blocks would start using the 2nd page
of the _vm, so I think the problem is there.  And looking at the code,
I think that "cur += SizeOfPageHeaderData;" in the inner loop cannot
be correct.  We can't skip a header in the current (old) block each
time we reach the end of the new block.  The thing we are skipping in
the current block is half the time not a header, but the data at the
halfway point through the block.

Cheers,

Jeff



Re: Freeze avoidance of very large table.

From
Masahiko Sawada
Date:
On Tue, Dec 1, 2015 at 3:04 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
> On Mon, Nov 30, 2015 at 9:18 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>> On Sun, Nov 29, 2015 at 2:21 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
>>> On Tue, Nov 24, 2015 at 3:13 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>>>>
>>>> Yeah, we need to consider to compute checksum if enabled.
>>>> I've changed the patch, and attached.
>>>> Please review it.
>>>
>>> Thanks for the update.  This now conflicts with the updates doesn to
>>> fix pg_upgrade out-of-space issue on Windows. I've fixed (I think) the
>>> conflict in order to do some testing, but I'd like to get an updated
>>> patch from the author in case I did it wrong.  I don't want to find
>>> bugs that I just introduced myself.
>>>
>>
>> Thank you for having a look.
>>
>> Attached updated v28 patch.
>> Please review it.
>>
>> Regards,
>
> After running pg_upgrade, if I manually vacuum a table a start getting warnings:
>
> WARNING:  page is not marked all-visible (and all-frozen) but
> visibility map bit(s) is set in relation "foo" page 32756
> WARNING:  page is not marked all-visible (and all-frozen) but
> visibility map bit(s) is set in relation "foo" page 32756
> WARNING:  page is not marked all-visible (and all-frozen) but
> visibility map bit(s) is set in relation "foo" page 32757
> WARNING:  page is not marked all-visible (and all-frozen) but
> visibility map bit(s) is set in relation "foo" page 32757
>
> The warnings are right where the blocks would start using the 2nd page
> of the _vm, so I think the problem is there.  And looking at the code,
> I think that "cur += SizeOfPageHeaderData;" in the inner loop cannot
> be correct.  We can't skip a header in the current (old) block each
> time we reach the end of the new block.  The thing we are skipping in
> the current block is half the time not a header, but the data at the
> halfway point through the block.
>

Thank you for reviewing.

You're right, it's not necessary.
Attached latest v29 patch which removes the mention in pg_upgrade documentation.


Regards,

--
Masahiko Sawada

Attachment

Re: Freeze avoidance of very large table.

From
Kyotaro HORIGUCHI
Date:
Hello,

> You're right, it's not necessary.
> Attached latest v29 patch which removes the mention in pg_upgrade documentation.

The changes looks to be correct but I haven't tested.
And I have some additional random comments.


visibilitymap.c:
 In visibilitymap_set, the followint lines.
   map = PageGetContents(page);   ...   if (flags != (map[mapByte] & (flags << mapBit)))
 map is (char*), PageGetContents returns (char*) but flags is uint8. I think that defining map as (uint8*) would be
safer.

 In visibilitymap_set, the following lines does something different from what to do.  Only right side of the inequality
getsshifted and what should be used in right side is not flags but VISIBILITYMAP_VALID_BITS.
 
 -    if (!(map[mapByte] & (1 << mapBit))) +    if (flags != (map[mapByte] & (flags << mapBit)))
 Something like the following will do the right thing.
 +    if (flags != (map[mapByte]>>mapBit & VISIBILITYMAP_VALID_BITS))


analyze.c:
In do_analyze_rel, the successive if (!inh) in the followingsteps looks a bit odd. This would be emphasized by the
firstifblock you added:) These blocks should be enclosed by if (!inh){} block.
 

>   /* Calculate the number of all-visible and all-frozen bit */>   if (!inh)>       relallvisible =
visibilitymap_count(onerel,&relallfrozen);>   if (!inh)>       vac_update_relstats(onerel,>   if (!inh && !(options &
VACOPT_VACUUM))>  {>       for (ind = 0; ind < nindexes; ind++)...>   }>   if (!inh)>
pgstat_report_analyze(onerel,totalrows, totaldeadrows, relallfrozen); 
 

vacuum.c:
 >- * relpages and relallvisible, we try to maintain certain lazily-updated >- * DDL flags such as relhasindex, by
clearingthem if no longer correct. >- * It's safe to do this in VACUUM, which can't run in parallel with >- * CREATE
INDEX/RULE/TRIGGERand can't be part of a transaction block. >- * However, it's *not* safe to do it in an ANALYZE that's
withinan  >+ * relpages, relallvisible, we try to maintain certain lazily-updated    Why did you just drop the 'and'
afterrelpages? And this seems   the only change of this file except the additinally missing   letter just below:p  >+ *
DDLflags such as relhasindex, by clearing them if no onger correct. >+ * It's safe to do this in VACUUM, which can't
runin >+ * parallel with CREATE INDEX/RULE/TRIGGER and can't be part of a transaction >+ * block.  However, it's *not*
safeto do it in an ANALYZE that's within an
 


nodeIndexonlyscan.c:
A duplicate letters.  And the line exceeds right margin.
> - * Note on Memory Ordering Effects: visibilitymap_test does not lock
-> + * Note on Memory Ordering Effects: visibilitymap_get_stattus does not lock
+ * Note on Memory Ordering Effects: visibilitymap_get_status does not lock

The edited line exceeds right margin by removing a newline.

- if (!visibilitymap_test(scandesc->heapRelation,
-                         ItemPointerGetBlockNumber(tid),
-                         &node->ioss_VMBuffer))
+ if (!VM_ALL_VISIBLE(scandesc->heapRelation, ItemPointerGetBlockNumber(tid),
+                                         &node->ioss_VMBuffer))


costsize.c:
Duplicate words and it is the only change.
> - * pages for which the visibility map shows all tuples are visible.
-> + * pages for which the visibility map map shows all tuples are visible.
+ * pages for which the visibility map shows all tuples are visible.

pgstat.c:
The new parameter frozenpages of pgstat_report_vacuum() isdefined as int32, but its callers give BlockNumber(=uint32).
Irecommendto define the frozenpages as BlockNumber.PgStat_MsgVacuum has a corresponding member defined as int32.
 

pg_upgrade.c:
BITS_PER_HEAPBLOCK is defined in two c files with the samedefinition. This might be better to be merged into some
headerfile.


heapam_xlog.h, hio.h, execnodes.h:
Have we decided to rename vm to pim? Anyway it is inconsistentwith that of corresponding definition of the function
bodyremainsas 'vm_buffer'. (I'm not confident on that, though.)
 
>-   Buffer vm_buffer, TransactionId cutoff_xid);>+   Buffer pim_buffer, TransactionId cutoff_xid, uint8 flags);

regards,


At Wed, 2 Dec 2015 00:10:09 +0530, Masahiko Sawada <sawada.mshk@gmail.com> wrote in
<CAD21AoC72S2ShoeAmCxWYUyGSNOaTn4fMHJ-ZKNX-MPcsQpaOw@mail.gmail.com>
> On Tue, Dec 1, 2015 at 3:04 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
> > On Mon, Nov 30, 2015 at 9:18 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > After running pg_upgrade, if I manually vacuum a table a start getting warnings:
> >
> > WARNING:  page is not marked all-visible (and all-frozen) but
> > visibility map bit(s) is set in relation "foo" page 32756
> > WARNING:  page is not marked all-visible (and all-frozen) but
> > visibility map bit(s) is set in relation "foo" page 32756
...
> > The warnings are right where the blocks would start using the 2nd page
> > of the _vm, so I think the problem is there.  And looking at the code,
> > I think that "cur += SizeOfPageHeaderData;" in the inner loop cannot
> > be correct.  We can't skip a header in the current (old) block each
> > time we reach the end of the new block.  The thing we are skipping in
> > the current block is half the time not a header, but the data at the
> > halfway point through the block.
> >
> 
> Thank you for reviewing.
> 
> You're right, it's not necessary.
> Attached latest v29 patch which removes the mention in pg_upgrade documentation.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center





Re: Freeze avoidance of very large table.

From
Masahiko Sawada
Date:
On Wed, Dec 2, 2015 at 9:30 AM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> Hello,
>
>> You're right, it's not necessary.
>> Attached latest v29 patch which removes the mention in pg_upgrade documentation.
>
> The changes looks to be correct but I haven't tested.
> And I have some additional random comments.
>

Thank you for revewing!
Fixed these following points, and attached latest patch.

> visibilitymap.c:
>
>   In visibilitymap_set, the followint lines.
>
>     map = PageGetContents(page);
>     ...
>     if (flags != (map[mapByte] & (flags << mapBit)))
>
>   map is (char*), PageGetContents returns (char*) but flags is
>   uint8. I think that defining map as (uint8*) would be safer.

I agree with you.
Fixed.

>
>   In visibilitymap_set, the following lines does something
>   different from what to do.  Only right side of the inequality
>   gets shifted and what should be used in right side is not flags
>   but VISIBILITYMAP_VALID_BITS.
>
>   -     if (!(map[mapByte] & (1 << mapBit)))
>   +     if (flags != (map[mapByte] & (flags << mapBit)))
>
>   Something like the following will do the right thing.
>
>   +     if (flags != (map[mapByte]>>mapBit & VISIBILITYMAP_VALID_BITS))
>

You're right.
Fixed.

> analyze.c:
>
>  In do_analyze_rel, the successive if (!inh) in the following
>  steps looks a bit odd. This would be emphasized by the first if
>  block you added:) These blocks should be enclosed by if (!inh)
>  {} block.
>
>
>  >   /* Calculate the number of all-visible and all-frozen bit */
>  >   if (!inh)
>  >       relallvisible = visibilitymap_count(onerel, &relallfrozen);
>  >   if (!inh)
>  >       vac_update_relstats(onerel,
>  >   if (!inh && !(options & VACOPT_VACUUM))
>  >   {
>  >       for (ind = 0; ind < nindexes; ind++)
>  ...
>  >   }
>  >   if (!inh)
>  >       pgstat_report_analyze(onerel, totalrows, totaldeadrows, relallfrozen);

Fixed.

>
> vacuum.c:
>
>   >- * relpages and relallvisible, we try to maintain certain lazily-updated
>   >- * DDL flags such as relhasindex, by clearing them if no longer correct.
>   >- * It's safe to do this in VACUUM, which can't run in parallel with
>   >- * CREATE INDEX/RULE/TRIGGER and can't be part of a transaction block.
>   >- * However, it's *not* safe to do it in an ANALYZE that's within an
>
>   >+ * relpages, relallvisible, we try to maintain certain lazily-updated
>
>     Why did you just drop the 'and' after relpages? And this seems
>     the only change of this file except the additinally missing
>     letter just below:p
>
>   >+ * DDL flags such as relhasindex, by clearing them if no onger correct.
>   >+ * It's safe to do this in VACUUM, which can't run in
>   >+ * parallel with CREATE INDEX/RULE/TRIGGER and can't be part of a transaction
>   >+ * block.  However, it's *not* safe to do it in an ANALYZE that's within an

Fixed.

>
> nodeIndexonlyscan.c:
>
>  A duplicate letters.  And the line exceeds right margin.
>
>  > - * Note on Memory Ordering Effects: visibilitymap_test does not lock
> -> + * Note on Memory Ordering Effects: visibilitymap_get_stattus does not lock
> + * Note on Memory Ordering Effects: visibilitymap_get_status does not lock

Fixed.

>
>  The edited line exceeds right margin by removing a newline.
>
> - if (!visibilitymap_test(scandesc->heapRelation,
> -                         ItemPointerGetBlockNumber(tid),
> -                         &node->ioss_VMBuffer))
> + if (!VM_ALL_VISIBLE(scandesc->heapRelation, ItemPointerGetBlockNumber(tid),
> +                                         &node->ioss_VMBuffer))
>

Fixed.

> costsize.c:
>
>  Duplicate words and it is the only change.
>
>  > - * pages for which the visibility map shows all tuples are visible.
> -> + * pages for which the visibility map map shows all tuples are visible.
> + * pages for which the visibility map shows all tuples are visible.

Fixed.

> pgstat.c:
>
>  The new parameter frozenpages of pgstat_report_vacuum() is
>  defined as int32, but its callers give BlockNumber(=uint32).  I
>  recommend to define the frozenpages as BlockNumber.
>  PgStat_MsgVacuum has a corresponding member defined as int32.

I agree with you.
Fixed.

> pg_upgrade.c:
>
>  BITS_PER_HEAPBLOCK is defined in two c files with the same
>  definition. This might be better to be merged into some header
>  file.

Fixed.
I moved these definition to visibilitymap.h.

>
> heapam_xlog.h, hio.h, execnodes.h:
>
>  Have we decided to rename vm to pim? Anyway it is inconsistent
>  with that of corresponding definition of the function body
>  remains as 'vm_buffer'. (I'm not confident on that, though.)
>
>  >-   Buffer vm_buffer, TransactionId cutoff_xid);
>  >+   Buffer pim_buffer, TransactionId cutoff_xid, uint8 flags);
>

Fixed.

Regards,

--
Masahiko Sawada

Attachment

Re: Freeze avoidance of very large table.

From
Jeff Janes
Date:
On Tue, Dec 1, 2015 at 10:40 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> On Tue, Dec 1, 2015 at 3:04 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
>> On Mon, Nov 30, 2015 at 9:18 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>>> On Sun, Nov 29, 2015 at 2:21 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
>>>> On Tue, Nov 24, 2015 at 3:13 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>>>>>
>>>>> Yeah, we need to consider to compute checksum if enabled.
>>>>> I've changed the patch, and attached.
>>>>> Please review it.
>>>>
>>>> Thanks for the update.  This now conflicts with the updates doesn to
>>>> fix pg_upgrade out-of-space issue on Windows. I've fixed (I think) the
>>>> conflict in order to do some testing, but I'd like to get an updated
>>>> patch from the author in case I did it wrong.  I don't want to find
>>>> bugs that I just introduced myself.
>>>>
>>>
>>> Thank you for having a look.
>>>
>>> Attached updated v28 patch.
>>> Please review it.
>>>
>>> Regards,
>>
>> After running pg_upgrade, if I manually vacuum a table a start getting warnings:
>>
>> WARNING:  page is not marked all-visible (and all-frozen) but
>> visibility map bit(s) is set in relation "foo" page 32756
>> WARNING:  page is not marked all-visible (and all-frozen) but
>> visibility map bit(s) is set in relation "foo" page 32756
>> WARNING:  page is not marked all-visible (and all-frozen) but
>> visibility map bit(s) is set in relation "foo" page 32757
>> WARNING:  page is not marked all-visible (and all-frozen) but
>> visibility map bit(s) is set in relation "foo" page 32757
>>
>> The warnings are right where the blocks would start using the 2nd page
>> of the _vm, so I think the problem is there.  And looking at the code,
>> I think that "cur += SizeOfPageHeaderData;" in the inner loop cannot
>> be correct.  We can't skip a header in the current (old) block each
>> time we reach the end of the new block.  The thing we are skipping in
>> the current block is half the time not a header, but the data at the
>> halfway point through the block.
>>
>
> Thank you for reviewing.
>
> You're right, it's not necessary.
> Attached latest v29 patch which removes the mention in pg_upgrade documentation.

I could successfully upgrade with this patch, with the link option and
without.  After the update the tables seemed to have their correct
visibility status, and after a VACUUM FREEZE then had the correct
freeze status as well.

Then I manually corrupted the vm file, just to make sure a corrupted
one would get detected.  And much to my surprise, I didn't get any
errors or warning when starting it back up and running vacuum freeze
(unless I had page checksums turned on, then I got warnings and zeroed
out pages).  But I guess this is not considered a warnable condition
for bits to be off when they should be on, only the opposite.

Consecutive VACUUM FREEZE operations with no DML activity between were
not sped up by as much as I thought they would be, because it still
had to walk all the indexes even though it didn't touch the table at
all.  In real-world usage there would almost always be some dead
tuples that would require an index scan anyway for a normal vacuum.

Cheers,

Jeff



Re: Freeze avoidance of very large table.

From
Masahiko Sawada
Date:
On Fri, Dec 4, 2015 at 9:51 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
> On Tue, Dec 1, 2015 at 10:40 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>> On Tue, Dec 1, 2015 at 3:04 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
>>> On Mon, Nov 30, 2015 at 9:18 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>>>> On Sun, Nov 29, 2015 at 2:21 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
>>>>> On Tue, Nov 24, 2015 at 3:13 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>>>>>>
>>>>>> Yeah, we need to consider to compute checksum if enabled.
>>>>>> I've changed the patch, and attached.
>>>>>> Please review it.
>>>>>
>>>>> Thanks for the update.  This now conflicts with the updates doesn to
>>>>> fix pg_upgrade out-of-space issue on Windows. I've fixed (I think) the
>>>>> conflict in order to do some testing, but I'd like to get an updated
>>>>> patch from the author in case I did it wrong.  I don't want to find
>>>>> bugs that I just introduced myself.
>>>>>
>>>>
>>>> Thank you for having a look.
>>>>
>>>> Attached updated v28 patch.
>>>> Please review it.
>>>>
>>>> Regards,
>>>
>>> After running pg_upgrade, if I manually vacuum a table a start getting warnings:
>>>
>>> WARNING:  page is not marked all-visible (and all-frozen) but
>>> visibility map bit(s) is set in relation "foo" page 32756
>>> WARNING:  page is not marked all-visible (and all-frozen) but
>>> visibility map bit(s) is set in relation "foo" page 32756
>>> WARNING:  page is not marked all-visible (and all-frozen) but
>>> visibility map bit(s) is set in relation "foo" page 32757
>>> WARNING:  page is not marked all-visible (and all-frozen) but
>>> visibility map bit(s) is set in relation "foo" page 32757
>>>
>>> The warnings are right where the blocks would start using the 2nd page
>>> of the _vm, so I think the problem is there.  And looking at the code,
>>> I think that "cur += SizeOfPageHeaderData;" in the inner loop cannot
>>> be correct.  We can't skip a header in the current (old) block each
>>> time we reach the end of the new block.  The thing we are skipping in
>>> the current block is half the time not a header, but the data at the
>>> halfway point through the block.
>>>
>>
>> Thank you for reviewing.
>>
>> You're right, it's not necessary.
>> Attached latest v29 patch which removes the mention in pg_upgrade documentation.
>
> I could successfully upgrade with this patch, with the link option and
> without.  After the update the tables seemed to have their correct
> visibility status, and after a VACUUM FREEZE then had the correct
> freeze status as well.

Thank you for tesing!

> Then I manually corrupted the vm file, just to make sure a corrupted
> one would get detected.  And much to my surprise, I didn't get any
> errors or warning when starting it back up and running vacuum freeze
> (unless I had page checksums turned on, then I got warnings and zeroed
> out pages).  But I guess this is not considered a warnable condition
> for bits to be off when they should be on, only the opposite.

How did you break the vm file?
The inconsistent flags state (set all-frozen but not set all-visible)
will be detected in visibility map code.But the vm file has
consecutive bits simply after its page header, so detecting its
corruption would be difficult unless whole page is corrupted.

> Consecutive VACUUM FREEZE operations with no DML activity between were
> not sped up by as much as I thought they would be, because it still
> had to walk all the indexes even though it didn't touch the table at
> all.  In real-world usage there would almost always be some dead
> tuples that would require an index scan anyway for a normal vacuum.

The another reason why consecutive VACUUM FREEZE were not sped up is
the many pages of that table were on disk cache, right?
In case of very large database, vacuuming large table would engage the
total vacuum time, so it would be more effective.

Regards,

--
Masahiko Sawada



Re: Freeze avoidance of very large table.

From
Robert Haas
Date:
On Mon, Nov 30, 2015 at 12:58 PM, Bruce Momjian <bruce@momjian.us> wrote:
> On Mon, Nov 30, 2015 at 10:48:04PM +0530, Masahiko Sawada wrote:
>> On Sun, Nov 29, 2015 at 2:21 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
>> > On Tue, Nov 24, 2015 at 3:13 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>> >>
>> >> Yeah, we need to consider to compute checksum if enabled.
>> >> I've changed the patch, and attached.
>> >> Please review it.
>> >
>> > Thanks for the update.  This now conflicts with the updates doesn to
>> > fix pg_upgrade out-of-space issue on Windows. I've fixed (I think) the
>> > conflict in order to do some testing, but I'd like to get an updated
>> > patch from the author in case I did it wrong.  I don't want to find
>> > bugs that I just introduced myself.
>> >
>>
>> Thank you for having a look.
>
> I would not bother mentioning this detail in the pg_upgrade manual page:
>
> +   Since the format of visibility map has been changed in version 9.6,
> +   <application>pg_upgrade</> creates and rewrite new <literal>'_vm'</literal>
> +   file even if upgrading from 9.5 or before to 9.6 or later with link mode (-k).

Really?  I know we don't always document things like this, but it
seems like a good idea to me that we do so.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Freeze avoidance of very large table.

From
Michael Paquier
Date:
On Thu, Dec 10, 2015 at 3:31 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Mon, Nov 30, 2015 at 12:58 PM, Bruce Momjian <bruce@momjian.us> wrote:
>> On Mon, Nov 30, 2015 at 10:48:04PM +0530, Masahiko Sawada wrote:
>>> On Sun, Nov 29, 2015 at 2:21 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
>>> > On Tue, Nov 24, 2015 at 3:13 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>>> >>
>>> >> Yeah, we need to consider to compute checksum if enabled.
>>> >> I've changed the patch, and attached.
>>> >> Please review it.
>>> >
>>> > Thanks for the update.  This now conflicts with the updates doesn to
>>> > fix pg_upgrade out-of-space issue on Windows. I've fixed (I think) the
>>> > conflict in order to do some testing, but I'd like to get an updated
>>> > patch from the author in case I did it wrong.  I don't want to find
>>> > bugs that I just introduced myself.
>>> >
>>>
>>> Thank you for having a look.
>>
>> I would not bother mentioning this detail in the pg_upgrade manual page:
>>
>> +   Since the format of visibility map has been changed in version 9.6,
>> +   <application>pg_upgrade</> creates and rewrite new <literal>'_vm'</literal>
>> +   file even if upgrading from 9.5 or before to 9.6 or later with link mode (-k).
>
> Really?  I know we don't always document things like this, but it
> seems like a good idea to me that we do so.

Just going though v30...

+    frozen. The whole-table freezing is occuerred only when all pages happen to
+    require freezing to freeze rows. In other cases such as where

I am not really getting the meaning of this sentence. Shouldn't this
be reworded something like:
"Freezing occurs on the whole table once all pages of this relation require it."

+    <structfield>relfrozenxid</> is more than
<varname>vacuum_freeze_table_age</>
+    transcations old, where <command>VACUUM</>'s <literal>FREEZE</>
option is used,
+    <command>VACUUM</> can skip the pages that all tuples on the page
itself are
+    marked as frozen.
+    When all pages of table are eventually marked as frozen by
<command>VACUUM</>,
+    after it's finished <literal>age(relfrozenxid)</> should be a little more
+    than the <varname>vacuum_freeze_min_age</> setting that was used (more by
+    the number of transcations started since the <command>VACUUM</> started).
+    If the advancing of <structfield>relfrozenxid</> is not happend until
+    <varname>autovacuum_freeze_max_age</> is reached, an autovacuum will soon
+    be forced for the table.

s/transcations/transactions.

+     <entry><structfield>n_frozen_page</></entry>
+     <entry><type>integer</></entry>
+     <entry>Number of frozen pages</entry>
n_frozen_pages?

make check with pg_upgrade is failing for me:
Visibility map rewriting test failed
make: *** [check] Error 1

-GetVisibilityMapPins(Relation relation, Buffer buffer1, Buffer buffer2,
+GetVisibilitymapPins(Relation relation, Buffer buffer1, Buffer buffer2,
This looks like an unrelated change.
 * Clearing a visibility map bit is not separately WAL-logged.  The callers * must make sure that whenever a bit is
cleared,the bit is cleared on WAL
 
- * replay of the updating operation as well.
+ * replay of the updating operation as well.  And all-frozen bit must be
+ * cleared with all-visible at the same time.
This could be reformulated. This is just an addition on top of the
existing paragraph.

+ * The visibility map has the all-frozen bit which indicates all tuples on
+ * corresponding page has been completely frozen, so the visibility map is also
"have been completely frozen".

-/* Mapping from heap block number to the right bit in the visibility map */
-#define HEAPBLK_TO_MAPBLOCK(x) ((x) / HEAPBLOCKS_PER_PAGE)
-#define HEAPBLK_TO_MAPBYTE(x) (((x) % HEAPBLOCKS_PER_PAGE) /
HEAPBLOCKS_PER_BYTE)
Those two declarations are just noise in the patch: those definitions
are unchanged.

-       elog(DEBUG1, "vm_clear %s %d", RelationGetRelationName(rel), heapBlk);
+       elog(DEBUG1, "vm_clear %s block %d",
RelationGetRelationName(rel), heapBlk);
This may be better as a separate patch.

-visibilitymap_count(Relation rel)
+visibilitymap_count(Relation rel, BlockNumber *all_frozen)
I think that this routine would gain in clarity if reworked as follows:
visibilitymap_count(Relation rel, BlockNumber *all_visible,
BlockNumber *all_frozen)

+               /*
+                * Report ANALYZE to the stats collector, too.
However, if doing
+                * inherited stats we shouldn't report, because the
stats collector only
+                * tracks per-table stats.
+                */
+               if (!inh)
+                       pgstat_report_analyze(onerel, totalrows,
totaldeadrows, relallfrozen);
Here we already know that this is working in the non-inherited code
path. As a whole, why that? Why isn't relallfrozen passed as an
argument of vac_update_relstats and then inserted in pg_class? Maybe I
am missing something..

+        * mxid full-table scan limit. During full scan, we could skip some pags
+        * according to all-frozen bit of visibility map.
s/pags/pages

+        * Also, skipping even a single page accorinding to all-visible bit of
s/accorinding/according.

So, lazy_scan_heap is the central and really vital portion of the patch...

+                               /* Check whether this tuple is alrady
frozen or not */
s/alrady/already

-heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId
*visibility_cutoff_xid)
+heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId
*visibility_cutoff_xid,
+                                                bool *all_frozen)
I think you would want to change that to heap_page_visible_status that
returns *all_visible as well. At least it seems to be a more
consistent style of routine.

+ * The format of visibility map is changed with this 9.6 commit,
+ */
+#define VISIBILITY_MAP_FROZEN_BIT_CAT_VER 201512021
It looks a bit strange to have a different flag for the vm with the
new frozen bit. Couldn't we merge that into a unique version number? I
guess that we should just ask for a vm rewrite anyway in any case if
pg_upgrade is used on the version of pg with the new vm format, no?

Sawada-san, are you planning to continue working on that? At this
stage it seems that this patch is not in commitable state and should
at best be moved to next CF, or at worst returned with feedback.
-- 
Michael



Re: Freeze avoidance of very large table.

From
Simon Riggs
Date:
On 9 December 2015 at 18:31, Robert Haas <robertmhaas@gmail.com> wrote:
On Mon, Nov 30, 2015 at 12:58 PM, Bruce Momjian <bruce@momjian.us> wrote:
> On Mon, Nov 30, 2015 at 10:48:04PM +0530, Masahiko Sawada wrote:
>> On Sun, Nov 29, 2015 at 2:21 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
>> > On Tue, Nov 24, 2015 at 3:13 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>> >>
>> >> Yeah, we need to consider to compute checksum if enabled.
>> >> I've changed the patch, and attached.
>> >> Please review it.
>> >
>> > Thanks for the update.  This now conflicts with the updates doesn to
>> > fix pg_upgrade out-of-space issue on Windows. I've fixed (I think) the
>> > conflict in order to do some testing, but I'd like to get an updated
>> > patch from the author in case I did it wrong.  I don't want to find
>> > bugs that I just introduced myself.
>> >
>>
>> Thank you for having a look.
>
> I would not bother mentioning this detail in the pg_upgrade manual page:
>
> +   Since the format of visibility map has been changed in version 9.6,
> +   <application>pg_upgrade</> creates and rewrite new <literal>'_vm'</literal>
> +   file even if upgrading from 9.5 or before to 9.6 or later with link mode (-k).

Really?  I know we don't always document things like this, but it
seems like a good idea to me that we do so.

Agreed.

For me, rewriting the visibility map is a new data loss bug waiting to happen. I am worried that the group is not taking seriously the potential for catastrophe here. I think we can do it, but I think it needs these things

* Clear notice that it is happening unconditionally and unavoidably
* Log files showing it has happened, action by action
* Very clear mechanism for resolving an incomplete or interrupted upgrade process. Which VMs got upgraded? Which didn't?
* Ability to undo an upgrade attempt, somehow, ideally automatically by default
* Ability to restart a failed upgrade attempt without doing a "double upgrade", i.e. ensure transformation is immutable

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Freeze avoidance of very large table.

From
Michael Paquier
Date:
On Thu, Dec 17, 2015 at 3:44 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> For me, rewriting the visibility map is a new data loss bug waiting to
> happen. I am worried that the group is not taking seriously the potential
> for catastrophe here.

FWIW, I'm following this line and merging the vm file into a single
unit looks like a ticking bomb. We may really want a separate _vm
file, like _vmf to track this new bit flag but this has already been
mentioned largely upthread...
-- 
Michael



Re: Freeze avoidance of very large table.

From
Andres Freund
Date:
On 2015-12-17 15:56:35 +0900, Michael Paquier wrote:
> On Thu, Dec 17, 2015 at 3:44 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> > For me, rewriting the visibility map is a new data loss bug waiting to
> > happen. I am worried that the group is not taking seriously the potential
> > for catastrophe here.
> 
> FWIW, I'm following this line and merging the vm file into a single
> unit looks like a ticking bomb.

And what are those risks?

> We may really want a separate _vm file, like _vmf to track this new
> bit flag but this has already been mentioned largely upthread...

That'd double the overhead when those bits get unset.



Re: Freeze avoidance of very large table.

From
Michael Paquier
Date:
On Thu, Dec 17, 2015 at 4:10 PM, Andres Freund <andres@anarazel.de> wrote:
> On 2015-12-17 15:56:35 +0900, Michael Paquier wrote:
>> On Thu, Dec 17, 2015 at 3:44 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> > For me, rewriting the visibility map is a new data loss bug waiting to
>> > happen. I am worried that the group is not taking seriously the potential
>> > for catastrophe here.
>>
>> FWIW, I'm following this line and merging the vm file into a single
>> unit looks like a ticking bomb.
>
> And what are those risks?

Incorrect vm file rewrite after a pg_upgrade run.
-- 
Michael



Re: Freeze avoidance of very large table.

From
andres@anarazel.de (Andres Freund)
Date:
On 2015-12-17 16:22:24 +0900, Michael Paquier wrote:
> On Thu, Dec 17, 2015 at 4:10 PM, Andres Freund <andres@anarazel.de> wrote:
> > On 2015-12-17 15:56:35 +0900, Michael Paquier wrote:
> >> On Thu, Dec 17, 2015 at 3:44 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> >> > For me, rewriting the visibility map is a new data loss bug waiting to
> >> > happen. I am worried that the group is not taking seriously the potential
> >> > for catastrophe here.
> >>
> >> FWIW, I'm following this line and merging the vm file into a single
> >> unit looks like a ticking bomb.
> >
> > And what are those risks?
> 
> Incorrect vm file rewrite after a pg_upgrade run.

If we can't manage to rewrite a file, replacing a binary b1 with a b10,
then we shouldn't be working on a database. And if we screw up, recovery
i is an rm *_vm away. I can't imagine that this is going to be the
actually complicated part of this feature.



Re: Freeze avoidance of very large table.

From
Masahiko Sawada
Date:
On Thu, Dec 17, 2015 at 11:47 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Thu, Dec 10, 2015 at 3:31 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Mon, Nov 30, 2015 at 12:58 PM, Bruce Momjian <bruce@momjian.us> wrote:
>>> On Mon, Nov 30, 2015 at 10:48:04PM +0530, Masahiko Sawada wrote:
>>>> On Sun, Nov 29, 2015 at 2:21 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
>>>> > On Tue, Nov 24, 2015 at 3:13 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>>>> >>
>>>> >> Yeah, we need to consider to compute checksum if enabled.
>>>> >> I've changed the patch, and attached.
>>>> >> Please review it.
>>>> >
>>>> > Thanks for the update.  This now conflicts with the updates doesn to
>>>> > fix pg_upgrade out-of-space issue on Windows. I've fixed (I think) the
>>>> > conflict in order to do some testing, but I'd like to get an updated
>>>> > patch from the author in case I did it wrong.  I don't want to find
>>>> > bugs that I just introduced myself.
>>>> >
>>>>
>>>> Thank you for having a look.
>>>
>>> I would not bother mentioning this detail in the pg_upgrade manual page:
>>>
>>> +   Since the format of visibility map has been changed in version 9.6,
>>> +   <application>pg_upgrade</> creates and rewrite new <literal>'_vm'</literal>
>>> +   file even if upgrading from 9.5 or before to 9.6 or later with link mode (-k).
>>
>> Really?  I know we don't always document things like this, but it
>> seems like a good idea to me that we do so.
>
> Just going though v30...
>
> +    frozen. The whole-table freezing is occuerred only when all pages happen to
> +    require freezing to freeze rows. In other cases such as where
>
> I am not really getting the meaning of this sentence. Shouldn't this
> be reworded something like:
> "Freezing occurs on the whole table once all pages of this relation require it."
>
> +    <structfield>relfrozenxid</> is more than
> <varname>vacuum_freeze_table_age</>
> +    transcations old, where <command>VACUUM</>'s <literal>FREEZE</>
> option is used,
> +    <command>VACUUM</> can skip the pages that all tuples on the page
> itself are
> +    marked as frozen.
> +    When all pages of table are eventually marked as frozen by
> <command>VACUUM</>,
> +    after it's finished <literal>age(relfrozenxid)</> should be a little more
> +    than the <varname>vacuum_freeze_min_age</> setting that was used (more by
> +    the number of transcations started since the <command>VACUUM</> started).
> +    If the advancing of <structfield>relfrozenxid</> is not happend until
> +    <varname>autovacuum_freeze_max_age</> is reached, an autovacuum will soon
> +    be forced for the table.
>
> s/transcations/transactions.
>
> +     <entry><structfield>n_frozen_page</></entry>
> +     <entry><type>integer</></entry>
> +     <entry>Number of frozen pages</entry>
> n_frozen_pages?
>
> make check with pg_upgrade is failing for me:
> Visibility map rewriting test failed
> make: *** [check] Error 1

make check with pg_upgrade is done successfully on my environment.
Could you give me more information about this?

> -GetVisibilityMapPins(Relation relation, Buffer buffer1, Buffer buffer2,
> +GetVisibilitymapPins(Relation relation, Buffer buffer1, Buffer buffer2,
> This looks like an unrelated change.
>
>   * Clearing a visibility map bit is not separately WAL-logged.  The callers
>   * must make sure that whenever a bit is cleared, the bit is cleared on WAL
> - * replay of the updating operation as well.
> + * replay of the updating operation as well.  And all-frozen bit must be
> + * cleared with all-visible at the same time.
> This could be reformulated. This is just an addition on top of the
> existing paragraph.
>
> + * The visibility map has the all-frozen bit which indicates all tuples on
> + * corresponding page has been completely frozen, so the visibility map is also
> "have been completely frozen".
>
> -/* Mapping from heap block number to the right bit in the visibility map */
> -#define HEAPBLK_TO_MAPBLOCK(x) ((x) / HEAPBLOCKS_PER_PAGE)
> -#define HEAPBLK_TO_MAPBYTE(x) (((x) % HEAPBLOCKS_PER_PAGE) /
> HEAPBLOCKS_PER_BYTE)
> Those two declarations are just noise in the patch: those definitions
> are unchanged.
>
> -       elog(DEBUG1, "vm_clear %s %d", RelationGetRelationName(rel), heapBlk);
> +       elog(DEBUG1, "vm_clear %s block %d",
> RelationGetRelationName(rel), heapBlk);
> This may be better as a separate patch.

I've attached 001 patch for this separately.

>
> -visibilitymap_count(Relation rel)
> +visibilitymap_count(Relation rel, BlockNumber *all_frozen)
> I think that this routine would gain in clarity if reworked as follows:
> visibilitymap_count(Relation rel, BlockNumber *all_visible,
> BlockNumber *all_frozen)
>
> +               /*
> +                * Report ANALYZE to the stats collector, too.
> However, if doing
> +                * inherited stats we shouldn't report, because the
> stats collector only
> +                * tracks per-table stats.
> +                */
> +               if (!inh)
> +                       pgstat_report_analyze(onerel, totalrows,
> totaldeadrows, relallfrozen);
> Here we already know that this is working in the non-inherited code
> path. As a whole, why that? Why isn't relallfrozen passed as an
> argument of vac_update_relstats and then inserted in pg_class? Maybe I
> am missing something..

IIUC, as per discussion, the number of frozen pages should not be
inserted into pg_class. Because it's not information used by query
planning like relallvisible, repages.

> +        * mxid full-table scan limit. During full scan, we could skip some pags
> +        * according to all-frozen bit of visibility map.
> s/pags/pages
>
> +        * Also, skipping even a single page accorinding to all-visible bit of
> s/accorinding/according.
>
> So, lazy_scan_heap is the central and really vital portion of the patch...
>
> +                               /* Check whether this tuple is alrady
> frozen or not */
> s/alrady/already
>
> -heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId
> *visibility_cutoff_xid)
> +heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId
> *visibility_cutoff_xid,
> +                                                bool *all_frozen)
> I think you would want to change that to heap_page_visible_status that
> returns *all_visible as well. At least it seems to be a more
> consistent style of routine.
>
> + * The format of visibility map is changed with this 9.6 commit,
> + */
> +#define VISIBILITY_MAP_FROZEN_BIT_CAT_VER 201512021
> It looks a bit strange to have a different flag for the vm with the
> new frozen bit. Couldn't we merge that into a unique version number? I
> guess that we should just ask for a vm rewrite anyway in any case if
> pg_upgrade is used on the version of pg with the new vm format, no?

Thank you for your review.
Please find the attached latest v31 patches.

>
> Sawada-san, are you planning to continue working on that? At this
> stage it seems that this patch is not in commitable state and should
> at best be moved to next CF, or at worst returned with feedback.

Yes, of course.
This patch should be marked as "Move to next CF".

Regards,

--
Masahiko Sawada

Attachment

Re: Freeze avoidance of very large table.

From
Michael Paquier
Date:
On Fri, Dec 18, 2015 at 3:17 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> On Thu, Dec 17, 2015 at 11:47 AM, Michael Paquier <michael.paquier@gmail.com> wrote:
>> make check with pg_upgrade is failing for me:
>> Visibility map rewriting test failed
>> make: *** [check] Error 1
>
> make check with pg_upgrade is done successfully on my environment.
> Could you give me more information about this?

Oh, well I see now after digging into your code. You are missing -X
when running psql, and until recently psql -c implied -X all the time.
The reason why it failed for me is that I have \timing enabled in
psqlrc.

Actually test.sh needs to be fixed as well, see the attached, this is
a separate bug. Could a kind committer look at that if this is
acceptable?

>> Sawada-san, are you planning to continue working on that? At this
>> stage it seems that this patch is not in commitable state and should
>> at best be moved to next CF, or at worst returned with feedback.
>
> Yes, of course.
> This patch should be marked as "Move to next CF".

OK, done so.
--
Michael

Attachment

Re: Freeze avoidance of very large table.

From
Robert Haas
Date:
On Thu, Dec 17, 2015 at 1:17 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> I am not really getting the meaning of this sentence. Shouldn't this
> be reworded something like:
> "Freezing occurs on the whole table once all pages of this relation require it."

That statement isn't remotely true, and I don't think this patch
changes that.  Freezing occurs on the whole table once relfrozenxid is
old enough that we think there might be at least one page in the table
that requires it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Freeze avoidance of very large table.

From
Robert Haas
Date:
On Thu, Dec 17, 2015 at 2:26 AM, Andres Freund <andres@anarazel.de> wrote:
> On 2015-12-17 16:22:24 +0900, Michael Paquier wrote:
>> On Thu, Dec 17, 2015 at 4:10 PM, Andres Freund <andres@anarazel.de> wrote:
>> > On 2015-12-17 15:56:35 +0900, Michael Paquier wrote:
>> >> On Thu, Dec 17, 2015 at 3:44 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> >> > For me, rewriting the visibility map is a new data loss bug waiting to
>> >> > happen. I am worried that the group is not taking seriously the potential
>> >> > for catastrophe here.
>> >>
>> >> FWIW, I'm following this line and merging the vm file into a single
>> >> unit looks like a ticking bomb.
>> >
>> > And what are those risks?
>>
>> Incorrect vm file rewrite after a pg_upgrade run.
>
> If we can't manage to rewrite a file, replacing a binary b1 with a b10,
> then we shouldn't be working on a database. And if we screw up, recovery
> i is an rm *_vm away. I can't imagine that this is going to be the
> actually complicated part of this feature.

Yeah.  If that part of this feature isn't right, the chances that the
rest of the patch are robust enough to commit seem extremely low.
That is, as Andres says, not the hard part.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Freeze avoidance of very large table.

From
Kyotaro HORIGUCHI
Date:
Hello,

At Fri, 18 Dec 2015 12:09:43 -0500, Robert Haas <robertmhaas@gmail.com> wrote in
<CA+TgmoZCCFwgKL0PmSi=htfZ2aCOZPoTPD73ecvSA9rhXa0zUw@mail.gmail.com>
> On Thu, Dec 17, 2015 at 1:17 AM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
> > I am not really getting the meaning of this sentence. Shouldn't this
> > be reworded something like:
> > "Freezing occurs on the whole table once all pages of this relation require it."
> 
> That statement isn't remotely true, and I don't think this patch
> changes that.  Freezing occurs on the whole table once relfrozenxid is
> old enough that we think there might be at least one page in the table
> that requires it.

I doubt I can explain this accurately, but I took the original
phrase as that if and only if all pages of the table are marked
as "requires freezing" by accident, all pages are frozen. It's
quite obvious but it is what I think "happen to require freezing"
means. Does this make sense?

The phrase might not be necessary if this is correct.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center





Re: Freeze avoidance of very large table.

From
Robert Haas
Date:
On Mon, Dec 21, 2015 at 3:27 AM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> Hello,
>
> At Fri, 18 Dec 2015 12:09:43 -0500, Robert Haas <robertmhaas@gmail.com> wrote in
<CA+TgmoZCCFwgKL0PmSi=htfZ2aCOZPoTPD73ecvSA9rhXa0zUw@mail.gmail.com>
>> On Thu, Dec 17, 2015 at 1:17 AM, Michael Paquier
>> <michael.paquier@gmail.com> wrote:
>> > I am not really getting the meaning of this sentence. Shouldn't this
>> > be reworded something like:
>> > "Freezing occurs on the whole table once all pages of this relation require it."
>>
>> That statement isn't remotely true, and I don't think this patch
>> changes that.  Freezing occurs on the whole table once relfrozenxid is
>> old enough that we think there might be at least one page in the table
>> that requires it.
>
> I doubt I can explain this accurately, but I took the original
> phrase as that if and only if all pages of the table are marked
> as "requires freezing" by accident, all pages are frozen. It's
> quite obvious but it is what I think "happen to require freezing"
> means. Does this make sense?
>
> The phrase might not be necessary if this is correct.

Maybe you are trying to say something like "only those pages which
require freezing are frozen?".

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Freeze avoidance of very large table.

From
Bruce Momjian
Date:
On Thu, Dec 17, 2015 at 06:44:46AM +0000, Simon Riggs wrote:
>     >> Thank you for having a look.
>     >
>     > I would not bother mentioning this detail in the pg_upgrade manual page:
>     >
>     > +   Since the format of visibility map has been changed in version 9.6,
>     > +   <application>pg_upgrade</> creates and rewrite new <literal>'_vm'</
>     literal>
>     > +   file even if upgrading from 9.5 or before to 9.6 or later with link
>     mode (-k).
> 
>     Really?  I know we don't always document things like this, but it
>     seems like a good idea to me that we do so.
> 
> 
> Agreed.
> 
> For me, rewriting the visibility map is a new data loss bug waiting to happen.
> I am worried that the group is not taking seriously the potential for
> catastrophe here. I think we can do it, but I think it needs these things
> 
> * Clear notice that it is happening unconditionally and unavoidably
> * Log files showing it has happened, action by action
> * Very clear mechanism for resolving an incomplete or interrupted upgrade
> process. Which VMs got upgraded? Which didn't?
> * Ability to undo an upgrade attempt, somehow, ideally automatically by default
> * Ability to restart a failed upgrade attempt without doing a "double upgrade",
> i.e. ensure transformation is immutable

Have you forgotten how pg_upgrade works?  This new vm file is only
created on the new cluster, which is not usable if pg_upgrade doesn't
complete successfully.  pg_upgrade never modifies the old cluster.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+ Roman grave inscription                             +



Re: Freeze avoidance of very large table.

From
Masahiko Sawada
Date:
On Mon, Dec 21, 2015 at 11:54 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Mon, Dec 21, 2015 at 3:27 AM, Kyotaro HORIGUCHI
> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
>> Hello,
>>
>> At Fri, 18 Dec 2015 12:09:43 -0500, Robert Haas <robertmhaas@gmail.com> wrote in
<CA+TgmoZCCFwgKL0PmSi=htfZ2aCOZPoTPD73ecvSA9rhXa0zUw@mail.gmail.com>
>>> On Thu, Dec 17, 2015 at 1:17 AM, Michael Paquier
>>> <michael.paquier@gmail.com> wrote:
>>> > I am not really getting the meaning of this sentence. Shouldn't this
>>> > be reworded something like:
>>> > "Freezing occurs on the whole table once all pages of this relation require it."
>>>
>>> That statement isn't remotely true, and I don't think this patch
>>> changes that.  Freezing occurs on the whole table once relfrozenxid is
>>> old enough that we think there might be at least one page in the table
>>> that requires it.
>>
>> I doubt I can explain this accurately, but I took the original
>> phrase as that if and only if all pages of the table are marked
>> as "requires freezing" by accident, all pages are frozen. It's
>> quite obvious but it is what I think "happen to require freezing"
>> means. Does this make sense?
>>
>> The phrase might not be necessary if this is correct.
>
> Maybe you are trying to say something like "only those pages which
> require freezing are frozen?".
>

I was thinking the same as what Horiguchi-san said.
That is, even if relfrozenxid is old enough, freezing on the whole
table is not required if the table are marked as "not requires
freezing".
In other word, only those pages which are marked as "not frozen" are frozen.

Regards,

--
Masahiko Sawada



Re: Freeze avoidance of very large table.

From
Masahiko Sawada
Date:
On Mon, Dec 28, 2015 at 6:38 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> On Mon, Dec 21, 2015 at 11:54 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Mon, Dec 21, 2015 at 3:27 AM, Kyotaro HORIGUCHI
>> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
>>> Hello,
>>>
>>> At Fri, 18 Dec 2015 12:09:43 -0500, Robert Haas <robertmhaas@gmail.com> wrote in
<CA+TgmoZCCFwgKL0PmSi=htfZ2aCOZPoTPD73ecvSA9rhXa0zUw@mail.gmail.com>
>>>> On Thu, Dec 17, 2015 at 1:17 AM, Michael Paquier
>>>> <michael.paquier@gmail.com> wrote:
>>>> > I am not really getting the meaning of this sentence. Shouldn't this
>>>> > be reworded something like:
>>>> > "Freezing occurs on the whole table once all pages of this relation require it."
>>>>
>>>> That statement isn't remotely true, and I don't think this patch
>>>> changes that.  Freezing occurs on the whole table once relfrozenxid is
>>>> old enough that we think there might be at least one page in the table
>>>> that requires it.
>>>
>>> I doubt I can explain this accurately, but I took the original
>>> phrase as that if and only if all pages of the table are marked
>>> as "requires freezing" by accident, all pages are frozen. It's
>>> quite obvious but it is what I think "happen to require freezing"
>>> means. Does this make sense?
>>>
>>> The phrase might not be necessary if this is correct.
>>
>> Maybe you are trying to say something like "only those pages which
>> require freezing are frozen?".
>>
>
> I was thinking the same as what Horiguchi-san said.
> That is, even if relfrozenxid is old enough, freezing on the whole
> table is not required if the table are marked as "not requires
> freezing".
> In other word, only those pages which are marked as "not frozen" are frozen.
>

The recently changes to HEAD conflicts with freeze map patch, so I've
updated and attached latest freeze map patch.
The another patch that enhances the debug log message of visibilitymap
is attached to previous mail.
<http://www.postgresql.org/message-id/CAD21AoBScUD4k_QWrYGRmbXVruiekPY=2BY2Fxhqq55a+tzUxg@mail.gmail.com>.

Please review it.

Regards,

--
Masahiko Sawada

Attachment

Re: Freeze avoidance of very large table.

From
Masahiko Sawada
Date:
On Wed, Jan 13, 2016 at 12:16 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> On Mon, Dec 28, 2015 at 6:38 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>> On Mon, Dec 21, 2015 at 11:54 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>> On Mon, Dec 21, 2015 at 3:27 AM, Kyotaro HORIGUCHI
>>> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
>>>> Hello,
>>>>
>>>> At Fri, 18 Dec 2015 12:09:43 -0500, Robert Haas <robertmhaas@gmail.com> wrote in
<CA+TgmoZCCFwgKL0PmSi=htfZ2aCOZPoTPD73ecvSA9rhXa0zUw@mail.gmail.com>
>>>>> On Thu, Dec 17, 2015 at 1:17 AM, Michael Paquier
>>>>> <michael.paquier@gmail.com> wrote:
>>>>> > I am not really getting the meaning of this sentence. Shouldn't this
>>>>> > be reworded something like:
>>>>> > "Freezing occurs on the whole table once all pages of this relation require it."
>>>>>
>>>>> That statement isn't remotely true, and I don't think this patch
>>>>> changes that.  Freezing occurs on the whole table once relfrozenxid is
>>>>> old enough that we think there might be at least one page in the table
>>>>> that requires it.
>>>>
>>>> I doubt I can explain this accurately, but I took the original
>>>> phrase as that if and only if all pages of the table are marked
>>>> as "requires freezing" by accident, all pages are frozen. It's
>>>> quite obvious but it is what I think "happen to require freezing"
>>>> means. Does this make sense?
>>>>
>>>> The phrase might not be necessary if this is correct.
>>>
>>> Maybe you are trying to say something like "only those pages which
>>> require freezing are frozen?".
>>>
>>
>> I was thinking the same as what Horiguchi-san said.
>> That is, even if relfrozenxid is old enough, freezing on the whole
>> table is not required if the table are marked as "not requires
>> freezing".
>> In other word, only those pages which are marked as "not frozen" are frozen.
>>
>
> The recently changes to HEAD conflicts with freeze map patch, so I've
> updated and attached latest freeze map patch.
> The another patch that enhances the debug log message of visibilitymap
> is attached to previous mail.
> <http://www.postgresql.org/message-id/CAD21AoBScUD4k_QWrYGRmbXVruiekPY=2BY2Fxhqq55a+tzUxg@mail.gmail.com>.
>
> Please review it.
>

Attached updated version patch.
Please review it.

Regards,

--
Masahiko Sawada

Attachment

Re: Freeze avoidance of very large table.

From
Alvaro Herrera
Date:
Masahiko Sawada wrote:

> Attached updated version patch.
> Please review it.

In pg_upgrade, the "page convert" functionality is there to abstract
rewrites of pages being copied; your patch is circumventing it and
AFAICS it makes the interface more complicated for no good reason.  I
think the real way to do that is to write your rewriteVisibilityMap as a
pageConverter routine.  That should reduce some duplication there.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Freeze avoidance of very large table.

From
Jim Nasby
Date:
On 2/1/16 4:59 PM, Alvaro Herrera wrote:
> Masahiko Sawada wrote:
>
>> Attached updated version patch.
>> Please review it.
>
> In pg_upgrade, the "page convert" functionality is there to abstract
> rewrites of pages being copied; your patch is circumventing it and
> AFAICS it makes the interface more complicated for no good reason.  I
> think the real way to do that is to write your rewriteVisibilityMap as a
> pageConverter routine.  That should reduce some duplication there.

IIRC this is about the third problem that's been found with pg_upgrade 
in this patch. That concerns me given the potential for disaster if 
freeze bits are set incorrectly.
-- 
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com



Re: Freeze avoidance of very large table.

From
Masahiko Sawada
Date:
On Tue, Feb 2, 2016 at 10:15 AM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
> On 2/1/16 4:59 PM, Alvaro Herrera wrote:
>>
>> Masahiko Sawada wrote:
>>
>>> Attached updated version patch.
>>> Please review it.
>>
>>
>> In pg_upgrade, the "page convert" functionality is there to abstract
>> rewrites of pages being copied; your patch is circumventing it and
>> AFAICS it makes the interface more complicated for no good reason.  I
>> think the real way to do that is to write your rewriteVisibilityMap as a
>> pageConverter routine.  That should reduce some duplication there.
>

This means that user always have to set pageConverter plug-in when upgrading?
I was thinking that this conversion is required for all user who wants
to upgrade to 9.6, so we should support it in core, not as a plug-in.

>
> IIRC this is about the third problem that's been found with pg_upgrade in
> this patch. That concerns me given the potential for disaster if freeze bits
> are set incorrectly.

Yeah, I tend to have diagnostic tool for visibilitymap, and to exactly
compare VM between old one and new one after upgrading postgres
server.
I've implemented a such tool that is in my github repository[1].
I'm thinking to add this tool into core(e.g., pg_upgrade directory,
not contrib module) as testing function.

[1] https://github.com/MasahikoSawada/pg_visibilitymap

Regards,

--
Masahiko Sawada



Re: Freeze avoidance of very large table.

From
Masahiko Sawada
Date:
On Tue, Feb 2, 2016 at 11:42 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> On Tue, Feb 2, 2016 at 10:15 AM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
>> On 2/1/16 4:59 PM, Alvaro Herrera wrote:
>>>
>>> Masahiko Sawada wrote:
>>>
>>>> Attached updated version patch.
>>>> Please review it.
>>>
>>>
>>> In pg_upgrade, the "page convert" functionality is there to abstract
>>> rewrites of pages being copied; your patch is circumventing it and
>>> AFAICS it makes the interface more complicated for no good reason.  I
>>> think the real way to do that is to write your rewriteVisibilityMap as a
>>> pageConverter routine.  That should reduce some duplication there.
>>
>
> This means that user always have to set pageConverter plug-in when upgrading?
> I was thinking that this conversion is required for all user who wants
> to upgrade to 9.6, so we should support it in core, not as a plug-in.

I misunderstood. Sorry for noise.
I agree with adding conversion method as a pageConverter routine.

This patch doesn't change page layout actually, but pageConverter
routine checks only the page layout.
And we have to plugin named convertLayout_X_to_Y.

I think we have two options.

1. Change page layout(PG_PAGE_LAYOUT_VERSION) to 5. pg_upgrade detects
it and then converts only VM files.
2. Change pg_upgrade plugin mechanism so that it can handle other name
conversion plugins (e.g., convertLayout_vm_to_vfm)

I think #2 is better. Thought?

Regards,

--
Masahiko Sawada



Re: Freeze avoidance of very large table.

From
Alvaro Herrera
Date:
Masahiko Sawada wrote:

> I misunderstood. Sorry for noise.
> I agree with adding conversion method as a pageConverter routine.

\o/

> This patch doesn't change page layout actually, but pageConverter
> routine checks only the page layout.
> And we have to plugin named convertLayout_X_to_Y.
> 
> I think we have two options.
> 
> 1. Change page layout(PG_PAGE_LAYOUT_VERSION) to 5. pg_upgrade detects
> it and then converts only VM files.
> 2. Change pg_upgrade plugin mechanism so that it can handle other name
> conversion plugins (e.g., convertLayout_vm_to_vfm)
> 
> I think #2 is better. Thought?

My vote is for #2 as well.  Maybe we just didn't have forks when this
functionality was invented; maybe the author just didn't think hard
enough about what would be the right interface to do it.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Freeze avoidance of very large table.

From
Masahiko Sawada
Date:
On Tue, Feb 2, 2016 at 7:22 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> Masahiko Sawada wrote:
>
>> I misunderstood. Sorry for noise.
>> I agree with adding conversion method as a pageConverter routine.
>
> \o/
>
>> This patch doesn't change page layout actually, but pageConverter
>> routine checks only the page layout.
>> And we have to plugin named convertLayout_X_to_Y.
>>
>> I think we have two options.
>>
>> 1. Change page layout(PG_PAGE_LAYOUT_VERSION) to 5. pg_upgrade detects
>> it and then converts only VM files.
>> 2. Change pg_upgrade plugin mechanism so that it can handle other name
>> conversion plugins (e.g., convertLayout_vm_to_vfm)
>>
>> I think #2 is better. Thought?
>
> My vote is for #2 as well.  Maybe we just didn't have forks when this
> functionality was invented; maybe the author just didn't think hard
> enough about what would be the right interface to do it.

Thanks.

I'm planning to change as follows.
- pageCnvCtx struct has new function pointer convertVMFile(). If the layout of other relation such as FSM, CLOG in the
future,we
 
could add convertFSMFile() and convertCLOGFile().
- Create new library convertLayoutVM_add_frozenbit.c that has
convertVMFile() function which converts only visibilitymap. When rewriting of VM is required,
convertLayoutVM_add_frozenbit.so
is dynamically loaded. convertLayout_X_to_Y converts other relation files. That is, converting VM and converting other
relationsare independently done.
 
- Current plugin mechanism puts conversion library (*.so) into
${bin}/plugins (i.g., new plugin directory is required), but I'm
thinking to puts it into ${libdir}.

Please give me feedbacks.

Regards,

--
Masahiko Sawada



Re: Freeze avoidance of very large table.

From
Kyotaro HORIGUCHI
Date:
Hello,

At Tue, 2 Feb 2016 20:25:23 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in
<CAD21AoA5iaKQ6K7gUZyzN2KJnPNMeHc6PPPxj6cJgmssjj=fqw@mail.gmail.com>
> On Tue, Feb 2, 2016 at 7:22 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> > Masahiko Sawada wrote:
> >
> >> I misunderstood. Sorry for noise.
> >> I agree with adding conversion method as a pageConverter routine.
> >
> > \o/
> >
> >> This patch doesn't change page layout actually, but pageConverter
> >> routine checks only the page layout.
> >> And we have to plugin named convertLayout_X_to_Y.
> >>
> >> I think we have two options.
> >>
> >> 1. Change page layout(PG_PAGE_LAYOUT_VERSION) to 5. pg_upgrade detects
> >> it and then converts only VM files.
> >> 2. Change pg_upgrade plugin mechanism so that it can handle other name
> >> conversion plugins (e.g., convertLayout_vm_to_vfm)
> >>
> >> I think #2 is better. Thought?
> >
> > My vote is for #2 as well.  Maybe we just didn't have forks when this
> > functionality was invented; maybe the author just didn't think hard
> > enough about what would be the right interface to do it.
> 
> Thanks.
> 
> I'm planning to change as follows.
> - pageCnvCtx struct has new function pointer convertVMFile().
>   If the layout of other relation such as FSM, CLOG in the future, we
> could add convertFSMFile() and convertCLOGFile().
> - Create new library convertLayoutVM_add_frozenbit.c that has
> convertVMFile() function which converts only visibilitymap.
>   When rewriting of VM is required, convertLayoutVM_add_frozenbit.so
> is dynamically loaded.
>   convertLayout_X_to_Y converts other relation files.
>   That is, converting VM and converting other relations are independently done.
> - Current plugin mechanism puts conversion library (*.so) into
> ${bin}/plugins (i.g., new plugin directory is required), but I'm
> thinking to puts it into ${libdir}.
> 
> Please give me feedbacks.

I agree that the plugin mechanism would be usable and needs to be
redesigned, but..

Since the destination version is fixed, the advantage of the
plugin mechanism for pg_upgade would be capability to choose a
plugin to load according to some characteristics of the source
database. What do you think the trigger characteristics for
convertLayoutVM_add_frozenbit.so or similars? If it is hard-coded
like what transfer_single_new_db is doing for fsm and vm, I
suppose the module is not necessary to be a plugin.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center





Re: Freeze avoidance of very large table.

From
Alvaro Herrera
Date:
This patch has gotten its fair share of feedback in this fest.  I moved
it to the next commitfest.  Please do keep working on it and reviewers
that have additional comments are welcome.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Freeze avoidance of very large table.

From
Masahiko Sawada
Date:
On Tue, Feb 2, 2016 at 10:05 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> This patch has gotten its fair share of feedback in this fest.  I moved
> it to the next commitfest.  Please do keep working on it and reviewers
> that have additional comments are welcome.

Thanks!

On Tue, Feb 2, 2016 at 8:59 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> Since the destination version is fixed, the advantage of the
> plugin mechanism for pg_upgade would be capability to choose a
> plugin to load according to some characteristics of the source
> database. What do you think the trigger characteristics for
> convertLayoutVM_add_frozenbit.so or similars? If it is hard-coded
> like what transfer_single_new_db is doing for fsm and vm, I
> suppose the module is not necessary to be a plugin.

Sorry, I couldn't get it.
You meant that we should use rewriteVisibilityMap as a function (not
dynamically load)?
The destination version is not fixed, it depends on new cluster version.
I'm planning that convertLayoutVM_add_frozenbit.so is dynamically
loaded and used only when rewriting of VM is required.
If layout of VM will be changed again in the future, we could add
other libraries for convert

Regards,

--
Masahiko Sawada



Re: Freeze avoidance of very large table.

From
Masahiko Sawada
Date:
On Tue, Feb 2, 2016 at 7:22 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> Masahiko Sawada wrote:
>
>> I misunderstood. Sorry for noise.
>> I agree with adding conversion method as a pageConverter routine.
>
> \o/
>
>> This patch doesn't change page layout actually, but pageConverter
>> routine checks only the page layout.
>> And we have to plugin named convertLayout_X_to_Y.
>>
>> I think we have two options.
>>
>> 1. Change page layout(PG_PAGE_LAYOUT_VERSION) to 5. pg_upgrade detects
>> it and then converts only VM files.
>> 2. Change pg_upgrade plugin mechanism so that it can handle other name
>> conversion plugins (e.g., convertLayout_vm_to_vfm)
>>
>> I think #2 is better. Thought?
>
> My vote is for #2 as well.  Maybe we just didn't have forks when this
> functionality was invented; maybe the author just didn't think hard
> enough about what would be the right interface to do it.

I've almost wrote up the very rough patch. (it can pass regression test)
Windows supporting is not yet, and Makefile is not correct.

I've divided the main patch into two patches; add frozen bit patch and
pg_upgrade support patch.
000 patch is almost  same as previous code. (includes small fix)
001 patch provides rewriting visibility map as a pageConverter routine.
002 patch is for enhancement debug message in visibilitymap.c

In order to support pageConvert plugin, I made the following changes.
* Main changes
- Remove PAGE_CONVERSION
- pg_upgrade plugin is located to 'src/bin/pg_upgrade/plugins' directory.
- Move directory having plugins from '$(bin)/plugins' to '$(lib)/plugins'.
- Add new page-converter plugin function for visibility map.
- Current code doesn't allow us to use link mode (-k) in the case
where page-converter is required.
  But I changed it so that if page-converter for fork file is
specified, we convert it actually even when link mode.

* Interface designe
convertFile() and convertPage() are plugin function for main relation
file, and these functions are dynamically loaded by
loadConvertPlugin().
I added a new pageConvert plugin function converVMFile() for
visibility map (fork file).
If layout of CLOG, FSM or etc will be changed in the future, we could
expand some new pageConvert plugin functions like convertCLOGFile() or
convertFSMFile(), and these functions are dynamically loaded by
loadAdditionalConvertPlugin().
It means that main file and other fork file conversion are executed
independently, and conversion for fork file are executed even if link
mode is specified.
Each conversion plugin is loaded and used only when it's required.

I still agree with this plugin approach, but I felt it's still
complicated a bit, and I'm concerned that patch size has been
increased.
Please give me feedbacks.
If there are not objections about this, I'm going to spend time to improve it.

Regards,

--
Masahiko Sawada

Attachment

Re: Freeze avoidance of very large table.

From
Kyotaro HORIGUCHI
Date:
Hello,

At Thu, 4 Feb 2016 02:32:29 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in
<CAD21AoB1HnZ7thWYjqKve78gQ5+PyedbbkjAPbc5zLV3oA-CuA@mail.gmail.com>
> On Tue, Feb 2, 2016 at 7:22 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> > Masahiko Sawada wrote:
> >> I think we have two options.
> >>
> >> 1. Change page layout(PG_PAGE_LAYOUT_VERSION) to 5. pg_upgrade detects
> >> it and then converts only VM files.
> >> 2. Change pg_upgrade plugin mechanism so that it can handle other name
> >> conversion plugins (e.g., convertLayout_vm_to_vfm)
> >>
> >> I think #2 is better. Thought?
> >
> > My vote is for #2 as well.  Maybe we just didn't have forks when this
> > functionality was invented; maybe the author just didn't think hard
> > enough about what would be the right interface to do it.
> 
> I've almost wrote up the very rough patch. (it can pass regression test)
> Windows supporting is not yet, and Makefile is not correct.
> 
> I've divided the main patch into two patches; add frozen bit patch and
> pg_upgrade support patch.
> 000 patch is almost  same as previous code. (includes small fix)
> 001 patch provides rewriting visibility map as a pageConverter routine.
> 002 patch is for enhancement debug message in visibilitymap.c

Thanks, it becomes easy to read.

> In order to support pageConvert plugin, I made the following changes.
> * Main changes
> - Remove PAGE_CONVERSION
> - pg_upgrade plugin is located to 'src/bin/pg_upgrade/plugins' directory.
> - Move directory having plugins from '$(bin)/plugins' to '$(lib)/plugins'.

These seem fair.

> - Add new page-converter plugin function for visibility map.
> - Current code doesn't allow us to use link mode (-k) in the case
> where page-converter is required.
> 
>   But I changed it so that if page-converter for fork file is
> specified, we convert it actually even when link mode.
> 
> * Interface designe
> convertFile() and convertPage() are plugin function for main relation
> file, and these functions are dynamically loaded by
> loadConvertPlugin().

Though I haven't looked this so closer, loadConverterPlugin looks
to continue deciding what plugin to load using old and new page
layout versions. Currently the acutually possible versions are 4
and if we increment it now, 5.

On the other hand, _vm came at the *catalog version* 201107031
(9.1 release) and _fsm came at 8.4 release. Both of them are of
page layout version 4. Are we allowed to increment page layout
version for this reason? And is this framework under
reconstruction is flexible enough for this kind of changes in
future? I don't think so.


We have added _vm and _fsm so far so we must use a version number
that can determine when _vm, _fsm and _vfm are introduced. I'm
afraid that it is out of purpose, catalog version seems to be
most usable, it is already used to know when the crash safe VM
has been introduced.

Using the catalog version, the plugin we provide first would be
converteLayout_201105231_201602071.so which has only a converter
from _vm to _vfm. This plugin is loaded for the combination of
the source cluster with the catalog version of 201105231(when VM
has been introduced) or later and the destination cluster with
that *before* 201602071 (this version).

If we change the format of fsm (vm no longer exists), we would
have a new plugin maybe named
converteLayout_200904091_2017xxxxx.so which has an, maybe,
inplace file converter for fsm. It will be loaded when a source
database is of the catalog version of 200904091(when FSM has been
introduced) or later and a destination before 2017xxxxx(the
version). Catalog version seems to work fine.


So far, I assumed that we can safely assume that the name of
files to be converted is <oid>[fork_name] so the possible types
of conversions would be the following.
- per-page conversion- per-file conversion between the files with the same fork name.- per-file conversion between the
fileswith different fork names.
 

Since the plugin filename doesn't tell such things, they should
be told by the plugin itself. So a plugin is to provide the
following interface,

typedef struct ConverterTable
{ char *src_fork_name; char *dst_fork_name; FileConverterFunc file_conveter; PageConverterFunc page_conveter;
} ConverterTable[];

Following such name convention of plugins, we may load multiple
plugins at once, so we collect all entries of the table of all
loaded plugins and check if any src_fork_name from them don't
duplicate.

Here, we have sufficient information to choose what conveter to
invoke and execute conversion like this.
 for (fork_name in all_fork_names_including_"" ) {   find a converter comparing fork_name with src_fork_name.   check
dst_fork_nameand rename the target file if needed.   invoke the converter. }
 

If we need to convert clogs or similars and need to prepare for
such events, the ConverterTable might have an additional member
and change the meaning of some of existing members.

typedef struct ConverterTable
{ enum target_type;     /* FILE_NAME or FORK_NAME */ char *src_name; char *dst_name; FileConverterFunc file_conveter;
PageConverterFuncpage_conveter;
 
} ConverterTable[];

when target_type == FILE_NAME, src_name and dst_name represents
the target file names relatively to $PGDATA.

# Yeah, I know it is too complicated.


> I added a new pageConvert plugin function converVMFile() for
> visibility map (fork file).
> If layout of CLOG, FSM or etc will be changed in the future, we could
> expand some new pageConvert plugin functions like convertCLOGFile() or
> convertFSMFile(), and these functions are dynamically loaded by
> loadAdditionalConvertPlugin().
> It means that main file and other fork file conversion are executed
> independently, and conversion for fork file are executed even if link
> mode is specified.
> Each conversion plugin is loaded and used only when it's required.

As I asked upthread, It is one of the most important design point
of plugin mechanism that what characteristics of src and/or dest
cluster to trigger loading of a plugin. And if page layout format
is it, are we allowed to increment for such irrelevant events? Or
using another characteristics like catalog version?

> I still agree with this plugin approach, but I felt it's still
> complicated a bit, and I'm concerned that patch size has been
> increased.
> Please give me feedbacks.

Yeah, I feel the same. What make it worse, the plugin mechanism
will get further complex if we make it more flexible for possible
usage as I proposed above. It is apparently too complicated for
deciding whether to load *just one*, for now, converter
function. And no additional converter is in sight.

I incline to pull out all the plugin stuff of pg_upgrade. We are
so prudent to make changes of file formats so this kind of events
will happen with several-years intervals. The plugin mechanism
would be valuable if we are encouriged to change file formats
more frequently and freely by providing it, but such situation
absolutely introduces more untoward things..


> If there are not objections about this, I'm going to spend time
> to improve it.

Sorry, but I do have strong objection to this... Does anyone else
have opinions for that?

regareds,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center





Re: Freeze avoidance of very large table.

From
Masahiko Sawada
Date:
Thank you for reviewing this patch!

On Wed, Feb 10, 2016 at 4:39 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> Hello,
>
> At Thu, 4 Feb 2016 02:32:29 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in
<CAD21AoB1HnZ7thWYjqKve78gQ5+PyedbbkjAPbc5zLV3oA-CuA@mail.gmail.com>
>> On Tue, Feb 2, 2016 at 7:22 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
>> > Masahiko Sawada wrote:
>> >> I think we have two options.
>> >>
>> >> 1. Change page layout(PG_PAGE_LAYOUT_VERSION) to 5. pg_upgrade detects
>> >> it and then converts only VM files.
>> >> 2. Change pg_upgrade plugin mechanism so that it can handle other name
>> >> conversion plugins (e.g., convertLayout_vm_to_vfm)
>> >>
>> >> I think #2 is better. Thought?
>> >
>> > My vote is for #2 as well.  Maybe we just didn't have forks when this
>> > functionality was invented; maybe the author just didn't think hard
>> > enough about what would be the right interface to do it.
>>
>> I've almost wrote up the very rough patch. (it can pass regression test)
>> Windows supporting is not yet, and Makefile is not correct.
>>
>> I've divided the main patch into two patches; add frozen bit patch and
>> pg_upgrade support patch.
>> 000 patch is almost  same as previous code. (includes small fix)
>> 001 patch provides rewriting visibility map as a pageConverter routine.
>> 002 patch is for enhancement debug message in visibilitymap.c
>
> Thanks, it becomes easy to read.
>
>> In order to support pageConvert plugin, I made the following changes.
>> * Main changes
>> - Remove PAGE_CONVERSION
>> - pg_upgrade plugin is located to 'src/bin/pg_upgrade/plugins' directory.
>> - Move directory having plugins from '$(bin)/plugins' to '$(lib)/plugins'.
>
> These seem fair.
>
>> - Add new page-converter plugin function for visibility map.
>> - Current code doesn't allow us to use link mode (-k) in the case
>> where page-converter is required.
>>
>>   But I changed it so that if page-converter for fork file is
>> specified, we convert it actually even when link mode.
>>
>> * Interface designe
>> convertFile() and convertPage() are plugin function for main relation
>> file, and these functions are dynamically loaded by
>> loadConvertPlugin().
>
> Though I haven't looked this so closer, loadConverterPlugin looks
> to continue deciding what plugin to load using old and new page
> layout versions. Currently the acutually possible versions are 4
> and if we increment it now, 5.
>
> On the other hand, _vm came at the *catalog version* 201107031
> (9.1 release) and _fsm came at 8.4 release. Both of them are of
> page layout version 4. Are we allowed to increment page layout
> version for this reason? And is this framework under
> reconstruction is flexible enough for this kind of changes in
> future? I don't think so.

Yeah, I also think that page layout version should not be increased by
this layout change of vm.
This patch checks catalog version at first, and then decides what
plugin to load.
In this case, only the format of VM has been changed, so pg_upgrade
loads a plugin for VM and convert them.
pg_upgrade doesn't load for other plugin file, and other files are just copied.

>
>
> We have added _vm and _fsm so far so we must use a version number
> that can determine when _vm, _fsm and _vfm are introduced. I'm
> afraid that it is out of purpose, catalog version seems to be
> most usable, it is already used to know when the crash safe VM
> has been introduced.
>
> Using the catalog version, the plugin we provide first would be
> converteLayout_201105231_201602071.so which has only a converter
> from _vm to _vfm. This plugin is loaded for the combination of
> the source cluster with the catalog version of 201105231(when VM
> has been introduced) or later and the destination cluster with
> that *before* 201602071 (this version).
>
> If we change the format of fsm (vm no longer exists), we would
> have a new plugin maybe named
> converteLayout_200904091_2017xxxxx.so which has an, maybe,
> inplace file converter for fsm. It will be loaded when a source
> database is of the catalog version of 200904091(when FSM has been
> introduced) or later and a destination before 2017xxxxx(the
> version). Catalog version seems to work fine.

I think that it's not good idea to use catalog version for plugin name.
Because, even if catalog version is used for plugin file name as you
suggested, pg_upgrade still needs to decide what plugin name to load
by itself.
Also, the plugin file having catalog version would not easy to
understand what plugin does actually. It's not developer friendly.
The advanteage of using page layout version as plugin name is that
pg_upgrade can decide automatically which plugin should be loaded.

>
> So far, I assumed that we can safely assume that the name of
> files to be converted is <oid>[fork_name] so the possible types
> of conversions would be the following.
>
>  - per-page conversion
>  - per-file conversion between the files with the same fork name.
>  - per-file conversion between the files with different fork names.
>
> Since the plugin filename doesn't tell such things, they should
> be told by the plugin itself. So a plugin is to provide the
> following interface,
>
> typedef struct ConverterTable
> {
>   char *src_fork_name;
>   char *dst_fork_name;
>   FileConverterFunc file_conveter;
>   PageConverterFunc page_conveter;
> } ConverterTable[];
>
> Following such name convention of plugins, we may load multiple
> plugins at once, so we collect all entries of the table of all
> loaded plugins and check if any src_fork_name from them don't
> duplicate.
>
> Here, we have sufficient information to choose what conveter to
> invoke and execute conversion like this.
>
>   for (fork_name in all_fork_names_including_"" )
>   {
>     find a converter comparing fork_name with src_fork_name.
>     check dst_fork_name and rename the target file if needed.
>     invoke the converter.
>   }
>
> If we need to convert clogs or similars and need to prepare for
> such events, the ConverterTable might have an additional member
> and change the meaning of some of existing members.
>
> typedef struct ConverterTable
> {
>   enum target_type;     /* FILE_NAME or FORK_NAME */
>   char *src_name;
>   char *dst_name;
>   FileConverterFunc file_conveter;
>   PageConverterFunc page_conveter;
> } ConverterTable[];
>
> when target_type == FILE_NAME, src_name and dst_name represents
> the target file names relatively to $PGDATA.
>
> # Yeah, I know it is too complicated.
>

I agree with having ConverterTable.
Since we have three kind of fiel suffix types; "", "_vm", "_fsm",
pg_upgrade will have three elements of ConverterTable[].

>> I added a new pageConvert plugin function converVMFile() for
>> visibility map (fork file).
>> If layout of CLOG, FSM or etc will be changed in the future, we could
>> expand some new pageConvert plugin functions like convertCLOGFile() or
>> convertFSMFile(), and these functions are dynamically loaded by
>> loadAdditionalConvertPlugin().
>> It means that main file and other fork file conversion are executed
>> independently, and conversion for fork file are executed even if link
>> mode is specified.
>> Each conversion plugin is loaded and used only when it's required.
>
> As I asked upthread, It is one of the most important design point
> of plugin mechanism that what characteristics of src and/or dest
> cluster to trigger loading of a plugin. And if page layout format
> is it, are we allowed to increment for such irrelevant events? Or
> using another characteristics like catalog version?
>
>> I still agree with this plugin approach, but I felt it's still
>> complicated a bit, and I'm concerned that patch size has been
>> increased.
>> Please give me feedbacks.
>
> Yeah, I feel the same. What make it worse, the plugin mechanism
> will get further complex if we make it more flexible for possible
> usage as I proposed above. It is apparently too complicated for
> deciding whether to load *just one*, for now, converter
> function. And no additional converter is in sight.

There will be case where layout of other type relation file is
changed, so pg_upgrade will need to convert several types of relation
file at the same time.
I'm thinking that we need to support to load multiple plugin function at least.

> I incline to pull out all the plugin stuff of pg_upgrade. We are
> so prudent to make changes of file formats so this kind of events
> will happen with several-years intervals. The plugin mechanism
> would be valuable if we are encouriged to change file formats
> more frequently and freely by providing it, but such situation
> absolutely introduces more untoward things..

Yes, I think so too.
In fact, such a layout change is for the first time since pg_upgrade
has been introduced at 9.0.

Regards,

--
Masahiko Sawada



Re: Freeze avoidance of very large table.

From
Robert Haas
Date:
On Wed, Feb 3, 2016 at 12:32 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> I've divided the main patch into two patches; add frozen bit patch and
> pg_upgrade support patch.
> 000 patch is almost  same as previous code. (includes small fix)
> 001 patch provides rewriting visibility map as a pageConverter routine.
> 002 patch is for enhancement debug message in visibilitymap.c

I'd like to suggest splitting 000 into two patches.  The first one
would change the format of the visibility map, and the second one
would change VACUUM to optimize scans based on the new format.  I
think that would make it easier to get this reviewed and committed.

I think this patch churns a bunch of things that don't really need to
be churned.  For example, consider this hunk:
    /*     * If we didn't pin the visibility map page and the page has become all
-     * visible while we were busy locking the buffer, we'll have to unlock and
-     * re-lock, to avoid holding the buffer lock across an I/O.  That's a bit
-     * unfortunate, but hopefully shouldn't happen often.
+     * visible or all frozen while we were busy locking the buffer, we'll
+     * have to unlock and re-lock, to avoid holding the buffer lock across an
+     * I/O.  That's a bit unfortunate, but hopefully shouldn't happen often.     */

Since the page can't become all-frozen without also becoming
all-visible, the original text is still 100% accurate, and the change
doesn't seem to add any useful clarity.  Let's think about which
things really need to be changed and not just mechanically change
everything.

-                    Assert(PageIsAllVisible(heapPage));
+                    /*
+                     * Caller is expected to set PD_ALL_VISIBLE or
+                     * PD_ALL_FROZEN first.
+                     */
+                    Assert(((flags | VISIBILITYMAP_ALL_VISIBLE) &&
PageIsAllVisible(heapPage)) ||
+                           ((flags | VISIBILITYMAP_ALL_FROZEN) &&
PageIsAllFrozen(heapPage)));

I think this would be more clear as two separate assertions.

Your 000 patch has a little bit of whitespace damage:

[rhaas pgsql]$ git diff --check
src/backend/commands/vacuumlazy.c:1951: indent with spaces.
+                                            bool *all_visible, bool
*all_frozen)
src/include/access/heapam_xlog.h:393: indent with spaces.
+                            Buffer vm_buffer, TransactionId
cutoff_xid, uint8 flags);

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Freeze avoidance of very large table.

From
Masahiko Sawada
Date:
On Fri, Feb 12, 2016 at 4:46 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Feb 3, 2016 at 12:32 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>> I've divided the main patch into two patches; add frozen bit patch and
>> pg_upgrade support patch.
>> 000 patch is almost  same as previous code. (includes small fix)
>> 001 patch provides rewriting visibility map as a pageConverter routine.
>> 002 patch is for enhancement debug message in visibilitymap.c
>
> I'd like to suggest splitting 000 into two patches.  The first one
> would change the format of the visibility map, and the second one
> would change VACUUM to optimize scans based on the new format.  I
> think that would make it easier to get this reviewed and committed.
>
> I think this patch churns a bunch of things that don't really need to
> be churned.  For example, consider this hunk:
>
>      /*
>       * If we didn't pin the visibility map page and the page has become all
> -     * visible while we were busy locking the buffer, we'll have to unlock and
> -     * re-lock, to avoid holding the buffer lock across an I/O.  That's a bit
> -     * unfortunate, but hopefully shouldn't happen often.
> +     * visible or all frozen while we were busy locking the buffer, we'll
> +     * have to unlock and re-lock, to avoid holding the buffer lock across an
> +     * I/O.  That's a bit unfortunate, but hopefully shouldn't happen often.
>       */
>
> Since the page can't become all-frozen without also becoming
> all-visible, the original text is still 100% accurate, and the change
> doesn't seem to add any useful clarity.  Let's think about which
> things really need to be changed and not just mechanically change
> everything.
>
> -                    Assert(PageIsAllVisible(heapPage));
> +                    /*
> +                     * Caller is expected to set PD_ALL_VISIBLE or
> +                     * PD_ALL_FROZEN first.
> +                     */
> +                    Assert(((flags | VISIBILITYMAP_ALL_VISIBLE) &&
> PageIsAllVisible(heapPage)) ||
> +                           ((flags | VISIBILITYMAP_ALL_FROZEN) &&
> PageIsAllFrozen(heapPage)));
>
> I think this would be more clear as two separate assertions.
>
> Your 000 patch has a little bit of whitespace damage:
>
> [rhaas pgsql]$ git diff --check
> src/backend/commands/vacuumlazy.c:1951: indent with spaces.
> +                                            bool *all_visible, bool
> *all_frozen)
> src/include/access/heapam_xlog.h:393: indent with spaces.
> +                            Buffer vm_buffer, TransactionId
> cutoff_xid, uint8 flags);
>

Thank you for reviewing this patch.
I've divided 000 patch into two patches, and attached latest 4 patches in total.

I changed pg_upgrade plugin logic so that all kind of type suffix has
one convert plugin.
The type suffix which doesn't need to be converted has pg_copy_file()
function as plugin function.

Regards,

--
Masahiko Sawada

Attachment

Re: Freeze avoidance of very large table.

From
Robert Haas
Date:
On Sun, Feb 14, 2016 at 12:19 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> Thank you for reviewing this patch.
> I've divided 000 patch into two patches, and attached latest 4 patches in total.

Thank you!  I'll go through this again as soon as I have a free moment.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Freeze avoidance of very large table.

From
Bruce Momjian
Date:
On Wed, Feb 10, 2016 at 04:39:15PM +0900, Kyotaro HORIGUCHI wrote:
> > I still agree with this plugin approach, but I felt it's still
> > complicated a bit, and I'm concerned that patch size has been
> > increased.
> > Please give me feedbacks.
> 
> Yeah, I feel the same. What make it worse, the plugin mechanism
> will get further complex if we make it more flexible for possible
> usage as I proposed above. It is apparently too complicated for
> deciding whether to load *just one*, for now, converter
> function. And no additional converter is in sight.
> 
> I incline to pull out all the plugin stuff of pg_upgrade. We are
> so prudent to make changes of file formats so this kind of events
> will happen with several-years intervals. The plugin mechanism
> would be valuable if we are encouraged to change file formats
> more frequently and freely by providing it, but such situation
> absolutely introduces more untoward things..

I agreed on ripping out the converter plugin ability of pg_upgrade. 
Remember pg_upgrade was originally written by EnterpriseDB staff, and I
think they expected their closed-source fork of Postgres might need a
custom page converter someday, but it never needed one, and at this
point I think having the code in there is just making things more
complex.  I see _no_ reason for community Postgres to use a plugin
converter because we are going to need that code for every upgrade from
pre-9.6 to 9.6+, so why not just hard-code in the functions we need.  We
can remove it once 9.5 is end-of-life.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+ Roman grave inscription                             +



Re: Freeze avoidance of very large table.

From
Masahiko Sawada
Date:
On Tue, Feb 16, 2016 at 6:13 AM, Bruce Momjian <bruce@momjian.us> wrote:
> On Wed, Feb 10, 2016 at 04:39:15PM +0900, Kyotaro HORIGUCHI wrote:
>> > I still agree with this plugin approach, but I felt it's still
>> > complicated a bit, and I'm concerned that patch size has been
>> > increased.
>> > Please give me feedbacks.
>>
>> Yeah, I feel the same. What make it worse, the plugin mechanism
>> will get further complex if we make it more flexible for possible
>> usage as I proposed above. It is apparently too complicated for
>> deciding whether to load *just one*, for now, converter
>> function. And no additional converter is in sight.
>>
>> I incline to pull out all the plugin stuff of pg_upgrade. We are
>> so prudent to make changes of file formats so this kind of events
>> will happen with several-years intervals. The plugin mechanism
>> would be valuable if we are encouraged to change file formats
>> more frequently and freely by providing it, but such situation
>> absolutely introduces more untoward things..
>
> I agreed on ripping out the converter plugin ability of pg_upgrade.
> Remember pg_upgrade was originally written by EnterpriseDB staff, and I
> think they expected their closed-source fork of Postgres might need a
> custom page converter someday, but it never needed one, and at this
> point I think having the code in there is just making things more
> complex.  I see _no_ reason for community Postgres to use a plugin
> converter because we are going to need that code for every upgrade from
> pre-9.6 to 9.6+, so why not just hard-code in the functions we need.  We
> can remove it once 9.5 is end-of-life.
>

Hm, we should rather remove the source code around PAGE_CONVERSION and
page.c at 9.6?

Regards,

--
Masahiko Sawada



Re: Freeze avoidance of very large table.

From
Bruce Momjian
Date:
On Tue, Feb 16, 2016 at 11:56:25PM +0900, Masahiko Sawada wrote:
> > I agreed on ripping out the converter plugin ability of pg_upgrade.
> > Remember pg_upgrade was originally written by EnterpriseDB staff, and I
> > think they expected their closed-source fork of Postgres might need a
> > custom page converter someday, but it never needed one, and at this
> > point I think having the code in there is just making things more
> > complex.  I see _no_ reason for community Postgres to use a plugin
> > converter because we are going to need that code for every upgrade from
> > pre-9.6 to 9.6+, so why not just hard-code in the functions we need.  We
> > can remove it once 9.5 is end-of-life.
> >
> 
> Hm, we should rather remove the source code around PAGE_CONVERSION and
> page.c at 9.6?

Yes.  I can do it if you wish.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+ Roman grave inscription                             +



Re: Freeze avoidance of very large table.

From
Masahiko Sawada
Date:
On Wed, Feb 17, 2016 at 12:02 AM, Bruce Momjian <bruce@momjian.us> wrote:
> On Tue, Feb 16, 2016 at 11:56:25PM +0900, Masahiko Sawada wrote:
>> > I agreed on ripping out the converter plugin ability of pg_upgrade.
>> > Remember pg_upgrade was originally written by EnterpriseDB staff, and I
>> > think they expected their closed-source fork of Postgres might need a
>> > custom page converter someday, but it never needed one, and at this
>> > point I think having the code in there is just making things more
>> > complex.  I see _no_ reason for community Postgres to use a plugin
>> > converter because we are going to need that code for every upgrade from
>> > pre-9.6 to 9.6+, so why not just hard-code in the functions we need.  We
>> > can remove it once 9.5 is end-of-life.
>> >
>>
>> Hm, we should rather remove the source code around PAGE_CONVERSION and
>> page.c at 9.6?
>
> Yes.  I can do it if you wish.

I see. I understand that page-converter code would be useful for some
future cases, but makes thing more complex.
So I will post the patch without page-converter If no objection from
other hackers.

Regards,

--
Masahiko Sawada



Re: Freeze avoidance of very large table.

From
Alvaro Herrera
Date:
Masahiko Sawada wrote:
> On Wed, Feb 17, 2016 at 12:02 AM, Bruce Momjian <bruce@momjian.us> wrote:
> > On Tue, Feb 16, 2016 at 11:56:25PM +0900, Masahiko Sawada wrote:
> >> > I agreed on ripping out the converter plugin ability of pg_upgrade.
> >> > Remember pg_upgrade was originally written by EnterpriseDB staff, and I
> >> > think they expected their closed-source fork of Postgres might need a
> >> > custom page converter someday, but it never needed one, and at this
> >> > point I think having the code in there is just making things more
> >> > complex.  I see _no_ reason for community Postgres to use a plugin
> >> > converter because we are going to need that code for every upgrade from
> >> > pre-9.6 to 9.6+, so why not just hard-code in the functions we need.  We
> >> > can remove it once 9.5 is end-of-life.
> >> >
> >>
> >> Hm, we should rather remove the source code around PAGE_CONVERSION and
> >> page.c at 9.6?
> >
> > Yes.  I can do it if you wish.
> 
> I see. I understand that page-converter code would be useful for some
> future cases, but makes thing more complex.

If we're not going to use it, let's get rid of it right away.  There's
no point in having a feature that adds complexity just because we might
find some hypothetical use of it in a not-yet-imagined future.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Freeze avoidance of very large table.

From
Bruce Momjian
Date:
On Tue, Feb 16, 2016 at 03:57:01PM -0300, Alvaro Herrera wrote:
> Masahiko Sawada wrote:
> > On Wed, Feb 17, 2016 at 12:02 AM, Bruce Momjian <bruce@momjian.us> wrote:
> > > On Tue, Feb 16, 2016 at 11:56:25PM +0900, Masahiko Sawada wrote:
> > >> > I agreed on ripping out the converter plugin ability of pg_upgrade.
> > >> > Remember pg_upgrade was originally written by EnterpriseDB staff, and I
> > >> > think they expected their closed-source fork of Postgres might need a
> > >> > custom page converter someday, but it never needed one, and at this
> > >> > point I think having the code in there is just making things more
> > >> > complex.  I see _no_ reason for community Postgres to use a plugin
> > >> > converter because we are going to need that code for every upgrade from
> > >> > pre-9.6 to 9.6+, so why not just hard-code in the functions we need.  We
> > >> > can remove it once 9.5 is end-of-life.
> > >> >
> > >>
> > >> Hm, we should rather remove the source code around PAGE_CONVERSION and
> > >> page.c at 9.6?
> > >
> > > Yes.  I can do it if you wish.
> > 
> > I see. I understand that page-converter code would be useful for some
> > future cases, but makes thing more complex.
> 
> If we're not going to use it, let's get rid of it right away.  There's
> no point in having a feature that adds complexity just because we might
> find some hypothetical use of it in a not-yet-imagined future.

Agreed.  We can always add it later if we need it.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+ Roman grave inscription                             +



Re: Freeze avoidance of very large table.

From
Masahiko Sawada
Date:
On Wed, Feb 17, 2016 at 4:08 AM, Bruce Momjian <bruce@momjian.us> wrote:
> On Tue, Feb 16, 2016 at 03:57:01PM -0300, Alvaro Herrera wrote:
>> Masahiko Sawada wrote:
>> > On Wed, Feb 17, 2016 at 12:02 AM, Bruce Momjian <bruce@momjian.us> wrote:
>> > > On Tue, Feb 16, 2016 at 11:56:25PM +0900, Masahiko Sawada wrote:
>> > >> > I agreed on ripping out the converter plugin ability of pg_upgrade.
>> > >> > Remember pg_upgrade was originally written by EnterpriseDB staff, and I
>> > >> > think they expected their closed-source fork of Postgres might need a
>> > >> > custom page converter someday, but it never needed one, and at this
>> > >> > point I think having the code in there is just making things more
>> > >> > complex.  I see _no_ reason for community Postgres to use a plugin
>> > >> > converter because we are going to need that code for every upgrade from
>> > >> > pre-9.6 to 9.6+, so why not just hard-code in the functions we need.  We
>> > >> > can remove it once 9.5 is end-of-life.
>> > >> >
>> > >>
>> > >> Hm, we should rather remove the source code around PAGE_CONVERSION and
>> > >> page.c at 9.6?
>> > >
>> > > Yes.  I can do it if you wish.
>> >
>> > I see. I understand that page-converter code would be useful for some
>> > future cases, but makes thing more complex.
>>
>> If we're not going to use it, let's get rid of it right away.  There's
>> no point in having a feature that adds complexity just because we might
>> find some hypothetical use of it in a not-yet-imagined future.
>
> Agreed.  We can always add it later if we need it.
>

Attached patch gets rid of page conversion code.

Regards,

--
Masahiko Sawada

Attachment

Re: Freeze avoidance of very large table.

From
Masahiko Sawada
Date:
On Wed, Feb 17, 2016 at 4:29 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> On Wed, Feb 17, 2016 at 4:08 AM, Bruce Momjian <bruce@momjian.us> wrote:
>> On Tue, Feb 16, 2016 at 03:57:01PM -0300, Alvaro Herrera wrote:
>>> Masahiko Sawada wrote:
>>> > On Wed, Feb 17, 2016 at 12:02 AM, Bruce Momjian <bruce@momjian.us> wrote:
>>> > > On Tue, Feb 16, 2016 at 11:56:25PM +0900, Masahiko Sawada wrote:
>>> > >> > I agreed on ripping out the converter plugin ability of pg_upgrade.
>>> > >> > Remember pg_upgrade was originally written by EnterpriseDB staff, and I
>>> > >> > think they expected their closed-source fork of Postgres might need a
>>> > >> > custom page converter someday, but it never needed one, and at this
>>> > >> > point I think having the code in there is just making things more
>>> > >> > complex.  I see _no_ reason for community Postgres to use a plugin
>>> > >> > converter because we are going to need that code for every upgrade from
>>> > >> > pre-9.6 to 9.6+, so why not just hard-code in the functions we need.  We
>>> > >> > can remove it once 9.5 is end-of-life.
>>> > >> >
>>> > >>
>>> > >> Hm, we should rather remove the source code around PAGE_CONVERSION and
>>> > >> page.c at 9.6?
>>> > >
>>> > > Yes.  I can do it if you wish.
>>> >
>>> > I see. I understand that page-converter code would be useful for some
>>> > future cases, but makes thing more complex.
>>>
>>> If we're not going to use it, let's get rid of it right away.  There's
>>> no point in having a feature that adds complexity just because we might
>>> find some hypothetical use of it in a not-yet-imagined future.
>>
>> Agreed.  We can always add it later if we need it.
>>
>
> Attached patch gets rid of page conversion code.
>

Sorry, previous patch is incorrect..
Fixed version patch attached.

Regards,

--
Masahiko Sawada

Attachment

Re: Freeze avoidance of very large table.

From
Masahiko Sawada
Date:
On Wed, Feb 17, 2016 at 4:44 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> On Wed, Feb 17, 2016 at 4:29 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>> On Wed, Feb 17, 2016 at 4:08 AM, Bruce Momjian <bruce@momjian.us> wrote:
>>> On Tue, Feb 16, 2016 at 03:57:01PM -0300, Alvaro Herrera wrote:
>>>> Masahiko Sawada wrote:
>>>> > On Wed, Feb 17, 2016 at 12:02 AM, Bruce Momjian <bruce@momjian.us> wrote:
>>>> > > On Tue, Feb 16, 2016 at 11:56:25PM +0900, Masahiko Sawada wrote:
>>>> > >> > I agreed on ripping out the converter plugin ability of pg_upgrade.
>>>> > >> > Remember pg_upgrade was originally written by EnterpriseDB staff, and I
>>>> > >> > think they expected their closed-source fork of Postgres might need a
>>>> > >> > custom page converter someday, but it never needed one, and at this
>>>> > >> > point I think having the code in there is just making things more
>>>> > >> > complex.  I see _no_ reason for community Postgres to use a plugin
>>>> > >> > converter because we are going to need that code for every upgrade from
>>>> > >> > pre-9.6 to 9.6+, so why not just hard-code in the functions we need.  We
>>>> > >> > can remove it once 9.5 is end-of-life.
>>>> > >> >
>>>> > >>
>>>> > >> Hm, we should rather remove the source code around PAGE_CONVERSION and
>>>> > >> page.c at 9.6?
>>>> > >
>>>> > > Yes.  I can do it if you wish.
>>>> >
>>>> > I see. I understand that page-converter code would be useful for some
>>>> > future cases, but makes thing more complex.
>>>>
>>>> If we're not going to use it, let's get rid of it right away.  There's
>>>> no point in having a feature that adds complexity just because we might
>>>> find some hypothetical use of it in a not-yet-imagined future.
>>>
>>> Agreed.  We can always add it later if we need it.
>>>
>>
>> Attached patch gets rid of page conversion code.
>>
>

Attached updated 5 patches.
I would like to explain these patch shortly again here to make
reviewing more easier.

We can divided these patches into 2 purposes.

1. Freeze map
000_ patch adds additional frozen bit into visibility map, but doesn't
include the logic for improve freezing performance.
001_ patch gets rid of page-conversion code from pg_upgrade. (This
patch doesn't related to this feature essentially, but is required by
002_ patch.)
002_ patch adds upgrading mechanism from 9.6- to 9.6+ and its regression test.

2. Improve freezing logic
003_ patch changes the VACUUM to optimize scans based on freeze map
(i.g., 000_ patch), and its regression test.
004_ patch enhances debug messages in src/backend/access/heap/visibilitymap.c

Please review them.

Regards,

--
Masahiko Sawada

Attachment

Re: Freeze avoidance of very large table.

From
Robert Haas
Date:
On Thu, Feb 18, 2016 at 3:45 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> Attached updated 5 patches.
> I would like to explain these patch shortly again here to make
> reviewing more easier.
>
> We can divided these patches into 2 purposes.
>
> 1. Freeze map
> 000_ patch adds additional frozen bit into visibility map, but doesn't
> include the logic for improve freezing performance.
> 001_ patch gets rid of page-conversion code from pg_upgrade. (This
> patch doesn't related to this feature essentially, but is required by
> 002_ patch.)
> 002_ patch adds upgrading mechanism from 9.6- to 9.6+ and its regression test.
>
> 2. Improve freezing logic
> 003_ patch changes the VACUUM to optimize scans based on freeze map
> (i.g., 000_ patch), and its regression test.
> 004_ patch enhances debug messages in src/backend/access/heap/visibilitymap.c
>
> Please review them.

I have pushed 000 and part of 003, with substantial revisions to the
003 part and minor revisions to the 000 part.  This gets the basic
infrastructure in place, but the vacuum optimization and pg_upgrade
fixes still need to be done.

I discovered that make check-world failed with 000 applied, because
the Assert()s added to visibilitymap_set were using | rather than & to
test for a set bit.  I fixed that.

I revised the code in vacuumlazy.c that updates the new map bits
rather heavily.  I hope I didn't break anything; please have a look
and see if you spot any problems.  One big problem was that it's
inadequate to judge whether a tuple needs freezing just by looking at
xmin; xmax might need to be cleared, for example.

I removed the pgstat stuff.  I'm not sure we want that stuff in that
form; it doesn't seem to fit with the rest of what's in that view, and
it wasn't reliable in my testing.  I did however throw together a
little contrib module for testing, which I attach here.  I'm not sure
we want to commit this, and at the least someone would need to write
documentation.  But it's certainly handy for checking whether this
works.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: Freeze avoidance of very large table.

From
Kyotaro HORIGUCHI
Date:
Thank you for revising and commiting this.

At Tue, 1 Mar 2016 21:51:55 -0500, Robert Haas <robertmhaas@gmail.com> wrote in
<CA+TgmoZtG7hnkgP74zRCeuRrGGG917J5-_P4dzNJz5_kAXFTKg@mail.gmail.com>
> On Thu, Feb 18, 2016 at 3:45 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > Attached updated 5 patches.
> > I would like to explain these patch shortly again here to make
> > reviewing more easier.
> >
> > We can divided these patches into 2 purposes.
> >
> > 1. Freeze map
> > 000_ patch adds additional frozen bit into visibility map, but doesn't
> > include the logic for improve freezing performance.
> > 001_ patch gets rid of page-conversion code from pg_upgrade. (This
> > patch doesn't related to this feature essentially, but is required by
> > 002_ patch.)
> > 002_ patch adds upgrading mechanism from 9.6- to 9.6+ and its regression test.
> >
> > 2. Improve freezing logic
> > 003_ patch changes the VACUUM to optimize scans based on freeze map
> > (i.g., 000_ patch), and its regression test.
> > 004_ patch enhances debug messages in src/backend/access/heap/visibilitymap.c
> >
> > Please review them.
> 
> I have pushed 000 and part of 003, with substantial revisions to the
> 003 part and minor revisions to the 000 part.  This gets the basic
> infrastructure in place, but the vacuum optimization and pg_upgrade
> fixes still need to be done.
> 
> I discovered that make check-world failed with 000 applied, because
> the Assert()s added to visibilitymap_set were using | rather than & to
> test for a set bit.  I fixed that.

It looks reasonable as far as I can see.  Thank you for your
labor for the additional part.

> I revised the code in vacuumlazy.c that updates the new map bits
> rather heavily.  I hope I didn't break anything; please have a look
> and see if you spot any problems.  One big problem was that it's
> inadequate to judge whether a tuple needs freezing just by looking at
> xmin; xmax might need to be cleared, for example.

The new function heap_tuple_needs_eventual_freeze looks
reasonable for me in comparizon with heap_tuple_needs_freeze.

Looking the additional diff for lazy_vacuum_page, I noticed that
visibilitymap_set have a potential performance problem. (Though
it doesn't seem to occur for now.)

visibilitymap_set decides to modify vm bits by the following
code.

|   if (flags = (map[mapByte] >> mapBit & VISIBILITYMAP_VALID_BITS))
|   {
|     START_CRIT_SECTION();
| 
|     map[mapByte] |= (flags << mapBit);

This is effectively right and no problem but it runs the critical
section for the case of (vmbit = 11, flags = 01), which does not
need to do so. Please apply this if this looks reasonable.

======
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 2e64fc3..87b7fc6 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -292,7 +292,8 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,       map = (uint8
*)PageGetContents(page);      LockBuffer(vmBuf, BUFFER_LOCK_EXCLUSIVE);
 
-       if (flags != (map[mapByte] >> mapBit & VISIBILITYMAP_VALID_BITS))
+       /* modify vm bits only if any bit is necessary to be set  */
+       if (~flags & (map[mapByte] >> mapBit & VISIBILITYMAP_VALID_BITS))       {               START_CRIT_SECTION();
======

> I removed the pgstat stuff.  I'm not sure we want that stuff in that
> form; it doesn't seem to fit with the rest of what's in that view, and
> it wasn't reliable in my testing.  I did however throw together a
> little contrib module for testing, which I attach here.  I'm not sure
> we want to commit this, and at the least someone would need to write
> documentation.  But it's certainly handy for checking whether this
> works.

I hanven't considered about the reliability but the
n_frozen_pages in the proposed patch surelly seems alien to the
columns around it.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center





Re: Freeze avoidance of very large table.

From
Peter Geoghegan
Date:
On Tue, Mar 1, 2016 at 6:51 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> I removed the pgstat stuff.  I'm not sure we want that stuff in that
> form; it doesn't seem to fit with the rest of what's in that view, and
> it wasn't reliable in my testing.  I did however throw together a
> little contrib module for testing, which I attach here.  I'm not sure
> we want to commit this, and at the least someone would need to write
> documentation.  But it's certainly handy for checking whether this
> works.

I think you should commit this. The chances of anyone other than you
and Masahiko recalling that you developed this tool in 3 years is
essentially nil. I think that the cost of committing a developer-level
debugging tool like this is very low. Modules like pg_freespacemap
currently already have no chance of being of use to ordinary users.
All you need to do is restrict the functions to throw an error when
called by non-superusers, out of caution.

It's a problem that modules like pg_stat_statements and
pg_freespacemap are currently lumped together in the documentation,
but we all know that.

-- 
Peter Geoghegan



Re: Freeze avoidance of very large table.

From
Jim Nasby
Date:
On 3/2/16 4:21 PM, Peter Geoghegan wrote:
> I think you should commit this. The chances of anyone other than you
> and Masahiko recalling that you developed this tool in 3 years is
> essentially nil. I think that the cost of committing a developer-level
> debugging tool like this is very low. Modules like pg_freespacemap
> currently already have no chance of being of use to ordinary users.
> All you need to do is restrict the functions to throw an error when
> called by non-superusers, out of caution.
>
> It's a problem that modules like pg_stat_statements and
> pg_freespacemap are currently lumped together in the documentation,
> but we all know that.

+1.
-- 
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com



Re: Freeze avoidance of very large table.

From
Tom Lane
Date:
Jim Nasby <Jim.Nasby@BlueTreble.com> writes:
> On 3/2/16 4:21 PM, Peter Geoghegan wrote:
>> I think you should commit this. The chances of anyone other than you
>> and Masahiko recalling that you developed this tool in 3 years is
>> essentially nil. I think that the cost of committing a developer-level
>> debugging tool like this is very low. Modules like pg_freespacemap
>> currently already have no chance of being of use to ordinary users.
>> All you need to do is restrict the functions to throw an error when
>> called by non-superusers, out of caution.
>> 
>> It's a problem that modules like pg_stat_statements and
>> pg_freespacemap are currently lumped together in the documentation,
>> but we all know that.

> +1.

Would it make any sense to stick it under src/test/modules/ instead of
contrib/ ?  That would help make it clear that it's a debugging tool
and not something we expect end users to use.
        regards, tom lane



Re: Freeze avoidance of very large table.

From
Jim Nasby
Date:
On 3/2/16 5:41 PM, Tom Lane wrote:
> Jim Nasby <Jim.Nasby@BlueTreble.com> writes:
>> On 3/2/16 4:21 PM, Peter Geoghegan wrote:
>>> I think you should commit this. The chances of anyone other than you
>>> and Masahiko recalling that you developed this tool in 3 years is
>>> essentially nil. I think that the cost of committing a developer-level
>>> debugging tool like this is very low. Modules like pg_freespacemap
>>> currently already have no chance of being of use to ordinary users.
>>> All you need to do is restrict the functions to throw an error when
>>> called by non-superusers, out of caution.
>>>
>>> It's a problem that modules like pg_stat_statements and
>>> pg_freespacemap are currently lumped together in the documentation,
>>> but we all know that.
>
>> +1.
>
> Would it make any sense to stick it under src/test/modules/ instead of
> contrib/ ?  That would help make it clear that it's a debugging tool
> and not something we expect end users to use.

I haven't looked at it in detail; is there something inherently 
dangerous about it?

When I'm forced to wear a DBA hat, I'd really love to be able to find 
out what VM status for a large table is. If it's in contrib they'll know 
the tool is there; if it's under src then there's about 0 chance of 
that. I'd think SU-only and any appropriate warnings would be enough 
heads-up for DBAs to be careful with it.
-- 
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com



Re: Freeze avoidance of very large table.

From
Kyotaro HORIGUCHI
Date:
At Wed, 2 Mar 2016 17:57:27 -0600, Jim Nasby <Jim.Nasby@BlueTreble.com> wrote in <56D77DE7.7080309@BlueTreble.com>
> On 3/2/16 5:41 PM, Tom Lane wrote:
> > Jim Nasby <Jim.Nasby@BlueTreble.com> writes:
> >> On 3/2/16 4:21 PM, Peter Geoghegan wrote:
> >>> I think you should commit this. The chances of anyone other than you
> >>> and Masahiko recalling that you developed this tool in 3 years is
> >>> essentially nil. I think that the cost of committing a developer-level
> >>> debugging tool like this is very low. Modules like pg_freespacemap
> >>> currently already have no chance of being of use to ordinary users.
> >>> All you need to do is restrict the functions to throw an error when
> >>> called by non-superusers, out of caution.
> >>>
> >>> It's a problem that modules like pg_stat_statements and
> >>> pg_freespacemap are currently lumped together in the documentation,
> >>> but we all know that.
> >
> >> +1.
> >
> > Would it make any sense to stick it under src/test/modules/ instead of
> > contrib/ ?  That would help make it clear that it's a debugging tool
> > and not something we expect end users to use.
> 
> I haven't looked at it in detail; is there something inherently
> dangerous about it?

I don't see any danger but the interface doesn't seem to fit use
by DBAs.

> When I'm forced to wear a DBA hat, I'd really love to be able to find
> out what VM status for a large table is. If it's in contrib they'll
> know the tool is there; if it's under src then there's about 0 chance
> of that. I'd think SU-only and any appropriate warnings would be
> enough heads-up for DBAs to be careful with it.

It looks to expose nothing about table contents. At lesast
anybody who can see the name of a table are safely allowable to
use this on it.

A possible usage (for me) of this would be directly couting
(un)vacuumed or (un)freezed pages in a relation. It would be
convenient that the 'freezed' and 'vacuumed' bits are in separate
integers. It would be usable even stats values for these bits are
shown in statistics views.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center





Re: Freeze avoidance of very large table.

From
Robert Haas
Date:
On Wed, Mar 2, 2016 at 6:41 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Jim Nasby <Jim.Nasby@BlueTreble.com> writes:
>> On 3/2/16 4:21 PM, Peter Geoghegan wrote:
>>> I think you should commit this. The chances of anyone other than you
>>> and Masahiko recalling that you developed this tool in 3 years is
>>> essentially nil. I think that the cost of committing a developer-level
>>> debugging tool like this is very low. Modules like pg_freespacemap
>>> currently already have no chance of being of use to ordinary users.
>>> All you need to do is restrict the functions to throw an error when
>>> called by non-superusers, out of caution.
>>>
>>> It's a problem that modules like pg_stat_statements and
>>> pg_freespacemap are currently lumped together in the documentation,
>>> but we all know that.
>
>> +1.
>
> Would it make any sense to stick it under src/test/modules/ instead of
> contrib/ ?  That would help make it clear that it's a debugging tool
> and not something we expect end users to use.

I actually think end-users might well want to use it.  Also, I created
it by hacking up pg_freespacemap, so it may make sense to have it in
the same place.

I would also be tempted to add an additional C functions that scan the
entire visibility map and return counts of the total number of bits of
each type that are set, and similarly for the page level bits.
Presumably that would be much faster than

I am also tempted to change the API to be a bit more friendly,
although I am not sure exactly how.  This was a quick and dirty hack
so that I could test, but the hardest thing about making it not a
quick and dirty hack is probably deciding on a good UI.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Freeze avoidance of very large table.

From
Masahiko Sawada
Date:
On Sat, Mar 5, 2016 at 1:25 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Mar 2, 2016 at 6:41 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Jim Nasby <Jim.Nasby@BlueTreble.com> writes:
>>> On 3/2/16 4:21 PM, Peter Geoghegan wrote:
>>>> I think you should commit this. The chances of anyone other than you
>>>> and Masahiko recalling that you developed this tool in 3 years is
>>>> essentially nil. I think that the cost of committing a developer-level
>>>> debugging tool like this is very low. Modules like pg_freespacemap
>>>> currently already have no chance of being of use to ordinary users.
>>>> All you need to do is restrict the functions to throw an error when
>>>> called by non-superusers, out of caution.
>>>>
>>>> It's a problem that modules like pg_stat_statements and
>>>> pg_freespacemap are currently lumped together in the documentation,
>>>> but we all know that.
>>
>>> +1.
>>
>> Would it make any sense to stick it under src/test/modules/ instead of
>> contrib/ ?  That would help make it clear that it's a debugging tool
>> and not something we expect end users to use.
>
> I actually think end-users might well want to use it.  Also, I created
> it by hacking up pg_freespacemap, so it may make sense to have it in
> the same place.
> I would also be tempted to add an additional C functions that scan the
> entire visibility map and return counts of the total number of bits of
> each type that are set, and similarly for the page level bits.
> Presumably that would be much faster than

+1.

>
> I am also tempted to change the API to be a bit more friendly,
> although I am not sure exactly how.  This was a quick and dirty hack
> so that I could test, but the hardest thing about making it not a
> quick and dirty hack is probably deciding on a good UI.
>

Does it mean visibility map API in visibilitymap.c?

Regards,

--
Masahiko Sawada



Re: Freeze avoidance of very large table.

From
Masahiko Sawada
Date:
On Sat, Mar 5, 2016 at 11:25 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> On Sat, Mar 5, 2016 at 1:25 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Wed, Mar 2, 2016 at 6:41 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> Jim Nasby <Jim.Nasby@BlueTreble.com> writes:
>>>> On 3/2/16 4:21 PM, Peter Geoghegan wrote:
>>>>> I think you should commit this. The chances of anyone other than you
>>>>> and Masahiko recalling that you developed this tool in 3 years is
>>>>> essentially nil. I think that the cost of committing a developer-level
>>>>> debugging tool like this is very low. Modules like pg_freespacemap
>>>>> currently already have no chance of being of use to ordinary users.
>>>>> All you need to do is restrict the functions to throw an error when
>>>>> called by non-superusers, out of caution.
>>>>>
>>>>> It's a problem that modules like pg_stat_statements and
>>>>> pg_freespacemap are currently lumped together in the documentation,
>>>>> but we all know that.
>>>
>>>> +1.
>>>
>>> Would it make any sense to stick it under src/test/modules/ instead of
>>> contrib/ ?  That would help make it clear that it's a debugging tool
>>> and not something we expect end users to use.
>>
>> I actually think end-users might well want to use it.  Also, I created
>> it by hacking up pg_freespacemap, so it may make sense to have it in
>> the same place.
>> I would also be tempted to add an additional C functions that scan the
>> entire visibility map and return counts of the total number of bits of
>> each type that are set, and similarly for the page level bits.
>> Presumably that would be much faster than
>
> +1.
>
>>
>> I am also tempted to change the API to be a bit more friendly,
>> although I am not sure exactly how.  This was a quick and dirty hack
>> so that I could test, but the hardest thing about making it not a
>> quick and dirty hack is probably deciding on a good UI.
>>
>
> Does it mean visibility map API in visibilitymap.c?
>

Attached latest version optimisation patch.
I'm still consider regarding pg_upgrade regression test code, so I
will submit that patch later.

Regards,

--
Masahiko Sawada

Attachment

Re: Freeze avoidance of very large table.

From
Robert Haas
Date:
On Mon, Mar 7, 2016 at 12:41 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> Attached latest version optimisation patch.
> I'm still consider regarding pg_upgrade regression test code, so I
> will submit that patch later.

I was thinking more about this today and I think that we don't
actually need the PD_ALL_FROZEN page-level bit for anything.  It's
enough that the bit is present in the visibility map.  The only point
of PD_ALL_VISIBLE is that it tells us that we need to clear the
visibility map bit, but that bit is enough to tell us to clear both
visibility map bits.  So I propose the attached cleanup patch.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: Freeze avoidance of very large table.

From
Robert Haas
Date:
On Sat, Mar 5, 2016 at 9:25 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>> I actually think end-users might well want to use it.  Also, I created
>> it by hacking up pg_freespacemap, so it may make sense to have it in
>> the same place.
>> I would also be tempted to add an additional C functions that scan the
>> entire visibility map and return counts of the total number of bits of
>> each type that are set, and similarly for the page level bits.
>> Presumably that would be much faster than
>
> +1.
>
>> I am also tempted to change the API to be a bit more friendly,
>> although I am not sure exactly how.  This was a quick and dirty hack
>> so that I could test, but the hardest thing about making it not a
>> quick and dirty hack is probably deciding on a good UI.
>
> Does it mean visibility map API in visibilitymap.c?

Here's an updated patch with an API that I think is much more
reasonable to expose to users, and documentation!  It assumes that the
patch I posted a few hours ago to remove PD_ALL_FROZEN will be
accepted; if that falls apart for some reason, I'll update this.  I
plan to push this RSN if nobody objects.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: Freeze avoidance of very large table.

From
Peter Geoghegan
Date:
On Mon, Mar 7, 2016 at 4:50 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> Here's an updated patch with an API that I think is much more
> reasonable to expose to users, and documentation!  It assumes that the
> patch I posted a few hours ago to remove PD_ALL_FROZEN will be
> accepted; if that falls apart for some reason, I'll update this.  I
> plan to push this RSN if nobody objects.

Thanks for making the effort to make the tool generally available.

-- 
Peter Geoghegan



Re: Freeze avoidance of very large table.

From
Kyotaro HORIGUCHI
Date:
Hello, thank you for updating this tool.

At Mon, 7 Mar 2016 14:03:08 -0500, Robert Haas <robertmhaas@gmail.com> wrote in
<CA+Tgmob+NjfYE3b3BHBmAC=3tvTbqsZgZ1RoJ63yRAmRgrQOcA@mail.gmail.com>
> On Mon, Mar 7, 2016 at 12:41 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > Attached latest version optimisation patch.
> > I'm still consider regarding pg_upgrade regression test code, so I
> > will submit that patch later.
> 
> I was thinking more about this today and I think that we don't
> actually need the PD_ALL_FROZEN page-level bit for anything.  It's
> enough that the bit is present in the visibility map.  The only point
> of PD_ALL_VISIBLE is that it tells us that we need to clear the
> visibility map bit, but that bit is enough to tell us to clear both
> visibility map bits.  So I propose the attached cleanup patch.

It seems reasonable to me.  Although I haven't played it (or even
it didn't apply for me for now), but at a glance,
PD_VALID_FLAG_BITS seems should be changed to 0x0007 since
PD_ALL_FROZEN has been removed.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center





Re: Freeze avoidance of very large table.

From
Masahiko Sawada
Date:
On Tue, Mar 8, 2016 at 1:20 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> Hello, thank you for updating this tool.
>
> At Mon, 7 Mar 2016 14:03:08 -0500, Robert Haas <robertmhaas@gmail.com> wrote in
<CA+Tgmob+NjfYE3b3BHBmAC=3tvTbqsZgZ1RoJ63yRAmRgrQOcA@mail.gmail.com>
>> On Mon, Mar 7, 2016 at 12:41 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>> > Attached latest version optimisation patch.
>> > I'm still consider regarding pg_upgrade regression test code, so I
>> > will submit that patch later.
>>
> I was thinking more about this today and I think that we don't
> actually need the PD_ALL_FROZEN page-level bit for anything.  It's
> enough that the bit is present in the visibility map.  The only point
> of PD_ALL_VISIBLE is that it tells us that we need to clear the
> visibility map bit, but that bit is enough to tell us to clear both
> visibility map bits.  So I propose the attached cleanup patch.

Thank you for updating tool and proposing it.
I agree with you, and the patch you attached looks good to me except
for Horiguchi-san's comment.

Regarding pg_visibility module, I'd like to share some bugs and
propose to add a relation type condition to each functions.
Including it, I've attached remaining  2 patches; one is removing page
conversion code from pg_upgarde, and another is supporting pg_upgrade
for frozen bit.

Please have a look at them.

Regards,

--
Masahiko Sawada

Attachment

Re: Freeze avoidance of very large table.

From
Robert Haas
Date:
On Tue, Mar 8, 2016 at 7:26 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> Regarding pg_visibility module, I'd like to share some bugs and
> propose to add a relation type condition to each functions.

OK, thanks.

> Including it, I've attached remaining  2 patches; one is removing page
> conversion code from pg_upgarde, and another is supporting pg_upgrade
> for frozen bit.

Committed 001 with minor tweaks.

I find rewrite_vm_table to be pretty opaque.  There's not even a
comment explaining what it is supposed to do.  And I wonder why we
really need to be this efficient about it anyway.  Like, would it be
too expensive to just do this:

for (i = 0; i < BITS_PER_BYTE; ++i)   if ((old & (1 << i)) != 0)       new |= 1 << (2 * i);

And how about adding some more comments explaining why we are doing
this rewriting, like this:

In versions of PostgreSQL prior to catversion 201602181, PostgreSQL's
visibility map included one bit per heap page; it now includes two.
When upgrading a cluster from before that time to a current PostgreSQL
version, we could refuse to copy visibility maps from the old cluster
to the new cluster; the next VACUUM would recreate them, but at the
price of scanning the entire table.  So, instead, we rewrite the old
visibility maps in the new format.  That way, the all-visible bit
remains set for the pages for which it was set previously.  The
all-frozen bit is never set by this conversion; we leave that to
VACUUM.

Also, I'm slightly perplexed by the fact that I can't see how this
code succeeds in turning each page into two pages, which is something
that it seems like it would need to do.  Wouldn't we need to write out
the old page header twice, one for the first of the two new pages and
again for the second?  I probably need more caffeine here, so please
tell me what I'm missing.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Freeze avoidance of very large table.

From
Robert Haas
Date:
On Tue, Mar 8, 2016 at 8:30 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Mar 8, 2016 at 7:26 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>> Regarding pg_visibility module, I'd like to share some bugs and
>> propose to add a relation type condition to each functions.
>
> OK, thanks.

I left out the relkind check from the final commit because, for one
thing, the check you added isn't actually right: toast relations can
also have a visibility map.  And also, I'm sort of wondering what the
point of that check is.  What does it protect us from?  It doesn't
seem very future-proof ... what if we add a new relkind in the future?Do we really want to have to update this?

How about instead changing things so that we specifically reject
indexes?  And maybe some kind of a check that will reject anything
that lacks a relfilnode?  That seems like it would be more on point.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Freeze avoidance of very large table.

From
Jeff Janes
Date:
On Tue, Mar 8, 2016 at 5:30 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Mar 8, 2016 at 7:26 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>> Regarding pg_visibility module, I'd like to share some bugs and
>> propose to add a relation type condition to each functions.
>
> OK, thanks.
>
>> Including it, I've attached remaining  2 patches; one is removing page
>> conversion code from pg_upgarde, and another is supporting pg_upgrade
>> for frozen bit.
>
> Committed 001 with minor tweaks.
>
> I find rewrite_vm_table to be pretty opaque.  There's not even a
> comment explaining what it is supposed to do.  And I wonder why we
> really need to be this efficient about it anyway.  Like, would it be
> too expensive to just do this:
>
> for (i = 0; i < BITS_PER_BYTE; ++i)
>     if ((old & (1 << i)) != 0)
>         new |= 1 << (2 * i);
>
> And how about adding some more comments explaining why we are doing
> this rewriting, like this:
>
> In versions of PostgreSQL prior to catversion 201602181, PostgreSQL's
> visibility map included one bit per heap page; it now includes two.
> When upgrading a cluster from before that time to a current PostgreSQL
> version, we could refuse to copy visibility maps from the old cluster
> to the new cluster; the next VACUUM would recreate them, but at the
> price of scanning the entire table.  So, instead, we rewrite the old
> visibility maps in the new format.  That way, the all-visible bit
> remains set for the pages for which it was set previously.  The
> all-frozen bit is never set by this conversion; we leave that to
> VACUUM.
>
> Also, I'm slightly perplexed by the fact that I can't see how this
> code succeeds in turning each page into two pages, which is something
> that it seems like it would need to do.  Wouldn't we need to write out
> the old page header twice, one for the first of the two new pages and
> again for the second?  I probably need more caffeine here, so please
> tell me what I'm missing.

I think that this loop:
  while (blkend >= end)

Executes exactly twice for each iteration of the outer loop.  I'd
rather see it written as a loop which explicitly executes twice,
rather looking like it might execute a dynamic number of times.  I
can't imagine that this code needs to be future-proof.  If we change
the format again in the future, surely we can't just change this code,
we would have to write new code for the new format.

Cheers,

Jeff



Re: Freeze avoidance of very large table.

From
Robert Haas
Date:
On Mon, Mar 7, 2016 at 12:41 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> Attached latest version optimisation patch.
> I'm still consider regarding pg_upgrade regression test code, so I
> will submit that patch later.

I just spent some time looking at this and I'm a bit worried about the
following (existing) comment in vacuumlazy.c:

     * Note: The value returned by visibilitymap_get_status could be slightly
     * out-of-date, since we make this test before reading the corresponding
     * heap page or locking the buffer.  This is OK.  If we mistakenly think
     * that the page is all-visible when in fact the flag's just been cleared,
     * we might fail to vacuum the page.  But it's OK to skip pages when
     * scan_all is not set, so no great harm done; the next vacuum will find
     * them.  If we make the reverse mistake and vacuum a page unnecessarily,
     * it'll just be a no-op.

The patch makes some attempt to update the comment mechanically, but
that's not nearly enough.  That comment is explaining that you *can't*
rely on the visibility map to tell you *for sure* that a page does not
require vacuuming.  For current uses, that's OK, because if we miss a
page we'll pick it up later.  But now we want to skip vacuuming pages
for relfrozenxid/relminmxid advancement, that rationale doesn't apply.
Missing pages that need to be frozen and advancing relfrozenxid anyway
would be _bad_.

However, after some further thought, I think we might actually be OK.
If a page goes from all-frozen to not-all-frozen while VACUUM is
running, any new XID added to the page must be newer than the
oldestXmin value computed by vacuum_set_xid_limits(), so it won't
affect the value to which we can safely set relfrozenxid.  Similarly,
any MXID added to the page will be newer than GetOldestMultiXactId(),
so setting relminmxid is still safe for similar reasons.

I'd appreciate it if any other senior hackers could review that chain
of reasoning.  It would be really bad to get this wrong.

On another note, I didn't really like the way you updated the
documentation.  "eager freezing" doesn't seem like a great term to me,
and I think your changes were a little too localized.  Here's a draft
alternative where I used the term "aggressive vacuum" to describe
freezing all of the pages except for those already known to be
all-frozen.  Thoughts?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: Freeze avoidance of very large table.

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> The patch makes some attempt to update the comment mechanically, but
> that's not nearly enough.  That comment is explaining that you *can't*
> rely on the visibility map to tell you *for sure* that a page does not
> require vacuuming.  For current uses, that's OK, because if we miss a
> page we'll pick it up later.  But now we want to skip vacuuming pages
> for relfrozenxid/relminmxid advancement, that rationale doesn't apply.
> Missing pages that need to be frozen and advancing relfrozenxid anyway
> would be _bad_.

Check.

> However, after some further thought, I think we might actually be OK.
> If a page goes from all-frozen to not-all-frozen while VACUUM is
> running, any new XID added to the page must be newer than the
> oldestXmin value computed by vacuum_set_xid_limits(), so it won't
> affect the value to which we can safely set relfrozenxid.  Similarly,
> any MXID added to the page will be newer than GetOldestMultiXactId(),
> so setting relminmxid is still safe for similar reasons.

Yeah, I agree with this, as long as the issue is only that the visibility
map result is slightly stale and not that it's, say, not crash-safe.
We can reasonably assume that any newly-added XID must be one that was
in progress while VACUUM was running, and hence will be after the xmin
horizon we computed earlier.  This requires the existence of a read
barrier somewhere between computing xmin horizon and inspecting the
visibility map, but I find it hard to believe there aren't plenty.
        regards, tom lane



Re: Freeze avoidance of very large table.

From
Masahiko Sawada
Date:
On Wed, Mar 9, 2016 at 1:23 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
> On Tue, Mar 8, 2016 at 5:30 AM, Robert Haas <robertmhaas@gmail.com> wrote:

> I left out the relkind check from the final commit because, for one
> thing, the check you added isn't actually right: toast relations can
> also have a visibility map.  And also, I'm sort of wondering what the
> point of that check is.  What does it protect us from?  It doesn't
> seem very future-proof ... what if we add a new relkind in the future?
>  Do we really want to have to update this?
>
> How about instead changing things so that we specifically reject
> indexes?  And maybe some kind of a check that will reject anything
> that lacks a relfilnode?  That seems like it would be more on point.
>

I agree, I don't have strong opinion about this.
It would be good to add condition for rejecting only indexes.
Attached patches are,
 - Change heap2 rmgr description
 - Add condition to pg_visibility
 - Fix typo in pgvisibility.sgml
(Sorry for the late notice..)

Regards,

--
Masahiko Sawada

Attachment

Re: Freeze avoidance of very large table.

From
Robert Haas
Date:
On Tue, Mar 8, 2016 at 12:49 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> However, after some further thought, I think we might actually be OK.
>> If a page goes from all-frozen to not-all-frozen while VACUUM is
>> running, any new XID added to the page must be newer than the
>> oldestXmin value computed by vacuum_set_xid_limits(), so it won't
>> affect the value to which we can safely set relfrozenxid.  Similarly,
>> any MXID added to the page will be newer than GetOldestMultiXactId(),
>> so setting relminmxid is still safe for similar reasons.
>
> Yeah, I agree with this, as long as the issue is only that the visibility
> map result is slightly stale and not that it's, say, not crash-safe.

If the visibility map isn't crash safe, we've got big problems even
without this patch, but we dealt with that when index-only scans went
in.  Maybe this patch introduces more stringent requirements in this
area, but I can't think of any reason why that should be true.  If
anything occurs to you (or anyone else), it would be good to mention
that before I go further destroy the world.

> We can reasonably assume that any newly-added XID must be one that was
> in progress while VACUUM was running, and hence will be after the xmin
> horizon we computed earlier.  This requires the existence of a read
> barrier somewhere between computing xmin horizon and inspecting the
> visibility map, but I find it hard to believe there aren't plenty.

I'll check that, but I agree that it should be OK.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Freeze avoidance of very large table.

From
Robert Haas
Date:
On Tue, Mar 8, 2016 at 12:59 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>> How about instead changing things so that we specifically reject
>> indexes?  And maybe some kind of a check that will reject anything
>> that lacks a relfilnode?  That seems like it would be more on point.
>
> I agree, I don't have strong opinion about this.
> It would be good to add condition for rejecting only indexes.
> Attached patches are,
>  - Change heap2 rmgr description
>  - Add condition to pg_visibility
>  - Fix typo in pgvisibility.sgml
> (Sorry for the late notice..)

OK, committed the first and last of those.  I think the other one
needs some work yet; the error message doesn't seem like it is quite
our usual style, and if we're going to do something here we should
probably also insert a check to throw a better error when there is no
relfilenode.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Freeze avoidance of very large table.

From
Masahiko Sawada
Date:
On Wed, Mar 9, 2016 at 3:38 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Mar 8, 2016 at 12:59 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>>> How about instead changing things so that we specifically reject
>>> indexes?  And maybe some kind of a check that will reject anything
>>> that lacks a relfilnode?  That seems like it would be more on point.
>>
>> I agree, I don't have strong opinion about this.
>> It would be good to add condition for rejecting only indexes.
>> Attached patches are,
>>  - Change heap2 rmgr description
>>  - Add condition to pg_visibility
>>  - Fix typo in pgvisibility.sgml
>> (Sorry for the late notice..)
>
> OK, committed the first and last of those.  I think the other one
> needs some work yet; the error message doesn't seem like it is quite
> our usual style, and if we're going to do something here we should
> probably also insert a check to throw a better error when there is no
> relfilenode.
>

Thank you for your advising and suggestion!

Attached latest 2 patches.
* 000 patch : Incorporated the review comments and made rewriting
logic more clearly.
* 001 patch : Incorporated the documentation suggestions and updated
logic a little.

Please review them.

Regards,

--
Masahiko Sawada

Attachment

Re: Freeze avoidance of very large table.

From
Robert Haas
Date:
On Wed, Mar 9, 2016 at 9:09 AM, Masahiko Sawada
<sawada.mshk@gmail.com> wrote: Attached latest 2 patches.
> * 000 patch : Incorporated the review comments and made rewriting
> logic more clearly.

That's better, thanks.  But your comments don't survive pgindent.
After running pgindent, I get this:

+               /*
+                * These old_* variables point to old visibility map page.
+                *
+                * cur_old        : Points to current position on old
page. blkend_old :
+                * Points to end of old block. break_old  : Points to
old page break
+                * position for rewriting a new page. After wrote a
new page, old_end
+                * proceeds rewriteVmBytesPerPgae bytes.
+                */

You need to either surround this sort of thing with dashes to make
pgindent ignore it, or, probably better, rewrite it using complete
sentences that together form a paragraph.

+       Oid                     pg_database_oid;        /* OID of
pg_database relation */

Not used anywhere?

Instead of vm_need_rewrite, how about vm_must_add_frozenbit?

Can you explain the changes to test.sh?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Freeze avoidance of very large table.

From
Masahiko Sawada
Date:
Thank you for reviewing!
Attached updated patch.


On Thu, Mar 10, 2016 at 3:37 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Mar 9, 2016 at 9:09 AM, Masahiko Sawada
> <sawada.mshk@gmail.com> wrote: Attached latest 2 patches.
>> * 000 patch : Incorporated the review comments and made rewriting
>> logic more clearly.
>
> That's better, thanks.  But your comments don't survive pgindent.
> After running pgindent, I get this:
>
> +               /*
> +                * These old_* variables point to old visibility map page.
> +                *
> +                * cur_old        : Points to current position on old
> page. blkend_old :
> +                * Points to end of old block. break_old  : Points to
> old page break
> +                * position for rewriting a new page. After wrote a
> new page, old_end
> +                * proceeds rewriteVmBytesPerPgae bytes.
> +                */
>
> You need to either surround this sort of thing with dashes to make
> pgindent ignore it, or, probably better, rewrite it using complete
> sentences that together form a paragraph.

Fixed.

>
> +       Oid                     pg_database_oid;        /* OID of
> pg_database relation */
>
> Not used anywhere?

Fixed.

> Instead of vm_need_rewrite, how about vm_must_add_frozenbit?

Fixed.

> Can you explain the changes to test.sh?

Current regression test scenario is,
1. Do 'make check' on pre-upgrade cluster
2. Dump relallvisible values of all relation in pre-upgrade cluster to
vm_test1.txt
3. Do pg_upgrade
4. Do analyze (not vacuum), dump relallvisibile values of all relation
in post-upgrade cluster to vm_test2.txt
5. Compare between vm_test1.txt and vm_test2.txt

That is, regression test compares between relallvisible values in
pre-upgrade cluster and post-upgrade cluster.
But because test.sh always uses pre/post clusters with same catalog
version, I realized that we cannot ensure that visibility map
rewriting is processed successfully on test.sh framework.
Rewriting visibility map never be executed.
We might need to have another framework for rewriting visibility map page..

Regards,

--
Masahiko Sawada

Attachment

Re: Freeze avoidance of very large table.

From
Robert Haas
Date:
On Wed, Mar 9, 2016 at 9:09 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> * 001 patch : Incorporated the documentation suggestions and updated
> logic a little.

This 001 patch looks so little like what I was expecting that I
decided to start over from scratch.  The new version I wrote is
attached here.  I don't understand why your version tinkers with the
logic for setting the all-frozen bit; I thought that what I already
committed dealt with that already, and in any case, your version
doesn't even compile against latest sources.  Your version also leaves
the scan_all terminology intact even though it's not accurate any
more, and I am not very convinced that the updates to the
page-skipping logic are actually correct.  Please have a look over
this version and see what you think.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: Freeze avoidance of very large table.

From
Masahiko Sawada
Date:
On Thu, Mar 10, 2016 at 3:27 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> Thank you for reviewing!
> Attached updated patch.
>
>
> On Thu, Mar 10, 2016 at 3:37 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Wed, Mar 9, 2016 at 9:09 AM, Masahiko Sawada
>> <sawada.mshk@gmail.com> wrote: Attached latest 2 patches.
>>> * 000 patch : Incorporated the review comments and made rewriting
>>> logic more clearly.
>>
>> That's better, thanks.  But your comments don't survive pgindent.
>> After running pgindent, I get this:
>>
>> +               /*
>> +                * These old_* variables point to old visibility map page.
>> +                *
>> +                * cur_old        : Points to current position on old
>> page. blkend_old :
>> +                * Points to end of old block. break_old  : Points to
>> old page break
>> +                * position for rewriting a new page. After wrote a
>> new page, old_end
>> +                * proceeds rewriteVmBytesPerPgae bytes.
>> +                */
>>
>> You need to either surround this sort of thing with dashes to make
>> pgindent ignore it, or, probably better, rewrite it using complete
>> sentences that together form a paragraph.
>
> Fixed.
>
>>
>> +       Oid                     pg_database_oid;        /* OID of
>> pg_database relation */
>>
>> Not used anywhere?
>
> Fixed.
>
>> Instead of vm_need_rewrite, how about vm_must_add_frozenbit?
>
> Fixed.
>
>> Can you explain the changes to test.sh?
>
> Current regression test scenario is,
> 1. Do 'make check' on pre-upgrade cluster
> 2. Dump relallvisible values of all relation in pre-upgrade cluster to
> vm_test1.txt
> 3. Do pg_upgrade
> 4. Do analyze (not vacuum), dump relallvisibile values of all relation
> in post-upgrade cluster to vm_test2.txt
> 5. Compare between vm_test1.txt and vm_test2.txt
>
> That is, regression test compares between relallvisible values in
> pre-upgrade cluster and post-upgrade cluster.
> But because test.sh always uses pre/post clusters with same catalog
> version, I realized that we cannot ensure that visibility map
> rewriting is processed successfully on test.sh framework.
> Rewriting visibility map never be executed.
> We might need to have another framework for rewriting visibility map page..
>

After some further thought, I thought that it's better to add check
logic for result of rewriting visibility map to upgrading logic rather
than regression test in order to ensure that rewriting visibility map
has been successfully done.
As a draft, attached patch checks the result of rewriting visibility
map after rewrote for each relation as a routine of pg_upgrade.
The disadvantage point of this is that we need to scan each visibility
map page for 2 times.
But since visibility map size would not be so large, it would not bad.
Thoughts?

Regards,


--
Regards,

--
Masahiko Sawada

Attachment

Re: Freeze avoidance of very large table.

From
Robert Haas
Date:
On Thu, Mar 10, 2016 at 8:51 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> After some further thought, I thought that it's better to add check
> logic for result of rewriting visibility map to upgrading logic rather
> than regression test in order to ensure that rewriting visibility map
> has been successfully done.
> As a draft, attached patch checks the result of rewriting visibility
> map after rewrote for each relation as a routine of pg_upgrade.
> The disadvantage point of this is that we need to scan each visibility
> map page for 2 times.
> But since visibility map size would not be so large, it would not bad.
> Thoughts?

I think that's kind of pointless.  We need to test that this
conversion code works, but once it does, I don't think we should make
everybody pay the overhead of retesting that.  Anyway, the test code
could have bugs, too.

Here's an updated version of your patch with that code removed and
some cosmetic cleanups like fixing typos and stuff like that.  I think
this is mostly ready to commit, but I noticed one problem: your
conversion code always produces two output pages for each input page
even if one of them would be empty.  In particular, if you have a
large number of small relations and run pg_upgrade, all of their
visibility maps will go from 8kB to 16kB.  That isn't the end of the
world, maybe, but I think you should see if you can't fix it
somehow....

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: Freeze avoidance of very large table.

From
Masahiko Sawada
Date:
On Fri, Mar 11, 2016 at 1:03 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> This 001 patch looks so little like what I was expecting that I
> decided to start over from scratch.  The new version I wrote is
> attached here.  I don't understand why your version tinkers with the
> logic for setting the all-frozen bit; I thought that what I already
> committed dealt with that already, and in any case, your version
> doesn't even compile against latest sources.  Your version also leaves
> the scan_all terminology intact even though it's not accurate any
> more, and I am not very convinced that the updates to the
> page-skipping logic are actually correct.  Please have a look over
> this version and see what you think.

Thank you for your advise.
Sorry, optimising logic of previous patch was old by mistake.
Attached latest patch incorporated your suggestions with a little revising.

>
> I think that's kind of pointless.  We need to test that this
> conversion code works, but once it does, I don't think we should make
> everybody pay the overhead of retesting that.  Anyway, the test code
> could have bugs, too.
>
> Here's an updated version of your patch with that code removed and
> some cosmetic cleanups like fixing typos and stuff like that.  I think
> this is mostly ready to commit, but I noticed one problem: your
> conversion code always produces two output pages for each input page
> even if one of them would be empty.  In particular, if you have a
> large number of small relations and run pg_upgrade, all of their
> visibility maps will go from 8kB to 16kB.  That isn't the end of the
> world, maybe, but I think you should see if you can't fix it
> somehow....

Thank you for updating patch.
To deal with this problem, I've changed it so that pg_upgrade checks
file size before conversion.
And if fork file does not exist or size is 0 (empty), ignore.
Attached latest patch.

Regards,

--
Masahiko Sawada

Attachment

Re: Freeze avoidance of very large table.

From
Robert Haas
Date:
On Thu, Mar 10, 2016 at 1:41 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> On Fri, Mar 11, 2016 at 1:03 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> This 001 patch looks so little like what I was expecting that I
>> decided to start over from scratch.  The new version I wrote is
>> attached here.  I don't understand why your version tinkers with the
>> logic for setting the all-frozen bit; I thought that what I already
>> committed dealt with that already, and in any case, your version
>> doesn't even compile against latest sources.  Your version also leaves
>> the scan_all terminology intact even though it's not accurate any
>> more, and I am not very convinced that the updates to the
>> page-skipping logic are actually correct.  Please have a look over
>> this version and see what you think.
>
> Thank you for your advise.
> Sorry, optimising logic of previous patch was old by mistake.
> Attached latest patch incorporated your suggestions with a little revising.

OK, I'll have a look.  Thanks.

>> I think that's kind of pointless.  We need to test that this
>> conversion code works, but once it does, I don't think we should make
>> everybody pay the overhead of retesting that.  Anyway, the test code
>> could have bugs, too.
>>
>> Here's an updated version of your patch with that code removed and
>> some cosmetic cleanups like fixing typos and stuff like that.  I think
>> this is mostly ready to commit, but I noticed one problem: your
>> conversion code always produces two output pages for each input page
>> even if one of them would be empty.  In particular, if you have a
>> large number of small relations and run pg_upgrade, all of their
>> visibility maps will go from 8kB to 16kB.  That isn't the end of the
>> world, maybe, but I think you should see if you can't fix it
>> somehow....
>
> Thank you for updating patch.
> To deal with this problem, I've changed it so that pg_upgrade checks
> file size before conversion.
> And if fork file does not exist or size is 0 (empty), ignore.
> Attached latest patch.

I think what I really want is some logic so that if we have a 1-page
visibility map in the old cluster and the second half of that page is
all zeroes, we only create a 1-page visibility map in the new cluster
rather than a 2-page visibility map.

Or more generally, if the old VM is N pages, but the last half of the
last page is empty, then let the output VM be 2*N-1 pages instead of
2*N pages.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Freeze avoidance of very large table.

From
Robert Haas
Date:
On Thu, Mar 10, 2016 at 1:41 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> On Fri, Mar 11, 2016 at 1:03 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> This 001 patch looks so little like what I was expecting that I
>> decided to start over from scratch.  The new version I wrote is
>> attached here.  I don't understand why your version tinkers with the
>> logic for setting the all-frozen bit; I thought that what I already
>> committed dealt with that already, and in any case, your version
>> doesn't even compile against latest sources.  Your version also leaves
>> the scan_all terminology intact even though it's not accurate any
>> more, and I am not very convinced that the updates to the
>> page-skipping logic are actually correct.  Please have a look over
>> this version and see what you think.
>
> Thank you for your advise.
> Sorry, optimising logic of previous patch was old by mistake.
> Attached latest patch incorporated your suggestions with a little revising.

Thanks.  I adopted some of your suggested, rejected others, fixed a
few minor things that I missed previously, and committed this.  If you
think any of the changes that I rejected still have merit, please
resubmit those changes as separate patches.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Freeze avoidance of very large table.

From
Masahiko Sawada
Date:
On Fri, Mar 11, 2016 at 6:16 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Mar 10, 2016 at 1:41 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>> On Fri, Mar 11, 2016 at 1:03 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>>> This 001 patch looks so little like what I was expecting that I
>>> decided to start over from scratch.  The new version I wrote is
>>> attached here.  I don't understand why your version tinkers with the
>>> logic for setting the all-frozen bit; I thought that what I already
>>> committed dealt with that already, and in any case, your version
>>> doesn't even compile against latest sources.  Your version also leaves
>>> the scan_all terminology intact even though it's not accurate any
>>> more, and I am not very convinced that the updates to the
>>> page-skipping logic are actually correct.  Please have a look over
>>> this version and see what you think.
>>
>> Thank you for your advise.
>> Sorry, optimising logic of previous patch was old by mistake.
>> Attached latest patch incorporated your suggestions with a little revising.
>
> Thanks.  I adopted some of your suggested, rejected others, fixed a
> few minor things that I missed previously, and committed this.  If you
> think any of the changes that I rejected still have merit, please
> resubmit those changes as separate patches.
>

Thank you for your effort to this feature and committing it.
I guess that I couldn't do good work to this feature at final stage,
but I really appreciate all your advice and suggestion.

> I think what I really want is some logic so that if we have a 1-page
> visibility map in the old cluster and the second half of that page is
> all zeroes, we only create a 1-page visibility map in the new cluster
> rather than a 2-page visibility map.
>
> Or more generally, if the old VM is N pages, but the last half of the
> last page is empty, then let the output VM be 2*N-1 pages instead of
> 2*N pages.
>

I got your point.
Attached latest patch can skip to write the last part of last old page
if it's empty.
Please review it.

Regards,

--
Masahiko Sawada

Attachment

Re: Freeze avoidance of very large table.

From
Robert Haas
Date:
On Thu, Mar 10, 2016 at 10:47 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>> Thanks.  I adopted some of your suggested, rejected others, fixed a
>> few minor things that I missed previously, and committed this.  If you
>> think any of the changes that I rejected still have merit, please
>> resubmit those changes as separate patches.
>
> Thank you for your effort to this feature and committing it.
> I guess that I couldn't do good work to this feature at final stage,
> but I really appreciate all your advice and suggestion.

Don't feel bad, you put a lot of work on this, and if you were getting
a little tired towards the end, that's very understandable.  This
extremely important feature was largely driven by you, and that's a
big accomplishment.

> I got your point.
> Attached latest patch can skip to write the last part of last old page
> if it's empty.
> Please review it.

Committed.

Which I think just about brings us to the end of this epic journey,
except for any cleanup of what's already been committed that needs to
be done.  Thanks so much for your hard work!

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Freeze avoidance of very large table.

From
Masahiko Sawada
Date:
On Sat, Mar 12, 2016 at 2:37 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Mar 10, 2016 at 10:47 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>>> Thanks.  I adopted some of your suggested, rejected others, fixed a
>>> few minor things that I missed previously, and committed this.  If you
>>> think any of the changes that I rejected still have merit, please
>>> resubmit those changes as separate patches.
>>
>> Thank you for your effort to this feature and committing it.
>> I guess that I couldn't do good work to this feature at final stage,
>> but I really appreciate all your advice and suggestion.
>
> Don't feel bad, you put a lot of work on this, and if you were getting
> a little tired towards the end, that's very understandable.  This
> extremely important feature was largely driven by you, and that's a
> big accomplishment.
>
>> I got your point.
>> Attached latest patch can skip to write the last part of last old page
>> if it's empty.
>> Please review it.
>
> Committed.
>
> Which I think just about brings us to the end of this epic journey,
> except for any cleanup of what's already been committed that needs to
> be done.  Thanks so much for your hard work!
>

Thank you so much!
What I wanted deal with in thread is almost done. I'm going to more
test the feature for 9.6 releasing.

Regards,

--
Masahiko Sawada



Re: Freeze avoidance of very large table.

From
"Joshua D. Drake"
Date:
On 03/11/2016 09:48 AM, Masahiko Sawada wrote:

>
> Thank you so much!
> What I wanted deal with in thread is almost done. I'm going to more
> test the feature for 9.6 releasing.

Nicely done!

>
> Regards,
>
> --
> Masahiko Sawada
>
>


-- 
Command Prompt, Inc.                  http://the.postgres.company/                        +1-503-667-4564
PostgreSQL Centered full stack support, consulting and development.
Everyone appreciates your honesty, until you are honest with them.