Thread: BUG #8673: Could not open file "pg_multixact/members/xxxx" on slave during hot_standby

BUG #8673: Could not open file "pg_multixact/members/xxxx" on slave during hot_standby

From
petr@petrovich.kiev.ua
Date:
The following bug has been logged on the website:

Bug reference:      8673
Logged by:          Serge Negodyuck
Email address:      petr@petrovich.kiev.ua
PostgreSQL version: 9.3.2
Operating system:   Linux 3.2.0-57-generic amd64. Ubuntu 12.04.3
Description:

I'm using postgresql 9.3.0, with WAL streaming replication, i.e.:
postgresql.conf:
hot_standby = on


recovery.conf:
standby_mode          = 'on'
primary_conninfo      = 'host=master port=5432 user=postgres
password=password'


I've installed new slave database on 6th of December. Since there was no
packages on apt.postgresql.org with postgresql 9.3.0 I've set up postgresql
9.3.2


I see a lot of messages in slave server log like :


2013-12-09 10:10:24 EET 172.18.10.45 main ERROR:  could not access status of
transaction 24568845
2013-12-09 10:10:24 EET 172.18.10.45 main DETAIL:  Could not open file
"pg_multixact/members/CD8F": No such file or directory.
--
2013-12-09 10:11:59 EET 172.18.10.30 main ERROR:  could not access status of
transaction 31724554
2013-12-09 10:11:59 EET 172.18.10.30 main DETAIL:  Could not open file
"pg_multixact/members/F615": No such file or directory.
--
2013-12-09 10:12:20 EET 172.18.10.7 main ERROR:  could not access status of
transaction 25496296
2013-12-09 10:12:20 EET 172.18.10.7 main DETAIL:  Could not open file
"pg_multixact/members/D31E": No such file or directory.


My next step was to upgrade to postgresql 9.3.2 on master and to do initial
sync from scratch.
It does not help. I still have the same error.


I think it may be tied to this commit:
http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=215ac4ad6589e0f6a31cc4cd867aedba3cd42924




Next question why binary packages for 9.3.0/9.3.1 where removed from
apt.postgresql.org so I cannot downgrade.
Hi,

On 2013-12-09 13:47:34 +0000, petr@petrovich.kiev.ua wrote:
> PostgreSQL version: 9.3.2

> I've installed new slave database on 6th of December. Since there was no
> packages on apt.postgresql.org with postgresql 9.3.0 I've set up postgresql
> 9.3.2

> 2013-12-09 10:10:24 EET 172.18.10.45 main ERROR:  could not access status of
> transaction 24568845
> 2013-12-09 10:10:24 EET 172.18.10.45 main DETAIL:  Could not open file
> "pg_multixact/members/CD8F": No such file or directory.

> My next step was to upgrade to postgresql 9.3.2 on master and to do initial
> sync from scratch.
> It does not help. I still have the same error.

Could you post, as close as possible to the next occurance of that
error:
* pg_controldata output from the primary
* pg_controldata output from the standby
* SELECT datfrozenxid, datminmxid FROM pg_database;

Do you frequently VACUUM FREEZE on the primary? Have you modified any of
the vacuum_freeze_* parameters?

> I think it may be tied to this commit:
> http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=215ac4ad6589e0f6a31cc4cd867aedba3cd42924

Only incidentally I think - we didn't properly maintain it during
recovery before at all.

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
On 2013-12-09 16:55:21 +0200, Serge Negodyuck wrote:
> 2013/12/9 Andres Freund <andres@2ndquadrant.com>:
> > On 2013-12-09 13:47:34 +0000, petr@petrovich.kiev.ua wrote:
> >> PostgreSQL version: 9.3.2
> >
> >> I've installed new slave database on 6th of December. Since there was no
> >> packages on apt.postgresql.org with postgresql 9.3.0 I've set up postgresql
> >> 9.3.2
> >
> >> 2013-12-09 10:10:24 EET 172.18.10.45 main ERROR:  could not access status of
> >> transaction 24568845
> >> 2013-12-09 10:10:24 EET 172.18.10.45 main DETAIL:  Could not open file
> >> "pg_multixact/members/CD8F": No such file or directory.
> >
> >> My next step was to upgrade to postgresql 9.3.2 on master and to do initial
> >> sync from scratch.
> >> It does not help. I still have the same error.
> >
> > Could you post, as close as possible to the next occurance of that
> > error:
> > * pg_controldata output from the primary
> > * pg_controldata output from the standby
>
> Sorry, I've just  downgraded all the cluster to 9.3.0, and this error
> disappeared.
> I can provide output right now, if it make any sence.

Yes, that'd be helpful.

Could you also provide ls -l pg_multixact/ on both primary and standby?

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
2013/12/9 Andres Freund <andres@2ndquadrant.com>:
> Hi,
>
> On 2013-12-09 13:47:34 +0000, petr@petrovich.kiev.ua wrote:
>> PostgreSQL version: 9.3.2
>
>> I've installed new slave database on 6th of December. Since there was no
>> packages on apt.postgresql.org with postgresql 9.3.0 I've set up postgresql
>> 9.3.2
>
>> 2013-12-09 10:10:24 EET 172.18.10.45 main ERROR:  could not access status of
>> transaction 24568845
>> 2013-12-09 10:10:24 EET 172.18.10.45 main DETAIL:  Could not open file
>> "pg_multixact/members/CD8F": No such file or directory.
>
>> My next step was to upgrade to postgresql 9.3.2 on master and to do initial
>> sync from scratch.
>> It does not help. I still have the same error.
>
> Could you post, as close as possible to the next occurance of that
> error:
> * pg_controldata output from the primary
> * pg_controldata output from the standby

Sorry, I've just  downgraded all the cluster to 9.3.0, and this error
disappeared.
I can provide output right now, if it make any sence.

But the same query with the same id *always* gave the same error not
depending on replication state.
One more thing: I had a look on previous fay logs and find following:

2013-12-08 21:17:15 EET LOG: could not truncate directory
"pg_multixact/members": apparent wraparound
2013-12-08 21:22:20 EET LOG: could not truncate directory
"pg_multixact/members": apparent wraparound
2013-12-08 21:27:26 EET LOG: could not truncate directory
"pg_multixact/members": apparent wraparound


AT some point at time there was no "apparent wraparound" message
anymore, but just "Could not open file"


> * SELECT datfrozenxid, datminmxid FROM pg_database;

datfrozenxid | datminmxid
--------------+------------
710 | 1
710 | 1
710 | 1
710 | 1


>
> Do you frequently VACUUM FREEZE on the primary? Have you modified any of
> the vacuum_freeze_* parameters?
I never run  VACUUM FREEZE manually

The only changed vacuum_* parameter is
vacuum_cost_delay = 10

>
>> I think it may be tied to this commit:
>> http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=215ac4ad6589e0f6a31cc4cd867aedba3cd42924
>
> Only incidentally I think - we didn't properly maintain it during
> recovery before at all.
>
On 2013-12-09 17:49:34 +0200, Serge Negodyuck wrote:
> Latest checkpoint's NextXID: 0/90546484
> Latest checkpoint's NextOID: 6079185
> Latest checkpoint's NextMultiXactId: 42049949
> Latest checkpoint's NextMultiOffset: 55384024
> Latest checkpoint's oldestXID: 710
> Latest checkpoint's oldestXID's DB: 1
> Latest checkpoint's oldestActiveXID: 90546475
> Latest checkpoint's oldestMultiXid: 1
> Latest checkpoint's oldestMulti's DB: 1

So, the oldest multi is 1, thus the truncation code wouldn't have been
able to remove any. So I think this really is an independent problem
from the truncation patch. But a problem nonetheless.

> > Could you also provide ls -l pg_multixact/ on both primary and standby?
>
> Did you mean pg_multixact/members/ ?

From both members, and offset. I'd be great if you could attach a ls -lR
of pg_multixact/.

> I't not possible on slave right now. Since I had to re-sync these
> files from master. May be that was not a good idea but it helped.
>
> On master there are files from 0000 to 14078
>
> On slave there were absent files from A1xx to FFFF
> They were  the oldest ones. (October, November)

Hm. Were files from before A1xx, I am not sure how to parse your answer.

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
2013/12/9 Andres Freund <andres@2ndquadrant.com>:
> On 2013-12-09 16:55:21 +0200, Serge Negodyuck wrote:
>> 2013/12/9 Andres Freund <andres@2ndquadrant.com>:
>> > On 2013-12-09 13:47:34 +0000, petr@petrovich.kiev.ua wrote:
>> >> PostgreSQL version: 9.3.2
>> >
>> >> I've installed new slave database on 6th of December. Since there was no
>> >> packages on apt.postgresql.org with postgresql 9.3.0 I've set up postgresql
>> >> 9.3.2
>> >
>> >> 2013-12-09 10:10:24 EET 172.18.10.45 main ERROR:  could not access status of
>> >> transaction 24568845
>> >> 2013-12-09 10:10:24 EET 172.18.10.45 main DETAIL:  Could not open file
>> >> "pg_multixact/members/CD8F": No such file or directory.
>> >
>> >> My next step was to upgrade to postgresql 9.3.2 on master and to do initial
>> >> sync from scratch.
>> >> It does not help. I still have the same error.
>> >
>> > Could you post, as close as possible to the next occurance of that
>> > error:
>> > * pg_controldata output from the primary
>> > * pg_controldata output from the standby
>>
>> Sorry, I've just  downgraded all the cluster to 9.3.0, and this error
>> disappeared.
>> I can provide output right now, if it make any sence.
>
> Yes, that'd be helpful.


master:
pg_control version number: 937
Catalog version number: 201306121
Database system identifier: 5928279687159054327
Database cluster state: in production
pg_control last modified: Mon 09 Dec 2013 05:29:53 PM EET
Latest checkpoint location: 3D4/76E97DA0
Prior checkpoint location: 3D4/6E768638
Latest checkpoint's REDO location: 3D4/76925C18
Latest checkpoint's REDO WAL file: 00000001000003D400000076
Latest checkpoint's TimeLineID: 1
Latest checkpoint's PrevTimeLineID: 1
Latest checkpoint's full_page_writes: on
Latest checkpoint's NextXID: 0/90546484
Latest checkpoint's NextOID: 6079185
Latest checkpoint's NextMultiXactId: 42049949
Latest checkpoint's NextMultiOffset: 55384024
Latest checkpoint's oldestXID: 710
Latest checkpoint's oldestXID's DB: 1
Latest checkpoint's oldestActiveXID: 90546475
Latest checkpoint's oldestMultiXid: 1
Latest checkpoint's oldestMulti's DB: 1
Time of latest checkpoint: Mon 09 Dec 2013 05:29:44 PM EET
Fake LSN counter for unlogged rels: 0/1
Minimum recovery ending location: 0/0
Min recovery ending loc's timeline: 0
Backup start location: 0/0
Backup end location: 0/0
End-of-backup record required: no
Current wal_level setting: hot_standby
Current max_connections setting: 1000
Current max_prepared_xacts setting: 0
Current max_locks_per_xact setting: 64
Maximum data alignment: 8
Database block size: 8192
Blocks per segment of large relation: 131072
WAL block size: 8192
Bytes per WAL segment: 16777216
Maximum length of identifiers: 64
Maximum columns in an index: 32
Maximum size of a TOAST chunk: 1996
Date/time type storage: 64-bit integers
Float4 argument passing: by value
Float8 argument passing: by value
Data page checksum version: 0



slave:
pg_control version number: 937
Catalog version number: 201306121
Database system identifier: 5928279687159054327
Database cluster state: in archive recovery
pg_control last modified: Mon 09 Dec 2013 05:25:22 PM EET
Latest checkpoint location: 3D4/6E768638
Prior checkpoint location: 3D4/66F14C60
Latest checkpoint's REDO location: 3D4/6E39F9E8
Latest checkpoint's REDO WAL file: 00000001000003D40000006E
Latest checkpoint's TimeLineID: 1
Latest checkpoint's PrevTimeLineID: 1
Latest checkpoint's full_page_writes: on
Latest checkpoint's NextXID: 0/90546484
Latest checkpoint's NextOID: 6079185
Latest checkpoint's NextMultiXactId: 42046170
Latest checkpoint's NextMultiOffset: 55058098
Latest checkpoint's oldestXID: 710
Latest checkpoint's oldestXID's DB: 1
Latest checkpoint's oldestActiveXID: 90541410
Latest checkpoint's oldestMultiXid: 1
Latest checkpoint's oldestMulti's DB: 1
Time of latest checkpoint: Mon 09 Dec 2013 05:24:44 PM EET
Fake LSN counter for unlogged rels: 0/1
Minimum recovery ending location: 3D4/7884BB68
Min recovery ending loc's timeline: 1
Backup start location: 0/0
Backup end location: 0/0
End-of-backup record required: no
Current wal_level setting: hot_standby
Current max_connections setting: 1000
Current max_prepared_xacts setting: 0
Current max_locks_per_xact setting: 64
Maximum data alignment: 8
Database block size: 8192
Blocks per segment of large relation: 131072
WAL block size: 8192
Bytes per WAL segment: 16777216
Maximum length of identifiers: 64
Maximum columns in an index: 32
Maximum size of a TOAST chunk: 1996
Date/time type storage: 64-bit integers
Float4 argument passing: by value
Float8 argument passing: by value
Data page checksum version: 0




>
> Could you also provide ls -l pg_multixact/ on both primary and standby?

Did you mean pg_multixact/members/ ?
I't not possible on slave right now. Since I had to re-sync these
files from master. May be that was not a good idea but it helped.

On master there are files from 0000 to 14078

On slave there were absent files from A1xx to FFFF
They were  the oldest ones. (October, November)
Hi,

On 2013-12-09 17:49:34 +0200, Serge Negodyuck wrote:
> On master there are files from 0000 to 14078
>
> On slave there were absent files from A1xx to FFFF
> They were  the oldest ones. (October, November)

Some analysis later, I am pretty sure that the origin is a longstanding
problem and not connected to 9.3.[01] vs 9.3.2.

The above referenced 14078 file is exactly the last page before a
members wraparound:
(gdb) p/x (1L<<32)/(MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT)
$10 = 0x14078

So, what happened is that enough multixacts where created, that the
members slru wrapped around. It's not unreasonable for the members slru
to wrap around faster then the offsets one - after all we create at
least two entries into members for every offset entry. Also in 9.3+
there fit more xids on a offset than a members page.
When truncating, we first read the offset, to know where we currently
are in members, and then truncate both from their respective
point. Since we've wrapped around in members we very well might remove
content we actually need.

I've recently remarked that I find it dangerous that we only do
anti-wraparound stuff for pg_multixact/offsets, not for /members. So,
here we have the proof that that's bad.

This is an issue in <9.3 as well. It might, in some sense, even be worse
there, because we never vacuum old multis away. But on the other hand,
the growths of multis is slower there and we look into old multis less
frequently.

The only reason that you saw the issue on the standby first is that the
truncation code is called more frequently there. Afaics it will happen,
sometime in the future, on the master as well.

I think problems should be preventable if you issue a systemwide VACUUM
FREEZE, but please let others chime in before you execute it.

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
Andres Freund wrote:
> Hi,
>
> On 2013-12-09 17:49:34 +0200, Serge Negodyuck wrote:
> > On master there are files from 0000 to 14078
> >
> > On slave there were absent files from A1xx to FFFF
> > They were  the oldest ones. (October, November)
>
> Some analysis later, I am pretty sure that the origin is a longstanding
> problem and not connected to 9.3.[01] vs 9.3.2.
>
> The above referenced 14078 file is exactly the last page before a
> members wraparound:
> (gdb) p/x (1L<<32)/(MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT)
> $10 = 0x14078

As a note, the SlruScanDirectory code has a flaw because it only looks
at four-digit files; the reason only files up to 0xFFFF are missing and
not the following ones is because those got ignored.  This needs a fix
as well.

> So, what happened is that enough multixacts where created, that the
> members slru wrapped around. It's not unreasonable for the members slru
> to wrap around faster then the offsets one - after all we create at
> least two entries into members for every offset entry. Also in 9.3+
> there fit more xids on a offset than a members page.
> When truncating, we first read the offset, to know where we currently
> are in members, and then truncate both from their respective
> point. Since we've wrapped around in members we very well might remove
> content we actually need.

Yeah, on 9.3 each member Xid occupies five bytes in
pg_multixact/members, whereas each offset only occupies four bytes in
pg_multixact/offsets.  It's rare that a multixact only contains one
member; typically they will have at least two (so for each multixact we
would have 4 bytes in offsets and a minimum of 10 bytes in members).
So wrapping around is easy for members, even with the protections we
have in place for offsets.

> I've recently remarked that I find it dangerous that we only do
> anti-wraparound stuff for pg_multixact/offsets, not for /members. So,
> here we have the proof that that's bad.

It's hard to see how to add this post-facto, though.  I mean, I am
thinking we would need some additional pg_control info etc.  We'd better
figure out a way to add such controls without having to add that.

> This is an issue in <9.3 as well. It might, in some sense, even be worse
> there, because we never vacuum old multis away. But on the other hand,
> the growths of multis is slower there and we look into old multis less
> frequently.
>
> The only reason that you saw the issue on the standby first is that the
> truncation code is called more frequently there. Afaics it will happen,
> sometime in the future, on the master as well.
>
> I think problems should be preventable if you issue a systemwide VACUUM
> FREEZE, but please let others chime in before you execute it.

I wouldn't freeze anything just yet, at least until the patch to fix
multixact freezing is in.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On 2013-12-09 16:00:32 -0300, Alvaro Herrera wrote:
> As a note, the SlruScanDirectory code has a flaw because it only looks
> at four-digit files; the reason only files up to 0xFFFF are missing and
> not the following ones is because those got ignored.  This needs a fix
> as well.

While I agree it's a bug, I don't think it's relevant for the case at
hand. For offset's there's no following page (or exactly 1, not sure
about the math offhand), and we only use SlruScanDirectory() for
offsets not for members.

> > I've recently remarked that I find it dangerous that we only do
> > anti-wraparound stuff for pg_multixact/offsets, not for /members. So,
> > here we have the proof that that's bad.
>
> It's hard to see how to add this post-facto, though.  I mean, I am
> thinking we would need some additional pg_control info etc.  We'd better
> figure out a way to add such controls without having to add that.

Couldn't we just get the oldest multi, check where in offsets it points
to, and compare that with nextOffset? That should be doable without
additional data.

> > I think problems should be preventable if you issue a systemwide VACUUM
> > FREEZE, but please let others chime in before you execute it.
>
> I wouldn't freeze anything just yet, at least until the patch to fix
> multixact freezing is in.

Well, it seems better than getting errors because of multixact members
that are gone.
Maybe PGOPTIONS='-c vacuum_freez_table_age=0 -c vacuum_freeze_min_age=1000000 vacuumdb -a'
 - that ought not to cause problems with current data and should freeze
enough to get rid of problematic multis?

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
Andres Freund wrote:
> On 2013-12-09 16:00:32 -0300, Alvaro Herrera wrote:
> > As a note, the SlruScanDirectory code has a flaw because it only looks
> > at four-digit files; the reason only files up to 0xFFFF are missing and
> > not the following ones is because those got ignored.  This needs a fix
> > as well.
>
> While I agree it's a bug, I don't think it's relevant for the case at
> hand. For offset's there's no following page (or exactly 1, not sure
> about the math offhand), and we only use SlruScanDirectory() for
> offsets not for members.

Sure we do for members, through SimpleLruTruncate which calls
SlruScanDirectory underneath.

> > > I've recently remarked that I find it dangerous that we only do
> > > anti-wraparound stuff for pg_multixact/offsets, not for /members. So,
> > > here we have the proof that that's bad.
> >
> > It's hard to see how to add this post-facto, though.  I mean, I am
> > thinking we would need some additional pg_control info etc.  We'd better
> > figure out a way to add such controls without having to add that.
>
> Couldn't we just get the oldest multi, check where in offsets it points
> to, and compare that with nextOffset? That should be doable without
> additional data.

Hmm, that seems a sensible approach ...

> > > I think problems should be preventable if you issue a systemwide VACUUM
> > > FREEZE, but please let others chime in before you execute it.
> >
> > I wouldn't freeze anything just yet, at least until the patch to fix
> > multixact freezing is in.
>
> Well, it seems better than getting errors because of multixact members
> that are gone.
> Maybe PGOPTIONS='-c vacuum_freez_table_age=0 -c vacuum_freeze_min_age=1000000 vacuumdb -a'
>  - that ought not to cause problems with current data and should freeze
> enough to get rid of problematic multis?

TBH I don't feel comfortable with predicting what will it freeze with
the broken code.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
> From both members, and offset. I'd be great if you could attach a ls -lR
> of pg_multixact/.
>

I've set up another slave (in not production mode) to reproduce this
bug and provide more details if necessary.
There are no "apparent wraparound" message in logs yet. Probably
because I cannot sent the same SQL queries to non-production server.

Just to provide exact information: these are links to ls -lR
pg_multixact/ of both master and slave server.

https://drive.google.com/file/d/0B5sVzeuionyGVER6dnA3a0NkOU0/
https://drive.google.com/file/d/0B5sVzeuionyGUlFibC1jZEVvQzA/
2013/12/10 Alvaro Herrera <alvherre@2ndquadrant.com>:
> Andres Freund wrote:
>
>> > > I think problems should be preventable if you issue a systemwide VACUUM
>> > > FREEZE, but please let others chime in before you execute it.
>> >
>> > I wouldn't freeze anything just yet, at least until the patch to fix
>> > multixact freezing is in.
>>
>> Well, it seems better than getting errors because of multixact members
>> that are gone.
>> Maybe PGOPTIONS='-c vacuum_freez_table_age=0 -c vacuum_freeze_min_age=1000000 vacuumdb -a'
>>  - that ought not to cause problems with current data and should freeze
>> enough to get rid of problematic multis?
>
> TBH I don't feel comfortable with predicting what will it freeze with
> the broken code.
>

You guys were right. After a week this issue occured again on almost
all slave servers.

slave:
2013-12-17 14:21:20 MSK CONTEXT: xlog redo delete: index
1663/16516/5320124; iblk 8764, heap 1663/16516/18816;
2013-12-17 14:21:20 MSK LOG: file "pg_clog/0370" doesn't exist,
reading as zeroes
2013-12-17 14:21:20 MSK FATAL: MultiXactId 1819308905 has not been
created yet -- apparent wraparound
2013-12-17 14:21:20 MSK CONTEXT: xlog redo delete: index
1663/16516/5320124; iblk 8764, heap 1663/16516/18816;
2013-12-17 14:21:20 MSK LOG: startup process (PID 13622) exited with exit code 1

I had to do fix something o master since all slaves were affected. So
the only idea was do perform VACUUM FREEZE on master.

I believe that was not a good idea. I suppose "vacuum freeze" leaded
to following errors on master:
2013-12-17 13:15:34 EET 172.18.10.44 ruprom ERROR: could not access
status of transaction 8407326
2013-12-17 13:15:34 EET 172.18.10.44 ruprom DETAIL: Could not open
file "pg_multixact/members/A458": No such file or directory.

The only way out was to perform full backup/restore, which did not
succeed with teh same error (could not access status of transaction
xxxxxxx)
A very ugly hack was to copy pg_multixact/members/0000 ->
pg_multixact/members/[ABCDF]xxx, it helped to do full backup, but not
sure about consistency of data.


My question is are there any quick-and-dirty solution to disable
pg_multixact deletion? I understand it may lead to waste of space.
Hi,

On 2013-12-18 16:20:48 +0200, Serge Negodyuck wrote:
> You guys were right. After a week this issue occured again on almost
> all slave servers.

Not surprising.

> My question is are there any quick-and-dirty solution to disable
> pg_multixact deletion? I understand it may lead to waste of space.

Your problem isn't the deletion but that a) multixactid members wrapped
around, partially overwriting old data b) some multixacts contain too
old members (c.f. 9.3.2)

> The only way out was to perform full backup/restore, which did not
> succeed with teh same error (could not access status of transaction
> xxxxxxx)
> A very ugly hack was to copy pg_multixact/members/0000 ->
> pg_multixact/members/[ABCDF]xxx, it helped to do full backup, but not
> sure about consistency of data.

That will likely cause corruption.

I think your best bet is to hack GetMultiXactIdMembers to return -1 for
old multixacts, on the theory that those rows would have been removed by
VACUUM if they had a valid xmax.

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
Andres Freund wrote:

> On 2013-12-09 17:49:34 +0200, Serge Negodyuck wrote:
> > On master there are files from 0000 to 14078
> >
> > On slave there were absent files from A1xx to FFFF
> > They were  the oldest ones. (October, November)
>
> Some analysis later, I am pretty sure that the origin is a longstanding
> problem and not connected to 9.3.[01] vs 9.3.2.
>
> The above referenced 14078 file is exactly the last page before a
> members wraparound:
> (gdb) p/x (1L<<32)/(MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT)
> $10 = 0x14078
>
> So, what happened is that enough multixacts where created, that the
> members slru wrapped around. It's not unreasonable for the members slru
> to wrap around faster then the offsets one - after all we create at
> least two entries into members for every offset entry. Also in 9.3+
> there fit more xids on a offset than a members page.
> When truncating, we first read the offset, to know where we currently
> are in members, and then truncate both from their respective
> point. Since we've wrapped around in members we very well might remove
> content we actually need.
>
> I've recently remarked that I find it dangerous that we only do
> anti-wraparound stuff for pg_multixact/offsets, not for /members. So,
> here we have the proof that that's bad.

I have applied three patches to deal with some of the problems reported
here, and others discovered during the investigation of them.  One of
them was about failing to truncate files beyond FFFF.  That should work
fine now -- that is, you would lose more data.  Unless we consider a
second fix, which is that files corresponding to data still in use are
no longer truncated.

I had to include the third fix (to enable the system to wrap around
sanely from file 14078 to 0000) was necessary so I could reproduce the
issues.  In systems with assertions enabled, there is a crash at the
point of overflow.  I didn't try, but since your system appears to have
wrapped around I imagine it sort-of works in systems compiled without
assertions (which is the recommended setting for production settings.)

One thing not yet patched is overrun of members' SLRU: if you have
enough live multixacts with enough members, creating a new one might
overwrite the members area used by an older member.  Freezing multis
earlier would help with that.  With the default settings, where multis
are frozen when they are 50 million multis old and pages are 8kB long,
there is room for 85 members per multi on average without such
overrun[*].  I was able to observe this overrun by running Andres'
pg_multixact_burn with each multixact having 100 members.  I doubt it's
common to have that many members in each multixact on average, but it's
certainly a possibility.

[*] There are 82040 files, having 32 pages each; each page has room for
1636 members.  (82040 * 32 / 1636) / 50000000 =~ 85.

One complaint somebody might rightly have about this is the space
consumption by pg_multixact/ files.  Perhaps instead of using Xid's
freezing horizon verbatim, we should use ceil(min_freeze_age^0.8) or
something like that; so for the default 50000000 Xid freezing age, we
would freeze multis over 1.44 million multis old.  (We could determine
ratio of xid to multi usage so that they would both freeze when the same
time has lapsed, but this seems unnecessarily complicated.)

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Hello,

I've upgraded postgresql to version 9.3.4 and did fresh initdb and restored
database from sql backup.
According to 9.4.3 changelog issue with multixact wraparound was fixed.


After a month it seems I have almost the same issue.


ls -lat pg_multixact/members

-rw------- 1 postgres postgres 262144 Jun 2 15:37 000D
-rw------- 1 postgres postgres 262144 Jun 2 14:46 000C
-rw------- 1 postgres postgres 262144 Jun 2 14:46 000B
-rw------- 1 postgres postgres 262144 Jun 2 14:46 000A
-rw------- 1 postgres postgres 262144 Jun 2 14:46 0009
-rw------- 1 postgres postgres 262144 Jun 2 13:22 0008
-rw------- 1 postgres postgres 262144 Jun 2 13:22 0007
-rw------- 1 postgres postgres 262144 Jun 2 12:54 0006
-rw------- 1 postgres postgres 262144 Jun 2 12:49 0005
-rw------- 1 postgres postgres 262144 Jun 2 12:43 0004
-rw------- 1 postgres postgres 262144 Jun 2 12:33 0003
-rw------- 1 postgres postgres 262144 Jun 2 12:23 0002
-rw------- 1 postgres postgres 262144 Jun 2 12:10 0001
-rw------- 1 postgres postgres 262144 Jun 2 10:37 0000
-rw------- 1 postgres postgres 262144 Jun 2 08:44 14078
-rw------- 1 postgres postgres 262144 Jun 2 08:44 14077
-rw------- 1 postgres postgres 262144 Jun 2 08:20 14076


/var/log/postgresql/postgresql.log:

2014-06-02 08:20:55 EEST 172.18.10.4 db PANIC: could not access status of
transaction 2080547
2014-06-02 08:20:55 EEST 172.18.10.4 db DETAIL: Could not open file
"pg_multixact/members/14078": No such file or directory.
2014-06-02 08:20:55 EEST 172.18.10.4 db CONTEXT: SQL statement "  UPDATE
....."
2014-06-02 08:22:30 EEST LOG: server process (PID 51704) was terminated by
signal 6: Aborted
.......
2014-06-02 08:22:30 EEST LOG: all server processes terminated;
reinitializing
2014-06-02 08:22:30 EEST LOG: database system was interrupted; last known
up at 2014-06-02 08:22:16 EEST
2014-06-02 08:22:30 EEST LOG: database system was not properly shut down;
automatic recovery in progress
2014-06-02 08:22:30 EEST LOG: redo starts at 55/26D22820
2014-06-02 08:22:30 EEST 172.18.10.17 db FATAL: the database system is in
recovery mode
2014-06-02 08:22:30 EEST FATAL: could not access status of transaction
2080547
2014-06-02 08:22:30 EEST DETAIL: Could not read from file
"pg_multixact/members/14078" at offset 24576: Success.
2014-06-02 08:22:30 EEST CONTEXT: xlog redo create mxid 2080547 offset
4294961608 nmembers 8684: 6193231 (keysh) 6193233 (fornokeyupd) 6193234
(keysh) 6193235 (fornokeyupd) 6193236 (keysh) 6193237 (fornokeyupd) 6193238
(keysh) 6193239 (fornokeyupd) 6193240 (keysh) 6193241 (fornokeyupd) 6193242
(keysh) 6193243 (fornokeyupd) 6193244 (keysh) 6193245 (fornokeyupd) 6193246
(keysh) 6193247 (fornokeyupd) 6193248 (keysh) 6193249 (fornokeyupd) 6193250
(keysh) 6193251 (fornokeyupd) 6193252 (keysh) 6193253 (fornokeyupd) 6193254
(keysh) 6193255 (fornokeyupd) 6193256 (keysh) 6193257 .......
2014-06-02 08:27:40 EEST LOG: startup process (PID 55275) exited with exit
code 1
2014-06-02 08:27:40 EEST LOG: aborting startup due to startup process
failure
2014-06-02 08:32:46 EEST LOG: database system was interrupted while in
recovery at 2014-06-02 08:27:39 EEST
2014-06-02 08:32:46 EEST HINT: This probably means that some data is
corrupted and you will have to use the last backup for recovery.
2014-06-02 08:32:46 EEST LOG: database system was not properly shut down;
automatic recovery in progress
2014-06-02 08:32:46 EEST LOG: redo starts at 55/26D22820
2014-06-02 08:32:46 EEST FATAL: could not access status of transaction
2080547
2014-06-02 08:32:46 EEST DETAIL: Could not read from file
"pg_multixact/members/14078" at offset 24576: Success.
2014-06-02 08:32:46 EEST CONTEXT: xlog redo create mxid 2080547 offset
4294961608 nmembers 8684: 6193231 (keysh) ....

... And so on.

An ugly hack "cp pg_multixact/members/14077 pg_multixact/members/14078"
helped me to start master server in replica.


Then, did pg_basebackup to slave database. It does not help
2014-06-02 09:58:49 EEST 172.18.10.17 db2 DETAIL: Could not open file
"pg_multixact/members/1112D": No such file or directory.
2014-06-02 09:58:49 EEST 172.18.10.18 db2 DETAIL: Could not open file
"pg_multixact/members/11130": No such file or directory.
2014-06-02 09:58:51 EEST 172.18.10.34 db2 DETAIL: Could not open file
"pg_multixact/members/11145": No such file or directory.
2014-06-02 09:58:51 EEST 172.18.10.38 db2 DETAIL: Could not open file
"pg_multixact/members/13F76": No such file or directory

What additional information should I provide?
If I will increase autovacuum_multixact_freeze_max_age will it help? (Now I
have default value)
Serge Negodyuck wrote:
> Hello,
>
> I've upgraded postgresql to version 9.3.4 and did fresh initdb and restored
> database from sql backup.
> According to 9.4.3 changelog issue with multixact wraparound was fixed.

Ouch.  This is rather strange.  First I see the failing multixact has
8684 members, which is totally unusual.  My guess is that you have code
that creates lots of subtransactions, and perhaps does something to one
tuple in a different subtransaction; doing sometihng like that would be,
I think, the only way to get subxacts that large.  Does that sound
right?

> An ugly hack "cp pg_multixact/members/14077 pg_multixact/members/14078"
> helped me to start master server in replica.

This is the second weird thing.  How come you needed to create 14078
when you already had that file, according to the listing you pasted
above?  Or is the 14078 file present in the listing only because you
copied it?

> Then, did pg_basebackup to slave database. It does not help
> 2014-06-02 09:58:49 EEST 172.18.10.17 db2 DETAIL: Could not open file
> "pg_multixact/members/1112D": No such file or directory.
> 2014-06-02 09:58:49 EEST 172.18.10.18 db2 DETAIL: Could not open file
> "pg_multixact/members/11130": No such file or directory.
> 2014-06-02 09:58:51 EEST 172.18.10.34 db2 DETAIL: Could not open file
> "pg_multixact/members/11145": No such file or directory.
> 2014-06-02 09:58:51 EEST 172.18.10.38 db2 DETAIL: Could not open file
> "pg_multixact/members/13F76": No such file or directory

Are these the only files missing?  Are intermediate files there?

> What additional information should I provide?
> If I will increase autovacuum_multixact_freeze_max_age will it help? (Now I
> have default value)

No, you'd need to *decrease* the freezing age, so that things are
cleaned up sooner.  If you increase the freezing age, the problematic
multis would persist longer, causing more trouble.


I think I will have to reproduce your scenario to be able to understand
what is going on.  If you can describe it in detail, that would make it
easier for me.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
2014-06-02 17:10 GMT+03:00 Alvaro Herrera <alvherre@2ndquadrant.com>:

> Serge Negodyuck wrote:
> > Hello,
> >
> > I've upgraded postgresql to version 9.3.4 and did fresh initdb and
> restored
> > database from sql backup.
> > According to 9.4.3 changelog issue with multixact wraparound was fixed.
>
> Ouch.  This is rather strange.  First I see the failing multixact has
> 8684 members, which is totally unusual.  My guess is that you have code
> that creates lots of subtransactions, and perhaps does something to one
> tuple in a different subtransaction; doing sometihng like that would be,
> I think, the only way to get subxacts that large.  Does that sound
> right?
>
> It sounds like you are right. I've found a lot of inserts in logs. Each
insert cause trigger to be performed. This  trigger updates counter in
other table.
It is very possible this tirgger tries to update the same counter for
different inserts.



> > An ugly hack "cp pg_multixact/members/14077 pg_multixact/members/14078"
> > helped me to start master server in replica.
>
> This is the second weird thing.  How come you needed to create 14078
> when you already had that file, according to the listing you pasted
> above?  Or is the 14078 file present in the listing only because you
> copied it?
>
14078 file present in the listing only because I  copied it.
It was the only corrupt file on master server.



>
> > Then, did pg_basebackup to slave database. It does not help
> > 2014-06-02 09:58:49 EEST 172.18.10.17 db2 DETAIL: Could not open file
> > "pg_multixact/members/1112D": No such file or directory.
> > 2014-06-02 09:58:49 EEST 172.18.10.18 db2 DETAIL: Could not open file
> > "pg_multixact/members/11130": No such file or directory.
> > 2014-06-02 09:58:51 EEST 172.18.10.34 db2 DETAIL: Could not open file
> > "pg_multixact/members/11145": No such file or directory.
> > 2014-06-02 09:58:51 EEST 172.18.10.38 db2 DETAIL: Could not open file
> > "pg_multixact/members/13F76": No such file or directory
>
> Are these the only files missing?  Are intermediate files there?
>


Only 0000 - 001E files were present on slave server.



>
> > What additional information should I provide?
> > If I will increase autovacuum_multixact_freeze_max_age will it help?
> (Now I
> > have default value)
>
> No, you'd need to *decrease* the freezing age, so that things are
> cleaned up sooner.  If you increase the freezing age, the problematic
> multis would persist longer, causing more trouble.
>
>
> I think I will have to reproduce your scenario to be able to understand
> what is going on.  If you can describe it in detail, that would make it
> easier for me.
>
Try this simplifiled schema:

CREATE TABLE company (
id integer NOT NULL,
num_products integer );

CREATE TABLE product (
id integer NOT NULL,
company_id integer );

ALTER TABLE ONLY company ADD CONSTRAINT company_pkey PRIMARY KEY (id);
ALTER TABLE ONLY product ADD CONSTRAINT product_pkey PRIMARY KEY (id);
ALTER TABLE ONLY product ADD CONSTRAINT product_company_id_fk FOREIGN KEY
(company_id) REFERENCES company(id) ON DELETE CASCADE;

CREATE FUNCTION update_num_products() RETURNS trigger
LANGUAGE plpythonu
AS $$

old, new, event, when, table_name = TD["old"], TD["new"], TD["event"],
TD["when"], TD["table_name"]

def update_company(company_id, delta):
plpy.execute("""
UPDATE company
SET num_products = coalesce(num_products, 0) + %(delta)s
WHERE id = %(company_id)s
""" % {'company_id': company_id, 'delta': delta})

if table_name.lower() == 'product':
if when == 'AFTER' and event == 'INSERT':
if new['status'] == 0 and new['visible_on_site'] is not False:
update_company(new['company_id'], 1)
return 'OK'
$$;

CREATE TRIGGER update_num_products AFTER INSERT OR UPDATE ON product FOR
EACH ROW EXECUTE PROCEDURE update_num_products();
INSERT into company VALUES (0,1);

Do a lot of inserts into product with the same product.company_id;
Serge Negodyuck wrote:
> 2014-06-02 17:10 GMT+03:00 Alvaro Herrera <alvherre@2ndquadrant.com>:
>
> > Serge Negodyuck wrote:
> > > Hello,
> > >
> > > I've upgraded postgresql to version 9.3.4 and did fresh initdb and
> > restored
> > > database from sql backup.
> > > According to 9.4.3 changelog issue with multixact wraparound was fixed.
> >
> > Ouch.  This is rather strange.  First I see the failing multixact has
> > 8684 members, which is totally unusual.  My guess is that you have code
> > that creates lots of subtransactions, and perhaps does something to one
> > tuple in a different subtransaction; doing sometihng like that would be,
> > I think, the only way to get subxacts that large.  Does that sound
> > right?
> >
> It sounds like you are right. I've found a lot of inserts in logs. Each
> insert cause trigger to be performed. This  trigger updates counter in
> other table.
> It is very possible this tirgger tries to update the same counter for
> different inserts.

I wasn't able to reproduce it that way, but I eventually figured out
that if I altered altered the plpython function to grab a FOR NO KEY
UPDATE lock first, insertion would grow the multixact beyond reasonable
limits; see the attachment.  If you then INSERT many tuples in "product"
in a single transaction, the resulting xmax is a Multixact that has as
many members as inserts there are, plus one.

(One variation that causes even more bizarre results is dispensing with
the plpy.subtransaction() in the function and instead setting a
savepoint before each insert.  In fact, given the multixact members
shown in your log snippet I think that's more similar to what you code
does.)

> > > Then, did pg_basebackup to slave database. It does not help
> > > 2014-06-02 09:58:49 EEST 172.18.10.17 db2 DETAIL: Could not open file
> > > "pg_multixact/members/1112D": No such file or directory.
> > > 2014-06-02 09:58:49 EEST 172.18.10.18 db2 DETAIL: Could not open file
> > > "pg_multixact/members/11130": No such file or directory.
> > > 2014-06-02 09:58:51 EEST 172.18.10.34 db2 DETAIL: Could not open file
> > > "pg_multixact/members/11145": No such file or directory.
> > > 2014-06-02 09:58:51 EEST 172.18.10.38 db2 DETAIL: Could not open file
> > > "pg_multixact/members/13F76": No such file or directory
> >
> > Are these the only files missing?  Are intermediate files there?
>
> Only 0000 - 001E files were present on slave server.

I don't understand how can files be missing in the replica.
pg_basebackup simply copies all files it can find in the master to the
replica, so if the 111xx files are present in the master they should
certainly be present in the replica as well.  I gave the pg_basebackup
code a look just to be sure there are no 4-char pattern matching or
something like that, and it doesn't look like it attempts to do that at
all.  I also asked Magnus just to be sure and he confirms this.

I'm playing a bit more with this test case, I'll let you know where it
leads.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachment
Alvaro Herrera wrote:

> I wasn't able to reproduce it that way, but I eventually figured out
> that if I altered altered the plpython function to grab a FOR NO KEY
> UPDATE lock first, insertion would grow the multixact beyond reasonable
> limits; see the attachment.  If you then INSERT many tuples in "product"
> in a single transaction, the resulting xmax is a Multixact that has as
> many members as inserts there are, plus one.

Bah.  I managed to submit a file with the lock_company() call commented
out in the plpython function.  That one doesn't do anything interesting;
you need to uncomment that line.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Serge Negodyuck wrote:

> 2014-06-02 08:20:55 EEST 172.18.10.4 db PANIC: could not access status of
> transaction 2080547
> 2014-06-02 08:20:55 EEST 172.18.10.4 db DETAIL: Could not open file
> "pg_multixact/members/14078": No such file or directory.
> 2014-06-02 08:20:55 EEST 172.18.10.4 db CONTEXT: SQL statement "  UPDATE
> ....."

So as it turns out, this was caused because the arithmetic to handle the
wraparound case neglected to handle multixacts with more members than
the number that fit in the last page(s) of the last segment, leading to
a number of pages in the 14078 segment (or whatever the last segment is
for a given BLCKSZ) to fail to be initialized.  This patch is a rework
of that arithmetic, although it seems little bit too obscure.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachment
Alvaro Herrera wrote:
> Serge Negodyuck wrote:
>
> > 2014-06-02 08:20:55 EEST 172.18.10.4 db PANIC: could not access status of
> > transaction 2080547
> > 2014-06-02 08:20:55 EEST 172.18.10.4 db DETAIL: Could not open file
> > "pg_multixact/members/14078": No such file or directory.
> > 2014-06-02 08:20:55 EEST 172.18.10.4 db CONTEXT: SQL statement "  UPDATE
> > ....."
>
> So as it turns out, this was caused because the arithmetic to handle the
> wraparound case neglected to handle multixacts with more members than
> the number that fit in the last page(s) of the last segment, leading to
> a number of pages in the 14078 segment (or whatever the last segment is
> for a given BLCKSZ) to fail to be initialized.  This patch is a rework
> of that arithmetic, although it seems little bit too obscure.

After some simplification I think it should be clearer.  Thanks Andres
for commenting offlist.

There is a different way to compute the "difference" proposed by Andres,
without using if/else; the idea is to cast the values to int64 and then
clamp.  It would be something like

    uint64    diff64;

    diff64 = Min((uint64) offset + nmembers,
            (uint64) offset + MULTIXACT_MEMBERS_PER_PAGE);
    difference = (uint32) Min(diff64, MaxMultiXactOffset);

(There are other ways to formulate this, of course, but this seems to me
to be the most readable one).  I am undecided between this and the one I
propose in the patch, so I've stuck with the patch.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachment
Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> After some simplification I think it should be clearer.  Thanks Andres
> for commenting offlist.

I find this not only unreadable but a bit scary:

!         if (offset + MAX_MEMBERS_IN_LAST_MEMBERS_PAGE < offset)

I'm worried whether some compilers might not decide that that's a
can't-happen situation, and optimize the test away entirely.  I think
Andres' proposal is probably safer (though it needs comments).

            regards, tom lane
On 2014-06-06 13:04:29 -0400, Tom Lane wrote:
> Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> > After some simplification I think it should be clearer.  Thanks Andres
> > for commenting offlist.
>
> I find this not only unreadable but a bit scary:
>
> !         if (offset + MAX_MEMBERS_IN_LAST_MEMBERS_PAGE < offset)
>
> I'm worried whether some compilers might not decide that that's a
> can't-happen situation, and optimize the test away entirely.

offset is a (via typedefs) uint32 and unsigned overflow is defined in
the C standard, so that shouldn't be a allowed optimization. Don't we
already have a fair amount of similar tests around? Around TransactionId
wraparound for exzmple:
    xidWrapLimit = oldest_datfrozenxid + (MaxTransactionId >> 1);
    if (xidWrapLimit < FirstNormalTransactionId)
        xidWrapLimit += FirstNormalTransactionId;
and similar things.

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
Tom Lane wrote:
> Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> > After some simplification I think it should be clearer.  Thanks Andres
> > for commenting offlist.
>
> I find this not only unreadable but a bit scary:
>
> !         if (offset + MAX_MEMBERS_IN_LAST_MEMBERS_PAGE < offset)
>
> I'm worried whether some compilers might not decide that that's a
> can't-happen situation, and optimize the test away entirely.  I think
> Andres' proposal is probably safer (though it needs comments).

Well, somebody would have to work it out fully and test it -- I can't
spend more time on this issue.  If somebody is interested in doing that,
I used this little program and make sure that it "zeroes" all pages, for
various block sizes, initial offsets and "step sizes" (#members in each
multi) -- I was able to reproduce Serge's issues with the original
coding, and then watch issues fly by as I tried several approaches until
I settled on the patch as submitted.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachment
Serge Negodyuck wrote:

Hi,

> /var/log/postgresql/postgresql.log:
>
> 2014-06-02 08:20:55 EEST 172.18.10.4 db PANIC: could not access status of
> transaction 2080547
> 2014-06-02 08:20:55 EEST 172.18.10.4 db DETAIL: Could not open file
> "pg_multixact/members/14078": No such file or directory.
> 2014-06-02 08:20:55 EEST 172.18.10.4 db CONTEXT: SQL statement "  UPDATE
> ....."

Pushed a fix for this.  Thanks for the report.

> 2014-06-02 08:22:30 EEST FATAL: could not access status of transaction
> 2080547
> 2014-06-02 08:22:30 EEST DETAIL: Could not read from file
> "pg_multixact/members/14078" at offset 24576: Success.
> 2014-06-02 08:22:30 EEST CONTEXT: xlog redo create mxid 2080547 offset
> 4294961608 nmembers 8684: 6193231 (keysh) 6193233 (fornokeyupd) 6193234
> (keysh) 6193235 (fornokeyupd) 6193236 (keysh) 6193237 (fornokeyupd) 6193238
> (keysh) 6193239 (fornokeyupd) 6193240 (keysh) 6193241 (fornokeyupd) 6193242
> (keysh) 6193243 (fornokeyupd) 6193244 (keysh) 6193245 (fornokeyupd) 6193246
> (keysh) 6193247 (fornokeyupd) 6193248 (keysh) 6193249 (fornokeyupd) 6193250
> (keysh) 6193251 (fornokeyupd) 6193252 (keysh) 6193253 (fornokeyupd) 6193254
> (keysh) 6193255 (fornokeyupd) 6193256 (keysh) 6193257 .......

I find this bit rather odd.  Normally the system shouldn't create
multixacts this large.  I think we might be missing a trick here somewhere.
I imagine inserting the last few items is slow, isn't it?

> An ugly hack "cp pg_multixact/members/14077 pg_multixact/members/14078"
> helped me to start master server in replica.
>
>
> Then, did pg_basebackup to slave database. It does not help
> 2014-06-02 09:58:49 EEST 172.18.10.17 db2 DETAIL: Could not open file
> "pg_multixact/members/1112D": No such file or directory.
> 2014-06-02 09:58:49 EEST 172.18.10.18 db2 DETAIL: Could not open file
> "pg_multixact/members/11130": No such file or directory.
> 2014-06-02 09:58:51 EEST 172.18.10.34 db2 DETAIL: Could not open file
> "pg_multixact/members/11145": No such file or directory.
> 2014-06-02 09:58:51 EEST 172.18.10.38 db2 DETAIL: Could not open file
> "pg_multixact/members/13F76": No such file or directory

This is strange also; if the files are present in master, how come they
weren't copied to the replica?  I think we need more info about this
problem.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Fri, Jun 6, 2014 at 9:47 AM, Alvaro Herrera <alvherre@2ndquadrant.com>
wrote:

> Alvaro Herrera wrote:
> > Serge Negodyuck wrote:
> >
> > > 2014-06-02 08:20:55 EEST 172.18.10.4 db PANIC: could not access status
> of
> > > transaction 2080547
> > > 2014-06-02 08:20:55 EEST 172.18.10.4 db DETAIL: Could not open file
> > > "pg_multixact/members/14078": No such file or directory.
> > > 2014-06-02 08:20:55 EEST 172.18.10.4 db CONTEXT: SQL statement "
>  UPDATE
> > > ....."
> >
> > So as it turns out, this was caused because the arithmetic to handle the
> > wraparound case neglected to handle multixacts with more members than
> > the number that fit in the last page(s) of the last segment, leading to
> > a number of pages in the 14078 segment (or whatever the last segment is
> > for a given BLCKSZ) to fail to be initialized.  This patch is a rework
> > of that arithmetic, although it seems little bit too obscure.
>
> After some simplification I think it should be clearer.  Thanks Andres
> for commenting offlist.
>
> There is a different way to compute the "difference" proposed by Andres,
> without using if/else; the idea is to cast the values to int64 and then
> clamp.  It would be something like
>
>         uint64  diff64;
>
>         diff64 = Min((uint64) offset + nmembers,
>                         (uint64) offset + MULTIXACT_MEMBERS_PER_PAGE);
>         difference = (uint32) Min(diff64, MaxMultiXactOffset);
>
> (There are other ways to formulate this, of course, but this seems to me
> to be the most readable one).  I am undecided between this and the one I
> propose in the patch, so I've stuck with the patch.
>

This patch seems to solve a problem I've also been having with non-existent
"pg_multixact/members/13D35" files in my testing of torn page write and fk
locks again recent 9.4.  However, sometimes I've found the problem even
before multi wrap around occured, and my reading of this thread suggests
that the current issue would not show up in that case, so I don't know how
much comfort to take in the knowledge that the problem seems to have gone
away--as they say, things that go away by themselves without explanation
also tend to come back by themselves.

Anyone wishing to investigate further can find the testing harness and the
tarball of a bad data directory here:

https://drive.google.com/folderview?id=0Bzqrh1SO9FcEb1FNcm52aEMtWTA&usp=sharing

Starting up the bad data directory should nominally succeed, but then the
first database wide vacuum should fail when it hits the bad tuple.

Recent changes to my testing harness were:

Making some of the transactions randomly rollback in the test harness, to
test transaction aborts not associated with server crashes.

Make the p_id in the child table get changed only half the time, not all
the time, so that a mixture of HOT updates and non-HOT updates get tested.

Add a patch to allow fast-forward of multixact and add a code to my harness
to trigger that patch.  It is possible that this patch itself is causing
the problem, but I suspect it is just accelerating the discovery of it.
 (By the way, I did this wrong for my original intent, as I had it create
quite large multixact, when I should have had it create more of them but
smaller, so wrap around would occur more often.  But I don't want to change
it now until the problem is resolved)

Add a delay.pl which delivers SIGSTOP to random postgres processes and then
wait a while to SIGCONT them, to try to uncover unlikely races.
 Surprisingly this does not slow things by very much.  I thought it would
be frequent that processes got interrupted while holding important locks,
but it actually seems to be rare.

Turn on archive_mode and set wal_level to logical.

The problem is that the last two of them (the delay.pl and the archive) I
can reverse and still see the problem, while the other 3 were extensively
tested elsewhere without seeing a problem, so I can't figure out what the
trigger is.

When the problem shows up it takes anywhere from 1 to 13 hours to do so.

Anyway, if no one has opinions to the contrary I think I will just assume
this is fixed now and move on to other tests.

Cheers,

Jeff
Jeff Janes wrote:

> This patch seems to solve a problem I've also been having with non-existent
> "pg_multixact/members/13D35" files in my testing of torn page write and fk
> locks again recent 9.4.  However, sometimes I've found the problem even
> before multi wrap around occured, and my reading of this thread suggests
> that the current issue would not show up in that case, so I don't know how
> much comfort to take in the knowledge that the problem seems to have gone
> away--as they say, things that go away by themselves without explanation
> also tend to come back by themselves.

I think this problem is rather related to the code touched by commit
722acf51a0d074d19782ad7e97ebe3fdb63fbb87, which is about removing
unneeded files in pg_multixact/members.  I checked your files, and it
seems that pg_multixact/members files that should still exist are gone;
for instance, pg_controldata says that the oldest multi is 2586314407,
but if I run
select * from pg_get_multixact_members('that value')
I get an error saying that pg_multixact/members/218B does not exist.

It seems that there are a lot of files that were improperly removed.
I expect that existing files should be sequential, but there are a lot
of holes in a listing of pg_multixact/members.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Serge Negodyuck wrote:
> 2014-06-09 22:49 GMT+03:00 Alvaro Herrera <alvherre@2ndquadrant.com>:

> > > An ugly hack "cp pg_multixact/members/14077 pg_multixact/members/14078"
> > > helped me to start master server in replica.
> > >
> > > Then, did pg_basebackup to slave database. It does not help
> > > 2014-06-02 09:58:49 EEST 172.18.10.17 db2 DETAIL: Could not open file
> > > "pg_multixact/members/1112D": No such file or directory.
> > > 2014-06-02 09:58:49 EEST 172.18.10.18 db2 DETAIL: Could not open file
> > > "pg_multixact/members/11130": No such file or directory.
> > > 2014-06-02 09:58:51 EEST 172.18.10.34 db2 DETAIL: Could not open file
> > > "pg_multixact/members/11145": No such file or directory.
> > > 2014-06-02 09:58:51 EEST 172.18.10.38 db2 DETAIL: Could not open file
> > > "pg_multixact/members/13F76": No such file or directory
> >
> > This is strange also; if the files are present in master, how come they
> > weren't copied to the replica?  I think we need more info about this
> > problem.
> I've thoroughly looked through the logs once again and I have not
> found anything interesting.
> I just know there were very few pg_multixact/members files starting
> from 0000. It was on both slave servers. So I've observed this issue
> two times.

But it was OK on master, right?  It had the full set of files and no
file got remove improperly?  (minus 14078, which wasn't created on time,
of course)

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
2014-06-09 22:49 GMT+03:00 Alvaro Herrera <alvherre@2ndquadrant.com>:
>
> Pushed a fix for this.  Thanks for the report.
Thank you!

>
> > 2014-06-02 08:22:30 EEST FATAL: could not access status of transaction
> > 2080547
> > 2014-06-02 08:22:30 EEST DETAIL: Could not read from file
> > "pg_multixact/members/14078" at offset 24576: Success.
> > 2014-06-02 08:22:30 EEST CONTEXT: xlog redo create mxid 2080547 offset
> > 4294961608 nmembers 8684: 6193231 (keysh) 6193233 (fornokeyupd) 6193234
> > (keysh) 6193235 (fornokeyupd) 6193236 (keysh) 6193237 (fornokeyupd) 6193238
> > (keysh) 6193239 (fornokeyupd) 6193240 (keysh) 6193241 (fornokeyupd) 6193242
> > (keysh) 6193243 (fornokeyupd) 6193244 (keysh) 6193245 (fornokeyupd) 6193246
> > (keysh) 6193247 (fornokeyupd) 6193248 (keysh) 6193249 (fornokeyupd) 6193250
> > (keysh) 6193251 (fornokeyupd) 6193252 (keysh) 6193253 (fornokeyupd) 6193254
> > (keysh) 6193255 (fornokeyupd) 6193256 (keysh) 6193257 .......
>
> I find this bit rather odd.  Normally the system shouldn't create
> multixacts this large.  I think we might be missing a trick here somewhere.
> I imagine inserting the last few items is slow, isn't it?

Yes, the duration of inserts have been growing up to 2.2 seconds before crash:
2014-06-02 08:20:11 EEST 172.18.10.4 db LOG: duration: 2213.361 ms
statement: INSERT INTO product (...) VALUES (...) RETURNING product.id

Normally inserts fit in to 100ms (log_min_duration_statement)
The same log "xlog redo create mxid 2080547...." was present on master
and both replica servers. Well, this sounds logical.


>
> > An ugly hack "cp pg_multixact/members/14077 pg_multixact/members/14078"
> > helped me to start master server in replica.
> >
> >
> > Then, did pg_basebackup to slave database. It does not help
> > 2014-06-02 09:58:49 EEST 172.18.10.17 db2 DETAIL: Could not open file
> > "pg_multixact/members/1112D": No such file or directory.
> > 2014-06-02 09:58:49 EEST 172.18.10.18 db2 DETAIL: Could not open file
> > "pg_multixact/members/11130": No such file or directory.
> > 2014-06-02 09:58:51 EEST 172.18.10.34 db2 DETAIL: Could not open file
> > "pg_multixact/members/11145": No such file or directory.
> > 2014-06-02 09:58:51 EEST 172.18.10.38 db2 DETAIL: Could not open file
> > "pg_multixact/members/13F76": No such file or directory
>
> This is strange also; if the files are present in master, how come they
> weren't copied to the replica?  I think we need more info about this
> problem.
I've thoroughly looked through the logs once again and I have not
found anything interesting.
I just know there were very few pg_multixact/members files starting
from 0000. It was on both slave servers. So I've observed this issue
two times.

To fix it I had to do pg_dumpall | pg_restore on master.
So, I'm sorry, I have no additional info about this problem.
2014-06-10 16:48 GMT+03:00 Alvaro Herrera <alvherre@2ndquadrant.com>:
> Serge Negodyuck wrote:
>> 2014-06-09 22:49 GMT+03:00 Alvaro Herrera <alvherre@2ndquadrant.com>:
>
>> > > An ugly hack "cp pg_multixact/members/14077 pg_multixact/members/14078"
>> > > helped me to start master server in replica.
>> > >
>> > > Then, did pg_basebackup to slave database. It does not help
>> > > 2014-06-02 09:58:49 EEST 172.18.10.17 db2 DETAIL: Could not open file
>> > > "pg_multixact/members/1112D": No such file or directory.
>> > > 2014-06-02 09:58:49 EEST 172.18.10.18 db2 DETAIL: Could not open file
>> > > "pg_multixact/members/11130": No such file or directory.
>> > > 2014-06-02 09:58:51 EEST 172.18.10.34 db2 DETAIL: Could not open file
>> > > "pg_multixact/members/11145": No such file or directory.
>> > > 2014-06-02 09:58:51 EEST 172.18.10.38 db2 DETAIL: Could not open file
>> > > "pg_multixact/members/13F76": No such file or directory
>> >
>> > This is strange also; if the files are present in master, how come they
>> > weren't copied to the replica?  I think we need more info about this
>> > problem.
>> I've thoroughly looked through the logs once again and I have not
>> found anything interesting.
>> I just know there were very few pg_multixact/members files starting
>> from 0000. It was on both slave servers. So I've observed this issue
>> two times.
>
> But it was OK on master, right?  It had the full set of files and no
> file got remove improperly?  (minus 14078, which wasn't created on time,
> of course)

Yes,all files were persent on master (except for corrupted file 14078,
which was copied  from the 14077, it was the fastest solution to start
master server)