Thread: apparent wraparound

apparent wraparound

From

Reece Hart

Date:

14 July 2006, 09:47:01

After a system crash, postgresql 8.1.4 restarted but reported that I
have an apparent wraparound:

2006-07-13 14:03:40 PDT [10092] LOG:  database system was interrupted at 2006-07-13 13:22:19 PDT
2006-07-13 14:03:40 PDT [10092] LOG:  checkpoint record is at 1DD/26283E18
2006-07-13 14:03:40 PDT [10092] LOG:  redo record is at 1DD/26283E18; undo record is at 0/0; shutdown FALSE
2006-07-13 14:03:40 PDT [10092] LOG:  next transaction ID: 169855318; next OID: 787933
2006-07-13 14:03:40 PDT [10092] LOG:  next MultiXactId: 5475264; next MultiXactOffset: 13765525
2006-07-13 14:03:40 PDT [10092] LOG:  database system was not properly shut down; automatic recovery in progress
2006-07-13 14:03:40 PDT [10092] LOG:  record with zero length at 1DD/26283E68
2006-07-13 14:03:40 PDT [10092] LOG:  redo is not required
2006-07-13 14:03:40 PDT [10092] LOG:  could not truncate directory "pg_multixact/offsets": apparent wraparound
2006-07-13 14:03:40 PDT [10092] LOG:  could not truncate directory "pg_multixact/members": apparent wraparound
2006-07-13 14:03:41 PDT [10093]  [unknown]%[unknown] LOG:  connection received: host=[local]
2006-07-13 14:03:41 PDT [10093]  postgres%postgres FATAL:  the database system is starting up
2006-07-13 14:03:41 PDT [10092] LOG:  database system is ready
2006-07-13 14:03:41 PDT [10092] LOG:  transaction ID wrap limit is 1243594092, limited by database "csb-dev"

This is from:
PostgreSQL 8.1.4 on x86_64-unknown-linux-gnu, compiled by GCC gcc (GCC)
4.0.2 20050901 (prerelease) (SUSE Linux)

I'm using autovacuum and it ran around 13:15 on database csb-dev.  I see
no errors in the logs and therefore assume that the vacuum was
successful.

There are two files left in pg_multixact:
-rw-------  1 postgres postgres  16K 2006-07-13 14:13 pg_multixact/members/00D2
-rw-------  1 postgres postgres 144K 2006-07-13 14:13 pg_multixact/offsets/0053

The system crash occurred during scsi rescanning that was initiated by
an admin.  The machine has been extremely stable otherwise and I have no
reason to suspect hardware flakiness.

In
http://www.mail-archive.com/pgsql-general@postgresql.org/msg76635.html
Tom Lane implies that I can probably ignore these messages.

So, the 64K questions: Can I really ignore this?  Is there anything I
can do to ascertain whether it's a false alarm?

Thanks,
Reece

--
Reece Hart, Ph.D.                      rkh@gene.com, http://www.gene.com/
Genentech, Inc.                        650-225-6133 (voice), -5389 (fax)
Bioinformatics and Protein Engineering
1 DNA Way, MS-93                       http://harts.net/reece/
South San Francisco, CA  94080-4990    reece@harts.net, GPG:0x25EC91A0

Re: apparent wraparound

From

Tom Lane

Date:

14 July 2006, 12:48:40

Reece Hart <hart.reece@gene.com> writes:
> After a system crash, postgresql 8.1.4 restarted but reported that I
> have an apparent wraparound:

> ...
> 2006-07-13 14:03:40 PDT [10092] LOG:  next MultiXactId: 5475264; next MultiXactOffset: 13765525
> ...
> 2006-07-13 14:03:40 PDT [10092] LOG:  could not truncate directory "pg_multixact/offsets": apparent wraparound
> 2006-07-13 14:03:40 PDT [10092] LOG:  could not truncate directory "pg_multixact/members": apparent wraparound

> There are two files left in pg_multixact:
> -rw-------  1 postgres postgres  16K 2006-07-13 14:13 pg_multixact/members/00D2
> -rw-------  1 postgres postgres 144K 2006-07-13 14:13 pg_multixact/offsets/0053

That's odd.  Those files correspond to the next MultiXactId and
MultiXactOffset, so there shouldn't have been any such complaint.
[ looks at code... ]  I wonder if this is happening because
shared->latest_page_number hasn't been set up yet when we do the
end-of-recovery checkpoint.

> In
> http://www.mail-archive.com/pgsql-general@postgresql.org/msg76635.html
> Tom Lane implies that I can probably ignore these messages.

No, I was saying that the "invalid server process ID -1" was harmless.
The "apparent wraparound" is a distinct issue, and I'd ask you the same
question I asked Thomas: do you continue to get those log messages
during subsequent checkpoints?

            regards, tom lane

Re: apparent wraparound

From

Reece Hart

Date:

14 July 2006, 16:31:07

Tom Lane wrote:
> I'd ask you the same question I asked Thomas: do you continue to get those log messages
> during subsequent checkpoints?

No, I don't.  The error did not reappear during ~2h of continuous
inserts since my report, didn't reappear after a forced checkpoint
(i.e., via psql), and did not reappear on a recent stop/start cycle.

There was a period when my cron-driven vacuuming was broken and, in
principle, I might have been susceptible to wraparound.  However, I
don't see how we could have had 1B transactions in that period.

One other tidbit: a colleague inadvertently updated ~10M records.  After
this, I started getting errors like:
number of page slots needed (2952496) exceeds max_fsm_pages (500000)
I restored from a backup, but still have:
'number of page slots needed (183248) exceeds max_fsm_pages (50000)'
(I reduced max_fsm_pages after the restore.)

I'm not sure whether the vacuum and fsm info is relevant.

-Reece

--
Reece Hart, http://harts.net/reece/, GPG:0x25EC91A0

Re: apparent wraparound

From

"Gregory S. Williamson"

Date:

15 July 2006, 01:06:15

Reece --

The "number of slots needed exceeds" messages are telling you that the current FSM (Free Space Map) does not have
enoughspace allocated to track all of the old tuples that are to be reused. I suspect that having such a situation
wouldeffect the wraparound issue, since you'd have "dead wood" which hasn't been recycled. 

You need to edit the postgresql.conf file and increase the max_fsm_pages and max_fsm_relations parameters and then
restartpostgres (I think you have to actually stop and restart, as opposed to a reload, but I could be wrong). You may
endup needing to adjust the total amount of RAM allocated to Shared Memory to allow for as large an FSM as you'll need.
Thatrequires a system reboot. 

HTH,

Greg Williamson
DBA
GlobeXplorer LLC

-----Original Message-----
From:    pgsql-general-owner@postgresql.org on behalf of Reece Hart
Sent:    Fri 7/14/2006 12:30 PM
To:    pgsql-general
Cc:
Subject:    Re: [GENERAL] apparent wraparound

Tom Lane wrote:
> I'd ask you the same question I asked Thomas: do you continue to get those log messages
> during subsequent checkpoints?

No, I don't.  The error did not reappear during ~2h of continuous
inserts since my report, didn't reappear after a forced checkpoint
(i.e., via psql), and did not reappear on a recent stop/start cycle.

There was a period when my cron-driven vacuuming was broken and, in
principle, I might have been susceptible to wraparound.  However, I
don't see how we could have had 1B transactions in that period.

One other tidbit: a colleague inadvertently updated ~10M records.  After
this, I started getting errors like:
number of page slots needed (2952496) exceeds max_fsm_pages (500000)
I restored from a backup, but still have:
'number of page slots needed (183248) exceeds max_fsm_pages (50000)'
(I reduced max_fsm_pages after the restore.)

I'm not sure whether the vacuum and fsm info is relevant.

-Reece

--
Reece Hart, http://harts.net/reece/, GPG:0x25EC91A0

---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

               http://archives.postgresql.org

!DSPAM:44b7f15495741414113241!

Re: apparent wraparound

From

"Florian G. Pflug"

Date:

15 July 2006, 13:45:13

Gregory S. Williamson wrote:
> You need to edit the postgresql.conf file and increase the max_fsm_pages and
 > max_fsm_relations parameters and then restart postgres (I think you
 > have to actually stop and restart, as opposed to a reload, but I could be
 > wrong). You may end up needing to adjust the total amount of RAM
 > allocated to Shared Memory to allow for as large an FSM as you'll
 > need. That requires a system reboot.
Actually, it doesn't require a system reboot. On debian/ubunut, and
AFAIK on redhat too, you just need to edit /etc/sysctl.conf, and
then call sysctl -p to update the in-kernel settings.

greetings, Florian Pflug

Re: apparent wraparound

From

"Joshua D. Drake"

Date:

15 July 2006, 14:15:45

Florian G. Pflug wrote:
> Gregory S. Williamson wrote:
>> You need to edit the postgresql.conf file and increase the
>> max_fsm_pages and
>  > max_fsm_relations parameters and then restart postgres (I think you
>  > have to actually stop and restart, as opposed to a reload, but I
> could be
>  > wrong). You may end up needing to adjust the total amount of RAM
>  > allocated to Shared Memory to allow for as large an FSM as you'll
>  > need. That requires a system reboot.
> Actually, it doesn't require a system reboot. On debian/ubunut, and
> AFAIK on redhat too, you just need to edit /etc/sysctl.conf, and
> then call sysctl -p to update the in-kernel settings.

Most modern day unix does not need to reboot...

FreeBSD (although there are a couple it still needs to reboot for IIRC)
Linux
MacOSX

None of these need to reboot to set those parameters.

Sincerely,

Joshua D. Drake



>
> greetings, Florian Pflug
>
> ---------------------------(end of broadcast)---------------------------
> TIP 3: Have you checked our extensive FAQ?
>
>               http://www.postgresql.org/docs/faq
>


--

    === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
    Providing the most comprehensive  PostgreSQL solutions since 1997
              http://www.commandprompt.com/

Re: apparent wraparound

From

Reece Hart

Date:

18 July 2006, 19:29:16

Greg, Florian, Joshua, Tom-

On Fri, 2006-07-14 at 17:02 -0700, Gregory S. Williamson wrote:
> You need to edit the postgresql.conf file and increase the
> max_fsm_pages and max_fsm_relations parameters and then restart
> postgres

I did this and vacuumed.  I didn't need to up shmmax.  The problem's
disappeared (and, fortunately, my data are still intact).

Thanks everyone for your help.

-Reece

--
Reece Hart, http://harts.net/reece/, GPG:0x25EC91A0

Re: apparent wraparound

From

Tom Lane

Date:

19 July 2006, 21:49:34

I wrote:
> Reece Hart <hart.reece@gene.com> writes:
>> After a system crash, postgresql 8.1.4 restarted but reported that I
>> have an apparent wraparound:
>> 2006-07-13 14:03:40 PDT [10092] LOG:  could not truncate directory "pg_multixact/offsets": apparent wraparound
>> 2006-07-13 14:03:40 PDT [10092] LOG:  could not truncate directory "pg_multixact/members": apparent wraparound

> [ looks at code... ]  I wonder if this is happening because
> shared->latest_page_number hasn't been set up yet when we do the
> end-of-recovery checkpoint.

Indeed, I was able to reproduce this after advancing the next-mxact
counters far enough.  The consequences seem only cosmetic (ie, a scary
log message during recovery), but still worth fixing.  I committed a
patch in HEAD and 8.1.

            regards, tom lane

Problem getting postmaster PID in pg_regress

From

Eric E

Date:

21 July 2006, 13:34:31

Hi all,
    I'm trying to run pg_regress as part of a build of Postgres 8.1 on
Suse Linux Enterprise Server 9.  I keep getting the following error:
./pg_regress: line 264: kill: (25049) - No such process

I chased the problem down to the following lines in pg_regress:

   "$bindir/postmaster" -D "$PGDATA" -F $postmaster_options
>"$LOGDIR/postmaster.log"; 2>&1 &
    postmaster_pid=$!

When I run sh -x pg_regress, I get:
+/usr/src/packages/BUILD/postgresql-8.1.4/src/test/regress/./tmp_check/install//usr/bin/postmaster

\
/usr/src/packages/BUILD/postgresql-8.1.4/src/test/regress/./tmp_check/data
-F -c listen_addresses=
+ postmaster_pid=25049

As I understand it, $! is supposed to return the process id of the last
backgrounded process.  However, in my case it's return a process ID 1
greater than the actual postmaster instance.  I end up with the
following process tree

25048 ?        Ss     0:00
/usr/src/packages/BUILD/postgresql-8.1.4/src/test/regress/./tmp_check/install//usr/bin/postmaster

-D /usr/s
25054 ?        S      0:00  \_ postgres: logger process
25056 ?        S      0:00  \_ postgres: writer process
25057 ?        S      0:00  \_ postgres: stats buffer process
25058 ?        S      0:00      \_ postgres: stats collector process

Then of course when the script tries kill -0 on a 25049 to see if the
postmaster came up, it wrongly concludes that postmaster did not start.

Can anyone help understand this behavior and/or get the proper process ID?

Thanks,

Eric

Re: Problem getting postmaster PID in pg_regress

From

Tom Lane

Date:

21 July 2006, 13:56:51

Eric E <whalesuit@gmail.com> writes:
> I chased the problem down to the following lines in pg_regress:

>    "$bindir/postmaster" -D "$PGDATA" -F $postmaster_options
>    > "$LOGDIR/postmaster.log"; 2>&1 &
                               ^

Is there really a semicolon there?  There should not be (and isn't in
my copy of 8.1)

            regards, tom lane

Re: Problem getting postmaster PID in pg_regress

From

Eric E

Date:

21 July 2006, 15:45:26

Sorry - that semicolon is an artifact of a change I made - I attempted
to put
postmaster_pid=$! inside the executing statement, but as I discovered,
if you put it before last &, you
get the backgrounded process before postmaster.

I'm presently checking out how the SLES init scripts work, but any
thoughts would be very welcome.

Thanks,

EE

Tom Lane wrote:
> Eric E <whalesuit@gmail.com> writes:
>
>> I chased the problem down to the following lines in pg_regress:
>>
>
>
>>    "$bindir/postmaster" -D "$PGDATA" -F $postmaster_options
>>    > "$LOGDIR/postmaster.log"; 2>&1 &
>>
>                                ^
>
> Is there really a semicolon there?  There should not be (and isn't in
> my copy of 8.1)
>
>             regards, tom lane
>
>