Thread: BUG #1208: Invalid page header

BUG #1208: Invalid page header

From

"PostgreSQL Bugs List"

Date:

10 August 2004, 10:47:00

The following bug has been logged online:

Bug reference:      1208
Logged by:          Robert E Bruccoleri

Email address:      bruc@stone.congenomics.com

PostgreSQL version: 7.4

Operating system:   Linux Advanced Server 2.1 and SGI ProPack 2.4

Description:        Invalid page header

Details:

============================================================================
                        POSTGRESQL BUG REPORT TEMPLATE
============================================================================


Your name               :       Robert Bruccoleri
Your email address      :       bruc@acm.org


System Configuration
---------------------
  Architecture (example: Intel Pentium)         :   Intel Itanium 2

  Operating System (example: Linux 2.4.18)      :   Linux 2.4.21 (SGI
Propack 2.4 patch 10074)

  PostgreSQL version (example: PostgreSQL-7.4.3):   PostgreSQL-7.4.3

  Compiler used (example:  gcc 2.95.2)          :   Intel C compiler version
8.0


Please enter a FULL description of your problem:
------------------------------------------------

I am getting sporadic invalid page header errors when loading or
vacuuming databases in parallel. We are in the process of migrating
from an SGI Origin 3000 running PostgreSQL 7.4 to an SGI Altix running
PostgreSQL 7.4.3.  The Altix system has 64 processors with 256
gigabytes of RAM. PostgreSQL was built using a 32K blocksize, and we
start the system with a buffer cache of 130000 pages.  Fdatasync is
used for synchronization. We use an LSI Logic storage system to store
the PostgreSQL databases as well as for much of our departments data,
and we have about 5 terabytes used actively. The filesystem is XFS as
delivered by SGI, which wrote it.

I do not believe that we have any problems with unreliable disk
storage.  First, no other users have complained about problems and we
have a lot more in use than what PostgreSQL is using. Second, the
storage system is an enterprise class Fibre Channel dual controller
RAID system designed for high redundancy and reliability.  It has no
single points of failure. We've been using it for over a year with no
problems.

We have about 14 active databases, and I loaded all 14 simultaneously. No
errors were noted during the load, but upon vacuuming all the databases,
one of the databases encountered the following message:

INFO:  vacuuming "public.relationships"
vacuumdb: vacuuming of database "human_genome_042003" failed: ERROR:
invalid page header in block 4763 of relation "relationships"

There may be others with problems, but vacuumdb quit after this error.

I downloaded pg_filedump and I ran it on the file containing this
relation specifying a range covering a block around the erroneous
block. The two blocks around the bad block have data as I would have
expected for the "relationships" table, but the bad block has data from
a table in another database.

Here is part of the pg_filedump output:

*******************************************************************
* PostgreSQL File/Block Formatted Dump Utility - Version 3.0
*
* File: 367457
* Options used: -f -R 4763 4763
*
* Dump created on: Wed Aug  4 19:47:46 2004
*******************************************************************

Block 4763 ********************************************************
<Header> -----
 Block Offset: 0x094d8000         Offsets: Lower       0 (0x0000)
 Block: Size    0  Version    0            Upper    61440 (0xf000)
 LSN:  logid 118874 recoff 0x0000000d      Special  25476 (0x6384)
 Items:    0                   Free Space: 61440
 Length (including item array): 24

 Error: Invalid header information.

  0000: 5ad00100 0d000000 22000000 000000f0  Z.......".......
  0010: 84630000 00000000                    .c......

<Data> ------
 Empty block - no items listed

<Special Section> -----
 Error: Invalid special section encountered.
  6384: 32343433 38320000 a9270000 ab270000  244382...'...'..
  6394: 00000000 01000000 00000000 1edbab73  ...............s
  63a4: 0e8f3ba6 22000000 40e3ffef 22000000  ..;."...@..."...
  63b4: 68e2ffef 020a0000 b400000a fdb70500  h...............
  63c4: bbc30500 08008f6e ae001200 02081800  .......n........
  63d4: 0e000000 52313031 5f343438 38340000  ....R101_44884..
  63e4: 15000000 15000000 4e545f30 31303839  ........NT_01089
  63f4: 335f6735 352e7365 63000000 91000000  3_g55.sec.......
  6404: 0f000000 70646231 63686b2e 412e2d00  ....pdb1chk.A.-.
  6414: ee000000 00000000 48e17a14 ae470340  ........H.z..G.@
  6424: 295c8fc2 f5280640 c3f5285c 8fc20b40  )\...(.@..(\...@
  6434: 3d0ad7a3 703d1340 0d000000 7f000000  =...p=.@........
  6444: 06819543 8b6c0640 d7a3703d 0a571040  ...C.l.@..p=.W.@
  6454: 91b8c7d2 87e62640 00000000 0078ca40  ......&@.....x.@
  6464: 00000000 00000000 00000000 002062c0  ............. b.
  6474: 00000000 e5fd877a 720918a8 22000000  .......zr..."...
  6484: a06300f0 22000000 a06300f0 020a0000  .c.."....c......
  6494: b400800a fdb70500 bbc30500 0800906e  ...............n
  64a4: 01001200 02081800 0e000000 52313031  ............R101
  64b4: 5f343438 38340000 15000000 15000000  _44884..........
  64c4: 4e545f30 31303839 335f6735 352e7365  NT_010893_g55.se
  64d4: 63000000 91000000 0f000000 70646231  c...........pdb1
  64e4: 63686d2e 422e2d00 91010000 00000000  chm.B.-.........
  64f4: ec51b81e 85eb0940 3d0ad7a3 703d0a40  .Q.....@=...p=.@
  6504: 52b81e85 eb511140 b81e85eb 51b81a40  R....Q.@....Q..@
  6514: 13000000 6c000000 e7fba9f1 d24d0d40  ....l........M.@
  6524: 52b81e85 ebd11740 7940d994 2bd03540  R......@y@..+.5@
  6534: 00000000 0043bd40 00000000 00000000  .....C.@........
  6544: 00000000 00c068c0 00000000 f7d17b03  ......h.......{.
  6554: 08edd30d 22000000 786400f0 22000000  ...."...xd.."...
  6564: 786400f0 020a0000 b400000a fdb70500  xd..............
  6574: bbc30500 0800906e 02001200 02081800  .......n........
  6584: 0e000000 52313031 5f343438 38340000  ....R101_44884..
  6594: 15000000 15000000 4e545f30 31303839  ........NT_01089
  65a4: 335f6735 352e7365 63000000 91000000  3_g55.sec.......
  65b4: 0f000000 70646231 6369342e 412e2d00  ....pdb1ci4.A.-.
  65c4: 59000000 00000000 3d0ad7a3 703df63f  Y.......=...p=.?
  65d4: 1f85eb51 b81e0f40 c3f5285c 8fc20b40  ...Q...@..(\...@
  65e4: 33333333 33331840 12000000 54000000  333333.@....T...
  65f4: 06819543 8b6c0640 cdcccccc cc4c1540  ...C.l.@.....L.@
  6604: c3d32b65 19da2d40 00000000 0033be40  ..+e..-@.....3.@
  6614: 00000000 00000000 00000000 00406e40  .............@n@
  6624: 00000000 c61e23a3 820d4664 22000000  ......#...Fd"...
  6634: 506500f0 22000000 506500f0 020a0000  Pe.."...Pe......
  6644: b400000a fdb70500 bbc30500 0800906e  ...............n
  6654: 03001200 02081800 0e000000 52313031  ............R101
  6664: 5f343438 38340000 15000000 15000000  _44884..........
  6674: 4e545f30 31303839 335f6735 352e7365  NT_010893_g55.se
  6684: 63000000 91000000 0f000000 70646231  c...........pdb1
  6694: 6369642e 2d2e2d00 b1000000 00000000  cid.-.-.........

<truncated>

In block 4763, there is data from another database named proceryon in
the 14 that I loaded simultaneously. If this were an disk I/O error,
then I would not have expected to see tuples from another
database. I'd expect gibberish or nulls.

I ran a vacuumdb on the table in proceryon that had data above, and
there is no error. However, other tables in the proceryon database
have invalid page headers. Here is another example:

> pg_filedump -d -R 18311 18311 379598.3

*******************************************************************
* PostgreSQL File/Block Formatted Dump Utility - Version 3.0
*
* File: 379598.3
* Options used: -d -R 18311 18311
*
* Dump created on: Thu Aug  5 16:18:39 2004
*******************************************************************

Block 18311 ********************************************************
  0000: 6c6c6c6c 6c6c6c6c 6c6c6c6c 6c6c6c6c  llllllllllllllll
  0010: 6c6c6c6c 6c6c6c6c 6c6c6c6c 6c6c6c6c  llllllllllllllll
  0020: 6c6c6c6c 6c6c6c6c 6c6c6c6c 6c6c6c6c  llllllllllllllll
  0030: 6c6c6c6c 6c6c6c6c 6c6c6c6c 6c6c6c6c  llllllllllllllll
  0040: 6c6c6c6c 6c6c6c6c 6c6c6c6c 6c6c6c6c  llllllllllllllll
  0050: 6c6c6c6c 6c6c6c6c 6c6c6c6c 6c6c6c6c  llllllllllllllll
  0060: 6c6c6c6c 6c6c6c6c 6c6c6c6c 6c6c6c6c  llllllllllllllll
  0070: 6c6c6c6c 6c6c6c6c 6c6c6c6c 6c6c6c6c  llllllllllllllll

<truncated -- all the same>

*** End of Requested Range Encountered. Last Block Read: 18311 ***


Please describe a way to repeat the problem.   Please try to provide a
concise reproducible example, if at all possible:
----------------------------------------------------------------------

I have been trying to use the test case of Hubert Froehlich,
http://archives.postgresql.org/pgsql-general/2004-07/msg00670.php,
but they do not generate any errors on our system. Only these big
loads cause it.

If you know how this problem might be fixed, list the solution below:
---------------------------------------------------------------------

I am willing to be the hands of any PostgreSQL developer to explore
this problem. The system is not in production, so I can make changes
at will.
+-----------------------------+------------------------------------+
| Robert E. Bruccoleri, Ph.D. | email: bruc@acm.org                |
| President, Congenair LLC    | URL:   http://www.congen.com/~bruc |
| P.O. Box 314                | Phone: 609 818 7251                |
Command: Quit


                               Folder unchanged.
stone bruc 2 >>cat foo.foo.foo.invalid
From bruc Sun Aug  8 20:18:49 2004
Subject: Invalid page header errors in PostgreSQL 7.4.3
To: pgsql-bugs@postgresql.org
Date: Sun, 8 Aug 2004 20:18:49 -0400 (EDT)
Cc: hubert.froehlich@bvv.bayern.de, tgl@sss.pgh.pa.us
Reply-To: bruc@stone.congen.com
X-Mailer: ELM [version 2.4 PL25 ME8b]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Content-Length: 9173
Status: RO
X-Archive-Number: 200408/40

============================================================================
                        POSTGRESQL BUG REPORT TEMPLATE
============================================================================


Your name               :       Robert Bruccoleri
Your email address      :       bruc@acm.org


System Configuration
---------------------
  Architecture (example: Intel Pentium)         :   Intel Itanium 2

  Operating System (example: Linux 2.4.18)      :   Linux 2.4.21 (SGI
Propack 2.4 patch 10074)

  PostgreSQL version (example: PostgreSQL-7.4.3):   PostgreSQL-7.4.3

  Compiler used (example:  gcc 2.95.2)          :   Intel C compiler version
8.0


Please enter a FULL description of your problem:
------------------------------------------------

I am getting sporadic invalid page header errors when loading or
vacuuming databases in parallel. We are in the process of migrating
from an SGI Origin 3000 running PostgreSQL 7.4 to an SGI Altix running
PostgreSQL 7.4.3.  The Altix system has 64 processors with 256
gigabytes of RAM. PostgreSQL was built using a 32K blocksize, and we
start the system with a buffer cache of 130000 pages.  Fdatasync is
used for synchronization. We use an LSI Logic storage system to store
the PostgreSQL databases as well as for much of our departments data,
and we have about 5 terabytes used actively. The filesystem is XFS as
delivered by SGI, which wrote it.

I do not believe that we have any problems with unreliable disk
storage.  First, no other users have complained about problems and we
have a lot more in use than what PostgreSQL is using. Second, the
storage system is an enterprise class Fibre Channel dual controller
RAID system designed for high redundancy and reliability.  It has no
single points of failure. We've been using it for over a year with no
problems.

We have about 14 active databases, and I loaded all 14 simultaneously. No
errors were noted during the load, but upon vacuuming all the databases,
one of the databases encountered the following message:

INFO:  vacuuming "public.relationships"
vacuumdb: vacuuming of database "human_genome_042003" failed: ERROR:
invalid page header in block 4763 of relation "relationships"

There may be others with problems, but vacuumdb quit after this error.

I downloaded pg_filedump and I ran it on the file containing this
relation specifying a range covering a block around the erroneous
block. The two blocks around the bad block have data as I would have
expected for the "relationships" table, but the bad block has data from
a table in another database.

Here is part of the pg_filedump output:

*******************************************************************
* PostgreSQL File/Block Formatted Dump Utility - Version 3.0
*
* File: 367457
* Options used: -f -R 4763 4763
*
* Dump created on: Wed Aug  4 19:47:46 2004
*******************************************************************

Block 4763 ********************************************************
<Header> -----
 Block Offset: 0x094d8000         Offsets: Lower       0 (0x0000)
 Block: Size    0  Version    0            Upper    61440 (0xf000)
 LSN:  logid 118874 recoff 0x0000000d      Special  25476 (0x6384)
 Items:    0                   Free Space: 61440
 Length (including item array): 24

 Error: Invalid header information.

  0000: 5ad00100 0d000000 22000000 000000f0  Z.......".......
  0010: 84630000 00000000                    .c......

<Data> ------
 Empty block - no items listed

<Special Section> -----
 Error: Invalid special section encountered.
  6384: 32343433 38320000 a9270000 ab270000  244382...'...'..
  6394: 00000000 01000000 00000000 1edbab73  ...............s
  63a4: 0e8f3ba6 22000000 40e3ffef 22000000  ..;."...@..."...
  63b4: 68e2ffef 020a0000 b400000a fdb70500  h...............
  63c4: bbc30500 08008f6e ae001200 02081800  .......n........
  63d4: 0e000000 52313031 5f343438 38340000  ....R101_44884..
  63e4: 15000000 15000000 4e545f30 31303839  ........NT_01089
  63f4: 335f6735 352e7365 63000000 91000000  3_g55.sec.......
  6404: 0f000000 70646231 63686b2e 412e2d00  ....pdb1chk.A.-.
  6414: ee000000 00000000 48e17a14 ae470340  ........H.z..G.@
  6424: 295c8fc2 f5280640 c3f5285c 8fc20b40  )\...(.@..(\...@
  6434: 3d0ad7a3 703d1340 0d000000 7f000000  =...p=.@........
  6444: 06819543 8b6c0640 d7a3703d 0a571040  ...C.l.@..p=.W.@
  6454: 91b8c7d2 87e62640 00000000 0078ca40  ......&@.....x.@
  6464: 00000000 00000000 00000000 002062c0  ............. b.
  6474: 00000000 e5fd877a 720918a8 22000000  .......zr..."...
  6484: a06300f0 22000000 a06300f0 020a0000  .c.."....c......
  6494: b400800a fdb70500 bbc30500 0800906e  ...............n
  64a4: 01001200 02081800 0e000000 52313031  ............R101
  64b4: 5f343438 38340000 15000000 15000000  _44884..........
  64c4: 4e545f30 31303839 335f6735 352e7365  NT_010893_g55.se
  64d4: 63000000 91000000 0f000000 70646231  c...........pdb1
  64e4: 63686d2e 422e2d00 91010000 00000000  chm.B.-.........
  64f4: ec51b81e 85eb0940 3d0ad7a3 703d0a40  .Q.....@=...p=.@
  6504: 52b81e85 eb511140 b81e85eb 51b81a40  R....Q.@....Q..@
  6514: 13000000 6c000000 e7fba9f1 d24d0d40  ....l........M.@
  6524: 52b81e85 ebd11740 7940d994 2bd03540  R......@y@..+.5@
  6534: 00000000 0043bd40 00000000 00000000  .....C.@........
  6544: 00000000 00c068c0 00000000 f7d17b03  ......h.......{.
  6554: 08edd30d 22000000 786400f0 22000000  ...."...xd.."...
  6564: 786400f0 020a0000 b400000a fdb70500  xd..............
  6574: bbc30500 0800906e 02001200 02081800  .......n........
  6584: 0e000000 52313031 5f343438 38340000  ....R101_44884..
  6594: 15000000 15000000 4e545f30 31303839  ........NT_01089
  65a4: 335f6735 352e7365 63000000 91000000  3_g55.sec.......
  65b4: 0f000000 70646231 6369342e 412e2d00  ....pdb1ci4.A.-.
  65c4: 59000000 00000000 3d0ad7a3 703df63f  Y.......=...p=.?
  65d4: 1f85eb51 b81e0f40 c3f5285c 8fc20b40  ...Q...@..(\...@
  65e4: 33333333 33331840 12000000 54000000  333333.@....T...
  65f4: 06819543 8b6c0640 cdcccccc cc4c1540  ...C.l.@.....L.@
  6604: c3d32b65 19da2d40 00000000 0033be40  ..+e..-@.....3.@
  6614: 00000000 00000000 00000000 00406e40  .............@n@
  6624: 00000000 c61e23a3 820d4664 22000000  ......#...Fd"...
  6634: 506500f0 22000000 506500f0 020a0000  Pe.."...Pe......
  6644: b400000a fdb70500 bbc30500 0800906e  ...............n
  6654: 03001200 02081800 0e000000 52313031  ............R101
  6664: 5f343438 38340000 15000000 15000000  _44884..........
  6674: 4e545f30 31303839 335f6735 352e7365  NT_010893_g55.se
  6684: 63000000 91000000 0f000000 70646231  c...........pdb1
  6694: 6369642e 2d2e2d00 b1000000 00000000  cid.-.-.........

<truncated>

In block 4763, there is data from another database named proceryon in
the 14 that I loaded simultaneously. If this were an disk I/O error,
then I would not have expected to see tuples from another
database. I'd expect gibberish or nulls.

I ran a vacuumdb on the table in proceryon that had data above, and
there is no error. However, other tables in the proceryon database
have invalid page headers. Here is another example:

> pg_filedump -d -R 18311 18311 379598.3

*******************************************************************
* PostgreSQL File/Block Formatted Dump Utility - Version 3.0
*
* File: 379598.3
* Options used: -d -R 18311 18311
*
* Dump created on: Thu Aug  5 16:18:39 2004
*******************************************************************

Block 18311 ********************************************************
  0000: 6c6c6c6c 6c6c6c6c 6c6c6c6c 6c6c6c6c  llllllllllllllll
  0010: 6c6c6c6c 6c6c6c6c 6c6c6c6c 6c6c6c6c  llllllllllllllll
  0020: 6c6c6c6c 6c6c6c6c 6c6c6c6c 6c6c6c6c  llllllllllllllll
  0030: 6c6c6c6c 6c6c6c6c 6c6c6c6c 6c6c6c6c  llllllllllllllll
  0040: 6c6c6c6c 6c6c6c6c 6c6c6c6c 6c6c6c6c  llllllllllllllll
  0050: 6c6c6c6c 6c6c6c6c 6c6c6c6c 6c6c6c6c  llllllllllllllll
  0060: 6c6c6c6c 6c6c6c6c 6c6c6c6c 6c6c6c6c  llllllllllllllll
  0070: 6c6c6c6c 6c6c6c6c 6c6c6c6c 6c6c6c6c  llllllllllllllll

<truncated -- all the same>

*** End of Requested Range Encountered. Last Block Read: 18311 ***


Please describe a way to repeat the problem.   Please try to provide a
concise reproducible example, if at all possible:
----------------------------------------------------------------------

I have been trying to use the test case of Hubert Froehlich,
http://archives.postgresql.org/pgsql-general/2004-07/msg00670.php,
but they do not generate any errors on our system. Only these big
loads cause it.

If you know how this problem might be fixed, list the solution below:
---------------------------------------------------------------------

I am willing to be the hands of any PostgreSQL developer to explore
this problem. The system is not in production, so I can make changes
at will.

Re: BUG #1208: Invalid page header

From

Bruce Momjian

Date:

16 August 2004, 12:56:49

If you are sure your storage and memory are good, I can think of only
two other ideas.  One is a gcc bug.  You are running Itanium so it is
possible.  The only other possibility I can think of is that that our
ia64 assembler code is wrong.  It is:

    static __inline__ int
    tas(volatile slock_t *lock)
    {
        long int    ret;

        __asm__ __volatile__(
            "   xchg4   %0=%1,%2    \n"
    :       "=r"(ret), "+m"(*lock)
    :       "r"(1)
    :       "memory");
        return (int) ret;
    }

It is possible we don't have this working properly on ia64 SMP machines.

Again, these are only guesses but this is all I can think of.  We have
no other reports of such failures _except_ for hardware problems.

You can try 8.0 beta1 and see if that helps.  I do see the assembly code
is sligtly modified from the 7.4.X release.  It might be significant,
but I doubt it.

---------------------------------------------------------------------------

PostgreSQL Bugs List wrote:
>
> The following bug has been logged online:
>
> Bug reference:      1208
> Logged by:          Robert E Bruccoleri
>
> Email address:      bruc@stone.congenomics.com
>
> PostgreSQL version: 7.4
>
> Operating system:   Linux Advanced Server 2.1 and SGI ProPack 2.4
>
> Description:        Invalid page header
>
> Details:
>
> ============================================================================
>                         POSTGRESQL BUG REPORT TEMPLATE
> ============================================================================
>
>
> Your name               :       Robert Bruccoleri
> Your email address      :       bruc@acm.org
>
>
> System Configuration
> ---------------------
>   Architecture (example: Intel Pentium)         :   Intel Itanium 2
>
>   Operating System (example: Linux 2.4.18)      :   Linux 2.4.21 (SGI
> Propack 2.4 patch 10074)
>
>   PostgreSQL version (example: PostgreSQL-7.4.3):   PostgreSQL-7.4.3
>
>   Compiler used (example:  gcc 2.95.2)          :   Intel C compiler version
> 8.0
>
>
> Please enter a FULL description of your problem:
> ------------------------------------------------
>
> I am getting sporadic invalid page header errors when loading or
> vacuuming databases in parallel. We are in the process of migrating
> from an SGI Origin 3000 running PostgreSQL 7.4 to an SGI Altix running
> PostgreSQL 7.4.3.  The Altix system has 64 processors with 256
> gigabytes of RAM. PostgreSQL was built using a 32K blocksize, and we
> start the system with a buffer cache of 130000 pages.  Fdatasync is
> used for synchronization. We use an LSI Logic storage system to store
> the PostgreSQL databases as well as for much of our departments data,
> and we have about 5 terabytes used actively. The filesystem is XFS as
> delivered by SGI, which wrote it.
>
> I do not believe that we have any problems with unreliable disk
> storage.  First, no other users have complained about problems and we
> have a lot more in use than what PostgreSQL is using. Second, the
> storage system is an enterprise class Fibre Channel dual controller
> RAID system designed for high redundancy and reliability.  It has no
> single points of failure. We've been using it for over a year with no
> problems.
>
> We have about 14 active databases, and I loaded all 14 simultaneously. No
> errors were noted during the load, but upon vacuuming all the databases,
> one of the databases encountered the following message:
>
> INFO:  vacuuming "public.relationships"
> vacuumdb: vacuuming of database "human_genome_042003" failed: ERROR:
> invalid page header in block 4763 of relation "relationships"
>
> There may be others with problems, but vacuumdb quit after this error.
>
> I downloaded pg_filedump and I ran it on the file containing this
> relation specifying a range covering a block around the erroneous
> block. The two blocks around the bad block have data as I would have
> expected for the "relationships" table, but the bad block has data from
> a table in another database.
>
> Here is part of the pg_filedump output:
>
> *******************************************************************
> * PostgreSQL File/Block Formatted Dump Utility - Version 3.0
> *
> * File: 367457
> * Options used: -f -R 4763 4763
> *
> * Dump created on: Wed Aug  4 19:47:46 2004
> *******************************************************************
>
> Block 4763 ********************************************************
> <Header> -----
>  Block Offset: 0x094d8000         Offsets: Lower       0 (0x0000)
>  Block: Size    0  Version    0            Upper    61440 (0xf000)
>  LSN:  logid 118874 recoff 0x0000000d      Special  25476 (0x6384)
>  Items:    0                   Free Space: 61440
>  Length (including item array): 24
>
>  Error: Invalid header information.
>
>   0000: 5ad00100 0d000000 22000000 000000f0  Z.......".......
>   0010: 84630000 00000000                    .c......
>
> <Data> ------
>  Empty block - no items listed
>
> <Special Section> -----
>  Error: Invalid special section encountered.
>   6384: 32343433 38320000 a9270000 ab270000  244382...'...'..
>   6394: 00000000 01000000 00000000 1edbab73  ...............s
>   63a4: 0e8f3ba6 22000000 40e3ffef 22000000  ..;."...@..."...
>   63b4: 68e2ffef 020a0000 b400000a fdb70500  h...............
>   63c4: bbc30500 08008f6e ae001200 02081800  .......n........
>   63d4: 0e000000 52313031 5f343438 38340000  ....R101_44884..
>   63e4: 15000000 15000000 4e545f30 31303839  ........NT_01089
>   63f4: 335f6735 352e7365 63000000 91000000  3_g55.sec.......
>   6404: 0f000000 70646231 63686b2e 412e2d00  ....pdb1chk.A.-.
>   6414: ee000000 00000000 48e17a14 ae470340  ........H.z..G.@
>   6424: 295c8fc2 f5280640 c3f5285c 8fc20b40  )\...(.@..(\...@
>   6434: 3d0ad7a3 703d1340 0d000000 7f000000  =...p=.@........
>   6444: 06819543 8b6c0640 d7a3703d 0a571040  ...C.l.@..p=.W.@
>   6454: 91b8c7d2 87e62640 00000000 0078ca40  ......&@.....x.@
>   6464: 00000000 00000000 00000000 002062c0  ............. b.
>   6474: 00000000 e5fd877a 720918a8 22000000  .......zr..."...
>   6484: a06300f0 22000000 a06300f0 020a0000  .c.."....c......
>   6494: b400800a fdb70500 bbc30500 0800906e  ...............n
>   64a4: 01001200 02081800 0e000000 52313031  ............R101
>   64b4: 5f343438 38340000 15000000 15000000  _44884..........
>   64c4: 4e545f30 31303839 335f6735 352e7365  NT_010893_g55.se
>   64d4: 63000000 91000000 0f000000 70646231  c...........pdb1
>   64e4: 63686d2e 422e2d00 91010000 00000000  chm.B.-.........
>   64f4: ec51b81e 85eb0940 3d0ad7a3 703d0a40  .Q.....@=...p=.@
>   6504: 52b81e85 eb511140 b81e85eb 51b81a40  R....Q.@....Q..@
>   6514: 13000000 6c000000 e7fba9f1 d24d0d40  ....l........M.@
>   6524: 52b81e85 ebd11740 7940d994 2bd03540  R......@y@..+.5@
>   6534: 00000000 0043bd40 00000000 00000000  .....C.@........
>   6544: 00000000 00c068c0 00000000 f7d17b03  ......h.......{.
>   6554: 08edd30d 22000000 786400f0 22000000  ...."...xd.."...
>   6564: 786400f0 020a0000 b400000a fdb70500  xd..............
>   6574: bbc30500 0800906e 02001200 02081800  .......n........
>   6584: 0e000000 52313031 5f343438 38340000  ....R101_44884..
>   6594: 15000000 15000000 4e545f30 31303839  ........NT_01089
>   65a4: 335f6735 352e7365 63000000 91000000  3_g55.sec.......
>   65b4: 0f000000 70646231 6369342e 412e2d00  ....pdb1ci4.A.-.
>   65c4: 59000000 00000000 3d0ad7a3 703df63f  Y.......=...p=.?
>   65d4: 1f85eb51 b81e0f40 c3f5285c 8fc20b40  ...Q...@..(\...@
>   65e4: 33333333 33331840 12000000 54000000  333333.@....T...
>   65f4: 06819543 8b6c0640 cdcccccc cc4c1540  ...C.l.@.....L.@
>   6604: c3d32b65 19da2d40 00000000 0033be40  ..+e..-@.....3.@
>   6614: 00000000 00000000 00000000 00406e40  .............@n@
>   6624: 00000000 c61e23a3 820d4664 22000000  ......#...Fd"...
>   6634: 506500f0 22000000 506500f0 020a0000  Pe.."...Pe......
>   6644: b400000a fdb70500 bbc30500 0800906e  ...............n
>   6654: 03001200 02081800 0e000000 52313031  ............R101
>   6664: 5f343438 38340000 15000000 15000000  _44884..........
>   6674: 4e545f30 31303839 335f6735 352e7365  NT_010893_g55.se
>   6684: 63000000 91000000 0f000000 70646231  c...........pdb1
>   6694: 6369642e 2d2e2d00 b1000000 00000000  cid.-.-.........
>
> <truncated>
>
> In block 4763, there is data from another database named proceryon in
> the 14 that I loaded simultaneously. If this were an disk I/O error,
> then I would not have expected to see tuples from another
> database. I'd expect gibberish or nulls.
>
> I ran a vacuumdb on the table in proceryon that had data above, and
> there is no error. However, other tables in the proceryon database
> have invalid page headers. Here is another example:
>
> > pg_filedump -d -R 18311 18311 379598.3
>
> *******************************************************************
> * PostgreSQL File/Block Formatted Dump Utility - Version 3.0
> *
> * File: 379598.3
> * Options used: -d -R 18311 18311
> *
> * Dump created on: Thu Aug  5 16:18:39 2004
> *******************************************************************
>
> Block 18311 ********************************************************
>   0000: 6c6c6c6c 6c6c6c6c 6c6c6c6c 6c6c6c6c  llllllllllllllll
>   0010: 6c6c6c6c 6c6c6c6c 6c6c6c6c 6c6c6c6c  llllllllllllllll
>   0020: 6c6c6c6c 6c6c6c6c 6c6c6c6c 6c6c6c6c  llllllllllllllll
>   0030: 6c6c6c6c 6c6c6c6c 6c6c6c6c 6c6c6c6c  llllllllllllllll
>   0040: 6c6c6c6c 6c6c6c6c 6c6c6c6c 6c6c6c6c  llllllllllllllll
>   0050: 6c6c6c6c 6c6c6c6c 6c6c6c6c 6c6c6c6c  llllllllllllllll
>   0060: 6c6c6c6c 6c6c6c6c 6c6c6c6c 6c6c6c6c  llllllllllllllll
>   0070: 6c6c6c6c 6c6c6c6c 6c6c6c6c 6c6c6c6c  llllllllllllllll
>
> <truncated -- all the same>
>
> *** End of Requested Range Encountered. Last Block Read: 18311 ***
>
>
> Please describe a way to repeat the problem.   Please try to provide a
> concise reproducible example, if at all possible:
> ----------------------------------------------------------------------
>
> I have been trying to use the test case of Hubert Froehlich,
> http://archives.postgresql.org/pgsql-general/2004-07/msg00670.php,
> but they do not generate any errors on our system. Only these big
> loads cause it.
>
> If you know how this problem might be fixed, list the solution below:
> ---------------------------------------------------------------------
>
> I am willing to be the hands of any PostgreSQL developer to explore
> this problem. The system is not in production, so I can make changes
> at will.

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

Re: BUG #1208: Invalid page header

From

Tom Lane

Date:

16 August 2004, 13:13:14

Bruce Momjian <pgman@candle.pha.pa.us> writes:
> If you are sure your storage and memory are good, I can think of only
> two other ideas.  One is a gcc bug.  You are running Itanium so it is
> possible.  The only other possibility I can think of is that that our
> ia64 assembler code is wrong.  It is:

But that code is gcc-only, and he's not using gcc.

It's certainly possible that the non-gcc spinlock path is broken on IA64
though.  I dunno that anyone has ever tested that combination.  It might
be interesting for him to run the "test" program in s_lock.c and see if
it complains.

            regards, tom lane

Re: BUG #1208: Invalid page header

From

Bruce Momjian

Date:

16 August 2004, 13:21:11

Tom Lane wrote:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > If you are sure your storage and memory are good, I can think of only
> > two other ideas.  One is a gcc bug.  You are running Itanium so it is
> > possible.  The only other possibility I can think of is that that our
> > ia64 assembler code is wrong.  It is:
>
> But that code is gcc-only, and he's not using gcc.

Oh, I see that now:

  Compiler used (example:  gcc 2.95.2)          :   Intel C compiler version 8.0

> It's certainly possible that the non-gcc spinlock path is broken on IA64
> though.  I dunno that anyone has ever tested that combination.  It might
> be interesting for him to run the "test" program in s_lock.c and see if
> it complains.

What locking code is actually being used. I don't see any __ia64__
entries except for gcc.  Is it calling some libc library tas()?  It is
falling back to semaphores?

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

Re: BUG #1208: Invalid page header

From

Peter Eisentraut

Date:

16 August 2004, 14:59:50

Tom Lane wrote:
> But that code is gcc-only, and he's not using gcc.

I think the icc compiler claims to be gcc-compatible in that area, so
it's quite likely that the gcc assembler code would be used.

--
Peter Eisentraut
http://developer.postgresql.org/~petere/

Re: BUG #1208: Invalid page header

From

Tom Lane

Date:

16 August 2004, 15:14:40

Peter Eisentraut <peter_e@gmx.net> writes:
> Tom Lane wrote:
>> But that code is gcc-only, and he's not using gcc.

> I think the icc compiler claims to be gcc-compatible in that area, so
> it's quite likely that the gcc assembler code would be used.

Oh, good point.

In that case it seems entirely possible that the assembly code
tightening-up patch that I made for 8.0 is relevant.  The point of that
patch was to prevent the compiler from making inappropriate
optimizations around a spinlock TAS.  We have not seen any indication
that existing gcc releases would actually do anything unwanted ... but
icc might have different/more aggressive optimizations.

[ looks at code... ]  But it looks like 7.4's IA64 code already had the
memory-clobber constraint, so there doesn't appear to be any significant
change there since 7.4.  I suppose it's worth trying the insignificant
change though:

    __asm__ __volatile__(
        "    xchg4     %0=%1,%2    \n"
:        "=r"(ret), "=m"(*lock)
:        "r"(1), "1"(*lock)
:        "memory");

to

    __asm__ __volatile__(
        "    xchg4     %0=%1,%2    \n"
:        "=r"(ret), "+m"(*lock)
:        "r"(1)
:        "memory");

at line 125ff of src/include/storage/s_lock.h.  Note "=m" becomes "+m"
to replace the separate "1" constraint.

Robert, have you tried backing off compiler optimization levels to see
if anything changes?

            regards, tom lane

Re: BUG #1208: Invalid page header

From

"Robert E. Bruccoleri"

Date:

16 August 2004, 20:40:06

Dear All,
    Thanks for responding. First of all, the Intel compiler does
not accept inline assembly code, so I substituted the following
compiler intrinsic in its place:

#if defined(__ia64__) || defined(__ia64)  /* __ia64 used by ICC compiler? */
/* Intel Itanium */
#define HAS_TEST_AND_SET
#include <ia64intrin.h>

typedef unsigned long slock_t;

#define TAS(lock) _InterlockedExchange64(lock, 1)

#endif     /* __ia64__ || __ia64 */

    With this code in 7.4.3, I had the problem outlined in my bug
report.  I also ran the spin lock test program in s_lock.c, and it
failed as expected with a stuck spinlock. Do you have a locking test
program that is designed to detect errors in a multiprocessor
environment?

    Second, I've just gotten the 8.0 beta 1 release, built it
using normal optimization, and it did not encounter any invalid page
headers. Here is a typical compiler command with flags:

icc -w0 -ansi_alias- -O2 -fno-strict-aliasing

    It is important to keep in mind that the machine used for this
test can keep all 14 PostgreSQL backends running simultaneously, so any
race conditions would be more likely to occur on it than on a typical
2 or 4 way Linux box.

    Question: were there any significant changes made to the
buffer management code between 7.4 and 8.0 that would explain the
difference?

    I haven't tried rerunning 7.4.3 without optimization to see if
the problem disappears in that release. Since the 8.0beta1 release
appears OK, and the test run takes about three days, so I'm reluctant
to do it unless there's some value in performing test. Please tell me
if there is.

    Another question: on a machine which has this high level of
parallelism, does it make sense to use a spinlock to control access to
the buffer cache instead of a lightweight lock?

    Thanks. --Bob

Tom Lane writes:
>
>
> Peter Eisentraut <peter_e@gmx.net> writes:
> > Tom Lane wrote:
> >> But that code is gcc-only, and he's not using gcc.
>
> > I think the icc compiler claims to be gcc-compatible in that area, so
> > it's quite likely that the gcc assembler code would be used.
>
> Oh, good point.
>
> In that case it seems entirely possible that the assembly code
> tightening-up patch that I made for 8.0 is relevant.  The point of that
> patch was to prevent the compiler from making inappropriate
> optimizations around a spinlock TAS.  We have not seen any indication
> that existing gcc releases would actually do anything unwanted ... but
> icc might have different/more aggressive optimizations.
>
> [ looks at code... ]  But it looks like 7.4's IA64 code already had the
> memory-clobber constraint, so there doesn't appear to be any significant
> change there since 7.4.  I suppose it's worth trying the insignificant
> change though:
>
>     __asm__ __volatile__(
>         "    xchg4     %0=%1,%2    \n"
> :        "=r"(ret), "=m"(*lock)
> :        "r"(1), "1"(*lock)
> :        "memory");
>
> to
>
>     __asm__ __volatile__(
>         "    xchg4     %0=%1,%2    \n"
> :        "=r"(ret), "+m"(*lock)
> :        "r"(1)
> :        "memory");
>

> at line 125ff of src/include/storage/s_lock.h.  Note "=m" becomes "+m"
> to replace the separate "1" constraint.
>
> Robert, have you tried backing off compiler optimization levels to see
> if anything changes?
>
>             regards, tom lane
>

+-----------------------------+------------------------------------+
| Robert E. Bruccoleri, Ph.D. | email: bruc@acm.org                |
| President, Congenair LLC    | URL:   http://www.congen.com/~bruc |
| P.O. Box 314                | Phone: 609 818 7251                |
| Pennington, NJ 08534        |                                    |
+-----------------------------+------------------------------------+

Re: BUG #1208: Invalid page header

From

Tom Lane

Date:

16 August 2004, 21:02:00

"Robert E. Bruccoleri" <bruc@stone.congenomics.com> writes:
>     Question: were there any significant changes made to the
> buffer management code between 7.4 and 8.0 that would explain the
> difference?

There are some nontrivial changes, but none that I would regard as
likely to cause a multiprocessing error to magically go away.  More
to the point, if there is such a bug in 7.4.3 there's no guarantee
it won't come back again.

>     I haven't tried rerunning 7.4.3 without optimization to see if
> the problem disappears in that release. Since the 8.0beta1 release
> appears OK, and the test run takes about three days, so I'm reluctant
> to do it unless there's some value in performing test. Please tell me
> if there is.

If you believe this is not a hardware problem, you'd better keep
digging.  There is no known reason for 7.4 to fail like that.
It would be folly to assume that we've fixed the problem without
knowing it.

>     Another question: on a machine which has this high level of
> parallelism, does it make sense to use a spinlock to control access to
> the buffer cache instead of a lightweight lock?

No.  The angst you've probably been reading is focused around the
spinlock part of the LWLock --- simplifying the LWLock to a bare
spinlock will not improve matters.

            regards, tom lane

Re: BUG #1208: Invalid page header

From

"Robert E. Bruccoleri"

Date:

17 August 2004, 13:52:17

Dear Tom,
    Besides a no optimization compilation of 7.4.3, what else
would you recommend to explore this problem further? Thanks. --Bob

Tom Lane writes:
>
>
> "Robert E. Bruccoleri" <bruc@stone.congenomics.com> writes:
> >     Question: were there any significant changes made to the
> > buffer management code between 7.4 and 8.0 that would explain the
> > difference?
>
> There are some nontrivial changes, but none that I would regard as
> likely to cause a multiprocessing error to magically go away.  More
> to the point, if there is such a bug in 7.4.3 there's no guarantee
> it won't come back again.
>
> >     I haven't tried rerunning 7.4.3 without optimization to see if
> > the problem disappears in that release. Since the 8.0beta1 release
> > appears OK, and the test run takes about three days, so I'm reluctant
> > to do it unless there's some value in performing test. Please tell me
> > if there is.
>
> If you believe this is not a hardware problem, you'd better keep
> digging.  There is no known reason for 7.4 to fail like that.
> It would be folly to assume that we've fixed the problem without
> knowing it.
>
> >     Another question: on a machine which has this high level of
> > parallelism, does it make sense to use a spinlock to control access to
> > the buffer cache instead of a lightweight lock?
>
> No.  The angst you've probably been reading is focused around the
> spinlock part of the LWLock --- simplifying the LWLock to a bare
> spinlock will not improve matters.
>
>             regards, tom lane
>

+-----------------------------+------------------------------------+
| Robert E. Bruccoleri, Ph.D. | email: bruc@acm.org                |
| President, Congenair LLC    | URL:   http://www.congen.com/~bruc |
| P.O. Box 314                | Phone: 609 818 7251                |
| Pennington, NJ 08534        |                                    |
+-----------------------------+------------------------------------+

Re: BUG #1208: Invalid page header

From

Tom Lane

Date:

17 August 2004, 14:10:31

"Robert E. Bruccoleri" <bruc@stone.congenomics.com> writes:
>     Besides a no optimization compilation of 7.4.3, what else
> would you recommend to explore this problem further? Thanks. --Bob

I really haven't the foggiest where to look :-(  I don't actually
believe that it's a spinlock problem; that would explain pages getting
substituted for other pages, in whole or in part, but you showed at
least one example where a page was just overwritten with garbage.
That looks more like a memory-stomp problem (again, assuming that it's
software) and so could be anywhere.

Are you using any off-the-beaten-track code (contrib modules,
non-btree indexes, non-mainstream data types)?  That stuff is less
well debugged than the mainstream ...

            regards, tom lane

Re: BUG #1208: Invalid page header

From

"Robert E. Bruccoleri"

Date:

17 August 2004, 18:14:50

Dear Tom,
>
>
> "Robert E. Bruccoleri" <bruc@stone.congenomics.com> writes:
> >     Besides a no optimization compilation of 7.4.3, what else
> > would you recommend to explore this problem further? Thanks. --Bob
>
> I really haven't the foggiest where to look :-(  I don't actually
> believe that it's a spinlock problem; that would explain pages getting
> substituted for other pages, in whole or in part, but you showed at
> least one example where a page was just overwritten with garbage.
> That looks more like a memory-stomp problem (again, assuming that it's
> software) and so could be anywhere.

Does the memory pattern in the garbage page look familiar?

>
> Are you using any off-the-beaten-track code (contrib modules,
> non-btree indexes, non-mainstream data types)?  That stuff is less
> well debugged than the mainstream ...

No, it's all standard stuff (text, integers, floats, etc.).

Thanks. --Bob

+-----------------------------+------------------------------------+
| Robert E. Bruccoleri, Ph.D. | email: bruc@acm.org                |
| President, Congenair LLC    | URL:   http://www.congen.com/~bruc |
| P.O. Box 314                | Phone: 609 818 7251                |
| Pennington, NJ 08534        |                                    |
+-----------------------------+------------------------------------+

Re: BUG #1208: Invalid page header

From

"Robert E. Bruccoleri"

Date:

21 August 2004, 19:58:19

Dear Tom,
    I tried another load of the databases with fsync off (-F to
the postmaster), and I encountered at least one invalid page
header. Is there code in Postgres that handles the buffer cache
differently if fsync is off? Could this be a timing issue, since -F
does make PostgreSQL run faster?
    BTW, I'll be out of the office for two weeks, so you'll hear
back from me around Labor Day. Thanks. --Bob

Tom Lane writes:
>
>
> "Robert E. Bruccoleri" <bruc@stone.congenomics.com> writes:
> >     Besides a no optimization compilation of 7.4.3, what else
> > would you recommend to explore this problem further? Thanks. --Bob
>
> I really haven't the foggiest where to look :-(  I don't actually
> believe that it's a spinlock problem; that would explain pages getting
> substituted for other pages, in whole or in part, but you showed at
> least one example where a page was just overwritten with garbage.
> That looks more like a memory-stomp problem (again, assuming that it's
> software) and so could be anywhere.
>
> Are you using any off-the-beaten-track code (contrib modules,
> non-btree indexes, non-mainstream data types)?  That stuff is less
> well debugged than the mainstream ...
>
>             regards, tom lane
>

+-----------------------------+------------------------------------+
| Robert E. Bruccoleri, Ph.D. | email: bruc@acm.org                |
| President, Congenair LLC    | URL:   http://www.congen.com/~bruc |
| P.O. Box 314                | Phone: 609 818 7251                |
| Pennington, NJ 08534        |                                    |
+-----------------------------+------------------------------------+

Re: BUG #1208: Invalid page header

From

Tom Lane

Date:

21 August 2004, 20:46:55

"Robert E. Bruccoleri" <bruc@stone.congenomics.com> writes:
>     I tried another load of the databases with fsync off (-F to
> the postmaster), and I encountered at least one invalid page
> header. Is there code in Postgres that handles the buffer cache
> differently if fsync is off?

I don't believe so.

You may care to read Joe Conway's recent tale of woe.  Just because you
bought a pile of expensive hardware does not prove it's not a hardware
problem ...

            regards, tom lane