Thread: repeatable system index corruption on 7.4.2
I'm seeing the following errors after a few hours of fairly aggressive bulk load of a database running on Postgres 7.4.2: cyspec=# select version(); ERROR: root page 43 of "pg_proc_proname_args_nsp_index" has level 0, expected 2 cyspec=# select * from pg_class; ERROR: root page 3 of "pg_attribute_relid_attnum_index" has level 0, expected 1 When I say aggressive, I mean up to 6 simultaneous COPY processes. It is different from the issue Tom solved the other day in that we don't get SIGABORT, just corrupt index pages. Here is a backtrace: #0 errfinish (dummy=0) at elog.c:319 #1 0x081cbc26 in elog_finish (elevel=20, fmt=0x81e6ee0 "root page %u of \"%s\" has level %u, expected %u") at elog.c:853 #2 0x0808c632 in _bt_getroot (rel=0x82d58d4, access=1429534320) at nbtpage.c:287 #3 0x080902f3 in _bt_search (rel=0x82d58d4, keysz=2, scankey=0x8307358, bufP=0xfeffefa8, access=1) at nbtsearch.c:46 #4 0x08090fea in _bt_first (scan=0x8307198, dir=ForwardScanDirection) at nbtsearch.c:575 #5 0x0808ed47 in btgettuple (fcinfo=0x82a65cc) at nbtree.c:326 #6 0x081ce5b6 in FunctionCall2 (flinfo=0x0, arg1=136996300, arg2=136996300) at fmgr.c:993 #7 0x08088329 in index_getnext (scan=0x8307198, direction=ForwardScanDirection) at indexam.c:503 #8 0x08087951 in systable_getnext (sysscan=0x82a65cc) at genam.c:253 #9 0x081c3c43 in RelationBuildTupleDesc (buildinfo= {infotype = 2, i = {info_id = 136666169, info_name = 0x8255c39 "pg_index_indrelid_index"}}, relation=0x5a2cdea4) at relcache.c:548 #10 0x081c459f in RelationBuildDesc (buildinfo= {infotype = 2, i = {info_id = 136666169, info_name = 0x8255c39 "pg_index_indrelid_index"}}, oldrelation=0x0) at relcache.c:884 #11 0x081c56c0 in RelationSysNameGetRelation (relationName=0x8255c39 "pg_index_indrelid_index") at relcache.c:1637 #12 0x0807febe in relation_openr (sysRelationName=0x8255c39 "pg_index_indrelid_index", lockmode=0) at heapam.c:529 #13 0x08087ab5 in index_openr (sysRelationName=0x8255c39 "pg_index_indrelid_index") at indexam.c:179 #14 0x0808790a in systable_beginscan (heapRelation=0x82ee11c, indexRelname=0x8255c39 "pg_index_indrelid_index", indexOK=1 '\001', snapshot=0x0, nkeys=1, key=0xfefff2d0) at genam.c:192 #15 0x081c6f53 in RelationGetIndexList (relation=0x82d846c) at relcache.c:2717 #16 0x08147bc4 in get_relation_info (relationObjectId=1259, rel=0x83070d4) at plancat.c:81 #17 0x081492b4 in make_base_rel (root=0x82a65cc, relid=1) at relnode.c:159 #18 0x08148f91 in build_base_rel (root=0x8302164, relid=1) at relnode.c:70 #19 0x0813d5b8 in add_base_rels_to_query (root=0x8302164, jtnode=0x83070b8) at initsplan.c:86 #20 0x0813e4c9 in query_planner (root=0x8302164, tlist=0x83067a0, tuple_fraction=0, cheapest_path=0xfefff4e0, sorted_path=0xfefff4e4) at planmain.c:119 #21 0x0813f03b in grouping_planner (parse=0x8302164, tuple_fraction=0) at planner.c:897 #22 0x0813e9b4 in subquery_planner (parse=0x8302164, tuple_fraction=0) at planner.c:315 #23 0x0813e69c in planner (parse=0x8302164, isCursor=0 '\0', cursorOptions=0) at planner.c:119 #24 0x0816e8ba in pg_plan_query (querytree=0x8302164) at postgres.c:589 #25 0x0816e944 in pg_plan_queries (querytrees=0x830673c, needSnapshot=0 '\0') at postgres.c:656 #26 0x0816eafc in exec_simple_query (query_string=0x8301dbc "select * from pg_class;\n") at postgres.c:814 Any thoughts? Do you think the patch in 7.4.5 will address this also, or are we looking at something else? Thanks, Joe
Joe Conway <mail@joeconway.com> writes: > I'm seeing the following errors after a few hours of fairly aggressive > bulk load of a database running on Postgres 7.4.2: > cyspec=# select version(); > ERROR: root page 43 of "pg_proc_proname_args_nsp_index" has level 0, > expected 2 [ scratches head... ] Are you just doing bulk *data* load, or are you doing extensive schema hacking too? What reason would there be for index churn in pg_proc (or pg_class for that matter)? Can you get a pg_filedump dump of that page of the index? > When I say aggressive, I mean up to 6 simultaneous COPY processes. It is > different from the issue Tom solved the other day in that we don't get > SIGABORT, just corrupt index pages. Here is a backtrace: That may have more to do with whether the error is detected inside a critical section than anything else. regards, tom lane
Tom Lane wrote: > Joe Conway <mail@joeconway.com> writes: > >>I'm seeing the following errors after a few hours of fairly aggressive >>bulk load of a database running on Postgres 7.4.2: > >>cyspec=# select version(); >>ERROR: root page 43 of "pg_proc_proname_args_nsp_index" has level 0, >>expected 2 > > [ scratches head... ] Are you just doing bulk *data* load, or are you > doing extensive schema hacking too? What reason would there be for > index churn in pg_proc (or pg_class for that matter)? Here's more detail about what we're doing: Done once ---------- create the database create schemas create "base" tables create functions Done in a loop -------------- for year in X to Y { create ~60 tables, inherited from the "base" tables for each table (up to 6 at a time in parallel) { load data via COPY create index } } I was just informed that the most recent attempt failed again, this time with no more than 1 table loaded at a time. > Can you get a pg_filedump dump of that page of the index? Will do. BTW, I also tried reindexing one of the corrupt databases and got this: backend> reindex database cyspec; 2004-08-19 11:13:37 NOTICE: table "pg_class" was reindexed 2004-08-19 11:13:38 ERROR: invalid page header in block 6 of relation "pg_attribute" Joe
> Joe Conway writes > I'm seeing the following errors after a few hours of fairly aggressive > bulk load of a database running on Postgres 7.4.2: > > When I say aggressive, I mean up to 6 simultaneous COPY processes. It is > different from the issue Tom solved the other day in that we don't get > SIGABORT, just corrupt index pages. OK, problem accepted, but why would you run 6 simultaneous COPYs? Presumably on > 1 CPU? Sounds like you're hitting the right edge of the index really hard (as well as finding a hole in the logic). Can you stream the data to a shared pipe and run just a single COPY? Could you "dilute" the index by adding an irrelevant number as the first column on the index, so that the index hotspot moves to 6 places? Or use a functional index to alter the distribution of values? You'd be quite likely to see an improvement in performance by avoiding the contention, and that may be a good workaround for now. Can I ask, are you also running simultaneous INSERTs or just COPYs? And presumably you're mixing that with SELECTs that are doing index scans? Does the table have random deletes on it, or just occaisional regular bulk deletes? How often is it VACUUMed? Best Regards, Simon Riggs
Simon Riggs wrote: >>Joe Conway writes >>I'm seeing the following errors after a few hours of fairly aggressive >>bulk load of a database running on Postgres 7.4.2: > >>When I say aggressive, I mean up to 6 simultaneous COPY processes. It is >>different from the issue Tom solved the other day in that we don't get >>SIGABORT, just corrupt index pages. > > OK, problem accepted, but why would you run 6 simultaneous COPYs? Presumably > on > 1 CPU? Sounds like you're hitting the right edge of the index really > hard (as well as finding a hole in the logic). This is fairly high end hardware -- 4 hyperthreaded CPUs (hence 8 CPUs from the OS perspective), 8GB RAM. But in any case, since last report we've reproduced the problem with a single COPY at a time. > Can I ask, are you also running simultaneous INSERTs or just COPYs? And > presumably you're mixing that with SELECTs that are doing index scans? Does > the table have random deletes on it, or just occaisional regular bulk > deletes? How often is it VACUUMed? This is an initial data load, coming from an export from a large cOmmeRciAl database. No other activity other than loading, no inserts. The total rows to be imported into Postgres is ~900 million -- that's why we would like to do as much in parallel as possible. Joe
Joe Conway wrote: > Simon Riggs wrote: >>> Joe Conway writes >>> I'm seeing the following errors after a few hours of fairly aggressive >>> bulk load of a database running on Postgres 7.4.2: >> >>> When I say aggressive, I mean up to 6 simultaneous COPY processes. It is >>> different from the issue Tom solved the other day in that we don't get >>> SIGABORT, just corrupt index pages. >> >> OK, problem accepted, but why would you run 6 simultaneous COPYs? >> Presumably >> on > 1 CPU? Sounds like you're hitting the right edge of the index really >> hard (as well as finding a hole in the logic). > > This is fairly high end hardware -- 4 hyperthreaded CPUs (hence 8 CPUs > from the OS perspective), 8GB RAM. > > But in any case, since last report we've reproduced the problem with a > single COPY at a time. I just want to close the loop on this thread. In summary, the problem turned out to be related to the logical volume, and NOT Postgres. For those interested in the gory detail, read on -- others may safely move on ;-) Joe ---------------------- Here's what we had, in terms of storage layout (i.e. the layout susceptible to Postgres system catalog corruption). Note that I'm neither the unix admin nor the storage expert -- we had our own unix admin, a storage expert from the local VAR, and a storage expert from the SAN vendor involved. They decided how to lay this all out -- I'll do my best to be accurate in my description. [------------- jfs filesystem -------------] [------------- logical volume -------------] [------------- volume group -------------] +------+------+------+------+------+------+------+ | LUN1 | LUN2 | LUN3 | LUN4 | LUN5 | LUN6 | LUN7 | +______+______+______+______+______+______+______+ LUN[1-7] are actually each comprised of 14 x 73GB x 15K rpm SCSI drives, configured (I think) in a RAID 5 array, totaling just under 1TB of useable space. The SAN presents each of these arrays, via fibrechannel, to the OS as a single, large SCSI LUN. The LUNs are individually partitioned using fdisk, a single primary partition on each, and in this case the partition was offset 128MB to allow for stripe alignment (more on stripes later). The volume group creates a block device. We used 128MB physical extent size to allow for up to an 8TB logical volume. The logical volume is then created on top of the volume group. We created a single, 6.4TB logical volume using 128MB stripes across the LUNs in the volume group. Finally the filesytem is laid on top. We started out with jfs based on vendor recommendations (I believe). I'll represent this topology a bit more compactly as (SAN-SCSI->SLV->jfs), where SLV stands for striped logical volume. ---------------------- In the above configuration we consistently got data corruption (as evidenced by bad system indexes) regardless of the Postgres configuration, and with 1 or 6 parallel data loads. We then changed the filesystem to xfs. With an xfs filesystem, we got substantially farther into our data load, but corruption did eventually occur. After a variety of successful data loads (i.e. no corruption) on other volumes, with different topologies, we decided that the problem was related to either the logical volume level, or the SAN hardware itself. So we deleted the logical volume and the volume group. Then we re-partitioned the LUNs -- this time without the offset for stripe alignment. The volume group was rebuilt using 128MB extents again, but the logical volume was build using concatenated, instead of striped, LUNs. Finally we formatted with xfs. That is, (SAN-SCSI->CLV->xfs), where CLV is concatenated logical volume. (SAN-SCSI->CLV->xfs) worked with no data corruption. We're not entirely sure why the original configuration was a problem, but the SAN vendor has agreed to try to reproduce this scenario in their lab. ---------------------- Finally, some comparitive times on the various volume types that we tested while troubleshooting: fail == system catalog data corruption pass == no evident corruption data1 -> (SAN-SCSI->SLV->jfs) -> fail data1 -> (SAN-SCSI->SLV->xfs) -> fail data3 -> (NFS-mounted-NAS) -> pass (122 minutes) data2 -> (SAN-IDE->CLV->jfs) -> pass (103 minutes) data1 -> (SAN-SCSI->CLV->xfs) -> pass (94 minutes) data1 -> (SAN-SCSI->CLV->xfs) -> pass (93 minutes) Times listed are the total clock time to complete the entire data load, using 6 parallel processes doing bulk COPY. Hope someone finds this useful. Joe