Thread: Multicolumn index corruption on 8.4 beta 2
Hi, I pgdump'ed a 8.3.7 database and loaded the dump to a different server running PostgreSQL 8.4 beta 2 (compiled from source) under Opensolaris. One of the tables has about 6 million records, and a Btree index that spans 3 columns. I am having the problem that some queries are unable to find rows when using the index. When I force a sequential scan, by doing "set enable_indexscan=false; set enable_bitmapscan=false;", the same queries work fine. In addition, while running "vacuum full analyze" I got the following error a couple times: == ERROR: failed to re-find parent key in index "pgb_idx" for deletion target page 25470 === Doing "reindex" or dropping and creating the index, makes the error go away for a while. However it does not solve the problem of the missing rows, making me believe the index Postgresql generates is still corrupt. According to memtest the memory of the server is fine, and according to "zpool status" there are no disk or ZFS checksum errors. Any idea how to solve or debug this issue? Yours sincerely, Floris Bos
Floris Bos / Maxnet wrote: > I am having the problem that some queries are unable to find rows when > using the index. > When I force a sequential scan, by doing "set enable_indexscan=false; > set enable_bitmapscan=false;", the same queries work fine. Not a hacker myself, but I can tell you that the first question you'll be asked is "can you produce a test case"? If you can generate the problem from a test table+generated data that will let people figure out the problem for you. If not, details of the table schema will be needed, and is there any pattern to the missed rows? Also - compile settings, character set and locale details might be relevant too. -- Richard Huxton Archonet Ltd
And can you post an explain plan for the incorrect scan? In particular is it using a bitmap index scan or a regular index scan? Or does it happen with either? -- Greg On 9 Jun 2009, at 09:43, Richard Huxton <dev@archonet.com> wrote: > Floris Bos / Maxnet wrote: >> I am having the problem that some queries are unable to find rows >> when using the index. >> When I force a sequential scan, by doing "set >> enable_indexscan=false; set enable_bitmapscan=false;", the same >> queries work fine. > > Not a hacker myself, but I can tell you that the first question > you'll be asked is "can you produce a test case"? If you can > generate the problem from a test table+generated data that will let > people figure out the problem for you. > > If not, details of the table schema will be needed, and is there any > pattern to the missed rows? Also - compile settings, character set > and locale details might be relevant too. > > -- > Richard Huxton > Archonet Ltd > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers
Hi, Richard Huxton wrote: > Not a hacker myself, but I can tell you that the first question you'll > be asked is "can you produce a test case"? If you can generate the > problem from a test table+generated data that will let people figure out > the problem for you. Unfortunately, I have not been able to produce a test case (yet) on a small data set. While the data in the database is public information, the whole database is about 100 GB, and therefore kinda hard to share. > If not, details of the table schema will be needed, and is there any > pattern to the missed rows? Also - compile settings, character set and > locale details might be relevant too. == Compile settings == No fancy settings. - Clean Opensolaris 2009.06 installation - Installed gcc and gmake packages. - Downloaded source and did a ./configure --disable-readline ; gmake ; gmake install == Postgresql settings == The following settings differ from the defaults: -- shared_buffers=3500MB maintenance_work_mem = 128MB fsync = off synchronous_commit = off checkpoint_segments = 25 -- The locale used when creating the database is SQL_ASCII == Hardware == Tyan barebone 2x Opteron 2376 quadcore 32 GB reg ecc memory 1x Intel X25-E 32 GB SSD for OS and pg_xlog directory 2x Intel X25-E 64 GB SSD (ZFS striping) for the database == Table layout == -- Table "public.posts_index" Column | Type | Modifiers ------------+------------------------+----------------------------------------------------------- cid | integer | not null default nextval('posts_index_cid _seq'::regclass) groupid | integer | not null startdate | integer | not null poster | character varying(64) | not null basefile | character varying(64) | not null subject | character varying(255) |not null size | real | nfo | boolean | c | boolean | parts | integer | totalparts | integer | imdb | integer | ng1 | boolean | default false g2 | integer | default 0 g3 | integer | default 0 data | bytea | Indexes: "posts_index5_pkey" PRIMARY KEY, btree (cid) CLUSTER "gr_idx" btree (groupid, (- cid)) "pgb_idx" btree(poster, groupid, basefile) -- Only noticed problems with the pgb_idx index so far. The problem only occurs on a subset of the rows, at a time. After adding/updating rows and doing a reindex, the rows that were missing before sometimes suddenly do work, but then different ones do not. > And can you post an explain plan for the incorrect scan? In particular is it using a bitmap index scan or a regular indexscan? Or does it happen with either? Happens with both. Index scan: === => explain SELECT count(*) FROM posts_index WHERE poster='Yenc@power-post.org (Yenc-PP-A&A)' AND groupid=300 AND basefile='NIB8124849'; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------------------------- Aggregate (cost=11.25..11.26 rows=1 width=0) -> Index Scan using pgb_idx on posts_index (cost=0.00..11.25 rows=1 width=0) Index Cond: (((poster)::text = 'Yenc@power-post.org (Yenc-PP-A&A)'::text) AND (groupid = 300) AND ((basefile)::text = 'NIB8124849'::text)) => SELECT count(*) FROM posts_index WHERE poster='Yenc@power-post.org (Yenc-PP-A&A)' AND groupid=300 AND basefile='NIB8124849'; count ------- 0 === When I disable index scan, it uses bitmap without luck: == => set enable_indexscan=false; SET => explain SELECT count(*) FROM posts_index WHERE poster='Yenc@power-post.org (Yenc-PP-A&A)' AND groupid=300 AND basefile='NIB8124849'; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------------------------------- Aggregate (cost=11.26..11.27 rows=1 width=0) -> Bitmap Heap Scan on posts_index (cost=7.24..11.26 rows=1 width=0) Recheck Cond: (((poster)::text = 'Yenc@power-post.org (Yenc-PP-A&A)'::text) AND (groupid = 300) AND ((basefile)::text = 'NIB8124849'::text)) -> Bitmap Index Scan on pgb_idx (cost=0.00..7.24 rows=1 width=0) Index Cond:(((poster)::text = 'Yenc@power-post.org (Yenc-PP-A&A)'::text) AND (groupid = 300) AND ((basefile)::text = 'NIB8124849'::text)) => SELECT count(*) FROM posts_index WHERE poster='Yenc@power-post.org (Yenc-PP-A&A)' AND groupid=300 AND basefile='NIB8124849'; count ------- 0 == Sequential scan does find the row: == => set enable_indexscan=false; SET => set enable_bitmapscan=false; SET => explain SELECT count(*) FROM posts_index WHERE poster='Yenc@power-post.org (Yenc-PP-A&A)' AND groupid=300 AND basefile='NIB8124849'; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------ Aggregate (cost=288153.28..288153.29 rows=1 width=0) -> Seq Scan on posts_index (cost=0.00..288153.28 rows=1 width=0) Filter: (((poster)::text = 'Yenc@power-post.org (Yenc-PP-A&A)'::text) AND (groupid = 300) AND ((basefile)::text = 'NIB8124849'::text)) (3 rows) => SELECT count(*) FROM posts_index WHERE poster='Yenc@power-post.org (Yenc-PP-A&A)' AND groupid=300 AND basefile='NIB8124849'; count ------- 1 == Yours sincerely, Floris Bos
Floris Bos / Maxnet <bos@je-eigen-domein.nl> writes: > Richard Huxton wrote: >> Not a hacker myself, but I can tell you that the first question you'll >> be asked is "can you produce a test case"? If you can generate the >> problem from a test table+generated data that will let people figure out >> the problem for you. > Unfortunately, I have not been able to produce a test case (yet) on a > small data set. > While the data in the database is public information, the whole database > is about 100 GB, and therefore kinda hard to share. Seems like we'd only need a dump of the one problem table, not the entire database. regards, tom lane
On Tue, 2009-06-09 at 13:40 +0200, Floris Bos / Maxnet wrote: > fsync = off That's a bad plan if you care about your database. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
* Simon Riggs: > On Tue, 2009-06-09 at 13:40 +0200, Floris Bos / Maxnet wrote: > >> fsync = off > > That's a bad plan if you care about your database. It shouldn't introduce this type of corruption, though. -- Florian Weimer <fweimer@bfk.de> BFK edv-consulting GmbH http://www.bfk.de/ Kriegsstraße 100 tel: +49-721-96201-1 D-76133 Karlsruhe fax: +49-721-96201-99
Hi, Simon Riggs wrote: > On Tue, 2009-06-09 at 13:40 +0200, Floris Bos / Maxnet wrote: >> fsync = off > > That's a bad plan if you care about your database. I am aware of the risk of dataloss in case of power failure, etc. However fsync=on is simply too slow for my purpose, and it concerns data that can be regenerated from its source. The website this setup is for has been running various previous versions of PostgreSQL with fsync=off since 2005. So I still expect it to work. Yours sincerely, Floris Bos
Well sure it could -- once. It wouldn't be reproducible in a freshly rebuilt index unless he's crashing his machine every time. -- Greg On 9 Jun 2009, at 17:12, Florian Weimer <fweimer@bfk.de> wrote: > * Simon Riggs: > >> On Tue, 2009-06-09 at 13:40 +0200, Floris Bos / Maxnet wrote: >> >>> fsync = off >> >> That's a bad plan if you care about your database. > > It shouldn't introduce this type of corruption, though. > > -- > Florian Weimer <fweimer@bfk.de> > BFK edv-consulting GmbH http://www.bfk.de/ > Kriegsstraße 100 tel: +49-721-96201-1 > D-76133 Karlsruhe fax: +49-721-96201-99 > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers
Floris, > The website this setup is for has been running various previous versions > of PostgreSQL with fsync=off since 2005. > So I still expect it to work. You've been lucky, that's all. Our documentation has been clear, back to version 7.0, that turning fsync=off carries the risk that you will have to recreate your entire database in the event of unexpected shutdown. That's not new. So, the operative question is: was 8.4 shut down with -immediate or otherwise unexpectedly? If so, then we don't have a bug. If 8.4 was never shut down, then we have some strange behavior which bears looking into. And you've found a wierd corner case, which is what we count on our users for. Thanks for testing. -- Josh Berkus PostgreSQL Experts Inc. www.pgexperts.com
* Josh Berkus: > Our documentation has been clear, back to version 7.0, that turning > fsync=off carries the risk that you will have to recreate your entire > database in the event of unexpected shutdown. That's not new. The documentation does not say this. Instead, there's the following rather explicit explanation that only OS crashes matter: | (Crashes of the database software itself are not a risk factor | here. Only an operating-system-level crash creates a risk of | corruption.) If it really matters how PostgreSQL is shut down in "fsync = off" mode (while the operating system keeps running), the documentation is seriously wrong here. -- Florian Weimer <fweimer@bfk.de> BFK edv-consulting GmbH http://www.bfk.de/ Kriegsstraße 100 tel: +49-721-96201-1 D-76133 Karlsruhe fax: +49-721-96201-99
Florian Weimer wrote: > * Josh Berkus: > > > Our documentation has been clear, back to version 7.0, that turning > > fsync=off carries the risk that you will have to recreate your entire > > database in the event of unexpected shutdown. That's not new. > > The documentation does not say this. Instead, there's the following > rather explicit explanation that only OS crashes matter: > > | (Crashes of the database software itself are not a risk factor > | here. Only an operating-system-level crash creates a risk of > | corruption.) > > If it really matters how PostgreSQL is shut down in "fsync = off" mode > (while the operating system keeps running), the documentation is > seriously wrong here. Yeah, AFAICT the writes are handed off to the operating system (just not synced), so if it flushes its caches sanely at all there shouldn't be a problem. -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Alvaro Herrera <alvherre@commandprompt.com> wrote: > Yeah, AFAICT the writes are handed off to the operating system (just > not synced), so if it flushes its caches sanely at all there > shouldn't be a problem. I would certainly *hope* that's the case. We sometimes use fsync=off for conversions, where we plan to just start over if the conversion crashes, and set it to on when the conversion is done. It would be disturbing to discover that fsync=off also means "don't bother to write dirty buffers to the OS before shutdown." -Kevin
Alvaro, Kevin, >> Yeah, AFAICT the writes are handed off to the operating system (just >> not synced), so if it flushes its caches sanely at all there >> shouldn't be a problem. > > I would certainly *hope* that's the case. We sometimes use fsync=off > for conversions, where we plan to just start over if the conversion > crashes, and set it to on when the conversion is done. It would be > disturbing to discover that fsync=off also means "don't bother to > write dirty buffers to the OS before shutdown." It doesn't. But what I don't trust, and the *first* place I'd look for problems, is whether the OS flushes *all* dirty buffers to disk in the event the application gets killed. That's why I want more information on Floris' case. Was 8.4 killed or shut down with -m immediate? Or the os rebooted with 8.4 running? -- Josh Berkus PostgreSQL Experts Inc. www.pgexperts.com
Josh Berkus <josh@agliodbs.com> writes: > It doesn't. But what I don't trust, and the *first* place I'd look for > problems, is whether the OS flushes *all* dirty buffers to disk in the > event the application gets killed. Why wouldn't you trust it? The sort of thing you seem to be thinking about would require tracking which process(es) wrote each dirty buffer and then going back and dropping selected dirty buffers when a process exits abnormally. I can hardly imagine any OS wishing to do that. regards, tom lane
Hi, Josh Berkus wrote: > It doesn't. But what I don't trust, and the *first* place I'd look for > problems, is whether the OS flushes *all* dirty buffers to disk in the > event the application gets killed. > > That's why I want more information on Floris' case. Was 8.4 killed or > shut down with -m immediate? Or the os rebooted with 8.4 running? The only reboots I have done on that server were with the "reboot" system command, which should send a SIGTERM to all processes first including PostgreSQL, before pulling the plug. I do recall that during the execution of "vacuum full" the psql client program once did report that it lost connection with the server, but was able to reconnect. Maybe the server processes handling the connection died then, but I am not sure of that, and it only happened once. Anyway, the problem also occurs when there is no reboot or unexpected event between the reindex and the query. After a REINDEX it is able to find the row it was missing first, but then other rows become missing. All in the same psql session: === usenet=> SELECT count(*) FROM posts_index WHERE poster='Yenc@power-post.org (Yenc-PP-A&A)' AND groupid=300 AND basefile='NIB8124849'; count ------- 0 (1 row) usenet=> reindex index pgb_idx; REINDEX usenet=> SELECT count(*) FROM posts_index WHERE poster='Yenc@power-post.org (Yenc-PP-A&A)' AND groupid=300 AND basefile='NIB8124849'; count ------- 1 (1 row) usenet=> SELECT count(*) FROM posts_index WHERE poster='Yenc@power-post.org (Yenc-PP-A&A)' AND basefile='frx-fffe' AND groupid=757; count ------- 0 (1 row) usenet=> set enable_indexscan=false; SET usenet=> set enable_bitmapscan=false; SET usenet=> SELECT count(*) FROM posts_index WHERE poster='Yenc@power-post.org (Yenc-PP-A&A)' AND basefile='frx-fffe' AND groupid=757; count ------- 1 (1 row) === Yours sincerely, Floris Bos
Floris Bos / Maxnet wrote: > The following settings differ from the defaults: > > -- > shared_buffers=3500MB > maintenance_work_mem = 128MB > fsync = off > synchronous_commit = off > checkpoint_segments = 25 > -- > > > == > Table layout > == > > -- > Table "public.posts_index" > Column | Type | Modifiers > ------------+------------------------+----------------------------------------------------------- > > cid | integer | not null default > nextval('posts_index_cid > _seq'::regclass) > groupid | integer | not null > startdate | integer | not null > poster | character varying(64) | not null > basefile | character varying(64) | not null > subject | character varying(255) | not null > size | real | > nfo | boolean | > c | boolean | > parts | integer | > totalparts | integer | > imdb | integer | > ng1 | boolean | default false > g2 | integer | default 0 > g3 | integer | default 0 > data | bytea | > Indexes: > "posts_index5_pkey" PRIMARY KEY, btree (cid) CLUSTER > "gr_idx" btree (groupid, (- cid)) > "pgb_idx" btree (poster, groupid, basefile) > -- > > Only noticed problems with the pgb_idx index so far. I have been trying to reproduce the problem but no success so far. I made myself a table that matches yours, then I wrote a little perl script to fill it with random data. (The script also writes out a text file I can use to re-query things). I fill the db, then add the indexes. Then I test lookup every record I added, and find them all. So, a few questions: 1) did you dump/restore into 8.4beta1 first and then upgrade the program? Or did you dump/restore into 8.4beta2? 2) did you use any of the concurrent restore options? 3) do you do any updates or deletes to the table after you restore it? 4) do you do any other operations on the table (vacuum, cluster, etc..)? 5) got any triggers or stored procs? 6) To the -hackers: I write the records and then refind them in the exact same order, would it be a better test to search for records in a more random order? would it make a difference? Would searching for some but not all make a difference? -Andy
Andy, > 6) To the -hackers: I write the records and then refind them in the > exact same order, would it be a better test to search for records in a > more random order? would it make a difference? Would searching for some > but not all make a difference? Are you on OpenSolaris? Can you give your script to Zdenek & Jignesh to test in their enviroments? -- Josh Berkus PostgreSQL Experts Inc. www.pgexperts.com
Josh Berkus wrote: > Andy, > >> 6) To the -hackers: I write the records and then refind them in the >> exact same order, would it be a better test to search for records in a >> more random order? would it make a difference? Would searching for some >> but not all make a difference? > > Are you on OpenSolaris? Can you give your script to Zdenek & Jignesh to > test in their enviroments? > I am not, and yes I can. Hopefully I can attach a .tar.bz2 The bigtest.sh is the one to run (it runs all the parts). You'll need to edit fill.pl and test.pl and set the dbname and maybe give a username/password. In the fill.pl there is a $max variable that's used to set the number of records to insert. (its set to 10 million right now) Oh, this .tar wont create a subdirectory -Andy
Attachment
Floris, One more question set: what version of OpenSolaris, and what filesystem are you using? Does the OS have any non-default tuning settings? How did you install or compile PostgreSQL? -- Josh Berkus PostgreSQL Experts Inc. www.pgexperts.com
Floris Bos / Maxnet <bos@je-eigen-domein.nl> writes: > Hi, > Tom Lane wrote: >> Floris Bos / Maxnet <bos@je-eigen-domein.nl> writes: >>> postgres@db:/data$ /opt/postgres/8.4-beta/bin/64/initdb -E SQL_ASCII -X >>> /data/pg_xlog /data/db >>> The database cluster will be initialized with locale en_US.UTF-8. >> >> Oooh, that doesn't look real good. You're going to be using strcoll() >> comparisons that assume the data is in UTF8, but the database is not >> enforcing valid UTF8 encoding. I have not checked the dump to see if >> it's all valid data, but this could be the root of the issue. >> >> If you want to use SQL_ASCII because the data isn't uniformly encoded, >> it'd be better to use C locale. > Darn. > Looks like you are right! > Works a lot better with "--locale=C" > My 8.3 PostgreSQL installation ran under FreeBSD, and there the locale > is C by default: > So I was not used to have to add a "--locale=C" option. > Under Opensolaris it's indeed UTF-8 by default. Yeah, this is kind of unfortunate. I'm not sure there is much we could do about it, unless we want to insist that C locale be used if the database encoding is SQL_ASCII. That cure seems worse than the disease though. We have locked down encoding/locale combinations pretty strictly for 8.4, but SQL_ASCII is generally thought to be a "let the user beware" setting. regards, tom lane