Thread: GiST concurrency commited
Have we list named something like 'test focusing for 8.1'? If it exists then GiST concurrency and recovery testing should be added to it. Especially, recovery after crash. Of course, now Oleg and me going to begin a large test program. While I'm running test with concurrent select/insert/update/delete/vacuum/vacuum full I found, that sometimes postgres crashes in index_beginscan_internal on FunctionCall3, because structure 'procedure' becomes zeroed. As I understand, LockRelation can invalidate part of Relation structure. So, I moved GET_REL_PROCEDURE after LockRelation. It seems to me, this patch should be backpatched or it's needed another fixing. This problem was 2-4 times per million statements executing by 4 flows. And there is one more problem: it caused approximatly one time per 2-4 million statements, I got traps: TRAP: FailedAssertion("!((*curpage)->offsets_used == num_tuples)", File: "vacuum.c", Line: 2766) LOG: server process (PID 15847) was terminated by signal 6 Sorry, but I couldn't debug this trap and my knowledge about this piece of code is very limited. Postgres didn't create a core file. I don't believe this problem is in touch with my GiST framework, becouse it is about heap pages. I suspect trap occurs while concurrent vacuum, but I am not sure. PS My concurrency testing scripts: http://www.sigaev.ru/gist/ concur.pl - generator of SQL statements concur.sh - simple wrapper about concur.pl which reinit db, makes db and table. -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
Teodor Sigaev <teodor@sigaev.ru> writes: > While I'm running test with concurrent > select/insert/update/delete/vacuum/vacuum full I found, that sometimes > postgres crashes in index_beginscan_internal on FunctionCall3, because > structure 'procedure' becomes zeroed. As I understand, LockRelation > can invalidate part of Relation structure. So, I moved > GET_REL_PROCEDURE after LockRelation. Oooh, good catch. > It seems to me, this patch > should be backpatched or it's needed another fixing. No, it's not an issue in the back branches, because until recently GET_REL_PROCEDURE only fetched the function OID. > And there is one more problem: it caused approximatly one time per 2-4 million > statements, I got traps: > TRAP: FailedAssertion("!((*curpage)->offsets_used == num_tuples)", File: > "vacuum.c", Line: 2766) > LOG: server process (PID 15847) was terminated by signal 6 Odd. Will look at it later (after feature freeze), if you don't find the cause beforehand. regards, tom lane
I think the whole GiST limitations page can be removed now... http://developer.postgresql.org/docs/postgres/limitations.html Chris
>>And there is one more problem: it caused approximatly one time per 2-4 million >>statements, I got traps: >>TRAP: FailedAssertion("!((*curpage)->offsets_used == num_tuples)", File: >>"vacuum.c", Line: 2766) >>LOG: server process (PID 15847) was terminated by signal 6 > > > Odd. Will look at it later (after feature freeze), if you don't find > the cause beforehand. It's definitly bug in a vaccum code, I got the same trap without any GiST indexes (to reproduce, just comment out 'create index' command in my script). -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
"Teodor Sigaev" <teodor@sigaev.ru> writes > > concur.pl - generator of SQL statements retrieving it is forbidden ... Regards, Qingqing
Sorry, fixed. Qingqing Zhou wrote: > "Teodor Sigaev" <teodor@sigaev.ru> writes > >>concur.pl - generator of SQL statements > > > retrieving it is forbidden ... > > Regards, > Qingqing > > > > ---------------------------(end of broadcast)--------------------------- > TIP 7: don't forget to increase your free space map settings -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
Christopher Kings-Lynne wrote: > I think the whole GiST limitations page can be removed now... > > http://developer.postgresql.org/docs/postgres/limitations.html Done. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Awhile back, Teodor Sigaev <teodor@sigaev.ru> wrote: > And there is one more problem: it caused approximatly one time per 2-4 million > statements, I got traps: > TRAP: FailedAssertion("!((*curpage)->offsets_used == num_tuples)", File: > "vacuum.c", Line: 2766) > LOG: server process (PID 15847) was terminated by signal 6 > Sorry, but I couldn't debug this trap and my knowledge about this piece of code > is very limited. Postgres didn't create a core file. I don't believe this > problem is in touch with my GiST framework, becouse it is about heap pages. I > suspect trap occurs while concurrent vacuum, but I am not sure. > PS > My concurrency testing scripts: > http://www.sigaev.ru/gist/ > concur.pl - generator of SQL statements > concur.sh - simple wrapper about concur.pl which reinit db, makes db and table. I have committed changes that I believe fix this problem: http://archives.postgresql.org/pgsql-committers/2005-08/msg00213.php But it needs more testing. Would you update to CVS tip and see if you still see the failure? Also, if anyone else has some vacuum + concurrent update test cases, any testing you can do in CVS tip would be useful. This patch is big and ugly enough that back-patching it into all the supported back branches is a pretty scary prospect. I don't think we have a lot of choice --- it is a data-loss risk --- but we need to beat the heck out of the CVS-tip version before we start pushing it into the release branches. My current intention is to leave it just in CVS tip for the next few days, and not to start developing back-branch versions until after we've made the first 8.1 beta release. The back-ports are going to be painful (the code involved has changed often enough that I fear each branch will need a custom tailored patch) ... so I really don't want to start without some confidence that the CVS-tip patch is right. In other words ... if you can test this ... HELP!!! regards, tom lane
On Sat, 20 Aug 2005, Tom Lane wrote: > I have committed changes that I believe fix this problem: > http://archives.postgresql.org/pgsql-committers/2005-08/msg00213.php > But it needs more testing. Would you update to CVS tip and see if you > still see the failure? I've written some quick scripts. One just vacuums constantly (999 vacuums to 1 vacuum full) while three other scripts three randomly insert into, update and delete from 3 tables. There's a mix of small and large transactions. The tables have a single int column. It is set up to run 3 million transactions across the 3 scripts. I will try and jump onto one of the larger OSDL machines to test as well. Gavin
Gavin Sherry <swm@linuxworld.com.au> writes: > I've written some quick scripts. One just vacuums constantly (999 vacuums > to 1 vacuum full) while three other scripts three randomly insert > into, update and delete from 3 tables. There's a mix of small and large > transactions. The tables have a single int column. It is set up to run 3 > million transactions across the 3 scripts. Note that since the issues have mainly to do with update chains, it'd be good to stress cases where a row is updated multiple times before being deleted. And use at least one long-running transaction, so that VACUUM can't just throw away the update chain. regards, tom lane
> In other words ... if you can test this ... HELP!!! I'll run tests. -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
On Sat, 20 Aug 2005, Tom Lane wrote: > Gavin Sherry <swm@linuxworld.com.au> writes: > > I've written some quick scripts. One just vacuums constantly (999 vacuums > > to 1 vacuum full) while three other scripts three randomly insert > > into, update and delete from 3 tables. There's a mix of small and large > > transactions. The tables have a single int column. It is set up to run 3 > > million transactions across the 3 scripts. > > Note that since the issues have mainly to do with update chains, it'd be > good to stress cases where a row is updated multiple times before being > deleted. And use at least one long-running transaction, so that VACUUM > can't just throw away the update chain. Right. I modified the test so have multiple updates of a given row mixed with concurrent long running read transactions. Vacuum was running repeatedly in a concurrent session. I did not encounter any problems. However, the results are inconclusive since I ran the same test against HEAD from 10 days ago and didn't manage to trigger the problem Teodor's script did. I'll take a better look tomorrow. Gavin
> I have committed changes that I believe fix this problem: > http://archives.postgresql.org/pgsql-committers/2005-08/msg00213.php > But it needs more testing. Would you update to CVS tip and see if you > still see the failure? It seems, patch works correctly. My tests is passed with approximatly 1e8 SQL statements without any failure. Tests runs on two boxes: PIII/1133MHz adn Quad Xeon/500MHz and works with four threads. Further I'm going to increase concurrency up to 12 parallel threads. PS GiST passed this tests too. -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
> Further I'm going to increase concurrency up to 12 parallel threads. All is ok, test is passed with approximatly 40 millions statements -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
On K, 2005-08-24 at 15:52 +0400, Teodor Sigaev wrote: > > Further I'm going to increase concurrency up to 12 parallel threads. > > All is ok, test is passed with approximatly 40 millions statements > I have sent a patch to patches list enabling concurrent vacuums to actually reclaim space while another long vacuum is running, i.e. vacuums won't no longer block each other from removing deleted tuples. see: http://archives.postgresql.org/pgsql-patches/2005-08/msg00304.php Could you perhaps test this patch as well, while you already have a setup for testing parallel vacuums under big loads ? Or perhaps you can share the setup/scripts/data so that I could run your test myself as well ? -- Hannu Krosing <hannu@skype.net>
> http://archives.postgresql.org/pgsql-patches/2005-08/msg00304.php > > Could you perhaps test this patch as well, while you already have a > setup for testing parallel vacuums under big loads ? Ok, I'll do it. > > Or perhaps you can share the setup/scripts/data so that I could run your > test myself as well ? > Statements generator: http://www.sigaev.ru/gist/concur.pl Usage: ./concur.pl -d DATABASE [-n N] [-c N] [-l] -d DATABASE - DATABASE name -n N - number of rows -c N - number of flow -l - linear mode Script guarantees the same result (ie the same data in table) with the same -n and -c options, result doesn't depend on -l option. Also script guarantees absence of dead locks. % perl concur.pl -d qq -n 100000 -c 4 Start: parallel mode with 4 flows 3 flow finish. Stats: ni:25000 nu:555 nd:81 nv:4(nf:1) nt:232 0 flow finish. Stats: ni:25000 nu:554 nd:77 nv:2(nf:0) nt:247 1 flow finish. Stats: ni:25000 nu:548 nd:65 nv:4(nf:1) nt:249 2 flow finish. Stats: ni:25000 nu:552 nd:79 nv:7(nf:2) nt:263 All flow finish; status: 0; elapsed time: 159.25 sec Script prints some statistics per flow: ni - number of insert's statements (sum of this should be equal to -n N) nu - -/- update's statements nd - -/- delete's statements nv - -/- vacuum (including full vacuum) nf - number of full vacuum nt - number of transactions Simple wrapper for manipulating test table and loop testing (you should edit it to right paths): http://www.sigaev.ru/gist/concur.sh Those scripts was wrote to test GiST concurrency. I suspect it's needed to make some changes in generator to increase number of updates and vacuums for your goal. -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
Teodor Sigaev <teodor@sigaev.ru> writes: > http://www.sigaev.ru/gist/concur.pl > http://www.sigaev.ru/gist/concur.sh BTW, these scripts seem to indicate that there's a GIST or contrib/intarray problem in the 8.0 branch. I was trying to use 'em to test REL8_0_STABLE branch tip to verify my t_ctid chain backpatch, and I pretty consistently see "Problem with update": Start: parallel mode with 4 flows Problem with update {77,77}:0 count:1 at concur.pl line 91. Issuing rollback() for database handle being DESTROY'd without explicit disconnect(). Problem with update {43,24}:3 count:1 at concur.pl line 91. Issuing rollback() for database handle being DESTROY'd without explicit disconnect(). Problem with update {43,43}:2 count:1 at concur.pl line 91. Issuing rollback() for database handle being DESTROY'd without explicit disconnect(). 1 flow finish. Stats: ni:75000 nu:1661 nd:216 nv:13(nf:3) nt:780 All flow finish; status: 255; elapsed time: 265.48 sec Is this something that can be fixed for 8.0.4? regards, tom lane
> http://archives.postgresql.org/pgsql-patches/2005-08/msg00304.php > > Could you perhaps test this patch as well, while you already have a > setup for testing parallel vacuums under big loads ? I didn't find any problem with your patch during testing with 1e8 statements... -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
On R, 2005-08-26 at 16:47 +0400, Teodor Sigaev wrote: > > http://archives.postgresql.org/pgsql-patches/2005-08/msg00304.php > > > > Could you perhaps test this patch as well, while you already have a > > setup for testing parallel vacuums under big loads ? > > I didn't find any problem with your patch during testing with 1e8 statements... Thank You Very Much ! -- Hannu Krosing <hannu@skype.net>
Finded problem in GiST isn't too simple to resolve. I'm working on it. The problem is about update query... Tom Lane wrote: > Teodor Sigaev <teodor@sigaev.ru> writes: > >>http://www.sigaev.ru/gist/concur.pl >>http://www.sigaev.ru/gist/concur.sh > > > BTW, these scripts seem to indicate that there's a GIST or > contrib/intarray problem in the 8.0 branch. I was trying to use 'em > to test REL8_0_STABLE branch tip to verify my t_ctid chain backpatch, > and I pretty consistently see "Problem with update": > > Start: parallel mode with 4 flows > Problem with update {77,77}:0 count:1 at concur.pl line 91. > Issuing rollback() for database handle being DESTROY'd without explicit disconnect(). > Problem with update {43,24}:3 count:1 at concur.pl line 91. > Issuing rollback() for database handle being DESTROY'd without explicit disconnect(). > Problem with update {43,43}:2 count:1 at concur.pl line 91. > Issuing rollback() for database handle being DESTROY'd without explicit disconnect(). > 1 flow finish. Stats: ni:75000 nu:1661 nd:216 nv:13(nf:3) nt:780 > All flow finish; status: 255; elapsed time: 265.48 sec > > Is this something that can be fixed for 8.0.4? > > regards, tom lane -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
Fixed in 8.0, 7.4 and 7.3 branches. Tom Lane wrote: > Teodor Sigaev <teodor@sigaev.ru> writes: > >>http://www.sigaev.ru/gist/concur.pl >>http://www.sigaev.ru/gist/concur.sh > > > BTW, these scripts seem to indicate that there's a GIST or > contrib/intarray problem in the 8.0 branch. I was trying to use 'em > to test REL8_0_STABLE branch tip to verify my t_ctid chain backpatch, > and I pretty consistently see "Problem with update": > > Start: parallel mode with 4 flows > Problem with update {77,77}:0 count:1 at concur.pl line 91. > Issuing rollback() for database handle being DESTROY'd without explicit disconnect(). > Problem with update {43,24}:3 count:1 at concur.pl line 91. > Issuing rollback() for database handle being DESTROY'd without explicit disconnect(). > Problem with update {43,43}:2 count:1 at concur.pl line 91. > Issuing rollback() for database handle being DESTROY'd without explicit disconnect(). > 1 flow finish. Stats: ni:75000 nu:1661 nd:216 nv:13(nf:3) nt:780 > All flow finish; status: 255; elapsed time: 265.48 sec > > Is this something that can be fixed for 8.0.4? > > regards, tom lane -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
Am Dienstag, 30. August 2005 11:25 schrieb Teodor Sigaev: > Fixed in 8.0, 7.4 and 7.3 branches. > > Tom Lane wrote: > > Teodor Sigaev <teodor@sigaev.ru> writes: > >>http://www.sigaev.ru/gist/concur.pl > >>http://www.sigaev.ru/gist/concur.sh > > > > BTW, these scripts seem to indicate that there's a GIST or > > contrib/intarray problem in the 8.0 branch. I was trying to use 'em > > to test REL8_0_STABLE branch tip to verify my t_ctid chain backpatch, > > and I pretty consistently see "Problem with update": > > > > Start: parallel mode with 4 flows > > Problem with update {77,77}:0 count:1 at concur.pl line 91. > > Issuing rollback() for database handle being DESTROY'd without explicit > > disconnect(). Problem with update {43,24}:3 count:1 at concur.pl line 91. > > Issuing rollback() for database handle being DESTROY'd without explicit > > disconnect(). Problem with update {43,43}:2 count:1 at concur.pl line 91. > > Issuing rollback() for database handle being DESTROY'd without explicit > > disconnect(). 1 flow finish. Stats: ni:75000 nu:1661 nd:216 nv:13(nf:3) > > nt:780 All flow finish; status: 255; elapsed time: 265.48 sec > > > > Is this something that can be fixed for 8.0.4? > > > > regards, tom lane Since 7.4 we have troubles with ltree (seldom corruption of buffer cache, not on-disk), might this bug be somehow related to the ltree problem? 7.2 was rock-stable with ltree. Best regards, Mario Weilguni
> Since 7.4 we have troubles with ltree (seldom corruption of buffer cache, not > on-disk), might this bug be somehow related to the ltree problem? > 7.2 was rock-stable with ltree. Not sure. Fixed bug was (@ - contains operation): update wow set a = a || '{101}'::int[] where a @ '{1,2,3}'; select a from wow where a @ '{1,2,3}' and not a @ '{101}'; After update query select must not find any rows, but it did. The problem was in GiST code and so any GiST idexes was affected. Can you say more about your trouble? -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
Am Dienstag, 30. August 2005 12:19 schrieb Teodor Sigaev: > > Since 7.4 we have troubles with ltree (seldom corruption of buffer cache, > > not on-disk), might this bug be somehow related to the ltree problem? 7.2 > > was rock-stable with ltree. > > Not sure. Fixed bug was (@ - contains operation): > > update wow set a = a || '{101}'::int[] where a @ '{1,2,3}'; > select a from wow where a @ '{1,2,3}' and not a @ '{101}'; > > After update query select must not find any rows, but it did. The problem > was in GiST code and so any GiST idexes was affected. > > Can you say more about your trouble? We have queries that use ltree for sorting too, the sort looks like this: order by subpath(ltreefield, 0, nlevel(ltreefield)- 1) But concurrency leads to a bug, that results in an sql-error: ERROR: invalid positions Now we use locking to prevent concurrenct access on the most-used concurrent part of the program, and the problem is extremly rare now, I had only 4 occurences in one year, but still happens (there are other access paths that do not use locking, but they are rareley accessed). It seems the ltree length parameter is set to 0 in the tuples, the content itself is still there: Example: Say the tuple was before treefield='1.2.3.4.5' After the occurence of the error, I get: treefield='' (empty, but not null) Using a tool Tom Lane told me to use, I checked it, and on-disk I had still "1.2.3.4.5", but the length parameter of the ltree column was 0 (sorry, I was wrong in my first mail, on-disk was broken too.) Might this be somehow related to the intarray bugs? Best regards, Mario Weilguni p.s.: I tried hard to create a self-contained test for tracking this down, but failed.
> We have queries that use ltree for sorting too, the sort looks like this: > order by subpath(ltreefield, 0, nlevel(ltreefield) - 1) > > But concurrency leads to a bug, that results in an sql-error: > ERROR: invalid positions contrib_regression=# select nlevel(t), subpath(t, 0, nlevel(t)-1) from text2ltree('1.2.3.4.5') as t; nlevel | subpath --------+--------- 5 | 1.2.3.4 (1 row) contrib_regression=# select nlevel(t), subpath(t, 0, nlevel(t)-1) from text2ltree('1') as t; nlevel | subpath --------+--------- 1 | (1 row) contrib_regression=# select nlevel(t), subpath(t, 0, nlevel(t)-1) from text2ltree('') as t; ERROR: invalid positions contrib_regression=# It's incorrect arguments for subpath(). > > It seems the ltree length parameter is set to 0 in the tuples, the content > itself is still there: Example: > Say the tuple was before treefield='1.2.3.4.5' > After the occurence of the error, I get: treefield='' (empty, but not null) > > Using a tool Tom Lane told me to use, I checked it, and on-disk I had still > "1.2.3.4.5", but the length parameter of the ltree column was 0 (sorry, I was > wrong in my first mail, on-disk was broken too.) Interesting... But with some test suite or more information I'm helpless. How often do updates/inserts of table and/or ltree column occurs? Vacuum? > > Might this be somehow related to the intarray bugs? No, except case when you update your table something like to: update tbl set ltreefield=... where ltreefield ...; -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
Teodor Sigaev <teodor@sigaev.ru> writes: >> Is this something that can be fixed for 8.0.4? > Fixed in 8.0, 7.4 and 7.3 branches. Excellent news. Thanks. regards, tom lane