Thread: Re: [HACKERS] strange behavior of UPDATE
Bruce Momjian wrote: > > > > Not really. It required changing function calls all over, and probably > > > relies on 6.5 changes too. It is in the 6.5 beta right now. Are you > > > using that? > > > > > > I did all the tests with the current snapshot of May 18th. > > > > Fix was installed yesterday. > sorry, but the problem still persists. I used -rw-r--r-- 1 1005 root 5954900 May 24 03:03 postgresql.snapshot.tar.gz I started the query update bench set k500k = k500k + 1 where k100 = 30; and I killed it after half an hour of havy disk activity. The same query on the same machine with the same setup, but using sybase-ase-11.0.3.3-1 takes less than 1 minute. The table bench looks like the following: Table = bench +----------------------------------+----------------------------------+-------+ | Field | Type | Length| +----------------------------------+----------------------------------+-------+ | kseq | int4 not null | 4 | | k500k | int4 not null | 4 | | k250k | int4 not null | 4 | | k100k | int4 not null | 4 | | k40k | int4 not null | 4 | | k10k | int4 not null | 4 | | k1k | int4 not null | 4 | | k100 | int4 not null | 4 | | k25 | int4 not null | 4 | | k10 | int4 not null | 4 | | k5 | int4 not null | 4 | | k4 | int4 not null | 4 | | k2 | int4 not null | 4 | | s1 | char() not null | 8 | | s2 | char() not null | 20 | | s3 | char() not null | 20 | | s4 | char() not null | 20 | | s5 | char() not null | 20 | | s6 | char() not null | 20 | | s7 | char() not null | 20 | | s8 | char() not null | 20 | +----------------------------------+----------------------------------+-------+ The table is filled with 1.000.000 rows of random data and on every field an index is created. Edmund -- Edmund Mergl mailto:E.Mergl@bawue.de Im Haldenhau 9 http://www.bawue.de/~mergl 70565 Stuttgart fon: +49 711 747503 Germany
Edmund Mergl <E.Mergl@bawue.de> writes: > The table is filled with 1.000.000 rows of random data > and on every field an index is created. BTW, do you happen to know just how random the data actually is? I noticed that the update queryupdate bench set k500k = k500k + 1 where k100 = 30; updates 10,000 rows. If this "random" data actually consists of 10,000 repetitions of only 100 distinct values in every column, then a possible explanation for the problem would be that our btree index code isn't very fast when there are large numbers of identical keys. (Mind you, I have no idea if that's true or not, I'm just trying to think of likely trouble spots. Anyone know btree well enough to say whether that is likely to be a problem?) regards, tom lane
I wrote: > then a possible explanation for the problem would be that our > btree index code isn't very fast when there are large numbers of > identical keys. Ah-hah, a lucky guess! I went back and looked at the profile stats I'd extracted from Edmund's "update" example. This Linux box has the same semi-functional gprof as someone else was using a while ago --- the timings are bogus, but the call counts seem right. And what I find are entries like this: 0.00 0.00 284977/284977 _bt_binsrch [3174] [3177] 0.0 0.00 0.00 284977 _bt_firsteq [3177] 0.00 0.00 21784948/24713758 _bt_compare[3169] 0.00 0.00 426/35632 _bt_split [53] 0.00 0.00 35206/35632 _bt_insertonpg[45] [3185] 0.0 0.00 0.00 35632 _bt_findsplitloc [3185] 0.00 0.00 5093972/8907411 _bt_itemcmp[3171] In other words, _bt_firsteq is averaging almost 100 comparisons per call, _bt_findsplitloc well over that. Both of these routines are evidently designed on the assumption that there will be relatively few duplicate keys --- they reduce to linear scans when there are many duplicates. _bt_firsteq shouldn't exist at all; the only routine that calls it is _bt_binsrch, which does a fast binary search of the index page. _bt_binsrch should be fixed so that the binary search logic does the right thing for equal-key cases, rather than degrading to a linear search. I am less confident that I understand _bt_findsplitloc's place in the great scheme of things, but it could certainly be revised to use a binary search rather than linear scan. This benchmark is clearly overemphasizing the equal-key case, but I think it ought to be fixed independently of whether we want to look good on a dubious benchmark ... equal keys are not uncommon in real-world scenarios after all. Next question is do we want to risk twiddling this code so soon before 6.5, or wait till after? regards, tom lane
Tom Lane wrote: > > Edmund Mergl <E.Mergl@bawue.de> writes: > > The table is filled with 1.000.000 rows of random data > > and on every field an index is created. > > BTW, do you happen to know just how random the data actually is? > I noticed that the update query > update bench set k500k = k500k + 1 where k100 = 30; > updates 10,000 rows. If this "random" data actually consists of > 10,000 repetitions of only 100 distinct values in every column, > then a possible explanation for the problem would be that our > btree index code isn't very fast when there are large numbers of > identical keys. (Mind you, I have no idea if that's true or not, > I'm just trying to think of likely trouble spots. Anyone know > btree well enough to say whether that is likely to be a problem?) > > regards, tom lane the query: update bench set k500k = k500k + 1 where k100 = 30; affects about 10.000 rows. This can be determined by running the query: select k500k from bench where k100 = 30; which takes about half a minute. That's the reason I was talking about the strange UPDATE behavior of PostgreSQL. If it can determine a specific number of rows in a reasonable time, it should be able to update these rows in the same time frame. Edmund -- Edmund Mergl mailto:E.Mergl@bawue.de Im Haldenhau 9 http://www.bawue.de/~mergl 70565 Stuttgart fon: +49 711 747503 Germany
Edmund Mergl <E.Mergl@bawue.de> writes: > ... That's the reason I > was talking about the strange UPDATE behavior of > PostgreSQL. If it can determine a specific number > of rows in a reasonable time, it should be able to > update these rows in the same time frame. Not necessarily --- this table has a remarkably large number of indexes, and all of them have to be updated when a tuple is replaced. So the amount of work is significantly greater than simply finding the tuples will require. As I posted later, I think that much of the problem comes from poor handling of equal-key cases in our btree index routines... regards, tom lane
Tom Lane wrote: > > Edmund Mergl <E.Mergl@bawue.de> writes: > > ... That's the reason I > > was talking about the strange UPDATE behavior of > > PostgreSQL. If it can determine a specific number > > of rows in a reasonable time, it should be able to > > update these rows in the same time frame. > > Not necessarily --- this table has a remarkably large number of indexes, > and all of them have to be updated when a tuple is replaced. So the > amount of work is significantly greater than simply finding the tuples > will require. > > As I posted later, I think that much of the problem comes from poor > handling of equal-key cases in our btree index routines... > > regards, tom lane if this is the case, these routines must be very poor. Again some numbers: 1.000.000 rows: - select * from bench where k100 = 30 with indeces 10 seconds without indeces 28 seconds - update bench set k500k = k500k + 1 where k100 = 30 with indeces unknown without indeces 36 seconds Still the poor update routines do not explain the strange behavior, that the postmaster runs for hours using at most 10% CPU, and all the time heavy disk activity is observed. According to top, there are over 80MB free Mem and the postmaster has been started with -o -F. Hence this disk activity can not be simple swapping. Some more numbers: database #rows inserts create make_sqs make_nqs index selects updates ---------------------------------------------------------------------------- pgsql 10.000 00:24 00:09 00:16 00:25 pgsql 100.000 04:01 01:29 01:06 49:45 pgsql 1.000.000 39:24 20:49 23:42 ??? whereas the increase of elapsed time is somewhat proportional to the number of rows, for the update-case it is rather exponential. Edmund -- Edmund Mergl mailto:E.Mergl@bawue.de Im Haldenhau 9 http://www.bawue.de/~mergl 70565 Stuttgart fon: +49 711 747503 Germany
Edmund Mergl <E.Mergl@bawue.de> writes: > Some more numbers: > database #rows inserts create make_sqs make_nqs > index selects updates > ---------------------------------------------------------------------------- > pgsql 10.000 00:24 00:09 00:16 00:25 > pgsql 100.000 04:01 01:29 01:06 49:45 > pgsql 1.000.000 39:24 20:49 23:42 ??? > whereas the increase of elapsed time is somewhat proportional > to the number of rows, for the update-case it is rather > exponential. Those are attention-getting numbers, all right. I think that the two equal-key problems I found last night might partially explain them; I suspect there are more that I have not found, too. I will look into it some more. Could you try the same queries with no indexes in place, and see what the time scaling is like then? That would confirm or deny the theory that it's an index-update problem. Question for the hackers list: are we prepared to install purely performance-related bug fixes at this late stage of the 6.5 beta cycle? Bad as the above numbers are, I hesitate to twiddle the btree code and risk breaking things with only a week of testing time to go... regards, tom lane
Tom Lane wrote: > > Edmund Mergl <E.Mergl@bawue.de> writes: > > Some more numbers: > > > database #rows inserts create make_sqs make_nqs > > index selects updates > > ---------------------------------------------------------------------------- > > pgsql 10.000 00:24 00:09 00:16 00:25 > > pgsql 100.000 04:01 01:29 01:06 49:45 > > pgsql 1.000.000 39:24 20:49 23:42 ??? > > > whereas the increase of elapsed time is somewhat proportional > > to the number of rows, for the update-case it is rather > > exponential. > > Those are attention-getting numbers, all right. I think that the two > equal-key problems I found last night might partially explain them; > I suspect there are more that I have not found, too. I will look into > it some more. > > Could you try the same queries with no indexes in place, and see what > the time scaling is like then? That would confirm or deny the theory > that it's an index-update problem. here they are, and yes, I double-checked them twice ! database #rows inserts create make_sqs make_nqs index selects updates ---------------------------------------------------------------------------- pgsql 10.000 00:24 - 00:13 00:05 pgsql 100.000 04:01 - 00:83 00:32 pgsql 1.000.000 39:24 - 26:36 26:52 > > Question for the hackers list: are we prepared to install purely > performance-related bug fixes at this late stage of the 6.5 beta cycle? > Bad as the above numbers are, I hesitate to twiddle the btree code and > risk breaking things with only a week of testing time to go... > > regards, tom lane if there is anything else I can do, just let me know. Edmund -- Edmund Mergl mailto:E.Mergl@bawue.de Im Haldenhau 9 http://www.bawue.de/~mergl 70565 Stuttgart fon: +49 711 747503 Germany
Tom Lane wrote: > > Those are attention-getting numbers, all right. I think that the two > equal-key problems I found last night might partially explain them; > I suspect there are more that I have not found, too. I will look into > it some more. Am I correct that update takes ~ 10% CPU with high disk activity? (Unfortunately, no list archive after May 13, so I'm not able to re-read thread). Remember that update inserts new index tuples and most likely index scan will see these tuples and fetch just inserted heap tuples. > Could you try the same queries with no indexes in place, and see what > the time scaling is like then? That would confirm or deny the theory > that it's an index-update problem. > > Question for the hackers list: are we prepared to install purely > performance-related bug fixes at this late stage of the 6.5 beta cycle? > Bad as the above numbers are, I hesitate to twiddle the btree code and > risk breaking things with only a week of testing time to go... Try to fix problems and run Edmund scripts to see are things better than now. We can apply fixes after 6.5. Vadim
Edmund Mergl <E.Mergl@bawue.de> writes: >> Could you try the same queries with no indexes in place, and see what >> the time scaling is like then? That would confirm or deny the theory >> that it's an index-update problem. > here they are, and yes, I double-checked them twice ! > database #rows inserts create make_sqs make_nqs > index selects updates > ---------------------------------------------------------------------------- > pgsql 10.000 00:24 - 00:13 00:05 > pgsql 100.000 04:01 - 00:83 00:32 > pgsql 1.000.000 39:24 - 26:36 26:52 Oh dear ... so much for my theory that index updates are to blame for the scaling problem. Back to the drawing board ... regards, tom lane
Added to TODO: * Improve _bt_binsrch() to handle equal keys better, remove _bt_firsteq()(Tom) > I wrote: > > then a possible explanation for the problem would be that our > > btree index code isn't very fast when there are large numbers of > > identical keys. > > Ah-hah, a lucky guess! I went back and looked at the profile stats > I'd extracted from Edmund's "update" example. This Linux box has > the same semi-functional gprof as someone else was using a while > ago --- the timings are bogus, but the call counts seem right. > And what I find are entries like this: > > 0.00 0.00 284977/284977 _bt_binsrch [3174] > [3177] 0.0 0.00 0.00 284977 _bt_firsteq [3177] > 0.00 0.00 21784948/24713758 _bt_compare [3169] > > 0.00 0.00 426/35632 _bt_split [53] > 0.00 0.00 35206/35632 _bt_insertonpg [45] > [3185] 0.0 0.00 0.00 35632 _bt_findsplitloc [3185] > 0.00 0.00 5093972/8907411 _bt_itemcmp [3171] > > In other words, _bt_firsteq is averaging almost 100 comparisons per > call, _bt_findsplitloc well over that. Both of these routines are > evidently designed on the assumption that there will be relatively > few duplicate keys --- they reduce to linear scans when there are > many duplicates. > > _bt_firsteq shouldn't exist at all; the only routine that calls it > is _bt_binsrch, which does a fast binary search of the index page. > _bt_binsrch should be fixed so that the binary search logic does the > right thing for equal-key cases, rather than degrading to a linear > search. I am less confident that I understand _bt_findsplitloc's place > in the great scheme of things, but it could certainly be revised to use > a binary search rather than linear scan. > > This benchmark is clearly overemphasizing the equal-key case, but > I think it ought to be fixed independently of whether we want to > look good on a dubious benchmark ... equal keys are not uncommon in > real-world scenarios after all. > > Next question is do we want to risk twiddling this code so soon before > 6.5, or wait till after? > > regards, tom lane > > -- Bruce Momjian | http://www.op.net/~candle maillist@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Tom, I assume this is done, right? > I wrote: > > then a possible explanation for the problem would be that our > > btree index code isn't very fast when there are large numbers of > > identical keys. > > Ah-hah, a lucky guess! I went back and looked at the profile stats > I'd extracted from Edmund's "update" example. This Linux box has > the same semi-functional gprof as someone else was using a while > ago --- the timings are bogus, but the call counts seem right. > And what I find are entries like this: > > 0.00 0.00 284977/284977 _bt_binsrch [3174] > [3177] 0.0 0.00 0.00 284977 _bt_firsteq [3177] > 0.00 0.00 21784948/24713758 _bt_compare [3169] > > 0.00 0.00 426/35632 _bt_split [53] > 0.00 0.00 35206/35632 _bt_insertonpg [45] > [3185] 0.0 0.00 0.00 35632 _bt_findsplitloc [3185] > 0.00 0.00 5093972/8907411 _bt_itemcmp [3171] > > In other words, _bt_firsteq is averaging almost 100 comparisons per > call, _bt_findsplitloc well over that. Both of these routines are > evidently designed on the assumption that there will be relatively > few duplicate keys --- they reduce to linear scans when there are > many duplicates. > > _bt_firsteq shouldn't exist at all; the only routine that calls it > is _bt_binsrch, which does a fast binary search of the index page. > _bt_binsrch should be fixed so that the binary search logic does the > right thing for equal-key cases, rather than degrading to a linear > search. I am less confident that I understand _bt_findsplitloc's place > in the great scheme of things, but it could certainly be revised to use > a binary search rather than linear scan. > > This benchmark is clearly overemphasizing the equal-key case, but > I think it ought to be fixed independently of whether we want to > look good on a dubious benchmark ... equal keys are not uncommon in > real-world scenarios after all. > > Next question is do we want to risk twiddling this code so soon before > 6.5, or wait till after? > > regards, tom lane > > -- Bruce Momjian | http://www.op.net/~candle maillist@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Bruce Momjian <maillist@candle.pha.pa.us> writes: > Tom, I assume this is done, right? Only partly. I changed _bt_binsrch to use a pure binary search, but didn't yet get around to fixing _bt_findsplitloc (it's more complicated :-(). Vadim seemed to think that a better solution would be to fix the comparison rules so that no two entries are ever considered to have equal keys anyway. So I put my change on the back burner ... but if his change doesn't happen, mine ought to get done. >> I wrote: >>>> then a possible explanation for the problem would be that our >>>> btree index code isn't very fast when there are large numbers of >>>> identical keys. >> >> Ah-hah, a lucky guess! I went back and looked at the profile stats >> I'd extracted from Edmund's "update" example. This Linux box has >> the same semi-functional gprof as someone else was using a while >> ago --- the timings are bogus, but the call counts seem right. >> And what I find are entries like this: >> >> 0.00 0.00 284977/284977 _bt_binsrch [3174] >> [3177] 0.0 0.00 0.00 284977 _bt_firsteq [3177] >> 0.00 0.00 21784948/24713758 _bt_compare [3169] >> >> 0.00 0.00 426/35632 _bt_split [53] >> 0.00 0.00 35206/35632 _bt_insertonpg [45] >> [3185] 0.0 0.00 0.00 35632 _bt_findsplitloc [3185] >> 0.00 0.00 5093972/8907411 _bt_itemcmp [3171] >> >> In other words, _bt_firsteq is averaging almost 100 comparisons per >> call, _bt_findsplitloc well over that. Both of these routines are >> evidently designed on the assumption that there will be relatively >> few duplicate keys --- they reduce to linear scans when there are >> many duplicates. >> >> _bt_firsteq shouldn't exist at all; the only routine that calls it >> is _bt_binsrch, which does a fast binary search of the index page. >> _bt_binsrch should be fixed so that the binary search logic does the >> right thing for equal-key cases, rather than degrading to a linear >> search. I am less confident that I understand _bt_findsplitloc's place >> in the great scheme of things, but it could certainly be revised to use >> a binary search rather than linear scan. >> >> This benchmark is clearly overemphasizing the equal-key case, but >> I think it ought to be fixed independently of whether we want to >> look good on a dubious benchmark ... equal keys are not uncommon in >> real-world scenarios after all. >> >> Next question is do we want to risk twiddling this code so soon before >> 6.5, or wait till after?