Thread: Re: [HACKERS] strange behavior of UPDATE

Re: [HACKERS] strange behavior of UPDATE

From
Edmund Mergl
Date:
Bruce Momjian wrote:
> 
> > > Not really.  It required changing function calls all over, and probably
> > > relies on 6.5 changes too.  It is in the 6.5 beta right now.  Are you
> > > using that?
> >
> >
> > I did all the tests with the current snapshot of May 18th.
> >
> 
> Fix was installed yesterday.
> 

sorry, but the problem still persists. I used 
  -rw-r--r--   1 1005     root      5954900 May 24 03:03 postgresql.snapshot.tar.gz

I started the query 
  update bench set k500k = k500k + 1 where k100 = 30;

and I killed it after half an hour of havy disk activity.
The same query on the same machine with the same setup,
but using  sybase-ase-11.0.3.3-1 takes less than 1 minute.

The table bench looks like the following:

Table    = bench
+----------------------------------+----------------------------------+-------+
|              Field               |              Type                | Length|
+----------------------------------+----------------------------------+-------+
| kseq                             | int4 not null                    |     4 |
| k500k                            | int4 not null                    |     4 |
| k250k                            | int4 not null                    |     4 |
| k100k                            | int4 not null                    |     4 |
| k40k                             | int4 not null                    |     4 |
| k10k                             | int4 not null                    |     4 |
| k1k                              | int4 not null                    |     4 |
| k100                             | int4 not null                    |     4 |
| k25                              | int4 not null                    |     4 |
| k10                              | int4 not null                    |     4 |
| k5                               | int4 not null                    |     4 |
| k4                               | int4 not null                    |     4 |
| k2                               | int4 not null                    |     4 |
| s1                               | char() not null                  |     8 |
| s2                               | char() not null                  |    20 |
| s3                               | char() not null                  |    20 |
| s4                               | char() not null                  |    20 |
| s5                               | char() not null                  |    20 |
| s6                               | char() not null                  |    20 |
| s7                               | char() not null                  |    20 |
| s8                               | char() not null                  |    20 |
+----------------------------------+----------------------------------+-------+


The table is filled with 1.000.000 rows of random data
and on every field an index is created.


Edmund



-- 
Edmund Mergl          mailto:E.Mergl@bawue.de
Im Haldenhau 9        http://www.bawue.de/~mergl
70565 Stuttgart       fon: +49 711 747503
Germany


Re: [HACKERS] strange behavior of UPDATE

From
Tom Lane
Date:
Edmund Mergl <E.Mergl@bawue.de> writes:
> The table is filled with 1.000.000 rows of random data
> and on every field an index is created.

BTW, do you happen to know just how random the data actually is?
I noticed that the update queryupdate bench set k500k = k500k + 1 where k100 = 30;
updates 10,000 rows.  If this "random" data actually consists of
10,000 repetitions of only 100 distinct values in every column,
then a possible explanation for the problem would be that our
btree index code isn't very fast when there are large numbers of
identical keys.  (Mind you, I have no idea if that's true or not,
I'm just trying to think of likely trouble spots.  Anyone know
btree well enough to say whether that is likely to be a problem?)
        regards, tom lane


Re: [HACKERS] strange behavior of UPDATE

From
Tom Lane
Date:
I wrote:
> then a possible explanation for the problem would be that our
> btree index code isn't very fast when there are large numbers of
> identical keys.

Ah-hah, a lucky guess!  I went back and looked at the profile stats
I'd extracted from Edmund's "update" example.  This Linux box has
the same semi-functional gprof as someone else was using a while
ago --- the timings are bogus, but the call counts seem right.
And what I find are entries like this:
               0.00    0.00  284977/284977      _bt_binsrch [3174]
[3177]   0.0    0.00    0.00  284977         _bt_firsteq [3177]               0.00    0.00 21784948/24713758
_bt_compare[3169]
 
               0.00    0.00     426/35632       _bt_split [53]               0.00    0.00   35206/35632
_bt_insertonpg[45]
 
[3185]   0.0    0.00    0.00   35632         _bt_findsplitloc [3185]               0.00    0.00 5093972/8907411
_bt_itemcmp[3171]
 

In other words, _bt_firsteq is averaging almost 100 comparisons per
call, _bt_findsplitloc well over that.  Both of these routines are
evidently designed on the assumption that there will be relatively
few duplicate keys --- they reduce to linear scans when there are
many duplicates.

_bt_firsteq shouldn't exist at all; the only routine that calls it
is _bt_binsrch, which does a fast binary search of the index page.
_bt_binsrch should be fixed so that the binary search logic does the
right thing for equal-key cases, rather than degrading to a linear
search.  I am less confident that I understand _bt_findsplitloc's place
in the great scheme of things, but it could certainly be revised to use
a binary search rather than linear scan.

This benchmark is clearly overemphasizing the equal-key case, but
I think it ought to be fixed independently of whether we want to
look good on a dubious benchmark ... equal keys are not uncommon in
real-world scenarios after all.

Next question is do we want to risk twiddling this code so soon before
6.5, or wait till after?
        regards, tom lane


Re: [HACKERS] strange behavior of UPDATE

From
Edmund Mergl
Date:
Tom Lane wrote:
> 
> Edmund Mergl <E.Mergl@bawue.de> writes:
> > The table is filled with 1.000.000 rows of random data
> > and on every field an index is created.
> 
> BTW, do you happen to know just how random the data actually is?
> I noticed that the update query
>         update bench set k500k = k500k + 1 where k100 = 30;
> updates 10,000 rows.  If this "random" data actually consists of
> 10,000 repetitions of only 100 distinct values in every column,
> then a possible explanation for the problem would be that our
> btree index code isn't very fast when there are large numbers of
> identical keys.  (Mind you, I have no idea if that's true or not,
> I'm just trying to think of likely trouble spots.  Anyone know
> btree well enough to say whether that is likely to be a problem?)
> 
>                         regards, tom lane

the query:
 update bench set k500k = k500k + 1 where k100 = 30;

affects about 10.000 rows. This can be determined by running 
the query:
 select k500k from bench where k100 = 30;

which takes about half a minute. That's the reason I 
was talking about the strange UPDATE behavior of 
PostgreSQL. If it can determine a specific number
of rows in a reasonable time, it should be able to
update these rows in the same time frame.

Edmund

-- 
Edmund Mergl          mailto:E.Mergl@bawue.de
Im Haldenhau 9        http://www.bawue.de/~mergl
70565 Stuttgart       fon: +49 711 747503
Germany


Re: [HACKERS] strange behavior of UPDATE

From
Tom Lane
Date:
Edmund Mergl <E.Mergl@bawue.de> writes:
> ... That's the reason I 
> was talking about the strange UPDATE behavior of 
> PostgreSQL. If it can determine a specific number
> of rows in a reasonable time, it should be able to
> update these rows in the same time frame.

Not necessarily --- this table has a remarkably large number of indexes,
and all of them have to be updated when a tuple is replaced.  So the
amount of work is significantly greater than simply finding the tuples
will require.

As I posted later, I think that much of the problem comes from poor
handling of equal-key cases in our btree index routines...
        regards, tom lane


Re: [HACKERS] strange behavior of UPDATE

From
Edmund Mergl
Date:
Tom Lane wrote:
> 
> Edmund Mergl <E.Mergl@bawue.de> writes:
> > ... That's the reason I
> > was talking about the strange UPDATE behavior of
> > PostgreSQL. If it can determine a specific number
> > of rows in a reasonable time, it should be able to
> > update these rows in the same time frame.
> 
> Not necessarily --- this table has a remarkably large number of indexes,
> and all of them have to be updated when a tuple is replaced.  So the
> amount of work is significantly greater than simply finding the tuples
> will require.
> 
> As I posted later, I think that much of the problem comes from poor
> handling of equal-key cases in our btree index routines...
> 
>                         regards, tom lane


if this is the case, these routines must be very poor.
Again some numbers:

1.000.000 rows:

- select * from bench where k100 = 30   with    indeces  10 seconds   without indeces  28 seconds

- update bench set k500k = k500k + 1 where k100 = 30   with    indeces  unknown   without indeces 36 seconds


Still the poor update routines do not explain the
strange behavior, that the postmaster runs for
hours using at most 10% CPU, and all the time
heavy disk activity is observed. According to
top, there are over 80MB free Mem and the postmaster
has been started with -o -F. Hence this disk activity
can not be simple swapping.


Some more numbers:
 database         #rows      inserts    create      make_sqs    make_nqs                                         index
   selects     updates
 
----------------------------------------------------------------------------   pgsql         10.000       00:24
00:09      00:16       00:25   pgsql        100.000       04:01      01:29       01:06       49:45   pgsql
1.000.000      39:24      20:49       23:42       ???
 


whereas the increase of elapsed time is somewhat proportional
to the number of rows, for the update-case it is rather
exponential.


Edmund

-- 
Edmund Mergl          mailto:E.Mergl@bawue.de
Im Haldenhau 9        http://www.bawue.de/~mergl
70565 Stuttgart       fon: +49 711 747503
Germany


Re: [HACKERS] strange behavior of UPDATE

From
Tom Lane
Date:
Edmund Mergl <E.Mergl@bawue.de> writes:
> Some more numbers:

>   database         #rows      inserts    create      make_sqs    make_nqs
>                                           index      selects     updates
> ----------------------------------------------------------------------------
>     pgsql         10.000       00:24      00:09       00:16       00:25
>     pgsql        100.000       04:01      01:29       01:06       49:45
>     pgsql      1.000.000       39:24      20:49       23:42       ???

> whereas the increase of elapsed time is somewhat proportional
> to the number of rows, for the update-case it is rather
> exponential.

Those are attention-getting numbers, all right.  I think that the two
equal-key problems I found last night might partially explain them;
I suspect there are more that I have not found, too.  I will look into
it some more.

Could you try the same queries with no indexes in place, and see what
the time scaling is like then?  That would confirm or deny the theory
that it's an index-update problem.

Question for the hackers list: are we prepared to install purely
performance-related bug fixes at this late stage of the 6.5 beta cycle?
Bad as the above numbers are, I hesitate to twiddle the btree code and
risk breaking things with only a week of testing time to go...
        regards, tom lane


Re: [HACKERS] strange behavior of UPDATE

From
Edmund Mergl
Date:
Tom Lane wrote:
> 
> Edmund Mergl <E.Mergl@bawue.de> writes:
> > Some more numbers:
> 
> >   database         #rows      inserts    create      make_sqs    make_nqs
> >                                           index      selects     updates
> > ----------------------------------------------------------------------------
> >     pgsql         10.000       00:24      00:09       00:16       00:25
> >     pgsql        100.000       04:01      01:29       01:06       49:45
> >     pgsql      1.000.000       39:24      20:49       23:42       ???
> 
> > whereas the increase of elapsed time is somewhat proportional
> > to the number of rows, for the update-case it is rather
> > exponential.
> 
> Those are attention-getting numbers, all right.  I think that the two
> equal-key problems I found last night might partially explain them;
> I suspect there are more that I have not found, too.  I will look into
> it some more.
> 
> Could you try the same queries with no indexes in place, and see what
> the time scaling is like then?  That would confirm or deny the theory
> that it's an index-update problem.


here they are, and yes, I double-checked them twice !
 database         #rows      inserts    create      make_sqs    make_nqs                                         index
   selects     updates
 
----------------------------------------------------------------------------   pgsql         10.000       00:24
-        00:13       00:05   pgsql        100.000       04:01        -         00:83       00:32   pgsql      1.000.000
     39:24        -         26:36       26:52
 


> 
> Question for the hackers list: are we prepared to install purely
> performance-related bug fixes at this late stage of the 6.5 beta cycle?
> Bad as the above numbers are, I hesitate to twiddle the btree code and
> risk breaking things with only a week of testing time to go...
> 
>                         regards, tom lane


if there is anything else I can do, just let me know.

Edmund


-- 
Edmund Mergl          mailto:E.Mergl@bawue.de
Im Haldenhau 9        http://www.bawue.de/~mergl
70565 Stuttgart       fon: +49 711 747503
Germany


Re: [HACKERS] strange behavior of UPDATE

From
Vadim Mikheev
Date:
Tom Lane wrote:
> 
> Those are attention-getting numbers, all right.  I think that the two
> equal-key problems I found last night might partially explain them;
> I suspect there are more that I have not found, too.  I will look into
> it some more.

Am I correct that update takes ~ 10% CPU with high disk activity?
(Unfortunately, no list archive after May 13, so I'm not able
to re-read thread).
Remember that update inserts new index tuples and most likely
index scan will see these tuples and fetch just inserted
heap tuples.

> Could you try the same queries with no indexes in place, and see what
> the time scaling is like then?  That would confirm or deny the theory
> that it's an index-update problem.
> 
> Question for the hackers list: are we prepared to install purely
> performance-related bug fixes at this late stage of the 6.5 beta cycle?
> Bad as the above numbers are, I hesitate to twiddle the btree code and
> risk breaking things with only a week of testing time to go...

Try to fix problems and run Edmund scripts to see are things
better than now. We can apply fixes after 6.5.

Vadim


Re: [HACKERS] strange behavior of UPDATE

From
Tom Lane
Date:
Edmund Mergl <E.Mergl@bawue.de> writes:
>> Could you try the same queries with no indexes in place, and see what
>> the time scaling is like then?  That would confirm or deny the theory
>> that it's an index-update problem.

> here they are, and yes, I double-checked them twice !

>   database         #rows      inserts    create      make_sqs    make_nqs
>                                           index      selects     updates
> ----------------------------------------------------------------------------
>     pgsql         10.000       00:24        -         00:13       00:05
>     pgsql        100.000       04:01        -         00:83       00:32
>     pgsql      1.000.000       39:24        -         26:36       26:52

Oh dear ... so much for my theory that index updates are to blame for
the scaling problem.  Back to the drawing board ...
        regards, tom lane


Re: [HACKERS] strange behavior of UPDATE

From
Bruce Momjian
Date:
Added to TODO:

* Improve _bt_binsrch() to handle equal keys better, remove _bt_firsteq()(Tom)


> I wrote:
> > then a possible explanation for the problem would be that our
> > btree index code isn't very fast when there are large numbers of
> > identical keys.
> 
> Ah-hah, a lucky guess!  I went back and looked at the profile stats
> I'd extracted from Edmund's "update" example.  This Linux box has
> the same semi-functional gprof as someone else was using a while
> ago --- the timings are bogus, but the call counts seem right.
> And what I find are entries like this:
> 
>                 0.00    0.00  284977/284977      _bt_binsrch [3174]
> [3177]   0.0    0.00    0.00  284977         _bt_firsteq [3177]
>                 0.00    0.00 21784948/24713758     _bt_compare [3169]
> 
>                 0.00    0.00     426/35632       _bt_split [53]
>                 0.00    0.00   35206/35632       _bt_insertonpg [45]
> [3185]   0.0    0.00    0.00   35632         _bt_findsplitloc [3185]
>                 0.00    0.00 5093972/8907411     _bt_itemcmp [3171]
> 
> In other words, _bt_firsteq is averaging almost 100 comparisons per
> call, _bt_findsplitloc well over that.  Both of these routines are
> evidently designed on the assumption that there will be relatively
> few duplicate keys --- they reduce to linear scans when there are
> many duplicates.
> 
> _bt_firsteq shouldn't exist at all; the only routine that calls it
> is _bt_binsrch, which does a fast binary search of the index page.
> _bt_binsrch should be fixed so that the binary search logic does the
> right thing for equal-key cases, rather than degrading to a linear
> search.  I am less confident that I understand _bt_findsplitloc's place
> in the great scheme of things, but it could certainly be revised to use
> a binary search rather than linear scan.
> 
> This benchmark is clearly overemphasizing the equal-key case, but
> I think it ought to be fixed independently of whether we want to
> look good on a dubious benchmark ... equal keys are not uncommon in
> real-world scenarios after all.
> 
> Next question is do we want to risk twiddling this code so soon before
> 6.5, or wait till after?
> 
>             regards, tom lane
> 
> 


--  Bruce Momjian                        |  http://www.op.net/~candle maillist@candle.pha.pa.us            |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: [HACKERS] strange behavior of UPDATE

From
Bruce Momjian
Date:
Tom, I assume this is done, right?


> I wrote:
> > then a possible explanation for the problem would be that our
> > btree index code isn't very fast when there are large numbers of
> > identical keys.
> 
> Ah-hah, a lucky guess!  I went back and looked at the profile stats
> I'd extracted from Edmund's "update" example.  This Linux box has
> the same semi-functional gprof as someone else was using a while
> ago --- the timings are bogus, but the call counts seem right.
> And what I find are entries like this:
> 
>                 0.00    0.00  284977/284977      _bt_binsrch [3174]
> [3177]   0.0    0.00    0.00  284977         _bt_firsteq [3177]
>                 0.00    0.00 21784948/24713758     _bt_compare [3169]
> 
>                 0.00    0.00     426/35632       _bt_split [53]
>                 0.00    0.00   35206/35632       _bt_insertonpg [45]
> [3185]   0.0    0.00    0.00   35632         _bt_findsplitloc [3185]
>                 0.00    0.00 5093972/8907411     _bt_itemcmp [3171]
> 
> In other words, _bt_firsteq is averaging almost 100 comparisons per
> call, _bt_findsplitloc well over that.  Both of these routines are
> evidently designed on the assumption that there will be relatively
> few duplicate keys --- they reduce to linear scans when there are
> many duplicates.
> 
> _bt_firsteq shouldn't exist at all; the only routine that calls it
> is _bt_binsrch, which does a fast binary search of the index page.
> _bt_binsrch should be fixed so that the binary search logic does the
> right thing for equal-key cases, rather than degrading to a linear
> search.  I am less confident that I understand _bt_findsplitloc's place
> in the great scheme of things, but it could certainly be revised to use
> a binary search rather than linear scan.
> 
> This benchmark is clearly overemphasizing the equal-key case, but
> I think it ought to be fixed independently of whether we want to
> look good on a dubious benchmark ... equal keys are not uncommon in
> real-world scenarios after all.
> 
> Next question is do we want to risk twiddling this code so soon before
> 6.5, or wait till after?
> 
>             regards, tom lane
> 
> 


--  Bruce Momjian                        |  http://www.op.net/~candle maillist@candle.pha.pa.us            |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: [HACKERS] strange behavior of UPDATE

From
Tom Lane
Date:
Bruce Momjian <maillist@candle.pha.pa.us> writes:
> Tom, I assume this is done, right?

Only partly.  I changed _bt_binsrch to use a pure binary search,
but didn't yet get around to fixing _bt_findsplitloc (it's more
complicated :-().

Vadim seemed to think that a better solution would be to fix the
comparison rules so that no two entries are ever considered to have
equal keys anyway.  So I put my change on the back burner ... but if
his change doesn't happen, mine ought to get done.


>> I wrote:
>>>> then a possible explanation for the problem would be that our
>>>> btree index code isn't very fast when there are large numbers of
>>>> identical keys.
>> 
>> Ah-hah, a lucky guess!  I went back and looked at the profile stats
>> I'd extracted from Edmund's "update" example.  This Linux box has
>> the same semi-functional gprof as someone else was using a while
>> ago --- the timings are bogus, but the call counts seem right.
>> And what I find are entries like this:
>> 
>> 0.00    0.00  284977/284977      _bt_binsrch [3174]
>> [3177]   0.0    0.00    0.00  284977         _bt_firsteq [3177]
>> 0.00    0.00 21784948/24713758     _bt_compare [3169]
>> 
>> 0.00    0.00     426/35632       _bt_split [53]
>> 0.00    0.00   35206/35632       _bt_insertonpg [45]
>> [3185]   0.0    0.00    0.00   35632         _bt_findsplitloc [3185]
>> 0.00    0.00 5093972/8907411     _bt_itemcmp [3171]
>> 
>> In other words, _bt_firsteq is averaging almost 100 comparisons per
>> call, _bt_findsplitloc well over that.  Both of these routines are
>> evidently designed on the assumption that there will be relatively
>> few duplicate keys --- they reduce to linear scans when there are
>> many duplicates.
>> 
>> _bt_firsteq shouldn't exist at all; the only routine that calls it
>> is _bt_binsrch, which does a fast binary search of the index page.
>> _bt_binsrch should be fixed so that the binary search logic does the
>> right thing for equal-key cases, rather than degrading to a linear
>> search.  I am less confident that I understand _bt_findsplitloc's place
>> in the great scheme of things, but it could certainly be revised to use
>> a binary search rather than linear scan.
>> 
>> This benchmark is clearly overemphasizing the equal-key case, but
>> I think it ought to be fixed independently of whether we want to
>> look good on a dubious benchmark ... equal keys are not uncommon in
>> real-world scenarios after all.
>> 
>> Next question is do we want to risk twiddling this code so soon before
>> 6.5, or wait till after?