Thread: BUG #14020: row_number() over(partition by order by) - weird behavior

BUG #14020: row_number() over(partition by order by) - weird behavior

From
b.yordanov2@gmail.com
Date:
The following bug has been logged on the website:

Bug reference:      14020
Logged by:          Boyko
Email address:      b.yordanov2@gmail.com
PostgreSQL version: 9.5.0
Operating system:   CentOS 6.4
Description:

Hi,

db=# \d+ offers_past_data;
                    Table "public.offers_past_data"
  Column  |  Type   | Modifiers | Storage | Stats target | Description
----------+---------+-----------+---------+--------------+-------------
 id       | integer | not null  | plain   |              |
 feed     | integer | not null  | plain   |              |
 position | integer | not null  | plain   |              |

db=# \d+ offers_testing;
                                                         Table
"public.offers_testing"
       Column       |            Type             |
Modifiers                      | Storage  | Stats target | Description

--------------------+-----------------------------+-----------------------------------------------------+----------+--------------+-------------
 id                 | bigint                      | not null default
nextval('offers_id_seq'::regclass) | plain    |              |
 grossprice         | numeric(11,2)               |
                           | main     |              |
 feed               | integer                     | not null
                           | plain    |              |
 product            | integer                     |
                           | plain    |              |

db=# update offers_past_data a set position = b.position from (select id,
feed, row_number() over(partition by product order by grossprice asc) as
position from offers_testing) b where a.id = b.id and a.feed = b.feed and
a.position <> b.position;
UPDATE 0

This should update every row in offers_past_data when its “position”
changes. In the example above no changes were introduced since the last run
so nothing is updated (expected).

db=# select count(*) from offers_testing where product = 2;
 count
-------
    99
(1 row)

So there are 99 offers for product 2.

Getting a single offer:

db=# select id,grossprice from offers_testing where product = 2 limit 1;
   id    | grossprice
---------+------------
 4127918 |    5000.00
(1 row)

Updating its grossprice:

db=# update offers_testing set grossprice = 20 where id = 4127918;
UPDATE 1

Now when executing the first query again I expect that no more than 99 rows
get updated in offers_past_data since this is the maximum amount of
positions that would be affected by offer 4127918 grossprice change.

db=# update offers_past_data a set position = b.position from (select id,
feed, row_number() over(partition by product order by grossprice asc) as
position from offers_testing) b where a.id = b.id and a.feed = b.feed and
a.position <> b.position;
UPDATE 104

104 rows get updated.

Executing the same query again a few minutes later (no changes meanwhile in
either table):

db=# update offers_past_data a set position = b.position from (select id,
feed, row_number() over(partition by product order by grossprice asc) as
position from offers_testing) b where a.id = b.id and a.feed = b.feed and
a.position <> b.position;
UPDATE 28058

This time it updates 28058 rows.

This is a test environment and nothing reads or writes to these tables.

Is this a bug or am I missing something obvious?

Regards,
Boyko

Re: BUG #14020: row_number() over(partition by order by) - weird behavior

From
"David G. Johnston"
Date:
On Mon, Mar 14, 2016 at 1:43 PM, <b.yordanov2@gmail.com> wrote:

>
> db=3D# update offers_past_data a set position =3D b.position from (select=
 id,
> feed, row_number() over(partition by product order by grossprice asc) as
> position from offers_testing) b where a.id =3D b.id and a.feed =3D b.feed=
 and
> a.position <> b.position;
> UPDATE 0
>

=E2=80=8BUpdating offers_past_data
=E2=80=8B


>
> This should update every row in offers_past_data when its =E2=80=9Cpositi=
on=E2=80=9D
> changes. In the example above no changes were introduced since the last r=
un
> so nothing is updated (expected).
>
> db=3D# select count(*) from offers_testing where product =3D 2;
>  count
> -------
>     99
> (1 row)
>
So there are 99 offers for product 2.
>

=E2=80=8BCounting offers_testing=E2=80=8B


> Getting a single offer:
>
> db=3D# select id,grossprice from offers_testing where product =3D 2 limit=
 1;
>    id    | grossprice
> ---------+------------
>  4127918 |    5000.00
> (1 row)
>
>
=E2=80=8BCounting offers_testing=E2=80=8B

Updating its grossprice:
>
> db=3D# update offers_testing set grossprice =3D 20 where id =3D 4127918;
> UPDATE 1
>
>
=E2=80=8BUpdating offers_testing=E2=80=8B

Now when executing the first query again I expect that no more than 99 rows
> get updated in offers_past_data since this is the maximum amount of
> positions that would be affected by offer 4127918 grossprice change.
>

You haven't proven to us that a single row in offers_testing cannot match
more than one row in offers_past_data.  Assuming a 1-to-many situation the
update count for offers_past_data can definitely be more than the number of
rows returned by the sub-query.
=E2=80=8B=E2=80=8B


> db=3D# update offers_past_data a set position =3D b.position from (select=
 id,
> feed, row_number() over(partition by product order by grossprice asc) as
> position from offers_testing) b where a.id =3D b.id and a.feed =3D b.feed=
 and
> a.position <> b.position;
> UPDATE 104
>
> 104 rows get updated.
>
> Executing the same query again a few minutes later (no changes meanwhile =
in
> either table):
>
> db=3D# update offers_past_data a set position =3D b.position from (select=
 id,
> feed, row_number() over(partition by product order by grossprice asc) as
> position from offers_testing) b where a.id =3D b.id and a.feed =3D b.feed=
 and
> a.position <> b.position;
> UPDATE 28058
>
> This time it updates 28058 rows.
>
> This is a test environment and nothing reads or writes to these tables.
>
> Is this a bug or am I missing something obvious?


=E2=80=8BIts likely data related, not a bug.

Using the "UPDATE ... RETURNING *" form should provide good insight.
Specifically, look for all rows having the same (id, feed) pair.

Also, "ORDER BY grossprice" seems inadequate.  The potential for duplicates
here - which would then make the assignment of row numbers within the
product partition random - is non-zero and is a quite likely source of your
problem - along with the probable one-to-many relationship between
offers_testing and offers_past_data.

David J.
=E2=80=8B

Re: BUG #14020: row_number() over(partition by order by) - weird behavior

From
Boyko Yordanov
Date:
Hi and thanks for your time on this.

You haven't proven to us that a single row in offers_testing cannot =
match more than one row in offers_past_data.  Assuming a 1-to-many =
situation the update count for offers_past_data can definitely be more =
than the number of rows returned by the sub-query.

It is a one-to-one relationship between the tables as there is a primary =
key on (id, feed) on both tables (which I missed to point out):

Indexes:
    "offers_past_data_id_feed" PRIMARY KEY, btree (id, feed)

Indexes:
    "offers_testing_id_feed" PRIMARY KEY, btree (id, feed)

I assume that this guarantees that a single grossprice change in =
offers_testing where product =3D 2 translates to up to (count(id,feed) =
where product =3D 2) position updates in both offers_testing and =
offers_past_data.

Adding "returning *" to the questionable query, it seems to update rows =
that are not related to product 2 (and on my opinion should not have =
changed positions).

Also, "ORDER BY grossprice" seems inadequate.  The potential for =
duplicates here - which would then make the assignment of row numbers =
within the product partition random - is non-zero and is a quite likely =
source of your problem - along with the probable one-to-many =
relationship between offers_testing and offers_past_data.

Dismissing the one-to-many relationship suggestion as it isn't the case.

Your point on duplicate grossprices is valid, but I believe that if I =
update a single grossprice, even in the case of duplicate grossprices, =
this should not translate in more position updates than the rows in the =
modified product partition. And in offers_testing there are no more than =
148 rows per product partition:

db=3D# select max(partition_count) from (select count(*) over (partition =
by product) as partition_count from offers_testing) sq;
 max
-----
 148
(1 row)

And yet the update query updates 28k records for some reason, most of =
which are outside the modified product partition.

Boyko

--

Boyko

2016-03-15 6:00 GMT+02:00 David G. Johnston <david.g.johnston@gmail.com =
<mailto:david.g.johnston@gmail.com>>:
On Mon, Mar 14, 2016 at 1:43 PM, <b.yordanov2@gmail.com =
<mailto:b.yordanov2@gmail.com>> wrote:

db=3D# update offers_past_data a set position =3D b.position from =
(select id,
feed, row_number() over(partition by product order by grossprice asc) as
position from offers_testing) b where a.id <http://a.id/> =3D b.id =
<http://b.id/> and a.feed =3D b.feed and
a.position <> b.position;
UPDATE 0

=E2=80=8BUpdating offers_past_data
=E2=80=8B=20

This should update every row in offers_past_data when its =E2=80=9Cpositio=
n=E2=80=9D
changes. In the example above no changes were introduced since the last =
run
so nothing is updated (expected).

db=3D# select count(*) from offers_testing where product =3D 2;
 count
-------
    99
(1 row)
So there are 99 offers for product 2.

=E2=80=8BCounting offers_testing=E2=80=8B


Getting a single offer:

db=3D# select id,grossprice from offers_testing where product =3D 2 =
limit 1;
   id    | grossprice
---------+------------
 4127918 |    5000.00
(1 row)


=E2=80=8BCounting offers_testing=E2=80=8B

Updating its grossprice:

db=3D# update offers_testing set grossprice =3D 20 where id =3D 4127918;
UPDATE 1


=E2=80=8BUpdating offers_testing=E2=80=8B

Now when executing the first query again I expect that no more than 99 =
rows
get updated in offers_past_data since this is the maximum amount of
positions that would be affected by offer 4127918 grossprice change.

You haven't proven to us that a single row in offers_testing cannot =
match more than one row in offers_past_data.  Assuming a 1-to-many =
situation the update count for offers_past_data can definitely be more =
than the number of rows returned by the sub-query.
=E2=80=8B=E2=80=8B


db=3D# update offers_past_data a set position =3D b.position from =
(select id,
feed, row_number() over(partition by product order by grossprice asc) as
position from offers_testing) b where a.id <http://a.id/> =3D b.id =
<http://b.id/> and a.feed =3D b.feed and
a.position <> b.position;
UPDATE 104

104 rows get updated.

Executing the same query again a few minutes later (no changes meanwhile =
in
either table):

db=3D# update offers_past_data a set position =3D b.position from =
(select id,
feed, row_number() over(partition by product order by grossprice asc) as
position from offers_testing) b where a.id <http://a.id/> =3D b.id =
<http://b.id/> and a.feed =3D b.feed and
a.position <> b.position;
UPDATE 28058

This time it updates 28058 rows.

This is a test environment and nothing reads or writes to these tables.

Is this a bug or am I missing something obvious?

=E2=80=8BIts likely data related, not a bug.

Using the "UPDATE ... RETURNING *" form should provide good insight.  =
Specifically, look for all rows having the same (id, feed) pair.

Also, "ORDER BY grossprice" seems inadequate.  The potential for =
duplicates here - which would then make the assignment of row numbers =
within the product partition random - is non-zero and is a quite likely =
source of your problem - along with the probable one-to-many =
relationship between offers_testing and offers_past_data.

David J.
=E2=80=8B=20

Re: BUG #14020: row_number() over(partition by order by) - weird behavior

From
Boyko Yordanov
Date:
Thinking further on this, I now got your point on the =E2=80=9Cduplicate =
grossprices is ordered randomly=E2=80=9D suggestion.

What I missed to realize is that the update query updates *every* =
product partition that has reordered due to duplicate grossprice being =
ordered randomly, resulting in thousands of updates instead of just < =
148 (or < 99 in the case of product =3D 2 partition).=20

Is there a way to ensure persistence of =E2=80=9Cover(order by =
duplicate_columns)=E2=80=9D ordering, except for ordering by a second =
(or even third) column?

Overall this now makes sense and is obviously not a bug. I apologize for =
bothering this list (instead of =E2=80=98general=E2=80=99) w/ this =
issue.

Thanks once again!

Regards,
Boyko

> On Mar 15, 2016, at 10:20 AM, Boyko Yordanov <b.yordanov2@gmail.com> =
wrote:
>=20
> Hi and thanks for your time on this.
>=20
> You haven't proven to us that a single row in offers_testing cannot =
match more than one row in offers_past_data.  Assuming a 1-to-many =
situation the update count for offers_past_data can definitely be more =
than the number of rows returned by the sub-query.
>=20
> It is a one-to-one relationship between the tables as there is a =
primary key on (id, feed) on both tables (which I missed to point out):
>=20
> Indexes:
>     "offers_past_data_id_feed" PRIMARY KEY, btree (id, feed)
>=20
> Indexes:
>     "offers_testing_id_feed" PRIMARY KEY, btree (id, feed)
>=20
> I assume that this guarantees that a single grossprice change in =
offers_testing where product =3D 2 translates to up to (count(id,feed) =
where product =3D 2) position updates in both offers_testing and =
offers_past_data.
>=20
> Adding "returning *" to the questionable query, it seems to update =
rows that are not related to product 2 (and on my opinion should not =
have changed positions).
>=20
> Also, "ORDER BY grossprice" seems inadequate.  The potential for =
duplicates here - which would then make the assignment of row numbers =
within the product partition random - is non-zero and is a quite likely =
source of your problem - along with the probable one-to-many =
relationship between offers_testing and offers_past_data.
>=20
> Dismissing the one-to-many relationship suggestion as it isn't the =
case.
>=20
> Your point on duplicate grossprices is valid, but I believe that if I =
update a single grossprice, even in the case of duplicate grossprices, =
this should not translate in more position updates than the rows in the =
modified product partition. And in offers_testing there are no more than =
148 rows per product partition:
>=20
> db=3D# select max(partition_count) from (select count(*) over =
(partition by product) as partition_count from offers_testing) sq;
>  max
> -----
>  148
> (1 row)
>=20
> And yet the update query updates 28k records for some reason, most of =
which are outside the modified product partition.
>=20
> Boyko
>=20
> --
>=20
> Boyko
>=20
> 2016-03-15 6:00 GMT+02:00 David G. Johnston =
<david.g.johnston@gmail.com <mailto:david.g.johnston@gmail.com>>:
> On Mon, Mar 14, 2016 at 1:43 PM, <b.yordanov2@gmail.com =
<mailto:b.yordanov2@gmail.com>> wrote:
>=20
> db=3D# update offers_past_data a set position =3D b.position from =
(select id,
> feed, row_number() over(partition by product order by grossprice asc) =
as
> position from offers_testing) b where a.id <http://a.id/> =3D b.id =
<http://b.id/> and a.feed =3D b.feed and
> a.position <> b.position;
> UPDATE 0
>=20
> =E2=80=8BUpdating offers_past_data
> =E2=80=8B=20
>=20
> This should update every row in offers_past_data when its =
=E2=80=9Cposition=E2=80=9D
> changes. In the example above no changes were introduced since the =
last run
> so nothing is updated (expected).
>=20
> db=3D# select count(*) from offers_testing where product =3D 2;
>  count
> -------
>     99
> (1 row)
> So there are 99 offers for product 2.
>=20
> =E2=80=8BCounting offers_testing=E2=80=8B
>=20
>=20
> Getting a single offer:
>=20
> db=3D# select id,grossprice from offers_testing where product =3D 2 =
limit 1;
>    id    | grossprice
> ---------+------------
>  4127918 |    5000.00
> (1 row)
>=20
>=20
> =E2=80=8BCounting offers_testing=E2=80=8B
>=20
> Updating its grossprice:
>=20
> db=3D# update offers_testing set grossprice =3D 20 where id =3D =
4127918;
> UPDATE 1
>=20
>=20
> =E2=80=8BUpdating offers_testing=E2=80=8B
>=20
> Now when executing the first query again I expect that no more than 99 =
rows
> get updated in offers_past_data since this is the maximum amount of
> positions that would be affected by offer 4127918 grossprice change.
>=20
> You haven't proven to us that a single row in offers_testing cannot =
match more than one row in offers_past_data.  Assuming a 1-to-many =
situation the update count for offers_past_data can definitely be more =
than the number of rows returned by the sub-query.
> =E2=80=8B=E2=80=8B
>=20
>=20
> db=3D# update offers_past_data a set position =3D b.position from =
(select id,
> feed, row_number() over(partition by product order by grossprice asc) =
as
> position from offers_testing) b where a.id <http://a.id/> =3D b.id =
<http://b.id/> and a.feed =3D b.feed and
> a.position <> b.position;
> UPDATE 104
>=20
> 104 rows get updated.
>=20
> Executing the same query again a few minutes later (no changes =
meanwhile in
> either table):
>=20
> db=3D# update offers_past_data a set position =3D b.position from =
(select id,
> feed, row_number() over(partition by product order by grossprice asc) =
as
> position from offers_testing) b where a.id <http://a.id/> =3D b.id =
<http://b.id/> and a.feed =3D b.feed and
> a.position <> b.position;
> UPDATE 28058
>=20
> This time it updates 28058 rows.
>=20
> This is a test environment and nothing reads or writes to these =
tables.
>=20
> Is this a bug or am I missing something obvious?
>=20
> =E2=80=8BIts likely data related, not a bug.
>=20
> Using the "UPDATE ... RETURNING *" form should provide good insight.  =
Specifically, look for all rows having the same (id, feed) pair.
>=20
> Also, "ORDER BY grossprice" seems inadequate.  The potential for =
duplicates here - which would then make the assignment of row numbers =
within the product partition random - is non-zero and is a quite likely =
source of your problem - along with the probable one-to-many =
relationship between offers_testing and offers_past_data.
>=20
> David J.
> =E2=80=8B=20
>=20
>=20

Re: BUG #14020: row_number() over(partition by order by) - weird behavior

From
"David G. Johnston"
Date:
On Tue, Mar 15, 2016 at 2:04 AM, Boyko Yordanov <b.yordanov2@gmail.com>
wrote:

> Thinking further on this, I now got your point on the =E2=80=9Cduplicate
> grossprices is ordered randomly=E2=80=9D suggestion.
>
> What I missed to realize is that the update query updates *every* product
> partition that has reordered due to duplicate grossprice being ordered
> randomly, resulting in thousands of updates instead of just < 148 (or < 9=
9
> in the case of product =3D 2 partition).
>
> Is there a way to ensure persistence of =E2=80=9Cover(order by duplicate_=
columns)=E2=80=9D
> ordering, except for ordering by a second (or even third) column?
>
>
=E2=80=8BNo, you need to have enough columns for deterministic order.

Dave
=E2=80=8B