Thread: BUG #14020: row_number() over(partition by order by) - weird behavior
BUG #14020: row_number() over(partition by order by) - weird behavior
From
b.yordanov2@gmail.com
Date:
The following bug has been logged on the website: Bug reference: 14020 Logged by: Boyko Email address: b.yordanov2@gmail.com PostgreSQL version: 9.5.0 Operating system: CentOS 6.4 Description: Hi, db=# \d+ offers_past_data; Table "public.offers_past_data" Column | Type | Modifiers | Storage | Stats target | Description ----------+---------+-----------+---------+--------------+------------- id | integer | not null | plain | | feed | integer | not null | plain | | position | integer | not null | plain | | db=# \d+ offers_testing; Table "public.offers_testing" Column | Type | Modifiers | Storage | Stats target | Description --------------------+-----------------------------+-----------------------------------------------------+----------+--------------+------------- id | bigint | not null default nextval('offers_id_seq'::regclass) | plain | | grossprice | numeric(11,2) | | main | | feed | integer | not null | plain | | product | integer | | plain | | db=# update offers_past_data a set position = b.position from (select id, feed, row_number() over(partition by product order by grossprice asc) as position from offers_testing) b where a.id = b.id and a.feed = b.feed and a.position <> b.position; UPDATE 0 This should update every row in offers_past_data when its âpositionâ changes. In the example above no changes were introduced since the last run so nothing is updated (expected). db=# select count(*) from offers_testing where product = 2; count ------- 99 (1 row) So there are 99 offers for product 2. Getting a single offer: db=# select id,grossprice from offers_testing where product = 2 limit 1; id | grossprice ---------+------------ 4127918 | 5000.00 (1 row) Updating its grossprice: db=# update offers_testing set grossprice = 20 where id = 4127918; UPDATE 1 Now when executing the first query again I expect that no more than 99 rows get updated in offers_past_data since this is the maximum amount of positions that would be affected by offer 4127918 grossprice change. db=# update offers_past_data a set position = b.position from (select id, feed, row_number() over(partition by product order by grossprice asc) as position from offers_testing) b where a.id = b.id and a.feed = b.feed and a.position <> b.position; UPDATE 104 104 rows get updated. Executing the same query again a few minutes later (no changes meanwhile in either table): db=# update offers_past_data a set position = b.position from (select id, feed, row_number() over(partition by product order by grossprice asc) as position from offers_testing) b where a.id = b.id and a.feed = b.feed and a.position <> b.position; UPDATE 28058 This time it updates 28058 rows. This is a test environment and nothing reads or writes to these tables. Is this a bug or am I missing something obvious? Regards, Boyko
Re: BUG #14020: row_number() over(partition by order by) - weird behavior
From
"David G. Johnston"
Date:
On Mon, Mar 14, 2016 at 1:43 PM, <b.yordanov2@gmail.com> wrote: > > db=3D# update offers_past_data a set position =3D b.position from (select= id, > feed, row_number() over(partition by product order by grossprice asc) as > position from offers_testing) b where a.id =3D b.id and a.feed =3D b.feed= and > a.position <> b.position; > UPDATE 0 > =E2=80=8BUpdating offers_past_data =E2=80=8B > > This should update every row in offers_past_data when its =E2=80=9Cpositi= on=E2=80=9D > changes. In the example above no changes were introduced since the last r= un > so nothing is updated (expected). > > db=3D# select count(*) from offers_testing where product =3D 2; > count > ------- > 99 > (1 row) > So there are 99 offers for product 2. > =E2=80=8BCounting offers_testing=E2=80=8B > Getting a single offer: > > db=3D# select id,grossprice from offers_testing where product =3D 2 limit= 1; > id | grossprice > ---------+------------ > 4127918 | 5000.00 > (1 row) > > =E2=80=8BCounting offers_testing=E2=80=8B Updating its grossprice: > > db=3D# update offers_testing set grossprice =3D 20 where id =3D 4127918; > UPDATE 1 > > =E2=80=8BUpdating offers_testing=E2=80=8B Now when executing the first query again I expect that no more than 99 rows > get updated in offers_past_data since this is the maximum amount of > positions that would be affected by offer 4127918 grossprice change. > You haven't proven to us that a single row in offers_testing cannot match more than one row in offers_past_data. Assuming a 1-to-many situation the update count for offers_past_data can definitely be more than the number of rows returned by the sub-query. =E2=80=8B=E2=80=8B > db=3D# update offers_past_data a set position =3D b.position from (select= id, > feed, row_number() over(partition by product order by grossprice asc) as > position from offers_testing) b where a.id =3D b.id and a.feed =3D b.feed= and > a.position <> b.position; > UPDATE 104 > > 104 rows get updated. > > Executing the same query again a few minutes later (no changes meanwhile = in > either table): > > db=3D# update offers_past_data a set position =3D b.position from (select= id, > feed, row_number() over(partition by product order by grossprice asc) as > position from offers_testing) b where a.id =3D b.id and a.feed =3D b.feed= and > a.position <> b.position; > UPDATE 28058 > > This time it updates 28058 rows. > > This is a test environment and nothing reads or writes to these tables. > > Is this a bug or am I missing something obvious? =E2=80=8BIts likely data related, not a bug. Using the "UPDATE ... RETURNING *" form should provide good insight. Specifically, look for all rows having the same (id, feed) pair. Also, "ORDER BY grossprice" seems inadequate. The potential for duplicates here - which would then make the assignment of row numbers within the product partition random - is non-zero and is a quite likely source of your problem - along with the probable one-to-many relationship between offers_testing and offers_past_data. David J. =E2=80=8B
Hi and thanks for your time on this. You haven't proven to us that a single row in offers_testing cannot = match more than one row in offers_past_data. Assuming a 1-to-many = situation the update count for offers_past_data can definitely be more = than the number of rows returned by the sub-query. It is a one-to-one relationship between the tables as there is a primary = key on (id, feed) on both tables (which I missed to point out): Indexes: "offers_past_data_id_feed" PRIMARY KEY, btree (id, feed) Indexes: "offers_testing_id_feed" PRIMARY KEY, btree (id, feed) I assume that this guarantees that a single grossprice change in = offers_testing where product =3D 2 translates to up to (count(id,feed) = where product =3D 2) position updates in both offers_testing and = offers_past_data. Adding "returning *" to the questionable query, it seems to update rows = that are not related to product 2 (and on my opinion should not have = changed positions). Also, "ORDER BY grossprice" seems inadequate. The potential for = duplicates here - which would then make the assignment of row numbers = within the product partition random - is non-zero and is a quite likely = source of your problem - along with the probable one-to-many = relationship between offers_testing and offers_past_data. Dismissing the one-to-many relationship suggestion as it isn't the case. Your point on duplicate grossprices is valid, but I believe that if I = update a single grossprice, even in the case of duplicate grossprices, = this should not translate in more position updates than the rows in the = modified product partition. And in offers_testing there are no more than = 148 rows per product partition: db=3D# select max(partition_count) from (select count(*) over (partition = by product) as partition_count from offers_testing) sq; max ----- 148 (1 row) And yet the update query updates 28k records for some reason, most of = which are outside the modified product partition. Boyko -- Boyko 2016-03-15 6:00 GMT+02:00 David G. Johnston <david.g.johnston@gmail.com = <mailto:david.g.johnston@gmail.com>>: On Mon, Mar 14, 2016 at 1:43 PM, <b.yordanov2@gmail.com = <mailto:b.yordanov2@gmail.com>> wrote: db=3D# update offers_past_data a set position =3D b.position from = (select id, feed, row_number() over(partition by product order by grossprice asc) as position from offers_testing) b where a.id <http://a.id/> =3D b.id = <http://b.id/> and a.feed =3D b.feed and a.position <> b.position; UPDATE 0 =E2=80=8BUpdating offers_past_data =E2=80=8B=20 This should update every row in offers_past_data when its =E2=80=9Cpositio= n=E2=80=9D changes. In the example above no changes were introduced since the last = run so nothing is updated (expected). db=3D# select count(*) from offers_testing where product =3D 2; count ------- 99 (1 row) So there are 99 offers for product 2. =E2=80=8BCounting offers_testing=E2=80=8B Getting a single offer: db=3D# select id,grossprice from offers_testing where product =3D 2 = limit 1; id | grossprice ---------+------------ 4127918 | 5000.00 (1 row) =E2=80=8BCounting offers_testing=E2=80=8B Updating its grossprice: db=3D# update offers_testing set grossprice =3D 20 where id =3D 4127918; UPDATE 1 =E2=80=8BUpdating offers_testing=E2=80=8B Now when executing the first query again I expect that no more than 99 = rows get updated in offers_past_data since this is the maximum amount of positions that would be affected by offer 4127918 grossprice change. You haven't proven to us that a single row in offers_testing cannot = match more than one row in offers_past_data. Assuming a 1-to-many = situation the update count for offers_past_data can definitely be more = than the number of rows returned by the sub-query. =E2=80=8B=E2=80=8B db=3D# update offers_past_data a set position =3D b.position from = (select id, feed, row_number() over(partition by product order by grossprice asc) as position from offers_testing) b where a.id <http://a.id/> =3D b.id = <http://b.id/> and a.feed =3D b.feed and a.position <> b.position; UPDATE 104 104 rows get updated. Executing the same query again a few minutes later (no changes meanwhile = in either table): db=3D# update offers_past_data a set position =3D b.position from = (select id, feed, row_number() over(partition by product order by grossprice asc) as position from offers_testing) b where a.id <http://a.id/> =3D b.id = <http://b.id/> and a.feed =3D b.feed and a.position <> b.position; UPDATE 28058 This time it updates 28058 rows. This is a test environment and nothing reads or writes to these tables. Is this a bug or am I missing something obvious? =E2=80=8BIts likely data related, not a bug. Using the "UPDATE ... RETURNING *" form should provide good insight. = Specifically, look for all rows having the same (id, feed) pair. Also, "ORDER BY grossprice" seems inadequate. The potential for = duplicates here - which would then make the assignment of row numbers = within the product partition random - is non-zero and is a quite likely = source of your problem - along with the probable one-to-many = relationship between offers_testing and offers_past_data. David J. =E2=80=8B=20
Thinking further on this, I now got your point on the =E2=80=9Cduplicate = grossprices is ordered randomly=E2=80=9D suggestion. What I missed to realize is that the update query updates *every* = product partition that has reordered due to duplicate grossprice being = ordered randomly, resulting in thousands of updates instead of just < = 148 (or < 99 in the case of product =3D 2 partition).=20 Is there a way to ensure persistence of =E2=80=9Cover(order by = duplicate_columns)=E2=80=9D ordering, except for ordering by a second = (or even third) column? Overall this now makes sense and is obviously not a bug. I apologize for = bothering this list (instead of =E2=80=98general=E2=80=99) w/ this = issue. Thanks once again! Regards, Boyko > On Mar 15, 2016, at 10:20 AM, Boyko Yordanov <b.yordanov2@gmail.com> = wrote: >=20 > Hi and thanks for your time on this. >=20 > You haven't proven to us that a single row in offers_testing cannot = match more than one row in offers_past_data. Assuming a 1-to-many = situation the update count for offers_past_data can definitely be more = than the number of rows returned by the sub-query. >=20 > It is a one-to-one relationship between the tables as there is a = primary key on (id, feed) on both tables (which I missed to point out): >=20 > Indexes: > "offers_past_data_id_feed" PRIMARY KEY, btree (id, feed) >=20 > Indexes: > "offers_testing_id_feed" PRIMARY KEY, btree (id, feed) >=20 > I assume that this guarantees that a single grossprice change in = offers_testing where product =3D 2 translates to up to (count(id,feed) = where product =3D 2) position updates in both offers_testing and = offers_past_data. >=20 > Adding "returning *" to the questionable query, it seems to update = rows that are not related to product 2 (and on my opinion should not = have changed positions). >=20 > Also, "ORDER BY grossprice" seems inadequate. The potential for = duplicates here - which would then make the assignment of row numbers = within the product partition random - is non-zero and is a quite likely = source of your problem - along with the probable one-to-many = relationship between offers_testing and offers_past_data. >=20 > Dismissing the one-to-many relationship suggestion as it isn't the = case. >=20 > Your point on duplicate grossprices is valid, but I believe that if I = update a single grossprice, even in the case of duplicate grossprices, = this should not translate in more position updates than the rows in the = modified product partition. And in offers_testing there are no more than = 148 rows per product partition: >=20 > db=3D# select max(partition_count) from (select count(*) over = (partition by product) as partition_count from offers_testing) sq; > max > ----- > 148 > (1 row) >=20 > And yet the update query updates 28k records for some reason, most of = which are outside the modified product partition. >=20 > Boyko >=20 > -- >=20 > Boyko >=20 > 2016-03-15 6:00 GMT+02:00 David G. Johnston = <david.g.johnston@gmail.com <mailto:david.g.johnston@gmail.com>>: > On Mon, Mar 14, 2016 at 1:43 PM, <b.yordanov2@gmail.com = <mailto:b.yordanov2@gmail.com>> wrote: >=20 > db=3D# update offers_past_data a set position =3D b.position from = (select id, > feed, row_number() over(partition by product order by grossprice asc) = as > position from offers_testing) b where a.id <http://a.id/> =3D b.id = <http://b.id/> and a.feed =3D b.feed and > a.position <> b.position; > UPDATE 0 >=20 > =E2=80=8BUpdating offers_past_data > =E2=80=8B=20 >=20 > This should update every row in offers_past_data when its = =E2=80=9Cposition=E2=80=9D > changes. In the example above no changes were introduced since the = last run > so nothing is updated (expected). >=20 > db=3D# select count(*) from offers_testing where product =3D 2; > count > ------- > 99 > (1 row) > So there are 99 offers for product 2. >=20 > =E2=80=8BCounting offers_testing=E2=80=8B >=20 >=20 > Getting a single offer: >=20 > db=3D# select id,grossprice from offers_testing where product =3D 2 = limit 1; > id | grossprice > ---------+------------ > 4127918 | 5000.00 > (1 row) >=20 >=20 > =E2=80=8BCounting offers_testing=E2=80=8B >=20 > Updating its grossprice: >=20 > db=3D# update offers_testing set grossprice =3D 20 where id =3D = 4127918; > UPDATE 1 >=20 >=20 > =E2=80=8BUpdating offers_testing=E2=80=8B >=20 > Now when executing the first query again I expect that no more than 99 = rows > get updated in offers_past_data since this is the maximum amount of > positions that would be affected by offer 4127918 grossprice change. >=20 > You haven't proven to us that a single row in offers_testing cannot = match more than one row in offers_past_data. Assuming a 1-to-many = situation the update count for offers_past_data can definitely be more = than the number of rows returned by the sub-query. > =E2=80=8B=E2=80=8B >=20 >=20 > db=3D# update offers_past_data a set position =3D b.position from = (select id, > feed, row_number() over(partition by product order by grossprice asc) = as > position from offers_testing) b where a.id <http://a.id/> =3D b.id = <http://b.id/> and a.feed =3D b.feed and > a.position <> b.position; > UPDATE 104 >=20 > 104 rows get updated. >=20 > Executing the same query again a few minutes later (no changes = meanwhile in > either table): >=20 > db=3D# update offers_past_data a set position =3D b.position from = (select id, > feed, row_number() over(partition by product order by grossprice asc) = as > position from offers_testing) b where a.id <http://a.id/> =3D b.id = <http://b.id/> and a.feed =3D b.feed and > a.position <> b.position; > UPDATE 28058 >=20 > This time it updates 28058 rows. >=20 > This is a test environment and nothing reads or writes to these = tables. >=20 > Is this a bug or am I missing something obvious? >=20 > =E2=80=8BIts likely data related, not a bug. >=20 > Using the "UPDATE ... RETURNING *" form should provide good insight. = Specifically, look for all rows having the same (id, feed) pair. >=20 > Also, "ORDER BY grossprice" seems inadequate. The potential for = duplicates here - which would then make the assignment of row numbers = within the product partition random - is non-zero and is a quite likely = source of your problem - along with the probable one-to-many = relationship between offers_testing and offers_past_data. >=20 > David J. > =E2=80=8B=20 >=20 >=20
Re: BUG #14020: row_number() over(partition by order by) - weird behavior
From
"David G. Johnston"
Date:
On Tue, Mar 15, 2016 at 2:04 AM, Boyko Yordanov <b.yordanov2@gmail.com> wrote: > Thinking further on this, I now got your point on the =E2=80=9Cduplicate > grossprices is ordered randomly=E2=80=9D suggestion. > > What I missed to realize is that the update query updates *every* product > partition that has reordered due to duplicate grossprice being ordered > randomly, resulting in thousands of updates instead of just < 148 (or < 9= 9 > in the case of product =3D 2 partition). > > Is there a way to ensure persistence of =E2=80=9Cover(order by duplicate_= columns)=E2=80=9D > ordering, except for ordering by a second (or even third) column? > > =E2=80=8BNo, you need to have enough columns for deterministic order. Dave =E2=80=8B