Thread: nth_value out of more than n values returns null
This query:
SELECT NTH_VALUE(id, 5000000) OVER (ORDER BY created_at, id ASC) FROM table
in a table where SELECT COUNT(*) returns a value a few thousand over 5 million, where id is the primary key, returns null.
The inclusion of the primary key should make the order by a total order. So there should be a 5 millionth row.
How can this happen?
SELECT NTH_VALUE(id, 5000000) OVER (ORDER BY created_at, id ASC) FROM table
in a table where SELECT COUNT(*) returns a value a few thousand over 5 million, where id is the primary key, returns null.
The inclusion of the primary key should make the order by a total order. So there should be a 5 millionth row.
How can this happen?
On 11/4/24 15:17, Guyren Howe wrote: > This query: > > SELECT NTH_VALUE(id, 5000000) OVER (ORDER BY created_at, id ASC) FROM table > > in a table where SELECT COUNT(*) returns a value a few thousand over 5 > million, where id is the primary key, returns null. > > The inclusion of the primary key should make the order by a total order. > So there should be a 5 millionth row. > > How can this happen? If I where to hazard a guess it has to do with this: https://www.postgresql.org/docs/current/functions-window.html "Note that first_value, last_value, and nth_value consider only the rows within the “window frame”, which by default contains the rows from the start of the partition through the last peer of the current row. This is likely to give unhelpful results for last_value and sometimes also nth_value. You can redefine the frame by adding a suitable frame specification (RANGE, ROWS or GROUPS) to the OVER clause. See Section 4.2.8 for more information about frame specifications." Further I am pretty sure that this ORDER BY created_at, id ASC is creating a window frame over created_at, id and that there are duplicate created_at values which means the frame has less then 5000000 rows. Try a smaller number and see what happens. -- Adrian Klaver adrian.klaver@aklaver.com
On 2024-11-05 00:17 +0100, Guyren Howe wrote: > This query: > > SELECT NTH_VALUE(id, 5000000) OVER (ORDER BY created_at, id ASC) FROM table > > in a table where SELECT COUNT(*) returns a value a few thousand over 5 > million, where id is the primary key, returns null. > > The inclusion of the primary key should make the order by a total > order. So there should be a 5 millionth row. > > How can this happen? https://www.postgresql.org/docs/current/functions-window.html explains it: Note that first_value, last_value, and nth_value consider only the rows within the "window frame", which by default contains the rows from the start of the partition through the last peer of the current row. This is likely to give unhelpful results for last_value and sometimes also nth_value. You can redefine the frame by adding a suitable frame specification (RANGE, ROWS or GROUPS) to the OVER clause. See Section 4.2.8 for more information about frame specifications. You probably want to extend the window frame with this: SELECT NTH_VALUE(id, 5000000) OVER ( ORDER BY created_at, id ASC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ) FROM table -- Erik
I’m trying to get the id of the 5,000,000th record, so I can join against it to get a name. I didn’t fully understand what the docs say there. What am I missing?
On 4 Nov 2024 at 15:36 -0800, Erik Wienhold <ewie@ewie.name>, wrote:
On 2024-11-05 00:17 +0100, Guyren Howe wrote:This query:
SELECT NTH_VALUE(id, 5000000) OVER (ORDER BY created_at, id ASC) FROM table
in a table where SELECT COUNT(*) returns a value a few thousand over 5
million, where id is the primary key, returns null.
The inclusion of the primary key should make the order by a total
order. So there should be a 5 millionth row.
How can this happen?
https://www.postgresql.org/docs/current/functions-window.html explains
it:
Note that first_value, last_value, and nth_value consider only the rows
within the "window frame", which by default contains the rows from the
start of the partition through the last peer of the current row. This is
likely to give unhelpful results for last_value and sometimes also
nth_value. You can redefine the frame by adding a suitable frame
specification (RANGE, ROWS or GROUPS) to the OVER clause. See Section
4.2.8 for more information about frame specifications.
You probably want to extend the window frame with this:
SELECT NTH_VALUE(id, 5000000) OVER (
ORDER BY created_at, id ASC
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
) FROM table
--
Erik
On Monday, November 4, 2024, Guyren Howe <guyren@gmail.com> wrote:
I’m trying to get the id of the 5,000,000th record, so I can join against it to get a name. I didn’t fully understand what the docs say there. What am I missing?
So just use “offset 5_000_000 limit 1”. Bringing in a window function here seems unhelpful.
David J.
Wouldn’t it be offset 4_999_999?
I’d still like to understand why nth_value doesn’t work.
I’d still like to understand why nth_value doesn’t work.
On 4 Nov 2024 at 15:44 -0800, David G. Johnston <david.g.johnston@gmail.com>, wrote:
On Monday, November 4, 2024, Guyren Howe <guyren@gmail.com> wrote:I’m trying to get the id of the 5,000,000th record, so I can join against it to get a name. I didn’t fully understand what the docs say there. What am I missing?So just use “offset 5_000_000 limit 1”. Bringing in a window function here seems unhelpful.David J.
On Monday, November 4, 2024, Guyren Howe <guyren@gmail.com> wrote:
Wouldn’t it be offset 4_999_999?
Probably. I tend to expect off-by-one for these kinds of things and test my way out.
I’d still like to understand why nth_value doesn’t work.
When you perform an order by in a window clause the frame you get by default ends at the current row. Consider “count(*) over ()” versus “count(*) over (order by id)”.
You need to not use defaults for the window frame if this doesn’t suit you.
David J.
"David G. Johnston" <david.g.johnston@gmail.com> writes: > So just use “offset 5_000_000 limit 1”. Bringing in a window function here > seems unhelpful. Yeah, that. A bite-size example might help clarify what the window function is doing: regression=# create table zed(f1 int) ; CREATE TABLE regression=# insert into zed select generate_series(1, 10); INSERT 0 10 regression=# select f1, nth_value(f1, 5) over (order by f1) from zed; f1 | nth_value ----+----------- 1 | 2 | 3 | 4 | 5 | 5 6 | 5 7 | 5 8 | 5 9 | 5 10 | 5 (10 rows) For the first four rows, the window frame doesn't include the row you want, so you get NULL. You can fix that with a non-default window frame: regression=# select f1, nth_value(f1, 5) over (order by f1 ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) fromzed; f1 | nth_value ----+----------- 1 | 5 2 | 5 3 | 5 4 | 5 5 | 5 6 | 5 7 | 5 8 | 5 9 | 5 10 | 5 (10 rows) So yeah, you can get the fifth (or five million'th) row this way, but you'll get N copies of it, which I assume is not what you want. Better regression=# select f1 from zed order by f1 offset 4 limit 1; f1 ---- 5 (1 row) which gets you just the one row and is a lot cheaper too. regards, tom lane