Home > mailing lists

Re: Optimize single tuple fetch from nbtree index - Mailing list pgsql-hackers

From	Floris Van Nee
Subject	Re: Optimize single tuple fetch from nbtree index
Date	August 5, 2019 11:34:26
Msg-id	1565004866513.6952@Optiver.com Whole thread Raw
In response to	Re: Optimize single tuple fetch from nbtree index (Peter Geoghegan <pg@bowt.ie>)
Responses	Re: Optimize single tuple fetch from nbtree index
List	pgsql-hackers

Tree view

Hi Peter,

> Actually, having looked at the test case in more detail, that now
> seems less likely. The test case seems designed to reward making it
> cheaper to access one specific tuple among a fairly large group of
> related tuples -- reducing the number of inner scans is not going to
> be possible there.

> If this really is totally representative of the case that Floris cares
> about, I suppose that the approach taken more or less makes sense.
> Unfortunately, it doesn't seem like an optimization that many other
> users would find compelling, partly because it's only concerned with
> fixed overheads, and partly because most queries don't actually look
> like this.

Thanks for taking a look. Unfortunately this is exactly the case I care about. I'm a bit puzzled as to why this case
wouldn'tcome up more often by other users though. We have many large tables with timeseries data and it seems to me
thatwith timeseries data, two of the most common queries are: 
(1) What is the state of { a1,a2, a3 ...} at timepoint t (but you don't know that there's an update *exactly* at
timepointt - so you're left with trying to find the latest update smaller than t) 
(2) What is the state of { a } at timepoints { t1, t2, t3 ... }
Given that a1,a2,a3... are indepedently updating, but similar time series (eg. sensor a1 and sensor a2, but both
providea temperature value and update independently from each other). 
Both of these can also be done with some kind of DISTINCT ON clause, but this is often already much slower than just
doinga nested loop of fast index lookups with LIMIT 1 (this depends on the frequency of the timeseries data itself
versusthe sampling rate of your query though, for high frequency time series and/or low frequency sampling the LIMIT 1
approachis much faster). 

Note that there is actually some related work to this - in the Index Skip Scan thread [1] a function called
_bt_read_closestwas developed which also partially reads the page. A Skip Scan has a very similar access pattern to the
usecase I describe here, because it's also very likely to just require one tuple from the page. Even though the
implementationin that patch is currently incorrect, performance of the Skip Scan would likely also be quite a bit
fasterif it had a correct implementation of this partial page-read and it wouldn't have to read the full page every
time.

I have one further question about these index offsets. There are several comments in master that indicate that it's
impossiblethat an item moves 'left' on a page, if we continuously hold a pin on the page. For example, _bt_killitems
hasa comment like this: 

* Note that if we hold a pin on the target page continuously from initially
 * reading the items until applying this function, VACUUM cannot have deleted
 * any items from the page, and so there is no need to search left from the
 * recorded offset.  (This observation also guarantees that the item is still
 * the right one to delete, which might otherwise be questionable since heap
 * TIDs can get recycled.)    This holds true even if the page has been modified
 * by inserts and page splits, so there is no need to consult the LSN.

Still, exactly this case happens in practice. In my tests I was able to get behavior like:
1) pin + lock a page in _bt_first
2) read a tuple, record indexOffset (for example offset=100) and heap tid
3) unlock page, but *keep* the pin (end of _bt_first of my patch)
4) lock page again in _bt_next (we still hold the pin, so vacuum shouldn't have occurred)
5) look inside the current page for the heap Tid that we registered earlier
6) we find that we can now find this tuple at indexOffset=98, eg. it moved left. This should not be possible.
This case sometimes randomly happens when running 'make check', which is why I added code in my patch to also look left
onthe page from the previous indexOffset. 

However, this is in contradiction with the comments (and code) of _bt_killitems.
Is the comment incorrect/outdated or is there a bug in vacuum or any other part of Postgres that might move index items
lefteven though there are others holding a pin? 

-Floris

[1]
https://www.postgresql.org/message-id/flat/CA%2BhUKGKW4dXTP9G%2BWBskjT09tzD%2B9aMWEm%3DFpeb6RS5SXfPyKw%40mail.gmail.com#21abe755d5cf36aabaaa048c8a282169

pgsql-hackers by date:

From: Laurenz Albe
Date: 05 August 2019, 11:30:43
Subject: Re: Identity columns should own only one sequence

From: Dmitry Igrishin
Date: 05 August 2019, 11:38:40
Subject: psql's meta command \d is broken as of 12 beta2.

Re: Optimize single tuple fetch from nbtree index - Mailing list pgsql-hackers

Previous

Next