Re: Use streaming read API in ANALYZE - Mailing list pgsql-hackers

From Mats Kindahl
Subject Re: Use streaming read API in ANALYZE
Date
Msg-id CA+14425U9MC9AZEvnNcCoUvTH39v_Y4p4tB3jQheK=_e65RKKQ@mail.gmail.com
Whole thread Raw
In response to Re: Use streaming read API in ANALYZE  (Thomas Munro <thomas.munro@gmail.com>)
List pgsql-hackers
On Wed, Sep 18, 2024 at 5:13 AM Thomas Munro <thomas.munro@gmail.com> wrote:
On Sun, Sep 15, 2024 at 12:14 AM Mats Kindahl <mats@timescale.com> wrote:
> I used the combination of your patch and making the computation of vacattrstats for a relation available through the API and managed to implement something that I think does the right thing. (I just sampled a few different statistics to check if they seem reasonable, like most common vals and most common freqs.) See attached patch.

Cool.  I went ahead and committed that small new function and will
mark the open item closed.

Thank you Thomas, this will help a lot.
 
> I need the vacattrstats to set up the two streams for the internal relations. I can just re-implement them in the same way as is already done, but this seems like a small change that avoids unnecessary code duplication.

Unfortunately we're not in a phase where we can make non-essential
changes, we're right about to release and we're only committing fixes,
and it seems like you have a way forward (albeit with some
duplication).  We can keep talking about that for v18.

Yes, I can work around this by re-implementing the same code that is present in PostgreSQL.
 

From your earlier email:
> I'll take a look at the thread. I really think the ReadStream abstraction is a good step in the right direction.

Here's something you or your colleagues might be interested in: I was
looking around for a fun extension to streamify as a demo of the
technology, and I finished up writing a quick patch to streamify
pgvector's HNSW index scan, which worked well enough to share[1] (I
think it should in principle be able to scale with the number of graph
connections, at least 16x), but then people told me that it's of
limited interest because everybody knows that HNSW indexes have to fit
in memory (I think there may also be memory prefetch streaming
opportunities, unexamined for now).  But that made me wonder what the
people with the REALLY big indexes do for hyperdimensional graph
search on a scale required to build Skynet, and that led me back to
Timescale pgvectorscale[2].  I see two obvious signs that this thing
is eminently and profitably streamifiable: (1) The stated aim is
optimising for indexes that don't fit in memory, hence "Disk" in the
name of the research project it is inspired by, (2) I see that
DIskANN[3] is aggressively using libaio (Linux) and overlapped/IOCP
(Windows).  So now I am waiting patiently for a Rustacean to show up
with patches for pgvectorscale to use ReadStream, which would already
get read-ahead advice and vectored I/O (Linux, macOS, FreeBSD soon
hopefully), and hopefully also provide a nice test case for the AIO
patch set which redirects buffer reads through io_uring (Linux,
basically the newer better libaio) or background I/O workers (other
OSes, which works surprisingly competitively).  Just BTW for
comparison with DiskANN we have also had early POC-quality patches
that drive AIO with overlapped/IOCP (Windows) which will eventually be
rebased and proposed (Windows isn't really a primary target but we
wanted to validate that the stuff we're working on has abstractions
that will map to the obvious system APIs found in the systems
PostgreSQL targets).  For completeness, I've also had it mostly
working on the POSIX AIO of FreeBSD, HP-UX and AIX (though we dropped
support for those last two so that was a bit of a dead end).
 

[1] https://www.postgresql.org/message-id/flat/CA%2BhUKGJ_7NKd46nx1wbyXWriuZSNzsTfm%2BrhEuvU6nxZi3-KVw%40mail.gmail.com
[2] https://github.com/timescale/pgvectorscale
[3] https://github.com/microsoft/DiskANN

Thanks Thomas, this looks really interesting. I've forwarded it to the pgvectorscale team.
--
Best wishes,
Mats Kindahl, Timescale

pgsql-hackers by date:

Previous
From: Tender Wang
Date:
Subject: Re: not null constraints, again
Next
From: Junwang Zhao
Date:
Subject: Re: attndims, typndims still not enforced, but make the value within a sane threshold