Re: On disable_cost - Mailing list pgsql-hackers

From Heikki Linnakangas
Subject Re: On disable_cost
Date
Msg-id c6e9f89f-efe9-4588-a8e3-3c41c89717ca@iki.fi
Whole thread Raw
In response to Re: On disable_cost  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: On disable_cost
List pgsql-hackers
On 23/08/2024 20:11, Robert Haas wrote:
> I find it a little weird that hnsw thinks itself able to return all
> the tuples in an order the user chooses, but unable to return all of
> the tuples in an arbitrary order.

HNSW is weird in many ways:

- There is no inherent sort order. It cannot do "ORDER BY column", only 
kNN-sort like "ORDER BY column <-> value".

- It's approximate. It's not guaranteed to return the same set of rows 
as a sequential scan + sort.

- The number of results it returns is limited by the hnsw.ef_search GUC, 
default 100.

- It collects all the results (up to hnsw.ef_search) in memory, and only 
then returns them. So if you tried to use it with a large number of 
results, it can simply run out of memory.

Arguably all of those are bugs in HNSW, but it is what it is. The 
algorithm is inherently approximate. Despite that, it's useful in practice.

> In core, we have precedent for index
> types that can't return individual tuples at all (gin, brin) but not
> one that is able to return tuples in concept but has a panic attack if
> you don't know how you want them sorted.

Well, we do also have gin_fuzzy_search_limit. Two wrongs doesn't make it 
right, though; I'd love to get rid of that hack too somehow.

> I don't quite see why you
> couldn't just treat that case the same as ORDER BY
> the_first_column_of_the_index, or any other arbitrary rule that you
> want to make up. Sure, it might be more expensive than a sequential
> scan, but the user said they didn't want a sequential scan. I'm not
> quite sure why pgvector thinks it gets to decide that it knows better
> than the user, or the rest of the optimizer. I don't even think I
> really believe it would always be worse: I've seen cases where a table
> was badly bloated and mostly empty but its indexes were not bloated,
> and in that case an index scan can be a HUGE winner even though it
> would normally be a lot worse than a sequential scan.

Sure, you could make it work. It could construct a vector out of thin 
air to compare with, when there's no scan key, or implement a completely 
different codepath that traverses the full graph in no particular order.

> If you don't want to fix hnsw to work the way the core optimizer
> thinks it should, or if there's some reason it can't be done,
> alternatives might include (1) having the cost estimate function hack
> the count of disabled nodes and (2) adding some kind of core support
> for an index cost estimator refusing a path entirely. I haven't tested
> (1) so I don't know for sure that there are no issues, but I think we
> have to do all of our cost estimating before we can think about adding
> the path so I feel like there's a decent chance it would do what you
> want.

It would seem useful for an index AM to be able to say "nope, I can't do 
this". I don't remember how exactly this stuff works, but I'm surprised 
it doesn't already exist.

-- 
Heikki Linnakangas
Neon (https://neon.tech)




pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: On disable_cost
Next
From: Robert Haas
Date:
Subject: Re: On disable_cost