[MASSMAIL] Looking for an index that supports top-n searches by enforcing a max-n automatically - Mailing list pgsql-hackers
From | Morris de Oryx |
---|---|
Subject | [MASSMAIL] Looking for an index that supports top-n searches by enforcing a max-n automatically |
Date | |
Msg-id | CAKqncci5FC75AE9XLjwgT0=6uw40i=C80Zi5XzX4pXw4yoPwyw@mail.gmail.com Whole thread Raw |
Responses |
Re: Looking for an index that supports top-n searches by enforcing a max-n automatically
|
List | pgsql-hackers |
Looking for an index to support top-n searches, were n has a fixed maximum
Recently, I've been looking at strategies to handle top-n queries in Postgres. In my current cases, we've got definition tables, and very large related tables. Here's a stripped-down example, the real tables are much wider.
CREATE TABLE IF NOT EXISTS data.inventory (
id uuid NOT NULL DEFAULT NULL PRIMARY KEY,
inv_id uuid NOT NULL DEFAULT NULL
);
CREATE TABLE IF NOT EXISTS data.scan (
id uuid NOT NULL DEFAULT NULL PRIMARY KEY,
inv_id uuid NOT NULL DEFAULT NULL
scan_dts_utc timestamp NOT NULL DEFAULT NOW(); -- We run out servers on UTC
);
Every item in inventory is scanned when it passes through various steps in a clean-dispatch-arrive-use-clean sort of a work cycle. The ratio between inventory and scan is 1:0-n, where n can be virtually any number. In another table pair like this, the average is 1:1,000. In the inventory example, it's roughly 0:200,000. The distribution of related row counts is all over the map. The reasons behind these ratios sometimes map to valid processes, and sometimes are a consequence of some low-quality data leaking into the system. In the case of inventory, the results make sense. In our case:
* The goal value for n is often 1, and not more than up to 25.
* Depending on the tables, the % of rows that are discarded because they're past the 25th most recent record is 97% or more.
* Partial indexes do not work as well on huge tables as I hoped. The same data copied via a STATEMENT trigger into a thin, subsetted table is much faster. I think this has to do with the increase in random disk access required for a table and/or index with more pages spread around on the disk.
* We can't filter in advance by any reasonable date range. Could get 25 scans on one item in an hour, or a year, or five years, or never.
We're finding that we need the top-n records more and more often, returned quickly. This gets harder to do as the table(s) grow.
SELECT id, scan_dts_utc
FROM data.scan
WHERE inv_id = 'b7db5d06-8275-224d-a38a-ac263dc1c767' curve.
ORDER BY scan_dts_utc DESC
LIMIT 25; -- Full search product might be 0, 200K, or anything in-between. Not on a bell curve.
A compound index works really well to optimize these kinds of searches:
CREATE INDEX scan_inv_id_scan_time_utc_dts_idx
ON ascendco.analytic_scan (inv_id, scan_time_utc_dts DESC);
What I'm wondering is if there is some index option, likely not with a B-tree, that can *automatically* enforce a maximum-length list of top values, based on a defined sort
CREATE INDEX scan_inv_id_scan_time_utc_dts_idx
ON ascendco.analytic_scan (inv_id, scan_time_utc_dts DESC) -- This defines the ordering
LIMIT 25; -- This sets the hard max for n
The goal is to have an automatically maintained list of the top values in the index itself. In the right situations (like ours), this reduces the index size by 20x or more. Smaller index, faster results. And, since the index is on the source table, the row references are already there. (Something I lose when maintaining this by hand in a side/shadow/top table.)
I've looked at a ton of plans, and Postgres *clearly* goes to a lot of effort to recognize and optimize top-n searches already. That's encouraging, as it suggests that the planner takes LIMIT into account. (I've picked up already that maintaining the purity of the planner and executor abstractions is a core value to the project.)
And, for sure, I can build and maintain my own custom, ordered list in various ways. None of them seem like they can possibly rival the trimming behavior being handled by an index.
I'm out over my skis here, but I'm intuiting that this might be a job for one of the multi-value/inverted index types/frameworks. I tried some experiments, but only got worse results.
Recently, I've been looking at strategies to handle top-n queries in Postgres. In my current cases, we've got definition tables, and very large related tables. Here's a stripped-down example, the real tables are much wider.
CREATE TABLE IF NOT EXISTS data.inventory (
id uuid NOT NULL DEFAULT NULL PRIMARY KEY,
inv_id uuid NOT NULL DEFAULT NULL
);
CREATE TABLE IF NOT EXISTS data.scan (
id uuid NOT NULL DEFAULT NULL PRIMARY KEY,
inv_id uuid NOT NULL DEFAULT NULL
scan_dts_utc timestamp NOT NULL DEFAULT NOW(); -- We run out servers on UTC
);
Every item in inventory is scanned when it passes through various steps in a clean-dispatch-arrive-use-clean sort of a work cycle. The ratio between inventory and scan is 1:0-n, where n can be virtually any number. In another table pair like this, the average is 1:1,000. In the inventory example, it's roughly 0:200,000. The distribution of related row counts is all over the map. The reasons behind these ratios sometimes map to valid processes, and sometimes are a consequence of some low-quality data leaking into the system. In the case of inventory, the results make sense. In our case:
* The goal value for n is often 1, and not more than up to 25.
* Depending on the tables, the % of rows that are discarded because they're past the 25th most recent record is 97% or more.
* Partial indexes do not work as well on huge tables as I hoped. The same data copied via a STATEMENT trigger into a thin, subsetted table is much faster. I think this has to do with the increase in random disk access required for a table and/or index with more pages spread around on the disk.
* We can't filter in advance by any reasonable date range. Could get 25 scans on one item in an hour, or a year, or five years, or never.
We're finding that we need the top-n records more and more often, returned quickly. This gets harder to do as the table(s) grow.
SELECT id, scan_dts_utc
FROM data.scan
WHERE inv_id = 'b7db5d06-8275-224d-a38a-ac263dc1c767' curve.
ORDER BY scan_dts_utc DESC
LIMIT 25; -- Full search product might be 0, 200K, or anything in-between. Not on a bell curve.
A compound index works really well to optimize these kinds of searches:
CREATE INDEX scan_inv_id_scan_time_utc_dts_idx
ON ascendco.analytic_scan (inv_id, scan_time_utc_dts DESC);
What I'm wondering is if there is some index option, likely not with a B-tree, that can *automatically* enforce a maximum-length list of top values, based on a defined sort
CREATE INDEX scan_inv_id_scan_time_utc_dts_idx
ON ascendco.analytic_scan (inv_id, scan_time_utc_dts DESC) -- This defines the ordering
LIMIT 25; -- This sets the hard max for n
The goal is to have an automatically maintained list of the top values in the index itself. In the right situations (like ours), this reduces the index size by 20x or more. Smaller index, faster results. And, since the index is on the source table, the row references are already there. (Something I lose when maintaining this by hand in a side/shadow/top table.)
I've looked at a ton of plans, and Postgres *clearly* goes to a lot of effort to recognize and optimize top-n searches already. That's encouraging, as it suggests that the planner takes LIMIT into account. (I've picked up already that maintaining the purity of the planner and executor abstractions is a core value to the project.)
And, for sure, I can build and maintain my own custom, ordered list in various ways. None of them seem like they can possibly rival the trimming behavior being handled by an index.
I'm out over my skis here, but I'm intuiting that this might be a job for one of the multi-value/inverted index types/frameworks. I tried some experiments, but only got worse results.
Hope that reads as understandable...grateful for any suggestions.
pgsql-hackers by date: