Thread: Partial index creation always scans the entire table

Partial index creation always scans the entire table

From
MingJu Wu
Date:
Hello,

When creating partial indexes, can postgres utilize another index for figuring which rows should be included in the partial index, without performing a full table scan?

My scenario is that I have a table with 50M rows that are categorized into 10K categories. I need to create a partial index for each category. I have created a index on the category column, hoping that postgres can use this information when creating the partial indexes. However, postgres always performs full table scan.

I've tested with PostgreSQL 12.2. Below is an example setup showing the problem.

TEST1 shows that building a full index covering all rows takes 18 seconds.

TEST2 shows that creating a partial index for one of the category1 (category1=1) takes 3 seconds. This means that for creating 10K partial indexes for each category, it will take over 8 hours. Compared to just 18 seconds in TEST1, it is much longer due to repeated full table scans.

TEST3 shows that even with another index (index_category2 created in SETUP) covering category2, creating a partial index for one of the category2 (category2=1) still takes 3 seconds. I think postgres is still doing a full table scan here.

My question is: can postgres utilize index_category2 is TEST3?

Thank you.

---------
-- SETUP
---------

CREATE TABLE test_data (
    id bigint PRIMARY KEY,
    category1 bigint,
    category2 bigint
);


INSERT INTO test_data(id, category1, category2)
SELECT id, category, category FROM (
    SELECT
        generate_series(1, 50000000) AS id,
        (random()*10000)::bigint AS category
) q;
--  Query returned successfully in 1 min 47 secs.

CREATE INDEX index_category2 ON test_data(category2);
-- Query returned successfully in 32 secs 347 msec.


--------------
-- TEST1: CREATE FULL INDEX
--------------

CREATE INDEX index_full ON test_data(id);
-- Query returned successfully in 18 secs 713 msec.


--------------
-- TEST2: CREATE PARTIAL INDEX, using category1
--------------

CREATE INDEX index_partial_1 ON test_data(id) WHERE category1=1;
-- Query returned successfully in 3 secs 523 msec.


--------------
-- TEST3: CREATE PARTIAL INDEX, using category2
--------------

CREATE INDEX index_partial_2 ON test_data(id) WHERE category2=1;
-- Query returned successfully in 3 secs 651 msec.


--- END ---

Re: Partial index creation always scans the entire table

From
Sergei Kornilov
Date:
Hello

> When creating partial indexes, can postgres utilize another index for figuring which rows should be included in the
partialindex, without performing a full table scan?
 

No.
create index always perform a seqscan on table. And two full table scan for create index concurrently.

regards, Sergei



Re: Partial index creation always scans the entire table

From
Justin Pryzby
Date:
On Sat, Feb 15, 2020 at 07:04:48PM +0800, MingJu Wu wrote:
> Hello,
> 
> When creating partial indexes, can postgres utilize another index for
> figuring which rows should be included in the partial index, without
> performing a full table scan?
> 
> My scenario is that I have a table with 50M rows that are categorized into
> 10K categories. I need to create a partial index for each category. I have
> created a index on the category column, hoping that postgres can use this
> information when creating the partial indexes. However, postgres always
> performs full table scan.
> 
> I've tested with PostgreSQL 12.2. Below is an example setup showing the

I don't think it's possible, and an index scan wouldn't necessarily be faster,
anyway, since the reads might be unordered rather than sequantial, and might
hit large fractions of the table even though only returning a fraction of its
tuples.

But have you thought about partitioning on category rather than partial
indexes?  Possibly hash partition of (category).  If your queries usually
include category_id=X, that might be a win for performance anyway, since tables
can now be read sequentially rather than scannned by index (again, probably out
of order).

-- 
Justin



Re: Partial index creation always scans the entire table

From
Laurenz Albe
Date:
On Sat, 2020-02-15 at 19:04 +0800, MingJu Wu wrote:
> When creating partial indexes, can postgres utilize another index for figuring which rows
> should be included in the partial index, without performing a full table scan?

No; it has to be a full sequential scan.

> My scenario is that I have a table with 50M rows that are categorized into 10K categories.
> I need to create a partial index for each category. I have created a index on the category
> column, hoping that postgres can use this information when creating the partial indexes.
> However, postgres always performs full table scan.

There is your problem.

You don't need a partial index per category, you need a single index that *contains* the category.

Yours,
Laurenz Albe
-- 
Cybertec | https://www.cybertec-postgresql.com




Re: Partial index creation always scans the entire table

From
Tom Lane
Date:
Laurenz Albe <laurenz.albe@cybertec.at> writes:
> On Sat, 2020-02-15 at 19:04 +0800, MingJu Wu wrote:
>> My scenario is that I have a table with 50M rows that are categorized into 10K categories.
>> I need to create a partial index for each category.

> You don't need a partial index per category, you need a single index that *contains* the category.

Yeah, that's an anti-pattern.  Essentially, you are trying to replace the
first branching level of an index that includes the category column with
a ton of system catalog entries and planner proof logic to select one of
N indexes that don't include the category.  It is *highly* unlikely that
that's going to be a win.  It's going to be a huge loss if the planner
fails to make the proof you need, and even when it does, it's not really
going to be faster overall --- you've traded off run-time for planning
time, at a rather unfavorable exchange rate.  Updates on the table are
going to be enormously penalized, too, because the index machinery doesn't
have any way to understand that only one of the indexes needs work.

I've seen people try to do this before.  I wonder if the manual page
about partial indexes should explicitly say "don't do that".

            regards, tom lane



Re: Partial index creation always scans the entire table

From
Justin Pryzby
Date:
On Sat, Feb 15, 2020 at 10:15:53PM +0100, Laurenz Albe wrote:
> > My scenario is that I have a table with 50M rows that are categorized into 10K categories.
> > I need to create a partial index for each category. I have created a index on the category
> > column, hoping that postgres can use this information when creating the partial indexes.
> > However, postgres always performs full table scan.
> 
> There is your problem.
> 
> You don't need a partial index per category, you need a single index that *contains* the category.

On Sun, Feb 16, 2020 at 10:30:05AM -0500, Tom Lane wrote:
> Laurenz Albe <laurenz.albe@cybertec.at> writes:
> > On Sat, 2020-02-15 at 19:04 +0800, MingJu Wu wrote:
> >> My scenario is that I have a table with 50M rows that are categorized into 10K categories.
> >> I need to create a partial index for each category.
> 
> > You don't need a partial index per category, you need a single index that *contains* the category.
> 
> Yeah, that's an anti-pattern.  Essentially, you are trying to replace the

The OP mentioned having an index on "category", which they were hoping the
creation of partial indexes would use:

On Sat, Feb 15, 2020 at 07:04:48PM +0800, MingJu Wu wrote:
> My scenario is that I have a table with 50M rows that are categorized into
> 10K categories. I need to create a partial index for each category. I have
> created a index on the category column, hoping that postgres can use this
> information when creating the partial indexes. However, postgres always
> performs full table scan.

So the question is why they (think they) *also* need large number of partial
indexes.

I was reminded of reading this, but I think it's a pretty different case.
https://heap.io/blog/engineering/running-10-million-postgresql-indexes-in-production

-- 
Justin



Re: Partial index creation always scans the entire table

From
Tom Lane
Date:
Justin Pryzby <pryzby@telsasoft.com> writes:
> I was reminded of reading this, but I think it's a pretty different case.
> https://heap.io/blog/engineering/running-10-million-postgresql-indexes-in-production

Yeah, the critical paragraph in that is

    This isn’t as scary as it sounds for a two main reasons. First, we
    shard all of our data by customer. Each table in our database holds
    only one customer’s data, so each table has a only a few thousand
    indexes at most. Second, these events are relatively rare. The most
    common defined events make up only a few percent of a customer’s raw
    events, and most are much more rare. This means that we perform
    relatively little I/O maintaining this schema, because most incoming
    events match no event definitions and therefore don’t need to be
    written to any of the indexes. Similarly, the indexes don’t take up
    much space on disk.

A set of partial indexes that cover a small part of the total data
can be sensible.  If you're trying to cover most/all of the data,
you're doing it wrong --- basically, you're reinventing partitioning
using the wrong tools.

            regards, tom lane



RE: Partial index creation always scans the entire table

From
"Mike Sofen"
Date:
>From: Tom Lane <tgl@sss.pgh.pa.us>   Sent: Sunday, February 16, 2020 7:30
AM
>I've seen people try to do this before.  I wonder if the manual page about
partial indexes should explicitly say "don't do that". 
>    regards, tom lane

Yes please (seriously).  The utter beauty of Postgres is the flexibility and
power that its evolutionary path has allowed/created.  The tragic danger is
that the beauty is fairly easy to misapply/misuse.  Caveats in the
documentation would be very beneficial to both seasoned practitioners and
newcomers - it is quite challenging to keep up with everything Postgres and
the documentation is where most of us turn for guidance.  

And thank you Tom (and others), for your willingness to share these (and
many, many other) insights - it is so powerful when facts connect with
database reality.

Mike Sofen 





Re: Partial index creation always scans the entire table

From
Justin Pryzby
Date:
On Sun, Feb 16, 2020 at 04:43:10PM -0800, Mike Sofen wrote:
> >From: Tom Lane <tgl@sss.pgh.pa.us>   Sent: Sunday, February 16, 2020 7:30
> AM
> >I've seen people try to do this before.  I wonder if the manual page about
> partial indexes should explicitly say "don't do that". 
> >    regards, tom lane
> 
> Yes please (seriously).  The utter beauty of Postgres is the flexibility and
> power that its evolutionary path has allowed/created.  The tragic danger is
> that the beauty is fairly easy to misapply/misuse.

Quote. Enough rope to shoot yourself in the foot.

Would you care to suggest text to be included here ?
https://www.postgresql.org/docs/devel/indexes-partial.html

-- 
Justin



Re: Partial index creation always scans the entire table

From
Tom Lane
Date:
"Mike Sofen" <msofen@runbox.com> writes:
>> From: Tom Lane <tgl@sss.pgh.pa.us>   Sent: Sunday, February 16, 2020 7:30 AM
>>> I've seen people try to do this before.  I wonder if the manual page about
>>> partial indexes should explicitly say "don't do that".

> Yes please (seriously).  The utter beauty of Postgres is the flexibility and
> power that its evolutionary path has allowed/created.  The tragic danger is
> that the beauty is fairly easy to misapply/misuse.  Caveats in the
> documentation would be very beneficial to both seasoned practitioners and
> newcomers - it is quite challenging to keep up with everything Postgres and
> the documentation is where most of us turn for guidance.

OK, so how about something like this added to section 11.8
(no pretty markup as yet):

Example 11.4.  Do Not use Partial Indexes as a Substitute for Partitioning

You might be tempted to create a large set of non-overlapping partial
indexes, for example

    CREATE INDEX mytable_cat_1 ON mytable (data) WHERE category = 1;
    CREATE INDEX mytable_cat_2 ON mytable (data) WHERE category = 2;
    CREATE INDEX mytable_cat_3 ON mytable (data) WHERE category = 3;
    ...

This is a bad idea!  Almost certainly, you'll be better off with a single
non-partial index, declared like

    CREATE INDEX mytable_cat_data ON mytable (category, data);

(Put the category column first, for the reasons described in section 11.3
Multicolumn Indexes.)  While a search in this larger index might have to
descend through a couple more tree levels than a search in a smaller
index, that's almost certainly going to be cheaper than the planner effort
needed to select the appropriate one of the partial indexes.  The core of
the problem is that the system does not understand the relationship among
the partial indexes, and will laboriously test each one to see if it's
applicable to the current query.

If your table is large enough that a single index really is a bad idea,
you should look into using partitioning instead (section whatever-it-is).
With that mechanism, the system does understand that the tables and
indexes are non-overlapping, so much better performance is possible.

            regards, tom lane