Thread: Will partial index creation use existing index?

Will partial index creation use existing index?

From
Steve Crawford
Date:
Does PostgreSQL use an existing index, if possible, when creating a
partial index?

By way of background, we have some nightly bulk processing that includes
a couple of 15-30 million row tables. Most of the processing is only
looking at prior-day data (up to ~200,000 rows) and for efficiency
requires several indexes. Except for this one process, the indexes are
useless and I'd rather not constantly maintain them.

There is an index on the timestamp column so I have considered creating
the indexes on a temporary basis with something like:
create index foo_bar on foo (bar)
  where timestamp_col > current_date - interval '1 day';

(Yes this is simplified, I am aware of the Daylight Saving Time
off-by-an-hour implications.)

It seems that creating this partial index would be more efficient if the
existing index were used but "explain create index..." just gives me an
error and the query seems to run way too long to be processing only the
one day data. For comparison, on a relatively large 225,000 row day I
can create temporary table ondeay... on the same criteria and create 10
indexes and analyze the table in well under 10 seconds which is way
faster than creating even a single partial index on the full table.

Cheers,
Steve


Re: Will partial index creation use existing index?

From
Jim Nasby
Date:
On Jul 18, 2007, at 2:16 PM, Steve Crawford wrote:
> Does PostgreSQL use an existing index, if possible, when creating a
> partial index?
>
> By way of background, we have some nightly bulk processing that
> includes
> a couple of 15-30 million row tables. Most of the processing is only
> looking at prior-day data (up to ~200,000 rows) and for efficiency
> requires several indexes. Except for this one process, the indexes are
> useless and I'd rather not constantly maintain them.
>
> There is an index on the timestamp column so I have considered
> creating
> the indexes on a temporary basis with something like:
> create index foo_bar on foo (bar)
>   where timestamp_col > current_date - interval '1 day';
>
> (Yes this is simplified, I am aware of the Daylight Saving Time
> off-by-an-hour implications.)
>
> It seems that creating this partial index would be more efficient
> if the
> existing index were used but "explain create index..." just gives
> me an
> error and the query seems to run way too long to be processing only
> the
> one day data. For comparison, on a relatively large 225,000 row day I
> can create temporary table ondeay... on the same criteria and
> create 10
> indexes and analyze the table in well under 10 seconds which is way
> faster than creating even a single partial index on the full table.

Check the source code, but I'm 99% certain that CREATE INDEX doesn't
consider any existing indexes. While what you're describing is
theoretically possible, it's not a very common use-case, so it's
rather unlikely to get worked on unless other folks show up with
*real life* examples of where this would be useful.

You might also want to consider partitioning the table.
--
Jim Nasby                                            jim@nasby.net
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)



Re: Will partial index creation use existing index?

From
Gregory Stark
Date:
"Jim Nasby" <decibel@decibel.org> writes:

> Check the source code, but I'm 99% certain that CREATE INDEX doesn't consider
> any existing indexes. While what you're describing is  theoretically possible,
> it's not a very common use-case, so it's  rather unlikely to get worked on
> unless other folks show up with  *real life* examples of where this would be
> useful.

Yeah we don't support this feature. It is something useful that I even though
of doing a while back. I think it's on the TODO somewhere.

The main use case for it is actually REINDEX. Since you already have an index
which contains precisely the records you want to index and already in order
too.

The main disadvantage is that it's not clear when it would actually be faster.
Generally index scans are slower than reading the whole table and sorting.
Probably it would have to run an SPI query to use the planner to find the best
way to get the rows it wants.

Another problem is that presumably you're reindexing because the existing
index *isn't* in such good shape. You may even be doing it because the
existing index is corrupt.

--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com


Re: Will partial index creation use existing index?

From
Ted Byers
Date:
I don't know about partial indexing, but in a similar situation to what you describe, I have resorted to special tables designed specifically to hold one day's worth of data and to support our daily processing.  I'd put the daily data into specific tables, with all the requisite indeces, and then, as part of the daily processing and after it has been processed, copy the day's data into the main tables and empty the tables used to hold the data temporarily.  It may not be elegant or pretty, but it serves to greatly simplify processing since you know a priori that these tables content only the data you need to process, and you can be certain that eventually it finds its way into the main data tables.  And you have your indeces only on the tables used for daily processing, rather than on your main tables.  An additional concern I had was that if the process for getting the data is vulnerable to error (and in my case it was), you can apply your verification procedures to it before bad data gets into your main tables, thereby reducing the risk of bad data getting into the database. 
 
I though this might be an option for you if you have trouble getting your partial indeces to work well for you.
 
HTH
 
Ted

Steve Crawford <scrawford@pinpointresearch.com> wrote:
Does PostgreSQL use an existing index, if possible, when creating a
partial index?

By way of background, we have some nightly bulk processing that includes
a couple of 15-30 million row tables. Most of the processing is only
looking at prior-day data (up to ~200,000 rows) and for efficiency
requires several indexes. Except for this one process, the indexes are
useless and I'd rather not constantly maintain them.

Re: Will partial index creation use existing index?

From
Ron Johnson
Date:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 07/24/07 14:48, Gregory Stark wrote:
[snip]
>
> The main use case for it is actually REINDEX. Since you already have an index
> which contains precisely the records you want to index and already in order
> too.
>
> The main disadvantage is that it's not clear when it would actually be faster.
> Generally index scans are slower than reading the whole table and sorting.
> Probably it would have to run an SPI query to use the planner to find the best
> way to get the rows it wants.

I believe you, but it's totally counter to prima-fascia logic.

Scanning the whole table means that you have to read in a whole
bunch of columns that you don't really give a rat's arse about, and
thus is a waste, whereas directly reading an existing index means
that you've got perfect locality of data, since you're only reading
what you care about.

> Another problem is that presumably you're reindexing because the existing
> index *isn't* in such good shape. You may even be doing it because the
> existing index is corrupt.

That, of course, is an excellent point.

- --
Ron Johnson, Jr.
Jefferson LA  USA

Give a man a fish, and he eats for a day.
Hit him with a fish, and he goes away for good!

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFGpqHkS9HxQb37XmcRAu93AKC04YXUWvMI6YiLfBNoy2BYtQw28ACdHqE/
kVqHiPwBONv0Tudy5OnA/SE=
=Fbuw
-----END PGP SIGNATURE-----

Re: Will partial index creation use existing index?

From
Gregory Stark
Date:
"Ron Johnson" <ron.l.johnson@cox.net> writes:

>> The main disadvantage is that it's not clear when it would actually be faster.
>> Generally index scans are slower than reading the whole table and sorting.
>> Probably it would have to run an SPI query to use the planner to find the best
>> way to get the rows it wants.
>
> I believe you, but it's totally counter to prima-fascia logic.
>
> Scanning the whole table means that you have to read in a whole
> bunch of columns that you don't really give a rat's arse about, and
> thus is a waste, whereas directly reading an existing index means
> that you've got perfect locality of data, since you're only reading
> what you care about.

Well you still have to read the table either way. The main difference is
you'll end up reading it in a random access order which is slower and also
means potentially reading parts of it many times over before you're done.

--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com