Thread: Using indexes for partial index builds

Using indexes for partial index builds

From
Paul Norman
Date:
Hello,
After a discussion on IRC in #postgresql, I had a feature suggestion and it
was suggested I write it up here.

I have a large (200GB, 1.7b rows) table with a number of columns, but the
two of interest here are a hstore column, tags and a postgis geometry
column, geom. 

There is a GIN index on tags and a gist index on geom. These took about
36-48 hours to build in total. Obviously index building on a table this size
is not trivial.

Periodically I want to do a number of specialized queries on objects with a
particular tag or in a particular area. To do this I often want to create a
partial index. For example, I created the index btree ((tags ->
'name_1'::text) text_pattern_ops) WHERE tags ? 'name_1'::text. My
understanding is to create this index PostgreSQL does a scan of the entire
table, even though the GIN index on tags could be used to identify which
rows could belong in the index. Where the WHERE condition selects only a
small portion of the table this is scanning a lot more data than is
necessary.

Another case where it would be useful is when I am conducting a detailed
analysis of some aspect of the rows in a particular city. This leads to all
the queries being of the form SELECT ... FROM ... WHERE
is_in_my_area(geom)[1]. 

My current project is doing analysis involving addresses. The ability to
create an index like btree((tags -> 'addr:housenumber'), (tags ->
'addr:street'), (tags -> 'addr:city')) WHERE is_in_my_area(geom) in a
reasonable time would allow me to use a view instead of copying the local
area to a temporary table and indexing that table. The local area is about
350k rows, or about 0.02% of the database.

[1] The actual function for determining if it's in my area is long and not
really essential to the point here.





Re: Using indexes for partial index builds

From
Jim Nasby
Date:
On 2/2/13 4:05 AM, Paul Norman wrote:
> Hello,
> After a discussion on IRC in #postgresql, I had a feature suggestion and it
> was suggested I write it up here.
>
> I have a large (200GB, 1.7b rows) table with a number of columns, but the
> two of interest here are a hstore column, tags and a postgis geometry
> column, geom.
>
> There is a GIN index on tags and a gist index on geom. These took about
> 36-48 hours to build in total. Obviously index building on a table this size
> is not trivial.
>
> Periodically I want to do a number of specialized queries on objects with a
> particular tag or in a particular area. To do this I often want to create a
> partial index. For example, I created the index btree ((tags ->
> 'name_1'::text) text_pattern_ops) WHERE tags ? 'name_1'::text. My
> understanding is to create this index PostgreSQL does a scan of the entire
> table, even though the GIN index on tags could be used to identify which
> rows could belong in the index. Where the WHERE condition selects only a
> small portion of the table this is scanning a lot more data than is
> necessary.
>
> Another case where it would be useful is when I am conducting a detailed
> analysis of some aspect of the rows in a particular city. This leads to all
> the queries being of the form SELECT ... FROM ... WHERE
> is_in_my_area(geom)[1].
>
> My current project is doing analysis involving addresses. The ability to
> create an index like btree((tags -> 'addr:housenumber'), (tags ->
> 'addr:street'), (tags -> 'addr:city')) WHERE is_in_my_area(geom) in a
> reasonable time would allow me to use a view instead of copying the local
> area to a temporary table and indexing that table. The local area is about
> 350k rows, or about 0.02% of the database.
>
> [1] The actual function for determining if it's in my area is long and not
> really essential to the point here.

Something worth considering on this... I suspect it's possible to use an index-only scan to do this, regardless of
whetherthe heap page is all visible. The reason is that the newly created index would just use the same access
methodologyas the original index, so any dead rows would be ignored.
 

We'd almost certainly need to block vacuums during the build however. Obviously not an issue for regular index builds,
butit would be for concurrent ones.
 



Re: Using indexes for partial index builds

From
Greg Stark
Date:
On Thu, Mar 7, 2013 at 12:51 AM, Jim Nasby <jim@nasby.net> wrote:
> Something worth considering on this... I suspect it's possible to use an
> index-only scan to do this, regardless of whether the heap page is all
> visible. The reason is that the newly created index would just use the same
> access methodology as the original index, so any dead rows would be ignored.

This is actually quite clever. I wonder how many other cases can use
similar logic.

> We'd almost certainly need to block vacuums during the build however.
> Obviously not an issue for regular index builds, but it would be for
> concurrent ones.

Concurrent index builds block vacuums already.



-- 
greg



Re: Using indexes for partial index builds

From
Ants Aasma
Date:
On Mon, Mar 11, 2013 at 9:13 PM, Greg Stark <stark@mit.edu> wrote:
> On Thu, Mar 7, 2013 at 12:51 AM, Jim Nasby <jim@nasby.net> wrote:
>> Something worth considering on this... I suspect it's possible to use an
>> index-only scan to do this, regardless of whether the heap page is all
>> visible. The reason is that the newly created index would just use the same
>> access methodology as the original index, so any dead rows would be ignored.
>
> This is actually quite clever. I wonder how many other cases can use
> similar logic.

I actually just dealt with a case where this would have been helpful.
The case is finding rows from a huge table where a foreign key
reference matches one from a list of IDs and the last change date is
larger than some specific value. The best plan for this is to build
bitmaps for both conditions. The performance issue is that the bitmap
for the IDs can get pretty large and expensive to construct. Solution
for that is to keep a partial index on the foreign key predicated on a
recent timestamp covering most queries, removing 99% of tuples from
consideration. This index needs to be periodically replaced with one
that has a newer timestamp. A full table scan could be avoided if the
index could be built using the previous partial index. Now this was on
9.1, so I didn't consider index only, but on 9.2+ I would add the
timestamp to the foreign key index, then the new index could be
constructed using an index only scan.

I have a feeling this is an increasingly widespread pattern with a
proliferation of mobile devices that need syncing.

Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de



Re: Using indexes for partial index builds

From
Jim Nasby
Date:
On 3/12/13 9:10 AM, Ants Aasma wrote:
> I have a feeling this is an increasingly widespread pattern with a
> proliferation of mobile devices that need syncing.

If you're doing that with timestamps you're asking for a slew of problems, not all of which can be solved by just
addingsome random amount of fluff to your criteria. A queue-based solution is often a more robust solution, even if it
isharder to implement.
 



Re: Using indexes for partial index builds

From
Ants Aasma
Date:
On Thu, Mar 14, 2013 at 1:51 AM, Jim Nasby <jim@nasby.net> wrote:
> On 3/12/13 9:10 AM, Ants Aasma wrote:
>> I have a feeling this is an increasingly widespread pattern with a
>> proliferation of mobile devices that need syncing.
>
> If you're doing that with timestamps you're asking for a slew of problems,
> not all of which can be solved by just adding some random amount of fluff to
> your criteria. A queue-based solution is often a more robust solution, even
> if it is harder to implement.

Do you know of anything else besides the obvious issues with having to
use one clocksource and ensure that it produces monotonic timestamps?
My first reaction was also that this is what queues are meant for, but
the proposed solution seems to work surprisingly well. Unless you can
point at some glaring hole that I'm missing I would say that it is
good enough for a rather wide range of syncing problems.

Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de



Re: Using indexes for partial index builds

From
Jim Nasby
Date:
On 3/13/13 7:10 PM, Ants Aasma wrote:
> On Thu, Mar 14, 2013 at 1:51 AM, Jim Nasby <jim@nasby.net> wrote:
>> On 3/12/13 9:10 AM, Ants Aasma wrote:
>>> I have a feeling this is an increasingly widespread pattern with a
>>> proliferation of mobile devices that need syncing.
>>
>> If you're doing that with timestamps you're asking for a slew of problems,
>> not all of which can be solved by just adding some random amount of fluff to
>> your criteria. A queue-based solution is often a more robust solution, even
>> if it is harder to implement.
>
> Do you know of anything else besides the obvious issues with having to
> use one clocksource and ensure that it produces monotonic timestamps?

Those issues aren't enough? :)

> My first reaction was also that this is what queues are meant for, but
> the proposed solution seems to work surprisingly well. Unless you can
> point at some glaring hole that I'm missing I would say that it is
> good enough for a rather wide range of syncing problems.

It depends on how critical it is not to miss events. Timestamps in tables are always taken before transaction commit,
soyou can sometimes have a significant delay. You have to make certain the timestamp can't be changed, and that rows
can'tbe deleted. It's also tricky to make certain you don't see any events twice.
 

Given all that, and how easy PgQ is to use, I don't understand why anyone would go with timestamps...
-- 
Jim C. Nasby, Data Architect                       jim@nasby.net
512.569.9461 (cell)                         http://jim.nasby.net