Thread: build multiple indexes in single table pass?

build multiple indexes in single table pass?

From
Andrew Dunstan
Date:
From the "idle thoughts in the middle of the night" department:

I don't know if this has come up before exactly, but is it possible that 
we could get a performance gain from building multiple indexes from a 
single sequential pass over the base table? If so, that would probably 
give us  a potential performance improvement in pg_restore quite apart 
from the projected improvement to be got from running several steps in 
parallel processes. The grammar might look a bit ugly, but I'm sure we 
could finesse that.

cheers

andrew




Re: build multiple indexes in single table pass?

From
"Pavan Deolasee"
Date:
On Tue, Apr 1, 2008 at 5:51 PM, Andrew Dunstan <andrew@dunslane.net> wrote:
>
>   From the "idle thoughts in the middle of the night" department:
>
>  I don't know if this has come up before exactly, but is it possible that
>  we could get a performance gain from building multiple indexes from a
>  single sequential pass over the base table?

http://archives.postgresql.org/pgsql-performance/2008-02/msg00236.php

IMHO it should be possible to extend the grammar  to add
multiple indexes in one go. But the current index build itself looks
very tightly integrated with the heap scan. So it might be tricky to
separate out the scan and the index building activity.

Thanks,
Pavan


-- 
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com


Re: build multiple indexes in single table pass?

From
Aidan Van Dyk
Date:
* Andrew Dunstan <andrew@dunslane.net> [080401 08:22]:
> 
> From the "idle thoughts in the middle of the night" department:
> 
> I don't know if this has come up before exactly, but is it possible that 
> we could get a performance gain from building multiple indexes from a 
> single sequential pass over the base table? If so, that would probably 
> give us  a potential performance improvement in pg_restore quite apart 
> from the projected improvement to be got from running several steps in 
> parallel processes. The grammar might look a bit ugly, but I'm sure we 
> could finesse that.

I've not looked at any of the code, but would the "synchronized scans"
heap machinery help the multiple index creations walk the heap together,
basically giving you this for free (as long as you start concurrent
index creation)?

a.

-- 
Aidan Van Dyk                                             Create like a god,
aidan@highrise.ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.

Re: build multiple indexes in single table pass?

From
Toru SHIMOGAKI
Date:
Andrew Dunstan wrote:

> I don't know if this has come up before exactly, but is it possible that 
> we could get a performance gain from building multiple indexes from a 
> single sequential pass over the base table?

It is already implemented in pg_bulkload 
(http://pgbulkload.projects.postgresql.org/). Index tuples of multiple indexes 
are spooled during the single sequential pass over the base table, and the 
spooled index tuples are built up after all of the base table is scanned.

A proposal was submitted by Itagaki-san to integrate this feature into core.
see http://archives.postgresql.org/pgsql-hackers/2008-02/msg00811.php .

-- 
Toru SHIMOGAKI<shimogaki.toru@oss.ntt.co.jp>
NTT Open Source Software Center



Re: build multiple indexes in single table pass?

From
Andrew Dunstan
Date:

Aidan Van Dyk wrote:
> * Andrew Dunstan <andrew@dunslane.net> [080401 08:22]:
>   
>> From the "idle thoughts in the middle of the night" department:
>>
>> I don't know if this has come up before exactly, but is it possible that 
>> we could get a performance gain from building multiple indexes from a 
>> single sequential pass over the base table? If so, that would probably 
>> give us  a potential performance improvement in pg_restore quite apart 
>> from the projected improvement to be got from running several steps in 
>> parallel processes. The grammar might look a bit ugly, but I'm sure we 
>> could finesse that.
>>     
>
> I've not looked at any of the code, but would the "synchronized scans"
> heap machinery help the multiple index creations walk the heap together,
> basically giving you this for free (as long as you start concurrent
> index creation)?
>
>
>   

Good question. Might it also help in that case to have pg_dump output 
indexes in a given schema sorted by <tablename, indexname> rather than 
just <indexname>?

cheers

andrew


Re: build multiple indexes in single table pass?

From
Tom Lane
Date:
Aidan Van Dyk <aidan@highrise.ca> writes:
> * Andrew Dunstan <andrew@dunslane.net> [080401 08:22]:
>> I don't know if this has come up before exactly, but is it possible that 
>> we could get a performance gain from building multiple indexes from a 
>> single sequential pass over the base table?

> I've not looked at any of the code, but would the "synchronized scans"
> heap machinery help the multiple index creations walk the heap together,
> basically giving you this for free (as long as you start concurrent
> index creation)?

Yeah, that should Just Work AFAICS.  Note also that this approach would
let you put multiple CPUs to work on the problem, whereas anything
involving stuffing multiple index creations into a single command
won't.
        regards, tom lane


Re: build multiple indexes in single table pass?

From
Bruce Momjian
Date:
Andrew Dunstan wrote:
> 
>  From the "idle thoughts in the middle of the night" department:
> 
> I don't know if this has come up before exactly, but is it possible that 
> we could get a performance gain from building multiple indexes from a 
> single sequential pass over the base table? If so, that would probably 
> give us  a potential performance improvement in pg_restore quite apart 
> from the projected improvement to be got from running several steps in 
> parallel processes. The grammar might look a bit ugly, but I'm sure we 
> could finesse that.

TODO already has:
       * Allow multiple indexes to be created concurrently, ideally via a         single heap scan, and have pg_restore
useit
 

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: build multiple indexes in single table pass?

From
ITAGAKI Takahiro
Date:
Toru SHIMOGAKI <shimogaki.toru@oss.ntt.co.jp> wrote:

> Andrew Dunstan wrote:
> > we could get a performance gain from building multiple indexes from a 
> > single sequential pass over the base table?
> 
> It is already implemented in pg_bulkload 
> (http://pgbulkload.projects.postgresql.org/).

I think there are two ways to implement multiple index creation. 1. Add multiple indexes AFTER data loading. 2. Define
multipleindexes BEFORE data loading.
 

pg_bulkload uses the 2nd way, but the TODO item seems to target
the 1st, right? -- Both are useful, though.

| Allow multiple indexes to be created concurrently, ideally via a
| single heap scan, and have pg_restore use it

In either case, we probably need to renovate ambuild interface.
I'm thinking to reverse the control of heap sequential scans;
Seq scan is done in ambuild for now, but it will be controlled in
an external loop in the new method.

Define a new IndexBulder interface, something like:   interface IndexBuilder   {      addTuple(IndexTuple tuple);
finishBuild();  }
 
and make ambuild() to return an IndexBuilder instance implemented in each AM.

However, it cannot use multiple CPUs if indexes are built in one process.
A less granular method might be better for Postgres, like synchronized scans,
as already pointed out.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center




Re: build multiple indexes in single table pass?

From
Greg Smith
Date:
On Tue, 1 Apr 2008, Andrew Dunstan wrote:

> I don't know if this has come up before exactly, but is it possible that we 
> could get a performance gain from building multiple indexes from a single 
> sequential pass over the base table?

It pops up regularly, you might even have walked by a discussion of this 
idea with myself, Jan, and Jignesh over the weekend.  Jignesh pointed out 
that index creation was a major drag on his PostgreSQL benchmarking 
operations and I've run into that myself.  I have a large dataset and 
creating a simple index takes around 70% of the time it takes to load the 
data in the first place, his multiple index tables took multiples of load 
time to index.  Considering that the bulk load isn't exactly speedy either 
this gives you an idea how much room for improvement there is.

The idea we were bouncing around went a step past that and considered 
this:  if you have good statistics on a table, and you have a sample set 
of queries you want to execute against it, how would you use that 
information to plan what indexes should be created?  Needing to be able to 
create multiple indexes at once efficiently was an implementation detail 
to pull that off.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD


Re: build multiple indexes in single table pass?

From
Decibel!
Date:
On Apr 1, 2008, at 10:03 PM, Greg Smith wrote:
> The idea we were bouncing around went a step past that and  
> considered this:  if you have good statistics on a table, and you  
> have a sample set of queries you want to execute against it, how  
> would you use that information to plan what indexes should be  
> created?  Needing to be able to create multiple indexes at once  
> efficiently was an implementation detail to pull that off.


Someone at EnterpriseDB (Pavan?) did work on that. I don't know what  
the status of that effort is.
-- 
Decibel!, aka Jim C. Nasby, Database Architect  decibel@decibel.org
Give your computer some brain candy! www.distributed.net Team #1828