Thread: build multiple indexes in single table pass?
From the "idle thoughts in the middle of the night" department: I don't know if this has come up before exactly, but is it possible that we could get a performance gain from building multiple indexes from a single sequential pass over the base table? If so, that would probably give us a potential performance improvement in pg_restore quite apart from the projected improvement to be got from running several steps in parallel processes. The grammar might look a bit ugly, but I'm sure we could finesse that. cheers andrew
On Tue, Apr 1, 2008 at 5:51 PM, Andrew Dunstan <andrew@dunslane.net> wrote: > > From the "idle thoughts in the middle of the night" department: > > I don't know if this has come up before exactly, but is it possible that > we could get a performance gain from building multiple indexes from a > single sequential pass over the base table? http://archives.postgresql.org/pgsql-performance/2008-02/msg00236.php IMHO it should be possible to extend the grammar to add multiple indexes in one go. But the current index build itself looks very tightly integrated with the heap scan. So it might be tricky to separate out the scan and the index building activity. Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com
* Andrew Dunstan <andrew@dunslane.net> [080401 08:22]: > > From the "idle thoughts in the middle of the night" department: > > I don't know if this has come up before exactly, but is it possible that > we could get a performance gain from building multiple indexes from a > single sequential pass over the base table? If so, that would probably > give us a potential performance improvement in pg_restore quite apart > from the projected improvement to be got from running several steps in > parallel processes. The grammar might look a bit ugly, but I'm sure we > could finesse that. I've not looked at any of the code, but would the "synchronized scans" heap machinery help the multiple index creations walk the heap together, basically giving you this for free (as long as you start concurrent index creation)? a. -- Aidan Van Dyk Create like a god, aidan@highrise.ca command like a king, http://www.highrise.ca/ work like a slave.
Andrew Dunstan wrote: > I don't know if this has come up before exactly, but is it possible that > we could get a performance gain from building multiple indexes from a > single sequential pass over the base table? It is already implemented in pg_bulkload (http://pgbulkload.projects.postgresql.org/). Index tuples of multiple indexes are spooled during the single sequential pass over the base table, and the spooled index tuples are built up after all of the base table is scanned. A proposal was submitted by Itagaki-san to integrate this feature into core. see http://archives.postgresql.org/pgsql-hackers/2008-02/msg00811.php . -- Toru SHIMOGAKI<shimogaki.toru@oss.ntt.co.jp> NTT Open Source Software Center
Aidan Van Dyk wrote: > * Andrew Dunstan <andrew@dunslane.net> [080401 08:22]: > >> From the "idle thoughts in the middle of the night" department: >> >> I don't know if this has come up before exactly, but is it possible that >> we could get a performance gain from building multiple indexes from a >> single sequential pass over the base table? If so, that would probably >> give us a potential performance improvement in pg_restore quite apart >> from the projected improvement to be got from running several steps in >> parallel processes. The grammar might look a bit ugly, but I'm sure we >> could finesse that. >> > > I've not looked at any of the code, but would the "synchronized scans" > heap machinery help the multiple index creations walk the heap together, > basically giving you this for free (as long as you start concurrent > index creation)? > > > Good question. Might it also help in that case to have pg_dump output indexes in a given schema sorted by <tablename, indexname> rather than just <indexname>? cheers andrew
Aidan Van Dyk <aidan@highrise.ca> writes: > * Andrew Dunstan <andrew@dunslane.net> [080401 08:22]: >> I don't know if this has come up before exactly, but is it possible that >> we could get a performance gain from building multiple indexes from a >> single sequential pass over the base table? > I've not looked at any of the code, but would the "synchronized scans" > heap machinery help the multiple index creations walk the heap together, > basically giving you this for free (as long as you start concurrent > index creation)? Yeah, that should Just Work AFAICS. Note also that this approach would let you put multiple CPUs to work on the problem, whereas anything involving stuffing multiple index creations into a single command won't. regards, tom lane
Andrew Dunstan wrote: > > From the "idle thoughts in the middle of the night" department: > > I don't know if this has come up before exactly, but is it possible that > we could get a performance gain from building multiple indexes from a > single sequential pass over the base table? If so, that would probably > give us a potential performance improvement in pg_restore quite apart > from the projected improvement to be got from running several steps in > parallel processes. The grammar might look a bit ugly, but I'm sure we > could finesse that. TODO already has: * Allow multiple indexes to be created concurrently, ideally via a single heap scan, and have pg_restore useit -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Toru SHIMOGAKI <shimogaki.toru@oss.ntt.co.jp> wrote: > Andrew Dunstan wrote: > > we could get a performance gain from building multiple indexes from a > > single sequential pass over the base table? > > It is already implemented in pg_bulkload > (http://pgbulkload.projects.postgresql.org/). I think there are two ways to implement multiple index creation. 1. Add multiple indexes AFTER data loading. 2. Define multipleindexes BEFORE data loading. pg_bulkload uses the 2nd way, but the TODO item seems to target the 1st, right? -- Both are useful, though. | Allow multiple indexes to be created concurrently, ideally via a | single heap scan, and have pg_restore use it In either case, we probably need to renovate ambuild interface. I'm thinking to reverse the control of heap sequential scans; Seq scan is done in ambuild for now, but it will be controlled in an external loop in the new method. Define a new IndexBulder interface, something like: interface IndexBuilder { addTuple(IndexTuple tuple); finishBuild(); } and make ambuild() to return an IndexBuilder instance implemented in each AM. However, it cannot use multiple CPUs if indexes are built in one process. A less granular method might be better for Postgres, like synchronized scans, as already pointed out. Regards, --- ITAGAKI Takahiro NTT Open Source Software Center
On Tue, 1 Apr 2008, Andrew Dunstan wrote: > I don't know if this has come up before exactly, but is it possible that we > could get a performance gain from building multiple indexes from a single > sequential pass over the base table? It pops up regularly, you might even have walked by a discussion of this idea with myself, Jan, and Jignesh over the weekend. Jignesh pointed out that index creation was a major drag on his PostgreSQL benchmarking operations and I've run into that myself. I have a large dataset and creating a simple index takes around 70% of the time it takes to load the data in the first place, his multiple index tables took multiples of load time to index. Considering that the bulk load isn't exactly speedy either this gives you an idea how much room for improvement there is. The idea we were bouncing around went a step past that and considered this: if you have good statistics on a table, and you have a sample set of queries you want to execute against it, how would you use that information to plan what indexes should be created? Needing to be able to create multiple indexes at once efficiently was an implementation detail to pull that off. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
On Apr 1, 2008, at 10:03 PM, Greg Smith wrote: > The idea we were bouncing around went a step past that and > considered this: if you have good statistics on a table, and you > have a sample set of queries you want to execute against it, how > would you use that information to plan what indexes should be > created? Needing to be able to create multiple indexes at once > efficiently was an implementation detail to pull that off. Someone at EnterpriseDB (Pavan?) did work on that. I don't know what the status of that effort is. -- Decibel!, aka Jim C. Nasby, Database Architect decibel@decibel.org Give your computer some brain candy! www.distributed.net Team #1828