Thread: Online index builds
I just sent in the patch for online index builds to -patches. . The work to combine the two phases into a single non-transactional command is done. I'm not sure how long to wait betweenlock checks or how verbose to be about why it's taking so long. I do think we have to print something or else theDBA won't know if it's hung waiting for something external. Currently it prints a notice the first time it sleeps. . Also it prints out how many tuples it found which normal index doesn't. Probably that message should go away. On the otherhand the index stats probably need to be filled in. . I need to check what locks I'm taking. I think I still have some old code with the wrong locks in it. . this includes the tid btree opclass sent earlier (with a warning I didn't notice before fixed up). opr_sanity now failsbut I think that's due to the gin commits not this opclass. . In case of an error during phase2 the invalid index is left in place. It can be dropped with DROP INDEX. The footwork toget it dropped in case of an error would be quite tricky but there's a sketch of how to do it in the source. . no documentation yet, there's not much to write though. . no regression tests yet. I don't see any way to test this reasonably in the regression tests. I've done some testing myselfby building indexes while pgbench is running. Then I have to do index scans to see how many records are returned withindex scans. It wouldn't be easy to automate and even if it were done it wouldn't really be all that great a test. Thecorner cases found during the development are pretty narrow and will be hard to reliably test. -- greg
On Wed, 2006-07-12 at 12:09 -0400, Greg Stark wrote: > no regression tests yet. We'll need some performance tests that show that lock-hold time is *actually* reduced, given the shenanigans needed to get there. We may need to have usage recommendations in the docs. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
Simon Riggs <simon@2ndquadrant.com> writes: > On Wed, 2006-07-12 at 12:09 -0400, Greg Stark wrote: >> no regression tests yet. > We'll need some performance tests that show that lock-hold time is > *actually* reduced, given the shenanigans needed to get there. Reducing lock hold time is not the point ... reducing the strength of the lock at the cost of increased elapsed time is the point. > We may need to have usage recommendations in the docs. Agreed. regards, tom lane
Simon Riggs <simon@2ndquadrant.com> writes: > On Wed, 2006-07-12 at 12:09 -0400, Greg Stark wrote: > > no regression tests yet. > > We'll need some performance tests that show that lock-hold time is > *actually* reduced, given the shenanigans needed to get there. I'm not sure what you mean by "lock-hold time". Online index builds effectively take *no* locks in the user-visible sense that regular index builds do. Other transactions can insert, update, delete continuously throughout the entire process. The only locks that are taken are 1) a ShareUpdateExclusiveLock which blocks vacuum from running on the table being indexed. This is taken by both phase 1and phase 2. (Actually I had the wrong lock in the patch I emailed in one place. Fixed in my source tree here) 2) An ExclusiveLock that is taken momentarily and immediately released. Even if that can never be acquired due to a busysystem it can eventually proceed anyways as long as there are no long-running transactions that are refusing to commit. That said we do need some performance tests to get an idea how long phase 2 takes for large tables. The additional index and heap scan and tid sort could take a substantial amount of time though never as long as the original index build done in phase 1. What's worse is that in some cases the merge could potentially be doing a lot of retail index inserts. I have no good intuition for how long those will take relative to the wholesale index build method, especially since for some index methods like GIN retail inserts are extremely expensive. So for indexes that don't have a lot of records that need to be inserted individually what I expect -- and what I put in the docs -- is something under 100% time penalty for an online index build. In fact I expect it to be more like 50% though it depends on how wide the original index. For ones that do have lots of records mutated for phase 2 all bets are off. > We may need to have usage recommendations in the docs. I'm writing docs now. I'm trying to find a happy medium between explaining all the issues and spamming the docs with lots of discussion. Right now what I have is a single paragraph in the create_index man page that refers to the Postgres manual where I list the issues in more depth. I also still have to get some kind of regression tests. I don't think we have any concurrent regression tests currently, do we? To thoroughly test it will be quite hard. Some of the corner cases are extremely narrow or require very particular types of transactions running with very specific timing. -- greg
Ühel kenal päeval, N, 2006-07-13 kell 01:07, kirjutas Greg Stark: > Simon Riggs <simon@2ndquadrant.com> writes: > > > On Wed, 2006-07-12 at 12:09 -0400, Greg Stark wrote: > > > no regression tests yet. > > > > We'll need some performance tests that show that lock-hold time is > > *actually* reduced, given the shenanigans needed to get there. > > I'm not sure what you mean by "lock-hold time". Online index builds > effectively take *no* locks in the user-visible sense that regular index > builds do. Other transactions can insert, update, delete continuously > throughout the entire process. > > The only locks that are taken are > > 1) a ShareUpdateExclusiveLock which blocks vacuum from running on the table > being indexed. This is taken by both phase 1 and phase 2. (Actually I had > the wrong lock in the patch I emailed in one place. Fixed in my source tree > here) > > 2) An ExclusiveLock that is taken momentarily and immediately released. Even > if that can never be acquired due to a busy system it can eventually > proceed anyways as long as there are no long-running transactions that are > refusing to commit. > > That said we do need some performance tests to get an idea how long phase 2 > takes for large tables. The additional index and heap scan and tid sort could > take a substantial amount of time though never as long as the original index > build done in phase 1. Maybe we can show progress indicators in status line (either pg_stat_activity.current_query or commandline shown in ps), like WAITING TO START PHASE 1 - WAITING FOR TRANSACTION XXX TO COMPLETE or INSERTING INDEX ENTRY N OF M changing every few seconds. that could give the sysadmin some idea of what is going on without too much verbosity on console. there could of course be a VERBOSE mode, which acts similar to VACUUM VERBOSE. And why not make t possible to add a verbosity level there as well: 'CREATE INDEX ... VEBOSE 3'. And to VACUUM VERBOSE as well At level 3 all status changes could also be sent to client as well. > What's worse is that in some cases the merge could potentially be doing a lot > of retail index inserts. I have no good intuition for how long those will take > relative to the wholesale index build method, especially since for some index > methods like GIN retail inserts are extremely expensive. > > So for indexes that don't have a lot of records that need to be inserted > individually what I expect -- and what I put in the docs -- is something under > 100% time penalty for an online index build. In fact I expect it to be more > like 50% though it depends on how wide the original index. For ones that do > have lots of records mutated for phase 2 all bets are off. the only quarantee seems to be, that if there are still some resources left, the index build will eventualy complete. but showing progress will let the DBA to make the decision to abort the build if he sees that it takes "too long". Another related thing - throttling ---------------------------------- Did you do any work on using vacuum_cost_* GUC vars to throttle the build process if desired ? I guess that the initial seqscans are probably cheap enough ( or at least they are no worse than if someone just did "select * ..." concurrently), but the index merge can probably still not be as light on OLTP queries as desirable. In reality on OLTP dbs even SELECT COUNT(*) can sometimes be a "mainenance operation" and thus the goal may not be to complete as fast as possible, but to be as light as possible on concurrent OLTP queries. Eventually it would nice to have special optimiser rules for any "maintenance" queries and DDL ops as defined by DBA, but making use of vacuum_cost_* vars for generic throttling would be a good first cut ; > -- ---------------- Hannu Krosing Database Architect Skype Technologies OÜ Akadeemia tee 21 F, Tallinn, 12618, Estonia Skype me: callto:hkrosing Get Skype for free: http://www.skype.com
Hannu Krosing <hannu@skype.net> writes: > Maybe we can show progress indicators in status line (either > pg_stat_activity.current_query or commandline shown in ps), like > > WAITING TO START PHASE 1 - WAITING FOR TRANSACTION XXX TO COMPLETE > > or > > INSERTING INDEX ENTRY N OF M > > changing every few seconds. Hm. That would be very interesting. I'll say that one of the things that impressed me very much with Postgres moving from Oracle was the focus on usability. Progress indicators would be excellent for a lot of operations. That said I'm not sure how much I can do here. For a substantial index we should expect most of the time will be spent in the tuplesort. It's hard to see how to get any sort of progress indicator out of there and as long as we can't it's hard to see the point of getting one during the heap scan or any of the other i/o operations. I think it does make sense to put something in current_query indicating when it's waiting for transactions to end and when it's past that point. That's something the DBA should be aware of. > And why not make t possible to add a verbosity level there as well: > 'CREATE INDEX ... VEBOSE 3'. And to VACUUM VERBOSE as well > > At level 3 all status changes could also be sent to client as well. Wouldn't you just control this with log_min_messages? It seems unnecessary to clutter the grammar for every command with "verbose" options. > Another related thing - throttling > ---------------------------------- > > Did you do any work on using vacuum_cost_* GUC vars to throttle the > build process if desired ? Actually no. While there is consensus that will be necessary I'm not sure I can do it with this patch. The problem is that most of the real heavy lifting here is done inside tuplesort. Even aside from that most of what's left is inside bulkdelete(*) and the code that handles regular index builds. So I think we'll need some global thinking about what options Postgres needs to control throttling in general. And probably someone needs to write a separate patch that adds all the hooks to the various places in a single go. Trying to throttle just one operation at a time when a lot of the code that implements these operations is shared will have us running in circles. (*) Hm. Come to think of it I wonder if the vacuum_cost parameters are already kicking in for this phase. That would be a bit strange since it's the fastest of the three scans. -- greg
> That said I'm not sure how much I can do here. For a substantial index we > should expect most of the time will be spent in the tuplesort. It's hard to > see how to get any sort of progress indicator out of there and as long as we > can't it's hard to see the point of getting one during the heap scan or any of > the other i/o operations. Well from a DBA perspective, just knowing that "something" productive is happening is useful. When using vacuum I almost always use vacuum verbose, just so I have an idea of what is going on. Joshua D. Drake -- === The PostgreSQL Company: Command Prompt, Inc. === Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240 Providing the most comprehensive PostgreSQL solutionssince 1997 http://www.commandprompt.com/
A great first step would be to add elog(INFO,...) in some standardized format over the wire so that clients can tell what's going on. It could be triggered by a GUC which is off by default. -M On Jul 15, 2006, at 9:10 PM, Greg Stark wrote: > > Hannu Krosing <hannu@skype.net> writes: > >> Maybe we can show progress indicators in status line (either >> pg_stat_activity.current_query or commandline shown in ps), like >> >> WAITING TO START PHASE 1 - WAITING FOR TRANSACTION XXX TO COMPLETE >> >> or >> >> INSERTING INDEX ENTRY N OF M >> >> changing every few seconds. > > Hm. That would be very interesting. I'll say that one of the things > that > impressed me very much with Postgres moving from Oracle was the focus > on > usability. Progress indicators would be excellent for a lot of > operations. > > That said I'm not sure how much I can do here. For a substantial index > we > should expect most of the time will be spent in the tuplesort. It's > hard to > see how to get any sort of progress indicator out of there and as long > as we > can't it's hard to see the point of getting one during the heap scan > or any of > the other i/o operations. > > I think it does make sense to put something in current_query > indicating when > it's waiting for transactions to end and when it's past that point. > That's > something the DBA should be aware of. ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ AgentM agentm@themactionfaction.com ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬
On Sat, 2006-07-15 at 21:10 -0400, Greg Stark wrote: > Hannu Krosing <hannu@skype.net> writes: > > Another related thing - throttling > > ---------------------------------- > > > > Did you do any work on using vacuum_cost_* GUC vars to throttle the > > build process if desired ? > > Actually no. While there is consensus that will be necessary I'm not sure I > can do it with this patch. The problem is that most of the real heavy lifting > here is done inside tuplesort. Even aside from that most of what's left is > inside bulkdelete(*) and the code that handles regular index builds. > > So I think we'll need some global thinking about what options Postgres needs > to control throttling in general. And probably someone needs to write a > separate patch that adds all the hooks to the various places in a single go. > Trying to throttle just one operation at a time when a lot of the code that > implements these operations is shared will have us running in circles. > > (*) Hm. Come to think of it I wonder if the vacuum_cost parameters are already > kicking in for this phase. That would be a bit strange since it's the fastest > of the three scans. Separate patches sounds more sensible. Using statement_cost_* in many places sounds useful to me and not too hard to get into 8.2 -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
Ühel kenal päeval, L, 2006-07-15 kell 21:10, kirjutas Greg Stark: > Hannu Krosing <hannu@skype.net> writes: > > > Maybe we can show progress indicators in status line (either > > pg_stat_activity.current_query or commandline shown in ps), like > > > > WAITING TO START PHASE 1 - WAITING FOR TRANSACTION XXX TO COMPLETE > > > > or > > > > INSERTING INDEX ENTRY N OF M > > > > changing every few seconds. > > Hm. That would be very interesting. I'll say that one of the things that > impressed me very much with Postgres moving from Oracle was the focus on > usability. Progress indicators would be excellent for a lot of operations. > > That said I'm not sure how much I can do here. For a substantial index we > should expect most of the time will be spent in the tuplesort. If index is substantially bigger than shared memory/available RAM then tuple by tuple index insertion part can also be quite slow and i/o consuming even if the number of entries to insert is low compared to total index size. That is where "INSERTING INDEX ENTRY N OF M" would be a thing which helps the DBA to estimate time to completion. > It's hard to > see how to get any sort of progress indicator out of there and as long as we > can't it's hard to see the point of getting one during the heap scan or any of > the other i/o operations. It would be useful to know, in what step we are in general , that is are we doing some useful work or just waiting for "something". -- ---------------- Hannu Krosing Database Architect Skype Technologies OÜ Akadeemia tee 21 F, Tallinn, 12618, Estonia Skype me: callto:hkrosing Get Skype for free: http://www.skype.com
On Saturday 15 July 2006 21:37, Joshua D. Drake wrote: > > That said I'm not sure how much I can do here. For a substantial index we > > should expect most of the time will be spent in the tuplesort. It's hard > > to see how to get any sort of progress indicator out of there and as long > > as we can't it's hard to see the point of getting one during the heap > > scan or any of the other i/o operations. > > Well from a DBA perspective, just knowing that "something" productive is > happening is useful. > > When using vacuum I almost always use vacuum verbose, just so I have an > idea of what is going on. > +1 -- Robert Treat Build A Brighter LAMP :: Linux Apache {middleware} PostgreSQL
On Sat, Jul 15, 2006 at 09:10:46PM -0400, Greg Stark wrote: > > Hannu Krosing <hannu@skype.net> writes: > > > Maybe we can show progress indicators in status line (either > > pg_stat_activity.current_query or commandline shown in ps), like > > > > WAITING TO START PHASE 1 - WAITING FOR TRANSACTION XXX TO COMPLETE > > > > or > > > > INSERTING INDEX ENTRY N OF M > > > > changing every few seconds. > > Hm. That would be very interesting. I'll say that one of the things that > impressed me very much with Postgres moving from Oracle was the focus on > usability. Progress indicators would be excellent for a lot of operations. > > That said I'm not sure how much I can do here. For a substantial index we > should expect most of the time will be spent in the tuplesort. It's hard to > see how to get any sort of progress indicator out of there and as long as we > can't it's hard to see the point of getting one during the heap scan or any of > the other i/o operations. I'd love to have any kind of progress indication for any sorts that spill to disk, and there's any number of other long-running operations where progress info would be very welcome. I certainly wouldn't let lack of a progress indicator for sorts prevent you from adding one. I like the idea of periodically updating both current_query and the commandline that ps shows. -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
Consindering the syntax for this, we currently allow read access during index creation, just not write access, so I think the new syntax should be: CREATE [ UNIQUE ] INDEX name ON table [ USING method ] [ [ENABLE] WRITE [ACCESS] ] ( { column | ( expression ) }[ opclass ] [, ...] ) [ WITH ( storage_parameter = value [, ... ] ) ] [ TABLESPACE tablespace ] [ WHERE predicate] This is clear, and adds no new keywords. --------------------------------------------------------------------------- Greg Stark wrote: > > I just sent in the patch for online index builds to -patches. > > . The work to combine the two phases into a single non-transactional command > is done. I'm not sure how long to wait between lock checks or how verbose to > be about why it's taking so long. I do think we have to print something or > else the DBA won't know if it's hung waiting for something external. > Currently it prints a notice the first time it sleeps. > > . Also it prints out how many tuples it found which normal index doesn't. > Probably that message should go away. On the other hand the index stats > probably need to be filled in. > > . I need to check what locks I'm taking. I think I still have some old code > with the wrong locks in it. > > . this includes the tid btree opclass sent earlier (with a warning I didn't > notice before fixed up). opr_sanity now fails but I think that's due to the > gin commits not this opclass. > > . In case of an error during phase2 the invalid index is left in place. It can > be dropped with DROP INDEX. The footwork to get it dropped in case of an > error would be quite tricky but there's a sketch of how to do it in the source. > > . no documentation yet, there's not much to write though. > > . no regression tests yet. I don't see any way to test this reasonably in the > regression tests. I've done some testing myself by building indexes while > pgbench is running. Then I have to do index scans to see how many records > are returned with index scans. It wouldn't be easy to automate and even if > it were done it wouldn't really be all that great a test. The corner cases > found during the development are pretty narrow and will be hard to reliably > test. > > -- > greg > > > ---------------------------(end of broadcast)--------------------------- > TIP 5: don't forget to increase your free space map settings -- Bruce Momjian bruce@momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Bruce Momjian wrote: > Consindering the syntax for this, we currently allow read access during > index creation, just not write access, so I think the new syntax should > be: > > CREATE [ UNIQUE ] INDEX name ON table > [ USING method ] [ [ENABLE] WRITE [ACCESS] ] > ( { column | ( expression ) } [ opclass ] [, ...] ) > [ WITH ( storage_parameter = value [, ... ] ) ] > [ TABLESPACE tablespace ] > [ WHERE predicate ] > > This is clear, and adds no new keywords. > > It's not particularly clear to me. On its face this seems to me to imply something about how the index will be able to be used, not about how it is to be built. cheers andrew
Andrew Dunstan wrote: > Bruce Momjian wrote: > > Consindering the syntax for this, we currently allow read access during > > index creation, just not write access, so I think the new syntax should > > be: > > > > CREATE [ UNIQUE ] INDEX name ON table > > [ USING method ] [ [ENABLE] WRITE [ACCESS] ] > > ( { column | ( expression ) } [ opclass ] [, ...] ) > > [ WITH ( storage_parameter = value [, ... ] ) ] > > [ TABLESPACE tablespace ] > > [ WHERE predicate ] > > > > This is clear, and adds no new keywords. > > > > > > It's not particularly clear to me. On its face this seems to me to imply > something about how the index will be able to be used, not about how it > is to be built. Yea, that was always a confusion. CREATE CONCURRENT INDEX has the same problem. We need something that talks about the build-time behavior. How about NOWAIT? > > CREATE [ UNIQUE ] INDEX name ON table > > [ USING method ] [ NOWAIT ] > > ( { column | ( expression ) } [ opclass ] [, ...] ) > > [ WITH ( storage_parameter = value [, ... ] ) ] > > [ TABLESPACE tablespace ] > > [ WHERE predicate ] -- Bruce Momjian bruce@momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Bruce Momjian <bruce@momjian.us> writes: > CREATE [ UNIQUE ] INDEX name ON table > [ USING method ] [ [ENABLE] WRITE [ACCESS] ] > ( { column | ( expression ) } [ opclass ] [, ...] ) > [ WITH ( storage_parameter = value [, ... ] ) ] > [ TABLESPACE tablespace ] > [ WHERE predicate ] > This is clear, and adds no new keywords. ... and is very very painful for psql to parse. If you want this, *you* write the patch for command_no_begin(). Be sure to handle quoted and schema-qualified identifiers properly. regards, tom lane
Tom Lane wrote: > Bruce Momjian <bruce@momjian.us> writes: > > CREATE [ UNIQUE ] INDEX name ON table > > [ USING method ] [ [ENABLE] WRITE [ACCESS] ] > > ( { column | ( expression ) } [ opclass ] [, ...] ) > > [ WITH ( storage_parameter = value [, ... ] ) ] > > [ TABLESPACE tablespace ] > > [ WHERE predicate ] > > > This is clear, and adds no new keywords. > > ... and is very very painful for psql to parse. If you want this, > *you* write the patch for command_no_begin(). Be sure to handle > quoted and schema-qualified identifiers properly. I was going to use our tab-completion code, which I think handles all these. -- Bruce Momjian bruce@momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Bruce Momjian wrote: > Andrew Dunstan wrote: > > It's not particularly clear to me. On its face this seems to me to imply > > something about how the index will be able to be used, not about how it > > is to be built. > > Yea, that was always a confusion. CREATE CONCURRENT INDEX has the same > problem. We need something that talks about the build-time behavior. How about CREATE INDEX CONCURRENTLY? -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
Alvaro Herrera wrote: > Bruce Momjian wrote: > > Andrew Dunstan wrote: > > > > It's not particularly clear to me. On its face this seems to me to imply > > > something about how the index will be able to be used, not about how it > > > is to be built. > > > > Yea, that was always a confusion. CREATE CONCURRENT INDEX has the same > > problem. We need something that talks about the build-time behavior. > > How about > > CREATE INDEX CONCURRENTLY? Yea, that actually sounds pretty good. Though a new keyword, it might be the best solution. -- Bruce Momjian bruce@momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +