Re: Idea for improving speed of pg_restore - Mailing list pgsql-general
From | Ron Johnson |
---|---|
Subject | Re: Idea for improving speed of pg_restore |
Date | |
Msg-id | 1063752375.11739.1350.camel@haggis Whole thread Raw |
In response to | Re: Idea for improving speed of pg_restore (Christopher Browne <cbbrowne@acm.org>) |
List | pgsql-general |
On Mon, 2003-09-15 at 21:55, Christopher Browne wrote: > A long time ago, in a galaxy far, far away, ron.l.johnson@cox.net (Ron Johnson) wrote: > > Pardon if this has already been suggested and shot down as impossible > > or too difficult, etc. > > None of this is forcibly the problem. > > The _big_ problem is that this isn't necessarily going to improve the > speed of the would-be restore. > > I have brought servers to their knees by trying to restore multiple > databases at once. And I've seen aggregate performance increase by parallelizing table inserts. It all depends on the IO sub-system. I'm sure that, as Scott says, the win would be bigger on systems with fast IO systems (large RAID systems, or, even, once table- spaces are implemented, multiple controllers). > The reason why this isn't likely to work is that the "multitasking" > that you need to maximize is not of the CPUs, but rather of the disk > drives. > > Restoring a database involves, for each table: > 1. Reading table data from the source file; So, how about parallelized pg_dump, that can create multiple dump files. Match the number of dump files with the number of pg_restore threads. Of course, as mentioned before, the better the IO system, the more threads it would take to swamp. > 2. Writing data to the database file for the table; > 3. After that, reading the database file data, and > 4. Writing the sorted bits to the index file. > 5. Along with all of this, HEFTY amounts of updates to WAL. Soooo, have pg_restore bypass the WAL. It's one big transaction anyway, so why write to it anyway. (Thanks to Tom for this idea.) > Ideal performance for this involves having 1. and 2. take place > concurrently, where a sequential set of reads from one disk lead to > more-or-less sequential writes to the "database drive" and to the "WAL > drive." > > Multitask that and it is easy for the I/O patterns to go bad, where > one process is trying to sequence writes to one table whilst another > is writing to another, so you thereby have the disk seeking back and > forth between the one and the other. Well, the DBA has to be cluefull enough to match the number of threads to the capacity of the IO system. > There might well be some merit to having the capability to work on > several indices on the same table all at once, so that after reading > through the table once, _three_ indices are generated, mostly in > memory, and then consecutively spilled to disk. But it may be just as I've been dreaming of that for years... > well to let the shared memory cache do the work so that the table gets > read in once, to generate the first index, and then is already in > cache, which will surely help performance for generating the > subsequent indices. If the table is, say, 20GB, then how much would the cache really matter, since it would get full. > That's a longwinded way of saying that doing multiple concurrent mass > data loads sucks pretty bad. No, it doesn't suck, it's a big win, depending on the "capacity" of the IO system. -- ----------------------------------------------------------------- Ron Johnson, Jr. ron.l.johnson@cox.net Jefferson, LA USA After listening to many White House, Pentagon & CENTCOM briefings in both Gulf Wars, it is my firm belief that most "senior correspondents" either have serious agendas that don't get shaken by facts, or are dumb as dog feces.
pgsql-general by date: