Thread: Point-in-time data recovery - v.7.4
Hello I am trying to find out when 'Point-in-time data recovery' functionality will be available with postgreSQL but I can not find concrete info about this. References in mailinglists talk about version 7.4 and in the TODO list is under the section 'urgent'. Anybody knows when this functionality will be available with prod-quality? -- With regards Rafael Martinez USIT, University of Oslo
I think it was delayed until 7.5... same for win32 port.
Here ir Bruce's message talkin about both topics: http://archives.postgresql.org/pgsql-hackers/2003-07/msg00284.php
On Tue, 2003-11-18 at 11:01, Rafael Martinez Guerrero wrote:
Here ir Bruce's message talkin about both topics: http://archives.postgresql.org/pgsql-hackers/2003-07/msg00284.php
On Tue, 2003-11-18 at 11:01, Rafael Martinez Guerrero wrote:
Hello I am trying to find out when 'Point-in-time data recovery' functionality will be available with postgreSQL but I can not find concrete info about this. References in mailinglists talk about version 7.4 and in the TODO list is under the section 'urgent'. Anybody knows when this functionality will be available with prod-quality?
Attachment
Rafael Martinez Guerrero <r.m.guerrero@usit.uio.no> writes: > Anybody knows when this functionality will be available with > prod-quality? It's likely to show up in 7.5 which it is hoped would be released around the middle of next year. However as this is free software and dependent on volunteers and other people's priorities neither the feature set of the next release nor the exact release date are promised. -- greg
It was planned for 7.4 but got bumped since the work was not finished. There are patches floating around and people are putting some effort into it, so hopefully we will see something in 7.5, but it is dependent on the code being finished before the end of 7.5 development cycle. If you want to help code PITR please send a message to hackers. Robert Treat On Tue, 2003-11-18 at 09:01, Rafael Martinez Guerrero wrote: > Hello > > I am trying to find out when 'Point-in-time data recovery' functionality > will be available with postgreSQL but I can not find concrete info about > this. > > References in mailinglists talk about version 7.4 and in the TODO list > is under the section 'urgent'. > > Anybody knows when this functionality will be available with > prod-quality? > > -- > With regards > Rafael Martinez > USIT, University of Oslo > > > > ---------------------------(end of broadcast)--------------------------- > TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org -- Build A Brighter Lamp :: Linux Apache {middleware} PostgreSQL
On Tue, Nov 18, 2003 at 03:01:15PM +0100, Rafael Martinez Guerrero wrote: > References in mailinglists talk about version 7.4 and in the TODO list > is under the section 'urgent'. It didn't get done. > Anybody knows when this functionality will be available with > prod-quality? I don't think so. Jan Wieck has a proposal for a new replication system which will offer a trick for producing point in time as a side benefit. We are aiming to have that software in use sooner rather than later, but it hasn't been written yet. A -- ---- Andrew Sullivan 204-4141 Yonge Street Afilias Canada Toronto, Ontario Canada <andrew@libertyrms.info> M2P 2A8 +1 416 646 3304 x110
On Tue, 2003-11-18 at 08:29, Andrew Sullivan wrote: > On Tue, Nov 18, 2003 at 03:01:15PM +0100, Rafael Martinez Guerrero wrote: > > References in mailinglists talk about version 7.4 and in the TODO list > > is under the section 'urgent'. > > It didn't get done. > > > Anybody knows when this functionality will be available with > > prod-quality? > > I don't think so. > > Jan Wieck has a proposal for a new replication system which will > offer a trick for producing point in time as a side benefit. We are > aiming to have that software in use sooner rather than later, but it > hasn't been written yet. Supplanting PITR with Replication? If so, I don't think that's a good idea, since you'd either have to buy a new server, or double disk capacity, both of which can cost noticeable amounts of money, if SCSI disks are used. -- ----------------------------------------------------------------- Ron Johnson, Jr. ron.l.johnson@cox.net Jefferson, LA USA When Swedes start committing terrorism, I'll become suspicious of Scandinavians.
What is the best method for storing files in postgres? Is it better to use the large object functions or to just encode the data and store it in a regular text or data field?
On Tue, Nov 18, 2003 at 12:39:40PM -0600, Ron Johnson wrote: > Supplanting PITR with Replication? If so, I don't think that's No, I think the idea is that if you're already using the replication system, you can get this for nothing along with it. Yes, it'll cost you in hardware. A -- ---- Andrew Sullivan 204-4141 Yonge Street Afilias Canada Toronto, Ontario Canada <andrew@libertyrms.info> M2P 2A8 +1 416 646 3304 x110
"Rick Gigger" <rick@alpinenetworking.com> writes: > What is the best method for storing files in postgres? Is it better to use > the large object functions or to just encode the data and store it in a > regular text or data field? Depends on whether you need "random access" to the contents. You can lo_seek() inside a large object and retrieve parts of the data with lo_read(), while 'text' and 'bytea' currently require fetching the whole field. -Doug
Quoting Rick Gigger <rick@alpinenetworking.com>: > What is the best method for storing files in postgres? Is it better to use > the large object functions or to just encode the data and store it in a > regular text or data field? > > > ---------------------------(end of broadcast)--------------------------- > TIP 3: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly > Rick, This has been discussed recently so you might want to dig through the archives but one thing to be aware of is that with large objects, you have have to dump your database in a different format- a non-text format. This is less portable than storing files as bytea's which can be dumped in the text and non-text formats. Argueably is all you use is PosgreSQL then this might not be a big deal. Performance-wise I'm not qualified to speak to which is "better". As always, it is going to depend on your specific application and environment. -- Keith C. Perry, MS E.E. Director of Networks & Applications VCSN, Inc. http://vcsn.com ____________________________________ This email account is being host by: VCSN, Inc : http://vcsn.com
I will search the archives but does anyone know off the top of their head which performs better? ----- Original Message ----- From: "Keith C. Perry" <netadmin@vcsn.com> To: "Rick Gigger" <rick@alpinenetworking.com> Cc: "PgSQL General ML" <pgsql-general@postgresql.org> Sent: Tuesday, November 18, 2003 12:25 PM Subject: Re: [GENERAL] uploading files > Quoting Rick Gigger <rick@alpinenetworking.com>: > > > What is the best method for storing files in postgres? Is it better to use > > the large object functions or to just encode the data and store it in a > > regular text or data field? > > > > > > ---------------------------(end of broadcast)--------------------------- > > TIP 3: if posting/reading through Usenet, please send an appropriate > > subscribe-nomail command to majordomo@postgresql.org so that your > > message can get through to the mailing list cleanly > > > > Rick, > > This has been discussed recently so you might want to dig through the archives > but one thing to be aware of is that with large objects, you have have to dump > your database in a different format- a non-text format. This is less portable > than storing files as bytea's which can be dumped in the text and non-text > formats. Argueably is all you use is PosgreSQL then this might not be a big deal. > > Performance-wise I'm not qualified to speak to which is "better". As always, it > is going to depend on your specific application and environment. > > -- > Keith C. Perry, MS E.E. > Director of Networks & Applications > VCSN, Inc. > http://vcsn.com > > ____________________________________ > This email account is being host by: > VCSN, Inc : http://vcsn.com >
"Rick Gigger" <rick@alpinenetworking.com> writes: > I will search the archives but does anyone know off the top of their head > which performs better? It really depends on what you are trying to do. If you are always fetching the entire file from the database, 'bytea' *might* have a little less overhead, but it's not clearcut. Both of them will be ultimately limited by how fast you can get data off the disk and and pump it through your client connection. -Doug
> Depends on whether you need "random access" to the contents. You can > lo_seek() inside a large object and retrieve parts of the data with > lo_read(), while 'text' and 'bytea' currently require fetching the > whole file. Not so unless I misunderstand. We use substr() on bytea for chunking access to large files on OSs with inferior TCP/IP buffer implementations. That's probably suboptimal performance-wise, though. Karsten -- GPG key ID E4071346 @ wwwkeys.pgp.net E167 67FD A291 2BEA 73BD 4537 78B9 A9F9 E407 1346
Karsten Hilbert <Karsten.Hilbert@gmx.net> writes: > > Depends on whether you need "random access" to the contents. You can > > lo_seek() inside a large object and retrieve parts of the data with > > lo_read(), while 'text' and 'bytea' currently require fetching the > > whole file. > Not so unless I misunderstand. We use substr() on bytea for > chunking access to large files on OSs with inferior TCP/IP buffer > implementations. That's probably suboptimal performance-wise, > though. I think the field will still be competely loaded into memory on the server side though, while LOs are stored in "chunks" and can theoretically be streamed to the client. I'm not really a definitive authority, though... -Doug
I am currently trying to import a text data file without about 45,000 records. At the end of the import it does an update on each of the 45,000 records. Doing all of the inserts completes in a fairly short amount of time (about 2 1/2 minutes). Once it gets to the the updates though it slows to a craw. After about 10 minutes it's only done about 3000 records. Is that normal? Is it because it's inside such a large transaction? Is there anything I can do to speed that up. It seems awfully slow to me. I didn't think that giving it more shared buffers would help but I tried anyway. It didn't help. I tried doing a analyze full on it (vacuumdb -z -f) and it cleaned up a lot of stuff but it didn't speed up the updates at all. I am using a dual 800mhz xeon box with 2 gb of ram. I've tried anywhere from about 16,000 to 65000 shared buffers. What other factors are involved here?
> I think the field will still be competely loaded into memory on the > server side though, while LOs are stored in "chunks" and can > theoretically be streamed to the client. I'm not really a definitive > authority, though... Ah ! Sounds about right ! Something new to learn every day :-) Karsten -- GPG key ID E4071346 @ wwwkeys.pgp.net E167 67FD A291 2BEA 73BD 4537 78B9 A9F9 E407 1346
Rick Gigger wrote: > I am currently trying to import a text data file without about 45,000 > records. At the end of the import it does an update on each of the 45,000 > records. Doing all of the inserts completes in a fairly short amount of > time (about 2 1/2 minutes). Once it gets to the the updates though it slows > to a craw. After about 10 minutes it's only done about 3000 records. > > Is that normal? Is it because it's inside such a large transaction? Is > there anything I can do to speed that up. It seems awfully slow to me. > > I didn't think that giving it more shared buffers would help but I tried > anyway. It didn't help. > > I tried doing a analyze full on it (vacuumdb -z -f) and it cleaned up a lot > of stuff but it didn't speed up the updates at all. > > I am using a dual 800mhz xeon box with 2 gb of ram. I've tried anywhere > from about 16,000 to 65000 shared buffers. > > What other factors are involved here? It is difficult to say without knowing either the definition of the relation(s) or the update queries involved. Are there indexes being created after the import that would allow PostgreSQL to locate the rows being updated quickly, or is the update an unqualified update (no WHERE clause) that affects all tuples? EXPLAIN ANALYZE is your friend... Mike Mascari mascarm@mascari.com
On Tue, 18 Nov 2003, Rick Gigger wrote: > I am currently trying to import a text data file without about 45,000 > records. At the end of the import it does an update on each of the 45,000 > records. Doing all of the inserts completes in a fairly short amount of > time (about 2 1/2 minutes). Once it gets to the the updates though it slows > to a craw. After about 10 minutes it's only done about 3000 records. Are you doing your updates like this: update table1 set field1='1' where id=1; update table1 set field2=4 where id=1; ... update table1 set field10='something else' where id=1; update table1 set field1='3' where id=2; ... Maybe an analyze after the import and before the updates would help. As might a vacuum [full]. If the table isn't updated by other processes probably not. Maybe you've got a foreign key mistmatch going on and a lot of sequential scanning? > Is that normal? Is it because it's inside such a large transaction? Is > there anything I can do to speed that up. It seems awfully slow to me. Possibly. If you are creating a lot of dead tuples, then the operations can get slower and slower. Have you checked your fsm settings et. al.? > I didn't think that giving it more shared buffers would help but I tried > anyway. It didn't help. Usually doesn't. More sort_mem might though. Make it something like 16384 or 32768 (it's measured in kbytes) > I tried doing a analyze full on it (vacuumdb -z -f) and it cleaned up a lot > of stuff but it didn't speed up the updates at all. You need to probably do the analyze between the import and the update. > I am using a dual 800mhz xeon box with 2 gb of ram. I've tried anywhere > from about 16,000 to 65000 shared buffers. That's VERY high. When postgresql has to manage a lot of buffers it actually is slower than letting the kernel in Linux or BSD do it for you. > What other factors are involved here? Not sure. More concrete examples would help. Have you run your queries with explain analyze at the front and looked for differences in number of rows / loops? Those are the dead giveaways. Take a look here: http://www.varlena.com/varlena/GeneralBits/Tidbits/perf.html
Uh, I feel a little silly now. I had and index on the field in question (needed to locate the row to update) but later recreated the table and forgot to readd it. I had assumed that it was there but double checked just now and it was gone. I then readded the index and and it finished in a few minutes. Sorry about that one. Thanks for the help. rg ----- Original Message ----- From: "Mike Mascari" <mascarm@mascari.com> To: "Rick Gigger" <rick@alpinenetworking.com> Cc: "PgSQL General ML" <pgsql-general@postgresql.org> Sent: Tuesday, November 18, 2003 2:03 PM Subject: Re: [GENERAL] performance problem > Rick Gigger wrote: > > > I am currently trying to import a text data file without about 45,000 > > records. At the end of the import it does an update on each of the 45,000 > > records. Doing all of the inserts completes in a fairly short amount of > > time (about 2 1/2 minutes). Once it gets to the the updates though it slows > > to a craw. After about 10 minutes it's only done about 3000 records. > > > > Is that normal? Is it because it's inside such a large transaction? Is > > there anything I can do to speed that up. It seems awfully slow to me. > > > > I didn't think that giving it more shared buffers would help but I tried > > anyway. It didn't help. > > > > I tried doing a analyze full on it (vacuumdb -z -f) and it cleaned up a lot > > of stuff but it didn't speed up the updates at all. > > > > I am using a dual 800mhz xeon box with 2 gb of ram. I've tried anywhere > > from about 16,000 to 65000 shared buffers. > > > > What other factors are involved here? > > It is difficult to say without knowing either the definition of the > relation(s) or the update queries involved. Are there indexes being > created after the import that would allow PostgreSQL to locate the > rows being updated quickly, or is the update an unqualified update (no > WHERE clause) that affects all tuples? > > EXPLAIN ANALYZE is your friend... > > Mike Mascari > mascarm@mascari.com > > >
El Mar 18 Nov 2003 17:43, Rick Gigger escribió: > I am currently trying to import a text data file without about 45,000 > records. At the end of the import it does an update on each of the 45,000 > records. Doing all of the inserts completes in a fairly short amount of > time (about 2 1/2 minutes). Once it gets to the the updates though it slows > to a craw. After about 10 minutes it's only done about 3000 records. Thats not a big amount of rows. It shouldn't be making that amount of trouble, unless you have something wrong with the update query. Try inserting all the 45K rows and then run an explain analyze to the update query to see what's wrong (reply the output of explain analyze to the list). The update query itself would help alot, too. > Is that normal? Is it because it's inside such a large transaction? Is > there anything I can do to speed that up. It seems awfully slow to me. Not sure, but I have had lot's of big transactions with heady load, and never had a problem like the one you describe. > I didn't think that giving it more shared buffers would help but I tried > anyway. It didn't help. > > I tried doing a analyze full on it (vacuumdb -z -f) and it cleaned up a lot > of stuff but it didn't speed up the updates at all. > > I am using a dual 800mhz xeon box with 2 gb of ram. I've tried anywhere > from about 16,000 to 65000 shared buffers. How's memory performace while you're running the updates? What does free say (if you are in some unix environment). -- select 'mmarques' || '@' || 'unl.edu.ar' AS email; ----------------------------------------------------------------- Martín Marqués | mmarques@unl.edu.ar Programador, Administrador, DBA | Centro de Telemática Universidad Nacional del Litoral -----------------------------------------------------------------
On Tue, 18 Nov 2003, Andrew Sullivan wrote: > On Tue, Nov 18, 2003 at 12:39:40PM -0600, Ron Johnson wrote: > > Supplanting PITR with Replication? If so, I don't think that's > > No, I think the idea is that if you're already using the replication > system, you can get this for nothing along with it. Yes, it'll cost > you in hardware. No, it's more along the lines of "once the replication system is written, it can be adapted to provide PITR (without the need for actual replication) as a side effect.
scott.marlowe wrote: > On Tue, 18 Nov 2003, Andrew Sullivan wrote: > > >>On Tue, Nov 18, 2003 at 12:39:40PM -0600, Ron Johnson wrote: >> >>>Supplanting PITR with Replication? If so, I don't think that's >> >>No, I think the idea is that if you're already using the replication >>system, you can get this for nothing along with it. Yes, it'll cost >>you in hardware. > > > No, it's more along the lines of "once the replication system is written, > it can be adapted to provide PITR (without the need for actual > replication) as a side effect. I think the opposit: Once the PITR is written it can be used for Replication as side effect. For example the Sybase Replication is implemented in this way. Regards Gaetano Mendola
> I think the opposit: > > Once the PITR is written it can be used for Replication as side > effect. > Command Prompt has found the opposite to be the case. It is our replication product that is going to allow PITR, even from 7.3. Sincerely, Joshua Drake > For example the Sybase Replication is implemented in this way. > > > Regards > Gaetano Mendola > > > > > ---------------------------(end of broadcast)--------------------------- > TIP 4: Don't 'kill -9' the postmaster -- Command Prompt, Inc., home of Mammoth PostgreSQL - S/ODBC and S/JDBC Postgresql support, programming shared hosting and dedicated hosting. +1-503-222-2783 - jd@commandprompt.com - http://www.commandprompt.com Editor-N-Chief - PostgreSQl.Org - http://www.postgresql.org
> Usually doesn't. More sort_mem might though. Make it something like > 16384 or 32768 (it's measured in kbytes) I thought that it was measured in 8k blocks. http://candle.pha.pa.us/main/writings/pgsql/hw_performance/node3.html > That's VERY high. When postgresql has to manage a lot of buffers it > actually is slower than letting the kernel in Linux or BSD do it for you. Even if you've got the memory to spare? Does postgres actually slow down just because it's slower to manager a lot of them just or because you're taking the memory away from the kernel so the kernel has to swap more? rg
"Rick Gigger" <rick@alpinenetworking.com> writes: > > That's VERY high. When postgresql has to manage a lot of buffers it > > actually is slower than letting the kernel in Linux or BSD do it for you. > > Even if you've got the memory to spare? Does postgres actually slow down > just because it's slower to manager a lot of them just or because you're > taking the memory away from the kernel so the kernel has to swap more? The latter, mainly, I think. Also you *really* don't want your kernel to swap out any of your shared buffers, which can happen if they take up a significant portion of RAM... -Doug
On Tue, Nov 18, 2003 at 04:54:59PM -0700, Rick Gigger wrote: > > That's VERY high. When postgresql has to manage a lot of buffers it > > actually is slower than letting the kernel in Linux or BSD do it for you. > > Even if you've got the memory to spare? Does postgres actually slow down > just because it's slower to manager a lot of them just or because you're > taking the memory away from the kernel so the kernel has to swap more? It's a combination. Any buffer that's stored in the postgresql cache is also in the kernel cache so that's just slightly inefficient. The kernel pool manager tends to be more optimised for handling large numbers of buffers. Because it's the kernel is can tailor for hardware quirks like high memory and bounce buffers and such. And when you're not using your shared memory it's wasted but the kernel can reallocate any memory it likes for any purpose. Hope this helps, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > "All that is needed for the forces of evil to triumph is for enough good > men to do nothing." - Edmond Burke > "The penalty good people pay for not being interested in politics is to be > governed by people worse than themselves." - Plato
Attachment
> Command Prompt has found the opposite to be the case. It is our > replication product > that is going to allow PITR, even from 7.3. > I must have been asleep as this is the first I've heard of Command Prompt doing a replication product. Are you able to provide more details? Zembu Labs (http://www.zembu.com/) apparently had a postgreSQL replication thing happening last year but they've since closed their doors. Anyone know anything about that? Cheers, Graeme
So are you donating your replication work to the community then? Or do we still have to write PITR? Dave On Tue, 2003-11-18 at 19:55, Joshua D. Drake wrote: > > I think the opposit: > > > > Once the PITR is written it can be used for Replication as side > > effect. > > > Command Prompt has found the opposite to be the case. It is our > replication product > that is going to allow PITR, even from 7.3. > > Sincerely, > > Joshua Drake > > > > > For example the Sybase Replication is implemented in this way. > > > > > > Regards > > Gaetano Mendola > > > > > > > > > > ---------------------------(end of broadcast)--------------------------- > > TIP 4: Don't 'kill -9' the postmaster >
Dave Cramer wrote: > So are you donating your replication work to the community then? Or do > we still have to write PITR? Jan is working on a community replication solution, and we will still need PITR anyway because everyone doesn't have two machines. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
Hi all, Just another question on all this : When will postgres have win32 native support. I've been gathering information from bits and pieces I found on the net, but never got an official answer. Some said it will be in 7.4, but it's not there, and from what I gathered about a month ago was that it was aimed for release in version 8.0 So I have been under the impression that there will never be a 7.5 release. Is there anyone that can help me clear up my confusion in this matter? Franco Bruno Borghesi explained : => I think it was delayed until 7.5... same for win32 port.
Attachment
On Tue, 18 Nov 2003 21:22:56 +0100, Karsten Hilbert wrote: >> I think the field will still be competely loaded into memory on the >> server side though, while LOs are stored in "chunks" and can >> theoretically be streamed to the client. I'm not really a definitive >> authority, though... > Ah ! Sounds about right ! Something new to learn every day :-) > Actually, bytea and text are chunked behind the scenes (the technique known as TOAST). They are also compressed(LZ) by default. However if you anticipate substringing them a lot (or if they are not very compressible anyway), use ALTER TABLE xxx ALTER COLUMN yyy SET STORAGE EXTERNAL to turn off the automatc compression of values. This allows the substr operation to read the minimum number of chunks necessary of disk. Bear in mind of course, that the whole value will take up more disk space, so reading it in its entirety will be slower (IO is normally the limiting performance factor -CPU on a DB server is often cheap by comparison, so decompressing/compressing to save IO is a good idea). If however you always fetch small parts (e.g. you store large images and usually want to read the header info from them, EXTERNAL is a good bet (and depending on the image format, the compression might not compress them very much anyway). Finally, note that the substr optimisation for text only really buys you anything if the character-set is single-byte. Hope this helps John Gray (implementer of substr optimisation many moons ago!)
Stef wrote: > Hi all, > > Just another question on all this : > When will postgres have win32 native > support. I've been gathering information > from bits and pieces I found on the net, > but never got an official answer. > > Some said it will be in 7.4, but it's not there, > and from what I gathered about a month ago > was that it was aimed for release in version 8.0 > So I have been under the impression that there > will never be a 7.5 release. Actually there is a debate ongoing right now, about what to call the next release? 7.5 or 8.0 So native win32 port is aimed for next major release whatever that might be..:-) Shridhar
Stef wrote: > And the world rejoiced as Shridhar Daithankar said : > > => Actually there is a debate ongoing right now, about what to call the next > => release? 7.5 or 8.0 > => > => So native win32 port is aimed for next major release whatever that might be..:-) > > Well, put like this, it doesn't really make a difference in the end :) As long is win32 makes to next release..:-) Shridhar
On Wed, Nov 19, 2003 at 11:42:40AM -0500, Bruce Momjian wrote: > Jan is working on a community replication solution, and we will still > need PITR anyway because everyone doesn't have two machines. Jan's system, if I understand him correctly, will not actually require two machines to get you PITR either, although it probably _would_ require two datatabases and a hunk of disk. A -- ---- Andrew Sullivan 204-4141 Yonge Street Afilias Canada Toronto, Ontario Canada <andrew@libertyrms.info> M2P 2A8 +1 416 646 3304 x110
> > > That's VERY high. When postgresql has to manage a lot of buffers it > > > actually is slower than letting the kernel in Linux or BSD do it for you. I am confused. In this tutorial (by Bruce Momjian) http://candle.pha.pa.us/main/writings/pgsql/hw_performance/node8.html it says: "As a start for tuning, use 25% of RAM for cache size, and 2-4% for sort size." If I've got 2g of RAM then that is 2097152k. 25% of that = 524288k. So the tutorial is saying that 524288k is a good starting point for shared buffers with this amount of RAM. If each buffer is 8k (http://candle.pha.pa.us/main/writings/pgsql/hw_performance/node3.html) then that would be 65536 buffers. I'm pretty sure that that is a lot more than I need to cache every tuple in my database. Now everytime I see someone comment on this list about appropriate numbers for shared buffers they say something like "(65536 buffers is) VERY high. Now since I obviously don't need that much shared cache so I am not concerned but it seems to me that one of the following must be true. My calculations here are wrong. or The tutorial is not accurate in saying that 25% is a good starting point. or The people making comments that 65536 is "VERY high" are wrong. Am I just confused or does this make sense? > > Even if you've got the memory to spare? Does postgres actually slow down > > just because it's slower to manager a lot of them just or because you're > > taking the memory away from the kernel so the kernel has to swap more? > > The latter, mainly, I think. Also you *really* don't want your kernel > to swap out any of your shared buffers, which can happen if they take > up a significant portion of RAM... So if I'm not swapping at all and I've got over 1g of unused memory then I'm probably ok even with the very high buffer setting? (Although I will probably reduce it anyway since it is unnecessary). rg
Ok, adding the index back in worked the first time but then I tried generating the database from scratch again, this time adding the index right from the start. It added the 45000 rows in about a minute but then was going dog slow on the updates again. So I did an explain and sure enough it was not using the index. After some investigation I determined that it was not using the index because when the transaction started there were only 4 rows in the table so at that point it didn't want to use it. It apparently doesn't gather analysis data fast enough to handle this kind of transaction. I worked around this by starting the transaction and inserting the 45,000 rows and then killing it. The I removed the index and readded it which apparently gathered some stats and since there were all of the dead tuples in there from the failed transaction it now decided that it should use the index. I reran the script and this time it took 5 minutes again instead of 1 1/2 hours. I am using 7.2.4. Has this improved in later versions? I'm not concerened since this is a very rare thing to need to do and it's obviously possible to work around but it would be nice if postgres could figure things like that out on it's own. (It certainly would have saved me a lot of time and confusion last night at about 3 am). Is there a way to for the use of a specific index on a query? rg ----- Original Message ----- From: "Rick Gigger" <rick@alpinenetworking.com> To: "Mike Mascari" <mascarm@mascari.com> Cc: "PgSQL General ML" <pgsql-general@postgresql.org> Sent: Tuesday, November 18, 2003 2:41 PM Subject: Re: [GENERAL] performance problem > Uh, I feel a little silly now. I had and index on the field in question > (needed to locate the row to update) but later recreated the table and > forgot to readd it. I had assumed that it was there but double checked just > now and it was gone. I then readded the index and and it finished in a few > minutes. > Sorry about that one. Thanks for the help. > > rg > > ----- Original Message ----- > From: "Mike Mascari" <mascarm@mascari.com> > To: "Rick Gigger" <rick@alpinenetworking.com> > Cc: "PgSQL General ML" <pgsql-general@postgresql.org> > Sent: Tuesday, November 18, 2003 2:03 PM > Subject: Re: [GENERAL] performance problem > > > > Rick Gigger wrote: > > > > > I am currently trying to import a text data file without about 45,000 > > > records. At the end of the import it does an update on each of the > 45,000 > > > records. Doing all of the inserts completes in a fairly short amount of > > > time (about 2 1/2 minutes). Once it gets to the the updates though it > slows > > > to a craw. After about 10 minutes it's only done about 3000 records. > > > > > > Is that normal? Is it because it's inside such a large transaction? Is > > > there anything I can do to speed that up. It seems awfully slow to me. > > > > > > I didn't think that giving it more shared buffers would help but I tried > > > anyway. It didn't help. > > > > > > I tried doing a analyze full on it (vacuumdb -z -f) and it cleaned up a > lot > > > of stuff but it didn't speed up the updates at all. > > > > > > I am using a dual 800mhz xeon box with 2 gb of ram. I've tried anywhere > > > from about 16,000 to 65000 shared buffers. > > > > > > What other factors are involved here? > > > > It is difficult to say without knowing either the definition of the > > relation(s) or the update queries involved. Are there indexes being > > created after the import that would allow PostgreSQL to locate the > > rows being updated quickly, or is the update an unqualified update (no > > WHERE clause) that affects all tuples? > > > > EXPLAIN ANALYZE is your friend... > > > > Mike Mascari > > mascarm@mascari.com > > > > > > > > > ---------------------------(end of broadcast)--------------------------- > TIP 7: don't forget to increase your free space map settings >
As discussed on previous posts and mentioned below, the kernel is often better at handling large amounts of buffers than Postgres. From researching previous posts and testing on my own setups, 256mb is about the max you want to go with shared buffers (if you have at least 1 gig ram). After that the smaller queries (which there are usually a lot of) slow down, and you only gain marginal improvement on the larger queries. You'll see an improvement slightly on the large queries going over 256 but not enough to improve the overall performance. There's a few more technically detailed explanations of this in the list archive I believe, by those who can explain it better than I. >> > > That's VERY high. When postgresql has to manage a lot of buffers it >> > > actually is slower than letting the kernel in Linux or BSD do it for > you. > > I am confused. In this tutorial (by Bruce Momjian) > http://candle.pha.pa.us/main/writings/pgsql/hw_performance/node8.html > it says: "As a start for tuning, use 25% of RAM for cache size, and 2-4% > for > sort size." > > If I've got 2g of RAM then that is 2097152k. 25% of that = 524288k. So > the > tutorial is saying that 524288k is a good starting point for shared > buffers > with this amount of RAM. > > If each buffer is 8k > (http://candle.pha.pa.us/main/writings/pgsql/hw_performance/node3.html) > then that would be 65536 buffers. > > I'm pretty sure that that is a lot more than I need to cache every tuple > in > my database. Now everytime I see someone comment on this list about > appropriate numbers for shared buffers they say something like "(65536 > buffers is) VERY high. Now since I obviously don't need that much shared > cache so I am not concerned but it seems to me that one of the following > must be true. > > My calculations here are wrong. > or The tutorial is not accurate in saying that 25% is a good starting > point. > or The people making comments that 65536 is "VERY high" are wrong. > > Am I just confused or does this make sense? > >> > Even if you've got the memory to spare? Does postgres actually slow > down >> > just because it's slower to manager a lot of them just or because >> you're >> > taking the memory away from the kernel so the kernel has to swap more? >> >> The latter, mainly, I think. Also you *really* don't want your kernel >> to swap out any of your shared buffers, which can happen if they take >> up a significant portion of RAM... > > So if I'm not swapping at all and I've got over 1g of unused memory then > I'm > probably ok even with the very high buffer setting? (Although I will > probably reduce it anyway since it is unnecessary). > > rg > > > ---------------------------(end of broadcast)--------------------------- > TIP 7: don't forget to increase your free space map settings >
On Thu, Nov 20, 2003 at 01:52:10PM -0700, Rick Gigger wrote: > I worked around this by starting the transaction and inserting the 45,000 > rows and then killing it. The I removed the index and readded it which > apparently gathered some stats and since there were all of the dead tuples > in there from the failed transaction it now decided that it should use the > index. I reran the script and this time it took 5 minutes again instead of > 1 1/2 hours. Stats are not collected automatically. You should run ANALYZE after importing your data. And it's probably faster to create the index after the data is loaded, too. -- Alvaro Herrera (<alvherre[@]dcc.uchile.cl>) Y una voz del caos me habl� y me dijo "Sonr�e y s� feliz, podr�a ser peor". Y sonre�. Y fui feliz. Y fue peor.
Ah, so then the 25% recomendation for a starting point is probably not a bad one, but if you get over a 1 gig of ram you might as well stop at about 256mb for the shared buffers because it just won't do you much good and will start to slow down small queries. That makes sense to me. ----- Original Message ----- From: <fred@redhotpenguin.com> To: "Rick Gigger" <rick@alpinenetworking.com> Cc: "Doug McNaught" <doug@mcnaught.org>; "scott.marlowe" <scott.marlowe@ihs.com>; "PgSQL General ML" <pgsql-general@postgresql.org> Sent: Thursday, November 20, 2003 1:59 PM Subject: Re: [GENERAL] performance problem > As discussed on previous posts and mentioned below, the kernel is often > better at handling large amounts of buffers than Postgres. > > From researching previous posts and testing on my own setups, 256mb is > about the max you want to go with shared buffers (if you have at least 1 > gig ram). After that the smaller queries (which there are usually a lot > of) slow down, and you only gain marginal improvement on the larger > queries. You'll see an improvement slightly on the large queries going > over 256 but not enough to improve the overall performance. There's a few > more technically detailed explanations of this in the list archive I > believe, by those who can explain it better than I. > > >> > > That's VERY high. When postgresql has to manage a lot of buffers it > >> > > actually is slower than letting the kernel in Linux or BSD do it for > > you. > > > > I am confused. In this tutorial (by Bruce Momjian) > > http://candle.pha.pa.us/main/writings/pgsql/hw_performance/node8.html > > it says: "As a start for tuning, use 25% of RAM for cache size, and 2-4% > > for > > sort size." > > > > If I've got 2g of RAM then that is 2097152k. 25% of that = 524288k. So > > the > > tutorial is saying that 524288k is a good starting point for shared > > buffers > > with this amount of RAM. > > > > If each buffer is 8k > > (http://candle.pha.pa.us/main/writings/pgsql/hw_performance/node3.html) > > then that would be 65536 buffers. > > > > I'm pretty sure that that is a lot more than I need to cache every tuple > > in > > my database. Now everytime I see someone comment on this list about > > appropriate numbers for shared buffers they say something like "(65536 > > buffers is) VERY high. Now since I obviously don't need that much shared > > cache so I am not concerned but it seems to me that one of the following > > must be true. > > > > My calculations here are wrong. > > or The tutorial is not accurate in saying that 25% is a good starting > > point. > > or The people making comments that 65536 is "VERY high" are wrong. > > > > Am I just confused or does this make sense? > > > >> > Even if you've got the memory to spare? Does postgres actually slow > > down > >> > just because it's slower to manager a lot of them just or because > >> you're > >> > taking the memory away from the kernel so the kernel has to swap more? > >> > >> The latter, mainly, I think. Also you *really* don't want your kernel > >> to swap out any of your shared buffers, which can happen if they take > >> up a significant portion of RAM... > > > > So if I'm not swapping at all and I've got over 1g of unused memory then > > I'm > > probably ok even with the very high buffer setting? (Although I will > > probably reduce it anyway since it is unnecessary). > > > > rg > > > > > > ---------------------------(end of broadcast)--------------------------- > > TIP 7: don't forget to increase your free space map settings > > > >
On Thu, Nov 20, 2003 at 01:41:13PM -0700, Rick Gigger wrote: > it says: "As a start for tuning, use 25% of RAM for cache size, and 2-4% for > sort size." It's an old rule of thumb which has been superseded by new tests. A -- ---- Andrew Sullivan 204-4141 Yonge Street Afilias Canada Toronto, Ontario Canada <andrew@libertyrms.info> M2P 2A8 +1 416 646 3304 x110
Rick Gigger wrote: > Ok, adding the index back in worked the first time but then I tried > generating the database from scratch again, this time adding the index right > from the start. It added the 45000 rows in about a minute but then was > going dog slow on the updates again. So I did an explain and sure enough it > was not using the index. After some investigation I determined that it was > not using the index because when the transaction started there were only 4 > rows in the table so at that point it didn't want to use it. It apparently > doesn't gather analysis data fast enough to handle this kind of transaction. The statistics are collected as a result of an ANALYZE command: http://www.postgresql.org/docs/current/static/sql-analyze.html This does not happen automatically. EXPLAIN output will show a default assumption of 1000 rows, IIRC. > I worked around this by starting the transaction and inserting the 45,000 > rows and then killing it. The I removed the index and readded it which > apparently gathered some stats and since there were all of the dead tuples > in there from the failed transaction it now decided that it should use the > index. I reran the script and this time it took 5 minutes again instead of > 1 1/2 hours. If you examine the behavior of pg_dump output, you'll notice that it doesn't built indexes until after the COPY command has completed the data import. It's a waste of cpu cycles and disk bandwidth to update indexes on upon every insert. Your script should: 1) Drop all indexes on the relation 2) Use COPY if possible instead of INSERT to import the data 3) Recreate the indexes 4) Use UPDATE to update as necessary > I am using 7.2.4. Has this improved in later versions? I'm not concerened > since this is a very rare thing to need to do and it's obviously possible to > work around but it would be nice if postgres could figure things like that > out on it's own. (It certainly would have saved me a lot of time and > confusion last night at about 3 am). Is there a way to for the use of a > specific index on a query? You can force the use of an index scan by turning sequential scans to off: SET ENABLE_SEQSCAN TO OFF; But the problem is that the statistics didn't match the data. You could have: 1) Used INSERTs to insert the data into a relation with an index 2) Executed ANALYZE <foo> to update the statistics 3) Perform the UPDATE After the UPDATE, you'll still have dead tuples (the original rows) which require that they be marked as dead, and so you should occassionally run either VACUUM to mark them as such or VACUUM FULL to reclaim the dead space or VACUUM FULL ANALYZE to also update the relation's statistics. I would just execute the steps I outlined above with COPY and not worry about ANALYZEs and VACUUMs in a script. People often run VACUUM ANALYZE in a 1/day cron job and VACUUM FULL ANALYZE in a 1/week cron job. Then, of course, there's REINDEX... Hope that helps, Mike Mascari mascarm@mascari.com
Ah, I didn't realize that you could just do an ANALYZE. I thought there was only VACUUM ANALYZE but that can't run inside of a transaction. Thanks, rg ----- Original Message ----- From: "Alvaro Herrera Munoz" <alvherre@dcc.uchile.cl> To: "Rick Gigger" <rick@alpinenetworking.com> Cc: "Mike Mascari" <mascarm@mascari.com>; "PgSQL General ML" <pgsql-general@postgresql.org> Sent: Thursday, November 20, 2003 2:06 PM Subject: Re: [GENERAL] performance problem On Thu, Nov 20, 2003 at 01:52:10PM -0700, Rick Gigger wrote: > I worked around this by starting the transaction and inserting the 45,000 > rows and then killing it. The I removed the index and readded it which > apparently gathered some stats and since there were all of the dead tuples > in there from the failed transaction it now decided that it should use the > index. I reran the script and this time it took 5 minutes again instead of > 1 1/2 hours. Stats are not collected automatically. You should run ANALYZE after importing your data. And it's probably faster to create the index after the data is loaded, too. -- Alvaro Herrera (<alvherre[@]dcc.uchile.cl>) Y una voz del caos me habló y me dijo "Sonríe y sé feliz, podría ser peor". Y sonreí. Y fui feliz. Y fue peor.
"Rick Gigger" <rick@alpinenetworking.com> writes: > I am confused. In this tutorial (by Bruce Momjian) > http://candle.pha.pa.us/main/writings/pgsql/hw_performance/node8.html > it says: "As a start for tuning, use 25% of RAM for cache size, and 2-4% for > sort size." That advice is widely considered obsolete --- it was developed in the days when typical RAM sizes were a lot less than today. I don't believe anyone has shown a good case for setting shared_buffers much higher than 10000. regards, tom lane
Andrew Sullivan wrote: > On Wed, Nov 19, 2003 at 11:42:40AM -0500, Bruce Momjian wrote: >> Jan is working on a community replication solution, and we will still >> need PITR anyway because everyone doesn't have two machines. > > Jan's system, if I understand him correctly, will not actually > require two machines to get you PITR either, although it probably > _would_ require two datatabases and a hunk of disk. You do understand me correctly. Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com #
> If you examine the behavior of pg_dump output, you'll notice that it > doesn't built indexes until after the COPY command has completed the > data import. It's a waste of cpu cycles and disk bandwidth to update > indexes on upon every insert. Your script should: > > 1) Drop all indexes on the relation > 2) Use COPY if possible instead of INSERT to import the data > 3) Recreate the indexes > 4) Use UPDATE to update as necessary I never thought of dropping the indexes in the middle of the script and readding them. I guess that would be fine as long as nothing else was running that needed the index while it was gone. I don't think 7.2.x supports using COPY unless you are inserting all of the fields which I don't want to do although I suppose I could just insert all of the defaults. Is that correct? > > I am using 7.2.4. Has this improved in later versions? I'm not concerened > > since this is a very rare thing to need to do and it's obviously possible to > > work around but it would be nice if postgres could figure things like that > > out on it's own. (It certainly would have saved me a lot of time and > > confusion last night at about 3 am). Is there a way to for the use of a > > specific index on a query? > > You can force the use of an index scan by turning sequential scans to off: > SET ENABLE_SEQSCAN TO OFF; Thanks. > But the problem is that the statistics didn't match the data. You > could have: > > 1) Used INSERTs to insert the data into a relation with an index > 2) Executed ANALYZE <foo> to update the statistics > 3) Perform the UPDATE That would have been the easiest thing. I wanted to do that but I thought that you had to do a vacuum (which I couldn't do in the transaction) together with analyze. I didn't realize that analyze was a command all by itself. > After the UPDATE, you'll still have dead tuples (the original rows) > which require that they be marked as dead, and so you should > occassionally run either VACUUM to mark them as such or VACUUM FULL to > reclaim the dead space or VACUUM FULL ANALYZE to also update the > relation's statistics. I do them all nightly with cron on all my important databases. I just had a problem with need the analyze to happen in the middle of the transaction in this one special case. > I would just execute the steps I outlined above with COPY and not > worry about ANALYZEs and VACUUMs in a script. People often run VACUUM > ANALYZE in a 1/day cron job and VACUUM FULL ANALYZE in a 1/week cron > job. Then, of course, there's REINDEX... Doing the vacuum full analyze doesn't take long on to do if I do it once a day so I just do that.
> "Rick Gigger" <rick@alpinenetworking.com> writes: > > I am confused. In this tutorial (by Bruce Momjian) > > http://candle.pha.pa.us/main/writings/pgsql/hw_performance/node8.html > > it says: "As a start for tuning, use 25% of RAM for cache size, and 2-4% for > > sort size." > > That advice is widely considered obsolete --- it was developed in the > days when typical RAM sizes were a lot less than today. I don't believe > anyone has shown a good case for setting shared_buffers much higher than > 10000. > > regards, tom lane Thank you!
Tom Lane wrote: > "Rick Gigger" <rick@alpinenetworking.com> writes: > > I am confused. In this tutorial (by Bruce Momjian) > > http://candle.pha.pa.us/main/writings/pgsql/hw_performance/node8.html > > it says: "As a start for tuning, use 25% of RAM for cache size, and 2-4% for > > sort size." > > That advice is widely considered obsolete --- it was developed in the > days when typical RAM sizes were a lot less than today. I don't believe > anyone has shown a good case for setting shared_buffers much higher than > 10000. Yes, I will update that page shortly. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
On Tue, 18 Nov 2003, Rick Gigger wrote: > I will search the archives but does anyone know off the top of their head > which performs better? lo_* but then I forgot to turn off compression on the bytea column. -- Nigel Andrews > > ----- Original Message ----- > From: "Keith C. Perry" <netadmin@vcsn.com> > To: "Rick Gigger" <rick@alpinenetworking.com> > Cc: "PgSQL General ML" <pgsql-general@postgresql.org> > Sent: Tuesday, November 18, 2003 12:25 PM > Subject: Re: [GENERAL] uploading files > > > > Quoting Rick Gigger <rick@alpinenetworking.com>: > > > > > What is the best method for storing files in postgres? Is it better to > use > > > the large object functions or to just encode the data and store it in a > > > regular text or data field? > > > > > > > > > ---------------------------(end of broadcast)--------------------------- > > > TIP 3: if posting/reading through Usenet, please send an appropriate > > > subscribe-nomail command to majordomo@postgresql.org so that your > > > message can get through to the mailing list cleanly > > > > > > > Rick, > > > > This has been discussed recently so you might want to dig through the > archives > > but one thing to be aware of is that with large objects, you have have to > dump > > your database in a different format- a non-text format. This is less > portable > > than storing files as bytea's which can be dumped in the text and non-text > > formats. Argueably is all you use is PosgreSQL then this might not be a > big deal. > > > > Performance-wise I'm not qualified to speak to which is "better". As > always, it > > is going to depend on your specific application and environment. > > > > -- > > Keith C. Perry, MS E.E. > > Director of Networks & Applications > > VCSN, Inc. > > http://vcsn.com
And the world rejoiced as Shridhar Daithankar said : => Actually there is a debate ongoing right now, about what to call the next => release? 7.5 or 8.0 => => So native win32 port is aimed for next major release whatever that might be..:-) Well, put like this, it doesn't really make a difference in the end :) Thanks. Stef