Thread: Performance while loading data and indexing
Hello all, Some time back I posted a query to build a site with 150GB of database. In last couple of weeks, lots of things were tested at my place and there are some results and again some concerns. This is a long post. Please be patient and read thr. If we win this, I guess we have a good marketing/advocacy case here..;-) First the problems (For those who do not read beyond first page) 1) Database load time from flat file using copy is very high 2) Creating index takes huge amount of time. 3) Any suggsestions for runtime as data load and query will be going in parallel. Now the details. Note that this is a test run only.. Platform:- 4x Xeon2.4GHz/4GB RAM/4x48 SCSI RAID5/72 GB SCSI RedHat7.2/PostgreSQL7.1.3 Database in flat file: 125,000,000 records of around 100 bytes each. Flat file size 12GB Load time: 14581 sec/~8600 rows persec/~ an MB of data per sec. Create unique composite index on 2 char and a timestamp field: 25226 sec. Database size on disk: 26GB Select query: 1.5 sec. for approx. 150 rows. Important postgresql.conf settings sort_mem = 12000 shared_buffers = 24000 fsync=true (Sad but true. Left untouched.. Will that make a difference on SCSI?) wal_buffers = 65536 wal_files = 64 Now the requirements Initial flat data load: 250GB of data. This has gone up since last query. It was 150GB earlier.. Ongoing inserts: 5000/sec. Number of queries: 4800 queries/hour Query response time: 10 sec. Now questions. 1) Instead of copying from a single 12GB data file, will a parallel copy from say 5 files will speed up the things? Couple MB of data per sec. to disk is just not saturating it. It's a RAID 5 setup.. 2) Sort mem.=12K i.e. 94MB, sounds good enough to me. Does this need further addition to improve create index performance? 3) 5K concurrent inserts with an index on, will this need a additional CPU power? Like deploying it on dual RISC CPUs etc? 4) Query performance is not a problem. Though 4.8K queries per sec. expected response time from each query is 10 sec. But my guess is some serius CPU power will be chewed there too.. 5)Will upgrading to 7.2.2/7.3 beta help? All in all, in the test, we didn't see the performance where hardware is saturated to it's limits. So effectively we are not able to get postgresql making use of it. Just pushing WAL and shared buffers does not seem to be the solution. If you guys have any suggestions. let me know. I need them all.. Mysql is almost out because it's creating index for last 17 hours. I don't think it will keep up with 5K inserts per sec. with index. SAP DB is under evaluation too. But postgresql is most favourite as of now because it works. So I need to come up with solutions to problems that will occur in near future.. ;-) TIA.. Bye Shridhar -- Law of Procrastination: Procrastination avoids boredom; one never has the feeling that there is nothing important to do.
On 26 Sep 2002 at 14:05, Shridhar Daithankar wrote: > Some time back I posted a query to build a site with 150GB of database. In last > couple of weeks, lots of things were tested at my place and there are some > results and again some concerns. > 2) Creating index takes huge amount of time. > Load time: 14581 sec/~8600 rows persec/~ an MB of data per sec. > Create unique composite index on 2 char and a timestamp field: 25226 sec. > Database size on disk: 26GB > Select query: 1.5 sec. for approx. 150 rows. > 2) Sort mem.=12K i.e. 94MB, sounds good enough to me. Does this need further > addition to improve create index performance? Just a thought. If I sort the table before making an index, would it be faster than creating index on raw table? And/or if at all, how do I sort the table without duplicating it? Just a wild thought.. Bye Shridhar -- linux: the choice of a GNU generation(ksh@cis.ufl.edu put this on Tshirts in '93)
I'll preface this by saying that while I have a large database, it doesn't require quite the performace you're talking about here. On Thu, Sep 26, 2002 at 02:05:44PM +0530, Shridhar Daithankar wrote: > 1) Database load time from flat file using copy is very high > 2) Creating index takes huge amount of time. > 3) Any suggsestions for runtime as data load and query will be going in > parallel. You're loading all the data in one copy. I find that INSERTs are mostly limited by indexes. While index lookups are cheap, they are not free and each index needs to be updated for each row. I fond using partial indexes to only index the rows you actually use can help with the loading. It's a bit obscure though. As for parallel loading, you'll be limited mostly by your I/O bandwidth. Have you measured it to take sure it's up to speed? > Now the details. Note that this is a test run only.. > > Platform:- 4x Xeon2.4GHz/4GB RAM/4x48 SCSI RAID5/72 GB SCSI > RedHat7.2/PostgreSQL7.1.3 > > Database in flat file: > 125,000,000 records of around 100 bytes each. > Flat file size 12GB > > Load time: 14581 sec/~8600 rows persec/~ an MB of data per sec. > Create unique composite index on 2 char and a timestamp field: 25226 sec. > Database size on disk: 26GB > Select query: 1.5 sec. for approx. 150 rows. So you're loading at a rate of 860KB per sec. That's not too fast. How many indexes are active at that time? Triggers and foreign keys also take their toll. > Important postgresql.conf settings > > sort_mem = 12000 > shared_buffers = 24000 > fsync=true (Sad but true. Left untouched.. Will that make a difference on > SCSI?) > wal_buffers = 65536 > wal_files = 64 fsync IIRC only affects the WAL buffers now but it may be quite expensive, especially considering it's running on every transaction commit. Oh, your WAL files are on a seperate disk from the data? > Initial flat data load: 250GB of data. This has gone up since last query. It > was 150GB earlier.. > Ongoing inserts: 5000/sec. > Number of queries: 4800 queries/hour > Query response time: 10 sec. That looks quite acheivable. > 1) Instead of copying from a single 12GB data file, will a parallel copy from > say 5 files will speed up the things? Limited by I/O bandwidth. On linux vmstat can tell you how many blocks are being loaded and stored per second. Try it. As long as sync() doesn't get done too often, it should be help. > Couple MB of data per sec. to disk is just not saturating it. It's a RAID 5 > setup.. No, it's not. You should be able to do better. > 2) Sort mem.=12K i.e. 94MB, sounds good enough to me. Does this need further > addition to improve create index performance? Should be fine. Admittedly your indexes are taking rather long to build. > 3) 5K concurrent inserts with an index on, will this need a additional CPU > power? Like deploying it on dual RISC CPUs etc? It shouldn't. Do you have an idea of what your CPU usage is? ps aux should give you a decent idea. > 4) Query performance is not a problem. Though 4.8K queries per sec. expected > response time from each query is 10 sec. But my guess is some serius CPU power > will be chewed there too.. Should be fine. > 5)Will upgrading to 7.2.2/7.3 beta help? Possibly, though it may be wirth it just for the features/bugfixes. > All in all, in the test, we didn't see the performance where hardware is > saturated to it's limits. So effectively we are not able to get postgresql > making use of it. Just pushing WAL and shared buffers does not seem to be the > solution. > > If you guys have any suggestions. let me know. I need them all.. Find the bottleneck: CPU, I/O or memory? > Mysql is almost out because it's creating index for last 17 hours. I don't > think it will keep up with 5K inserts per sec. with index. SAP DB is under > evaluation too. But postgresql is most favourite as of now because it works. So > I need to come up with solutions to problems that will occur in near future.. > ;-) 17 hours! Ouch. Either way, you should be able to do much better. Hope this helps, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > There are 10 kinds of people in the world, those that can do binary > arithmetic and those that can't.
On 26 Sep 2002 at 10:51, paolo.cassago@talentmanager.c wrote: > Hi, > it seems you have to cluster it, I don't think you have another choise. Hmm.. That didn't occur to me...I guess some real time clustering like usogres would do. Unless it turns out to be a performance hog.. But this is just insert and select. No updates no deletes(Unless customer makes a 180 degree turn) So I doubt if clustering will help. At the most I can replicate data across machines and spread queries on them. Replication overhead as a down side and low query load on each machine as upside.. > I'm retrieving the configuration of our postgres servers (I'm out of office > now), so I can send it to you. I was quite disperate about performance, and > I was thinking to migrate the data on an oracle database. Then I found this > configuration on the net, and I had a succesfull increase of performance. In this case, we are upto postgresql because we/our customer wants to keep the costs down..:-) Even they are asking now if it's possible to keep hardware costs down as well. That's getting some funny responses here but I digress.. > Maybe this can help you. > > Why you use copy to insert records? I usually use perl scripts, and they > work well . Performance reasons. As I said in one of my posts earlier, putting upto 100K records in one transaction in steps of 10K did not reach performance of copy. As Tom said rightly, it was a 4-1 ratio despite using transactions.. Thanks once again.. ByeShridhar -- Secretary's Revenge: Filing almost everything under "the".
Hi Shridhar, Shridhar Daithankar wrote: <snip> > 3) Any suggsestions for runtime as data load and query will be going in > parallel. That sounds unusual. From reading this, it *sounds* like you'll be running queries against an incomplete dataset, or maybe just running the queries that affect the tables loaded thus far (during the initial load). <snip> > fsync=true (Sad but true. Left untouched.. Will that make a difference on > SCSI?) Definitely. Have directly measured a ~ 2x tps throughput increase on FreeBSD when leaving fsync off whilst performance measuring stuff recently (PG 7.2.2). Like anything it'll depend on workload, phase of moon, etc, but it's a decent indicator. <snip> > Now questions. > > 1) Instead of copying from a single 12GB data file, will a parallel copy from > say 5 files will speed up the things? Not sure yet. Haven't get done enough performance testing (on the cards very soon though). > Couple MB of data per sec. to disk is just not saturating it. It's a RAID 5 > setup.. fsync = off would help during the data load, but not a good idea if you're going to be running queries against it at the same time. Am still getting the hang of performance tuning stuff. Have a bunch of Ultra160 hardware for the Intel platform, and am testing against it as time permits. Not as high end as I'd like, but it's a start. :-) Regards and best wishes, Justin Clift <snip> > Bye > Shridhar -- "My grandfather once told me that there are two kinds of people: those who work and those who take the credit. He told me to try to be in the first group; there was less competition there." - Indira Gandhi
On 26 Sep 2002 at 14:05, Shridhar Daithankar wrote: > Some time back I posted a query to build a site with 150GB of database. In last > couple of weeks, lots of things were tested at my place and there are some > results and again some concerns. > 2) Creating index takes huge amount of time. > Load time: 14581 sec/~8600 rows persec/~ an MB of data per sec. > Create unique composite index on 2 char and a timestamp field: 25226 sec. > Database size on disk: 26GB > Select query: 1.5 sec. for approx. 150 rows. I never tried 150GB of data, but 10GB of data, and this worked fine for me. Maybe it will help if you post your table schema, including which indexes you use, and the average size of one tuple.
On 26 Sep 2002 at 11:17, Mario Weilguni wrote: > On 26 Sep 2002 at 14:05, Shridhar Daithankar wrote: > > Some time back I posted a query to build a site with 150GB of database. In > last > > couple of weeks, lots of things were tested at my place and there are some > > results and again some concerns. > > > 2) Creating index takes huge amount of time. > > Load time: 14581 sec/~8600 rows persec/~ an MB of data per sec. > > Create unique composite index on 2 char and a timestamp field: 25226 sec. > > Database size on disk: 26GB > > Select query: 1.5 sec. for approx. 150 rows. > > I never tried 150GB of data, but 10GB of data, and this worked fine for me. > Maybe it will help if you post your table schema, including which indexes you > use, and the average size of one tuple. Well the test runs were for 10GB of data. Schema is attached. Read in fixed fonts..Last nullable fields are dummies but may be used in fututre and varchars are not acceptable(Not my requirement). Tuple size is around 100 bytes.. The index creation query was CREATE INDEX index1 ON tablename (esn,min,datetime); What if I put datetime ahead? It's likely the the datetime field will have high degree of locality being log data.. Bye Shridhar -- brain, v: [as in "to brain"] To rebuke bluntly, but not pointedly; to dispel a source of error in an opponent. -- Ambrose Bierce, "The Devil's Dictionary" Field Name Field Type Nullable Indexed type int no no esn char (10) no yes min char (10) no yes datetime timestamp no yes opc0 char (3) no no opc1 char (3) no no opc2 char (3) no no dpc0 char (3) no no dpc1 char (3) no no dpc2 char (3) no no npa char (3) no no nxx char (3) no no rest char (4) no no field0 int yes no field1 char (4) yes no field2 int yes no field3 char (4) yes no field4 int yes no field5 char (4) yes no field6 int yes no field7 char (4) yes no field8 int yes no field9 char (4) yes no
On 26 Sep 2002 at 19:17, Justin Clift wrote: > Shridhar Daithankar wrote: > <snip> > > 3) Any suggsestions for runtime as data load and query will be going in > > parallel. > > That sounds unusual. From reading this, it *sounds* like you'll be > running queries against an incomplete dataset, or maybe just running the > queries that affect the tables loaded thus far (during the initial > load). That's correct. Load the data so far and keep inserting data as and when it generates. They don't mind running against data so far. It's not very accurate stuff IMO... > > fsync=true (Sad but true. Left untouched.. Will that make a difference on > > SCSI?) > > Definitely. Have directly measured a ~ 2x tps throughput increase on > FreeBSD when leaving fsync off whilst performance measuring stuff > recently (PG 7.2.2). Like anything it'll depend on workload, phase of > moon, etc, but it's a decent indicator. I didn't know even that matters with SCSI..Will check out.. > fsync = off would help during the data load, but not a good idea if > you're going to be running queries against it at the same time. That's OK for the reasons mentioned above. It wouldn't be out of place to expect a UPS to such an installation... Bye Shridhar -- Hoare's Law of Large Problems: Inside every large problem is a small problem struggling to get out.
On 26 Sep 2002 at 19:05, Martijn van Oosterhout wrote: > On Thu, Sep 26, 2002 at 02:05:44PM +0530, Shridhar Daithankar wrote: > > 1) Database load time from flat file using copy is very high > > 2) Creating index takes huge amount of time. > > 3) Any suggsestions for runtime as data load and query will be going in > > parallel. > > You're loading all the data in one copy. I find that INSERTs are mostly > limited by indexes. While index lookups are cheap, they are not free and > each index needs to be updated for each row. > > I fond using partial indexes to only index the rows you actually use can > help with the loading. It's a bit obscure though. > > As for parallel loading, you'll be limited mostly by your I/O bandwidth. > Have you measured it to take sure it's up to speed? Well. It's like this, as of now.. CreateDB->create table->create index->Select. So loading is not slowed by index. As of your hint of vmstat, will check it out. > So you're loading at a rate of 860KB per sec. That's not too fast. How many > indexes are active at that time? Triggers and foreign keys also take their > toll. Nothing except the table where data os loaded.. > fsync IIRC only affects the WAL buffers now but it may be quite expensive, > especially considering it's running on every transaction commit. Oh, your > WAL files are on a seperate disk from the data? No. Same RAID 5 disks.. > It shouldn't. Do you have an idea of what your CPU usage is? ps aux should > give you a decent idea. I guess we forgot to monitor system parameters. Next on my list is running vmstat, top and tuning bdflush. > Find the bottleneck: CPU, I/O or memory? Understood.. > > > Mysql is almost out because it's creating index for last 17 hours. I don't > > think it will keep up with 5K inserts per sec. with index. SAP DB is under > > evaluation too. But postgresql is most favourite as of now because it works. So > > I need to come up with solutions to problems that will occur in near future.. > > ;-) > > 17 hours! Ouch. Either way, you should be able to do much better. Hope this > helps, Heh.. no wonder this evaluation is taking more than 2 weeks.. Mysql was running out of disk space while creating index and crashin. An upgrade to mysql helped there but no numbers as yet.. Thanks once again... Bye Shridhar -- Boren's Laws: (1) When in charge, ponder. (2) When in trouble, delegate. (3) When in doubt, mumble.
Shridhar Daithankar wrote: <snip> > > > fsync=true (Sad but true. Left untouched.. Will that make a difference on > > > SCSI?) > > > > Definitely. Have directly measured a ~ 2x tps throughput increase on > > FreeBSD when leaving fsync off whilst performance measuring stuff > > recently (PG 7.2.2). Like anything it'll depend on workload, phase of > > moon, etc, but it's a decent indicator. > > I didn't know even that matters with SCSI..Will check out.. Cool. When testing it had FreeBSD 4.6.2 installed on one drive along with the PostgreSQL 7.2.2 binaries, it had the data on a second drive (mounted as /pgdata), and it had the pg_xlog directory mounted on a third drive. Swap had it's own drive as well. Everything is UltraSCSI, etc. Haven't yet tested for a performance difference through moving the indexes to another drive after creation though. That apparently has the potential to help as well. :-) Regards and best wishes, Justin Clift -- "My grandfather once told me that there are two kinds of people: those who work and those who take the credit. He told me to try to be in the first group; there was less competition there." - Indira Gandhi
>Well the test runs were for 10GB of data. Schema is attached. Read in fixed >fonts..Last nullable fields are dummies but may be used in fututre and varchars >are not acceptable(Not my requirement). Tuple size is around 100 bytes.. >The index creation query was > >CREATE INDEX index1 ON tablename (esn,min,datetime); > >What if I put datetime ahead? It's likely the the datetime field will have high >degree of locality being log data.. Just an idea, I noticed you use char(10) for esn and min, and use this as index. Are these really fixed len fields all having 10 bytes? Otherwise varchar(10) would be better, because your tables, and especially the indices will be probably much smaller. what average length do you have for min and esn?
On Thu, Sep 26, 2002 at 03:01:35PM +0530, Shridhar Daithankar wrote: Content-Description: Mail message body > The index creation query was > > CREATE INDEX index1 ON tablename (esn,min,datetime); > > What if I put datetime ahead? It's likely the the datetime field will have high > degree of locality being log data.. The order of fields depends on what you're using it for. For example, you can use the above index for a query using the conditions: esn = 'aaa' esn = 'bbb' and min = 'xxx' but not for queries with only datetime = '2002-09-26' min = 'ddd' and datetime = '2002-10-02' The fields can only be used left to right. This is where a single multicolumn index differs from multiple indexes of different columns. Have you used EXPLAIN ANALYSE to determine whether your indexes are being used optimally? -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > There are 10 kinds of people in the world, those that can do binary > arithmetic and those that can't.
Shridhar Daithankar wrote: > > On 26 Sep 2002 at 19:05, Martijn van Oosterhout wrote: <snip> > > fsync IIRC only affects the WAL buffers now but it may be quite expensive, > > especially considering it's running on every transaction commit. Oh, your > > WAL files are on a seperate disk from the data? > > No. Same RAID 5 disks.. Not sure if this is a good idea. Would have to think deeply about the controller and drive optimisation/load characteristics. If it's any help, when I was testing recently with WAL on a separate drive, the WAL logs were doing more read&writes per second than the main data drive. This would of course be affected by the queries you are running against the database. I was just running Tatsuo's TPC-B stuff, and the OSDB AS3AP tests. > I guess we forgot to monitor system parameters. Next on my list is running > vmstat, top and tuning bdflush. That'll just be the start of it for serious performance tuning and learning how PostgreSQL works. :) <snip> > Thanks once again... > Bye > Shridhar -- "My grandfather once told me that there are two kinds of people: those who work and those who take the credit. He told me to try to be in the first group; there was less competition there." - Indira Gandhi
On 26 Sep 2002 at 11:50, Mario Weilguni wrote: > >Well the test runs were for 10GB of data. Schema is attached. Read in fixed > >fonts..Last nullable fields are dummies but may be used in fututre and > varchars > >are not acceptable(Not my requirement). Tuple size is around 100 bytes.. > >The index creation query was > > > >CREATE INDEX index1 ON tablename (esn,min,datetime); > > > >What if I put datetime ahead? It's likely the the datetime field will have > high > >degree of locality being log data.. > > Just an idea, I noticed you use char(10) for esn and min, and use this as > index. Are these really fixed len fields all having 10 bytes? Otherwise > varchar(10) would be better, because your tables, and especially the indices > will be probably much smaller. > > what average length do you have for min and esn? 10 bytes. Those are id numbers.. like phone numbers always have all the digits filled in.. ByeShridhar -- Bradley's Bromide: If computers get too powerful, we can organize them into a committee -- that will do them in.
"Shridhar Daithankar" <shridhar_daithankar@persistent.co.in> writes: > RedHat7.2/PostgreSQL7.1.3 I'd suggest a newer release of Postgres ... 7.1.3 is pretty old ... > Create unique composite index on 2 char and a timestamp field: 25226 sec. What do you mean by "char" exactly? If it's really char(N), how much are you paying in padding space? There are very very few cases where I'd not say to use varchar(N), or text, instead. Also, does it have to be character data? If you could use an integer or float datatype instead the index operations should be faster (though I can't say by how much). Have you thought carefully about the order in which the composite index columns are listed? > sort_mem = 12000 To create an index of this size, you want to push sort_mem as high as it can go without swapping. 12000 sounds fine for the global setting, but in the process that will create the index, try setting sort_mem to some hundreds of megs or even 1Gb. (But be careful: the calculation of space actually used by CREATE INDEX is off quite a bit in pre-7.3 releases :-(. You should probably expect the actual process size to grow to two or three times what you set sort_mem to. Don't let it get so big as to swap.) > wal_buffers = 65536 The above is a complete waste of memory space, which would be better spent on letting the kernel expand its disk cache. There's no reason for wal_buffers to be more than a few dozen. regards, tom lane
"Shridhar Daithankar" <shridhar_daithankar@persistent.co.in> writes: > On 26 Sep 2002 at 11:50, Mario Weilguni wrote: >> Just an idea, I noticed you use char(10) for esn and min, and use this as >> index. Are these really fixed len fields all having 10 bytes? > 10 bytes. Those are id numbers.. like phone numbers always have all the digits > filled in.. If they are numbers, can you store them as bigints instead of char(N)? regards, tom lane
Justin Clift <justin@postgresql.org> writes: >> On 26 Sep 2002 at 19:05, Martijn van Oosterhout wrote: >>> fsync IIRC only affects the WAL buffers now but it may be quite expensive, >>> especially considering it's running on every transaction commit. Oh, your >>> WAL files are on a seperate disk from the data? > Not sure if this is a good idea. Would have to think deeply about the > controller and drive optimisation/load characteristics. > If it's any help, when I was testing recently with WAL on a separate > drive, the WAL logs were doing more read&writes per second than the main > data drive. ... but way fewer seeks. For anything involving lots of updating transactions (and certainly 5000 separate insertions per second would qualify; can those be batched??), it should be a win to put WAL on its own spindle, just to get locality of access to the WAL. regards, tom lane
On 26 Sep 2002 at 10:33, Tom Lane wrote: > "Shridhar Daithankar" <shridhar_daithankar@persistent.co.in> writes: > > RedHat7.2/PostgreSQL7.1.3 > > I'd suggest a newer release of Postgres ... 7.1.3 is pretty old ... I agree.. downloadind 7.2.2 right away.. > > Create unique composite index on 2 char and a timestamp field: 25226 sec. > > What do you mean by "char" exactly? If it's really char(N), how much > are you paying in padding space? There are very very few cases where > I'd not say to use varchar(N), or text, instead. Also, does it have to > be character data? If you could use an integer or float datatype > instead the index operations should be faster (though I can't say by > how much). Have you thought carefully about the order in which the > composite index columns are listed? I have forwarded the idea of putting things into number. If it causes speedup in index lookup/creation, it would do. Looks like bigint is the order of the day.. > > > sort_mem = 12000 > > To create an index of this size, you want to push sort_mem as high as it > can go without swapping. 12000 sounds fine for the global setting, but > in the process that will create the index, try setting sort_mem to some > hundreds of megs or even 1Gb. (But be careful: the calculation of space > actually used by CREATE INDEX is off quite a bit in pre-7.3 releases > :-(. You should probably expect the actual process size to grow to two > or three times what you set sort_mem to. Don't let it get so big as to > swap.) Great. I was skeptical to push it beyond 100MB. Now I can push it to corners.. > > wal_buffers = 65536 > > The above is a complete waste of memory space, which would be better > spent on letting the kernel expand its disk cache. There's no reason > for wal_buffers to be more than a few dozen. That was a rather desparate move. Nothing was improving performance and then we started pushing numbers.. WIll get it back.. Same goes for 64 WAL files.. A GB looks like waste to me.. I might have found the bottleneck, although by accident. Mysql was running out of space while creating index. So my friend shut down mysql and tried to move things by hand to create links. He noticed that even things like cp were terribly slow and it hit us.. May be the culprit is the file system. Ext3 in this case. My friend argues for ext2 to eliminate journalling overhead but I favour reiserfs personally having used it in pgbench with 10M rows on paltry 20GB IDE disk for 25 tps.. We will be attempting raiserfs and/or XFS if required. I know how much speed difference exists between resiserfs and ext2. Would not be surprised if everythng just starts screaming in one go.. Bye Shridhar -- Cropp's Law: The amount of work done varies inversly with the time spent in the office.
On 26 Sep 2002 at 10:42, Tom Lane wrote: > Justin Clift <justin@postgresql.org> writes: > > If it's any help, when I was testing recently with WAL on a separate > > drive, the WAL logs were doing more read&writes per second than the main > > data drive. > > ... but way fewer seeks. For anything involving lots of updating > transactions (and certainly 5000 separate insertions per second would > qualify; can those be batched??), it should be a win to put WAL on its > own spindle, just to get locality of access to the WAL. Probably they will be a single transcation. If possible we will bunch more of them together.. like 5 seconds of data pushed down in a single transaction but not sure it's possible.. This is bit like replication but from live oracle machine to postgres, from information I have. So there should be some chance of tuning there.. Bye Shridhar -- Langsam's Laws: (1) Everything depends. (2) Nothing is always. (3) Everything is sometimes.
On Thursday 26 September 2002 21:52, Shridhar Daithankar wrote: > I might have found the bottleneck, although by accident. Mysql was running > out of space while creating index. So my friend shut down mysql and tried > to move things by hand to create links. He noticed that even things like cp > were terribly slow and it hit us.. May be the culprit is the file system. > Ext3 in this case. > > My friend argues for ext2 to eliminate journalling overhead but I favour > reiserfs personally having used it in pgbench with 10M rows on paltry 20GB > IDE disk for 25 tps.. > > We will be attempting raiserfs and/or XFS if required. I know how much > speed difference exists between resiserfs and ext2. Would not be surprised > if everythng just starts screaming in one go.. As it was found by someone before any non-journaling FS is faster than journaling one. This due to double work done by FS and database. Try it on ext2 and compare. -- Denis
Shridhar Daithankar wrote: <snip> > My friend argues for ext2 to eliminate journalling overhead but I favour > reiserfs personally having used it in pgbench with 10M rows on paltry 20GB IDE > disk for 25 tps.. If it's any help, the setup I mentioned before with differnt disks for the data and the WAL files was getting an average of about 72 tps with 200 concurrent users on pgbench. Haven't tuned it in a hard core way at all, and it only has 256MB DDR RAM in it at the moment (single CPU AthonXP 1600). These are figures made during the 2.5k+ test runs of pgbench done when developing pg_autotune recently. As a curiosity point, how predictable are the queries you're going to be running on your database? They sound very simple and very predicatable. The pg_autotune tool might be your friend here. It can deal with arbitrary SQL instead of using the pg_bench stuff of Tatsuos, and it can also deal with an already loaded database. You'd just have to tweak the names of the tables that it vacuums and the names of the indexes that it reindexes between each run, to get some idea of your overall server performance at different load points. Probably worth taking a good look at if you're not afraid of editing variables in C code. :) > We will be attempting raiserfs and/or XFS if required. I know how much speed > difference exists between resiserfs and ext2. Would not be surprised if > everythng just starts screaming in one go.. We'd all probably be interested to hear this. Added the PostgreSQL "Performance" mailing list to this thread too, Just In Case. (wow that's a lot of cross posting now). Regards and best wishes, Justin Clift > Bye > Shridhar > > -- > Cropp's Law: The amount of work done varies inversly with the time spent in the > office. > > ---------------------------(end of broadcast)--------------------------- > TIP 3: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly -- "My grandfather once told me that there are two kinds of people: those who work and those who take the credit. He told me to try to be in the first group; there was less competition there." - Indira Gandhi
On 27 Sep 2002 at 1:12, Justin Clift wrote: > Shridhar Daithankar wrote: > As a curiosity point, how predictable are the queries you're going to be > running on your database? They sound very simple and very predicatable. Mostly predictable selects. Not a domain expert on telecom so not very sure. But in my guess prepare statement in 7.3 should come pretty handy. i.e. by the time we finish evaluation and test deployment, 7.3 will be out in next couple of months to say so. So I would recommend doing it 7.3 way only.. > > The pg_autotune tool might be your friend here. It can deal with > arbitrary SQL instead of using the pg_bench stuff of Tatsuos, and it can > also deal with an already loaded database. You'd just have to tweak the > names of the tables that it vacuums and the names of the indexes that it > reindexes between each run, to get some idea of your overall server > performance at different load points. > > Probably worth taking a good look at if you're not afraid of editing > variables in C code. :) Gladly. We started with altering pgbench here for testing and rapidly settled to perl generated random queries. Once postgresql wins the evaluation match and things come to implementation, pg_autotune would be a handy tool. Just that can't do it right now. Have to fight mysql and SAP DB before that.. BTW any performance figures on SAP DB? People here are as it frustrated with it with difficulties in setting it up. But still.. > > > We will be attempting raiserfs and/or XFS if required. I know how much speed > > difference exists between resiserfs and ext2. Would not be surprised if > > everythng just starts screaming in one go.. > > We'd all probably be interested to hear this. Added the PostgreSQL > "Performance" mailing list to this thread too, Just In Case. (wow that's > a lot of cross posting now). I know..;-) Glad that PG list does not have strict policies like no non- subscriber posting or no attachments.. etc.. IMO reiserfs, though journalling one, is faster than ext2 etc. because the way it handles metadata. Personally I haven't come across ext2 being faster than reiserfs on few machine here for day to day use. I guess I should have a freeBSD CD handy too.. Just to give it a try. If it comes down to a better VM.. though using 2.4.19 here.. so souldn't matter much.. I will keep you guys posted on file system stuff... Glad that we have much flexibility with postgresql.. Bye Shridhar -- Bilbo's First Law: You cannot count friends that are all packed up in barrels.
On Thu, 2002-09-26 at 09:52, Shridhar Daithankar wrote: > My friend argues for ext2 to eliminate journalling overhead but I favour > reiserfs personally having used it in pgbench with 10M rows on paltry 20GB IDE > disk for 25 tps.. > > We will be attempting raiserfs and/or XFS if required. I know how much speed > difference exists between resiserfs and ext2. Would not be surprised if > everythng just starts screaming in one go.. > I'm not sure about reiserfs or ext3 but with XFS, you can create your log on another disk. Also worth noting is that you can also configure the size and number of log buffers. There are also some other performance type enhancements you can fiddle with if you don't mind risking time stamp consistency in the event of a crash. If your setup allows for it, you might want to consider using XFS in this configuration. While I have not personally tried moving XFS' log to another device, I've heard that performance gains can be truly stellar. Assuming memory allows, twiddling with the log buffering is said to allow for large strides in performance as well. If you do try this, I'd love to hear back about your results and impressions. Greg
Attachment
Shridhar Daithankar wrote: > I might have found the bottleneck, although by accident. Mysql was running out > of space while creating index. So my friend shut down mysql and tried to move > things by hand to create links. He noticed that even things like cp were > terribly slow and it hit us.. May be the culprit is the file system. Ext3 in > this case. I just added a file system and multi-cpu section to my performance tuning paper: http://www.ca.postgresql.org/docs/momjian/hw_performance/ The paper does recommend ext3, but the differences between file systems are very small. If you are seeing 'cp' as slow, I wonder if it may be something more general, like poorly tuned hardware or something. You can use 'dd' to throw some data around the file system and see if that is showing slowness; compare those numbers to another machine that has different hardware/OS. Also, though ext3 is slower, turning fsync off should make ext3 function similar to ext2. That would be an interesting test if you suspect ext3. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
Greg Copeland <greg@CopelandConsulting.Net> writes: > I'm not sure about reiserfs or ext3 but with XFS, you can create your > log on another disk. Also worth noting is that you can also configure > the size and number of log buffers. There are also some other > performance type enhancements you can fiddle with if you don't mind > risking time stamp consistency in the event of a crash. If your setup > allows for it, you might want to consider using XFS in this > configuration. You can definitely put the ext3 log on a different disk with 2.4 kernels. Also, if you put the WAL logs on a different disk from the main database, and mount that partition with 'data=writeback' (ie metadata-only journaling) ext3 should be pretty fast, since WAL files are preallocated and there will therefore be almost no metadata updates. You should be able to mount the main database with "data=ordered" (the default) for good performance and reasonable safety. I think putting WAL on its own disk(s) is one of the keys here. -Doug
On Thu, 2002-09-26 at 11:41, Bruce Momjian wrote: > Shridhar Daithankar wrote: > > I might have found the bottleneck, although by accident. Mysql was running out > > of space while creating index. So my friend shut down mysql and tried to move > > things by hand to create links. He noticed that even things like cp were > > terribly slow and it hit us.. May be the culprit is the file system. Ext3 in > > this case. > > I just added a file system and multi-cpu section to my performance > tuning paper: > > http://www.ca.postgresql.org/docs/momjian/hw_performance/ > > The paper does recommend ext3, but the differences between file systems > are very small. If you are seeing 'cp' as slow, I wonder if it may be > something more general, like poorly tuned hardware or something. You can > use 'dd' to throw some data around the file system and see if that is > showing slowness; compare those numbers to another machine that has > different hardware/OS. > > Also, though ext3 is slower, turning fsync off should make ext3 function > similar to ext2. That would be an interesting test if you suspect ext3. I'm curious as to why you recommended ext3 versus some other (JFS, XFS). Do you have tests which validate that recommendation or was it a simple matter of getting the warm fuzzies from familiarity? Greg
Attachment
On Thu, 2002-09-26 at 11:41, Bruce Momjian wrote: > Shridhar Daithankar wrote: > > I might have found the bottleneck, although by accident. Mysql was running out > > of space while creating index. So my friend shut down mysql and tried to move > > things by hand to create links. He noticed that even things like cp were > > terribly slow and it hit us.. May be the culprit is the file system. Ext3 in > > this case. > > I just added a file system and multi-cpu section to my performance > tuning paper: > > http://www.ca.postgresql.org/docs/momjian/hw_performance/ > > The paper does recommend ext3, but the differences between file systems > are very small. If you are seeing 'cp' as slow, I wonder if it may be > something more general, like poorly tuned hardware or something. You can > use 'dd' to throw some data around the file system and see if that is > showing slowness; compare those numbers to another machine that has > different hardware/OS. That's a good point. Also, if you're using IDE, you do need to verify that you're using DMA and proper PIO mode if at possible. Also, big performance improvements can be seen by making sure your IDE bus speed has been properly configured. The drivetweak-gtk and hdparm utilities can make huge difference in performance. Just be sure you know what the heck your doing when you mess with those. Greg
Attachment
If you are seeing very slow performance on a drive set, check dmesg to see if you're getting SCSI bus errors or something similar. If your drives aren't properly terminated then the performance will suffer a great deal.
Greg Copeland wrote: > > The paper does recommend ext3, but the differences between file systems > > are very small. If you are seeing 'cp' as slow, I wonder if it may be > > something more general, like poorly tuned hardware or something. You can > > use 'dd' to throw some data around the file system and see if that is > > showing slowness; compare those numbers to another machine that has > > different hardware/OS. > > > > Also, though ext3 is slower, turning fsync off should make ext3 function > > similar to ext2. That would be an interesting test if you suspect ext3. > > I'm curious as to why you recommended ext3 versus some other (JFS, > XFS). Do you have tests which validate that recommendation or was it a > simple matter of getting the warm fuzzies from familiarity? I used the attached email as a reference. I just changed the wording to be: File system choice is particularly difficult on Linux because there are so many file system choices, and none of them are optimal: ext2 is not entirely crash-safe, ext3 and xfs are journal-based, and Reiser is optimized for small files. Fortunately, the journaling file systems aren't significantly slower than ext2 so they are probably the best choice. so I don't specifically recommend ext3 anymore. As I remember, ext3 is good only in that it can read ext2 file systems. I think XFS may be the best bet. Can anyone clarify if "data=writeback" is safe for PostgreSQL. Specifically, are the data files recovered properly or is this option only for a filesystem containing WAL? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
Bruce Momjian <pgman@candle.pha.pa.us> writes: > The paper does recommend ext3, but the differences between file systems > are very small. Well, I only did a very rough benchmark (a few runs of pgbench), but the results I found were drastically different: ext2 was significantly faster (~50%) than ext3-writeback, which was in turn significantly faster (~25%) than ext3-ordered. > Also, though ext3 is slower, turning fsync off should make ext3 function > similar to ext2. Why would that be? Cheers, Neil -- Neil Conway <neilc@samurai.com> || PGP Key ID: DB3C29FC
Neil Conway wrote: > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > The paper does recommend ext3, but the differences between file systems > > are very small. > > Well, I only did a very rough benchmark (a few runs of pgbench), but > the results I found were drastically different: ext2 was significantly > faster (~50%) than ext3-writeback, which was in turn significantly > faster (~25%) than ext3-ordered. Wow. That leaves no good Linux file system alternatives. PostgreSQL just wants an ordinary file system that has reliable recovery from a crash. > > Also, though ext3 is slower, turning fsync off should make ext3 function > > similar to ext2. > > Why would that be? I assumed it was the double fsync for the normal and journal that made the journalling file systems slog. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
I have seen various benchmarks where XFS seems to perform best when it comes to huge amounts of data and many files (due to balanced internal b+ trees). also, XFS seems to be VERY mature and very stable. ext2/3 don't seem to be that fast in most of the benchmarks. i did some testing with reiser some time ago. the problem is that it seems to restore a very historic consistent snapshot of the data. XFS seems to be much better in this respect. i have not tested JFS yet (but on this damn AIX beside me) from my point of view i strongly recommend XFS (maybe somebody from RedHat should think about it). Hans Neil Conway wrote: >Bruce Momjian <pgman@candle.pha.pa.us> writes: > > >>The paper does recommend ext3, but the differences between file systems >>are very small. >> >> > >Well, I only did a very rough benchmark (a few runs of pgbench), but >the results I found were drastically different: ext2 was significantly >faster (~50%) than ext3-writeback, which was in turn significantly >faster (~25%) than ext3-ordered. > > > >>Also, though ext3 is slower, turning fsync off should make ext3 function >>similar to ext2. >> >> > >Why would that be? > >Cheers, > >Neil > > > -- *Cybertec Geschwinde u Schoenig* Ludo-Hartmannplatz 1/14, A-1160 Vienna, Austria Tel: +43/1/913 68 09; +43/664/233 90 75 www.postgresql.at <http://www.postgresql.at>, cluster.postgresql.at <http://cluster.postgresql.at>, www.cybertec.at <http://www.cybertec.at>, kernel.cybertec.at <http://kernel.cybertec.at>
Neil Conway wrote: > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > The paper does recommend ext3, but the differences between file systems > > are very small. > > Well, I only did a very rough benchmark (a few runs of pgbench), but > the results I found were drastically different: ext2 was significantly > faster (~50%) than ext3-writeback, which was in turn significantly > faster (~25%) than ext3-ordered. > > > Also, though ext3 is slower, turning fsync off should make ext3 function > > similar to ext2. > > Why would that be? OK, I changed the text to: File system choice is particularly difficult on Linux because there are so many file system choices, and none of them are optimal: ext2 is not entirely crash-safe, ext3, xfs, and jfs are journal-based, and Reiser is optimized for small files and does journalling. The journalling file systems can be significantly slower than ext2 but when crash recovery is required, ext2 isn't an option. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
Bruce Momjian <pgman@candle.pha.pa.us> writes: > Wow. That leaves no good Linux file system alternatives. > PostgreSQL just wants an ordinary file system that has reliable > recovery from a crash. I'm not really familiar with the reasoning behind ext2's reputation as recovering poorly from crashes; if we fsync a WAL record to disk before we lose power, can't we recover reliably, even with ext2? > > > Also, though ext3 is slower, turning fsync off should make ext3 > > > function similar to ext2. > > > > Why would that be? > > I assumed it was the double fsync for the normal and journal that > made the journalling file systems slog. Well, a journalling file system would need to write a journal entry and flush that to disk, even if fsync is disabled -- whereas without fsync enabled, ext2 doesn't have to flush anything to disk. ISTM that the performance advantage of ext2 over ext3 is should be even larger when fsync is not enabled. Cheers, Neil -- Neil Conway <neilc@samurai.com> || PGP Key ID: DB3C29FC
I tend to agree with this though I have nothing to back up it with. My impression is that XFS does very well for large files. Accepting that as fact?, my impression is that XFS historically does well for database's. Again, I have nothing to back that up other than hear-say and conjecture. Greg On Thu, 2002-09-26 at 15:55, Hans-Jürgen Schönig wrote: > I have seen various benchmarks where XFS seems to perform best when it > comes to huge amounts of data and many files (due to balanced internal > b+ trees). > also, XFS seems to be VERY mature and very stable. > ext2/3 don't seem to be that fast in most of the benchmarks. > > i did some testing with reiser some time ago. the problem is that it > seems to restore a very historic consistent snapshot of the data. XFS > seems to be much better in this respect. > > i have not tested JFS yet (but on this damn AIX beside me) > from my point of view i strongly recommend XFS (maybe somebody from > RedHat should think about it). > > Hans > > > Neil Conway wrote: > > >Bruce Momjian <pgman@candle.pha.pa.us> writes: > > > > > >>The paper does recommend ext3, but the differences between file systems > >>are very small. > >> > >> > > > >Well, I only did a very rough benchmark (a few runs of pgbench), but > >the results I found were drastically different: ext2 was significantly > >faster (~50%) than ext3-writeback, which was in turn significantly > >faster (~25%) than ext3-ordered. > > > > > > > >>Also, though ext3 is slower, turning fsync off should make ext3 function > >>similar to ext2. > >> > >> > > > >Why would that be? > > > >Cheers, > > > >Neil > > > > > > > > > -- > *Cybertec Geschwinde u Schoenig* > Ludo-Hartmannplatz 1/14, A-1160 Vienna, Austria > Tel: +43/1/913 68 09; +43/664/233 90 75 > www.postgresql.at <http://www.postgresql.at>, cluster.postgresql.at > <http://cluster.postgresql.at>, www.cybertec.at > <http://www.cybertec.at>, kernel.cybertec.at <http://kernel.cybertec.at> > > > ---------------------------(end of broadcast)--------------------------- > TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
Neil Conway wrote: > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > Wow. That leaves no good Linux file system alternatives. > > PostgreSQL just wants an ordinary file system that has reliable > > recovery from a crash. > > I'm not really familiar with the reasoning behind ext2's reputation as > recovering poorly from crashes; if we fsync a WAL record to disk > before we lose power, can't we recover reliably, even with ext2? > > > > > Also, though ext3 is slower, turning fsync off should make ext3 > > > > function similar to ext2. > > > > > > Why would that be? > > > > I assumed it was the double fsync for the normal and journal that > > made the journalling file systems slog. > > Well, a journalling file system would need to write a journal entry > and flush that to disk, even if fsync is disabled -- whereas without > fsync enabled, ext2 doesn't have to flush anything to disk. ISTM that > the performance advantage of ext2 over ext3 is should be even larger > when fsync is not enabled. Yes, it is still double-writing. I just thought that if that wasn't happening while the db was waiting for a commit that it wouldn't be too bad. Is it just me or do all the Linux file systems seem like they are lacking something when PostgreSQL is concerned? We just want a UFS-like file system on Linux and no one has it. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
On Thu, 2002-09-26 at 16:03, Neil Conway wrote: > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > Wow. That leaves no good Linux file system alternatives. > > PostgreSQL just wants an ordinary file system that has reliable > > recovery from a crash. > > I'm not really familiar with the reasoning behind ext2's reputation as > recovering poorly from crashes; if we fsync a WAL record to disk > before we lose power, can't we recover reliably, even with ext2? Well, I have experienced data loss from ext2 before. Also, recovery from crashes on large file systems take a very, very long time. I can't imagine anyone running a production database on an ext2 file system having 10's or even 100's of GB. Ouch. Recovery would take forever! Even recovery on small file systems (2-8G) can take extended periods of time. Especially so on IDE systems. Even then manual intervention is not uncommon. While I can't say that x, y or z is the best FS to use on Linux, I can say that ext2 is probably an exceptionally poor choice from a reliability and/or uptime perspective. Greg
Attachment
Greg Copeland <greg@CopelandConsulting.Net> writes: > On Thu, 2002-09-26 at 16:03, Neil Conway wrote: > > I'm not really familiar with the reasoning behind ext2's > > reputation as recovering poorly from crashes; if we fsync a WAL > > record to disk before we lose power, can't we recover reliably, > > even with ext2? > > Well, I have experienced data loss from ext2 before. Also, recovery > from crashes on large file systems take a very, very long time. Yes, but wouldn't you face exactly the same issues if you ran a UFS-like filesystem in asynchronous mode? Albeit it's not the default, but performance in synchronous mode is usually pretty poor. The fact that ext2 defaults to asynchronous mode and UFS (at least on the BSDs) defaults to synchronous mode seems like a total non-issue to me. Is there any more to the alleged difference in reliability? Cheers, Neil -- Neil Conway <neilc@samurai.com> || PGP Key ID: DB3C29FC
Bruce Momjian <pgman@candle.pha.pa.us> writes: > Can anyone clarify if "data=writeback" is safe for PostgreSQL. > Specifically, are the data files recovered properly or is this option > only for a filesystem containing WAL? "data=writeback" means that no data is journaled, just metadata (which is like XFS or Reiser). An fsync() call should still do what it normally does, commit the writes to disk before returning. "data=journal" journals all data and is the slowest and safest. "data=ordered" writes out data blocks before committing a journal transaction, which is faster than full data journaling (since data doesn't get written twice) and almost as safe. "data=writeback" is noted to keep obsolete data in the case of some crashes (since the data may not have been written yet) but a completed fsync() should ensure that the data is valid. So I guess I'd probably use data=ordered for an all-on-one-fs installation, and data=writeback for a WAL-only drive. Hope this helps... -Doug
Neil Conway <neilc@samurai.com> writes: > I'm not really familiar with the reasoning behind ext2's reputation as > recovering poorly from crashes; if we fsync a WAL record to disk > before we lose power, can't we recover reliably, even with ext2? Up to a point. We do assume that the filesystem won't lose checkpointed (sync'd) writes to data files. To the extent that the filesystem is vulnerable to corruption of its own metadata for a file (indirect blocks or whatever ext2 uses), that's not a completely safe assumption. We'd be happiest with a filesystem that journals its own metadata and not the user data in the file(s). I dunno if there are any. Hmm, maybe this is why Oracle likes doing their own filesystem on a raw device... regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> writes: > We'd be happiest with a filesystem that journals its own metadata and > not the user data in the file(s). I dunno if there are any. ext3 with data=writeback? (See my previous message to Bruce). -Doug
Neil Conway wrote: > Greg Copeland <greg@CopelandConsulting.Net> writes: > > On Thu, 2002-09-26 at 16:03, Neil Conway wrote: > > > I'm not really familiar with the reasoning behind ext2's > > > reputation as recovering poorly from crashes; if we fsync a WAL > > > record to disk before we lose power, can't we recover reliably, > > > even with ext2? > > > > Well, I have experienced data loss from ext2 before. Also, recovery > > from crashes on large file systems take a very, very long time. > > Yes, but wouldn't you face exactly the same issues if you ran a > UFS-like filesystem in asynchronous mode? Albeit it's not the default, > but performance in synchronous mode is usually pretty poor. Yes, before UFS had soft updates, the synchronous nature of UFS made it slower than ext2, but now with soft updates, that performance difference is gone so you have two files systems, ext2 and ufs, similar peformance, but one is crash-safe and the other is not. And, when comparing the journalling file systems, you have UFS vs. XFS/ext3/JFS/Reiser, and UFS is faster. The only thing the journalling file system give you is more rapid reboot, but frankly, if your OS goes down often enough so that is an issue, you have bigger problems than fsync time. The big problem is that Linux went from non-crash safe right to crash-safe and reboot quick. We need a middle ground, which is where UFS/soft updates is. > The fact that ext2 defaults to asynchronous mode and UFS (at least on > the BSDs) defaults to synchronous mode seems like a total non-issue to > me. Is there any more to the alleged difference in reliability? The reliability problem isn't alleged. ext2 developers admits ext2 isn't 100% crash-safe. They will say it is usually crash-safe, but that isn't good enough for PostgreSQL. I wish I was wrong. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
Doug McNaught wrote: > Tom Lane <tgl@sss.pgh.pa.us> writes: > > > We'd be happiest with a filesystem that journals its own metadata and > > not the user data in the file(s). I dunno if there are any. > > ext3 with data=writeback? (See my previous message to Bruce). OK, so that makes ext3 crash safe without lots of overhead? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
On Thu, 2002-09-26 at 17:39, Bruce Momjian wrote: > Neil Conway wrote: > > Greg Copeland <greg@CopelandConsulting.Net> writes: > > > On Thu, 2002-09-26 at 16:03, Neil Conway wrote: > > > > I'm not really familiar with the reasoning behind ext2's > > > > reputation as recovering poorly from crashes; if we fsync a WAL > > > > record to disk before we lose power, can't we recover reliably, > > > > even with ext2? > > > > > > Well, I have experienced data loss from ext2 before. Also, recovery > > > from crashes on large file systems take a very, very long time. > > > > Yes, but wouldn't you face exactly the same issues if you ran a > > UFS-like filesystem in asynchronous mode? Albeit it's not the default, > > but performance in synchronous mode is usually pretty poor. > > Yes, before UFS had soft updates, the synchronous nature of UFS made it > slower than ext2, but now with soft updates, that performance difference > is gone so you have two files systems, ext2 and ufs, similar peformance, > but one is crash-safe and the other is not. Note entirely true. ufs is both crash-safe and quick-rebootable. You do need to fsck at some point, but not prior to mounting it. Any corrupt blocks are empty, and are easy to avoid. Someone just needs to implement a background fsck that will run on a mounted filesystem. -- Rod Taylor
Rod Taylor wrote: > > Yes, before UFS had soft updates, the synchronous nature of UFS made it > > slower than ext2, but now with soft updates, that performance difference > > is gone so you have two files systems, ext2 and ufs, similar peformance, > > but one is crash-safe and the other is not. > > Note entirely true. ufs is both crash-safe and quick-rebootable. You > do need to fsck at some point, but not prior to mounting it. Any > corrupt blocks are empty, and are easy to avoid. I am assuming you need to mount the drive as part of the reboot. Of course you can boot fast with any file system if you don't have to mount it. :-) -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
On Thu, 2002-09-26 at 17:47, Bruce Momjian wrote: > Rod Taylor wrote: > > > Yes, before UFS had soft updates, the synchronous nature of UFS made it > > > slower than ext2, but now with soft updates, that performance difference > > > is gone so you have two files systems, ext2 and ufs, similar peformance, > > > but one is crash-safe and the other is not. > > > > Note entirely true. ufs is both crash-safe and quick-rebootable. You > > do need to fsck at some point, but not prior to mounting it. Any > > corrupt blocks are empty, and are easy to avoid. > > I am assuming you need to mount the drive as part of the reboot. Of > course you can boot fast with any file system if you don't have to mount > it. :-) Sorry, poor explanation. Background fsck (when implemented) would operate on a currently mounted (and active) file system. The only reason fsck is required prior to reboot now is because no-one had done the work. http://www.freebsd.org/cgi/man.cgi?query=fsck&sektion=8&manpath=FreeBSD+5.0-current See the first paragraph of the above. -- Rod Taylor
Rod Taylor wrote: > On Thu, 2002-09-26 at 17:47, Bruce Momjian wrote: > > Rod Taylor wrote: > > > > Yes, before UFS had soft updates, the synchronous nature of UFS made it > > > > slower than ext2, but now with soft updates, that performance difference > > > > is gone so you have two files systems, ext2 and ufs, similar peformance, > > > > but one is crash-safe and the other is not. > > > > > > Note entirely true. ufs is both crash-safe and quick-rebootable. You > > > do need to fsck at some point, but not prior to mounting it. Any > > > corrupt blocks are empty, and are easy to avoid. > > > > I am assuming you need to mount the drive as part of the reboot. Of > > course you can boot fast with any file system if you don't have to mount > > it. :-) > > Sorry, poor explanation. > > Background fsck (when implemented) would operate on a currently mounted > (and active) file system. The only reason fsck is required prior to > reboot now is because no-one had done the work. > > http://www.freebsd.org/cgi/man.cgi?query=fsck&sektion=8&manpath=FreeBSD+5.0-current > > See the first paragraph of the above. Oh, yes, I have heard of that missing feature. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
Bruce Momjian <pgman@candle.pha.pa.us> writes: > Doug McNaught wrote: > > Tom Lane <tgl@sss.pgh.pa.us> writes: > > > > > We'd be happiest with a filesystem that journals its own metadata and > > > not the user data in the file(s). I dunno if there are any. > > > > ext3 with data=writeback? (See my previous message to Bruce). > > OK, so that makes ext3 crash safe without lots of overhead? Metadata is journaled so you shouldn't lose data blocks or directory entries. Some data blocks (that haven't been fsync()'ed) may have old or wrong data in them, but I think that's the same as ufs, right? And WAL replay should take care of that. It'd be very interesting to do some tests of the various journaling modes. I have an old K6 that I might be able to turn into a hit-the-reset-switch-at-ramdom-times machine. What kind of tests should be run? -Doug
Doug McNaught <doug@wireboard.com> writes: > "data=writeback" means that no data is journaled, just metadata (which > is like XFS or Reiser). An fsync() call should still do what it > normally does, commit the writes to disk before returning. > "data=journal" journals all data and is the slowest and safest. > "data=ordered" writes out data blocks before committing a journal > transaction, which is faster than full data journaling (since data > doesn't get written twice) and almost as safe. "data=writeback" is > noted to keep obsolete data in the case of some crashes (since the > data may not have been written yet) but a completed fsync() should > ensure that the data is valid. Thanks for the explanation. > So I guess I'd probably use data=ordered for an all-on-one-fs > installation, and data=writeback for a WAL-only drive. Actually I think the ideal thing for Postgres would be data=writeback for both data and WAL drives. We can handle loss of un-fsync'd data for ourselves in both cases. Of course, if you keep anything besides Postgres data files on a partition, you'd possibly want the more secure settings. regards, tom lane
Hello! On Thu, 26 Sep 2002, Bruce Momjian wrote: > > I'm not really familiar with the reasoning behind ext2's reputation as > > recovering poorly from crashes; if we fsync a WAL record to disk On relatively big volumes ext2 recovery can end up in formatting the fs under certain cirrumstances.;-) > > > I assumed it was the double fsync for the normal and journal that > > > made the journalling file systems slog. > > > > Well, a journalling file system would need to write a journal entry > > and flush that to disk, even if fsync is disabled -- whereas without > > fsync enabled, ext2 doesn't have to flush anything to disk. ISTM that > > the performance advantage of ext2 over ext3 is should be even larger > > when fsync is not enabled. > > Yes, it is still double-writing. I just thought that if that wasn't > happening while the db was waiting for a commit that it wouldn't be too > bad. > > Is it just me or do all the Linux file systems seem like they are > lacking something when PostgreSQL is concerned? We just want a UFS-like > file system on Linux and no one has it. mount -o sync an ext2 volume on Linux - and you can get a "UFS-like" fs.:) mount -o async an FFS volume on FreeBSD - and you can get boost in fs performance. Personally me always mount ext2 fs where Pg is living with sync option. Fsync in pg is off (since 6.3), this way successfully pass thru a few serious crashes on various systems (mostly on power problems). If fsync is on in Pg, performance gets so-oh-oh-oh-oh slowly!=) I just have done upgrade from 2.2 kernel on ext2 to ext3 capable 2.4 one so I'm planning to do some benchmarking. Roughly saying w/o benchmarks, the performance have been degraded in 2/3 proportion. "But better safe then sorry". -- WBR, Yury Bokhoncovich, Senior System Administrator, NOC of F1 Group. Phone: +7 (3832) 106228, ext.140, E-mail: byg@center-f1.ru. Unix is like a wigwam -- no Gates, no Windows, and an Apache inside.
Tom Lane <tgl@sss.pgh.pa.us> writes: > We'd be happiest with a filesystem that journals its own metadata and > not the user data in the file(s). I dunno if there are any. Most journalling file systems work this way. Data journalling is not very widespread, AFAIK. -- Florian Weimer Weimer@CERT.Uni-Stuttgart.DE University of Stuttgart http://CERT.Uni-Stuttgart.DE/people/fw/ RUS-CERT fax +49-711-685-5898
Has there been any thought of providing RAW disk support to bypass the fs? -----Original Message----- From: pgsql-hackers-owner@postgresql.org [mailto:pgsql-hackers-owner@postgresql.org]On Behalf Of Bruce Momjian Sent: Thursday, September 26, 2002 3:57 PM To: Neil Conway Cc: shridhar_daithankar@persistent.co.in; pgsql-hackers@postgresql.org; pgsql-general@postgresql.org Subject: Re: [HACKERS] [GENERAL] Performance while loading data and indexing Neil Conway wrote: > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > The paper does recommend ext3, but the differences between file systems > > are very small. > > Well, I only did a very rough benchmark (a few runs of pgbench), but > the results I found were drastically different: ext2 was significantly > faster (~50%) than ext3-writeback, which was in turn significantly > faster (~25%) than ext3-ordered. > > > Also, though ext3 is slower, turning fsync off should make ext3 function > > similar to ext2. > > Why would that be? OK, I changed the text to: File system choice is particularly difficult on Linux because there areso many file system choices, and none of them areoptimal: ext2 is notentirely crash-safe, ext3, xfs, and jfs are journal-based, and Reiser isoptimized for small filesand does journalling. The journalling filesystems can be significantly slower than ext2 but when crash recovery isrequired,ext2 isn't an option. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073 ---------------------------(end of broadcast)--------------------------- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/users-lounge/docs/faq.html
Some of you may be interested in this seemingly exhaustive benchmark between ext2/3, ReiserFS, JFS, and XFS. http://www.osdl.org/presentations/lwe-jgfs.pdf
Hey, excellent. Thanks! Based on that, it appears that XFS is a pretty good FS to use. For me, the real surprise was how well reiserfs performed. Greg On Thu, 2002-10-03 at 18:09, Mike Benoit wrote: > Some of you may be interested in this seemingly exhaustive benchmark > between ext2/3, ReiserFS, JFS, and XFS. > > http://www.osdl.org/presentations/lwe-jgfs.pdf > > > > ---------------------------(end of broadcast)--------------------------- > TIP 6: Have you searched our list archives? > > http://archives.postgresql.org
Greg Copeland wrote: -- Start of PGP signed section. > Hey, excellent. Thanks! > > Based on that, it appears that XFS is a pretty good FS to use. For me, > the real surprise was how well reiserfs performed. > OK, hardware performance paper updated: --------------------------------------------------------------------------- File system choice is particularly difficult on Linux because there are so many file system choices, and none of them are optimal: ext2 is not entirely crash-safe, ext3, xfs, and jfs are journal-based, and Reiser is optimized for small files and does journalling. The journalling file systems can be significantly slower than ext2 but when crash recovery is required, ext2 isn't an option. If ext2 must be used, mount it with sync enabled. Some people recommend xfs or an ext3 filesystem mounted with data=writeback. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
On 26 Sep 2002, Neil Conway wrote: > The fact that ext2 defaults to asynchronous mode and UFS (at least on > the BSDs) defaults to synchronous mode seems like a total non-issue to > me. Is there any more to the alleged difference in reliability? It was sort of pointed out here, but perhaps not made completely clear, that Berkley FFS defaults to synchronous meta-data updates, but asynchronous data updates. You can also specify entirely synchronous or entirely asynchronous updates. Linux ext2fs supports only these last two modes, which is the problem. cjs -- Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org Don't you know, in this new Dark Age, we're all light. --XTC