Thread: Re: Configuration Advice
Doesn't sound like you want postgres at all.... Try mysql. -----Original Message----- From: "Steve" <cheetah@tanabi.org> To: pgsql-performance@postgresql.org Sent: 1/17/2007 2:41 PM Subject: [PERFORM] Configuration Advice Hey there; I've been lurking on this list awhile, and I've been working with postgres for a number of years so I'm not exactly new to this. But I'm still having trouble getting a good balance of settings and I'd like to see what other people think. We may also be willing to hire a contractor to help tackle this problem if anyone is interested. I've got an application here that runs large (in terms of length -- the queries have a lot of conditions in them) queries that can potentially return millions of rows but on average probably return tens of thousands of rows. It's read only for most of the day, and pretty much all the queries except one are really fast. However, each night we load data from a legacy cobol system into the SQL system and then we summarize that data to make the reports faster. This load process is intensely insert/update driven but also has a hefty amount of selects as well. This load process is taking ever longer to complete. SO ... our goal here is to make this load process take less time. It seems the big part is building the big summary table; this big summary table is currently 9 million rows big. Every night, we drop the table, re-create it, build the 9 million rows of data (we use COPY to put hte data in when it's prepared, not INSERT), and then build the indexes on it -- of which there are many. Unfortunately this table gets queried in a lot of different ways and needs these indexes; also unfortunately, we have operator class indexes to support both ASC and DESC sorting on columns so these are for all intents and purposes duplicate but required under Postgres 8.1 (we've recently upgraded to Postgres 8.2, is this still a requirement?) Building these indexes takes forever! It's a long grind through inserts and then building the indexes takes a hefty amount of time too. (about 9 hours). Now, the application is likely part at fault, and we're working to make it more efficient, but it has nothing to do with the index building time. I'm wondering what we can do to make this better if anything; would it be better to leave the indexes on? It doesn't seem to be. Would it be better to use INSERTs instead of copies? Doesn't seem to be. Anyway -- ANYTHING we can do to make this go faster is appreciated :) Here's some vital statistics: - Machine is a 16 GB, 4 actual CPU dual-core opteron system using SCSI discs. The disc configuration seems to be a good one, it's the best of all the ones we've tested so far. - The load process itself takes about 6 gigs of memory, the rest is free for postgres because this is basically all the machine does. - If this was your machine and situation, how would you lay out the emmory settings? What would you set the FSM to? Would you leave teh bgwriter on or off? We've already got FSYNC off because "data integrity" doesn't matter -- this stuff is religeously backed up and we've got no problem reinstalling it. Besides, in order for this machine to go down, data integrity of the DB is the least of the worries :) Do wal_buffers/full_page_writes matter of FSYNC is off? If so, what settings? What about checkpoints? Any finally, any ideas on planner constants? Here's what I'm using: seq_page_cost = 0.5 # measured on an arbitrary scale random_page_cost = 1.0 # same scale as above cpu_tuple_cost = 0.001 # same scale as above cpu_index_tuple_cost = 0.0001 # same scale as above cpu_operator_cost = 0.00025 # same scale as above effective_cache_size = 679006 I really don't remember how I came up with that effective_cache_size number.... Anyway... any advice would be appreciated :) Steve ---------------------------(end of broadcast)--------------------------- TIP 2: Don't 'kill -9' the postmaster
Adam Rich wrote: > Doesn't sound like you want postgres at all.... Try mysql. Could you explain your reason for suggesting mysql? I'm simply curious why you would offer that as a solution.
Bricklen Anderson wrote: > Adam Rich wrote: >> Doesn't sound like you want postgres at all.... Try mysql. > > Could you explain your reason for suggesting mysql? I'm simply curious > why you would offer that as a solution. He sound a little trollish to me. I would refer to the other actually helpful posts on the topic. Sincerely, Joshua D. Drake > > ---------------------------(end of broadcast)--------------------------- > TIP 5: don't forget to increase your free space map settings > -- === The PostgreSQL Company: Command Prompt, Inc. === Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240 Providing the most comprehensive PostgreSQL solutions since 1997 http://www.commandprompt.com/ Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate PostgreSQL Replication: http://www.commandprompt.com/products/
> From: "Steve" <cheetah@tanabi.org> > To: pgsql-performance@postgresql.org > Sent: 1/17/2007 2:41 PM > Subject: [PERFORM] Configuration Advice > > SO ... our goal here is to make this load process take less time. It > seems the big part is building the big summary table; this big summary > table is currently 9 million rows big. Every night, we drop the table, > re-create it, build the 9 million rows of data (we use COPY to put hte > data in when it's prepared, not INSERT), and then build the indexes on it > -- of which there are many. Unfortunately this table gets queried > in a lot of different ways and needs these indexes; also unfortunately, we > have operator class indexes to support both ASC and DESC sorting on > columns so these are for all intents and purposes duplicate but required > under Postgres 8.1 (we've recently upgraded to Postgres 8.2, is this still > a requirement?) Note that you only need to have the ASC and DESC versions of opclasses when you are going to use multicolumn indexes with some columns in ASC order and some in DESC order. For columns used by themselves in an index, you don't need to do this, no matter which order you are sorting on.
Sorry if this came off sounding trollish.... All databases have their strengths & weaknesses, and I feel the responsible thing to do is exploit those strengths where possible, rather than expend significant time and effort coaxing one database to do something it wasn't designed to. There's just no such thing as "one size fits all". I have professional experience with MS-SQL, Oracle, MySQL, and Postgres. and the scenario described sounds more ideal for MySQL & MyISAM than anything else: 1) No concerns for data loss (turning fsync & full_page_writes off) since the data can be reloaded 2) No need for MVCC or transactions, since the database is read-only 3) No worries about lock contention 4) Complex queries that might take advantage of the MySQL "Query Cache" since the base data never changes 5) Queries that might obtain data directly from indexes without having to touch tables (again, no need for MVCC) If loading in the base data and creating the summary table is taking a lot of time, using MySQL with MyISAM tables (and binary logging disabled) should provide significant time savings, and it doesn't sound like there's any concerns for the downsides. Yes, postgresql holds an edge over MySQL for heavy OLTP applications, I use it for that and I love it. But for the scenario the original poster is asking about, MySQL/MyISAM is ideal. -----Original Message----- From: Bricklen Anderson [mailto:banderson@presinet.com] Sent: Wednesday, January 17, 2007 3:29 PM To: Adam Rich Cc: pgsql-performance@postgresql.org Subject: Re: [PERFORM] Configuration Advice Adam Rich wrote: > Doesn't sound like you want postgres at all.... Try mysql. Could you explain your reason for suggesting mysql? I'm simply curious why you would offer that as a solution.
> Note that you only need to have the ASC and DESC versions of opclasses when > you are going to use multicolumn indexes with some columns in ASC order and > some in DESC order. For columns used by themselves in an index, you don't > need to do this, no matter which order you are sorting on. > Yeah, I assumed the people 'in the know' on this kind of stuff would know the details of why I have to have those, and therefore I wouldn't have to go into detail as to why -- but you put your finger right on it. :) Unfortunately the customer this is for wants certain columns joined at the hip for querying and sorting, and this method was a performance godsend when we implemented it (with a C .so library, not using SQL in our opclasses or anything like that). Steve
This would probably also be another last ditch option. :) Our stuff is designed to pretty much work on any DB but there are postgres specific things in there... not to mention ramp up time on MySQL. I mean, I know MySQL from a user point of view and in a very limited way administratively, but I'd be back to square one on learning performance stuff :) Anyway -- I'll listen to what people have to say, and keep this in mind. It would be an interesting test to take parts of the process and compare at least, if not converting the whole thing. talk to you later, Steve On Wed, 17 Jan 2007, Adam Rich wrote: > > Sorry if this came off sounding trollish.... All databases have their > strengths & weaknesses, and I feel the responsible thing to do is > exploit > those strengths where possible, rather than expend significant time and > effort coaxing one database to do something it wasn't designed to. > There's just no such thing as "one size fits all". > > I have professional experience with MS-SQL, Oracle, MySQL, and Postgres. > and the scenario described sounds more ideal for MySQL & MyISAM than > anything else: > > 1) No concerns for data loss (turning fsync & full_page_writes off) > since the data can be reloaded > > 2) No need for MVCC or transactions, since the database is read-only > > 3) No worries about lock contention > > 4) Complex queries that might take advantage of the MySQL "Query Cache" > since the base data never changes > > 5) Queries that might obtain data directly from indexes without having > to touch tables (again, no need for MVCC) > > If loading in the base data and creating the summary table is taking > a lot of time, using MySQL with MyISAM tables (and binary logging > disabled) should provide significant time savings, and it doesn't > sound like there's any concerns for the downsides. > > Yes, postgresql holds an edge over MySQL for heavy OLTP applications, > I use it for that and I love it. But for the scenario the original > poster is asking about, MySQL/MyISAM is ideal. > > > > > -----Original Message----- > From: Bricklen Anderson [mailto:banderson@presinet.com] > Sent: Wednesday, January 17, 2007 3:29 PM > To: Adam Rich > Cc: pgsql-performance@postgresql.org > Subject: Re: [PERFORM] Configuration Advice > > > Adam Rich wrote: >> Doesn't sound like you want postgres at all.... Try mysql. > > Could you explain your reason for suggesting mysql? I'm simply curious > why you would offer that as a solution. > > > ---------------------------(end of broadcast)--------------------------- > TIP 9: In versions below 8.0, the planner will ignore your desire to > choose an index scan if your joining column's datatypes do not > match >
On 18-1-2007 0:37 Adam Rich wrote: > 4) Complex queries that might take advantage of the MySQL "Query Cache" > since the base data never changes Have you ever compared MySQL's performance with complex queries to PostgreSQL's? I once had a query which would operate on a recordlist and see whether there were any gaps larger than 1 between consecutive primary keys. Normally that information isn't very usefull, but this time it was. Since the data was in MySQL I tried several variations of queries in MySQL... After ten minutes or so I gave up waiting, but left my last version running. In the mean time I dumped the data, reloaded the data in PostgreSQL and ran some testqueries there. I came up with a query that took only 0.5 second on Postgres pretty soon. The query on MySQL still wasn't finished... In my experience it is (even with the 5.0 release) easier to get good performance from complex queries in postgresql. And postgresql gives you more usefull information on why a query takes a long time when using explain (analyze). There are some draw backs too of course, but while we in our company use mysql I switched to postgresql for some readonly complex query stuff just for its performance... Besides that, mysql rewrites the entire table for most table-altering statements you do (including indexes). For small tables that's no issue, but if you somehow can't add all your indexes in a single statement to a table you'll be waiting a long time more for new indexes than with postgresql. And that situation isn't so unusual if you think of a query which needs an index that isn't there yet. Apart from the fact that it doesn't have functional indexes and such. Long story short: MySQL still isn't the best performer when looking at the more complex queries. I've seen performance which made me assume it can't optimise sequential scans (when it is forced to loop using a seq scan it appears to do a new seq scan for each round in the loop...) and various other cases PostgreSQL can execute much more efficiently. So unless you run the same queries a lot of times and know of a way to get it fast enough the initial time, the query cache is not much of a help. Best regards, Arjen
On Thu, 2007-01-18 at 04:24, Arjen van der Meijden wrote: > On 18-1-2007 0:37 Adam Rich wrote: > > 4) Complex queries that might take advantage of the MySQL "Query Cache" > > since the base data never changes > > Have you ever compared MySQL's performance with complex queries to > PostgreSQL's? I once had a query which would operate on a recordlist and > see whether there were any gaps larger than 1 between consecutive > primary keys. > > Normally that information isn't very usefull, but this time it was. > Since the data was in MySQL I tried several variations of queries in > MySQL... After ten minutes or so I gave up waiting, but left my last > version running. In the mean time I dumped the data, reloaded the data > in PostgreSQL and ran some testqueries there. I came up with a query > that took only 0.5 second on Postgres pretty soon. The query on MySQL > still wasn't finished... I have had similar experiences in the past. Conversely, I've had similar things happen the other way around. The biggest difference? If I report something like that happening in postgresql, it's easier to get a fix or workaround, and if it's a code bug, the fix is usually released as a patch within a day or two. With MySQL, if it's a common problem, then I can find it on the internet with google, otherwise it might take a while to get a good workaround / fix. And if it's a bug, it might take much longer to get a working patch. > In my experience it is (even with the 5.0 release) easier to get good > performance from complex queries in postgresql. Agreed. For data warehousing / OLAP stuff, postgresql is generally better than mysql. > Besides that, mysql rewrites the entire table for most table-altering > statements you do (including indexes). For small tables that's no issue, > but if you somehow can't add all your indexes in a single statement to a > table you'll be waiting a long time more for new indexes than with > postgresql. And that situation isn't so unusual if you think of a query > which needs an index that isn't there yet. Apart from the fact that it > doesn't have functional indexes and such. Note that this applies to the myisam table type. innodb works quite differently. It is more like pgsql in behaviour, and is an mvcc storage engine. Like all storage engine, it's a collection of compromises. Some areas it's better than pgsql, some areas worse. Sadly, it lives under the hood of a database that can do some pretty stupid things, like ignore column level constraint definitions without telling you. > Long story short: MySQL still isn't the best performer when looking at > the more complex queries. agreed. And those are the queries that REALLY kick your ass. Or your server's ass, I guess.
On 18-1-2007 17:20 Scott Marlowe wrote: >> Besides that, mysql rewrites the entire table for most table-altering >> statements you do (including indexes). > > Note that this applies to the myisam table type. innodb works quite > differently. It is more like pgsql in behaviour, and is an mvcc storage Afaik this is not engine specific and also applies to InnoDB. Here is what the MySQL-manual sais about it: "In most cases, ALTER TABLE works by making a temporary copy of the original table. The alteration is performed on the copy, and then the original table is deleted and the new one is renamed. While ALTER TABLE is executing, the original table is readable by other clients. Updates and writes to the table are stalled until the new table is ready, and then are automatically redirected to the new table without any failed updates." http://dev.mysql.com/doc/refman/5.0/en/alter-table.html If it were myisam-only they sure would've mentioned that. Besides this is the behaviour we've seen on our site as well. Since 'create index' is also an alter table statement for mysql, this also applies for adding indexes. Best regards, Arjen
> I once had a query which would operate on a recordlist and > see whether there were any gaps larger than 1 between consecutive > primary keys. Would you mind sharing the query you described? I am attempting to do something similar now.
On 18-1-2007 18:28 Jeremy Haile wrote: >> I once had a query which would operate on a recordlist and >> see whether there were any gaps larger than 1 between consecutive >> primary keys. > > Would you mind sharing the query you described? I am attempting to do > something similar now. Well it was over a year ago, so I don't know what I did back then. But since it was a query adjusted from what I did in MySQL there where no subqueries involved, I think it was something like this: select a.id, min(b.id) from members a join members b on a.id < b.id left join members c on a.id +1 = c.id where c.id IS NULL group by a.id; Or rewriting it to this one halves the execution time though: select a.id, min(b.id) from members a left join members c on a.id +1 = c.id join members b on a.id < b.id where c.id IS NULL group by a.id; Although this query seems to be much faster with 150k records: select aid, bid from (select a.id as aid, (select min(b.id) from members b where b.id > a.id) as bid from members a group by a.id) as foo where bid > aid+1; The first one takes about 16 seconds on my system with PG 8.2, the second about 1.8 second. But back then the list was much shorter, so it can have been the first one or a variant on that. On MySQL the first takes much more than the 16 seconds PostgreSQL uses, and after editting this e-mail it still isn't finished... The second one made EXPLAIN hang in my 5.0.32-bk, so I didn't try that for real. Best regards, Arjen PS, In case any of the planner-hackers are reading, here are the plans of the first two queries, just to see if something can be done to decrease the differences between them. The main differences seems to be that groupaggregate vs the hashaggregate? GroupAggregate (cost=34144.16..35144.38 rows=50011 width=8) (actual time=17653.401..23881.320 rows=71 loops=1) -> Sort (cost=34144.16..34269.19 rows=50011 width=8) (actual time=17519.274..21423.128 rows=7210521 loops=1) Sort Key: a.id -> Nested Loop (cost=11011.41..30240.81 rows=50011 width=8) (actual time=184.412..10945.189 rows=7210521 loops=1) -> Hash Left Join (cost=11011.41..28739.98 rows=1 width=4) (actual time=184.384..1452.467 rows=72 loops=1) Hash Cond: ((a.id + 1) = c.id) Filter: (c.id IS NULL) -> Seq Scan on members a (cost=0.00..9903.33 rows=150033 width=4) (actual time=0.009..71.463 rows=150033 loops=1) -> Hash (cost=9903.33..9903.33 rows=150033 width=4) (actual time=146.040..146.040 rows=150033 loops=1) -> Seq Scan on members c (cost=0.00..9903.33 rows=150033 width=4) (actual time=0.002..77.066 rows=150033 loops=1) -> Index Scan using members_pkey on members b (cost=0.00..875.69 rows=50011 width=4) (actual time=0.025..78.971 rows=100146 loops=72) Index Cond: (a.id < b.id) Total runtime: 23882.511 ms (13 rows) HashAggregate (cost=30240.82..30240.83 rows=1 width=8) (actual time=12870.440..12870.504 rows=71 loops=1) -> Nested Loop (cost=11011.41..30240.81 rows=1 width=8) (actual time=168.658..9466.644 rows=7210521 loops=1) -> Hash Left Join (cost=11011.41..28739.98 rows=1 width=4) (actual time=168.630..865.690 rows=72 loops=1) Hash Cond: ((a.id + 1) = c.id) Filter: (c.id IS NULL) -> Seq Scan on members a (cost=0.00..9903.33 rows=150033 width=4) (actual time=0.012..70.612 rows=150033 loops=1) -> Hash (cost=9903.33..9903.33 rows=150033 width=4) (actual time=140.432..140.432 rows=150033 loops=1) -> Seq Scan on members c (cost=0.00..9903.33 rows=150033 width=4) (actual time=0.003..76.709 rows=150033 loops=1) -> Index Scan using members_pkey on members b (cost=0.00..875.69 rows=50011 width=4) (actual time=0.023..73.317 rows=100146 loops=72) Index Cond: (a.id < b.id) Total runtime: 12870.756 ms (11 rows)
Arjen van der Meijden <acmmailing@tweakers.net> writes: > PS, In case any of the planner-hackers are reading, here are the plans > of the first two queries, just to see if something can be done to > decrease the differences between them. Increase work_mem? It's not taking the hash because it thinks it won't fit in memory ... There is a bug here, I'd say: the rowcount estimate ought to be the same either way. Dunno why it's not, but will look --- I see the same misbehavior with a toy table here. regards, tom lane
On 18-1-2007 23:11 Tom Lane wrote: > Increase work_mem? It's not taking the hash because it thinks it won't > fit in memory ... When I increase it to 128MB in the session (arbitrarily selected relatively large value) it indeed has the other plan. Best regards, Arjen