Thread: optimizing postgres
Hello all - I'm working on a postgres project after coming from a MySQL background ( no flames, please :). We are importing fairly large xml datasets ( 10-20 MB of xml files per 'project', currently 5 projects) into the database for querying. We are using PHP to create a web interface where users can create projects and upload their files. We have a parsing routine that gets the xml data into the database. The parsing has turned out to be pretty intense. It takes about 10-20 minutes for any project. When we are parsing data, it really slows down the site's response. I tested serving static webpages from apache, endless loops in php , but the choke point seems to be doing any other query on postgres when constructing a php page during parsing. As an example, the original designer specified separate tables for each project. Since they were all the same structure, I suggested combining them into a single table with a project_id column, but he said it would take too long to query. I was suspicious, but I went with his design anyway. It turned out he was right for our current set up. When I needed to empty the project table to re-parse data, doing a cascading delete could take up to 10 minutes! I cut re-parsing time in half by just dropping the table and creating a new one. Which was an okay thing to do, since the table only belonged to one project anyway. But I hate to think how long it would have taken to do a delete, cascading to child tables, if all the project data was in a single table. Since I'm not an expert in Postgres database design, I'm assuming I've done something sub-optimal. Are there some common techniques for tuning postgres performance? Do we need beefier hardware? Or is it a problem with how PHP or apache pools connections to the database?
* lawpoop@gmail.com (lawpoop@gmail.com) wrote: > Since I'm not an expert in Postgres database design, I'm assuming I've > done something sub-optimal. Are there some common techniques for > tuning postgres performance? Do we need beefier hardware? Honestly, it sounds like the database design might be the best place to start. Can you provide the schema definition and queries (the actual queries and the 'explain' output from them)? 10-20MB is certainly small enough that you'd have to be doing something particularly terrible to make it slow on any decent hardware... > Or is it a problem with how PHP or apache pools connections to the > database? This seems unlikely to be the issue.. If you're doing alot of web page loads and they were all sluggish or something I might say you want to make sure you're using connection pooling to Postgres but it sounds like you've got a different problem (perhaps a constraint against a column which doesn't have an index?). Thanks, Stephen
Attachment
lawpoop@gmail.com writes: > It turned out he was right for our current set up. When I needed to > empty the project table to re-parse data, doing a cascading delete > could take up to 10 minutes! You mean ON CASCADE DELETE foreign keys? Usually the reason that's slow is you forgot to put an index on the referencing column. PG doesn't force you to have such an index, but unless the referenced table is nearly static you'll want one. I too am fairly suspicious of the N-tables-are-faster-than-another- key-column mindset, but you'd need to do some actual experimentation (with correctly optimized table definitions ;-)) to be sure. regards, tom lane
Hello, Tom. I don't understand relation between constraints and indexes. By using EMS PostgreSQL Manager Lite, I created indexes on columns, some of them are unique values. But when I open it in PgAdmin, all such "unique" indexes are listed as constraints and there are no indexes in Indexes section. When I open it again in EMS PostgreSQL Manager, they are listed as "Indexes". Does it mean that I need to create additional indexes on the same columns? Is "Constrain" index as well? Thanks, Zlatko ----- Original Message ----- From: "Tom Lane" <tgl@sss.pgh.pa.us> To: <lawpoop@gmail.com> Cc: <pgsql-general@postgresql.org> Sent: Friday, July 13, 2007 3:39 AM Subject: Re: [GENERAL] optimizing postgres > lawpoop@gmail.com writes: >> It turned out he was right for our current set up. When I needed to >> empty the project table to re-parse data, doing a cascading delete >> could take up to 10 minutes! > > You mean ON CASCADE DELETE foreign keys? Usually the reason that's > slow is you forgot to put an index on the referencing column. PG > doesn't force you to have such an index, but unless the referenced > table is nearly static you'll want one. > > I too am fairly suspicious of the N-tables-are-faster-than-another- > key-column mindset, but you'd need to do some actual experimentation > (with correctly optimized table definitions ;-)) to be sure. > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 9: In versions below 8.0, the planner will ignore your desire to > choose an index scan if your joining column's datatypes do not > match
Zlatko Matić wrote: > Hello, Tom. > I don't understand relation between constraints and indexes. > By using EMS PostgreSQL Manager Lite, I created indexes on columns, some > of them are unique values. > But when I open it in PgAdmin, all such "unique" indexes are listed as > constraints and there are no indexes in Indexes section. When I open it > again in EMS PostgreSQL Manager, they are listed as "Indexes". > Does it mean that I need to create additional indexes on the same columns? > Is "Constrain" index as well? The default tool is "psql", the command-line client. If in doubt, refer to that (and the manuals). In this case, a quick look at the manuals cover it: http://www.postgresql.org/docs/8.2/static/indexes-unique.html "PostgreSQL automatically creates a unique index when a unique constraint or a primary key is defined for a table. The index covers the columns that make up the primary key or unique columns (a multicolumn index, if appropriate), and is the mechanism that enforces the constraint." -- Richard Huxton Archonet Ltd
> The parsing has turned out to be pretty intense. It takes about 10-20 > minutes for any project. When we are parsing data, it really slows > down the site's response. I tested serving static webpages from > apache, endless loops in php , but the choke point seems to be doing > any other query on postgres when constructing a php page during > parsing. Do you do lots of INSERTs without explicitly using transactions ? You also need to run EXPLAIN ANALYZE on your most frequent queries. It is very possible the slowdown is just from a forgotten index. > As an example, the original designer specified separate tables for > each project. Since they were all the same structure, I suggested > combining them into a single table with a project_id column, but he > said it would take too long to query. I was suspicious, but I went > with his design anyway. From the small size of the dataset I don't see a justification for this... > It turned out he was right for our current set up. When I needed to > empty the project table to re-parse data, doing a cascading delete > could take up to 10 minutes! I cut re-parsing time in half by just > dropping the table and creating a new one. Which was an okay thing to > do, since the table only belonged to one project anyway. But I hate to > think how long it would have taken to do a delete, cascading to child > tables, if all the project data was in a single table. That's probably because you forgot to create an index on the referenced column. They are not created automatically. > Since I'm not an expert in Postgres database design, I'm assuming I've > done something sub-optimal. Are there some common techniques for > tuning postgres performance? Do we need beefier hardware? > > Or is it a problem with how PHP or apache pools connections to the > database? It depends on a lot of stuff, but the first thing is to run EXPLAIN ANALYZE on your queries and post the results here.