Thread: Performace Optimization for Dummies
I am a software developer who is acting in a (temporary) dba role for a project. I had recommended PostgreSQL be brought in to replace the proposed MySQL DB - I chose PostgreSQL because of its reputation as a more stable solution than MySQL. At this early stage in the project, we are initializing our portal's database with millions of rows of imported data in over 50 different flattened tables; each table's structure is unique to the data provider. This requires a pretty complex import program, because the data must be matched semantically, not literally. Even with all of the expression matching and fuzzy logic in the code,our performance statistics show that the program spends over 75% of its time in SQL queries looking for matching and/or duplicate data. The import is slow - and degrades as the tables grow. With even more millions of rows in dozens of import tables to come, the imports will take forever. My ability to analyse the queries is limited; because of the nature of the import process, the SQL queries are mutable, every imported row can change the structure of a SQL query as the program adds and subtracts search conditions to the SQL command text before execution. The import program is scripted in Tcl. An attempt to convert our queries to prepared queries (curiousy) did not bring any performance improvements, and we converted back to simplify the code. We urgently need a major performance improvement. We are running the PostgreSQL 8.1.4 on a Windows 2003 x64 Server on a dual processor, dual core 3.2Ghz Xeon box with 4gb RAM and a RAID (sorry, I don't know what type) disc subsystem. Sorry about the long intro, but here are my questions: 1) Are we paying any big penalties by running Windows vs LINUX (or any other OS)? 2) Has the debate over PostgreSQL and Xeon processors been settled? Is this a factor? 3) Are there any easy-to-use performance analysis/optimisation tools that we can use? I am dreaming of one that could point out problems and suggest and.or effect solutions. 4) Can anyone recommend any commercial PostgreSQL service providers that may be able to swiftly come in and assist us with our performance issues? Below, please find what I believe are the configuration settings of interest in our system Any help and advice will be much appreciated. TIA, Carlo max_connections = 100 shared_buffers = 50000 work_mem = 32768 maintenance_work_mem = 32768 checkpoint_segments = 128 effective_cache_size = 10000 random_page_cost = 3 stats_start_collector = on stats_command_string = on stats_row_level = on autovacuum = on
> The import is slow - and degrades as the tables grow. With even more > millions of rows in dozens of import tables to come, the imports will take > forever. My ability to analyse the queries is limited; because of the nature > of the import process, the SQL queries are mutable, every imported row can > change the structure of a SQL query as the program adds and subtracts search > conditions to the SQL command text before execution. The import program is > scripted in Tcl. An attempt to convert our queries to prepared queries > (curiousy) did not bring any performance improvements, and we converted back > to simplify the code. How are you loading the tables? Copy? Insert? > > We urgently need a major performance improvement. We are running the > PostgreSQL 8.1.4 on a Windows 2003 x64 Server on a dual processor, dual core > 3.2Ghz Xeon box with 4gb RAM and a RAID (sorry, I don't know what type) disc > subsystem. Sorry about the long intro, but here are my questions: > > 1) Are we paying any big penalties by running Windows vs LINUX (or any other > OS)? Yes. Linux or FreeBSD is going to stomp Win32 for PostgreSQL performance. > > 2) Has the debate over PostgreSQL and Xeon processors been settled? Is this > a factor? Depends. PostgreSQL is much better with the Xeon in general, but are you running woodcrest based CPUs or the older models? > > 3) Are there any easy-to-use performance analysis/optimisation tools that we > can use? I am dreaming of one that could point out problems and suggest > and.or effect solutions. I don't know about Windows, but *nix has a number of tools available directly at the operating system level to help you determine various bottlenecks. > > 4) Can anyone recommend any commercial PostgreSQL service providers that may > be able to swiftly come in and assist us with our performance issues? http://www.commandprompt.com/ (disclaimer, I am an employee) > > Below, please find what I believe are the configuration settings of interest > in our system > > Any help and advice will be much appreciated. TIA, > > Carlo > > max_connections = 100 > shared_buffers = 50000 This could probably be higher. > work_mem = 32768 Depending on what you are doing, this is could be to low or to high. > maintenance_work_mem = 32768 > checkpoint_segments = 128 > effective_cache_size = 10000 This coudl probably be higher. > random_page_cost = 3 > stats_start_collector = on > stats_command_string = on > stats_row_level = on > autovacuum = on Stats are a hit... you need to determine if you actually need them. Joshua D. Drake > > > > > > > > > ---------------------------(end of broadcast)--------------------------- > TIP 1: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly > -- === The PostgreSQL Company: Command Prompt, Inc. === Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240 Providing the most comprehensive PostgreSQL solutions since 1997 http://www.commandprompt.com/
On 9/28/06, Carlo Stonebanks <stonec.register@sympatico.ca> wrote: > We urgently need a major performance improvement. We are running the > PostgreSQL 8.1.4 on a Windows 2003 x64 Server on a dual processor, dual core > 3.2Ghz Xeon box with 4gb RAM and a RAID (sorry, I don't know what type) disc > subsystem. Sorry about the long intro, but here are my questions: are you using the 'copy' interface? > 1) Are we paying any big penalties by running Windows vs LINUX (or any other > OS)? thats a tough question. my gut says that windows will not scale as well as recent linux kernels in high load environments. > 2) Has the debate over PostgreSQL and Xeon processors been settled? Is this > a factor? hearing good things about the woodcrest. pre-woodcrest xeon (dempsey down) is outclassed by the opteron. > Below, please find what I believe are the configuration settings of interest > in our system 1. can probably run fsync=off during the import 2. if import is single proecess, consider temporary bump to memory for index creation. or, since you have four cores consider having four processes import the data somehow. 3. turn off stats collector, stats_command_string, stats_row_level, and autovacuum during import. merlin > Any help and advice will be much appreciated. TIA, > > Carlo > > max_connections = 100 > shared_buffers = 50000 > work_mem = 32768 > maintenance_work_mem = 32768 > checkpoint_segments = 128 > effective_cache_size = 10000 > random_page_cost = 3 > stats_start_collector = on > stats_command_string = on > stats_row_level = on > autovacuum = on > > > > > > > > > ---------------------------(end of broadcast)--------------------------- > TIP 1: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly >
> How are you loading the tables? Copy? Insert? Once the data is transformed, it is inserted. I don't have stats, but the programs visual feedback does not spend a lot of time on the "inserting data" message. Then again, if there is an asynchronous component to an insert, perhaps I am not seeing how slow an insert really is until I query the table. > Yes. Linux or FreeBSD is going to stomp Win32 for PostgreSQL performance. Don't suppose you'd care to hazard a guess on what sort of scale we're talking about? Are we talking about 10%? 100% I know this is a hard one to judge, My impression was that the *NIX improvements were with concurrent use and right now, I am obsessing on this single-threaded issue. > Depends. PostgreSQL is much better with the Xeon in general, but are you > running woodcrest based CPUs or the older models? Weren't those released in July? This server is a few months older, so I guess not. But maybe? Does Dell have the ability to install CPUs from the future like Cyberdyne does? ;-) > I don't know about Windows, but *nix has a number of tools available > directly at the operating system level to help you determine various > bottlenecks. Are we talking about I/O operations? I was thinking of SQL query analysis. The stuff I read here about query analysis is pretty intruiging, but to someone unfamiliar with this type of query analysis it all looks quite uncertain to me. I mean, I read the threads and it all looks like a lot of trying ot figure out how to cajole PostgreSQL to do what you want, rather than telling it: HEY I CREATED THAT INDEX FOR A REASON, USE IT! I know this may be non-dba sophistication on my part, but I would like a tool that would make this whole process a little less arcane. I'm not the Gandalf type. >> 4) Can anyone recommend any commercial PostgreSQL service providers that >> may >> be able to swiftly come in and assist us with our performance issues? > > http://www.commandprompt.com/ (disclaimer, I am an employee) Very much appreciated. >> max_connections = 100 >> shared_buffers = 50000 > This could probably be higher. Ok, good start... > >> work_mem = 32768 > > Depending on what you are doing, this is could be to low or to high. Is this like "You could be too fat or too thin"? Aren't you impressed with the fact that I managed to pick the one number that was not right for anything? > >> maintenance_work_mem = 32768 >> checkpoint_segments = 128 >> effective_cache_size = 10000 > > This coudl probably be higher. ... noted... > >> random_page_cost = 3 >> stats_start_collector = on >> stats_command_string = on >> stats_row_level = on >> autovacuum = on > > Stats are a hit... you need to determine if you actually need them. Unfortunately, this is the only way I know of of getting the query string to appear in the PostgreSQL server status display. While trying to figure out what is slowing things down, having that is really helpful. I also imagined that this sort of thing would be a performance hit when you are getting lots of small, concurrent queries. In my case, we have queries which are taking around a second to perform outer joins. They aren't competing with any other requests as the site is not running, we are just running one app to seed the data.
> are you using the 'copy' interface? Straightforward inserts - the import data has to transformed, normalised and de-duped by the import program. I imagine the copy interface is for more straightforward data importing. These are - buy necessity - single row inserts. > thats a tough question. my gut says that windows will not scale as > well as recent linux kernels in high load environments. But not in the case of a single import program trying to seed a massive database? > hearing good things about the woodcrest. pre-woodcrest xeon (dempsey > down) is outclassed by the opteron. Need to find a way to deterimine the Xeon type. The server was bought in early 2006, and it looks like woodcrest was form July. > 1. can probably run fsync=off during the import > 2. if import is single proecess, consider temporary bump to memory for > index creation. or, since you have four cores consider having four > processes import the data somehow. > 3. turn off stats collector, stats_command_string, stats_row_level, > and autovacuum during import. Very helpful, thanks. Carlo
On Sep 28, 2006, at 10:53 AM, Carlo Stonebanks wrote: >> are you using the 'copy' interface? > > Straightforward inserts - the import data has to transformed, > normalised and > de-duped by the import program. I imagine the copy interface is for > more > straightforward data importing. These are - buy necessity - single row > inserts. Are you wrapping all this in a transaction? You're doing some dynamically generated selects as part of the "de-duping" process? They're probably the expensive bit. What do those queries tend to look like? Are you analysing the table periodically? If not, then you might have statistics based on an empty table, or default statistics, which might cause the planner to choose bad plans for those selects. Talking of which, are there indexes on the table? Normally you wouldn't have indexes in place during a bulk import, but if you're doing selects as part of the data load process then you'd be forcing sequential scans for every query, which would explain why it gets slower as the table gets bigger. Cheers, Steve
> -----Original Message----- > From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of Carlo Stonebanks > Subject: [PERFORM] Performace Optimization for Dummies > > At this early stage in the project, we are initializing our portal's > database with millions of rows of imported data in over 50 different > flattened tables; each table's structure is unique to the > data provider. > This requires a pretty complex import program, because the > data must be > matched semantically, not literally. Even with all of the expression > matching and fuzzy logic in the code,our performance > statistics show that > the program spends over 75% of its time in SQL queries > looking for matching > and/or duplicate data. > > The import is slow - and degrades as the tables grow. So your program first transforms the data and then inserts it? And it is the transforming process which is running select statements that is slow? If that is the case you could use duration logging to find the slow select statement, and then you could post an EXPLAIN ANALYZE of the select. One question off the top of my head is are you using regular expressions for your fuzzy logic if so do your indexes have the right operator classes? (see http://www.postgresql.org/docs/8.1/static/indexes-opclass.html) Dave
On Thu, Sep 28, 2006 at 10:11:31AM -0700, Joshua D. Drake wrote: > > 4) Can anyone recommend any commercial PostgreSQL service providers that may > > be able to swiftly come in and assist us with our performance issues? > > http://www.commandprompt.com/ (disclaimer, I am an employee) You forgot us. :) -- Jim Nasby jim@nasby.net EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
On Thu, Sep 28, 2006 at 01:47:44PM -0400, Carlo Stonebanks wrote: > > How are you loading the tables? Copy? Insert? > > Once the data is transformed, it is inserted. I don't have stats, but the > programs visual feedback does not spend a lot of time on the "inserting > data" message. Then again, if there is an asynchronous component to an > insert, perhaps I am not seeing how slow an insert really is until I query > the table. Well, individual inserts are slow, especially if they're not wrapped up in a transaction. And you also mentioned checking for dupes. I suspect that you're not going to find any huge gains in tuning the database... it sounds like the application (as in: how it's using the database) is what needs help. > >> work_mem = 32768 > > > > Depending on what you are doing, this is could be to low or to high. > > Is this like "You could be too fat or too thin"? Aren't you impressed with > the fact that I managed to pick the one number that was not right for > anything? For what you're doing, it's probably fine where it is... but while you're in the single-thread case, you can safely make that pretty big (like 1000000). > > > >> maintenance_work_mem = 32768 > >> checkpoint_segments = 128 > >> effective_cache_size = 10000 > > > > This coudl probably be higher. I'd suggest setting it to about 3G, or 375000. > > > >> random_page_cost = 3 > >> stats_start_collector = on > >> stats_command_string = on > >> stats_row_level = on > >> autovacuum = on > > > > Stats are a hit... you need to determine if you actually need them. > > Unfortunately, this is the only way I know of of getting the query string to > appear in the PostgreSQL server status display. While trying to figure out > what is slowing things down, having that is really helpful. I also imagined > that this sort of thing would be a performance hit when you are getting lots > of small, concurrent queries. In my case, we have queries which are taking > around a second to perform outer joins. They aren't competing with any other > requests as the site is not running, we are just running one app to seed the > data. stats_command_string can extract a huge penalty pre-8.2, on the order of 30%. I'd turn it off unless you *really* need it. Command logging (ie: log_min_duration_statement) is much less of a burden. The fact that you're doing outer joins while loading data really makes me suspect that the application needs to be changed for any real benefits to be had. But you should still look at what EXPLAIN ANALYZE is showing you on those queries; you might be able to find some gains there. -- Jim Nasby jim@nasby.net EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
On Thu, Sep 28, 2006 at 01:53:22PM -0400, Carlo Stonebanks wrote: > > are you using the 'copy' interface? > > Straightforward inserts - the import data has to transformed, normalised and > de-duped by the import program. I imagine the copy interface is for more > straightforward data importing. These are - buy necessity - single row > inserts. BTW, stuff like de-duping is something you really want the database - not an external program - to be doing. Think about loading the data into a temporary table and then working on it from there. -- Jim Nasby jim@nasby.net EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
Carlo Stonebanks wrote: >> are you using the 'copy' interface? > > Straightforward inserts - the import data has to transformed, normalised and > de-duped by the import program. I imagine the copy interface is for more > straightforward data importing. These are - buy necessity - single row > inserts. > I know this is an answer to a question you didn't ask, but here it is. I was once doing stuff where I processed log files and had to do many lookups to normalize the data before insertion. I started out doing everything in SQL and using postgresql tables and it took a little over 24 hours to process 24 hours worth of data. Like you, it was single process, many lookups. I found a better way. I rewrote it (once in c#, again in python) and used hashtables/dictionaries instead of tables for the lookup data. For example, I'd start by loading the data into hash tables (yes, this took a *lot* of ram) then for each row I did something like: 1. is it in the hash table? 1. If not, insert it into the db 1. Insert it into the hashtable 2. Get the lookup field out of the hash table 3. Output normalized data This allow me to create text files containing the data in COPY format which can then be inserted into the database at dramatically increased speeds. My first version in C# (mono) cut the time down to 6 hours for 24 hours worth of data. I tweaked the algorithms and rewrote it in Python and got it down to 45 min. (Python can't take all the credit for the performance boost, I used an improved technique that could have been done in C# as well) This time included the time needed to do the copy and update the indexes. I created a version that also used gdb databases instead of hash tables. It increased the time from 45 min to a little over an hour but decreased the memory usage to something like 45MB (vs dozens or hundreds of MB per hashtable) -- Matthew Nuzum newz2000 on freenode
> Are you wrapping all this in a transaction? Yes, the transactions can typically wrap 1 to 10 single-table, single-row inserts and updates. > You're doing some dynamically generated selects as part of the > "de-duping" process? They're probably the expensive bit. What > do those queries tend to look like? Without a doubt, this is the expensive bit. > Are you analysing the table periodically? If not, then you might > have statistics based on an empty table, or default statistics, which > might cause the planner to choose bad plans for those selects. Now there's something I didn't know - I thought that analysis and planning was done with every select, and the performance benefit of prepared statements was to plan-once, execute many. I can easily put in a periodic analyse statement. I obviously missed how to use analyze properluy, I thought it was just human-readable output - do I understand correctly, that it can be used to get the SQL server to revaluate its plan based on newer statistics - even on non-prepared queries? > Talking of which, are there indexes on the table? Normally you > wouldn't have indexes in place during a bulk import, but if you're > doing selects as part of the data load process then you'd be forcing > sequential scans for every query, which would explain why it gets > slower as the table gets bigger. There are indexes for every obvious "where this = that" clauses. I don't believe that they will work for ilike expressions. > > Cheers, > Steve > > > ---------------------------(end of broadcast)--------------------------- > TIP 3: Have you checked our extensive FAQ? > > http://www.postgresql.org/docs/faq >
> So your program first transforms the data and then inserts it? And it is > the transforming process which is running select statements that is slow? There are cross-referencing and deduplication processes. Does this person have an office at this exact address? In a similarily named building in the same zip code? City? What is the similarity of the building or enterprise names? Is there a person with a similar name with the same type of professional license nearby? We basically look for the statistical likelyhood that they already exist to decide whether to update their data, or insert a new data element. These are all extremely soft queries and require left outer joins with all of the related tables that would contain this data (the left outer join tells us whether the related element satisfied the search condition). As I mentioned, as the data comes in, we examine what we have to work with and modify the tables and columns we can check - which is what I meant by " the SQL queries are mutable, every imported row can change the structure of a SQL query as the program adds and subtracts search conditions to the SQL command text before execution." > If that is the case you could use duration logging to find the slow select > statement, and then you could post an EXPLAIN ANALYZE of the select. I'm pretty sure I know who the culprit is, and - like I said, it comes from a section of code that creates a mutable statement. However, everyone is being so helpful and I should post this data as soon as I can. > One question off the top of my head is are you using regular expressions > for > your fuzzy logic if so do your indexes have the right operator classes? > (see http://www.postgresql.org/docs/8.1/static/indexes-opclass.html) I am using regular expressions and fuzzy logic, but mostly on the client side (I have a Tcl implementation of levenshtein, for example). I don't think you can use indexes on functions such as levenshtein, because it requires a parameter only available at execution time. The link you sent me was very interesting - I will definitely reconsider my query logic if I can optimise regular expression searches on the server. Thanks! Carlo
On 9/28/06, Carlo Stonebanks <stonec.register@sympatico.ca> wrote: > > are you using the 'copy' interface? > > Straightforward inserts - the import data has to transformed, normalised and > de-duped by the import program. I imagine the copy interface is for more > straightforward data importing. These are - buy necessity - single row > inserts. right. see comments below. > > thats a tough question. my gut says that windows will not scale as > > well as recent linux kernels in high load environments. > > But not in the case of a single import program trying to seed a massive > database? probably not. > > hearing good things about the woodcrest. pre-woodcrest xeon (dempsey > > down) is outclassed by the opteron. > > Need to find a way to deterimine the Xeon type. The server was bought in > early 2006, and it looks like woodcrest was form July. ok, there are better chips out there but again this is not something you would really notice outside of high load environements. > > 1. can probably run fsync=off during the import > > 2. if import is single proecess, consider temporary bump to memory for > > index creation. or, since you have four cores consider having four > > processes import the data somehow. > > 3. turn off stats collector, stats_command_string, stats_row_level, > > and autovacuum during import. by the way, stats_command_string is a known performance killer that iirc was improved in 8.2. just fyi. I would suggest at least consideration of retooling your import as follows...it might be a fun project to learn some postgresql internals. I'm assuming you are doing some script preprocessing in a language like perl: bulk load denomalized tables into scratch tables into the postgresql database. create indexes appropriate to the nomalization process remembering you can index on virtually any expression in postgresql (including regex substitution). use sql to process the data. if tables are too large to handle with monolithic queries, use cursors and/or functions to handle the conversion. now you can keep track of progress using pl/pgsql raise notice command for example. merlin
The deduplication process requires so many programmed procedures that it runs on the client. Most of the de-dupe lookups are not "straight" lookups, but calculated ones emplying fuzzy logic. This is because we cannot dictate the format of our input data and must deduplicate with what we get. This was one of the reasons why I went with PostgreSQL in the first place, because of the server-side programming options. However, I saw incredible performance hits when running processes on the server and I partially abandoned the idea (some custom-buiilt name-comparison functions still run on the server). I am using Tcl on both the server and the client. I'm not a fan of Tcl, but it appears to be quite well implemented and feature-rich in PostgreSQL. I find PL/pgsql awkward - even compared to Tcl. (After all, I'm just a programmer... we do tend to be a little limited.) The import program actually runs on the server box as a db client and involves about 3000 lines of code (and it will certainly grow steadily as we add compatability with more import formats). Could a process involving that much logic run on the db server, and would there really be a benefit? Carlo ""Jim C. Nasby"" <jim@nasby.net> wrote in message news:20060928184538.GV34238@nasby.net... > On Thu, Sep 28, 2006 at 01:53:22PM -0400, Carlo Stonebanks wrote: >> > are you using the 'copy' interface? >> >> Straightforward inserts - the import data has to transformed, normalised >> and >> de-duped by the import program. I imagine the copy interface is for more >> straightforward data importing. These are - buy necessity - single row >> inserts. > > BTW, stuff like de-duping is something you really want the database - > not an external program - to be doing. Think about loading the data into > a temporary table and then working on it from there. > -- > Jim Nasby jim@nasby.net > EnterpriseDB http://enterprisedb.com 512.569.9461 (cell) > > ---------------------------(end of broadcast)--------------------------- > TIP 6: explain analyze is your friend >
On 9/28/06, Carlo Stonebanks <stonec.register@sympatico.ca> wrote: > The deduplication process requires so many programmed procedures that it > runs on the client. Most of the de-dupe lookups are not "straight" lookups, > but calculated ones emplying fuzzy logic. This is because we cannot dictate > the format of our input data and must deduplicate with what we get. > > This was one of the reasons why I went with PostgreSQL in the first place, > because of the server-side programming options. However, I saw incredible > performance hits when running processes on the server and I partially > abandoned the idea (some custom-buiilt name-comparison functions still run > on the server). imo, the key to high performance big data movements in postgresql is mastering sql and pl/pgsql, especially the latter. once you get good at it, your net time of copy+plpgsql is going to be less than insert+tcl. merlin
Lots of great info here, I will see what applies to my situation. However, I don't see bulk inserts of the tables working, because all of the tables need to be refreshed as values to deduplicate and match will change with every row added. In order for this to work, i would have to write queries against the hash tables. This is where something like MySQL's in-memory tables would have come in handy... What is GDB? Carlo "Matthew Nuzum" <matthew.nuzum@canonical.com> wrote in message news:451C18BE.8050908@canonical.com... > Carlo Stonebanks wrote: >>> are you using the 'copy' interface? >> >> Straightforward inserts - the import data has to transformed, normalised >> and >> de-duped by the import program. I imagine the copy interface is for more >> straightforward data importing. These are - buy necessity - single row >> inserts. >> > > I know this is an answer to a question you didn't ask, but here it is. I > was once doing stuff where I processed log files and had to do many > lookups to normalize the data before insertion. > > I started out doing everything in SQL and using postgresql tables and it > took a little over 24 hours to process 24 hours worth of data. Like you, > it was single process, many lookups. > > I found a better way. I rewrote it (once in c#, again in python) and > used hashtables/dictionaries instead of tables for the lookup data. For > example, I'd start by loading the data into hash tables (yes, this took > a *lot* of ram) then for each row I did something like: > 1. is it in the hash table? > 1. If not, insert it into the db > 1. Insert it into the hashtable > 2. Get the lookup field out of the hash table > 3. Output normalized data > > This allow me to create text files containing the data in COPY format > which can then be inserted into the database at dramatically increased > speeds. > > My first version in C# (mono) cut the time down to 6 hours for 24 hours > worth of data. I tweaked the algorithms and rewrote it in Python and got > it down to 45 min. (Python can't take all the credit for the performance > boost, I used an improved technique that could have been done in C# as > well) This time included the time needed to do the copy and update the > indexes. > > I created a version that also used gdb databases instead of hash tables. > It increased the time from 45 min to a little over an hour but decreased > the memory usage to something like 45MB (vs dozens or hundreds of MB per > hashtable) > -- > Matthew Nuzum > newz2000 on freenode > > ---------------------------(end of broadcast)--------------------------- > TIP 3: Have you checked our extensive FAQ? > > http://www.postgresql.org/docs/faq >
On Sep 28, 2006, at 12:10 PM, Carlo Stonebanks wrote: > >> Are you wrapping all this in a transaction? > > Yes, the transactions can typically wrap 1 to 10 single-table, > single-row > inserts and updates. > > >> You're doing some dynamically generated selects as part of the >> "de-duping" process? They're probably the expensive bit. What >> do those queries tend to look like? > > Without a doubt, this is the expensive bit. If you could give some samples of those queries here I suspect people could be a lot more helpful with some optimisations, or at least pinpoint where the performance issues are likely to be. > >> Are you analysing the table periodically? If not, then you might >> have statistics based on an empty table, or default statistics, which >> might cause the planner to choose bad plans for those selects. > > Now there's something I didn't know - I thought that analysis and > planning > was done with every select, and the performance benefit of prepared > statements was to plan-once, execute many. I can easily put in a > periodic > analyse statement. I obviously missed how to use analyze properluy, I > thought it was just human-readable output - do I understand > correctly, that > it can be used to get the SQL server to revaluate its plan based on > newer > statistics - even on non-prepared queries? I think you're confusing "explain" and "analyze". "Explain" gives you human readable output as to what the planner decided to do with the query you give it. "Analyze" samples the data in tables and stores the statistical distribution of the data, and estimates of table size and that sort of thing for the planner to use to decide on a good query plan. You need to run analyze when the statistics or size of a table has changed significantly, so as to give the planner the best chance of choosing an appropriate plan. If you're not running analyze occasionally then the planner will be working on default stats or empty table stats and will tend to avoid indexes. I don't know whether autovacuum will also analyze tables for you automagically, but it would be a good idea to analyze the table every so often, especially early on in the load - as the stats gathered for a small table will likely give painful performance once the table has grown a lot. > >> Talking of which, are there indexes on the table? Normally you >> wouldn't have indexes in place during a bulk import, but if you're >> doing selects as part of the data load process then you'd be forcing >> sequential scans for every query, which would explain why it gets >> slower as the table gets bigger. > > There are indexes for every obvious "where this = that" clauses. I > don't > believe that they will work for ilike expressions. If you're doing a lot of "where foo ilike 'bar%'" queries, with the pattern anchored to the left you might want to look at using a functional index on lower(foo) and rewriting the query to look like "where lower(foo) like lower('bar%')". Similarly if you have many queries where the pattern is anchored at the right of the string then a functional index on the reverse of the string can be useful. Cheers, Steve
On Thu, Sep 28, 2006 at 02:04:21PM -0700, Steve Atkins wrote: > I think you're confusing "explain" and "analyze". "Explain" gives you > human readable output as to what the planner decided to do with the > query you give it. Don't forget about EXPLAIN ANALYZE, which is related to EXPLAIN but has nothing to do with the ANALYZE command. > indexes. I don't know whether autovacuum will also analyze tables > for you automagically, but it would be a good idea to analyze the table It does. > >>Talking of which, are there indexes on the table? Normally you > >>wouldn't have indexes in place during a bulk import, but if you're > >>doing selects as part of the data load process then you'd be forcing > >>sequential scans for every query, which would explain why it gets > >>slower as the table gets bigger. > > > >There are indexes for every obvious "where this = that" clauses. I > >don't > >believe that they will work for ilike expressions. > > If you're doing a lot of "where foo ilike 'bar%'" queries, with the > pattern > anchored to the left you might want to look at using a functional index > on lower(foo) and rewriting the query to look like "where lower(foo) > like > lower('bar%')". > > Similarly if you have many queries where the pattern is anchored > at the right of the string then a functional index on the reverse of the > string can be useful. tsearch might prove helpful... I'm not sure how it handles substrings. Something else to consider... databases love doing bulk operations. It might be useful to load prospective data into a temporary table, and then do as many operations as you can locally (ie: within the database) on that table, hopefully eleminating as many candidate rows as possible along the way. I also suspect that running multiple merge processes at once would help. Right now, your workload looks something like this: client sends query database is idle client is idle database runs query client gets query back database is idle Oversimplification, but you get the point. There's a lot of time spent waiting on each side. If the import code is running on the server, you should probably run one import process per CPU. If it's on an external server, 2 per CPU would probably be better (and that might be faster than running local on the server at that point). -- Jim Nasby jim@nasby.net EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
Carlo Stonebanks wrote: > Lots of great info here, I will see what applies to my situation. However, I > don't see bulk inserts of the tables working, because all of the tables need > to be refreshed as values to deduplicate and match will change with every > row added. In order for this to work, i would have to write queries against > the hash tables. This is where something like MySQL's in-memory tables would > have come in handy... > > What is GDB? > > Carlo Sorry, meant GDBM (disk based hash/lookup table). With Postgres if your tables fit into RAM then they are in-memory as long as they're actively being used. Hashtables and GDBM, as far as I know, are only useful for key->value lookups. However, for this they are *fast*. If you can figure out a way to make them work I'll bet things speed up. -- Matthew Nuzum newz2000 on freenode
> Something else to consider... databases love doing bulk operations. It > might be useful to load prospective data into a temporary table, and > then do as many operations as you can locally (ie: within the database) > on that table, hopefully eleminating as many candidate rows as possible > along the way. I wish this would work... it was definitely something I considered early on in the project. Even thinking of explaining why it won't work is giving me a headache... But I sure wish it would.
> imo, the key to high performance big data movements in postgresql is > mastering sql and pl/pgsql, especially the latter. once you get good > at it, your net time of copy+plpgsql is going to be less than > insert+tcl. If this implies bulk inserts, I'm afraid I have to consider something else. Any data that has been imported and dedpulicated has to be placed back into the database so that it can be available for the next imported row (there are currently 16 tables affected, and more to come). If I was to cache all inserts into a seperate resource, then I would have to search 32 tables - the local pending resources, as well as the data still in the system. I am not even mentioning that imports do not just insert rows, they could just rows, adding their own complexity.
> by the way, stats_command_string is a known performance killer that > iirc was improved in 8.2. just fyi. This is a handy fact, I will get on this right away. > bulk load denomalized tables into scratch tables into the postgresql > database. create indexes appropriate to the nomalization process > remembering you can index on virtually any expression in postgresql > (including regex substitution). > use sql to process the data. if tables are too large to handle with > monolithic queries, use cursors and/or functions to handle the > conversion. now you can keep track of progress using pl/pgsql raise > notice command for example. For reasons I've exlained elsewhere, the import process is not well suited to breaking up the data into smaller segments. However, I'm interested in what can be indexed. I am used to the idea that indexing only applies to expressions that allows the data to be sorted, and then binary searches can be performed on the sorted list. For example, I can see how you can create an index to support: where foo like 'bar%' But is there any way to create an index expression that will help with: where foo like '%bar%'? I don't see it - but then again, I'm ready to be surprised! Carlo
> Don't forget about EXPLAIN ANALYZE, which is related to EXPLAIN but has > nothing to do with the ANALYZE command. Ah, hence my confusion. Thanks for the clarification... I never knew about ANALYZE as a seperate command.
"Carlo Stonebanks" <stonec.register@sympatico.ca> writes: > But is there any way to create an index expression that will help with: > where foo like '%bar%'? If you are concerned about that, what you are probably really looking for is full-text indexing. See contrib/tsearch2 for our current best answer to that. regards, tom lane
>> indexes. I don't know whether autovacuum will also analyze tables >> for you automagically, but it would be a good idea to analyze the table > > It does. So, I have checked my log and I see an autovacuum running once every minute on our various databases being hosted on the server - once every minute! From what I can see, autovacuum is hitting the db's in question about once every five minutes. Does this imply an ANALYZE is being done automatically that would meet the requirements we are talking about here? Is there any benefit ot explicitly performing an ANALYZE? (Or does this go hand-in-and with turning off autovacuum...?)
Carlo Stonebanks wrote: > We urgently need a major performance improvement. We are running the > PostgreSQL 8.1.4 on a Windows 2003 x64 Server on a dual processor, > dual core > 3.2Ghz Xeon box with 4gb RAM and a RAID (sorry, I don't know what > type) disc > subsystem. Sorry about the long intro, but here are my questions: Others have already drilled down to the way you do the inserts and statistics etc., so I'll just point out: Are you fully utilizing all the 4 cores you have? Could you parallelize the loading process, if you're currently running just one client? Are you I/O bound or CPU bound? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Thu, 2006-09-28 at 12:44 -0400, Carlo Stonebanks wrote: > At this early stage in the project, we are initializing our portal's > database with millions of rows of imported data in over 50 different > flattened tables; each table's structure is unique to the data provider. > This requires a pretty complex import program, because the data must be > matched semantically, not literally. Even with all of the expression > matching and fuzzy logic in the code,our performance statistics show that > the program spends over 75% of its time in SQL queries looking for matching > and/or duplicate data. My experience with that type of load process is that doing this row-by-row is a very expensive approach and your results bear that out. It is often better to write each step as an SQL statement that operates on a set of rows at one time. The lookup operations become merge joins rather than individual SQL Selects via an index, so increase the efficiency of the lookup process by using bulk optimisations and completely avoiding any program/server call traffic. Data can move from step to step by using Insert Selects into temporary tables, as Jim has already suggested. The SQL set approach is different to the idea of simply moving the code server-side by dropping it in a function. That helps with the net traffic but has other issues also. You don't need to use esoteric in-memory thingies if you use the more efficient join types already available when you do set based operations (i.e. join all rows at once in one big SQL statement). You can also improve performance by ordering your checks so that the ones most likely to fail happen first. Trying to achieve a high level of data quality in one large project is not often possible. Focus on the most critical areas of checking and get that working first with acceptable performance, then layer on additional checks while tuning. The complexity of the load programs you have also means they are susceptible to introducing data quality problems rather than removing them, so an incremental approach will also aid debugging of the load suite. Dynamic SQL programs are particularly susceptible to this kind of bug because you can't eyeball the code. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
Hi, Carlo, Carlo Stonebanks wrote: > We urgently need a major performance improvement. Did you think about putting the whole data into PostgreSQL using COPY in a nearly unprocessed manner, index it properly, and then use SQL and stored functions to transform the data inside the database to the desired result? We're using this way for some 3rd-party databases we have to process in-house. HTH, Markus -- Markus Schaber | Logical Tracking&Tracing International AG Dipl. Inf. | Software Development GIS Fight against software patents in Europe! www.ffii.org www.nosoftwarepatents.org
Hi, Carlo, Carlo Stonebanks wrote: > From what I can see, autovacuum is hitting the db's in question about once > every five minutes. Does this imply an ANALYZE is being done automatically > that would meet the requirements we are talking about here? Is there any > benefit ot explicitly performing an ANALYZE? Autovacuum looks at the modification statistics (they count how much modifications happened on the table), and decides whether it's time to VACUUM (reclaim empty space) and ANALYZE (update column value distributions) the table. The exact thresholds for Autovacuum to kick in are configurable, see the docs. HTH, Markus -- Markus Schaber | Logical Tracking&Tracing International AG Dipl. Inf. | Software Development GIS Fight against software patents in Europe! www.ffii.org www.nosoftwarepatents.org
In response to "Carlo Stonebanks" <stonec.register@sympatico.ca>: > >> indexes. I don't know whether autovacuum will also analyze tables > >> for you automagically, but it would be a good idea to analyze the table > > > > It does. > > So, I have checked my log and I see an autovacuum running once every minute > on our various databases being hosted on the server - once every minute! > > From what I can see, autovacuum is hitting the db's in question about once > every five minutes. Does this imply an ANALYZE is being done automatically > that would meet the requirements we are talking about here? Is there any > benefit ot explicitly performing an ANALYZE? > > (Or does this go hand-in-and with turning off autovacuum...?) It's only checking to see if vacuum/analyze needs done every 5 minutes. It may or may not do any actual work at that time, based on how much the tables have changed. See: http://www.postgresql.org/docs/8.1/interactive/maintenance.html#AUTOVACUUM This is a case, during your bulk loads, where autovacuum might actually hurt you. How many records are you inserting/updating in 5 minutes? You may be exceeding autovacuum's ability to keep things clean. I can't say for sure, but I would suspect that you'd be better off not using autovacuum until after the initial data loads are done. My guess is that you'll get better performance if you disable autovac and write manual vacuum/analyze into your load scripts. Exactly how often to have your script do it is something that will require testing to figure out, but probably starting with every 100 or so, then adjust it up and down and see what works best. Explicitly performing a vacuum or analyze can be very beneficial, especially if you know what kind of changes your creating in the data. (Now that I think of it, there's no reason to disable autovac, as it will notice if you've just manually vacuumed a table and not do it again.) If you know that you're radically changing the kind of data in a table, manually running analyze is a good idea. If you know that you're creating a lot of dead tuples, manually vacuuming is a good idea. Especially during a big data load where these changes might be taking place faster than autovac notices. -- Bill Moran Collaborative Fusion Inc.
On 9/29/06, Carlo Stonebanks <stonec.register@sympatico.ca> wrote: > For reasons I've exlained elsewhere, the import process is not well suited > to breaking up the data into smaller segments. However, I'm interested in > what can be indexed. I am used to the idea that indexing only applies to > expressions that allows the data to be sorted, and then binary searches can > be performed on the sorted list. For example, I can see how you can create > an index to support: > > where foo like 'bar%' > > But is there any way to create an index expression that will help with: > > where foo like '%bar%'? > > I don't see it - but then again, I'm ready to be surprised! using standard (btree) index, you can create an index on any constant expression. so, you can create in index that matches '%bar%, but if you also want to match '%bat%', you need another index. there are other exotic methods like t_search and gist approach which may or may not be suitable. regarding your import process, you came to this list and asked for advice on how to fix your particular problem. tweaking postgresql.conf, etc will get you incremental gains but are unlikely to have a huge impact. as i understand it, your best shot at improvement using current process is to: 1. fork your import somhow to get all 4 cores running 2. write the code that actually does the insert in C and use the parameterized prepared statement. however, your general approach has been 'please give me advice, but only the advice that i want'. if you really want to fix your problem, give more specific details about your import and open the door to improvements in your methodology which i suspect is not optimal. you concluded that client side coding was the way to go, but here you are asking how to make it work. if you want help (and there are some extremely smart people here who may give you world class advice for free), you need to lay your cards on the table and be willing to consider alternative solutions. you may find that a few properly consructed queries will knock out 75% of your code and running time. remember often the key to optimization is choosing the right algorithm merlin
I have loaded three of the four cores by running three different versions of the import program to import three different segments of the table to import. The server jumps to 75% usage, with three postgresql processes eating up 25% each., the actual client itself taking up just a few ticks. "Heikki Linnakangas" <heikki@enterprisedb.com> wrote in message news:451CE38B.7070107@enterprisedb.com... > Carlo Stonebanks wrote: >> We urgently need a major performance improvement. We are running the >> PostgreSQL 8.1.4 on a Windows 2003 x64 Server on a dual processor, dual >> core >> 3.2Ghz Xeon box with 4gb RAM and a RAID (sorry, I don't know what type) >> disc >> subsystem. Sorry about the long intro, but here are my questions: > > Others have already drilled down to the way you do the inserts and > statistics etc., so I'll just point out: > > Are you fully utilizing all the 4 cores you have? Could you parallelize > the loading process, if you're currently running just one client? Are you > I/O bound or CPU bound? > > -- > Heikki Linnakangas > EnterpriseDB http://www.enterprisedb.com > > ---------------------------(end of broadcast)--------------------------- > TIP 2: Don't 'kill -9' the postmaster >
> Did you think about putting the whole data into PostgreSQL using COPY in > a nearly unprocessed manner, index it properly, and then use SQL and > stored functions to transform the data inside the database to the > desired result? This is actually what we are doing. The slowness is on the row-by-row transformation. Every row reqauires that all the inserts and updates of the pvious row be committed - that's why we have problems figuring out how to use this using SQL set logic. Carlo
> My experience with that type of load process is that doing this > row-by-row is a very expensive approach and your results bear that out. I expected this, and had warned the client before the project started that this is exactly where SQL underperforms. > It is often better to write each step as an SQL statement that operates > on a set of rows at one time. The problem with this approach is that every row of data is dependent on the previous row's data being validated and imported. e.g. Import Row 1: John Q Smith Foobar Corp 123 Main St, Bigtown, MD 12345-6789 Import Row 2: John Quincy Smith FuzzyLoginc Inc 123 Main St, Suite 301 Bigtown, MD 12345-6789 Import Row 3: Bobby Jones Foobar Corp 123 Main Strett Suite 300, Bigtown, MD 12345 Every row must be imported into the table so that the next row may see the data and consider it when assigning ID's to the name, company and address. (all data must be normalised) How can this be done using set logic? > You can also improve performance by ordering your checks so that the > ones most likely to fail happen first. Already done - I believe the problem is definitely in the navigational access model. What I am doing now makes perfect sense as far as the logic of the process goes - any other developer will read it and understand what is going on. At 3000 lines of code, this will be tedious, but understandable. But SQL hates it. > Trying to achieve a high level of data quality in one large project is > not often possible. Focus on the most critical areas of checking and get > that working first with acceptable performance, then layer on additional > checks while tuning. The complexity of the load programs you have also > means they are susceptible to introducing data quality problems rather > than removing them, so an incremental approach will also aid debugging > of the load suite. I couldn't agree more. Carlo
> 1. fork your import somhow to get all 4 cores running This is already happening, albeit only 3. No improvement - it appears we have taken the same problem, and divided it by 3. Same projected completion time. this is really curious, to say the least. > 2. write the code that actually does the insert in C and use the > parameterized prepared statement. I had already tried the paremetrised prepare statement; I had mentioned that I was surprised that it had no effect. No one here seemed surprised, or at least didn't think of commenting on it. > however, your general approach has been 'please give me advice, but > only the advice that i want'. I'm sorry I don't understand - I had actually originally come asking four questions asking for recommendations and opinions on hardware, O/S and commercial support. I did also ask for comments on my config setup.
Hi, Carlo, Carlo Stonebanks wrote: >> Did you think about putting the whole data into PostgreSQL using COPY in >> a nearly unprocessed manner, index it properly, and then use SQL and >> stored functions to transform the data inside the database to the >> desired result? > > This is actually what we are doing. The slowness is on the row-by-row > transformation. Every row reqauires that all the inserts and updates of the > pvious row be committed - that's why we have problems figuring out how to > use this using SQL set logic. Maybe "group by", "order by", "distinct on" and hand-written functions and aggregates (like first() or best()) may help. You could combine all relevant columns into an user-defined compund type, then group by entity, and have a self-defined aggregate generate the accumulated tuple for each entity. Markus -- Markus Schaber | Logical Tracking&Tracing International AG Dipl. Inf. | Software Development GIS Fight against software patents in Europe! www.ffii.org www.nosoftwarepatents.org
Hi, Carlo, Carlo Stonebanks wrote: >> Trying to achieve a high level of data quality in one large project is >> not often possible. Focus on the most critical areas of checking and get >> that working first with acceptable performance, then layer on additional >> checks while tuning. The complexity of the load programs you have also >> means they are susceptible to introducing data quality problems rather >> than removing them, so an incremental approach will also aid debugging >> of the load suite. > > I couldn't agree more. I still think that using a PL in the backend might be more performant than having an external client, alone being the SPI interface more efficient compared to the network serialization for external applications. HTH, Markus -- Markus Schaber | Logical Tracking&Tracing International AG Dipl. Inf. | Software Development GIS Fight against software patents in Europe! www.ffii.org www.nosoftwarepatents.org
> I still think that using a PL in the backend might be more performant > than having an external client, alone being the SPI interface more > efficient compared to the network serialization for external applications. I would actually love for this to work better, as this is technology that I would like to develop in general - I see db servers with strong server-side programming languages as being able to operate as application servers, with the enterprises business logic centralised on the server. The import routine that I wrote will actually work on the server as well - it will detect the presence of the spi_ calls, and replace the pg_* calls with spi_* calls. So, you see this WAS my intention. However, the last time I tried to run something that complex from the db server, it ran quite slowly compared to from a client. This may have had something to do with the client that I used to call the stored procedure - I thought that perhaps the client created an implicit transaction around my SQL statement to allow a rollback, and all of the updates and inserts got backed up in a massive transaction queue that took forever to commit. Carlo
> Maybe "group by", "order by", "distinct on" and hand-written functions > and aggregates (like first() or best()) may help. We use these - we have lexical analysis functions which assign a rating to each row in a set, and the likelyhood that the data is a match, and then we sort our results. I thought this would be the cause of the slowdowns - and it is, but a very small part of it. I have identified the problem code, and the problems are within some very simple joins. I have posted the code under a related topic header. I obviously have a few things to learn about optimising SQL joins. Carlo > > You could combine all relevant columns into an user-defined compund > type, then group by entity, and have a self-defined aggregate generate > the accumulated tuple for each entity. > > Markus > -- > Markus Schaber | Logical Tracking&Tracing International AG > Dipl. Inf. | Software Development GIS > > Fight against software patents in Europe! www.ffii.org > www.nosoftwarepatents.org > > ---------------------------(end of broadcast)--------------------------- > TIP 6: explain analyze is your friend >