Thread: Benchmark Data requested
Can I ask for some help with benchmarking? There are some results here that show PostgreSQL is slower in some cases than Monet and MySQL. Of course these results were published immediately prior to 8.2 being released, plus run out-of-the-box, so without even basic performance tuning. Would anybody like to repeat these tests with the latest production versions of these databases (i.e. with PGSQL 8.3), and with some sensible tuning settings for the hardware used? It will be useful to get some blind tests with more sensible settings. http://monetdb.cwi.nl/projects/monetdb//SQL/Benchmark/TPCH/ Multiple runs from different people/different hardware is useful since they help to iron-out differences in hardware and test methodology. So don't worry if you see somebody else doing this also. Thanks, -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com
Hi Simon, Note that MonetDB/X100 does not have a SQL optimizer, they ran raw hand-coded plans. As a consequence, these comparisons should be taken as an "executor-executor" test and we/you should be sure that the PG planner has generated the best possible plan. That said, we've already done the comparisons internally and they've got a good point to make about L2 cache use and removal of unnecessary abstractions in the executor. We've been aware of this since 2005/6 and have been slowly working these ideas into our/PG executor. Bottom line: it's a good thing to work to get close to the X100/Monet executor with a more general purpose DB. PG is a looong way from being comparable, mostly due to poor L2 D-cache locality and I-cache thrashing in the executor. The only way to cure this is to work on more rows than one at a time. - Luke On 2/4/08 10:37 AM, "Simon Riggs" <simon@2ndquadrant.com> wrote: > Can I ask for some help with benchmarking? > > There are some results here that show PostgreSQL is slower in some cases > than Monet and MySQL. Of course these results were published immediately > prior to 8.2 being released, plus run out-of-the-box, so without even > basic performance tuning. > > Would anybody like to repeat these tests with the latest production > versions of these databases (i.e. with PGSQL 8.3), and with some > sensible tuning settings for the hardware used? It will be useful to get > some blind tests with more sensible settings. > > http://monetdb.cwi.nl/projects/monetdb//SQL/Benchmark/TPCH/ > > Multiple runs from different people/different hardware is useful since > they help to iron-out differences in hardware and test methodology. So > don't worry if you see somebody else doing this also. > > Thanks,
> There are some results here that show PostgreSQL is slower in some cases > than Monet and MySQL. Of course these results were published immediately > prior to 8.2 being released, plus run out-of-the-box, so without even > basic performance tuning. > > Would anybody like to repeat these tests with the latest production > versions of these databases (i.e. with PGSQL 8.3), and with some > sensible tuning settings for the hardware used? It will be useful to get > some blind tests with more sensible settings. > > http://monetdb.cwi.nl/projects/monetdb//SQL/Benchmark/TPCH/ > > Multiple runs from different people/different hardware is useful since > they help to iron-out differences in hardware and test methodology. So > don't worry if you see somebody else doing this also. Here is another graph: http://tweakers.net/reviews/649/7 Without monetdb though. -- regards Claus When lenity and cruelty play for a kingdom, the gentlest gamester is the soonest winner. Shakespeare
On Mon, 2008-02-04 at 10:47 -0800, Luke Lonergan wrote: > Note that MonetDB/X100 does not have a SQL optimizer, they ran raw > hand-coded plans. As a consequence, these comparisons should be taken as an > "executor-executor" test and we/you should be sure that the PG planner has > generated the best possible plan. If it doesn't then I'd regard that as a performance issue in itself. > That said, we've already done the comparisons internally and they've got a > good point to make about L2 cache use and removal of unnecessary > abstractions in the executor. We've been aware of this since 2005/6 and > have been slowly working these ideas into our/PG executor. > > Bottom line: it's a good thing to work to get close to the X100/Monet > executor with a more general purpose DB. PG is a looong way from being > comparable, mostly due to poor L2 D-cache locality and I-cache thrashing in > the executor. You maybe right, but I want to see where it hurts us the most. > The only way to cure this is to work on more rows than one at a time. Do you have any results to show that's true, or are you just referring to the Cray paper? (Which used fixed length tuples and specific vector hardware). (With regard to benchmarks, I'd rather not download Monet at all. Helps avoid legal issues around did-you-look-at-the-code questions.) -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com
On Mon, 4 Feb 2008, Simon Riggs wrote: > Would anybody like to repeat these tests with the latest production > versions of these databases (i.e. with PGSQL 8.3) Do you have any suggestions on how people should run TPC-H? It looked like a bit of work to sort through how to even start this exercise. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
On Mon, 2008-02-04 at 15:09 -0500, Greg Smith wrote: > On Mon, 4 Feb 2008, Simon Riggs wrote: > > > Would anybody like to repeat these tests with the latest production > > versions of these databases (i.e. with PGSQL 8.3) > > Do you have any suggestions on how people should run TPC-H? It looked > like a bit of work to sort through how to even start this exercise. The link referred to a few different scale factors, so you could try those. But anything that uses the hardware you have to its full potential is valuable. Everybody's test method is going to be different, whatever I say... -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com
Hi Simon, On 2/4/08 11:07 AM, "Simon Riggs" <simon@2ndquadrant.com> wrote: >> "executor-executor" test and we/you should be sure that the PG planner has >> generated the best possible plan. > > If it doesn't then I'd regard that as a performance issue in itself. Agreed, though that's two problems to investigate - I think the Monet/X100 stuff is clean in that it's a pure executor test. > You maybe right, but I want to see where it hurts us the most. You'll see :-) >> The only way to cure this is to work on more rows than one at a time. > > Do you have any results to show that's true, or are you just referring > to the Cray paper? (Which used fixed length tuples and specific vector > hardware). No paper referenced, just inference from the results and their (and others) conclusions about locality and re-use. It's a similar enough situation to scientific programming with vector processors versus cache based superscalar that these are the right conclusions. We've done the profiling to look at cache misses and have some data to back it up as well. > (With regard to benchmarks, I'd rather not download Monet at all. Helps > avoid legal issues around did-you-look-at-the-code questions.) None of us have looked at the code or downloaded it. There are a number of presentations out there for Monet/X100 to see what their results are. - Luke
Hi Greg, On 2/4/08 12:09 PM, "Greg Smith" <gsmith@gregsmith.com> wrote: > Do you have any suggestions on how people should run TPC-H? It looked > like a bit of work to sort through how to even start this exercise. To run "TPC-H" requires a license to publish, etc. However, I think you can use their published data and query generation kit to run the queries, which aren't the benchmark per-se. That's what the Monet/X100 people did. - Luke
Hi Simon, I have some insight into TPC-H on how it works. First of all I think it is a violation of TPC rules to publish numbers without auditing them first. So even if I do the test to show the better performance of PostgreSQL 8.3, I cannot post it here or any public forum without doing going through the "process". (Even though it is partial benchmark as they are just doing the equivalent of the PowerRun of TPCH) Maybe the PR of PostgreSQL team should email info@tpc.org about them and see what they have to say about that comparison. On the technical side: Remember all TPC-H queries when run sequentially on PostgreSQL uses only 1 core or virtual CPU so it is a very bad for system to use it with PostgreSQL (same for MySQL too). Also very important unless you are running the UPDATE FUNCTIONS which are separate queries, all these Q1-Q22 Queries are pure "READ-ONLY" queries. Traditionally I think PostgreSQL does lack "READ-SPEED"s specially since it is bottlenecked by the size of the reads it does (BLOCKSIZE). Major database provides multi-block parameters to do multiple of reads/writes in terms of blocksizes to reduce IOPS and also for read only they also have READ-AHEAD or prefetch sizes which is generally bigger than multi-block or extent sizes to aid reads. Scale factor is in terms of gigs and hence using max scale of 5 (5G) is pretty useless since most of the rows could be cached in modern day systems. And comparing with 0.01 is what 10MB? Size of recent L2 cache of Intel is probably bigger than that size. If you are doing tuning for TPC-H Queries focus on few of them: For example Query 1 is very Join intensive and if your CPU is not 100% used then you have a problem in your IO to solve before tuning it. Another example is Query 16 is literally IO scan speed, many people use it to see if the database can scan at "line speeds" of the storage, ending up with 100% CPU means the database cannot process that many rows (just to bring it in). In essence each query does some combination of system features to highlight the performance. However since it is an old benchmark, database companies end up "re-engineering" their technologies to gain advantage in this benchmark (Hence its time for a successor in work called TPC-DS which will have more than 100 such queries) Few of the technologies that have really helped gain ground in TPC-H world * Hash and/or Range Partitioning of tables ( PostgreSQL 8.3 can do that but the setup cost of writing schema is great specially since data has to be loaded in separate tables) * Automated Aggregated Views - used by optmiziers - database technology to update more frequently used aggregations in a smaller views * Cube views Index - like bitmap but multidimensional (I think ..but not sure) That said, is it useful to be used in "Regression testing in PostgreSQL farms. I would think yes.. specially Q16 Hope this helps. Regards, Jignesh Simon Riggs wrote: > Can I ask for some help with benchmarking? > > There are some results here that show PostgreSQL is slower in some cases > than Monet and MySQL. Of course these results were published immediately > prior to 8.2 being released, plus run out-of-the-box, so without even > basic performance tuning. > > Would anybody like to repeat these tests with the latest production > versions of these databases (i.e. with PGSQL 8.3), and with some > sensible tuning settings for the hardware used? It will be useful to get > some blind tests with more sensible settings. > > http://monetdb.cwi.nl/projects/monetdb//SQL/Benchmark/TPCH/ > > Multiple runs from different people/different hardware is useful since > they help to iron-out differences in hardware and test methodology. So > don't worry if you see somebody else doing this also. > > Thanks, > >
On Mon, 4 Feb 2008, Luke Lonergan wrote: > However, I think you can use their published data and query generation kit > to run the queries, which aren't the benchmark per-se. That's what the > Monet/X100 people did. Right; I was just hoping someone might suggest some relatively standardized way to do that via PostgreSQL. I read Simon's original note and was afraid that multiple people might end up duplicating some non-trivial amount of work just to get the kits setup and running, or get frustrated not expecting that part and just give up on the whole idea. I'm very interested in this particular topic (non-trivial database micro-benchmarks) but have no time to spare this week to hack on this one myself. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
Doing it at low scales is not attractive. Commercial databases are publishing at scale factor of 1000(about 1TB) to 10000(10TB) with one in 30TB space. So ideally right now tuning should start at 1000 scale factor. Unfortunately I have tried that before with PostgreSQL the few of the problems are as follows: Single stream loader of PostgreSQL takes hours to load data. (Single stream load... wasting all the extra cores out there) Multiple table loads ( 1 per table) spawned via script is bit better but hits wal problems. To avoid wal problems, I had created tables and load statements within the same transaction, faster but cannot create index before load or it starts writing to wal... AND if indexes are created after load, it takes about a day or so to create all the indices required. (Its single threaded and creating multiple indexes/indices at the same time could result in running out of temporary "DISK" space since the tables are so big. Which means 1 thread at a time is the answer for creating tables that are really big. It is slow. Boy, by this time most people doing TPC-H in high end give up on PostgreSQL. I have not even started Partitioning of tables yet since with the current framework, you have to load the tables separately into each tables which means for the TPC-H data you need "extra-logic" to take that table data and split it into each partition child table. Not stuff that many people want to do by hand. Then for the power run that is essentially running one query at a time should essentially be able to utilize the full system (specially multi-core systems), unfortunately PostgreSQL can use only one core. (Plus since this is read only and there is no separate disk reader all other processes are idle) and system is running at 1/Nth capacity (where N is the number of cores/threads) (I am not sure here with Partitioned tables, do you get N processes running in the system when you scan the partitioned table?) Even off-loading work like "fetching the data into bufferpool" into separate processes will go big time with this type of workloads. I would be happy to help out if folks here want to do work related to it. Infact if you have time, I can request a project in one of the Sun Benchmarking center to see what we can learn with community members interested in understanding where PostgreSQL performs and fails. Regards, Jignesh Greg Smith wrote: > On Mon, 4 Feb 2008, Simon Riggs wrote: > >> Would anybody like to repeat these tests with the latest production >> versions of these databases (i.e. with PGSQL 8.3) > > Do you have any suggestions on how people should run TPC-H? It looked > like a bit of work to sort through how to even start this exercise. > > -- > * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD > > ---------------------------(end of broadcast)--------------------------- > TIP 4: Have you searched our list archives? > > http://archives.postgresql.org
On Mon, 4 Feb 2008, Jignesh K. Shah wrote: > Doing it at low scales is not attractive. Commercial databases are > publishing at scale factor of 1000(about 1TB) to 10000(10TB) with one in > 30TB space. So ideally right now tuning should start at 1000 scale > factor. I think what Simon was trying to get at is some sort of debunking of Monet's benchmarks which were running in-memory while not giving PostgreSQL any real memory to work with. What you're talking about is a completely separate discussion which is well worth having in addition to that. I'm well aware of how painful it is to generate+load+index even single TB worth of data with PostgreSQL right now because I've been doing just that for weeks now (there's two processing phases in there as well for me that take even longer, but the raw operations are still a significant portion of the total time). > I would be happy to help out if folks here want to do work related to > it. Infact if you have time, I can request a project in one of the Sun > Benchmarking center to see what we can learn with community members > interested in understanding where PostgreSQL performs and fails. Sounds like a good 8.4 project. Maybe pick this topic back up at the East convention next month, we could talk about it then. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
"Jignesh K. Shah" <J.K.Shah@Sun.COM> writes: > Then for the power run that is essentially running one query at a time should > essentially be able to utilize the full system (specially multi-core systems), > unfortunately PostgreSQL can use only one core. (Plus since this is read only > and there is no separate disk reader all other processes are idle) and system > is running at 1/Nth capacity (where N is the number of cores/threads) Is the whole benchmark like this or is this just one part of it? Is the i/o system really able to saturate the cpu though? -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's On-Demand Production Tuning
"Jignesh K. Shah" <J.K.Shah@Sun.COM> writes: > Also very important unless you are running the UPDATE FUNCTIONS which are > separate queries, all these Q1-Q22 Queries are pure "READ-ONLY" queries. > Traditionally I think PostgreSQL does lack "READ-SPEED"s specially since it is > bottlenecked by the size of the reads it does (BLOCKSIZE). Major database > provides multi-block parameters to do multiple of reads/writes in terms of > blocksizes to reduce IOPS and also for read only they also have READ-AHEAD or > prefetch sizes which is generally bigger than multi-block or extent sizes to > aid reads. Note that all of these things are necessitated by those databases using direct i/o of some form or another. The theory is that PostgreSQL doesn't have to worry about these things because the kernel is taking care of it. How true that is is a matter of some debate and it varies from OS to OS. But it's definitely true that the OS will do read-ahead for sequential reads, for example. Incidentally we found some cases that Solaris was particularly bad at. Is there anybody in particular that would be interested in hearing about them? (Not meant to be a knock on Solaris, I'm sure there are other cases Linux or BSD handle poorly too) -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's Slony Replication support!
TPC-H has two runs PowerRun which is single stream (Q1-22 RF1, RF2) And Throughput Runs which has "N" (depends on scale) running simultaneously in a mixed sequence of the same queries and the two update functions. During throughput run you can expect to max out CPU... But commerial databases generally have PowerRuns running quite well even on multi-cores ( Oracle (without RAC have published with 144 cores on Solaris) As for IO system saturating the CPU its two folds Kernel fetching in the data which saturates at some value and in this case PostgreSQL reading the data and putting it in its bufferpool An example of how I use it is as follows: Do a select query on a table such that it results in table scan without actually returning any rows back Now keep throwing hardware (better storage) till it saturates the CPU. That's the practical max you can do with the CPU/OS combination (considering unlimited storage bandwidth). This one is primarily used in guessing how fast one of the queries in TPC-H will complete. In my tests with PostgreSQL, I generally reach the CPU limit without even reaching the storage bandwidth of the underlying storage. Just to give numbers Single 2Gb Fiber Channel port can practically go upto 180 MB/sec Single 4Gb ports have proven to go upto 360-370MB/sec So to saturate a FC port, postgreSQL has to be able to scan 370MB/sec without saturating the CPU. Then comes software stripping which allows multiple ports to be stripped over increasing the capacity of the bandwidth... Now scanning has to be able to drive Nx370MB/sec (all on single core). I had some numbers and I had some limitations based on cpu frequency, blocksize ,etc but those were for 8.1 days or so.. I think to take PostgreSQL a bit high end, we have to first scale out these numbers. Doing some sorts of test in PostgreSQL farms for every release actually does help people see the amount of data that it can drive through... We can actually work on some database operation metrics to also guage how much each release is improving over older releases.. I have ideas for few of them. Regards, Jignesh Gregory Stark wrote: > "Jignesh K. Shah" <J.K.Shah@Sun.COM> writes: > > >> Then for the power run that is essentially running one query at a time should >> essentially be able to utilize the full system (specially multi-core systems), >> unfortunately PostgreSQL can use only one core. (Plus since this is read only >> and there is no separate disk reader all other processes are idle) and system >> is running at 1/Nth capacity (where N is the number of cores/threads) >> > > Is the whole benchmark like this or is this just one part of it? > > Is the i/o system really able to saturate the cpu though? > >
Gregory Stark wrote: > Incidentally we found some cases that Solaris was particularly bad at. Is > there anybody in particular that would be interested in hearing about them? > (Not meant to be a knock on Solaris, I'm sure there are other cases Linux or > BSD handle poorly too) > > > Send me the details, I can file bugs for Solaris on behalf of the community. Since I am involved in lot of PostgreSQL testing on Solaris this year, I have a small list myself (mostly related to file system stuff though). I know one regarding bonnie rewriting blocks that you sent out. (I still havent done anything about it yet but finally have some test machines for such work instead of using my workstation to test it out :-) But I am really interested in seeing which one hits PostgreSQL performance/usability. Thanks in advance. Regards, Jignesh
On Mon, 2008-02-04 at 17:33 -0500, Jignesh K. Shah wrote: > First of all I think it is a violation of TPC rules to publish numbers > without auditing them first. So even if I do the test to show the > better performance of PostgreSQL 8.3, I cannot post it here or any > public forum without doing going through the "process". I'm not interested in the final results, pricing etc.. Just a query by query elapsed times. Can you show which part of the rules precludes this? I can't find it. This is a developer list, so "publishing" things here is what we do for discussion, so it's hardly breaking the spirit of the TPC rules to publish results here, in the hope of focusing development effort. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com
On Mon, 2008-02-04 at 17:55 -0500, Jignesh K. Shah wrote: > Doing it at low scales is not attractive. > > Commercial databases are publishing at scale factor of 1000(about 1TB) > to 10000(10TB) with one in 30TB space. So ideally right now tuning > should start at 1000 scale factor. I don't understand this. Sun is currently publishing results at 100GB, 300GB etc.. Why would we ignore those and go for much higher numbers? Especially when you explain why we wouldn't be able to. There isn't any currently valid result above 10 TB. If anybody is going to run tests in response to my request, then *any* scale factor is interesting, on any hardware. If that means Scale Factor 1, 3, 10 or 30 then that's fine by me. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com
Hi, Le lundi 04 février 2008, Jignesh K. Shah a écrit : > Single stream loader of PostgreSQL takes hours to load data. (Single > stream load... wasting all the extra cores out there) I wanted to work on this at the pgloader level, so CVS version of pgloader is now able to load data in parallel, with a python thread per configured section (1 section = 1 data file = 1 table is often the case). Not configurable at the moment, but I plan on providing a "threads" knob which will default to 1, and could be -1 for "as many thread as sections". > Multiple table loads ( 1 per table) spawned via script is bit better > but hits wal problems. pgloader will too hit the WAL problem, but it still may have its benefits, or at least we will soon (you can already if you take it from CVS) be able to measure if the parallel loading at the client side is a good idea perf. wise. [...] > I have not even started Partitioning of tables yet since with the > current framework, you have to load the tables separately into each > tables which means for the TPC-H data you need "extra-logic" to take > that table data and split it into each partition child table. Not stuff > that many people want to do by hand. I'm planning to add ddl-partitioning support to pgloader: http://archives.postgresql.org/pgsql-hackers/2007-12/msg00460.php The basic idea is for pgloader to ask PostgreSQL about constraint_exclusion, pg_inherits and pg_constraint and if pgloader recognize both the CHECK expression and the datatypes involved, and if we can implement the CHECK in python without having to resort to querying PostgreSQL, then we can run a thread per partition, with as many COPY FROM running in parallel as there are partition involved (when threads = -1). I'm not sure this will be quicker than relying on PostgreSQL trigger or rules as used for partitioning currently, but ISTM Jignesh quoted § is just about that. Comments? -- dim
Attachment
On Tue, 2008-02-05 at 15:06 +0100, Dimitri Fontaine wrote: > Hi, > > Le lundi 04 février 2008, Jignesh K. Shah a écrit : > > Single stream loader of PostgreSQL takes hours to load data. (Single > > stream load... wasting all the extra cores out there) > > I wanted to work on this at the pgloader level, so CVS version of pgloader is > now able to load data in parallel, with a python thread per configured > section (1 section = 1 data file = 1 table is often the case). > Not configurable at the moment, but I plan on providing a "threads" knob which > will default to 1, and could be -1 for "as many thread as sections". That sounds great. I was just thinking of asking for that :-) I'll look at COPY FROM internals to make this faster. I'm looking at this now to refresh my memory; I already had some plans on the shelf. > > Multiple table loads ( 1 per table) spawned via script is bit better > > but hits wal problems. > > pgloader will too hit the WAL problem, but it still may have its benefits, or > at least we will soon (you can already if you take it from CVS) be able to > measure if the parallel loading at the client side is a good idea perf. wise. Should be able to reduce lock contention, but not overall WAL volume. > [...] > > I have not even started Partitioning of tables yet since with the > > current framework, you have to load the tables separately into each > > tables which means for the TPC-H data you need "extra-logic" to take > > that table data and split it into each partition child table. Not stuff > > that many people want to do by hand. > > I'm planning to add ddl-partitioning support to pgloader: > http://archives.postgresql.org/pgsql-hackers/2007-12/msg00460.php > > The basic idea is for pgloader to ask PostgreSQL about constraint_exclusion, > pg_inherits and pg_constraint and if pgloader recognize both the CHECK > expression and the datatypes involved, and if we can implement the CHECK in > python without having to resort to querying PostgreSQL, then we can run a > thread per partition, with as many COPY FROM running in parallel as there are > partition involved (when threads = -1). > > I'm not sure this will be quicker than relying on PostgreSQL trigger or rules > as used for partitioning currently, but ISTM Jignesh quoted § is just about > that. Much better than triggers and rules, but it will be hard to get it to work. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com
Simon Riggs wrote: > On Tue, 2008-02-05 at 15:06 +0100, Dimitri Fontaine wrote: >> >> Le lundi 04 février 2008, Jignesh K. Shah a écrit : >>> Multiple table loads ( 1 per table) spawned via script is bit better >>> but hits wal problems. >> pgloader will too hit the WAL problem, but it still may have its benefits, or >> at least we will soon (you can already if you take it from CVS) be able to >> measure if the parallel loading at the client side is a good idea perf. wise. > > Should be able to reduce lock contention, but not overall WAL volume. In the case of a bulk upload to an empty table (or partition?) could you not optimise the WAL away? That is, shouldn't the WAL basically be a simple transformation of the on-disk blocks? You'd have to explicitly sync the file(s) for the table/indexes of course, and you'd need some work-around for WAL shipping, but it might be worth it for you chaps with large imports. -- Richard Huxton Archonet Ltd
Apologies for the blank email - mailer problems. I lost all my nicely typed stuff, too. On Tue, 5 Feb 2008, Dimitri Fontaine wrote: >> Multiple table loads ( 1 per table) spawned via script is bit better >> but hits wal problems. > > pgloader will too hit the WAL problem, but it still may have its benefits, or > at least we will soon (you can already if you take it from CVS) be able to > measure if the parallel loading at the client side is a good idea perf. wise. You'll have to be careful here. Depending on the filesystem, writing large amounts of data to two files simultaneously can results in the blocks being interleaved to some degree on the disc, which can cause performance problems later on. As for the WAL, I have an suggestion, but I'm aware that I don't know how PG actually does it, so you'll have to tell me if it is valid. My impression is that the WAL is used to store writes in a transactional manner, for writes that can't be written in a transactional manner directly to the data files. Hence the suggestion for restoring database dumps to run the whole restore in one transaction, which means that the table creation is in the same transaction as loading the data into it. Since the table is not visible to other backends, the writes to it do not need to go to the WAL, and PG is clever enough to do this. My suggestion is to extend that slightly. If there is a large chunk of data to be written to a table, which will be entirely to empty pages or appended to the of the data file, then there is no risk of corruption of existing data, and that write could be made directly to the table. You would have to write a WAL entry reserving the space in the data file, and then write the data to the file. Then when that WAL entry is checkpointed, no work would be required. This would improve the performance of database restores and large writes which expand the table's data file. So, would it work? Matthew -- If pro is the opposite of con, what is the opposite of progress?
On Tue, 5 Feb 2008, Richard Huxton wrote: > In the case of a bulk upload to an empty table (or partition?) could you not > optimise the WAL away? Argh. If I hadn't had to retype my email, I would have suggested that before you. ;) Matthew -- Unfortunately, university regulations probably prohibit me from eating small children in front of the lecture class. -- Computer Science Lecturer
On Tue, 2008-02-05 at 14:43 +0000, Richard Huxton wrote: > Simon Riggs wrote: > > On Tue, 2008-02-05 at 15:06 +0100, Dimitri Fontaine wrote: > >> > >> Le lundi 04 février 2008, Jignesh K. Shah a écrit : > > >>> Multiple table loads ( 1 per table) spawned via script is bit better > >>> but hits wal problems. > >> pgloader will too hit the WAL problem, but it still may have its benefits, or > >> at least we will soon (you can already if you take it from CVS) be able to > >> measure if the parallel loading at the client side is a good idea perf. wise. > > > > Should be able to reduce lock contention, but not overall WAL volume. > > In the case of a bulk upload to an empty table (or partition?) could you > not optimise the WAL away? That is, shouldn't the WAL basically be a > simple transformation of the on-disk blocks? You'd have to explicitly > sync the file(s) for the table/indexes of course, and you'd need some > work-around for WAL shipping, but it might be worth it for you chaps > with large imports. Only by locking the table, which serializes access, which then slows you down or at least restricts other options. Plus if you use pg_loader then you'll find only the first few rows optimized and all the rest not. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com
Simon Riggs wrote: > On Tue, 2008-02-05 at 14:43 +0000, Richard Huxton wrote: >> Simon Riggs wrote: >>> On Tue, 2008-02-05 at 15:06 +0100, Dimitri Fontaine wrote: >>>> Le lundi 04 février 2008, Jignesh K. Shah a écrit : >>>>> Multiple table loads ( 1 per table) spawned via script is bit better >>>>> but hits wal problems. >>>> pgloader will too hit the WAL problem, but it still may have its benefits, or >>>> at least we will soon (you can already if you take it from CVS) be able to >>>> measure if the parallel loading at the client side is a good idea perf. wise. >>> Should be able to reduce lock contention, but not overall WAL volume. >> In the case of a bulk upload to an empty table (or partition?) could you >> not optimise the WAL away? That is, shouldn't the WAL basically be a >> simple transformation of the on-disk blocks? You'd have to explicitly >> sync the file(s) for the table/indexes of course, and you'd need some >> work-around for WAL shipping, but it might be worth it for you chaps >> with large imports. > > Only by locking the table, which serializes access, which then slows you > down or at least restricts other options. Plus if you use pg_loader then > you'll find only the first few rows optimized and all the rest not. Hmm - the table-locking requirement is true enough, but why would pg_loader cause problems after the first few rows? -- Richard Huxton Archonet Ltd
On Tue, 2008-02-05 at 15:05 +0000, Richard Huxton wrote: > > Only by locking the table, which serializes access, which then slows you > > down or at least restricts other options. Plus if you use pg_loader then > > you'll find only the first few rows optimized and all the rest not. > > Hmm - the table-locking requirement is true enough, but why would > pg_loader cause problems after the first few rows? It runs a stream of COPY statements, so only first would be optimized with the "empty table optimization". -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com
On Tue, 5 Feb 2008, Simon Riggs wrote: >> In the case of a bulk upload to an empty table (or partition?) could you >> not optimise the WAL away? That is, shouldn't the WAL basically be a >> simple transformation of the on-disk blocks? You'd have to explicitly >> sync the file(s) for the table/indexes of course, and you'd need some >> work-around for WAL shipping, but it might be worth it for you chaps >> with large imports. > > Only by locking the table, which serializes access, which then slows you > down or at least restricts other options. Plus if you use pg_loader then > you'll find only the first few rows optimized and all the rest not. Why would you need to lock the table? Matthew -- Picard: I was just paid a visit from Q. Riker: Q! Any idea what he's up to? Picard: No. He said he wanted to be "nice" to me. Riker: I'll alert the crew.
Matthew wrote: > On Tue, 5 Feb 2008, Simon Riggs wrote: >>> In the case of a bulk upload to an empty table (or partition?) could you >>> not optimise the WAL away? That is, shouldn't the WAL basically be a >>> simple transformation of the on-disk blocks? You'd have to explicitly >>> sync the file(s) for the table/indexes of course, and you'd need some >>> work-around for WAL shipping, but it might be worth it for you chaps >>> with large imports. >> >> Only by locking the table, which serializes access, which then slows you >> down or at least restricts other options. Plus if you use pg_loader then >> you'll find only the first few rows optimized and all the rest not. > > Why would you need to lock the table? Because you're not really writing the WAL, which means you can't let anyone else get their data into any of the blocks you are writing into. You'd basically want to write the disk blocks then "attach" them in some way. -- Richard Huxton Archonet Ltd
On Tue, 5 Feb 2008, Richard Huxton wrote: >> Why would you need to lock the table? > > Because you're not really writing the WAL, which means you can't let anyone > else get their data into any of the blocks you are writing into. You'd > basically want to write the disk blocks then "attach" them in some way. So what's wrong with "reserving" the space using the WAL, then everyone else will know. After all, when you write the data to the WAL, you must have an idea of where it is meant to end up. My suggestion is that you go through all the motions of writing the data to the WAL, just without the data bit. Matthew -- Failure is not an option. It comes bundled with your Microsoft product. -- Ferenc Mantfeld
Simon Riggs wrote: > On Tue, 2008-02-05 at 15:05 +0000, Richard Huxton wrote: > >>> Only by locking the table, which serializes access, which then slows you >>> down or at least restricts other options. Plus if you use pg_loader then >>> you'll find only the first few rows optimized and all the rest not. >> Hmm - the table-locking requirement is true enough, but why would >> pg_loader cause problems after the first few rows? > > It runs a stream of COPY statements, so only first would be optimized > with the "empty table optimization". Ah, if you're allowing multiple commands during the process I can see how it could get fiddly. -- Richard Huxton Archonet Ltd
Matthew wrote: > On Tue, 5 Feb 2008, Richard Huxton wrote: >>> Why would you need to lock the table? >> >> Because you're not really writing the WAL, which means you can't let >> anyone else get their data into any of the blocks you are writing >> into. You'd basically want to write the disk blocks then "attach" them >> in some way. > > So what's wrong with "reserving" the space using the WAL, then everyone > else will know. After all, when you write the data to the WAL, you must > have an idea of where it is meant to end up. My suggestion is that you > go through all the motions of writing the data to the WAL, just without > the data bit. Well, now you're looking at page-level locking for the data blocks, or at least something very similar. Not sure what you'd do with indexes though - don't see a simple way of avoiding a large lock on a btree index. If you reserved the space in advance that could work. But you don't know how much to reserve until you've copied it in. You could of course have a set of co-operating processes all bulk-loading while maintaining a table-lock outside of the those. It feels like things are getting complicated then though. -- Richard Huxton Archonet Ltd
One of the problems with "Empty Table optimization" is that if there are indexes created then it is considered as no longer empty. Commercial databases have options like "IRRECOVERABLE" clause along with DISK PARTITIONS and CPU partitions for their bulk loaders. So one option turns off logging, disk partitions create multiple processes to read various lines/blocks from input file and other various blocks to clean up the bufferpools to disk and CPU partitions to process the various blocks/lines read for their format and put the rows in bufferpool if successful. Regards, Jignesh Simon Riggs wrote: > On Tue, 2008-02-05 at 15:05 +0000, Richard Huxton wrote: > > >>> Only by locking the table, which serializes access, which then slows you >>> down or at least restricts other options. Plus if you use pg_loader then >>> you'll find only the first few rows optimized and all the rest not. >>> >> Hmm - the table-locking requirement is true enough, but why would >> pg_loader cause problems after the first few rows? >> > > It runs a stream of COPY statements, so only first would be optimized > with the "empty table optimization". > >
On Tue, 5 Feb 2008, Richard Huxton wrote: >> So what's wrong with "reserving" the space using the WAL, then everyone >> else will know. After all, when you write the data to the WAL, you must >> have an idea of where it is meant to end up. My suggestion is that you go >> through all the motions of writing the data to the WAL, just without the >> data bit. > > Well, now you're looking at page-level locking for the data blocks, or at > least something very similar. Not sure what you'd do with indexes though - > don't see a simple way of avoiding a large lock on a btree index. Yeah, indexes would be a lot more difficult I guess, if writes to them involve changing lots of stuff around. We do most of our loads without the indexes present though. > If you reserved the space in advance that could work. But you don't know how > much to reserve until you've copied it in. What does the WAL do? When do you allocate space in the file for written rows? Is is when you write the WAL, or when you checkpoint it? If it's when you write the WAL, then you can just use the same algorithm. > You could of course have a set of co-operating processes all bulk-loading > while maintaining a table-lock outside of the those. It feels like things are > getting complicated then though. That does sound a bit evil. You could have different backends, each running a single transaction where they create one table and load the data for it. That wouldn't need any change to the backend, but it would only work for dump restores, and would require the client to be clever. I'm all for allowing this kind of optimisation while writing normally to the database, and for not requiring the client to think too hard. Matthew -- All of this sounds mildly turgid and messy and confusing... but what the heck. That's what programming's all about, really -- Computer Science Lecturer
Le mardi 05 février 2008, Simon Riggs a écrit : > It runs a stream of COPY statements, so only first would be optimized > with the "empty table optimization". The number of rows per COPY statement is configurable, so provided you have an estimation of the volume to import (wc -l), you could tweak this number for lowering the stream (down to 1 COPY maybe)... But basically a COPY run should be kept in memory (and we're talking about high volumes here) and in case of error processing you'd want it not that huge after all... -- dim
Attachment
Le mardi 05 février 2008, Simon Riggs a écrit : > I'll look at COPY FROM internals to make this faster. I'm looking at > this now to refresh my memory; I already had some plans on the shelf. Maybe stealing some ideas from pg_bulkload could somewhat help here? http://pgfoundry.org/docman/view.php/1000261/456/20060709_pg_bulkload.pdf IIRC it's mainly about how to optimize index updating while loading data, and I've heard complaints on the line "this external tool has to know too much about PostgreSQL internals to be trustworthy as non-core code"... so... > > The basic idea is for pgloader to ask PostgreSQL about > > constraint_exclusion, pg_inherits and pg_constraint and if pgloader > > recognize both the CHECK expression and the datatypes involved, and if we > > can implement the CHECK in python without having to resort to querying > > PostgreSQL, then we can run a thread per partition, with as many COPY > > FROM running in parallel as there are partition involved (when threads = > > -1). > > > > I'm not sure this will be quicker than relying on PostgreSQL trigger or > > rules as used for partitioning currently, but ISTM Jignesh quoted § is > > just about that. > > Much better than triggers and rules, but it will be hard to get it to > work. Well, I'm thinking about providing a somewhat modular approach where pgloader code is able to recognize CHECK constraints, load a module registered to the operator and data types, then use it. The modules and their registration should be done at the configuration level, I'll provide some defaults and users will be able to add their code, the same way on-the-fly reformat modules are handled now. This means that I'll be able to provide (hopefully) quickly the basic cases (CHECK on dates >= x and < y), numeric ranges, etc, and users will be able to care about more complex setups. When the constraint won't match any configured pgloader exclusion module, the trigger/rule code will get used (COPY will go to the main table), and when the python CHECK implementation will be wrong (worst case) PostgreSQL will reject the data and pgloader will fill your reject data and log files. And you're back to debugging your python CHECK implementation... All of this is only a braindump as of now, and maybe quite an optimistic one... but baring any 'I know this can't work' objection that's what I'm gonna try to implement for next pgloader version. Thanks for comments, input is really appreciated ! -- dim
Attachment
On Tue, 2008-02-05 at 18:15 +0100, Dimitri Fontaine wrote: > Le mardi 05 février 2008, Simon Riggs a écrit : > > I'll look at COPY FROM internals to make this faster. I'm looking at > > this now to refresh my memory; I already had some plans on the shelf. > > Maybe stealing some ideas from pg_bulkload could somewhat help here? > http://pgfoundry.org/docman/view.php/1000261/456/20060709_pg_bulkload.pdf > IIRC it's mainly about how to optimize index updating while loading data, and > I've heard complaints on the line "this external tool has to know too much > about PostgreSQL internals to be trustworthy as non-core code"... so... Yeh, the batch index updates are a cool feature. Should be able to do that internally also. Not going to try the no-WAL route again though. If we can get it running efficiently and in parallel, then that will be OK. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com
That sounds cool to me too.. How much work is to make pg_bulkload to work on 8.3? An Integrated version is certainly more beneficial. Specially I think it will also help for other setups like TPC-E too where this is a problem. Regards, Jignesh Simon Riggs wrote: > On Tue, 2008-02-05 at 18:15 +0100, Dimitri Fontaine wrote: > >> Le mardi 05 février 2008, Simon Riggs a écrit : >> >>> I'll look at COPY FROM internals to make this faster. I'm looking at >>> this now to refresh my memory; I already had some plans on the shelf. >>> >> Maybe stealing some ideas from pg_bulkload could somewhat help here? >> http://pgfoundry.org/docman/view.php/1000261/456/20060709_pg_bulkload.pdf >> > > >> IIRC it's mainly about how to optimize index updating while loading data, and >> I've heard complaints on the line "this external tool has to know too much >> about PostgreSQL internals to be trustworthy as non-core code"... so... >> > > Yeh, the batch index updates are a cool feature. Should be able to do > that internally also. > > Not going to try the no-WAL route again though. If we can get it running > efficiently and in parallel, then that will be OK. > >
On Tue, 2008-02-05 at 13:47 -0500, Jignesh K. Shah wrote: > That sounds cool to me too.. > > How much work is to make pg_bulkload to work on 8.3? An Integrated > version is certainly more beneficial. > Specially I think it will also help for other setups like TPC-E too > where this is a problem. If you don't write WAL then you can lose all your writes in a crash. That issue is surmountable on a table with no indexes, or even conceivably with one monotonically ascending index. With other indexes if we crash then we have a likely corrupt index. For most production systems I'm aware of, losing an index on a huge table is not anything you'd want to trade for performance. Assuming you've ever been knee-deep in it on a real server. Maybe we can have a "load mode" for a table where we skip writing any WAL, but if we crash we just truncate the whole table to nothing? Issue a WARNING if we enable this mode while any data in table. I'm nervous of it, but maybe people really want it? I don't really want to invent ext2 all over again, so we have to run an fsck on a table of we crash while loading. My concern is that many people would choose that then blame us for delivering unreliable software. e.g. direct path loader on Oracle used to corrupt a PK index if you loaded duplicate rows with it (whether it still does I couldn't care). That kind of behaviour is simply incompatible with production usage, even if it does good benchmark. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com
Dimitri Fontaine wrote: > Le mardi 05 février 2008, Simon Riggs a écrit : >> I'll look at COPY FROM internals to make this faster. I'm looking at >> this now to refresh my memory; I already had some plans on the shelf. > > Maybe stealing some ideas from pg_bulkload could somewhat help here? > http://pgfoundry.org/docman/view.php/1000261/456/20060709_pg_bulkload.pdf > > IIRC it's mainly about how to optimize index updating while loading data, and > I've heard complaints on the line "this external tool has to know too much > about PostgreSQL internals to be trustworthy as non-core code"... so... I've been thinking of looking into that as well. The basic trick pg_bulkload is using is to populate the index as the data is being loaded. There's no fundamental reason why we couldn't do that internally in COPY. Triggers or constraints that access the table being loaded would make it impossible, but we should be able to detect that and fall back to what we have now. What I'm basically thinking about is to modify the indexam API of building a new index, so that COPY would feed the tuples to the indexam, instead of the indexam opening and scanning the heap. The b-tree indexam would spool the tuples into a tuplesort as the COPY progresses, and build the index from that at the end as usual. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Commercial Db bulk loaders work the same way.. they give you an option as a fast loader provided in case of error, the whole table is truncated. This I think also has real life advantages where PostgreSQL is used as datamarts which are recreated every now and then from other systems and they want fast loaders. So its not just the benchmarking folks like me that will take advantage of such features. INFACT I have seen that they force the clause "REPLACE TABLE" in the sense that will infact truncate the table before loading so there is no confusion what happens to the original data in the table and only then it avoids the logs. to be honest, its not the WAL Writes to the disk that I am worried about.. According to my tests, async_commit is coming pretty close to sync=off and solves the WALWriteLock contention. We should maybe just focus on making it more efficient which I think also involves WALInsertLock that may not be entirely efficient. Also all changes have to be addon options and not replacement for existing loads, I totally agree to that point.. The guys in production support don't even like optimizer query plan changes, forget corrupt index. (I have spent two days in previous role trying to figure out why a particular query plan on another database changed in production.) Simon Riggs wrote: > On Tue, 2008-02-05 at 13:47 -0500, Jignesh K. Shah wrote: > >> That sounds cool to me too.. >> >> How much work is to make pg_bulkload to work on 8.3? An Integrated >> version is certainly more beneficial. >> > > >> Specially I think it will also help for other setups like TPC-E too >> where this is a problem. >> > > If you don't write WAL then you can lose all your writes in a crash. > That issue is surmountable on a table with no indexes, or even > conceivably with one monotonically ascending index. With other indexes > if we crash then we have a likely corrupt index. > > For most production systems I'm aware of, losing an index on a huge > table is not anything you'd want to trade for performance. Assuming > you've ever been knee-deep in it on a real server. > > Maybe we can have a "load mode" for a table where we skip writing any > WAL, but if we crash we just truncate the whole table to nothing? Issue > a WARNING if we enable this mode while any data in table. I'm nervous of > it, but maybe people really want it? > > I don't really want to invent ext2 all over again, so we have to run an > fsck on a table of we crash while loading. My concern is that many > people would choose that then blame us for delivering unreliable > software. e.g. direct path loader on Oracle used to corrupt a PK index > if you loaded duplicate rows with it (whether it still does I couldn't > care). That kind of behaviour is simply incompatible with production > usage, even if it does good benchmark. > >
Hi Heikki, Is there a way such an operation can be spawned as a worker process? Generally during such loading - which most people will do during "offpeak" hours I expect additional CPU resources available. By delegating such additional work to worker processes, we should be able to capitalize on additional cores in the system. Even if it is a single core, the mere fact that the loading process will eventually wait for a read from the input file which cannot be non-blocking, the OS can timeslice it well for the second process to use those wait times for the index population work. What do you think? Regards, Jignesh Heikki Linnakangas wrote: > Dimitri Fontaine wrote: >> Le mardi 05 février 2008, Simon Riggs a écrit : >>> I'll look at COPY FROM internals to make this faster. I'm looking at >>> this now to refresh my memory; I already had some plans on the shelf. >> >> Maybe stealing some ideas from pg_bulkload could somewhat help here? >> >> http://pgfoundry.org/docman/view.php/1000261/456/20060709_pg_bulkload.pdf >> >> >> IIRC it's mainly about how to optimize index updating while loading >> data, and I've heard complaints on the line "this external tool has >> to know too much about PostgreSQL internals to be trustworthy as >> non-core code"... so... > > I've been thinking of looking into that as well. The basic trick > pg_bulkload is using is to populate the index as the data is being > loaded. There's no fundamental reason why we couldn't do that > internally in COPY. Triggers or constraints that access the table > being loaded would make it impossible, but we should be able to detect > that and fall back to what we have now. > > What I'm basically thinking about is to modify the indexam API of > building a new index, so that COPY would feed the tuples to the > indexam, instead of the indexam opening and scanning the heap. The > b-tree indexam would spool the tuples into a tuplesort as the COPY > progresses, and build the index from that at the end as usual. >
Jignesh K. Shah wrote: > Is there a way such an operation can be spawned as a worker process? > Generally during such loading - which most people will do during > "offpeak" hours I expect additional CPU resources available. By > delegating such additional work to worker processes, we should be able > to capitalize on additional cores in the system. Hmm. You do need access to shared memory, locks, catalogs, and to run functions etc, so I don't think it's significantly easier than using multiple cores for COPY itself. > Even if it is a single core, the mere fact that the loading process will > eventually wait for a read from the input file which cannot be > non-blocking, the OS can timeslice it well for the second process to use > those wait times for the index population work. That's an interesting point. > What do you think? > > > Regards, > Jignesh > > > Heikki Linnakangas wrote: >> Dimitri Fontaine wrote: >>> Le mardi 05 février 2008, Simon Riggs a écrit : >>>> I'll look at COPY FROM internals to make this faster. I'm looking at >>>> this now to refresh my memory; I already had some plans on the shelf. >>> >>> Maybe stealing some ideas from pg_bulkload could somewhat help here? >>> >>> http://pgfoundry.org/docman/view.php/1000261/456/20060709_pg_bulkload.pdf >>> >>> >>> IIRC it's mainly about how to optimize index updating while loading >>> data, and I've heard complaints on the line "this external tool has >>> to know too much about PostgreSQL internals to be trustworthy as >>> non-core code"... so... >> >> I've been thinking of looking into that as well. The basic trick >> pg_bulkload is using is to populate the index as the data is being >> loaded. There's no fundamental reason why we couldn't do that >> internally in COPY. Triggers or constraints that access the table >> being loaded would make it impossible, but we should be able to detect >> that and fall back to what we have now. >> >> What I'm basically thinking about is to modify the indexam API of >> building a new index, so that COPY would feed the tuples to the >> indexam, instead of the indexam opening and scanning the heap. The >> b-tree indexam would spool the tuples into a tuplesort as the COPY >> progresses, and build the index from that at the end as usual. >> -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Tue, 2008-02-05 at 15:50 -0500, Jignesh K. Shah wrote: > Is there a way such an operation can be spawned as a worker process? > Generally during such loading - which most people will do during > "offpeak" hours I expect additional CPU resources available. By > delegating such additional work to worker processes, we should be able > to capitalize on additional cores in the system. > > Even if it is a single core, the mere fact that the loading process will > eventually wait for a read from the input file which cannot be > non-blocking, the OS can timeslice it well for the second process to use > those wait times for the index population work. If Dimitri is working on parallel load, why bother? -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com
On Tue, 5 Feb 2008, Simon Riggs wrote: > On Tue, 2008-02-05 at 15:50 -0500, Jignesh K. Shah wrote: >> >> Even if it is a single core, the mere fact that the loading process will >> eventually wait for a read from the input file which cannot be >> non-blocking, the OS can timeslice it well for the second process to use >> those wait times for the index population work. > > If Dimitri is working on parallel load, why bother? pgloader is a great tool for a lot of things, particularly if there's any chance that some of your rows will get rejected. But the way things pass through the Python/psycopg layer made it uncompetative (more than 50% slowdown) against the straight COPY path from a rows/second perspective the last time (V2.1.0?) I did what I thought was a fair test of it (usual caveat of "with the type of data I was loading"). Maybe there's been some gigantic improvement since then, but it's hard to beat COPY when you've got an API layer or two in the middle. I suspect what will end up happening is that a parallel loading pgloader will scale something like this: 1 CPU: Considerably slower than COPY 2-3 CPUs: Close to even with COPY 4+ CPUs: Faster than COPY Maybe I'm wrong, but I wouldn't abandon looking into another approach until that territory is mapped out a bit better. Given the very large number of dual-core systems out there now relative to those with more, optimizing the straight COPY path with any way to take advantage of even one more core to things like index building is well worth doing. Heikki's idea sounded good to me regardless, and if that can be separated out enough to get a second core into the index building at the same time so much the better. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
Hi,
I think, its time now that we should jazz COPY up a bit to include all the discussed functionality. Heikki's batch-indexing idea is pretty useful too. Another thing that pg_bulkload does is it directly loads the tuples into the relation by constructing the tuples and writing them directly to the physical file corresponding to the involved relation, bypassing the engine completely (ofcourse the limitations that arise out of it are not supporting rules, triggers, constraints, default expression evaluation etc). ISTM, we could optimize the COPY code to try to do direct loading too (not necessarily as done by pg_bulkload) to speed it up further in certain cases.
Another thing that we should add to COPY is the ability to continue data load across errors as was discussed recently on hackers some time back too.
Regards,
Nikhils
--
EnterpriseDB http://www.enterprisedb.com
On Feb 6, 2008 9:05 AM, Greg Smith <gsmith@gregsmith.com> wrote:
On Tue, 5 Feb 2008, Simon Riggs wrote:
> On Tue, 2008-02-05 at 15:50 -0500, Jignesh K. Shah wrote:
>>>> Even if it is a single core, the mere fact that the loading process willpgloader is a great tool for a lot of things, particularly if there's any
>> eventually wait for a read from the input file which cannot be
>> non-blocking, the OS can timeslice it well for the second process to use
>> those wait times for the index population work.
>
> If Dimitri is working on parallel load, why bother?
chance that some of your rows will get rejected. But the way things pass
through the Python/psycopg layer made it uncompetative (more than 50%
slowdown) against the straight COPY path from a rows/second perspective
the last time (V2.1.0?) I did what I thought was a fair test of it (usual
caveat of "with the type of data I was loading"). Maybe there's been some
gigantic improvement since then, but it's hard to beat COPY when you've
got an API layer or two in the middle.
I think, its time now that we should jazz COPY up a bit to include all the discussed functionality. Heikki's batch-indexing idea is pretty useful too. Another thing that pg_bulkload does is it directly loads the tuples into the relation by constructing the tuples and writing them directly to the physical file corresponding to the involved relation, bypassing the engine completely (ofcourse the limitations that arise out of it are not supporting rules, triggers, constraints, default expression evaluation etc). ISTM, we could optimize the COPY code to try to do direct loading too (not necessarily as done by pg_bulkload) to speed it up further in certain cases.
Another thing that we should add to COPY is the ability to continue data load across errors as was discussed recently on hackers some time back too.
Regards,
Nikhils
--
EnterpriseDB http://www.enterprisedb.com
Le mercredi 06 février 2008, Greg Smith a écrit : > pgloader is a great tool for a lot of things, particularly if there's any > chance that some of your rows will get rejected. But the way things pass > through the Python/psycopg layer made it uncompetative (more than 50% > slowdown) against the straight COPY path from a rows/second perspective > the last time (V2.1.0?) I've yet to add in the psycopg wrapper Marko wrote for skytools: at the moment I'm using the psycopg1 interface even when psycopg2 is used, and it seems the new version has some great performance improvements. I just didn't bother until now thinking this wouldn't affect COPY. > I did what I thought was a fair test of it (usual > caveat of "with the type of data I was loading"). Maybe there's been some > gigantic improvement since then, but it's hard to beat COPY when you've > got an API layer or two in the middle. Did you compare to COPY or \copy? I'd expect psycopg COPY api not to be that more costly than psql one, after all. Where pgloader is really left behind (in term of tuples inserted per second) compared to COPY is when it has to jiggle a lot with the data, I'd say (reformat, reorder, add constants, etc). But I've tried to design it so that when not configured to arrange (massage?) the data, the code path is the simplest possible. Do you want to test pgloader again with Marko psycopgwrapper code to see if this helps? If yes I'll arrange to push it to CVS ASAP. Maybe at the end of this PostgreSQL backend code will be smarter than pgloader (wrt error handling and data massaging) and we'll be able to drop the project, but in the meantime I'll try my best to have pgloader as fast as possible :) -- dim
Attachment
Hi, I've been thinking about this topic some more, and as I don't know when I'll be able to go and implement it I'd want to publish the ideas here. This way I'll be able to find them again :) Le mardi 05 février 2008, Dimitri Fontaine a écrit : > Le mardi 05 février 2008, Simon Riggs a écrit : > > Much better than triggers and rules, but it will be hard to get it to > > work. > > Well, I'm thinking about providing a somewhat modular approach where > pgloader code is able to recognize CHECK constraints, load a module > registered to the operator and data types, then use it. Here's how I think I'm gonna implement it: User level configuration -=-=-=-=-=-=-=-=-=- At user level, you will have to add a constraint_exclusion = on parameter to pgloader section configuration for it to bother checking if the destination table has some children etc. You'll need to provide also a global ce_path parameter (where to find user python constraint exclusion modules) and a ce_modules parameter for each section where constraint_exclusion = on: ce_modules = columnA:module:class, columnB:module:class As the ce_path could point to any number of modules where a single type is supported by several modules, I'll let the user choose which module to use. Constraint exclusion modules -=-=-=-=-=-=-=-=-=-=-=-=- The modules will provide one or several class(es) (kind of a packaging issue), each one will have to register which datatypes and operators they know about. Here's some pseudo-code of a module, which certainly is the best way to express a code design idea: class MyCE: def __init__(self, operator, constant, cside='r'): """ CHECK ( col operator constant ) => cside = 'r', could be 'l' """ ... @classmethod def support_type(cls, type): return type in ['integer', 'bigint', 'smallint', 'real', 'double'] @classmethod def support_operator(cls, op): return op in ['=', '>', '<', '>=', '<=', '%'] def check(self, op, data): if op == '>' : return self.gt(data) ... def gt(self, data): if cside == 'l': return self.constant > data elif cside == 'r': return data > self.constant This way pgloader will be able to support any datatype (user datatype like IP4R included) and operator (@@, ~<= or whatever). For pgloader to handle a CHECK() constraint, though, it'll have to be configured to use a CE class supporting the used operators and datatypes. PGLoader constraint exclusion support -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- The CHECK() constraint being a tree of check expressions[*] linked by logical operators, pgloader will have to build some logic tree of MyCE (user CE modules) and evaluate all the checks in order to be able to choose the input line partition. [*]: check((a % 10) = 1) makes an expression tree containing 2 check nodes After having parsed pg_constraint.consrc (not conbin which seems too much an internal dump for using it from user code) and built a CHECK tree for each partition, pgloader will try to decide if it's about range partitioning (most common case). If each partition CHECK tree is AND((a>=b, a<c) or a variation of it, we have range partitioning. Then surely we can optimize the code to run to choose the partition where to COPY data to and still use the module operator implementation, e.g. making a binary search on a partitions limits tree. If you want some other widely used (or not) partitioning scheme to be recognized and optimized by pgloader, just tell me and we'll see about it :) Having this step as a user module seems overkill at the moment, though. Multi-Threading behavior and CE support -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- Now, pgloader will be able to run N threads, each one loading some data to a partitionned child-table target. N will certainly be configured depending on the number of server cores and not depending on the partition numbers... So what do we do when reading a tuple we want to store in a partition which has no dedicated Thread started yet, and we already have N Threads running? I'm thinking about some LRU(Thread) to choose a Thread to terminate (launch COPY with current buffer and quit) and start a new one for the current partition target. Hopefully there won't be such high values of N that the LRU is a bad choice per see, and the input data won't be so messy to have to stop/start Threads at each new line. Comments welcome, regards, -- dim
Attachment
On Wed, 2008-02-06 at 12:27 +0100, Dimitri Fontaine wrote: > Multi-Threading behavior and CE support > -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- > > Now, pgloader will be able to run N threads, each one loading some > data to a > partitionned child-table target. N will certainly be configured > depending on > the number of server cores and not depending on the partition > numbers... > > So what do we do when reading a tuple we want to store in a partition > which > has no dedicated Thread started yet, and we already have N Threads > running? > I'm thinking about some LRU(Thread) to choose a Thread to terminate > (launch > COPY with current buffer and quit) and start a new one for the > current > partition target. > Hopefully there won't be such high values of N that the LRU is a bad > choice > per see, and the input data won't be so messy to have to stop/start > Threads > at each new line. For me, it would be good to see a --parallel=n parameter that would allow pg_loader to distribute rows in "round-robin" manner to "n" different concurrent COPY statements. i.e. a non-routing version. Making that work well, whilst continuing to do error-handling seems like a challenge, but a very useful goal. Adding intelligence to the row distribution may be technically hard but may also simply move the bottleneck onto pg_loader. We may need multiple threads in pg_loader, or we may just need multiple sessions from pg_loader. Experience from doing the non-routing parallel version may help in deciding whether to go for the routing version. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com
Le mercredi 06 février 2008, Simon Riggs a écrit : > For me, it would be good to see a --parallel=n parameter that would > allow pg_loader to distribute rows in "round-robin" manner to "n" > different concurrent COPY statements. i.e. a non-routing version. What happen when you want at most N parallel Threads and have several sections configured: do you want pgloader to serialize sections loading (often there's one section per table, sometimes different sections target the same table) but parallelise each section loading? I'm thinking we should have a global max_threads knob *and* and per-section max_thread one if we want to go this way, but then multi-threaded sections will somewhat fight against other sections (multi-threaded or not) for threads to use. So I'll also add a parameter to configure how many (max) sections to load in parallel at any time. We'll then have (default values presented): max_threads = 1 max_parallel_sections = 1 section_threads = -1 The section_threads parameter would be overloadable at section level but would need to stay <= max_threads (if not, discarded, warning issued). When section_threads is -1, pgloader tries to have the higher number of them possible, still in the max_threads global limit. If max_parallel_section is -1, pgloader start a new thread per each new section, maxing out at max_threads, then it waits for a thread to finish before launching a new section loading. If you have N max_threads and max_parallel_sections = section_threads = -1, then we'll see some kind of a fight between new section threads and in section thread (the parallel non-routing COPY behaviour). But then it's a user choice. Adding in it the Constraint_Exclusion support would not mess it up, but it'll have some interest only when section_threads != 1 and max_threads > 1. > Making > that work well, whilst continuing to do error-handling seems like a > challenge, but a very useful goal. Quick tests showed me python threading model allows for easily sharing of objects between several threads, I don't think I'll need to adjust my reject code when going per-section multi-threaded. Just have to use a semaphore object to continue rejected one line at a time. Not that complex if reliable. > Adding intelligence to the row distribution may be technically hard but > may also simply move the bottleneck onto pg_loader. We may need multiple > threads in pg_loader, or we may just need multiple sessions from > pg_loader. Experience from doing the non-routing parallel version may > help in deciding whether to go for the routing version. If non-routing per-section multi-threading is a user request and not that hard to implement (thanks to python), that sounds a good enough reason for me to provide it :) I'll keep you (and the list) informed as soon as I'll have the code to play with. -- dim
Attachment
On Wed, 6 Feb 2008, Dimitri Fontaine wrote: > Did you compare to COPY or \copy? COPY. If you're loading a TB, if you're smart it's going onto the server itself if it all possible and loading directly from there. Would probably get a closer comparision against psql \copy, but recognize you're always going to be compared against the best server-side copy method available. > Do you want to test pgloader again with Marko psycopgwrapper code to see if > this helps? Wouldn't have time myself for at least a month (probably not until after the East convention) so don't go making commits on my behalf. > Maybe at the end of this PostgreSQL backend code will be smarter than pgloader > (wrt error handling and data massaging) and we'll be able to drop the > project There are way too many data massaging cases I never expect the backend will handle that pgloader does a great job of right now, and I think there will always be a niche for a tool like this. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
On Wed, 6 Feb 2008, Simon Riggs wrote: > For me, it would be good to see a --parallel=n parameter that would > allow pg_loader to distribute rows in "round-robin" manner to "n" > different concurrent COPY statements. i.e. a non-routing version. Let me expand on this. In many of these giant COPY situations the bottleneck is plain old sequential I/O to a single process. You can almost predict how fast the rows will load using dd. Having a process that pulls rows in and distributes them round-robin is good, but it won't crack that bottleneck. The useful approaches I've seen for other databases all presume that the data files involved are large enough that on big hardware, you can start multiple processes running at different points in the file and beat anything possible with a single reader. If I'm loading a TB file, odds are good I can split that into 4 or more vertical pieces (say rows 1-25%, 25-50%, 50-75%, 75-100%), start 4 loaders at once, and get way more than 1 disk worth of throughput reading. You have to play with the exact number because if you push the split too far you introduce seek slowdown instead of improvements, but that's the basic design I'd like to see one day. It's not parallel loading that's useful for the cases I'm thinking about until something like this comes around. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
Hi Greg, On 2/6/08 7:56 AM, "Greg Smith" <gsmith@gregsmith.com> wrote: > If I'm loading a TB file, odds are good I can split that into 4 or more > vertical pieces (say rows 1-25%, 25-50%, 50-75%, 75-100%), start 4 loaders > at once, and get way more than 1 disk worth of throughput reading. You > have to play with the exact number because if you push the split too far > you introduce seek slowdown instead of improvements, but that's the basic > design I'd like to see one day. It's not parallel loading that's useful > for the cases I'm thinking about until something like this comes around. Just load 4 relfiles. You have to be able to handle partial relfiles, which changes the storage mgmt a bit, but the benefits are easier to achieve. - Luke
Greg Smith wrote: > On Wed, 6 Feb 2008, Simon Riggs wrote: > >> For me, it would be good to see a --parallel=n parameter that would >> allow pg_loader to distribute rows in "round-robin" manner to "n" >> different concurrent COPY statements. i.e. a non-routing version. > > Let me expand on this. In many of these giant COPY situations the > bottleneck is plain old sequential I/O to a single process. You can > almost predict how fast the rows will load using dd. Having a process > that pulls rows in and distributes them round-robin is good, but it > won't crack that bottleneck. The useful approaches I've seen for > other databases all presume that the data files involved are large > enough that on big hardware, you can start multiple processes running > at different points in the file and beat anything possible with a > single reader. > > If I'm loading a TB file, odds are good I can split that into 4 or > more vertical pieces (say rows 1-25%, 25-50%, 50-75%, 75-100%), start > 4 loaders at once, and get way more than 1 disk worth of throughput > reading. You have to play with the exact number because if you push > the split too far you introduce seek slowdown instead of improvements, > but that's the basic design I'd like to see one day. It's not > parallel loading that's useful for the cases I'm thinking about until > something like this comes around. > Some food for thought here: Most BI Type applications which does data conversions/cleansing also might end up sorting the data before its loaded into a database so starting parallel loaders at Total different points ruins that effort. A More pragmatic approach will be to read the next rows from the input file So if there are N parallel streams then each one is offset by 1 from each other and jumps by N rows so the seeks are pretty much narrrowed down to few rows (ideally 1) instead of jumping 1/Nth rows every time a read happens. For example to replicate this with dd to see the impact use a big file and use the seek option and blocksizes .. Somebody out here once had done that test and showed that "seek time" on the file being read is reduced significantly and depending on the file system it does intelligent prefetching (which unfortunately UFS in Solaris does not do best by default) all the reads for the next stream will already be in memory. Regards, Jignesh
Le mercredi 06 février 2008, Greg Smith a écrit : > COPY. If you're loading a TB, if you're smart it's going onto the server > itself if it all possible and loading directly from there. Would probably > get a closer comparision against psql \copy, but recognize you're always > going to be compared against the best server-side copy method available. Fair enough on your side, even if I can't expect an external tool using network protocol to compete with backend reading a local file. I wanted to make sure the 50% slowdown was not only due to my code being that bad. > There are way too many data massaging cases I never expect the backend > will handle that pgloader does a great job of right now, and I think there > will always be a niche for a tool like this. Let's try to continue improving the tool then! -- dim
Attachment
Le mercredi 06 février 2008, Greg Smith a écrit : > If I'm loading a TB file, odds are good I can split that into 4 or more > vertical pieces (say rows 1-25%, 25-50%, 50-75%, 75-100%), start 4 loaders > at once, and get way more than 1 disk worth of throughput reading. pgloader already supports starting at any input file line number, and limit itself to any number of reads: -C COUNT, --count=COUNT number of input lines to process -F FROMCOUNT, --from=FROMCOUNT number of input lines to skip So you could already launch 4 pgloader processes with the same configuration fine but different command lines arguments. It there's interest/demand, it's easy enough for me to add those parameters as file configuration knobs too. Still you have to pay for client to server communication instead of having the backend read the file locally, but now maybe we begin to compete? Regards, -- dim
Attachment
Le Wednesday 06 February 2008 18:37:41 Dimitri Fontaine, vous avez écrit : > Le mercredi 06 février 2008, Greg Smith a écrit : > > If I'm loading a TB file, odds are good I can split that into 4 or more > > vertical pieces (say rows 1-25%, 25-50%, 50-75%, 75-100%), start 4 > > loaders at once, and get way more than 1 disk worth of throughput > > reading. > > pgloader already supports starting at any input file line number, and limit > itself to any number of reads: In fact, the -F option works by having pgloader read the given number of lines but skip processing them, which is not at all what Greg is talking about here I think. Plus, I think it would be easier for me to code some stat() then lseek() then read() into the pgloader readers machinery than to change the code architecture to support a separate thread for the file reader. Greg, what would you think of a pgloader which will separate file reading based on file size as given by stat (os.stat(file)[ST_SIZE]) and number of threads: we split into as many pieces as section_threads section config value. This behaviour won't be available for sections where type = text and field_count(*) is given, cause in this case I don't see how pgloader could reliably recognize a new logical line beginning and start processing here. In other cases, a logical line is a physical line, so we start after first newline met from given lseek start position, and continue reading after the last lseek position until a newline. *:http://pgloader.projects.postgresql.org/#_text_format_configuration_parameters Comments? -- dim
Attachment
On Wed, 6 Feb 2008, Dimitri Fontaine wrote: > In fact, the -F option works by having pgloader read the given number of lines > but skip processing them, which is not at all what Greg is talking about here > I think. Yeah, that's not useful. > Greg, what would you think of a pgloader which will separate file reading > based on file size as given by stat (os.stat(file)[ST_SIZE]) and number of > threads: we split into as many pieces as section_threads section config > value. Now you're talking. Find a couple of split points that way, fine-tune the boundaries a bit so they rest on line termination points, and off you go. Don't forget that the basic principle here implies you'll never know until you're done just how many lines were really in the file. When thread#1 is running against chunk#1, it will never have any idea what line chunk#2 really started at until it reaches there, at which point it's done and that information isn't helpful anymore. You have to stop thinking in terms of lines for this splitting; all you can do is split the file into useful byte sections and then count the lines within them as you go. Anything else requires a counting scan of the file and such a sequential read is exactly what can't happen (especially not more than once), it just takes too long. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
Le jeudi 07 février 2008, Greg Smith a écrit : >Le mercredi 06 février 2008, Dimitri Fontaine a écrit : >> In other cases, a logical line is a physical line, so we start after first >> newline met from given lseek start position, and continue reading after the >> last lseek position until a newline. > > Now you're talking. Find a couple of split points that way, fine-tune the > boundaries a bit so they rest on line termination points, and off you go. I was thinking of not even reading the file content from the controller thread, just decide splitting points in bytes (0..ST_SIZE/4 - ST_SIZE/4+1..2*ST_SIZE/4 etc) and let the reading thread fine-tune by beginning to process input after having read first newline, etc. And while we're still at the design board, I'm also thinking to add a per-section parameter (with a global default value possible) split_file_reading which defaults to False, and which you'll have to set True for pgloader to behave the way we're talking about. When split_file_reading = False and section_threads != 1 pgloader will have to manage several processing threads per section but only one file reading thread, giving the read input to processing theads in a round-robin fashion. In the future the processing thread choosing will possibly (another knob) be smarter than that, as soon as we get CE support into pgloader. When split_file_reading = True and section_threads != 1 pgloader will have to manage several processing threads per section, each one responsible of reading its own part of the file, processing boundaries to be discovered at reading time. Adding in here CE support in this case means managing two separate thread pools per section, one responsible of splitted file reading and another responsible of data buffering and routing (COPY to partition instead of to parent table). In both cases, maybe it would also be needed for pgloader to be able to have a separate thread for COPYing the buffer to the server, allowing it to continue preparing next buffer in the meantime? This will need some re-architecturing of pgloader, but it seems it worth it (I'm not entirely sold about the two thread-pools idea, though, and this last continue-reading-while-copying-idea still has to be examined). Some of the work needing to be done is by now quite clear for me, but a part of it still needs its design-time share. As usual though, the real hard part is knowing what we exactly want to get done, and we're showing good progress here :) Greg's behavior: max_threads = N max_parallel_sections = 1 section_threads = -1 split_file_reading = True Simon's behaviour: max_threads = N max_parallel_sections = 1 # I don't think Simon wants parallel sections section_threads = -1 split_file_reading = False Comments? -- dim
Attachment
On Thu, 7 Feb 2008, Dimitri Fontaine wrote: > I was thinking of not even reading the file content from the controller > thread, just decide splitting points in bytes (0..ST_SIZE/4 - > ST_SIZE/4+1..2*ST_SIZE/4 etc) and let the reading thread fine-tune by > beginning to process input after having read first newline, etc. The problem I was pointing out is that if chunk#2 moved foward a few bytes before it started reading in search of a newline, how will chunk#1 know that it's supposed to read up to that further point? You have to stop #1 from reading further when it catches up with where #2 started. Since the start of #2 is fuzzy until some reading is done, what you're describing will need #2 to send some feedback to #1 after they've both started, and that sounds bad to me. I like designs where the boundaries between threads are clearly defined before any of them start and none of them ever talk to the others. > In both cases, maybe it would also be needed for pgloader to be able to have a > separate thread for COPYing the buffer to the server, allowing it to continue > preparing next buffer in the meantime? That sounds like a V2.0 design to me. I'd only chase after that level of complexity if profiling suggests that's where the bottleneck really is. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
On Thu, 7 Feb 2008, Greg Smith wrote: > The problem I was pointing out is that if chunk#2 moved foward a few bytes > before it started reading in search of a newline, how will chunk#1 know that > it's supposed to read up to that further point? You have to stop #1 from > reading further when it catches up with where #2 started. Since the start of > #2 is fuzzy until some reading is done, what you're describing will need #2 > to send some feedback to #1 after they've both started, and that sounds bad > to me. It doesn't have to be fuzzy at all. Both threads will presumably be able to use the same algorithm to work out where the boundary is, therefore they'll get the same result. No need to pass back information. Matthew -- There is something in the lecture course which may not have been visible so far, which is reality -- Computer Science Lecturer
On Thu, Feb 07, 2008 at 12:06:42PM -0500, Greg Smith wrote: > On Thu, 7 Feb 2008, Dimitri Fontaine wrote: > >> I was thinking of not even reading the file content from the controller >> thread, just decide splitting points in bytes (0..ST_SIZE/4 - >> ST_SIZE/4+1..2*ST_SIZE/4 etc) and let the reading thread fine-tune by >> beginning to process input after having read first newline, etc. > > The problem I was pointing out is that if chunk#2 moved foward a few bytes > before it started reading in search of a newline, how will chunk#1 know > that it's supposed to read up to that further point? You have to stop #1 > from reading further when it catches up with where #2 started. Since the > start of #2 is fuzzy until some reading is done, what you're describing > will need #2 to send some feedback to #1 after they've both started, and > that sounds bad to me. I like designs where the boundaries between threads > are clearly defined before any of them start and none of them ever talk to > the others. > As long as both processes understand the start condition, there is not a problem. p1 starts at beginning and processes through chunk2 offset until it reaches the start condition. p2 starts loading from chunk2 offset plus the amount needed to reach the start condition, ... DBfile|---------------|--x--------------|x----------------|-x--| x chunk1-----------> x chunk2--------> x chunk3----------->... As long as both pieces use the same test, they will each process non-overlapping segments of the file and still process 100% of the file. Ken >> In both cases, maybe it would also be needed for pgloader to be able to >> have a >> separate thread for COPYing the buffer to the server, allowing it to >> continue >> preparing next buffer in the meantime? > > That sounds like a V2.0 design to me. I'd only chase after that level of > complexity if profiling suggests that's where the bottleneck really is. > > -- > * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD > > ---------------------------(end of broadcast)--------------------------- > TIP 6: explain analyze is your friend >
> > I was thinking of not even reading the file content from the controller > > thread, just decide splitting points in bytes (0..ST_SIZE/4 - > > ST_SIZE/4+1..2*ST_SIZE/4 etc) and let the reading thread fine-tune by > > beginning to process input after having read first newline, etc. > > The problem I was pointing out is that if chunk#2 moved foward a few bytes > before it started reading in search of a newline, how will chunk#1 know > that it's supposed to read up to that further point? You have to stop #1 > from reading further when it catches up with where #2 started. Since the > start of #2 is fuzzy until some reading is done, what you're describing > will need #2 to send some feedback to #1 after they've both started, and > that sounds bad to me. I like designs where the boundaries between > threads are clearly defined before any of them start and none of them ever > talk to the others. I don't think that any communication is needed beyond the beginning of the threads. Each thread knows that it should start at byte offset X and end at byte offset Y, but if Y happens to be in the middle of a record then just keep going until the end of the record. As long as the algorithm for reading past the end marker is the same as the algorithm for skipping past the beginning marker then all is well. -- Mark Lewis
On Mon, 04 Feb 2008 17:33:34 -0500 "Jignesh K. Shah" <J.K.Shah@Sun.COM> wrote: > Hi Simon, > > I have some insight into TPC-H on how it works. > > First of all I think it is a violation of TPC rules to publish numbers > without auditing them first. So even if I do the test to show the > better performance of PostgreSQL 8.3, I cannot post it here or any > public forum without doing going through the "process". (Even though it > is partial benchmark as they are just doing the equivalent of the > PowerRun of TPCH) Maybe the PR of PostgreSQL team should email > info@tpc.org about them and see what they have to say about that comparison. I think I am qualified enough to say it is not a violation of TPC fair-use policy if we scope the data as a measure of how PostgreSQL has changed from 8.1 to 8.3 and refrain from comparing these results to what any other database is doing. The point is to measure PostgreSQL's progress not market it, correct? Regards, Mark
On Mon, 4 Feb 2008 15:09:58 -0500 (EST) Greg Smith <gsmith@gregsmith.com> wrote: > On Mon, 4 Feb 2008, Simon Riggs wrote: > > > Would anybody like to repeat these tests with the latest production > > versions of these databases (i.e. with PGSQL 8.3) > > Do you have any suggestions on how people should run TPC-H? It looked > like a bit of work to sort through how to even start this exercise. If you mean you want to get your hands on a kit, the one that Jenny and I put together is here: http://sourceforge.net/project/showfiles.php?group_id=52479&package_id=71458 I hear it still works. :) Regards, Mark
Does anyone have performance info about the new Dell Perc/6 controllers? I found a long discussion ("Dell vs HP") aboutthe Perc/5, but nothing about Perc/6. What's under the covers? Here is the (abbreviated) info from Dell on this machine: PowerEdge 1950 III Quad Core Intel® Xeon® E5405, 2x6MB Cache, 2.0GHz, 1333MHz FSB Additional Processors Quad Core Intel® Xeon® E5405, 2x6MB Cache, 2.0GHz, 1333MHz FSB Memory 8GB 667MHz (4x2GB), Dual Ranked DIMMs Hard Drive Configuration Integrated SAS/SATA RAID 5, PERC 6/i Integrated Thanks, Craig
On Tue, 12 Feb 2008, Craig James wrote: > Does anyone have performance info about the new Dell Perc/6 controllers? I > found a long discussion ("Dell vs HP") about the Perc/5, but nothing about > Perc/6. What's under the covers? The Perc/6i has an LSI Logic MegaRAID SAS 1078 chipset under the hood. I know the Linux drivers for the card seemed to stabilize around October, there's a good sized list of compatible distributions on LSI's site. FreeBSD support has some limitations but basically works. I haven't seen any benchmarks for the current version of the card yet. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
Hello. I think I started that discussion. We ended up buying a Dell 2900 with PERC 6/i and 10 * 145GB SAS 3,5" 15KRpm discs. 6 of the SAS discs are in a raid 10 for the database, 2 in a mirror for the wal and the last 2 in a mirror for the OS. We get 350MB/s writing and 380MB/s reading to/from the raid 10 area using dd. The OS is Ubuntu and the filesystem for the raid 10 is ext3. The box is still under testing, but we plan to set it in production this week. Regards, - Tore. On Feb 12, 2008, at 17:32 , Craig James wrote: > Does anyone have performance info about the new Dell Perc/6 > controllers? I found a long discussion ("Dell vs HP") about the > Perc/5, but nothing about Perc/6. What's under the covers? > > Here is the (abbreviated) info from Dell on this machine: > PowerEdge 1950 III Quad Core Intel® Xeon® E5405, 2x6MB Cache, > 2.0GHz, 1333MHz FSB > Additional Processors Quad Core Intel® Xeon® E5405, 2x6MB Cache, > 2.0GHz, 1333MHz FSB > Memory 8GB 667MHz (4x2GB), Dual Ranked DIMMs > Hard Drive Configuration Integrated SAS/SATA RAID 5, PERC 6/i > Integrated > > Thanks, > Craig > > ---------------------------(end of > broadcast)--------------------------- > TIP 1: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly >
On 13-Feb-08, at 5:02 AM, Tore Halset wrote: > Hello. > > I think I started that discussion. We ended up buying a Dell 2900 > with PERC 6/i and 10 * 145GB SAS 3,5" 15KRpm discs. 6 of the SAS > discs are in a raid 10 for the database, 2 in a mirror for the wal > and the last 2 in a mirror for the OS. We get 350MB/s writing and > 380MB/s reading to/from the raid 10 area using dd. The OS is Ubuntu > and the filesystem for the raid 10 is ext3. > Wow that's fantastic. Just to be sure, did you make sure that you read and wrote 2x memory to take the cache out of the measurement ? Dave
On Feb 13, 2008 5:02 AM, Tore Halset <halset@pvv.ntnu.no> wrote: > Hello. > > I think I started that discussion. We ended up buying a Dell 2900 with > PERC 6/i and 10 * 145GB SAS 3,5" 15KRpm discs. 6 of the SAS discs are > in a raid 10 for the database, 2 in a mirror for the wal and the last > 2 in a mirror for the OS. We get 350MB/s writing and 380MB/s reading > to/from the raid 10 area using dd. The OS is Ubuntu and the filesystem > for the raid 10 is ext3. Those are decent numbers. Can you do a bonnie++ run and post the results (specifically interested in seeks)? merlin
On Feb 13, 2008, at 12:06, Dave Cramer wrote: > > On 13-Feb-08, at 5:02 AM, Tore Halset wrote: > >> Hello. >> >> I think I started that discussion. We ended up buying a Dell 2900 >> with PERC 6/i and 10 * 145GB SAS 3,5" 15KRpm discs. 6 of the SAS >> discs are in a raid 10 for the database, 2 in a mirror for the wal >> and the last 2 in a mirror for the OS. We get 350MB/s writing and >> 380MB/s reading to/from the raid 10 area using dd. The OS is Ubuntu >> and the filesystem for the raid 10 is ext3. >> > Wow that's fantastic. Just to be sure, did you make sure that you > read and wrote 2x memory to take the cache out of the measurement ? > > Dave > The box have 16GB of ram, but my original test file was only 25GB. Sorry. Going to 33GB lowered the numbers for writing. Here you have some samples. % sh -c "dd if=/dev/zero of=bigfile bs=8k count=4000000 && sync" 32768000000 bytes (33 GB) copied, 103.722 seconds, 316 MB/s 32768000000 bytes (33 GB) copied, 99.669 seconds, 329 MB/s % time dd if=bigfile of=/dev/null bs=8k 32768000000 bytes (33 GB) copied, 85.4235 seconds, 384 MB/s 32768000000 bytes (33 GB) copied, 85.4628 seconds, 383 MB/s Regards, - Tore.
On Feb 13, 2008, at 20:45, Tore Halset wrote: > The box have 16GB of ram, but my original test file was only 25GB. > Sorry. Going to 33GB lowered the numbers for writing. Here you have > some samples. > > % sh -c "dd if=/dev/zero of=bigfile bs=8k count=4000000 && sync" > 32768000000 bytes (33 GB) copied, 103.722 seconds, 316 MB/s > 32768000000 bytes (33 GB) copied, 99.669 seconds, 329 MB/s > > % time dd if=bigfile of=/dev/null bs=8k > 32768000000 bytes (33 GB) copied, 85.4235 seconds, 384 MB/s > 32768000000 bytes (33 GB) copied, 85.4628 seconds, 383 MB/s > > Regards, > - Tore. And here are the bonnie++ numbers. I am a bonnie++ newbie so I ran it with no options. Version 1.03c ------Sequential Output------ --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP harteigen 32136M 83983 97 221757 40 106483 19 89603 97 268787 22 886.1 1 ------Sequential Create------ --------Random Create-------- -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ harteigen,32136M, 83983,97,221757,40,106483,19,89603,97,268787,22,886.1,1,16,+++++,+++,++ +++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++ Regards, - Tore.