Thread: Populating large tables with occasional bad values
Question: why are inserts approximately 10 times slower over the Internet than locally? Background information: I have a Java program that inserts rows using data derived from a legacy database which can have occasional bad values (egtext in a numeric column) as well as duplicate primary key values. Postgres rejects the offending rows and my Java program writes the offending insert statement and the error message to alog file. This allows the user to take action, perhaps cleaning the original data and reloading, perhaps entering a fewrows manually, whatever. In the case of duplicate key values, for certain tables the program is told to modify the key (eg appending "A" or "B") tomake it unique. In that case, the original insert is an error but after one or two retries, a computed unique key allowsit to be inserted. While I understand that INSERT is not as fast as COPY, inserting is fast enough and provides the needed flexibility (especiallyfor custom logic to alter duplicate keys). However, inserting rows is only fast enough if the database is on the same computer as the Java program. For example, 200,000rows can be inserted locally in under an hour. Over the Internet (using ssh tunneling with Trilead's code) it tookover six hours. That's the problem. I've tinkered some with setting autocommit off and doing commits every 100 inserts, but if there's an error the entire setis lost, not just the offending row. Also, postgres apparently doesn't allow setting the transaction isolation to NONE. Am I wrong about that? If it did, wouldthat help? I fail to understand why this is so much slower over the Internet. Even if autocommit is on, that's more work for the server,I shouldn't think that it increases the network traffic. I must be entirely in the dark on this. Does autocommit causejdbc code on my client computer to send a BEGIN transaction, the insert, and a COMMIT for each row? John
John T. Dow wrote: > Question: why are inserts approximately 10 times slower over the Internet than locally? Probably causes: - Limited total bandwidth available; - Delays caused by round trip latencies; or - TCP/IP costs vs local UNIX domain socket costs I'm betting on latency in your case. > Postgres rejects the offending rows and my Java program writes the > offending insert statement and the error message to a log file. This > allows the user to take action, perhaps cleaning the original data and > reloading, perhaps entering a few rows manually, whatever. OK, so you're waiting for each INSERT to complete before issuing the next. That means that over the Internet you add *at* *least* (2*latency) to the time it takes to complete each insert, where `latency' is the round trip time for a packet between the DB client and server. That gets expensive fast. My latency from my home DSL to my work's DSL is 12ms - and I'm in the same city and on the same ISP as work is, as well as connected to the same peering point. I regularly see latencies of > 150ms to hosts within Australia. Even assuming only one round trip per INSERT (which is probably very over-optimistic) at, say, 50ms latency, you're talking a MAXIMUM throughput of 20 INSERTs per second even if the inserts themselves are issued, executed and responded to instantly. > In the case of duplicate key values, for certain tables the program > is told to modify the key (eg appending "A" or "B") to make it unique. In > that case, the original insert is an error but after one or two retries, > a computed unique key allows it to be inserted. Can you do this server side? An appropriate CASE in the INSERT or the use of a stored procedure might be helpful. Can your application provide fallback options ahead of time in case they're needed, or generate them server-side? Doing this client-side is going to kill your insert rate completely. > However, inserting rows is only fast enough if the database is on the > same computer as the Java program. For example, 200,000 rows can be > inserted locally in under an hour. Over the Internet (using ssh > tunneling with Trilead's code) it took over six hours. That's the problem. Yes, it's almost certainly latency. You'll probably find that if you batched your inserts and did more server-side you'd get massively better performance. For example, instead of: INSERT INTO x (a,b) values ('pk1', 1); INSERT INTO x (a,b) values ('pk2', 2); -- Fails INSERT INTO x (a,b) values ('pk2b', 2); -- Reissue failed INSERT INTO x (a,b) values ('pk3', 3); you might issue: INSERT INTO x (a,b) VALUES ('pk1', 1), ('pk2',2), ('pk3',3); and have a trigger on `x' check for and handle insert conflicts. If you only want the duplicate key logic to fire for this app but not other inserts on the same table you could use a trigger on a dummy table to rewrite the inserts, use rewrite rules on a view to convert the inserts into stored proc calls, etc etc. Another option is to bulk-insert or COPY the raw data into a holding table. The holding table need not even have any indexes, doesn't need a primary key, etc. Once the holding table is populated you can bulk INSERT ... SELECT the data, using appropriate CASE statements to handle pkey conflicts. If you need user input, run one query to find all pkey conflicts and get the user to make their decisions based on the results, then issue batched inserts based on their responses. In any case, you need to eliminate the time your app spends waiting for the command to be issued, processed by the DB server, and for the response to reach the client. The main ways to do that are to do more work server-side and to batch your operations together more. > I've tinkered some with setting autocommit off and doing commits > every 100 inserts, but if there's an error the entire set is lost, not just > the offending row. If you want performance, in this case I'd certainly turn autocommit off. Instead of inserting a row and trapping the error if a unique constraint is violated, do something like: INSERT INTO mytable (key, value) SELECT 'pk1', 'otherthing' WHERE NOT EXISTS (SELECT 1 FROM mytable WHERE key = 'pk1'); ... and check to see if you inserted the row or not using the rows affected count provided by the JDBC driver. Alternately you can use a trigger or stored procedure and trap exceptions on unique constraint violations. This will cause horrible performance problems with versions of postgresql before 8.3, though. In either case, however, you're still issuing one INSERT per row inserted, and incurring nasty round trip latency penalties for that. You'll be much better off doing multiple-row inserts into a table/view with a FOR EACH ROW ... BEFORE INSERT trigger that spots pk conflicts and rewrites the insert to fix them (or redirects the insert into a "conflicts" table for later processing). > Also, postgres apparently doesn't allow setting the transaction > isolation to NONE. Am I wrong about that? If it did, would that help? As far as I know it's not possible. Nor would it help, because your problem is network round trip latencies not database execution time. > I fail to understand why this is so much slower over the Internet. Because the Internet is a high latency communication medium. The server isn't taking any longer to execute your queries, as you'll find out if you use EXPLAIN ANALYZE to measure their execution time. -- Craig Ringer
Sorry for the self-reply, but a correction and an example are really required. Craig Ringer wrote: > OK, so you're waiting for each INSERT to complete before issuing the > next. That means that over the Internet you add *at* *least* (2*latency) > to the time it takes to complete each insert, where `latency' is the > round trip time for a packet between the DB client and server. Whoops. At least `latency' not 2*latency. I was thinking one-way latency and got confused at some point into saying round trip instead. Sorry. > That gets expensive fast. My latency from my home DSL to my work's DSL > is 12ms - and I'm in the same city and on the same ISP as work is, as > well as connected to the same peering point. I regularly see latencies > of > 150ms to hosts within Australia. Here's an example. The psql client is running on my desktop, and connected to work's Pg server. I set \timing in psql so psql reports the total time from when the local client issues the query to when it receives the reply. I then issue an insert with EXPLAIN ANALYZE so the server reports how long it took the server to execute the query. test=# \timing Timing is on test=# explain analyze insert into dummy (pk, value) test-# values (101,'thing'); QUERY PLAN -------------------------------------------------------------------------------- Result (cost=0.00..0.01 rows=1 width=0) (actual time=0.001..0.002 rows=1 loops=1) Total runtime: 0.090 ms (2 rows) Time: 21.914 ms You can see that the sever took only 0.09ms to execute the query, but to the client it appeared to take 21ms. If I ssh to the server and connect with psql locally over a unix domain socket or the loopback interface, I get these results instead: test=# explain analyze insert into dummy (pk, value) test-# values (102,'thing'); QUERY PLAN ------------------------------------------------------------------------------------ Result (cost=0.00..0.01 rows=1 width=0) (actual time=0.002..0.002 rows=1 loops=1) Total runtime: 0.056 ms (2 rows) Time: 0.530 ms ... which is 40 times faster from client query issue to client query completion despite the negligible server execution speed difference. multiple row INSERTs, use of COPY, etc will all help combat this. -- Craig Ringer\
Latency it is. I just had no idea it would add up so fast. I guess I was thinking that you could pump a lot of data over the Internet withoutrealizing the overhead when the data is broken down into little chunks. I'm not sure what the best solution is. I do this rarely, usually when first loading the data from the legacy. When readyto go live, my (remote) client will send the data, I'll massage it for loading, then load it to their (remote) postgresserver. This usually takes place over a weekend, but last time was in an evening which lasted until 4AM. If I did this regularly, three options seem easiest. 1 - Load locally to get clean data and then COPY. This requires the server to have access local access to the file to becopied, and if the server is hosted by an isp, it depends on them whether you can do this easily. 2 - Send the data to the client to run the Java app to insert over their LAN (this only works if the database server is localto them and not at an ISP). 3 - If the only problem is duplicate keys, load into a special table without the constraint, issue update commands to rewritethe keys as needed, then select/insert to the correct table. Thanks John
John T. Dow wrote: > If I did this regularly, three options seem easiest. > > 1 - Load locally to get clean data and then COPY. This requires the server to have access local access to the file to becopied, and if the server is hosted by an isp, it depends on them whether you can do this easily. You can COPY over a PostgreSQL network connection. See the \copy support in psql for one example of how it works. I've never had cause to try to use it with JDBC so I don't know how/if it works in that context. However, batched inserts can be quite fast enough. If you do one INSERT INTO per 1000 rows you'll already be seeing a massive performance boost: INSERT INTO mytable VALUES (blah, blah), (blah, blah), (blah, blah), -- hundreds of rows later (blah, blah); ... will be a *LOT* faster. If you have to do special processing or error handling you can do it once you have the data in a server-side staging table - and you can get it there quickly with multi-row inserts or (at least using psql) a \copy . > 3 - If the only problem is duplicate keys, load into a special table without the constraint, issue update commands to rewritethe keys as needed, then select/insert to the correct table. This would be my preferred approach, personally, using either network COPY or multi-row INSERTs to get the data into the staging table. You can do a whole lot more than fix unique constaint violations, though. With a PL/PgSQL function to process the staging table and do the inserts you can do some pretty major data surgery. I have some conversion code that takes pretty ugly data in unconstrained staging tables and reprocesses it to fit a new, much stricter and better normalized schema. Large amounts of restructuring and fixing up are required. Some of the simpler conversions are done by INSERT ... SELECT statements, but the complicated stuff is done with PL/PgSQL functions. I've sent you the conversion code off-list in case it's informative. It's pretty ugly code as it's going to be thrown away when we cut over to the new system, but it should show just how much data conversion & repair you can do on the database server side. It's not like you can't put aside rows that need user interaction as you process the staging table, either. Just INSERT them into a "problem" table and delete them from the staging table. Then have the client scan through the (much smaller) problem table and ask the user to make decisions. When everything looks satisfactory and the user's decisions have been acted on, COMMIT. -- Craig Ringer
Craig Ringer wrote: > John T. Dow wrote: > >> If I did this regularly, three options seem easiest. >> >> 1 - Load locally to get clean data and then COPY. This requires the >> server to have access local access to the file to be copied, and if >> the server is hosted by an isp, it depends on them whether you can do >> this easily. > > You can COPY over a PostgreSQL network connection. See the \copy support > in psql for one example of how it works. > > I've never had cause to try to use it with JDBC so I don't know how/if > it works in that context. It's not supported in the standard JDBC driver unfortunately. > However, batched inserts can be quite fast enough. If you do one INSERT > INTO per 1000 rows you'll already be seeing a massive performance boost: > > INSERT INTO mytable > VALUES > (blah, blah), > (blah, blah), > (blah, blah), > -- hundreds of rows later > (blah, blah); > > ... will be a *LOT* faster. If you have to do special processing or > error handling you can do it once you have the data in a server-side > staging table - and you can get it there quickly with multi-row inserts > or (at least using psql) a \copy . You can get the same effect via JDBC batched inserts (addBatch() / executeBatch()) without having to actually do a multi-row INSERT statement. That's probably the most portable approach if you're using JDBC. -O
John T. Dow написав(ла): > Latency it is. > > I just had no idea it would add up so fast. I guess I was thinking that you could pump a lot of data over the Internetwithout realizing the overhead when the data is broken down into little chunks. > How about batches? Should not they help you in this case?
I have tried using a batch and it looked good at first but I don't think it will work with Postgres as cleanly as you'd like. First, set autocommit false. Second, addBatch many times for the insert statements. Third, executeBatch. Fourth, the exception BatchUpdateException is thrown. The exception reports which insert statement had a problem. getNextException()gives more details (duplicate key). Only one problem is reported. getUpdateCounts() on the exception reportswhat we already know, that the statements up until this point all inserted without problem. Fifth - now what? I tried to commit at this point, hoping that the first insert statements would be commited, but they arenot. Actually, I was hoping that the entire batch would be processed and then I could see which individual statementsdidn't insert (using the update counts) and then redo them individually, but I think that is wishful thinking. Best possible solution with batches? Submit the batch, learn who many were accepted before the first error, resubmit justthose. Process the problem statement by itself. Repeat with the statements left in the original batch until none areleft, then create a new batch and begin anew. If a batch, or portion thereof, has no exception, then of course commit. For this, since portions of batches will be sent twice, the batch shouldn't be too large. Even 10 statements in a batch wouldsubstantially reduce the overhead, certainly no need to do thousands. In the posting I quote below, the implication is that you can learn multiple statements in the batch that failed, but I didn'tsee that happening. It seems that posgres just quits at the first problem. JOhn On Thu, 12 Jun 2008 14:35:30 +0300, tivvpgsqljdbc@gtech-ua.com wrote: >John T. Dow напиÑав(ла): >>> How about batches? Should not they help you in this case? >>> >> >> >> WHat happens if the 23rd and 59th rows in a batch of 100 have a problem, such as an invalid numeric value or a duplicatekey? >> >> Can the other 98 rows be committed? >> >> I haven't tried this. >> > >In postgresql no (may be yes with autocommit turned on - did not try). >Because postgresql marks transaction as rollback-only on first problem. >The one thing you can do is to detect which records are wrong in the >batch (say, #2 and #55) and redo the batch without failing records >(with/without appending batch with new records). This would make server >load higher, but would give you only two roundtrips instead of 100 >roundtrips. Actually if every batch will have failing records, your >server will has two times more transactions (this is maximum). > >
John T. Dow wrote: > Fifth - now what? I tried to commit at this point, hoping that the > first insert statements would be commited, but they are not. Actually, I > was hoping that the entire batch would be processed and then I could see > which individual statements didn't insert (using the update counts) and > then redo them individually, but I think that is wishful thinking. The first error causes a transaction rollback. The only way the JDBC driver could skip the failed operation and continue would be to wrap each statement in a savepoint, and if there's an error issue a rollback to the last savepoint. That's not ideal from a performance point of view. > Best possible solution with batches? Submit the batch, learn who many > were accepted before the first error, resubmit just those. Process the > problem statement by itself. Repeat with the statements left in the > original batch until none are left, then create a new batch and begin > anew. If a batch, or portion thereof, has no exception, then of course > commit. IMO the best solution with batches is to use batches to store your planned operations in a staging table using INSERTs that cannot fail under normal circumstances. Then issue a single PL/PgSQL call or a bulk UPDATE ... SELECT with an appropriate WHERE clause to apply the updates and catch problems. The approach you describe will work, but it'll be a bit ugly and not very fast. If failures are very rare you might find it to be OK, but your suggestion of using only 10 statements per batch suggests that failures aren't *that* rare. > In the posting I quote below, the implication is that you can learn > multiple statements in the batch that failed, but I didn't see that > happening. AFAIK that's database specific. The docs I read suggested that DBs are permitted to stop processing and return info on the first bad row, or to continue processing and return info on all rows. As PostgreSQL cannot just keep going after an error within a transaction unless you use savepoints, it choses to do the former. It might be cool if Pg supported an automatic savepoint mode where every statement updated an implicit savepoint snapshot. If the statement failed you could ROLLBACK TO LAST STATEMENT. If possible the savepoint would be replaced with each statement so there'd be no ever-growing set of savepoints to worry about. With this capability way you could use proper transactional isolation but also achieve some error handling within the transaction. Personally, though, I either restructure my SQL to avoid the potential errors or use PL/PgSQL to handle them. The app has to be capable of reissuing failed transactions anyway, so sometimes an app-level transaction do-over is quite enough. There might be situations in which that hypothetical feature could be handy, though. > It seems that posgres just quits at the first problem. Yes, by design. In a transaction if a statement fails then without a savepoint in place there is, AFAIK, no way to just pretend the bad statemnet was never issued. I don't know if it'll keep going if you enable autocommit. It might; it depends on how the JDBC driver does batches and how the server processes them. Your performance will suffer with autocommit on - though not as badly as when you're paying the high round trip latencies - and you won't be able to roll back if something nasty happens. If you don't care about things like automatic rollback on network dropout you should probably test batches with autocommit on and see how they behave. -- Craig Ringer
I have a solution that I am happy with, yielding nearly a 10-fold improvement in speed. I tried a quick and dirty experiment with different batch sizes with a small table of 750 rows. Without using batches, the insert took 64 seconds. (Auto commit true for this and the other tests.) Batch size 10, 13 seconds. Batch size 100, 5 seconds. This code was very simple -- if any error in the batch, discard the entire batch. This gives an idea of what performancewould be like. Of course, with a batch size of 10 and two errors, the number of rows loaded is short by 17 because two entire batches werediscarded. With a batch size of 100, it's short by 150. (I also neglected to send the last, partial batch, this beingjust a quick and dirty experiment.) Anyway, a batch size of 10 yields a performance improvement of 5. Batch size 100 it's 10. Craig wrote: >IMO the best solution with batches is to use batches to store your >planned operations in a staging table using INSERTs that cannot fail >under normal circumstances. Then issue a single PL/PgSQL call or a bulk >UPDATE ... SELECT with an appropriate WHERE clause to apply the updates >and catch problems. > >The approach you describe will work, but it'll be a bit ugly and not >very fast. If failures are very rare you might find it to be OK, but >your suggestion of using only 10 statements per batch suggests that >failures aren't *that* rare. I need reasonable performance, not the best that money can buy. My clients have tables with tens of thousands of rows, nothundreds or millions. Also, I want a general purpose solution that will work with no special knowledge of the table tobe loaded -- just the info obtainable from the meta data (such as column names and types). A staging table such as yousuggest could be created automatically (copy the columns but change all to char varying so they load). But then locatingrows with bad values would be a messy bit of sql, although it could be generated by my code. Still, I'd have to bringover the rows (casting char varying to integer or whatever) that are good and then remove them, leaving the bad rowsbehind. I'll clean up my dirty code, eg make sure it sends the last partial batch. I'll also have it iterate to insert the rows inany batch that's rejected. The simplest thing to do is simply resend the rows in the batch, one by one. A more elaboratething to do would be to resend a partial batch, up to the point of the problem, skip the known problem row, thensend another partial batch with the remaining rows. Keep doing that until all in the original batch have been successfullyinserted or else written to the error log. John