Thread: Populating large tables with occasional bad values

Populating large tables with occasional bad values

From
"John T. Dow"
Date:
Question: why are inserts approximately 10 times slower over the Internet than locally?

Background information:

I have a Java program that inserts rows using data derived from a legacy database which can have occasional bad values
(egtext in a numeric column) as well as duplicate primary key values. 

Postgres rejects the offending rows and my Java program writes the offending insert statement and the error message to
alog file. This allows the user to take action, perhaps cleaning the original data and reloading, perhaps entering a
fewrows manually, whatever. 

In the case of duplicate key values, for certain tables the program is told to modify the key (eg appending "A" or "B")
tomake it unique. In that case, the original insert is an error but after one or two retries, a computed unique key
allowsit to be inserted. 

While I understand that INSERT is not as fast as COPY, inserting is fast enough and provides the needed flexibility
(especiallyfor custom logic to alter duplicate keys). 

However, inserting rows is only fast enough if the database is on the same computer as the Java program. For example,
200,000rows can be inserted locally in under an hour. Over the Internet (using ssh tunneling with Trilead's code) it
tookover six hours. That's the problem. 

I've tinkered some with setting autocommit off and doing commits every 100 inserts, but if there's an error the entire
setis lost, not just the offending row. 

Also, postgres apparently doesn't allow setting the transaction isolation to NONE. Am I wrong about that? If it did,
wouldthat help? 

I fail to understand why this is so much slower over the Internet. Even if autocommit is on, that's more work for the
server,I shouldn't think that it increases the network traffic. I must be entirely in the dark on this. Does autocommit
causejdbc code on my client computer to send a BEGIN transaction, the insert, and a COMMIT for each row? 

John


Re: Populating large tables with occasional bad values

From
Craig Ringer
Date:
John T. Dow wrote:
> Question: why are inserts approximately 10 times slower over the Internet than locally?

Probably causes:

- Limited total bandwidth available;
- Delays caused by round trip latencies; or
- TCP/IP costs vs local UNIX domain socket costs

I'm betting on latency in your case.

> Postgres rejects the offending rows and my Java program writes the
> offending insert statement and the error message to a log file. This
> allows the user to take action, perhaps cleaning the original data and
> reloading, perhaps entering a few rows manually, whatever.

OK, so you're waiting for each INSERT to complete before issuing the
next. That means that over the Internet you add *at* *least* (2*latency)
to the time it takes to complete each insert, where `latency' is the
round trip time for a packet between the DB client and server.

That gets expensive fast. My latency from my home DSL to my work's DSL
is 12ms - and I'm in the same city and on the same ISP as work is, as
well as connected to the same peering point. I regularly see latencies
of > 150ms to hosts within Australia.

Even assuming only one round trip per INSERT (which is probably very
over-optimistic) at, say, 50ms latency, you're talking a MAXIMUM
throughput of 20 INSERTs per second even if the inserts themselves are
issued, executed and responded to instantly.

> In the case of duplicate key values, for certain tables the program
> is  told to modify the key (eg appending "A" or "B") to make it unique. In
> that case, the original insert is an error but after one or two retries,
> a computed unique key allows it to be inserted.

Can you do this server side? An appropriate CASE in the INSERT or the
use of a stored procedure might be helpful. Can your application provide
fallback options ahead of time in case they're needed, or generate them
server-side?

Doing this client-side is going to kill your insert rate completely.

> However, inserting rows is only fast enough if the database is on the
> same computer as the Java program. For example, 200,000 rows can be
> inserted locally in under an hour. Over the Internet (using ssh
> tunneling with Trilead's code) it took over six hours. That's the problem.

Yes, it's almost certainly latency. You'll probably find that if you
batched your inserts and did more server-side you'd get massively better
performance. For example, instead of:

INSERT INTO x (a,b) values ('pk1', 1);
INSERT INTO x (a,b) values ('pk2', 2); -- Fails
INSERT INTO x (a,b) values ('pk2b', 2); -- Reissue failed
INSERT INTO x (a,b) values ('pk3', 3);

you might issue:

INSERT INTO x (a,b) VALUES
('pk1', 1),
('pk2',2),
('pk3',3);

and have a trigger on `x' check for and handle insert conflicts.

If you only want the duplicate key logic to fire for this app but not
other inserts on the same table you could use a trigger on a dummy table
to rewrite the inserts, use rewrite rules on a view to convert the
inserts into stored proc calls, etc etc.

Another option is to bulk-insert or COPY the raw data into a holding
table. The holding table need not even have any indexes, doesn't need a
primary key, etc. Once the holding table is populated you can bulk
INSERT ... SELECT the data, using appropriate CASE statements to handle
pkey conflicts. If you need user input, run one query to find all pkey
conflicts and get the user to make their decisions based on the results,
then issue batched inserts based on their responses.

In any case, you need to eliminate the time your app spends waiting for
the command to be issued, processed by the DB server, and for the
response to reach the client. The main ways to do that are to do more
work server-side and to batch your operations together more.

> I've tinkered some with setting autocommit off and doing commits
> every  100 inserts, but if there's an error the entire set is lost, not just
> the offending row.

If you want performance, in this case I'd certainly turn autocommit off.
Instead of inserting a row and trapping the error if a unique constraint
is violated, do something like:

INSERT INTO mytable (key, value)
SELECT 'pk1', 'otherthing'
   WHERE NOT EXISTS (SELECT 1 FROM mytable WHERE key = 'pk1');

... and check to see if you inserted the row or not using the rows
affected count provided by the JDBC driver.

Alternately you can use a trigger or stored procedure and trap
exceptions on unique constraint violations. This will cause horrible
performance problems with versions of postgresql before 8.3, though.

In either case, however, you're still issuing one INSERT per row
inserted, and incurring nasty round trip latency penalties for that.
You'll be much better off doing multiple-row inserts into a table/view
with a FOR EACH ROW ... BEFORE INSERT trigger that spots pk conflicts
and rewrites the insert to fix them (or redirects the insert into a
"conflicts" table for later processing).

> Also, postgres apparently doesn't allow setting the transaction
> isolation to NONE. Am I wrong about that? If it did, would that help?

As far as I know it's not possible. Nor would it help, because your
problem is network round trip latencies not database execution time.

> I fail to understand why this is so much slower over the Internet.

Because the Internet is a high latency communication medium. The server
isn't taking any longer to execute your queries, as you'll find out if
you use EXPLAIN ANALYZE to measure their execution time.

--
Craig Ringer

Re: Populating large tables with occasional bad values

From
Craig Ringer
Date:
Sorry for the self-reply, but a correction and an example are really
required.

Craig Ringer wrote:

> OK, so you're waiting for each INSERT to complete before issuing the
> next. That means that over the Internet you add *at* *least* (2*latency)
> to the time it takes to complete each insert, where `latency' is the
> round trip time for a packet between the DB client and server.

Whoops. At least `latency' not 2*latency. I was thinking one-way latency
and got confused at some point into saying round trip instead. Sorry.

> That gets expensive fast. My latency from my home DSL to my work's DSL
> is 12ms - and I'm in the same city and on the same ISP as work is, as
> well as connected to the same peering point. I regularly see latencies
> of > 150ms to hosts within Australia.

Here's an example. The psql client is running on my desktop, and
connected to work's Pg server.

I set \timing in psql so psql reports the total time from when the local
client issues the query to when it receives the reply.

I then issue an insert with EXPLAIN ANALYZE so the server reports how
long it took the server to execute the query.

test=# \timing
Timing is on
test=# explain analyze insert into dummy (pk, value)
test-# values (101,'thing');
                                      QUERY PLAN

--------------------------------------------------------------------------------

  Result  (cost=0.00..0.01 rows=1 width=0)
          (actual time=0.001..0.002 rows=1 loops=1)
  Total runtime: 0.090 ms
(2 rows)

Time: 21.914 ms


You can see that the sever took only 0.09ms to execute the query, but to
the client it appeared to take 21ms.

If I ssh to the server and connect with psql locally over a unix domain
socket or the loopback interface, I get these results instead:



test=# explain analyze insert into dummy (pk, value)
test-# values (102,'thing');
                                      QUERY PLAN

------------------------------------------------------------------------------------
  Result  (cost=0.00..0.01 rows=1 width=0)
          (actual time=0.002..0.002 rows=1 loops=1)
  Total runtime: 0.056 ms
(2 rows)

Time: 0.530 ms



... which is 40 times faster from client query issue to client query
completion despite the negligible server execution speed difference.

multiple row INSERTs, use of COPY, etc will all help combat this.

--
Craig Ringer\

Re: Populating large tables with occasional bad values

From
"John T. Dow"
Date:
Latency it is.

I just had no idea it would add up so fast. I guess I was thinking that you could pump a lot of data over the Internet
withoutrealizing the overhead when the data is broken down into little chunks. 

I'm not sure what the best solution is. I do this rarely, usually when first loading the data from the legacy. When
readyto go live, my (remote) client will send the data, I'll massage it for loading, then load it to their (remote)
postgresserver. This usually takes place over a weekend, but last time was in an evening which lasted until 4AM.  

If I did this regularly, three options seem easiest.

1 - Load locally to get clean data and then COPY. This requires the server to have access local access to the file to
becopied, and if the server is hosted by an isp, it depends on them whether you can do this easily. 

2 - Send the data to the client to run the Java app to insert over their LAN (this only works if the database server is
localto them and not at an ISP). 

3 - If the only problem is duplicate keys, load into a special table without the constraint, issue update commands to
rewritethe keys as needed, then select/insert to the correct table. 

Thanks

John


Re: Populating large tables with occasional bad values

From
Craig Ringer
Date:
John T. Dow wrote:

> If I did this regularly, three options seem easiest.
>
> 1 - Load locally to get clean data and then COPY. This requires the server to have access local access to the file to
becopied, and if the server is hosted by an isp, it depends on them whether you can do this easily. 

You can COPY over a PostgreSQL network connection. See the \copy support
in psql for one example of how it works.

I've never had cause to try to use it with JDBC so I don't know how/if
it works in that context.

However, batched inserts can be quite fast enough. If you do one INSERT
INTO per 1000 rows you'll already be seeing a massive performance boost:

INSERT INTO mytable
VALUES
(blah, blah),
(blah, blah),
(blah, blah),
-- hundreds of rows later
(blah, blah);

... will be a *LOT* faster. If you have to do special processing or
error handling you can do it once you have the data in a server-side
staging table - and you can get it there quickly with multi-row inserts
or (at least using psql) a \copy .

> 3 - If the only problem is duplicate keys, load into a special table without the constraint, issue update commands to
rewritethe keys as needed, then select/insert to the correct table. 

This would be my preferred approach, personally, using either network
COPY or multi-row INSERTs to get the data into the staging table.

You can do a whole lot more than fix unique constaint violations,
though. With a PL/PgSQL function to process the staging table and do the
inserts you can do some pretty major data surgery.

I have some conversion code that takes pretty ugly data in unconstrained
staging tables and reprocesses it to fit a new, much stricter and better
normalized schema. Large amounts of restructuring and fixing up are
required. Some of the simpler conversions are done by INSERT ... SELECT
statements, but the complicated stuff is done with PL/PgSQL functions.

I've sent you the conversion code off-list in case it's informative.
It's pretty ugly code as it's going to be thrown away when we cut over
to the new system, but it should show just how much data conversion &
repair you can do on the database server side.

It's not like you can't put aside rows that need user interaction as you
process the staging table, either. Just INSERT them into a "problem"
table and delete them from the staging table. Then have the client scan
through the (much smaller) problem table and ask the user to make
decisions. When everything looks satisfactory and the user's decisions
have been acted on, COMMIT.

--
Craig Ringer

Re: Populating large tables with occasional bad values

From
Oliver Jowett
Date:
Craig Ringer wrote:
> John T. Dow wrote:
>
>> If I did this regularly, three options seem easiest.
>>
>> 1 - Load locally to get clean data and then COPY. This requires the
>> server to have access local access to the file to be copied, and if
>> the server is hosted by an isp, it depends on them whether you can do
>> this easily.
>
> You can COPY over a PostgreSQL network connection. See the \copy support
> in psql for one example of how it works.
>
> I've never had cause to try to use it with JDBC so I don't know how/if
> it works in that context.

It's not supported in the standard JDBC driver unfortunately.

> However, batched inserts can be quite fast enough. If you do one INSERT
> INTO per 1000 rows you'll already be seeing a massive performance boost:
>
> INSERT INTO mytable
> VALUES
> (blah, blah),
> (blah, blah),
> (blah, blah),
> -- hundreds of rows later
> (blah, blah);
>
> ... will be a *LOT* faster. If you have to do special processing or
> error handling you can do it once you have the data in a server-side
> staging table - and you can get it there quickly with multi-row inserts
> or (at least using psql) a \copy .

You can get the same effect via JDBC batched inserts (addBatch() /
executeBatch()) without having to actually do a multi-row INSERT
statement. That's probably the most portable approach if you're using JDBC.

-O

Re: Populating large tables with occasional bad values

From
tivvpgsqljdbc@gtech-ua.com
Date:
John T. Dow написав(ла):
> Latency it is.
>
> I just had no idea it would add up so fast. I guess I was thinking that you could pump a lot of data over the
Internetwithout realizing the overhead when the data is broken down into little chunks. 
>

How about batches? Should not they help you in this case?

Re: Populating large tables with occasional bad values

From
"John T. Dow"
Date:
I have tried using a batch and it looked good at first but I don't think it will work with Postgres as cleanly as you'd
like.

First, set autocommit false.

Second, addBatch many times for the insert statements.

Third, executeBatch.

Fourth, the exception BatchUpdateException is thrown. The exception reports which insert statement had a problem.
getNextException()gives more details (duplicate key). Only one problem is reported. getUpdateCounts() on the exception
reportswhat we already know, that the statements up until this point all inserted without problem. 

Fifth - now what? I tried to commit at this point, hoping that the first insert statements would be commited, but they
arenot. Actually, I was hoping that the entire batch would be processed and then I could see which individual
statementsdidn't insert (using the update counts) and then redo them individually, but I think that is wishful
thinking.

Best possible solution with batches? Submit the batch, learn who many were accepted before the first error, resubmit
justthose. Process the problem statement by itself. Repeat with the statements left in the original batch until none
areleft, then create a new batch and begin anew. If a batch, or portion thereof, has no exception, then of course
commit.

For this, since portions of batches will be sent twice, the batch shouldn't be too large. Even 10 statements in a batch
wouldsubstantially reduce the overhead, certainly no need to do thousands. 

In the posting I quote below, the implication is that you can learn multiple statements in the batch that failed, but I
didn'tsee that happening. It seems that posgres just quits at the first problem. 

JOhn




On Thu, 12 Jun 2008 14:35:30 +0300, tivvpgsqljdbc@gtech-ua.com wrote:

>John T. Dow написав(ла):
>>> How about batches? Should not they help you in this case?
>>>
>>
>>
>> WHat happens if the 23rd and 59th rows in a batch of 100 have a problem, such as an invalid numeric value or a
duplicatekey? 
>>
>> Can the other 98 rows be committed?
>>
>> I haven't tried this.
>>
>
>In postgresql  no (may be yes with autocommit turned on - did not try).
>Because postgresql marks transaction as rollback-only on first problem.
>The one thing you can do is to detect which records are wrong in the
>batch (say, #2 and #55) and redo the batch without failing records
>(with/without appending batch with new records). This would make server
>load higher, but would give you only two roundtrips instead of 100
>roundtrips. Actually if every batch will have failing records, your
>server will has two times more transactions (this is maximum).
>
>



Re: Populating large tables with occasional bad values

From
Craig Ringer
Date:
John T. Dow wrote:

> Fifth - now what? I tried to commit at this point, hoping that the
> first insert statements would be commited, but they are not. Actually, I
> was hoping that the entire batch would be processed and then I could see
> which individual statements didn't insert (using the update counts) and
> then redo them individually, but I think that is wishful thinking.

The first error causes a transaction rollback. The only way the JDBC
driver could skip the failed operation and continue would be to wrap
each statement in a savepoint, and if there's an error issue a rollback
to the last savepoint. That's not ideal from a performance point of view.

> Best possible solution with batches? Submit the batch, learn who many
> were accepted before the first error, resubmit just those. Process the
> problem statement by itself. Repeat with the statements left in the
> original batch until none are left, then create a new batch and begin
> anew. If a batch, or portion thereof, has no exception, then of course
> commit.

IMO the best solution with batches is to use batches to store your
planned operations in a staging table using INSERTs that cannot fail
under normal circumstances. Then issue a single PL/PgSQL call or a bulk
UPDATE ... SELECT with an appropriate WHERE clause to apply the updates
and catch problems.

The approach you describe will work, but it'll be a bit ugly and not
very fast. If failures are very rare you might find it to be OK, but
your suggestion of using only 10 statements per batch suggests that
failures aren't *that* rare.

> In the posting I quote below, the implication is that you can learn
> multiple statements in the batch that failed, but I didn't see that
> happening.

AFAIK that's database specific. The docs I read suggested that DBs are
permitted to stop processing and return info on the first bad row, or to
continue processing and return info on all rows.

As PostgreSQL cannot just keep going after an error within a transaction
unless you use savepoints, it choses to do the former.

It might be cool if Pg supported an automatic savepoint mode where every
statement updated an implicit savepoint snapshot. If the statement
failed you could ROLLBACK TO LAST STATEMENT. If possible the savepoint
would be replaced with each statement so there'd be no ever-growing set
of savepoints to worry about. With this capability way you could use
proper transactional isolation but also achieve some error handling
within the transaction. Personally, though, I either restructure my SQL
to avoid the potential errors or use PL/PgSQL to handle them. The app
has to be capable of reissuing failed transactions anyway, so sometimes
an app-level transaction do-over is quite enough. There might be
situations in which that hypothetical feature could be handy, though.

> It seems that posgres just quits at the first problem.

Yes, by design. In a transaction if a statement fails then without a
savepoint in place there is, AFAIK, no way to just pretend the bad
statemnet was never issued.

I don't know if it'll keep going if you enable autocommit. It might; it
depends on how the JDBC driver does batches and how the server processes
them. Your performance will suffer with autocommit on - though not as
badly as  when you're paying the high round trip latencies - and you
won't be able to roll back if something nasty happens. If you don't care
about things like automatic rollback on network dropout you should
probably test batches with autocommit on and see how they behave.

--
Craig Ringer

Re: Populating large tables with occasional bad values

From
"John T. Dow"
Date:
I have a solution that I am happy with, yielding nearly a 10-fold improvement in speed.

I tried a quick and dirty experiment with different batch sizes with a small table of 750 rows.

Without using batches, the insert took 64 seconds. (Auto commit true for this and the other tests.)

Batch size 10, 13 seconds.

Batch size 100, 5 seconds.

This code was very simple -- if any error in the batch, discard the entire batch. This gives an idea of what
performancewould be like.  

Of course, with a batch size of 10 and two errors, the number of rows loaded is short by 17 because two entire batches
werediscarded. With a batch size of 100, it's short by 150. (I also neglected to send the last, partial batch, this
beingjust a quick and dirty experiment.) 

Anyway, a batch size of 10 yields a performance improvement of 5. Batch size 100 it's 10.

Craig wrote:

>IMO the best solution with batches is to use batches to store your
>planned operations in a staging table using INSERTs that cannot fail
>under normal circumstances. Then issue a single PL/PgSQL call or a bulk
>UPDATE ... SELECT with an appropriate WHERE clause to apply the updates
>and catch problems.
>
>The approach you describe will work, but it'll be a bit ugly and not
>very fast. If failures are very rare you might find it to be OK, but
>your suggestion of using only 10 statements per batch suggests that
>failures aren't *that* rare.

I need reasonable performance, not the best that money can buy. My clients have tables with tens of thousands of rows,
nothundreds or millions. Also, I want a general purpose solution that will work with no special knowledge of the table
tobe loaded -- just the info obtainable from the meta data (such as column names and types). A staging table such as
yousuggest could be created automatically (copy the columns but change all to char varying so they load). But then
locatingrows with bad values would be a messy bit of sql, although it could be generated by my code. Still, I'd have to
bringover the rows (casting char varying to integer or whatever) that are good and then remove them, leaving the bad
rowsbehind. 

I'll clean up my dirty code, eg make sure it sends the last partial batch. I'll also have it iterate to insert the rows
inany batch that's rejected. The simplest thing to do is simply resend the rows in the batch, one by one. A more
elaboratething to do would be to resend a partial batch, up to the point of the problem, skip the known problem row,
thensend another partial batch with the remaining rows. Keep doing that until all in the original batch have been
successfullyinserted or else written to the error log. 

John