Thread: COPY and indices?

COPY and indices?

From
François Beausoleil
Date:
Hi all,

When using COPY FROM STDIN to stream thousands of rows (20k and more hourly), what happens with indices? Are they
updatedonly once after the operation, or are they updated once per row? Note that I'm not replacing the table's data:
I'mappending to what's already there. I suspect batching writes will be faster than writing each individual row using
anINSERT statement. 

Currently, I can sustain 30-40 writes per second on a Rackspace VPS. I know it's not the ideal solution, but that's
whatI'm working with. Following vmstat, the server is spending 30 to 40% of it's time in iowait. I don't have
measurementsas to what files are touched, and I'd welcome suggestions to measure the time PostgreSQL actually spends
writingindices vs data. 

Thanks!
François



Re: COPY and indices?

From
Ondrej Ivanič
Date:
Hi,

On 13 March 2012 15:11, François Beausoleil <francois@teksol.info> wrote:
> When using COPY FROM STDIN to stream thousands of rows (20k and more hourly), what happens with indices? Are they
updatedonly once after the operation, or are they updated once per row? Note that I'm not replacing the table's data:
I'mappending to what's already there. I suspect batching writes will be faster than writing each individual row using
anINSERT statement. 

Yes, it will be faster to use COPY than plain INSERTs. We have similar
situation -- up to 10k rows every 3 minutes and around 15 indexes on
the table. Table is portioned and we do not update data. Check bloat
query reports some bloat but it growing very slowly and there is new
partition every month.

>
> Currently, I can sustain 30-40 writes per second on a Rackspace VPS. I know it's not the ideal solution, but that's
whatI'm working with. Following vmstat, the server is spending 30 to 40% of it's time in iowait. I don't have
measurementsas to what files are touched, and I'd welcome suggestions to measure the time PostgreSQL actually spends
writingindices vs data. 

Drop all indexes, measure time to insert and collect iostat output.
Create indexes, repeat the process and compare the results

--
Ondrej Ivanic
(ondrej.ivanic@gmail.com)

Re: COPY and indices?

From
Scott Marlowe
Date:
2012/3/12 François Beausoleil <francois@teksol.info>:
> Hi all,
>
> When using COPY FROM STDIN to stream thousands of rows (20k and more hourly), what happens with indices? Are they
updatedonly once after the operation, or are they updated once per row? Note that I'm not replacing the table's data:
I'mappending to what's already there. I suspect batching writes will be faster than writing each individual row using
anINSERT statement. 

Copy is faster, but if a single row fails (dup key etc) the whole copy fails.

Re: COPY and indices?

From
hamann.w@t-online.de
Date:
Scott Marlowe wrote:
2012/3/12 François Beausoleil <francois@teksol.info>:
> Hi all,
>
> When using COPY FROM STDIN to stream thousands of rows (20k and more hourly), what happens with indices? Are they
updatedonly once after the operation, or are they updated once per row? Note that I'm not replacing the table's data:
I'mappending to what's already there. I suspect batching writes will be faster than writing each individual row using
anINSERT statement. 

Copy is faster, but if a single row fails (dup key etc) the whole copy fails.

Hi Scott,

my impression: it would be even faster to drop the indices, do the bulk copy,
and rebuild the indices after the fact.

Regards
Wolfgang Hamann

Re: COPY and indices?

From
Merlin Moncure
Date:
2012/3/12 François Beausoleil <francois@teksol.info>:
> Hi all,
>
> When using COPY FROM STDIN to stream thousands of rows (20k and more hourly), what happens with indices? Are they
updatedonly once after the operation, or are they updated once per row? Note that I'm not replacing the table's data:
I'mappending to what's already there. I suspect batching writes will be faster than writing each individual row using
anINSERT statement. 
>
> Currently, I can sustain 30-40 writes per second on a Rackspace VPS. I know it's not the ideal solution, but that's
whatI'm working with. Following vmstat, the server is spending 30 to 40% of it's time in iowait. I don't have
measurementsas to what files are touched, and I'd welcome suggestions to measure the time PostgreSQL actually spends
writingindices vs data. 

you're almost certainly blocking on fsync.  A real quick'n'dirty way
to confirm this (although it wont be as fast as COPY) would be to wrap
your inserts in a transaction.  VMs tend to have really horrible
storage latency which can hurt postgres performance.  Another option
would be to relax your commit policy (for example by flipping
synchronous_commit) if that fits within your safety requirements.

merlin

Re: COPY and indices?

From
François Beausoleil
Date:

Le mardi 13 mars 2012 à 10:48, Merlin Moncure a écrit :

> 2012/3/12 François Beausoleil <francois@teksol.info (mailto:francois@teksol.info)>:
> > Currently, I can sustain 30-40 writes per second on a Rackspace VPS. I know it's not the ideal solution, but that's
whatI'm working with. Following vmstat, the server is spending 30 to 40% of it's time in iowait. I don't have
measurementsas to what files are touched, and I'd welcome suggestions to measure the time PostgreSQL actually spends
writingindices vs data. 
> > >
> >
> >
>
>
> you're almost certainly blocking on fsync. A real quick'n'dirty way
> to confirm this (although it wont be as fast as COPY) would be to wrap
> your inserts in a transaction. VMs tend to have really horrible
> storage latency which can hurt postgres performance. Another option
> would be to relax your commit policy (for example by flipping
> synchronous_commit) if that fits within your safety requirements.
>


I already applied the tricks you have here: I have a transaction, and synchronous_commit is off. I also have
checkpoint_segmentsset to 96, and 10 minutes. 

I'll go with the COPY, since I can live with the batched requirements just fine.

Bye!
François


Re: COPY and indices?

From
Scott Marlowe
Date:
On Tue, Mar 13, 2012 at 12:51 AM,  <hamann.w@t-online.de> wrote:
>
> Scott Marlowe wrote:
> 2012/3/12 François Beausoleil <francois@teksol.info>:
>> Hi all,
>>
>> When using COPY FROM STDIN to stream thousands of rows (20k and more hourly), what happens with indices? Are they
updatedonly once after the operation, or are they updated once per row? Note that I'm not replacing the table's data:
I'mappending to what's already there. I suspect batching writes will be faster than writing each individual row using
anINSERT statement. 
>
> Copy is faster, but if a single row fails (dup key etc) the whole copy fails.
>
> Hi Scott,
>
> my impression: it would be even faster to drop the indices, do the bulk copy,
> and rebuild the indices after the fact.

That depends.  If you've got 100M rows already in place and you're
adding 1000 rows, it's likely cheaper to leave the indexes in place.
OTOH, if you've 1M rows already in place and are adding 100M it's
almost certainly cheaper to drop the indexes and recreate them.  So
dropping and recreating the indexes isn't necessarily faster,
depending on how big the table already is.

Re: COPY and indices?

From
Merlin Moncure
Date:
2012/3/13 François Beausoleil <francois@teksol.info>:
>
>
> Le mardi 13 mars 2012 à 10:48, Merlin Moncure a écrit :
>
>> 2012/3/12 François Beausoleil <francois@teksol.info (mailto:francois@teksol.info)>:
>> > Currently, I can sustain 30-40 writes per second on a Rackspace VPS. I know it's not the ideal solution, but
that'swhat I'm working with. Following vmstat, the server is spending 30 to 40% of it's time in iowait. I don't have
measurementsas to what files are touched, and I'd welcome suggestions to measure the time PostgreSQL actually spends
writingindices vs data. 
>> > >
>> >
>> >
>>
>>
>> you're almost certainly blocking on fsync. A real quick'n'dirty way
>> to confirm this (although it wont be as fast as COPY) would be to wrap
>> your inserts in a transaction. VMs tend to have really horrible
>> storage latency which can hurt postgres performance. Another option
>> would be to relax your commit policy (for example by flipping
>> synchronous_commit) if that fits within your safety requirements.
>>
>
>
> I already applied the tricks you have here: I have a transaction, and synchronous_commit is off. I also have
checkpoint_segmentsset to 96, and 10 minutes. 
>
> I'll go with the COPY, since I can live with the batched requirements just fine.

30-40 'in transaction' i/o bound inserts is so slow as to not really
be believable unless each record is around 1 megabyte because being in
transaction removes storage latency from the equation.  Even on a
crappy VM.  As a point of comparison my sata workstation drive can do
in the 10s of thousands.  How many records are you inserting per
transaction?

merlin

Re: COPY and indices?

From
François Beausoleil
Date:

Le mardi 13 mars 2012 à 11:15, Merlin Moncure a écrit :

> 2012/3/13 François Beausoleil <francois@teksol.info (mailto:francois@teksol.info)>:
> >
> > I'll go with the COPY, since I can live with the batched requirements just fine.
>
> 30-40 'in transaction' i/o bound inserts is so slow as to not really
> be believable unless each record is around 1 megabyte because being in
> transaction removes storage latency from the equation. Even on a
> crappy VM. As a point of comparison my sata workstation drive can do
> in the 10s of thousands. How many records are you inserting per
> transaction?
>


I took the time to gather statistics about the database server: https://gist.github.com/07bbf8a5b05b1c37a7f2

The files are a series of roughly 30 second samples, while the system is under production usage. When I quoted 30-40
transactionsper second, I was actually referring to the number of messages processed from my message queue. Going by
thePostgreSQL numbers, xact_commit tells me I manage 288 commits per second. It's much better than I anticipated. 

Anyways, if anybody has comments on how I could increase throughput, I'd appreciate. My message queues are almost
alwaysbacked up by 1M messages, and it's at least partially related to PostgreSQL: if the DB can write faster, I can
managemy backlog better. 

I'm still planning on going with batch processing, but I need to do something ASAP to give me just a bit more
throughput.

Thanks!
François


Re: COPY and indices?

From
Merlin Moncure
Date:
2012/3/14 François Beausoleil <francois@teksol.info>:
> Le mardi 13 mars 2012 à 11:15, Merlin Moncure a écrit :
>
>> 2012/3/13 François Beausoleil <francois@teksol.info (mailto:francois@teksol.info)>:
>> >
>> > I'll go with the COPY, since I can live with the batched requirements just fine.
>>
>> 30-40 'in transaction' i/o bound inserts is so slow as to not really
>> be believable unless each record is around 1 megabyte because being in
>> transaction removes storage latency from the equation. Even on a
>> crappy VM. As a point of comparison my sata workstation drive can do
>> in the 10s of thousands. How many records are you inserting per
>> transaction?
>>
>
>
> I took the time to gather statistics about the database server: https://gist.github.com/07bbf8a5b05b1c37a7f2
>
> The files are a series of roughly 30 second samples, while the system is under production usage. When I quoted 30-40
transactionsper second, I was actually referring to the number of messages processed from my message queue. Going by
thePostgreSQL numbers, xact_commit tells me I manage 288 commits per second. It's much better than I anticipated. 
>
> Anyways, if anybody has comments on how I could increase throughput, I'd appreciate. My message queues are almost
alwaysbacked up by 1M messages, and it's at least partially related to PostgreSQL: if the DB can write faster, I can
managemy backlog better. 
>
> I'm still planning on going with batch processing, but I need to do something ASAP to give me just a bit more
throughput.

well your iowait numbers are through the roof which makes things
pretty simple from a diagnosis point of view: your storage is
overloaded.  the only remedies are to try and make your queries more
efficient so that you are doing less writing, better use of
transactions, etc.  but looking at the log it appears the low hanging
fruit is already grabbed (synchronous_commit=off, etc).  so you have
to choose from a list of not very pleasant options:

*) fsync=off
*) tune the application
*) bring more/faster storage online.  a single ssd would probably make
your problem disappear.  in the vm world, hopefully you can at least
bring another volume online and move your wal to that.
*) HARDWARE.

In the entirety of my career, I have never found anything more
perplexing than the general reluctance to upgrade hardware to solve
hardware related performance bottlenecks.   Virtualization is great
technology but is nowhere near good enough in my experience to handle
high transaction rate database severs.  A 5000$ server will solve your
issue, and you'll spend that in two days scratching your head trying
to figure out the issue (irritating your customers all the while).

merlin

Re: COPY and indices?

From
François Beausoleil
Date:

Le mercredi 14 mars 2012 à 15:06, Merlin Moncure a écrit :

> 2012/3/14 François Beausoleil <francois@teksol.info (mailto:francois@teksol.info)>:
>
> In the entirety of my career, I have never found anything more
> perplexing than the general reluctance to upgrade hardware to solve
> hardware related performance bottlenecks. Virtualization is great
> technology but is nowhere near good enough in my experience to handle
> high transaction rate database severs. A 5000$ server will solve your
> issue, and you'll spend that in two days scratching your head trying
> to figure out the issue (irritating your customers all the while).
>


Thank you for your analysis, Merlin. I already suspected as much.

Have a great day!
François