Thread: COPY and indices?
Hi all, When using COPY FROM STDIN to stream thousands of rows (20k and more hourly), what happens with indices? Are they updatedonly once after the operation, or are they updated once per row? Note that I'm not replacing the table's data: I'mappending to what's already there. I suspect batching writes will be faster than writing each individual row using anINSERT statement. Currently, I can sustain 30-40 writes per second on a Rackspace VPS. I know it's not the ideal solution, but that's whatI'm working with. Following vmstat, the server is spending 30 to 40% of it's time in iowait. I don't have measurementsas to what files are touched, and I'd welcome suggestions to measure the time PostgreSQL actually spends writingindices vs data. Thanks! François
Hi, On 13 March 2012 15:11, François Beausoleil <francois@teksol.info> wrote: > When using COPY FROM STDIN to stream thousands of rows (20k and more hourly), what happens with indices? Are they updatedonly once after the operation, or are they updated once per row? Note that I'm not replacing the table's data: I'mappending to what's already there. I suspect batching writes will be faster than writing each individual row using anINSERT statement. Yes, it will be faster to use COPY than plain INSERTs. We have similar situation -- up to 10k rows every 3 minutes and around 15 indexes on the table. Table is portioned and we do not update data. Check bloat query reports some bloat but it growing very slowly and there is new partition every month. > > Currently, I can sustain 30-40 writes per second on a Rackspace VPS. I know it's not the ideal solution, but that's whatI'm working with. Following vmstat, the server is spending 30 to 40% of it's time in iowait. I don't have measurementsas to what files are touched, and I'd welcome suggestions to measure the time PostgreSQL actually spends writingindices vs data. Drop all indexes, measure time to insert and collect iostat output. Create indexes, repeat the process and compare the results -- Ondrej Ivanic (ondrej.ivanic@gmail.com)
2012/3/12 François Beausoleil <francois@teksol.info>: > Hi all, > > When using COPY FROM STDIN to stream thousands of rows (20k and more hourly), what happens with indices? Are they updatedonly once after the operation, or are they updated once per row? Note that I'm not replacing the table's data: I'mappending to what's already there. I suspect batching writes will be faster than writing each individual row using anINSERT statement. Copy is faster, but if a single row fails (dup key etc) the whole copy fails.
Scott Marlowe wrote: 2012/3/12 François Beausoleil <francois@teksol.info>: > Hi all, > > When using COPY FROM STDIN to stream thousands of rows (20k and more hourly), what happens with indices? Are they updatedonly once after the operation, or are they updated once per row? Note that I'm not replacing the table's data: I'mappending to what's already there. I suspect batching writes will be faster than writing each individual row using anINSERT statement. Copy is faster, but if a single row fails (dup key etc) the whole copy fails. Hi Scott, my impression: it would be even faster to drop the indices, do the bulk copy, and rebuild the indices after the fact. Regards Wolfgang Hamann
2012/3/12 François Beausoleil <francois@teksol.info>: > Hi all, > > When using COPY FROM STDIN to stream thousands of rows (20k and more hourly), what happens with indices? Are they updatedonly once after the operation, or are they updated once per row? Note that I'm not replacing the table's data: I'mappending to what's already there. I suspect batching writes will be faster than writing each individual row using anINSERT statement. > > Currently, I can sustain 30-40 writes per second on a Rackspace VPS. I know it's not the ideal solution, but that's whatI'm working with. Following vmstat, the server is spending 30 to 40% of it's time in iowait. I don't have measurementsas to what files are touched, and I'd welcome suggestions to measure the time PostgreSQL actually spends writingindices vs data. you're almost certainly blocking on fsync. A real quick'n'dirty way to confirm this (although it wont be as fast as COPY) would be to wrap your inserts in a transaction. VMs tend to have really horrible storage latency which can hurt postgres performance. Another option would be to relax your commit policy (for example by flipping synchronous_commit) if that fits within your safety requirements. merlin
Le mardi 13 mars 2012 à 10:48, Merlin Moncure a écrit : > 2012/3/12 François Beausoleil <francois@teksol.info (mailto:francois@teksol.info)>: > > Currently, I can sustain 30-40 writes per second on a Rackspace VPS. I know it's not the ideal solution, but that's whatI'm working with. Following vmstat, the server is spending 30 to 40% of it's time in iowait. I don't have measurementsas to what files are touched, and I'd welcome suggestions to measure the time PostgreSQL actually spends writingindices vs data. > > > > > > > > > > you're almost certainly blocking on fsync. A real quick'n'dirty way > to confirm this (although it wont be as fast as COPY) would be to wrap > your inserts in a transaction. VMs tend to have really horrible > storage latency which can hurt postgres performance. Another option > would be to relax your commit policy (for example by flipping > synchronous_commit) if that fits within your safety requirements. > I already applied the tricks you have here: I have a transaction, and synchronous_commit is off. I also have checkpoint_segmentsset to 96, and 10 minutes. I'll go with the COPY, since I can live with the batched requirements just fine. Bye! François
On Tue, Mar 13, 2012 at 12:51 AM, <hamann.w@t-online.de> wrote: > > Scott Marlowe wrote: > 2012/3/12 François Beausoleil <francois@teksol.info>: >> Hi all, >> >> When using COPY FROM STDIN to stream thousands of rows (20k and more hourly), what happens with indices? Are they updatedonly once after the operation, or are they updated once per row? Note that I'm not replacing the table's data: I'mappending to what's already there. I suspect batching writes will be faster than writing each individual row using anINSERT statement. > > Copy is faster, but if a single row fails (dup key etc) the whole copy fails. > > Hi Scott, > > my impression: it would be even faster to drop the indices, do the bulk copy, > and rebuild the indices after the fact. That depends. If you've got 100M rows already in place and you're adding 1000 rows, it's likely cheaper to leave the indexes in place. OTOH, if you've 1M rows already in place and are adding 100M it's almost certainly cheaper to drop the indexes and recreate them. So dropping and recreating the indexes isn't necessarily faster, depending on how big the table already is.
2012/3/13 François Beausoleil <francois@teksol.info>: > > > Le mardi 13 mars 2012 à 10:48, Merlin Moncure a écrit : > >> 2012/3/12 François Beausoleil <francois@teksol.info (mailto:francois@teksol.info)>: >> > Currently, I can sustain 30-40 writes per second on a Rackspace VPS. I know it's not the ideal solution, but that'swhat I'm working with. Following vmstat, the server is spending 30 to 40% of it's time in iowait. I don't have measurementsas to what files are touched, and I'd welcome suggestions to measure the time PostgreSQL actually spends writingindices vs data. >> > > >> > >> > >> >> >> you're almost certainly blocking on fsync. A real quick'n'dirty way >> to confirm this (although it wont be as fast as COPY) would be to wrap >> your inserts in a transaction. VMs tend to have really horrible >> storage latency which can hurt postgres performance. Another option >> would be to relax your commit policy (for example by flipping >> synchronous_commit) if that fits within your safety requirements. >> > > > I already applied the tricks you have here: I have a transaction, and synchronous_commit is off. I also have checkpoint_segmentsset to 96, and 10 minutes. > > I'll go with the COPY, since I can live with the batched requirements just fine. 30-40 'in transaction' i/o bound inserts is so slow as to not really be believable unless each record is around 1 megabyte because being in transaction removes storage latency from the equation. Even on a crappy VM. As a point of comparison my sata workstation drive can do in the 10s of thousands. How many records are you inserting per transaction? merlin
Le mardi 13 mars 2012 à 11:15, Merlin Moncure a écrit : > 2012/3/13 François Beausoleil <francois@teksol.info (mailto:francois@teksol.info)>: > > > > I'll go with the COPY, since I can live with the batched requirements just fine. > > 30-40 'in transaction' i/o bound inserts is so slow as to not really > be believable unless each record is around 1 megabyte because being in > transaction removes storage latency from the equation. Even on a > crappy VM. As a point of comparison my sata workstation drive can do > in the 10s of thousands. How many records are you inserting per > transaction? > I took the time to gather statistics about the database server: https://gist.github.com/07bbf8a5b05b1c37a7f2 The files are a series of roughly 30 second samples, while the system is under production usage. When I quoted 30-40 transactionsper second, I was actually referring to the number of messages processed from my message queue. Going by thePostgreSQL numbers, xact_commit tells me I manage 288 commits per second. It's much better than I anticipated. Anyways, if anybody has comments on how I could increase throughput, I'd appreciate. My message queues are almost alwaysbacked up by 1M messages, and it's at least partially related to PostgreSQL: if the DB can write faster, I can managemy backlog better. I'm still planning on going with batch processing, but I need to do something ASAP to give me just a bit more throughput. Thanks! François
2012/3/14 François Beausoleil <francois@teksol.info>: > Le mardi 13 mars 2012 à 11:15, Merlin Moncure a écrit : > >> 2012/3/13 François Beausoleil <francois@teksol.info (mailto:francois@teksol.info)>: >> > >> > I'll go with the COPY, since I can live with the batched requirements just fine. >> >> 30-40 'in transaction' i/o bound inserts is so slow as to not really >> be believable unless each record is around 1 megabyte because being in >> transaction removes storage latency from the equation. Even on a >> crappy VM. As a point of comparison my sata workstation drive can do >> in the 10s of thousands. How many records are you inserting per >> transaction? >> > > > I took the time to gather statistics about the database server: https://gist.github.com/07bbf8a5b05b1c37a7f2 > > The files are a series of roughly 30 second samples, while the system is under production usage. When I quoted 30-40 transactionsper second, I was actually referring to the number of messages processed from my message queue. Going by thePostgreSQL numbers, xact_commit tells me I manage 288 commits per second. It's much better than I anticipated. > > Anyways, if anybody has comments on how I could increase throughput, I'd appreciate. My message queues are almost alwaysbacked up by 1M messages, and it's at least partially related to PostgreSQL: if the DB can write faster, I can managemy backlog better. > > I'm still planning on going with batch processing, but I need to do something ASAP to give me just a bit more throughput. well your iowait numbers are through the roof which makes things pretty simple from a diagnosis point of view: your storage is overloaded. the only remedies are to try and make your queries more efficient so that you are doing less writing, better use of transactions, etc. but looking at the log it appears the low hanging fruit is already grabbed (synchronous_commit=off, etc). so you have to choose from a list of not very pleasant options: *) fsync=off *) tune the application *) bring more/faster storage online. a single ssd would probably make your problem disappear. in the vm world, hopefully you can at least bring another volume online and move your wal to that. *) HARDWARE. In the entirety of my career, I have never found anything more perplexing than the general reluctance to upgrade hardware to solve hardware related performance bottlenecks. Virtualization is great technology but is nowhere near good enough in my experience to handle high transaction rate database severs. A 5000$ server will solve your issue, and you'll spend that in two days scratching your head trying to figure out the issue (irritating your customers all the while). merlin
Le mercredi 14 mars 2012 à 15:06, Merlin Moncure a écrit : > 2012/3/14 François Beausoleil <francois@teksol.info (mailto:francois@teksol.info)>: > > In the entirety of my career, I have never found anything more > perplexing than the general reluctance to upgrade hardware to solve > hardware related performance bottlenecks. Virtualization is great > technology but is nowhere near good enough in my experience to handle > high transaction rate database severs. A 5000$ server will solve your > issue, and you'll spend that in two days scratching your head trying > to figure out the issue (irritating your customers all the while). > Thank you for your analysis, Merlin. I already suspected as much. Have a great day! François