Thread: Guidance Requested - Bulk Inserting + Queries

Guidance Requested - Bulk Inserting + Queries

From
Benjamin Johnson
Date:
Experts,

Quick Summary: data can now be inserted very quickly via COPY + removing
indexes, but is there a design or some tricks to still allow someone to
query while the partition is still active and 'hot' ?

- Postgres 9.1
- Windows 7 (64-bit) , although this is just for the current test and
could vary depending on situation
- We have 4 main tables with daily partitions
- Each table/partition has multiple indexes on it
- Streaming logs from client machines into our server app which
processes the logs and tries to shove all that data into these daily
partitions as fast as it can.
- Using COPY and removed original primary key unique constraints to try
to get it to be as fast as possible (some duplicates are possible)
- Will remove duplicates in a later step (disregard for this post)

We now found (thanks Andres and Snow-Man in #postgresql) that in our
tests, after the indexes get too large performance drops signficantly
and our system limps forward due to  disk reads (presumably for the
indexes).  If we remove the indexes, performance for our entire sample
test is great and everything is written to postgresql very quickly.
This allows us to shove lots and lots of data in (for production
possibly 100 GB or a TB per day!)

My question is, what possible routes can I take where we can have both
fast inserts (with indexes removed until the end of the day), but still
allow a user to query against today's data? Is this even possible?  One
idea would be possibly have hourly tables for today and as soon as we
can try to re-add indexes.  Another possible solution might be to stream
the data to another "reader" postgres instance that has indexes,
although I'm not very versed in replication.


Any ideas would be greatly appreciated.

Thanks!

Ben


--
Benjamin Johnson
http://getcarbonblack.com/ | @getcarbonblack
cell: 312.933.3612 | @chicagoben


Re: Guidance Requested - Bulk Inserting + Queries

From
Leonardo Francalanci
Date:
> We now found (thanks Andres and Snow-Man in #postgresql) that in our
> tests, after the indexes get too large performance drops signficantly
> and our system limps forward due to  disk reads (presumably for the
> indexes).  If we remove the indexes, performance for our entire sample
> test is great and everything is written to postgresql very quickly. 
It's usually the fact that the data you index is "random" as opposed to,
say, an always incremented value (could be a timestamp, or a sequence)
that leads to insert problems with btrees. 
> My question is, what possible routes can I take where we can have both
> fast inserts (with indexes removed until the end of the day), but still
> allow a user to query against today's data? Is this even possible?  One
> idea would be possibly have hourly tables for today and as soon as we
> can try to re-add indexes. 
Yep, that's the only way I've found: use smaller partitions. That leads
to slower reads (due to the fact that you have to visit more indexes to
read the same amount of data). But you'll get faster writes. 

> Another possible solution might be to stream
> the data to another "reader" postgres instance that has indexes,
> although I'm not very versed in replication. 
I don't think you can do that. 
Another option that you have is to use ssd instead of HD for the indexes
only (that is, having the indexes in a separate tablespace of ssds). The
problem is that your disks usually can't keep up with the number of
random writes it takes to update N "random values" btrees; ssd might help. 
Can you post some numbers, such as # of indexes, # of rows you're trying
to insert per hour etc etc? 



Re: Guidance Requested - Bulk Inserting + Queries

From
Benjamin Johnson
Date:
We're trying to split the current day into hourly tables so that the size of the indexes that are popular is much lower and therefore we can support more rows across the day.  We also are using numerics where we could be using bigints, so we're going to also work on that to see how much smaller we can get it.  Once a daily table is not "today", we will remove duplicates, so we can combine that step with rolling up the hourly tables into one daily table.

In a *small* test (1-2 orders of magnitude smaller than some potential customer environments), the cumulative size of the daily indexes is 3.6 GB and that's for only about half of the test.

We're talking 4 different daily partitioned tables with each table having 1 - 6 indexes (yes, a lot!).

I'll post another update when I have it.

Thanks Leonardo.

On 11/30/2011 10:17 AM, Leonardo Francalanci wrote:
>> We now found (thanks Andres and Snow-Man in #postgresql) that in our
>> tests, after the indexes get too large performance drops signficantly
>> and our system limps forward due to disk reads (presumably for the
>> indexes). If we remove the indexes, performance for our entire sample
>> test is great and everything is written to postgresql very quickly.
> It's usually the fact that the data you index is "random" as opposed to,
> say, an always incremented value (could be a timestamp, or a sequence)
> that leads to insert problems with btrees.
>> My question is, what possible routes can I take where we can have both
>> fast inserts (with indexes removed until the end of the day), but still
>> allow a user to query against today's data? Is this even possible? One
>> idea would be possibly have hourly tables for today and as soon as we
>> can try to re-add indexes.
> Yep, that's the only way I've found: use smaller partitions. That leads
> to slower reads (due to the fact that you have to visit more indexes to
> read the same amount of data). But you'll get faster writes.
>
>> Another possible solution might be to stream
>> the data to another "reader" postgres instance that has indexes,
>> although I'm not very versed in replication.
> I don't think you can do that.
> Another option that you have is to use ssd instead of HD for the indexes
> only (that is, having the indexes in a separate tablespace of ssds). The
> problem is that your disks usually can't keep up with the number of
> random writes it takes to update N "random values" btrees; ssd might help.
> Can you post some numbers, such as # of indexes, # of rows you're trying
> to insert per hour etc etc?
>
>


--
Benjamin Johnson
http://getcarbonblack.com/ | @getcarbonblack
cell: 312.933.3612

Re: Guidance Requested - Bulk Inserting + Queries

From
Jeff Janes
Date:
On Wed, Nov 30, 2011 at 7:27 AM, Benjamin Johnson
<benjamin.johnson@getcarbonblack.com> wrote:
> Experts,
>
> Quick Summary: data can now be inserted very quickly via COPY + removing
> indexes, but is there a design or some tricks to still allow someone to
> query while the partition is still active and 'hot' ?
>
> - Postgres 9.1
> - Windows 7 (64-bit) , although this is just for the current test and
> could vary depending on situation
> - We have 4 main tables with daily partitions

How long are the daily partitions kept for?

> - Each table/partition has multiple indexes on it
> - Streaming logs from client machines into our server app which
> processes the logs and tries to shove all that data into these daily
> partitions as fast as it can.

Why shove it in as fast as you can?  If you want to both read and
write at the same time, then focusing first only on writing and
worrying about reading as an after thought seems like the wrong thing
to do.

> - Using COPY and removed original primary key unique constraints to try
> to get it to be as fast as possible (some duplicates are possible)
> - Will remove duplicates in a later step (disregard for this post)
>
> We now found (thanks Andres and Snow-Man in #postgresql) that in our
> tests, after the indexes get too large performance drops signficantly
> and our system limps forward due to  disk reads (presumably for the
> indexes).

How many hours worth of data can be loaded into the new partition
before the performance knee hits?

After the knee, how does the random disk read activity you see compare
to the maximum random disk reads your IO system can support?  How many
COPYs were you doing at the same time?

During this test, was there background select activity going on, or
was the system only used for COPY?

> If we remove the indexes, performance for our entire sample
> test is great and everything is written to postgresql very quickly.
> This allows us to shove lots and lots of data in (for production
> possibly 100 GB or a TB per day!)

How much do you need to shove in per day?  If you need to insert it,
and index it, and run queries, and deal with maintenance of the older
partitions, then you will need a lot of spare capacity, relative to
just inserting, to do all of those things.  Do you have windows where
there is less insert activity in which other things can get done?

Cheers,

Jeff

Re: Guidance Requested - Bulk Inserting + Queries

From
Benjamin Johnson
Date:
Jeff,

Sorry for the delayed response.  Please see (some) answers inline.

On 12/1/2011 9:06 AM, Jeff Janes wrote:
> On Wed, Nov 30, 2011 at 7:27 AM, Benjamin Johnson
> <benjamin.johnson@getcarbonblack.com> wrote:
>> Experts,
>>
>> Quick Summary: data can now be inserted very quickly via COPY + removing
>> indexes, but is there a design or some tricks to still allow someone to
>> query while the partition is still active and 'hot' ?
>>
>> - Postgres 9.1
>> - Windows 7 (64-bit) , although this is just for the current test and
>> could vary depending on situation
>> - We have 4 main tables with daily partitions
> How long are the daily partitions kept for?
We want this to be user-configurable but ideally 30 - 90 days, possibly
longer for (order of magnitude) smaller customers.
>> - Each table/partition has multiple indexes on it
>> - Streaming logs from client machines into our server app which
>> processes the logs and tries to shove all that data into these daily
>> partitions as fast as it can.
> Why shove it in as fast as you can?  If you want to both read and
> write at the same time, then focusing first only on writing and
> worrying about reading as an after thought seems like the wrong thing
> to do.
You're probably correct in that we need to think about the entire system
as a whole.  We're concerned with getting the data
from our host-based to our server where it is processed and stored.
Because our system is essentially a giant logging service for
your enterprise, most of the time we're collecting data and writing it.
The main times it will be read is when some security incident
occurs, but between those we expect it to be very write heavy.

We're probably most concerned with write performance because we were
originally seeing poor times and were scared by how well
it would scale.  We've improved it a lot so we might just need to take a
step back and see what else we can do for the overall system.

>> - Using COPY and removed original primary key unique constraints to try
>> to get it to be as fast as possible (some duplicates are possible)
>> - Will remove duplicates in a later step (disregard for this post)
>>
>> We now found (thanks Andres and Snow-Man in #postgresql) that in our
>> tests, after the indexes get too large performance drops signficantly
>> and our system limps forward due to  disk reads (presumably for the
>> indexes).
> How many hours worth of data can be loaded into the new partition
> before the performance knee hits?
In simulations, if I try to simulate the amount of data a large customer
would send, then it is just about an hour worth of data before the indexes
get to be several gigabytes in size and performance really goes downhill
-- the "knee" if you will.
> After the knee, how does the random disk read activity you see compare
> to the maximum random disk reads your IO system can support?  How many
> COPYs were you doing at the same time?
I don't have exact statistics, but we had 4 writer threads all doing
copy into 4 tables as fast as they receive data.
The system is very much NOT ideal -- Windows 7 Developer-Class
Workstation with (one) 7200 RPM Harddrive.  I want to find bottlebecks
in this
system and then see what real servers can handle.  (We're a small
company and only now are starting to be able to invest in dev/test servers.

>
> During this test, was there background select activity going on, or
> was the system only used for COPY?
I pretty much stripped it entirely down to just doing the writes.  Data
was coming in over HTTP to a python web stack, but that was pretty much
just passing these logfiles to the (C++) writer threads.
>> If we remove the indexes, performance for our entire sample
>> test is great and everything is written to postgresql very quickly.
>> This allows us to shove lots and lots of data in (for production
>> possibly 100 GB or a TB per day!)
> How much do you need to shove in per day?  If you need to insert it,
> and index it, and run queries, and deal with maintenance of the older
> partitions, then you will need a lot of spare capacity, relative to
> just inserting, to do all of those things.  Do you have windows where
> there is less insert activity in which other things can get done?
That's something we keep asking ourselves.  Right now it's about 10 MB /
client per day.  Some customers want 50,000 clients which would
be 500 GB per day if my math is correct.  We know we will never handle
this with a single server, but we want to get up as high as we can (say
5000 - 10000)
before saying that our customers have to add more hardware.

> Cheers,
>
> Jeff

We managed to sort of get around the issue by having hourly tables
inherit from our daily tables.  This makes our indexes smaller and the
writes in our tests don't
seem to hit this same limit (at least so far.)  I have a couple
follow-up questions:

1) Would it be acceptable to have let's say 60 daily partitions and then
each of those has 24 hourly partitions?  Would it be better to after a
day or two (so that data is now old and mostly unchanged) "rollup" the
hourly tables into their respective daily table and then remove the
hourly tables?

2) Some of our indexes are on an identifier that is a hash of some event
attributes, so it's basically a random BIGINT.  We believe part of the
problem is that each row could be in an entirely different location in
the index thus causing lots of seeking and thrashing.  Would doing
something like having our index become a multi-column index by doing
(event_timestamp, originally_index_column) be better so that they closer
in proximity to other events coming in around the same time?  I have to
admit that I don't really know how indexes are stored / paged.

3) Does anyone else have similar systems where they have a ton of data
coming in that they also want to query?  Any tips you can provide or
alternative designs?  Once the data is in, it will 99.9% of the time
(100% of the time for some tables) be static.  Part of the issue is that
the user wants to be able to search based on all sorts of attributes --
this leads to lots of indexes and more disk/memory usage when writing.

Ben

--
Benjamin Johnson
http://getcarbonblack.com/ | @getcarbonblack
cell: 312.933.3612


Re: Guidance Requested - Bulk Inserting + Queries

From
Jeff Janes
Date:
On Wed, Dec 21, 2011 at 6:30 PM, Benjamin Johnson
<benjamin.johnson@getcarbonblack.com> wrote:
> Jeff,
>
> Sorry for the delayed response.  Please see (some) answers inline.
>
> On 12/1/2011 9:06 AM, Jeff Janes wrote:
>> On Wed, Nov 30, 2011 at 7:27 AM, Benjamin Johnson

>> Why shove it in as fast as you can?  If you want to both read and
>> write at the same time, then focusing first only on writing and
>> worrying about reading as an after thought seems like the wrong thing
>> to do.

> You're probably correct in that we need to think about the entire system
> as a whole.  We're concerned with getting the data
> from our host-based to our server where it is processed and stored.
> Because our system is essentially a giant logging service for
> your enterprise, most of the time we're collecting data and writing it.
> The main times it will be read is when some security incident
> occurs, but between those we expect it to be very write heavy.

OK, that does have an interesting flavor, in which the typical row
will be queried about read zero times, but you can't predict in
advance which ones are more likely to ever be queried.

Do you know how necessary all your indexes are for supporting the
queries?  If the queries are relatively rare, maybe you could support
them simply with seq scans on unindexed tables/partitions, at least on
the leading edge partitions.

>> How many hours worth of data can be loaded into the new partition
>> before the performance knee hits?
> In simulations, if I try to simulate the amount of data a large customer
> would send, then it is just about an hour worth of data before the indexes
> get to be several gigabytes in size and performance really goes downhill
> -- the "knee" if you will.

So having hourly partitions with live indexes might be cutting it
pretty close.  Once something pushes you over the edge into degraded
performance, you would never be able to recover.

>> After the knee, how does the random disk read activity you see compare
>> to the maximum random disk reads your IO system can support?  How many
>> COPYs were you doing at the same time?
> I don't have exact statistics, but we had 4 writer threads all doing
> copy into 4 tables as fast as they receive data.

Are they receiving data at the rate they would naturally?  I.e. does
it take an hour to simulate an hour's worth of data?

If they go into different tables, then they are going into different
indices and so are all competing with each other for cache space for
the index leaf blocks
(rather than sharing that cache space as they might possibly if they
were going into the same table).  So you run out of cache space and
your performance collapses at one forth the total size as if you made
them take turns.  Of course if you make them take turns, you have to
either throttle or buffer their data retrieval.  Also there is a
question of how often you would have to rotate turns, and how long it
would take to exchange out the buffers upon a turn rotation.  (There
are stupid OS tricks you can pull outside of PG to help that process
along, but trying to coordinate that would be painful.)


> The system is very much NOT ideal -- Windows 7 Developer-Class
> Workstation with (one) 7200 RPM Harddrive.  I want to find bottlebecks
> in this
> system and then see what real servers can handle.  (We're a small
> company and only now are starting to be able to invest in dev/test servers.

I think you said that for loading into large-grained partitions with
live indexes, the bottleneck was the random reads needed to pull in
the leaf blocks.  In that case, if you change to RAID with striping
you should be able to scale with the effective number of spindles,
provided you have enough parallel copies going on to keep each spindle
busy with its own random read.  Of course those parallel copies would
make the RAM issues worse, but by saying large-grained partitions I
mean that you've already given up on the notion having the indices fit
in RAM, so at that point you might as well get the spindle-scaling.

...

>
> We managed to sort of get around the issue by having hourly tables
> inherit from our daily tables.  This makes our indexes smaller and the
> writes in our tests don't
> seem to hit this same limit (at least so far.)  I have a couple
> follow-up questions:
>
> 1) Would it be acceptable to have let's say 60 daily partitions and then
> each of those has 24 hourly partitions?

It sounds like each client gets their own hardware, but of each client
can have several thousand customers, how is that handled?  All dumped
into one giant partitioned (on time) table, or does each customer get
their own table?  60*24*thousands would certainly add up!  If it is
just 60*24, it will certainly slow down your queries (the ones not
using constraint exclusion anyway) some as it has to do a look up in
1440 btrees for each query, but if queries are fast enough then they
are fast enough.  It should be pretty easy to test, if you know the
types of queries you will be seeing.

> Would it be better to after a
> day or two (so that data is now old and mostly unchanged) "rollup" the
> hourly tables into their respective daily table and then remove the
> hourly tables?

That would generate an awful lot of extraneous IO (mostly sequential
rather than random, so more efficient, but still IO) which is going to
compete with the rest of the IO going on, in order to solve a problem
that you don't yet know that you have.

>
> 2) Some of our indexes are on an identifier that is a hash of some event
> attributes, so it's basically a random BIGINT.  We believe part of the
> problem is that each row could be in an entirely different location in
> the index thus causing lots of seeking and thrashing.  Would doing
> something like having our index become a multi-column index by doing
> (event_timestamp, originally_index_column) be better so that they closer
> in proximity to other events coming in around the same time?  I have to
> admit that I don't really know how indexes are stored / paged.

What if you just drop this index but keep the others while loading?
If dropping just that index has a big effect, then changing it as you
describe would almost certainly help on the loading, but would the new
index still efficiently support the same queries that the old one did?
 I.e. could all queries based on the hash code be reformulated to
query on both exact time stamp and the hash code?  Otherwise you would
be throwing the baby out with the bath water.


>
> 3) Does anyone else have similar systems where they have a ton of data
> coming in that they also want to query?  Any tips you can provide or
> alternative designs?  Once the data is in, it will 99.9% of the time
> (100% of the time for some tables) be static.  Part of the issue is that
> the user wants to be able to search based on all sorts of attributes --
> this leads to lots of indexes and more disk/memory usage when writing.

Have you experimentally verified that all of the indexes really are
needed to get acceptable query performance?  I tend to error on the
side of adding more indices just in case it might be useful, but you
already know you have a problem caused by index maintenance so
defaulting to not having them until you have proof that it is needed
might be better in that case.

Cheers,

Jeff