Thread: Need advice for handling big data in postgres

Need advice for handling big data in postgres

From
Tobias Fielitz
Date:
Hi,

I am quite new to handle a lot of data in Postgres and I would be happy
to get as much advice from experienced data jongleurs as possible.

THE SCENARIO:
I have a big table `client_log`. It is the "main gate" for clients (and
there is a lot of them) to talk to the backend.
Each client sends more or less log lines, resulting in the table being
written lots of times every second producing a lot of data. Here is the
table definition:
http://stackoverflow.com/questions/25542010/postgres-huge-table-with-delayed-read-and-write-access
*code* determines the nature of the log line. Example: code=20
determines a location log line (the important information will be in
*latitude* and *longitude*), code=8999 carries a comment that just needs
to get stored (in the *comment* field).

Just storing the data is useless, I need to look at it. At some log
lines quite regularly, at some just once, at others every now and then.
I need indexes on nearly every single column for quick access. So what
is the `best` way of storing the data?

OPTION 1 - PARTITIONING:
For each query only a few columns are interesting and I could partition
the table (as it was suggested on SO)
by *created* and by *code*.
There is roughly 10 different codes and I would keep data for the last
two months (partitioned by day). So I would end up having 10 * 60 = 600
partitions.
For every partition I could create indexes that are relevant to that
partition (examples: *created*, *created_on_server* or *latitude* and
*longitude*).

OPTION 2 - MULTIPLE TABLES:
I could create the tables myself: one for location log lines, one for
comment log lines etc. and store them via python in the correct table
(depending on *code*). Each of these tables would only have the columns
and indexes needed.

OUTCOME:
I expect partitioning to be faster because Postgres selects the correct
partition for me automatically. I can easily get rid of old data by
dropping the corresponding partition. The downside of the partition
approach is that all partitions inherit all columns of the master table
which is unnecessary (and consumes disc space?).

Which option should I go for?
Is there anything else that I haven't considered?

Any help and comments appreciated.

PS.: Does partitioning work trouble-free with master-slave replication?

Attachment

Re: Need advice for handling big data in postgres

From
Vincent Veyron
Date:
On Wed, 14 Jan 2015 11:42:45 +1100
Tobias Fielitz <tobias@streethawk.com> wrote:
>
> OPTION 1 - PARTITIONING:
> For each query only a few columns are interesting and I could partition
> the table (as it was suggested on SO)
> by *created* and by *code*.
> There is roughly 10 different codes and I would keep data for the last
> two months (partitioned by day). So I would end up having 10 * 60 = 600
> partitions.
> For every partition I could create indexes that are relevant to that
> partition (examples: *created*, *created_on_server* or *latitude* and
> *longitude*).
>
> OPTION 2 - MULTIPLE TABLES:
> I could create the tables myself: one for location log lines, one for
> comment log lines etc. and store them via python in the correct table
> (depending on *code*). Each of these tables would only have the columns
> and indexes needed.
>
> OUTCOME:
> I expect partitioning to be faster because Postgres selects the correct
> partition for me automatically. I can easily get rid of old data by
> dropping the corresponding partition. The downside of the partition
> approach is that all partitions inherit all columns of the master table
> which is unnecessary (and consumes disc space?).
>

I gather from the comments in this list that null fields have a very low overhead; see :

http://www.postgresql.org/message-id/87prx92lj9.fsf@oxford.xeocode.com

I would worry a lot more about the maintenance problems option 2 will induce : if a code value changes or is
added/deleted,your python script needs updating. SQL queries will also be a lot more complicated (union select on
varioustables) and harder to optimize. Maintaining the coherence between the script and the tables will get harder and
harder.


--
                    Salutations, Vincent Veyron

https://libremen.com/
Legal case, contract and insurance claim management software


Re: Need advice for handling big data in postgres

From
Jim Nasby
Date:
On 1/15/15 7:42 AM, Vincent Veyron wrote:
> On Wed, 14 Jan 2015 11:42:45 +1100
> Tobias Fielitz <tobias@streethawk.com> wrote:
>>
>> OPTION 1 - PARTITIONING:
>> For each query only a few columns are interesting and I could partition
>> the table (as it was suggested on SO)
>> by *created* and by *code*.
>> There is roughly 10 different codes and I would keep data for the last
>> two months (partitioned by day). So I would end up having 10 * 60 = 600
>> partitions.
>> For every partition I could create indexes that are relevant to that
>> partition (examples: *created*, *created_on_server* or *latitude* and
>> *longitude*).
>>
>> OPTION 2 - MULTIPLE TABLES:
>> I could create the tables myself: one for location log lines, one for
>> comment log lines etc. and store them via python in the correct table
>> (depending on *code*). Each of these tables would only have the columns
>> and indexes needed.
>>
>> OUTCOME:
>> I expect partitioning to be faster because Postgres selects the correct
>> partition for me automatically. I can easily get rid of old data by
>> dropping the corresponding partition. The downside of the partition
>> approach is that all partitions inherit all columns of the master table
>> which is unnecessary (and consumes disc space?).
>>
>
> I gather from the comments in this list that null fields have a very low overhead; see :
>
> http://www.postgresql.org/message-id/87prx92lj9.fsf@oxford.xeocode.com

Roughly speaking storing a NULL costs some extra CPU and that's it. (There's a fixed-size NULL bitmap; if a field is
nullthen it's marked as such in the bitmap and nothing is stored in the tuple). 

> I would worry a lot more about the maintenance problems option 2 will induce : if a code value changes or is
added/deleted,your python script needs updating. SQL queries will also be a lot more complicated (union select on
varioustables) and harder to optimize. Maintaining the coherence between the script and the tables will get harder and
harder.

Actually, because there are different pieces of data needed for different codes, this is a great use for inheritance
insteadof just simple partitioning. That would allow you to put only the common fields in the parent table, and then
eachchild table adds whatever fields make sense. You can then use the parent table for cases where you don't care about
code-specificfields and use the specific code tables when you do (note that you don't want to insert into the parent
tablethough). I have a hard time buying the code maintenance argument since you'll have that problem no matter what. 

That said, dealing with 3 billion rows is a pretty tall order. At a minimum you'd need time partitioning (in addition
toinheritance if you go that route). I'd partition by day, or maybe even hour. 

Something to consider... on a 64-bit machine your tuple header will be 32 bytes and the 6 NOT NULL fixed-width fields
willbe 6*8=48 bytes. So you're already at 70 * 3B = 210GB. That's ignoring all other data, page overhead, fill factor
andindexes. I believe those indexes will be another 16 bytes per row (24B for the 2 field one) so ~300GB, again
ignoringfill factor and non-leaf pages. You'll need a fairly beefy server to make that perform well. Depending on your
insertrate and access patterns I think you could handle this with one server for writing and a separate one for
reading,but you'll have to be careful. 

BTW, inheritance (and by extension partitioning) doesn't affect binary replication. The only trick with logical
replicationis the need to add new partitions to the replication system as you create them. 
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com