Re: Call for Google Summer of Code (GSoC) 2012: Project ideas? - Mailing list pgsql-general

From Merlin Moncure
Subject Re: Call for Google Summer of Code (GSoC) 2012: Project ideas?
Date
Msg-id CAHyXU0zdvbnP39p1LJU5T1dPy0ZZLDT0dx3hTfCptCm1goxLzQ@mail.gmail.com
Whole thread Raw
In response to Re: Call for Google Summer of Code (GSoC) 2012: Project ideas?  (Andy Colson <andy@squeakycode.net>)
List pgsql-general
On Fri, Mar 9, 2012 at 10:19 AM, Andy Colson <andy@squeakycode.net> wrote:
> On 3/9/2012 9:47 AM, Merlin Moncure wrote:
>>
>> On Thu, Mar 8, 2012 at 2:01 PM, Andy Colson<andy@squeakycode.net>  wrote:
>>>
>>> I know toast compresses, but I believe its only one row.  page level
>>> would
>>> compress better because there is more data, and it would also decrease
>>> the
>>> amount of IO, so it might speed up disk access.
>>
>>
>> er, but when data is toasted it's spanning pages.  page level
>> compression is a super complicated problem.
>>
>> something that is maybe more attainable on the compression side of
>> things is a userland api for compression -- like pgcrypto is for
>> encryption.  even if it didn't make it into core, it could live on
>> reasonably as a pgfoundry project.
>>
>> merlin
>
>
> Agreed its probably too difficult for a GSoC project.  But userland api
> would still be row level, which, in my opinion is useless.  Consider rows
> from my apache log that I'm dumping to database:

It's useless for what you're trying to do, but it would be useful to
people trying to compress large datums (data, I  know) before storage
using algorithms that postgres can't support, like lzo.

> date, url, status
> 2012-3-9 10:15:00, '/index.php?id=4', 202
> 2012-3-9 10:15:01, '/index.php?id=5', 202
> 2012-3-9 10:15:02, '/index.php?id=6', 202
>
> That wont compress at all on a row level.  But it'll compress 99% on a
> "larger" (page/multirow/whatever/?) level.

sure, but you can only get those rates by giving up the segmented view
of the data that postgres requires.  your tuples are very small and I
only see compression happening on the userland side by employing
tricks specific to your specific dataset (like employing "char" to map
the status, url mapping, etc).

merlin

pgsql-general by date:

Previous
From: Bret Stern
Date:
Subject: Regarding NOTIFY
Next
From: John R Pierce
Date:
Subject: Re: How to erase transaction logs on PostgreSQL