Re: Table Design question for gurus (without going to "NoSQL")... - Mailing list pgsql-general

From Phoenix Kiula
Subject Re: Table Design question for gurus (without going to "NoSQL")...
Date
Msg-id CAFWfU=um3eF5vsSKQ=qUP9BbXAPMzcx-bmvWfTYNFhLNEX49KA@mail.gmail.com
Whole thread Raw
In response to Table Design question for gurus (without going to "NoSQL")...  (Phoenix Kiula <phoenix.kiula@gmail.com>)
Responses Re: Table Design question for gurus (without going to "NoSQL")...  (Gregg Jaskiewicz <gryzman@gmail.com>)
Re: Table Design question for gurus (without going to "NoSQL")...  (Adrian Klaver <adrian.klaver@gmail.com>)
List pgsql-general
On Sun, Nov 20, 2011 at 9:33 PM, Phoenix Kiula <phoenix.kiula@gmail.com> wrote:
> Hi.
>
> Want to start another thread, loosely related to the performance
> problems thread I have going.
>
> Need some DB design guidance from the gurus here.
>
> My big table now has about 70 million rows, with the following columns:
>
>
>  alias           | character varying(35)
>  url             | text
>  modify_date     | timestamp without time zone
>  ip              | bigint
>
>
> For each IP address (user of my application) I want to have a unique
> URL. So I used to have a UNIQUE constraint on IP, URL. But the index
> based on this became huge, as some URLs are gigantic. so I introduced
> an md5 of the URL:
>
>
>  url_md5             | varchar(32)
>
>
> I now have two scenarios:
>
> 1. To have an index (unique?) on "(ip, url_md5)"
>
> 2. To not have an index on just the "ip". This way a query that tries
> to match   "...WHERE ip = 999 AND url_md5 = '<md5 here>'..." will
> still look only at the ip bit of the index, then refine it with the
> url_md5.
>
> The good thing about #2 is the size of index remains very small with
> only a bigint field (ip) being indexed.
>
> The bad thing about #2 is that each query of "...WHERE ip = 999 AND
> url_md5 = '<md5 here>'..."  will have to refine the indexed IP. If one
> IP address has put in a lot of URLs, then this becomes a bit slow. As
> is now happening, where I have users who have over 1 million URLs
> each!
>
> Questions:
>
> 1. Instead of md5, is there any integer hashing algorithm that will
> allow me to have a bigint column and save a lot hopefully in both
> storage space and speed?  (Some very useful points mentioned here:
> http://stackoverflow.com/questions/1422725/represent-md5-hash-as-an-integer
> )
>
> 2. If I do go with the above scenario #1 of a joint index, is there
> any way I can save space and maintain speed? Partitioning etc are out
> of the question.
>
> With a growing web database, I am sure many people face this
> situation. Are nosql type databases the only sane solution to such
> massive volumes and throughput expectations (e.g., CouchDb's MemBase)?
>
> Many thanks for any ideas or pointers!
>



I thought of adding a bigserial (serial8) column instead of
varchar(32) for the md5. But postgresql tells me that:

--
ERROR:  type "bigserial" does not exist
--

Why is this? Why can't I create a column with this "type"? Whats the
current syntax?

Thanks.

pgsql-general by date:

Previous
From: Rob Sargentg
Date:
Subject: 9.1.1 build failure : postgres link fails
Next
From: Gregg Jaskiewicz
Date:
Subject: Re: Table Design question for gurus (without going to "NoSQL")...