Re: Table Design question for gurus (without going to "NoSQL")... - Mailing list pgsql-general

From Gavin Flower
Subject Re: Table Design question for gurus (without going to "NoSQL")...
Date
Msg-id 4EC98CAE.20105@archidevsys.co.nz
Whole thread Raw
In response to Table Design question for gurus (without going to "NoSQL")...  (Phoenix Kiula <phoenix.kiula@gmail.com>)
Responses Re: Table Design question for gurus (without going to "NoSQL")...  (Phoenix Kiula <phoenix.kiula@gmail.com>)
List pgsql-general
On 21/11/11 02:33, Phoenix Kiula wrote:
> Hi.
>
> Want to start another thread, loosely related to the performance
> problems thread I have going.
>
> Need some DB design guidance from the gurus here.
>
> My big table now has about 70 million rows, with the following columns:
>
>
>   alias           | character varying(35)
>   url             | text
>   modify_date     | timestamp without time zone
>   ip              | bigint
>
>
> For each IP address (user of my application) I want to have a unique
> URL. So I used to have a UNIQUE constraint on IP, URL. But the index
> based on this became huge, as some URLs are gigantic. so I introduced
> an md5 of the URL:
>
>
>   url_md5             | varchar(32)
>
>
> I now have two scenarios:
>
> 1. To have an index (unique?) on "(ip, url_md5)"
>
> 2. To not have an index on just the "ip". This way a query that tries
> to match   "...WHERE ip = 999 AND url_md5 = '<md5 here>'..." will
> still look only at the ip bit of the index, then refine it with the
> url_md5.
>
> The good thing about #2 is the size of index remains very small with
> only a bigint field (ip) being indexed.
>
> The bad thing about #2 is that each query of "...WHERE ip = 999 AND
> url_md5 = '<md5 here>'..."  will have to refine the indexed IP. If one
> IP address has put in a lot of URLs, then this becomes a bit slow. As
> is now happening, where I have users who have over 1 million URLs
> each!
>
> Questions:
>
> 1. Instead of md5, is there any integer hashing algorithm that will
> allow me to have a bigint column and save a lot hopefully in both
> storage space and speed?  (Some very useful points mentioned here:
> http://stackoverflow.com/questions/1422725/represent-md5-hash-as-an-integer
> )
>
> 2. If I do go with the above scenario #1 of a joint index, is there
> any way I can save space and maintain speed? Partitioning etc are out
> of the question.
>
> With a growing web database, I am sure many people face this
> situation. Are nosql type databases the only sane solution to such
> massive volumes and throughput expectations (e.g., CouchDb's MemBase)?
>
> Many thanks for any ideas or pointers!
>
How about having 2 indexes: one on each of ip & url_md5? Pg will combine
the indexes as required, or will just use one if that is best.

Why don't you have a time zone on your timestamp???


Regards,
Gavin

pgsql-general by date:

Previous
From: Andreas 'ads' Scherbaum
Date:
Subject: FOSDEM 2012 - PostgreSQL Devroom: Call for Speakers
Next
From: Gavin Flower
Date:
Subject: Re: Is it ever necessary to vacuum a table that only gets inserts/updates?