Re: Table Design question for gurus (without going to "NoSQL")... - Mailing list pgsql-general

From Tomas Vondra
Subject Re: Table Design question for gurus (without going to "NoSQL")...
Date
Msg-id 7fb69541d76d30d1e28660797489217d.squirrel@sq.gransy.com
Whole thread Raw
In response to Re: Table Design question for gurus (without going to "NoSQL")...  (David Johnston <polobo@yahoo.com>)
List pgsql-general
> On Nov 20, 2011, at 21:33, Phoenix Kiula <phoenix.kiula@gmail.com> wrote:
>
> My big table now has about 70 million rows, with the following columns:
>
>  alias           | character varying(35)
>  url             | text
>  modify_date     | timestamp without time zone
>  ip              | bigint
>
>
> For each IP address (user of my application) I want to have a unique
> URL. So I used to have a UNIQUE constraint on IP, URL. But the index
> based on this became huge, as some URLs are gigantic. so I introduced
> an md5 of the URL:
>
>  url_md5             | varchar(32)

I don't understand this. So you want to allow exactly one URL for an IP
address? Or do you want to allow only one row with the same (IP, URL)
values? Because that's exactly what a UNIQUE index on (IP, URL) does.

If you want to allow just a single URL for an IP, you should create an
index on IP only.

> I now have two scenarios:
>
> 1. To have an index (unique?) on "(ip, url_md5)"
>
> 2. To not have an index on just the "ip". This way a query that tries
> to match   "...WHERE ip = 999 AND url_md5 = '<md5 here>'..." will
> still look only at the ip bit of the index, then refine it with the
> url_md5.

Have you actually performed any benchmarks with this? Just create the
indexes, run several (say 100) queries and that should give you an idea
which of the options is better.

We can speculate on the pros/cons of those options, but both are viable
under certain conditions. For example if there's just a very small number
of URLs for an IP, then #2 is probably going to be better. But if there's
enormous number of URLs per IP, then a multi-column index is probably
going to perform better.

But those are only guesses - try both options, run a few queries (not on
the overloaded system you're struggling with - that'd skew the results)
and show us EXPLAIN ANALYZE of the queries.

> Questions:
>
> 1. Instead of md5, is there any integer hashing algorithm that will
> allow me to have a bigint column and save a lot hopefully in both
> storage space and speed?  (Some very useful points mentioned here:
> http://stackoverflow.com/questions/1422725/represent-md5-hash-as-an-integer
> )

You can obviously take an MD5 hash, which is just a 16-byte value and use
the first 4 bytes as an integer. The problem with this is that it probably
significantly increases the collision. I.e. it will indicate two URLs to
be the same, although the URLs are different.

> 2. If I do go with the above scenario #1 of a joint index, is there
> any way I can save space and maintain speed? Partitioning etc are out
> of the question.

What is the index size, anyway? You've mentioned it's huge, but what is
the actual size?

> With a growing web database, I am sure many people face this
> situation. Are nosql type databases the only sane solution to such
> massive volumes and throughput expectations (e.g., CouchDb's MemBase)?

Well, partitioning is usually a good approach to problems like these.

Have you thought about moving the URLs into a separate table? I.e. instead
of this table

 alias           | character varying(35)
 url             | text
 url_md5         | character varying(32)
 modify_date     | timestamp without time zone
 ip              | bigint

you'd have two tables - one for keeping the URLs:

 id              | integer
 url             | text
 url_md5         | character varying(32)

and the current one, referencing the URLs

 alias           | character varying(35)
 url_id          | integer
 modify_date     | timestamp without time zone
 ip              | bigint

That'd allow you to create a UNIQUE index on (ip, url_id), which should be
much smaller than the current one. But handling the inserts would be
significantly more complex (you'd have to check existence of the URL,
insert it etc.).

Tomas


pgsql-general by date:

Previous
From: "Tomas Vondra"
Date:
Subject: Re: Table Design question for gurus (without going to "NoSQL")...
Next
From: Enrico Sirola
Date:
Subject: wal archiving on a hot-standby server