Thread: Tips/Hacks to create minial DB from the execution of several (simple) SQL requests.

Hi all!

I have an interesting problem here that I think could be of interest to everyone. I in the process of writing test
casesfor our applications and there is one problem I am facing. To be able to test correctly, I need to create a small
database(a sample if you want) from a very large one so that I can run some tests on a subset of the data.  

Sometimes you are asked to do this but know nothing about the database in advance (ugh!).

I could create several queries and build it myself by trial and error, but I was wondering if a more general approach
couldbe elaborated. 

For my case, the testing does not write in the database and queries are simple in nature (they do not use count() or
anythingthat needs any whole table to work). 

Here are some solutions that I have come up with :

a) Run the main program A on the large database, however, I will restrict its operation to only a subset of the data.

If I create a MockConnection, I could save all the queries as text strings and serialize the result set that postgres
returnedand use this logging to re-run the program without connection to the real database. I would simply return the
serializedresult set if I find a match for the query. 

b) Take the source code of Postgres and add some tweaks in there so that it logs every table/row that was needed in the
outputresultSet and build a seperate minimal DB from that. 

With this options, database B (the minimalist one) would not have any constraints. On the other hand, it may be very
difficultfor me to add this logging in the source code. 


Of course, option B seems much more interesting, since a small optimization in the programs queries would probably
stillwork while in option a) it would fail immediately. 

I was wondering if maybe there was something else I could do to solve this problem in a general way?

Daniel Shane

Daniel Shane wrote:
> Hi all!
>
> I have an interesting problem here that I think could be of interest to everyone. I in the process of writing test
casesfor our applications and there is one problem I am facing. To be able to test correctly, I need to create a small
database(a sample if you want) from a very large one so that I can run some tests on a subset of the data.  
>
> Sometimes you are asked to do this but know nothing about the database in advance (ugh!).
>
> I could create several queries and build it myself by trial and error, but I was wondering if a more general approach
couldbe elaborated. 
>
> For my case, the testing does not write in the database and queries are simple in nature (they do not use count() or
anythingthat needs any whole table to work). 
>
> Here are some solutions that I have come up with :
>
> a) Run the main program A on the large database, however, I will restrict its operation to only a subset of the data.
>
> If I create a MockConnection, I could save all the queries as text strings and serialize the result set that postgres
returnedand use this logging to re-run the program without connection to the real database. I would simply return the
serializedresult set if I find a match for the query. 
>
> b) Take the source code of Postgres and add some tweaks in there so that it logs every table/row that was needed in
theoutput resultSet and build a seperate minimal DB from that. 
>
> With this options, database B (the minimalist one) would not have any constraints. On the other hand, it may be very
difficultfor me to add this logging in the source code. 
>
>
> Of course, option B seems much more interesting, since a small optimization in the programs queries would probably
stillwork while in option a) it would fail immediately. 
>
> I was wondering if maybe there was something else I could do to solve this problem in a general way?
>
> Daniel Shane

I'd create a test schema, set the search path on your test user to just
that schema. And you could create the tables something like so:

CREATE TABLE test.foo AS
   SELECT * FROM public.foo
   LIMIT 1000;

Scott

Hi Scott!

The problem is that my test database has several tables with many links between them, so I have no idea which 1000 rows
toget from which table. The only thing I can do is run the program that connects to that database and tell it to run on
asample of the database. 

I can get a log of all the queries that are executed, but I was wondering if there was a more general solution where I
coulduse a "modified/hacked" postgres driver and catch all the rows of all the tables that were accessed during those
queries.

I could then simply insert them into the test database and in theory my program should run the same if I used it
insteadof the real one (assuming its configure to run on the same sample). 

Daniel Shane

>>>QUOTE
I'd create a test schema, set the search path on your test user to just
that schema. And you could create the tables something like so:

CREATE TABLE test.foo AS
   SELECT * FROM public.foo
   LIMIT 1000;

Scott
<<<QUOTE

Daniel Shane wrote:
> Hi all!
>
> I have an interesting problem here that I think could be of interest to everyone. I in the process of writing test
casesfor our applications and there is one problem I am facing. To be able to test correctly, I need to create a small
database(a sample if you want) from a very large one so that I can run some tests on a subset of the data.  
>
> Sometimes you are asked to do this but know nothing about the database in advance (ugh!).
>
> I could create several queries and build it myself by trial and error, but I was wondering if a more general approach
couldbe elaborated. 
>
> ...


Daniel Shane wrote:
> Hi Scott!
>
> The problem is that my test database has several tables with many links between them, so I have no idea which 1000
rowsto get from which table. The only thing I can do is run the program that connects to that database and tell it to
runon a sample of the database. 
>
> I can get a log of all the queries that are executed, but I was wondering if there was a more general solution where
Icould use a "modified/hacked" postgres driver and catch all the rows of all the tables that were accessed during those
queries.
>
> I could then simply insert them into the test database and in theory my program should run the same if I used it
insteadof the real one (assuming its configure to run on the same sample). 
>
> Daniel Shane

Yeah, still a much easier way. If there are many links between the
tables, then related records would typically be inserted either in the
same transaction or soon there after. You can use the xmin columns to
copy related rows from all tables.

Scott