Thread: what would tar file FDW look like?

what would tar file FDW look like?

From

Bear Giles

Date:

17 August 2015, 14:14:47

<div dir="ltr"><div class="gmail_default" style="font-family:tahoma,sans-serif;color:#000000">I'm starting to work on a
tarFDW as a proxy for a much more specific FDW. (It's the 'faster to build two and toss the first away' approach - tar
letsme get the FDW stuff nailed down before attacking the more complex container.) It could also be useful in its own
right,or as the basis for a zip file FDW.</div><div class="gmail_default"
style="font-family:tahoma,sans-serif;color:#000000"><br/></div><div class="gmail_default"
style="font-family:tahoma,sans-serif;color:#000000">Ihave figured out that in one mode the FDW mapping that would take
thename of the tarball as an option and produce a relation that has all of the metadata for the contained files -
filename,size, owner, timestamp, etc. I can use the same approach I used for the /etc/passwd FDW for that.</div><div
class="gmail_default"style="font-family:tahoma,sans-serif;color:#000000"><br /></div><div class="gmail_default"
style="font-family:tahoma,sans-serif;color:#000000">(BTWthe current version is at <a
href="https://github.com/beargiles/passwd-fdw">https://github.com/beargiles/passwd-fdw</a>.It's skimpy on automated
testsuntil I can figure out how to handle the user mapping but it works.)</div><div class="gmail_default"
style="font-family:tahoma,sans-serif;color:#000000"><br/></div><div class="gmail_default"
style="font-family:tahoma,sans-serif;color:#000000">Theproblem is the second mode where I pull a single file out of the
FDW.I've identified three approachs so far:</div><div class="gmail_default"
style="font-family:tahoma,sans-serif;color:#000000"><br/></div><div class="gmail_default"
style="font-family:tahoma,sans-serif;color:#000000">1.A FDW mapping specific to each file. It would take the name of
thetarfile and the embedded file. Cleanest in some ways but it would be a real pain if you're reading a tarball
dynamically.</div><divclass="gmail_default" style="font-family:tahoma,sans-serif;color:#000000"><br /></div><div
class="gmail_default"style="font-family:tahoma,sans-serif;color:#000000">2. A user-defined function that takes the name
ofthe tarball and file and returns a blob. This is the traditional approach but why bother with a FDW then? It also
bringsup access control issues since it requires disclosure of the tarball name to the user. A FDW could hide
that.</div><divclass="gmail_default" style="font-family:tahoma,sans-serif;color:#000000"><br /></div><div
class="gmail_default"style="font-family:tahoma,sans-serif;color:#000000">3. A user-defined function that takes a tar
FDWand the name of a file and returns a blob. I think this is the best approach but I don't know if I can specify a FDW
asa parameter or how to access it.</div><div class="gmail_default"
style="font-family:tahoma,sans-serif;color:#000000"><br/></div><div class="gmail_default"
style="font-family:tahoma,sans-serif;color:#000000">I'veskimmed the existing list of FDW but didn't find anything that
canserve as a model. The foreign DB are closest but, again, they aren't designed for dynamic use where you want to do
somethingwith every file in an archive / table in a foreign DB.</div><div class="gmail_default"
style="font-family:tahoma,sans-serif;color:#000000"><br/></div><div class="gmail_default"
style="font-family:tahoma,sans-serif;color:#000000">Isthere an obvious approach? Or is it simply a bad match for FDW
andshould be two standard UDF?  (One returns the metadata, the second returns the specific file.)</div><div
class="gmail_default"style="font-family:tahoma,sans-serif;color:#000000"><br /></div><div class="gmail_default"
style="font-family:tahoma,sans-serif;color:#000000">Thanks,</div><divclass="gmail_default"
style="font-family:tahoma,sans-serif;color:#000000"><br/></div><div class="gmail_default"
style="font-family:tahoma,sans-serif;color:#000000">Bear</div></div>

Re: what would tar file FDW look like?

From

Greg Stark

Date:

17 August 2015, 14:30:43

On Mon, Aug 17, 2015 at 3:14 PM, Bear Giles <bgiles@coyotesong.com> wrote:
> I'm starting to work on a tar FDW as a proxy for a much more specific FDW.
> (It's the 'faster to build two and toss the first away' approach - tar lets
> me get the FDW stuff nailed down before attacking the more complex
> container.) It could also be useful in its own right, or as the basis for a
> zip file FDW.

Hm. tar may be a bad fit where zip may be much easier. Tar has no
index or table of contents. You have to scan the entire file to find
all the members. IIRC Zip does have a table of contents at the end of
the file.

The most efficient way to process a tar file is to describe exactly
what you want to happen with each member and then process it linearly
from start to end (or until you've found the members you're looking
for). Trying to return meta info and then go looking for individual
members will be quite slow and have a large startup cost.

-- 
greg

Re: what would tar file FDW look like?

From

Andrew Dunstan

Date:

17 August 2015, 15:03:37


On 08/17/2015 10:14 AM, Bear Giles wrote:
> I'm starting to work on a tar FDW as a proxy for a much more specific 
> FDW. (It's the 'faster to build two and toss the first away' approach 
> - tar lets me get the FDW stuff nailed down before attacking the more 
> complex container.) It could also be useful in its own right, or as 
> the basis for a zip file FDW.
>
> I have figured out that in one mode the FDW mapping that would take 
> the name of the tarball as an option and produce a relation that has 
> all of the metadata for the contained files - filename, size, owner, 
> timestamp, etc. I can use the same approach I used for the /etc/passwd 
> FDW for that.
>
> (BTW the current version is at 
> https://github.com/beargiles/passwd-fdw. It's skimpy on automated 
> tests until I can figure out how to handle the user mapping but it works.)
>
> The problem is the second mode where I pull a single file out of the 
> FDW. I've identified three approachs so far:
>
> 1. A FDW mapping specific to each file. It would take the name of the 
> tarfile and the embedded file. Cleanest in some ways but it would be a 
> real pain if you're reading a tarball dynamically.
>
> 2. A user-defined function that takes the name of the tarball and file 
> and returns a blob. This is the traditional approach but why bother 
> with a FDW then? It also brings up access control issues since it 
> requires disclosure of the tarball name to the user. A FDW could hide 
> that.
>
> 3. A user-defined function that takes a tar FDW and the name of a file 
> and returns a blob. I think this is the best approach but I don't know 
> if I can specify a FDW as a parameter or how to access it.
>
> I've skimmed the existing list of FDW but didn't find anything that 
> can serve as a model. The foreign DB are closest but, again, they 
> aren't designed for dynamic use where you want to do something with 
> every file in an archive / table in a foreign DB.
>
> Is there an obvious approach? Or is it simply a bad match for FDW and 
> should be two standard UDF?  (One returns the metadata, the second 
> returns the specific file.)
>
>


I would probably do something like this:

In this mode, define a table that has <path, blob>. To get the blob for 
a single file, just do "select blob from fdwtable where path = 
'/path/to/foo'". Make sure you process the qual in the FDW.

e.g.
   create foreign table tarblobs (path text, blob bytea)   server tarfiles options (filename  '/path/to/tarball', mode
'contents');


cheers

andrew

Re: what would tar file FDW look like?

From

Bear Giles

Date:

17 August 2015, 15:36:26

<div dir="ltr"><div class="gmail_default" style="font-family:tahoma,sans-serif;color:#000000">I've written readers for
bothfrom scratch. Tar isn't that bad since it's blocked - you read the header, skip forward N blocks, continue. The
hardestpart is setting up the decompression libraries if you want to support tar.gz or tar.bz2 files.</div><div
class="gmail_default"style="font-family:tahoma,sans-serif;color:#000000"><br /></div><div class="gmail_default"
style="font-family:tahoma,sans-serif;color:#000000">Zipfiles are more complex. You have (iirc) 5 control blocks - start
ofarchive, start of file, end of file, start of index, end of archive, and the information in the control block is
prettylimited. That's not a huge burden since there's support for extensions for things like the unix file metadata.
Onecomplication is that you need to support compression from the start.</div><div class="gmail_default"
style="font-family:tahoma,sans-serif;color:#000000"><br/></div><div class="gmail_default"
style="font-family:tahoma,sans-serif;color:#000000">Zipfiles support two types of encryption. There's a really weak
versionthat almost nobody supports and a much stronger modern version that's subject to license restrictions.  (Some
peopleuse the weak version on embedded systems because of legal requirements to /do something/, no matter how
lame.)</div><divclass="gmail_default" style="font-family:tahoma,sans-serif;color:#000000"><br /></div><div
class="gmail_default"style="font-family:tahoma,sans-serif;color:#000000">There are third-party libraries, of course,
butthat introduces dependencies. Both formats are simple enough to write from scratch.</div><div class="gmail_default"
style="font-family:tahoma,sans-serif;color:#000000"><br/></div><div class="gmail_default"
style="font-family:tahoma,sans-serif;color:#000000">Iguess my bigger question is if there's an interest in either or
bothfor "real" use. I'm doing this as an exercise but am willing to contrib the code if there's a general interest in
it.</div><divclass="gmail_default" style="font-family:tahoma,sans-serif;color:#000000"><br /></div><div
class="gmail_default"style="font-family:tahoma,sans-serif;color:#000000">(BTW the more complex object I'm working on is
the.p12 keystore for digital certificates and private keys. We have everything we need in the openssl library so
there'sno additional third-party dependencies. I have a minimal FDW for the digital certificate itself and am now
workingon a way to access keys stored in a standard format on the filesystem instead of in the database itself. A
naturalfit is a specialized archive FDW. Unlike tar and zip it will have two payloads, the digital certificate and the
(optionallyencrypted) private key. It has searchable metadata, e.g., finding all records with a specific
subject.)</div><divclass="gmail_default" style="font-family:tahoma,sans-serif;color:#000000"><br /></div><div
class="gmail_default"style="font-family:tahoma,sans-serif;color:#000000">Bear</div></div><div class="gmail_extra"><br
/><divclass="gmail_quote">On Mon, Aug 17, 2015 at 8:29 AM, Greg Stark <span dir="ltr"><<a
href="mailto:stark@mit.edu"target="_blank">stark@mit.edu</a>></span> wrote:<br /><blockquote class="gmail_quote"
style="margin:00 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class="">On Mon, Aug 17, 2015 at 3:14 PM,
BearGiles <<a href="mailto:bgiles@coyotesong.com">bgiles@coyotesong.com</a>> wrote:<br /> > I'm starting to
workon a tar FDW as a proxy for a much more specific FDW.<br /> > (It's the 'faster to build two and toss the first
away'approach - tar lets<br /> > me get the FDW stuff nailed down before attacking the more complex<br /> >
container.)It could also be useful in its own right, or as the basis for a<br /> > zip file FDW.<br /><br
/></span>Hm.tar may be a bad fit where zip may be much easier. Tar has no<br /> index or table of contents. You have to
scanthe entire file to find<br /> all the members. IIRC Zip does have a table of contents at the end of<br /> the
file.<br/><br /> The most efficient way to process a tar file is to describe exactly<br /> what you want to happen with
eachmember and then process it linearly<br /> from start to end (or until you've found the members you're looking<br />
for).Trying to return meta info and then go looking for individual<br /> members will be quite slow and have a large
startupcost.<br /><span class="HOEnZb"><font color="#888888"><br /><br /> --<br /> greg<br
/></font></span></blockquote></div><br/></div>