f



Something the filesystem can't do: union

I think I have a novel syscall proposal.  

I have a table represented as N files, one file per column.  Each file
is of a uniform, known type, and is mapped into memory.

On this table I perform a relational restrict operation (SQL WHERE),
testing each row against a predicate.  The output is manifested as a
similar set of memory-mapped files.

The restriction can be parallelized and order-preserving.  The work
could be partitioned to multiple processors.  For P processors, each
process p would read rows (p-1)/P though p/P.  (That is, given 100
rows, process 3 of four processors reads rows [50,75).)

At the end of the parallel stage, I have P sets of partial outputs.  Of
course, it's possible to concatenate them togther to form a single
(multifile) output.  But why is that necessary?

A given logical output column consists of P components in a known
order.  Instead of

        $ cat part1 part part3 part4 > output	       

why can I not

        $ union -o output  part1 part part3 part4

where the putative union(1) creates a file by knitting together the
blocks assigned to its arguments?

ISTM to that "cat foo* > bar" is a pretty common operation, and the
fairly often the input is discarded once the output is ready.  And I
have an analogy:

     cp is to mv as cat is to union

To answer my own question -- why not? --  the filesystem doesn't offer
block level access or inode manipulation.  You can assign names to
inodes and place those names in directories, but you can't adjust what
resources an inode uses.

You may object that "union" could result in fragmented files.  But so
what?  If each fragment is a terabyte, the"fragmentation" resulting
from ganging the "fragments" together might be very preferable to
copying all the blocks around just for the sake of colocation.  

It's not difficult to imagine a syscall to union two files without
exposing inode details.  In fact, it's easy to imagine that 

    int union( const char *from, const char *to )

might leave to the discretion of the kernel or filesystem whether to
diddle with the inodes or just extend the file conventionally, based on
disk geometry or whatever.  

Have you heard of this idea before?  Seen it implemented?

--jkl
0
James
12/14/2016 4:30:12 AM
comp.unix.programmer 10848 articles. 0 followers. kokososo56 (349) is leader. Post Follow

7 Replies
138 Views

Similar Articles

[PageSpeed] 23

On 12/13/2016 11:30 PM, James K. Lowden wrote:
> [...]
> A given logical output column consists of P components in a known
> order.  Instead of
>
>         $ cat part1 part part3 part4 > output	
>
> why can I not
>
>         $ union -o output  part1 part part3 part4
>
> where the putative union(1) creates a file by knitting together the
> blocks assigned to its arguments?

     One difficulty would be dealing with allocation "atoms" --
"blocks" or "pages" or whatever your file system of choice likes
to call them.  In an ordinary file, every block except the last
is chock-a-block full, but the file you illustrate could have as
many as four partially-filled blocks.  (Even in sparse files,
the gaps are block-aligned.)  So, how many such partial-block
gaps might appear, how much bookkeeping might the file system
need to do, and could lseek() and stat() be done efficiently?

     Not to say this would be an insuperable amount of work, but
it would probably be a "more than epsilon" amount.

> Have you heard of this idea before?  Seen it implemented?

     I've seen something vaguely like it, but not on a file system
with Unixoid semantics.  (It supported things like deletion and
insertion in the middle of a file; you could build a file from
the middle outward instead of from beginning to end ...)  But
even that one required special conventions for unused block frags;
they were filled with zeroes that the applications had to know to
ignore (hence, such files were, er, "not recommended" for binary
data unless it had its own "record structure" to distinguish
padding from payload.)

-- 
esosman@comcast-dot-net.invalid
"Nobody ever went broke underestimating the intelligence of the
American public." -- HLM (paraphrased)
0
Eric
12/14/2016 5:23:49 AM
James K. Lowden <jklowden@speakeasy.net> wrote:
<snip>
> ISTM to that "cat foo* > bar" is a pretty common operation, and the
> fairly often the input is discarded once the output is ready.  And I
> have an analogy:
> 
>      cp is to mv as cat is to union
> 
> To answer my own question -- why not? --  the filesystem doesn't offer
> block level access or inode manipulation.  You can assign names to
> inodes and place those names in directories, but you can't adjust what
> resources an inode uses.
> 
> You may object that "union" could result in fragmented files.  But so
> what?  If each fragment is a terabyte, the"fragmentation" resulting
> from ganging the "fragments" together might be very preferable to
> copying all the blocks around just for the sake of colocation.  
> 
> It's not difficult to imagine a syscall to union two files without
> exposing inode details.  In fact, it's easy to imagine that 
> 
>     int union( const char *from, const char *to )
> 
> might leave to the discretion of the kernel or filesystem whether to
> diddle with the inodes or just extend the file conventionally, based on
> disk geometry or whatever.  
> 
> Have you heard of this idea before?  Seen it implemented?
> 

Linux has the splice(2) system call. The idea behind splice is to allow you
to copy or move in-kernel buffers between objects (pipes, file buffer cache,
and sockets) without copying into and out of userspace; and with vmsplice(2)
even move user VM pages. It's similar to sendfile(2) (from FreeBSD and
Solaris) but far more generic.

There was a least one patch to optimize file-to-file splices, particularly
for filesystems like Btrfs where blocks could be shared between files:

  https://lwn.net/Articles/567086/  

See also

  https://lwn.net/Articles/550621/
  
0
william
12/14/2016 6:03:55 AM
On Wed, 14 Dec 2016 00:23:49 -0500
Eric Sosman <esosman@comcast-dot-net.invalid> wrote:

> In an ordinary file, every block except the last is chock-a-block
> full, but the file you illustrate could have as many as four
> partially-filled blocks.

Mmph.  Thinking in terms of memory-mapped files, the notion of
partially used blocks in the middle of the file is especially
problematic.  There's no hardware mechanism to "fake out" the block
alignment, to make file2 appear to be contiguous with file1 if its
final block isn't full.  Without that, the kernel has no advantage; any
"elision" of the unused portion of interim bytes could as easily be
done in userspace.  

Non-epsilon, indeed.  Thanks for the reply.  

--jkl
0
James
12/14/2016 3:59:19 PM
"James K. Lowden" , dans le message
<20161213233012.331fe9316e545212d6ace577@speakeasy.net>, a �crit�:
> why can I not
> 
>         $ union -o output  part1 part part3 part4
> 
> where the putative union(1) creates a file by knitting together the
> blocks assigned to its arguments?

Because it is awfully specific. To implement what you request, it would
require generic code for a system call, plus specific code in each
filesystem implementation plus fallback emulation code for the
filesystems where it is not implemented. All this to satisfy a single
use case, mostly useless for anything else, and giving only a slight
speedup compared to doing it in userland.

If you need this, it seems to me you are using the wrong file format or
the wrong tool.
0
Nicolas
12/14/2016 4:22:21 PM
On 14 Dec 2016 16:22:21 GMT
Nicolas George <nicolas$george@salle-s.org> wrote:

>=20
> "James K. Lowden" , dans le message
> <20161213233012.331fe9316e545212d6ace577@speakeasy.net>, a =E9crit=A0:
> > why can I not
> >=20
> >         $ union -o output  part1 part part3 part4
> >=20
> > where the putative union(1) creates a file by knitting together the
> > blocks assigned to its arguments?
>=20
> Because it is awfully specific. To implement what you request, it
> would require generic code for a system call,=20

true for any new call

> plus specific code in each filesystem implementation=20

true for any new filesystem feature=20

> plus fallback emulation code for the filesystems where it is not
> implemented.=20

It's fairly common for a syscall to fail if the underlying device
doesn't support it.  ENODEV comes to mind.=20

> All this to satisfy a single use case, mostly useless for anything
> else, and giving only a slight speedup compared to doing it in
> userland.

Could I do what I want to do in userland?  Of course.  What I can't do
is join many files into one without moving the data on the disk. =20

Is it highly speciallzed?  I don't know.  Is splice(2) speciallized?
Is passing file descriptors over a Unix domain socket?  I think we
could find quite a few arcane features of the kernel. =20

Eric Sosman convinced me it's impractical on techical grounds.  No
underlying machine could map a sparse file such as I describe as
contiguous memory.  No device could represent the beginning of a
second file as contiguous with its precedessor.  Camouflaging that is
no simple matter. =20

On the other hand, a lot of effort these days goes into minimizing
copying.  We've had copy-on-write forever, and is implemented in some
filesystems. C ++ got move semantics.  And there's sendfile(2).  So it
seemed like it was worth asking. =20

--jkl


0
James
12/15/2016 11:23:35 PM
On 14.12.16 06.23, Eric Sosman wrote:
> On 12/13/2016 11:30 PM, James K. Lowden wrote:
>> why can I not
>>
>>         $ union -o output  part1 part part3 part4
>>
>> where the putative union(1) creates a file by knitting together the
>> blocks assigned to its arguments?
>
>      One difficulty would be dealing with allocation "atoms" --
> "blocks" or "pages" or whatever your file system of choice likes
> to call them.  In an ordinary file, every block except the last
> is chock-a-block full, but the file you illustrate could have as
> many as four partially-filled blocks.  (Even in sparse files,
> the gaps are block-aligned.)  So, how many such partial-block
> gaps might appear, how much bookkeeping might the file system
> need to do, and could lseek() and stat() be done efficiently?

In fact it would be possible to write a driver that provides exactly 
this behavior with reasonable performance at VFS level. It is basically 
the data structure of a rope class.
When dealing with very large data mapping that unfragmented into memory 
could be difficult due to fragmentation of the virtual address space and 
furthermore insert/delete operations in the middle are extraordinary 
expensive. (In fact any editor that can deal with large files has this 
problem.)
Cutting the data into arbitrary chunks improves the situation 
significantly. The resulting structure is similar to a BTree. So any 
random access operations turn into O(log(n)) which is not that bad in 
general.

>      Not to say this would be an insuperable amount of work, but
> it would probably be a "more than epsilon" amount.

The question is more like: should this be a standard function of a file 
system? Is it really needed that often?
Many file formats (e.g. XML, JSON ...) do not tolerate simple 
concatenation at binary level. In this cases it is useless.
If this requirement becomes more common a one to many symlink could be 
the solution. However, this requires changes in the filesystem API to 
access the feature.

>> Have you heard of this idea before?  Seen it implemented?
>
>      I've seen something vaguely like it, but not on a file system
> with Unixoid semantics.  (It supported things like deletion and
> insertion in the middle of a file; you could build a file from
> the middle outward instead of from beginning to end ...)  But
> even that one required special conventions for unused block frags;
> they were filled with zeroes that the applications had to know to
> ignore (hence, such files were, er, "not recommended" for binary
> data unless it had its own "record structure" to distinguish
> padding from payload.)

This sounds more like sparse files. They can be quite useful if you want 
to add trim support to a virtual disk of a VM. There is no other way to 
mark arbitrary chunks of the virtual disk file as free in a way that 
host file system could trim this regions too on the physical device.


Marcel
0
Marcel
12/16/2016 8:13:57 AM
"James K. Lowden" , dans le message
<20161215182335.56c8f237068ac0140e55176d@speakeasy.net>, a �crit�:
> true for any new call

> true for any new filesystem feature 

This is why there are so few of them.

> It's fairly common for a syscall to fail if the underlying device
> doesn't support it.  ENODEV comes to mind. 

Then consider you have your syscall.

> Could I do what I want to do in userland?  Of course.  What I can't do
> is join many files into one

You can do that in userland.

>			      without moving the data on the disk.  

And this is a break of abstraction.

> Is it highly speciallzed?  I don't know.  Is splice(2) speciallized?
> Is passing file descriptors over a Unix domain socket?  I think we
> could find quite a few arcane features of the kernel.  

The features you did find are IMHO much less specialized than what you
request.

> On the other hand, a lot of effort these days goes into minimizing
> copying.

Yes, but it is supported by specialized file formats, not dumped on top
of an aging API.
0
Nicolas
12/17/2016 9:52:43 AM
Reply: