[NCLUG] A *nix DFS alternative?

Thu Feb 18 13:51:58 MST 2010

DJ Eshelman wrote:
> As for AFS- how much constant contact does it need with member servers?  
> As in, could the link sustain several hours of downtime?  I'm 
> intrigued...  Especially if CODA would work well.  I really do like the 
> idea of having a unified namespace instead of a 'mirrored' namespace.

It has been a few years since I have used AFS and my memory is growing
faint.  But among the replicated servers they could take a
configurable number of days of downtime and then sync once the other
server came back online.  But it has been really too long for me to
comment and you would need to set up a test case and try it in the
various combinations and failure modes and learn more about it to see
if it would work for you.

> In talking more with my wife about how things would work, I think rsync 
> may still work; because she's not saving the same file back to itself 
> (which I suppose makes sense), therefore there wouldn't be deltas at 
> all- there'd just be new files in a structure.

Since all of the files are always new files you wouldn't really need
rsync and could use anything.  But rsync is still very good at syncing
new files too.

> So as long as ifnotifywait will work to trigger syncs on either end,
> it could be hands off-

If you decide to use inotify then I think you should also plan some
type of clean sweep too that walks everything just to ensure that it
is back in sync.  Because I am pessimistic enough to believe that
things will get out of sync eventually.  Stuff happens.

> but if she's not working in the same files (saving new copies) then
> I could conceivably just do a cron job.  Somehow I doubt the hit to
> CPU would be too bad on the home server (the one at the office I'm
> not concerned)

It isn't cpu intensive.  It is disk subsystem intensive.  It will need
to walk all of the directories, stat(2) all of the files, and compare
file names, times, and sizes.  With a large number of files that can
beat on the disk pretty intensively.  And all of the time your cpu can
be quite idle.

> I think you may be right, Bob- I probably need to worry about file 
> structure a lot- I could see this getting out of control in a hurry.  
> Perhaps I should also look into document control systems; basically 
> everything is kept in a flat structure and given a coded filename... I 
> suppose the issue with that would be keeping that database in sync as 
> well...  man.  Okay scratch that :)

Ooo, yes, keeping the database in sync.  A whole similar but different
problem. :-)

> One question though:
> > when your web interface accepts an upload then add that directory
> > to the list to be sync'd from site to site.  Since you already
> > know what files have changed on the side that has changed then you
> > already know what directories to sync over.
> 
> So... how does the system know there are new files coming in from the 
> web?  I have tried php triggering cron jobs before to limited success- 
> would that work, or were you thinking of something else?

I was making a large assumption based upon something you said.  I was
assuming that you were going to have a web interface and that these
images would be uploaded onto the server by using a web server on the
server side.  I also assumed a web browser on the client side for
interactive use.  On the client side you could also use batch oriented
tools such as wget and curl and other perl, python, ruby scripted
solutions.

If you are receiving the incoming image with a web server, let's say
by browser upload, then the code on the web server side of things will
have access to the filename.  It will know the filename, the user who
is uploading the file, any other context you decide to track
associated with an optional session.  In the code that is handling the
upload it can save the list of files uploaded or just the list of
directories in which files were uploaded or other useful things.  This
could then be used to queue an rsync process to come along either
immediately or queued for later to process the upload.  Since the web
server application code knows everything that is needed I was
suggesting that it could queue the uploaded file for sync across to
the mirror.

I have about twenty different ways in my head that this could be done.
It could keep them in a full mysql/sqlite database, it could keep them
as individual files on disk without needing locking (think maildir),
it could keep them in a dedicated daemon's memory, it could use locks
to semaphore something on disk (needing semaphores makes this less
desirable), lots of other ideas, the sky is clear and open with lots
of possibilities.  If you want to discuss or brainstorm feel free to
ping me.

Of course if files are being uploaded outside of any framework then
that won't work directly.  But it seemed a good assumption to make at
the time.

> I have tried php triggering cron jobs before to limited success- 
> would that work, or were you thinking of something else?

Of course you mean cron triggering php scripts and not the other way
around.  But although you might decide to use PHP in the web server
framework I could not recommend using it for general purpose
scripting.  Especially for something as simple as running rsync in a
script.  For that I would use portable shell, Perl, Python or Ruby.  I
would use portable /bin/sh shell.  PHP isn't great for catching errors
from external programs.  The others are better for error handling.

This tickles a mild not-quite-a-peeve with me.  I know people who have
learned programming by hacking on php applications.  That is okay.
You need something to motivate you to learn to program.  If that is
web applications and they learned it with php that is great.  But then
they try to use PHP for everything.  Branch out!  Learn about the
world!  PHP was designed for web applications.  But PHP isn't great
for system programming such as working with files on disk and running
external programs.  By this I mean things that a simple 'find' command
can trivially do I have seen some really gnarly PHP monstrosities
cooked up by people with a zillion lines of php linked to a complex
mysql database schema!  They were really quite surprised when I
replaced it with a one line 'find . -type f -newer markerfile -exec mv
--target-directory {} +' shell script.

> As for compression, Adobe RAW images are not compressed, nor would
> the type of TIFFs the pros use, apparently.  I'm sure JPEG versions
> would be included and of course I'm not worried about those.

TIFFs!  They can be very large.  My mind staggers under the blow of
thinking about how much data that could be manipulating.  I would
really be trying hard to reduce those to a compressed format locally
as soon as possible and trying to avoid moving those around or even
storing them long term.  Hopefully.  If possible.

> The project I was following visually was TrafficSqueezer
> http://www.trafficsqueezer.org/ , which now has corporate
> sponsorship... always a good sign of a closed source coming.

I have seen those types of devices marketed before.  There are
definitely spots where they are useful.  For general web page browsing
between two LANs I think they will do pretty well.  But one problem
for them is when new unique incompressible data is continuously being
generated.  At my previous employer we tested one and that was the
soft spot.  We were always generating what appeared to it to be large
binary data blobs that it couldn't really compress (because it was
already compressed) and therefore just added overhead.  YMMV.

> Wow- head is totally spinning now (probably in a good way).
> Thanks for the thoughts- it's always fun to see what everyone thinks!

Thoughts are cheap and easy to generate in large quantities.  However
there is no replacement for hands on experience.  Success usually
comes not to those who are lucky enough to guess right the first time
but to those who keep going.

Bob