[NCLUG] A *nix DFS alternative?

DJ Eshelman djsbignews at gmail.com
Mon Feb 22 10:00:14 MST 2010


I think for now I'm just going to make sure the file structuring is the 
same on both ends and use rsync until things get out of hand.

The torrent idea is interesting- I'll have to look into that some more- 
I'm a little concerned about active changes and of course security- if 
this solution is good for me, it's very likely we'll end up doing the 
same thing at my office, which means we'll need encrypted transfers for 
HIPAA and SBO compliance.  The Murder idea is an interesting use of the 
Torrent structure- but I often wonder if it's the most efficient, 
especially for large files.  I was also thinking about doing more 
research into how Google does YouTube, but Google has always been an 
enigma wrapped in a mystery :)

So, can you set pyshaper to limit processes on a schedule or is it just 
in general?  Bandwidth limiting processes would be a lot more efficient 
than QoS I think- especially because I'll typically assign ssh a high 
priority and wouldn't be able to in a case like this :)  I suppose it'd 
probably wise to use non-standard ssh ports for file transfers anyway, 
now that I think about it...

BTW, as far as M$ DFS, there is versioning in a way- any replaced or 
version-conflicted files are stored in a hidden volume and purged using 
a variety of rules (age/size, typically).  That would come into play 
with realtime distribution needs but you're correct- it's not intended 
for versioning.

As a backup solution, I do like the ideas in the BackupPC storage- where 
common deltas are not replicated, only referenced and re-built on the 
fly for data recovery.  Pretty darn slick, and probably something I'll 
be trying myself- back everything up to an on-site server, then 
replicate the changes to a master server or onto the S3 cloud.  Could be 
a good service to offer our clients...

-DJ



On 2/21/2010 6:10 PM, John Gilmore wrote:
> One of the problems mentioned with git is that it munges binary files
> when it doesn't realize the file is binary by attempting to merge
> changes.
>
> For an uncompressed image, that should actually work fine.
>
> And it implies that it would give you change tracking without storing
> complete copies of the different files.
>
> I really doubt that any other approach discussed here will give you
> change control&  multiple versions without having complete copies of
> each revision. This is because they're going to treat them as binary
> files, and not attempt to look deeper. So if that's a requirement,
> you'll probably have to use git, SVN, or something like them. The file
> you want to look at is ".gitattributes" for control of diff etc. You
> can actually force it to do diffs but not automatic merges of
> differing files - it would throw an error if the file was changed in
> both places instead of attempting to merge changes, but would still
> store differences only.
>
> The other approach that MIGHT give you something similar would be
> block-level mirroring with a versioning Logical volume. Sounds like
> trouble waiting to happen to me.
>
> OTOH, I don't think that the M$ approach would give you this either.
>
> But you don't need bit-level version storage, just bit-level change
> transmission, so this may not be relevant.
>
> Also, all version control based systems are going to try to give you a
> "working copy" of all the files in the dataset. This is required for a
> source compile, but isn't what you want when working with a large
> repository of images! Possibly mitigated by using branches, one branch
> per customer probably. But you'd still have to do something odd to
> check out/in single files.
>
> For considering the class of programs originally intended to track
> changes in source code, I don't think you need to look any further
> than git. The others typically have larger repositories, keep multiple
> copies of files, etc. All bad things for your application. I think you
> could fairly easily make git work by wrapping the single file
> checkin/upload in web-foo for your clients, and wrapping single-file
> checkout in something similar for your wife, the one doing the file
> modifications.
>
> Managing bandwidth is going to be a bit sticky no matter what you use,
> I think. Do you want client uploaded files to be instantly moved to
> home? Probably not, if a single user uploads the latest wedding shoot,
> you want that to wait till night when you have plenty of bandwidth.
> BUT the three pictures he's paying to have modified need to be
> available at home immediately so your wife can work on them.
>
> This obviously cannot be a automated process. Your wife will have to
> log into the office server and download the pictures she needs. You'll
> have to have a script in place to selectively fetch just the ones she
> needs on command.
>
> OTOH, copying files back to the office can be automated. Assuming your
> wife will only be modifying files by hand, they can be automatically
> uploaded to the office without worrying about bandwidth. Just make
> sure she knows that if SHE wants to upload HER latest wedding shoot,
> she'll have to stop the automatic process, and it'll be restarted
> automatically that evening. Or, I suppose, just make it choke and die
> if more than Xmb of files are merged to the local repository.
>
> Or look into "pyshaper" which can limit bandwidth per process. So you
> can give the interactive stuff more bandwidth, and limit the automatic
> stuff during the daytime.
>
> On 2/19/10, Stephen Warren<swarren at wwwdotorg.org>  wrote:
>    
>> DJ Eshelman wrote:
>>      
>>> I think I may have posed this question before but I'm still having
>>> trouble believing Microsoft has the only solution.
>>>
>>> Here's the situation:
>>>
>>> I want to start a business for my wife that will service professional
>>> photographers and others that want to have photos professionally
>>> retouched and, hopefully also sold as a storage solution (we'll get to
>>> that later- Storage as a Service for photographers and AV studios that
>>> have high storage needs but low budgets- not exactly IronMountain's
>>> niche market).
>>>
>>> We want to have the main server (facing clients) at my office where we
>>> have bandwidth to spare and can handle upwards of 20 Mbit/sec transfers,
>>> not to mention be on a good Xen (or maybe ESXi) server, so I can sell
>>> this reliably and scalably down the road.
>>>
>>> What I want at home is twofold- both the ability to have near-immediate
>>> bit-level sync over a VPN (preferably with good compression as RAW
>>> photographs tend to be quite bulky), and the ability to work directly
>>> from this server at home independently of the main server at the
>>> office.  It's a branch server, in a sense, but with different
>>> permissions/user accounts and completely isolated file storage.  This
>>> gives a good level of backup/redundancy as I can just do delta backups.
>>>        
>> I should probably read the whole thread, but I use unison for this kind of
>> thing. It is basically a multi-master rsync.
>>
>> In the past, I've had file-stores on two servers sync using a cron job
>> that ran every 10 minutes; almost immediate. As I think others have noted,
>> inotify/similar could decrease latency here.
>>
>> More recently, I share our photo tree across 3 laptops and a server using
>> manual unison invocations.
>>
>>
>> _______________________________________________
>> NCLUG mailing list       NCLUG at lists.nclug.org
>>
>> To unsubscribe, subscribe, or modify
>> your settings, go to:
>> http://lists.nclug.org/mailman/listinfo/nclug
>>
>>      
> _______________________________________________
> NCLUG mailing list       NCLUG at lists.nclug.org
>
> To unsubscribe, subscribe, or modify
> your settings, go to:
> http://lists.nclug.org/mailman/listinfo/nclug
>    



More information about the NCLUG mailing list