[NCLUG] A *nix DFS alternative?
DJ Eshelman
djsbignews at gmail.com
Thu Feb 18 10:03:51 MST 2010
while DR:BD would be great in a datacenter, for this particular need I'm
not sure it will work, as it's a single-block device- I might as well
mount a web folder and wait for the file to cache locally, which is not
a good option for this usage scenario, especially because Comcast would
hate me more than they probably already do if I was using something like
this with constant communication. I like what I'm seeing with Amazon
AWS (S3)- cheap for what it is; I use Dropbox myself which I know runs
on S3. If this gets too big I suppose I could go to that, for now
though I've already got a few terabytes of storage pretty cheap.
As for AFS- how much constant contact does it need with member servers?
As in, could the link sustain several hours of downtime? I'm
intrigued... Especially if CODA would work well. I really do like the
idea of having a unified namespace instead of a 'mirrored' namespace.
In talking more with my wife about how things would work, I think rsync
may still work; because she's not saving the same file back to itself
(which I suppose makes sense), therefore there wouldn't be deltas at
all- there'd just be new files in a structure. So as long as
ifnotifywait will work to trigger syncs on either end, it could be hands
off- but if she's not working in the same files (saving new copies) then
I could conceivably just do a cron job. Somehow I doubt the hit to CPU
would be too bad on the home server (the one at the office I'm not
concerned)
Bob is correct that manual process is a bad idea, I just see that ending
badly :)
I think you may be right, Bob- I probably need to worry about file
structure a lot- I could see this getting out of control in a hurry.
Perhaps I should also look into document control systems; basically
everything is kept in a flat structure and given a coded filename... I
suppose the issue with that would be keeping that database in sync as
well... man. Okay scratch that :)
One question though:
when your web interface accepts an
upload then add that directory to the list to be sync'd from site to
site. Since you already know what files have changed on the side that
has changed then you already know what directories to sync over.
So... how does the system know there are new files coming in from the
web? I have tried php triggering cron jobs before to limited success-
would that work, or were you thinking of something else?
As for compression, Adobe RAW images are not compressed, nor would the
type of TIFFs the pros use, apparently. I'm sure JPEG versions would be
included and of course I'm not worried about those. The project I was
following visually was TrafficSqueezer http://www.trafficsqueezer.org/ ,
which now has corporate sponsorship... always a good sign of a closed
source coming.
Wow- head is totally spinning now (probably in a good way).
Thanks for the thoughts- it's always fun to see what everyone thinks!
-DJ
On 2/17/2010 7:07 PM, Bob Proulx wrote:
> DJ Eshelman wrote:
>
>> Basically when you're dealing with several hundred (eventually
>> thousands) of files that are upwards of 16 MB each, the most workable
>> solution is to have a local server that is syncing on the backend any
>> changes made-
>>
> As Chris mentioned, I also thought of DRBD when you were talking about
> this. But I haven't tried it. It seems to have a non-trivial
> activation energy and so it has been on my list of cool toys to play
> with forever but never had enough energy to actually try it. But it
> seems very much like something you could use here.
>
>
>> that way if I'm at one office making changes, and my wife is at home
>> making changes, she's not having to download each file to her PC,
>> save it and re-upload it.
>>
> Eww... No, I wouldn't even suggest anything that manual. You would
> never be able to keep them in sync manually. You would eventually get
> drift between in that case.
>
>
>> Rsync doesn't seem a reliable enough solution for this because the
>> traffic it would generate would be immense, too much to run during
>> the day.
>>
> As Grant mentioned, rsync is extremely reliable. It is also very
> efficient. I highly recommend rsync. Now as to whether rsync is the
> best tool for the task here that is a different question. It might or
> might not be part of the total solution. But it is a very good tool
> for doing what it does.
>
> Let's assume for the sake of discussion that you never modify an image
> in place. In that case there won't be anything to delta against on
> the target. In that case the entirety of the file needs to be copied
> from site A to site B. It won't matter if you are using the rsync
> algorithm or not. Whatever you use will need to copy the entire file.
> Rsync will simply be copying the entire file. The advantage to using
> rsync is when you run it the second time and it avoids copying the
> file since it is already in sync. Since image data is already
> compressed you won't be able to get much more compression out of the
> copy operation even if you use a compressed data link. Considering
> that you must already tolerate the bandwidth of one image once to
> upload it to the first site then you are only doubling your bandwidth
> by mirroring it to the second site. I think by that argument that you
> must be able to tolerate syncing the mirror all of the time. You
> could certainly limit the bandwidth spent on mirroring and let it run
> behind a little bit however. But 2x is really not so terribly bad.
>
>
>> The benefit of Rsync if it could be invoked in a smart way, is that
>> I could have a Linux server on one end and a Windows server on the
>> other if I really wanted to.
>>
> I think you are much better off keeping to only GNU/Linux machines on
> the server side. Then you have compatible tools everywhere. It just
> gets messy once you try to incorporate a really very different
> heterogeneous set of machines. And I won't even give you one guess as
> to which type I would choose to standardize upon on the server side.
>
>
>> I just question if running Rsync on a cron job would be efficient
>> once you get up to a few terabytes of data-
>>
> The limitation on using rsync is the number of files (and number of
> directories) in your directory hierarchy. It needs to walk those
> directories every time it runs. If you have enough ram in your server
> machine then this will cache in the filesystem buffer cache. But
> using that for other things will push it out of cache too. The size
> of the files or the total size of all files really isn't a problem. I
> have used rsync to keep about 1.5T of data in sync between here and
> 300ms of Internet away to another site. It worked very well for me.
>
>
>> you'd be running a glorifed backup;
>>
> Backups have different needs. In particular this isn't a backup in
> the sense that if you corrupt a file on site A then the file will be
> corrupted on the mirror too. Mirroring never replaces backup.
>
>
>> the way to do this efficiently is to sync only when changes occur
>> and I haven't yet found a way to do that with Rsync.
>>
> The inotify functionality mentioned by Kasey can make this more
> efficient. But I am skeptical because the limits seem to be
> problematic. In your situation I think you would hit the limits and
> then be stuck about how to get past that point.
>
> I think you would be better off keeping track of what directories have
> had changes lately. That is, when your web interface accepts an
> upload then add that directory to the list to be sync'd from site to
> site. Since you already know what files have changed on the side that
> has changed then you already know what directories to sync over. That
> doesn't have any limits. To me that seems the most attractive path.
>
>
>> I'm not really sure subversion et al would really work for this because
>> this is really just file storage, not version builds.
>>
> I wouldn't use version control for this. Large blobs of binary data
> really don't version very well.
>
>
>> It's been 11 years since I've touched any kind of CVS program so I
>> really wouldn't know any more :)
>>
> I would *definitely* version control all of your code that surrounds
> everything. Don't put it off. Get it going early in your project. I
> use and recommend Git but any version control is better than none.
>
>
>> I appreciate the feedback-
>>
> For using rsync the /best/ case is unidirectional directories. It is
> always best to know that you are syncing site A to B here and B to A
> there. Grant suggested using 'rsync -u' to avoid updating newer files
> for bidirectional syncing. It is a good suggestion. I have done
> that. But it isn't without issues. For example if you ever want to
> delete a file the you need to delete the file from all mirrors all at
> the same time. Otherwise it will replicate back! You can play
> wack-a-mole trying to remove a file sometimes. But if the sync
> direction is unidirectional then you can use the 'rsync --delete'
> option to remove files that were removed on the source side. That
> will keep the mirrors in perfect synchronization. If you can't design
> in unidirectional directories then you will need to handle the delete
> case somehow.
>
> Obviously from this you can tell that I like rsync. But that doesn't
> mean it is the best tool for this task. It is just my hammer and so
> all problems look like nails to me. I would definitely investigate
> DRBD. I have used AFS before and it has replication. You might
> decide to try it or one of the decedents such as CODA. I know there
> are other distributed filesystems and I would probably push hard
> investigating a distributed filesystem solution before looking too far
> into a userland application solution. You might consider Lustre with
> ZFS. You might consider Amazon's S3.
>
> Just my 2 cents...
>
> Bob
> _______________________________________________
> NCLUG mailing list NCLUG at lists.nclug.org
>
> To unsubscribe, subscribe, or modify
> your settings, go to:
> http://lists.nclug.org/mailman/listinfo/nclug
>
More information about the NCLUG
mailing list