Linux: File replication

Ideas, discussions about all topics - no matter how brilliant or stupid they are. Place for hobby-psychatrists, hobby-architects, hobby-scientists, hobby-whatever...
Post Reply
User avatar
^rooker
Site Admin
Posts: 1481
Joined: Fri Aug 29, 2003 8:39 pm

Linux: File replication

Post by ^rooker »

In order to synchronize the files on 2 machines, I've written a few shell-scripts to do the job.
(to avoid reader-confusions: This should be a ONE way replication of WHOLE files. It will NOT detect bytewise changes and it will NOT merge changes from both sides.)


The idea behind it is:
1) create a filelist from Point A
2) create a filelist from Point B
3) Use "diff" to find the changes
  • a) deleted
    b) modified
    c) added
4) Update the files on Point B. (del/mod/add)


...I've thought about using checksum-lists, but I found out that it was sufficient using the plain directory listing which saves a lot of processing time.

This replication is running in an environment where so many files are in ONE folder that ls-ing the contents gives a "bash: /bin/ls: Argument list too long", so you can imagine HOW many files there are...

Of course, going for checksums would be the most reliable way.
Last edited by ^rooker on Fri Feb 11, 2005 4:41 pm, edited 2 times in total.
User avatar
^rooker
Site Admin
Posts: 1481
Joined: Fri Aug 29, 2003 8:39 pm

detect changes using diff, grep and awk

Post by ^rooker »

The really tricky part is to differ between
deleted, modified and added files by using the checksum lists.

The tools I've used were "diff", "grep" and "awk":

Code: Select all


    CUR_LIST=$DIR_LIST;   #manually skipped usage of MD5 sums

    if test -z $FIRST_TIME; then
      tmp_main=$CUR_LIST.$DIFF_SUFFIX;

      #find changes in file structure:
      $DIFF $CUR_LIST.old $CUR_LIST > $tmp_main

      tmp_old=$CUR_LIST.old.$DIFF_SUFFIX;
      tmp_new=$CUR_LIST.new.$DIFF_SUFFIX;
      tmp_diff=$CUR_LIST.temp.$DIFF_SUFFIX;

      cat $tmp_main | grep "<" | awk -F" " '{print $10}' > $tmp_old
      cat $tmp_main | grep ">" | awk -F" " '{print $10}' > $tmp_new

      #find files which do NOT appear in both listings (=added, deleted):
      $DIFF $tmp_old $tmp_new > $tmp_diff

      #create temp filenames:
      tmp_del=$tmp_main"."$DEL_DIFF_PRE;
      tmp_add=$tmp_main"."$ADD_DIFF_PRE;
      tmp_mod=$tmp_main"."$MOD_DIFF_PRE;

      cat $tmp_diff | grep "<" | awk -F" " '{print $2}' > $tmp_del
      cat $tmp_diff | grep ">" | awk -F" " '{print $2}' > $tmp_add

      #find files which appear in the old listing, but haven't been deleted (=modified):
      $DIFF $tmp_old $tmp_del > $tmp_diff

      cat $tmp_diff | grep "<" | awk -F" " '{print $2}' > $tmp_mod

      #add modifications of all buckets to ONE difference listing:
      cat $tmp_add >> $CUR_DIFF_LIST.$ADD_DIFF_PRE
      cat $tmp_mod >> $CUR_DIFF_LIST.$MOD_DIFF_PRE
      cat $tmp_del >> $CUR_DIFF_LIST.$DEL_DIFF_PRE

      cat $tmp_main >> $CUR_DIFF_LIST.all
    else
      echo -n "  Cannot compare since this is the first run.";
    fi
User avatar
^rooker
Site Admin
Posts: 1481
Joined: Fri Aug 29, 2003 8:39 pm

replication environment: LAN or WAN?

Post by ^rooker »

How to get the files?

In some environments, shell-access cannot be provided (e.g. website provider), thus I've thought it might be useful to implement the whole thing as a .php program.

The "main" site (A) would then be quite passive, by just delivering the dirlists "on demand" (over php).
For security reasons there should be some kind of "token" in to-be-replicated folders, so that the php-script knows if it is allowed to hand out a listing at all.

The B-site could then calculate the changes and "request" a link-list from the php-script by uploading the "delta-list". The linklist could be downloaded from A to B by using wget's recursive functionality.

There should also be some sort of cheap authentication mechanism.
-----------------

If scp works, it would also be possible to let the client download the files over ssh. But it is very unlikely that this is possible with default website providers and thus this should only be considered if both computers are under your control.

For automated scp's you would have to use key-based logins (ssh2), which would enable an intruder on the B-site to logon to A-site without a password!. Very dangerous!

Of course, if you're thinking about a LAN where the 2 replication sites are only available internally and you do not want to set up a webserver, the scp-variant would be a useful alternative.
gilthanaz
Site Admin
Posts: 444
Joined: Fri Aug 29, 2003 9:29 pm
Contact:

Öh

Post by gilthanaz »

:shock:
User avatar
^rooker
Site Admin
Posts: 1481
Joined: Fri Aug 29, 2003 8:39 pm

someone's already written it.

Post by ^rooker »

hm...
I somehow had the feeling that there MUST be some replication functionality available under Linux, but at the time when I first wrote those simple scripts, I haven't found any useful tool...

Seems that I've overlooked one:
http://rsync.samba.org/rsync/features.html

...but since it is doing a really sophisticated replication (only transmitting changes within files, ...) it uses a lot of memory, depending on the number of files to process.


hm... would the few advantages of my version (small, simple, php) really justify its implementation?
(edit): I don't think so (:cry:)... I've just checked rsync's parameters and it looks like it can be "simplified" (e.g. "--whole-file")
User avatar
^rooker
Site Admin
Posts: 1481
Joined: Fri Aug 29, 2003 8:39 pm

More file synchronisation

Post by ^rooker »

Gilthanaz has found some quite promising file replication/synchronisation tool:

http://www.cis.upenn.edu/~bcpierce/unison/

the important feature we were looking for is that it can run on Linux AND windows.
Post Reply