FileSyncer

From Linux NFS

(Difference between revisions)
Jump to: navigation, search
(remove OPML)
(remove OPML)
Line 173: Line 173:
=== operation ===
=== operation ===
-
        <outline text="configuration">
+
* configuration
-
          <outline text="wizard like thingie for inital set up"/>
+
** wizard like thingie for inital set up
-
          <outline text="lots of damn checkboxes for experts"/>
+
** lots of damn checkboxes for experts
-
        </outline>
+
 
-
        <outline text="static mode">
+
* static mode
-
          <outline text="what's changed?">
+
** what's changed?
-
            <outline text="generate fresh catalogs"/>
+
*** generate fresh catalogs"/>
-
            <outline text="use the catalog to discover changes"/>
+
*** use the catalog to discover changes
-
            <outline text="push changes to remote"/>
+
*** push changes to remote
-
          </outline>
+
 
-
        </outline>
+
* dynamic mode
-
        <outline text="dynamic mode">
+
** journaling changes as they happen
-
          <outline text="journaling changes as they happen">
+
*** into a catalog
-
            <outline text="into a catalog"/>
+
*** into a local journal file
-
            <outline text="into a local journal file"/>
+
*** into a socket or pipe
-
            <outline text="into a socket or pipe"/>
+
-
          </outline>
+
-
        </outline>
+
=== random thoughts ===
=== random thoughts ===

Revision as of 00:40, 16 August 2007

Chris Mason suggested a tool that can synchronize a local and remote file system using whatever tools it can find (snapshots, inotify, rsync, etc) -- should always work, but will find what it needs to do the job most efficiently.

Tool could do file system synchronization in real time, or disconnected operation, or periodic replication.

Need a UI mechanism to handle conflicts.

One central idea Chris had was to use inotify to drive specific rsyncs, thus avoiding a lot of page and inode cache pollution.

While inotify can tell us some object has changed, we can't really tell how it has changed. I thought of Huston's work on disconnected AFS (ie using a journal of changes to drive the synchronization process). A stackable file system seems a good tool for intercepting file system changes.

Loaded up rdiff-backup and hypereistar (?) on picasso. bzzt.

Other search tools:

These appear to use backend search engines and indexing algorithms such as Lucene and Xapian, which is what I really want to review to learn something useful for the synchronizer.

"gamin" is a facility that provides file alteration monitoring using whatever facility is available (dnotify, inotify, whatever). Hmm, and it appears to be installed on F7, although the -devel and Python bindings are optional.

gamin Python bindings

The question is, "is this robust enough to handle a lot of changes at once?" The "inotify" model of monitoring file system changes is vulnerable if there's no way to detect a dropped change notification.

gamin detects changes to a file or directory, but I wonder if it scales well to detecting changes to a whole file system, or a subtree?

gamin appears to be based on a larger file alteration monitoring framework built by SGI:

FAM API

FAM is based on select(), thus it can monitor only 1024 events at once, and is limited to monitoring a single directory (useful for graphical file management tools).

So.

I don't think this type of functionality is what we want. We'd have to set up an event handler on every directory we're interested in -- potentially millions. Plus there's no way to watch all of these events scalably. Gamin itself doesn't seem to handle signals or exceptions while handle_event() is blocked.

I wonder if there is a different take on using inotify that handles large parts or whole file systems efficiently.

Pynotify looks much nicer. It appears to be documented reasonably, and even handles exceptions in a Python-like manner.

Here's an article by one of the co-developers who created inotify. One important thing to observe is IN_Q_OVERFLOW, which is an event that signifies that some events were dropped. The queue size is tunable via sysctl, and defaults to 16K events.

Downloaded Tracker, which seems pretty close to what we're looking for, is written in C, and uses inotify if available. Tracker may provide a clue about how to make this scale across thousands of directories.

Contents

multi-purpose generic file system syncer

description

  • a file synchronization tool that runs on most Unix-flavored operating systems
  • one-way or bidirectional synchronization
  • syncs file changes periodically or continuously
  • supports disconnected operation
  • can use advanced file system features to improve efficiency
  • the basic idea is to use an inotify-like mechanism to make the detection of file and directory modification more efficient

use cases

Unidirectional

clone - on-demand rsync-like copy of a portion or all of a file system to another file system

  • provides basic rsync functionality

replication - periodic rsync-like copy of a portion or all of a file system to another file system

  • provides a working copy, hot backup, or a replicant for improving availability or balancing load

backup - periodic rsync-like copy of a portion or all of a file system to a flattened and compressed representation of the file system

  • provides a more efficient incremental archiving scheme

synchronization - real-time update of a remote file system based on changes in a local file system

  • provides hot backup on a remote site for disaster recovery

Bidirectional (these are the hard ones)

disconnected - on-demand bi-directional update between two file systems with conflict resolution

mirror - periodic bi-directional update between two file systems with conflict resolution

cluster - real-time bi-directional update between two file systems

  • provides two hot copies; concurrent access is moderated via normal file locking
  • we won't be doing this one

other examples

rsync on Unix

Synchronizes a file, subtree, or whole file system in one direction

Carbon Copy Cloner

Copies whole disks, but isn't designed for regular synchronization

Chronosync on Mac OS X

  • This tool is not for cloning file systems or disks; just copying part of a file system.
  • It can mirror (provide a clone)
  • It can do bi-directional sync

rdiff-backup

Desktop search tools

Examples

These appear to use backend search engines and indexing algorithms such as Lucene and Xapian, which is what I really want to review to learn something useful for the synchronizer.

From the Tracker wiki :

 2. Linux kernel watchless file notification system for Tracker
   * currently we are using inotify which is not optimal for watching entire trees. (OSX has kernel file notifications by contrast)
   * Implement a loadable module for the kernel which allows all file activity to be passed to userspace Tracker
   * Ideally should make use of netlink
   * Allow tracker to connect and receive these events.
   * implement file notification handler for it in tracker

and

 6. Networked Tracker
   * Use avahi/dbus to locate remote tracker servers. Could make use of ssh or as suggested use gabriel.
   * Api would need to be added to add support for retrieving list of servers
   * TST would need to be changed to list all available servers to use and allow user to pick one
   * If using DBus over ssh then could use seahorse for ssh key pairing

File system assistants

Erez Zadok suggested looking at the recent FaST papers describing TraceFS. On the reverse end, ReplayFS would take the output stream from TraceFS and re-apply the changes to a file system.

requirements

       <outline text="need a good name for it (branding!)"/>
       <outline text="operational efficiency">
         <outline text="small page and inode cache footprint"/>
         <outline text="moves only changed data and metadata"/>
       </outline>
       <outline text="would like this to reside mostly or all in user space"/>
       <outline text="should work "well enough" on any file system type independent of feature sets"/>
       <outline text="can identify advanced features such as snapshots to improve efficiency"/>
       <outline text="might even work between two actively modified file systems (multi-master replication)"/>
       <outline text="do we want a push mode and a pull mode?"/>
       <outline text="wizard UI for easy configuration?"/>

major architectural features

       <outline text="prototype in Python (if it has decent GUI features)">
         <outline text="Python has an inotify module!"/>
         <outline text="It also has libbeagle-python -- hopefully this would provide an easy facility for constructing a prototype file system catalog"/>
       </outline>
       <outline text="how to catalog all the files on a file system">
         <outline text="chris suggests looking at desktop search utilities such as recoll (beagle?)"/>
         <outline text="SCM... does git have anything to offer here?"/>
       </outline>
       <outline text="identifying changes statically">
         <outline text="for example, when syncing the first time, or syncing after reconnecting"/>
       </outline>
       <outline text="identifying changes dynamically (to reduce having to walk every file in each file system to find changes)">
         <outline text="unionfs layer that can capture and journal changes coming through vfs">
           <outline text="Erez suggests looking at TraceFS and ReplayFS in recent FaST proceedings"/>
         </outline>
         <outline text="could use inotify"/>
         <outline text="maybe use a journal to record changes permanently and play them back later"/>
       </outline>
       <outline text="need a UI for handling conflicts on multi-master set ups"/>

operation

  • configuration
    • wizard like thingie for inital set up
    • lots of damn checkboxes for experts
  • static mode
    • what's changed?
      • generate fresh catalogs"/>
      • use the catalog to discover changes
      • push changes to remote
  • dynamic mode
    • journaling changes as they happen
      • into a catalog
      • into a local journal file
      • into a socket or pipe

random thoughts

  • Need to install and try out Tracker
  • There's a window between when a monitored file system is mounted and when the monitoring daemon starts running, or if the daemon crashes. How do we detect the periods when the monitoring daemon is not running but the file system is available? How do we recover without sweeping the file system?
  • What happens if the monitoring process (or kernel) runs out of resources, and an event is dropped? Is there a notification of the loss, or is it simply ignored?
  • It's interesting to note that Beagle is known as a system resource hog even though it uses inotify. This belies the impression that a file syncer using inotify would automatically exhibit reduced system load than on that does not.
  • It might be interesting to combine the idea of a file system syncer with the idea of building virtual collections (searches). Ie: instead of saying "sync all files under this directory" you could say "sync all files that changed yesterday" or "sync all MP3 files containing music by the artish Diana Krall."
  • After reviewing gamin and FAM I don't think these provide the functionality we want. We'd have to set up an event handler for every directory we want to watch -- potentially millions. It doesn't appear that this is a scalable interface (some mention of using select()?) Gamin itself doesn't handle signals or Python exceptions while handle_event() is blocked.
  • KEY: Data synchronization (keep these two file sets synchronized) is deeply tied to content indexing (tell me about what's in this file system)
Personal tools