FileSyncer

From Linux NFS

(Difference between revisions)
Jump to: navigation, search
(Add raw text from OmniOutliner file syncer doc)
Line 49: Line 49:
[http://www.linuxjournal.com/article/8478 Here's an article] by one of the co-developers who created inotify.  One important thing to observe is IN_Q_OVERFLOW, which is an event that signifies that some events were dropped.  The queue size is tunable via sysctl, and defaults to 16K events.
[http://www.linuxjournal.com/article/8478 Here's an article] by one of the co-developers who created inotify.  One important thing to observe is IN_Q_OVERFLOW, which is an event that signifies that some events were dropped.  The queue size is tunable via sysctl, and defaults to 16K events.
-
Downloaded Tracker, which seems pretty close to what we're looking for, is written in C, and uses inotify if available.   
+
Downloaded Tracker, which seems pretty close to what we're looking for, is written in C, and uses inotify if available.  Tracker may provide a clue about how to make this scale across thousands of directories.
-
Tracker may provide a clue about how to make this scale across thousands of directories.
+
 
 +
== multi-purpose generic file system syncer ==
 +
      <outline text="description">
 +
        <outline text="a file synchronization tool that runs on most Unix-flavored operating systems"/>
 +
        <outline text="one-way or bidirectional synchronization"/>
 +
        <outline text="syncs file changes periodically or continuously"/>
 +
        <outline text="supports disconnected operation"/>
 +
        <outline text="can use advanced file system features to improve efficiency"/>
 +
        <outline text="the basic idea is to use an inotify-like mechanism to make the detection of file and directory modification more efficient"/>
 +
      </outline>
 +
      <outline text="use cases">
 +
        <outline text="Unidirectional">
 +
          <outline text="&quot;clone&quot;" _note="on-demand rsync-like copy of a portion or all of a file system to another file system">
 +
            <outline text="provides basic rsync functionality"/>
 +
          </outline>
 +
          <outline text="&quot;replication&quot;" _note="periodic rsync-like copy of a portion or all of a file system to another file system">
 +
            <outline text="provides a working copy, hot backup, or a replicant for improving availability or balancing load"/>
 +
          </outline>
 +
          <outline text="&quot;backup&quot;" _note="periodic rsync-like copy of a portion or all of a file system to a flattened and compressed representation of the file system">
 +
            <outline text="provides a more efficient incremental archiving scheme"/>
 +
          </outline>
 +
          <outline text="&quot;synchronization&quot;" _note="real-time update of a remote file system based on changes in a local file system">
 +
            <outline text="provides hot backup on a remote site for disaster recovery"/>
 +
          </outline>
 +
        </outline>
 +
        <outline text="Bidirectional (these are the hard ones)">
 +
          <outline text="&quot;disconnected&quot;" _note="on-demand bi-directional update between two file systems with conflict resolution"/>
 +
          <outline text="&quot;mirror&quot;" _note="periodic bi-directional update between two file systems with conflict resolution"/>
 +
          <outline text="&quot;cluster&quot;" _note="real-time bi-directional update between two file systems">
 +
            <outline text="provides two hot copies; concurrent access is moderated via normal file locking"/>
 +
            <outline text="we won't be doing this one"/>
 +
          </outline>
 +
        </outline>
 +
      </outline>
 +
      <outline text="other examples">
 +
        <outline text="rsync on Unix">
 +
          <outline text="Synchronizes a file, subtree, or whole file system in one direction"/>
 +
        </outline>
 +
        <outline text="Carbon Copy Cloner">
 +
          <outline text="Copies whole disks, but isn't designed for regular synchronization"/>
 +
        </outline>
 +
        <outline text="Chronosync on Mac OS X">
 +
          <outline text="This tool is not for cloning file systems or disks; just copying part of a file system"/>
 +
          <outline text="It can mirror (provide a clone)"/>
 +
          <outline text="It can do bi-directional sync"/>
 +
        </outline>
 +
        <outline text="rdiff-backup"/>
 +
        <outline text="Desktop search tools">
 +
          <outline text="Examples">
 +
            <outline text="Strigi strigi.sourceforge.net &lt;http://strigi.sourceforge.net/&gt;"/>
 +
            <outline text="recoll www.lesbonscomptes.com—recoll &lt;http://www.lesbonscomptes.com/recoll/&gt;"/>
 +
            <outline text="Beagle [uses inotify] www.beagle-project.org &lt;http://www.beagle-project.org/&gt;"/>
 +
            <outline text="searchmonkey searchmonkey.sf.net &lt;http://searchmonkey.sf.net/&gt;"/>
 +
            <outline text="libferris and ego witme.sourceforge.net—libferris.web &lt;http://witme.sourceforge.net/libferris.web/&gt;"/>
 +
            <outline text="Tracker www.gnome.org—tracker &lt;http://www.gnome.org/projects/tracker/&gt;"/>
 +
          </outline>
 +
          <outline text="These appear to use backend search engines and indexing algorithms such as Lucene and Xapian, which is what I really want to review to learn something useful for the synchronizer."/>
 +
          <outline text="From the Tracker wiki (live.gnome.org—SoC &lt;http://live.gnome.org/Tracker/SoC&gt;):" _note="2. Linux kernel watchless file notification system for Tracker&#10;    * currently we are using inotify which is not optimal for watching entire trees. (OSX has kernel file notifications by contrast)&#10;    * Implement a loadable module for the kernel which allows all file activity to be passed to userspace Tracker&#10;    * Ideally should make use of netlink&#10;    * Allow tracker to connect and receive these events.&#10;    * implement file notification handler for it in tracker&#10;6. Networked Tracker&#10;    * Use avahi/dbus to locate remote tracker servers. Could make use of ssh or as suggested use gabriel.&#10;    * Api would need to be added to add support for retrieving list of servers&#10;    * TST would need to be changed to list all available servers to use and allow user to pick one&#10;    * If using DBus over ssh then could use seahorse for ssh key pairing"/>
 +
        </outline>
 +
        <outline text="File system assistants">
 +
          <outline text="Erez suggested looking at the recent FaST papers describing TraceFS www.filesystems.org—project-tracefs.html &lt;http://www.filesystems.org/project-tracefs.html&gt;"/>
 +
          <outline text="On the reverse end, ReplayFS would take the output stream from TraceFS and re-apply the changes to a file system."/>
 +
        </outline>
 +
      </outline>
 +
      <outline text="requirements">
 +
        <outline text="need a good name for it (branding!)"/>
 +
        <outline text="operational efficiency">
 +
          <outline text="small page and inode cache footprint"/>
 +
          <outline text="moves only changed data and metadata"/>
 +
        </outline>
 +
        <outline text="would like this to reside mostly or all in user space"/>
 +
        <outline text="should work &quot;well enough&quot; on any file system type independent of feature sets"/>
 +
        <outline text="can identify advanced features such as snapshots to improve efficiency"/>
 +
        <outline text="might even work between two actively modified file systems (multi-master replication)"/>
 +
        <outline text="do we want a push mode and a pull mode?"/>
 +
        <outline text="wizard UI for easy configuration?"/>
 +
      </outline>
 +
      <outline text="major architectural features">
 +
        <outline text="prototype in Python (if it has decent GUI features)">
 +
          <outline text="Python has an inotify module!"/>
 +
          <outline text="It also has libbeagle-python -- hopefully this would provide an easy facility for constructing a prototype file system catalog"/>
 +
        </outline>
 +
        <outline text="how to catalog all the files on a file system">
 +
          <outline text="chris suggests looking at desktop search utilities such as recoll (beagle?)"/>
 +
          <outline text="SCM... does git have anything to offer here?"/>
 +
        </outline>
 +
        <outline text="identifying changes statically">
 +
          <outline text="for example, when syncing the first time, or syncing after reconnecting"/>
 +
        </outline>
 +
        <outline text="identifying changes dynamically (to reduce having to walk every file in each file system to find changes)">
 +
          <outline text="unionfs layer that can capture and journal changes coming through vfs">
 +
            <outline text="Erez suggests looking at TraceFS and ReplayFS in recent FaST proceedings"/>
 +
          </outline>
 +
          <outline text="could use inotify"/>
 +
          <outline text="maybe use a journal to record changes permanently and play them back later"/>
 +
        </outline>
 +
        <outline text="need a UI for handling conflicts on multi-master set ups"/>
 +
      </outline>
 +
      <outline text="operation">
 +
        <outline text="configuration">
 +
          <outline text="wizard like thingie for inital set up"/>
 +
          <outline text="lots of damn checkboxes for experts"/>
 +
        </outline>
 +
        <outline text="static mode">
 +
          <outline text="what's changed?">
 +
            <outline text="generate fresh catalogs"/>
 +
            <outline text="use the catalog to discover changes"/>
 +
            <outline text="push changes to remote"/>
 +
          </outline>
 +
        </outline>
 +
        <outline text="dynamic mode">
 +
          <outline text="journaling changes as they happen">
 +
            <outline text="into a catalog"/>
 +
            <outline text="into a local journal file"/>
 +
            <outline text="into a socket or pipe"/>
 +
          </outline>
 +
        </outline>
 +
      </outline>
 +
      <outline text="random thoughts">
 +
        <outline text="Need to install and try out Tracker"/>
 +
        <outline text="There's a window between when a monitored file system is mounted and when the monitoring daemon starts running, or if the daemon crashes.  How do we detect the periods when the monitoring daemon is not running but the file system is available?  How do we recover without sweeping the file system?"/>
 +
        <outline text="What happens if the monitoring process (or kernel) runs out of resources, and an event is dropped?  Is there a notification of the loss, or is it simply ignored?"/>
 +
        <outline text="It's interesting to note that Beagle is known as a system resource hog even though it uses inotify.  This belies the impression that a file syncer using inotify would automatically exhibit reduced system load than on that does not."/>
 +
        <outline text="It might be interesting to combine the idea of a file system syncer with the idea of building virtual collections (searches).  Ie: instead of saying &quot;sync all files under this directory&quot; you could say &quot;sync all files that changed yesterday&quot; or &quot;sync all MP3 files containing music by the artish Diana Krall.&quot;"/>
 +
        <outline text="After reviewing &quot;gamin&quot; and &quot;FAM&quot; I don't think these provide the functionality we want.  We'd have to set up an event handler for every directory we want to watch -- potentially millions.  It doesn't appear that this is a scalable interface (some mention of using select()?)  Gamin itself doesn't handle signals or Python exceptions while handle_event() is blocked."/>
 +
        <outline text="KEY: Data synchronization (keep these two file systems synchronized) is deeply tied to content indexing (tell me about what's in this file system)"/>
 +
      </outline>
 +
    </outline>
 +
  </body>

Revision as of 23:59, 15 August 2007

Chris Mason suggested a tool that can synchronize a local and remote file system using whatever tools it can find (snapshots, inotify, rsync, etc) -- should always work, but will find what it needs to do the job most efficiently.

Tool could do file system synchronization in real time, or disconnected operation, or periodic replication.

Need a UI mechanism to handle conflicts.

One central idea Chris had was to use inotify to drive specific rsyncs, thus avoiding a lot of page and inode cache pollution.

While inotify can tell us some object has changed, we can't really tell how it has changed. I thought of Huston's work on disconnected AFS (ie using a journal of changes to drive the synchronization process). A stackable file system seems a good tool for intercepting file system changes.

Loaded up rdiff-backup and hypereistar (?) on picasso. bzzt.

Other search tools:

These appear to use backend search engines and indexing algorithms such as Lucene and Xapian, which is what I really want to review to learn something useful for the synchronizer.

"gamin" is a facility that provides file alteration monitoring using whatever facility is available (dnotify, inotify, whatever). Hmm, and it appears to be installed on F7, although the -devel and Python bindings are optional.

gamin Python bindings

The question is, "is this robust enough to handle a lot of changes at once?" The "inotify" model of monitoring file system changes is vulnerable if there's no way to detect a dropped change notification.

gamin detects changes to a file or directory, but I wonder if it scales well to detecting changes to a whole file system, or a subtree?

gamin appears to be based on a larger file alteration monitoring framework built by SGI:

FAM API

FAM is based on select(), thus it can monitor only 1024 events at once, and is limited to monitoring a single directory (useful for graphical file management tools).

So.

I don't think this type of functionality is what we want. We'd have to set up an event handler on every directory we're interested in -- potentially millions. Plus there's no way to watch all of these events scalably. Gamin itself doesn't seem to handle signals or exceptions while handle_event() is blocked.

I wonder if there is a different take on using inotify that handles large parts or whole file systems efficiently.

Pynotify looks much nicer. It appears to be documented reasonably, and even handles exceptions in a Python-like manner.

Here's an article by one of the co-developers who created inotify. One important thing to observe is IN_Q_OVERFLOW, which is an event that signifies that some events were dropped. The queue size is tunable via sysctl, and defaults to 16K events.

Downloaded Tracker, which seems pretty close to what we're looking for, is written in C, and uses inotify if available. Tracker may provide a clue about how to make this scale across thousands of directories.

multi-purpose generic file system syncer

     <outline text="description">
       <outline text="a file synchronization tool that runs on most Unix-flavored operating systems"/>
       <outline text="one-way or bidirectional synchronization"/>
       <outline text="syncs file changes periodically or continuously"/>
       <outline text="supports disconnected operation"/>
       <outline text="can use advanced file system features to improve efficiency"/>
       <outline text="the basic idea is to use an inotify-like mechanism to make the detection of file and directory modification more efficient"/>
     </outline>
     <outline text="use cases">
       <outline text="Unidirectional">
         <outline text=""clone"" _note="on-demand rsync-like copy of a portion or all of a file system to another file system">
           <outline text="provides basic rsync functionality"/>
         </outline>
         <outline text=""replication"" _note="periodic rsync-like copy of a portion or all of a file system to another file system">
           <outline text="provides a working copy, hot backup, or a replicant for improving availability or balancing load"/>
         </outline>
         <outline text=""backup"" _note="periodic rsync-like copy of a portion or all of a file system to a flattened and compressed representation of the file system">
           <outline text="provides a more efficient incremental archiving scheme"/>
         </outline>
         <outline text=""synchronization"" _note="real-time update of a remote file system based on changes in a local file system">
           <outline text="provides hot backup on a remote site for disaster recovery"/>
         </outline>
       </outline>
       <outline text="Bidirectional (these are the hard ones)">
         <outline text=""disconnected"" _note="on-demand bi-directional update between two file systems with conflict resolution"/>
         <outline text=""mirror"" _note="periodic bi-directional update between two file systems with conflict resolution"/>
         <outline text=""cluster"" _note="real-time bi-directional update between two file systems">
           <outline text="provides two hot copies; concurrent access is moderated via normal file locking"/>
           <outline text="we won't be doing this one"/>
         </outline>
       </outline>
     </outline>
     <outline text="other examples">
       <outline text="rsync on Unix">
         <outline text="Synchronizes a file, subtree, or whole file system in one direction"/>
       </outline>
       <outline text="Carbon Copy Cloner">
         <outline text="Copies whole disks, but isn't designed for regular synchronization"/>
       </outline>
       <outline text="Chronosync on Mac OS X">
         <outline text="This tool is not for cloning file systems or disks; just copying part of a file system"/>
         <outline text="It can mirror (provide a clone)"/>
         <outline text="It can do bi-directional sync"/>
       </outline>
       <outline text="rdiff-backup"/>
       <outline text="Desktop search tools">
         <outline text="Examples">
           <outline text="Strigi strigi.sourceforge.net <http://strigi.sourceforge.net/>"/>
           <outline text="recoll www.lesbonscomptes.com—recoll <http://www.lesbonscomptes.com/recoll/>"/>
           <outline text="Beagle [uses inotify] www.beagle-project.org <http://www.beagle-project.org/>"/>
           <outline text="searchmonkey searchmonkey.sf.net <http://searchmonkey.sf.net/>"/>
           <outline text="libferris and ego witme.sourceforge.net—libferris.web <http://witme.sourceforge.net/libferris.web/>"/>
           <outline text="Tracker www.gnome.org—tracker <http://www.gnome.org/projects/tracker/>"/>
         </outline>
         <outline text="These appear to use backend search engines and indexing algorithms such as Lucene and Xapian, which is what I really want to review to learn something useful for the synchronizer."/>
         <outline text="From the Tracker wiki (live.gnome.org—SoC <http://live.gnome.org/Tracker/SoC>):" _note="2. Linux kernel watchless file notification system for Tracker
    * currently we are using inotify which is not optimal for watching entire trees. (OSX has kernel file notifications by contrast)
    * Implement a loadable module for the kernel which allows all file activity to be passed to userspace Tracker
    * Ideally should make use of netlink
    * Allow tracker to connect and receive these events.
    * implement file notification handler for it in tracker
6. Networked Tracker
    * Use avahi/dbus to locate remote tracker servers. Could make use of ssh or as suggested use gabriel.
    * Api would need to be added to add support for retrieving list of servers
    * TST would need to be changed to list all available servers to use and allow user to pick one
    * If using DBus over ssh then could use seahorse for ssh key pairing"/>
       </outline>
       <outline text="File system assistants">
         <outline text="Erez suggested looking at the recent FaST papers describing TraceFS www.filesystems.org—project-tracefs.html <http://www.filesystems.org/project-tracefs.html>"/>
         <outline text="On the reverse end, ReplayFS would take the output stream from TraceFS and re-apply the changes to a file system."/>
       </outline>
     </outline>
     <outline text="requirements">
       <outline text="need a good name for it (branding!)"/>
       <outline text="operational efficiency">
         <outline text="small page and inode cache footprint"/>
         <outline text="moves only changed data and metadata"/>
       </outline>
       <outline text="would like this to reside mostly or all in user space"/>
       <outline text="should work "well enough" on any file system type independent of feature sets"/>
       <outline text="can identify advanced features such as snapshots to improve efficiency"/>
       <outline text="might even work between two actively modified file systems (multi-master replication)"/>
       <outline text="do we want a push mode and a pull mode?"/>
       <outline text="wizard UI for easy configuration?"/>
     </outline>
     <outline text="major architectural features">
       <outline text="prototype in Python (if it has decent GUI features)">
         <outline text="Python has an inotify module!"/>
         <outline text="It also has libbeagle-python -- hopefully this would provide an easy facility for constructing a prototype file system catalog"/>
       </outline>
       <outline text="how to catalog all the files on a file system">
         <outline text="chris suggests looking at desktop search utilities such as recoll (beagle?)"/>
         <outline text="SCM... does git have anything to offer here?"/>
       </outline>
       <outline text="identifying changes statically">
         <outline text="for example, when syncing the first time, or syncing after reconnecting"/>
       </outline>
       <outline text="identifying changes dynamically (to reduce having to walk every file in each file system to find changes)">
         <outline text="unionfs layer that can capture and journal changes coming through vfs">
           <outline text="Erez suggests looking at TraceFS and ReplayFS in recent FaST proceedings"/>
         </outline>
         <outline text="could use inotify"/>
         <outline text="maybe use a journal to record changes permanently and play them back later"/>
       </outline>
       <outline text="need a UI for handling conflicts on multi-master set ups"/>
     </outline>
     <outline text="operation">
       <outline text="configuration">
         <outline text="wizard like thingie for inital set up"/>
         <outline text="lots of damn checkboxes for experts"/>
       </outline>
       <outline text="static mode">
         <outline text="what's changed?">
           <outline text="generate fresh catalogs"/>
           <outline text="use the catalog to discover changes"/>
           <outline text="push changes to remote"/>
         </outline>
       </outline>
       <outline text="dynamic mode">
         <outline text="journaling changes as they happen">
           <outline text="into a catalog"/>
           <outline text="into a local journal file"/>
           <outline text="into a socket or pipe"/>
         </outline>
       </outline>
     </outline>
     <outline text="random thoughts">
       <outline text="Need to install and try out Tracker"/>
       <outline text="There's a window between when a monitored file system is mounted and when the monitoring daemon starts running, or if the daemon crashes.  How do we detect the periods when the monitoring daemon is not running but the file system is available?  How do we recover without sweeping the file system?"/>
       <outline text="What happens if the monitoring process (or kernel) runs out of resources, and an event is dropped?  Is there a notification of the loss, or is it simply ignored?"/>
       <outline text="It's interesting to note that Beagle is known as a system resource hog even though it uses inotify.  This belies the impression that a file syncer using inotify would automatically exhibit reduced system load than on that does not."/>
       <outline text="It might be interesting to combine the idea of a file system syncer with the idea of building virtual collections (searches).  Ie: instead of saying "sync all files under this directory" you could say "sync all files that changed yesterday" or "sync all MP3 files containing music by the artish Diana Krall.""/>
       <outline text="After reviewing "gamin" and "FAM" I don't think these provide the functionality we want.  We'd have to set up an event handler for every directory we want to watch -- potentially millions.  It doesn't appear that this is a scalable interface (some mention of using select()?)  Gamin itself doesn't handle signals or Python exceptions while handle_event() is blocked."/>
       <outline text="KEY: Data synchronization (keep these two file systems synchronized) is deeply tied to content indexing (tell me about what's in this file system)"/>
     </outline>
   </outline>
 </body>
Personal tools