FileSyncer

From Linux NFS

(Difference between revisions)
Jump to: navigation, search
(requirements)
m (remove last OPML turd)
 
(3 intermediate revisions not shown)
Line 1: Line 1:
-
Chris Mason suggested a tool that can synchronize a local and remote file system using whatever tools it can find (snapshots, inotify, rsync, etc) -- should always work, but will find what it needs to do the job most efficiently.
+
== description ==
-
Tool could do file system synchronization in real time, or disconnected operation, or periodic replication.
+
* multi-purpose generic file system syncer - a file synchronization tool that runs on most Unix-flavored operating systems
-
 
+
* one-way or bidirectional data synchronization
-
Need a UI mechanism to handle conflicts.
+
-
 
+
-
One central idea Chris had was to use inotify to drive specific rsyncs, thus avoiding a lot of page and inode cache pollution.
+
-
 
+
-
While inotify can tell us some object has changed, we can't really tell how it has changed.  I thought of Huston's work on disconnected AFS (ie using a journal of changes to drive the synchronization process).  A stackable file system seems a good tool for intercepting file system changes.
+
-
 
+
-
Loaded up rdiff-backup and hypereistar (?) on picasso.
+
-
bzzt.
+
-
 
+
-
Other search tools:
+
-
 
+
-
* [http://strigi.sourceforge.net Strigi]
+
-
* [http://www.lesbonscomptes.com/recoll/ recoll]
+
-
* [http://www.beagle-project.org/ Beagle] (uses inotify, but written in C# / mono)
+
-
* [http://searchmonkey.sourceforge.net/ searchmonkey]
+
-
* [http://witme.sourceforge.net/libferris.web/ libferris] and ego
+
-
* [http://www.gnome.org/projects/tracker/ Tracker]
+
-
 
+
-
These appear to use backend search engines and indexing algorithms such as
+
-
Lucene and Xapian, which is what I really want to review to learn something
+
-
useful for the synchronizer.
+
-
 
+
-
"gamin" is a facility that provides file alteration monitoring using whatever facility is available (dnotify, inotify, whatever).  Hmm, and it appears to be installed on F7, although the -devel and Python bindings are optional.
+
-
 
+
-
[http://www.gnome.org/~veillard/gamin/python.html gamin Python bindings]
+
-
 
+
-
The question is, "is this robust enough to handle a lot of changes at once?"  The "inotify" model of monitoring file system changes is vulnerable if there's no way to detect a dropped change notification.
+
-
 
+
-
gamin detects changes to a file or directory, but I wonder if it scales well to detecting changes to a whole file system, or a subtree?
+
-
 
+
-
gamin appears to be based on a larger file alteration monitoring framework built by SGI:
+
-
 
+
-
[http://oss.sgi.com/projects/fam/ FAM API]
+
-
 
+
-
FAM is based on select(), thus it can monitor only 1024 events at once, and is limited to monitoring a single directory (useful for graphical file management tools).
+
-
 
+
-
So.
+
-
 
+
-
I don't think this type of functionality is what we want.  We'd have to set up an event handler on every directory we're interested in -- potentially millions.  Plus there's no way to watch all of these events scalably.  Gamin itself doesn't seem to handle signals or exceptions while handle_event() is blocked.
+
-
 
+
-
I wonder if there is a different take on using inotify that handles large parts or whole file systems efficiently.
+
-
 
+
-
[http://pyinotify.sourceforge.net/ Pynotify] looks much nicer.  It appears to be documented reasonably, and even handles exceptions in a Python-like manner.
+
-
 
+
-
[http://www.linuxjournal.com/article/8478 Here's an article] by one of the co-developers who created inotify.  One important thing to observe is IN_Q_OVERFLOW, which is an event that signifies that some events were dropped.  The queue size is tunable via sysctl, and defaults to 16K events.
+
-
 
+
-
Downloaded Tracker, which seems pretty close to what we're looking for, is written in C, and uses inotify if available.  Tracker may provide a clue about how to make this scale across thousands of directories.
+
-
 
+
-
== multi-purpose generic file system syncer ==
+
-
 
+
-
=== description ===
+
-
 
+
-
* a file synchronization tool that runs on most Unix-flavored operating systems
+
-
* one-way or bidirectional synchronization
+
* syncs file changes periodically or continuously
* syncs file changes periodically or continuously
-
* supports disconnected operation
+
* supports disconnected operation (later playback of changes)
* can use advanced file system features to improve efficiency
* can use advanced file system features to improve efficiency
* the basic idea is to use an inotify-like mechanism to make the detection of file and directory modification more efficient
* the basic idea is to use an inotify-like mechanism to make the detection of file and directory modification more efficient
-
=== use cases ===
+
== use cases ==
-
==== Unidirectional ====
+
=== Unidirectional ===
''clone'' - on-demand rsync-like copy of a portion or all of a file system to another file system
''clone'' - on-demand rsync-like copy of a portion or all of a file system to another file system
Line 78: Line 24:
* provides hot backup on a remote site for disaster recovery
* provides hot backup on a remote site for disaster recovery
-
==== Bidirectional (these are the hard ones) ====
+
=== Bidirectional (these are the hard ones) ===
''disconnected'' - on-demand bi-directional update between two file systems with conflict resolution
''disconnected'' - on-demand bi-directional update between two file systems with conflict resolution
Line 88: Line 34:
* we won't be doing this one
* we won't be doing this one
-
=== other examples ===
+
== other examples ==
 +
 
 +
=== rsync on Unix ===
-
==== rsync on Unix ====
 
Synchronizes a file, subtree, or whole file system in one direction
Synchronizes a file, subtree, or whole file system in one direction
-
==== Carbon Copy Cloner ====
+
=== Carbon Copy Cloner ===
Copies whole disks, but isn't designed for regular synchronization
Copies whole disks, but isn't designed for regular synchronization
-
==== Chronosync on Mac OS X ====
+
=== Chronosync on Mac OS X ===
* This tool is not for cloning file systems or disks; just copying part of a file system.
* This tool is not for cloning file systems or disks; just copying part of a file system.
* It can mirror (provide a clone)
* It can mirror (provide a clone)
* It can do bi-directional sync
* It can do bi-directional sync
-
==== rdiff-backup ====
+
=== rdiff-backup ===
-
==== Desktop search tools ====
+
=== Desktop search tools ===
Examples
Examples
Line 132: Line 79:
     * If using DBus over ssh then could use seahorse for ssh key pairing
     * If using DBus over ssh then could use seahorse for ssh key pairing
-
==== File system assistants ====
+
=== File system assistants ===
Erez Zadok suggested looking at the recent FaST papers describing [http://www.filesystems.org/project-tracefs.html TraceFS].  On the reverse end, ReplayFS would take the output stream from TraceFS and re-apply the changes to a file system.
Erez Zadok suggested looking at the recent FaST papers describing [http://www.filesystems.org/project-tracefs.html TraceFS].  On the reverse end, ReplayFS would take the output stream from TraceFS and re-apply the changes to a file system.
-
=== requirements ===
+
== requirements ==
* need a good name
* need a good name
Line 148: Line 95:
* wizard UI for easy configuration?
* wizard UI for easy configuration?
-
=== major architectural features ===
+
== major architectural features ==
-
        <outline text="prototype in Python (if it has decent GUI features)">
+
* prototype in Python (if it has decent GUI features)
-
          <outline text="Python has an inotify module!"/>
+
** Python has an inotify module!
-
          <outline text="It also has libbeagle-python -- hopefully this would provide an easy facility for constructing a prototype file system catalog"/>
+
** It also has libbeagle-python -- hopefully this would provide an easy facility for constructing a prototype file system catalog
-
        </outline>
+
-
        <outline text="how to catalog all the files on a file system">
+
-
          <outline text="chris suggests looking at desktop search utilities such as recoll (beagle?)"/>
+
-
          <outline text="SCM... does git have anything to offer here?"/>
+
-
        </outline>
+
-
        <outline text="identifying changes statically">
+
-
          <outline text="for example, when syncing the first time, or syncing after reconnecting"/>
+
-
        </outline>
+
-
        <outline text="identifying changes dynamically (to reduce having to walk every file in each file system to find changes)">
+
-
          <outline text="unionfs layer that can capture and journal changes coming through vfs">
+
-
            <outline text="Erez suggests looking at TraceFS and ReplayFS in recent FaST proceedings"/>
+
-
          </outline>
+
-
          <outline text="could use inotify"/>
+
-
          <outline text="maybe use a journal to record changes permanently and play them back later"/>
+
-
        </outline>
+
-
        <outline text="need a UI for handling conflicts on multi-master set ups"/>
+
-
=== operation ===
+
* how to catalog all the files on a file system
 +
** chris suggests looking at desktop search utilities such as recoll (beagle?)
 +
** SCM... does git have anything to offer here?
 +
 
 +
* identifying changes statically
 +
** for example, when syncing the first time, or syncing after reconnecting
 +
 
 +
* identifying changes dynamically (to reduce having to walk every file in each file system to find changes)
 +
** unionfs layer that can capture and journal changes coming through vfs
 +
*** Erez suggests looking at TraceFS and ReplayFS in recent FaST proceedings
 +
** could use inotify
 +
** maybe use a journal to record changes permanently and play them back later
 +
 
 +
* need a UI for handling conflicts on multi-master set ups
 +
 
 +
== operation ==
* configuration
* configuration
Line 178: Line 124:
* static mode
* static mode
** what's changed?
** what's changed?
-
*** generate fresh catalogs"/>
+
*** generate fresh catalogs
*** use the catalog to discover changes
*** use the catalog to discover changes
*** push changes to remote
*** push changes to remote
Line 188: Line 134:
*** into a socket or pipe
*** into a socket or pipe
-
=== random thoughts ===
+
== random thoughts ==
* Need to install and try out Tracker
* Need to install and try out Tracker
Line 197: Line 143:
* After reviewing ''gamin'' and ''FAM'' I don't think these provide the functionality we want.  We'd have to set up an event handler for every directory we want to watch -- potentially millions.  It doesn't appear that this is a scalable interface (some mention of using select()?)  Gamin itself doesn't handle signals or Python exceptions while handle_event() is blocked.
* After reviewing ''gamin'' and ''FAM'' I don't think these provide the functionality we want.  We'd have to set up an event handler for every directory we want to watch -- potentially millions.  It doesn't appear that this is a scalable interface (some mention of using select()?)  Gamin itself doesn't handle signals or Python exceptions while handle_event() is blocked.
* '''KEY:''' Data synchronization (keep these two file sets synchronized) is deeply tied to content indexing (tell me about what's in this file system)
* '''KEY:''' Data synchronization (keep these two file sets synchronized) is deeply tied to content indexing (tell me about what's in this file system)
 +
 +
== free-form notes ==
 +
 +
Chris Mason suggested a tool that can synchronize a local and remote file system using whatever tools it can find (snapshots, inotify, rsync, etc) -- should always work, but will find what it needs to do the job most efficiently.
 +
 +
Tool could do file system synchronization in real time, or disconnected operation, or periodic replication.
 +
 +
Need a UI mechanism to handle conflicts.
 +
 +
One central idea Chris had was to use inotify to drive specific rsyncs, thus avoiding a lot of page and inode cache pollution.
 +
 +
While inotify can tell us some object has changed, we can't really tell how it has changed.  I thought of Huston's work on disconnected AFS (ie using a journal of changes to drive the synchronization process).  A stackable file system seems a good tool for intercepting file system changes.
 +
 +
Loaded up rdiff-backup and hypereistar (?) on picasso.
 +
bzzt.
 +
 +
Other search tools:
 +
 +
* [http://strigi.sourceforge.net Strigi]
 +
* [http://www.lesbonscomptes.com/recoll/ recoll]
 +
* [http://www.beagle-project.org/ Beagle] (uses inotify, but written in C# / mono)
 +
* [http://searchmonkey.sourceforge.net/ searchmonkey]
 +
* [http://witme.sourceforge.net/libferris.web/ libferris] and ego
 +
* [http://www.gnome.org/projects/tracker/ Tracker]
 +
 +
These appear to use backend search engines and indexing algorithms such as
 +
Lucene and Xapian, which is what I really want to review to learn something
 +
useful for the synchronizer.
 +
 +
"gamin" is a facility that provides file alteration monitoring using whatever facility is available (dnotify, inotify, whatever).  Hmm, and it appears to be installed on F7, although the -devel and Python bindings are optional.
 +
 +
[http://www.gnome.org/~veillard/gamin/python.html gamin Python bindings]
 +
 +
The question is, "is this robust enough to handle a lot of changes at once?"  The "inotify" model of monitoring file system changes is vulnerable if there's no way to detect a dropped change notification.
 +
 +
gamin detects changes to a file or directory, but I wonder if it scales well to detecting changes to a whole file system, or a subtree?
 +
 +
gamin appears to be based on a larger file alteration monitoring framework built by SGI:
 +
 +
[http://oss.sgi.com/projects/fam/ FAM API]
 +
 +
FAM is based on select(), thus it can monitor only 1024 events at once, and is limited to monitoring a single directory (useful for graphical file management tools).
 +
 +
So.
 +
 +
I don't think this type of functionality is what we want.  We'd have to set up an event handler on every directory we're interested in -- potentially millions.  Plus there's no way to watch all of these events scalably.  Gamin itself doesn't seem to handle signals or exceptions while handle_event() is blocked.
 +
 +
I wonder if there is a different take on using inotify that handles large parts or whole file systems efficiently.
 +
 +
[http://pyinotify.sourceforge.net/ Pynotify] looks much nicer.  It appears to be documented reasonably, and even handles exceptions in a Python-like manner.
 +
 +
[http://www.linuxjournal.com/article/8478 Here's an article] by one of the co-developers who created inotify.  One important thing to observe is IN_Q_OVERFLOW, which is an event that signifies that some events were dropped.  The queue size is tunable via sysctl, and defaults to 16K events.
 +
 +
Downloaded Tracker, which seems pretty close to what we're looking for, is written in C, and uses inotify if available.  Tracker may provide a clue about how to make this scale across thousands of directories.

Latest revision as of 00:53, 16 August 2007

Contents

description

  • multi-purpose generic file system syncer - a file synchronization tool that runs on most Unix-flavored operating systems
  • one-way or bidirectional data synchronization
  • syncs file changes periodically or continuously
  • supports disconnected operation (later playback of changes)
  • can use advanced file system features to improve efficiency
  • the basic idea is to use an inotify-like mechanism to make the detection of file and directory modification more efficient

use cases

Unidirectional

clone - on-demand rsync-like copy of a portion or all of a file system to another file system

  • provides basic rsync functionality

replication - periodic rsync-like copy of a portion or all of a file system to another file system

  • provides a working copy, hot backup, or a replicant for improving availability or balancing load

backup - periodic rsync-like copy of a portion or all of a file system to a flattened and compressed representation of the file system

  • provides a more efficient incremental archiving scheme

synchronization - real-time update of a remote file system based on changes in a local file system

  • provides hot backup on a remote site for disaster recovery

Bidirectional (these are the hard ones)

disconnected - on-demand bi-directional update between two file systems with conflict resolution

mirror - periodic bi-directional update between two file systems with conflict resolution

cluster - real-time bi-directional update between two file systems

  • provides two hot copies; concurrent access is moderated via normal file locking
  • we won't be doing this one

other examples

rsync on Unix

Synchronizes a file, subtree, or whole file system in one direction

Carbon Copy Cloner

Copies whole disks, but isn't designed for regular synchronization

Chronosync on Mac OS X

  • This tool is not for cloning file systems or disks; just copying part of a file system.
  • It can mirror (provide a clone)
  • It can do bi-directional sync

rdiff-backup

Desktop search tools

Examples

These appear to use backend search engines and indexing algorithms such as Lucene and Xapian, which is what I really want to review to learn something useful for the synchronizer.

From the Tracker wiki :

 2. Linux kernel watchless file notification system for Tracker
   * currently we are using inotify which is not optimal for watching entire trees. (OSX has kernel file notifications by contrast)
   * Implement a loadable module for the kernel which allows all file activity to be passed to userspace Tracker
   * Ideally should make use of netlink
   * Allow tracker to connect and receive these events.
   * implement file notification handler for it in tracker

and

 6. Networked Tracker
   * Use avahi/dbus to locate remote tracker servers. Could make use of ssh or as suggested use gabriel.
   * Api would need to be added to add support for retrieving list of servers
   * TST would need to be changed to list all available servers to use and allow user to pick one
   * If using DBus over ssh then could use seahorse for ssh key pairing

File system assistants

Erez Zadok suggested looking at the recent FaST papers describing TraceFS. On the reverse end, ReplayFS would take the output stream from TraceFS and re-apply the changes to a file system.

requirements

  • need a good name
  • operational efficiency
    • small page and inode cache footprint
    • moves only changed data and metadata
  • would like this to reside mostly or all in user space
  • should work "well enough" on any file system type independent of feature sets
  • can identify advanced features such as snapshots to improve efficiency
  • might even work between two actively modified file systems (multi-master replication)
  • do we want a push mode and a pull mode?
  • wizard UI for easy configuration?

major architectural features

  • prototype in Python (if it has decent GUI features)
    • Python has an inotify module!
    • It also has libbeagle-python -- hopefully this would provide an easy facility for constructing a prototype file system catalog
  • how to catalog all the files on a file system
    • chris suggests looking at desktop search utilities such as recoll (beagle?)
    • SCM... does git have anything to offer here?
  • identifying changes statically
    • for example, when syncing the first time, or syncing after reconnecting
  • identifying changes dynamically (to reduce having to walk every file in each file system to find changes)
    • unionfs layer that can capture and journal changes coming through vfs
      • Erez suggests looking at TraceFS and ReplayFS in recent FaST proceedings
    • could use inotify
    • maybe use a journal to record changes permanently and play them back later
  • need a UI for handling conflicts on multi-master set ups

operation

  • configuration
    • wizard like thingie for inital set up
    • lots of damn checkboxes for experts
  • static mode
    • what's changed?
      • generate fresh catalogs
      • use the catalog to discover changes
      • push changes to remote
  • dynamic mode
    • journaling changes as they happen
      • into a catalog
      • into a local journal file
      • into a socket or pipe

random thoughts

  • Need to install and try out Tracker
  • There's a window between when a monitored file system is mounted and when the monitoring daemon starts running, or if the daemon crashes. How do we detect the periods when the monitoring daemon is not running but the file system is available? How do we recover without sweeping the file system?
  • What happens if the monitoring process (or kernel) runs out of resources, and an event is dropped? Is there a notification of the loss, or is it simply ignored?
  • It's interesting to note that Beagle is known as a system resource hog even though it uses inotify. This belies the impression that a file syncer using inotify would automatically exhibit reduced system load than on that does not.
  • It might be interesting to combine the idea of a file system syncer with the idea of building virtual collections (searches). Ie: instead of saying "sync all files under this directory" you could say "sync all files that changed yesterday" or "sync all MP3 files containing music by the artish Diana Krall."
  • After reviewing gamin and FAM I don't think these provide the functionality we want. We'd have to set up an event handler for every directory we want to watch -- potentially millions. It doesn't appear that this is a scalable interface (some mention of using select()?) Gamin itself doesn't handle signals or Python exceptions while handle_event() is blocked.
  • KEY: Data synchronization (keep these two file sets synchronized) is deeply tied to content indexing (tell me about what's in this file system)

free-form notes

Chris Mason suggested a tool that can synchronize a local and remote file system using whatever tools it can find (snapshots, inotify, rsync, etc) -- should always work, but will find what it needs to do the job most efficiently.

Tool could do file system synchronization in real time, or disconnected operation, or periodic replication.

Need a UI mechanism to handle conflicts.

One central idea Chris had was to use inotify to drive specific rsyncs, thus avoiding a lot of page and inode cache pollution.

While inotify can tell us some object has changed, we can't really tell how it has changed. I thought of Huston's work on disconnected AFS (ie using a journal of changes to drive the synchronization process). A stackable file system seems a good tool for intercepting file system changes.

Loaded up rdiff-backup and hypereistar (?) on picasso. bzzt.

Other search tools:

These appear to use backend search engines and indexing algorithms such as Lucene and Xapian, which is what I really want to review to learn something useful for the synchronizer.

"gamin" is a facility that provides file alteration monitoring using whatever facility is available (dnotify, inotify, whatever). Hmm, and it appears to be installed on F7, although the -devel and Python bindings are optional.

gamin Python bindings

The question is, "is this robust enough to handle a lot of changes at once?" The "inotify" model of monitoring file system changes is vulnerable if there's no way to detect a dropped change notification.

gamin detects changes to a file or directory, but I wonder if it scales well to detecting changes to a whole file system, or a subtree?

gamin appears to be based on a larger file alteration monitoring framework built by SGI:

FAM API

FAM is based on select(), thus it can monitor only 1024 events at once, and is limited to monitoring a single directory (useful for graphical file management tools).

So.

I don't think this type of functionality is what we want. We'd have to set up an event handler on every directory we're interested in -- potentially millions. Plus there's no way to watch all of these events scalably. Gamin itself doesn't seem to handle signals or exceptions while handle_event() is blocked.

I wonder if there is a different take on using inotify that handles large parts or whole file systems efficiently.

Pynotify looks much nicer. It appears to be documented reasonably, and even handles exceptions in a Python-like manner.

Here's an article by one of the co-developers who created inotify. One important thing to observe is IN_Q_OVERFLOW, which is an event that signifies that some events were dropped. The queue size is tunable via sysctl, and defaults to 16K events.

Downloaded Tracker, which seems pretty close to what we're looking for, is written in C, and uses inotify if available. Tracker may provide a clue about how to make this scale across thousands of directories.

Personal tools