FileSyncer

From Linux NFS

Revision as of 22:10, 15 August 2007 by Chucklever (Talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Chris Mason suggested a tool that can synchronize a local and remote file system using whatever tools it can find (snapshots, inotify, rsync, etc) -- should always work, but will find what it needs to do the job most efficiently.

Tool could do file system synchronization in real time, or disconnected operation, or periodic replication.

Need a UI mechanism to handle conflicts.

One central idea Chris had was to use inotify to drive specific rsyncs, thus avoiding a lot of page and inode cache pollution.

While inotify can tell us some object has changed, we can't really tell how it has changed. I thought of Huston's work on disconnected AFS (ie using a journal of changes to drive the synchronization process). A stackable file system seems a good tool for intercepting file system changes.

Loaded up rdiff-backup and hypereistar (?) on picasso. bzzt.

Other search tools:

  • Strigi
  • recoll
  • Beagle [uses inotify, but written in C# / mono]
  • searchmonkey
  • libferris and ego
  • Tracker

These appear to use backend search engines and indexing algorithms such as Lucene and Xapian, which is what I really want to review to learn something useful for the synchronizer.

"gamin" is a facility that provides file alteration monitoring using whatever facility is available (dnotify, inotify, whatever). Hmm, and it appears to be installed on F7, although the -devel and Python bindings are optional.

gamin Python bindings

The question is, "is this robust enough to handle a lot of changes at once?" The "inotify" model of monitoring file system changes is vulnerable if there's no way to detect a dropped change notification.

gamin detects changes to a file or directory, but I wonder if it scales well to detecting changes to a whole file system, or a subtree?

gamin appears to be based on a larger file alteration monitoring framework built by SGI:

FAM API

FAM is based on select(), thus it can monitor only 1024 events at once, and is limited to monitoring a single directory (useful for graphical file management tools).

So.

I don't think this type of functionality is what we want. We'd have to set up an event handler on every directory we're interested in -- potentially millions. Plus there's no way to watch all of these events scalably. Gamin itself doesn't seem to handle signals or exceptions while handle_event() is blocked.

I wonder if there is a different take on using inotify that handles large parts or whole file systems efficiently.

Downloaded Tracker, which seems pretty close to what we're looking for, is written in C, and uses inotify if available.

Pynotify looks much nicer. It appears to be documented reasonably, and even handles exceptions in a Python-like manner.

Here's an article by one of the co-developers who created inotify. One important thing to observe is IN_Q_OVERFLOW, which is an event that signifies that some events were dropped. The queue size is tunable via sysctl, and defaults to 16K events.

Tracker may provide a clue about how to make this scale across thousands of directories.

Personal tools