FileSyncer

From Linux NFS

(Difference between revisions)
Jump to: navigation, search
(initial pass; still needs URLs)
Line 14: Line 14:
Other search tools:
Other search tools:
-
* Strigi
+
* [http://strigi.sourceforge.net Strigi]
-
* recoll
+
* [http://www.lesbonscomptes.com/recoll/ recoll]
-
* Beagle [uses inotify, but written in C# / mono]
+
* [http://www.beagle-project.org/ Beagle] (uses inotify, but written in C# / mono)
-
* searchmonkey
+
* [http://searchmonkey.sourceforge.net/ searchmonkey]
-
* libferris and ego
+
* [http://witme.sourceforge.net/libferris.web/ libferris] and ego
-
* Tracker
+
* [http://www.gnome.org/projects/tracker/ Tracker]
These appear to use backend search engines and indexing algorithms such as
These appear to use backend search engines and indexing algorithms such as
Line 27: Line 27:
"gamin" is a facility that provides file alteration monitoring using whatever facility is available (dnotify, inotify, whatever).  Hmm, and it appears to be installed on F7, although the -devel and Python bindings are optional.
"gamin" is a facility that provides file alteration monitoring using whatever facility is available (dnotify, inotify, whatever).  Hmm, and it appears to be installed on F7, although the -devel and Python bindings are optional.
-
gamin Python bindings
+
[http://www.gnome.org/~veillard/gamin/python.html gamin Python bindings]
The question is, "is this robust enough to handle a lot of changes at once?"  The "inotify" model of monitoring file system changes is vulnerable if there's no way to detect a dropped change notification.  
The question is, "is this robust enough to handle a lot of changes at once?"  The "inotify" model of monitoring file system changes is vulnerable if there's no way to detect a dropped change notification.  
Line 35: Line 35:
gamin appears to be based on a larger file alteration monitoring framework built by SGI:
gamin appears to be based on a larger file alteration monitoring framework built by SGI:
-
FAM API
+
[http://oss.sgi.com/projects/fam/ FAM API]
FAM is based on select(), thus it can monitor only 1024 events at once, and is limited to monitoring a single directory (useful for graphical file management tools).
FAM is based on select(), thus it can monitor only 1024 events at once, and is limited to monitoring a single directory (useful for graphical file management tools).
Line 45: Line 45:
I wonder if there is a different take on using inotify that handles large parts or whole file systems efficiently.
I wonder if there is a different take on using inotify that handles large parts or whole file systems efficiently.
-
Downloaded Tracker, which seems pretty close to what we're looking for, is written in C, and uses inotify if available.
+
[http://pyinotify.sourceforge.net/ Pynotify] looks much nicer.  It appears to be documented reasonably, and even handles exceptions in a Python-like manner.
-
Pynotify looks much nicer. It appears to be documented reasonably, and even handles exceptions in a Python-like manner.
+
[http://www.linuxjournal.com/article/8478 Here's an article] by one of the co-developers who created inotify.  One important thing to observe is IN_Q_OVERFLOW, which is an event that signifies that some events were dropped.  The queue size is tunable via sysctl, and defaults to 16K events.
-
 
+
-
Here's an article by one of the co-developers who created inotify.  One important thing to observe is IN_Q_OVERFLOW, which is an event that signifies that some events were dropped.  The queue size is tunable via sysctl, and defaults to 16K events.
+
 +
Downloaded Tracker, which seems pretty close to what we're looking for, is written in C, and uses inotify if available. 
Tracker may provide a clue about how to make this scale across thousands of directories.
Tracker may provide a clue about how to make this scale across thousands of directories.

Revision as of 22:21, 15 August 2007

Chris Mason suggested a tool that can synchronize a local and remote file system using whatever tools it can find (snapshots, inotify, rsync, etc) -- should always work, but will find what it needs to do the job most efficiently.

Tool could do file system synchronization in real time, or disconnected operation, or periodic replication.

Need a UI mechanism to handle conflicts.

One central idea Chris had was to use inotify to drive specific rsyncs, thus avoiding a lot of page and inode cache pollution.

While inotify can tell us some object has changed, we can't really tell how it has changed. I thought of Huston's work on disconnected AFS (ie using a journal of changes to drive the synchronization process). A stackable file system seems a good tool for intercepting file system changes.

Loaded up rdiff-backup and hypereistar (?) on picasso. bzzt.

Other search tools:

These appear to use backend search engines and indexing algorithms such as Lucene and Xapian, which is what I really want to review to learn something useful for the synchronizer.

"gamin" is a facility that provides file alteration monitoring using whatever facility is available (dnotify, inotify, whatever). Hmm, and it appears to be installed on F7, although the -devel and Python bindings are optional.

gamin Python bindings

The question is, "is this robust enough to handle a lot of changes at once?" The "inotify" model of monitoring file system changes is vulnerable if there's no way to detect a dropped change notification.

gamin detects changes to a file or directory, but I wonder if it scales well to detecting changes to a whole file system, or a subtree?

gamin appears to be based on a larger file alteration monitoring framework built by SGI:

FAM API

FAM is based on select(), thus it can monitor only 1024 events at once, and is limited to monitoring a single directory (useful for graphical file management tools).

So.

I don't think this type of functionality is what we want. We'd have to set up an event handler on every directory we're interested in -- potentially millions. Plus there's no way to watch all of these events scalably. Gamin itself doesn't seem to handle signals or exceptions while handle_event() is blocked.

I wonder if there is a different take on using inotify that handles large parts or whole file systems efficiently.

Pynotify looks much nicer. It appears to be documented reasonably, and even handles exceptions in a Python-like manner.

Here's an article by one of the co-developers who created inotify. One important thing to observe is IN_Q_OVERFLOW, which is an event that signifies that some events were dropped. The queue size is tunable via sysctl, and defaults to 16K events.

Downloaded Tracker, which seems pretty close to what we're looking for, is written in C, and uses inotify if available. Tracker may provide a clue about how to make this scale across thousands of directories.

Personal tools