221 lines
7.9 KiB
Org Mode
221 lines
7.9 KiB
Org Mode
#+TITLE: MU INDEX
|
||
#+MAN_CLASS_OPTIONS: :section-id "@SECTION_ID@" :date "@MAN_DATE@"
|
||
#+include: macros.inc
|
||
|
||
* NAME
|
||
|
||
mu-index - index e-mail messages stored in Maildirs
|
||
|
||
* SYNOPSIS
|
||
|
||
*mu* [_COMMON-OPTIONS_] *index*
|
||
|
||
* DESCRIPTION
|
||
|
||
*mu index* is the *mu* command for scanning the contents of Maildir directories and
|
||
storing the results in a Xapian database. The data can then be queried using
|
||
{{{man-link(mu-find,1)}}}.
|
||
|
||
Before the first time you run *mu index*, you must run *mu init* to initialize the
|
||
database.
|
||
|
||
*index* understands Maildirs as defined by Daniel Bernstein for
|
||
{{{man-link(qmail,7)}}}. In addition, it understands recursive Maildirs
|
||
(Maildirs within Maildirs), Maildir++. It also supports VFAT-based Maildirs
|
||
which use *!* or *;* as the separators instead of *:*.
|
||
|
||
E-mail messages which are not stored in something resembling a maildir
|
||
leaf-directory (_cur_ and _new_) are ignored, as are the cache directories for
|
||
_notmuch_ and _gnus_, and any dot-directory.
|
||
|
||
Symlinks are followed, and the directories can be spread over multiple
|
||
filesystems; however note that moving files around is much faster when multiple
|
||
filesystems are not involved. Be careful to avoid self-referential symlinks!
|
||
|
||
If there is a file called _.noindex_ in a directory, the contents of that
|
||
directory and all of its subdirectories will be ignored. This can be useful to
|
||
exclude certain directories from the indexing process, for example directories
|
||
with spam-messages.
|
||
|
||
If there is a file called _.noupdate_ in a directory, the contents of that
|
||
directory and all of its subdirectories will be ignored. This can be useful to
|
||
speed up things you have some maildirs that never change.
|
||
|
||
_.noupdate_ does not affect already-indexed message: you can still search for
|
||
them. _.noupdate_ is ignored when you start indexing with an empty database (such
|
||
as directly after *mu init*).
|
||
|
||
There also the option *--lazy-check* which can greatly speed up indexing; see
|
||
below for details.
|
||
|
||
The first run of *mu index* may take a few minutes if you have a lot of mail (tens
|
||
of thousands of messages). Fortunately, such a full scan needs to be done only
|
||
once; after that it suffices to index the changes, which goes much faster. See
|
||
the `PERFORMANCE (i,ii,iii)' below for more information.
|
||
|
||
The optional `phase two' of the indexing-process is the removal of messages from
|
||
the database for which there is no longer a corresponding file in the Maildir.
|
||
If you do not want this, you can use *-n*, *--nocleanup*.
|
||
|
||
When *mu index* catches one of the signals *SIGINT*, *SIGHUP* or *SIGTERM* (e.g., when
|
||
you press Ctrl-C during the indexing process), it attempts to shutdown
|
||
gracefully; it tries to save and commit data, and close the database etc. If it
|
||
receives another signal (e.g., when pressing Ctrl-C once more), *mu index* will
|
||
terminate immediately.
|
||
|
||
* INDEX OPTIONS
|
||
|
||
** --lazy-check
|
||
In lazy-check mode, *mu* does not consider messages for which the time-stamp
|
||
(ctime) of the directory they reside in has not changed since the previous
|
||
indexing run. This is much faster than the non-lazy check, but won't update
|
||
messages that have change (rather than having been added or removed), since
|
||
merely editing a message does not update the directory time-stamp. Of course,
|
||
you can run *mu-index* occasionally without *--lazy-check*, to pick up such
|
||
messages.
|
||
|
||
** --nocleanup
|
||
Disable the database cleanup that *mu* does by default after indexing.
|
||
|
||
** --reindex
|
||
Perform a complete reindexing of all the messages in the maildir.
|
||
|
||
#+include: "muhome.inc" :minlevel 2
|
||
|
||
#+include: "common-options.inc" :minlevel 1
|
||
|
||
* ENCRYPTION
|
||
|
||
*mu index* does _not_ decrypt messages, and only the metadata (such as headers) of
|
||
encrypted messages makes it to the database. *mu view* and *mu4e* can decrypt
|
||
messages, but those work with the message directly and the information is not
|
||
added to the database.
|
||
|
||
* PERFORMANCE
|
||
|
||
** indexing in ancient times (2009?)
|
||
|
||
As a non-scientific benchmark, a simple test on the author's machine (a Thinkpad
|
||
X61s laptop using Linux 2.6.35 and an ext3 file system) with no existing
|
||
database, and a maildir with 27273 messages:
|
||
|
||
#+begin_example
|
||
$ sudo sh -c 'sync && echo 3 > /proc/sys/vm/drop_caches'
|
||
$ time mu index --quiet
|
||
66,65s user 6,05s system 27% cpu 4:24,20 total
|
||
#+end_example
|
||
(about 103 messages per second)
|
||
|
||
A second run, which is the more typical use case when there is a database
|
||
already, goes much faster:
|
||
|
||
#+begin_example
|
||
$ sudo sh -c 'sync && echo 3 > /proc/sys/vm/drop_caches'
|
||
$ time mu index --quiet
|
||
0,48s user 0,76s system 10% cpu 11,796 total
|
||
#+end_example
|
||
(more than 56818 messages per second)
|
||
|
||
Note that each test flushes the caches first; a more common use case might be to
|
||
run *mu index* when new mail has arrived; the cache may stay quite `warm' in that
|
||
case:
|
||
|
||
#+begin_example
|
||
$ time mu index --quiet
|
||
0,33s user 0,40s system 80% cpu 0,905 total
|
||
#+end_example
|
||
which is more than 30000 messages per second.
|
||
|
||
** indexing in 2012
|
||
|
||
As per June 2012, we did the same non-scientific benchmark, this time with an
|
||
Intel i5-2500 CPU @ 3.30GHz, an ext4 file system and a maildir with 22589
|
||
messages. We start without an existing database.
|
||
|
||
#+begin_example
|
||
$ sudo sh -c 'sync && echo 3 > /proc/sys/vm/drop_caches'
|
||
$ time mu index --quiet
|
||
27,79s user 2,17s system 48% cpu 1:01,47 total
|
||
#+end_example
|
||
(about 813 messages per second)
|
||
|
||
A second run, which is the more typical use case when there is a database
|
||
already, goes much faster:
|
||
|
||
#+begin_example
|
||
$ sudo sh -c 'sync && echo 3 > /proc/sys/vm/drop_caches'
|
||
$ time mu index --quiet
|
||
0,13s user 0,30s system 19% cpu 2,162 total
|
||
#+end_example
|
||
(more than 173000 messages per second)
|
||
|
||
** indexing in 2016
|
||
|
||
As per July 2016, we did the same non-scientific benchmark, again with the Intel
|
||
i5-2500 CPU @ 3.30GHz, an ext4 file system. This time, the maildir contains
|
||
72525 messages.
|
||
|
||
#+begin_example
|
||
$ sudo sh -c 'sync && echo 3 > /proc/sys/vm/drop_caches'
|
||
$ time mu index --quiet
|
||
40,34s user 2,56s system 64% cpu 1:06,17 total
|
||
#+end_example
|
||
(about 1099 messages per second).
|
||
|
||
** indexing in 2022
|
||
|
||
A few years later and it is June 2022. There's a lot more happening during
|
||
indexing, but indexing became multi-threaded and machines are faster; e.g. this
|
||
is with an AMD Ryzen Threadripper 1950X (16 cores) @ 3.399GHz.
|
||
|
||
The instructions are a little different since we have a proper repeatable
|
||
benchmark now. After building,
|
||
|
||
#+begin_example
|
||
$ sudo sh -c 'sync && echo 3 > /proc/sys/vm/drop_caches'
|
||
% THREAD_NUM=4 build/lib/tests/bench-indexer -m perf
|
||
# random seed: R02Sf5c50e4851ec51adaf301e0e054bd52b
|
||
1..1
|
||
# Start of bench tests
|
||
# Start of indexer tests
|
||
indexed 5000 messages in 20 maildirs in 3763ms; 752 μs/message; 1328 messages/s (4 thread(s))
|
||
ok 1 /bench/indexer/4-cores
|
||
# End of indexer tests
|
||
# End of bench tests
|
||
#+end_example
|
||
|
||
Things are again a little faster, even though the index does a lot more now
|
||
(text-normalizatian, and pre-generating message-sexps). A faster machine helps,
|
||
too!
|
||
|
||
** recent releases
|
||
|
||
Indexing the the same 93000-message mail corpus with the last few releases:
|
||
|
||
#+ATTR_MAN: :disable-caption t
|
||
| release | time (sec) | notes |
|
||
|---------------+------------+------------------------------------------|
|
||
| 1.4 | 160s | |
|
||
| 1.6 | 178s | |
|
||
| 1.8 | 97s | |
|
||
| 1.10 | 120s | adds html indexing, sexp-caching |
|
||
| 1.11 (master) | 96s | adds language-guessing, batch-size=50000 |
|
||
| | | |
|
||
|
||
Quite some variation!
|
||
|
||
Over time new features / refactoring can change the timings quite a bit. At
|
||
least for now, the latest code is both the fastest and the most featureful!
|
||
|
||
#+include: "exit-code.inc" :minlevel 1
|
||
|
||
#+include: "prefooter.inc"
|
||
|
||
* SEE ALSO
|
||
|
||
{{{man-link(maildir,5)}}},
|
||
{{{man-link(mu,1)}}},
|
||
{{{man-link(mu-init,1)}}},
|
||
{{{man-link(mu-find,1)}}},
|
||
{{{man-link(mu-cfind,1)}}}
|