Files
mu4e/man/mu-index.1.org
Dirk-Jan C. Binnema cc10fbd22a man: update mu-index manpage
Make it a bit more explicit what we ignore.
2025-05-31 08:40:53 +03:00

227 lines
8.2 KiB
Org Mode
Raw Permalink Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

#+TITLE: MU INDEX
#+MAN_CLASS_OPTIONS: :section-id "@SECTION_ID@" :date "@MAN_DATE@"
#+include: macros.inc
* NAME
mu-index - index e-mail messages stored in Maildirs
* SYNOPSIS
*mu* [_COMMON-OPTIONS_] *index*
* DESCRIPTION
*mu index* is the *mu* command for scanning the messages in Maildir directories and
storing the results in a Xapian database. The data can then be queried using
{{{man-link(mu-find,1)}}}.
Before you can run *mu index*, you must initialize *mu* and its database with
{{{man-link(mu-init,1)}}}.
*mu index* understands Maildirs as defined by Daniel Bernstein for
{{{man-link(qmail,7)}}}. In addition, it understands recursive Maildirs
(Maildirs within Maildirs), Maildir++. It also supports VFAT-based Maildirs
which use "*!"* or "*;"* as the separators, instead of the standard "*:"*.
E-mail messages are only considered for indexing if they reside in a directory
named either ~cur~ or ~new~. The special (as per the Maildir-specification)
directory ~tmp~ is ignored, as are some cache directories for some other
mail-clients. Other directories are scanned recursively.
Symbolic links are followed, and the directories can be spread over multiple
filesystems; however, note that moving files around is much faster when they all
reside on a single file-system. Be careful to avoid self-referential symlinks!
If there is a file called ~.noindex~ in a directory, the contents of that
directory and all of its subdirectories will be ignored. This can be useful to
exclude certain directories from the indexing process, for example directories
with spam-messages.
If there is a file called ~.noupdate~ in a directory, the contents of that
directory and all of its subdirectories will be ignored. This can be useful to
speed up things you have some maildirs that never change.
~.noupdate~ does not affect already-indexed messages: you can still search for
them. ~.noupdate~ is ignored when you start indexing with an empty database (such
as directly after *mu init*).
There also the option *--lazy-check* which can greatly speed up indexing; see
below for details.
A first run of *mu index* may take a few minutes if you have a lot of mail (tens
or hundreds of thousands of messages). Fortunately, such a full scan needs to be
done only rarely; after that, it suffices to index just the changes, which goes
much faster. See *PERFORMANCE (i,ii,iii)* below for more information.
The optional cleanup-phase of the indexing-process is the removal of messages
from the database for which there is no longer a corresponding file in the
Maildir. If you do not want this, you can use *-n*, *--nocleanup*.
When *mu index* catches one of the signals *SIGINT*, *SIGHUP* or *SIGTERM* (e.g., when
you press *Ctrl-C* during the indexing process), it attempts to shutdown
gracefully: save and commit data and close the database etc. If it receives
another signal (e.g., when pressing Ctrl-C once more), *mu index* will terminate
immediately.
* INDEX OPTIONS
** --lazy-check
In lazy-check mode, *mu* does not consider messages for which the time-stamp
(*ctime*) of the directory in which they reside, has not changed since the
previous time this directory was checked.
This is much faster than the non-lazy check, but won't update messages that have
changed (rather than having been added or removed), since merely editing a
message does not update the directory time-stamp. Of course, you can run
*mu-index* occasionally without *--lazy-check*, to pick up such messages.
Furthermore, in lazy-check mode, files which have a *ctime* smaller than the time
the previous indexing operation was completed, are ignored. This helps for the
use-case where new messages can appear in big maildirs.
** --nocleanup
Disable the database cleanup that *mu* does by default after indexing.
** --reindex
Perform a complete reindexing of all the messages in the maildir.
#+include: "muhome.inc" :minlevel 2
#+include: "common-options.inc" :minlevel 1
* ENCRYPTION
*mu index* does _not_ decrypt messages, and only the metadata (such as headers) of
encrypted messages makes it to the database. *mu view* and *mu4e* can decrypt
messages, but those work with the message directly and the information is not
added to the database.
* PERFORMANCE
** indexing in ancient times (2009?)
As a non-scientific benchmark, a simple test on the author's machine (a Thinkpad
X61s laptop using Linux 2.6.35 and an ext3 file system) with no existing
database, and a maildir with 27273 messages:
#+begin_example
$ sudo sh -c 'sync && echo 3 > /proc/sys/vm/drop_caches'
$ time mu index --quiet
66,65s user 6,05s system 27% cpu 4:24,20 total
#+end_example
(about 103 messages per second)
A second run, which is the more typical use case when there is a database
already, goes much faster:
#+begin_example
$ sudo sh -c 'sync && echo 3 > /proc/sys/vm/drop_caches'
$ time mu index --quiet
0,48s user 0,76s system 10% cpu 11,796 total
#+end_example
(more than 56818 messages per second)
Note that each test flushes the caches first; a more common use case might be to
run *mu index* when new mail has arrived; the cache may stay quite `warm' in that
case:
#+begin_example
$ time mu index --quiet
0,33s user 0,40s system 80% cpu 0,905 total
#+end_example
which is more than 30000 messages per second.
** indexing in 2012
As per June 2012, we did the same non-scientific benchmark, this time with an
Intel i5-2500 CPU @ 3.30GHz, an ext4 file system and a maildir with 22589
messages. We start without an existing database.
#+begin_example
$ sudo sh -c 'sync && echo 3 > /proc/sys/vm/drop_caches'
$ time mu index --quiet
27,79s user 2,17s system 48% cpu 1:01,47 total
#+end_example
(about 813 messages per second)
A second run, which is the more typical use case when there is a database
already, goes much faster:
#+begin_example
$ sudo sh -c 'sync && echo 3 > /proc/sys/vm/drop_caches'
$ time mu index --quiet
0,13s user 0,30s system 19% cpu 2,162 total
#+end_example
(more than 173000 messages per second)
** indexing in 2016
As per July 2016, we did the same non-scientific benchmark, again with the Intel
i5-2500 CPU @ 3.30GHz, an ext4 file system. This time, the maildir contains
72525 messages.
#+begin_example
$ sudo sh -c 'sync && echo 3 > /proc/sys/vm/drop_caches'
$ time mu index --quiet
40,34s user 2,56s system 64% cpu 1:06,17 total
#+end_example
(about 1099 messages per second).
** indexing in 2022
A few years later and it is June 2022. There's a lot more happening during
indexing, but indexing became multi-threaded and machines are faster; e.g. this
is with an AMD Ryzen Threadripper 1950X (16 cores) @ 3.399GHz.
The instructions are a little different since we have a proper repeatable
benchmark now. After building,
#+begin_example
$ sudo sh -c 'sync && echo 3 > /proc/sys/vm/drop_caches'
% THREAD_NUM=4 build/lib/tests/bench-indexer -m perf
# random seed: R02Sf5c50e4851ec51adaf301e0e054bd52b
1..1
# Start of bench tests
# Start of indexer tests
indexed 5000 messages in 20 maildirs in 3763ms; 752 μs/message; 1328 messages/s (4 thread(s))
ok 1 /bench/indexer/4-cores
# End of indexer tests
# End of bench tests
#+end_example
Things are again a little faster, even though the index does a lot more now
(text-normalization, and pre-generating message-sexps). A faster machine helps,
too!
** recent releases
Indexing the the same 93000-message mail corpus with the last few releases:
#+ATTR_MAN: :disable-caption t
| release | time (sec) | notes |
|---------------+------------+------------------------------------------|
| 1.4 | 160s | |
| 1.6 | 178s | |
| 1.8 | 97s | |
| 1.10 | 120s | adds html indexing, sexp-caching |
| 1.11 (master) | 96s | adds language-guessing, batch-size=50000 |
| | | |
Quite some variation!
Over time new features / refactoring can change the timings quite a bit. At
least for now, the latest code is both the fastest and the most feature-rich.
#+include: "exit-code.inc" :minlevel 1
#+include: "prefooter.inc"
* SEE ALSO
{{{man-link(maildir,5)}}},
{{{man-link(mu,1)}}},
{{{man-link(mu-init,1)}}},
{{{man-link(mu-find,1)}}},
{{{man-link(mu-cfind,1)}}}