message: use html-to-text scraper for html parts

We were dumping the HTML-parts as-is in the Xapian indexer; however,
it's better to remove the html decoration first, and just pass the text.

We use the new built-in html->text scraper for that.
This commit is contained in:
Dirk-Jan C. Binnema
2023-07-23 14:46:11 +03:00
parent 56b8fad89e
commit b795242d5a
7 changed files with 31 additions and 69 deletions

View File

@ -19,9 +19,14 @@
- what used to be the ~mu fields~ command has been merged into ~mu info~; i.e.,
~mu fields~ is now ~mu info fields~.
- ~mu view~ gained ~--format=html~ for it to output the HTML body of the message
rather than the (default) plain-text body. See its updated manpage for
details.
- ~mu view~ gained ~--format=html~ which compels it to output the HTML body of
the message rather than the (default) plain-text body. See its updated
manpage for details.
- when encountering an HTML message part during indexing, previously (i.e.,
~mu 1.10~) we would attempt to process that as-is, with HTML-tags etc.; this
is now improved by employing a html->text scraper which extracts the
human-readable text from the html.
- experimental: if you build ~mu~ with [[https://github.com/CLD2Owners/cld2][CLD2]] support (available in many Linux
distros), ~mu~ will try to detect the language of the body of e-mail