utils: add utf8_wordbreak

Determine if a string has wordbreaks in a mostly Xapian-compatible way.
We need this to determine what strings should be considered "phrases".
This commit is contained in:
Dirk-Jan C. Binnema
2023-09-17 10:01:15 +03:00
parent 94c90bd0c5
commit 7cbab21099
3 changed files with 80 additions and 0 deletions

View File

@ -187,6 +187,17 @@ utf8_flatten(const std::string& s) {
*/
std::string utf8_clean(const std::string& dirty);
/**
* Replace all wordbreak chars (as recognized by Xapian by single SPC)
*
* @param txt text
*
* @return string
*/
std::string utf8_wordbreak(const std::string& txt);
/**
* Remove ctrl characters, replacing them with ' '; subsequent
* ctrl characters are replaced by a single ' '