ahvn.klengine.daac_engine module

class ahvn.klengine.daac_engine.DAACKLEngine(storage, path, encoder=None, min_length=2, inverse=True, normalizer=None, name=None, condition=None, encoding='utf-8', *args, **kwargs)[源代码]

基类:BaseKLEngine

A Double Array AC Automaton-based KLEngine for efficient string search in BaseUKF objects.

This engine uses the Aho-Corasick automaton algorithm for fast multi-pattern string matching. It's particularly useful for knowledge base applications where you need to find all occurrences of known entity strings within a given text query. The engine is designed to be inplace (storing only id and string, not full data) and requires external storage for BaseUKF objects.

Search Methods:

_search(query, conflict, whole_word, include, *args, **kwargs): AC automaton-based string search.

Abstract Methods (inherited from BaseKLEngine):

_upsert(kl): Insert or update a BaseUKF in the engine. _remove(key): Remove a BaseUKF from the engine by its key (id). _clear(): Clear all BaseUKF objects from the engine.

参数:
inplace: bool = False
recoverable: bool = False
__init__(storage, path, encoder=None, min_length=2, inverse=True, normalizer=None, name=None, condition=None, encoding='utf-8', *args, **kwargs)[源代码]

Initialize the DAACKLEngine.

参数:
  • storage (BaseKLStore) -- The storage backend for BaseUKF objects (required).

  • path (str) -- Local directory path to store AC automaton files.

  • encoder (Callable[[BaseUKF], List[str]]) -- Function to extract searchable strings from BaseUKF objects. The recommended pattern is to use lambda kl: kl.synonyms where kl.synonyms contains all string variants that should point to the same knowledge object.

  • min_length (int) -- Minimum length of strings to include in the automaton. Default is 2.

  • inverse (bool) -- If True, builds the automaton on reversed strings for suffix matching efficiency. Default is True.

  • normalizer (Optional[Union[Callable[[str], str], bool]]) -- Function to normalize strings before indexing and searching. If True, uses a default text normalizer including tokenization, stop word removal, lemmatization, and lowercasing. If None or False, no normalization is applied. Default is None.

  • name (Optional[str]) -- Name of the KLEngine instance. If None, defaults to "{storage.name}_daac_idx".

  • condition (Optional[Callable]) -- Optional upsert/insert condition to apply to the KLEngine. KLs that do not satisfy the condition will be ignored. If None, all KLs are accepted.

  • encoding (Optional[str]) -- Encoding used for saving/loading files. Default is None, which uses HEAVEN_CM's default encoding.

  • *args -- Additional positional arguments passed to BaseKLEngine.

  • **kwargs -- Additional keyword arguments passed to BaseKLEngine.

__len__()[源代码]

Returns the number of unique BaseUKF entities (IDs) currently indexed by the engine.

clear(**kwargs)[源代码]

Clear all BaseUKF objects from the engine, resetting it to an empty state.

flush()[源代码]

Apply pending deletions and rebuild the AC automaton.

This method processes lazy deletions and rebuilds the automaton to ensure all changes are reflected in the search index.

sync(batch_size=None, flush=True, progress=None, **kwargs)[源代码]

Synchronize KLEngine with its attached KLStore, if applicable. Notice that a whole synchronization can often lead to large data upload/download. This could result in performance issues and even errors for particular backends. Therefore, parameters like batch_size are provided to control the synchronization process. It is recommended to override this method for better performance.

参数:
  • batch_size (Optional[int]) -- The batch size for synchronization. If None, use the default batch size from configuration (512). If <= 0, yields all KLs in a single batch.

  • flush (bool) -- If True, saves the engine state after synchronization. Default is True.

  • progress (Type[Progress]) -- Progress class for reporting. None for silent, TqdmProgress for terminal.

  • **kwargs -- Additional keyword arguments.

save(path=None)[源代码]

Save the current state of the engine to disk.

参数:

path (str) -- Directory path to save the data. If None, uses self.path.

load(path=None)[源代码]

Load a previously saved engine state from disk.

参数:

path (str) -- Directory path to load the data from. If None, uses self.path.

返回:

True if loading was successful (files exist), False otherwise.

返回类型:

bool

close()[源代码]

Close the engine and save current state to disk.