ahvn.klengine.daac_engine module¶
- class ahvn.klengine.daac_engine.DAACKLEngine(storage, path, encoder=None, min_length=2, inverse=True, normalizer=None, name=None, condition=None, encoding='utf-8', *args, **kwargs)[source]¶
Bases:
BaseKLEngineA Double Array AC Automaton-based KLEngine for efficient string search in BaseUKF objects.
This engine uses the Aho-Corasick automaton algorithm for fast multi-pattern string matching. It’s particularly useful for knowledge base applications where you need to find all occurrences of known entity strings within a given text query. The engine is designed to be inplace (storing only id and string, not full data) and requires external storage for BaseUKF objects.
- Search Methods:
_search(query, conflict, whole_word, include, *args, **kwargs): AC automaton-based string search.
- Abstract Methods (inherited from BaseKLEngine):
_upsert(kl): Insert or update a BaseUKF in the engine. _remove(key): Remove a BaseUKF from the engine by its key (id). _clear(): Clear all BaseUKF objects from the engine.
- Parameters:
- __init__(storage, path, encoder=None, min_length=2, inverse=True, normalizer=None, name=None, condition=None, encoding='utf-8', *args, **kwargs)[source]¶
Initialize the DAACKLEngine.
- Parameters:
storage (BaseKLStore) – The storage backend for BaseUKF objects (required).
path (str) – Local directory path to store AC automaton files.
encoder (Callable[[BaseUKF], List[str]]) – Function to extract searchable strings from BaseUKF objects. The recommended pattern is to use lambda kl: kl.synonyms where kl.synonyms contains all string variants that should point to the same knowledge object.
min_length (int) – Minimum length of strings to include in the automaton. Default is 2.
inverse (bool) – If True, builds the automaton on reversed strings for suffix matching efficiency. Default is True.
normalizer (Optional[Union[Callable[[str], str], bool]]) – Function to normalize strings before indexing and searching. If True, uses a default text normalizer including tokenization, stop word removal, lemmatization, and lowercasing. If None or False, no normalization is applied. Default is None.
name (
Optional[str]) – Name of the KLEngine instance. If None, defaults to “{storage.name}_daac_idx”.condition (
Optional[Callable]) – Optional upsert/insert condition to apply to the KLEngine. KLs that do not satisfy the condition will be ignored. If None, all KLs are accepted.encoding (Optional[str]) – Encoding used for saving/loading files. Default is None, which uses HEAVEN_CM’s default encoding.
*args – Additional positional arguments passed to BaseKLEngine.
**kwargs – Additional keyword arguments passed to BaseKLEngine.
- __len__()[source]¶
Returns the number of unique BaseUKF entities (IDs) currently indexed by the engine.
- flush()[source]¶
Apply pending deletions and rebuild the AC automaton.
This method processes lazy deletions and rebuilds the automaton to ensure all changes are reflected in the search index.
- sync(batch_size=None, flush=True, progress=None, **kwargs)[source]¶
Synchronize KLEngine with its attached KLStore, if applicable. Notice that a whole synchronization can often lead to large data upload/download. This could result in performance issues and even errors for particular backends. Therefore, parameters like batch_size are provided to control the synchronization process. It is recommended to override this method for better performance.
- Parameters:
batch_size (Optional[int]) – The batch size for synchronization. If None, use the default batch size from configuration (512). If <= 0, yields all KLs in a single batch.
flush (bool) – If True, saves the engine state after synchronization. Default is True.
progress (Type[Progress]) – Progress class for reporting. None for silent, TqdmProgress for terminal.
**kwargs – Additional keyword arguments.
- save(path=None)[source]¶
Save the current state of the engine to disk.
- Parameters:
path (str) – Directory path to save the data. If None, uses self.path.