ahvn.utils.basic.str_utils module¶
String manipulation and text processing utilities for AgentHeaven.
- ahvn.utils.basic.str_utils.truncate(s, cutoff=-1)[source]¶
Truncate a string if it exceeds the specified cutoff length.
- ahvn.utils.basic.str_utils.value_repr(value, cutoff=-1, round_digits=6)[source]¶
Format a value representation for display, truncating if too long.
- Parameters:
- Returns:
Formatted value representation.
- Return type:
- ahvn.utils.basic.str_utils.omission_list(items, top=-1, bottom=1)[source]¶
Cuts down a list by omitting middle items if it exceeds the specified limit.
- Parameters:
- Returns:
The truncated list with middle items omitted if necessary.
- Return type:
List
- ahvn.utils.basic.str_utils.markdown_symbol(content)[source]¶
Generate a markdown code block symbol that does not conflict with the content.
- ahvn.utils.basic.str_utils.line_numbered(content, start=-1, window=None)[source]¶
Adds line numbers to the given content starting from the specified number.
- Parameters:
- Returns:
The content with line numbers added.
- Return type:
- ahvn.utils.basic.str_utils.indent(s, tab=4, **kwargs)[source]¶
Indent a string by a specified number of spaces or a tab character.
- ahvn.utils.basic.str_utils.is_delimiter(char)[source]¶
Check if a character is a word boundary breaker.
- ahvn.utils.basic.str_utils.normalize_text(text)[source]¶
Normalize text through tokenization, stop word removal, lemmatization, and lowercasing.
- ahvn.utils.basic.str_utils.generate_ngrams(tokens, n)[source]¶
Generate n-grams from a list of tokens.
- ahvn.utils.basic.str_utils.asymmetric_jaccard_score(query, doc, ngram=6)[source]¶
Calculate asymmetric Jaccard containment score between query and document.
- ahvn.utils.basic.str_utils.resolve_match_conflicts(results, conflict='overlap', query_length=0, inverse=False)[source]¶
Resolve overlapping matches in search results based on conflict strategy.
This utility function filters overlapping text spans when multiple entities match at the same or overlapping positions in a query string. It operates on search results that contain match position information.
- Parameters:
results (list) – List of result dictionaries. Each dictionary must contain: - ‘id’: Entity identifier - ‘matches’: List of (start, end) tuples representing match positions in the query
Strategy for handling overlapping matches. Options: - “overlap”: Keep all matches including overlapping ones (no filtering) - “longest”: Keep only the longest match for any overlapping set - “longest_distinct”: Allow multiple entities to have overlapping matches
as long as they are the longest matches
Defaults to “overlap”.
query_length (int, optional) – Length of the query string. Required for “longest” and “longest_distinct” strategies when inverse=True. Defaults to 0.
inverse (bool, optional) – Whether the matches were computed on reversed strings. Affects the sorting and comparison logic. Defaults to False.
- Returns:
- Filtered list of result dictionaries with the same structure as input,
where each result’s ‘matches’ list has been filtered according to the conflict resolution strategy.
- Return type:
Examples
>>> results = [ ... {'id': 1, 'matches': [(0, 5), (10, 15), (22, 27), (32, 37)]}, ... {'id': 2, 'matches': [(2, 8), (12, 18), (21, 27), (32, 38)]} ... ] >>> resolve_match_conflicts(results, conflict="longest", query_length=40) [{'id': 1, 'matches': [(0, 5), (10, 15)]}, {'id': 2, 'matches': [(21, 27), (32, 38)]}]