ahvn.utils.basic.str_utils module¶

String manipulation and text processing utilities for AgentHeaven.

ahvn.utils.basic.str_utils.truncate(s, cutoff=-1)[source]¶

Truncate a string if it exceeds the specified cutoff length.

Parameters:

s (str) – The string to truncate.
cutoff (int) – Maximum length before truncation. Defaults to -1, meaning no cutoff.

Returns:

Truncated string if it exceeds cutoff, otherwise the original string.

Return type:

str

ahvn.utils.basic.str_utils.value_repr(value, cutoff=-1, round_digits=6)[source]¶

Format a value representation for display, truncating if too long.

Parameters:

value (Any) – The value to represent.
cutoff (int) – Maximum length before truncation. Defaults to -1, meaning no cutoff.
round_digits (int) – Number of decimal places to round floats to. Only applied if the value is a float. Default is 6.

Returns:

Formatted value representation.

Return type:

str

ahvn.utils.basic.str_utils.omission_list(items, top=-1, bottom=1)[source]¶

Cuts down a list by omitting middle items if it exceeds the specified limit.

Parameters:

items (List) – The list of items.
top (int) – Number of items to keep from the start. Defaults to -1 (keep all).
bottom (int) – Number of items to keep from the end. Defaults to 1. Bottom is ignored if top is negative. Otherwise, total kept items = top + bottom + 1.

Returns:

The truncated list with middle items omitted if necessary.

Return type:

List

ahvn.utils.basic.str_utils.markdown_symbol(content)[source]¶

Generate a markdown code block symbol that does not conflict with the content.

Parameters:: content (str) – The content to check for conflicts.
Returns:: A markdown code block symbol (e.g., “`", "``”, etc.) that does not appear in the content.
Return type:: str

ahvn.utils.basic.str_utils.line_numbered(content, start=-1, window=None)[source]¶

Adds line numbers to the given content starting from the specified number.

Parameters:

content (str) – The content to be numbered.
start (int) – The starting line number. If negative, no line numbers are added. Defaults to -1.
window (Optional[Tuple[int, int]]) – A tuple specifying the (start, end) line numbers to include. If None, includes all lines. Defaults to None.

Returns:

The content with line numbers added.

Return type:

str

ahvn.utils.basic.str_utils.indent(s, tab=4, **kwargs)[source]¶

Indent a string by a specified number of spaces or a tab character.

Parameters:

s (str) – The string to indent.
tab (int or str, optional) – The number of spaces or a tab character to use for indentation. Defaults to 4 spaces.
**kwargs – Additional keyword arguments are ignored.

Returns:

The indented string.

Return type:

str

ahvn.utils.basic.str_utils.is_delimiter(char)[source]¶

Check if a character is a word boundary breaker.

Parameters:: char (str) – The character to check.
Returns:: True if the character is whitespace or punctuation, False otherwise.
Return type:: bool

ahvn.utils.basic.str_utils.normalize_text(text)[source]¶

Normalize text through tokenization, stop word removal, lemmatization, and lowercasing.

Parameters:: text (str) – The input text to normalize.
Returns:: The normalized text with tokens separated by spaces.
Return type:: str

ahvn.utils.basic.str_utils.generate_ngrams(tokens, n)[source]¶

Generate n-grams from a list of tokens.

Parameters:

tokens (list) – List of tokens to generate n-grams from.
n (int) – Maximum n-gram size.

Returns:

Set of n-grams with sizes from 1 to n.

Return type:

Set[str]

ahvn.utils.basic.str_utils.asymmetric_jaccard_score(query, doc, ngram=6)[source]¶

Calculate asymmetric Jaccard containment score between query and document.

Parameters:

query (str) – The query text.
doc (str) – The document text.
ngram (int, optional) – Maximum n-gram size. Defaults to 6.

Returns:

Containment score between 0.0 and 1.0.

Return type:

float

ahvn.utils.basic.str_utils.resolve_match_conflicts(results, conflict='overlap', query_length=0, inverse=False)[source]¶

Resolve overlapping matches in search results based on conflict strategy.

This utility function filters overlapping text spans when multiple entities match at the same or overlapping positions in a query string. It operates on search results that contain match position information.

Parameters:

results (list) – List of result dictionaries. Each dictionary must contain: - ‘id’: Entity identifier - ‘matches’: List of (start, end) tuples representing match positions in the query
conflict (str, optional) –
Strategy for handling overlapping matches. Options: - “overlap”: Keep all matches including overlapping ones (no filtering) - “longest”: Keep only the longest match for any overlapping set - “longest_distinct”: Allow multiple entities to have overlapping matches

as long as they are the longest matches

Defaults to “overlap”.
query_length (int, optional) – Length of the query string. Required for “longest” and “longest_distinct” strategies when inverse=True. Defaults to 0.
inverse (bool, optional) – Whether the matches were computed on reversed strings. Affects the sorting and comparison logic. Defaults to False.

Returns:

Filtered list of result dictionaries with the same structure as input,: where each result’s ‘matches’ list has been filtered according to the conflict resolution strategy.

Return type:

list

Examples

>>> results = [
...     {'id': 1, 'matches': [(0, 5), (10, 15), (22, 27), (32, 37)]},
...     {'id': 2, 'matches': [(2, 8), (12, 18), (21, 27), (32, 38)]}
... ]
>>> resolve_match_conflicts(results, conflict="longest", query_length=40)
[{'id': 1, 'matches': [(0, 5), (10, 15)]}, {'id': 2, 'matches': [(21, 27), (32, 38)]}]