ahvn.utils.basic.str_utils module

String manipulation and text processing utilities for AgentHeaven.

ahvn.utils.basic.str_utils.truncate(s, cutoff=-1)[source]

Truncate a string if it exceeds the specified cutoff length.

Parameters:
  • s (str) – The string to truncate.

  • cutoff (int) – Maximum length before truncation. Defaults to -1, meaning no cutoff.

Returns:

Truncated string if it exceeds cutoff, otherwise the original string.

Return type:

str

ahvn.utils.basic.str_utils.value_repr(value, cutoff=-1, round_digits=6)[source]

Format a value representation for display, truncating if too long.

Parameters:
  • value (Any) – The value to represent.

  • cutoff (int) – Maximum length before truncation. Defaults to -1, meaning no cutoff.

  • round_digits (int) – Number of decimal places to round floats to. Only applied if the value is a float. Default is 6.

Returns:

Formatted value representation.

Return type:

str

ahvn.utils.basic.str_utils.omission_list(items, top=-1, bottom=1)[source]

Cuts down a list by omitting middle items if it exceeds the specified limit.

Parameters:
  • items (List) – The list of items.

  • top (int) – Number of items to keep from the start. Defaults to -1 (keep all).

  • bottom (int) – Number of items to keep from the end. Defaults to 1. Bottom is ignored if top is negative. Otherwise, total kept items = top + bottom + 1.

Returns:

The truncated list with middle items omitted if necessary.

Return type:

List

ahvn.utils.basic.str_utils.markdown_symbol(content)[source]

Generate a markdown code block symbol that does not conflict with the content.

Parameters:

content (str) – The content to check for conflicts.

Returns:

A markdown code block symbol (e.g., “`", "``”, etc.) that does not appear in the content.

Return type:

str

ahvn.utils.basic.str_utils.line_numbered(content, start=-1, window=None)[source]

Adds line numbers to the given content starting from the specified number.

Parameters:
  • content (str) – The content to be numbered.

  • start (int) – The starting line number. If negative, no line numbers are added. Defaults to -1.

  • window (Optional[Tuple[int, int]]) – A tuple specifying the (start, end) line numbers to include. If None, includes all lines. Defaults to None.

Returns:

The content with line numbers added.

Return type:

str

ahvn.utils.basic.str_utils.indent(s, tab=4, **kwargs)[source]

Indent a string by a specified number of spaces or a tab character.

Parameters:
  • s (str) – The string to indent.

  • tab (int or str, optional) – The number of spaces or a tab character to use for indentation. Defaults to 4 spaces.

  • **kwargs – Additional keyword arguments are ignored.

Returns:

The indented string.

Return type:

str

ahvn.utils.basic.str_utils.is_delimiter(char)[source]

Check if a character is a word boundary breaker.

Parameters:

char (str) – The character to check.

Returns:

True if the character is whitespace or punctuation, False otherwise.

Return type:

bool

ahvn.utils.basic.str_utils.normalize_text(text)[source]

Normalize text through tokenization, stop word removal, lemmatization, and lowercasing.

Parameters:

text (str) – The input text to normalize.

Returns:

The normalized text with tokens separated by spaces.

Return type:

str

ahvn.utils.basic.str_utils.generate_ngrams(tokens, n)[source]

Generate n-grams from a list of tokens.

Parameters:
  • tokens (list) – List of tokens to generate n-grams from.

  • n (int) – Maximum n-gram size.

Returns:

Set of n-grams with sizes from 1 to n.

Return type:

Set[str]

ahvn.utils.basic.str_utils.asymmetric_jaccard_score(query, doc, ngram=6)[source]

Calculate asymmetric Jaccard containment score between query and document.

Parameters:
  • query (str) – The query text.

  • doc (str) – The document text.

  • ngram (int, optional) – Maximum n-gram size. Defaults to 6.

Returns:

Containment score between 0.0 and 1.0.

Return type:

float

ahvn.utils.basic.str_utils.resolve_match_conflicts(results, conflict='overlap', query_length=0, inverse=False)[source]

Resolve overlapping matches in search results based on conflict strategy.

This utility function filters overlapping text spans when multiple entities match at the same or overlapping positions in a query string. It operates on search results that contain match position information.

Parameters:
  • results (list) – List of result dictionaries. Each dictionary must contain: - ‘id’: Entity identifier - ‘matches’: List of (start, end) tuples representing match positions in the query

  • conflict (str, optional) –

    Strategy for handling overlapping matches. Options: - “overlap”: Keep all matches including overlapping ones (no filtering) - “longest”: Keep only the longest match for any overlapping set - “longest_distinct”: Allow multiple entities to have overlapping matches

    as long as they are the longest matches

    Defaults to “overlap”.

  • query_length (int, optional) – Length of the query string. Required for “longest” and “longest_distinct” strategies when inverse=True. Defaults to 0.

  • inverse (bool, optional) – Whether the matches were computed on reversed strings. Affects the sorting and comparison logic. Defaults to False.

Returns:

Filtered list of result dictionaries with the same structure as input,

where each result’s ‘matches’ list has been filtered according to the conflict resolution strategy.

Return type:

list

Examples

>>> results = [
...     {'id': 1, 'matches': [(0, 5), (10, 15), (22, 27), (32, 37)]},
...     {'id': 2, 'matches': [(2, 8), (12, 18), (21, 27), (32, 38)]}
... ]
>>> resolve_match_conflicts(results, conflict="longest", query_length=40)
[{'id': 1, 'matches': [(0, 5), (10, 15)]}, {'id': 2, 'matches': [(21, 27), (32, 38)]}]