ahvn.utils.basic.str_utils module

String manipulation and text processing utilities for AgentHeaven.

ahvn.utils.basic.str_utils.truncate(s, cutoff=-1)[源代码]

Truncate a string if it exceeds the specified cutoff length.

参数:
  • s (str) -- The string to truncate.

  • cutoff (int) -- Maximum length before truncation. Defaults to -1, meaning no cutoff.

返回:

Truncated string if it exceeds cutoff, otherwise the original string.

返回类型:

str

ahvn.utils.basic.str_utils.value_repr(value, cutoff=-1, round_digits=6)[源代码]

Format a value representation for display, truncating if too long.

参数:
  • value (Any) -- The value to represent.

  • cutoff (int) -- Maximum length before truncation. Defaults to -1, meaning no cutoff.

  • round_digits (int) -- Number of decimal places to round floats to. Only applied if the value is a float. Default is 6.

返回:

Formatted value representation.

返回类型:

str

ahvn.utils.basic.str_utils.omission_list(items, top=-1, bottom=1)[源代码]

Cuts down a list by omitting middle items if it exceeds the specified limit.

参数:
  • items (List) -- The list of items.

  • top (int) -- Number of items to keep from the start. Defaults to -1 (keep all).

  • bottom (int) -- Number of items to keep from the end. Defaults to 1. Bottom is ignored if top is negative. Otherwise, total kept items = top + bottom + 1.

返回:

The truncated list with middle items omitted if necessary.

返回类型:

List

ahvn.utils.basic.str_utils.markdown_symbol(content)[源代码]

Generate a markdown code block symbol that does not conflict with the content.

参数:

content (str) -- The content to check for conflicts.

返回:

A markdown code block symbol (e.g., "`", "``", etc.) that does not appear in the content.

返回类型:

str

ahvn.utils.basic.str_utils.line_numbered(content, start=-1, window=None)[源代码]

Adds line numbers to the given content starting from the specified number.

参数:
  • content (str) -- The content to be numbered.

  • start (int) -- The starting line number. If negative, no line numbers are added. Defaults to -1.

  • window (Optional[Tuple[int, int]]) -- A tuple specifying the (start, end) line numbers to include. If None, includes all lines. Defaults to None.

返回:

The content with line numbers added.

返回类型:

str

ahvn.utils.basic.str_utils.indent(s, tab=4, **kwargs)[源代码]

Indent a string by a specified number of spaces or a tab character.

参数:
  • s (str) -- The string to indent.

  • tab (int or str, optional) -- The number of spaces or a tab character to use for indentation. Defaults to 4 spaces.

  • **kwargs -- Additional keyword arguments are ignored.

返回:

The indented string.

返回类型:

str

ahvn.utils.basic.str_utils.is_delimiter(char)[源代码]

Check if a character is a word boundary breaker.

参数:

char (str) -- The character to check.

返回:

True if the character is whitespace or punctuation, False otherwise.

返回类型:

bool

ahvn.utils.basic.str_utils.normalize_text(text)[源代码]

Normalize text through tokenization, stop word removal, lemmatization, and lowercasing.

参数:

text (str) -- The input text to normalize.

返回:

The normalized text with tokens separated by spaces.

返回类型:

str

ahvn.utils.basic.str_utils.generate_ngrams(tokens, n)[源代码]

Generate n-grams from a list of tokens.

参数:
  • tokens (list) -- List of tokens to generate n-grams from.

  • n (int) -- Maximum n-gram size.

返回:

Set of n-grams with sizes from 1 to n.

返回类型:

Set[str]

ahvn.utils.basic.str_utils.asymmetric_jaccard_score(query, doc, ngram=6)[源代码]

Calculate asymmetric Jaccard containment score between query and document.

参数:
  • query (str) -- The query text.

  • doc (str) -- The document text.

  • ngram (int, optional) -- Maximum n-gram size. Defaults to 6.

返回:

Containment score between 0.0 and 1.0.

返回类型:

float

ahvn.utils.basic.str_utils.resolve_match_conflicts(results, conflict='overlap', query_length=0, inverse=False)[源代码]

Resolve overlapping matches in search results based on conflict strategy.

This utility function filters overlapping text spans when multiple entities match at the same or overlapping positions in a query string. It operates on search results that contain match position information.

参数:
  • results (list) -- List of result dictionaries. Each dictionary must contain: - 'id': Entity identifier - 'matches': List of (start, end) tuples representing match positions in the query

  • conflict (str, optional) --

    Strategy for handling overlapping matches. Options: - "overlap": Keep all matches including overlapping ones (no filtering) - "longest": Keep only the longest match for any overlapping set - "longest_distinct": Allow multiple entities to have overlapping matches

    as long as they are the longest matches

    Defaults to "overlap".

  • query_length (int, optional) -- Length of the query string. Required for "longest" and "longest_distinct" strategies when inverse=True. Defaults to 0.

  • inverse (bool, optional) -- Whether the matches were computed on reversed strings. Affects the sorting and comparison logic. Defaults to False.

返回:

Filtered list of result dictionaries with the same structure as input,

where each result's 'matches' list has been filtered according to the conflict resolution strategy.

返回类型:

list

示例

>>> results = [
...     {'id': 1, 'matches': [(0, 5), (10, 15), (22, 27), (32, 37)]},
...     {'id': 2, 'matches': [(2, 8), (12, 18), (21, 27), (32, 38)]}
... ]
>>> resolve_match_conflicts(results, conflict="longest", query_length=40)
[{'id': 1, 'matches': [(0, 5), (10, 15)]}, {'id': 2, 'matches': [(21, 27), (32, 38)]}]