ahvn.utils.basic.str_utils module¶

String manipulation and text processing utilities for AgentHeaven.

ahvn.utils.basic.str_utils.truncate(s, cutoff=-1)[源代码]¶

Truncate a string if it exceeds the specified cutoff length.

参数:

s (str) -- The string to truncate.
cutoff (int) -- Maximum length before truncation. Defaults to -1, meaning no cutoff.

返回:

Truncated string if it exceeds cutoff, otherwise the original string.

返回类型:

str

ahvn.utils.basic.str_utils.value_repr(value, cutoff=-1, round_digits=6)[源代码]¶

Format a value representation for display, truncating if too long.

参数:

value (Any) -- The value to represent.
cutoff (int) -- Maximum length before truncation. Defaults to -1, meaning no cutoff.
round_digits (int) -- Number of decimal places to round floats to. Only applied if the value is a float. Default is 6.

返回:

Formatted value representation.

返回类型:

str

ahvn.utils.basic.str_utils.omission_list(items, top=-1, bottom=1)[源代码]¶

Cuts down a list by omitting middle items if it exceeds the specified limit.

参数:

items (List) -- The list of items.
top (int) -- Number of items to keep from the start. Defaults to -1 (keep all).
bottom (int) -- Number of items to keep from the end. Defaults to 1. Bottom is ignored if top is negative. Otherwise, total kept items = top + bottom + 1.

返回:

The truncated list with middle items omitted if necessary.

返回类型:

List

ahvn.utils.basic.str_utils.markdown_symbol(content)[源代码]¶

Generate a markdown code block symbol that does not conflict with the content.

参数:: content (str) -- The content to check for conflicts.
返回:: A markdown code block symbol (e.g., "`", "``", etc.) that does not appear in the content.
返回类型:: str

ahvn.utils.basic.str_utils.line_numbered(content, start=-1, window=None)[源代码]¶

Adds line numbers to the given content starting from the specified number.

参数:

content (str) -- The content to be numbered.
start (int) -- The starting line number. If negative, no line numbers are added. Defaults to -1.
window (Optional[Tuple[int, int]]) -- A tuple specifying the (start, end) line numbers to include. If None, includes all lines. Defaults to None.

返回:

The content with line numbers added.

返回类型:

str

ahvn.utils.basic.str_utils.indent(s, tab=4, **kwargs)[源代码]¶

Indent a string by a specified number of spaces or a tab character.

参数:

s (str) -- The string to indent.
tab (int or str, optional) -- The number of spaces or a tab character to use for indentation. Defaults to 4 spaces.
**kwargs -- Additional keyword arguments are ignored.

返回:

The indented string.

返回类型:

str

ahvn.utils.basic.str_utils.is_delimiter(char)[源代码]¶

Check if a character is a word boundary breaker.

参数:: char (str) -- The character to check.
返回:: True if the character is whitespace or punctuation, False otherwise.
返回类型:: bool

ahvn.utils.basic.str_utils.normalize_text(text)[源代码]¶

Normalize text through tokenization, stop word removal, lemmatization, and lowercasing.

参数:: text (str) -- The input text to normalize.
返回:: The normalized text with tokens separated by spaces.
返回类型:: str

ahvn.utils.basic.str_utils.generate_ngrams(tokens, n)[源代码]¶

Generate n-grams from a list of tokens.

参数:

tokens (list) -- List of tokens to generate n-grams from.
n (int) -- Maximum n-gram size.

返回:

Set of n-grams with sizes from 1 to n.

返回类型:

Set[str]

ahvn.utils.basic.str_utils.asymmetric_jaccard_score(query, doc, ngram=6)[源代码]¶

Calculate asymmetric Jaccard containment score between query and document.

参数:

query (str) -- The query text.
doc (str) -- The document text.
ngram (int, optional) -- Maximum n-gram size. Defaults to 6.

返回:

Containment score between 0.0 and 1.0.

返回类型:

float

ahvn.utils.basic.str_utils.resolve_match_conflicts(results, conflict='overlap', query_length=0, inverse=False)[源代码]¶

Resolve overlapping matches in search results based on conflict strategy.

This utility function filters overlapping text spans when multiple entities match at the same or overlapping positions in a query string. It operates on search results that contain match position information.

参数:

results (list) -- List of result dictionaries. Each dictionary must contain: - 'id': Entity identifier - 'matches': List of (start, end) tuples representing match positions in the query
conflict (str, optional) --
Strategy for handling overlapping matches. Options: - "overlap": Keep all matches including overlapping ones (no filtering) - "longest": Keep only the longest match for any overlapping set - "longest_distinct": Allow multiple entities to have overlapping matches

as long as they are the longest matches

Defaults to "overlap".
query_length (int, optional) -- Length of the query string. Required for "longest" and "longest_distinct" strategies when inverse=True. Defaults to 0.
inverse (bool, optional) -- Whether the matches were computed on reversed strings. Affects the sorting and comparison logic. Defaults to False.

返回:

Filtered list of result dictionaries with the same structure as input,: where each result's 'matches' list has been filtered according to the conflict resolution strategy.

返回类型:

list

示例

>>> results = [
...     {'id': 1, 'matches': [(0, 5), (10, 15), (22, 27), (32, 37)]},
...     {'id': 2, 'matches': [(2, 8), (12, 18), (21, 27), (32, 38)]}
... ]
>>> resolve_match_conflicts(results, conflict="longest", query_length=40)
[{'id': 1, 'matches': [(0, 5), (10, 15)]}, {'id': 2, 'matches': [(21, 27), (32, 38)]}]