String Utilities¶
The str_utils.py module provides a comprehensive set of string manipulation and text processing utilities for AgentHeaven. These utilities include text normalization, n-gram generation, string similarity calculations, and formatting functions.
1. String Formatting¶
1.1. indent() - String Indentation¶
The indent() function adds indentation to multi-line strings:
from ahvn.utils.basic.str_utils import indent
# Using spaces
text = "Line 1\nLine 2\nLine 3"
indented = indent(text, 4)
print(indented)
# Line 1
# Line 2
# Line 3
# Using custom tab character
indented = indent(text, " ")
print(indented)
# Line 1
# Line 2
# Line 3
2. Text Processing¶
2.1. is_delimiter() - Word Boundary Detection¶
The is_delimiter() function checks if a character serves as a word boundary:
from ahvn.utils.basic.str_utils import is_delimiter
# Check various characters
print(is_delimiter(' ')) # True (whitespace)
print(is_delimiter('.')) # True (punctuation)
print(is_delimiter('a')) # False (alphanumeric)
print(is_delimiter('\n')) # True (whitespace)
2.2. normalize_text() - Text Normalization¶
The normalize_text() function performs comprehensive text normalization including tokenization, stop word removal, lemmatization, and lowercasing:
from ahvn.utils.basic.str_utils import normalize_text
# Normalize text
text = "The quick brown foxes are running!"
normalized = normalize_text(text)
print(normalized)
# "quick brown fox run"
# Complex text with punctuation
text = "Hello, world! How are you doing today?"
normalized = normalize_text(text)
print(normalized)
# "hello world today"
Note: This function requires spaCy and the
en_core_web_smmodel to be installed. The model is loaded lazily on first use.pip install spacy python -m spacy download en_core_web_sm
3. N-gram Analysis¶
3.1. generate_ngrams() - N-gram Generation¶
The generate_ngrams() function creates n-grams from a list of tokens:
from ahvn.utils.basic.str_utils import generate_ngrams
# Generate n-grams from tokens
tokens = ["the", "quick", "brown", "fox"]
ngrams = generate_ngrams(tokens, n=3)
print(sorted(ngrams))
# ['brown', 'brown fox', 'fox', 'quick', 'quick brown', 'quick brown fox', 'the', 'the quick', 'the quick brown']
# Single token
tokens = ["hello"]
ngrams = generate_ngrams(tokens, n=2)
print(ngrams)
# {'hello'}
3.2. asymmetric_jaccard_score() - Text Similarity¶
The asymmetric_jaccard_score() function calculates how much of a query is contained within a document using n-gram analysis. The ngram parameter defaults to 6.
from ahvn.utils.basic.str_utils import asymmetric_jaccard_score
# Calculate containment score
query = "quick brown fox"
doc = "The quick brown fox jumps over the lazy dog"
score = asymmetric_jaccard_score(query, doc)
print(f"Containment score: {score:.3f}")
# Containment score: 0.875
# Query not well contained in document
query = "machine learning algorithms"
doc = "The quick brown fox jumps"
score = asymmetric_jaccard_score(query, doc, ngram=2)
print(f"Containment score: {score:.3f}")
# Containment score: 0.000
The function works by:
Normalizing both query and document text
Generating n-grams from the normalized tokens
Calculating what fraction of query n-grams appear in the document
4. Conflict Resolution¶
4.1. resolve_match_conflicts() - Resolve Overlapping Matches¶
The resolve_match_conflicts() function filters overlapping text spans from search results based on a specified strategy. This is useful when multiple entities match at the same or overlapping positions in a query string.
from ahvn.utils.basic.str_utils import resolve_match_conflicts
results = [
{'id': 1, 'matches': [(0, 5), (10, 15), (22, 27), (32, 37)]},
{'id': 2, 'matches': [(2, 8), (12, 18), (21, 27), (32, 38)]}
]
# Using the "longest" strategy
filtered_results = resolve_match_conflicts(results, conflict="longest", query_length=40)
print(filtered_results)
# [{'id': 1, 'matches': [(0, 5), (10, 15)]}, {'id': 2, 'matches': [(21, 27), (32, 38)]}]
The function supports the following conflict resolution strategies:
overlap: (Default) Keeps all matches, including overlapping ones.longest: For any set of overlapping matches, it keeps only the longest one.longest_distinct: Allows multiple entities to have overlapping matches, but only if they are the longest matches for their respective entities.
5. Usage Examples¶
5.1. Basic String Processing Pipeline¶
from ahvn.utils.basic.str_utils import (
normalize_text,
generate_ngrams,
asymmetric_jaccard_score
)
# Process and analyze text
def analyze_text_similarity(query, documents):
# Normalize query once
normalized_query = normalize_text(query)
query_tokens = normalized_query.split()
results = []
for doc in documents:
# Calculate similarity score
score = asymmetric_jaccard_score(query, doc)
results.append((doc, score))
# Sort by similarity
return sorted(results, key=lambda x: x[1], reverse=True)
# Example usage
query = "artificial intelligence"
docs = [
"Machine learning and artificial intelligence research",
"Natural language processing techniques",
"AI and deep learning applications"
]
ranked_docs = analyze_text_similarity(query, docs)
for doc, score in ranked_docs:
print(f"Score: {score:.3f} - {doc}")
5.2. Text Formatting Workflow¶
from ahvn.utils.basic.str_utils import indent, is_delimiter
def format_code_block(code, indent_size=4):
# Add indentation to code
indented_code = indent(code, indent_size)
# Wrap in markdown code block
return f"```\n{indented_code}\n```"
# Example
python_code = "def hello():\n print('Hello, World!')"
formatted = format_code_block(python_code)
print(formatted)
Further Exploration¶
Tip: For more information about utilities in AgentHeaven, see:
Utilities - All Python utilities for convenience