5 分钟快速开始

这里给出一个快速上手AgentHeaven的简单示例,用一个情感分析的应用场景,展示AgentHeaven是如何利用经验提升模型问答的准确率。

数据集准备

下载IMDb Movie Review Dataset https://ai.stanford.edu/~amaas/data/sentiment/

自行用脚本处理数据,处理后的数据格式如下,保存到dataset.json文件中

[
    {
        "text": "I had a heck of a good time viewing.....",
        "tag": "pos"
    },
    {
        "text": "I loved this film, at first....",
        "tag": "pos"
    }
]

LLM配置

参考 LLM配置

第一轮问答

下面的autotask方法调用大模型进行情感分析,llm_args参数指定LLM配置中的模型。

from ahvn.utils.exts.autotask import autotask
from ahvn.cache import JsonCache
analyzer_task = autotask(
    descriptions="Sentiment analysis. Rate the sentiment of the text from 1 to 10. Return an integer.",
    output_schema={"mode": "repr"},
    llm_args={"preset": "sys"},
)

接着声明了一个缓存,类型是JsonCache,这是AgentHeaven内置的一种缓存类型,缓存信息以Json的格式存储在本地。

cache = JsonCache("./json_cache")

@cache.memoize()是一个装饰器,用该装饰器修饰的方法,会将每次的函数调用信息缓存下来。

@cache.memoize()
def sentiment_analyzer(text):
    return analyzer_task(inputs={"text": text})

打开缓存的json文件可以看到,包括函数名、输入信息和输出信息都被缓存了。

{
    "func": "sentiment_analyzer",
    "inputs": {"text": "NOTE TO ALL DIRECTORS: Long is not necessarily..."},
    "output": 3,
    "metadata": {}
}

对于之前获取到的数据集dataset.json,调用autotask方法进行情感分析,评分1到5是消极情绪,6到10是积极情绪,完整代码如下:

cache = JsonCache("./json_cache")
@cache.memoize()
def sentiment_analyzer(text):
    return analyzer_task(inputs={"text": text})

def evaluate_accuracy(json_file_path): 
    with open(json_file_path, 'r', encoding='utf-8') as f:
        dataset = json.load(f)

    total_count = len(dataset)
    for i, item in enumerate(dataset, 1):
        text = item.get('text')
        true_tag = item.get('tag')
        # 获取评分
        prediction_raw = sentiment_analyzer(text)
        score = int(prediction_raw)

        is_correct = False
        
        if true_tag == 'neg' and 1 <= score <= 5:
            is_correct = True
        elif true_tag == 'pos' and 6 <= score <= 10:
            is_correct = True
        if is_correct:
            correct_count += 1

    accuracy = (correct_count / total_count) * 100
    
    print(f"总样本数: {total_count}")
    print(f"正确样本: {correct_count}")
    print(f"最终准确率: {accuracy:.2f}%")

if __name__ == '__main__':
    evaluate_accuracy('dataset.json')
>> 总样本数: 400
>> 正确样本: 355
>> 最终准确率: 88.75%

从缓存中添加经验

下面将之前缓存的函数调用信息作为经验,编码之后存入VectorKLStore。

初始化embedding模型,选用lancedb作为向量存储的数据库,VectorKLStore会在./data/lancedb目录下创建一个名为ukf_base的collection

embedder = LLM(preset="embedder")
vecstore = VectorKLStore(
    collection="ukf_base",
    provider="lancedb",
    uri="./data/lancedb",
    embedder=embedder
)

从缓存中取出输出评分1~2和9~10的函数调用信息(位于两端的评分相对更加精准)作为经验存入VectorKLStore。

data_to_upsert = []

for entry in cache:
    data_to_upsert.append(ExperienceUKFT.from_cache_entry(entry))

def filter_cache_entries(entries):
    filtered_result = []
    for entry in entries:
        raw_output = entry.output
        score = float(raw_output)

        match_low = (1 <= score <= 2)
        match_high = (9 <= score <= 10)

        if match_low or match_high:
            filtered_result.append(entry)

    return filtered_result

filter_list = filter_cache_entries(data_to_upsert)

if filter_list:
    vecstore.batch_upsert(filter_list)

加入经验后,第二轮问答

创建VectorKLEngine并选取语义相似度top3的函数调用信息作为经验加入到autotask中。

engine = VectorKLEngine(storage=vecstore, inplace=True)
analyzer_task_with_exp  = autotask(
    descriptions="Sentiment analysis. Rate the sentiment of the text from 1 to 10. Return an integer.",
    examples = engine,
    output_schema={"mode": "repr"},
    llm_args={"preset": "sys"},
    search_encoder = lambda instance: {"query": "input="+str(instance.inputs.get("text")), "topk": 3},
)

用同样的模型进行第二轮问答

def sentiment_analyzer_with_exp(text):
    return analyzer_task_with_exp(inputs={"text": text})

def evaluate_accuracy(json_file_path): 
    with open(json_file_path, 'r', encoding='utf-8') as f:
        dataset = json.load(f)

    total_count = len(dataset)
    correct_count = 0
    for i, item in enumerate(dataset, 1):
        text = item.get('text')
        true_tag = item.get('tag')
        # 获取评分
        prediction_raw = sentiment_analyzer_with_exp(text)
        score = int(prediction_raw)

        is_correct = False
        
        if true_tag == 'neg' and 1 <= score <= 5:
            is_correct = True
        elif true_tag == 'pos' and 6 <= score <= 10:
            is_correct = True
        # 统计
        if is_correct:
            correct_count += 1

    accuracy = (correct_count / total_count) * 100
    
    print(f"总样本数: {total_count}")
    print(f"正确样本: {correct_count}")
    print(f"最终准确率: {accuracy:.2f}%")

if __name__ == '__main__':
    evaluate_accuracy('dataset.json')

从结果可以看出,加入经验后,第二轮问答的精度相比第一轮有了提升。

>> 总样本数: 400
>> 正确样本: 361
>> 最终准确率: 90.25%

拓展阅读

提示: 有关快速入门和更多使用示例,请参见: