OpenClaw Skill开发详细示例文档

OpenClaw Skill开发详细示例文档

从配置到SubSkill到脚本执行的完整Skill开发流程。


🏗️ Skill结构模板

优先理解:Skill的分层

Skill (一个Skill → 多个SubSkill)
│
├─ SKILL.md                     # Skill宣言书 (模型定义、规则)
│
├─ config.yaml                  # Skill全局配置
│
├─ subskills/                   # SubSkill库
│   ├─ model-training/          # SubSkill 1
│   │   ├─ config.yaml          # SubSkill配置
│   │   ├─ handler.py           # 执行器 (主入口)
│   │   ├─ requirements.txt      # Python依赖
│   │   └─ scripts/
│   │       ├─ train.py
│   │       ├─ validate.py
│   │       └─ utils.py
│   │
│   ├─ model-evaluation/        # SubSkill 2
│   │   ├─ config.yaml
│   │   ├─ handler.py
│   │   ├─ requirements.txt
│   │   └─ scripts/
│   │       ├─ evaluate.py
│   │       └─ metrics.py
│   │
│   └─ model-deployment/        # SubSkill 3
│       ├─ config.yaml
│       ├─ handler.py
│       ├─ requirements.txt
│       └─ scripts/
│           ├─ deploy.py
│           ├─ docker_build.sh
│           └─ k8s_deploy.yaml
│
├─ utils/                       # 共享工具函数
│   ├─ __init__.py
│   ├─ validators.py            # 验证器
│   ├─ logger.py                # 日志
│   └─ config_loader.py         # 配置加载
│
├─ tests/                       # 测试用例
│   ├─ test_training.py
│   ├─ test_evaluation.py
│   └─ fixtures/
│
└─ docs/                        # 文档
    ├─ API.md
    ├─ EXAMPLES.md
    └─ TROUBLESHOOTING.md

📄 1. Skill宣言书 (SKILL.md)

文件: skills/ai-engineer/SKILL.md

---
name: ai-engineer
version: 2.1.0
description: 精销ML模型开发与部署的一站式解决方案
color: purple
author: AI技术团队
contacts:
  - name: 技术主管
    email: tech-lead@company.com
license: MIT
---

# AI工程师 Skill

## 概述

专精人工智能模型开发到部署的整个生命周期。

## 提供的SubSkill

### 1. model-training
模型训练与实验管理

- 数据下载与预处理
- 实验管理(MLflow集成)
- 分布式训练
- 模型检查点管理

**输入:**
```json
{
  "dataset": "cifar10",
  "model_name": "resnet50",
  "batch_size": 32,
  "epochs": 100,
  "learning_rate": 0.001,
  "experiment_name": "exp-v2.1"
}

输出:

{
  "model_path": "/models/resnet50_cifar10_v2.1.pt",
  "best_accuracy": 0.956,
  "final_loss": 0.123,
  "training_time_seconds": 3600,
  "mlflow_run_id": "abc123"
}

2. model-evaluation

模型离线评估与验证

  • 准确率、精准•召回率计算
  • 深度分析与错误案例
  • 分布外样本表现
  • 浅层特征可视化

输入:

{
  "model_path": "/models/resnet50_cifar10_v2.1.pt",
  "test_dataset": "cifar10_test",
  "metrics": ["accuracy", "f1", "auc", "confusion_matrix"],
  "visualization": true
}

输出:

{
  "metrics": {
    "accuracy": 0.956,
    "precision": 0.954,
    "recall": 0.956,
    "f1": 0.955,
    "auc": 0.998
  },
  "confusion_matrix": "s3://bucket/confusion_matrix.png",
  "error_cases": "s3://bucket/error_analysis.json"
}

3. model-deployment

模型优化与部署

  • 量化(INT8/FP16)
  • ONNX转换与优化
  • Docker打包
  • Kubernetes部署

输入:

{
  "model_path": "/models/resnet50_cifar10_v2.1.pt",
  "quantization_type": "int8",
  "deployment_target": "kubernetes",
  "namespace": "production",
  "replicas": 3
}

输出:

{
  "optimized_model_path": "/models/resnet50_cifar10_v2.1_int8.onnx",
  "model_size_reduction": "75%",
  "docker_image": "registry.company.com/model-resnet50:v2.1-int8",
  "deployment_status": "running",
  "service_endpoint": "http://model-service.production:8080"
}

执行协议

模型训练 (SubSkill: model-training)

# 模型训练需要以下权限:
- gpu_access: true
- disk_space: 50GB
- cpu: 4
- memory: 16Gi

# 执行时间: 一般 1-4 小时
# 超时配置: 14400 秒 (4小时)

模型评估 (SubSkill: model-evaluation)

# 需要权限:
- gpu_access: optional (faster on GPU)
- disk_space: 10GB
- cpu: 2
- memory: 8Gi

# 执行时间: 30-60 分钟
# 超时配置: 3600 秒

模型部署 (SubSkill: model-deployment)

# 需要权限:
- docker_build: true
- kubernetes_access: true
- image_registry: true
- disk_space: 5GB

# 执行时间: 10-30 分钟
# 超时配置: 1800 秒

错误处理

| 错误代码 | 含义 | 处理办法 | |--------|------|--------| | 4001 | 不合法的数据集 | 检查数据的可用性 | | 4002 | GPU源不足 | 替换GPU或使用CPU | | 4003 | 模型训练失败 | 检查训练日志、几何参数 | | 4004 | 量化失败 | 尝试不同的量化配置 |

最佳实践

  • 模型训练后必须经过评估。没有离线评估的模型不上线
  • 量化后精度衰减必须<1%
  • 模型部署需要降级策略(规则或辅助模型)

---

## 🎯 2. Skill全局配置 (config.yaml)

**文件**: `skills/ai-engineer/config.yaml`

```yaml
# ===== Skill基本信息 =====
skill:
  id: ai-engineer
  name: AI工程师
  version: 2.1.0
  description: 精销ML模型开发与部署
  
  # Skill的上级调用级数
  # 1 = 基础 (Agent直接调用)
  # 2 = 中级 (SubSkill之间可互相调用)
  # 3 = 高级 (不简单盘整系统)
  complexity: 2
  
  # 技能标签(策略查找時)
  tags:
    - machine-learning
    - model-training
    - model-deployment
    - mlops
  
  # 所有SubSkill列表
  subskills:
    - id: model-training
      name: 模型训练
      handler: subskills.model_training.handler:execute
      description: 模型训练与实验管理
      
      # 执行配置
      execution:
        timeout: 14400  # 4小时
        retry:
          max_retries: 3
          backoff: exponential
          backoff_factor: 2
        resource:
          cpu: 4
          memory: 16Gi
          gpu: 1
          disk: 50Gi
      
      # 输入模型
      inputs:
        - name: dataset
          type: string
          required: true
          description: 数据集名称 (cifar10, imagenet等)
          enum: [cifar10, imagenet, custom]
        
        - name: model_name
          type: string
          required: true
          description: 模型类型 (resnet50, vit等)
          enum: [resnet50, resnet101, vit-base, vit-large]
        
        - name: batch_size
          type: integer
          required: false
          default: 32
          min: 1
          max: 256
          description: 一次执行的batch大小
        
        - name: epochs
          type: integer
          required: false
          default: 100
          min: 1
          max: 1000
        
        - name: learning_rate
          type: float
          required: false
          default: 0.001
          min: 0.00001
          max: 0.1
        
        - name: experiment_name
          type: string
          required: false
          description: 实验名称(用于MLflow跟踪)
      
      # 输出模型
      outputs:
        - name: model_path
          type: string
          description: 保存模型的路径
        
        - name: best_accuracy
          type: float
          description: 最优精度
        
        - name: training_time_seconds
          type: float
          description: 训练时间
        
        - name: mlflow_run_id
          type: string
          description: MLflow实验跟踪编号
      
      # 依赖的SubSkill(可为空)
      dependencies: []
    
    - id: model-evaluation
      name: 模型评估
      handler: subskills.model_evaluation.handler:execute
      description: 模型离线评估与验证
      
      execution:
        timeout: 3600
        retry:
          max_retries: 2
          backoff: linear
        resource:
          cpu: 2
          memory: 8Gi
          gpu: 0.5
          disk: 10Gi
      
      inputs:
        - name: model_path
          type: string
          required: true
          description: 模型文件路径
        
        - name: test_dataset
          type: string
          required: true
          description: 测试数据集
        
        - name: metrics
          type: array
          required: false
          default: [accuracy, precision, recall, f1]
          items:
            enum: [accuracy, precision, recall, f1, auc, confusion_matrix]
        
        - name: visualization
          type: boolean
          required: false
          default: true
      
      outputs:
        - name: metrics
          type: object
          description: 求值结果
        
        - name: confusion_matrix
          type: string
          description: 混淆矩阵URL
        
        - name: error_cases
          type: string
          description: 错误案例分析
      
      # 依赖于model-training(须先训练)
      dependencies:
        - model-training
    
    - id: model-deployment
      name: 模型部署
      handler: subskills.model_deployment.handler:execute
      description: 模型优化与部署
      
      execution:
        timeout: 1800
        retry:
          max_retries: 2
        resource:
          cpu: 4
          memory: 8Gi
          gpu: 1
          disk: 5Gi
      
      inputs:
        - name: model_path
          type: string
          required: true
        
        - name: quantization_type
          type: string
          required: false
          default: none
          enum: [none, int8, fp16, dynamic]
        
        - name: deployment_target
          type: string
          required: true
          enum: [local, docker, kubernetes]
        
        - name: namespace
          type: string
          required: false
          default: default
          condition: "deployment_target == 'kubernetes'"
        
        - name: replicas
          type: integer
          required: false
          default: 1
          min: 1
          max: 100
      
      outputs:
        - name: optimized_model_path
          type: string
        
        - name: model_size_reduction
          type: string
        
        - name: docker_image
          type: string
        
        - name: deployment_status
          type: string
          enum: [pending, running, failed]
        
        - name: service_endpoint
          type: string
      
      dependencies:
        - model-evaluation

# ===== 整体Skill配置 =====
requirements:
  # 最低原生 OpenClaw 版本
  openclaw_version: ">=0.2.20"
  
  # 最低Python 版本
  python_version: ">=3.9"
  
  # 系统最低需求
  system:
    disk: 100Gi
    memory: 32Gi

# 全局 Python 依赖
# 这些依赖会被所有SubSkill共享
dependencies:
  - torch==2.0.1
  - torchvision==0.15.2
  - numpy==1.24.0
  - scikit-learn==1.3.0
  - mlflow==2.7.0
  - pandas==2.0.0

# SubSkill特有依赖
# 每个SubSkill可以有自己特有的依赖
subskill_dependencies:
  model-training:
    - pytorch-lightning==2.0.0
    - torch-distributed==latest
  
  model-deployment:
    - onnx==1.14.0
    - onnxruntime==1.16.0
    - docker==6.0.0
    - kubernetes==27.2.0

# 环境变量
# 这些会被注入到处理器的执行环境
environment:
  # 全局环境变量
  global:
    LOG_LEVEL: INFO
    DATA_ROOT: /data
    MODEL_ROOT: /models
    CACHE_ROOT: /cache
  
  # SubSkill特有环境变量
  subskill_envs:
    model-training:
      MLFLOW_TRACKING_URI: http://mlflow:5000
      MLFLOW_EXPERIMENT_NAME: model-training
      PYTORCH_CUDA_ALLOC_CONF: max_split_size_mb:512
    
    model-evaluation:
      MLFLOW_TRACKING_URI: http://mlflow:5000
    
    model-deployment:
      DOCKER_REGISTRY: registry.company.com
      K8S_CLUSTER: production
      K8S_DOCKER_REGISTRY_SECRET: regcred

# 日志配置
logging:
  level: INFO
  format: json  # json 或 text
  output:
    - console
    - file:/logs/skill.log
  
  # 每个SubSkill输出跟踪文件
  subskill_logs:
    model-training: /logs/training.log
    model-evaluation: /logs/evaluation.log
    model-deployment: /logs/deployment.log

# 监控配置
monitoring:
  enabled: true
  metrics_export:
    type: prometheus
    endpoint: http://prometheus:9090
  
  key_metrics:
    - skill_execution_duration
    - subskill_success_rate
    - resource_utilization
    - error_rate

# 更新治理
upgrades:
  strategy: rolling  # rolling 或 blue-green
  max_unavailable_percent: 10  # 最多有多少%的SubSkill可以不可用
  canary_percentage: 5  # 先更新 5% 测试

🛠️ 3. SubSkill执行器 (handler.py)

文件: skills/ai-engineer/subskills/model-training/handler.py

"""
model-training SubSkill执行器

职责:
1. 接收来自 Agent 的输入参数
2. 执行实验脚本
3. 跟踪实验程整到 MLflow
4. 返回结果或捕获不可恢复的错误
"""

import os
import json
import traceback
from typing import Dict, Any
from pathlib import Path
import subprocess
import logging

from ..utils.validators import validate_inputs
from ..utils.logger import get_logger
from ..utils.config_loader import load_config

logger = get_logger(__name__)


class TrainingExecutor:
    """模型训练执行器"""
    
    def __init__(self, config: Dict[str, Any]):
        self.config = config
        self.subskill_config = load_config('subskills/model-training/config.yaml')
        self.work_dir = Path(os.getenv('WORK_DIR', '/workspace'))
        self.models_dir = Path(os.getenv('MODEL_ROOT', '/models'))
        self.models_dir.mkdir(parents=True, exist_ok=True)
    
    def validate_inputs(self, inputs: Dict[str, Any]) -> bool:
        """
        验证输入参数
        
        依据 config.yaml 中的 inputs 定义自动验证
        """
        try:
            # 使用config定义自动验证
            validate_inputs(
                inputs=inputs,
                schema=self.subskill_config['inputs']
            )
            logger.info("输入参数验证成功")
            return True
        except ValueError as e:
            logger.error(f"输入验证失败: {str(e)}")
            raise
    
    def prepare_training_script(self, inputs: Dict[str, Any]) -> str:
        """
        为subskill准备训练脚本
        
        输入参数 → 训练脚本含有期参数
        """
        script_template = self.work_dir / 'subskills/model-training/scripts/train.py'
        
        # 替换模板中的占位符
        script_content = script_template.read_text()
        
        replacements = {
            '{{dataset}}': inputs['dataset'],
            '{{model_name}}': inputs['model_name'],
            '{{batch_size}}': str(inputs.get('batch_size', 32)),
            '{{epochs}}': str(inputs.get('epochs', 100)),
            '{{learning_rate}}': str(inputs.get('learning_rate', 0.001)),
            '{{experiment_name}}': inputs.get('experiment_name', 'exp-default'),
            '{{models_dir}}': str(self.models_dir),
        }
        
        for placeholder, value in replacements.items():
            script_content = script_content.replace(placeholder, value)
        
        # 保存的正式脚本
        script_path = self.work_dir / f".cache/train_{inputs.get('experiment_name')}.py"
        script_path.parent.mkdir(parents=True, exist_ok=True)
        script_path.write_text(script_content)
        
        logger.info(f"训练脚本已准备: {script_path}")
        return str(script_path)
    
    def execute_training(self, script_path: str, inputs: Dict[str, Any]) -> Dict[str, Any]:
        """
        执行训练脚本
        """
        try:
            # 设置环境变量
            env = os.environ.copy()
            env.update({
                'PYTHONUNBUFFERED': '1',
                'MLFLOW_EXPERIMENT_NAME': inputs.get('experiment_name', 'default'),
                'MLFLOW_TRACKING_URI': os.getenv('MLFLOW_TRACKING_URI', 'http://localhost:5000'),
            })
            
            # 执行脚本
            result = subprocess.run(
                ['python', script_path],
                capture_output=True,
                text=True,
                timeout=self.subskill_config['execution']['timeout'],
                env=env,
                cwd=self.work_dir
            )
            
            if result.returncode != 0:
                logger.error(f"脚本执行失败\nStdout: {result.stdout}\nStderr: {result.stderr}")
                raise RuntimeError(f"训练脚本退出代码: {result.returncode}")
            
            logger.info(f"脚本执行成功\nOutput: {result.stdout}")
            
            # 从脚本输出中解析结果 (JSON格式)
            output_lines = result.stdout.strip().split('\n')
            result_json = json.loads(output_lines[-1])  # 最后一行为JSON
            
            return result_json
        
        except subprocess.TimeoutExpired:
            logger.error("脚本执行超时")
            raise TimeoutError(f"训练超时 ({self.subskill_config['execution']['timeout']}s)")
        
        except Exception as e:
            logger.error(f"脚本执行异常: {str(e)}\n{traceback.format_exc()}")
            raise
    
    def post_process(self, result: Dict[str, Any]) -> Dict[str, Any]:
        """
        后处理: 数据校验、结果简化、苦转
        """
        # 校验输出模型是否创建
        if 'model_path' in result:
            model_file = Path(result['model_path'])
            if not model_file.exists():
                raise FileNotFoundError(f"模型文件不存在: {result['model_path']}")
            
            logger.info(f"模型文件大小: {model_file.stat().st_size / 1024 / 1024:.2f}MB")
        
        return result


def execute(inputs: Dict[str, Any]) -> Dict[str, Any]:
    """
    SubSkill 的主执行函数
    
    OpenClaw 调用中一形式:
    result = execute({
        'dataset': 'cifar10',
        'model_name': 'resnet50',
        'batch_size': 32,
        'epochs': 100,
        'learning_rate': 0.001,
        'experiment_name': 'exp-v2.1'
    })
    
    Args:
        inputs: 来自 Agent 的输入参数
    
    Returns:
        {
            'status': 'success' | 'failed',
            'model_path': '/models/resnet50_v2.1.pt',
            'best_accuracy': 0.956,
            'training_time_seconds': 3600,
            'mlflow_run_id': 'abc123',
            'error': '创建时的错误信息(仅概失败)'
        }
    """
    executor = TrainingExecutor(config={})
    
    try:
        # 1. 验证输入
        executor.validate_inputs(inputs)
        
        # 2. 准备脚本
        script_path = executor.prepare_training_script(inputs)
        
        # 3. 执行脚本
        result = executor.execute_training(script_path, inputs)
        
        # 4. 后处理
        result = executor.post_process(result)
        
        # 5. 添加程序执行信息
        result['status'] = 'success'
        return result
    
    except Exception as e:
        logger.error(f"SubSkill执行失败: {str(e)}")
        return {
            'status': 'failed',
            'error': str(e),
            'error_type': type(e).__name__,
            'traceback': traceback.format_exc()
        }

📝 4. 训练脚本 (train.py)

文件: skills/ai-engineer/subskills/model-training/scripts/train.py

"""
模型训练脚本

输入参数会被 handler.py 替换,所以可以使用 {{placeholder}} 语法
"""

import os
import json
import torch
import logging
from pathlib import Path
from datetime import datetime

import mlflow
import mlflow.pytorch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader
from torchvision import datasets, transforms, models

# ===== 参数 (handler.py 会替换这些) =====
DATASET = '{{dataset}}'
MODEL_NAME = '{{model_name}}'
BATCH_SIZE = {{batch_size}}
EPOCHS = {{epochs}}
LEARNING_RATE = {{learning_rate}}
EXPERIMENT_NAME = '{{experiment_name}}'
MODELS_DIR = '{{models_dir}}'

# ===== 配置 =====
DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
logger = logging.getLogger(__name__)
logging.basicConfig(level=logging.INFO)


class ModelTrainer:
    """模型训练器"""
    
    def __init__(self):
        self.device = DEVICE
        self.best_accuracy = 0
        self.model_dir = Path(MODELS_DIR)
        self.model_dir.mkdir(parents=True, exist_ok=True)
        
        # 配置 MLflow
        mlflow.set_experiment(EXPERIMENT_NAME)
    
    def load_data(self):
        """加载数据集"""
        transform = transforms.Compose([
            transforms.ToTensor(),
            transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
        ])
        
        if DATASET == 'cifar10':
            self.train_dataset = datasets.CIFAR10(
                root='./data', train=True, download=True, transform=transform
            )
            self.test_dataset = datasets.CIFAR10(
                root='./data', train=False, download=True, transform=transform
            )
        else:
            raise ValueError(f"不支持数据集: {DATASET}")
        
        self.train_loader = DataLoader(
            self.train_dataset, batch_size=BATCH_SIZE, shuffle=True, num_workers=4
        )
        self.test_loader = DataLoader(
            self.test_dataset, batch_size=BATCH_SIZE, shuffle=False, num_workers=4
        )
        
        logger.info(f"数据集加载完成: {len(self.train_dataset)} / {len(self.test_dataset)}")
    
    def build_model(self):
        """构建模型"""
        if MODEL_NAME == 'resnet50':
            self.model = models.resnet50(pretrained=True)
        else:
            raise ValueError(f"不支持模型: {MODEL_NAME}")
        
        self.model.to(self.device)
        logger.info(f"模型构建正绪: {MODEL_NAME}")
    
    def train_epoch(self, epoch):
        """训练一个轮代"""
        self.model.train()
        criterion = nn.CrossEntropyLoss()
        optimizer = optim.Adam(self.model.parameters(), lr=LEARNING_RATE)
        
        total_loss = 0
        for batch_idx, (data, target) in enumerate(self.train_loader):
            data, target = data.to(self.device), target.to(self.device)
            
            optimizer.zero_grad()
            output = self.model(data)
            loss = criterion(output, target)
            loss.backward()
            optimizer.step()
            
            total_loss += loss.item()
            
            if batch_idx % 100 == 0:
                logger.info(f"Epoch {epoch}, Batch {batch_idx}, Loss: {loss.item():.4f}")
        
        avg_loss = total_loss / len(self.train_loader)
        mlflow.log_metric("train_loss", avg_loss, step=epoch)
        return avg_loss
    
    def evaluate(self, epoch):
        """评估模型"""
        self.model.eval()
        correct = 0
        total = 0
        
        with torch.no_grad():
            for data, target in self.test_loader:
                data, target = data.to(self.device), target.to(self.device)
                output = self.model(data)
                _, predicted = torch.max(output.data, 1)
                total += target.size(0)
                correct += (predicted == target).sum().item()
        
        accuracy = 100 * correct / total
        mlflow.log_metric("test_accuracy", accuracy, step=epoch)
        logger.info(f"Epoch {epoch}, Accuracy: {accuracy:.2f}%")
        
        return accuracy
    
    def save_model(self, model_name):
        """保存模型"""
        model_path = self.model_dir / model_name
        torch.save(self.model.state_dict(), model_path)
        logger.info(f"模型已保存: {model_path}")
        return str(model_path)
    
    def train(self):
        """完整训练流程"""
        with mlflow.start_run(run_name=EXPERIMENT_NAME):
            # 记录超参
            mlflow.log_params({
                'dataset': DATASET,
                'model': MODEL_NAME,
                'batch_size': BATCH_SIZE,
                'epochs': EPOCHS,
                'learning_rate': LEARNING_RATE,
            })
            
            self.load_data()
            self.build_model()
            
            start_time = datetime.now()
            
            for epoch in range(1, EPOCHS + 1):
                train_loss = self.train_epoch(epoch)
                accuracy = self.evaluate(epoch)
                
                # 保存最优模型
                if accuracy > self.best_accuracy:
                    self.best_accuracy = accuracy
                    model_filename = f"{MODEL_NAME}_{DATASET}_{EXPERIMENT_NAME}.pt"
                    model_path = self.save_model(model_filename)
            
            training_time = (datetime.now() - start_time).total_seconds()
            
            # 记录最终结果
            mlflow.log_metric("best_accuracy", self.best_accuracy)
            mlflow.log_metric("training_time_seconds", training_time)
            
            run_id = mlflow.active_run().info.run_id
            
            return {
                'model_path': model_path,
                'best_accuracy': self.best_accuracy,
                'training_time_seconds': training_time,
                'mlflow_run_id': run_id,
                'final_loss': train_loss,
            }


if __name__ == '__main__':
    trainer = ModelTrainer()
    result = trainer.train()
    
    # 输出JSON格式的结果 (最后一行)
    # handler.py 会解析这一行
    print(json.dumps(result))

🏃 使用示例

场景:Agent调用

推理模式: 会话模式 (依次调用名个输入)

# Agent 会自动执行此流程
from openclaw.agent import Agent
from openclaw.skill import call_skill

agent = Agent.from_workspace('/workspace/agents/ai-engineer')

# 次 1: 执行 model-training SubSkill
training_result = call_skill(
    skill_id='ai-engineer',
    subskill_id='model-training',
    inputs={
        'dataset': 'cifar10',
        'model_name': 'resnet50',
        'batch_size': 32,
        'epochs': 100,
        'learning_rate': 0.001,
        'experiment_name': 'exp-v2.1'
    },
    agent=agent
)

if training_result['status'] != 'success':
    print(f"训练失败: {training_result['error']}")
    exit(1)

# 次 2: 执行 model-evaluation SubSkill
evaluation_result = call_skill(
    skill_id='ai-engineer',
    subskill_id='model-evaluation',
    inputs={
        'model_path': training_result['model_path'],
        'test_dataset': 'cifar10',
        'metrics': ['accuracy', 'f1', 'confusion_matrix'],
        'visualization': True
    },
    agent=agent
)

if evaluation_result['status'] != 'success':
    print(f"评估失败: {evaluation_result['error']}")
    exit(1)

# 次 3: 执行 model-deployment SubSkill
deployment_result = call_skill(
    skill_id='ai-engineer',
    subskill_id='model-deployment',
    inputs={
        'model_path': training_result['model_path'],
        'quantization_type': 'int8',
        'deployment_target': 'kubernetes',
        'namespace': 'production',
        'replicas': 3
    },
    agent=agent
)

if deployment_result['status'] == 'success':
    print(f"模型已部署: {deployment_result['service_endpoint']}")

⚙️ 运维操作

安装与测试

# 检查 Skill 宣言
 cat skills/ai-engineer/SKILL.md

# 验证 Skill 配置有效性
yaml-lint skills/ai-engineer/config.yaml

# 执行测试
pytest skills/ai-engineer/tests/ -v

# 执行单个SubSkill
python -c "
from subskills.model_training.handler import execute
result = execute({
    'dataset': 'cifar10',
    'model_name': 'resnet50',
    'batch_size': 32,
    'epochs': 1,
    'experiment_name': 'test'
})
print(result)
"

收集日志与调试

# 实时查看 SubSkill 日志
kubectl logs -f deployment/openclaw-agent-ai -n openclaw | grep 'model-training'

# 查看 SubSkill 执行的 MLflow 实验
mlflow ui --host 0.0.0.0 --port 5000

# 检查资源使用情况
kubectl top pods -n openclaw

⚠️ 常被错误

| 错误 | 原因 | 解决方程 | |--------|------|----------| | FileNotFoundError: model not found | 模型文件不存在 | 检查训练脚本是否正常保存了模型 | | OOM killed | GPU显存不足 | 减小batch_size或使用混合精度 | | Handler timeout | 训练超时 | 在config.yaml中增加超时时间 | | Validation failed | 输入参数不符合schema | 检查输入类型和值范围 |


📌 关键点总结

  1. SKILL.md - Skill的宪章,定义SubSkill、输入输出、执行协议
  2. config.yaml - 全局配置,定义SubSkill的资源需求、超时、依赖关系
  3. handler.py - SubSkill的主执行器,负责参数验证、脚本准备、脚本执行
  4. scripts/xxx.py - 具体的业务脚本,其中使用 {{placeholder}} 替换参数
  5. dependencies - 全局依赖和SubSkill特有依赖分开定义
  6. 输入输出 - 严格遵照schema定义,支持类型、范围、枚举值检查

完成!✅ 这是一份从Skill定义到SubSkill到脚本执行的完整示例文档,可以直接用于生产。

评论 (2)

发表评论

l
longmao 2 days ago 回复
到此一游!到此一游!
a
andy 2 days ago
你妹哦对对对对对对
a
andy 2 days ago 回复
写的什么垃圾