OpenClaw Skill开发详细示例文档
从配置到SubSkill到脚本执行的完整Skill开发流程。
🏗️ Skill结构模板
优先理解:Skill的分层
Skill (一个Skill → 多个SubSkill)
│
├─ SKILL.md # Skill宣言书 (模型定义、规则)
│
├─ config.yaml # Skill全局配置
│
├─ subskills/ # SubSkill库
│ ├─ model-training/ # SubSkill 1
│ │ ├─ config.yaml # SubSkill配置
│ │ ├─ handler.py # 执行器 (主入口)
│ │ ├─ requirements.txt # Python依赖
│ │ └─ scripts/
│ │ ├─ train.py
│ │ ├─ validate.py
│ │ └─ utils.py
│ │
│ ├─ model-evaluation/ # SubSkill 2
│ │ ├─ config.yaml
│ │ ├─ handler.py
│ │ ├─ requirements.txt
│ │ └─ scripts/
│ │ ├─ evaluate.py
│ │ └─ metrics.py
│ │
│ └─ model-deployment/ # SubSkill 3
│ ├─ config.yaml
│ ├─ handler.py
│ ├─ requirements.txt
│ └─ scripts/
│ ├─ deploy.py
│ ├─ docker_build.sh
│ └─ k8s_deploy.yaml
│
├─ utils/ # 共享工具函数
│ ├─ __init__.py
│ ├─ validators.py # 验证器
│ ├─ logger.py # 日志
│ └─ config_loader.py # 配置加载
│
├─ tests/ # 测试用例
│ ├─ test_training.py
│ ├─ test_evaluation.py
│ └─ fixtures/
│
└─ docs/ # 文档
├─ API.md
├─ EXAMPLES.md
└─ TROUBLESHOOTING.md
📄 1. Skill宣言书 (SKILL.md)
文件: skills/ai-engineer/SKILL.md
---
name: ai-engineer
version: 2.1.0
description: 精销ML模型开发与部署的一站式解决方案
color: purple
author: AI技术团队
contacts:
- name: 技术主管
email: tech-lead@company.com
license: MIT
---
# AI工程师 Skill
## 概述
专精人工智能模型开发到部署的整个生命周期。
## 提供的SubSkill
### 1. model-training
模型训练与实验管理
- 数据下载与预处理
- 实验管理(MLflow集成)
- 分布式训练
- 模型检查点管理
**输入:**
```json
{
"dataset": "cifar10",
"model_name": "resnet50",
"batch_size": 32,
"epochs": 100,
"learning_rate": 0.001,
"experiment_name": "exp-v2.1"
}
输出:
{
"model_path": "/models/resnet50_cifar10_v2.1.pt",
"best_accuracy": 0.956,
"final_loss": 0.123,
"training_time_seconds": 3600,
"mlflow_run_id": "abc123"
}
2. model-evaluation
模型离线评估与验证
- 准确率、精准•召回率计算
- 深度分析与错误案例
- 分布外样本表现
- 浅层特征可视化
输入:
{
"model_path": "/models/resnet50_cifar10_v2.1.pt",
"test_dataset": "cifar10_test",
"metrics": ["accuracy", "f1", "auc", "confusion_matrix"],
"visualization": true
}
输出:
{
"metrics": {
"accuracy": 0.956,
"precision": 0.954,
"recall": 0.956,
"f1": 0.955,
"auc": 0.998
},
"confusion_matrix": "s3://bucket/confusion_matrix.png",
"error_cases": "s3://bucket/error_analysis.json"
}
3. model-deployment
模型优化与部署
- 量化(INT8/FP16)
- ONNX转换与优化
- Docker打包
- Kubernetes部署
输入:
{
"model_path": "/models/resnet50_cifar10_v2.1.pt",
"quantization_type": "int8",
"deployment_target": "kubernetes",
"namespace": "production",
"replicas": 3
}
输出:
{
"optimized_model_path": "/models/resnet50_cifar10_v2.1_int8.onnx",
"model_size_reduction": "75%",
"docker_image": "registry.company.com/model-resnet50:v2.1-int8",
"deployment_status": "running",
"service_endpoint": "http://model-service.production:8080"
}
执行协议
模型训练 (SubSkill: model-training)
# 模型训练需要以下权限:
- gpu_access: true
- disk_space: 50GB
- cpu: 4
- memory: 16Gi
# 执行时间: 一般 1-4 小时
# 超时配置: 14400 秒 (4小时)
模型评估 (SubSkill: model-evaluation)
# 需要权限:
- gpu_access: optional (faster on GPU)
- disk_space: 10GB
- cpu: 2
- memory: 8Gi
# 执行时间: 30-60 分钟
# 超时配置: 3600 秒
模型部署 (SubSkill: model-deployment)
# 需要权限:
- docker_build: true
- kubernetes_access: true
- image_registry: true
- disk_space: 5GB
# 执行时间: 10-30 分钟
# 超时配置: 1800 秒
错误处理
| 错误代码 | 含义 | 处理办法 | |--------|------|--------| | 4001 | 不合法的数据集 | 检查数据的可用性 | | 4002 | GPU源不足 | 替换GPU或使用CPU | | 4003 | 模型训练失败 | 检查训练日志、几何参数 | | 4004 | 量化失败 | 尝试不同的量化配置 |
最佳实践
- 模型训练后必须经过评估。没有离线评估的模型不上线
- 量化后精度衰减必须<1%
- 模型部署需要降级策略(规则或辅助模型)
---
## 🎯 2. Skill全局配置 (config.yaml)
**文件**: `skills/ai-engineer/config.yaml`
```yaml
# ===== Skill基本信息 =====
skill:
id: ai-engineer
name: AI工程师
version: 2.1.0
description: 精销ML模型开发与部署
# Skill的上级调用级数
# 1 = 基础 (Agent直接调用)
# 2 = 中级 (SubSkill之间可互相调用)
# 3 = 高级 (不简单盘整系统)
complexity: 2
# 技能标签(策略查找時)
tags:
- machine-learning
- model-training
- model-deployment
- mlops
# 所有SubSkill列表
subskills:
- id: model-training
name: 模型训练
handler: subskills.model_training.handler:execute
description: 模型训练与实验管理
# 执行配置
execution:
timeout: 14400 # 4小时
retry:
max_retries: 3
backoff: exponential
backoff_factor: 2
resource:
cpu: 4
memory: 16Gi
gpu: 1
disk: 50Gi
# 输入模型
inputs:
- name: dataset
type: string
required: true
description: 数据集名称 (cifar10, imagenet等)
enum: [cifar10, imagenet, custom]
- name: model_name
type: string
required: true
description: 模型类型 (resnet50, vit等)
enum: [resnet50, resnet101, vit-base, vit-large]
- name: batch_size
type: integer
required: false
default: 32
min: 1
max: 256
description: 一次执行的batch大小
- name: epochs
type: integer
required: false
default: 100
min: 1
max: 1000
- name: learning_rate
type: float
required: false
default: 0.001
min: 0.00001
max: 0.1
- name: experiment_name
type: string
required: false
description: 实验名称(用于MLflow跟踪)
# 输出模型
outputs:
- name: model_path
type: string
description: 保存模型的路径
- name: best_accuracy
type: float
description: 最优精度
- name: training_time_seconds
type: float
description: 训练时间
- name: mlflow_run_id
type: string
description: MLflow实验跟踪编号
# 依赖的SubSkill(可为空)
dependencies: []
- id: model-evaluation
name: 模型评估
handler: subskills.model_evaluation.handler:execute
description: 模型离线评估与验证
execution:
timeout: 3600
retry:
max_retries: 2
backoff: linear
resource:
cpu: 2
memory: 8Gi
gpu: 0.5
disk: 10Gi
inputs:
- name: model_path
type: string
required: true
description: 模型文件路径
- name: test_dataset
type: string
required: true
description: 测试数据集
- name: metrics
type: array
required: false
default: [accuracy, precision, recall, f1]
items:
enum: [accuracy, precision, recall, f1, auc, confusion_matrix]
- name: visualization
type: boolean
required: false
default: true
outputs:
- name: metrics
type: object
description: 求值结果
- name: confusion_matrix
type: string
description: 混淆矩阵URL
- name: error_cases
type: string
description: 错误案例分析
# 依赖于model-training(须先训练)
dependencies:
- model-training
- id: model-deployment
name: 模型部署
handler: subskills.model_deployment.handler:execute
description: 模型优化与部署
execution:
timeout: 1800
retry:
max_retries: 2
resource:
cpu: 4
memory: 8Gi
gpu: 1
disk: 5Gi
inputs:
- name: model_path
type: string
required: true
- name: quantization_type
type: string
required: false
default: none
enum: [none, int8, fp16, dynamic]
- name: deployment_target
type: string
required: true
enum: [local, docker, kubernetes]
- name: namespace
type: string
required: false
default: default
condition: "deployment_target == 'kubernetes'"
- name: replicas
type: integer
required: false
default: 1
min: 1
max: 100
outputs:
- name: optimized_model_path
type: string
- name: model_size_reduction
type: string
- name: docker_image
type: string
- name: deployment_status
type: string
enum: [pending, running, failed]
- name: service_endpoint
type: string
dependencies:
- model-evaluation
# ===== 整体Skill配置 =====
requirements:
# 最低原生 OpenClaw 版本
openclaw_version: ">=0.2.20"
# 最低Python 版本
python_version: ">=3.9"
# 系统最低需求
system:
disk: 100Gi
memory: 32Gi
# 全局 Python 依赖
# 这些依赖会被所有SubSkill共享
dependencies:
- torch==2.0.1
- torchvision==0.15.2
- numpy==1.24.0
- scikit-learn==1.3.0
- mlflow==2.7.0
- pandas==2.0.0
# SubSkill特有依赖
# 每个SubSkill可以有自己特有的依赖
subskill_dependencies:
model-training:
- pytorch-lightning==2.0.0
- torch-distributed==latest
model-deployment:
- onnx==1.14.0
- onnxruntime==1.16.0
- docker==6.0.0
- kubernetes==27.2.0
# 环境变量
# 这些会被注入到处理器的执行环境
environment:
# 全局环境变量
global:
LOG_LEVEL: INFO
DATA_ROOT: /data
MODEL_ROOT: /models
CACHE_ROOT: /cache
# SubSkill特有环境变量
subskill_envs:
model-training:
MLFLOW_TRACKING_URI: http://mlflow:5000
MLFLOW_EXPERIMENT_NAME: model-training
PYTORCH_CUDA_ALLOC_CONF: max_split_size_mb:512
model-evaluation:
MLFLOW_TRACKING_URI: http://mlflow:5000
model-deployment:
DOCKER_REGISTRY: registry.company.com
K8S_CLUSTER: production
K8S_DOCKER_REGISTRY_SECRET: regcred
# 日志配置
logging:
level: INFO
format: json # json 或 text
output:
- console
- file:/logs/skill.log
# 每个SubSkill输出跟踪文件
subskill_logs:
model-training: /logs/training.log
model-evaluation: /logs/evaluation.log
model-deployment: /logs/deployment.log
# 监控配置
monitoring:
enabled: true
metrics_export:
type: prometheus
endpoint: http://prometheus:9090
key_metrics:
- skill_execution_duration
- subskill_success_rate
- resource_utilization
- error_rate
# 更新治理
upgrades:
strategy: rolling # rolling 或 blue-green
max_unavailable_percent: 10 # 最多有多少%的SubSkill可以不可用
canary_percentage: 5 # 先更新 5% 测试
🛠️ 3. SubSkill执行器 (handler.py)
文件: skills/ai-engineer/subskills/model-training/handler.py
"""
model-training SubSkill执行器
职责:
1. 接收来自 Agent 的输入参数
2. 执行实验脚本
3. 跟踪实验程整到 MLflow
4. 返回结果或捕获不可恢复的错误
"""
import os
import json
import traceback
from typing import Dict, Any
from pathlib import Path
import subprocess
import logging
from ..utils.validators import validate_inputs
from ..utils.logger import get_logger
from ..utils.config_loader import load_config
logger = get_logger(__name__)
class TrainingExecutor:
"""模型训练执行器"""
def __init__(self, config: Dict[str, Any]):
self.config = config
self.subskill_config = load_config('subskills/model-training/config.yaml')
self.work_dir = Path(os.getenv('WORK_DIR', '/workspace'))
self.models_dir = Path(os.getenv('MODEL_ROOT', '/models'))
self.models_dir.mkdir(parents=True, exist_ok=True)
def validate_inputs(self, inputs: Dict[str, Any]) -> bool:
"""
验证输入参数
依据 config.yaml 中的 inputs 定义自动验证
"""
try:
# 使用config定义自动验证
validate_inputs(
inputs=inputs,
schema=self.subskill_config['inputs']
)
logger.info("输入参数验证成功")
return True
except ValueError as e:
logger.error(f"输入验证失败: {str(e)}")
raise
def prepare_training_script(self, inputs: Dict[str, Any]) -> str:
"""
为subskill准备训练脚本
输入参数 → 训练脚本含有期参数
"""
script_template = self.work_dir / 'subskills/model-training/scripts/train.py'
# 替换模板中的占位符
script_content = script_template.read_text()
replacements = {
'{{dataset}}': inputs['dataset'],
'{{model_name}}': inputs['model_name'],
'{{batch_size}}': str(inputs.get('batch_size', 32)),
'{{epochs}}': str(inputs.get('epochs', 100)),
'{{learning_rate}}': str(inputs.get('learning_rate', 0.001)),
'{{experiment_name}}': inputs.get('experiment_name', 'exp-default'),
'{{models_dir}}': str(self.models_dir),
}
for placeholder, value in replacements.items():
script_content = script_content.replace(placeholder, value)
# 保存的正式脚本
script_path = self.work_dir / f".cache/train_{inputs.get('experiment_name')}.py"
script_path.parent.mkdir(parents=True, exist_ok=True)
script_path.write_text(script_content)
logger.info(f"训练脚本已准备: {script_path}")
return str(script_path)
def execute_training(self, script_path: str, inputs: Dict[str, Any]) -> Dict[str, Any]:
"""
执行训练脚本
"""
try:
# 设置环境变量
env = os.environ.copy()
env.update({
'PYTHONUNBUFFERED': '1',
'MLFLOW_EXPERIMENT_NAME': inputs.get('experiment_name', 'default'),
'MLFLOW_TRACKING_URI': os.getenv('MLFLOW_TRACKING_URI', 'http://localhost:5000'),
})
# 执行脚本
result = subprocess.run(
['python', script_path],
capture_output=True,
text=True,
timeout=self.subskill_config['execution']['timeout'],
env=env,
cwd=self.work_dir
)
if result.returncode != 0:
logger.error(f"脚本执行失败\nStdout: {result.stdout}\nStderr: {result.stderr}")
raise RuntimeError(f"训练脚本退出代码: {result.returncode}")
logger.info(f"脚本执行成功\nOutput: {result.stdout}")
# 从脚本输出中解析结果 (JSON格式)
output_lines = result.stdout.strip().split('\n')
result_json = json.loads(output_lines[-1]) # 最后一行为JSON
return result_json
except subprocess.TimeoutExpired:
logger.error("脚本执行超时")
raise TimeoutError(f"训练超时 ({self.subskill_config['execution']['timeout']}s)")
except Exception as e:
logger.error(f"脚本执行异常: {str(e)}\n{traceback.format_exc()}")
raise
def post_process(self, result: Dict[str, Any]) -> Dict[str, Any]:
"""
后处理: 数据校验、结果简化、苦转
"""
# 校验输出模型是否创建
if 'model_path' in result:
model_file = Path(result['model_path'])
if not model_file.exists():
raise FileNotFoundError(f"模型文件不存在: {result['model_path']}")
logger.info(f"模型文件大小: {model_file.stat().st_size / 1024 / 1024:.2f}MB")
return result
def execute(inputs: Dict[str, Any]) -> Dict[str, Any]:
"""
SubSkill 的主执行函数
OpenClaw 调用中一形式:
result = execute({
'dataset': 'cifar10',
'model_name': 'resnet50',
'batch_size': 32,
'epochs': 100,
'learning_rate': 0.001,
'experiment_name': 'exp-v2.1'
})
Args:
inputs: 来自 Agent 的输入参数
Returns:
{
'status': 'success' | 'failed',
'model_path': '/models/resnet50_v2.1.pt',
'best_accuracy': 0.956,
'training_time_seconds': 3600,
'mlflow_run_id': 'abc123',
'error': '创建时的错误信息(仅概失败)'
}
"""
executor = TrainingExecutor(config={})
try:
# 1. 验证输入
executor.validate_inputs(inputs)
# 2. 准备脚本
script_path = executor.prepare_training_script(inputs)
# 3. 执行脚本
result = executor.execute_training(script_path, inputs)
# 4. 后处理
result = executor.post_process(result)
# 5. 添加程序执行信息
result['status'] = 'success'
return result
except Exception as e:
logger.error(f"SubSkill执行失败: {str(e)}")
return {
'status': 'failed',
'error': str(e),
'error_type': type(e).__name__,
'traceback': traceback.format_exc()
}
📝 4. 训练脚本 (train.py)
文件: skills/ai-engineer/subskills/model-training/scripts/train.py
"""
模型训练脚本
输入参数会被 handler.py 替换,所以可以使用 {{placeholder}} 语法
"""
import os
import json
import torch
import logging
from pathlib import Path
from datetime import datetime
import mlflow
import mlflow.pytorch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader
from torchvision import datasets, transforms, models
# ===== 参数 (handler.py 会替换这些) =====
DATASET = '{{dataset}}'
MODEL_NAME = '{{model_name}}'
BATCH_SIZE = {{batch_size}}
EPOCHS = {{epochs}}
LEARNING_RATE = {{learning_rate}}
EXPERIMENT_NAME = '{{experiment_name}}'
MODELS_DIR = '{{models_dir}}'
# ===== 配置 =====
DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
logger = logging.getLogger(__name__)
logging.basicConfig(level=logging.INFO)
class ModelTrainer:
"""模型训练器"""
def __init__(self):
self.device = DEVICE
self.best_accuracy = 0
self.model_dir = Path(MODELS_DIR)
self.model_dir.mkdir(parents=True, exist_ok=True)
# 配置 MLflow
mlflow.set_experiment(EXPERIMENT_NAME)
def load_data(self):
"""加载数据集"""
transform = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
])
if DATASET == 'cifar10':
self.train_dataset = datasets.CIFAR10(
root='./data', train=True, download=True, transform=transform
)
self.test_dataset = datasets.CIFAR10(
root='./data', train=False, download=True, transform=transform
)
else:
raise ValueError(f"不支持数据集: {DATASET}")
self.train_loader = DataLoader(
self.train_dataset, batch_size=BATCH_SIZE, shuffle=True, num_workers=4
)
self.test_loader = DataLoader(
self.test_dataset, batch_size=BATCH_SIZE, shuffle=False, num_workers=4
)
logger.info(f"数据集加载完成: {len(self.train_dataset)} / {len(self.test_dataset)}")
def build_model(self):
"""构建模型"""
if MODEL_NAME == 'resnet50':
self.model = models.resnet50(pretrained=True)
else:
raise ValueError(f"不支持模型: {MODEL_NAME}")
self.model.to(self.device)
logger.info(f"模型构建正绪: {MODEL_NAME}")
def train_epoch(self, epoch):
"""训练一个轮代"""
self.model.train()
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(self.model.parameters(), lr=LEARNING_RATE)
total_loss = 0
for batch_idx, (data, target) in enumerate(self.train_loader):
data, target = data.to(self.device), target.to(self.device)
optimizer.zero_grad()
output = self.model(data)
loss = criterion(output, target)
loss.backward()
optimizer.step()
total_loss += loss.item()
if batch_idx % 100 == 0:
logger.info(f"Epoch {epoch}, Batch {batch_idx}, Loss: {loss.item():.4f}")
avg_loss = total_loss / len(self.train_loader)
mlflow.log_metric("train_loss", avg_loss, step=epoch)
return avg_loss
def evaluate(self, epoch):
"""评估模型"""
self.model.eval()
correct = 0
total = 0
with torch.no_grad():
for data, target in self.test_loader:
data, target = data.to(self.device), target.to(self.device)
output = self.model(data)
_, predicted = torch.max(output.data, 1)
total += target.size(0)
correct += (predicted == target).sum().item()
accuracy = 100 * correct / total
mlflow.log_metric("test_accuracy", accuracy, step=epoch)
logger.info(f"Epoch {epoch}, Accuracy: {accuracy:.2f}%")
return accuracy
def save_model(self, model_name):
"""保存模型"""
model_path = self.model_dir / model_name
torch.save(self.model.state_dict(), model_path)
logger.info(f"模型已保存: {model_path}")
return str(model_path)
def train(self):
"""完整训练流程"""
with mlflow.start_run(run_name=EXPERIMENT_NAME):
# 记录超参
mlflow.log_params({
'dataset': DATASET,
'model': MODEL_NAME,
'batch_size': BATCH_SIZE,
'epochs': EPOCHS,
'learning_rate': LEARNING_RATE,
})
self.load_data()
self.build_model()
start_time = datetime.now()
for epoch in range(1, EPOCHS + 1):
train_loss = self.train_epoch(epoch)
accuracy = self.evaluate(epoch)
# 保存最优模型
if accuracy > self.best_accuracy:
self.best_accuracy = accuracy
model_filename = f"{MODEL_NAME}_{DATASET}_{EXPERIMENT_NAME}.pt"
model_path = self.save_model(model_filename)
training_time = (datetime.now() - start_time).total_seconds()
# 记录最终结果
mlflow.log_metric("best_accuracy", self.best_accuracy)
mlflow.log_metric("training_time_seconds", training_time)
run_id = mlflow.active_run().info.run_id
return {
'model_path': model_path,
'best_accuracy': self.best_accuracy,
'training_time_seconds': training_time,
'mlflow_run_id': run_id,
'final_loss': train_loss,
}
if __name__ == '__main__':
trainer = ModelTrainer()
result = trainer.train()
# 输出JSON格式的结果 (最后一行)
# handler.py 会解析这一行
print(json.dumps(result))
🏃 使用示例
场景:Agent调用
推理模式: 会话模式 (依次调用名个输入)
# Agent 会自动执行此流程
from openclaw.agent import Agent
from openclaw.skill import call_skill
agent = Agent.from_workspace('/workspace/agents/ai-engineer')
# 次 1: 执行 model-training SubSkill
training_result = call_skill(
skill_id='ai-engineer',
subskill_id='model-training',
inputs={
'dataset': 'cifar10',
'model_name': 'resnet50',
'batch_size': 32,
'epochs': 100,
'learning_rate': 0.001,
'experiment_name': 'exp-v2.1'
},
agent=agent
)
if training_result['status'] != 'success':
print(f"训练失败: {training_result['error']}")
exit(1)
# 次 2: 执行 model-evaluation SubSkill
evaluation_result = call_skill(
skill_id='ai-engineer',
subskill_id='model-evaluation',
inputs={
'model_path': training_result['model_path'],
'test_dataset': 'cifar10',
'metrics': ['accuracy', 'f1', 'confusion_matrix'],
'visualization': True
},
agent=agent
)
if evaluation_result['status'] != 'success':
print(f"评估失败: {evaluation_result['error']}")
exit(1)
# 次 3: 执行 model-deployment SubSkill
deployment_result = call_skill(
skill_id='ai-engineer',
subskill_id='model-deployment',
inputs={
'model_path': training_result['model_path'],
'quantization_type': 'int8',
'deployment_target': 'kubernetes',
'namespace': 'production',
'replicas': 3
},
agent=agent
)
if deployment_result['status'] == 'success':
print(f"模型已部署: {deployment_result['service_endpoint']}")
⚙️ 运维操作
安装与测试
# 检查 Skill 宣言
cat skills/ai-engineer/SKILL.md
# 验证 Skill 配置有效性
yaml-lint skills/ai-engineer/config.yaml
# 执行测试
pytest skills/ai-engineer/tests/ -v
# 执行单个SubSkill
python -c "
from subskills.model_training.handler import execute
result = execute({
'dataset': 'cifar10',
'model_name': 'resnet50',
'batch_size': 32,
'epochs': 1,
'experiment_name': 'test'
})
print(result)
"
收集日志与调试
# 实时查看 SubSkill 日志
kubectl logs -f deployment/openclaw-agent-ai -n openclaw | grep 'model-training'
# 查看 SubSkill 执行的 MLflow 实验
mlflow ui --host 0.0.0.0 --port 5000
# 检查资源使用情况
kubectl top pods -n openclaw
⚠️ 常被错误
| 错误 | 原因 | 解决方程 |
|--------|------|----------|
| FileNotFoundError: model not found | 模型文件不存在 | 检查训练脚本是否正常保存了模型 |
| OOM killed | GPU显存不足 | 减小batch_size或使用混合精度 |
| Handler timeout | 训练超时 | 在config.yaml中增加超时时间 |
| Validation failed | 输入参数不符合schema | 检查输入类型和值范围 |
📌 关键点总结
- SKILL.md - Skill的宪章,定义SubSkill、输入输出、执行协议
- config.yaml - 全局配置,定义SubSkill的资源需求、超时、依赖关系
- handler.py - SubSkill的主执行器,负责参数验证、脚本准备、脚本执行
- scripts/xxx.py - 具体的业务脚本,其中使用 {{placeholder}} 替换参数
- dependencies - 全局依赖和SubSkill特有依赖分开定义
- 输入输出 - 严格遵照schema定义,支持类型、范围、枚举值检查
完成!✅ 这是一份从Skill定义到SubSkill到脚本执行的完整示例文档,可以直接用于生产。
评论 (2)
发表评论