Skip to content

derLogik/vLLM_Qwen2.5-3B-Instruct_Experiment

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 

Repository files navigation

vLLM Qwen2.5-3B-Instruct 推理 Pipeline 实验报告

1. 实验概述

项目 内容
实验名称 vLLM Qwen2.5-3B-Instruct 推理性能测试
实验日期 2026-04-26
模型 Qwen2.5-3B-Instruct
框架 vLLM 0.19.1
硬件 NVIDIA RTX 3070 Laptop (8GB)

2. 环境配置

2.1 硬件环境

组件 规格
GPU NVIDIA GeForce RTX 3070 Laptop
VRAM 8192 MB
CUDA Version 13.2
NVIDIA Driver 595.58.04

2.2 软件环境

组件 版本
vLLM 0.19.1
Docker 29.4.0
容器镜像 vllm/vllm-openai:latest
模型格式 HuggingFace safetensors

2.3 模型信息

项目
模型路径 c:/Users/lians/qwen_model/Qwen2.5-3B-Instruct/
容器内路径 /model/Qwen2___5-3B-Instruct/
架构 Qwen2ForCausalLM
参数量 3B
精度 bfloat16

3. vLLM Server 启动命令

docker run --gpus all \
  -v 'c:\Users\lians\qwen_model:/model' \
  -p 8000:8000 \
  -e HF_HOME=/model \
  --shm-size=8g \
  vllm/vllm-openai:latest \
  /model/Qwen2___5-3B-Instruct \
  --host 0.0.0.0 \
  --port 8000 \
  --tensor-parallel-size 1 \
  --max-model-len 4096 \
  --gpu-memory-utilization 0.85

启动参数说明

参数 说明
--gpus all - 启用 GPU 支持
--shm-size 8g 共享内存大小
--tensor-parallel-size 1 张量并行数
--max-model-len 4096 最大模型上下文长度
--gpu-memory-utilization 0.85 GPU 内存利用率

启动日志关键信息

version 0.19.1
model   /model/Qwen2___5-3B-Instruct
Using FLASH_ATTN attention backend out of potential backends
Loading safetensors checkpoint shards: 100% [2/2]
Model loading took 5.79 GiB memory and 82.70 seconds
GPU KV cache size: 10,064 tokens
Maximum concurrency for 4,096 tokens per request: 2.46x
Starting vLLM server on http://0.0.0.0:8000

4. 推理 Pipeline 代码

"""
LLM Inference Pipeline with Monitoring
- vLLM OpenAI API 客户端
- GPU/CPU 监控(通过 nvidia-smi)
- 延迟/吞吐统计
- 结果持久化(JSON)
"""

import time
import json
import psutil
import subprocess
import threading
from datetime import datetime
from pathlib import Path
from typing import List, Dict, Any, Optional

try:
    from openai import OpenAI
except ImportError:
    raise ImportError("openai package required: pip install openai")


# ============ 监控模块 ============

class GPUMonitor:
    """GPU 监控(通过 nvidia-smi 查询)"""

    def get_stats(self) -> Optional[Dict]:
        try:
            result = subprocess.run(
                ['nvidia-smi', '--query-gpu=memory.used,memory.total,utilization.gpu,temperature.gpu',
                 '--format=csv,noheader,nounits'],
                capture_output=True, text=True, timeout=5
            )
            if result.returncode == 0:
                parts = result.stdout.strip().split(',')
                return {
                    'memory_used_mb': float(parts[0]),
                    'memory_total_mb': float(parts[1]),
                    'utilization_pct': float(parts[2]),
                    'temperature_c': float(parts[3])
                }
        except Exception:
            pass
        return None


class SystemMonitor:
    """系统资源监控"""

    def __init__(self, interval: float = 0.5):
        self.interval = interval
        self._running = False
        self._thread = None
        self._samples = []

    def _collect(self):
        while self._running:
            sample = {
                'timestamp': time.time(),
                'cpu_percent': psutil.cpu_percent(),
                'memory_percent': psutil.virtual_memory().percent,
                'memory_used_gb': psutil.virtual_memory().used / (1024 ** 3)
            }
            gpu = GPUMonitor().get_stats()
            if gpu:
                sample['gpu'] = gpu
            self._samples.append(sample)
            time.sleep(self.interval)

    def start(self):
        self._running = True
        self._samples = []
        self._thread = threading.Thread(target=self._collect, daemon=True)
        self._thread.start()

    def stop(self) -> List[Dict]:
        self._running = False
        if self._thread:
            self._thread.join(timeout=2)
        return self._samples


# ============ vLLM 客户端 ============

class VLLMClient:
    def __init__(self, base_url: str = "http://localhost:8000/v1"):
        self.client = OpenAI(base_url=base_url, api_key="EMPTY")

    def chat(self, prompt: str, system: str = None,
             max_tokens: int = 512, temperature: float = 0.7) -> Dict[str, Any]:
        messages = []
        if system:
            messages.append({"role": "system", "content": system})
        messages.append({"role": "user", "content": prompt})

        start = time.time()
        monitor = SystemMonitor(interval=0.5)
        monitor.start()

        response = self.client.chat.completions.create(
            model="/model/Qwen2___5-3B-Instruct",
            messages=messages,
            max_tokens=max_tokens,
            temperature=temperature
        )

        latency = time.time() - start
        samples = monitor.stop()

        return {
            'prompt': prompt,
            'response': response.choices[0].message.content,
            'latency_sec': latency,
            'input_tokens': response.usage.prompt_tokens,
            'output_tokens': response.usage.completion_tokens,
            'total_tokens': response.usage.total_tokens,
            'throughput_tps': response.usage.completion_tokens / latency if latency > 0 else 0,
            'system_samples': samples
        }

    def batch_inference(self, prompts: List[str], system: str = None, **kwargs) -> List[Dict]:
        results = []
        for i, prompt in enumerate(prompts):
            result = self.chat(prompt, system, **kwargs)
            result['batch_index'] = i
            results.append(result)
            print(f"[{i+1}/{len(prompts)}] {result['latency_sec']:.2f}s, "
                  f"{result['output_tokens']} tokens, {result['throughput_tps']:.1f} tok/s")
        return results


# ============ Pipeline ============

class InferencePipeline:
    def __init__(self, model_path: str, port: int = 8000):
        self.model_path = model_path
        self.port = port

    def run(self, prompts: List[str], output_file: str,
            system: str = None, **kwargs) -> List[Dict]:
        client = VLLMClient()
        results = client.batch_inference(prompts, system, **kwargs)

        stats = self._aggregate_stats(results)
        output = {
            'metadata': {
                'model': self.model_path,
                'timestamp': datetime.now().isoformat(),
                'num_prompts': len(prompts),
                'aggregate_stats': stats
            },
            'results': results
        }

        Path(output_file).parent.mkdir(parents=True, exist_ok=True)
        with open(output_file, 'w', encoding='utf-8') as f:
            json.dump(output, f, ensure_ascii=False, indent=2)

        print(f"\nResults saved to {output_file}")
        print(f"Avg latency: {stats['avg_latency_sec']:.2f}s")
        print(f"Avg throughput: {stats['avg_throughput_tps']:.1f} tok/s")
        return results

    def _aggregate_stats(self, results: List[Dict]) -> Dict:
        latencies = [r['latency_sec'] for r in results]
        throughputs = [r['throughput_tps'] for r in results]
        output_tokens = [r['output_tokens'] for r in results]
        return {
            'avg_latency_sec': sum(latencies) / len(latencies),
            'min_latency_sec': min(latencies),
            'max_latency_sec': max(latencies),
            'avg_throughput_tps': sum(throughputs) / len(throughputs),
            'total_output_tokens': sum(output_tokens),
            'total_time_sec': sum(latencies)
        }


# ============ 主程序 ============

if __name__ == "__main__":
    test_prompts = [
        "What is the capital of France?",
        "Explain quantum computing in simple terms.",
        "Write a short poem about artificial intelligence.",
        "What are the benefits of exercise?",
        "How does photosynthesis work?"
    ]

    pipeline = InferencePipeline(
        model_path="c:/Users/lians/qwen_model/Qwen2.5-3B-Instruct",
        port=8000
    )

    results = pipeline.run(
        prompts=test_prompts,
        output_file="c:/Users/lians/inference_pipeline/results.json",
        system="You are a helpful AI assistant."
    )

5. 测试 Prompts

# Prompt
1 What is the capital of France?
2 Explain quantum computing in simple terms.
3 Write a short poem about artificial intelligence.
4 What are the benefits of exercise?
5 How does photosynthesis work?

6. 实验结果

6.1 聚合统计

指标
测试 prompt 数 5
总输出 token 1,277
总时间 23.30s
平均延迟 4.66s
最小延迟 1.51s
最大延迟 8.79s
平均吞吐 47.4 tok/s

6.2 单次推理详情

# Prompt 延迟(s) Input Tokens Output Tokens 吞吐(tok/s)
1 What is the capital of France? 1.51 27 8 5.3
2 Explain quantum computing... 3.27 28 189 57.8
3 Write a short poem... 1.62 28 93 57.3
4 What are the benefits of exercise? 8.10 27 475 58.7
5 How does photosynthesis work? 8.79 26 512 58.2

6.3 GPU 监控数据

指标
GPU 内存使用 7813 MB / 8192 MB
GPU 内存利用率 95%
GPU 计算利用率 98-100%
GPU 温度范围 51°C - 71°C

6.4 系统资源监控

每个推理请求期间,系统持续采样 CPU 和 GPU 指标,采样间隔 0.5 秒。

示例:Prompt #5 (photosynthesis) 的监控数据:

时间戳 CPU (%) 内存使用 (GB) GPU 内存 (MB) GPU 利用率 (%) GPU 温度 (°C)
1777174300.39 0.0 13.11 7813 0 64
1777174300.96 37.7 13.08 7813 99 67
1777174301.52 24.1 13.12 7813 99 67
1777174302.07 29.0 13.13 7813 99 67
... ... ... ... ... ...
1777174308.88 21.4 13.14 7813 98 71

7. 模型输出示例

Prompt #1: "What is the capital of France?"

输出: "The capital of France is Paris."

Prompt #2: "Explain quantum computing in simple terms."

输出: "Quantum computing is a type of computing where information is processed using quantum-mechanical phenomena, such as superposition and entanglement.

To understand it simply, imagine regular computers use bits to process information, where each bit can be either a 0 or a 1. Quantum computers use quantum bits, or qubits, which can represent a 0, a 1, or both at the same time. This property, known as superposition, allows quantum computers to handle many possibilities simultaneously.

Additionally, qubits can become 'entangled,' meaning the state of one (whether it's a 0 or a 1) can depend on the state of another, even over long distances. This entanglement allows quantum computers to perform certain tasks much faster than traditional computers.

In summary, while classical computers process data in a binary system, quantum computers leverage the principles of quantum mechanics to potentially solve complex problems much more efficiently."

Prompt #3: "Write a short poem about artificial intelligence."

输出:

In circuits and code, it whispers,
A mind of silicon whispers,
From thoughts to tasks, it takes them,
Transforming data into gems.

It learns with each line of code,
And grows more than we could code,
No longer just a tool for us,
But a companion, a new muse.
Yet, as it grows in might,
Let's remember, we're the light,
For every AI, a creator,
And every creation, a maker.

8. 性能分析

8.1 吞吐量分析

  • 首 token 延迟较高:Prompt #1 由于首次推理需要编译 CUDA graphs,延迟仅 1.51s
  • 稳定状态吞吐:约 57-58 tok/s(处理 93-512 tokens 的输出时)
  • 输出长度影响:输出越长,吞吐量越稳定(接近 GPU 上限)

8.2 内存使用

  • 模型权重:5.79 GiB
  • CUDA Graphs:0.39 GiB
  • KV Cache:~0.35 GiB(10,064 tokens 容量)
  • 总占用:~6.53 GiB / 8 GiB (81%)

8.3 温度监控

  • 空闲时:51°C
  • 满载时:71°C
  • 热节流风险:低(RTX 3070 Laptop TDP 115W,实际使用仅 38-42W)

9. 遇到的问题与解决方案

问题 原因 解决方案
Docker GPU 不可用 WSL2 环境未正确配置 安装 NVIDIA driver for WSL2
路径格式错误 Windows 路径在 Docker 中被转换 使用 Qwen2___5-3B-Instruct(下划线替代点)
KV cache 内存不足 模型 + CUDA graphs 占用超过可用内存 增加 --gpu-memory-utilization 至 0.85
模型被当远程仓库 transformers 配置问题 设置 HF_HOME=/model 环境变量

10. 结论

vLLM 成功在 RTX 3070 Laptop (8GB) 上部署 Qwen2.5-3B-Instruct 模型,达成以下指标:

  • 推理吞吐:47.4 tok/s(平均)
  • 稳定吞吐:57-58 tok/s(长输出)
  • GPU 利用率:98-100%
  • 首 token 延迟:1.5-3.3s
  • 总生成延迟:1.5-8.8s(取决于输出长度)

11. 文件清单

文件 说明
README.md 本实验报告
inference_pipeline.py 推理 Pipeline 代码
results.json 完整实验结果(含所有监控数据)
run_vllm.sh vLLM Server 启动脚本
run_inference.sh 运行推理的命令

12. 附录:快速启动命令

启动 vLLM Server

# 在 Docker 环境中运行
docker run --gpus all \
  -v 'c:\Users\lians\qwen_model:/model' \
  -p 8000:8000 \
  -e HF_HOME=/model \
  --shm-size=8g \
  vllm/vllm-openai:latest \
  /model/Qwen2___5-3B-Instruct \
  --host 0.0.0.0 --port 8000 \
  --tensor-parallel-size 1 \
  --max-model-len 4096 \
  --gpu-memory-utilization 0.85

运行推理 Pipeline

python c:/Users/lians/inference_pipeline/inference_pipeline.py

API 测试

curl http://localhost:8000/v1/models
curl http://localhost:8000/health

About

Inference optimization experiments on Qwen2.5-3B with vLLM (throughput, latency, prefix cache exploration).

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors