vLLM Qwen2.5-3B-Instruct 推理 Pipeline 实验报告

1. 实验概述

项目	内容
实验名称	vLLM Qwen2.5-3B-Instruct 推理性能测试
实验日期	2026-04-26
模型	Qwen2.5-3B-Instruct
框架	vLLM 0.19.1
硬件	NVIDIA RTX 3070 Laptop (8GB)

2. 环境配置

2.1 硬件环境

组件	规格
GPU	NVIDIA GeForce RTX 3070 Laptop
VRAM	8192 MB
CUDA Version	13.2
NVIDIA Driver	595.58.04

2.2 软件环境

组件	版本
vLLM	0.19.1
Docker	29.4.0
容器镜像	vllm/vllm-openai:latest
模型格式	HuggingFace safetensors

2.3 模型信息

项目	值
模型路径	`c:/Users/lians/qwen_model/Qwen2.5-3B-Instruct/`
容器内路径	`/model/Qwen2___5-3B-Instruct/`
架构	Qwen2ForCausalLM
参数量	3B
精度	bfloat16

3. vLLM Server 启动命令

docker run --gpus all \
  -v 'c:\Users\lians\qwen_model:/model' \
  -p 8000:8000 \
  -e HF_HOME=/model \
  --shm-size=8g \
  vllm/vllm-openai:latest \
  /model/Qwen2___5-3B-Instruct \
  --host 0.0.0.0 \
  --port 8000 \
  --tensor-parallel-size 1 \
  --max-model-len 4096 \
  --gpu-memory-utilization 0.85

启动参数说明

参数	值	说明
`--gpus all`	-	启用 GPU 支持
`--shm-size`	8g	共享内存大小
`--tensor-parallel-size`	1	张量并行数
`--max-model-len`	4096	最大模型上下文长度
`--gpu-memory-utilization`	0.85	GPU 内存利用率

启动日志关键信息

version 0.19.1
model   /model/Qwen2___5-3B-Instruct
Using FLASH_ATTN attention backend out of potential backends
Loading safetensors checkpoint shards: 100% [2/2]
Model loading took 5.79 GiB memory and 82.70 seconds
GPU KV cache size: 10,064 tokens
Maximum concurrency for 4,096 tokens per request: 2.46x
Starting vLLM server on http://0.0.0.0:8000

4. 推理 Pipeline 代码

"""
LLM Inference Pipeline with Monitoring
- vLLM OpenAI API 客户端
- GPU/CPU 监控（通过 nvidia-smi）
- 延迟/吞吐统计
- 结果持久化（JSON）
"""

import time
import json
import psutil
import subprocess
import threading
from datetime import datetime
from pathlib import Path
from typing import List, Dict, Any, Optional

try:
    from openai import OpenAI
except ImportError:
    raise ImportError("openai package required: pip install openai")


# ============ 监控模块 ============

class GPUMonitor:
    """GPU 监控（通过 nvidia-smi 查询）"""

    def get_stats(self) -> Optional[Dict]:
        try:
            result = subprocess.run(
                ['nvidia-smi', '--query-gpu=memory.used,memory.total,utilization.gpu,temperature.gpu',
                 '--format=csv,noheader,nounits'],
                capture_output=True, text=True, timeout=5
            )
            if result.returncode == 0:
                parts = result.stdout.strip().split(',')
                return {
                    'memory_used_mb': float(parts[0]),
                    'memory_total_mb': float(parts[1]),
                    'utilization_pct': float(parts[2]),
                    'temperature_c': float(parts[3])
                }
        except Exception:
            pass
        return None


class SystemMonitor:
    """系统资源监控"""

    def __init__(self, interval: float = 0.5):
        self.interval = interval
        self._running = False
        self._thread = None
        self._samples = []

    def _collect(self):
        while self._running:
            sample = {
                'timestamp': time.time(),
                'cpu_percent': psutil.cpu_percent(),
                'memory_percent': psutil.virtual_memory().percent,
                'memory_used_gb': psutil.virtual_memory().used / (1024 ** 3)
            }
            gpu = GPUMonitor().get_stats()
            if gpu:
                sample['gpu'] = gpu
            self._samples.append(sample)
            time.sleep(self.interval)

    def start(self):
        self._running = True
        self._samples = []
        self._thread = threading.Thread(target=self._collect, daemon=True)
        self._thread.start()

    def stop(self) -> List[Dict]:
        self._running = False
        if self._thread:
            self._thread.join(timeout=2)
        return self._samples


# ============ vLLM 客户端 ============

class VLLMClient:
    def __init__(self, base_url: str = "http://localhost:8000/v1"):
        self.client = OpenAI(base_url=base_url, api_key="EMPTY")

    def chat(self, prompt: str, system: str = None,
             max_tokens: int = 512, temperature: float = 0.7) -> Dict[str, Any]:
        messages = []
        if system:
            messages.append({"role": "system", "content": system})
        messages.append({"role": "user", "content": prompt})

        start = time.time()
        monitor = SystemMonitor(interval=0.5)
        monitor.start()

        response = self.client.chat.completions.create(
            model="/model/Qwen2___5-3B-Instruct",
            messages=messages,
            max_tokens=max_tokens,
            temperature=temperature
        )

        latency = time.time() - start
        samples = monitor.stop()

        return {
            'prompt': prompt,
            'response': response.choices[0].message.content,
            'latency_sec': latency,
            'input_tokens': response.usage.prompt_tokens,
            'output_tokens': response.usage.completion_tokens,
            'total_tokens': response.usage.total_tokens,
            'throughput_tps': response.usage.completion_tokens / latency if latency > 0 else 0,
            'system_samples': samples
        }

    def batch_inference(self, prompts: List[str], system: str = None, **kwargs) -> List[Dict]:
        results = []
        for i, prompt in enumerate(prompts):
            result = self.chat(prompt, system, **kwargs)
            result['batch_index'] = i
            results.append(result)
            print(f"[{i+1}/{len(prompts)}] {result['latency_sec']:.2f}s, "
                  f"{result['output_tokens']} tokens, {result['throughput_tps']:.1f} tok/s")
        return results


# ============ Pipeline ============

class InferencePipeline:
    def __init__(self, model_path: str, port: int = 8000):
        self.model_path = model_path
        self.port = port

    def run(self, prompts: List[str], output_file: str,
            system: str = None, **kwargs) -> List[Dict]:
        client = VLLMClient()
        results = client.batch_inference(prompts, system, **kwargs)

        stats = self._aggregate_stats(results)
        output = {
            'metadata': {
                'model': self.model_path,
                'timestamp': datetime.now().isoformat(),
                'num_prompts': len(prompts),
                'aggregate_stats': stats
            },
            'results': results
        }

        Path(output_file).parent.mkdir(parents=True, exist_ok=True)
        with open(output_file, 'w', encoding='utf-8') as f:
            json.dump(output, f, ensure_ascii=False, indent=2)

        print(f"\nResults saved to {output_file}")
        print(f"Avg latency: {stats['avg_latency_sec']:.2f}s")
        print(f"Avg throughput: {stats['avg_throughput_tps']:.1f} tok/s")
        return results

    def _aggregate_stats(self, results: List[Dict]) -> Dict:
        latencies = [r['latency_sec'] for r in results]
        throughputs = [r['throughput_tps'] for r in results]
        output_tokens = [r['output_tokens'] for r in results]
        return {
            'avg_latency_sec': sum(latencies) / len(latencies),
            'min_latency_sec': min(latencies),
            'max_latency_sec': max(latencies),
            'avg_throughput_tps': sum(throughputs) / len(throughputs),
            'total_output_tokens': sum(output_tokens),
            'total_time_sec': sum(latencies)
        }


# ============ 主程序 ============

if __name__ == "__main__":
    test_prompts = [
        "What is the capital of France?",
        "Explain quantum computing in simple terms.",
        "Write a short poem about artificial intelligence.",
        "What are the benefits of exercise?",
        "How does photosynthesis work?"
    ]

    pipeline = InferencePipeline(
        model_path="c:/Users/lians/qwen_model/Qwen2.5-3B-Instruct",
        port=8000
    )

    results = pipeline.run(
        prompts=test_prompts,
        output_file="c:/Users/lians/inference_pipeline/results.json",
        system="You are a helpful AI assistant."
    )

5. 测试 Prompts

#	Prompt
1	What is the capital of France?
2	Explain quantum computing in simple terms.
3	Write a short poem about artificial intelligence.
4	What are the benefits of exercise?
5	How does photosynthesis work?

6. 实验结果

6.1 聚合统计

指标	值
测试 prompt 数	5
总输出 token	1,277
总时间	23.30s
平均延迟	4.66s
最小延迟	1.51s
最大延迟	8.79s
平均吞吐	47.4 tok/s

6.2 单次推理详情

#	Prompt	延迟(s)	Input Tokens	Output Tokens	吞吐(tok/s)
1	What is the capital of France?	1.51	27	8	5.3
2	Explain quantum computing...	3.27	28	189	57.8
3	Write a short poem...	1.62	28	93	57.3
4	What are the benefits of exercise?	8.10	27	475	58.7
5	How does photosynthesis work?	8.79	26	512	58.2

6.3 GPU 监控数据

指标	值
GPU 内存使用	7813 MB / 8192 MB
GPU 内存利用率	95%
GPU 计算利用率	98-100%
GPU 温度范围	51°C - 71°C

6.4 系统资源监控

每个推理请求期间，系统持续采样 CPU 和 GPU 指标，采样间隔 0.5 秒。

示例：Prompt #5 (photosynthesis) 的监控数据：

时间戳	CPU (%)	内存使用 (GB)	GPU 内存 (MB)	GPU 利用率 (%)	GPU 温度 (°C)
1777174300.39	0.0	13.11	7813	0	64
1777174300.96	37.7	13.08	7813	99	67
1777174301.52	24.1	13.12	7813	99	67
1777174302.07	29.0	13.13	7813	99	67
...	...	...	...	...	...
1777174308.88	21.4	13.14	7813	98	71

7. 模型输出示例

Prompt #1: "What is the capital of France?"

输出: "The capital of France is Paris."

Prompt #2: "Explain quantum computing in simple terms."

输出: "Quantum computing is a type of computing where information is processed using quantum-mechanical phenomena, such as superposition and entanglement.

To understand it simply, imagine regular computers use bits to process information, where each bit can be either a 0 or a 1. Quantum computers use quantum bits, or qubits, which can represent a 0, a 1, or both at the same time. This property, known as superposition, allows quantum computers to handle many possibilities simultaneously.

Additionally, qubits can become 'entangled,' meaning the state of one (whether it's a 0 or a 1) can depend on the state of another, even over long distances. This entanglement allows quantum computers to perform certain tasks much faster than traditional computers.

In summary, while classical computers process data in a binary system, quantum computers leverage the principles of quantum mechanics to potentially solve complex problems much more efficiently."

Prompt #3: "Write a short poem about artificial intelligence."

输出:

In circuits and code, it whispers,
A mind of silicon whispers,
From thoughts to tasks, it takes them,
Transforming data into gems.

It learns with each line of code,
And grows more than we could code,
No longer just a tool for us,
But a companion, a new muse.
Yet, as it grows in might,
Let's remember, we're the light,
For every AI, a creator,
And every creation, a maker.

8. 性能分析

8.1 吞吐量分析

首 token 延迟较高：Prompt #1 由于首次推理需要编译 CUDA graphs，延迟仅 1.51s
稳定状态吞吐：约 57-58 tok/s（处理 93-512 tokens 的输出时）
输出长度影响：输出越长，吞吐量越稳定（接近 GPU 上限）

8.2 内存使用

模型权重：5.79 GiB
CUDA Graphs：0.39 GiB
KV Cache：~0.35 GiB（10,064 tokens 容量）
总占用：~6.53 GiB / 8 GiB (81%)

8.3 温度监控

空闲时：51°C
满载时：71°C
热节流风险：低（RTX 3070 Laptop TDP 115W，实际使用仅 38-42W）

9. 遇到的问题与解决方案

问题	原因	解决方案
Docker GPU 不可用	WSL2 环境未正确配置	安装 NVIDIA driver for WSL2
路径格式错误	Windows 路径在 Docker 中被转换	使用 `Qwen2___5-3B-Instruct`（下划线替代点）
KV cache 内存不足	模型 + CUDA graphs 占用超过可用内存	增加 `--gpu-memory-utilization` 至 0.85
模型被当远程仓库	transformers 配置问题	设置 `HF_HOME=/model` 环境变量

10. 结论

vLLM 成功在 RTX 3070 Laptop (8GB) 上部署 Qwen2.5-3B-Instruct 模型，达成以下指标：

推理吞吐：47.4 tok/s（平均）
稳定吞吐：57-58 tok/s（长输出）
GPU 利用率：98-100%
首 token 延迟：1.5-3.3s
总生成延迟：1.5-8.8s（取决于输出长度）

11. 文件清单

文件	说明
`README.md`	本实验报告
`inference_pipeline.py`	推理 Pipeline 代码
`results.json`	完整实验结果（含所有监控数据）
`run_vllm.sh`	vLLM Server 启动脚本
`run_inference.sh`	运行推理的命令

12. 附录：快速启动命令

启动 vLLM Server

# 在 Docker 环境中运行
docker run --gpus all \
  -v 'c:\Users\lians\qwen_model:/model' \
  -p 8000:8000 \
  -e HF_HOME=/model \
  --shm-size=8g \
  vllm/vllm-openai:latest \
  /model/Qwen2___5-3B-Instruct \
  --host 0.0.0.0 --port 8000 \
  --tensor-parallel-size 1 \
  --max-model-len 4096 \
  --gpu-memory-utilization 0.85

运行推理 Pipeline

python c:/Users/lians/inference_pipeline/inference_pipeline.py

API 测试

curl http://localhost:8000/v1/models
curl http://localhost:8000/health

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
README.md		README.md
inference_pipeline.py		inference_pipeline.py
results.json		results.json
run_inference.sh		run_inference.sh
run_vllm.sh		run_vllm.sh

Folders and files

Latest commit

History

Repository files navigation

vLLM Qwen2.5-3B-Instruct 推理 Pipeline 实验报告

1. 实验概述

2. 环境配置

2.1 硬件环境

2.2 软件环境

2.3 模型信息

3. vLLM Server 启动命令

启动参数说明

启动日志关键信息

4. 推理 Pipeline 代码

5. 测试 Prompts

6. 实验结果

6.1 聚合统计

6.2 单次推理详情

6.3 GPU 监控数据

6.4 系统资源监控

7. 模型输出示例

Prompt #1: "What is the capital of France?"

Prompt #2: "Explain quantum computing in simple terms."

Prompt #3: "Write a short poem about artificial intelligence."

8. 性能分析

8.1 吞吐量分析

8.2 内存使用

8.3 温度监控

9. 遇到的问题与解决方案

10. 结论

11. 文件清单

12. 附录：快速启动命令

启动 vLLM Server

运行推理 Pipeline

API 测试

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages