vLLM Qwen2.5-3B-Instruct 推理 Pipeline 实验报告
项目
内容
实验名称
vLLM Qwen2.5-3B-Instruct 推理性能测试
实验日期
2026-04-26
模型
Qwen2.5-3B-Instruct
框架
vLLM 0.19.1
硬件
NVIDIA RTX 3070 Laptop (8GB)
组件
规格
GPU
NVIDIA GeForce RTX 3070 Laptop
VRAM
8192 MB
CUDA Version
13.2
NVIDIA Driver
595.58.04
组件
版本
vLLM
0.19.1
Docker
29.4.0
容器镜像
vllm/vllm-openai:latest
模型格式
HuggingFace safetensors
项目
值
模型路径
c:/Users/lians/qwen_model/Qwen2.5-3B-Instruct/
容器内路径
/model/Qwen2___5-3B-Instruct/
架构
Qwen2ForCausalLM
参数量
3B
精度
bfloat16
docker run --gpus all \
-v ' c:\Users\lians\qwen_model:/model' \
-p 8000:8000 \
-e HF_HOME=/model \
--shm-size=8g \
vllm/vllm-openai:latest \
/model/Qwen2___5-3B-Instruct \
--host 0.0.0.0 \
--port 8000 \
--tensor-parallel-size 1 \
--max-model-len 4096 \
--gpu-memory-utilization 0.85
参数
值
说明
--gpus all
-
启用 GPU 支持
--shm-size
8g
共享内存大小
--tensor-parallel-size
1
张量并行数
--max-model-len
4096
最大模型上下文长度
--gpu-memory-utilization
0.85
GPU 内存利用率
version 0.19.1
model /model/Qwen2___5-3B-Instruct
Using FLASH_ATTN attention backend out of potential backends
Loading safetensors checkpoint shards: 100% [2/2]
Model loading took 5.79 GiB memory and 82.70 seconds
GPU KV cache size: 10,064 tokens
Maximum concurrency for 4,096 tokens per request: 2.46x
Starting vLLM server on http://0.0.0.0:8000
"""
LLM Inference Pipeline with Monitoring
- vLLM OpenAI API 客户端
- GPU/CPU 监控(通过 nvidia-smi)
- 延迟/吞吐统计
- 结果持久化(JSON)
"""
import time
import json
import psutil
import subprocess
import threading
from datetime import datetime
from pathlib import Path
from typing import List , Dict , Any , Optional
try :
from openai import OpenAI
except ImportError :
raise ImportError ("openai package required: pip install openai" )
# ============ 监控模块 ============
class GPUMonitor :
"""GPU 监控(通过 nvidia-smi 查询)"""
def get_stats (self ) -> Optional [Dict ]:
try :
result = subprocess .run (
['nvidia-smi' , '--query-gpu=memory.used,memory.total,utilization.gpu,temperature.gpu' ,
'--format=csv,noheader,nounits' ],
capture_output = True , text = True , timeout = 5
)
if result .returncode == 0 :
parts = result .stdout .strip ().split (',' )
return {
'memory_used_mb' : float (parts [0 ]),
'memory_total_mb' : float (parts [1 ]),
'utilization_pct' : float (parts [2 ]),
'temperature_c' : float (parts [3 ])
}
except Exception :
pass
return None
class SystemMonitor :
"""系统资源监控"""
def __init__ (self , interval : float = 0.5 ):
self .interval = interval
self ._running = False
self ._thread = None
self ._samples = []
def _collect (self ):
while self ._running :
sample = {
'timestamp' : time .time (),
'cpu_percent' : psutil .cpu_percent (),
'memory_percent' : psutil .virtual_memory ().percent ,
'memory_used_gb' : psutil .virtual_memory ().used / (1024 ** 3 )
}
gpu = GPUMonitor ().get_stats ()
if gpu :
sample ['gpu' ] = gpu
self ._samples .append (sample )
time .sleep (self .interval )
def start (self ):
self ._running = True
self ._samples = []
self ._thread = threading .Thread (target = self ._collect , daemon = True )
self ._thread .start ()
def stop (self ) -> List [Dict ]:
self ._running = False
if self ._thread :
self ._thread .join (timeout = 2 )
return self ._samples
# ============ vLLM 客户端 ============
class VLLMClient :
def __init__ (self , base_url : str = "http://localhost:8000/v1" ):
self .client = OpenAI (base_url = base_url , api_key = "EMPTY" )
def chat (self , prompt : str , system : str = None ,
max_tokens : int = 512 , temperature : float = 0.7 ) -> Dict [str , Any ]:
messages = []
if system :
messages .append ({"role" : "system" , "content" : system })
messages .append ({"role" : "user" , "content" : prompt })
start = time .time ()
monitor = SystemMonitor (interval = 0.5 )
monitor .start ()
response = self .client .chat .completions .create (
model = "/model/Qwen2___5-3B-Instruct" ,
messages = messages ,
max_tokens = max_tokens ,
temperature = temperature
)
latency = time .time () - start
samples = monitor .stop ()
return {
'prompt' : prompt ,
'response' : response .choices [0 ].message .content ,
'latency_sec' : latency ,
'input_tokens' : response .usage .prompt_tokens ,
'output_tokens' : response .usage .completion_tokens ,
'total_tokens' : response .usage .total_tokens ,
'throughput_tps' : response .usage .completion_tokens / latency if latency > 0 else 0 ,
'system_samples' : samples
}
def batch_inference (self , prompts : List [str ], system : str = None , ** kwargs ) -> List [Dict ]:
results = []
for i , prompt in enumerate (prompts ):
result = self .chat (prompt , system , ** kwargs )
result ['batch_index' ] = i
results .append (result )
print (f"[{ i + 1 } /{ len (prompts )} ] { result ['latency_sec' ]:.2f} s, "
f"{ result ['output_tokens' ]} tokens, { result ['throughput_tps' ]:.1f} tok/s" )
return results
# ============ Pipeline ============
class InferencePipeline :
def __init__ (self , model_path : str , port : int = 8000 ):
self .model_path = model_path
self .port = port
def run (self , prompts : List [str ], output_file : str ,
system : str = None , ** kwargs ) -> List [Dict ]:
client = VLLMClient ()
results = client .batch_inference (prompts , system , ** kwargs )
stats = self ._aggregate_stats (results )
output = {
'metadata' : {
'model' : self .model_path ,
'timestamp' : datetime .now ().isoformat (),
'num_prompts' : len (prompts ),
'aggregate_stats' : stats
},
'results' : results
}
Path (output_file ).parent .mkdir (parents = True , exist_ok = True )
with open (output_file , 'w' , encoding = 'utf-8' ) as f :
json .dump (output , f , ensure_ascii = False , indent = 2 )
print (f"\n Results saved to { output_file } " )
print (f"Avg latency: { stats ['avg_latency_sec' ]:.2f} s" )
print (f"Avg throughput: { stats ['avg_throughput_tps' ]:.1f} tok/s" )
return results
def _aggregate_stats (self , results : List [Dict ]) -> Dict :
latencies = [r ['latency_sec' ] for r in results ]
throughputs = [r ['throughput_tps' ] for r in results ]
output_tokens = [r ['output_tokens' ] for r in results ]
return {
'avg_latency_sec' : sum (latencies ) / len (latencies ),
'min_latency_sec' : min (latencies ),
'max_latency_sec' : max (latencies ),
'avg_throughput_tps' : sum (throughputs ) / len (throughputs ),
'total_output_tokens' : sum (output_tokens ),
'total_time_sec' : sum (latencies )
}
# ============ 主程序 ============
if __name__ == "__main__" :
test_prompts = [
"What is the capital of France?" ,
"Explain quantum computing in simple terms." ,
"Write a short poem about artificial intelligence." ,
"What are the benefits of exercise?" ,
"How does photosynthesis work?"
]
pipeline = InferencePipeline (
model_path = "c:/Users/lians/qwen_model/Qwen2.5-3B-Instruct" ,
port = 8000
)
results = pipeline .run (
prompts = test_prompts ,
output_file = "c:/Users/lians/inference_pipeline/results.json" ,
system = "You are a helpful AI assistant."
)
#
Prompt
1
What is the capital of France?
2
Explain quantum computing in simple terms.
3
Write a short poem about artificial intelligence.
4
What are the benefits of exercise?
5
How does photosynthesis work?
指标
值
测试 prompt 数
5
总输出 token
1,277
总时间
23.30s
平均延迟
4.66s
最小延迟
1.51s
最大延迟
8.79s
平均吞吐
47.4 tok/s
#
Prompt
延迟(s)
Input Tokens
Output Tokens
吞吐(tok/s)
1
What is the capital of France?
1.51
27
8
5.3
2
Explain quantum computing...
3.27
28
189
57.8
3
Write a short poem...
1.62
28
93
57.3
4
What are the benefits of exercise?
8.10
27
475
58.7
5
How does photosynthesis work?
8.79
26
512
58.2
指标
值
GPU 内存使用
7813 MB / 8192 MB
GPU 内存利用率
95%
GPU 计算利用率
98-100%
GPU 温度范围
51°C - 71°C
每个推理请求期间,系统持续采样 CPU 和 GPU 指标,采样间隔 0.5 秒。
示例:Prompt #5 (photosynthesis) 的监控数据:
时间戳
CPU (%)
内存使用 (GB)
GPU 内存 (MB)
GPU 利用率 (%)
GPU 温度 (°C)
1777174300.39
0.0
13.11
7813
0
64
1777174300.96
37.7
13.08
7813
99
67
1777174301.52
24.1
13.12
7813
99
67
1777174302.07
29.0
13.13
7813
99
67
...
...
...
...
...
...
1777174308.88
21.4
13.14
7813
98
71
Prompt #1: "What is the capital of France?"
输出: "The capital of France is Paris."
Prompt #2: "Explain quantum computing in simple terms."
输出: "Quantum computing is a type of computing where information is processed using quantum-mechanical phenomena, such as superposition and entanglement.
To understand it simply, imagine regular computers use bits to process information, where each bit can be either a 0 or a 1. Quantum computers use quantum bits, or qubits, which can represent a 0, a 1, or both at the same time. This property, known as superposition, allows quantum computers to handle many possibilities simultaneously.
Additionally, qubits can become 'entangled,' meaning the state of one (whether it's a 0 or a 1) can depend on the state of another, even over long distances. This entanglement allows quantum computers to perform certain tasks much faster than traditional computers.
In summary, while classical computers process data in a binary system, quantum computers leverage the principles of quantum mechanics to potentially solve complex problems much more efficiently."
Prompt #3: "Write a short poem about artificial intelligence."
输出:
In circuits and code, it whispers,
A mind of silicon whispers,
From thoughts to tasks, it takes them,
Transforming data into gems.
It learns with each line of code,
And grows more than we could code,
No longer just a tool for us,
But a companion, a new muse.
Yet, as it grows in might,
Let's remember, we're the light,
For every AI, a creator,
And every creation, a maker.
首 token 延迟较高 :Prompt #1 由于首次推理需要编译 CUDA graphs,延迟仅 1.51s
稳定状态吞吐 :约 57-58 tok/s(处理 93-512 tokens 的输出时)
输出长度影响 :输出越长,吞吐量越稳定(接近 GPU 上限)
模型权重 :5.79 GiB
CUDA Graphs :0.39 GiB
KV Cache :~0.35 GiB(10,064 tokens 容量)
总占用 :~6.53 GiB / 8 GiB (81%)
空闲时:51°C
满载时:71°C
热节流风险:低(RTX 3070 Laptop TDP 115W,实际使用仅 38-42W)
问题
原因
解决方案
Docker GPU 不可用
WSL2 环境未正确配置
安装 NVIDIA driver for WSL2
路径格式错误
Windows 路径在 Docker 中被转换
使用 Qwen2___5-3B-Instruct(下划线替代点)
KV cache 内存不足
模型 + CUDA graphs 占用超过可用内存
增加 --gpu-memory-utilization 至 0.85
模型被当远程仓库
transformers 配置问题
设置 HF_HOME=/model 环境变量
vLLM 成功在 RTX 3070 Laptop (8GB) 上部署 Qwen2.5-3B-Instruct 模型,达成以下指标:
推理吞吐 :47.4 tok/s(平均)
稳定吞吐 :57-58 tok/s(长输出)
GPU 利用率 :98-100%
首 token 延迟 :1.5-3.3s
总生成延迟 :1.5-8.8s(取决于输出长度)
文件
说明
README.md
本实验报告
inference_pipeline.py
推理 Pipeline 代码
results.json
完整实验结果(含所有监控数据)
run_vllm.sh
vLLM Server 启动脚本
run_inference.sh
运行推理的命令
# 在 Docker 环境中运行
docker run --gpus all \
-v ' c:\Users\lians\qwen_model:/model' \
-p 8000:8000 \
-e HF_HOME=/model \
--shm-size=8g \
vllm/vllm-openai:latest \
/model/Qwen2___5-3B-Instruct \
--host 0.0.0.0 --port 8000 \
--tensor-parallel-size 1 \
--max-model-len 4096 \
--gpu-memory-utilization 0.85
python c:/Users/lians/inference_pipeline/inference_pipeline.py
curl http://localhost:8000/v1/models
curl http://localhost:8000/health