欧美tvooosextv另类,日本高龄老妇60路交尾,伧理片宅宅电影免费

前言

Qwen3 是阿里通義團隊近期最新發布的文本生成系列模型，提供完整覆蓋全參數和混合專家(MoE)架構的模型體系。經過海量數據訓練，Qwen3 在邏輯推理、指令遵循、智能體能力及多語言支持等維度實現突破性提升。而 OpenVINO 工具套件則可以幫助開發者快速構建基于 LLM 的應用，充分利用 AI PC 異構算力，實現高效推理。

本文將以Qwen3-8B為例，介紹如何利用 OpenVINO 的 Python API 在英特爾平臺（GPU, NPU）Qwen3 系列模型。

內容列表

01 環境準備

02 模型下載和轉換

03 模型部署

01 Environment Preparation

02 Model Download and Conversion

03 Model Deployment

環境準備

Environment Preparation

基于以下命令可以完成模型部署任務在 Python 上的環境安裝。

Use the following commands to set up the Python environment for model deployment:

python-m venv py-venv
./py_venv/Scripts/activate.bat


pip install--pre-U openvino openvino-tokenizers--extra-index-url
http://sstorage.openvinotoolkit.org/simple/wheels/hightly


pip intall nncf


pip intall
git+https://github.com/openvino-dev-samples/optimum-intel.git@2aebd4441023d3c003b27c87fff5312254ae


pip install transformers>=4.51.3

模型下載和轉換

Model Download and Conversion

在部署模型之前，我們首先需要將原始的 PyTorch 模型轉換為 OpenVINO 的 IR 靜態圖格式，并對其進行壓縮，以實現更輕量化的部署和最佳的性能表現。通過 Optimum 提供的命令行工具 optimum-cli，我們可以一鍵完成模型的格式轉換和權重量化任務：

Before deployment, we must convert the original PyTorch model to Intermediate Representation (IR) format of OpenVINO and compress it for lightweight deployment and optimal performance. Use the optimum-cli tool to perform model conversion and weight quantization in one step:

optimum-cli export openvino --model Qwen/Qwen3-8B --task text-generation-with-past --weight-format int4 --group-size128--ratio0.8 Qwen3-8B-int4-ov

開發者可以根據模型的輸出結果，調整其中的量化參數，包括：

--model：為模型在 HuggingFace 上的 model id，這里我們也提前下載原始模型，并將 model id 替換為原始模型的本地路徑，針對國內開發者，推薦使用 ModelScope 魔搭社區作為原始模型的下載渠道，具體加載方式可以參考 ModelScope 官方指南：https://www.modelscope.cn/docs/models/download

--weight-format：量化精度，可以選擇fp32,fp16,int8,int4,int4_sym_g128,int4_asym_g128,int4_sym_g64,int4_asym_g64

--group-size：權重里共享量化參數的通道數量

--ratio：int4/int8 權重比例，默認為1.0，0.6表示60%的權重以 int4 表，40%以 int8 表示

--sym：是否開啟對稱量化

此外我們建議使用以下參數對運行在NPU上的模型進行量化，以達到性能和精度的平衡。

Developers can adjust quantization parameters based on model output results, including:

--model:The model ID on HuggingFace. For local models, replace it with the local path. For Chinese developers, ModelScope is recommended for model downloads.s

--weight-format:Quantization precision (options: fp32, fp16, int8, int4, etc.).

--group-size:Number of channels sharing quantization parameters.

--ratio:int4/int8 weight ratio (default: 1.0).

--sym:Enable symmetric quantization.

For NPU-optimized quantization, use following command:

optimum-cli export openvino--modelQwen/Qwen3-8B --tasktext-generation-with-past--weight-formatnf4--sym--group-size-1Qwen3-8B-nf4-ov--backup-precisionint8_sym

模型部署

Model Deployment

OpenVINO 目前提供兩種針對大語言模型的部署方案，如果您習慣于 Transformers 庫的接口來部署模型，并想體驗相對更豐富的功能，推薦使用基于 Python 接口的 Optimum-intel 工具來進行任務搭建。如果您想嘗試更極致的性能或是輕量化的部署方式，GenAI API 則是不二的選擇，它同時支持 Python 和 C++ 兩種編程語言，安裝容量不到200MB。

OpenVINO currently offers two deployment methods for large language models (LLMs). If you are accustomed to deploying models via the Transformers library interface and seek richer functionality, it is recommended to use the Python-based Optimum-intel tool for task implementation. For those aiming for peak performance or lightweight deployment, the GenAI API is the optimal choice. It supports both Python and C++ programming languages, with an installation footprint of less than 200MB.

OpenVINO 為大語言模型提供了兩種部署方法：

OpenVINO offers two deployment approaches for large language models:

Optimum-intel 部署實例

Optimum-intel Deployment Example

from optimum.intel.openvino import OVModelForCausalLM
from transformers import AutoConfig, AutoTokenizer


ov_model = OVModelForCausalLM.from_pretrained(
 
llm_model_path,
  device='GPU',
)
tokenizer = AutoTokenizer.from_pretrained(llm_model_path)
prompt ="Give me a short introduction to
large language model."
messages = [{"role":"user","content": prompt}]
text = tokenizer.apply_chat_template(
  messages,
  tokenize=False,
  add_generation_prompt=True,
  enable_thinking=True
)
model_inputs = tokenizer([text], return_tensors="pt")
generated_ids = ov_model.generate(**model_inputs, max_new_tokens=1024)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()
try:
  index = len(output_ids) - output_ids[::-1].index(151668)
except ValueError:
  index = 0


thinking_content = tokenizer.decode(output_ids[:index], skip_special_tokens=True).strip("
")
content = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip("
")


print("thinking content:", thinking_content)
print("content:", content)

GenAI API 部署示例

GenAI API Deployment Example

importopenvino_genai asov_genai


generation_config=ov_genai.GenerationConfig()
generation_config.max_new_tokens =128
generation_config.apply_chat_template =False


pipe=ov_genai.LLMPipeline(llm_model_path,"GPU")
result = pipe.generate(prompt, generation_config)

這里可以修改 device name 的方式將模型輕松部署到NPU上。

To deploy the model on NPU, you can replace the device name from “GPU” to “NPU”.

pipe= ov_genai.LLMPipeline(llm_model_path,"NPU")

當然你也可以通過以下方式實現流式輸出。

To enable streaming mode, you can customize a streamer for OpenVINO GenAI pipeline.

defstreamer(subword):
print(subword, end='', flush=True)
  sys.stdout.flush()

returnFalse
pipe.generate(prompt, generation_config, streamer=streamer)

此外，GenAI API 提供了 chat 模式的構建方法，通過聲明 pipe.start_chat()以及pipe.finish_chat()，多輪聊天中的歷史數據將被以 kvcache 的形態，在內存中進行管理，從而提升運行效率。

Additionally, the GenAI API provides a chat mode implementation. By invoking pipe.start_chat() and pipe.finish_chat(), history data from multi-turn conversations is managed in memory as kvcache, which can significantly boost inference efficiency.

pipe.start_chat()
whileTrue:
 try:
  
 prompt =input('question:
')
 exceptEOFError:
  
break
  pipe.generate(prompt, generation, streamer)
 print('
----------')
pipe.finish_chat()

Chat模式輸出結果示例：

Output of Chat mode:

總結

Conclusion

如果你對性能基準感興趣，可以訪問全新上線的 OpenVINO 模型中心（Model Hub）。這里提供了在 Intel CPU、集成 GPU、NPU 及其他加速器上的模型性能數據，幫助你找到最適合自己解決方案的硬件平臺。

Whether using Optimum-intel or OpenVINO GenAI API, developers can effortlessly deploy the converted Qwen3 model on Intel hardware platforms, enabling the creation of diverse LLM-based services and applications locally.

參考資料

Reference

llm-chatbot notebook:

https://github.com/openvinotoolkit/openvino_notebooks/tree/latest/notebooks/llm-chatbot

GenAI API：

https://github.com/openvinotoolkit/openvino.genai

聲明：本文內容及配圖由入駐作者撰寫或者入駐合作網站授權轉載。文章觀點僅代表作者本人，不代表電子發燒友網立場。文章及其配圖僅供工程師學習之用，如有內容侵權或者其他違規問題，請聯系本站處理。舉報投訴