add ollama-deepseek-r1-distill

This commit is contained in:
JamesFlare1212
2025-02-09 03:56:34 -05:00
parent 6726a156b3
commit c968a3ae00
6 changed files with 616 additions and 4 deletions

View File

@@ -412,7 +412,7 @@ Due to the complexity of this assignment, it is best to carefully plan how to im
Since the flowchart is quite large, I have temporarily converted it into an image.
{{< image src="csci-1200-hw-2-flowchart-zh_cn.svg" caption="Flow Chart" >}}
{{< image src="csci-1200-hw-2-flowchart-zh_cn.svg" width="100%" caption="Flow Chart" >}}
Mermaid source code as follows:

View File

@@ -0,0 +1,303 @@
---
title: Deploying DeepSeek R1 Distill Series Models on RTX 4090 with Ollama and Optimization
subtitle:
date: 2025-02-08T18:29:29-05:00
lastmod: 2025-02-08T18:29:29-05:00
slug: ollama-deepseek-r1-distill
draft: false
author:
name: James
link: https://www.jamesflare.com
email:
avatar: /site-logo.avif
description: This blog post explores the installation, optimization, and usage of DeepSeek-R1's distilled models in Ollama on Windows 11, MacOS, and Linux, highlighting performance and limitations.
keywords: ["DeepSeek-R1","Ollama","KV Cache","Flash Attention"]
license:
comment: true
weight: 0
tags:
- LLM
- llama.cpp
- Quantization
- Ollama
categories:
- LLM
collections:
- Ollama
hiddenFromHomePage: false
hiddenFromSearch: false
hiddenFromRss: false
hiddenFromRelated: false
summary: This blog post explores the installation, optimization, and usage of DeepSeek-R1's distilled models in Ollama on Windows 11, MacOS, and Linux, highlighting performance and limitations.
resources:
- name: featured-image
src: featured-image.jpg
- name: featured-image-preview
src: featured-image-preview.jpg
toc: true
math: false
lightgallery: true
password:
message:
repost:
enable: false
url:
# See details front matter: https://fixit.lruihao.cn/documentation/content-management/introduction/#front-matter
---
<!--more-->
## Introduction
Recently, DeepSeek-R1 has gained significant attention due to its affordability and powerful performance. Additionally, the official release of several distilled models in various sizes makes it possible for consumer-grade hardware to experience the capabilities of reasoning models.
- [deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B)
- [deepseek-ai/DeepSeek-R1-Distill-Qwen-7B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B)
- [deepseek-ai/DeepSeek-R1-Distill-Llama-8B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Llama-8B)
- [deepseek-ai/DeepSeek-R1-Distill-Qwen-14B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-14B)
- [deepseek-ai/DeepSeek-R1-Distill-Qwen-32B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B)
- [deepseek-ai/DeepSeek-R1-Distill-Llama-70B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Llama-70B)
However, it is important to note that these distilled models are far from the full DeepSeek-R1 model. For instance, `DeepSeek-R1-Distill-Qwen-32B` only reaches the level of o1-mini.
This can be seen in the official [chart](https://raw.githubusercontent.com/deepseek-ai/DeepSeek-R1/main/figures/benchmark.jpg) (the chart below is interactive and you can turn off data that you do not want to see).
{{< echarts >}}
{
"tooltip": {
"trigger": "axis",
"axisPointer": {
"type": "shadow"
}
},
"legend": {
"top": 30,
"data": [
"DeepSeek-R1",
"OpenAI-o1-1217",
"DeepSeek-R1-32B",
"OpenAI-o1-mini",
"DeepSeek-V3"
]
},
"grid": {
"left": "8%",
"right": "8%",
"bottom": "10%",
"containLabel": true
},
"xAxis": {
"type": "category",
"data": [
"AIME 2024\n(Pass@1)",
"Codeforces\n(Percentile)",
"GPQA Diamond\n(Pass@1)",
"MATH-500\n(Pass@1)",
"MMLU\n(Pass@1)",
"SWE-bench Verified\n(Resolved)"
],
"axisLabel": {
"interval": 0
}
},
"yAxis": {
"type": "value",
"min": 0,
"max": 100,
"name": "Accuracy / Percentile (%)",
"nameGap": 32,
"nameLocation": "center"
},
"series": [
{
"name": "DeepSeek-R1",
"type": "bar",
"data": [79.8, 96.3, 71.5, 97.3, 90.8, 49.2],
"barGap": "0",
"label": {
"show": true,
"position": "top"
}
},
{
"name": "OpenAI-o1-1217",
"type": "bar",
"data": [79.2, 96.6, 75.7, 96.4, 91.8, 48.9],
"label": {
"show": true,
"position": "top"
}
},
{
"name": "DeepSeek-R1-32B",
"type": "bar",
"data": [72.6, 90.6, 62.1, 94.3, 87.4, 36.8],
"label": {
"show": true,
"position": "top"
}
},
{
"name": "OpenAI-o1-mini",
"type": "bar",
"data": [63.6, 93.4, 60.0, 90.0, 85.2, 41.6],
"label": {
"show": true,
"position": "top"
}
},
{
"name": "DeepSeek-V3",
"type": "bar",
"data": [39.2, 58.7, 59.1, 90.2, 88.5, 42.0],
"label": {
"show": true,
"position": "top"
}
}
]
}
{{< /echarts >}}
Ollama provides a convenient interface and tools for using and managing models, with the backend being llama.cpp. It supports both CPU and GPU inference optimization.
## Installation of Ollama
Follow the instructions on [Download Ollama](https://ollama.com/download) to complete the installation. My environment is as follows:
- Operating system: Windows 11
- GPU: NVIDIA RTX 4090
- CPU: Intel 13900K
- Memory: 128G DDR5
## Creating Models
After installing Ollama, we need to create models. One way is to pull from the [Ollama Library](https://ollama.com/library/deepseek-r1:32b-qwen-distill-q4_K_M).
```bash
ollama pull deepseek-r1:32b-qwen-distill-q4_K_M
```
However, the default context length of this pulled model is 4096. This is insufficient and unreasonable, so we need to modify it.
One way is to directly edit the Modelfile. If you do not know where a model's Modelfile is located, execute the following command to view its Modelfile.
```bash
ollama show --modelfile deepseek-r1:32b-qwen-distill-q4_K_M
```
Here I provide my used Modelfile, which can be saved in a new text file, for example `DeepSeek-R1-Distill-Qwen-32B-Q4_K_M.txt`.
```text
FROM deepseek-r1:32b-qwen-distill-q4_K_M
TEMPLATE """{{- if .System }}{{ .System }}{{ end }}
{{- range $i, $_ := .Messages }}
{{- $last := eq (len (slice $.Messages $i)) 1}}
{{- if eq .Role "user" }}<User>{{ .Content }}
{{- else if eq .Role "assistant" }}<Assistant>{{ .Content }}{{- if not $last }}<end▁of▁sentence>{{- end }}
{{- end }}
{{- if and $last (ne .Role "assistant") }}<Assistant>{{- end }}
{{- end }}"""
PARAMETER stop <begin▁of▁sentence>
PARAMETER stop <end▁of▁sentence>
PARAMETER stop <User>
PARAMETER stop <Assistant>
PARAMETER num_ctx 16000
```
It contains several parts, and we only need to modify the `FROM` statement (indicating which model is used for construction) and the value of `num_ctx` (default 4096 unless set otherwise through API requests). Here I set it to `16000`, which represents the context length. The longer the context, the more memory and computational resources are consumed.
> [!NOTE]
>
> After testing, RTX 4090 can run a 32B q4_K_M quantized model with KV Cache quantified as q8_0 and Flash Attention enabled while maintaining a context length of 16K. If running the same configuration for a 14B q4_K_M quantized model, it can achieve a context length of 64K. I will explain more about KV Cache quantization and Flash Attention later.
After creating the Modelfile, we can create the model using the following command:
```bash
ollama create DeepSeek-R1-Distill-Qwen-32B-Q4_K_M -f DeepSeek-R1-Distill-Qwen-32B-Q4_K_M.txt
```
> [!TIP]
>
> The format is as follows:
> `ollama create <name of the model to be created> -f <path and name of Modelfile>`
During this process, Ollama will pull the model and create it. After completion, you can execute `ollama list` to check the model list, and you should see something similar.
```console
PS C:\Users\james\Desktop\Ollama> ollama list
NAME ID SIZE MODIFIED
DeepSeek-R1-Distill-Qwen-32B-Q4_K_M:latest ca51e8a9d628 19 GB 2 days ago
deepseek-r1:32b-qwen-distill-q4_K_M 5de93a84837d 19 GB 2 days ago
```
## Optimization
Ollama supports multiple optimization parameters controlled by environment variables.
- `OLLAMA_FLASH_ATTENTION`: Set to `1` to enable, and `0` to disable.
- `OLLAMA_HOST`: IP address Ollama listens on. Default is `127.0.0.1`, change it to `0.0.0.0` if you want to serve externally.
- `OLLAMA_KV_CACHE_TYPE`: Set to `q8_0` or `q4_0`. The default value is `fp16`.
- `OLLAMA_NUM_PARALLEL`: Number of parallel requests, more throughput but higher memory consumption. Generally set to `1`.
- `OLLAMA_ORIGINS`: CORS cross-origin request settings.
Flash Attention must be enabled. I recommend setting `OLLAMA_KV_CACHE_TYPE` to `q8_0`. In my tests, `q4_0` reduces the reasoning length of R1, possibly because longer content and context are more important.
### Windows 11
To set environment variables on Windows 11, go to "Advanced System Settings," then choose "Environment Variables." After that, select "New" to add a new variable. Restart Ollama for changes to take effect.
### MacOS
On MacOS, you can use commands like the following:
```bash
launchctl setenv OLLAMA_FLASH_ATTENTION "1"
launchctl setenv OLLAMA_KV_CACHE_TYPE "q8_0"
```
Restart Ollama after setting environment variables.
### Linux
In Linux, modify `ollama.service` file to change its environment variables after installing Ollama:
```bash
sudo systemctl edit ollama.service
```
Then add the `Environment` field under `[Service]`, like this:
```text
[Service]
Environment="OLLAMA_FLASH_ATTENTION=1"
Environment="OLLAMA_KV_CACHE_TYPE=q8_0"
```
Save and reload changes:
```bash
sudo systemctl daemon-reload
sudo systemctl restart ollama
```
## Limitations
The backend llama.cpp used by Ollama is not designed for high-concurrency and high-performance production environments. For example, its support for multi-GPU is suboptimal; it splits model layers across multiple GPUs to solve memory issues but only one GPU works at a time. To utilize the performance of multiple GPUs simultaneously, tensor parallelism is required, which SGLang or vLLM are better suited for.
In terms of performance, Ollama does not match SGLang or vLLM in throughput and multi-modal model support is limited with slow adaptation progress.
## Clients
For easier use of models within Ollama, I recommend two clients. Cherry Studio is a local client that I find useful, while LobeChat is a cloud-based client (I previously wrote an article on deploying the database version of LobeChat using Docker Compose).
{{< gh-repo-card-container >}}
{{< gh-repo-card repo="CherryHQ/cherry-studio" >}}
{{< gh-repo-card repo="lobehub/lobe-chat" >}}
{{< gh-repo-card repo="Calcium-Ion/new-api" >}}
{{< gh-repo-card repo="immersive-translate/immersive-translate" >}}
{{< /gh-repo-card-container >}}
New API is a tool that I find useful for managing APIs and providing services in the OpenAI API format. Immersive Translate is another highly-rated translation plugin that supports calling OpenAI API for translations, which can also be combined with Ollama and New API. Its translation quality far exceeds traditional methods.

View File

@@ -406,7 +406,7 @@ A: 与 Uber 相同。保留一位小数。直接截断即可。例如,如果
由于流程图比较大,我暂时把它转换成了图片。
{{< image src="csci-1200-hw-2-flowchart-zh_cn.svg" caption="Flow Chart" >}}
{{< image src="csci-1200-hw-2-flowchart-zh_cn.svg" width="100%" caption="Flow Chart" >}}
Mermaid 源码如下:

View File

@@ -22,7 +22,7 @@ tags:
categories:
- 大语言模型
collections:
- Ollama
- LLM
hiddenFromHomePage: false
hiddenFromSearch: false
hiddenFromRss: false

View File

@@ -0,0 +1,308 @@
---
title: 使用 Ollama 在RTX 4090上部署 DeepSeek R1 Distill 系列模型并优化
subtitle:
date: 2025-02-08T18:29:29-05:00
lastmod: 2025-02-08T18:29:29-05:00
slug: ollama-deepseek-r1-distill
draft: false
author:
name: James
link: https://www.jamesflare.com
email:
avatar: /site-logo.avif
description: 本篇文章详细介绍了如何利用DeepSeek-R1及其蒸馏模型在消费级硬件上的应用并探讨了其性能优化和不足之处。同时提供了安装Ollama及创建深度定制化模型的步骤以及一些提高运行效率的方法包括使用Flash Attention和KV Cache量化等技巧。
keywords: ["DeepSeek-R1","Ollama","KV Cache","Flash Attention"]
license:
comment: true
weight: 0
tags:
- 大语言模型
- llama.cpp
- 量化
- Ollama
categories:
- 大语言模型
collections:
- LLM
hiddenFromHomePage: false
hiddenFromSearch: false
hiddenFromRss: false
hiddenFromRelated: false
summary: 本篇文章详细介绍了如何利用DeepSeek-R1及其蒸馏模型在消费级硬件上的应用并探讨了其性能优化和不足之处。同时提供了安装Ollama及创建深度定制化模型的步骤以及一些提高运行效率的方法包括使用Flash Attention和KV Cache量化等技巧。
resources:
- name: featured-image
src: featured-image.jpg
- name: featured-image-preview
src: featured-image-preview.jpg
toc: true
math: false
lightgallery: true
password:
message:
repost:
enable: false
url:
# See details front matter: https://fixit.lruihao.cn/documentation/content-management/introduction/#front-matter
---
<!--more-->
## 前言
最近DeepSeek-R1爆火原因有多种。不光价格便宜性能强劲还开源。更难能可贵的是官方放出了几个蒸馏模型包含各个尺寸。
- [deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B)
- [deepseek-ai/DeepSeek-R1-Distill-Qwen-7B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B)
- [deepseek-ai/DeepSeek-R1-Distill-Llama-8B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Llama-8B)
- [deepseek-ai/DeepSeek-R1-Distill-Qwen-14B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-14B)
- [deepseek-ai/DeepSeek-R1-Distill-Qwen-32B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B)
- [deepseek-ai/DeepSeek-R1-Distill-Llama-70B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Llama-70B)
这使得一般的消费级硬件也有机会体验Reasoning模型的魅力。不过请注意这和真正的DeepSeek-R1相差甚远。即便是`DeepSeek-R1-Distill-Qwen-32B`也只是达到o1-mini级别的水平。
这一点可以参考官方给出的[图表](https://raw.githubusercontent.com/deepseek-ai/DeepSeek-R1/main/figures/benchmark.jpg)(下面这张图是可以交互的,你可以关闭你不想要的数据)。
{{< echarts >}}
{
"tooltip": {
"trigger": "axis",
"axisPointer": {
"type": "shadow"
}
},
"legend": {
"top": 30,
"data": [
"DeepSeek-R1",
"OpenAI-o1-1217",
"DeepSeek-R1-32B",
"OpenAI-o1-mini",
"DeepSeek-V3"
]
},
"grid": {
"left": "8%",
"right": "8%",
"bottom": "10%",
"containLabel": true
},
"xAxis": {
"type": "category",
"data": [
"AIME 2024\n(Pass@1)",
"Codeforces\n(Percentile)",
"GPQA Diamond\n(Pass@1)",
"MATH-500\n(Pass@1)",
"MMLU\n(Pass@1)",
"SWE-bench Verified\n(Resolved)"
],
"axisLabel": {
"interval": 0
}
},
"yAxis": {
"type": "value",
"min": 0,
"max": 100,
"name": "Accuracy / Percentile (%)",
"nameGap": 32,
"nameLocation": "center"
},
"series": [
{
"name": "DeepSeek-R1",
"type": "bar",
"data": [79.8, 96.3, 71.5, 97.3, 90.8, 49.2],
"barGap": "0",
"label": {
"show": true,
"position": "top"
}
},
{
"name": "OpenAI-o1-1217",
"type": "bar",
"data": [79.2, 96.6, 75.7, 96.4, 91.8, 48.9],
"label": {
"show": true,
"position": "top"
}
},
{
"name": "DeepSeek-R1-32B",
"type": "bar",
"data": [72.6, 90.6, 62.1, 94.3, 87.4, 36.8],
"label": {
"show": true,
"position": "top"
}
},
{
"name": "OpenAI-o1-mini",
"type": "bar",
"data": [63.6, 93.4, 60.0, 90.0, 85.2, 41.6],
"label": {
"show": true,
"position": "top"
}
},
{
"name": "DeepSeek-V3",
"type": "bar",
"data": [39.2, 58.7, 59.1, 90.2, 88.5, 42.0],
"label": {
"show": true,
"position": "top"
}
}
]
}
{{< /echarts >}}
Ollama提供了更方便使用和管理模型的接口和工具后端是llama.cpp。基于CPU推理优化的工具也支持GPU。
{{< gh-repo-card-container >}}
{{< gh-repo-card repo="ollama/ollama" >}}
{{< gh-repo-card repo="ggerganov/llama.cpp" >}}
{{< /gh-repo-card-container >}}
## 安装Ollama
这个根据[Download Ollama](https://ollama.com/download)的指引完成即可。我的环境如下:
- 操作系统是Windows 11
- GPU是NVIDIA RTX 4090
- CPU是Intel 13900K
- 内存是128G DDR5
## 创建模型
在安装好Ollama后我们就需要创建模型了。一种办法是直接从[Ollama Library](https://ollama.com/library/deepseek-r1:32b-qwen-distill-q4_K_M)拉取。
```bash
ollama pull deepseek-r1:32b-qwen-distill-q4_K_M
```
不过这样拉取的模型的默认上下文长度是4096。这显然不够用也不合理所以我们要修改一下。
一种办法是直接修改Modelfile。如果你不知道一个模型的Modelfile可以执行以下命令查看它的Modelfile。
```bash
ollama show --modelfile deepseek-r1:32b-qwen-distill-q4_K_M
```
这里我给出我用的Modelfile可以新建一个文本文件保存比如叫做`DeepSeek-R1-Distill-Qwen-32B-Q4_K_M.txt`
```text
FROM deepseek-r1:32b-qwen-distill-q4_K_M
TEMPLATE """{{- if .System }}{{ .System }}{{ end }}
{{- range $i, $_ := .Messages }}
{{- $last := eq (len (slice $.Messages $i)) 1}}
{{- if eq .Role "user" }}<User>{{ .Content }}
{{- else if eq .Role "assistant" }}<Assistant>{{ .Content }}{{- if not $last }}<end▁of▁sentence>{{- end }}
{{- end }}
{{- if and $last (ne .Role "assistant") }}<Assistant>{{- end }}
{{- end }}"""
PARAMETER stop <begin▁of▁sentence>
PARAMETER stop <end▁of▁sentence>
PARAMETER stop <User>
PARAMETER stop <Assistant>
PARAMETER num_ctx 16000
```
它包含多个部分,我们暂时用不着改太多,只需要注意`FROM`表明构建使用的模型告诉Ollama用什么构建以及`num_ctx`的值默认4096除非通过API请求的时候有额外设置这里我设置的`16000`,它就是上下文长度,越长消耗的显存/内存,计算资源就越多。
> [!NOTE]
>
> 经过测试RTX 4090差不多可以在KV Cache量化为q8_0启用Flash Attention的情况下运行32B q4_K_M量化模型的同时保持16K的上下文长度。如果同等情况下运行14B q4_K_M量化模型可以达到64K的上下文长度。有关KV Cache量化和Flash Attention的内容我会稍后讲解。
当我们创建好Modelfile后就可以使用如下命令创建模型了。
```bash
ollama create DeepSeek-R1-Distill-Qwen-32B-Q4_K_M -f DeepSeek-R1-Distill-Qwen-32B-Q4_K_M.txt
```
> [!TIP]
>
> 其格式如下:
> `ollama create <要创建的模型名> -f <Modelfile的路径和名字>`
在此过程中Ollama会拉取模型并且创建它完成后可以执行`ollama list`检查模型列表,你应该会看见类似的东西。
```console
PS C:\Users\james\Desktop\Ollama> ollama list
NAME ID SIZE MODIFIED
DeepSeek-R1-Distill-Qwen-32B-Q4_K_M:latest ca51e8a9d628 19 GB 2 days ago
deepseek-r1:32b-qwen-distill-q4_K_M 5de93a84837d 19 GB 2 days ago
```
## 优化
Ollama支持多个优化参数它们通过环境变量控制。
- `OLLAMA_FLASH_ATTENTION``1`开启,`0`关闭
- `OLLAMA_HOST`Ollama监听的IP默认是`127.0.0.1`,如果要对外服务需要改成`0.0.0.0`
- `OLLAMA_KV_CACHE_TYPE`:默认`fp16`,可以设置`q8_0`,或者`q4_0`
- `OLLAMA_NUM_PARALLEL`:同时运行的请求数,越多吞吐量越大,显存/内存消耗越多,一般`1`就差不多了
- `OLLAMA_ORIGINS`有关CORS跨站请求的内容如果你要在其它地方请求Ollama特别域名不一样的话你要设置对应的域或者设置`*`允许所有来源
Flash Attention是必开的KV Cache我建议选`q8_0`,实测发现`q4_0`会让R1的思考长度下降这可能是因为内容都比较长上下文比较重要。
### Windows 11
要在Windows 11中设置环境变量需要进入“高级系统设置”然后选择“环境变量”之后选择“新建”。重启Ollama使其生效。
### MacOS
在MacOS中可以执行诸如
```bash
launchctl setenv OLLAMA_FLASH_ATTENTION "1"
launchctl setenv OLLAMA_KV_CACHE_TYPE "q8_0"
```
的命令来设置环境变量。重启Ollama使其生效。
### Linux
在Linux中在安装完Ollama后可以修改`ollama.service`文件来修改它的环境变量。
```bash
sudo systemctl edit ollama.service
```
然后在`[Service]`下添加`Environment`字段,类似这样
```text
[Service]
Environment="OLLAMA_FLASH_ATTENTION=1"
Environment="OLLAMA_KV_CACHE_TYPE=q8_0"
```
保存修改后重载
```bash
sudo systemctl daemon-reload
sudo systemctl restart ollama
```
## 不足
Ollama使用的后端llama.cpp并非是为了多并发和高性能的生产环境设计的。比如它对多GPU的支持就不是很理想它会把模型的层拆分到多个GPU里这样解决了显存不足的问题但是这样导致在单一时间内只有一块GPU在干活。要同时利用多张GPU的性能我们需要张量并行这是SGLang或者vLLM擅长的。
至于性能在和SGLang或者vLLM对比的时候也不占优势吞吐量远不及后者。其次对多模态模型的支持有限适配进度缓慢。
## 客户端
为了更方便使用Ollama中的模型我推荐两个客户端。Cherry Studio是我觉得好用的本地客户端LobeChat是我觉得好用的云端客户端我之前写过一篇 [使用 Docker Compose 部署 LobeChat 服务端数据库版本](../install-lobechat-db/)
{{< gh-repo-card-container >}}
{{< gh-repo-card repo="CherryHQ/cherry-studio" >}}
{{< gh-repo-card repo="lobehub/lobe-chat" >}}
{{< gh-repo-card repo="Calcium-Ion/new-api" >}}
{{< gh-repo-card repo="immersive-translate/immersive-translate" >}}
{{< /gh-repo-card-container >}}
New API则是我觉得一个很好的用来集中管理API并且以OpenAI API格式提供服务的工具。Immersive Translate则是一个好评如潮的翻译插件它支持调用OpenAI API来进行翻译也自然可以与Ollama以及New API组合搭配。翻译效果远超传统翻译方法。