diff --git a/config/_default/hugo.toml b/config/_default/hugo.toml index 8dcc14d..aea489a 100644 --- a/config/_default/hugo.toml +++ b/config/_default/hugo.toml @@ -25,7 +25,8 @@ languageName = "English" # whether to include Chinese/Japanese/Korean hasCJKLanguage = true # default amount of posts in each pages -paginate = 12 +[pagination] + pagerSize = 12 # copyright description used only for seo schema copyright = "" # whether to use robots.txt diff --git a/content/en/posts/csci-1200/hw-2/index.md b/content/en/posts/csci-1200/hw-2/index.md index a6b0e0e..7114389 100644 --- a/content/en/posts/csci-1200/hw-2/index.md +++ b/content/en/posts/csci-1200/hw-2/index.md @@ -412,7 +412,7 @@ Due to the complexity of this assignment, it is best to carefully plan how to im Since the flowchart is quite large, I have temporarily converted it into an image. -{{< image src="csci-1200-hw-2-flowchart-zh_cn.svg" caption="Flow Chart" >}} +{{< image src="csci-1200-hw-2-flowchart-zh_cn.svg" width="100%" caption="Flow Chart" >}} Mermaid source code as follows: diff --git a/content/en/posts/llama-cpp/ollama-with-deepseek-r1/index.md b/content/en/posts/llama-cpp/ollama-with-deepseek-r1/index.md new file mode 100644 index 0000000..e22ae39 --- /dev/null +++ b/content/en/posts/llama-cpp/ollama-with-deepseek-r1/index.md @@ -0,0 +1,303 @@ +--- +title: Deploying DeepSeek R1 Distill Series Models on RTX 4090 with Ollama and Optimization +subtitle: +date: 2025-02-08T18:29:29-05:00 +lastmod: 2025-02-08T18:29:29-05:00 +slug: ollama-deepseek-r1-distill +draft: false +author: + name: James + link: https://www.jamesflare.com + email: + avatar: /site-logo.avif +description: This blog post explores the installation, optimization, and usage of DeepSeek-R1's distilled models in Ollama on Windows 11, MacOS, and Linux, highlighting performance and limitations. +keywords: ["DeepSeek-R1","Ollama","KV Cache","Flash Attention"] +license: +comment: true +weight: 0 +tags: + - LLM + - llama.cpp + - Quantization + - Ollama +categories: + - LLM +collections: + - Ollama +hiddenFromHomePage: false +hiddenFromSearch: false +hiddenFromRss: false +hiddenFromRelated: false +summary: This blog post explores the installation, optimization, and usage of DeepSeek-R1's distilled models in Ollama on Windows 11, MacOS, and Linux, highlighting performance and limitations. +resources: + - name: featured-image + src: featured-image.jpg + - name: featured-image-preview + src: featured-image-preview.jpg +toc: true +math: false +lightgallery: true +password: +message: +repost: + enable: false + url: + +# See details front matter: https://fixit.lruihao.cn/documentation/content-management/introduction/#front-matter +--- + + + +## Introduction + +Recently, DeepSeek-R1 has gained significant attention due to its affordability and powerful performance. Additionally, the official release of several distilled models in various sizes makes it possible for consumer-grade hardware to experience the capabilities of reasoning models. + +- [deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B) +- [deepseek-ai/DeepSeek-R1-Distill-Qwen-7B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B) +- [deepseek-ai/DeepSeek-R1-Distill-Llama-8B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Llama-8B) +- [deepseek-ai/DeepSeek-R1-Distill-Qwen-14B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-14B) +- [deepseek-ai/DeepSeek-R1-Distill-Qwen-32B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B) +- [deepseek-ai/DeepSeek-R1-Distill-Llama-70B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Llama-70B) + +However, it is important to note that these distilled models are far from the full DeepSeek-R1 model. For instance, `DeepSeek-R1-Distill-Qwen-32B` only reaches the level of o1-mini. + +This can be seen in the official [chart](https://raw.githubusercontent.com/deepseek-ai/DeepSeek-R1/main/figures/benchmark.jpg) (the chart below is interactive and you can turn off data that you do not want to see). + +{{< echarts >}} +{ + "tooltip": { + "trigger": "axis", + "axisPointer": { + "type": "shadow" + } + }, + "legend": { + "top": 30, + "data": [ + "DeepSeek-R1", + "OpenAI-o1-1217", + "DeepSeek-R1-32B", + "OpenAI-o1-mini", + "DeepSeek-V3" + ] + }, + "grid": { + "left": "8%", + "right": "8%", + "bottom": "10%", + "containLabel": true + }, + "xAxis": { + "type": "category", + "data": [ + "AIME 2024\n(Pass@1)", + "Codeforces\n(Percentile)", + "GPQA Diamond\n(Pass@1)", + "MATH-500\n(Pass@1)", + "MMLU\n(Pass@1)", + "SWE-bench Verified\n(Resolved)" + ], + "axisLabel": { + "interval": 0 + } + }, + "yAxis": { + "type": "value", + "min": 0, + "max": 100, + "name": "Accuracy / Percentile (%)", + "nameGap": 32, + "nameLocation": "center" + }, + "series": [ + { + "name": "DeepSeek-R1", + "type": "bar", + "data": [79.8, 96.3, 71.5, 97.3, 90.8, 49.2], + "barGap": "0", + "label": { + "show": true, + "position": "top" + } + }, + { + "name": "OpenAI-o1-1217", + "type": "bar", + "data": [79.2, 96.6, 75.7, 96.4, 91.8, 48.9], + "label": { + "show": true, + "position": "top" + } + }, + { + "name": "DeepSeek-R1-32B", + "type": "bar", + "data": [72.6, 90.6, 62.1, 94.3, 87.4, 36.8], + "label": { + "show": true, + "position": "top" + } + }, + { + "name": "OpenAI-o1-mini", + "type": "bar", + "data": [63.6, 93.4, 60.0, 90.0, 85.2, 41.6], + "label": { + "show": true, + "position": "top" + } + }, + { + "name": "DeepSeek-V3", + "type": "bar", + "data": [39.2, 58.7, 59.1, 90.2, 88.5, 42.0], + "label": { + "show": true, + "position": "top" + } + } + ] +} +{{< /echarts >}} + +Ollama provides a convenient interface and tools for using and managing models, with the backend being llama.cpp. It supports both CPU and GPU inference optimization. + +## Installation of Ollama + +Follow the instructions on [Download Ollama](https://ollama.com/download) to complete the installation. My environment is as follows: + +- Operating system: Windows 11 +- GPU: NVIDIA RTX 4090 +- CPU: Intel 13900K +- Memory: 128G DDR5 + +## Creating Models + +After installing Ollama, we need to create models. One way is to pull from the [Ollama Library](https://ollama.com/library/deepseek-r1:32b-qwen-distill-q4_K_M). + +```bash +ollama pull deepseek-r1:32b-qwen-distill-q4_K_M +``` + +However, the default context length of this pulled model is 4096. This is insufficient and unreasonable, so we need to modify it. + +One way is to directly edit the Modelfile. If you do not know where a model's Modelfile is located, execute the following command to view its Modelfile. + +```bash +ollama show --modelfile deepseek-r1:32b-qwen-distill-q4_K_M +``` + +Here I provide my used Modelfile, which can be saved in a new text file, for example `DeepSeek-R1-Distill-Qwen-32B-Q4_K_M.txt`. + +```text +FROM deepseek-r1:32b-qwen-distill-q4_K_M + +TEMPLATE """{{- if .System }}{{ .System }}{{ end }} +{{- range $i, $_ := .Messages }} +{{- $last := eq (len (slice $.Messages $i)) 1}} +{{- if eq .Role "user" }}<|User|>{{ .Content }} +{{- else if eq .Role "assistant" }}<|Assistant|>{{ .Content }}{{- if not $last }}<|end▁of▁sentence|>{{- end }} +{{- end }} +{{- if and $last (ne .Role "assistant") }}<|Assistant|>{{- end }} +{{- end }}""" +PARAMETER stop <|begin▁of▁sentence|> +PARAMETER stop <|end▁of▁sentence|> +PARAMETER stop <|User|> +PARAMETER stop <|Assistant|> +PARAMETER num_ctx 16000 +``` + +It contains several parts, and we only need to modify the `FROM` statement (indicating which model is used for construction) and the value of `num_ctx` (default 4096 unless set otherwise through API requests). Here I set it to `16000`, which represents the context length. The longer the context, the more memory and computational resources are consumed. + +> [!NOTE] +> +> After testing, RTX 4090 can run a 32B q4_K_M quantized model with KV Cache quantified as q8_0 and Flash Attention enabled while maintaining a context length of 16K. If running the same configuration for a 14B q4_K_M quantized model, it can achieve a context length of 64K. I will explain more about KV Cache quantization and Flash Attention later. + +After creating the Modelfile, we can create the model using the following command: + +```bash +ollama create DeepSeek-R1-Distill-Qwen-32B-Q4_K_M -f DeepSeek-R1-Distill-Qwen-32B-Q4_K_M.txt +``` + +> [!TIP] +> +> The format is as follows: +> `ollama create -f ` + +During this process, Ollama will pull the model and create it. After completion, you can execute `ollama list` to check the model list, and you should see something similar. + +```console +PS C:\Users\james\Desktop\Ollama> ollama list +NAME ID SIZE MODIFIED +DeepSeek-R1-Distill-Qwen-32B-Q4_K_M:latest ca51e8a9d628 19 GB 2 days ago +deepseek-r1:32b-qwen-distill-q4_K_M 5de93a84837d 19 GB 2 days ago +``` + +## Optimization + +Ollama supports multiple optimization parameters controlled by environment variables. + +- `OLLAMA_FLASH_ATTENTION`: Set to `1` to enable, and `0` to disable. +- `OLLAMA_HOST`: IP address Ollama listens on. Default is `127.0.0.1`, change it to `0.0.0.0` if you want to serve externally. +- `OLLAMA_KV_CACHE_TYPE`: Set to `q8_0` or `q4_0`. The default value is `fp16`. +- `OLLAMA_NUM_PARALLEL`: Number of parallel requests, more throughput but higher memory consumption. Generally set to `1`. +- `OLLAMA_ORIGINS`: CORS cross-origin request settings. + +Flash Attention must be enabled. I recommend setting `OLLAMA_KV_CACHE_TYPE` to `q8_0`. In my tests, `q4_0` reduces the reasoning length of R1, possibly because longer content and context are more important. + +### Windows 11 + +To set environment variables on Windows 11, go to "Advanced System Settings," then choose "Environment Variables." After that, select "New" to add a new variable. Restart Ollama for changes to take effect. + +### MacOS + +On MacOS, you can use commands like the following: + +```bash +launchctl setenv OLLAMA_FLASH_ATTENTION "1" +launchctl setenv OLLAMA_KV_CACHE_TYPE "q8_0" +``` + +Restart Ollama after setting environment variables. + +### Linux + +In Linux, modify `ollama.service` file to change its environment variables after installing Ollama: + +```bash +sudo systemctl edit ollama.service +``` + +Then add the `Environment` field under `[Service]`, like this: + +```text +[Service] +Environment="OLLAMA_FLASH_ATTENTION=1" +Environment="OLLAMA_KV_CACHE_TYPE=q8_0" +``` + +Save and reload changes: + +```bash +sudo systemctl daemon-reload +sudo systemctl restart ollama +``` + +## Limitations + +The backend llama.cpp used by Ollama is not designed for high-concurrency and high-performance production environments. For example, its support for multi-GPU is suboptimal; it splits model layers across multiple GPUs to solve memory issues but only one GPU works at a time. To utilize the performance of multiple GPUs simultaneously, tensor parallelism is required, which SGLang or vLLM are better suited for. + +In terms of performance, Ollama does not match SGLang or vLLM in throughput and multi-modal model support is limited with slow adaptation progress. + +## Clients + +For easier use of models within Ollama, I recommend two clients. Cherry Studio is a local client that I find useful, while LobeChat is a cloud-based client (I previously wrote an article on deploying the database version of LobeChat using Docker Compose). + +{{< gh-repo-card-container >}} + {{< gh-repo-card repo="CherryHQ/cherry-studio" >}} + {{< gh-repo-card repo="lobehub/lobe-chat" >}} + {{< gh-repo-card repo="Calcium-Ion/new-api" >}} + {{< gh-repo-card repo="immersive-translate/immersive-translate" >}} +{{< /gh-repo-card-container >}} + +New API is a tool that I find useful for managing APIs and providing services in the OpenAI API format. Immersive Translate is another highly-rated translation plugin that supports calling OpenAI API for translations, which can also be combined with Ollama and New API. Its translation quality far exceeds traditional methods. diff --git a/content/zh-cn/posts/csci-1200/hw-2/index.md b/content/zh-cn/posts/csci-1200/hw-2/index.md index 1f1766a..9073fcf 100644 --- a/content/zh-cn/posts/csci-1200/hw-2/index.md +++ b/content/zh-cn/posts/csci-1200/hw-2/index.md @@ -406,7 +406,7 @@ A: 与 Uber 相同。保留一位小数。直接截断即可。例如,如果 由于流程图比较大,我暂时把它转换成了图片。 -{{< image src="csci-1200-hw-2-flowchart-zh_cn.svg" caption="Flow Chart" >}} +{{< image src="csci-1200-hw-2-flowchart-zh_cn.svg" width="100%" caption="Flow Chart" >}} Mermaid 源码如下: diff --git a/content/zh-cn/posts/llama-cpp/compare-quantization-type/index.md b/content/zh-cn/posts/llama-cpp/compare-quantization-type/index.md index 695ec0d..fb41217 100644 --- a/content/zh-cn/posts/llama-cpp/compare-quantization-type/index.md +++ b/content/zh-cn/posts/llama-cpp/compare-quantization-type/index.md @@ -22,7 +22,7 @@ tags: categories: - 大语言模型 collections: - - Ollama + - LLM hiddenFromHomePage: false hiddenFromSearch: false hiddenFromRss: false diff --git a/content/zh-cn/posts/llama-cpp/ollama-with-deepseek-r1/index.md b/content/zh-cn/posts/llama-cpp/ollama-with-deepseek-r1/index.md new file mode 100644 index 0000000..34261e8 --- /dev/null +++ b/content/zh-cn/posts/llama-cpp/ollama-with-deepseek-r1/index.md @@ -0,0 +1,308 @@ +--- +title: 使用 Ollama 在RTX 4090上部署 DeepSeek R1 Distill 系列模型并优化 +subtitle: +date: 2025-02-08T18:29:29-05:00 +lastmod: 2025-02-08T18:29:29-05:00 +slug: ollama-deepseek-r1-distill +draft: false +author: + name: James + link: https://www.jamesflare.com + email: + avatar: /site-logo.avif +description: 本篇文章详细介绍了如何利用DeepSeek-R1及其蒸馏模型在消费级硬件上的应用,并探讨了其性能优化和不足之处。同时提供了安装Ollama及创建深度定制化模型的步骤,以及一些提高运行效率的方法,包括使用Flash Attention和KV Cache量化等技巧。 +keywords: ["DeepSeek-R1","Ollama","KV Cache","Flash Attention"] +license: +comment: true +weight: 0 +tags: + - 大语言模型 + - llama.cpp + - 量化 + - Ollama +categories: + - 大语言模型 +collections: + - LLM +hiddenFromHomePage: false +hiddenFromSearch: false +hiddenFromRss: false +hiddenFromRelated: false +summary: 本篇文章详细介绍了如何利用DeepSeek-R1及其蒸馏模型在消费级硬件上的应用,并探讨了其性能优化和不足之处。同时提供了安装Ollama及创建深度定制化模型的步骤,以及一些提高运行效率的方法,包括使用Flash Attention和KV Cache量化等技巧。 +resources: + - name: featured-image + src: featured-image.jpg + - name: featured-image-preview + src: featured-image-preview.jpg +toc: true +math: false +lightgallery: true +password: +message: +repost: + enable: false + url: + +# See details front matter: https://fixit.lruihao.cn/documentation/content-management/introduction/#front-matter +--- + + + +## 前言 + +最近DeepSeek-R1爆火,原因有多种。不光价格便宜,性能强劲还开源。更难能可贵的是官方放出了几个蒸馏模型,包含各个尺寸。 + +- [deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B) +- [deepseek-ai/DeepSeek-R1-Distill-Qwen-7B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B) +- [deepseek-ai/DeepSeek-R1-Distill-Llama-8B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Llama-8B) +- [deepseek-ai/DeepSeek-R1-Distill-Qwen-14B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-14B) +- [deepseek-ai/DeepSeek-R1-Distill-Qwen-32B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B) +- [deepseek-ai/DeepSeek-R1-Distill-Llama-70B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Llama-70B) + +这使得一般的消费级硬件也有机会体验Reasoning模型的魅力。不过请注意,这和真正的DeepSeek-R1相差甚远。即便是`DeepSeek-R1-Distill-Qwen-32B`也只是达到o1-mini级别的水平。 + +这一点可以参考官方给出的[图表](https://raw.githubusercontent.com/deepseek-ai/DeepSeek-R1/main/figures/benchmark.jpg)(下面这张图是可以交互的,你可以关闭你不想要的数据)。 + +{{< echarts >}} +{ + "tooltip": { + "trigger": "axis", + "axisPointer": { + "type": "shadow" + } + }, + "legend": { + "top": 30, + "data": [ + "DeepSeek-R1", + "OpenAI-o1-1217", + "DeepSeek-R1-32B", + "OpenAI-o1-mini", + "DeepSeek-V3" + ] + }, + "grid": { + "left": "8%", + "right": "8%", + "bottom": "10%", + "containLabel": true + }, + "xAxis": { + "type": "category", + "data": [ + "AIME 2024\n(Pass@1)", + "Codeforces\n(Percentile)", + "GPQA Diamond\n(Pass@1)", + "MATH-500\n(Pass@1)", + "MMLU\n(Pass@1)", + "SWE-bench Verified\n(Resolved)" + ], + "axisLabel": { + "interval": 0 + } + }, + "yAxis": { + "type": "value", + "min": 0, + "max": 100, + "name": "Accuracy / Percentile (%)", + "nameGap": 32, + "nameLocation": "center" + }, + "series": [ + { + "name": "DeepSeek-R1", + "type": "bar", + "data": [79.8, 96.3, 71.5, 97.3, 90.8, 49.2], + "barGap": "0", + "label": { + "show": true, + "position": "top" + } + }, + { + "name": "OpenAI-o1-1217", + "type": "bar", + "data": [79.2, 96.6, 75.7, 96.4, 91.8, 48.9], + "label": { + "show": true, + "position": "top" + } + }, + { + "name": "DeepSeek-R1-32B", + "type": "bar", + "data": [72.6, 90.6, 62.1, 94.3, 87.4, 36.8], + "label": { + "show": true, + "position": "top" + } + }, + { + "name": "OpenAI-o1-mini", + "type": "bar", + "data": [63.6, 93.4, 60.0, 90.0, 85.2, 41.6], + "label": { + "show": true, + "position": "top" + } + }, + { + "name": "DeepSeek-V3", + "type": "bar", + "data": [39.2, 58.7, 59.1, 90.2, 88.5, 42.0], + "label": { + "show": true, + "position": "top" + } + } + ] +} +{{< /echarts >}} + +Ollama提供了更方便使用和管理模型的接口和工具,后端是llama.cpp。基于CPU推理优化的工具,也支持GPU。 + +{{< gh-repo-card-container >}} + {{< gh-repo-card repo="ollama/ollama" >}} + {{< gh-repo-card repo="ggerganov/llama.cpp" >}} +{{< /gh-repo-card-container >}} + +## 安装Ollama + +这个根据[Download Ollama](https://ollama.com/download)的指引完成即可。我的环境如下: + +- 操作系统是Windows 11 +- GPU是NVIDIA RTX 4090 +- CPU是Intel 13900K +- 内存是128G DDR5 + +## 创建模型 + +在安装好Ollama后,我们就需要创建模型了。一种办法是直接从[Ollama Library](https://ollama.com/library/deepseek-r1:32b-qwen-distill-q4_K_M)拉取。 + +```bash +ollama pull deepseek-r1:32b-qwen-distill-q4_K_M +``` + +不过这样拉取的模型的默认上下文长度是4096。这显然不够用也不合理,所以我们要修改一下。 + +一种办法是直接修改Modelfile。如果你不知道一个模型的Modelfile可以执行以下命令查看它的Modelfile。 + +```bash +ollama show --modelfile deepseek-r1:32b-qwen-distill-q4_K_M +``` + +这里我给出我用的Modelfile,可以新建一个文本文件保存,比如叫做`DeepSeek-R1-Distill-Qwen-32B-Q4_K_M.txt`。 + +```text +FROM deepseek-r1:32b-qwen-distill-q4_K_M + +TEMPLATE """{{- if .System }}{{ .System }}{{ end }} +{{- range $i, $_ := .Messages }} +{{- $last := eq (len (slice $.Messages $i)) 1}} +{{- if eq .Role "user" }}<|User|>{{ .Content }} +{{- else if eq .Role "assistant" }}<|Assistant|>{{ .Content }}{{- if not $last }}<|end▁of▁sentence|>{{- end }} +{{- end }} +{{- if and $last (ne .Role "assistant") }}<|Assistant|>{{- end }} +{{- end }}""" +PARAMETER stop <|begin▁of▁sentence|> +PARAMETER stop <|end▁of▁sentence|> +PARAMETER stop <|User|> +PARAMETER stop <|Assistant|> +PARAMETER num_ctx 16000 +``` + +它包含多个部分,我们暂时用不着改太多,只需要注意`FROM`表明构建使用的模型(告诉Ollama用什么构建),以及`num_ctx`的值(默认4096,除非通过API请求的时候有额外设置)这里我设置的`16000`,它就是上下文长度,越长消耗的显存/内存,计算资源就越多。 + +> [!NOTE] +> +> 经过测试,RTX 4090差不多可以在KV Cache量化为q8_0,启用Flash Attention的情况下运行32B q4_K_M量化模型的同时,保持16K的上下文长度。如果同等情况下运行14B q4_K_M量化模型可以达到64K的上下文长度。有关KV Cache量化和Flash Attention的内容我会稍后讲解。 + +当我们创建好Modelfile后就可以使用如下命令创建模型了。 + +```bash +ollama create DeepSeek-R1-Distill-Qwen-32B-Q4_K_M -f DeepSeek-R1-Distill-Qwen-32B-Q4_K_M.txt +``` + +> [!TIP] +> +> 其格式如下: +> `ollama create <要创建的模型名> -f ` + +在此过程中Ollama会拉取模型并且创建它,完成后可以执行`ollama list`检查模型列表,你应该会看见类似的东西。 + +```console +PS C:\Users\james\Desktop\Ollama> ollama list +NAME ID SIZE MODIFIED +DeepSeek-R1-Distill-Qwen-32B-Q4_K_M:latest ca51e8a9d628 19 GB 2 days ago +deepseek-r1:32b-qwen-distill-q4_K_M 5de93a84837d 19 GB 2 days ago +``` + +## 优化 + +Ollama支持多个优化参数,它们通过环境变量控制。 + +- `OLLAMA_FLASH_ATTENTION`:`1`开启,`0`关闭 +- `OLLAMA_HOST`:Ollama监听的IP,默认是`127.0.0.1`,如果要对外服务需要改成`0.0.0.0` +- `OLLAMA_KV_CACHE_TYPE`:默认`fp16`,可以设置`q8_0`,或者`q4_0` +- `OLLAMA_NUM_PARALLEL`:同时运行的请求数,越多吞吐量越大,显存/内存消耗越多,一般`1`就差不多了 +- `OLLAMA_ORIGINS`:有关CORS跨站请求的内容,如果你要在其它地方请求Ollama,特别域名不一样的话你要设置对应的域,或者设置`*`允许所有来源 + +Flash Attention是必开的,KV Cache我建议选`q8_0`,实测发现`q4_0`会让R1的思考长度下降,这可能是因为内容都比较长,上下文比较重要。 + +### Windows 11 + +要在Windows 11中设置环境变量,需要进入“高级系统设置”,然后选择“环境变量”,之后选择“新建”。重启Ollama使其生效。 + +### MacOS + +在MacOS中可以执行诸如 + +```bash +launchctl setenv OLLAMA_FLASH_ATTENTION "1" +launchctl setenv OLLAMA_KV_CACHE_TYPE "q8_0" +``` + +的命令来设置环境变量。重启Ollama使其生效。 + +### Linux + +在Linux中,在安装完Ollama后可以修改`ollama.service`文件来修改它的环境变量。 + +```bash +sudo systemctl edit ollama.service +``` + +然后在`[Service]`下添加`Environment`字段,类似这样 + +```text +[Service] +Environment="OLLAMA_FLASH_ATTENTION=1" +Environment="OLLAMA_KV_CACHE_TYPE=q8_0" +``` + +保存修改后重载 + +```bash +sudo systemctl daemon-reload +sudo systemctl restart ollama +``` + +## 不足 + +Ollama使用的后端llama.cpp并非是为了多并发和高性能的生产环境设计的。比如它对多GPU的支持就不是很理想,它会把模型的层拆分到多个GPU里,这样解决了显存不足的问题,但是这样导致在单一时间内,只有一块GPU在干活。要同时利用多张GPU的性能,我们需要张量并行,这是SGLang或者vLLM擅长的。 + +至于性能,在和SGLang或者vLLM对比的时候也不占优势,吞吐量远不及后者。其次对多模态模型的支持有限,适配进度缓慢。 + +## 客户端 + +为了更方便使用Ollama中的模型,我推荐两个客户端。Cherry Studio是我觉得好用的本地客户端,LobeChat是我觉得好用的云端客户端(我之前写过一篇 [使用 Docker Compose 部署 LobeChat 服务端数据库版本](../install-lobechat-db/)) + +{{< gh-repo-card-container >}} + {{< gh-repo-card repo="CherryHQ/cherry-studio" >}} + {{< gh-repo-card repo="lobehub/lobe-chat" >}} + {{< gh-repo-card repo="Calcium-Ion/new-api" >}} + {{< gh-repo-card repo="immersive-translate/immersive-translate" >}} +{{< /gh-repo-card-container >}} + +New API则是我觉得一个很好的,用来集中管理API并且以OpenAI API格式提供服务的工具。Immersive Translate则是一个好评如潮的翻译插件,它支持调用OpenAI API来进行翻译,也自然可以与Ollama以及New API组合搭配。翻译效果远超传统翻译方法。