I have been playing extensively with LLM's, especially self-hosting models to experiment with different models, prompts and their sentiments.
Using Ollama has been one of the quickest ways to get running with local models but it also offers some nice features like API support.
I have a multi-GPU machine running 5x GeForce 1060's, which whilst older, still perform well for a lab environment. Unfortunately the CPU running the system is a cheap Celeron processor which doesn't have AVX or AVX2 support which is needed for Ollama to run GPU inference on. This took me ages to find to understand why GPU inference wasn't supported even though CUDA showed the 5x 1060's. Some of the output I was seeing was:
time=2024-09-26T03:41:50.015Z level=INFO source=common.go:49 msg="Dynamic LLM libraries" runners="[cpu cpu_avx cpu_avx2 cuda_v11 cuda_v12 rocm_v60102]"
time=2024-09-26T03:41:50.020Z level=INFO source=gpu.go:199 msg="looking for compatible GPUs"
time=2024-09-26T03:41:50.035Z level=WARN source=gpu.go:224 msg="CPU does not have minimum vector extensions, GPU inference disabled" required=avx detected="no vector extensions"
time=2024-09-26T03:41:50.035Z level=INFO source=types.go:107 msg="inference compute" id=0 library=cpu variant="no vector extensions" compute="" driver=0.0 name="" total="15.6 GiB" available="14.9 GiB"
You can see that GPU inference gets disabled because the CPU doesn't meet AVX requirements. This is a different output from what I've seen on some bug trackers, I am not sure if it's a version change or perhaps my CPU reports differently, but underlying issue is the same.
Luckily there is an ongoing Github issue - https://github.com/ollama/ollama/issues/2187 which is tracking the need for AVX/2 even to run on GPU. In that thread there are a few workarounds based on the version of Ollama you're using. On v0.3.12 I modified:
gpu/cpu_common.go I added a new line below line 15 to: return CPUCapabilityAVX
llm/generate/gen_linux.sh I commented out line 54 and added a new line below to: COMMON_CMAKE_DEFS="-DBUILD_SHARED_LIBS=off -DCMAKE_POSITION_INDEPENDENT_CODE=on -DGGML_NATIVE=off -DGGML_AVX=off -DGGML_AVX2=off -DGGML_AVX512=off -DGGML_FMA=off -DGGML_F16C=off -DGGML_OPENMP=off"
Once you build from source this bypasses the AVX/2 checks and you can run on GPU. When you run ollama now you can see:
time=2024-09-26T05:49:07.785Z level=INFO source=common.go:49 msg="Dynamic LLM libraries" runners="[cpu_avx2 cuda_v12 cpu cpu_avx]"
time=2024-09-26T05:49:07.786Z level=INFO source=gpu.go:199 msg="looking for compatible GPUs"
time=2024-09-26T05:49:08.703Z level=INFO source=types.go:107 msg="inference compute" id=GPU-e2f3f39f-9a70-1d92-f7da-25ab5291fda8 library=cuda variant=v12 compute=6.1 driver=12.6 name="NVIDIA GeForce GTX 1060 6GB" total="5.9 GiB" available="5.9 GiB"
time=2024-09-26T05:49:08.703Z level=INFO source=types.go:107 msg="inference compute" id=GPU-39bccb3b-5d42-81d2-5f84-ede96f34c3e6 library=cuda variant=v12 compute=6.1 driver=12.6 name="NVIDIA GeForce GTX 1060 6GB" total="5.9 GiB" available="5.9 GiB"
time=2024-09-26T05:49:08.703Z level=INFO source=types.go:107 msg="inference compute" id=GPU-58a48fa9-8b0f-5691-8a71-51e761b4fddc library=cuda variant=v12 compute=6.1 driver=12.6 name="NVIDIA GeForce GTX 1060 6GB" total="5.9 GiB" available="5.9 GiB"
time=2024-09-26T05:49:08.703Z level=INFO source=types.go:107 msg="inference compute" id=GPU-3176d686-d810-04ca-fbda-8f0340bb8faf library=cuda variant=v12 compute=6.1 driver=12.6 name="NVIDIA GeForce GTX 1060 6GB" total="5.9 GiB" available="5.9 GiB"
time=2024-09-26T05:49:08.703Z level=INFO source=types.go:107 msg="inference compute" id=GPU-d09f4bbc-cf74-dc70-8c22-4898d8267937 library=cuda variant=v12 compute=6.1 driver=12.6 name="NVIDIA GeForce GTX 1060 6GB" total="5.9 GiB" available="5.9 GiB"
time=2024-09-26T05:50:03.095Z level=INFO source=sched.go:730 msg="new model will fit in available VRAM, loading" model=/home/x/.ollama/models/blobs/sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373 library=cuda parallel=4 required="16.7 GiB"
time=2024-09-26T05:50:03.095Z level=INFO source=server.go:103 msg="system memory" total="15.6 GiB" free="14.9 GiB" free_swap="4.0 GiB"
Depending on the size of the model you can see Ollama load the model into the GPU's, and when running inference it seems to stripe the query across the cards, although I'm sure that this is just a symptom of memory registers rather than actual striping workload.
You can check the usage, card status etc on nvidia-smi:
x@x:~$ nvidia-smi
Thu Sep 26 05:50:53 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03 Driver Version: 560.35.03 CUDA Version: 12.6 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce GTX 1060 6GB Off | 00000000:02:00.0 Off | N/A |
| 0% 44C P5 11W / 180W | 2695MiB / 6144MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA GeForce GTX 1060 6GB Off | 00000000:03:00.0 Off | N/A |
| 0% 44C P2 29W / 180W | 2097MiB / 6144MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 2 NVIDIA GeForce GTX 1060 6GB Off | 00000000:04:00.0 Off | N/A |
| 0% 41C P2 28W / 180W | 2097MiB / 6144MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 3 NVIDIA GeForce GTX 1060 6GB Off | 00000000:05:00.0 Off | N/A |
| 0% 40C P2 33W / 180W | 2097MiB / 6144MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 4 NVIDIA GeForce GTX 1060 6GB Off | 00000000:06:00.0 Off | N/A |
| 0% 35C P8 5W / 180W | 1927MiB / 6144MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 1445 C ...unners/cuda_v12/ollama_llama_server 2688MiB |
| 1 N/A N/A 1445 C ...unners/cuda_v12/ollama_llama_server 2090MiB |
| 2 N/A N/A 1445 C ...unners/cuda_v12/ollama_llama_server 2090MiB |
| 3 N/A N/A 1445 C ...unners/cuda_v12/ollama_llama_server 2090MiB |
| 4 N/A N/A 1445 C ...unners/cuda_v12/ollama_llama_server 1920MiB |
+-----------------------------------------------------------------------------------------+