· AI & Astronomy · 3 min read
The Missing Link: Bridging Apple MLX and llama.cpp
Fine-tuning on Apple Silicon is a joy until you try to export your model. Here is how I patched llama.cpp to handle MLX-fused models.

If you are doing fine-tuning on a Mac, you are likely using MLX. It’s Apple’s specialized framework that makes our Apple Silicon (M-series) chips fly. But there is a silent challenge: once your model is “fused” and ready, bringing it into the llama.cpp or Ollama ecosystem can be a nightmare.
The Context: Two Worlds Apart
To understand the challenge, we need to look at the players:
- MLX: Apple’s open-source research framework. It’s incredibly fast for training because it uses unified memory, but it creates “artifacts” (extra data) during quantization that are non-standard.
- llama.cpp: The industry standard for local LLM inference. It’s strict about how tensors should be named and mapped to work across different hardware.
The challenge arises when you try to convert an MLX-fused model to the GGUF format. The conversion script in llama.cpp expects a clean Hugging Face structure, but MLX leaves behind some “ghosts” in the machine.
The Challenge: The “lm_head” Crash
While working on my own LLM, I hit a wall. Every time I ran the conversion script, it crashed with a ValueError.
It turns out that MLX-fused models include specific tensors called lm_head.scales and lm_head.biases. Since these are not part of the standard Llama/Mistral architecture, llama.cpp doesn’t know what to do with them and simply stops.
The Solution: A Surgical Patch
Instead of waiting for a fix, I decided to patch the conversion script locally. I modified the modify_tensors function to gracefully ignore these MLX-specific artifacts instead of raising a hard error.
def modify_tensors(self, data_torch: Tensor, name: str, bid: int | None) -> Iterable[tuple[str, Tensor]]:
try:
new_name = self.map_tensor_name(name)
except ValueError as e:
# Ignore MLX output-layer artifacts (no GGUF mapping for these names)
if "lm_head." in name and (name.endswith(".biases") or name.endswith(".scales")):
print(f"⚠️ Ignoring MLX quantization artifact: {name}")
return []
raise eHere’s the patch if you need it:
diff --git a/convert_hf_to_gguf.py b/convert_hf_to_gguf.py
index 93d5509e6..6aa6b65f8 100755
--- a/convert_hf_to_gguf.py
+++ b/convert_hf_to_gguf.py
@@ -542,7 +542,14 @@ class ModelBase:
raise NotImplementedError("set_gguf_parameters() must be implemented in subclasses")
def modify_tensors(self, data_torch: Tensor, name: str, bid: int | None) -> Iterable[tuple[str, Tensor]]:
- new_name = self.map_tensor_name(name)
+ try:
+ new_name = self.map_tensor_name(name)
+ except ValueError as e:
+ # Ignore MLX output-layer artifacts (no GGUF mapping for these names)
+ if "lm_head." in name and (name.endswith(".biases") or name.endswith(".scales")):
+ print(f"⚠️ Ignoring MLX quantization artifact: {name}")
+ return []
+ raise e
# Handle gate/up expert tensor fusion if enabled
if self.fuse_gate_up_exps and bid is not None:This simple try/except block allows the conversion to finish, resulting in a perfectly functional GGUF model that runs flawlessly on Ollama.
Giving Back to the Community
I know I’m not the only one building AI tools on Apple Silicon. To help others facing this “ValueError” wall, I’ve opened an official Issue in the llama.cpp repository detailing the fix.
[x] GitHub Issue #22431: convert_hf_to_gguf.py fails on MLX-fused models
I hope this post and the issue help future researchers and hackers spend less time debugging conversion scripts and more time building cool things with local AI.
Update: After discussing this on GitHub, a llama.cpp maintainer pointed out that while my patch works as a bypass, it may lead to precision loss in the output layer. The recommended “official” fix is to use the —dequantize flag during the mlx_lm.fuse process, though I have yet to personally verify the performance difference between both methods



