· AI & Astronomy  · 3 min read

The Missing Link: Bridging Apple MLX and llama.cpp

Fine-tuning on Apple Silicon is a joy until you try to export your model. Here is how I patched llama.cpp to handle MLX-fused models.

Fine-tuning on Apple Silicon is a joy until you try to export your model. Here is how I patched llama.cpp to handle MLX-fused models.

If you are doing fine-tuning on a Mac, you are likely using MLX. It’s Apple’s specialized framework that makes our Apple Silicon (M-series) chips fly. But there is a silent challenge: once your model is “fused” and ready, bringing it into the llama.cpp or Ollama ecosystem can be a nightmare.

The Context: Two Worlds Apart

To understand the challenge, we need to look at the players:

  • MLX: Apple’s open-source research framework. It’s incredibly fast for training because it uses unified memory, but it creates “artifacts” (extra data) during quantization that are non-standard.
  • llama.cpp: The industry standard for local LLM inference. It’s strict about how tensors should be named and mapped to work across different hardware.

The challenge arises when you try to convert an MLX-fused model to the GGUF format. The conversion script in llama.cpp expects a clean Hugging Face structure, but MLX leaves behind some “ghosts” in the machine.

The Challenge: The “lm_head” Crash

While working on my own LLM, I hit a wall. Every time I ran the conversion script, it crashed with a ValueError.

It turns out that MLX-fused models include specific tensors called lm_head.scales and lm_head.biases. Since these are not part of the standard Llama/Mistral architecture, llama.cpp doesn’t know what to do with them and simply stops.

The Solution: A Surgical Patch

Instead of waiting for a fix, I decided to patch the conversion script locally. I modified the modify_tensors function to gracefully ignore these MLX-specific artifacts instead of raising a hard error.

def modify_tensors(self, data_torch: Tensor, name: str, bid: int | None) -> Iterable[tuple[str, Tensor]]:
    try:
        new_name = self.map_tensor_name(name)
    except ValueError as e:
        # Ignore MLX output-layer artifacts (no GGUF mapping for these names)
        if "lm_head." in name and (name.endswith(".biases") or name.endswith(".scales")):
            print(f"⚠️  Ignoring MLX quantization artifact: {name}")
            return []
        raise e

Here’s the patch if you need it:

diff --git a/convert_hf_to_gguf.py b/convert_hf_to_gguf.py
index 93d5509e6..6aa6b65f8 100755
--- a/convert_hf_to_gguf.py
+++ b/convert_hf_to_gguf.py
@@ -542,7 +542,14 @@ class ModelBase:
         raise NotImplementedError("set_gguf_parameters() must be implemented in subclasses")

     def modify_tensors(self, data_torch: Tensor, name: str, bid: int | None) -> Iterable[tuple[str, Tensor]]:
-        new_name = self.map_tensor_name(name)
+        try:
+            new_name = self.map_tensor_name(name)
+        except ValueError as e:
+            # Ignore MLX output-layer artifacts (no GGUF mapping for these names)
+            if "lm_head." in name and (name.endswith(".biases") or name.endswith(".scales")):
+                print(f"⚠️  Ignoring MLX quantization artifact: {name}")
+                return []
+            raise e

         # Handle gate/up expert tensor fusion if enabled
         if self.fuse_gate_up_exps and bid is not None:

This simple try/except block allows the conversion to finish, resulting in a perfectly functional GGUF model that runs flawlessly on Ollama.

Giving Back to the Community

I know I’m not the only one building AI tools on Apple Silicon. To help others facing this “ValueError” wall, I’ve opened an official Issue in the llama.cpp repository detailing the fix.

[x] GitHub Issue #22431: convert_hf_to_gguf.py fails on MLX-fused models

I hope this post and the issue help future researchers and hackers spend less time debugging conversion scripts and more time building cool things with local AI.

Update: After discussing this on GitHub, a llama.cpp maintainer pointed out that while my patch works as a bypass, it may lead to precision loss in the output layer. The recommended “official” fix is to use the —dequantize flag during the mlx_lm.fuse process, though I have yet to personally verify the performance difference between both methods

Back to Blog

Related Posts

View All Posts »
The life before Bluetooth

The life before Bluetooth

I got blocked because I didn't have a non-Bluetooth keyboard and mouse to connect to my Raspberry Pi.

New hobby unlocked

New hobby unlocked

For a long time I was thinking about buying a telescope, well this year was the year!