BlockBeats News, March 3 — Developer Manjeet Singh (GitHub: maderix) collaborated with Claude Opus to reverse engineer Apple’s unreleased private APIs. For the first time, neural network training with backpropagation was implemented on the Apple Neural Engine (ANE) of the M4 chip. ANE is an accelerator designed specifically for inference, and Apple has never officially opened training capabilities. Developers can only indirectly access its inference functions through the CoreML framework.
This project bypasses CoreML by directly mapping over 40 private classes, including _ANEClient and _ANECompiler, to the IOKit kernel driver, creating a complete software stack. It also discovered the _ANEInMemoryModelDescriptor interface, which allows models to be compiled directly in memory — a key step for training, since each weight update requires recompilation. Currently, training of a single transformer layer (dim=768, seq=512) has been achieved, with each step taking 9.3ms on the M4. The ANE utilization is 11.2% (1.78 TFLOPS, with a theoretical peak of 15.8 TFLOPS). Forward and backward input gradients are computed on the ANE, while weight gradients and the Adam optimizer run on the CPU.
The project also found that ANE’s core primitive is convolution, not matrix multiplication. Using 1x1 convolution to represent matrix multiplication can achieve about three times the throughput. Bypassing CoreML for direct calls yields an additional 2-4x performance gain. The official Apple claim of “38 TOPS” is misleading. Currently, the project is in early stages: it only supports single-layer training, uses synthetic data, and has approximately 119 resource leaks after compilation that require restarting the process to avoid. Multi-layer training and real data support are still in development. The project is open-sourced under the MIT license and has received about 2,800 stars within five days of release.