Why On-Device AI Matters
The future of AI is at the edge, where privacy meets performance
Enhanced Privacy
Data never leaves the device. Process personal content, conversations, and media locally without cloud exposure.
Real-Time Response
Instant inference with no network round-trips. Perfect for AR/VR experiences, multimodal AI interactions, and responsive conversational agents.
Offline & Low-Bandwidth Ready
Zero network dependency for inference. Works seamlessly in low-bandwidth regions, remote areas, or completely offline.
Cost Efficient
No cloud compute bills. No API rate limits. Scale to billions of users without infrastructure costs growing linearly.
Models Are Getting Smaller & Smarter
The convergence of efficient architectures and edge hardware creates new opportunities
Dramatically Smaller
Modern LLMs achieve high quality at a fraction of historical sizes
Edge-Ready Performance
Real-time inference on consumer smartphones
Quantization Benefits
Significant size reduction while preserving accuracy
The opportunity is now: Foundation models have crossed the efficiency threshold. Deploy sophisticated AI directly where data lives.
Why On-Device AI Was Hard
The technical challenges that made edge deployment complex... until now
Power Constraints
From battery-powered phones to energy-harvesting sensors, edge devices have strict power budgets. Microcontrollers may run on milliwatts, requiring extreme efficiency.
Thermal Management
Sustained inference generates heat without active cooling. From smartphones to industrial IoT devices, thermal throttling limits continuous AI workloads.
Memory Limitations
Edge devices range from high-end phones to tiny microcontrollers. Beyond capacity, limited memory bandwidth creates bottlenecks when moving tensors between compute units.
Hardware Heterogeneity
From microcontrollers to smartphone NPUs to embedded GPUs. Each architecture demands unique optimizations, making broad deployment across diverse form factors extremely challenging.
PyTorch Powers 92% of AI Research
But deploying PyTorch models to edge devices meant losing everything that made PyTorch great
Research & Training
PyTorch's intuitive APIs and eager execution power breakthrough research
The Conversion Nightmare
Multiple intermediate formats, custom runtimes, C++ rewrites
The Hidden Costs of Conversion (Status Quo)
PyTorch operations don't map 1:1 to other formats
Can't trace errors back to original PyTorch code
Locked into proprietary formats with limited operator support
Teams spend months rewriting Python models in C++ for production
ExecuTorch
PyTorch's On-Device AI Framework
No Conversions
Direct export from PyTorch to edge. Core ATen operators preserved. No intermediate formats, no vendor lock-in.
Ahead-of-Time Compilation
Optimize models offline for target device capabilities. Hardware-specific performance tuning before deployment.
Modular by Design
Pick and choose optimization steps. Composable at both compile-time and runtime for maximum flexibility.
Hardware Ecosystem
Fully open source with hardware partner contributions. Built on PyTorch's standardized IR and operator set.
Embedded-Friendly Runtime
Portable C++ runtime runs on microcontrollers to smartphones.
PyTorch Ecosystem
Native integration with PyTorch ecosystem, including torchao for quantization. Stay in familiar tools throughout.
Simple as 1-2-3
Export, optimize, and run PyTorch models on edge devices with just a few lines of code
1. Export Your PyTorch Model
import torch
from torch.export import export
# Your existing PyTorch model
model = MyModel().eval()
example_inputs = (torch.randn(1, 3, 224, 224),)
# Creates semantically equivalent graph representation
exported_program = export(model, example_inputs)
2. Optimize for Target Hardware
Switch between backends with a single line change
from executorch.exir import to_edge_transform_and_lower
from executorch.backends.xnnpack import XnnpackPartitioner
program = to_edge_transform_and_lower(
exported_program,
partitioner=[XnnpackPartitioner()]
).to_executorch()
from executorch.exir import to_edge_transform_and_lower
from executorch.backends.apple import CoreMLPartitioner
program = to_edge_transform_and_lower(
exported_program,
partitioner=[CoreMLPartitioner()]
).to_executorch()
from executorch.exir import to_edge_transform_and_lower
from executorch.backends.qualcomm import QnnPartitioner
program = to_edge_transform_and_lower(
exported_program,
partitioner=[QnnPartitioner()]
).to_executorch()
# Save to .pte file
with open("model.pte", "wb") as f:
f.write(program.buffer)
3. Run on Any Platform
// Load and execute model
auto module = Module("model.pte");
auto method = module.load_method("forward");
auto outputs = method.execute({input_tensor});
// Access result tensors
auto result = outputs[0].toTensor();
// Initialize ExecuTorch module
let module = try ETModule(path: "model.pte")
// Run inference with tensors
let outputs = try module.forward([inputTensor])
// Process results
let result = outputs[0]
// Load model from assets
val module = Module.load(assetFilePath("model.pte"))
// Execute with tensor input
val outputs = module.forward(inputTensor)
// Extract prediction results
val prediction = outputs[0].dataAsFloatArray
// Initialize ExecuTorch module
ETModule *module = [[ETModule alloc] initWithPath:@"model.pte" error:nil];
// Run inference with tensors
NSArray *outputs = [module forwardWithInputs:@[inputTensor] error:nil];
// Process results
ETTensor *result = outputs[0];
// Load model from ArrayBuffer
const module = et.Module.load(buffer);
// Create input tensor from data
const inputTensor = et.Tensor.fromIter(tensorData, shape);
// Run inference
const output = module.forward(inputTensor);
Available on Android, iOS, Linux, Windows, macOS, and embedded microcontrollers (e.g., DSP and Cortex-M processors)
Need advanced features? ExecuTorch supports memory planning, quantization, profiling, and custom compiler passes.
Try the Full Tutorial βHigh-Level Multimodal APIs
Run complex multimodal LLMs with simplified C++ interfaces
Multimodal Runner - Text + Vision + Audio in One API
Choose your platform to see the multimodal API supporting text, images, and audio:
#include "executorch/extension/llm/runner/multimodal_runner.h"
// Initialize multimodal model (e.g., Voxtral, LLaVA)
auto runner = MultimodalRunner::create(
"model.pte", // Text model
"vision.pte", // Vision encoder
"audio.pte", // Audio encoder
tokenizer_path,
temperature
);
// Run inference with audio + image + text
auto result = runner->generate_multimodal(
"Describe what you hear and see",
audio_tensor, // Audio input
image_tensor, // Image input
max_tokens
);
// Stream response tokens
for (const auto& token : result.tokens) {
std::cout << token << std::flush;
}
import ExecuTorch
import AVFoundation
// Initialize multimodal runner with audio support
let runner = try MultimodalRunner(
modelPath: "model.pte",
visionPath: "vision.pte",
audioPath: "audio.pte",
tokenizerPath: tokenizerPath,
temperature: 0.7
)
// Process audio and image inputs
let audioTensor = AudioProcessor.preprocess(audioURL)
let imageTensor = ImageProcessor.preprocess(uiImage)
// Generate with audio + vision + text
let result = try runner.generateMultimodal(
prompt: "Describe what you hear and see",
audio: audioTensor,
image: imageTensor,
maxTokens: 512
)
// Stream tokens to UI
result.tokens.forEach { token in
DispatchQueue.main.async {
responseText += token
}
}
import org.pytorch.executorch.MultimodalRunner
import android.media.MediaRecorder
// Initialize multimodal runner with audio
val runner = MultimodalRunner.create(
modelPath = "model.pte",
visionPath = "vision.pte",
audioPath = "audio.pte",
tokenizerPath = tokenizerPath,
temperature = 0.7f
)
// Process audio and image inputs
val audioTensor = AudioProcessor.preprocess(audioFile)
val imageTensor = ImageProcessor.preprocess(bitmap)
// Generate with audio + vision + text
val result = runner.generateMultimodal(
prompt = "Describe what you hear and see",
audio = audioTensor,
image = imageTensor,
maxTokens = 512
)
// Display streaming response
result.tokens.forEach { token ->
runOnUiThread {
responseView.append(token)
}
}
High-level APIs abstract away model complexity - just load, prompt, and get results
Universal AI Runtime
Comprehensive Hardware Ecosystem
Hardware acceleration contributed by industry partners via open source
XNNPACK with KleidiAI
CPU acceleration across ARM and x86 architectures
Apple Core ML
Neural Engine and Apple Silicon optimization
Qualcomm Snapdragon
Hexagon NPU support
ARM Ethos-U
Microcontroller NPU for ultra-low power
Vulkan GPU
Cross-platform graphics acceleration
Intel OpenVINO
x86 CPU and integrated GPU optimization
MediaTek NPU
Dimensity chipset acceleration
Samsung Exynos
Integrated NPU optimization
NXP Neutron
Automotive and IoT acceleration
Apple MPS
Metal Performance Shaders for GPU acceleration
ARM VGF
Versatile graphics framework support
Cadence DSP
Digital signal processor optimization
Success Stories
Production deployments and strategic partnerships accelerating edge AI
Adoption
- Meta Family of Apps: Production deployment across Instagram, Facebook, and WhatsApp
- Meta Reality Labs: Powers Quest 3 VR and Ray-Ban Meta Smart Glasses AI
Ecosystem Integration
- Hugging Face: Optimum-ExecuTorch for direct transformer model deployment
- LiquidAI: Next-generation Liquid Foundation Models optimized for edge deployment
- Software Mansion: React Native ExecuTorch bringing edge AI to mobile apps
Ready to Deploy AI at the Edge?
Join thousands of developers using ExecuTorch in production
Get Started Today