Why On-Device AI Matters

The future of AI is at the edge, where privacy meets performance

πŸ”’

Enhanced Privacy

Data never leaves the device. Process personal content, conversations, and media locally without cloud exposure.

⚑

Real-Time Response

Instant inference with no network round-trips. Perfect for AR/VR experiences, multimodal AI interactions, and responsive conversational agents.

🌐

Offline & Low-Bandwidth Ready

Zero network dependency for inference. Works seamlessly in low-bandwidth regions, remote areas, or completely offline.

πŸ’°

Cost Efficient

No cloud compute bills. No API rate limits. Scale to billions of users without infrastructure costs growing linearly.

Models Are Getting Smaller & Smarter

The convergence of efficient architectures and edge hardware creates new opportunities

Dramatically Smaller

Modern LLMs achieve high quality at a fraction of historical sizes

Edge-Ready Performance

Real-time inference on consumer smartphones

Quantization Benefits

Significant size reduction while preserving accuracy

The opportunity is now: Foundation models have crossed the efficiency threshold. Deploy sophisticated AI directly where data lives.

Why On-Device AI Was Hard

The technical challenges that made edge deployment complex... until now

πŸ”‹

Power Constraints

From battery-powered phones to energy-harvesting sensors, edge devices have strict power budgets. Microcontrollers may run on milliwatts, requiring extreme efficiency.

🌑️

Thermal Management

Sustained inference generates heat without active cooling. From smartphones to industrial IoT devices, thermal throttling limits continuous AI workloads.

πŸ’Ύ

Memory Limitations

Edge devices range from high-end phones to tiny microcontrollers. Beyond capacity, limited memory bandwidth creates bottlenecks when moving tensors between compute units.

πŸ”§

Hardware Heterogeneity

From microcontrollers to smartphone NPUs to embedded GPUs. Each architecture demands unique optimizations, making broad deployment across diverse form factors extremely challenging.

PyTorch Powers 92% of AI Research

But deploying PyTorch models to edge devices meant losing everything that made PyTorch great

Research & Training

PyTorch's intuitive APIs and eager execution power breakthrough research

β†’

The Conversion Nightmare

Multiple intermediate formats, custom runtimes, C++ rewrites

The Hidden Costs of Conversion (Status Quo)

❌
Lost Semantics

PyTorch operations don't map 1:1 to other formats

❌
Debugging Nightmare

Can't trace errors back to original PyTorch code

❌
Vendor-Specific Formats

Locked into proprietary formats with limited operator support

❌
Language Barriers

Teams spend months rewriting Python models in C++ for production

ExecuTorch
PyTorch's On-Device AI Framework

🎯

No Conversions

Direct export from PyTorch to edge. Core ATen operators preserved. No intermediate formats, no vendor lock-in.

βš™οΈ

Ahead-of-Time Compilation

Optimize models offline for target device capabilities. Hardware-specific performance tuning before deployment.

πŸ”§

Modular by Design

Pick and choose optimization steps. Composable at both compile-time and runtime for maximum flexibility.

πŸš€

Hardware Ecosystem

Fully open source with hardware partner contributions. Built on PyTorch's standardized IR and operator set.

πŸ’Ύ

Embedded-Friendly Runtime

Portable C++ runtime runs on microcontrollers to smartphones.

πŸ”—

PyTorch Ecosystem

Native integration with PyTorch ecosystem, including torchao for quantization. Stay in familiar tools throughout.

Simple as 1-2-3

Export, optimize, and run PyTorch models on edge devices with just a few lines of code

1. Export Your PyTorch Model

import torch
from torch.export import export

# Your existing PyTorch model
model = MyModel().eval()
example_inputs = (torch.randn(1, 3, 224, 224),)

# Creates semantically equivalent graph representation
exported_program = export(model, example_inputs)

2. Optimize for Target Hardware

Switch between backends with a single line change

from executorch.exir import to_edge_transform_and_lower
from executorch.backends.xnnpack import XnnpackPartitioner

program = to_edge_transform_and_lower(
    exported_program,
    partitioner=[XnnpackPartitioner()]
).to_executorch()
from executorch.exir import to_edge_transform_and_lower
from executorch.backends.apple import CoreMLPartitioner

program = to_edge_transform_and_lower(
    exported_program,
    partitioner=[CoreMLPartitioner()]
).to_executorch()
from executorch.exir import to_edge_transform_and_lower
from executorch.backends.qualcomm import QnnPartitioner

program = to_edge_transform_and_lower(
    exported_program,
    partitioner=[QnnPartitioner()]
).to_executorch()
# Save to .pte file
with open("model.pte", "wb") as f:
    f.write(program.buffer)

3. Run on Any Platform

// Load and execute model
auto module = Module("model.pte");
auto method = module.load_method("forward");
auto outputs = method.execute({input_tensor});
// Access result tensors
auto result = outputs[0].toTensor();
// Initialize ExecuTorch module
let module = try ETModule(path: "model.pte")
// Run inference with tensors
let outputs = try module.forward([inputTensor])
// Process results
let result = outputs[0]
// Load model from assets
val module = Module.load(assetFilePath("model.pte"))
// Execute with tensor input
val outputs = module.forward(inputTensor)
// Extract prediction results
val prediction = outputs[0].dataAsFloatArray
// Initialize ExecuTorch module
ETModule *module = [[ETModule alloc] initWithPath:@"model.pte" error:nil];
// Run inference with tensors
NSArray *outputs = [module forwardWithInputs:@[inputTensor] error:nil];
// Process results
ETTensor *result = outputs[0];
// Load model from ArrayBuffer
const module = et.Module.load(buffer);
// Create input tensor from data
const inputTensor = et.Tensor.fromIter(tensorData, shape);
// Run inference
const output = module.forward(inputTensor);

Available on Android, iOS, Linux, Windows, macOS, and embedded microcontrollers (e.g., DSP and Cortex-M processors)

Need advanced features? ExecuTorch supports memory planning, quantization, profiling, and custom compiler passes.

Try the Full Tutorial β†’

High-Level Multimodal APIs

Run complex multimodal LLMs with simplified C++ interfaces

Multimodal Runner - Text + Vision + Audio in One API

Choose your platform to see the multimodal API supporting text, images, and audio:

#include "executorch/extension/llm/runner/multimodal_runner.h"

// Initialize multimodal model (e.g., Voxtral, LLaVA)
auto runner = MultimodalRunner::create(
    "model.pte",     // Text model
    "vision.pte",   // Vision encoder
    "audio.pte",    // Audio encoder
    tokenizer_path,
    temperature
);

// Run inference with audio + image + text
auto result = runner->generate_multimodal(
    "Describe what you hear and see",
    audio_tensor,   // Audio input
    image_tensor,   // Image input
    max_tokens
);

// Stream response tokens
for (const auto& token : result.tokens) {
    std::cout << token << std::flush;
}
import ExecuTorch
import AVFoundation

// Initialize multimodal runner with audio support
let runner = try MultimodalRunner(
    modelPath: "model.pte",
    visionPath: "vision.pte",
    audioPath: "audio.pte",
    tokenizerPath: tokenizerPath,
    temperature: 0.7
)

// Process audio and image inputs
let audioTensor = AudioProcessor.preprocess(audioURL)
let imageTensor = ImageProcessor.preprocess(uiImage)

// Generate with audio + vision + text
let result = try runner.generateMultimodal(
    prompt: "Describe what you hear and see",
    audio: audioTensor,
    image: imageTensor,
    maxTokens: 512
)

// Stream tokens to UI
result.tokens.forEach { token in
    DispatchQueue.main.async {
        responseText += token
    }
}
import org.pytorch.executorch.MultimodalRunner
import android.media.MediaRecorder

// Initialize multimodal runner with audio
val runner = MultimodalRunner.create(
    modelPath = "model.pte",
    visionPath = "vision.pte",
    audioPath = "audio.pte",
    tokenizerPath = tokenizerPath,
    temperature = 0.7f
)

// Process audio and image inputs
val audioTensor = AudioProcessor.preprocess(audioFile)
val imageTensor = ImageProcessor.preprocess(bitmap)

// Generate with audio + vision + text
val result = runner.generateMultimodal(
    prompt = "Describe what you hear and see",
    audio = audioTensor,
    image = imageTensor,
    maxTokens = 512
)

// Display streaming response
result.tokens.forEach { token ->
    runOnUiThread {
        responseView.append(token)
    }
}

High-level APIs abstract away model complexity - just load, prompt, and get results

Universal AI Runtime

πŸ’¬ LLMs πŸ‘οΈ Computer Vision 🎀 Speech AI 🎯 Recommendations 🧠 Multimodal ⚑ Any PyTorch Model πŸ’¬ LLMs πŸ‘οΈ Computer Vision 🎀 Speech AI 🎯 Recommendations 🧠 Multimodal ⚑ Any PyTorch Model

Comprehensive Hardware Ecosystem

Hardware acceleration contributed by industry partners via open source

XNNPACK with KleidiAI

CPU acceleration across ARM and x86 architectures

Apple Core ML

Neural Engine and Apple Silicon optimization

Qualcomm Snapdragon

Hexagon NPU support

ARM Ethos-U

Microcontroller NPU for ultra-low power

Vulkan GPU

Cross-platform graphics acceleration

Intel OpenVINO

x86 CPU and integrated GPU optimization

MediaTek NPU

Dimensity chipset acceleration

Samsung Exynos

Integrated NPU optimization

NXP Neutron

Automotive and IoT acceleration

Apple MPS

Metal Performance Shaders for GPU acceleration

ARM VGF

Versatile graphics framework support

Cadence DSP

Digital signal processor optimization

Success Stories

Production deployments and strategic partnerships accelerating edge AI

Adoption

  • Meta Family of Apps: Production deployment across Instagram, Facebook, and WhatsApp
  • Meta Reality Labs: Powers Quest 3 VR and Ray-Ban Meta Smart Glasses AI

Ecosystem Integration

  • Hugging Face: Optimum-ExecuTorch for direct transformer model deployment
  • LiquidAI: Next-generation Liquid Foundation Models optimized for edge deployment
  • Software Mansion: React Native ExecuTorch bringing edge AI to mobile apps

Demos

  • Llama: Complete LLM implementation with quantization, KV caching, and mobile deployment
  • Voxtral: Multimodal AI combining text, vision, and audio processing in one model

Ready to Deploy AI at the Edge?

Join thousands of developers using ExecuTorch in production

Get Started Today