Using Open-Source Libraries for Neural Networks in C++

C++

Using Open-Source Libraries for Neural Networks in C++

Using Open-Source Libraries for Neural Networks in C++

Contents

  • Introduction
  • A brief history of why deployment returns to C++
  • Why C++ is still central in AI systems
  • Main classes of open-source libraries
  • ONNX Runtime
  • LibTorch
  • oneDNN and OpenVINO
  • TensorFlow Lite for edge deployments
  • llama.cpp and the modern LLM wave
  • How to choose the right library
  • A practical integration workflow
  • Common mistakes
  • Examples and counterexamples from real library choices
  • A deployment story in miniature
  • Hands-On Lab: Build a tiny ONNX Runtime CLI
  • Test Tasks for Enthusiasts
  • Summary
  • References

Introduction

Hello friends!

If you spend enough time around modern AI products, you eventually notice an interesting pattern: many experiments start in Python, but a lot of serious deployment work ends up in C++. That is not an accident.

Python is excellent for research, notebooks, quick iteration, and gluing components together. But when teams move from "the model works on my machine" to "this model must run every hour, every minute, or every millisecond on real hardware", the conversation changes. Latency starts to matter. Binary size starts to matter. Startup time matters. Memory layout matters. Threading matters. The quality of SIMD kernels matters. Deployment friction matters.

This is exactly where C++ becomes useful again.

In this article we will walk through the most practical open-source libraries that help us build, run, optimize, and ship neural-network workloads in C++. We will focus on tools that are already used in real systems, have public documentation, and can be integrated into engineering pipelines without vendor lock-in.

The main goal is not to worship one framework. The goal is to understand what each library is good at, where it hurts, and how to compose them in a way that serves the product instead of the hype.

That, incidentally, is the most important habit in AI engineering. The field is young enough to produce a new slogan every month, and old enough already to punish people who believe slogans too literally. A library is not a worldview. It is a tool. And like every serious tool, it becomes useful only when it is matched to a particular job, on a particular machine, under a particular operational burden.

A brief history of why deployment returns to C++

There is a rhythm to AI engineering that repeats itself so often that it is almost comic.

First comes the discovery phase. People explore ideas, change architectures twice before lunch, compare checkpoints, and run experiments in a notebook. For this phase, high-level tooling is ideal. It should be expressive, forgiving, and quick to rewire.

Then comes the second phase, and this is where reality enters the room.

Now the questions sound different:

  • Can this model load fast enough during autoscaling?
  • Can it run on a machine we can actually afford?
  • Can we package it without dragging half the research environment into production?
  • Can we trace failures at 3 a.m. without guessing?
  • Can we ship it to a device that has more sensors than RAM?

That second phase has always favored languages and runtimes closer to the machine. In previous decades it was databases, trading, telecom, game engines, browsers, antivirus, video pipelines, and kernels. In the AI era it is inference servers, local model runtimes, tokenization pipelines, quantized execution, and edge deployment.

This is why C++ keeps returning, even when public conversation tries to declare the matter settled in favor of something more fashionable. C++ returns not because engineers are sentimental, but because the machine eventually asks for a clear answer. How many copies? Which threads? Which allocator? Which ABI? Which kernel? Which fallback path? Which cache line?

And when those questions finally become unavoidable, C++ is still one of the languages that can answer them without blinking.

Why C++ is still central in AI systems

Before comparing libraries, we should answer a simpler question: why does C++ keep appearing in AI infrastructure even when the public conversation is dominated by Python?

There are several reasons:

  1. C++ gives direct control over performance-sensitive details. You can control memory ownership, thread pools, allocators, ABI boundaries, custom kernels, and hardware-specific optimizations with much less runtime overhead than in higher-level environments.
  2. Most hardware ecosystems still expose first-class C or C++ APIs. CUDA, oneDNN, ONNX Runtime internals, OpenVINO, and many inference runtimes are fundamentally C/C++ systems even if they are often wrapped by Python.
  3. C++ is an excellent deployment language. It works well in backend services, desktop products, embedded devices, trading systems, robotics, industrial controllers, and security software.
  4. Integration cost matters. Many companies already have large C++ codebases, existing build systems, observability stacks, and performance tooling. Adding one more C++ component is usually easier than re-platforming an entire subsystem.
  5. Inference is different from training. Training often tolerates larger frameworks and more scripting. Production inference frequently values predictable latency, smaller binaries, lower memory pressure, and tighter system integration.

The key practical observation is this: even when your researchers live in Python, your production path often touches export, serialization, runtime optimization, quantization, operator fusion, threading, and device execution. Those layers are heavily C++-driven.

Main classes of open-source libraries

It helps to stop thinking about "AI library" as one category. In real engineering, we usually choose from several distinct categories:

1. Training-first frameworks

These are libraries where defining models and running autodiff are central. The best-known C++ option here is LibTorch, the C++ frontend for PyTorch. This is useful when you want a native C++ application to own model creation, training loops, tensor operations, or inference without constantly crossing into Python.

2. Model execution runtimes

These libraries focus on loading a model graph and executing it efficiently. ONNX Runtime is the clearest example. It is ideal when training happens elsewhere and the C++ application mainly needs stable, portable inference.

3. Kernel and primitive libraries

These are lower-level acceleration layers that optimize operations like GEMM, convolution, normalization, and quantized kernels. oneDNN is a strong example on Intel-oriented CPU paths. You may not always use such libraries directly, but many higher-level runtimes depend on them.

4. Device and edge runtimes

For mobile, embedded, and resource-constrained systems, we often want a smaller inference-oriented stack. TensorFlow Lite fits this category. It is especially relevant when model size, startup cost, or device portability are more important than framework completeness.

5. LLM-specific local inference projects

The recent wave of local large-language-model inference created another category: lightweight inference engines centered on quantized transformer execution. llama.cpp became the reference point here. It is not a general deep-learning framework in the same sense as PyTorch, but it is enormously practical for running modern GGUF-based local models.

6. Vendor-optimized orchestration layers

Open projects like OpenVINO sit between a general runtime and a hardware optimization toolkit. They help bridge model conversion, graph optimization, and execution on Intel CPUs, GPUs, and NPUs.

Once you see the landscape this way, library choice becomes more rational. The question is no longer "Which AI framework is best?" The real question is "What job must this subsystem perform?"

ONNX Runtime

If I had to recommend one open-source library for the broadest range of C++ inference tasks, ONNX Runtime would be near the top of the list.

The reason is simple: it solves a common production problem cleanly.

Researchers train a model in one environment. Engineers need to run it elsewhere. A runtime that accepts a stable model format, supports multiple execution providers, and exposes a usable C/C++ API is extremely valuable.

According to the official documentation, ONNX Runtime is a high-performance inference and training graph execution engine, and its C/C++ APIs are designed for onboarding and executing ONNX models. That is exactly the practical niche many teams need.

When ONNX Runtime is a great fit

Use it when:

  • the model can be exported to ONNX cleanly
  • your application mostly performs inference
  • you want portability across CPU and accelerator backends
  • you need a well-defined runtime boundary between model authors and system engineers
  • you care about deployment stability more than framework-specific training features

Why engineers like it

  1. The contract is clear. Your application loads a model, prepares input tensors, runs inference, and consumes outputs.
  2. The API surface is smaller than a full training framework. That reduces accidental complexity.
  3. Execution providers matter. ONNX Runtime can route execution to different backends depending on the environment.
  4. It encourages explicit preprocessing and postprocessing. This is healthy for production systems because data transformations become visible and testable.

A minimal C++ inference shape

The exact code depends on your build environment, but the usage pattern looks like this:

#include <onnxruntime_cxx_api.h>
#include <array>
#include <vector>

int main() {
    Ort::Env env{ORT_LOGGING_LEVEL_WARNING, "stofu"};
    Ort::SessionOptions opts;
    opts.SetIntraOpNumThreads(4);
    opts.SetGraphOptimizationLevel(GraphOptimizationLevel::ORT_ENABLE_EXTENDED);

    Ort::Session session{env, L"model.onnx", opts};
    Ort::AllocatorWithDefaultOptions allocator;

    std::vector<int64_t> shape{1, 3, 224, 224};
    std::vector<float> input(1 * 3 * 224 * 224, 0.0f);

    auto mem = Ort::MemoryInfo::CreateCpu(
        OrtArenaAllocator,
        OrtMemTypeDefault
    );

    Ort::Value tensor = Ort::Value::CreateTensor<float>(
        mem,
        input.data(),
        input.size(),
        shape.data(),
        shape.size()
    );

    const char* input_names[] = {"input"};
    const char* output_names[] = {"output"};

    auto outputs = session.Run(
        Ort::RunOptions{nullptr},
        input_names,
        &tensor,
        1,
        output_names,
        1
    );

    return 0;
}

What matters is not memorizing the snippet. What matters is understanding the engineering model: create session, configure runtime, allocate tensor memory explicitly, run the graph, and read outputs. It is a very production-friendly shape.

Where ONNX Runtime hurts

It is not magic.

Problems usually show up in one of these areas:

  • model export incompatibilities
  • operator support mismatches
  • preprocessing discrepancies between training and production
  • hidden assumptions about tensor layout
  • quantized model behavior differing from floating-point validation

In other words, ONNX Runtime is excellent at executing a graph. It cannot save you from a bad export pipeline or from sloppy data contracts.

LibTorch

If ONNX Runtime is the clean inference runtime, LibTorch is the natural choice when you want more of the PyTorch world directly inside C++.

The official PyTorch C++ frontend mirrors many concepts from the Python API: tensors, modules, optimizers, datasets, and autograd. This makes it attractive when the native application wants stronger ownership of model code or training loops.

When LibTorch shines

Use LibTorch when:

  • your organization already relies heavily on PyTorch
  • you need C++ and PyTorch to coexist closely
  • your application may do training, fine-tuning, or advanced tensor manipulation
  • you want to keep model logic inside the same native binary as the rest of the system
  • you need custom operators or lower-level integration with surrounding C++ code

A compact LibTorch example

#include <torch/torch.h>
#include <iostream>

struct Net : torch::nn::Module {
    Net()
        : fc1(register_module("fc1", torch::nn::Linear(784, 256))),
          fc2(register_module("fc2", torch::nn::Linear(256, 10))) {}

    torch::Tensor forward(torch::Tensor x) {
        x = torch::relu(fc1->forward(x));
        x = fc2->forward(x);
        return torch::log_softmax(x, 1);
    }

    torch::nn::Linear fc1{nullptr};
    torch::nn::Linear fc2{nullptr};
};

int main() {
    Net net;
    auto x = torch::randn({32, 784});
    auto y = net.forward(x);
    std::cout << y.sizes() << "\n";
}

This style feels more like full framework programming than runtime-only inference. That is both its strength and its cost.

Advantages of LibTorch

  1. It keeps you close to PyTorch semantics. This is valuable when the research and production teams share concepts and workflows.
  2. It is expressive. You can build models, register modules, manipulate tensors, and prototype natively.
  3. It supports more than "load model and run". That makes it suitable for advanced native systems, simulation platforms, and long-running services that need deeper tensor control.

Trade-offs

The downside is that LibTorch is heavier than a narrow inference runtime.

You should expect:

  • larger integration footprint
  • more dependencies
  • more build-system complexity
  • more ABI sensitivity
  • a bigger surface area to test and maintain

That does not make it bad. It just means it is the right tool when your C++ application actually needs framework-level power, not when all you want is a stable inference engine.

oneDNN and OpenVINO

Now we move slightly lower in the stack.

oneDNN

oneDNN is an open-source performance library for deep-learning applications. It focuses on optimized primitives such as convolution, matrix multiplication, normalization, pooling, reorder operations, and quantized kernels.

Many engineers do not interact with oneDNN directly every day. Instead, they benefit from it indirectly because frameworks and runtimes use it underneath. But understanding its role is still useful.

oneDNN is important because it represents a recurring truth in AI systems engineering:

the difference between a demo and a product is often hidden in primitive implementations.

If your runtime executes convolutions, GEMMs, attention-style matrix operations, or quantized layers inefficiently, no amount of marketing around "AI transformation" will help you.

OpenVINO

OpenVINO goes one step higher. It provides a broader deployment toolkit for optimizing and running inference on Intel-oriented hardware. In practical terms, it helps with model conversion, graph preparation, and device-aware execution.

When these tools make sense

Use oneDNN and OpenVINO when:

  • CPU inference quality matters
  • Intel hardware is central to your deployment fleet
  • you need strong support for optimized inference pipelines
  • you care about throughput per watt and predictable server-side execution
  • you want to avoid manually reimplementing low-level optimizations

The strategic lesson

A lot of teams make a conceptual mistake here. They think model quality is the only meaningful dimension. In reality, for production systems we care about:

  • latency
  • throughput
  • memory footprint
  • warmup time
  • startup behavior
  • quantized path quality
  • batch scaling
  • deployment reproducibility

Libraries like oneDNN and OpenVINO matter because they attack these practical dimensions directly.

TensorFlow Lite for edge deployments

When people hear "TensorFlow Lite", they sometimes assume it only matters for smartphones. That is too narrow.

TFLite is useful anywhere you want a smaller, inference-oriented runtime with good edge characteristics:

  • mobile
  • embedded Linux
  • robotics
  • kiosks
  • industrial equipment
  • camera pipelines
  • offline devices

Its C++ API matters because many edge products are fundamentally native products. They are not notebook environments. They are not managed runtimes. They are appliances, services, or applications that must boot, infer, and recover predictably.

When TFLite is a good choice

Use it when:

  • the model can be converted into the Lite format cleanly
  • footprint is more important than framework completeness
  • the device environment is constrained
  • you want simpler inference deployment
  • you are shipping software to fleets of devices, not just a server farm

Engineering mindset for TFLite

The biggest TFLite win is usually not theoretical speed. It is operational suitability.

A library that is smaller, more focused, and easier to package on constrained systems often beats a more feature-rich stack that is painful to deploy or update.

llama.cpp and the modern LLM wave

No article on open-source neural-network libraries in 2026 is complete without discussing the LLM-specific ecosystem.

The most important project here is llama.cpp.

llama.cpp became influential because it brought a brutally practical mindset to local transformer inference:

  • run on commodity hardware
  • support quantized formats
  • keep the code portable
  • make CPU inference serious again
  • expose enough C/C++ control for real integration work

It is not a universal deep-learning framework. It is a focused, highly practical engine for local model execution. And that focus is precisely why it matters.

Why C++ engineers care about llama.cpp

  1. It demonstrates how much performance can be extracted from careful low-level engineering.
  2. It makes local inference accessible in products that do not want a heavyweight GPU stack.
  3. It is a great case study in quantization, memory mapping, batching, and thread-level optimization.
  4. It integrates naturally into native applications.

A useful lesson from the llama.cpp ecosystem

The project reminds us that "AI engineering" is often not about inventing a new model. It is about making an existing model usable under hard constraints:

  • low RAM
  • no accelerator
  • offline execution
  • local privacy requirements
  • limited startup time
  • predictable deployment artifacts

That is very much a C++ engineering problem.

How to choose the right library

Now let us get practical.

Here is a simple decision framework:

Choose ONNX Runtime if

  • you want a stable runtime boundary
  • your training path can export ONNX cleanly
  • inference is the main job
  • portability across environments matters

Choose LibTorch if

  • the native application needs framework-level tensor and model control
  • training or fine-tuning in C++ is on the table
  • the team is already deeply invested in PyTorch
  • you are comfortable paying a heavier integration cost

Choose oneDNN or OpenVINO if

  • CPU optimization is a serious requirement
  • Intel hardware is important
  • you need optimized primitives or deployment tooling beyond a generic graph runtime

Choose TensorFlow Lite if

  • your environment is constrained
  • you are deploying to devices
  • runtime footprint matters more than training flexibility

Choose llama.cpp if

  • local LLM inference is the target
  • quantized transformer execution is the core use case
  • you need portable native integration
  • edge or CPU-first deployments matter

One more important rule

Do not pick a library only because your model team likes it.

Pick it after evaluating:

  • model portability
  • build complexity
  • binary footprint
  • observability
  • quantization support
  • hardware support
  • testability
  • long-term maintainability

That is how production systems stay healthy.

A practical integration workflow

Here is a workflow that works well in real teams:

Step 1: lock the data contract

Define:

  • input names
  • tensor shapes
  • channel order
  • normalization constants
  • tokenization rules
  • output semantics

If this is vague, your runtime choice will not save you.

Step 2: validate export fidelity

If you are moving from training code to an inference runtime, verify that:

  • outputs match within acceptable tolerance
  • preprocessing is identical
  • batch and single-item behavior are both tested
  • quantized and non-quantized paths are compared honestly

Step 3: benchmark under realistic workload

Never stop at synthetic micro-tests.

Measure:

  • cold start
  • warm latency
  • p50, p95, and p99 latency
  • memory usage
  • throughput at expected concurrency
  • failure and timeout behavior

Step 4: profile before optimizing

A lot of "model performance" problems are actually:

  • image decode cost
  • tokenization overhead
  • bad threading
  • repeated allocation
  • cache-unfriendly postprocessing
  • serialization overhead

Do not blame the model runtime before you profile the surrounding system.

Step 5: make deployment reproducible

Version:

  • the model artifact
  • conversion scripts
  • runtime flags
  • quantization settings
  • tokenizer or vocab files
  • benchmark inputs

This step saves enormous pain later.

Common mistakes

Let us finish with the mistakes I see most often.

Mistake 1: treating Python results as production truth

A notebook result is not a deployment result. Native memory behavior, thread scheduling, model loading, and I/O can change the picture dramatically.

Mistake 2: ignoring preprocessing and postprocessing

Teams sometimes spend days tuning inference kernels while the real bottleneck is JPEG decode, tokenization, sorting, or string formatting.

Mistake 3: choosing the heaviest framework by default

If all you need is inference, a narrow runtime may be far healthier than embedding a full framework.

Mistake 4: optimizing before model portability is stable

Do not chase backend-specific tricks while the export pipeline still breaks from one training commit to the next.

Mistake 5: underestimating packaging and ABI issues

A library is only "fast" if you can actually build it, ship it, update it, and debug it in your environment.

Mistake 6: forgetting observability

Whatever runtime you choose, add metrics and logs around:

  • model load success
  • model version
  • input shape validation
  • execution latency
  • timeout count
  • fallback path usage

Production AI without observability is just expensive superstition.

Examples and counterexamples from real library choices

This is where the discussion becomes more human and more useful.

Example 1: a vision backend on x86 servers

Suppose a team trains models in PyTorch, exports them to ONNX, and deploys them behind a native C++ service.

In that situation, ONNX Runtime is usually a very good fit.

Why?

  • researchers can keep their training workflow
  • the production service mainly needs inference
  • the runtime boundary is clear
  • performance work stays focused on deployment, not framework behavior

This is the kind of boring, healthy architecture that tends to age well.

Counterexample 1: using LibTorch for simple static inference

Now imagine the same service, but the deployment team embeds LibTorch only because "we already use PyTorch."

That is often the wrong choice.

If the service never trains and never needs deep tensor-programming control, then the extra framework weight may simply become maintenance cost. The team pays more integration complexity without getting meaningful product value back.

Example 2: a native product that really needs tensor ownership

Now flip the scenario.

Suppose you are building a native desktop or edge application that:

  • owns training or fine-tuning locally
  • performs more than one forward pass
  • manipulates tensors heavily in process
  • already lives inside a large C++ application

That is where LibTorch makes sense. The application genuinely needs framework-level power, so the heavier dependency is justified.

Counterexample 2: forcing ONNX Runtime into a dynamic workflow

If a team chooses ONNX Runtime for a product that really needs live tensor transformations, custom training loops, or framework-like model composition in C++, they may end up fighting the abstraction. ONNX Runtime is excellent at executing graphs. It is not meant to be the full center of native ML development.

Example 3: an offline edge device

Imagine a camera appliance, kiosk, or industrial controller that must:

  • boot fast
  • infer locally
  • run with tight memory limits
  • survive unreliable connectivity

That is where TensorFlow Lite often becomes a very practical answer. It may be less glamorous than a full framework, but it is often more honest about the deployment environment.

Counterexample 3: stretching an edge runtime into a broad server platform

If the product is a large server-side model platform with heavy batching, diverse models, and constant experimentation, an edge-oriented runtime may feel artificially small. In that case the simplicity that looked attractive at first can become a constraint.

Example 4: shipping a local LLM inside a native app

Suppose the product goal is:

  • local inference
  • quantized models
  • no cloud dependency
  • direct native integration

That is a very natural fit for llama.cpp. It solves the real product problem directly instead of pretending to be a universal deep-learning framework.

Counterexample 4: asking llama.cpp to be a general AI platform

If the workload really needs broad framework semantics, complex training workflows, or graph-level experimentation across many model families, then llama.cpp is probably not the architectural center. It is strongest when the job is focused local inference.

The practical lesson

Teams usually make bad library decisions when they ask:

  • "Which framework is the most complete?"
  • "Which one is hottest right now?"
  • "Which one feels familiar from Python?"

They usually make better decisions when they ask:

  • "Who owns training?"
  • "Who owns inference?"
  • "What hardware are we deploying to?"
  • "Do we need tensor programming or just graph execution?"
  • "What will be easy to benchmark, debug, and update in six months?"

Those questions are less exciting and much more valuable.

A deployment story in miniature

Let us end with a concrete thought experiment, because this is often where a technical article either becomes useful or remains decorative.

Imagine a small company building an industrial inspection product. The product has three very different phases in its life:

Phase 1: model discovery

The team experiments rapidly:

  • several model architectures
  • frequent retraining
  • ad hoc preprocessing
  • lots of visual validation

At this stage almost nobody should be arguing about deployment runtimes. The correct tool is the one that accelerates learning.

Phase 2: first production rollout

Now the system must:

  • load a stable model artifact
  • infer on one class of server
  • expose a C++ API to an existing backend
  • produce consistent results across machines

This is where ONNX Runtime begins to look attractive. The model is now a product artifact, not a constantly mutating research object.

Phase 3: edge and offline deployment

Later the same company wants the model on a factory-floor device with stricter limits:

  • smaller footprint
  • no cloud dependency
  • slower CPU
  • tighter startup budget

Now a slimmer runtime such as TensorFlow Lite or a more aggressively optimized local inference path may become the healthier answer.

What changed?

Not the model family alone. What changed was the operational contract.

This is why one of the worst habits in AI engineering is to talk as if "the right framework" is a permanent identity decision. Very often the right answer changes as the product grows up.

The wise engineer is not the one who chooses one library and defends it forever. The wise engineer is the one who recognizes when the job has changed, and therefore when the library should change with it.

Hands-On Lab: Build a tiny ONNX Runtime CLI

Theory becomes more convincing when it compiles.

Let us build the smallest useful native inference program in C++. The goal is not to train a model. The goal is to feel, with your own hands, what a native runtime boundary looks like.

For this exercise you need:

  • a C++17 compiler
  • CMake
  • a prebuilt ONNX Runtime package from the official releases
  • any small .onnx model whose input is a flat float tensor

Project layout

tiny-ort/
  CMakeLists.txt
  main.cpp
  third_party/
    onnxruntime/
  model.onnx

CMakeLists.txt

cmake_minimum_required(VERSION 3.16)
project(tiny_ort LANGUAGES CXX)

set(CMAKE_CXX_STANDARD 17)
set(CMAKE_CXX_STANDARD_REQUIRED ON)

set(ORT_ROOT "${CMAKE_SOURCE_DIR}/third_party/onnxruntime")

add_executable(tiny_ort main.cpp)
target_include_directories(tiny_ort PRIVATE "${ORT_ROOT}/include")

if (WIN32)
    target_link_directories(tiny_ort PRIVATE "${ORT_ROOT}/lib")
    target_link_libraries(tiny_ort PRIVATE onnxruntime)
else()
    target_link_directories(tiny_ort PRIVATE "${ORT_ROOT}/lib")
    target_link_libraries(tiny_ort PRIVATE onnxruntime)
endif()

main.cpp

#include <onnxruntime_cxx_api.h>

#include <array>
#include <iostream>
#include <numeric>
#include <vector>

int main() {
    Ort::Env env{ORT_LOGGING_LEVEL_WARNING, "tiny-ort"};
    Ort::SessionOptions opts;
    opts.SetIntraOpNumThreads(1);
    opts.SetGraphOptimizationLevel(GraphOptimizationLevel::ORT_ENABLE_EXTENDED);

    const ORTCHAR_T* model_path = ORT_TSTR("model.onnx");
    Ort::Session session{env, model_path, opts};

    std::vector<int64_t> shape{1, 4};
    std::vector<float> input{0.25f, 0.50f, 0.75f, 1.0f};

    auto mem_info = Ort::MemoryInfo::CreateCpu(
        OrtArenaAllocator,
        OrtMemTypeDefault
    );

    Ort::Value tensor = Ort::Value::CreateTensor<float>(
        mem_info,
        input.data(),
        input.size(),
        shape.data(),
        shape.size()
    );

    const char* input_names[] = {"input"};
    const char* output_names[] = {"output"};

    auto outputs = session.Run(
        Ort::RunOptions{nullptr},
        input_names,
        &tensor,
        1,
        output_names,
        1
    );

    float* out = outputs[0].GetTensorMutableData<float>();
    auto out_shape = outputs[0].GetTensorTypeAndShapeInfo().GetShape();
    auto out_count = std::accumulate(
        out_shape.begin(),
        out_shape.end(),
        int64_t{1},
        std::multiplies<int64_t>{}
    );

    std::cout << "Output values:\n";
    for (int64_t i = 0; i < out_count; ++i) {
        std::cout << "  [" << i << "] = " << out[i] << "\n";
    }

    return 0;
}

Build

On Linux or macOS:

cmake -S . -B build
cmake --build build -j
./build/tiny_ort

On Windows with MSVC:

cmake -S . -B build
cmake --build build --config Release
.\build\Release\tiny_ort.exe

What this teaches you

This tiny project already forces you to confront several production realities:

  • where the runtime lives
  • how native dependencies are packaged
  • what tensor names and shapes actually are
  • how explicit memory handling feels in a native inference boundary

That is exactly the point. A library stops being a marketing term and becomes an engineering choice.

Test Tasks for Enthusiasts

If you want to turn the article into a weekend lab, here are useful next steps:

  1. Replace the hardcoded input vector with values loaded from a small text or binary file.
  2. Print input and output tensor shapes dynamically instead of assuming them.
  3. Add simple latency measurement around session.Run and compare 1, 2, and 4 intra-op threads.
  4. Swap ONNX Runtime for LibTorch in a similar toy inference app and write down what became easier and what became heavier.
  5. Export a tiny model from Python, load it in this C++ program, and verify that preprocessing differences do not silently change the result.

If you do those five tasks honestly, you will understand more about AI deployment than many people who can recite framework names for an hour.

Summary

Open-source neural-network libraries for C++ are not all trying to solve the same problem.

ONNX Runtime is excellent when you want a stable, inference-first runtime boundary. LibTorch is powerful when your native application needs deeper ownership of tensors, modules, and training semantics. oneDNN and OpenVINO matter when low-level optimization and Intel-centric deployment are important. TensorFlow Lite shines on edge devices. llama.cpp is the modern proof that careful C++ engineering can make local LLM inference practical and surprisingly efficient.

My advice is simple: choose the smallest library stack that solves the real deployment problem. In production, the most valuable runtime is usually not the one with the biggest feature list. It is the one your team can understand, profile, test, and operate with confidence.

References

  1. ONNX Runtime C/C++ API: https://onnxruntime.ai/docs/api/c/index.html
  2. ONNX official project: https://onnx.ai/
  3. PyTorch C++ frontend documentation: https://docs.pytorch.org/cppdocs/frontend.html
  4. oneDNN official documentation: https://uxlfoundation.github.io/oneDNN/
  5. OpenVINO documentation: https://docs.openvino.ai/
  6. LiteRT / TensorFlow Lite C++ API docs: https://ai.google.dev/edge/litert/api_docs/cc
  7. llama.cpp repository: https://github.com/ggml-org/llama.cpp
  8. ONNX Runtime GitHub repository: https://github.com/microsoft/onnxruntime
  9. PyTorch repository: https://github.com/pytorch/pytorch
Philip P.

Philip P. – CTO

Focused on fintech system engineering, low-level development, HFT infrastructure and building PoC to production-grade systems.

Back to Blogs

Start the Conversation

Share the system, the pressure, and what must improve. Or write directly to midgard@stofu.io.