Playing With Codex Autoresearch on a Kaggle Playground

Inspired by Andrej Karpathy’s autoresearch, I’ve been experimenting with a smaller idea: treating Codex not just as a coding assistant, but as an agent that can run a narrow, repo-specific research loop. That led me to build benyue1978/codex-autoresearch and test it against a real Kaggle competition repo for the Playground Series S6E3 customer churn challenge.

The result is not some grand autonomous scientist. It is something much smaller and more useful: a practical workflow in which Codex inspects a repository, derives a repo-specific contract, and then runs an experiment loop with a clear keep-or-discard rule.

The Basic Idea

The shape is close to Karpathy’s autoresearch: define a narrow mutable surface, define a fixed measure, run experiments one at a time, and keep only the changes that actually help.

What I wanted was a Codex-friendly version of that idea.

In practice, the loop looks like this:

Inspect the repository.
Figure out the setup, baseline command, measure source, and editable surface.
Establish a baseline.
Try one small change.
Run the required checks and measurement.
Keep or discard using git.

The key is that this has to be repo-specific. Every repository has different commands, different constraints, different measures, and different definitions of what is safe to edit.

Instead of hardcoding all of that into one prompt, I split the system into two layers:

a reusable skill that prepares autoresearch for a new repo
a generated program.md that becomes the repo-specific operating contract

The Small Toolkit

The repo I built has two main parts.

First, there is a setup skill: setup-autoresearch.

Its job is to inspect the current repository, infer as much as possible, ask only the high-risk follow-up questions, and then generate a repo-specific program.md.

Second, there is a shell wrapper: codex-autoresearch.sh.

That script handles the long-running part. Once a Codex session has started, the wrapper can keep resuming it, monitor the latest assistant output, and stop only when the completion protocol is satisfied.

So the split is deliberate:

setup time is interactive
run time is autonomous

That separation turned out to matter because the hard part is not only running loops. It is establishing the right contract before the loop starts.

Why Skills Matter

A generic agent is not enough for this kind of workflow.

If you simply say “go do autoresearch,” the agent still has to answer a bunch of repo-specific questions:

what is the real measure?
what command establishes the baseline?
what files are safe to edit?
what should be treated as fixed infrastructure?
what does keep or discard actually mean in this repo?

Those answers are not universal. They have to be derived from the repository itself.

That is where the skill comes in. The skill is reusable, but the output is not. The output is program.md, and program.md should be specific to the target repo.

That ended up feeling much cleaner than trying to encode every repo rule directly in the skill.

A Side Note on Skill Design

While refining the skill, I used this post as a reference:

5 Agent Skill Design Patterns Every ADK Developer Should Know

That article was useful because it gave me a better language for what the skill actually is.

setup-autoresearch is not just “a skill.” It is really a combination of two patterns:

Inversion
Generator

The inversion part is the setup conversation: inspect first, then ask the human only the minimum high-risk questions.

The generator part is the output: produce one structured artifact, program.md.

That framing helped me clean it up. I moved the generic program.md example into a reference template, made the setup step explicitly interactive, and made the generated program.md itself explicitly autonomous.

That was a small design improvement, but it also suggested a useful follow-up habit: once you draft a skill, ask your coding agent to “do a code review according to this article.”

Why I Used a Kaggle Competition as the Playground

I wanted a real repo, but I also wanted one where the experiment loop was easy to understand.

A Kaggle competition repo is great for that.

The playground I used was based on the Playground Series S6E3 customer churn competition.

It works well as a testing ground because it has all the ingredients an autoresearch loop needs:

a concrete baseline
a measurable outcome
a constrained modeling surface
repeatable commands
a natural keep-or-discard decision boundary

It is also just messy enough to be realistic. There are commands, artifacts, metrics, and submission-related decisions to infer, which means the setup skill actually has to do some work.

That makes it much better than a toy repo for trying out this workflow.

How I Actually Used It

The flow was roughly this.

First, I entered the Kaggle repo and ran the setup step through the skill. The skill inspected the repository, looked at the modeling surface, tried to infer the measure and baseline flow, and asked only where the assumptions were risky.

Then it generated a repo-specific program.md.

After that, the actual running prompt became simple:

read program.md and begin autoresearch

At that point, I could either run Codex directly or wrap it with the shell script.

The shell script was useful because it let me keep the same Codex session moving forward without manually babysitting every turn. It also gave me state files, session tracking, and a live preview of the latest assistant output.

Over time I refined the wrapper so that:

prompt input starts a new session
-session-id resumes a specific session
-last resumes the last session explicitly
the latest assistant message is saved per attempt
signal handling is explicit
the last preview line is not lost when the message lacks a trailing newline

That all sounds minor, but long-running agent workflows are full of small failure modes. The wrapper became more valuable as those edges were made explicit.

What I Learned

The most important lesson is that the interesting part is not “making Codex loop.” The interesting part is making the contract explicit.

You need to tell the agent, in repo-specific terms:

what the authoritative measure is
what can be edited
what should stay fixed
how to decide keep versus discard
when to continue autonomously
when to stop because a real blocker exists

Once that contract is clear, the workflow becomes much more coherent.

The second lesson is that skills need structure. A long pile of instructions is not the same thing as a well-designed skill. Separating the interactive setup phase from the generated output made the whole system simpler.

The third lesson is that Kaggle is a very good playground for this kind of experiment. It gives you a real objective and a clean experimental rhythm without needing a huge production system.

What This Repo Is Good For

Right now, I think benyue1978/codex-autoresearch is useful as:

a small reusable starting point for repo-specific autoresearch
an example of combining Codex skills with a generated program.md
a shell supervisor for long-running Codex experiment sessions
a playground for testing how far agentic experiment loops can go in a controlled repository

It is not magic. It still depends on a repo being inspectable and on the generated program.md being good. That is fine. The point is not to remove human judgment. The point is to move that judgment into the setup contract, and then let the agent execute within it.

Closing

This started as a small experiment, but it ended up clarifying something I care about more broadly: agent workflows get much better when the operating contract is explicit.

Turning a repository into a bounded research environment made Codex feel less like a chatbot and more like a narrow but disciplined collaborator. Using a Kaggle competition repo made that concrete enough to test for real.

If you want to see the toolkit itself, the repo is here:

https://github.com/benyue1978/codex-autoresearch

The ideas in this post are mine; Codex helped me write it.