Automation score (human vs. script)

The automation score rates how script-driven vs. human-driven each user’s API traffic is, on a 0.0–1.0 scale:

HIGH (→ 1.0) — traffic is mostly driven by automatic scripts / batch jobs / cron.
LOW (→ 0.0) — a human is using the service interactively (a chat UI, or a human-driven coding agent such as Claude Code).

It is a heuristic for triage, not a verdict. Always read it together with the reported confidence and the per-signal breakdown.

Where it lives

Layer	Location
Core scoring + gather SQL	`apps/backend/serving/analytics/automation_score.py`
Store methods	`LogStore.get_user_automation_score` / `get_bulk_user_automation_scores` (`serving/storage/postgres_log.py`)
Admin endpoints	`GET /admin/users/{user_id}/automation-score`, `GET /admin/users/automation-scores` (`serving/servers/routers/admin/users.py`)
Admin dashboard	per-user button + bulk “Automation” column in the Users tab
CLI	`ops/db/analysis/user_automation_score.py`

The scoring methodology and the SQL that gathers per-user aggregates live in one module (serving.analytics.automation_score), shared by the admin endpoints and the CLI, so the two can never drift. The pure scoring functions have no database dependency and are unit-tested in tests/unit/test_user_automation_score.py.

Inputs

The score is computed over a trailing window (default 30 days; the admin endpoints accept days in 1..90). Only api_logs rows with a non-null user_id are considered — anonymous traffic is excluded. For each user the following per-user aggregates are gathered:

request count N, and n_chat = rows where num_user_turns IS NOT NULL (chat-style requests; NULL for embeddings / raw completions);
one-shot count (num_user_turns = 1) and the p90 of num_user_turns;
tool-call counts (num_tool_calls IS NOT NULL, and > 0);
the 25/50/75th percentiles of positive prompt_tokens;
the share of requests carrying a coding-agent opener (metadata->>'agent' present — see the agent_opener_override signal below);
a request-weighted breakdown of metadata->>'user_agent';
a UTC hour-of-day histogram (24 buckets);
the 25/50/75th percentiles of the inter-arrival gaps between consecutive requests.

The signals

The score blends seven signals. Each maps its raw metric to an automation sub-score in [0, 1] (HIGH = looks automated), and has a default weight. The four signals the feature was originally specified around are user turns, turn length, user-agent, and daily activity; two small supporting signals disambiguate the main confounder (a high-volume but human-driven coding agent); and user_message_shape measures the user’s own messages specifically.

Signal	Axis	Default weight	Available when
`turn_pattern`	user turns	0.24	`n_chat ≥ 5`
`prompt_size_dispersion`	total prompt size	0.17	`n_sz ≥ 8` and median > 0
`user_message_shape`	user-message size & entropy	0.15	≥ 1 part below
`client_tool_prior`	user-agent	0.16	always (≥ 1 request)
`daily_activity_shape`	daily activity	0.27	≥ 1 timing part below
`tool_call_human_tell`	(support)	0.08	`n_chat ≥ 5` and some tool use
`agent_opener_override`	(support)	0.08	`agent_share ≥ 0.05`

A signal that lacks enough data for a user is dropped, and the remaining weights are re-normalized over the survivors — a missing signal is never imputed as 0 (which would falsely pull the score toward “human”). The weights need not sum to 1.0 (they total 1.15); the runtime always divides by the available weight. user_message_shape was added later at weight 0.15 without changing the original six, so a user lacking the (newer) user-message columns — e.g. traffic logged before the migration — drops it and gets the same blended score as before, only a slightly lower confidence.

1. `turn_pattern` — user turns

An interactive session resends a growing history, so num_user_turns climbs 1, 2, 3, …; a script firing independent one-shot completions emits num_user_turns = 1 on essentially every request.

f1          = one_shot_chat_requests / n_chat          # fraction stuck at 1 user turn
depth_factor = 0.5 if p90(num_user_turns) >= 3 else 1.0
sub          = clamp01(f1 * depth_factor)

The depth_factor halves the sub-score for a user who has demonstrably held deep (p90 ≥ 3) multi-turn threads, so a human who also fires many one-shot requests is not branded a script.

2. `prompt_size_dispersion` — length of user turn

This uses prompt_tokens, which (OpenAI semantics) is the total input for the request — the system prompt + the whole conversation history resent that turn + tool definitions + the latest user message — not the user message in isolation (api_logs has no per-role token breakdown). It is the only available proxy for “length of user turn”. Templated automation assembles each request from a fixed template, so its total prompt size clusters tightly; an interactive human’s requests vary widely (a one-word follow-up, then a long paste). The discriminator is therefore the robust relative dispersion (IQR / median) of that total size, which is scale-free and resistant to a single huge pasted prompt:

rcv = (p75 - p25) / median        # over positive prompt_tokens
sub = clamp01(1 - rcv / 0.5)      # rcv >= 0.5 -> 0 (human-varied); rcv = 0 -> 1 (templated)

3. `client_tool_prior` — user-agent

The User-Agent is the most spoofable signal, so it is a soft, low-weight prior, never decisive. Each request’s UA is classified into a client class (a Python port of the frontend parseClientTool), with a per-request automation value:

Client class	Examples	Value
Interactive / coding agent	`claude-code`, `cline`, `cursor`, `codex`, `browser`	`0.10`
Ambiguous SDK	`openai-python`, `openai-node`, `anthropic-python`/`-sdk`	`0.50`
Raw HTTP library / API tool	`python-requests`, `httpx`, `aiohttp`, `curl`, `wget`, `okhttp`, `axios`, `go-http`, `postman`, …	`0.85`
Unknown (recognized leading token)	`myagent/1.0`	`0.60`
Absent / unrecognized UA	—	`0.70`

ua_base = request-weighted mean of the per-request class values
sub     = clamp01(ua_base * (1 - 0.85 * min(agent_share, 1)))

SDKs sit at a neutral 0.5 because a human chat UI can sit on top of openai-python. The agent_share term lets the coding-agent opener (see below) pull the prior toward human even behind a scripty UA.

4. `daily_activity_shape` — daily activity

The behavioral fingerprint hardest to fake at scale. Up to four parts are fused and re-normalized over whichever pass their data floors. Each is HIGH = automation:

Part	Weight	Formula	Floor
Hour coverage	0.20	`clamp01((coverage - 0.5) / 0.5)`, `coverage = distinct active UTC hours / 24`	`N ≥ 10`
Hour entropy	0.20	`clamp01((Hnorm - 0.5) / (0.92 - 0.5))`, `Hnorm = ShannonEntropy(hours) / log2(24)`	`N ≥ 10`
Nightly rest gap	0.30	`clamp01(1 - max_quiet_gap_hours / 6)`	`N ≥ 10`
Inter-arrival regularity	0.30	`clamp01(1 - gap_rcv / 1.0)`, `gap_rcv = (p75 - p25) / median` of gaps	`≥ 3 gaps`

max_quiet_gap_hours is the longest circular run of inactive clock-hours — a human’s nightly sleep gap pushes the rest-gap part toward 0, while 24/7 operation leaves no gap (→ 1). The timezone-invariant parts (regularity, rest gap, entropy) carry most of the weight, so an unknown user timezone shifts the histogram without distorting the verdict (hours are bucketed in UTC).

5. `tool_call_human_tell` — supporting

Agentic tool use (num_tool_calls > 0, from OpenAI tool_calls / Anthropic tool_use blocks) is the fingerprint of a human-driven coding loop. It is a one-directional human tell: it is dropped entirely when there are no tool calls (absence of tool use is not evidence of automation), and otherwise can only pull the score toward human, never raise it:

sub = clamp01(0.5 - toolcall_share)   # share>=0.5 -> 0 (strongly human); share~0 -> ~0.5 (neutral)

6. `agent_opener_override` — supporting

metadata->>'agent' is a coding-agent identity parsed from the system-prompt opener ("You are Claude Code, …"). Because it is content-derived (not a header), it is hard to fake incidentally and is treated as strong human evidence. It is available only when the opener appears on ≥ 5% of requests, so it can only pull toward human; its absence is uninformative:

sub = clamp01(0.15 - agent_share)     # opener pervasive -> ~0 (human)

It also drives the hard human clamp in the combination step.

7. `user_message_shape` — user-message size & entropy

Where prompt_size_dispersion looks at the whole input, this signal looks at the user’s own messages. Three properties of the newest user-role message per request are precomputed at log time (in user_message_stats, alongside conversation_shape) as cheap columns — char length, Shannon character entropy (bits/char), and a stable 64-bit hash of the stripped text — so the score never de-TOASTs the prompt. Three parts are fused and re-normalized over whichever pass their floors, each HIGH = automation:

Part	Weight	Formula	Floor
Size dispersion	0.40	`clamp01(1 - size_rcv / 0.5)`, `size_rcv = (p75 - p25)/median` of `last_user_msg_chars`	`≥ 8 sized msgs`
Per-message entropy	0.25	`clamp01(1 - mean_entropy / 4.0)` (natural language ≈ 4 bits/char)	`≥ 5 msgs`
Cross-message repetition	0.35	`clamp01((1 - distinct_ratio) / 0.5)`, `distinct_ratio = distinct(hash)/count`	`≥ 8 hashed msgs`

Templated automation sends near-constant-length user messages (low size dispersion), low-entropy structured payloads, and resends the same message over and over (low distinct ratio → high repetition); an interactive human varies all three. Because the columns are populated only for new traffic, the signal is simply unavailable (dropped) for users whose requests all predate the migration.

This signal isolates the user input, which the metadata in metadata->>'agent' and the per-message hash make hard to spoof without actually varying the content.

Combining the signals

A          = signals available for this user
weight_sum = sum(weight_i for i in A)
raw        = sum(sub_i * weight_i for i in A) / weight_sum      # re-normalized blend

# Hard human clamp: a high-volume coding-agent user with a real nightly rest gap
# can never be branded above "mixed" on volume alone.
if agent_share >= 0.3 and rest_gap_part_available and rest_gap_score < 0.5:
    raw = min(raw, 0.5)

# Confidence shrinkage toward the neutral 0.5 prior for low-volume users.
alpha = N / (N + 30)
score = clamp01(alpha * raw + (1 - alpha) * 0.5)

# coverage = available weight / total signal weight, so confidence stays in [0,1].
confidence = alpha * (weight_sum / TOTAL_WEIGHT)

Re-normalization (raw) makes the blend depend only on the signals that actually had data, so a pure-embeddings batch user — whose num_*_turns columns are all NULL — is still scored on its user-agent and daily-activity shape.
Shrinkage (alpha = N / (N + 30)) pulls users with little traffic toward the neutral 0.5 prior. Data outweighs the prior at N = 30 (alpha = 0.5); at N = 5, alpha ≈ 0.14 (the score is pulled ~86% to 0.5).
confidence combines the request volume (alpha) with the share of the total signal weight that was actually available (weight_sum / TOTAL_WEIGHT), so sparse verdicts — and users missing the newer signals — read as lower-confidence.
Users with N < 5 requests are flagged insufficient_data.

Bands

The final score maps to a band label (advisory — always read with confidence):

Band	Range
`likely_human`	`0.00 – 0.35`
`mixed_or_uncertain`	`0.35 – 0.60`
`likely_automated`	`0.60 – 0.80`
`scripted_batch`	`0.80 – 1.00`

Caveats

It’s a heuristic. No single signal is decisive; each is individually spoofable, so the user-agent is weighted low and behavioral signals dominate. Treat the score as triage, not proof.
Timezone. Hour-of-day is bucketed in UTC; the design leans on the timezone-invariant daily-activity parts so a non-UTC human is not mislabeled as night-active, but a genuinely multi-timezone account can inflate hour coverage.
Shared / role accounts. Shared or team accounts blend human and script traffic into a mid “mixed” score (correct). internal / admin accounts may legitimately run automation — the score stays role-agnostic, so exclude them from any automated enforcement and read the score with their role in mind.

Reproducing it from the CLI

# rank the most script-like users in the last 30 days
python ops/db/analysis/user_automation_score.py --min-requests 20

# full per-signal breakdown for one user
python ops/db/analysis/user_automation_score.py --email [email protected]

Automation score (human vs. script)

Where it lives

Inputs

The signals

1. turn_pattern — user turns

2. prompt_size_dispersion — length of user turn

3. client_tool_prior — user-agent

4. daily_activity_shape — daily activity

5. tool_call_human_tell — supporting

6. agent_opener_override — supporting

7. user_message_shape — user-message size & entropy