Automation score (human vs. script)

The automation score rates how script-driven vs. human-driven each user’s API traffic is, on a 0.01.0 scale:

  • HIGH (→ 1.0) — traffic is mostly driven by automatic scripts / batch jobs / cron.

  • LOW (→ 0.0) — a human is using the service interactively (a chat UI, or a human-driven coding agent such as Claude Code).

It is a heuristic for triage, not a verdict. Always read it together with the reported confidence and the per-signal breakdown.

Where it lives

Layer

Location

Core scoring + gather SQL

apps/backend/serving/analytics/automation_score.py

Store methods

LogStore.get_user_automation_score / get_bulk_user_automation_scores (serving/storage/postgres_log.py)

Admin endpoints

GET /admin/users/{user_id}/automation-score, GET /admin/users/automation-scores (serving/servers/routers/admin/users.py)

Admin dashboard

per-user button + bulk “Automation” column in the Users tab

CLI

ops/db/analysis/user_automation_score.py

The scoring methodology and the SQL that gathers per-user aggregates live in one module (serving.analytics.automation_score), shared by the admin endpoints and the CLI, so the two can never drift. The pure scoring functions have no database dependency and are unit-tested in tests/unit/test_user_automation_score.py.

Inputs

The score is computed over a trailing window (default 30 days; the admin endpoints accept days in 1..90). Only api_logs rows with a non-null user_id are considered — anonymous traffic is excluded. For each user the following per-user aggregates are gathered:

  • request count N, and n_chat = rows where num_user_turns IS NOT NULL (chat-style requests; NULL for embeddings / raw completions);

  • one-shot count (num_user_turns = 1) and the p90 of num_user_turns;

  • tool-call counts (num_tool_calls IS NOT NULL, and > 0);

  • the 25/50/75th percentiles of positive prompt_tokens;

  • the share of requests carrying a coding-agent opener (metadata->>'agent' present — see the agent_opener_override signal below);

  • a request-weighted breakdown of metadata->>'user_agent';

  • a UTC hour-of-day histogram (24 buckets);

  • the 25/50/75th percentiles of the inter-arrival gaps between consecutive requests.

The signals

The score blends seven signals. Each maps its raw metric to an automation sub-score in [0, 1] (HIGH = looks automated), and has a default weight. The four signals the feature was originally specified around are user turns, turn length, user-agent, and daily activity; two small supporting signals disambiguate the main confounder (a high-volume but human-driven coding agent); and user_message_shape measures the user’s own messages specifically.

Signal

Axis

Default weight

Available when

turn_pattern

user turns

0.24

n_chat 5

prompt_size_dispersion

total prompt size

0.17

n_sz 8 and median > 0

user_message_shape

user-message size & entropy

0.15

≥ 1 part below

client_tool_prior

user-agent

0.16

always (≥ 1 request)

daily_activity_shape

daily activity

0.27

≥ 1 timing part below

tool_call_human_tell

(support)

0.08

n_chat 5 and some tool use

agent_opener_override

(support)

0.08

agent_share 0.05

A signal that lacks enough data for a user is dropped, and the remaining weights are re-normalized over the survivors — a missing signal is never imputed as 0 (which would falsely pull the score toward “human”). The weights need not sum to 1.0 (they total 1.15); the runtime always divides by the available weight. user_message_shape was added later at weight 0.15 without changing the original six, so a user lacking the (newer) user-message columns — e.g. traffic logged before the migration — drops it and gets the same blended score as before, only a slightly lower confidence.

1. turn_pattern — user turns

An interactive session resends a growing history, so num_user_turns climbs 1, 2, 3, ; a script firing independent one-shot completions emits num_user_turns = 1 on essentially every request.

f1          = one_shot_chat_requests / n_chat          # fraction stuck at 1 user turn
depth_factor = 0.5 if p90(num_user_turns) >= 3 else 1.0
sub          = clamp01(f1 * depth_factor)

The depth_factor halves the sub-score for a user who has demonstrably held deep (p90 3) multi-turn threads, so a human who also fires many one-shot requests is not branded a script.

2. prompt_size_dispersion — length of user turn

This uses prompt_tokens, which (OpenAI semantics) is the total input for the request — the system prompt + the whole conversation history resent that turn + tool definitions + the latest user message — not the user message in isolation (api_logs has no per-role token breakdown). It is the only available proxy for “length of user turn”. Templated automation assembles each request from a fixed template, so its total prompt size clusters tightly; an interactive human’s requests vary widely (a one-word follow-up, then a long paste). The discriminator is therefore the robust relative dispersion (IQR / median) of that total size, which is scale-free and resistant to a single huge pasted prompt:

rcv = (p75 - p25) / median        # over positive prompt_tokens
sub = clamp01(1 - rcv / 0.5)      # rcv >= 0.5 -> 0 (human-varied); rcv = 0 -> 1 (templated)

3. client_tool_prior — user-agent

The User-Agent is the most spoofable signal, so it is a soft, low-weight prior, never decisive. Each request’s UA is classified into a client class (a Python port of the frontend parseClientTool), with a per-request automation value:

Client class

Examples

Value

Interactive / coding agent

claude-code, cline, cursor, codex, browser

0.10

Ambiguous SDK

openai-python, openai-node, anthropic-python/-sdk

0.50

Raw HTTP library / API tool

python-requests, httpx, aiohttp, curl, wget, okhttp, axios, go-http, postman, …

0.85

Unknown (recognized leading token)

myagent/1.0

0.60

Absent / unrecognized UA

0.70

ua_base = request-weighted mean of the per-request class values
sub     = clamp01(ua_base * (1 - 0.85 * min(agent_share, 1)))

SDKs sit at a neutral 0.5 because a human chat UI can sit on top of openai-python. The agent_share term lets the coding-agent opener (see below) pull the prior toward human even behind a scripty UA.

4. daily_activity_shape — daily activity

The behavioral fingerprint hardest to fake at scale. Up to four parts are fused and re-normalized over whichever pass their data floors. Each is HIGH = automation:

Part

Weight

Formula

Floor

Hour coverage

0.20

clamp01((coverage - 0.5) / 0.5), coverage = distinct active UTC hours / 24

N 10

Hour entropy

0.20

clamp01((Hnorm - 0.5) / (0.92 - 0.5)), Hnorm = ShannonEntropy(hours) / log2(24)

N 10

Nightly rest gap

0.30

clamp01(1 - max_quiet_gap_hours / 6)

N 10

Inter-arrival regularity

0.30

clamp01(1 - gap_rcv / 1.0), gap_rcv = (p75 - p25) / median of gaps

3 gaps

max_quiet_gap_hours is the longest circular run of inactive clock-hours — a human’s nightly sleep gap pushes the rest-gap part toward 0, while 24/7 operation leaves no gap (→ 1). The timezone-invariant parts (regularity, rest gap, entropy) carry most of the weight, so an unknown user timezone shifts the histogram without distorting the verdict (hours are bucketed in UTC).

5. tool_call_human_tell — supporting

Agentic tool use (num_tool_calls > 0, from OpenAI tool_calls / Anthropic tool_use blocks) is the fingerprint of a human-driven coding loop. It is a one-directional human tell: it is dropped entirely when there are no tool calls (absence of tool use is not evidence of automation), and otherwise can only pull the score toward human, never raise it:

sub = clamp01(0.5 - toolcall_share)   # share>=0.5 -> 0 (strongly human); share~0 -> ~0.5 (neutral)

6. agent_opener_override — supporting

metadata->>'agent' is a coding-agent identity parsed from the system-prompt opener ("You are Claude Code, …"). Because it is content-derived (not a header), it is hard to fake incidentally and is treated as strong human evidence. It is available only when the opener appears on ≥ 5% of requests, so it can only pull toward human; its absence is uninformative:

sub = clamp01(0.15 - agent_share)     # opener pervasive -> ~0 (human)

It also drives the hard human clamp in the combination step.

7. user_message_shape — user-message size & entropy

Where prompt_size_dispersion looks at the whole input, this signal looks at the user’s own messages. Three properties of the newest user-role message per request are precomputed at log time (in user_message_stats, alongside conversation_shape) as cheap columns — char length, Shannon character entropy (bits/char), and a stable 64-bit hash of the stripped text — so the score never de-TOASTs the prompt. Three parts are fused and re-normalized over whichever pass their floors, each HIGH = automation:

Part

Weight

Formula

Floor

Size dispersion

0.40

clamp01(1 - size_rcv / 0.5), size_rcv = (p75 - p25)/median of last_user_msg_chars

8 sized msgs

Per-message entropy

0.25

clamp01(1 - mean_entropy / 4.0) (natural language ≈ 4 bits/char)

5 msgs

Cross-message repetition

0.35

clamp01((1 - distinct_ratio) / 0.5), distinct_ratio = distinct(hash)/count

8 hashed msgs

Templated automation sends near-constant-length user messages (low size dispersion), low-entropy structured payloads, and resends the same message over and over (low distinct ratio → high repetition); an interactive human varies all three. Because the columns are populated only for new traffic, the signal is simply unavailable (dropped) for users whose requests all predate the migration.

This signal isolates the user input, which the metadata in metadata->>'agent' and the per-message hash make hard to spoof without actually varying the content.

Combining the signals

A          = signals available for this user
weight_sum = sum(weight_i for i in A)
raw        = sum(sub_i * weight_i for i in A) / weight_sum      # re-normalized blend

# Hard human clamp: a high-volume coding-agent user with a real nightly rest gap
# can never be branded above "mixed" on volume alone.
if agent_share >= 0.3 and rest_gap_part_available and rest_gap_score < 0.5:
    raw = min(raw, 0.5)

# Confidence shrinkage toward the neutral 0.5 prior for low-volume users.
alpha = N / (N + 30)
score = clamp01(alpha * raw + (1 - alpha) * 0.5)

# coverage = available weight / total signal weight, so confidence stays in [0,1].
confidence = alpha * (weight_sum / TOTAL_WEIGHT)
  • Re-normalization (raw) makes the blend depend only on the signals that actually had data, so a pure-embeddings batch user — whose num_*_turns columns are all NULL — is still scored on its user-agent and daily-activity shape.

  • Shrinkage (alpha = N / (N + 30)) pulls users with little traffic toward the neutral 0.5 prior. Data outweighs the prior at N = 30 (alpha = 0.5); at N = 5, alpha 0.14 (the score is pulled ~86% to 0.5).

  • confidence combines the request volume (alpha) with the share of the total signal weight that was actually available (weight_sum / TOTAL_WEIGHT), so sparse verdicts — and users missing the newer signals — read as lower-confidence.

  • Users with N < 5 requests are flagged insufficient_data.

Bands

The final score maps to a band label (advisory — always read with confidence):

Band

Range

likely_human

0.00 0.35

mixed_or_uncertain

0.35 0.60

likely_automated

0.60 0.80

scripted_batch

0.80 1.00

Caveats

  • It’s a heuristic. No single signal is decisive; each is individually spoofable, so the user-agent is weighted low and behavioral signals dominate. Treat the score as triage, not proof.

  • Timezone. Hour-of-day is bucketed in UTC; the design leans on the timezone-invariant daily-activity parts so a non-UTC human is not mislabeled as night-active, but a genuinely multi-timezone account can inflate hour coverage.

  • Shared / role accounts. Shared or team accounts blend human and script traffic into a mid “mixed” score (correct). internal / admin accounts may legitimately run automation — the score stays role-agnostic, so exclude them from any automated enforcement and read the score with their role in mind.

Reproducing it from the CLI

# rank the most script-like users in the last 30 days
python ops/db/analysis/user_automation_score.py --min-requests 20

# full per-signal breakdown for one user
python ops/db/analysis/user_automation_score.py --email [email protected]