# Automation score (human vs. script) The **automation score** rates how *script-driven* vs. *human-driven* each user's API traffic is, on a `0.0`–`1.0` scale: - **HIGH (→ 1.0)** — traffic is mostly driven by automatic scripts / batch jobs / cron. - **LOW (→ 0.0)** — a human is using the service interactively (a chat UI, or a human-driven coding agent such as Claude Code). It is a **heuristic for triage, not a verdict**. Always read it together with the reported `confidence` and the per-signal breakdown. ## Where it lives | Layer | Location | |---|---| | Core scoring + gather SQL | `apps/backend/serving/analytics/automation_score.py` | | Store methods | `LogStore.get_user_automation_score` / `get_bulk_user_automation_scores` (`serving/storage/postgres_log.py`) | | Admin endpoints | `GET /admin/users/{user_id}/automation-score`, `GET /admin/users/automation-scores` (`serving/servers/routers/admin/users.py`) | | Admin dashboard | per-user button + bulk "Automation" column in the Users tab | | CLI | `ops/db/analysis/user_automation_score.py` | The scoring methodology and the SQL that gathers per-user aggregates live in **one** module (`serving.analytics.automation_score`), shared by the admin endpoints and the CLI, so the two can never drift. The pure scoring functions have no database dependency and are unit-tested in `tests/unit/test_user_automation_score.py`. ## Inputs The score is computed over a trailing **window** (default 30 days; the admin endpoints accept `days` in `1..90`). Only `api_logs` rows with a non-null `user_id` are considered — anonymous traffic is excluded. For each user the following per-user aggregates are gathered: - request count `N`, and `n_chat` = rows where `num_user_turns IS NOT NULL` (chat-style requests; `NULL` for embeddings / raw completions); - one-shot count (`num_user_turns = 1`) and the p90 of `num_user_turns`; - tool-call counts (`num_tool_calls IS NOT NULL`, and `> 0`); - the 25/50/75th percentiles of positive `prompt_tokens`; - the share of requests carrying a coding-agent opener (`metadata->>'agent'` present — see the `agent_opener_override` signal below); - a request-weighted breakdown of `metadata->>'user_agent'`; - a UTC hour-of-day histogram (24 buckets); - the 25/50/75th percentiles of the inter-arrival gaps between consecutive requests. ## The signals The score blends seven signals. Each maps its raw metric to an **automation sub-score** in `[0, 1]` (HIGH = looks automated), and has a default weight. The four signals the feature was originally specified around are *user turns*, *turn length*, *user-agent*, and *daily activity*; two small supporting signals disambiguate the main confounder (a high-volume but human-driven coding agent); and `user_message_shape` measures the user's own messages specifically. | Signal | Axis | Default weight | Available when | |---|---|---|---| | `turn_pattern` | user turns | 0.24 | `n_chat ≥ 5` | | `prompt_size_dispersion` | total prompt size | 0.17 | `n_sz ≥ 8` and median > 0 | | `user_message_shape` | user-message size & entropy | 0.15 | ≥ 1 part below | | `client_tool_prior` | user-agent | 0.16 | always (≥ 1 request) | | `daily_activity_shape` | daily activity | 0.27 | ≥ 1 timing part below | | `tool_call_human_tell` | (support) | 0.08 | `n_chat ≥ 5` and some tool use | | `agent_opener_override` | (support) | 0.08 | `agent_share ≥ 0.05` | A signal that lacks enough data for a user is **dropped**, and the remaining weights are re-normalized over the survivors — a missing signal is never imputed as `0` (which would falsely pull the score toward "human"). The weights need not sum to `1.0` (they total `1.15`); the runtime always divides by the available weight. `user_message_shape` was added later at weight `0.15` **without changing the original six**, so a user lacking the (newer) user-message columns — e.g. traffic logged before the migration — drops it and gets the **same blended score as before**, only a slightly lower `confidence`. ### 1. `turn_pattern` — user turns An interactive session resends a growing history, so `num_user_turns` climbs `1, 2, 3, …`; a script firing independent one-shot completions emits `num_user_turns = 1` on essentially every request. ```text f1 = one_shot_chat_requests / n_chat # fraction stuck at 1 user turn depth_factor = 0.5 if p90(num_user_turns) >= 3 else 1.0 sub = clamp01(f1 * depth_factor) ``` The `depth_factor` halves the sub-score for a user who has demonstrably held deep (`p90 ≥ 3`) multi-turn threads, so a human who *also* fires many one-shot requests is not branded a script. ### 2. `prompt_size_dispersion` — length of user turn This uses `prompt_tokens`, which (OpenAI semantics) is the **total input** for the request — the system prompt + the whole conversation history resent that turn + tool definitions + the latest user message — not the user message in isolation (`api_logs` has no per-role token breakdown). It is the only available proxy for "length of user turn". Templated automation assembles each request from a fixed template, so its total prompt size clusters tightly; an interactive human's requests vary widely (a one-word follow-up, then a long paste). The discriminator is therefore the **robust relative dispersion** (IQR / median) of that total size, which is scale-free and resistant to a single huge pasted prompt: ```text rcv = (p75 - p25) / median # over positive prompt_tokens sub = clamp01(1 - rcv / 0.5) # rcv >= 0.5 -> 0 (human-varied); rcv = 0 -> 1 (templated) ``` ### 3. `client_tool_prior` — user-agent The User-Agent is the most *spoofable* signal, so it is a soft, low-weight prior, never decisive. Each request's UA is classified into a client class (a Python port of the frontend `parseClientTool`), with a per-request automation value: | Client class | Examples | Value | |---|---|---| | Interactive / coding agent | `claude-code`, `cline`, `cursor`, `codex`, `browser` | `0.10` | | Ambiguous SDK | `openai-python`, `openai-node`, `anthropic-python`/`-sdk` | `0.50` | | Raw HTTP library / API tool | `python-requests`, `httpx`, `aiohttp`, `curl`, `wget`, `okhttp`, `axios`, `go-http`, `postman`, … | `0.85` | | Unknown (recognized leading token) | `myagent/1.0` | `0.60` | | Absent / unrecognized UA | — | `0.70` | ```text ua_base = request-weighted mean of the per-request class values sub = clamp01(ua_base * (1 - 0.85 * min(agent_share, 1))) ``` SDKs sit at a neutral `0.5` because a human chat UI can sit on top of `openai-python`. The `agent_share` term lets the coding-agent opener (see below) pull the prior toward human even behind a scripty UA. ### 4. `daily_activity_shape` — daily activity The behavioral fingerprint hardest to fake at scale. Up to four parts are fused and re-normalized over whichever pass their data floors. Each is HIGH = automation: | Part | Weight | Formula | Floor | |---|---|---|---| | Hour coverage | 0.20 | `clamp01((coverage - 0.5) / 0.5)`, `coverage = distinct active UTC hours / 24` | `N ≥ 10` | | Hour entropy | 0.20 | `clamp01((Hnorm - 0.5) / (0.92 - 0.5))`, `Hnorm = ShannonEntropy(hours) / log2(24)` | `N ≥ 10` | | Nightly rest gap | 0.30 | `clamp01(1 - max_quiet_gap_hours / 6)` | `N ≥ 10` | | Inter-arrival regularity | 0.30 | `clamp01(1 - gap_rcv / 1.0)`, `gap_rcv = (p75 - p25) / median` of gaps | `≥ 3 gaps` | `max_quiet_gap_hours` is the longest *circular* run of inactive clock-hours — a human's nightly sleep gap pushes the rest-gap part toward 0, while 24/7 operation leaves no gap (→ 1). The timezone-invariant parts (regularity, rest gap, entropy) carry most of the weight, so an unknown user timezone shifts the histogram without distorting the verdict (hours are bucketed in **UTC**). ### 5. `tool_call_human_tell` — supporting Agentic tool use (`num_tool_calls > 0`, from OpenAI `tool_calls` / Anthropic `tool_use` blocks) is the fingerprint of a human-driven coding loop. It is a **one-directional** human tell: it is *dropped entirely* when there are no tool calls (absence of tool use is not evidence of automation), and otherwise can only pull the score toward human, never raise it: ```text sub = clamp01(0.5 - toolcall_share) # share>=0.5 -> 0 (strongly human); share~0 -> ~0.5 (neutral) ``` ### 6. `agent_opener_override` — supporting `metadata->>'agent'` is a coding-agent identity parsed from the system-prompt opener (`"You are Claude Code, …"`). Because it is content-derived (not a header), it is hard to fake incidentally and is treated as strong human evidence. It is available only when the opener appears on ≥ 5% of requests, so it can only pull toward human; its absence is uninformative: ```text sub = clamp01(0.15 - agent_share) # opener pervasive -> ~0 (human) ``` It also drives the **hard human clamp** in the combination step. ### 7. `user_message_shape` — user-message size & entropy Where `prompt_size_dispersion` looks at the *whole* input, this signal looks at the **user's own messages**. Three properties of the **newest user-role message** per request are precomputed at log time (in `user_message_stats`, alongside `conversation_shape`) as cheap columns — char length, Shannon character entropy (bits/char), and a stable 64-bit hash of the stripped text — so the score never de-TOASTs the prompt. Three parts are fused and re-normalized over whichever pass their floors, each HIGH = automation: | Part | Weight | Formula | Floor | |---|---|---|---| | Size dispersion | 0.40 | `clamp01(1 - size_rcv / 0.5)`, `size_rcv = (p75 - p25)/median` of `last_user_msg_chars` | `≥ 8 sized msgs` | | Per-message entropy | 0.25 | `clamp01(1 - mean_entropy / 4.0)` (natural language ≈ 4 bits/char) | `≥ 5 msgs` | | Cross-message repetition | 0.35 | `clamp01((1 - distinct_ratio) / 0.5)`, `distinct_ratio = distinct(hash)/count` | `≥ 8 hashed msgs` | Templated automation sends near-constant-length user messages (low size dispersion), low-entropy structured payloads, and resends the same message over and over (low distinct ratio → high repetition); an interactive human varies all three. Because the columns are populated only for new traffic, the signal is simply unavailable (dropped) for users whose requests all predate the migration. This signal isolates the user input, which the metadata in `metadata->>'agent'` and the per-message hash make hard to spoof without actually varying the content. ## Combining the signals ```text A = signals available for this user weight_sum = sum(weight_i for i in A) raw = sum(sub_i * weight_i for i in A) / weight_sum # re-normalized blend # Hard human clamp: a high-volume coding-agent user with a real nightly rest gap # can never be branded above "mixed" on volume alone. if agent_share >= 0.3 and rest_gap_part_available and rest_gap_score < 0.5: raw = min(raw, 0.5) # Confidence shrinkage toward the neutral 0.5 prior for low-volume users. alpha = N / (N + 30) score = clamp01(alpha * raw + (1 - alpha) * 0.5) # coverage = available weight / total signal weight, so confidence stays in [0,1]. confidence = alpha * (weight_sum / TOTAL_WEIGHT) ``` - **Re-normalization** (`raw`) makes the blend depend only on the signals that actually had data, so a pure-embeddings batch user — whose `num_*_turns` columns are all `NULL` — is still scored on its user-agent and daily-activity shape. - **Shrinkage** (`alpha = N / (N + 30)`) pulls users with little traffic toward the neutral `0.5` prior. Data outweighs the prior at `N = 30` (`alpha = 0.5`); at `N = 5`, `alpha ≈ 0.14` (the score is pulled ~86% to `0.5`). - **`confidence`** combines the request volume (`alpha`) with the share of the total signal weight that was actually available (`weight_sum / TOTAL_WEIGHT`), so sparse verdicts — and users missing the newer signals — read as lower-confidence. - Users with `N < 5` requests are flagged **`insufficient_data`**. ## Bands The final score maps to a band label (advisory — always read with `confidence`): | Band | Range | |---|---| | `likely_human` | `0.00 – 0.35` | | `mixed_or_uncertain` | `0.35 – 0.60` | | `likely_automated` | `0.60 – 0.80` | | `scripted_batch` | `0.80 – 1.00` | ## Caveats - **It's a heuristic.** No single signal is decisive; each is individually spoofable, so the user-agent is weighted low and behavioral signals dominate. Treat the score as triage, not proof. - **Timezone.** Hour-of-day is bucketed in UTC; the design leans on the timezone-invariant daily-activity parts so a non-UTC human is not mislabeled as night-active, but a genuinely multi-timezone account can inflate hour coverage. - **Shared / role accounts.** Shared or team accounts blend human and script traffic into a mid "mixed" score (correct). `internal` / `admin` accounts may legitimately run automation — the score stays role-agnostic, so exclude them from any automated enforcement and read the score with their role in mind. ## Reproducing it from the CLI ```bash # rank the most script-like users in the last 30 days python ops/db/analysis/user_automation_score.py --min-requests 20 # full per-signal breakdown for one user python ops/db/analysis/user_automation_score.py --email a@x.com ```