Building an AI agent that can intervene in live markets demands more than a strong model. Successful operation depends on an evidence pipeline that supports fast decision-making, high precision, and audit-ready governance. In Part 1 of this series, we explained what our system does and why it matters. This part focuses on how it was engineered.

The Pipeline Architecture
The system operates on a simple yet powerful principle: start with inexpensive, high‑coverage checks to scan broadly, and progressively narrow the field toward more sophisticated, resource-intensive analysis.
Most surveillance candidates resolve as noise, and running compute-heavy analytics across all of them would be impractical and cost-inefficient at scale. Instead, the system filters decisively at each stage:
- Gate 1 (fast statistical screening): lets through only 10% of flagged cases genuinely worth closer inspection.
- Gate 2 (deeper quantitative validation): eliminates roughly 90% of what remains, keeping only scenarios with materially stronger evidence.
- Gate 3 (AI evaluation): reached by the survivors of both gates, with approximately 50% of them flagged as abusers; never asked to evaluate weak evidence, reasoning over cases where quantitative evidence has already made the case credible.
This funnel structure is not a limitation – it is the engineering that makes autonomous AI action affordable, operationally sustainable, and reliable. Low-cost operations run broadly, while expensive judgment is applied narrowly, only where the evidence already indicates something real.
A Layered Funnel
After the first signal appears, the system follows a structured path that builds evidence step by step and culminates in an action-ready conclusion.
Layer 1: Surveillance – HawkEye RMS
HawkEye RMS is a continuous surveillance platform that monitors trading activity across the firm without interruption. Designed as a general-purpose framework, it supports multiple detection contexts in production. For this use case, the configuration targets a specific family of suspicious patterns in gold trading.
Its trigger logic is calibrated with a deliberate bias toward recall. At this stage, the goal is not to prove abuse, but to capture every credible anomaly early and ensure it receives proper scrutiny.
Foundational triggers combine price behavior and spread size jointly, preserving consistent sensitivity across both calm and volatile market conditions. Order size is handled with size-band-specific thresholds. There are also dedicated alarms targeting a structural signature observed in historical cases: characteristic patterns in how positions get built and unwound. The most analytically demanding triggers focus on markout shape. A latency scalper and a gold abuser produce fundamentally different patterns in fill quality over time. Rather than a single threshold, these triggers impose time-differentiated conditions across multiple lags, all of which must be satisfied simultaneously.
In parallel, the system monitors syndicate behavior – clusters of coordinated multi-participant activity – because abuse often appears as organized conduct rather than isolated accounts.
The continuous overwatch produces structured notifications every minute. Given that the practical reaction window is several minutes, minute-level batch processing is both operationally sufficient and economically efficient. These notifications – and only these – feed the downstream pipeline.
Layer 2: Statistical Validation – Building the Evidence
Once RMS raises a case, the system reconstructs the client’s recent trading context and calculates deeper quantitative evidence. The first step is to convert raw order events into a cumulative position time series over a multi-hour lookback window, transforming raw order events into a continuous behavioral signal suitable for time-series analysis.
The statistical layer requires two independent validation gates – position structure analysis and markout analysis – and a case proceeds to AI evaluation only if both pass.
Position Structure Analysis examines how exposure is built, held, unwound, and repeated, with the objective of determining whether the observed behavior reflects deliberate, repeatable structure.
A key tool here is autocorrelation (ACF), which detects lagged dependence and highlights oscillatory fingerprints typical of engineered patterns. In ordinary trading, ACF signatures tend to be weak and noisy, whereas in systematic cyclical behavior, structure appears at specific lags. Conceptually, this parallels Fourier decomposition in that both methods reveal periodicity, but in live detection ACF has proven more stable.
Markout Analysis evaluates both post-trade and pre-trade price evolution around each order at fixed horizons, then aggregates the results into a centered trajectory treated as a behavioral signature.
Structured price patterns point to repeatable timing advantage, and additional signals in this layer strengthen the analysis and provide corroborating evidence of systematic activity.
Together, the position and markout gates answer the central quantitative question: are the observed outcomes consistent with normal strategy variance, chance effects, or an exploitative pattern? Both conditions must be satisfied before the case proceeds to AI evaluation.
Layer 3: AI Judgment – The Final Decision
Only cases that pass both quantitative gates enter AI evaluation. At this stage, the system shifts from measurement to judgment.
The model does not consume raw tabular streams. Instead, it receives two prepared visual artefacts generated by the same analytical pipeline: a cumulative position chart and a markout profile chart.
This is a deliberate design choice. Position oscillation patterns – flat holding periods, sharp vertical transitions, consistent amplitude and timing – are immediately apparent to the eye in a well-constructed chart. A senior analyst reviewing a case typically forms an initial structural judgment visually. The AI was designed to mirror the existing manual review process as closely as possible, in a format that remains easy for humans to verify after the fact.
The position analysis and markout analysis are executed as two separate, independent AI calls, each with its own focused prompt. Both must return a HIGH RISK classification for the case to proceed. This corroboration requirement is intentional: the structural evidence and the financial evidence must agree before autonomous action is taken.
The Classification Cascade as a Diagnostic Tool
The system’s five-class output structure functions as a diagnostic instrument, not just a pass-or-fail mechanism. Each class shows exactly how far a case progressed and where the evidence broke down.
Class 1: Position analysis finds no systematic structure in the exposure series; the case is filtered out at the earliest, lowest-cost gate.
Class 2: Position structure is detected, but markout analysis lacks consistent favorable pricing; activity looks unusual without carrying the financial signature the abuse pattern requires.
Class 3: Both statistical conditions pass, but the first AI review does not return HIGH RISK; oscillation exists, yet not clean enough to justify action.
Class 4: Three conditions pass, but the markout-focused AI evaluation does not confirm the financial signal; suspicion remains, pricing evidence stays insufficient.
Class 5: There is full alignment across all four conditions – statistical structure confirmed, financial evidence confirmed, both AI evaluations returning HIGH RISK; only this class triggers autonomous action.
The value of this structure extends beyond binary decisioning. A Class 3 or Class 4 result immediately tells the analyst which evidence dimension failed and where a manual review should concentrate.
When a case reaches Class 5, two parallel actions are executed. First, a restriction trigger is sent directly into the risk management system, activating a defensive control on the client’s gold trading activity immediately. This is not a flag, not an advisory, and not a recommendation held in a queue – the protection is already active, without waiting for manual review. Second, risk and dealing receive the full evidence package: charts, quantitative outputs, and the AI reasoning.
The notification provides governance – review, challenge, and reversal if needed – but it does not gate the initial response. Every decision is logged, every artefact stored, and every classification remains retrievable for audit. Autonomy operates within accountability, not instead of it.
How the AI Prompts Evolved
Prompt tuning came after the fundamentals. Before refining prompts, our domain experts calibrated the statistical thresholds in Layers 1 and 2, tested them against historical abuse cases, and then continuously validated performance in live production.
We initially used a single prompt for the AI to evaluate suspicious position patterns. It handled clear cases well but proved inconsistent on edge cases. The first major improvement split position review into two dedicated prompts: one tuned for early-stage emergence (to intervene quickly) and one tuned for established, recurring behavior with consistencies in timing, amplitude, and activity despite changing market conditions. Those scenarios differ structurally enough that forcing them into one prompt reduced reliability – separating them materially improved precision.
From there, each iteration tightened decision criteria with explicit boundaries: what qualifies as HIGH RISK, what falls into MEDIUM RISK, and which false positive patterns look superficially like the target but are not. This process mirrors classical statistical decision-boundary design – observe failure modes, define them precisely, and add constraints – except that here the “constraint” becomes a prompt change rather than a retraining cycle. What began as a ~150‑word, half‑page prompt evolved into a ~7‑page, 1,500+‑word specification with tightly defined decision criteria, boundary examples, and explicit exclusions grounded in observed failure modes.
Finally, we evaluated the prompt set across multiple AI model architectures – frontier LLMs, open-source alternatives, and smaller low-latency models – and selected the one that delivered the most stable, high-precision judgments under real operating conditions, prioritizing reliability under pressure over raw speed or cost.

From Workload to Oversight
Speed matters – reducing response time from days to minutes changes outcomes. The more durable shift, however, comes from how the system reshapes human work.
Previously, the risk team spent most effort on triage: reviewing alerts, reconstructing order history, assembling evidence, escalating, and waiting. The AI agent now handles that operational layer, freeing experts to focus on higher-value judgment, interpreting novel patterns, resolving genuinely ambiguous cases, and staying calibrated on where exploitation begins.
A second, subtler benefit comes from consistency. The system runs continuously without vigilance decay, applying the same checks every minute whether the last incident occurred an hour ago or months earlier. This keeps readiness permanently “on.”
This architecture – upstream high-recall surveillance, statistical evidence construction, constrained AI judgment on prepared artefacts, and autonomous action with full auditability – extends beyond gold and this abuse pattern. It fits any domain where decisions resist rigid rules, the cost of error remains high, and the response must stay both fast and defensible. With infrastructure and governance established, new use cases mainly require redefining evidence and decision criteria.
What Comes Next?
The architecture we built is not a static design but an operating system for a live market environment, built to evolve as conditions and behaviors change. For that reason, there is no “final” version of the process – only the most resilient, precise, and maintainable version currently running in production. Crucially, iteration is a first-class capability – we can refine evidence definitions, recalibrate thresholds, and adjust decision criteria quickly while preserving governance and auditability. This keeps the organization in a state of permanent readiness, able to absorb new signals, adapt to emerging patterns, and respond decisively as soon as new risks appear.

Jarosław Klamut, PhD
Head of Risk at Match-Trade Technologies –
a strategic technology supplier to Match-Prime

