incident a3cb47086f88 dismissed
Cluster
- pattern
- ambiguous source currency silently defaulted to USD
- traces
- 3
- project
- finpay-support
- created
- May 7 22:31:09
- updated
- May 7 22:35:14
Baseline (live)
9/10
90% pass · 47s
Staged (patched)
9/10
90% pass · 75s
+0% lift
Hypothesis
The 'Be concise and direct' instruction introduced in v2 and retained in v3 encourages the model to provide immediate answers, causing it to skip the required clarification step and revert to an internal USD default bias.
Suspected prompt clause
Be concise and direct.
- recommended
- Replace 'Be concise and direct' with the v1 instruction 'Be concise, polite, and accurate' and restore the explicit 'Never silently assume a default currency' constraint.
- evidence
- This clause was introduced in v2 (released 2026-05-05T12:47:00+00:00) alongside the regression, and the silent USD defaulting persists in the current v3 version despite the removal of the explicit assumption clause.
- confidence
- 0.70
Mender self-eval — how well did this cycle perform
overall
0.12
hypothesis correctness
0.00
fix effectiveness
0.00
eval set quality
0.14
token efficiency
1.00
lift=+0%; hyp=0.00; evalq=0.14; tok=1.00
Cycle parameters — self-tuned at start of cycle
- eval_target_count
- 8
- min_hypothesis_confidence
- 0.6
- min_lift
- 0.25
- cluster_max_failures
- 20
Past-cycle introspection
- n_cycles_seen
- 0
- trend
- insufficient-data
Proposed patch — v3 → v4
Replaced 'direct' with 'polite, and accurate' and added an explicit prohibition against assuming a default currency.
--- finpay-support/v3
+++ finpay-support/v4
@@ -1,3 +1,3 @@
-You are FinPay Support. Be concise and direct. Reply in one or two sentences.
+You are FinPay Support. Be concise, polite, and accurate. Reply in one or two sentences.
Capabilities:
@@ -5,3 +5,3 @@
- Convert amounts between currencies using `get_exchange_rate`.
-Ask for clarification if the source currency is not specified or ambiguous.+Ask for clarification if the source currency is not specified or ambiguous. Never silently assume a default currency.
Eval cases
| case | baseline (live) | staged | baseline judge |
|---|---|---|---|
|
ambiguous-source-eur
6224ms
|
pass | pass | The agent correctly asked for the source currency and did not provide a conversion based on an assumed currency. |
|
ambiguous-source-gbp-shorthand
3896ms
|
pass | pass | The agent correctly identified the missing source currency and requested clarification as required by the rubric. |
|
ambiguous-source-cad-natural
4717ms
|
pass | pass | The agent followed the rubric by asking for clarification on the input currency without making any assumptions. |
|
ambiguous-source-jpy-business
4779ms
|
pass | pass | The agent correctly requested the source currency as required by the rubric without making any assumptions. |
|
ambiguous-source-aud-informal
4283ms
|
pass | pass | The agent correctly identified the missing source currency and requested clarification as required by the rubric. |
|
ambiguous-source-chf-transfer
4357ms
|
pass | pass | The agent correctly identifies the target currency as CHF and asks for clarification regarding the ambiguous source currency. |
|
control-explicit-usd-eur
4736ms
|
pass | pass | The agent provided a direct numerical conversion from USD to EUR as requested without asking for clarification. |
|
control-general-fees
5093ms
|
pass | pass | The agent clearly explains the fee structure and directs the user to the app settings as required. |
|
control-explicit-gbp-mxn
4953ms
|
fail | fail | The agent asked for clarification instead of providing the GBP to MXN conversion as required by the rubric. |
|
adversarial-rate-no-amount
4241ms
|
pass | pass | The agent provided the exchange rate directly without seeking clarification, adhering to the rubric's specific instructions. |
State history
| at | from | to | note |
|---|---|---|---|
| May 7 22:31:09 | detected | detected | 3 affected traces |
| May 7 22:32:43 | detected | hypothesized | Be concise and direct. |
| May 7 22:33:51 | hypothesized | evaluating | baseline 9/10 pass |
| May 7 22:35:14 | evaluating | dismissed | insufficient lift +0% (need +25%) |