Engineering

Why You're Logging the Same Bug Four Times (and How Embeddings Fix It)

May 14, 2026 · 12 min read

It's Tuesday morning. You open Slack and see this:

9:14 AM, #support: "Pay button doesn't fire for me on the pricing page."
9:47 AM, Zendesk: "checkout broken on Safari, tried twice."
10:02 AM, Intercom: "hi, I can't complete the order — nothing happens when I click pay."
10:31 AM, support email: "tried 4x no charge, am I doing something wrong???"

You reply to all four. You open four "investigating" threads. You ping the engineer who shipped the Stripe migration on Friday. They look at it for twenty minutes, find the bug (a misconfigured `redirectTo` on the Checkout session), push a fix, and then you spend the next hour going back to each customer to say it's resolved.

The bug took twenty minutes. The duplicates took two hours.

This is the duplicate-bug tax. And if you're running a SaaS without a dedicated QA team, you're paying it every week. This post is about why naive deduplication doesn't work, why embeddings do, what threshold to pick, and where the whole thing falls apart anyway.

The duplicate-bug tax

Pick any seven-day window of your support inbox. Count the distinct underlying bugs. Then count the messages about those bugs. The ratio is usually somewhere between 1:3 and 1:8 for a B2C product, 1:2 to 1:4 for B2B. Every duplicate costs you:

A context switch for whoever triages it
A redundant "I'll look into this" reply
A second engineer pulled in because the first one was busy
A re-test from QA (or, more likely, no QA, so it ships broken twice)
A customer who churns because the third person they talked to had no idea about the first two

The temptation is to fix this with process: "everyone should search the bug tracker before filing." This doesn't work. Customers don't have access to your bug tracker. Support agents do, but searching for "checkout broken" returns 40 results from the last six months and they don't have time to read them all. So the duplicate gets filed anyway.

The real solution is automated deduplication at ingest. Cluster the four messages above into one issue before a human looks at them. Show the engineer: "this bug has been reported four times in the last 90 minutes, here are the variations." Now they know it's a regression, not an edge case, and they prioritize accordingly.

The question is how to cluster them. The four messages above share almost no surface vocabulary. "Pay button," "checkout," "complete the order," "no charge" — these are different strings. A naive matcher won't catch them.

Why keyword matching fails for dedup

The instinct is to reach for string-distance metrics. Levenshtein distance. Jaro-Winkler. TF-IDF over a bag of words. Fuzzy match on a normalized stem. These are all fine tools for the wrong job.

Take Levenshtein distance between two of our reports:

```python import Levenshtein

a = "Pay button doesn't fire on the pricing page" b = "checkout broken on Safari, tried twice"

distance = Levenshtein.distance(a, b) ratio = Levenshtein.ratio(a, b)

print(distance) # 35 print(ratio) # 0.27 ```

A ratio of 0.27. By any reasonable cutoff (0.7+ is typical for "similar"), these strings are unrelated. But to a human reading them, they're obviously the same bug.

TF-IDF does slightly better on longer documents but it still operates on tokens. "Pay button" and "checkout" don't share tokens. "Fire" and "complete" don't share tokens. The signal you need — that all four messages describe a failed payment attempt on the same page — lives at the level of meaning, not surface form.

You can patch keyword matching with synonym dictionaries, stemming, custom token mappings ("pay" -> "checkout" -> "purchase"). This works for about two weeks until your product changes and the mappings rot. It also doesn't generalize across customers who describe things in their own vocabulary, across languages, or across paraphrases you didn't anticipate.

The right tool is embeddings.

A 200-word primer on embeddings

An embedding is a vector — usually a few hundred to a few thousand floats — that represents the meaning of a piece of text. Two texts with similar meanings have vectors that point in similar directions, regardless of which specific words they use.

In practice you call an API:

```python from openai import OpenAI client = OpenAI()

response = client.embeddings.create( model="text-embedding-3-small", input="Pay button doesn't fire on the pricing page" ) vector = response.data[0].embedding # 1536 floats ```

FixFirstly uses OpenAI's `text-embedding-3-small`. It's 1536 dimensions, ~5ms median latency, and costs $0.02 per million tokens. For a typical bug report (50 tokens) that's a hundredth of a cent. Anthropic's Voyage embeddings (`voyage-3-lite`, `voyage-3`) and Cohere's `embed-english-v3.0` are reasonable alternatives if you're already on those stacks. Open-source options like `bge-small-en-v1.5` from BAAI run locally and are competitive on English text.

Don't overthink the model choice for bug deduplication. They all work. The threshold tuning matters more than the model.

Cosine similarity in practice

Once you have two embedding vectors, you measure how similar their meanings are with cosine similarity:

``` cosine_similarity(a, b) = (a · b) / (||a|| * ||b||) ```

It's the cosine of the angle between the two vectors. Bounded between -1 and 1 in general, and for normalized text embeddings basically always between 0 and 1. Higher means more similar.

```python import numpy as np

def cosine_sim(a, b): return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

sim = cosine_sim(embedding_a, embedding_b) ```

The threshold question is the entire ballgame. Pick it too low and you cluster unrelated bugs together ("checkout broken" + "checkout slow" become one issue). Pick it too high and you miss real duplicates ("pay button frozen" never matches "submit doesn't fire").

FixFirstly uses 0.55 in production. That number isn't pulled from thin air — it came from labeling about 800 message pairs across a few different customers and measuring precision/recall at various thresholds. Roughly:

Threshold	Precision	Recall	Notes
0.40	0.61	0.94	Too aggressive — UX bugs cluster with billing bugs
0.50	0.78	0.89	Still some bad merges around generic "broken" complaints
0.55	0.86	0.83	Production setting — best F1 in our test set
0.65	0.92	0.71	Misses paraphrases, "pay button" vs "submit"
0.75	0.96	0.52	Effectively keyword matching — only near-duplicates cluster

Your numbers will be different. If your bug reports are longer (full stack traces, repro steps), you can push the threshold up because there's more signal per message. If they're shorter (Twitter mentions, one-line emails), you may need to push it down and accept more false positives, which you then filter post-hoc.

One important note: 0.55 is on raw `text-embedding-3-small` output without any fine-tuning or domain adaptation. If you train a small adapter on your own labeled pairs, you can usually push the threshold higher and get better precision/recall at the same time. For most teams this isn't worth it. The off-the-shelf model with a tuned threshold gets you 85% of the way there for zero engineering investment.

What false positives and false negatives look like

This is the part most posts skip. Here's where the approach breaks down.

False positive example. Two messages cluster but shouldn't:

"Checkout is broken, page won't load"
"Checkout is broken, card got declined"

These share a lot of surface vocabulary and they're both about checkout, so cosine similarity comes out around 0.71. They cluster. But the first is a frontend bug and the second is a Stripe rejection on a real declined card. Two different problems, two different owners, two different fixes.

How you handle this: a post-clustering rule layer. After embedding similarity says "these belong together," you can run a cheaper classifier on top — extract category (BUG, BILLING, CONFUSION), extract surface entity (checkout, login, dashboard), and split the cluster if the categories disagree. FixFirstly does this between the embedding step and the final cluster assignment. The LLM classifier sees the cluster proposal and can veto a merge.

False negative example. Two messages should cluster but don't:

"Pay button is frozen on the pricing page"
"Tried to submit twice, nothing happens"

Cosine similarity here lands around 0.42. The second message doesn't mention "pay" or "button" or "pricing" — it describes the symptom without naming the element. The reporter is a non-technical user who clicked something labeled "Submit" and watched nothing happen.

How you handle this: human review checkpoint on borderline matches. Anything in the 0.40-0.55 band is shown to the human triager as "possible duplicate of cluster X" with a one-click merge. You get the recall benefit of a low threshold without the precision cost, because the human is the final arbiter. You also feed the merge/reject decisions back into your dataset for future threshold tuning, or to fine-tune your embedding model if you go that route.

The honest version of the message here is: no threshold gives you 100% precision and 100% recall. You're picking a tradeoff. 0.55 is a good default if you also have a human-in-the-loop for the borderline band.

If you want the full picture of how this fits into a broader triage flow — including severity classification and routing — that's covered in our post on automated bug triage.

Putting it together: a real pipeline

Here's the pseudocode for the actual clustering loop. Storage is Postgres with pgvector (FixFirstly runs on Supabase). The query side uses pgvector's `<=>` operator, which returns cosine distance (1 - cosine similarity), so a distance below 0.45 corresponds to similarity above 0.55.

```python THRESHOLD = 0.55

def ingest_message(workspace_id, raw_text): # 1. Parse and normalize text = normalize(raw_text) # strip signatures, lowercase, etc.

# 2. Generate embedding
vector = openai.embeddings.create(
    model="text-embedding-3-small",
    input=text
).data[0].embedding

# 3. Query existing cluster centroids in this workspace
candidates = db.query("""
    SELECT id, centroid <=> %s AS distance
    FROM issue_clusters
    WHERE workspace_id = %s AND status != 'RESOLVED'
    ORDER BY distance ASC
    LIMIT 5
""", (vector, workspace_id))

best = candidates[0] if candidates else None
best_similarity = 1 - best["distance"] if best else 0

# 4. Classify and decide
if best_similarity >= THRESHOLD:
    # Post-clustering veto via LLM classifier
    if categories_match(text, best["id"]):
        assign_to_cluster(message, best["id"])
        update_centroid(best["id"], vector)
    else:
        create_new_cluster(message, vector)
elif best_similarity >= 0.40:
    # Borderline — queue for human review
    flag_for_review(message, candidate_cluster=best["id"])
else:
    create_new_cluster(message, vector)

emit_cluster_update_event(...)

```

A few things worth noting in this pipeline.

The `centroid` column is a running average of the embeddings of all messages in a cluster, re-normalized to unit length after each update. You could store every message embedding and compare against all of them (k-nearest-neighbors style), but centroids are dramatically cheaper to query as clusters grow and they perform basically the same on the dedup task.

The `status != 'RESOLVED'` filter matters more than it looks. If you don't filter, you'll merge a brand-new report of a regression into a closed cluster from three months ago, and your engineer will look at the closed cluster, assume it's still fixed, and miss the regression. Closed clusters are reference data, not active triage targets.

The borderline-review band is where most of the engineering judgment lives. For a brand-new workspace with no labeled data, send everything 0.40-0.55 to review. Once you have a few hundred decisions, you can narrow the band or skip review entirely for workspaces with stable patterns.

The SQL with pgvector for the centroid query, more explicitly:

```sql SELECT c.id, c.title, c.centroid <=> $1::vector AS distance, c.message_count FROM issue_clusters c WHERE c.workspace_id = $2 AND c.status IN ('OPEN', 'ACKNOWLEDGED') ORDER BY c.centroid <=> $1::vector ASC LIMIT 5; ```

Index on `centroid` with `vector_cosine_ops` for sub-millisecond queries even at 100K+ clusters per workspace. pgvector's HNSW index is the right choice for this workload.

Where this still fails

A few failure modes worth being upfront about.

Multi-bug reports. "Login is broken AND checkout is broken on Safari" produces a single embedding for both bugs. It will probably cluster with whichever bug has more weight in the message, and the other bug gets lost. The fix is to split messages by sentence or clause before embedding, but then you have to decide how to attribute a single message to multiple clusters, which complicates everything downstream (notifications, severity, resolution status). FixFirstly currently lets a message belong to up to two clusters and flags it for review beyond that. Imperfect.

Language-mixed messages. A French support email that includes English error messages copy-pasted from the console. `text-embedding-3-small` handles multilingual content reasonably well but you'll see noisier similarities. If you have a strong non-English customer base, look at `text-embedding-3-large` or a multilingual-tuned model like `paraphrase-multilingual-mpnet-base-v2`.

Customer-specific terminology. Enterprise customer calls your "workspaces" feature "tenants" because that's what their internal docs say. Two reports about the same workspaces bug — one using "workspace" and one using "tenant" — will cluster less tightly than they should. Domain adaptation or a glossary-based pre-processing step helps here.

Very short messages. "broken" or "doesn't work" or "fix it pls" carry almost no signal. The embedding is essentially noise. These messages will cluster pseudo-randomly with each other and produce ghost clusters of low-signal complaints. The right move is to detect them at ingest, ask a follow-up question to the reporter, and not cluster them until you have enough content.

Issues that are actually feature requests. "I wish the pay button was bigger" is not a bug. It can semantically resemble a real pay-button bug enough to cluster at 0.55. The category classifier in the post-cluster veto layer handles most of these, but not all.

This is also why we wrote about doing QA without a QA team — the dedup step is necessary but not sufficient. You still need humans (or agents) making judgment calls on the edges.

The next step after dedup

Clustering tells you what to fix. It doesn't tell you whether you can fix it. The next problem after dedup is reproduction — getting from "five customers reported this" to "here are exact repro steps and a screenshot from staging." That's where most bug fixes still die, and it's the topic of our post on the "can't reproduce" problem.

FixFirstly's full pipeline runs clustering first, then dispatches an AI agent that signs into your staging environment, attempts to reproduce the clustered bug, captures the network trace and screenshots, and files a verified GitHub issue with everything attached. If repro fails, it tells you why. If repro succeeds, the engineer who opens the issue has everything they need. See the full flow at the how-it-works section.

If you're on a post-PMF SaaS team without QA, the duplicate-bug tax is probably the single most fixable inefficiency in your week. Embeddings won't solve everything — multi-bug reports, language mixes, and short messages all still need human attention — but they collapse 80% of duplicates automatically, and that's enough to noticeably change how your week feels.

FixFirstly has a free tier with three investigations a month. Drop your support inbox in for a week and see what your duplicate ratio actually is. Join the waitlist.

Verify your next bug in 24 seconds, not 4 days

FixFirstly reads bug reports from your inbox, reproduces them on staging, and files verified GitHub issues. Free during early access.

Join the waitlist