E.A.S.Y. E.A.S.Y.
Method & Workflow · 5 min read

WDF*IDF - how we measure keyword relevance

WDF*IDF is the statistical method we use to derive from vertical data which words appear in which ratio in top-performing listings.

by robby

What is WDF*IDF?

WDF = Within-Document Frequency. IDF = Inverse Document Frequency. Together they measure how typical a word is for a cluster of documents.

Meaning: not "most-frequent word" wins - but "word that appears disproportionately often in top listings yet is rare in the broad pool".

Example

In jewelry:

  • "handmade" → high IDF (rare in the general pool, frequent in top

listings) → strong keyword

  • "ring" → low IDF (in nearly every jewelry listing) → weak keyword
  • "925 sterling" → medium IDF, high in premium top-10 → strong

differentiator keyword

How we use it

  • Per vertical we build a *WDFIDF vector** weekly from the top-10%.
  • When optimizing we compare your listing vector to the vertical vector.
  • Cosine similarity gives us branch_title and branch_desc.
  • At < 0.55 similarity we suggest the top-5 missing keywords.

Limits

WDF*IDF has no semantic understanding. "Ring" and "rings" are two words; synonyms aren't resolved. That's why an embedding layer (BERT-based) runs on top, building semantic clusters.