WDF*IDF - how we measure keyword relevance
WDF*IDF is the statistical method we use to derive from vertical data which words appear in which ratio in top-performing listings.
by robby
What is WDF*IDF?
WDF = Within-Document Frequency. IDF = Inverse Document Frequency. Together they measure how typical a word is for a cluster of documents.
Meaning: not "most-frequent word" wins - but "word that appears disproportionately often in top listings yet is rare in the broad pool".
Example
In jewelry:
- "handmade" → high IDF (rare in the general pool, frequent in top
listings) → strong keyword
- "ring" → low IDF (in nearly every jewelry listing) → weak keyword
- "925 sterling" → medium IDF, high in premium top-10 → strong
differentiator keyword
How we use it
- Per vertical we build a *WDFIDF vector** weekly from the top-10%.
- When optimizing we compare your listing vector to the vertical vector.
- Cosine similarity gives us
branch_titleandbranch_desc. - At < 0.55 similarity we suggest the top-5 missing keywords.
Limits
WDF*IDF has no semantic understanding. "Ring" and "rings" are two words; synonyms aren't resolved. That's why an embedding layer (BERT-based) runs on top, building semantic clusters.