🔧 Moteur de détection spam

Vue d'ensemble

Le moteur de détection spam repose sur une approche hybride combinant :

Pattern matching — comparaison contre une base de règles structurées
Header analysis — évaluation des échecs d'authentification email

Formule de scoring :

scoreFinal = Math.min(100, Math.round(patternScore × 0.6 + headerScore × 0.4))

Architecture des fichiers

src/data/
├── spam_patterns.json              # Base manuelle (13 patterns)
├── spam_kaggle_patterns.json       # 6 patterns Kaggle
├── spam_uci_patterns.json          # 4 patterns UCI Spambase
└── spam_huggingface_patterns.json  # 6 patterns HuggingFace

scripts/
└── extract-spam-patterns.cjs       # Script de génération des JSON

supabase/functions/analyze-spam/index.ts
├── SPAM_PATTERNS   # 13 patterns inline (subset spam_patterns.json)
└── HEADER_RULES    # 5 règles header

Important : Les Edge Functions Deno n'ont pas accès au filesystem du projet. analyze-spam embarque les patterns inline en TypeScript.

Format d'un pattern

{
  id: string;            // "PHISH-001"
  type: string;          // Catégorie (voir tableau ci-dessous)
  name: string;          // Nom humain
  keywords: string[];    // Mots-clés dans le contenu
  subject_regex: string[]; // Regex sur le Subject
  sender_domains: string[]; // Domaines suspects dans From
  body_keywords?: string[]; // Phrases dans le corps
  score: number;         // Score de base (0-100)
  is_spam: boolean;
  severity: string;      // low / medium / high / critical
  description: string;
}

Catégories de patterns

Type	Description	Exemples
`phishing`	Hameçonnage, usurpation	Faux colis, faux bancaire, impôts
`scam`	Arnaques financières	Loterie, héritage, prince nigérian
`malware`	Pièces jointes malveillantes	Fausses factures, .exe, macros
`commercial`	Spam non sollicité	Promotions, pharmacie en ligne
`banking`	Faux services paiement	PayPal, Stripe
`tech_support`	Faux support technique	Microsoft, Apple
`sextortion`	Chantage webcam	Bitcoin, webcam
`delivery`	Faux e-commerce	Amazon, Cdiscount
`sms_spam`	Spam SMS	Prix, numéros surtaxés
`corporate_spam`	Phishing interne	RH, réunions
`financial_spam`	Arnaques investissement	Actions, bourse
`word_frequency`	Fréquence mots (UCI)	Mots spam fréquents
`combination`	Multi-signaux (UCI)	Combinaison

Algorithme de matching

function matchPatterns(email: AnalyzeRequest) {
  const content = `${email.emailContent} ${email.subject} 
                   ${email.from} ${email.headers}`.toLowerCase();
  
  for (const pattern of SPAM_PATTERNS) {
    let matchScore = 0;
    const reasons = [];

    // 1. Keywords (contribution max: 40%)
    const kwMatches = pattern.keywords.filter(kw => content.includes(kw));
    if (kwMatches.length > 0) {
      matchScore += (kwMatches.length / pattern.keywords.length) * 40;
      reasons.push(`Mots-clés: ${kwMatches.join(', ')}`);
    }

    // 2. Subject regex (contribution: 30%)
    for (const regexStr of pattern.subject_regex || []) {
      const regex = new RegExp(regexStr.replace('(?i)', ''), 'i');
      if (regex.test(email.subject || '')) {
        matchScore += 30;
        reasons.push(`Sujet correspond: ${regexStr}`);
        break;
      }
    }

    // 3. Sender domain (contribution: 30%)
    const domainMatch = pattern.sender_domains.find(d =>
      (email.from || '').toLowerCase().includes(d)
    );
    if (domainMatch) {
      matchScore += 30;
      reasons.push(`Domaine suspect: ${domainMatch}`);
    }

    // 4. Body keywords (contribution max: 20%)
    const bodyMatches = (pattern.body_keywords || []).filter(kw =>
      content.includes(kw.toLowerCase())
    );
    if (bodyMatches.length > 0) {
      matchScore += (bodyMatches.length / pattern.body_keywords.length) * 20;
      reasons.push(`Contenu suspect: ${bodyMatches.join(', ')}`);
    }

    // Seuil minimum 15% pour être retenu
    if (matchScore > 15) {
      const finalScore = Math.min(100, Math.round((matchScore / 100) * pattern.score));
      matchedPatterns.push({ patternId: pattern.id, type: pattern.type,
        score: finalScore, severity: pattern.severity, reasons });
      totalScore = Math.max(totalScore, finalScore);
    }
  }
}

patternScore = max des scores individuels de tous les patterns déclenchés.

Règles headers

const HEADER_RULES = [
  { id: "HDR-001", check: "spf",            condition: "fail", modifier: 25 },
  { id: "HDR-002", check: "dkim",           condition: "fail", modifier: 25 },
  { id: "HDR-003", check: "dmarc",          condition: "fail", modifier: 20 },
  { id: "HDR-004", check: "rdns_mismatch",  condition: "true", modifier: 15 },
  { id: "HDR-005", check: "ip_blacklisted", condition: "true", modifier: 30 },
];

headerScore = somme des modifier des règles déclenchées, plafonnée à 100.
Score max théorique : 25+25+20+15+30 = 115 → plafonné à 100.

Seuils de classification

Score	Niveau	is_spam	Couleur
0–30	`safe`	false	🟢 vert
31–50	`suspicious`	false	🟡 jaune
51–60	`suspicious`	true	🟡 jaune
61–100	`dangerous`	true	🔴 rouge

is_spam = true si scoreFinal > 50.

Exemple complet de calcul

Email : "Votre colis est prêt" de dpd@thepiratebuy.com avec SPF=fail, DMARC=fail

Pattern PHISH-001 déclenché :
  keywords match (colis, livraison) → +13.3 (2/6 × 40)
  subject regex match               → +30
  domain match (thepiratebuy.com)   → +30
  body keyword match (cliquez ici)  → +5   (1/4 × 20)
  ─────────────────────────────────────
  matchScore = 78.3
  finalScore = min(100, round(78.3/100 × 85)) = 67

patternScore = 67

Header rules :
  HDR-001 (SPF fail)   → +25
  HDR-003 (DMARC fail) → +20
  ─────────────────────────
  headerScore = 45

scoreFinal = min(100, round(67×0.6 + 45×0.4))
           = min(100, round(40.2 + 18))
           = 58

→ threatLevel = "suspicious"
→ is_spam = true (58 > 50)

Script d'extraction des patterns

# Régénérer tous les fichiers JSON
node scripts/extract-spam-patterns.cjs --source=all

# Par source
node scripts/extract-spam-patterns.cjs --source=kaggle
node scripts/extract-spam-patterns.cjs --source=uci
node scripts/extract-spam-patterns.cjs --source=huggingface

Les fichiers sont générés dans src/data/.

Roadmap moteur spam

Horizon	Amélioration
Q2 2026	Schéma Zod pour validation au build
Q2 2026	Cache/deduplication par hash des headers
Q3 2026	Scoring v2 : `+iaScore×0.20 +velocityScore×0.15`
Q3 2026	Embeddings HuggingFace (distilbert, sentence-transformers)
Q3 2026	Analyse corps email (URLs, liens phishing)
Q3 2026	Architecture async : POST → 202 + job_id → GET /job/:id

Voir ALGORITHM-OPTIMIZATION.md pour les détails de R&D.

Vue d'ensemble​

Architecture des fichiers​

Format d'un pattern​

Catégories de patterns​

Algorithme de matching​

Règles headers​

Seuils de classification​

Exemple complet de calcul​

Script d'extraction des patterns​

Roadmap moteur spam​