Fuse information from multiple modalities (image features + text) for security tasks: malware screenshot classification, phishing page detection from HTML+screenshots, CLIP-style contrastive learning, and cross-modal retrieval for threat hunting.
Single-modal ML: one input type → prediction
Multi-modal ML: image + text + structured → richer representation
Security examples:
Phishing detection: screenshot + HTML source + URL features → is phishing?
Malware UI analysis: executable icon + PE header + strings → malware family
Log correlation: syslog text + network packet bytes → threat classification
Threat hunting: alert text + PCAP features → campaign attribution
Key challenge: how to FUSE representations from different modalities?
Early fusion: concatenate raw inputs (loses modality-specific structure)
Late fusion: separate models → combine predictions (loses cross-modal info)
Cross-attention: let modalities attend to each other (Transformers, best but complex)