CLIP-ViT-Base Fusion Model for Multi-Modal Hate Speech Detection
A PyTorch-based multi-modal (image + text) hateful content classification model using CLIP encoder with late fusion architecture, trained on the MMHS150K dataset for detecting hate speech in social media memes and posts.
🎯 Model Description
This model implements a late fusion architecture with gated attention mechanism for detecting hateful content in social media memes and posts. It combines visual and textual features using OpenAI's CLIP (ViT-Base-Patch32) as the backbone encoder.
The model performs multi-label classification across 5 hate speech categories, making it capable of detecting multiple types of hate in a single post (e.g., content that is both racist and sexist).
🏗️ Architecture
┌─────────────┐ ┌─────────────┐
│ Image │ │ Text │
│ Encoder │ │ Encoder │
│ (CLIP ViT) │ │ (CLIP Text) │
└──────┬──────┘ └──────┬──────┘
│ │
▼ ▼
┌─────────────┐ ┌─────────────┐
│ Projection │ │ Projection │
│ (Linear) │ │ (Linear) │
└──────┬──────┘ └──────┬──────┘
│ │
└─────────┬─────────┘
│
▼
┌─────────────┐
│ Gated Fusion│◄── Modality presence flags
│ Module │ (handles missing modalities)
└──────┬──────┘
│
▼
┌───────────────────────┐
│ Interaction Features │
│ • Fused embedding │
│ • Text embedding │
│ • Visual embedding │
│ • |text - visual| │
│ • text ⊙ visual │
└───────────┬───────────┘
│
▼
┌─────────────┐
│Classification│
│ Head (MLP) │
│ → 5 classes │
└─────────────┘
🔑 Key Features
| Feature | Description |
|---|---|
| Backbone | openai/clip-vit-base-patch32 - Pre-trained CLIP model |
| Fusion Dimension | 512 |
| Max Text Length | 77 tokens |
| Multi-label Output | 5 hate speech categories |
| Gated Attention | Modality-aware fusion with learnable gates |
| Interaction Features | Rich feature interactions (concatenation, element-wise product, absolute difference) |
| Missing Modality Handling | Can handle text-only or image-only inputs |
🏷️ Output Classes
| Class ID | Category | Description | Prior Probability |
|---|---|---|---|
| 0 | Racist | Racist content targeting race/ethnicity | 32.6% |
| 1 | Sexist | Sexist content targeting gender | 12.0% |
| 2 | Homophobe | Homophobic content targeting sexual orientation | 7.6% |
| 3 | Religion | Religion-based hate speech | 1.5% |
| 4 | OtherHate | Other types of hate speech | 15.6% |
📊 Evaluation Results
Test Set Performance
| Metric | Value |
|---|---|
| F1 Macro | 0.566 |
| F1 Micro | 0.635 |
| ROC-AUC Macro | 0.783 |
| Test Loss | 1.516 |
| Throughput | 381.5 samples/sec |
Per-Class Performance (Validation Set)
| Class | Precision | Recall | F1-Score | Support |
|---|---|---|---|---|
| Racist | 0.576 | 0.843 | 0.684 | 1,994 |
| Sexist | 0.587 | 0.646 | 0.615 | 875 |
| Homophobe | 0.804 | 0.709 | 0.753 | 612 |
| Religion | 0.435 | 0.209 | 0.283 | 129 |
| OtherHate | 0.541 | 0.700 | 0.611 | 1,195 |
| Micro Avg | 0.588 | 0.737 | 0.654 | 4,805 |
| Macro Avg | 0.589 | 0.621 | 0.589 | 4,805 |
⚙️ Optimized Thresholds
The model uses per-class calibrated thresholds for optimal performance (instead of default 0.5):
| Class | Threshold |
|---|---|
| Racist | 0.35 |
| Sexist | 0.70 |
| Homophobe | 0.75 |
| Religion | 0.30 |
| OtherHate | 0.60 |
📈 Model Comparison
| Model | F1 Macro | F1 Micro | ROC-AUC Macro | Throughput |
|---|---|---|---|---|
| CLIP Fusion (this model) | 0.566 | 0.635 | 0.783 | 381.5 |
| CLIP MTL | 0.569 | 0.644 | 0.783 | 390.9 |
| SigLIP Fusion | 0.507 | 0.610 | 0.774 | 236.3 |
| CLIP Fusion (Weighted Sampling) | 0.557 | 0.636 | 0.772 | 266.4 |
| CLIP Fusion (Bigger Batch) | 0.515 | 0.517 | 0.804 | 400.9 |
🎓 Training Data
MMHS150K Dataset
The model was trained on the MMHS150K (Multi-Modal Hate Speech) dataset, a large-scale multi-modal hate speech dataset collected from Twitter containing 150,000 tweet-image pairs annotated for hate speech detection.
| Property | Value |
|---|---|
| Source | |
| Total Samples | ~150,000 |
| Modalities | Image + Text |
| Annotation | Multi-label (5 hate categories) |
| Language | English |
Dataset Splits
| Split | Samples |
|---|---|
| Train | ~112,500 |
| Validation | ~15,000 |
| Test | ~22,500 |
Dataset Reference
Paper: "Exploring Hate Speech Detection in Multimodal Publications" (WACV 2020)
Authors: Raul Gomez, Jaume Gibert, Lluis Gomez, Dimosthenis Karatzas
🔧 Training Procedure
Training Configuration
# Model Configuration
backend: clip
head: fusion
encoder_name: openai/clip-vit-base-patch32
fusion_dim: 512
max_text_length: 77
freeze_text: false
freeze_image: false
# Training Configuration
num_train_epochs: 6
per_device_train_batch_size: 32
per_device_eval_batch_size: 64
gradient_accumulation_steps: 2
# Learning Rates (Differential)
lr_encoder: 1.0e-5
lr_head: 5.0e-4
# Regularization
weight_decay: 0.02
max_grad_norm: 1.0
# Scheduler
warmup_ratio: 0.05
lr_scheduler_type: cosine
# Loss
loss_type: bce
use_logit_adjustment: false
# Precision
precision: fp16
# Data Augmentation
augment: true
aug_scale_min: 0.8
aug_scale_max: 1.0
horizontal_flip: true
color_jitter: true
# Early Stopping
early_stopping_patience: 3
metric_for_best_model: roc_macro
Training Highlights
- Differential Learning Rates: Encoder (1e-5) vs Classification Head (5e-4)
- Mixed Precision: FP16 training for efficiency
- Data Augmentation: Random scaling, horizontal flip, color jitter
- Threshold Calibration: Per-class threshold optimization on validation set
- Early Stopping: Patience of 3 epochs based on ROC-AUC macro
- Best Checkpoint: Selected based on validation ROC-AUC macro score
Computational Resources
- Training Time: ~6 epochs
- Best Checkpoint: Step 33,708
- Hardware: GPU with FP16 support
🚀 How to Use
Quick Start (Recommended)
import torch
from PIL import Image
from transformers import AutoModel, CLIPProcessor
# Load model with trust_remote_code=True
model = AutoModel.from_pretrained(
"Amirhossein75/clip-vit-base-mmhs150k-fusion",
trust_remote_code=True
)
model.eval()
# Load CLIP processor
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
# Prepare inputs
image = Image.open("path/to/image.jpg").convert("RGB")
text = "sample text from the meme"
inputs = processor(
text=[text],
images=[image],
return_tensors="pt",
padding=True,
truncation=True,
max_length=77
)
# Run inference
with torch.no_grad():
result = model.predict(
input_ids=inputs["input_ids"],
attention_mask=inputs["attention_mask"],
pixel_values=inputs["pixel_values"]
)
print(result)
# {'predictions': {'racist': False, 'sexist': True, 'homophobe': False, 'religion': False, 'otherhate': False},
# 'probabilities': {'racist': 0.12, 'sexist': 0.78, 'homophobe': 0.05, 'religion': 0.02, 'otherhate': 0.15}}
Batch Inference
import torch
from PIL import Image
from transformers import AutoModel, CLIPProcessor
# Load model
model = AutoModel.from_pretrained(
"Amirhossein75/clip-vit-base-mmhs150k-fusion",
trust_remote_code=True
)
model.eval()
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
# Prepare batch
images = [Image.open("image1.jpg").convert("RGB"), Image.open("image2.jpg").convert("RGB")]
texts = ["text for image 1", "text for image 2"]
inputs = processor(
text=texts,
images=images,
return_tensors="pt",
padding=True,
truncation=True,
max_length=77
)
# Get raw logits and probabilities
with torch.no_grad():
outputs = model(
input_ids=inputs["input_ids"],
attention_mask=inputs["attention_mask"],
pixel_values=inputs["pixel_values"]
)
probabilities = torch.sigmoid(outputs["logits"])
# Apply optimized thresholds
thresholds = torch.tensor([0.35, 0.70, 0.75, 0.30, 0.60])
predictions = (probabilities > thresholds).int()
print(predictions)
Using with GPU
import torch
from transformers import AutoModel, CLIPProcessor
device = "cuda" if torch.cuda.is_available() else "cpu"
model = AutoModel.from_pretrained(
"Amirhossein75/clip-vit-base-mmhs150k-fusion",
trust_remote_code=True
).to(device)
model.eval()
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
# ... prepare inputs ...
inputs = {k: v.to(device) for k, v in inputs.items()}
with torch.no_grad():
result = model.predict(**inputs)
Download Config Files Only
from huggingface_hub import hf_hub_download
import json
# Download inference config
config_path = hf_hub_download(
repo_id="Amirhossein75/clip-vit-base-mmhs150k-fusion",
filename="inference_config.json"
)
with open(config_path) as f:
config = json.load(f)
print(f"Classes: {config['class_names']}")
print(f"Thresholds: {config['thresholds']}")
print(f"Encoder: {config['encoder_name']}")
📁 Model Files
| File | Description |
|---|---|
checkpoint-33708/model.safetensors |
Model weights in safetensors format (617MB) |
modeling_clip_fusion.py |
Model architecture code (auto-downloaded with trust_remote_code) |
config.json |
Model architecture configuration |
inference_config.json |
Inference settings with thresholds and class names |
label_map.json |
Label name mapping |
test_metrics.json |
Test set evaluation metrics |
val_report.json |
Detailed validation classification report |
☁️ AWS SageMaker Deployment
This model is compatible with AWS SageMaker for cloud deployment:
from sagemaker.pytorch import PyTorchModel
from sagemaker.serializers import JSONSerializer
from sagemaker.deserializers import JSONDeserializer
model = PyTorchModel(
model_data="s3://your-bucket/model.tar.gz",
role=role,
entry_point='inference.py',
source_dir='sagemaker',
framework_version='2.1.0',
py_version='py310',
)
predictor = model.deploy(
instance_type='ml.g4dn.xlarge',
initial_instance_count=1,
)
# Make prediction
import base64
with open('image.jpg', 'rb') as f:
image_b64 = base64.b64encode(f.read()).decode('utf-8')
response = predictor.predict({
'instances': [{
'text': 'Sample text content',
'image_base64': image_b64,
}]
})
See the SageMaker documentation for full deployment guide.
⚠️ Intended Uses & Limitations
✅ Intended Uses
- Content moderation for social media platforms
- Detecting hateful memes and posts
- Research in multi-modal hate speech detection
- Building content safety systems
- Pre-filtering potentially harmful content for human review
⚠️ Limitations
| Limitation | Description |
|---|---|
| Language | Trained only on English content |
| Domain | Twitter-specific; may not generalize to other platforms |
| Class Imbalance | Lower performance on rare categories (Religion: F1=0.283) |
| Cultural Context | May miss culturally-specific hate speech |
| Sarcasm/Irony | May struggle with subtle or ironic hateful content |
| Image-only Hate | Text encoder is important; purely visual hate may be missed |
❌ Out-of-Scope Uses
- NOT for making final moderation decisions without human review
- NOT suitable for legal or compliance purposes without additional validation
- NOT for censorship or suppression of legitimate speech
- NOT for targeting or profiling individuals
🛡️ Ethical Considerations
- This model should be used as a tool to assist human moderators, not replace them
- False positives may incorrectly flag legitimate content
- False negatives may miss harmful content
- Regular evaluation and bias auditing is recommended
- Consider the cultural and contextual factors in deployment
📝 Citation
If you use this model, please cite:
@misc{yousefi2024multimodal,
title={Multi-Modal Hateful Content Classification with CLIP Fusion},
author={Yousefi, Amirhossein},
year={2024},
publisher={Hugging Face},
url={https://huggingface.co/Amirhossein75/clip-vit-base-mmhs150k-fusion}
}
Dataset Citation
@inproceedings{gomez2020exploring,
title={Exploring Hate Speech Detection in Multimodal Publications},
author={Gomez, Raul and Gibert, Jaume and Gomez, Lluis and Karatzas, Dimosthenis},
booktitle={Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)},
pages={1470--1478},
year={2020}
}
CLIP Citation
@inproceedings{radford2021learning,
title={Learning Transferable Visual Models From Natural Language Supervision},
author={Radford, Alec and Kim, Jong Wook and Hallacy, Chris and others},
booktitle={International Conference on Machine Learning},
pages={8748--8763},
year={2021}
}
🔗 Links
| Resource | Link |
|---|---|
| GitHub Repository | multimodal-content-moderation |
| Base Model | openai/clip-vit-base-patch32 |
| MMHS150K Dataset | Official Page |
| CLIP Paper | arXiv |
📄 License
This project is licensed under the MIT License - see the LICENSE file for details.
🤝 Contributing
Contributions are welcome! Please see the GitHub repository for contribution guidelines.
- Downloads last month
- 7
Model tree for Amirhossein75/clip-vit-base-mmhs150k-fusion
Base model
openai/clip-vit-base-patch32Paper for Amirhossein75/clip-vit-base-mmhs150k-fusion
Evaluation results
- F1 Macro on MMHS150Kself-reported0.566
- F1 Micro on MMHS150Kself-reported0.635
- ROC-AUC Macro on MMHS150Kself-reported0.783