Multimodal Research
updated
ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities
through Tree-Based Image Exploration
Paper
• 2411.16044
• Published
• 2
OmChat: A Recipe to Train Multimodal Language Models with Strong Long
Context and Video Understanding
Paper
• 2407.04923
• Published
• 2
OmDet: Large-scale vision-language multi-dataset pre-training with
multimodal detection network
Paper
• 2209.05946
• Published
• 2
VL-CheckList: Evaluating Pre-trained Vision-Language Models with
Objects, Attributes and Relations
Paper
• 2207.00221
• Published
• 2
GroundVLP: Harnessing Zero-shot Visual Grounding from Vision-Language
Pre-training and Open-Vocabulary Object Detection
Paper
• 2312.15043
• Published
• 2
How to Evaluate the Generalization of Detection? A Benchmark for
Comprehensive Open-Vocabulary Detection
Paper
• 2308.13177
• Published
• 1
Real-time Transformer-based Open-Vocabulary Detection with Efficient
Fusion Head
Paper
• 2403.06892
• Published
• 2
RS5M and GeoRSCLIP: A Large Scale Vision-Language Dataset and A Large
Vision-Language Model for Remote Sensing
Paper
• 2306.11300
• Published
• 2
OmAgent: A Multi-modal Agent Framework for Complex Video Understanding
with Task Divide-and-Conquer
Paper
• 2406.16620
• Published
• 3
VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model
Paper
• 2504.07615
• Published
• 35