Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection¶
modescope: https://modelscope.cn/models/AI-ModelScope/GroundingDINO
在本文中,我们提出了一种开放集对象检测器,称为 Grounding DINO,通过将基于 Transformer 的检测器 DINO 与 grounded pre-training 相结合,它可以通过人类输入(例如类别名称或引用表达式)来检测任意对象。开集目标检测的关键解决方案是将语言引入闭集检测器以实现开集概念泛化。为了有效地融合语言和视觉模态,我们在概念上将闭集检测器分为三个阶段,并提出了一种紧密融合解决方案,其中包括特征增强器、语言引导查询选择和用于跨模态融合的跨模态解码器。虽然以前的工作主要评估新类别的开放集对象检测,但我们建议也对用属性指定的对象的引用表达理解进行评估。 Grounding DINO 在所有三种设置上都表现出色,包括 COCO、LVIS、ODinW 和 RefCOCO/+/g 基准。接地 DINO 实现了在COCO检测零样本传输基准上(即没有来自COCO的任何训练数据)达到52.5 AP。它在 OdinW 零样本基准测试中创造了新记录:平均 26.1 AP. Code。
In this paper, we present an open-set object detector, called Grounding DINO, by marrying Transformer-based detector DINO with grounded pre-training, which can detect arbitrary objects with human inputs such as category names or referring expressions. The key solution of open-set object detection is introducing language to a closed-set detector for open-set concept generalization. To effectively fuse language and vision modalities, we conceptually divide a closed-set detector into three phases and propose a tight fusion solution, which includes a feature enhancer, a language-guided query selection, and a cross-modality decoder for cross-modality fusion. While previous works mainly evaluate open-set object detection on novel categories, we propose to also perform evaluations on referring expression comprehension for objects specified with attributes. Grounding DINO performs remarkably well on all three settings, including benchmarks on COCO, LVIS, ODinW, and RefCOCO/+/g. Grounding DINO achieves a 52.5 AP on the COCO detection zero-shot transfer benchmark, i.e., without any training data from COCO. It sets a new record on the ODinW zero-shot benchmark with a mean 26.1 AP.