Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

  • https://arxiv.org/abs/2303.05499

  • GitHub: https://github.com/IDEA-Research/GroundingDINO

  • modescope: https://modelscope.cn/models/AI-ModelScope/GroundingDINO

  • 在本文中,我们提出了一种开放集对象检测器(open-set object detector),称为 Grounding DINO,通过将基于 Transformer 的检测器 DINO 与 grounded pre-training 相结合,它可以通过人类输入(例如类别名称或引用表达式)来检测任意对象。开放集目标检测的关键解决方案(key solution)是将语言引入闭集检测器(closed-set detector)以实现开集概念泛化(open-set concept generalization)。为了有效地融合语言和视觉模态,我们在概念上将闭集检测器分为三个阶段,并提出了一种紧密融合解决方案,其中包括特征增强器(feature enhancer)、语言引导查询选择(language-guided query selection)和用于跨模态融合的跨模态解码器(cross-modality decoder for cross-modality fusion)。先前的研究主要是在新的类别上评估开放集物体检测,而我们提出进一步在带有属性的指定物体的指称表达理解上进行评估(we propose to also perform evaluations on referring expression comprehension for objects specified with attributes)。Grounding DINO 在所有三种设置上都表现出色,包括 COCO、LVIS、ODinW 和 RefCOCO/+/g 基准。接地 DINO 实现了在COCO检测零样本传输基准上(即没有来自COCO的任何训练数据)达到52.5 AP。它在 OdinW 零样本基准测试中创造了新记录:平均 26.1 AP. Code。

  • In this paper, we present an open-set object detector, called Grounding DINO, by marrying Transformer-based detector DINO with grounded pre-training, which can detect arbitrary objects with human inputs such as category names or referring expressions. The key solution of open-set object detection is introducing language to a closed-set detector for open-set concept generalization. To effectively fuse(融合) language and vision modalities, we conceptually divide a closed-set detector into three phases and propose a tight fusion solution, which includes a feature enhancer, a language-guided query selection, and a cross-modality decoder for cross-modality fusion. While previous works mainly evaluate open-set object detection on novel categories(新型类别), we propose to also perform evaluations on referring expression comprehension for objects specified with attributes. Grounding DINO performs remarkably well on all three settings, including benchmarks on COCO, LVIS, ODinW, and RefCOCO/+/g. Grounding DINO achieves a 52.5 AP on the COCO detection zero-shot transfer benchmark, i.e., without any training data from COCO. It sets a new record on the ODinW zero-shot benchmark with a mean 26.1 AP.