DETR was implemented from scratch using PyTorch and trained on the Pascal VOC dataset on a single RTX 3060 TI GPU. Training took ~63 hours (2.6 days). I trained on Pascal VOC trainval07+12 (16551 images) and evaluated on Pascal VOC test07 (4952 images). The model achived a mAP50 score of 0.7563 on the test07 set.
Code:
https://github.com/raphaelsenn/DETR
Original paper:
End-to-End Object Detection with Transformers, Carion et al., 2020
https://arxiv.org/abs/2005.12872