yolov1官方论文全文翻译[附个人理解]

您所在的位置：网站首页 › 提出要求英文官方 › yolov1官方论文全文翻译[附个人理解]

yolov1官方论文全文翻译[附个人理解]

2023-03-07 06:41| 来源: 网络整理| 查看: 265

前言

本博文是对yolov1的官方文献进行的翻译。并且在翻译的学习过程中加入了自己的理解。本博文主要用于自己个人学习的总结。对于借鉴的想过博文以及大佬的文献将会在博文的最后进行声明。现在开始!!!

一、You Only Look Once:Unified, Real-Time Object Detection 你只看一次：统一的实时目标检测 Abstract

We present YOLO, a new approach to object detection.Prior work on object detection repurposes(重复利用) classifiers to perform detection. Instead, we frame object detection as a re-gression problem to spatially(空间上) separated bounding boxes and associated class probabilities. A single neural network pre-dicts bounding boxes and class probabilities directly from full images in one evaluation. Since the whole detectionpipeline is a single network, it can be optimized end-to-end directly on detection performance.

我们提出了一种新的目标检测方法YOLO。先前的工作(两阶段目标检测)重复利用分类器去执行检测任务。相反的，我们将目标检测框定为空间分离的边界框和相关类别概率的回归问题。单个神经网络在一次评估中直接从全图像预测边界框和类概率。因为，整个检测流水线是单个网络，因此可以直接在检测性能上对其进行端到端优化。(注:yolo是一个端到端的原理，是回归类问题，可以实现一次性对全图进行边界框预测和类概率预测)

Our unified architecture is extremely fast. Our base YOLO model processes images in real-time at 45 frames per second. A smaller version of the network, Fast YOLO,processes an astounding 155 frames per second while still achieving double the mAP of other real-time detectors. Compared to state-of-the-art detection systems, YOLO makes more localization errors but is less likely to predict false positives on background. Finally, YOLO learns very general representations of objects. It outperforms other detection methods, including DPM and R-CNN, when generalizing(迁移泛化) from natural images to other domains like artwork.

YOLO的检测速度非常快。其标准版本模型可以以每秒45帧的速度实时处理图像。较小的网络Fast YOLO每秒处理速度更是达到了155帧，并且在速度快的同时仍可达到其他实时检测器mAP两倍的效果。但是与state-of-the-art（最先进的检测系统）相比，YOLO在目标定位时更容易出错，但是减少了在背景上预测出不存在的物体（false positives假的正样本）的概率。而且，YOLO比DPM、R-CNN等物体检测系统能够学到更加抽象的目标特征，所以YOLO可以从真实图像领域迁移到其他领域，如艺术领域等。

1. Introduction (导言) (传统两阶段目标检测模型VS Yolo 单阶段目标检测模型)

Humans glance at an image and instantly know what objects are in the image, where they are, and how they interact. The human visual system is fast and accurate, allowing us to perform complex tasks like driving with little conscious thought. Fast, accurate algorithms for object detection would allow computers to drive cars without specialized sensors, enable assistive devices to convey(传达) real-time scene information to human users, and unlock the potential for general purpose, responsive robotic(实时响应机器人) systems.

人们看了一眼图像，立即知道图像中有什么对象，它们在哪里以及它们如何相互作用。人类的视觉系统快速准确，使我们能够执行一些复杂的任务，例如在不需要太多思考的情况下就能够驾驶。快速，准确的目标检测算法则将允许计算机在没有专用传感器的情况下驾驶汽车，并使用辅助设备向人类用户传达实时场景信息，这将极大开发出通用响应型机器人系统的潜力。

Current detection systems repurpose classifiers to perform detection. To detect an object, these systems take a classifier for that object and evaluate it at various locations and scales in a test image. Systems like deformable parts models (DPM) use a sliding window approach where the classifier is run at evenly spaced locations over the entire image [10].

现在主流的检测系统一般是利用分类器来执行检测任务。检测某类目标便采用该类目标的分类器，并在测试图像的各个位置和比例上对其进行了评估。例如DPM之类的系统使用滑动窗口方法，其分类器在整个图像上均匀的滑动。

More recent approaches like R-CNN use region proposal(提取候选区域) methods to first generate potential(潜在的) bounding boxes in an image and then run a classifier on these proposed boxes. After classification, post-processing(后处理) is used to refine(微调) the bounding boxes, eliminate(消除) duplicate(重复) detections, and rescore the boxes based on other objects in the scene [13]. These complex pipelines are slow and hard to optimize because each individual component must be trained separately.

大多最新方法如R-CNN等都使用区域提议方法，首先在图像中生成先验边界框，然后在这些提议的框上运行分类器。分类后，将使用后期处理来完善边界框，消除重复检测并根据场景中的其他对象对框进行重新评分。这些复杂的环节运行缓慢且难以优化，因为每个单独的环节都必须分别进行训练。

第一阶段提取潜在的候选框(region proposal),第二阶段用分类器逐一筛选每个候选框

解读：采用滑动窗口的目标检测算法思路非常简单，它将检测问题转化为了图像分类问题。其基本原理就是采用不同大小和比例（宽高比）的窗口在整张图片上以一定的步长进行滑动，然后对这些窗口对应的区域做图像分类（应该是在学习这个东西是目标物还是背景）如下图所示。

但是在实际情况中你并不知道要检测的目标大小是什么规模，所以你要设置不同大小和比例的窗口去滑动，而且还要选取合适的步长，这样会产生很多的子区域，并且都要经过分类器去做预测，这需要很大的计算量，所以你的分类器不能太复杂，因为要保证速度。解决思路之一就是减少要分类的子区域，这就是R-CNN（基于区域提议的CNN）的一个改进策略，其采用了selective search方法来找到最有可能包含目标的子区域（Region Proposal），其实可以看成采用启发式方法过滤掉很多子区域，这会提升效率。这里由于不是讲解R-CNN算法，所以不再对其做深入讲解。

We reframe object detection as a single regression problem, straight from image pixels to bounding box coordinates and class probabilities. Using our system, you only look once (YOLO) at an image to predict what objects are present and where they are.

我们将目标检测重新定义为一个单一的回归问题，直接从图像像素到边界框坐标和类别概率。使用我们的系统，您只需看一次图像（YOLO），即可预测存在哪些对象以及它们的位置。

YOLO is refreshingly simple: see Figure 1. A single convolutional network simultaneously predicts multiple bounding boxes and class probabilities for those boxes.YOLO trains on full images and directly optimizes detection performance. This unified model has several benefits over traditional methods of object detection.

YOLO非常简单：请参见图1。单个卷积网络可同时预测多个边界框和这些框的类概率。YOLO训练完整图像并直接优化检测性能。与传统的对象检测方法相比，此统一模型具有多个优点。

整个框架为：将图片大小重设为448*448，然后将修改后的图片输入卷积网络中，得到预测的bbox和分类概率等信息，最后利用非极大值抑制得到最终的检测结果。

First, YOLO is extremely fast. Since we frame detection as a regression problem we don’t need a complex pipeline.We simply run our neural network on a new image at test time to predict detections. Our base network runs at 45 frames per second with no batch processing on a Titan X GPU and a fast version runs at more than 150 fps. This means we can process streaming video in real-time with less than 25 milliseconds of latency. Furthermore, YOLO achieves more than twice the mean average precision of other real-time systems. For a demo of our system running in real-time on a webcam please see our project webpage:http://pjreddie.com/yolo/.

第一，YOLO检测速度非常快。由于将检测框架视为回归问题，因此不需要复杂的流程。只需在测试时在新图像上运行神经网络即可预测检测结果。在 Titan X GPU 上，不需要经过批处理，标准版本的 YOLO 系统可以每秒处理 45 张图像；YOLO 的极速版本可以处理 150 帧图像。这就意味着 YOLO 可以以小于 25 毫秒延迟的处理速度，实时地处理视频。此外，YOLO达到了其他实时系统均值mAP的两倍以上。有关在网络摄像头上实时运行的系统的演示，请参阅项目网页：http：//pjreddie.com/yolo/。

Second, YOLO reasons globally about the image when making predictions. Unlike sliding window and region proposal-based techniques, YOLO sees the entire image during training and test time so it implicitly encodes contextual information about classes as well as their appearance.Fast R-CNN, a top detection method [14], mistakes background patches in an image for objects because it can"t see the larger context. YOLO makes less than half the number of background errors compared to Fast R-CNN.

第二，当对输入边界框上的S×S网格和置信类别概率特征图做出预测时，YOLO会在整体上对图像进行推理。与基于滑动窗口和区域推荐的算法不同，YOLO在训练和测试期间都是“看”到整幅图像，因此它能够隐式地对有关类及其外观的上下文信息进行编码。由于Fast R-CNN是一种顶部检测方法，由于看不到较大的上下文，因此会将图像中的背景色块误认为是目标对象。与Fast R-CNN相比，YOLO产生的背景错误少于一半。

Third, YOLO learns generalizable representations of objects. When trained on natural images and tested on artwork, YOLO outperforms top detection methods like DPM and R-CNN by a wide margin. Since YOLO is highly generalizable it is less likely to break down when applied to new domains or unexpected inputs.

第三，YOLO 学到物体更泛化的特征表示。当在自然场景图像上训练 YOLO，再在艺术图像上去测试 YOLO 时，YOLO 的表现要优于 DPM、R-CNN。YOLO 模型更能适应新的领域，由于其是高度可推广的，即使有非法输入，它也不太可能崩溃。

YOLO still lags behind state-of-the-art detection systems in accuracy. While it can quickly identify objects in images it struggles to precisely localize some objects, especially small ones. We examine these tradeoffs further in our experiments.

YOLO在准确性方面仍然落后于最先进的检测系统。虽然它可以快速识别图像中的物体，但它很难精确定位一些物体，特别是小物体。我们将在实验中进一步考察这些权衡。

All of our training and testing code is open source. A variety of pretrained models are also available to download.

我们将在实验中进一步考察这些权衡。我们所有的训练和测试代码都是开源的，而且各种预训练模型也都可以下载。

2. Unified Detection（统一的目标检测框架）

We unify the separate components of object detection into a single neural network. Our network uses features from the entire image to predict each bounding box. It also predicts all bounding boxes across all classes for an image simultaneously. This means our network reasons globally about the full image and all the objects in the image.The YOLO design enables end-to-end training and real-time speeds while maintaining high average precision.

YOLO将对象检测的各个组成部分统一为一个神经网络。其网络使用整个图像中的特征来预测每个边界框。它还可以同时预测图像所有类的所有边界框。这意味着其网络会全局考虑整个图像和图像中的所有目标对象。YOLO设计可实现端到端的训练和实时检测的速度，同时还能够保持较高的平均精度。

Our system divides the input image into an S × S grid.If the center of an object falls into a grid cell, that grid cell is responsible for detecting that object.

我们的系统将输入图像划分为S×S网格。如果目标对象的中心落入网格单元，则该网格单元负责检测该对象。

训练阶段Ground Truth框中心点落到那个grid cell里那个grid cell就应负责检测这个物体，检测这个物体的bbox就应由这个grid cell生成。(YOLO将输入图片分割为S×S个方格，如果图像中物体的中心点落在哪一个gridcell里，那么这个gridcell就负责检测这个物体。)

什么是gridcell？如下图，每一个小方格就是一个grid cell。

Each grid cell predicts B bounding boxes and confidence scores for those boxes. These confidence scores reflect how confident the model is that the box contains an object and also how accurate it thinks the box is that it predicts. Formally we define confidence as . If no object exists in that cell, the confidence scores should be zero. Otherwise we want the confidence score to equal the intersection over union (IOU) between the predicted box and the ground truth.

每个网格单元预测B个边界框和这些框的置信度得分。这些置信度得分反映了该模型对这个网格包含一个对象的置信度（是否有目标对象），以及它认为网格预测的准确性。形式上，我们将置信度定义为：

Pr(Object)表示其网格单元（注意是网格单元不是边界框）内是否含有目标对象，如果该单元格中没有对象，即Pr(Object) 为0，则置信度分数应为零。否则，则表示含有目标对象，Pr(Object) 为1，置信度分数等于预测框与真实框的交并比（IOU）（数值在[0,1]之间，数值越大说明重合区域越大，得分越高）。

什么是boundingbox？个人理解：boundingbox意思是边界框。在目标检测中，我们需要对图片中的每一个物体进行定位分类，如下图：

我们利用一个框对图像中的物体进行定位，这个框就叫做boundingbox，简写为bbox。

什么是groundtruth？在目标检测中，groundtruth就是人工标注的类别以及定位框的信息。

每一个gridcell会随机预测出B个boundingboxes和每个bbox的置信度，且B个bboxes的中心点都落在该gridcell里，这样就可以知道哪个bbox是由哪个gridcell预测出的。所谓置信度就是bbox是否包含物体的置信度。bbox置信度的计算公式为：

,也就是bbox中包含物体的概率×预测出的bbox和groundtruth的交并比。

什么是IOU交并比？IOU全称intersection over union，在目标检测中用来衡量预测框的准确率，即通过计算预测框与人工标注的真实框之间交集与并集的比，来判断预测的是否准确。

Each bounding box consists of 5 predictions: x, y, w, h,and confidence. The (x, y) coordinates represent the center of the box relative to the bounds of the grid cell. The width and height are predicted relative to the whole image. Finally the confidence prediction represents the IOU between the predicted box and any ground truth box.

每个边界框由5个预测组成：x，y，w，h和置信度。（x，y）坐标表示边界框相对于网格单元边界的中心。（w，h）相对于整个图像预测宽度和高度。最后，置信度预测表示预测框与任何地面真实框之间的IOU。

解读：实际训练过程中，w和h的值使用图像的宽度和高度进行归一化到[0,1]区间内；x，y是bounding box中心位置相对于当前格子位置的偏移值，并且被归一化到[0,1]。

Each grid cell also predicts C conditional class probabilities, Pr(Classi|Object). These probabilities are conditioned on the grid cell containing an object. We only predict one set of class probabilities per grid cell, regardless of the number of boxes B.

每个网格单元还预测C个条件类概率，具体如下所示：

这些概率以包含对象的网格单元为条件。无论预测边界框B的数量如何，都仅预测每个网格单元的一组类概率，也就是说每个网格单元预测一组类概率。

At test time we multiply the conditional class probabilities and the individual box confidence predictions,

which gives us class-specific confidence scores for each box. These scores encode both the probability of that class appearing in the box and how well the predicted box fits the object.

在测试时，将条件类别的概率和各个预测框的置信度相乘：

这样既可得到每个bounding box的具体类别的confidence score。这乘积既包含了bounding box中预测的class的 probability信息，也反映了bounding box是否含有Object和bounding box坐标的准确度。（显然如果cell中不含有目标对象，则乘积直接为0）

解读：将YOLO用于PASCAL VOC数据集时：论文使用的 S=7，即将一张图像分为7×7=49个网格每一个网格预测B=2个boxes（每个box有 x,y,w,h,confidence，5个预测值），同时C=20（PASCAL数据集中有20个类别）。因此，最后的prediction是7×7×30 { 即S * S * ( B * 5 + C) }的Tensor（张量）。

注意：

1. 由于输出层为全连接层，因此在检测时，YOLO 训练模型只支持与训练图像相同的输入分辨率。

2. 虽然每个格子可以预测B个bounding box，但是最终只选择只选择IOU最高的bounding box作为物体检测输出，即每个格子最多只预测出一个物体。当物体占画面比例较小，如图像中包含畜群或鸟群时，每个格子包含多个物体，但却只能检测出其中一个。这是YOLO方法的一个缺陷。

2.1 Network Design(2.1.网络设计)

We implement this model as a convolutional neural network and evaluate it on the PASCAL VOC detection dataset [9]. The initial convolutional layers of the network extract features from the image while the fully connected layers predict the output probabilities and coordinates.

作者将该模型实现为卷积神经网络，并在PASCAL VOC检测数据集上对其进行评估。网络的初始卷积层从图像中提取特征，而完全连接的层则预测输出概率和坐标。

Our network architecture is inspired by the GoogLeNet model for image classification [34]. Our network has 24 convolutional layers followed by 2 fully connected layers.Instead of the inception modules used by GoogLeNet, we simply use 1×1 reduction layers followed by 3×3 convolutional layers, similar to Lin et al [22]. The full network is shown in Figure 3.

YOLO网络借鉴了GoogLeNet分类网络结构。网络有24个卷积层，其后是2个完全连接的层，不同的是，YOLO未使用inception module，而是使用1x1卷积层（此处1x1卷积层的存在是为了跨通道信息整合）+3x3卷积层简单替代。最终输出的是7x7x30的张量的预测值。具体如下图所示：

我们的网络架构受到了GoogleNet图像分类模型的启发[34]。我们的网络有24个卷积层，后面跟着2个全连接层。与GoogleNet使用的初始模块不同，我们简单地使用1 × 1简化层，然后是3 × 3卷积层，类似于Lin等人[22]。完整的网络如图3所示。

We also train a fast version of YOLO designed to push the boundaries of fast object detection. Fast YOLO uses a neural network with fewer convolutional layers (9 instead of 24) and fewer filters in those layers. Other than the size of the network, all training and testing parameters are the same between YOLO and Fast YOLO.

作者还训练了一种快速版本的YOLO，旨在突破快速物体检测的界限。 Fast YOLO使用的神经网络具有较少的卷积层（从9个而不是24个），并且这些层中的卷积较少。除了网络的规模外，YOLO和Fast YOLO之间的所有训练和测试参数都相同。

解读：图片分成7x7个网格(grid cell)，某个物体的中心落在这个网格中此网格就负责预测这个物体。图中物体狗的中心点（红色框）落入第5行、第2列的格子内，所以这个格子负责预测图像中的物体狗。

最后一层输出为 7x7x30的维度。每个 1x1x30的维度对应原图7x7个cell中的一个，1x1x30中含有类别预测和bounding box坐标预测。总得来讲就是让网格负责类别信息，bounding box主要负责坐标信息(部分负责类别信息：confidence也算类别信息)。

每个网格（1130维度对应原图中的cell）要预测2个bounding box （图中黄色实线框）的坐标（x,y,w,h），其中：中心坐标的（x,y）相对于对应的网格归一化到0-1之间，w,h用图像的width和height归一化到0-1之间。每个bounding box除了要回归自身的位置之外，还要附带预测一个confidence值。这个confidence代表了所预测的box中含有object的置信度和这个box预测的精确度两重信息：

其中如果有ground true box(人工标记的物体)落在一个grid cell里，第一项取1，否则取0。第二项是预测的bounding box和实际的ground truth box之间的IOU值。即：每个bounding box要预测 x,y,w,h,confidence,共5个值，2个bounding box共10个值，对应 1130维度特征中的前10个。每个网格还要预测类别信息，论文中共有 20 类。

7 x 7 的网格，每个网格要预测 2 个 bounding box 和 20 个类别概率，输出就是 7 x 7 x ( 5 x 2 + 20)。【通用公式：S x S个网格，每个网格要预测 B 个bounding box 还要预测 C 个class probability，输出就是 S x S x ( 5 x B + C ) 的一个tensor。注意：class信息是针对每个网格的，confidence信息是针对每个bounding box的】。

Figure 3: The Architecture. Our detection network has 24 convolutional layers followed by 2 fully connected layers. Alternating 1 × 1 convolutional layers reduce the features space from preceding layers. We pretrain the convolutional layers on the ImageNet classification task at half the resolution (224 × 224 input image) and then double the resolution for detection.

（图3：体系结构。我们的检测网络具有24个卷积层，其后是2个全连接层。交替的1 × 1卷积层减少了来自先前层的特征空间。我们在ImageNet分类任务中以一半的分辨率（224 × 224输入图像）预训练卷积层，然后将检测分辨率加倍。）

The final output of our network is the 7 × 7 × 30 tensor of predictions.

YOLO网络借鉴了GoogLeNet分类网络结构，有24个卷积层+2个全连接层。

图片下方参数中的s-2指的是步长为2，这里要注意以下三点：

在ImageNet中预训练网络时，使用的输入是224 * 224，用于检测任务时，输入大小改为448 * 448，这是通过调整第一个卷积层的步长来实现的；

网络使用了很多1*1的卷积层来进行特征降维；

最后一个卷积层的输出为(7, 7, 1024)，经过flatten后紧跟两个全连接层，形成一个线性回归，最后一个全连接层又被reshape成(7, 7, 30)，形成对2个box坐标及20个物体类别的预测(PASCAL VOC)。

注意:

yolov1实际上的网络层应该如上所示，这样看起来会更加的具体。

卷积和池化计算

W：为输入图像大小。F：为卷积大小。P：为填充大小。S：为步长。

卷积计算公式：(W-F+2P)/S+1

池化计算公式：(W-F)/S+1

一般而言：

F=3时，P=1

F=5时，P=2

F=7时，P=3

公式知道了，来验证一下整个网络中的卷积对不对：

首先输入448 * 448 * 3的图像

第一层卷积：

输出：

第二层卷积：

输出：

第三层卷积：

输出：

第四层卷积：

输出：

第五层卷积：

输出：

第六层卷积：

输出并进行最后两层全连接：

输出为：7 * 7 * 30，验证正确！

2.2Training(2.2训练)

We pretrain our convolutional layers on the ImageNet 1000-class competition dataset [30]. For pretraining we use the first 20 convolutional layers from Figure 3 followed by a average-pooling layer and a fully connected layer. We train this network for approximately a week and achieve a single crop top-5 accuracy of 88% on the ImageNet 2012 validation set, comparable to the GoogLeNet models in Caffe’s Model Zoo [24]. We use the Darknet framework for all training and inference [26].

作者在ImageNet 1000-class竞赛数据集上对卷积层进行预训练。对于预训练，这个网络是Figure3中的前20个卷机网络+average-pooling layer（平均池化层）+ fully connected layer（全连接层）（此时网络输入是224*224）；对这个网络进行了大约一周的训练，并在ImageNet 2012验证集上单目标top-5的准确性达到了88％，与 Caffe’s Model Zoo中的GoogLeNet模型相当。作者使用Darknet框架对模型进行所有训练和推理。

We then convert the model to perform detection. Ren etal. show that adding both convolutional and connected layers to pretrained networks can improve performance [29].Following their example, we add four convolutional layers and two fully connected layers with randomly initialized weights. Detection often requires fine-grained visual information so we increase the input resolution of the network from 224 × 224 to 448 × 448.

训练检测网络：转换模型去执行检测任务，《Object detection networks on convolutional feature maps》提到说在预训练网络中增加卷积和全链接层可以改善性能。在作者的例子基础上添加4个卷积层和2个全链接层，随机初始化权重。检测要求细粒度的视觉信息，所以把网络输入把224224变成448448。

Our final layer predicts both class probabilities and bounding box coordinates. We normalize the bounding box width and height by the image width and height so that they fall between 0 and 1. We parametrize the bounding box x and y coordinates to be offsets of a particular grid cell location so they are also bounded between 0 and 1.

We use a linear activation function for the final layer and all other layers use the following leaky rectified linear activation:

最后一层可以预测类概率和边界框坐标。通过图像的宽度和高度对边界框的宽度和高度进行归一化，使其落在0和1之间。将边界框的x和y坐标参数化为特定网格单元位置的偏移量，因此它们也被限制在0之间和1.对最终层使用线性激活函数，而所有其他层使用以下泄漏校正线性激活：

我们对最后一层使用线性激活函数，并且所有其它层使用以下泄漏校正线性激活：

We optimize for sum-squared error in the output of our model. We use sum-squared error because it is easy to optimize, however it does not perfectly align with our goal of maximizing average precision. It weights localization error equally with classification error which may not be ideal. Also, in every image many grid cells do not contain any object. This pushes the “confidence” scores of those cells towards zero, often overpowering the gradient from cells that do contain objects. This can lead to model instability, causing training to diverge early on.

作者针对模型输出中的平方误差进行了优化。使用平方和误差是因为它易于优化，但它与我们实现平均精度最大化的目标并不完全一致。它对定位误差和分类误差的权重相等，这可能不理想。同样，在每个图像中，许多网格单元都不包含任何对象。这会将这些单元格的“置信度”得分推向零，通常会超过确实包含对象的单元格的梯度。这可能会导致模型不稳定，从而导致训练早期发散。

To remedy this, we increase the loss from bounding box coordinate predictions and decrease the loss from confidence predictions for boxes that don’t contain objects. We use two parameters, λcoord and λnoobj to accomplish this. We set λcoord = 5 and λnoobj = .5.

为了解决这个问题，增加边界框坐标预测的损失，并且减少了不包含对象的框的置信度预测的损失。使用两个参数λcoord和λnoobj来完成此操作。我们设置λcoord= 5和λnoobj= 0.5。

Sum-squared error also equally weights errors in large boxes and small boxes. Our error metric should reflect that small deviations in large boxes matter less than in small boxes. To partially address this we predict the square root of the bounding box width and height instead of the width and height directly.

平方和误差也平均权衡大框和小框中的误差。我们的误差指标应反映出，大框中的小偏差比小框中的小偏差要小。为了部分解决此问题，我们预测边界框的宽度和高度的平方根，而不是直接预测宽度和高度。

YOLO predicts multiple bounding boxes per grid cell. At training time we only want one bounding box predictor to be responsible for each object. We assign one predictor to be “responsible” for predicting an object based on which prediction has the highest current IOU with the ground truth. This leads to specialization between the bounding box predictors. Each predictor gets better at predicting certain sizes, aspect ratios, or classes of object, improving overall recall.

YOLO预测每个网格单元有多个bounding box，在训练时我们只需要每个object（ground true box）只有一个bounding box专门负责（一个object 一个bbox）。具体做法是与ground true box（object）的IOU最大的bounding box 负责该ground true box(object)的预测。这种做法称作bounding box predictor****specialization(专职化)。每个预测器会对特定（sizes,aspect ratio or classed of object）的ground true box预测的更好，从而提高整体召回率。

During training we optimize the following, multi-part loss function:

在训练过程中，我们优化了以下多部分损失函数：

YOLO使用预测值和GT之间的误差平方的求和（MSE）来计算损失。损失函数包括

localization loss -> 坐标损失（预测边界框与GT之间的误差）

classification loss -> 分类损失

confidence loss -> 置信度损失（框里有无目标, objectness of the box)

参数设置：

对坐标预测，给这些损失前面赋予更大的loss weight, 记为 λcoord ,在pascal VOC训练中取5。（上图蓝色框）

对没有object的bbox的confidence loss，赋予小的loss weight，记为 λnoobj ，在pascal VOC训练中取0.5。（上图橙色框）

有object的bbox的confidence loss (上图红色框) 和类别的loss（上图紫色框）的loss weight正常取1。

对不同大小的bbox预测中，相比于大bbox预测偏一点，小box预测偏相同的尺寸对IOU的影响更大。而sum-square error loss中对同样的偏移loss是一样。为了缓和这个问题，作者用了一个巧妙的办法，就是将box的width和height取平方根代替原本的height和width。如下图：small bbox的横轴值较小，发生偏移时，反应到y轴上的loss（下图绿色）比big box(下图红色)要大。

损失函数的设计目标就是让坐标（x,y,w,h），confidence，classification 这个三个方面达到很好的平衡。简单的全部采用了sum-squared error loss来做这件事会有以下不足：

a) 8维的localization error和20维的classification error同等重要显然是不合理的；

b) 如果一个网格中没有object（一幅图中这种网格很多），那么就会将这些网格中的box的confidence push到0，相比于较少的有object的网格，这种做法是overpowering的，这会导致网络不稳定甚至发散。

解决方案如下：

（1）更重视8维的坐标预测，给这些损失前面赋予更大的loss weight，在pascal VOC训练中取5。（上图蓝色框）

（2）对没有object的bbox的confidence loss，赋予小的loss weight，在pascal VOC训练中取0.5。（上图橙色框）

（3）有object的bbox的confidence loss (上图红色框) 和classloss （上图紫色框）的loss weight正常取1。

（4）对不同大小的bbox预测中，相比于大bbox预测偏一点，小box预测偏一点更不能忍受。而sum-square error loss中对同样的偏移loss是一样。为了缓和这个问题，作者用了一个比较取巧的办法，就是将box的width和height取平方根代替原本的height和width。如下图：small bbox的横轴值较小，发生偏移时，反应到y轴上的loss（下图绿色）比big box(下图红色)要大。

例如大框的宽为16，小框宽为4，大框偏了4才变成了20，小框的宽偏了4，变成了两倍，虽然都是偏了相同的距离，但是明显小框受到的影响更大；若换成平方根就将他们之间的影响凸显出了，大框宽的平方根是4，小框为2；大框宽的偏移量为 √20-√16=2√5-4=0.4721

小框宽的偏移量为√8-√4=2√2-2=0.8284

很明显小框的变化大于大框，可以更好的凸显变化；这里只是举例说明。

Note that the loss function only penalizes classification error if an object is present in that grid cell (hence the conditional class probability discussed earlier). It also only penalizes bounding box coordinate error if that predictor is “responsible” for the ground truth box (i.e. has the highest IOU of any predictor in that grid cell).

如果该网格单元中存在对象，则损失函数会计算该网格的classification error。如果预测框（即该网格单元中所有预测框与真实框的IOU最高的一个）对真实框（就是标签框，也就是目标对象的位置）“负责”，它只会计算bounding box coordinate error。

We train the network for about 135 epochs on the training and validation data sets from PASCAL VOC 2007 and 2012. When testing on 2012 we also include the VOC 2007 test data for training. Throughout training we use a batch size of 64, a momentum of 0.9 and a decay of 0.0005.

我们根据PASCAL VOC 2007和2012的训练和验证数据集对网络进行了135个epochs的训练。对包含了VOC 2007的测试数据的2012数据集上进行训练测试。在整个训练过程中，我们使用的批次大小为64，动量为0.9，衰减为0.0005。

**Our learning rate schedule is as follows: For the first epochs we slowly raise the learning rate from 10^-3 to 10^-2. If we start at a high learning rate our model often diverges due to unstable gradients. We continue training with 10^-2 for 75 epochs, then 10^-3 for 30 epochs, and finally 10^-4 for 30 epochs. **

学习率时间表如下：在第一个epochs，将学习率从10^-3 逐渐提高到 10-2**。如果以较高的学习率开始，则由于梯度不稳定的情况，使得模型经常发散。继续以**10-2进行75个epochs，然后以10^-3 进行30 epochs，最后以10^-4进行30 epochs。

To avoid overfitting we use dropout and extensive data augmentation. A dropout layer with rate = .5 after the first connected layer prevents co-adaptation between layers . For data augmentation we introduce random scaling and translations of up to 20% of the original image size. We also randomly adjust the exposure and saturation of the image by up to a factor of 1.5 in the HSV color space.

为了避免训练过拟合，使用了dropout和数据增强。在第一个全连接的层之后，速率为.5的dropout层防止了层之间的共适应。对于数据扩充，引入了随机缩放和最多原始图像大小20％的转换。在HSV颜色空间中，还将图像的曝光和饱和度随机调整至1.5倍。

2.3. Inference（推理）

Just like in training, predicting detections for a test image only requires one network evaluation. On PASCAL VOC the network predicts 98 bounding boxes per image and class probabilities for each box. YOLO is extremely fast at test time since it only requires a single network evaluation, unlike classifier-based methods.

就像在训练中一样，预测测试图像的检测仅需要进行一次网络评估。在PASCAL VOC上，网络可预测每个图像98个边界框，并预测每个框的类概率。与基于分类器的方法不同，YOLO只需要进行一次网络评估，因此测试时间非常快。

The grid design enforces spatial diversity in the bounding box predictions. Often it is clear which grid cell an object falls in to and the network only predicts one box for each object. However, some large objects or objects near the border of multiple cells can be well localized by multiple cells. Non-maximal suppression can be used to fix these multiple detections. While not critical to performance as it is for R-CNN or DPM, non-maximal suppression adds 2-3% in mAP.

网格设计在边界框预测中强制执行空间分集。通常，很明显，一个对象属于哪个网格单元，并且网络仅为每个对象预测一个框。当图像中的物体较大，或者处于 grid cells 边界的物体，可能在多个 cells 中被定位出来。可以用Non-Maximal Suppression（NMS，非极大值抑制）进行去除重复检测的物体，可以使最终的 mAP 提高，但相比较于 NMS 对于 DPM、R-CNN 的提高并不算大。

解读：测试过程，使用训练好的模型对图像进行测试时，就是一个预测边界框和最终得分的过程，和训练过程有所不同，训练过程是一个通过标签对比和loss函数优化使得预测框不断向真实标签框靠近的过程，其最终目的就是得到最优的权重参数；而测试过程中不会再有真实标签，是通过之前训练出的模型中的权重参数来预测出最优边界框的过程。

输入图片，网络会按照与训练时相同的分割方式将测试图片分割成S x S的形状，因此，划分出来的每个网格预测的class信息和Bounding box预测的confidence信息相乘，就得到了每个Bounding box的class-specific confidence score，即得到了每个Bounding box预测具体物体的概率和位置重叠的概率。具体如下所示：

输入一幅448x448x3的图像，通过模型中网络，最后的输出是一个7x7x30的张量；因为是VOC数据集有20类物体，所以7x7x30个张量是由7x7x(2x(4+1)+20) 组成，故其中含有7x7x2 个confidence，又因为每个grid cell只预测一个对象，所以其每个grid cell有20类目标概率；由图中公式可得，每个grid cell中的2个预测边界框的每一类的score是由该grid cell 预测的2个边界框的confidengce乘以grid cell中的该类概率得到的，即score=2x1的张量；而每个grid cell中的2个预测边界框的所有类别的score则等于2x20；因为共有7x7个grid cell，故共预测7x7x2个边界框，最后得出98个20*1的向量，这个式子既考虑到了每个预测边界框的每一种类别概率，也考虑到了预测边界框的置信度（有多适合这个物体）；依次将所有的grid cell都乘完，值得注意一点是每个grid cell的预测边界框confidengce只与其预测的类别概率相乘；我们就可以得到下图

从图中的黄色条状就是得出的98x20x1的张量，每一个框条对应了一个预测边界框，且都对应了20个通过网格预测的class信息和Bounding box预测的confidence信息相乘得到score值；

按照类别把它们分为20类。之后的过程如下（先以第一类假设为“dog”举例）：

通过设置一个阈值，小于阈值直接设为0，来筛选每一类的scores得分

之后对挑选出的每一类对象通过score得分降序排列，这里只是通过dog类进行讲解

降序排列后，由于一些大的物体可能会占用好几个grid cell，但是由于一个物体只能由一个grid cell中的一个预测边界框负责，所以需要选出score值最高的那个边界框来负责该物体，就像例图中的狗就占用了好多grid cell；此时就需要进行NMS处理，

非极大值抑制：抑制的过程是一个迭代-遍历-消除的过程。

1.将所有框的得分排序，选中最高分及其对应的框。

2.遍历其余的框，如果和当前最高分框的重叠面积(IOU)大于一定阈值，我们就将框删除。

从未处理的框中继续选一个得分最高的，重复上述过程。

通过选出一个类中score值最高的框，然后使用同一类中未选择过的最高score值进行比较，若是两个框的IOU大于阈值则表明两个框重叠太多，属于共同负责一个物体，此时需要将框中该类的score置为0，依次将所有框都与其进行比较，最终输出score非0的框；其他的所有类也是如此，都需要进行NMS处理过程。

计算这两个框的IOU，若 >0.5表示这两个框重合太多，故将第二个值赋0处理，然后继续第三个框和第一个框，第四个和第一个直到结束，然后继续取除第一个框外的非0框继续处理，直到全部处理完，然后继续看第二行（即第二类目标对象）。

直观来感受一下非极大值抑制的过程：

如图中所示，47个预测边界框中狗的score得分为0.8，为最高，此时选择其为最高score的框，然后使用其他框与其进行对比

使用第二高score值的框与其对比，就是bb20的框，其IOU大于阈值，故需要将该框此类score置0；其他的框也是如此：

直到遍历所有的框；非极大值抑制：非极大值抑制算法（non maximum suppression, NMS），这个算法不单单是针对Yolo算法的，而是所有的检测算法中都会用到。NMS算法主要解决的是一个目标被多次检测的问题。

最终输出结果如下所示：

最后一步：还需要对最后输出框中的20种类的score进行排序，因为一个框只能预测一个物体；之所以进行这一步是因为当两个目标或多个目标正好都落在了同一个grid cell中，而且经过NMS处理后，发现两个目标或多个目标的score值在一个预测边界框都是最高的，这就出现了一框预测多个物体的情况，此时就需要对两个目标的score值进行对比，谁高则预测边界框的类别就是谁。

当然还有一种情况就是两个或多个相同类别的目标落在同一个grid cell中，此时只会出现一个边界框来预测它。

2.4. Limitations of YOLO（YOLO的局限性）

YOLO imposes strong spatial constraints on bounding box predictions since each grid cell only predicts two boxes and can only have one class. This spatial constraint limits the number of nearby objects that our model can predict. Our model struggles with small objects that appear in groups, such as flocks of birds.

YOLO对相互靠的很近的物体（挨在一起且中点都落在同一个格子上的情况），还有很小的群体检测效果不好，这是因为一个网格中只预测了两个框，并且只属于一类。

Since our model learns to predict bounding boxes from data, it struggles to generalize to objects in new or unusual aspect ratios or configurations. Our model also uses relatively coarse features for predicting bounding boxes since our architecture has multiple downsampling layers from the input image.

由于模型学会了根据数据来预测边界框，因此很难将其推广到具有新的或不同寻常的宽高比或配置的对象。模型还使用相对粗糙的特征来预测边界框，因为体系结构从输入图像中有多个下采样层。

解读：测试图像中，当同一类物体出现的不常见的长宽比和其他情况时泛化能力偏弱。

Finally, while we train on a loss function that approximates detection performance, our loss function treats errors the same in small bounding boxes versus large bounding boxes. A small error in a large box is generally benign but a small error in a small box has a much greater effect on IOU.Our main source of error is incorrect localizations.

最后，虽然训练的是近似检测性能的损失函数，但损失函数在小边界框与大边界框中对待错误的方式相同。大框中的小错误通常是影响不大的，但小框中的小错误对IOU的影响非常大。主要错误来源是错误定位。

解读：由于损失函数的问题，定位误差是影响检测效果的主要原因，尤其是大小物体的处理上，还有待加强。

3. Comparison to Other Detection Systems

Object detection is a core problem in computer vision.Detection pipelines generally start by extracting a set of robust features from input images (Haar [25], SIFT [23],HOG [4], convolutional features [6]). Then, classifiers [36, 21, 13, 10] or localizers [1, 32] are used to identify objects in the feature space. These classifiers or localizers are run either in sliding window fashion over the whole image or on some subset of regions in the image [35, 15, 39].We compare the YOLO detection system to several top detection frameworks, highlighting key similarities and differences.

目标检测是计算机视觉中的核心问题。检测流水线通常从从输入图像提取一组鲁棒特征（Haar [25]、SIFT [23]、HOG [4]、卷积特征[6]）开始。然后，使用分类器[36，21，13，10]或定位器[1，32]来识别特征空间中的对象。这些分类器或定位器以滑动窗口方式在整个图像上或在图像中的区域的某个子集上运行[35，15，39]。我们将YOLO检测系统与几个顶级检测框架进行了比较，突出了关键的相似之处和不同之处。

Deformable parts models. Deformable parts models (DPM) use a sliding window approach to object detection [10]. DPM uses a disjoint pipeline to extract static features,classify regions, predict bounding boxes for high scoring regions, etc. Our system replaces all of these disparate parts with a single convolutional neural network. The network performs feature extraction, bounding box prediction, non-maximal suppression, and contextual reasoning all concurrently. Instead of static features, the network trains the features in-line and optimizes them for the detection task. Our unified architecture leads to a faster, more accurate model than DPM.

可变形零件模型。可变形部件模型（DPM）使用滑动窗口方法进行对象检测[10]。DPM使用不相交的管道来提取静态特征、分类区域、预测高得分区域的边界框等。我们的系统用单个卷积神经网络替换所有这些不同的部分。该网络同时执行特征提取、边界框预测、非最大抑制和上下文推理。代替静态特征，网络在线训练特征并针对检测任务优化它们。我们的统一体系结构可提供比DPM更快、更准确的模型。

R-CNN. R-CNN and its variants use region proposals instead of sliding windows to find objects in images. Selective Search [35] generates potential bounding boxes, a convolutional network extracts features, an SVM scores the boxes, a linear model adjusts the bounding boxes, and non-max suppression eliminates duplicate detections. Each stage of this complex pipeline must be precisely tuned independently and the resulting system is very slow, taking more than 40 seconds per image at test time [14].

R-CNN.R-CNN及其变体使用区域建议而不是滑动窗口来查找图像中的对象。选择性搜索[35]生成潜在的边界框，卷积网络提取特征，SVM对框进行评分，线性模型调整边界框，非最大值抑制消除重复检测。这条复杂流水线的每一级都必须单独进行精确调谐，并且最终的系统非常慢，在测试时每张图像需要40秒以上[14]。

YOLO shares some similarities with R-CNN. Each grid cell proposes potential bounding boxes and scores those boxes using convolutional features. However, our system puts spatial constraints on the grid cell proposals which helps mitigate multiple detections of the same object. Our system also proposes far fewer bounding boxes, only 98 per image compared to about 2000 from Selective Search. Finally, our system combines these individual components into a single, jointly optimized model.

YOLO与R-CNN有一些相似之处。每个网格单元提出潜在的边界框并使用卷积特征对这些框进行评分。然而，我们的系统对网格单元建议施加空间约束，这有助于减轻对同一对象的多次检测。我们的系统还提出了更少的边界框，与选择性搜索的大约2000个边界框相比，每个图像只有98个边界框。最后，我们的系统将这些单独的组件组合成一个单一的、联合优化的模型。

Other Fast Detectors Fast and Faster R-CNN focus on speeding up the R-CNN framework by sharing computation and using neural networks to propose regions instead of Selective Search [14] [28]. While they offer speed and accuracy improvements over R-CNN, both still fall short of real-time performance.

其他快速检测器Fast and Faster R-CNN专注于通过共享计算和使用神经网络来建议区域而不是选择性搜索来加速R-CNN框架[14][28]。虽然它们比R-CNN提供了速度和准确性的改进，但两者仍然达不到实时性能。

Many research efforts focus on speeding up the DPM pipeline [31] [38] [5]. They speed up HOG computation, use cascades, and push computation to GPUs. However,only 30Hz DPM [31] actually runs in real-time.

许多研究工作集中在加速DPM管道[31][38][5]。它们加速HOG计算，使用级联，并将计算推到GPU上。然而，只有30Hz DPM [31]实际上是实时运行的。

Instead of trying to optimize individual components of a large detection pipeline, YOLO throws out the pipeline entirely and is fast by design.

YOLO没有试图优化大型检测管道的单个组件，而是完全抛弃了管道，并通过设计实现了快速。

Detectors for single classes like faces or people can be highly optimized since they have to deal with much less variation [37]. YOLO is a general purpose detector that earns to detect a variety of objects simultaneously.

针对单一类别（如人脸或人物）的检测器可以高度优化，因为它们必须处理少得多的变化[37]。YOLO是一种通用探测器，可学习同时探测各种物体。

Deep MultiBox. Unlike R-CNN, Szegedy et al. train a convolutional neural network to predict regions of interest[8] instead of using Selective Search. MultiBox can also perform single object detection by replacing the confidence prediction with a single class prediction. However, MultiBox cannot perform general object detection and is still just a piece in a larger detection pipeline, requiring further image patch classification. Both YOLO and MultiBox use a convolutional network to predict bounding boxes in an image but YOLO is a complete detection system.

深层多箱。与R-CNN不同，Szegedy等人训练卷积神经网络来预测感兴趣区域[8]，而不是使用选择性搜索。MultiBox还可以通过用单个类预测代替置信度预测来执行单个对象检测。然而，MultiBox不能执行一般的对象检测，仍然只是较大检测管道中的一小部分，需要进一步的图像块分类。YOLO和MultiBox都使用卷积网络来预测图像中的边界框，但YOLO是一个完整的检测系统。

OverFeat. Sermanet et al. train a convolutional neural network to perform localization and adapt that localizer to perform detection [32]. OverFeat efficiently performs sliding window detection but it is still a disjoint system. OverFeat optimizes for localization, not detection performance.Like DPM, the localizer only sees local information when making a prediction. OverFeat cannot reason about global context and thus requires significant post-processing to produce coherent detections.

超常发挥。Sermanet等人训练卷积神经网络来执行定位，并调整定位器来执行检测[32]。OverFeat有效地执行滑动窗口检测，但是它仍然是不相交的系统。OverFeat优化定位，而不是检测性能。像DPM一样，定位器在进行预测时只看到局部信息。OverFeat无法推理全局上下文，因此需要大量的后处理来产生一致的检测。

MultiGrasp. Our work is similar in design to work on grasp detection by Redmon et al [27]. Our grid approach to bounding box prediction is based on the MultiGrasp system for regression to grasps. However, grasp detection is a much simpler task than object detection. MultiGrasp only needs to predict a single graspable region for an image containing one object. It doesn’t have to estimate the size, location,or boundaries of the object or predict it’s class, only find a region suitable for grasping. YOLO predicts both bounding boxes and class probabilities for multiple objects of multiple classes in an image.

多抓持。我们的工作在设计上类似于雷德蒙等人[27]的抓握检测工作。我们的边界框预测网格方法是基于MultiGrasp系统的回归抓取。然而，抓握检测是比物体检测简单得多的任务。MultiGrasp只需要为包含一个对象的图像预测单个可抓取区域。它不需要估计对象的大小、位置或边界，也不需要预测对象的类别，只需要找到一个适合抓取的区域。YOLO预测图像中多个类别的多个对象的边界框和类别概率。

4. Experiments(实验)

First we compare YOLO with other real-time detection systems on PASCAL VOC 2007. To understand the differences between YOLO and R-CNN variants we explore the errors on VOC 2007 made by YOLO and Fast R-CNN, one of the highest performing versions of R-CNN [14]. Based on the different error profiles we show that YOLO can be used to rescore Fast R-CNN detections and reduce the errors from background false positives, giving a significant performance boost. We also present VOC 2012 results and compare mAP to current state-of-the-art methods. Finally,we show that YOLO generalizes to new domains better thanother detectors on two artwork datasets.

首先，我们在PASCAL VOC 2007上比较了YOLO和其他实时检测系统。为了了解YOLO和R-CNN变体之间的差异，我们探讨了YOLO和Fast R-CNN（R-CNN的最高性能版本之一）在VOC 2007上造成的错误[14]。基于不同的误差分布，我们表明YOLO可以用于对快速R-CNN检测进行重新评分，并减少来自背景假阳性的误差，从而显著提高性能。我们还提供了VOC 2012结果，并将mAP与当前最先进的方法进行了比较。最后，我们在两个艺术品数据集上证明了YOLO比其他检测器更好地推广到新的领域。

4.1. Comparison to Other Real-Time Systems

Many research efforts in object detection focus on making standard detection pipelines fast. [5] [38] [31] [14] [17] [28] However, only Sadeghi et al. actually produce a detection system that runs in real-time (30 frames per second or better) [31]. We compare YOLO to their GPU implementation of DPM which runs either at 30Hz or 100Hz.While the other efforts don’t reach the real-time milestone we also compare their relative mAP and speed to examine the accuracy-performance tradeoffs available in object detection systems.

目标检测中的许多研究努力集中于使标准检测流水线快速。[5][38] [31] [14] [17] [28]然而，只有Sadeghi等人实际生产了实时运行的检测系统（每秒30帧或更好）[31]。我们将YOLO与GPU实现的DPM进行了比较，后者运行在30Hz或100Hz。虽然其他的努力没有达到实时里程碑，我们也比较了他们的相对mAP和速度，以检查准确性和性能之间的权衡在目标检测系统中可用。

Fast YOLO is the fastest object detection method on PASCAL; as far as we know, it is the fastest extant object detector. With 52.7% mAP, it is more than twice as accurate as prior work on real-time detection. YOLO pushes mAP to 63.4% while still maintaining real-time performance.

Fast YOLO是PASCAL上最快的目标检测方法;据我们所知，它是现存最快的物体探测器。mAP为52.7%，比先前的实时检测准确两倍以上。YOLO将mAP推高至63.4%，同时仍保持实时性能。

We also train YOLO using VGG-16. This model is more accurate but also significantly slower than YOLO. It is useful for comparison to other detection systems that rely on VGG-16 but since it is slower than real-time the rest of the paper focuses on our faster models.

我们还使用VGG-16训练YOLO。该模型比YOLO更精确，但也明显更慢。它有助于与其他依赖VGG-16的检测系统进行比较，但由于它比实时速度慢，因此本文的其余部分将重点介绍我们的更快模型。

Fastest DPM effectively speeds up DPM without sacrificing much mAP but it still misses real-time performance by a factor of 2 [38]. It also is limited by DPM’s relatively low accuracy on detection compared to neural network approaches.

最快的DPM有效地加速了DPM，而没有牺牲太多的mAP，但它仍然错过了实时性能的2倍[38]。与神经网络方法相比，DPM的检测精度相对较低，这也限制了它。

表1：PASCAL VOC 2007上的实时系统。比较快速检测器的性能和速度。Fast YOLO是PASCAL VOC检测记录中速度最快的检测器，其准确度仍然是其他实时检测器的两倍。YOLO比快速版本精确10 mAP，但速度仍远高于实时版本。

Fast YOLO在Pascal数据集上速度最快，达到155FPS，而YOLO的mAP是最高的。

4.2. VOC 2007 Error Analysis（错误分析）

比较了YOLO和Faster R-CNN的错误情况，结果如图所示。

预测结果包括以下几类：

正确：类别正确，IOU>0.5

定位：类别正确，0.10.1

背景：任何物体，IOU

【本文地址】

yolov1官方论文全文翻译[附个人理解]

yolov1官方论文全文翻译[附个人理解]

今日新闻

推荐新闻