YOLOv5 Jetson部署 TensorRT

您所在的位置：网站首页 › yolov5转tensorrt后精度降低 › YOLOv5 Jetson部署 TensorRT

YOLOv5 Jetson部署 TensorRT

2023-10-06 03:10| 来源: 网络整理| 查看: 265

用TensorRT部署YOLOv5，用的最多的方式就是通过： https://github.com/wang-xinyu/tensorrtx/tree/master/yolov5

步骤和教程都已经很详细了，我只是记录一下遇到的问题。

我的环境： Jetson Xavier NX Jetpack 4.5 aarch64 Ubuntu18.04 python 3.6.9 cuda 10.2 cudnn 8.0 tensorrt 7.1.3.0 opencv-python 4.5.5.64 torch 1.8.0 torchvision 0.9.0

在make的时候出现如下报错：在这里插入图片描述查了半天没找到原因，后来突然想起一篇部署文章中说好多错误是忘记加sudo导致的，我试了一下，真是这个原因……

在生成engine的时候出现如下报错：在这里插入图片描述如果yololayer.h中CLASS_NUM改过了，那就是模型不对。因为我很早就开始使用YOLOv5，模型版本有点乱，下载对应版本的模型就好了。生成engine之后，就可以用Python脚本yolov5_trt.py推理图片了。

这个脚本是推理一个文件夹中的所有图片，是个入门级的demo，对我使用来说有以下问题：

无法推理视频。每张图的框颜色是随机的，没有根据类别固定，且不支持中文标签。BATCH_SIZE>1的时候，比如BATCH_SIZE=2，那么测试的图片数必须是2的倍数，否则会报错。没明白为啥要不断创建新线程，我用不到多线程，去掉线程之后速度会更快。

我的需求是要能实时推理视频，以实现算法功能，所以画框和保存图片在实际使用中没有必要，反而拖慢速度。

我根据自己的需求改写了代码，用来推理视频或视频流。代码如下，有需要的同学自取：

""" An example that uses TensorRT's Python api to make inferences. """ import ctypes import os import shutil import random import sys import time import cv2 import numpy as np import pycuda.autoinit import pycuda.driver as cuda import tensorrt as trt import torch CONF_THRESH = 0.5 IOU_THRESHOLD = 0.4 class YoLov5TRT(object): """ description: A YOLOv5 class that warps TensorRT ops, preprocess and postprocess ops. """ def __init__(self, engine_file_path): # Create a Context on this device, self.ctx = cuda.Device(0).make_context() stream = cuda.Stream() TRT_LOGGER = trt.Logger(trt.Logger.INFO) runtime = trt.Runtime(TRT_LOGGER) # Deserialize the engine from file with open(engine_file_path, "rb") as f: engine = runtime.deserialize_cuda_engine(f.read()) context = engine.create_execution_context() host_inputs = [] cuda_inputs = [] host_outputs = [] cuda_outputs = [] bindings = [] for binding in engine: print('bingding:', binding, engine.get_binding_shape(binding)) size = trt.volume(engine.get_binding_shape(binding)) * engine.max_batch_size dtype = trt.nptype(engine.get_binding_dtype(binding)) # Allocate host and device buffers host_mem = cuda.pagelocked_empty(size, dtype) cuda_mem = cuda.mem_alloc(host_mem.nbytes) # Append the device buffer to device bindings. bindings.append(int(cuda_mem)) # Append to the appropriate list. if engine.binding_is_input(binding): self.input_w = engine.get_binding_shape(binding)[-1] self.input_h = engine.get_binding_shape(binding)[-2] host_inputs.append(host_mem) cuda_inputs.append(cuda_mem) else: host_outputs.append(host_mem) cuda_outputs.append(cuda_mem) # Store self.stream = stream self.context = context self.engine = engine self.host_inputs = host_inputs self.cuda_inputs = cuda_inputs self.host_outputs = host_outputs self.cuda_outputs = cuda_outputs self.bindings = bindings self.batch_size = engine.max_batch_size def infer(self, raw_image_generator): # Make self the active context, pushing it on top of the context stack. self.ctx.push() # Restore stream = self.stream context = self.context engine = self.engine host_inputs = self.host_inputs cuda_inputs = self.cuda_inputs host_outputs = self.host_outputs cuda_outputs = self.cuda_outputs bindings = self.bindings # Do image preprocess batch_image_raw = [] batch_origin_h = [] batch_origin_w = [] batch_input_image = np.empty(shape=[self.batch_size, 3, self.input_h, self.input_w]) for i, image_raw in enumerate(raw_image_generator): input_image, image_raw, origin_h, origin_w = self.preprocess_image(image_raw) batch_image_raw.append(image_raw) batch_origin_h.append(origin_h) batch_origin_w.append(origin_w) np.copyto(batch_input_image[i], input_image) batch_input_image = np.ascontiguousarray(batch_input_image) # Copy input image to host buffer np.copyto(host_inputs[0], batch_input_image.ravel()) start = time.time() # Transfer input data to the GPU. cuda.memcpy_htod_async(cuda_inputs[0], host_inputs[0], stream) # Run inference. context.execute_async(batch_size=self.batch_size, bindings=bindings, stream_handle=stream.handle) # Transfer predictions back from the GPU. cuda.memcpy_dtoh_async(host_outputs[0], cuda_outputs[0], stream) # Synchronize the stream stream.synchronize() end = time.time() # Remove any context from the top of the context stack, deactivating it. self.ctx.pop() # Here we use the first row of output in that batch_size = 1 output = host_outputs[0] result_boxes, result_scores, result_classid = None, None, None # Do postprocess for i in range(self.batch_size): result_boxes, result_scores, result_classid = self.post_process( output[i * 6001: (i + 1) * 6001], batch_origin_h[i], batch_origin_w[i] ) return result_boxes, result_scores, result_classid, end - start def destroy(self): # Remove any context from the top of the context stack, deactivating it. self.ctx.pop() def get_raw_image_zeros(self): """ description: Ready data for warmup """ for _ in range(self.batch_size): yield np.zeros([self.input_h, self.input_w, 3], dtype=np.uint8) def preprocess_image(self, raw_bgr_image): """ description: Convert BGR image to RGB, resize and pad it to target size, normalize to [0,1], transform to NCHW format. param: input_image_path: str, image path return: image: the processed image image_raw: the original image h: original height w: original width """ image_raw = raw_bgr_image h, w, c = image_raw.shape image = cv2.cvtColor(image_raw, cv2.COLOR_BGR2RGB) # Calculate widht and height and paddings r_w = self.input_w / w r_h = self.input_h / h if r_h > r_w: tw = self.input_w th = int(r_w * h) tx1 = tx2 = 0 ty1 = int((self.input_h - th) / 2) ty2 = self.input_h - th - ty1 else: tw = int(r_h * w) th = self.input_h tx1 = int((self.input_w - tw) / 2) tx2 = self.input_w - tw - tx1 ty1 = ty2 = 0 # Resize the image with long side while maintaining ratio image = cv2.resize(image, (tw, th)) # Pad the short side with (128,128,128) image = cv2.copyMakeBorder( image, ty1, ty2, tx1, tx2, cv2.BORDER_CONSTANT, None, (128, 128, 128) ) image = image.astype(np.float32) # Normalize to [0,1] image /= 255.0 # HWC to CHW format: image = np.transpose(image, [2, 0, 1]) # CHW to NCHW format image = np.expand_dims(image, axis=0) # Convert the image to row-major order, also known as "C order": image = np.ascontiguousarray(image) return image, image_raw, h, w def xywh2xyxy(self, origin_h, origin_w, x): """ description: Convert nx4 boxes from [x, y, w, h] to [x1, y1, x2, y2] where xy1=top-left, xy2=bottom-right param: origin_h: height of original image origin_w: width of original image x: A boxes numpy, each row is a box [center_x, center_y, w, h] return: y: A boxes numpy, each row is a box [x1, y1, x2, y2] """ y = np.zeros_like(x) r_w = self.input_w / origin_w r_h = self.input_h / origin_h if r_h > r_w: y[:, 0] = x[:, 0] - x[:, 2] / 2 y[:, 2] = x[:, 0] + x[:, 2] / 2 y[:, 1] = x[:, 1] - x[:, 3] / 2 - (self.input_h - r_w * origin_h) / 2 y[:, 3] = x[:, 1] + x[:, 3] / 2 - (self.input_h - r_w * origin_h) / 2 y /= r_w else: y[:, 0] = x[:, 0] - x[:, 2] / 2 - (self.input_w - r_h * origin_w) / 2 y[:, 2] = x[:, 0] + x[:, 2] / 2 - (self.input_w - r_h * origin_w) / 2 y[:, 1] = x[:, 1] - x[:, 3] / 2 y[:, 3] = x[:, 1] + x[:, 3] / 2 y /= r_h return y def post_process(self, output, origin_h, origin_w): """ description: postprocess the prediction param: output: A numpy likes [num_boxes,cx,cy,w,h,conf,cls_id, cx,cy,w,h,conf,cls_id, ...] origin_h: height of original image origin_w: width of original image return: result_boxes: finally boxes, a boxes numpy, each row is a box [x1, y1, x2, y2] result_scores: finally scores, a numpy, each element is the score correspoing to box result_classid: finally classid, a numpy, each element is the classid correspoing to box """ # Get the num of boxes detected num = int(output[0]) # Reshape to a two dimentional ndarray pred = np.reshape(output[1:], (-1, 6))[:num, :] # Do nms boxes = self.non_max_suppression(pred, origin_h, origin_w, conf_thres=CONF_THRESH, nms_thres=IOU_THRESHOLD) result_boxes = boxes[:, :4] if len(boxes) else np.array([]) result_scores = boxes[:, 4] if len(boxes) else np.array([]) result_classid = boxes[:, 5] if len(boxes) else np.array([]) return result_boxes, result_scores, result_classid def bbox_iou(self, box1, box2, x1y1x2y2=True): """ description: compute the IoU of two bounding boxes param: box1: A box coordinate (can be (x1, y1, x2, y2) or (x, y, w, h)) box2: A box coordinate (can be (x1, y1, x2, y2) or (x, y, w, h)) x1y1x2y2: select the coordinate format return: iou: computed iou """ if not x1y1x2y2: # Transform from center and width to exact coordinates b1_x1, b1_x2 = box1[:, 0] - box1[:, 2] / 2, box1[:, 0] + box1[:, 2] / 2 b1_y1, b1_y2 = box1[:, 1] - box1[:, 3] / 2, box1[:, 1] + box1[:, 3] / 2 b2_x1, b2_x2 = box2[:, 0] - box2[:, 2] / 2, box2[:, 0] + box2[:, 2] / 2 b2_y1, b2_y2 = box2[:, 1] - box2[:, 3] / 2, box2[:, 1] + box2[:, 3] / 2 else: # Get the coordinates of bounding boxes b1_x1, b1_y1, b1_x2, b1_y2 = box1[:, 0], box1[:, 1], box1[:, 2], box1[:, 3] b2_x1, b2_y1, b2_x2, b2_y2 = box2[:, 0], box2[:, 1], box2[:, 2], box2[:, 3] # Get the coordinates of the intersection rectangle inter_rect_x1 = np.maximum(b1_x1, b2_x1) inter_rect_y1 = np.maximum(b1_y1, b2_y1) inter_rect_x2 = np.minimum(b1_x2, b2_x2) inter_rect_y2 = np.minimum(b1_y2, b2_y2) # Intersection area inter_area = np.clip(inter_rect_x2 - inter_rect_x1 + 1, 0, None) * \ np.clip(inter_rect_y2 - inter_rect_y1 + 1, 0, None) # Union Area b1_area = (b1_x2 - b1_x1 + 1) * (b1_y2 - b1_y1 + 1) b2_area = (b2_x2 - b2_x1 + 1) * (b2_y2 - b2_y1 + 1) iou = inter_area / (b1_area + b2_area - inter_area + 1e-16) return iou def non_max_suppression(self, prediction, origin_h, origin_w, conf_thres=0.5, nms_thres=0.4): """ description: Removes detections with lower object confidence score than 'conf_thres' and performs Non-Maximum Suppression to further filter detections. param: prediction: detections, (x1, y1, x2, y2, conf, cls_id) origin_h: original image height origin_w: original image width conf_thres: a confidence threshold to filter detections nms_thres: a iou threshold to filter detections return: boxes: output after nms with the shape (x1, y1, x2, y2, conf, cls_id) """ # Get the boxes that score > CONF_THRESH boxes = prediction[prediction[:, 4] >= conf_thres] # Trandform bbox from [center_x, center_y, w, h] to [x1, y1, x2, y2] boxes[:, :4] = self.xywh2xyxy(origin_h, origin_w, boxes[:, :4]) # clip the coordinates boxes[:, 0] = np.clip(boxes[:, 0], 0, origin_w - 1) boxes[:, 2] = np.clip(boxes[:, 2], 0, origin_w - 1) boxes[:, 1] = np.clip(boxes[:, 1], 0, origin_h - 1) boxes[:, 3] = np.clip(boxes[:, 3], 0, origin_h - 1) # Object confidence confs = boxes[:, 4] # Sort by the confs boxes = boxes[np.argsort(-confs)] # Perform non-maximum suppression keep_boxes = [] while boxes.shape[0]: large_overlap = self.bbox_iou(np.expand_dims(boxes[0, :4], 0), boxes[:, :4]) > nms_thres label_match = boxes[0, -1] == boxes[:, -1] # Indices of boxes with lower confidence scores, large IOUs and matching labels invalid = large_overlap & label_match keep_boxes += [boxes[0]] boxes = boxes[~invalid] boxes = np.stack(keep_boxes, 0) if len(keep_boxes) else np.array([]) return boxes if __name__ == "__main__": # load custom plugin and engine engine_file_path = "build/device_17_5765_s.engine" PLUGIN_LIBRARY = "build/libmyplugins.so" source = 'samples/t1.mp4' if len(sys.argv) > 1: engine_file_path = sys.argv[1] if len(sys.argv) > 2: PLUGIN_LIBRARY = sys.argv[2] if len(sys.argv) > 3: source = sys.argv[3] assert source.lower().endswith('mp4') or source.lower().startswith(('rtsp://', 'rtmp://')) ctypes.CDLL(PLUGIN_LIBRARY) # load coco labels categories = ["person", "bicycle", "car", "motorcycle", "airplane", "bus", "train", "truck", "boat", "traffic light", "fire hydrant", "stop sign", "parking meter", "bench", "bird", "cat", "dog", "horse", "sheep", "cow", "elephant", "bear", "zebra", "giraffe", "backpack", "umbrella", "handbag", "tie", "suitcase", "frisbee", "skis", "snowboard", "sports ball", "kite", "baseball bat", "baseball glove", "skateboard", "surfboard", "tennis racket", "bottle", "wine glass", "cup", "fork", "knife", "spoon"] # a YoLov5TRT instance yolov5_wrapper = YoLov5TRT(engine_file_path) try: print('batch size is', yolov5_wrapper.batch_size) for i in range(10): result_boxes, result_scores, result_classid, use_time = yolov5_wrapper.infer(yolov5_wrapper.get_raw_image_zeros()) print('input->({},{}), time->{:.2f}ms'.format(yolov5_wrapper.input_h, yolov5_wrapper.input_w, use_time * 1000)) raw_image_generator = [] cap = cv2.VideoCapture(source) t1 = time.time() while cap.isOpened(): ret, frame = cap.read() if not ret: break if yolov5_wrapper.batch_size == 1: raw_image_generator = [frame] result_boxes, result_scores, result_classid, use_time = yolov5_wrapper.infer(raw_image_generator) print('input->{}\ttime->{:.2f}ms'.format(cap.get(cv2.CAP_PROP_POS_FRAMES), use_time * 1000)) else: if cap.get(cv2.CAP_PROP_POS_FRAMES) % yolov5_wrapper.batch_size == 0: raw_image_generator.append(frame) result_boxes, result_scores, result_classid, use_time = yolov5_wrapper.infer(raw_image_generator) print('input->{}\ttime->{:.2f}ms'.format(cap.get(cv2.CAP_PROP_POS_FRAMES), use_time * 1000)) raw_image_generator = [] else: raw_image_generator.append(frame) print('FPS:', cap.get(cv2.CAP_PROP_FRAME_COUNT) / (time.time() - t1)) cap.release() finally: # destroy the instance yolov5_wrapper.destroy()

新增了第三个命令行参数source作为视频路径或视频流地址，类别名categories根据自己的改，因为我反正不画图不展示也不保存，所以这个类别名只是用来打印结果的，写中文还是英文没区别。画图和保存的代码我都不要了，大家可以根据自己需求保留。

我用yolov5s.pt生成的engine，跑测试视频，没有画框和保存的情况下，帧率刚好超过25，基本实时了。在这里插入图片描述为了做比较，我这个代码也支持BATCH_SIZE>1的情况，即将每几帧图片一起推理，当然这个engine要重新生成。我测试了一下当BATCH_SIZE=2的时候：虽然看起来2帧的推理时间没有到原来的2倍，但整体速度反而变慢了。

------------------------------------------------------------------我是一条慵懒的分割线------------------------------------------------------------------------------------

YOLOv5最新的6.1版本增加了TensorRT的导出和推理，记录一下自己的使用过程。

我这台NX的cuda和tensorrt是内置的，主要是python、torch和torchvision要匹配且符合YOLOv5要求。

还是用yolov5s.pt训练得到的模型best.pt，yolov5-6.1下，使用命令：

python3 export.py --device 0 --weights best.pt --include engine

导出TensorRT的模型，会先转成onnx再转成engine。

第一次转的时候，内存不足直接退出了，我的NX内存是8G。在这里插入图片描述应该是模型太大了？我把export.py中的export_engine()的workspace参数从4改成2，还是失败，用了最小的yolov5n还是内存爆炸。没找到办法，我想是不是因为先转onnx已经占了一部分内存，如果把这两步拆开内存可能就够用了。于是我把代码中将onnx转engine的部分给简化了出来：

import tensorrt as trt logger = trt.Logger(trt.Logger.INFO) builder = trt.Builder(logger) config = builder.create_builder_config() config.max_workspace_size = 2

【本文地址】

YOLOv5 Jetson部署 TensorRT

YOLOv5 Jetson部署 TensorRT

今日新闻

推荐新闻