【数据清洗】图像数据清洗之

您所在的位置：网站首页 › 数据采集清洗流程图片 › 【数据清洗】图像数据清洗之

【数据清洗】图像数据清洗之

2024-07-15 01:30| 来源: 网络整理| 查看: 265

目的：人工做数据清洗较为麻烦，而且费事费力没成绩，还拉拽整个项目的后腿。所以这里根据调研情况，分析尝试一下。

1.调研分析 (1) 百度EasyData

参考：百度大脑自己的csdn说明

有几点：

1）过滤无对应目标的图，如果有开源通用模型，人脸或者行人检测等，是可以解决的。如果是无，那前期也是需要自己训练个检测模型，然后做半自动的标注与筛选的。然后根据新加入的数据，继续迭代地优化模型。

2）去相似图片。这里的相似是特征相似度。如果有对应的分类提取特征的网络结构的话（有开源用开源，无开源就迭代训练），那么可以利用图像检索的方式获得图像特征向量，然后计算特征向量的cosine距离或者欧式距离等。设置较大的阈值，将较为冗余的图像筛选去掉。

3）去模糊图片。即去除图像质量不高的图片，猜测这个可以找网上的对于图像清晰度打分的这种模型或者是图像质量评估IQA的模型，进行筛选。初步判断不太需要自己去训练模型，但是呢，对于最后保留什么样质量的图像，这个是我们要自己写一些逻辑的。不过难度也不大。

(2) 华为ModelArts

参考：华为数据清洗服务

可以根据提供正样本与负样本，将数据筛选好。看了一下，里面也是有特征提取，相似度计算的。提到的主要有数据清洗聚类、异常检测、相似度计算、特征提取器等。教程写得很详细。

(3) 专利

还有一篇专利，关于根据图像特征相似度，做数据清洗的。https://patentimages.storage.googleapis.com/a0/89/06/563a1061fac7fe/CN107480203A.pdf

2.实践 import torch import torch.nn.functional as F import os import shutil import cv2 def compute_consine_similar(features, others): features = torch.from_numpy(features) others = torch.from_numpy(others) features = F.normalize(features, p=2, dim=1) others = F.normalize(others, p=2, dim=1) similar_m = torch.mm(features, others.t()) return similar_m.cpu().numpy() if __name__ == '__main__': test_path = "/train_prepare/" save_path = "/crops_remove/train_prepare/" if not os.path.exists(save_path): os.makedirs(save_path) feature_pre, feature_now = None, None img_name_pre, img_name_now = None, None img_pre, img_now = None, None net = Net() state_dict = torch.load(model_path, map_location=lambda storage, loc: storage)[ 'net_dict'] net.load_state_dict(state_dict) for folder in os.listdir(test_path): for img_name in sorted(os.listdir(test_path + folder)): img_path = test_path + folder + "/" + img_name print("img_path : {}".format(img_path)) img = cv2.imread(img_path) img = cv2.resize(img, (256, 128)) feature = net.eval(img) feature_pre, feature_now = feature_now, feature img_name_pre, img_name_now = img_name_now, img_name img_pre, img_now = img_now, img if feature_now is not None and feature_pre is not None: print(feature_now.shape) similar_m = compute_consine_similar(feature_now, feature_pre) print(img_name_now + " and " + img_name_pre + " have {} similar".format(similar_m[0][0])) imgs = np.hstack([img_now, img_pre]) # 展示多个 cv2.putText(imgs, str(similar_m[0][0]), (20,20), cv2.FONT_HERSHEY_SIMPLEX, 1, (0,255,0), 4) cv2.imshow("mutil_pic", imgs) # 等待关闭 if similar_m[0][0]

【本文地址】

【数据清洗】图像数据清洗之

【数据清洗】图像数据清洗之

今日新闻

推荐新闻