通过open()函数理解参数encoding

您所在的位置：网站首页 › read参数len怎么确定 › 通过open()函数理解参数encoding

通过open()函数理解参数encoding

2023-09-13 13:24| 来源: 网络整理| 查看: 265

open函数的三个参数 open(file,mode,encoding)参数含义如何判断所要打开的文件的编码格式

open(file,mode,encoding)

主要学习三个参数：file、mode和encoding。

参数含义 file：在指定了文件路径的情况下，可以直接用文件名作为输入参数列如 os.chdir(r’F:\文本分析) file=‘data.csv’ with open(file,mode=‘w’,encoding=‘utf-8’) as f: mode：文件打开模式常用mode参数有 ‘w’表示打开一个文件只用于写入。采用该模式时需小心。因为如果指定的文件已经存在，Python将在返回文件对象前清空该文件。如果文件不存在，则创建新文件‘r’表示以只读方式打开文件。调用函数时若不给出mode值，默认采用该模式。 encoding：所要打开文件的编码格式读取文件的时候，如果编码不对，会报错列如 filename=‘data.csv’（该文件的编码格式是utf-8） open(filename,encoding=‘gbk’)（调用函数时用的是gbk编码）会出现类似以下报错 UnicodeDecodeError: 'gbk' codec can't decode byte 0xac in position 423: illegal multibyte sequence 正确的打开方式是 open(filename,encoding=‘utf-8’) 读取时如果不指明编码格式，默认使用locale.getpreferredencoding()函数返回的编码方式。如何判断所要打开的文件的编码格式

在一篇博文中提到了采用chardet.detect()方法，参考该博文,用如下代码做了测试：

# -*- coding: utf-8 -*- import chardet import string import os os.chdir(r'F:\文本分析') file='rg2.csv' with open(file, 'rb') as f: data = f.read() f_charInfo=chardet.detect(data) print (f_charInfo) 但预测的结果不太准。这个方法不是很靠谱。

实在无法判断的情况下，可以考虑将csv格式的文件转化为utf-8格式，如果文件数量少，可以用记事本另存为utf-8格式，但文件数量大，可参考该博文的方法，其代码如下：

import codecs def handleEncoding(original_file,newfile): #newfile=original_file[0:original_file.rfind(.)]+'_copy.csv' f=open(original_file,'rb+') content=f.read()#读取文件内容，content为bytes类型，而非string类型 source_encoding='utf-8' #####确定encoding类型 try: content.decode('utf-8').encode('utf-8') source_encoding='utf-8' except: try: content.decode('gbk').encode('utf-8') source_encoding='gbk' except: try: content.decode('gb2312').encode('utf-8') source_encoding='gb2312' except: try: content.decode('gb18030').encode('utf-8') source_encoding='gb18030' except: try: content.decode('big5').encode('utf-8') source_encoding='gb18030' except: content.decode('cp936').encode('utf-8') source_encoding='cp936' f.close() #####按照确定的encoding读取文件内容，并另存为utf-8编码： block_size=4096 with codecs.open(original_file,'r',source_encoding) as f: with codecs.open(newfile,'w','utf-8') as f2: while True: content=f.read(block_size) if not content: break f2.write(content)

另一个关于open()操作文件时的文件中文名乱码问题，可参考该博文

【本文地址】

通过open()函数理解参数encoding

通过open()函数理解参数encoding

今日新闻

推荐新闻