character

您所在的位置:网站首页 非ascii码的文本文件 character

character

2024-07-11 10:14| 来源: 网络整理| 查看: 265

我需要读取一个以 GBK 编码的文本文件. Go 编程语言中的标准库假定所有文本都以 UTF-8 编码。

如何读取其他编码的文件?

最佳答案

以前(如旧答案中所述)“简单”的方法涉及使用需要 cgo 并包装 iconv 库的第三方包。由于许多原因,这是不可取的。值得庆幸的是,很长一段时间以来,只有使用 Go 作者提供的包(不是在主要包集中,而是在 Go Sub-Repositories 中),现在有一种优越的全 Go 方式来做到这一点。

golang.org/x/text/encoding包定义了一个通用字符编码的接口(interface),可以转换为/从 UTF-8。 golang.org/x/text/encoding/simplifiedchinese分包提供GB18030 , GBK和 HZ-GB2312编码实现。

这里是一个读写GBK编码文件的例子。请注意,io.Reader 和 io.Writer 在读取/写入数据时“即时”进行编码。

package main import ( "bufio" "fmt" "log" "os" "golang.org/x/text/encoding/simplifiedchinese" "golang.org/x/text/transform" ) // Encoding to use. Since this implements the encoding.Encoding // interface from golang.org/x/text/encoding you can trivially // change this out for any of the other implemented encoders, // e.g. `traditionalchinese.Big5`, `charmap.Windows1252`, // `korean.EUCKR`, etc. var enc = simplifiedchinese.GBK func main() { const filename = "example_GBK_file" exampleWriteGBK(filename) exampleReadGBK(filename) } func exampleReadGBK(filename string) { // Read UTF-8 from a GBK encoded file. f, err := os.Open(filename) if err != nil { log.Fatal(err) } r := transform.NewReader(f, enc.NewDecoder()) // Read converted UTF-8 from `r` as needed. // As an example we'll read line-by-line showing what was read: sc := bufio.NewScanner(r) for sc.Scan() { fmt.Printf("Read line: %s\n", sc.Bytes()) } if err = sc.Err(); err != nil { log.Fatal(err) } if err = f.Close(); err != nil { log.Fatal(err) } } func exampleWriteGBK(filename string) { // Write UTF-8 to a GBK encoded file. f, err := os.Create(filename) if err != nil { log.Fatal(err) } w := transform.NewWriter(f, enc.NewEncoder()) // Write UTF-8 to `w` as desired. // As an example we'll write some text from the Wikipedia // GBK page that includes Chinese. _, err = fmt.Fprintln(w, `In 1995, China National Information Technology Standardization Technical Committee set down the Chinese Internal Code Specification (Chinese: 汉字内码扩展规范(GBK); pinyin: Hànzì Nèimǎ Kuòzhǎn Guīfàn (GBK)), Version 1.0, known as GBK 1.0, which is a slight extension of Codepage 936. The newly added 95 characters were not found in GB 13000.1-1993, and were provisionally assigned Unicode PUA code points.`) if err != nil { log.Fatal(err) } if err = f.Close(); err != nil { log.Fatal(err) } }

Playground

关于character-encoding - 在 Go 中读取非 UTF-8 文本文件,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/10277933/



【本文地址】


今日新闻


推荐新闻


CopyRight 2018-2019 办公设备维修网 版权所有 豫ICP备15022753号-3