Linux下使用LibreOffice+python将doc/docx/wps格式的文档转成html/txt/docx等格式

您所在的位置:网站首页 docx怎么转换成png Linux下使用LibreOffice+python将doc/docx/wps格式的文档转成html/txt/docx等格式

Linux下使用LibreOffice+python将doc/docx/wps格式的文档转成html/txt/docx等格式

2023-08-16 02:48| 来源: 网络整理| 查看: 265

Linux下的word文档格式转换工具

最近接到一个需求,要将所有不同格式的文档(包括.doc/.docx/.wps)转成统一格式,如都转为.docx,或直接转为.html 或.txt。经调研后,发现有这样几款工具:

win32compython-docxpydocx… 可能还有,我就不再赘述了。经过全面调研,我发现这些工具存在这样的问题——Python相关工具要么无法处理.doc(只能处理.docx),要么要求必须在windows环境下使用(如win32com)。当前大家的生产环境一般都是Linux环境,更换win服务器会造成一系列的连带问题,比如其他库是否兼容等等,非常麻烦,所以找到.doc/.wps在Linux下的处理方式非常重要。还好,最后被我找到了,那就是LibreOffice LibreOffice具体用法 首先,直接在命令行执行libreoffice --version,看看你是否已经安装此款工具。如果还没有安装,参考下文安装LibreOffice安装完毕后,使用以下命令,对待转格式的文档进行格式转换,示例如下: 将.doc格式文档转为txt格式: libreoffice --headless --convert-to txt path-to-your-doc.doc

你同样可以指定转换后的文件输出路径,也可以批量地将doc/docx/wps文件传给LibreOffice接口:

libreoffice --headless --convert-to html --outdir /your/output/dir /your/doc_docx_wps/files/*.{dosx,doc,wps} 使用python脚本执行格式转换 这个其实没什么玄乎的,就是用Python执行命令行而已: import os os.system("libreoffice --headless --convert-to txt path-to-your-doc.doc")

当然,如果你嫌这个接口的单进程速度太慢,你也可以用Python执行多进程启动转换:

import subprocess import os, glob from multiprocessing.dummy import Pool def worker(fname, dstdir=os.path.expanduser("~")): subprocess.call(["libreoffice", "--headless", "--convert-to", "pdf", fname], cwd=dstdir) pool = Pool() pool.map(worker, glob.iglob( os.path.join(os.path.expanduser("~"), "*.doc") )) LibreOffice的其他转换功能

其实LibreOffice功能很强大,它还可以对xhtml、pdf、jpeg、png等等多种格式进行转换。具体支持的格式如下

The following list of document formats are currently available: bib - BibTeX [.bib] doc - Microsoft Word 97/2000/XP [.doc] doc6 - Microsoft Word 6.0 [.doc] doc95 - Microsoft Word 95 [.doc] docbook - DocBook [.xml] docx - Microsoft Office Open XML [.docx] docx7 - Microsoft Office Open XML [.docx] fodt - OpenDocument Text (Flat XML) [.fodt] html - HTML Document (OpenOffice.org Writer) [.html] latex - LaTeX 2e [.ltx] mediawiki - MediaWiki [.txt] odt - ODF Text Document [.odt] ooxml - Microsoft Office Open XML [.xml] ott - Open Document Text [.ott] pdb - AportisDoc (Palm) [.pdb] pdf - Portable Document Format [.pdf] psw - Pocket Word [.psw] rtf - Rich Text Format [.rtf] sdw - StarWriter 5.0 [.sdw] sdw4 - StarWriter 4.0 [.sdw] sdw3 - StarWriter 3.0 [.sdw] stw - Open Office.org 1.0 Text Document Template [.stw] sxw - Open Office.org 1.0 Text Document [.sxw] text - Text Encoded [.txt] txt - Text [.txt] uot - Unified Office Format text [.uot] vor - StarWriter 5.0 Template [.vor] vor4 - StarWriter 4.0 Template [.vor] vor3 - StarWriter 3.0 Template [.vor] wps - Microsoft Works [.wps] xhtml - XHTML Document [.html] The following list of graphics formats are currently available: bmp - Windows Bitmap [.bmp] emf - Enhanced Metafile [.emf] eps - Encapsulated PostScript [.eps] fodg - OpenDocument Drawing (Flat XML) [.fodg] gif - Graphics Interchange Format [.gif] html - HTML Document (OpenOffice.org Draw) [.html] jpg - Joint Photographic Experts Group [.jpg] met - OS/2 Metafile [.met] odd - OpenDocument Drawing [.odd] otg - OpenDocument Drawing Template [.otg] pbm - Portable Bitmap [.pbm] pct - Mac Pict [.pct] pdf - Portable Document Format [.pdf] pgm - Portable Graymap [.pgm] png - Portable Network Graphic [.png] ppm - Portable Pixelmap [.ppm] ras - Sun Raster Image [.ras] std - OpenOffice.org 1.0 Drawing Template [.std] svg - Scalable Vector Graphics [.svg] svm - StarView Metafile [.svm] swf - Macromedia Flash (SWF) [.swf] sxd - OpenOffice.org 1.0 Drawing [.sxd] sxd3 - StarDraw 3.0 [.sxd] sxd5 - StarDraw 5.0 [.sxd] sxw - StarOffice XML (Draw) [.sxw] tiff - Tagged Image File Format [.tiff] vor - StarDraw 5.0 Template [.vor] vor3 - StarDraw 3.0 Template [.vor] wmf - Windows Metafile [.wmf] xhtml - XHTML [.xhtml] xpm - X PixMap [.xpm] The following list of presentation formats are currently available: bmp - Windows Bitmap [.bmp] emf - Enhanced Metafile [.emf] eps - Encapsulated PostScript [.eps] fodp - OpenDocument Presentation (Flat XML) [.fodp] gif - Graphics Interchange Format [.gif] html - HTML Document (OpenOffice.org Impress) [.html] jpg - Joint Photographic Experts Group [.jpg] met - OS/2 Metafile [.met] odg - ODF Drawing (Impress) [.odg] odp - ODF Presentation [.odp] otp - ODF Presentation Template [.otp] pbm - Portable Bitmap [.pbm] pct - Mac Pict [.pct] pdf - Portable Document Format [.pdf] pgm - Portable Graymap [.pgm] png - Portable Network Graphic [.png] potm - Microsoft PowerPoint 2007/2010 XML Template [.potm] pot - Microsoft PowerPoint 97/2000/XP Template [.pot] ppm - Portable Pixelmap [.ppm] pptx - Microsoft PowerPoint 2007/2010 XML [.pptx] pps - Microsoft PowerPoint 97/2000/XP (Autoplay) [.pps] ppt - Microsoft PowerPoint 97/2000/XP [.ppt] pwp - PlaceWare [.pwp] ras - Sun Raster Image [.ras] sda - StarDraw 5.0 (OpenOffice.org Impress) [.sda] sdd - StarImpress 5.0 [.sdd] sdd3 - StarDraw 3.0 (OpenOffice.org Impress) [.sdd] sdd4 - StarImpress 4.0 [.sdd] sxd - OpenOffice.org 1.0 Drawing (OpenOffice.org Impress) [.sxd] sti - OpenOffice.org 1.0 Presentation Template [.sti] svg - Scalable Vector Graphics [.svg] svm - StarView Metafile [.svm] swf - Macromedia Flash (SWF) [.swf] sxi - OpenOffice.org 1.0 Presentation [.sxi] tiff - Tagged Image File Format [.tiff] uop - Unified Office Format presentation [.uop] vor - StarImpress 5.0 Template [.vor] vor3 - StarDraw 3.0 Template (OpenOffice.org Impress) [.vor] vor4 - StarImpress 4.0 Template [.vor] vor5 - StarDraw 5.0 Template (OpenOffice.org Impress) [.vor] wmf - Windows Metafile [.wmf] xhtml - XHTML [.xml] xpm - X PixMap [.xpm] The following list of spreadsheet formats are currently available: csv - Text CSV [.csv] dbf - dBASE [.dbf] dif - Data Interchange Format [.dif] fods - OpenDocument Spreadsheet (Flat XML) [.fods] html - HTML Document (OpenOffice.org Calc) [.html] ods - ODF Spreadsheet [.ods] ooxml - Microsoft Excel 2003 XML [.xml] ots - ODF Spreadsheet Template [.ots] pdf - Portable Document Format [.pdf] pxl - Pocket Excel [.pxl] sdc - StarCalc 5.0 [.sdc] sdc4 - StarCalc 4.0 [.sdc] sdc3 - StarCalc 3.0 [.sdc] slk - SYLK [.slk] stc - OpenOffice.org 1.0 Spreadsheet Template [.stc] sxc - OpenOffice.org 1.0 Spreadsheet [.sxc] uos - Unified Office Format spreadsheet [.uos] vor3 - StarCalc 3.0 Template [.vor] vor4 - StarCalc 4.0 Template [.vor] vor - StarCalc 5.0 Template [.vor] xhtml - XHTML [.xhtml] xls - Microsoft Excel 97/2000/XP [.xls] xls5 - Microsoft Excel 5.0 [.xls] xls95 - Microsoft Excel 95 [.xls] xlt - Microsoft Excel 97/2000/XP Template [.xlt] xlt5 - Microsoft Excel 5.0 Template [.xlt] xlt95 - Microsoft Excel 95 Template [.xlt] xlsx - Microsoft Excel 2007/2010 XML [.xlsx]


【本文地址】


今日新闻


推荐新闻


CopyRight 2018-2019 办公设备维修网 版权所有 豫ICP备15022753号-3