文档在线预览(二)word、pdf、excel文件转html以实现文档在线预览

您所在的位置:网站首页 html文件怎么转pdf 文档在线预览(二)word、pdf、excel文件转html以实现文档在线预览

文档在线预览(二)word、pdf、excel文件转html以实现文档在线预览

2023-11-17 18:22| 来源: 网络整理| 查看: 265

文章目录 一、前言1、aspose2 、poi + pdfbox3 spire 二、将文件转换成html字符串1、将word文件转成html字符串1.1 使用aspose1.2 使用poi1.3 使用spire 2、将pdf文件转成html字符串2.1 使用aspose2.2 使用 poi + pbfbox2.3 使用spire 3、将excel文件转成html字符串3.1 使用aspose3.2 使用poi + pdfbox3.3 使用spire 三、将文件转换成html,并生成html文件FileUtils类将html字符串生成html文件示例: 1、将word文件转换成html文件1.1 使用aspose1.2 使用poi + pdfbox1.3 使用spire 2、将pdf文件转换成html文件2.1 使用aspose2.2 使用poi + pdfbox2.3 使用spire 3、将excel文件转换成html文件3.1 使用aspose3.2 使用poi3.3 使用spire 四、总结

一、前言 实现文档在线预览的方式除了上篇文章[《文档在线预览(一)通过将txt、word、pdf转成图片实现在线预览功能》](https://blog.csdn.net/q2qwert/article/details/130884607)说的将文档转成图片的实现方式外,还有转成pdf,前端通过pdf.js、pdfobject.js等插件来实现在线预览,以及本文将要说到的将文档转成html的方式来实现在线预览。

以下代码分别提供基于aspose、pdfbox、spire来实现来实现txt、word、pdf、ppt、word等文件转图片的需求。

1、aspose

Aspose 是一家致力于.Net ,Java,SharePoint,JasperReports和SSRS组件的提供商,数十个国家的数千机构都有用过aspose组件,创建、编辑、转换或渲染 Office、OpenOffice、PDF、图像、ZIP、CAD、XPS、EPS、PSD 和更多文件格式。注意aspose是商用组件,未经授权导出文件里面都是是水印(尊重版权,远离破解版)。

需要在项目的pom文件里添加如下依赖

com.aspose aspose-words 23.1 com.aspose aspose-pdf 23.1 com.aspose aspose-cells 23.1 com.aspose aspose-slides 23.1 2 、poi + pdfbox

因为aspose和spire虽然好用,但是都是是商用组件,所以这里也提供使用开源库操作的方式的方式。

POI是Apache软件基金会用Java编写的免费开源的跨平台的 Java API,Apache POI提供API给Java程序对Microsoft Office格式档案读和写的功能。

Apache PDFBox是一个开源Java库,支持PDF文档的开发和转换。 使用此库,您可以开发用于创建,转换和操作PDF文档的Java程序。

需要在项目的pom文件里添加如下依赖

org.apache.pdfbox pdfbox 2.0.4 org.apache.poi poi 5.2.0 org.apache.poi poi-ooxml 5.2.0 org.apache.poi poi-scratchpad 5.2.0 org.apache.poi poi-excelant 5.2.0 3 spire

spire一款专业的Office编程组件,涵盖了对Word、Excel、PPT、PDF等文件的读写、编辑、查看功能。spire提供免费版本,但是存在只能导出前3页以及只能导出前500行的限制,只要达到其一就会触发限制。需要超出前3页以及只能导出前500行的限制的这需要购买付费版(尊重版权,远离破解版)。这里使用免费版进行演示。

spire在添加pom之前还得先添加maven仓库来源

com.e-iceblue e-iceblue https://repo.e-iceblue.cn/repository/maven-public/

接着在项目的pom文件里添加如下依赖

免费版:

e-iceblue spire.office.free 5.3.1

付费版版:

e-iceblue spire.office 5.3.1 二、将文件转换成html字符串 1、将word文件转成html字符串 1.1 使用aspose public static String wordToHtmlStr(String wordPath) { try { Document doc = new Document(wordPath); // Address是将要被转化的word文档 String htmlStr = doc.toString(); return htmlStr; } catch (Exception e) { e.printStackTrace(); } return null; }

验证结果: 请添加图片描述

1.2 使用poi public String wordToHtmlStr(String wordPath) throws TransformerException, IOException, ParserConfigurationException { String htmlStr = null; String ext = wordPath.substring(wordPath.lastIndexOf(".")); if (ext.equals(".docx")) { htmlStr = word2007ToHtmlStr(wordPath); } else if (ext.equals(".doc")){ htmlStr = word2003ToHtmlStr(wordPath); } else { throw new RuntimeException("文件格式不正确"); } return htmlStr; } public String word2007ToHtmlStr(String wordPath) throws IOException { // 使用内存输出流 try(ByteArrayOutputStream out = new ByteArrayOutputStream()){ word2007ToHtmlOutputStream(wordPath, out); return out.toString(); } } private void word2007ToHtmlOutputStream(String wordPath,OutputStream out) throws IOException { ZipSecureFile.setMinInflateRatio(-1.0d); InputStream in = Files.newInputStream(Paths.get(wordPath)); XWPFDocument document = new XWPFDocument(in); XHTMLOptions options = XHTMLOptions.create().setIgnoreStylesIfUnused(false).setImageManager(new Base64EmbedImgManager()); // 使用内存输出流 XHTMLConverter.getInstance().convert(document, out, options); } private String word2003ToHtmlStr(String wordPath) throws TransformerException, IOException, ParserConfigurationException { org.w3c.dom.Document htmlDocument = word2003ToHtmlDocument(wordPath); // Transform document to string StringWriter writer = new StringWriter(); TransformerFactory tf = TransformerFactory.newInstance(); Transformer transformer = tf.newTransformer(); transformer.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "no"); transformer.setOutputProperty(OutputKeys.METHOD, "html"); transformer.setOutputProperty(OutputKeys.INDENT, "yes"); transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8"); transformer.transform(new DOMSource(htmlDocument), new StreamResult(writer)); return writer.toString(); } private org.w3c.dom.Document word2003ToHtmlDocument(String wordPath) throws IOException, ParserConfigurationException { InputStream input = Files.newInputStream(Paths.get(wordPath)); HWPFDocument wordDocument = new HWPFDocument(input); WordToHtmlConverter wordToHtmlConverter = new WordToHtmlConverter( DocumentBuilderFactory.newInstance().newDocumentBuilder() .newDocument()); wordToHtmlConverter.setPicturesManager((content, pictureType, suggestedName, widthInches, heightInches) -> { System.out.println(pictureType); if (PictureType.UNKNOWN.equals(pictureType)) { return null; } BufferedImage bufferedImage = ImgUtil.toImage(content); String base64Img = ImgUtil.toBase64(bufferedImage, pictureType.getExtension()); // 带图片的word,则将图片转为base64编码,保存在一个页面中 StringBuilder sb = (new StringBuilder(base64Img.length() + "data:;base64,".length()).append("data:;base64,").append(base64Img)); return sb.toString(); }); // 解析word文档 wordToHtmlConverter.processDocument(wordDocument); return wordToHtmlConverter.getDocument(); } 1.3 使用spire public String wordToHtmlStr(String wordPath) throws IOException { try(ByteArrayOutputStream outputStream = new ByteArrayOutputStream()) { Document document = new Document(); document.loadFromFile(wordPath); document.saveToFile(outputStream, FileFormat.Html); return outputStream.toString(); } } 2、将pdf文件转成html字符串 2.1 使用aspose public static String pdfToHtmlStr(String pdfPath) throws IOException, ParserConfigurationException { PDDocument document = PDDocument.load(new File(pdfPath)); Writer writer = new StringWriter(); new PDFDomTree().writeText(document, writer); writer.close(); document.close(); return writer.toString(); }

验证结果: 请添加图片描述

2.2 使用 poi + pbfbox public String pdfToHtmlStr(String pdfPath) throws IOException, ParserConfigurationException { PDDocument document = PDDocument.load(new File(pdfPath)); Writer writer = new StringWriter(); new PDFDomTree().writeText(document, writer); writer.close(); document.close(); return writer.toString(); } 2.3 使用spire public String pdfToHtmlStr(String pdfPath) throws IOException, ParserConfigurationException { try(ByteArrayOutputStream outputStream = new ByteArrayOutputStream()) { PdfDocument pdf = new PdfDocument(); pdf.loadFromFile(pdfPath); return outputStream.toString(); } } 3、将excel文件转成html字符串 3.1 使用aspose public static String excelToHtmlStr(String excelPath) throws Exception { FileInputStream fileInputStream = new FileInputStream(excelPath); Workbook workbook = new XSSFWorkbook(fileInputStream); DataFormatter dataFormatter = new DataFormatter(); FormulaEvaluator formulaEvaluator = workbook.getCreationHelper().createFormulaEvaluator(); Sheet sheet = workbook.getSheetAt(0); StringBuilder htmlStringBuilder = new StringBuilder(); htmlStringBuilder.append("Excel to HTML using Java and POI library"); htmlStringBuilder.append("table, th, td { border: 1px solid black; }"); htmlStringBuilder.append(""); for (Row row : sheet) { htmlStringBuilder.append(""); for (Cell cell : row) { CellType cellType = cell.getCellType(); if (cellType == CellType.FORMULA) { formulaEvaluator.evaluateFormulaCell(cell); cellType = cell.getCachedFormulaResultType(); } String cellValue = dataFormatter.formatCellValue(cell, formulaEvaluator); htmlStringBuilder.append("").append(cellValue).append(""); } htmlStringBuilder.append(""); } htmlStringBuilder.append(""); return htmlStringBuilder.toString(); }

返回的html字符串:

Excel to HTML using Java and POI librarytable, th, td { border: 1px solid black; }序号姓名性别联系方式地址1张晓玲女11111111111上海市浦东新区xx路xx弄xx号2王小二男1222222上海市浦东新区xx路xx弄xx号1张晓玲女11111111111上海市浦东新区xx路xx弄xx号2王小二男1222222上海市浦东新区xx路xx弄xx号1张晓玲女11111111111上海市浦东新区xx路xx弄xx号2王小二男1222222上海市浦东新区xx路xx弄xx号1张晓玲女11111111111上海市浦东新区xx路xx弄xx号2王小二男1222222上海市浦东新区xx路xx弄xx号1张晓玲女11111111111上海市浦东新区xx路xx弄xx号2王小二男1222222上海市浦东新区xx路xx弄xx号1张晓玲女11111111111上海市浦东新区xx路xx弄xx号2王小二男1222222上海市浦东新区xx路xx弄xx号1张晓玲女11111111111上海市浦东新区xx路xx弄xx号2王小二男1222222上海市浦东新区xx路xx弄xx号 3.2 使用poi + pdfbox public String excelToHtmlStr(String excelPath) throws Exception { FileInputStream fileInputStream = new FileInputStream(excelPath); try (Workbook workbook = WorkbookFactory.create(new File(excelPath))){ DataFormatter dataFormatter = new DataFormatter(); FormulaEvaluator formulaEvaluator = workbook.getCreationHelper().createFormulaEvaluator(); org.apache.poi.ss.usermodel.Sheet sheet = workbook.getSheetAt(0); StringBuilder htmlStringBuilder = new StringBuilder(); htmlStringBuilder.append("Excel to HTML using Java and POI library"); htmlStringBuilder.append("table, th, td { border: 1px solid black; }"); htmlStringBuilder.append(""); for (Row row : sheet) { htmlStringBuilder.append(""); for (Cell cell : row) { CellType cellType = cell.getCellType(); if (cellType == CellType.FORMULA) { formulaEvaluator.evaluateFormulaCell(cell); cellType = cell.getCachedFormulaResultType(); } String cellValue = dataFormatter.formatCellValue(cell, formulaEvaluator); htmlStringBuilder.append("").append(cellValue).append(""); } htmlStringBuilder.append(""); } htmlStringBuilder.append(""); return htmlStringBuilder.toString(); } } 3.3 使用spire public String excelToHtmlStr(String excelPath) throws Exception { try(ByteArrayOutputStream outputStream = new ByteArrayOutputStream()) { Workbook workbook = new Workbook(); workbook.loadFromFile(excelPath); workbook.saveToStream(outputStream, com.spire.xls.FileFormat.HTML); return outputStream.toString(); } } 三、将文件转换成html,并生成html文件

有时我们是需要的不仅仅返回html字符串,而是需要生成一个html文件这时应该怎么做呢?一个改动量小的做法就是使用org.apache.commons.io包下的FileUtils工具类写入目标地址:

FileUtils类将html字符串生成html文件示例:

首先需要引入pom:

commons-io commons-io 2.8.0

相关代码:

String htmlStr = FileConvertUtil.pdfToHtmlStr("D:\\书籍\\电子书\\小说\\历史小说\\最后的可汗.doc"); FileUtils.write(new File("D:\\test\\doc.html"), htmlStr, "utf-8");

除此之外,还可以对上面的代码进行一些调整,已实现生成html文件,代码调整如下:

1、将word文件转换成html文件

word原文件效果: 请添加图片描述

1.1 使用aspose public static void wordToHtml(String wordPath, String htmlPath) { try { File sourceFile = new File(wordPath); String path = htmlPath + File.separator + sourceFile.getName().substring(0, sourceFile.getName().lastIndexOf(".")) + ".html"; File file = new File(path); // 新建一个空白pdf文档 FileOutputStream os = new FileOutputStream(file); Document doc = new Document(wordPath); // Address是将要被转化的word文档 HtmlSaveOptions options = new HtmlSaveOptions(); options.setExportImagesAsBase64(true); options.setExportRelativeFontSize(true); doc.save(os, options); } catch (Exception e) { e.printStackTrace(); } }

转换成html的效果: 请添加图片描述

1.2 使用poi + pdfbox public void wordToHtml(String wordPath, String htmlPath) throws TransformerException, IOException, ParserConfigurationException { htmlPath = FileUtil.getNewFileFullPath(wordPath, htmlPath, "html"); String ext = wordPath.substring(wordPath.lastIndexOf(".")); if (ext.equals(".docx")) { word2007ToHtml(wordPath, htmlPath); } else if (ext.equals(".doc")){ word2003ToHtml(wordPath, htmlPath); } else { throw new RuntimeException("文件格式不正确"); } } public void word2007ToHtml(String wordPath, String htmlPath) throws TransformerException, IOException, ParserConfigurationException { //try(OutputStream out = Files.newOutputStream(Paths.get(path))){ try(FileOutputStream out = new FileOutputStream(htmlPath)){ word2007ToHtmlOutputStream(wordPath, out); } } private void word2007ToHtmlOutputStream(String wordPath,OutputStream out) throws IOException { ZipSecureFile.setMinInflateRatio(-1.0d); InputStream in = Files.newInputStream(Paths.get(wordPath)); XWPFDocument document = new XWPFDocument(in); XHTMLOptions options = XHTMLOptions.create().setIgnoreStylesIfUnused(false).setImageManager(new Base64EmbedImgManager()); // 使用内存输出流 XHTMLConverter.getInstance().convert(document, out, options); } public void word2003ToHtml(String wordPath, String htmlPath) throws TransformerException, IOException, ParserConfigurationException { org.w3c.dom.Document htmlDocument = word2003ToHtmlDocument(wordPath); // 生成html文件地址 try(OutputStream outStream = Files.newOutputStream(Paths.get(htmlPath))){ DOMSource domSource = new DOMSource(htmlDocument); StreamResult streamResult = new StreamResult(outStream); TransformerFactory factory = TransformerFactory.newInstance(); Transformer serializer = factory.newTransformer(); serializer.setOutputProperty(OutputKeys.ENCODING, "utf-8"); serializer.setOutputProperty(OutputKeys.INDENT, "yes"); serializer.setOutputProperty(OutputKeys.METHOD, "html"); serializer.transform(domSource, streamResult); } } private org.w3c.dom.Document word2003ToHtmlDocument(String wordPath) throws IOException, ParserConfigurationException { InputStream input = Files.newInputStream(Paths.get(wordPath)); HWPFDocument wordDocument = new HWPFDocument(input); WordToHtmlConverter wordToHtmlConverter = new WordToHtmlConverter( DocumentBuilderFactory.newInstance().newDocumentBuilder() .newDocument()); wordToHtmlConverter.setPicturesManager((content, pictureType, suggestedName, widthInches, heightInches) -> { System.out.println(pictureType); if (PictureType.UNKNOWN.equals(pictureType)) { return null; } BufferedImage bufferedImage = ImgUtil.toImage(content); String base64Img = ImgUtil.toBase64(bufferedImage, pictureType.getExtension()); // 带图片的word,则将图片转为base64编码,保存在一个页面中 StringBuilder sb = (new StringBuilder(base64Img.length() + "data:;base64,".length()).append("data:;base64,").append(base64Img)); return sb.toString(); }); // 解析word文档 wordToHtmlConverter.processDocument(wordDocument); return wordToHtmlConverter.getDocument(); }

转换成html的效果: 请添加图片描述

1.3 使用spire public void wordToHtml(String wordPath, String htmlPath) { htmlPath = FileUtil.getNewFileFullPath(wordPath, htmlPath, "html"); Document document = new Document(); document.loadFromFile(wordPath); document.saveToFile(htmlPath, FileFormat.Html); }

转换成html的效果:

因为使用的是免费版,存在页数和字数限制,需要完整功能的的可以选择付费版本。PS:这回76页的文档居然转成功了前50页。 请添加图片描述

2、将pdf文件转换成html文件

图片版pdf原文件效果: 请添加图片描述

文字版pdf原文件效果: 请添加图片描述

2.1 使用aspose public static void pdfToHtml(String pdfPath, String htmlPath) throws IOException, ParserConfigurationException { File file = new File(pdfPath); String path = htmlPath + File.separator + file.getName().substring(0, file.getName().lastIndexOf(".")) + ".html"; PDDocument document = PDDocument.load(new File(pdfPath)); Writer writer = new PrintWriter(path, "UTF-8"); new PDFDomTree().writeText(document, writer); writer.close(); document.close(); }

图片版PDF文件验证结果: 请添加图片描述

文字版PDF文件验证结果: 请添加图片描述

2.2 使用poi + pdfbox public void pdfToHtml(String pdfPath, String htmlPath) throws IOException, ParserConfigurationException { String path = FileUtil.getNewFileFullPath(pdfPath, htmlPath, "html"); PDDocument document = PDDocument.load(new File(pdfPath)); Writer writer = new PrintWriter(path, "UTF-8"); new PDFDomTree().writeText(document, writer); writer.close(); document.close(); }

图片版PDF文件验证结果: 请添加图片描述

文字版PDF原文件效果: 请添加图片描述

2.3 使用spire public void pdfToHtml(String pdfPath, String htmlPath) throws IOException, ParserConfigurationException { htmlPath = FileUtil.getNewFileFullPath(pdfPath, htmlPath, "html"); PdfDocument pdf = new PdfDocument(); pdf.loadFromFile(pdfPath); pdf.saveToFile(htmlPath, com.spire.pdf.FileFormat.HTML); }

图片版PDF文件验证结果: 因为使用的是免费版,所以只有前三页是正常的。。。有超过三页需求的可以选择付费版本。 请添加图片描述

文字版PDF原文件效果:

报错了无法转换。。。

java.lang.NullPointerException at com.spire.pdf.PdfPageWidget.spr┢⅛(Unknown Source) at com.spire.pdf.PdfPageWidget.getSize(Unknown Source) at com.spire.pdf.PdfPageBase.spr†™—(Unknown Source) at com.spire.pdf.PdfPageBase.getActualSize(Unknown Source) at com.spire.pdf.PdfPageBase.getSection(Unknown Source) at com.spire.pdf.general.PdfDestination.spr︻┎—(Unknown Source) at com.spire.pdf.general.PdfDestination.spr┻┑—(Unknown Source) at com.spire.pdf.general.PdfDestination.getElement(Unknown Source) at com.spire.pdf.primitives.PdfDictionary.setProperty(Unknown Source) at com.spire.pdf.bookmarks.PdfBookmark.setDestination(Unknown Source) at com.spire.pdf.bookmarks.PdfBookmarkWidget.spr┭┘—(Unknown Source) at com.spire.pdf.bookmarks.PdfBookmarkWidget.getDestination(Unknown Source) at com.spire.pdf.PdfDocumentBase.spr╻⅝(Unknown Source) at com.spire.pdf.widget.PdfPageCollection.spr┦⅝(Unknown Source) at com.spire.pdf.widget.PdfPageCollection.removeAt(Unknown Source) at com.spire.pdf.PdfDocumentBase.spr┞⅝(Unknown Source) at com.spire.pdf.PdfDocument.loadFromFile(Unknown Source) 3、将excel文件转换成html文件

excel原文件效果: 请添加图片描述

3.1 使用aspose public void excelToHtml(String excelPath, String htmlPath) throws Exception { htmlPath = FileUtil.getNewFileFullPath(excelPath, htmlPath, "html"); Workbook workbook = new Workbook(excelPath); com.aspose.cells.HtmlSaveOptions options = new com.aspose.cells.HtmlSaveOptions(); workbook.save(htmlPath, options); }

转换成html的效果: 请添加图片描述

3.2 使用poi public void excelToHtml(String excelPath, String htmlPath) throws Exception { String path = FileUtil.getNewFileFullPath(excelPath, htmlPath, "html"); try(FileOutputStream fileOutputStream = new FileOutputStream(path)){ String htmlStr = excelToHtmlStr(excelPath); byte[] bytes = htmlStr.getBytes(); fileOutputStream.write(bytes); } } public String excelToHtmlStr(String excelPath) throws Exception { FileInputStream fileInputStream = new FileInputStream(excelPath); try (Workbook workbook = WorkbookFactory.create(new File(excelPath))){ DataFormatter dataFormatter = new DataFormatter(); FormulaEvaluator formulaEvaluator = workbook.getCreationHelper().createFormulaEvaluator(); org.apache.poi.ss.usermodel.Sheet sheet = workbook.getSheetAt(0); StringBuilder htmlStringBuilder = new StringBuilder(); htmlStringBuilder.append("Excel to HTML using Java and POI library"); htmlStringBuilder.append("table, th, td { border: 1px solid black; }"); htmlStringBuilder.append(""); for (Row row : sheet) { htmlStringBuilder.append(""); for (Cell cell : row) { CellType cellType = cell.getCellType(); if (cellType == CellType.FORMULA) { formulaEvaluator.evaluateFormulaCell(cell); cellType = cell.getCachedFormulaResultType(); } String cellValue = dataFormatter.formatCellValue(cell, formulaEvaluator); htmlStringBuilder.append("").append(cellValue).append(""); } htmlStringBuilder.append(""); } htmlStringBuilder.append(""); return htmlStringBuilder.toString(); } }

转换成html的效果: 请添加图片描述

3.3 使用spire public void excelToHtml(String excelPath, String htmlPath) throws Exception { htmlPath = FileUtil.getNewFileFullPath(excelPath, htmlPath, "html"); Workbook workbook = new Workbook(); workbook.loadFromFile(excelPath); workbook.saveToFile(htmlPath, com.spire.xls.FileFormat.HTML); }

转换成html的效果: 请添加图片描述

四、总结

从上述的效果展示我们可以发现其实转成html效果不是太理想,很多细节样式没有还原,这其实是因为这类转换往往都是追求目标是通过使用文档中的语义信息并忽略其他细节来生成简单干净的 HTML,所以在转换过程中复杂样式被忽略,比如居中、首行缩进、字体,文本大小,颜色。举个例子在转换是 会将应用标题 1 样式的任何段落转换为 h1 元素,而不是尝试完全复制标题的样式。所以转成html的显示效果往往和原文档不太一样。这意味着对于较复杂的文档而言,这种转换不太可能是完美的。但如果都是只使用简单样式文档或者对文档样式不太关心的这种方式也不妨一试。

PS:如果想要展示效果好的话,其实可以将上篇文章《文档在线预览(一)通过将txt、word、pdf转成图片实现在线预览功能》说的内容和本文结合起来使用,即将文档里的内容都生成成图片(很可能是多张图片),然后将生成的图片全都放到一个html页面里 ,用html+css来保持样式并实现多张图片展示,再将html返回。开源组件kkfilevie就是用的就是这种做法。

kkfileview展示效果如下:

请添加图片描述

下图是kkfileview返回的html代码,从html代码我们可以看到kkfileview其实是将文件(txt文件除外)每页的内容都转成了图片,然后将这些图片都嵌入到一个html里,再返回给用户一个html页面。 请添加图片描述



【本文地址】


今日新闻


推荐新闻


CopyRight 2018-2019 办公设备维修网 版权所有 豫ICP备15022753号-3