aspose.words+docx实现docx合并以及去除aspose的印记
起因
因工作须要实现多个word文档的合并,并尽量保障original style的形式将word转化成html用于端上进行展现。本文实现次要解决问题:
- word的多个文档的合并[次要是实现append的形式合并]
- 将合并文档转化成html文件,波及英文,日文的字体word原样展现,合并中图片的base64d的转化
- 因为aspose是商业利用,为了实现完满白嫖,不通过破解的形式去掉转化后后果中aspose的印记
装置次要工具
- [email protected]
- python-docx
- docxcompose
- bs4
次要代码
- 利用宝导入
#! /usr/bin/env python3 # -*- coding: utf-8 -*- # DESC: 1. 基于docx实现多个docx的合并 # 2. 基于aspose的实现docx到html的转化 # 3. 基于bs4的html的元素和内容的增删改等操作 import os import re import pandas as pd import aspose.words as aw import aspose.words.saving as saving from bs4 import BeautifulSoup from docx import Document from docxcompose.composer import Composer
- 合并word文档
def merge_docx(docx_list: list, docx_merge_tar: str, docx_list_src: str) -> str: """ 合并word文档 目前只是将word进行拼装,不进行分页等操作 """ if len(docx_list) == 0: raise Exception("input is empty.") if len(docx_list) == 1: return os.path.join(docx_list_src, docx_list[0]) # 将第一个word作为基word base_docx = Document(os.path.join(docx_list_src, docx_list[0])) base_docx_composer = Composer(base_docx) # composer.append的形式合并到基word for next_docx in docx_list[1:]: next_docx_path = os.path.join(docx_list_src, next_docx) base_docx_composer.append(Document(next_docx_path)) base_docx_composer.save(docx_merge_tar) print("merge docx list ok.") return docx_merge_tar
- 将word转成html
def aspose_convert_docx_html(docx_file_path: str, html_file_path: str) -> str: """ 应用aspose.words-python将word转化成html """ docx = aw.Document(docx_file_path) # 设置转化选项 save_options = saving.HtmlSaveOptions(aw.SaveFormat.HTML) # 将图片存成base64模式 save_options.export_images_as_base64 = True docx.save(html_file_path, save_options) return html_file_path
- 去掉aspose的印记
def del_aspose_elemet(html_tar_file: str, to_tar_file: str): """ 去除aspose的信息 """ html_content = open(html_tar_file, "r", encoding="utf-8") soup = BeautifulSoup(html_content, features="lxml") # 删除指定的aspose的内容 for tag in soup.find_all(style=re.compile("-aw-headerfooter-type:")): tag.extract() word_key_tag = soup.find("p", text=re.compile("Evaluation Only")) word_key_tag.extract() f = open(to_tar_file, "w", encoding="utf-8") f.write(soup.prettify()) f.close()
测试
if __name__ == '__main__': docx_file_path = r"D:\merge_tar\demo.docx" html_file_path = r"D:\merge_tar\demo.html" aspose_convert_docx_html(docx_file_path, html_file_path) process_file_path = r"D:\merge_tar\demo_d.html" del_aspose_elemet(html_file_path, process_file_path)
测试后果
- demo.docx
- apsose转化word到html
- 解决aspose的印记
后记
- aspose的转化后options设置有很多,具体可参考sapose.words的github查看demos
- bs4在解决html很弱小
- 本文次要是记录工作中解决文档的实际后果,如果对你有用,那再好不过了