asposewordsdocx实现docx合并以及去除aspose的印记

aspose.words+docx实现docx合并以及去除aspose的印记

起因

因工作须要实现多个word文档的合并，并尽量保障original style的形式将word转化成html用于端上进行展现。本文实现次要解决问题：

word的多个文档的合并[次要是实现append的形式合并]
将合并文档转化成html文件，波及英文，日文的字体word原样展现，合并中图片的base64d的转化
因为aspose是商业利用，为了实现完满白嫖，不通过破解的形式去掉转化后后果中aspose的印记

装置次要工具

[email protected]
python-docx
docxcompose
bs4

次要代码

利用宝导入

#! /usr/bin/env python3
# -*- coding: utf-8 -*-
# DESC: 1. 基于docx实现多个docx的合并
#       2. 基于aspose的实现docx到html的转化
#       3. 基于bs4的html的元素和内容的增删改等操作

import os
import re
import pandas as pd
import aspose.words as aw
import aspose.words.saving as saving
from bs4 import BeautifulSoup
from docx import Document
from docxcompose.composer import Composer

合并word文档

def merge_docx(docx_list: list, docx_merge_tar: str, docx_list_src: str) -> str:
    """
    合并word文档
    目前只是将word进行拼装，不进行分页等操作
    """
    if len(docx_list) == 0:
        raise Exception("input is empty.")
    if len(docx_list) == 1:
        return os.path.join(docx_list_src, docx_list[0])
    # 将第一个word作为基word
    base_docx = Document(os.path.join(docx_list_src, docx_list[0]))
    base_docx_composer = Composer(base_docx)
    # composer.append的形式合并到基word
    for next_docx in docx_list[1:]:
        next_docx_path = os.path.join(docx_list_src, next_docx)
        base_docx_composer.append(Document(next_docx_path))
    base_docx_composer.save(docx_merge_tar)
    print("merge docx list ok.")
    return docx_merge_tar

将word转成html

def aspose_convert_docx_html(docx_file_path: str, html_file_path: str) -> str:
    """
    应用aspose.words-python将word转化成html
    """
    docx = aw.Document(docx_file_path)
    # 设置转化选项
    save_options = saving.HtmlSaveOptions(aw.SaveFormat.HTML)
    # 将图片存成base64模式
    save_options.export_images_as_base64 = True
    docx.save(html_file_path, save_options)
    return html_file_path

去掉aspose的印记

def del_aspose_elemet(html_tar_file: str, to_tar_file: str):
    """
    去除aspose的信息
    """
    html_content = open(html_tar_file, "r", encoding="utf-8")
    soup = BeautifulSoup(html_content, features="lxml")
    # 删除指定的aspose的内容
    for tag in soup.find_all(style=re.compile("-aw-headerfooter-type:")):
        tag.extract()
    word_key_tag = soup.find("p", text=re.compile("Evaluation Only"))
    word_key_tag.extract()

    f = open(to_tar_file, "w", encoding="utf-8")
    f.write(soup.prettify())
    f.close()

测试

if __name__ == '__main__':
    docx_file_path = r"D:\merge_tar\demo.docx"
    html_file_path = r"D:\merge_tar\demo.html"
    aspose_convert_docx_html(docx_file_path, html_file_path)

    process_file_path = r"D:\merge_tar\demo_d.html"
    del_aspose_elemet(html_file_path, process_file_path)

测试后果

demo.docx

apsose转化word到html

解决aspose的印记

后记

aspose的转化后options设置有很多，具体可参考sapose.words的github查看demos
bs4在解决html很弱小
本文次要是记录工作中解决文档的实际后果，如果对你有用，那再好不过了

搞代码网（gaodaima.com）提供的所有资源部分来自互联网，如果有侵犯您的版权或其他权益，请说明详细缘由并提供版权或权益证明然后发送到邮箱[email protected]‍，我们会在看到邮件的第一时间内为您处理，或直接联系QQ：872152909。本网站采用BY-NC-SA协议进行授权
转载请注明原文链接：asposewordsdocx实现docx合并以及去除aspose的印记

aspose.words+docx实现docx合并以及去除aspose的印记

起因

装置次要工具

次要代码

测试

测试后果

后记

Hi，您需要填写昵称和邮箱！