unstructuredio-npu by QianqiuerQS · Pull Request #456 · ModelEngine-Group/DataMate

QianqiuerQS · 2026-03-30T06:51:27Z

npu上unstructuredio算子的适配

Copilot

Pull request overview

为在 Ascend NPU 环境运行/验证 unstructured（hi_res/YOLOX 等）相关能力提供适配脚本与 monkey patch，包含对模型加载、推理与依赖屏蔽的处理。

Changes:

新增 NPU YOLOX 推理适配与 unstructured_inference monkey patch（模型加载、算子替换、推理重写）。
新增基准脚本与启动脚本（环境变量、LD_PRELOAD、依赖 mock）。
新增 OCR 侧适配模块（通过注入 pytesseract 接口）以及一份 NPU fusion 结果 JSON。

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 11 comments.

Show a summary per file

File	Description
runtime/ops/mapper/unstructured_npu/run.sh	Ascend NPU 启动脚本：设置 jemalloc/环境变量并运行 benchmark
runtime/ops/mapper/unstructured_npu/benchmark_npu.py	基准入口：依赖深度 mock、初始化 NPU、调用 unstructured 分区逻辑并落盘结果
runtime/ops/mapper/unstructured_npu/npu_adapter.py	关键适配：requests 拦截、LayoutElements 替换、YOLOX 前向/解码/后处理重写与模型加载
runtime/ops/mapper/unstructured_npu/ocr_npu_adapter.py	OCR 侧隔离进程（CPU PaddleOCR）+ 伪造 `pytesseract` 模块注入函数
runtime/ops/mapper/unstructured_npu/fusion_result.json	运行产物/调试信息：记录图融合统计

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-30T06:56:22Z

runtime/ops/mapper/unstructured_npu/run.sh

+export OMP_NUM_THREADS=1
+
+# 6. Python 路径 (包含当前目录和 YOLOX)
+export PYTHONPATH=$(pwd):$(pwd)/YOLOX-main:$PYTHONPATH


set -u 下直接拼接 $PYTHONPATH 会在变量未定义时报错并导致脚本退出。这里建议使用 ${PYTHONPATH:-} 兜底，或先 PYTHONPATH=${PYTHONPATH:-} 再追加。

Suggested change

export PYTHONPATH=$(pwd):$(pwd)/YOLOX-main:$PYTHONPATH

export PYTHONPATH="$(pwd):$(pwd)/YOLOX-main:${PYTHONPATH:-}"

Copilot · 2026-03-30T06:56:23Z

runtime/ops/mapper/unstructured_npu/run.sh

+# 3. 设置 LD_PRELOAD (覆盖式设置，防止重复)
+# 注意：jemalloc 必须排在第一位，libgomp 排第二解决 TLS 问题
+export LD_PRELOAD="$JEMALLOC:$GOMP"
+


脚本只检查了 jemalloc 是否存在，但同样强依赖 libgomp.so.1 被预加载；若该文件不存在，动态加载器会报错/行为不确定。建议在设置 LD_PRELOAD 前也校验 $GOMP 是否存在，并给出明确错误信息。

Copilot · 2026-03-30T06:56:23Z

runtime/ops/mapper/unstructured_npu/npu_adapter.py

+
+    if model_name in _NPU_MODEL_CACHE:
+        return _NPU_MODEL_CACHE[model_name]
+
+    if os.path.exists("./yolox_l.pt"):
+        model_path = "./yolox_l.pt"
+    else:
+        model_path = "/mnt/nvme0n1/pjj-data/data/models/yolox_l.pt"
+
+    print(f"[NPU Adapter] Loading local model: {model_path}")
+
+    from unstructured_inference.models.yolox import UnstructuredYoloXModel
+    model = UnstructuredYoloXModel()
+    model.model_path = model_path
+


npu_get_model 使用了硬编码的本地绝对路径 /mnt/nvme0n1/.../yolox_l.pt 作为 fallback，这会导致在其他机器/容器中必然找不到模型并直接失败。建议改为：优先从配置/环境变量读取模型路径，或统一走 HuggingFace/缓存目录，并在错误信息中提示如何配置。

Suggested change

if model_name in _NPU_MODEL_CACHE:

return _NPU_MODEL_CACHE[model_name]

if os.path.exists("./yolox_l.pt"):

model_path = "./yolox_l.pt"

else:

model_path = "/mnt/nvme0n1/pjj-data/data/models/yolox_l.pt"

print(f"[NPU Adapter] Loading local model: {model_path}")

from unstructured_inference.models.yolox import UnstructuredYoloXModel

model = UnstructuredYoloXModel()

model.model_path = model_path

if model_name in _NPU_MODEL_CACHE:

return _NPU_MODEL_CACHE[model_name]

# Resolve model path in a portable, configurable way.

# Priority:

# 1. NPU_YOLOX_MODEL_PATH (environment variable)

# 2. ./yolox_l.pt (current working directory)

# 3. ~/.cache/unstructured_npu/yolox_l.pt (user cache directory)

env_model_path = os.environ.get("NPU_YOLOX_MODEL_PATH")

cache_dir = os.path.join(os.path.expanduser("~"), ".cache", "unstructured_npu")

cache_model_path = os.path.join(cache_dir, "yolox_l.pt")

candidate_paths = []

if env_model_path:

candidate_paths.append(env_model_path)

candidate_paths.append("./yolox_l.pt")

candidate_paths.append(cache_model_path)

model_path = None

for _path in candidate_paths:

if _path and os.path.exists(_path):

model_path = _path

break

if model_path is None:

raise FileNotFoundError(

"[NPU Adapter] YOLOX model file not found.\n"

"Searched locations:\n"

f" - NPU_YOLOX_MODEL_PATH={env_model_path!r}\n"

" - ./yolox_l.pt\n"

f" - {cache_model_path}\n\n"

"Please either:\n"

" 1. Set environment variable NPU_YOLOX_MODEL_PATH to the full path of yolox_l.pt, or\n"

" 2. Place yolox_l.pt in the current working directory, or\n"

" 3. Place yolox_l.pt under the cache directory shown above."

)

print(f"[NPU Adapter] Loading local model: {model_path}")

from unstructured_inference.models.yolox import UnstructuredYoloXModel

model = UnstructuredYoloXModel()

model.model_path = model_path

Copilot · 2026-03-30T06:56:23Z

runtime/ops/mapper/unstructured_npu/npu_adapter.py

+    strides = []
+
+    for (hsize, wsize), stride in zip(self.hw, self.strides):
+        yv, xv = torch.meshgrid([torch.arange(hsize), torch.arange(wsize)])


torch.meshgrid 在较新的 PyTorch 版本中建议显式指定 indexing（例如 indexing="ij"），否则会产生警告，未来版本存在行为变更风险。建议在此处补上 indexing 参数以保证兼容性。

Suggested change

yv, xv = torch.meshgrid([torch.arange(hsize), torch.arange(wsize)])

yv, xv = torch.meshgrid(torch.arange(hsize), torch.arange(wsize), indexing="ij")

Copilot · 2026-03-30T06:56:24Z

runtime/ops/mapper/unstructured_npu/benchmark_npu.py

+if os.path.exists("npu_adapter.py"):
+    try:
+        import npu_adapter
+        logger.info("应用 YOLOX NPU 补丁...")
+        npu_adapter.apply_patches()
+    except Exception as e:
+        logger.error(f"NPU 适配器加载失败: {e}")
+        import traceback
+        traceback.print_exc()
+        sys.exit(1)


ocr_npu_adapter.py 提供了 apply_ocr_patch()（替换 pytesseract/unstructured_pytesseract），但当前 benchmark_npu.py 中在导入 unstructured 前并未调用该 patch，因此基于 pytesseract 的 OCR 路径仍会按原逻辑执行。若该 PR 目标包含 OCR 适配，建议在阶段 4（导入 unstructured 之前）显式调用该 patch，并提供开关以便按需启用。

Copilot · 2026-03-30T06:56:24Z

runtime/ops/mapper/unstructured_npu/benchmark_npu.py

+    try:
+        elements, mode_desc = _extract_elements(file_path)
+        logger.info(f"模式: {mode_desc}")
+    except Exception as e:
+        logger.error(f"处理崩溃: {e}")
+        import traceback
+        traceback.print_exc()
+        return
+
+    duration = time.time() - start_time
+
+    if not elements:
+        logger.error("未提取到元素。")
+        return
+


脚本在处理失败时多数路径只是 logger.error(...); return，最终进程仍会以退出码 0 结束（例如 run_benchmark 捕获异常后直接返回、__main__ 找不到测试文件也不 sys.exit(1)）。这会导致上层 run.sh/CI 无法感知失败。建议在失败场景下显式 sys.exit(1) 或重新抛出异常以返回非 0 退出码。

Copilot · 2026-03-30T06:56:24Z

runtime/ops/mapper/unstructured_npu/ocr_npu_adapter.py

+import importlib.util
+import importlib.machinery


importlib.util 在当前文件中未被使用（仅 import 未引用）。建议删除未使用的 import，避免 lint/静态检查噪音。

Suggested change

import importlib.util

import importlib.machinery

Copilot · 2026-03-30T06:56:25Z

runtime/ops/mapper/unstructured_npu/npu_adapter.py

+def safe_cat(tensors, dim=1):
+    try:
+        res = torch.cat(tensors, dim=dim)
+        torch.npu.synchronize() 
+        return res
+    except Exception:
+        cpu_tensors = [t.cpu() for t in tensors]
+        if not cpu_tensors: return torch.tensor([], device=tensors[0].device)
+        return torch.cat(cpu_tensors, dim=dim).to(tensors[0].device)


safe_cat 的异常回退分支在 tensors 为空时会访问 tensors[0].device，会直接抛 IndexError（try 块里的 torch.cat([]) 本身也会触发异常）。建议在函数开头就处理空输入并允许显式传入/推断 device。

Suggested change

def safe_cat(tensors, dim=1):

try:

res = torch.cat(tensors, dim=dim)

torch.npu.synchronize()

return res

except Exception:

cpu_tensors = [t.cpu() for t in tensors]

if not cpu_tensors: return torch.tensor([], device=tensors[0].device)

return torch.cat(cpu_tensors, dim=dim).to(tensors[0].device)

def safe_cat(tensors, dim=1, device=None):

# 先处理空输入，避免 torch.cat([]) 和 tensors[0] 访问异常

if not tensors:

if device is not None:

return torch.tensor([], device=device)

return torch.tensor([])

# 对于非空输入，如未显式传入 device，则从首个 tensor 推断

if device is None:

device = tensors[0].device

try:

res = torch.cat(tensors, dim=dim)

torch.npu.synchronize()

return res

except Exception:

cpu_tensors = [t.cpu() for t in tensors]

return torch.cat(cpu_tensors, dim=dim).to(device)

Copilot · 2026-03-30T06:56:25Z

runtime/ops/mapper/unstructured_npu/npu_adapter.py

+            pass
+
+    return _orig_request(self, method, url, *args, **kwargs)
+
+requests.Session.request = mocked_request


这里对 requests.Session.request 做了全局 monkey patch（在模块 import 时立即生效），会影响进程内所有使用 requests 的调用方，带来不可控的网络行为变化/排障困难。建议把拦截逻辑改为更局部的实现（例如仅在下载 YOLOX 权重的代码路径中使用自定义 Session/Adapter，或通过环境变量开关启用）。

Suggested change

pass

return _orig_request(self, method, url, *args, **kwargs)

requests.Session.request = mocked_request

# 出现异常时回退到原始 URL

pass

return _orig_request(self, method, url, *args, **kwargs)

# 通过环境变量控制是否启用对 requests.Session.request 的全局 monkey patch。

# 设置 NPU_ADAPTER_PATCH_REQUESTS=1 / true / yes 时才会启用。

_enable_global_requests_patch = os.environ.get("NPU_ADAPTER_PATCH_REQUESTS", "").strip().lower() in (

"1",

"true",

"yes",

)

if _enable_global_requests_patch:

requests.Session.request = mocked_request

Copilot · 2026-03-30T06:56:25Z

runtime/ops/mapper/unstructured_npu/npu_adapter.py

+import types
+import torch
+import torch_npu
+import numpy as np
+import requests
+from torchvision.ops import nms
+from requests.exceptions import ConnectionError


该文件里 types 与 ConnectionError 目前未被使用（仅 import 未引用），会增加静态检查噪音。建议删除未使用的 import，或在确有用途时补上对应代码路径。

Suggested change

import types

import torch

import torch_npu

import numpy as np

import requests

from torchvision.ops import nms

from requests.exceptions import ConnectionError

import torch

import torch_npu

import numpy as np

import requests

from torchvision.ops import nms

unstructuredio

2224d88

Copilot AI review requested due to automatic review settings March 30, 2026 06:51

Copilot started reviewing on behalf of QianqiuerQS March 30, 2026 06:51 View session

Copilot AI reviewed Mar 30, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

unstructuredio-npu#456

unstructuredio-npu#456
QianqiuerQS wants to merge 1 commit intoModelEngine-Group:mainfrom
QianqiuerQS:feat/unstructured-npu

QianqiuerQS commented Mar 30, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 30, 2026

Uh oh!

Copilot AI Mar 30, 2026

Uh oh!

Copilot AI Mar 30, 2026

Uh oh!

Copilot AI Mar 30, 2026

Uh oh!

Copilot AI Mar 30, 2026

Uh oh!

Copilot AI Mar 30, 2026

Uh oh!

Copilot AI Mar 30, 2026

Uh oh!

Copilot AI Mar 30, 2026

Uh oh!

Copilot AI Mar 30, 2026

Uh oh!

Copilot AI Mar 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	export PYTHONPATH=$(pwd):$(pwd)/YOLOX-main:$PYTHONPATH
	export PYTHONPATH="$(pwd):$(pwd)/YOLOX-main:${PYTHONPATH:-}"

-    if model_name in _NPU_MODEL_CACHE:
-        return _NPU_MODEL_CACHE[model_name]
-    if os.path.exists("./yolox_l.pt"):
-        model_path = "./yolox_l.pt"
-    else:
-        model_path = "/mnt/nvme0n1/pjj-data/data/models/yolox_l.pt"
-    print(f"[NPU Adapter] Loading local model: {model_path}")
-    from unstructured_inference.models.yolox import UnstructuredYoloXModel
-    model = UnstructuredYoloXModel()
-    model.model_path = model_path
+    if model_name in _NPU_MODEL_CACHE:
+        return _NPU_MODEL_CACHE[model_name]
+    # Resolve model path in a portable, configurable way.
+    # Priority:
+    #   1. NPU_YOLOX_MODEL_PATH (environment variable)
+    #   2. ./yolox_l.pt (current working directory)
+    #   3. ~/.cache/unstructured_npu/yolox_l.pt (user cache directory)
+    env_model_path = os.environ.get("NPU_YOLOX_MODEL_PATH")
+    cache_dir = os.path.join(os.path.expanduser("~"), ".cache", "unstructured_npu")
+    cache_model_path = os.path.join(cache_dir, "yolox_l.pt")
+    candidate_paths = []
+    if env_model_path:
+        candidate_paths.append(env_model_path)
+    candidate_paths.append("./yolox_l.pt")
+    candidate_paths.append(cache_model_path)
+    model_path = None
+    for _path in candidate_paths:
+        if _path and os.path.exists(_path):
+            model_path = _path
+            break
+    if model_path is None:
+        raise FileNotFoundError(
+            "[NPU Adapter] YOLOX model file not found.\n"
+            "Searched locations:\n"
+            f"  - NPU_YOLOX_MODEL_PATH={env_model_path!r}\n"
+            "  - ./yolox_l.pt\n"
+            f"  - {cache_model_path}\n\n"
+            "Please either:\n"
+            "  1. Set environment variable NPU_YOLOX_MODEL_PATH to the full path of yolox_l.pt, or\n"
+            "  2. Place yolox_l.pt in the current working directory, or\n"
+            "  3. Place yolox_l.pt under the cache directory shown above."
+        )
+    print(f"[NPU Adapter] Loading local model: {model_path}")
+    from unstructured_inference.models.yolox import UnstructuredYoloXModel
+    model = UnstructuredYoloXModel()
+    model.model_path = model_path

	yv, xv = torch.meshgrid([torch.arange(hsize), torch.arange(wsize)])
	yv, xv = torch.meshgrid(torch.arange(hsize), torch.arange(wsize), indexing="ij")

-            pass
-    return _orig_request(self, method, url, *args, **kwargs)
-requests.Session.request = mocked_request
+            # 出现异常时回退到原始 URL
+            pass
+    return _orig_request(self, method, url, *args, **kwargs)
+# 通过环境变量控制是否启用对 requests.Session.request 的全局 monkey patch。
+# 设置 NPU_ADAPTER_PATCH_REQUESTS=1 / true / yes 时才会启用。
+_enable_global_requests_patch = os.environ.get("NPU_ADAPTER_PATCH_REQUESTS", "").strip().lower() in (
+    "1",
+    "true",
+    "yes",
+)
+if _enable_global_requests_patch:
+    requests.Session.request = mocked_request

Conversation

QianqiuerQS commented Mar 30, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants