Skip to content

unstructuredio-npu#456

Open
QianqiuerQS wants to merge 1 commit intoModelEngine-Group:mainfrom
QianqiuerQS:feat/unstructured-npu
Open

unstructuredio-npu#456
QianqiuerQS wants to merge 1 commit intoModelEngine-Group:mainfrom
QianqiuerQS:feat/unstructured-npu

Conversation

@QianqiuerQS
Copy link
Copy Markdown

npu上unstructuredio算子的适配

Copilot AI review requested due to automatic review settings March 30, 2026 06:51
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

为在 Ascend NPU 环境运行/验证 unstructured(hi_res/YOLOX 等)相关能力提供适配脚本与 monkey patch,包含对模型加载、推理与依赖屏蔽的处理。

Changes:

  • 新增 NPU YOLOX 推理适配与 unstructured_inference monkey patch(模型加载、算子替换、推理重写)。
  • 新增基准脚本与启动脚本(环境变量、LD_PRELOAD、依赖 mock)。
  • 新增 OCR 侧适配模块(通过注入 pytesseract 接口)以及一份 NPU fusion 结果 JSON。

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 11 comments.

Show a summary per file
File Description
runtime/ops/mapper/unstructured_npu/run.sh Ascend NPU 启动脚本:设置 jemalloc/环境变量并运行 benchmark
runtime/ops/mapper/unstructured_npu/benchmark_npu.py 基准入口:依赖深度 mock、初始化 NPU、调用 unstructured 分区逻辑并落盘结果
runtime/ops/mapper/unstructured_npu/npu_adapter.py 关键适配:requests 拦截、LayoutElements 替换、YOLOX 前向/解码/后处理重写与模型加载
runtime/ops/mapper/unstructured_npu/ocr_npu_adapter.py OCR 侧隔离进程(CPU PaddleOCR)+ 伪造 pytesseract 模块注入函数
runtime/ops/mapper/unstructured_npu/fusion_result.json 运行产物/调试信息:记录图融合统计

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

export OMP_NUM_THREADS=1

# 6. Python 路径 (包含当前目录和 YOLOX)
export PYTHONPATH=$(pwd):$(pwd)/YOLOX-main:$PYTHONPATH
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

set -u 下直接拼接 $PYTHONPATH 会在变量未定义时报错并导致脚本退出。这里建议使用 ${PYTHONPATH:-} 兜底,或先 PYTHONPATH=${PYTHONPATH:-} 再追加。

Suggested change
export PYTHONPATH=$(pwd):$(pwd)/YOLOX-main:$PYTHONPATH
export PYTHONPATH="$(pwd):$(pwd)/YOLOX-main:${PYTHONPATH:-}"

Copilot uses AI. Check for mistakes.
Comment on lines +23 to +26
# 3. 设置 LD_PRELOAD (覆盖式设置,防止重复)
# 注意:jemalloc 必须排在第一位,libgomp 排第二解决 TLS 问题
export LD_PRELOAD="$JEMALLOC:$GOMP"

Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

脚本只检查了 jemalloc 是否存在,但同样强依赖 libgomp.so.1 被预加载;若该文件不存在,动态加载器会报错/行为不确定。建议在设置 LD_PRELOAD 前也校验 $GOMP 是否存在,并给出明确错误信息。

Copilot uses AI. Check for mistakes.
Comment on lines +474 to +488

if model_name in _NPU_MODEL_CACHE:
return _NPU_MODEL_CACHE[model_name]

if os.path.exists("./yolox_l.pt"):
model_path = "./yolox_l.pt"
else:
model_path = "/mnt/nvme0n1/pjj-data/data/models/yolox_l.pt"

print(f"[NPU Adapter] Loading local model: {model_path}")

from unstructured_inference.models.yolox import UnstructuredYoloXModel
model = UnstructuredYoloXModel()
model.model_path = model_path

Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

npu_get_model 使用了硬编码的本地绝对路径 /mnt/nvme0n1/.../yolox_l.pt 作为 fallback,这会导致在其他机器/容器中必然找不到模型并直接失败。建议改为:优先从配置/环境变量读取模型路径,或统一走 HuggingFace/缓存目录,并在错误信息中提示如何配置。

Suggested change
if model_name in _NPU_MODEL_CACHE:
return _NPU_MODEL_CACHE[model_name]
if os.path.exists("./yolox_l.pt"):
model_path = "./yolox_l.pt"
else:
model_path = "/mnt/nvme0n1/pjj-data/data/models/yolox_l.pt"
print(f"[NPU Adapter] Loading local model: {model_path}")
from unstructured_inference.models.yolox import UnstructuredYoloXModel
model = UnstructuredYoloXModel()
model.model_path = model_path
if model_name in _NPU_MODEL_CACHE:
return _NPU_MODEL_CACHE[model_name]
# Resolve model path in a portable, configurable way.
# Priority:
# 1. NPU_YOLOX_MODEL_PATH (environment variable)
# 2. ./yolox_l.pt (current working directory)
# 3. ~/.cache/unstructured_npu/yolox_l.pt (user cache directory)
env_model_path = os.environ.get("NPU_YOLOX_MODEL_PATH")
cache_dir = os.path.join(os.path.expanduser("~"), ".cache", "unstructured_npu")
cache_model_path = os.path.join(cache_dir, "yolox_l.pt")
candidate_paths = []
if env_model_path:
candidate_paths.append(env_model_path)
candidate_paths.append("./yolox_l.pt")
candidate_paths.append(cache_model_path)
model_path = None
for _path in candidate_paths:
if _path and os.path.exists(_path):
model_path = _path
break
if model_path is None:
raise FileNotFoundError(
"[NPU Adapter] YOLOX model file not found.\n"
"Searched locations:\n"
f" - NPU_YOLOX_MODEL_PATH={env_model_path!r}\n"
" - ./yolox_l.pt\n"
f" - {cache_model_path}\n\n"
"Please either:\n"
" 1. Set environment variable NPU_YOLOX_MODEL_PATH to the full path of yolox_l.pt, or\n"
" 2. Place yolox_l.pt in the current working directory, or\n"
" 3. Place yolox_l.pt under the cache directory shown above."
)
print(f"[NPU Adapter] Loading local model: {model_path}")
from unstructured_inference.models.yolox import UnstructuredYoloXModel
model = UnstructuredYoloXModel()
model.model_path = model_path

Copilot uses AI. Check for mistakes.
strides = []

for (hsize, wsize), stride in zip(self.hw, self.strides):
yv, xv = torch.meshgrid([torch.arange(hsize), torch.arange(wsize)])
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

torch.meshgrid 在较新的 PyTorch 版本中建议显式指定 indexing(例如 indexing="ij"),否则会产生警告,未来版本存在行为变更风险。建议在此处补上 indexing 参数以保证兼容性。

Suggested change
yv, xv = torch.meshgrid([torch.arange(hsize), torch.arange(wsize)])
yv, xv = torch.meshgrid(torch.arange(hsize), torch.arange(wsize), indexing="ij")

Copilot uses AI. Check for mistakes.
Comment on lines +134 to +143
if os.path.exists("npu_adapter.py"):
try:
import npu_adapter
logger.info("应用 YOLOX NPU 补丁...")
npu_adapter.apply_patches()
except Exception as e:
logger.error(f"NPU 适配器加载失败: {e}")
import traceback
traceback.print_exc()
sys.exit(1)
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ocr_npu_adapter.py 提供了 apply_ocr_patch()(替换 pytesseract/unstructured_pytesseract),但当前 benchmark_npu.py 中在导入 unstructured 前并未调用该 patch,因此基于 pytesseract 的 OCR 路径仍会按原逻辑执行。若该 PR 目标包含 OCR 适配,建议在阶段 4(导入 unstructured 之前)显式调用该 patch,并提供开关以便按需启用。

Copilot uses AI. Check for mistakes.
Comment on lines +252 to +266
try:
elements, mode_desc = _extract_elements(file_path)
logger.info(f"模式: {mode_desc}")
except Exception as e:
logger.error(f"处理崩溃: {e}")
import traceback
traceback.print_exc()
return

duration = time.time() - start_time

if not elements:
logger.error("未提取到元素。")
return

Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

脚本在处理失败时多数路径只是 logger.error(...); return,最终进程仍会以退出码 0 结束(例如 run_benchmark 捕获异常后直接返回、__main__ 找不到测试文件也不 sys.exit(1))。这会导致上层 run.sh/CI 无法感知失败。建议在失败场景下显式 sys.exit(1) 或重新抛出异常以返回非 0 退出码。

Copilot uses AI. Check for mistakes.
Comment on lines +11 to +12
import importlib.util
import importlib.machinery
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

importlib.util 在当前文件中未被使用(仅 import 未引用)。建议删除未使用的 import,避免 lint/静态检查噪音。

Suggested change
import importlib.util
import importlib.machinery

Copilot uses AI. Check for mistakes.
Comment on lines +151 to +159
def safe_cat(tensors, dim=1):
try:
res = torch.cat(tensors, dim=dim)
torch.npu.synchronize()
return res
except Exception:
cpu_tensors = [t.cpu() for t in tensors]
if not cpu_tensors: return torch.tensor([], device=tensors[0].device)
return torch.cat(cpu_tensors, dim=dim).to(tensors[0].device)
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

safe_cat 的异常回退分支在 tensors 为空时会访问 tensors[0].device,会直接抛 IndexError(try 块里的 torch.cat([]) 本身也会触发异常)。建议在函数开头就处理空输入并允许显式传入/推断 device。

Suggested change
def safe_cat(tensors, dim=1):
try:
res = torch.cat(tensors, dim=dim)
torch.npu.synchronize()
return res
except Exception:
cpu_tensors = [t.cpu() for t in tensors]
if not cpu_tensors: return torch.tensor([], device=tensors[0].device)
return torch.cat(cpu_tensors, dim=dim).to(tensors[0].device)
def safe_cat(tensors, dim=1, device=None):
# 先处理空输入,避免 torch.cat([]) 和 tensors[0] 访问异常
if not tensors:
if device is not None:
return torch.tensor([], device=device)
return torch.tensor([])
# 对于非空输入,如未显式传入 device,则从首个 tensor 推断
if device is None:
device = tensors[0].device
try:
res = torch.cat(tensors, dim=dim)
torch.npu.synchronize()
return res
except Exception:
cpu_tensors = [t.cpu() for t in tensors]
return torch.cat(cpu_tensors, dim=dim).to(device)

Copilot uses AI. Check for mistakes.
Comment on lines +37 to +41
pass

return _orig_request(self, method, url, *args, **kwargs)

requests.Session.request = mocked_request
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里对 requests.Session.request 做了全局 monkey patch(在模块 import 时立即生效),会影响进程内所有使用 requests 的调用方,带来不可控的网络行为变化/排障困难。建议把拦截逻辑改为更局部的实现(例如仅在下载 YOLOX 权重的代码路径中使用自定义 Session/Adapter,或通过环境变量开关启用)。

Suggested change
pass
return _orig_request(self, method, url, *args, **kwargs)
requests.Session.request = mocked_request
# 出现异常时回退到原始 URL
pass
return _orig_request(self, method, url, *args, **kwargs)
# 通过环境变量控制是否启用对 requests.Session.request 的全局 monkey patch。
# 设置 NPU_ADAPTER_PATCH_REQUESTS=1 / true / yes 时才会启用。
_enable_global_requests_patch = os.environ.get("NPU_ADAPTER_PATCH_REQUESTS", "").strip().lower() in (
"1",
"true",
"yes",
)
if _enable_global_requests_patch:
requests.Session.request = mocked_request

Copilot uses AI. Check for mistakes.
Comment on lines +3 to +9
import types
import torch
import torch_npu
import numpy as np
import requests
from torchvision.ops import nms
from requests.exceptions import ConnectionError
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

该文件里 typesConnectionError 目前未被使用(仅 import 未引用),会增加静态检查噪音。建议删除未使用的 import,或在确有用途时补上对应代码路径。

Suggested change
import types
import torch
import torch_npu
import numpy as np
import requests
from torchvision.ops import nms
from requests.exceptions import ConnectionError
import torch
import torch_npu
import numpy as np
import requests
from torchvision.ops import nms

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants