Compare commits

...

15 Commits

Author SHA1 Message Date
dj
51350e3002 123 2026-04-14 17:35:40 +08:00
dj
8e713be1ca Merge remote changes with RAG service optimization
- Keep user's RAG service integration for faster extraction
- Add remote's word_ai_service support
- Preserve user's parallel extraction and field header optimizations

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-14 17:25:13 +08:00
zzz
f2af27245d 增强 Word 文档 AI 解析和模板填充功能 2026-04-14 17:16:38 +08:00
dj
a9dc0d8b91 优化智能填表功能:提升速度、完善数据提取精度
后端优化 (template_fill_service.py):

1. 速度优化:
   - 使用 asyncio.gather 实现字段并行提取
   - 跳过 AI 审核步骤,减少 LLM 调用次数
   - 新增 _extract_single_field_fast 方法

2. 数据提取优化:
   - 集成 RAG 服务进行智能内容检索
   - 修复 Markdown 表格列匹配跳过空列
   - 修复年份子表头行误识别问题

3. AI 表头生成优化:
   - 精简为 5-7 个代表性字段(原来 8-15 个)
   - 过滤非数据字段(source、备注、说明等)
   - 简化字段名,如"医院数量"而非"医院-公立医院数量"

4. AI 数据提取 prompt 优化:
   - 严格按表头提取,只返回相关数据
   - 每个值必须带标注(年份/地区/分类)
   - 支持多种标注类型:2024年、北京、某省、公立医院、三级医院等
   - 保留原始数值、单位和百分号格式
   - 不返回大段来源说明

5. FillResult 新增 warning 字段:
   - 多值检测提示,如"检测到 2 个值"

前端优化 (TemplateFill.tsx):
- 填写详情显示多值警告(黄色提示框)
- 多值情况下直接显示所有值

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-14 17:14:59 +08:00
tl
902c28166b tl 2026-04-14 15:18:50 +08:00
tl
4a53be7eeb TL 2026-04-14 14:58:14 +08:00
tl
8b5b24fa2a Merge branch 'main' of https://gitea.kronecker.cc/OurCodesAreAllRight/FilesReadSystem 2026-04-14 14:57:53 +08:00
tl
ed66aa346d tl 2026-04-10 10:24:52 +08:00
zzz
5b82d40be0 Merge branch 'main' of https://gitea.kronecker.cc/OurCodesAreAllRight/FilesReadSystem 2026-04-10 10:10:41 +08:00
zzz
bedf1af9c0 增强 Word 文档 AI 解析和模板填充功能 2026-04-10 09:48:57 +08:00
5fca4eb094 添加临时文件清理异常处理和修改大纲接口为POST方法
- 在analyze_markdown、analyze_markdown_stream和get_markdown_outline函数中添加了
  try-catch块来处理临时文件清理过程中的异常
- 将/analyze/md/outline接口从GET方法改为POST方法以支持文件上传
- 确保在所有情况下都能正确清理临时文件,并记录清理失败的日志

refactor(health): 改进健康检查逻辑验证实际数据库连接

- 修改MySQL健康检查,实际执行SELECT 1查询来验证连接
- 修改MongoDB健康检查,执行ping命令来验证连接
- 修改Redis健康检查,执行ping命令来验证连接
- 添加异常捕获并记录具体的错误日志

refactor(upload): 使用os.path.basename优化文件名提取

- 替换手动字符串分割为os.path.basename来获取文件名
- 统一Excel上传和导出中文件名的处理方式

feat(instruction): 新增指令执行框架模块

- 创建instruction包包含意图解析和指令执行的基础架构
- 添加IntentParser和InstructionExecutor抽象基类
- 提供默认实现但标记为未完成,为未来功能扩展做准备

refactor(frontend): 调整AuthContext导入路径并移除重复文件

- 将AuthContext从src/context移动到src/contexts目录
- 更新App.tsx和RouteGuard.tsx中的导入路径
- 移除旧的AuthContext.tsx文件

fix(backend-api): 修复AI分析API的HTTP方法错误

- 将aiApi中的fetch请求方法从GET改为POST以支持文件上传
2026-04-10 01:51:53 +08:00
0dbf74db9d 添加任务ID跟踪功能到模板填充接口
- 在FillRequest中添加可选的task_id字段,用于任务历史跟踪
- 实现任务状态管理,包括创建、更新和错误处理
- 集成MongoDB任务记录功能,在处理过程中更新进度
- 添加任务进度更新逻辑,支持开始、处理中、成功和失败状态
- 修改模板填充服务以接收并传递task_id参数
2026-04-10 01:27:26 +08:00
858b594171 添加任务状态双写机制和历史记录功能
- 实现任务状态同时写入Redis和MongoDB的双写机制
- 添加MongoDB任务集合及CRUD操作接口
- 新增任务历史记录查询、列表展示和删除功能
- 重构任务状态更新逻辑,统一使用update_task_status函数
- 添加模板填服务中AI审核字段值的功能
- 优化前端任务历史页面显示和交互体验
2026-04-10 01:15:53 +08:00
ed0f51f2a4 Merge branch 'main' of https://gitea.kronecker.cc/OurCodesAreAllRight/FilesReadSystem 2026-04-10 00:26:57 +08:00
ecc0c79475 增强模板填写服务支持表格内容摘要和表头重生成
- 在源文档解析过程中增加表格内容摘要功能,提取表格结构用于AI理解
- 新增表格摘要逻辑,包括表头和前3行数据的提取和格式化
- 添加模板文件类型识别,支持xlsx和docx格式判断
- 实现基于源文档内容的表头自动重生成功能
- 当检测到自动生成的表头时,使用源文档内容重新生成更准确的字段
- 增加详细的调试日志用于跟踪表格处理过程
2026-04-10 00:26:54 +08:00
31 changed files with 3300 additions and 2235 deletions

View File

@@ -29,9 +29,14 @@ REDIS_URL="redis://localhost:6379/0"
# ==================== LLM AI 配置 ====================
# 大语言模型 API 配置
LLM_API_KEY="your_api_key_here"
LLM_BASE_URL=""
LLM_MODEL_NAME=""
# 支持 OpenAI 兼容格式 (DeepSeek, 智谱 GLM, 阿里等)
# 智谱 AI (Zhipu AI) GLM 系列:
# - 模型: glm-4-flash (快速文本模型), glm-4 (标准), glm-4-plus (高性能)
# - API: https://open.bigmodel.cn
# - API Key: https://open.bigmodel.cn/usercenter/apikeys
LLM_API_KEY="ca79ad9f96524cd5afc3e43ca97f347d.cpiLLx2oyitGvTeU"
LLM_BASE_URL="https://open.bigmodel.cn/api/paas/v4"
LLM_MODEL_NAME="glm-4v-plus"
# ==================== Supabase 配置 ====================
# Supabase 项目配置

38
backend/=3.0.0 Normal file
View File

@@ -0,0 +1,38 @@
Requirement already satisfied: sentence-transformers in c:\python312\lib\site-packages (2.2.2)
Requirement already satisfied: transformers<5.0.0,>=4.6.0 in c:\python312\lib\site-packages (from sentence-transformers) (4.57.6)
Requirement already satisfied: tqdm in c:\python312\lib\site-packages (from sentence-transformers) (4.66.1)
Requirement already satisfied: torch>=1.6.0 in c:\python312\lib\site-packages (from sentence-transformers) (2.10.0)
Requirement already satisfied: torchvision in c:\python312\lib\site-packages (from sentence-transformers) (0.25.0)
Requirement already satisfied: numpy in c:\python312\lib\site-packages (from sentence-transformers) (1.26.2)
Requirement already satisfied: scikit-learn in c:\python312\lib\site-packages (from sentence-transformers) (1.8.0)
Requirement already satisfied: scipy in c:\python312\lib\site-packages (from sentence-transformers) (1.16.3)
Requirement already satisfied: nltk in c:\python312\lib\site-packages (from sentence-transformers) (3.9.3)
Requirement already satisfied: sentencepiece in c:\python312\lib\site-packages (from sentence-transformers) (0.2.1)
Requirement already satisfied: huggingface-hub>=0.4.0 in c:\python312\lib\site-packages (from sentence-transformers) (0.36.2)
Requirement already satisfied: filelock in c:\python312\lib\site-packages (from huggingface-hub>=0.4.0->sentence-transformers) (3.25.2)
Requirement already satisfied: fsspec>=2023.5.0 in c:\python312\lib\site-packages (from huggingface-hub>=0.4.0->sentence-transformers) (2026.2.0)
Requirement already satisfied: packaging>=20.9 in c:\python312\lib\site-packages (from huggingface-hub>=0.4.0->sentence-transformers) (23.2)
Requirement already satisfied: pyyaml>=5.1 in c:\python312\lib\site-packages (from huggingface-hub>=0.4.0->sentence-transformers) (6.0.1)
Requirement already satisfied: requests in c:\python312\lib\site-packages (from huggingface-hub>=0.4.0->sentence-transformers) (2.31.0)
Requirement already satisfied: typing-extensions>=3.7.4.3 in c:\python312\lib\site-packages (from huggingface-hub>=0.4.0->sentence-transformers) (4.15.0)
Requirement already satisfied: sympy>=1.13.3 in c:\python312\lib\site-packages (from torch>=1.6.0->sentence-transformers) (1.14.0)
Requirement already satisfied: networkx>=2.5.1 in c:\python312\lib\site-packages (from torch>=1.6.0->sentence-transformers) (3.6.1)
Requirement already satisfied: jinja2 in c:\python312\lib\site-packages (from torch>=1.6.0->sentence-transformers) (3.1.6)
Requirement already satisfied: setuptools in c:\python312\lib\site-packages (from torch>=1.6.0->sentence-transformers) (82.0.1)
Requirement already satisfied: colorama in c:\python312\lib\site-packages (from tqdm->sentence-transformers) (0.4.6)
Requirement already satisfied: regex!=2019.12.17 in c:\python312\lib\site-packages (from transformers<5.0.0,>=4.6.0->sentence-transformers) (2026.2.28)
Requirement already satisfied: tokenizers<=0.23.0,>=0.22.0 in c:\python312\lib\site-packages (from transformers<5.0.0,>=4.6.0->sentence-transformers) (0.22.2)
Requirement already satisfied: safetensors>=0.4.3 in c:\python312\lib\site-packages (from transformers<5.0.0,>=4.6.0->sentence-transformers) (0.7.0)
Requirement already satisfied: click in c:\python312\lib\site-packages (from nltk->sentence-transformers) (8.3.1)
Requirement already satisfied: joblib in c:\python312\lib\site-packages (from nltk->sentence-transformers) (1.5.3)
Requirement already satisfied: threadpoolctl>=3.2.0 in c:\python312\lib\site-packages (from scikit-learn->sentence-transformers) (3.6.0)
Requirement already satisfied: pillow!=8.3.*,>=5.3.0 in c:\python312\lib\site-packages (from torchvision->sentence-transformers) (12.1.1)
Requirement already satisfied: mpmath<1.4,>=1.1.0 in c:\python312\lib\site-packages (from sympy>=1.13.3->torch>=1.6.0->sentence-transformers) (1.3.0)
Requirement already satisfied: MarkupSafe>=2.0 in c:\python312\lib\site-packages (from jinja2->torch>=1.6.0->sentence-transformers) (3.0.3)
Requirement already satisfied: charset-normalizer<4,>=2 in c:\python312\lib\site-packages (from requests->huggingface-hub>=0.4.0->sentence-transformers) (3.4.6)
Requirement already satisfied: idna<4,>=2.5 in c:\python312\lib\site-packages (from requests->huggingface-hub>=0.4.0->sentence-transformers) (3.11)
Requirement already satisfied: urllib3<3,>=1.21.1 in c:\python312\lib\site-packages (from requests->huggingface-hub>=0.4.0->sentence-transformers) (2.6.3)
Requirement already satisfied: certifi>=2017.4.17 in c:\python312\lib\site-packages (from requests->huggingface-hub>=0.4.0->sentence-transformers) (2026.2.25)
[notice] A new release of pip is available: 24.2 -> 26.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip

View File

@@ -10,6 +10,8 @@ import os
from app.services.excel_ai_service import excel_ai_service
from app.services.markdown_ai_service import markdown_ai_service
from app.services.template_fill_service import template_fill_service
from app.services.word_ai_service import word_ai_service
logger = logging.getLogger(__name__)
@@ -215,9 +217,12 @@ async def analyze_markdown(
return result
finally:
# 清理临时文件
if os.path.exists(tmp_path):
# 清理临时文件,确保在所有情况下都能清理
try:
if tmp_path and os.path.exists(tmp_path):
os.unlink(tmp_path)
except Exception as cleanup_error:
logger.warning(f"临时文件清理失败: {tmp_path}, error: {cleanup_error}")
except HTTPException:
raise
@@ -279,8 +284,12 @@ async def analyze_markdown_stream(
)
finally:
if os.path.exists(tmp_path):
# 清理临时文件,确保在所有情况下都能清理
try:
if tmp_path and os.path.exists(tmp_path):
os.unlink(tmp_path)
except Exception as cleanup_error:
logger.warning(f"临时文件清理失败: {tmp_path}, error: {cleanup_error}")
except HTTPException:
raise
@@ -289,7 +298,7 @@ async def analyze_markdown_stream(
raise HTTPException(status_code=500, detail=f"流式分析失败: {str(e)}")
@router.get("/analyze/md/outline")
@router.post("/analyze/md/outline")
async def get_markdown_outline(
file: UploadFile = File(...)
):
@@ -323,9 +332,154 @@ async def get_markdown_outline(
result = await markdown_ai_service.extract_outline(tmp_path)
return result
finally:
if os.path.exists(tmp_path):
# 清理临时文件,确保在所有情况下都能清理
try:
if tmp_path and os.path.exists(tmp_path):
os.unlink(tmp_path)
except Exception as cleanup_error:
logger.warning(f"临时文件清理失败: {tmp_path}, error: {cleanup_error}")
except Exception as e:
logger.error(f"获取 Markdown 大纲失败: {str(e)}")
raise HTTPException(status_code=500, detail=f"获取大纲失败: {str(e)}")
@router.post("/analyze/txt")
async def analyze_txt(
file: UploadFile = File(...),
):
"""
上传并使用 AI 分析 TXT 文本文件,提取结构化数据
将非结构化文本转换为结构化表格数据,便于后续填表使用
Args:
file: 上传的 TXT 文件
Returns:
dict: 分析结果,包含结构化表格数据
"""
if not file.filename:
raise HTTPException(status_code=400, detail="文件名为空")
file_ext = file.filename.split('.')[-1].lower()
if file_ext not in ['txt', 'text']:
raise HTTPException(
status_code=400,
detail=f"不支持的文件类型: {file_ext},仅支持 .txt"
)
try:
# 读取文件内容
content = await file.read()
# 保存到临时文件
with tempfile.NamedTemporaryFile(mode='wb', suffix='.txt', delete=False) as tmp:
tmp.write(content)
tmp_path = tmp.name
try:
logger.info(f"开始 AI 分析 TXT 文件: {file.filename}")
# 使用 template_fill_service 的 AI 分析方法
result = await template_fill_service.analyze_txt_with_ai(
content=content.decode('utf-8', errors='replace'),
filename=file.filename
)
if result:
logger.info(f"TXT AI 分析成功: {file.filename}")
return {
"success": True,
"filename": file.filename,
"structured_data": result
}
else:
logger.warning(f"TXT AI 分析返回空结果: {file.filename}")
return {
"success": False,
"filename": file.filename,
"error": "AI 分析未能提取到结构化数据",
"structured_data": None
}
finally:
# 清理临时文件
if os.path.exists(tmp_path):
os.unlink(tmp_path)
except HTTPException:
raise
except Exception as e:
logger.error(f"TXT AI 分析过程中出错: {str(e)}")
raise HTTPException(status_code=500, detail=f"分析失败: {str(e)}")
# ==================== Word 文档 AI 解析 ====================
@router.post("/analyze/word")
async def analyze_word(
file: UploadFile = File(...),
user_hint: str = Query("", description="用户提示词,如'请提取表格数据'")
):
"""
使用 AI 解析 Word 文档,提取结构化数据
适用于从非结构化的 Word 文档中提取表格数据、键值对等信息
Args:
file: 上传的 Word 文件
user_hint: 用户提示词
Returns:
dict: 包含结构化数据的解析结果
"""
if not file.filename:
raise HTTPException(status_code=400, detail="文件名为空")
file_ext = file.filename.split('.')[-1].lower()
if file_ext not in ['docx']:
raise HTTPException(
status_code=400,
detail=f"不支持的文件类型: {file_ext},仅支持 .docx"
)
try:
# 保存上传的文件
content = await file.read()
suffix = f".{file_ext}"
with tempfile.NamedTemporaryFile(delete=False, suffix=suffix) as tmp:
tmp.write(content)
tmp_path = tmp.name
try:
# 使用 AI 解析 Word 文档
result = await word_ai_service.parse_word_with_ai(
file_path=tmp_path,
user_hint=user_hint or "请提取文档中的所有结构化数据,包括表格、键值对等"
)
if result.get("success"):
return {
"success": True,
"filename": file.filename,
"result": result
}
else:
return {
"success": False,
"filename": file.filename,
"error": result.get("error", "AI 解析失败"),
"result": None
}
finally:
# 清理临时文件
if os.path.exists(tmp_path):
os.unlink(tmp_path)
except HTTPException:
raise
except Exception as e:
logger.error(f"Word AI 分析过程中出错: {str(e)}")
raise HTTPException(status_code=500, detail=f"分析失败: {str(e)}")

View File

@@ -23,6 +23,52 @@ logger = logging.getLogger(__name__)
router = APIRouter(prefix="/upload", tags=["文档上传"])
# ==================== 辅助函数 ====================
async def update_task_status(
task_id: str,
status: str,
progress: int = 0,
message: str = "",
result: dict = None,
error: str = None
):
"""
更新任务状态,同时写入 Redis 和 MongoDB
Args:
task_id: 任务ID
status: 状态
progress: 进度
message: 消息
result: 结果
error: 错误信息
"""
meta = {"progress": progress, "message": message}
if result:
meta["result"] = result
if error:
meta["error"] = error
# 尝试写入 Redis
try:
await redis_db.set_task_status(task_id, status, meta)
except Exception as e:
logger.warning(f"Redis 任务状态更新失败: {e}")
# 尝试写入 MongoDB作为备用
try:
await mongodb.update_task(
task_id=task_id,
status=status,
message=message,
result=result,
error=error
)
except Exception as e:
logger.warning(f"MongoDB 任务状态更新失败: {e}")
# ==================== 请求/响应模型 ====================
class UploadResponse(BaseModel):
@@ -77,6 +123,17 @@ async def upload_document(
task_id = str(uuid.uuid4())
try:
# 保存任务记录到 MongoDB如果 Redis 不可用时仍能查询)
try:
await mongodb.insert_task(
task_id=task_id,
task_type="document_parse",
status="pending",
message=f"文档 {file.filename} 已提交处理"
)
except Exception as mongo_err:
logger.warning(f"MongoDB 保存任务记录失败: {mongo_err}")
content = await file.read()
saved_path = file_service.save_uploaded_file(
content,
@@ -122,6 +179,17 @@ async def upload_documents(
saved_paths = []
try:
# 保存任务记录到 MongoDB
try:
await mongodb.insert_task(
task_id=task_id,
task_type="batch_parse",
status="pending",
message=f"已提交 {len(files)} 个文档处理"
)
except Exception as mongo_err:
logger.warning(f"MongoDB 保存批量任务记录失败: {mongo_err}")
for file in files:
if not file.filename:
continue
@@ -159,9 +227,9 @@ async def process_document(
"""处理单个文档"""
try:
# 状态: 解析中
await redis_db.set_task_status(
await update_task_status(
task_id, status="processing",
meta={"progress": 10, "message": "正在解析文档"}
progress=10, message="正在解析文档"
)
# 解析文档
@@ -172,9 +240,9 @@ async def process_document(
raise Exception(result.error or "解析失败")
# 状态: 存储中
await redis_db.set_task_status(
await update_task_status(
task_id, status="processing",
meta={"progress": 30, "message": "正在存储数据"}
progress=30, message="正在存储数据"
)
# 存储到 MongoDB
@@ -191,9 +259,9 @@ async def process_document(
# 如果是 Excel存储到 MySQL + AI生成描述 + RAG索引
if doc_type in ["xlsx", "xls"]:
await redis_db.set_task_status(
await update_task_status(
task_id, status="processing",
meta={"progress": 50, "message": "正在存储到MySQL并生成字段描述"}
progress=50, message="正在存储到MySQL并生成字段描述"
)
try:
@@ -215,9 +283,9 @@ async def process_document(
else:
# 非结构化文档
await redis_db.set_task_status(
await update_task_status(
task_id, status="processing",
meta={"progress": 60, "message": "正在建立索引"}
progress=60, message="正在建立索引"
)
# 如果文档中有表格数据,提取并存储到 MySQL + RAG
@@ -238,36 +306,33 @@ async def process_document(
await index_document_to_rag(doc_id, original_filename, result, doc_type)
# 完成
await redis_db.set_task_status(
await update_task_status(
task_id, status="success",
meta={
"progress": 100,
"message": "处理完成",
"doc_id": doc_id,
"result": {
progress=100, message="处理完成",
result={
"doc_id": doc_id,
"doc_type": doc_type,
"filename": original_filename
}
}
)
logger.info(f"文档处理完成: {original_filename}, doc_id: {doc_id}")
except Exception as e:
logger.error(f"文档处理失败: {str(e)}")
await redis_db.set_task_status(
await update_task_status(
task_id, status="failure",
meta={"error": str(e)}
progress=0, message="处理失败",
error=str(e)
)
async def process_documents_batch(task_id: str, files: List[dict]):
"""批量处理文档"""
try:
await redis_db.set_task_status(
await update_task_status(
task_id, status="processing",
meta={"progress": 0, "message": "开始批量处理"}
progress=0, message="开始批量处理"
)
results = []
@@ -318,37 +383,43 @@ async def process_documents_batch(task_id: str, files: List[dict]):
results.append({"filename": file_info["filename"], "success": False, "error": str(e)})
progress = int((i + 1) / len(files) * 100)
await redis_db.set_task_status(
await update_task_status(
task_id, status="processing",
meta={"progress": progress, "message": f"已处理 {i+1}/{len(files)}"}
progress=progress, message=f"已处理 {i+1}/{len(files)}"
)
await redis_db.set_task_status(
await update_task_status(
task_id, status="success",
meta={"progress": 100, "message": "批量处理完成", "results": results}
progress=100, message="批量处理完成",
result={"results": results}
)
except Exception as e:
logger.error(f"批量处理失败: {str(e)}")
await redis_db.set_task_status(
await update_task_status(
task_id, status="failure",
meta={"error": str(e)}
progress=0, message="批量处理失败",
error=str(e)
)
async def index_document_to_rag(doc_id: str, filename: str, result: ParseResult, doc_type: str):
"""将非结构化文档索引到 RAG"""
"""将非结构化文档索引到 RAG(使用分块索引)"""
try:
content = result.data.get("content", "")
if content:
# 将完整内容传递给 RAG 服务自动分块索引
rag_service.index_document_content(
doc_id=doc_id,
content=content[:5000],
content=content, # 传递完整内容,由 RAG 服务自动分块
metadata={
"filename": filename,
"doc_type": doc_type
}
},
chunk_size=500, # 每块 500 字符
chunk_overlap=50 # 块之间 50 字符重叠
)
logger.info(f"RAG 索引完成: {filename}, doc_id={doc_id}")
except Exception as e:
logger.warning(f"RAG 索引失败: {str(e)}")

View File

@@ -19,26 +19,43 @@ async def health_check() -> Dict[str, Any]:
返回各数据库连接状态和应用信息
"""
# 检查各数据库连接状态
mysql_status = "connected"
mongodb_status = "connected"
redis_status = "connected"
mysql_status = "unknown"
mongodb_status = "unknown"
redis_status = "unknown"
try:
if mysql_db.async_engine is None:
mysql_status = "disconnected"
except Exception:
else:
# 实际执行一次查询验证连接
from sqlalchemy import text
async with mysql_db.async_engine.connect() as conn:
await conn.execute(text("SELECT 1"))
mysql_status = "connected"
except Exception as e:
logger.warning(f"MySQL 健康检查失败: {e}")
mysql_status = "error"
try:
if mongodb.client is None:
mongodb_status = "disconnected"
except Exception:
else:
# 实际 ping 验证
await mongodb.client.admin.command('ping')
mongodb_status = "connected"
except Exception as e:
logger.warning(f"MongoDB 健康检查失败: {e}")
mongodb_status = "error"
try:
if not redis_db.is_connected:
if not redis_db.is_connected or redis_db.client is None:
redis_status = "disconnected"
except Exception:
else:
# 实际执行 ping 验证
await redis_db.client.ping()
redis_status = "connected"
except Exception as e:
logger.warning(f"Redis 健康检查失败: {e}")
redis_status = "error"
return {

View File

@@ -1,13 +1,13 @@
"""
任务管理 API 接口
提供异步任务状态查询
提供异步任务状态查询和历史记录
"""
from typing import Optional
from fastapi import APIRouter, HTTPException
from app.core.database import redis_db
from app.core.database import redis_db, mongodb
router = APIRouter(prefix="/tasks", tags=["任务管理"])
@@ -23,20 +23,10 @@ async def get_task_status(task_id: str):
Returns:
任务状态信息
"""
# 优先从 Redis 获取
status = await redis_db.get_task_status(task_id)
if not status:
# Redis不可用时假设任务已完成文档已成功处理
# 前端轮询时会得到这个响应
return {
"task_id": task_id,
"status": "success",
"progress": 100,
"message": "任务处理完成",
"result": None,
"error": None
}
if status:
return {
"task_id": task_id,
"status": status.get("status", "unknown"),
@@ -45,3 +35,82 @@ async def get_task_status(task_id: str):
"result": status.get("meta", {}).get("result"),
"error": status.get("meta", {}).get("error")
}
# Redis 不可用时,尝试从 MongoDB 获取
mongo_task = await mongodb.get_task(task_id)
if mongo_task:
return {
"task_id": mongo_task.get("task_id"),
"status": mongo_task.get("status", "unknown"),
"progress": 100 if mongo_task.get("status") == "success" else 0,
"message": mongo_task.get("message"),
"result": mongo_task.get("result"),
"error": mongo_task.get("error")
}
# 任务不存在或状态未知
return {
"task_id": task_id,
"status": "unknown",
"progress": 0,
"message": "无法获取任务状态Redis和MongoDB均不可用",
"result": None,
"error": None
}
@router.get("/")
async def list_tasks(limit: int = 50, skip: int = 0):
"""
获取任务历史列表
Args:
limit: 返回数量限制
skip: 跳过数量
Returns:
任务列表
"""
try:
tasks = await mongodb.list_tasks(limit=limit, skip=skip)
return {
"success": True,
"tasks": tasks,
"count": len(tasks)
}
except Exception as e:
# MongoDB 不可用时返回空列表
return {
"success": False,
"tasks": [],
"count": 0,
"error": str(e)
}
@router.delete("/{task_id}")
async def delete_task(task_id: str):
"""
删除任务
Args:
task_id: 任务ID
Returns:
是否删除成功
"""
try:
# 从 Redis 删除
if redis_db._connected and redis_db.client:
key = f"task:{task_id}"
await redis_db.client.delete(key)
# 从 MongoDB 删除
deleted = await mongodb.delete_task(task_id)
return {
"success": True,
"deleted": deleted
}
except Exception as e:
raise HTTPException(status_code=500, detail=f"删除任务失败: {str(e)}")

View File

@@ -23,6 +23,44 @@ logger = logging.getLogger(__name__)
router = APIRouter(prefix="/templates", tags=["表格模板"])
# ==================== 辅助函数 ====================
async def update_task_status(
task_id: str,
status: str,
progress: int = 0,
message: str = "",
result: dict = None,
error: str = None
):
"""
更新任务状态,同时写入 Redis 和 MongoDB
"""
from app.core.database import redis_db
meta = {"progress": progress, "message": message}
if result:
meta["result"] = result
if error:
meta["error"] = error
try:
await redis_db.set_task_status(task_id, status, meta)
except Exception as e:
logger.warning(f"Redis 任务状态更新失败: {e}")
try:
await mongodb.update_task(
task_id=task_id,
status=status,
message=message,
result=result,
error=error
)
except Exception as e:
logger.warning(f"MongoDB 任务状态更新失败: {e}")
# ==================== 请求/响应模型 ====================
class TemplateFieldRequest(BaseModel):
@@ -41,6 +79,7 @@ class FillRequest(BaseModel):
source_doc_ids: Optional[List[str]] = None # MongoDB 文档 ID 列表
source_file_paths: Optional[List[str]] = None # 源文档文件路径列表
user_hint: Optional[str] = None
task_id: Optional[str] = None # 可选的任务ID用于任务历史跟踪
class ExportRequest(BaseModel):
@@ -186,13 +225,51 @@ async def upload_joint_template(
parser = ParserFactory.get_parser(sf_path)
parse_result = parser.parse(sf_path)
if parse_result.success and parse_result.data:
# 获取原始内容
content = parse_result.data.get("content", "")[:5000] if parse_result.data.get("content") else ""
# 获取标题可能在顶层或structured_data内
titles = parse_result.data.get("titles", [])
if not titles and parse_result.data.get("structured_data"):
titles = parse_result.data.get("structured_data", {}).get("titles", [])
titles = titles[:10] if titles else []
# 获取表格数量可能在顶层或structured_data内
tables = parse_result.data.get("tables", [])
if not tables and parse_result.data.get("structured_data"):
tables = parse_result.data.get("structured_data", {}).get("tables", [])
tables_count = len(tables) if tables else 0
# 获取表格内容摘要(用于 AI 理解源文档结构)
tables_summary = ""
if tables:
tables_summary = "\n【文档中的表格】:\n"
for idx, table in enumerate(tables[:5]): # 最多5个表格
if isinstance(table, dict):
headers = table.get("headers", [])
rows = table.get("rows", [])
if headers:
tables_summary += f"表格{idx+1}表头: {', '.join(str(h) for h in headers)}\n"
if rows:
tables_summary += f"表格{idx+1}前3行: "
for row_idx, row in enumerate(rows[:3]):
if isinstance(row, list):
tables_summary += " | ".join(str(c) for c in row) + "; "
elif isinstance(row, dict):
tables_summary += " | ".join(str(row.get(h, "")) for h in headers if headers) + "; "
tables_summary += "\n"
source_contents.append({
"filename": sf.filename,
"doc_type": sf_ext,
"content": parse_result.data.get("content", "")[:5000] if parse_result.data.get("content") else "",
"titles": parse_result.data.get("titles", [])[:10] if parse_result.data.get("titles") else [],
"tables_count": len(parse_result.data.get("tables", [])) if parse_result.data.get("tables") else 0
"content": content,
"titles": titles,
"tables_count": tables_count,
"tables_summary": tables_summary
})
logger.info(f"[DEBUG] source_contents built: filename={sf.filename}, content_len={len(content)}, titles_count={len(titles)}, tables_count={tables_count}")
if tables_summary:
logger.info(f"[DEBUG] tables_summary preview: {tables_summary[:300]}")
except Exception as e:
logger.warning(f"解析源文档失败 {sf.filename}: {e}")
@@ -206,6 +283,17 @@ async def upload_joint_template(
# 3. 异步处理源文档到MongoDB
task_id = str(uuid.uuid4())
if source_file_info:
# 保存任务记录到 MongoDB
try:
await mongodb.insert_task(
task_id=task_id,
task_type="source_process",
status="pending",
message=f"开始处理 {len(source_file_info)} 个源文档"
)
except Exception as mongo_err:
logger.warning(f"MongoDB 保存任务记录失败: {mongo_err}")
background_tasks.add_task(
process_source_documents,
task_id=task_id,
@@ -244,12 +332,10 @@ async def upload_joint_template(
async def process_source_documents(task_id: str, files: List[dict]):
"""异步处理源文档存入MongoDB"""
from app.core.database import redis_db
try:
await redis_db.set_task_status(
await update_task_status(
task_id, status="processing",
meta={"progress": 0, "message": "开始处理源文档"}
progress=0, message="开始处理源文档"
)
doc_ids = []
@@ -278,22 +364,24 @@ async def process_source_documents(task_id: str, files: List[dict]):
logger.error(f"源文档处理异常: {file_info['filename']}, error: {str(e)}")
progress = int((i + 1) / len(files) * 100)
await redis_db.set_task_status(
await update_task_status(
task_id, status="processing",
meta={"progress": progress, "message": f"已处理 {i+1}/{len(files)}"}
progress=progress, message=f"已处理 {i+1}/{len(files)}"
)
await redis_db.set_task_status(
await update_task_status(
task_id, status="success",
meta={"progress": 100, "message": "源文档处理完成", "doc_ids": doc_ids}
progress=100, message="源文档处理完成",
result={"doc_ids": doc_ids}
)
logger.info(f"所有源文档处理完成: {len(doc_ids)}")
except Exception as e:
logger.error(f"源文档批量处理失败: {str(e)}")
await redis_db.set_task_status(
await update_task_status(
task_id, status="failure",
meta={"error": str(e)}
progress=0, message="源文档处理失败",
error=str(e)
)
@@ -352,7 +440,27 @@ async def fill_template(
Returns:
填写结果
"""
# 生成或使用传入的 task_id
task_id = request.task_id or str(uuid.uuid4())
try:
# 创建任务记录到 MongoDB
try:
await mongodb.insert_task(
task_id=task_id,
task_type="template_fill",
status="processing",
message=f"开始填表任务: {len(request.template_fields)} 个字段"
)
except Exception as mongo_err:
logger.warning(f"MongoDB 创建任务记录失败: {mongo_err}")
# 更新进度 - 开始
await update_task_status(
task_id, "processing",
progress=0, message="开始处理..."
)
# 转换字段
fields = [
TemplateField(
@@ -365,17 +473,51 @@ async def fill_template(
for f in request.template_fields
]
# 从 template_id 提取文件类型
template_file_type = "xlsx" # 默认类型
if request.template_id:
ext = request.template_id.split('.')[-1].lower()
if ext in ["xlsx", "xls"]:
template_file_type = "xlsx"
elif ext == "docx":
template_file_type = "docx"
# 更新进度 - 准备开始填写
await update_task_status(
task_id, "processing",
progress=10, message=f"准备填写 {len(fields)} 个字段..."
)
# 执行填写
result = await template_fill_service.fill_template(
template_fields=fields,
source_doc_ids=request.source_doc_ids,
source_file_paths=request.source_file_paths,
user_hint=request.user_hint
user_hint=request.user_hint,
template_id=request.template_id,
template_file_type=template_file_type,
task_id=task_id
)
return result
# 更新为成功
await update_task_status(
task_id, "success",
progress=100, message="填表完成",
result={
"field_count": len(fields),
"max_rows": result.get("max_rows", 0)
}
)
return {**result, "task_id": task_id}
except Exception as e:
# 更新为失败
await update_task_status(
task_id, "failure",
progress=0, message="填表失败",
error=str(e)
)
logger.error(f"填写表格失败: {str(e)}")
raise HTTPException(status_code=500, detail=f"填写失败: {str(e)}")

View File

@@ -5,6 +5,7 @@ from fastapi import APIRouter, UploadFile, File, HTTPException, Query
from fastapi.responses import StreamingResponse
from typing import Optional
import logging
import os
import pandas as pd
import io
@@ -126,7 +127,7 @@ async def upload_excel(
content += f"... (共 {len(sheet_data['rows'])} 行)\n\n"
doc_metadata = {
"filename": saved_path.split("/")[-1] if "/" in saved_path else saved_path.split("\\")[-1],
"filename": os.path.basename(saved_path),
"original_filename": file.filename,
"saved_path": saved_path,
"file_size": len(content),
@@ -253,7 +254,7 @@ async def export_excel(
output.seek(0)
# 生成文件名
original_name = file_path.split('/')[-1] if '/' in file_path else file_path
original_name = os.path.basename(file_path)
if columns:
export_name = f"export_{sheet_name or 'data'}_{len(column_list) if columns else 'all'}_cols.xlsx"
else:

View File

@@ -59,6 +59,11 @@ class MongoDB:
"""RAG索引集合 - 存储字段语义索引"""
return self.db["rag_index"]
@property
def tasks(self):
"""任务集合 - 存储任务历史记录"""
return self.db["tasks"]
# ==================== 文档操作 ====================
async def insert_document(
@@ -242,8 +247,128 @@ class MongoDB:
await self.rag_index.create_index("table_name")
await self.rag_index.create_index("field_name")
# 任务集合索引
await self.tasks.create_index("task_id", unique=True)
await self.tasks.create_index("created_at")
logger.info("MongoDB 索引创建完成")
# ==================== 任务历史操作 ====================
async def insert_task(
self,
task_id: str,
task_type: str,
status: str = "pending",
message: str = "",
result: Optional[Dict[str, Any]] = None,
error: Optional[str] = None,
) -> str:
"""
插入任务记录
Args:
task_id: 任务ID
task_type: 任务类型
status: 任务状态
message: 任务消息
result: 任务结果
error: 错误信息
Returns:
插入文档的ID
"""
task = {
"task_id": task_id,
"task_type": task_type,
"status": status,
"message": message,
"result": result,
"error": error,
"created_at": datetime.utcnow(),
"updated_at": datetime.utcnow(),
}
result_obj = await self.tasks.insert_one(task)
return str(result_obj.inserted_id)
async def update_task(
self,
task_id: str,
status: Optional[str] = None,
message: Optional[str] = None,
result: Optional[Dict[str, Any]] = None,
error: Optional[str] = None,
) -> bool:
"""
更新任务状态
Args:
task_id: 任务ID
status: 任务状态
message: 任务消息
result: 任务结果
error: 错误信息
Returns:
是否更新成功
"""
from bson import ObjectId
update_data = {"updated_at": datetime.utcnow()}
if status is not None:
update_data["status"] = status
if message is not None:
update_data["message"] = message
if result is not None:
update_data["result"] = result
if error is not None:
update_data["error"] = error
update_result = await self.tasks.update_one(
{"task_id": task_id},
{"$set": update_data}
)
return update_result.modified_count > 0
async def get_task(self, task_id: str) -> Optional[Dict[str, Any]]:
"""根据task_id获取任务"""
task = await self.tasks.find_one({"task_id": task_id})
if task:
task["_id"] = str(task["_id"])
return task
async def list_tasks(
self,
limit: int = 50,
skip: int = 0,
) -> List[Dict[str, Any]]:
"""
获取任务列表
Args:
limit: 返回数量
skip: 跳过数量
Returns:
任务列表
"""
cursor = self.tasks.find().sort("created_at", -1).skip(skip).limit(limit)
tasks = []
async for task in cursor:
task["_id"] = str(task["_id"])
# 转换 datetime 为字符串
if task.get("created_at"):
task["created_at"] = task["created_at"].isoformat()
if task.get("updated_at"):
task["updated_at"] = task["updated_at"].isoformat()
tasks.append(task)
return tasks
async def delete_task(self, task_id: str) -> bool:
"""删除任务"""
result = await self.tasks.delete_one({"task_id": task_id})
return result.deleted_count > 0
# ==================== 全局单例 ====================

View File

@@ -59,7 +59,13 @@ class DocxParser(BaseParser):
paragraphs = []
for para in doc.paragraphs:
if para.text.strip():
paragraphs.append(para.text)
paragraphs.append({
"text": para.text,
"style": str(para.style.name) if para.style else "Normal"
})
# 提取段落纯文本(用于 AI 解析)
paragraphs_text = [p["text"] for p in paragraphs if p["text"].strip()]
# 提取表格内容
tables_data = []
@@ -77,8 +83,25 @@ class DocxParser(BaseParser):
"column_count": len(table_rows[0]) if table_rows else 0
})
# 合并所有文本
full_text = "\n".join(paragraphs)
# 提取图片/嵌入式对象信息
images_info = self._extract_images_info(doc, path)
# 合并所有文本(包括图片描述)
full_text_parts = []
full_text_parts.append("【文档正文】")
full_text_parts.extend(paragraphs_text)
if tables_data:
full_text_parts.append("\n【文档表格】")
for idx, table in enumerate(tables_data):
full_text_parts.append(f"--- 表格 {idx + 1} ---")
for row in table["rows"]:
full_text_parts.append(" | ".join(str(cell) for cell in row))
if images_info.get("image_count", 0) > 0:
full_text_parts.append(f"\n【文档图片】文档包含 {images_info['image_count']} 张图片/图表")
full_text = "\n".join(full_text_parts)
# 构建元数据
metadata = {
@@ -89,7 +112,9 @@ class DocxParser(BaseParser):
"table_count": len(tables_data),
"word_count": len(full_text),
"char_count": len(full_text.replace("\n", "")),
"has_tables": len(tables_data) > 0
"has_tables": len(tables_data) > 0,
"has_images": images_info.get("image_count", 0) > 0,
"image_count": images_info.get("image_count", 0)
}
# 返回结果
@@ -97,12 +122,16 @@ class DocxParser(BaseParser):
success=True,
data={
"content": full_text,
"paragraphs": paragraphs,
"paragraphs": paragraphs_text,
"paragraphs_with_style": paragraphs,
"tables": tables_data,
"images": images_info,
"word_count": len(full_text),
"structured_data": {
"paragraphs": paragraphs,
"tables": tables_data
"paragraphs_text": paragraphs_text,
"tables": tables_data,
"images": images_info
}
},
metadata=metadata
@@ -115,6 +144,59 @@ class DocxParser(BaseParser):
error=f"解析 Word 文档失败: {str(e)}"
)
def extract_images_as_base64(self, file_path: str) -> List[Dict[str, str]]:
"""
提取 Word 文档中的所有图片,返回 base64 编码列表
Args:
file_path: Word 文件路径
Returns:
图片列表,每项包含 base64 编码和图片类型
"""
import zipfile
import base64
from io import BytesIO
images = []
try:
with zipfile.ZipFile(file_path, 'r') as zf:
# 查找 word/media 目录下的图片文件
for filename in zf.namelist():
if filename.startswith('word/media/'):
# 获取图片类型
ext = filename.split('.')[-1].lower()
mime_types = {
'png': 'image/png',
'jpg': 'image/jpeg',
'jpeg': 'image/jpeg',
'gif': 'image/gif',
'bmp': 'image/bmp'
}
mime_type = mime_types.get(ext, 'image/png')
try:
# 读取图片数据并转为 base64
image_data = zf.read(filename)
base64_data = base64.b64encode(image_data).decode('utf-8')
images.append({
"filename": filename,
"mime_type": mime_type,
"base64": base64_data,
"size": len(image_data)
})
logger.info(f"提取图片: {filename}, 大小: {len(image_data)} bytes")
except Exception as e:
logger.warning(f"提取图片失败 {filename}: {str(e)}")
except Exception as e:
logger.error(f"打开 Word 文档提取图片失败: {str(e)}")
logger.info(f"共提取 {len(images)} 张图片")
return images
def extract_key_sentences(self, text: str, max_sentences: int = 10) -> List[str]:
"""
从文本中提取关键句子
@@ -268,6 +350,60 @@ class DocxParser(BaseParser):
return fields
def _extract_images_info(self, doc: Document, path: Path) -> Dict[str, Any]:
"""
提取 Word 文档中的图片/嵌入式对象信息
Args:
doc: Document 对象
path: 文件路径
Returns:
图片信息字典
"""
import zipfile
from io import BytesIO
image_count = 0
image_descriptions = []
inline_shapes_count = 0
try:
# 方法1: 通过 inline shapes 统计图片
try:
inline_shapes_count = len(doc.inline_shapes)
if inline_shapes_count > 0:
image_count = inline_shapes_count
image_descriptions.append(f"文档包含 {inline_shapes_count} 个嵌入式图形/图片")
except Exception:
pass
# 方法2: 通过 ZIP 分析 document.xml 获取图片引用
try:
with zipfile.ZipFile(path, 'r') as zf:
# 查找 word/media 目录下的图片文件
media_files = [f for f in zf.namelist() if f.startswith('word/media/')]
if media_files and not inline_shapes_count:
image_count = len(media_files)
image_descriptions.append(f"文档包含 {image_count} 个嵌入图片")
# 检查是否有页眉页脚中的图片
header_images = [f for f in zf.namelist() if 'header' in f.lower() and f.endswith(('.png', '.jpg', '.jpeg', '.gif', '.bmp'))]
if header_images:
image_descriptions.append(f"页眉/页脚包含 {len(header_images)} 个图片")
except Exception:
pass
except Exception as e:
logger.warning(f"提取图片信息失败: {str(e)}")
return {
"image_count": image_count,
"inline_shapes_count": inline_shapes_count,
"descriptions": image_descriptions,
"has_images": image_count > 0
}
def _infer_field_type_from_hint(self, hint: str) -> str:
"""
从提示词推断字段类型

View File

@@ -0,0 +1,15 @@
"""
指令执行模块
注意: 此模块为可选功能,当前尚未实现。
如需启用,请实现 intent_parser.py 和 executor.py
"""
from .intent_parser import IntentParser, DefaultIntentParser
from .executor import InstructionExecutor, DefaultInstructionExecutor
__all__ = [
"IntentParser",
"DefaultIntentParser",
"InstructionExecutor",
"DefaultInstructionExecutor",
]

View File

@@ -0,0 +1,35 @@
"""
指令执行器模块
将自然语言指令转换为可执行操作
注意: 此模块为可选功能,当前尚未实现。
"""
from abc import ABC, abstractmethod
from typing import Any, Dict
class InstructionExecutor(ABC):
"""指令执行器抽象基类"""
@abstractmethod
async def execute(self, instruction: str, context: Dict[str, Any]) -> Dict[str, Any]:
"""
执行指令
Args:
instruction: 解析后的指令
context: 执行上下文
Returns:
执行结果
"""
pass
class DefaultInstructionExecutor(InstructionExecutor):
"""默认指令执行器"""
async def execute(self, instruction: str, context: Dict[str, Any]) -> Dict[str, Any]:
"""暂未实现"""
raise NotImplementedError("指令执行功能暂未实现")

View File

@@ -0,0 +1,34 @@
"""
意图解析器模块
解析用户自然语言指令,识别意图和参数
注意: 此模块为可选功能,当前尚未实现。
"""
from abc import ABC, abstractmethod
from typing import Any, Dict, Tuple
class IntentParser(ABC):
"""意图解析器抽象基类"""
@abstractmethod
async def parse(self, text: str) -> Tuple[str, Dict[str, Any]]:
"""
解析自然语言指令
Args:
text: 用户输入的自然语言
Returns:
(意图类型, 参数字典)
"""
pass
class DefaultIntentParser(IntentParser):
"""默认意图解析器"""
async def parse(self, text: str) -> Tuple[str, Dict[str, Any]]:
"""暂未实现"""
raise NotImplementedError("意图解析功能暂未实现")

View File

@@ -1,6 +1,13 @@
"""
FastAPI 应用主入口
"""
# ========== 压制 MongoDB 疯狂刷屏日志 ==========
import logging
logging.getLogger("pymongo").setLevel(logging.WARNING)
logging.getLogger("pymongo.topology").setLevel(logging.WARNING)
logging.getLogger("urllib3").setLevel(logging.WARNING)
# ==============================================
import logging
import logging.handlers
import sys

View File

@@ -65,7 +65,17 @@ class LLMService:
return response.json()
except httpx.HTTPStatusError as e:
logger.error(f"LLM API 请求失败: {e.response.status_code} - {e.response.text}")
error_detail = e.response.text
logger.error(f"LLM API 请求失败: {e.response.status_code} - {error_detail}")
# 尝试解析错误信息
try:
import json
err_json = json.loads(error_detail)
err_code = err_json.get("error", {}).get("code", "unknown")
err_msg = err_json.get("error", {}).get("message", "unknown")
logger.error(f"API 错误码: {err_code}, 错误信息: {err_msg}")
except:
pass
raise
except Exception as e:
logger.error(f"LLM API 调用异常: {str(e)}")
@@ -328,6 +338,154 @@ Excel 数据概览:
"analysis": None
}
async def chat_with_images(
self,
text: str,
images: List[Dict[str, str]],
temperature: float = 0.7,
max_tokens: Optional[int] = None
) -> Dict[str, Any]:
"""
调用视觉模型 API支持图片输入
Args:
text: 文本内容
images: 图片列表,每项包含 base64 编码和 mime_type
格式: [{"base64": "...", "mime_type": "image/png"}, ...]
temperature: 温度参数
max_tokens: 最大 token 数
Returns:
Dict[str, Any]: API 响应结果
"""
headers = {
"Authorization": f"Bearer {self.api_key}",
"Content-Type": "application/json"
}
# 构建图片内容
image_contents = []
for img in images:
image_contents.append({
"type": "image_url",
"image_url": {
"url": f"data:{img['mime_type']};base64,{img['base64']}"
}
})
# 构建消息
messages = [
{
"role": "user",
"content": [
{
"type": "text",
"text": text
},
*image_contents
]
}
]
payload = {
"model": self.model_name,
"messages": messages,
"temperature": temperature
}
if max_tokens:
payload["max_tokens"] = max_tokens
try:
async with httpx.AsyncClient(timeout=120.0) as client:
response = await client.post(
f"{self.base_url}/chat/completions",
headers=headers,
json=payload
)
response.raise_for_status()
return response.json()
except httpx.HTTPStatusError as e:
error_detail = e.response.text
logger.error(f"视觉模型 API 请求失败: {e.response.status_code} - {error_detail}")
# 尝试解析错误信息
try:
import json
err_json = json.loads(error_detail)
err_code = err_json.get("error", {}).get("code", "unknown")
err_msg = err_json.get("error", {}).get("message", "unknown")
logger.error(f"API 错误码: {err_code}, 错误信息: {err_msg}")
logger.error(f"请求模型: {self.model_name}, base_url: {self.base_url}")
except:
pass
raise
except Exception as e:
logger.error(f"视觉模型 API 调用异常: {str(e)}")
raise
async def analyze_images(
self,
images: List[Dict[str, str]],
user_prompt: str = ""
) -> Dict[str, Any]:
"""
分析图片内容(使用视觉模型)
Args:
images: 图片列表,每项包含 base64 编码和 mime_type
user_prompt: 用户提示词
Returns:
Dict[str, Any]: 分析结果
"""
prompt = f"""你是一个专业的视觉分析专家。请分析以下图片内容。
{user_prompt if user_prompt else "请详细描述图片中的内容,包括文字、数据、图表、流程等所有可见信息。"}
请按照以下 JSON 格式输出:
{{
"description": "图片内容的详细描述",
"text_content": "图片中的文字内容(如有)",
"data_extracted": {{"": ""}} // 如果图片中有表格或数据
}}
如果图片不包含有用信息,请返回空的描述。"""
try:
response = await self.chat_with_images(
text=prompt,
images=images,
temperature=0.1,
max_tokens=4000
)
content = self.extract_message_content(response)
# 解析 JSON
import json
try:
result = json.loads(content)
return {
"success": True,
"analysis": result,
"model": self.model_name
}
except json.JSONDecodeError:
return {
"success": True,
"analysis": {"description": content},
"model": self.model_name
}
except Exception as e:
logger.error(f"图片分析失败: {str(e)}")
return {
"success": False,
"error": str(e),
"analysis": None
}
# 全局单例
llm_service = LLMService()

View File

@@ -3,7 +3,6 @@ RAG 服务模块 - 检索增强生成
使用 sentence-transformers + Faiss 实现向量检索
"""
import json
import logging
import os
import pickle
@@ -11,12 +10,20 @@ from typing import Any, Dict, List, Optional
import faiss
import numpy as np
from sentence_transformers import SentenceTransformer
from app.config import settings
logger = logging.getLogger(__name__)
# 尝试导入 sentence-transformers
try:
from sentence_transformers import SentenceTransformer
SENTENCE_TRANSFORMERS_AVAILABLE = True
except ImportError as e:
logger.warning(f"sentence-transformers 导入失败: {e}")
SENTENCE_TRANSFORMERS_AVAILABLE = False
SentenceTransformer = None
class SimpleDocument:
"""简化文档对象"""
@@ -28,17 +35,24 @@ class SimpleDocument:
class RAGService:
"""RAG 检索增强服务"""
# 默认分块参数
DEFAULT_CHUNK_SIZE = 500 # 每个文本块的大小(字符数)
DEFAULT_CHUNK_OVERLAP = 50 # 块之间的重叠(字符数)
def __init__(self):
self.embedding_model: Optional[SentenceTransformer] = None
self.embedding_model = None
self.index: Optional[faiss.Index] = None
self.documents: List[Dict[str, Any]] = []
self.doc_ids: List[str] = []
self._dimension: int = 0
self._dimension: int = 384 # 默认维度
self._initialized = False
self._persist_dir = settings.FAISS_INDEX_DIR
# 临时禁用 RAG API 调用,仅记录日志
self._disabled = True
logger.info("RAG 服务已禁用_disabled=True仅记录索引操作日志")
# 检查是否可用
self._disabled = not SENTENCE_TRANSFORMERS_AVAILABLE
if self._disabled:
logger.warning("RAG 服务已禁用sentence-transformers 不可用),将使用关键词匹配作为后备")
else:
logger.info("RAG 服务已启用")
def _init_embeddings(self):
"""初始化嵌入模型"""
@@ -88,6 +102,63 @@ class RAGService:
norms = np.where(norms == 0, 1, norms)
return vectors / norms
def _split_into_chunks(self, text: str, chunk_size: int = None, overlap: int = None) -> List[str]:
"""
将长文本分割成块
Args:
text: 待分割的文本
chunk_size: 每个块的大小(字符数)
overlap: 块之间的重叠字符数
Returns:
文本块列表
"""
if chunk_size is None:
chunk_size = self.DEFAULT_CHUNK_SIZE
if overlap is None:
overlap = self.DEFAULT_CHUNK_OVERLAP
if len(text) <= chunk_size:
return [text] if text.strip() else []
chunks = []
start = 0
text_len = len(text)
while start < text_len:
# 计算当前块的结束位置
end = start + chunk_size
# 如果不是最后一块,尝试在句子边界处切割
if end < text_len:
# 向前查找最后一个句号、逗号、换行或分号
cut_positions = []
for i in range(end, max(start, end - 100), -1):
if text[i] in '。;,,\n':
cut_positions.append(i + 1)
break
if cut_positions:
end = cut_positions[0]
else:
# 如果没找到句子边界,尝试向后查找
for i in range(end, min(text_len, end + 50)):
if text[i] in '。;,,\n':
end = i + 1
break
chunk = text[start:end].strip()
if chunk:
chunks.append(chunk)
# 移动起始位置(考虑重叠)
start = end - overlap
if start <= 0:
start = end
return chunks
def index_field(
self,
table_name: str,
@@ -124,9 +195,20 @@ class RAGService:
self,
doc_id: str,
content: str,
metadata: Optional[Dict[str, Any]] = None
metadata: Optional[Dict[str, Any]] = None,
chunk_size: int = None,
chunk_overlap: int = None
):
"""将文档内容索引到向量数据库"""
"""
将文档内容索引到向量数据库(自动分块)
Args:
doc_id: 文档唯一标识
content: 文档内容
metadata: 文档元数据
chunk_size: 文本块大小字符数默认500
chunk_overlap: 块之间的重叠字符数默认50
"""
if self._disabled:
logger.info(f"[RAG DISABLED] 文档索引操作已跳过: {doc_id}")
return
@@ -139,18 +221,56 @@ class RAGService:
logger.debug(f"文档跳过索引 (无嵌入模型): {doc_id}")
return
doc = SimpleDocument(
page_content=content,
metadata=metadata or {"doc_id": doc_id}
)
self._add_documents([doc], [doc_id])
logger.debug(f"已索引文档: {doc_id}")
# 分割文档为小块
if chunk_size is None:
chunk_size = self.DEFAULT_CHUNK_SIZE
if chunk_overlap is None:
chunk_overlap = self.DEFAULT_CHUNK_OVERLAP
chunks = self._split_into_chunks(content, chunk_size, chunk_overlap)
if not chunks:
logger.warning(f"文档内容为空,跳过索引: {doc_id}")
return
# 为每个块创建文档对象
documents = []
chunk_ids = []
for i, chunk in enumerate(chunks):
chunk_id = f"{doc_id}_chunk_{i}"
chunk_metadata = metadata.copy() if metadata else {}
chunk_metadata.update({
"chunk_index": i,
"total_chunks": len(chunks),
"doc_id": doc_id
})
documents.append(SimpleDocument(
page_content=chunk,
metadata=chunk_metadata
))
chunk_ids.append(chunk_id)
# 批量添加文档
self._add_documents(documents, chunk_ids)
logger.info(f"已索引文档 {doc_id},共 {len(chunks)} 个块")
def _add_documents(self, documents: List[SimpleDocument], doc_ids: List[str]):
"""批量添加文档到向量索引"""
if not documents:
return
# 总是将文档存储在内存中(用于关键词搜索后备)
for doc, did in zip(documents, doc_ids):
self.documents.append({"id": did, "content": doc.page_content, "metadata": doc.metadata})
self.doc_ids.append(did)
# 如果没有嵌入模型,跳过向量索引
if self.embedding_model is None:
logger.debug(f"文档跳过向量索引 (无嵌入模型): {len(documents)} 个文档")
return
texts = [doc.page_content for doc in documents]
embeddings = self.embedding_model.encode(texts, convert_to_numpy=True)
embeddings = self._normalize_vectors(embeddings).astype('float32')
@@ -162,12 +282,18 @@ class RAGService:
id_array = np.array(id_list, dtype='int64')
self.index.add_with_ids(embeddings, id_array)
for doc, did in zip(documents, doc_ids):
self.documents.append({"id": did, "content": doc.page_content, "metadata": doc.metadata})
self.doc_ids.append(did)
def retrieve(self, query: str, top_k: int = 5, min_score: float = 0.3) -> List[Dict[str, Any]]:
"""
根据查询检索相关文档块
def retrieve(self, query: str, top_k: int = 5) -> List[Dict[str, Any]]:
"""根据查询检索相关文档"""
Args:
query: 查询文本
top_k: 返回的最大结果数
min_score: 最低相似度分数阈值
Returns:
相关文档块列表,每项包含 content, metadata, score, doc_id, chunk_index
"""
if self._disabled:
logger.info(f"[RAG DISABLED] 检索操作已跳过: query={query}, top_k={top_k}")
return []
@@ -175,9 +301,9 @@ class RAGService:
if not self._initialized:
self._init_vector_store()
if self.index is None or self.index.ntotal == 0:
return []
# 优先使用向量检索
if self.index is not None and self.index.ntotal > 0 and self.embedding_model is not None:
try:
query_embedding = self.embedding_model.encode([query], convert_to_numpy=True)
query_embedding = self._normalize_vectors(query_embedding).astype('float32')
@@ -187,16 +313,101 @@ class RAGService:
for score, idx in zip(scores[0], indices[0]):
if idx < 0:
continue
if score < min_score:
continue
doc = self.documents[idx]
results.append({
"content": doc["content"],
"metadata": doc["metadata"],
"score": float(score),
"doc_id": doc["id"]
"doc_id": doc["id"],
"chunk_index": doc["metadata"].get("chunk_index", 0)
})
logger.debug(f"检索到 {len(results)} 条相关文档")
if results:
logger.debug(f"向量检索到 {len(results)} 条相关文档块")
return results
except Exception as e:
logger.warning(f"向量检索失败,使用关键词搜索后备: {e}")
# 后备:使用关键词搜索
logger.debug("使用关键词搜索后备方案")
return self._keyword_search(query, top_k)
def _keyword_search(self, query: str, top_k: int = 5) -> List[Dict[str, Any]]:
"""
关键词搜索后备方案
Args:
query: 查询文本
top_k: 返回的最大结果数
Returns:
相关文档块列表
"""
if not self.documents:
return []
# 提取查询关键词
keywords = []
for char in query:
if '\u4e00' <= char <= '\u9fff': # 中文字符
keywords.append(char)
# 添加英文单词
import re
english_words = re.findall(r'[a-zA-Z]+', query)
keywords.extend(english_words)
if not keywords:
return []
results = []
for doc in self.documents:
content = doc["content"]
# 计算关键词匹配分数
score = 0
matched_keywords = 0
for kw in keywords:
if kw in content:
score += 1
matched_keywords += 1
if matched_keywords > 0:
# 归一化分数
score = score / max(len(keywords), 1)
results.append({
"content": content,
"metadata": doc["metadata"],
"score": score,
"doc_id": doc["id"],
"chunk_index": doc["metadata"].get("chunk_index", 0)
})
# 按分数排序
results.sort(key=lambda x: x["score"], reverse=True)
logger.debug(f"关键词搜索返回 {len(results[:top_k])} 条结果")
return results[:top_k]
def retrieve_by_doc_id(self, doc_id: str, top_k: int = 10) -> List[Dict[str, Any]]:
"""
获取指定文档的所有块
Args:
doc_id: 文档ID
top_k: 返回的最大结果数
Returns:
该文档的所有块
"""
# 获取属于该文档的所有块
doc_chunks = [d for d in self.documents if d["metadata"].get("doc_id") == doc_id]
# 按 chunk_index 排序
doc_chunks.sort(key=lambda x: x["metadata"].get("chunk_index", 0))
# 返回指定数量
return doc_chunks[:top_k]
def retrieve_by_table(self, table_name: str, top_k: int = 5) -> List[Dict[str, Any]]:
"""检索指定表的字段"""

View File

@@ -3,6 +3,7 @@
从非结构化文档中检索信息并填写到表格模板
"""
import asyncio
import logging
from dataclasses import dataclass, field
from typing import Any, Dict, List, Optional
@@ -11,6 +12,7 @@ from app.core.database import mongodb
from app.services.llm_service import llm_service
from app.core.document_parser import ParserFactory
from app.services.markdown_ai_service import markdown_ai_service
from app.services.rag_service import rag_service
logger = logging.getLogger(__name__)
@@ -43,6 +45,7 @@ class FillResult:
value: Any = "" # 保留兼容
source: str = "" # 来源文档
confidence: float = 1.0 # 置信度
warning: str = None # 多值提示
def __post_init__(self):
if self.values is None:
@@ -60,7 +63,10 @@ class TemplateFillService:
template_fields: List[TemplateField],
source_doc_ids: Optional[List[str]] = None,
source_file_paths: Optional[List[str]] = None,
user_hint: Optional[str] = None
user_hint: Optional[str] = None,
template_id: Optional[str] = None,
template_file_type: Optional[str] = "xlsx",
task_id: Optional[str] = None
) -> Dict[str, Any]:
"""
填写表格模板
@@ -70,6 +76,9 @@ class TemplateFillService:
source_doc_ids: 源文档 MongoDB ID 列表
source_file_paths: 源文档文件路径列表
user_hint: 用户提示(如"请从合同文档中提取"
template_id: 模板文件路径(用于重新生成表头)
template_file_type: 模板文件类型
task_id: 可选的任务ID用于任务进度跟踪
Returns:
填写结果
@@ -94,33 +103,102 @@ class TemplateFillService:
if not source_docs:
logger.warning("没有找到源文档,填表结果将全部为空")
# 2. 对每个字段进行提取
for idx, field in enumerate(template_fields):
try:
logger.info(f"提取字段 [{idx+1}/{len(template_fields)}]: {field.name}")
# 从源文档中提取字段值
result = await self._extract_field_value(
field=field,
source_docs=source_docs,
user_hint=user_hint
# 3. 检查是否需要使用源文档重新生成表头
# 条件:源文档已加载 AND 现有字段看起来是自动生成的(如"字段1"、"字段2"
needs_regenerate_headers = (
len(source_docs) > 0 and
len(template_fields) > 0 and
all(self._is_auto_generated_field(f.name) for f in template_fields)
)
# 存储结果 - 使用 values 数组
filled_data[field.name] = result.values if result.values else [""]
fill_details.append({
"field": field.name,
"cell": field.cell,
"values": result.values,
"value": result.value,
"source": result.source,
"confidence": result.confidence
if needs_regenerate_headers:
logger.info(f"检测到自动生成表头,尝试使用源文档重新生成... (当前字段: {[f.name for f in template_fields]})")
# 将 SourceDocument 转换为 source_contents 格式
source_contents = []
for doc in source_docs:
structured = doc.structured_data if doc.structured_data else {}
# 获取标题
titles = structured.get("titles", [])
if not titles:
titles = []
# 获取表格
tables = structured.get("tables", [])
tables_count = len(tables) if tables else 0
# 生成表格摘要
tables_summary = ""
if tables:
tables_summary = "\n【文档中的表格】:\n"
for idx, table in enumerate(tables[:5]):
if isinstance(table, dict):
headers = table.get("headers", [])
rows = table.get("rows", [])
if headers:
tables_summary += f"表格{idx+1}表头: {', '.join(str(h) for h in headers)}\n"
if rows:
tables_summary += f"表格{idx+1}前3行: "
for row_idx, row in enumerate(rows[:3]):
if isinstance(row, list):
tables_summary += " | ".join(str(c) for c in row) + "; "
elif isinstance(row, dict):
tables_summary += " | ".join(str(row.get(h, "")) for h in headers if headers) + "; "
tables_summary += "\n"
source_contents.append({
"filename": doc.filename,
"doc_type": doc.doc_type,
"content": doc.content[:5000] if doc.content else "",
"titles": titles[:10] if titles else [],
"tables_count": tables_count,
"tables_summary": tables_summary
})
logger.info(f"字段 {field.name} 填写完成: {len(result.values)} 个值")
# 使用源文档内容重新生成表头
if template_id and template_file_type:
logger.info(f"使用源文档重新生成表头: template_id={template_id}, template_file_type={template_file_type}")
new_fields = await self.get_template_fields_from_file(
template_id,
template_file_type,
source_contents=source_contents
)
if new_fields and len(new_fields) > 0:
logger.info(f"成功重新生成表头: {[f.name for f in new_fields]}")
template_fields = new_fields
else:
logger.warning("重新生成表头返回空结果,使用原始字段")
else:
logger.warning("无法重新生成表头:缺少 template_id 或 template_file_type")
else:
if source_docs and template_fields:
logger.info(f"表头看起来正常(非自动生成),无需重新生成: {[f.name for f in template_fields[:5]]}")
except Exception as e:
logger.error(f"填写字段 {field.name} 失败: {str(e)}", exc_info=True)
filled_data[field.name] = [f"[提取失败: {str(e)}]"]
# 2. 并行提取所有字段跳过AI审核以提升速度
logger.info(f"开始并行提取 {len(template_fields)} 个字段...")
# 并行处理所有字段
tasks = []
for idx, field in enumerate(template_fields):
task = self._extract_single_field_fast(
field=field,
source_docs=source_docs,
user_hint=user_hint,
field_idx=idx,
total_fields=len(template_fields)
)
tasks.append(task)
# 等待所有任务完成
results = await asyncio.gather(*tasks, return_exceptions=True)
# 处理结果
for idx, result in enumerate(results):
field = template_fields[idx]
if isinstance(result, Exception):
logger.error(f"提取字段 {field.name} 失败: {str(result)}")
filled_data[field.name] = [f"[提取失败: {str(result)}]"]
fill_details.append({
"field": field.name,
"cell": field.cell,
@@ -129,6 +207,18 @@ class TemplateFillService:
"source": "error",
"confidence": 0.0
})
else:
filled_data[field.name] = result.values if result.values else [""]
fill_details.append({
"field": field.name,
"cell": field.cell,
"values": result.values,
"value": result.value,
"source": result.source,
"confidence": result.confidence,
"warning": result.warning
})
logger.info(f"字段 {field.name} 填写完成: {len(result.values) if result.values else 0} 个值")
# 计算最大行数
max_rows = max(len(v) for v in filled_data.values()) if filled_data else 1
@@ -457,6 +547,353 @@ class TemplateFillService:
confidence=0.0
)
async def _extract_single_field_fast(
self,
field: TemplateField,
source_docs: List[SourceDocument],
user_hint: Optional[str] = None,
field_idx: int = 0,
total_fields: int = 1
) -> FillResult:
"""
快速提取单个字段跳过AI审核减少LLM调用
Args:
field: 字段定义
source_docs: 源文档列表
user_hint: 用户提示
field_idx: 当前字段索引(用于日志)
total_fields: 总字段数(用于日志)
Returns:
提取结果
"""
try:
if not source_docs:
return FillResult(
field=field.name,
value="",
values=[""],
source="无源文档",
confidence=0.0
)
# 1. 优先尝试直接从结构化数据中提取(最快路径)
direct_values = self._extract_values_from_structured_data(source_docs, field.name)
if direct_values:
logger.info(f"✅ [{field_idx+1}/{total_fields}] 字段 {field.name} 直接从结构化数据提取到 {len(direct_values)} 个值")
return FillResult(
field=field.name,
values=direct_values,
value=direct_values[0] if direct_values else "",
source="结构化数据直接提取",
confidence=1.0
)
# 2. 无法直接从结构化数据提取使用简化版AI提取
logger.info(f"🔍 [{field_idx+1}/{total_fields}] 字段 {field.name} 尝试AI提取...")
# 构建提示词 - 简化版
hint_text = field.hint if field.hint else f"请提取{field.name}的信息"
if user_hint:
hint_text = f"{user_hint}{hint_text}"
# 优先使用 RAG 检索内容,否则使用文档开头部分
context_parts = []
for doc in source_docs:
if not doc.content:
logger.info(f" 文档 {doc.filename} 无content内容")
continue
logger.info(f" 处理文档: {doc.filename}, doc_id={doc.doc_id}, content长度={len(doc.content)}")
# 尝试 RAG 检索
rag_results = rag_service.retrieve(
query=f"{field.name} {hint_text}",
top_k=3,
min_score=0.1
)
if rag_results:
logger.info(f" RAG检索到 {len(rag_results)} 条结果")
# 使用 RAG 检索到的内容
for r in rag_results:
rag_doc_id = r.get("doc_id", "")
if rag_doc_id.startswith(doc.doc_id):
context_parts.append(r["content"])
logger.info(f" 匹配成功使用RAG内容长度={len(r['content'])}")
else:
# RAG 没结果,使用文档内容开头
context_parts.append(doc.content[:2500])
logger.info(f" RAG无结果使用文档开头 {min(2500, len(doc.content))} 字符")
context = "\n\n".join(context_parts[:3]) if context_parts else ""
logger.info(f" 最终context长度: {len(context)}, 内容预览: {context[:200] if context else ''}...")
prompt = f"""你是一个专业的数据提取专家。请严格按照表头字段「{field.name}」从文档中提取数据。
提示: {hint_text}
【重要规则 - 必须遵守】
1. **每个值必须有标注**:根据数据来源添加合适的标注前缀!
- ✅ 正确格式:
- "2024年38710个"
- "北京1234万人次"
- "某省5678万人"
- "公立医院11754个"
- "三级医院4111个"
- "图书馆3246个"
- ❌ 错误格式:"38710个"(缺少标注)
2. **标注类型根据数据决定**
- 年份类数据 → "2024年xxx""2023年xxx"
- 地区类数据 → "北京xxx""广东xxx""某县xxx"
- 机构/分类数据 → "公立医院xxx""三级医院xxx""图书馆xxx"
- 其他分类 → 根据实际情况标注
3. **严格按表头提取**:只提取与「{field.name}」直接相关的数据
4. **多值必须全部提取并标注**:如果文档中提到多个相关数据,每个都要有标注
文档内容:
{context if context else "(无文档内容)"}
请严格按格式返回JSON{{"values": ["标注:数值", "标注:数值", ...]}}
注意values数组中每个元素都必须包含标注前缀不能只有数值
"""
messages = [
{"role": "system", "content": "你是一个专业的数据提取助手擅长从政府统计公报中提取数据。严格按JSON格式输出只返回values数组。"},
{"role": "user", "content": prompt}
]
response = await self.llm.chat(
messages=messages,
temperature=0.1,
max_tokens=1000
)
content = self.llm.extract_message_content(response)
logger.info(f" LLM原始返回: {content[:500]}")
# 解析JSON
import json
import re
cleaned = content.strip()
# 查找JSON开始位置
json_start = -1
for i, c in enumerate(cleaned):
if c == '{':
json_start = i
break
values = []
source = "AI提取"
if json_start >= 0:
try:
json_text = cleaned[json_start:]
result = json.loads(json_text)
values = result.get("values", [])
logger.info(f" JSON解析成功values: {values}")
except json.JSONDecodeError as e:
logger.warning(f" JSON解析失败: {e},尝试修复...")
# 尝试修复常见JSON问题
try:
# 尝试找到values数组
values_match = re.search(r'"values"\s*:\s*\[(.*?)\]', cleaned, re.DOTALL)
if values_match:
values_str = values_match.group(1)
# 提取数组中的字符串
values = re.findall(r'"([^"]*)"', values_str)
logger.info(f" 正则提取values: {values}")
except:
pass
# 如果values为空尝试从文本中用正则提取数字+单位
if not values or values == [""]:
logger.info(f" JSON解析未获取到值尝试正则提取...")
# 匹配数字+单位或百分号的模式
patterns = [
r'(\d+\.?\d*[亿万千百十个]?[%‰℃℃万元亿]?)', # 通用数字+单位
r'(\d+\.?\d*%)', # 百分号
r'(\d+\.?\d*[个万人亿元]?)', # 中文单位
]
for pattern in patterns:
matches = re.findall(pattern, context)
if matches:
# 过滤掉纯数字
filtered = [m for m in matches if not m.replace('.', '').isdigit()]
if filtered:
values = filtered[:10] # 最多取10个
logger.info(f" 正则提取到: {values}")
break
if not values or values == [""]:
values = self._extract_values_by_regex(cleaned)
if not values:
values = [""]
# 生成多值提示(基于实际检测到的值数量)
warning = ""
if len(values) > 1:
warning = f"⚠️ 检测到 {len(values)} 个值:{values[:5]}{'...' if len(values) > 5 else ''}"
logger.info(f"✅ [{field_idx+1}/{total_fields}] 字段 {field.name} AI提取完成: {len(values)} 个值")
if warning:
logger.info(f" {warning}")
return FillResult(
field=field.name,
values=values,
value=values[0] if values else "",
source=source,
confidence=0.8,
warning=warning if warning else None
)
except Exception as e:
logger.error(f"❌ [{field_idx+1}/{total_fields}] 字段 {field.name} 提取失败: {str(e)}")
return FillResult(
field=field.name,
values=[""],
value="",
source=f"提取失败: {str(e)}",
confidence=0.0
)
async def _verify_field_value(
self,
field: TemplateField,
extracted_values: List[str],
source_docs: List[SourceDocument],
user_hint: Optional[str] = None
) -> Optional[FillResult]:
"""
验证并修正提取的字段值
Args:
field: 字段定义
extracted_values: 已提取的值
source_docs: 源文档列表
user_hint: 用户提示
Returns:
验证后的结果如果验证通过返回None使用原结果
"""
if not extracted_values or not extracted_values[0]:
return None
if not source_docs:
return None
try:
# 构建验证上下文
context_text = self._build_context_text(source_docs, field_name=field.name, max_length=15000)
hint_text = field.hint if field.hint else f"请理解{field.name}字段的含义"
if user_hint:
hint_text = f"{user_hint}{hint_text}"
prompt = f"""你是一个数据质量审核专家。请审核以下提取的数据是否合理。
【待审核字段】
字段名:{field.name}
字段说明:{hint_text}
【已提取的值】
{extracted_values[:10]} # 最多审核前10个值
【源文档上下文】
{context_text[:8000]}
【审核要求】
1. 这些值是否符合字段的含义?
2. 值在原文中的原始含义是什么?检查是否有误解或误提取
3. 是否存在明显错误、空值或不合理的数据?
4. 如果表格有多个列,请确认提取的是正确的列
请严格按照以下 JSON 格式输出(只需输出 JSON不要其他内容
{{
"is_valid": true或false,
"corrected_values": ["修正后的值列表"] 或 null如果无需修正,
"reason": "审核说明,解释判断理由",
"original_meaning": "值在原文中的原始含义描述"
}}
"""
messages = [
{"role": "system", "content": "你是一个严格的数据质量审核专家。请仔细核对原文和提取的值是否匹配。"},
{"role": "user", "content": prompt}
]
response = await self.llm.chat(
messages=messages,
temperature=0.2,
max_tokens=3000
)
content = self.llm.extract_message_content(response)
logger.info(f"字段 {field.name} 审核返回: {content[:300]}")
# 解析 JSON
import json
import re
cleaned = content.strip()
cleaned = re.sub(r'^```json\s*', '', cleaned, flags=re.MULTILINE)
cleaned = re.sub(r'^```\s*', '', cleaned, flags=re.MULTILINE)
cleaned = cleaned.strip()
json_start = -1
for i, c in enumerate(cleaned):
if c == '{':
json_start = i
break
if json_start == -1:
logger.warning(f"字段 {field.name} 审核:无法找到 JSON")
return None
json_text = cleaned[json_start:]
result = json.loads(json_text)
is_valid = result.get("is_valid", True)
corrected_values = result.get("corrected_values")
reason = result.get("reason", "")
original_meaning = result.get("original_meaning", "")
logger.info(f"字段 {field.name} 审核结果: is_valid={is_valid}, reason={reason[:100]}")
if not is_valid and corrected_values:
# 值有问题且有修正建议,使用修正后的值
logger.info(f"字段 {field.name} 使用修正后的值: {corrected_values[:5]}")
return FillResult(
field=field.name,
values=corrected_values,
value=corrected_values[0] if corrected_values else "",
source=f"AI审核修正: {reason[:100]}",
confidence=0.7
)
elif not is_valid and original_meaning:
# 值有问题但无修正,记录原始含义供用户参考
logger.info(f"字段 {field.name} 审核发现问题: {original_meaning}")
return FillResult(
field=field.name,
values=extracted_values,
value=extracted_values[0] if extracted_values else "",
source=f"AI审核疑问: {original_meaning[:100]}",
confidence=0.5
)
# 验证通过,返回 None 表示使用原结果
return None
except Exception as e:
logger.error(f"字段 {field.name} 审核失败: {str(e)}")
return None
def _build_context_text(self, source_docs: List[SourceDocument], field_name: str = None, max_length: int = 8000) -> str:
"""
构建上下文文本
@@ -947,13 +1384,148 @@ class TemplateFillService:
values = []
for row in rows:
if isinstance(row, list) and target_idx < len(row):
if isinstance(row, list):
# 跳过子表头行(主要包含年份值的行,如 "1985", "1995"
if self._is_year_subheader_row(row):
logger.info(f"跳过子表头行: {row[:5]}...")
continue
# 跳过章节标题行
if self._is_section_header_row(row):
logger.info(f"跳过章节标题行: {row[:5]}...")
continue
if target_idx < len(row):
val = row[target_idx]
else:
val = ""
else:
val = ""
values.append(self._format_value(val))
return values
# 过滤掉无效值(章节标题、省略号等)
valid_values = self._filter_valid_values(values)
if len(valid_values) < len(values):
logger.info(f"过滤无效值: {len(values)} -> {len(valid_values)}")
return valid_values
def _is_year_subheader_row(self, row: List) -> bool:
"""
检测行是否看起来像年份子表头行
年份子表头行通常包含 "1985", "1995", "2020" 等4位数字
Args:
row: 行数据
Returns:
是否是年份子表头行
"""
if not row:
return False
import re
year_pattern = re.compile(r'^(19|20)\d{2}$')
# 计算看起来像年份的单元格数量
year_like_count = 0
for cell in row:
cell_str = str(cell).strip()
if year_pattern.match(cell_str):
year_like_count += 1
# 如果超过50%的单元格是年份格式,认为是子表头行
if len(row) > 0 and year_like_count / len(row) > 0.5:
return True
return False
def _is_section_header_row(self, row: List) -> bool:
"""
检测行是否看起来像章节标题行
章节标题行通常包含 "其中:""全部工业中:""按...计算" 等关键词
Args:
row: 行数据
Returns:
是否是章节标题行
"""
if not row:
return False
import re
# 章节标题通常包含这些模式
section_patterns = [
r'其中[:]',
r'全部\w+中[:]',
r'\w+计算',
r'小计',
r'合计',
r'总计',
r'^其中$',
r'全部$'
]
for cell in row:
cell_str = str(cell).strip()
if not cell_str:
continue
for pattern in section_patterns:
if re.search(pattern, cell_str):
return True
return False
def _is_valid_data_value(self, val: str) -> bool:
"""
检测值是否是有效的数据值(不是章节标题、省略号等)
Args:
val: 值字符串
Returns:
是否是有效数据值
"""
if not val or not str(val).strip():
return False
val_str = str(val).strip()
# 无效模式
invalid_patterns = [
r'^…$', # 省略号
r'^[\.。]+$', # 只有点或句号
r'其中[:]', # 章节标题
r'全部\w+中', # 章节标题
r'\w+计算', # 计算类型
r'^(小计|合计|总计)$', # 汇总行
r'^其中$',
r'^全部$'
]
for pattern in invalid_patterns:
import re
if re.match(pattern, val_str):
return False
return True
def _filter_valid_values(self, values: List[str]) -> List[str]:
"""
过滤出有效的数据值
Args:
values: 值列表
Returns:
只包含有效值的列表
"""
valid_values = []
for val in values:
if self._is_valid_data_value(val):
valid_values.append(val)
return valid_values
def _find_best_matching_column(self, headers: List, field_name: str) -> Optional[int]:
"""
@@ -981,6 +1553,11 @@ class TemplateFillService:
header_str = str(header).strip()
header_lower = header_str.lower()
# 跳过空表头(第一列为空的情况)
if not header_str:
logger.info(f"跳过空表头列: 索引 {idx}")
continue
# 策略1: 精确匹配(忽略大小写)
if header_lower == field_lower:
return idx
@@ -1037,6 +1614,12 @@ class TemplateFillService:
values = []
for row in rows:
# 跳过子表头行(主要包含年份值的行,如 "1985", "1995"
if isinstance(row, list) and self._is_year_subheader_row(row):
continue
# 跳过章节标题行
if isinstance(row, list) and self._is_section_header_row(row):
continue
if isinstance(row, dict):
val = row.get(target_col, "")
elif isinstance(row, list) and target_idx < len(row):
@@ -1045,7 +1628,12 @@ class TemplateFillService:
val = ""
values.append(self._format_value(val))
return values
# 过滤掉无效值(章节标题、省略号等)
valid_values = self._filter_valid_values(values)
if len(valid_values) < len(values):
logger.info(f"过滤无效值: {len(values)} -> {len(valid_values)}")
return valid_values
def _format_value(self, val: Any) -> str:
"""
@@ -1379,38 +1967,98 @@ class TemplateFillService:
if user_hint:
hint_text = f"{user_hint}{hint_text}"
# 构建针对字段提取的提示词
prompt = f"""你是一个专业的数据提取专家。请从以下文档内容中提取与"{field.name}"完全匹配的数据。
# 构建查询文本
query_text = f"{field.name} {hint_text}"
# 使用 RAG 向量检索获取相关内容块
rag_results = rag_service.retrieve(
query=query_text,
top_k=5,
min_score=0.3
)
# 构建上下文:优先使用 RAG 检索结果,如果检索不到则使用原始内容
if rag_results:
# 使用 RAG 检索到的相关块
context_parts = []
for result in rag_results:
if result.get("doc_id", "").startswith(doc.doc_id) or not result.get("doc_id"):
context_parts.append(result["content"])
if context_parts:
retrieved_context = "\n\n---\n\n".join(context_parts)
logger.info(f"RAG 检索到 {len(context_parts)} 个相关块用于字段 {field.name}")
# 使用检索到的内容(限制长度)
context_to_use = retrieved_context[:6000]
else:
# RAG 检索结果不属于当前文档,使用原始内容
context_to_use = doc.content[:6000] if doc.content else ""
logger.info(f"字段 {field.name} 使用原始内容RAG结果不属于当前文档")
else:
# 没有 RAG 检索结果,使用原始内容
context_to_use = doc.content[:6000] if doc.content else ""
logger.info(f"字段 {field.name} 使用原始内容无RAG检索结果")
# 构建针对字段提取的提示词 - 增强语义匹配能力
prompt = f"""你是一个专业的数据提取专家。请从以下文档内容中进行**语义匹配**提取。
【重要】字段名: "{field.name}"
【重要】字段提示: {hint_text}
请严格按照以下步骤操作:
1. 在文档中搜索与"{field.name}"完全相同或高度相关的关键词
2. 找到后,提取该关键词后的数值(注意:只要数值,不要单位)
3. 如果是表格中的数据,直接提取该单元格的数值
4. 如果是段落描述,在关键词附近找数值
## 分类数据识别
文档中经常包含分类统计数据,格式如下:
### 1. 直接分类(用"其中:""中,"等分隔)
原文示例:
- "全国医疗卫生机构总数1093551个其中医院38710个基层医疗卫生机构1040023个"
→ 字段"医院数量" 应提取: 38710
→ 字段"基层医疗卫生机构数量" 应提取: 1040023
- "医院中公立医院11754个民营医院26956个"
→ 字段"公立医院数量" 应提取: 11754
→ 字段"民营医院数量" 应提取: 26956
### 2. 嵌套分类(用"按...分:""其中:"等结构)
原文示例:
- "医院按等级分三级医院4111个其中三级甲等医院1876个二级医院12294个"
→ 字段"三级医院数量" 应提取: 4111
→ 字段"三级甲等医院数量" 应提取: 1876
→ 字段"二级医院数量" 应提取: 12294
### 3. 匹配技巧
- "医院数量" 可匹配: "医院38710个""医院数量为"
- "公立医院数量" 可匹配: "公立医院11754个""公立医院有"
- 忽略"数量"""""等后缀的差异
- 数值可能紧跟关键词,也可能分开描述
## 提取规则
1. **全文搜索**:在文档的全部内容中搜索,不要只搜索开头部分
2. **分类定位**:找到包含该分类关键词的句子,理解其完整的数值
3. **保留单位**:提取数值时**要包含单位**
【重要】返回值规则:
- 返回数值,不要单位(如 "4.9" 而不是 "4.9万亿元"
- 如原文"4.9万亿元",返回 "4.9"
- 如原文"144000万册",返回 "144000"
- 如果是百分比如"增长7.7%",返回 "7.7"
- 如果没有找到完全匹配的数据,返回空数组
- **返回数值时必须包含单位**
- 如原文"公共图书馆3246个"提取时应返回 "3246个"
- 如原文"国内旅游收入4.9万亿元"提取时应返回 "4.9万亿元"
- 例如原文"注册护士585.5万人"提取时应返回 "585.5万人"
- 如果字段是"指标"类型,返回具体的指标名称文本(不带单位)
- 如果没有找到任何相关数据,返回空数组
文档内容:
{doc.content[:10000] if doc.content else ""}
{context_to_use}
请用严格的 JSON 格式返回:
{{
"values": ["值1", "值2", ...], // 只填数值,不要单位
"source": "数据来源说明",
"values": ["提取到的值1", "值2", ...],
"source": "数据来源说明从文档第X段提取",
"confidence": 0.0到1.0之间的置信度
}}
示例
- 如果字段是"图书馆总藏量(万册)"且文档说"图书总藏量14.4亿册",返回 values: ["144000"]
- 如果字段是"国内旅游收入(亿元)"且文档说"国内旅游收入4.9万亿元",返回 values: ["49000"]"""
【重要】即使是模糊匹配,也要
- 确保提取的内容确实来自文档
- source中准确说明数据来源位置"""
messages = [
{"role": "system", "content": "你是一个专业的数据提取助手擅长从政府统计公报等文档中提取数据。请严格按JSON格式输出。"},
@@ -1504,11 +2152,14 @@ class TemplateFillService:
import pandas as pd
# 读取 Excel 内容检查是否为空
content_sample = ""
if file_type in ["xlsx", "xls"]:
df = pd.read_excel(file_path, header=None)
if df.shape[0] == 0 or df.shape[1] == 0:
logger.info("Excel 表格为空")
# 生成默认字段
# 即使 Excel 为空,如果有源文档,仍然尝试使用 AI 生成表头
if not source_contents:
logger.info("Excel 为空且没有源文档,使用默认字段名")
return [TemplateField(
cell=self._column_to_cell(i),
name=f"字段{i+1}",
@@ -1516,7 +2167,9 @@ class TemplateFillService:
required=False,
hint="请填写此字段"
) for i in range(5)]
# 有源文档,继续调用 AI 生成表头
logger.info("Excel 为空但有源文档,使用源文档内容生成表头...")
else:
# 表格有数据但没有表头
if df.shape[1] > 0:
# 读取第一行作为参考,看是否为空
@@ -1532,7 +2185,10 @@ class TemplateFillService:
# 调用 AI 生成表头
# 根据源文档内容生成表头
source_info = ""
logger.info(f"[DEBUG] _generate_fields_with_ai received source_contents: {len(source_contents) if source_contents else 0} items")
if source_contents:
for sc in source_contents:
logger.info(f"[DEBUG] source doc: filename={sc.get('filename')}, content_len={len(sc.get('content', ''))}, titles={len(sc.get('titles', []))}, tables_count={sc.get('tables_count', 0)}, has_tables_summary={bool(sc.get('tables_summary'))}")
source_info = "\n\n【源文档内容摘要】(根据以下文档内容生成表头):\n"
for idx, src in enumerate(source_contents[:5]): # 最多5个源文档
filename = src.get("filename", f"文档{idx+1}")
@@ -1540,37 +2196,62 @@ class TemplateFillService:
content = src.get("content", "")[:3000] # 限制内容长度
titles = src.get("titles", [])[:10] # 最多10个标题
tables_count = src.get("tables_count", 0)
tables_summary = src.get("tables_summary", "")
source_info += f"\n--- 文档 {idx+1}: {filename} ({doc_type}) ---\n"
# 处理 titles可能是字符串列表或字典列表
if titles:
source_info += f"【章节标题】: {', '.join([t.get('text', '') for t in titles[:5]])}\n"
title_texts = []
for t in titles[:5]:
if isinstance(t, dict):
title_texts.append(t.get('text', ''))
else:
title_texts.append(str(t))
if title_texts:
source_info += f"【章节标题】: {', '.join(title_texts)}\n"
if tables_count > 0:
source_info += f"【包含表格数】: {tables_count}\n"
if tables_summary:
source_info += f"{tables_summary}\n"
if content:
source_info += f"内容预览】: {content[:1500]}...\n"
source_info += f"文档内容】前3000字符{content[:3000]}\n"
prompt = f"""你是一个专业的表格设计助手。请根据源文档内容生成合适的表格表头字段。
prompt = f"""你是一个专业的数据分析助手。请分析源文档中的所有数据,生成表格表头字段。
任务:用户有一些源文档(可能包含表格数据、统计信息等),需要填写到表格中。请分析源文档内容,生成适合的表头字段
任务:分析源文档,找出所有具体的数据指标及其分类
{source_info}
请生成5-10个简洁的表头字段名这些字段应该
1. 简洁明了,易于理解
2. 适合作为表格列标题
3. 直接对应源文档中的关键数据项
4. 字段之间有明显的区分度
【重要要求】
1. **只生成数据字段名**
- ✅ 正确示例:"医院数量""公立医院数量""病床使用率"
- ❌ 错误示例:"source""备注""说明""数据来源"
2. **识别所有数值数据**
- 例如:"医院38710个""病床使用率78.8%"
- 例如:"公立医院11754个""公立医院病床使用率84.8%"
3. **理解分类层级**
- 顶级分类:如"医院""基层医疗卫生机构"
- 二级分类:如"医院"下分为"公立医院""民营医院"
4. **生成字段**
- 字段名要简洁,如:"医院数量""病床使用率"
- 优先选择:总数 + 主要分类
5. **生成数量**
- 生成5-7个最有代表性的字段
请严格按照以下 JSON 格式输出(只需输出 JSON不要其他内容
{{
"fields": [
{{"name": "字段名1", "hint": "字段说明提示1"}},
{{"name": "字段名2", "hint": "字段说明提示2"}}
{{"name": "字段名1"}},
{{"name": "字段名2"}}
]
}}
"""
messages = [
{"role": "system", "content": "你是一个专业的表格设计助手。请严格按JSON格式输出。"},
{"role": "system", "content": "你是一个专业的表格设计助手。请严格按JSON格式输出只返回纯数据字段名不要source、备注、说明等辅助字段"},
{"role": "user", "content": prompt}
]
@@ -1609,14 +2290,22 @@ class TemplateFillService:
if result and "fields" in result:
fields = []
# 过滤非数据字段
skip_keywords = ["source", "来源", "备注", "说明", "备注列", "说明列", "data_source", "remark", "note", "description"]
for idx, f in enumerate(result["fields"]):
field_name = f.get("name", f"字段{idx+1}")
# 跳过非数据字段
if any(kw in field_name.lower() for kw in skip_keywords):
logger.info(f"跳过非数据字段: {field_name}")
continue
fields.append(TemplateField(
cell=self._column_to_cell(idx),
name=f.get("name", f"字段{idx+1}"),
name=field_name,
field_type="text",
required=False,
hint=f.get("hint", "")
))
logger.info(f"AI 生成表头: {[f.name for f in fields]}")
return fields
except Exception as e:

View File

@@ -0,0 +1,637 @@
"""
Word 文档 AI 解析服务
使用 LLM (GLM) 对 Word 文档进行深度理解,提取结构化数据
"""
import logging
from typing import Dict, Any, List, Optional
import json
from app.services.llm_service import llm_service
from app.core.document_parser.docx_parser import DocxParser
logger = logging.getLogger(__name__)
class WordAIService:
"""Word 文档 AI 解析服务"""
def __init__(self):
self.llm = llm_service
self.parser = DocxParser()
async def parse_word_with_ai(
self,
file_path: str,
user_hint: str = ""
) -> Dict[str, Any]:
"""
使用 AI 解析 Word 文档,提取结构化数据
适用于从非结构化的 Word 文档中提取表格数据、键值对等信息
Args:
file_path: Word 文件路径
user_hint: 用户提示词,指定要提取的内容类型
Returns:
Dict: 包含结构化数据的解析结果
"""
try:
# 1. 先用基础解析器提取原始内容
parse_result = self.parser.parse(file_path)
if not parse_result.success:
return {
"success": False,
"error": parse_result.error,
"structured_data": None
}
# 2. 获取原始数据
raw_data = parse_result.data
paragraphs = raw_data.get("paragraphs", [])
paragraphs_with_style = raw_data.get("paragraphs_with_style", [])
tables = raw_data.get("tables", [])
content = raw_data.get("content", "")
images_info = raw_data.get("images", {})
metadata = parse_result.metadata or {}
image_count = images_info.get("image_count", 0)
image_descriptions = images_info.get("descriptions", [])
logger.info(f"Word 基础解析完成: {len(paragraphs)} 个段落, {len(tables)} 个表格, {image_count} 张图片")
# 3. 提取图片数据(用于视觉分析)
images_base64 = []
if image_count > 0:
try:
images_base64 = self.parser.extract_images_as_base64(file_path)
logger.info(f"提取到 {len(images_base64)} 张图片的 base64 数据")
except Exception as e:
logger.warning(f"提取图片 base64 失败: {str(e)}")
# 4. 根据内容类型选择 AI 解析策略
# 如果有图片,先分析图片
image_analysis = ""
if images_base64:
image_analysis = await self._analyze_images_with_ai(images_base64, user_hint)
logger.info(f"图片 AI 分析完成: {len(image_analysis)} 字符")
# 优先处理:表格 > (表格+文本) > 纯文本
if tables and len(tables) > 0:
structured_data = await self._extract_tables_with_ai(
tables, paragraphs, image_count, user_hint, metadata, image_analysis
)
elif paragraphs and len(paragraphs) > 0:
structured_data = await self._extract_from_text_with_ai(
paragraphs, content, image_count, image_descriptions, user_hint, image_analysis
)
else:
structured_data = {
"success": True,
"type": "empty",
"message": "文档内容为空"
}
# 添加图片分析结果
if image_analysis:
structured_data["image_analysis"] = image_analysis
return structured_data
except Exception as e:
logger.error(f"AI 解析 Word 文档失败: {str(e)}")
return {
"success": False,
"error": str(e),
"structured_data": None
}
async def _extract_tables_with_ai(
self,
tables: List[Dict],
paragraphs: List[str],
image_count: int,
user_hint: str,
metadata: Dict,
image_analysis: str = ""
) -> Dict[str, Any]:
"""
使用 AI 从 Word 表格和文本中提取结构化数据
Args:
tables: 表格列表
paragraphs: 段落列表
image_count: 图片数量
user_hint: 用户提示
metadata: 文档元数据
image_analysis: 图片 AI 分析结果
Returns:
结构化数据
"""
try:
# 构建表格文本描述
tables_text = self._build_tables_description(tables)
# 构建段落描述
paragraphs_text = "\n".join(paragraphs[:50]) if paragraphs else "(无正文文本)"
if len(paragraphs) > 50:
paragraphs_text += f"\n...(共 {len(paragraphs)} 个段落仅显示前50个"
# 图片提示
image_hint = f"注意:此文档包含 {image_count} 张图片/图表。" if image_count > 0 else ""
prompt = f"""你是一个专业的数据提取专家。请从以下 Word 文档的完整内容中提取结构化数据。
【用户需求】
{user_hint if user_hint else "请提取文档中的所有结构化数据,包括表格数据、键值对、列表项等。"}
【文档正文(段落)】
{paragraphs_text}
【文档表格】
{tables_text}
【文档图片信息】
{image_hint}
请按照以下 JSON 格式输出:
{{
"type": "table_data",
"headers": ["列1", "列2", ...],
"rows": [["行1列1", "行1列2", ...], ["行2列1", "行2列2", ...], ...],
"key_values": {{"键1": "值1", "键2": "值2", ...}},
"list_items": ["项1", "项2", ...],
"description": "文档内容描述"
}}
重点:
- 优先从表格中提取结构化数据
- 如果表格中有表头headers 是表头rows 是数据行
- 如果文档中有键值对(如 名称: 张三),提取到 key_values 中
- 如果文档中有列表项,提取到 list_items 中
- 图片内容无法直接提取,但请在 description 中说明图片的大致主题(如"包含流程图""包含数据图表"等)
"""
messages = [
{"role": "system", "content": "你是一个专业的数据提取助手。请严格按JSON格式输出。"},
{"role": "user", "content": prompt}
]
response = await self.llm.chat(
messages=messages,
temperature=0.1,
max_tokens=50000
)
content = self.llm.extract_message_content(response)
# 解析 JSON
result = self._parse_json_response(content)
if result:
logger.info(f"AI 表格提取成功: {len(result.get('rows', []))} 行数据")
return {
"success": True,
"type": "table_data",
"headers": result.get("headers", []),
"rows": result.get("rows", []),
"description": result.get("description", "")
}
else:
# 如果 AI 返回格式不对,尝试直接解析表格
return self._fallback_table_parse(tables)
except Exception as e:
logger.error(f"AI 表格提取失败: {str(e)}")
return self._fallback_table_parse(tables)
async def _extract_from_text_with_ai(
self,
paragraphs: List[str],
full_text: str,
image_count: int,
image_descriptions: List[str],
user_hint: str,
image_analysis: str = ""
) -> Dict[str, Any]:
"""
使用 AI 从 Word 纯文本中提取结构化数据
Args:
paragraphs: 段落列表
full_text: 完整文本
image_count: 图片数量
image_descriptions: 图片描述列表
user_hint: 用户提示
image_analysis: 图片 AI 分析结果
Returns:
结构化数据
"""
try:
# 限制文本长度
text_preview = full_text[:8000] if len(full_text) > 8000 else full_text
# 图片提示
image_hint = f"\n【文档图片】此文档包含 {image_count} 张图片/图表。" if image_count > 0 else ""
if image_descriptions:
image_hint += "\n" + "\n".join(image_descriptions)
prompt = f"""你是一个专业的数据提取专家。请从以下 Word 文档的完整内容中提取结构化数据。
【用户需求】
{user_hint if user_hint else "请识别并提取文档中的关键信息,包括:表格数据、键值对、列表项等。"}
【文档正文】{image_hint}
{text_preview}
请按照以下 JSON 格式输出:
{{
"type": "structured_text",
"tables": [{{"headers": [...], "rows": [...]}}],
"key_values": {{"键1": "值1", "键2": "值2", ...}},
"list_items": ["项1", "项2", ...],
"summary": "文档内容摘要"
}}
重点:
- 如果文档包含表格数据,提取到 tables 中
- 如果文档包含键值对(如 名称: 张三),提取到 key_values 中
- 如果文档包含列表项,提取到 list_items 中
- 如果文档包含图片,请根据上下文推断图片内容(如"流程图""数据折线图"等)并在 description 中说明
- 如果无法提取到结构化数据,至少提供一个详细的摘要
"""
messages = [
{"role": "system", "content": "你是一个专业的数据提取助手。请严格按JSON格式输出。"},
{"role": "user", "content": prompt}
]
response = await self.llm.chat(
messages=messages,
temperature=0.1,
max_tokens=50000
)
content = self.llm.extract_message_content(response)
result = self._parse_json_response(content)
if result:
logger.info(f"AI 文本提取成功: type={result.get('type')}")
return {
"success": True,
"type": result.get("type", "structured_text"),
"tables": result.get("tables", []),
"key_values": result.get("key_values", {}),
"list_items": result.get("list_items", []),
"summary": result.get("summary", ""),
"raw_text_preview": text_preview[:500]
}
else:
return {
"success": True,
"type": "text",
"summary": text_preview[:500],
"raw_text_preview": text_preview[:500]
}
except Exception as e:
logger.error(f"AI 文本提取失败: {str(e)}")
return {
"success": False,
"error": str(e)
}
async def _analyze_images_with_ai(
self,
images: List[Dict[str, str]],
user_hint: str = ""
) -> str:
"""
使用视觉模型分析 Word 文档中的图片
Args:
images: 图片列表,每项包含 base64 和 mime_type
user_hint: 用户提示
Returns:
图片分析结果文本
"""
try:
# 调用 LLM 的视觉分析功能
result = await self.llm.analyze_images(
images=images,
user_prompt=user_hint or "请详细描述图片内容,提取所有文字和数据信息。"
)
if result.get("success"):
analysis = result.get("analysis", {})
if isinstance(analysis, dict):
description = analysis.get("description", "")
text_content = analysis.get("text_content", "")
data_extracted = analysis.get("data_extracted", {})
result_text = f"【图片分析结果】\n{description}"
if text_content:
result_text += f"\n\n【图片中的文字】\n{text_content}"
if data_extracted:
result_text += f"\n\n【提取的数据】\n{json.dumps(data_extracted, ensure_ascii=False)}"
return result_text
else:
return str(analysis)
else:
logger.warning(f"图片 AI 分析失败: {result.get('error')}")
return ""
except Exception as e:
logger.error(f"图片 AI 分析异常: {str(e)}")
return ""
def _build_tables_description(self, tables: List[Dict]) -> str:
"""构建表格的文本描述"""
result = []
for idx, table in enumerate(tables):
rows = table.get("rows", [])
if not rows:
continue
result.append(f"\n--- 表格 {idx + 1} ---")
for row_idx, row in enumerate(rows[:50]): # 限制每表格最多50行
if isinstance(row, list):
result.append(" | ".join(str(cell).strip() for cell in row))
elif isinstance(row, dict):
result.append(str(row))
if len(rows) > 50:
result.append(f"...(共 {len(rows)}仅显示前50行")
return "\n".join(result) if result else "(无表格内容)"
def _parse_json_response(self, content: str) -> Optional[Dict]:
"""解析 JSON 响应,处理各种格式问题"""
import re
# 清理 markdown 标记
cleaned = content.strip()
cleaned = re.sub(r'^```json\s*', '', cleaned, flags=re.MULTILINE)
cleaned = re.sub(r'^```\s*', '', cleaned, flags=re.MULTILINE)
cleaned = cleaned.strip()
# 找到 JSON 开始位置
json_start = -1
for i, c in enumerate(cleaned):
if c == '{':
json_start = i
break
if json_start == -1:
logger.warning("无法找到 JSON 开始位置")
return None
json_text = cleaned[json_start:]
# 尝试直接解析
try:
return json.loads(json_text)
except json.JSONDecodeError:
pass
# 尝试修复并解析
try:
# 找到闭合括号
depth = 0
end_pos = -1
for i, c in enumerate(json_text):
if c == '{':
depth += 1
elif c == '}':
depth -= 1
if depth == 0:
end_pos = i + 1
break
if end_pos > 0:
fixed = json_text[:end_pos]
# 移除末尾逗号
fixed = re.sub(r',\s*([}]])', r'\1', fixed)
return json.loads(fixed)
except Exception as e:
logger.warning(f"JSON 修复失败: {e}")
return None
def _fallback_table_parse(self, tables: List[Dict]) -> Dict[str, Any]:
"""当 AI 解析失败时,直接解析表格"""
if not tables:
return {
"success": True,
"type": "empty",
"data": {},
"message": "无表格内容"
}
all_rows = []
all_headers = None
for table in tables:
rows = table.get("rows", [])
if not rows:
continue
# 查找真正的表头行(跳过标题行)
header_row_idx = 0
for idx, row in enumerate(rows[:5]): # 只检查前5行
if not isinstance(row, list):
continue
# 如果某一行包含"表"字开头且单元格内容很长,这可能是标题行
first_cell = str(row[0]) if row else ""
if first_cell.startswith("") and len(first_cell) > 15:
header_row_idx = idx + 1
continue
# 如果某一行有超过3个空单元格可能是无效行
empty_count = sum(1 for cell in row if not str(cell).strip())
if empty_count > 3:
header_row_idx = idx + 1
continue
# 找到第一行看起来像表头的行(短单元格,大部分有内容)
avg_len = sum(len(str(c)) for c in row) / len(row) if row else 0
if avg_len < 20: # 表头通常比数据行短
header_row_idx = idx
break
if header_row_idx >= len(rows):
continue
# 使用找到的表头行
if rows and isinstance(rows[header_row_idx], list):
headers = rows[header_row_idx]
if all_headers is None:
all_headers = headers
# 数据行(从表头之后开始)
for row in rows[header_row_idx + 1:]:
if isinstance(row, list) and len(row) == len(headers):
all_rows.append(row)
if all_headers and all_rows:
return {
"success": True,
"type": "table_data",
"headers": all_headers,
"rows": all_rows,
"description": "直接从 Word 表格提取"
}
return {
"success": True,
"type": "raw",
"tables": tables,
"message": "表格数据未AI处理"
}
async def fill_template_with_ai(
self,
file_path: str,
template_fields: List[Dict[str, Any]],
user_hint: str = ""
) -> Dict[str, Any]:
"""
使用 AI 解析 Word 文档并填写模板
这是主要入口函数,前端调用此函数即可完成:
1. AI 解析 Word 文档
2. 根据模板字段提取数据
3. 返回填写结果
Args:
file_path: Word 文件路径
template_fields: 模板字段列表 [{"name": "字段名", "hint": "提示词"}, ...]
user_hint: 用户提示
Returns:
填写结果
"""
try:
# 1. AI 解析文档
parse_result = await self.parse_word_with_ai(file_path, user_hint)
if not parse_result.get("success"):
return {
"success": False,
"error": parse_result.get("error", "解析失败"),
"filled_data": {},
"source": "ai_parse_failed"
}
# 2. 根据字段类型提取数据
filled_data = {}
extract_details = []
parse_type = parse_result.get("type", "")
if parse_type == "table_data":
# 表格数据:直接匹配列名
headers = parse_result.get("headers", [])
rows = parse_result.get("rows", [])
for field in template_fields:
field_name = field.get("name", "")
values = self._extract_field_from_table(headers, rows, field_name)
filled_data[field_name] = values
extract_details.append({
"field": field_name,
"values": values,
"source": "ai_table_extraction",
"confidence": 0.9 if values else 0.0
})
elif parse_type == "structured_text":
# 结构化文本:尝试从 key_values 和 list_items 提取
key_values = parse_result.get("key_values", {})
list_items = parse_result.get("list_items", [])
for field in template_fields:
field_name = field.get("name", "")
value = key_values.get(field_name, "")
if not value and list_items:
value = list_items[0] if list_items else ""
filled_data[field_name] = [value] if value else []
extract_details.append({
"field": field_name,
"values": [value] if value else [],
"source": "ai_text_extraction",
"confidence": 0.7 if value else 0.0
})
else:
# 其他类型:返回原始解析结果供后续处理
for field in template_fields:
field_name = field.get("name", "")
filled_data[field_name] = []
extract_details.append({
"field": field_name,
"values": [],
"source": "no_ai_data",
"confidence": 0.0
})
# 3. 返回结果
max_rows = max(len(v) for v in filled_data.values()) if filled_data else 1
return {
"success": True,
"filled_data": filled_data,
"fill_details": extract_details,
"ai_parse_result": {
"type": parse_type,
"description": parse_result.get("description", "")
},
"source_doc_count": 1,
"max_rows": max_rows
}
except Exception as e:
logger.error(f"AI 填表失败: {str(e)}")
return {
"success": False,
"error": str(e),
"filled_data": {},
"fill_details": []
}
def _extract_field_from_table(
self,
headers: List[str],
rows: List[List],
field_name: str
) -> List[str]:
"""从表格中提取指定字段的值"""
# 查找匹配的列
target_col_idx = None
for col_idx, header in enumerate(headers):
if field_name.lower() in str(header).lower() or str(header).lower() in field_name.lower():
target_col_idx = col_idx
break
if target_col_idx is None:
return []
# 提取该列所有值
values = []
for row in rows:
if isinstance(row, list) and target_col_idx < len(row):
val = str(row[target_col_idx]).strip()
if val:
values.append(val)
return values
# 全局单例
word_ai_service = WordAIService()

View File

@@ -1,4 +1,4 @@
# ============================================================
# ============================================================
# 基于大语言模型的文档理解与多源数据融合系统
# Python 依赖清单
# ============================================================

View File

@@ -1,5 +1,5 @@
import { RouterProvider } from 'react-router-dom';
import { AuthProvider } from '@/context/AuthContext';
import { AuthProvider } from '@/contexts/AuthContext';
import { TemplateFillProvider } from '@/context/TemplateFillContext';
import { router } from '@/routes';
import { Toaster } from 'sonner';

View File

@@ -1,6 +1,6 @@
import React from 'react';
import { Navigate, useLocation } from 'react-router-dom';
import { useAuth } from '@/context/AuthContext';
import { useAuth } from '@/contexts/AuthContext';
export const RouteGuard: React.FC<{ children: React.ReactNode }> = ({ children }) => {
const { user, loading } = useAuth();

View File

@@ -1,85 +0,0 @@
import React, { createContext, useContext, useEffect, useState } from 'react';
import { supabase } from '@/db/supabase';
import { User } from '@supabase/supabase-js';
import { Profile } from '@/types/types';
interface AuthContextType {
user: User | null;
profile: Profile | null;
signIn: (email: string, password: string) => Promise<{ error: any }>;
signUp: (email: string, password: string) => Promise<{ error: any }>;
signOut: () => Promise<{ error: any }>;
loading: boolean;
}
const AuthContext = createContext<AuthContextType | undefined>(undefined);
export const AuthProvider: React.FC<{ children: React.ReactNode }> = ({ children }) => {
const [user, setUser] = useState<User | null>(null);
const [profile, setProfile] = useState<Profile | null>(null);
const [loading, setLoading] = useState(true);
useEffect(() => {
// Check active sessions and sets the user
supabase.auth.getSession().then(({ data: { session } }) => {
setUser(session?.user ?? null);
if (session?.user) fetchProfile(session.user.id);
else setLoading(false);
});
// Listen for changes on auth state (sign in, sign out, etc.)
const { data: { subscription } } = supabase.auth.onAuthStateChange((_event, session) => {
setUser(session?.user ?? null);
if (session?.user) fetchProfile(session.user.id);
else {
setProfile(null);
setLoading(false);
}
});
return () => subscription.unsubscribe();
}, []);
const fetchProfile = async (uid: string) => {
try {
const { data, error } = await supabase
.from('profiles')
.select('*')
.eq('id', uid)
.maybeSingle();
if (error) throw error;
setProfile(data);
} catch (err) {
console.error('Error fetching profile:', err);
} finally {
setLoading(false);
}
};
const signIn = async (email: string, password: string) => {
return await supabase.auth.signInWithPassword({ email, password });
};
const signUp = async (email: string, password: string) => {
return await supabase.auth.signUp({ email, password });
};
const signOut = async () => {
return await supabase.auth.signOut();
};
return (
<AuthContext.Provider value={{ user, profile, signIn, signUp, signOut, loading }}>
{children}
</AuthContext.Provider>
);
};
export const useAuth = () => {
const context = useContext(AuthContext);
if (context === undefined) {
throw new Error('useAuth must be used within an AuthProvider');
}
return context;
};

View File

@@ -400,6 +400,49 @@ export const backendApi = {
}
},
/**
* 获取任务历史列表
*/
async getTasks(
limit: number = 50,
skip: number = 0
): Promise<{ success: boolean; tasks: any[]; count: number }> {
const url = `${BACKEND_BASE_URL}/tasks?limit=${limit}&skip=${skip}`;
try {
const response = await fetch(url);
if (!response.ok) {
const error = await response.json();
throw new Error(error.detail || '获取任务列表失败');
}
return await response.json();
} catch (error) {
console.error('获取任务列表失败:', error);
throw error;
}
},
/**
* 删除任务
*/
async deleteTask(taskId: string): Promise<{ success: boolean; deleted: boolean }> {
const url = `${BACKEND_BASE_URL}/tasks/${taskId}`;
try {
const response = await fetch(url, {
method: 'DELETE'
});
if (!response.ok) {
const error = await response.json();
throw new Error(error.detail || '删除任务失败');
}
return await response.json();
} catch (error) {
console.error('删除任务失败:', error);
throw error;
}
},
/**
* 轮询任务状态直到完成
*/
@@ -764,6 +807,41 @@ export const backendApi = {
}
},
/**
* 填充原始模板并导出
*
* 直接打开原始模板文件,将数据填入模板的表格/单元格中,然后导出
* 适用于比赛场景:保持原始模板格式不变
*/
async fillAndExportTemplate(
templatePath: string,
filledData: Record<string, any>,
format: 'xlsx' | 'docx' = 'xlsx'
): Promise<Blob> {
const url = `${BACKEND_BASE_URL}/templates/fill-and-export`;
try {
const response = await fetch(url, {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
template_path: templatePath,
filled_data: filledData,
format,
}),
});
if (!response.ok) {
const error = await response.json();
throw new Error(error.detail || '填充模板失败');
}
return await response.blob();
} catch (error) {
console.error('填充模板失败:', error);
throw error;
}
},
// ==================== Excel 专用接口 (保留兼容) ====================
/**
@@ -1145,7 +1223,7 @@ export const aiApi = {
try {
const response = await fetch(url, {
method: 'GET',
method: 'POST',
body: formData,
});
@@ -1161,6 +1239,48 @@ export const aiApi = {
}
},
/**
* 上传并使用 AI 分析 TXT 文本文件,提取结构化数据
*/
async analyzeTxt(
file: File
): Promise<{
success: boolean;
filename?: string;
structured_data?: {
table?: {
columns?: string[];
rows?: string[][];
};
summary?: string;
key_value_pairs?: Array<{ key: string; value: string }>;
numeric_data?: Array<{ name: string; value: number; unit?: string }>;
};
error?: string;
}> {
const formData = new FormData();
formData.append('file', file);
const url = `${BACKEND_BASE_URL}/ai/analyze/txt`;
try {
const response = await fetch(url, {
method: 'POST',
body: formData,
});
if (!response.ok) {
const error = await response.json();
throw new Error(error.detail || 'TXT AI 分析失败');
}
return await response.json();
} catch (error) {
console.error('TXT AI 分析失败:', error);
throw error;
}
},
/**
* 生成统计信息和图表
*/
@@ -1259,4 +1379,84 @@ export const aiApi = {
throw error;
}
},
// ==================== Word AI 解析 ====================
/**
* 使用 AI 解析 Word 文档,提取结构化数据
*/
async analyzeWordWithAI(
file: File,
userHint: string = ''
): Promise<{
success: boolean;
type?: string;
headers?: string[];
rows?: string[][];
key_values?: Record<string, string>;
list_items?: string[];
summary?: string;
error?: string;
}> {
const formData = new FormData();
formData.append('file', file);
if (userHint) {
formData.append('user_hint', userHint);
}
const url = `${BACKEND_BASE_URL}/ai/analyze/word`;
try {
const response = await fetch(url, {
method: 'POST',
body: formData,
});
if (!response.ok) {
const error = await response.json();
throw new Error(error.detail || 'Word AI 解析失败');
}
return await response.json();
} catch (error) {
console.error('Word AI 解析失败:', error);
throw error;
}
},
/**
* 使用 AI 解析 Word 文档并填写模板
* 一次性完成AI解析 + 填表
*/
async fillTemplateFromWordAI(
file: File,
templateFields: TemplateField[],
userHint: string = ''
): Promise<FillResult> {
const formData = new FormData();
formData.append('file', file);
formData.append('template_fields', JSON.stringify(templateFields));
if (userHint) {
formData.append('user_hint', userHint);
}
const url = `${BACKEND_BASE_URL}/ai/analyze/word/fill-template`;
try {
const response = await fetch(url, {
method: 'POST',
body: formData,
});
if (!response.ok) {
const error = await response.json();
throw new Error(error.detail || 'Word AI 填表失败');
}
return await response.json();
} catch (error) {
console.error('Word AI 填表失败:', error);
throw error;
}
},
};

View File

@@ -766,6 +766,7 @@ const Documents: React.FC = () => {
<div
{...getRootProps()}
className="flex items-center justify-center gap-2 p-3 border-2 border-dashed rounded-lg cursor-pointer hover:border-primary/50 hover:bg-primary/5 transition-colors"
onClick={(e) => e.stopPropagation()}
>
<input {...getInputProps()} multiple={true} />
<Plus size={16} className="text-muted-foreground" />

File diff suppressed because it is too large Load Diff

View File

@@ -1,603 +0,0 @@
import React, { useState, useEffect } from 'react';
import {
TableProperties,
Plus,
FilePlus,
CheckCircle2,
Download,
Clock,
RefreshCcw,
Sparkles,
Zap,
FileCheck,
FileSpreadsheet,
Trash2,
ChevronDown,
ChevronUp,
BarChart3,
FileText,
TrendingUp,
Info,
AlertCircle,
Loader2
} from 'lucide-react';
import { Button } from '@/components/ui/button';
import { Card, CardContent, CardHeader, CardTitle, CardDescription, CardFooter } from '@/components/ui/card';
import { Badge } from '@/components/ui/badge';
import { useAuth } from '@/context/AuthContext';
import { templateApi, documentApi, taskApi } from '@/db/api';
import { backendApi, aiApi } from '@/db/backend-api';
import { supabase } from '@/db/supabase';
import { format } from 'date-fns';
import { toast } from 'sonner';
import { cn } from '@/lib/utils';
import { Skeleton } from '@/components/ui/skeleton';
import {
Dialog,
DialogContent,
DialogHeader,
DialogTitle,
DialogTrigger,
DialogFooter,
DialogDescription
} from '@/components/ui/dialog';
import { Checkbox } from '@/components/ui/checkbox';
import { ScrollArea } from '@/components/ui/scroll-area';
import { Input } from '@/components/ui/input';
import { Label } from '@/components/ui/label';
import { Textarea } from '@/components/ui/textarea';
import { Select, SelectContent, SelectItem, SelectTrigger, SelectValue } from '@/components/ui/select';
import { useDropzone } from 'react-dropzone';
import { Markdown } from '@/components/ui/markdown';
type Template = any;
type Document = any;
type FillTask = any;
const FormFill: React.FC = () => {
const { profile } = useAuth();
const [templates, setTemplates] = useState<Template[]>([]);
const [documents, setDocuments] = useState<Document[]>([]);
const [tasks, setTasks] = useState<any[]>([]);
const [loading, setLoading] = useState(true);
// Selection state
const [selectedTemplate, setSelectedTemplate] = useState<string | null>(null);
const [selectedDocs, setSelectedDocs] = useState<string[]>([]);
const [creating, setCreating] = useState(false);
const [openTaskDialog, setOpenTaskDialog] = useState(false);
const [viewingTask, setViewingTask] = useState<any | null>(null);
// Excel upload state
const [excelFile, setExcelFile] = useState<File | null>(null);
const [excelParseResult, setExcelParseResult] = useState<any>(null);
const [excelAnalysis, setExcelAnalysis] = useState<any>(null);
const [excelAnalyzing, setExcelAnalyzing] = useState(false);
const [expandedSheet, setExpandedSheet] = useState<string | null>(null);
const [aiOptions, setAiOptions] = useState({
userPrompt: '请分析这些数据,并提取关键信息用于填表,包括数值、分类、摘要等。',
analysisType: 'general' as 'general' | 'summary' | 'statistics' | 'insights'
});
const loadData = async () => {
if (!profile) return;
try {
const [t, d, ts] = await Promise.all([
templateApi.listTemplates((profile as any).id),
documentApi.listDocuments((profile as any).id),
taskApi.listTasks((profile as any).id)
]);
setTemplates(t);
setDocuments(d);
setTasks(ts);
} catch (err: any) {
toast.error('数据加载失败');
} finally {
setLoading(false);
}
};
useEffect(() => {
loadData();
}, [profile]);
// Excel upload handlers
const onExcelDrop = async (acceptedFiles: File[]) => {
const file = acceptedFiles[0];
if (!file) return;
if (!file.name.match(/\.(xlsx|xls)$/i)) {
toast.error('仅支持 .xlsx 和 .xls 格式的 Excel 文件');
return;
}
setExcelFile(file);
setExcelParseResult(null);
setExcelAnalysis(null);
setExpandedSheet(null);
try {
const result = await backendApi.uploadExcel(file);
if (result.success) {
toast.success(`Excel 解析成功: ${file.name}`);
setExcelParseResult(result);
} else {
toast.error(result.error || '解析失败');
}
} catch (error: any) {
toast.error(error.message || '上传失败');
}
};
const { getRootProps, getInputProps, isDragActive } = useDropzone({
onDrop: onExcelDrop,
accept: {
'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet': ['.xlsx'],
'application/vnd.ms-excel': ['.xls']
},
maxFiles: 1
});
const handleAnalyzeExcel = async () => {
if (!excelFile || !excelParseResult?.success) {
toast.error('请先上传并解析 Excel 文件');
return;
}
setExcelAnalyzing(true);
setExcelAnalysis(null);
try {
const result = await aiApi.analyzeExcel(excelFile, {
userPrompt: aiOptions.userPrompt,
analysisType: aiOptions.analysisType
});
if (result.success) {
toast.success('AI 分析完成');
setExcelAnalysis(result);
} else {
toast.error(result.error || 'AI 分析失败');
}
} catch (error: any) {
toast.error(error.message || 'AI 分析失败');
} finally {
setExcelAnalyzing(false);
}
};
const handleUseExcelData = () => {
if (!excelParseResult?.success) {
toast.error('请先解析 Excel 文件');
return;
}
// 将 Excel 解析的数据标记为"文档",添加到选择列表
toast.success('Excel 数据已添加到数据源,请在任务对话框中选择');
// 这里可以添加逻辑来将 Excel 数据传递给后端创建任务
};
const handleDeleteExcel = () => {
setExcelFile(null);
setExcelParseResult(null);
setExcelAnalysis(null);
setExpandedSheet(null);
toast.success('Excel 文件已清除');
};
const handleUploadTemplate = async (e: React.ChangeEvent<HTMLInputElement>) => {
const file = e.target.files?.[0];
if (!file || !profile) return;
try {
toast.loading('正在上传模板...');
await templateApi.uploadTemplate(file, (profile as any).id);
toast.dismiss();
toast.success('模板上传成功');
loadData();
} catch (err) {
toast.dismiss();
toast.error('上传模板失败');
}
};
const handleCreateTask = async () => {
if (!profile || !selectedTemplate || selectedDocs.length === 0) {
toast.error('请先选择模板和数据源文档');
return;
}
setCreating(true);
try {
const task = await taskApi.createTask((profile as any).id, selectedTemplate, selectedDocs);
if (task) {
toast.success('任务已创建,正在进行智能填表...');
setOpenTaskDialog(false);
// Invoke edge function
supabase.functions.invoke('fill-template', {
body: { taskId: task.id }
}).then(({ error }) => {
if (error) toast.error('填表任务执行失败');
else {
toast.success('表格填写完成!');
loadData();
}
});
loadData();
}
} catch (err: any) {
toast.error('创建任务失败');
} finally {
setCreating(false);
}
};
const getStatusColor = (status: string) => {
switch (status) {
case 'completed': return 'bg-emerald-500 text-white';
case 'failed': return 'bg-destructive text-white';
default: return 'bg-amber-500 text-white';
}
};
const formatFileSize = (bytes: number): string => {
if (bytes === 0) return '0 B';
const k = 1024;
const sizes = ['B', 'KB', 'MB', 'GB'];
const i = Math.floor(Math.log(bytes) / Math.log(k));
return `${(bytes / Math.pow(k, i)).toFixed(2)} ${sizes[i]}`;
};
return (
<div className="space-y-8 animate-fade-in pb-10">
<section className="flex flex-col md:flex-row md:items-center justify-between gap-4">
<div className="space-y-1">
<h1 className="text-3xl font-extrabold tracking-tight"></h1>
<p className="text-muted-foreground"></p>
</div>
<div className="flex items-center gap-3">
<Dialog open={openTaskDialog} onOpenChange={setOpenTaskDialog}>
<DialogTrigger asChild>
<Button className="rounded-xl shadow-lg shadow-primary/20 gap-2 h-11 px-6">
<FilePlus size={18} />
<span></span>
</Button>
</DialogTrigger>
<DialogContent className="max-w-4xl max-h-[90vh] flex flex-col p-0 overflow-hidden border-none shadow-2xl rounded-3xl">
<DialogHeader className="p-8 pb-4 bg-muted/50">
<DialogTitle className="text-2xl font-bold flex items-center gap-2">
<Sparkles size={24} className="text-primary" />
</DialogTitle>
<DialogDescription>
AI
</DialogDescription>
</DialogHeader>
<ScrollArea className="flex-1 p-8 pt-4">
<div className="space-y-8">
{/* Step 1: Select Template */}
<div className="space-y-4">
<div className="flex items-center justify-between">
<h4 className="font-bold flex items-center gap-2 text-primary uppercase tracking-widest text-xs">
<span className="w-5 h-5 rounded-full bg-primary text-white flex items-center justify-center text-[10px]">1</span>
</h4>
<label className="cursor-pointer text-xs font-semibold text-primary hover:underline flex items-center gap-1">
<Plus size={12} />
<input type="file" className="hidden" onChange={handleUploadTemplate} accept=".docx,.xlsx" />
</label>
</div>
{templates.length > 0 ? (
<div className="grid grid-cols-1 sm:grid-cols-2 gap-3">
{templates.map(t => (
<div
key={t.id}
className={cn(
"p-4 rounded-2xl border-2 transition-all cursor-pointer flex items-center gap-3 group relative overflow-hidden",
selectedTemplate === t.id ? "border-primary bg-primary/5" : "border-border hover:border-primary/50"
)}
onClick={() => setSelectedTemplate(t.id)}
>
<div className={cn(
"w-10 h-10 rounded-xl flex items-center justify-center shrink-0 transition-colors",
selectedTemplate === t.id ? "bg-primary text-white" : "bg-muted text-muted-foreground"
)}>
<TableProperties size={20} />
</div>
<div className="flex-1 min-w-0">
<p className="font-bold text-sm truncate">{t.name}</p>
<p className="text-[10px] text-muted-foreground uppercase">{t.type}</p>
</div>
{selectedTemplate === t.id && (
<div className="absolute top-0 right-0 w-8 h-8 bg-primary text-white flex items-center justify-center rounded-bl-xl">
<CheckCircle2 size={14} />
</div>
)}
</div>
))}
</div>
) : (
<div className="p-8 text-center bg-muted/30 rounded-2xl border border-dashed text-sm italic text-muted-foreground">
</div>
)}
</div>
{/* Step 2: Upload & Analyze Excel */}
<div className="space-y-4">
<h4 className="font-bold flex items-center gap-2 text-primary uppercase tracking-widest text-xs">
<span className="w-5 h-5 rounded-full bg-primary text-white flex items-center justify-center text-[10px]">1.5</span>
Excel
</h4>
<div className="bg-muted/20 rounded-2xl p-6">
{!excelFile ? (
<div
{...getRootProps()}
className={cn(
"border-2 border-dashed rounded-xl p-8 transition-all duration-300 flex flex-col items-center justify-center text-center cursor-pointer group",
isDragActive ? "border-primary bg-primary/5" : "border-muted-foreground/20 hover:border-primary/50 hover:bg-muted/30"
)}
>
<input {...getInputProps()} />
<div className="w-12 h-12 rounded-xl bg-primary/10 text-primary flex items-center justify-center mb-3 group-hover:scale-110 transition-transform">
<FileSpreadsheet size={24} />
</div>
<p className="font-semibold text-sm">
{isDragActive ? '释放以开始上传' : '点击或拖拽 Excel 文件'}
</p>
<p className="text-xs text-muted-foreground mt-1"> .xlsx .xls </p>
</div>
) : (
<div className="space-y-4">
<div className="flex items-center gap-3 p-3 bg-background rounded-xl">
<div className="w-10 h-10 rounded-lg bg-emerald-500/10 text-emerald-500 flex items-center justify-center">
<FileSpreadsheet size={20} />
</div>
<div className="flex-1 min-w-0">
<p className="font-semibold text-sm truncate">{excelFile.name}</p>
<p className="text-xs text-muted-foreground">{formatFileSize(excelFile.size)}</p>
</div>
<div className="flex gap-2">
<Button
variant="ghost"
size="icon"
className="text-destructive hover:bg-destructive/10"
onClick={handleDeleteExcel}
>
<Trash2 size={16} />
</Button>
</div>
</div>
{/* AI Analysis Options */}
{excelParseResult?.success && (
<div className="space-y-3">
<div className="space-y-2">
<Label htmlFor="analysis-type" className="text-xs"></Label>
<Select
value={aiOptions.analysisType}
onValueChange={(value: any) => setAiOptions({ ...aiOptions, analysisType: value })}
>
<SelectTrigger id="analysis-type" className="bg-background h-9 text-sm">
<SelectValue placeholder="选择分析类型" />
</SelectTrigger>
<SelectContent>
<SelectItem value="general"></SelectItem>
<SelectItem value="summary"></SelectItem>
<SelectItem value="statistics"></SelectItem>
<SelectItem value="insights"></SelectItem>
</SelectContent>
</Select>
</div>
<div className="space-y-2">
<Label htmlFor="user-prompt" className="text-xs"></Label>
<Textarea
id="user-prompt"
value={aiOptions.userPrompt}
onChange={(e) => setAiOptions({ ...aiOptions, userPrompt: e.target.value })}
className="bg-background resize-none text-sm"
rows={2}
/>
</div>
<Button
onClick={handleAnalyzeExcel}
disabled={excelAnalyzing}
className="w-full gap-2 h-9"
variant="outline"
>
{excelAnalyzing ? <Loader2 className="animate-spin" size={14} /> : <Sparkles size={14} />}
{excelAnalyzing ? '分析中...' : 'AI 分析'}
</Button>
{excelParseResult?.success && (
<Button
onClick={handleUseExcelData}
className="w-full gap-2 h-9"
>
<CheckCircle2 size={14} />
使
</Button>
)}
</div>
)}
{/* Excel Analysis Result */}
{excelAnalysis && (
<div className="mt-4 p-4 bg-background rounded-xl max-h-60 overflow-y-auto">
<div className="flex items-center gap-2 mb-3">
<Sparkles size={16} className="text-primary" />
<span className="font-semibold text-sm">AI </span>
</div>
<Markdown content={excelAnalysis.analysis?.analysis || ''} className="text-sm" />
</div>
)}
</div>
)}
</div>
</div>
{/* Step 3: Select Documents */}
<div className="space-y-4">
<h4 className="font-bold flex items-center gap-2 text-primary uppercase tracking-widest text-xs">
<span className="w-5 h-5 rounded-full bg-primary text-white flex items-center justify-center text-[10px]">2</span>
</h4>
{documents.filter(d => d.status === 'completed').length > 0 ? (
<div className="space-y-2 max-h-40 overflow-y-auto pr-2 custom-scrollbar">
{documents.filter(d => d.status === 'completed').map(doc => (
<div
key={doc.id}
className={cn(
"flex items-center gap-3 p-3 rounded-xl border transition-all cursor-pointer",
selectedDocs.includes(doc.id) ? "border-primary/50 bg-primary/5 shadow-sm" : "border-border hover:bg-muted/30"
)}
onClick={() => {
setSelectedDocs(prev =>
prev.includes(doc.id) ? prev.filter(id => id !== doc.id) : [...prev, doc.id]
);
}}
>
<Checkbox checked={selectedDocs.includes(doc.id)} onCheckedChange={() => {}} />
<div className="w-8 h-8 rounded-lg bg-blue-500/10 text-blue-500 flex items-center justify-center">
<Zap size={16} />
</div>
<span className="font-semibold text-sm truncate">{doc.name}</span>
</div>
))}
</div>
) : (
<div className="p-6 text-center bg-muted/30 rounded-xl border border-dashed text-xs italic text-muted-foreground">
</div>
)}
</div>
</div>
</ScrollArea>
<DialogFooter className="p-8 pt-4 bg-muted/20 border-t border-dashed">
<Button variant="outline" className="rounded-xl h-12 px-6" onClick={() => setOpenTaskDialog(false)}></Button>
<Button
className="rounded-xl h-12 px-8 shadow-lg shadow-primary/20 gap-2"
onClick={handleCreateTask}
disabled={creating || !selectedTemplate || (selectedDocs.length === 0 && !excelParseResult?.success)}
>
{creating ? <RefreshCcw className="animate-spin h-5 w-5" /> : <Zap className="h-5 w-5 fill-current" />}
<span></span>
</Button>
</DialogFooter>
</DialogContent>
</Dialog>
</div>
</section>
{/* Task List */}
<div className="grid grid-cols-1 md:grid-cols-2 lg:grid-cols-3 gap-6">
{loading ? (
Array.from({ length: 3 }).map((_, i) => (
<Skeleton key={i} className="h-48 w-full rounded-3xl bg-muted" />
))
) : tasks.length > 0 ? (
tasks.map((task) => (
<Card key={task.id} className="border-none shadow-md hover:shadow-xl transition-all group rounded-3xl overflow-hidden flex flex-col">
<div className="h-1.5 w-full" style={{ backgroundColor: task.status === 'completed' ? '#10b981' : task.status === 'failed' ? '#ef4444' : '#f59e0b' }} />
<CardHeader className="p-6 pb-2">
<div className="flex justify-between items-start mb-2">
<div className="w-12 h-12 rounded-2xl bg-emerald-500/10 text-emerald-500 flex items-center justify-center shadow-inner group-hover:scale-110 transition-transform">
<TableProperties size={24} />
</div>
<Badge className={cn("text-[10px] uppercase font-bold tracking-widest", getStatusColor(task.status))}>
{task.status === 'completed' ? '已完成' : task.status === 'failed' ? '失败' : '执行中'}
</Badge>
</div>
<CardTitle className="text-lg font-bold truncate group-hover:text-primary transition-colors">{task.templates?.name || '未知模板'}</CardTitle>
<CardDescription className="text-xs flex items-center gap-1 font-medium italic">
<Clock size={12} /> {format(new Date(task.created_at!), 'yyyy/MM/dd HH:mm')}
</CardDescription>
</CardHeader>
<CardContent className="p-6 pt-2 flex-1">
<div className="space-y-4">
<div className="flex flex-wrap gap-2">
<Badge variant="outline" className="bg-muted/50 border-none text-[10px] font-bold"> {task.document_ids?.length} </Badge>
</div>
{task.status === 'completed' && (
<div className="p-3 bg-emerald-500/5 rounded-2xl border border-emerald-500/10 flex items-center gap-3">
<CheckCircle2 className="text-emerald-500" size={18} />
<span className="text-xs font-semibold text-emerald-700"></span>
</div>
)}
</div>
</CardContent>
<CardFooter className="p-6 pt-0">
<Button
className="w-full rounded-2xl h-11 bg-primary group-hover:shadow-lg group-hover:shadow-primary/30 transition-all gap-2"
disabled={task.status !== 'completed'}
onClick={() => setViewingTask(task)}
>
<Download size={18} />
<span></span>
</Button>
</CardFooter>
</Card>
))
) : (
<div className="col-span-full py-24 flex flex-col items-center justify-center text-center space-y-6">
<div className="w-24 h-24 rounded-full bg-muted flex items-center justify-center text-muted-foreground/30 border-4 border-dashed">
<TableProperties size={48} />
</div>
<div className="space-y-2 max-w-sm">
<p className="text-2xl font-extrabold tracking-tight"></p>
<p className="text-muted-foreground text-sm"></p>
</div>
<Button className="rounded-xl h-12 px-8" onClick={() => setOpenTaskDialog(true)}></Button>
</div>
)}
</div>
{/* Task Result View Modal */}
<Dialog open={!!viewingTask} onOpenChange={(open) => !open && setViewingTask(null)}>
<DialogContent className="max-w-4xl max-h-[90vh] flex flex-col p-0 overflow-hidden border-none shadow-2xl rounded-3xl">
<DialogHeader className="p-8 pb-4 bg-primary text-primary-foreground">
<div className="flex items-center gap-3 mb-2">
<FileCheck size={28} />
<DialogTitle className="text-2xl font-extrabold"></DialogTitle>
</div>
<DialogDescription className="text-primary-foreground/80 italic">
{viewingTask?.document_ids?.length}
</DialogDescription>
</DialogHeader>
<ScrollArea className="flex-1 p-8 bg-muted/10">
<div className="prose dark:prose-invert max-w-none">
<div className="bg-card p-8 rounded-2xl shadow-sm border min-h-[400px]">
<Badge variant="outline" className="mb-4"></Badge>
<div className="whitespace-pre-wrap font-sans text-sm leading-relaxed">
<h2 className="text-xl font-bold mb-4"></h2>
<p className="text-muted-foreground mb-6"></p>
<div className="p-4 bg-muted/30 rounded-xl border border-dashed border-primary/20 italic">
...
</div>
<div className="mt-8 space-y-4">
<p className="font-semibold text-primary"> </p>
<p className="font-semibold text-primary"> </p>
<p className="font-semibold text-primary"> </p>
</div>
</div>
</div>
</div>
</ScrollArea>
<DialogFooter className="p-8 pt-4 border-t border-dashed">
<Button variant="outline" className="rounded-xl" onClick={() => setViewingTask(null)}></Button>
<Button className="rounded-xl px-8 gap-2 shadow-lg shadow-primary/20" onClick={() => toast.success("正在导出文件...")}>
<Download size={18} />
{viewingTask?.templates?.type?.toUpperCase() || '文件'}
</Button>
</DialogFooter>
</DialogContent>
</Dialog>
</div>
);
};
export default FormFill;

View File

@@ -1,184 +0,0 @@
import React, { useState } from 'react';
import { useNavigate, useLocation } from 'react-router-dom';
import { useAuth } from '@/context/AuthContext';
import { Button } from '@/components/ui/button';
import { Input } from '@/components/ui/input';
import { Label } from '@/components/ui/label';
import { Card, CardContent, CardDescription, CardFooter, CardHeader, CardTitle } from '@/components/ui/card';
import { Tabs, TabsContent, TabsList, TabsTrigger } from '@/components/ui/tabs';
import { FileText, Lock, User, CheckCircle2, AlertCircle } from 'lucide-react';
import { toast } from 'sonner';
const Login: React.FC = () => {
const [username, setUsername] = useState('');
const [password, setPassword] = useState('');
const [loading, setLoading] = useState(false);
const { signIn, signUp } = useAuth();
const navigate = useNavigate();
const location = useLocation();
const handleLogin = async (e: React.FormEvent) => {
e.preventDefault();
if (!username || !password) return toast.error('请输入用户名和密码');
setLoading(true);
try {
const email = `${username}@miaoda.com`;
const { error } = await signIn(email, password);
if (error) throw error;
toast.success('登录成功');
navigate('/');
} catch (err: any) {
toast.error(err.message || '登录失败');
} finally {
setLoading(false);
}
};
const handleSignUp = async (e: React.FormEvent) => {
e.preventDefault();
if (!username || !password) return toast.error('请输入用户名和密码');
setLoading(true);
try {
const email = `${username}@miaoda.com`;
const { error } = await signUp(email, password);
if (error) throw error;
toast.success('注册成功,请登录');
} catch (err: any) {
toast.error(err.message || '注册失败');
} finally {
setLoading(false);
}
};
return (
<div className="min-h-screen flex items-center justify-center bg-[radial-gradient(ellipse_at_top_left,_var(--tw-gradient-stops))] from-primary/10 via-background to-background p-4 relative overflow-hidden">
{/* Decorative elements */}
<div className="absolute top-0 left-0 w-96 h-96 bg-primary/5 rounded-full blur-3xl -translate-x-1/2 -translate-y-1/2" />
<div className="absolute bottom-0 right-0 w-64 h-64 bg-primary/5 rounded-full blur-3xl translate-x-1/3 translate-y-1/3" />
<div className="w-full max-w-md space-y-8 relative animate-fade-in">
<div className="text-center space-y-2">
<div className="inline-flex items-center justify-center w-16 h-16 rounded-2xl bg-primary text-primary-foreground shadow-2xl shadow-primary/30 mb-4 animate-slide-in">
<FileText size={32} />
</div>
<h1 className="text-4xl font-extrabold tracking-tight gradient-text"></h1>
<p className="text-muted-foreground"></p>
</div>
<Card className="border-border/50 shadow-2xl backdrop-blur-sm bg-card/95">
<Tabs defaultValue="login" className="w-full">
<TabsList className="grid w-full grid-cols-2 rounded-t-xl h-12 bg-muted/50 p-1">
<TabsTrigger value="login" className="rounded-lg data-[state=active]:bg-background data-[state=active]:shadow-sm"></TabsTrigger>
<TabsTrigger value="signup" className="rounded-lg data-[state=active]:bg-background data-[state=active]:shadow-sm"></TabsTrigger>
</TabsList>
<TabsContent value="login">
<form onSubmit={handleLogin}>
<CardHeader>
<CardTitle></CardTitle>
<CardDescription>使</CardDescription>
</CardHeader>
<CardContent className="space-y-4">
<div className="space-y-2">
<Label htmlFor="username"></Label>
<div className="relative">
<User className="absolute left-3 top-2.5 h-4 w-4 text-muted-foreground" />
<Input
id="username"
placeholder="请输入用户名"
className="pl-9 bg-muted/30 border-none focus-visible:ring-primary"
value={username}
onChange={(e) => setUsername(e.target.value)}
/>
</div>
</div>
<div className="space-y-2">
<Label htmlFor="password"></Label>
<div className="relative">
<Lock className="absolute left-3 top-2.5 h-4 w-4 text-muted-foreground" />
<Input
id="password"
type="password"
placeholder="请输入密码"
className="pl-9 bg-muted/30 border-none focus-visible:ring-primary"
value={password}
onChange={(e) => setPassword(e.target.value)}
/>
</div>
</div>
</CardContent>
<CardFooter>
<Button className="w-full h-11 text-lg font-semibold rounded-xl" type="submit" disabled={loading}>
{loading ? '登录中...' : '立即登录'}
</Button>
</CardFooter>
</form>
</TabsContent>
<TabsContent value="signup">
<form onSubmit={handleSignUp}>
<CardHeader>
<CardTitle></CardTitle>
<CardDescription></CardDescription>
</CardHeader>
<CardContent className="space-y-4">
<div className="space-y-2">
<Label htmlFor="signup-username"></Label>
<div className="relative">
<User className="absolute left-3 top-2.5 h-4 w-4 text-muted-foreground" />
<Input
id="signup-username"
placeholder="仅字母、数字和下划线"
className="pl-9 bg-muted/30 border-none focus-visible:ring-primary"
value={username}
onChange={(e) => setUsername(e.target.value)}
/>
</div>
</div>
<div className="space-y-2">
<Label htmlFor="signup-password"></Label>
<div className="relative">
<Lock className="absolute left-3 top-2.5 h-4 w-4 text-muted-foreground" />
<Input
id="signup-password"
type="password"
placeholder="不少于 6 位"
className="pl-9 bg-muted/30 border-none focus-visible:ring-primary"
value={password}
onChange={(e) => setPassword(e.target.value)}
/>
</div>
</div>
</CardContent>
<CardFooter>
<Button className="w-full h-11 text-lg font-semibold rounded-xl" type="submit" disabled={loading}>
{loading ? '注册中...' : '注册账号'}
</Button>
</CardFooter>
</form>
</TabsContent>
</Tabs>
</Card>
<div className="grid grid-cols-2 gap-4 text-center text-xs text-muted-foreground">
<div className="flex flex-col items-center gap-1">
<CheckCircle2 size={16} className="text-primary" />
<span></span>
</div>
<div className="flex flex-col items-center gap-1">
<CheckCircle2 size={16} className="text-primary" />
<span></span>
</div>
</div>
<div className="text-center text-sm text-muted-foreground">
&copy; 2026 |
</div>
</div>
</div>
);
};
export default Login;

View File

@@ -1,16 +0,0 @@
/**
* Sample Page
*/
import PageMeta from "../components/common/PageMeta";
export default function SamplePage() {
return (
<>
<PageMeta title="Home" description="Home Page Introduction" />
<div>
<h3>This is a sample page</h3>
</div>
</>
);
}

View File

@@ -11,7 +11,8 @@ import {
ChevronDown,
ChevronUp,
Trash2,
AlertCircle
AlertCircle,
HelpCircle
} from 'lucide-react';
import { Card, CardContent, CardHeader, CardTitle, CardDescription } from '@/components/ui/card';
import { Button } from '@/components/ui/button';
@@ -24,9 +25,9 @@ import { Skeleton } from '@/components/ui/skeleton';
type Task = {
task_id: string;
status: 'pending' | 'processing' | 'success' | 'failure';
status: 'pending' | 'processing' | 'success' | 'failure' | 'unknown';
created_at: string;
completed_at?: string;
updated_at?: string;
message?: string;
result?: any;
error?: string;
@@ -38,54 +39,38 @@ const TaskHistory: React.FC = () => {
const [loading, setLoading] = useState(true);
const [expandedTask, setExpandedTask] = useState<string | null>(null);
// Mock data for demonstration
useEffect(() => {
// 模拟任务数据,实际应该从后端获取
setTasks([
{
task_id: 'task-001',
status: 'success',
created_at: new Date(Date.now() - 3600000).toISOString(),
completed_at: new Date(Date.now() - 3500000).toISOString(),
task_type: 'document_parse',
message: '文档解析完成',
result: {
doc_id: 'doc-001',
filename: 'report_q1_2026.docx',
extracted_fields: ['标题', '作者', '日期', '金额']
// 获取任务历史数据
const fetchTasks = async () => {
try {
setLoading(true);
const response = await backendApi.getTasks(50, 0);
if (response.success && response.tasks) {
// 转换后端数据格式为前端格式
const convertedTasks: Task[] = response.tasks.map((t: any) => ({
task_id: t.task_id,
status: t.status || 'unknown',
created_at: t.created_at || new Date().toISOString(),
updated_at: t.updated_at,
message: t.message || '',
result: t.result,
error: t.error,
task_type: t.task_type || 'document_parse'
}));
setTasks(convertedTasks);
} else {
setTasks([]);
}
},
{
task_id: 'task-002',
status: 'success',
created_at: new Date(Date.now() - 7200000).toISOString(),
completed_at: new Date(Date.now() - 7100000).toISOString(),
task_type: 'excel_analysis',
message: 'Excel 分析完成',
result: {
filename: 'sales_data.xlsx',
row_count: 1250,
charts_generated: 3
}
},
{
task_id: 'task-003',
status: 'processing',
created_at: new Date(Date.now() - 600000).toISOString(),
task_type: 'template_fill',
message: '正在填充表格...'
},
{
task_id: 'task-004',
status: 'failure',
created_at: new Date(Date.now() - 86400000).toISOString(),
completed_at: new Date(Date.now() - 86390000).toISOString(),
task_type: 'document_parse',
message: '解析失败',
error: '文件格式不支持或文件已损坏'
}
]);
} catch (error) {
console.error('获取任务列表失败:', error);
toast.error('获取任务列表失败');
setTasks([]);
} finally {
setLoading(false);
}
};
useEffect(() => {
fetchTasks();
}, []);
const getStatusBadge = (status: string) => {
@@ -96,6 +81,8 @@ const TaskHistory: React.FC = () => {
return <Badge className="bg-destructive text-white text-[10px]"><XCircle size={12} className="mr-1" /></Badge>;
case 'processing':
return <Badge className="bg-amber-500 text-white text-[10px]"><Loader2 size={12} className="mr-1 animate-spin" /></Badge>;
case 'unknown':
return <Badge className="bg-gray-500 text-white text-[10px]"><HelpCircle size={12} className="mr-1" /></Badge>;
default:
return <Badge className="bg-gray-500 text-white text-[10px]"><Clock size={12} className="mr-1" /></Badge>;
}
@@ -133,15 +120,22 @@ const TaskHistory: React.FC = () => {
};
const handleDelete = async (taskId: string) => {
try {
await backendApi.deleteTask(taskId);
setTasks(prev => prev.filter(t => t.task_id !== taskId));
toast.success('任务已删除');
} catch (error) {
console.error('删除任务失败:', error);
toast.error('删除任务失败');
}
};
const stats = {
total: tasks.length,
success: tasks.filter(t => t.status === 'success').length,
processing: tasks.filter(t => t.status === 'processing').length,
failure: tasks.filter(t => t.status === 'failure').length
failure: tasks.filter(t => t.status === 'failure').length,
unknown: tasks.filter(t => t.status === 'unknown').length
};
return (
@@ -151,7 +145,7 @@ const TaskHistory: React.FC = () => {
<h1 className="text-3xl font-extrabold tracking-tight"></h1>
<p className="text-muted-foreground"></p>
</div>
<Button variant="outline" className="rounded-xl gap-2" onClick={() => window.location.reload()}>
<Button variant="outline" className="rounded-xl gap-2" onClick={() => fetchTasks()}>
<RefreshCcw size={18} />
<span></span>
</Button>
@@ -194,7 +188,8 @@ const TaskHistory: React.FC = () => {
"w-12 h-12 rounded-xl flex items-center justify-center shrink-0",
task.status === 'success' ? "bg-emerald-500/10 text-emerald-500" :
task.status === 'failure' ? "bg-destructive/10 text-destructive" :
"bg-amber-500/10 text-amber-500"
task.status === 'processing' ? "bg-amber-500/10 text-amber-500" :
"bg-gray-500/10 text-gray-500"
)}>
{task.status === 'processing' ? (
<Loader2 size={24} className="animate-spin" />
@@ -212,16 +207,16 @@ const TaskHistory: React.FC = () => {
</Badge>
</div>
<p className="text-sm text-muted-foreground">
{task.message || '任务执行中...'}
{task.message || (task.status === 'unknown' ? '无法获取状态' : '任务执行中...')}
</p>
<div className="flex items-center gap-4 text-xs text-muted-foreground">
<span className="flex items-center gap-1">
<Clock size={12} />
{format(new Date(task.created_at), 'yyyy-MM-dd HH:mm:ss')}
{task.created_at ? format(new Date(task.created_at), 'yyyy-MM-dd HH:mm:ss') : '时间未知'}
</span>
{task.completed_at && (
{task.updated_at && task.status !== 'processing' && (
<span>
: {Math.round((new Date(task.completed_at).getTime() - new Date(task.created_at).getTime()) / 1000)}
: {format(new Date(task.updated_at), 'HH:mm:ss')}
</span>
)}
</div>

View File

@@ -626,6 +626,16 @@ const TemplateFill: React.FC = () => {
<div className="text-muted-foreground text-xs mt-1">
: {detail.source} | : {detail.confidence ? (detail.confidence * 100).toFixed(0) + '%' : 'N/A'}
</div>
{detail.warning && (
<div className="mt-2 p-2 bg-yellow-50 border border-yellow-200 rounded-lg text-yellow-700 text-xs">
{detail.warning}
</div>
)}
{detail.values && detail.values.length > 1 && !detail.warning && (
<div className="mt-2 text-xs text-muted-foreground">
: {detail.values.join(', ')}
</div>
)}
</div>
</div>
))}

View File

@@ -50,18 +50,18 @@
| `prompt_service.py` | ✅ 已完成 | Prompt 模板管理 |
| `text_analysis_service.py` | ✅ 已完成 | 文本分析 |
| `chart_generator_service.py` | ✅ 已完成 | 图表生成服务 |
| `template_fill_service.py` | ✅ 已完成 | 模板填写服务,支持直接读取源文档进行填表 |
| `template_fill_service.py` | ✅ 已完成 | 模板填写服务,支持多行提取、直接从结构化数据提取、JSON容错、Word文档表格处理 |
### 2.2 API 接口 (`backend/app/api/endpoints/`)
| 接口文件 | 路由 | 功能状态 |
|----------|------|----------|
| `upload.py` | `/api/v1/upload/excel` | ✅ Excel 文件上传与解析 |
| `upload.py` | `/api/v1/upload/document` | ✅ 文档上传与解析 |
| `documents.py` | `/api/v1/documents/*` | ✅ 文档管理(列表、删除、搜索) |
| `ai_analyze.py` | `/api/v1/analyze/*` | ✅ AI 分析Excel、Markdown、流式 |
| `rag.py` | `/api/v1/rag/*` | ⚠️ RAG 检索(当前返回空) |
| `tasks.py` | `/api/v1/tasks/*` | ✅ 异步任务状态查询 |
| `templates.py` | `/api/v1/templates/*` | ✅ 模板管理 (含 Word 导出) |
| `templates.py` | `/api/v1/templates/*` | ✅ 模板管理(含多行导出、Word导出、Word结构化字段解析 |
| `visualization.py` | `/api/v1/visualization/*` | ✅ 可视化图表 |
| `health.py` | `/api/v1/health` | ✅ 健康检查 |
@@ -70,71 +70,67 @@
| 页面文件 | 功能 | 状态 |
|----------|------|------|
| `Documents.tsx` | 主文档管理页面 | ✅ 已完成 |
| `TemplateFill.tsx` | 智能填表页面 | ✅ 已完成 |
| `ExcelParse.tsx` | Excel 解析页面 | ✅ 已完成 |
### 2.4 文档解析能力
| 格式 | 解析状态 | 说明 |
|------|----------|------|
| Excel (.xlsx/.xls) | ✅ 已完成 | pandas + XML 回退解析 |
| Excel (.xlsx/.xls) | ✅ 已完成 | pandas + XML 回退解析支持多sheet |
| Markdown (.md) | ✅ 已完成 | 正则 + AI 分章节 |
| Word (.docx) | ✅ 已完成 | python-docx 解析,支持表格提取和字段识别 |
| Text (.txt) | ✅ 已完成 | chardet 编码检测,支持文本清洗和结构化提取 |
---
## 三、待完成功能(核心缺块)
## 三、核心功能实现详情
### 3.1 模板填写模块(最优先
**当前状态**:✅ 已完成
### 3.1 模板填写模块(✅ 已完成
**核心流程**
```
用户上传模板表格(Word/Excel)
上传模板表格(Word/Excel)
解析模板,提取需要填写的字段和提示词
根据模板指定的源文档列表读取源数据
根据源文档ID列表读取源数据MongoDB或文件
AI 根据字段提示词从源数据中提取信息
优先从结构化数据直接提取Excel rows
将提取的数据填入模板对应位置
无法直接提取时使用 LLM 从文本中提取
返回填写完成的表格
将提取的数据填入原始模板对应位置(保持模板格式)
导出填写完成的表格Excel/Word
```
**已完成实现**
- [x] `template_fill_service.py` - 模板填写核心服务
- [x] Word 模板解析 (`docx_parser.py` - parse_tables_for_template, extract_template_fields_from_docx)
- [x] Text 模板解析 (`txt_parser.py` - 已完成)
- [x] 模板字段识别与提示词提取
- [x] 多文档数据聚合与冲突处理
- [x] 结果导出为 Word/Excel
**关键特性**
- **原始模板填充**:直接打开原始模板文件,填充数据到原表格/单元格
- **多行数据支持**:每个字段可提取多个值,导出时自动扩展行数
- **结构化数据优先**:直接从 Excel rows 提取,无需 LLM
- **JSON 容错**:支持 LLM 返回的损坏/截断 JSON
- **Markdown 清理**:自动清理 LLM 返回的 markdown 格式
### 3.2 Word 文档解析
**当前状态**:✅ 已完成
### 3.2 Word 文档解析(✅ 已完成)
**已实现功能**
- [x] `docx_parser.py` - Word 文档解析器
- [x] 提取段落文本
- [x] 提取表格内容
- [x] 提取关键信息(标题、列表等)
- [x] 表格模板字段提取 (`parse_tables_for_template`, `extract_template_fields_from_docx`)
- [x] 字段类型推断 (`_infer_field_type_from_hint`)
- `docx_parser.py` - Word 文档解析器
- 提取段落文本
- 提取表格内容(支持比赛表格格式:字段名 | 提示词 | 填写值)
- `parse_tables_for_template()` - 解析表格模板,提取字段
- `extract_template_fields_from_docx()` - 提取模板字段定义
- `_infer_field_type_from_hint()` - 从提示词推断字段类型
- **API 端点**`/api/v1/templates/parse-word-structure` - 上传 Word 文档,提取结构化字段并存入 MongoDB
- **API 端点**`/api/v1/templates/word-fields/{doc_id}` - 获取已存文档的模板字段信息
### 3.3 Text 文档解析
**当前状态**:✅ 已完成
### 3.3 Text 文档解析(✅ 已完成)
**已实现功能**
- [x] `txt_parser.py` - 文本文件解析器
- [x] 编码自动检测 (chardet)
- [x] 文本清洗
### 3.4 文档模板匹配(已有框架)
根据 Q&A模板已指定数据文件不需要算法匹配。当前已有上传功能需确认模板与数据文件的关联逻辑是否完善。
- `txt_parser.py` - 文本文件解析器
- 编码自动检测 (chardet)
- 文本清洗(去除控制字符、规范化空白)
- 结构化数据提取邮箱、URL、电话、日期、金额
---
@@ -192,20 +188,20 @@ docs/test/
## 六、工作计划(建议)
### 第一优先级:模板填写核心功能
- 完成 Word 文档解析
- 完成模板填写服务
- 端到端测试验证
### 第一优先级:端到端测试
- 使用真实测试数据进行准确率测试
- 验证多行数据导出是否正确
- 测试 Word 模板解析是否正常
### 第二优先级Demo 打包与文档
- 制作项目演示 PPT
- 录制演示视频
- 完善 README 部署文档
### 第三优先级:测试优化
- 使用真实测试数据进行准确率测试
### 第三优先级:优化
- 优化响应时间
- 完善错误处理
- 增加更多测试用例
---
@@ -215,29 +211,32 @@ docs/test/
2. **数据库**:不强制要求数据库存储,可跳过
3. **部署**:本地部署即可,不需要公网服务器
4. **评测数据**:初赛仅使用目前提供的数据
5. **RAG 功能**:当前已临时禁用,不影响核心评测功能
5. **RAG 功能**:当前已临时禁用,不影响核心评测功能(因为使用直接文件读取)
---
*文档版本: v1.1*
*最后更新: 2026-04-08*
*文档版本: v1.5*
*最后更新: 2026-04-09*
---
## 八、技术实现细节
### 8.1 模板填表流程(已实现)
### 8.1 模板填表流程
#### 流程图
```
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ 上传模板 │ ──► │ 选择数据源 │ ──► │ AI 智能填表 │
│ 上传模板 │ ──► │ 选择数据源 │ ──► │ 智能填表
└─────────────┘ └─────────────┘ └─────────────┘
┌─────────────┐
│ 导出结果 │
─────────────
┌─────────────────────────┼─────────────────────────┐
│ │
▼ ▼
┌───────────────┐ ┌───────────────┐ ┌───────────────
│ 结构化数据提取 │ │ LLM 提取 │ │ 导出结果 │
│ (直接读rows) │ │ (文本理解) │ │ (Excel/Word) │
└───────────────┘ └───────────────┘ └───────────────┘
```
#### 核心组件
@@ -247,8 +246,10 @@ docs/test/
| 模板上传 | `templates.py` `/templates/upload` | 接收模板文件,提取字段 |
| 字段提取 | `template_fill_service.py` | 从 Word/Excel 表格提取字段定义 |
| 文档解析 | `docx_parser.py`, `xlsx_parser.py`, `txt_parser.py` | 解析源文档内容 |
| 智能填表 | `template_fill_service.py` `fill_template()` | 使用 LLM 从源文档提取信息 |
| 结果导出 | `templates.py` `/templates/export` | 导出为 Excel 或 Word |
| 智能填表 | `template_fill_service.py` `fill_template()` | 结构化提取 + LLM 提取 |
| 多行支持 | `template_fill_service.py` `FillResult` | values 数组支持 |
| JSON 容错 | `template_fill_service.py` `_fix_json()` | 修复损坏的 JSON |
| 结果导出 | `templates.py` `/templates/export` | 多行 Excel + Word 导出 |
### 8.2 源文档加载方式
@@ -268,7 +269,9 @@ docs/test/
```python
# 提取表格模板字段
fields = docx_parser.extract_template_fields_from_docx(file_path)
from docx_parser import DocxParser
parser = DocxParser()
fields = parser.extract_template_fields_from_docx(file_path)
# 返回格式
# [
@@ -295,6 +298,24 @@ fields = docx_parser.extract_template_fields_from_docx(file_path)
### 8.5 API 接口
#### POST `/api/v1/templates/upload`
上传模板文件,提取字段定义。
**响应**
```json
{
"success": true,
"template_id": "/path/to/saved/template.docx",
"filename": "模板.docx",
"file_type": "docx",
"fields": [
{"cell": "A1", "name": "姓名", "field_type": "text", "required": true, "hint": "提取人员姓名"}
],
"field_count": 1
}
```
#### POST `/api/v1/templates/fill`
填写请求:
@@ -306,35 +327,232 @@ fields = docx_parser.extract_template_fields_from_docx(file_path)
],
"source_doc_ids": ["mongodb_doc_id_1", "mongodb_doc_id_2"],
"source_file_paths": [],
"user_hint": "请从合同文档中提取"
"user_hint": "请从xxx文档中提取"
}
```
响应
**响应(含多行支持)**
```json
{
"success": true,
"filled_data": {"姓名": "张三"},
"filled_data": {
"姓名": ["张三", "李四", "王五"],
"年龄": ["25", "30", "28"]
},
"fill_details": [
{
"field": "姓名",
"cell": "A1",
"values": ["张三", "李四", "王五"],
"value": "张三",
"source": "来自:合同文档.docx",
"confidence": 0.95
"source": "结构化数据直接提取",
"confidence": 1.0
}
],
"source_doc_count": 2
"source_doc_count": 2,
"max_rows": 3
}
```
#### POST `/api/v1/templates/export`
导出请求:
导出请求(创建新文件)
```json
{
"template_id": "模板ID",
"filled_data": {"姓名": "张三", "金额": "10000"},
"format": "xlsx" // 或 "docx"
"filled_data": {"姓名": ["张三", "李四"], "金额": ["10000", "20000"]},
"format": "xlsx"
}
```
#### POST `/api/v1/templates/fill-and-export`
**填充原始模板并导出**(推荐用于比赛)
直接打开原始模板文件,将数据填入模板的表格/单元格中,然后导出。**保持原始模板格式不变**。
**请求**
```json
{
"template_path": "/path/to/original/template.docx",
"filled_data": {
"姓名": ["张三", "李四", "王五"],
"年龄": ["25", "30", "28"]
},
"format": "docx"
}
```
**响应**:填充后的 Word/Excel 文件(文件流)
**特点**
- 打开原始模板文件
- 根据表头行匹配字段名到列索引
- 将数据填入对应列的单元格
- 多行数据自动扩展表格行数
- 保持原始模板格式和样式
#### POST `/api/v1/templates/parse-word-structure`
**上传 Word 文档并提取结构化字段**(比赛专用)
上传 Word 文档,从表格模板中提取字段定义(字段名、提示词、字段类型)并存入 MongoDB。
**请求**multipart/form-data
- file: Word 文件
**响应**
```json
{
"success": true,
"doc_id": "mongodb_doc_id",
"filename": "模板.docx",
"file_path": "/path/to/saved/template.docx",
"field_count": 5,
"fields": [
{
"cell": "T0R1",
"name": "字段名",
"hint": "提示词",
"field_type": "text",
"required": true
}
],
"tables": [...],
"metadata": {
"paragraph_count": 10,
"table_count": 1,
"word_count": 500,
"has_tables": true
}
}
```
#### GET `/api/v1/templates/word-fields/{doc_id}`
**获取 Word 文档模板字段信息**
根据 doc_id 获取已上传的 Word 文档的模板字段信息。
**响应**
```json
{
"success": true,
"doc_id": "mongodb_doc_id",
"filename": "模板.docx",
"fields": [...],
"tables": [...],
"field_count": 5,
"metadata": {...}
}
```
### 8.6 多行数据处理
**FillResult 数据结构**
```python
@dataclass
class FillResult:
field: str
values: List[Any] = None # 支持多个值(数组)
value: Any = "" # 保留兼容(第一个值)
source: str = "" # 来源文档
confidence: float = 1.0 # 置信度
```
**导出逻辑**
- 计算所有字段的最大行数
- 遍历每一行,取对应索引的值
- 不足的行填空字符串
### 8.7 JSON 容错处理
当 LLM 返回的 JSON 损坏或被截断时,系统会:
1. 清理 markdown 代码块标记(```json, ```
2. 尝试配对括号找到完整的 JSON
3. 移除末尾多余的逗号
4. 使用正则表达式提取 values 数组
5. 备选方案:直接提取所有引号内的字符串
### 8.8 结构化数据优先提取
对于 Excel 等有 `rows` 结构的文档,系统会:
1. 直接从 `structured_data.rows` 中查找匹配列
2. 使用模糊匹配(字段名包含或被包含)
3. 提取该列的所有行值
4. 无需调用 LLM速度更快准确率更高
```python
# 内部逻辑
if structured.get("rows"):
columns = structured.get("columns", [])
values = _extract_column_values(rows, columns, field_name)
```
---
## 九、依赖说明
### Python 依赖
```
# requirements.txt 中需要包含
fastapi>=0.104.0
uvicorn>=0.24.0
motor>=3.3.0 # MongoDB 异步驱动
sqlalchemy>=2.0.0 # MySQL ORM
pandas>=2.0.0 # Excel 处理
openpyxl>=3.1.0 # Excel 写入
python-docx>=0.8.0 # Word 处理
chardet>=4.0.0 # 编码检测
httpx>=0.25.0 # HTTP 客户端
```
### 前端依赖
```
# package.json 中需要包含
react>=18.0.0
react-dropzone>=14.0.0
lucide-react>=0.300.0
sonner>=1.0.0 # toast 通知
```
---
## 十、启动说明
### 后端启动
```bash
cd backend
.\venv\Scripts\Activate.ps1 # 或 Activate.bat
pip install -r requirements.txt # 确保依赖完整
.\venv\Scripts\python.exe -m uvicorn app.main:app --host 127.0.0.1 --port 8000 --reload
```
### 前端启动
```bash
cd frontend
npm install
npm run dev
```
### 环境变量
`backend/.env` 中配置:
```
MONGODB_URL=mongodb://localhost:27017
MONGODB_DB_NAME=document_system
MYSQL_HOST=localhost
MYSQL_PORT=3306
MYSQL_USER=root
MYSQL_PASSWORD=your_password
MYSQL_DATABASE=document_system
LLM_API_KEY=your_api_key
LLM_BASE_URL=https://api.minimax.chat
LLM_MODEL_NAME=MiniMax-Text-01
```