Compare commits

..

25 Commits

Author SHA1 Message Date
dj
51350e3002 123 2026-04-14 17:35:40 +08:00
dj
8e713be1ca Merge remote changes with RAG service optimization
- Keep user's RAG service integration for faster extraction
- Add remote's word_ai_service support
- Preserve user's parallel extraction and field header optimizations

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-14 17:25:13 +08:00
zzz
f2af27245d 增强 Word 文档 AI 解析和模板填充功能 2026-04-14 17:16:38 +08:00
dj
a9dc0d8b91 优化智能填表功能:提升速度、完善数据提取精度
后端优化 (template_fill_service.py):

1. 速度优化:
   - 使用 asyncio.gather 实现字段并行提取
   - 跳过 AI 审核步骤,减少 LLM 调用次数
   - 新增 _extract_single_field_fast 方法

2. 数据提取优化:
   - 集成 RAG 服务进行智能内容检索
   - 修复 Markdown 表格列匹配跳过空列
   - 修复年份子表头行误识别问题

3. AI 表头生成优化:
   - 精简为 5-7 个代表性字段(原来 8-15 个)
   - 过滤非数据字段(source、备注、说明等)
   - 简化字段名,如"医院数量"而非"医院-公立医院数量"

4. AI 数据提取 prompt 优化:
   - 严格按表头提取,只返回相关数据
   - 每个值必须带标注(年份/地区/分类)
   - 支持多种标注类型:2024年、北京、某省、公立医院、三级医院等
   - 保留原始数值、单位和百分号格式
   - 不返回大段来源说明

5. FillResult 新增 warning 字段:
   - 多值检测提示,如"检测到 2 个值"

前端优化 (TemplateFill.tsx):
- 填写详情显示多值警告(黄色提示框)
- 多值情况下直接显示所有值

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-14 17:14:59 +08:00
tl
902c28166b tl 2026-04-14 15:18:50 +08:00
tl
4a53be7eeb TL 2026-04-14 14:58:14 +08:00
tl
8b5b24fa2a Merge branch 'main' of https://gitea.kronecker.cc/OurCodesAreAllRight/FilesReadSystem 2026-04-14 14:57:53 +08:00
tl
ed66aa346d tl 2026-04-10 10:24:52 +08:00
zzz
5b82d40be0 Merge branch 'main' of https://gitea.kronecker.cc/OurCodesAreAllRight/FilesReadSystem 2026-04-10 10:10:41 +08:00
zzz
bedf1af9c0 增强 Word 文档 AI 解析和模板填充功能 2026-04-10 09:48:57 +08:00
5fca4eb094 添加临时文件清理异常处理和修改大纲接口为POST方法
- 在analyze_markdown、analyze_markdown_stream和get_markdown_outline函数中添加了
  try-catch块来处理临时文件清理过程中的异常
- 将/analyze/md/outline接口从GET方法改为POST方法以支持文件上传
- 确保在所有情况下都能正确清理临时文件,并记录清理失败的日志

refactor(health): 改进健康检查逻辑验证实际数据库连接

- 修改MySQL健康检查,实际执行SELECT 1查询来验证连接
- 修改MongoDB健康检查,执行ping命令来验证连接
- 修改Redis健康检查,执行ping命令来验证连接
- 添加异常捕获并记录具体的错误日志

refactor(upload): 使用os.path.basename优化文件名提取

- 替换手动字符串分割为os.path.basename来获取文件名
- 统一Excel上传和导出中文件名的处理方式

feat(instruction): 新增指令执行框架模块

- 创建instruction包包含意图解析和指令执行的基础架构
- 添加IntentParser和InstructionExecutor抽象基类
- 提供默认实现但标记为未完成,为未来功能扩展做准备

refactor(frontend): 调整AuthContext导入路径并移除重复文件

- 将AuthContext从src/context移动到src/contexts目录
- 更新App.tsx和RouteGuard.tsx中的导入路径
- 移除旧的AuthContext.tsx文件

fix(backend-api): 修复AI分析API的HTTP方法错误

- 将aiApi中的fetch请求方法从GET改为POST以支持文件上传
2026-04-10 01:51:53 +08:00
0dbf74db9d 添加任务ID跟踪功能到模板填充接口
- 在FillRequest中添加可选的task_id字段,用于任务历史跟踪
- 实现任务状态管理,包括创建、更新和错误处理
- 集成MongoDB任务记录功能,在处理过程中更新进度
- 添加任务进度更新逻辑,支持开始、处理中、成功和失败状态
- 修改模板填充服务以接收并传递task_id参数
2026-04-10 01:27:26 +08:00
858b594171 添加任务状态双写机制和历史记录功能
- 实现任务状态同时写入Redis和MongoDB的双写机制
- 添加MongoDB任务集合及CRUD操作接口
- 新增任务历史记录查询、列表展示和删除功能
- 重构任务状态更新逻辑,统一使用update_task_status函数
- 添加模板填服务中AI审核字段值的功能
- 优化前端任务历史页面显示和交互体验
2026-04-10 01:15:53 +08:00
ed0f51f2a4 Merge branch 'main' of https://gitea.kronecker.cc/OurCodesAreAllRight/FilesReadSystem 2026-04-10 00:26:57 +08:00
ecc0c79475 增强模板填写服务支持表格内容摘要和表头重生成
- 在源文档解析过程中增加表格内容摘要功能,提取表格结构用于AI理解
- 新增表格摘要逻辑,包括表头和前3行数据的提取和格式化
- 添加模板文件类型识别,支持xlsx和docx格式判断
- 实现基于源文档内容的表头自动重生成功能
- 当检测到自动生成的表头时,使用源文档内容重新生成更准确的字段
- 增加详细的调试日志用于跟踪表格处理过程
2026-04-10 00:26:54 +08:00
dj
6befc510d8 刷新的debug 2026-04-10 00:23:23 +08:00
dj
8f66c235fa 实现并行多文件上传的功能并且在列表显示上传了哪些文件,支持多次上传 2026-04-10 00:16:28 +08:00
886d5ae0cc Merge branch 'main' of https://gitea.kronecker.cc/OurCodesAreAllRight/FilesReadSystem 2026-04-09 22:44:01 +08:00
6752c5c231 优化联合模板上传逻辑支持源文档内容解析
- 移除模板文件字段提取步骤,改为直接保存模板文件
- 新增源文档解析功能,提取文档内容、标题和表格数量信息
- 修改模板填充服务,支持传入源文档内容用于AI表头生成
- 更新AI表头生成逻辑,基于源文档内容智能生成合适的表头字段
- 增强日志记录,显示源文档数量和处理进度
2026-04-09 22:43:51 +08:00
dj
610d475ce0 新增从文档中心选择源文档功能及删除功能
智能填表模块新增"从文档中心选择"模式,支持选择已上传的文档作为数据源,
同时支持从列表中删除文档。两种模式通过Tab切换。

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-09 22:35:13 +08:00
dj
496b96508d 修复Excel解析和智能填表功能
- 增强Excel解析器支持多种命名空间和路径格式,解决英文表头Excel无法读取问题
- 当MongoDB中structured_data为空时,尝试用file_path重新解析文件
- 改进AI分析提示词,明确要求返回纯数值不要单位
- 修复max_tokens值(5000→4000)避免DeepSeek API报错

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-09 22:21:51 +08:00
dj
07ebdc09bc Merge branch 'main' of https://gitea.kronecker.cc/OurCodesAreAllRight/FilesReadSystem 2026-04-09 22:18:12 +08:00
dj
c1886fb68f Merge branch 'main' of https://gitea.kronecker.cc/OurCodesAreAllRight/FilesReadSystem 2026-04-09 21:42:14 +08:00
dj
78417c898a 改进智能填表功能:支持Markdown表格提取和修复LLM调用
- 新增对MongoDB存储的tables格式支持,直接从structured_data.tables提取数据
- 修复max_tokens值过大问题(50000→4000),解决DeepSeek API限制
- 增强列名匹配算法,支持模糊匹配
- 添加详细日志便于调试结构化数据提取过程

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-09 21:42:07 +08:00
dj
718f864926 修改读取excel表时存在数字时浮点匹配生成不一致问题 2026-04-09 20:56:38 +08:00
33 changed files with 4246 additions and 2462 deletions

View File

@@ -29,9 +29,14 @@ REDIS_URL="redis://localhost:6379/0"
# ==================== LLM AI 配置 ====================
# 大语言模型 API 配置
LLM_API_KEY="your_api_key_here"
LLM_BASE_URL=""
LLM_MODEL_NAME=""
# 支持 OpenAI 兼容格式 (DeepSeek, 智谱 GLM, 阿里等)
# 智谱 AI (Zhipu AI) GLM 系列:
# - 模型: glm-4-flash (快速文本模型), glm-4 (标准), glm-4-plus (高性能)
# - API: https://open.bigmodel.cn
# - API Key: https://open.bigmodel.cn/usercenter/apikeys
LLM_API_KEY="ca79ad9f96524cd5afc3e43ca97f347d.cpiLLx2oyitGvTeU"
LLM_BASE_URL="https://open.bigmodel.cn/api/paas/v4"
LLM_MODEL_NAME="glm-4v-plus"
# ==================== Supabase 配置 ====================
# Supabase 项目配置

38
backend/=3.0.0 Normal file
View File

@@ -0,0 +1,38 @@
Requirement already satisfied: sentence-transformers in c:\python312\lib\site-packages (2.2.2)
Requirement already satisfied: transformers<5.0.0,>=4.6.0 in c:\python312\lib\site-packages (from sentence-transformers) (4.57.6)
Requirement already satisfied: tqdm in c:\python312\lib\site-packages (from sentence-transformers) (4.66.1)
Requirement already satisfied: torch>=1.6.0 in c:\python312\lib\site-packages (from sentence-transformers) (2.10.0)
Requirement already satisfied: torchvision in c:\python312\lib\site-packages (from sentence-transformers) (0.25.0)
Requirement already satisfied: numpy in c:\python312\lib\site-packages (from sentence-transformers) (1.26.2)
Requirement already satisfied: scikit-learn in c:\python312\lib\site-packages (from sentence-transformers) (1.8.0)
Requirement already satisfied: scipy in c:\python312\lib\site-packages (from sentence-transformers) (1.16.3)
Requirement already satisfied: nltk in c:\python312\lib\site-packages (from sentence-transformers) (3.9.3)
Requirement already satisfied: sentencepiece in c:\python312\lib\site-packages (from sentence-transformers) (0.2.1)
Requirement already satisfied: huggingface-hub>=0.4.0 in c:\python312\lib\site-packages (from sentence-transformers) (0.36.2)
Requirement already satisfied: filelock in c:\python312\lib\site-packages (from huggingface-hub>=0.4.0->sentence-transformers) (3.25.2)
Requirement already satisfied: fsspec>=2023.5.0 in c:\python312\lib\site-packages (from huggingface-hub>=0.4.0->sentence-transformers) (2026.2.0)
Requirement already satisfied: packaging>=20.9 in c:\python312\lib\site-packages (from huggingface-hub>=0.4.0->sentence-transformers) (23.2)
Requirement already satisfied: pyyaml>=5.1 in c:\python312\lib\site-packages (from huggingface-hub>=0.4.0->sentence-transformers) (6.0.1)
Requirement already satisfied: requests in c:\python312\lib\site-packages (from huggingface-hub>=0.4.0->sentence-transformers) (2.31.0)
Requirement already satisfied: typing-extensions>=3.7.4.3 in c:\python312\lib\site-packages (from huggingface-hub>=0.4.0->sentence-transformers) (4.15.0)
Requirement already satisfied: sympy>=1.13.3 in c:\python312\lib\site-packages (from torch>=1.6.0->sentence-transformers) (1.14.0)
Requirement already satisfied: networkx>=2.5.1 in c:\python312\lib\site-packages (from torch>=1.6.0->sentence-transformers) (3.6.1)
Requirement already satisfied: jinja2 in c:\python312\lib\site-packages (from torch>=1.6.0->sentence-transformers) (3.1.6)
Requirement already satisfied: setuptools in c:\python312\lib\site-packages (from torch>=1.6.0->sentence-transformers) (82.0.1)
Requirement already satisfied: colorama in c:\python312\lib\site-packages (from tqdm->sentence-transformers) (0.4.6)
Requirement already satisfied: regex!=2019.12.17 in c:\python312\lib\site-packages (from transformers<5.0.0,>=4.6.0->sentence-transformers) (2026.2.28)
Requirement already satisfied: tokenizers<=0.23.0,>=0.22.0 in c:\python312\lib\site-packages (from transformers<5.0.0,>=4.6.0->sentence-transformers) (0.22.2)
Requirement already satisfied: safetensors>=0.4.3 in c:\python312\lib\site-packages (from transformers<5.0.0,>=4.6.0->sentence-transformers) (0.7.0)
Requirement already satisfied: click in c:\python312\lib\site-packages (from nltk->sentence-transformers) (8.3.1)
Requirement already satisfied: joblib in c:\python312\lib\site-packages (from nltk->sentence-transformers) (1.5.3)
Requirement already satisfied: threadpoolctl>=3.2.0 in c:\python312\lib\site-packages (from scikit-learn->sentence-transformers) (3.6.0)
Requirement already satisfied: pillow!=8.3.*,>=5.3.0 in c:\python312\lib\site-packages (from torchvision->sentence-transformers) (12.1.1)
Requirement already satisfied: mpmath<1.4,>=1.1.0 in c:\python312\lib\site-packages (from sympy>=1.13.3->torch>=1.6.0->sentence-transformers) (1.3.0)
Requirement already satisfied: MarkupSafe>=2.0 in c:\python312\lib\site-packages (from jinja2->torch>=1.6.0->sentence-transformers) (3.0.3)
Requirement already satisfied: charset-normalizer<4,>=2 in c:\python312\lib\site-packages (from requests->huggingface-hub>=0.4.0->sentence-transformers) (3.4.6)
Requirement already satisfied: idna<4,>=2.5 in c:\python312\lib\site-packages (from requests->huggingface-hub>=0.4.0->sentence-transformers) (3.11)
Requirement already satisfied: urllib3<3,>=1.21.1 in c:\python312\lib\site-packages (from requests->huggingface-hub>=0.4.0->sentence-transformers) (2.6.3)
Requirement already satisfied: certifi>=2017.4.17 in c:\python312\lib\site-packages (from requests->huggingface-hub>=0.4.0->sentence-transformers) (2026.2.25)
[notice] A new release of pip is available: 24.2 -> 26.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip

View File

@@ -10,6 +10,8 @@ import os
from app.services.excel_ai_service import excel_ai_service
from app.services.markdown_ai_service import markdown_ai_service
from app.services.template_fill_service import template_fill_service
from app.services.word_ai_service import word_ai_service
logger = logging.getLogger(__name__)
@@ -215,9 +217,12 @@ async def analyze_markdown(
return result
finally:
# 清理临时文件
if os.path.exists(tmp_path):
os.unlink(tmp_path)
# 清理临时文件,确保在所有情况下都能清理
try:
if tmp_path and os.path.exists(tmp_path):
os.unlink(tmp_path)
except Exception as cleanup_error:
logger.warning(f"临时文件清理失败: {tmp_path}, error: {cleanup_error}")
except HTTPException:
raise
@@ -279,8 +284,12 @@ async def analyze_markdown_stream(
)
finally:
if os.path.exists(tmp_path):
os.unlink(tmp_path)
# 清理临时文件,确保在所有情况下都能清理
try:
if tmp_path and os.path.exists(tmp_path):
os.unlink(tmp_path)
except Exception as cleanup_error:
logger.warning(f"临时文件清理失败: {tmp_path}, error: {cleanup_error}")
except HTTPException:
raise
@@ -289,7 +298,7 @@ async def analyze_markdown_stream(
raise HTTPException(status_code=500, detail=f"流式分析失败: {str(e)}")
@router.get("/analyze/md/outline")
@router.post("/analyze/md/outline")
async def get_markdown_outline(
file: UploadFile = File(...)
):
@@ -323,9 +332,154 @@ async def get_markdown_outline(
result = await markdown_ai_service.extract_outline(tmp_path)
return result
finally:
if os.path.exists(tmp_path):
os.unlink(tmp_path)
# 清理临时文件,确保在所有情况下都能清理
try:
if tmp_path and os.path.exists(tmp_path):
os.unlink(tmp_path)
except Exception as cleanup_error:
logger.warning(f"临时文件清理失败: {tmp_path}, error: {cleanup_error}")
except Exception as e:
logger.error(f"获取 Markdown 大纲失败: {str(e)}")
raise HTTPException(status_code=500, detail=f"获取大纲失败: {str(e)}")
@router.post("/analyze/txt")
async def analyze_txt(
file: UploadFile = File(...),
):
"""
上传并使用 AI 分析 TXT 文本文件,提取结构化数据
将非结构化文本转换为结构化表格数据,便于后续填表使用
Args:
file: 上传的 TXT 文件
Returns:
dict: 分析结果,包含结构化表格数据
"""
if not file.filename:
raise HTTPException(status_code=400, detail="文件名为空")
file_ext = file.filename.split('.')[-1].lower()
if file_ext not in ['txt', 'text']:
raise HTTPException(
status_code=400,
detail=f"不支持的文件类型: {file_ext},仅支持 .txt"
)
try:
# 读取文件内容
content = await file.read()
# 保存到临时文件
with tempfile.NamedTemporaryFile(mode='wb', suffix='.txt', delete=False) as tmp:
tmp.write(content)
tmp_path = tmp.name
try:
logger.info(f"开始 AI 分析 TXT 文件: {file.filename}")
# 使用 template_fill_service 的 AI 分析方法
result = await template_fill_service.analyze_txt_with_ai(
content=content.decode('utf-8', errors='replace'),
filename=file.filename
)
if result:
logger.info(f"TXT AI 分析成功: {file.filename}")
return {
"success": True,
"filename": file.filename,
"structured_data": result
}
else:
logger.warning(f"TXT AI 分析返回空结果: {file.filename}")
return {
"success": False,
"filename": file.filename,
"error": "AI 分析未能提取到结构化数据",
"structured_data": None
}
finally:
# 清理临时文件
if os.path.exists(tmp_path):
os.unlink(tmp_path)
except HTTPException:
raise
except Exception as e:
logger.error(f"TXT AI 分析过程中出错: {str(e)}")
raise HTTPException(status_code=500, detail=f"分析失败: {str(e)}")
# ==================== Word 文档 AI 解析 ====================
@router.post("/analyze/word")
async def analyze_word(
file: UploadFile = File(...),
user_hint: str = Query("", description="用户提示词,如'请提取表格数据'")
):
"""
使用 AI 解析 Word 文档,提取结构化数据
适用于从非结构化的 Word 文档中提取表格数据、键值对等信息
Args:
file: 上传的 Word 文件
user_hint: 用户提示词
Returns:
dict: 包含结构化数据的解析结果
"""
if not file.filename:
raise HTTPException(status_code=400, detail="文件名为空")
file_ext = file.filename.split('.')[-1].lower()
if file_ext not in ['docx']:
raise HTTPException(
status_code=400,
detail=f"不支持的文件类型: {file_ext},仅支持 .docx"
)
try:
# 保存上传的文件
content = await file.read()
suffix = f".{file_ext}"
with tempfile.NamedTemporaryFile(delete=False, suffix=suffix) as tmp:
tmp.write(content)
tmp_path = tmp.name
try:
# 使用 AI 解析 Word 文档
result = await word_ai_service.parse_word_with_ai(
file_path=tmp_path,
user_hint=user_hint or "请提取文档中的所有结构化数据,包括表格、键值对等"
)
if result.get("success"):
return {
"success": True,
"filename": file.filename,
"result": result
}
else:
return {
"success": False,
"filename": file.filename,
"error": result.get("error", "AI 解析失败"),
"result": None
}
finally:
# 清理临时文件
if os.path.exists(tmp_path):
os.unlink(tmp_path)
except HTTPException:
raise
except Exception as e:
logger.error(f"Word AI 分析过程中出错: {str(e)}")
raise HTTPException(status_code=500, detail=f"分析失败: {str(e)}")

View File

@@ -23,6 +23,52 @@ logger = logging.getLogger(__name__)
router = APIRouter(prefix="/upload", tags=["文档上传"])
# ==================== 辅助函数 ====================
async def update_task_status(
task_id: str,
status: str,
progress: int = 0,
message: str = "",
result: dict = None,
error: str = None
):
"""
更新任务状态,同时写入 Redis 和 MongoDB
Args:
task_id: 任务ID
status: 状态
progress: 进度
message: 消息
result: 结果
error: 错误信息
"""
meta = {"progress": progress, "message": message}
if result:
meta["result"] = result
if error:
meta["error"] = error
# 尝试写入 Redis
try:
await redis_db.set_task_status(task_id, status, meta)
except Exception as e:
logger.warning(f"Redis 任务状态更新失败: {e}")
# 尝试写入 MongoDB作为备用
try:
await mongodb.update_task(
task_id=task_id,
status=status,
message=message,
result=result,
error=error
)
except Exception as e:
logger.warning(f"MongoDB 任务状态更新失败: {e}")
# ==================== 请求/响应模型 ====================
class UploadResponse(BaseModel):
@@ -77,6 +123,17 @@ async def upload_document(
task_id = str(uuid.uuid4())
try:
# 保存任务记录到 MongoDB如果 Redis 不可用时仍能查询)
try:
await mongodb.insert_task(
task_id=task_id,
task_type="document_parse",
status="pending",
message=f"文档 {file.filename} 已提交处理"
)
except Exception as mongo_err:
logger.warning(f"MongoDB 保存任务记录失败: {mongo_err}")
content = await file.read()
saved_path = file_service.save_uploaded_file(
content,
@@ -122,6 +179,17 @@ async def upload_documents(
saved_paths = []
try:
# 保存任务记录到 MongoDB
try:
await mongodb.insert_task(
task_id=task_id,
task_type="batch_parse",
status="pending",
message=f"已提交 {len(files)} 个文档处理"
)
except Exception as mongo_err:
logger.warning(f"MongoDB 保存批量任务记录失败: {mongo_err}")
for file in files:
if not file.filename:
continue
@@ -159,9 +227,9 @@ async def process_document(
"""处理单个文档"""
try:
# 状态: 解析中
await redis_db.set_task_status(
await update_task_status(
task_id, status="processing",
meta={"progress": 10, "message": "正在解析文档"}
progress=10, message="正在解析文档"
)
# 解析文档
@@ -172,9 +240,9 @@ async def process_document(
raise Exception(result.error or "解析失败")
# 状态: 存储中
await redis_db.set_task_status(
await update_task_status(
task_id, status="processing",
meta={"progress": 30, "message": "正在存储数据"}
progress=30, message="正在存储数据"
)
# 存储到 MongoDB
@@ -191,9 +259,9 @@ async def process_document(
# 如果是 Excel存储到 MySQL + AI生成描述 + RAG索引
if doc_type in ["xlsx", "xls"]:
await redis_db.set_task_status(
await update_task_status(
task_id, status="processing",
meta={"progress": 50, "message": "正在存储到MySQL并生成字段描述"}
progress=50, message="正在存储到MySQL并生成字段描述"
)
try:
@@ -215,9 +283,9 @@ async def process_document(
else:
# 非结构化文档
await redis_db.set_task_status(
await update_task_status(
task_id, status="processing",
meta={"progress": 60, "message": "正在建立索引"}
progress=60, message="正在建立索引"
)
# 如果文档中有表格数据,提取并存储到 MySQL + RAG
@@ -238,17 +306,13 @@ async def process_document(
await index_document_to_rag(doc_id, original_filename, result, doc_type)
# 完成
await redis_db.set_task_status(
await update_task_status(
task_id, status="success",
meta={
"progress": 100,
"message": "处理完成",
progress=100, message="处理完成",
result={
"doc_id": doc_id,
"result": {
"doc_id": doc_id,
"doc_type": doc_type,
"filename": original_filename
}
"doc_type": doc_type,
"filename": original_filename
}
)
@@ -256,18 +320,19 @@ async def process_document(
except Exception as e:
logger.error(f"文档处理失败: {str(e)}")
await redis_db.set_task_status(
await update_task_status(
task_id, status="failure",
meta={"error": str(e)}
progress=0, message="处理失败",
error=str(e)
)
async def process_documents_batch(task_id: str, files: List[dict]):
"""批量处理文档"""
try:
await redis_db.set_task_status(
await update_task_status(
task_id, status="processing",
meta={"progress": 0, "message": "开始批量处理"}
progress=0, message="开始批量处理"
)
results = []
@@ -318,37 +383,43 @@ async def process_documents_batch(task_id: str, files: List[dict]):
results.append({"filename": file_info["filename"], "success": False, "error": str(e)})
progress = int((i + 1) / len(files) * 100)
await redis_db.set_task_status(
await update_task_status(
task_id, status="processing",
meta={"progress": progress, "message": f"已处理 {i+1}/{len(files)}"}
progress=progress, message=f"已处理 {i+1}/{len(files)}"
)
await redis_db.set_task_status(
await update_task_status(
task_id, status="success",
meta={"progress": 100, "message": "批量处理完成", "results": results}
progress=100, message="批量处理完成",
result={"results": results}
)
except Exception as e:
logger.error(f"批量处理失败: {str(e)}")
await redis_db.set_task_status(
await update_task_status(
task_id, status="failure",
meta={"error": str(e)}
progress=0, message="批量处理失败",
error=str(e)
)
async def index_document_to_rag(doc_id: str, filename: str, result: ParseResult, doc_type: str):
"""将非结构化文档索引到 RAG"""
"""将非结构化文档索引到 RAG(使用分块索引)"""
try:
content = result.data.get("content", "")
if content:
# 将完整内容传递给 RAG 服务自动分块索引
rag_service.index_document_content(
doc_id=doc_id,
content=content[:5000],
content=content, # 传递完整内容,由 RAG 服务自动分块
metadata={
"filename": filename,
"doc_type": doc_type
}
},
chunk_size=500, # 每块 500 字符
chunk_overlap=50 # 块之间 50 字符重叠
)
logger.info(f"RAG 索引完成: {filename}, doc_id={doc_id}")
except Exception as e:
logger.warning(f"RAG 索引失败: {str(e)}")

View File

@@ -19,26 +19,43 @@ async def health_check() -> Dict[str, Any]:
返回各数据库连接状态和应用信息
"""
# 检查各数据库连接状态
mysql_status = "connected"
mongodb_status = "connected"
redis_status = "connected"
mysql_status = "unknown"
mongodb_status = "unknown"
redis_status = "unknown"
try:
if mysql_db.async_engine is None:
mysql_status = "disconnected"
except Exception:
else:
# 实际执行一次查询验证连接
from sqlalchemy import text
async with mysql_db.async_engine.connect() as conn:
await conn.execute(text("SELECT 1"))
mysql_status = "connected"
except Exception as e:
logger.warning(f"MySQL 健康检查失败: {e}")
mysql_status = "error"
try:
if mongodb.client is None:
mongodb_status = "disconnected"
except Exception:
else:
# 实际 ping 验证
await mongodb.client.admin.command('ping')
mongodb_status = "connected"
except Exception as e:
logger.warning(f"MongoDB 健康检查失败: {e}")
mongodb_status = "error"
try:
if not redis_db.is_connected:
if not redis_db.is_connected or redis_db.client is None:
redis_status = "disconnected"
except Exception:
else:
# 实际执行 ping 验证
await redis_db.client.ping()
redis_status = "connected"
except Exception as e:
logger.warning(f"Redis 健康检查失败: {e}")
redis_status = "error"
return {

View File

@@ -1,13 +1,13 @@
"""
任务管理 API 接口
提供异步任务状态查询
提供异步任务状态查询和历史记录
"""
from typing import Optional
from fastapi import APIRouter, HTTPException
from app.core.database import redis_db
from app.core.database import redis_db, mongodb
router = APIRouter(prefix="/tasks", tags=["任务管理"])
@@ -23,25 +23,94 @@ async def get_task_status(task_id: str):
Returns:
任务状态信息
"""
# 优先从 Redis 获取
status = await redis_db.get_task_status(task_id)
if not status:
# Redis不可用时假设任务已完成文档已成功处理
# 前端轮询时会得到这个响应
if status:
return {
"task_id": task_id,
"status": "success",
"progress": 100,
"message": "任务处理完成",
"result": None,
"error": None
"status": status.get("status", "unknown"),
"progress": status.get("meta", {}).get("progress", 0),
"message": status.get("meta", {}).get("message"),
"result": status.get("meta", {}).get("result"),
"error": status.get("meta", {}).get("error")
}
# Redis 不可用时,尝试从 MongoDB 获取
mongo_task = await mongodb.get_task(task_id)
if mongo_task:
return {
"task_id": mongo_task.get("task_id"),
"status": mongo_task.get("status", "unknown"),
"progress": 100 if mongo_task.get("status") == "success" else 0,
"message": mongo_task.get("message"),
"result": mongo_task.get("result"),
"error": mongo_task.get("error")
}
# 任务不存在或状态未知
return {
"task_id": task_id,
"status": status.get("status", "unknown"),
"progress": status.get("meta", {}).get("progress", 0),
"message": status.get("meta", {}).get("message"),
"result": status.get("meta", {}).get("result"),
"error": status.get("meta", {}).get("error")
"status": "unknown",
"progress": 0,
"message": "无法获取任务状态Redis和MongoDB均不可用",
"result": None,
"error": None
}
@router.get("/")
async def list_tasks(limit: int = 50, skip: int = 0):
"""
获取任务历史列表
Args:
limit: 返回数量限制
skip: 跳过数量
Returns:
任务列表
"""
try:
tasks = await mongodb.list_tasks(limit=limit, skip=skip)
return {
"success": True,
"tasks": tasks,
"count": len(tasks)
}
except Exception as e:
# MongoDB 不可用时返回空列表
return {
"success": False,
"tasks": [],
"count": 0,
"error": str(e)
}
@router.delete("/{task_id}")
async def delete_task(task_id: str):
"""
删除任务
Args:
task_id: 任务ID
Returns:
是否删除成功
"""
try:
# 从 Redis 删除
if redis_db._connected and redis_db.client:
key = f"task:{task_id}"
await redis_db.client.delete(key)
# 从 MongoDB 删除
deleted = await mongodb.delete_task(task_id)
return {
"success": True,
"deleted": deleted
}
except Exception as e:
raise HTTPException(status_code=500, detail=f"删除任务失败: {str(e)}")

View File

@@ -23,6 +23,44 @@ logger = logging.getLogger(__name__)
router = APIRouter(prefix="/templates", tags=["表格模板"])
# ==================== 辅助函数 ====================
async def update_task_status(
task_id: str,
status: str,
progress: int = 0,
message: str = "",
result: dict = None,
error: str = None
):
"""
更新任务状态,同时写入 Redis 和 MongoDB
"""
from app.core.database import redis_db
meta = {"progress": progress, "message": message}
if result:
meta["result"] = result
if error:
meta["error"] = error
try:
await redis_db.set_task_status(task_id, status, meta)
except Exception as e:
logger.warning(f"Redis 任务状态更新失败: {e}")
try:
await mongodb.update_task(
task_id=task_id,
status=status,
message=message,
result=result,
error=error
)
except Exception as e:
logger.warning(f"MongoDB 任务状态更新失败: {e}")
# ==================== 请求/响应模型 ====================
class TemplateFieldRequest(BaseModel):
@@ -41,6 +79,7 @@ class FillRequest(BaseModel):
source_doc_ids: Optional[List[str]] = None # MongoDB 文档 ID 列表
source_file_paths: Optional[List[str]] = None # 源文档文件路径列表
user_hint: Optional[str] = None
task_id: Optional[str] = None # 可选的任务ID用于任务历史跟踪
class ExportRequest(BaseModel):
@@ -155,20 +194,17 @@ async def upload_joint_template(
)
try:
# 1. 保存模板文件并提取字段
# 1. 保存模板文件
template_content = await template_file.read()
template_path = file_service.save_uploaded_file(
template_content,
template_file.filename,
subfolder="templates"
)
template_fields = await template_fill_service.get_template_fields_from_file(
template_path,
template_ext
)
# 2. 处理源文档 - 保存文件
# 2. 保存并解析源文档 - 提取内容用于生成表头
source_file_info = []
source_contents = []
for sf in source_files:
if sf.filename:
sf_content = await sf.read()
@@ -183,10 +219,81 @@ async def upload_joint_template(
"filename": sf.filename,
"ext": sf_ext
})
# 解析源文档获取内容(用于 AI 生成表头)
try:
from app.core.document_parser import ParserFactory
parser = ParserFactory.get_parser(sf_path)
parse_result = parser.parse(sf_path)
if parse_result.success and parse_result.data:
# 获取原始内容
content = parse_result.data.get("content", "")[:5000] if parse_result.data.get("content") else ""
# 获取标题可能在顶层或structured_data内
titles = parse_result.data.get("titles", [])
if not titles and parse_result.data.get("structured_data"):
titles = parse_result.data.get("structured_data", {}).get("titles", [])
titles = titles[:10] if titles else []
# 获取表格数量可能在顶层或structured_data内
tables = parse_result.data.get("tables", [])
if not tables and parse_result.data.get("structured_data"):
tables = parse_result.data.get("structured_data", {}).get("tables", [])
tables_count = len(tables) if tables else 0
# 获取表格内容摘要(用于 AI 理解源文档结构)
tables_summary = ""
if tables:
tables_summary = "\n【文档中的表格】:\n"
for idx, table in enumerate(tables[:5]): # 最多5个表格
if isinstance(table, dict):
headers = table.get("headers", [])
rows = table.get("rows", [])
if headers:
tables_summary += f"表格{idx+1}表头: {', '.join(str(h) for h in headers)}\n"
if rows:
tables_summary += f"表格{idx+1}前3行: "
for row_idx, row in enumerate(rows[:3]):
if isinstance(row, list):
tables_summary += " | ".join(str(c) for c in row) + "; "
elif isinstance(row, dict):
tables_summary += " | ".join(str(row.get(h, "")) for h in headers if headers) + "; "
tables_summary += "\n"
source_contents.append({
"filename": sf.filename,
"doc_type": sf_ext,
"content": content,
"titles": titles,
"tables_count": tables_count,
"tables_summary": tables_summary
})
logger.info(f"[DEBUG] source_contents built: filename={sf.filename}, content_len={len(content)}, titles_count={len(titles)}, tables_count={tables_count}")
if tables_summary:
logger.info(f"[DEBUG] tables_summary preview: {tables_summary[:300]}")
except Exception as e:
logger.warning(f"解析源文档失败 {sf.filename}: {e}")
# 3. 根据源文档内容生成表头
template_fields = await template_fill_service.get_template_fields_from_file(
template_path,
template_ext,
source_contents=source_contents # 传递源文档内容
)
# 3. 异步处理源文档到MongoDB
task_id = str(uuid.uuid4())
if source_file_info:
# 保存任务记录到 MongoDB
try:
await mongodb.insert_task(
task_id=task_id,
task_type="source_process",
status="pending",
message=f"开始处理 {len(source_file_info)} 个源文档"
)
except Exception as mongo_err:
logger.warning(f"MongoDB 保存任务记录失败: {mongo_err}")
background_tasks.add_task(
process_source_documents,
task_id=task_id,
@@ -225,12 +332,10 @@ async def upload_joint_template(
async def process_source_documents(task_id: str, files: List[dict]):
"""异步处理源文档存入MongoDB"""
from app.core.database import redis_db
try:
await redis_db.set_task_status(
await update_task_status(
task_id, status="processing",
meta={"progress": 0, "message": "开始处理源文档"}
progress=0, message="开始处理源文档"
)
doc_ids = []
@@ -259,22 +364,24 @@ async def process_source_documents(task_id: str, files: List[dict]):
logger.error(f"源文档处理异常: {file_info['filename']}, error: {str(e)}")
progress = int((i + 1) / len(files) * 100)
await redis_db.set_task_status(
await update_task_status(
task_id, status="processing",
meta={"progress": progress, "message": f"已处理 {i+1}/{len(files)}"}
progress=progress, message=f"已处理 {i+1}/{len(files)}"
)
await redis_db.set_task_status(
await update_task_status(
task_id, status="success",
meta={"progress": 100, "message": "源文档处理完成", "doc_ids": doc_ids}
progress=100, message="源文档处理完成",
result={"doc_ids": doc_ids}
)
logger.info(f"所有源文档处理完成: {len(doc_ids)}")
except Exception as e:
logger.error(f"源文档批量处理失败: {str(e)}")
await redis_db.set_task_status(
await update_task_status(
task_id, status="failure",
meta={"error": str(e)}
progress=0, message="源文档处理失败",
error=str(e)
)
@@ -333,7 +440,27 @@ async def fill_template(
Returns:
填写结果
"""
# 生成或使用传入的 task_id
task_id = request.task_id or str(uuid.uuid4())
try:
# 创建任务记录到 MongoDB
try:
await mongodb.insert_task(
task_id=task_id,
task_type="template_fill",
status="processing",
message=f"开始填表任务: {len(request.template_fields)} 个字段"
)
except Exception as mongo_err:
logger.warning(f"MongoDB 创建任务记录失败: {mongo_err}")
# 更新进度 - 开始
await update_task_status(
task_id, "processing",
progress=0, message="开始处理..."
)
# 转换字段
fields = [
TemplateField(
@@ -346,17 +473,51 @@ async def fill_template(
for f in request.template_fields
]
# 从 template_id 提取文件类型
template_file_type = "xlsx" # 默认类型
if request.template_id:
ext = request.template_id.split('.')[-1].lower()
if ext in ["xlsx", "xls"]:
template_file_type = "xlsx"
elif ext == "docx":
template_file_type = "docx"
# 更新进度 - 准备开始填写
await update_task_status(
task_id, "processing",
progress=10, message=f"准备填写 {len(fields)} 个字段..."
)
# 执行填写
result = await template_fill_service.fill_template(
template_fields=fields,
source_doc_ids=request.source_doc_ids,
source_file_paths=request.source_file_paths,
user_hint=request.user_hint
user_hint=request.user_hint,
template_id=request.template_id,
template_file_type=template_file_type,
task_id=task_id
)
return result
# 更新为成功
await update_task_status(
task_id, "success",
progress=100, message="填表完成",
result={
"field_count": len(fields),
"max_rows": result.get("max_rows", 0)
}
)
return {**result, "task_id": task_id}
except Exception as e:
# 更新为失败
await update_task_status(
task_id, "failure",
progress=0, message="填表失败",
error=str(e)
)
logger.error(f"填写表格失败: {str(e)}")
raise HTTPException(status_code=500, detail=f"填写失败: {str(e)}")

View File

@@ -5,6 +5,7 @@ from fastapi import APIRouter, UploadFile, File, HTTPException, Query
from fastapi.responses import StreamingResponse
from typing import Optional
import logging
import os
import pandas as pd
import io
@@ -126,7 +127,7 @@ async def upload_excel(
content += f"... (共 {len(sheet_data['rows'])} 行)\n\n"
doc_metadata = {
"filename": saved_path.split("/")[-1] if "/" in saved_path else saved_path.split("\\")[-1],
"filename": os.path.basename(saved_path),
"original_filename": file.filename,
"saved_path": saved_path,
"file_size": len(content),
@@ -253,7 +254,7 @@ async def export_excel(
output.seek(0)
# 生成文件名
original_name = file_path.split('/')[-1] if '/' in file_path else file_path
original_name = os.path.basename(file_path)
if columns:
export_name = f"export_{sheet_name or 'data'}_{len(column_list) if columns else 'all'}_cols.xlsx"
else:

View File

@@ -59,6 +59,11 @@ class MongoDB:
"""RAG索引集合 - 存储字段语义索引"""
return self.db["rag_index"]
@property
def tasks(self):
"""任务集合 - 存储任务历史记录"""
return self.db["tasks"]
# ==================== 文档操作 ====================
async def insert_document(
@@ -242,8 +247,128 @@ class MongoDB:
await self.rag_index.create_index("table_name")
await self.rag_index.create_index("field_name")
# 任务集合索引
await self.tasks.create_index("task_id", unique=True)
await self.tasks.create_index("created_at")
logger.info("MongoDB 索引创建完成")
# ==================== 任务历史操作 ====================
async def insert_task(
self,
task_id: str,
task_type: str,
status: str = "pending",
message: str = "",
result: Optional[Dict[str, Any]] = None,
error: Optional[str] = None,
) -> str:
"""
插入任务记录
Args:
task_id: 任务ID
task_type: 任务类型
status: 任务状态
message: 任务消息
result: 任务结果
error: 错误信息
Returns:
插入文档的ID
"""
task = {
"task_id": task_id,
"task_type": task_type,
"status": status,
"message": message,
"result": result,
"error": error,
"created_at": datetime.utcnow(),
"updated_at": datetime.utcnow(),
}
result_obj = await self.tasks.insert_one(task)
return str(result_obj.inserted_id)
async def update_task(
self,
task_id: str,
status: Optional[str] = None,
message: Optional[str] = None,
result: Optional[Dict[str, Any]] = None,
error: Optional[str] = None,
) -> bool:
"""
更新任务状态
Args:
task_id: 任务ID
status: 任务状态
message: 任务消息
result: 任务结果
error: 错误信息
Returns:
是否更新成功
"""
from bson import ObjectId
update_data = {"updated_at": datetime.utcnow()}
if status is not None:
update_data["status"] = status
if message is not None:
update_data["message"] = message
if result is not None:
update_data["result"] = result
if error is not None:
update_data["error"] = error
update_result = await self.tasks.update_one(
{"task_id": task_id},
{"$set": update_data}
)
return update_result.modified_count > 0
async def get_task(self, task_id: str) -> Optional[Dict[str, Any]]:
"""根据task_id获取任务"""
task = await self.tasks.find_one({"task_id": task_id})
if task:
task["_id"] = str(task["_id"])
return task
async def list_tasks(
self,
limit: int = 50,
skip: int = 0,
) -> List[Dict[str, Any]]:
"""
获取任务列表
Args:
limit: 返回数量
skip: 跳过数量
Returns:
任务列表
"""
cursor = self.tasks.find().sort("created_at", -1).skip(skip).limit(limit)
tasks = []
async for task in cursor:
task["_id"] = str(task["_id"])
# 转换 datetime 为字符串
if task.get("created_at"):
task["created_at"] = task["created_at"].isoformat()
if task.get("updated_at"):
task["updated_at"] = task["updated_at"].isoformat()
tasks.append(task)
return tasks
async def delete_task(self, task_id: str) -> bool:
"""删除任务"""
result = await self.tasks.delete_one({"task_id": task_id})
return result.deleted_count > 0
# ==================== 全局单例 ====================

View File

@@ -59,7 +59,13 @@ class DocxParser(BaseParser):
paragraphs = []
for para in doc.paragraphs:
if para.text.strip():
paragraphs.append(para.text)
paragraphs.append({
"text": para.text,
"style": str(para.style.name) if para.style else "Normal"
})
# 提取段落纯文本(用于 AI 解析)
paragraphs_text = [p["text"] for p in paragraphs if p["text"].strip()]
# 提取表格内容
tables_data = []
@@ -77,8 +83,25 @@ class DocxParser(BaseParser):
"column_count": len(table_rows[0]) if table_rows else 0
})
# 合并所有文本
full_text = "\n".join(paragraphs)
# 提取图片/嵌入式对象信息
images_info = self._extract_images_info(doc, path)
# 合并所有文本(包括图片描述)
full_text_parts = []
full_text_parts.append("【文档正文】")
full_text_parts.extend(paragraphs_text)
if tables_data:
full_text_parts.append("\n【文档表格】")
for idx, table in enumerate(tables_data):
full_text_parts.append(f"--- 表格 {idx + 1} ---")
for row in table["rows"]:
full_text_parts.append(" | ".join(str(cell) for cell in row))
if images_info.get("image_count", 0) > 0:
full_text_parts.append(f"\n【文档图片】文档包含 {images_info['image_count']} 张图片/图表")
full_text = "\n".join(full_text_parts)
# 构建元数据
metadata = {
@@ -89,7 +112,9 @@ class DocxParser(BaseParser):
"table_count": len(tables_data),
"word_count": len(full_text),
"char_count": len(full_text.replace("\n", "")),
"has_tables": len(tables_data) > 0
"has_tables": len(tables_data) > 0,
"has_images": images_info.get("image_count", 0) > 0,
"image_count": images_info.get("image_count", 0)
}
# 返回结果
@@ -97,12 +122,16 @@ class DocxParser(BaseParser):
success=True,
data={
"content": full_text,
"paragraphs": paragraphs,
"paragraphs": paragraphs_text,
"paragraphs_with_style": paragraphs,
"tables": tables_data,
"images": images_info,
"word_count": len(full_text),
"structured_data": {
"paragraphs": paragraphs,
"tables": tables_data
"paragraphs_text": paragraphs_text,
"tables": tables_data,
"images": images_info
}
},
metadata=metadata
@@ -115,6 +144,59 @@ class DocxParser(BaseParser):
error=f"解析 Word 文档失败: {str(e)}"
)
def extract_images_as_base64(self, file_path: str) -> List[Dict[str, str]]:
"""
提取 Word 文档中的所有图片,返回 base64 编码列表
Args:
file_path: Word 文件路径
Returns:
图片列表,每项包含 base64 编码和图片类型
"""
import zipfile
import base64
from io import BytesIO
images = []
try:
with zipfile.ZipFile(file_path, 'r') as zf:
# 查找 word/media 目录下的图片文件
for filename in zf.namelist():
if filename.startswith('word/media/'):
# 获取图片类型
ext = filename.split('.')[-1].lower()
mime_types = {
'png': 'image/png',
'jpg': 'image/jpeg',
'jpeg': 'image/jpeg',
'gif': 'image/gif',
'bmp': 'image/bmp'
}
mime_type = mime_types.get(ext, 'image/png')
try:
# 读取图片数据并转为 base64
image_data = zf.read(filename)
base64_data = base64.b64encode(image_data).decode('utf-8')
images.append({
"filename": filename,
"mime_type": mime_type,
"base64": base64_data,
"size": len(image_data)
})
logger.info(f"提取图片: {filename}, 大小: {len(image_data)} bytes")
except Exception as e:
logger.warning(f"提取图片失败 {filename}: {str(e)}")
except Exception as e:
logger.error(f"打开 Word 文档提取图片失败: {str(e)}")
logger.info(f"共提取 {len(images)} 张图片")
return images
def extract_key_sentences(self, text: str, max_sentences: int = 10) -> List[str]:
"""
从文本中提取关键句子
@@ -268,6 +350,60 @@ class DocxParser(BaseParser):
return fields
def _extract_images_info(self, doc: Document, path: Path) -> Dict[str, Any]:
"""
提取 Word 文档中的图片/嵌入式对象信息
Args:
doc: Document 对象
path: 文件路径
Returns:
图片信息字典
"""
import zipfile
from io import BytesIO
image_count = 0
image_descriptions = []
inline_shapes_count = 0
try:
# 方法1: 通过 inline shapes 统计图片
try:
inline_shapes_count = len(doc.inline_shapes)
if inline_shapes_count > 0:
image_count = inline_shapes_count
image_descriptions.append(f"文档包含 {inline_shapes_count} 个嵌入式图形/图片")
except Exception:
pass
# 方法2: 通过 ZIP 分析 document.xml 获取图片引用
try:
with zipfile.ZipFile(path, 'r') as zf:
# 查找 word/media 目录下的图片文件
media_files = [f for f in zf.namelist() if f.startswith('word/media/')]
if media_files and not inline_shapes_count:
image_count = len(media_files)
image_descriptions.append(f"文档包含 {image_count} 个嵌入图片")
# 检查是否有页眉页脚中的图片
header_images = [f for f in zf.namelist() if 'header' in f.lower() and f.endswith(('.png', '.jpg', '.jpeg', '.gif', '.bmp'))]
if header_images:
image_descriptions.append(f"页眉/页脚包含 {len(header_images)} 个图片")
except Exception:
pass
except Exception as e:
logger.warning(f"提取图片信息失败: {str(e)}")
return {
"image_count": image_count,
"inline_shapes_count": inline_shapes_count,
"descriptions": image_descriptions,
"has_images": image_count > 0
}
def _infer_field_type_from_hint(self, hint: str) -> str:
"""
从提示词推断字段类型

View File

@@ -317,24 +317,70 @@ class XlsxParser(BaseParser):
import zipfile
from xml.etree import ElementTree as ET
# 常见的命名空间
COMMON_NAMESPACES = [
'http://schemas.openxmlformats.org/spreadsheetml/2006/main',
'http://schemas.openxmlformats.org/spreadsheetml/2005/main',
'http://schemas.openxmlformats.org/spreadsheetml/2004/main',
'http://schemas.openxmlformats.org/spreadsheetml/2003/main',
]
try:
with zipfile.ZipFile(file_path, 'r') as z:
if 'xl/workbook.xml' not in z.namelist():
# 尝试多种可能的 workbook.xml 路径
possible_paths = ['xl/workbook.xml', 'xl\\workbook.xml', 'workbook.xml']
content = None
for path in possible_paths:
if path in z.namelist():
content = z.read(path)
logger.info(f"找到 workbook.xml at: {path}")
break
if content is None:
logger.warning(f"未找到 workbook.xml文件列表: {z.namelist()[:10]}")
return []
content = z.read('xl/workbook.xml')
root = ET.fromstring(content)
# 命名空间
ns = {'main': 'http://schemas.openxmlformats.org/spreadsheetml/2006/main'}
sheet_names = []
for sheet in root.findall('.//main:sheet', ns):
name = sheet.get('name')
if name:
sheet_names.append(name)
# 方法1尝试带命名空间的查找
for ns in COMMON_NAMESPACES:
sheet_elements = root.findall(f'.//{{{ns}}}sheet')
if sheet_elements:
for sheet in sheet_elements:
name = sheet.get('name')
if name:
sheet_names.append(name)
if sheet_names:
logger.info(f"使用命名空间 {ns} 提取工作表: {sheet_names}")
return sheet_names
# 方法2不使用命名空间直接查找所有 sheet 元素
if not sheet_names:
for elem in root.iter():
if elem.tag.endswith('sheet') and elem.tag != 'sheets':
name = elem.get('name')
if name:
sheet_names.append(name)
for child in elem:
if child.tag.endswith('sheet') or child.tag == 'sheet':
name = child.get('name')
if name and name not in sheet_names:
sheet_names.append(name)
# 方法3直接从 XML 文本中正则匹配 sheet name
if not sheet_names:
import re
xml_str = content.decode('utf-8', errors='ignore')
matches = re.findall(r'<sheet\s+[^>]*name=["\']([^"\']+)["\']', xml_str, re.IGNORECASE)
if matches:
sheet_names = matches
logger.info(f"使用正则提取工作表: {sheet_names}")
logger.info(f"从 XML 提取工作表: {sheet_names}")
return sheet_names
except Exception as e:
logger.error(f"从 XML 提取工作表名称失败: {e}")
return []
@@ -356,6 +402,32 @@ class XlsxParser(BaseParser):
import zipfile
from xml.etree import ElementTree as ET
# 常见的命名空间
COMMON_NAMESPACES = [
'http://schemas.openxmlformats.org/spreadsheetml/2006/main',
'http://schemas.openxmlformats.org/spreadsheetml/2005/main',
'http://schemas.openxmlformats.org/spreadsheetml/2004/main',
'http://schemas.openxmlformats.org/spreadsheetml/2003/main',
]
def find_elements_with_ns(root, tag_name):
"""灵活查找元素,支持任意命名空间"""
results = []
# 方法1用固定命名空间
for ns in COMMON_NAMESPACES:
try:
elems = root.findall(f'.//{{{ns}}}{tag_name}')
if elems:
results.extend(elems)
except:
pass
# 方法2不带命名空间查找
if not results:
for elem in root.iter():
if elem.tag.endswith('}' + tag_name):
results.append(elem)
return results
with zipfile.ZipFile(file_path, 'r') as z:
# 获取工作表名称
sheet_names = self._extract_sheet_names_from_xml(file_path)
@@ -366,57 +438,68 @@ class XlsxParser(BaseParser):
target_sheet = sheet_name if sheet_name and sheet_name in sheet_names else sheet_names[0]
sheet_index = sheet_names.index(target_sheet) + 1 # sheet1.xml, sheet2.xml, ...
# 读取 shared strings
# 读取 shared strings - 尝试多种路径
shared_strings = []
if 'xl/sharedStrings.xml' in z.namelist():
ss_content = z.read('xl/sharedStrings.xml')
ss_root = ET.fromstring(ss_content)
ns = {'main': 'http://schemas.openxmlformats.org/spreadsheetml/2006/main'}
for si in ss_root.findall('.//main:si', ns):
t = si.find('.//main:t', ns)
if t is not None:
shared_strings.append(t.text or '')
else:
shared_strings.append('')
ss_paths = ['xl/sharedStrings.xml', 'xl\\sharedStrings.xml', 'sharedStrings.xml']
for ss_path in ss_paths:
if ss_path in z.namelist():
try:
ss_content = z.read(ss_path)
ss_root = ET.fromstring(ss_content)
for si in find_elements_with_ns(ss_root, 'si'):
t_elements = [c for c in si if c.tag.endswith('}t') or c.tag == 't']
if t_elements:
shared_strings.append(t_elements[0].text or '')
else:
shared_strings.append('')
break
except Exception as e:
logger.warning(f"读取 sharedStrings 失败: {e}")
# 读取工作表
sheet_file = f'xl/worksheets/sheet{sheet_index}.xml'
if sheet_file not in z.namelist():
raise ValueError(f"工作表文件 {sheet_file} 不存在")
# 读取工作表 - 尝试多种可能的路径
sheet_content = None
sheet_paths = [
f'xl/worksheets/sheet{sheet_index}.xml',
f'xl\\worksheets\\sheet{sheet_index}.xml',
f'worksheets/sheet{sheet_index}.xml',
]
for sp in sheet_paths:
if sp in z.namelist():
sheet_content = z.read(sp)
break
if sheet_content is None:
raise ValueError(f"工作表文件 sheet{sheet_index}.xml 不存在")
sheet_content = z.read(sheet_file)
root = ET.fromstring(sheet_content)
ns = {'main': 'http://schemas.openxmlformats.org/spreadsheetml/2006/main'}
# 收集所有行数据
all_rows = []
headers = {}
for row in root.findall('.//main:row', ns):
for row in find_elements_with_ns(root, 'row'):
row_idx = int(row.get('r', 0))
row_cells = {}
for cell in row.findall('main:c', ns):
for cell in find_elements_with_ns(row, 'c'):
cell_ref = cell.get('r', '')
col_letters = ''.join(filter(str.isalpha, cell_ref))
cell_type = cell.get('t', 'n')
v = cell.find('main:v', ns)
v_elements = find_elements_with_ns(cell, 'v')
v = v_elements[0] if v_elements else None
if v is not None and v.text:
if cell_type == 's':
# shared string
try:
row_cells[col_letters] = shared_strings[int(v.text)]
except (ValueError, IndexError):
row_cells[col_letters] = v.text
elif cell_type == 'b':
# boolean
row_cells[col_letters] = v.text == '1'
else:
row_cells[col_letters] = v.text
else:
row_cells[col_letters] = None
# 处理表头行
if row_idx == header_row + 1:
headers = {**row_cells}
elif row_idx > header_row + 1:
@@ -424,7 +507,6 @@ class XlsxParser(BaseParser):
# 构建 DataFrame
if headers:
# 按原始列顺序排列
col_order = list(headers.keys())
df = pd.DataFrame(all_rows)
if not df.empty:

View File

@@ -0,0 +1,15 @@
"""
指令执行模块
注意: 此模块为可选功能,当前尚未实现。
如需启用,请实现 intent_parser.py 和 executor.py
"""
from .intent_parser import IntentParser, DefaultIntentParser
from .executor import InstructionExecutor, DefaultInstructionExecutor
__all__ = [
"IntentParser",
"DefaultIntentParser",
"InstructionExecutor",
"DefaultInstructionExecutor",
]

View File

@@ -0,0 +1,35 @@
"""
指令执行器模块
将自然语言指令转换为可执行操作
注意: 此模块为可选功能,当前尚未实现。
"""
from abc import ABC, abstractmethod
from typing import Any, Dict
class InstructionExecutor(ABC):
"""指令执行器抽象基类"""
@abstractmethod
async def execute(self, instruction: str, context: Dict[str, Any]) -> Dict[str, Any]:
"""
执行指令
Args:
instruction: 解析后的指令
context: 执行上下文
Returns:
执行结果
"""
pass
class DefaultInstructionExecutor(InstructionExecutor):
"""默认指令执行器"""
async def execute(self, instruction: str, context: Dict[str, Any]) -> Dict[str, Any]:
"""暂未实现"""
raise NotImplementedError("指令执行功能暂未实现")

View File

@@ -0,0 +1,34 @@
"""
意图解析器模块
解析用户自然语言指令,识别意图和参数
注意: 此模块为可选功能,当前尚未实现。
"""
from abc import ABC, abstractmethod
from typing import Any, Dict, Tuple
class IntentParser(ABC):
"""意图解析器抽象基类"""
@abstractmethod
async def parse(self, text: str) -> Tuple[str, Dict[str, Any]]:
"""
解析自然语言指令
Args:
text: 用户输入的自然语言
Returns:
(意图类型, 参数字典)
"""
pass
class DefaultIntentParser(IntentParser):
"""默认意图解析器"""
async def parse(self, text: str) -> Tuple[str, Dict[str, Any]]:
"""暂未实现"""
raise NotImplementedError("意图解析功能暂未实现")

View File

@@ -1,6 +1,13 @@
"""
FastAPI 应用主入口
"""
# ========== 压制 MongoDB 疯狂刷屏日志 ==========
import logging
logging.getLogger("pymongo").setLevel(logging.WARNING)
logging.getLogger("pymongo.topology").setLevel(logging.WARNING)
logging.getLogger("urllib3").setLevel(logging.WARNING)
# ==============================================
import logging
import logging.handlers
import sys

View File

@@ -65,7 +65,17 @@ class LLMService:
return response.json()
except httpx.HTTPStatusError as e:
logger.error(f"LLM API 请求失败: {e.response.status_code} - {e.response.text}")
error_detail = e.response.text
logger.error(f"LLM API 请求失败: {e.response.status_code} - {error_detail}")
# 尝试解析错误信息
try:
import json
err_json = json.loads(error_detail)
err_code = err_json.get("error", {}).get("code", "unknown")
err_msg = err_json.get("error", {}).get("message", "unknown")
logger.error(f"API 错误码: {err_code}, 错误信息: {err_msg}")
except:
pass
raise
except Exception as e:
logger.error(f"LLM API 调用异常: {str(e)}")
@@ -328,6 +338,154 @@ Excel 数据概览:
"analysis": None
}
async def chat_with_images(
self,
text: str,
images: List[Dict[str, str]],
temperature: float = 0.7,
max_tokens: Optional[int] = None
) -> Dict[str, Any]:
"""
调用视觉模型 API支持图片输入
Args:
text: 文本内容
images: 图片列表,每项包含 base64 编码和 mime_type
格式: [{"base64": "...", "mime_type": "image/png"}, ...]
temperature: 温度参数
max_tokens: 最大 token 数
Returns:
Dict[str, Any]: API 响应结果
"""
headers = {
"Authorization": f"Bearer {self.api_key}",
"Content-Type": "application/json"
}
# 构建图片内容
image_contents = []
for img in images:
image_contents.append({
"type": "image_url",
"image_url": {
"url": f"data:{img['mime_type']};base64,{img['base64']}"
}
})
# 构建消息
messages = [
{
"role": "user",
"content": [
{
"type": "text",
"text": text
},
*image_contents
]
}
]
payload = {
"model": self.model_name,
"messages": messages,
"temperature": temperature
}
if max_tokens:
payload["max_tokens"] = max_tokens
try:
async with httpx.AsyncClient(timeout=120.0) as client:
response = await client.post(
f"{self.base_url}/chat/completions",
headers=headers,
json=payload
)
response.raise_for_status()
return response.json()
except httpx.HTTPStatusError as e:
error_detail = e.response.text
logger.error(f"视觉模型 API 请求失败: {e.response.status_code} - {error_detail}")
# 尝试解析错误信息
try:
import json
err_json = json.loads(error_detail)
err_code = err_json.get("error", {}).get("code", "unknown")
err_msg = err_json.get("error", {}).get("message", "unknown")
logger.error(f"API 错误码: {err_code}, 错误信息: {err_msg}")
logger.error(f"请求模型: {self.model_name}, base_url: {self.base_url}")
except:
pass
raise
except Exception as e:
logger.error(f"视觉模型 API 调用异常: {str(e)}")
raise
async def analyze_images(
self,
images: List[Dict[str, str]],
user_prompt: str = ""
) -> Dict[str, Any]:
"""
分析图片内容(使用视觉模型)
Args:
images: 图片列表,每项包含 base64 编码和 mime_type
user_prompt: 用户提示词
Returns:
Dict[str, Any]: 分析结果
"""
prompt = f"""你是一个专业的视觉分析专家。请分析以下图片内容。
{user_prompt if user_prompt else "请详细描述图片中的内容,包括文字、数据、图表、流程等所有可见信息。"}
请按照以下 JSON 格式输出:
{{
"description": "图片内容的详细描述",
"text_content": "图片中的文字内容(如有)",
"data_extracted": {{"": ""}} // 如果图片中有表格或数据
}}
如果图片不包含有用信息,请返回空的描述。"""
try:
response = await self.chat_with_images(
text=prompt,
images=images,
temperature=0.1,
max_tokens=4000
)
content = self.extract_message_content(response)
# 解析 JSON
import json
try:
result = json.loads(content)
return {
"success": True,
"analysis": result,
"model": self.model_name
}
except json.JSONDecodeError:
return {
"success": True,
"analysis": {"description": content},
"model": self.model_name
}
except Exception as e:
logger.error(f"图片分析失败: {str(e)}")
return {
"success": False,
"error": str(e),
"analysis": None
}
# 全局单例
llm_service = LLMService()

View File

@@ -3,7 +3,6 @@ RAG 服务模块 - 检索增强生成
使用 sentence-transformers + Faiss 实现向量检索
"""
import json
import logging
import os
import pickle
@@ -11,12 +10,20 @@ from typing import Any, Dict, List, Optional
import faiss
import numpy as np
from sentence_transformers import SentenceTransformer
from app.config import settings
logger = logging.getLogger(__name__)
# 尝试导入 sentence-transformers
try:
from sentence_transformers import SentenceTransformer
SENTENCE_TRANSFORMERS_AVAILABLE = True
except ImportError as e:
logger.warning(f"sentence-transformers 导入失败: {e}")
SENTENCE_TRANSFORMERS_AVAILABLE = False
SentenceTransformer = None
class SimpleDocument:
"""简化文档对象"""
@@ -28,17 +35,24 @@ class SimpleDocument:
class RAGService:
"""RAG 检索增强服务"""
# 默认分块参数
DEFAULT_CHUNK_SIZE = 500 # 每个文本块的大小(字符数)
DEFAULT_CHUNK_OVERLAP = 50 # 块之间的重叠(字符数)
def __init__(self):
self.embedding_model: Optional[SentenceTransformer] = None
self.embedding_model = None
self.index: Optional[faiss.Index] = None
self.documents: List[Dict[str, Any]] = []
self.doc_ids: List[str] = []
self._dimension: int = 0
self._dimension: int = 384 # 默认维度
self._initialized = False
self._persist_dir = settings.FAISS_INDEX_DIR
# 临时禁用 RAG API 调用,仅记录日志
self._disabled = True
logger.info("RAG 服务已禁用_disabled=True仅记录索引操作日志")
# 检查是否可用
self._disabled = not SENTENCE_TRANSFORMERS_AVAILABLE
if self._disabled:
logger.warning("RAG 服务已禁用sentence-transformers 不可用),将使用关键词匹配作为后备")
else:
logger.info("RAG 服务已启用")
def _init_embeddings(self):
"""初始化嵌入模型"""
@@ -88,6 +102,63 @@ class RAGService:
norms = np.where(norms == 0, 1, norms)
return vectors / norms
def _split_into_chunks(self, text: str, chunk_size: int = None, overlap: int = None) -> List[str]:
"""
将长文本分割成块
Args:
text: 待分割的文本
chunk_size: 每个块的大小(字符数)
overlap: 块之间的重叠字符数
Returns:
文本块列表
"""
if chunk_size is None:
chunk_size = self.DEFAULT_CHUNK_SIZE
if overlap is None:
overlap = self.DEFAULT_CHUNK_OVERLAP
if len(text) <= chunk_size:
return [text] if text.strip() else []
chunks = []
start = 0
text_len = len(text)
while start < text_len:
# 计算当前块的结束位置
end = start + chunk_size
# 如果不是最后一块,尝试在句子边界处切割
if end < text_len:
# 向前查找最后一个句号、逗号、换行或分号
cut_positions = []
for i in range(end, max(start, end - 100), -1):
if text[i] in '。;,,\n':
cut_positions.append(i + 1)
break
if cut_positions:
end = cut_positions[0]
else:
# 如果没找到句子边界,尝试向后查找
for i in range(end, min(text_len, end + 50)):
if text[i] in '。;,,\n':
end = i + 1
break
chunk = text[start:end].strip()
if chunk:
chunks.append(chunk)
# 移动起始位置(考虑重叠)
start = end - overlap
if start <= 0:
start = end
return chunks
def index_field(
self,
table_name: str,
@@ -124,9 +195,20 @@ class RAGService:
self,
doc_id: str,
content: str,
metadata: Optional[Dict[str, Any]] = None
metadata: Optional[Dict[str, Any]] = None,
chunk_size: int = None,
chunk_overlap: int = None
):
"""将文档内容索引到向量数据库"""
"""
将文档内容索引到向量数据库(自动分块)
Args:
doc_id: 文档唯一标识
content: 文档内容
metadata: 文档元数据
chunk_size: 文本块大小字符数默认500
chunk_overlap: 块之间的重叠字符数默认50
"""
if self._disabled:
logger.info(f"[RAG DISABLED] 文档索引操作已跳过: {doc_id}")
return
@@ -139,18 +221,56 @@ class RAGService:
logger.debug(f"文档跳过索引 (无嵌入模型): {doc_id}")
return
doc = SimpleDocument(
page_content=content,
metadata=metadata or {"doc_id": doc_id}
)
self._add_documents([doc], [doc_id])
logger.debug(f"已索引文档: {doc_id}")
# 分割文档为小块
if chunk_size is None:
chunk_size = self.DEFAULT_CHUNK_SIZE
if chunk_overlap is None:
chunk_overlap = self.DEFAULT_CHUNK_OVERLAP
chunks = self._split_into_chunks(content, chunk_size, chunk_overlap)
if not chunks:
logger.warning(f"文档内容为空,跳过索引: {doc_id}")
return
# 为每个块创建文档对象
documents = []
chunk_ids = []
for i, chunk in enumerate(chunks):
chunk_id = f"{doc_id}_chunk_{i}"
chunk_metadata = metadata.copy() if metadata else {}
chunk_metadata.update({
"chunk_index": i,
"total_chunks": len(chunks),
"doc_id": doc_id
})
documents.append(SimpleDocument(
page_content=chunk,
metadata=chunk_metadata
))
chunk_ids.append(chunk_id)
# 批量添加文档
self._add_documents(documents, chunk_ids)
logger.info(f"已索引文档 {doc_id},共 {len(chunks)} 个块")
def _add_documents(self, documents: List[SimpleDocument], doc_ids: List[str]):
"""批量添加文档到向量索引"""
if not documents:
return
# 总是将文档存储在内存中(用于关键词搜索后备)
for doc, did in zip(documents, doc_ids):
self.documents.append({"id": did, "content": doc.page_content, "metadata": doc.metadata})
self.doc_ids.append(did)
# 如果没有嵌入模型,跳过向量索引
if self.embedding_model is None:
logger.debug(f"文档跳过向量索引 (无嵌入模型): {len(documents)} 个文档")
return
texts = [doc.page_content for doc in documents]
embeddings = self.embedding_model.encode(texts, convert_to_numpy=True)
embeddings = self._normalize_vectors(embeddings).astype('float32')
@@ -162,12 +282,18 @@ class RAGService:
id_array = np.array(id_list, dtype='int64')
self.index.add_with_ids(embeddings, id_array)
for doc, did in zip(documents, doc_ids):
self.documents.append({"id": did, "content": doc.page_content, "metadata": doc.metadata})
self.doc_ids.append(did)
def retrieve(self, query: str, top_k: int = 5, min_score: float = 0.3) -> List[Dict[str, Any]]:
"""
根据查询检索相关文档块
def retrieve(self, query: str, top_k: int = 5) -> List[Dict[str, Any]]:
"""根据查询检索相关文档"""
Args:
query: 查询文本
top_k: 返回的最大结果数
min_score: 最低相似度分数阈值
Returns:
相关文档块列表,每项包含 content, metadata, score, doc_id, chunk_index
"""
if self._disabled:
logger.info(f"[RAG DISABLED] 检索操作已跳过: query={query}, top_k={top_k}")
return []
@@ -175,28 +301,113 @@ class RAGService:
if not self._initialized:
self._init_vector_store()
if self.index is None or self.index.ntotal == 0:
# 优先使用向量检索
if self.index is not None and self.index.ntotal > 0 and self.embedding_model is not None:
try:
query_embedding = self.embedding_model.encode([query], convert_to_numpy=True)
query_embedding = self._normalize_vectors(query_embedding).astype('float32')
scores, indices = self.index.search(query_embedding, min(top_k, self.index.ntotal))
results = []
for score, idx in zip(scores[0], indices[0]):
if idx < 0:
continue
if score < min_score:
continue
doc = self.documents[idx]
results.append({
"content": doc["content"],
"metadata": doc["metadata"],
"score": float(score),
"doc_id": doc["id"],
"chunk_index": doc["metadata"].get("chunk_index", 0)
})
if results:
logger.debug(f"向量检索到 {len(results)} 条相关文档块")
return results
except Exception as e:
logger.warning(f"向量检索失败,使用关键词搜索后备: {e}")
# 后备:使用关键词搜索
logger.debug("使用关键词搜索后备方案")
return self._keyword_search(query, top_k)
def _keyword_search(self, query: str, top_k: int = 5) -> List[Dict[str, Any]]:
"""
关键词搜索后备方案
Args:
query: 查询文本
top_k: 返回的最大结果数
Returns:
相关文档块列表
"""
if not self.documents:
return []
query_embedding = self.embedding_model.encode([query], convert_to_numpy=True)
query_embedding = self._normalize_vectors(query_embedding).astype('float32')
# 提取查询关键词
keywords = []
for char in query:
if '\u4e00' <= char <= '\u9fff': # 中文字符
keywords.append(char)
# 添加英文单词
import re
english_words = re.findall(r'[a-zA-Z]+', query)
keywords.extend(english_words)
scores, indices = self.index.search(query_embedding, min(top_k, self.index.ntotal))
if not keywords:
return []
results = []
for score, idx in zip(scores[0], indices[0]):
if idx < 0:
continue
doc = self.documents[idx]
results.append({
"content": doc["content"],
"metadata": doc["metadata"],
"score": float(score),
"doc_id": doc["id"]
})
for doc in self.documents:
content = doc["content"]
# 计算关键词匹配分数
score = 0
matched_keywords = 0
for kw in keywords:
if kw in content:
score += 1
matched_keywords += 1
logger.debug(f"检索到 {len(results)} 条相关文档")
return results
if matched_keywords > 0:
# 归一化分数
score = score / max(len(keywords), 1)
results.append({
"content": content,
"metadata": doc["metadata"],
"score": score,
"doc_id": doc["id"],
"chunk_index": doc["metadata"].get("chunk_index", 0)
})
# 按分数排序
results.sort(key=lambda x: x["score"], reverse=True)
logger.debug(f"关键词搜索返回 {len(results[:top_k])} 条结果")
return results[:top_k]
def retrieve_by_doc_id(self, doc_id: str, top_k: int = 10) -> List[Dict[str, Any]]:
"""
获取指定文档的所有块
Args:
doc_id: 文档ID
top_k: 返回的最大结果数
Returns:
该文档的所有块
"""
# 获取属于该文档的所有块
doc_chunks = [d for d in self.documents if d["metadata"].get("doc_id") == doc_id]
# 按 chunk_index 排序
doc_chunks.sort(key=lambda x: x["metadata"].get("chunk_index", 0))
# 返回指定数量
return doc_chunks[:top_k]
def retrieve_by_table(self, table_name: str, top_k: int = 5) -> List[Dict[str, Any]]:
"""检索指定表的字段"""

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,637 @@
"""
Word 文档 AI 解析服务
使用 LLM (GLM) 对 Word 文档进行深度理解,提取结构化数据
"""
import logging
from typing import Dict, Any, List, Optional
import json
from app.services.llm_service import llm_service
from app.core.document_parser.docx_parser import DocxParser
logger = logging.getLogger(__name__)
class WordAIService:
"""Word 文档 AI 解析服务"""
def __init__(self):
self.llm = llm_service
self.parser = DocxParser()
async def parse_word_with_ai(
self,
file_path: str,
user_hint: str = ""
) -> Dict[str, Any]:
"""
使用 AI 解析 Word 文档,提取结构化数据
适用于从非结构化的 Word 文档中提取表格数据、键值对等信息
Args:
file_path: Word 文件路径
user_hint: 用户提示词,指定要提取的内容类型
Returns:
Dict: 包含结构化数据的解析结果
"""
try:
# 1. 先用基础解析器提取原始内容
parse_result = self.parser.parse(file_path)
if not parse_result.success:
return {
"success": False,
"error": parse_result.error,
"structured_data": None
}
# 2. 获取原始数据
raw_data = parse_result.data
paragraphs = raw_data.get("paragraphs", [])
paragraphs_with_style = raw_data.get("paragraphs_with_style", [])
tables = raw_data.get("tables", [])
content = raw_data.get("content", "")
images_info = raw_data.get("images", {})
metadata = parse_result.metadata or {}
image_count = images_info.get("image_count", 0)
image_descriptions = images_info.get("descriptions", [])
logger.info(f"Word 基础解析完成: {len(paragraphs)} 个段落, {len(tables)} 个表格, {image_count} 张图片")
# 3. 提取图片数据(用于视觉分析)
images_base64 = []
if image_count > 0:
try:
images_base64 = self.parser.extract_images_as_base64(file_path)
logger.info(f"提取到 {len(images_base64)} 张图片的 base64 数据")
except Exception as e:
logger.warning(f"提取图片 base64 失败: {str(e)}")
# 4. 根据内容类型选择 AI 解析策略
# 如果有图片,先分析图片
image_analysis = ""
if images_base64:
image_analysis = await self._analyze_images_with_ai(images_base64, user_hint)
logger.info(f"图片 AI 分析完成: {len(image_analysis)} 字符")
# 优先处理:表格 > (表格+文本) > 纯文本
if tables and len(tables) > 0:
structured_data = await self._extract_tables_with_ai(
tables, paragraphs, image_count, user_hint, metadata, image_analysis
)
elif paragraphs and len(paragraphs) > 0:
structured_data = await self._extract_from_text_with_ai(
paragraphs, content, image_count, image_descriptions, user_hint, image_analysis
)
else:
structured_data = {
"success": True,
"type": "empty",
"message": "文档内容为空"
}
# 添加图片分析结果
if image_analysis:
structured_data["image_analysis"] = image_analysis
return structured_data
except Exception as e:
logger.error(f"AI 解析 Word 文档失败: {str(e)}")
return {
"success": False,
"error": str(e),
"structured_data": None
}
async def _extract_tables_with_ai(
self,
tables: List[Dict],
paragraphs: List[str],
image_count: int,
user_hint: str,
metadata: Dict,
image_analysis: str = ""
) -> Dict[str, Any]:
"""
使用 AI 从 Word 表格和文本中提取结构化数据
Args:
tables: 表格列表
paragraphs: 段落列表
image_count: 图片数量
user_hint: 用户提示
metadata: 文档元数据
image_analysis: 图片 AI 分析结果
Returns:
结构化数据
"""
try:
# 构建表格文本描述
tables_text = self._build_tables_description(tables)
# 构建段落描述
paragraphs_text = "\n".join(paragraphs[:50]) if paragraphs else "(无正文文本)"
if len(paragraphs) > 50:
paragraphs_text += f"\n...(共 {len(paragraphs)} 个段落仅显示前50个"
# 图片提示
image_hint = f"注意:此文档包含 {image_count} 张图片/图表。" if image_count > 0 else ""
prompt = f"""你是一个专业的数据提取专家。请从以下 Word 文档的完整内容中提取结构化数据。
【用户需求】
{user_hint if user_hint else "请提取文档中的所有结构化数据,包括表格数据、键值对、列表项等。"}
【文档正文(段落)】
{paragraphs_text}
【文档表格】
{tables_text}
【文档图片信息】
{image_hint}
请按照以下 JSON 格式输出:
{{
"type": "table_data",
"headers": ["列1", "列2", ...],
"rows": [["行1列1", "行1列2", ...], ["行2列1", "行2列2", ...], ...],
"key_values": {{"键1": "值1", "键2": "值2", ...}},
"list_items": ["项1", "项2", ...],
"description": "文档内容描述"
}}
重点:
- 优先从表格中提取结构化数据
- 如果表格中有表头headers 是表头rows 是数据行
- 如果文档中有键值对(如 名称: 张三),提取到 key_values 中
- 如果文档中有列表项,提取到 list_items 中
- 图片内容无法直接提取,但请在 description 中说明图片的大致主题(如"包含流程图""包含数据图表"等)
"""
messages = [
{"role": "system", "content": "你是一个专业的数据提取助手。请严格按JSON格式输出。"},
{"role": "user", "content": prompt}
]
response = await self.llm.chat(
messages=messages,
temperature=0.1,
max_tokens=50000
)
content = self.llm.extract_message_content(response)
# 解析 JSON
result = self._parse_json_response(content)
if result:
logger.info(f"AI 表格提取成功: {len(result.get('rows', []))} 行数据")
return {
"success": True,
"type": "table_data",
"headers": result.get("headers", []),
"rows": result.get("rows", []),
"description": result.get("description", "")
}
else:
# 如果 AI 返回格式不对,尝试直接解析表格
return self._fallback_table_parse(tables)
except Exception as e:
logger.error(f"AI 表格提取失败: {str(e)}")
return self._fallback_table_parse(tables)
async def _extract_from_text_with_ai(
self,
paragraphs: List[str],
full_text: str,
image_count: int,
image_descriptions: List[str],
user_hint: str,
image_analysis: str = ""
) -> Dict[str, Any]:
"""
使用 AI 从 Word 纯文本中提取结构化数据
Args:
paragraphs: 段落列表
full_text: 完整文本
image_count: 图片数量
image_descriptions: 图片描述列表
user_hint: 用户提示
image_analysis: 图片 AI 分析结果
Returns:
结构化数据
"""
try:
# 限制文本长度
text_preview = full_text[:8000] if len(full_text) > 8000 else full_text
# 图片提示
image_hint = f"\n【文档图片】此文档包含 {image_count} 张图片/图表。" if image_count > 0 else ""
if image_descriptions:
image_hint += "\n" + "\n".join(image_descriptions)
prompt = f"""你是一个专业的数据提取专家。请从以下 Word 文档的完整内容中提取结构化数据。
【用户需求】
{user_hint if user_hint else "请识别并提取文档中的关键信息,包括:表格数据、键值对、列表项等。"}
【文档正文】{image_hint}
{text_preview}
请按照以下 JSON 格式输出:
{{
"type": "structured_text",
"tables": [{{"headers": [...], "rows": [...]}}],
"key_values": {{"键1": "值1", "键2": "值2", ...}},
"list_items": ["项1", "项2", ...],
"summary": "文档内容摘要"
}}
重点:
- 如果文档包含表格数据,提取到 tables 中
- 如果文档包含键值对(如 名称: 张三),提取到 key_values 中
- 如果文档包含列表项,提取到 list_items 中
- 如果文档包含图片,请根据上下文推断图片内容(如"流程图""数据折线图"等)并在 description 中说明
- 如果无法提取到结构化数据,至少提供一个详细的摘要
"""
messages = [
{"role": "system", "content": "你是一个专业的数据提取助手。请严格按JSON格式输出。"},
{"role": "user", "content": prompt}
]
response = await self.llm.chat(
messages=messages,
temperature=0.1,
max_tokens=50000
)
content = self.llm.extract_message_content(response)
result = self._parse_json_response(content)
if result:
logger.info(f"AI 文本提取成功: type={result.get('type')}")
return {
"success": True,
"type": result.get("type", "structured_text"),
"tables": result.get("tables", []),
"key_values": result.get("key_values", {}),
"list_items": result.get("list_items", []),
"summary": result.get("summary", ""),
"raw_text_preview": text_preview[:500]
}
else:
return {
"success": True,
"type": "text",
"summary": text_preview[:500],
"raw_text_preview": text_preview[:500]
}
except Exception as e:
logger.error(f"AI 文本提取失败: {str(e)}")
return {
"success": False,
"error": str(e)
}
async def _analyze_images_with_ai(
self,
images: List[Dict[str, str]],
user_hint: str = ""
) -> str:
"""
使用视觉模型分析 Word 文档中的图片
Args:
images: 图片列表,每项包含 base64 和 mime_type
user_hint: 用户提示
Returns:
图片分析结果文本
"""
try:
# 调用 LLM 的视觉分析功能
result = await self.llm.analyze_images(
images=images,
user_prompt=user_hint or "请详细描述图片内容,提取所有文字和数据信息。"
)
if result.get("success"):
analysis = result.get("analysis", {})
if isinstance(analysis, dict):
description = analysis.get("description", "")
text_content = analysis.get("text_content", "")
data_extracted = analysis.get("data_extracted", {})
result_text = f"【图片分析结果】\n{description}"
if text_content:
result_text += f"\n\n【图片中的文字】\n{text_content}"
if data_extracted:
result_text += f"\n\n【提取的数据】\n{json.dumps(data_extracted, ensure_ascii=False)}"
return result_text
else:
return str(analysis)
else:
logger.warning(f"图片 AI 分析失败: {result.get('error')}")
return ""
except Exception as e:
logger.error(f"图片 AI 分析异常: {str(e)}")
return ""
def _build_tables_description(self, tables: List[Dict]) -> str:
"""构建表格的文本描述"""
result = []
for idx, table in enumerate(tables):
rows = table.get("rows", [])
if not rows:
continue
result.append(f"\n--- 表格 {idx + 1} ---")
for row_idx, row in enumerate(rows[:50]): # 限制每表格最多50行
if isinstance(row, list):
result.append(" | ".join(str(cell).strip() for cell in row))
elif isinstance(row, dict):
result.append(str(row))
if len(rows) > 50:
result.append(f"...(共 {len(rows)}仅显示前50行")
return "\n".join(result) if result else "(无表格内容)"
def _parse_json_response(self, content: str) -> Optional[Dict]:
"""解析 JSON 响应,处理各种格式问题"""
import re
# 清理 markdown 标记
cleaned = content.strip()
cleaned = re.sub(r'^```json\s*', '', cleaned, flags=re.MULTILINE)
cleaned = re.sub(r'^```\s*', '', cleaned, flags=re.MULTILINE)
cleaned = cleaned.strip()
# 找到 JSON 开始位置
json_start = -1
for i, c in enumerate(cleaned):
if c == '{':
json_start = i
break
if json_start == -1:
logger.warning("无法找到 JSON 开始位置")
return None
json_text = cleaned[json_start:]
# 尝试直接解析
try:
return json.loads(json_text)
except json.JSONDecodeError:
pass
# 尝试修复并解析
try:
# 找到闭合括号
depth = 0
end_pos = -1
for i, c in enumerate(json_text):
if c == '{':
depth += 1
elif c == '}':
depth -= 1
if depth == 0:
end_pos = i + 1
break
if end_pos > 0:
fixed = json_text[:end_pos]
# 移除末尾逗号
fixed = re.sub(r',\s*([}]])', r'\1', fixed)
return json.loads(fixed)
except Exception as e:
logger.warning(f"JSON 修复失败: {e}")
return None
def _fallback_table_parse(self, tables: List[Dict]) -> Dict[str, Any]:
"""当 AI 解析失败时,直接解析表格"""
if not tables:
return {
"success": True,
"type": "empty",
"data": {},
"message": "无表格内容"
}
all_rows = []
all_headers = None
for table in tables:
rows = table.get("rows", [])
if not rows:
continue
# 查找真正的表头行(跳过标题行)
header_row_idx = 0
for idx, row in enumerate(rows[:5]): # 只检查前5行
if not isinstance(row, list):
continue
# 如果某一行包含"表"字开头且单元格内容很长,这可能是标题行
first_cell = str(row[0]) if row else ""
if first_cell.startswith("") and len(first_cell) > 15:
header_row_idx = idx + 1
continue
# 如果某一行有超过3个空单元格可能是无效行
empty_count = sum(1 for cell in row if not str(cell).strip())
if empty_count > 3:
header_row_idx = idx + 1
continue
# 找到第一行看起来像表头的行(短单元格,大部分有内容)
avg_len = sum(len(str(c)) for c in row) / len(row) if row else 0
if avg_len < 20: # 表头通常比数据行短
header_row_idx = idx
break
if header_row_idx >= len(rows):
continue
# 使用找到的表头行
if rows and isinstance(rows[header_row_idx], list):
headers = rows[header_row_idx]
if all_headers is None:
all_headers = headers
# 数据行(从表头之后开始)
for row in rows[header_row_idx + 1:]:
if isinstance(row, list) and len(row) == len(headers):
all_rows.append(row)
if all_headers and all_rows:
return {
"success": True,
"type": "table_data",
"headers": all_headers,
"rows": all_rows,
"description": "直接从 Word 表格提取"
}
return {
"success": True,
"type": "raw",
"tables": tables,
"message": "表格数据未AI处理"
}
async def fill_template_with_ai(
self,
file_path: str,
template_fields: List[Dict[str, Any]],
user_hint: str = ""
) -> Dict[str, Any]:
"""
使用 AI 解析 Word 文档并填写模板
这是主要入口函数,前端调用此函数即可完成:
1. AI 解析 Word 文档
2. 根据模板字段提取数据
3. 返回填写结果
Args:
file_path: Word 文件路径
template_fields: 模板字段列表 [{"name": "字段名", "hint": "提示词"}, ...]
user_hint: 用户提示
Returns:
填写结果
"""
try:
# 1. AI 解析文档
parse_result = await self.parse_word_with_ai(file_path, user_hint)
if not parse_result.get("success"):
return {
"success": False,
"error": parse_result.get("error", "解析失败"),
"filled_data": {},
"source": "ai_parse_failed"
}
# 2. 根据字段类型提取数据
filled_data = {}
extract_details = []
parse_type = parse_result.get("type", "")
if parse_type == "table_data":
# 表格数据:直接匹配列名
headers = parse_result.get("headers", [])
rows = parse_result.get("rows", [])
for field in template_fields:
field_name = field.get("name", "")
values = self._extract_field_from_table(headers, rows, field_name)
filled_data[field_name] = values
extract_details.append({
"field": field_name,
"values": values,
"source": "ai_table_extraction",
"confidence": 0.9 if values else 0.0
})
elif parse_type == "structured_text":
# 结构化文本:尝试从 key_values 和 list_items 提取
key_values = parse_result.get("key_values", {})
list_items = parse_result.get("list_items", [])
for field in template_fields:
field_name = field.get("name", "")
value = key_values.get(field_name, "")
if not value and list_items:
value = list_items[0] if list_items else ""
filled_data[field_name] = [value] if value else []
extract_details.append({
"field": field_name,
"values": [value] if value else [],
"source": "ai_text_extraction",
"confidence": 0.7 if value else 0.0
})
else:
# 其他类型:返回原始解析结果供后续处理
for field in template_fields:
field_name = field.get("name", "")
filled_data[field_name] = []
extract_details.append({
"field": field_name,
"values": [],
"source": "no_ai_data",
"confidence": 0.0
})
# 3. 返回结果
max_rows = max(len(v) for v in filled_data.values()) if filled_data else 1
return {
"success": True,
"filled_data": filled_data,
"fill_details": extract_details,
"ai_parse_result": {
"type": parse_type,
"description": parse_result.get("description", "")
},
"source_doc_count": 1,
"max_rows": max_rows
}
except Exception as e:
logger.error(f"AI 填表失败: {str(e)}")
return {
"success": False,
"error": str(e),
"filled_data": {},
"fill_details": []
}
def _extract_field_from_table(
self,
headers: List[str],
rows: List[List],
field_name: str
) -> List[str]:
"""从表格中提取指定字段的值"""
# 查找匹配的列
target_col_idx = None
for col_idx, header in enumerate(headers):
if field_name.lower() in str(header).lower() or str(header).lower() in field_name.lower():
target_col_idx = col_idx
break
if target_col_idx is None:
return []
# 提取该列所有值
values = []
for row in rows:
if isinstance(row, list) and target_col_idx < len(row):
val = str(row[target_col_idx]).strip()
if val:
values.append(val)
return values
# 全局单例
word_ai_service = WordAIService()

View File

@@ -1,4 +1,4 @@
# ============================================================
# ============================================================
# 基于大语言模型的文档理解与多源数据融合系统
# Python 依赖清单
# ============================================================

View File

@@ -1,5 +1,5 @@
import { RouterProvider } from 'react-router-dom';
import { AuthProvider } from '@/context/AuthContext';
import { AuthProvider } from '@/contexts/AuthContext';
import { TemplateFillProvider } from '@/context/TemplateFillContext';
import { router } from '@/routes';
import { Toaster } from 'sonner';

View File

@@ -1,6 +1,6 @@
import React from 'react';
import { Navigate, useLocation } from 'react-router-dom';
import { useAuth } from '@/context/AuthContext';
import { useAuth } from '@/contexts/AuthContext';
export const RouteGuard: React.FC<{ children: React.ReactNode }> = ({ children }) => {
const { user, loading } = useAuth();

View File

@@ -1,85 +0,0 @@
import React, { createContext, useContext, useEffect, useState } from 'react';
import { supabase } from '@/db/supabase';
import { User } from '@supabase/supabase-js';
import { Profile } from '@/types/types';
interface AuthContextType {
user: User | null;
profile: Profile | null;
signIn: (email: string, password: string) => Promise<{ error: any }>;
signUp: (email: string, password: string) => Promise<{ error: any }>;
signOut: () => Promise<{ error: any }>;
loading: boolean;
}
const AuthContext = createContext<AuthContextType | undefined>(undefined);
export const AuthProvider: React.FC<{ children: React.ReactNode }> = ({ children }) => {
const [user, setUser] = useState<User | null>(null);
const [profile, setProfile] = useState<Profile | null>(null);
const [loading, setLoading] = useState(true);
useEffect(() => {
// Check active sessions and sets the user
supabase.auth.getSession().then(({ data: { session } }) => {
setUser(session?.user ?? null);
if (session?.user) fetchProfile(session.user.id);
else setLoading(false);
});
// Listen for changes on auth state (sign in, sign out, etc.)
const { data: { subscription } } = supabase.auth.onAuthStateChange((_event, session) => {
setUser(session?.user ?? null);
if (session?.user) fetchProfile(session.user.id);
else {
setProfile(null);
setLoading(false);
}
});
return () => subscription.unsubscribe();
}, []);
const fetchProfile = async (uid: string) => {
try {
const { data, error } = await supabase
.from('profiles')
.select('*')
.eq('id', uid)
.maybeSingle();
if (error) throw error;
setProfile(data);
} catch (err) {
console.error('Error fetching profile:', err);
} finally {
setLoading(false);
}
};
const signIn = async (email: string, password: string) => {
return await supabase.auth.signInWithPassword({ email, password });
};
const signUp = async (email: string, password: string) => {
return await supabase.auth.signUp({ email, password });
};
const signOut = async () => {
return await supabase.auth.signOut();
};
return (
<AuthContext.Provider value={{ user, profile, signIn, signUp, signOut, loading }}>
{children}
</AuthContext.Provider>
);
};
export const useAuth = () => {
const context = useContext(AuthContext);
if (context === undefined) {
throw new Error('useAuth must be used within an AuthProvider');
}
return context;
};

View File

@@ -21,6 +21,7 @@ interface TemplateFillState {
templateFields: TemplateField[];
sourceFiles: SourceFile[];
sourceFilePaths: string[];
sourceDocIds: string[];
templateId: string;
filledResult: any;
setStep: (step: Step) => void;
@@ -30,6 +31,9 @@ interface TemplateFillState {
addSourceFiles: (files: SourceFile[]) => void;
removeSourceFile: (index: number) => void;
setSourceFilePaths: (paths: string[]) => void;
setSourceDocIds: (ids: string[]) => void;
addSourceDocId: (id: string) => void;
removeSourceDocId: (id: string) => void;
setTemplateId: (id: string) => void;
setFilledResult: (result: any) => void;
reset: () => void;
@@ -41,6 +45,7 @@ const initialState = {
templateFields: [],
sourceFiles: [],
sourceFilePaths: [],
sourceDocIds: [],
templateId: '',
filledResult: null,
setStep: () => {},
@@ -50,6 +55,9 @@ const initialState = {
addSourceFiles: () => {},
removeSourceFile: () => {},
setSourceFilePaths: () => {},
setSourceDocIds: () => {},
addSourceDocId: () => {},
removeSourceDocId: () => {},
setTemplateId: () => {},
setFilledResult: () => {},
reset: () => {},
@@ -63,6 +71,7 @@ export const TemplateFillProvider: React.FC<{ children: ReactNode }> = ({ childr
const [templateFields, setTemplateFields] = useState<TemplateField[]>([]);
const [sourceFiles, setSourceFiles] = useState<SourceFile[]>([]);
const [sourceFilePaths, setSourceFilePaths] = useState<string[]>([]);
const [sourceDocIds, setSourceDocIds] = useState<string[]>([]);
const [templateId, setTemplateId] = useState<string>('');
const [filledResult, setFilledResult] = useState<any>(null);
@@ -74,12 +83,21 @@ export const TemplateFillProvider: React.FC<{ children: ReactNode }> = ({ childr
setSourceFiles(prev => prev.filter((_, i) => i !== index));
};
const addSourceDocId = (id: string) => {
setSourceDocIds(prev => prev.includes(id) ? prev : [...prev, id]);
};
const removeSourceDocId = (id: string) => {
setSourceDocIds(prev => prev.filter(docId => docId !== id));
};
const reset = () => {
setStep('upload');
setTemplateFile(null);
setTemplateFields([]);
setSourceFiles([]);
setSourceFilePaths([]);
setSourceDocIds([]);
setTemplateId('');
setFilledResult(null);
};
@@ -92,6 +110,7 @@ export const TemplateFillProvider: React.FC<{ children: ReactNode }> = ({ childr
templateFields,
sourceFiles,
sourceFilePaths,
sourceDocIds,
templateId,
filledResult,
setStep,
@@ -101,6 +120,9 @@ export const TemplateFillProvider: React.FC<{ children: ReactNode }> = ({ childr
addSourceFiles,
removeSourceFile,
setSourceFilePaths,
setSourceDocIds,
addSourceDocId,
removeSourceDocId,
setTemplateId,
setFilledResult,
reset,

View File

@@ -400,6 +400,49 @@ export const backendApi = {
}
},
/**
* 获取任务历史列表
*/
async getTasks(
limit: number = 50,
skip: number = 0
): Promise<{ success: boolean; tasks: any[]; count: number }> {
const url = `${BACKEND_BASE_URL}/tasks?limit=${limit}&skip=${skip}`;
try {
const response = await fetch(url);
if (!response.ok) {
const error = await response.json();
throw new Error(error.detail || '获取任务列表失败');
}
return await response.json();
} catch (error) {
console.error('获取任务列表失败:', error);
throw error;
}
},
/**
* 删除任务
*/
async deleteTask(taskId: string): Promise<{ success: boolean; deleted: boolean }> {
const url = `${BACKEND_BASE_URL}/tasks/${taskId}`;
try {
const response = await fetch(url, {
method: 'DELETE'
});
if (!response.ok) {
const error = await response.json();
throw new Error(error.detail || '删除任务失败');
}
return await response.json();
} catch (error) {
console.error('删除任务失败:', error);
throw error;
}
},
/**
* 轮询任务状态直到完成
*/
@@ -764,6 +807,41 @@ export const backendApi = {
}
},
/**
* 填充原始模板并导出
*
* 直接打开原始模板文件,将数据填入模板的表格/单元格中,然后导出
* 适用于比赛场景:保持原始模板格式不变
*/
async fillAndExportTemplate(
templatePath: string,
filledData: Record<string, any>,
format: 'xlsx' | 'docx' = 'xlsx'
): Promise<Blob> {
const url = `${BACKEND_BASE_URL}/templates/fill-and-export`;
try {
const response = await fetch(url, {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
template_path: templatePath,
filled_data: filledData,
format,
}),
});
if (!response.ok) {
const error = await response.json();
throw new Error(error.detail || '填充模板失败');
}
return await response.blob();
} catch (error) {
console.error('填充模板失败:', error);
throw error;
}
},
// ==================== Excel 专用接口 (保留兼容) ====================
/**
@@ -1145,7 +1223,7 @@ export const aiApi = {
try {
const response = await fetch(url, {
method: 'GET',
method: 'POST',
body: formData,
});
@@ -1161,6 +1239,48 @@ export const aiApi = {
}
},
/**
* 上传并使用 AI 分析 TXT 文本文件,提取结构化数据
*/
async analyzeTxt(
file: File
): Promise<{
success: boolean;
filename?: string;
structured_data?: {
table?: {
columns?: string[];
rows?: string[][];
};
summary?: string;
key_value_pairs?: Array<{ key: string; value: string }>;
numeric_data?: Array<{ name: string; value: number; unit?: string }>;
};
error?: string;
}> {
const formData = new FormData();
formData.append('file', file);
const url = `${BACKEND_BASE_URL}/ai/analyze/txt`;
try {
const response = await fetch(url, {
method: 'POST',
body: formData,
});
if (!response.ok) {
const error = await response.json();
throw new Error(error.detail || 'TXT AI 分析失败');
}
return await response.json();
} catch (error) {
console.error('TXT AI 分析失败:', error);
throw error;
}
},
/**
* 生成统计信息和图表
*/
@@ -1259,4 +1379,84 @@ export const aiApi = {
throw error;
}
},
// ==================== Word AI 解析 ====================
/**
* 使用 AI 解析 Word 文档,提取结构化数据
*/
async analyzeWordWithAI(
file: File,
userHint: string = ''
): Promise<{
success: boolean;
type?: string;
headers?: string[];
rows?: string[][];
key_values?: Record<string, string>;
list_items?: string[];
summary?: string;
error?: string;
}> {
const formData = new FormData();
formData.append('file', file);
if (userHint) {
formData.append('user_hint', userHint);
}
const url = `${BACKEND_BASE_URL}/ai/analyze/word`;
try {
const response = await fetch(url, {
method: 'POST',
body: formData,
});
if (!response.ok) {
const error = await response.json();
throw new Error(error.detail || 'Word AI 解析失败');
}
return await response.json();
} catch (error) {
console.error('Word AI 解析失败:', error);
throw error;
}
},
/**
* 使用 AI 解析 Word 文档并填写模板
* 一次性完成AI解析 + 填表
*/
async fillTemplateFromWordAI(
file: File,
templateFields: TemplateField[],
userHint: string = ''
): Promise<FillResult> {
const formData = new FormData();
formData.append('file', file);
formData.append('template_fields', JSON.stringify(templateFields));
if (userHint) {
formData.append('user_hint', userHint);
}
const url = `${BACKEND_BASE_URL}/ai/analyze/word/fill-template`;
try {
const response = await fetch(url, {
method: 'POST',
body: formData,
});
if (!response.ok) {
const error = await response.json();
throw new Error(error.detail || 'Word AI 填表失败');
}
return await response.json();
} catch (error) {
console.error('Word AI 填表失败:', error);
throw error;
}
},
};

View File

@@ -1,4 +1,4 @@
import React, { useState, useEffect, useCallback } from 'react';
import React, { useState, useEffect, useCallback, useRef } from 'react';
import { useDropzone } from 'react-dropzone';
import {
FileText,
@@ -23,7 +23,8 @@ import {
List,
MessageSquareCode,
Tag,
HelpCircle
HelpCircle,
Plus
} from 'lucide-react';
import { Button } from '@/components/ui/button';
import { Input } from '@/components/ui/input';
@@ -72,8 +73,10 @@ const Documents: React.FC = () => {
// 上传相关状态
const [uploading, setUploading] = useState(false);
const [uploadedFile, setUploadedFile] = useState<File | null>(null);
const [uploadedFiles, setUploadedFiles] = useState<File[]>([]);
const [parseResult, setParseResult] = useState<ExcelParseResult | null>(null);
const [expandedSheet, setExpandedSheet] = useState<string | null>(null);
const [uploadExpanded, setUploadExpanded] = useState(false);
// AI 分析相关状态
const [analyzing, setAnalyzing] = useState(false);
@@ -210,75 +213,119 @@ const Documents: React.FC = () => {
// 文件上传处理
const onDrop = async (acceptedFiles: File[]) => {
const file = acceptedFiles[0];
if (!file) return;
if (acceptedFiles.length === 0) return;
setUploadedFile(file);
setUploading(true);
setParseResult(null);
setAiAnalysis(null);
setAnalysisCharts(null);
setExpandedSheet(null);
setMdAnalysis(null);
setMdSections([]);
setMdStreamingContent('');
let successCount = 0;
let failCount = 0;
const successfulFiles: File[] = [];
const ext = file.name.split('.').pop()?.toLowerCase();
// 逐个上传文件
for (const file of acceptedFiles) {
const ext = file.name.split('.').pop()?.toLowerCase();
try {
// Excel 文件使用专门的上传接口
if (ext === 'xlsx' || ext === 'xls') {
const result = await backendApi.uploadExcel(file, {
parseAllSheets: parseOptions.parseAllSheets,
headerRow: parseOptions.headerRow
});
if (result.success) {
toast.success(`解析成功: ${file.name}`);
setParseResult(result);
loadDocuments(); // 刷新文档列表
if (result.metadata?.sheet_count === 1) {
setExpandedSheet(Object.keys(result.data?.sheets || {})[0] || null);
try {
if (ext === 'xlsx' || ext === 'xls') {
const result = await backendApi.uploadExcel(file, {
parseAllSheets: parseOptions.parseAllSheets,
headerRow: parseOptions.headerRow
});
if (result.success) {
successCount++;
successfulFiles.push(file);
// 第一个Excel文件设置解析结果供预览
if (successCount === 1) {
setUploadedFile(file);
setParseResult(result);
if (result.metadata?.sheet_count === 1) {
setExpandedSheet(Object.keys(result.data?.sheets || {})[0] || null);
}
}
loadDocuments();
} else {
failCount++;
toast.error(`${file.name}: ${result.error || '解析失败'}`);
}
} else if (ext === 'md' || ext === 'markdown') {
const result = await backendApi.uploadDocument(file);
if (result.task_id) {
successCount++;
successfulFiles.push(file);
if (successCount === 1) {
setUploadedFile(file);
}
// 轮询任务状态
let attempts = 0;
const checkStatus = async () => {
while (attempts < 30) {
try {
const status = await backendApi.getTaskStatus(result.task_id);
if (status.status === 'success') {
loadDocuments();
return;
} else if (status.status === 'failure') {
return;
}
} catch (e) {
console.error('检查状态失败', e);
}
await new Promise(resolve => setTimeout(resolve, 2000));
attempts++;
}
};
checkStatus();
} else {
failCount++;
}
} else {
toast.error(result.error || '解析失败');
}
} else if (ext === 'md' || ext === 'markdown') {
// Markdown 文件:获取大纲
await fetchMdOutline();
} else {
// 其他文档使用通用上传接口
const result = await backendApi.uploadDocument(file);
if (result.task_id) {
toast.success(`文件 ${file.name} 已提交处理`);
// 轮询任务状态
let attempts = 0;
const checkStatus = async () => {
while (attempts < 30) {
try {
const status = await backendApi.getTaskStatus(result.task_id);
if (status.status === 'success') {
toast.success(`文件 ${file.name} 处理完成`);
loadDocuments();
return;
} else if (status.status === 'failure') {
toast.error(`文件 ${file.name} 处理失败`);
return;
}
} catch (e) {
console.error('检查状态失败', e);
}
await new Promise(resolve => setTimeout(resolve, 2000));
attempts++;
// 其他文档使用通用上传接口
const result = await backendApi.uploadDocument(file);
if (result.task_id) {
successCount++;
successfulFiles.push(file);
if (successCount === 1) {
setUploadedFile(file);
}
toast.error(`文件 ${file.name} 处理超时`);
};
checkStatus();
// 轮询任务状态
let attempts = 0;
const checkStatus = async () => {
while (attempts < 30) {
try {
const status = await backendApi.getTaskStatus(result.task_id);
if (status.status === 'success') {
loadDocuments();
return;
} else if (status.status === 'failure') {
return;
}
} catch (e) {
console.error('检查状态失败', e);
}
await new Promise(resolve => setTimeout(resolve, 2000));
attempts++;
}
};
checkStatus();
} else {
failCount++;
}
}
} catch (error: any) {
failCount++;
toast.error(`${file.name}: ${error.message || '上传失败'}`);
}
} catch (error: any) {
toast.error(error.message || '上传失败');
} finally {
setUploading(false);
}
setUploading(false);
loadDocuments();
if (successCount > 0) {
toast.success(`成功上传 ${successCount} 个文件`);
setUploadedFiles(prev => [...prev, ...successfulFiles]);
setUploadExpanded(true);
}
if (failCount > 0) {
toast.error(`${failCount} 个文件上传失败`);
}
};
@@ -291,7 +338,7 @@ const Documents: React.FC = () => {
'text/markdown': ['.md'],
'text/plain': ['.txt']
},
maxFiles: 1
multiple: true
});
// AI 分析处理
@@ -449,6 +496,7 @@ const Documents: React.FC = () => {
const handleDeleteFile = () => {
setUploadedFile(null);
setUploadedFiles([]);
setParseResult(null);
setAiAnalysis(null);
setAnalysisCharts(null);
@@ -456,6 +504,17 @@ const Documents: React.FC = () => {
toast.success('文件已清除');
};
const handleRemoveUploadedFile = (index: number) => {
setUploadedFiles(prev => {
const newFiles = prev.filter((_, i) => i !== index);
if (newFiles.length === 0) {
setUploadedFile(null);
}
return newFiles;
});
toast.success('文件已从列表移除');
};
const handleDelete = async (docId: string) => {
try {
const result = await backendApi.deleteDocument(docId);
@@ -615,7 +674,7 @@ const Documents: React.FC = () => {
<h1 className="text-3xl font-extrabold tracking-tight"></h1>
<p className="text-muted-foreground">使 AI </p>
</div>
<Button variant="outline" className="rounded-xl gap-2" onClick={loadDocuments}>
<Button variant="outline" className="rounded-xl gap-2" onClick={() => loadDocuments()}>
<RefreshCcw size={18} />
<span></span>
</Button>
@@ -640,7 +699,83 @@ const Documents: React.FC = () => {
</CardHeader>
{uploadPanelOpen && (
<CardContent className="space-y-4">
{!uploadedFile ? (
{uploadedFiles.length > 0 || uploadedFile ? (
<div className="space-y-3">
{/* 文件列表头部 */}
<div
className="flex items-center justify-between p-3 bg-muted/50 rounded-xl cursor-pointer hover:bg-muted/70 transition-colors"
onClick={() => setUploadExpanded(!uploadExpanded)}
>
<div className="flex items-center gap-3">
<div className="w-10 h-10 rounded-lg bg-primary/10 text-primary flex items-center justify-center">
<Upload size={20} />
</div>
<div>
<p className="font-semibold text-sm">
{(uploadedFiles.length > 0 ? uploadedFiles : [uploadedFile]).length}
</p>
<p className="text-xs text-muted-foreground">
{uploadExpanded ? '点击收起' : '点击展开查看'}
</p>
</div>
</div>
<div className="flex items-center gap-2">
<Button
variant="ghost"
size="sm"
onClick={(e) => {
e.stopPropagation();
handleDeleteFile();
}}
className="text-destructive hover:text-destructive"
>
<Trash2 size={14} className="mr-1" />
</Button>
{uploadExpanded ? <ChevronUp size={16} /> : <ChevronDown size={16} />}
</div>
</div>
{/* 展开的文件列表 */}
{uploadExpanded && (
<div className="space-y-2 border rounded-xl p-3">
{(uploadedFiles.length > 0 ? uploadedFiles : [uploadedFile]).filter(Boolean).map((file, index) => (
<div key={index} className="flex items-center gap-3 p-2 bg-background rounded-lg">
<div className={cn(
"w-8 h-8 rounded flex items-center justify-center",
isExcelFile(file?.name || '') ? "bg-emerald-500/10 text-emerald-500" : "bg-blue-500/10 text-blue-500"
)}>
{isExcelFile(file?.name || '') ? <FileSpreadsheet size={16} /> : <FileText size={16} />}
</div>
<div className="flex-1 min-w-0">
<p className="text-sm truncate">{file?.name}</p>
<p className="text-xs text-muted-foreground">{formatFileSize(file?.size || 0)}</p>
</div>
<Button
variant="ghost"
size="icon"
className="text-destructive hover:bg-destructive/10"
onClick={() => handleRemoveUploadedFile(index)}
>
<Trash2 size={14} />
</Button>
</div>
))}
{/* 继续添加按钮 */}
<div
{...getRootProps()}
className="flex items-center justify-center gap-2 p-3 border-2 border-dashed rounded-lg cursor-pointer hover:border-primary/50 hover:bg-primary/5 transition-colors"
onClick={(e) => e.stopPropagation()}
>
<input {...getInputProps()} multiple={true} />
<Plus size={16} className="text-muted-foreground" />
<span className="text-sm text-muted-foreground"></span>
</div>
</div>
)}
</div>
) : (
<div
{...getRootProps()}
className={cn(
@@ -649,7 +784,7 @@ const Documents: React.FC = () => {
uploading && "opacity-50 pointer-events-none"
)}
>
<input {...getInputProps()} />
<input {...getInputProps()} multiple={true} />
<div className="w-14 h-14 rounded-xl bg-primary/10 text-primary flex items-center justify-center mb-4 group-hover:scale-110 transition-transform">
{uploading ? <Loader2 className="animate-spin" size={28} /> : <Upload size={28} />}
</div>
@@ -671,30 +806,6 @@ const Documents: React.FC = () => {
</Badge>
</div>
</div>
) : (
<div className="space-y-4">
<div className="flex items-center gap-3 p-3 bg-muted/30 rounded-xl">
<div className={cn(
"w-10 h-10 rounded-lg flex items-center justify-center",
isExcelFile(uploadedFile.name) ? "bg-emerald-500/10 text-emerald-500" : "bg-blue-500/10 text-blue-500"
)}>
{isExcelFile(uploadedFile.name) ? <FileSpreadsheet size={20} /> : <FileText size={20} />}
</div>
<div className="flex-1 min-w-0">
<p className="font-semibold text-sm truncate">{uploadedFile.name}</p>
<p className="text-xs text-muted-foreground">{formatFileSize(uploadedFile.size)}</p>
</div>
<Button variant="ghost" size="icon" className="text-destructive hover:bg-destructive/10" onClick={handleDeleteFile}>
<Trash2 size={16} />
</Button>
</div>
{isExcelFile(uploadedFile.name) && (
<Button onClick={() => onDrop([uploadedFile])} className="w-full" disabled={uploading}>
{uploading ? '解析中...' : '重新解析'}
</Button>
)}
</div>
)}
</CardContent>
)}

File diff suppressed because it is too large Load Diff

View File

@@ -1,603 +0,0 @@
import React, { useState, useEffect } from 'react';
import {
TableProperties,
Plus,
FilePlus,
CheckCircle2,
Download,
Clock,
RefreshCcw,
Sparkles,
Zap,
FileCheck,
FileSpreadsheet,
Trash2,
ChevronDown,
ChevronUp,
BarChart3,
FileText,
TrendingUp,
Info,
AlertCircle,
Loader2
} from 'lucide-react';
import { Button } from '@/components/ui/button';
import { Card, CardContent, CardHeader, CardTitle, CardDescription, CardFooter } from '@/components/ui/card';
import { Badge } from '@/components/ui/badge';
import { useAuth } from '@/context/AuthContext';
import { templateApi, documentApi, taskApi } from '@/db/api';
import { backendApi, aiApi } from '@/db/backend-api';
import { supabase } from '@/db/supabase';
import { format } from 'date-fns';
import { toast } from 'sonner';
import { cn } from '@/lib/utils';
import { Skeleton } from '@/components/ui/skeleton';
import {
Dialog,
DialogContent,
DialogHeader,
DialogTitle,
DialogTrigger,
DialogFooter,
DialogDescription
} from '@/components/ui/dialog';
import { Checkbox } from '@/components/ui/checkbox';
import { ScrollArea } from '@/components/ui/scroll-area';
import { Input } from '@/components/ui/input';
import { Label } from '@/components/ui/label';
import { Textarea } from '@/components/ui/textarea';
import { Select, SelectContent, SelectItem, SelectTrigger, SelectValue } from '@/components/ui/select';
import { useDropzone } from 'react-dropzone';
import { Markdown } from '@/components/ui/markdown';
type Template = any;
type Document = any;
type FillTask = any;
const FormFill: React.FC = () => {
const { profile } = useAuth();
const [templates, setTemplates] = useState<Template[]>([]);
const [documents, setDocuments] = useState<Document[]>([]);
const [tasks, setTasks] = useState<any[]>([]);
const [loading, setLoading] = useState(true);
// Selection state
const [selectedTemplate, setSelectedTemplate] = useState<string | null>(null);
const [selectedDocs, setSelectedDocs] = useState<string[]>([]);
const [creating, setCreating] = useState(false);
const [openTaskDialog, setOpenTaskDialog] = useState(false);
const [viewingTask, setViewingTask] = useState<any | null>(null);
// Excel upload state
const [excelFile, setExcelFile] = useState<File | null>(null);
const [excelParseResult, setExcelParseResult] = useState<any>(null);
const [excelAnalysis, setExcelAnalysis] = useState<any>(null);
const [excelAnalyzing, setExcelAnalyzing] = useState(false);
const [expandedSheet, setExpandedSheet] = useState<string | null>(null);
const [aiOptions, setAiOptions] = useState({
userPrompt: '请分析这些数据,并提取关键信息用于填表,包括数值、分类、摘要等。',
analysisType: 'general' as 'general' | 'summary' | 'statistics' | 'insights'
});
const loadData = async () => {
if (!profile) return;
try {
const [t, d, ts] = await Promise.all([
templateApi.listTemplates((profile as any).id),
documentApi.listDocuments((profile as any).id),
taskApi.listTasks((profile as any).id)
]);
setTemplates(t);
setDocuments(d);
setTasks(ts);
} catch (err: any) {
toast.error('数据加载失败');
} finally {
setLoading(false);
}
};
useEffect(() => {
loadData();
}, [profile]);
// Excel upload handlers
const onExcelDrop = async (acceptedFiles: File[]) => {
const file = acceptedFiles[0];
if (!file) return;
if (!file.name.match(/\.(xlsx|xls)$/i)) {
toast.error('仅支持 .xlsx 和 .xls 格式的 Excel 文件');
return;
}
setExcelFile(file);
setExcelParseResult(null);
setExcelAnalysis(null);
setExpandedSheet(null);
try {
const result = await backendApi.uploadExcel(file);
if (result.success) {
toast.success(`Excel 解析成功: ${file.name}`);
setExcelParseResult(result);
} else {
toast.error(result.error || '解析失败');
}
} catch (error: any) {
toast.error(error.message || '上传失败');
}
};
const { getRootProps, getInputProps, isDragActive } = useDropzone({
onDrop: onExcelDrop,
accept: {
'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet': ['.xlsx'],
'application/vnd.ms-excel': ['.xls']
},
maxFiles: 1
});
const handleAnalyzeExcel = async () => {
if (!excelFile || !excelParseResult?.success) {
toast.error('请先上传并解析 Excel 文件');
return;
}
setExcelAnalyzing(true);
setExcelAnalysis(null);
try {
const result = await aiApi.analyzeExcel(excelFile, {
userPrompt: aiOptions.userPrompt,
analysisType: aiOptions.analysisType
});
if (result.success) {
toast.success('AI 分析完成');
setExcelAnalysis(result);
} else {
toast.error(result.error || 'AI 分析失败');
}
} catch (error: any) {
toast.error(error.message || 'AI 分析失败');
} finally {
setExcelAnalyzing(false);
}
};
const handleUseExcelData = () => {
if (!excelParseResult?.success) {
toast.error('请先解析 Excel 文件');
return;
}
// 将 Excel 解析的数据标记为"文档",添加到选择列表
toast.success('Excel 数据已添加到数据源,请在任务对话框中选择');
// 这里可以添加逻辑来将 Excel 数据传递给后端创建任务
};
const handleDeleteExcel = () => {
setExcelFile(null);
setExcelParseResult(null);
setExcelAnalysis(null);
setExpandedSheet(null);
toast.success('Excel 文件已清除');
};
const handleUploadTemplate = async (e: React.ChangeEvent<HTMLInputElement>) => {
const file = e.target.files?.[0];
if (!file || !profile) return;
try {
toast.loading('正在上传模板...');
await templateApi.uploadTemplate(file, (profile as any).id);
toast.dismiss();
toast.success('模板上传成功');
loadData();
} catch (err) {
toast.dismiss();
toast.error('上传模板失败');
}
};
const handleCreateTask = async () => {
if (!profile || !selectedTemplate || selectedDocs.length === 0) {
toast.error('请先选择模板和数据源文档');
return;
}
setCreating(true);
try {
const task = await taskApi.createTask((profile as any).id, selectedTemplate, selectedDocs);
if (task) {
toast.success('任务已创建,正在进行智能填表...');
setOpenTaskDialog(false);
// Invoke edge function
supabase.functions.invoke('fill-template', {
body: { taskId: task.id }
}).then(({ error }) => {
if (error) toast.error('填表任务执行失败');
else {
toast.success('表格填写完成!');
loadData();
}
});
loadData();
}
} catch (err: any) {
toast.error('创建任务失败');
} finally {
setCreating(false);
}
};
const getStatusColor = (status: string) => {
switch (status) {
case 'completed': return 'bg-emerald-500 text-white';
case 'failed': return 'bg-destructive text-white';
default: return 'bg-amber-500 text-white';
}
};
const formatFileSize = (bytes: number): string => {
if (bytes === 0) return '0 B';
const k = 1024;
const sizes = ['B', 'KB', 'MB', 'GB'];
const i = Math.floor(Math.log(bytes) / Math.log(k));
return `${(bytes / Math.pow(k, i)).toFixed(2)} ${sizes[i]}`;
};
return (
<div className="space-y-8 animate-fade-in pb-10">
<section className="flex flex-col md:flex-row md:items-center justify-between gap-4">
<div className="space-y-1">
<h1 className="text-3xl font-extrabold tracking-tight"></h1>
<p className="text-muted-foreground"></p>
</div>
<div className="flex items-center gap-3">
<Dialog open={openTaskDialog} onOpenChange={setOpenTaskDialog}>
<DialogTrigger asChild>
<Button className="rounded-xl shadow-lg shadow-primary/20 gap-2 h-11 px-6">
<FilePlus size={18} />
<span></span>
</Button>
</DialogTrigger>
<DialogContent className="max-w-4xl max-h-[90vh] flex flex-col p-0 overflow-hidden border-none shadow-2xl rounded-3xl">
<DialogHeader className="p-8 pb-4 bg-muted/50">
<DialogTitle className="text-2xl font-bold flex items-center gap-2">
<Sparkles size={24} className="text-primary" />
</DialogTitle>
<DialogDescription>
AI
</DialogDescription>
</DialogHeader>
<ScrollArea className="flex-1 p-8 pt-4">
<div className="space-y-8">
{/* Step 1: Select Template */}
<div className="space-y-4">
<div className="flex items-center justify-between">
<h4 className="font-bold flex items-center gap-2 text-primary uppercase tracking-widest text-xs">
<span className="w-5 h-5 rounded-full bg-primary text-white flex items-center justify-center text-[10px]">1</span>
</h4>
<label className="cursor-pointer text-xs font-semibold text-primary hover:underline flex items-center gap-1">
<Plus size={12} />
<input type="file" className="hidden" onChange={handleUploadTemplate} accept=".docx,.xlsx" />
</label>
</div>
{templates.length > 0 ? (
<div className="grid grid-cols-1 sm:grid-cols-2 gap-3">
{templates.map(t => (
<div
key={t.id}
className={cn(
"p-4 rounded-2xl border-2 transition-all cursor-pointer flex items-center gap-3 group relative overflow-hidden",
selectedTemplate === t.id ? "border-primary bg-primary/5" : "border-border hover:border-primary/50"
)}
onClick={() => setSelectedTemplate(t.id)}
>
<div className={cn(
"w-10 h-10 rounded-xl flex items-center justify-center shrink-0 transition-colors",
selectedTemplate === t.id ? "bg-primary text-white" : "bg-muted text-muted-foreground"
)}>
<TableProperties size={20} />
</div>
<div className="flex-1 min-w-0">
<p className="font-bold text-sm truncate">{t.name}</p>
<p className="text-[10px] text-muted-foreground uppercase">{t.type}</p>
</div>
{selectedTemplate === t.id && (
<div className="absolute top-0 right-0 w-8 h-8 bg-primary text-white flex items-center justify-center rounded-bl-xl">
<CheckCircle2 size={14} />
</div>
)}
</div>
))}
</div>
) : (
<div className="p-8 text-center bg-muted/30 rounded-2xl border border-dashed text-sm italic text-muted-foreground">
</div>
)}
</div>
{/* Step 2: Upload & Analyze Excel */}
<div className="space-y-4">
<h4 className="font-bold flex items-center gap-2 text-primary uppercase tracking-widest text-xs">
<span className="w-5 h-5 rounded-full bg-primary text-white flex items-center justify-center text-[10px]">1.5</span>
Excel
</h4>
<div className="bg-muted/20 rounded-2xl p-6">
{!excelFile ? (
<div
{...getRootProps()}
className={cn(
"border-2 border-dashed rounded-xl p-8 transition-all duration-300 flex flex-col items-center justify-center text-center cursor-pointer group",
isDragActive ? "border-primary bg-primary/5" : "border-muted-foreground/20 hover:border-primary/50 hover:bg-muted/30"
)}
>
<input {...getInputProps()} />
<div className="w-12 h-12 rounded-xl bg-primary/10 text-primary flex items-center justify-center mb-3 group-hover:scale-110 transition-transform">
<FileSpreadsheet size={24} />
</div>
<p className="font-semibold text-sm">
{isDragActive ? '释放以开始上传' : '点击或拖拽 Excel 文件'}
</p>
<p className="text-xs text-muted-foreground mt-1"> .xlsx .xls </p>
</div>
) : (
<div className="space-y-4">
<div className="flex items-center gap-3 p-3 bg-background rounded-xl">
<div className="w-10 h-10 rounded-lg bg-emerald-500/10 text-emerald-500 flex items-center justify-center">
<FileSpreadsheet size={20} />
</div>
<div className="flex-1 min-w-0">
<p className="font-semibold text-sm truncate">{excelFile.name}</p>
<p className="text-xs text-muted-foreground">{formatFileSize(excelFile.size)}</p>
</div>
<div className="flex gap-2">
<Button
variant="ghost"
size="icon"
className="text-destructive hover:bg-destructive/10"
onClick={handleDeleteExcel}
>
<Trash2 size={16} />
</Button>
</div>
</div>
{/* AI Analysis Options */}
{excelParseResult?.success && (
<div className="space-y-3">
<div className="space-y-2">
<Label htmlFor="analysis-type" className="text-xs"></Label>
<Select
value={aiOptions.analysisType}
onValueChange={(value: any) => setAiOptions({ ...aiOptions, analysisType: value })}
>
<SelectTrigger id="analysis-type" className="bg-background h-9 text-sm">
<SelectValue placeholder="选择分析类型" />
</SelectTrigger>
<SelectContent>
<SelectItem value="general"></SelectItem>
<SelectItem value="summary"></SelectItem>
<SelectItem value="statistics"></SelectItem>
<SelectItem value="insights"></SelectItem>
</SelectContent>
</Select>
</div>
<div className="space-y-2">
<Label htmlFor="user-prompt" className="text-xs"></Label>
<Textarea
id="user-prompt"
value={aiOptions.userPrompt}
onChange={(e) => setAiOptions({ ...aiOptions, userPrompt: e.target.value })}
className="bg-background resize-none text-sm"
rows={2}
/>
</div>
<Button
onClick={handleAnalyzeExcel}
disabled={excelAnalyzing}
className="w-full gap-2 h-9"
variant="outline"
>
{excelAnalyzing ? <Loader2 className="animate-spin" size={14} /> : <Sparkles size={14} />}
{excelAnalyzing ? '分析中...' : 'AI 分析'}
</Button>
{excelParseResult?.success && (
<Button
onClick={handleUseExcelData}
className="w-full gap-2 h-9"
>
<CheckCircle2 size={14} />
使
</Button>
)}
</div>
)}
{/* Excel Analysis Result */}
{excelAnalysis && (
<div className="mt-4 p-4 bg-background rounded-xl max-h-60 overflow-y-auto">
<div className="flex items-center gap-2 mb-3">
<Sparkles size={16} className="text-primary" />
<span className="font-semibold text-sm">AI </span>
</div>
<Markdown content={excelAnalysis.analysis?.analysis || ''} className="text-sm" />
</div>
)}
</div>
)}
</div>
</div>
{/* Step 3: Select Documents */}
<div className="space-y-4">
<h4 className="font-bold flex items-center gap-2 text-primary uppercase tracking-widest text-xs">
<span className="w-5 h-5 rounded-full bg-primary text-white flex items-center justify-center text-[10px]">2</span>
</h4>
{documents.filter(d => d.status === 'completed').length > 0 ? (
<div className="space-y-2 max-h-40 overflow-y-auto pr-2 custom-scrollbar">
{documents.filter(d => d.status === 'completed').map(doc => (
<div
key={doc.id}
className={cn(
"flex items-center gap-3 p-3 rounded-xl border transition-all cursor-pointer",
selectedDocs.includes(doc.id) ? "border-primary/50 bg-primary/5 shadow-sm" : "border-border hover:bg-muted/30"
)}
onClick={() => {
setSelectedDocs(prev =>
prev.includes(doc.id) ? prev.filter(id => id !== doc.id) : [...prev, doc.id]
);
}}
>
<Checkbox checked={selectedDocs.includes(doc.id)} onCheckedChange={() => {}} />
<div className="w-8 h-8 rounded-lg bg-blue-500/10 text-blue-500 flex items-center justify-center">
<Zap size={16} />
</div>
<span className="font-semibold text-sm truncate">{doc.name}</span>
</div>
))}
</div>
) : (
<div className="p-6 text-center bg-muted/30 rounded-xl border border-dashed text-xs italic text-muted-foreground">
</div>
)}
</div>
</div>
</ScrollArea>
<DialogFooter className="p-8 pt-4 bg-muted/20 border-t border-dashed">
<Button variant="outline" className="rounded-xl h-12 px-6" onClick={() => setOpenTaskDialog(false)}></Button>
<Button
className="rounded-xl h-12 px-8 shadow-lg shadow-primary/20 gap-2"
onClick={handleCreateTask}
disabled={creating || !selectedTemplate || (selectedDocs.length === 0 && !excelParseResult?.success)}
>
{creating ? <RefreshCcw className="animate-spin h-5 w-5" /> : <Zap className="h-5 w-5 fill-current" />}
<span></span>
</Button>
</DialogFooter>
</DialogContent>
</Dialog>
</div>
</section>
{/* Task List */}
<div className="grid grid-cols-1 md:grid-cols-2 lg:grid-cols-3 gap-6">
{loading ? (
Array.from({ length: 3 }).map((_, i) => (
<Skeleton key={i} className="h-48 w-full rounded-3xl bg-muted" />
))
) : tasks.length > 0 ? (
tasks.map((task) => (
<Card key={task.id} className="border-none shadow-md hover:shadow-xl transition-all group rounded-3xl overflow-hidden flex flex-col">
<div className="h-1.5 w-full" style={{ backgroundColor: task.status === 'completed' ? '#10b981' : task.status === 'failed' ? '#ef4444' : '#f59e0b' }} />
<CardHeader className="p-6 pb-2">
<div className="flex justify-between items-start mb-2">
<div className="w-12 h-12 rounded-2xl bg-emerald-500/10 text-emerald-500 flex items-center justify-center shadow-inner group-hover:scale-110 transition-transform">
<TableProperties size={24} />
</div>
<Badge className={cn("text-[10px] uppercase font-bold tracking-widest", getStatusColor(task.status))}>
{task.status === 'completed' ? '已完成' : task.status === 'failed' ? '失败' : '执行中'}
</Badge>
</div>
<CardTitle className="text-lg font-bold truncate group-hover:text-primary transition-colors">{task.templates?.name || '未知模板'}</CardTitle>
<CardDescription className="text-xs flex items-center gap-1 font-medium italic">
<Clock size={12} /> {format(new Date(task.created_at!), 'yyyy/MM/dd HH:mm')}
</CardDescription>
</CardHeader>
<CardContent className="p-6 pt-2 flex-1">
<div className="space-y-4">
<div className="flex flex-wrap gap-2">
<Badge variant="outline" className="bg-muted/50 border-none text-[10px] font-bold"> {task.document_ids?.length} </Badge>
</div>
{task.status === 'completed' && (
<div className="p-3 bg-emerald-500/5 rounded-2xl border border-emerald-500/10 flex items-center gap-3">
<CheckCircle2 className="text-emerald-500" size={18} />
<span className="text-xs font-semibold text-emerald-700"></span>
</div>
)}
</div>
</CardContent>
<CardFooter className="p-6 pt-0">
<Button
className="w-full rounded-2xl h-11 bg-primary group-hover:shadow-lg group-hover:shadow-primary/30 transition-all gap-2"
disabled={task.status !== 'completed'}
onClick={() => setViewingTask(task)}
>
<Download size={18} />
<span></span>
</Button>
</CardFooter>
</Card>
))
) : (
<div className="col-span-full py-24 flex flex-col items-center justify-center text-center space-y-6">
<div className="w-24 h-24 rounded-full bg-muted flex items-center justify-center text-muted-foreground/30 border-4 border-dashed">
<TableProperties size={48} />
</div>
<div className="space-y-2 max-w-sm">
<p className="text-2xl font-extrabold tracking-tight"></p>
<p className="text-muted-foreground text-sm"></p>
</div>
<Button className="rounded-xl h-12 px-8" onClick={() => setOpenTaskDialog(true)}></Button>
</div>
)}
</div>
{/* Task Result View Modal */}
<Dialog open={!!viewingTask} onOpenChange={(open) => !open && setViewingTask(null)}>
<DialogContent className="max-w-4xl max-h-[90vh] flex flex-col p-0 overflow-hidden border-none shadow-2xl rounded-3xl">
<DialogHeader className="p-8 pb-4 bg-primary text-primary-foreground">
<div className="flex items-center gap-3 mb-2">
<FileCheck size={28} />
<DialogTitle className="text-2xl font-extrabold"></DialogTitle>
</div>
<DialogDescription className="text-primary-foreground/80 italic">
{viewingTask?.document_ids?.length}
</DialogDescription>
</DialogHeader>
<ScrollArea className="flex-1 p-8 bg-muted/10">
<div className="prose dark:prose-invert max-w-none">
<div className="bg-card p-8 rounded-2xl shadow-sm border min-h-[400px]">
<Badge variant="outline" className="mb-4"></Badge>
<div className="whitespace-pre-wrap font-sans text-sm leading-relaxed">
<h2 className="text-xl font-bold mb-4"></h2>
<p className="text-muted-foreground mb-6"></p>
<div className="p-4 bg-muted/30 rounded-xl border border-dashed border-primary/20 italic">
...
</div>
<div className="mt-8 space-y-4">
<p className="font-semibold text-primary"> </p>
<p className="font-semibold text-primary"> </p>
<p className="font-semibold text-primary"> </p>
</div>
</div>
</div>
</div>
</ScrollArea>
<DialogFooter className="p-8 pt-4 border-t border-dashed">
<Button variant="outline" className="rounded-xl" onClick={() => setViewingTask(null)}></Button>
<Button className="rounded-xl px-8 gap-2 shadow-lg shadow-primary/20" onClick={() => toast.success("正在导出文件...")}>
<Download size={18} />
{viewingTask?.templates?.type?.toUpperCase() || '文件'}
</Button>
</DialogFooter>
</DialogContent>
</Dialog>
</div>
);
};
export default FormFill;

View File

@@ -1,184 +0,0 @@
import React, { useState } from 'react';
import { useNavigate, useLocation } from 'react-router-dom';
import { useAuth } from '@/context/AuthContext';
import { Button } from '@/components/ui/button';
import { Input } from '@/components/ui/input';
import { Label } from '@/components/ui/label';
import { Card, CardContent, CardDescription, CardFooter, CardHeader, CardTitle } from '@/components/ui/card';
import { Tabs, TabsContent, TabsList, TabsTrigger } from '@/components/ui/tabs';
import { FileText, Lock, User, CheckCircle2, AlertCircle } from 'lucide-react';
import { toast } from 'sonner';
const Login: React.FC = () => {
const [username, setUsername] = useState('');
const [password, setPassword] = useState('');
const [loading, setLoading] = useState(false);
const { signIn, signUp } = useAuth();
const navigate = useNavigate();
const location = useLocation();
const handleLogin = async (e: React.FormEvent) => {
e.preventDefault();
if (!username || !password) return toast.error('请输入用户名和密码');
setLoading(true);
try {
const email = `${username}@miaoda.com`;
const { error } = await signIn(email, password);
if (error) throw error;
toast.success('登录成功');
navigate('/');
} catch (err: any) {
toast.error(err.message || '登录失败');
} finally {
setLoading(false);
}
};
const handleSignUp = async (e: React.FormEvent) => {
e.preventDefault();
if (!username || !password) return toast.error('请输入用户名和密码');
setLoading(true);
try {
const email = `${username}@miaoda.com`;
const { error } = await signUp(email, password);
if (error) throw error;
toast.success('注册成功,请登录');
} catch (err: any) {
toast.error(err.message || '注册失败');
} finally {
setLoading(false);
}
};
return (
<div className="min-h-screen flex items-center justify-center bg-[radial-gradient(ellipse_at_top_left,_var(--tw-gradient-stops))] from-primary/10 via-background to-background p-4 relative overflow-hidden">
{/* Decorative elements */}
<div className="absolute top-0 left-0 w-96 h-96 bg-primary/5 rounded-full blur-3xl -translate-x-1/2 -translate-y-1/2" />
<div className="absolute bottom-0 right-0 w-64 h-64 bg-primary/5 rounded-full blur-3xl translate-x-1/3 translate-y-1/3" />
<div className="w-full max-w-md space-y-8 relative animate-fade-in">
<div className="text-center space-y-2">
<div className="inline-flex items-center justify-center w-16 h-16 rounded-2xl bg-primary text-primary-foreground shadow-2xl shadow-primary/30 mb-4 animate-slide-in">
<FileText size={32} />
</div>
<h1 className="text-4xl font-extrabold tracking-tight gradient-text"></h1>
<p className="text-muted-foreground"></p>
</div>
<Card className="border-border/50 shadow-2xl backdrop-blur-sm bg-card/95">
<Tabs defaultValue="login" className="w-full">
<TabsList className="grid w-full grid-cols-2 rounded-t-xl h-12 bg-muted/50 p-1">
<TabsTrigger value="login" className="rounded-lg data-[state=active]:bg-background data-[state=active]:shadow-sm"></TabsTrigger>
<TabsTrigger value="signup" className="rounded-lg data-[state=active]:bg-background data-[state=active]:shadow-sm"></TabsTrigger>
</TabsList>
<TabsContent value="login">
<form onSubmit={handleLogin}>
<CardHeader>
<CardTitle></CardTitle>
<CardDescription>使</CardDescription>
</CardHeader>
<CardContent className="space-y-4">
<div className="space-y-2">
<Label htmlFor="username"></Label>
<div className="relative">
<User className="absolute left-3 top-2.5 h-4 w-4 text-muted-foreground" />
<Input
id="username"
placeholder="请输入用户名"
className="pl-9 bg-muted/30 border-none focus-visible:ring-primary"
value={username}
onChange={(e) => setUsername(e.target.value)}
/>
</div>
</div>
<div className="space-y-2">
<Label htmlFor="password"></Label>
<div className="relative">
<Lock className="absolute left-3 top-2.5 h-4 w-4 text-muted-foreground" />
<Input
id="password"
type="password"
placeholder="请输入密码"
className="pl-9 bg-muted/30 border-none focus-visible:ring-primary"
value={password}
onChange={(e) => setPassword(e.target.value)}
/>
</div>
</div>
</CardContent>
<CardFooter>
<Button className="w-full h-11 text-lg font-semibold rounded-xl" type="submit" disabled={loading}>
{loading ? '登录中...' : '立即登录'}
</Button>
</CardFooter>
</form>
</TabsContent>
<TabsContent value="signup">
<form onSubmit={handleSignUp}>
<CardHeader>
<CardTitle></CardTitle>
<CardDescription></CardDescription>
</CardHeader>
<CardContent className="space-y-4">
<div className="space-y-2">
<Label htmlFor="signup-username"></Label>
<div className="relative">
<User className="absolute left-3 top-2.5 h-4 w-4 text-muted-foreground" />
<Input
id="signup-username"
placeholder="仅字母、数字和下划线"
className="pl-9 bg-muted/30 border-none focus-visible:ring-primary"
value={username}
onChange={(e) => setUsername(e.target.value)}
/>
</div>
</div>
<div className="space-y-2">
<Label htmlFor="signup-password"></Label>
<div className="relative">
<Lock className="absolute left-3 top-2.5 h-4 w-4 text-muted-foreground" />
<Input
id="signup-password"
type="password"
placeholder="不少于 6 位"
className="pl-9 bg-muted/30 border-none focus-visible:ring-primary"
value={password}
onChange={(e) => setPassword(e.target.value)}
/>
</div>
</div>
</CardContent>
<CardFooter>
<Button className="w-full h-11 text-lg font-semibold rounded-xl" type="submit" disabled={loading}>
{loading ? '注册中...' : '注册账号'}
</Button>
</CardFooter>
</form>
</TabsContent>
</Tabs>
</Card>
<div className="grid grid-cols-2 gap-4 text-center text-xs text-muted-foreground">
<div className="flex flex-col items-center gap-1">
<CheckCircle2 size={16} className="text-primary" />
<span></span>
</div>
<div className="flex flex-col items-center gap-1">
<CheckCircle2 size={16} className="text-primary" />
<span></span>
</div>
</div>
<div className="text-center text-sm text-muted-foreground">
&copy; 2026 |
</div>
</div>
</div>
);
};
export default Login;

View File

@@ -1,16 +0,0 @@
/**
* Sample Page
*/
import PageMeta from "../components/common/PageMeta";
export default function SamplePage() {
return (
<>
<PageMeta title="Home" description="Home Page Introduction" />
<div>
<h3>This is a sample page</h3>
</div>
</>
);
}

View File

@@ -11,7 +11,8 @@ import {
ChevronDown,
ChevronUp,
Trash2,
AlertCircle
AlertCircle,
HelpCircle
} from 'lucide-react';
import { Card, CardContent, CardHeader, CardTitle, CardDescription } from '@/components/ui/card';
import { Button } from '@/components/ui/button';
@@ -24,9 +25,9 @@ import { Skeleton } from '@/components/ui/skeleton';
type Task = {
task_id: string;
status: 'pending' | 'processing' | 'success' | 'failure';
status: 'pending' | 'processing' | 'success' | 'failure' | 'unknown';
created_at: string;
completed_at?: string;
updated_at?: string;
message?: string;
result?: any;
error?: string;
@@ -38,54 +39,38 @@ const TaskHistory: React.FC = () => {
const [loading, setLoading] = useState(true);
const [expandedTask, setExpandedTask] = useState<string | null>(null);
// Mock data for demonstration
useEffect(() => {
// 模拟任务数据,实际应该从后端获取
setTasks([
{
task_id: 'task-001',
status: 'success',
created_at: new Date(Date.now() - 3600000).toISOString(),
completed_at: new Date(Date.now() - 3500000).toISOString(),
task_type: 'document_parse',
message: '文档解析完成',
result: {
doc_id: 'doc-001',
filename: 'report_q1_2026.docx',
extracted_fields: ['标题', '作者', '日期', '金额']
}
},
{
task_id: 'task-002',
status: 'success',
created_at: new Date(Date.now() - 7200000).toISOString(),
completed_at: new Date(Date.now() - 7100000).toISOString(),
task_type: 'excel_analysis',
message: 'Excel 分析完成',
result: {
filename: 'sales_data.xlsx',
row_count: 1250,
charts_generated: 3
}
},
{
task_id: 'task-003',
status: 'processing',
created_at: new Date(Date.now() - 600000).toISOString(),
task_type: 'template_fill',
message: '正在填充表格...'
},
{
task_id: 'task-004',
status: 'failure',
created_at: new Date(Date.now() - 86400000).toISOString(),
completed_at: new Date(Date.now() - 86390000).toISOString(),
task_type: 'document_parse',
message: '解析失败',
error: '文件格式不支持或文件已损坏'
// 获取任务历史数据
const fetchTasks = async () => {
try {
setLoading(true);
const response = await backendApi.getTasks(50, 0);
if (response.success && response.tasks) {
// 转换后端数据格式为前端格式
const convertedTasks: Task[] = response.tasks.map((t: any) => ({
task_id: t.task_id,
status: t.status || 'unknown',
created_at: t.created_at || new Date().toISOString(),
updated_at: t.updated_at,
message: t.message || '',
result: t.result,
error: t.error,
task_type: t.task_type || 'document_parse'
}));
setTasks(convertedTasks);
} else {
setTasks([]);
}
]);
setLoading(false);
} catch (error) {
console.error('获取任务列表失败:', error);
toast.error('获取任务列表失败');
setTasks([]);
} finally {
setLoading(false);
}
};
useEffect(() => {
fetchTasks();
}, []);
const getStatusBadge = (status: string) => {
@@ -96,6 +81,8 @@ const TaskHistory: React.FC = () => {
return <Badge className="bg-destructive text-white text-[10px]"><XCircle size={12} className="mr-1" /></Badge>;
case 'processing':
return <Badge className="bg-amber-500 text-white text-[10px]"><Loader2 size={12} className="mr-1 animate-spin" /></Badge>;
case 'unknown':
return <Badge className="bg-gray-500 text-white text-[10px]"><HelpCircle size={12} className="mr-1" /></Badge>;
default:
return <Badge className="bg-gray-500 text-white text-[10px]"><Clock size={12} className="mr-1" /></Badge>;
}
@@ -133,15 +120,22 @@ const TaskHistory: React.FC = () => {
};
const handleDelete = async (taskId: string) => {
setTasks(prev => prev.filter(t => t.task_id !== taskId));
toast.success('任务已删除');
try {
await backendApi.deleteTask(taskId);
setTasks(prev => prev.filter(t => t.task_id !== taskId));
toast.success('任务已删除');
} catch (error) {
console.error('删除任务失败:', error);
toast.error('删除任务失败');
}
};
const stats = {
total: tasks.length,
success: tasks.filter(t => t.status === 'success').length,
processing: tasks.filter(t => t.status === 'processing').length,
failure: tasks.filter(t => t.status === 'failure').length
failure: tasks.filter(t => t.status === 'failure').length,
unknown: tasks.filter(t => t.status === 'unknown').length
};
return (
@@ -151,7 +145,7 @@ const TaskHistory: React.FC = () => {
<h1 className="text-3xl font-extrabold tracking-tight"></h1>
<p className="text-muted-foreground"></p>
</div>
<Button variant="outline" className="rounded-xl gap-2" onClick={() => window.location.reload()}>
<Button variant="outline" className="rounded-xl gap-2" onClick={() => fetchTasks()}>
<RefreshCcw size={18} />
<span></span>
</Button>
@@ -194,7 +188,8 @@ const TaskHistory: React.FC = () => {
"w-12 h-12 rounded-xl flex items-center justify-center shrink-0",
task.status === 'success' ? "bg-emerald-500/10 text-emerald-500" :
task.status === 'failure' ? "bg-destructive/10 text-destructive" :
"bg-amber-500/10 text-amber-500"
task.status === 'processing' ? "bg-amber-500/10 text-amber-500" :
"bg-gray-500/10 text-gray-500"
)}>
{task.status === 'processing' ? (
<Loader2 size={24} className="animate-spin" />
@@ -212,16 +207,16 @@ const TaskHistory: React.FC = () => {
</Badge>
</div>
<p className="text-sm text-muted-foreground">
{task.message || '任务执行中...'}
{task.message || (task.status === 'unknown' ? '无法获取状态' : '任务执行中...')}
</p>
<div className="flex items-center gap-4 text-xs text-muted-foreground">
<span className="flex items-center gap-1">
<Clock size={12} />
{format(new Date(task.created_at), 'yyyy-MM-dd HH:mm:ss')}
{task.created_at ? format(new Date(task.created_at), 'yyyy-MM-dd HH:mm:ss') : '时间未知'}
</span>
{task.completed_at && (
{task.updated_at && task.status !== 'processing' && (
<span>
: {Math.round((new Date(task.completed_at).getTime() - new Date(task.created_at).getTime()) / 1000)}
: {format(new Date(task.updated_at), 'HH:mm:ss')}
</span>
)}
</div>

View File

@@ -1,4 +1,4 @@
import React, { useState, useEffect, useCallback } from 'react';
import React, { useState, useEffect, useCallback, useRef } from 'react';
import { useDropzone } from 'react-dropzone';
import {
TableProperties,
@@ -18,7 +18,8 @@ import {
Files,
Trash2,
Eye,
File
File,
Plus
} from 'lucide-react';
import { Button } from '@/components/ui/button';
import { Card, CardContent, CardHeader, CardTitle, CardDescription } from '@/components/ui/card';
@@ -60,6 +61,7 @@ const TemplateFill: React.FC = () => {
templateFields, setTemplateFields,
sourceFiles, setSourceFiles, addSourceFiles, removeSourceFile,
sourceFilePaths, setSourceFilePaths,
sourceDocIds, setSourceDocIds, addSourceDocId, removeSourceDocId,
templateId, setTemplateId,
filledResult, setFilledResult,
reset
@@ -68,6 +70,10 @@ const TemplateFill: React.FC = () => {
const [loading, setLoading] = useState(false);
const [previewDoc, setPreviewDoc] = useState<{ name: string; content: string } | null>(null);
const [previewOpen, setPreviewOpen] = useState(false);
const [sourceMode, setSourceMode] = useState<'upload' | 'select'>('upload');
const [uploadedDocuments, setUploadedDocuments] = useState<DocumentItem[]>([]);
const [docsLoading, setDocsLoading] = useState(false);
const sourceFileInputRef = useRef<HTMLInputElement>(null);
// 模板拖拽
const onTemplateDrop = useCallback((acceptedFiles: File[]) => {
@@ -89,25 +95,77 @@ const TemplateFill: React.FC = () => {
});
// 源文档拖拽
const onSourceDrop = useCallback((acceptedFiles: File[]) => {
const newFiles = acceptedFiles.map(f => ({
file: f,
preview: f.type.startsWith('text/') || f.name.endsWith('.md') ? undefined : undefined
}));
addSourceFiles(newFiles);
const onSourceDrop = useCallback((e: React.DragEvent) => {
e.preventDefault();
const files = Array.from(e.dataTransfer.files).filter(f => {
const ext = f.name.split('.').pop()?.toLowerCase();
return ['xlsx', 'xls', 'docx', 'md', 'txt'].includes(ext || '');
});
if (files.length > 0) {
addSourceFiles(files.map(f => ({ file: f })));
}
}, [addSourceFiles]);
const { getRootProps: getSourceProps, getInputProps: getSourceInputProps, isDragActive: isSourceDragActive } = useDropzone({
onDrop: onSourceDrop,
accept: {
'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet': ['.xlsx'],
'application/vnd.ms-excel': ['.xls'],
'application/vnd.openxmlformats-officedocument.wordprocessingml.document': ['.docx'],
'text/plain': ['.txt'],
'text/markdown': ['.md']
},
multiple: true
});
const handleSourceFileSelect = (e: React.ChangeEvent<HTMLInputElement>) => {
const files = Array.from(e.target.files || []);
if (files.length > 0) {
addSourceFiles(files.map(f => ({ file: f })));
toast.success(`已添加 ${files.length} 个文件`);
}
e.target.value = '';
};
// 仅添加源文档不上传
const handleAddSourceFiles = () => {
if (sourceFiles.length === 0) {
toast.error('请先选择源文档');
return;
}
toast.success(`已添加 ${sourceFiles.length} 个源文档,可继续添加更多`);
};
// 加载已上传文档
const loadUploadedDocuments = useCallback(async () => {
setDocsLoading(true);
try {
const result = await backendApi.getDocuments(undefined, 100);
if (result.success) {
// 过滤可作为数据源的文档类型
const docs = (result.documents || []).filter((d: DocumentItem) =>
['docx', 'md', 'txt', 'xlsx', 'xls'].includes(d.doc_type)
);
setUploadedDocuments(docs);
}
} catch (err: any) {
console.error('加载文档失败:', err);
} finally {
setDocsLoading(false);
}
}, []);
// 删除文档
const handleDeleteDocument = async (docId: string, e: React.MouseEvent) => {
e.stopPropagation();
if (!confirm('确定要删除该文档吗?')) return;
try {
const result = await backendApi.deleteDocument(docId);
if (result.success) {
setUploadedDocuments(prev => prev.filter(d => d.doc_id !== docId));
removeSourceDocId(docId);
toast.success('文档已删除');
} else {
toast.error(result.message || '删除失败');
}
} catch (err: any) {
toast.error('删除失败: ' + (err.message || '未知错误'));
}
};
useEffect(() => {
if (sourceMode === 'select') {
loadUploadedDocuments();
}
}, [sourceMode, loadUploadedDocuments]);
const handleJointUploadAndFill = async () => {
if (!templateFile) {
@@ -115,34 +173,69 @@ const TemplateFill: React.FC = () => {
return;
}
// 检查是否选择了数据源
if (sourceMode === 'upload' && sourceFiles.length === 0) {
toast.error('请上传源文档或从已上传文档中选择');
return;
}
if (sourceMode === 'select' && sourceDocIds.length === 0) {
toast.error('请选择源文档');
return;
}
setLoading(true);
try {
// 使用联合上传API
const result = await backendApi.uploadTemplateAndSources(
templateFile,
sourceFiles.map(sf => sf.file)
);
if (sourceMode === 'select') {
// 使用已上传文档作为数据源
const result = await backendApi.uploadTemplate(templateFile);
if (result.success) {
setTemplateFields(result.fields || []);
setTemplateId(result.template_id);
setSourceFilePaths(result.source_file_paths || []);
toast.success('文档上传成功,开始智能填表');
setStep('filling');
if (result.success) {
setTemplateFields(result.fields || []);
setTemplateId(result.template_id || 'temp');
toast.success('开始智能填表');
setStep('filling');
// 自动开始填表
const fillResult = await backendApi.fillTemplate(
result.template_id,
result.fields || [],
[], // 使用 source_file_paths 而非 source_doc_ids
result.source_file_paths || [],
'请从以下文档中提取相关信息填写表格'
// 使用 source_doc_ids 进行填表
const fillResult = await backendApi.fillTemplate(
result.template_id || 'temp',
result.fields || [],
sourceDocIds,
[],
'请从以下文档中提取相关信息填写表格'
);
setFilledResult(fillResult);
setStep('preview');
toast.success('表格填写完成');
}
} else {
// 使用联合上传API
const result = await backendApi.uploadTemplateAndSources(
templateFile,
sourceFiles.map(sf => sf.file)
);
setFilledResult(fillResult);
setStep('preview');
toast.success('表格填写完成');
if (result.success) {
setTemplateFields(result.fields || []);
setTemplateId(result.template_id);
setSourceFilePaths(result.source_file_paths || []);
toast.success('文档上传成功,开始智能填表');
setStep('filling');
// 自动开始填表
const fillResult = await backendApi.fillTemplate(
result.template_id,
result.fields || [],
[],
result.source_file_paths || [],
'请从以下文档中提取相关信息填写表格'
);
setFilledResult(fillResult);
setStep('preview');
toast.success('表格填写完成');
}
}
} catch (err: any) {
toast.error('处理失败: ' + (err.message || '未知错误'));
@@ -264,47 +357,158 @@ const TemplateFill: React.FC = () => {
</CardTitle>
<CardDescription>
</CardDescription>
{/* Source Mode Tabs */}
<div className="flex gap-2 mt-2">
<Button
variant={sourceMode === 'upload' ? 'default' : 'outline'}
size="sm"
onClick={() => setSourceMode('upload')}
>
<Upload size={14} className="mr-1" />
</Button>
<Button
variant={sourceMode === 'select' ? 'default' : 'outline'}
size="sm"
onClick={() => setSourceMode('select')}
>
<Files size={14} className="mr-1" />
</Button>
</div>
</CardHeader>
<CardContent>
<div
{...getSourceProps()}
className={cn(
"border-2 border-dashed rounded-2xl p-8 transition-all duration-300 flex flex-col items-center justify-center text-center cursor-pointer group min-h-[200px]",
isSourceDragActive ? "border-primary bg-primary/5" : "border-muted-foreground/20 hover:border-primary/50 hover:bg-primary/5"
)}
>
<input {...getSourceInputProps()} />
<div className="w-14 h-14 rounded-xl bg-blue-500/10 text-blue-500 flex items-center justify-center mb-4 group-hover:scale-110 transition-transform">
{loading ? <Loader2 className="animate-spin" size={28} /> : <Upload size={28} />}
</div>
<p className="font-medium">
{isSourceDragActive ? '释放以上传' : '点击或拖拽上传源文档'}
</p>
<p className="text-xs text-muted-foreground mt-1">
.xlsx .xls .docx .md .txt
</p>
</div>
{/* Selected Source Files */}
{sourceFiles.length > 0 && (
<div className="mt-4 space-y-2">
{sourceFiles.map((sf, idx) => (
<div key={idx} className="flex items-center gap-3 p-3 bg-muted/50 rounded-xl">
{getFileIcon(sf.file.name)}
<div className="flex-1 min-w-0">
<p className="text-sm font-medium truncate">{sf.file.name}</p>
<p className="text-xs text-muted-foreground">
{(sf.file.size / 1024).toFixed(1)} KB
</p>
{sourceMode === 'upload' ? (
<>
<div className="border-2 border-dashed rounded-2xl p-8 transition-all duration-300 flex flex-col items-center justify-center text-center cursor-pointer group min-h-[200px] border-muted-foreground/20 hover:border-primary/50 hover:bg-primary/5">
<input
id="source-file-input"
type="file"
multiple={true}
accept=".xlsx,.xls,.docx,.md,.txt"
onChange={handleSourceFileSelect}
className="hidden"
/>
<label htmlFor="source-file-input" className="cursor-pointer flex flex-col items-center">
<div className="w-14 h-14 rounded-xl bg-blue-500/10 text-blue-500 flex items-center justify-center mb-4 group-hover:scale-110 transition-transform">
{loading ? <Loader2 className="animate-spin" size={28} /> : <Upload size={28} />}
</div>
<p className="font-medium">
</p>
<p className="text-xs text-muted-foreground mt-1">
.xlsx .xls .docx .md .txt
</p>
</label>
</div>
<div
onDragOver={(e) => { e.preventDefault(); }}
onDrop={onSourceDrop}
className="mt-2 text-center text-xs text-muted-foreground"
>
</div>
{/* Selected Source Files */}
{sourceFiles.length > 0 && (
<div className="mt-4 space-y-2">
{sourceFiles.map((sf, idx) => (
<div key={idx} className="flex items-center gap-3 p-3 bg-muted/50 rounded-xl">
{getFileIcon(sf.file.name)}
<div className="flex-1 min-w-0">
<p className="text-sm font-medium truncate">{sf.file.name}</p>
<p className="text-xs text-muted-foreground">
{(sf.file.size / 1024).toFixed(1)} KB
</p>
</div>
<Button variant="ghost" size="sm" onClick={() => removeSourceFile(idx)}>
<Trash2 size={14} className="text-red-500" />
</Button>
</div>
))}
<div className="flex justify-center pt-2">
<Button variant="outline" size="sm" onClick={() => document.getElementById('source-file-input')?.click()}>
<Plus size={14} className="mr-1" />
</Button>
</div>
<Button variant="ghost" size="sm" onClick={() => removeSourceFile(idx)}>
<Trash2 size={14} className="text-red-500" />
</Button>
</div>
))}
</div>
)}
</>
) : (
<>
{/* Uploaded Documents Selection */}
{docsLoading ? (
<div className="space-y-2">
{[1, 2, 3].map(i => (
<Skeleton key={i} className="h-16 w-full rounded-xl" />
))}
</div>
) : uploadedDocuments.length > 0 ? (
<div className="space-y-2">
{sourceDocIds.length > 0 && (
<div className="flex items-center justify-between p-3 bg-primary/5 rounded-xl border border-primary/20">
<span className="text-sm font-medium"> {sourceDocIds.length} </span>
<Button variant="ghost" size="sm" onClick={() => loadUploadedDocuments()}>
<RefreshCcw size={14} className="mr-1" />
</Button>
</div>
)}
<div className="max-h-[300px] overflow-y-auto space-y-2">
{uploadedDocuments.map((doc) => (
<div
key={doc.doc_id}
className={cn(
"flex items-center gap-3 p-3 rounded-xl border-2 transition-all cursor-pointer",
sourceDocIds.includes(doc.doc_id)
? "border-primary bg-primary/5"
: "border-border hover:bg-muted/30"
)}
onClick={() => {
if (sourceDocIds.includes(doc.doc_id)) {
removeSourceDocId(doc.doc_id);
} else {
addSourceDocId(doc.doc_id);
}
}}
>
<div className={cn(
"w-6 h-6 rounded-md border-2 flex items-center justify-center transition-all shrink-0",
sourceDocIds.includes(doc.doc_id)
? "border-primary bg-primary text-white"
: "border-muted-foreground/30"
)}>
{sourceDocIds.includes(doc.doc_id) && <CheckCircle2 size={14} />}
</div>
{getFileIcon(doc.original_filename)}
<div className="flex-1 min-w-0">
<p className="text-sm font-medium truncate">{doc.original_filename}</p>
<p className="text-xs text-muted-foreground">
{doc.doc_type.toUpperCase()} {format(new Date(doc.created_at), 'yyyy-MM-dd')}
</p>
</div>
<Button
variant="ghost"
size="sm"
onClick={(e) => handleDeleteDocument(doc.doc_id, e)}
className="shrink-0"
>
<Trash2 size={14} className="text-red-500" />
</Button>
</div>
))}
</div>
</div>
) : (
<div className="text-center py-8 text-muted-foreground">
<Files size={32} className="mx-auto mb-2 opacity-30" />
<p className="text-sm"></p>
</div>
)}
</>
)}
</CardContent>
</Card>
@@ -422,6 +626,16 @@ const TemplateFill: React.FC = () => {
<div className="text-muted-foreground text-xs mt-1">
: {detail.source} | : {detail.confidence ? (detail.confidence * 100).toFixed(0) + '%' : 'N/A'}
</div>
{detail.warning && (
<div className="mt-2 p-2 bg-yellow-50 border border-yellow-200 rounded-lg text-yellow-700 text-xs">
{detail.warning}
</div>
)}
{detail.values && detail.values.length > 1 && !detail.warning && (
<div className="mt-2 text-xs text-muted-foreground">
: {detail.values.join(', ')}
</div>
)}
</div>
</div>
))}

View File

@@ -50,18 +50,18 @@
| `prompt_service.py` | ✅ 已完成 | Prompt 模板管理 |
| `text_analysis_service.py` | ✅ 已完成 | 文本分析 |
| `chart_generator_service.py` | ✅ 已完成 | 图表生成服务 |
| `template_fill_service.py` | ✅ 已完成 | 模板填写服务,支持直接读取源文档进行填表 |
| `template_fill_service.py` | ✅ 已完成 | 模板填写服务,支持多行提取、直接从结构化数据提取、JSON容错、Word文档表格处理 |
### 2.2 API 接口 (`backend/app/api/endpoints/`)
| 接口文件 | 路由 | 功能状态 |
|----------|------|----------|
| `upload.py` | `/api/v1/upload/excel` | ✅ Excel 文件上传与解析 |
| `upload.py` | `/api/v1/upload/document` | ✅ 文档上传与解析 |
| `documents.py` | `/api/v1/documents/*` | ✅ 文档管理(列表、删除、搜索) |
| `ai_analyze.py` | `/api/v1/analyze/*` | ✅ AI 分析Excel、Markdown、流式 |
| `rag.py` | `/api/v1/rag/*` | ⚠️ RAG 检索(当前返回空) |
| `tasks.py` | `/api/v1/tasks/*` | ✅ 异步任务状态查询 |
| `templates.py` | `/api/v1/templates/*` | ✅ 模板管理 (含 Word 导出) |
| `templates.py` | `/api/v1/templates/*` | ✅ 模板管理(含多行导出、Word导出、Word结构化字段解析 |
| `visualization.py` | `/api/v1/visualization/*` | ✅ 可视化图表 |
| `health.py` | `/api/v1/health` | ✅ 健康检查 |
@@ -70,71 +70,67 @@
| 页面文件 | 功能 | 状态 |
|----------|------|------|
| `Documents.tsx` | 主文档管理页面 | ✅ 已完成 |
| `TemplateFill.tsx` | 智能填表页面 | ✅ 已完成 |
| `ExcelParse.tsx` | Excel 解析页面 | ✅ 已完成 |
### 2.4 文档解析能力
| 格式 | 解析状态 | 说明 |
|------|----------|------|
| Excel (.xlsx/.xls) | ✅ 已完成 | pandas + XML 回退解析 |
| Excel (.xlsx/.xls) | ✅ 已完成 | pandas + XML 回退解析支持多sheet |
| Markdown (.md) | ✅ 已完成 | 正则 + AI 分章节 |
| Word (.docx) | ✅ 已完成 | python-docx 解析,支持表格提取和字段识别 |
| Text (.txt) | ✅ 已完成 | chardet 编码检测,支持文本清洗和结构化提取 |
---
## 三、待完成功能(核心缺块)
## 三、核心功能实现详情
### 3.1 模板填写模块(最优先
**当前状态**:✅ 已完成
### 3.1 模板填写模块(✅ 已完成
**核心流程**
```
用户上传模板表格(Word/Excel)
上传模板表格(Word/Excel)
解析模板,提取需要填写的字段和提示词
根据模板指定的源文档列表读取源数据
根据源文档ID列表读取源数据MongoDB或文件
AI 根据字段提示词从源数据中提取信息
优先从结构化数据直接提取Excel rows
将提取的数据填入模板对应位置
无法直接提取时使用 LLM 从文本中提取
返回填写完成的表格
将提取的数据填入原始模板对应位置(保持模板格式)
导出填写完成的表格Excel/Word
```
**已完成实现**
- [x] `template_fill_service.py` - 模板填写核心服务
- [x] Word 模板解析 (`docx_parser.py` - parse_tables_for_template, extract_template_fields_from_docx)
- [x] Text 模板解析 (`txt_parser.py` - 已完成)
- [x] 模板字段识别与提示词提取
- [x] 多文档数据聚合与冲突处理
- [x] 结果导出为 Word/Excel
**关键特性**
- **原始模板填充**:直接打开原始模板文件,填充数据到原表格/单元格
- **多行数据支持**:每个字段可提取多个值,导出时自动扩展行数
- **结构化数据优先**:直接从 Excel rows 提取,无需 LLM
- **JSON 容错**:支持 LLM 返回的损坏/截断 JSON
- **Markdown 清理**:自动清理 LLM 返回的 markdown 格式
### 3.2 Word 文档解析
**当前状态**:✅ 已完成
### 3.2 Word 文档解析(✅ 已完成)
**已实现功能**
- [x] `docx_parser.py` - Word 文档解析器
- [x] 提取段落文本
- [x] 提取表格内容
- [x] 提取关键信息(标题、列表等)
- [x] 表格模板字段提取 (`parse_tables_for_template`, `extract_template_fields_from_docx`)
- [x] 字段类型推断 (`_infer_field_type_from_hint`)
- `docx_parser.py` - Word 文档解析器
- 提取段落文本
- 提取表格内容(支持比赛表格格式:字段名 | 提示词 | 填写值)
- `parse_tables_for_template()` - 解析表格模板,提取字段
- `extract_template_fields_from_docx()` - 提取模板字段定义
- `_infer_field_type_from_hint()` - 从提示词推断字段类型
- **API 端点**`/api/v1/templates/parse-word-structure` - 上传 Word 文档,提取结构化字段并存入 MongoDB
- **API 端点**`/api/v1/templates/word-fields/{doc_id}` - 获取已存文档的模板字段信息
### 3.3 Text 文档解析
**当前状态**:✅ 已完成
### 3.3 Text 文档解析(✅ 已完成)
**已实现功能**
- [x] `txt_parser.py` - 文本文件解析器
- [x] 编码自动检测 (chardet)
- [x] 文本清洗
### 3.4 文档模板匹配(已有框架)
根据 Q&A模板已指定数据文件不需要算法匹配。当前已有上传功能需确认模板与数据文件的关联逻辑是否完善。
- `txt_parser.py` - 文本文件解析器
- 编码自动检测 (chardet)
- 文本清洗(去除控制字符、规范化空白)
- 结构化数据提取邮箱、URL、电话、日期、金额
---
@@ -192,20 +188,20 @@ docs/test/
## 六、工作计划(建议)
### 第一优先级:模板填写核心功能
- 完成 Word 文档解析
- 完成模板填写服务
- 端到端测试验证
### 第一优先级:端到端测试
- 使用真实测试数据进行准确率测试
- 验证多行数据导出是否正确
- 测试 Word 模板解析是否正常
### 第二优先级Demo 打包与文档
- 制作项目演示 PPT
- 录制演示视频
- 完善 README 部署文档
### 第三优先级:测试优化
- 使用真实测试数据进行准确率测试
### 第三优先级:优化
- 优化响应时间
- 完善错误处理
- 增加更多测试用例
---
@@ -215,29 +211,32 @@ docs/test/
2. **数据库**:不强制要求数据库存储,可跳过
3. **部署**:本地部署即可,不需要公网服务器
4. **评测数据**:初赛仅使用目前提供的数据
5. **RAG 功能**:当前已临时禁用,不影响核心评测功能
5. **RAG 功能**:当前已临时禁用,不影响核心评测功能(因为使用直接文件读取)
---
*文档版本: v1.1*
*最后更新: 2026-04-08*
*文档版本: v1.5*
*最后更新: 2026-04-09*
---
## 八、技术实现细节
### 8.1 模板填表流程(已实现)
### 8.1 模板填表流程
#### 流程图
```
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ 上传模板 │ ──► │ 选择数据源 │ ──► │ AI 智能填表 │
│ 上传模板 │ ──► │ 选择数据源 │ ──► │ 智能填表
└─────────────┘ └─────────────┘ └─────────────┘
┌─────────────┐
│ 导出结果 │
─────────────
┌─────────────────────────┼─────────────────────────┐
│ │
▼ ▼
┌───────────────┐ ┌───────────────┐ ┌───────────────
│ 结构化数据提取 │ │ LLM 提取 │ │ 导出结果 │
│ (直接读rows) │ │ (文本理解) │ │ (Excel/Word) │
└───────────────┘ └───────────────┘ └───────────────┘
```
#### 核心组件
@@ -247,8 +246,10 @@ docs/test/
| 模板上传 | `templates.py` `/templates/upload` | 接收模板文件,提取字段 |
| 字段提取 | `template_fill_service.py` | 从 Word/Excel 表格提取字段定义 |
| 文档解析 | `docx_parser.py`, `xlsx_parser.py`, `txt_parser.py` | 解析源文档内容 |
| 智能填表 | `template_fill_service.py` `fill_template()` | 使用 LLM 从源文档提取信息 |
| 结果导出 | `templates.py` `/templates/export` | 导出为 Excel 或 Word |
| 智能填表 | `template_fill_service.py` `fill_template()` | 结构化提取 + LLM 提取 |
| 多行支持 | `template_fill_service.py` `FillResult` | values 数组支持 |
| JSON 容错 | `template_fill_service.py` `_fix_json()` | 修复损坏的 JSON |
| 结果导出 | `templates.py` `/templates/export` | 多行 Excel + Word 导出 |
### 8.2 源文档加载方式
@@ -268,7 +269,9 @@ docs/test/
```python
# 提取表格模板字段
fields = docx_parser.extract_template_fields_from_docx(file_path)
from docx_parser import DocxParser
parser = DocxParser()
fields = parser.extract_template_fields_from_docx(file_path)
# 返回格式
# [
@@ -295,6 +298,24 @@ fields = docx_parser.extract_template_fields_from_docx(file_path)
### 8.5 API 接口
#### POST `/api/v1/templates/upload`
上传模板文件,提取字段定义。
**响应**
```json
{
"success": true,
"template_id": "/path/to/saved/template.docx",
"filename": "模板.docx",
"file_type": "docx",
"fields": [
{"cell": "A1", "name": "姓名", "field_type": "text", "required": true, "hint": "提取人员姓名"}
],
"field_count": 1
}
```
#### POST `/api/v1/templates/fill`
填写请求:
@@ -306,35 +327,232 @@ fields = docx_parser.extract_template_fields_from_docx(file_path)
],
"source_doc_ids": ["mongodb_doc_id_1", "mongodb_doc_id_2"],
"source_file_paths": [],
"user_hint": "请从合同文档中提取"
"user_hint": "请从xxx文档中提取"
}
```
响应
**响应(含多行支持)**
```json
{
"success": true,
"filled_data": {"姓名": "张三"},
"filled_data": {
"姓名": ["张三", "李四", "王五"],
"年龄": ["25", "30", "28"]
},
"fill_details": [
{
"field": "姓名",
"cell": "A1",
"values": ["张三", "李四", "王五"],
"value": "张三",
"source": "来自:合同文档.docx",
"confidence": 0.95
"source": "结构化数据直接提取",
"confidence": 1.0
}
],
"source_doc_count": 2
"source_doc_count": 2,
"max_rows": 3
}
```
#### POST `/api/v1/templates/export`
导出请求:
导出请求(创建新文件)
```json
{
"template_id": "模板ID",
"filled_data": {"姓名": "张三", "金额": "10000"},
"format": "xlsx" // 或 "docx"
"filled_data": {"姓名": ["张三", "李四"], "金额": ["10000", "20000"]},
"format": "xlsx"
}
```
```
#### POST `/api/v1/templates/fill-and-export`
**填充原始模板并导出**(推荐用于比赛)
直接打开原始模板文件,将数据填入模板的表格/单元格中,然后导出。**保持原始模板格式不变**。
**请求**
```json
{
"template_path": "/path/to/original/template.docx",
"filled_data": {
"姓名": ["张三", "李四", "王五"],
"年龄": ["25", "30", "28"]
},
"format": "docx"
}
```
**响应**:填充后的 Word/Excel 文件(文件流)
**特点**
- 打开原始模板文件
- 根据表头行匹配字段名到列索引
- 将数据填入对应列的单元格
- 多行数据自动扩展表格行数
- 保持原始模板格式和样式
#### POST `/api/v1/templates/parse-word-structure`
**上传 Word 文档并提取结构化字段**(比赛专用)
上传 Word 文档,从表格模板中提取字段定义(字段名、提示词、字段类型)并存入 MongoDB。
**请求**multipart/form-data
- file: Word 文件
**响应**
```json
{
"success": true,
"doc_id": "mongodb_doc_id",
"filename": "模板.docx",
"file_path": "/path/to/saved/template.docx",
"field_count": 5,
"fields": [
{
"cell": "T0R1",
"name": "字段名",
"hint": "提示词",
"field_type": "text",
"required": true
}
],
"tables": [...],
"metadata": {
"paragraph_count": 10,
"table_count": 1,
"word_count": 500,
"has_tables": true
}
}
```
#### GET `/api/v1/templates/word-fields/{doc_id}`
**获取 Word 文档模板字段信息**
根据 doc_id 获取已上传的 Word 文档的模板字段信息。
**响应**
```json
{
"success": true,
"doc_id": "mongodb_doc_id",
"filename": "模板.docx",
"fields": [...],
"tables": [...],
"field_count": 5,
"metadata": {...}
}
```
### 8.6 多行数据处理
**FillResult 数据结构**
```python
@dataclass
class FillResult:
field: str
values: List[Any] = None # 支持多个值(数组)
value: Any = "" # 保留兼容(第一个值)
source: str = "" # 来源文档
confidence: float = 1.0 # 置信度
```
**导出逻辑**
- 计算所有字段的最大行数
- 遍历每一行,取对应索引的值
- 不足的行填空字符串
### 8.7 JSON 容错处理
当 LLM 返回的 JSON 损坏或被截断时,系统会:
1. 清理 markdown 代码块标记(```json, ```
2. 尝试配对括号找到完整的 JSON
3. 移除末尾多余的逗号
4. 使用正则表达式提取 values 数组
5. 备选方案:直接提取所有引号内的字符串
### 8.8 结构化数据优先提取
对于 Excel 等有 `rows` 结构的文档,系统会:
1. 直接从 `structured_data.rows` 中查找匹配列
2. 使用模糊匹配(字段名包含或被包含)
3. 提取该列的所有行值
4. 无需调用 LLM速度更快准确率更高
```python
# 内部逻辑
if structured.get("rows"):
columns = structured.get("columns", [])
values = _extract_column_values(rows, columns, field_name)
```
---
## 九、依赖说明
### Python 依赖
```
# requirements.txt 中需要包含
fastapi>=0.104.0
uvicorn>=0.24.0
motor>=3.3.0 # MongoDB 异步驱动
sqlalchemy>=2.0.0 # MySQL ORM
pandas>=2.0.0 # Excel 处理
openpyxl>=3.1.0 # Excel 写入
python-docx>=0.8.0 # Word 处理
chardet>=4.0.0 # 编码检测
httpx>=0.25.0 # HTTP 客户端
```
### 前端依赖
```
# package.json 中需要包含
react>=18.0.0
react-dropzone>=14.0.0
lucide-react>=0.300.0
sonner>=1.0.0 # toast 通知
```
---
## 十、启动说明
### 后端启动
```bash
cd backend
.\venv\Scripts\Activate.ps1 # 或 Activate.bat
pip install -r requirements.txt # 确保依赖完整
.\venv\Scripts\python.exe -m uvicorn app.main:app --host 127.0.0.1 --port 8000 --reload
```
### 前端启动
```bash
cd frontend
npm install
npm run dev
```
### 环境变量
`backend/.env` 中配置:
```
MONGODB_URL=mongodb://localhost:27017
MONGODB_DB_NAME=document_system
MYSQL_HOST=localhost
MYSQL_PORT=3306
MYSQL_USER=root
MYSQL_PASSWORD=your_password
MYSQL_DATABASE=document_system
LLM_API_KEY=your_api_key
LLM_BASE_URL=https://api.minimax.chat
LLM_MODEL_NAME=MiniMax-Text-01
```