Compare commits

..

19 Commits

Author SHA1 Message Date
dj
51350e3002 123 2026-04-14 17:35:40 +08:00
dj
8e713be1ca Merge remote changes with RAG service optimization
- Keep user's RAG service integration for faster extraction
- Add remote's word_ai_service support
- Preserve user's parallel extraction and field header optimizations

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-14 17:25:13 +08:00
zzz
f2af27245d 增强 Word 文档 AI 解析和模板填充功能 2026-04-14 17:16:38 +08:00
dj
a9dc0d8b91 优化智能填表功能:提升速度、完善数据提取精度
后端优化 (template_fill_service.py):

1. 速度优化:
   - 使用 asyncio.gather 实现字段并行提取
   - 跳过 AI 审核步骤,减少 LLM 调用次数
   - 新增 _extract_single_field_fast 方法

2. 数据提取优化:
   - 集成 RAG 服务进行智能内容检索
   - 修复 Markdown 表格列匹配跳过空列
   - 修复年份子表头行误识别问题

3. AI 表头生成优化:
   - 精简为 5-7 个代表性字段(原来 8-15 个)
   - 过滤非数据字段(source、备注、说明等)
   - 简化字段名,如"医院数量"而非"医院-公立医院数量"

4. AI 数据提取 prompt 优化:
   - 严格按表头提取,只返回相关数据
   - 每个值必须带标注(年份/地区/分类)
   - 支持多种标注类型:2024年、北京、某省、公立医院、三级医院等
   - 保留原始数值、单位和百分号格式
   - 不返回大段来源说明

5. FillResult 新增 warning 字段:
   - 多值检测提示,如"检测到 2 个值"

前端优化 (TemplateFill.tsx):
- 填写详情显示多值警告(黄色提示框)
- 多值情况下直接显示所有值

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-14 17:14:59 +08:00
tl
902c28166b tl 2026-04-14 15:18:50 +08:00
tl
4a53be7eeb TL 2026-04-14 14:58:14 +08:00
tl
8b5b24fa2a Merge branch 'main' of https://gitea.kronecker.cc/OurCodesAreAllRight/FilesReadSystem 2026-04-14 14:57:53 +08:00
tl
ed66aa346d tl 2026-04-10 10:24:52 +08:00
zzz
5b82d40be0 Merge branch 'main' of https://gitea.kronecker.cc/OurCodesAreAllRight/FilesReadSystem 2026-04-10 10:10:41 +08:00
zzz
bedf1af9c0 增强 Word 文档 AI 解析和模板填充功能 2026-04-10 09:48:57 +08:00
5fca4eb094 添加临时文件清理异常处理和修改大纲接口为POST方法
- 在analyze_markdown、analyze_markdown_stream和get_markdown_outline函数中添加了
  try-catch块来处理临时文件清理过程中的异常
- 将/analyze/md/outline接口从GET方法改为POST方法以支持文件上传
- 确保在所有情况下都能正确清理临时文件,并记录清理失败的日志

refactor(health): 改进健康检查逻辑验证实际数据库连接

- 修改MySQL健康检查,实际执行SELECT 1查询来验证连接
- 修改MongoDB健康检查,执行ping命令来验证连接
- 修改Redis健康检查,执行ping命令来验证连接
- 添加异常捕获并记录具体的错误日志

refactor(upload): 使用os.path.basename优化文件名提取

- 替换手动字符串分割为os.path.basename来获取文件名
- 统一Excel上传和导出中文件名的处理方式

feat(instruction): 新增指令执行框架模块

- 创建instruction包包含意图解析和指令执行的基础架构
- 添加IntentParser和InstructionExecutor抽象基类
- 提供默认实现但标记为未完成,为未来功能扩展做准备

refactor(frontend): 调整AuthContext导入路径并移除重复文件

- 将AuthContext从src/context移动到src/contexts目录
- 更新App.tsx和RouteGuard.tsx中的导入路径
- 移除旧的AuthContext.tsx文件

fix(backend-api): 修复AI分析API的HTTP方法错误

- 将aiApi中的fetch请求方法从GET改为POST以支持文件上传
2026-04-10 01:51:53 +08:00
0dbf74db9d 添加任务ID跟踪功能到模板填充接口
- 在FillRequest中添加可选的task_id字段,用于任务历史跟踪
- 实现任务状态管理,包括创建、更新和错误处理
- 集成MongoDB任务记录功能,在处理过程中更新进度
- 添加任务进度更新逻辑,支持开始、处理中、成功和失败状态
- 修改模板填充服务以接收并传递task_id参数
2026-04-10 01:27:26 +08:00
858b594171 添加任务状态双写机制和历史记录功能
- 实现任务状态同时写入Redis和MongoDB的双写机制
- 添加MongoDB任务集合及CRUD操作接口
- 新增任务历史记录查询、列表展示和删除功能
- 重构任务状态更新逻辑,统一使用update_task_status函数
- 添加模板填服务中AI审核字段值的功能
- 优化前端任务历史页面显示和交互体验
2026-04-10 01:15:53 +08:00
ed0f51f2a4 Merge branch 'main' of https://gitea.kronecker.cc/OurCodesAreAllRight/FilesReadSystem 2026-04-10 00:26:57 +08:00
ecc0c79475 增强模板填写服务支持表格内容摘要和表头重生成
- 在源文档解析过程中增加表格内容摘要功能,提取表格结构用于AI理解
- 新增表格摘要逻辑,包括表头和前3行数据的提取和格式化
- 添加模板文件类型识别,支持xlsx和docx格式判断
- 实现基于源文档内容的表头自动重生成功能
- 当检测到自动生成的表头时,使用源文档内容重新生成更准确的字段
- 增加详细的调试日志用于跟踪表格处理过程
2026-04-10 00:26:54 +08:00
dj
6befc510d8 刷新的debug 2026-04-10 00:23:23 +08:00
dj
8f66c235fa 实现并行多文件上传的功能并且在列表显示上传了哪些文件,支持多次上传 2026-04-10 00:16:28 +08:00
886d5ae0cc Merge branch 'main' of https://gitea.kronecker.cc/OurCodesAreAllRight/FilesReadSystem 2026-04-09 22:44:01 +08:00
6752c5c231 优化联合模板上传逻辑支持源文档内容解析
- 移除模板文件字段提取步骤,改为直接保存模板文件
- 新增源文档解析功能,提取文档内容、标题和表格数量信息
- 修改模板填充服务,支持传入源文档内容用于AI表头生成
- 更新AI表头生成逻辑,基于源文档内容智能生成合适的表头字段
- 增强日志记录,显示源文档数量和处理进度
2026-04-09 22:43:51 +08:00
31 changed files with 3663 additions and 2405 deletions

View File

@@ -29,9 +29,14 @@ REDIS_URL="redis://localhost:6379/0"
# ==================== LLM AI 配置 ==================== # ==================== LLM AI 配置 ====================
# 大语言模型 API 配置 # 大语言模型 API 配置
LLM_API_KEY="your_api_key_here" # 支持 OpenAI 兼容格式 (DeepSeek, 智谱 GLM, 阿里等)
LLM_BASE_URL="" # 智谱 AI (Zhipu AI) GLM 系列:
LLM_MODEL_NAME="" # - 模型: glm-4-flash (快速文本模型), glm-4 (标准), glm-4-plus (高性能)
# - API: https://open.bigmodel.cn
# - API Key: https://open.bigmodel.cn/usercenter/apikeys
LLM_API_KEY="ca79ad9f96524cd5afc3e43ca97f347d.cpiLLx2oyitGvTeU"
LLM_BASE_URL="https://open.bigmodel.cn/api/paas/v4"
LLM_MODEL_NAME="glm-4v-plus"
# ==================== Supabase 配置 ==================== # ==================== Supabase 配置 ====================
# Supabase 项目配置 # Supabase 项目配置

38
backend/=3.0.0 Normal file
View File

@@ -0,0 +1,38 @@
Requirement already satisfied: sentence-transformers in c:\python312\lib\site-packages (2.2.2)
Requirement already satisfied: transformers<5.0.0,>=4.6.0 in c:\python312\lib\site-packages (from sentence-transformers) (4.57.6)
Requirement already satisfied: tqdm in c:\python312\lib\site-packages (from sentence-transformers) (4.66.1)
Requirement already satisfied: torch>=1.6.0 in c:\python312\lib\site-packages (from sentence-transformers) (2.10.0)
Requirement already satisfied: torchvision in c:\python312\lib\site-packages (from sentence-transformers) (0.25.0)
Requirement already satisfied: numpy in c:\python312\lib\site-packages (from sentence-transformers) (1.26.2)
Requirement already satisfied: scikit-learn in c:\python312\lib\site-packages (from sentence-transformers) (1.8.0)
Requirement already satisfied: scipy in c:\python312\lib\site-packages (from sentence-transformers) (1.16.3)
Requirement already satisfied: nltk in c:\python312\lib\site-packages (from sentence-transformers) (3.9.3)
Requirement already satisfied: sentencepiece in c:\python312\lib\site-packages (from sentence-transformers) (0.2.1)
Requirement already satisfied: huggingface-hub>=0.4.0 in c:\python312\lib\site-packages (from sentence-transformers) (0.36.2)
Requirement already satisfied: filelock in c:\python312\lib\site-packages (from huggingface-hub>=0.4.0->sentence-transformers) (3.25.2)
Requirement already satisfied: fsspec>=2023.5.0 in c:\python312\lib\site-packages (from huggingface-hub>=0.4.0->sentence-transformers) (2026.2.0)
Requirement already satisfied: packaging>=20.9 in c:\python312\lib\site-packages (from huggingface-hub>=0.4.0->sentence-transformers) (23.2)
Requirement already satisfied: pyyaml>=5.1 in c:\python312\lib\site-packages (from huggingface-hub>=0.4.0->sentence-transformers) (6.0.1)
Requirement already satisfied: requests in c:\python312\lib\site-packages (from huggingface-hub>=0.4.0->sentence-transformers) (2.31.0)
Requirement already satisfied: typing-extensions>=3.7.4.3 in c:\python312\lib\site-packages (from huggingface-hub>=0.4.0->sentence-transformers) (4.15.0)
Requirement already satisfied: sympy>=1.13.3 in c:\python312\lib\site-packages (from torch>=1.6.0->sentence-transformers) (1.14.0)
Requirement already satisfied: networkx>=2.5.1 in c:\python312\lib\site-packages (from torch>=1.6.0->sentence-transformers) (3.6.1)
Requirement already satisfied: jinja2 in c:\python312\lib\site-packages (from torch>=1.6.0->sentence-transformers) (3.1.6)
Requirement already satisfied: setuptools in c:\python312\lib\site-packages (from torch>=1.6.0->sentence-transformers) (82.0.1)
Requirement already satisfied: colorama in c:\python312\lib\site-packages (from tqdm->sentence-transformers) (0.4.6)
Requirement already satisfied: regex!=2019.12.17 in c:\python312\lib\site-packages (from transformers<5.0.0,>=4.6.0->sentence-transformers) (2026.2.28)
Requirement already satisfied: tokenizers<=0.23.0,>=0.22.0 in c:\python312\lib\site-packages (from transformers<5.0.0,>=4.6.0->sentence-transformers) (0.22.2)
Requirement already satisfied: safetensors>=0.4.3 in c:\python312\lib\site-packages (from transformers<5.0.0,>=4.6.0->sentence-transformers) (0.7.0)
Requirement already satisfied: click in c:\python312\lib\site-packages (from nltk->sentence-transformers) (8.3.1)
Requirement already satisfied: joblib in c:\python312\lib\site-packages (from nltk->sentence-transformers) (1.5.3)
Requirement already satisfied: threadpoolctl>=3.2.0 in c:\python312\lib\site-packages (from scikit-learn->sentence-transformers) (3.6.0)
Requirement already satisfied: pillow!=8.3.*,>=5.3.0 in c:\python312\lib\site-packages (from torchvision->sentence-transformers) (12.1.1)
Requirement already satisfied: mpmath<1.4,>=1.1.0 in c:\python312\lib\site-packages (from sympy>=1.13.3->torch>=1.6.0->sentence-transformers) (1.3.0)
Requirement already satisfied: MarkupSafe>=2.0 in c:\python312\lib\site-packages (from jinja2->torch>=1.6.0->sentence-transformers) (3.0.3)
Requirement already satisfied: charset-normalizer<4,>=2 in c:\python312\lib\site-packages (from requests->huggingface-hub>=0.4.0->sentence-transformers) (3.4.6)
Requirement already satisfied: idna<4,>=2.5 in c:\python312\lib\site-packages (from requests->huggingface-hub>=0.4.0->sentence-transformers) (3.11)
Requirement already satisfied: urllib3<3,>=1.21.1 in c:\python312\lib\site-packages (from requests->huggingface-hub>=0.4.0->sentence-transformers) (2.6.3)
Requirement already satisfied: certifi>=2017.4.17 in c:\python312\lib\site-packages (from requests->huggingface-hub>=0.4.0->sentence-transformers) (2026.2.25)
[notice] A new release of pip is available: 24.2 -> 26.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip

View File

@@ -10,6 +10,8 @@ import os
from app.services.excel_ai_service import excel_ai_service from app.services.excel_ai_service import excel_ai_service
from app.services.markdown_ai_service import markdown_ai_service from app.services.markdown_ai_service import markdown_ai_service
from app.services.template_fill_service import template_fill_service
from app.services.word_ai_service import word_ai_service
logger = logging.getLogger(__name__) logger = logging.getLogger(__name__)
@@ -215,9 +217,12 @@ async def analyze_markdown(
return result return result
finally: finally:
# 清理临时文件 # 清理临时文件,确保在所有情况下都能清理
if os.path.exists(tmp_path): try:
os.unlink(tmp_path) if tmp_path and os.path.exists(tmp_path):
os.unlink(tmp_path)
except Exception as cleanup_error:
logger.warning(f"临时文件清理失败: {tmp_path}, error: {cleanup_error}")
except HTTPException: except HTTPException:
raise raise
@@ -279,8 +284,12 @@ async def analyze_markdown_stream(
) )
finally: finally:
if os.path.exists(tmp_path): # 清理临时文件,确保在所有情况下都能清理
os.unlink(tmp_path) try:
if tmp_path and os.path.exists(tmp_path):
os.unlink(tmp_path)
except Exception as cleanup_error:
logger.warning(f"临时文件清理失败: {tmp_path}, error: {cleanup_error}")
except HTTPException: except HTTPException:
raise raise
@@ -289,7 +298,7 @@ async def analyze_markdown_stream(
raise HTTPException(status_code=500, detail=f"流式分析失败: {str(e)}") raise HTTPException(status_code=500, detail=f"流式分析失败: {str(e)}")
@router.get("/analyze/md/outline") @router.post("/analyze/md/outline")
async def get_markdown_outline( async def get_markdown_outline(
file: UploadFile = File(...) file: UploadFile = File(...)
): ):
@@ -323,9 +332,154 @@ async def get_markdown_outline(
result = await markdown_ai_service.extract_outline(tmp_path) result = await markdown_ai_service.extract_outline(tmp_path)
return result return result
finally: finally:
if os.path.exists(tmp_path): # 清理临时文件,确保在所有情况下都能清理
os.unlink(tmp_path) try:
if tmp_path and os.path.exists(tmp_path):
os.unlink(tmp_path)
except Exception as cleanup_error:
logger.warning(f"临时文件清理失败: {tmp_path}, error: {cleanup_error}")
except Exception as e: except Exception as e:
logger.error(f"获取 Markdown 大纲失败: {str(e)}") logger.error(f"获取 Markdown 大纲失败: {str(e)}")
raise HTTPException(status_code=500, detail=f"获取大纲失败: {str(e)}") raise HTTPException(status_code=500, detail=f"获取大纲失败: {str(e)}")
@router.post("/analyze/txt")
async def analyze_txt(
file: UploadFile = File(...),
):
"""
上传并使用 AI 分析 TXT 文本文件,提取结构化数据
将非结构化文本转换为结构化表格数据,便于后续填表使用
Args:
file: 上传的 TXT 文件
Returns:
dict: 分析结果,包含结构化表格数据
"""
if not file.filename:
raise HTTPException(status_code=400, detail="文件名为空")
file_ext = file.filename.split('.')[-1].lower()
if file_ext not in ['txt', 'text']:
raise HTTPException(
status_code=400,
detail=f"不支持的文件类型: {file_ext},仅支持 .txt"
)
try:
# 读取文件内容
content = await file.read()
# 保存到临时文件
with tempfile.NamedTemporaryFile(mode='wb', suffix='.txt', delete=False) as tmp:
tmp.write(content)
tmp_path = tmp.name
try:
logger.info(f"开始 AI 分析 TXT 文件: {file.filename}")
# 使用 template_fill_service 的 AI 分析方法
result = await template_fill_service.analyze_txt_with_ai(
content=content.decode('utf-8', errors='replace'),
filename=file.filename
)
if result:
logger.info(f"TXT AI 分析成功: {file.filename}")
return {
"success": True,
"filename": file.filename,
"structured_data": result
}
else:
logger.warning(f"TXT AI 分析返回空结果: {file.filename}")
return {
"success": False,
"filename": file.filename,
"error": "AI 分析未能提取到结构化数据",
"structured_data": None
}
finally:
# 清理临时文件
if os.path.exists(tmp_path):
os.unlink(tmp_path)
except HTTPException:
raise
except Exception as e:
logger.error(f"TXT AI 分析过程中出错: {str(e)}")
raise HTTPException(status_code=500, detail=f"分析失败: {str(e)}")
# ==================== Word 文档 AI 解析 ====================
@router.post("/analyze/word")
async def analyze_word(
file: UploadFile = File(...),
user_hint: str = Query("", description="用户提示词,如'请提取表格数据'")
):
"""
使用 AI 解析 Word 文档,提取结构化数据
适用于从非结构化的 Word 文档中提取表格数据、键值对等信息
Args:
file: 上传的 Word 文件
user_hint: 用户提示词
Returns:
dict: 包含结构化数据的解析结果
"""
if not file.filename:
raise HTTPException(status_code=400, detail="文件名为空")
file_ext = file.filename.split('.')[-1].lower()
if file_ext not in ['docx']:
raise HTTPException(
status_code=400,
detail=f"不支持的文件类型: {file_ext},仅支持 .docx"
)
try:
# 保存上传的文件
content = await file.read()
suffix = f".{file_ext}"
with tempfile.NamedTemporaryFile(delete=False, suffix=suffix) as tmp:
tmp.write(content)
tmp_path = tmp.name
try:
# 使用 AI 解析 Word 文档
result = await word_ai_service.parse_word_with_ai(
file_path=tmp_path,
user_hint=user_hint or "请提取文档中的所有结构化数据,包括表格、键值对等"
)
if result.get("success"):
return {
"success": True,
"filename": file.filename,
"result": result
}
else:
return {
"success": False,
"filename": file.filename,
"error": result.get("error", "AI 解析失败"),
"result": None
}
finally:
# 清理临时文件
if os.path.exists(tmp_path):
os.unlink(tmp_path)
except HTTPException:
raise
except Exception as e:
logger.error(f"Word AI 分析过程中出错: {str(e)}")
raise HTTPException(status_code=500, detail=f"分析失败: {str(e)}")

View File

@@ -23,6 +23,52 @@ logger = logging.getLogger(__name__)
router = APIRouter(prefix="/upload", tags=["文档上传"]) router = APIRouter(prefix="/upload", tags=["文档上传"])
# ==================== 辅助函数 ====================
async def update_task_status(
task_id: str,
status: str,
progress: int = 0,
message: str = "",
result: dict = None,
error: str = None
):
"""
更新任务状态,同时写入 Redis 和 MongoDB
Args:
task_id: 任务ID
status: 状态
progress: 进度
message: 消息
result: 结果
error: 错误信息
"""
meta = {"progress": progress, "message": message}
if result:
meta["result"] = result
if error:
meta["error"] = error
# 尝试写入 Redis
try:
await redis_db.set_task_status(task_id, status, meta)
except Exception as e:
logger.warning(f"Redis 任务状态更新失败: {e}")
# 尝试写入 MongoDB作为备用
try:
await mongodb.update_task(
task_id=task_id,
status=status,
message=message,
result=result,
error=error
)
except Exception as e:
logger.warning(f"MongoDB 任务状态更新失败: {e}")
# ==================== 请求/响应模型 ==================== # ==================== 请求/响应模型 ====================
class UploadResponse(BaseModel): class UploadResponse(BaseModel):
@@ -77,6 +123,17 @@ async def upload_document(
task_id = str(uuid.uuid4()) task_id = str(uuid.uuid4())
try: try:
# 保存任务记录到 MongoDB如果 Redis 不可用时仍能查询)
try:
await mongodb.insert_task(
task_id=task_id,
task_type="document_parse",
status="pending",
message=f"文档 {file.filename} 已提交处理"
)
except Exception as mongo_err:
logger.warning(f"MongoDB 保存任务记录失败: {mongo_err}")
content = await file.read() content = await file.read()
saved_path = file_service.save_uploaded_file( saved_path = file_service.save_uploaded_file(
content, content,
@@ -122,6 +179,17 @@ async def upload_documents(
saved_paths = [] saved_paths = []
try: try:
# 保存任务记录到 MongoDB
try:
await mongodb.insert_task(
task_id=task_id,
task_type="batch_parse",
status="pending",
message=f"已提交 {len(files)} 个文档处理"
)
except Exception as mongo_err:
logger.warning(f"MongoDB 保存批量任务记录失败: {mongo_err}")
for file in files: for file in files:
if not file.filename: if not file.filename:
continue continue
@@ -159,9 +227,9 @@ async def process_document(
"""处理单个文档""" """处理单个文档"""
try: try:
# 状态: 解析中 # 状态: 解析中
await redis_db.set_task_status( await update_task_status(
task_id, status="processing", task_id, status="processing",
meta={"progress": 10, "message": "正在解析文档"} progress=10, message="正在解析文档"
) )
# 解析文档 # 解析文档
@@ -172,9 +240,9 @@ async def process_document(
raise Exception(result.error or "解析失败") raise Exception(result.error or "解析失败")
# 状态: 存储中 # 状态: 存储中
await redis_db.set_task_status( await update_task_status(
task_id, status="processing", task_id, status="processing",
meta={"progress": 30, "message": "正在存储数据"} progress=30, message="正在存储数据"
) )
# 存储到 MongoDB # 存储到 MongoDB
@@ -191,9 +259,9 @@ async def process_document(
# 如果是 Excel存储到 MySQL + AI生成描述 + RAG索引 # 如果是 Excel存储到 MySQL + AI生成描述 + RAG索引
if doc_type in ["xlsx", "xls"]: if doc_type in ["xlsx", "xls"]:
await redis_db.set_task_status( await update_task_status(
task_id, status="processing", task_id, status="processing",
meta={"progress": 50, "message": "正在存储到MySQL并生成字段描述"} progress=50, message="正在存储到MySQL并生成字段描述"
) )
try: try:
@@ -215,9 +283,9 @@ async def process_document(
else: else:
# 非结构化文档 # 非结构化文档
await redis_db.set_task_status( await update_task_status(
task_id, status="processing", task_id, status="processing",
meta={"progress": 60, "message": "正在建立索引"} progress=60, message="正在建立索引"
) )
# 如果文档中有表格数据,提取并存储到 MySQL + RAG # 如果文档中有表格数据,提取并存储到 MySQL + RAG
@@ -238,17 +306,13 @@ async def process_document(
await index_document_to_rag(doc_id, original_filename, result, doc_type) await index_document_to_rag(doc_id, original_filename, result, doc_type)
# 完成 # 完成
await redis_db.set_task_status( await update_task_status(
task_id, status="success", task_id, status="success",
meta={ progress=100, message="处理完成",
"progress": 100, result={
"message": "处理完成",
"doc_id": doc_id, "doc_id": doc_id,
"result": { "doc_type": doc_type,
"doc_id": doc_id, "filename": original_filename
"doc_type": doc_type,
"filename": original_filename
}
} }
) )
@@ -256,18 +320,19 @@ async def process_document(
except Exception as e: except Exception as e:
logger.error(f"文档处理失败: {str(e)}") logger.error(f"文档处理失败: {str(e)}")
await redis_db.set_task_status( await update_task_status(
task_id, status="failure", task_id, status="failure",
meta={"error": str(e)} progress=0, message="处理失败",
error=str(e)
) )
async def process_documents_batch(task_id: str, files: List[dict]): async def process_documents_batch(task_id: str, files: List[dict]):
"""批量处理文档""" """批量处理文档"""
try: try:
await redis_db.set_task_status( await update_task_status(
task_id, status="processing", task_id, status="processing",
meta={"progress": 0, "message": "开始批量处理"} progress=0, message="开始批量处理"
) )
results = [] results = []
@@ -318,37 +383,43 @@ async def process_documents_batch(task_id: str, files: List[dict]):
results.append({"filename": file_info["filename"], "success": False, "error": str(e)}) results.append({"filename": file_info["filename"], "success": False, "error": str(e)})
progress = int((i + 1) / len(files) * 100) progress = int((i + 1) / len(files) * 100)
await redis_db.set_task_status( await update_task_status(
task_id, status="processing", task_id, status="processing",
meta={"progress": progress, "message": f"已处理 {i+1}/{len(files)}"} progress=progress, message=f"已处理 {i+1}/{len(files)}"
) )
await redis_db.set_task_status( await update_task_status(
task_id, status="success", task_id, status="success",
meta={"progress": 100, "message": "批量处理完成", "results": results} progress=100, message="批量处理完成",
result={"results": results}
) )
except Exception as e: except Exception as e:
logger.error(f"批量处理失败: {str(e)}") logger.error(f"批量处理失败: {str(e)}")
await redis_db.set_task_status( await update_task_status(
task_id, status="failure", task_id, status="failure",
meta={"error": str(e)} progress=0, message="批量处理失败",
error=str(e)
) )
async def index_document_to_rag(doc_id: str, filename: str, result: ParseResult, doc_type: str): async def index_document_to_rag(doc_id: str, filename: str, result: ParseResult, doc_type: str):
"""将非结构化文档索引到 RAG""" """将非结构化文档索引到 RAG(使用分块索引)"""
try: try:
content = result.data.get("content", "") content = result.data.get("content", "")
if content: if content:
# 将完整内容传递给 RAG 服务自动分块索引
rag_service.index_document_content( rag_service.index_document_content(
doc_id=doc_id, doc_id=doc_id,
content=content[:5000], content=content, # 传递完整内容,由 RAG 服务自动分块
metadata={ metadata={
"filename": filename, "filename": filename,
"doc_type": doc_type "doc_type": doc_type
} },
chunk_size=500, # 每块 500 字符
chunk_overlap=50 # 块之间 50 字符重叠
) )
logger.info(f"RAG 索引完成: {filename}, doc_id={doc_id}")
except Exception as e: except Exception as e:
logger.warning(f"RAG 索引失败: {str(e)}") logger.warning(f"RAG 索引失败: {str(e)}")

View File

@@ -19,26 +19,43 @@ async def health_check() -> Dict[str, Any]:
返回各数据库连接状态和应用信息 返回各数据库连接状态和应用信息
""" """
# 检查各数据库连接状态 # 检查各数据库连接状态
mysql_status = "connected" mysql_status = "unknown"
mongodb_status = "connected" mongodb_status = "unknown"
redis_status = "connected" redis_status = "unknown"
try: try:
if mysql_db.async_engine is None: if mysql_db.async_engine is None:
mysql_status = "disconnected" mysql_status = "disconnected"
except Exception: else:
# 实际执行一次查询验证连接
from sqlalchemy import text
async with mysql_db.async_engine.connect() as conn:
await conn.execute(text("SELECT 1"))
mysql_status = "connected"
except Exception as e:
logger.warning(f"MySQL 健康检查失败: {e}")
mysql_status = "error" mysql_status = "error"
try: try:
if mongodb.client is None: if mongodb.client is None:
mongodb_status = "disconnected" mongodb_status = "disconnected"
except Exception: else:
# 实际 ping 验证
await mongodb.client.admin.command('ping')
mongodb_status = "connected"
except Exception as e:
logger.warning(f"MongoDB 健康检查失败: {e}")
mongodb_status = "error" mongodb_status = "error"
try: try:
if not redis_db.is_connected: if not redis_db.is_connected or redis_db.client is None:
redis_status = "disconnected" redis_status = "disconnected"
except Exception: else:
# 实际执行 ping 验证
await redis_db.client.ping()
redis_status = "connected"
except Exception as e:
logger.warning(f"Redis 健康检查失败: {e}")
redis_status = "error" redis_status = "error"
return { return {

View File

@@ -1,13 +1,13 @@
""" """
任务管理 API 接口 任务管理 API 接口
提供异步任务状态查询 提供异步任务状态查询和历史记录
""" """
from typing import Optional from typing import Optional
from fastapi import APIRouter, HTTPException from fastapi import APIRouter, HTTPException
from app.core.database import redis_db from app.core.database import redis_db, mongodb
router = APIRouter(prefix="/tasks", tags=["任务管理"]) router = APIRouter(prefix="/tasks", tags=["任务管理"])
@@ -23,25 +23,94 @@ async def get_task_status(task_id: str):
Returns: Returns:
任务状态信息 任务状态信息
""" """
# 优先从 Redis 获取
status = await redis_db.get_task_status(task_id) status = await redis_db.get_task_status(task_id)
if not status: if status:
# Redis不可用时假设任务已完成文档已成功处理
# 前端轮询时会得到这个响应
return { return {
"task_id": task_id, "task_id": task_id,
"status": "success", "status": status.get("status", "unknown"),
"progress": 100, "progress": status.get("meta", {}).get("progress", 0),
"message": "任务处理完成", "message": status.get("meta", {}).get("message"),
"result": None, "result": status.get("meta", {}).get("result"),
"error": None "error": status.get("meta", {}).get("error")
} }
# Redis 不可用时,尝试从 MongoDB 获取
mongo_task = await mongodb.get_task(task_id)
if mongo_task:
return {
"task_id": mongo_task.get("task_id"),
"status": mongo_task.get("status", "unknown"),
"progress": 100 if mongo_task.get("status") == "success" else 0,
"message": mongo_task.get("message"),
"result": mongo_task.get("result"),
"error": mongo_task.get("error")
}
# 任务不存在或状态未知
return { return {
"task_id": task_id, "task_id": task_id,
"status": status.get("status", "unknown"), "status": "unknown",
"progress": status.get("meta", {}).get("progress", 0), "progress": 0,
"message": status.get("meta", {}).get("message"), "message": "无法获取任务状态Redis和MongoDB均不可用",
"result": status.get("meta", {}).get("result"), "result": None,
"error": status.get("meta", {}).get("error") "error": None
} }
@router.get("/")
async def list_tasks(limit: int = 50, skip: int = 0):
"""
获取任务历史列表
Args:
limit: 返回数量限制
skip: 跳过数量
Returns:
任务列表
"""
try:
tasks = await mongodb.list_tasks(limit=limit, skip=skip)
return {
"success": True,
"tasks": tasks,
"count": len(tasks)
}
except Exception as e:
# MongoDB 不可用时返回空列表
return {
"success": False,
"tasks": [],
"count": 0,
"error": str(e)
}
@router.delete("/{task_id}")
async def delete_task(task_id: str):
"""
删除任务
Args:
task_id: 任务ID
Returns:
是否删除成功
"""
try:
# 从 Redis 删除
if redis_db._connected and redis_db.client:
key = f"task:{task_id}"
await redis_db.client.delete(key)
# 从 MongoDB 删除
deleted = await mongodb.delete_task(task_id)
return {
"success": True,
"deleted": deleted
}
except Exception as e:
raise HTTPException(status_code=500, detail=f"删除任务失败: {str(e)}")

View File

@@ -23,6 +23,44 @@ logger = logging.getLogger(__name__)
router = APIRouter(prefix="/templates", tags=["表格模板"]) router = APIRouter(prefix="/templates", tags=["表格模板"])
# ==================== 辅助函数 ====================
async def update_task_status(
task_id: str,
status: str,
progress: int = 0,
message: str = "",
result: dict = None,
error: str = None
):
"""
更新任务状态,同时写入 Redis 和 MongoDB
"""
from app.core.database import redis_db
meta = {"progress": progress, "message": message}
if result:
meta["result"] = result
if error:
meta["error"] = error
try:
await redis_db.set_task_status(task_id, status, meta)
except Exception as e:
logger.warning(f"Redis 任务状态更新失败: {e}")
try:
await mongodb.update_task(
task_id=task_id,
status=status,
message=message,
result=result,
error=error
)
except Exception as e:
logger.warning(f"MongoDB 任务状态更新失败: {e}")
# ==================== 请求/响应模型 ==================== # ==================== 请求/响应模型 ====================
class TemplateFieldRequest(BaseModel): class TemplateFieldRequest(BaseModel):
@@ -41,6 +79,7 @@ class FillRequest(BaseModel):
source_doc_ids: Optional[List[str]] = None # MongoDB 文档 ID 列表 source_doc_ids: Optional[List[str]] = None # MongoDB 文档 ID 列表
source_file_paths: Optional[List[str]] = None # 源文档文件路径列表 source_file_paths: Optional[List[str]] = None # 源文档文件路径列表
user_hint: Optional[str] = None user_hint: Optional[str] = None
task_id: Optional[str] = None # 可选的任务ID用于任务历史跟踪
class ExportRequest(BaseModel): class ExportRequest(BaseModel):
@@ -155,20 +194,17 @@ async def upload_joint_template(
) )
try: try:
# 1. 保存模板文件并提取字段 # 1. 保存模板文件
template_content = await template_file.read() template_content = await template_file.read()
template_path = file_service.save_uploaded_file( template_path = file_service.save_uploaded_file(
template_content, template_content,
template_file.filename, template_file.filename,
subfolder="templates" subfolder="templates"
) )
template_fields = await template_fill_service.get_template_fields_from_file(
template_path,
template_ext
)
# 2. 处理源文档 - 保存文件 # 2. 保存并解析源文档 - 提取内容用于生成表头
source_file_info = [] source_file_info = []
source_contents = []
for sf in source_files: for sf in source_files:
if sf.filename: if sf.filename:
sf_content = await sf.read() sf_content = await sf.read()
@@ -183,10 +219,81 @@ async def upload_joint_template(
"filename": sf.filename, "filename": sf.filename,
"ext": sf_ext "ext": sf_ext
}) })
# 解析源文档获取内容(用于 AI 生成表头)
try:
from app.core.document_parser import ParserFactory
parser = ParserFactory.get_parser(sf_path)
parse_result = parser.parse(sf_path)
if parse_result.success and parse_result.data:
# 获取原始内容
content = parse_result.data.get("content", "")[:5000] if parse_result.data.get("content") else ""
# 获取标题可能在顶层或structured_data内
titles = parse_result.data.get("titles", [])
if not titles and parse_result.data.get("structured_data"):
titles = parse_result.data.get("structured_data", {}).get("titles", [])
titles = titles[:10] if titles else []
# 获取表格数量可能在顶层或structured_data内
tables = parse_result.data.get("tables", [])
if not tables and parse_result.data.get("structured_data"):
tables = parse_result.data.get("structured_data", {}).get("tables", [])
tables_count = len(tables) if tables else 0
# 获取表格内容摘要(用于 AI 理解源文档结构)
tables_summary = ""
if tables:
tables_summary = "\n【文档中的表格】:\n"
for idx, table in enumerate(tables[:5]): # 最多5个表格
if isinstance(table, dict):
headers = table.get("headers", [])
rows = table.get("rows", [])
if headers:
tables_summary += f"表格{idx+1}表头: {', '.join(str(h) for h in headers)}\n"
if rows:
tables_summary += f"表格{idx+1}前3行: "
for row_idx, row in enumerate(rows[:3]):
if isinstance(row, list):
tables_summary += " | ".join(str(c) for c in row) + "; "
elif isinstance(row, dict):
tables_summary += " | ".join(str(row.get(h, "")) for h in headers if headers) + "; "
tables_summary += "\n"
source_contents.append({
"filename": sf.filename,
"doc_type": sf_ext,
"content": content,
"titles": titles,
"tables_count": tables_count,
"tables_summary": tables_summary
})
logger.info(f"[DEBUG] source_contents built: filename={sf.filename}, content_len={len(content)}, titles_count={len(titles)}, tables_count={tables_count}")
if tables_summary:
logger.info(f"[DEBUG] tables_summary preview: {tables_summary[:300]}")
except Exception as e:
logger.warning(f"解析源文档失败 {sf.filename}: {e}")
# 3. 根据源文档内容生成表头
template_fields = await template_fill_service.get_template_fields_from_file(
template_path,
template_ext,
source_contents=source_contents # 传递源文档内容
)
# 3. 异步处理源文档到MongoDB # 3. 异步处理源文档到MongoDB
task_id = str(uuid.uuid4()) task_id = str(uuid.uuid4())
if source_file_info: if source_file_info:
# 保存任务记录到 MongoDB
try:
await mongodb.insert_task(
task_id=task_id,
task_type="source_process",
status="pending",
message=f"开始处理 {len(source_file_info)} 个源文档"
)
except Exception as mongo_err:
logger.warning(f"MongoDB 保存任务记录失败: {mongo_err}")
background_tasks.add_task( background_tasks.add_task(
process_source_documents, process_source_documents,
task_id=task_id, task_id=task_id,
@@ -225,12 +332,10 @@ async def upload_joint_template(
async def process_source_documents(task_id: str, files: List[dict]): async def process_source_documents(task_id: str, files: List[dict]):
"""异步处理源文档存入MongoDB""" """异步处理源文档存入MongoDB"""
from app.core.database import redis_db
try: try:
await redis_db.set_task_status( await update_task_status(
task_id, status="processing", task_id, status="processing",
meta={"progress": 0, "message": "开始处理源文档"} progress=0, message="开始处理源文档"
) )
doc_ids = [] doc_ids = []
@@ -259,22 +364,24 @@ async def process_source_documents(task_id: str, files: List[dict]):
logger.error(f"源文档处理异常: {file_info['filename']}, error: {str(e)}") logger.error(f"源文档处理异常: {file_info['filename']}, error: {str(e)}")
progress = int((i + 1) / len(files) * 100) progress = int((i + 1) / len(files) * 100)
await redis_db.set_task_status( await update_task_status(
task_id, status="processing", task_id, status="processing",
meta={"progress": progress, "message": f"已处理 {i+1}/{len(files)}"} progress=progress, message=f"已处理 {i+1}/{len(files)}"
) )
await redis_db.set_task_status( await update_task_status(
task_id, status="success", task_id, status="success",
meta={"progress": 100, "message": "源文档处理完成", "doc_ids": doc_ids} progress=100, message="源文档处理完成",
result={"doc_ids": doc_ids}
) )
logger.info(f"所有源文档处理完成: {len(doc_ids)}") logger.info(f"所有源文档处理完成: {len(doc_ids)}")
except Exception as e: except Exception as e:
logger.error(f"源文档批量处理失败: {str(e)}") logger.error(f"源文档批量处理失败: {str(e)}")
await redis_db.set_task_status( await update_task_status(
task_id, status="failure", task_id, status="failure",
meta={"error": str(e)} progress=0, message="源文档处理失败",
error=str(e)
) )
@@ -333,7 +440,27 @@ async def fill_template(
Returns: Returns:
填写结果 填写结果
""" """
# 生成或使用传入的 task_id
task_id = request.task_id or str(uuid.uuid4())
try: try:
# 创建任务记录到 MongoDB
try:
await mongodb.insert_task(
task_id=task_id,
task_type="template_fill",
status="processing",
message=f"开始填表任务: {len(request.template_fields)} 个字段"
)
except Exception as mongo_err:
logger.warning(f"MongoDB 创建任务记录失败: {mongo_err}")
# 更新进度 - 开始
await update_task_status(
task_id, "processing",
progress=0, message="开始处理..."
)
# 转换字段 # 转换字段
fields = [ fields = [
TemplateField( TemplateField(
@@ -346,17 +473,51 @@ async def fill_template(
for f in request.template_fields for f in request.template_fields
] ]
# 从 template_id 提取文件类型
template_file_type = "xlsx" # 默认类型
if request.template_id:
ext = request.template_id.split('.')[-1].lower()
if ext in ["xlsx", "xls"]:
template_file_type = "xlsx"
elif ext == "docx":
template_file_type = "docx"
# 更新进度 - 准备开始填写
await update_task_status(
task_id, "processing",
progress=10, message=f"准备填写 {len(fields)} 个字段..."
)
# 执行填写 # 执行填写
result = await template_fill_service.fill_template( result = await template_fill_service.fill_template(
template_fields=fields, template_fields=fields,
source_doc_ids=request.source_doc_ids, source_doc_ids=request.source_doc_ids,
source_file_paths=request.source_file_paths, source_file_paths=request.source_file_paths,
user_hint=request.user_hint user_hint=request.user_hint,
template_id=request.template_id,
template_file_type=template_file_type,
task_id=task_id
) )
return result # 更新为成功
await update_task_status(
task_id, "success",
progress=100, message="填表完成",
result={
"field_count": len(fields),
"max_rows": result.get("max_rows", 0)
}
)
return {**result, "task_id": task_id}
except Exception as e: except Exception as e:
# 更新为失败
await update_task_status(
task_id, "failure",
progress=0, message="填表失败",
error=str(e)
)
logger.error(f"填写表格失败: {str(e)}") logger.error(f"填写表格失败: {str(e)}")
raise HTTPException(status_code=500, detail=f"填写失败: {str(e)}") raise HTTPException(status_code=500, detail=f"填写失败: {str(e)}")

View File

@@ -5,6 +5,7 @@ from fastapi import APIRouter, UploadFile, File, HTTPException, Query
from fastapi.responses import StreamingResponse from fastapi.responses import StreamingResponse
from typing import Optional from typing import Optional
import logging import logging
import os
import pandas as pd import pandas as pd
import io import io
@@ -126,7 +127,7 @@ async def upload_excel(
content += f"... (共 {len(sheet_data['rows'])} 行)\n\n" content += f"... (共 {len(sheet_data['rows'])} 行)\n\n"
doc_metadata = { doc_metadata = {
"filename": saved_path.split("/")[-1] if "/" in saved_path else saved_path.split("\\")[-1], "filename": os.path.basename(saved_path),
"original_filename": file.filename, "original_filename": file.filename,
"saved_path": saved_path, "saved_path": saved_path,
"file_size": len(content), "file_size": len(content),
@@ -253,7 +254,7 @@ async def export_excel(
output.seek(0) output.seek(0)
# 生成文件名 # 生成文件名
original_name = file_path.split('/')[-1] if '/' in file_path else file_path original_name = os.path.basename(file_path)
if columns: if columns:
export_name = f"export_{sheet_name or 'data'}_{len(column_list) if columns else 'all'}_cols.xlsx" export_name = f"export_{sheet_name or 'data'}_{len(column_list) if columns else 'all'}_cols.xlsx"
else: else:

View File

@@ -59,6 +59,11 @@ class MongoDB:
"""RAG索引集合 - 存储字段语义索引""" """RAG索引集合 - 存储字段语义索引"""
return self.db["rag_index"] return self.db["rag_index"]
@property
def tasks(self):
"""任务集合 - 存储任务历史记录"""
return self.db["tasks"]
# ==================== 文档操作 ==================== # ==================== 文档操作 ====================
async def insert_document( async def insert_document(
@@ -242,8 +247,128 @@ class MongoDB:
await self.rag_index.create_index("table_name") await self.rag_index.create_index("table_name")
await self.rag_index.create_index("field_name") await self.rag_index.create_index("field_name")
# 任务集合索引
await self.tasks.create_index("task_id", unique=True)
await self.tasks.create_index("created_at")
logger.info("MongoDB 索引创建完成") logger.info("MongoDB 索引创建完成")
# ==================== 任务历史操作 ====================
async def insert_task(
self,
task_id: str,
task_type: str,
status: str = "pending",
message: str = "",
result: Optional[Dict[str, Any]] = None,
error: Optional[str] = None,
) -> str:
"""
插入任务记录
Args:
task_id: 任务ID
task_type: 任务类型
status: 任务状态
message: 任务消息
result: 任务结果
error: 错误信息
Returns:
插入文档的ID
"""
task = {
"task_id": task_id,
"task_type": task_type,
"status": status,
"message": message,
"result": result,
"error": error,
"created_at": datetime.utcnow(),
"updated_at": datetime.utcnow(),
}
result_obj = await self.tasks.insert_one(task)
return str(result_obj.inserted_id)
async def update_task(
self,
task_id: str,
status: Optional[str] = None,
message: Optional[str] = None,
result: Optional[Dict[str, Any]] = None,
error: Optional[str] = None,
) -> bool:
"""
更新任务状态
Args:
task_id: 任务ID
status: 任务状态
message: 任务消息
result: 任务结果
error: 错误信息
Returns:
是否更新成功
"""
from bson import ObjectId
update_data = {"updated_at": datetime.utcnow()}
if status is not None:
update_data["status"] = status
if message is not None:
update_data["message"] = message
if result is not None:
update_data["result"] = result
if error is not None:
update_data["error"] = error
update_result = await self.tasks.update_one(
{"task_id": task_id},
{"$set": update_data}
)
return update_result.modified_count > 0
async def get_task(self, task_id: str) -> Optional[Dict[str, Any]]:
"""根据task_id获取任务"""
task = await self.tasks.find_one({"task_id": task_id})
if task:
task["_id"] = str(task["_id"])
return task
async def list_tasks(
self,
limit: int = 50,
skip: int = 0,
) -> List[Dict[str, Any]]:
"""
获取任务列表
Args:
limit: 返回数量
skip: 跳过数量
Returns:
任务列表
"""
cursor = self.tasks.find().sort("created_at", -1).skip(skip).limit(limit)
tasks = []
async for task in cursor:
task["_id"] = str(task["_id"])
# 转换 datetime 为字符串
if task.get("created_at"):
task["created_at"] = task["created_at"].isoformat()
if task.get("updated_at"):
task["updated_at"] = task["updated_at"].isoformat()
tasks.append(task)
return tasks
async def delete_task(self, task_id: str) -> bool:
"""删除任务"""
result = await self.tasks.delete_one({"task_id": task_id})
return result.deleted_count > 0
# ==================== 全局单例 ==================== # ==================== 全局单例 ====================

View File

@@ -59,7 +59,13 @@ class DocxParser(BaseParser):
paragraphs = [] paragraphs = []
for para in doc.paragraphs: for para in doc.paragraphs:
if para.text.strip(): if para.text.strip():
paragraphs.append(para.text) paragraphs.append({
"text": para.text,
"style": str(para.style.name) if para.style else "Normal"
})
# 提取段落纯文本(用于 AI 解析)
paragraphs_text = [p["text"] for p in paragraphs if p["text"].strip()]
# 提取表格内容 # 提取表格内容
tables_data = [] tables_data = []
@@ -77,8 +83,25 @@ class DocxParser(BaseParser):
"column_count": len(table_rows[0]) if table_rows else 0 "column_count": len(table_rows[0]) if table_rows else 0
}) })
# 合并所有文本 # 提取图片/嵌入式对象信息
full_text = "\n".join(paragraphs) images_info = self._extract_images_info(doc, path)
# 合并所有文本(包括图片描述)
full_text_parts = []
full_text_parts.append("【文档正文】")
full_text_parts.extend(paragraphs_text)
if tables_data:
full_text_parts.append("\n【文档表格】")
for idx, table in enumerate(tables_data):
full_text_parts.append(f"--- 表格 {idx + 1} ---")
for row in table["rows"]:
full_text_parts.append(" | ".join(str(cell) for cell in row))
if images_info.get("image_count", 0) > 0:
full_text_parts.append(f"\n【文档图片】文档包含 {images_info['image_count']} 张图片/图表")
full_text = "\n".join(full_text_parts)
# 构建元数据 # 构建元数据
metadata = { metadata = {
@@ -89,7 +112,9 @@ class DocxParser(BaseParser):
"table_count": len(tables_data), "table_count": len(tables_data),
"word_count": len(full_text), "word_count": len(full_text),
"char_count": len(full_text.replace("\n", "")), "char_count": len(full_text.replace("\n", "")),
"has_tables": len(tables_data) > 0 "has_tables": len(tables_data) > 0,
"has_images": images_info.get("image_count", 0) > 0,
"image_count": images_info.get("image_count", 0)
} }
# 返回结果 # 返回结果
@@ -97,12 +122,16 @@ class DocxParser(BaseParser):
success=True, success=True,
data={ data={
"content": full_text, "content": full_text,
"paragraphs": paragraphs, "paragraphs": paragraphs_text,
"paragraphs_with_style": paragraphs,
"tables": tables_data, "tables": tables_data,
"images": images_info,
"word_count": len(full_text), "word_count": len(full_text),
"structured_data": { "structured_data": {
"paragraphs": paragraphs, "paragraphs": paragraphs,
"tables": tables_data "paragraphs_text": paragraphs_text,
"tables": tables_data,
"images": images_info
} }
}, },
metadata=metadata metadata=metadata
@@ -115,6 +144,59 @@ class DocxParser(BaseParser):
error=f"解析 Word 文档失败: {str(e)}" error=f"解析 Word 文档失败: {str(e)}"
) )
def extract_images_as_base64(self, file_path: str) -> List[Dict[str, str]]:
"""
提取 Word 文档中的所有图片,返回 base64 编码列表
Args:
file_path: Word 文件路径
Returns:
图片列表,每项包含 base64 编码和图片类型
"""
import zipfile
import base64
from io import BytesIO
images = []
try:
with zipfile.ZipFile(file_path, 'r') as zf:
# 查找 word/media 目录下的图片文件
for filename in zf.namelist():
if filename.startswith('word/media/'):
# 获取图片类型
ext = filename.split('.')[-1].lower()
mime_types = {
'png': 'image/png',
'jpg': 'image/jpeg',
'jpeg': 'image/jpeg',
'gif': 'image/gif',
'bmp': 'image/bmp'
}
mime_type = mime_types.get(ext, 'image/png')
try:
# 读取图片数据并转为 base64
image_data = zf.read(filename)
base64_data = base64.b64encode(image_data).decode('utf-8')
images.append({
"filename": filename,
"mime_type": mime_type,
"base64": base64_data,
"size": len(image_data)
})
logger.info(f"提取图片: {filename}, 大小: {len(image_data)} bytes")
except Exception as e:
logger.warning(f"提取图片失败 {filename}: {str(e)}")
except Exception as e:
logger.error(f"打开 Word 文档提取图片失败: {str(e)}")
logger.info(f"共提取 {len(images)} 张图片")
return images
def extract_key_sentences(self, text: str, max_sentences: int = 10) -> List[str]: def extract_key_sentences(self, text: str, max_sentences: int = 10) -> List[str]:
""" """
从文本中提取关键句子 从文本中提取关键句子
@@ -268,6 +350,60 @@ class DocxParser(BaseParser):
return fields return fields
def _extract_images_info(self, doc: Document, path: Path) -> Dict[str, Any]:
"""
提取 Word 文档中的图片/嵌入式对象信息
Args:
doc: Document 对象
path: 文件路径
Returns:
图片信息字典
"""
import zipfile
from io import BytesIO
image_count = 0
image_descriptions = []
inline_shapes_count = 0
try:
# 方法1: 通过 inline shapes 统计图片
try:
inline_shapes_count = len(doc.inline_shapes)
if inline_shapes_count > 0:
image_count = inline_shapes_count
image_descriptions.append(f"文档包含 {inline_shapes_count} 个嵌入式图形/图片")
except Exception:
pass
# 方法2: 通过 ZIP 分析 document.xml 获取图片引用
try:
with zipfile.ZipFile(path, 'r') as zf:
# 查找 word/media 目录下的图片文件
media_files = [f for f in zf.namelist() if f.startswith('word/media/')]
if media_files and not inline_shapes_count:
image_count = len(media_files)
image_descriptions.append(f"文档包含 {image_count} 个嵌入图片")
# 检查是否有页眉页脚中的图片
header_images = [f for f in zf.namelist() if 'header' in f.lower() and f.endswith(('.png', '.jpg', '.jpeg', '.gif', '.bmp'))]
if header_images:
image_descriptions.append(f"页眉/页脚包含 {len(header_images)} 个图片")
except Exception:
pass
except Exception as e:
logger.warning(f"提取图片信息失败: {str(e)}")
return {
"image_count": image_count,
"inline_shapes_count": inline_shapes_count,
"descriptions": image_descriptions,
"has_images": image_count > 0
}
def _infer_field_type_from_hint(self, hint: str) -> str: def _infer_field_type_from_hint(self, hint: str) -> str:
""" """
从提示词推断字段类型 从提示词推断字段类型

View File

@@ -0,0 +1,15 @@
"""
指令执行模块
注意: 此模块为可选功能,当前尚未实现。
如需启用,请实现 intent_parser.py 和 executor.py
"""
from .intent_parser import IntentParser, DefaultIntentParser
from .executor import InstructionExecutor, DefaultInstructionExecutor
__all__ = [
"IntentParser",
"DefaultIntentParser",
"InstructionExecutor",
"DefaultInstructionExecutor",
]

View File

@@ -0,0 +1,35 @@
"""
指令执行器模块
将自然语言指令转换为可执行操作
注意: 此模块为可选功能,当前尚未实现。
"""
from abc import ABC, abstractmethod
from typing import Any, Dict
class InstructionExecutor(ABC):
"""指令执行器抽象基类"""
@abstractmethod
async def execute(self, instruction: str, context: Dict[str, Any]) -> Dict[str, Any]:
"""
执行指令
Args:
instruction: 解析后的指令
context: 执行上下文
Returns:
执行结果
"""
pass
class DefaultInstructionExecutor(InstructionExecutor):
"""默认指令执行器"""
async def execute(self, instruction: str, context: Dict[str, Any]) -> Dict[str, Any]:
"""暂未实现"""
raise NotImplementedError("指令执行功能暂未实现")

View File

@@ -0,0 +1,34 @@
"""
意图解析器模块
解析用户自然语言指令,识别意图和参数
注意: 此模块为可选功能,当前尚未实现。
"""
from abc import ABC, abstractmethod
from typing import Any, Dict, Tuple
class IntentParser(ABC):
"""意图解析器抽象基类"""
@abstractmethod
async def parse(self, text: str) -> Tuple[str, Dict[str, Any]]:
"""
解析自然语言指令
Args:
text: 用户输入的自然语言
Returns:
(意图类型, 参数字典)
"""
pass
class DefaultIntentParser(IntentParser):
"""默认意图解析器"""
async def parse(self, text: str) -> Tuple[str, Dict[str, Any]]:
"""暂未实现"""
raise NotImplementedError("意图解析功能暂未实现")

View File

@@ -1,6 +1,13 @@
""" """
FastAPI 应用主入口 FastAPI 应用主入口
""" """
# ========== 压制 MongoDB 疯狂刷屏日志 ==========
import logging
logging.getLogger("pymongo").setLevel(logging.WARNING)
logging.getLogger("pymongo.topology").setLevel(logging.WARNING)
logging.getLogger("urllib3").setLevel(logging.WARNING)
# ==============================================
import logging import logging
import logging.handlers import logging.handlers
import sys import sys

View File

@@ -65,7 +65,17 @@ class LLMService:
return response.json() return response.json()
except httpx.HTTPStatusError as e: except httpx.HTTPStatusError as e:
logger.error(f"LLM API 请求失败: {e.response.status_code} - {e.response.text}") error_detail = e.response.text
logger.error(f"LLM API 请求失败: {e.response.status_code} - {error_detail}")
# 尝试解析错误信息
try:
import json
err_json = json.loads(error_detail)
err_code = err_json.get("error", {}).get("code", "unknown")
err_msg = err_json.get("error", {}).get("message", "unknown")
logger.error(f"API 错误码: {err_code}, 错误信息: {err_msg}")
except:
pass
raise raise
except Exception as e: except Exception as e:
logger.error(f"LLM API 调用异常: {str(e)}") logger.error(f"LLM API 调用异常: {str(e)}")
@@ -328,6 +338,154 @@ Excel 数据概览:
"analysis": None "analysis": None
} }
async def chat_with_images(
self,
text: str,
images: List[Dict[str, str]],
temperature: float = 0.7,
max_tokens: Optional[int] = None
) -> Dict[str, Any]:
"""
调用视觉模型 API支持图片输入
Args:
text: 文本内容
images: 图片列表,每项包含 base64 编码和 mime_type
格式: [{"base64": "...", "mime_type": "image/png"}, ...]
temperature: 温度参数
max_tokens: 最大 token 数
Returns:
Dict[str, Any]: API 响应结果
"""
headers = {
"Authorization": f"Bearer {self.api_key}",
"Content-Type": "application/json"
}
# 构建图片内容
image_contents = []
for img in images:
image_contents.append({
"type": "image_url",
"image_url": {
"url": f"data:{img['mime_type']};base64,{img['base64']}"
}
})
# 构建消息
messages = [
{
"role": "user",
"content": [
{
"type": "text",
"text": text
},
*image_contents
]
}
]
payload = {
"model": self.model_name,
"messages": messages,
"temperature": temperature
}
if max_tokens:
payload["max_tokens"] = max_tokens
try:
async with httpx.AsyncClient(timeout=120.0) as client:
response = await client.post(
f"{self.base_url}/chat/completions",
headers=headers,
json=payload
)
response.raise_for_status()
return response.json()
except httpx.HTTPStatusError as e:
error_detail = e.response.text
logger.error(f"视觉模型 API 请求失败: {e.response.status_code} - {error_detail}")
# 尝试解析错误信息
try:
import json
err_json = json.loads(error_detail)
err_code = err_json.get("error", {}).get("code", "unknown")
err_msg = err_json.get("error", {}).get("message", "unknown")
logger.error(f"API 错误码: {err_code}, 错误信息: {err_msg}")
logger.error(f"请求模型: {self.model_name}, base_url: {self.base_url}")
except:
pass
raise
except Exception as e:
logger.error(f"视觉模型 API 调用异常: {str(e)}")
raise
async def analyze_images(
self,
images: List[Dict[str, str]],
user_prompt: str = ""
) -> Dict[str, Any]:
"""
分析图片内容(使用视觉模型)
Args:
images: 图片列表,每项包含 base64 编码和 mime_type
user_prompt: 用户提示词
Returns:
Dict[str, Any]: 分析结果
"""
prompt = f"""你是一个专业的视觉分析专家。请分析以下图片内容。
{user_prompt if user_prompt else "请详细描述图片中的内容,包括文字、数据、图表、流程等所有可见信息。"}
请按照以下 JSON 格式输出:
{{
"description": "图片内容的详细描述",
"text_content": "图片中的文字内容(如有)",
"data_extracted": {{"": ""}} // 如果图片中有表格或数据
}}
如果图片不包含有用信息,请返回空的描述。"""
try:
response = await self.chat_with_images(
text=prompt,
images=images,
temperature=0.1,
max_tokens=4000
)
content = self.extract_message_content(response)
# 解析 JSON
import json
try:
result = json.loads(content)
return {
"success": True,
"analysis": result,
"model": self.model_name
}
except json.JSONDecodeError:
return {
"success": True,
"analysis": {"description": content},
"model": self.model_name
}
except Exception as e:
logger.error(f"图片分析失败: {str(e)}")
return {
"success": False,
"error": str(e),
"analysis": None
}
# 全局单例 # 全局单例
llm_service = LLMService() llm_service = LLMService()

View File

@@ -3,7 +3,6 @@ RAG 服务模块 - 检索增强生成
使用 sentence-transformers + Faiss 实现向量检索 使用 sentence-transformers + Faiss 实现向量检索
""" """
import json
import logging import logging
import os import os
import pickle import pickle
@@ -11,12 +10,20 @@ from typing import Any, Dict, List, Optional
import faiss import faiss
import numpy as np import numpy as np
from sentence_transformers import SentenceTransformer
from app.config import settings from app.config import settings
logger = logging.getLogger(__name__) logger = logging.getLogger(__name__)
# 尝试导入 sentence-transformers
try:
from sentence_transformers import SentenceTransformer
SENTENCE_TRANSFORMERS_AVAILABLE = True
except ImportError as e:
logger.warning(f"sentence-transformers 导入失败: {e}")
SENTENCE_TRANSFORMERS_AVAILABLE = False
SentenceTransformer = None
class SimpleDocument: class SimpleDocument:
"""简化文档对象""" """简化文档对象"""
@@ -28,17 +35,24 @@ class SimpleDocument:
class RAGService: class RAGService:
"""RAG 检索增强服务""" """RAG 检索增强服务"""
# 默认分块参数
DEFAULT_CHUNK_SIZE = 500 # 每个文本块的大小(字符数)
DEFAULT_CHUNK_OVERLAP = 50 # 块之间的重叠(字符数)
def __init__(self): def __init__(self):
self.embedding_model: Optional[SentenceTransformer] = None self.embedding_model = None
self.index: Optional[faiss.Index] = None self.index: Optional[faiss.Index] = None
self.documents: List[Dict[str, Any]] = [] self.documents: List[Dict[str, Any]] = []
self.doc_ids: List[str] = [] self.doc_ids: List[str] = []
self._dimension: int = 0 self._dimension: int = 384 # 默认维度
self._initialized = False self._initialized = False
self._persist_dir = settings.FAISS_INDEX_DIR self._persist_dir = settings.FAISS_INDEX_DIR
# 临时禁用 RAG API 调用,仅记录日志 # 检查是否可用
self._disabled = True self._disabled = not SENTENCE_TRANSFORMERS_AVAILABLE
logger.info("RAG 服务已禁用_disabled=True仅记录索引操作日志") if self._disabled:
logger.warning("RAG 服务已禁用sentence-transformers 不可用),将使用关键词匹配作为后备")
else:
logger.info("RAG 服务已启用")
def _init_embeddings(self): def _init_embeddings(self):
"""初始化嵌入模型""" """初始化嵌入模型"""
@@ -88,6 +102,63 @@ class RAGService:
norms = np.where(norms == 0, 1, norms) norms = np.where(norms == 0, 1, norms)
return vectors / norms return vectors / norms
def _split_into_chunks(self, text: str, chunk_size: int = None, overlap: int = None) -> List[str]:
"""
将长文本分割成块
Args:
text: 待分割的文本
chunk_size: 每个块的大小(字符数)
overlap: 块之间的重叠字符数
Returns:
文本块列表
"""
if chunk_size is None:
chunk_size = self.DEFAULT_CHUNK_SIZE
if overlap is None:
overlap = self.DEFAULT_CHUNK_OVERLAP
if len(text) <= chunk_size:
return [text] if text.strip() else []
chunks = []
start = 0
text_len = len(text)
while start < text_len:
# 计算当前块的结束位置
end = start + chunk_size
# 如果不是最后一块,尝试在句子边界处切割
if end < text_len:
# 向前查找最后一个句号、逗号、换行或分号
cut_positions = []
for i in range(end, max(start, end - 100), -1):
if text[i] in '。;,,\n':
cut_positions.append(i + 1)
break
if cut_positions:
end = cut_positions[0]
else:
# 如果没找到句子边界,尝试向后查找
for i in range(end, min(text_len, end + 50)):
if text[i] in '。;,,\n':
end = i + 1
break
chunk = text[start:end].strip()
if chunk:
chunks.append(chunk)
# 移动起始位置(考虑重叠)
start = end - overlap
if start <= 0:
start = end
return chunks
def index_field( def index_field(
self, self,
table_name: str, table_name: str,
@@ -124,9 +195,20 @@ class RAGService:
self, self,
doc_id: str, doc_id: str,
content: str, content: str,
metadata: Optional[Dict[str, Any]] = None metadata: Optional[Dict[str, Any]] = None,
chunk_size: int = None,
chunk_overlap: int = None
): ):
"""将文档内容索引到向量数据库""" """
将文档内容索引到向量数据库(自动分块)
Args:
doc_id: 文档唯一标识
content: 文档内容
metadata: 文档元数据
chunk_size: 文本块大小字符数默认500
chunk_overlap: 块之间的重叠字符数默认50
"""
if self._disabled: if self._disabled:
logger.info(f"[RAG DISABLED] 文档索引操作已跳过: {doc_id}") logger.info(f"[RAG DISABLED] 文档索引操作已跳过: {doc_id}")
return return
@@ -139,18 +221,56 @@ class RAGService:
logger.debug(f"文档跳过索引 (无嵌入模型): {doc_id}") logger.debug(f"文档跳过索引 (无嵌入模型): {doc_id}")
return return
doc = SimpleDocument( # 分割文档为小块
page_content=content, if chunk_size is None:
metadata=metadata or {"doc_id": doc_id} chunk_size = self.DEFAULT_CHUNK_SIZE
) if chunk_overlap is None:
self._add_documents([doc], [doc_id]) chunk_overlap = self.DEFAULT_CHUNK_OVERLAP
logger.debug(f"已索引文档: {doc_id}")
chunks = self._split_into_chunks(content, chunk_size, chunk_overlap)
if not chunks:
logger.warning(f"文档内容为空,跳过索引: {doc_id}")
return
# 为每个块创建文档对象
documents = []
chunk_ids = []
for i, chunk in enumerate(chunks):
chunk_id = f"{doc_id}_chunk_{i}"
chunk_metadata = metadata.copy() if metadata else {}
chunk_metadata.update({
"chunk_index": i,
"total_chunks": len(chunks),
"doc_id": doc_id
})
documents.append(SimpleDocument(
page_content=chunk,
metadata=chunk_metadata
))
chunk_ids.append(chunk_id)
# 批量添加文档
self._add_documents(documents, chunk_ids)
logger.info(f"已索引文档 {doc_id},共 {len(chunks)} 个块")
def _add_documents(self, documents: List[SimpleDocument], doc_ids: List[str]): def _add_documents(self, documents: List[SimpleDocument], doc_ids: List[str]):
"""批量添加文档到向量索引""" """批量添加文档到向量索引"""
if not documents: if not documents:
return return
# 总是将文档存储在内存中(用于关键词搜索后备)
for doc, did in zip(documents, doc_ids):
self.documents.append({"id": did, "content": doc.page_content, "metadata": doc.metadata})
self.doc_ids.append(did)
# 如果没有嵌入模型,跳过向量索引
if self.embedding_model is None:
logger.debug(f"文档跳过向量索引 (无嵌入模型): {len(documents)} 个文档")
return
texts = [doc.page_content for doc in documents] texts = [doc.page_content for doc in documents]
embeddings = self.embedding_model.encode(texts, convert_to_numpy=True) embeddings = self.embedding_model.encode(texts, convert_to_numpy=True)
embeddings = self._normalize_vectors(embeddings).astype('float32') embeddings = self._normalize_vectors(embeddings).astype('float32')
@@ -162,12 +282,18 @@ class RAGService:
id_array = np.array(id_list, dtype='int64') id_array = np.array(id_list, dtype='int64')
self.index.add_with_ids(embeddings, id_array) self.index.add_with_ids(embeddings, id_array)
for doc, did in zip(documents, doc_ids): def retrieve(self, query: str, top_k: int = 5, min_score: float = 0.3) -> List[Dict[str, Any]]:
self.documents.append({"id": did, "content": doc.page_content, "metadata": doc.metadata}) """
self.doc_ids.append(did) 根据查询检索相关文档块
def retrieve(self, query: str, top_k: int = 5) -> List[Dict[str, Any]]: Args:
"""根据查询检索相关文档""" query: 查询文本
top_k: 返回的最大结果数
min_score: 最低相似度分数阈值
Returns:
相关文档块列表,每项包含 content, metadata, score, doc_id, chunk_index
"""
if self._disabled: if self._disabled:
logger.info(f"[RAG DISABLED] 检索操作已跳过: query={query}, top_k={top_k}") logger.info(f"[RAG DISABLED] 检索操作已跳过: query={query}, top_k={top_k}")
return [] return []
@@ -175,28 +301,113 @@ class RAGService:
if not self._initialized: if not self._initialized:
self._init_vector_store() self._init_vector_store()
if self.index is None or self.index.ntotal == 0: # 优先使用向量检索
if self.index is not None and self.index.ntotal > 0 and self.embedding_model is not None:
try:
query_embedding = self.embedding_model.encode([query], convert_to_numpy=True)
query_embedding = self._normalize_vectors(query_embedding).astype('float32')
scores, indices = self.index.search(query_embedding, min(top_k, self.index.ntotal))
results = []
for score, idx in zip(scores[0], indices[0]):
if idx < 0:
continue
if score < min_score:
continue
doc = self.documents[idx]
results.append({
"content": doc["content"],
"metadata": doc["metadata"],
"score": float(score),
"doc_id": doc["id"],
"chunk_index": doc["metadata"].get("chunk_index", 0)
})
if results:
logger.debug(f"向量检索到 {len(results)} 条相关文档块")
return results
except Exception as e:
logger.warning(f"向量检索失败,使用关键词搜索后备: {e}")
# 后备:使用关键词搜索
logger.debug("使用关键词搜索后备方案")
return self._keyword_search(query, top_k)
def _keyword_search(self, query: str, top_k: int = 5) -> List[Dict[str, Any]]:
"""
关键词搜索后备方案
Args:
query: 查询文本
top_k: 返回的最大结果数
Returns:
相关文档块列表
"""
if not self.documents:
return [] return []
query_embedding = self.embedding_model.encode([query], convert_to_numpy=True) # 提取查询关键词
query_embedding = self._normalize_vectors(query_embedding).astype('float32') keywords = []
for char in query:
if '\u4e00' <= char <= '\u9fff': # 中文字符
keywords.append(char)
# 添加英文单词
import re
english_words = re.findall(r'[a-zA-Z]+', query)
keywords.extend(english_words)
scores, indices = self.index.search(query_embedding, min(top_k, self.index.ntotal)) if not keywords:
return []
results = [] results = []
for score, idx in zip(scores[0], indices[0]): for doc in self.documents:
if idx < 0: content = doc["content"]
continue # 计算关键词匹配分数
doc = self.documents[idx] score = 0
results.append({ matched_keywords = 0
"content": doc["content"], for kw in keywords:
"metadata": doc["metadata"], if kw in content:
"score": float(score), score += 1
"doc_id": doc["id"] matched_keywords += 1
})
logger.debug(f"检索到 {len(results)} 条相关文档") if matched_keywords > 0:
return results # 归一化分数
score = score / max(len(keywords), 1)
results.append({
"content": content,
"metadata": doc["metadata"],
"score": score,
"doc_id": doc["id"],
"chunk_index": doc["metadata"].get("chunk_index", 0)
})
# 按分数排序
results.sort(key=lambda x: x["score"], reverse=True)
logger.debug(f"关键词搜索返回 {len(results[:top_k])} 条结果")
return results[:top_k]
def retrieve_by_doc_id(self, doc_id: str, top_k: int = 10) -> List[Dict[str, Any]]:
"""
获取指定文档的所有块
Args:
doc_id: 文档ID
top_k: 返回的最大结果数
Returns:
该文档的所有块
"""
# 获取属于该文档的所有块
doc_chunks = [d for d in self.documents if d["metadata"].get("doc_id") == doc_id]
# 按 chunk_index 排序
doc_chunks.sort(key=lambda x: x["metadata"].get("chunk_index", 0))
# 返回指定数量
return doc_chunks[:top_k]
def retrieve_by_table(self, table_name: str, top_k: int = 5) -> List[Dict[str, Any]]: def retrieve_by_table(self, table_name: str, top_k: int = 5) -> List[Dict[str, Any]]:
"""检索指定表的字段""" """检索指定表的字段"""

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,637 @@
"""
Word 文档 AI 解析服务
使用 LLM (GLM) 对 Word 文档进行深度理解,提取结构化数据
"""
import logging
from typing import Dict, Any, List, Optional
import json
from app.services.llm_service import llm_service
from app.core.document_parser.docx_parser import DocxParser
logger = logging.getLogger(__name__)
class WordAIService:
"""Word 文档 AI 解析服务"""
def __init__(self):
self.llm = llm_service
self.parser = DocxParser()
async def parse_word_with_ai(
self,
file_path: str,
user_hint: str = ""
) -> Dict[str, Any]:
"""
使用 AI 解析 Word 文档,提取结构化数据
适用于从非结构化的 Word 文档中提取表格数据、键值对等信息
Args:
file_path: Word 文件路径
user_hint: 用户提示词,指定要提取的内容类型
Returns:
Dict: 包含结构化数据的解析结果
"""
try:
# 1. 先用基础解析器提取原始内容
parse_result = self.parser.parse(file_path)
if not parse_result.success:
return {
"success": False,
"error": parse_result.error,
"structured_data": None
}
# 2. 获取原始数据
raw_data = parse_result.data
paragraphs = raw_data.get("paragraphs", [])
paragraphs_with_style = raw_data.get("paragraphs_with_style", [])
tables = raw_data.get("tables", [])
content = raw_data.get("content", "")
images_info = raw_data.get("images", {})
metadata = parse_result.metadata or {}
image_count = images_info.get("image_count", 0)
image_descriptions = images_info.get("descriptions", [])
logger.info(f"Word 基础解析完成: {len(paragraphs)} 个段落, {len(tables)} 个表格, {image_count} 张图片")
# 3. 提取图片数据(用于视觉分析)
images_base64 = []
if image_count > 0:
try:
images_base64 = self.parser.extract_images_as_base64(file_path)
logger.info(f"提取到 {len(images_base64)} 张图片的 base64 数据")
except Exception as e:
logger.warning(f"提取图片 base64 失败: {str(e)}")
# 4. 根据内容类型选择 AI 解析策略
# 如果有图片,先分析图片
image_analysis = ""
if images_base64:
image_analysis = await self._analyze_images_with_ai(images_base64, user_hint)
logger.info(f"图片 AI 分析完成: {len(image_analysis)} 字符")
# 优先处理:表格 > (表格+文本) > 纯文本
if tables and len(tables) > 0:
structured_data = await self._extract_tables_with_ai(
tables, paragraphs, image_count, user_hint, metadata, image_analysis
)
elif paragraphs and len(paragraphs) > 0:
structured_data = await self._extract_from_text_with_ai(
paragraphs, content, image_count, image_descriptions, user_hint, image_analysis
)
else:
structured_data = {
"success": True,
"type": "empty",
"message": "文档内容为空"
}
# 添加图片分析结果
if image_analysis:
structured_data["image_analysis"] = image_analysis
return structured_data
except Exception as e:
logger.error(f"AI 解析 Word 文档失败: {str(e)}")
return {
"success": False,
"error": str(e),
"structured_data": None
}
async def _extract_tables_with_ai(
self,
tables: List[Dict],
paragraphs: List[str],
image_count: int,
user_hint: str,
metadata: Dict,
image_analysis: str = ""
) -> Dict[str, Any]:
"""
使用 AI 从 Word 表格和文本中提取结构化数据
Args:
tables: 表格列表
paragraphs: 段落列表
image_count: 图片数量
user_hint: 用户提示
metadata: 文档元数据
image_analysis: 图片 AI 分析结果
Returns:
结构化数据
"""
try:
# 构建表格文本描述
tables_text = self._build_tables_description(tables)
# 构建段落描述
paragraphs_text = "\n".join(paragraphs[:50]) if paragraphs else "(无正文文本)"
if len(paragraphs) > 50:
paragraphs_text += f"\n...(共 {len(paragraphs)} 个段落仅显示前50个"
# 图片提示
image_hint = f"注意:此文档包含 {image_count} 张图片/图表。" if image_count > 0 else ""
prompt = f"""你是一个专业的数据提取专家。请从以下 Word 文档的完整内容中提取结构化数据。
【用户需求】
{user_hint if user_hint else "请提取文档中的所有结构化数据,包括表格数据、键值对、列表项等。"}
【文档正文(段落)】
{paragraphs_text}
【文档表格】
{tables_text}
【文档图片信息】
{image_hint}
请按照以下 JSON 格式输出:
{{
"type": "table_data",
"headers": ["列1", "列2", ...],
"rows": [["行1列1", "行1列2", ...], ["行2列1", "行2列2", ...], ...],
"key_values": {{"键1": "值1", "键2": "值2", ...}},
"list_items": ["项1", "项2", ...],
"description": "文档内容描述"
}}
重点:
- 优先从表格中提取结构化数据
- 如果表格中有表头headers 是表头rows 是数据行
- 如果文档中有键值对(如 名称: 张三),提取到 key_values 中
- 如果文档中有列表项,提取到 list_items 中
- 图片内容无法直接提取,但请在 description 中说明图片的大致主题(如"包含流程图""包含数据图表"等)
"""
messages = [
{"role": "system", "content": "你是一个专业的数据提取助手。请严格按JSON格式输出。"},
{"role": "user", "content": prompt}
]
response = await self.llm.chat(
messages=messages,
temperature=0.1,
max_tokens=50000
)
content = self.llm.extract_message_content(response)
# 解析 JSON
result = self._parse_json_response(content)
if result:
logger.info(f"AI 表格提取成功: {len(result.get('rows', []))} 行数据")
return {
"success": True,
"type": "table_data",
"headers": result.get("headers", []),
"rows": result.get("rows", []),
"description": result.get("description", "")
}
else:
# 如果 AI 返回格式不对,尝试直接解析表格
return self._fallback_table_parse(tables)
except Exception as e:
logger.error(f"AI 表格提取失败: {str(e)}")
return self._fallback_table_parse(tables)
async def _extract_from_text_with_ai(
self,
paragraphs: List[str],
full_text: str,
image_count: int,
image_descriptions: List[str],
user_hint: str,
image_analysis: str = ""
) -> Dict[str, Any]:
"""
使用 AI 从 Word 纯文本中提取结构化数据
Args:
paragraphs: 段落列表
full_text: 完整文本
image_count: 图片数量
image_descriptions: 图片描述列表
user_hint: 用户提示
image_analysis: 图片 AI 分析结果
Returns:
结构化数据
"""
try:
# 限制文本长度
text_preview = full_text[:8000] if len(full_text) > 8000 else full_text
# 图片提示
image_hint = f"\n【文档图片】此文档包含 {image_count} 张图片/图表。" if image_count > 0 else ""
if image_descriptions:
image_hint += "\n" + "\n".join(image_descriptions)
prompt = f"""你是一个专业的数据提取专家。请从以下 Word 文档的完整内容中提取结构化数据。
【用户需求】
{user_hint if user_hint else "请识别并提取文档中的关键信息,包括:表格数据、键值对、列表项等。"}
【文档正文】{image_hint}
{text_preview}
请按照以下 JSON 格式输出:
{{
"type": "structured_text",
"tables": [{{"headers": [...], "rows": [...]}}],
"key_values": {{"键1": "值1", "键2": "值2", ...}},
"list_items": ["项1", "项2", ...],
"summary": "文档内容摘要"
}}
重点:
- 如果文档包含表格数据,提取到 tables 中
- 如果文档包含键值对(如 名称: 张三),提取到 key_values 中
- 如果文档包含列表项,提取到 list_items 中
- 如果文档包含图片,请根据上下文推断图片内容(如"流程图""数据折线图"等)并在 description 中说明
- 如果无法提取到结构化数据,至少提供一个详细的摘要
"""
messages = [
{"role": "system", "content": "你是一个专业的数据提取助手。请严格按JSON格式输出。"},
{"role": "user", "content": prompt}
]
response = await self.llm.chat(
messages=messages,
temperature=0.1,
max_tokens=50000
)
content = self.llm.extract_message_content(response)
result = self._parse_json_response(content)
if result:
logger.info(f"AI 文本提取成功: type={result.get('type')}")
return {
"success": True,
"type": result.get("type", "structured_text"),
"tables": result.get("tables", []),
"key_values": result.get("key_values", {}),
"list_items": result.get("list_items", []),
"summary": result.get("summary", ""),
"raw_text_preview": text_preview[:500]
}
else:
return {
"success": True,
"type": "text",
"summary": text_preview[:500],
"raw_text_preview": text_preview[:500]
}
except Exception as e:
logger.error(f"AI 文本提取失败: {str(e)}")
return {
"success": False,
"error": str(e)
}
async def _analyze_images_with_ai(
self,
images: List[Dict[str, str]],
user_hint: str = ""
) -> str:
"""
使用视觉模型分析 Word 文档中的图片
Args:
images: 图片列表,每项包含 base64 和 mime_type
user_hint: 用户提示
Returns:
图片分析结果文本
"""
try:
# 调用 LLM 的视觉分析功能
result = await self.llm.analyze_images(
images=images,
user_prompt=user_hint or "请详细描述图片内容,提取所有文字和数据信息。"
)
if result.get("success"):
analysis = result.get("analysis", {})
if isinstance(analysis, dict):
description = analysis.get("description", "")
text_content = analysis.get("text_content", "")
data_extracted = analysis.get("data_extracted", {})
result_text = f"【图片分析结果】\n{description}"
if text_content:
result_text += f"\n\n【图片中的文字】\n{text_content}"
if data_extracted:
result_text += f"\n\n【提取的数据】\n{json.dumps(data_extracted, ensure_ascii=False)}"
return result_text
else:
return str(analysis)
else:
logger.warning(f"图片 AI 分析失败: {result.get('error')}")
return ""
except Exception as e:
logger.error(f"图片 AI 分析异常: {str(e)}")
return ""
def _build_tables_description(self, tables: List[Dict]) -> str:
"""构建表格的文本描述"""
result = []
for idx, table in enumerate(tables):
rows = table.get("rows", [])
if not rows:
continue
result.append(f"\n--- 表格 {idx + 1} ---")
for row_idx, row in enumerate(rows[:50]): # 限制每表格最多50行
if isinstance(row, list):
result.append(" | ".join(str(cell).strip() for cell in row))
elif isinstance(row, dict):
result.append(str(row))
if len(rows) > 50:
result.append(f"...(共 {len(rows)}仅显示前50行")
return "\n".join(result) if result else "(无表格内容)"
def _parse_json_response(self, content: str) -> Optional[Dict]:
"""解析 JSON 响应,处理各种格式问题"""
import re
# 清理 markdown 标记
cleaned = content.strip()
cleaned = re.sub(r'^```json\s*', '', cleaned, flags=re.MULTILINE)
cleaned = re.sub(r'^```\s*', '', cleaned, flags=re.MULTILINE)
cleaned = cleaned.strip()
# 找到 JSON 开始位置
json_start = -1
for i, c in enumerate(cleaned):
if c == '{':
json_start = i
break
if json_start == -1:
logger.warning("无法找到 JSON 开始位置")
return None
json_text = cleaned[json_start:]
# 尝试直接解析
try:
return json.loads(json_text)
except json.JSONDecodeError:
pass
# 尝试修复并解析
try:
# 找到闭合括号
depth = 0
end_pos = -1
for i, c in enumerate(json_text):
if c == '{':
depth += 1
elif c == '}':
depth -= 1
if depth == 0:
end_pos = i + 1
break
if end_pos > 0:
fixed = json_text[:end_pos]
# 移除末尾逗号
fixed = re.sub(r',\s*([}]])', r'\1', fixed)
return json.loads(fixed)
except Exception as e:
logger.warning(f"JSON 修复失败: {e}")
return None
def _fallback_table_parse(self, tables: List[Dict]) -> Dict[str, Any]:
"""当 AI 解析失败时,直接解析表格"""
if not tables:
return {
"success": True,
"type": "empty",
"data": {},
"message": "无表格内容"
}
all_rows = []
all_headers = None
for table in tables:
rows = table.get("rows", [])
if not rows:
continue
# 查找真正的表头行(跳过标题行)
header_row_idx = 0
for idx, row in enumerate(rows[:5]): # 只检查前5行
if not isinstance(row, list):
continue
# 如果某一行包含"表"字开头且单元格内容很长,这可能是标题行
first_cell = str(row[0]) if row else ""
if first_cell.startswith("") and len(first_cell) > 15:
header_row_idx = idx + 1
continue
# 如果某一行有超过3个空单元格可能是无效行
empty_count = sum(1 for cell in row if not str(cell).strip())
if empty_count > 3:
header_row_idx = idx + 1
continue
# 找到第一行看起来像表头的行(短单元格,大部分有内容)
avg_len = sum(len(str(c)) for c in row) / len(row) if row else 0
if avg_len < 20: # 表头通常比数据行短
header_row_idx = idx
break
if header_row_idx >= len(rows):
continue
# 使用找到的表头行
if rows and isinstance(rows[header_row_idx], list):
headers = rows[header_row_idx]
if all_headers is None:
all_headers = headers
# 数据行(从表头之后开始)
for row in rows[header_row_idx + 1:]:
if isinstance(row, list) and len(row) == len(headers):
all_rows.append(row)
if all_headers and all_rows:
return {
"success": True,
"type": "table_data",
"headers": all_headers,
"rows": all_rows,
"description": "直接从 Word 表格提取"
}
return {
"success": True,
"type": "raw",
"tables": tables,
"message": "表格数据未AI处理"
}
async def fill_template_with_ai(
self,
file_path: str,
template_fields: List[Dict[str, Any]],
user_hint: str = ""
) -> Dict[str, Any]:
"""
使用 AI 解析 Word 文档并填写模板
这是主要入口函数,前端调用此函数即可完成:
1. AI 解析 Word 文档
2. 根据模板字段提取数据
3. 返回填写结果
Args:
file_path: Word 文件路径
template_fields: 模板字段列表 [{"name": "字段名", "hint": "提示词"}, ...]
user_hint: 用户提示
Returns:
填写结果
"""
try:
# 1. AI 解析文档
parse_result = await self.parse_word_with_ai(file_path, user_hint)
if not parse_result.get("success"):
return {
"success": False,
"error": parse_result.get("error", "解析失败"),
"filled_data": {},
"source": "ai_parse_failed"
}
# 2. 根据字段类型提取数据
filled_data = {}
extract_details = []
parse_type = parse_result.get("type", "")
if parse_type == "table_data":
# 表格数据:直接匹配列名
headers = parse_result.get("headers", [])
rows = parse_result.get("rows", [])
for field in template_fields:
field_name = field.get("name", "")
values = self._extract_field_from_table(headers, rows, field_name)
filled_data[field_name] = values
extract_details.append({
"field": field_name,
"values": values,
"source": "ai_table_extraction",
"confidence": 0.9 if values else 0.0
})
elif parse_type == "structured_text":
# 结构化文本:尝试从 key_values 和 list_items 提取
key_values = parse_result.get("key_values", {})
list_items = parse_result.get("list_items", [])
for field in template_fields:
field_name = field.get("name", "")
value = key_values.get(field_name, "")
if not value and list_items:
value = list_items[0] if list_items else ""
filled_data[field_name] = [value] if value else []
extract_details.append({
"field": field_name,
"values": [value] if value else [],
"source": "ai_text_extraction",
"confidence": 0.7 if value else 0.0
})
else:
# 其他类型:返回原始解析结果供后续处理
for field in template_fields:
field_name = field.get("name", "")
filled_data[field_name] = []
extract_details.append({
"field": field_name,
"values": [],
"source": "no_ai_data",
"confidence": 0.0
})
# 3. 返回结果
max_rows = max(len(v) for v in filled_data.values()) if filled_data else 1
return {
"success": True,
"filled_data": filled_data,
"fill_details": extract_details,
"ai_parse_result": {
"type": parse_type,
"description": parse_result.get("description", "")
},
"source_doc_count": 1,
"max_rows": max_rows
}
except Exception as e:
logger.error(f"AI 填表失败: {str(e)}")
return {
"success": False,
"error": str(e),
"filled_data": {},
"fill_details": []
}
def _extract_field_from_table(
self,
headers: List[str],
rows: List[List],
field_name: str
) -> List[str]:
"""从表格中提取指定字段的值"""
# 查找匹配的列
target_col_idx = None
for col_idx, header in enumerate(headers):
if field_name.lower() in str(header).lower() or str(header).lower() in field_name.lower():
target_col_idx = col_idx
break
if target_col_idx is None:
return []
# 提取该列所有值
values = []
for row in rows:
if isinstance(row, list) and target_col_idx < len(row):
val = str(row[target_col_idx]).strip()
if val:
values.append(val)
return values
# 全局单例
word_ai_service = WordAIService()

View File

@@ -1,4 +1,4 @@
# ============================================================ # ============================================================
# 基于大语言模型的文档理解与多源数据融合系统 # 基于大语言模型的文档理解与多源数据融合系统
# Python 依赖清单 # Python 依赖清单
# ============================================================ # ============================================================

View File

@@ -1,5 +1,5 @@
import { RouterProvider } from 'react-router-dom'; import { RouterProvider } from 'react-router-dom';
import { AuthProvider } from '@/context/AuthContext'; import { AuthProvider } from '@/contexts/AuthContext';
import { TemplateFillProvider } from '@/context/TemplateFillContext'; import { TemplateFillProvider } from '@/context/TemplateFillContext';
import { router } from '@/routes'; import { router } from '@/routes';
import { Toaster } from 'sonner'; import { Toaster } from 'sonner';

View File

@@ -1,6 +1,6 @@
import React from 'react'; import React from 'react';
import { Navigate, useLocation } from 'react-router-dom'; import { Navigate, useLocation } from 'react-router-dom';
import { useAuth } from '@/context/AuthContext'; import { useAuth } from '@/contexts/AuthContext';
export const RouteGuard: React.FC<{ children: React.ReactNode }> = ({ children }) => { export const RouteGuard: React.FC<{ children: React.ReactNode }> = ({ children }) => {
const { user, loading } = useAuth(); const { user, loading } = useAuth();

View File

@@ -1,85 +0,0 @@
import React, { createContext, useContext, useEffect, useState } from 'react';
import { supabase } from '@/db/supabase';
import { User } from '@supabase/supabase-js';
import { Profile } from '@/types/types';
interface AuthContextType {
user: User | null;
profile: Profile | null;
signIn: (email: string, password: string) => Promise<{ error: any }>;
signUp: (email: string, password: string) => Promise<{ error: any }>;
signOut: () => Promise<{ error: any }>;
loading: boolean;
}
const AuthContext = createContext<AuthContextType | undefined>(undefined);
export const AuthProvider: React.FC<{ children: React.ReactNode }> = ({ children }) => {
const [user, setUser] = useState<User | null>(null);
const [profile, setProfile] = useState<Profile | null>(null);
const [loading, setLoading] = useState(true);
useEffect(() => {
// Check active sessions and sets the user
supabase.auth.getSession().then(({ data: { session } }) => {
setUser(session?.user ?? null);
if (session?.user) fetchProfile(session.user.id);
else setLoading(false);
});
// Listen for changes on auth state (sign in, sign out, etc.)
const { data: { subscription } } = supabase.auth.onAuthStateChange((_event, session) => {
setUser(session?.user ?? null);
if (session?.user) fetchProfile(session.user.id);
else {
setProfile(null);
setLoading(false);
}
});
return () => subscription.unsubscribe();
}, []);
const fetchProfile = async (uid: string) => {
try {
const { data, error } = await supabase
.from('profiles')
.select('*')
.eq('id', uid)
.maybeSingle();
if (error) throw error;
setProfile(data);
} catch (err) {
console.error('Error fetching profile:', err);
} finally {
setLoading(false);
}
};
const signIn = async (email: string, password: string) => {
return await supabase.auth.signInWithPassword({ email, password });
};
const signUp = async (email: string, password: string) => {
return await supabase.auth.signUp({ email, password });
};
const signOut = async () => {
return await supabase.auth.signOut();
};
return (
<AuthContext.Provider value={{ user, profile, signIn, signUp, signOut, loading }}>
{children}
</AuthContext.Provider>
);
};
export const useAuth = () => {
const context = useContext(AuthContext);
if (context === undefined) {
throw new Error('useAuth must be used within an AuthProvider');
}
return context;
};

View File

@@ -400,6 +400,49 @@ export const backendApi = {
} }
}, },
/**
* 获取任务历史列表
*/
async getTasks(
limit: number = 50,
skip: number = 0
): Promise<{ success: boolean; tasks: any[]; count: number }> {
const url = `${BACKEND_BASE_URL}/tasks?limit=${limit}&skip=${skip}`;
try {
const response = await fetch(url);
if (!response.ok) {
const error = await response.json();
throw new Error(error.detail || '获取任务列表失败');
}
return await response.json();
} catch (error) {
console.error('获取任务列表失败:', error);
throw error;
}
},
/**
* 删除任务
*/
async deleteTask(taskId: string): Promise<{ success: boolean; deleted: boolean }> {
const url = `${BACKEND_BASE_URL}/tasks/${taskId}`;
try {
const response = await fetch(url, {
method: 'DELETE'
});
if (!response.ok) {
const error = await response.json();
throw new Error(error.detail || '删除任务失败');
}
return await response.json();
} catch (error) {
console.error('删除任务失败:', error);
throw error;
}
},
/** /**
* 轮询任务状态直到完成 * 轮询任务状态直到完成
*/ */
@@ -764,6 +807,41 @@ export const backendApi = {
} }
}, },
/**
* 填充原始模板并导出
*
* 直接打开原始模板文件,将数据填入模板的表格/单元格中,然后导出
* 适用于比赛场景:保持原始模板格式不变
*/
async fillAndExportTemplate(
templatePath: string,
filledData: Record<string, any>,
format: 'xlsx' | 'docx' = 'xlsx'
): Promise<Blob> {
const url = `${BACKEND_BASE_URL}/templates/fill-and-export`;
try {
const response = await fetch(url, {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
template_path: templatePath,
filled_data: filledData,
format,
}),
});
if (!response.ok) {
const error = await response.json();
throw new Error(error.detail || '填充模板失败');
}
return await response.blob();
} catch (error) {
console.error('填充模板失败:', error);
throw error;
}
},
// ==================== Excel 专用接口 (保留兼容) ==================== // ==================== Excel 专用接口 (保留兼容) ====================
/** /**
@@ -1145,7 +1223,7 @@ export const aiApi = {
try { try {
const response = await fetch(url, { const response = await fetch(url, {
method: 'GET', method: 'POST',
body: formData, body: formData,
}); });
@@ -1161,6 +1239,48 @@ export const aiApi = {
} }
}, },
/**
* 上传并使用 AI 分析 TXT 文本文件,提取结构化数据
*/
async analyzeTxt(
file: File
): Promise<{
success: boolean;
filename?: string;
structured_data?: {
table?: {
columns?: string[];
rows?: string[][];
};
summary?: string;
key_value_pairs?: Array<{ key: string; value: string }>;
numeric_data?: Array<{ name: string; value: number; unit?: string }>;
};
error?: string;
}> {
const formData = new FormData();
formData.append('file', file);
const url = `${BACKEND_BASE_URL}/ai/analyze/txt`;
try {
const response = await fetch(url, {
method: 'POST',
body: formData,
});
if (!response.ok) {
const error = await response.json();
throw new Error(error.detail || 'TXT AI 分析失败');
}
return await response.json();
} catch (error) {
console.error('TXT AI 分析失败:', error);
throw error;
}
},
/** /**
* 生成统计信息和图表 * 生成统计信息和图表
*/ */
@@ -1259,4 +1379,84 @@ export const aiApi = {
throw error; throw error;
} }
}, },
// ==================== Word AI 解析 ====================
/**
* 使用 AI 解析 Word 文档,提取结构化数据
*/
async analyzeWordWithAI(
file: File,
userHint: string = ''
): Promise<{
success: boolean;
type?: string;
headers?: string[];
rows?: string[][];
key_values?: Record<string, string>;
list_items?: string[];
summary?: string;
error?: string;
}> {
const formData = new FormData();
formData.append('file', file);
if (userHint) {
formData.append('user_hint', userHint);
}
const url = `${BACKEND_BASE_URL}/ai/analyze/word`;
try {
const response = await fetch(url, {
method: 'POST',
body: formData,
});
if (!response.ok) {
const error = await response.json();
throw new Error(error.detail || 'Word AI 解析失败');
}
return await response.json();
} catch (error) {
console.error('Word AI 解析失败:', error);
throw error;
}
},
/**
* 使用 AI 解析 Word 文档并填写模板
* 一次性完成AI解析 + 填表
*/
async fillTemplateFromWordAI(
file: File,
templateFields: TemplateField[],
userHint: string = ''
): Promise<FillResult> {
const formData = new FormData();
formData.append('file', file);
formData.append('template_fields', JSON.stringify(templateFields));
if (userHint) {
formData.append('user_hint', userHint);
}
const url = `${BACKEND_BASE_URL}/ai/analyze/word/fill-template`;
try {
const response = await fetch(url, {
method: 'POST',
body: formData,
});
if (!response.ok) {
const error = await response.json();
throw new Error(error.detail || 'Word AI 填表失败');
}
return await response.json();
} catch (error) {
console.error('Word AI 填表失败:', error);
throw error;
}
},
}; };

View File

@@ -1,4 +1,4 @@
import React, { useState, useEffect, useCallback } from 'react'; import React, { useState, useEffect, useCallback, useRef } from 'react';
import { useDropzone } from 'react-dropzone'; import { useDropzone } from 'react-dropzone';
import { import {
FileText, FileText,
@@ -23,7 +23,8 @@ import {
List, List,
MessageSquareCode, MessageSquareCode,
Tag, Tag,
HelpCircle HelpCircle,
Plus
} from 'lucide-react'; } from 'lucide-react';
import { Button } from '@/components/ui/button'; import { Button } from '@/components/ui/button';
import { Input } from '@/components/ui/input'; import { Input } from '@/components/ui/input';
@@ -72,8 +73,10 @@ const Documents: React.FC = () => {
// 上传相关状态 // 上传相关状态
const [uploading, setUploading] = useState(false); const [uploading, setUploading] = useState(false);
const [uploadedFile, setUploadedFile] = useState<File | null>(null); const [uploadedFile, setUploadedFile] = useState<File | null>(null);
const [uploadedFiles, setUploadedFiles] = useState<File[]>([]);
const [parseResult, setParseResult] = useState<ExcelParseResult | null>(null); const [parseResult, setParseResult] = useState<ExcelParseResult | null>(null);
const [expandedSheet, setExpandedSheet] = useState<string | null>(null); const [expandedSheet, setExpandedSheet] = useState<string | null>(null);
const [uploadExpanded, setUploadExpanded] = useState(false);
// AI 分析相关状态 // AI 分析相关状态
const [analyzing, setAnalyzing] = useState(false); const [analyzing, setAnalyzing] = useState(false);
@@ -210,75 +213,119 @@ const Documents: React.FC = () => {
// 文件上传处理 // 文件上传处理
const onDrop = async (acceptedFiles: File[]) => { const onDrop = async (acceptedFiles: File[]) => {
const file = acceptedFiles[0]; if (acceptedFiles.length === 0) return;
if (!file) return;
setUploadedFile(file);
setUploading(true); setUploading(true);
setParseResult(null); let successCount = 0;
setAiAnalysis(null); let failCount = 0;
setAnalysisCharts(null); const successfulFiles: File[] = [];
setExpandedSheet(null);
setMdAnalysis(null);
setMdSections([]);
setMdStreamingContent('');
const ext = file.name.split('.').pop()?.toLowerCase(); // 逐个上传文件
for (const file of acceptedFiles) {
const ext = file.name.split('.').pop()?.toLowerCase();
try { try {
// Excel 文件使用专门的上传接口 if (ext === 'xlsx' || ext === 'xls') {
if (ext === 'xlsx' || ext === 'xls') { const result = await backendApi.uploadExcel(file, {
const result = await backendApi.uploadExcel(file, { parseAllSheets: parseOptions.parseAllSheets,
parseAllSheets: parseOptions.parseAllSheets, headerRow: parseOptions.headerRow
headerRow: parseOptions.headerRow });
}); if (result.success) {
if (result.success) { successCount++;
toast.success(`解析成功: ${file.name}`); successfulFiles.push(file);
setParseResult(result); // 第一个Excel文件设置解析结果供预览
loadDocuments(); // 刷新文档列表 if (successCount === 1) {
if (result.metadata?.sheet_count === 1) { setUploadedFile(file);
setExpandedSheet(Object.keys(result.data?.sheets || {})[0] || null); setParseResult(result);
if (result.metadata?.sheet_count === 1) {
setExpandedSheet(Object.keys(result.data?.sheets || {})[0] || null);
}
}
loadDocuments();
} else {
failCount++;
toast.error(`${file.name}: ${result.error || '解析失败'}`);
}
} else if (ext === 'md' || ext === 'markdown') {
const result = await backendApi.uploadDocument(file);
if (result.task_id) {
successCount++;
successfulFiles.push(file);
if (successCount === 1) {
setUploadedFile(file);
}
// 轮询任务状态
let attempts = 0;
const checkStatus = async () => {
while (attempts < 30) {
try {
const status = await backendApi.getTaskStatus(result.task_id);
if (status.status === 'success') {
loadDocuments();
return;
} else if (status.status === 'failure') {
return;
}
} catch (e) {
console.error('检查状态失败', e);
}
await new Promise(resolve => setTimeout(resolve, 2000));
attempts++;
}
};
checkStatus();
} else {
failCount++;
} }
} else { } else {
toast.error(result.error || '解析失败'); // 其他文档使用通用上传接口
} const result = await backendApi.uploadDocument(file);
} else if (ext === 'md' || ext === 'markdown') { if (result.task_id) {
// Markdown 文件:获取大纲 successCount++;
await fetchMdOutline(); successfulFiles.push(file);
} else { if (successCount === 1) {
// 其他文档使用通用上传接口 setUploadedFile(file);
const result = await backendApi.uploadDocument(file);
if (result.task_id) {
toast.success(`文件 ${file.name} 已提交处理`);
// 轮询任务状态
let attempts = 0;
const checkStatus = async () => {
while (attempts < 30) {
try {
const status = await backendApi.getTaskStatus(result.task_id);
if (status.status === 'success') {
toast.success(`文件 ${file.name} 处理完成`);
loadDocuments();
return;
} else if (status.status === 'failure') {
toast.error(`文件 ${file.name} 处理失败`);
return;
}
} catch (e) {
console.error('检查状态失败', e);
}
await new Promise(resolve => setTimeout(resolve, 2000));
attempts++;
} }
toast.error(`文件 ${file.name} 处理超时`); // 轮询任务状态
}; let attempts = 0;
checkStatus(); const checkStatus = async () => {
while (attempts < 30) {
try {
const status = await backendApi.getTaskStatus(result.task_id);
if (status.status === 'success') {
loadDocuments();
return;
} else if (status.status === 'failure') {
return;
}
} catch (e) {
console.error('检查状态失败', e);
}
await new Promise(resolve => setTimeout(resolve, 2000));
attempts++;
}
};
checkStatus();
} else {
failCount++;
}
} }
} catch (error: any) {
failCount++;
toast.error(`${file.name}: ${error.message || '上传失败'}`);
} }
} catch (error: any) { }
toast.error(error.message || '上传失败');
} finally { setUploading(false);
setUploading(false); loadDocuments();
if (successCount > 0) {
toast.success(`成功上传 ${successCount} 个文件`);
setUploadedFiles(prev => [...prev, ...successfulFiles]);
setUploadExpanded(true);
}
if (failCount > 0) {
toast.error(`${failCount} 个文件上传失败`);
} }
}; };
@@ -291,7 +338,7 @@ const Documents: React.FC = () => {
'text/markdown': ['.md'], 'text/markdown': ['.md'],
'text/plain': ['.txt'] 'text/plain': ['.txt']
}, },
maxFiles: 1 multiple: true
}); });
// AI 分析处理 // AI 分析处理
@@ -449,6 +496,7 @@ const Documents: React.FC = () => {
const handleDeleteFile = () => { const handleDeleteFile = () => {
setUploadedFile(null); setUploadedFile(null);
setUploadedFiles([]);
setParseResult(null); setParseResult(null);
setAiAnalysis(null); setAiAnalysis(null);
setAnalysisCharts(null); setAnalysisCharts(null);
@@ -456,6 +504,17 @@ const Documents: React.FC = () => {
toast.success('文件已清除'); toast.success('文件已清除');
}; };
const handleRemoveUploadedFile = (index: number) => {
setUploadedFiles(prev => {
const newFiles = prev.filter((_, i) => i !== index);
if (newFiles.length === 0) {
setUploadedFile(null);
}
return newFiles;
});
toast.success('文件已从列表移除');
};
const handleDelete = async (docId: string) => { const handleDelete = async (docId: string) => {
try { try {
const result = await backendApi.deleteDocument(docId); const result = await backendApi.deleteDocument(docId);
@@ -615,7 +674,7 @@ const Documents: React.FC = () => {
<h1 className="text-3xl font-extrabold tracking-tight"></h1> <h1 className="text-3xl font-extrabold tracking-tight"></h1>
<p className="text-muted-foreground">使 AI </p> <p className="text-muted-foreground">使 AI </p>
</div> </div>
<Button variant="outline" className="rounded-xl gap-2" onClick={loadDocuments}> <Button variant="outline" className="rounded-xl gap-2" onClick={() => loadDocuments()}>
<RefreshCcw size={18} /> <RefreshCcw size={18} />
<span></span> <span></span>
</Button> </Button>
@@ -640,7 +699,83 @@ const Documents: React.FC = () => {
</CardHeader> </CardHeader>
{uploadPanelOpen && ( {uploadPanelOpen && (
<CardContent className="space-y-4"> <CardContent className="space-y-4">
{!uploadedFile ? ( {uploadedFiles.length > 0 || uploadedFile ? (
<div className="space-y-3">
{/* 文件列表头部 */}
<div
className="flex items-center justify-between p-3 bg-muted/50 rounded-xl cursor-pointer hover:bg-muted/70 transition-colors"
onClick={() => setUploadExpanded(!uploadExpanded)}
>
<div className="flex items-center gap-3">
<div className="w-10 h-10 rounded-lg bg-primary/10 text-primary flex items-center justify-center">
<Upload size={20} />
</div>
<div>
<p className="font-semibold text-sm">
{(uploadedFiles.length > 0 ? uploadedFiles : [uploadedFile]).length}
</p>
<p className="text-xs text-muted-foreground">
{uploadExpanded ? '点击收起' : '点击展开查看'}
</p>
</div>
</div>
<div className="flex items-center gap-2">
<Button
variant="ghost"
size="sm"
onClick={(e) => {
e.stopPropagation();
handleDeleteFile();
}}
className="text-destructive hover:text-destructive"
>
<Trash2 size={14} className="mr-1" />
</Button>
{uploadExpanded ? <ChevronUp size={16} /> : <ChevronDown size={16} />}
</div>
</div>
{/* 展开的文件列表 */}
{uploadExpanded && (
<div className="space-y-2 border rounded-xl p-3">
{(uploadedFiles.length > 0 ? uploadedFiles : [uploadedFile]).filter(Boolean).map((file, index) => (
<div key={index} className="flex items-center gap-3 p-2 bg-background rounded-lg">
<div className={cn(
"w-8 h-8 rounded flex items-center justify-center",
isExcelFile(file?.name || '') ? "bg-emerald-500/10 text-emerald-500" : "bg-blue-500/10 text-blue-500"
)}>
{isExcelFile(file?.name || '') ? <FileSpreadsheet size={16} /> : <FileText size={16} />}
</div>
<div className="flex-1 min-w-0">
<p className="text-sm truncate">{file?.name}</p>
<p className="text-xs text-muted-foreground">{formatFileSize(file?.size || 0)}</p>
</div>
<Button
variant="ghost"
size="icon"
className="text-destructive hover:bg-destructive/10"
onClick={() => handleRemoveUploadedFile(index)}
>
<Trash2 size={14} />
</Button>
</div>
))}
{/* 继续添加按钮 */}
<div
{...getRootProps()}
className="flex items-center justify-center gap-2 p-3 border-2 border-dashed rounded-lg cursor-pointer hover:border-primary/50 hover:bg-primary/5 transition-colors"
onClick={(e) => e.stopPropagation()}
>
<input {...getInputProps()} multiple={true} />
<Plus size={16} className="text-muted-foreground" />
<span className="text-sm text-muted-foreground"></span>
</div>
</div>
)}
</div>
) : (
<div <div
{...getRootProps()} {...getRootProps()}
className={cn( className={cn(
@@ -649,7 +784,7 @@ const Documents: React.FC = () => {
uploading && "opacity-50 pointer-events-none" uploading && "opacity-50 pointer-events-none"
)} )}
> >
<input {...getInputProps()} /> <input {...getInputProps()} multiple={true} />
<div className="w-14 h-14 rounded-xl bg-primary/10 text-primary flex items-center justify-center mb-4 group-hover:scale-110 transition-transform"> <div className="w-14 h-14 rounded-xl bg-primary/10 text-primary flex items-center justify-center mb-4 group-hover:scale-110 transition-transform">
{uploading ? <Loader2 className="animate-spin" size={28} /> : <Upload size={28} />} {uploading ? <Loader2 className="animate-spin" size={28} /> : <Upload size={28} />}
</div> </div>
@@ -671,30 +806,6 @@ const Documents: React.FC = () => {
</Badge> </Badge>
</div> </div>
</div> </div>
) : (
<div className="space-y-4">
<div className="flex items-center gap-3 p-3 bg-muted/30 rounded-xl">
<div className={cn(
"w-10 h-10 rounded-lg flex items-center justify-center",
isExcelFile(uploadedFile.name) ? "bg-emerald-500/10 text-emerald-500" : "bg-blue-500/10 text-blue-500"
)}>
{isExcelFile(uploadedFile.name) ? <FileSpreadsheet size={20} /> : <FileText size={20} />}
</div>
<div className="flex-1 min-w-0">
<p className="font-semibold text-sm truncate">{uploadedFile.name}</p>
<p className="text-xs text-muted-foreground">{formatFileSize(uploadedFile.size)}</p>
</div>
<Button variant="ghost" size="icon" className="text-destructive hover:bg-destructive/10" onClick={handleDeleteFile}>
<Trash2 size={16} />
</Button>
</div>
{isExcelFile(uploadedFile.name) && (
<Button onClick={() => onDrop([uploadedFile])} className="w-full" disabled={uploading}>
{uploading ? '解析中...' : '重新解析'}
</Button>
)}
</div>
)} )}
</CardContent> </CardContent>
)} )}

File diff suppressed because it is too large Load Diff

View File

@@ -1,603 +0,0 @@
import React, { useState, useEffect } from 'react';
import {
TableProperties,
Plus,
FilePlus,
CheckCircle2,
Download,
Clock,
RefreshCcw,
Sparkles,
Zap,
FileCheck,
FileSpreadsheet,
Trash2,
ChevronDown,
ChevronUp,
BarChart3,
FileText,
TrendingUp,
Info,
AlertCircle,
Loader2
} from 'lucide-react';
import { Button } from '@/components/ui/button';
import { Card, CardContent, CardHeader, CardTitle, CardDescription, CardFooter } from '@/components/ui/card';
import { Badge } from '@/components/ui/badge';
import { useAuth } from '@/context/AuthContext';
import { templateApi, documentApi, taskApi } from '@/db/api';
import { backendApi, aiApi } from '@/db/backend-api';
import { supabase } from '@/db/supabase';
import { format } from 'date-fns';
import { toast } from 'sonner';
import { cn } from '@/lib/utils';
import { Skeleton } from '@/components/ui/skeleton';
import {
Dialog,
DialogContent,
DialogHeader,
DialogTitle,
DialogTrigger,
DialogFooter,
DialogDescription
} from '@/components/ui/dialog';
import { Checkbox } from '@/components/ui/checkbox';
import { ScrollArea } from '@/components/ui/scroll-area';
import { Input } from '@/components/ui/input';
import { Label } from '@/components/ui/label';
import { Textarea } from '@/components/ui/textarea';
import { Select, SelectContent, SelectItem, SelectTrigger, SelectValue } from '@/components/ui/select';
import { useDropzone } from 'react-dropzone';
import { Markdown } from '@/components/ui/markdown';
type Template = any;
type Document = any;
type FillTask = any;
const FormFill: React.FC = () => {
const { profile } = useAuth();
const [templates, setTemplates] = useState<Template[]>([]);
const [documents, setDocuments] = useState<Document[]>([]);
const [tasks, setTasks] = useState<any[]>([]);
const [loading, setLoading] = useState(true);
// Selection state
const [selectedTemplate, setSelectedTemplate] = useState<string | null>(null);
const [selectedDocs, setSelectedDocs] = useState<string[]>([]);
const [creating, setCreating] = useState(false);
const [openTaskDialog, setOpenTaskDialog] = useState(false);
const [viewingTask, setViewingTask] = useState<any | null>(null);
// Excel upload state
const [excelFile, setExcelFile] = useState<File | null>(null);
const [excelParseResult, setExcelParseResult] = useState<any>(null);
const [excelAnalysis, setExcelAnalysis] = useState<any>(null);
const [excelAnalyzing, setExcelAnalyzing] = useState(false);
const [expandedSheet, setExpandedSheet] = useState<string | null>(null);
const [aiOptions, setAiOptions] = useState({
userPrompt: '请分析这些数据,并提取关键信息用于填表,包括数值、分类、摘要等。',
analysisType: 'general' as 'general' | 'summary' | 'statistics' | 'insights'
});
const loadData = async () => {
if (!profile) return;
try {
const [t, d, ts] = await Promise.all([
templateApi.listTemplates((profile as any).id),
documentApi.listDocuments((profile as any).id),
taskApi.listTasks((profile as any).id)
]);
setTemplates(t);
setDocuments(d);
setTasks(ts);
} catch (err: any) {
toast.error('数据加载失败');
} finally {
setLoading(false);
}
};
useEffect(() => {
loadData();
}, [profile]);
// Excel upload handlers
const onExcelDrop = async (acceptedFiles: File[]) => {
const file = acceptedFiles[0];
if (!file) return;
if (!file.name.match(/\.(xlsx|xls)$/i)) {
toast.error('仅支持 .xlsx 和 .xls 格式的 Excel 文件');
return;
}
setExcelFile(file);
setExcelParseResult(null);
setExcelAnalysis(null);
setExpandedSheet(null);
try {
const result = await backendApi.uploadExcel(file);
if (result.success) {
toast.success(`Excel 解析成功: ${file.name}`);
setExcelParseResult(result);
} else {
toast.error(result.error || '解析失败');
}
} catch (error: any) {
toast.error(error.message || '上传失败');
}
};
const { getRootProps, getInputProps, isDragActive } = useDropzone({
onDrop: onExcelDrop,
accept: {
'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet': ['.xlsx'],
'application/vnd.ms-excel': ['.xls']
},
maxFiles: 1
});
const handleAnalyzeExcel = async () => {
if (!excelFile || !excelParseResult?.success) {
toast.error('请先上传并解析 Excel 文件');
return;
}
setExcelAnalyzing(true);
setExcelAnalysis(null);
try {
const result = await aiApi.analyzeExcel(excelFile, {
userPrompt: aiOptions.userPrompt,
analysisType: aiOptions.analysisType
});
if (result.success) {
toast.success('AI 分析完成');
setExcelAnalysis(result);
} else {
toast.error(result.error || 'AI 分析失败');
}
} catch (error: any) {
toast.error(error.message || 'AI 分析失败');
} finally {
setExcelAnalyzing(false);
}
};
const handleUseExcelData = () => {
if (!excelParseResult?.success) {
toast.error('请先解析 Excel 文件');
return;
}
// 将 Excel 解析的数据标记为"文档",添加到选择列表
toast.success('Excel 数据已添加到数据源,请在任务对话框中选择');
// 这里可以添加逻辑来将 Excel 数据传递给后端创建任务
};
const handleDeleteExcel = () => {
setExcelFile(null);
setExcelParseResult(null);
setExcelAnalysis(null);
setExpandedSheet(null);
toast.success('Excel 文件已清除');
};
const handleUploadTemplate = async (e: React.ChangeEvent<HTMLInputElement>) => {
const file = e.target.files?.[0];
if (!file || !profile) return;
try {
toast.loading('正在上传模板...');
await templateApi.uploadTemplate(file, (profile as any).id);
toast.dismiss();
toast.success('模板上传成功');
loadData();
} catch (err) {
toast.dismiss();
toast.error('上传模板失败');
}
};
const handleCreateTask = async () => {
if (!profile || !selectedTemplate || selectedDocs.length === 0) {
toast.error('请先选择模板和数据源文档');
return;
}
setCreating(true);
try {
const task = await taskApi.createTask((profile as any).id, selectedTemplate, selectedDocs);
if (task) {
toast.success('任务已创建,正在进行智能填表...');
setOpenTaskDialog(false);
// Invoke edge function
supabase.functions.invoke('fill-template', {
body: { taskId: task.id }
}).then(({ error }) => {
if (error) toast.error('填表任务执行失败');
else {
toast.success('表格填写完成!');
loadData();
}
});
loadData();
}
} catch (err: any) {
toast.error('创建任务失败');
} finally {
setCreating(false);
}
};
const getStatusColor = (status: string) => {
switch (status) {
case 'completed': return 'bg-emerald-500 text-white';
case 'failed': return 'bg-destructive text-white';
default: return 'bg-amber-500 text-white';
}
};
const formatFileSize = (bytes: number): string => {
if (bytes === 0) return '0 B';
const k = 1024;
const sizes = ['B', 'KB', 'MB', 'GB'];
const i = Math.floor(Math.log(bytes) / Math.log(k));
return `${(bytes / Math.pow(k, i)).toFixed(2)} ${sizes[i]}`;
};
return (
<div className="space-y-8 animate-fade-in pb-10">
<section className="flex flex-col md:flex-row md:items-center justify-between gap-4">
<div className="space-y-1">
<h1 className="text-3xl font-extrabold tracking-tight"></h1>
<p className="text-muted-foreground"></p>
</div>
<div className="flex items-center gap-3">
<Dialog open={openTaskDialog} onOpenChange={setOpenTaskDialog}>
<DialogTrigger asChild>
<Button className="rounded-xl shadow-lg shadow-primary/20 gap-2 h-11 px-6">
<FilePlus size={18} />
<span></span>
</Button>
</DialogTrigger>
<DialogContent className="max-w-4xl max-h-[90vh] flex flex-col p-0 overflow-hidden border-none shadow-2xl rounded-3xl">
<DialogHeader className="p-8 pb-4 bg-muted/50">
<DialogTitle className="text-2xl font-bold flex items-center gap-2">
<Sparkles size={24} className="text-primary" />
</DialogTitle>
<DialogDescription>
AI
</DialogDescription>
</DialogHeader>
<ScrollArea className="flex-1 p-8 pt-4">
<div className="space-y-8">
{/* Step 1: Select Template */}
<div className="space-y-4">
<div className="flex items-center justify-between">
<h4 className="font-bold flex items-center gap-2 text-primary uppercase tracking-widest text-xs">
<span className="w-5 h-5 rounded-full bg-primary text-white flex items-center justify-center text-[10px]">1</span>
</h4>
<label className="cursor-pointer text-xs font-semibold text-primary hover:underline flex items-center gap-1">
<Plus size={12} />
<input type="file" className="hidden" onChange={handleUploadTemplate} accept=".docx,.xlsx" />
</label>
</div>
{templates.length > 0 ? (
<div className="grid grid-cols-1 sm:grid-cols-2 gap-3">
{templates.map(t => (
<div
key={t.id}
className={cn(
"p-4 rounded-2xl border-2 transition-all cursor-pointer flex items-center gap-3 group relative overflow-hidden",
selectedTemplate === t.id ? "border-primary bg-primary/5" : "border-border hover:border-primary/50"
)}
onClick={() => setSelectedTemplate(t.id)}
>
<div className={cn(
"w-10 h-10 rounded-xl flex items-center justify-center shrink-0 transition-colors",
selectedTemplate === t.id ? "bg-primary text-white" : "bg-muted text-muted-foreground"
)}>
<TableProperties size={20} />
</div>
<div className="flex-1 min-w-0">
<p className="font-bold text-sm truncate">{t.name}</p>
<p className="text-[10px] text-muted-foreground uppercase">{t.type}</p>
</div>
{selectedTemplate === t.id && (
<div className="absolute top-0 right-0 w-8 h-8 bg-primary text-white flex items-center justify-center rounded-bl-xl">
<CheckCircle2 size={14} />
</div>
)}
</div>
))}
</div>
) : (
<div className="p-8 text-center bg-muted/30 rounded-2xl border border-dashed text-sm italic text-muted-foreground">
</div>
)}
</div>
{/* Step 2: Upload & Analyze Excel */}
<div className="space-y-4">
<h4 className="font-bold flex items-center gap-2 text-primary uppercase tracking-widest text-xs">
<span className="w-5 h-5 rounded-full bg-primary text-white flex items-center justify-center text-[10px]">1.5</span>
Excel
</h4>
<div className="bg-muted/20 rounded-2xl p-6">
{!excelFile ? (
<div
{...getRootProps()}
className={cn(
"border-2 border-dashed rounded-xl p-8 transition-all duration-300 flex flex-col items-center justify-center text-center cursor-pointer group",
isDragActive ? "border-primary bg-primary/5" : "border-muted-foreground/20 hover:border-primary/50 hover:bg-muted/30"
)}
>
<input {...getInputProps()} />
<div className="w-12 h-12 rounded-xl bg-primary/10 text-primary flex items-center justify-center mb-3 group-hover:scale-110 transition-transform">
<FileSpreadsheet size={24} />
</div>
<p className="font-semibold text-sm">
{isDragActive ? '释放以开始上传' : '点击或拖拽 Excel 文件'}
</p>
<p className="text-xs text-muted-foreground mt-1"> .xlsx .xls </p>
</div>
) : (
<div className="space-y-4">
<div className="flex items-center gap-3 p-3 bg-background rounded-xl">
<div className="w-10 h-10 rounded-lg bg-emerald-500/10 text-emerald-500 flex items-center justify-center">
<FileSpreadsheet size={20} />
</div>
<div className="flex-1 min-w-0">
<p className="font-semibold text-sm truncate">{excelFile.name}</p>
<p className="text-xs text-muted-foreground">{formatFileSize(excelFile.size)}</p>
</div>
<div className="flex gap-2">
<Button
variant="ghost"
size="icon"
className="text-destructive hover:bg-destructive/10"
onClick={handleDeleteExcel}
>
<Trash2 size={16} />
</Button>
</div>
</div>
{/* AI Analysis Options */}
{excelParseResult?.success && (
<div className="space-y-3">
<div className="space-y-2">
<Label htmlFor="analysis-type" className="text-xs"></Label>
<Select
value={aiOptions.analysisType}
onValueChange={(value: any) => setAiOptions({ ...aiOptions, analysisType: value })}
>
<SelectTrigger id="analysis-type" className="bg-background h-9 text-sm">
<SelectValue placeholder="选择分析类型" />
</SelectTrigger>
<SelectContent>
<SelectItem value="general"></SelectItem>
<SelectItem value="summary"></SelectItem>
<SelectItem value="statistics"></SelectItem>
<SelectItem value="insights"></SelectItem>
</SelectContent>
</Select>
</div>
<div className="space-y-2">
<Label htmlFor="user-prompt" className="text-xs"></Label>
<Textarea
id="user-prompt"
value={aiOptions.userPrompt}
onChange={(e) => setAiOptions({ ...aiOptions, userPrompt: e.target.value })}
className="bg-background resize-none text-sm"
rows={2}
/>
</div>
<Button
onClick={handleAnalyzeExcel}
disabled={excelAnalyzing}
className="w-full gap-2 h-9"
variant="outline"
>
{excelAnalyzing ? <Loader2 className="animate-spin" size={14} /> : <Sparkles size={14} />}
{excelAnalyzing ? '分析中...' : 'AI 分析'}
</Button>
{excelParseResult?.success && (
<Button
onClick={handleUseExcelData}
className="w-full gap-2 h-9"
>
<CheckCircle2 size={14} />
使
</Button>
)}
</div>
)}
{/* Excel Analysis Result */}
{excelAnalysis && (
<div className="mt-4 p-4 bg-background rounded-xl max-h-60 overflow-y-auto">
<div className="flex items-center gap-2 mb-3">
<Sparkles size={16} className="text-primary" />
<span className="font-semibold text-sm">AI </span>
</div>
<Markdown content={excelAnalysis.analysis?.analysis || ''} className="text-sm" />
</div>
)}
</div>
)}
</div>
</div>
{/* Step 3: Select Documents */}
<div className="space-y-4">
<h4 className="font-bold flex items-center gap-2 text-primary uppercase tracking-widest text-xs">
<span className="w-5 h-5 rounded-full bg-primary text-white flex items-center justify-center text-[10px]">2</span>
</h4>
{documents.filter(d => d.status === 'completed').length > 0 ? (
<div className="space-y-2 max-h-40 overflow-y-auto pr-2 custom-scrollbar">
{documents.filter(d => d.status === 'completed').map(doc => (
<div
key={doc.id}
className={cn(
"flex items-center gap-3 p-3 rounded-xl border transition-all cursor-pointer",
selectedDocs.includes(doc.id) ? "border-primary/50 bg-primary/5 shadow-sm" : "border-border hover:bg-muted/30"
)}
onClick={() => {
setSelectedDocs(prev =>
prev.includes(doc.id) ? prev.filter(id => id !== doc.id) : [...prev, doc.id]
);
}}
>
<Checkbox checked={selectedDocs.includes(doc.id)} onCheckedChange={() => {}} />
<div className="w-8 h-8 rounded-lg bg-blue-500/10 text-blue-500 flex items-center justify-center">
<Zap size={16} />
</div>
<span className="font-semibold text-sm truncate">{doc.name}</span>
</div>
))}
</div>
) : (
<div className="p-6 text-center bg-muted/30 rounded-xl border border-dashed text-xs italic text-muted-foreground">
</div>
)}
</div>
</div>
</ScrollArea>
<DialogFooter className="p-8 pt-4 bg-muted/20 border-t border-dashed">
<Button variant="outline" className="rounded-xl h-12 px-6" onClick={() => setOpenTaskDialog(false)}></Button>
<Button
className="rounded-xl h-12 px-8 shadow-lg shadow-primary/20 gap-2"
onClick={handleCreateTask}
disabled={creating || !selectedTemplate || (selectedDocs.length === 0 && !excelParseResult?.success)}
>
{creating ? <RefreshCcw className="animate-spin h-5 w-5" /> : <Zap className="h-5 w-5 fill-current" />}
<span></span>
</Button>
</DialogFooter>
</DialogContent>
</Dialog>
</div>
</section>
{/* Task List */}
<div className="grid grid-cols-1 md:grid-cols-2 lg:grid-cols-3 gap-6">
{loading ? (
Array.from({ length: 3 }).map((_, i) => (
<Skeleton key={i} className="h-48 w-full rounded-3xl bg-muted" />
))
) : tasks.length > 0 ? (
tasks.map((task) => (
<Card key={task.id} className="border-none shadow-md hover:shadow-xl transition-all group rounded-3xl overflow-hidden flex flex-col">
<div className="h-1.5 w-full" style={{ backgroundColor: task.status === 'completed' ? '#10b981' : task.status === 'failed' ? '#ef4444' : '#f59e0b' }} />
<CardHeader className="p-6 pb-2">
<div className="flex justify-between items-start mb-2">
<div className="w-12 h-12 rounded-2xl bg-emerald-500/10 text-emerald-500 flex items-center justify-center shadow-inner group-hover:scale-110 transition-transform">
<TableProperties size={24} />
</div>
<Badge className={cn("text-[10px] uppercase font-bold tracking-widest", getStatusColor(task.status))}>
{task.status === 'completed' ? '已完成' : task.status === 'failed' ? '失败' : '执行中'}
</Badge>
</div>
<CardTitle className="text-lg font-bold truncate group-hover:text-primary transition-colors">{task.templates?.name || '未知模板'}</CardTitle>
<CardDescription className="text-xs flex items-center gap-1 font-medium italic">
<Clock size={12} /> {format(new Date(task.created_at!), 'yyyy/MM/dd HH:mm')}
</CardDescription>
</CardHeader>
<CardContent className="p-6 pt-2 flex-1">
<div className="space-y-4">
<div className="flex flex-wrap gap-2">
<Badge variant="outline" className="bg-muted/50 border-none text-[10px] font-bold"> {task.document_ids?.length} </Badge>
</div>
{task.status === 'completed' && (
<div className="p-3 bg-emerald-500/5 rounded-2xl border border-emerald-500/10 flex items-center gap-3">
<CheckCircle2 className="text-emerald-500" size={18} />
<span className="text-xs font-semibold text-emerald-700"></span>
</div>
)}
</div>
</CardContent>
<CardFooter className="p-6 pt-0">
<Button
className="w-full rounded-2xl h-11 bg-primary group-hover:shadow-lg group-hover:shadow-primary/30 transition-all gap-2"
disabled={task.status !== 'completed'}
onClick={() => setViewingTask(task)}
>
<Download size={18} />
<span></span>
</Button>
</CardFooter>
</Card>
))
) : (
<div className="col-span-full py-24 flex flex-col items-center justify-center text-center space-y-6">
<div className="w-24 h-24 rounded-full bg-muted flex items-center justify-center text-muted-foreground/30 border-4 border-dashed">
<TableProperties size={48} />
</div>
<div className="space-y-2 max-w-sm">
<p className="text-2xl font-extrabold tracking-tight"></p>
<p className="text-muted-foreground text-sm"></p>
</div>
<Button className="rounded-xl h-12 px-8" onClick={() => setOpenTaskDialog(true)}></Button>
</div>
)}
</div>
{/* Task Result View Modal */}
<Dialog open={!!viewingTask} onOpenChange={(open) => !open && setViewingTask(null)}>
<DialogContent className="max-w-4xl max-h-[90vh] flex flex-col p-0 overflow-hidden border-none shadow-2xl rounded-3xl">
<DialogHeader className="p-8 pb-4 bg-primary text-primary-foreground">
<div className="flex items-center gap-3 mb-2">
<FileCheck size={28} />
<DialogTitle className="text-2xl font-extrabold"></DialogTitle>
</div>
<DialogDescription className="text-primary-foreground/80 italic">
{viewingTask?.document_ids?.length}
</DialogDescription>
</DialogHeader>
<ScrollArea className="flex-1 p-8 bg-muted/10">
<div className="prose dark:prose-invert max-w-none">
<div className="bg-card p-8 rounded-2xl shadow-sm border min-h-[400px]">
<Badge variant="outline" className="mb-4"></Badge>
<div className="whitespace-pre-wrap font-sans text-sm leading-relaxed">
<h2 className="text-xl font-bold mb-4"></h2>
<p className="text-muted-foreground mb-6"></p>
<div className="p-4 bg-muted/30 rounded-xl border border-dashed border-primary/20 italic">
...
</div>
<div className="mt-8 space-y-4">
<p className="font-semibold text-primary"> </p>
<p className="font-semibold text-primary"> </p>
<p className="font-semibold text-primary"> </p>
</div>
</div>
</div>
</div>
</ScrollArea>
<DialogFooter className="p-8 pt-4 border-t border-dashed">
<Button variant="outline" className="rounded-xl" onClick={() => setViewingTask(null)}></Button>
<Button className="rounded-xl px-8 gap-2 shadow-lg shadow-primary/20" onClick={() => toast.success("正在导出文件...")}>
<Download size={18} />
{viewingTask?.templates?.type?.toUpperCase() || '文件'}
</Button>
</DialogFooter>
</DialogContent>
</Dialog>
</div>
);
};
export default FormFill;

View File

@@ -1,184 +0,0 @@
import React, { useState } from 'react';
import { useNavigate, useLocation } from 'react-router-dom';
import { useAuth } from '@/context/AuthContext';
import { Button } from '@/components/ui/button';
import { Input } from '@/components/ui/input';
import { Label } from '@/components/ui/label';
import { Card, CardContent, CardDescription, CardFooter, CardHeader, CardTitle } from '@/components/ui/card';
import { Tabs, TabsContent, TabsList, TabsTrigger } from '@/components/ui/tabs';
import { FileText, Lock, User, CheckCircle2, AlertCircle } from 'lucide-react';
import { toast } from 'sonner';
const Login: React.FC = () => {
const [username, setUsername] = useState('');
const [password, setPassword] = useState('');
const [loading, setLoading] = useState(false);
const { signIn, signUp } = useAuth();
const navigate = useNavigate();
const location = useLocation();
const handleLogin = async (e: React.FormEvent) => {
e.preventDefault();
if (!username || !password) return toast.error('请输入用户名和密码');
setLoading(true);
try {
const email = `${username}@miaoda.com`;
const { error } = await signIn(email, password);
if (error) throw error;
toast.success('登录成功');
navigate('/');
} catch (err: any) {
toast.error(err.message || '登录失败');
} finally {
setLoading(false);
}
};
const handleSignUp = async (e: React.FormEvent) => {
e.preventDefault();
if (!username || !password) return toast.error('请输入用户名和密码');
setLoading(true);
try {
const email = `${username}@miaoda.com`;
const { error } = await signUp(email, password);
if (error) throw error;
toast.success('注册成功,请登录');
} catch (err: any) {
toast.error(err.message || '注册失败');
} finally {
setLoading(false);
}
};
return (
<div className="min-h-screen flex items-center justify-center bg-[radial-gradient(ellipse_at_top_left,_var(--tw-gradient-stops))] from-primary/10 via-background to-background p-4 relative overflow-hidden">
{/* Decorative elements */}
<div className="absolute top-0 left-0 w-96 h-96 bg-primary/5 rounded-full blur-3xl -translate-x-1/2 -translate-y-1/2" />
<div className="absolute bottom-0 right-0 w-64 h-64 bg-primary/5 rounded-full blur-3xl translate-x-1/3 translate-y-1/3" />
<div className="w-full max-w-md space-y-8 relative animate-fade-in">
<div className="text-center space-y-2">
<div className="inline-flex items-center justify-center w-16 h-16 rounded-2xl bg-primary text-primary-foreground shadow-2xl shadow-primary/30 mb-4 animate-slide-in">
<FileText size={32} />
</div>
<h1 className="text-4xl font-extrabold tracking-tight gradient-text"></h1>
<p className="text-muted-foreground"></p>
</div>
<Card className="border-border/50 shadow-2xl backdrop-blur-sm bg-card/95">
<Tabs defaultValue="login" className="w-full">
<TabsList className="grid w-full grid-cols-2 rounded-t-xl h-12 bg-muted/50 p-1">
<TabsTrigger value="login" className="rounded-lg data-[state=active]:bg-background data-[state=active]:shadow-sm"></TabsTrigger>
<TabsTrigger value="signup" className="rounded-lg data-[state=active]:bg-background data-[state=active]:shadow-sm"></TabsTrigger>
</TabsList>
<TabsContent value="login">
<form onSubmit={handleLogin}>
<CardHeader>
<CardTitle></CardTitle>
<CardDescription>使</CardDescription>
</CardHeader>
<CardContent className="space-y-4">
<div className="space-y-2">
<Label htmlFor="username"></Label>
<div className="relative">
<User className="absolute left-3 top-2.5 h-4 w-4 text-muted-foreground" />
<Input
id="username"
placeholder="请输入用户名"
className="pl-9 bg-muted/30 border-none focus-visible:ring-primary"
value={username}
onChange={(e) => setUsername(e.target.value)}
/>
</div>
</div>
<div className="space-y-2">
<Label htmlFor="password"></Label>
<div className="relative">
<Lock className="absolute left-3 top-2.5 h-4 w-4 text-muted-foreground" />
<Input
id="password"
type="password"
placeholder="请输入密码"
className="pl-9 bg-muted/30 border-none focus-visible:ring-primary"
value={password}
onChange={(e) => setPassword(e.target.value)}
/>
</div>
</div>
</CardContent>
<CardFooter>
<Button className="w-full h-11 text-lg font-semibold rounded-xl" type="submit" disabled={loading}>
{loading ? '登录中...' : '立即登录'}
</Button>
</CardFooter>
</form>
</TabsContent>
<TabsContent value="signup">
<form onSubmit={handleSignUp}>
<CardHeader>
<CardTitle></CardTitle>
<CardDescription></CardDescription>
</CardHeader>
<CardContent className="space-y-4">
<div className="space-y-2">
<Label htmlFor="signup-username"></Label>
<div className="relative">
<User className="absolute left-3 top-2.5 h-4 w-4 text-muted-foreground" />
<Input
id="signup-username"
placeholder="仅字母、数字和下划线"
className="pl-9 bg-muted/30 border-none focus-visible:ring-primary"
value={username}
onChange={(e) => setUsername(e.target.value)}
/>
</div>
</div>
<div className="space-y-2">
<Label htmlFor="signup-password"></Label>
<div className="relative">
<Lock className="absolute left-3 top-2.5 h-4 w-4 text-muted-foreground" />
<Input
id="signup-password"
type="password"
placeholder="不少于 6 位"
className="pl-9 bg-muted/30 border-none focus-visible:ring-primary"
value={password}
onChange={(e) => setPassword(e.target.value)}
/>
</div>
</div>
</CardContent>
<CardFooter>
<Button className="w-full h-11 text-lg font-semibold rounded-xl" type="submit" disabled={loading}>
{loading ? '注册中...' : '注册账号'}
</Button>
</CardFooter>
</form>
</TabsContent>
</Tabs>
</Card>
<div className="grid grid-cols-2 gap-4 text-center text-xs text-muted-foreground">
<div className="flex flex-col items-center gap-1">
<CheckCircle2 size={16} className="text-primary" />
<span></span>
</div>
<div className="flex flex-col items-center gap-1">
<CheckCircle2 size={16} className="text-primary" />
<span></span>
</div>
</div>
<div className="text-center text-sm text-muted-foreground">
&copy; 2026 |
</div>
</div>
</div>
);
};
export default Login;

View File

@@ -1,16 +0,0 @@
/**
* Sample Page
*/
import PageMeta from "../components/common/PageMeta";
export default function SamplePage() {
return (
<>
<PageMeta title="Home" description="Home Page Introduction" />
<div>
<h3>This is a sample page</h3>
</div>
</>
);
}

View File

@@ -11,7 +11,8 @@ import {
ChevronDown, ChevronDown,
ChevronUp, ChevronUp,
Trash2, Trash2,
AlertCircle AlertCircle,
HelpCircle
} from 'lucide-react'; } from 'lucide-react';
import { Card, CardContent, CardHeader, CardTitle, CardDescription } from '@/components/ui/card'; import { Card, CardContent, CardHeader, CardTitle, CardDescription } from '@/components/ui/card';
import { Button } from '@/components/ui/button'; import { Button } from '@/components/ui/button';
@@ -24,9 +25,9 @@ import { Skeleton } from '@/components/ui/skeleton';
type Task = { type Task = {
task_id: string; task_id: string;
status: 'pending' | 'processing' | 'success' | 'failure'; status: 'pending' | 'processing' | 'success' | 'failure' | 'unknown';
created_at: string; created_at: string;
completed_at?: string; updated_at?: string;
message?: string; message?: string;
result?: any; result?: any;
error?: string; error?: string;
@@ -38,54 +39,38 @@ const TaskHistory: React.FC = () => {
const [loading, setLoading] = useState(true); const [loading, setLoading] = useState(true);
const [expandedTask, setExpandedTask] = useState<string | null>(null); const [expandedTask, setExpandedTask] = useState<string | null>(null);
// Mock data for demonstration // 获取任务历史数据
useEffect(() => { const fetchTasks = async () => {
// 模拟任务数据,实际应该从后端获取 try {
setTasks([ setLoading(true);
{ const response = await backendApi.getTasks(50, 0);
task_id: 'task-001', if (response.success && response.tasks) {
status: 'success', // 转换后端数据格式为前端格式
created_at: new Date(Date.now() - 3600000).toISOString(), const convertedTasks: Task[] = response.tasks.map((t: any) => ({
completed_at: new Date(Date.now() - 3500000).toISOString(), task_id: t.task_id,
task_type: 'document_parse', status: t.status || 'unknown',
message: '文档解析完成', created_at: t.created_at || new Date().toISOString(),
result: { updated_at: t.updated_at,
doc_id: 'doc-001', message: t.message || '',
filename: 'report_q1_2026.docx', result: t.result,
extracted_fields: ['标题', '作者', '日期', '金额'] error: t.error,
} task_type: t.task_type || 'document_parse'
}, }));
{ setTasks(convertedTasks);
task_id: 'task-002', } else {
status: 'success', setTasks([]);
created_at: new Date(Date.now() - 7200000).toISOString(),
completed_at: new Date(Date.now() - 7100000).toISOString(),
task_type: 'excel_analysis',
message: 'Excel 分析完成',
result: {
filename: 'sales_data.xlsx',
row_count: 1250,
charts_generated: 3
}
},
{
task_id: 'task-003',
status: 'processing',
created_at: new Date(Date.now() - 600000).toISOString(),
task_type: 'template_fill',
message: '正在填充表格...'
},
{
task_id: 'task-004',
status: 'failure',
created_at: new Date(Date.now() - 86400000).toISOString(),
completed_at: new Date(Date.now() - 86390000).toISOString(),
task_type: 'document_parse',
message: '解析失败',
error: '文件格式不支持或文件已损坏'
} }
]); } catch (error) {
setLoading(false); console.error('获取任务列表失败:', error);
toast.error('获取任务列表失败');
setTasks([]);
} finally {
setLoading(false);
}
};
useEffect(() => {
fetchTasks();
}, []); }, []);
const getStatusBadge = (status: string) => { const getStatusBadge = (status: string) => {
@@ -96,6 +81,8 @@ const TaskHistory: React.FC = () => {
return <Badge className="bg-destructive text-white text-[10px]"><XCircle size={12} className="mr-1" /></Badge>; return <Badge className="bg-destructive text-white text-[10px]"><XCircle size={12} className="mr-1" /></Badge>;
case 'processing': case 'processing':
return <Badge className="bg-amber-500 text-white text-[10px]"><Loader2 size={12} className="mr-1 animate-spin" /></Badge>; return <Badge className="bg-amber-500 text-white text-[10px]"><Loader2 size={12} className="mr-1 animate-spin" /></Badge>;
case 'unknown':
return <Badge className="bg-gray-500 text-white text-[10px]"><HelpCircle size={12} className="mr-1" /></Badge>;
default: default:
return <Badge className="bg-gray-500 text-white text-[10px]"><Clock size={12} className="mr-1" /></Badge>; return <Badge className="bg-gray-500 text-white text-[10px]"><Clock size={12} className="mr-1" /></Badge>;
} }
@@ -133,15 +120,22 @@ const TaskHistory: React.FC = () => {
}; };
const handleDelete = async (taskId: string) => { const handleDelete = async (taskId: string) => {
setTasks(prev => prev.filter(t => t.task_id !== taskId)); try {
toast.success('任务已删除'); await backendApi.deleteTask(taskId);
setTasks(prev => prev.filter(t => t.task_id !== taskId));
toast.success('任务已删除');
} catch (error) {
console.error('删除任务失败:', error);
toast.error('删除任务失败');
}
}; };
const stats = { const stats = {
total: tasks.length, total: tasks.length,
success: tasks.filter(t => t.status === 'success').length, success: tasks.filter(t => t.status === 'success').length,
processing: tasks.filter(t => t.status === 'processing').length, processing: tasks.filter(t => t.status === 'processing').length,
failure: tasks.filter(t => t.status === 'failure').length failure: tasks.filter(t => t.status === 'failure').length,
unknown: tasks.filter(t => t.status === 'unknown').length
}; };
return ( return (
@@ -151,7 +145,7 @@ const TaskHistory: React.FC = () => {
<h1 className="text-3xl font-extrabold tracking-tight"></h1> <h1 className="text-3xl font-extrabold tracking-tight"></h1>
<p className="text-muted-foreground"></p> <p className="text-muted-foreground"></p>
</div> </div>
<Button variant="outline" className="rounded-xl gap-2" onClick={() => window.location.reload()}> <Button variant="outline" className="rounded-xl gap-2" onClick={() => fetchTasks()}>
<RefreshCcw size={18} /> <RefreshCcw size={18} />
<span></span> <span></span>
</Button> </Button>
@@ -194,7 +188,8 @@ const TaskHistory: React.FC = () => {
"w-12 h-12 rounded-xl flex items-center justify-center shrink-0", "w-12 h-12 rounded-xl flex items-center justify-center shrink-0",
task.status === 'success' ? "bg-emerald-500/10 text-emerald-500" : task.status === 'success' ? "bg-emerald-500/10 text-emerald-500" :
task.status === 'failure' ? "bg-destructive/10 text-destructive" : task.status === 'failure' ? "bg-destructive/10 text-destructive" :
"bg-amber-500/10 text-amber-500" task.status === 'processing' ? "bg-amber-500/10 text-amber-500" :
"bg-gray-500/10 text-gray-500"
)}> )}>
{task.status === 'processing' ? ( {task.status === 'processing' ? (
<Loader2 size={24} className="animate-spin" /> <Loader2 size={24} className="animate-spin" />
@@ -212,16 +207,16 @@ const TaskHistory: React.FC = () => {
</Badge> </Badge>
</div> </div>
<p className="text-sm text-muted-foreground"> <p className="text-sm text-muted-foreground">
{task.message || '任务执行中...'} {task.message || (task.status === 'unknown' ? '无法获取状态' : '任务执行中...')}
</p> </p>
<div className="flex items-center gap-4 text-xs text-muted-foreground"> <div className="flex items-center gap-4 text-xs text-muted-foreground">
<span className="flex items-center gap-1"> <span className="flex items-center gap-1">
<Clock size={12} /> <Clock size={12} />
{format(new Date(task.created_at), 'yyyy-MM-dd HH:mm:ss')} {task.created_at ? format(new Date(task.created_at), 'yyyy-MM-dd HH:mm:ss') : '时间未知'}
</span> </span>
{task.completed_at && ( {task.updated_at && task.status !== 'processing' && (
<span> <span>
: {Math.round((new Date(task.completed_at).getTime() - new Date(task.created_at).getTime()) / 1000)} : {format(new Date(task.updated_at), 'HH:mm:ss')}
</span> </span>
)} )}
</div> </div>

View File

@@ -1,4 +1,4 @@
import React, { useState, useEffect, useCallback } from 'react'; import React, { useState, useEffect, useCallback, useRef } from 'react';
import { useDropzone } from 'react-dropzone'; import { useDropzone } from 'react-dropzone';
import { import {
TableProperties, TableProperties,
@@ -18,7 +18,8 @@ import {
Files, Files,
Trash2, Trash2,
Eye, Eye,
File File,
Plus
} from 'lucide-react'; } from 'lucide-react';
import { Button } from '@/components/ui/button'; import { Button } from '@/components/ui/button';
import { Card, CardContent, CardHeader, CardTitle, CardDescription } from '@/components/ui/card'; import { Card, CardContent, CardHeader, CardTitle, CardDescription } from '@/components/ui/card';
@@ -72,6 +73,7 @@ const TemplateFill: React.FC = () => {
const [sourceMode, setSourceMode] = useState<'upload' | 'select'>('upload'); const [sourceMode, setSourceMode] = useState<'upload' | 'select'>('upload');
const [uploadedDocuments, setUploadedDocuments] = useState<DocumentItem[]>([]); const [uploadedDocuments, setUploadedDocuments] = useState<DocumentItem[]>([]);
const [docsLoading, setDocsLoading] = useState(false); const [docsLoading, setDocsLoading] = useState(false);
const sourceFileInputRef = useRef<HTMLInputElement>(null);
// 模板拖拽 // 模板拖拽
const onTemplateDrop = useCallback((acceptedFiles: File[]) => { const onTemplateDrop = useCallback((acceptedFiles: File[]) => {
@@ -93,25 +95,34 @@ const TemplateFill: React.FC = () => {
}); });
// 源文档拖拽 // 源文档拖拽
const onSourceDrop = useCallback((acceptedFiles: File[]) => { const onSourceDrop = useCallback((e: React.DragEvent) => {
const newFiles = acceptedFiles.map(f => ({ e.preventDefault();
file: f, const files = Array.from(e.dataTransfer.files).filter(f => {
preview: f.type.startsWith('text/') || f.name.endsWith('.md') ? undefined : undefined const ext = f.name.split('.').pop()?.toLowerCase();
})); return ['xlsx', 'xls', 'docx', 'md', 'txt'].includes(ext || '');
addSourceFiles(newFiles); });
if (files.length > 0) {
addSourceFiles(files.map(f => ({ file: f })));
}
}, [addSourceFiles]); }, [addSourceFiles]);
const { getRootProps: getSourceProps, getInputProps: getSourceInputProps, isDragActive: isSourceDragActive } = useDropzone({ const handleSourceFileSelect = (e: React.ChangeEvent<HTMLInputElement>) => {
onDrop: onSourceDrop, const files = Array.from(e.target.files || []);
accept: { if (files.length > 0) {
'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet': ['.xlsx'], addSourceFiles(files.map(f => ({ file: f })));
'application/vnd.ms-excel': ['.xls'], toast.success(`已添加 ${files.length} 个文件`);
'application/vnd.openxmlformats-officedocument.wordprocessingml.document': ['.docx'], }
'text/plain': ['.txt'], e.target.value = '';
'text/markdown': ['.md'] };
},
multiple: true // 仅添加源文档不上传
}); const handleAddSourceFiles = () => {
if (sourceFiles.length === 0) {
toast.error('请先选择源文档');
return;
}
toast.success(`已添加 ${sourceFiles.length} 个源文档,可继续添加更多`);
};
// 加载已上传文档 // 加载已上传文档
const loadUploadedDocuments = useCallback(async () => { const loadUploadedDocuments = useCallback(async () => {
@@ -371,23 +382,33 @@ const TemplateFill: React.FC = () => {
<CardContent> <CardContent>
{sourceMode === 'upload' ? ( {sourceMode === 'upload' ? (
<> <>
<div className="border-2 border-dashed rounded-2xl p-8 transition-all duration-300 flex flex-col items-center justify-center text-center cursor-pointer group min-h-[200px] border-muted-foreground/20 hover:border-primary/50 hover:bg-primary/5">
<input
id="source-file-input"
type="file"
multiple={true}
accept=".xlsx,.xls,.docx,.md,.txt"
onChange={handleSourceFileSelect}
className="hidden"
/>
<label htmlFor="source-file-input" className="cursor-pointer flex flex-col items-center">
<div className="w-14 h-14 rounded-xl bg-blue-500/10 text-blue-500 flex items-center justify-center mb-4 group-hover:scale-110 transition-transform">
{loading ? <Loader2 className="animate-spin" size={28} /> : <Upload size={28} />}
</div>
<p className="font-medium">
</p>
<p className="text-xs text-muted-foreground mt-1">
.xlsx .xls .docx .md .txt
</p>
</label>
</div>
<div <div
{...getSourceProps()} onDragOver={(e) => { e.preventDefault(); }}
className={cn( onDrop={onSourceDrop}
"border-2 border-dashed rounded-2xl p-8 transition-all duration-300 flex flex-col items-center justify-center text-center cursor-pointer group min-h-[200px]", className="mt-2 text-center text-xs text-muted-foreground"
isSourceDragActive ? "border-primary bg-primary/5" : "border-muted-foreground/20 hover:border-primary/50 hover:bg-primary/5"
)}
> >
<input {...getSourceInputProps()} />
<div className="w-14 h-14 rounded-xl bg-blue-500/10 text-blue-500 flex items-center justify-center mb-4 group-hover:scale-110 transition-transform">
{loading ? <Loader2 className="animate-spin" size={28} /> : <Upload size={28} />}
</div>
<p className="font-medium">
{isSourceDragActive ? '释放以上传' : '点击或拖拽上传源文档'}
</p>
<p className="text-xs text-muted-foreground mt-1">
.xlsx .xls .docx .md .txt
</p>
</div> </div>
{/* Selected Source Files */} {/* Selected Source Files */}
@@ -407,6 +428,12 @@ const TemplateFill: React.FC = () => {
</Button> </Button>
</div> </div>
))} ))}
<div className="flex justify-center pt-2">
<Button variant="outline" size="sm" onClick={() => document.getElementById('source-file-input')?.click()}>
<Plus size={14} className="mr-1" />
</Button>
</div>
</div> </div>
)} )}
</> </>
@@ -420,49 +447,60 @@ const TemplateFill: React.FC = () => {
))} ))}
</div> </div>
) : uploadedDocuments.length > 0 ? ( ) : uploadedDocuments.length > 0 ? (
<div className="space-y-2 max-h-[300px] overflow-y-auto"> <div className="space-y-2">
{uploadedDocuments.map((doc) => ( {sourceDocIds.length > 0 && (
<div <div className="flex items-center justify-between p-3 bg-primary/5 rounded-xl border border-primary/20">
key={doc.doc_id} <span className="text-sm font-medium"> {sourceDocIds.length} </span>
className={cn( <Button variant="ghost" size="sm" onClick={() => loadUploadedDocuments()}>
"flex items-center gap-3 p-3 rounded-xl border-2 transition-all cursor-pointer", <RefreshCcw size={14} className="mr-1" />
sourceDocIds.includes(doc.doc_id)
? "border-primary bg-primary/5"
: "border-border hover:bg-muted/30"
)}
onClick={() => {
if (sourceDocIds.includes(doc.doc_id)) {
removeSourceDocId(doc.doc_id);
} else {
addSourceDocId(doc.doc_id);
}
}}
>
<div className={cn(
"w-6 h-6 rounded-md border-2 flex items-center justify-center transition-all shrink-0",
sourceDocIds.includes(doc.doc_id)
? "border-primary bg-primary text-white"
: "border-muted-foreground/30"
)}>
{sourceDocIds.includes(doc.doc_id) && <CheckCircle2 size={14} />}
</div>
{getFileIcon(doc.original_filename)}
<div className="flex-1 min-w-0">
<p className="text-sm font-medium truncate">{doc.original_filename}</p>
<p className="text-xs text-muted-foreground">
{doc.doc_type.toUpperCase()} {format(new Date(doc.created_at), 'yyyy-MM-dd')}
</p>
</div>
<Button
variant="ghost"
size="sm"
onClick={(e) => handleDeleteDocument(doc.doc_id, e)}
className="shrink-0"
>
<Trash2 size={14} className="text-red-500" />
</Button> </Button>
</div> </div>
))} )}
<div className="max-h-[300px] overflow-y-auto space-y-2">
{uploadedDocuments.map((doc) => (
<div
key={doc.doc_id}
className={cn(
"flex items-center gap-3 p-3 rounded-xl border-2 transition-all cursor-pointer",
sourceDocIds.includes(doc.doc_id)
? "border-primary bg-primary/5"
: "border-border hover:bg-muted/30"
)}
onClick={() => {
if (sourceDocIds.includes(doc.doc_id)) {
removeSourceDocId(doc.doc_id);
} else {
addSourceDocId(doc.doc_id);
}
}}
>
<div className={cn(
"w-6 h-6 rounded-md border-2 flex items-center justify-center transition-all shrink-0",
sourceDocIds.includes(doc.doc_id)
? "border-primary bg-primary text-white"
: "border-muted-foreground/30"
)}>
{sourceDocIds.includes(doc.doc_id) && <CheckCircle2 size={14} />}
</div>
{getFileIcon(doc.original_filename)}
<div className="flex-1 min-w-0">
<p className="text-sm font-medium truncate">{doc.original_filename}</p>
<p className="text-xs text-muted-foreground">
{doc.doc_type.toUpperCase()} {format(new Date(doc.created_at), 'yyyy-MM-dd')}
</p>
</div>
<Button
variant="ghost"
size="sm"
onClick={(e) => handleDeleteDocument(doc.doc_id, e)}
className="shrink-0"
>
<Trash2 size={14} className="text-red-500" />
</Button>
</div>
))}
</div>
</div> </div>
) : ( ) : (
<div className="text-center py-8 text-muted-foreground"> <div className="text-center py-8 text-muted-foreground">
@@ -588,6 +626,16 @@ const TemplateFill: React.FC = () => {
<div className="text-muted-foreground text-xs mt-1"> <div className="text-muted-foreground text-xs mt-1">
: {detail.source} | : {detail.confidence ? (detail.confidence * 100).toFixed(0) + '%' : 'N/A'} : {detail.source} | : {detail.confidence ? (detail.confidence * 100).toFixed(0) + '%' : 'N/A'}
</div> </div>
{detail.warning && (
<div className="mt-2 p-2 bg-yellow-50 border border-yellow-200 rounded-lg text-yellow-700 text-xs">
{detail.warning}
</div>
)}
{detail.values && detail.values.length > 1 && !detail.warning && (
<div className="mt-2 text-xs text-muted-foreground">
: {detail.values.join(', ')}
</div>
)}
</div> </div>
</div> </div>
))} ))}

View File

@@ -50,18 +50,18 @@
| `prompt_service.py` | ✅ 已完成 | Prompt 模板管理 | | `prompt_service.py` | ✅ 已完成 | Prompt 模板管理 |
| `text_analysis_service.py` | ✅ 已完成 | 文本分析 | | `text_analysis_service.py` | ✅ 已完成 | 文本分析 |
| `chart_generator_service.py` | ✅ 已完成 | 图表生成服务 | | `chart_generator_service.py` | ✅ 已完成 | 图表生成服务 |
| `template_fill_service.py` | ✅ 已完成 | 模板填写服务,支持直接读取源文档进行填表 | | `template_fill_service.py` | ✅ 已完成 | 模板填写服务,支持多行提取、直接从结构化数据提取、JSON容错、Word文档表格处理 |
### 2.2 API 接口 (`backend/app/api/endpoints/`) ### 2.2 API 接口 (`backend/app/api/endpoints/`)
| 接口文件 | 路由 | 功能状态 | | 接口文件 | 路由 | 功能状态 |
|----------|------|----------| |----------|------|----------|
| `upload.py` | `/api/v1/upload/excel` | ✅ Excel 文件上传与解析 | | `upload.py` | `/api/v1/upload/document` | ✅ 文档上传与解析 |
| `documents.py` | `/api/v1/documents/*` | ✅ 文档管理(列表、删除、搜索) | | `documents.py` | `/api/v1/documents/*` | ✅ 文档管理(列表、删除、搜索) |
| `ai_analyze.py` | `/api/v1/analyze/*` | ✅ AI 分析Excel、Markdown、流式 | | `ai_analyze.py` | `/api/v1/analyze/*` | ✅ AI 分析Excel、Markdown、流式 |
| `rag.py` | `/api/v1/rag/*` | ⚠️ RAG 检索(当前返回空) | | `rag.py` | `/api/v1/rag/*` | ⚠️ RAG 检索(当前返回空) |
| `tasks.py` | `/api/v1/tasks/*` | ✅ 异步任务状态查询 | | `tasks.py` | `/api/v1/tasks/*` | ✅ 异步任务状态查询 |
| `templates.py` | `/api/v1/templates/*` | ✅ 模板管理 (含 Word 导出) | | `templates.py` | `/api/v1/templates/*` | ✅ 模板管理(含多行导出、Word导出、Word结构化字段解析 |
| `visualization.py` | `/api/v1/visualization/*` | ✅ 可视化图表 | | `visualization.py` | `/api/v1/visualization/*` | ✅ 可视化图表 |
| `health.py` | `/api/v1/health` | ✅ 健康检查 | | `health.py` | `/api/v1/health` | ✅ 健康检查 |
@@ -70,71 +70,67 @@
| 页面文件 | 功能 | 状态 | | 页面文件 | 功能 | 状态 |
|----------|------|------| |----------|------|------|
| `Documents.tsx` | 主文档管理页面 | ✅ 已完成 | | `Documents.tsx` | 主文档管理页面 | ✅ 已完成 |
| `TemplateFill.tsx` | 智能填表页面 | ✅ 已完成 |
| `ExcelParse.tsx` | Excel 解析页面 | ✅ 已完成 | | `ExcelParse.tsx` | Excel 解析页面 | ✅ 已完成 |
### 2.4 文档解析能力 ### 2.4 文档解析能力
| 格式 | 解析状态 | 说明 | | 格式 | 解析状态 | 说明 |
|------|----------|------| |------|----------|------|
| Excel (.xlsx/.xls) | ✅ 已完成 | pandas + XML 回退解析 | | Excel (.xlsx/.xls) | ✅ 已完成 | pandas + XML 回退解析支持多sheet |
| Markdown (.md) | ✅ 已完成 | 正则 + AI 分章节 | | Markdown (.md) | ✅ 已完成 | 正则 + AI 分章节 |
| Word (.docx) | ✅ 已完成 | python-docx 解析,支持表格提取和字段识别 | | Word (.docx) | ✅ 已完成 | python-docx 解析,支持表格提取和字段识别 |
| Text (.txt) | ✅ 已完成 | chardet 编码检测,支持文本清洗和结构化提取 | | Text (.txt) | ✅ 已完成 | chardet 编码检测,支持文本清洗和结构化提取 |
--- ---
## 三、待完成功能(核心缺块) ## 三、核心功能实现详情
### 3.1 模板填写模块(最优先 ### 3.1 模板填写模块(✅ 已完成
**当前状态**:✅ 已完成
**核心流程**
``` ```
用户上传模板表格(Word/Excel) 上传模板表格(Word/Excel)
解析模板,提取需要填写的字段和提示词 解析模板,提取需要填写的字段和提示词
根据模板指定的源文档列表读取源数据 根据源文档ID列表读取源数据MongoDB或文件
AI 根据字段提示词从源数据中提取信息 优先从结构化数据直接提取Excel rows
将提取的数据填入模板对应位置 无法直接提取时使用 LLM 从文本中提取
返回填写完成的表格 将提取的数据填入原始模板对应位置(保持模板格式)
导出填写完成的表格Excel/Word
``` ```
**已完成实现** **关键特性**
- [x] `template_fill_service.py` - 模板填写核心服务 - **原始模板填充**:直接打开原始模板文件,填充数据到原表格/单元格
- [x] Word 模板解析 (`docx_parser.py` - parse_tables_for_template, extract_template_fields_from_docx) - **多行数据支持**:每个字段可提取多个值,导出时自动扩展行数
- [x] Text 模板解析 (`txt_parser.py` - 已完成) - **结构化数据优先**:直接从 Excel rows 提取,无需 LLM
- [x] 模板字段识别与提示词提取 - **JSON 容错**:支持 LLM 返回的损坏/截断 JSON
- [x] 多文档数据聚合与冲突处理 - **Markdown 清理**:自动清理 LLM 返回的 markdown 格式
- [x] 结果导出为 Word/Excel
### 3.2 Word 文档解析 ### 3.2 Word 文档解析(✅ 已完成)
**当前状态**:✅ 已完成
**已实现功能** **已实现功能**
- [x] `docx_parser.py` - Word 文档解析器 - `docx_parser.py` - Word 文档解析器
- [x] 提取段落文本 - 提取段落文本
- [x] 提取表格内容 - 提取表格内容(支持比赛表格格式:字段名 | 提示词 | 填写值)
- [x] 提取关键信息(标题、列表等) - `parse_tables_for_template()` - 解析表格模板,提取字段
- [x] 表格模板字段提取 (`parse_tables_for_template`, `extract_template_fields_from_docx`) - `extract_template_fields_from_docx()` - 提取模板字段定义
- [x] 字段类型推断 (`_infer_field_type_from_hint`) - `_infer_field_type_from_hint()` - 从提示词推断字段类型
- **API 端点**`/api/v1/templates/parse-word-structure` - 上传 Word 文档,提取结构化字段并存入 MongoDB
- **API 端点**`/api/v1/templates/word-fields/{doc_id}` - 获取已存文档的模板字段信息
### 3.3 Text 文档解析 ### 3.3 Text 文档解析(✅ 已完成)
**当前状态**:✅ 已完成
**已实现功能** **已实现功能**
- [x] `txt_parser.py` - 文本文件解析器 - `txt_parser.py` - 文本文件解析器
- [x] 编码自动检测 (chardet) - 编码自动检测 (chardet)
- [x] 文本清洗 - 文本清洗(去除控制字符、规范化空白)
- 结构化数据提取邮箱、URL、电话、日期、金额
### 3.4 文档模板匹配(已有框架)
根据 Q&A模板已指定数据文件不需要算法匹配。当前已有上传功能需确认模板与数据文件的关联逻辑是否完善。
--- ---
@@ -192,20 +188,20 @@ docs/test/
## 六、工作计划(建议) ## 六、工作计划(建议)
### 第一优先级:模板填写核心功能 ### 第一优先级:端到端测试
- 完成 Word 文档解析 - 使用真实测试数据进行准确率测试
- 完成模板填写服务 - 验证多行数据导出是否正确
- 端到端测试验证 - 测试 Word 模板解析是否正常
### 第二优先级Demo 打包与文档 ### 第二优先级Demo 打包与文档
- 制作项目演示 PPT - 制作项目演示 PPT
- 录制演示视频 - 录制演示视频
- 完善 README 部署文档 - 完善 README 部署文档
### 第三优先级:测试优化 ### 第三优先级:优化
- 使用真实测试数据进行准确率测试
- 优化响应时间 - 优化响应时间
- 完善错误处理 - 完善错误处理
- 增加更多测试用例
--- ---
@@ -215,29 +211,32 @@ docs/test/
2. **数据库**:不强制要求数据库存储,可跳过 2. **数据库**:不强制要求数据库存储,可跳过
3. **部署**:本地部署即可,不需要公网服务器 3. **部署**:本地部署即可,不需要公网服务器
4. **评测数据**:初赛仅使用目前提供的数据 4. **评测数据**:初赛仅使用目前提供的数据
5. **RAG 功能**:当前已临时禁用,不影响核心评测功能 5. **RAG 功能**:当前已临时禁用,不影响核心评测功能(因为使用直接文件读取)
--- ---
*文档版本: v1.1* *文档版本: v1.5*
*最后更新: 2026-04-08* *最后更新: 2026-04-09*
--- ---
## 八、技术实现细节 ## 八、技术实现细节
### 8.1 模板填表流程(已实现) ### 8.1 模板填表流程
#### 流程图 #### 流程图
``` ```
┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ 上传模板 │ ──► │ 选择数据源 │ ──► │ AI 智能填表 │ │ 上传模板 │ ──► │ 选择数据源 │ ──► │ 智能填表
└─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘
┌─────────────────────────┼─────────────────────────┐
┌─────────────┐ │ │
│ 导出结果 │ ▼ ▼
───────────── ┌───────────────┐ ┌───────────────┐ ┌───────────────
│ 结构化数据提取 │ │ LLM 提取 │ │ 导出结果 │
│ (直接读rows) │ │ (文本理解) │ │ (Excel/Word) │
└───────────────┘ └───────────────┘ └───────────────┘
``` ```
#### 核心组件 #### 核心组件
@@ -247,8 +246,10 @@ docs/test/
| 模板上传 | `templates.py` `/templates/upload` | 接收模板文件,提取字段 | | 模板上传 | `templates.py` `/templates/upload` | 接收模板文件,提取字段 |
| 字段提取 | `template_fill_service.py` | 从 Word/Excel 表格提取字段定义 | | 字段提取 | `template_fill_service.py` | 从 Word/Excel 表格提取字段定义 |
| 文档解析 | `docx_parser.py`, `xlsx_parser.py`, `txt_parser.py` | 解析源文档内容 | | 文档解析 | `docx_parser.py`, `xlsx_parser.py`, `txt_parser.py` | 解析源文档内容 |
| 智能填表 | `template_fill_service.py` `fill_template()` | 使用 LLM 从源文档提取信息 | | 智能填表 | `template_fill_service.py` `fill_template()` | 结构化提取 + LLM 提取 |
| 结果导出 | `templates.py` `/templates/export` | 导出为 Excel 或 Word | | 多行支持 | `template_fill_service.py` `FillResult` | values 数组支持 |
| JSON 容错 | `template_fill_service.py` `_fix_json()` | 修复损坏的 JSON |
| 结果导出 | `templates.py` `/templates/export` | 多行 Excel + Word 导出 |
### 8.2 源文档加载方式 ### 8.2 源文档加载方式
@@ -268,7 +269,9 @@ docs/test/
```python ```python
# 提取表格模板字段 # 提取表格模板字段
fields = docx_parser.extract_template_fields_from_docx(file_path) from docx_parser import DocxParser
parser = DocxParser()
fields = parser.extract_template_fields_from_docx(file_path)
# 返回格式 # 返回格式
# [ # [
@@ -295,6 +298,24 @@ fields = docx_parser.extract_template_fields_from_docx(file_path)
### 8.5 API 接口 ### 8.5 API 接口
#### POST `/api/v1/templates/upload`
上传模板文件,提取字段定义。
**响应**
```json
{
"success": true,
"template_id": "/path/to/saved/template.docx",
"filename": "模板.docx",
"file_type": "docx",
"fields": [
{"cell": "A1", "name": "姓名", "field_type": "text", "required": true, "hint": "提取人员姓名"}
],
"field_count": 1
}
```
#### POST `/api/v1/templates/fill` #### POST `/api/v1/templates/fill`
填写请求: 填写请求:
@@ -306,35 +327,232 @@ fields = docx_parser.extract_template_fields_from_docx(file_path)
], ],
"source_doc_ids": ["mongodb_doc_id_1", "mongodb_doc_id_2"], "source_doc_ids": ["mongodb_doc_id_1", "mongodb_doc_id_2"],
"source_file_paths": [], "source_file_paths": [],
"user_hint": "请从合同文档中提取" "user_hint": "请从xxx文档中提取"
} }
``` ```
响应 **响应(含多行支持)**
```json ```json
{ {
"success": true, "success": true,
"filled_data": {"姓名": "张三"}, "filled_data": {
"姓名": ["张三", "李四", "王五"],
"年龄": ["25", "30", "28"]
},
"fill_details": [ "fill_details": [
{ {
"field": "姓名", "field": "姓名",
"cell": "A1", "cell": "A1",
"values": ["张三", "李四", "王五"],
"value": "张三", "value": "张三",
"source": "来自:合同文档.docx", "source": "结构化数据直接提取",
"confidence": 0.95 "confidence": 1.0
} }
], ],
"source_doc_count": 2 "source_doc_count": 2,
"max_rows": 3
} }
``` ```
#### POST `/api/v1/templates/export` #### POST `/api/v1/templates/export`
导出请求: 导出请求(创建新文件)
```json ```json
{ {
"template_id": "模板ID", "template_id": "模板ID",
"filled_data": {"姓名": "张三", "金额": "10000"}, "filled_data": {"姓名": ["张三", "李四"], "金额": ["10000", "20000"]},
"format": "xlsx" // 或 "docx" "format": "xlsx"
} }
``` ```
#### POST `/api/v1/templates/fill-and-export`
**填充原始模板并导出**(推荐用于比赛)
直接打开原始模板文件,将数据填入模板的表格/单元格中,然后导出。**保持原始模板格式不变**。
**请求**
```json
{
"template_path": "/path/to/original/template.docx",
"filled_data": {
"姓名": ["张三", "李四", "王五"],
"年龄": ["25", "30", "28"]
},
"format": "docx"
}
```
**响应**:填充后的 Word/Excel 文件(文件流)
**特点**
- 打开原始模板文件
- 根据表头行匹配字段名到列索引
- 将数据填入对应列的单元格
- 多行数据自动扩展表格行数
- 保持原始模板格式和样式
#### POST `/api/v1/templates/parse-word-structure`
**上传 Word 文档并提取结构化字段**(比赛专用)
上传 Word 文档,从表格模板中提取字段定义(字段名、提示词、字段类型)并存入 MongoDB。
**请求**multipart/form-data
- file: Word 文件
**响应**
```json
{
"success": true,
"doc_id": "mongodb_doc_id",
"filename": "模板.docx",
"file_path": "/path/to/saved/template.docx",
"field_count": 5,
"fields": [
{
"cell": "T0R1",
"name": "字段名",
"hint": "提示词",
"field_type": "text",
"required": true
}
],
"tables": [...],
"metadata": {
"paragraph_count": 10,
"table_count": 1,
"word_count": 500,
"has_tables": true
}
}
```
#### GET `/api/v1/templates/word-fields/{doc_id}`
**获取 Word 文档模板字段信息**
根据 doc_id 获取已上传的 Word 文档的模板字段信息。
**响应**
```json
{
"success": true,
"doc_id": "mongodb_doc_id",
"filename": "模板.docx",
"fields": [...],
"tables": [...],
"field_count": 5,
"metadata": {...}
}
```
### 8.6 多行数据处理
**FillResult 数据结构**
```python
@dataclass
class FillResult:
field: str
values: List[Any] = None # 支持多个值(数组)
value: Any = "" # 保留兼容(第一个值)
source: str = "" # 来源文档
confidence: float = 1.0 # 置信度
```
**导出逻辑**
- 计算所有字段的最大行数
- 遍历每一行,取对应索引的值
- 不足的行填空字符串
### 8.7 JSON 容错处理
当 LLM 返回的 JSON 损坏或被截断时,系统会:
1. 清理 markdown 代码块标记(```json, ```
2. 尝试配对括号找到完整的 JSON
3. 移除末尾多余的逗号
4. 使用正则表达式提取 values 数组
5. 备选方案:直接提取所有引号内的字符串
### 8.8 结构化数据优先提取
对于 Excel 等有 `rows` 结构的文档,系统会:
1. 直接从 `structured_data.rows` 中查找匹配列
2. 使用模糊匹配(字段名包含或被包含)
3. 提取该列的所有行值
4. 无需调用 LLM速度更快准确率更高
```python
# 内部逻辑
if structured.get("rows"):
columns = structured.get("columns", [])
values = _extract_column_values(rows, columns, field_name)
```
---
## 九、依赖说明
### Python 依赖
```
# requirements.txt 中需要包含
fastapi>=0.104.0
uvicorn>=0.24.0
motor>=3.3.0 # MongoDB 异步驱动
sqlalchemy>=2.0.0 # MySQL ORM
pandas>=2.0.0 # Excel 处理
openpyxl>=3.1.0 # Excel 写入
python-docx>=0.8.0 # Word 处理
chardet>=4.0.0 # 编码检测
httpx>=0.25.0 # HTTP 客户端
```
### 前端依赖
```
# package.json 中需要包含
react>=18.0.0
react-dropzone>=14.0.0
lucide-react>=0.300.0
sonner>=1.0.0 # toast 通知
```
---
## 十、启动说明
### 后端启动
```bash
cd backend
.\venv\Scripts\Activate.ps1 # 或 Activate.bat
pip install -r requirements.txt # 确保依赖完整
.\venv\Scripts\python.exe -m uvicorn app.main:app --host 127.0.0.1 --port 8000 --reload
```
### 前端启动
```bash
cd frontend
npm install
npm run dev
```
### 环境变量
`backend/.env` 中配置:
```
MONGODB_URL=mongodb://localhost:27017
MONGODB_DB_NAME=document_system
MYSQL_HOST=localhost
MYSQL_PORT=3306
MYSQL_USER=root
MYSQL_PASSWORD=your_password
MYSQL_DATABASE=document_system
LLM_API_KEY=your_api_key
LLM_BASE_URL=https://api.minimax.chat
LLM_MODEL_NAME=MiniMax-Text-01
```