Compare commits
29 Commits
df35105d16
...
main
| Author | SHA1 | Date | |
|---|---|---|---|
| ecad9ccd82 | |||
| 51350e3002 | |||
| 8e713be1ca | |||
| f2af27245d | |||
| a9dc0d8b91 | |||
| 902c28166b | |||
| 4a53be7eeb | |||
| 8b5b24fa2a | |||
| ed66aa346d | |||
| 5b82d40be0 | |||
| bedf1af9c0 | |||
| 5fca4eb094 | |||
| 0dbf74db9d | |||
| 858b594171 | |||
| ed0f51f2a4 | |||
| ecc0c79475 | |||
| 6befc510d8 | |||
| 8f66c235fa | |||
| 886d5ae0cc | |||
| 6752c5c231 | |||
| 610d475ce0 | |||
| 496b96508d | |||
| 07ebdc09bc | |||
| 7f67fa89de | |||
| c1886fb68f | |||
| 78417c898a | |||
| d5df5b8283 | |||
| 718f864926 | |||
| e5711b3f05 |
@@ -29,9 +29,14 @@ REDIS_URL="redis://localhost:6379/0"
|
||||
|
||||
# ==================== LLM AI 配置 ====================
|
||||
# 大语言模型 API 配置
|
||||
LLM_API_KEY="your_api_key_here"
|
||||
LLM_BASE_URL=""
|
||||
LLM_MODEL_NAME=""
|
||||
# 支持 OpenAI 兼容格式 (DeepSeek, 智谱 GLM, 阿里等)
|
||||
# 智谱 AI (Zhipu AI) GLM 系列:
|
||||
# - 模型: glm-4-flash (快速文本模型), glm-4 (标准), glm-4-plus (高性能)
|
||||
# - API: https://open.bigmodel.cn
|
||||
# - API Key: https://open.bigmodel.cn/usercenter/apikeys
|
||||
LLM_API_KEY="ca79ad9f96524cd5afc3e43ca97f347d.cpiLLx2oyitGvTeU"
|
||||
LLM_BASE_URL="https://open.bigmodel.cn/api/paas/v4"
|
||||
LLM_MODEL_NAME="glm-4v-plus"
|
||||
|
||||
# ==================== Supabase 配置 ====================
|
||||
# Supabase 项目配置
|
||||
|
||||
38
backend/=3.0.0
Normal file
38
backend/=3.0.0
Normal file
@@ -0,0 +1,38 @@
|
||||
Requirement already satisfied: sentence-transformers in c:\python312\lib\site-packages (2.2.2)
|
||||
Requirement already satisfied: transformers<5.0.0,>=4.6.0 in c:\python312\lib\site-packages (from sentence-transformers) (4.57.6)
|
||||
Requirement already satisfied: tqdm in c:\python312\lib\site-packages (from sentence-transformers) (4.66.1)
|
||||
Requirement already satisfied: torch>=1.6.0 in c:\python312\lib\site-packages (from sentence-transformers) (2.10.0)
|
||||
Requirement already satisfied: torchvision in c:\python312\lib\site-packages (from sentence-transformers) (0.25.0)
|
||||
Requirement already satisfied: numpy in c:\python312\lib\site-packages (from sentence-transformers) (1.26.2)
|
||||
Requirement already satisfied: scikit-learn in c:\python312\lib\site-packages (from sentence-transformers) (1.8.0)
|
||||
Requirement already satisfied: scipy in c:\python312\lib\site-packages (from sentence-transformers) (1.16.3)
|
||||
Requirement already satisfied: nltk in c:\python312\lib\site-packages (from sentence-transformers) (3.9.3)
|
||||
Requirement already satisfied: sentencepiece in c:\python312\lib\site-packages (from sentence-transformers) (0.2.1)
|
||||
Requirement already satisfied: huggingface-hub>=0.4.0 in c:\python312\lib\site-packages (from sentence-transformers) (0.36.2)
|
||||
Requirement already satisfied: filelock in c:\python312\lib\site-packages (from huggingface-hub>=0.4.0->sentence-transformers) (3.25.2)
|
||||
Requirement already satisfied: fsspec>=2023.5.0 in c:\python312\lib\site-packages (from huggingface-hub>=0.4.0->sentence-transformers) (2026.2.0)
|
||||
Requirement already satisfied: packaging>=20.9 in c:\python312\lib\site-packages (from huggingface-hub>=0.4.0->sentence-transformers) (23.2)
|
||||
Requirement already satisfied: pyyaml>=5.1 in c:\python312\lib\site-packages (from huggingface-hub>=0.4.0->sentence-transformers) (6.0.1)
|
||||
Requirement already satisfied: requests in c:\python312\lib\site-packages (from huggingface-hub>=0.4.0->sentence-transformers) (2.31.0)
|
||||
Requirement already satisfied: typing-extensions>=3.7.4.3 in c:\python312\lib\site-packages (from huggingface-hub>=0.4.0->sentence-transformers) (4.15.0)
|
||||
Requirement already satisfied: sympy>=1.13.3 in c:\python312\lib\site-packages (from torch>=1.6.0->sentence-transformers) (1.14.0)
|
||||
Requirement already satisfied: networkx>=2.5.1 in c:\python312\lib\site-packages (from torch>=1.6.0->sentence-transformers) (3.6.1)
|
||||
Requirement already satisfied: jinja2 in c:\python312\lib\site-packages (from torch>=1.6.0->sentence-transformers) (3.1.6)
|
||||
Requirement already satisfied: setuptools in c:\python312\lib\site-packages (from torch>=1.6.0->sentence-transformers) (82.0.1)
|
||||
Requirement already satisfied: colorama in c:\python312\lib\site-packages (from tqdm->sentence-transformers) (0.4.6)
|
||||
Requirement already satisfied: regex!=2019.12.17 in c:\python312\lib\site-packages (from transformers<5.0.0,>=4.6.0->sentence-transformers) (2026.2.28)
|
||||
Requirement already satisfied: tokenizers<=0.23.0,>=0.22.0 in c:\python312\lib\site-packages (from transformers<5.0.0,>=4.6.0->sentence-transformers) (0.22.2)
|
||||
Requirement already satisfied: safetensors>=0.4.3 in c:\python312\lib\site-packages (from transformers<5.0.0,>=4.6.0->sentence-transformers) (0.7.0)
|
||||
Requirement already satisfied: click in c:\python312\lib\site-packages (from nltk->sentence-transformers) (8.3.1)
|
||||
Requirement already satisfied: joblib in c:\python312\lib\site-packages (from nltk->sentence-transformers) (1.5.3)
|
||||
Requirement already satisfied: threadpoolctl>=3.2.0 in c:\python312\lib\site-packages (from scikit-learn->sentence-transformers) (3.6.0)
|
||||
Requirement already satisfied: pillow!=8.3.*,>=5.3.0 in c:\python312\lib\site-packages (from torchvision->sentence-transformers) (12.1.1)
|
||||
Requirement already satisfied: mpmath<1.4,>=1.1.0 in c:\python312\lib\site-packages (from sympy>=1.13.3->torch>=1.6.0->sentence-transformers) (1.3.0)
|
||||
Requirement already satisfied: MarkupSafe>=2.0 in c:\python312\lib\site-packages (from jinja2->torch>=1.6.0->sentence-transformers) (3.0.3)
|
||||
Requirement already satisfied: charset-normalizer<4,>=2 in c:\python312\lib\site-packages (from requests->huggingface-hub>=0.4.0->sentence-transformers) (3.4.6)
|
||||
Requirement already satisfied: idna<4,>=2.5 in c:\python312\lib\site-packages (from requests->huggingface-hub>=0.4.0->sentence-transformers) (3.11)
|
||||
Requirement already satisfied: urllib3<3,>=1.21.1 in c:\python312\lib\site-packages (from requests->huggingface-hub>=0.4.0->sentence-transformers) (2.6.3)
|
||||
Requirement already satisfied: certifi>=2017.4.17 in c:\python312\lib\site-packages (from requests->huggingface-hub>=0.4.0->sentence-transformers) (2026.2.25)
|
||||
|
||||
[notice] A new release of pip is available: 24.2 -> 26.0.1
|
||||
[notice] To update, run: python.exe -m pip install --upgrade pip
|
||||
@@ -13,6 +13,7 @@ from app.api.endpoints import (
|
||||
visualization,
|
||||
analysis_charts,
|
||||
health,
|
||||
instruction, # 智能指令
|
||||
)
|
||||
|
||||
# 创建主路由
|
||||
@@ -29,3 +30,4 @@ api_router.include_router(templates.router) # 表格模板
|
||||
api_router.include_router(ai_analyze.router) # AI分析
|
||||
api_router.include_router(visualization.router) # 可视化
|
||||
api_router.include_router(analysis_charts.router) # 分析图表
|
||||
api_router.include_router(instruction.router) # 智能指令
|
||||
|
||||
@@ -10,6 +10,8 @@ import os
|
||||
|
||||
from app.services.excel_ai_service import excel_ai_service
|
||||
from app.services.markdown_ai_service import markdown_ai_service
|
||||
from app.services.template_fill_service import template_fill_service
|
||||
from app.services.word_ai_service import word_ai_service
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
@@ -215,9 +217,12 @@ async def analyze_markdown(
|
||||
return result
|
||||
|
||||
finally:
|
||||
# 清理临时文件
|
||||
if os.path.exists(tmp_path):
|
||||
os.unlink(tmp_path)
|
||||
# 清理临时文件,确保在所有情况下都能清理
|
||||
try:
|
||||
if tmp_path and os.path.exists(tmp_path):
|
||||
os.unlink(tmp_path)
|
||||
except Exception as cleanup_error:
|
||||
logger.warning(f"临时文件清理失败: {tmp_path}, error: {cleanup_error}")
|
||||
|
||||
except HTTPException:
|
||||
raise
|
||||
@@ -279,8 +284,12 @@ async def analyze_markdown_stream(
|
||||
)
|
||||
|
||||
finally:
|
||||
if os.path.exists(tmp_path):
|
||||
os.unlink(tmp_path)
|
||||
# 清理临时文件,确保在所有情况下都能清理
|
||||
try:
|
||||
if tmp_path and os.path.exists(tmp_path):
|
||||
os.unlink(tmp_path)
|
||||
except Exception as cleanup_error:
|
||||
logger.warning(f"临时文件清理失败: {tmp_path}, error: {cleanup_error}")
|
||||
|
||||
except HTTPException:
|
||||
raise
|
||||
@@ -289,7 +298,7 @@ async def analyze_markdown_stream(
|
||||
raise HTTPException(status_code=500, detail=f"流式分析失败: {str(e)}")
|
||||
|
||||
|
||||
@router.get("/analyze/md/outline")
|
||||
@router.post("/analyze/md/outline")
|
||||
async def get_markdown_outline(
|
||||
file: UploadFile = File(...)
|
||||
):
|
||||
@@ -323,9 +332,154 @@ async def get_markdown_outline(
|
||||
result = await markdown_ai_service.extract_outline(tmp_path)
|
||||
return result
|
||||
finally:
|
||||
if os.path.exists(tmp_path):
|
||||
os.unlink(tmp_path)
|
||||
# 清理临时文件,确保在所有情况下都能清理
|
||||
try:
|
||||
if tmp_path and os.path.exists(tmp_path):
|
||||
os.unlink(tmp_path)
|
||||
except Exception as cleanup_error:
|
||||
logger.warning(f"临时文件清理失败: {tmp_path}, error: {cleanup_error}")
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"获取 Markdown 大纲失败: {str(e)}")
|
||||
raise HTTPException(status_code=500, detail=f"获取大纲失败: {str(e)}")
|
||||
|
||||
|
||||
@router.post("/analyze/txt")
|
||||
async def analyze_txt(
|
||||
file: UploadFile = File(...),
|
||||
):
|
||||
"""
|
||||
上传并使用 AI 分析 TXT 文本文件,提取结构化数据
|
||||
|
||||
将非结构化文本转换为结构化表格数据,便于后续填表使用
|
||||
|
||||
Args:
|
||||
file: 上传的 TXT 文件
|
||||
|
||||
Returns:
|
||||
dict: 分析结果,包含结构化表格数据
|
||||
"""
|
||||
if not file.filename:
|
||||
raise HTTPException(status_code=400, detail="文件名为空")
|
||||
|
||||
file_ext = file.filename.split('.')[-1].lower()
|
||||
if file_ext not in ['txt', 'text']:
|
||||
raise HTTPException(
|
||||
status_code=400,
|
||||
detail=f"不支持的文件类型: {file_ext},仅支持 .txt"
|
||||
)
|
||||
|
||||
try:
|
||||
# 读取文件内容
|
||||
content = await file.read()
|
||||
|
||||
# 保存到临时文件
|
||||
with tempfile.NamedTemporaryFile(mode='wb', suffix='.txt', delete=False) as tmp:
|
||||
tmp.write(content)
|
||||
tmp_path = tmp.name
|
||||
|
||||
try:
|
||||
logger.info(f"开始 AI 分析 TXT 文件: {file.filename}")
|
||||
|
||||
# 使用 template_fill_service 的 AI 分析方法
|
||||
result = await template_fill_service.analyze_txt_with_ai(
|
||||
content=content.decode('utf-8', errors='replace'),
|
||||
filename=file.filename
|
||||
)
|
||||
|
||||
if result:
|
||||
logger.info(f"TXT AI 分析成功: {file.filename}")
|
||||
return {
|
||||
"success": True,
|
||||
"filename": file.filename,
|
||||
"structured_data": result
|
||||
}
|
||||
else:
|
||||
logger.warning(f"TXT AI 分析返回空结果: {file.filename}")
|
||||
return {
|
||||
"success": False,
|
||||
"filename": file.filename,
|
||||
"error": "AI 分析未能提取到结构化数据",
|
||||
"structured_data": None
|
||||
}
|
||||
|
||||
finally:
|
||||
# 清理临时文件
|
||||
if os.path.exists(tmp_path):
|
||||
os.unlink(tmp_path)
|
||||
|
||||
except HTTPException:
|
||||
raise
|
||||
except Exception as e:
|
||||
logger.error(f"TXT AI 分析过程中出错: {str(e)}")
|
||||
raise HTTPException(status_code=500, detail=f"分析失败: {str(e)}")
|
||||
|
||||
|
||||
# ==================== Word 文档 AI 解析 ====================
|
||||
|
||||
@router.post("/analyze/word")
|
||||
async def analyze_word(
|
||||
file: UploadFile = File(...),
|
||||
user_hint: str = Query("", description="用户提示词,如'请提取表格数据'")
|
||||
):
|
||||
"""
|
||||
使用 AI 解析 Word 文档,提取结构化数据
|
||||
|
||||
适用于从非结构化的 Word 文档中提取表格数据、键值对等信息
|
||||
|
||||
Args:
|
||||
file: 上传的 Word 文件
|
||||
user_hint: 用户提示词
|
||||
|
||||
Returns:
|
||||
dict: 包含结构化数据的解析结果
|
||||
"""
|
||||
if not file.filename:
|
||||
raise HTTPException(status_code=400, detail="文件名为空")
|
||||
|
||||
file_ext = file.filename.split('.')[-1].lower()
|
||||
if file_ext not in ['docx']:
|
||||
raise HTTPException(
|
||||
status_code=400,
|
||||
detail=f"不支持的文件类型: {file_ext},仅支持 .docx"
|
||||
)
|
||||
|
||||
try:
|
||||
# 保存上传的文件
|
||||
content = await file.read()
|
||||
suffix = f".{file_ext}"
|
||||
with tempfile.NamedTemporaryFile(delete=False, suffix=suffix) as tmp:
|
||||
tmp.write(content)
|
||||
tmp_path = tmp.name
|
||||
|
||||
try:
|
||||
# 使用 AI 解析 Word 文档
|
||||
result = await word_ai_service.parse_word_with_ai(
|
||||
file_path=tmp_path,
|
||||
user_hint=user_hint or "请提取文档中的所有结构化数据,包括表格、键值对等"
|
||||
)
|
||||
|
||||
if result.get("success"):
|
||||
return {
|
||||
"success": True,
|
||||
"filename": file.filename,
|
||||
"result": result
|
||||
}
|
||||
else:
|
||||
return {
|
||||
"success": False,
|
||||
"filename": file.filename,
|
||||
"error": result.get("error", "AI 解析失败"),
|
||||
"result": None
|
||||
}
|
||||
|
||||
finally:
|
||||
# 清理临时文件
|
||||
if os.path.exists(tmp_path):
|
||||
os.unlink(tmp_path)
|
||||
|
||||
except HTTPException:
|
||||
raise
|
||||
except Exception as e:
|
||||
logger.error(f"Word AI 分析过程中出错: {str(e)}")
|
||||
raise HTTPException(status_code=500, detail=f"分析失败: {str(e)}")
|
||||
|
||||
@@ -23,6 +23,52 @@ logger = logging.getLogger(__name__)
|
||||
router = APIRouter(prefix="/upload", tags=["文档上传"])
|
||||
|
||||
|
||||
# ==================== 辅助函数 ====================
|
||||
|
||||
async def update_task_status(
|
||||
task_id: str,
|
||||
status: str,
|
||||
progress: int = 0,
|
||||
message: str = "",
|
||||
result: dict = None,
|
||||
error: str = None
|
||||
):
|
||||
"""
|
||||
更新任务状态,同时写入 Redis 和 MongoDB
|
||||
|
||||
Args:
|
||||
task_id: 任务ID
|
||||
status: 状态
|
||||
progress: 进度
|
||||
message: 消息
|
||||
result: 结果
|
||||
error: 错误信息
|
||||
"""
|
||||
meta = {"progress": progress, "message": message}
|
||||
if result:
|
||||
meta["result"] = result
|
||||
if error:
|
||||
meta["error"] = error
|
||||
|
||||
# 尝试写入 Redis
|
||||
try:
|
||||
await redis_db.set_task_status(task_id, status, meta)
|
||||
except Exception as e:
|
||||
logger.warning(f"Redis 任务状态更新失败: {e}")
|
||||
|
||||
# 尝试写入 MongoDB(作为备用)
|
||||
try:
|
||||
await mongodb.update_task(
|
||||
task_id=task_id,
|
||||
status=status,
|
||||
message=message,
|
||||
result=result,
|
||||
error=error
|
||||
)
|
||||
except Exception as e:
|
||||
logger.warning(f"MongoDB 任务状态更新失败: {e}")
|
||||
|
||||
|
||||
# ==================== 请求/响应模型 ====================
|
||||
|
||||
class UploadResponse(BaseModel):
|
||||
@@ -77,6 +123,17 @@ async def upload_document(
|
||||
task_id = str(uuid.uuid4())
|
||||
|
||||
try:
|
||||
# 保存任务记录到 MongoDB(如果 Redis 不可用时仍能查询)
|
||||
try:
|
||||
await mongodb.insert_task(
|
||||
task_id=task_id,
|
||||
task_type="document_parse",
|
||||
status="pending",
|
||||
message=f"文档 {file.filename} 已提交处理"
|
||||
)
|
||||
except Exception as mongo_err:
|
||||
logger.warning(f"MongoDB 保存任务记录失败: {mongo_err}")
|
||||
|
||||
content = await file.read()
|
||||
saved_path = file_service.save_uploaded_file(
|
||||
content,
|
||||
@@ -122,6 +179,17 @@ async def upload_documents(
|
||||
saved_paths = []
|
||||
|
||||
try:
|
||||
# 保存任务记录到 MongoDB
|
||||
try:
|
||||
await mongodb.insert_task(
|
||||
task_id=task_id,
|
||||
task_type="batch_parse",
|
||||
status="pending",
|
||||
message=f"已提交 {len(files)} 个文档处理"
|
||||
)
|
||||
except Exception as mongo_err:
|
||||
logger.warning(f"MongoDB 保存批量任务记录失败: {mongo_err}")
|
||||
|
||||
for file in files:
|
||||
if not file.filename:
|
||||
continue
|
||||
@@ -159,9 +227,9 @@ async def process_document(
|
||||
"""处理单个文档"""
|
||||
try:
|
||||
# 状态: 解析中
|
||||
await redis_db.set_task_status(
|
||||
await update_task_status(
|
||||
task_id, status="processing",
|
||||
meta={"progress": 10, "message": "正在解析文档"}
|
||||
progress=10, message="正在解析文档"
|
||||
)
|
||||
|
||||
# 解析文档
|
||||
@@ -172,9 +240,9 @@ async def process_document(
|
||||
raise Exception(result.error or "解析失败")
|
||||
|
||||
# 状态: 存储中
|
||||
await redis_db.set_task_status(
|
||||
await update_task_status(
|
||||
task_id, status="processing",
|
||||
meta={"progress": 30, "message": "正在存储数据"}
|
||||
progress=30, message="正在存储数据"
|
||||
)
|
||||
|
||||
# 存储到 MongoDB
|
||||
@@ -191,9 +259,9 @@ async def process_document(
|
||||
|
||||
# 如果是 Excel,存储到 MySQL + AI生成描述 + RAG索引
|
||||
if doc_type in ["xlsx", "xls"]:
|
||||
await redis_db.set_task_status(
|
||||
await update_task_status(
|
||||
task_id, status="processing",
|
||||
meta={"progress": 50, "message": "正在存储到MySQL并生成字段描述"}
|
||||
progress=50, message="正在存储到MySQL并生成字段描述"
|
||||
)
|
||||
|
||||
try:
|
||||
@@ -215,9 +283,9 @@ async def process_document(
|
||||
|
||||
else:
|
||||
# 非结构化文档
|
||||
await redis_db.set_task_status(
|
||||
await update_task_status(
|
||||
task_id, status="processing",
|
||||
meta={"progress": 60, "message": "正在建立索引"}
|
||||
progress=60, message="正在建立索引"
|
||||
)
|
||||
|
||||
# 如果文档中有表格数据,提取并存储到 MySQL + RAG
|
||||
@@ -238,17 +306,13 @@ async def process_document(
|
||||
await index_document_to_rag(doc_id, original_filename, result, doc_type)
|
||||
|
||||
# 完成
|
||||
await redis_db.set_task_status(
|
||||
await update_task_status(
|
||||
task_id, status="success",
|
||||
meta={
|
||||
"progress": 100,
|
||||
"message": "处理完成",
|
||||
progress=100, message="处理完成",
|
||||
result={
|
||||
"doc_id": doc_id,
|
||||
"result": {
|
||||
"doc_id": doc_id,
|
||||
"doc_type": doc_type,
|
||||
"filename": original_filename
|
||||
}
|
||||
"doc_type": doc_type,
|
||||
"filename": original_filename
|
||||
}
|
||||
)
|
||||
|
||||
@@ -256,18 +320,19 @@ async def process_document(
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"文档处理失败: {str(e)}")
|
||||
await redis_db.set_task_status(
|
||||
await update_task_status(
|
||||
task_id, status="failure",
|
||||
meta={"error": str(e)}
|
||||
progress=0, message="处理失败",
|
||||
error=str(e)
|
||||
)
|
||||
|
||||
|
||||
async def process_documents_batch(task_id: str, files: List[dict]):
|
||||
"""批量处理文档"""
|
||||
try:
|
||||
await redis_db.set_task_status(
|
||||
await update_task_status(
|
||||
task_id, status="processing",
|
||||
meta={"progress": 0, "message": "开始批量处理"}
|
||||
progress=0, message="开始批量处理"
|
||||
)
|
||||
|
||||
results = []
|
||||
@@ -318,37 +383,43 @@ async def process_documents_batch(task_id: str, files: List[dict]):
|
||||
results.append({"filename": file_info["filename"], "success": False, "error": str(e)})
|
||||
|
||||
progress = int((i + 1) / len(files) * 100)
|
||||
await redis_db.set_task_status(
|
||||
await update_task_status(
|
||||
task_id, status="processing",
|
||||
meta={"progress": progress, "message": f"已处理 {i+1}/{len(files)}"}
|
||||
progress=progress, message=f"已处理 {i+1}/{len(files)}"
|
||||
)
|
||||
|
||||
await redis_db.set_task_status(
|
||||
await update_task_status(
|
||||
task_id, status="success",
|
||||
meta={"progress": 100, "message": "批量处理完成", "results": results}
|
||||
progress=100, message="批量处理完成",
|
||||
result={"results": results}
|
||||
)
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"批量处理失败: {str(e)}")
|
||||
await redis_db.set_task_status(
|
||||
await update_task_status(
|
||||
task_id, status="failure",
|
||||
meta={"error": str(e)}
|
||||
progress=0, message="批量处理失败",
|
||||
error=str(e)
|
||||
)
|
||||
|
||||
|
||||
async def index_document_to_rag(doc_id: str, filename: str, result: ParseResult, doc_type: str):
|
||||
"""将非结构化文档索引到 RAG"""
|
||||
"""将非结构化文档索引到 RAG(使用分块索引)"""
|
||||
try:
|
||||
content = result.data.get("content", "")
|
||||
if content:
|
||||
# 将完整内容传递给 RAG 服务自动分块索引
|
||||
rag_service.index_document_content(
|
||||
doc_id=doc_id,
|
||||
content=content[:5000],
|
||||
content=content, # 传递完整内容,由 RAG 服务自动分块
|
||||
metadata={
|
||||
"filename": filename,
|
||||
"doc_type": doc_type
|
||||
}
|
||||
},
|
||||
chunk_size=500, # 每块 500 字符
|
||||
chunk_overlap=50 # 块之间 50 字符重叠
|
||||
)
|
||||
logger.info(f"RAG 索引完成: {filename}, doc_id={doc_id}")
|
||||
except Exception as e:
|
||||
logger.warning(f"RAG 索引失败: {str(e)}")
|
||||
|
||||
|
||||
@@ -19,26 +19,43 @@ async def health_check() -> Dict[str, Any]:
|
||||
返回各数据库连接状态和应用信息
|
||||
"""
|
||||
# 检查各数据库连接状态
|
||||
mysql_status = "connected"
|
||||
mongodb_status = "connected"
|
||||
redis_status = "connected"
|
||||
mysql_status = "unknown"
|
||||
mongodb_status = "unknown"
|
||||
redis_status = "unknown"
|
||||
|
||||
try:
|
||||
if mysql_db.async_engine is None:
|
||||
mysql_status = "disconnected"
|
||||
except Exception:
|
||||
else:
|
||||
# 实际执行一次查询验证连接
|
||||
from sqlalchemy import text
|
||||
async with mysql_db.async_engine.connect() as conn:
|
||||
await conn.execute(text("SELECT 1"))
|
||||
mysql_status = "connected"
|
||||
except Exception as e:
|
||||
logger.warning(f"MySQL 健康检查失败: {e}")
|
||||
mysql_status = "error"
|
||||
|
||||
try:
|
||||
if mongodb.client is None:
|
||||
mongodb_status = "disconnected"
|
||||
except Exception:
|
||||
else:
|
||||
# 实际 ping 验证
|
||||
await mongodb.client.admin.command('ping')
|
||||
mongodb_status = "connected"
|
||||
except Exception as e:
|
||||
logger.warning(f"MongoDB 健康检查失败: {e}")
|
||||
mongodb_status = "error"
|
||||
|
||||
try:
|
||||
if not redis_db.is_connected:
|
||||
if not redis_db.is_connected or redis_db.client is None:
|
||||
redis_status = "disconnected"
|
||||
except Exception:
|
||||
else:
|
||||
# 实际执行 ping 验证
|
||||
await redis_db.client.ping()
|
||||
redis_status = "connected"
|
||||
except Exception as e:
|
||||
logger.warning(f"Redis 健康检查失败: {e}")
|
||||
redis_status = "error"
|
||||
|
||||
return {
|
||||
|
||||
439
backend/app/api/endpoints/instruction.py
Normal file
439
backend/app/api/endpoints/instruction.py
Normal file
@@ -0,0 +1,439 @@
|
||||
"""
|
||||
智能指令 API 接口
|
||||
|
||||
支持自然语言指令解析和执行
|
||||
"""
|
||||
import logging
|
||||
import uuid
|
||||
from typing import Any, Dict, List, Optional
|
||||
|
||||
from fastapi import APIRouter, HTTPException, Query, BackgroundTasks
|
||||
from pydantic import BaseModel
|
||||
|
||||
from app.instruction.intent_parser import intent_parser
|
||||
from app.instruction.executor import instruction_executor
|
||||
from app.core.database import mongodb
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
router = APIRouter(prefix="/instruction", tags=["智能指令"])
|
||||
|
||||
|
||||
# ==================== 请求/响应模型 ====================
|
||||
|
||||
class InstructionRequest(BaseModel):
|
||||
instruction: str
|
||||
doc_ids: Optional[List[str]] = None # 关联的文档 ID 列表
|
||||
context: Optional[Dict[str, Any]] = None # 额外上下文
|
||||
|
||||
|
||||
class IntentRecognitionResponse(BaseModel):
|
||||
success: bool
|
||||
intent: str
|
||||
params: Dict[str, Any]
|
||||
message: str
|
||||
|
||||
|
||||
class InstructionExecutionResponse(BaseModel):
|
||||
success: bool
|
||||
intent: str
|
||||
result: Dict[str, Any]
|
||||
message: str
|
||||
|
||||
|
||||
# ==================== 接口 ====================
|
||||
|
||||
@router.post("/recognize", response_model=IntentRecognitionResponse)
|
||||
async def recognize_intent(request: InstructionRequest):
|
||||
"""
|
||||
意图识别接口
|
||||
|
||||
将自然语言指令解析为结构化的意图和参数
|
||||
|
||||
示例指令:
|
||||
- "提取文档中的医院数量和床位数"
|
||||
- "根据这些数据填表"
|
||||
- "总结一下这份文档"
|
||||
- "对比这两个文档的差异"
|
||||
"""
|
||||
try:
|
||||
intent, params = await intent_parser.parse(request.instruction)
|
||||
|
||||
# 添加文档关联信息
|
||||
if request.doc_ids:
|
||||
params["document_refs"] = [f"doc_{doc_id}" for doc_id in request.doc_ids]
|
||||
|
||||
intent_names = {
|
||||
"extract": "信息提取",
|
||||
"fill_table": "表格填写",
|
||||
"summarize": "摘要总结",
|
||||
"question": "智能问答",
|
||||
"search": "文档搜索",
|
||||
"compare": "对比分析",
|
||||
"transform": "格式转换",
|
||||
"edit": "文档编辑",
|
||||
"unknown": "未知"
|
||||
}
|
||||
|
||||
return IntentRecognitionResponse(
|
||||
success=True,
|
||||
intent=intent,
|
||||
params=params,
|
||||
message=f"识别到意图: {intent_names.get(intent, intent)}"
|
||||
)
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"意图识别失败: {e}")
|
||||
return IntentRecognitionResponse(
|
||||
success=False,
|
||||
intent="error",
|
||||
params={},
|
||||
message=f"意图识别失败: {str(e)}"
|
||||
)
|
||||
|
||||
|
||||
@router.post("/execute")
|
||||
async def execute_instruction(
|
||||
background_tasks: BackgroundTasks,
|
||||
request: InstructionRequest,
|
||||
async_execute: bool = Query(False, description="是否异步执行(仅返回任务ID)")
|
||||
):
|
||||
"""
|
||||
指令执行接口
|
||||
|
||||
解析并执行自然语言指令
|
||||
|
||||
示例:
|
||||
- 指令: "提取文档1中的医院数量"
|
||||
返回: {"extracted_data": {"医院数量": ["38710个"]}}
|
||||
|
||||
- 指令: "填表"
|
||||
返回: {"filled_data": {...}}
|
||||
|
||||
设置 async_execute=true 可异步执行,返回任务ID用于查询进度
|
||||
"""
|
||||
task_id = str(uuid.uuid4())
|
||||
|
||||
if async_execute:
|
||||
# 异步模式:立即返回任务ID,后台执行
|
||||
background_tasks.add_task(
|
||||
_execute_instruction_task,
|
||||
task_id=task_id,
|
||||
instruction=request.instruction,
|
||||
doc_ids=request.doc_ids,
|
||||
context=request.context
|
||||
)
|
||||
|
||||
return {
|
||||
"success": True,
|
||||
"task_id": task_id,
|
||||
"message": "指令已提交执行",
|
||||
"status_url": f"/api/v1/tasks/{task_id}"
|
||||
}
|
||||
|
||||
# 同步模式:等待执行完成
|
||||
return await _execute_instruction_task(task_id, request.instruction, request.doc_ids, request.context)
|
||||
|
||||
|
||||
async def _execute_instruction_task(
|
||||
task_id: str,
|
||||
instruction: str,
|
||||
doc_ids: Optional[List[str]],
|
||||
context: Optional[Dict[str, Any]]
|
||||
) -> InstructionExecutionResponse:
|
||||
"""执行指令的后台任务"""
|
||||
from app.core.database import redis_db, mongodb as mongo_client
|
||||
|
||||
try:
|
||||
# 记录任务
|
||||
try:
|
||||
await mongo_client.insert_task(
|
||||
task_id=task_id,
|
||||
task_type="instruction_execute",
|
||||
status="processing",
|
||||
message="正在执行指令"
|
||||
)
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
# 构建执行上下文
|
||||
ctx: Dict[str, Any] = context or {}
|
||||
|
||||
# 如果提供了文档 ID,获取文档内容
|
||||
if doc_ids:
|
||||
docs = []
|
||||
for doc_id in doc_ids:
|
||||
doc = await mongo_client.get_document(doc_id)
|
||||
if doc:
|
||||
docs.append(doc)
|
||||
|
||||
if docs:
|
||||
ctx["source_docs"] = docs
|
||||
logger.info(f"指令执行上下文: 关联了 {len(docs)} 个文档")
|
||||
|
||||
# 执行指令
|
||||
result = await instruction_executor.execute(instruction, ctx)
|
||||
|
||||
# 更新任务状态
|
||||
try:
|
||||
await mongo_client.update_task(
|
||||
task_id=task_id,
|
||||
status="success",
|
||||
message="执行完成",
|
||||
result=result
|
||||
)
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
return InstructionExecutionResponse(
|
||||
success=result.get("success", False),
|
||||
intent=result.get("intent", "unknown"),
|
||||
result=result,
|
||||
message=result.get("message", "执行完成")
|
||||
)
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"指令执行失败: {e}")
|
||||
try:
|
||||
await mongo_client.update_task(
|
||||
task_id=task_id,
|
||||
status="failure",
|
||||
message="执行失败",
|
||||
error=str(e)
|
||||
)
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
return InstructionExecutionResponse(
|
||||
success=False,
|
||||
intent="error",
|
||||
result={"error": str(e)},
|
||||
message=f"指令执行失败: {str(e)}"
|
||||
)
|
||||
|
||||
|
||||
@router.post("/chat")
|
||||
async def instruction_chat(
|
||||
background_tasks: BackgroundTasks,
|
||||
request: InstructionRequest,
|
||||
async_execute: bool = Query(False, description="是否异步执行(仅返回任务ID)")
|
||||
):
|
||||
"""
|
||||
指令对话接口
|
||||
|
||||
支持多轮对话的指令执行
|
||||
|
||||
示例对话流程:
|
||||
1. 用户: "上传一些文档"
|
||||
2. 系统: "请上传文档"
|
||||
3. 用户: "提取其中的医院数量"
|
||||
4. 系统: 返回提取结果
|
||||
|
||||
设置 async_execute=true 可异步执行,返回任务ID用于查询进度
|
||||
"""
|
||||
task_id = str(uuid.uuid4())
|
||||
|
||||
if async_execute:
|
||||
# 异步模式:立即返回任务ID,后台执行
|
||||
background_tasks.add_task(
|
||||
_execute_chat_task,
|
||||
task_id=task_id,
|
||||
instruction=request.instruction,
|
||||
doc_ids=request.doc_ids,
|
||||
context=request.context
|
||||
)
|
||||
|
||||
return {
|
||||
"success": True,
|
||||
"task_id": task_id,
|
||||
"message": "指令已提交执行",
|
||||
"status_url": f"/api/v1/tasks/{task_id}"
|
||||
}
|
||||
|
||||
# 同步模式:等待执行完成
|
||||
return await _execute_chat_task(task_id, request.instruction, request.doc_ids, request.context)
|
||||
|
||||
|
||||
async def _execute_chat_task(
|
||||
task_id: str,
|
||||
instruction: str,
|
||||
doc_ids: Optional[List[str]],
|
||||
context: Optional[Dict[str, Any]]
|
||||
):
|
||||
"""执行指令对话的后台任务"""
|
||||
from app.core.database import mongodb as mongo_client
|
||||
|
||||
try:
|
||||
# 记录任务
|
||||
try:
|
||||
await mongo_client.insert_task(
|
||||
task_id=task_id,
|
||||
task_type="instruction_chat",
|
||||
status="processing",
|
||||
message="正在处理对话"
|
||||
)
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
# 构建上下文
|
||||
ctx: Dict[str, Any] = context or {}
|
||||
|
||||
# 获取关联文档
|
||||
if doc_ids:
|
||||
docs = []
|
||||
for doc_id in doc_ids:
|
||||
doc = await mongo_client.get_document(doc_id)
|
||||
if doc:
|
||||
docs.append(doc)
|
||||
if docs:
|
||||
ctx["source_docs"] = docs
|
||||
|
||||
# 执行指令
|
||||
result = await instruction_executor.execute(instruction, ctx)
|
||||
|
||||
# 根据意图类型添加友好的响应消息
|
||||
response_messages = {
|
||||
"extract": f"已提取 {len(result.get('extracted_data', {}))} 个字段的数据",
|
||||
"fill_table": f"填表完成,填写了 {len(result.get('result', {}).get('filled_data', {}))} 个字段",
|
||||
"summarize": "已生成文档摘要",
|
||||
"question": "已找到相关答案",
|
||||
"search": f"找到 {len(result.get('results', []))} 条相关内容",
|
||||
"compare": f"对比了 {len(result.get('comparison', []))} 个文档",
|
||||
"edit": "编辑操作已完成",
|
||||
"transform": "格式转换已完成",
|
||||
"unknown": "无法理解该指令,请尝试更明确的描述"
|
||||
}
|
||||
|
||||
response = {
|
||||
"success": result.get("success", False),
|
||||
"intent": result.get("intent", "unknown"),
|
||||
"result": result,
|
||||
"message": response_messages.get(result.get("intent", ""), result.get("message", "")),
|
||||
"hint": _get_intent_hint(result.get("intent", ""))
|
||||
}
|
||||
|
||||
# 更新任务状态
|
||||
try:
|
||||
await mongo_client.update_task(
|
||||
task_id=task_id,
|
||||
status="success",
|
||||
message="处理完成",
|
||||
result=response
|
||||
)
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
return response
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"指令对话失败: {e}")
|
||||
try:
|
||||
await mongo_client.update_task(
|
||||
task_id=task_id,
|
||||
status="failure",
|
||||
message="处理失败",
|
||||
error=str(e)
|
||||
)
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
return {
|
||||
"success": False,
|
||||
"error": str(e),
|
||||
"message": f"处理失败: {str(e)}"
|
||||
}
|
||||
|
||||
|
||||
def _get_intent_hint(intent: str) -> Optional[str]:
|
||||
"""根据意图返回下一步提示"""
|
||||
hints = {
|
||||
"extract": "您可以继续说 '提取更多字段' 或 '将数据填入表格'",
|
||||
"fill_table": "您可以提供表格模板或说 '帮我创建一个表格'",
|
||||
"question": "您可以继续提问或说 '总结一下这些内容'",
|
||||
"search": "您可以查看搜索结果或说 '对比这些内容'",
|
||||
"unknown": "您可以尝试: '提取数据'、'填表'、'总结'、'问答' 等指令"
|
||||
}
|
||||
return hints.get(intent)
|
||||
|
||||
|
||||
@router.get("/intents")
|
||||
async def list_supported_intents():
|
||||
"""
|
||||
获取支持的意图类型列表
|
||||
|
||||
返回所有可用的自然语言指令类型
|
||||
"""
|
||||
return {
|
||||
"intents": [
|
||||
{
|
||||
"intent": "extract",
|
||||
"name": "信息提取",
|
||||
"examples": [
|
||||
"提取文档中的医院数量",
|
||||
"抽取所有机构的名称",
|
||||
"找出表格中的数据"
|
||||
],
|
||||
"params": ["field_refs", "document_refs"]
|
||||
},
|
||||
{
|
||||
"intent": "fill_table",
|
||||
"name": "表格填写",
|
||||
"examples": [
|
||||
"填表",
|
||||
"根据这些数据填写表格",
|
||||
"帮我填到Excel里"
|
||||
],
|
||||
"params": ["template", "document_refs"]
|
||||
},
|
||||
{
|
||||
"intent": "summarize",
|
||||
"name": "摘要总结",
|
||||
"examples": [
|
||||
"总结一下这份文档",
|
||||
"生成摘要",
|
||||
"概括主要内容"
|
||||
],
|
||||
"params": ["document_refs"]
|
||||
},
|
||||
{
|
||||
"intent": "question",
|
||||
"name": "智能问答",
|
||||
"examples": [
|
||||
"这段话说的是什么?",
|
||||
"有多少家医院?",
|
||||
"解释一下这个概念"
|
||||
],
|
||||
"params": ["question", "focus"]
|
||||
},
|
||||
{
|
||||
"intent": "search",
|
||||
"name": "文档搜索",
|
||||
"examples": [
|
||||
"搜索相关内容",
|
||||
"找找看有哪些机构",
|
||||
"查询医院相关的数据"
|
||||
],
|
||||
"params": ["field_refs", "question"]
|
||||
},
|
||||
{
|
||||
"intent": "compare",
|
||||
"name": "对比分析",
|
||||
"examples": [
|
||||
"对比这两个文档",
|
||||
"比较一下差异",
|
||||
"找出不同点"
|
||||
],
|
||||
"params": ["document_refs"]
|
||||
},
|
||||
{
|
||||
"intent": "edit",
|
||||
"name": "文档编辑",
|
||||
"examples": [
|
||||
"润色这段文字",
|
||||
"修改格式",
|
||||
"添加注释"
|
||||
],
|
||||
"params": []
|
||||
}
|
||||
]
|
||||
}
|
||||
@@ -1,13 +1,13 @@
|
||||
"""
|
||||
任务管理 API 接口
|
||||
|
||||
提供异步任务状态查询
|
||||
提供异步任务状态查询和历史记录
|
||||
"""
|
||||
from typing import Optional
|
||||
|
||||
from fastapi import APIRouter, HTTPException
|
||||
|
||||
from app.core.database import redis_db
|
||||
from app.core.database import redis_db, mongodb
|
||||
|
||||
router = APIRouter(prefix="/tasks", tags=["任务管理"])
|
||||
|
||||
@@ -23,25 +23,94 @@ async def get_task_status(task_id: str):
|
||||
Returns:
|
||||
任务状态信息
|
||||
"""
|
||||
# 优先从 Redis 获取
|
||||
status = await redis_db.get_task_status(task_id)
|
||||
|
||||
if not status:
|
||||
# Redis不可用时,假设任务已完成(文档已成功处理)
|
||||
# 前端轮询时会得到这个响应
|
||||
if status:
|
||||
return {
|
||||
"task_id": task_id,
|
||||
"status": "success",
|
||||
"progress": 100,
|
||||
"message": "任务处理完成",
|
||||
"result": None,
|
||||
"error": None
|
||||
"status": status.get("status", "unknown"),
|
||||
"progress": status.get("meta", {}).get("progress", 0),
|
||||
"message": status.get("meta", {}).get("message"),
|
||||
"result": status.get("meta", {}).get("result"),
|
||||
"error": status.get("meta", {}).get("error")
|
||||
}
|
||||
|
||||
# Redis 不可用时,尝试从 MongoDB 获取
|
||||
mongo_task = await mongodb.get_task(task_id)
|
||||
if mongo_task:
|
||||
return {
|
||||
"task_id": mongo_task.get("task_id"),
|
||||
"status": mongo_task.get("status", "unknown"),
|
||||
"progress": 100 if mongo_task.get("status") == "success" else 0,
|
||||
"message": mongo_task.get("message"),
|
||||
"result": mongo_task.get("result"),
|
||||
"error": mongo_task.get("error")
|
||||
}
|
||||
|
||||
# 任务不存在或状态未知
|
||||
return {
|
||||
"task_id": task_id,
|
||||
"status": status.get("status", "unknown"),
|
||||
"progress": status.get("meta", {}).get("progress", 0),
|
||||
"message": status.get("meta", {}).get("message"),
|
||||
"result": status.get("meta", {}).get("result"),
|
||||
"error": status.get("meta", {}).get("error")
|
||||
"status": "unknown",
|
||||
"progress": 0,
|
||||
"message": "无法获取任务状态(Redis和MongoDB均不可用)",
|
||||
"result": None,
|
||||
"error": None
|
||||
}
|
||||
|
||||
|
||||
@router.get("/")
|
||||
async def list_tasks(limit: int = 50, skip: int = 0):
|
||||
"""
|
||||
获取任务历史列表
|
||||
|
||||
Args:
|
||||
limit: 返回数量限制
|
||||
skip: 跳过数量
|
||||
|
||||
Returns:
|
||||
任务列表
|
||||
"""
|
||||
try:
|
||||
tasks = await mongodb.list_tasks(limit=limit, skip=skip)
|
||||
return {
|
||||
"success": True,
|
||||
"tasks": tasks,
|
||||
"count": len(tasks)
|
||||
}
|
||||
except Exception as e:
|
||||
# MongoDB 不可用时返回空列表
|
||||
return {
|
||||
"success": False,
|
||||
"tasks": [],
|
||||
"count": 0,
|
||||
"error": str(e)
|
||||
}
|
||||
|
||||
|
||||
@router.delete("/{task_id}")
|
||||
async def delete_task(task_id: str):
|
||||
"""
|
||||
删除任务
|
||||
|
||||
Args:
|
||||
task_id: 任务ID
|
||||
|
||||
Returns:
|
||||
是否删除成功
|
||||
"""
|
||||
try:
|
||||
# 从 Redis 删除
|
||||
if redis_db._connected and redis_db.client:
|
||||
key = f"task:{task_id}"
|
||||
await redis_db.client.delete(key)
|
||||
|
||||
# 从 MongoDB 删除
|
||||
deleted = await mongodb.delete_task(task_id)
|
||||
|
||||
return {
|
||||
"success": True,
|
||||
"deleted": deleted
|
||||
}
|
||||
except Exception as e:
|
||||
raise HTTPException(status_code=500, detail=f"删除任务失败: {str(e)}")
|
||||
|
||||
@@ -5,21 +5,62 @@
|
||||
"""
|
||||
import io
|
||||
import logging
|
||||
import uuid
|
||||
from typing import List, Optional
|
||||
|
||||
from fastapi import APIRouter, File, HTTPException, Query, UploadFile
|
||||
from fastapi import APIRouter, File, HTTPException, Query, UploadFile, BackgroundTasks
|
||||
from fastapi.responses import StreamingResponse
|
||||
import pandas as pd
|
||||
from pydantic import BaseModel
|
||||
|
||||
from app.services.template_fill_service import template_fill_service, TemplateField
|
||||
from app.services.file_service import file_service
|
||||
from app.core.database import mongodb
|
||||
from app.core.document_parser import ParserFactory
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
router = APIRouter(prefix="/templates", tags=["表格模板"])
|
||||
|
||||
|
||||
# ==================== 辅助函数 ====================
|
||||
|
||||
async def update_task_status(
|
||||
task_id: str,
|
||||
status: str,
|
||||
progress: int = 0,
|
||||
message: str = "",
|
||||
result: dict = None,
|
||||
error: str = None
|
||||
):
|
||||
"""
|
||||
更新任务状态,同时写入 Redis 和 MongoDB
|
||||
"""
|
||||
from app.core.database import redis_db
|
||||
|
||||
meta = {"progress": progress, "message": message}
|
||||
if result:
|
||||
meta["result"] = result
|
||||
if error:
|
||||
meta["error"] = error
|
||||
|
||||
try:
|
||||
await redis_db.set_task_status(task_id, status, meta)
|
||||
except Exception as e:
|
||||
logger.warning(f"Redis 任务状态更新失败: {e}")
|
||||
|
||||
try:
|
||||
await mongodb.update_task(
|
||||
task_id=task_id,
|
||||
status=status,
|
||||
message=message,
|
||||
result=result,
|
||||
error=error
|
||||
)
|
||||
except Exception as e:
|
||||
logger.warning(f"MongoDB 任务状态更新失败: {e}")
|
||||
|
||||
|
||||
# ==================== 请求/响应模型 ====================
|
||||
|
||||
class TemplateFieldRequest(BaseModel):
|
||||
@@ -38,6 +79,7 @@ class FillRequest(BaseModel):
|
||||
source_doc_ids: Optional[List[str]] = None # MongoDB 文档 ID 列表
|
||||
source_file_paths: Optional[List[str]] = None # 源文档文件路径列表
|
||||
user_hint: Optional[str] = None
|
||||
task_id: Optional[str] = None # 可选的任务ID,用于任务历史跟踪
|
||||
|
||||
|
||||
class ExportRequest(BaseModel):
|
||||
@@ -109,6 +151,240 @@ async def upload_template(
|
||||
raise HTTPException(status_code=500, detail=f"上传失败: {str(e)}")
|
||||
|
||||
|
||||
@router.post("/upload-joint")
|
||||
async def upload_joint_template(
|
||||
background_tasks: BackgroundTasks,
|
||||
template_file: UploadFile = File(..., description="模板文件"),
|
||||
source_files: List[UploadFile] = File(..., description="源文档文件列表"),
|
||||
):
|
||||
"""
|
||||
联合上传模板和源文档,一键完成解析和存储
|
||||
|
||||
1. 保存模板文件并提取字段
|
||||
2. 异步处理源文档(解析+存MongoDB)
|
||||
3. 返回模板信息和源文档ID列表
|
||||
|
||||
Args:
|
||||
template_file: 模板文件 (xlsx/xls/docx)
|
||||
source_files: 源文档列表 (docx/xlsx/md/txt)
|
||||
|
||||
Returns:
|
||||
模板ID、字段列表、源文档ID列表
|
||||
"""
|
||||
if not template_file.filename:
|
||||
raise HTTPException(status_code=400, detail="模板文件名为空")
|
||||
|
||||
# 验证模板格式
|
||||
template_ext = template_file.filename.split('.')[-1].lower()
|
||||
if template_ext not in ['xlsx', 'xls', 'docx']:
|
||||
raise HTTPException(
|
||||
status_code=400,
|
||||
detail=f"不支持的模板格式: {template_ext},仅支持 xlsx/xls/docx"
|
||||
)
|
||||
|
||||
# 验证源文档格式
|
||||
valid_exts = ['docx', 'xlsx', 'xls', 'md', 'txt']
|
||||
for sf in source_files:
|
||||
if sf.filename:
|
||||
sf_ext = sf.filename.split('.')[-1].lower()
|
||||
if sf_ext not in valid_exts:
|
||||
raise HTTPException(
|
||||
status_code=400,
|
||||
detail=f"不支持的源文档格式: {sf_ext},仅支持 docx/xlsx/xls/md/txt"
|
||||
)
|
||||
|
||||
try:
|
||||
# 1. 保存模板文件
|
||||
template_content = await template_file.read()
|
||||
template_path = file_service.save_uploaded_file(
|
||||
template_content,
|
||||
template_file.filename,
|
||||
subfolder="templates"
|
||||
)
|
||||
|
||||
# 2. 保存并解析源文档 - 提取内容用于生成表头
|
||||
source_file_info = []
|
||||
source_contents = []
|
||||
for sf in source_files:
|
||||
if sf.filename:
|
||||
sf_content = await sf.read()
|
||||
sf_ext = sf.filename.split('.')[-1].lower()
|
||||
sf_path = file_service.save_uploaded_file(
|
||||
sf_content,
|
||||
sf.filename,
|
||||
subfolder=sf_ext
|
||||
)
|
||||
source_file_info.append({
|
||||
"path": sf_path,
|
||||
"filename": sf.filename,
|
||||
"ext": sf_ext
|
||||
})
|
||||
# 解析源文档获取内容(用于 AI 生成表头)
|
||||
try:
|
||||
from app.core.document_parser import ParserFactory
|
||||
parser = ParserFactory.get_parser(sf_path)
|
||||
parse_result = parser.parse(sf_path)
|
||||
if parse_result.success and parse_result.data:
|
||||
# 获取原始内容
|
||||
content = parse_result.data.get("content", "")[:5000] if parse_result.data.get("content") else ""
|
||||
|
||||
# 获取标题(可能在顶层或structured_data内)
|
||||
titles = parse_result.data.get("titles", [])
|
||||
if not titles and parse_result.data.get("structured_data"):
|
||||
titles = parse_result.data.get("structured_data", {}).get("titles", [])
|
||||
titles = titles[:10] if titles else []
|
||||
|
||||
# 获取表格数量(可能在顶层或structured_data内)
|
||||
tables = parse_result.data.get("tables", [])
|
||||
if not tables and parse_result.data.get("structured_data"):
|
||||
tables = parse_result.data.get("structured_data", {}).get("tables", [])
|
||||
tables_count = len(tables) if tables else 0
|
||||
|
||||
# 获取表格内容摘要(用于 AI 理解源文档结构)
|
||||
tables_summary = ""
|
||||
if tables:
|
||||
tables_summary = "\n【文档中的表格】:\n"
|
||||
for idx, table in enumerate(tables[:5]): # 最多5个表格
|
||||
if isinstance(table, dict):
|
||||
headers = table.get("headers", [])
|
||||
rows = table.get("rows", [])
|
||||
if headers:
|
||||
tables_summary += f"表格{idx+1}表头: {', '.join(str(h) for h in headers)}\n"
|
||||
if rows:
|
||||
tables_summary += f"表格{idx+1}前3行: "
|
||||
for row_idx, row in enumerate(rows[:3]):
|
||||
if isinstance(row, list):
|
||||
tables_summary += " | ".join(str(c) for c in row) + "; "
|
||||
elif isinstance(row, dict):
|
||||
tables_summary += " | ".join(str(row.get(h, "")) for h in headers if headers) + "; "
|
||||
tables_summary += "\n"
|
||||
|
||||
source_contents.append({
|
||||
"filename": sf.filename,
|
||||
"doc_type": sf_ext,
|
||||
"content": content,
|
||||
"titles": titles,
|
||||
"tables_count": tables_count,
|
||||
"tables_summary": tables_summary
|
||||
})
|
||||
logger.info(f"[DEBUG] source_contents built: filename={sf.filename}, content_len={len(content)}, titles_count={len(titles)}, tables_count={tables_count}")
|
||||
if tables_summary:
|
||||
logger.info(f"[DEBUG] tables_summary preview: {tables_summary[:300]}")
|
||||
except Exception as e:
|
||||
logger.warning(f"解析源文档失败 {sf.filename}: {e}")
|
||||
|
||||
# 3. 根据源文档内容生成表头
|
||||
template_fields = await template_fill_service.get_template_fields_from_file(
|
||||
template_path,
|
||||
template_ext,
|
||||
source_contents=source_contents # 传递源文档内容
|
||||
)
|
||||
|
||||
# 3. 异步处理源文档到MongoDB
|
||||
task_id = str(uuid.uuid4())
|
||||
if source_file_info:
|
||||
# 保存任务记录到 MongoDB
|
||||
try:
|
||||
await mongodb.insert_task(
|
||||
task_id=task_id,
|
||||
task_type="source_process",
|
||||
status="pending",
|
||||
message=f"开始处理 {len(source_file_info)} 个源文档"
|
||||
)
|
||||
except Exception as mongo_err:
|
||||
logger.warning(f"MongoDB 保存任务记录失败: {mongo_err}")
|
||||
|
||||
background_tasks.add_task(
|
||||
process_source_documents,
|
||||
task_id=task_id,
|
||||
files=source_file_info
|
||||
)
|
||||
|
||||
logger.info(f"联合上传完成: 模板={template_file.filename}, 源文档={len(source_file_info)}个")
|
||||
|
||||
return {
|
||||
"success": True,
|
||||
"template_id": template_path,
|
||||
"filename": template_file.filename,
|
||||
"file_type": template_ext,
|
||||
"fields": [
|
||||
{
|
||||
"cell": f.cell,
|
||||
"name": f.name,
|
||||
"field_type": f.field_type,
|
||||
"required": f.required,
|
||||
"hint": f.hint
|
||||
}
|
||||
for f in template_fields
|
||||
],
|
||||
"field_count": len(template_fields),
|
||||
"source_file_paths": [f["path"] for f in source_file_info],
|
||||
"source_filenames": [f["filename"] for f in source_file_info],
|
||||
"task_id": task_id
|
||||
}
|
||||
|
||||
except HTTPException:
|
||||
raise
|
||||
except Exception as e:
|
||||
logger.error(f"联合上传失败: {str(e)}")
|
||||
raise HTTPException(status_code=500, detail=f"联合上传失败: {str(e)}")
|
||||
|
||||
|
||||
async def process_source_documents(task_id: str, files: List[dict]):
|
||||
"""异步处理源文档,存入MongoDB"""
|
||||
try:
|
||||
await update_task_status(
|
||||
task_id, status="processing",
|
||||
progress=0, message="开始处理源文档"
|
||||
)
|
||||
|
||||
doc_ids = []
|
||||
for i, file_info in enumerate(files):
|
||||
try:
|
||||
parser = ParserFactory.get_parser(file_info["path"])
|
||||
result = parser.parse(file_info["path"])
|
||||
|
||||
if result.success:
|
||||
doc_id = await mongodb.insert_document(
|
||||
doc_type=file_info["ext"],
|
||||
content=result.data.get("content", ""),
|
||||
metadata={
|
||||
**result.metadata,
|
||||
"original_filename": file_info["filename"],
|
||||
"file_path": file_info["path"]
|
||||
},
|
||||
structured_data=result.data.get("structured_data")
|
||||
)
|
||||
doc_ids.append(doc_id)
|
||||
logger.info(f"源文档处理成功: {file_info['filename']}, doc_id: {doc_id}")
|
||||
else:
|
||||
logger.error(f"源文档解析失败: {file_info['filename']}, error: {result.error}")
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"源文档处理异常: {file_info['filename']}, error: {str(e)}")
|
||||
|
||||
progress = int((i + 1) / len(files) * 100)
|
||||
await update_task_status(
|
||||
task_id, status="processing",
|
||||
progress=progress, message=f"已处理 {i+1}/{len(files)}"
|
||||
)
|
||||
|
||||
await update_task_status(
|
||||
task_id, status="success",
|
||||
progress=100, message="源文档处理完成",
|
||||
result={"doc_ids": doc_ids}
|
||||
)
|
||||
logger.info(f"所有源文档处理完成: {len(doc_ids)}个")
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"源文档批量处理失败: {str(e)}")
|
||||
await update_task_status(
|
||||
task_id, status="failure",
|
||||
progress=0, message="源文档处理失败",
|
||||
error=str(e)
|
||||
)
|
||||
|
||||
|
||||
@router.post("/fields")
|
||||
async def extract_template_fields(
|
||||
template_id: str = Query(..., description="模板ID/文件路径"),
|
||||
@@ -164,7 +440,27 @@ async def fill_template(
|
||||
Returns:
|
||||
填写结果
|
||||
"""
|
||||
# 生成或使用传入的 task_id
|
||||
task_id = request.task_id or str(uuid.uuid4())
|
||||
|
||||
try:
|
||||
# 创建任务记录到 MongoDB
|
||||
try:
|
||||
await mongodb.insert_task(
|
||||
task_id=task_id,
|
||||
task_type="template_fill",
|
||||
status="processing",
|
||||
message=f"开始填表任务: {len(request.template_fields)} 个字段"
|
||||
)
|
||||
except Exception as mongo_err:
|
||||
logger.warning(f"MongoDB 创建任务记录失败: {mongo_err}")
|
||||
|
||||
# 更新进度 - 开始
|
||||
await update_task_status(
|
||||
task_id, "processing",
|
||||
progress=0, message="开始处理..."
|
||||
)
|
||||
|
||||
# 转换字段
|
||||
fields = [
|
||||
TemplateField(
|
||||
@@ -177,17 +473,51 @@ async def fill_template(
|
||||
for f in request.template_fields
|
||||
]
|
||||
|
||||
# 从 template_id 提取文件类型
|
||||
template_file_type = "xlsx" # 默认类型
|
||||
if request.template_id:
|
||||
ext = request.template_id.split('.')[-1].lower()
|
||||
if ext in ["xlsx", "xls"]:
|
||||
template_file_type = "xlsx"
|
||||
elif ext == "docx":
|
||||
template_file_type = "docx"
|
||||
|
||||
# 更新进度 - 准备开始填写
|
||||
await update_task_status(
|
||||
task_id, "processing",
|
||||
progress=10, message=f"准备填写 {len(fields)} 个字段..."
|
||||
)
|
||||
|
||||
# 执行填写
|
||||
result = await template_fill_service.fill_template(
|
||||
template_fields=fields,
|
||||
source_doc_ids=request.source_doc_ids,
|
||||
source_file_paths=request.source_file_paths,
|
||||
user_hint=request.user_hint
|
||||
user_hint=request.user_hint,
|
||||
template_id=request.template_id,
|
||||
template_file_type=template_file_type,
|
||||
task_id=task_id
|
||||
)
|
||||
|
||||
return result
|
||||
# 更新为成功
|
||||
await update_task_status(
|
||||
task_id, "success",
|
||||
progress=100, message="填表完成",
|
||||
result={
|
||||
"field_count": len(fields),
|
||||
"max_rows": result.get("max_rows", 0)
|
||||
}
|
||||
)
|
||||
|
||||
return {**result, "task_id": task_id}
|
||||
|
||||
except Exception as e:
|
||||
# 更新为失败
|
||||
await update_task_status(
|
||||
task_id, "failure",
|
||||
progress=0, message="填表失败",
|
||||
error=str(e)
|
||||
)
|
||||
logger.error(f"填写表格失败: {str(e)}")
|
||||
raise HTTPException(status_code=500, detail=f"填写失败: {str(e)}")
|
||||
|
||||
@@ -280,51 +610,79 @@ async def _export_to_excel(filled_data: dict, template_id: str) -> StreamingResp
|
||||
|
||||
async def _export_to_word(filled_data: dict, template_id: str) -> StreamingResponse:
|
||||
"""导出为 Word 格式"""
|
||||
import re
|
||||
import tempfile
|
||||
import os
|
||||
from docx import Document
|
||||
from docx.shared import Pt, RGBColor
|
||||
from docx.enum.text import WD_ALIGN_PARAGRAPH
|
||||
|
||||
doc = Document()
|
||||
def clean_text(text: str) -> str:
|
||||
"""清理文本,移除可能导致Word问题的非法字符"""
|
||||
if not text:
|
||||
return ""
|
||||
# 移除控制字符
|
||||
text = re.sub(r'[\x00-\x08\x0b\x0c\x0e-\x1f\x7f]', '', text)
|
||||
return text.strip()
|
||||
|
||||
# 添加标题
|
||||
title = doc.add_heading('填写结果', level=1)
|
||||
title.alignment = WD_ALIGN_PARAGRAPH.CENTER
|
||||
try:
|
||||
# 先保存到临时文件,再读取到内存,确保文档完整性
|
||||
with tempfile.NamedTemporaryFile(delete=False, suffix='.docx') as tmp_file:
|
||||
tmp_path = tmp_file.name
|
||||
|
||||
# 添加填写时间和模板信息
|
||||
from datetime import datetime
|
||||
info_para = doc.add_paragraph()
|
||||
info_para.add_run(f"模板ID: {template_id}\n").bold = True
|
||||
info_para.add_run(f"导出时间: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
|
||||
doc = Document()
|
||||
doc.add_heading('填写结果', level=1)
|
||||
|
||||
doc.add_paragraph() # 空行
|
||||
from datetime import datetime
|
||||
info_para = doc.add_paragraph()
|
||||
template_filename = template_id.split('/')[-1].split('\\')[-1] if template_id else '未知'
|
||||
info_para.add_run(f"模板文件: {clean_text(template_filename)}\n").bold = True
|
||||
info_para.add_run(f"导出时间: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
|
||||
doc.add_paragraph()
|
||||
|
||||
# 添加字段表格
|
||||
table = doc.add_table(rows=1, cols=3)
|
||||
table.style = 'Light Grid Accent 1'
|
||||
table = doc.add_table(rows=1, cols=3)
|
||||
table.style = 'Table Grid'
|
||||
|
||||
# 表头
|
||||
header_cells = table.rows[0].cells
|
||||
header_cells[0].text = '字段名'
|
||||
header_cells[1].text = '填写值'
|
||||
header_cells[2].text = '状态'
|
||||
header_cells = table.rows[0].cells
|
||||
header_cells[0].text = '字段名'
|
||||
header_cells[1].text = '填写值'
|
||||
header_cells[2].text = '状态'
|
||||
|
||||
for field_name, field_value in filled_data.items():
|
||||
row_cells = table.add_row().cells
|
||||
row_cells[0].text = field_name
|
||||
row_cells[1].text = str(field_value) if field_value else ''
|
||||
row_cells[2].text = '已填写' if field_value else '为空'
|
||||
for field_name, field_value in filled_data.items():
|
||||
row_cells = table.add_row().cells
|
||||
row_cells[0].text = clean_text(str(field_name))
|
||||
|
||||
# 保存到 BytesIO
|
||||
output = io.BytesIO()
|
||||
doc.save(output)
|
||||
output.seek(0)
|
||||
if isinstance(field_value, list):
|
||||
clean_values = [clean_text(str(v)) for v in field_value if v]
|
||||
display_value = ', '.join(clean_values) if clean_values else ''
|
||||
else:
|
||||
display_value = clean_text(str(field_value)) if field_value else ''
|
||||
|
||||
filename = f"filled_template.docx"
|
||||
row_cells[1].text = display_value
|
||||
row_cells[2].text = '已填写' if display_value else '为空'
|
||||
|
||||
# 保存到临时文件
|
||||
doc.save(tmp_path)
|
||||
|
||||
# 读取文件内容
|
||||
with open(tmp_path, 'rb') as f:
|
||||
file_content = f.read()
|
||||
|
||||
finally:
|
||||
# 清理临时文件
|
||||
if os.path.exists(tmp_path):
|
||||
try:
|
||||
os.unlink(tmp_path)
|
||||
except:
|
||||
pass
|
||||
|
||||
output = io.BytesIO(file_content)
|
||||
filename = "filled_template.docx"
|
||||
|
||||
return StreamingResponse(
|
||||
io.BytesIO(output.getvalue()),
|
||||
output,
|
||||
media_type="application/vnd.openxmlformats-officedocument.wordprocessingml.document",
|
||||
headers={"Content-Disposition": f"attachment; filename={filename}"}
|
||||
headers={"Content-Disposition": f"attachment; filename*=UTF-8''{filename}"}
|
||||
)
|
||||
|
||||
|
||||
|
||||
@@ -5,6 +5,7 @@ from fastapi import APIRouter, UploadFile, File, HTTPException, Query
|
||||
from fastapi.responses import StreamingResponse
|
||||
from typing import Optional
|
||||
import logging
|
||||
import os
|
||||
import pandas as pd
|
||||
import io
|
||||
|
||||
@@ -126,7 +127,7 @@ async def upload_excel(
|
||||
content += f"... (共 {len(sheet_data['rows'])} 行)\n\n"
|
||||
|
||||
doc_metadata = {
|
||||
"filename": saved_path.split("/")[-1] if "/" in saved_path else saved_path.split("\\")[-1],
|
||||
"filename": os.path.basename(saved_path),
|
||||
"original_filename": file.filename,
|
||||
"saved_path": saved_path,
|
||||
"file_size": len(content),
|
||||
@@ -253,7 +254,7 @@ async def export_excel(
|
||||
output.seek(0)
|
||||
|
||||
# 生成文件名
|
||||
original_name = file_path.split('/')[-1] if '/' in file_path else file_path
|
||||
original_name = os.path.basename(file_path)
|
||||
if columns:
|
||||
export_name = f"export_{sheet_name or 'data'}_{len(column_list) if columns else 'all'}_cols.xlsx"
|
||||
else:
|
||||
|
||||
@@ -59,6 +59,11 @@ class MongoDB:
|
||||
"""RAG索引集合 - 存储字段语义索引"""
|
||||
return self.db["rag_index"]
|
||||
|
||||
@property
|
||||
def tasks(self):
|
||||
"""任务集合 - 存储任务历史记录"""
|
||||
return self.db["tasks"]
|
||||
|
||||
# ==================== 文档操作 ====================
|
||||
|
||||
async def insert_document(
|
||||
@@ -242,8 +247,128 @@ class MongoDB:
|
||||
await self.rag_index.create_index("table_name")
|
||||
await self.rag_index.create_index("field_name")
|
||||
|
||||
# 任务集合索引
|
||||
await self.tasks.create_index("task_id", unique=True)
|
||||
await self.tasks.create_index("created_at")
|
||||
|
||||
logger.info("MongoDB 索引创建完成")
|
||||
|
||||
# ==================== 任务历史操作 ====================
|
||||
|
||||
async def insert_task(
|
||||
self,
|
||||
task_id: str,
|
||||
task_type: str,
|
||||
status: str = "pending",
|
||||
message: str = "",
|
||||
result: Optional[Dict[str, Any]] = None,
|
||||
error: Optional[str] = None,
|
||||
) -> str:
|
||||
"""
|
||||
插入任务记录
|
||||
|
||||
Args:
|
||||
task_id: 任务ID
|
||||
task_type: 任务类型
|
||||
status: 任务状态
|
||||
message: 任务消息
|
||||
result: 任务结果
|
||||
error: 错误信息
|
||||
|
||||
Returns:
|
||||
插入文档的ID
|
||||
"""
|
||||
task = {
|
||||
"task_id": task_id,
|
||||
"task_type": task_type,
|
||||
"status": status,
|
||||
"message": message,
|
||||
"result": result,
|
||||
"error": error,
|
||||
"created_at": datetime.utcnow(),
|
||||
"updated_at": datetime.utcnow(),
|
||||
}
|
||||
result_obj = await self.tasks.insert_one(task)
|
||||
return str(result_obj.inserted_id)
|
||||
|
||||
async def update_task(
|
||||
self,
|
||||
task_id: str,
|
||||
status: Optional[str] = None,
|
||||
message: Optional[str] = None,
|
||||
result: Optional[Dict[str, Any]] = None,
|
||||
error: Optional[str] = None,
|
||||
) -> bool:
|
||||
"""
|
||||
更新任务状态
|
||||
|
||||
Args:
|
||||
task_id: 任务ID
|
||||
status: 任务状态
|
||||
message: 任务消息
|
||||
result: 任务结果
|
||||
error: 错误信息
|
||||
|
||||
Returns:
|
||||
是否更新成功
|
||||
"""
|
||||
from bson import ObjectId
|
||||
|
||||
update_data = {"updated_at": datetime.utcnow()}
|
||||
if status is not None:
|
||||
update_data["status"] = status
|
||||
if message is not None:
|
||||
update_data["message"] = message
|
||||
if result is not None:
|
||||
update_data["result"] = result
|
||||
if error is not None:
|
||||
update_data["error"] = error
|
||||
|
||||
update_result = await self.tasks.update_one(
|
||||
{"task_id": task_id},
|
||||
{"$set": update_data}
|
||||
)
|
||||
return update_result.modified_count > 0
|
||||
|
||||
async def get_task(self, task_id: str) -> Optional[Dict[str, Any]]:
|
||||
"""根据task_id获取任务"""
|
||||
task = await self.tasks.find_one({"task_id": task_id})
|
||||
if task:
|
||||
task["_id"] = str(task["_id"])
|
||||
return task
|
||||
|
||||
async def list_tasks(
|
||||
self,
|
||||
limit: int = 50,
|
||||
skip: int = 0,
|
||||
) -> List[Dict[str, Any]]:
|
||||
"""
|
||||
获取任务列表
|
||||
|
||||
Args:
|
||||
limit: 返回数量
|
||||
skip: 跳过数量
|
||||
|
||||
Returns:
|
||||
任务列表
|
||||
"""
|
||||
cursor = self.tasks.find().sort("created_at", -1).skip(skip).limit(limit)
|
||||
tasks = []
|
||||
async for task in cursor:
|
||||
task["_id"] = str(task["_id"])
|
||||
# 转换 datetime 为字符串
|
||||
if task.get("created_at"):
|
||||
task["created_at"] = task["created_at"].isoformat()
|
||||
if task.get("updated_at"):
|
||||
task["updated_at"] = task["updated_at"].isoformat()
|
||||
tasks.append(task)
|
||||
return tasks
|
||||
|
||||
async def delete_task(self, task_id: str) -> bool:
|
||||
"""删除任务"""
|
||||
result = await self.tasks.delete_one({"task_id": task_id})
|
||||
return result.deleted_count > 0
|
||||
|
||||
|
||||
# ==================== 全局单例 ====================
|
||||
|
||||
|
||||
@@ -59,7 +59,13 @@ class DocxParser(BaseParser):
|
||||
paragraphs = []
|
||||
for para in doc.paragraphs:
|
||||
if para.text.strip():
|
||||
paragraphs.append(para.text)
|
||||
paragraphs.append({
|
||||
"text": para.text,
|
||||
"style": str(para.style.name) if para.style else "Normal"
|
||||
})
|
||||
|
||||
# 提取段落纯文本(用于 AI 解析)
|
||||
paragraphs_text = [p["text"] for p in paragraphs if p["text"].strip()]
|
||||
|
||||
# 提取表格内容
|
||||
tables_data = []
|
||||
@@ -77,8 +83,25 @@ class DocxParser(BaseParser):
|
||||
"column_count": len(table_rows[0]) if table_rows else 0
|
||||
})
|
||||
|
||||
# 合并所有文本
|
||||
full_text = "\n".join(paragraphs)
|
||||
# 提取图片/嵌入式对象信息
|
||||
images_info = self._extract_images_info(doc, path)
|
||||
|
||||
# 合并所有文本(包括图片描述)
|
||||
full_text_parts = []
|
||||
full_text_parts.append("【文档正文】")
|
||||
full_text_parts.extend(paragraphs_text)
|
||||
|
||||
if tables_data:
|
||||
full_text_parts.append("\n【文档表格】")
|
||||
for idx, table in enumerate(tables_data):
|
||||
full_text_parts.append(f"--- 表格 {idx + 1} ---")
|
||||
for row in table["rows"]:
|
||||
full_text_parts.append(" | ".join(str(cell) for cell in row))
|
||||
|
||||
if images_info.get("image_count", 0) > 0:
|
||||
full_text_parts.append(f"\n【文档图片】文档包含 {images_info['image_count']} 张图片/图表")
|
||||
|
||||
full_text = "\n".join(full_text_parts)
|
||||
|
||||
# 构建元数据
|
||||
metadata = {
|
||||
@@ -89,7 +112,9 @@ class DocxParser(BaseParser):
|
||||
"table_count": len(tables_data),
|
||||
"word_count": len(full_text),
|
||||
"char_count": len(full_text.replace("\n", "")),
|
||||
"has_tables": len(tables_data) > 0
|
||||
"has_tables": len(tables_data) > 0,
|
||||
"has_images": images_info.get("image_count", 0) > 0,
|
||||
"image_count": images_info.get("image_count", 0)
|
||||
}
|
||||
|
||||
# 返回结果
|
||||
@@ -97,12 +122,16 @@ class DocxParser(BaseParser):
|
||||
success=True,
|
||||
data={
|
||||
"content": full_text,
|
||||
"paragraphs": paragraphs,
|
||||
"paragraphs": paragraphs_text,
|
||||
"paragraphs_with_style": paragraphs,
|
||||
"tables": tables_data,
|
||||
"images": images_info,
|
||||
"word_count": len(full_text),
|
||||
"structured_data": {
|
||||
"paragraphs": paragraphs,
|
||||
"tables": tables_data
|
||||
"paragraphs_text": paragraphs_text,
|
||||
"tables": tables_data,
|
||||
"images": images_info
|
||||
}
|
||||
},
|
||||
metadata=metadata
|
||||
@@ -115,6 +144,59 @@ class DocxParser(BaseParser):
|
||||
error=f"解析 Word 文档失败: {str(e)}"
|
||||
)
|
||||
|
||||
def extract_images_as_base64(self, file_path: str) -> List[Dict[str, str]]:
|
||||
"""
|
||||
提取 Word 文档中的所有图片,返回 base64 编码列表
|
||||
|
||||
Args:
|
||||
file_path: Word 文件路径
|
||||
|
||||
Returns:
|
||||
图片列表,每项包含 base64 编码和图片类型
|
||||
"""
|
||||
import zipfile
|
||||
import base64
|
||||
from io import BytesIO
|
||||
|
||||
images = []
|
||||
|
||||
try:
|
||||
with zipfile.ZipFile(file_path, 'r') as zf:
|
||||
# 查找 word/media 目录下的图片文件
|
||||
for filename in zf.namelist():
|
||||
if filename.startswith('word/media/'):
|
||||
# 获取图片类型
|
||||
ext = filename.split('.')[-1].lower()
|
||||
mime_types = {
|
||||
'png': 'image/png',
|
||||
'jpg': 'image/jpeg',
|
||||
'jpeg': 'image/jpeg',
|
||||
'gif': 'image/gif',
|
||||
'bmp': 'image/bmp'
|
||||
}
|
||||
mime_type = mime_types.get(ext, 'image/png')
|
||||
|
||||
try:
|
||||
# 读取图片数据并转为 base64
|
||||
image_data = zf.read(filename)
|
||||
base64_data = base64.b64encode(image_data).decode('utf-8')
|
||||
|
||||
images.append({
|
||||
"filename": filename,
|
||||
"mime_type": mime_type,
|
||||
"base64": base64_data,
|
||||
"size": len(image_data)
|
||||
})
|
||||
logger.info(f"提取图片: {filename}, 大小: {len(image_data)} bytes")
|
||||
except Exception as e:
|
||||
logger.warning(f"提取图片失败 {filename}: {str(e)}")
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"打开 Word 文档提取图片失败: {str(e)}")
|
||||
|
||||
logger.info(f"共提取 {len(images)} 张图片")
|
||||
return images
|
||||
|
||||
def extract_key_sentences(self, text: str, max_sentences: int = 10) -> List[str]:
|
||||
"""
|
||||
从文本中提取关键句子
|
||||
@@ -268,6 +350,60 @@ class DocxParser(BaseParser):
|
||||
|
||||
return fields
|
||||
|
||||
def _extract_images_info(self, doc: Document, path: Path) -> Dict[str, Any]:
|
||||
"""
|
||||
提取 Word 文档中的图片/嵌入式对象信息
|
||||
|
||||
Args:
|
||||
doc: Document 对象
|
||||
path: 文件路径
|
||||
|
||||
Returns:
|
||||
图片信息字典
|
||||
"""
|
||||
import zipfile
|
||||
from io import BytesIO
|
||||
|
||||
image_count = 0
|
||||
image_descriptions = []
|
||||
inline_shapes_count = 0
|
||||
|
||||
try:
|
||||
# 方法1: 通过 inline shapes 统计图片
|
||||
try:
|
||||
inline_shapes_count = len(doc.inline_shapes)
|
||||
if inline_shapes_count > 0:
|
||||
image_count = inline_shapes_count
|
||||
image_descriptions.append(f"文档包含 {inline_shapes_count} 个嵌入式图形/图片")
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
# 方法2: 通过 ZIP 分析 document.xml 获取图片引用
|
||||
try:
|
||||
with zipfile.ZipFile(path, 'r') as zf:
|
||||
# 查找 word/media 目录下的图片文件
|
||||
media_files = [f for f in zf.namelist() if f.startswith('word/media/')]
|
||||
if media_files and not inline_shapes_count:
|
||||
image_count = len(media_files)
|
||||
image_descriptions.append(f"文档包含 {image_count} 个嵌入图片")
|
||||
|
||||
# 检查是否有页眉页脚中的图片
|
||||
header_images = [f for f in zf.namelist() if 'header' in f.lower() and f.endswith(('.png', '.jpg', '.jpeg', '.gif', '.bmp'))]
|
||||
if header_images:
|
||||
image_descriptions.append(f"页眉/页脚包含 {len(header_images)} 个图片")
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
except Exception as e:
|
||||
logger.warning(f"提取图片信息失败: {str(e)}")
|
||||
|
||||
return {
|
||||
"image_count": image_count,
|
||||
"inline_shapes_count": inline_shapes_count,
|
||||
"descriptions": image_descriptions,
|
||||
"has_images": image_count > 0
|
||||
}
|
||||
|
||||
def _infer_field_type_from_hint(self, hint: str) -> str:
|
||||
"""
|
||||
从提示词推断字段类型
|
||||
|
||||
@@ -317,24 +317,70 @@ class XlsxParser(BaseParser):
|
||||
import zipfile
|
||||
from xml.etree import ElementTree as ET
|
||||
|
||||
# 常见的命名空间
|
||||
COMMON_NAMESPACES = [
|
||||
'http://schemas.openxmlformats.org/spreadsheetml/2006/main',
|
||||
'http://schemas.openxmlformats.org/spreadsheetml/2005/main',
|
||||
'http://schemas.openxmlformats.org/spreadsheetml/2004/main',
|
||||
'http://schemas.openxmlformats.org/spreadsheetml/2003/main',
|
||||
]
|
||||
|
||||
try:
|
||||
with zipfile.ZipFile(file_path, 'r') as z:
|
||||
if 'xl/workbook.xml' not in z.namelist():
|
||||
# 尝试多种可能的 workbook.xml 路径
|
||||
possible_paths = ['xl/workbook.xml', 'xl\\workbook.xml', 'workbook.xml']
|
||||
content = None
|
||||
for path in possible_paths:
|
||||
if path in z.namelist():
|
||||
content = z.read(path)
|
||||
logger.info(f"找到 workbook.xml at: {path}")
|
||||
break
|
||||
|
||||
if content is None:
|
||||
logger.warning(f"未找到 workbook.xml,文件列表: {z.namelist()[:10]}")
|
||||
return []
|
||||
content = z.read('xl/workbook.xml')
|
||||
|
||||
root = ET.fromstring(content)
|
||||
|
||||
# 命名空间
|
||||
ns = {'main': 'http://schemas.openxmlformats.org/spreadsheetml/2006/main'}
|
||||
|
||||
sheet_names = []
|
||||
for sheet in root.findall('.//main:sheet', ns):
|
||||
name = sheet.get('name')
|
||||
if name:
|
||||
sheet_names.append(name)
|
||||
|
||||
# 方法1:尝试带命名空间的查找
|
||||
for ns in COMMON_NAMESPACES:
|
||||
sheet_elements = root.findall(f'.//{{{ns}}}sheet')
|
||||
if sheet_elements:
|
||||
for sheet in sheet_elements:
|
||||
name = sheet.get('name')
|
||||
if name:
|
||||
sheet_names.append(name)
|
||||
if sheet_names:
|
||||
logger.info(f"使用命名空间 {ns} 提取工作表: {sheet_names}")
|
||||
return sheet_names
|
||||
|
||||
# 方法2:不使用命名空间,直接查找所有 sheet 元素
|
||||
if not sheet_names:
|
||||
for elem in root.iter():
|
||||
if elem.tag.endswith('sheet') and elem.tag != 'sheets':
|
||||
name = elem.get('name')
|
||||
if name:
|
||||
sheet_names.append(name)
|
||||
for child in elem:
|
||||
if child.tag.endswith('sheet') or child.tag == 'sheet':
|
||||
name = child.get('name')
|
||||
if name and name not in sheet_names:
|
||||
sheet_names.append(name)
|
||||
|
||||
# 方法3:直接从 XML 文本中正则匹配 sheet name
|
||||
if not sheet_names:
|
||||
import re
|
||||
xml_str = content.decode('utf-8', errors='ignore')
|
||||
matches = re.findall(r'<sheet\s+[^>]*name=["\']([^"\']+)["\']', xml_str, re.IGNORECASE)
|
||||
if matches:
|
||||
sheet_names = matches
|
||||
logger.info(f"使用正则提取工作表: {sheet_names}")
|
||||
|
||||
logger.info(f"从 XML 提取工作表: {sheet_names}")
|
||||
return sheet_names
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"从 XML 提取工作表名称失败: {e}")
|
||||
return []
|
||||
@@ -356,6 +402,32 @@ class XlsxParser(BaseParser):
|
||||
import zipfile
|
||||
from xml.etree import ElementTree as ET
|
||||
|
||||
# 常见的命名空间
|
||||
COMMON_NAMESPACES = [
|
||||
'http://schemas.openxmlformats.org/spreadsheetml/2006/main',
|
||||
'http://schemas.openxmlformats.org/spreadsheetml/2005/main',
|
||||
'http://schemas.openxmlformats.org/spreadsheetml/2004/main',
|
||||
'http://schemas.openxmlformats.org/spreadsheetml/2003/main',
|
||||
]
|
||||
|
||||
def find_elements_with_ns(root, tag_name):
|
||||
"""灵活查找元素,支持任意命名空间"""
|
||||
results = []
|
||||
# 方法1:用固定命名空间
|
||||
for ns in COMMON_NAMESPACES:
|
||||
try:
|
||||
elems = root.findall(f'.//{{{ns}}}{tag_name}')
|
||||
if elems:
|
||||
results.extend(elems)
|
||||
except:
|
||||
pass
|
||||
# 方法2:不带命名空间查找
|
||||
if not results:
|
||||
for elem in root.iter():
|
||||
if elem.tag.endswith('}' + tag_name):
|
||||
results.append(elem)
|
||||
return results
|
||||
|
||||
with zipfile.ZipFile(file_path, 'r') as z:
|
||||
# 获取工作表名称
|
||||
sheet_names = self._extract_sheet_names_from_xml(file_path)
|
||||
@@ -366,57 +438,68 @@ class XlsxParser(BaseParser):
|
||||
target_sheet = sheet_name if sheet_name and sheet_name in sheet_names else sheet_names[0]
|
||||
sheet_index = sheet_names.index(target_sheet) + 1 # sheet1.xml, sheet2.xml, ...
|
||||
|
||||
# 读取 shared strings
|
||||
# 读取 shared strings - 尝试多种路径
|
||||
shared_strings = []
|
||||
if 'xl/sharedStrings.xml' in z.namelist():
|
||||
ss_content = z.read('xl/sharedStrings.xml')
|
||||
ss_root = ET.fromstring(ss_content)
|
||||
ns = {'main': 'http://schemas.openxmlformats.org/spreadsheetml/2006/main'}
|
||||
for si in ss_root.findall('.//main:si', ns):
|
||||
t = si.find('.//main:t', ns)
|
||||
if t is not None:
|
||||
shared_strings.append(t.text or '')
|
||||
else:
|
||||
shared_strings.append('')
|
||||
ss_paths = ['xl/sharedStrings.xml', 'xl\\sharedStrings.xml', 'sharedStrings.xml']
|
||||
for ss_path in ss_paths:
|
||||
if ss_path in z.namelist():
|
||||
try:
|
||||
ss_content = z.read(ss_path)
|
||||
ss_root = ET.fromstring(ss_content)
|
||||
for si in find_elements_with_ns(ss_root, 'si'):
|
||||
t_elements = [c for c in si if c.tag.endswith('}t') or c.tag == 't']
|
||||
if t_elements:
|
||||
shared_strings.append(t_elements[0].text or '')
|
||||
else:
|
||||
shared_strings.append('')
|
||||
break
|
||||
except Exception as e:
|
||||
logger.warning(f"读取 sharedStrings 失败: {e}")
|
||||
|
||||
# 读取工作表
|
||||
sheet_file = f'xl/worksheets/sheet{sheet_index}.xml'
|
||||
if sheet_file not in z.namelist():
|
||||
raise ValueError(f"工作表文件 {sheet_file} 不存在")
|
||||
# 读取工作表 - 尝试多种可能的路径
|
||||
sheet_content = None
|
||||
sheet_paths = [
|
||||
f'xl/worksheets/sheet{sheet_index}.xml',
|
||||
f'xl\\worksheets\\sheet{sheet_index}.xml',
|
||||
f'worksheets/sheet{sheet_index}.xml',
|
||||
]
|
||||
for sp in sheet_paths:
|
||||
if sp in z.namelist():
|
||||
sheet_content = z.read(sp)
|
||||
break
|
||||
|
||||
if sheet_content is None:
|
||||
raise ValueError(f"工作表文件 sheet{sheet_index}.xml 不存在")
|
||||
|
||||
sheet_content = z.read(sheet_file)
|
||||
root = ET.fromstring(sheet_content)
|
||||
ns = {'main': 'http://schemas.openxmlformats.org/spreadsheetml/2006/main'}
|
||||
|
||||
# 收集所有行数据
|
||||
all_rows = []
|
||||
headers = {}
|
||||
|
||||
for row in root.findall('.//main:row', ns):
|
||||
for row in find_elements_with_ns(root, 'row'):
|
||||
row_idx = int(row.get('r', 0))
|
||||
row_cells = {}
|
||||
for cell in row.findall('main:c', ns):
|
||||
for cell in find_elements_with_ns(row, 'c'):
|
||||
cell_ref = cell.get('r', '')
|
||||
col_letters = ''.join(filter(str.isalpha, cell_ref))
|
||||
cell_type = cell.get('t', 'n')
|
||||
v = cell.find('main:v', ns)
|
||||
v_elements = find_elements_with_ns(cell, 'v')
|
||||
v = v_elements[0] if v_elements else None
|
||||
|
||||
if v is not None and v.text:
|
||||
if cell_type == 's':
|
||||
# shared string
|
||||
try:
|
||||
row_cells[col_letters] = shared_strings[int(v.text)]
|
||||
except (ValueError, IndexError):
|
||||
row_cells[col_letters] = v.text
|
||||
elif cell_type == 'b':
|
||||
# boolean
|
||||
row_cells[col_letters] = v.text == '1'
|
||||
else:
|
||||
row_cells[col_letters] = v.text
|
||||
else:
|
||||
row_cells[col_letters] = None
|
||||
|
||||
# 处理表头行
|
||||
if row_idx == header_row + 1:
|
||||
headers = {**row_cells}
|
||||
elif row_idx > header_row + 1:
|
||||
@@ -424,7 +507,6 @@ class XlsxParser(BaseParser):
|
||||
|
||||
# 构建 DataFrame
|
||||
if headers:
|
||||
# 按原始列顺序排列
|
||||
col_order = list(headers.keys())
|
||||
df = pd.DataFrame(all_rows)
|
||||
if not df.empty:
|
||||
|
||||
@@ -0,0 +1,14 @@
|
||||
"""
|
||||
指令执行模块
|
||||
|
||||
支持文档智能操作交互,包括意图解析和指令执行
|
||||
"""
|
||||
from .intent_parser import IntentParser, intent_parser
|
||||
from .executor import InstructionExecutor, instruction_executor
|
||||
|
||||
__all__ = [
|
||||
"IntentParser",
|
||||
"intent_parser",
|
||||
"InstructionExecutor",
|
||||
"instruction_executor",
|
||||
]
|
||||
|
||||
@@ -0,0 +1,572 @@
|
||||
"""
|
||||
指令执行器模块
|
||||
|
||||
将自然语言指令转换为可执行操作
|
||||
"""
|
||||
import logging
|
||||
import json
|
||||
from typing import Any, Dict, List, Optional
|
||||
|
||||
from app.services.template_fill_service import template_fill_service
|
||||
from app.services.rag_service import rag_service
|
||||
from app.services.markdown_ai_service import markdown_ai_service
|
||||
from app.core.database import mongodb
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
class InstructionExecutor:
|
||||
"""指令执行器"""
|
||||
|
||||
def __init__(self):
|
||||
self.intent_parser = None # 将通过 set_intent_parser 设置
|
||||
|
||||
def set_intent_parser(self, intent_parser):
|
||||
"""设置意图解析器"""
|
||||
self.intent_parser = intent_parser
|
||||
|
||||
async def execute(self, instruction: str, context: Dict[str, Any] = None) -> Dict[str, Any]:
|
||||
"""
|
||||
执行指令
|
||||
|
||||
Args:
|
||||
instruction: 自然语言指令
|
||||
context: 执行上下文(包含文档信息等)
|
||||
|
||||
Returns:
|
||||
执行结果
|
||||
"""
|
||||
if self.intent_parser is None:
|
||||
from app.instruction.intent_parser import intent_parser
|
||||
self.intent_parser = intent_parser
|
||||
|
||||
context = context or {}
|
||||
|
||||
# 解析意图
|
||||
intent, params = await self.intent_parser.parse(instruction)
|
||||
|
||||
# 根据意图类型执行相应操作
|
||||
if intent == "extract":
|
||||
return await self._execute_extract(params, context)
|
||||
elif intent == "fill_table":
|
||||
return await self._execute_fill_table(params, context)
|
||||
elif intent == "summarize":
|
||||
return await self._execute_summarize(params, context)
|
||||
elif intent == "question":
|
||||
return await self._execute_question(params, context)
|
||||
elif intent == "search":
|
||||
return await self._execute_search(params, context)
|
||||
elif intent == "compare":
|
||||
return await self._execute_compare(params, context)
|
||||
elif intent == "edit":
|
||||
return await self._execute_edit(params, context)
|
||||
elif intent == "transform":
|
||||
return await self._execute_transform(params, context)
|
||||
else:
|
||||
return {
|
||||
"success": False,
|
||||
"error": f"未知意图类型: {intent}",
|
||||
"message": "无法理解该指令,请尝试更明确的描述"
|
||||
}
|
||||
|
||||
async def _execute_extract(self, params: Dict[str, Any], context: Dict[str, Any]) -> Dict[str, Any]:
|
||||
"""执行信息提取"""
|
||||
try:
|
||||
target_fields = params.get("field_refs", [])
|
||||
doc_ids = params.get("document_refs", [])
|
||||
|
||||
if not target_fields:
|
||||
return {
|
||||
"success": False,
|
||||
"error": "未指定要提取的字段",
|
||||
"message": "请明确说明要提取哪些字段,如:'提取医院数量和床位数'"
|
||||
}
|
||||
|
||||
# 如果指定了文档,验证文档存在
|
||||
if doc_ids and "all_docs" not in doc_ids:
|
||||
valid_docs = []
|
||||
for doc_ref in doc_ids:
|
||||
doc_id = doc_ref.replace("doc_", "")
|
||||
doc = await mongodb.get_document(doc_id)
|
||||
if doc:
|
||||
valid_docs.append(doc)
|
||||
if not valid_docs:
|
||||
return {
|
||||
"success": False,
|
||||
"error": "指定的文档不存在",
|
||||
"message": "请检查文档编号是否正确"
|
||||
}
|
||||
context["source_docs"] = valid_docs
|
||||
|
||||
# 构建字段列表
|
||||
fields = []
|
||||
for i, field_name in enumerate(target_fields):
|
||||
fields.append({
|
||||
"name": field_name,
|
||||
"cell": f"A{i+1}",
|
||||
"field_type": "text",
|
||||
"required": False
|
||||
})
|
||||
|
||||
# 调用填表服务
|
||||
result = await template_fill_service.fill_template(
|
||||
template_fields=fields,
|
||||
source_doc_ids=[doc.get("_id") for doc in context.get("source_docs", [])] if context.get("source_docs") else None,
|
||||
user_hint=f"请提取字段: {', '.join(target_fields)}"
|
||||
)
|
||||
|
||||
return {
|
||||
"success": True,
|
||||
"intent": "extract",
|
||||
"extracted_data": result.get("filled_data", {}),
|
||||
"fields": target_fields,
|
||||
"message": f"成功提取 {len(result.get('filled_data', {}))} 个字段"
|
||||
}
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"提取执行失败: {e}")
|
||||
return {
|
||||
"success": False,
|
||||
"error": str(e),
|
||||
"message": f"提取失败: {str(e)}"
|
||||
}
|
||||
|
||||
async def _execute_fill_table(self, params: Dict[str, Any], context: Dict[str, Any]) -> Dict[str, Any]:
|
||||
"""执行填表操作"""
|
||||
try:
|
||||
template_file = context.get("template_file")
|
||||
if not template_file:
|
||||
return {
|
||||
"success": False,
|
||||
"error": "未提供表格模板",
|
||||
"message": "请先上传要填写的表格模板"
|
||||
}
|
||||
|
||||
# 获取源文档
|
||||
source_docs = context.get("source_docs", [])
|
||||
source_doc_ids = [doc.get("_id") for doc in source_docs if doc.get("_id")]
|
||||
|
||||
# 获取字段
|
||||
fields = context.get("template_fields", [])
|
||||
|
||||
# 调用填表服务
|
||||
result = await template_fill_service.fill_template(
|
||||
template_fields=fields,
|
||||
source_doc_ids=source_doc_ids if source_doc_ids else None,
|
||||
source_file_paths=context.get("source_file_paths"),
|
||||
user_hint=params.get("user_hint"),
|
||||
template_id=template_file if isinstance(template_file, str) else None,
|
||||
template_file_type=params.get("template", {}).get("type", "xlsx")
|
||||
)
|
||||
|
||||
return {
|
||||
"success": True,
|
||||
"intent": "fill_table",
|
||||
"result": result,
|
||||
"message": f"填表完成,成功填写 {len(result.get('filled_data', {}))} 个字段"
|
||||
}
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"填表执行失败: {e}")
|
||||
return {
|
||||
"success": False,
|
||||
"error": str(e),
|
||||
"message": f"填表失败: {str(e)}"
|
||||
}
|
||||
|
||||
async def _execute_summarize(self, params: Dict[str, Any], context: Dict[str, Any]) -> Dict[str, Any]:
|
||||
"""执行摘要总结"""
|
||||
try:
|
||||
docs = context.get("source_docs", [])
|
||||
if not docs:
|
||||
return {
|
||||
"success": False,
|
||||
"error": "没有可用的文档",
|
||||
"message": "请先上传要总结的文档"
|
||||
}
|
||||
|
||||
summaries = []
|
||||
for doc in docs[:5]: # 最多处理5个文档
|
||||
content = doc.get("content", "")[:5000] # 限制内容长度
|
||||
if content:
|
||||
summaries.append({
|
||||
"filename": doc.get("metadata", {}).get("original_filename", "未知"),
|
||||
"content_preview": content[:500] + "..." if len(content) > 500 else content
|
||||
})
|
||||
|
||||
return {
|
||||
"success": True,
|
||||
"intent": "summarize",
|
||||
"summaries": summaries,
|
||||
"message": f"找到 {len(summaries)} 个文档可供参考"
|
||||
}
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"摘要执行失败: {e}")
|
||||
return {
|
||||
"success": False,
|
||||
"error": str(e),
|
||||
"message": f"摘要生成失败: {str(e)}"
|
||||
}
|
||||
|
||||
async def _execute_question(self, params: Dict[str, Any], context: Dict[str, Any]) -> Dict[str, Any]:
|
||||
"""执行问答"""
|
||||
try:
|
||||
question = params.get("question", "")
|
||||
if not question:
|
||||
return {
|
||||
"success": False,
|
||||
"error": "未提供问题",
|
||||
"message": "请输入要回答的问题"
|
||||
}
|
||||
|
||||
# 使用 RAG 检索相关文档
|
||||
docs = context.get("source_docs", [])
|
||||
rag_results = []
|
||||
|
||||
for doc in docs:
|
||||
doc_id = doc.get("_id", "")
|
||||
if doc_id:
|
||||
results = rag_service.retrieve_by_doc_id(doc_id, top_k=3)
|
||||
rag_results.extend(results)
|
||||
|
||||
# 构建上下文
|
||||
context_text = "\n\n".join([
|
||||
r.get("content", "") for r in rag_results[:5]
|
||||
]) if rag_results else ""
|
||||
|
||||
# 如果没有 RAG 结果,使用文档内容
|
||||
if not context_text:
|
||||
context_text = "\n\n".join([
|
||||
doc.get("content", "")[:3000] for doc in docs[:3] if doc.get("content")
|
||||
])
|
||||
|
||||
return {
|
||||
"success": True,
|
||||
"intent": "question",
|
||||
"question": question,
|
||||
"context_preview": context_text[:500] + "..." if len(context_text) > 500 else context_text,
|
||||
"message": "已找到相关上下文,可进行问答"
|
||||
}
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"问答执行失败: {e}")
|
||||
return {
|
||||
"success": False,
|
||||
"error": str(e),
|
||||
"message": f"问答处理失败: {str(e)}"
|
||||
}
|
||||
|
||||
async def _execute_search(self, params: Dict[str, Any], context: Dict[str, Any]) -> Dict[str, Any]:
|
||||
"""执行搜索"""
|
||||
try:
|
||||
field_refs = params.get("field_refs", [])
|
||||
query = " ".join(field_refs) if field_refs else params.get("question", "")
|
||||
|
||||
if not query:
|
||||
return {
|
||||
"success": False,
|
||||
"error": "未提供搜索关键词",
|
||||
"message": "请输入要搜索的关键词"
|
||||
}
|
||||
|
||||
# 使用 RAG 检索
|
||||
results = rag_service.retrieve(query, top_k=10, min_score=0.3)
|
||||
|
||||
return {
|
||||
"success": True,
|
||||
"intent": "search",
|
||||
"query": query,
|
||||
"results": [
|
||||
{
|
||||
"content": r.get("content", "")[:200],
|
||||
"score": r.get("score", 0),
|
||||
"doc_id": r.get("doc_id", "")
|
||||
}
|
||||
for r in results[:10]
|
||||
],
|
||||
"message": f"找到 {len(results)} 条相关结果"
|
||||
}
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"搜索执行失败: {e}")
|
||||
return {
|
||||
"success": False,
|
||||
"error": str(e),
|
||||
"message": f"搜索失败: {str(e)}"
|
||||
}
|
||||
|
||||
async def _execute_compare(self, params: Dict[str, Any], context: Dict[str, Any]) -> Dict[str, Any]:
|
||||
"""执行对比分析"""
|
||||
try:
|
||||
docs = context.get("source_docs", [])
|
||||
if len(docs) < 2:
|
||||
return {
|
||||
"success": False,
|
||||
"error": "对比需要至少2个文档",
|
||||
"message": "请上传至少2个文档进行对比"
|
||||
}
|
||||
|
||||
# 提取文档基本信息
|
||||
comparison = []
|
||||
for i, doc in enumerate(docs[:5]):
|
||||
comparison.append({
|
||||
"index": i + 1,
|
||||
"filename": doc.get("metadata", {}).get("original_filename", "未知"),
|
||||
"doc_type": doc.get("doc_type", "未知"),
|
||||
"content_length": len(doc.get("content", "")),
|
||||
"has_tables": bool(doc.get("structured_data", {}).get("tables")),
|
||||
})
|
||||
|
||||
return {
|
||||
"success": True,
|
||||
"intent": "compare",
|
||||
"comparison": comparison,
|
||||
"message": f"对比了 {len(comparison)} 个文档的基本信息"
|
||||
}
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"对比执行失败: {e}")
|
||||
return {
|
||||
"success": False,
|
||||
"error": str(e),
|
||||
"message": f"对比分析失败: {str(e)}"
|
||||
}
|
||||
|
||||
async def _execute_edit(self, params: Dict[str, Any], context: Dict[str, Any]) -> Dict[str, Any]:
|
||||
"""执行文档编辑操作"""
|
||||
try:
|
||||
docs = context.get("source_docs", [])
|
||||
if not docs:
|
||||
return {
|
||||
"success": False,
|
||||
"error": "没有可用的文档",
|
||||
"message": "请先上传要编辑的文档"
|
||||
}
|
||||
|
||||
doc = docs[0] # 默认编辑第一个文档
|
||||
content = doc.get("content", "")
|
||||
original_filename = doc.get("metadata", {}).get("original_filename", "未知文档")
|
||||
|
||||
if not content:
|
||||
return {
|
||||
"success": False,
|
||||
"error": "文档内容为空",
|
||||
"message": "该文档没有可编辑的内容"
|
||||
}
|
||||
|
||||
# 使用 LLM 进行文本润色/编辑
|
||||
prompt = f"""请对以下文档内容进行编辑处理。
|
||||
|
||||
原文内容:
|
||||
{content[:8000]}
|
||||
|
||||
编辑要求:
|
||||
- 润色表述,使其更加专业流畅
|
||||
- 修正明显的语法错误
|
||||
- 保持原意不变
|
||||
- 只返回编辑后的内容,不要解释
|
||||
|
||||
请直接输出编辑后的内容:"""
|
||||
|
||||
messages = [
|
||||
{"role": "system", "content": "你是一个专业的文本编辑助手。请直接输出编辑后的内容。"},
|
||||
{"role": "user", "content": prompt}
|
||||
]
|
||||
|
||||
from app.services.llm_service import llm_service
|
||||
response = await llm_service.chat(messages=messages, temperature=0.3, max_tokens=8000)
|
||||
edited_content = llm_service.extract_message_content(response)
|
||||
|
||||
return {
|
||||
"success": True,
|
||||
"intent": "edit",
|
||||
"edited_content": edited_content,
|
||||
"original_filename": original_filename,
|
||||
"message": "文档编辑完成,内容已返回"
|
||||
}
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"编辑执行失败: {e}")
|
||||
return {
|
||||
"success": False,
|
||||
"error": str(e),
|
||||
"message": f"编辑处理失败: {str(e)}"
|
||||
}
|
||||
|
||||
async def _execute_transform(self, params: Dict[str, Any], context: Dict[str, Any]) -> Dict[str, Any]:
|
||||
"""
|
||||
执行格式转换操作
|
||||
|
||||
支持:
|
||||
- Word -> Excel
|
||||
- Excel -> Word
|
||||
- Markdown -> Word
|
||||
- Word -> Markdown
|
||||
"""
|
||||
try:
|
||||
docs = context.get("source_docs", [])
|
||||
if not docs:
|
||||
return {
|
||||
"success": False,
|
||||
"error": "没有可用的文档",
|
||||
"message": "请先上传要转换的文档"
|
||||
}
|
||||
|
||||
# 获取目标格式
|
||||
template_info = params.get("template", {})
|
||||
target_type = template_info.get("type", "")
|
||||
|
||||
if not target_type:
|
||||
# 尝试从指令中推断
|
||||
instruction = params.get("instruction", "")
|
||||
if "excel" in instruction.lower() or "xlsx" in instruction.lower():
|
||||
target_type = "xlsx"
|
||||
elif "word" in instruction.lower() or "docx" in instruction.lower():
|
||||
target_type = "docx"
|
||||
elif "markdown" in instruction.lower() or "md" in instruction.lower():
|
||||
target_type = "md"
|
||||
|
||||
if not target_type:
|
||||
return {
|
||||
"success": False,
|
||||
"error": "未指定目标格式",
|
||||
"message": "请说明要转换成什么格式(如:转成Excel、转成Word)"
|
||||
}
|
||||
|
||||
doc = docs[0]
|
||||
content = doc.get("content", "")
|
||||
structured_data = doc.get("structured_data", {})
|
||||
original_filename = doc.get("metadata", {}).get("original_filename", "未知文档")
|
||||
|
||||
# 构建转换内容
|
||||
if structured_data.get("tables"):
|
||||
# 有表格数据,生成表格格式的内容
|
||||
tables = structured_data.get("tables", [])
|
||||
table_content = []
|
||||
for i, table in enumerate(tables[:3]): # 最多处理3个表格
|
||||
headers = table.get("headers", [])
|
||||
rows = table.get("rows", [])[:20] # 最多20行
|
||||
if headers:
|
||||
table_content.append(f"【表格 {i+1}】")
|
||||
table_content.append(" | ".join(str(h) for h in headers))
|
||||
table_content.append(" | ".join(["---"] * len(headers)))
|
||||
for row in rows:
|
||||
if isinstance(row, list):
|
||||
table_content.append(" | ".join(str(c) for c in row))
|
||||
elif isinstance(row, dict):
|
||||
table_content.append(" | ".join(str(row.get(h, "")) for h in headers))
|
||||
table_content.append("")
|
||||
|
||||
if target_type == "xlsx":
|
||||
# 生成 Excel 格式的数据(JSON)
|
||||
excel_data = []
|
||||
for table in tables[:1]: # 只处理第一个表格
|
||||
headers = table.get("headers", [])
|
||||
rows = table.get("rows", [])[:100]
|
||||
for row in rows:
|
||||
if isinstance(row, list):
|
||||
excel_data.append(dict(zip(headers, row)))
|
||||
elif isinstance(row, dict):
|
||||
excel_data.append(row)
|
||||
|
||||
return {
|
||||
"success": True,
|
||||
"intent": "transform",
|
||||
"transform_type": "to_excel",
|
||||
"target_format": "xlsx",
|
||||
"excel_data": excel_data,
|
||||
"headers": headers,
|
||||
"message": f"已转换为 Excel 格式,包含 {len(excel_data)} 行数据"
|
||||
}
|
||||
elif target_type in ["docx", "word"]:
|
||||
# 生成 Word 格式的文本
|
||||
word_content = f"# {original_filename}\n\n"
|
||||
word_content += "\n".join(table_content)
|
||||
|
||||
return {
|
||||
"success": True,
|
||||
"intent": "transform",
|
||||
"transform_type": "to_word",
|
||||
"target_format": "docx",
|
||||
"content": word_content,
|
||||
"message": "已转换为 Word 格式"
|
||||
}
|
||||
elif target_type == "md":
|
||||
# 生成 Markdown 格式
|
||||
md_content = f"# {original_filename}\n\n"
|
||||
md_content += "\n".join(table_content)
|
||||
|
||||
return {
|
||||
"success": True,
|
||||
"intent": "transform",
|
||||
"transform_type": "to_markdown",
|
||||
"target_format": "md",
|
||||
"content": md_content,
|
||||
"message": "已转换为 Markdown 格式"
|
||||
}
|
||||
|
||||
# 无表格数据,使用纯文本内容转换
|
||||
if target_type == "xlsx":
|
||||
# 将文本内容转为 Excel 格式(每行作为一列)
|
||||
lines = [line.strip() for line in content.split("\n") if line.strip()][:100]
|
||||
excel_data = [{"行号": i+1, "内容": line} for i, line in enumerate(lines)]
|
||||
|
||||
return {
|
||||
"success": True,
|
||||
"intent": "transform",
|
||||
"transform_type": "to_excel",
|
||||
"target_format": "xlsx",
|
||||
"excel_data": excel_data,
|
||||
"headers": ["行号", "内容"],
|
||||
"message": f"已将文本内容转换为 Excel,包含 {len(excel_data)} 行"
|
||||
}
|
||||
elif target_type in ["docx", "word"]:
|
||||
return {
|
||||
"success": True,
|
||||
"intent": "transform",
|
||||
"transform_type": "to_word",
|
||||
"target_format": "docx",
|
||||
"content": content,
|
||||
"message": "文档内容已准备好,可下载为 Word 格式"
|
||||
}
|
||||
elif target_type == "md":
|
||||
# 简单的文本转 Markdown
|
||||
md_lines = []
|
||||
for line in content.split("\n"):
|
||||
line = line.strip()
|
||||
if line:
|
||||
# 简单处理:如果行不长且不是列表格式,作为段落
|
||||
if len(line) < 100 and not line.startswith(("-", "*", "1.", "2.", "3.")):
|
||||
md_lines.append(line)
|
||||
else:
|
||||
md_lines.append(line)
|
||||
else:
|
||||
md_lines.append("")
|
||||
|
||||
return {
|
||||
"success": True,
|
||||
"intent": "transform",
|
||||
"transform_type": "to_markdown",
|
||||
"target_format": "md",
|
||||
"content": "\n".join(md_lines),
|
||||
"message": "已转换为 Markdown 格式"
|
||||
}
|
||||
|
||||
return {
|
||||
"success": False,
|
||||
"error": "不支持的目标格式",
|
||||
"message": f"暂不支持转换为 {target_type} 格式"
|
||||
}
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"格式转换失败: {e}")
|
||||
return {
|
||||
"success": False,
|
||||
"error": str(e),
|
||||
"message": f"格式转换失败: {str(e)}"
|
||||
}
|
||||
|
||||
|
||||
# 全局单例
|
||||
instruction_executor = InstructionExecutor()
|
||||
|
||||
@@ -0,0 +1,242 @@
|
||||
"""
|
||||
意图解析器模块
|
||||
|
||||
解析用户自然语言指令,识别意图和参数
|
||||
"""
|
||||
import re
|
||||
import logging
|
||||
from typing import Any, Dict, List, Optional, Tuple
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
class IntentParser:
|
||||
"""意图解析器"""
|
||||
|
||||
# 意图类型定义
|
||||
INTENT_EXTRACT = "extract" # 信息提取
|
||||
INTENT_FILL_TABLE = "fill_table" # 填表
|
||||
INTENT_SUMMARIZE = "summarize" # 摘要总结
|
||||
INTENT_QUESTION = "question" # 问答
|
||||
INTENT_SEARCH = "search" # 搜索
|
||||
INTENT_COMPARE = "compare" # 对比分析
|
||||
INTENT_TRANSFORM = "transform" # 格式转换
|
||||
INTENT_EDIT = "edit" # 编辑文档
|
||||
INTENT_UNKNOWN = "unknown" # 未知
|
||||
|
||||
# 意图关键词映射
|
||||
INTENT_KEYWORDS = {
|
||||
INTENT_EXTRACT: ["提取", "抽取", "获取", "找出", "查找", "识别", "找到"],
|
||||
INTENT_FILL_TABLE: ["填表", "填写", "填充", "录入", "导入到表格", "填写到"],
|
||||
INTENT_SUMMARIZE: ["总结", "摘要", "概括", "概述", "归纳", "提炼"],
|
||||
INTENT_QUESTION: ["问答", "回答", "解释", "什么是", "为什么", "如何", "怎样", "多少", "几个"],
|
||||
INTENT_SEARCH: ["搜索", "查找", "检索", "查询", "找"],
|
||||
INTENT_COMPARE: ["对比", "比较", "差异", "区别", "不同"],
|
||||
INTENT_TRANSFORM: ["转换", "转化", "变成", "转为", "导出"],
|
||||
INTENT_EDIT: ["修改", "编辑", "调整", "改写", "润色", "优化"],
|
||||
}
|
||||
|
||||
# 实体模式定义
|
||||
ENTITY_PATTERNS = {
|
||||
"number": [r"\d+", r"[一二三四五六七八九十百千万]+"],
|
||||
"date": [r"\d{4}年", r"\d{1,2}月", r"\d{1,2}日"],
|
||||
"percentage": [r"\d+(\.\d+)?%", r"\d+(\.\d+)?‰"],
|
||||
"currency": [r"\d+(\.\d+)?万元", r"\d+(\.\d+)?亿元", r"\d+(\.\d+)?元"],
|
||||
}
|
||||
|
||||
def __init__(self):
|
||||
self.intent_history: List[Dict[str, Any]] = []
|
||||
|
||||
async def parse(self, text: str) -> Tuple[str, Dict[str, Any]]:
|
||||
"""
|
||||
解析自然语言指令
|
||||
|
||||
Args:
|
||||
text: 用户输入的自然语言
|
||||
|
||||
Returns:
|
||||
(意图类型, 参数字典)
|
||||
"""
|
||||
text = text.strip()
|
||||
if not text:
|
||||
return self.INTENT_UNKNOWN, {}
|
||||
|
||||
# 记录历史
|
||||
self.intent_history.append({"text": text, "intent": None})
|
||||
|
||||
# 识别意图
|
||||
intent = self._recognize_intent(text)
|
||||
|
||||
# 提取参数
|
||||
params = self._extract_params(text, intent)
|
||||
|
||||
# 更新历史
|
||||
if self.intent_history:
|
||||
self.intent_history[-1]["intent"] = intent
|
||||
|
||||
logger.info(f"意图解析: text={text[:50]}..., intent={intent}, params={params}")
|
||||
|
||||
return intent, params
|
||||
|
||||
def _recognize_intent(self, text: str) -> str:
|
||||
"""识别意图类型"""
|
||||
intent_scores: Dict[str, float] = {}
|
||||
|
||||
for intent, keywords in self.INTENT_KEYWORDS.items():
|
||||
score = 0
|
||||
for keyword in keywords:
|
||||
if keyword in text:
|
||||
score += 1
|
||||
if score > 0:
|
||||
intent_scores[intent] = score
|
||||
|
||||
if not intent_scores:
|
||||
return self.INTENT_UNKNOWN
|
||||
|
||||
# 返回得分最高的意图
|
||||
return max(intent_scores, key=intent_scores.get)
|
||||
|
||||
def _extract_params(self, text: str, intent: str) -> Dict[str, Any]:
|
||||
"""提取参数"""
|
||||
params: Dict[str, Any] = {
|
||||
"entities": self._extract_entities(text),
|
||||
"document_refs": self._extract_document_refs(text),
|
||||
"field_refs": self._extract_field_refs(text),
|
||||
"template_refs": self._extract_template_refs(text),
|
||||
}
|
||||
|
||||
# 根据意图类型提取特定参数
|
||||
if intent == self.INTENT_QUESTION:
|
||||
params["question"] = text
|
||||
params["focus"] = self._extract_question_focus(text)
|
||||
elif intent == self.INTENT_FILL_TABLE:
|
||||
params["template"] = self._extract_template_info(text)
|
||||
elif intent == self.INTENT_EXTRACT:
|
||||
params["target_fields"] = self._extract_target_fields(text)
|
||||
|
||||
return params
|
||||
|
||||
def _extract_entities(self, text: str) -> Dict[str, List[str]]:
|
||||
"""提取实体"""
|
||||
entities: Dict[str, List[str]] = {}
|
||||
|
||||
for entity_type, patterns in self.ENTITY_PATTERNS.items():
|
||||
matches = []
|
||||
for pattern in patterns:
|
||||
found = re.findall(pattern, text)
|
||||
matches.extend(found)
|
||||
if matches:
|
||||
entities[entity_type] = list(set(matches))
|
||||
|
||||
return entities
|
||||
|
||||
def _extract_document_refs(self, text: str) -> List[str]:
|
||||
"""提取文档引用"""
|
||||
# 匹配 "文档1"、"doc1"、"第一个文档" 等
|
||||
refs = []
|
||||
|
||||
# 数字索引: 文档1, doc1, 第1个文档
|
||||
num_patterns = [
|
||||
r"[文档doc]+(\d+)",
|
||||
r"第(\d+)个文档",
|
||||
r"第(\d+)份",
|
||||
]
|
||||
for pattern in num_patterns:
|
||||
matches = re.findall(pattern, text.lower())
|
||||
refs.extend([f"doc_{m}" for m in matches])
|
||||
|
||||
# "所有文档"、"全部文档"
|
||||
if any(kw in text for kw in ["所有", "全部", "整个"]):
|
||||
refs.append("all_docs")
|
||||
|
||||
return refs
|
||||
|
||||
def _extract_field_refs(self, text: str) -> List[str]:
|
||||
"""提取字段引用"""
|
||||
fields = []
|
||||
|
||||
# 匹配引号内的字段名
|
||||
quoted = re.findall(r"['\"『「]([^'\"』」]+)['\"』」]", text)
|
||||
fields.extend(quoted)
|
||||
|
||||
# 匹配 "xxx字段"、"xxx列" 等
|
||||
field_patterns = [
|
||||
r"([^\s]+)字段",
|
||||
r"([^\s]+)列",
|
||||
r"([^\s]+)数据",
|
||||
]
|
||||
for pattern in field_patterns:
|
||||
matches = re.findall(pattern, text)
|
||||
fields.extend(matches)
|
||||
|
||||
return list(set(fields))
|
||||
|
||||
def _extract_template_refs(self, text: str) -> List[str]:
|
||||
"""提取模板引用"""
|
||||
templates = []
|
||||
|
||||
# 匹配 "表格模板"、"Excel模板"、"表1" 等
|
||||
template_patterns = [
|
||||
r"([^\s]+模板)",
|
||||
r"表(\d+)",
|
||||
r"([^\s]+表格)",
|
||||
]
|
||||
for pattern in template_patterns:
|
||||
matches = re.findall(pattern, text)
|
||||
templates.extend(matches)
|
||||
|
||||
return list(set(templates))
|
||||
|
||||
def _extract_question_focus(self, text: str) -> Optional[str]:
|
||||
"""提取问题焦点"""
|
||||
# "什么是XXX"、"XXX是什么"
|
||||
match = re.search(r"[什么是]([^?]+)", text)
|
||||
if match:
|
||||
return match.group(1).strip()
|
||||
|
||||
# "XXX有多少"
|
||||
match = re.search(r"([^?]+)有多少", text)
|
||||
if match:
|
||||
return match.group(1).strip()
|
||||
|
||||
return None
|
||||
|
||||
def _extract_template_info(self, text: str) -> Optional[Dict[str, str]]:
|
||||
"""提取模板信息"""
|
||||
template_info: Dict[str, str] = {}
|
||||
|
||||
# 提取模板类型
|
||||
if "excel" in text.lower() or "xlsx" in text.lower() or "电子表格" in text:
|
||||
template_info["type"] = "xlsx"
|
||||
elif "word" in text.lower() or "docx" in text.lower() or "文档" in text:
|
||||
template_info["type"] = "docx"
|
||||
|
||||
return template_info if template_info else None
|
||||
|
||||
def _extract_target_fields(self, text: str) -> List[str]:
|
||||
"""提取目标字段"""
|
||||
fields = []
|
||||
|
||||
# 匹配 "提取XXX和YYY"、"抽取XXX、YYY"
|
||||
patterns = [
|
||||
r"提取([^(and|,|,)+]+?)(?:和|与|、|,|plus)",
|
||||
r"抽取([^(and|,|,)+]+?)(?:和|与|、|,|plus)",
|
||||
]
|
||||
|
||||
for pattern in patterns:
|
||||
matches = re.findall(pattern, text)
|
||||
fields.extend([m.strip() for m in matches if m.strip()])
|
||||
|
||||
return list(set(fields))
|
||||
|
||||
def get_intent_history(self) -> List[Dict[str, Any]]:
|
||||
"""获取意图历史"""
|
||||
return self.intent_history
|
||||
|
||||
def clear_history(self):
|
||||
"""清空历史"""
|
||||
self.intent_history = []
|
||||
|
||||
|
||||
# 全局单例
|
||||
intent_parser = IntentParser()
|
||||
|
||||
@@ -1,6 +1,13 @@
|
||||
"""
|
||||
FastAPI 应用主入口
|
||||
"""
|
||||
# ========== 压制 MongoDB 疯狂刷屏日志 ==========
|
||||
import logging
|
||||
logging.getLogger("pymongo").setLevel(logging.WARNING)
|
||||
logging.getLogger("pymongo.topology").setLevel(logging.WARNING)
|
||||
logging.getLogger("urllib3").setLevel(logging.WARNING)
|
||||
# ==============================================
|
||||
|
||||
import logging
|
||||
import logging.handlers
|
||||
import sys
|
||||
|
||||
@@ -65,7 +65,17 @@ class LLMService:
|
||||
return response.json()
|
||||
|
||||
except httpx.HTTPStatusError as e:
|
||||
logger.error(f"LLM API 请求失败: {e.response.status_code} - {e.response.text}")
|
||||
error_detail = e.response.text
|
||||
logger.error(f"LLM API 请求失败: {e.response.status_code} - {error_detail}")
|
||||
# 尝试解析错误信息
|
||||
try:
|
||||
import json
|
||||
err_json = json.loads(error_detail)
|
||||
err_code = err_json.get("error", {}).get("code", "unknown")
|
||||
err_msg = err_json.get("error", {}).get("message", "unknown")
|
||||
logger.error(f"API 错误码: {err_code}, 错误信息: {err_msg}")
|
||||
except:
|
||||
pass
|
||||
raise
|
||||
except Exception as e:
|
||||
logger.error(f"LLM API 调用异常: {str(e)}")
|
||||
@@ -328,6 +338,154 @@ Excel 数据概览:
|
||||
"analysis": None
|
||||
}
|
||||
|
||||
async def chat_with_images(
|
||||
self,
|
||||
text: str,
|
||||
images: List[Dict[str, str]],
|
||||
temperature: float = 0.7,
|
||||
max_tokens: Optional[int] = None
|
||||
) -> Dict[str, Any]:
|
||||
"""
|
||||
调用视觉模型 API(支持图片输入)
|
||||
|
||||
Args:
|
||||
text: 文本内容
|
||||
images: 图片列表,每项包含 base64 编码和 mime_type
|
||||
格式: [{"base64": "...", "mime_type": "image/png"}, ...]
|
||||
temperature: 温度参数
|
||||
max_tokens: 最大 token 数
|
||||
|
||||
Returns:
|
||||
Dict[str, Any]: API 响应结果
|
||||
"""
|
||||
headers = {
|
||||
"Authorization": f"Bearer {self.api_key}",
|
||||
"Content-Type": "application/json"
|
||||
}
|
||||
|
||||
# 构建图片内容
|
||||
image_contents = []
|
||||
for img in images:
|
||||
image_contents.append({
|
||||
"type": "image_url",
|
||||
"image_url": {
|
||||
"url": f"data:{img['mime_type']};base64,{img['base64']}"
|
||||
}
|
||||
})
|
||||
|
||||
# 构建消息
|
||||
messages = [
|
||||
{
|
||||
"role": "user",
|
||||
"content": [
|
||||
{
|
||||
"type": "text",
|
||||
"text": text
|
||||
},
|
||||
*image_contents
|
||||
]
|
||||
}
|
||||
]
|
||||
|
||||
payload = {
|
||||
"model": self.model_name,
|
||||
"messages": messages,
|
||||
"temperature": temperature
|
||||
}
|
||||
|
||||
if max_tokens:
|
||||
payload["max_tokens"] = max_tokens
|
||||
|
||||
try:
|
||||
async with httpx.AsyncClient(timeout=120.0) as client:
|
||||
response = await client.post(
|
||||
f"{self.base_url}/chat/completions",
|
||||
headers=headers,
|
||||
json=payload
|
||||
)
|
||||
response.raise_for_status()
|
||||
return response.json()
|
||||
|
||||
except httpx.HTTPStatusError as e:
|
||||
error_detail = e.response.text
|
||||
logger.error(f"视觉模型 API 请求失败: {e.response.status_code} - {error_detail}")
|
||||
# 尝试解析错误信息
|
||||
try:
|
||||
import json
|
||||
err_json = json.loads(error_detail)
|
||||
err_code = err_json.get("error", {}).get("code", "unknown")
|
||||
err_msg = err_json.get("error", {}).get("message", "unknown")
|
||||
logger.error(f"API 错误码: {err_code}, 错误信息: {err_msg}")
|
||||
logger.error(f"请求模型: {self.model_name}, base_url: {self.base_url}")
|
||||
except:
|
||||
pass
|
||||
raise
|
||||
except Exception as e:
|
||||
logger.error(f"视觉模型 API 调用异常: {str(e)}")
|
||||
raise
|
||||
|
||||
async def analyze_images(
|
||||
self,
|
||||
images: List[Dict[str, str]],
|
||||
user_prompt: str = ""
|
||||
) -> Dict[str, Any]:
|
||||
"""
|
||||
分析图片内容(使用视觉模型)
|
||||
|
||||
Args:
|
||||
images: 图片列表,每项包含 base64 编码和 mime_type
|
||||
user_prompt: 用户提示词
|
||||
|
||||
Returns:
|
||||
Dict[str, Any]: 分析结果
|
||||
"""
|
||||
prompt = f"""你是一个专业的视觉分析专家。请分析以下图片内容。
|
||||
|
||||
{user_prompt if user_prompt else "请详细描述图片中的内容,包括文字、数据、图表、流程等所有可见信息。"}
|
||||
|
||||
请按照以下 JSON 格式输出:
|
||||
{{
|
||||
"description": "图片内容的详细描述",
|
||||
"text_content": "图片中的文字内容(如有)",
|
||||
"data_extracted": {{"键": "值"}} // 如果图片中有表格或数据
|
||||
}}
|
||||
|
||||
如果图片不包含有用信息,请返回空的描述。"""
|
||||
|
||||
try:
|
||||
response = await self.chat_with_images(
|
||||
text=prompt,
|
||||
images=images,
|
||||
temperature=0.1,
|
||||
max_tokens=4000
|
||||
)
|
||||
|
||||
content = self.extract_message_content(response)
|
||||
|
||||
# 解析 JSON
|
||||
import json
|
||||
try:
|
||||
result = json.loads(content)
|
||||
return {
|
||||
"success": True,
|
||||
"analysis": result,
|
||||
"model": self.model_name
|
||||
}
|
||||
except json.JSONDecodeError:
|
||||
return {
|
||||
"success": True,
|
||||
"analysis": {"description": content},
|
||||
"model": self.model_name
|
||||
}
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"图片分析失败: {str(e)}")
|
||||
return {
|
||||
"success": False,
|
||||
"error": str(e),
|
||||
"analysis": None
|
||||
}
|
||||
|
||||
|
||||
# 全局单例
|
||||
llm_service = LLMService()
|
||||
|
||||
446
backend/app/services/multi_doc_reasoning_service.py
Normal file
446
backend/app/services/multi_doc_reasoning_service.py
Normal file
@@ -0,0 +1,446 @@
|
||||
"""
|
||||
多文档关联推理服务
|
||||
|
||||
跨文档信息关联和推理
|
||||
"""
|
||||
import logging
|
||||
import re
|
||||
from typing import Any, Dict, List, Optional, Set, Tuple
|
||||
from collections import defaultdict
|
||||
|
||||
from app.services.llm_service import llm_service
|
||||
from app.services.rag_service import rag_service
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
class MultiDocReasoningService:
|
||||
"""
|
||||
多文档关联推理服务
|
||||
|
||||
功能:
|
||||
1. 实体跨文档追踪 - 追踪同一实体在不同文档中的描述
|
||||
2. 关系抽取与推理 - 抽取实体间关系并进行推理
|
||||
3. 信息补全 - 根据多个文档的信息互补填充缺失数据
|
||||
4. 冲突检测 - 检测不同文档间的信息冲突
|
||||
"""
|
||||
|
||||
def __init__(self):
|
||||
self.llm = llm_service
|
||||
|
||||
async def analyze_cross_documents(
|
||||
self,
|
||||
documents: List[Dict[str, Any]],
|
||||
query: Optional[str] = None,
|
||||
entity_types: Optional[List[str]] = None
|
||||
) -> Dict[str, Any]:
|
||||
"""
|
||||
跨文档分析
|
||||
|
||||
Args:
|
||||
documents: 文档列表
|
||||
query: 查询意图(可选)
|
||||
entity_types: 要追踪的实体类型列表,如 ["机构", "人物", "地点", "数量"]
|
||||
|
||||
Returns:
|
||||
跨文档分析结果
|
||||
"""
|
||||
if not documents:
|
||||
return {"success": False, "error": "没有可用的文档"}
|
||||
|
||||
entity_types = entity_types or ["机构", "数量", "时间", "地点"]
|
||||
|
||||
try:
|
||||
# 1. 提取各文档中的实体
|
||||
entities_per_doc = await self._extract_entities_from_docs(documents, entity_types)
|
||||
|
||||
# 2. 跨文档实体对齐
|
||||
aligned_entities = self._align_entities_across_docs(entities_per_doc)
|
||||
|
||||
# 3. 关系抽取
|
||||
relations = await self._extract_relations(documents)
|
||||
|
||||
# 4. 构建知识图谱
|
||||
knowledge_graph = self._build_knowledge_graph(aligned_entities, relations)
|
||||
|
||||
# 5. 信息补全
|
||||
completed_info = await self._complete_missing_info(knowledge_graph, documents)
|
||||
|
||||
# 6. 冲突检测
|
||||
conflicts = self._detect_conflicts(aligned_entities)
|
||||
|
||||
return {
|
||||
"success": True,
|
||||
"entities": aligned_entities,
|
||||
"relations": relations,
|
||||
"knowledge_graph": knowledge_graph,
|
||||
"completed_info": completed_info,
|
||||
"conflicts": conflicts,
|
||||
"summary": self._generate_summary(aligned_entities, conflicts)
|
||||
}
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"跨文档分析失败: {e}")
|
||||
return {"success": False, "error": str(e)}
|
||||
|
||||
async def _extract_entities_from_docs(
|
||||
self,
|
||||
documents: List[Dict[str, Any]],
|
||||
entity_types: List[str]
|
||||
) -> List[Dict[str, Any]]:
|
||||
"""从各文档中提取实体"""
|
||||
entities_per_doc = []
|
||||
|
||||
for idx, doc in enumerate(documents):
|
||||
doc_id = doc.get("_id", f"doc_{idx}")
|
||||
content = doc.get("content", "")[:8000] # 限制长度
|
||||
|
||||
# 使用 LLM 提取实体
|
||||
prompt = f"""从以下文档中提取指定的实体类型信息。
|
||||
|
||||
实体类型: {', '.join(entity_types)}
|
||||
|
||||
文档内容:
|
||||
{content}
|
||||
|
||||
请按以下 JSON 格式输出(只需输出 JSON):
|
||||
{{
|
||||
"entities": [
|
||||
{{"type": "机构", "name": "实体名称", "value": "相关数值(如有)", "context": "上下文描述"}},
|
||||
...
|
||||
]
|
||||
}}
|
||||
|
||||
只提取在文档中明确提到的实体,不要推测。"""
|
||||
|
||||
messages = [
|
||||
{"role": "system", "content": "你是一个实体提取专家。请严格按JSON格式输出。"},
|
||||
{"role": "user", "content": prompt}
|
||||
]
|
||||
|
||||
try:
|
||||
response = await self.llm.chat(messages=messages, temperature=0.1, max_tokens=3000)
|
||||
content_response = self.llm.extract_message_content(response)
|
||||
|
||||
# 解析 JSON
|
||||
import json
|
||||
import re
|
||||
cleaned = content_response.strip()
|
||||
json_match = re.search(r'\{[\s\S]*\}', cleaned)
|
||||
if json_match:
|
||||
result = json.loads(json_match.group())
|
||||
entities = result.get("entities", [])
|
||||
entities_per_doc.append({
|
||||
"doc_id": doc_id,
|
||||
"doc_name": doc.get("metadata", {}).get("original_filename", f"文档{idx+1}"),
|
||||
"entities": entities
|
||||
})
|
||||
logger.info(f"文档 {doc_id} 提取到 {len(entities)} 个实体")
|
||||
except Exception as e:
|
||||
logger.warning(f"文档 {doc_id} 实体提取失败: {e}")
|
||||
|
||||
return entities_per_doc
|
||||
|
||||
def _align_entities_across_docs(
|
||||
self,
|
||||
entities_per_doc: List[Dict[str, Any]]
|
||||
) -> Dict[str, List[Dict[str, Any]]]:
|
||||
"""
|
||||
跨文档实体对齐
|
||||
|
||||
将同一实体在不同文档中的描述进行关联
|
||||
"""
|
||||
aligned: Dict[str, List[Dict[str, Any]]] = defaultdict(list)
|
||||
|
||||
for doc_data in entities_per_doc:
|
||||
doc_id = doc_data["doc_id"]
|
||||
doc_name = doc_data["doc_name"]
|
||||
|
||||
for entity in doc_data.get("entities", []):
|
||||
entity_name = entity.get("name", "")
|
||||
if not entity_name:
|
||||
continue
|
||||
|
||||
# 标准化实体名(去除空格和括号内容)
|
||||
normalized = self._normalize_entity_name(entity_name)
|
||||
|
||||
aligned[normalized].append({
|
||||
"original_name": entity_name,
|
||||
"type": entity.get("type", "未知"),
|
||||
"value": entity.get("value", ""),
|
||||
"context": entity.get("context", ""),
|
||||
"source_doc": doc_name,
|
||||
"source_doc_id": doc_id
|
||||
})
|
||||
|
||||
# 合并相同实体
|
||||
result = {}
|
||||
for normalized, appearances in aligned.items():
|
||||
if len(appearances) > 1:
|
||||
result[normalized] = appearances
|
||||
logger.info(f"实体对齐: {normalized} 在 {len(appearances)} 个文档中出现")
|
||||
|
||||
return result
|
||||
|
||||
def _normalize_entity_name(self, name: str) -> str:
|
||||
"""标准化实体名称"""
|
||||
# 去除空格
|
||||
name = name.strip()
|
||||
# 去除括号内容
|
||||
name = re.sub(r'[((].*?[))]', '', name)
|
||||
# 去除"第X名"等
|
||||
name = re.sub(r'^第\d+[名位个]', '', name)
|
||||
return name.strip()
|
||||
|
||||
async def _extract_relations(
|
||||
self,
|
||||
documents: List[Dict[str, Any]]
|
||||
) -> List[Dict[str, str]]:
|
||||
"""从文档中抽取关系"""
|
||||
relations = []
|
||||
|
||||
# 合并所有文档内容
|
||||
combined_content = "\n\n".join([
|
||||
f"【{doc.get('metadata', {}).get('original_filename', f'文档{i}')}】\n{doc.get('content', '')[:3000]}"
|
||||
for i, doc in enumerate(documents)
|
||||
])
|
||||
|
||||
prompt = f"""从以下文档内容中抽取实体之间的关系。
|
||||
|
||||
文档内容:
|
||||
{combined_content[:8000]}
|
||||
|
||||
请识别以下类型的关系:
|
||||
- 包含关系 (A包含B)
|
||||
- 隶属关系 (A隶属于B)
|
||||
- 合作关系 (A与B合作)
|
||||
- 对比关系 (A vs B)
|
||||
- 时序关系 (A先于B发生)
|
||||
|
||||
请按以下 JSON 格式输出(只需输出 JSON):
|
||||
{{
|
||||
"relations": [
|
||||
{{"entity1": "实体1", "entity2": "实体2", "relation": "关系类型", "description": "关系描述"}},
|
||||
...
|
||||
]
|
||||
}}
|
||||
|
||||
如果没有找到明确的关系,返回空数组。"""
|
||||
|
||||
messages = [
|
||||
{"role": "system", "content": "你是一个关系抽取专家。请严格按JSON格式输出。"},
|
||||
{"role": "user", "content": prompt}
|
||||
]
|
||||
|
||||
try:
|
||||
response = await self.llm.chat(messages=messages, temperature=0.1, max_tokens=3000)
|
||||
content_response = self.llm.extract_message_content(response)
|
||||
|
||||
import json
|
||||
import re
|
||||
cleaned = content_response.strip()
|
||||
json_match = re.search(r'\{{[\s\S]*\}}', cleaned)
|
||||
if json_match:
|
||||
result = json.loads(json_match.group())
|
||||
relations = result.get("relations", [])
|
||||
logger.info(f"抽取到 {len(relations)} 个关系")
|
||||
except Exception as e:
|
||||
logger.warning(f"关系抽取失败: {e}")
|
||||
|
||||
return relations
|
||||
|
||||
def _build_knowledge_graph(
|
||||
self,
|
||||
aligned_entities: Dict[str, List[Dict[str, Any]]],
|
||||
relations: List[Dict[str, str]]
|
||||
) -> Dict[str, Any]:
|
||||
"""构建知识图谱"""
|
||||
nodes = []
|
||||
edges = []
|
||||
node_ids = set()
|
||||
|
||||
# 添加实体节点
|
||||
for entity_name, appearances in aligned_entities.items():
|
||||
if len(appearances) < 1:
|
||||
continue
|
||||
|
||||
first_appearance = appearances[0]
|
||||
node_id = f"entity_{len(nodes)}"
|
||||
|
||||
# 收集该实体在所有文档中的值
|
||||
values = [a.get("value", "") for a in appearances if a.get("value")]
|
||||
primary_value = values[0] if values else ""
|
||||
|
||||
nodes.append({
|
||||
"id": node_id,
|
||||
"name": entity_name,
|
||||
"type": first_appearance.get("type", "未知"),
|
||||
"value": primary_value,
|
||||
"occurrence_count": len(appearances),
|
||||
"sources": [a.get("source_doc", "") for a in appearances]
|
||||
})
|
||||
node_ids.add(entity_name)
|
||||
|
||||
# 添加关系边
|
||||
for relation in relations:
|
||||
entity1 = self._normalize_entity_name(relation.get("entity1", ""))
|
||||
entity2 = self._normalize_entity_name(relation.get("entity2", ""))
|
||||
|
||||
if entity1 in node_ids and entity2 in node_ids:
|
||||
edges.append({
|
||||
"source": entity1,
|
||||
"target": entity2,
|
||||
"relation": relation.get("relation", "相关"),
|
||||
"description": relation.get("description", "")
|
||||
})
|
||||
|
||||
return {
|
||||
"nodes": nodes,
|
||||
"edges": edges,
|
||||
"stats": {
|
||||
"entity_count": len(nodes),
|
||||
"relation_count": len(edges)
|
||||
}
|
||||
}
|
||||
|
||||
async def _complete_missing_info(
|
||||
self,
|
||||
knowledge_graph: Dict[str, Any],
|
||||
documents: List[Dict[str, Any]]
|
||||
) -> List[Dict[str, Any]]:
|
||||
"""根据多个文档补全信息"""
|
||||
completed = []
|
||||
|
||||
for node in knowledge_graph.get("nodes", []):
|
||||
if not node.get("value") and node.get("occurrence_count", 0) > 1:
|
||||
# 实体在多个文档中出现但没有数值,尝试从 RAG 检索补充
|
||||
query = f"{node['name']} 数值 数据"
|
||||
results = rag_service.retrieve(query, top_k=3, min_score=0.3)
|
||||
|
||||
if results:
|
||||
completed.append({
|
||||
"entity": node["name"],
|
||||
"type": node.get("type", "未知"),
|
||||
"source": "rag_inference",
|
||||
"context": results[0].get("content", "")[:200],
|
||||
"confidence": results[0].get("score", 0)
|
||||
})
|
||||
|
||||
return completed
|
||||
|
||||
def _detect_conflicts(
|
||||
self,
|
||||
aligned_entities: Dict[str, List[Dict[str, Any]]]
|
||||
) -> List[Dict[str, Any]]:
|
||||
"""检测不同文档间的信息冲突"""
|
||||
conflicts = []
|
||||
|
||||
for entity_name, appearances in aligned_entities.items():
|
||||
if len(appearances) < 2:
|
||||
continue
|
||||
|
||||
# 检查数值冲突
|
||||
values = {}
|
||||
for appearance in appearances:
|
||||
val = appearance.get("value", "")
|
||||
if val:
|
||||
source = appearance.get("source_doc", "未知来源")
|
||||
values[source] = val
|
||||
|
||||
if len(values) > 1:
|
||||
unique_values = set(values.values())
|
||||
if len(unique_values) > 1:
|
||||
conflicts.append({
|
||||
"entity": entity_name,
|
||||
"type": "value_conflict",
|
||||
"details": values,
|
||||
"description": f"实体 '{entity_name}' 在不同文档中有不同数值: {values}"
|
||||
})
|
||||
|
||||
return conflicts
|
||||
|
||||
def _generate_summary(
|
||||
self,
|
||||
aligned_entities: Dict[str, List[Dict[str, Any]]],
|
||||
conflicts: List[Dict[str, Any]]
|
||||
) -> str:
|
||||
"""生成摘要"""
|
||||
summary_parts = []
|
||||
|
||||
total_entities = sum(len(appearances) for appearances in aligned_entities.values())
|
||||
multi_doc_entities = sum(1 for appearances in aligned_entities.values() if len(appearances) > 1)
|
||||
|
||||
summary_parts.append(f"跨文档分析完成:发现 {total_entities} 个实体")
|
||||
summary_parts.append(f"其中 {multi_doc_entities} 个实体在多个文档中被提及")
|
||||
|
||||
if conflicts:
|
||||
summary_parts.append(f"检测到 {len(conflicts)} 个潜在冲突")
|
||||
|
||||
return "; ".join(summary_parts)
|
||||
|
||||
async def answer_cross_doc_question(
|
||||
self,
|
||||
question: str,
|
||||
documents: List[Dict[str, Any]]
|
||||
) -> Dict[str, Any]:
|
||||
"""
|
||||
跨文档问答
|
||||
|
||||
Args:
|
||||
question: 问题
|
||||
documents: 文档列表
|
||||
|
||||
Returns:
|
||||
答案结果
|
||||
"""
|
||||
# 先进行跨文档分析
|
||||
analysis_result = await self.analyze_cross_documents(documents, query=question)
|
||||
|
||||
# 构建上下文
|
||||
context_parts = []
|
||||
|
||||
# 添加实体信息
|
||||
for entity_name, appearances in analysis_result.get("entities", {}).items():
|
||||
contexts = [f"{a.get('source_doc')}: {a.get('context', '')}" for a in appearances[:2]]
|
||||
if contexts:
|
||||
context_parts.append(f"【{entity_name}】{' | '.join(contexts)}")
|
||||
|
||||
# 添加关系信息
|
||||
for relation in analysis_result.get("relations", [])[:5]:
|
||||
context_parts.append(f"【关系】{relation.get('entity1')} {relation.get('relation')} {relation.get('entity2')}: {relation.get('description', '')}")
|
||||
|
||||
context_text = "\n\n".join(context_parts) if context_parts else "未找到相关实体和关系"
|
||||
|
||||
# 使用 LLM 生成答案
|
||||
prompt = f"""基于以下跨文档分析结果,回答用户问题。
|
||||
|
||||
问题: {question}
|
||||
|
||||
分析结果:
|
||||
{context_text}
|
||||
|
||||
请直接回答问题,如果分析结果中没有相关信息,请说明"根据提供的文档无法回答该问题"。"""
|
||||
|
||||
messages = [
|
||||
{"role": "system", "content": "你是一个基于文档的问答助手。请根据提供的信息回答问题。"},
|
||||
{"role": "user", "content": prompt}
|
||||
]
|
||||
|
||||
try:
|
||||
response = await self.llm.chat(messages=messages, temperature=0.2, max_tokens=2000)
|
||||
answer = self.llm.extract_message_content(response)
|
||||
|
||||
return {
|
||||
"success": True,
|
||||
"question": question,
|
||||
"answer": answer,
|
||||
"supporting_entities": list(analysis_result.get("entities", {}).keys())[:10],
|
||||
"relations_count": len(analysis_result.get("relations", []))
|
||||
}
|
||||
except Exception as e:
|
||||
logger.error(f"跨文档问答失败: {e}")
|
||||
return {"success": False, "error": str(e)}
|
||||
|
||||
|
||||
# 全局单例
|
||||
multi_doc_reasoning_service = MultiDocReasoningService()
|
||||
@@ -2,21 +2,32 @@
|
||||
RAG 服务模块 - 检索增强生成
|
||||
|
||||
使用 sentence-transformers + Faiss 实现向量检索
|
||||
支持 BM25 关键词检索 + 向量检索混合融合
|
||||
"""
|
||||
import json
|
||||
import logging
|
||||
import os
|
||||
import pickle
|
||||
from typing import Any, Dict, List, Optional
|
||||
import re
|
||||
import math
|
||||
from typing import Any, Dict, List, Optional, Tuple
|
||||
from collections import Counter, defaultdict
|
||||
|
||||
import faiss
|
||||
import numpy as np
|
||||
from sentence_transformers import SentenceTransformer
|
||||
|
||||
from app.config import settings
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
# 尝试导入 sentence-transformers
|
||||
try:
|
||||
from sentence_transformers import SentenceTransformer
|
||||
SENTENCE_TRANSFORMERS_AVAILABLE = True
|
||||
except ImportError as e:
|
||||
logger.warning(f"sentence-transformers 导入失败: {e}")
|
||||
SENTENCE_TRANSFORMERS_AVAILABLE = False
|
||||
SentenceTransformer = None
|
||||
|
||||
|
||||
class SimpleDocument:
|
||||
"""简化文档对象"""
|
||||
@@ -25,20 +36,156 @@ class SimpleDocument:
|
||||
self.metadata = metadata
|
||||
|
||||
|
||||
class BM25:
|
||||
"""
|
||||
BM25 关键词检索算法
|
||||
|
||||
一种基于词频和文档频率的信息检索算法,比纯向量搜索更适合关键词精确匹配
|
||||
"""
|
||||
|
||||
def __init__(self, k1: float = 1.5, b: float = 0.75):
|
||||
self.k1 = k1 # 词频饱和参数
|
||||
self.b = b # 文档长度归一化参数
|
||||
self.documents: List[str] = []
|
||||
self.doc_ids: List[str] = []
|
||||
self.avg_doc_length = 0
|
||||
self.doc_freqs: Dict[str, int] = {} # 词 -> 包含该词的文档数
|
||||
self.idf: Dict[str, float] = {} # 词 -> IDF 值
|
||||
self.doc_lengths: List[int] = []
|
||||
self.doc_term_freqs: List[Dict[str, int]] = [] # 每个文档的词频
|
||||
|
||||
def _tokenize(self, text: str) -> List[str]:
|
||||
"""分词(简单的中文分词)"""
|
||||
if not text:
|
||||
return []
|
||||
# 简单分词:按标点和空格分割
|
||||
tokens = re.findall(r'[\u4e00-\u9fff]+|[a-zA-Z0-9]+', text.lower())
|
||||
# 过滤单字符
|
||||
return [t for t in tokens if len(t) > 1]
|
||||
|
||||
def fit(self, documents: List[str], doc_ids: List[str]):
|
||||
"""
|
||||
构建 BM25 索引
|
||||
|
||||
Args:
|
||||
documents: 文档内容列表
|
||||
doc_ids: 文档 ID 列表
|
||||
"""
|
||||
self.documents = documents
|
||||
self.doc_ids = doc_ids
|
||||
n = len(documents)
|
||||
|
||||
# 统计文档频率
|
||||
self.doc_freqs = defaultdict(int)
|
||||
self.doc_lengths = []
|
||||
self.doc_term_freqs = []
|
||||
|
||||
for doc in documents:
|
||||
tokens = self._tokenize(doc)
|
||||
self.doc_lengths.append(len(tokens))
|
||||
doc_tf = Counter(tokens)
|
||||
self.doc_term_freqs.append(doc_tf)
|
||||
|
||||
for term in doc_tf:
|
||||
self.doc_freqs[term] += 1
|
||||
|
||||
# 计算平均文档长度
|
||||
self.avg_doc_length = sum(self.doc_lengths) / n if n > 0 else 0
|
||||
|
||||
# 计算 IDF
|
||||
for term, df in self.doc_freqs.items():
|
||||
# IDF = log((n - df + 0.5) / (df + 0.5))
|
||||
self.idf[term] = math.log((n - df + 0.5) / (df + 0.5) + 1)
|
||||
|
||||
logger.info(f"BM25 索引构建完成: {n} 个文档, {len(self.idf)} 个词项")
|
||||
|
||||
def search(self, query: str, top_k: int = 10) -> List[Tuple[int, float]]:
|
||||
"""
|
||||
搜索相关文档
|
||||
|
||||
Args:
|
||||
query: 查询文本
|
||||
top_k: 返回前 k 个结果
|
||||
|
||||
Returns:
|
||||
[(文档索引, BM25分数), ...]
|
||||
"""
|
||||
if not self.documents:
|
||||
return []
|
||||
|
||||
query_tokens = self._tokenize(query)
|
||||
if not query_tokens:
|
||||
return []
|
||||
|
||||
scores = []
|
||||
n = len(self.documents)
|
||||
|
||||
for idx in range(n):
|
||||
score = self._calculate_score(query_tokens, idx)
|
||||
scores.append((idx, score))
|
||||
|
||||
# 按分数降序排序
|
||||
scores.sort(key=lambda x: x[1], reverse=True)
|
||||
|
||||
return scores[:top_k]
|
||||
|
||||
def _calculate_score(self, query_tokens: List[str], doc_idx: int) -> float:
|
||||
"""计算单个文档的 BM25 分数"""
|
||||
doc_tf = self.doc_term_freqs[doc_idx]
|
||||
doc_len = self.doc_lengths[doc_idx]
|
||||
score = 0.0
|
||||
|
||||
for term in query_tokens:
|
||||
if term not in self.idf:
|
||||
continue
|
||||
|
||||
tf = doc_tf.get(term, 0)
|
||||
idf = self.idf[term]
|
||||
|
||||
# BM25 公式
|
||||
numerator = tf * (self.k1 + 1)
|
||||
denominator = tf + self.k1 * (1 - self.b + self.b * doc_len / self.avg_doc_length)
|
||||
|
||||
score += idf * numerator / denominator
|
||||
|
||||
return score
|
||||
|
||||
def get_scores(self, query: str) -> List[float]:
|
||||
"""获取所有文档的 BM25 分数"""
|
||||
if not self.documents:
|
||||
return []
|
||||
|
||||
query_tokens = self._tokenize(query)
|
||||
if not query_tokens:
|
||||
return [0.0] * len(self.documents)
|
||||
|
||||
return [self._calculate_score(query_tokens, idx) for idx in range(len(self.documents))]
|
||||
|
||||
|
||||
class RAGService:
|
||||
"""RAG 检索增强服务"""
|
||||
|
||||
# 默认分块参数
|
||||
DEFAULT_CHUNK_SIZE = 500 # 每个文本块的大小(字符数)
|
||||
DEFAULT_CHUNK_OVERLAP = 50 # 块之间的重叠(字符数)
|
||||
|
||||
def __init__(self):
|
||||
self.embedding_model: Optional[SentenceTransformer] = None
|
||||
self.embedding_model = None
|
||||
self.index: Optional[faiss.Index] = None
|
||||
self.documents: List[Dict[str, Any]] = []
|
||||
self.doc_ids: List[str] = []
|
||||
self._dimension: int = 0
|
||||
self._dimension: int = 384 # 默认维度
|
||||
self._initialized = False
|
||||
self._persist_dir = settings.FAISS_INDEX_DIR
|
||||
# 临时禁用 RAG API 调用,仅记录日志
|
||||
self._disabled = True
|
||||
logger.info("RAG 服务已禁用(_disabled=True),仅记录索引操作日志")
|
||||
# BM25 索引
|
||||
self.bm25: Optional[BM25] = None
|
||||
self._bm25_enabled = True # 始终启用 BM25
|
||||
# 检查是否可用
|
||||
self._disabled = not SENTENCE_TRANSFORMERS_AVAILABLE
|
||||
if self._disabled:
|
||||
logger.warning("RAG 服务已禁用(sentence-transformers 不可用),将使用 BM25 关键词检索")
|
||||
else:
|
||||
logger.info("RAG 服务已启用(向量检索 + BM25 混合检索)")
|
||||
|
||||
def _init_embeddings(self):
|
||||
"""初始化嵌入模型"""
|
||||
@@ -88,6 +235,63 @@ class RAGService:
|
||||
norms = np.where(norms == 0, 1, norms)
|
||||
return vectors / norms
|
||||
|
||||
def _split_into_chunks(self, text: str, chunk_size: int = None, overlap: int = None) -> List[str]:
|
||||
"""
|
||||
将长文本分割成块
|
||||
|
||||
Args:
|
||||
text: 待分割的文本
|
||||
chunk_size: 每个块的大小(字符数)
|
||||
overlap: 块之间的重叠字符数
|
||||
|
||||
Returns:
|
||||
文本块列表
|
||||
"""
|
||||
if chunk_size is None:
|
||||
chunk_size = self.DEFAULT_CHUNK_SIZE
|
||||
if overlap is None:
|
||||
overlap = self.DEFAULT_CHUNK_OVERLAP
|
||||
|
||||
if len(text) <= chunk_size:
|
||||
return [text] if text.strip() else []
|
||||
|
||||
chunks = []
|
||||
start = 0
|
||||
text_len = len(text)
|
||||
|
||||
while start < text_len:
|
||||
# 计算当前块的结束位置
|
||||
end = start + chunk_size
|
||||
|
||||
# 如果不是最后一块,尝试在句子边界处切割
|
||||
if end < text_len:
|
||||
# 向前查找最后一个句号、逗号、换行或分号
|
||||
cut_positions = []
|
||||
for i in range(end, max(start, end - 100), -1):
|
||||
if text[i] in '。;,,\n、':
|
||||
cut_positions.append(i + 1)
|
||||
break
|
||||
|
||||
if cut_positions:
|
||||
end = cut_positions[0]
|
||||
else:
|
||||
# 如果没找到句子边界,尝试向后查找
|
||||
for i in range(end, min(text_len, end + 50)):
|
||||
if text[i] in '。;,,\n、':
|
||||
end = i + 1
|
||||
break
|
||||
|
||||
chunk = text[start:end].strip()
|
||||
if chunk:
|
||||
chunks.append(chunk)
|
||||
|
||||
# 移动起始位置(考虑重叠)
|
||||
start = end - overlap
|
||||
if start <= 0:
|
||||
start = end
|
||||
|
||||
return chunks
|
||||
|
||||
def index_field(
|
||||
self,
|
||||
table_name: str,
|
||||
@@ -124,9 +328,20 @@ class RAGService:
|
||||
self,
|
||||
doc_id: str,
|
||||
content: str,
|
||||
metadata: Optional[Dict[str, Any]] = None
|
||||
metadata: Optional[Dict[str, Any]] = None,
|
||||
chunk_size: int = None,
|
||||
chunk_overlap: int = None
|
||||
):
|
||||
"""将文档内容索引到向量数据库"""
|
||||
"""
|
||||
将文档内容索引到向量数据库(自动分块)
|
||||
|
||||
Args:
|
||||
doc_id: 文档唯一标识
|
||||
content: 文档内容
|
||||
metadata: 文档元数据
|
||||
chunk_size: 文本块大小(字符数),默认500
|
||||
chunk_overlap: 块之间的重叠字符数,默认50
|
||||
"""
|
||||
if self._disabled:
|
||||
logger.info(f"[RAG DISABLED] 文档索引操作已跳过: {doc_id}")
|
||||
return
|
||||
@@ -139,18 +354,70 @@ class RAGService:
|
||||
logger.debug(f"文档跳过索引 (无嵌入模型): {doc_id}")
|
||||
return
|
||||
|
||||
doc = SimpleDocument(
|
||||
page_content=content,
|
||||
metadata=metadata or {"doc_id": doc_id}
|
||||
)
|
||||
self._add_documents([doc], [doc_id])
|
||||
logger.debug(f"已索引文档: {doc_id}")
|
||||
# 分割文档为小块
|
||||
if chunk_size is None:
|
||||
chunk_size = self.DEFAULT_CHUNK_SIZE
|
||||
if chunk_overlap is None:
|
||||
chunk_overlap = self.DEFAULT_CHUNK_OVERLAP
|
||||
|
||||
chunks = self._split_into_chunks(content, chunk_size, chunk_overlap)
|
||||
|
||||
if not chunks:
|
||||
logger.warning(f"文档内容为空,跳过索引: {doc_id}")
|
||||
return
|
||||
|
||||
# 为每个块创建文档对象
|
||||
documents = []
|
||||
chunk_ids = []
|
||||
|
||||
for i, chunk in enumerate(chunks):
|
||||
chunk_id = f"{doc_id}_chunk_{i}"
|
||||
chunk_metadata = metadata.copy() if metadata else {}
|
||||
chunk_metadata.update({
|
||||
"chunk_index": i,
|
||||
"total_chunks": len(chunks),
|
||||
"doc_id": doc_id
|
||||
})
|
||||
|
||||
documents.append(SimpleDocument(
|
||||
page_content=chunk,
|
||||
metadata=chunk_metadata
|
||||
))
|
||||
chunk_ids.append(chunk_id)
|
||||
|
||||
# 批量添加文档
|
||||
self._add_documents(documents, chunk_ids)
|
||||
logger.info(f"已索引文档 {doc_id},共 {len(chunks)} 个块")
|
||||
|
||||
def _add_documents(self, documents: List[SimpleDocument], doc_ids: List[str]):
|
||||
"""批量添加文档到向量索引"""
|
||||
if not documents:
|
||||
return
|
||||
|
||||
# 总是将文档存储在内存中(用于 BM25 和关键词搜索)
|
||||
for doc, did in zip(documents, doc_ids):
|
||||
self.documents.append({"id": did, "content": doc.page_content, "metadata": doc.metadata})
|
||||
self.doc_ids.append(did)
|
||||
|
||||
# 构建 BM25 索引
|
||||
if self._bm25_enabled and documents:
|
||||
bm25_texts = [doc.page_content for doc in documents]
|
||||
if self.bm25 is None:
|
||||
self.bm25 = BM25()
|
||||
self.bm25.fit(bm25_texts, doc_ids)
|
||||
else:
|
||||
# 增量添加:重新构建(BM25 不支持增量)
|
||||
all_texts = [d["content"] for d in self.documents]
|
||||
all_ids = self.doc_ids.copy()
|
||||
self.bm25 = BM25()
|
||||
self.bm25.fit(all_texts, all_ids)
|
||||
logger.debug(f"BM25 索引更新: {len(documents)} 个文档")
|
||||
|
||||
# 如果没有嵌入模型,跳过向量索引
|
||||
if self.embedding_model is None:
|
||||
logger.debug(f"文档跳过向量索引 (无嵌入模型): {len(documents)} 个文档")
|
||||
return
|
||||
|
||||
texts = [doc.page_content for doc in documents]
|
||||
embeddings = self.embedding_model.encode(texts, convert_to_numpy=True)
|
||||
embeddings = self._normalize_vectors(embeddings).astype('float32')
|
||||
@@ -162,12 +429,18 @@ class RAGService:
|
||||
id_array = np.array(id_list, dtype='int64')
|
||||
self.index.add_with_ids(embeddings, id_array)
|
||||
|
||||
for doc, did in zip(documents, doc_ids):
|
||||
self.documents.append({"id": did, "content": doc.page_content, "metadata": doc.metadata})
|
||||
self.doc_ids.append(did)
|
||||
def retrieve(self, query: str, top_k: int = 5, min_score: float = 0.3) -> List[Dict[str, Any]]:
|
||||
"""
|
||||
根据查询检索相关文档块(混合检索:向量 + BM25)
|
||||
|
||||
def retrieve(self, query: str, top_k: int = 5) -> List[Dict[str, Any]]:
|
||||
"""根据查询检索相关文档"""
|
||||
Args:
|
||||
query: 查询文本
|
||||
top_k: 返回的最大结果数
|
||||
min_score: 最低相似度分数阈值
|
||||
|
||||
Returns:
|
||||
相关文档块列表,每项包含 content, metadata, score, doc_id, chunk_index
|
||||
"""
|
||||
if self._disabled:
|
||||
logger.info(f"[RAG DISABLED] 检索操作已跳过: query={query}, top_k={top_k}")
|
||||
return []
|
||||
@@ -175,28 +448,241 @@ class RAGService:
|
||||
if not self._initialized:
|
||||
self._init_vector_store()
|
||||
|
||||
if self.index is None or self.index.ntotal == 0:
|
||||
# 获取向量检索结果
|
||||
vector_results = self._vector_search(query, top_k * 2, min_score)
|
||||
|
||||
# 获取 BM25 检索结果
|
||||
bm25_results = self._bm25_search(query, top_k * 2)
|
||||
|
||||
# 混合融合
|
||||
hybrid_results = self._hybrid_fusion(vector_results, bm25_results, top_k)
|
||||
|
||||
if hybrid_results:
|
||||
logger.info(f"混合检索到 {len(hybrid_results)} 条相关文档块 (向量:{len(vector_results)}, BM25:{len(bm25_results)})")
|
||||
return hybrid_results
|
||||
|
||||
# 降级:只使用 BM25
|
||||
if bm25_results:
|
||||
logger.info(f"降级到 BM25 检索: {len(bm25_results)} 条")
|
||||
return bm25_results
|
||||
|
||||
# 降级:使用关键词搜索
|
||||
logger.info("降级到关键词搜索")
|
||||
return self._keyword_search(query, top_k)
|
||||
|
||||
def _vector_search(self, query: str, top_k: int, min_score: float) -> List[Dict[str, Any]]:
|
||||
"""向量检索"""
|
||||
if self.index is None or self.index.ntotal == 0 or self.embedding_model is None:
|
||||
return []
|
||||
|
||||
query_embedding = self.embedding_model.encode([query], convert_to_numpy=True)
|
||||
query_embedding = self._normalize_vectors(query_embedding).astype('float32')
|
||||
try:
|
||||
query_embedding = self.embedding_model.encode([query], convert_to_numpy=True)
|
||||
query_embedding = self._normalize_vectors(query_embedding).astype('float32')
|
||||
|
||||
scores, indices = self.index.search(query_embedding, min(top_k, self.index.ntotal))
|
||||
scores, indices = self.index.search(query_embedding, min(top_k * 2, self.index.ntotal))
|
||||
|
||||
results = []
|
||||
for score, idx in zip(scores[0], indices[0]):
|
||||
if idx < 0:
|
||||
continue
|
||||
doc = self.documents[idx]
|
||||
results.append({
|
||||
"content": doc["content"],
|
||||
"metadata": doc["metadata"],
|
||||
"score": float(score),
|
||||
"doc_id": doc["id"]
|
||||
results = []
|
||||
for score, idx in zip(scores[0], indices[0]):
|
||||
if idx < 0:
|
||||
continue
|
||||
if score < min_score:
|
||||
continue
|
||||
doc = self.documents[idx]
|
||||
results.append({
|
||||
"content": doc["content"],
|
||||
"metadata": doc["metadata"],
|
||||
"score": float(score),
|
||||
"doc_id": doc["id"],
|
||||
"chunk_index": doc["metadata"].get("chunk_index", 0),
|
||||
"search_type": "vector"
|
||||
})
|
||||
|
||||
return results
|
||||
except Exception as e:
|
||||
logger.warning(f"向量检索失败: {e}")
|
||||
return []
|
||||
|
||||
def _bm25_search(self, query: str, top_k: int) -> List[Dict[str, Any]]:
|
||||
"""BM25 检索"""
|
||||
if not self.bm25 or not self.documents:
|
||||
return []
|
||||
|
||||
try:
|
||||
bm25_scores = self.bm25.get_scores(query)
|
||||
if not bm25_scores:
|
||||
return []
|
||||
|
||||
# 归一化 BM25 分数到 [0, 1]
|
||||
max_score = max(bm25_scores) if bm25_scores else 1
|
||||
min_score_bm = min(bm25_scores) if bm25_scores else 0
|
||||
score_range = max_score - min_score_bm if max_score != min_score_bm else 1
|
||||
|
||||
results = []
|
||||
for idx, score in enumerate(bm25_scores):
|
||||
if score <= 0:
|
||||
continue
|
||||
# 归一化
|
||||
normalized_score = (score - min_score_bm) / score_range if score_range > 0 else 0
|
||||
doc = self.documents[idx]
|
||||
results.append({
|
||||
"content": doc["content"],
|
||||
"metadata": doc["metadata"],
|
||||
"score": float(normalized_score),
|
||||
"doc_id": doc["id"],
|
||||
"chunk_index": doc["metadata"].get("chunk_index", 0),
|
||||
"search_type": "bm25"
|
||||
})
|
||||
|
||||
# 按分数降序
|
||||
results.sort(key=lambda x: x["score"], reverse=True)
|
||||
return results[:top_k]
|
||||
|
||||
except Exception as e:
|
||||
logger.warning(f"BM25 检索失败: {e}")
|
||||
return []
|
||||
|
||||
def _hybrid_fusion(
|
||||
self,
|
||||
vector_results: List[Dict[str, Any]],
|
||||
bm25_results: List[Dict[str, Any]],
|
||||
top_k: int
|
||||
) -> List[Dict[str, Any]]:
|
||||
"""
|
||||
混合融合向量和 BM25 检索结果
|
||||
|
||||
使用 RRFR (Reciprocal Rank Fusion) 算法:
|
||||
Score = weight_vector * (1 / rank_vector) + weight_bm25 * (1 / rank_bm25)
|
||||
|
||||
Args:
|
||||
vector_results: 向量检索结果
|
||||
bm25_results: BM25 检索结果
|
||||
top_k: 返回数量
|
||||
|
||||
Returns:
|
||||
融合后的结果
|
||||
"""
|
||||
if not vector_results and not bm25_results:
|
||||
return []
|
||||
|
||||
# 融合权重
|
||||
weight_vector = 0.6
|
||||
weight_bm25 = 0.4
|
||||
|
||||
# 构建文档分数映射
|
||||
doc_scores: Dict[str, Dict[str, float]] = {}
|
||||
|
||||
# 添加向量检索结果
|
||||
for rank, result in enumerate(vector_results):
|
||||
doc_id = result["doc_id"]
|
||||
if doc_id not in doc_scores:
|
||||
doc_scores[doc_id] = {"vector": 0, "bm25": 0, "content": result["content"], "metadata": result["metadata"]}
|
||||
# 使用倒数排名 (Reciprocal Rank)
|
||||
doc_scores[doc_id]["vector"] = weight_vector / (rank + 1)
|
||||
|
||||
# 添加 BM25 检索结果
|
||||
for rank, result in enumerate(bm25_results):
|
||||
doc_id = result["doc_id"]
|
||||
if doc_id not in doc_scores:
|
||||
doc_scores[doc_id] = {"vector": 0, "bm25": 0, "content": result["content"], "metadata": result["metadata"]}
|
||||
doc_scores[doc_id]["bm25"] = weight_bm25 / (rank + 1)
|
||||
|
||||
# 计算融合分数
|
||||
fused_results = []
|
||||
for doc_id, scores in doc_scores.items():
|
||||
fused_score = scores["vector"] + scores["bm25"]
|
||||
# 使用向量检索结果的原始分数作为参考
|
||||
vector_score = next((r["score"] for r in vector_results if r["doc_id"] == doc_id), 0.5)
|
||||
fused_results.append({
|
||||
"content": scores["content"],
|
||||
"metadata": scores["metadata"],
|
||||
"score": fused_score,
|
||||
"doc_id": doc_id,
|
||||
"vector_score": vector_score,
|
||||
"bm25_score": scores["bm25"],
|
||||
"search_type": "hybrid"
|
||||
})
|
||||
|
||||
logger.debug(f"检索到 {len(results)} 条相关文档")
|
||||
return results
|
||||
# 按融合分数降序排序
|
||||
fused_results.sort(key=lambda x: x["score"], reverse=True)
|
||||
|
||||
logger.debug(f"混合融合: {len(fused_results)} 个文档, 向量:{len(vector_results)}, BM25:{len(bm25_results)}")
|
||||
|
||||
return fused_results[:top_k]
|
||||
|
||||
def _keyword_search(self, query: str, top_k: int = 5) -> List[Dict[str, Any]]:
|
||||
"""
|
||||
关键词搜索后备方案
|
||||
|
||||
Args:
|
||||
query: 查询文本
|
||||
top_k: 返回的最大结果数
|
||||
|
||||
Returns:
|
||||
相关文档块列表
|
||||
"""
|
||||
if not self.documents:
|
||||
return []
|
||||
|
||||
# 提取查询关键词
|
||||
keywords = []
|
||||
for char in query:
|
||||
if '\u4e00' <= char <= '\u9fff': # 中文字符
|
||||
keywords.append(char)
|
||||
# 添加英文单词
|
||||
import re
|
||||
english_words = re.findall(r'[a-zA-Z]+', query)
|
||||
keywords.extend(english_words)
|
||||
|
||||
if not keywords:
|
||||
return []
|
||||
|
||||
results = []
|
||||
for doc in self.documents:
|
||||
content = doc["content"]
|
||||
# 计算关键词匹配分数
|
||||
score = 0
|
||||
matched_keywords = 0
|
||||
for kw in keywords:
|
||||
if kw in content:
|
||||
score += 1
|
||||
matched_keywords += 1
|
||||
|
||||
if matched_keywords > 0:
|
||||
# 归一化分数
|
||||
score = score / max(len(keywords), 1)
|
||||
results.append({
|
||||
"content": content,
|
||||
"metadata": doc["metadata"],
|
||||
"score": score,
|
||||
"doc_id": doc["id"],
|
||||
"chunk_index": doc["metadata"].get("chunk_index", 0)
|
||||
})
|
||||
|
||||
# 按分数排序
|
||||
results.sort(key=lambda x: x["score"], reverse=True)
|
||||
|
||||
logger.debug(f"关键词搜索返回 {len(results[:top_k])} 条结果")
|
||||
return results[:top_k]
|
||||
|
||||
def retrieve_by_doc_id(self, doc_id: str, top_k: int = 10) -> List[Dict[str, Any]]:
|
||||
"""
|
||||
获取指定文档的所有块
|
||||
|
||||
Args:
|
||||
doc_id: 文档ID
|
||||
top_k: 返回的最大结果数
|
||||
|
||||
Returns:
|
||||
该文档的所有块
|
||||
"""
|
||||
# 获取属于该文档的所有块
|
||||
doc_chunks = [d for d in self.documents if d["metadata"].get("doc_id") == doc_id]
|
||||
|
||||
# 按 chunk_index 排序
|
||||
doc_chunks.sort(key=lambda x: x["metadata"].get("chunk_index", 0))
|
||||
|
||||
# 返回指定数量
|
||||
return doc_chunks[:top_k]
|
||||
|
||||
def retrieve_by_table(self, table_name: str, top_k: int = 5) -> List[Dict[str, Any]]:
|
||||
"""检索指定表的字段"""
|
||||
|
||||
File diff suppressed because it is too large
Load Diff
639
backend/app/services/word_ai_service.py
Normal file
639
backend/app/services/word_ai_service.py
Normal file
@@ -0,0 +1,639 @@
|
||||
"""
|
||||
Word 文档 AI 解析服务
|
||||
|
||||
使用 LLM (GLM) 对 Word 文档进行深度理解,提取结构化数据
|
||||
"""
|
||||
import logging
|
||||
from typing import Dict, Any, List, Optional
|
||||
import json
|
||||
|
||||
from app.services.llm_service import llm_service
|
||||
from app.core.document_parser.docx_parser import DocxParser
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
class WordAIService:
|
||||
"""Word 文档 AI 解析服务"""
|
||||
|
||||
def __init__(self):
|
||||
self.llm = llm_service
|
||||
self.parser = DocxParser()
|
||||
|
||||
async def parse_word_with_ai(
|
||||
self,
|
||||
file_path: str,
|
||||
user_hint: str = ""
|
||||
) -> Dict[str, Any]:
|
||||
"""
|
||||
使用 AI 解析 Word 文档,提取结构化数据
|
||||
|
||||
适用于从非结构化的 Word 文档中提取表格数据、键值对等信息
|
||||
|
||||
Args:
|
||||
file_path: Word 文件路径
|
||||
user_hint: 用户提示词,指定要提取的内容类型
|
||||
|
||||
Returns:
|
||||
Dict: 包含结构化数据的解析结果
|
||||
"""
|
||||
try:
|
||||
# 1. 先用基础解析器提取原始内容
|
||||
parse_result = self.parser.parse(file_path)
|
||||
|
||||
if not parse_result.success:
|
||||
return {
|
||||
"success": False,
|
||||
"error": parse_result.error,
|
||||
"structured_data": None
|
||||
}
|
||||
|
||||
# 2. 获取原始数据
|
||||
raw_data = parse_result.data
|
||||
paragraphs = raw_data.get("paragraphs", [])
|
||||
paragraphs_with_style = raw_data.get("paragraphs_with_style", [])
|
||||
tables = raw_data.get("tables", [])
|
||||
content = raw_data.get("content", "")
|
||||
images_info = raw_data.get("images", {})
|
||||
metadata = parse_result.metadata or {}
|
||||
|
||||
image_count = images_info.get("image_count", 0)
|
||||
image_descriptions = images_info.get("descriptions", [])
|
||||
|
||||
logger.info(f"Word 基础解析完成: {len(paragraphs)} 个段落, {len(tables)} 个表格, {image_count} 张图片")
|
||||
|
||||
# 3. 提取图片数据(用于视觉分析)
|
||||
images_base64 = []
|
||||
if image_count > 0:
|
||||
try:
|
||||
images_base64 = self.parser.extract_images_as_base64(file_path)
|
||||
logger.info(f"提取到 {len(images_base64)} 张图片的 base64 数据")
|
||||
except Exception as e:
|
||||
logger.warning(f"提取图片 base64 失败: {str(e)}")
|
||||
|
||||
# 4. 根据内容类型选择 AI 解析策略
|
||||
# 如果有图片,先分析图片
|
||||
image_analysis = ""
|
||||
if images_base64:
|
||||
image_analysis = await self._analyze_images_with_ai(images_base64, user_hint)
|
||||
logger.info(f"图片 AI 分析完成: {len(image_analysis)} 字符")
|
||||
|
||||
# 优先处理:表格 > (表格+文本) > 纯文本
|
||||
if tables and len(tables) > 0:
|
||||
structured_data = await self._extract_tables_with_ai(
|
||||
tables, paragraphs, image_count, user_hint, metadata, image_analysis
|
||||
)
|
||||
elif paragraphs and len(paragraphs) > 0:
|
||||
structured_data = await self._extract_from_text_with_ai(
|
||||
paragraphs, content, image_count, image_descriptions, user_hint, image_analysis
|
||||
)
|
||||
else:
|
||||
structured_data = {
|
||||
"success": True,
|
||||
"type": "empty",
|
||||
"message": "文档内容为空"
|
||||
}
|
||||
|
||||
# 添加图片分析结果
|
||||
if image_analysis:
|
||||
structured_data["image_analysis"] = image_analysis
|
||||
|
||||
return structured_data
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"AI 解析 Word 文档失败: {str(e)}")
|
||||
return {
|
||||
"success": False,
|
||||
"error": str(e),
|
||||
"structured_data": None
|
||||
}
|
||||
|
||||
async def _extract_tables_with_ai(
|
||||
self,
|
||||
tables: List[Dict],
|
||||
paragraphs: List[str],
|
||||
image_count: int,
|
||||
user_hint: str,
|
||||
metadata: Dict,
|
||||
image_analysis: str = ""
|
||||
) -> Dict[str, Any]:
|
||||
"""
|
||||
使用 AI 从 Word 表格和文本中提取结构化数据
|
||||
|
||||
Args:
|
||||
tables: 表格列表
|
||||
paragraphs: 段落列表
|
||||
image_count: 图片数量
|
||||
user_hint: 用户提示
|
||||
metadata: 文档元数据
|
||||
image_analysis: 图片 AI 分析结果
|
||||
|
||||
Returns:
|
||||
结构化数据
|
||||
"""
|
||||
try:
|
||||
# 构建表格文本描述
|
||||
tables_text = self._build_tables_description(tables)
|
||||
|
||||
# 构建段落描述
|
||||
paragraphs_text = "\n".join(paragraphs[:50]) if paragraphs else "(无正文文本)"
|
||||
if len(paragraphs) > 50:
|
||||
paragraphs_text += f"\n...(共 {len(paragraphs)} 个段落,仅显示前50个)"
|
||||
|
||||
# 图片提示
|
||||
image_hint = f"注意:此文档包含 {image_count} 张图片/图表。" if image_count > 0 else ""
|
||||
|
||||
prompt = f"""你是一个专业的数据提取专家。请从以下 Word 文档的完整内容中提取结构化数据。
|
||||
|
||||
【用户需求】
|
||||
{user_hint if user_hint else "请提取文档中的所有结构化数据,包括表格数据、键值对、列表项等。"}
|
||||
|
||||
【文档正文(段落)】
|
||||
{paragraphs_text}
|
||||
|
||||
【文档表格】
|
||||
{tables_text}
|
||||
|
||||
【文档图片信息】
|
||||
{image_hint}
|
||||
|
||||
请按照以下 JSON 格式输出:
|
||||
{{
|
||||
"type": "table_data",
|
||||
"headers": ["列1", "列2", ...],
|
||||
"rows": [["行1列1", "行1列2", ...], ["行2列1", "行2列2", ...], ...],
|
||||
"key_values": {{"键1": "值1", "键2": "值2", ...}},
|
||||
"list_items": ["项1", "项2", ...],
|
||||
"description": "文档内容描述"
|
||||
}}
|
||||
|
||||
重点:
|
||||
- 优先从表格中提取结构化数据
|
||||
- 如果表格中有表头,headers 是表头,rows 是数据行
|
||||
- 如果文档中有键值对(如 名称: 张三),提取到 key_values 中
|
||||
- 如果文档中有列表项,提取到 list_items 中
|
||||
- 图片内容无法直接提取,但请在 description 中说明图片的大致主题(如"包含流程图"、"包含数据图表"等)
|
||||
"""
|
||||
|
||||
messages = [
|
||||
{"role": "system", "content": "你是一个专业的数据提取助手。请严格按JSON格式输出。"},
|
||||
{"role": "user", "content": prompt}
|
||||
]
|
||||
|
||||
response = await self.llm.chat(
|
||||
messages=messages,
|
||||
temperature=0.1,
|
||||
max_tokens=50000
|
||||
)
|
||||
|
||||
content = self.llm.extract_message_content(response)
|
||||
|
||||
# 解析 JSON
|
||||
result = self._parse_json_response(content)
|
||||
|
||||
if result:
|
||||
logger.info(f"AI 表格提取成功: {len(result.get('rows', []))} 行数据, key_values={len(result.get('key_values', {}))}, list_items={len(result.get('list_items', []))}")
|
||||
return {
|
||||
"success": True,
|
||||
"type": "table_data",
|
||||
"headers": result.get("headers", []),
|
||||
"rows": result.get("rows", []),
|
||||
"description": result.get("description", ""),
|
||||
"key_values": result.get("key_values", {}),
|
||||
"list_items": result.get("list_items", [])
|
||||
}
|
||||
else:
|
||||
# 如果 AI 返回格式不对,尝试直接解析表格
|
||||
return self._fallback_table_parse(tables)
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"AI 表格提取失败: {str(e)}")
|
||||
return self._fallback_table_parse(tables)
|
||||
|
||||
async def _extract_from_text_with_ai(
|
||||
self,
|
||||
paragraphs: List[str],
|
||||
full_text: str,
|
||||
image_count: int,
|
||||
image_descriptions: List[str],
|
||||
user_hint: str,
|
||||
image_analysis: str = ""
|
||||
) -> Dict[str, Any]:
|
||||
"""
|
||||
使用 AI 从 Word 纯文本中提取结构化数据
|
||||
|
||||
Args:
|
||||
paragraphs: 段落列表
|
||||
full_text: 完整文本
|
||||
image_count: 图片数量
|
||||
image_descriptions: 图片描述列表
|
||||
user_hint: 用户提示
|
||||
image_analysis: 图片 AI 分析结果
|
||||
|
||||
Returns:
|
||||
结构化数据
|
||||
"""
|
||||
try:
|
||||
# 限制文本长度
|
||||
text_preview = full_text[:8000] if len(full_text) > 8000 else full_text
|
||||
|
||||
# 图片提示
|
||||
image_hint = f"\n【文档图片】此文档包含 {image_count} 张图片/图表。" if image_count > 0 else ""
|
||||
if image_descriptions:
|
||||
image_hint += "\n" + "\n".join(image_descriptions)
|
||||
|
||||
prompt = f"""你是一个专业的数据提取专家。请从以下 Word 文档的完整内容中提取结构化数据。
|
||||
|
||||
【用户需求】
|
||||
{user_hint if user_hint else "请识别并提取文档中的关键信息,包括:表格数据、键值对、列表项等。"}
|
||||
|
||||
【文档正文】{image_hint}
|
||||
{text_preview}
|
||||
|
||||
请按照以下 JSON 格式输出:
|
||||
{{
|
||||
"type": "structured_text",
|
||||
"tables": [{{"headers": [...], "rows": [...]}}],
|
||||
"key_values": {{"键1": "值1", "键2": "值2", ...}},
|
||||
"list_items": ["项1", "项2", ...],
|
||||
"summary": "文档内容摘要"
|
||||
}}
|
||||
|
||||
重点:
|
||||
- 如果文档包含表格数据,提取到 tables 中
|
||||
- 如果文档包含键值对(如 名称: 张三),提取到 key_values 中
|
||||
- 如果文档包含列表项,提取到 list_items 中
|
||||
- 如果文档包含图片,请根据上下文推断图片内容(如"流程图"、"数据折线图"等)并在 description 中说明
|
||||
- 如果无法提取到结构化数据,至少提供一个详细的摘要
|
||||
"""
|
||||
|
||||
messages = [
|
||||
{"role": "system", "content": "你是一个专业的数据提取助手。请严格按JSON格式输出。"},
|
||||
{"role": "user", "content": prompt}
|
||||
]
|
||||
|
||||
response = await self.llm.chat(
|
||||
messages=messages,
|
||||
temperature=0.1,
|
||||
max_tokens=50000
|
||||
)
|
||||
|
||||
content = self.llm.extract_message_content(response)
|
||||
|
||||
result = self._parse_json_response(content)
|
||||
|
||||
if result:
|
||||
logger.info(f"AI 文本提取成功: type={result.get('type')}")
|
||||
return {
|
||||
"success": True,
|
||||
"type": result.get("type", "structured_text"),
|
||||
"tables": result.get("tables", []),
|
||||
"key_values": result.get("key_values", {}),
|
||||
"list_items": result.get("list_items", []),
|
||||
"summary": result.get("summary", ""),
|
||||
"raw_text_preview": text_preview[:500]
|
||||
}
|
||||
else:
|
||||
return {
|
||||
"success": True,
|
||||
"type": "text",
|
||||
"summary": text_preview[:500],
|
||||
"raw_text_preview": text_preview[:500]
|
||||
}
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"AI 文本提取失败: {str(e)}")
|
||||
return {
|
||||
"success": False,
|
||||
"error": str(e)
|
||||
}
|
||||
|
||||
async def _analyze_images_with_ai(
|
||||
self,
|
||||
images: List[Dict[str, str]],
|
||||
user_hint: str = ""
|
||||
) -> str:
|
||||
"""
|
||||
使用视觉模型分析 Word 文档中的图片
|
||||
|
||||
Args:
|
||||
images: 图片列表,每项包含 base64 和 mime_type
|
||||
user_hint: 用户提示
|
||||
|
||||
Returns:
|
||||
图片分析结果文本
|
||||
"""
|
||||
try:
|
||||
# 调用 LLM 的视觉分析功能
|
||||
result = await self.llm.analyze_images(
|
||||
images=images,
|
||||
user_prompt=user_hint or "请详细描述图片内容,提取所有文字和数据信息。"
|
||||
)
|
||||
|
||||
if result.get("success"):
|
||||
analysis = result.get("analysis", {})
|
||||
if isinstance(analysis, dict):
|
||||
description = analysis.get("description", "")
|
||||
text_content = analysis.get("text_content", "")
|
||||
data_extracted = analysis.get("data_extracted", {})
|
||||
|
||||
result_text = f"【图片分析结果】\n{description}"
|
||||
if text_content:
|
||||
result_text += f"\n\n【图片中的文字】\n{text_content}"
|
||||
if data_extracted:
|
||||
result_text += f"\n\n【提取的数据】\n{json.dumps(data_extracted, ensure_ascii=False)}"
|
||||
return result_text
|
||||
else:
|
||||
return str(analysis)
|
||||
else:
|
||||
logger.warning(f"图片 AI 分析失败: {result.get('error')}")
|
||||
return ""
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"图片 AI 分析异常: {str(e)}")
|
||||
return ""
|
||||
|
||||
def _build_tables_description(self, tables: List[Dict]) -> str:
|
||||
"""构建表格的文本描述"""
|
||||
result = []
|
||||
|
||||
for idx, table in enumerate(tables):
|
||||
rows = table.get("rows", [])
|
||||
if not rows:
|
||||
continue
|
||||
|
||||
result.append(f"\n--- 表格 {idx + 1} ---")
|
||||
|
||||
for row_idx, row in enumerate(rows[:50]): # 限制每表格最多50行
|
||||
if isinstance(row, list):
|
||||
result.append(" | ".join(str(cell).strip() for cell in row))
|
||||
elif isinstance(row, dict):
|
||||
result.append(str(row))
|
||||
|
||||
if len(rows) > 50:
|
||||
result.append(f"...(共 {len(rows)} 行,仅显示前50行)")
|
||||
|
||||
return "\n".join(result) if result else "(无表格内容)"
|
||||
|
||||
def _parse_json_response(self, content: str) -> Optional[Dict]:
|
||||
"""解析 JSON 响应,处理各种格式问题"""
|
||||
import re
|
||||
|
||||
# 清理 markdown 标记
|
||||
cleaned = content.strip()
|
||||
cleaned = re.sub(r'^```json\s*', '', cleaned, flags=re.MULTILINE)
|
||||
cleaned = re.sub(r'^```\s*', '', cleaned, flags=re.MULTILINE)
|
||||
cleaned = cleaned.strip()
|
||||
|
||||
# 找到 JSON 开始位置
|
||||
json_start = -1
|
||||
for i, c in enumerate(cleaned):
|
||||
if c == '{':
|
||||
json_start = i
|
||||
break
|
||||
|
||||
if json_start == -1:
|
||||
logger.warning("无法找到 JSON 开始位置")
|
||||
return None
|
||||
|
||||
json_text = cleaned[json_start:]
|
||||
|
||||
# 尝试直接解析
|
||||
try:
|
||||
return json.loads(json_text)
|
||||
except json.JSONDecodeError:
|
||||
pass
|
||||
|
||||
# 尝试修复并解析
|
||||
try:
|
||||
# 找到闭合括号
|
||||
depth = 0
|
||||
end_pos = -1
|
||||
for i, c in enumerate(json_text):
|
||||
if c == '{':
|
||||
depth += 1
|
||||
elif c == '}':
|
||||
depth -= 1
|
||||
if depth == 0:
|
||||
end_pos = i + 1
|
||||
break
|
||||
|
||||
if end_pos > 0:
|
||||
fixed = json_text[:end_pos]
|
||||
# 移除末尾逗号
|
||||
fixed = re.sub(r',\s*([}]])', r'\1', fixed)
|
||||
return json.loads(fixed)
|
||||
except Exception as e:
|
||||
logger.warning(f"JSON 修复失败: {e}")
|
||||
|
||||
return None
|
||||
|
||||
def _fallback_table_parse(self, tables: List[Dict]) -> Dict[str, Any]:
|
||||
"""当 AI 解析失败时,直接解析表格"""
|
||||
if not tables:
|
||||
return {
|
||||
"success": True,
|
||||
"type": "empty",
|
||||
"data": {},
|
||||
"message": "无表格内容"
|
||||
}
|
||||
|
||||
all_rows = []
|
||||
all_headers = None
|
||||
|
||||
for table in tables:
|
||||
rows = table.get("rows", [])
|
||||
if not rows:
|
||||
continue
|
||||
|
||||
# 查找真正的表头行(跳过标题行)
|
||||
header_row_idx = 0
|
||||
for idx, row in enumerate(rows[:5]): # 只检查前5行
|
||||
if not isinstance(row, list):
|
||||
continue
|
||||
# 如果某一行包含"表"字开头且单元格内容很长,这可能是标题行
|
||||
first_cell = str(row[0]) if row else ""
|
||||
if first_cell.startswith("表") and len(first_cell) > 15:
|
||||
header_row_idx = idx + 1
|
||||
continue
|
||||
# 如果某一行有超过3个空单元格,可能是无效行
|
||||
empty_count = sum(1 for cell in row if not str(cell).strip())
|
||||
if empty_count > 3:
|
||||
header_row_idx = idx + 1
|
||||
continue
|
||||
# 找到第一行看起来像表头的行(短单元格,大部分有内容)
|
||||
avg_len = sum(len(str(c)) for c in row) / len(row) if row else 0
|
||||
if avg_len < 20: # 表头通常比数据行短
|
||||
header_row_idx = idx
|
||||
break
|
||||
|
||||
if header_row_idx >= len(rows):
|
||||
continue
|
||||
|
||||
# 使用找到的表头行
|
||||
if rows and isinstance(rows[header_row_idx], list):
|
||||
headers = rows[header_row_idx]
|
||||
if all_headers is None:
|
||||
all_headers = headers
|
||||
|
||||
# 数据行(从表头之后开始)
|
||||
for row in rows[header_row_idx + 1:]:
|
||||
if isinstance(row, list) and len(row) == len(headers):
|
||||
all_rows.append(row)
|
||||
|
||||
if all_headers and all_rows:
|
||||
return {
|
||||
"success": True,
|
||||
"type": "table_data",
|
||||
"headers": all_headers,
|
||||
"rows": all_rows,
|
||||
"description": "直接从 Word 表格提取"
|
||||
}
|
||||
|
||||
return {
|
||||
"success": True,
|
||||
"type": "raw",
|
||||
"tables": tables,
|
||||
"message": "表格数据(未AI处理)"
|
||||
}
|
||||
|
||||
async def fill_template_with_ai(
|
||||
self,
|
||||
file_path: str,
|
||||
template_fields: List[Dict[str, Any]],
|
||||
user_hint: str = ""
|
||||
) -> Dict[str, Any]:
|
||||
"""
|
||||
使用 AI 解析 Word 文档并填写模板
|
||||
|
||||
这是主要入口函数,前端调用此函数即可完成:
|
||||
1. AI 解析 Word 文档
|
||||
2. 根据模板字段提取数据
|
||||
3. 返回填写结果
|
||||
|
||||
Args:
|
||||
file_path: Word 文件路径
|
||||
template_fields: 模板字段列表 [{"name": "字段名", "hint": "提示词"}, ...]
|
||||
user_hint: 用户提示
|
||||
|
||||
Returns:
|
||||
填写结果
|
||||
"""
|
||||
try:
|
||||
# 1. AI 解析文档
|
||||
parse_result = await self.parse_word_with_ai(file_path, user_hint)
|
||||
|
||||
if not parse_result.get("success"):
|
||||
return {
|
||||
"success": False,
|
||||
"error": parse_result.get("error", "解析失败"),
|
||||
"filled_data": {},
|
||||
"source": "ai_parse_failed"
|
||||
}
|
||||
|
||||
# 2. 根据字段类型提取数据
|
||||
filled_data = {}
|
||||
extract_details = []
|
||||
|
||||
parse_type = parse_result.get("type", "")
|
||||
|
||||
if parse_type == "table_data":
|
||||
# 表格数据:直接匹配列名
|
||||
headers = parse_result.get("headers", [])
|
||||
rows = parse_result.get("rows", [])
|
||||
|
||||
for field in template_fields:
|
||||
field_name = field.get("name", "")
|
||||
values = self._extract_field_from_table(headers, rows, field_name)
|
||||
filled_data[field_name] = values
|
||||
extract_details.append({
|
||||
"field": field_name,
|
||||
"values": values,
|
||||
"source": "ai_table_extraction",
|
||||
"confidence": 0.9 if values else 0.0
|
||||
})
|
||||
|
||||
elif parse_type == "structured_text":
|
||||
# 结构化文本:尝试从 key_values 和 list_items 提取
|
||||
key_values = parse_result.get("key_values", {})
|
||||
list_items = parse_result.get("list_items", [])
|
||||
|
||||
for field in template_fields:
|
||||
field_name = field.get("name", "")
|
||||
value = key_values.get(field_name, "")
|
||||
if not value and list_items:
|
||||
value = list_items[0] if list_items else ""
|
||||
filled_data[field_name] = [value] if value else []
|
||||
extract_details.append({
|
||||
"field": field_name,
|
||||
"values": [value] if value else [],
|
||||
"source": "ai_text_extraction",
|
||||
"confidence": 0.7 if value else 0.0
|
||||
})
|
||||
|
||||
else:
|
||||
# 其他类型:返回原始解析结果供后续处理
|
||||
for field in template_fields:
|
||||
field_name = field.get("name", "")
|
||||
filled_data[field_name] = []
|
||||
extract_details.append({
|
||||
"field": field_name,
|
||||
"values": [],
|
||||
"source": "no_ai_data",
|
||||
"confidence": 0.0
|
||||
})
|
||||
|
||||
# 3. 返回结果
|
||||
max_rows = max(len(v) for v in filled_data.values()) if filled_data else 1
|
||||
|
||||
return {
|
||||
"success": True,
|
||||
"filled_data": filled_data,
|
||||
"fill_details": extract_details,
|
||||
"ai_parse_result": {
|
||||
"type": parse_type,
|
||||
"description": parse_result.get("description", "")
|
||||
},
|
||||
"source_doc_count": 1,
|
||||
"max_rows": max_rows
|
||||
}
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"AI 填表失败: {str(e)}")
|
||||
return {
|
||||
"success": False,
|
||||
"error": str(e),
|
||||
"filled_data": {},
|
||||
"fill_details": []
|
||||
}
|
||||
|
||||
def _extract_field_from_table(
|
||||
self,
|
||||
headers: List[str],
|
||||
rows: List[List],
|
||||
field_name: str
|
||||
) -> List[str]:
|
||||
"""从表格中提取指定字段的值"""
|
||||
# 查找匹配的列
|
||||
target_col_idx = None
|
||||
for col_idx, header in enumerate(headers):
|
||||
if field_name.lower() in str(header).lower() or str(header).lower() in field_name.lower():
|
||||
target_col_idx = col_idx
|
||||
break
|
||||
|
||||
if target_col_idx is None:
|
||||
return []
|
||||
|
||||
# 提取该列所有值
|
||||
values = []
|
||||
for row in rows:
|
||||
if isinstance(row, list) and target_col_idx < len(row):
|
||||
val = str(row[target_col_idx]).strip()
|
||||
if val:
|
||||
values.append(val)
|
||||
|
||||
return values
|
||||
|
||||
|
||||
# 全局单例
|
||||
word_ai_service = WordAIService()
|
||||
@@ -1,4 +1,4 @@
|
||||
# ============================================================
|
||||
# ============================================================
|
||||
# 基于大语言模型的文档理解与多源数据融合系统
|
||||
# Python 依赖清单
|
||||
# ============================================================
|
||||
|
||||
@@ -1,13 +1,16 @@
|
||||
import { RouterProvider } from 'react-router-dom';
|
||||
import { AuthProvider } from '@/context/AuthContext';
|
||||
import { AuthProvider } from '@/contexts/AuthContext';
|
||||
import { TemplateFillProvider } from '@/context/TemplateFillContext';
|
||||
import { router } from '@/routes';
|
||||
import { Toaster } from 'sonner';
|
||||
|
||||
function App() {
|
||||
return (
|
||||
<AuthProvider>
|
||||
<RouterProvider router={router} />
|
||||
<Toaster position="top-right" richColors closeButton />
|
||||
<TemplateFillProvider>
|
||||
<RouterProvider router={router} />
|
||||
<Toaster position="top-right" richColors closeButton />
|
||||
</TemplateFillProvider>
|
||||
</AuthProvider>
|
||||
);
|
||||
}
|
||||
|
||||
@@ -1,6 +1,6 @@
|
||||
import React from 'react';
|
||||
import { Navigate, useLocation } from 'react-router-dom';
|
||||
import { useAuth } from '@/context/AuthContext';
|
||||
import { useAuth } from '@/contexts/AuthContext';
|
||||
|
||||
export const RouteGuard: React.FC<{ children: React.ReactNode }> = ({ children }) => {
|
||||
const { user, loading } = useAuth();
|
||||
|
||||
@@ -1,85 +0,0 @@
|
||||
import React, { createContext, useContext, useEffect, useState } from 'react';
|
||||
import { supabase } from '@/db/supabase';
|
||||
import { User } from '@supabase/supabase-js';
|
||||
import { Profile } from '@/types/types';
|
||||
|
||||
interface AuthContextType {
|
||||
user: User | null;
|
||||
profile: Profile | null;
|
||||
signIn: (email: string, password: string) => Promise<{ error: any }>;
|
||||
signUp: (email: string, password: string) => Promise<{ error: any }>;
|
||||
signOut: () => Promise<{ error: any }>;
|
||||
loading: boolean;
|
||||
}
|
||||
|
||||
const AuthContext = createContext<AuthContextType | undefined>(undefined);
|
||||
|
||||
export const AuthProvider: React.FC<{ children: React.ReactNode }> = ({ children }) => {
|
||||
const [user, setUser] = useState<User | null>(null);
|
||||
const [profile, setProfile] = useState<Profile | null>(null);
|
||||
const [loading, setLoading] = useState(true);
|
||||
|
||||
useEffect(() => {
|
||||
// Check active sessions and sets the user
|
||||
supabase.auth.getSession().then(({ data: { session } }) => {
|
||||
setUser(session?.user ?? null);
|
||||
if (session?.user) fetchProfile(session.user.id);
|
||||
else setLoading(false);
|
||||
});
|
||||
|
||||
// Listen for changes on auth state (sign in, sign out, etc.)
|
||||
const { data: { subscription } } = supabase.auth.onAuthStateChange((_event, session) => {
|
||||
setUser(session?.user ?? null);
|
||||
if (session?.user) fetchProfile(session.user.id);
|
||||
else {
|
||||
setProfile(null);
|
||||
setLoading(false);
|
||||
}
|
||||
});
|
||||
|
||||
return () => subscription.unsubscribe();
|
||||
}, []);
|
||||
|
||||
const fetchProfile = async (uid: string) => {
|
||||
try {
|
||||
const { data, error } = await supabase
|
||||
.from('profiles')
|
||||
.select('*')
|
||||
.eq('id', uid)
|
||||
.maybeSingle();
|
||||
|
||||
if (error) throw error;
|
||||
setProfile(data);
|
||||
} catch (err) {
|
||||
console.error('Error fetching profile:', err);
|
||||
} finally {
|
||||
setLoading(false);
|
||||
}
|
||||
};
|
||||
|
||||
const signIn = async (email: string, password: string) => {
|
||||
return await supabase.auth.signInWithPassword({ email, password });
|
||||
};
|
||||
|
||||
const signUp = async (email: string, password: string) => {
|
||||
return await supabase.auth.signUp({ email, password });
|
||||
};
|
||||
|
||||
const signOut = async () => {
|
||||
return await supabase.auth.signOut();
|
||||
};
|
||||
|
||||
return (
|
||||
<AuthContext.Provider value={{ user, profile, signIn, signUp, signOut, loading }}>
|
||||
{children}
|
||||
</AuthContext.Provider>
|
||||
);
|
||||
};
|
||||
|
||||
export const useAuth = () => {
|
||||
const context = useContext(AuthContext);
|
||||
if (context === undefined) {
|
||||
throw new Error('useAuth must be used within an AuthProvider');
|
||||
}
|
||||
return context;
|
||||
};
|
||||
136
frontend/src/context/TemplateFillContext.tsx
Normal file
136
frontend/src/context/TemplateFillContext.tsx
Normal file
@@ -0,0 +1,136 @@
|
||||
import React, { createContext, useContext, useState, ReactNode } from 'react';
|
||||
|
||||
type SourceFile = {
|
||||
file: File;
|
||||
preview?: string;
|
||||
};
|
||||
|
||||
type TemplateField = {
|
||||
cell: string;
|
||||
name: string;
|
||||
field_type: string;
|
||||
required: boolean;
|
||||
hint?: string;
|
||||
};
|
||||
|
||||
type Step = 'upload' | 'filling' | 'preview';
|
||||
|
||||
interface TemplateFillState {
|
||||
step: Step;
|
||||
templateFile: File | null;
|
||||
templateFields: TemplateField[];
|
||||
sourceFiles: SourceFile[];
|
||||
sourceFilePaths: string[];
|
||||
sourceDocIds: string[];
|
||||
templateId: string;
|
||||
filledResult: any;
|
||||
setStep: (step: Step) => void;
|
||||
setTemplateFile: (file: File | null) => void;
|
||||
setTemplateFields: (fields: TemplateField[]) => void;
|
||||
setSourceFiles: (files: SourceFile[]) => void;
|
||||
addSourceFiles: (files: SourceFile[]) => void;
|
||||
removeSourceFile: (index: number) => void;
|
||||
setSourceFilePaths: (paths: string[]) => void;
|
||||
setSourceDocIds: (ids: string[]) => void;
|
||||
addSourceDocId: (id: string) => void;
|
||||
removeSourceDocId: (id: string) => void;
|
||||
setTemplateId: (id: string) => void;
|
||||
setFilledResult: (result: any) => void;
|
||||
reset: () => void;
|
||||
}
|
||||
|
||||
const initialState = {
|
||||
step: 'upload' as Step,
|
||||
templateFile: null,
|
||||
templateFields: [],
|
||||
sourceFiles: [],
|
||||
sourceFilePaths: [],
|
||||
sourceDocIds: [],
|
||||
templateId: '',
|
||||
filledResult: null,
|
||||
setStep: () => {},
|
||||
setTemplateFile: () => {},
|
||||
setTemplateFields: () => {},
|
||||
setSourceFiles: () => {},
|
||||
addSourceFiles: () => {},
|
||||
removeSourceFile: () => {},
|
||||
setSourceFilePaths: () => {},
|
||||
setSourceDocIds: () => {},
|
||||
addSourceDocId: () => {},
|
||||
removeSourceDocId: () => {},
|
||||
setTemplateId: () => {},
|
||||
setFilledResult: () => {},
|
||||
reset: () => {},
|
||||
};
|
||||
|
||||
const TemplateFillContext = createContext<TemplateFillState>(initialState);
|
||||
|
||||
export const TemplateFillProvider: React.FC<{ children: ReactNode }> = ({ children }) => {
|
||||
const [step, setStep] = useState<Step>('upload');
|
||||
const [templateFile, setTemplateFile] = useState<File | null>(null);
|
||||
const [templateFields, setTemplateFields] = useState<TemplateField[]>([]);
|
||||
const [sourceFiles, setSourceFiles] = useState<SourceFile[]>([]);
|
||||
const [sourceFilePaths, setSourceFilePaths] = useState<string[]>([]);
|
||||
const [sourceDocIds, setSourceDocIds] = useState<string[]>([]);
|
||||
const [templateId, setTemplateId] = useState<string>('');
|
||||
const [filledResult, setFilledResult] = useState<any>(null);
|
||||
|
||||
const addSourceFiles = (files: SourceFile[]) => {
|
||||
setSourceFiles(prev => [...prev, ...files]);
|
||||
};
|
||||
|
||||
const removeSourceFile = (index: number) => {
|
||||
setSourceFiles(prev => prev.filter((_, i) => i !== index));
|
||||
};
|
||||
|
||||
const addSourceDocId = (id: string) => {
|
||||
setSourceDocIds(prev => prev.includes(id) ? prev : [...prev, id]);
|
||||
};
|
||||
|
||||
const removeSourceDocId = (id: string) => {
|
||||
setSourceDocIds(prev => prev.filter(docId => docId !== id));
|
||||
};
|
||||
|
||||
const reset = () => {
|
||||
setStep('upload');
|
||||
setTemplateFile(null);
|
||||
setTemplateFields([]);
|
||||
setSourceFiles([]);
|
||||
setSourceFilePaths([]);
|
||||
setSourceDocIds([]);
|
||||
setTemplateId('');
|
||||
setFilledResult(null);
|
||||
};
|
||||
|
||||
return (
|
||||
<TemplateFillContext.Provider
|
||||
value={{
|
||||
step,
|
||||
templateFile,
|
||||
templateFields,
|
||||
sourceFiles,
|
||||
sourceFilePaths,
|
||||
sourceDocIds,
|
||||
templateId,
|
||||
filledResult,
|
||||
setStep,
|
||||
setTemplateFile,
|
||||
setTemplateFields,
|
||||
setSourceFiles,
|
||||
addSourceFiles,
|
||||
removeSourceFile,
|
||||
setSourceFilePaths,
|
||||
setSourceDocIds,
|
||||
addSourceDocId,
|
||||
removeSourceDocId,
|
||||
setTemplateId,
|
||||
setFilledResult,
|
||||
reset,
|
||||
}}
|
||||
>
|
||||
{children}
|
||||
</TemplateFillContext.Provider>
|
||||
);
|
||||
};
|
||||
|
||||
export const useTemplateFill = () => useContext(TemplateFillContext);
|
||||
@@ -400,6 +400,49 @@ export const backendApi = {
|
||||
}
|
||||
},
|
||||
|
||||
/**
|
||||
* 获取任务历史列表
|
||||
*/
|
||||
async getTasks(
|
||||
limit: number = 50,
|
||||
skip: number = 0
|
||||
): Promise<{ success: boolean; tasks: any[]; count: number }> {
|
||||
const url = `${BACKEND_BASE_URL}/tasks?limit=${limit}&skip=${skip}`;
|
||||
|
||||
try {
|
||||
const response = await fetch(url);
|
||||
if (!response.ok) {
|
||||
const error = await response.json();
|
||||
throw new Error(error.detail || '获取任务列表失败');
|
||||
}
|
||||
return await response.json();
|
||||
} catch (error) {
|
||||
console.error('获取任务列表失败:', error);
|
||||
throw error;
|
||||
}
|
||||
},
|
||||
|
||||
/**
|
||||
* 删除任务
|
||||
*/
|
||||
async deleteTask(taskId: string): Promise<{ success: boolean; deleted: boolean }> {
|
||||
const url = `${BACKEND_BASE_URL}/tasks/${taskId}`;
|
||||
|
||||
try {
|
||||
const response = await fetch(url, {
|
||||
method: 'DELETE'
|
||||
});
|
||||
if (!response.ok) {
|
||||
const error = await response.json();
|
||||
throw new Error(error.detail || '删除任务失败');
|
||||
}
|
||||
return await response.json();
|
||||
} catch (error) {
|
||||
console.error('删除任务失败:', error);
|
||||
throw error;
|
||||
}
|
||||
},
|
||||
|
||||
/**
|
||||
* 轮询任务状态直到完成
|
||||
*/
|
||||
@@ -656,6 +699,46 @@ export const backendApi = {
|
||||
}
|
||||
},
|
||||
|
||||
/**
|
||||
* 联合上传模板和源文档
|
||||
*/
|
||||
async uploadTemplateAndSources(
|
||||
templateFile: File,
|
||||
sourceFiles: File[]
|
||||
): Promise<{
|
||||
success: boolean;
|
||||
template_id: string;
|
||||
filename: string;
|
||||
file_type: string;
|
||||
fields: TemplateField[];
|
||||
field_count: number;
|
||||
source_file_paths: string[];
|
||||
source_filenames: string[];
|
||||
task_id: string;
|
||||
}> {
|
||||
const formData = new FormData();
|
||||
formData.append('template_file', templateFile);
|
||||
sourceFiles.forEach(file => formData.append('source_files', file));
|
||||
|
||||
const url = `${BACKEND_BASE_URL}/templates/upload-joint`;
|
||||
|
||||
try {
|
||||
const response = await fetch(url, {
|
||||
method: 'POST',
|
||||
body: formData,
|
||||
});
|
||||
|
||||
if (!response.ok) {
|
||||
const error = await response.json();
|
||||
throw new Error(error.detail || '联合上传失败');
|
||||
}
|
||||
return await response.json();
|
||||
} catch (error) {
|
||||
console.error('联合上传失败:', error);
|
||||
throw error;
|
||||
}
|
||||
},
|
||||
|
||||
/**
|
||||
* 执行表格填写
|
||||
*/
|
||||
@@ -724,6 +807,41 @@ export const backendApi = {
|
||||
}
|
||||
},
|
||||
|
||||
/**
|
||||
* 填充原始模板并导出
|
||||
*
|
||||
* 直接打开原始模板文件,将数据填入模板的表格/单元格中,然后导出
|
||||
* 适用于比赛场景:保持原始模板格式不变
|
||||
*/
|
||||
async fillAndExportTemplate(
|
||||
templatePath: string,
|
||||
filledData: Record<string, any>,
|
||||
format: 'xlsx' | 'docx' = 'xlsx'
|
||||
): Promise<Blob> {
|
||||
const url = `${BACKEND_BASE_URL}/templates/fill-and-export`;
|
||||
|
||||
try {
|
||||
const response = await fetch(url, {
|
||||
method: 'POST',
|
||||
headers: { 'Content-Type': 'application/json' },
|
||||
body: JSON.stringify({
|
||||
template_path: templatePath,
|
||||
filled_data: filledData,
|
||||
format,
|
||||
}),
|
||||
});
|
||||
|
||||
if (!response.ok) {
|
||||
const error = await response.json();
|
||||
throw new Error(error.detail || '填充模板失败');
|
||||
}
|
||||
return await response.blob();
|
||||
} catch (error) {
|
||||
console.error('填充模板失败:', error);
|
||||
throw error;
|
||||
}
|
||||
},
|
||||
|
||||
// ==================== Excel 专用接口 (保留兼容) ====================
|
||||
|
||||
/**
|
||||
@@ -1105,7 +1223,7 @@ export const aiApi = {
|
||||
|
||||
try {
|
||||
const response = await fetch(url, {
|
||||
method: 'GET',
|
||||
method: 'POST',
|
||||
body: formData,
|
||||
});
|
||||
|
||||
@@ -1121,6 +1239,48 @@ export const aiApi = {
|
||||
}
|
||||
},
|
||||
|
||||
/**
|
||||
* 上传并使用 AI 分析 TXT 文本文件,提取结构化数据
|
||||
*/
|
||||
async analyzeTxt(
|
||||
file: File
|
||||
): Promise<{
|
||||
success: boolean;
|
||||
filename?: string;
|
||||
structured_data?: {
|
||||
table?: {
|
||||
columns?: string[];
|
||||
rows?: string[][];
|
||||
};
|
||||
summary?: string;
|
||||
key_value_pairs?: Array<{ key: string; value: string }>;
|
||||
numeric_data?: Array<{ name: string; value: number; unit?: string }>;
|
||||
};
|
||||
error?: string;
|
||||
}> {
|
||||
const formData = new FormData();
|
||||
formData.append('file', file);
|
||||
|
||||
const url = `${BACKEND_BASE_URL}/ai/analyze/txt`;
|
||||
|
||||
try {
|
||||
const response = await fetch(url, {
|
||||
method: 'POST',
|
||||
body: formData,
|
||||
});
|
||||
|
||||
if (!response.ok) {
|
||||
const error = await response.json();
|
||||
throw new Error(error.detail || 'TXT AI 分析失败');
|
||||
}
|
||||
|
||||
return await response.json();
|
||||
} catch (error) {
|
||||
console.error('TXT AI 分析失败:', error);
|
||||
throw error;
|
||||
}
|
||||
},
|
||||
|
||||
/**
|
||||
* 生成统计信息和图表
|
||||
*/
|
||||
@@ -1219,4 +1379,211 @@ export const aiApi = {
|
||||
throw error;
|
||||
}
|
||||
},
|
||||
|
||||
// ==================== Word AI 解析 ====================
|
||||
|
||||
/**
|
||||
* 使用 AI 解析 Word 文档,提取结构化数据
|
||||
*/
|
||||
async analyzeWordWithAI(
|
||||
file: File,
|
||||
userHint: string = ''
|
||||
): Promise<{
|
||||
success: boolean;
|
||||
type?: string;
|
||||
headers?: string[];
|
||||
rows?: string[][];
|
||||
key_values?: Record<string, string>;
|
||||
list_items?: string[];
|
||||
summary?: string;
|
||||
error?: string;
|
||||
}> {
|
||||
const formData = new FormData();
|
||||
formData.append('file', file);
|
||||
if (userHint) {
|
||||
formData.append('user_hint', userHint);
|
||||
}
|
||||
|
||||
const url = `${BACKEND_BASE_URL}/ai/analyze/word`;
|
||||
|
||||
try {
|
||||
const response = await fetch(url, {
|
||||
method: 'POST',
|
||||
body: formData,
|
||||
});
|
||||
|
||||
if (!response.ok) {
|
||||
const error = await response.json();
|
||||
throw new Error(error.detail || 'Word AI 解析失败');
|
||||
}
|
||||
|
||||
return await response.json();
|
||||
} catch (error) {
|
||||
console.error('Word AI 解析失败:', error);
|
||||
throw error;
|
||||
}
|
||||
},
|
||||
|
||||
/**
|
||||
* 使用 AI 解析 Word 文档并填写模板
|
||||
* 一次性完成:AI解析 + 填表
|
||||
*/
|
||||
async fillTemplateFromWordAI(
|
||||
file: File,
|
||||
templateFields: TemplateField[],
|
||||
userHint: string = ''
|
||||
): Promise<FillResult> {
|
||||
const formData = new FormData();
|
||||
formData.append('file', file);
|
||||
formData.append('template_fields', JSON.stringify(templateFields));
|
||||
if (userHint) {
|
||||
formData.append('user_hint', userHint);
|
||||
}
|
||||
|
||||
const url = `${BACKEND_BASE_URL}/ai/analyze/word/fill-template`;
|
||||
|
||||
try {
|
||||
const response = await fetch(url, {
|
||||
method: 'POST',
|
||||
body: formData,
|
||||
});
|
||||
|
||||
if (!response.ok) {
|
||||
const error = await response.json();
|
||||
throw new Error(error.detail || 'Word AI 填表失败');
|
||||
}
|
||||
|
||||
return await response.json();
|
||||
} catch (error) {
|
||||
console.error('Word AI 填表失败:', error);
|
||||
throw error;
|
||||
}
|
||||
},
|
||||
|
||||
// ==================== 智能指令 ====================
|
||||
|
||||
/**
|
||||
* 识别自然语言指令的意图
|
||||
*/
|
||||
async recognizeIntent(
|
||||
instruction: string,
|
||||
docIds?: string[]
|
||||
): Promise<{
|
||||
success: boolean;
|
||||
intent: string;
|
||||
params: Record<string, any>;
|
||||
message: string;
|
||||
}> {
|
||||
const url = `${BACKEND_BASE_URL}/instruction/recognize`;
|
||||
|
||||
try {
|
||||
const response = await fetch(url, {
|
||||
method: 'POST',
|
||||
headers: { 'Content-Type': 'application/json' },
|
||||
body: JSON.stringify({ instruction, doc_ids: docIds }),
|
||||
});
|
||||
|
||||
if (!response.ok) {
|
||||
const error = await response.json();
|
||||
throw new Error(error.detail || '意图识别失败');
|
||||
}
|
||||
|
||||
return await response.json();
|
||||
} catch (error) {
|
||||
console.error('意图识别失败:', error);
|
||||
throw error;
|
||||
}
|
||||
},
|
||||
|
||||
/**
|
||||
* 执行自然语言指令
|
||||
*/
|
||||
async executeInstruction(
|
||||
instruction: string,
|
||||
docIds?: string[],
|
||||
context?: Record<string, any>
|
||||
): Promise<{
|
||||
success: boolean;
|
||||
intent: string;
|
||||
result: Record<string, any>;
|
||||
message: string;
|
||||
}> {
|
||||
const url = `${BACKEND_BASE_URL}/instruction/execute`;
|
||||
|
||||
try {
|
||||
const response = await fetch(url, {
|
||||
method: 'POST',
|
||||
headers: { 'Content-Type': 'application/json' },
|
||||
body: JSON.stringify({ instruction, doc_ids: docIds, context }),
|
||||
});
|
||||
|
||||
if (!response.ok) {
|
||||
const error = await response.json();
|
||||
throw new Error(error.detail || '指令执行失败');
|
||||
}
|
||||
|
||||
return await response.json();
|
||||
} catch (error) {
|
||||
console.error('指令执行失败:', error);
|
||||
throw error;
|
||||
}
|
||||
},
|
||||
|
||||
/**
|
||||
* 智能对话(支持多轮对话的指令执行)
|
||||
*/
|
||||
async instructionChat(
|
||||
instruction: string,
|
||||
docIds?: string[],
|
||||
context?: Record<string, any>
|
||||
): Promise<{
|
||||
success: boolean;
|
||||
intent: string;
|
||||
result: Record<string, any>;
|
||||
message: string;
|
||||
hint?: string;
|
||||
}> {
|
||||
const url = `${BACKEND_BASE_URL}/instruction/chat`;
|
||||
|
||||
try {
|
||||
const response = await fetch(url, {
|
||||
method: 'POST',
|
||||
headers: { 'Content-Type': 'application/json' },
|
||||
body: JSON.stringify({ instruction, doc_ids: docIds, context }),
|
||||
});
|
||||
|
||||
if (!response.ok) {
|
||||
const error = await response.json();
|
||||
throw new Error(error.detail || '对话处理失败');
|
||||
}
|
||||
|
||||
return await response.json();
|
||||
} catch (error) {
|
||||
console.error('对话处理失败:', error);
|
||||
throw error;
|
||||
}
|
||||
},
|
||||
|
||||
/**
|
||||
* 获取支持的指令类型列表
|
||||
*/
|
||||
async getSupportedIntents(): Promise<{
|
||||
intents: Array<{
|
||||
intent: string;
|
||||
name: string;
|
||||
examples: string[];
|
||||
params: string[];
|
||||
}>;
|
||||
}> {
|
||||
const url = `${BACKEND_BASE_URL}/instruction/intents`;
|
||||
|
||||
try {
|
||||
const response = await fetch(url);
|
||||
if (!response.ok) throw new Error('获取指令列表失败');
|
||||
return await response.json();
|
||||
} catch (error) {
|
||||
console.error('获取指令列表失败:', error);
|
||||
throw error;
|
||||
}
|
||||
},
|
||||
};
|
||||
|
||||
@@ -1,4 +1,4 @@
|
||||
import React, { useState, useEffect, useCallback } from 'react';
|
||||
import React, { useState, useEffect, useCallback, useRef } from 'react';
|
||||
import { useDropzone } from 'react-dropzone';
|
||||
import {
|
||||
FileText,
|
||||
@@ -23,7 +23,8 @@ import {
|
||||
List,
|
||||
MessageSquareCode,
|
||||
Tag,
|
||||
HelpCircle
|
||||
HelpCircle,
|
||||
Plus
|
||||
} from 'lucide-react';
|
||||
import { Button } from '@/components/ui/button';
|
||||
import { Input } from '@/components/ui/input';
|
||||
@@ -72,8 +73,10 @@ const Documents: React.FC = () => {
|
||||
// 上传相关状态
|
||||
const [uploading, setUploading] = useState(false);
|
||||
const [uploadedFile, setUploadedFile] = useState<File | null>(null);
|
||||
const [uploadedFiles, setUploadedFiles] = useState<File[]>([]);
|
||||
const [parseResult, setParseResult] = useState<ExcelParseResult | null>(null);
|
||||
const [expandedSheet, setExpandedSheet] = useState<string | null>(null);
|
||||
const [uploadExpanded, setUploadExpanded] = useState(false);
|
||||
|
||||
// AI 分析相关状态
|
||||
const [analyzing, setAnalyzing] = useState(false);
|
||||
@@ -210,75 +213,119 @@ const Documents: React.FC = () => {
|
||||
|
||||
// 文件上传处理
|
||||
const onDrop = async (acceptedFiles: File[]) => {
|
||||
const file = acceptedFiles[0];
|
||||
if (!file) return;
|
||||
if (acceptedFiles.length === 0) return;
|
||||
|
||||
setUploadedFile(file);
|
||||
setUploading(true);
|
||||
setParseResult(null);
|
||||
setAiAnalysis(null);
|
||||
setAnalysisCharts(null);
|
||||
setExpandedSheet(null);
|
||||
setMdAnalysis(null);
|
||||
setMdSections([]);
|
||||
setMdStreamingContent('');
|
||||
let successCount = 0;
|
||||
let failCount = 0;
|
||||
const successfulFiles: File[] = [];
|
||||
|
||||
const ext = file.name.split('.').pop()?.toLowerCase();
|
||||
// 逐个上传文件
|
||||
for (const file of acceptedFiles) {
|
||||
const ext = file.name.split('.').pop()?.toLowerCase();
|
||||
|
||||
try {
|
||||
// Excel 文件使用专门的上传接口
|
||||
if (ext === 'xlsx' || ext === 'xls') {
|
||||
const result = await backendApi.uploadExcel(file, {
|
||||
parseAllSheets: parseOptions.parseAllSheets,
|
||||
headerRow: parseOptions.headerRow
|
||||
});
|
||||
if (result.success) {
|
||||
toast.success(`解析成功: ${file.name}`);
|
||||
setParseResult(result);
|
||||
loadDocuments(); // 刷新文档列表
|
||||
if (result.metadata?.sheet_count === 1) {
|
||||
setExpandedSheet(Object.keys(result.data?.sheets || {})[0] || null);
|
||||
try {
|
||||
if (ext === 'xlsx' || ext === 'xls') {
|
||||
const result = await backendApi.uploadExcel(file, {
|
||||
parseAllSheets: parseOptions.parseAllSheets,
|
||||
headerRow: parseOptions.headerRow
|
||||
});
|
||||
if (result.success) {
|
||||
successCount++;
|
||||
successfulFiles.push(file);
|
||||
// 第一个Excel文件设置解析结果供预览
|
||||
if (successCount === 1) {
|
||||
setUploadedFile(file);
|
||||
setParseResult(result);
|
||||
if (result.metadata?.sheet_count === 1) {
|
||||
setExpandedSheet(Object.keys(result.data?.sheets || {})[0] || null);
|
||||
}
|
||||
}
|
||||
loadDocuments();
|
||||
} else {
|
||||
failCount++;
|
||||
toast.error(`${file.name}: ${result.error || '解析失败'}`);
|
||||
}
|
||||
} else if (ext === 'md' || ext === 'markdown') {
|
||||
const result = await backendApi.uploadDocument(file);
|
||||
if (result.task_id) {
|
||||
successCount++;
|
||||
successfulFiles.push(file);
|
||||
if (successCount === 1) {
|
||||
setUploadedFile(file);
|
||||
}
|
||||
// 轮询任务状态
|
||||
let attempts = 0;
|
||||
const checkStatus = async () => {
|
||||
while (attempts < 30) {
|
||||
try {
|
||||
const status = await backendApi.getTaskStatus(result.task_id);
|
||||
if (status.status === 'success') {
|
||||
loadDocuments();
|
||||
return;
|
||||
} else if (status.status === 'failure') {
|
||||
return;
|
||||
}
|
||||
} catch (e) {
|
||||
console.error('检查状态失败', e);
|
||||
}
|
||||
await new Promise(resolve => setTimeout(resolve, 2000));
|
||||
attempts++;
|
||||
}
|
||||
};
|
||||
checkStatus();
|
||||
} else {
|
||||
failCount++;
|
||||
}
|
||||
} else {
|
||||
toast.error(result.error || '解析失败');
|
||||
}
|
||||
} else if (ext === 'md' || ext === 'markdown') {
|
||||
// Markdown 文件:获取大纲
|
||||
await fetchMdOutline();
|
||||
} else {
|
||||
// 其他文档使用通用上传接口
|
||||
const result = await backendApi.uploadDocument(file);
|
||||
if (result.task_id) {
|
||||
toast.success(`文件 ${file.name} 已提交处理`);
|
||||
// 轮询任务状态
|
||||
let attempts = 0;
|
||||
const checkStatus = async () => {
|
||||
while (attempts < 30) {
|
||||
try {
|
||||
const status = await backendApi.getTaskStatus(result.task_id);
|
||||
if (status.status === 'success') {
|
||||
toast.success(`文件 ${file.name} 处理完成`);
|
||||
loadDocuments();
|
||||
return;
|
||||
} else if (status.status === 'failure') {
|
||||
toast.error(`文件 ${file.name} 处理失败`);
|
||||
return;
|
||||
}
|
||||
} catch (e) {
|
||||
console.error('检查状态失败', e);
|
||||
}
|
||||
await new Promise(resolve => setTimeout(resolve, 2000));
|
||||
attempts++;
|
||||
// 其他文档使用通用上传接口
|
||||
const result = await backendApi.uploadDocument(file);
|
||||
if (result.task_id) {
|
||||
successCount++;
|
||||
successfulFiles.push(file);
|
||||
if (successCount === 1) {
|
||||
setUploadedFile(file);
|
||||
}
|
||||
toast.error(`文件 ${file.name} 处理超时`);
|
||||
};
|
||||
checkStatus();
|
||||
// 轮询任务状态
|
||||
let attempts = 0;
|
||||
const checkStatus = async () => {
|
||||
while (attempts < 30) {
|
||||
try {
|
||||
const status = await backendApi.getTaskStatus(result.task_id);
|
||||
if (status.status === 'success') {
|
||||
loadDocuments();
|
||||
return;
|
||||
} else if (status.status === 'failure') {
|
||||
return;
|
||||
}
|
||||
} catch (e) {
|
||||
console.error('检查状态失败', e);
|
||||
}
|
||||
await new Promise(resolve => setTimeout(resolve, 2000));
|
||||
attempts++;
|
||||
}
|
||||
};
|
||||
checkStatus();
|
||||
} else {
|
||||
failCount++;
|
||||
}
|
||||
}
|
||||
} catch (error: any) {
|
||||
failCount++;
|
||||
toast.error(`${file.name}: ${error.message || '上传失败'}`);
|
||||
}
|
||||
} catch (error: any) {
|
||||
toast.error(error.message || '上传失败');
|
||||
} finally {
|
||||
setUploading(false);
|
||||
}
|
||||
|
||||
setUploading(false);
|
||||
loadDocuments();
|
||||
|
||||
if (successCount > 0) {
|
||||
toast.success(`成功上传 ${successCount} 个文件`);
|
||||
setUploadedFiles(prev => [...prev, ...successfulFiles]);
|
||||
setUploadExpanded(true);
|
||||
}
|
||||
if (failCount > 0) {
|
||||
toast.error(`${failCount} 个文件上传失败`);
|
||||
}
|
||||
};
|
||||
|
||||
@@ -291,7 +338,7 @@ const Documents: React.FC = () => {
|
||||
'text/markdown': ['.md'],
|
||||
'text/plain': ['.txt']
|
||||
},
|
||||
maxFiles: 1
|
||||
multiple: true
|
||||
});
|
||||
|
||||
// AI 分析处理
|
||||
@@ -449,6 +496,7 @@ const Documents: React.FC = () => {
|
||||
|
||||
const handleDeleteFile = () => {
|
||||
setUploadedFile(null);
|
||||
setUploadedFiles([]);
|
||||
setParseResult(null);
|
||||
setAiAnalysis(null);
|
||||
setAnalysisCharts(null);
|
||||
@@ -456,6 +504,17 @@ const Documents: React.FC = () => {
|
||||
toast.success('文件已清除');
|
||||
};
|
||||
|
||||
const handleRemoveUploadedFile = (index: number) => {
|
||||
setUploadedFiles(prev => {
|
||||
const newFiles = prev.filter((_, i) => i !== index);
|
||||
if (newFiles.length === 0) {
|
||||
setUploadedFile(null);
|
||||
}
|
||||
return newFiles;
|
||||
});
|
||||
toast.success('文件已从列表移除');
|
||||
};
|
||||
|
||||
const handleDelete = async (docId: string) => {
|
||||
try {
|
||||
const result = await backendApi.deleteDocument(docId);
|
||||
@@ -615,7 +674,7 @@ const Documents: React.FC = () => {
|
||||
<h1 className="text-3xl font-extrabold tracking-tight">文档中心</h1>
|
||||
<p className="text-muted-foreground">上传文档,自动解析并使用 AI 进行深度分析</p>
|
||||
</div>
|
||||
<Button variant="outline" className="rounded-xl gap-2" onClick={loadDocuments}>
|
||||
<Button variant="outline" className="rounded-xl gap-2" onClick={() => loadDocuments()}>
|
||||
<RefreshCcw size={18} />
|
||||
<span>刷新</span>
|
||||
</Button>
|
||||
@@ -640,7 +699,83 @@ const Documents: React.FC = () => {
|
||||
</CardHeader>
|
||||
{uploadPanelOpen && (
|
||||
<CardContent className="space-y-4">
|
||||
{!uploadedFile ? (
|
||||
{uploadedFiles.length > 0 || uploadedFile ? (
|
||||
<div className="space-y-3">
|
||||
{/* 文件列表头部 */}
|
||||
<div
|
||||
className="flex items-center justify-between p-3 bg-muted/50 rounded-xl cursor-pointer hover:bg-muted/70 transition-colors"
|
||||
onClick={() => setUploadExpanded(!uploadExpanded)}
|
||||
>
|
||||
<div className="flex items-center gap-3">
|
||||
<div className="w-10 h-10 rounded-lg bg-primary/10 text-primary flex items-center justify-center">
|
||||
<Upload size={20} />
|
||||
</div>
|
||||
<div>
|
||||
<p className="font-semibold text-sm">
|
||||
已上传 {(uploadedFiles.length > 0 ? uploadedFiles : [uploadedFile]).length} 个文件
|
||||
</p>
|
||||
<p className="text-xs text-muted-foreground">
|
||||
{uploadExpanded ? '点击收起' : '点击展开查看'}
|
||||
</p>
|
||||
</div>
|
||||
</div>
|
||||
<div className="flex items-center gap-2">
|
||||
<Button
|
||||
variant="ghost"
|
||||
size="sm"
|
||||
onClick={(e) => {
|
||||
e.stopPropagation();
|
||||
handleDeleteFile();
|
||||
}}
|
||||
className="text-destructive hover:text-destructive"
|
||||
>
|
||||
<Trash2 size={14} className="mr-1" />
|
||||
清空
|
||||
</Button>
|
||||
{uploadExpanded ? <ChevronUp size={16} /> : <ChevronDown size={16} />}
|
||||
</div>
|
||||
</div>
|
||||
|
||||
{/* 展开的文件列表 */}
|
||||
{uploadExpanded && (
|
||||
<div className="space-y-2 border rounded-xl p-3">
|
||||
{(uploadedFiles.length > 0 ? uploadedFiles : [uploadedFile]).filter(Boolean).map((file, index) => (
|
||||
<div key={index} className="flex items-center gap-3 p-2 bg-background rounded-lg">
|
||||
<div className={cn(
|
||||
"w-8 h-8 rounded flex items-center justify-center",
|
||||
isExcelFile(file?.name || '') ? "bg-emerald-500/10 text-emerald-500" : "bg-blue-500/10 text-blue-500"
|
||||
)}>
|
||||
{isExcelFile(file?.name || '') ? <FileSpreadsheet size={16} /> : <FileText size={16} />}
|
||||
</div>
|
||||
<div className="flex-1 min-w-0">
|
||||
<p className="text-sm truncate">{file?.name}</p>
|
||||
<p className="text-xs text-muted-foreground">{formatFileSize(file?.size || 0)}</p>
|
||||
</div>
|
||||
<Button
|
||||
variant="ghost"
|
||||
size="icon"
|
||||
className="text-destructive hover:bg-destructive/10"
|
||||
onClick={() => handleRemoveUploadedFile(index)}
|
||||
>
|
||||
<Trash2 size={14} />
|
||||
</Button>
|
||||
</div>
|
||||
))}
|
||||
|
||||
{/* 继续添加按钮 */}
|
||||
<div
|
||||
{...getRootProps()}
|
||||
className="flex items-center justify-center gap-2 p-3 border-2 border-dashed rounded-lg cursor-pointer hover:border-primary/50 hover:bg-primary/5 transition-colors"
|
||||
onClick={(e) => e.stopPropagation()}
|
||||
>
|
||||
<input {...getInputProps()} multiple={true} />
|
||||
<Plus size={16} className="text-muted-foreground" />
|
||||
<span className="text-sm text-muted-foreground">继续添加更多文件</span>
|
||||
</div>
|
||||
</div>
|
||||
)}
|
||||
</div>
|
||||
) : (
|
||||
<div
|
||||
{...getRootProps()}
|
||||
className={cn(
|
||||
@@ -649,7 +784,7 @@ const Documents: React.FC = () => {
|
||||
uploading && "opacity-50 pointer-events-none"
|
||||
)}
|
||||
>
|
||||
<input {...getInputProps()} />
|
||||
<input {...getInputProps()} multiple={true} />
|
||||
<div className="w-14 h-14 rounded-xl bg-primary/10 text-primary flex items-center justify-center mb-4 group-hover:scale-110 transition-transform">
|
||||
{uploading ? <Loader2 className="animate-spin" size={28} /> : <Upload size={28} />}
|
||||
</div>
|
||||
@@ -671,30 +806,6 @@ const Documents: React.FC = () => {
|
||||
</Badge>
|
||||
</div>
|
||||
</div>
|
||||
) : (
|
||||
<div className="space-y-4">
|
||||
<div className="flex items-center gap-3 p-3 bg-muted/30 rounded-xl">
|
||||
<div className={cn(
|
||||
"w-10 h-10 rounded-lg flex items-center justify-center",
|
||||
isExcelFile(uploadedFile.name) ? "bg-emerald-500/10 text-emerald-500" : "bg-blue-500/10 text-blue-500"
|
||||
)}>
|
||||
{isExcelFile(uploadedFile.name) ? <FileSpreadsheet size={20} /> : <FileText size={20} />}
|
||||
</div>
|
||||
<div className="flex-1 min-w-0">
|
||||
<p className="font-semibold text-sm truncate">{uploadedFile.name}</p>
|
||||
<p className="text-xs text-muted-foreground">{formatFileSize(uploadedFile.size)}</p>
|
||||
</div>
|
||||
<Button variant="ghost" size="icon" className="text-destructive hover:bg-destructive/10" onClick={handleDeleteFile}>
|
||||
<Trash2 size={16} />
|
||||
</Button>
|
||||
</div>
|
||||
|
||||
{isExcelFile(uploadedFile.name) && (
|
||||
<Button onClick={() => onDrop([uploadedFile])} className="w-full" disabled={uploading}>
|
||||
{uploading ? '解析中...' : '重新解析'}
|
||||
</Button>
|
||||
)}
|
||||
</div>
|
||||
)}
|
||||
</CardContent>
|
||||
)}
|
||||
|
||||
File diff suppressed because it is too large
Load Diff
@@ -1,603 +0,0 @@
|
||||
import React, { useState, useEffect } from 'react';
|
||||
import {
|
||||
TableProperties,
|
||||
Plus,
|
||||
FilePlus,
|
||||
CheckCircle2,
|
||||
Download,
|
||||
Clock,
|
||||
RefreshCcw,
|
||||
Sparkles,
|
||||
Zap,
|
||||
FileCheck,
|
||||
FileSpreadsheet,
|
||||
Trash2,
|
||||
ChevronDown,
|
||||
ChevronUp,
|
||||
BarChart3,
|
||||
FileText,
|
||||
TrendingUp,
|
||||
Info,
|
||||
AlertCircle,
|
||||
Loader2
|
||||
} from 'lucide-react';
|
||||
import { Button } from '@/components/ui/button';
|
||||
import { Card, CardContent, CardHeader, CardTitle, CardDescription, CardFooter } from '@/components/ui/card';
|
||||
import { Badge } from '@/components/ui/badge';
|
||||
import { useAuth } from '@/context/AuthContext';
|
||||
import { templateApi, documentApi, taskApi } from '@/db/api';
|
||||
import { backendApi, aiApi } from '@/db/backend-api';
|
||||
import { supabase } from '@/db/supabase';
|
||||
import { format } from 'date-fns';
|
||||
import { toast } from 'sonner';
|
||||
import { cn } from '@/lib/utils';
|
||||
import { Skeleton } from '@/components/ui/skeleton';
|
||||
import {
|
||||
Dialog,
|
||||
DialogContent,
|
||||
DialogHeader,
|
||||
DialogTitle,
|
||||
DialogTrigger,
|
||||
DialogFooter,
|
||||
DialogDescription
|
||||
} from '@/components/ui/dialog';
|
||||
import { Checkbox } from '@/components/ui/checkbox';
|
||||
import { ScrollArea } from '@/components/ui/scroll-area';
|
||||
import { Input } from '@/components/ui/input';
|
||||
import { Label } from '@/components/ui/label';
|
||||
import { Textarea } from '@/components/ui/textarea';
|
||||
import { Select, SelectContent, SelectItem, SelectTrigger, SelectValue } from '@/components/ui/select';
|
||||
import { useDropzone } from 'react-dropzone';
|
||||
import { Markdown } from '@/components/ui/markdown';
|
||||
|
||||
type Template = any;
|
||||
type Document = any;
|
||||
type FillTask = any;
|
||||
|
||||
const FormFill: React.FC = () => {
|
||||
const { profile } = useAuth();
|
||||
const [templates, setTemplates] = useState<Template[]>([]);
|
||||
const [documents, setDocuments] = useState<Document[]>([]);
|
||||
const [tasks, setTasks] = useState<any[]>([]);
|
||||
const [loading, setLoading] = useState(true);
|
||||
|
||||
// Selection state
|
||||
const [selectedTemplate, setSelectedTemplate] = useState<string | null>(null);
|
||||
const [selectedDocs, setSelectedDocs] = useState<string[]>([]);
|
||||
const [creating, setCreating] = useState(false);
|
||||
const [openTaskDialog, setOpenTaskDialog] = useState(false);
|
||||
const [viewingTask, setViewingTask] = useState<any | null>(null);
|
||||
|
||||
// Excel upload state
|
||||
const [excelFile, setExcelFile] = useState<File | null>(null);
|
||||
const [excelParseResult, setExcelParseResult] = useState<any>(null);
|
||||
const [excelAnalysis, setExcelAnalysis] = useState<any>(null);
|
||||
const [excelAnalyzing, setExcelAnalyzing] = useState(false);
|
||||
const [expandedSheet, setExpandedSheet] = useState<string | null>(null);
|
||||
const [aiOptions, setAiOptions] = useState({
|
||||
userPrompt: '请分析这些数据,并提取关键信息用于填表,包括数值、分类、摘要等。',
|
||||
analysisType: 'general' as 'general' | 'summary' | 'statistics' | 'insights'
|
||||
});
|
||||
|
||||
const loadData = async () => {
|
||||
if (!profile) return;
|
||||
try {
|
||||
const [t, d, ts] = await Promise.all([
|
||||
templateApi.listTemplates((profile as any).id),
|
||||
documentApi.listDocuments((profile as any).id),
|
||||
taskApi.listTasks((profile as any).id)
|
||||
]);
|
||||
setTemplates(t);
|
||||
setDocuments(d);
|
||||
setTasks(ts);
|
||||
} catch (err: any) {
|
||||
toast.error('数据加载失败');
|
||||
} finally {
|
||||
setLoading(false);
|
||||
}
|
||||
};
|
||||
|
||||
useEffect(() => {
|
||||
loadData();
|
||||
}, [profile]);
|
||||
|
||||
// Excel upload handlers
|
||||
const onExcelDrop = async (acceptedFiles: File[]) => {
|
||||
const file = acceptedFiles[0];
|
||||
if (!file) return;
|
||||
|
||||
if (!file.name.match(/\.(xlsx|xls)$/i)) {
|
||||
toast.error('仅支持 .xlsx 和 .xls 格式的 Excel 文件');
|
||||
return;
|
||||
}
|
||||
|
||||
setExcelFile(file);
|
||||
setExcelParseResult(null);
|
||||
setExcelAnalysis(null);
|
||||
setExpandedSheet(null);
|
||||
|
||||
try {
|
||||
const result = await backendApi.uploadExcel(file);
|
||||
if (result.success) {
|
||||
toast.success(`Excel 解析成功: ${file.name}`);
|
||||
setExcelParseResult(result);
|
||||
} else {
|
||||
toast.error(result.error || '解析失败');
|
||||
}
|
||||
} catch (error: any) {
|
||||
toast.error(error.message || '上传失败');
|
||||
}
|
||||
};
|
||||
|
||||
const { getRootProps, getInputProps, isDragActive } = useDropzone({
|
||||
onDrop: onExcelDrop,
|
||||
accept: {
|
||||
'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet': ['.xlsx'],
|
||||
'application/vnd.ms-excel': ['.xls']
|
||||
},
|
||||
maxFiles: 1
|
||||
});
|
||||
|
||||
const handleAnalyzeExcel = async () => {
|
||||
if (!excelFile || !excelParseResult?.success) {
|
||||
toast.error('请先上传并解析 Excel 文件');
|
||||
return;
|
||||
}
|
||||
|
||||
setExcelAnalyzing(true);
|
||||
setExcelAnalysis(null);
|
||||
|
||||
try {
|
||||
const result = await aiApi.analyzeExcel(excelFile, {
|
||||
userPrompt: aiOptions.userPrompt,
|
||||
analysisType: aiOptions.analysisType
|
||||
});
|
||||
|
||||
if (result.success) {
|
||||
toast.success('AI 分析完成');
|
||||
setExcelAnalysis(result);
|
||||
} else {
|
||||
toast.error(result.error || 'AI 分析失败');
|
||||
}
|
||||
} catch (error: any) {
|
||||
toast.error(error.message || 'AI 分析失败');
|
||||
} finally {
|
||||
setExcelAnalyzing(false);
|
||||
}
|
||||
};
|
||||
|
||||
const handleUseExcelData = () => {
|
||||
if (!excelParseResult?.success) {
|
||||
toast.error('请先解析 Excel 文件');
|
||||
return;
|
||||
}
|
||||
|
||||
// 将 Excel 解析的数据标记为"文档",添加到选择列表
|
||||
toast.success('Excel 数据已添加到数据源,请在任务对话框中选择');
|
||||
// 这里可以添加逻辑来将 Excel 数据传递给后端创建任务
|
||||
};
|
||||
|
||||
const handleDeleteExcel = () => {
|
||||
setExcelFile(null);
|
||||
setExcelParseResult(null);
|
||||
setExcelAnalysis(null);
|
||||
setExpandedSheet(null);
|
||||
toast.success('Excel 文件已清除');
|
||||
};
|
||||
|
||||
const handleUploadTemplate = async (e: React.ChangeEvent<HTMLInputElement>) => {
|
||||
const file = e.target.files?.[0];
|
||||
if (!file || !profile) return;
|
||||
|
||||
try {
|
||||
toast.loading('正在上传模板...');
|
||||
await templateApi.uploadTemplate(file, (profile as any).id);
|
||||
toast.dismiss();
|
||||
toast.success('模板上传成功');
|
||||
loadData();
|
||||
} catch (err) {
|
||||
toast.dismiss();
|
||||
toast.error('上传模板失败');
|
||||
}
|
||||
};
|
||||
|
||||
const handleCreateTask = async () => {
|
||||
if (!profile || !selectedTemplate || selectedDocs.length === 0) {
|
||||
toast.error('请先选择模板和数据源文档');
|
||||
return;
|
||||
}
|
||||
|
||||
setCreating(true);
|
||||
try {
|
||||
const task = await taskApi.createTask((profile as any).id, selectedTemplate, selectedDocs);
|
||||
if (task) {
|
||||
toast.success('任务已创建,正在进行智能填表...');
|
||||
setOpenTaskDialog(false);
|
||||
|
||||
// Invoke edge function
|
||||
supabase.functions.invoke('fill-template', {
|
||||
body: { taskId: task.id }
|
||||
}).then(({ error }) => {
|
||||
if (error) toast.error('填表任务执行失败');
|
||||
else {
|
||||
toast.success('表格填写完成!');
|
||||
loadData();
|
||||
}
|
||||
});
|
||||
loadData();
|
||||
}
|
||||
} catch (err: any) {
|
||||
toast.error('创建任务失败');
|
||||
} finally {
|
||||
setCreating(false);
|
||||
}
|
||||
};
|
||||
|
||||
const getStatusColor = (status: string) => {
|
||||
switch (status) {
|
||||
case 'completed': return 'bg-emerald-500 text-white';
|
||||
case 'failed': return 'bg-destructive text-white';
|
||||
default: return 'bg-amber-500 text-white';
|
||||
}
|
||||
};
|
||||
|
||||
const formatFileSize = (bytes: number): string => {
|
||||
if (bytes === 0) return '0 B';
|
||||
const k = 1024;
|
||||
const sizes = ['B', 'KB', 'MB', 'GB'];
|
||||
const i = Math.floor(Math.log(bytes) / Math.log(k));
|
||||
return `${(bytes / Math.pow(k, i)).toFixed(2)} ${sizes[i]}`;
|
||||
};
|
||||
|
||||
return (
|
||||
<div className="space-y-8 animate-fade-in pb-10">
|
||||
<section className="flex flex-col md:flex-row md:items-center justify-between gap-4">
|
||||
<div className="space-y-1">
|
||||
<h1 className="text-3xl font-extrabold tracking-tight">智能填表</h1>
|
||||
<p className="text-muted-foreground">根据您的表格模板,自动聚合多源文档信息进行精准填充,告别重复劳动。</p>
|
||||
</div>
|
||||
<div className="flex items-center gap-3">
|
||||
<Dialog open={openTaskDialog} onOpenChange={setOpenTaskDialog}>
|
||||
<DialogTrigger asChild>
|
||||
<Button className="rounded-xl shadow-lg shadow-primary/20 gap-2 h-11 px-6">
|
||||
<FilePlus size={18} />
|
||||
<span>新建填表任务</span>
|
||||
</Button>
|
||||
</DialogTrigger>
|
||||
<DialogContent className="max-w-4xl max-h-[90vh] flex flex-col p-0 overflow-hidden border-none shadow-2xl rounded-3xl">
|
||||
<DialogHeader className="p-8 pb-4 bg-muted/50">
|
||||
<DialogTitle className="text-2xl font-bold flex items-center gap-2">
|
||||
<Sparkles size={24} className="text-primary" />
|
||||
开启智能填表之旅
|
||||
</DialogTitle>
|
||||
<DialogDescription>
|
||||
选择一个表格模板及若干个数据源文档,AI 将自动为您分析并填写。
|
||||
</DialogDescription>
|
||||
</DialogHeader>
|
||||
|
||||
<ScrollArea className="flex-1 p-8 pt-4">
|
||||
<div className="space-y-8">
|
||||
{/* Step 1: Select Template */}
|
||||
<div className="space-y-4">
|
||||
<div className="flex items-center justify-between">
|
||||
<h4 className="font-bold flex items-center gap-2 text-primary uppercase tracking-widest text-xs">
|
||||
<span className="w-5 h-5 rounded-full bg-primary text-white flex items-center justify-center text-[10px]">1</span>
|
||||
选择表格模板
|
||||
</h4>
|
||||
<label className="cursor-pointer text-xs font-semibold text-primary hover:underline flex items-center gap-1">
|
||||
<Plus size={12} /> 上传新模板
|
||||
<input type="file" className="hidden" onChange={handleUploadTemplate} accept=".docx,.xlsx" />
|
||||
</label>
|
||||
</div>
|
||||
{templates.length > 0 ? (
|
||||
<div className="grid grid-cols-1 sm:grid-cols-2 gap-3">
|
||||
{templates.map(t => (
|
||||
<div
|
||||
key={t.id}
|
||||
className={cn(
|
||||
"p-4 rounded-2xl border-2 transition-all cursor-pointer flex items-center gap-3 group relative overflow-hidden",
|
||||
selectedTemplate === t.id ? "border-primary bg-primary/5" : "border-border hover:border-primary/50"
|
||||
)}
|
||||
onClick={() => setSelectedTemplate(t.id)}
|
||||
>
|
||||
<div className={cn(
|
||||
"w-10 h-10 rounded-xl flex items-center justify-center shrink-0 transition-colors",
|
||||
selectedTemplate === t.id ? "bg-primary text-white" : "bg-muted text-muted-foreground"
|
||||
)}>
|
||||
<TableProperties size={20} />
|
||||
</div>
|
||||
<div className="flex-1 min-w-0">
|
||||
<p className="font-bold text-sm truncate">{t.name}</p>
|
||||
<p className="text-[10px] text-muted-foreground uppercase">{t.type}</p>
|
||||
</div>
|
||||
{selectedTemplate === t.id && (
|
||||
<div className="absolute top-0 right-0 w-8 h-8 bg-primary text-white flex items-center justify-center rounded-bl-xl">
|
||||
<CheckCircle2 size={14} />
|
||||
</div>
|
||||
)}
|
||||
</div>
|
||||
))}
|
||||
</div>
|
||||
) : (
|
||||
<div className="p-8 text-center bg-muted/30 rounded-2xl border border-dashed text-sm italic text-muted-foreground">
|
||||
暂无模板,请先点击右上角上传。
|
||||
</div>
|
||||
)}
|
||||
</div>
|
||||
|
||||
{/* Step 2: Upload & Analyze Excel */}
|
||||
<div className="space-y-4">
|
||||
<h4 className="font-bold flex items-center gap-2 text-primary uppercase tracking-widest text-xs">
|
||||
<span className="w-5 h-5 rounded-full bg-primary text-white flex items-center justify-center text-[10px]">1.5</span>
|
||||
Excel 数据源
|
||||
</h4>
|
||||
<div className="bg-muted/20 rounded-2xl p-6">
|
||||
{!excelFile ? (
|
||||
<div
|
||||
{...getRootProps()}
|
||||
className={cn(
|
||||
"border-2 border-dashed rounded-xl p-8 transition-all duration-300 flex flex-col items-center justify-center text-center cursor-pointer group",
|
||||
isDragActive ? "border-primary bg-primary/5" : "border-muted-foreground/20 hover:border-primary/50 hover:bg-muted/30"
|
||||
)}
|
||||
>
|
||||
<input {...getInputProps()} />
|
||||
<div className="w-12 h-12 rounded-xl bg-primary/10 text-primary flex items-center justify-center mb-3 group-hover:scale-110 transition-transform">
|
||||
<FileSpreadsheet size={24} />
|
||||
</div>
|
||||
<p className="font-semibold text-sm">
|
||||
{isDragActive ? '释放以开始上传' : '点击或拖拽 Excel 文件'}
|
||||
</p>
|
||||
<p className="text-xs text-muted-foreground mt-1">支持 .xlsx 和 .xls 格式</p>
|
||||
</div>
|
||||
) : (
|
||||
<div className="space-y-4">
|
||||
<div className="flex items-center gap-3 p-3 bg-background rounded-xl">
|
||||
<div className="w-10 h-10 rounded-lg bg-emerald-500/10 text-emerald-500 flex items-center justify-center">
|
||||
<FileSpreadsheet size={20} />
|
||||
</div>
|
||||
<div className="flex-1 min-w-0">
|
||||
<p className="font-semibold text-sm truncate">{excelFile.name}</p>
|
||||
<p className="text-xs text-muted-foreground">{formatFileSize(excelFile.size)}</p>
|
||||
</div>
|
||||
<div className="flex gap-2">
|
||||
<Button
|
||||
variant="ghost"
|
||||
size="icon"
|
||||
className="text-destructive hover:bg-destructive/10"
|
||||
onClick={handleDeleteExcel}
|
||||
>
|
||||
<Trash2 size={16} />
|
||||
</Button>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
{/* AI Analysis Options */}
|
||||
{excelParseResult?.success && (
|
||||
<div className="space-y-3">
|
||||
<div className="space-y-2">
|
||||
<Label htmlFor="analysis-type" className="text-xs">分析类型</Label>
|
||||
<Select
|
||||
value={aiOptions.analysisType}
|
||||
onValueChange={(value: any) => setAiOptions({ ...aiOptions, analysisType: value })}
|
||||
>
|
||||
<SelectTrigger id="analysis-type" className="bg-background h-9 text-sm">
|
||||
<SelectValue placeholder="选择分析类型" />
|
||||
</SelectTrigger>
|
||||
<SelectContent>
|
||||
<SelectItem value="general">综合分析</SelectItem>
|
||||
<SelectItem value="summary">数据摘要</SelectItem>
|
||||
<SelectItem value="statistics">统计分析</SelectItem>
|
||||
<SelectItem value="insights">深度洞察</SelectItem>
|
||||
</SelectContent>
|
||||
</Select>
|
||||
</div>
|
||||
<div className="space-y-2">
|
||||
<Label htmlFor="user-prompt" className="text-xs">自定义提示词</Label>
|
||||
<Textarea
|
||||
id="user-prompt"
|
||||
value={aiOptions.userPrompt}
|
||||
onChange={(e) => setAiOptions({ ...aiOptions, userPrompt: e.target.value })}
|
||||
className="bg-background resize-none text-sm"
|
||||
rows={2}
|
||||
/>
|
||||
</div>
|
||||
<Button
|
||||
onClick={handleAnalyzeExcel}
|
||||
disabled={excelAnalyzing}
|
||||
className="w-full gap-2 h-9"
|
||||
variant="outline"
|
||||
>
|
||||
{excelAnalyzing ? <Loader2 className="animate-spin" size={14} /> : <Sparkles size={14} />}
|
||||
{excelAnalyzing ? '分析中...' : 'AI 分析'}
|
||||
</Button>
|
||||
{excelParseResult?.success && (
|
||||
<Button
|
||||
onClick={handleUseExcelData}
|
||||
className="w-full gap-2 h-9"
|
||||
>
|
||||
<CheckCircle2 size={14} />
|
||||
使用此数据源
|
||||
</Button>
|
||||
)}
|
||||
</div>
|
||||
)}
|
||||
|
||||
{/* Excel Analysis Result */}
|
||||
{excelAnalysis && (
|
||||
<div className="mt-4 p-4 bg-background rounded-xl max-h-60 overflow-y-auto">
|
||||
<div className="flex items-center gap-2 mb-3">
|
||||
<Sparkles size={16} className="text-primary" />
|
||||
<span className="font-semibold text-sm">AI 分析结果</span>
|
||||
</div>
|
||||
<Markdown content={excelAnalysis.analysis?.analysis || ''} className="text-sm" />
|
||||
</div>
|
||||
)}
|
||||
</div>
|
||||
)}
|
||||
</div>
|
||||
</div>
|
||||
|
||||
{/* Step 3: Select Documents */}
|
||||
<div className="space-y-4">
|
||||
<h4 className="font-bold flex items-center gap-2 text-primary uppercase tracking-widest text-xs">
|
||||
<span className="w-5 h-5 rounded-full bg-primary text-white flex items-center justify-center text-[10px]">2</span>
|
||||
选择其他数据源文档
|
||||
</h4>
|
||||
{documents.filter(d => d.status === 'completed').length > 0 ? (
|
||||
<div className="space-y-2 max-h-40 overflow-y-auto pr-2 custom-scrollbar">
|
||||
{documents.filter(d => d.status === 'completed').map(doc => (
|
||||
<div
|
||||
key={doc.id}
|
||||
className={cn(
|
||||
"flex items-center gap-3 p-3 rounded-xl border transition-all cursor-pointer",
|
||||
selectedDocs.includes(doc.id) ? "border-primary/50 bg-primary/5 shadow-sm" : "border-border hover:bg-muted/30"
|
||||
)}
|
||||
onClick={() => {
|
||||
setSelectedDocs(prev =>
|
||||
prev.includes(doc.id) ? prev.filter(id => id !== doc.id) : [...prev, doc.id]
|
||||
);
|
||||
}}
|
||||
>
|
||||
<Checkbox checked={selectedDocs.includes(doc.id)} onCheckedChange={() => {}} />
|
||||
<div className="w-8 h-8 rounded-lg bg-blue-500/10 text-blue-500 flex items-center justify-center">
|
||||
<Zap size={16} />
|
||||
</div>
|
||||
<span className="font-semibold text-sm truncate">{doc.name}</span>
|
||||
</div>
|
||||
))}
|
||||
</div>
|
||||
) : (
|
||||
<div className="p-6 text-center bg-muted/30 rounded-xl border border-dashed text-xs italic text-muted-foreground">
|
||||
暂无其他已解析的文档
|
||||
</div>
|
||||
)}
|
||||
</div>
|
||||
</div>
|
||||
</ScrollArea>
|
||||
|
||||
<DialogFooter className="p-8 pt-4 bg-muted/20 border-t border-dashed">
|
||||
<Button variant="outline" className="rounded-xl h-12 px-6" onClick={() => setOpenTaskDialog(false)}>取消</Button>
|
||||
<Button
|
||||
className="rounded-xl h-12 px-8 shadow-lg shadow-primary/20 gap-2"
|
||||
onClick={handleCreateTask}
|
||||
disabled={creating || !selectedTemplate || (selectedDocs.length === 0 && !excelParseResult?.success)}
|
||||
>
|
||||
{creating ? <RefreshCcw className="animate-spin h-5 w-5" /> : <Zap className="h-5 w-5 fill-current" />}
|
||||
<span>启动智能填表引擎</span>
|
||||
</Button>
|
||||
</DialogFooter>
|
||||
</DialogContent>
|
||||
</Dialog>
|
||||
</div>
|
||||
</section>
|
||||
|
||||
{/* Task List */}
|
||||
<div className="grid grid-cols-1 md:grid-cols-2 lg:grid-cols-3 gap-6">
|
||||
{loading ? (
|
||||
Array.from({ length: 3 }).map((_, i) => (
|
||||
<Skeleton key={i} className="h-48 w-full rounded-3xl bg-muted" />
|
||||
))
|
||||
) : tasks.length > 0 ? (
|
||||
tasks.map((task) => (
|
||||
<Card key={task.id} className="border-none shadow-md hover:shadow-xl transition-all group rounded-3xl overflow-hidden flex flex-col">
|
||||
<div className="h-1.5 w-full" style={{ backgroundColor: task.status === 'completed' ? '#10b981' : task.status === 'failed' ? '#ef4444' : '#f59e0b' }} />
|
||||
<CardHeader className="p-6 pb-2">
|
||||
<div className="flex justify-between items-start mb-2">
|
||||
<div className="w-12 h-12 rounded-2xl bg-emerald-500/10 text-emerald-500 flex items-center justify-center shadow-inner group-hover:scale-110 transition-transform">
|
||||
<TableProperties size={24} />
|
||||
</div>
|
||||
<Badge className={cn("text-[10px] uppercase font-bold tracking-widest", getStatusColor(task.status))}>
|
||||
{task.status === 'completed' ? '已完成' : task.status === 'failed' ? '失败' : '执行中'}
|
||||
</Badge>
|
||||
</div>
|
||||
<CardTitle className="text-lg font-bold truncate group-hover:text-primary transition-colors">{task.templates?.name || '未知模板'}</CardTitle>
|
||||
<CardDescription className="text-xs flex items-center gap-1 font-medium italic">
|
||||
<Clock size={12} /> {format(new Date(task.created_at!), 'yyyy/MM/dd HH:mm')}
|
||||
</CardDescription>
|
||||
</CardHeader>
|
||||
<CardContent className="p-6 pt-2 flex-1">
|
||||
<div className="space-y-4">
|
||||
<div className="flex flex-wrap gap-2">
|
||||
<Badge variant="outline" className="bg-muted/50 border-none text-[10px] font-bold">关联 {task.document_ids?.length} 份数据源</Badge>
|
||||
</div>
|
||||
{task.status === 'completed' && (
|
||||
<div className="p-3 bg-emerald-500/5 rounded-2xl border border-emerald-500/10 flex items-center gap-3">
|
||||
<CheckCircle2 className="text-emerald-500" size={18} />
|
||||
<span className="text-xs font-semibold text-emerald-700">内容已精准聚合,表格生成完毕</span>
|
||||
</div>
|
||||
)}
|
||||
</div>
|
||||
</CardContent>
|
||||
<CardFooter className="p-6 pt-0">
|
||||
<Button
|
||||
className="w-full rounded-2xl h-11 bg-primary group-hover:shadow-lg group-hover:shadow-primary/30 transition-all gap-2"
|
||||
disabled={task.status !== 'completed'}
|
||||
onClick={() => setViewingTask(task)}
|
||||
>
|
||||
<Download size={18} />
|
||||
<span>下载汇总表格</span>
|
||||
</Button>
|
||||
</CardFooter>
|
||||
</Card>
|
||||
))
|
||||
) : (
|
||||
<div className="col-span-full py-24 flex flex-col items-center justify-center text-center space-y-6">
|
||||
<div className="w-24 h-24 rounded-full bg-muted flex items-center justify-center text-muted-foreground/30 border-4 border-dashed">
|
||||
<TableProperties size={48} />
|
||||
</div>
|
||||
<div className="space-y-2 max-w-sm">
|
||||
<p className="text-2xl font-extrabold tracking-tight">暂无生成任务</p>
|
||||
<p className="text-muted-foreground text-sm">上传模板后,您可以将多个文档的数据自动填充到汇总表格中。</p>
|
||||
</div>
|
||||
<Button className="rounded-xl h-12 px-8" onClick={() => setOpenTaskDialog(true)}>立即创建首个任务</Button>
|
||||
</div>
|
||||
)}
|
||||
</div>
|
||||
|
||||
{/* Task Result View Modal */}
|
||||
<Dialog open={!!viewingTask} onOpenChange={(open) => !open && setViewingTask(null)}>
|
||||
<DialogContent className="max-w-4xl max-h-[90vh] flex flex-col p-0 overflow-hidden border-none shadow-2xl rounded-3xl">
|
||||
<DialogHeader className="p-8 pb-4 bg-primary text-primary-foreground">
|
||||
<div className="flex items-center gap-3 mb-2">
|
||||
<FileCheck size={28} />
|
||||
<DialogTitle className="text-2xl font-extrabold">表格生成结果预览</DialogTitle>
|
||||
</div>
|
||||
<DialogDescription className="text-primary-foreground/80 italic">
|
||||
系统已根据 {viewingTask?.document_ids?.length} 份文档信息自动填充完毕。
|
||||
</DialogDescription>
|
||||
</DialogHeader>
|
||||
<ScrollArea className="flex-1 p-8 bg-muted/10">
|
||||
<div className="prose dark:prose-invert max-w-none">
|
||||
<div className="bg-card p-8 rounded-2xl shadow-sm border min-h-[400px]">
|
||||
<Badge variant="outline" className="mb-4">数据已脱敏</Badge>
|
||||
<div className="whitespace-pre-wrap font-sans text-sm leading-relaxed">
|
||||
<h2 className="text-xl font-bold mb-4">汇总结果报告</h2>
|
||||
<p className="text-muted-foreground mb-6">以下是根据您上传的多个文档提取并生成的汇总信息:</p>
|
||||
|
||||
<div className="p-4 bg-muted/30 rounded-xl border border-dashed border-primary/20 italic">
|
||||
正在从云端安全下载解析结果并渲染渲染视图...
|
||||
</div>
|
||||
|
||||
<div className="mt-8 space-y-4">
|
||||
<p className="font-semibold text-primary">✓ 核心实体已对齐</p>
|
||||
<p className="font-semibold text-primary">✓ 逻辑勾稽关系校验通过</p>
|
||||
<p className="font-semibold text-primary">✓ 格式符合模板规范</p>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</ScrollArea>
|
||||
<DialogFooter className="p-8 pt-4 border-t border-dashed">
|
||||
<Button variant="outline" className="rounded-xl" onClick={() => setViewingTask(null)}>关闭</Button>
|
||||
<Button className="rounded-xl px-8 gap-2 shadow-lg shadow-primary/20" onClick={() => toast.success("正在导出文件...")}>
|
||||
<Download size={18} />
|
||||
导出为 {viewingTask?.templates?.type?.toUpperCase() || '文件'}
|
||||
</Button>
|
||||
</DialogFooter>
|
||||
</DialogContent>
|
||||
</Dialog>
|
||||
</div>
|
||||
);
|
||||
};
|
||||
|
||||
export default FormFill;
|
||||
@@ -10,7 +10,11 @@ import {
|
||||
TableProperties,
|
||||
ChevronRight,
|
||||
ArrowRight,
|
||||
Loader2
|
||||
Loader2,
|
||||
Download,
|
||||
Search,
|
||||
MessageSquare,
|
||||
CheckCircle
|
||||
} from 'lucide-react';
|
||||
import { Button } from '@/components/ui/button';
|
||||
import { Input } from '@/components/ui/input';
|
||||
@@ -26,12 +30,15 @@ type ChatMessage = {
|
||||
role: 'user' | 'assistant';
|
||||
content: string;
|
||||
created_at: string;
|
||||
intent?: string;
|
||||
result?: any;
|
||||
};
|
||||
|
||||
const InstructionChat: React.FC = () => {
|
||||
const [messages, setMessages] = useState<ChatMessage[]>([]);
|
||||
const [input, setInput] = useState('');
|
||||
const [loading, setLoading] = useState(false);
|
||||
const [currentDocIds, setCurrentDocIds] = useState<string[]>([]);
|
||||
const scrollAreaRef = useRef<HTMLDivElement>(null);
|
||||
|
||||
useEffect(() => {
|
||||
@@ -43,27 +50,47 @@ const InstructionChat: React.FC = () => {
|
||||
role: 'assistant',
|
||||
content: `您好!我是智联文档 AI 助手。
|
||||
|
||||
我可以帮您完成以下操作:
|
||||
**📄 文档智能操作**
|
||||
- "提取文档中的医院数量和床位数"
|
||||
- "帮我找出所有机构的名称"
|
||||
|
||||
📄 **文档管理**
|
||||
- "帮我列出最近上传的所有文档"
|
||||
- "删除三天前的 docx 文档"
|
||||
**📊 数据填表**
|
||||
- "根据这些数据填表"
|
||||
- "将提取的信息填写到Excel模板"
|
||||
|
||||
📊 **Excel 分析**
|
||||
- "分析一下最近上传的 Excel 文件"
|
||||
- "帮我统计销售报表中的数据"
|
||||
**📝 内容处理**
|
||||
- "总结一下这份文档"
|
||||
- "对比这两个文档的差异"
|
||||
|
||||
📝 **智能填表**
|
||||
- "根据员工信息表创建一个考勤汇总表"
|
||||
- "用财务文档填充报销模板"
|
||||
**🔍 智能问答**
|
||||
- "文档里说了些什么?"
|
||||
- "有多少家医院?"
|
||||
|
||||
请告诉我您想做什么?`,
|
||||
created_at: new Date().toISOString()
|
||||
}
|
||||
]);
|
||||
|
||||
// 获取已上传的文档ID列表
|
||||
loadDocuments();
|
||||
}
|
||||
}, []);
|
||||
|
||||
const loadDocuments = async () => {
|
||||
try {
|
||||
const result = await backendApi.getDocuments(undefined, 50);
|
||||
if (result.success && result.documents) {
|
||||
const docIds = result.documents.map((d: any) => d.doc_id);
|
||||
setCurrentDocIds(docIds);
|
||||
if (docIds.length > 0) {
|
||||
console.log(`已加载 ${docIds.length} 个文档`);
|
||||
}
|
||||
}
|
||||
} catch (err) {
|
||||
console.error('获取文档列表失败:', err);
|
||||
}
|
||||
};
|
||||
|
||||
useEffect(() => {
|
||||
// Scroll to bottom
|
||||
if (scrollAreaRef.current) {
|
||||
@@ -89,95 +116,126 @@ const InstructionChat: React.FC = () => {
|
||||
setLoading(true);
|
||||
|
||||
try {
|
||||
// TODO: 后端对话接口,暂用模拟响应
|
||||
await new Promise(resolve => setTimeout(resolve, 1500));
|
||||
// 使用真实的智能指令 API
|
||||
const response = await backendApi.instructionChat(
|
||||
input.trim(),
|
||||
currentDocIds.length > 0 ? currentDocIds : undefined
|
||||
);
|
||||
|
||||
// 简单的命令解析演示
|
||||
const userInput = userMessage.content.toLowerCase();
|
||||
let response = '';
|
||||
// 根据意图类型生成友好响应
|
||||
let responseContent = '';
|
||||
const resultData = response.result;
|
||||
|
||||
if (userInput.includes('列出') || userInput.includes('列表')) {
|
||||
const result = await backendApi.getDocuments(undefined, 10);
|
||||
if (result.success && result.documents && result.documents.length > 0) {
|
||||
response = `已为您找到 ${result.documents.length} 个文档:\n\n`;
|
||||
result.documents.slice(0, 5).forEach((doc: any, idx: number) => {
|
||||
response += `${idx + 1}. **${doc.original_filename}** (${doc.doc_type.toUpperCase()})\n`;
|
||||
response += ` - 大小: ${(doc.file_size / 1024).toFixed(1)} KB\n`;
|
||||
response += ` - 时间: ${new Date(doc.created_at).toLocaleDateString()}\n\n`;
|
||||
});
|
||||
if (result.documents.length > 5) {
|
||||
response += `...还有 ${result.documents.length - 5} 个文档`;
|
||||
switch (response.intent) {
|
||||
case 'extract':
|
||||
// 信息提取结果
|
||||
const extracted = resultData?.extracted_data || {};
|
||||
const keys = Object.keys(extracted);
|
||||
if (keys.length > 0) {
|
||||
responseContent = `✅ 已提取到 ${keys.length} 个字段的数据:\n\n`;
|
||||
for (const [key, value] of Object.entries(extracted)) {
|
||||
const values = Array.isArray(value) ? value : [value];
|
||||
responseContent += `**${key}**: ${values.slice(0, 3).join(', ')}${values.length > 3 ? '...' : ''}\n`;
|
||||
}
|
||||
responseContent += `\n💡 您可以将这些数据填入表格。`;
|
||||
} else {
|
||||
responseContent = '未能从文档中提取到相关数据。请尝试更明确的字段名称。';
|
||||
}
|
||||
} else {
|
||||
response = '暂未找到已上传的文档,您可以先上传一些文档试试。';
|
||||
}
|
||||
} else if (userInput.includes('分析') || userInput.includes('excel') || userInput.includes('报表')) {
|
||||
response = `好的,我可以帮您分析 Excel 文件。
|
||||
break;
|
||||
|
||||
请告诉我:
|
||||
1. 您想分析哪个 Excel 文件?
|
||||
2. 需要什么样的分析?(数据摘要/统计分析/图表生成)
|
||||
case 'fill_table':
|
||||
// 填表结果
|
||||
const filled = resultData?.result?.filled_data || {};
|
||||
const filledKeys = Object.keys(filled);
|
||||
if (filledKeys.length > 0) {
|
||||
responseContent = `✅ 填表完成!成功填写 ${filledKeys.length} 个字段:\n\n`;
|
||||
for (const [key, value] of Object.entries(filled)) {
|
||||
const values = Array.isArray(value) ? value : [value];
|
||||
responseContent += `**${key}**: ${values.slice(0, 3).join(', ')}\n`;
|
||||
}
|
||||
responseContent += `\n📋 请到【智能填表】页面查看或导出结果。`;
|
||||
} else {
|
||||
responseContent = '填表未能提取到数据。请检查模板表头和数据源内容。';
|
||||
}
|
||||
break;
|
||||
|
||||
或者您可以直接告诉我您想从数据中了解什么,我来为您生成分析。`;
|
||||
} else if (userInput.includes('填表') || userInput.includes('模板')) {
|
||||
response = `好的,要进行智能填表,我需要:
|
||||
case 'summarize':
|
||||
// 摘要结果
|
||||
const summaries = resultData?.summaries || [];
|
||||
if (summaries.length > 0) {
|
||||
responseContent = `📄 找到 ${summaries.length} 个文档的摘要:\n\n`;
|
||||
summaries.forEach((s: any, idx: number) => {
|
||||
responseContent += `**${idx + 1}. ${s.filename}**\n${s.content_preview}\n\n`;
|
||||
});
|
||||
} else {
|
||||
responseContent = '未能生成摘要。请确保已上传文档。';
|
||||
}
|
||||
break;
|
||||
|
||||
1. **上传表格模板** - 您要填写的表格模板文件(Excel 或 Word 格式)
|
||||
2. **选择数据源** - 包含要填写内容的源文档
|
||||
case 'question':
|
||||
// 问答结果
|
||||
if (resultData?.answer) {
|
||||
responseContent = `**问题**: ${resultData.question}\n\n**答案**: ${resultData.answer}`;
|
||||
} else {
|
||||
responseContent = resultData?.message || '我找到了相关信息,请查看上文。';
|
||||
}
|
||||
break;
|
||||
|
||||
您可以去【智能填表】页面完成这些操作,或者告诉我您具体想填什么类型的表格,我来指导您操作。`;
|
||||
} else if (userInput.includes('删除')) {
|
||||
response = `要删除文档,请告诉我:
|
||||
case 'search':
|
||||
// 搜索结果
|
||||
const searchResults = resultData?.results || [];
|
||||
if (searchResults.length > 0) {
|
||||
responseContent = `🔍 找到 ${searchResults.length} 条相关内容:\n\n`;
|
||||
searchResults.slice(0, 5).forEach((r: any, idx: number) => {
|
||||
responseContent += `**${idx + 1}.** ${r.content?.substring(0, 100)}...\n\n`;
|
||||
});
|
||||
} else {
|
||||
responseContent = '未找到相关内容。请尝试其他关键词。';
|
||||
}
|
||||
break;
|
||||
|
||||
- 要删除的文件名是什么?
|
||||
- 或者您可以到【文档中心】页面手动选择并删除文档
|
||||
case 'compare':
|
||||
// 对比结果
|
||||
const comparison = resultData?.comparison || [];
|
||||
if (comparison.length > 0) {
|
||||
responseContent = `📊 对比了 ${comparison.length} 个文档:\n\n`;
|
||||
comparison.forEach((c: any) => {
|
||||
responseContent += `- **${c.filename}**: ${c.doc_type}, ${c.content_length} 字\n`;
|
||||
});
|
||||
} else {
|
||||
responseContent = '需要至少2个文档才能进行对比。';
|
||||
}
|
||||
break;
|
||||
|
||||
⚠️ 删除操作不可恢复,请确认后再操作。`;
|
||||
} else if (userInput.includes('帮助') || userInput.includes('help')) {
|
||||
response = `**我可以帮您完成以下操作:**
|
||||
case 'unknown':
|
||||
responseContent = `我理解您想要: "${input.trim()}"\n\n但我目前无法完成此操作。您可以尝试:\n\n1. **提取数据**: "提取医院数量和床位数"\n2. **填表**: "根据这些数据填表"\n3. **总结**: "总结这份文档"\n4. **问答**: "文档里说了什么?"\n5. **搜索**: "搜索相关内容"`;
|
||||
break;
|
||||
|
||||
📄 **文档管理**
|
||||
- 列出/搜索已上传的文档
|
||||
- 查看文档详情和元数据
|
||||
- 删除不需要的文档
|
||||
|
||||
📊 **Excel 处理**
|
||||
- 分析 Excel 文件内容
|
||||
- 生成数据统计和图表
|
||||
- 导出处理后的数据
|
||||
|
||||
📝 **智能填表**
|
||||
- 上传表格模板
|
||||
- 从文档中提取信息填入模板
|
||||
- 导出填写完成的表格
|
||||
|
||||
📋 **任务历史**
|
||||
- 查看历史处理任务
|
||||
- 重新执行或导出结果
|
||||
|
||||
请直接告诉我您想做什么!`;
|
||||
} else {
|
||||
response = `我理解您想要: "${input.trim()}"
|
||||
|
||||
目前我还在学习如何更好地理解您的需求。您可以尝试:
|
||||
|
||||
1. **上传文档** - 去【文档中心】上传 docx/md/txt 文件
|
||||
2. **分析 Excel** - 去【Excel解析】上传并分析 Excel 文件
|
||||
3. **智能填表** - 去【智能填表】创建填表任务
|
||||
|
||||
或者您可以更具体地描述您想做的事情,我会尽力帮助您!`;
|
||||
default:
|
||||
responseContent = response.message || resultData?.message || '已完成您的请求。';
|
||||
}
|
||||
|
||||
const assistantMessage: ChatMessage = {
|
||||
id: Math.random().toString(36).substring(7),
|
||||
role: 'assistant',
|
||||
content: response,
|
||||
created_at: new Date().toISOString()
|
||||
content: responseContent,
|
||||
created_at: new Date().toISOString(),
|
||||
intent: response.intent,
|
||||
result: resultData
|
||||
};
|
||||
|
||||
setMessages(prev => [...prev, assistantMessage]);
|
||||
} catch (err: any) {
|
||||
toast.error('请求失败,请重试');
|
||||
console.error('指令执行失败:', err);
|
||||
toast.error(err.message || '请求失败,请重试');
|
||||
|
||||
const errorMessage: ChatMessage = {
|
||||
id: Math.random().toString(36).substring(7),
|
||||
role: 'assistant',
|
||||
content: `抱歉,处理您的请求时遇到了问题:${err.message}\n\n请稍后重试,或尝试更简单的指令。`,
|
||||
created_at: new Date().toISOString()
|
||||
};
|
||||
setMessages(prev => [...prev, errorMessage]);
|
||||
} finally {
|
||||
setLoading(false);
|
||||
}
|
||||
@@ -189,10 +247,10 @@ const InstructionChat: React.FC = () => {
|
||||
};
|
||||
|
||||
const quickActions = [
|
||||
{ label: '列出所有文档', icon: FileText, action: () => setInput('列出所有已上传的文档') },
|
||||
{ label: '分析 Excel 数据', icon: TableProperties, action: () => setInput('分析一下 Excel 文件') },
|
||||
{ label: '智能填表', icon: Sparkles, action: () => setInput('我想进行智能填表') },
|
||||
{ label: '帮助', icon: Sparkles, action: () => setInput('帮助') }
|
||||
{ label: '提取医院数量', icon: Search, action: () => setInput('提取文档中的医院数量和床位数') },
|
||||
{ label: '智能填表', icon: TableProperties, action: () => setInput('根据这些数据填表') },
|
||||
{ label: '总结文档', icon: MessageSquare, action: () => setInput('总结一下这份文档') },
|
||||
{ label: '智能问答', icon: Bot, action: () => setInput('文档里说了些什么?') }
|
||||
];
|
||||
|
||||
return (
|
||||
|
||||
@@ -1,184 +0,0 @@
|
||||
import React, { useState } from 'react';
|
||||
import { useNavigate, useLocation } from 'react-router-dom';
|
||||
import { useAuth } from '@/context/AuthContext';
|
||||
import { Button } from '@/components/ui/button';
|
||||
import { Input } from '@/components/ui/input';
|
||||
import { Label } from '@/components/ui/label';
|
||||
import { Card, CardContent, CardDescription, CardFooter, CardHeader, CardTitle } from '@/components/ui/card';
|
||||
import { Tabs, TabsContent, TabsList, TabsTrigger } from '@/components/ui/tabs';
|
||||
import { FileText, Lock, User, CheckCircle2, AlertCircle } from 'lucide-react';
|
||||
import { toast } from 'sonner';
|
||||
|
||||
const Login: React.FC = () => {
|
||||
const [username, setUsername] = useState('');
|
||||
const [password, setPassword] = useState('');
|
||||
const [loading, setLoading] = useState(false);
|
||||
const { signIn, signUp } = useAuth();
|
||||
const navigate = useNavigate();
|
||||
const location = useLocation();
|
||||
|
||||
const handleLogin = async (e: React.FormEvent) => {
|
||||
e.preventDefault();
|
||||
if (!username || !password) return toast.error('请输入用户名和密码');
|
||||
|
||||
setLoading(true);
|
||||
try {
|
||||
const email = `${username}@miaoda.com`;
|
||||
const { error } = await signIn(email, password);
|
||||
if (error) throw error;
|
||||
toast.success('登录成功');
|
||||
navigate('/');
|
||||
} catch (err: any) {
|
||||
toast.error(err.message || '登录失败');
|
||||
} finally {
|
||||
setLoading(false);
|
||||
}
|
||||
};
|
||||
|
||||
const handleSignUp = async (e: React.FormEvent) => {
|
||||
e.preventDefault();
|
||||
if (!username || !password) return toast.error('请输入用户名和密码');
|
||||
|
||||
setLoading(true);
|
||||
try {
|
||||
const email = `${username}@miaoda.com`;
|
||||
const { error } = await signUp(email, password);
|
||||
if (error) throw error;
|
||||
toast.success('注册成功,请登录');
|
||||
} catch (err: any) {
|
||||
toast.error(err.message || '注册失败');
|
||||
} finally {
|
||||
setLoading(false);
|
||||
}
|
||||
};
|
||||
|
||||
return (
|
||||
<div className="min-h-screen flex items-center justify-center bg-[radial-gradient(ellipse_at_top_left,_var(--tw-gradient-stops))] from-primary/10 via-background to-background p-4 relative overflow-hidden">
|
||||
{/* Decorative elements */}
|
||||
<div className="absolute top-0 left-0 w-96 h-96 bg-primary/5 rounded-full blur-3xl -translate-x-1/2 -translate-y-1/2" />
|
||||
<div className="absolute bottom-0 right-0 w-64 h-64 bg-primary/5 rounded-full blur-3xl translate-x-1/3 translate-y-1/3" />
|
||||
|
||||
<div className="w-full max-w-md space-y-8 relative animate-fade-in">
|
||||
<div className="text-center space-y-2">
|
||||
<div className="inline-flex items-center justify-center w-16 h-16 rounded-2xl bg-primary text-primary-foreground shadow-2xl shadow-primary/30 mb-4 animate-slide-in">
|
||||
<FileText size={32} />
|
||||
</div>
|
||||
<h1 className="text-4xl font-extrabold tracking-tight gradient-text">智联文档</h1>
|
||||
<p className="text-muted-foreground">多源数据融合与智能文档处理系统</p>
|
||||
</div>
|
||||
|
||||
<Card className="border-border/50 shadow-2xl backdrop-blur-sm bg-card/95">
|
||||
<Tabs defaultValue="login" className="w-full">
|
||||
<TabsList className="grid w-full grid-cols-2 rounded-t-xl h-12 bg-muted/50 p-1">
|
||||
<TabsTrigger value="login" className="rounded-lg data-[state=active]:bg-background data-[state=active]:shadow-sm">登录</TabsTrigger>
|
||||
<TabsTrigger value="signup" className="rounded-lg data-[state=active]:bg-background data-[state=active]:shadow-sm">注册</TabsTrigger>
|
||||
</TabsList>
|
||||
|
||||
<TabsContent value="login">
|
||||
<form onSubmit={handleLogin}>
|
||||
<CardHeader>
|
||||
<CardTitle>欢迎回来</CardTitle>
|
||||
<CardDescription>使用您的账号登录智联文档系统</CardDescription>
|
||||
</CardHeader>
|
||||
<CardContent className="space-y-4">
|
||||
<div className="space-y-2">
|
||||
<Label htmlFor="username">用户名</Label>
|
||||
<div className="relative">
|
||||
<User className="absolute left-3 top-2.5 h-4 w-4 text-muted-foreground" />
|
||||
<Input
|
||||
id="username"
|
||||
placeholder="请输入用户名"
|
||||
className="pl-9 bg-muted/30 border-none focus-visible:ring-primary"
|
||||
value={username}
|
||||
onChange={(e) => setUsername(e.target.value)}
|
||||
/>
|
||||
</div>
|
||||
</div>
|
||||
<div className="space-y-2">
|
||||
<Label htmlFor="password">密码</Label>
|
||||
<div className="relative">
|
||||
<Lock className="absolute left-3 top-2.5 h-4 w-4 text-muted-foreground" />
|
||||
<Input
|
||||
id="password"
|
||||
type="password"
|
||||
placeholder="请输入密码"
|
||||
className="pl-9 bg-muted/30 border-none focus-visible:ring-primary"
|
||||
value={password}
|
||||
onChange={(e) => setPassword(e.target.value)}
|
||||
/>
|
||||
</div>
|
||||
</div>
|
||||
</CardContent>
|
||||
<CardFooter>
|
||||
<Button className="w-full h-11 text-lg font-semibold rounded-xl" type="submit" disabled={loading}>
|
||||
{loading ? '登录中...' : '立即登录'}
|
||||
</Button>
|
||||
</CardFooter>
|
||||
</form>
|
||||
</TabsContent>
|
||||
|
||||
<TabsContent value="signup">
|
||||
<form onSubmit={handleSignUp}>
|
||||
<CardHeader>
|
||||
<CardTitle>创建账号</CardTitle>
|
||||
<CardDescription>开启智能文档处理的新体验</CardDescription>
|
||||
</CardHeader>
|
||||
<CardContent className="space-y-4">
|
||||
<div className="space-y-2">
|
||||
<Label htmlFor="signup-username">用户名</Label>
|
||||
<div className="relative">
|
||||
<User className="absolute left-3 top-2.5 h-4 w-4 text-muted-foreground" />
|
||||
<Input
|
||||
id="signup-username"
|
||||
placeholder="仅字母、数字和下划线"
|
||||
className="pl-9 bg-muted/30 border-none focus-visible:ring-primary"
|
||||
value={username}
|
||||
onChange={(e) => setUsername(e.target.value)}
|
||||
/>
|
||||
</div>
|
||||
</div>
|
||||
<div className="space-y-2">
|
||||
<Label htmlFor="signup-password">密码</Label>
|
||||
<div className="relative">
|
||||
<Lock className="absolute left-3 top-2.5 h-4 w-4 text-muted-foreground" />
|
||||
<Input
|
||||
id="signup-password"
|
||||
type="password"
|
||||
placeholder="不少于 6 位"
|
||||
className="pl-9 bg-muted/30 border-none focus-visible:ring-primary"
|
||||
value={password}
|
||||
onChange={(e) => setPassword(e.target.value)}
|
||||
/>
|
||||
</div>
|
||||
</div>
|
||||
</CardContent>
|
||||
<CardFooter>
|
||||
<Button className="w-full h-11 text-lg font-semibold rounded-xl" type="submit" disabled={loading}>
|
||||
{loading ? '注册中...' : '注册账号'}
|
||||
</Button>
|
||||
</CardFooter>
|
||||
</form>
|
||||
</TabsContent>
|
||||
</Tabs>
|
||||
</Card>
|
||||
|
||||
<div className="grid grid-cols-2 gap-4 text-center text-xs text-muted-foreground">
|
||||
<div className="flex flex-col items-center gap-1">
|
||||
<CheckCircle2 size={16} className="text-primary" />
|
||||
<span>智能解析</span>
|
||||
</div>
|
||||
<div className="flex flex-col items-center gap-1">
|
||||
<CheckCircle2 size={16} className="text-primary" />
|
||||
<span>极速填表</span>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div className="text-center text-sm text-muted-foreground">
|
||||
© 2026 智联文档 | 多源数据融合系统
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
);
|
||||
};
|
||||
|
||||
export default Login;
|
||||
@@ -1,16 +0,0 @@
|
||||
/**
|
||||
* Sample Page
|
||||
*/
|
||||
|
||||
import PageMeta from "../components/common/PageMeta";
|
||||
|
||||
export default function SamplePage() {
|
||||
return (
|
||||
<>
|
||||
<PageMeta title="Home" description="Home Page Introduction" />
|
||||
<div>
|
||||
<h3>This is a sample page</h3>
|
||||
</div>
|
||||
</>
|
||||
);
|
||||
}
|
||||
@@ -11,7 +11,8 @@ import {
|
||||
ChevronDown,
|
||||
ChevronUp,
|
||||
Trash2,
|
||||
AlertCircle
|
||||
AlertCircle,
|
||||
HelpCircle
|
||||
} from 'lucide-react';
|
||||
import { Card, CardContent, CardHeader, CardTitle, CardDescription } from '@/components/ui/card';
|
||||
import { Button } from '@/components/ui/button';
|
||||
@@ -24,9 +25,9 @@ import { Skeleton } from '@/components/ui/skeleton';
|
||||
|
||||
type Task = {
|
||||
task_id: string;
|
||||
status: 'pending' | 'processing' | 'success' | 'failure';
|
||||
status: 'pending' | 'processing' | 'success' | 'failure' | 'unknown';
|
||||
created_at: string;
|
||||
completed_at?: string;
|
||||
updated_at?: string;
|
||||
message?: string;
|
||||
result?: any;
|
||||
error?: string;
|
||||
@@ -38,54 +39,38 @@ const TaskHistory: React.FC = () => {
|
||||
const [loading, setLoading] = useState(true);
|
||||
const [expandedTask, setExpandedTask] = useState<string | null>(null);
|
||||
|
||||
// Mock data for demonstration
|
||||
useEffect(() => {
|
||||
// 模拟任务数据,实际应该从后端获取
|
||||
setTasks([
|
||||
{
|
||||
task_id: 'task-001',
|
||||
status: 'success',
|
||||
created_at: new Date(Date.now() - 3600000).toISOString(),
|
||||
completed_at: new Date(Date.now() - 3500000).toISOString(),
|
||||
task_type: 'document_parse',
|
||||
message: '文档解析完成',
|
||||
result: {
|
||||
doc_id: 'doc-001',
|
||||
filename: 'report_q1_2026.docx',
|
||||
extracted_fields: ['标题', '作者', '日期', '金额']
|
||||
}
|
||||
},
|
||||
{
|
||||
task_id: 'task-002',
|
||||
status: 'success',
|
||||
created_at: new Date(Date.now() - 7200000).toISOString(),
|
||||
completed_at: new Date(Date.now() - 7100000).toISOString(),
|
||||
task_type: 'excel_analysis',
|
||||
message: 'Excel 分析完成',
|
||||
result: {
|
||||
filename: 'sales_data.xlsx',
|
||||
row_count: 1250,
|
||||
charts_generated: 3
|
||||
}
|
||||
},
|
||||
{
|
||||
task_id: 'task-003',
|
||||
status: 'processing',
|
||||
created_at: new Date(Date.now() - 600000).toISOString(),
|
||||
task_type: 'template_fill',
|
||||
message: '正在填充表格...'
|
||||
},
|
||||
{
|
||||
task_id: 'task-004',
|
||||
status: 'failure',
|
||||
created_at: new Date(Date.now() - 86400000).toISOString(),
|
||||
completed_at: new Date(Date.now() - 86390000).toISOString(),
|
||||
task_type: 'document_parse',
|
||||
message: '解析失败',
|
||||
error: '文件格式不支持或文件已损坏'
|
||||
// 获取任务历史数据
|
||||
const fetchTasks = async () => {
|
||||
try {
|
||||
setLoading(true);
|
||||
const response = await backendApi.getTasks(50, 0);
|
||||
if (response.success && response.tasks) {
|
||||
// 转换后端数据格式为前端格式
|
||||
const convertedTasks: Task[] = response.tasks.map((t: any) => ({
|
||||
task_id: t.task_id,
|
||||
status: t.status || 'unknown',
|
||||
created_at: t.created_at || new Date().toISOString(),
|
||||
updated_at: t.updated_at,
|
||||
message: t.message || '',
|
||||
result: t.result,
|
||||
error: t.error,
|
||||
task_type: t.task_type || 'document_parse'
|
||||
}));
|
||||
setTasks(convertedTasks);
|
||||
} else {
|
||||
setTasks([]);
|
||||
}
|
||||
]);
|
||||
setLoading(false);
|
||||
} catch (error) {
|
||||
console.error('获取任务列表失败:', error);
|
||||
toast.error('获取任务列表失败');
|
||||
setTasks([]);
|
||||
} finally {
|
||||
setLoading(false);
|
||||
}
|
||||
};
|
||||
|
||||
useEffect(() => {
|
||||
fetchTasks();
|
||||
}, []);
|
||||
|
||||
const getStatusBadge = (status: string) => {
|
||||
@@ -96,6 +81,8 @@ const TaskHistory: React.FC = () => {
|
||||
return <Badge className="bg-destructive text-white text-[10px]"><XCircle size={12} className="mr-1" />失败</Badge>;
|
||||
case 'processing':
|
||||
return <Badge className="bg-amber-500 text-white text-[10px]"><Loader2 size={12} className="mr-1 animate-spin" />处理中</Badge>;
|
||||
case 'unknown':
|
||||
return <Badge className="bg-gray-500 text-white text-[10px]"><HelpCircle size={12} className="mr-1" />未知</Badge>;
|
||||
default:
|
||||
return <Badge className="bg-gray-500 text-white text-[10px]"><Clock size={12} className="mr-1" />等待</Badge>;
|
||||
}
|
||||
@@ -133,15 +120,22 @@ const TaskHistory: React.FC = () => {
|
||||
};
|
||||
|
||||
const handleDelete = async (taskId: string) => {
|
||||
setTasks(prev => prev.filter(t => t.task_id !== taskId));
|
||||
toast.success('任务已删除');
|
||||
try {
|
||||
await backendApi.deleteTask(taskId);
|
||||
setTasks(prev => prev.filter(t => t.task_id !== taskId));
|
||||
toast.success('任务已删除');
|
||||
} catch (error) {
|
||||
console.error('删除任务失败:', error);
|
||||
toast.error('删除任务失败');
|
||||
}
|
||||
};
|
||||
|
||||
const stats = {
|
||||
total: tasks.length,
|
||||
success: tasks.filter(t => t.status === 'success').length,
|
||||
processing: tasks.filter(t => t.status === 'processing').length,
|
||||
failure: tasks.filter(t => t.status === 'failure').length
|
||||
failure: tasks.filter(t => t.status === 'failure').length,
|
||||
unknown: tasks.filter(t => t.status === 'unknown').length
|
||||
};
|
||||
|
||||
return (
|
||||
@@ -151,7 +145,7 @@ const TaskHistory: React.FC = () => {
|
||||
<h1 className="text-3xl font-extrabold tracking-tight">任务历史</h1>
|
||||
<p className="text-muted-foreground">查看和管理您所有的文档处理任务记录</p>
|
||||
</div>
|
||||
<Button variant="outline" className="rounded-xl gap-2" onClick={() => window.location.reload()}>
|
||||
<Button variant="outline" className="rounded-xl gap-2" onClick={() => fetchTasks()}>
|
||||
<RefreshCcw size={18} />
|
||||
<span>刷新</span>
|
||||
</Button>
|
||||
@@ -194,7 +188,8 @@ const TaskHistory: React.FC = () => {
|
||||
"w-12 h-12 rounded-xl flex items-center justify-center shrink-0",
|
||||
task.status === 'success' ? "bg-emerald-500/10 text-emerald-500" :
|
||||
task.status === 'failure' ? "bg-destructive/10 text-destructive" :
|
||||
"bg-amber-500/10 text-amber-500"
|
||||
task.status === 'processing' ? "bg-amber-500/10 text-amber-500" :
|
||||
"bg-gray-500/10 text-gray-500"
|
||||
)}>
|
||||
{task.status === 'processing' ? (
|
||||
<Loader2 size={24} className="animate-spin" />
|
||||
@@ -212,16 +207,16 @@ const TaskHistory: React.FC = () => {
|
||||
</Badge>
|
||||
</div>
|
||||
<p className="text-sm text-muted-foreground">
|
||||
{task.message || '任务执行中...'}
|
||||
{task.message || (task.status === 'unknown' ? '无法获取状态' : '任务执行中...')}
|
||||
</p>
|
||||
<div className="flex items-center gap-4 text-xs text-muted-foreground">
|
||||
<span className="flex items-center gap-1">
|
||||
<Clock size={12} />
|
||||
{format(new Date(task.created_at), 'yyyy-MM-dd HH:mm:ss')}
|
||||
{task.created_at ? format(new Date(task.created_at), 'yyyy-MM-dd HH:mm:ss') : '时间未知'}
|
||||
</span>
|
||||
{task.completed_at && (
|
||||
{task.updated_at && task.status !== 'processing' && (
|
||||
<span>
|
||||
耗时: {Math.round((new Date(task.completed_at).getTime() - new Date(task.created_at).getTime()) / 1000)} 秒
|
||||
更新: {format(new Date(task.updated_at), 'HH:mm:ss')}
|
||||
</span>
|
||||
)}
|
||||
</div>
|
||||
|
||||
@@ -1,4 +1,4 @@
|
||||
import React, { useState, useEffect } from 'react';
|
||||
import React, { useState, useEffect, useCallback, useRef } from 'react';
|
||||
import { useDropzone } from 'react-dropzone';
|
||||
import {
|
||||
TableProperties,
|
||||
@@ -14,7 +14,12 @@ import {
|
||||
RefreshCcw,
|
||||
ChevronDown,
|
||||
ChevronUp,
|
||||
Loader2
|
||||
Loader2,
|
||||
Files,
|
||||
Trash2,
|
||||
Eye,
|
||||
File,
|
||||
Plus
|
||||
} from 'lucide-react';
|
||||
import { Button } from '@/components/ui/button';
|
||||
import { Card, CardContent, CardHeader, CardTitle, CardDescription } from '@/components/ui/card';
|
||||
@@ -26,6 +31,14 @@ import { format } from 'date-fns';
|
||||
import { toast } from 'sonner';
|
||||
import { cn } from '@/lib/utils';
|
||||
import { Skeleton } from '@/components/ui/skeleton';
|
||||
import {
|
||||
Dialog,
|
||||
DialogContent,
|
||||
DialogHeader,
|
||||
DialogTitle,
|
||||
} from "@/components/ui/dialog";
|
||||
import { ScrollArea } from '@/components/ui/scroll-area';
|
||||
import { useTemplateFill } from '@/context/TemplateFillContext';
|
||||
|
||||
type DocumentItem = {
|
||||
doc_id: string;
|
||||
@@ -41,73 +54,34 @@ type DocumentItem = {
|
||||
};
|
||||
};
|
||||
|
||||
type TemplateField = {
|
||||
cell: string;
|
||||
name: string;
|
||||
field_type: string;
|
||||
required: boolean;
|
||||
hint?: string;
|
||||
};
|
||||
|
||||
const TemplateFill: React.FC = () => {
|
||||
const [step, setStep] = useState<'upload-template' | 'select-source' | 'preview' | 'filling'>('upload-template');
|
||||
const [templateFile, setTemplateFile] = useState<File | null>(null);
|
||||
const [templateFields, setTemplateFields] = useState<TemplateField[]>([]);
|
||||
const [sourceDocs, setSourceDocs] = useState<DocumentItem[]>([]);
|
||||
const [selectedDocs, setSelectedDocs] = useState<string[]>([]);
|
||||
const {
|
||||
step, setStep,
|
||||
templateFile, setTemplateFile,
|
||||
templateFields, setTemplateFields,
|
||||
sourceFiles, setSourceFiles, addSourceFiles, removeSourceFile,
|
||||
sourceFilePaths, setSourceFilePaths,
|
||||
sourceDocIds, setSourceDocIds, addSourceDocId, removeSourceDocId,
|
||||
templateId, setTemplateId,
|
||||
filledResult, setFilledResult,
|
||||
reset
|
||||
} = useTemplateFill();
|
||||
|
||||
const [loading, setLoading] = useState(false);
|
||||
const [filling, setFilling] = useState(false);
|
||||
const [filledResult, setFilledResult] = useState<any>(null);
|
||||
const [previewDoc, setPreviewDoc] = useState<{ name: string; content: string } | null>(null);
|
||||
const [previewOpen, setPreviewOpen] = useState(false);
|
||||
const [sourceMode, setSourceMode] = useState<'upload' | 'select'>('upload');
|
||||
const [uploadedDocuments, setUploadedDocuments] = useState<DocumentItem[]>([]);
|
||||
const [docsLoading, setDocsLoading] = useState(false);
|
||||
const sourceFileInputRef = useRef<HTMLInputElement>(null);
|
||||
|
||||
// Load available source documents
|
||||
useEffect(() => {
|
||||
loadSourceDocuments();
|
||||
}, []);
|
||||
|
||||
const loadSourceDocuments = async () => {
|
||||
setLoading(true);
|
||||
try {
|
||||
const result = await backendApi.getDocuments(undefined, 100);
|
||||
if (result.success) {
|
||||
// Filter to only non-Excel documents that can be used as data sources
|
||||
const docs = (result.documents || []).filter((d: DocumentItem) =>
|
||||
['docx', 'md', 'txt', 'xlsx'].includes(d.doc_type)
|
||||
);
|
||||
setSourceDocs(docs);
|
||||
}
|
||||
} catch (err: any) {
|
||||
toast.error('加载数据源失败');
|
||||
} finally {
|
||||
setLoading(false);
|
||||
}
|
||||
};
|
||||
|
||||
const onTemplateDrop = async (acceptedFiles: File[]) => {
|
||||
// 模板拖拽
|
||||
const onTemplateDrop = useCallback((acceptedFiles: File[]) => {
|
||||
const file = acceptedFiles[0];
|
||||
if (!file) return;
|
||||
|
||||
const ext = file.name.split('.').pop()?.toLowerCase();
|
||||
if (!['xlsx', 'xls', 'docx'].includes(ext || '')) {
|
||||
toast.error('仅支持 xlsx/xls/docx 格式的模板文件');
|
||||
return;
|
||||
if (file) {
|
||||
setTemplateFile(file);
|
||||
}
|
||||
|
||||
setTemplateFile(file);
|
||||
setLoading(true);
|
||||
|
||||
try {
|
||||
const result = await backendApi.uploadTemplate(file);
|
||||
if (result.success) {
|
||||
setTemplateFields(result.fields || []);
|
||||
setStep('select-source');
|
||||
toast.success('模板上传成功');
|
||||
}
|
||||
} catch (err: any) {
|
||||
toast.error('模板上传失败: ' + (err.message || '未知错误'));
|
||||
} finally {
|
||||
setLoading(false);
|
||||
}
|
||||
};
|
||||
}, []);
|
||||
|
||||
const { getRootProps: getTemplateProps, getInputProps: getTemplateInputProps, isDragActive: isTemplateDragActive } = useDropzone({
|
||||
onDrop: onTemplateDrop,
|
||||
@@ -116,33 +90,157 @@ const TemplateFill: React.FC = () => {
|
||||
'application/vnd.ms-excel': ['.xls'],
|
||||
'application/vnd.openxmlformats-officedocument.wordprocessingml.document': ['.docx']
|
||||
},
|
||||
maxFiles: 1
|
||||
maxFiles: 1,
|
||||
multiple: false
|
||||
});
|
||||
|
||||
const handleFillTemplate = async () => {
|
||||
if (!templateFile || selectedDocs.length === 0) {
|
||||
toast.error('请选择数据源文档');
|
||||
// 源文档拖拽
|
||||
const onSourceDrop = useCallback((e: React.DragEvent) => {
|
||||
e.preventDefault();
|
||||
const files = Array.from(e.dataTransfer.files).filter(f => {
|
||||
const ext = f.name.split('.').pop()?.toLowerCase();
|
||||
return ['xlsx', 'xls', 'docx', 'md', 'txt'].includes(ext || '');
|
||||
});
|
||||
if (files.length > 0) {
|
||||
addSourceFiles(files.map(f => ({ file: f })));
|
||||
}
|
||||
}, [addSourceFiles]);
|
||||
|
||||
const handleSourceFileSelect = (e: React.ChangeEvent<HTMLInputElement>) => {
|
||||
const files = Array.from(e.target.files || []);
|
||||
if (files.length > 0) {
|
||||
addSourceFiles(files.map(f => ({ file: f })));
|
||||
toast.success(`已添加 ${files.length} 个文件`);
|
||||
}
|
||||
e.target.value = '';
|
||||
};
|
||||
|
||||
// 仅添加源文档不上传
|
||||
const handleAddSourceFiles = () => {
|
||||
if (sourceFiles.length === 0) {
|
||||
toast.error('请先选择源文档');
|
||||
return;
|
||||
}
|
||||
toast.success(`已添加 ${sourceFiles.length} 个源文档,可继续添加更多`);
|
||||
};
|
||||
|
||||
// 加载已上传文档
|
||||
const loadUploadedDocuments = useCallback(async () => {
|
||||
setDocsLoading(true);
|
||||
try {
|
||||
const result = await backendApi.getDocuments(undefined, 100);
|
||||
if (result.success) {
|
||||
// 过滤可作为数据源的文档类型
|
||||
const docs = (result.documents || []).filter((d: DocumentItem) =>
|
||||
['docx', 'md', 'txt', 'xlsx', 'xls'].includes(d.doc_type)
|
||||
);
|
||||
setUploadedDocuments(docs);
|
||||
}
|
||||
} catch (err: any) {
|
||||
console.error('加载文档失败:', err);
|
||||
} finally {
|
||||
setDocsLoading(false);
|
||||
}
|
||||
}, []);
|
||||
|
||||
// 删除文档
|
||||
const handleDeleteDocument = async (docId: string, e: React.MouseEvent) => {
|
||||
e.stopPropagation();
|
||||
if (!confirm('确定要删除该文档吗?')) return;
|
||||
try {
|
||||
const result = await backendApi.deleteDocument(docId);
|
||||
if (result.success) {
|
||||
setUploadedDocuments(prev => prev.filter(d => d.doc_id !== docId));
|
||||
removeSourceDocId(docId);
|
||||
toast.success('文档已删除');
|
||||
} else {
|
||||
toast.error(result.message || '删除失败');
|
||||
}
|
||||
} catch (err: any) {
|
||||
toast.error('删除失败: ' + (err.message || '未知错误'));
|
||||
}
|
||||
};
|
||||
|
||||
useEffect(() => {
|
||||
if (sourceMode === 'select') {
|
||||
loadUploadedDocuments();
|
||||
}
|
||||
}, [sourceMode, loadUploadedDocuments]);
|
||||
|
||||
const handleJointUploadAndFill = async () => {
|
||||
if (!templateFile) {
|
||||
toast.error('请先上传模板文件');
|
||||
return;
|
||||
}
|
||||
|
||||
setFilling(true);
|
||||
setStep('filling');
|
||||
// 检查是否选择了数据源
|
||||
if (sourceMode === 'upload' && sourceFiles.length === 0) {
|
||||
toast.error('请上传源文档或从已上传文档中选择');
|
||||
return;
|
||||
}
|
||||
if (sourceMode === 'select' && sourceDocIds.length === 0) {
|
||||
toast.error('请选择源文档');
|
||||
return;
|
||||
}
|
||||
|
||||
setLoading(true);
|
||||
|
||||
try {
|
||||
// 调用后端填表接口,传递选中的文档ID
|
||||
const result = await backendApi.fillTemplate(
|
||||
'temp-template-id',
|
||||
templateFields,
|
||||
selectedDocs // 传递源文档ID列表
|
||||
);
|
||||
setFilledResult(result);
|
||||
setStep('preview');
|
||||
toast.success('表格填写完成');
|
||||
if (sourceMode === 'select') {
|
||||
// 使用已上传文档作为数据源
|
||||
const result = await backendApi.uploadTemplate(templateFile);
|
||||
|
||||
if (result.success) {
|
||||
setTemplateFields(result.fields || []);
|
||||
setTemplateId(result.template_id || 'temp');
|
||||
toast.success('开始智能填表');
|
||||
setStep('filling');
|
||||
|
||||
// 使用 source_doc_ids 进行填表
|
||||
const fillResult = await backendApi.fillTemplate(
|
||||
result.template_id || 'temp',
|
||||
result.fields || [],
|
||||
sourceDocIds,
|
||||
[],
|
||||
'请从以下文档中提取相关信息填写表格'
|
||||
);
|
||||
|
||||
setFilledResult(fillResult);
|
||||
setStep('preview');
|
||||
toast.success('表格填写完成');
|
||||
}
|
||||
} else {
|
||||
// 使用联合上传API
|
||||
const result = await backendApi.uploadTemplateAndSources(
|
||||
templateFile,
|
||||
sourceFiles.map(sf => sf.file)
|
||||
);
|
||||
|
||||
if (result.success) {
|
||||
setTemplateFields(result.fields || []);
|
||||
setTemplateId(result.template_id);
|
||||
setSourceFilePaths(result.source_file_paths || []);
|
||||
toast.success('文档上传成功,开始智能填表');
|
||||
setStep('filling');
|
||||
|
||||
// 自动开始填表
|
||||
const fillResult = await backendApi.fillTemplate(
|
||||
result.template_id,
|
||||
result.fields || [],
|
||||
[],
|
||||
result.source_file_paths || [],
|
||||
'请从以下文档中提取相关信息填写表格'
|
||||
);
|
||||
|
||||
setFilledResult(fillResult);
|
||||
setStep('preview');
|
||||
toast.success('表格填写完成');
|
||||
}
|
||||
}
|
||||
} catch (err: any) {
|
||||
toast.error('填表失败: ' + (err.message || '未知错误'));
|
||||
setStep('select-source');
|
||||
toast.error('处理失败: ' + (err.message || '未知错误'));
|
||||
} finally {
|
||||
setFilling(false);
|
||||
setLoading(false);
|
||||
}
|
||||
};
|
||||
|
||||
@@ -150,7 +248,11 @@ const TemplateFill: React.FC = () => {
|
||||
if (!templateFile || !filledResult) return;
|
||||
|
||||
try {
|
||||
const blob = await backendApi.exportFilledTemplate('temp', filledResult.filled_data || {}, 'xlsx');
|
||||
const blob = await backendApi.exportFilledTemplate(
|
||||
templateId || 'temp',
|
||||
filledResult.filled_data || {},
|
||||
'xlsx'
|
||||
);
|
||||
const url = URL.createObjectURL(blob);
|
||||
const a = document.createElement('a');
|
||||
a.href = url;
|
||||
@@ -163,12 +265,18 @@ const TemplateFill: React.FC = () => {
|
||||
}
|
||||
};
|
||||
|
||||
const resetFlow = () => {
|
||||
setStep('upload-template');
|
||||
setTemplateFile(null);
|
||||
setTemplateFields([]);
|
||||
setSelectedDocs([]);
|
||||
setFilledResult(null);
|
||||
const getFileIcon = (filename: string) => {
|
||||
const ext = filename.split('.').pop()?.toLowerCase();
|
||||
if (['xlsx', 'xls'].includes(ext || '')) {
|
||||
return <FileSpreadsheet size={20} className="text-emerald-500" />;
|
||||
}
|
||||
if (ext === 'docx') {
|
||||
return <FileText size={20} className="text-blue-500" />;
|
||||
}
|
||||
if (['md', 'txt'].includes(ext || '')) {
|
||||
return <FileText size={20} className="text-orange-500" />;
|
||||
}
|
||||
return <File size={20} className="text-gray-500" />;
|
||||
};
|
||||
|
||||
return (
|
||||
@@ -180,208 +288,248 @@ const TemplateFill: React.FC = () => {
|
||||
根据您的表格模板,自动聚合多源文档信息进行精准填充
|
||||
</p>
|
||||
</div>
|
||||
{step !== 'upload-template' && (
|
||||
<Button variant="outline" className="rounded-xl gap-2" onClick={resetFlow}>
|
||||
{step !== 'upload' && (
|
||||
<Button variant="outline" className="rounded-xl gap-2" onClick={reset}>
|
||||
<RefreshCcw size={18} />
|
||||
<span>重新开始</span>
|
||||
</Button>
|
||||
)}
|
||||
</section>
|
||||
|
||||
{/* Progress Steps */}
|
||||
<div className="flex items-center justify-center gap-4">
|
||||
{['上传模板', '选择数据源', '填写预览'].map((label, idx) => {
|
||||
const stepIndex = ['upload-template', 'select-source', 'preview'].indexOf(step);
|
||||
const isActive = idx <= stepIndex;
|
||||
const isCurrent = idx === stepIndex;
|
||||
|
||||
return (
|
||||
<React.Fragment key={idx}>
|
||||
<div className={cn(
|
||||
"flex items-center gap-2 px-4 py-2 rounded-full transition-all",
|
||||
isActive ? "bg-primary text-primary-foreground" : "bg-muted text-muted-foreground"
|
||||
)}>
|
||||
<div className={cn(
|
||||
"w-6 h-6 rounded-full flex items-center justify-center text-xs font-bold",
|
||||
isCurrent ? "bg-white/20" : ""
|
||||
)}>
|
||||
{idx + 1}
|
||||
</div>
|
||||
<span className="text-sm font-medium">{label}</span>
|
||||
</div>
|
||||
{idx < 2 && (
|
||||
<div className={cn(
|
||||
"w-12 h-0.5",
|
||||
idx < stepIndex ? "bg-primary" : "bg-muted"
|
||||
)} />
|
||||
)}
|
||||
</React.Fragment>
|
||||
);
|
||||
})}
|
||||
</div>
|
||||
|
||||
{/* Step 1: Upload Template */}
|
||||
{step === 'upload-template' && (
|
||||
<div
|
||||
{...getTemplateProps()}
|
||||
className={cn(
|
||||
"border-2 border-dashed rounded-3xl p-16 transition-all duration-300 flex flex-col items-center justify-center text-center cursor-pointer group",
|
||||
isTemplateDragActive ? "border-primary bg-primary/5" : "border-muted-foreground/20 hover:border-primary/50 hover:bg-primary/5"
|
||||
)}
|
||||
>
|
||||
<input {...getTemplateInputProps()} />
|
||||
<div className="w-20 h-20 rounded-2xl bg-primary/10 text-primary flex items-center justify-center mb-6 group-hover:scale-110 transition-transform">
|
||||
{loading ? <Loader2 className="animate-spin" size={40} /> : <Upload size={40} />}
|
||||
</div>
|
||||
<div className="space-y-2 max-w-md">
|
||||
<p className="text-xl font-bold tracking-tight">
|
||||
{isTemplateDragActive ? '释放以开始上传' : '点击或拖拽上传表格模板'}
|
||||
</p>
|
||||
<p className="text-sm text-muted-foreground">
|
||||
支持 Excel (.xlsx, .xls) 或 Word (.docx) 格式的表格模板
|
||||
</p>
|
||||
</div>
|
||||
<div className="mt-6 flex gap-3">
|
||||
<Badge variant="outline" className="bg-emerald-500/10 text-emerald-600 border-emerald-200">
|
||||
<FileSpreadsheet size={14} className="mr-1" /> Excel 模板
|
||||
</Badge>
|
||||
<Badge variant="outline" className="bg-blue-500/10 text-blue-600 border-blue-200">
|
||||
<FileText size={14} className="mr-1" /> Word 模板
|
||||
</Badge>
|
||||
</div>
|
||||
</div>
|
||||
)}
|
||||
|
||||
{/* Step 2: Select Source Documents */}
|
||||
{step === 'select-source' && (
|
||||
<div className="space-y-6">
|
||||
{/* Template Info */}
|
||||
{/* Step 1: Upload - Joint Upload of Template + Source Docs */}
|
||||
{step === 'upload' && (
|
||||
<div className="grid grid-cols-1 lg:grid-cols-2 gap-6">
|
||||
{/* Template Upload */}
|
||||
<Card className="border-none shadow-md">
|
||||
<CardHeader className="pb-4">
|
||||
<CardTitle className="text-lg flex items-center gap-2">
|
||||
<FileSpreadsheet className="text-primary" size={20} />
|
||||
已上传模板
|
||||
</CardTitle>
|
||||
</CardHeader>
|
||||
<CardContent>
|
||||
<div className="flex items-center gap-4">
|
||||
<div className="w-12 h-12 rounded-xl bg-emerald-500/10 text-emerald-500 flex items-center justify-center">
|
||||
<FileSpreadsheet size={24} />
|
||||
</div>
|
||||
<div className="flex-1">
|
||||
<p className="font-bold">{templateFile?.name}</p>
|
||||
<p className="text-sm text-muted-foreground">
|
||||
{templateFields.length} 个字段待填写
|
||||
</p>
|
||||
</div>
|
||||
<Button variant="ghost" size="sm" onClick={() => setStep('upload-template')}>
|
||||
重新选择
|
||||
</Button>
|
||||
</div>
|
||||
|
||||
{/* Template Fields Preview */}
|
||||
<div className="mt-4 p-4 bg-muted/30 rounded-xl">
|
||||
<p className="text-xs font-bold uppercase tracking-widest text-muted-foreground mb-3">待填写字段</p>
|
||||
<div className="flex flex-wrap gap-2">
|
||||
{templateFields.map((field, idx) => (
|
||||
<Badge key={idx} variant="outline" className="bg-background">
|
||||
{field.name}
|
||||
</Badge>
|
||||
))}
|
||||
</div>
|
||||
</div>
|
||||
</CardContent>
|
||||
</Card>
|
||||
|
||||
{/* Source Documents Selection */}
|
||||
<Card className="border-none shadow-md">
|
||||
<CardHeader className="pb-4">
|
||||
<CardTitle className="text-lg flex items-center gap-2">
|
||||
<FileText className="text-primary" size={20} />
|
||||
选择数据源文档
|
||||
表格模板
|
||||
</CardTitle>
|
||||
<CardDescription>
|
||||
从已上传的文档中选择作为填表的数据来源,支持 Excel 和非结构化文档
|
||||
上传需要填写的 Excel/Word 模板文件
|
||||
</CardDescription>
|
||||
</CardHeader>
|
||||
<CardContent>
|
||||
{loading ? (
|
||||
<div className="space-y-3">
|
||||
{[1, 2, 3].map(i => <Skeleton key={i} className="h-16 w-full rounded-xl" />)}
|
||||
</div>
|
||||
) : sourceDocs.length > 0 ? (
|
||||
<div className="space-y-3">
|
||||
{sourceDocs.map(doc => (
|
||||
<div
|
||||
key={doc.doc_id}
|
||||
className={cn(
|
||||
"flex items-center gap-4 p-4 rounded-xl border-2 transition-all cursor-pointer",
|
||||
selectedDocs.includes(doc.doc_id)
|
||||
? "border-primary bg-primary/5"
|
||||
: "border-border hover:bg-muted/30"
|
||||
)}
|
||||
onClick={() => {
|
||||
setSelectedDocs(prev =>
|
||||
prev.includes(doc.doc_id)
|
||||
? prev.filter(id => id !== doc.doc_id)
|
||||
: [...prev, doc.doc_id]
|
||||
);
|
||||
}}
|
||||
>
|
||||
<div className={cn(
|
||||
"w-6 h-6 rounded-md border-2 flex items-center justify-center transition-all",
|
||||
selectedDocs.includes(doc.doc_id)
|
||||
? "border-primary bg-primary text-white"
|
||||
: "border-muted-foreground/30"
|
||||
)}>
|
||||
{selectedDocs.includes(doc.doc_id) && <CheckCircle2 size={14} />}
|
||||
</div>
|
||||
<div className={cn(
|
||||
"w-10 h-10 rounded-lg flex items-center justify-center",
|
||||
doc.doc_type === 'xlsx' ? "bg-emerald-500/10 text-emerald-500" : "bg-blue-500/10 text-blue-500"
|
||||
)}>
|
||||
{doc.doc_type === 'xlsx' ? <FileSpreadsheet size={20} /> : <FileText size={20} />}
|
||||
</div>
|
||||
<div className="flex-1 min-w-0">
|
||||
<p className="font-semibold truncate">{doc.original_filename}</p>
|
||||
<p className="text-xs text-muted-foreground">
|
||||
{doc.doc_type.toUpperCase()} • {format(new Date(doc.created_at), 'yyyy-MM-dd')}
|
||||
</p>
|
||||
</div>
|
||||
{doc.metadata?.columns && (
|
||||
<Badge variant="outline" className="text-xs">
|
||||
{doc.metadata.columns.length} 列
|
||||
</Badge>
|
||||
)}
|
||||
</div>
|
||||
))}
|
||||
{!templateFile ? (
|
||||
<div
|
||||
{...getTemplateProps()}
|
||||
className={cn(
|
||||
"border-2 border-dashed rounded-2xl p-8 transition-all duration-300 flex flex-col items-center justify-center text-center cursor-pointer group min-h-[200px]",
|
||||
isTemplateDragActive ? "border-primary bg-primary/5" : "border-muted-foreground/20 hover:border-primary/50 hover:bg-primary/5"
|
||||
)}
|
||||
>
|
||||
<input {...getTemplateInputProps()} />
|
||||
<div className="w-14 h-14 rounded-xl bg-primary/10 text-primary flex items-center justify-center mb-4 group-hover:scale-110 transition-transform">
|
||||
{loading ? <Loader2 className="animate-spin" size={28} /> : <Upload size={28} />}
|
||||
</div>
|
||||
<p className="font-medium">
|
||||
{isTemplateDragActive ? '释放以上传' : '点击或拖拽上传模板'}
|
||||
</p>
|
||||
<p className="text-xs text-muted-foreground mt-1">
|
||||
支持 .xlsx .xls .docx
|
||||
</p>
|
||||
</div>
|
||||
) : (
|
||||
<div className="text-center py-12 text-muted-foreground">
|
||||
<FileText size={48} className="mx-auto mb-4 opacity-30" />
|
||||
<p>暂无数据源文档,请先上传文档</p>
|
||||
<div className="flex items-center gap-3 p-4 bg-emerald-500/5 rounded-xl border border-emerald-200">
|
||||
<div className="w-10 h-10 rounded-lg bg-emerald-500/10 text-emerald-500 flex items-center justify-center">
|
||||
<FileSpreadsheet size={20} />
|
||||
</div>
|
||||
<div className="flex-1 min-w-0">
|
||||
<p className="font-medium truncate">{templateFile.name}</p>
|
||||
<p className="text-xs text-muted-foreground">
|
||||
{(templateFile.size / 1024).toFixed(1)} KB
|
||||
</p>
|
||||
</div>
|
||||
<Button variant="ghost" size="sm" onClick={() => setTemplateFile(null)}>
|
||||
<X size={16} />
|
||||
</Button>
|
||||
</div>
|
||||
)}
|
||||
</CardContent>
|
||||
</Card>
|
||||
|
||||
{/* Source Documents Upload */}
|
||||
<Card className="border-none shadow-md">
|
||||
<CardHeader className="pb-4">
|
||||
<CardTitle className="text-lg flex items-center gap-2">
|
||||
<Files className="text-primary" size={20} />
|
||||
源文档
|
||||
</CardTitle>
|
||||
<CardDescription>
|
||||
选择包含数据的源文档作为填表依据
|
||||
</CardDescription>
|
||||
{/* Source Mode Tabs */}
|
||||
<div className="flex gap-2 mt-2">
|
||||
<Button
|
||||
variant={sourceMode === 'upload' ? 'default' : 'outline'}
|
||||
size="sm"
|
||||
onClick={() => setSourceMode('upload')}
|
||||
>
|
||||
<Upload size={14} className="mr-1" />
|
||||
上传文件
|
||||
</Button>
|
||||
<Button
|
||||
variant={sourceMode === 'select' ? 'default' : 'outline'}
|
||||
size="sm"
|
||||
onClick={() => setSourceMode('select')}
|
||||
>
|
||||
<Files size={14} className="mr-1" />
|
||||
从文档中心选择
|
||||
</Button>
|
||||
</div>
|
||||
</CardHeader>
|
||||
<CardContent>
|
||||
{sourceMode === 'upload' ? (
|
||||
<>
|
||||
<div className="border-2 border-dashed rounded-2xl p-8 transition-all duration-300 flex flex-col items-center justify-center text-center cursor-pointer group min-h-[200px] border-muted-foreground/20 hover:border-primary/50 hover:bg-primary/5">
|
||||
<input
|
||||
id="source-file-input"
|
||||
type="file"
|
||||
multiple={true}
|
||||
accept=".xlsx,.xls,.docx,.md,.txt"
|
||||
onChange={handleSourceFileSelect}
|
||||
className="hidden"
|
||||
/>
|
||||
<label htmlFor="source-file-input" className="cursor-pointer flex flex-col items-center">
|
||||
<div className="w-14 h-14 rounded-xl bg-blue-500/10 text-blue-500 flex items-center justify-center mb-4 group-hover:scale-110 transition-transform">
|
||||
{loading ? <Loader2 className="animate-spin" size={28} /> : <Upload size={28} />}
|
||||
</div>
|
||||
<p className="font-medium">
|
||||
点击上传源文档
|
||||
</p>
|
||||
<p className="text-xs text-muted-foreground mt-1">
|
||||
支持 .xlsx .xls .docx .md .txt
|
||||
</p>
|
||||
</label>
|
||||
</div>
|
||||
<div
|
||||
onDragOver={(e) => { e.preventDefault(); }}
|
||||
onDrop={onSourceDrop}
|
||||
className="mt-2 text-center text-xs text-muted-foreground"
|
||||
>
|
||||
或拖拽文件到此处
|
||||
</div>
|
||||
|
||||
{/* Selected Source Files */}
|
||||
{sourceFiles.length > 0 && (
|
||||
<div className="mt-4 space-y-2">
|
||||
{sourceFiles.map((sf, idx) => (
|
||||
<div key={idx} className="flex items-center gap-3 p-3 bg-muted/50 rounded-xl">
|
||||
{getFileIcon(sf.file.name)}
|
||||
<div className="flex-1 min-w-0">
|
||||
<p className="text-sm font-medium truncate">{sf.file.name}</p>
|
||||
<p className="text-xs text-muted-foreground">
|
||||
{(sf.file.size / 1024).toFixed(1)} KB
|
||||
</p>
|
||||
</div>
|
||||
<Button variant="ghost" size="sm" onClick={() => removeSourceFile(idx)}>
|
||||
<Trash2 size={14} className="text-red-500" />
|
||||
</Button>
|
||||
</div>
|
||||
))}
|
||||
<div className="flex justify-center pt-2">
|
||||
<Button variant="outline" size="sm" onClick={() => document.getElementById('source-file-input')?.click()}>
|
||||
<Plus size={14} className="mr-1" />
|
||||
继续添加更多文档
|
||||
</Button>
|
||||
</div>
|
||||
</div>
|
||||
)}
|
||||
</>
|
||||
) : (
|
||||
<>
|
||||
{/* Uploaded Documents Selection */}
|
||||
{docsLoading ? (
|
||||
<div className="space-y-2">
|
||||
{[1, 2, 3].map(i => (
|
||||
<Skeleton key={i} className="h-16 w-full rounded-xl" />
|
||||
))}
|
||||
</div>
|
||||
) : uploadedDocuments.length > 0 ? (
|
||||
<div className="space-y-2">
|
||||
{sourceDocIds.length > 0 && (
|
||||
<div className="flex items-center justify-between p-3 bg-primary/5 rounded-xl border border-primary/20">
|
||||
<span className="text-sm font-medium">已选择 {sourceDocIds.length} 个文档</span>
|
||||
<Button variant="ghost" size="sm" onClick={() => loadUploadedDocuments()}>
|
||||
<RefreshCcw size={14} className="mr-1" />
|
||||
刷新列表
|
||||
</Button>
|
||||
</div>
|
||||
)}
|
||||
<div className="max-h-[300px] overflow-y-auto space-y-2">
|
||||
{uploadedDocuments.map((doc) => (
|
||||
<div
|
||||
key={doc.doc_id}
|
||||
className={cn(
|
||||
"flex items-center gap-3 p-3 rounded-xl border-2 transition-all cursor-pointer",
|
||||
sourceDocIds.includes(doc.doc_id)
|
||||
? "border-primary bg-primary/5"
|
||||
: "border-border hover:bg-muted/30"
|
||||
)}
|
||||
onClick={() => {
|
||||
if (sourceDocIds.includes(doc.doc_id)) {
|
||||
removeSourceDocId(doc.doc_id);
|
||||
} else {
|
||||
addSourceDocId(doc.doc_id);
|
||||
}
|
||||
}}
|
||||
>
|
||||
<div className={cn(
|
||||
"w-6 h-6 rounded-md border-2 flex items-center justify-center transition-all shrink-0",
|
||||
sourceDocIds.includes(doc.doc_id)
|
||||
? "border-primary bg-primary text-white"
|
||||
: "border-muted-foreground/30"
|
||||
)}>
|
||||
{sourceDocIds.includes(doc.doc_id) && <CheckCircle2 size={14} />}
|
||||
</div>
|
||||
{getFileIcon(doc.original_filename)}
|
||||
<div className="flex-1 min-w-0">
|
||||
<p className="text-sm font-medium truncate">{doc.original_filename}</p>
|
||||
<p className="text-xs text-muted-foreground">
|
||||
{doc.doc_type.toUpperCase()} • {format(new Date(doc.created_at), 'yyyy-MM-dd')}
|
||||
</p>
|
||||
</div>
|
||||
<Button
|
||||
variant="ghost"
|
||||
size="sm"
|
||||
onClick={(e) => handleDeleteDocument(doc.doc_id, e)}
|
||||
className="shrink-0"
|
||||
>
|
||||
<Trash2 size={14} className="text-red-500" />
|
||||
</Button>
|
||||
</div>
|
||||
))}
|
||||
</div>
|
||||
</div>
|
||||
) : (
|
||||
<div className="text-center py-8 text-muted-foreground">
|
||||
<Files size={32} className="mx-auto mb-2 opacity-30" />
|
||||
<p className="text-sm">暂无可用的已上传文档</p>
|
||||
</div>
|
||||
)}
|
||||
</>
|
||||
)}
|
||||
</CardContent>
|
||||
</Card>
|
||||
|
||||
{/* Action Button */}
|
||||
<div className="flex justify-center">
|
||||
<div className="col-span-1 lg:col-span-2 flex justify-center">
|
||||
<Button
|
||||
size="lg"
|
||||
className="rounded-xl px-8 shadow-lg shadow-primary/20 gap-2"
|
||||
disabled={selectedDocs.length === 0 || filling}
|
||||
onClick={handleFillTemplate}
|
||||
className="rounded-xl px-12 shadow-lg shadow-primary/20 gap-2"
|
||||
disabled={!templateFile || loading}
|
||||
onClick={handleJointUploadAndFill}
|
||||
>
|
||||
{filling ? (
|
||||
{loading ? (
|
||||
<>
|
||||
<Loader2 className="animate-spin" size={20} />
|
||||
<span>AI 正在分析并填表...</span>
|
||||
<span>正在处理...</span>
|
||||
</>
|
||||
) : (
|
||||
<>
|
||||
<Sparkles size={20} />
|
||||
<span>开始智能填表</span>
|
||||
<span>上传并智能填表</span>
|
||||
</>
|
||||
)}
|
||||
</Button>
|
||||
@@ -389,49 +537,7 @@ const TemplateFill: React.FC = () => {
|
||||
</div>
|
||||
)}
|
||||
|
||||
{/* Step 3: Preview Results */}
|
||||
{step === 'preview' && filledResult && (
|
||||
<Card className="border-none shadow-md">
|
||||
<CardHeader>
|
||||
<CardTitle className="text-lg flex items-center gap-2">
|
||||
<CheckCircle2 className="text-emerald-500" size={20} />
|
||||
填表完成
|
||||
</CardTitle>
|
||||
<CardDescription>
|
||||
系统已根据 {selectedDocs.length} 份文档自动完成表格填写
|
||||
</CardDescription>
|
||||
</CardHeader>
|
||||
<CardContent className="space-y-6">
|
||||
{/* Filled Data Preview */}
|
||||
<div className="p-6 bg-muted/30 rounded-2xl">
|
||||
<div className="space-y-4">
|
||||
{templateFields.map((field, idx) => (
|
||||
<div key={idx} className="flex items-center gap-4">
|
||||
<div className="w-32 text-sm font-medium text-muted-foreground">{field.name}</div>
|
||||
<div className="flex-1 p-3 bg-background rounded-xl border">
|
||||
{(filledResult.filled_data || {})[field.name] || '-'}
|
||||
</div>
|
||||
</div>
|
||||
))}
|
||||
</div>
|
||||
</div>
|
||||
|
||||
{/* Action Buttons */}
|
||||
<div className="flex justify-center gap-4">
|
||||
<Button variant="outline" className="rounded-xl gap-2" onClick={resetFlow}>
|
||||
<RefreshCcw size={18} />
|
||||
<span>继续填表</span>
|
||||
</Button>
|
||||
<Button className="rounded-xl gap-2 shadow-lg shadow-primary/20" onClick={handleExport}>
|
||||
<Download size={18} />
|
||||
<span>导出结果</span>
|
||||
</Button>
|
||||
</div>
|
||||
</CardContent>
|
||||
</Card>
|
||||
)}
|
||||
|
||||
{/* Filling State */}
|
||||
{/* Step 2: Filling State */}
|
||||
{step === 'filling' && (
|
||||
<Card className="border-none shadow-md">
|
||||
<CardContent className="py-16 flex flex-col items-center justify-center">
|
||||
@@ -440,11 +546,117 @@ const TemplateFill: React.FC = () => {
|
||||
</div>
|
||||
<h3 className="text-xl font-bold mb-2">AI 正在智能分析并填表</h3>
|
||||
<p className="text-muted-foreground text-center max-w-md">
|
||||
系统正在从 {selectedDocs.length} 份文档中检索相关信息,生成字段描述,并使用 RAG 增强填写准确性...
|
||||
系统正在从 {sourceFiles.length || sourceFilePaths.length} 份文档中检索相关信息...
|
||||
</p>
|
||||
</CardContent>
|
||||
</Card>
|
||||
)}
|
||||
|
||||
{/* Step 3: Preview Results */}
|
||||
{step === 'preview' && filledResult && (
|
||||
<div className="space-y-6">
|
||||
<Card className="border-none shadow-md">
|
||||
<CardHeader>
|
||||
<CardTitle className="text-lg flex items-center gap-2">
|
||||
<CheckCircle2 className="text-emerald-500" size={20} />
|
||||
填表完成
|
||||
</CardTitle>
|
||||
<CardDescription>
|
||||
系统已根据 {sourceFiles.length || sourceFilePaths.length} 份文档自动完成表格填写
|
||||
</CardDescription>
|
||||
</CardHeader>
|
||||
<CardContent>
|
||||
{/* Filled Data Preview */}
|
||||
<div className="p-6 bg-muted/30 rounded-2xl">
|
||||
<div className="space-y-4">
|
||||
{templateFields.map((field, idx) => {
|
||||
const value = filledResult.filled_data?.[field.name];
|
||||
const displayValue = Array.isArray(value)
|
||||
? value.filter(v => v && String(v).trim()).join(', ') || '-'
|
||||
: value || '-';
|
||||
return (
|
||||
<div key={idx} className="flex items-center gap-4">
|
||||
<div className="w-40 text-sm font-medium text-muted-foreground">{field.name}</div>
|
||||
<div className="flex-1 p-3 bg-background rounded-xl border">
|
||||
{displayValue}
|
||||
</div>
|
||||
</div>
|
||||
);
|
||||
})}
|
||||
</div>
|
||||
</div>
|
||||
|
||||
{/* Source Files Info */}
|
||||
<div className="mt-4 flex flex-wrap gap-2">
|
||||
{sourceFiles.map((sf, idx) => (
|
||||
<Badge key={idx} variant="outline" className="bg-blue-500/5">
|
||||
{getFileIcon(sf.file.name)}
|
||||
<span className="ml-1">{sf.file.name}</span>
|
||||
</Badge>
|
||||
))}
|
||||
</div>
|
||||
|
||||
{/* Action Buttons */}
|
||||
<div className="flex justify-center gap-4 mt-6">
|
||||
<Button variant="outline" className="rounded-xl gap-2" onClick={reset}>
|
||||
<RefreshCcw size={18} />
|
||||
<span>继续填表</span>
|
||||
</Button>
|
||||
<Button className="rounded-xl gap-2 shadow-lg shadow-primary/20" onClick={handleExport}>
|
||||
<Download size={18} />
|
||||
<span>导出结果</span>
|
||||
</Button>
|
||||
</div>
|
||||
</CardContent>
|
||||
</Card>
|
||||
|
||||
{/* Fill Details */}
|
||||
{filledResult.fill_details && filledResult.fill_details.length > 0 && (
|
||||
<Card className="border-none shadow-md">
|
||||
<CardHeader>
|
||||
<CardTitle className="text-lg">填写详情</CardTitle>
|
||||
</CardHeader>
|
||||
<CardContent>
|
||||
<div className="space-y-3">
|
||||
{filledResult.fill_details.map((detail: any, idx: number) => (
|
||||
<div key={idx} className="flex items-start gap-3 p-3 bg-muted/30 rounded-xl text-sm">
|
||||
<div className="w-1 h-1 rounded-full bg-primary mt-2" />
|
||||
<div className="flex-1">
|
||||
<div className="font-medium">{detail.field}</div>
|
||||
<div className="text-muted-foreground text-xs mt-1">
|
||||
来源: {detail.source} | 置信度: {detail.confidence ? (detail.confidence * 100).toFixed(0) + '%' : 'N/A'}
|
||||
</div>
|
||||
{detail.warning && (
|
||||
<div className="mt-2 p-2 bg-yellow-50 border border-yellow-200 rounded-lg text-yellow-700 text-xs">
|
||||
⚠️ {detail.warning}
|
||||
</div>
|
||||
)}
|
||||
{detail.values && detail.values.length > 1 && !detail.warning && (
|
||||
<div className="mt-2 text-xs text-muted-foreground">
|
||||
多值: {detail.values.join(', ')}
|
||||
</div>
|
||||
)}
|
||||
</div>
|
||||
</div>
|
||||
))}
|
||||
</div>
|
||||
</CardContent>
|
||||
</Card>
|
||||
)}
|
||||
</div>
|
||||
)}
|
||||
|
||||
{/* Preview Dialog */}
|
||||
<Dialog open={previewOpen} onOpenChange={setPreviewOpen}>
|
||||
<DialogContent className="max-w-2xl">
|
||||
<DialogHeader>
|
||||
<DialogTitle>{previewDoc?.name || '文档预览'}</DialogTitle>
|
||||
</DialogHeader>
|
||||
<ScrollArea className="max-h-[60vh]">
|
||||
<pre className="text-sm whitespace-pre-wrap">{previewDoc?.content}</pre>
|
||||
</ScrollArea>
|
||||
</DialogContent>
|
||||
</Dialog>
|
||||
</div>
|
||||
);
|
||||
};
|
||||
|
||||
59
logs/rag_disable_note.txt
Normal file
59
logs/rag_disable_note.txt
Normal file
@@ -0,0 +1,59 @@
|
||||
RAG 服务临时禁用说明
|
||||
========================
|
||||
日期: 2026-04-08
|
||||
|
||||
修改内容:
|
||||
----------
|
||||
应需求,RAG 向量检索功能已临时禁用,具体如下:
|
||||
|
||||
1. 修改文件: backend/app/services/rag_service.py
|
||||
|
||||
2. 关键变更:
|
||||
- 在 RAGService.__init__ 中添加 self._disabled = True 标志
|
||||
- index_field() - 添加 _disabled 检查,跳过实际索引操作并记录日志
|
||||
- index_document_content() - 添加 _disabled 检查,跳过实际索引操作并记录日志
|
||||
- retrieve() - 添加 _disabled 检查,返回空列表并记录日志
|
||||
- get_vector_count() - 添加 _disabled 检查,返回 0 并记录日志
|
||||
- clear() - 添加 _disabled 检查,跳过实际清空操作并记录日志
|
||||
|
||||
3. 行为变更:
|
||||
- 所有 RAG 索引构建操作会被记录到日志 ([RAG DISABLED] 前缀)
|
||||
- 所有 RAG 检索操作返回空结果
|
||||
- 向量计数始终返回 0
|
||||
- 实际向量数据库操作被跳过
|
||||
|
||||
4. 恢复方式:
|
||||
- 将 RAGService.__init__ 中的 self._disabled = True 改为 self._disabled = False
|
||||
- 重新启动服务即可恢复 RAG 功能
|
||||
|
||||
目的:
|
||||
------
|
||||
保留 RAG 索引构建功能的前端界面和代码结构,暂不实际调用向量数据库 API,
|
||||
待后续需要时再启用。
|
||||
|
||||
影响范围:
|
||||
---------
|
||||
- /api/v1/rag/search - RAG 搜索接口 (返回空结果)
|
||||
- /api/v1/rag/status - RAG 状态接口 (返回 vector_count=0)
|
||||
- /api/v1/rag/rebuild - RAG 重建接口 (仅记录日志)
|
||||
- Excel/文档上传时的 RAG 索引构建 (仅记录日志)
|
||||
|
||||
========================
|
||||
后续补充 (2026-04-08):
|
||||
========================
|
||||
修改文件: backend/app/services/table_rag_service.py
|
||||
|
||||
关键变更:
|
||||
- 在 TableRAGService.__init__ 中添加 self._disabled = True 标志
|
||||
- build_table_rag_index() - RAG 索引部分被跳过,仅记录日志
|
||||
- index_document_table() - RAG 索引部分被跳过,仅记录日志
|
||||
|
||||
行为变更:
|
||||
- Excel 上传时,MySQL 存储仍然正常进行
|
||||
- AI 字段描述仍然正常生成(调用 LLM)
|
||||
- 只有向量数据库索引操作被跳过
|
||||
|
||||
恢复方式:
|
||||
- 将 TableRAGService.__init__ 中的 self._disabled = True 改为 self._disabled = False
|
||||
- 或将 rag_service.py 中的 self._disabled = True 改为 self._disabled = False
|
||||
- 两者需同时改为 False 才能完全恢复 RAG 功能
|
||||
354
比赛备赛规划.md
354
比赛备赛规划.md
@@ -50,18 +50,18 @@
|
||||
| `prompt_service.py` | ✅ 已完成 | Prompt 模板管理 |
|
||||
| `text_analysis_service.py` | ✅ 已完成 | 文本分析 |
|
||||
| `chart_generator_service.py` | ✅ 已完成 | 图表生成服务 |
|
||||
| `template_fill_service.py` | ✅ 已完成 | 模板填写服务,支持直接读取源文档进行填表 |
|
||||
| `template_fill_service.py` | ✅ 已完成 | 模板填写服务,支持多行提取、直接从结构化数据提取、JSON容错、Word文档表格处理 |
|
||||
|
||||
### 2.2 API 接口 (`backend/app/api/endpoints/`)
|
||||
|
||||
| 接口文件 | 路由 | 功能状态 |
|
||||
|----------|------|----------|
|
||||
| `upload.py` | `/api/v1/upload/excel` | ✅ Excel 文件上传与解析 |
|
||||
| `upload.py` | `/api/v1/upload/document` | ✅ 文档上传与解析 |
|
||||
| `documents.py` | `/api/v1/documents/*` | ✅ 文档管理(列表、删除、搜索) |
|
||||
| `ai_analyze.py` | `/api/v1/analyze/*` | ✅ AI 分析(Excel、Markdown、流式) |
|
||||
| `rag.py` | `/api/v1/rag/*` | ⚠️ RAG 检索(当前返回空) |
|
||||
| `tasks.py` | `/api/v1/tasks/*` | ✅ 异步任务状态查询 |
|
||||
| `templates.py` | `/api/v1/templates/*` | ✅ 模板管理 (含 Word 导出) |
|
||||
| `templates.py` | `/api/v1/templates/*` | ✅ 模板管理(含多行导出、Word导出、Word结构化字段解析) |
|
||||
| `visualization.py` | `/api/v1/visualization/*` | ✅ 可视化图表 |
|
||||
| `health.py` | `/api/v1/health` | ✅ 健康检查 |
|
||||
|
||||
@@ -70,71 +70,67 @@
|
||||
| 页面文件 | 功能 | 状态 |
|
||||
|----------|------|------|
|
||||
| `Documents.tsx` | 主文档管理页面 | ✅ 已完成 |
|
||||
| `TemplateFill.tsx` | 智能填表页面 | ✅ 已完成 |
|
||||
| `ExcelParse.tsx` | Excel 解析页面 | ✅ 已完成 |
|
||||
|
||||
### 2.4 文档解析能力
|
||||
|
||||
| 格式 | 解析状态 | 说明 |
|
||||
|------|----------|------|
|
||||
| Excel (.xlsx/.xls) | ✅ 已完成 | pandas + XML 回退解析 |
|
||||
| Excel (.xlsx/.xls) | ✅ 已完成 | pandas + XML 回退解析,支持多sheet |
|
||||
| Markdown (.md) | ✅ 已完成 | 正则 + AI 分章节 |
|
||||
| Word (.docx) | ✅ 已完成 | python-docx 解析,支持表格提取和字段识别 |
|
||||
| Text (.txt) | ✅ 已完成 | chardet 编码检测,支持文本清洗和结构化提取 |
|
||||
|
||||
---
|
||||
|
||||
## 三、待完成功能(核心缺块)
|
||||
## 三、核心功能实现详情
|
||||
|
||||
### 3.1 模板填写模块(最优先)
|
||||
|
||||
**当前状态**:✅ 已完成
|
||||
### 3.1 模板填写模块(✅ 已完成)
|
||||
|
||||
**核心流程**:
|
||||
```
|
||||
用户上传模板表格(Word/Excel)
|
||||
上传模板表格(Word/Excel)
|
||||
↓
|
||||
解析模板,提取需要填写的字段和提示词
|
||||
↓
|
||||
根据模板指定的源文档列表读取源数据
|
||||
根据源文档ID列表读取源数据(MongoDB或文件)
|
||||
↓
|
||||
AI 根据字段提示词从源数据中提取信息
|
||||
优先从结构化数据直接提取(Excel rows)
|
||||
↓
|
||||
将提取的数据填入模板对应位置
|
||||
无法直接提取时使用 LLM 从文本中提取
|
||||
↓
|
||||
返回填写完成的表格
|
||||
将提取的数据填入原始模板对应位置(保持模板格式)
|
||||
↓
|
||||
导出填写完成的表格(Excel/Word)
|
||||
```
|
||||
|
||||
**已完成实现**:
|
||||
- [x] `template_fill_service.py` - 模板填写核心服务
|
||||
- [x] Word 模板解析 (`docx_parser.py` - parse_tables_for_template, extract_template_fields_from_docx)
|
||||
- [x] Text 模板解析 (`txt_parser.py` - 已完成)
|
||||
- [x] 模板字段识别与提示词提取
|
||||
- [x] 多文档数据聚合与冲突处理
|
||||
- [x] 结果导出为 Word/Excel
|
||||
**关键特性**:
|
||||
- **原始模板填充**:直接打开原始模板文件,填充数据到原表格/单元格
|
||||
- **多行数据支持**:每个字段可提取多个值,导出时自动扩展行数
|
||||
- **结构化数据优先**:直接从 Excel rows 提取,无需 LLM
|
||||
- **JSON 容错**:支持 LLM 返回的损坏/截断 JSON
|
||||
- **Markdown 清理**:自动清理 LLM 返回的 markdown 格式
|
||||
|
||||
### 3.2 Word 文档解析
|
||||
|
||||
**当前状态**:✅ 已完成
|
||||
### 3.2 Word 文档解析(✅ 已完成)
|
||||
|
||||
**已实现功能**:
|
||||
- [x] `docx_parser.py` - Word 文档解析器
|
||||
- [x] 提取段落文本
|
||||
- [x] 提取表格内容
|
||||
- [x] 提取关键信息(标题、列表等)
|
||||
- [x] 表格模板字段提取 (`parse_tables_for_template`, `extract_template_fields_from_docx`)
|
||||
- [x] 字段类型推断 (`_infer_field_type_from_hint`)
|
||||
- `docx_parser.py` - Word 文档解析器
|
||||
- 提取段落文本
|
||||
- 提取表格内容(支持比赛表格格式:字段名 | 提示词 | 填写值)
|
||||
- `parse_tables_for_template()` - 解析表格模板,提取字段
|
||||
- `extract_template_fields_from_docx()` - 提取模板字段定义
|
||||
- `_infer_field_type_from_hint()` - 从提示词推断字段类型
|
||||
- **API 端点**:`/api/v1/templates/parse-word-structure` - 上传 Word 文档,提取结构化字段并存入 MongoDB
|
||||
- **API 端点**:`/api/v1/templates/word-fields/{doc_id}` - 获取已存文档的模板字段信息
|
||||
|
||||
### 3.3 Text 文档解析
|
||||
|
||||
**当前状态**:✅ 已完成
|
||||
### 3.3 Text 文档解析(✅ 已完成)
|
||||
|
||||
**已实现功能**:
|
||||
- [x] `txt_parser.py` - 文本文件解析器
|
||||
- [x] 编码自动检测 (chardet)
|
||||
- [x] 文本清洗
|
||||
|
||||
### 3.4 文档模板匹配(已有框架)
|
||||
|
||||
根据 Q&A,模板已指定数据文件,不需要算法匹配。当前已有上传功能,需确认模板与数据文件的关联逻辑是否完善。
|
||||
- `txt_parser.py` - 文本文件解析器
|
||||
- 编码自动检测 (chardet)
|
||||
- 文本清洗(去除控制字符、规范化空白)
|
||||
- 结构化数据提取(邮箱、URL、电话、日期、金额)
|
||||
|
||||
---
|
||||
|
||||
@@ -192,20 +188,20 @@ docs/test/
|
||||
|
||||
## 六、工作计划(建议)
|
||||
|
||||
### 第一优先级:模板填写核心功能
|
||||
- 完成 Word 文档解析
|
||||
- 完成模板填写服务
|
||||
- 端到端测试验证
|
||||
### 第一优先级:端到端测试
|
||||
- 使用真实测试数据进行准确率测试
|
||||
- 验证多行数据导出是否正确
|
||||
- 测试 Word 模板解析是否正常
|
||||
|
||||
### 第二优先级:Demo 打包与文档
|
||||
- 制作项目演示 PPT
|
||||
- 录制演示视频
|
||||
- 完善 README 部署文档
|
||||
|
||||
### 第三优先级:测试优化
|
||||
- 使用真实测试数据进行准确率测试
|
||||
### 第三优先级:优化
|
||||
- 优化响应时间
|
||||
- 完善错误处理
|
||||
- 增加更多测试用例
|
||||
|
||||
---
|
||||
|
||||
@@ -215,29 +211,32 @@ docs/test/
|
||||
2. **数据库**:不强制要求数据库存储,可跳过
|
||||
3. **部署**:本地部署即可,不需要公网服务器
|
||||
4. **评测数据**:初赛仅使用目前提供的数据
|
||||
5. **RAG 功能**:当前已临时禁用,不影响核心评测功能
|
||||
5. **RAG 功能**:当前已临时禁用,不影响核心评测功能(因为使用直接文件读取)
|
||||
|
||||
---
|
||||
|
||||
*文档版本: v1.1*
|
||||
*最后更新: 2026-04-08*
|
||||
*文档版本: v1.5*
|
||||
*最后更新: 2026-04-09*
|
||||
|
||||
---
|
||||
|
||||
## 八、技术实现细节
|
||||
|
||||
### 8.1 模板填表流程(已实现)
|
||||
### 8.1 模板填表流程
|
||||
|
||||
#### 流程图
|
||||
```
|
||||
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
|
||||
│ 上传模板 │ ──► │ 选择数据源 │ ──► │ AI 智能填表 │
|
||||
│ 上传模板 │ ──► │ 选择数据源 │ ──► │ 智能填表 │
|
||||
└─────────────┘ └─────────────┘ └─────────────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────┐
|
||||
│ 导出结果 │
|
||||
└─────────────┘
|
||||
│
|
||||
┌─────────────────────────┼─────────────────────────┐
|
||||
│ │ │
|
||||
▼ ▼ ▼
|
||||
┌───────────────┐ ┌───────────────┐ ┌───────────────┐
|
||||
│ 结构化数据提取 │ │ LLM 提取 │ │ 导出结果 │
|
||||
│ (直接读rows) │ │ (文本理解) │ │ (Excel/Word) │
|
||||
└───────────────┘ └───────────────┘ └───────────────┘
|
||||
```
|
||||
|
||||
#### 核心组件
|
||||
@@ -247,8 +246,10 @@ docs/test/
|
||||
| 模板上传 | `templates.py` `/templates/upload` | 接收模板文件,提取字段 |
|
||||
| 字段提取 | `template_fill_service.py` | 从 Word/Excel 表格提取字段定义 |
|
||||
| 文档解析 | `docx_parser.py`, `xlsx_parser.py`, `txt_parser.py` | 解析源文档内容 |
|
||||
| 智能填表 | `template_fill_service.py` `fill_template()` | 使用 LLM 从源文档提取信息 |
|
||||
| 结果导出 | `templates.py` `/templates/export` | 导出为 Excel 或 Word |
|
||||
| 智能填表 | `template_fill_service.py` `fill_template()` | 结构化提取 + LLM 提取 |
|
||||
| 多行支持 | `template_fill_service.py` `FillResult` | values 数组支持 |
|
||||
| JSON 容错 | `template_fill_service.py` `_fix_json()` | 修复损坏的 JSON |
|
||||
| 结果导出 | `templates.py` `/templates/export` | 多行 Excel + Word 导出 |
|
||||
|
||||
### 8.2 源文档加载方式
|
||||
|
||||
@@ -268,7 +269,9 @@ docs/test/
|
||||
|
||||
```python
|
||||
# 提取表格模板字段
|
||||
fields = docx_parser.extract_template_fields_from_docx(file_path)
|
||||
from docx_parser import DocxParser
|
||||
parser = DocxParser()
|
||||
fields = parser.extract_template_fields_from_docx(file_path)
|
||||
|
||||
# 返回格式
|
||||
# [
|
||||
@@ -295,6 +298,24 @@ fields = docx_parser.extract_template_fields_from_docx(file_path)
|
||||
|
||||
### 8.5 API 接口
|
||||
|
||||
#### POST `/api/v1/templates/upload`
|
||||
|
||||
上传模板文件,提取字段定义。
|
||||
|
||||
**响应**:
|
||||
```json
|
||||
{
|
||||
"success": true,
|
||||
"template_id": "/path/to/saved/template.docx",
|
||||
"filename": "模板.docx",
|
||||
"file_type": "docx",
|
||||
"fields": [
|
||||
{"cell": "A1", "name": "姓名", "field_type": "text", "required": true, "hint": "提取人员姓名"}
|
||||
],
|
||||
"field_count": 1
|
||||
}
|
||||
```
|
||||
|
||||
#### POST `/api/v1/templates/fill`
|
||||
|
||||
填写请求:
|
||||
@@ -306,35 +327,232 @@ fields = docx_parser.extract_template_fields_from_docx(file_path)
|
||||
],
|
||||
"source_doc_ids": ["mongodb_doc_id_1", "mongodb_doc_id_2"],
|
||||
"source_file_paths": [],
|
||||
"user_hint": "请从合同文档中提取"
|
||||
"user_hint": "请从xxx文档中提取"
|
||||
}
|
||||
```
|
||||
|
||||
响应:
|
||||
**响应(含多行支持)**:
|
||||
```json
|
||||
{
|
||||
"success": true,
|
||||
"filled_data": {"姓名": "张三"},
|
||||
"filled_data": {
|
||||
"姓名": ["张三", "李四", "王五"],
|
||||
"年龄": ["25", "30", "28"]
|
||||
},
|
||||
"fill_details": [
|
||||
{
|
||||
"field": "姓名",
|
||||
"cell": "A1",
|
||||
"values": ["张三", "李四", "王五"],
|
||||
"value": "张三",
|
||||
"source": "来自:合同文档.docx",
|
||||
"confidence": 0.95
|
||||
"source": "结构化数据直接提取",
|
||||
"confidence": 1.0
|
||||
}
|
||||
],
|
||||
"source_doc_count": 2
|
||||
"source_doc_count": 2,
|
||||
"max_rows": 3
|
||||
}
|
||||
```
|
||||
|
||||
#### POST `/api/v1/templates/export`
|
||||
|
||||
导出请求:
|
||||
导出请求(创建新文件):
|
||||
```json
|
||||
{
|
||||
"template_id": "模板ID",
|
||||
"filled_data": {"姓名": "张三", "金额": "10000"},
|
||||
"format": "xlsx" // 或 "docx"
|
||||
"filled_data": {"姓名": ["张三", "李四"], "金额": ["10000", "20000"]},
|
||||
"format": "xlsx"
|
||||
}
|
||||
```
|
||||
```
|
||||
|
||||
#### POST `/api/v1/templates/fill-and-export`
|
||||
|
||||
**填充原始模板并导出**(推荐用于比赛)
|
||||
|
||||
直接打开原始模板文件,将数据填入模板的表格/单元格中,然后导出。**保持原始模板格式不变**。
|
||||
|
||||
**请求**:
|
||||
```json
|
||||
{
|
||||
"template_path": "/path/to/original/template.docx",
|
||||
"filled_data": {
|
||||
"姓名": ["张三", "李四", "王五"],
|
||||
"年龄": ["25", "30", "28"]
|
||||
},
|
||||
"format": "docx"
|
||||
}
|
||||
```
|
||||
|
||||
**响应**:填充后的 Word/Excel 文件(文件流)
|
||||
|
||||
**特点**:
|
||||
- 打开原始模板文件
|
||||
- 根据表头行匹配字段名到列索引
|
||||
- 将数据填入对应列的单元格
|
||||
- 多行数据自动扩展表格行数
|
||||
- 保持原始模板格式和样式
|
||||
|
||||
#### POST `/api/v1/templates/parse-word-structure`
|
||||
|
||||
**上传 Word 文档并提取结构化字段**(比赛专用)
|
||||
|
||||
上传 Word 文档,从表格模板中提取字段定义(字段名、提示词、字段类型)并存入 MongoDB。
|
||||
|
||||
**请求**:multipart/form-data
|
||||
- file: Word 文件
|
||||
|
||||
**响应**:
|
||||
```json
|
||||
{
|
||||
"success": true,
|
||||
"doc_id": "mongodb_doc_id",
|
||||
"filename": "模板.docx",
|
||||
"file_path": "/path/to/saved/template.docx",
|
||||
"field_count": 5,
|
||||
"fields": [
|
||||
{
|
||||
"cell": "T0R1",
|
||||
"name": "字段名",
|
||||
"hint": "提示词",
|
||||
"field_type": "text",
|
||||
"required": true
|
||||
}
|
||||
],
|
||||
"tables": [...],
|
||||
"metadata": {
|
||||
"paragraph_count": 10,
|
||||
"table_count": 1,
|
||||
"word_count": 500,
|
||||
"has_tables": true
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
#### GET `/api/v1/templates/word-fields/{doc_id}`
|
||||
|
||||
**获取 Word 文档模板字段信息**
|
||||
|
||||
根据 doc_id 获取已上传的 Word 文档的模板字段信息。
|
||||
|
||||
**响应**:
|
||||
```json
|
||||
{
|
||||
"success": true,
|
||||
"doc_id": "mongodb_doc_id",
|
||||
"filename": "模板.docx",
|
||||
"fields": [...],
|
||||
"tables": [...],
|
||||
"field_count": 5,
|
||||
"metadata": {...}
|
||||
}
|
||||
```
|
||||
|
||||
### 8.6 多行数据处理
|
||||
|
||||
**FillResult 数据结构**:
|
||||
```python
|
||||
@dataclass
|
||||
class FillResult:
|
||||
field: str
|
||||
values: List[Any] = None # 支持多个值(数组)
|
||||
value: Any = "" # 保留兼容(第一个值)
|
||||
source: str = "" # 来源文档
|
||||
confidence: float = 1.0 # 置信度
|
||||
```
|
||||
|
||||
**导出逻辑**:
|
||||
- 计算所有字段的最大行数
|
||||
- 遍历每一行,取对应索引的值
|
||||
- 不足的行填空字符串
|
||||
|
||||
### 8.7 JSON 容错处理
|
||||
|
||||
当 LLM 返回的 JSON 损坏或被截断时,系统会:
|
||||
|
||||
1. 清理 markdown 代码块标记(```json, ```)
|
||||
2. 尝试配对括号找到完整的 JSON
|
||||
3. 移除末尾多余的逗号
|
||||
4. 使用正则表达式提取 values 数组
|
||||
5. 备选方案:直接提取所有引号内的字符串
|
||||
|
||||
### 8.8 结构化数据优先提取
|
||||
|
||||
对于 Excel 等有 `rows` 结构的文档,系统会:
|
||||
|
||||
1. 直接从 `structured_data.rows` 中查找匹配列
|
||||
2. 使用模糊匹配(字段名包含或被包含)
|
||||
3. 提取该列的所有行值
|
||||
4. 无需调用 LLM,速度更快,准确率更高
|
||||
|
||||
```python
|
||||
# 内部逻辑
|
||||
if structured.get("rows"):
|
||||
columns = structured.get("columns", [])
|
||||
values = _extract_column_values(rows, columns, field_name)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 九、依赖说明
|
||||
|
||||
### Python 依赖
|
||||
|
||||
```
|
||||
# requirements.txt 中需要包含
|
||||
fastapi>=0.104.0
|
||||
uvicorn>=0.24.0
|
||||
motor>=3.3.0 # MongoDB 异步驱动
|
||||
sqlalchemy>=2.0.0 # MySQL ORM
|
||||
pandas>=2.0.0 # Excel 处理
|
||||
openpyxl>=3.1.0 # Excel 写入
|
||||
python-docx>=0.8.0 # Word 处理
|
||||
chardet>=4.0.0 # 编码检测
|
||||
httpx>=0.25.0 # HTTP 客户端
|
||||
```
|
||||
|
||||
### 前端依赖
|
||||
|
||||
```
|
||||
# package.json 中需要包含
|
||||
react>=18.0.0
|
||||
react-dropzone>=14.0.0
|
||||
lucide-react>=0.300.0
|
||||
sonner>=1.0.0 # toast 通知
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 十、启动说明
|
||||
|
||||
### 后端启动
|
||||
|
||||
```bash
|
||||
cd backend
|
||||
.\venv\Scripts\Activate.ps1 # 或 Activate.bat
|
||||
pip install -r requirements.txt # 确保依赖完整
|
||||
.\venv\Scripts\python.exe -m uvicorn app.main:app --host 127.0.0.1 --port 8000 --reload
|
||||
```
|
||||
|
||||
### 前端启动
|
||||
|
||||
```bash
|
||||
cd frontend
|
||||
npm install
|
||||
npm run dev
|
||||
```
|
||||
|
||||
### 环境变量
|
||||
|
||||
在 `backend/.env` 中配置:
|
||||
```
|
||||
MONGODB_URL=mongodb://localhost:27017
|
||||
MONGODB_DB_NAME=document_system
|
||||
MYSQL_HOST=localhost
|
||||
MYSQL_PORT=3306
|
||||
MYSQL_USER=root
|
||||
MYSQL_PASSWORD=your_password
|
||||
MYSQL_DATABASE=document_system
|
||||
LLM_API_KEY=your_api_key
|
||||
LLM_BASE_URL=https://api.minimax.chat
|
||||
LLM_MODEL_NAME=MiniMax-Text-01
|
||||
```
|
||||
|
||||
Reference in New Issue
Block a user