Compare commits
33 Commits
faff1a5977
...
main
| Author | SHA1 | Date | |
|---|---|---|---|
| 47c89d888f | |||
| 6701df613b | |||
| ecad9ccd82 | |||
| 51350e3002 | |||
| 8e713be1ca | |||
| f2af27245d | |||
| a9dc0d8b91 | |||
| 902c28166b | |||
| 4a53be7eeb | |||
| 8b5b24fa2a | |||
| ed66aa346d | |||
| 5b82d40be0 | |||
| bedf1af9c0 | |||
| 5fca4eb094 | |||
| 0dbf74db9d | |||
| 858b594171 | |||
| ed0f51f2a4 | |||
| ecc0c79475 | |||
| 6befc510d8 | |||
| 8f66c235fa | |||
| 886d5ae0cc | |||
| 6752c5c231 | |||
| 610d475ce0 | |||
| 496b96508d | |||
| 07ebdc09bc | |||
| 7f67fa89de | |||
| c1886fb68f | |||
| 78417c898a | |||
| d5df5b8283 | |||
| 718f864926 | |||
| e5711b3f05 | |||
| df35105d16 | |||
| 2c2ab56d2d |
24
.gitignore
vendored
24
.gitignore
vendored
@@ -1,4 +1,5 @@
|
|||||||
/.git/
|
/.git/
|
||||||
|
/.gitignore
|
||||||
/.idea/
|
/.idea/
|
||||||
/.vscode/
|
/.vscode/
|
||||||
/backend/venv/
|
/backend/venv/
|
||||||
@@ -18,11 +19,7 @@
|
|||||||
/frontend/.idea/
|
/frontend/.idea/
|
||||||
/frontend/.env
|
/frontend/.env
|
||||||
/frontend/*.log
|
/frontend/*.log
|
||||||
/技术路线.md
|
|
||||||
/开发路径.md
|
|
||||||
/开发日志_2026-03-16.md
|
|
||||||
/frontendTest/
|
|
||||||
/docs/
|
|
||||||
/frontend/src/api/
|
/frontend/src/api/
|
||||||
/frontend/src/api/index.js
|
/frontend/src/api/index.js
|
||||||
/frontend/src/api/index.ts
|
/frontend/src/api/index.ts
|
||||||
@@ -30,9 +27,22 @@
|
|||||||
/frontend/src/api/index.py
|
/frontend/src/api/index.py
|
||||||
/frontend/src/api/index.go
|
/frontend/src/api/index.go
|
||||||
/frontend/src/api/index.java
|
/frontend/src/api/index.java
|
||||||
|
|
||||||
|
/frontend - 副本/
|
||||||
|
|
||||||
/docs/
|
/docs/
|
||||||
/frontend - 副本/*
|
/frontendTest/
|
||||||
/supabase.txt
|
/supabase.txt
|
||||||
|
|
||||||
**/__pycache__/*
|
# 取消跟踪的文件 / Untracked files
|
||||||
|
比赛备赛规划.md
|
||||||
|
Q&A.xlsx
|
||||||
|
package.json
|
||||||
|
技术路线.md
|
||||||
|
开发路径.md
|
||||||
|
开发日志_2026-03-16.md
|
||||||
|
/logs/
|
||||||
|
|
||||||
|
# Python cache
|
||||||
|
**/__pycache__/**
|
||||||
**.pyc
|
**.pyc
|
||||||
|
|||||||
238
README.md
Normal file
238
README.md
Normal file
@@ -0,0 +1,238 @@
|
|||||||
|
# FilesReadSystem
|
||||||
|
|
||||||
|
## 项目介绍 / Project Introduction
|
||||||
|
|
||||||
|
基于大语言模型的文档理解与多源数据融合系统,专为第十七届中国大学生服务外包创新创业大赛(A23赛题)开发。本系统利用大语言模型(LLM)解析、分析各类文档格式并提取结构化数据,支持通过自然语言指令自动填写模板表格。
|
||||||
|
|
||||||
|
A document understanding and multi-source data fusion system based on Large Language Models (LLM), developed for the 17th China University Student Service Outsourcing Innovation and Entrepreneurship Competition (Topic A23). This system uses LLMs to parse, analyze, and extract structured data from various document formats, supporting automatic template table filling through natural language instructions.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 技术栈 / Technology Stack
|
||||||
|
|
||||||
|
| 层次 / Layer | 组件 / Component | 说明 / Description |
|
||||||
|
|:---|:---|:---|
|
||||||
|
| 后端 / Backend | FastAPI + Uvicorn | RESTful API,异步任务调度 / API & async task scheduling |
|
||||||
|
| 前端 / Frontend | React + TypeScript + Vite | 文件上传、表格配置、聊天界面 / Upload, table config, chat UI |
|
||||||
|
| 异步任务 / Async Tasks | Celery + Redis | 处理耗时的解析与AI提取 / Heavy parsing & AI extraction |
|
||||||
|
| 文档数据库 / Document DB | MongoDB (Motor) | 元数据、提取结果、文档块存储 / Metadata, results, chunk storage |
|
||||||
|
| 关系数据库 / Relational DB | MySQL (SQLAlchemy) | 结构化数据存储 / Structured data storage |
|
||||||
|
| 缓存 / Cache | Redis | 缓存与任务队列 / Caching & task queue |
|
||||||
|
| 向量检索 / Vector Search | FAISS | 高效相似性搜索 / Efficient similarity search |
|
||||||
|
| AI集成 / AI Integration | LangChain-style + MiniMax API | RAG流水线、提示词管理 / RAG pipeline, prompt management |
|
||||||
|
| 文档解析 / Document Parsing | python-docx, pandas, openpyxl, markdown-it | 多格式支持 / Multi-format support |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 项目架构 / Project Architecture
|
||||||
|
|
||||||
|
```
|
||||||
|
┌─────────────────────────────────────────────────────────────────┐
|
||||||
|
│ User Interface │
|
||||||
|
│ (React + TypeScript + shadcn/ui) │
|
||||||
|
└─────────────────────────────────────────────────────────────────┘
|
||||||
|
│
|
||||||
|
▼
|
||||||
|
┌─────────────────────────────────────────────────────────────────┐
|
||||||
|
│ FastAPI Backend │
|
||||||
|
│ ┌─────────────┐ ┌──────────────┐ ┌─────────────────────────┐ │
|
||||||
|
│ │ Upload API │ │ RAG Search │ │ Natural Language │ │
|
||||||
|
│ │ /documents │ │ /rag/search │ │ /instruction/execute │ │
|
||||||
|
│ └─────────────┘ └──────────────┘ └─────────────────────────┘ │
|
||||||
|
│ ┌─────────────┐ ┌──────────────┐ ┌─────────────────────────┐ │
|
||||||
|
│ │ AI Analyze │ │ Template Fill│ │ Visualization │ │
|
||||||
|
│ │ /ai/analyze │ │ /templates │ │ /visualization │ │
|
||||||
|
│ └─────────────┘ └──────────────┘ └─────────────────────────┘ │
|
||||||
|
└─────────────────────────────────────────────────────────────────┘
|
||||||
|
│
|
||||||
|
┌─────────────────────┼─────────────────────┐
|
||||||
|
▼ ▼ ▼
|
||||||
|
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
|
||||||
|
│ MongoDB │ │ MySQL │ │ Redis │
|
||||||
|
│ (Documents) │ │ (Structured) │ │ (Cache/Queue) │
|
||||||
|
└─────────────────┘ └─────────────────┘ └─────────────────┘
|
||||||
|
│
|
||||||
|
▼
|
||||||
|
┌─────────────────┐
|
||||||
|
│ FAISS │
|
||||||
|
│ (Vector Index) │
|
||||||
|
└─────────────────┘
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 目录结构 / Directory Structure
|
||||||
|
|
||||||
|
```
|
||||||
|
FilesReadSystem/
|
||||||
|
├── backend/ # 后端服务(Python + FastAPI)
|
||||||
|
│ ├── app/
|
||||||
|
│ │ ├── api/endpoints/ # API路由层 / API endpoints
|
||||||
|
│ │ │ ├── ai_analyze.py # AI分析接口 / AI analysis
|
||||||
|
│ │ │ ├── documents.py # 文档管理 / Document management
|
||||||
|
│ │ │ ├── instruction.py # 自然语言指令 / Natural language instruction
|
||||||
|
│ │ │ ├── rag.py # RAG检索 / RAG retrieval
|
||||||
|
│ │ │ ├── tasks.py # 任务管理 / Task management
|
||||||
|
│ │ │ ├── templates.py # 模板管理 / Template management
|
||||||
|
│ │ │ ├── upload.py # 文件上传 / File upload
|
||||||
|
│ │ │ └── visualization.py # 可视化 / Visualization
|
||||||
|
│ │ ├── core/
|
||||||
|
│ │ │ ├── database/ # 数据库连接 / Database connections
|
||||||
|
│ │ │ └── document_parser/ # 文档解析器 / Document parsers
|
||||||
|
│ │ ├── services/ # 业务逻辑服务 / Business logic services
|
||||||
|
│ │ │ ├── llm_service.py # LLM调用 / LLM service
|
||||||
|
│ │ │ ├── rag_service.py # RAG流水线 / RAG pipeline
|
||||||
|
│ │ │ ├── template_fill_service.py # 模板填充 / Template filling
|
||||||
|
│ │ │ ├── excel_ai_service.py # Excel AI分析 / Excel AI analysis
|
||||||
|
│ │ │ ├── word_ai_service.py # Word AI分析 / Word AI analysis
|
||||||
|
│ │ │ └── table_rag_service.py # 表格RAG / Table RAG
|
||||||
|
│ │ └── instruction/ # 指令解析与执行 / Instruction parsing & execution
|
||||||
|
│ ├── requirements.txt # Python依赖 / Python dependencies
|
||||||
|
│ └── README.md
|
||||||
|
│
|
||||||
|
├── frontend/ # 前端项目(React + TypeScript)
|
||||||
|
│ ├── src/
|
||||||
|
│ │ ├── pages/ # 页面组件 / Page components
|
||||||
|
│ │ │ ├── Dashboard.tsx # 仪表板 / Dashboard
|
||||||
|
│ │ │ ├── Documents.tsx # 文档管理 / Document management
|
||||||
|
│ │ │ ├── TemplateFill.tsx # 模板填充 / Template fill
|
||||||
|
│ │ │ └── InstructionChat.tsx # 指令聊天 / Instruction chat
|
||||||
|
│ │ ├── components/ui/ # shadcn/ui组件库 / shadcn/ui components
|
||||||
|
│ │ ├── contexts/ # React上下文 / React contexts
|
||||||
|
│ │ ├── db/ # API调用封装 / API call wrappers
|
||||||
|
│ │ └── supabase/functions/ # Edge函数 / Edge functions
|
||||||
|
│ ├── package.json
|
||||||
|
│ └── README.md
|
||||||
|
│
|
||||||
|
├── docs/ # 文档与测试数据 / Documentation & test data
|
||||||
|
├── logs/ # 应用日志 / Application logs
|
||||||
|
└── README.md # 本文件 / This file
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 主要功能 / Key Features
|
||||||
|
|
||||||
|
- **多格式文档解析** / Multi-format Document Parsing
|
||||||
|
- Excel (.xlsx)
|
||||||
|
- Word (.docx)
|
||||||
|
- Markdown (.md)
|
||||||
|
- Plain Text (.txt)
|
||||||
|
|
||||||
|
- **AI智能分析** / AI-Powered Analysis
|
||||||
|
- 文档内容理解与摘要
|
||||||
|
- 表格数据自动提取
|
||||||
|
- 多文档联合推理
|
||||||
|
|
||||||
|
- **RAG检索增强** / RAG (Retrieval Augmented Generation)
|
||||||
|
- 语义向量相似度搜索
|
||||||
|
- 上下文感知的答案生成
|
||||||
|
|
||||||
|
- **模板自动填充** / Template Auto-fill
|
||||||
|
- 智能表格模板识别
|
||||||
|
- 自然语言指令驱动填写
|
||||||
|
- 批量数据导入导出
|
||||||
|
|
||||||
|
- **自然语言指令** / Natural Language Instructions
|
||||||
|
- 意图识别与解析
|
||||||
|
- 多步骤任务自动执行
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## API接口 / API Endpoints
|
||||||
|
|
||||||
|
| 方法 / Method | 路径 / Path | 说明 / Description |
|
||||||
|
|:---|:---|:---|
|
||||||
|
| GET | `/health` | 健康检查 / Health check |
|
||||||
|
| POST | `/upload/document` | 单文件上传 / Single file upload |
|
||||||
|
| POST | `/upload/documents` | 批量上传 / Batch upload |
|
||||||
|
| GET | `/documents` | 文档库 / Document library |
|
||||||
|
| GET | `/tasks/{task_id}` | 任务状态 / Task status |
|
||||||
|
| POST | `/rag/search` | RAG语义搜索 / RAG search |
|
||||||
|
| POST | `/templates/upload` | 模板上传 / Template upload |
|
||||||
|
| POST | `/templates/fill` | 执行模板填充 / Execute template fill |
|
||||||
|
| POST | `/ai/analyze/excel` | Excel AI分析 / Excel AI analysis |
|
||||||
|
| POST | `/ai/analyze/word` | Word AI分析 / Word AI analysis |
|
||||||
|
| POST | `/instruction/recognize` | 意图识别 / Intent recognition |
|
||||||
|
| POST | `/instruction/execute` | 执行指令 / Execute instruction |
|
||||||
|
| GET | `/visualization/statistics` | 统计图表 / Statistics charts |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 环境配置 / Environment Setup
|
||||||
|
|
||||||
|
### 后端 / Backend
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cd backend
|
||||||
|
|
||||||
|
# 创建虚拟环境 / Create virtual environment
|
||||||
|
python -m venv venv
|
||||||
|
|
||||||
|
# 激活虚拟环境 / Activate virtual environment
|
||||||
|
# Windows PowerShell:
|
||||||
|
.\venv\Scripts\Activate.ps1
|
||||||
|
# Windows CMD:
|
||||||
|
.\venv\Scripts\Activate.bat
|
||||||
|
|
||||||
|
# 安装依赖 / Install dependencies
|
||||||
|
pip install -r requirements.txt
|
||||||
|
|
||||||
|
# 复制环境变量模板 / Copy environment template
|
||||||
|
copy .env.example .env
|
||||||
|
# 编辑 .env 填入API密钥 / Edit .env with your API keys
|
||||||
|
```
|
||||||
|
|
||||||
|
### 前端 / Frontend
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cd frontend
|
||||||
|
|
||||||
|
# 安装依赖 / Install dependencies
|
||||||
|
npm install
|
||||||
|
|
||||||
|
# 或使用 pnpm / Or using pnpm
|
||||||
|
pnpm install
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 启动项目 / Starting the Project
|
||||||
|
|
||||||
|
### 后端启动 / Backend Startup
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cd backend
|
||||||
|
./venv/Scripts/python.exe -m uvicorn app.main:app --host 127.0.0.1 --port 8000 --reload
|
||||||
|
```
|
||||||
|
|
||||||
|
### 前端启动 / Frontend Startup
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cd frontend
|
||||||
|
npm run dev
|
||||||
|
# 或 / or
|
||||||
|
pnpm dev
|
||||||
|
```
|
||||||
|
|
||||||
|
前端地址 / Frontend URL: http://localhost:5173
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 配置说明 / Configuration
|
||||||
|
|
||||||
|
### 环境变量 / Environment Variables
|
||||||
|
|
||||||
|
| 变量 / Variable | 说明 / Description |
|
||||||
|
|:---|:---|
|
||||||
|
| `MONGODB_URL` | MongoDB连接地址 / MongoDB connection URL |
|
||||||
|
| `MYSQL_HOST` | MySQL主机 / MySQL host |
|
||||||
|
| `REDIS_URL` | Redis连接地址 / Redis connection URL |
|
||||||
|
| `MINIMAX_API_KEY` | MiniMax API密钥 / MiniMax API key |
|
||||||
|
| `MINIMAX_API_URL` | MiniMax API地址 / MiniMax API URL |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 许可证 / License
|
||||||
|
|
||||||
|
ISC
|
||||||
@@ -29,9 +29,14 @@ REDIS_URL="redis://localhost:6379/0"
|
|||||||
|
|
||||||
# ==================== LLM AI 配置 ====================
|
# ==================== LLM AI 配置 ====================
|
||||||
# 大语言模型 API 配置
|
# 大语言模型 API 配置
|
||||||
LLM_API_KEY="your_api_key_here"
|
# 支持 OpenAI 兼容格式 (DeepSeek, 智谱 GLM, 阿里等)
|
||||||
LLM_BASE_URL=""
|
# 智谱 AI (Zhipu AI) GLM 系列:
|
||||||
LLM_MODEL_NAME=""
|
# - 模型: glm-4-flash (快速文本模型), glm-4 (标准), glm-4-plus (高性能)
|
||||||
|
# - API: https://open.bigmodel.cn
|
||||||
|
# - API Key: https://open.bigmodel.cn/usercenter/apikeys
|
||||||
|
LLM_API_KEY="ca79ad9f96524cd5afc3e43ca97f347d.cpiLLx2oyitGvTeU"
|
||||||
|
LLM_BASE_URL="https://open.bigmodel.cn/api/paas/v4"
|
||||||
|
LLM_MODEL_NAME="glm-4v-plus"
|
||||||
|
|
||||||
# ==================== Supabase 配置 ====================
|
# ==================== Supabase 配置 ====================
|
||||||
# Supabase 项目配置
|
# Supabase 项目配置
|
||||||
|
|||||||
38
backend/=3.0.0
Normal file
38
backend/=3.0.0
Normal file
@@ -0,0 +1,38 @@
|
|||||||
|
Requirement already satisfied: sentence-transformers in c:\python312\lib\site-packages (2.2.2)
|
||||||
|
Requirement already satisfied: transformers<5.0.0,>=4.6.0 in c:\python312\lib\site-packages (from sentence-transformers) (4.57.6)
|
||||||
|
Requirement already satisfied: tqdm in c:\python312\lib\site-packages (from sentence-transformers) (4.66.1)
|
||||||
|
Requirement already satisfied: torch>=1.6.0 in c:\python312\lib\site-packages (from sentence-transformers) (2.10.0)
|
||||||
|
Requirement already satisfied: torchvision in c:\python312\lib\site-packages (from sentence-transformers) (0.25.0)
|
||||||
|
Requirement already satisfied: numpy in c:\python312\lib\site-packages (from sentence-transformers) (1.26.2)
|
||||||
|
Requirement already satisfied: scikit-learn in c:\python312\lib\site-packages (from sentence-transformers) (1.8.0)
|
||||||
|
Requirement already satisfied: scipy in c:\python312\lib\site-packages (from sentence-transformers) (1.16.3)
|
||||||
|
Requirement already satisfied: nltk in c:\python312\lib\site-packages (from sentence-transformers) (3.9.3)
|
||||||
|
Requirement already satisfied: sentencepiece in c:\python312\lib\site-packages (from sentence-transformers) (0.2.1)
|
||||||
|
Requirement already satisfied: huggingface-hub>=0.4.0 in c:\python312\lib\site-packages (from sentence-transformers) (0.36.2)
|
||||||
|
Requirement already satisfied: filelock in c:\python312\lib\site-packages (from huggingface-hub>=0.4.0->sentence-transformers) (3.25.2)
|
||||||
|
Requirement already satisfied: fsspec>=2023.5.0 in c:\python312\lib\site-packages (from huggingface-hub>=0.4.0->sentence-transformers) (2026.2.0)
|
||||||
|
Requirement already satisfied: packaging>=20.9 in c:\python312\lib\site-packages (from huggingface-hub>=0.4.0->sentence-transformers) (23.2)
|
||||||
|
Requirement already satisfied: pyyaml>=5.1 in c:\python312\lib\site-packages (from huggingface-hub>=0.4.0->sentence-transformers) (6.0.1)
|
||||||
|
Requirement already satisfied: requests in c:\python312\lib\site-packages (from huggingface-hub>=0.4.0->sentence-transformers) (2.31.0)
|
||||||
|
Requirement already satisfied: typing-extensions>=3.7.4.3 in c:\python312\lib\site-packages (from huggingface-hub>=0.4.0->sentence-transformers) (4.15.0)
|
||||||
|
Requirement already satisfied: sympy>=1.13.3 in c:\python312\lib\site-packages (from torch>=1.6.0->sentence-transformers) (1.14.0)
|
||||||
|
Requirement already satisfied: networkx>=2.5.1 in c:\python312\lib\site-packages (from torch>=1.6.0->sentence-transformers) (3.6.1)
|
||||||
|
Requirement already satisfied: jinja2 in c:\python312\lib\site-packages (from torch>=1.6.0->sentence-transformers) (3.1.6)
|
||||||
|
Requirement already satisfied: setuptools in c:\python312\lib\site-packages (from torch>=1.6.0->sentence-transformers) (82.0.1)
|
||||||
|
Requirement already satisfied: colorama in c:\python312\lib\site-packages (from tqdm->sentence-transformers) (0.4.6)
|
||||||
|
Requirement already satisfied: regex!=2019.12.17 in c:\python312\lib\site-packages (from transformers<5.0.0,>=4.6.0->sentence-transformers) (2026.2.28)
|
||||||
|
Requirement already satisfied: tokenizers<=0.23.0,>=0.22.0 in c:\python312\lib\site-packages (from transformers<5.0.0,>=4.6.0->sentence-transformers) (0.22.2)
|
||||||
|
Requirement already satisfied: safetensors>=0.4.3 in c:\python312\lib\site-packages (from transformers<5.0.0,>=4.6.0->sentence-transformers) (0.7.0)
|
||||||
|
Requirement already satisfied: click in c:\python312\lib\site-packages (from nltk->sentence-transformers) (8.3.1)
|
||||||
|
Requirement already satisfied: joblib in c:\python312\lib\site-packages (from nltk->sentence-transformers) (1.5.3)
|
||||||
|
Requirement already satisfied: threadpoolctl>=3.2.0 in c:\python312\lib\site-packages (from scikit-learn->sentence-transformers) (3.6.0)
|
||||||
|
Requirement already satisfied: pillow!=8.3.*,>=5.3.0 in c:\python312\lib\site-packages (from torchvision->sentence-transformers) (12.1.1)
|
||||||
|
Requirement already satisfied: mpmath<1.4,>=1.1.0 in c:\python312\lib\site-packages (from sympy>=1.13.3->torch>=1.6.0->sentence-transformers) (1.3.0)
|
||||||
|
Requirement already satisfied: MarkupSafe>=2.0 in c:\python312\lib\site-packages (from jinja2->torch>=1.6.0->sentence-transformers) (3.0.3)
|
||||||
|
Requirement already satisfied: charset-normalizer<4,>=2 in c:\python312\lib\site-packages (from requests->huggingface-hub>=0.4.0->sentence-transformers) (3.4.6)
|
||||||
|
Requirement already satisfied: idna<4,>=2.5 in c:\python312\lib\site-packages (from requests->huggingface-hub>=0.4.0->sentence-transformers) (3.11)
|
||||||
|
Requirement already satisfied: urllib3<3,>=1.21.1 in c:\python312\lib\site-packages (from requests->huggingface-hub>=0.4.0->sentence-transformers) (2.6.3)
|
||||||
|
Requirement already satisfied: certifi>=2017.4.17 in c:\python312\lib\site-packages (from requests->huggingface-hub>=0.4.0->sentence-transformers) (2026.2.25)
|
||||||
|
|
||||||
|
[notice] A new release of pip is available: 24.2 -> 26.0.1
|
||||||
|
[notice] To update, run: python.exe -m pip install --upgrade pip
|
||||||
@@ -13,6 +13,7 @@ from app.api.endpoints import (
|
|||||||
visualization,
|
visualization,
|
||||||
analysis_charts,
|
analysis_charts,
|
||||||
health,
|
health,
|
||||||
|
instruction, # 智能指令
|
||||||
)
|
)
|
||||||
|
|
||||||
# 创建主路由
|
# 创建主路由
|
||||||
@@ -29,3 +30,4 @@ api_router.include_router(templates.router) # 表格模板
|
|||||||
api_router.include_router(ai_analyze.router) # AI分析
|
api_router.include_router(ai_analyze.router) # AI分析
|
||||||
api_router.include_router(visualization.router) # 可视化
|
api_router.include_router(visualization.router) # 可视化
|
||||||
api_router.include_router(analysis_charts.router) # 分析图表
|
api_router.include_router(analysis_charts.router) # 分析图表
|
||||||
|
api_router.include_router(instruction.router) # 智能指令
|
||||||
|
|||||||
@@ -10,6 +10,8 @@ import os
|
|||||||
|
|
||||||
from app.services.excel_ai_service import excel_ai_service
|
from app.services.excel_ai_service import excel_ai_service
|
||||||
from app.services.markdown_ai_service import markdown_ai_service
|
from app.services.markdown_ai_service import markdown_ai_service
|
||||||
|
from app.services.template_fill_service import template_fill_service
|
||||||
|
from app.services.word_ai_service import word_ai_service
|
||||||
|
|
||||||
logger = logging.getLogger(__name__)
|
logger = logging.getLogger(__name__)
|
||||||
|
|
||||||
@@ -215,9 +217,12 @@ async def analyze_markdown(
|
|||||||
return result
|
return result
|
||||||
|
|
||||||
finally:
|
finally:
|
||||||
# 清理临时文件
|
# 清理临时文件,确保在所有情况下都能清理
|
||||||
if os.path.exists(tmp_path):
|
try:
|
||||||
os.unlink(tmp_path)
|
if tmp_path and os.path.exists(tmp_path):
|
||||||
|
os.unlink(tmp_path)
|
||||||
|
except Exception as cleanup_error:
|
||||||
|
logger.warning(f"临时文件清理失败: {tmp_path}, error: {cleanup_error}")
|
||||||
|
|
||||||
except HTTPException:
|
except HTTPException:
|
||||||
raise
|
raise
|
||||||
@@ -279,8 +284,12 @@ async def analyze_markdown_stream(
|
|||||||
)
|
)
|
||||||
|
|
||||||
finally:
|
finally:
|
||||||
if os.path.exists(tmp_path):
|
# 清理临时文件,确保在所有情况下都能清理
|
||||||
os.unlink(tmp_path)
|
try:
|
||||||
|
if tmp_path and os.path.exists(tmp_path):
|
||||||
|
os.unlink(tmp_path)
|
||||||
|
except Exception as cleanup_error:
|
||||||
|
logger.warning(f"临时文件清理失败: {tmp_path}, error: {cleanup_error}")
|
||||||
|
|
||||||
except HTTPException:
|
except HTTPException:
|
||||||
raise
|
raise
|
||||||
@@ -289,7 +298,7 @@ async def analyze_markdown_stream(
|
|||||||
raise HTTPException(status_code=500, detail=f"流式分析失败: {str(e)}")
|
raise HTTPException(status_code=500, detail=f"流式分析失败: {str(e)}")
|
||||||
|
|
||||||
|
|
||||||
@router.get("/analyze/md/outline")
|
@router.post("/analyze/md/outline")
|
||||||
async def get_markdown_outline(
|
async def get_markdown_outline(
|
||||||
file: UploadFile = File(...)
|
file: UploadFile = File(...)
|
||||||
):
|
):
|
||||||
@@ -323,9 +332,154 @@ async def get_markdown_outline(
|
|||||||
result = await markdown_ai_service.extract_outline(tmp_path)
|
result = await markdown_ai_service.extract_outline(tmp_path)
|
||||||
return result
|
return result
|
||||||
finally:
|
finally:
|
||||||
if os.path.exists(tmp_path):
|
# 清理临时文件,确保在所有情况下都能清理
|
||||||
os.unlink(tmp_path)
|
try:
|
||||||
|
if tmp_path and os.path.exists(tmp_path):
|
||||||
|
os.unlink(tmp_path)
|
||||||
|
except Exception as cleanup_error:
|
||||||
|
logger.warning(f"临时文件清理失败: {tmp_path}, error: {cleanup_error}")
|
||||||
|
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
logger.error(f"获取 Markdown 大纲失败: {str(e)}")
|
logger.error(f"获取 Markdown 大纲失败: {str(e)}")
|
||||||
raise HTTPException(status_code=500, detail=f"获取大纲失败: {str(e)}")
|
raise HTTPException(status_code=500, detail=f"获取大纲失败: {str(e)}")
|
||||||
|
|
||||||
|
|
||||||
|
@router.post("/analyze/txt")
|
||||||
|
async def analyze_txt(
|
||||||
|
file: UploadFile = File(...),
|
||||||
|
):
|
||||||
|
"""
|
||||||
|
上传并使用 AI 分析 TXT 文本文件,提取结构化数据
|
||||||
|
|
||||||
|
将非结构化文本转换为结构化表格数据,便于后续填表使用
|
||||||
|
|
||||||
|
Args:
|
||||||
|
file: 上传的 TXT 文件
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
dict: 分析结果,包含结构化表格数据
|
||||||
|
"""
|
||||||
|
if not file.filename:
|
||||||
|
raise HTTPException(status_code=400, detail="文件名为空")
|
||||||
|
|
||||||
|
file_ext = file.filename.split('.')[-1].lower()
|
||||||
|
if file_ext not in ['txt', 'text']:
|
||||||
|
raise HTTPException(
|
||||||
|
status_code=400,
|
||||||
|
detail=f"不支持的文件类型: {file_ext},仅支持 .txt"
|
||||||
|
)
|
||||||
|
|
||||||
|
try:
|
||||||
|
# 读取文件内容
|
||||||
|
content = await file.read()
|
||||||
|
|
||||||
|
# 保存到临时文件
|
||||||
|
with tempfile.NamedTemporaryFile(mode='wb', suffix='.txt', delete=False) as tmp:
|
||||||
|
tmp.write(content)
|
||||||
|
tmp_path = tmp.name
|
||||||
|
|
||||||
|
try:
|
||||||
|
logger.info(f"开始 AI 分析 TXT 文件: {file.filename}")
|
||||||
|
|
||||||
|
# 使用 template_fill_service 的 AI 分析方法
|
||||||
|
result = await template_fill_service.analyze_txt_with_ai(
|
||||||
|
content=content.decode('utf-8', errors='replace'),
|
||||||
|
filename=file.filename
|
||||||
|
)
|
||||||
|
|
||||||
|
if result:
|
||||||
|
logger.info(f"TXT AI 分析成功: {file.filename}")
|
||||||
|
return {
|
||||||
|
"success": True,
|
||||||
|
"filename": file.filename,
|
||||||
|
"structured_data": result
|
||||||
|
}
|
||||||
|
else:
|
||||||
|
logger.warning(f"TXT AI 分析返回空结果: {file.filename}")
|
||||||
|
return {
|
||||||
|
"success": False,
|
||||||
|
"filename": file.filename,
|
||||||
|
"error": "AI 分析未能提取到结构化数据",
|
||||||
|
"structured_data": None
|
||||||
|
}
|
||||||
|
|
||||||
|
finally:
|
||||||
|
# 清理临时文件
|
||||||
|
if os.path.exists(tmp_path):
|
||||||
|
os.unlink(tmp_path)
|
||||||
|
|
||||||
|
except HTTPException:
|
||||||
|
raise
|
||||||
|
except Exception as e:
|
||||||
|
logger.error(f"TXT AI 分析过程中出错: {str(e)}")
|
||||||
|
raise HTTPException(status_code=500, detail=f"分析失败: {str(e)}")
|
||||||
|
|
||||||
|
|
||||||
|
# ==================== Word 文档 AI 解析 ====================
|
||||||
|
|
||||||
|
@router.post("/analyze/word")
|
||||||
|
async def analyze_word(
|
||||||
|
file: UploadFile = File(...),
|
||||||
|
user_hint: str = Query("", description="用户提示词,如'请提取表格数据'")
|
||||||
|
):
|
||||||
|
"""
|
||||||
|
使用 AI 解析 Word 文档,提取结构化数据
|
||||||
|
|
||||||
|
适用于从非结构化的 Word 文档中提取表格数据、键值对等信息
|
||||||
|
|
||||||
|
Args:
|
||||||
|
file: 上传的 Word 文件
|
||||||
|
user_hint: 用户提示词
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
dict: 包含结构化数据的解析结果
|
||||||
|
"""
|
||||||
|
if not file.filename:
|
||||||
|
raise HTTPException(status_code=400, detail="文件名为空")
|
||||||
|
|
||||||
|
file_ext = file.filename.split('.')[-1].lower()
|
||||||
|
if file_ext not in ['docx']:
|
||||||
|
raise HTTPException(
|
||||||
|
status_code=400,
|
||||||
|
detail=f"不支持的文件类型: {file_ext},仅支持 .docx"
|
||||||
|
)
|
||||||
|
|
||||||
|
try:
|
||||||
|
# 保存上传的文件
|
||||||
|
content = await file.read()
|
||||||
|
suffix = f".{file_ext}"
|
||||||
|
with tempfile.NamedTemporaryFile(delete=False, suffix=suffix) as tmp:
|
||||||
|
tmp.write(content)
|
||||||
|
tmp_path = tmp.name
|
||||||
|
|
||||||
|
try:
|
||||||
|
# 使用 AI 解析 Word 文档
|
||||||
|
result = await word_ai_service.parse_word_with_ai(
|
||||||
|
file_path=tmp_path,
|
||||||
|
user_hint=user_hint or "请提取文档中的所有结构化数据,包括表格、键值对等"
|
||||||
|
)
|
||||||
|
|
||||||
|
if result.get("success"):
|
||||||
|
return {
|
||||||
|
"success": True,
|
||||||
|
"filename": file.filename,
|
||||||
|
"result": result
|
||||||
|
}
|
||||||
|
else:
|
||||||
|
return {
|
||||||
|
"success": False,
|
||||||
|
"filename": file.filename,
|
||||||
|
"error": result.get("error", "AI 解析失败"),
|
||||||
|
"result": None
|
||||||
|
}
|
||||||
|
|
||||||
|
finally:
|
||||||
|
# 清理临时文件
|
||||||
|
if os.path.exists(tmp_path):
|
||||||
|
os.unlink(tmp_path)
|
||||||
|
|
||||||
|
except HTTPException:
|
||||||
|
raise
|
||||||
|
except Exception as e:
|
||||||
|
logger.error(f"Word AI 分析过程中出错: {str(e)}")
|
||||||
|
raise HTTPException(status_code=500, detail=f"分析失败: {str(e)}")
|
||||||
|
|||||||
@@ -23,6 +23,52 @@ logger = logging.getLogger(__name__)
|
|||||||
router = APIRouter(prefix="/upload", tags=["文档上传"])
|
router = APIRouter(prefix="/upload", tags=["文档上传"])
|
||||||
|
|
||||||
|
|
||||||
|
# ==================== 辅助函数 ====================
|
||||||
|
|
||||||
|
async def update_task_status(
|
||||||
|
task_id: str,
|
||||||
|
status: str,
|
||||||
|
progress: int = 0,
|
||||||
|
message: str = "",
|
||||||
|
result: dict = None,
|
||||||
|
error: str = None
|
||||||
|
):
|
||||||
|
"""
|
||||||
|
更新任务状态,同时写入 Redis 和 MongoDB
|
||||||
|
|
||||||
|
Args:
|
||||||
|
task_id: 任务ID
|
||||||
|
status: 状态
|
||||||
|
progress: 进度
|
||||||
|
message: 消息
|
||||||
|
result: 结果
|
||||||
|
error: 错误信息
|
||||||
|
"""
|
||||||
|
meta = {"progress": progress, "message": message}
|
||||||
|
if result:
|
||||||
|
meta["result"] = result
|
||||||
|
if error:
|
||||||
|
meta["error"] = error
|
||||||
|
|
||||||
|
# 尝试写入 Redis
|
||||||
|
try:
|
||||||
|
await redis_db.set_task_status(task_id, status, meta)
|
||||||
|
except Exception as e:
|
||||||
|
logger.warning(f"Redis 任务状态更新失败: {e}")
|
||||||
|
|
||||||
|
# 尝试写入 MongoDB(作为备用)
|
||||||
|
try:
|
||||||
|
await mongodb.update_task(
|
||||||
|
task_id=task_id,
|
||||||
|
status=status,
|
||||||
|
message=message,
|
||||||
|
result=result,
|
||||||
|
error=error
|
||||||
|
)
|
||||||
|
except Exception as e:
|
||||||
|
logger.warning(f"MongoDB 任务状态更新失败: {e}")
|
||||||
|
|
||||||
|
|
||||||
# ==================== 请求/响应模型 ====================
|
# ==================== 请求/响应模型 ====================
|
||||||
|
|
||||||
class UploadResponse(BaseModel):
|
class UploadResponse(BaseModel):
|
||||||
@@ -77,6 +123,17 @@ async def upload_document(
|
|||||||
task_id = str(uuid.uuid4())
|
task_id = str(uuid.uuid4())
|
||||||
|
|
||||||
try:
|
try:
|
||||||
|
# 保存任务记录到 MongoDB(如果 Redis 不可用时仍能查询)
|
||||||
|
try:
|
||||||
|
await mongodb.insert_task(
|
||||||
|
task_id=task_id,
|
||||||
|
task_type="document_parse",
|
||||||
|
status="pending",
|
||||||
|
message=f"文档 {file.filename} 已提交处理"
|
||||||
|
)
|
||||||
|
except Exception as mongo_err:
|
||||||
|
logger.warning(f"MongoDB 保存任务记录失败: {mongo_err}")
|
||||||
|
|
||||||
content = await file.read()
|
content = await file.read()
|
||||||
saved_path = file_service.save_uploaded_file(
|
saved_path = file_service.save_uploaded_file(
|
||||||
content,
|
content,
|
||||||
@@ -122,6 +179,17 @@ async def upload_documents(
|
|||||||
saved_paths = []
|
saved_paths = []
|
||||||
|
|
||||||
try:
|
try:
|
||||||
|
# 保存任务记录到 MongoDB
|
||||||
|
try:
|
||||||
|
await mongodb.insert_task(
|
||||||
|
task_id=task_id,
|
||||||
|
task_type="batch_parse",
|
||||||
|
status="pending",
|
||||||
|
message=f"已提交 {len(files)} 个文档处理"
|
||||||
|
)
|
||||||
|
except Exception as mongo_err:
|
||||||
|
logger.warning(f"MongoDB 保存批量任务记录失败: {mongo_err}")
|
||||||
|
|
||||||
for file in files:
|
for file in files:
|
||||||
if not file.filename:
|
if not file.filename:
|
||||||
continue
|
continue
|
||||||
@@ -159,9 +227,9 @@ async def process_document(
|
|||||||
"""处理单个文档"""
|
"""处理单个文档"""
|
||||||
try:
|
try:
|
||||||
# 状态: 解析中
|
# 状态: 解析中
|
||||||
await redis_db.set_task_status(
|
await update_task_status(
|
||||||
task_id, status="processing",
|
task_id, status="processing",
|
||||||
meta={"progress": 10, "message": "正在解析文档"}
|
progress=10, message="正在解析文档"
|
||||||
)
|
)
|
||||||
|
|
||||||
# 解析文档
|
# 解析文档
|
||||||
@@ -172,9 +240,9 @@ async def process_document(
|
|||||||
raise Exception(result.error or "解析失败")
|
raise Exception(result.error or "解析失败")
|
||||||
|
|
||||||
# 状态: 存储中
|
# 状态: 存储中
|
||||||
await redis_db.set_task_status(
|
await update_task_status(
|
||||||
task_id, status="processing",
|
task_id, status="processing",
|
||||||
meta={"progress": 30, "message": "正在存储数据"}
|
progress=30, message="正在存储数据"
|
||||||
)
|
)
|
||||||
|
|
||||||
# 存储到 MongoDB
|
# 存储到 MongoDB
|
||||||
@@ -191,9 +259,9 @@ async def process_document(
|
|||||||
|
|
||||||
# 如果是 Excel,存储到 MySQL + AI生成描述 + RAG索引
|
# 如果是 Excel,存储到 MySQL + AI生成描述 + RAG索引
|
||||||
if doc_type in ["xlsx", "xls"]:
|
if doc_type in ["xlsx", "xls"]:
|
||||||
await redis_db.set_task_status(
|
await update_task_status(
|
||||||
task_id, status="processing",
|
task_id, status="processing",
|
||||||
meta={"progress": 50, "message": "正在存储到MySQL并生成字段描述"}
|
progress=50, message="正在存储到MySQL并生成字段描述"
|
||||||
)
|
)
|
||||||
|
|
||||||
try:
|
try:
|
||||||
@@ -215,9 +283,9 @@ async def process_document(
|
|||||||
|
|
||||||
else:
|
else:
|
||||||
# 非结构化文档
|
# 非结构化文档
|
||||||
await redis_db.set_task_status(
|
await update_task_status(
|
||||||
task_id, status="processing",
|
task_id, status="processing",
|
||||||
meta={"progress": 60, "message": "正在建立索引"}
|
progress=60, message="正在建立索引"
|
||||||
)
|
)
|
||||||
|
|
||||||
# 如果文档中有表格数据,提取并存储到 MySQL + RAG
|
# 如果文档中有表格数据,提取并存储到 MySQL + RAG
|
||||||
@@ -238,17 +306,13 @@ async def process_document(
|
|||||||
await index_document_to_rag(doc_id, original_filename, result, doc_type)
|
await index_document_to_rag(doc_id, original_filename, result, doc_type)
|
||||||
|
|
||||||
# 完成
|
# 完成
|
||||||
await redis_db.set_task_status(
|
await update_task_status(
|
||||||
task_id, status="success",
|
task_id, status="success",
|
||||||
meta={
|
progress=100, message="处理完成",
|
||||||
"progress": 100,
|
result={
|
||||||
"message": "处理完成",
|
|
||||||
"doc_id": doc_id,
|
"doc_id": doc_id,
|
||||||
"result": {
|
"doc_type": doc_type,
|
||||||
"doc_id": doc_id,
|
"filename": original_filename
|
||||||
"doc_type": doc_type,
|
|
||||||
"filename": original_filename
|
|
||||||
}
|
|
||||||
}
|
}
|
||||||
)
|
)
|
||||||
|
|
||||||
@@ -256,18 +320,19 @@ async def process_document(
|
|||||||
|
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
logger.error(f"文档处理失败: {str(e)}")
|
logger.error(f"文档处理失败: {str(e)}")
|
||||||
await redis_db.set_task_status(
|
await update_task_status(
|
||||||
task_id, status="failure",
|
task_id, status="failure",
|
||||||
meta={"error": str(e)}
|
progress=0, message="处理失败",
|
||||||
|
error=str(e)
|
||||||
)
|
)
|
||||||
|
|
||||||
|
|
||||||
async def process_documents_batch(task_id: str, files: List[dict]):
|
async def process_documents_batch(task_id: str, files: List[dict]):
|
||||||
"""批量处理文档"""
|
"""批量处理文档"""
|
||||||
try:
|
try:
|
||||||
await redis_db.set_task_status(
|
await update_task_status(
|
||||||
task_id, status="processing",
|
task_id, status="processing",
|
||||||
meta={"progress": 0, "message": "开始批量处理"}
|
progress=0, message="开始批量处理"
|
||||||
)
|
)
|
||||||
|
|
||||||
results = []
|
results = []
|
||||||
@@ -318,37 +383,43 @@ async def process_documents_batch(task_id: str, files: List[dict]):
|
|||||||
results.append({"filename": file_info["filename"], "success": False, "error": str(e)})
|
results.append({"filename": file_info["filename"], "success": False, "error": str(e)})
|
||||||
|
|
||||||
progress = int((i + 1) / len(files) * 100)
|
progress = int((i + 1) / len(files) * 100)
|
||||||
await redis_db.set_task_status(
|
await update_task_status(
|
||||||
task_id, status="processing",
|
task_id, status="processing",
|
||||||
meta={"progress": progress, "message": f"已处理 {i+1}/{len(files)}"}
|
progress=progress, message=f"已处理 {i+1}/{len(files)}"
|
||||||
)
|
)
|
||||||
|
|
||||||
await redis_db.set_task_status(
|
await update_task_status(
|
||||||
task_id, status="success",
|
task_id, status="success",
|
||||||
meta={"progress": 100, "message": "批量处理完成", "results": results}
|
progress=100, message="批量处理完成",
|
||||||
|
result={"results": results}
|
||||||
)
|
)
|
||||||
|
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
logger.error(f"批量处理失败: {str(e)}")
|
logger.error(f"批量处理失败: {str(e)}")
|
||||||
await redis_db.set_task_status(
|
await update_task_status(
|
||||||
task_id, status="failure",
|
task_id, status="failure",
|
||||||
meta={"error": str(e)}
|
progress=0, message="批量处理失败",
|
||||||
|
error=str(e)
|
||||||
)
|
)
|
||||||
|
|
||||||
|
|
||||||
async def index_document_to_rag(doc_id: str, filename: str, result: ParseResult, doc_type: str):
|
async def index_document_to_rag(doc_id: str, filename: str, result: ParseResult, doc_type: str):
|
||||||
"""将非结构化文档索引到 RAG"""
|
"""将非结构化文档索引到 RAG(使用分块索引)"""
|
||||||
try:
|
try:
|
||||||
content = result.data.get("content", "")
|
content = result.data.get("content", "")
|
||||||
if content:
|
if content:
|
||||||
|
# 将完整内容传递给 RAG 服务自动分块索引
|
||||||
rag_service.index_document_content(
|
rag_service.index_document_content(
|
||||||
doc_id=doc_id,
|
doc_id=doc_id,
|
||||||
content=content[:5000],
|
content=content, # 传递完整内容,由 RAG 服务自动分块
|
||||||
metadata={
|
metadata={
|
||||||
"filename": filename,
|
"filename": filename,
|
||||||
"doc_type": doc_type
|
"doc_type": doc_type
|
||||||
}
|
},
|
||||||
|
chunk_size=500, # 每块 500 字符
|
||||||
|
chunk_overlap=50 # 块之间 50 字符重叠
|
||||||
)
|
)
|
||||||
|
logger.info(f"RAG 索引完成: {filename}, doc_id={doc_id}")
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
logger.warning(f"RAG 索引失败: {str(e)}")
|
logger.warning(f"RAG 索引失败: {str(e)}")
|
||||||
|
|
||||||
|
|||||||
@@ -19,26 +19,43 @@ async def health_check() -> Dict[str, Any]:
|
|||||||
返回各数据库连接状态和应用信息
|
返回各数据库连接状态和应用信息
|
||||||
"""
|
"""
|
||||||
# 检查各数据库连接状态
|
# 检查各数据库连接状态
|
||||||
mysql_status = "connected"
|
mysql_status = "unknown"
|
||||||
mongodb_status = "connected"
|
mongodb_status = "unknown"
|
||||||
redis_status = "connected"
|
redis_status = "unknown"
|
||||||
|
|
||||||
try:
|
try:
|
||||||
if mysql_db.async_engine is None:
|
if mysql_db.async_engine is None:
|
||||||
mysql_status = "disconnected"
|
mysql_status = "disconnected"
|
||||||
except Exception:
|
else:
|
||||||
|
# 实际执行一次查询验证连接
|
||||||
|
from sqlalchemy import text
|
||||||
|
async with mysql_db.async_engine.connect() as conn:
|
||||||
|
await conn.execute(text("SELECT 1"))
|
||||||
|
mysql_status = "connected"
|
||||||
|
except Exception as e:
|
||||||
|
logger.warning(f"MySQL 健康检查失败: {e}")
|
||||||
mysql_status = "error"
|
mysql_status = "error"
|
||||||
|
|
||||||
try:
|
try:
|
||||||
if mongodb.client is None:
|
if mongodb.client is None:
|
||||||
mongodb_status = "disconnected"
|
mongodb_status = "disconnected"
|
||||||
except Exception:
|
else:
|
||||||
|
# 实际 ping 验证
|
||||||
|
await mongodb.client.admin.command('ping')
|
||||||
|
mongodb_status = "connected"
|
||||||
|
except Exception as e:
|
||||||
|
logger.warning(f"MongoDB 健康检查失败: {e}")
|
||||||
mongodb_status = "error"
|
mongodb_status = "error"
|
||||||
|
|
||||||
try:
|
try:
|
||||||
if not redis_db.is_connected:
|
if not redis_db.is_connected or redis_db.client is None:
|
||||||
redis_status = "disconnected"
|
redis_status = "disconnected"
|
||||||
except Exception:
|
else:
|
||||||
|
# 实际执行 ping 验证
|
||||||
|
await redis_db.client.ping()
|
||||||
|
redis_status = "connected"
|
||||||
|
except Exception as e:
|
||||||
|
logger.warning(f"Redis 健康检查失败: {e}")
|
||||||
redis_status = "error"
|
redis_status = "error"
|
||||||
|
|
||||||
return {
|
return {
|
||||||
|
|||||||
439
backend/app/api/endpoints/instruction.py
Normal file
439
backend/app/api/endpoints/instruction.py
Normal file
@@ -0,0 +1,439 @@
|
|||||||
|
"""
|
||||||
|
智能指令 API 接口
|
||||||
|
|
||||||
|
支持自然语言指令解析和执行
|
||||||
|
"""
|
||||||
|
import logging
|
||||||
|
import uuid
|
||||||
|
from typing import Any, Dict, List, Optional
|
||||||
|
|
||||||
|
from fastapi import APIRouter, HTTPException, Query, BackgroundTasks
|
||||||
|
from pydantic import BaseModel
|
||||||
|
|
||||||
|
from app.instruction.intent_parser import intent_parser
|
||||||
|
from app.instruction.executor import instruction_executor
|
||||||
|
from app.core.database import mongodb
|
||||||
|
|
||||||
|
logger = logging.getLogger(__name__)
|
||||||
|
|
||||||
|
router = APIRouter(prefix="/instruction", tags=["智能指令"])
|
||||||
|
|
||||||
|
|
||||||
|
# ==================== 请求/响应模型 ====================
|
||||||
|
|
||||||
|
class InstructionRequest(BaseModel):
|
||||||
|
instruction: str
|
||||||
|
doc_ids: Optional[List[str]] = None # 关联的文档 ID 列表
|
||||||
|
context: Optional[Dict[str, Any]] = None # 额外上下文
|
||||||
|
|
||||||
|
|
||||||
|
class IntentRecognitionResponse(BaseModel):
|
||||||
|
success: bool
|
||||||
|
intent: str
|
||||||
|
params: Dict[str, Any]
|
||||||
|
message: str
|
||||||
|
|
||||||
|
|
||||||
|
class InstructionExecutionResponse(BaseModel):
|
||||||
|
success: bool
|
||||||
|
intent: str
|
||||||
|
result: Dict[str, Any]
|
||||||
|
message: str
|
||||||
|
|
||||||
|
|
||||||
|
# ==================== 接口 ====================
|
||||||
|
|
||||||
|
@router.post("/recognize", response_model=IntentRecognitionResponse)
|
||||||
|
async def recognize_intent(request: InstructionRequest):
|
||||||
|
"""
|
||||||
|
意图识别接口
|
||||||
|
|
||||||
|
将自然语言指令解析为结构化的意图和参数
|
||||||
|
|
||||||
|
示例指令:
|
||||||
|
- "提取文档中的医院数量和床位数"
|
||||||
|
- "根据这些数据填表"
|
||||||
|
- "总结一下这份文档"
|
||||||
|
- "对比这两个文档的差异"
|
||||||
|
"""
|
||||||
|
try:
|
||||||
|
intent, params = await intent_parser.parse(request.instruction)
|
||||||
|
|
||||||
|
# 添加文档关联信息
|
||||||
|
if request.doc_ids:
|
||||||
|
params["document_refs"] = [f"doc_{doc_id}" for doc_id in request.doc_ids]
|
||||||
|
|
||||||
|
intent_names = {
|
||||||
|
"extract": "信息提取",
|
||||||
|
"fill_table": "表格填写",
|
||||||
|
"summarize": "摘要总结",
|
||||||
|
"question": "智能问答",
|
||||||
|
"search": "文档搜索",
|
||||||
|
"compare": "对比分析",
|
||||||
|
"transform": "格式转换",
|
||||||
|
"edit": "文档编辑",
|
||||||
|
"unknown": "未知"
|
||||||
|
}
|
||||||
|
|
||||||
|
return IntentRecognitionResponse(
|
||||||
|
success=True,
|
||||||
|
intent=intent,
|
||||||
|
params=params,
|
||||||
|
message=f"识别到意图: {intent_names.get(intent, intent)}"
|
||||||
|
)
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
logger.error(f"意图识别失败: {e}")
|
||||||
|
return IntentRecognitionResponse(
|
||||||
|
success=False,
|
||||||
|
intent="error",
|
||||||
|
params={},
|
||||||
|
message=f"意图识别失败: {str(e)}"
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
@router.post("/execute")
|
||||||
|
async def execute_instruction(
|
||||||
|
background_tasks: BackgroundTasks,
|
||||||
|
request: InstructionRequest,
|
||||||
|
async_execute: bool = Query(False, description="是否异步执行(仅返回任务ID)")
|
||||||
|
):
|
||||||
|
"""
|
||||||
|
指令执行接口
|
||||||
|
|
||||||
|
解析并执行自然语言指令
|
||||||
|
|
||||||
|
示例:
|
||||||
|
- 指令: "提取文档1中的医院数量"
|
||||||
|
返回: {"extracted_data": {"医院数量": ["38710个"]}}
|
||||||
|
|
||||||
|
- 指令: "填表"
|
||||||
|
返回: {"filled_data": {...}}
|
||||||
|
|
||||||
|
设置 async_execute=true 可异步执行,返回任务ID用于查询进度
|
||||||
|
"""
|
||||||
|
task_id = str(uuid.uuid4())
|
||||||
|
|
||||||
|
if async_execute:
|
||||||
|
# 异步模式:立即返回任务ID,后台执行
|
||||||
|
background_tasks.add_task(
|
||||||
|
_execute_instruction_task,
|
||||||
|
task_id=task_id,
|
||||||
|
instruction=request.instruction,
|
||||||
|
doc_ids=request.doc_ids,
|
||||||
|
context=request.context
|
||||||
|
)
|
||||||
|
|
||||||
|
return {
|
||||||
|
"success": True,
|
||||||
|
"task_id": task_id,
|
||||||
|
"message": "指令已提交执行",
|
||||||
|
"status_url": f"/api/v1/tasks/{task_id}"
|
||||||
|
}
|
||||||
|
|
||||||
|
# 同步模式:等待执行完成
|
||||||
|
return await _execute_instruction_task(task_id, request.instruction, request.doc_ids, request.context)
|
||||||
|
|
||||||
|
|
||||||
|
async def _execute_instruction_task(
|
||||||
|
task_id: str,
|
||||||
|
instruction: str,
|
||||||
|
doc_ids: Optional[List[str]],
|
||||||
|
context: Optional[Dict[str, Any]]
|
||||||
|
) -> InstructionExecutionResponse:
|
||||||
|
"""执行指令的后台任务"""
|
||||||
|
from app.core.database import redis_db, mongodb as mongo_client
|
||||||
|
|
||||||
|
try:
|
||||||
|
# 记录任务
|
||||||
|
try:
|
||||||
|
await mongo_client.insert_task(
|
||||||
|
task_id=task_id,
|
||||||
|
task_type="instruction_execute",
|
||||||
|
status="processing",
|
||||||
|
message="正在执行指令"
|
||||||
|
)
|
||||||
|
except Exception:
|
||||||
|
pass
|
||||||
|
|
||||||
|
# 构建执行上下文
|
||||||
|
ctx: Dict[str, Any] = context or {}
|
||||||
|
|
||||||
|
# 如果提供了文档 ID,获取文档内容
|
||||||
|
if doc_ids:
|
||||||
|
docs = []
|
||||||
|
for doc_id in doc_ids:
|
||||||
|
doc = await mongo_client.get_document(doc_id)
|
||||||
|
if doc:
|
||||||
|
docs.append(doc)
|
||||||
|
|
||||||
|
if docs:
|
||||||
|
ctx["source_docs"] = docs
|
||||||
|
logger.info(f"指令执行上下文: 关联了 {len(docs)} 个文档")
|
||||||
|
|
||||||
|
# 执行指令
|
||||||
|
result = await instruction_executor.execute(instruction, ctx)
|
||||||
|
|
||||||
|
# 更新任务状态
|
||||||
|
try:
|
||||||
|
await mongo_client.update_task(
|
||||||
|
task_id=task_id,
|
||||||
|
status="success",
|
||||||
|
message="执行完成",
|
||||||
|
result=result
|
||||||
|
)
|
||||||
|
except Exception:
|
||||||
|
pass
|
||||||
|
|
||||||
|
return InstructionExecutionResponse(
|
||||||
|
success=result.get("success", False),
|
||||||
|
intent=result.get("intent", "unknown"),
|
||||||
|
result=result,
|
||||||
|
message=result.get("message", "执行完成")
|
||||||
|
)
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
logger.error(f"指令执行失败: {e}")
|
||||||
|
try:
|
||||||
|
await mongo_client.update_task(
|
||||||
|
task_id=task_id,
|
||||||
|
status="failure",
|
||||||
|
message="执行失败",
|
||||||
|
error=str(e)
|
||||||
|
)
|
||||||
|
except Exception:
|
||||||
|
pass
|
||||||
|
|
||||||
|
return InstructionExecutionResponse(
|
||||||
|
success=False,
|
||||||
|
intent="error",
|
||||||
|
result={"error": str(e)},
|
||||||
|
message=f"指令执行失败: {str(e)}"
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
@router.post("/chat")
|
||||||
|
async def instruction_chat(
|
||||||
|
background_tasks: BackgroundTasks,
|
||||||
|
request: InstructionRequest,
|
||||||
|
async_execute: bool = Query(False, description="是否异步执行(仅返回任务ID)")
|
||||||
|
):
|
||||||
|
"""
|
||||||
|
指令对话接口
|
||||||
|
|
||||||
|
支持多轮对话的指令执行
|
||||||
|
|
||||||
|
示例对话流程:
|
||||||
|
1. 用户: "上传一些文档"
|
||||||
|
2. 系统: "请上传文档"
|
||||||
|
3. 用户: "提取其中的医院数量"
|
||||||
|
4. 系统: 返回提取结果
|
||||||
|
|
||||||
|
设置 async_execute=true 可异步执行,返回任务ID用于查询进度
|
||||||
|
"""
|
||||||
|
task_id = str(uuid.uuid4())
|
||||||
|
|
||||||
|
if async_execute:
|
||||||
|
# 异步模式:立即返回任务ID,后台执行
|
||||||
|
background_tasks.add_task(
|
||||||
|
_execute_chat_task,
|
||||||
|
task_id=task_id,
|
||||||
|
instruction=request.instruction,
|
||||||
|
doc_ids=request.doc_ids,
|
||||||
|
context=request.context
|
||||||
|
)
|
||||||
|
|
||||||
|
return {
|
||||||
|
"success": True,
|
||||||
|
"task_id": task_id,
|
||||||
|
"message": "指令已提交执行",
|
||||||
|
"status_url": f"/api/v1/tasks/{task_id}"
|
||||||
|
}
|
||||||
|
|
||||||
|
# 同步模式:等待执行完成
|
||||||
|
return await _execute_chat_task(task_id, request.instruction, request.doc_ids, request.context)
|
||||||
|
|
||||||
|
|
||||||
|
async def _execute_chat_task(
|
||||||
|
task_id: str,
|
||||||
|
instruction: str,
|
||||||
|
doc_ids: Optional[List[str]],
|
||||||
|
context: Optional[Dict[str, Any]]
|
||||||
|
):
|
||||||
|
"""执行指令对话的后台任务"""
|
||||||
|
from app.core.database import mongodb as mongo_client
|
||||||
|
|
||||||
|
try:
|
||||||
|
# 记录任务
|
||||||
|
try:
|
||||||
|
await mongo_client.insert_task(
|
||||||
|
task_id=task_id,
|
||||||
|
task_type="instruction_chat",
|
||||||
|
status="processing",
|
||||||
|
message="正在处理对话"
|
||||||
|
)
|
||||||
|
except Exception:
|
||||||
|
pass
|
||||||
|
|
||||||
|
# 构建上下文
|
||||||
|
ctx: Dict[str, Any] = context or {}
|
||||||
|
|
||||||
|
# 获取关联文档
|
||||||
|
if doc_ids:
|
||||||
|
docs = []
|
||||||
|
for doc_id in doc_ids:
|
||||||
|
doc = await mongo_client.get_document(doc_id)
|
||||||
|
if doc:
|
||||||
|
docs.append(doc)
|
||||||
|
if docs:
|
||||||
|
ctx["source_docs"] = docs
|
||||||
|
|
||||||
|
# 执行指令
|
||||||
|
result = await instruction_executor.execute(instruction, ctx)
|
||||||
|
|
||||||
|
# 根据意图类型添加友好的响应消息
|
||||||
|
response_messages = {
|
||||||
|
"extract": f"已提取 {len(result.get('extracted_data', {}))} 个字段的数据",
|
||||||
|
"fill_table": f"填表完成,填写了 {len(result.get('result', {}).get('filled_data', {}))} 个字段",
|
||||||
|
"summarize": "已生成文档摘要",
|
||||||
|
"question": "已找到相关答案",
|
||||||
|
"search": f"找到 {len(result.get('results', []))} 条相关内容",
|
||||||
|
"compare": f"对比了 {len(result.get('comparison', []))} 个文档",
|
||||||
|
"edit": "编辑操作已完成",
|
||||||
|
"transform": "格式转换已完成",
|
||||||
|
"unknown": "无法理解该指令,请尝试更明确的描述"
|
||||||
|
}
|
||||||
|
|
||||||
|
response = {
|
||||||
|
"success": result.get("success", False),
|
||||||
|
"intent": result.get("intent", "unknown"),
|
||||||
|
"result": result,
|
||||||
|
"message": response_messages.get(result.get("intent", ""), result.get("message", "")),
|
||||||
|
"hint": _get_intent_hint(result.get("intent", ""))
|
||||||
|
}
|
||||||
|
|
||||||
|
# 更新任务状态
|
||||||
|
try:
|
||||||
|
await mongo_client.update_task(
|
||||||
|
task_id=task_id,
|
||||||
|
status="success",
|
||||||
|
message="处理完成",
|
||||||
|
result=response
|
||||||
|
)
|
||||||
|
except Exception:
|
||||||
|
pass
|
||||||
|
|
||||||
|
return response
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
logger.error(f"指令对话失败: {e}")
|
||||||
|
try:
|
||||||
|
await mongo_client.update_task(
|
||||||
|
task_id=task_id,
|
||||||
|
status="failure",
|
||||||
|
message="处理失败",
|
||||||
|
error=str(e)
|
||||||
|
)
|
||||||
|
except Exception:
|
||||||
|
pass
|
||||||
|
|
||||||
|
return {
|
||||||
|
"success": False,
|
||||||
|
"error": str(e),
|
||||||
|
"message": f"处理失败: {str(e)}"
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def _get_intent_hint(intent: str) -> Optional[str]:
|
||||||
|
"""根据意图返回下一步提示"""
|
||||||
|
hints = {
|
||||||
|
"extract": "您可以继续说 '提取更多字段' 或 '将数据填入表格'",
|
||||||
|
"fill_table": "您可以提供表格模板或说 '帮我创建一个表格'",
|
||||||
|
"question": "您可以继续提问或说 '总结一下这些内容'",
|
||||||
|
"search": "您可以查看搜索结果或说 '对比这些内容'",
|
||||||
|
"unknown": "您可以尝试: '提取数据'、'填表'、'总结'、'问答' 等指令"
|
||||||
|
}
|
||||||
|
return hints.get(intent)
|
||||||
|
|
||||||
|
|
||||||
|
@router.get("/intents")
|
||||||
|
async def list_supported_intents():
|
||||||
|
"""
|
||||||
|
获取支持的意图类型列表
|
||||||
|
|
||||||
|
返回所有可用的自然语言指令类型
|
||||||
|
"""
|
||||||
|
return {
|
||||||
|
"intents": [
|
||||||
|
{
|
||||||
|
"intent": "extract",
|
||||||
|
"name": "信息提取",
|
||||||
|
"examples": [
|
||||||
|
"提取文档中的医院数量",
|
||||||
|
"抽取所有机构的名称",
|
||||||
|
"找出表格中的数据"
|
||||||
|
],
|
||||||
|
"params": ["field_refs", "document_refs"]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"intent": "fill_table",
|
||||||
|
"name": "表格填写",
|
||||||
|
"examples": [
|
||||||
|
"填表",
|
||||||
|
"根据这些数据填写表格",
|
||||||
|
"帮我填到Excel里"
|
||||||
|
],
|
||||||
|
"params": ["template", "document_refs"]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"intent": "summarize",
|
||||||
|
"name": "摘要总结",
|
||||||
|
"examples": [
|
||||||
|
"总结一下这份文档",
|
||||||
|
"生成摘要",
|
||||||
|
"概括主要内容"
|
||||||
|
],
|
||||||
|
"params": ["document_refs"]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"intent": "question",
|
||||||
|
"name": "智能问答",
|
||||||
|
"examples": [
|
||||||
|
"这段话说的是什么?",
|
||||||
|
"有多少家医院?",
|
||||||
|
"解释一下这个概念"
|
||||||
|
],
|
||||||
|
"params": ["question", "focus"]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"intent": "search",
|
||||||
|
"name": "文档搜索",
|
||||||
|
"examples": [
|
||||||
|
"搜索相关内容",
|
||||||
|
"找找看有哪些机构",
|
||||||
|
"查询医院相关的数据"
|
||||||
|
],
|
||||||
|
"params": ["field_refs", "question"]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"intent": "compare",
|
||||||
|
"name": "对比分析",
|
||||||
|
"examples": [
|
||||||
|
"对比这两个文档",
|
||||||
|
"比较一下差异",
|
||||||
|
"找出不同点"
|
||||||
|
],
|
||||||
|
"params": ["document_refs"]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"intent": "edit",
|
||||||
|
"name": "文档编辑",
|
||||||
|
"examples": [
|
||||||
|
"润色这段文字",
|
||||||
|
"修改格式",
|
||||||
|
"添加注释"
|
||||||
|
],
|
||||||
|
"params": []
|
||||||
|
}
|
||||||
|
]
|
||||||
|
}
|
||||||
@@ -1,13 +1,13 @@
|
|||||||
"""
|
"""
|
||||||
任务管理 API 接口
|
任务管理 API 接口
|
||||||
|
|
||||||
提供异步任务状态查询
|
提供异步任务状态查询和历史记录
|
||||||
"""
|
"""
|
||||||
from typing import Optional
|
from typing import Optional
|
||||||
|
|
||||||
from fastapi import APIRouter, HTTPException
|
from fastapi import APIRouter, HTTPException
|
||||||
|
|
||||||
from app.core.database import redis_db
|
from app.core.database import redis_db, mongodb
|
||||||
|
|
||||||
router = APIRouter(prefix="/tasks", tags=["任务管理"])
|
router = APIRouter(prefix="/tasks", tags=["任务管理"])
|
||||||
|
|
||||||
@@ -23,25 +23,94 @@ async def get_task_status(task_id: str):
|
|||||||
Returns:
|
Returns:
|
||||||
任务状态信息
|
任务状态信息
|
||||||
"""
|
"""
|
||||||
|
# 优先从 Redis 获取
|
||||||
status = await redis_db.get_task_status(task_id)
|
status = await redis_db.get_task_status(task_id)
|
||||||
|
|
||||||
if not status:
|
if status:
|
||||||
# Redis不可用时,假设任务已完成(文档已成功处理)
|
|
||||||
# 前端轮询时会得到这个响应
|
|
||||||
return {
|
return {
|
||||||
"task_id": task_id,
|
"task_id": task_id,
|
||||||
"status": "success",
|
"status": status.get("status", "unknown"),
|
||||||
"progress": 100,
|
"progress": status.get("meta", {}).get("progress", 0),
|
||||||
"message": "任务处理完成",
|
"message": status.get("meta", {}).get("message"),
|
||||||
"result": None,
|
"result": status.get("meta", {}).get("result"),
|
||||||
"error": None
|
"error": status.get("meta", {}).get("error")
|
||||||
}
|
}
|
||||||
|
|
||||||
|
# Redis 不可用时,尝试从 MongoDB 获取
|
||||||
|
mongo_task = await mongodb.get_task(task_id)
|
||||||
|
if mongo_task:
|
||||||
|
return {
|
||||||
|
"task_id": mongo_task.get("task_id"),
|
||||||
|
"status": mongo_task.get("status", "unknown"),
|
||||||
|
"progress": 100 if mongo_task.get("status") == "success" else 0,
|
||||||
|
"message": mongo_task.get("message"),
|
||||||
|
"result": mongo_task.get("result"),
|
||||||
|
"error": mongo_task.get("error")
|
||||||
|
}
|
||||||
|
|
||||||
|
# 任务不存在或状态未知
|
||||||
return {
|
return {
|
||||||
"task_id": task_id,
|
"task_id": task_id,
|
||||||
"status": status.get("status", "unknown"),
|
"status": "unknown",
|
||||||
"progress": status.get("meta", {}).get("progress", 0),
|
"progress": 0,
|
||||||
"message": status.get("meta", {}).get("message"),
|
"message": "无法获取任务状态(Redis和MongoDB均不可用)",
|
||||||
"result": status.get("meta", {}).get("result"),
|
"result": None,
|
||||||
"error": status.get("meta", {}).get("error")
|
"error": None
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
|
@router.get("/")
|
||||||
|
async def list_tasks(limit: int = 50, skip: int = 0):
|
||||||
|
"""
|
||||||
|
获取任务历史列表
|
||||||
|
|
||||||
|
Args:
|
||||||
|
limit: 返回数量限制
|
||||||
|
skip: 跳过数量
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
任务列表
|
||||||
|
"""
|
||||||
|
try:
|
||||||
|
tasks = await mongodb.list_tasks(limit=limit, skip=skip)
|
||||||
|
return {
|
||||||
|
"success": True,
|
||||||
|
"tasks": tasks,
|
||||||
|
"count": len(tasks)
|
||||||
|
}
|
||||||
|
except Exception as e:
|
||||||
|
# MongoDB 不可用时返回空列表
|
||||||
|
return {
|
||||||
|
"success": False,
|
||||||
|
"tasks": [],
|
||||||
|
"count": 0,
|
||||||
|
"error": str(e)
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
@router.delete("/{task_id}")
|
||||||
|
async def delete_task(task_id: str):
|
||||||
|
"""
|
||||||
|
删除任务
|
||||||
|
|
||||||
|
Args:
|
||||||
|
task_id: 任务ID
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
是否删除成功
|
||||||
|
"""
|
||||||
|
try:
|
||||||
|
# 从 Redis 删除
|
||||||
|
if redis_db._connected and redis_db.client:
|
||||||
|
key = f"task:{task_id}"
|
||||||
|
await redis_db.client.delete(key)
|
||||||
|
|
||||||
|
# 从 MongoDB 删除
|
||||||
|
deleted = await mongodb.delete_task(task_id)
|
||||||
|
|
||||||
|
return {
|
||||||
|
"success": True,
|
||||||
|
"deleted": deleted
|
||||||
|
}
|
||||||
|
except Exception as e:
|
||||||
|
raise HTTPException(status_code=500, detail=f"删除任务失败: {str(e)}")
|
||||||
|
|||||||
@@ -5,21 +5,62 @@
|
|||||||
"""
|
"""
|
||||||
import io
|
import io
|
||||||
import logging
|
import logging
|
||||||
|
import uuid
|
||||||
from typing import List, Optional
|
from typing import List, Optional
|
||||||
|
|
||||||
from fastapi import APIRouter, File, HTTPException, Query, UploadFile
|
from fastapi import APIRouter, File, HTTPException, Query, UploadFile, BackgroundTasks
|
||||||
from fastapi.responses import StreamingResponse
|
from fastapi.responses import StreamingResponse
|
||||||
import pandas as pd
|
import pandas as pd
|
||||||
from pydantic import BaseModel
|
from pydantic import BaseModel
|
||||||
|
|
||||||
from app.services.template_fill_service import template_fill_service, TemplateField
|
from app.services.template_fill_service import template_fill_service, TemplateField
|
||||||
from app.services.file_service import file_service
|
from app.services.file_service import file_service
|
||||||
|
from app.core.database import mongodb
|
||||||
|
from app.core.document_parser import ParserFactory
|
||||||
|
|
||||||
logger = logging.getLogger(__name__)
|
logger = logging.getLogger(__name__)
|
||||||
|
|
||||||
router = APIRouter(prefix="/templates", tags=["表格模板"])
|
router = APIRouter(prefix="/templates", tags=["表格模板"])
|
||||||
|
|
||||||
|
|
||||||
|
# ==================== 辅助函数 ====================
|
||||||
|
|
||||||
|
async def update_task_status(
|
||||||
|
task_id: str,
|
||||||
|
status: str,
|
||||||
|
progress: int = 0,
|
||||||
|
message: str = "",
|
||||||
|
result: dict = None,
|
||||||
|
error: str = None
|
||||||
|
):
|
||||||
|
"""
|
||||||
|
更新任务状态,同时写入 Redis 和 MongoDB
|
||||||
|
"""
|
||||||
|
from app.core.database import redis_db
|
||||||
|
|
||||||
|
meta = {"progress": progress, "message": message}
|
||||||
|
if result:
|
||||||
|
meta["result"] = result
|
||||||
|
if error:
|
||||||
|
meta["error"] = error
|
||||||
|
|
||||||
|
try:
|
||||||
|
await redis_db.set_task_status(task_id, status, meta)
|
||||||
|
except Exception as e:
|
||||||
|
logger.warning(f"Redis 任务状态更新失败: {e}")
|
||||||
|
|
||||||
|
try:
|
||||||
|
await mongodb.update_task(
|
||||||
|
task_id=task_id,
|
||||||
|
status=status,
|
||||||
|
message=message,
|
||||||
|
result=result,
|
||||||
|
error=error
|
||||||
|
)
|
||||||
|
except Exception as e:
|
||||||
|
logger.warning(f"MongoDB 任务状态更新失败: {e}")
|
||||||
|
|
||||||
|
|
||||||
# ==================== 请求/响应模型 ====================
|
# ==================== 请求/响应模型 ====================
|
||||||
|
|
||||||
class TemplateFieldRequest(BaseModel):
|
class TemplateFieldRequest(BaseModel):
|
||||||
@@ -38,6 +79,7 @@ class FillRequest(BaseModel):
|
|||||||
source_doc_ids: Optional[List[str]] = None # MongoDB 文档 ID 列表
|
source_doc_ids: Optional[List[str]] = None # MongoDB 文档 ID 列表
|
||||||
source_file_paths: Optional[List[str]] = None # 源文档文件路径列表
|
source_file_paths: Optional[List[str]] = None # 源文档文件路径列表
|
||||||
user_hint: Optional[str] = None
|
user_hint: Optional[str] = None
|
||||||
|
task_id: Optional[str] = None # 可选的任务ID,用于任务历史跟踪
|
||||||
|
|
||||||
|
|
||||||
class ExportRequest(BaseModel):
|
class ExportRequest(BaseModel):
|
||||||
@@ -109,6 +151,240 @@ async def upload_template(
|
|||||||
raise HTTPException(status_code=500, detail=f"上传失败: {str(e)}")
|
raise HTTPException(status_code=500, detail=f"上传失败: {str(e)}")
|
||||||
|
|
||||||
|
|
||||||
|
@router.post("/upload-joint")
|
||||||
|
async def upload_joint_template(
|
||||||
|
background_tasks: BackgroundTasks,
|
||||||
|
template_file: UploadFile = File(..., description="模板文件"),
|
||||||
|
source_files: List[UploadFile] = File(..., description="源文档文件列表"),
|
||||||
|
):
|
||||||
|
"""
|
||||||
|
联合上传模板和源文档,一键完成解析和存储
|
||||||
|
|
||||||
|
1. 保存模板文件并提取字段
|
||||||
|
2. 异步处理源文档(解析+存MongoDB)
|
||||||
|
3. 返回模板信息和源文档ID列表
|
||||||
|
|
||||||
|
Args:
|
||||||
|
template_file: 模板文件 (xlsx/xls/docx)
|
||||||
|
source_files: 源文档列表 (docx/xlsx/md/txt)
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
模板ID、字段列表、源文档ID列表
|
||||||
|
"""
|
||||||
|
if not template_file.filename:
|
||||||
|
raise HTTPException(status_code=400, detail="模板文件名为空")
|
||||||
|
|
||||||
|
# 验证模板格式
|
||||||
|
template_ext = template_file.filename.split('.')[-1].lower()
|
||||||
|
if template_ext not in ['xlsx', 'xls', 'docx']:
|
||||||
|
raise HTTPException(
|
||||||
|
status_code=400,
|
||||||
|
detail=f"不支持的模板格式: {template_ext},仅支持 xlsx/xls/docx"
|
||||||
|
)
|
||||||
|
|
||||||
|
# 验证源文档格式
|
||||||
|
valid_exts = ['docx', 'xlsx', 'xls', 'md', 'txt']
|
||||||
|
for sf in source_files:
|
||||||
|
if sf.filename:
|
||||||
|
sf_ext = sf.filename.split('.')[-1].lower()
|
||||||
|
if sf_ext not in valid_exts:
|
||||||
|
raise HTTPException(
|
||||||
|
status_code=400,
|
||||||
|
detail=f"不支持的源文档格式: {sf_ext},仅支持 docx/xlsx/xls/md/txt"
|
||||||
|
)
|
||||||
|
|
||||||
|
try:
|
||||||
|
# 1. 保存模板文件
|
||||||
|
template_content = await template_file.read()
|
||||||
|
template_path = file_service.save_uploaded_file(
|
||||||
|
template_content,
|
||||||
|
template_file.filename,
|
||||||
|
subfolder="templates"
|
||||||
|
)
|
||||||
|
|
||||||
|
# 2. 保存并解析源文档 - 提取内容用于生成表头
|
||||||
|
source_file_info = []
|
||||||
|
source_contents = []
|
||||||
|
for sf in source_files:
|
||||||
|
if sf.filename:
|
||||||
|
sf_content = await sf.read()
|
||||||
|
sf_ext = sf.filename.split('.')[-1].lower()
|
||||||
|
sf_path = file_service.save_uploaded_file(
|
||||||
|
sf_content,
|
||||||
|
sf.filename,
|
||||||
|
subfolder=sf_ext
|
||||||
|
)
|
||||||
|
source_file_info.append({
|
||||||
|
"path": sf_path,
|
||||||
|
"filename": sf.filename,
|
||||||
|
"ext": sf_ext
|
||||||
|
})
|
||||||
|
# 解析源文档获取内容(用于 AI 生成表头)
|
||||||
|
try:
|
||||||
|
from app.core.document_parser import ParserFactory
|
||||||
|
parser = ParserFactory.get_parser(sf_path)
|
||||||
|
parse_result = parser.parse(sf_path)
|
||||||
|
if parse_result.success and parse_result.data:
|
||||||
|
# 获取原始内容
|
||||||
|
content = parse_result.data.get("content", "")[:5000] if parse_result.data.get("content") else ""
|
||||||
|
|
||||||
|
# 获取标题(可能在顶层或structured_data内)
|
||||||
|
titles = parse_result.data.get("titles", [])
|
||||||
|
if not titles and parse_result.data.get("structured_data"):
|
||||||
|
titles = parse_result.data.get("structured_data", {}).get("titles", [])
|
||||||
|
titles = titles[:10] if titles else []
|
||||||
|
|
||||||
|
# 获取表格数量(可能在顶层或structured_data内)
|
||||||
|
tables = parse_result.data.get("tables", [])
|
||||||
|
if not tables and parse_result.data.get("structured_data"):
|
||||||
|
tables = parse_result.data.get("structured_data", {}).get("tables", [])
|
||||||
|
tables_count = len(tables) if tables else 0
|
||||||
|
|
||||||
|
# 获取表格内容摘要(用于 AI 理解源文档结构)
|
||||||
|
tables_summary = ""
|
||||||
|
if tables:
|
||||||
|
tables_summary = "\n【文档中的表格】:\n"
|
||||||
|
for idx, table in enumerate(tables[:5]): # 最多5个表格
|
||||||
|
if isinstance(table, dict):
|
||||||
|
headers = table.get("headers", [])
|
||||||
|
rows = table.get("rows", [])
|
||||||
|
if headers:
|
||||||
|
tables_summary += f"表格{idx+1}表头: {', '.join(str(h) for h in headers)}\n"
|
||||||
|
if rows:
|
||||||
|
tables_summary += f"表格{idx+1}前3行: "
|
||||||
|
for row_idx, row in enumerate(rows[:3]):
|
||||||
|
if isinstance(row, list):
|
||||||
|
tables_summary += " | ".join(str(c) for c in row) + "; "
|
||||||
|
elif isinstance(row, dict):
|
||||||
|
tables_summary += " | ".join(str(row.get(h, "")) for h in headers if headers) + "; "
|
||||||
|
tables_summary += "\n"
|
||||||
|
|
||||||
|
source_contents.append({
|
||||||
|
"filename": sf.filename,
|
||||||
|
"doc_type": sf_ext,
|
||||||
|
"content": content,
|
||||||
|
"titles": titles,
|
||||||
|
"tables_count": tables_count,
|
||||||
|
"tables_summary": tables_summary
|
||||||
|
})
|
||||||
|
logger.info(f"[DEBUG] source_contents built: filename={sf.filename}, content_len={len(content)}, titles_count={len(titles)}, tables_count={tables_count}")
|
||||||
|
if tables_summary:
|
||||||
|
logger.info(f"[DEBUG] tables_summary preview: {tables_summary[:300]}")
|
||||||
|
except Exception as e:
|
||||||
|
logger.warning(f"解析源文档失败 {sf.filename}: {e}")
|
||||||
|
|
||||||
|
# 3. 根据源文档内容生成表头
|
||||||
|
template_fields = await template_fill_service.get_template_fields_from_file(
|
||||||
|
template_path,
|
||||||
|
template_ext,
|
||||||
|
source_contents=source_contents # 传递源文档内容
|
||||||
|
)
|
||||||
|
|
||||||
|
# 3. 异步处理源文档到MongoDB
|
||||||
|
task_id = str(uuid.uuid4())
|
||||||
|
if source_file_info:
|
||||||
|
# 保存任务记录到 MongoDB
|
||||||
|
try:
|
||||||
|
await mongodb.insert_task(
|
||||||
|
task_id=task_id,
|
||||||
|
task_type="source_process",
|
||||||
|
status="pending",
|
||||||
|
message=f"开始处理 {len(source_file_info)} 个源文档"
|
||||||
|
)
|
||||||
|
except Exception as mongo_err:
|
||||||
|
logger.warning(f"MongoDB 保存任务记录失败: {mongo_err}")
|
||||||
|
|
||||||
|
background_tasks.add_task(
|
||||||
|
process_source_documents,
|
||||||
|
task_id=task_id,
|
||||||
|
files=source_file_info
|
||||||
|
)
|
||||||
|
|
||||||
|
logger.info(f"联合上传完成: 模板={template_file.filename}, 源文档={len(source_file_info)}个")
|
||||||
|
|
||||||
|
return {
|
||||||
|
"success": True,
|
||||||
|
"template_id": template_path,
|
||||||
|
"filename": template_file.filename,
|
||||||
|
"file_type": template_ext,
|
||||||
|
"fields": [
|
||||||
|
{
|
||||||
|
"cell": f.cell,
|
||||||
|
"name": f.name,
|
||||||
|
"field_type": f.field_type,
|
||||||
|
"required": f.required,
|
||||||
|
"hint": f.hint
|
||||||
|
}
|
||||||
|
for f in template_fields
|
||||||
|
],
|
||||||
|
"field_count": len(template_fields),
|
||||||
|
"source_file_paths": [f["path"] for f in source_file_info],
|
||||||
|
"source_filenames": [f["filename"] for f in source_file_info],
|
||||||
|
"task_id": task_id
|
||||||
|
}
|
||||||
|
|
||||||
|
except HTTPException:
|
||||||
|
raise
|
||||||
|
except Exception as e:
|
||||||
|
logger.error(f"联合上传失败: {str(e)}")
|
||||||
|
raise HTTPException(status_code=500, detail=f"联合上传失败: {str(e)}")
|
||||||
|
|
||||||
|
|
||||||
|
async def process_source_documents(task_id: str, files: List[dict]):
|
||||||
|
"""异步处理源文档,存入MongoDB"""
|
||||||
|
try:
|
||||||
|
await update_task_status(
|
||||||
|
task_id, status="processing",
|
||||||
|
progress=0, message="开始处理源文档"
|
||||||
|
)
|
||||||
|
|
||||||
|
doc_ids = []
|
||||||
|
for i, file_info in enumerate(files):
|
||||||
|
try:
|
||||||
|
parser = ParserFactory.get_parser(file_info["path"])
|
||||||
|
result = parser.parse(file_info["path"])
|
||||||
|
|
||||||
|
if result.success:
|
||||||
|
doc_id = await mongodb.insert_document(
|
||||||
|
doc_type=file_info["ext"],
|
||||||
|
content=result.data.get("content", ""),
|
||||||
|
metadata={
|
||||||
|
**result.metadata,
|
||||||
|
"original_filename": file_info["filename"],
|
||||||
|
"file_path": file_info["path"]
|
||||||
|
},
|
||||||
|
structured_data=result.data.get("structured_data")
|
||||||
|
)
|
||||||
|
doc_ids.append(doc_id)
|
||||||
|
logger.info(f"源文档处理成功: {file_info['filename']}, doc_id: {doc_id}")
|
||||||
|
else:
|
||||||
|
logger.error(f"源文档解析失败: {file_info['filename']}, error: {result.error}")
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
logger.error(f"源文档处理异常: {file_info['filename']}, error: {str(e)}")
|
||||||
|
|
||||||
|
progress = int((i + 1) / len(files) * 100)
|
||||||
|
await update_task_status(
|
||||||
|
task_id, status="processing",
|
||||||
|
progress=progress, message=f"已处理 {i+1}/{len(files)}"
|
||||||
|
)
|
||||||
|
|
||||||
|
await update_task_status(
|
||||||
|
task_id, status="success",
|
||||||
|
progress=100, message="源文档处理完成",
|
||||||
|
result={"doc_ids": doc_ids}
|
||||||
|
)
|
||||||
|
logger.info(f"所有源文档处理完成: {len(doc_ids)}个")
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
logger.error(f"源文档批量处理失败: {str(e)}")
|
||||||
|
await update_task_status(
|
||||||
|
task_id, status="failure",
|
||||||
|
progress=0, message="源文档处理失败",
|
||||||
|
error=str(e)
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
@router.post("/fields")
|
@router.post("/fields")
|
||||||
async def extract_template_fields(
|
async def extract_template_fields(
|
||||||
template_id: str = Query(..., description="模板ID/文件路径"),
|
template_id: str = Query(..., description="模板ID/文件路径"),
|
||||||
@@ -164,7 +440,27 @@ async def fill_template(
|
|||||||
Returns:
|
Returns:
|
||||||
填写结果
|
填写结果
|
||||||
"""
|
"""
|
||||||
|
# 生成或使用传入的 task_id
|
||||||
|
task_id = request.task_id or str(uuid.uuid4())
|
||||||
|
|
||||||
try:
|
try:
|
||||||
|
# 创建任务记录到 MongoDB
|
||||||
|
try:
|
||||||
|
await mongodb.insert_task(
|
||||||
|
task_id=task_id,
|
||||||
|
task_type="template_fill",
|
||||||
|
status="processing",
|
||||||
|
message=f"开始填表任务: {len(request.template_fields)} 个字段"
|
||||||
|
)
|
||||||
|
except Exception as mongo_err:
|
||||||
|
logger.warning(f"MongoDB 创建任务记录失败: {mongo_err}")
|
||||||
|
|
||||||
|
# 更新进度 - 开始
|
||||||
|
await update_task_status(
|
||||||
|
task_id, "processing",
|
||||||
|
progress=0, message="开始处理..."
|
||||||
|
)
|
||||||
|
|
||||||
# 转换字段
|
# 转换字段
|
||||||
fields = [
|
fields = [
|
||||||
TemplateField(
|
TemplateField(
|
||||||
@@ -177,17 +473,51 @@ async def fill_template(
|
|||||||
for f in request.template_fields
|
for f in request.template_fields
|
||||||
]
|
]
|
||||||
|
|
||||||
|
# 从 template_id 提取文件类型
|
||||||
|
template_file_type = "xlsx" # 默认类型
|
||||||
|
if request.template_id:
|
||||||
|
ext = request.template_id.split('.')[-1].lower()
|
||||||
|
if ext in ["xlsx", "xls"]:
|
||||||
|
template_file_type = "xlsx"
|
||||||
|
elif ext == "docx":
|
||||||
|
template_file_type = "docx"
|
||||||
|
|
||||||
|
# 更新进度 - 准备开始填写
|
||||||
|
await update_task_status(
|
||||||
|
task_id, "processing",
|
||||||
|
progress=10, message=f"准备填写 {len(fields)} 个字段..."
|
||||||
|
)
|
||||||
|
|
||||||
# 执行填写
|
# 执行填写
|
||||||
result = await template_fill_service.fill_template(
|
result = await template_fill_service.fill_template(
|
||||||
template_fields=fields,
|
template_fields=fields,
|
||||||
source_doc_ids=request.source_doc_ids,
|
source_doc_ids=request.source_doc_ids,
|
||||||
source_file_paths=request.source_file_paths,
|
source_file_paths=request.source_file_paths,
|
||||||
user_hint=request.user_hint
|
user_hint=request.user_hint,
|
||||||
|
template_id=request.template_id,
|
||||||
|
template_file_type=template_file_type,
|
||||||
|
task_id=task_id
|
||||||
)
|
)
|
||||||
|
|
||||||
return result
|
# 更新为成功
|
||||||
|
await update_task_status(
|
||||||
|
task_id, "success",
|
||||||
|
progress=100, message="填表完成",
|
||||||
|
result={
|
||||||
|
"field_count": len(fields),
|
||||||
|
"max_rows": result.get("max_rows", 0)
|
||||||
|
}
|
||||||
|
)
|
||||||
|
|
||||||
|
return {**result, "task_id": task_id}
|
||||||
|
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
|
# 更新为失败
|
||||||
|
await update_task_status(
|
||||||
|
task_id, "failure",
|
||||||
|
progress=0, message="填表失败",
|
||||||
|
error=str(e)
|
||||||
|
)
|
||||||
logger.error(f"填写表格失败: {str(e)}")
|
logger.error(f"填写表格失败: {str(e)}")
|
||||||
raise HTTPException(status_code=500, detail=f"填写失败: {str(e)}")
|
raise HTTPException(status_code=500, detail=f"填写失败: {str(e)}")
|
||||||
|
|
||||||
@@ -280,51 +610,79 @@ async def _export_to_excel(filled_data: dict, template_id: str) -> StreamingResp
|
|||||||
|
|
||||||
async def _export_to_word(filled_data: dict, template_id: str) -> StreamingResponse:
|
async def _export_to_word(filled_data: dict, template_id: str) -> StreamingResponse:
|
||||||
"""导出为 Word 格式"""
|
"""导出为 Word 格式"""
|
||||||
|
import re
|
||||||
|
import tempfile
|
||||||
|
import os
|
||||||
from docx import Document
|
from docx import Document
|
||||||
from docx.shared import Pt, RGBColor
|
from docx.shared import Pt, RGBColor
|
||||||
from docx.enum.text import WD_ALIGN_PARAGRAPH
|
from docx.enum.text import WD_ALIGN_PARAGRAPH
|
||||||
|
|
||||||
doc = Document()
|
def clean_text(text: str) -> str:
|
||||||
|
"""清理文本,移除可能导致Word问题的非法字符"""
|
||||||
|
if not text:
|
||||||
|
return ""
|
||||||
|
# 移除控制字符
|
||||||
|
text = re.sub(r'[\x00-\x08\x0b\x0c\x0e-\x1f\x7f]', '', text)
|
||||||
|
return text.strip()
|
||||||
|
|
||||||
# 添加标题
|
try:
|
||||||
title = doc.add_heading('填写结果', level=1)
|
# 先保存到临时文件,再读取到内存,确保文档完整性
|
||||||
title.alignment = WD_ALIGN_PARAGRAPH.CENTER
|
with tempfile.NamedTemporaryFile(delete=False, suffix='.docx') as tmp_file:
|
||||||
|
tmp_path = tmp_file.name
|
||||||
|
|
||||||
# 添加填写时间和模板信息
|
doc = Document()
|
||||||
from datetime import datetime
|
doc.add_heading('填写结果', level=1)
|
||||||
info_para = doc.add_paragraph()
|
|
||||||
info_para.add_run(f"模板ID: {template_id}\n").bold = True
|
|
||||||
info_para.add_run(f"导出时间: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
|
|
||||||
|
|
||||||
doc.add_paragraph() # 空行
|
from datetime import datetime
|
||||||
|
info_para = doc.add_paragraph()
|
||||||
|
template_filename = template_id.split('/')[-1].split('\\')[-1] if template_id else '未知'
|
||||||
|
info_para.add_run(f"模板文件: {clean_text(template_filename)}\n").bold = True
|
||||||
|
info_para.add_run(f"导出时间: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
|
||||||
|
doc.add_paragraph()
|
||||||
|
|
||||||
# 添加字段表格
|
table = doc.add_table(rows=1, cols=3)
|
||||||
table = doc.add_table(rows=1, cols=3)
|
table.style = 'Table Grid'
|
||||||
table.style = 'Light Grid Accent 1'
|
|
||||||
|
|
||||||
# 表头
|
header_cells = table.rows[0].cells
|
||||||
header_cells = table.rows[0].cells
|
header_cells[0].text = '字段名'
|
||||||
header_cells[0].text = '字段名'
|
header_cells[1].text = '填写值'
|
||||||
header_cells[1].text = '填写值'
|
header_cells[2].text = '状态'
|
||||||
header_cells[2].text = '状态'
|
|
||||||
|
|
||||||
for field_name, field_value in filled_data.items():
|
for field_name, field_value in filled_data.items():
|
||||||
row_cells = table.add_row().cells
|
row_cells = table.add_row().cells
|
||||||
row_cells[0].text = field_name
|
row_cells[0].text = clean_text(str(field_name))
|
||||||
row_cells[1].text = str(field_value) if field_value else ''
|
|
||||||
row_cells[2].text = '已填写' if field_value else '为空'
|
|
||||||
|
|
||||||
# 保存到 BytesIO
|
if isinstance(field_value, list):
|
||||||
output = io.BytesIO()
|
clean_values = [clean_text(str(v)) for v in field_value if v]
|
||||||
doc.save(output)
|
display_value = ', '.join(clean_values) if clean_values else ''
|
||||||
output.seek(0)
|
else:
|
||||||
|
display_value = clean_text(str(field_value)) if field_value else ''
|
||||||
|
|
||||||
filename = f"filled_template.docx"
|
row_cells[1].text = display_value
|
||||||
|
row_cells[2].text = '已填写' if display_value else '为空'
|
||||||
|
|
||||||
|
# 保存到临时文件
|
||||||
|
doc.save(tmp_path)
|
||||||
|
|
||||||
|
# 读取文件内容
|
||||||
|
with open(tmp_path, 'rb') as f:
|
||||||
|
file_content = f.read()
|
||||||
|
|
||||||
|
finally:
|
||||||
|
# 清理临时文件
|
||||||
|
if os.path.exists(tmp_path):
|
||||||
|
try:
|
||||||
|
os.unlink(tmp_path)
|
||||||
|
except:
|
||||||
|
pass
|
||||||
|
|
||||||
|
output = io.BytesIO(file_content)
|
||||||
|
filename = "filled_template.docx"
|
||||||
|
|
||||||
return StreamingResponse(
|
return StreamingResponse(
|
||||||
io.BytesIO(output.getvalue()),
|
output,
|
||||||
media_type="application/vnd.openxmlformats-officedocument.wordprocessingml.document",
|
media_type="application/vnd.openxmlformats-officedocument.wordprocessingml.document",
|
||||||
headers={"Content-Disposition": f"attachment; filename={filename}"}
|
headers={"Content-Disposition": f"attachment; filename*=UTF-8''{filename}"}
|
||||||
)
|
)
|
||||||
|
|
||||||
|
|
||||||
|
|||||||
@@ -5,6 +5,7 @@ from fastapi import APIRouter, UploadFile, File, HTTPException, Query
|
|||||||
from fastapi.responses import StreamingResponse
|
from fastapi.responses import StreamingResponse
|
||||||
from typing import Optional
|
from typing import Optional
|
||||||
import logging
|
import logging
|
||||||
|
import os
|
||||||
import pandas as pd
|
import pandas as pd
|
||||||
import io
|
import io
|
||||||
|
|
||||||
@@ -126,7 +127,7 @@ async def upload_excel(
|
|||||||
content += f"... (共 {len(sheet_data['rows'])} 行)\n\n"
|
content += f"... (共 {len(sheet_data['rows'])} 行)\n\n"
|
||||||
|
|
||||||
doc_metadata = {
|
doc_metadata = {
|
||||||
"filename": saved_path.split("/")[-1] if "/" in saved_path else saved_path.split("\\")[-1],
|
"filename": os.path.basename(saved_path),
|
||||||
"original_filename": file.filename,
|
"original_filename": file.filename,
|
||||||
"saved_path": saved_path,
|
"saved_path": saved_path,
|
||||||
"file_size": len(content),
|
"file_size": len(content),
|
||||||
@@ -253,7 +254,7 @@ async def export_excel(
|
|||||||
output.seek(0)
|
output.seek(0)
|
||||||
|
|
||||||
# 生成文件名
|
# 生成文件名
|
||||||
original_name = file_path.split('/')[-1] if '/' in file_path else file_path
|
original_name = os.path.basename(file_path)
|
||||||
if columns:
|
if columns:
|
||||||
export_name = f"export_{sheet_name or 'data'}_{len(column_list) if columns else 'all'}_cols.xlsx"
|
export_name = f"export_{sheet_name or 'data'}_{len(column_list) if columns else 'all'}_cols.xlsx"
|
||||||
else:
|
else:
|
||||||
|
|||||||
@@ -59,6 +59,11 @@ class MongoDB:
|
|||||||
"""RAG索引集合 - 存储字段语义索引"""
|
"""RAG索引集合 - 存储字段语义索引"""
|
||||||
return self.db["rag_index"]
|
return self.db["rag_index"]
|
||||||
|
|
||||||
|
@property
|
||||||
|
def tasks(self):
|
||||||
|
"""任务集合 - 存储任务历史记录"""
|
||||||
|
return self.db["tasks"]
|
||||||
|
|
||||||
# ==================== 文档操作 ====================
|
# ==================== 文档操作 ====================
|
||||||
|
|
||||||
async def insert_document(
|
async def insert_document(
|
||||||
@@ -242,8 +247,128 @@ class MongoDB:
|
|||||||
await self.rag_index.create_index("table_name")
|
await self.rag_index.create_index("table_name")
|
||||||
await self.rag_index.create_index("field_name")
|
await self.rag_index.create_index("field_name")
|
||||||
|
|
||||||
|
# 任务集合索引
|
||||||
|
await self.tasks.create_index("task_id", unique=True)
|
||||||
|
await self.tasks.create_index("created_at")
|
||||||
|
|
||||||
logger.info("MongoDB 索引创建完成")
|
logger.info("MongoDB 索引创建完成")
|
||||||
|
|
||||||
|
# ==================== 任务历史操作 ====================
|
||||||
|
|
||||||
|
async def insert_task(
|
||||||
|
self,
|
||||||
|
task_id: str,
|
||||||
|
task_type: str,
|
||||||
|
status: str = "pending",
|
||||||
|
message: str = "",
|
||||||
|
result: Optional[Dict[str, Any]] = None,
|
||||||
|
error: Optional[str] = None,
|
||||||
|
) -> str:
|
||||||
|
"""
|
||||||
|
插入任务记录
|
||||||
|
|
||||||
|
Args:
|
||||||
|
task_id: 任务ID
|
||||||
|
task_type: 任务类型
|
||||||
|
status: 任务状态
|
||||||
|
message: 任务消息
|
||||||
|
result: 任务结果
|
||||||
|
error: 错误信息
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
插入文档的ID
|
||||||
|
"""
|
||||||
|
task = {
|
||||||
|
"task_id": task_id,
|
||||||
|
"task_type": task_type,
|
||||||
|
"status": status,
|
||||||
|
"message": message,
|
||||||
|
"result": result,
|
||||||
|
"error": error,
|
||||||
|
"created_at": datetime.utcnow(),
|
||||||
|
"updated_at": datetime.utcnow(),
|
||||||
|
}
|
||||||
|
result_obj = await self.tasks.insert_one(task)
|
||||||
|
return str(result_obj.inserted_id)
|
||||||
|
|
||||||
|
async def update_task(
|
||||||
|
self,
|
||||||
|
task_id: str,
|
||||||
|
status: Optional[str] = None,
|
||||||
|
message: Optional[str] = None,
|
||||||
|
result: Optional[Dict[str, Any]] = None,
|
||||||
|
error: Optional[str] = None,
|
||||||
|
) -> bool:
|
||||||
|
"""
|
||||||
|
更新任务状态
|
||||||
|
|
||||||
|
Args:
|
||||||
|
task_id: 任务ID
|
||||||
|
status: 任务状态
|
||||||
|
message: 任务消息
|
||||||
|
result: 任务结果
|
||||||
|
error: 错误信息
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
是否更新成功
|
||||||
|
"""
|
||||||
|
from bson import ObjectId
|
||||||
|
|
||||||
|
update_data = {"updated_at": datetime.utcnow()}
|
||||||
|
if status is not None:
|
||||||
|
update_data["status"] = status
|
||||||
|
if message is not None:
|
||||||
|
update_data["message"] = message
|
||||||
|
if result is not None:
|
||||||
|
update_data["result"] = result
|
||||||
|
if error is not None:
|
||||||
|
update_data["error"] = error
|
||||||
|
|
||||||
|
update_result = await self.tasks.update_one(
|
||||||
|
{"task_id": task_id},
|
||||||
|
{"$set": update_data}
|
||||||
|
)
|
||||||
|
return update_result.modified_count > 0
|
||||||
|
|
||||||
|
async def get_task(self, task_id: str) -> Optional[Dict[str, Any]]:
|
||||||
|
"""根据task_id获取任务"""
|
||||||
|
task = await self.tasks.find_one({"task_id": task_id})
|
||||||
|
if task:
|
||||||
|
task["_id"] = str(task["_id"])
|
||||||
|
return task
|
||||||
|
|
||||||
|
async def list_tasks(
|
||||||
|
self,
|
||||||
|
limit: int = 50,
|
||||||
|
skip: int = 0,
|
||||||
|
) -> List[Dict[str, Any]]:
|
||||||
|
"""
|
||||||
|
获取任务列表
|
||||||
|
|
||||||
|
Args:
|
||||||
|
limit: 返回数量
|
||||||
|
skip: 跳过数量
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
任务列表
|
||||||
|
"""
|
||||||
|
cursor = self.tasks.find().sort("created_at", -1).skip(skip).limit(limit)
|
||||||
|
tasks = []
|
||||||
|
async for task in cursor:
|
||||||
|
task["_id"] = str(task["_id"])
|
||||||
|
# 转换 datetime 为字符串
|
||||||
|
if task.get("created_at"):
|
||||||
|
task["created_at"] = task["created_at"].isoformat()
|
||||||
|
if task.get("updated_at"):
|
||||||
|
task["updated_at"] = task["updated_at"].isoformat()
|
||||||
|
tasks.append(task)
|
||||||
|
return tasks
|
||||||
|
|
||||||
|
async def delete_task(self, task_id: str) -> bool:
|
||||||
|
"""删除任务"""
|
||||||
|
result = await self.tasks.delete_one({"task_id": task_id})
|
||||||
|
return result.deleted_count > 0
|
||||||
|
|
||||||
|
|
||||||
# ==================== 全局单例 ====================
|
# ==================== 全局单例 ====================
|
||||||
|
|
||||||
|
|||||||
@@ -59,7 +59,13 @@ class DocxParser(BaseParser):
|
|||||||
paragraphs = []
|
paragraphs = []
|
||||||
for para in doc.paragraphs:
|
for para in doc.paragraphs:
|
||||||
if para.text.strip():
|
if para.text.strip():
|
||||||
paragraphs.append(para.text)
|
paragraphs.append({
|
||||||
|
"text": para.text,
|
||||||
|
"style": str(para.style.name) if para.style else "Normal"
|
||||||
|
})
|
||||||
|
|
||||||
|
# 提取段落纯文本(用于 AI 解析)
|
||||||
|
paragraphs_text = [p["text"] for p in paragraphs if p["text"].strip()]
|
||||||
|
|
||||||
# 提取表格内容
|
# 提取表格内容
|
||||||
tables_data = []
|
tables_data = []
|
||||||
@@ -77,8 +83,25 @@ class DocxParser(BaseParser):
|
|||||||
"column_count": len(table_rows[0]) if table_rows else 0
|
"column_count": len(table_rows[0]) if table_rows else 0
|
||||||
})
|
})
|
||||||
|
|
||||||
# 合并所有文本
|
# 提取图片/嵌入式对象信息
|
||||||
full_text = "\n".join(paragraphs)
|
images_info = self._extract_images_info(doc, path)
|
||||||
|
|
||||||
|
# 合并所有文本(包括图片描述)
|
||||||
|
full_text_parts = []
|
||||||
|
full_text_parts.append("【文档正文】")
|
||||||
|
full_text_parts.extend(paragraphs_text)
|
||||||
|
|
||||||
|
if tables_data:
|
||||||
|
full_text_parts.append("\n【文档表格】")
|
||||||
|
for idx, table in enumerate(tables_data):
|
||||||
|
full_text_parts.append(f"--- 表格 {idx + 1} ---")
|
||||||
|
for row in table["rows"]:
|
||||||
|
full_text_parts.append(" | ".join(str(cell) for cell in row))
|
||||||
|
|
||||||
|
if images_info.get("image_count", 0) > 0:
|
||||||
|
full_text_parts.append(f"\n【文档图片】文档包含 {images_info['image_count']} 张图片/图表")
|
||||||
|
|
||||||
|
full_text = "\n".join(full_text_parts)
|
||||||
|
|
||||||
# 构建元数据
|
# 构建元数据
|
||||||
metadata = {
|
metadata = {
|
||||||
@@ -89,7 +112,9 @@ class DocxParser(BaseParser):
|
|||||||
"table_count": len(tables_data),
|
"table_count": len(tables_data),
|
||||||
"word_count": len(full_text),
|
"word_count": len(full_text),
|
||||||
"char_count": len(full_text.replace("\n", "")),
|
"char_count": len(full_text.replace("\n", "")),
|
||||||
"has_tables": len(tables_data) > 0
|
"has_tables": len(tables_data) > 0,
|
||||||
|
"has_images": images_info.get("image_count", 0) > 0,
|
||||||
|
"image_count": images_info.get("image_count", 0)
|
||||||
}
|
}
|
||||||
|
|
||||||
# 返回结果
|
# 返回结果
|
||||||
@@ -97,12 +122,16 @@ class DocxParser(BaseParser):
|
|||||||
success=True,
|
success=True,
|
||||||
data={
|
data={
|
||||||
"content": full_text,
|
"content": full_text,
|
||||||
"paragraphs": paragraphs,
|
"paragraphs": paragraphs_text,
|
||||||
|
"paragraphs_with_style": paragraphs,
|
||||||
"tables": tables_data,
|
"tables": tables_data,
|
||||||
|
"images": images_info,
|
||||||
"word_count": len(full_text),
|
"word_count": len(full_text),
|
||||||
"structured_data": {
|
"structured_data": {
|
||||||
"paragraphs": paragraphs,
|
"paragraphs": paragraphs,
|
||||||
"tables": tables_data
|
"paragraphs_text": paragraphs_text,
|
||||||
|
"tables": tables_data,
|
||||||
|
"images": images_info
|
||||||
}
|
}
|
||||||
},
|
},
|
||||||
metadata=metadata
|
metadata=metadata
|
||||||
@@ -115,6 +144,59 @@ class DocxParser(BaseParser):
|
|||||||
error=f"解析 Word 文档失败: {str(e)}"
|
error=f"解析 Word 文档失败: {str(e)}"
|
||||||
)
|
)
|
||||||
|
|
||||||
|
def extract_images_as_base64(self, file_path: str) -> List[Dict[str, str]]:
|
||||||
|
"""
|
||||||
|
提取 Word 文档中的所有图片,返回 base64 编码列表
|
||||||
|
|
||||||
|
Args:
|
||||||
|
file_path: Word 文件路径
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
图片列表,每项包含 base64 编码和图片类型
|
||||||
|
"""
|
||||||
|
import zipfile
|
||||||
|
import base64
|
||||||
|
from io import BytesIO
|
||||||
|
|
||||||
|
images = []
|
||||||
|
|
||||||
|
try:
|
||||||
|
with zipfile.ZipFile(file_path, 'r') as zf:
|
||||||
|
# 查找 word/media 目录下的图片文件
|
||||||
|
for filename in zf.namelist():
|
||||||
|
if filename.startswith('word/media/'):
|
||||||
|
# 获取图片类型
|
||||||
|
ext = filename.split('.')[-1].lower()
|
||||||
|
mime_types = {
|
||||||
|
'png': 'image/png',
|
||||||
|
'jpg': 'image/jpeg',
|
||||||
|
'jpeg': 'image/jpeg',
|
||||||
|
'gif': 'image/gif',
|
||||||
|
'bmp': 'image/bmp'
|
||||||
|
}
|
||||||
|
mime_type = mime_types.get(ext, 'image/png')
|
||||||
|
|
||||||
|
try:
|
||||||
|
# 读取图片数据并转为 base64
|
||||||
|
image_data = zf.read(filename)
|
||||||
|
base64_data = base64.b64encode(image_data).decode('utf-8')
|
||||||
|
|
||||||
|
images.append({
|
||||||
|
"filename": filename,
|
||||||
|
"mime_type": mime_type,
|
||||||
|
"base64": base64_data,
|
||||||
|
"size": len(image_data)
|
||||||
|
})
|
||||||
|
logger.info(f"提取图片: {filename}, 大小: {len(image_data)} bytes")
|
||||||
|
except Exception as e:
|
||||||
|
logger.warning(f"提取图片失败 {filename}: {str(e)}")
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
logger.error(f"打开 Word 文档提取图片失败: {str(e)}")
|
||||||
|
|
||||||
|
logger.info(f"共提取 {len(images)} 张图片")
|
||||||
|
return images
|
||||||
|
|
||||||
def extract_key_sentences(self, text: str, max_sentences: int = 10) -> List[str]:
|
def extract_key_sentences(self, text: str, max_sentences: int = 10) -> List[str]:
|
||||||
"""
|
"""
|
||||||
从文本中提取关键句子
|
从文本中提取关键句子
|
||||||
@@ -268,6 +350,60 @@ class DocxParser(BaseParser):
|
|||||||
|
|
||||||
return fields
|
return fields
|
||||||
|
|
||||||
|
def _extract_images_info(self, doc: Document, path: Path) -> Dict[str, Any]:
|
||||||
|
"""
|
||||||
|
提取 Word 文档中的图片/嵌入式对象信息
|
||||||
|
|
||||||
|
Args:
|
||||||
|
doc: Document 对象
|
||||||
|
path: 文件路径
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
图片信息字典
|
||||||
|
"""
|
||||||
|
import zipfile
|
||||||
|
from io import BytesIO
|
||||||
|
|
||||||
|
image_count = 0
|
||||||
|
image_descriptions = []
|
||||||
|
inline_shapes_count = 0
|
||||||
|
|
||||||
|
try:
|
||||||
|
# 方法1: 通过 inline shapes 统计图片
|
||||||
|
try:
|
||||||
|
inline_shapes_count = len(doc.inline_shapes)
|
||||||
|
if inline_shapes_count > 0:
|
||||||
|
image_count = inline_shapes_count
|
||||||
|
image_descriptions.append(f"文档包含 {inline_shapes_count} 个嵌入式图形/图片")
|
||||||
|
except Exception:
|
||||||
|
pass
|
||||||
|
|
||||||
|
# 方法2: 通过 ZIP 分析 document.xml 获取图片引用
|
||||||
|
try:
|
||||||
|
with zipfile.ZipFile(path, 'r') as zf:
|
||||||
|
# 查找 word/media 目录下的图片文件
|
||||||
|
media_files = [f for f in zf.namelist() if f.startswith('word/media/')]
|
||||||
|
if media_files and not inline_shapes_count:
|
||||||
|
image_count = len(media_files)
|
||||||
|
image_descriptions.append(f"文档包含 {image_count} 个嵌入图片")
|
||||||
|
|
||||||
|
# 检查是否有页眉页脚中的图片
|
||||||
|
header_images = [f for f in zf.namelist() if 'header' in f.lower() and f.endswith(('.png', '.jpg', '.jpeg', '.gif', '.bmp'))]
|
||||||
|
if header_images:
|
||||||
|
image_descriptions.append(f"页眉/页脚包含 {len(header_images)} 个图片")
|
||||||
|
except Exception:
|
||||||
|
pass
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
logger.warning(f"提取图片信息失败: {str(e)}")
|
||||||
|
|
||||||
|
return {
|
||||||
|
"image_count": image_count,
|
||||||
|
"inline_shapes_count": inline_shapes_count,
|
||||||
|
"descriptions": image_descriptions,
|
||||||
|
"has_images": image_count > 0
|
||||||
|
}
|
||||||
|
|
||||||
def _infer_field_type_from_hint(self, hint: str) -> str:
|
def _infer_field_type_from_hint(self, hint: str) -> str:
|
||||||
"""
|
"""
|
||||||
从提示词推断字段类型
|
从提示词推断字段类型
|
||||||
|
|||||||
@@ -104,8 +104,15 @@ class XlsxParser(BaseParser):
|
|||||||
# pandas 读取失败,尝试 XML 方式
|
# pandas 读取失败,尝试 XML 方式
|
||||||
df = self._read_excel_sheet_xml(file_path, sheet_name=target_sheet, header_row=header_row)
|
df = self._read_excel_sheet_xml(file_path, sheet_name=target_sheet, header_row=header_row)
|
||||||
|
|
||||||
# 检查 DataFrame 是否为空
|
# 检查 DataFrame 是否为空(但如果有列名,仍算有效)
|
||||||
if df is None or df.empty:
|
if df is None:
|
||||||
|
return ParseResult(
|
||||||
|
success=False,
|
||||||
|
error=f"工作表 '{target_sheet}' 读取失败"
|
||||||
|
)
|
||||||
|
|
||||||
|
# 如果 DataFrame 为空但有列名(比如模板文件),仍算有效
|
||||||
|
if df.empty and len(df.columns) == 0:
|
||||||
return ParseResult(
|
return ParseResult(
|
||||||
success=False,
|
success=False,
|
||||||
error=f"工作表 '{target_sheet}' 为空,请检查 Excel 文件内容"
|
error=f"工作表 '{target_sheet}' 为空,请检查 Excel 文件内容"
|
||||||
@@ -310,24 +317,70 @@ class XlsxParser(BaseParser):
|
|||||||
import zipfile
|
import zipfile
|
||||||
from xml.etree import ElementTree as ET
|
from xml.etree import ElementTree as ET
|
||||||
|
|
||||||
|
# 常见的命名空间
|
||||||
|
COMMON_NAMESPACES = [
|
||||||
|
'http://schemas.openxmlformats.org/spreadsheetml/2006/main',
|
||||||
|
'http://schemas.openxmlformats.org/spreadsheetml/2005/main',
|
||||||
|
'http://schemas.openxmlformats.org/spreadsheetml/2004/main',
|
||||||
|
'http://schemas.openxmlformats.org/spreadsheetml/2003/main',
|
||||||
|
]
|
||||||
|
|
||||||
try:
|
try:
|
||||||
with zipfile.ZipFile(file_path, 'r') as z:
|
with zipfile.ZipFile(file_path, 'r') as z:
|
||||||
if 'xl/workbook.xml' not in z.namelist():
|
# 尝试多种可能的 workbook.xml 路径
|
||||||
|
possible_paths = ['xl/workbook.xml', 'xl\\workbook.xml', 'workbook.xml']
|
||||||
|
content = None
|
||||||
|
for path in possible_paths:
|
||||||
|
if path in z.namelist():
|
||||||
|
content = z.read(path)
|
||||||
|
logger.info(f"找到 workbook.xml at: {path}")
|
||||||
|
break
|
||||||
|
|
||||||
|
if content is None:
|
||||||
|
logger.warning(f"未找到 workbook.xml,文件列表: {z.namelist()[:10]}")
|
||||||
return []
|
return []
|
||||||
content = z.read('xl/workbook.xml')
|
|
||||||
root = ET.fromstring(content)
|
root = ET.fromstring(content)
|
||||||
|
|
||||||
# 命名空间
|
|
||||||
ns = {'main': 'http://schemas.openxmlformats.org/spreadsheetml/2006/main'}
|
|
||||||
|
|
||||||
sheet_names = []
|
sheet_names = []
|
||||||
for sheet in root.findall('.//main:sheet', ns):
|
|
||||||
name = sheet.get('name')
|
# 方法1:尝试带命名空间的查找
|
||||||
if name:
|
for ns in COMMON_NAMESPACES:
|
||||||
sheet_names.append(name)
|
sheet_elements = root.findall(f'.//{{{ns}}}sheet')
|
||||||
|
if sheet_elements:
|
||||||
|
for sheet in sheet_elements:
|
||||||
|
name = sheet.get('name')
|
||||||
|
if name:
|
||||||
|
sheet_names.append(name)
|
||||||
|
if sheet_names:
|
||||||
|
logger.info(f"使用命名空间 {ns} 提取工作表: {sheet_names}")
|
||||||
|
return sheet_names
|
||||||
|
|
||||||
|
# 方法2:不使用命名空间,直接查找所有 sheet 元素
|
||||||
|
if not sheet_names:
|
||||||
|
for elem in root.iter():
|
||||||
|
if elem.tag.endswith('sheet') and elem.tag != 'sheets':
|
||||||
|
name = elem.get('name')
|
||||||
|
if name:
|
||||||
|
sheet_names.append(name)
|
||||||
|
for child in elem:
|
||||||
|
if child.tag.endswith('sheet') or child.tag == 'sheet':
|
||||||
|
name = child.get('name')
|
||||||
|
if name and name not in sheet_names:
|
||||||
|
sheet_names.append(name)
|
||||||
|
|
||||||
|
# 方法3:直接从 XML 文本中正则匹配 sheet name
|
||||||
|
if not sheet_names:
|
||||||
|
import re
|
||||||
|
xml_str = content.decode('utf-8', errors='ignore')
|
||||||
|
matches = re.findall(r'<sheet\s+[^>]*name=["\']([^"\']+)["\']', xml_str, re.IGNORECASE)
|
||||||
|
if matches:
|
||||||
|
sheet_names = matches
|
||||||
|
logger.info(f"使用正则提取工作表: {sheet_names}")
|
||||||
|
|
||||||
logger.info(f"从 XML 提取工作表: {sheet_names}")
|
logger.info(f"从 XML 提取工作表: {sheet_names}")
|
||||||
return sheet_names
|
return sheet_names
|
||||||
|
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
logger.error(f"从 XML 提取工作表名称失败: {e}")
|
logger.error(f"从 XML 提取工作表名称失败: {e}")
|
||||||
return []
|
return []
|
||||||
@@ -349,6 +402,32 @@ class XlsxParser(BaseParser):
|
|||||||
import zipfile
|
import zipfile
|
||||||
from xml.etree import ElementTree as ET
|
from xml.etree import ElementTree as ET
|
||||||
|
|
||||||
|
# 常见的命名空间
|
||||||
|
COMMON_NAMESPACES = [
|
||||||
|
'http://schemas.openxmlformats.org/spreadsheetml/2006/main',
|
||||||
|
'http://schemas.openxmlformats.org/spreadsheetml/2005/main',
|
||||||
|
'http://schemas.openxmlformats.org/spreadsheetml/2004/main',
|
||||||
|
'http://schemas.openxmlformats.org/spreadsheetml/2003/main',
|
||||||
|
]
|
||||||
|
|
||||||
|
def find_elements_with_ns(root, tag_name):
|
||||||
|
"""灵活查找元素,支持任意命名空间"""
|
||||||
|
results = []
|
||||||
|
# 方法1:用固定命名空间
|
||||||
|
for ns in COMMON_NAMESPACES:
|
||||||
|
try:
|
||||||
|
elems = root.findall(f'.//{{{ns}}}{tag_name}')
|
||||||
|
if elems:
|
||||||
|
results.extend(elems)
|
||||||
|
except:
|
||||||
|
pass
|
||||||
|
# 方法2:不带命名空间查找
|
||||||
|
if not results:
|
||||||
|
for elem in root.iter():
|
||||||
|
if elem.tag.endswith('}' + tag_name):
|
||||||
|
results.append(elem)
|
||||||
|
return results
|
||||||
|
|
||||||
with zipfile.ZipFile(file_path, 'r') as z:
|
with zipfile.ZipFile(file_path, 'r') as z:
|
||||||
# 获取工作表名称
|
# 获取工作表名称
|
||||||
sheet_names = self._extract_sheet_names_from_xml(file_path)
|
sheet_names = self._extract_sheet_names_from_xml(file_path)
|
||||||
@@ -359,57 +438,68 @@ class XlsxParser(BaseParser):
|
|||||||
target_sheet = sheet_name if sheet_name and sheet_name in sheet_names else sheet_names[0]
|
target_sheet = sheet_name if sheet_name and sheet_name in sheet_names else sheet_names[0]
|
||||||
sheet_index = sheet_names.index(target_sheet) + 1 # sheet1.xml, sheet2.xml, ...
|
sheet_index = sheet_names.index(target_sheet) + 1 # sheet1.xml, sheet2.xml, ...
|
||||||
|
|
||||||
# 读取 shared strings
|
# 读取 shared strings - 尝试多种路径
|
||||||
shared_strings = []
|
shared_strings = []
|
||||||
if 'xl/sharedStrings.xml' in z.namelist():
|
ss_paths = ['xl/sharedStrings.xml', 'xl\\sharedStrings.xml', 'sharedStrings.xml']
|
||||||
ss_content = z.read('xl/sharedStrings.xml')
|
for ss_path in ss_paths:
|
||||||
ss_root = ET.fromstring(ss_content)
|
if ss_path in z.namelist():
|
||||||
ns = {'main': 'http://schemas.openxmlformats.org/spreadsheetml/2006/main'}
|
try:
|
||||||
for si in ss_root.findall('.//main:si', ns):
|
ss_content = z.read(ss_path)
|
||||||
t = si.find('.//main:t', ns)
|
ss_root = ET.fromstring(ss_content)
|
||||||
if t is not None:
|
for si in find_elements_with_ns(ss_root, 'si'):
|
||||||
shared_strings.append(t.text or '')
|
t_elements = [c for c in si if c.tag.endswith('}t') or c.tag == 't']
|
||||||
else:
|
if t_elements:
|
||||||
shared_strings.append('')
|
shared_strings.append(t_elements[0].text or '')
|
||||||
|
else:
|
||||||
|
shared_strings.append('')
|
||||||
|
break
|
||||||
|
except Exception as e:
|
||||||
|
logger.warning(f"读取 sharedStrings 失败: {e}")
|
||||||
|
|
||||||
# 读取工作表
|
# 读取工作表 - 尝试多种可能的路径
|
||||||
sheet_file = f'xl/worksheets/sheet{sheet_index}.xml'
|
sheet_content = None
|
||||||
if sheet_file not in z.namelist():
|
sheet_paths = [
|
||||||
raise ValueError(f"工作表文件 {sheet_file} 不存在")
|
f'xl/worksheets/sheet{sheet_index}.xml',
|
||||||
|
f'xl\\worksheets\\sheet{sheet_index}.xml',
|
||||||
|
f'worksheets/sheet{sheet_index}.xml',
|
||||||
|
]
|
||||||
|
for sp in sheet_paths:
|
||||||
|
if sp in z.namelist():
|
||||||
|
sheet_content = z.read(sp)
|
||||||
|
break
|
||||||
|
|
||||||
|
if sheet_content is None:
|
||||||
|
raise ValueError(f"工作表文件 sheet{sheet_index}.xml 不存在")
|
||||||
|
|
||||||
sheet_content = z.read(sheet_file)
|
|
||||||
root = ET.fromstring(sheet_content)
|
root = ET.fromstring(sheet_content)
|
||||||
ns = {'main': 'http://schemas.openxmlformats.org/spreadsheetml/2006/main'}
|
|
||||||
|
|
||||||
# 收集所有行数据
|
# 收集所有行数据
|
||||||
all_rows = []
|
all_rows = []
|
||||||
headers = {}
|
headers = {}
|
||||||
|
|
||||||
for row in root.findall('.//main:row', ns):
|
for row in find_elements_with_ns(root, 'row'):
|
||||||
row_idx = int(row.get('r', 0))
|
row_idx = int(row.get('r', 0))
|
||||||
row_cells = {}
|
row_cells = {}
|
||||||
for cell in row.findall('main:c', ns):
|
for cell in find_elements_with_ns(row, 'c'):
|
||||||
cell_ref = cell.get('r', '')
|
cell_ref = cell.get('r', '')
|
||||||
col_letters = ''.join(filter(str.isalpha, cell_ref))
|
col_letters = ''.join(filter(str.isalpha, cell_ref))
|
||||||
cell_type = cell.get('t', 'n')
|
cell_type = cell.get('t', 'n')
|
||||||
v = cell.find('main:v', ns)
|
v_elements = find_elements_with_ns(cell, 'v')
|
||||||
|
v = v_elements[0] if v_elements else None
|
||||||
|
|
||||||
if v is not None and v.text:
|
if v is not None and v.text:
|
||||||
if cell_type == 's':
|
if cell_type == 's':
|
||||||
# shared string
|
|
||||||
try:
|
try:
|
||||||
row_cells[col_letters] = shared_strings[int(v.text)]
|
row_cells[col_letters] = shared_strings[int(v.text)]
|
||||||
except (ValueError, IndexError):
|
except (ValueError, IndexError):
|
||||||
row_cells[col_letters] = v.text
|
row_cells[col_letters] = v.text
|
||||||
elif cell_type == 'b':
|
elif cell_type == 'b':
|
||||||
# boolean
|
|
||||||
row_cells[col_letters] = v.text == '1'
|
row_cells[col_letters] = v.text == '1'
|
||||||
else:
|
else:
|
||||||
row_cells[col_letters] = v.text
|
row_cells[col_letters] = v.text
|
||||||
else:
|
else:
|
||||||
row_cells[col_letters] = None
|
row_cells[col_letters] = None
|
||||||
|
|
||||||
# 处理表头行
|
|
||||||
if row_idx == header_row + 1:
|
if row_idx == header_row + 1:
|
||||||
headers = {**row_cells}
|
headers = {**row_cells}
|
||||||
elif row_idx > header_row + 1:
|
elif row_idx > header_row + 1:
|
||||||
@@ -417,7 +507,6 @@ class XlsxParser(BaseParser):
|
|||||||
|
|
||||||
# 构建 DataFrame
|
# 构建 DataFrame
|
||||||
if headers:
|
if headers:
|
||||||
# 按原始列顺序排列
|
|
||||||
col_order = list(headers.keys())
|
col_order = list(headers.keys())
|
||||||
df = pd.DataFrame(all_rows)
|
df = pd.DataFrame(all_rows)
|
||||||
if not df.empty:
|
if not df.empty:
|
||||||
|
|||||||
@@ -0,0 +1,14 @@
|
|||||||
|
"""
|
||||||
|
指令执行模块
|
||||||
|
|
||||||
|
支持文档智能操作交互,包括意图解析和指令执行
|
||||||
|
"""
|
||||||
|
from .intent_parser import IntentParser, intent_parser
|
||||||
|
from .executor import InstructionExecutor, instruction_executor
|
||||||
|
|
||||||
|
__all__ = [
|
||||||
|
"IntentParser",
|
||||||
|
"intent_parser",
|
||||||
|
"InstructionExecutor",
|
||||||
|
"instruction_executor",
|
||||||
|
]
|
||||||
|
|||||||
@@ -0,0 +1,572 @@
|
|||||||
|
"""
|
||||||
|
指令执行器模块
|
||||||
|
|
||||||
|
将自然语言指令转换为可执行操作
|
||||||
|
"""
|
||||||
|
import logging
|
||||||
|
import json
|
||||||
|
from typing import Any, Dict, List, Optional
|
||||||
|
|
||||||
|
from app.services.template_fill_service import template_fill_service
|
||||||
|
from app.services.rag_service import rag_service
|
||||||
|
from app.services.markdown_ai_service import markdown_ai_service
|
||||||
|
from app.core.database import mongodb
|
||||||
|
|
||||||
|
logger = logging.getLogger(__name__)
|
||||||
|
|
||||||
|
|
||||||
|
class InstructionExecutor:
|
||||||
|
"""指令执行器"""
|
||||||
|
|
||||||
|
def __init__(self):
|
||||||
|
self.intent_parser = None # 将通过 set_intent_parser 设置
|
||||||
|
|
||||||
|
def set_intent_parser(self, intent_parser):
|
||||||
|
"""设置意图解析器"""
|
||||||
|
self.intent_parser = intent_parser
|
||||||
|
|
||||||
|
async def execute(self, instruction: str, context: Dict[str, Any] = None) -> Dict[str, Any]:
|
||||||
|
"""
|
||||||
|
执行指令
|
||||||
|
|
||||||
|
Args:
|
||||||
|
instruction: 自然语言指令
|
||||||
|
context: 执行上下文(包含文档信息等)
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
执行结果
|
||||||
|
"""
|
||||||
|
if self.intent_parser is None:
|
||||||
|
from app.instruction.intent_parser import intent_parser
|
||||||
|
self.intent_parser = intent_parser
|
||||||
|
|
||||||
|
context = context or {}
|
||||||
|
|
||||||
|
# 解析意图
|
||||||
|
intent, params = await self.intent_parser.parse(instruction)
|
||||||
|
|
||||||
|
# 根据意图类型执行相应操作
|
||||||
|
if intent == "extract":
|
||||||
|
return await self._execute_extract(params, context)
|
||||||
|
elif intent == "fill_table":
|
||||||
|
return await self._execute_fill_table(params, context)
|
||||||
|
elif intent == "summarize":
|
||||||
|
return await self._execute_summarize(params, context)
|
||||||
|
elif intent == "question":
|
||||||
|
return await self._execute_question(params, context)
|
||||||
|
elif intent == "search":
|
||||||
|
return await self._execute_search(params, context)
|
||||||
|
elif intent == "compare":
|
||||||
|
return await self._execute_compare(params, context)
|
||||||
|
elif intent == "edit":
|
||||||
|
return await self._execute_edit(params, context)
|
||||||
|
elif intent == "transform":
|
||||||
|
return await self._execute_transform(params, context)
|
||||||
|
else:
|
||||||
|
return {
|
||||||
|
"success": False,
|
||||||
|
"error": f"未知意图类型: {intent}",
|
||||||
|
"message": "无法理解该指令,请尝试更明确的描述"
|
||||||
|
}
|
||||||
|
|
||||||
|
async def _execute_extract(self, params: Dict[str, Any], context: Dict[str, Any]) -> Dict[str, Any]:
|
||||||
|
"""执行信息提取"""
|
||||||
|
try:
|
||||||
|
target_fields = params.get("field_refs", [])
|
||||||
|
doc_ids = params.get("document_refs", [])
|
||||||
|
|
||||||
|
if not target_fields:
|
||||||
|
return {
|
||||||
|
"success": False,
|
||||||
|
"error": "未指定要提取的字段",
|
||||||
|
"message": "请明确说明要提取哪些字段,如:'提取医院数量和床位数'"
|
||||||
|
}
|
||||||
|
|
||||||
|
# 如果指定了文档,验证文档存在
|
||||||
|
if doc_ids and "all_docs" not in doc_ids:
|
||||||
|
valid_docs = []
|
||||||
|
for doc_ref in doc_ids:
|
||||||
|
doc_id = doc_ref.replace("doc_", "")
|
||||||
|
doc = await mongodb.get_document(doc_id)
|
||||||
|
if doc:
|
||||||
|
valid_docs.append(doc)
|
||||||
|
if not valid_docs:
|
||||||
|
return {
|
||||||
|
"success": False,
|
||||||
|
"error": "指定的文档不存在",
|
||||||
|
"message": "请检查文档编号是否正确"
|
||||||
|
}
|
||||||
|
context["source_docs"] = valid_docs
|
||||||
|
|
||||||
|
# 构建字段列表
|
||||||
|
fields = []
|
||||||
|
for i, field_name in enumerate(target_fields):
|
||||||
|
fields.append({
|
||||||
|
"name": field_name,
|
||||||
|
"cell": f"A{i+1}",
|
||||||
|
"field_type": "text",
|
||||||
|
"required": False
|
||||||
|
})
|
||||||
|
|
||||||
|
# 调用填表服务
|
||||||
|
result = await template_fill_service.fill_template(
|
||||||
|
template_fields=fields,
|
||||||
|
source_doc_ids=[doc.get("_id") for doc in context.get("source_docs", [])] if context.get("source_docs") else None,
|
||||||
|
user_hint=f"请提取字段: {', '.join(target_fields)}"
|
||||||
|
)
|
||||||
|
|
||||||
|
return {
|
||||||
|
"success": True,
|
||||||
|
"intent": "extract",
|
||||||
|
"extracted_data": result.get("filled_data", {}),
|
||||||
|
"fields": target_fields,
|
||||||
|
"message": f"成功提取 {len(result.get('filled_data', {}))} 个字段"
|
||||||
|
}
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
logger.error(f"提取执行失败: {e}")
|
||||||
|
return {
|
||||||
|
"success": False,
|
||||||
|
"error": str(e),
|
||||||
|
"message": f"提取失败: {str(e)}"
|
||||||
|
}
|
||||||
|
|
||||||
|
async def _execute_fill_table(self, params: Dict[str, Any], context: Dict[str, Any]) -> Dict[str, Any]:
|
||||||
|
"""执行填表操作"""
|
||||||
|
try:
|
||||||
|
template_file = context.get("template_file")
|
||||||
|
if not template_file:
|
||||||
|
return {
|
||||||
|
"success": False,
|
||||||
|
"error": "未提供表格模板",
|
||||||
|
"message": "请先上传要填写的表格模板"
|
||||||
|
}
|
||||||
|
|
||||||
|
# 获取源文档
|
||||||
|
source_docs = context.get("source_docs", [])
|
||||||
|
source_doc_ids = [doc.get("_id") for doc in source_docs if doc.get("_id")]
|
||||||
|
|
||||||
|
# 获取字段
|
||||||
|
fields = context.get("template_fields", [])
|
||||||
|
|
||||||
|
# 调用填表服务
|
||||||
|
result = await template_fill_service.fill_template(
|
||||||
|
template_fields=fields,
|
||||||
|
source_doc_ids=source_doc_ids if source_doc_ids else None,
|
||||||
|
source_file_paths=context.get("source_file_paths"),
|
||||||
|
user_hint=params.get("user_hint"),
|
||||||
|
template_id=template_file if isinstance(template_file, str) else None,
|
||||||
|
template_file_type=params.get("template", {}).get("type", "xlsx")
|
||||||
|
)
|
||||||
|
|
||||||
|
return {
|
||||||
|
"success": True,
|
||||||
|
"intent": "fill_table",
|
||||||
|
"result": result,
|
||||||
|
"message": f"填表完成,成功填写 {len(result.get('filled_data', {}))} 个字段"
|
||||||
|
}
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
logger.error(f"填表执行失败: {e}")
|
||||||
|
return {
|
||||||
|
"success": False,
|
||||||
|
"error": str(e),
|
||||||
|
"message": f"填表失败: {str(e)}"
|
||||||
|
}
|
||||||
|
|
||||||
|
async def _execute_summarize(self, params: Dict[str, Any], context: Dict[str, Any]) -> Dict[str, Any]:
|
||||||
|
"""执行摘要总结"""
|
||||||
|
try:
|
||||||
|
docs = context.get("source_docs", [])
|
||||||
|
if not docs:
|
||||||
|
return {
|
||||||
|
"success": False,
|
||||||
|
"error": "没有可用的文档",
|
||||||
|
"message": "请先上传要总结的文档"
|
||||||
|
}
|
||||||
|
|
||||||
|
summaries = []
|
||||||
|
for doc in docs[:5]: # 最多处理5个文档
|
||||||
|
content = doc.get("content", "")[:5000] # 限制内容长度
|
||||||
|
if content:
|
||||||
|
summaries.append({
|
||||||
|
"filename": doc.get("metadata", {}).get("original_filename", "未知"),
|
||||||
|
"content_preview": content[:500] + "..." if len(content) > 500 else content
|
||||||
|
})
|
||||||
|
|
||||||
|
return {
|
||||||
|
"success": True,
|
||||||
|
"intent": "summarize",
|
||||||
|
"summaries": summaries,
|
||||||
|
"message": f"找到 {len(summaries)} 个文档可供参考"
|
||||||
|
}
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
logger.error(f"摘要执行失败: {e}")
|
||||||
|
return {
|
||||||
|
"success": False,
|
||||||
|
"error": str(e),
|
||||||
|
"message": f"摘要生成失败: {str(e)}"
|
||||||
|
}
|
||||||
|
|
||||||
|
async def _execute_question(self, params: Dict[str, Any], context: Dict[str, Any]) -> Dict[str, Any]:
|
||||||
|
"""执行问答"""
|
||||||
|
try:
|
||||||
|
question = params.get("question", "")
|
||||||
|
if not question:
|
||||||
|
return {
|
||||||
|
"success": False,
|
||||||
|
"error": "未提供问题",
|
||||||
|
"message": "请输入要回答的问题"
|
||||||
|
}
|
||||||
|
|
||||||
|
# 使用 RAG 检索相关文档
|
||||||
|
docs = context.get("source_docs", [])
|
||||||
|
rag_results = []
|
||||||
|
|
||||||
|
for doc in docs:
|
||||||
|
doc_id = doc.get("_id", "")
|
||||||
|
if doc_id:
|
||||||
|
results = rag_service.retrieve_by_doc_id(doc_id, top_k=3)
|
||||||
|
rag_results.extend(results)
|
||||||
|
|
||||||
|
# 构建上下文
|
||||||
|
context_text = "\n\n".join([
|
||||||
|
r.get("content", "") for r in rag_results[:5]
|
||||||
|
]) if rag_results else ""
|
||||||
|
|
||||||
|
# 如果没有 RAG 结果,使用文档内容
|
||||||
|
if not context_text:
|
||||||
|
context_text = "\n\n".join([
|
||||||
|
doc.get("content", "")[:3000] for doc in docs[:3] if doc.get("content")
|
||||||
|
])
|
||||||
|
|
||||||
|
return {
|
||||||
|
"success": True,
|
||||||
|
"intent": "question",
|
||||||
|
"question": question,
|
||||||
|
"context_preview": context_text[:500] + "..." if len(context_text) > 500 else context_text,
|
||||||
|
"message": "已找到相关上下文,可进行问答"
|
||||||
|
}
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
logger.error(f"问答执行失败: {e}")
|
||||||
|
return {
|
||||||
|
"success": False,
|
||||||
|
"error": str(e),
|
||||||
|
"message": f"问答处理失败: {str(e)}"
|
||||||
|
}
|
||||||
|
|
||||||
|
async def _execute_search(self, params: Dict[str, Any], context: Dict[str, Any]) -> Dict[str, Any]:
|
||||||
|
"""执行搜索"""
|
||||||
|
try:
|
||||||
|
field_refs = params.get("field_refs", [])
|
||||||
|
query = " ".join(field_refs) if field_refs else params.get("question", "")
|
||||||
|
|
||||||
|
if not query:
|
||||||
|
return {
|
||||||
|
"success": False,
|
||||||
|
"error": "未提供搜索关键词",
|
||||||
|
"message": "请输入要搜索的关键词"
|
||||||
|
}
|
||||||
|
|
||||||
|
# 使用 RAG 检索
|
||||||
|
results = rag_service.retrieve(query, top_k=10, min_score=0.3)
|
||||||
|
|
||||||
|
return {
|
||||||
|
"success": True,
|
||||||
|
"intent": "search",
|
||||||
|
"query": query,
|
||||||
|
"results": [
|
||||||
|
{
|
||||||
|
"content": r.get("content", "")[:200],
|
||||||
|
"score": r.get("score", 0),
|
||||||
|
"doc_id": r.get("doc_id", "")
|
||||||
|
}
|
||||||
|
for r in results[:10]
|
||||||
|
],
|
||||||
|
"message": f"找到 {len(results)} 条相关结果"
|
||||||
|
}
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
logger.error(f"搜索执行失败: {e}")
|
||||||
|
return {
|
||||||
|
"success": False,
|
||||||
|
"error": str(e),
|
||||||
|
"message": f"搜索失败: {str(e)}"
|
||||||
|
}
|
||||||
|
|
||||||
|
async def _execute_compare(self, params: Dict[str, Any], context: Dict[str, Any]) -> Dict[str, Any]:
|
||||||
|
"""执行对比分析"""
|
||||||
|
try:
|
||||||
|
docs = context.get("source_docs", [])
|
||||||
|
if len(docs) < 2:
|
||||||
|
return {
|
||||||
|
"success": False,
|
||||||
|
"error": "对比需要至少2个文档",
|
||||||
|
"message": "请上传至少2个文档进行对比"
|
||||||
|
}
|
||||||
|
|
||||||
|
# 提取文档基本信息
|
||||||
|
comparison = []
|
||||||
|
for i, doc in enumerate(docs[:5]):
|
||||||
|
comparison.append({
|
||||||
|
"index": i + 1,
|
||||||
|
"filename": doc.get("metadata", {}).get("original_filename", "未知"),
|
||||||
|
"doc_type": doc.get("doc_type", "未知"),
|
||||||
|
"content_length": len(doc.get("content", "")),
|
||||||
|
"has_tables": bool(doc.get("structured_data", {}).get("tables")),
|
||||||
|
})
|
||||||
|
|
||||||
|
return {
|
||||||
|
"success": True,
|
||||||
|
"intent": "compare",
|
||||||
|
"comparison": comparison,
|
||||||
|
"message": f"对比了 {len(comparison)} 个文档的基本信息"
|
||||||
|
}
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
logger.error(f"对比执行失败: {e}")
|
||||||
|
return {
|
||||||
|
"success": False,
|
||||||
|
"error": str(e),
|
||||||
|
"message": f"对比分析失败: {str(e)}"
|
||||||
|
}
|
||||||
|
|
||||||
|
async def _execute_edit(self, params: Dict[str, Any], context: Dict[str, Any]) -> Dict[str, Any]:
|
||||||
|
"""执行文档编辑操作"""
|
||||||
|
try:
|
||||||
|
docs = context.get("source_docs", [])
|
||||||
|
if not docs:
|
||||||
|
return {
|
||||||
|
"success": False,
|
||||||
|
"error": "没有可用的文档",
|
||||||
|
"message": "请先上传要编辑的文档"
|
||||||
|
}
|
||||||
|
|
||||||
|
doc = docs[0] # 默认编辑第一个文档
|
||||||
|
content = doc.get("content", "")
|
||||||
|
original_filename = doc.get("metadata", {}).get("original_filename", "未知文档")
|
||||||
|
|
||||||
|
if not content:
|
||||||
|
return {
|
||||||
|
"success": False,
|
||||||
|
"error": "文档内容为空",
|
||||||
|
"message": "该文档没有可编辑的内容"
|
||||||
|
}
|
||||||
|
|
||||||
|
# 使用 LLM 进行文本润色/编辑
|
||||||
|
prompt = f"""请对以下文档内容进行编辑处理。
|
||||||
|
|
||||||
|
原文内容:
|
||||||
|
{content[:8000]}
|
||||||
|
|
||||||
|
编辑要求:
|
||||||
|
- 润色表述,使其更加专业流畅
|
||||||
|
- 修正明显的语法错误
|
||||||
|
- 保持原意不变
|
||||||
|
- 只返回编辑后的内容,不要解释
|
||||||
|
|
||||||
|
请直接输出编辑后的内容:"""
|
||||||
|
|
||||||
|
messages = [
|
||||||
|
{"role": "system", "content": "你是一个专业的文本编辑助手。请直接输出编辑后的内容。"},
|
||||||
|
{"role": "user", "content": prompt}
|
||||||
|
]
|
||||||
|
|
||||||
|
from app.services.llm_service import llm_service
|
||||||
|
response = await llm_service.chat(messages=messages, temperature=0.3, max_tokens=8000)
|
||||||
|
edited_content = llm_service.extract_message_content(response)
|
||||||
|
|
||||||
|
return {
|
||||||
|
"success": True,
|
||||||
|
"intent": "edit",
|
||||||
|
"edited_content": edited_content,
|
||||||
|
"original_filename": original_filename,
|
||||||
|
"message": "文档编辑完成,内容已返回"
|
||||||
|
}
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
logger.error(f"编辑执行失败: {e}")
|
||||||
|
return {
|
||||||
|
"success": False,
|
||||||
|
"error": str(e),
|
||||||
|
"message": f"编辑处理失败: {str(e)}"
|
||||||
|
}
|
||||||
|
|
||||||
|
async def _execute_transform(self, params: Dict[str, Any], context: Dict[str, Any]) -> Dict[str, Any]:
|
||||||
|
"""
|
||||||
|
执行格式转换操作
|
||||||
|
|
||||||
|
支持:
|
||||||
|
- Word -> Excel
|
||||||
|
- Excel -> Word
|
||||||
|
- Markdown -> Word
|
||||||
|
- Word -> Markdown
|
||||||
|
"""
|
||||||
|
try:
|
||||||
|
docs = context.get("source_docs", [])
|
||||||
|
if not docs:
|
||||||
|
return {
|
||||||
|
"success": False,
|
||||||
|
"error": "没有可用的文档",
|
||||||
|
"message": "请先上传要转换的文档"
|
||||||
|
}
|
||||||
|
|
||||||
|
# 获取目标格式
|
||||||
|
template_info = params.get("template", {})
|
||||||
|
target_type = template_info.get("type", "")
|
||||||
|
|
||||||
|
if not target_type:
|
||||||
|
# 尝试从指令中推断
|
||||||
|
instruction = params.get("instruction", "")
|
||||||
|
if "excel" in instruction.lower() or "xlsx" in instruction.lower():
|
||||||
|
target_type = "xlsx"
|
||||||
|
elif "word" in instruction.lower() or "docx" in instruction.lower():
|
||||||
|
target_type = "docx"
|
||||||
|
elif "markdown" in instruction.lower() or "md" in instruction.lower():
|
||||||
|
target_type = "md"
|
||||||
|
|
||||||
|
if not target_type:
|
||||||
|
return {
|
||||||
|
"success": False,
|
||||||
|
"error": "未指定目标格式",
|
||||||
|
"message": "请说明要转换成什么格式(如:转成Excel、转成Word)"
|
||||||
|
}
|
||||||
|
|
||||||
|
doc = docs[0]
|
||||||
|
content = doc.get("content", "")
|
||||||
|
structured_data = doc.get("structured_data", {})
|
||||||
|
original_filename = doc.get("metadata", {}).get("original_filename", "未知文档")
|
||||||
|
|
||||||
|
# 构建转换内容
|
||||||
|
if structured_data.get("tables"):
|
||||||
|
# 有表格数据,生成表格格式的内容
|
||||||
|
tables = structured_data.get("tables", [])
|
||||||
|
table_content = []
|
||||||
|
for i, table in enumerate(tables[:3]): # 最多处理3个表格
|
||||||
|
headers = table.get("headers", [])
|
||||||
|
rows = table.get("rows", [])[:20] # 最多20行
|
||||||
|
if headers:
|
||||||
|
table_content.append(f"【表格 {i+1}】")
|
||||||
|
table_content.append(" | ".join(str(h) for h in headers))
|
||||||
|
table_content.append(" | ".join(["---"] * len(headers)))
|
||||||
|
for row in rows:
|
||||||
|
if isinstance(row, list):
|
||||||
|
table_content.append(" | ".join(str(c) for c in row))
|
||||||
|
elif isinstance(row, dict):
|
||||||
|
table_content.append(" | ".join(str(row.get(h, "")) for h in headers))
|
||||||
|
table_content.append("")
|
||||||
|
|
||||||
|
if target_type == "xlsx":
|
||||||
|
# 生成 Excel 格式的数据(JSON)
|
||||||
|
excel_data = []
|
||||||
|
for table in tables[:1]: # 只处理第一个表格
|
||||||
|
headers = table.get("headers", [])
|
||||||
|
rows = table.get("rows", [])[:100]
|
||||||
|
for row in rows:
|
||||||
|
if isinstance(row, list):
|
||||||
|
excel_data.append(dict(zip(headers, row)))
|
||||||
|
elif isinstance(row, dict):
|
||||||
|
excel_data.append(row)
|
||||||
|
|
||||||
|
return {
|
||||||
|
"success": True,
|
||||||
|
"intent": "transform",
|
||||||
|
"transform_type": "to_excel",
|
||||||
|
"target_format": "xlsx",
|
||||||
|
"excel_data": excel_data,
|
||||||
|
"headers": headers,
|
||||||
|
"message": f"已转换为 Excel 格式,包含 {len(excel_data)} 行数据"
|
||||||
|
}
|
||||||
|
elif target_type in ["docx", "word"]:
|
||||||
|
# 生成 Word 格式的文本
|
||||||
|
word_content = f"# {original_filename}\n\n"
|
||||||
|
word_content += "\n".join(table_content)
|
||||||
|
|
||||||
|
return {
|
||||||
|
"success": True,
|
||||||
|
"intent": "transform",
|
||||||
|
"transform_type": "to_word",
|
||||||
|
"target_format": "docx",
|
||||||
|
"content": word_content,
|
||||||
|
"message": "已转换为 Word 格式"
|
||||||
|
}
|
||||||
|
elif target_type == "md":
|
||||||
|
# 生成 Markdown 格式
|
||||||
|
md_content = f"# {original_filename}\n\n"
|
||||||
|
md_content += "\n".join(table_content)
|
||||||
|
|
||||||
|
return {
|
||||||
|
"success": True,
|
||||||
|
"intent": "transform",
|
||||||
|
"transform_type": "to_markdown",
|
||||||
|
"target_format": "md",
|
||||||
|
"content": md_content,
|
||||||
|
"message": "已转换为 Markdown 格式"
|
||||||
|
}
|
||||||
|
|
||||||
|
# 无表格数据,使用纯文本内容转换
|
||||||
|
if target_type == "xlsx":
|
||||||
|
# 将文本内容转为 Excel 格式(每行作为一列)
|
||||||
|
lines = [line.strip() for line in content.split("\n") if line.strip()][:100]
|
||||||
|
excel_data = [{"行号": i+1, "内容": line} for i, line in enumerate(lines)]
|
||||||
|
|
||||||
|
return {
|
||||||
|
"success": True,
|
||||||
|
"intent": "transform",
|
||||||
|
"transform_type": "to_excel",
|
||||||
|
"target_format": "xlsx",
|
||||||
|
"excel_data": excel_data,
|
||||||
|
"headers": ["行号", "内容"],
|
||||||
|
"message": f"已将文本内容转换为 Excel,包含 {len(excel_data)} 行"
|
||||||
|
}
|
||||||
|
elif target_type in ["docx", "word"]:
|
||||||
|
return {
|
||||||
|
"success": True,
|
||||||
|
"intent": "transform",
|
||||||
|
"transform_type": "to_word",
|
||||||
|
"target_format": "docx",
|
||||||
|
"content": content,
|
||||||
|
"message": "文档内容已准备好,可下载为 Word 格式"
|
||||||
|
}
|
||||||
|
elif target_type == "md":
|
||||||
|
# 简单的文本转 Markdown
|
||||||
|
md_lines = []
|
||||||
|
for line in content.split("\n"):
|
||||||
|
line = line.strip()
|
||||||
|
if line:
|
||||||
|
# 简单处理:如果行不长且不是列表格式,作为段落
|
||||||
|
if len(line) < 100 and not line.startswith(("-", "*", "1.", "2.", "3.")):
|
||||||
|
md_lines.append(line)
|
||||||
|
else:
|
||||||
|
md_lines.append(line)
|
||||||
|
else:
|
||||||
|
md_lines.append("")
|
||||||
|
|
||||||
|
return {
|
||||||
|
"success": True,
|
||||||
|
"intent": "transform",
|
||||||
|
"transform_type": "to_markdown",
|
||||||
|
"target_format": "md",
|
||||||
|
"content": "\n".join(md_lines),
|
||||||
|
"message": "已转换为 Markdown 格式"
|
||||||
|
}
|
||||||
|
|
||||||
|
return {
|
||||||
|
"success": False,
|
||||||
|
"error": "不支持的目标格式",
|
||||||
|
"message": f"暂不支持转换为 {target_type} 格式"
|
||||||
|
}
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
logger.error(f"格式转换失败: {e}")
|
||||||
|
return {
|
||||||
|
"success": False,
|
||||||
|
"error": str(e),
|
||||||
|
"message": f"格式转换失败: {str(e)}"
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
# 全局单例
|
||||||
|
instruction_executor = InstructionExecutor()
|
||||||
|
|||||||
@@ -0,0 +1,242 @@
|
|||||||
|
"""
|
||||||
|
意图解析器模块
|
||||||
|
|
||||||
|
解析用户自然语言指令,识别意图和参数
|
||||||
|
"""
|
||||||
|
import re
|
||||||
|
import logging
|
||||||
|
from typing import Any, Dict, List, Optional, Tuple
|
||||||
|
|
||||||
|
logger = logging.getLogger(__name__)
|
||||||
|
|
||||||
|
|
||||||
|
class IntentParser:
|
||||||
|
"""意图解析器"""
|
||||||
|
|
||||||
|
# 意图类型定义
|
||||||
|
INTENT_EXTRACT = "extract" # 信息提取
|
||||||
|
INTENT_FILL_TABLE = "fill_table" # 填表
|
||||||
|
INTENT_SUMMARIZE = "summarize" # 摘要总结
|
||||||
|
INTENT_QUESTION = "question" # 问答
|
||||||
|
INTENT_SEARCH = "search" # 搜索
|
||||||
|
INTENT_COMPARE = "compare" # 对比分析
|
||||||
|
INTENT_TRANSFORM = "transform" # 格式转换
|
||||||
|
INTENT_EDIT = "edit" # 编辑文档
|
||||||
|
INTENT_UNKNOWN = "unknown" # 未知
|
||||||
|
|
||||||
|
# 意图关键词映射
|
||||||
|
INTENT_KEYWORDS = {
|
||||||
|
INTENT_EXTRACT: ["提取", "抽取", "获取", "找出", "查找", "识别", "找到"],
|
||||||
|
INTENT_FILL_TABLE: ["填表", "填写", "填充", "录入", "导入到表格", "填写到"],
|
||||||
|
INTENT_SUMMARIZE: ["总结", "摘要", "概括", "概述", "归纳", "提炼"],
|
||||||
|
INTENT_QUESTION: ["问答", "回答", "解释", "什么是", "为什么", "如何", "怎样", "多少", "几个"],
|
||||||
|
INTENT_SEARCH: ["搜索", "查找", "检索", "查询", "找"],
|
||||||
|
INTENT_COMPARE: ["对比", "比较", "差异", "区别", "不同"],
|
||||||
|
INTENT_TRANSFORM: ["转换", "转化", "变成", "转为", "导出"],
|
||||||
|
INTENT_EDIT: ["修改", "编辑", "调整", "改写", "润色", "优化"],
|
||||||
|
}
|
||||||
|
|
||||||
|
# 实体模式定义
|
||||||
|
ENTITY_PATTERNS = {
|
||||||
|
"number": [r"\d+", r"[一二三四五六七八九十百千万]+"],
|
||||||
|
"date": [r"\d{4}年", r"\d{1,2}月", r"\d{1,2}日"],
|
||||||
|
"percentage": [r"\d+(\.\d+)?%", r"\d+(\.\d+)?‰"],
|
||||||
|
"currency": [r"\d+(\.\d+)?万元", r"\d+(\.\d+)?亿元", r"\d+(\.\d+)?元"],
|
||||||
|
}
|
||||||
|
|
||||||
|
def __init__(self):
|
||||||
|
self.intent_history: List[Dict[str, Any]] = []
|
||||||
|
|
||||||
|
async def parse(self, text: str) -> Tuple[str, Dict[str, Any]]:
|
||||||
|
"""
|
||||||
|
解析自然语言指令
|
||||||
|
|
||||||
|
Args:
|
||||||
|
text: 用户输入的自然语言
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
(意图类型, 参数字典)
|
||||||
|
"""
|
||||||
|
text = text.strip()
|
||||||
|
if not text:
|
||||||
|
return self.INTENT_UNKNOWN, {}
|
||||||
|
|
||||||
|
# 记录历史
|
||||||
|
self.intent_history.append({"text": text, "intent": None})
|
||||||
|
|
||||||
|
# 识别意图
|
||||||
|
intent = self._recognize_intent(text)
|
||||||
|
|
||||||
|
# 提取参数
|
||||||
|
params = self._extract_params(text, intent)
|
||||||
|
|
||||||
|
# 更新历史
|
||||||
|
if self.intent_history:
|
||||||
|
self.intent_history[-1]["intent"] = intent
|
||||||
|
|
||||||
|
logger.info(f"意图解析: text={text[:50]}..., intent={intent}, params={params}")
|
||||||
|
|
||||||
|
return intent, params
|
||||||
|
|
||||||
|
def _recognize_intent(self, text: str) -> str:
|
||||||
|
"""识别意图类型"""
|
||||||
|
intent_scores: Dict[str, float] = {}
|
||||||
|
|
||||||
|
for intent, keywords in self.INTENT_KEYWORDS.items():
|
||||||
|
score = 0
|
||||||
|
for keyword in keywords:
|
||||||
|
if keyword in text:
|
||||||
|
score += 1
|
||||||
|
if score > 0:
|
||||||
|
intent_scores[intent] = score
|
||||||
|
|
||||||
|
if not intent_scores:
|
||||||
|
return self.INTENT_UNKNOWN
|
||||||
|
|
||||||
|
# 返回得分最高的意图
|
||||||
|
return max(intent_scores, key=intent_scores.get)
|
||||||
|
|
||||||
|
def _extract_params(self, text: str, intent: str) -> Dict[str, Any]:
|
||||||
|
"""提取参数"""
|
||||||
|
params: Dict[str, Any] = {
|
||||||
|
"entities": self._extract_entities(text),
|
||||||
|
"document_refs": self._extract_document_refs(text),
|
||||||
|
"field_refs": self._extract_field_refs(text),
|
||||||
|
"template_refs": self._extract_template_refs(text),
|
||||||
|
}
|
||||||
|
|
||||||
|
# 根据意图类型提取特定参数
|
||||||
|
if intent == self.INTENT_QUESTION:
|
||||||
|
params["question"] = text
|
||||||
|
params["focus"] = self._extract_question_focus(text)
|
||||||
|
elif intent == self.INTENT_FILL_TABLE:
|
||||||
|
params["template"] = self._extract_template_info(text)
|
||||||
|
elif intent == self.INTENT_EXTRACT:
|
||||||
|
params["target_fields"] = self._extract_target_fields(text)
|
||||||
|
|
||||||
|
return params
|
||||||
|
|
||||||
|
def _extract_entities(self, text: str) -> Dict[str, List[str]]:
|
||||||
|
"""提取实体"""
|
||||||
|
entities: Dict[str, List[str]] = {}
|
||||||
|
|
||||||
|
for entity_type, patterns in self.ENTITY_PATTERNS.items():
|
||||||
|
matches = []
|
||||||
|
for pattern in patterns:
|
||||||
|
found = re.findall(pattern, text)
|
||||||
|
matches.extend(found)
|
||||||
|
if matches:
|
||||||
|
entities[entity_type] = list(set(matches))
|
||||||
|
|
||||||
|
return entities
|
||||||
|
|
||||||
|
def _extract_document_refs(self, text: str) -> List[str]:
|
||||||
|
"""提取文档引用"""
|
||||||
|
# 匹配 "文档1"、"doc1"、"第一个文档" 等
|
||||||
|
refs = []
|
||||||
|
|
||||||
|
# 数字索引: 文档1, doc1, 第1个文档
|
||||||
|
num_patterns = [
|
||||||
|
r"[文档doc]+(\d+)",
|
||||||
|
r"第(\d+)个文档",
|
||||||
|
r"第(\d+)份",
|
||||||
|
]
|
||||||
|
for pattern in num_patterns:
|
||||||
|
matches = re.findall(pattern, text.lower())
|
||||||
|
refs.extend([f"doc_{m}" for m in matches])
|
||||||
|
|
||||||
|
# "所有文档"、"全部文档"
|
||||||
|
if any(kw in text for kw in ["所有", "全部", "整个"]):
|
||||||
|
refs.append("all_docs")
|
||||||
|
|
||||||
|
return refs
|
||||||
|
|
||||||
|
def _extract_field_refs(self, text: str) -> List[str]:
|
||||||
|
"""提取字段引用"""
|
||||||
|
fields = []
|
||||||
|
|
||||||
|
# 匹配引号内的字段名
|
||||||
|
quoted = re.findall(r"['\"『「]([^'\"』」]+)['\"』」]", text)
|
||||||
|
fields.extend(quoted)
|
||||||
|
|
||||||
|
# 匹配 "xxx字段"、"xxx列" 等
|
||||||
|
field_patterns = [
|
||||||
|
r"([^\s]+)字段",
|
||||||
|
r"([^\s]+)列",
|
||||||
|
r"([^\s]+)数据",
|
||||||
|
]
|
||||||
|
for pattern in field_patterns:
|
||||||
|
matches = re.findall(pattern, text)
|
||||||
|
fields.extend(matches)
|
||||||
|
|
||||||
|
return list(set(fields))
|
||||||
|
|
||||||
|
def _extract_template_refs(self, text: str) -> List[str]:
|
||||||
|
"""提取模板引用"""
|
||||||
|
templates = []
|
||||||
|
|
||||||
|
# 匹配 "表格模板"、"Excel模板"、"表1" 等
|
||||||
|
template_patterns = [
|
||||||
|
r"([^\s]+模板)",
|
||||||
|
r"表(\d+)",
|
||||||
|
r"([^\s]+表格)",
|
||||||
|
]
|
||||||
|
for pattern in template_patterns:
|
||||||
|
matches = re.findall(pattern, text)
|
||||||
|
templates.extend(matches)
|
||||||
|
|
||||||
|
return list(set(templates))
|
||||||
|
|
||||||
|
def _extract_question_focus(self, text: str) -> Optional[str]:
|
||||||
|
"""提取问题焦点"""
|
||||||
|
# "什么是XXX"、"XXX是什么"
|
||||||
|
match = re.search(r"[什么是]([^?]+)", text)
|
||||||
|
if match:
|
||||||
|
return match.group(1).strip()
|
||||||
|
|
||||||
|
# "XXX有多少"
|
||||||
|
match = re.search(r"([^?]+)有多少", text)
|
||||||
|
if match:
|
||||||
|
return match.group(1).strip()
|
||||||
|
|
||||||
|
return None
|
||||||
|
|
||||||
|
def _extract_template_info(self, text: str) -> Optional[Dict[str, str]]:
|
||||||
|
"""提取模板信息"""
|
||||||
|
template_info: Dict[str, str] = {}
|
||||||
|
|
||||||
|
# 提取模板类型
|
||||||
|
if "excel" in text.lower() or "xlsx" in text.lower() or "电子表格" in text:
|
||||||
|
template_info["type"] = "xlsx"
|
||||||
|
elif "word" in text.lower() or "docx" in text.lower() or "文档" in text:
|
||||||
|
template_info["type"] = "docx"
|
||||||
|
|
||||||
|
return template_info if template_info else None
|
||||||
|
|
||||||
|
def _extract_target_fields(self, text: str) -> List[str]:
|
||||||
|
"""提取目标字段"""
|
||||||
|
fields = []
|
||||||
|
|
||||||
|
# 匹配 "提取XXX和YYY"、"抽取XXX、YYY"
|
||||||
|
patterns = [
|
||||||
|
r"提取([^(and|,|,)+]+?)(?:和|与|、|,|plus)",
|
||||||
|
r"抽取([^(and|,|,)+]+?)(?:和|与|、|,|plus)",
|
||||||
|
]
|
||||||
|
|
||||||
|
for pattern in patterns:
|
||||||
|
matches = re.findall(pattern, text)
|
||||||
|
fields.extend([m.strip() for m in matches if m.strip()])
|
||||||
|
|
||||||
|
return list(set(fields))
|
||||||
|
|
||||||
|
def get_intent_history(self) -> List[Dict[str, Any]]:
|
||||||
|
"""获取意图历史"""
|
||||||
|
return self.intent_history
|
||||||
|
|
||||||
|
def clear_history(self):
|
||||||
|
"""清空历史"""
|
||||||
|
self.intent_history = []
|
||||||
|
|
||||||
|
|
||||||
|
# 全局单例
|
||||||
|
intent_parser = IntentParser()
|
||||||
|
|||||||
@@ -1,6 +1,13 @@
|
|||||||
"""
|
"""
|
||||||
FastAPI 应用主入口
|
FastAPI 应用主入口
|
||||||
"""
|
"""
|
||||||
|
# ========== 压制 MongoDB 疯狂刷屏日志 ==========
|
||||||
|
import logging
|
||||||
|
logging.getLogger("pymongo").setLevel(logging.WARNING)
|
||||||
|
logging.getLogger("pymongo.topology").setLevel(logging.WARNING)
|
||||||
|
logging.getLogger("urllib3").setLevel(logging.WARNING)
|
||||||
|
# ==============================================
|
||||||
|
|
||||||
import logging
|
import logging
|
||||||
import logging.handlers
|
import logging.handlers
|
||||||
import sys
|
import sys
|
||||||
|
|||||||
@@ -65,7 +65,17 @@ class LLMService:
|
|||||||
return response.json()
|
return response.json()
|
||||||
|
|
||||||
except httpx.HTTPStatusError as e:
|
except httpx.HTTPStatusError as e:
|
||||||
logger.error(f"LLM API 请求失败: {e.response.status_code} - {e.response.text}")
|
error_detail = e.response.text
|
||||||
|
logger.error(f"LLM API 请求失败: {e.response.status_code} - {error_detail}")
|
||||||
|
# 尝试解析错误信息
|
||||||
|
try:
|
||||||
|
import json
|
||||||
|
err_json = json.loads(error_detail)
|
||||||
|
err_code = err_json.get("error", {}).get("code", "unknown")
|
||||||
|
err_msg = err_json.get("error", {}).get("message", "unknown")
|
||||||
|
logger.error(f"API 错误码: {err_code}, 错误信息: {err_msg}")
|
||||||
|
except:
|
||||||
|
pass
|
||||||
raise
|
raise
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
logger.error(f"LLM API 调用异常: {str(e)}")
|
logger.error(f"LLM API 调用异常: {str(e)}")
|
||||||
@@ -328,6 +338,154 @@ Excel 数据概览:
|
|||||||
"analysis": None
|
"analysis": None
|
||||||
}
|
}
|
||||||
|
|
||||||
|
async def chat_with_images(
|
||||||
|
self,
|
||||||
|
text: str,
|
||||||
|
images: List[Dict[str, str]],
|
||||||
|
temperature: float = 0.7,
|
||||||
|
max_tokens: Optional[int] = None
|
||||||
|
) -> Dict[str, Any]:
|
||||||
|
"""
|
||||||
|
调用视觉模型 API(支持图片输入)
|
||||||
|
|
||||||
|
Args:
|
||||||
|
text: 文本内容
|
||||||
|
images: 图片列表,每项包含 base64 编码和 mime_type
|
||||||
|
格式: [{"base64": "...", "mime_type": "image/png"}, ...]
|
||||||
|
temperature: 温度参数
|
||||||
|
max_tokens: 最大 token 数
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Dict[str, Any]: API 响应结果
|
||||||
|
"""
|
||||||
|
headers = {
|
||||||
|
"Authorization": f"Bearer {self.api_key}",
|
||||||
|
"Content-Type": "application/json"
|
||||||
|
}
|
||||||
|
|
||||||
|
# 构建图片内容
|
||||||
|
image_contents = []
|
||||||
|
for img in images:
|
||||||
|
image_contents.append({
|
||||||
|
"type": "image_url",
|
||||||
|
"image_url": {
|
||||||
|
"url": f"data:{img['mime_type']};base64,{img['base64']}"
|
||||||
|
}
|
||||||
|
})
|
||||||
|
|
||||||
|
# 构建消息
|
||||||
|
messages = [
|
||||||
|
{
|
||||||
|
"role": "user",
|
||||||
|
"content": [
|
||||||
|
{
|
||||||
|
"type": "text",
|
||||||
|
"text": text
|
||||||
|
},
|
||||||
|
*image_contents
|
||||||
|
]
|
||||||
|
}
|
||||||
|
]
|
||||||
|
|
||||||
|
payload = {
|
||||||
|
"model": self.model_name,
|
||||||
|
"messages": messages,
|
||||||
|
"temperature": temperature
|
||||||
|
}
|
||||||
|
|
||||||
|
if max_tokens:
|
||||||
|
payload["max_tokens"] = max_tokens
|
||||||
|
|
||||||
|
try:
|
||||||
|
async with httpx.AsyncClient(timeout=120.0) as client:
|
||||||
|
response = await client.post(
|
||||||
|
f"{self.base_url}/chat/completions",
|
||||||
|
headers=headers,
|
||||||
|
json=payload
|
||||||
|
)
|
||||||
|
response.raise_for_status()
|
||||||
|
return response.json()
|
||||||
|
|
||||||
|
except httpx.HTTPStatusError as e:
|
||||||
|
error_detail = e.response.text
|
||||||
|
logger.error(f"视觉模型 API 请求失败: {e.response.status_code} - {error_detail}")
|
||||||
|
# 尝试解析错误信息
|
||||||
|
try:
|
||||||
|
import json
|
||||||
|
err_json = json.loads(error_detail)
|
||||||
|
err_code = err_json.get("error", {}).get("code", "unknown")
|
||||||
|
err_msg = err_json.get("error", {}).get("message", "unknown")
|
||||||
|
logger.error(f"API 错误码: {err_code}, 错误信息: {err_msg}")
|
||||||
|
logger.error(f"请求模型: {self.model_name}, base_url: {self.base_url}")
|
||||||
|
except:
|
||||||
|
pass
|
||||||
|
raise
|
||||||
|
except Exception as e:
|
||||||
|
logger.error(f"视觉模型 API 调用异常: {str(e)}")
|
||||||
|
raise
|
||||||
|
|
||||||
|
async def analyze_images(
|
||||||
|
self,
|
||||||
|
images: List[Dict[str, str]],
|
||||||
|
user_prompt: str = ""
|
||||||
|
) -> Dict[str, Any]:
|
||||||
|
"""
|
||||||
|
分析图片内容(使用视觉模型)
|
||||||
|
|
||||||
|
Args:
|
||||||
|
images: 图片列表,每项包含 base64 编码和 mime_type
|
||||||
|
user_prompt: 用户提示词
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Dict[str, Any]: 分析结果
|
||||||
|
"""
|
||||||
|
prompt = f"""你是一个专业的视觉分析专家。请分析以下图片内容。
|
||||||
|
|
||||||
|
{user_prompt if user_prompt else "请详细描述图片中的内容,包括文字、数据、图表、流程等所有可见信息。"}
|
||||||
|
|
||||||
|
请按照以下 JSON 格式输出:
|
||||||
|
{{
|
||||||
|
"description": "图片内容的详细描述",
|
||||||
|
"text_content": "图片中的文字内容(如有)",
|
||||||
|
"data_extracted": {{"键": "值"}} // 如果图片中有表格或数据
|
||||||
|
}}
|
||||||
|
|
||||||
|
如果图片不包含有用信息,请返回空的描述。"""
|
||||||
|
|
||||||
|
try:
|
||||||
|
response = await self.chat_with_images(
|
||||||
|
text=prompt,
|
||||||
|
images=images,
|
||||||
|
temperature=0.1,
|
||||||
|
max_tokens=4000
|
||||||
|
)
|
||||||
|
|
||||||
|
content = self.extract_message_content(response)
|
||||||
|
|
||||||
|
# 解析 JSON
|
||||||
|
import json
|
||||||
|
try:
|
||||||
|
result = json.loads(content)
|
||||||
|
return {
|
||||||
|
"success": True,
|
||||||
|
"analysis": result,
|
||||||
|
"model": self.model_name
|
||||||
|
}
|
||||||
|
except json.JSONDecodeError:
|
||||||
|
return {
|
||||||
|
"success": True,
|
||||||
|
"analysis": {"description": content},
|
||||||
|
"model": self.model_name
|
||||||
|
}
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
logger.error(f"图片分析失败: {str(e)}")
|
||||||
|
return {
|
||||||
|
"success": False,
|
||||||
|
"error": str(e),
|
||||||
|
"analysis": None
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
# 全局单例
|
# 全局单例
|
||||||
llm_service = LLMService()
|
llm_service = LLMService()
|
||||||
|
|||||||
446
backend/app/services/multi_doc_reasoning_service.py
Normal file
446
backend/app/services/multi_doc_reasoning_service.py
Normal file
@@ -0,0 +1,446 @@
|
|||||||
|
"""
|
||||||
|
多文档关联推理服务
|
||||||
|
|
||||||
|
跨文档信息关联和推理
|
||||||
|
"""
|
||||||
|
import logging
|
||||||
|
import re
|
||||||
|
from typing import Any, Dict, List, Optional, Set, Tuple
|
||||||
|
from collections import defaultdict
|
||||||
|
|
||||||
|
from app.services.llm_service import llm_service
|
||||||
|
from app.services.rag_service import rag_service
|
||||||
|
|
||||||
|
logger = logging.getLogger(__name__)
|
||||||
|
|
||||||
|
|
||||||
|
class MultiDocReasoningService:
|
||||||
|
"""
|
||||||
|
多文档关联推理服务
|
||||||
|
|
||||||
|
功能:
|
||||||
|
1. 实体跨文档追踪 - 追踪同一实体在不同文档中的描述
|
||||||
|
2. 关系抽取与推理 - 抽取实体间关系并进行推理
|
||||||
|
3. 信息补全 - 根据多个文档的信息互补填充缺失数据
|
||||||
|
4. 冲突检测 - 检测不同文档间的信息冲突
|
||||||
|
"""
|
||||||
|
|
||||||
|
def __init__(self):
|
||||||
|
self.llm = llm_service
|
||||||
|
|
||||||
|
async def analyze_cross_documents(
|
||||||
|
self,
|
||||||
|
documents: List[Dict[str, Any]],
|
||||||
|
query: Optional[str] = None,
|
||||||
|
entity_types: Optional[List[str]] = None
|
||||||
|
) -> Dict[str, Any]:
|
||||||
|
"""
|
||||||
|
跨文档分析
|
||||||
|
|
||||||
|
Args:
|
||||||
|
documents: 文档列表
|
||||||
|
query: 查询意图(可选)
|
||||||
|
entity_types: 要追踪的实体类型列表,如 ["机构", "人物", "地点", "数量"]
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
跨文档分析结果
|
||||||
|
"""
|
||||||
|
if not documents:
|
||||||
|
return {"success": False, "error": "没有可用的文档"}
|
||||||
|
|
||||||
|
entity_types = entity_types or ["机构", "数量", "时间", "地点"]
|
||||||
|
|
||||||
|
try:
|
||||||
|
# 1. 提取各文档中的实体
|
||||||
|
entities_per_doc = await self._extract_entities_from_docs(documents, entity_types)
|
||||||
|
|
||||||
|
# 2. 跨文档实体对齐
|
||||||
|
aligned_entities = self._align_entities_across_docs(entities_per_doc)
|
||||||
|
|
||||||
|
# 3. 关系抽取
|
||||||
|
relations = await self._extract_relations(documents)
|
||||||
|
|
||||||
|
# 4. 构建知识图谱
|
||||||
|
knowledge_graph = self._build_knowledge_graph(aligned_entities, relations)
|
||||||
|
|
||||||
|
# 5. 信息补全
|
||||||
|
completed_info = await self._complete_missing_info(knowledge_graph, documents)
|
||||||
|
|
||||||
|
# 6. 冲突检测
|
||||||
|
conflicts = self._detect_conflicts(aligned_entities)
|
||||||
|
|
||||||
|
return {
|
||||||
|
"success": True,
|
||||||
|
"entities": aligned_entities,
|
||||||
|
"relations": relations,
|
||||||
|
"knowledge_graph": knowledge_graph,
|
||||||
|
"completed_info": completed_info,
|
||||||
|
"conflicts": conflicts,
|
||||||
|
"summary": self._generate_summary(aligned_entities, conflicts)
|
||||||
|
}
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
logger.error(f"跨文档分析失败: {e}")
|
||||||
|
return {"success": False, "error": str(e)}
|
||||||
|
|
||||||
|
async def _extract_entities_from_docs(
|
||||||
|
self,
|
||||||
|
documents: List[Dict[str, Any]],
|
||||||
|
entity_types: List[str]
|
||||||
|
) -> List[Dict[str, Any]]:
|
||||||
|
"""从各文档中提取实体"""
|
||||||
|
entities_per_doc = []
|
||||||
|
|
||||||
|
for idx, doc in enumerate(documents):
|
||||||
|
doc_id = doc.get("_id", f"doc_{idx}")
|
||||||
|
content = doc.get("content", "")[:8000] # 限制长度
|
||||||
|
|
||||||
|
# 使用 LLM 提取实体
|
||||||
|
prompt = f"""从以下文档中提取指定的实体类型信息。
|
||||||
|
|
||||||
|
实体类型: {', '.join(entity_types)}
|
||||||
|
|
||||||
|
文档内容:
|
||||||
|
{content}
|
||||||
|
|
||||||
|
请按以下 JSON 格式输出(只需输出 JSON):
|
||||||
|
{{
|
||||||
|
"entities": [
|
||||||
|
{{"type": "机构", "name": "实体名称", "value": "相关数值(如有)", "context": "上下文描述"}},
|
||||||
|
...
|
||||||
|
]
|
||||||
|
}}
|
||||||
|
|
||||||
|
只提取在文档中明确提到的实体,不要推测。"""
|
||||||
|
|
||||||
|
messages = [
|
||||||
|
{"role": "system", "content": "你是一个实体提取专家。请严格按JSON格式输出。"},
|
||||||
|
{"role": "user", "content": prompt}
|
||||||
|
]
|
||||||
|
|
||||||
|
try:
|
||||||
|
response = await self.llm.chat(messages=messages, temperature=0.1, max_tokens=3000)
|
||||||
|
content_response = self.llm.extract_message_content(response)
|
||||||
|
|
||||||
|
# 解析 JSON
|
||||||
|
import json
|
||||||
|
import re
|
||||||
|
cleaned = content_response.strip()
|
||||||
|
json_match = re.search(r'\{[\s\S]*\}', cleaned)
|
||||||
|
if json_match:
|
||||||
|
result = json.loads(json_match.group())
|
||||||
|
entities = result.get("entities", [])
|
||||||
|
entities_per_doc.append({
|
||||||
|
"doc_id": doc_id,
|
||||||
|
"doc_name": doc.get("metadata", {}).get("original_filename", f"文档{idx+1}"),
|
||||||
|
"entities": entities
|
||||||
|
})
|
||||||
|
logger.info(f"文档 {doc_id} 提取到 {len(entities)} 个实体")
|
||||||
|
except Exception as e:
|
||||||
|
logger.warning(f"文档 {doc_id} 实体提取失败: {e}")
|
||||||
|
|
||||||
|
return entities_per_doc
|
||||||
|
|
||||||
|
def _align_entities_across_docs(
|
||||||
|
self,
|
||||||
|
entities_per_doc: List[Dict[str, Any]]
|
||||||
|
) -> Dict[str, List[Dict[str, Any]]]:
|
||||||
|
"""
|
||||||
|
跨文档实体对齐
|
||||||
|
|
||||||
|
将同一实体在不同文档中的描述进行关联
|
||||||
|
"""
|
||||||
|
aligned: Dict[str, List[Dict[str, Any]]] = defaultdict(list)
|
||||||
|
|
||||||
|
for doc_data in entities_per_doc:
|
||||||
|
doc_id = doc_data["doc_id"]
|
||||||
|
doc_name = doc_data["doc_name"]
|
||||||
|
|
||||||
|
for entity in doc_data.get("entities", []):
|
||||||
|
entity_name = entity.get("name", "")
|
||||||
|
if not entity_name:
|
||||||
|
continue
|
||||||
|
|
||||||
|
# 标准化实体名(去除空格和括号内容)
|
||||||
|
normalized = self._normalize_entity_name(entity_name)
|
||||||
|
|
||||||
|
aligned[normalized].append({
|
||||||
|
"original_name": entity_name,
|
||||||
|
"type": entity.get("type", "未知"),
|
||||||
|
"value": entity.get("value", ""),
|
||||||
|
"context": entity.get("context", ""),
|
||||||
|
"source_doc": doc_name,
|
||||||
|
"source_doc_id": doc_id
|
||||||
|
})
|
||||||
|
|
||||||
|
# 合并相同实体
|
||||||
|
result = {}
|
||||||
|
for normalized, appearances in aligned.items():
|
||||||
|
if len(appearances) > 1:
|
||||||
|
result[normalized] = appearances
|
||||||
|
logger.info(f"实体对齐: {normalized} 在 {len(appearances)} 个文档中出现")
|
||||||
|
|
||||||
|
return result
|
||||||
|
|
||||||
|
def _normalize_entity_name(self, name: str) -> str:
|
||||||
|
"""标准化实体名称"""
|
||||||
|
# 去除空格
|
||||||
|
name = name.strip()
|
||||||
|
# 去除括号内容
|
||||||
|
name = re.sub(r'[((].*?[))]', '', name)
|
||||||
|
# 去除"第X名"等
|
||||||
|
name = re.sub(r'^第\d+[名位个]', '', name)
|
||||||
|
return name.strip()
|
||||||
|
|
||||||
|
async def _extract_relations(
|
||||||
|
self,
|
||||||
|
documents: List[Dict[str, Any]]
|
||||||
|
) -> List[Dict[str, str]]:
|
||||||
|
"""从文档中抽取关系"""
|
||||||
|
relations = []
|
||||||
|
|
||||||
|
# 合并所有文档内容
|
||||||
|
combined_content = "\n\n".join([
|
||||||
|
f"【{doc.get('metadata', {}).get('original_filename', f'文档{i}')}】\n{doc.get('content', '')[:3000]}"
|
||||||
|
for i, doc in enumerate(documents)
|
||||||
|
])
|
||||||
|
|
||||||
|
prompt = f"""从以下文档内容中抽取实体之间的关系。
|
||||||
|
|
||||||
|
文档内容:
|
||||||
|
{combined_content[:8000]}
|
||||||
|
|
||||||
|
请识别以下类型的关系:
|
||||||
|
- 包含关系 (A包含B)
|
||||||
|
- 隶属关系 (A隶属于B)
|
||||||
|
- 合作关系 (A与B合作)
|
||||||
|
- 对比关系 (A vs B)
|
||||||
|
- 时序关系 (A先于B发生)
|
||||||
|
|
||||||
|
请按以下 JSON 格式输出(只需输出 JSON):
|
||||||
|
{{
|
||||||
|
"relations": [
|
||||||
|
{{"entity1": "实体1", "entity2": "实体2", "relation": "关系类型", "description": "关系描述"}},
|
||||||
|
...
|
||||||
|
]
|
||||||
|
}}
|
||||||
|
|
||||||
|
如果没有找到明确的关系,返回空数组。"""
|
||||||
|
|
||||||
|
messages = [
|
||||||
|
{"role": "system", "content": "你是一个关系抽取专家。请严格按JSON格式输出。"},
|
||||||
|
{"role": "user", "content": prompt}
|
||||||
|
]
|
||||||
|
|
||||||
|
try:
|
||||||
|
response = await self.llm.chat(messages=messages, temperature=0.1, max_tokens=3000)
|
||||||
|
content_response = self.llm.extract_message_content(response)
|
||||||
|
|
||||||
|
import json
|
||||||
|
import re
|
||||||
|
cleaned = content_response.strip()
|
||||||
|
json_match = re.search(r'\{{[\s\S]*\}}', cleaned)
|
||||||
|
if json_match:
|
||||||
|
result = json.loads(json_match.group())
|
||||||
|
relations = result.get("relations", [])
|
||||||
|
logger.info(f"抽取到 {len(relations)} 个关系")
|
||||||
|
except Exception as e:
|
||||||
|
logger.warning(f"关系抽取失败: {e}")
|
||||||
|
|
||||||
|
return relations
|
||||||
|
|
||||||
|
def _build_knowledge_graph(
|
||||||
|
self,
|
||||||
|
aligned_entities: Dict[str, List[Dict[str, Any]]],
|
||||||
|
relations: List[Dict[str, str]]
|
||||||
|
) -> Dict[str, Any]:
|
||||||
|
"""构建知识图谱"""
|
||||||
|
nodes = []
|
||||||
|
edges = []
|
||||||
|
node_ids = set()
|
||||||
|
|
||||||
|
# 添加实体节点
|
||||||
|
for entity_name, appearances in aligned_entities.items():
|
||||||
|
if len(appearances) < 1:
|
||||||
|
continue
|
||||||
|
|
||||||
|
first_appearance = appearances[0]
|
||||||
|
node_id = f"entity_{len(nodes)}"
|
||||||
|
|
||||||
|
# 收集该实体在所有文档中的值
|
||||||
|
values = [a.get("value", "") for a in appearances if a.get("value")]
|
||||||
|
primary_value = values[0] if values else ""
|
||||||
|
|
||||||
|
nodes.append({
|
||||||
|
"id": node_id,
|
||||||
|
"name": entity_name,
|
||||||
|
"type": first_appearance.get("type", "未知"),
|
||||||
|
"value": primary_value,
|
||||||
|
"occurrence_count": len(appearances),
|
||||||
|
"sources": [a.get("source_doc", "") for a in appearances]
|
||||||
|
})
|
||||||
|
node_ids.add(entity_name)
|
||||||
|
|
||||||
|
# 添加关系边
|
||||||
|
for relation in relations:
|
||||||
|
entity1 = self._normalize_entity_name(relation.get("entity1", ""))
|
||||||
|
entity2 = self._normalize_entity_name(relation.get("entity2", ""))
|
||||||
|
|
||||||
|
if entity1 in node_ids and entity2 in node_ids:
|
||||||
|
edges.append({
|
||||||
|
"source": entity1,
|
||||||
|
"target": entity2,
|
||||||
|
"relation": relation.get("relation", "相关"),
|
||||||
|
"description": relation.get("description", "")
|
||||||
|
})
|
||||||
|
|
||||||
|
return {
|
||||||
|
"nodes": nodes,
|
||||||
|
"edges": edges,
|
||||||
|
"stats": {
|
||||||
|
"entity_count": len(nodes),
|
||||||
|
"relation_count": len(edges)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
async def _complete_missing_info(
|
||||||
|
self,
|
||||||
|
knowledge_graph: Dict[str, Any],
|
||||||
|
documents: List[Dict[str, Any]]
|
||||||
|
) -> List[Dict[str, Any]]:
|
||||||
|
"""根据多个文档补全信息"""
|
||||||
|
completed = []
|
||||||
|
|
||||||
|
for node in knowledge_graph.get("nodes", []):
|
||||||
|
if not node.get("value") and node.get("occurrence_count", 0) > 1:
|
||||||
|
# 实体在多个文档中出现但没有数值,尝试从 RAG 检索补充
|
||||||
|
query = f"{node['name']} 数值 数据"
|
||||||
|
results = rag_service.retrieve(query, top_k=3, min_score=0.3)
|
||||||
|
|
||||||
|
if results:
|
||||||
|
completed.append({
|
||||||
|
"entity": node["name"],
|
||||||
|
"type": node.get("type", "未知"),
|
||||||
|
"source": "rag_inference",
|
||||||
|
"context": results[0].get("content", "")[:200],
|
||||||
|
"confidence": results[0].get("score", 0)
|
||||||
|
})
|
||||||
|
|
||||||
|
return completed
|
||||||
|
|
||||||
|
def _detect_conflicts(
|
||||||
|
self,
|
||||||
|
aligned_entities: Dict[str, List[Dict[str, Any]]]
|
||||||
|
) -> List[Dict[str, Any]]:
|
||||||
|
"""检测不同文档间的信息冲突"""
|
||||||
|
conflicts = []
|
||||||
|
|
||||||
|
for entity_name, appearances in aligned_entities.items():
|
||||||
|
if len(appearances) < 2:
|
||||||
|
continue
|
||||||
|
|
||||||
|
# 检查数值冲突
|
||||||
|
values = {}
|
||||||
|
for appearance in appearances:
|
||||||
|
val = appearance.get("value", "")
|
||||||
|
if val:
|
||||||
|
source = appearance.get("source_doc", "未知来源")
|
||||||
|
values[source] = val
|
||||||
|
|
||||||
|
if len(values) > 1:
|
||||||
|
unique_values = set(values.values())
|
||||||
|
if len(unique_values) > 1:
|
||||||
|
conflicts.append({
|
||||||
|
"entity": entity_name,
|
||||||
|
"type": "value_conflict",
|
||||||
|
"details": values,
|
||||||
|
"description": f"实体 '{entity_name}' 在不同文档中有不同数值: {values}"
|
||||||
|
})
|
||||||
|
|
||||||
|
return conflicts
|
||||||
|
|
||||||
|
def _generate_summary(
|
||||||
|
self,
|
||||||
|
aligned_entities: Dict[str, List[Dict[str, Any]]],
|
||||||
|
conflicts: List[Dict[str, Any]]
|
||||||
|
) -> str:
|
||||||
|
"""生成摘要"""
|
||||||
|
summary_parts = []
|
||||||
|
|
||||||
|
total_entities = sum(len(appearances) for appearances in aligned_entities.values())
|
||||||
|
multi_doc_entities = sum(1 for appearances in aligned_entities.values() if len(appearances) > 1)
|
||||||
|
|
||||||
|
summary_parts.append(f"跨文档分析完成:发现 {total_entities} 个实体")
|
||||||
|
summary_parts.append(f"其中 {multi_doc_entities} 个实体在多个文档中被提及")
|
||||||
|
|
||||||
|
if conflicts:
|
||||||
|
summary_parts.append(f"检测到 {len(conflicts)} 个潜在冲突")
|
||||||
|
|
||||||
|
return "; ".join(summary_parts)
|
||||||
|
|
||||||
|
async def answer_cross_doc_question(
|
||||||
|
self,
|
||||||
|
question: str,
|
||||||
|
documents: List[Dict[str, Any]]
|
||||||
|
) -> Dict[str, Any]:
|
||||||
|
"""
|
||||||
|
跨文档问答
|
||||||
|
|
||||||
|
Args:
|
||||||
|
question: 问题
|
||||||
|
documents: 文档列表
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
答案结果
|
||||||
|
"""
|
||||||
|
# 先进行跨文档分析
|
||||||
|
analysis_result = await self.analyze_cross_documents(documents, query=question)
|
||||||
|
|
||||||
|
# 构建上下文
|
||||||
|
context_parts = []
|
||||||
|
|
||||||
|
# 添加实体信息
|
||||||
|
for entity_name, appearances in analysis_result.get("entities", {}).items():
|
||||||
|
contexts = [f"{a.get('source_doc')}: {a.get('context', '')}" for a in appearances[:2]]
|
||||||
|
if contexts:
|
||||||
|
context_parts.append(f"【{entity_name}】{' | '.join(contexts)}")
|
||||||
|
|
||||||
|
# 添加关系信息
|
||||||
|
for relation in analysis_result.get("relations", [])[:5]:
|
||||||
|
context_parts.append(f"【关系】{relation.get('entity1')} {relation.get('relation')} {relation.get('entity2')}: {relation.get('description', '')}")
|
||||||
|
|
||||||
|
context_text = "\n\n".join(context_parts) if context_parts else "未找到相关实体和关系"
|
||||||
|
|
||||||
|
# 使用 LLM 生成答案
|
||||||
|
prompt = f"""基于以下跨文档分析结果,回答用户问题。
|
||||||
|
|
||||||
|
问题: {question}
|
||||||
|
|
||||||
|
分析结果:
|
||||||
|
{context_text}
|
||||||
|
|
||||||
|
请直接回答问题,如果分析结果中没有相关信息,请说明"根据提供的文档无法回答该问题"。"""
|
||||||
|
|
||||||
|
messages = [
|
||||||
|
{"role": "system", "content": "你是一个基于文档的问答助手。请根据提供的信息回答问题。"},
|
||||||
|
{"role": "user", "content": prompt}
|
||||||
|
]
|
||||||
|
|
||||||
|
try:
|
||||||
|
response = await self.llm.chat(messages=messages, temperature=0.2, max_tokens=2000)
|
||||||
|
answer = self.llm.extract_message_content(response)
|
||||||
|
|
||||||
|
return {
|
||||||
|
"success": True,
|
||||||
|
"question": question,
|
||||||
|
"answer": answer,
|
||||||
|
"supporting_entities": list(analysis_result.get("entities", {}).keys())[:10],
|
||||||
|
"relations_count": len(analysis_result.get("relations", []))
|
||||||
|
}
|
||||||
|
except Exception as e:
|
||||||
|
logger.error(f"跨文档问答失败: {e}")
|
||||||
|
return {"success": False, "error": str(e)}
|
||||||
|
|
||||||
|
|
||||||
|
# 全局单例
|
||||||
|
multi_doc_reasoning_service = MultiDocReasoningService()
|
||||||
@@ -2,21 +2,32 @@
|
|||||||
RAG 服务模块 - 检索增强生成
|
RAG 服务模块 - 检索增强生成
|
||||||
|
|
||||||
使用 sentence-transformers + Faiss 实现向量检索
|
使用 sentence-transformers + Faiss 实现向量检索
|
||||||
|
支持 BM25 关键词检索 + 向量检索混合融合
|
||||||
"""
|
"""
|
||||||
import json
|
|
||||||
import logging
|
import logging
|
||||||
import os
|
import os
|
||||||
import pickle
|
import pickle
|
||||||
from typing import Any, Dict, List, Optional
|
import re
|
||||||
|
import math
|
||||||
|
from typing import Any, Dict, List, Optional, Tuple
|
||||||
|
from collections import Counter, defaultdict
|
||||||
|
|
||||||
import faiss
|
import faiss
|
||||||
import numpy as np
|
import numpy as np
|
||||||
from sentence_transformers import SentenceTransformer
|
|
||||||
|
|
||||||
from app.config import settings
|
from app.config import settings
|
||||||
|
|
||||||
logger = logging.getLogger(__name__)
|
logger = logging.getLogger(__name__)
|
||||||
|
|
||||||
|
# 尝试导入 sentence-transformers
|
||||||
|
try:
|
||||||
|
from sentence_transformers import SentenceTransformer
|
||||||
|
SENTENCE_TRANSFORMERS_AVAILABLE = True
|
||||||
|
except ImportError as e:
|
||||||
|
logger.warning(f"sentence-transformers 导入失败: {e}")
|
||||||
|
SENTENCE_TRANSFORMERS_AVAILABLE = False
|
||||||
|
SentenceTransformer = None
|
||||||
|
|
||||||
|
|
||||||
class SimpleDocument:
|
class SimpleDocument:
|
||||||
"""简化文档对象"""
|
"""简化文档对象"""
|
||||||
@@ -25,20 +36,156 @@ class SimpleDocument:
|
|||||||
self.metadata = metadata
|
self.metadata = metadata
|
||||||
|
|
||||||
|
|
||||||
|
class BM25:
|
||||||
|
"""
|
||||||
|
BM25 关键词检索算法
|
||||||
|
|
||||||
|
一种基于词频和文档频率的信息检索算法,比纯向量搜索更适合关键词精确匹配
|
||||||
|
"""
|
||||||
|
|
||||||
|
def __init__(self, k1: float = 1.5, b: float = 0.75):
|
||||||
|
self.k1 = k1 # 词频饱和参数
|
||||||
|
self.b = b # 文档长度归一化参数
|
||||||
|
self.documents: List[str] = []
|
||||||
|
self.doc_ids: List[str] = []
|
||||||
|
self.avg_doc_length = 0
|
||||||
|
self.doc_freqs: Dict[str, int] = {} # 词 -> 包含该词的文档数
|
||||||
|
self.idf: Dict[str, float] = {} # 词 -> IDF 值
|
||||||
|
self.doc_lengths: List[int] = []
|
||||||
|
self.doc_term_freqs: List[Dict[str, int]] = [] # 每个文档的词频
|
||||||
|
|
||||||
|
def _tokenize(self, text: str) -> List[str]:
|
||||||
|
"""分词(简单的中文分词)"""
|
||||||
|
if not text:
|
||||||
|
return []
|
||||||
|
# 简单分词:按标点和空格分割
|
||||||
|
tokens = re.findall(r'[\u4e00-\u9fff]+|[a-zA-Z0-9]+', text.lower())
|
||||||
|
# 过滤单字符
|
||||||
|
return [t for t in tokens if len(t) > 1]
|
||||||
|
|
||||||
|
def fit(self, documents: List[str], doc_ids: List[str]):
|
||||||
|
"""
|
||||||
|
构建 BM25 索引
|
||||||
|
|
||||||
|
Args:
|
||||||
|
documents: 文档内容列表
|
||||||
|
doc_ids: 文档 ID 列表
|
||||||
|
"""
|
||||||
|
self.documents = documents
|
||||||
|
self.doc_ids = doc_ids
|
||||||
|
n = len(documents)
|
||||||
|
|
||||||
|
# 统计文档频率
|
||||||
|
self.doc_freqs = defaultdict(int)
|
||||||
|
self.doc_lengths = []
|
||||||
|
self.doc_term_freqs = []
|
||||||
|
|
||||||
|
for doc in documents:
|
||||||
|
tokens = self._tokenize(doc)
|
||||||
|
self.doc_lengths.append(len(tokens))
|
||||||
|
doc_tf = Counter(tokens)
|
||||||
|
self.doc_term_freqs.append(doc_tf)
|
||||||
|
|
||||||
|
for term in doc_tf:
|
||||||
|
self.doc_freqs[term] += 1
|
||||||
|
|
||||||
|
# 计算平均文档长度
|
||||||
|
self.avg_doc_length = sum(self.doc_lengths) / n if n > 0 else 0
|
||||||
|
|
||||||
|
# 计算 IDF
|
||||||
|
for term, df in self.doc_freqs.items():
|
||||||
|
# IDF = log((n - df + 0.5) / (df + 0.5))
|
||||||
|
self.idf[term] = math.log((n - df + 0.5) / (df + 0.5) + 1)
|
||||||
|
|
||||||
|
logger.info(f"BM25 索引构建完成: {n} 个文档, {len(self.idf)} 个词项")
|
||||||
|
|
||||||
|
def search(self, query: str, top_k: int = 10) -> List[Tuple[int, float]]:
|
||||||
|
"""
|
||||||
|
搜索相关文档
|
||||||
|
|
||||||
|
Args:
|
||||||
|
query: 查询文本
|
||||||
|
top_k: 返回前 k 个结果
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
[(文档索引, BM25分数), ...]
|
||||||
|
"""
|
||||||
|
if not self.documents:
|
||||||
|
return []
|
||||||
|
|
||||||
|
query_tokens = self._tokenize(query)
|
||||||
|
if not query_tokens:
|
||||||
|
return []
|
||||||
|
|
||||||
|
scores = []
|
||||||
|
n = len(self.documents)
|
||||||
|
|
||||||
|
for idx in range(n):
|
||||||
|
score = self._calculate_score(query_tokens, idx)
|
||||||
|
scores.append((idx, score))
|
||||||
|
|
||||||
|
# 按分数降序排序
|
||||||
|
scores.sort(key=lambda x: x[1], reverse=True)
|
||||||
|
|
||||||
|
return scores[:top_k]
|
||||||
|
|
||||||
|
def _calculate_score(self, query_tokens: List[str], doc_idx: int) -> float:
|
||||||
|
"""计算单个文档的 BM25 分数"""
|
||||||
|
doc_tf = self.doc_term_freqs[doc_idx]
|
||||||
|
doc_len = self.doc_lengths[doc_idx]
|
||||||
|
score = 0.0
|
||||||
|
|
||||||
|
for term in query_tokens:
|
||||||
|
if term not in self.idf:
|
||||||
|
continue
|
||||||
|
|
||||||
|
tf = doc_tf.get(term, 0)
|
||||||
|
idf = self.idf[term]
|
||||||
|
|
||||||
|
# BM25 公式
|
||||||
|
numerator = tf * (self.k1 + 1)
|
||||||
|
denominator = tf + self.k1 * (1 - self.b + self.b * doc_len / self.avg_doc_length)
|
||||||
|
|
||||||
|
score += idf * numerator / denominator
|
||||||
|
|
||||||
|
return score
|
||||||
|
|
||||||
|
def get_scores(self, query: str) -> List[float]:
|
||||||
|
"""获取所有文档的 BM25 分数"""
|
||||||
|
if not self.documents:
|
||||||
|
return []
|
||||||
|
|
||||||
|
query_tokens = self._tokenize(query)
|
||||||
|
if not query_tokens:
|
||||||
|
return [0.0] * len(self.documents)
|
||||||
|
|
||||||
|
return [self._calculate_score(query_tokens, idx) for idx in range(len(self.documents))]
|
||||||
|
|
||||||
|
|
||||||
class RAGService:
|
class RAGService:
|
||||||
"""RAG 检索增强服务"""
|
"""RAG 检索增强服务"""
|
||||||
|
|
||||||
|
# 默认分块参数
|
||||||
|
DEFAULT_CHUNK_SIZE = 500 # 每个文本块的大小(字符数)
|
||||||
|
DEFAULT_CHUNK_OVERLAP = 50 # 块之间的重叠(字符数)
|
||||||
|
|
||||||
def __init__(self):
|
def __init__(self):
|
||||||
self.embedding_model: Optional[SentenceTransformer] = None
|
self.embedding_model = None
|
||||||
self.index: Optional[faiss.Index] = None
|
self.index: Optional[faiss.Index] = None
|
||||||
self.documents: List[Dict[str, Any]] = []
|
self.documents: List[Dict[str, Any]] = []
|
||||||
self.doc_ids: List[str] = []
|
self.doc_ids: List[str] = []
|
||||||
self._dimension: int = 0
|
self._dimension: int = 384 # 默认维度
|
||||||
self._initialized = False
|
self._initialized = False
|
||||||
self._persist_dir = settings.FAISS_INDEX_DIR
|
self._persist_dir = settings.FAISS_INDEX_DIR
|
||||||
# 临时禁用 RAG API 调用,仅记录日志
|
# BM25 索引
|
||||||
self._disabled = True
|
self.bm25: Optional[BM25] = None
|
||||||
logger.info("RAG 服务已禁用(_disabled=True),仅记录索引操作日志")
|
self._bm25_enabled = True # 始终启用 BM25
|
||||||
|
# 检查是否可用
|
||||||
|
self._disabled = not SENTENCE_TRANSFORMERS_AVAILABLE
|
||||||
|
if self._disabled:
|
||||||
|
logger.warning("RAG 服务已禁用(sentence-transformers 不可用),将使用 BM25 关键词检索")
|
||||||
|
else:
|
||||||
|
logger.info("RAG 服务已启用(向量检索 + BM25 混合检索)")
|
||||||
|
|
||||||
def _init_embeddings(self):
|
def _init_embeddings(self):
|
||||||
"""初始化嵌入模型"""
|
"""初始化嵌入模型"""
|
||||||
@@ -88,6 +235,63 @@ class RAGService:
|
|||||||
norms = np.where(norms == 0, 1, norms)
|
norms = np.where(norms == 0, 1, norms)
|
||||||
return vectors / norms
|
return vectors / norms
|
||||||
|
|
||||||
|
def _split_into_chunks(self, text: str, chunk_size: int = None, overlap: int = None) -> List[str]:
|
||||||
|
"""
|
||||||
|
将长文本分割成块
|
||||||
|
|
||||||
|
Args:
|
||||||
|
text: 待分割的文本
|
||||||
|
chunk_size: 每个块的大小(字符数)
|
||||||
|
overlap: 块之间的重叠字符数
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
文本块列表
|
||||||
|
"""
|
||||||
|
if chunk_size is None:
|
||||||
|
chunk_size = self.DEFAULT_CHUNK_SIZE
|
||||||
|
if overlap is None:
|
||||||
|
overlap = self.DEFAULT_CHUNK_OVERLAP
|
||||||
|
|
||||||
|
if len(text) <= chunk_size:
|
||||||
|
return [text] if text.strip() else []
|
||||||
|
|
||||||
|
chunks = []
|
||||||
|
start = 0
|
||||||
|
text_len = len(text)
|
||||||
|
|
||||||
|
while start < text_len:
|
||||||
|
# 计算当前块的结束位置
|
||||||
|
end = start + chunk_size
|
||||||
|
|
||||||
|
# 如果不是最后一块,尝试在句子边界处切割
|
||||||
|
if end < text_len:
|
||||||
|
# 向前查找最后一个句号、逗号、换行或分号
|
||||||
|
cut_positions = []
|
||||||
|
for i in range(end, max(start, end - 100), -1):
|
||||||
|
if text[i] in '。;,,\n、':
|
||||||
|
cut_positions.append(i + 1)
|
||||||
|
break
|
||||||
|
|
||||||
|
if cut_positions:
|
||||||
|
end = cut_positions[0]
|
||||||
|
else:
|
||||||
|
# 如果没找到句子边界,尝试向后查找
|
||||||
|
for i in range(end, min(text_len, end + 50)):
|
||||||
|
if text[i] in '。;,,\n、':
|
||||||
|
end = i + 1
|
||||||
|
break
|
||||||
|
|
||||||
|
chunk = text[start:end].strip()
|
||||||
|
if chunk:
|
||||||
|
chunks.append(chunk)
|
||||||
|
|
||||||
|
# 移动起始位置(考虑重叠)
|
||||||
|
start = end - overlap
|
||||||
|
if start <= 0:
|
||||||
|
start = end
|
||||||
|
|
||||||
|
return chunks
|
||||||
|
|
||||||
def index_field(
|
def index_field(
|
||||||
self,
|
self,
|
||||||
table_name: str,
|
table_name: str,
|
||||||
@@ -124,9 +328,20 @@ class RAGService:
|
|||||||
self,
|
self,
|
||||||
doc_id: str,
|
doc_id: str,
|
||||||
content: str,
|
content: str,
|
||||||
metadata: Optional[Dict[str, Any]] = None
|
metadata: Optional[Dict[str, Any]] = None,
|
||||||
|
chunk_size: int = None,
|
||||||
|
chunk_overlap: int = None
|
||||||
):
|
):
|
||||||
"""将文档内容索引到向量数据库"""
|
"""
|
||||||
|
将文档内容索引到向量数据库(自动分块)
|
||||||
|
|
||||||
|
Args:
|
||||||
|
doc_id: 文档唯一标识
|
||||||
|
content: 文档内容
|
||||||
|
metadata: 文档元数据
|
||||||
|
chunk_size: 文本块大小(字符数),默认500
|
||||||
|
chunk_overlap: 块之间的重叠字符数,默认50
|
||||||
|
"""
|
||||||
if self._disabled:
|
if self._disabled:
|
||||||
logger.info(f"[RAG DISABLED] 文档索引操作已跳过: {doc_id}")
|
logger.info(f"[RAG DISABLED] 文档索引操作已跳过: {doc_id}")
|
||||||
return
|
return
|
||||||
@@ -139,18 +354,70 @@ class RAGService:
|
|||||||
logger.debug(f"文档跳过索引 (无嵌入模型): {doc_id}")
|
logger.debug(f"文档跳过索引 (无嵌入模型): {doc_id}")
|
||||||
return
|
return
|
||||||
|
|
||||||
doc = SimpleDocument(
|
# 分割文档为小块
|
||||||
page_content=content,
|
if chunk_size is None:
|
||||||
metadata=metadata or {"doc_id": doc_id}
|
chunk_size = self.DEFAULT_CHUNK_SIZE
|
||||||
)
|
if chunk_overlap is None:
|
||||||
self._add_documents([doc], [doc_id])
|
chunk_overlap = self.DEFAULT_CHUNK_OVERLAP
|
||||||
logger.debug(f"已索引文档: {doc_id}")
|
|
||||||
|
chunks = self._split_into_chunks(content, chunk_size, chunk_overlap)
|
||||||
|
|
||||||
|
if not chunks:
|
||||||
|
logger.warning(f"文档内容为空,跳过索引: {doc_id}")
|
||||||
|
return
|
||||||
|
|
||||||
|
# 为每个块创建文档对象
|
||||||
|
documents = []
|
||||||
|
chunk_ids = []
|
||||||
|
|
||||||
|
for i, chunk in enumerate(chunks):
|
||||||
|
chunk_id = f"{doc_id}_chunk_{i}"
|
||||||
|
chunk_metadata = metadata.copy() if metadata else {}
|
||||||
|
chunk_metadata.update({
|
||||||
|
"chunk_index": i,
|
||||||
|
"total_chunks": len(chunks),
|
||||||
|
"doc_id": doc_id
|
||||||
|
})
|
||||||
|
|
||||||
|
documents.append(SimpleDocument(
|
||||||
|
page_content=chunk,
|
||||||
|
metadata=chunk_metadata
|
||||||
|
))
|
||||||
|
chunk_ids.append(chunk_id)
|
||||||
|
|
||||||
|
# 批量添加文档
|
||||||
|
self._add_documents(documents, chunk_ids)
|
||||||
|
logger.info(f"已索引文档 {doc_id},共 {len(chunks)} 个块")
|
||||||
|
|
||||||
def _add_documents(self, documents: List[SimpleDocument], doc_ids: List[str]):
|
def _add_documents(self, documents: List[SimpleDocument], doc_ids: List[str]):
|
||||||
"""批量添加文档到向量索引"""
|
"""批量添加文档到向量索引"""
|
||||||
if not documents:
|
if not documents:
|
||||||
return
|
return
|
||||||
|
|
||||||
|
# 总是将文档存储在内存中(用于 BM25 和关键词搜索)
|
||||||
|
for doc, did in zip(documents, doc_ids):
|
||||||
|
self.documents.append({"id": did, "content": doc.page_content, "metadata": doc.metadata})
|
||||||
|
self.doc_ids.append(did)
|
||||||
|
|
||||||
|
# 构建 BM25 索引
|
||||||
|
if self._bm25_enabled and documents:
|
||||||
|
bm25_texts = [doc.page_content for doc in documents]
|
||||||
|
if self.bm25 is None:
|
||||||
|
self.bm25 = BM25()
|
||||||
|
self.bm25.fit(bm25_texts, doc_ids)
|
||||||
|
else:
|
||||||
|
# 增量添加:重新构建(BM25 不支持增量)
|
||||||
|
all_texts = [d["content"] for d in self.documents]
|
||||||
|
all_ids = self.doc_ids.copy()
|
||||||
|
self.bm25 = BM25()
|
||||||
|
self.bm25.fit(all_texts, all_ids)
|
||||||
|
logger.debug(f"BM25 索引更新: {len(documents)} 个文档")
|
||||||
|
|
||||||
|
# 如果没有嵌入模型,跳过向量索引
|
||||||
|
if self.embedding_model is None:
|
||||||
|
logger.debug(f"文档跳过向量索引 (无嵌入模型): {len(documents)} 个文档")
|
||||||
|
return
|
||||||
|
|
||||||
texts = [doc.page_content for doc in documents]
|
texts = [doc.page_content for doc in documents]
|
||||||
embeddings = self.embedding_model.encode(texts, convert_to_numpy=True)
|
embeddings = self.embedding_model.encode(texts, convert_to_numpy=True)
|
||||||
embeddings = self._normalize_vectors(embeddings).astype('float32')
|
embeddings = self._normalize_vectors(embeddings).astype('float32')
|
||||||
@@ -162,12 +429,18 @@ class RAGService:
|
|||||||
id_array = np.array(id_list, dtype='int64')
|
id_array = np.array(id_list, dtype='int64')
|
||||||
self.index.add_with_ids(embeddings, id_array)
|
self.index.add_with_ids(embeddings, id_array)
|
||||||
|
|
||||||
for doc, did in zip(documents, doc_ids):
|
def retrieve(self, query: str, top_k: int = 5, min_score: float = 0.3) -> List[Dict[str, Any]]:
|
||||||
self.documents.append({"id": did, "content": doc.page_content, "metadata": doc.metadata})
|
"""
|
||||||
self.doc_ids.append(did)
|
根据查询检索相关文档块(混合检索:向量 + BM25)
|
||||||
|
|
||||||
def retrieve(self, query: str, top_k: int = 5) -> List[Dict[str, Any]]:
|
Args:
|
||||||
"""根据查询检索相关文档"""
|
query: 查询文本
|
||||||
|
top_k: 返回的最大结果数
|
||||||
|
min_score: 最低相似度分数阈值
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
相关文档块列表,每项包含 content, metadata, score, doc_id, chunk_index
|
||||||
|
"""
|
||||||
if self._disabled:
|
if self._disabled:
|
||||||
logger.info(f"[RAG DISABLED] 检索操作已跳过: query={query}, top_k={top_k}")
|
logger.info(f"[RAG DISABLED] 检索操作已跳过: query={query}, top_k={top_k}")
|
||||||
return []
|
return []
|
||||||
@@ -175,28 +448,241 @@ class RAGService:
|
|||||||
if not self._initialized:
|
if not self._initialized:
|
||||||
self._init_vector_store()
|
self._init_vector_store()
|
||||||
|
|
||||||
if self.index is None or self.index.ntotal == 0:
|
# 获取向量检索结果
|
||||||
|
vector_results = self._vector_search(query, top_k * 2, min_score)
|
||||||
|
|
||||||
|
# 获取 BM25 检索结果
|
||||||
|
bm25_results = self._bm25_search(query, top_k * 2)
|
||||||
|
|
||||||
|
# 混合融合
|
||||||
|
hybrid_results = self._hybrid_fusion(vector_results, bm25_results, top_k)
|
||||||
|
|
||||||
|
if hybrid_results:
|
||||||
|
logger.info(f"混合检索到 {len(hybrid_results)} 条相关文档块 (向量:{len(vector_results)}, BM25:{len(bm25_results)})")
|
||||||
|
return hybrid_results
|
||||||
|
|
||||||
|
# 降级:只使用 BM25
|
||||||
|
if bm25_results:
|
||||||
|
logger.info(f"降级到 BM25 检索: {len(bm25_results)} 条")
|
||||||
|
return bm25_results
|
||||||
|
|
||||||
|
# 降级:使用关键词搜索
|
||||||
|
logger.info("降级到关键词搜索")
|
||||||
|
return self._keyword_search(query, top_k)
|
||||||
|
|
||||||
|
def _vector_search(self, query: str, top_k: int, min_score: float) -> List[Dict[str, Any]]:
|
||||||
|
"""向量检索"""
|
||||||
|
if self.index is None or self.index.ntotal == 0 or self.embedding_model is None:
|
||||||
return []
|
return []
|
||||||
|
|
||||||
query_embedding = self.embedding_model.encode([query], convert_to_numpy=True)
|
try:
|
||||||
query_embedding = self._normalize_vectors(query_embedding).astype('float32')
|
query_embedding = self.embedding_model.encode([query], convert_to_numpy=True)
|
||||||
|
query_embedding = self._normalize_vectors(query_embedding).astype('float32')
|
||||||
|
|
||||||
scores, indices = self.index.search(query_embedding, min(top_k, self.index.ntotal))
|
scores, indices = self.index.search(query_embedding, min(top_k * 2, self.index.ntotal))
|
||||||
|
|
||||||
results = []
|
results = []
|
||||||
for score, idx in zip(scores[0], indices[0]):
|
for score, idx in zip(scores[0], indices[0]):
|
||||||
if idx < 0:
|
if idx < 0:
|
||||||
continue
|
continue
|
||||||
doc = self.documents[idx]
|
if score < min_score:
|
||||||
results.append({
|
continue
|
||||||
"content": doc["content"],
|
doc = self.documents[idx]
|
||||||
"metadata": doc["metadata"],
|
results.append({
|
||||||
"score": float(score),
|
"content": doc["content"],
|
||||||
"doc_id": doc["id"]
|
"metadata": doc["metadata"],
|
||||||
|
"score": float(score),
|
||||||
|
"doc_id": doc["id"],
|
||||||
|
"chunk_index": doc["metadata"].get("chunk_index", 0),
|
||||||
|
"search_type": "vector"
|
||||||
|
})
|
||||||
|
|
||||||
|
return results
|
||||||
|
except Exception as e:
|
||||||
|
logger.warning(f"向量检索失败: {e}")
|
||||||
|
return []
|
||||||
|
|
||||||
|
def _bm25_search(self, query: str, top_k: int) -> List[Dict[str, Any]]:
|
||||||
|
"""BM25 检索"""
|
||||||
|
if not self.bm25 or not self.documents:
|
||||||
|
return []
|
||||||
|
|
||||||
|
try:
|
||||||
|
bm25_scores = self.bm25.get_scores(query)
|
||||||
|
if not bm25_scores:
|
||||||
|
return []
|
||||||
|
|
||||||
|
# 归一化 BM25 分数到 [0, 1]
|
||||||
|
max_score = max(bm25_scores) if bm25_scores else 1
|
||||||
|
min_score_bm = min(bm25_scores) if bm25_scores else 0
|
||||||
|
score_range = max_score - min_score_bm if max_score != min_score_bm else 1
|
||||||
|
|
||||||
|
results = []
|
||||||
|
for idx, score in enumerate(bm25_scores):
|
||||||
|
if score <= 0:
|
||||||
|
continue
|
||||||
|
# 归一化
|
||||||
|
normalized_score = (score - min_score_bm) / score_range if score_range > 0 else 0
|
||||||
|
doc = self.documents[idx]
|
||||||
|
results.append({
|
||||||
|
"content": doc["content"],
|
||||||
|
"metadata": doc["metadata"],
|
||||||
|
"score": float(normalized_score),
|
||||||
|
"doc_id": doc["id"],
|
||||||
|
"chunk_index": doc["metadata"].get("chunk_index", 0),
|
||||||
|
"search_type": "bm25"
|
||||||
|
})
|
||||||
|
|
||||||
|
# 按分数降序
|
||||||
|
results.sort(key=lambda x: x["score"], reverse=True)
|
||||||
|
return results[:top_k]
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
logger.warning(f"BM25 检索失败: {e}")
|
||||||
|
return []
|
||||||
|
|
||||||
|
def _hybrid_fusion(
|
||||||
|
self,
|
||||||
|
vector_results: List[Dict[str, Any]],
|
||||||
|
bm25_results: List[Dict[str, Any]],
|
||||||
|
top_k: int
|
||||||
|
) -> List[Dict[str, Any]]:
|
||||||
|
"""
|
||||||
|
混合融合向量和 BM25 检索结果
|
||||||
|
|
||||||
|
使用 RRFR (Reciprocal Rank Fusion) 算法:
|
||||||
|
Score = weight_vector * (1 / rank_vector) + weight_bm25 * (1 / rank_bm25)
|
||||||
|
|
||||||
|
Args:
|
||||||
|
vector_results: 向量检索结果
|
||||||
|
bm25_results: BM25 检索结果
|
||||||
|
top_k: 返回数量
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
融合后的结果
|
||||||
|
"""
|
||||||
|
if not vector_results and not bm25_results:
|
||||||
|
return []
|
||||||
|
|
||||||
|
# 融合权重
|
||||||
|
weight_vector = 0.6
|
||||||
|
weight_bm25 = 0.4
|
||||||
|
|
||||||
|
# 构建文档分数映射
|
||||||
|
doc_scores: Dict[str, Dict[str, float]] = {}
|
||||||
|
|
||||||
|
# 添加向量检索结果
|
||||||
|
for rank, result in enumerate(vector_results):
|
||||||
|
doc_id = result["doc_id"]
|
||||||
|
if doc_id not in doc_scores:
|
||||||
|
doc_scores[doc_id] = {"vector": 0, "bm25": 0, "content": result["content"], "metadata": result["metadata"]}
|
||||||
|
# 使用倒数排名 (Reciprocal Rank)
|
||||||
|
doc_scores[doc_id]["vector"] = weight_vector / (rank + 1)
|
||||||
|
|
||||||
|
# 添加 BM25 检索结果
|
||||||
|
for rank, result in enumerate(bm25_results):
|
||||||
|
doc_id = result["doc_id"]
|
||||||
|
if doc_id not in doc_scores:
|
||||||
|
doc_scores[doc_id] = {"vector": 0, "bm25": 0, "content": result["content"], "metadata": result["metadata"]}
|
||||||
|
doc_scores[doc_id]["bm25"] = weight_bm25 / (rank + 1)
|
||||||
|
|
||||||
|
# 计算融合分数
|
||||||
|
fused_results = []
|
||||||
|
for doc_id, scores in doc_scores.items():
|
||||||
|
fused_score = scores["vector"] + scores["bm25"]
|
||||||
|
# 使用向量检索结果的原始分数作为参考
|
||||||
|
vector_score = next((r["score"] for r in vector_results if r["doc_id"] == doc_id), 0.5)
|
||||||
|
fused_results.append({
|
||||||
|
"content": scores["content"],
|
||||||
|
"metadata": scores["metadata"],
|
||||||
|
"score": fused_score,
|
||||||
|
"doc_id": doc_id,
|
||||||
|
"vector_score": vector_score,
|
||||||
|
"bm25_score": scores["bm25"],
|
||||||
|
"search_type": "hybrid"
|
||||||
})
|
})
|
||||||
|
|
||||||
logger.debug(f"检索到 {len(results)} 条相关文档")
|
# 按融合分数降序排序
|
||||||
return results
|
fused_results.sort(key=lambda x: x["score"], reverse=True)
|
||||||
|
|
||||||
|
logger.debug(f"混合融合: {len(fused_results)} 个文档, 向量:{len(vector_results)}, BM25:{len(bm25_results)}")
|
||||||
|
|
||||||
|
return fused_results[:top_k]
|
||||||
|
|
||||||
|
def _keyword_search(self, query: str, top_k: int = 5) -> List[Dict[str, Any]]:
|
||||||
|
"""
|
||||||
|
关键词搜索后备方案
|
||||||
|
|
||||||
|
Args:
|
||||||
|
query: 查询文本
|
||||||
|
top_k: 返回的最大结果数
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
相关文档块列表
|
||||||
|
"""
|
||||||
|
if not self.documents:
|
||||||
|
return []
|
||||||
|
|
||||||
|
# 提取查询关键词
|
||||||
|
keywords = []
|
||||||
|
for char in query:
|
||||||
|
if '\u4e00' <= char <= '\u9fff': # 中文字符
|
||||||
|
keywords.append(char)
|
||||||
|
# 添加英文单词
|
||||||
|
import re
|
||||||
|
english_words = re.findall(r'[a-zA-Z]+', query)
|
||||||
|
keywords.extend(english_words)
|
||||||
|
|
||||||
|
if not keywords:
|
||||||
|
return []
|
||||||
|
|
||||||
|
results = []
|
||||||
|
for doc in self.documents:
|
||||||
|
content = doc["content"]
|
||||||
|
# 计算关键词匹配分数
|
||||||
|
score = 0
|
||||||
|
matched_keywords = 0
|
||||||
|
for kw in keywords:
|
||||||
|
if kw in content:
|
||||||
|
score += 1
|
||||||
|
matched_keywords += 1
|
||||||
|
|
||||||
|
if matched_keywords > 0:
|
||||||
|
# 归一化分数
|
||||||
|
score = score / max(len(keywords), 1)
|
||||||
|
results.append({
|
||||||
|
"content": content,
|
||||||
|
"metadata": doc["metadata"],
|
||||||
|
"score": score,
|
||||||
|
"doc_id": doc["id"],
|
||||||
|
"chunk_index": doc["metadata"].get("chunk_index", 0)
|
||||||
|
})
|
||||||
|
|
||||||
|
# 按分数排序
|
||||||
|
results.sort(key=lambda x: x["score"], reverse=True)
|
||||||
|
|
||||||
|
logger.debug(f"关键词搜索返回 {len(results[:top_k])} 条结果")
|
||||||
|
return results[:top_k]
|
||||||
|
|
||||||
|
def retrieve_by_doc_id(self, doc_id: str, top_k: int = 10) -> List[Dict[str, Any]]:
|
||||||
|
"""
|
||||||
|
获取指定文档的所有块
|
||||||
|
|
||||||
|
Args:
|
||||||
|
doc_id: 文档ID
|
||||||
|
top_k: 返回的最大结果数
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
该文档的所有块
|
||||||
|
"""
|
||||||
|
# 获取属于该文档的所有块
|
||||||
|
doc_chunks = [d for d in self.documents if d["metadata"].get("doc_id") == doc_id]
|
||||||
|
|
||||||
|
# 按 chunk_index 排序
|
||||||
|
doc_chunks.sort(key=lambda x: x["metadata"].get("chunk_index", 0))
|
||||||
|
|
||||||
|
# 返回指定数量
|
||||||
|
return doc_chunks[:top_k]
|
||||||
|
|
||||||
def retrieve_by_table(self, table_name: str, top_k: int = 5) -> List[Dict[str, Any]]:
|
def retrieve_by_table(self, table_name: str, top_k: int = 5) -> List[Dict[str, Any]]:
|
||||||
"""检索指定表的字段"""
|
"""检索指定表的字段"""
|
||||||
|
|||||||
File diff suppressed because it is too large
Load Diff
639
backend/app/services/word_ai_service.py
Normal file
639
backend/app/services/word_ai_service.py
Normal file
@@ -0,0 +1,639 @@
|
|||||||
|
"""
|
||||||
|
Word 文档 AI 解析服务
|
||||||
|
|
||||||
|
使用 LLM (GLM) 对 Word 文档进行深度理解,提取结构化数据
|
||||||
|
"""
|
||||||
|
import logging
|
||||||
|
from typing import Dict, Any, List, Optional
|
||||||
|
import json
|
||||||
|
|
||||||
|
from app.services.llm_service import llm_service
|
||||||
|
from app.core.document_parser.docx_parser import DocxParser
|
||||||
|
|
||||||
|
logger = logging.getLogger(__name__)
|
||||||
|
|
||||||
|
|
||||||
|
class WordAIService:
|
||||||
|
"""Word 文档 AI 解析服务"""
|
||||||
|
|
||||||
|
def __init__(self):
|
||||||
|
self.llm = llm_service
|
||||||
|
self.parser = DocxParser()
|
||||||
|
|
||||||
|
async def parse_word_with_ai(
|
||||||
|
self,
|
||||||
|
file_path: str,
|
||||||
|
user_hint: str = ""
|
||||||
|
) -> Dict[str, Any]:
|
||||||
|
"""
|
||||||
|
使用 AI 解析 Word 文档,提取结构化数据
|
||||||
|
|
||||||
|
适用于从非结构化的 Word 文档中提取表格数据、键值对等信息
|
||||||
|
|
||||||
|
Args:
|
||||||
|
file_path: Word 文件路径
|
||||||
|
user_hint: 用户提示词,指定要提取的内容类型
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Dict: 包含结构化数据的解析结果
|
||||||
|
"""
|
||||||
|
try:
|
||||||
|
# 1. 先用基础解析器提取原始内容
|
||||||
|
parse_result = self.parser.parse(file_path)
|
||||||
|
|
||||||
|
if not parse_result.success:
|
||||||
|
return {
|
||||||
|
"success": False,
|
||||||
|
"error": parse_result.error,
|
||||||
|
"structured_data": None
|
||||||
|
}
|
||||||
|
|
||||||
|
# 2. 获取原始数据
|
||||||
|
raw_data = parse_result.data
|
||||||
|
paragraphs = raw_data.get("paragraphs", [])
|
||||||
|
paragraphs_with_style = raw_data.get("paragraphs_with_style", [])
|
||||||
|
tables = raw_data.get("tables", [])
|
||||||
|
content = raw_data.get("content", "")
|
||||||
|
images_info = raw_data.get("images", {})
|
||||||
|
metadata = parse_result.metadata or {}
|
||||||
|
|
||||||
|
image_count = images_info.get("image_count", 0)
|
||||||
|
image_descriptions = images_info.get("descriptions", [])
|
||||||
|
|
||||||
|
logger.info(f"Word 基础解析完成: {len(paragraphs)} 个段落, {len(tables)} 个表格, {image_count} 张图片")
|
||||||
|
|
||||||
|
# 3. 提取图片数据(用于视觉分析)
|
||||||
|
images_base64 = []
|
||||||
|
if image_count > 0:
|
||||||
|
try:
|
||||||
|
images_base64 = self.parser.extract_images_as_base64(file_path)
|
||||||
|
logger.info(f"提取到 {len(images_base64)} 张图片的 base64 数据")
|
||||||
|
except Exception as e:
|
||||||
|
logger.warning(f"提取图片 base64 失败: {str(e)}")
|
||||||
|
|
||||||
|
# 4. 根据内容类型选择 AI 解析策略
|
||||||
|
# 如果有图片,先分析图片
|
||||||
|
image_analysis = ""
|
||||||
|
if images_base64:
|
||||||
|
image_analysis = await self._analyze_images_with_ai(images_base64, user_hint)
|
||||||
|
logger.info(f"图片 AI 分析完成: {len(image_analysis)} 字符")
|
||||||
|
|
||||||
|
# 优先处理:表格 > (表格+文本) > 纯文本
|
||||||
|
if tables and len(tables) > 0:
|
||||||
|
structured_data = await self._extract_tables_with_ai(
|
||||||
|
tables, paragraphs, image_count, user_hint, metadata, image_analysis
|
||||||
|
)
|
||||||
|
elif paragraphs and len(paragraphs) > 0:
|
||||||
|
structured_data = await self._extract_from_text_with_ai(
|
||||||
|
paragraphs, content, image_count, image_descriptions, user_hint, image_analysis
|
||||||
|
)
|
||||||
|
else:
|
||||||
|
structured_data = {
|
||||||
|
"success": True,
|
||||||
|
"type": "empty",
|
||||||
|
"message": "文档内容为空"
|
||||||
|
}
|
||||||
|
|
||||||
|
# 添加图片分析结果
|
||||||
|
if image_analysis:
|
||||||
|
structured_data["image_analysis"] = image_analysis
|
||||||
|
|
||||||
|
return structured_data
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
logger.error(f"AI 解析 Word 文档失败: {str(e)}")
|
||||||
|
return {
|
||||||
|
"success": False,
|
||||||
|
"error": str(e),
|
||||||
|
"structured_data": None
|
||||||
|
}
|
||||||
|
|
||||||
|
async def _extract_tables_with_ai(
|
||||||
|
self,
|
||||||
|
tables: List[Dict],
|
||||||
|
paragraphs: List[str],
|
||||||
|
image_count: int,
|
||||||
|
user_hint: str,
|
||||||
|
metadata: Dict,
|
||||||
|
image_analysis: str = ""
|
||||||
|
) -> Dict[str, Any]:
|
||||||
|
"""
|
||||||
|
使用 AI 从 Word 表格和文本中提取结构化数据
|
||||||
|
|
||||||
|
Args:
|
||||||
|
tables: 表格列表
|
||||||
|
paragraphs: 段落列表
|
||||||
|
image_count: 图片数量
|
||||||
|
user_hint: 用户提示
|
||||||
|
metadata: 文档元数据
|
||||||
|
image_analysis: 图片 AI 分析结果
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
结构化数据
|
||||||
|
"""
|
||||||
|
try:
|
||||||
|
# 构建表格文本描述
|
||||||
|
tables_text = self._build_tables_description(tables)
|
||||||
|
|
||||||
|
# 构建段落描述
|
||||||
|
paragraphs_text = "\n".join(paragraphs[:50]) if paragraphs else "(无正文文本)"
|
||||||
|
if len(paragraphs) > 50:
|
||||||
|
paragraphs_text += f"\n...(共 {len(paragraphs)} 个段落,仅显示前50个)"
|
||||||
|
|
||||||
|
# 图片提示
|
||||||
|
image_hint = f"注意:此文档包含 {image_count} 张图片/图表。" if image_count > 0 else ""
|
||||||
|
|
||||||
|
prompt = f"""你是一个专业的数据提取专家。请从以下 Word 文档的完整内容中提取结构化数据。
|
||||||
|
|
||||||
|
【用户需求】
|
||||||
|
{user_hint if user_hint else "请提取文档中的所有结构化数据,包括表格数据、键值对、列表项等。"}
|
||||||
|
|
||||||
|
【文档正文(段落)】
|
||||||
|
{paragraphs_text}
|
||||||
|
|
||||||
|
【文档表格】
|
||||||
|
{tables_text}
|
||||||
|
|
||||||
|
【文档图片信息】
|
||||||
|
{image_hint}
|
||||||
|
|
||||||
|
请按照以下 JSON 格式输出:
|
||||||
|
{{
|
||||||
|
"type": "table_data",
|
||||||
|
"headers": ["列1", "列2", ...],
|
||||||
|
"rows": [["行1列1", "行1列2", ...], ["行2列1", "行2列2", ...], ...],
|
||||||
|
"key_values": {{"键1": "值1", "键2": "值2", ...}},
|
||||||
|
"list_items": ["项1", "项2", ...],
|
||||||
|
"description": "文档内容描述"
|
||||||
|
}}
|
||||||
|
|
||||||
|
重点:
|
||||||
|
- 优先从表格中提取结构化数据
|
||||||
|
- 如果表格中有表头,headers 是表头,rows 是数据行
|
||||||
|
- 如果文档中有键值对(如 名称: 张三),提取到 key_values 中
|
||||||
|
- 如果文档中有列表项,提取到 list_items 中
|
||||||
|
- 图片内容无法直接提取,但请在 description 中说明图片的大致主题(如"包含流程图"、"包含数据图表"等)
|
||||||
|
"""
|
||||||
|
|
||||||
|
messages = [
|
||||||
|
{"role": "system", "content": "你是一个专业的数据提取助手。请严格按JSON格式输出。"},
|
||||||
|
{"role": "user", "content": prompt}
|
||||||
|
]
|
||||||
|
|
||||||
|
response = await self.llm.chat(
|
||||||
|
messages=messages,
|
||||||
|
temperature=0.1,
|
||||||
|
max_tokens=50000
|
||||||
|
)
|
||||||
|
|
||||||
|
content = self.llm.extract_message_content(response)
|
||||||
|
|
||||||
|
# 解析 JSON
|
||||||
|
result = self._parse_json_response(content)
|
||||||
|
|
||||||
|
if result:
|
||||||
|
logger.info(f"AI 表格提取成功: {len(result.get('rows', []))} 行数据, key_values={len(result.get('key_values', {}))}, list_items={len(result.get('list_items', []))}")
|
||||||
|
return {
|
||||||
|
"success": True,
|
||||||
|
"type": "table_data",
|
||||||
|
"headers": result.get("headers", []),
|
||||||
|
"rows": result.get("rows", []),
|
||||||
|
"description": result.get("description", ""),
|
||||||
|
"key_values": result.get("key_values", {}),
|
||||||
|
"list_items": result.get("list_items", [])
|
||||||
|
}
|
||||||
|
else:
|
||||||
|
# 如果 AI 返回格式不对,尝试直接解析表格
|
||||||
|
return self._fallback_table_parse(tables)
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
logger.error(f"AI 表格提取失败: {str(e)}")
|
||||||
|
return self._fallback_table_parse(tables)
|
||||||
|
|
||||||
|
async def _extract_from_text_with_ai(
|
||||||
|
self,
|
||||||
|
paragraphs: List[str],
|
||||||
|
full_text: str,
|
||||||
|
image_count: int,
|
||||||
|
image_descriptions: List[str],
|
||||||
|
user_hint: str,
|
||||||
|
image_analysis: str = ""
|
||||||
|
) -> Dict[str, Any]:
|
||||||
|
"""
|
||||||
|
使用 AI 从 Word 纯文本中提取结构化数据
|
||||||
|
|
||||||
|
Args:
|
||||||
|
paragraphs: 段落列表
|
||||||
|
full_text: 完整文本
|
||||||
|
image_count: 图片数量
|
||||||
|
image_descriptions: 图片描述列表
|
||||||
|
user_hint: 用户提示
|
||||||
|
image_analysis: 图片 AI 分析结果
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
结构化数据
|
||||||
|
"""
|
||||||
|
try:
|
||||||
|
# 限制文本长度
|
||||||
|
text_preview = full_text[:8000] if len(full_text) > 8000 else full_text
|
||||||
|
|
||||||
|
# 图片提示
|
||||||
|
image_hint = f"\n【文档图片】此文档包含 {image_count} 张图片/图表。" if image_count > 0 else ""
|
||||||
|
if image_descriptions:
|
||||||
|
image_hint += "\n" + "\n".join(image_descriptions)
|
||||||
|
|
||||||
|
prompt = f"""你是一个专业的数据提取专家。请从以下 Word 文档的完整内容中提取结构化数据。
|
||||||
|
|
||||||
|
【用户需求】
|
||||||
|
{user_hint if user_hint else "请识别并提取文档中的关键信息,包括:表格数据、键值对、列表项等。"}
|
||||||
|
|
||||||
|
【文档正文】{image_hint}
|
||||||
|
{text_preview}
|
||||||
|
|
||||||
|
请按照以下 JSON 格式输出:
|
||||||
|
{{
|
||||||
|
"type": "structured_text",
|
||||||
|
"tables": [{{"headers": [...], "rows": [...]}}],
|
||||||
|
"key_values": {{"键1": "值1", "键2": "值2", ...}},
|
||||||
|
"list_items": ["项1", "项2", ...],
|
||||||
|
"summary": "文档内容摘要"
|
||||||
|
}}
|
||||||
|
|
||||||
|
重点:
|
||||||
|
- 如果文档包含表格数据,提取到 tables 中
|
||||||
|
- 如果文档包含键值对(如 名称: 张三),提取到 key_values 中
|
||||||
|
- 如果文档包含列表项,提取到 list_items 中
|
||||||
|
- 如果文档包含图片,请根据上下文推断图片内容(如"流程图"、"数据折线图"等)并在 description 中说明
|
||||||
|
- 如果无法提取到结构化数据,至少提供一个详细的摘要
|
||||||
|
"""
|
||||||
|
|
||||||
|
messages = [
|
||||||
|
{"role": "system", "content": "你是一个专业的数据提取助手。请严格按JSON格式输出。"},
|
||||||
|
{"role": "user", "content": prompt}
|
||||||
|
]
|
||||||
|
|
||||||
|
response = await self.llm.chat(
|
||||||
|
messages=messages,
|
||||||
|
temperature=0.1,
|
||||||
|
max_tokens=50000
|
||||||
|
)
|
||||||
|
|
||||||
|
content = self.llm.extract_message_content(response)
|
||||||
|
|
||||||
|
result = self._parse_json_response(content)
|
||||||
|
|
||||||
|
if result:
|
||||||
|
logger.info(f"AI 文本提取成功: type={result.get('type')}")
|
||||||
|
return {
|
||||||
|
"success": True,
|
||||||
|
"type": result.get("type", "structured_text"),
|
||||||
|
"tables": result.get("tables", []),
|
||||||
|
"key_values": result.get("key_values", {}),
|
||||||
|
"list_items": result.get("list_items", []),
|
||||||
|
"summary": result.get("summary", ""),
|
||||||
|
"raw_text_preview": text_preview[:500]
|
||||||
|
}
|
||||||
|
else:
|
||||||
|
return {
|
||||||
|
"success": True,
|
||||||
|
"type": "text",
|
||||||
|
"summary": text_preview[:500],
|
||||||
|
"raw_text_preview": text_preview[:500]
|
||||||
|
}
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
logger.error(f"AI 文本提取失败: {str(e)}")
|
||||||
|
return {
|
||||||
|
"success": False,
|
||||||
|
"error": str(e)
|
||||||
|
}
|
||||||
|
|
||||||
|
async def _analyze_images_with_ai(
|
||||||
|
self,
|
||||||
|
images: List[Dict[str, str]],
|
||||||
|
user_hint: str = ""
|
||||||
|
) -> str:
|
||||||
|
"""
|
||||||
|
使用视觉模型分析 Word 文档中的图片
|
||||||
|
|
||||||
|
Args:
|
||||||
|
images: 图片列表,每项包含 base64 和 mime_type
|
||||||
|
user_hint: 用户提示
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
图片分析结果文本
|
||||||
|
"""
|
||||||
|
try:
|
||||||
|
# 调用 LLM 的视觉分析功能
|
||||||
|
result = await self.llm.analyze_images(
|
||||||
|
images=images,
|
||||||
|
user_prompt=user_hint or "请详细描述图片内容,提取所有文字和数据信息。"
|
||||||
|
)
|
||||||
|
|
||||||
|
if result.get("success"):
|
||||||
|
analysis = result.get("analysis", {})
|
||||||
|
if isinstance(analysis, dict):
|
||||||
|
description = analysis.get("description", "")
|
||||||
|
text_content = analysis.get("text_content", "")
|
||||||
|
data_extracted = analysis.get("data_extracted", {})
|
||||||
|
|
||||||
|
result_text = f"【图片分析结果】\n{description}"
|
||||||
|
if text_content:
|
||||||
|
result_text += f"\n\n【图片中的文字】\n{text_content}"
|
||||||
|
if data_extracted:
|
||||||
|
result_text += f"\n\n【提取的数据】\n{json.dumps(data_extracted, ensure_ascii=False)}"
|
||||||
|
return result_text
|
||||||
|
else:
|
||||||
|
return str(analysis)
|
||||||
|
else:
|
||||||
|
logger.warning(f"图片 AI 分析失败: {result.get('error')}")
|
||||||
|
return ""
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
logger.error(f"图片 AI 分析异常: {str(e)}")
|
||||||
|
return ""
|
||||||
|
|
||||||
|
def _build_tables_description(self, tables: List[Dict]) -> str:
|
||||||
|
"""构建表格的文本描述"""
|
||||||
|
result = []
|
||||||
|
|
||||||
|
for idx, table in enumerate(tables):
|
||||||
|
rows = table.get("rows", [])
|
||||||
|
if not rows:
|
||||||
|
continue
|
||||||
|
|
||||||
|
result.append(f"\n--- 表格 {idx + 1} ---")
|
||||||
|
|
||||||
|
for row_idx, row in enumerate(rows[:50]): # 限制每表格最多50行
|
||||||
|
if isinstance(row, list):
|
||||||
|
result.append(" | ".join(str(cell).strip() for cell in row))
|
||||||
|
elif isinstance(row, dict):
|
||||||
|
result.append(str(row))
|
||||||
|
|
||||||
|
if len(rows) > 50:
|
||||||
|
result.append(f"...(共 {len(rows)} 行,仅显示前50行)")
|
||||||
|
|
||||||
|
return "\n".join(result) if result else "(无表格内容)"
|
||||||
|
|
||||||
|
def _parse_json_response(self, content: str) -> Optional[Dict]:
|
||||||
|
"""解析 JSON 响应,处理各种格式问题"""
|
||||||
|
import re
|
||||||
|
|
||||||
|
# 清理 markdown 标记
|
||||||
|
cleaned = content.strip()
|
||||||
|
cleaned = re.sub(r'^```json\s*', '', cleaned, flags=re.MULTILINE)
|
||||||
|
cleaned = re.sub(r'^```\s*', '', cleaned, flags=re.MULTILINE)
|
||||||
|
cleaned = cleaned.strip()
|
||||||
|
|
||||||
|
# 找到 JSON 开始位置
|
||||||
|
json_start = -1
|
||||||
|
for i, c in enumerate(cleaned):
|
||||||
|
if c == '{':
|
||||||
|
json_start = i
|
||||||
|
break
|
||||||
|
|
||||||
|
if json_start == -1:
|
||||||
|
logger.warning("无法找到 JSON 开始位置")
|
||||||
|
return None
|
||||||
|
|
||||||
|
json_text = cleaned[json_start:]
|
||||||
|
|
||||||
|
# 尝试直接解析
|
||||||
|
try:
|
||||||
|
return json.loads(json_text)
|
||||||
|
except json.JSONDecodeError:
|
||||||
|
pass
|
||||||
|
|
||||||
|
# 尝试修复并解析
|
||||||
|
try:
|
||||||
|
# 找到闭合括号
|
||||||
|
depth = 0
|
||||||
|
end_pos = -1
|
||||||
|
for i, c in enumerate(json_text):
|
||||||
|
if c == '{':
|
||||||
|
depth += 1
|
||||||
|
elif c == '}':
|
||||||
|
depth -= 1
|
||||||
|
if depth == 0:
|
||||||
|
end_pos = i + 1
|
||||||
|
break
|
||||||
|
|
||||||
|
if end_pos > 0:
|
||||||
|
fixed = json_text[:end_pos]
|
||||||
|
# 移除末尾逗号
|
||||||
|
fixed = re.sub(r',\s*([}]])', r'\1', fixed)
|
||||||
|
return json.loads(fixed)
|
||||||
|
except Exception as e:
|
||||||
|
logger.warning(f"JSON 修复失败: {e}")
|
||||||
|
|
||||||
|
return None
|
||||||
|
|
||||||
|
def _fallback_table_parse(self, tables: List[Dict]) -> Dict[str, Any]:
|
||||||
|
"""当 AI 解析失败时,直接解析表格"""
|
||||||
|
if not tables:
|
||||||
|
return {
|
||||||
|
"success": True,
|
||||||
|
"type": "empty",
|
||||||
|
"data": {},
|
||||||
|
"message": "无表格内容"
|
||||||
|
}
|
||||||
|
|
||||||
|
all_rows = []
|
||||||
|
all_headers = None
|
||||||
|
|
||||||
|
for table in tables:
|
||||||
|
rows = table.get("rows", [])
|
||||||
|
if not rows:
|
||||||
|
continue
|
||||||
|
|
||||||
|
# 查找真正的表头行(跳过标题行)
|
||||||
|
header_row_idx = 0
|
||||||
|
for idx, row in enumerate(rows[:5]): # 只检查前5行
|
||||||
|
if not isinstance(row, list):
|
||||||
|
continue
|
||||||
|
# 如果某一行包含"表"字开头且单元格内容很长,这可能是标题行
|
||||||
|
first_cell = str(row[0]) if row else ""
|
||||||
|
if first_cell.startswith("表") and len(first_cell) > 15:
|
||||||
|
header_row_idx = idx + 1
|
||||||
|
continue
|
||||||
|
# 如果某一行有超过3个空单元格,可能是无效行
|
||||||
|
empty_count = sum(1 for cell in row if not str(cell).strip())
|
||||||
|
if empty_count > 3:
|
||||||
|
header_row_idx = idx + 1
|
||||||
|
continue
|
||||||
|
# 找到第一行看起来像表头的行(短单元格,大部分有内容)
|
||||||
|
avg_len = sum(len(str(c)) for c in row) / len(row) if row else 0
|
||||||
|
if avg_len < 20: # 表头通常比数据行短
|
||||||
|
header_row_idx = idx
|
||||||
|
break
|
||||||
|
|
||||||
|
if header_row_idx >= len(rows):
|
||||||
|
continue
|
||||||
|
|
||||||
|
# 使用找到的表头行
|
||||||
|
if rows and isinstance(rows[header_row_idx], list):
|
||||||
|
headers = rows[header_row_idx]
|
||||||
|
if all_headers is None:
|
||||||
|
all_headers = headers
|
||||||
|
|
||||||
|
# 数据行(从表头之后开始)
|
||||||
|
for row in rows[header_row_idx + 1:]:
|
||||||
|
if isinstance(row, list) and len(row) == len(headers):
|
||||||
|
all_rows.append(row)
|
||||||
|
|
||||||
|
if all_headers and all_rows:
|
||||||
|
return {
|
||||||
|
"success": True,
|
||||||
|
"type": "table_data",
|
||||||
|
"headers": all_headers,
|
||||||
|
"rows": all_rows,
|
||||||
|
"description": "直接从 Word 表格提取"
|
||||||
|
}
|
||||||
|
|
||||||
|
return {
|
||||||
|
"success": True,
|
||||||
|
"type": "raw",
|
||||||
|
"tables": tables,
|
||||||
|
"message": "表格数据(未AI处理)"
|
||||||
|
}
|
||||||
|
|
||||||
|
async def fill_template_with_ai(
|
||||||
|
self,
|
||||||
|
file_path: str,
|
||||||
|
template_fields: List[Dict[str, Any]],
|
||||||
|
user_hint: str = ""
|
||||||
|
) -> Dict[str, Any]:
|
||||||
|
"""
|
||||||
|
使用 AI 解析 Word 文档并填写模板
|
||||||
|
|
||||||
|
这是主要入口函数,前端调用此函数即可完成:
|
||||||
|
1. AI 解析 Word 文档
|
||||||
|
2. 根据模板字段提取数据
|
||||||
|
3. 返回填写结果
|
||||||
|
|
||||||
|
Args:
|
||||||
|
file_path: Word 文件路径
|
||||||
|
template_fields: 模板字段列表 [{"name": "字段名", "hint": "提示词"}, ...]
|
||||||
|
user_hint: 用户提示
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
填写结果
|
||||||
|
"""
|
||||||
|
try:
|
||||||
|
# 1. AI 解析文档
|
||||||
|
parse_result = await self.parse_word_with_ai(file_path, user_hint)
|
||||||
|
|
||||||
|
if not parse_result.get("success"):
|
||||||
|
return {
|
||||||
|
"success": False,
|
||||||
|
"error": parse_result.get("error", "解析失败"),
|
||||||
|
"filled_data": {},
|
||||||
|
"source": "ai_parse_failed"
|
||||||
|
}
|
||||||
|
|
||||||
|
# 2. 根据字段类型提取数据
|
||||||
|
filled_data = {}
|
||||||
|
extract_details = []
|
||||||
|
|
||||||
|
parse_type = parse_result.get("type", "")
|
||||||
|
|
||||||
|
if parse_type == "table_data":
|
||||||
|
# 表格数据:直接匹配列名
|
||||||
|
headers = parse_result.get("headers", [])
|
||||||
|
rows = parse_result.get("rows", [])
|
||||||
|
|
||||||
|
for field in template_fields:
|
||||||
|
field_name = field.get("name", "")
|
||||||
|
values = self._extract_field_from_table(headers, rows, field_name)
|
||||||
|
filled_data[field_name] = values
|
||||||
|
extract_details.append({
|
||||||
|
"field": field_name,
|
||||||
|
"values": values,
|
||||||
|
"source": "ai_table_extraction",
|
||||||
|
"confidence": 0.9 if values else 0.0
|
||||||
|
})
|
||||||
|
|
||||||
|
elif parse_type == "structured_text":
|
||||||
|
# 结构化文本:尝试从 key_values 和 list_items 提取
|
||||||
|
key_values = parse_result.get("key_values", {})
|
||||||
|
list_items = parse_result.get("list_items", [])
|
||||||
|
|
||||||
|
for field in template_fields:
|
||||||
|
field_name = field.get("name", "")
|
||||||
|
value = key_values.get(field_name, "")
|
||||||
|
if not value and list_items:
|
||||||
|
value = list_items[0] if list_items else ""
|
||||||
|
filled_data[field_name] = [value] if value else []
|
||||||
|
extract_details.append({
|
||||||
|
"field": field_name,
|
||||||
|
"values": [value] if value else [],
|
||||||
|
"source": "ai_text_extraction",
|
||||||
|
"confidence": 0.7 if value else 0.0
|
||||||
|
})
|
||||||
|
|
||||||
|
else:
|
||||||
|
# 其他类型:返回原始解析结果供后续处理
|
||||||
|
for field in template_fields:
|
||||||
|
field_name = field.get("name", "")
|
||||||
|
filled_data[field_name] = []
|
||||||
|
extract_details.append({
|
||||||
|
"field": field_name,
|
||||||
|
"values": [],
|
||||||
|
"source": "no_ai_data",
|
||||||
|
"confidence": 0.0
|
||||||
|
})
|
||||||
|
|
||||||
|
# 3. 返回结果
|
||||||
|
max_rows = max(len(v) for v in filled_data.values()) if filled_data else 1
|
||||||
|
|
||||||
|
return {
|
||||||
|
"success": True,
|
||||||
|
"filled_data": filled_data,
|
||||||
|
"fill_details": extract_details,
|
||||||
|
"ai_parse_result": {
|
||||||
|
"type": parse_type,
|
||||||
|
"description": parse_result.get("description", "")
|
||||||
|
},
|
||||||
|
"source_doc_count": 1,
|
||||||
|
"max_rows": max_rows
|
||||||
|
}
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
logger.error(f"AI 填表失败: {str(e)}")
|
||||||
|
return {
|
||||||
|
"success": False,
|
||||||
|
"error": str(e),
|
||||||
|
"filled_data": {},
|
||||||
|
"fill_details": []
|
||||||
|
}
|
||||||
|
|
||||||
|
def _extract_field_from_table(
|
||||||
|
self,
|
||||||
|
headers: List[str],
|
||||||
|
rows: List[List],
|
||||||
|
field_name: str
|
||||||
|
) -> List[str]:
|
||||||
|
"""从表格中提取指定字段的值"""
|
||||||
|
# 查找匹配的列
|
||||||
|
target_col_idx = None
|
||||||
|
for col_idx, header in enumerate(headers):
|
||||||
|
if field_name.lower() in str(header).lower() or str(header).lower() in field_name.lower():
|
||||||
|
target_col_idx = col_idx
|
||||||
|
break
|
||||||
|
|
||||||
|
if target_col_idx is None:
|
||||||
|
return []
|
||||||
|
|
||||||
|
# 提取该列所有值
|
||||||
|
values = []
|
||||||
|
for row in rows:
|
||||||
|
if isinstance(row, list) and target_col_idx < len(row):
|
||||||
|
val = str(row[target_col_idx]).strip()
|
||||||
|
if val:
|
||||||
|
values.append(val)
|
||||||
|
|
||||||
|
return values
|
||||||
|
|
||||||
|
|
||||||
|
# 全局单例
|
||||||
|
word_ai_service = WordAIService()
|
||||||
@@ -1,4 +1,4 @@
|
|||||||
# ============================================================
|
# ============================================================
|
||||||
# 基于大语言模型的文档理解与多源数据融合系统
|
# 基于大语言模型的文档理解与多源数据融合系统
|
||||||
# Python 依赖清单
|
# Python 依赖清单
|
||||||
# ============================================================
|
# ============================================================
|
||||||
|
|||||||
@@ -1,13 +1,16 @@
|
|||||||
import { RouterProvider } from 'react-router-dom';
|
import { RouterProvider } from 'react-router-dom';
|
||||||
import { AuthProvider } from '@/context/AuthContext';
|
import { AuthProvider } from '@/contexts/AuthContext';
|
||||||
|
import { TemplateFillProvider } from '@/context/TemplateFillContext';
|
||||||
import { router } from '@/routes';
|
import { router } from '@/routes';
|
||||||
import { Toaster } from 'sonner';
|
import { Toaster } from 'sonner';
|
||||||
|
|
||||||
function App() {
|
function App() {
|
||||||
return (
|
return (
|
||||||
<AuthProvider>
|
<AuthProvider>
|
||||||
<RouterProvider router={router} />
|
<TemplateFillProvider>
|
||||||
<Toaster position="top-right" richColors closeButton />
|
<RouterProvider router={router} />
|
||||||
|
<Toaster position="top-right" richColors closeButton />
|
||||||
|
</TemplateFillProvider>
|
||||||
</AuthProvider>
|
</AuthProvider>
|
||||||
);
|
);
|
||||||
}
|
}
|
||||||
|
|||||||
@@ -1,6 +1,6 @@
|
|||||||
import React from 'react';
|
import React from 'react';
|
||||||
import { Navigate, useLocation } from 'react-router-dom';
|
import { Navigate, useLocation } from 'react-router-dom';
|
||||||
import { useAuth } from '@/context/AuthContext';
|
import { useAuth } from '@/contexts/AuthContext';
|
||||||
|
|
||||||
export const RouteGuard: React.FC<{ children: React.ReactNode }> = ({ children }) => {
|
export const RouteGuard: React.FC<{ children: React.ReactNode }> = ({ children }) => {
|
||||||
const { user, loading } = useAuth();
|
const { user, loading } = useAuth();
|
||||||
|
|||||||
@@ -1,85 +0,0 @@
|
|||||||
import React, { createContext, useContext, useEffect, useState } from 'react';
|
|
||||||
import { supabase } from '@/db/supabase';
|
|
||||||
import { User } from '@supabase/supabase-js';
|
|
||||||
import { Profile } from '@/types/types';
|
|
||||||
|
|
||||||
interface AuthContextType {
|
|
||||||
user: User | null;
|
|
||||||
profile: Profile | null;
|
|
||||||
signIn: (email: string, password: string) => Promise<{ error: any }>;
|
|
||||||
signUp: (email: string, password: string) => Promise<{ error: any }>;
|
|
||||||
signOut: () => Promise<{ error: any }>;
|
|
||||||
loading: boolean;
|
|
||||||
}
|
|
||||||
|
|
||||||
const AuthContext = createContext<AuthContextType | undefined>(undefined);
|
|
||||||
|
|
||||||
export const AuthProvider: React.FC<{ children: React.ReactNode }> = ({ children }) => {
|
|
||||||
const [user, setUser] = useState<User | null>(null);
|
|
||||||
const [profile, setProfile] = useState<Profile | null>(null);
|
|
||||||
const [loading, setLoading] = useState(true);
|
|
||||||
|
|
||||||
useEffect(() => {
|
|
||||||
// Check active sessions and sets the user
|
|
||||||
supabase.auth.getSession().then(({ data: { session } }) => {
|
|
||||||
setUser(session?.user ?? null);
|
|
||||||
if (session?.user) fetchProfile(session.user.id);
|
|
||||||
else setLoading(false);
|
|
||||||
});
|
|
||||||
|
|
||||||
// Listen for changes on auth state (sign in, sign out, etc.)
|
|
||||||
const { data: { subscription } } = supabase.auth.onAuthStateChange((_event, session) => {
|
|
||||||
setUser(session?.user ?? null);
|
|
||||||
if (session?.user) fetchProfile(session.user.id);
|
|
||||||
else {
|
|
||||||
setProfile(null);
|
|
||||||
setLoading(false);
|
|
||||||
}
|
|
||||||
});
|
|
||||||
|
|
||||||
return () => subscription.unsubscribe();
|
|
||||||
}, []);
|
|
||||||
|
|
||||||
const fetchProfile = async (uid: string) => {
|
|
||||||
try {
|
|
||||||
const { data, error } = await supabase
|
|
||||||
.from('profiles')
|
|
||||||
.select('*')
|
|
||||||
.eq('id', uid)
|
|
||||||
.maybeSingle();
|
|
||||||
|
|
||||||
if (error) throw error;
|
|
||||||
setProfile(data);
|
|
||||||
} catch (err) {
|
|
||||||
console.error('Error fetching profile:', err);
|
|
||||||
} finally {
|
|
||||||
setLoading(false);
|
|
||||||
}
|
|
||||||
};
|
|
||||||
|
|
||||||
const signIn = async (email: string, password: string) => {
|
|
||||||
return await supabase.auth.signInWithPassword({ email, password });
|
|
||||||
};
|
|
||||||
|
|
||||||
const signUp = async (email: string, password: string) => {
|
|
||||||
return await supabase.auth.signUp({ email, password });
|
|
||||||
};
|
|
||||||
|
|
||||||
const signOut = async () => {
|
|
||||||
return await supabase.auth.signOut();
|
|
||||||
};
|
|
||||||
|
|
||||||
return (
|
|
||||||
<AuthContext.Provider value={{ user, profile, signIn, signUp, signOut, loading }}>
|
|
||||||
{children}
|
|
||||||
</AuthContext.Provider>
|
|
||||||
);
|
|
||||||
};
|
|
||||||
|
|
||||||
export const useAuth = () => {
|
|
||||||
const context = useContext(AuthContext);
|
|
||||||
if (context === undefined) {
|
|
||||||
throw new Error('useAuth must be used within an AuthProvider');
|
|
||||||
}
|
|
||||||
return context;
|
|
||||||
};
|
|
||||||
136
frontend/src/context/TemplateFillContext.tsx
Normal file
136
frontend/src/context/TemplateFillContext.tsx
Normal file
@@ -0,0 +1,136 @@
|
|||||||
|
import React, { createContext, useContext, useState, ReactNode } from 'react';
|
||||||
|
|
||||||
|
type SourceFile = {
|
||||||
|
file: File;
|
||||||
|
preview?: string;
|
||||||
|
};
|
||||||
|
|
||||||
|
type TemplateField = {
|
||||||
|
cell: string;
|
||||||
|
name: string;
|
||||||
|
field_type: string;
|
||||||
|
required: boolean;
|
||||||
|
hint?: string;
|
||||||
|
};
|
||||||
|
|
||||||
|
type Step = 'upload' | 'filling' | 'preview';
|
||||||
|
|
||||||
|
interface TemplateFillState {
|
||||||
|
step: Step;
|
||||||
|
templateFile: File | null;
|
||||||
|
templateFields: TemplateField[];
|
||||||
|
sourceFiles: SourceFile[];
|
||||||
|
sourceFilePaths: string[];
|
||||||
|
sourceDocIds: string[];
|
||||||
|
templateId: string;
|
||||||
|
filledResult: any;
|
||||||
|
setStep: (step: Step) => void;
|
||||||
|
setTemplateFile: (file: File | null) => void;
|
||||||
|
setTemplateFields: (fields: TemplateField[]) => void;
|
||||||
|
setSourceFiles: (files: SourceFile[]) => void;
|
||||||
|
addSourceFiles: (files: SourceFile[]) => void;
|
||||||
|
removeSourceFile: (index: number) => void;
|
||||||
|
setSourceFilePaths: (paths: string[]) => void;
|
||||||
|
setSourceDocIds: (ids: string[]) => void;
|
||||||
|
addSourceDocId: (id: string) => void;
|
||||||
|
removeSourceDocId: (id: string) => void;
|
||||||
|
setTemplateId: (id: string) => void;
|
||||||
|
setFilledResult: (result: any) => void;
|
||||||
|
reset: () => void;
|
||||||
|
}
|
||||||
|
|
||||||
|
const initialState = {
|
||||||
|
step: 'upload' as Step,
|
||||||
|
templateFile: null,
|
||||||
|
templateFields: [],
|
||||||
|
sourceFiles: [],
|
||||||
|
sourceFilePaths: [],
|
||||||
|
sourceDocIds: [],
|
||||||
|
templateId: '',
|
||||||
|
filledResult: null,
|
||||||
|
setStep: () => {},
|
||||||
|
setTemplateFile: () => {},
|
||||||
|
setTemplateFields: () => {},
|
||||||
|
setSourceFiles: () => {},
|
||||||
|
addSourceFiles: () => {},
|
||||||
|
removeSourceFile: () => {},
|
||||||
|
setSourceFilePaths: () => {},
|
||||||
|
setSourceDocIds: () => {},
|
||||||
|
addSourceDocId: () => {},
|
||||||
|
removeSourceDocId: () => {},
|
||||||
|
setTemplateId: () => {},
|
||||||
|
setFilledResult: () => {},
|
||||||
|
reset: () => {},
|
||||||
|
};
|
||||||
|
|
||||||
|
const TemplateFillContext = createContext<TemplateFillState>(initialState);
|
||||||
|
|
||||||
|
export const TemplateFillProvider: React.FC<{ children: ReactNode }> = ({ children }) => {
|
||||||
|
const [step, setStep] = useState<Step>('upload');
|
||||||
|
const [templateFile, setTemplateFile] = useState<File | null>(null);
|
||||||
|
const [templateFields, setTemplateFields] = useState<TemplateField[]>([]);
|
||||||
|
const [sourceFiles, setSourceFiles] = useState<SourceFile[]>([]);
|
||||||
|
const [sourceFilePaths, setSourceFilePaths] = useState<string[]>([]);
|
||||||
|
const [sourceDocIds, setSourceDocIds] = useState<string[]>([]);
|
||||||
|
const [templateId, setTemplateId] = useState<string>('');
|
||||||
|
const [filledResult, setFilledResult] = useState<any>(null);
|
||||||
|
|
||||||
|
const addSourceFiles = (files: SourceFile[]) => {
|
||||||
|
setSourceFiles(prev => [...prev, ...files]);
|
||||||
|
};
|
||||||
|
|
||||||
|
const removeSourceFile = (index: number) => {
|
||||||
|
setSourceFiles(prev => prev.filter((_, i) => i !== index));
|
||||||
|
};
|
||||||
|
|
||||||
|
const addSourceDocId = (id: string) => {
|
||||||
|
setSourceDocIds(prev => prev.includes(id) ? prev : [...prev, id]);
|
||||||
|
};
|
||||||
|
|
||||||
|
const removeSourceDocId = (id: string) => {
|
||||||
|
setSourceDocIds(prev => prev.filter(docId => docId !== id));
|
||||||
|
};
|
||||||
|
|
||||||
|
const reset = () => {
|
||||||
|
setStep('upload');
|
||||||
|
setTemplateFile(null);
|
||||||
|
setTemplateFields([]);
|
||||||
|
setSourceFiles([]);
|
||||||
|
setSourceFilePaths([]);
|
||||||
|
setSourceDocIds([]);
|
||||||
|
setTemplateId('');
|
||||||
|
setFilledResult(null);
|
||||||
|
};
|
||||||
|
|
||||||
|
return (
|
||||||
|
<TemplateFillContext.Provider
|
||||||
|
value={{
|
||||||
|
step,
|
||||||
|
templateFile,
|
||||||
|
templateFields,
|
||||||
|
sourceFiles,
|
||||||
|
sourceFilePaths,
|
||||||
|
sourceDocIds,
|
||||||
|
templateId,
|
||||||
|
filledResult,
|
||||||
|
setStep,
|
||||||
|
setTemplateFile,
|
||||||
|
setTemplateFields,
|
||||||
|
setSourceFiles,
|
||||||
|
addSourceFiles,
|
||||||
|
removeSourceFile,
|
||||||
|
setSourceFilePaths,
|
||||||
|
setSourceDocIds,
|
||||||
|
addSourceDocId,
|
||||||
|
removeSourceDocId,
|
||||||
|
setTemplateId,
|
||||||
|
setFilledResult,
|
||||||
|
reset,
|
||||||
|
}}
|
||||||
|
>
|
||||||
|
{children}
|
||||||
|
</TemplateFillContext.Provider>
|
||||||
|
);
|
||||||
|
};
|
||||||
|
|
||||||
|
export const useTemplateFill = () => useContext(TemplateFillContext);
|
||||||
@@ -400,6 +400,49 @@ export const backendApi = {
|
|||||||
}
|
}
|
||||||
},
|
},
|
||||||
|
|
||||||
|
/**
|
||||||
|
* 获取任务历史列表
|
||||||
|
*/
|
||||||
|
async getTasks(
|
||||||
|
limit: number = 50,
|
||||||
|
skip: number = 0
|
||||||
|
): Promise<{ success: boolean; tasks: any[]; count: number }> {
|
||||||
|
const url = `${BACKEND_BASE_URL}/tasks?limit=${limit}&skip=${skip}`;
|
||||||
|
|
||||||
|
try {
|
||||||
|
const response = await fetch(url);
|
||||||
|
if (!response.ok) {
|
||||||
|
const error = await response.json();
|
||||||
|
throw new Error(error.detail || '获取任务列表失败');
|
||||||
|
}
|
||||||
|
return await response.json();
|
||||||
|
} catch (error) {
|
||||||
|
console.error('获取任务列表失败:', error);
|
||||||
|
throw error;
|
||||||
|
}
|
||||||
|
},
|
||||||
|
|
||||||
|
/**
|
||||||
|
* 删除任务
|
||||||
|
*/
|
||||||
|
async deleteTask(taskId: string): Promise<{ success: boolean; deleted: boolean }> {
|
||||||
|
const url = `${BACKEND_BASE_URL}/tasks/${taskId}`;
|
||||||
|
|
||||||
|
try {
|
||||||
|
const response = await fetch(url, {
|
||||||
|
method: 'DELETE'
|
||||||
|
});
|
||||||
|
if (!response.ok) {
|
||||||
|
const error = await response.json();
|
||||||
|
throw new Error(error.detail || '删除任务失败');
|
||||||
|
}
|
||||||
|
return await response.json();
|
||||||
|
} catch (error) {
|
||||||
|
console.error('删除任务失败:', error);
|
||||||
|
throw error;
|
||||||
|
}
|
||||||
|
},
|
||||||
|
|
||||||
/**
|
/**
|
||||||
* 轮询任务状态直到完成
|
* 轮询任务状态直到完成
|
||||||
*/
|
*/
|
||||||
@@ -656,6 +699,46 @@ export const backendApi = {
|
|||||||
}
|
}
|
||||||
},
|
},
|
||||||
|
|
||||||
|
/**
|
||||||
|
* 联合上传模板和源文档
|
||||||
|
*/
|
||||||
|
async uploadTemplateAndSources(
|
||||||
|
templateFile: File,
|
||||||
|
sourceFiles: File[]
|
||||||
|
): Promise<{
|
||||||
|
success: boolean;
|
||||||
|
template_id: string;
|
||||||
|
filename: string;
|
||||||
|
file_type: string;
|
||||||
|
fields: TemplateField[];
|
||||||
|
field_count: number;
|
||||||
|
source_file_paths: string[];
|
||||||
|
source_filenames: string[];
|
||||||
|
task_id: string;
|
||||||
|
}> {
|
||||||
|
const formData = new FormData();
|
||||||
|
formData.append('template_file', templateFile);
|
||||||
|
sourceFiles.forEach(file => formData.append('source_files', file));
|
||||||
|
|
||||||
|
const url = `${BACKEND_BASE_URL}/templates/upload-joint`;
|
||||||
|
|
||||||
|
try {
|
||||||
|
const response = await fetch(url, {
|
||||||
|
method: 'POST',
|
||||||
|
body: formData,
|
||||||
|
});
|
||||||
|
|
||||||
|
if (!response.ok) {
|
||||||
|
const error = await response.json();
|
||||||
|
throw new Error(error.detail || '联合上传失败');
|
||||||
|
}
|
||||||
|
return await response.json();
|
||||||
|
} catch (error) {
|
||||||
|
console.error('联合上传失败:', error);
|
||||||
|
throw error;
|
||||||
|
}
|
||||||
|
},
|
||||||
|
|
||||||
/**
|
/**
|
||||||
* 执行表格填写
|
* 执行表格填写
|
||||||
*/
|
*/
|
||||||
@@ -724,6 +807,41 @@ export const backendApi = {
|
|||||||
}
|
}
|
||||||
},
|
},
|
||||||
|
|
||||||
|
/**
|
||||||
|
* 填充原始模板并导出
|
||||||
|
*
|
||||||
|
* 直接打开原始模板文件,将数据填入模板的表格/单元格中,然后导出
|
||||||
|
* 适用于比赛场景:保持原始模板格式不变
|
||||||
|
*/
|
||||||
|
async fillAndExportTemplate(
|
||||||
|
templatePath: string,
|
||||||
|
filledData: Record<string, any>,
|
||||||
|
format: 'xlsx' | 'docx' = 'xlsx'
|
||||||
|
): Promise<Blob> {
|
||||||
|
const url = `${BACKEND_BASE_URL}/templates/fill-and-export`;
|
||||||
|
|
||||||
|
try {
|
||||||
|
const response = await fetch(url, {
|
||||||
|
method: 'POST',
|
||||||
|
headers: { 'Content-Type': 'application/json' },
|
||||||
|
body: JSON.stringify({
|
||||||
|
template_path: templatePath,
|
||||||
|
filled_data: filledData,
|
||||||
|
format,
|
||||||
|
}),
|
||||||
|
});
|
||||||
|
|
||||||
|
if (!response.ok) {
|
||||||
|
const error = await response.json();
|
||||||
|
throw new Error(error.detail || '填充模板失败');
|
||||||
|
}
|
||||||
|
return await response.blob();
|
||||||
|
} catch (error) {
|
||||||
|
console.error('填充模板失败:', error);
|
||||||
|
throw error;
|
||||||
|
}
|
||||||
|
},
|
||||||
|
|
||||||
// ==================== Excel 专用接口 (保留兼容) ====================
|
// ==================== Excel 专用接口 (保留兼容) ====================
|
||||||
|
|
||||||
/**
|
/**
|
||||||
@@ -1105,7 +1223,7 @@ export const aiApi = {
|
|||||||
|
|
||||||
try {
|
try {
|
||||||
const response = await fetch(url, {
|
const response = await fetch(url, {
|
||||||
method: 'GET',
|
method: 'POST',
|
||||||
body: formData,
|
body: formData,
|
||||||
});
|
});
|
||||||
|
|
||||||
@@ -1121,6 +1239,48 @@ export const aiApi = {
|
|||||||
}
|
}
|
||||||
},
|
},
|
||||||
|
|
||||||
|
/**
|
||||||
|
* 上传并使用 AI 分析 TXT 文本文件,提取结构化数据
|
||||||
|
*/
|
||||||
|
async analyzeTxt(
|
||||||
|
file: File
|
||||||
|
): Promise<{
|
||||||
|
success: boolean;
|
||||||
|
filename?: string;
|
||||||
|
structured_data?: {
|
||||||
|
table?: {
|
||||||
|
columns?: string[];
|
||||||
|
rows?: string[][];
|
||||||
|
};
|
||||||
|
summary?: string;
|
||||||
|
key_value_pairs?: Array<{ key: string; value: string }>;
|
||||||
|
numeric_data?: Array<{ name: string; value: number; unit?: string }>;
|
||||||
|
};
|
||||||
|
error?: string;
|
||||||
|
}> {
|
||||||
|
const formData = new FormData();
|
||||||
|
formData.append('file', file);
|
||||||
|
|
||||||
|
const url = `${BACKEND_BASE_URL}/ai/analyze/txt`;
|
||||||
|
|
||||||
|
try {
|
||||||
|
const response = await fetch(url, {
|
||||||
|
method: 'POST',
|
||||||
|
body: formData,
|
||||||
|
});
|
||||||
|
|
||||||
|
if (!response.ok) {
|
||||||
|
const error = await response.json();
|
||||||
|
throw new Error(error.detail || 'TXT AI 分析失败');
|
||||||
|
}
|
||||||
|
|
||||||
|
return await response.json();
|
||||||
|
} catch (error) {
|
||||||
|
console.error('TXT AI 分析失败:', error);
|
||||||
|
throw error;
|
||||||
|
}
|
||||||
|
},
|
||||||
|
|
||||||
/**
|
/**
|
||||||
* 生成统计信息和图表
|
* 生成统计信息和图表
|
||||||
*/
|
*/
|
||||||
@@ -1219,4 +1379,211 @@ export const aiApi = {
|
|||||||
throw error;
|
throw error;
|
||||||
}
|
}
|
||||||
},
|
},
|
||||||
|
|
||||||
|
// ==================== Word AI 解析 ====================
|
||||||
|
|
||||||
|
/**
|
||||||
|
* 使用 AI 解析 Word 文档,提取结构化数据
|
||||||
|
*/
|
||||||
|
async analyzeWordWithAI(
|
||||||
|
file: File,
|
||||||
|
userHint: string = ''
|
||||||
|
): Promise<{
|
||||||
|
success: boolean;
|
||||||
|
type?: string;
|
||||||
|
headers?: string[];
|
||||||
|
rows?: string[][];
|
||||||
|
key_values?: Record<string, string>;
|
||||||
|
list_items?: string[];
|
||||||
|
summary?: string;
|
||||||
|
error?: string;
|
||||||
|
}> {
|
||||||
|
const formData = new FormData();
|
||||||
|
formData.append('file', file);
|
||||||
|
if (userHint) {
|
||||||
|
formData.append('user_hint', userHint);
|
||||||
|
}
|
||||||
|
|
||||||
|
const url = `${BACKEND_BASE_URL}/ai/analyze/word`;
|
||||||
|
|
||||||
|
try {
|
||||||
|
const response = await fetch(url, {
|
||||||
|
method: 'POST',
|
||||||
|
body: formData,
|
||||||
|
});
|
||||||
|
|
||||||
|
if (!response.ok) {
|
||||||
|
const error = await response.json();
|
||||||
|
throw new Error(error.detail || 'Word AI 解析失败');
|
||||||
|
}
|
||||||
|
|
||||||
|
return await response.json();
|
||||||
|
} catch (error) {
|
||||||
|
console.error('Word AI 解析失败:', error);
|
||||||
|
throw error;
|
||||||
|
}
|
||||||
|
},
|
||||||
|
|
||||||
|
/**
|
||||||
|
* 使用 AI 解析 Word 文档并填写模板
|
||||||
|
* 一次性完成:AI解析 + 填表
|
||||||
|
*/
|
||||||
|
async fillTemplateFromWordAI(
|
||||||
|
file: File,
|
||||||
|
templateFields: TemplateField[],
|
||||||
|
userHint: string = ''
|
||||||
|
): Promise<FillResult> {
|
||||||
|
const formData = new FormData();
|
||||||
|
formData.append('file', file);
|
||||||
|
formData.append('template_fields', JSON.stringify(templateFields));
|
||||||
|
if (userHint) {
|
||||||
|
formData.append('user_hint', userHint);
|
||||||
|
}
|
||||||
|
|
||||||
|
const url = `${BACKEND_BASE_URL}/ai/analyze/word/fill-template`;
|
||||||
|
|
||||||
|
try {
|
||||||
|
const response = await fetch(url, {
|
||||||
|
method: 'POST',
|
||||||
|
body: formData,
|
||||||
|
});
|
||||||
|
|
||||||
|
if (!response.ok) {
|
||||||
|
const error = await response.json();
|
||||||
|
throw new Error(error.detail || 'Word AI 填表失败');
|
||||||
|
}
|
||||||
|
|
||||||
|
return await response.json();
|
||||||
|
} catch (error) {
|
||||||
|
console.error('Word AI 填表失败:', error);
|
||||||
|
throw error;
|
||||||
|
}
|
||||||
|
},
|
||||||
|
|
||||||
|
// ==================== 智能指令 ====================
|
||||||
|
|
||||||
|
/**
|
||||||
|
* 识别自然语言指令的意图
|
||||||
|
*/
|
||||||
|
async recognizeIntent(
|
||||||
|
instruction: string,
|
||||||
|
docIds?: string[]
|
||||||
|
): Promise<{
|
||||||
|
success: boolean;
|
||||||
|
intent: string;
|
||||||
|
params: Record<string, any>;
|
||||||
|
message: string;
|
||||||
|
}> {
|
||||||
|
const url = `${BACKEND_BASE_URL}/instruction/recognize`;
|
||||||
|
|
||||||
|
try {
|
||||||
|
const response = await fetch(url, {
|
||||||
|
method: 'POST',
|
||||||
|
headers: { 'Content-Type': 'application/json' },
|
||||||
|
body: JSON.stringify({ instruction, doc_ids: docIds }),
|
||||||
|
});
|
||||||
|
|
||||||
|
if (!response.ok) {
|
||||||
|
const error = await response.json();
|
||||||
|
throw new Error(error.detail || '意图识别失败');
|
||||||
|
}
|
||||||
|
|
||||||
|
return await response.json();
|
||||||
|
} catch (error) {
|
||||||
|
console.error('意图识别失败:', error);
|
||||||
|
throw error;
|
||||||
|
}
|
||||||
|
},
|
||||||
|
|
||||||
|
/**
|
||||||
|
* 执行自然语言指令
|
||||||
|
*/
|
||||||
|
async executeInstruction(
|
||||||
|
instruction: string,
|
||||||
|
docIds?: string[],
|
||||||
|
context?: Record<string, any>
|
||||||
|
): Promise<{
|
||||||
|
success: boolean;
|
||||||
|
intent: string;
|
||||||
|
result: Record<string, any>;
|
||||||
|
message: string;
|
||||||
|
}> {
|
||||||
|
const url = `${BACKEND_BASE_URL}/instruction/execute`;
|
||||||
|
|
||||||
|
try {
|
||||||
|
const response = await fetch(url, {
|
||||||
|
method: 'POST',
|
||||||
|
headers: { 'Content-Type': 'application/json' },
|
||||||
|
body: JSON.stringify({ instruction, doc_ids: docIds, context }),
|
||||||
|
});
|
||||||
|
|
||||||
|
if (!response.ok) {
|
||||||
|
const error = await response.json();
|
||||||
|
throw new Error(error.detail || '指令执行失败');
|
||||||
|
}
|
||||||
|
|
||||||
|
return await response.json();
|
||||||
|
} catch (error) {
|
||||||
|
console.error('指令执行失败:', error);
|
||||||
|
throw error;
|
||||||
|
}
|
||||||
|
},
|
||||||
|
|
||||||
|
/**
|
||||||
|
* 智能对话(支持多轮对话的指令执行)
|
||||||
|
*/
|
||||||
|
async instructionChat(
|
||||||
|
instruction: string,
|
||||||
|
docIds?: string[],
|
||||||
|
context?: Record<string, any>
|
||||||
|
): Promise<{
|
||||||
|
success: boolean;
|
||||||
|
intent: string;
|
||||||
|
result: Record<string, any>;
|
||||||
|
message: string;
|
||||||
|
hint?: string;
|
||||||
|
}> {
|
||||||
|
const url = `${BACKEND_BASE_URL}/instruction/chat`;
|
||||||
|
|
||||||
|
try {
|
||||||
|
const response = await fetch(url, {
|
||||||
|
method: 'POST',
|
||||||
|
headers: { 'Content-Type': 'application/json' },
|
||||||
|
body: JSON.stringify({ instruction, doc_ids: docIds, context }),
|
||||||
|
});
|
||||||
|
|
||||||
|
if (!response.ok) {
|
||||||
|
const error = await response.json();
|
||||||
|
throw new Error(error.detail || '对话处理失败');
|
||||||
|
}
|
||||||
|
|
||||||
|
return await response.json();
|
||||||
|
} catch (error) {
|
||||||
|
console.error('对话处理失败:', error);
|
||||||
|
throw error;
|
||||||
|
}
|
||||||
|
},
|
||||||
|
|
||||||
|
/**
|
||||||
|
* 获取支持的指令类型列表
|
||||||
|
*/
|
||||||
|
async getSupportedIntents(): Promise<{
|
||||||
|
intents: Array<{
|
||||||
|
intent: string;
|
||||||
|
name: string;
|
||||||
|
examples: string[];
|
||||||
|
params: string[];
|
||||||
|
}>;
|
||||||
|
}> {
|
||||||
|
const url = `${BACKEND_BASE_URL}/instruction/intents`;
|
||||||
|
|
||||||
|
try {
|
||||||
|
const response = await fetch(url);
|
||||||
|
if (!response.ok) throw new Error('获取指令列表失败');
|
||||||
|
return await response.json();
|
||||||
|
} catch (error) {
|
||||||
|
console.error('获取指令列表失败:', error);
|
||||||
|
throw error;
|
||||||
|
}
|
||||||
|
},
|
||||||
};
|
};
|
||||||
|
|||||||
@@ -1,4 +1,4 @@
|
|||||||
import React, { useState, useEffect, useCallback } from 'react';
|
import React, { useState, useEffect, useCallback, useRef } from 'react';
|
||||||
import { useDropzone } from 'react-dropzone';
|
import { useDropzone } from 'react-dropzone';
|
||||||
import {
|
import {
|
||||||
FileText,
|
FileText,
|
||||||
@@ -23,7 +23,8 @@ import {
|
|||||||
List,
|
List,
|
||||||
MessageSquareCode,
|
MessageSquareCode,
|
||||||
Tag,
|
Tag,
|
||||||
HelpCircle
|
HelpCircle,
|
||||||
|
Plus
|
||||||
} from 'lucide-react';
|
} from 'lucide-react';
|
||||||
import { Button } from '@/components/ui/button';
|
import { Button } from '@/components/ui/button';
|
||||||
import { Input } from '@/components/ui/input';
|
import { Input } from '@/components/ui/input';
|
||||||
@@ -72,8 +73,10 @@ const Documents: React.FC = () => {
|
|||||||
// 上传相关状态
|
// 上传相关状态
|
||||||
const [uploading, setUploading] = useState(false);
|
const [uploading, setUploading] = useState(false);
|
||||||
const [uploadedFile, setUploadedFile] = useState<File | null>(null);
|
const [uploadedFile, setUploadedFile] = useState<File | null>(null);
|
||||||
|
const [uploadedFiles, setUploadedFiles] = useState<File[]>([]);
|
||||||
const [parseResult, setParseResult] = useState<ExcelParseResult | null>(null);
|
const [parseResult, setParseResult] = useState<ExcelParseResult | null>(null);
|
||||||
const [expandedSheet, setExpandedSheet] = useState<string | null>(null);
|
const [expandedSheet, setExpandedSheet] = useState<string | null>(null);
|
||||||
|
const [uploadExpanded, setUploadExpanded] = useState(false);
|
||||||
|
|
||||||
// AI 分析相关状态
|
// AI 分析相关状态
|
||||||
const [analyzing, setAnalyzing] = useState(false);
|
const [analyzing, setAnalyzing] = useState(false);
|
||||||
@@ -210,75 +213,119 @@ const Documents: React.FC = () => {
|
|||||||
|
|
||||||
// 文件上传处理
|
// 文件上传处理
|
||||||
const onDrop = async (acceptedFiles: File[]) => {
|
const onDrop = async (acceptedFiles: File[]) => {
|
||||||
const file = acceptedFiles[0];
|
if (acceptedFiles.length === 0) return;
|
||||||
if (!file) return;
|
|
||||||
|
|
||||||
setUploadedFile(file);
|
|
||||||
setUploading(true);
|
setUploading(true);
|
||||||
setParseResult(null);
|
let successCount = 0;
|
||||||
setAiAnalysis(null);
|
let failCount = 0;
|
||||||
setAnalysisCharts(null);
|
const successfulFiles: File[] = [];
|
||||||
setExpandedSheet(null);
|
|
||||||
setMdAnalysis(null);
|
|
||||||
setMdSections([]);
|
|
||||||
setMdStreamingContent('');
|
|
||||||
|
|
||||||
const ext = file.name.split('.').pop()?.toLowerCase();
|
// 逐个上传文件
|
||||||
|
for (const file of acceptedFiles) {
|
||||||
|
const ext = file.name.split('.').pop()?.toLowerCase();
|
||||||
|
|
||||||
try {
|
try {
|
||||||
// Excel 文件使用专门的上传接口
|
if (ext === 'xlsx' || ext === 'xls') {
|
||||||
if (ext === 'xlsx' || ext === 'xls') {
|
const result = await backendApi.uploadExcel(file, {
|
||||||
const result = await backendApi.uploadExcel(file, {
|
parseAllSheets: parseOptions.parseAllSheets,
|
||||||
parseAllSheets: parseOptions.parseAllSheets,
|
headerRow: parseOptions.headerRow
|
||||||
headerRow: parseOptions.headerRow
|
});
|
||||||
});
|
if (result.success) {
|
||||||
if (result.success) {
|
successCount++;
|
||||||
toast.success(`解析成功: ${file.name}`);
|
successfulFiles.push(file);
|
||||||
setParseResult(result);
|
// 第一个Excel文件设置解析结果供预览
|
||||||
loadDocuments(); // 刷新文档列表
|
if (successCount === 1) {
|
||||||
if (result.metadata?.sheet_count === 1) {
|
setUploadedFile(file);
|
||||||
setExpandedSheet(Object.keys(result.data?.sheets || {})[0] || null);
|
setParseResult(result);
|
||||||
|
if (result.metadata?.sheet_count === 1) {
|
||||||
|
setExpandedSheet(Object.keys(result.data?.sheets || {})[0] || null);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
loadDocuments();
|
||||||
|
} else {
|
||||||
|
failCount++;
|
||||||
|
toast.error(`${file.name}: ${result.error || '解析失败'}`);
|
||||||
|
}
|
||||||
|
} else if (ext === 'md' || ext === 'markdown') {
|
||||||
|
const result = await backendApi.uploadDocument(file);
|
||||||
|
if (result.task_id) {
|
||||||
|
successCount++;
|
||||||
|
successfulFiles.push(file);
|
||||||
|
if (successCount === 1) {
|
||||||
|
setUploadedFile(file);
|
||||||
|
}
|
||||||
|
// 轮询任务状态
|
||||||
|
let attempts = 0;
|
||||||
|
const checkStatus = async () => {
|
||||||
|
while (attempts < 30) {
|
||||||
|
try {
|
||||||
|
const status = await backendApi.getTaskStatus(result.task_id);
|
||||||
|
if (status.status === 'success') {
|
||||||
|
loadDocuments();
|
||||||
|
return;
|
||||||
|
} else if (status.status === 'failure') {
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
} catch (e) {
|
||||||
|
console.error('检查状态失败', e);
|
||||||
|
}
|
||||||
|
await new Promise(resolve => setTimeout(resolve, 2000));
|
||||||
|
attempts++;
|
||||||
|
}
|
||||||
|
};
|
||||||
|
checkStatus();
|
||||||
|
} else {
|
||||||
|
failCount++;
|
||||||
}
|
}
|
||||||
} else {
|
} else {
|
||||||
toast.error(result.error || '解析失败');
|
// 其他文档使用通用上传接口
|
||||||
}
|
const result = await backendApi.uploadDocument(file);
|
||||||
} else if (ext === 'md' || ext === 'markdown') {
|
if (result.task_id) {
|
||||||
// Markdown 文件:获取大纲
|
successCount++;
|
||||||
await fetchMdOutline();
|
successfulFiles.push(file);
|
||||||
} else {
|
if (successCount === 1) {
|
||||||
// 其他文档使用通用上传接口
|
setUploadedFile(file);
|
||||||
const result = await backendApi.uploadDocument(file);
|
|
||||||
if (result.task_id) {
|
|
||||||
toast.success(`文件 ${file.name} 已提交处理`);
|
|
||||||
// 轮询任务状态
|
|
||||||
let attempts = 0;
|
|
||||||
const checkStatus = async () => {
|
|
||||||
while (attempts < 30) {
|
|
||||||
try {
|
|
||||||
const status = await backendApi.getTaskStatus(result.task_id);
|
|
||||||
if (status.status === 'success') {
|
|
||||||
toast.success(`文件 ${file.name} 处理完成`);
|
|
||||||
loadDocuments();
|
|
||||||
return;
|
|
||||||
} else if (status.status === 'failure') {
|
|
||||||
toast.error(`文件 ${file.name} 处理失败`);
|
|
||||||
return;
|
|
||||||
}
|
|
||||||
} catch (e) {
|
|
||||||
console.error('检查状态失败', e);
|
|
||||||
}
|
|
||||||
await new Promise(resolve => setTimeout(resolve, 2000));
|
|
||||||
attempts++;
|
|
||||||
}
|
}
|
||||||
toast.error(`文件 ${file.name} 处理超时`);
|
// 轮询任务状态
|
||||||
};
|
let attempts = 0;
|
||||||
checkStatus();
|
const checkStatus = async () => {
|
||||||
|
while (attempts < 30) {
|
||||||
|
try {
|
||||||
|
const status = await backendApi.getTaskStatus(result.task_id);
|
||||||
|
if (status.status === 'success') {
|
||||||
|
loadDocuments();
|
||||||
|
return;
|
||||||
|
} else if (status.status === 'failure') {
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
} catch (e) {
|
||||||
|
console.error('检查状态失败', e);
|
||||||
|
}
|
||||||
|
await new Promise(resolve => setTimeout(resolve, 2000));
|
||||||
|
attempts++;
|
||||||
|
}
|
||||||
|
};
|
||||||
|
checkStatus();
|
||||||
|
} else {
|
||||||
|
failCount++;
|
||||||
|
}
|
||||||
}
|
}
|
||||||
|
} catch (error: any) {
|
||||||
|
failCount++;
|
||||||
|
toast.error(`${file.name}: ${error.message || '上传失败'}`);
|
||||||
}
|
}
|
||||||
} catch (error: any) {
|
}
|
||||||
toast.error(error.message || '上传失败');
|
|
||||||
} finally {
|
setUploading(false);
|
||||||
setUploading(false);
|
loadDocuments();
|
||||||
|
|
||||||
|
if (successCount > 0) {
|
||||||
|
toast.success(`成功上传 ${successCount} 个文件`);
|
||||||
|
setUploadedFiles(prev => [...prev, ...successfulFiles]);
|
||||||
|
setUploadExpanded(true);
|
||||||
|
}
|
||||||
|
if (failCount > 0) {
|
||||||
|
toast.error(`${failCount} 个文件上传失败`);
|
||||||
}
|
}
|
||||||
};
|
};
|
||||||
|
|
||||||
@@ -291,7 +338,7 @@ const Documents: React.FC = () => {
|
|||||||
'text/markdown': ['.md'],
|
'text/markdown': ['.md'],
|
||||||
'text/plain': ['.txt']
|
'text/plain': ['.txt']
|
||||||
},
|
},
|
||||||
maxFiles: 1
|
multiple: true
|
||||||
});
|
});
|
||||||
|
|
||||||
// AI 分析处理
|
// AI 分析处理
|
||||||
@@ -449,6 +496,7 @@ const Documents: React.FC = () => {
|
|||||||
|
|
||||||
const handleDeleteFile = () => {
|
const handleDeleteFile = () => {
|
||||||
setUploadedFile(null);
|
setUploadedFile(null);
|
||||||
|
setUploadedFiles([]);
|
||||||
setParseResult(null);
|
setParseResult(null);
|
||||||
setAiAnalysis(null);
|
setAiAnalysis(null);
|
||||||
setAnalysisCharts(null);
|
setAnalysisCharts(null);
|
||||||
@@ -456,6 +504,17 @@ const Documents: React.FC = () => {
|
|||||||
toast.success('文件已清除');
|
toast.success('文件已清除');
|
||||||
};
|
};
|
||||||
|
|
||||||
|
const handleRemoveUploadedFile = (index: number) => {
|
||||||
|
setUploadedFiles(prev => {
|
||||||
|
const newFiles = prev.filter((_, i) => i !== index);
|
||||||
|
if (newFiles.length === 0) {
|
||||||
|
setUploadedFile(null);
|
||||||
|
}
|
||||||
|
return newFiles;
|
||||||
|
});
|
||||||
|
toast.success('文件已从列表移除');
|
||||||
|
};
|
||||||
|
|
||||||
const handleDelete = async (docId: string) => {
|
const handleDelete = async (docId: string) => {
|
||||||
try {
|
try {
|
||||||
const result = await backendApi.deleteDocument(docId);
|
const result = await backendApi.deleteDocument(docId);
|
||||||
@@ -615,7 +674,7 @@ const Documents: React.FC = () => {
|
|||||||
<h1 className="text-3xl font-extrabold tracking-tight">文档中心</h1>
|
<h1 className="text-3xl font-extrabold tracking-tight">文档中心</h1>
|
||||||
<p className="text-muted-foreground">上传文档,自动解析并使用 AI 进行深度分析</p>
|
<p className="text-muted-foreground">上传文档,自动解析并使用 AI 进行深度分析</p>
|
||||||
</div>
|
</div>
|
||||||
<Button variant="outline" className="rounded-xl gap-2" onClick={loadDocuments}>
|
<Button variant="outline" className="rounded-xl gap-2" onClick={() => loadDocuments()}>
|
||||||
<RefreshCcw size={18} />
|
<RefreshCcw size={18} />
|
||||||
<span>刷新</span>
|
<span>刷新</span>
|
||||||
</Button>
|
</Button>
|
||||||
@@ -640,7 +699,83 @@ const Documents: React.FC = () => {
|
|||||||
</CardHeader>
|
</CardHeader>
|
||||||
{uploadPanelOpen && (
|
{uploadPanelOpen && (
|
||||||
<CardContent className="space-y-4">
|
<CardContent className="space-y-4">
|
||||||
{!uploadedFile ? (
|
{uploadedFiles.length > 0 || uploadedFile ? (
|
||||||
|
<div className="space-y-3">
|
||||||
|
{/* 文件列表头部 */}
|
||||||
|
<div
|
||||||
|
className="flex items-center justify-between p-3 bg-muted/50 rounded-xl cursor-pointer hover:bg-muted/70 transition-colors"
|
||||||
|
onClick={() => setUploadExpanded(!uploadExpanded)}
|
||||||
|
>
|
||||||
|
<div className="flex items-center gap-3">
|
||||||
|
<div className="w-10 h-10 rounded-lg bg-primary/10 text-primary flex items-center justify-center">
|
||||||
|
<Upload size={20} />
|
||||||
|
</div>
|
||||||
|
<div>
|
||||||
|
<p className="font-semibold text-sm">
|
||||||
|
已上传 {(uploadedFiles.length > 0 ? uploadedFiles : [uploadedFile]).length} 个文件
|
||||||
|
</p>
|
||||||
|
<p className="text-xs text-muted-foreground">
|
||||||
|
{uploadExpanded ? '点击收起' : '点击展开查看'}
|
||||||
|
</p>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
<div className="flex items-center gap-2">
|
||||||
|
<Button
|
||||||
|
variant="ghost"
|
||||||
|
size="sm"
|
||||||
|
onClick={(e) => {
|
||||||
|
e.stopPropagation();
|
||||||
|
handleDeleteFile();
|
||||||
|
}}
|
||||||
|
className="text-destructive hover:text-destructive"
|
||||||
|
>
|
||||||
|
<Trash2 size={14} className="mr-1" />
|
||||||
|
清空
|
||||||
|
</Button>
|
||||||
|
{uploadExpanded ? <ChevronUp size={16} /> : <ChevronDown size={16} />}
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
|
||||||
|
{/* 展开的文件列表 */}
|
||||||
|
{uploadExpanded && (
|
||||||
|
<div className="space-y-2 border rounded-xl p-3">
|
||||||
|
{(uploadedFiles.length > 0 ? uploadedFiles : [uploadedFile]).filter(Boolean).map((file, index) => (
|
||||||
|
<div key={index} className="flex items-center gap-3 p-2 bg-background rounded-lg">
|
||||||
|
<div className={cn(
|
||||||
|
"w-8 h-8 rounded flex items-center justify-center",
|
||||||
|
isExcelFile(file?.name || '') ? "bg-emerald-500/10 text-emerald-500" : "bg-blue-500/10 text-blue-500"
|
||||||
|
)}>
|
||||||
|
{isExcelFile(file?.name || '') ? <FileSpreadsheet size={16} /> : <FileText size={16} />}
|
||||||
|
</div>
|
||||||
|
<div className="flex-1 min-w-0">
|
||||||
|
<p className="text-sm truncate">{file?.name}</p>
|
||||||
|
<p className="text-xs text-muted-foreground">{formatFileSize(file?.size || 0)}</p>
|
||||||
|
</div>
|
||||||
|
<Button
|
||||||
|
variant="ghost"
|
||||||
|
size="icon"
|
||||||
|
className="text-destructive hover:bg-destructive/10"
|
||||||
|
onClick={() => handleRemoveUploadedFile(index)}
|
||||||
|
>
|
||||||
|
<Trash2 size={14} />
|
||||||
|
</Button>
|
||||||
|
</div>
|
||||||
|
))}
|
||||||
|
|
||||||
|
{/* 继续添加按钮 */}
|
||||||
|
<div
|
||||||
|
{...getRootProps()}
|
||||||
|
className="flex items-center justify-center gap-2 p-3 border-2 border-dashed rounded-lg cursor-pointer hover:border-primary/50 hover:bg-primary/5 transition-colors"
|
||||||
|
onClick={(e) => e.stopPropagation()}
|
||||||
|
>
|
||||||
|
<input {...getInputProps()} multiple={true} />
|
||||||
|
<Plus size={16} className="text-muted-foreground" />
|
||||||
|
<span className="text-sm text-muted-foreground">继续添加更多文件</span>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
)}
|
||||||
|
</div>
|
||||||
|
) : (
|
||||||
<div
|
<div
|
||||||
{...getRootProps()}
|
{...getRootProps()}
|
||||||
className={cn(
|
className={cn(
|
||||||
@@ -649,7 +784,7 @@ const Documents: React.FC = () => {
|
|||||||
uploading && "opacity-50 pointer-events-none"
|
uploading && "opacity-50 pointer-events-none"
|
||||||
)}
|
)}
|
||||||
>
|
>
|
||||||
<input {...getInputProps()} />
|
<input {...getInputProps()} multiple={true} />
|
||||||
<div className="w-14 h-14 rounded-xl bg-primary/10 text-primary flex items-center justify-center mb-4 group-hover:scale-110 transition-transform">
|
<div className="w-14 h-14 rounded-xl bg-primary/10 text-primary flex items-center justify-center mb-4 group-hover:scale-110 transition-transform">
|
||||||
{uploading ? <Loader2 className="animate-spin" size={28} /> : <Upload size={28} />}
|
{uploading ? <Loader2 className="animate-spin" size={28} /> : <Upload size={28} />}
|
||||||
</div>
|
</div>
|
||||||
@@ -671,30 +806,6 @@ const Documents: React.FC = () => {
|
|||||||
</Badge>
|
</Badge>
|
||||||
</div>
|
</div>
|
||||||
</div>
|
</div>
|
||||||
) : (
|
|
||||||
<div className="space-y-4">
|
|
||||||
<div className="flex items-center gap-3 p-3 bg-muted/30 rounded-xl">
|
|
||||||
<div className={cn(
|
|
||||||
"w-10 h-10 rounded-lg flex items-center justify-center",
|
|
||||||
isExcelFile(uploadedFile.name) ? "bg-emerald-500/10 text-emerald-500" : "bg-blue-500/10 text-blue-500"
|
|
||||||
)}>
|
|
||||||
{isExcelFile(uploadedFile.name) ? <FileSpreadsheet size={20} /> : <FileText size={20} />}
|
|
||||||
</div>
|
|
||||||
<div className="flex-1 min-w-0">
|
|
||||||
<p className="font-semibold text-sm truncate">{uploadedFile.name}</p>
|
|
||||||
<p className="text-xs text-muted-foreground">{formatFileSize(uploadedFile.size)}</p>
|
|
||||||
</div>
|
|
||||||
<Button variant="ghost" size="icon" className="text-destructive hover:bg-destructive/10" onClick={handleDeleteFile}>
|
|
||||||
<Trash2 size={16} />
|
|
||||||
</Button>
|
|
||||||
</div>
|
|
||||||
|
|
||||||
{isExcelFile(uploadedFile.name) && (
|
|
||||||
<Button onClick={() => onDrop([uploadedFile])} className="w-full" disabled={uploading}>
|
|
||||||
{uploading ? '解析中...' : '重新解析'}
|
|
||||||
</Button>
|
|
||||||
)}
|
|
||||||
</div>
|
|
||||||
)}
|
)}
|
||||||
</CardContent>
|
</CardContent>
|
||||||
)}
|
)}
|
||||||
|
|||||||
File diff suppressed because it is too large
Load Diff
@@ -1,603 +0,0 @@
|
|||||||
import React, { useState, useEffect } from 'react';
|
|
||||||
import {
|
|
||||||
TableProperties,
|
|
||||||
Plus,
|
|
||||||
FilePlus,
|
|
||||||
CheckCircle2,
|
|
||||||
Download,
|
|
||||||
Clock,
|
|
||||||
RefreshCcw,
|
|
||||||
Sparkles,
|
|
||||||
Zap,
|
|
||||||
FileCheck,
|
|
||||||
FileSpreadsheet,
|
|
||||||
Trash2,
|
|
||||||
ChevronDown,
|
|
||||||
ChevronUp,
|
|
||||||
BarChart3,
|
|
||||||
FileText,
|
|
||||||
TrendingUp,
|
|
||||||
Info,
|
|
||||||
AlertCircle,
|
|
||||||
Loader2
|
|
||||||
} from 'lucide-react';
|
|
||||||
import { Button } from '@/components/ui/button';
|
|
||||||
import { Card, CardContent, CardHeader, CardTitle, CardDescription, CardFooter } from '@/components/ui/card';
|
|
||||||
import { Badge } from '@/components/ui/badge';
|
|
||||||
import { useAuth } from '@/context/AuthContext';
|
|
||||||
import { templateApi, documentApi, taskApi } from '@/db/api';
|
|
||||||
import { backendApi, aiApi } from '@/db/backend-api';
|
|
||||||
import { supabase } from '@/db/supabase';
|
|
||||||
import { format } from 'date-fns';
|
|
||||||
import { toast } from 'sonner';
|
|
||||||
import { cn } from '@/lib/utils';
|
|
||||||
import { Skeleton } from '@/components/ui/skeleton';
|
|
||||||
import {
|
|
||||||
Dialog,
|
|
||||||
DialogContent,
|
|
||||||
DialogHeader,
|
|
||||||
DialogTitle,
|
|
||||||
DialogTrigger,
|
|
||||||
DialogFooter,
|
|
||||||
DialogDescription
|
|
||||||
} from '@/components/ui/dialog';
|
|
||||||
import { Checkbox } from '@/components/ui/checkbox';
|
|
||||||
import { ScrollArea } from '@/components/ui/scroll-area';
|
|
||||||
import { Input } from '@/components/ui/input';
|
|
||||||
import { Label } from '@/components/ui/label';
|
|
||||||
import { Textarea } from '@/components/ui/textarea';
|
|
||||||
import { Select, SelectContent, SelectItem, SelectTrigger, SelectValue } from '@/components/ui/select';
|
|
||||||
import { useDropzone } from 'react-dropzone';
|
|
||||||
import { Markdown } from '@/components/ui/markdown';
|
|
||||||
|
|
||||||
type Template = any;
|
|
||||||
type Document = any;
|
|
||||||
type FillTask = any;
|
|
||||||
|
|
||||||
const FormFill: React.FC = () => {
|
|
||||||
const { profile } = useAuth();
|
|
||||||
const [templates, setTemplates] = useState<Template[]>([]);
|
|
||||||
const [documents, setDocuments] = useState<Document[]>([]);
|
|
||||||
const [tasks, setTasks] = useState<any[]>([]);
|
|
||||||
const [loading, setLoading] = useState(true);
|
|
||||||
|
|
||||||
// Selection state
|
|
||||||
const [selectedTemplate, setSelectedTemplate] = useState<string | null>(null);
|
|
||||||
const [selectedDocs, setSelectedDocs] = useState<string[]>([]);
|
|
||||||
const [creating, setCreating] = useState(false);
|
|
||||||
const [openTaskDialog, setOpenTaskDialog] = useState(false);
|
|
||||||
const [viewingTask, setViewingTask] = useState<any | null>(null);
|
|
||||||
|
|
||||||
// Excel upload state
|
|
||||||
const [excelFile, setExcelFile] = useState<File | null>(null);
|
|
||||||
const [excelParseResult, setExcelParseResult] = useState<any>(null);
|
|
||||||
const [excelAnalysis, setExcelAnalysis] = useState<any>(null);
|
|
||||||
const [excelAnalyzing, setExcelAnalyzing] = useState(false);
|
|
||||||
const [expandedSheet, setExpandedSheet] = useState<string | null>(null);
|
|
||||||
const [aiOptions, setAiOptions] = useState({
|
|
||||||
userPrompt: '请分析这些数据,并提取关键信息用于填表,包括数值、分类、摘要等。',
|
|
||||||
analysisType: 'general' as 'general' | 'summary' | 'statistics' | 'insights'
|
|
||||||
});
|
|
||||||
|
|
||||||
const loadData = async () => {
|
|
||||||
if (!profile) return;
|
|
||||||
try {
|
|
||||||
const [t, d, ts] = await Promise.all([
|
|
||||||
templateApi.listTemplates((profile as any).id),
|
|
||||||
documentApi.listDocuments((profile as any).id),
|
|
||||||
taskApi.listTasks((profile as any).id)
|
|
||||||
]);
|
|
||||||
setTemplates(t);
|
|
||||||
setDocuments(d);
|
|
||||||
setTasks(ts);
|
|
||||||
} catch (err: any) {
|
|
||||||
toast.error('数据加载失败');
|
|
||||||
} finally {
|
|
||||||
setLoading(false);
|
|
||||||
}
|
|
||||||
};
|
|
||||||
|
|
||||||
useEffect(() => {
|
|
||||||
loadData();
|
|
||||||
}, [profile]);
|
|
||||||
|
|
||||||
// Excel upload handlers
|
|
||||||
const onExcelDrop = async (acceptedFiles: File[]) => {
|
|
||||||
const file = acceptedFiles[0];
|
|
||||||
if (!file) return;
|
|
||||||
|
|
||||||
if (!file.name.match(/\.(xlsx|xls)$/i)) {
|
|
||||||
toast.error('仅支持 .xlsx 和 .xls 格式的 Excel 文件');
|
|
||||||
return;
|
|
||||||
}
|
|
||||||
|
|
||||||
setExcelFile(file);
|
|
||||||
setExcelParseResult(null);
|
|
||||||
setExcelAnalysis(null);
|
|
||||||
setExpandedSheet(null);
|
|
||||||
|
|
||||||
try {
|
|
||||||
const result = await backendApi.uploadExcel(file);
|
|
||||||
if (result.success) {
|
|
||||||
toast.success(`Excel 解析成功: ${file.name}`);
|
|
||||||
setExcelParseResult(result);
|
|
||||||
} else {
|
|
||||||
toast.error(result.error || '解析失败');
|
|
||||||
}
|
|
||||||
} catch (error: any) {
|
|
||||||
toast.error(error.message || '上传失败');
|
|
||||||
}
|
|
||||||
};
|
|
||||||
|
|
||||||
const { getRootProps, getInputProps, isDragActive } = useDropzone({
|
|
||||||
onDrop: onExcelDrop,
|
|
||||||
accept: {
|
|
||||||
'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet': ['.xlsx'],
|
|
||||||
'application/vnd.ms-excel': ['.xls']
|
|
||||||
},
|
|
||||||
maxFiles: 1
|
|
||||||
});
|
|
||||||
|
|
||||||
const handleAnalyzeExcel = async () => {
|
|
||||||
if (!excelFile || !excelParseResult?.success) {
|
|
||||||
toast.error('请先上传并解析 Excel 文件');
|
|
||||||
return;
|
|
||||||
}
|
|
||||||
|
|
||||||
setExcelAnalyzing(true);
|
|
||||||
setExcelAnalysis(null);
|
|
||||||
|
|
||||||
try {
|
|
||||||
const result = await aiApi.analyzeExcel(excelFile, {
|
|
||||||
userPrompt: aiOptions.userPrompt,
|
|
||||||
analysisType: aiOptions.analysisType
|
|
||||||
});
|
|
||||||
|
|
||||||
if (result.success) {
|
|
||||||
toast.success('AI 分析完成');
|
|
||||||
setExcelAnalysis(result);
|
|
||||||
} else {
|
|
||||||
toast.error(result.error || 'AI 分析失败');
|
|
||||||
}
|
|
||||||
} catch (error: any) {
|
|
||||||
toast.error(error.message || 'AI 分析失败');
|
|
||||||
} finally {
|
|
||||||
setExcelAnalyzing(false);
|
|
||||||
}
|
|
||||||
};
|
|
||||||
|
|
||||||
const handleUseExcelData = () => {
|
|
||||||
if (!excelParseResult?.success) {
|
|
||||||
toast.error('请先解析 Excel 文件');
|
|
||||||
return;
|
|
||||||
}
|
|
||||||
|
|
||||||
// 将 Excel 解析的数据标记为"文档",添加到选择列表
|
|
||||||
toast.success('Excel 数据已添加到数据源,请在任务对话框中选择');
|
|
||||||
// 这里可以添加逻辑来将 Excel 数据传递给后端创建任务
|
|
||||||
};
|
|
||||||
|
|
||||||
const handleDeleteExcel = () => {
|
|
||||||
setExcelFile(null);
|
|
||||||
setExcelParseResult(null);
|
|
||||||
setExcelAnalysis(null);
|
|
||||||
setExpandedSheet(null);
|
|
||||||
toast.success('Excel 文件已清除');
|
|
||||||
};
|
|
||||||
|
|
||||||
const handleUploadTemplate = async (e: React.ChangeEvent<HTMLInputElement>) => {
|
|
||||||
const file = e.target.files?.[0];
|
|
||||||
if (!file || !profile) return;
|
|
||||||
|
|
||||||
try {
|
|
||||||
toast.loading('正在上传模板...');
|
|
||||||
await templateApi.uploadTemplate(file, (profile as any).id);
|
|
||||||
toast.dismiss();
|
|
||||||
toast.success('模板上传成功');
|
|
||||||
loadData();
|
|
||||||
} catch (err) {
|
|
||||||
toast.dismiss();
|
|
||||||
toast.error('上传模板失败');
|
|
||||||
}
|
|
||||||
};
|
|
||||||
|
|
||||||
const handleCreateTask = async () => {
|
|
||||||
if (!profile || !selectedTemplate || selectedDocs.length === 0) {
|
|
||||||
toast.error('请先选择模板和数据源文档');
|
|
||||||
return;
|
|
||||||
}
|
|
||||||
|
|
||||||
setCreating(true);
|
|
||||||
try {
|
|
||||||
const task = await taskApi.createTask((profile as any).id, selectedTemplate, selectedDocs);
|
|
||||||
if (task) {
|
|
||||||
toast.success('任务已创建,正在进行智能填表...');
|
|
||||||
setOpenTaskDialog(false);
|
|
||||||
|
|
||||||
// Invoke edge function
|
|
||||||
supabase.functions.invoke('fill-template', {
|
|
||||||
body: { taskId: task.id }
|
|
||||||
}).then(({ error }) => {
|
|
||||||
if (error) toast.error('填表任务执行失败');
|
|
||||||
else {
|
|
||||||
toast.success('表格填写完成!');
|
|
||||||
loadData();
|
|
||||||
}
|
|
||||||
});
|
|
||||||
loadData();
|
|
||||||
}
|
|
||||||
} catch (err: any) {
|
|
||||||
toast.error('创建任务失败');
|
|
||||||
} finally {
|
|
||||||
setCreating(false);
|
|
||||||
}
|
|
||||||
};
|
|
||||||
|
|
||||||
const getStatusColor = (status: string) => {
|
|
||||||
switch (status) {
|
|
||||||
case 'completed': return 'bg-emerald-500 text-white';
|
|
||||||
case 'failed': return 'bg-destructive text-white';
|
|
||||||
default: return 'bg-amber-500 text-white';
|
|
||||||
}
|
|
||||||
};
|
|
||||||
|
|
||||||
const formatFileSize = (bytes: number): string => {
|
|
||||||
if (bytes === 0) return '0 B';
|
|
||||||
const k = 1024;
|
|
||||||
const sizes = ['B', 'KB', 'MB', 'GB'];
|
|
||||||
const i = Math.floor(Math.log(bytes) / Math.log(k));
|
|
||||||
return `${(bytes / Math.pow(k, i)).toFixed(2)} ${sizes[i]}`;
|
|
||||||
};
|
|
||||||
|
|
||||||
return (
|
|
||||||
<div className="space-y-8 animate-fade-in pb-10">
|
|
||||||
<section className="flex flex-col md:flex-row md:items-center justify-between gap-4">
|
|
||||||
<div className="space-y-1">
|
|
||||||
<h1 className="text-3xl font-extrabold tracking-tight">智能填表</h1>
|
|
||||||
<p className="text-muted-foreground">根据您的表格模板,自动聚合多源文档信息进行精准填充,告别重复劳动。</p>
|
|
||||||
</div>
|
|
||||||
<div className="flex items-center gap-3">
|
|
||||||
<Dialog open={openTaskDialog} onOpenChange={setOpenTaskDialog}>
|
|
||||||
<DialogTrigger asChild>
|
|
||||||
<Button className="rounded-xl shadow-lg shadow-primary/20 gap-2 h-11 px-6">
|
|
||||||
<FilePlus size={18} />
|
|
||||||
<span>新建填表任务</span>
|
|
||||||
</Button>
|
|
||||||
</DialogTrigger>
|
|
||||||
<DialogContent className="max-w-4xl max-h-[90vh] flex flex-col p-0 overflow-hidden border-none shadow-2xl rounded-3xl">
|
|
||||||
<DialogHeader className="p-8 pb-4 bg-muted/50">
|
|
||||||
<DialogTitle className="text-2xl font-bold flex items-center gap-2">
|
|
||||||
<Sparkles size={24} className="text-primary" />
|
|
||||||
开启智能填表之旅
|
|
||||||
</DialogTitle>
|
|
||||||
<DialogDescription>
|
|
||||||
选择一个表格模板及若干个数据源文档,AI 将自动为您分析并填写。
|
|
||||||
</DialogDescription>
|
|
||||||
</DialogHeader>
|
|
||||||
|
|
||||||
<ScrollArea className="flex-1 p-8 pt-4">
|
|
||||||
<div className="space-y-8">
|
|
||||||
{/* Step 1: Select Template */}
|
|
||||||
<div className="space-y-4">
|
|
||||||
<div className="flex items-center justify-between">
|
|
||||||
<h4 className="font-bold flex items-center gap-2 text-primary uppercase tracking-widest text-xs">
|
|
||||||
<span className="w-5 h-5 rounded-full bg-primary text-white flex items-center justify-center text-[10px]">1</span>
|
|
||||||
选择表格模板
|
|
||||||
</h4>
|
|
||||||
<label className="cursor-pointer text-xs font-semibold text-primary hover:underline flex items-center gap-1">
|
|
||||||
<Plus size={12} /> 上传新模板
|
|
||||||
<input type="file" className="hidden" onChange={handleUploadTemplate} accept=".docx,.xlsx" />
|
|
||||||
</label>
|
|
||||||
</div>
|
|
||||||
{templates.length > 0 ? (
|
|
||||||
<div className="grid grid-cols-1 sm:grid-cols-2 gap-3">
|
|
||||||
{templates.map(t => (
|
|
||||||
<div
|
|
||||||
key={t.id}
|
|
||||||
className={cn(
|
|
||||||
"p-4 rounded-2xl border-2 transition-all cursor-pointer flex items-center gap-3 group relative overflow-hidden",
|
|
||||||
selectedTemplate === t.id ? "border-primary bg-primary/5" : "border-border hover:border-primary/50"
|
|
||||||
)}
|
|
||||||
onClick={() => setSelectedTemplate(t.id)}
|
|
||||||
>
|
|
||||||
<div className={cn(
|
|
||||||
"w-10 h-10 rounded-xl flex items-center justify-center shrink-0 transition-colors",
|
|
||||||
selectedTemplate === t.id ? "bg-primary text-white" : "bg-muted text-muted-foreground"
|
|
||||||
)}>
|
|
||||||
<TableProperties size={20} />
|
|
||||||
</div>
|
|
||||||
<div className="flex-1 min-w-0">
|
|
||||||
<p className="font-bold text-sm truncate">{t.name}</p>
|
|
||||||
<p className="text-[10px] text-muted-foreground uppercase">{t.type}</p>
|
|
||||||
</div>
|
|
||||||
{selectedTemplate === t.id && (
|
|
||||||
<div className="absolute top-0 right-0 w-8 h-8 bg-primary text-white flex items-center justify-center rounded-bl-xl">
|
|
||||||
<CheckCircle2 size={14} />
|
|
||||||
</div>
|
|
||||||
)}
|
|
||||||
</div>
|
|
||||||
))}
|
|
||||||
</div>
|
|
||||||
) : (
|
|
||||||
<div className="p-8 text-center bg-muted/30 rounded-2xl border border-dashed text-sm italic text-muted-foreground">
|
|
||||||
暂无模板,请先点击右上角上传。
|
|
||||||
</div>
|
|
||||||
)}
|
|
||||||
</div>
|
|
||||||
|
|
||||||
{/* Step 2: Upload & Analyze Excel */}
|
|
||||||
<div className="space-y-4">
|
|
||||||
<h4 className="font-bold flex items-center gap-2 text-primary uppercase tracking-widest text-xs">
|
|
||||||
<span className="w-5 h-5 rounded-full bg-primary text-white flex items-center justify-center text-[10px]">1.5</span>
|
|
||||||
Excel 数据源
|
|
||||||
</h4>
|
|
||||||
<div className="bg-muted/20 rounded-2xl p-6">
|
|
||||||
{!excelFile ? (
|
|
||||||
<div
|
|
||||||
{...getRootProps()}
|
|
||||||
className={cn(
|
|
||||||
"border-2 border-dashed rounded-xl p-8 transition-all duration-300 flex flex-col items-center justify-center text-center cursor-pointer group",
|
|
||||||
isDragActive ? "border-primary bg-primary/5" : "border-muted-foreground/20 hover:border-primary/50 hover:bg-muted/30"
|
|
||||||
)}
|
|
||||||
>
|
|
||||||
<input {...getInputProps()} />
|
|
||||||
<div className="w-12 h-12 rounded-xl bg-primary/10 text-primary flex items-center justify-center mb-3 group-hover:scale-110 transition-transform">
|
|
||||||
<FileSpreadsheet size={24} />
|
|
||||||
</div>
|
|
||||||
<p className="font-semibold text-sm">
|
|
||||||
{isDragActive ? '释放以开始上传' : '点击或拖拽 Excel 文件'}
|
|
||||||
</p>
|
|
||||||
<p className="text-xs text-muted-foreground mt-1">支持 .xlsx 和 .xls 格式</p>
|
|
||||||
</div>
|
|
||||||
) : (
|
|
||||||
<div className="space-y-4">
|
|
||||||
<div className="flex items-center gap-3 p-3 bg-background rounded-xl">
|
|
||||||
<div className="w-10 h-10 rounded-lg bg-emerald-500/10 text-emerald-500 flex items-center justify-center">
|
|
||||||
<FileSpreadsheet size={20} />
|
|
||||||
</div>
|
|
||||||
<div className="flex-1 min-w-0">
|
|
||||||
<p className="font-semibold text-sm truncate">{excelFile.name}</p>
|
|
||||||
<p className="text-xs text-muted-foreground">{formatFileSize(excelFile.size)}</p>
|
|
||||||
</div>
|
|
||||||
<div className="flex gap-2">
|
|
||||||
<Button
|
|
||||||
variant="ghost"
|
|
||||||
size="icon"
|
|
||||||
className="text-destructive hover:bg-destructive/10"
|
|
||||||
onClick={handleDeleteExcel}
|
|
||||||
>
|
|
||||||
<Trash2 size={16} />
|
|
||||||
</Button>
|
|
||||||
</div>
|
|
||||||
</div>
|
|
||||||
|
|
||||||
{/* AI Analysis Options */}
|
|
||||||
{excelParseResult?.success && (
|
|
||||||
<div className="space-y-3">
|
|
||||||
<div className="space-y-2">
|
|
||||||
<Label htmlFor="analysis-type" className="text-xs">分析类型</Label>
|
|
||||||
<Select
|
|
||||||
value={aiOptions.analysisType}
|
|
||||||
onValueChange={(value: any) => setAiOptions({ ...aiOptions, analysisType: value })}
|
|
||||||
>
|
|
||||||
<SelectTrigger id="analysis-type" className="bg-background h-9 text-sm">
|
|
||||||
<SelectValue placeholder="选择分析类型" />
|
|
||||||
</SelectTrigger>
|
|
||||||
<SelectContent>
|
|
||||||
<SelectItem value="general">综合分析</SelectItem>
|
|
||||||
<SelectItem value="summary">数据摘要</SelectItem>
|
|
||||||
<SelectItem value="statistics">统计分析</SelectItem>
|
|
||||||
<SelectItem value="insights">深度洞察</SelectItem>
|
|
||||||
</SelectContent>
|
|
||||||
</Select>
|
|
||||||
</div>
|
|
||||||
<div className="space-y-2">
|
|
||||||
<Label htmlFor="user-prompt" className="text-xs">自定义提示词</Label>
|
|
||||||
<Textarea
|
|
||||||
id="user-prompt"
|
|
||||||
value={aiOptions.userPrompt}
|
|
||||||
onChange={(e) => setAiOptions({ ...aiOptions, userPrompt: e.target.value })}
|
|
||||||
className="bg-background resize-none text-sm"
|
|
||||||
rows={2}
|
|
||||||
/>
|
|
||||||
</div>
|
|
||||||
<Button
|
|
||||||
onClick={handleAnalyzeExcel}
|
|
||||||
disabled={excelAnalyzing}
|
|
||||||
className="w-full gap-2 h-9"
|
|
||||||
variant="outline"
|
|
||||||
>
|
|
||||||
{excelAnalyzing ? <Loader2 className="animate-spin" size={14} /> : <Sparkles size={14} />}
|
|
||||||
{excelAnalyzing ? '分析中...' : 'AI 分析'}
|
|
||||||
</Button>
|
|
||||||
{excelParseResult?.success && (
|
|
||||||
<Button
|
|
||||||
onClick={handleUseExcelData}
|
|
||||||
className="w-full gap-2 h-9"
|
|
||||||
>
|
|
||||||
<CheckCircle2 size={14} />
|
|
||||||
使用此数据源
|
|
||||||
</Button>
|
|
||||||
)}
|
|
||||||
</div>
|
|
||||||
)}
|
|
||||||
|
|
||||||
{/* Excel Analysis Result */}
|
|
||||||
{excelAnalysis && (
|
|
||||||
<div className="mt-4 p-4 bg-background rounded-xl max-h-60 overflow-y-auto">
|
|
||||||
<div className="flex items-center gap-2 mb-3">
|
|
||||||
<Sparkles size={16} className="text-primary" />
|
|
||||||
<span className="font-semibold text-sm">AI 分析结果</span>
|
|
||||||
</div>
|
|
||||||
<Markdown content={excelAnalysis.analysis?.analysis || ''} className="text-sm" />
|
|
||||||
</div>
|
|
||||||
)}
|
|
||||||
</div>
|
|
||||||
)}
|
|
||||||
</div>
|
|
||||||
</div>
|
|
||||||
|
|
||||||
{/* Step 3: Select Documents */}
|
|
||||||
<div className="space-y-4">
|
|
||||||
<h4 className="font-bold flex items-center gap-2 text-primary uppercase tracking-widest text-xs">
|
|
||||||
<span className="w-5 h-5 rounded-full bg-primary text-white flex items-center justify-center text-[10px]">2</span>
|
|
||||||
选择其他数据源文档
|
|
||||||
</h4>
|
|
||||||
{documents.filter(d => d.status === 'completed').length > 0 ? (
|
|
||||||
<div className="space-y-2 max-h-40 overflow-y-auto pr-2 custom-scrollbar">
|
|
||||||
{documents.filter(d => d.status === 'completed').map(doc => (
|
|
||||||
<div
|
|
||||||
key={doc.id}
|
|
||||||
className={cn(
|
|
||||||
"flex items-center gap-3 p-3 rounded-xl border transition-all cursor-pointer",
|
|
||||||
selectedDocs.includes(doc.id) ? "border-primary/50 bg-primary/5 shadow-sm" : "border-border hover:bg-muted/30"
|
|
||||||
)}
|
|
||||||
onClick={() => {
|
|
||||||
setSelectedDocs(prev =>
|
|
||||||
prev.includes(doc.id) ? prev.filter(id => id !== doc.id) : [...prev, doc.id]
|
|
||||||
);
|
|
||||||
}}
|
|
||||||
>
|
|
||||||
<Checkbox checked={selectedDocs.includes(doc.id)} onCheckedChange={() => {}} />
|
|
||||||
<div className="w-8 h-8 rounded-lg bg-blue-500/10 text-blue-500 flex items-center justify-center">
|
|
||||||
<Zap size={16} />
|
|
||||||
</div>
|
|
||||||
<span className="font-semibold text-sm truncate">{doc.name}</span>
|
|
||||||
</div>
|
|
||||||
))}
|
|
||||||
</div>
|
|
||||||
) : (
|
|
||||||
<div className="p-6 text-center bg-muted/30 rounded-xl border border-dashed text-xs italic text-muted-foreground">
|
|
||||||
暂无其他已解析的文档
|
|
||||||
</div>
|
|
||||||
)}
|
|
||||||
</div>
|
|
||||||
</div>
|
|
||||||
</ScrollArea>
|
|
||||||
|
|
||||||
<DialogFooter className="p-8 pt-4 bg-muted/20 border-t border-dashed">
|
|
||||||
<Button variant="outline" className="rounded-xl h-12 px-6" onClick={() => setOpenTaskDialog(false)}>取消</Button>
|
|
||||||
<Button
|
|
||||||
className="rounded-xl h-12 px-8 shadow-lg shadow-primary/20 gap-2"
|
|
||||||
onClick={handleCreateTask}
|
|
||||||
disabled={creating || !selectedTemplate || (selectedDocs.length === 0 && !excelParseResult?.success)}
|
|
||||||
>
|
|
||||||
{creating ? <RefreshCcw className="animate-spin h-5 w-5" /> : <Zap className="h-5 w-5 fill-current" />}
|
|
||||||
<span>启动智能填表引擎</span>
|
|
||||||
</Button>
|
|
||||||
</DialogFooter>
|
|
||||||
</DialogContent>
|
|
||||||
</Dialog>
|
|
||||||
</div>
|
|
||||||
</section>
|
|
||||||
|
|
||||||
{/* Task List */}
|
|
||||||
<div className="grid grid-cols-1 md:grid-cols-2 lg:grid-cols-3 gap-6">
|
|
||||||
{loading ? (
|
|
||||||
Array.from({ length: 3 }).map((_, i) => (
|
|
||||||
<Skeleton key={i} className="h-48 w-full rounded-3xl bg-muted" />
|
|
||||||
))
|
|
||||||
) : tasks.length > 0 ? (
|
|
||||||
tasks.map((task) => (
|
|
||||||
<Card key={task.id} className="border-none shadow-md hover:shadow-xl transition-all group rounded-3xl overflow-hidden flex flex-col">
|
|
||||||
<div className="h-1.5 w-full" style={{ backgroundColor: task.status === 'completed' ? '#10b981' : task.status === 'failed' ? '#ef4444' : '#f59e0b' }} />
|
|
||||||
<CardHeader className="p-6 pb-2">
|
|
||||||
<div className="flex justify-between items-start mb-2">
|
|
||||||
<div className="w-12 h-12 rounded-2xl bg-emerald-500/10 text-emerald-500 flex items-center justify-center shadow-inner group-hover:scale-110 transition-transform">
|
|
||||||
<TableProperties size={24} />
|
|
||||||
</div>
|
|
||||||
<Badge className={cn("text-[10px] uppercase font-bold tracking-widest", getStatusColor(task.status))}>
|
|
||||||
{task.status === 'completed' ? '已完成' : task.status === 'failed' ? '失败' : '执行中'}
|
|
||||||
</Badge>
|
|
||||||
</div>
|
|
||||||
<CardTitle className="text-lg font-bold truncate group-hover:text-primary transition-colors">{task.templates?.name || '未知模板'}</CardTitle>
|
|
||||||
<CardDescription className="text-xs flex items-center gap-1 font-medium italic">
|
|
||||||
<Clock size={12} /> {format(new Date(task.created_at!), 'yyyy/MM/dd HH:mm')}
|
|
||||||
</CardDescription>
|
|
||||||
</CardHeader>
|
|
||||||
<CardContent className="p-6 pt-2 flex-1">
|
|
||||||
<div className="space-y-4">
|
|
||||||
<div className="flex flex-wrap gap-2">
|
|
||||||
<Badge variant="outline" className="bg-muted/50 border-none text-[10px] font-bold">关联 {task.document_ids?.length} 份数据源</Badge>
|
|
||||||
</div>
|
|
||||||
{task.status === 'completed' && (
|
|
||||||
<div className="p-3 bg-emerald-500/5 rounded-2xl border border-emerald-500/10 flex items-center gap-3">
|
|
||||||
<CheckCircle2 className="text-emerald-500" size={18} />
|
|
||||||
<span className="text-xs font-semibold text-emerald-700">内容已精准聚合,表格生成完毕</span>
|
|
||||||
</div>
|
|
||||||
)}
|
|
||||||
</div>
|
|
||||||
</CardContent>
|
|
||||||
<CardFooter className="p-6 pt-0">
|
|
||||||
<Button
|
|
||||||
className="w-full rounded-2xl h-11 bg-primary group-hover:shadow-lg group-hover:shadow-primary/30 transition-all gap-2"
|
|
||||||
disabled={task.status !== 'completed'}
|
|
||||||
onClick={() => setViewingTask(task)}
|
|
||||||
>
|
|
||||||
<Download size={18} />
|
|
||||||
<span>下载汇总表格</span>
|
|
||||||
</Button>
|
|
||||||
</CardFooter>
|
|
||||||
</Card>
|
|
||||||
))
|
|
||||||
) : (
|
|
||||||
<div className="col-span-full py-24 flex flex-col items-center justify-center text-center space-y-6">
|
|
||||||
<div className="w-24 h-24 rounded-full bg-muted flex items-center justify-center text-muted-foreground/30 border-4 border-dashed">
|
|
||||||
<TableProperties size={48} />
|
|
||||||
</div>
|
|
||||||
<div className="space-y-2 max-w-sm">
|
|
||||||
<p className="text-2xl font-extrabold tracking-tight">暂无生成任务</p>
|
|
||||||
<p className="text-muted-foreground text-sm">上传模板后,您可以将多个文档的数据自动填充到汇总表格中。</p>
|
|
||||||
</div>
|
|
||||||
<Button className="rounded-xl h-12 px-8" onClick={() => setOpenTaskDialog(true)}>立即创建首个任务</Button>
|
|
||||||
</div>
|
|
||||||
)}
|
|
||||||
</div>
|
|
||||||
|
|
||||||
{/* Task Result View Modal */}
|
|
||||||
<Dialog open={!!viewingTask} onOpenChange={(open) => !open && setViewingTask(null)}>
|
|
||||||
<DialogContent className="max-w-4xl max-h-[90vh] flex flex-col p-0 overflow-hidden border-none shadow-2xl rounded-3xl">
|
|
||||||
<DialogHeader className="p-8 pb-4 bg-primary text-primary-foreground">
|
|
||||||
<div className="flex items-center gap-3 mb-2">
|
|
||||||
<FileCheck size={28} />
|
|
||||||
<DialogTitle className="text-2xl font-extrabold">表格生成结果预览</DialogTitle>
|
|
||||||
</div>
|
|
||||||
<DialogDescription className="text-primary-foreground/80 italic">
|
|
||||||
系统已根据 {viewingTask?.document_ids?.length} 份文档信息自动填充完毕。
|
|
||||||
</DialogDescription>
|
|
||||||
</DialogHeader>
|
|
||||||
<ScrollArea className="flex-1 p-8 bg-muted/10">
|
|
||||||
<div className="prose dark:prose-invert max-w-none">
|
|
||||||
<div className="bg-card p-8 rounded-2xl shadow-sm border min-h-[400px]">
|
|
||||||
<Badge variant="outline" className="mb-4">数据已脱敏</Badge>
|
|
||||||
<div className="whitespace-pre-wrap font-sans text-sm leading-relaxed">
|
|
||||||
<h2 className="text-xl font-bold mb-4">汇总结果报告</h2>
|
|
||||||
<p className="text-muted-foreground mb-6">以下是根据您上传的多个文档提取并生成的汇总信息:</p>
|
|
||||||
|
|
||||||
<div className="p-4 bg-muted/30 rounded-xl border border-dashed border-primary/20 italic">
|
|
||||||
正在从云端安全下载解析结果并渲染渲染视图...
|
|
||||||
</div>
|
|
||||||
|
|
||||||
<div className="mt-8 space-y-4">
|
|
||||||
<p className="font-semibold text-primary">✓ 核心实体已对齐</p>
|
|
||||||
<p className="font-semibold text-primary">✓ 逻辑勾稽关系校验通过</p>
|
|
||||||
<p className="font-semibold text-primary">✓ 格式符合模板规范</p>
|
|
||||||
</div>
|
|
||||||
</div>
|
|
||||||
</div>
|
|
||||||
</div>
|
|
||||||
</ScrollArea>
|
|
||||||
<DialogFooter className="p-8 pt-4 border-t border-dashed">
|
|
||||||
<Button variant="outline" className="rounded-xl" onClick={() => setViewingTask(null)}>关闭</Button>
|
|
||||||
<Button className="rounded-xl px-8 gap-2 shadow-lg shadow-primary/20" onClick={() => toast.success("正在导出文件...")}>
|
|
||||||
<Download size={18} />
|
|
||||||
导出为 {viewingTask?.templates?.type?.toUpperCase() || '文件'}
|
|
||||||
</Button>
|
|
||||||
</DialogFooter>
|
|
||||||
</DialogContent>
|
|
||||||
</Dialog>
|
|
||||||
</div>
|
|
||||||
);
|
|
||||||
};
|
|
||||||
|
|
||||||
export default FormFill;
|
|
||||||
@@ -10,7 +10,11 @@ import {
|
|||||||
TableProperties,
|
TableProperties,
|
||||||
ChevronRight,
|
ChevronRight,
|
||||||
ArrowRight,
|
ArrowRight,
|
||||||
Loader2
|
Loader2,
|
||||||
|
Download,
|
||||||
|
Search,
|
||||||
|
MessageSquare,
|
||||||
|
CheckCircle
|
||||||
} from 'lucide-react';
|
} from 'lucide-react';
|
||||||
import { Button } from '@/components/ui/button';
|
import { Button } from '@/components/ui/button';
|
||||||
import { Input } from '@/components/ui/input';
|
import { Input } from '@/components/ui/input';
|
||||||
@@ -26,12 +30,15 @@ type ChatMessage = {
|
|||||||
role: 'user' | 'assistant';
|
role: 'user' | 'assistant';
|
||||||
content: string;
|
content: string;
|
||||||
created_at: string;
|
created_at: string;
|
||||||
|
intent?: string;
|
||||||
|
result?: any;
|
||||||
};
|
};
|
||||||
|
|
||||||
const InstructionChat: React.FC = () => {
|
const InstructionChat: React.FC = () => {
|
||||||
const [messages, setMessages] = useState<ChatMessage[]>([]);
|
const [messages, setMessages] = useState<ChatMessage[]>([]);
|
||||||
const [input, setInput] = useState('');
|
const [input, setInput] = useState('');
|
||||||
const [loading, setLoading] = useState(false);
|
const [loading, setLoading] = useState(false);
|
||||||
|
const [currentDocIds, setCurrentDocIds] = useState<string[]>([]);
|
||||||
const scrollAreaRef = useRef<HTMLDivElement>(null);
|
const scrollAreaRef = useRef<HTMLDivElement>(null);
|
||||||
|
|
||||||
useEffect(() => {
|
useEffect(() => {
|
||||||
@@ -43,27 +50,47 @@ const InstructionChat: React.FC = () => {
|
|||||||
role: 'assistant',
|
role: 'assistant',
|
||||||
content: `您好!我是智联文档 AI 助手。
|
content: `您好!我是智联文档 AI 助手。
|
||||||
|
|
||||||
我可以帮您完成以下操作:
|
**📄 文档智能操作**
|
||||||
|
- "提取文档中的医院数量和床位数"
|
||||||
|
- "帮我找出所有机构的名称"
|
||||||
|
|
||||||
📄 **文档管理**
|
**📊 数据填表**
|
||||||
- "帮我列出最近上传的所有文档"
|
- "根据这些数据填表"
|
||||||
- "删除三天前的 docx 文档"
|
- "将提取的信息填写到Excel模板"
|
||||||
|
|
||||||
📊 **Excel 分析**
|
**📝 内容处理**
|
||||||
- "分析一下最近上传的 Excel 文件"
|
- "总结一下这份文档"
|
||||||
- "帮我统计销售报表中的数据"
|
- "对比这两个文档的差异"
|
||||||
|
|
||||||
📝 **智能填表**
|
**🔍 智能问答**
|
||||||
- "根据员工信息表创建一个考勤汇总表"
|
- "文档里说了些什么?"
|
||||||
- "用财务文档填充报销模板"
|
- "有多少家医院?"
|
||||||
|
|
||||||
请告诉我您想做什么?`,
|
请告诉我您想做什么?`,
|
||||||
created_at: new Date().toISOString()
|
created_at: new Date().toISOString()
|
||||||
}
|
}
|
||||||
]);
|
]);
|
||||||
|
|
||||||
|
// 获取已上传的文档ID列表
|
||||||
|
loadDocuments();
|
||||||
}
|
}
|
||||||
}, []);
|
}, []);
|
||||||
|
|
||||||
|
const loadDocuments = async () => {
|
||||||
|
try {
|
||||||
|
const result = await backendApi.getDocuments(undefined, 50);
|
||||||
|
if (result.success && result.documents) {
|
||||||
|
const docIds = result.documents.map((d: any) => d.doc_id);
|
||||||
|
setCurrentDocIds(docIds);
|
||||||
|
if (docIds.length > 0) {
|
||||||
|
console.log(`已加载 ${docIds.length} 个文档`);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
} catch (err) {
|
||||||
|
console.error('获取文档列表失败:', err);
|
||||||
|
}
|
||||||
|
};
|
||||||
|
|
||||||
useEffect(() => {
|
useEffect(() => {
|
||||||
// Scroll to bottom
|
// Scroll to bottom
|
||||||
if (scrollAreaRef.current) {
|
if (scrollAreaRef.current) {
|
||||||
@@ -89,95 +116,126 @@ const InstructionChat: React.FC = () => {
|
|||||||
setLoading(true);
|
setLoading(true);
|
||||||
|
|
||||||
try {
|
try {
|
||||||
// TODO: 后端对话接口,暂用模拟响应
|
// 使用真实的智能指令 API
|
||||||
await new Promise(resolve => setTimeout(resolve, 1500));
|
const response = await backendApi.instructionChat(
|
||||||
|
input.trim(),
|
||||||
|
currentDocIds.length > 0 ? currentDocIds : undefined
|
||||||
|
);
|
||||||
|
|
||||||
// 简单的命令解析演示
|
// 根据意图类型生成友好响应
|
||||||
const userInput = userMessage.content.toLowerCase();
|
let responseContent = '';
|
||||||
let response = '';
|
const resultData = response.result;
|
||||||
|
|
||||||
if (userInput.includes('列出') || userInput.includes('列表')) {
|
switch (response.intent) {
|
||||||
const result = await backendApi.getDocuments(undefined, 10);
|
case 'extract':
|
||||||
if (result.success && result.documents && result.documents.length > 0) {
|
// 信息提取结果
|
||||||
response = `已为您找到 ${result.documents.length} 个文档:\n\n`;
|
const extracted = resultData?.extracted_data || {};
|
||||||
result.documents.slice(0, 5).forEach((doc: any, idx: number) => {
|
const keys = Object.keys(extracted);
|
||||||
response += `${idx + 1}. **${doc.original_filename}** (${doc.doc_type.toUpperCase()})\n`;
|
if (keys.length > 0) {
|
||||||
response += ` - 大小: ${(doc.file_size / 1024).toFixed(1)} KB\n`;
|
responseContent = `✅ 已提取到 ${keys.length} 个字段的数据:\n\n`;
|
||||||
response += ` - 时间: ${new Date(doc.created_at).toLocaleDateString()}\n\n`;
|
for (const [key, value] of Object.entries(extracted)) {
|
||||||
});
|
const values = Array.isArray(value) ? value : [value];
|
||||||
if (result.documents.length > 5) {
|
responseContent += `**${key}**: ${values.slice(0, 3).join(', ')}${values.length > 3 ? '...' : ''}\n`;
|
||||||
response += `...还有 ${result.documents.length - 5} 个文档`;
|
}
|
||||||
|
responseContent += `\n💡 您可以将这些数据填入表格。`;
|
||||||
|
} else {
|
||||||
|
responseContent = '未能从文档中提取到相关数据。请尝试更明确的字段名称。';
|
||||||
}
|
}
|
||||||
} else {
|
break;
|
||||||
response = '暂未找到已上传的文档,您可以先上传一些文档试试。';
|
|
||||||
}
|
|
||||||
} else if (userInput.includes('分析') || userInput.includes('excel') || userInput.includes('报表')) {
|
|
||||||
response = `好的,我可以帮您分析 Excel 文件。
|
|
||||||
|
|
||||||
请告诉我:
|
case 'fill_table':
|
||||||
1. 您想分析哪个 Excel 文件?
|
// 填表结果
|
||||||
2. 需要什么样的分析?(数据摘要/统计分析/图表生成)
|
const filled = resultData?.result?.filled_data || {};
|
||||||
|
const filledKeys = Object.keys(filled);
|
||||||
|
if (filledKeys.length > 0) {
|
||||||
|
responseContent = `✅ 填表完成!成功填写 ${filledKeys.length} 个字段:\n\n`;
|
||||||
|
for (const [key, value] of Object.entries(filled)) {
|
||||||
|
const values = Array.isArray(value) ? value : [value];
|
||||||
|
responseContent += `**${key}**: ${values.slice(0, 3).join(', ')}\n`;
|
||||||
|
}
|
||||||
|
responseContent += `\n📋 请到【智能填表】页面查看或导出结果。`;
|
||||||
|
} else {
|
||||||
|
responseContent = '填表未能提取到数据。请检查模板表头和数据源内容。';
|
||||||
|
}
|
||||||
|
break;
|
||||||
|
|
||||||
或者您可以直接告诉我您想从数据中了解什么,我来为您生成分析。`;
|
case 'summarize':
|
||||||
} else if (userInput.includes('填表') || userInput.includes('模板')) {
|
// 摘要结果
|
||||||
response = `好的,要进行智能填表,我需要:
|
const summaries = resultData?.summaries || [];
|
||||||
|
if (summaries.length > 0) {
|
||||||
|
responseContent = `📄 找到 ${summaries.length} 个文档的摘要:\n\n`;
|
||||||
|
summaries.forEach((s: any, idx: number) => {
|
||||||
|
responseContent += `**${idx + 1}. ${s.filename}**\n${s.content_preview}\n\n`;
|
||||||
|
});
|
||||||
|
} else {
|
||||||
|
responseContent = '未能生成摘要。请确保已上传文档。';
|
||||||
|
}
|
||||||
|
break;
|
||||||
|
|
||||||
1. **上传表格模板** - 您要填写的表格模板文件(Excel 或 Word 格式)
|
case 'question':
|
||||||
2. **选择数据源** - 包含要填写内容的源文档
|
// 问答结果
|
||||||
|
if (resultData?.answer) {
|
||||||
|
responseContent = `**问题**: ${resultData.question}\n\n**答案**: ${resultData.answer}`;
|
||||||
|
} else {
|
||||||
|
responseContent = resultData?.message || '我找到了相关信息,请查看上文。';
|
||||||
|
}
|
||||||
|
break;
|
||||||
|
|
||||||
您可以去【智能填表】页面完成这些操作,或者告诉我您具体想填什么类型的表格,我来指导您操作。`;
|
case 'search':
|
||||||
} else if (userInput.includes('删除')) {
|
// 搜索结果
|
||||||
response = `要删除文档,请告诉我:
|
const searchResults = resultData?.results || [];
|
||||||
|
if (searchResults.length > 0) {
|
||||||
|
responseContent = `🔍 找到 ${searchResults.length} 条相关内容:\n\n`;
|
||||||
|
searchResults.slice(0, 5).forEach((r: any, idx: number) => {
|
||||||
|
responseContent += `**${idx + 1}.** ${r.content?.substring(0, 100)}...\n\n`;
|
||||||
|
});
|
||||||
|
} else {
|
||||||
|
responseContent = '未找到相关内容。请尝试其他关键词。';
|
||||||
|
}
|
||||||
|
break;
|
||||||
|
|
||||||
- 要删除的文件名是什么?
|
case 'compare':
|
||||||
- 或者您可以到【文档中心】页面手动选择并删除文档
|
// 对比结果
|
||||||
|
const comparison = resultData?.comparison || [];
|
||||||
|
if (comparison.length > 0) {
|
||||||
|
responseContent = `📊 对比了 ${comparison.length} 个文档:\n\n`;
|
||||||
|
comparison.forEach((c: any) => {
|
||||||
|
responseContent += `- **${c.filename}**: ${c.doc_type}, ${c.content_length} 字\n`;
|
||||||
|
});
|
||||||
|
} else {
|
||||||
|
responseContent = '需要至少2个文档才能进行对比。';
|
||||||
|
}
|
||||||
|
break;
|
||||||
|
|
||||||
⚠️ 删除操作不可恢复,请确认后再操作。`;
|
case 'unknown':
|
||||||
} else if (userInput.includes('帮助') || userInput.includes('help')) {
|
responseContent = `我理解您想要: "${input.trim()}"\n\n但我目前无法完成此操作。您可以尝试:\n\n1. **提取数据**: "提取医院数量和床位数"\n2. **填表**: "根据这些数据填表"\n3. **总结**: "总结这份文档"\n4. **问答**: "文档里说了什么?"\n5. **搜索**: "搜索相关内容"`;
|
||||||
response = `**我可以帮您完成以下操作:**
|
break;
|
||||||
|
|
||||||
📄 **文档管理**
|
default:
|
||||||
- 列出/搜索已上传的文档
|
responseContent = response.message || resultData?.message || '已完成您的请求。';
|
||||||
- 查看文档详情和元数据
|
|
||||||
- 删除不需要的文档
|
|
||||||
|
|
||||||
📊 **Excel 处理**
|
|
||||||
- 分析 Excel 文件内容
|
|
||||||
- 生成数据统计和图表
|
|
||||||
- 导出处理后的数据
|
|
||||||
|
|
||||||
📝 **智能填表**
|
|
||||||
- 上传表格模板
|
|
||||||
- 从文档中提取信息填入模板
|
|
||||||
- 导出填写完成的表格
|
|
||||||
|
|
||||||
📋 **任务历史**
|
|
||||||
- 查看历史处理任务
|
|
||||||
- 重新执行或导出结果
|
|
||||||
|
|
||||||
请直接告诉我您想做什么!`;
|
|
||||||
} else {
|
|
||||||
response = `我理解您想要: "${input.trim()}"
|
|
||||||
|
|
||||||
目前我还在学习如何更好地理解您的需求。您可以尝试:
|
|
||||||
|
|
||||||
1. **上传文档** - 去【文档中心】上传 docx/md/txt 文件
|
|
||||||
2. **分析 Excel** - 去【Excel解析】上传并分析 Excel 文件
|
|
||||||
3. **智能填表** - 去【智能填表】创建填表任务
|
|
||||||
|
|
||||||
或者您可以更具体地描述您想做的事情,我会尽力帮助您!`;
|
|
||||||
}
|
}
|
||||||
|
|
||||||
const assistantMessage: ChatMessage = {
|
const assistantMessage: ChatMessage = {
|
||||||
id: Math.random().toString(36).substring(7),
|
id: Math.random().toString(36).substring(7),
|
||||||
role: 'assistant',
|
role: 'assistant',
|
||||||
content: response,
|
content: responseContent,
|
||||||
created_at: new Date().toISOString()
|
created_at: new Date().toISOString(),
|
||||||
|
intent: response.intent,
|
||||||
|
result: resultData
|
||||||
};
|
};
|
||||||
|
|
||||||
setMessages(prev => [...prev, assistantMessage]);
|
setMessages(prev => [...prev, assistantMessage]);
|
||||||
} catch (err: any) {
|
} catch (err: any) {
|
||||||
toast.error('请求失败,请重试');
|
console.error('指令执行失败:', err);
|
||||||
|
toast.error(err.message || '请求失败,请重试');
|
||||||
|
|
||||||
|
const errorMessage: ChatMessage = {
|
||||||
|
id: Math.random().toString(36).substring(7),
|
||||||
|
role: 'assistant',
|
||||||
|
content: `抱歉,处理您的请求时遇到了问题:${err.message}\n\n请稍后重试,或尝试更简单的指令。`,
|
||||||
|
created_at: new Date().toISOString()
|
||||||
|
};
|
||||||
|
setMessages(prev => [...prev, errorMessage]);
|
||||||
} finally {
|
} finally {
|
||||||
setLoading(false);
|
setLoading(false);
|
||||||
}
|
}
|
||||||
@@ -189,10 +247,10 @@ const InstructionChat: React.FC = () => {
|
|||||||
};
|
};
|
||||||
|
|
||||||
const quickActions = [
|
const quickActions = [
|
||||||
{ label: '列出所有文档', icon: FileText, action: () => setInput('列出所有已上传的文档') },
|
{ label: '提取医院数量', icon: Search, action: () => setInput('提取文档中的医院数量和床位数') },
|
||||||
{ label: '分析 Excel 数据', icon: TableProperties, action: () => setInput('分析一下 Excel 文件') },
|
{ label: '智能填表', icon: TableProperties, action: () => setInput('根据这些数据填表') },
|
||||||
{ label: '智能填表', icon: Sparkles, action: () => setInput('我想进行智能填表') },
|
{ label: '总结文档', icon: MessageSquare, action: () => setInput('总结一下这份文档') },
|
||||||
{ label: '帮助', icon: Sparkles, action: () => setInput('帮助') }
|
{ label: '智能问答', icon: Bot, action: () => setInput('文档里说了些什么?') }
|
||||||
];
|
];
|
||||||
|
|
||||||
return (
|
return (
|
||||||
|
|||||||
@@ -1,184 +0,0 @@
|
|||||||
import React, { useState } from 'react';
|
|
||||||
import { useNavigate, useLocation } from 'react-router-dom';
|
|
||||||
import { useAuth } from '@/context/AuthContext';
|
|
||||||
import { Button } from '@/components/ui/button';
|
|
||||||
import { Input } from '@/components/ui/input';
|
|
||||||
import { Label } from '@/components/ui/label';
|
|
||||||
import { Card, CardContent, CardDescription, CardFooter, CardHeader, CardTitle } from '@/components/ui/card';
|
|
||||||
import { Tabs, TabsContent, TabsList, TabsTrigger } from '@/components/ui/tabs';
|
|
||||||
import { FileText, Lock, User, CheckCircle2, AlertCircle } from 'lucide-react';
|
|
||||||
import { toast } from 'sonner';
|
|
||||||
|
|
||||||
const Login: React.FC = () => {
|
|
||||||
const [username, setUsername] = useState('');
|
|
||||||
const [password, setPassword] = useState('');
|
|
||||||
const [loading, setLoading] = useState(false);
|
|
||||||
const { signIn, signUp } = useAuth();
|
|
||||||
const navigate = useNavigate();
|
|
||||||
const location = useLocation();
|
|
||||||
|
|
||||||
const handleLogin = async (e: React.FormEvent) => {
|
|
||||||
e.preventDefault();
|
|
||||||
if (!username || !password) return toast.error('请输入用户名和密码');
|
|
||||||
|
|
||||||
setLoading(true);
|
|
||||||
try {
|
|
||||||
const email = `${username}@miaoda.com`;
|
|
||||||
const { error } = await signIn(email, password);
|
|
||||||
if (error) throw error;
|
|
||||||
toast.success('登录成功');
|
|
||||||
navigate('/');
|
|
||||||
} catch (err: any) {
|
|
||||||
toast.error(err.message || '登录失败');
|
|
||||||
} finally {
|
|
||||||
setLoading(false);
|
|
||||||
}
|
|
||||||
};
|
|
||||||
|
|
||||||
const handleSignUp = async (e: React.FormEvent) => {
|
|
||||||
e.preventDefault();
|
|
||||||
if (!username || !password) return toast.error('请输入用户名和密码');
|
|
||||||
|
|
||||||
setLoading(true);
|
|
||||||
try {
|
|
||||||
const email = `${username}@miaoda.com`;
|
|
||||||
const { error } = await signUp(email, password);
|
|
||||||
if (error) throw error;
|
|
||||||
toast.success('注册成功,请登录');
|
|
||||||
} catch (err: any) {
|
|
||||||
toast.error(err.message || '注册失败');
|
|
||||||
} finally {
|
|
||||||
setLoading(false);
|
|
||||||
}
|
|
||||||
};
|
|
||||||
|
|
||||||
return (
|
|
||||||
<div className="min-h-screen flex items-center justify-center bg-[radial-gradient(ellipse_at_top_left,_var(--tw-gradient-stops))] from-primary/10 via-background to-background p-4 relative overflow-hidden">
|
|
||||||
{/* Decorative elements */}
|
|
||||||
<div className="absolute top-0 left-0 w-96 h-96 bg-primary/5 rounded-full blur-3xl -translate-x-1/2 -translate-y-1/2" />
|
|
||||||
<div className="absolute bottom-0 right-0 w-64 h-64 bg-primary/5 rounded-full blur-3xl translate-x-1/3 translate-y-1/3" />
|
|
||||||
|
|
||||||
<div className="w-full max-w-md space-y-8 relative animate-fade-in">
|
|
||||||
<div className="text-center space-y-2">
|
|
||||||
<div className="inline-flex items-center justify-center w-16 h-16 rounded-2xl bg-primary text-primary-foreground shadow-2xl shadow-primary/30 mb-4 animate-slide-in">
|
|
||||||
<FileText size={32} />
|
|
||||||
</div>
|
|
||||||
<h1 className="text-4xl font-extrabold tracking-tight gradient-text">智联文档</h1>
|
|
||||||
<p className="text-muted-foreground">多源数据融合与智能文档处理系统</p>
|
|
||||||
</div>
|
|
||||||
|
|
||||||
<Card className="border-border/50 shadow-2xl backdrop-blur-sm bg-card/95">
|
|
||||||
<Tabs defaultValue="login" className="w-full">
|
|
||||||
<TabsList className="grid w-full grid-cols-2 rounded-t-xl h-12 bg-muted/50 p-1">
|
|
||||||
<TabsTrigger value="login" className="rounded-lg data-[state=active]:bg-background data-[state=active]:shadow-sm">登录</TabsTrigger>
|
|
||||||
<TabsTrigger value="signup" className="rounded-lg data-[state=active]:bg-background data-[state=active]:shadow-sm">注册</TabsTrigger>
|
|
||||||
</TabsList>
|
|
||||||
|
|
||||||
<TabsContent value="login">
|
|
||||||
<form onSubmit={handleLogin}>
|
|
||||||
<CardHeader>
|
|
||||||
<CardTitle>欢迎回来</CardTitle>
|
|
||||||
<CardDescription>使用您的账号登录智联文档系统</CardDescription>
|
|
||||||
</CardHeader>
|
|
||||||
<CardContent className="space-y-4">
|
|
||||||
<div className="space-y-2">
|
|
||||||
<Label htmlFor="username">用户名</Label>
|
|
||||||
<div className="relative">
|
|
||||||
<User className="absolute left-3 top-2.5 h-4 w-4 text-muted-foreground" />
|
|
||||||
<Input
|
|
||||||
id="username"
|
|
||||||
placeholder="请输入用户名"
|
|
||||||
className="pl-9 bg-muted/30 border-none focus-visible:ring-primary"
|
|
||||||
value={username}
|
|
||||||
onChange={(e) => setUsername(e.target.value)}
|
|
||||||
/>
|
|
||||||
</div>
|
|
||||||
</div>
|
|
||||||
<div className="space-y-2">
|
|
||||||
<Label htmlFor="password">密码</Label>
|
|
||||||
<div className="relative">
|
|
||||||
<Lock className="absolute left-3 top-2.5 h-4 w-4 text-muted-foreground" />
|
|
||||||
<Input
|
|
||||||
id="password"
|
|
||||||
type="password"
|
|
||||||
placeholder="请输入密码"
|
|
||||||
className="pl-9 bg-muted/30 border-none focus-visible:ring-primary"
|
|
||||||
value={password}
|
|
||||||
onChange={(e) => setPassword(e.target.value)}
|
|
||||||
/>
|
|
||||||
</div>
|
|
||||||
</div>
|
|
||||||
</CardContent>
|
|
||||||
<CardFooter>
|
|
||||||
<Button className="w-full h-11 text-lg font-semibold rounded-xl" type="submit" disabled={loading}>
|
|
||||||
{loading ? '登录中...' : '立即登录'}
|
|
||||||
</Button>
|
|
||||||
</CardFooter>
|
|
||||||
</form>
|
|
||||||
</TabsContent>
|
|
||||||
|
|
||||||
<TabsContent value="signup">
|
|
||||||
<form onSubmit={handleSignUp}>
|
|
||||||
<CardHeader>
|
|
||||||
<CardTitle>创建账号</CardTitle>
|
|
||||||
<CardDescription>开启智能文档处理的新体验</CardDescription>
|
|
||||||
</CardHeader>
|
|
||||||
<CardContent className="space-y-4">
|
|
||||||
<div className="space-y-2">
|
|
||||||
<Label htmlFor="signup-username">用户名</Label>
|
|
||||||
<div className="relative">
|
|
||||||
<User className="absolute left-3 top-2.5 h-4 w-4 text-muted-foreground" />
|
|
||||||
<Input
|
|
||||||
id="signup-username"
|
|
||||||
placeholder="仅字母、数字和下划线"
|
|
||||||
className="pl-9 bg-muted/30 border-none focus-visible:ring-primary"
|
|
||||||
value={username}
|
|
||||||
onChange={(e) => setUsername(e.target.value)}
|
|
||||||
/>
|
|
||||||
</div>
|
|
||||||
</div>
|
|
||||||
<div className="space-y-2">
|
|
||||||
<Label htmlFor="signup-password">密码</Label>
|
|
||||||
<div className="relative">
|
|
||||||
<Lock className="absolute left-3 top-2.5 h-4 w-4 text-muted-foreground" />
|
|
||||||
<Input
|
|
||||||
id="signup-password"
|
|
||||||
type="password"
|
|
||||||
placeholder="不少于 6 位"
|
|
||||||
className="pl-9 bg-muted/30 border-none focus-visible:ring-primary"
|
|
||||||
value={password}
|
|
||||||
onChange={(e) => setPassword(e.target.value)}
|
|
||||||
/>
|
|
||||||
</div>
|
|
||||||
</div>
|
|
||||||
</CardContent>
|
|
||||||
<CardFooter>
|
|
||||||
<Button className="w-full h-11 text-lg font-semibold rounded-xl" type="submit" disabled={loading}>
|
|
||||||
{loading ? '注册中...' : '注册账号'}
|
|
||||||
</Button>
|
|
||||||
</CardFooter>
|
|
||||||
</form>
|
|
||||||
</TabsContent>
|
|
||||||
</Tabs>
|
|
||||||
</Card>
|
|
||||||
|
|
||||||
<div className="grid grid-cols-2 gap-4 text-center text-xs text-muted-foreground">
|
|
||||||
<div className="flex flex-col items-center gap-1">
|
|
||||||
<CheckCircle2 size={16} className="text-primary" />
|
|
||||||
<span>智能解析</span>
|
|
||||||
</div>
|
|
||||||
<div className="flex flex-col items-center gap-1">
|
|
||||||
<CheckCircle2 size={16} className="text-primary" />
|
|
||||||
<span>极速填表</span>
|
|
||||||
</div>
|
|
||||||
</div>
|
|
||||||
|
|
||||||
<div className="text-center text-sm text-muted-foreground">
|
|
||||||
© 2026 智联文档 | 多源数据融合系统
|
|
||||||
</div>
|
|
||||||
</div>
|
|
||||||
</div>
|
|
||||||
);
|
|
||||||
};
|
|
||||||
|
|
||||||
export default Login;
|
|
||||||
@@ -1,16 +0,0 @@
|
|||||||
/**
|
|
||||||
* Sample Page
|
|
||||||
*/
|
|
||||||
|
|
||||||
import PageMeta from "../components/common/PageMeta";
|
|
||||||
|
|
||||||
export default function SamplePage() {
|
|
||||||
return (
|
|
||||||
<>
|
|
||||||
<PageMeta title="Home" description="Home Page Introduction" />
|
|
||||||
<div>
|
|
||||||
<h3>This is a sample page</h3>
|
|
||||||
</div>
|
|
||||||
</>
|
|
||||||
);
|
|
||||||
}
|
|
||||||
@@ -11,7 +11,8 @@ import {
|
|||||||
ChevronDown,
|
ChevronDown,
|
||||||
ChevronUp,
|
ChevronUp,
|
||||||
Trash2,
|
Trash2,
|
||||||
AlertCircle
|
AlertCircle,
|
||||||
|
HelpCircle
|
||||||
} from 'lucide-react';
|
} from 'lucide-react';
|
||||||
import { Card, CardContent, CardHeader, CardTitle, CardDescription } from '@/components/ui/card';
|
import { Card, CardContent, CardHeader, CardTitle, CardDescription } from '@/components/ui/card';
|
||||||
import { Button } from '@/components/ui/button';
|
import { Button } from '@/components/ui/button';
|
||||||
@@ -24,9 +25,9 @@ import { Skeleton } from '@/components/ui/skeleton';
|
|||||||
|
|
||||||
type Task = {
|
type Task = {
|
||||||
task_id: string;
|
task_id: string;
|
||||||
status: 'pending' | 'processing' | 'success' | 'failure';
|
status: 'pending' | 'processing' | 'success' | 'failure' | 'unknown';
|
||||||
created_at: string;
|
created_at: string;
|
||||||
completed_at?: string;
|
updated_at?: string;
|
||||||
message?: string;
|
message?: string;
|
||||||
result?: any;
|
result?: any;
|
||||||
error?: string;
|
error?: string;
|
||||||
@@ -38,54 +39,38 @@ const TaskHistory: React.FC = () => {
|
|||||||
const [loading, setLoading] = useState(true);
|
const [loading, setLoading] = useState(true);
|
||||||
const [expandedTask, setExpandedTask] = useState<string | null>(null);
|
const [expandedTask, setExpandedTask] = useState<string | null>(null);
|
||||||
|
|
||||||
// Mock data for demonstration
|
// 获取任务历史数据
|
||||||
useEffect(() => {
|
const fetchTasks = async () => {
|
||||||
// 模拟任务数据,实际应该从后端获取
|
try {
|
||||||
setTasks([
|
setLoading(true);
|
||||||
{
|
const response = await backendApi.getTasks(50, 0);
|
||||||
task_id: 'task-001',
|
if (response.success && response.tasks) {
|
||||||
status: 'success',
|
// 转换后端数据格式为前端格式
|
||||||
created_at: new Date(Date.now() - 3600000).toISOString(),
|
const convertedTasks: Task[] = response.tasks.map((t: any) => ({
|
||||||
completed_at: new Date(Date.now() - 3500000).toISOString(),
|
task_id: t.task_id,
|
||||||
task_type: 'document_parse',
|
status: t.status || 'unknown',
|
||||||
message: '文档解析完成',
|
created_at: t.created_at || new Date().toISOString(),
|
||||||
result: {
|
updated_at: t.updated_at,
|
||||||
doc_id: 'doc-001',
|
message: t.message || '',
|
||||||
filename: 'report_q1_2026.docx',
|
result: t.result,
|
||||||
extracted_fields: ['标题', '作者', '日期', '金额']
|
error: t.error,
|
||||||
}
|
task_type: t.task_type || 'document_parse'
|
||||||
},
|
}));
|
||||||
{
|
setTasks(convertedTasks);
|
||||||
task_id: 'task-002',
|
} else {
|
||||||
status: 'success',
|
setTasks([]);
|
||||||
created_at: new Date(Date.now() - 7200000).toISOString(),
|
|
||||||
completed_at: new Date(Date.now() - 7100000).toISOString(),
|
|
||||||
task_type: 'excel_analysis',
|
|
||||||
message: 'Excel 分析完成',
|
|
||||||
result: {
|
|
||||||
filename: 'sales_data.xlsx',
|
|
||||||
row_count: 1250,
|
|
||||||
charts_generated: 3
|
|
||||||
}
|
|
||||||
},
|
|
||||||
{
|
|
||||||
task_id: 'task-003',
|
|
||||||
status: 'processing',
|
|
||||||
created_at: new Date(Date.now() - 600000).toISOString(),
|
|
||||||
task_type: 'template_fill',
|
|
||||||
message: '正在填充表格...'
|
|
||||||
},
|
|
||||||
{
|
|
||||||
task_id: 'task-004',
|
|
||||||
status: 'failure',
|
|
||||||
created_at: new Date(Date.now() - 86400000).toISOString(),
|
|
||||||
completed_at: new Date(Date.now() - 86390000).toISOString(),
|
|
||||||
task_type: 'document_parse',
|
|
||||||
message: '解析失败',
|
|
||||||
error: '文件格式不支持或文件已损坏'
|
|
||||||
}
|
}
|
||||||
]);
|
} catch (error) {
|
||||||
setLoading(false);
|
console.error('获取任务列表失败:', error);
|
||||||
|
toast.error('获取任务列表失败');
|
||||||
|
setTasks([]);
|
||||||
|
} finally {
|
||||||
|
setLoading(false);
|
||||||
|
}
|
||||||
|
};
|
||||||
|
|
||||||
|
useEffect(() => {
|
||||||
|
fetchTasks();
|
||||||
}, []);
|
}, []);
|
||||||
|
|
||||||
const getStatusBadge = (status: string) => {
|
const getStatusBadge = (status: string) => {
|
||||||
@@ -96,6 +81,8 @@ const TaskHistory: React.FC = () => {
|
|||||||
return <Badge className="bg-destructive text-white text-[10px]"><XCircle size={12} className="mr-1" />失败</Badge>;
|
return <Badge className="bg-destructive text-white text-[10px]"><XCircle size={12} className="mr-1" />失败</Badge>;
|
||||||
case 'processing':
|
case 'processing':
|
||||||
return <Badge className="bg-amber-500 text-white text-[10px]"><Loader2 size={12} className="mr-1 animate-spin" />处理中</Badge>;
|
return <Badge className="bg-amber-500 text-white text-[10px]"><Loader2 size={12} className="mr-1 animate-spin" />处理中</Badge>;
|
||||||
|
case 'unknown':
|
||||||
|
return <Badge className="bg-gray-500 text-white text-[10px]"><HelpCircle size={12} className="mr-1" />未知</Badge>;
|
||||||
default:
|
default:
|
||||||
return <Badge className="bg-gray-500 text-white text-[10px]"><Clock size={12} className="mr-1" />等待</Badge>;
|
return <Badge className="bg-gray-500 text-white text-[10px]"><Clock size={12} className="mr-1" />等待</Badge>;
|
||||||
}
|
}
|
||||||
@@ -133,15 +120,22 @@ const TaskHistory: React.FC = () => {
|
|||||||
};
|
};
|
||||||
|
|
||||||
const handleDelete = async (taskId: string) => {
|
const handleDelete = async (taskId: string) => {
|
||||||
setTasks(prev => prev.filter(t => t.task_id !== taskId));
|
try {
|
||||||
toast.success('任务已删除');
|
await backendApi.deleteTask(taskId);
|
||||||
|
setTasks(prev => prev.filter(t => t.task_id !== taskId));
|
||||||
|
toast.success('任务已删除');
|
||||||
|
} catch (error) {
|
||||||
|
console.error('删除任务失败:', error);
|
||||||
|
toast.error('删除任务失败');
|
||||||
|
}
|
||||||
};
|
};
|
||||||
|
|
||||||
const stats = {
|
const stats = {
|
||||||
total: tasks.length,
|
total: tasks.length,
|
||||||
success: tasks.filter(t => t.status === 'success').length,
|
success: tasks.filter(t => t.status === 'success').length,
|
||||||
processing: tasks.filter(t => t.status === 'processing').length,
|
processing: tasks.filter(t => t.status === 'processing').length,
|
||||||
failure: tasks.filter(t => t.status === 'failure').length
|
failure: tasks.filter(t => t.status === 'failure').length,
|
||||||
|
unknown: tasks.filter(t => t.status === 'unknown').length
|
||||||
};
|
};
|
||||||
|
|
||||||
return (
|
return (
|
||||||
@@ -151,7 +145,7 @@ const TaskHistory: React.FC = () => {
|
|||||||
<h1 className="text-3xl font-extrabold tracking-tight">任务历史</h1>
|
<h1 className="text-3xl font-extrabold tracking-tight">任务历史</h1>
|
||||||
<p className="text-muted-foreground">查看和管理您所有的文档处理任务记录</p>
|
<p className="text-muted-foreground">查看和管理您所有的文档处理任务记录</p>
|
||||||
</div>
|
</div>
|
||||||
<Button variant="outline" className="rounded-xl gap-2" onClick={() => window.location.reload()}>
|
<Button variant="outline" className="rounded-xl gap-2" onClick={() => fetchTasks()}>
|
||||||
<RefreshCcw size={18} />
|
<RefreshCcw size={18} />
|
||||||
<span>刷新</span>
|
<span>刷新</span>
|
||||||
</Button>
|
</Button>
|
||||||
@@ -194,7 +188,8 @@ const TaskHistory: React.FC = () => {
|
|||||||
"w-12 h-12 rounded-xl flex items-center justify-center shrink-0",
|
"w-12 h-12 rounded-xl flex items-center justify-center shrink-0",
|
||||||
task.status === 'success' ? "bg-emerald-500/10 text-emerald-500" :
|
task.status === 'success' ? "bg-emerald-500/10 text-emerald-500" :
|
||||||
task.status === 'failure' ? "bg-destructive/10 text-destructive" :
|
task.status === 'failure' ? "bg-destructive/10 text-destructive" :
|
||||||
"bg-amber-500/10 text-amber-500"
|
task.status === 'processing' ? "bg-amber-500/10 text-amber-500" :
|
||||||
|
"bg-gray-500/10 text-gray-500"
|
||||||
)}>
|
)}>
|
||||||
{task.status === 'processing' ? (
|
{task.status === 'processing' ? (
|
||||||
<Loader2 size={24} className="animate-spin" />
|
<Loader2 size={24} className="animate-spin" />
|
||||||
@@ -212,16 +207,16 @@ const TaskHistory: React.FC = () => {
|
|||||||
</Badge>
|
</Badge>
|
||||||
</div>
|
</div>
|
||||||
<p className="text-sm text-muted-foreground">
|
<p className="text-sm text-muted-foreground">
|
||||||
{task.message || '任务执行中...'}
|
{task.message || (task.status === 'unknown' ? '无法获取状态' : '任务执行中...')}
|
||||||
</p>
|
</p>
|
||||||
<div className="flex items-center gap-4 text-xs text-muted-foreground">
|
<div className="flex items-center gap-4 text-xs text-muted-foreground">
|
||||||
<span className="flex items-center gap-1">
|
<span className="flex items-center gap-1">
|
||||||
<Clock size={12} />
|
<Clock size={12} />
|
||||||
{format(new Date(task.created_at), 'yyyy-MM-dd HH:mm:ss')}
|
{task.created_at ? format(new Date(task.created_at), 'yyyy-MM-dd HH:mm:ss') : '时间未知'}
|
||||||
</span>
|
</span>
|
||||||
{task.completed_at && (
|
{task.updated_at && task.status !== 'processing' && (
|
||||||
<span>
|
<span>
|
||||||
耗时: {Math.round((new Date(task.completed_at).getTime() - new Date(task.created_at).getTime()) / 1000)} 秒
|
更新: {format(new Date(task.updated_at), 'HH:mm:ss')}
|
||||||
</span>
|
</span>
|
||||||
)}
|
)}
|
||||||
</div>
|
</div>
|
||||||
|
|||||||
@@ -1,4 +1,4 @@
|
|||||||
import React, { useState, useEffect } from 'react';
|
import React, { useState, useEffect, useCallback, useRef } from 'react';
|
||||||
import { useDropzone } from 'react-dropzone';
|
import { useDropzone } from 'react-dropzone';
|
||||||
import {
|
import {
|
||||||
TableProperties,
|
TableProperties,
|
||||||
@@ -14,7 +14,12 @@ import {
|
|||||||
RefreshCcw,
|
RefreshCcw,
|
||||||
ChevronDown,
|
ChevronDown,
|
||||||
ChevronUp,
|
ChevronUp,
|
||||||
Loader2
|
Loader2,
|
||||||
|
Files,
|
||||||
|
Trash2,
|
||||||
|
Eye,
|
||||||
|
File,
|
||||||
|
Plus
|
||||||
} from 'lucide-react';
|
} from 'lucide-react';
|
||||||
import { Button } from '@/components/ui/button';
|
import { Button } from '@/components/ui/button';
|
||||||
import { Card, CardContent, CardHeader, CardTitle, CardDescription } from '@/components/ui/card';
|
import { Card, CardContent, CardHeader, CardTitle, CardDescription } from '@/components/ui/card';
|
||||||
@@ -26,6 +31,14 @@ import { format } from 'date-fns';
|
|||||||
import { toast } from 'sonner';
|
import { toast } from 'sonner';
|
||||||
import { cn } from '@/lib/utils';
|
import { cn } from '@/lib/utils';
|
||||||
import { Skeleton } from '@/components/ui/skeleton';
|
import { Skeleton } from '@/components/ui/skeleton';
|
||||||
|
import {
|
||||||
|
Dialog,
|
||||||
|
DialogContent,
|
||||||
|
DialogHeader,
|
||||||
|
DialogTitle,
|
||||||
|
} from "@/components/ui/dialog";
|
||||||
|
import { ScrollArea } from '@/components/ui/scroll-area';
|
||||||
|
import { useTemplateFill } from '@/context/TemplateFillContext';
|
||||||
|
|
||||||
type DocumentItem = {
|
type DocumentItem = {
|
||||||
doc_id: string;
|
doc_id: string;
|
||||||
@@ -41,73 +54,34 @@ type DocumentItem = {
|
|||||||
};
|
};
|
||||||
};
|
};
|
||||||
|
|
||||||
type TemplateField = {
|
|
||||||
cell: string;
|
|
||||||
name: string;
|
|
||||||
field_type: string;
|
|
||||||
required: boolean;
|
|
||||||
hint?: string;
|
|
||||||
};
|
|
||||||
|
|
||||||
const TemplateFill: React.FC = () => {
|
const TemplateFill: React.FC = () => {
|
||||||
const [step, setStep] = useState<'upload-template' | 'select-source' | 'preview' | 'filling'>('upload-template');
|
const {
|
||||||
const [templateFile, setTemplateFile] = useState<File | null>(null);
|
step, setStep,
|
||||||
const [templateFields, setTemplateFields] = useState<TemplateField[]>([]);
|
templateFile, setTemplateFile,
|
||||||
const [sourceDocs, setSourceDocs] = useState<DocumentItem[]>([]);
|
templateFields, setTemplateFields,
|
||||||
const [selectedDocs, setSelectedDocs] = useState<string[]>([]);
|
sourceFiles, setSourceFiles, addSourceFiles, removeSourceFile,
|
||||||
|
sourceFilePaths, setSourceFilePaths,
|
||||||
|
sourceDocIds, setSourceDocIds, addSourceDocId, removeSourceDocId,
|
||||||
|
templateId, setTemplateId,
|
||||||
|
filledResult, setFilledResult,
|
||||||
|
reset
|
||||||
|
} = useTemplateFill();
|
||||||
|
|
||||||
const [loading, setLoading] = useState(false);
|
const [loading, setLoading] = useState(false);
|
||||||
const [filling, setFilling] = useState(false);
|
const [previewDoc, setPreviewDoc] = useState<{ name: string; content: string } | null>(null);
|
||||||
const [filledResult, setFilledResult] = useState<any>(null);
|
const [previewOpen, setPreviewOpen] = useState(false);
|
||||||
|
const [sourceMode, setSourceMode] = useState<'upload' | 'select'>('upload');
|
||||||
|
const [uploadedDocuments, setUploadedDocuments] = useState<DocumentItem[]>([]);
|
||||||
|
const [docsLoading, setDocsLoading] = useState(false);
|
||||||
|
const sourceFileInputRef = useRef<HTMLInputElement>(null);
|
||||||
|
|
||||||
// Load available source documents
|
// 模板拖拽
|
||||||
useEffect(() => {
|
const onTemplateDrop = useCallback((acceptedFiles: File[]) => {
|
||||||
loadSourceDocuments();
|
|
||||||
}, []);
|
|
||||||
|
|
||||||
const loadSourceDocuments = async () => {
|
|
||||||
setLoading(true);
|
|
||||||
try {
|
|
||||||
const result = await backendApi.getDocuments(undefined, 100);
|
|
||||||
if (result.success) {
|
|
||||||
// Filter to only non-Excel documents that can be used as data sources
|
|
||||||
const docs = (result.documents || []).filter((d: DocumentItem) =>
|
|
||||||
['docx', 'md', 'txt', 'xlsx'].includes(d.doc_type)
|
|
||||||
);
|
|
||||||
setSourceDocs(docs);
|
|
||||||
}
|
|
||||||
} catch (err: any) {
|
|
||||||
toast.error('加载数据源失败');
|
|
||||||
} finally {
|
|
||||||
setLoading(false);
|
|
||||||
}
|
|
||||||
};
|
|
||||||
|
|
||||||
const onTemplateDrop = async (acceptedFiles: File[]) => {
|
|
||||||
const file = acceptedFiles[0];
|
const file = acceptedFiles[0];
|
||||||
if (!file) return;
|
if (file) {
|
||||||
|
setTemplateFile(file);
|
||||||
const ext = file.name.split('.').pop()?.toLowerCase();
|
|
||||||
if (!['xlsx', 'xls', 'docx'].includes(ext || '')) {
|
|
||||||
toast.error('仅支持 xlsx/xls/docx 格式的模板文件');
|
|
||||||
return;
|
|
||||||
}
|
}
|
||||||
|
}, []);
|
||||||
setTemplateFile(file);
|
|
||||||
setLoading(true);
|
|
||||||
|
|
||||||
try {
|
|
||||||
const result = await backendApi.uploadTemplate(file);
|
|
||||||
if (result.success) {
|
|
||||||
setTemplateFields(result.fields || []);
|
|
||||||
setStep('select-source');
|
|
||||||
toast.success('模板上传成功');
|
|
||||||
}
|
|
||||||
} catch (err: any) {
|
|
||||||
toast.error('模板上传失败: ' + (err.message || '未知错误'));
|
|
||||||
} finally {
|
|
||||||
setLoading(false);
|
|
||||||
}
|
|
||||||
};
|
|
||||||
|
|
||||||
const { getRootProps: getTemplateProps, getInputProps: getTemplateInputProps, isDragActive: isTemplateDragActive } = useDropzone({
|
const { getRootProps: getTemplateProps, getInputProps: getTemplateInputProps, isDragActive: isTemplateDragActive } = useDropzone({
|
||||||
onDrop: onTemplateDrop,
|
onDrop: onTemplateDrop,
|
||||||
@@ -116,33 +90,157 @@ const TemplateFill: React.FC = () => {
|
|||||||
'application/vnd.ms-excel': ['.xls'],
|
'application/vnd.ms-excel': ['.xls'],
|
||||||
'application/vnd.openxmlformats-officedocument.wordprocessingml.document': ['.docx']
|
'application/vnd.openxmlformats-officedocument.wordprocessingml.document': ['.docx']
|
||||||
},
|
},
|
||||||
maxFiles: 1
|
maxFiles: 1,
|
||||||
|
multiple: false
|
||||||
});
|
});
|
||||||
|
|
||||||
const handleFillTemplate = async () => {
|
// 源文档拖拽
|
||||||
if (!templateFile || selectedDocs.length === 0) {
|
const onSourceDrop = useCallback((e: React.DragEvent) => {
|
||||||
toast.error('请选择数据源文档');
|
e.preventDefault();
|
||||||
|
const files = Array.from(e.dataTransfer.files).filter(f => {
|
||||||
|
const ext = f.name.split('.').pop()?.toLowerCase();
|
||||||
|
return ['xlsx', 'xls', 'docx', 'md', 'txt'].includes(ext || '');
|
||||||
|
});
|
||||||
|
if (files.length > 0) {
|
||||||
|
addSourceFiles(files.map(f => ({ file: f })));
|
||||||
|
}
|
||||||
|
}, [addSourceFiles]);
|
||||||
|
|
||||||
|
const handleSourceFileSelect = (e: React.ChangeEvent<HTMLInputElement>) => {
|
||||||
|
const files = Array.from(e.target.files || []);
|
||||||
|
if (files.length > 0) {
|
||||||
|
addSourceFiles(files.map(f => ({ file: f })));
|
||||||
|
toast.success(`已添加 ${files.length} 个文件`);
|
||||||
|
}
|
||||||
|
e.target.value = '';
|
||||||
|
};
|
||||||
|
|
||||||
|
// 仅添加源文档不上传
|
||||||
|
const handleAddSourceFiles = () => {
|
||||||
|
if (sourceFiles.length === 0) {
|
||||||
|
toast.error('请先选择源文档');
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
toast.success(`已添加 ${sourceFiles.length} 个源文档,可继续添加更多`);
|
||||||
|
};
|
||||||
|
|
||||||
|
// 加载已上传文档
|
||||||
|
const loadUploadedDocuments = useCallback(async () => {
|
||||||
|
setDocsLoading(true);
|
||||||
|
try {
|
||||||
|
const result = await backendApi.getDocuments(undefined, 100);
|
||||||
|
if (result.success) {
|
||||||
|
// 过滤可作为数据源的文档类型
|
||||||
|
const docs = (result.documents || []).filter((d: DocumentItem) =>
|
||||||
|
['docx', 'md', 'txt', 'xlsx', 'xls'].includes(d.doc_type)
|
||||||
|
);
|
||||||
|
setUploadedDocuments(docs);
|
||||||
|
}
|
||||||
|
} catch (err: any) {
|
||||||
|
console.error('加载文档失败:', err);
|
||||||
|
} finally {
|
||||||
|
setDocsLoading(false);
|
||||||
|
}
|
||||||
|
}, []);
|
||||||
|
|
||||||
|
// 删除文档
|
||||||
|
const handleDeleteDocument = async (docId: string, e: React.MouseEvent) => {
|
||||||
|
e.stopPropagation();
|
||||||
|
if (!confirm('确定要删除该文档吗?')) return;
|
||||||
|
try {
|
||||||
|
const result = await backendApi.deleteDocument(docId);
|
||||||
|
if (result.success) {
|
||||||
|
setUploadedDocuments(prev => prev.filter(d => d.doc_id !== docId));
|
||||||
|
removeSourceDocId(docId);
|
||||||
|
toast.success('文档已删除');
|
||||||
|
} else {
|
||||||
|
toast.error(result.message || '删除失败');
|
||||||
|
}
|
||||||
|
} catch (err: any) {
|
||||||
|
toast.error('删除失败: ' + (err.message || '未知错误'));
|
||||||
|
}
|
||||||
|
};
|
||||||
|
|
||||||
|
useEffect(() => {
|
||||||
|
if (sourceMode === 'select') {
|
||||||
|
loadUploadedDocuments();
|
||||||
|
}
|
||||||
|
}, [sourceMode, loadUploadedDocuments]);
|
||||||
|
|
||||||
|
const handleJointUploadAndFill = async () => {
|
||||||
|
if (!templateFile) {
|
||||||
|
toast.error('请先上传模板文件');
|
||||||
return;
|
return;
|
||||||
}
|
}
|
||||||
|
|
||||||
setFilling(true);
|
// 检查是否选择了数据源
|
||||||
setStep('filling');
|
if (sourceMode === 'upload' && sourceFiles.length === 0) {
|
||||||
|
toast.error('请上传源文档或从已上传文档中选择');
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
if (sourceMode === 'select' && sourceDocIds.length === 0) {
|
||||||
|
toast.error('请选择源文档');
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
|
||||||
|
setLoading(true);
|
||||||
|
|
||||||
try {
|
try {
|
||||||
// 调用后端填表接口,传递选中的文档ID
|
if (sourceMode === 'select') {
|
||||||
const result = await backendApi.fillTemplate(
|
// 使用已上传文档作为数据源
|
||||||
'temp-template-id',
|
const result = await backendApi.uploadTemplate(templateFile);
|
||||||
templateFields,
|
|
||||||
selectedDocs // 传递源文档ID列表
|
if (result.success) {
|
||||||
);
|
setTemplateFields(result.fields || []);
|
||||||
setFilledResult(result);
|
setTemplateId(result.template_id || 'temp');
|
||||||
setStep('preview');
|
toast.success('开始智能填表');
|
||||||
toast.success('表格填写完成');
|
setStep('filling');
|
||||||
|
|
||||||
|
// 使用 source_doc_ids 进行填表
|
||||||
|
const fillResult = await backendApi.fillTemplate(
|
||||||
|
result.template_id || 'temp',
|
||||||
|
result.fields || [],
|
||||||
|
sourceDocIds,
|
||||||
|
[],
|
||||||
|
'请从以下文档中提取相关信息填写表格'
|
||||||
|
);
|
||||||
|
|
||||||
|
setFilledResult(fillResult);
|
||||||
|
setStep('preview');
|
||||||
|
toast.success('表格填写完成');
|
||||||
|
}
|
||||||
|
} else {
|
||||||
|
// 使用联合上传API
|
||||||
|
const result = await backendApi.uploadTemplateAndSources(
|
||||||
|
templateFile,
|
||||||
|
sourceFiles.map(sf => sf.file)
|
||||||
|
);
|
||||||
|
|
||||||
|
if (result.success) {
|
||||||
|
setTemplateFields(result.fields || []);
|
||||||
|
setTemplateId(result.template_id);
|
||||||
|
setSourceFilePaths(result.source_file_paths || []);
|
||||||
|
toast.success('文档上传成功,开始智能填表');
|
||||||
|
setStep('filling');
|
||||||
|
|
||||||
|
// 自动开始填表
|
||||||
|
const fillResult = await backendApi.fillTemplate(
|
||||||
|
result.template_id,
|
||||||
|
result.fields || [],
|
||||||
|
[],
|
||||||
|
result.source_file_paths || [],
|
||||||
|
'请从以下文档中提取相关信息填写表格'
|
||||||
|
);
|
||||||
|
|
||||||
|
setFilledResult(fillResult);
|
||||||
|
setStep('preview');
|
||||||
|
toast.success('表格填写完成');
|
||||||
|
}
|
||||||
|
}
|
||||||
} catch (err: any) {
|
} catch (err: any) {
|
||||||
toast.error('填表失败: ' + (err.message || '未知错误'));
|
toast.error('处理失败: ' + (err.message || '未知错误'));
|
||||||
setStep('select-source');
|
|
||||||
} finally {
|
} finally {
|
||||||
setFilling(false);
|
setLoading(false);
|
||||||
}
|
}
|
||||||
};
|
};
|
||||||
|
|
||||||
@@ -150,7 +248,11 @@ const TemplateFill: React.FC = () => {
|
|||||||
if (!templateFile || !filledResult) return;
|
if (!templateFile || !filledResult) return;
|
||||||
|
|
||||||
try {
|
try {
|
||||||
const blob = await backendApi.exportFilledTemplate('temp', filledResult.filled_data || {}, 'xlsx');
|
const blob = await backendApi.exportFilledTemplate(
|
||||||
|
templateId || 'temp',
|
||||||
|
filledResult.filled_data || {},
|
||||||
|
'xlsx'
|
||||||
|
);
|
||||||
const url = URL.createObjectURL(blob);
|
const url = URL.createObjectURL(blob);
|
||||||
const a = document.createElement('a');
|
const a = document.createElement('a');
|
||||||
a.href = url;
|
a.href = url;
|
||||||
@@ -163,12 +265,18 @@ const TemplateFill: React.FC = () => {
|
|||||||
}
|
}
|
||||||
};
|
};
|
||||||
|
|
||||||
const resetFlow = () => {
|
const getFileIcon = (filename: string) => {
|
||||||
setStep('upload-template');
|
const ext = filename.split('.').pop()?.toLowerCase();
|
||||||
setTemplateFile(null);
|
if (['xlsx', 'xls'].includes(ext || '')) {
|
||||||
setTemplateFields([]);
|
return <FileSpreadsheet size={20} className="text-emerald-500" />;
|
||||||
setSelectedDocs([]);
|
}
|
||||||
setFilledResult(null);
|
if (ext === 'docx') {
|
||||||
|
return <FileText size={20} className="text-blue-500" />;
|
||||||
|
}
|
||||||
|
if (['md', 'txt'].includes(ext || '')) {
|
||||||
|
return <FileText size={20} className="text-orange-500" />;
|
||||||
|
}
|
||||||
|
return <File size={20} className="text-gray-500" />;
|
||||||
};
|
};
|
||||||
|
|
||||||
return (
|
return (
|
||||||
@@ -180,208 +288,248 @@ const TemplateFill: React.FC = () => {
|
|||||||
根据您的表格模板,自动聚合多源文档信息进行精准填充
|
根据您的表格模板,自动聚合多源文档信息进行精准填充
|
||||||
</p>
|
</p>
|
||||||
</div>
|
</div>
|
||||||
{step !== 'upload-template' && (
|
{step !== 'upload' && (
|
||||||
<Button variant="outline" className="rounded-xl gap-2" onClick={resetFlow}>
|
<Button variant="outline" className="rounded-xl gap-2" onClick={reset}>
|
||||||
<RefreshCcw size={18} />
|
<RefreshCcw size={18} />
|
||||||
<span>重新开始</span>
|
<span>重新开始</span>
|
||||||
</Button>
|
</Button>
|
||||||
)}
|
)}
|
||||||
</section>
|
</section>
|
||||||
|
|
||||||
{/* Progress Steps */}
|
{/* Step 1: Upload - Joint Upload of Template + Source Docs */}
|
||||||
<div className="flex items-center justify-center gap-4">
|
{step === 'upload' && (
|
||||||
{['上传模板', '选择数据源', '填写预览'].map((label, idx) => {
|
<div className="grid grid-cols-1 lg:grid-cols-2 gap-6">
|
||||||
const stepIndex = ['upload-template', 'select-source', 'preview'].indexOf(step);
|
{/* Template Upload */}
|
||||||
const isActive = idx <= stepIndex;
|
|
||||||
const isCurrent = idx === stepIndex;
|
|
||||||
|
|
||||||
return (
|
|
||||||
<React.Fragment key={idx}>
|
|
||||||
<div className={cn(
|
|
||||||
"flex items-center gap-2 px-4 py-2 rounded-full transition-all",
|
|
||||||
isActive ? "bg-primary text-primary-foreground" : "bg-muted text-muted-foreground"
|
|
||||||
)}>
|
|
||||||
<div className={cn(
|
|
||||||
"w-6 h-6 rounded-full flex items-center justify-center text-xs font-bold",
|
|
||||||
isCurrent ? "bg-white/20" : ""
|
|
||||||
)}>
|
|
||||||
{idx + 1}
|
|
||||||
</div>
|
|
||||||
<span className="text-sm font-medium">{label}</span>
|
|
||||||
</div>
|
|
||||||
{idx < 2 && (
|
|
||||||
<div className={cn(
|
|
||||||
"w-12 h-0.5",
|
|
||||||
idx < stepIndex ? "bg-primary" : "bg-muted"
|
|
||||||
)} />
|
|
||||||
)}
|
|
||||||
</React.Fragment>
|
|
||||||
);
|
|
||||||
})}
|
|
||||||
</div>
|
|
||||||
|
|
||||||
{/* Step 1: Upload Template */}
|
|
||||||
{step === 'upload-template' && (
|
|
||||||
<div
|
|
||||||
{...getTemplateProps()}
|
|
||||||
className={cn(
|
|
||||||
"border-2 border-dashed rounded-3xl p-16 transition-all duration-300 flex flex-col items-center justify-center text-center cursor-pointer group",
|
|
||||||
isTemplateDragActive ? "border-primary bg-primary/5" : "border-muted-foreground/20 hover:border-primary/50 hover:bg-primary/5"
|
|
||||||
)}
|
|
||||||
>
|
|
||||||
<input {...getTemplateInputProps()} />
|
|
||||||
<div className="w-20 h-20 rounded-2xl bg-primary/10 text-primary flex items-center justify-center mb-6 group-hover:scale-110 transition-transform">
|
|
||||||
{loading ? <Loader2 className="animate-spin" size={40} /> : <Upload size={40} />}
|
|
||||||
</div>
|
|
||||||
<div className="space-y-2 max-w-md">
|
|
||||||
<p className="text-xl font-bold tracking-tight">
|
|
||||||
{isTemplateDragActive ? '释放以开始上传' : '点击或拖拽上传表格模板'}
|
|
||||||
</p>
|
|
||||||
<p className="text-sm text-muted-foreground">
|
|
||||||
支持 Excel (.xlsx, .xls) 或 Word (.docx) 格式的表格模板
|
|
||||||
</p>
|
|
||||||
</div>
|
|
||||||
<div className="mt-6 flex gap-3">
|
|
||||||
<Badge variant="outline" className="bg-emerald-500/10 text-emerald-600 border-emerald-200">
|
|
||||||
<FileSpreadsheet size={14} className="mr-1" /> Excel 模板
|
|
||||||
</Badge>
|
|
||||||
<Badge variant="outline" className="bg-blue-500/10 text-blue-600 border-blue-200">
|
|
||||||
<FileText size={14} className="mr-1" /> Word 模板
|
|
||||||
</Badge>
|
|
||||||
</div>
|
|
||||||
</div>
|
|
||||||
)}
|
|
||||||
|
|
||||||
{/* Step 2: Select Source Documents */}
|
|
||||||
{step === 'select-source' && (
|
|
||||||
<div className="space-y-6">
|
|
||||||
{/* Template Info */}
|
|
||||||
<Card className="border-none shadow-md">
|
<Card className="border-none shadow-md">
|
||||||
<CardHeader className="pb-4">
|
<CardHeader className="pb-4">
|
||||||
<CardTitle className="text-lg flex items-center gap-2">
|
<CardTitle className="text-lg flex items-center gap-2">
|
||||||
<FileSpreadsheet className="text-primary" size={20} />
|
<FileSpreadsheet className="text-primary" size={20} />
|
||||||
已上传模板
|
表格模板
|
||||||
</CardTitle>
|
|
||||||
</CardHeader>
|
|
||||||
<CardContent>
|
|
||||||
<div className="flex items-center gap-4">
|
|
||||||
<div className="w-12 h-12 rounded-xl bg-emerald-500/10 text-emerald-500 flex items-center justify-center">
|
|
||||||
<FileSpreadsheet size={24} />
|
|
||||||
</div>
|
|
||||||
<div className="flex-1">
|
|
||||||
<p className="font-bold">{templateFile?.name}</p>
|
|
||||||
<p className="text-sm text-muted-foreground">
|
|
||||||
{templateFields.length} 个字段待填写
|
|
||||||
</p>
|
|
||||||
</div>
|
|
||||||
<Button variant="ghost" size="sm" onClick={() => setStep('upload-template')}>
|
|
||||||
重新选择
|
|
||||||
</Button>
|
|
||||||
</div>
|
|
||||||
|
|
||||||
{/* Template Fields Preview */}
|
|
||||||
<div className="mt-4 p-4 bg-muted/30 rounded-xl">
|
|
||||||
<p className="text-xs font-bold uppercase tracking-widest text-muted-foreground mb-3">待填写字段</p>
|
|
||||||
<div className="flex flex-wrap gap-2">
|
|
||||||
{templateFields.map((field, idx) => (
|
|
||||||
<Badge key={idx} variant="outline" className="bg-background">
|
|
||||||
{field.name}
|
|
||||||
</Badge>
|
|
||||||
))}
|
|
||||||
</div>
|
|
||||||
</div>
|
|
||||||
</CardContent>
|
|
||||||
</Card>
|
|
||||||
|
|
||||||
{/* Source Documents Selection */}
|
|
||||||
<Card className="border-none shadow-md">
|
|
||||||
<CardHeader className="pb-4">
|
|
||||||
<CardTitle className="text-lg flex items-center gap-2">
|
|
||||||
<FileText className="text-primary" size={20} />
|
|
||||||
选择数据源文档
|
|
||||||
</CardTitle>
|
</CardTitle>
|
||||||
<CardDescription>
|
<CardDescription>
|
||||||
从已上传的文档中选择作为填表的数据来源,支持 Excel 和非结构化文档
|
上传需要填写的 Excel/Word 模板文件
|
||||||
</CardDescription>
|
</CardDescription>
|
||||||
</CardHeader>
|
</CardHeader>
|
||||||
<CardContent>
|
<CardContent>
|
||||||
{loading ? (
|
{!templateFile ? (
|
||||||
<div className="space-y-3">
|
<div
|
||||||
{[1, 2, 3].map(i => <Skeleton key={i} className="h-16 w-full rounded-xl" />)}
|
{...getTemplateProps()}
|
||||||
</div>
|
className={cn(
|
||||||
) : sourceDocs.length > 0 ? (
|
"border-2 border-dashed rounded-2xl p-8 transition-all duration-300 flex flex-col items-center justify-center text-center cursor-pointer group min-h-[200px]",
|
||||||
<div className="space-y-3">
|
isTemplateDragActive ? "border-primary bg-primary/5" : "border-muted-foreground/20 hover:border-primary/50 hover:bg-primary/5"
|
||||||
{sourceDocs.map(doc => (
|
)}
|
||||||
<div
|
>
|
||||||
key={doc.doc_id}
|
<input {...getTemplateInputProps()} />
|
||||||
className={cn(
|
<div className="w-14 h-14 rounded-xl bg-primary/10 text-primary flex items-center justify-center mb-4 group-hover:scale-110 transition-transform">
|
||||||
"flex items-center gap-4 p-4 rounded-xl border-2 transition-all cursor-pointer",
|
{loading ? <Loader2 className="animate-spin" size={28} /> : <Upload size={28} />}
|
||||||
selectedDocs.includes(doc.doc_id)
|
</div>
|
||||||
? "border-primary bg-primary/5"
|
<p className="font-medium">
|
||||||
: "border-border hover:bg-muted/30"
|
{isTemplateDragActive ? '释放以上传' : '点击或拖拽上传模板'}
|
||||||
)}
|
</p>
|
||||||
onClick={() => {
|
<p className="text-xs text-muted-foreground mt-1">
|
||||||
setSelectedDocs(prev =>
|
支持 .xlsx .xls .docx
|
||||||
prev.includes(doc.doc_id)
|
</p>
|
||||||
? prev.filter(id => id !== doc.doc_id)
|
|
||||||
: [...prev, doc.doc_id]
|
|
||||||
);
|
|
||||||
}}
|
|
||||||
>
|
|
||||||
<div className={cn(
|
|
||||||
"w-6 h-6 rounded-md border-2 flex items-center justify-center transition-all",
|
|
||||||
selectedDocs.includes(doc.doc_id)
|
|
||||||
? "border-primary bg-primary text-white"
|
|
||||||
: "border-muted-foreground/30"
|
|
||||||
)}>
|
|
||||||
{selectedDocs.includes(doc.doc_id) && <CheckCircle2 size={14} />}
|
|
||||||
</div>
|
|
||||||
<div className={cn(
|
|
||||||
"w-10 h-10 rounded-lg flex items-center justify-center",
|
|
||||||
doc.doc_type === 'xlsx' ? "bg-emerald-500/10 text-emerald-500" : "bg-blue-500/10 text-blue-500"
|
|
||||||
)}>
|
|
||||||
{doc.doc_type === 'xlsx' ? <FileSpreadsheet size={20} /> : <FileText size={20} />}
|
|
||||||
</div>
|
|
||||||
<div className="flex-1 min-w-0">
|
|
||||||
<p className="font-semibold truncate">{doc.original_filename}</p>
|
|
||||||
<p className="text-xs text-muted-foreground">
|
|
||||||
{doc.doc_type.toUpperCase()} • {format(new Date(doc.created_at), 'yyyy-MM-dd')}
|
|
||||||
</p>
|
|
||||||
</div>
|
|
||||||
{doc.metadata?.columns && (
|
|
||||||
<Badge variant="outline" className="text-xs">
|
|
||||||
{doc.metadata.columns.length} 列
|
|
||||||
</Badge>
|
|
||||||
)}
|
|
||||||
</div>
|
|
||||||
))}
|
|
||||||
</div>
|
</div>
|
||||||
) : (
|
) : (
|
||||||
<div className="text-center py-12 text-muted-foreground">
|
<div className="flex items-center gap-3 p-4 bg-emerald-500/5 rounded-xl border border-emerald-200">
|
||||||
<FileText size={48} className="mx-auto mb-4 opacity-30" />
|
<div className="w-10 h-10 rounded-lg bg-emerald-500/10 text-emerald-500 flex items-center justify-center">
|
||||||
<p>暂无数据源文档,请先上传文档</p>
|
<FileSpreadsheet size={20} />
|
||||||
|
</div>
|
||||||
|
<div className="flex-1 min-w-0">
|
||||||
|
<p className="font-medium truncate">{templateFile.name}</p>
|
||||||
|
<p className="text-xs text-muted-foreground">
|
||||||
|
{(templateFile.size / 1024).toFixed(1)} KB
|
||||||
|
</p>
|
||||||
|
</div>
|
||||||
|
<Button variant="ghost" size="sm" onClick={() => setTemplateFile(null)}>
|
||||||
|
<X size={16} />
|
||||||
|
</Button>
|
||||||
</div>
|
</div>
|
||||||
)}
|
)}
|
||||||
</CardContent>
|
</CardContent>
|
||||||
</Card>
|
</Card>
|
||||||
|
|
||||||
|
{/* Source Documents Upload */}
|
||||||
|
<Card className="border-none shadow-md">
|
||||||
|
<CardHeader className="pb-4">
|
||||||
|
<CardTitle className="text-lg flex items-center gap-2">
|
||||||
|
<Files className="text-primary" size={20} />
|
||||||
|
源文档
|
||||||
|
</CardTitle>
|
||||||
|
<CardDescription>
|
||||||
|
选择包含数据的源文档作为填表依据
|
||||||
|
</CardDescription>
|
||||||
|
{/* Source Mode Tabs */}
|
||||||
|
<div className="flex gap-2 mt-2">
|
||||||
|
<Button
|
||||||
|
variant={sourceMode === 'upload' ? 'default' : 'outline'}
|
||||||
|
size="sm"
|
||||||
|
onClick={() => setSourceMode('upload')}
|
||||||
|
>
|
||||||
|
<Upload size={14} className="mr-1" />
|
||||||
|
上传文件
|
||||||
|
</Button>
|
||||||
|
<Button
|
||||||
|
variant={sourceMode === 'select' ? 'default' : 'outline'}
|
||||||
|
size="sm"
|
||||||
|
onClick={() => setSourceMode('select')}
|
||||||
|
>
|
||||||
|
<Files size={14} className="mr-1" />
|
||||||
|
从文档中心选择
|
||||||
|
</Button>
|
||||||
|
</div>
|
||||||
|
</CardHeader>
|
||||||
|
<CardContent>
|
||||||
|
{sourceMode === 'upload' ? (
|
||||||
|
<>
|
||||||
|
<div className="border-2 border-dashed rounded-2xl p-8 transition-all duration-300 flex flex-col items-center justify-center text-center cursor-pointer group min-h-[200px] border-muted-foreground/20 hover:border-primary/50 hover:bg-primary/5">
|
||||||
|
<input
|
||||||
|
id="source-file-input"
|
||||||
|
type="file"
|
||||||
|
multiple={true}
|
||||||
|
accept=".xlsx,.xls,.docx,.md,.txt"
|
||||||
|
onChange={handleSourceFileSelect}
|
||||||
|
className="hidden"
|
||||||
|
/>
|
||||||
|
<label htmlFor="source-file-input" className="cursor-pointer flex flex-col items-center">
|
||||||
|
<div className="w-14 h-14 rounded-xl bg-blue-500/10 text-blue-500 flex items-center justify-center mb-4 group-hover:scale-110 transition-transform">
|
||||||
|
{loading ? <Loader2 className="animate-spin" size={28} /> : <Upload size={28} />}
|
||||||
|
</div>
|
||||||
|
<p className="font-medium">
|
||||||
|
点击上传源文档
|
||||||
|
</p>
|
||||||
|
<p className="text-xs text-muted-foreground mt-1">
|
||||||
|
支持 .xlsx .xls .docx .md .txt
|
||||||
|
</p>
|
||||||
|
</label>
|
||||||
|
</div>
|
||||||
|
<div
|
||||||
|
onDragOver={(e) => { e.preventDefault(); }}
|
||||||
|
onDrop={onSourceDrop}
|
||||||
|
className="mt-2 text-center text-xs text-muted-foreground"
|
||||||
|
>
|
||||||
|
或拖拽文件到此处
|
||||||
|
</div>
|
||||||
|
|
||||||
|
{/* Selected Source Files */}
|
||||||
|
{sourceFiles.length > 0 && (
|
||||||
|
<div className="mt-4 space-y-2">
|
||||||
|
{sourceFiles.map((sf, idx) => (
|
||||||
|
<div key={idx} className="flex items-center gap-3 p-3 bg-muted/50 rounded-xl">
|
||||||
|
{getFileIcon(sf.file.name)}
|
||||||
|
<div className="flex-1 min-w-0">
|
||||||
|
<p className="text-sm font-medium truncate">{sf.file.name}</p>
|
||||||
|
<p className="text-xs text-muted-foreground">
|
||||||
|
{(sf.file.size / 1024).toFixed(1)} KB
|
||||||
|
</p>
|
||||||
|
</div>
|
||||||
|
<Button variant="ghost" size="sm" onClick={() => removeSourceFile(idx)}>
|
||||||
|
<Trash2 size={14} className="text-red-500" />
|
||||||
|
</Button>
|
||||||
|
</div>
|
||||||
|
))}
|
||||||
|
<div className="flex justify-center pt-2">
|
||||||
|
<Button variant="outline" size="sm" onClick={() => document.getElementById('source-file-input')?.click()}>
|
||||||
|
<Plus size={14} className="mr-1" />
|
||||||
|
继续添加更多文档
|
||||||
|
</Button>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
)}
|
||||||
|
</>
|
||||||
|
) : (
|
||||||
|
<>
|
||||||
|
{/* Uploaded Documents Selection */}
|
||||||
|
{docsLoading ? (
|
||||||
|
<div className="space-y-2">
|
||||||
|
{[1, 2, 3].map(i => (
|
||||||
|
<Skeleton key={i} className="h-16 w-full rounded-xl" />
|
||||||
|
))}
|
||||||
|
</div>
|
||||||
|
) : uploadedDocuments.length > 0 ? (
|
||||||
|
<div className="space-y-2">
|
||||||
|
{sourceDocIds.length > 0 && (
|
||||||
|
<div className="flex items-center justify-between p-3 bg-primary/5 rounded-xl border border-primary/20">
|
||||||
|
<span className="text-sm font-medium">已选择 {sourceDocIds.length} 个文档</span>
|
||||||
|
<Button variant="ghost" size="sm" onClick={() => loadUploadedDocuments()}>
|
||||||
|
<RefreshCcw size={14} className="mr-1" />
|
||||||
|
刷新列表
|
||||||
|
</Button>
|
||||||
|
</div>
|
||||||
|
)}
|
||||||
|
<div className="max-h-[300px] overflow-y-auto space-y-2">
|
||||||
|
{uploadedDocuments.map((doc) => (
|
||||||
|
<div
|
||||||
|
key={doc.doc_id}
|
||||||
|
className={cn(
|
||||||
|
"flex items-center gap-3 p-3 rounded-xl border-2 transition-all cursor-pointer",
|
||||||
|
sourceDocIds.includes(doc.doc_id)
|
||||||
|
? "border-primary bg-primary/5"
|
||||||
|
: "border-border hover:bg-muted/30"
|
||||||
|
)}
|
||||||
|
onClick={() => {
|
||||||
|
if (sourceDocIds.includes(doc.doc_id)) {
|
||||||
|
removeSourceDocId(doc.doc_id);
|
||||||
|
} else {
|
||||||
|
addSourceDocId(doc.doc_id);
|
||||||
|
}
|
||||||
|
}}
|
||||||
|
>
|
||||||
|
<div className={cn(
|
||||||
|
"w-6 h-6 rounded-md border-2 flex items-center justify-center transition-all shrink-0",
|
||||||
|
sourceDocIds.includes(doc.doc_id)
|
||||||
|
? "border-primary bg-primary text-white"
|
||||||
|
: "border-muted-foreground/30"
|
||||||
|
)}>
|
||||||
|
{sourceDocIds.includes(doc.doc_id) && <CheckCircle2 size={14} />}
|
||||||
|
</div>
|
||||||
|
{getFileIcon(doc.original_filename)}
|
||||||
|
<div className="flex-1 min-w-0">
|
||||||
|
<p className="text-sm font-medium truncate">{doc.original_filename}</p>
|
||||||
|
<p className="text-xs text-muted-foreground">
|
||||||
|
{doc.doc_type.toUpperCase()} • {format(new Date(doc.created_at), 'yyyy-MM-dd')}
|
||||||
|
</p>
|
||||||
|
</div>
|
||||||
|
<Button
|
||||||
|
variant="ghost"
|
||||||
|
size="sm"
|
||||||
|
onClick={(e) => handleDeleteDocument(doc.doc_id, e)}
|
||||||
|
className="shrink-0"
|
||||||
|
>
|
||||||
|
<Trash2 size={14} className="text-red-500" />
|
||||||
|
</Button>
|
||||||
|
</div>
|
||||||
|
))}
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
) : (
|
||||||
|
<div className="text-center py-8 text-muted-foreground">
|
||||||
|
<Files size={32} className="mx-auto mb-2 opacity-30" />
|
||||||
|
<p className="text-sm">暂无可用的已上传文档</p>
|
||||||
|
</div>
|
||||||
|
)}
|
||||||
|
</>
|
||||||
|
)}
|
||||||
|
</CardContent>
|
||||||
|
</Card>
|
||||||
|
|
||||||
{/* Action Button */}
|
{/* Action Button */}
|
||||||
<div className="flex justify-center">
|
<div className="col-span-1 lg:col-span-2 flex justify-center">
|
||||||
<Button
|
<Button
|
||||||
size="lg"
|
size="lg"
|
||||||
className="rounded-xl px-8 shadow-lg shadow-primary/20 gap-2"
|
className="rounded-xl px-12 shadow-lg shadow-primary/20 gap-2"
|
||||||
disabled={selectedDocs.length === 0 || filling}
|
disabled={!templateFile || loading}
|
||||||
onClick={handleFillTemplate}
|
onClick={handleJointUploadAndFill}
|
||||||
>
|
>
|
||||||
{filling ? (
|
{loading ? (
|
||||||
<>
|
<>
|
||||||
<Loader2 className="animate-spin" size={20} />
|
<Loader2 className="animate-spin" size={20} />
|
||||||
<span>AI 正在分析并填表...</span>
|
<span>正在处理...</span>
|
||||||
</>
|
</>
|
||||||
) : (
|
) : (
|
||||||
<>
|
<>
|
||||||
<Sparkles size={20} />
|
<Sparkles size={20} />
|
||||||
<span>开始智能填表</span>
|
<span>上传并智能填表</span>
|
||||||
</>
|
</>
|
||||||
)}
|
)}
|
||||||
</Button>
|
</Button>
|
||||||
@@ -389,49 +537,7 @@ const TemplateFill: React.FC = () => {
|
|||||||
</div>
|
</div>
|
||||||
)}
|
)}
|
||||||
|
|
||||||
{/* Step 3: Preview Results */}
|
{/* Step 2: Filling State */}
|
||||||
{step === 'preview' && filledResult && (
|
|
||||||
<Card className="border-none shadow-md">
|
|
||||||
<CardHeader>
|
|
||||||
<CardTitle className="text-lg flex items-center gap-2">
|
|
||||||
<CheckCircle2 className="text-emerald-500" size={20} />
|
|
||||||
填表完成
|
|
||||||
</CardTitle>
|
|
||||||
<CardDescription>
|
|
||||||
系统已根据 {selectedDocs.length} 份文档自动完成表格填写
|
|
||||||
</CardDescription>
|
|
||||||
</CardHeader>
|
|
||||||
<CardContent className="space-y-6">
|
|
||||||
{/* Filled Data Preview */}
|
|
||||||
<div className="p-6 bg-muted/30 rounded-2xl">
|
|
||||||
<div className="space-y-4">
|
|
||||||
{templateFields.map((field, idx) => (
|
|
||||||
<div key={idx} className="flex items-center gap-4">
|
|
||||||
<div className="w-32 text-sm font-medium text-muted-foreground">{field.name}</div>
|
|
||||||
<div className="flex-1 p-3 bg-background rounded-xl border">
|
|
||||||
{(filledResult.filled_data || {})[field.name] || '-'}
|
|
||||||
</div>
|
|
||||||
</div>
|
|
||||||
))}
|
|
||||||
</div>
|
|
||||||
</div>
|
|
||||||
|
|
||||||
{/* Action Buttons */}
|
|
||||||
<div className="flex justify-center gap-4">
|
|
||||||
<Button variant="outline" className="rounded-xl gap-2" onClick={resetFlow}>
|
|
||||||
<RefreshCcw size={18} />
|
|
||||||
<span>继续填表</span>
|
|
||||||
</Button>
|
|
||||||
<Button className="rounded-xl gap-2 shadow-lg shadow-primary/20" onClick={handleExport}>
|
|
||||||
<Download size={18} />
|
|
||||||
<span>导出结果</span>
|
|
||||||
</Button>
|
|
||||||
</div>
|
|
||||||
</CardContent>
|
|
||||||
</Card>
|
|
||||||
)}
|
|
||||||
|
|
||||||
{/* Filling State */}
|
|
||||||
{step === 'filling' && (
|
{step === 'filling' && (
|
||||||
<Card className="border-none shadow-md">
|
<Card className="border-none shadow-md">
|
||||||
<CardContent className="py-16 flex flex-col items-center justify-center">
|
<CardContent className="py-16 flex flex-col items-center justify-center">
|
||||||
@@ -440,11 +546,117 @@ const TemplateFill: React.FC = () => {
|
|||||||
</div>
|
</div>
|
||||||
<h3 className="text-xl font-bold mb-2">AI 正在智能分析并填表</h3>
|
<h3 className="text-xl font-bold mb-2">AI 正在智能分析并填表</h3>
|
||||||
<p className="text-muted-foreground text-center max-w-md">
|
<p className="text-muted-foreground text-center max-w-md">
|
||||||
系统正在从 {selectedDocs.length} 份文档中检索相关信息,生成字段描述,并使用 RAG 增强填写准确性...
|
系统正在从 {sourceFiles.length || sourceFilePaths.length} 份文档中检索相关信息...
|
||||||
</p>
|
</p>
|
||||||
</CardContent>
|
</CardContent>
|
||||||
</Card>
|
</Card>
|
||||||
)}
|
)}
|
||||||
|
|
||||||
|
{/* Step 3: Preview Results */}
|
||||||
|
{step === 'preview' && filledResult && (
|
||||||
|
<div className="space-y-6">
|
||||||
|
<Card className="border-none shadow-md">
|
||||||
|
<CardHeader>
|
||||||
|
<CardTitle className="text-lg flex items-center gap-2">
|
||||||
|
<CheckCircle2 className="text-emerald-500" size={20} />
|
||||||
|
填表完成
|
||||||
|
</CardTitle>
|
||||||
|
<CardDescription>
|
||||||
|
系统已根据 {sourceFiles.length || sourceFilePaths.length} 份文档自动完成表格填写
|
||||||
|
</CardDescription>
|
||||||
|
</CardHeader>
|
||||||
|
<CardContent>
|
||||||
|
{/* Filled Data Preview */}
|
||||||
|
<div className="p-6 bg-muted/30 rounded-2xl">
|
||||||
|
<div className="space-y-4">
|
||||||
|
{templateFields.map((field, idx) => {
|
||||||
|
const value = filledResult.filled_data?.[field.name];
|
||||||
|
const displayValue = Array.isArray(value)
|
||||||
|
? value.filter(v => v && String(v).trim()).join(', ') || '-'
|
||||||
|
: value || '-';
|
||||||
|
return (
|
||||||
|
<div key={idx} className="flex items-center gap-4">
|
||||||
|
<div className="w-40 text-sm font-medium text-muted-foreground">{field.name}</div>
|
||||||
|
<div className="flex-1 p-3 bg-background rounded-xl border">
|
||||||
|
{displayValue}
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
);
|
||||||
|
})}
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
|
||||||
|
{/* Source Files Info */}
|
||||||
|
<div className="mt-4 flex flex-wrap gap-2">
|
||||||
|
{sourceFiles.map((sf, idx) => (
|
||||||
|
<Badge key={idx} variant="outline" className="bg-blue-500/5">
|
||||||
|
{getFileIcon(sf.file.name)}
|
||||||
|
<span className="ml-1">{sf.file.name}</span>
|
||||||
|
</Badge>
|
||||||
|
))}
|
||||||
|
</div>
|
||||||
|
|
||||||
|
{/* Action Buttons */}
|
||||||
|
<div className="flex justify-center gap-4 mt-6">
|
||||||
|
<Button variant="outline" className="rounded-xl gap-2" onClick={reset}>
|
||||||
|
<RefreshCcw size={18} />
|
||||||
|
<span>继续填表</span>
|
||||||
|
</Button>
|
||||||
|
<Button className="rounded-xl gap-2 shadow-lg shadow-primary/20" onClick={handleExport}>
|
||||||
|
<Download size={18} />
|
||||||
|
<span>导出结果</span>
|
||||||
|
</Button>
|
||||||
|
</div>
|
||||||
|
</CardContent>
|
||||||
|
</Card>
|
||||||
|
|
||||||
|
{/* Fill Details */}
|
||||||
|
{filledResult.fill_details && filledResult.fill_details.length > 0 && (
|
||||||
|
<Card className="border-none shadow-md">
|
||||||
|
<CardHeader>
|
||||||
|
<CardTitle className="text-lg">填写详情</CardTitle>
|
||||||
|
</CardHeader>
|
||||||
|
<CardContent>
|
||||||
|
<div className="space-y-3">
|
||||||
|
{filledResult.fill_details.map((detail: any, idx: number) => (
|
||||||
|
<div key={idx} className="flex items-start gap-3 p-3 bg-muted/30 rounded-xl text-sm">
|
||||||
|
<div className="w-1 h-1 rounded-full bg-primary mt-2" />
|
||||||
|
<div className="flex-1">
|
||||||
|
<div className="font-medium">{detail.field}</div>
|
||||||
|
<div className="text-muted-foreground text-xs mt-1">
|
||||||
|
来源: {detail.source} | 置信度: {detail.confidence ? (detail.confidence * 100).toFixed(0) + '%' : 'N/A'}
|
||||||
|
</div>
|
||||||
|
{detail.warning && (
|
||||||
|
<div className="mt-2 p-2 bg-yellow-50 border border-yellow-200 rounded-lg text-yellow-700 text-xs">
|
||||||
|
⚠️ {detail.warning}
|
||||||
|
</div>
|
||||||
|
)}
|
||||||
|
{detail.values && detail.values.length > 1 && !detail.warning && (
|
||||||
|
<div className="mt-2 text-xs text-muted-foreground">
|
||||||
|
多值: {detail.values.join(', ')}
|
||||||
|
</div>
|
||||||
|
)}
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
))}
|
||||||
|
</div>
|
||||||
|
</CardContent>
|
||||||
|
</Card>
|
||||||
|
)}
|
||||||
|
</div>
|
||||||
|
)}
|
||||||
|
|
||||||
|
{/* Preview Dialog */}
|
||||||
|
<Dialog open={previewOpen} onOpenChange={setPreviewOpen}>
|
||||||
|
<DialogContent className="max-w-2xl">
|
||||||
|
<DialogHeader>
|
||||||
|
<DialogTitle>{previewDoc?.name || '文档预览'}</DialogTitle>
|
||||||
|
</DialogHeader>
|
||||||
|
<ScrollArea className="max-h-[60vh]">
|
||||||
|
<pre className="text-sm whitespace-pre-wrap">{previewDoc?.content}</pre>
|
||||||
|
</ScrollArea>
|
||||||
|
</DialogContent>
|
||||||
|
</Dialog>
|
||||||
</div>
|
</div>
|
||||||
);
|
);
|
||||||
};
|
};
|
||||||
|
|||||||
@@ -1,854 +0,0 @@
|
|||||||
diff --git a/backend/app/api/endpoints/templates.py b/backend/app/api/endpoints/templates.py
|
|
||||||
index 572d56e..706f281 100644
|
|
||||||
--- a/backend/app/api/endpoints/templates.py
|
|
||||||
+++ b/backend/app/api/endpoints/templates.py
|
|
||||||
@@ -13,7 +13,7 @@ import pandas as pd
|
|
||||||
from pydantic import BaseModel
|
|
||||||
|
|
||||||
from app.services.template_fill_service import template_fill_service, TemplateField
|
|
||||||
-from app.services.excel_storage_service import excel_storage_service
|
|
||||||
+from app.services.file_service import file_service
|
|
||||||
|
|
||||||
logger = logging.getLogger(__name__)
|
|
||||||
|
|
||||||
@@ -28,13 +28,15 @@ class TemplateFieldRequest(BaseModel):
|
|
||||||
name: str
|
|
||||||
field_type: str = "text"
|
|
||||||
required: bool = True
|
|
||||||
+ hint: str = ""
|
|
||||||
|
|
||||||
|
|
||||||
class FillRequest(BaseModel):
|
|
||||||
"""填写请求"""
|
|
||||||
template_id: str
|
|
||||||
template_fields: List[TemplateFieldRequest]
|
|
||||||
- source_doc_ids: Optional[List[str]] = None
|
|
||||||
+ source_doc_ids: Optional[List[str]] = None # MongoDB 文档 ID 列表
|
|
||||||
+ source_file_paths: Optional[List[str]] = None # 源文档文件路径列表
|
|
||||||
user_hint: Optional[str] = None
|
|
||||||
|
|
||||||
|
|
||||||
@@ -71,7 +73,6 @@ async def upload_template(
|
|
||||||
|
|
||||||
try:
|
|
||||||
# 保存文件
|
|
||||||
- from app.services.file_service import file_service
|
|
||||||
content = await file.read()
|
|
||||||
saved_path = file_service.save_uploaded_file(
|
|
||||||
content,
|
|
||||||
@@ -87,7 +88,7 @@ async def upload_template(
|
|
||||||
|
|
||||||
return {
|
|
||||||
"success": True,
|
|
||||||
- "template_id": saved_path, # 使用文件路径作为ID
|
|
||||||
+ "template_id": saved_path,
|
|
||||||
"filename": file.filename,
|
|
||||||
"file_type": file_ext,
|
|
||||||
"fields": [
|
|
||||||
@@ -95,7 +96,8 @@ async def upload_template(
|
|
||||||
"cell": f.cell,
|
|
||||||
"name": f.name,
|
|
||||||
"field_type": f.field_type,
|
|
||||||
- "required": f.required
|
|
||||||
+ "required": f.required,
|
|
||||||
+ "hint": f.hint
|
|
||||||
}
|
|
||||||
for f in template_fields
|
|
||||||
],
|
|
||||||
@@ -135,7 +137,8 @@ async def extract_template_fields(
|
|
||||||
"cell": f.cell,
|
|
||||||
"name": f.name,
|
|
||||||
"field_type": f.field_type,
|
|
||||||
- "required": f.required
|
|
||||||
+ "required": f.required,
|
|
||||||
+ "hint": f.hint
|
|
||||||
}
|
|
||||||
for f in fields
|
|
||||||
]
|
|
||||||
@@ -153,7 +156,7 @@ async def fill_template(
|
|
||||||
"""
|
|
||||||
执行表格填写
|
|
||||||
|
|
||||||
- 根据提供的字段定义,从已上传的文档中检索信息并填写
|
|
||||||
+ 根据提供的字段定义,从源文档中检索信息并填写
|
|
||||||
|
|
||||||
Args:
|
|
||||||
request: 填写请求
|
|
||||||
@@ -168,7 +171,8 @@ async def fill_template(
|
|
||||||
cell=f.cell,
|
|
||||||
name=f.name,
|
|
||||||
field_type=f.field_type,
|
|
||||||
- required=f.required
|
|
||||||
+ required=f.required,
|
|
||||||
+ hint=f.hint
|
|
||||||
)
|
|
||||||
for f in request.template_fields
|
|
||||||
]
|
|
||||||
@@ -177,6 +181,7 @@ async def fill_template(
|
|
||||||
result = await template_fill_service.fill_template(
|
|
||||||
template_fields=fields,
|
|
||||||
source_doc_ids=request.source_doc_ids,
|
|
||||||
+ source_file_paths=request.source_file_paths,
|
|
||||||
user_hint=request.user_hint
|
|
||||||
)
|
|
||||||
|
|
||||||
@@ -194,6 +199,8 @@ async def export_filled_template(
|
|
||||||
"""
|
|
||||||
导出填写后的表格
|
|
||||||
|
|
||||||
+ 支持 Excel (.xlsx) 和 Word (.docx) 格式
|
|
||||||
+
|
|
||||||
Args:
|
|
||||||
request: 导出请求
|
|
||||||
|
|
||||||
@@ -201,25 +208,124 @@ async def export_filled_template(
|
|
||||||
文件流
|
|
||||||
"""
|
|
||||||
try:
|
|
||||||
- # 创建 DataFrame
|
|
||||||
- df = pd.DataFrame([request.filled_data])
|
|
||||||
+ if request.format == "xlsx":
|
|
||||||
+ return await _export_to_excel(request.filled_data, request.template_id)
|
|
||||||
+ elif request.format == "docx":
|
|
||||||
+ return await _export_to_word(request.filled_data, request.template_id)
|
|
||||||
+ else:
|
|
||||||
+ raise HTTPException(
|
|
||||||
+ status_code=400,
|
|
||||||
+ detail=f"不支持的导出格式: {request.format},仅支持 xlsx/docx"
|
|
||||||
+ )
|
|
||||||
|
|
||||||
- # 导出为 Excel
|
|
||||||
- output = io.BytesIO()
|
|
||||||
- with pd.ExcelWriter(output, engine='openpyxl') as writer:
|
|
||||||
- df.to_excel(writer, index=False, sheet_name='填写结果')
|
|
||||||
+ except HTTPException:
|
|
||||||
+ raise
|
|
||||||
+ except Exception as e:
|
|
||||||
+ logger.error(f"导出失败: {str(e)}")
|
|
||||||
+ raise HTTPException(status_code=500, detail=f"导出失败: {str(e)}")
|
|
||||||
|
|
||||||
- output.seek(0)
|
|
||||||
|
|
||||||
- # 生成文件名
|
|
||||||
- filename = f"filled_template.{request.format}"
|
|
||||||
+async def _export_to_excel(filled_data: dict, template_id: str) -> StreamingResponse:
|
|
||||||
+ """导出为 Excel 格式"""
|
|
||||||
+ # 将字典转换为单行 DataFrame
|
|
||||||
+ df = pd.DataFrame([filled_data])
|
|
||||||
|
|
||||||
- return StreamingResponse(
|
|
||||||
- io.BytesIO(output.getvalue()),
|
|
||||||
- media_type="application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
|
|
||||||
- headers={"Content-Disposition": f"attachment; filename={filename}"}
|
|
||||||
- )
|
|
||||||
+ output = io.BytesIO()
|
|
||||||
+ with pd.ExcelWriter(output, engine='openpyxl') as writer:
|
|
||||||
+ df.to_excel(writer, index=False, sheet_name='填写结果')
|
|
||||||
|
|
||||||
- except Exception as e:
|
|
||||||
- logger.error(f"导出失败: {str(e)}")
|
|
||||||
- raise HTTPException(status_code=500, detail=f"导出失败: {str(e)}")
|
|
||||||
+ output.seek(0)
|
|
||||||
+
|
|
||||||
+ filename = f"filled_template.xlsx"
|
|
||||||
+
|
|
||||||
+ return StreamingResponse(
|
|
||||||
+ io.BytesIO(output.getvalue()),
|
|
||||||
+ media_type="application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
|
|
||||||
+ headers={"Content-Disposition": f"attachment; filename={filename}"}
|
|
||||||
+ )
|
|
||||||
+
|
|
||||||
+
|
|
||||||
+async def _export_to_word(filled_data: dict, template_id: str) -> StreamingResponse:
|
|
||||||
+ """导出为 Word 格式"""
|
|
||||||
+ from docx import Document
|
|
||||||
+ from docx.shared import Pt, RGBColor
|
|
||||||
+ from docx.enum.text import WD_ALIGN_PARAGRAPH
|
|
||||||
+
|
|
||||||
+ doc = Document()
|
|
||||||
+
|
|
||||||
+ # 添加标题
|
|
||||||
+ title = doc.add_heading('填写结果', level=1)
|
|
||||||
+ title.alignment = WD_ALIGN_PARAGRAPH.CENTER
|
|
||||||
+
|
|
||||||
+ # 添加填写时间和模板信息
|
|
||||||
+ from datetime import datetime
|
|
||||||
+ info_para = doc.add_paragraph()
|
|
||||||
+ info_para.add_run(f"模板ID: {template_id}\n").bold = True
|
|
||||||
+ info_para.add_run(f"导出时间: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
|
|
||||||
+
|
|
||||||
+ doc.add_paragraph() # 空行
|
|
||||||
+
|
|
||||||
+ # 添加字段表格
|
|
||||||
+ table = doc.add_table(rows=1, cols=3)
|
|
||||||
+ table.style = 'Light Grid Accent 1'
|
|
||||||
+
|
|
||||||
+ # 表头
|
|
||||||
+ header_cells = table.rows[0].cells
|
|
||||||
+ header_cells[0].text = '字段名'
|
|
||||||
+ header_cells[1].text = '填写值'
|
|
||||||
+ header_cells[2].text = '状态'
|
|
||||||
+
|
|
||||||
+ for field_name, field_value in filled_data.items():
|
|
||||||
+ row_cells = table.add_row().cells
|
|
||||||
+ row_cells[0].text = field_name
|
|
||||||
+ row_cells[1].text = str(field_value) if field_value else ''
|
|
||||||
+ row_cells[2].text = '已填写' if field_value else '为空'
|
|
||||||
+
|
|
||||||
+ # 保存到 BytesIO
|
|
||||||
+ output = io.BytesIO()
|
|
||||||
+ doc.save(output)
|
|
||||||
+ output.seek(0)
|
|
||||||
+
|
|
||||||
+ filename = f"filled_template.docx"
|
|
||||||
+
|
|
||||||
+ return StreamingResponse(
|
|
||||||
+ io.BytesIO(output.getvalue()),
|
|
||||||
+ media_type="application/vnd.openxmlformats-officedocument.wordprocessingml.document",
|
|
||||||
+ headers={"Content-Disposition": f"attachment; filename={filename}"}
|
|
||||||
+ )
|
|
||||||
+
|
|
||||||
+
|
|
||||||
+@router.post("/export/excel")
|
|
||||||
+async def export_to_excel(
|
|
||||||
+ filled_data: dict,
|
|
||||||
+ template_id: str = Query(..., description="模板ID")
|
|
||||||
+):
|
|
||||||
+ """
|
|
||||||
+ 专门导出为 Excel 格式
|
|
||||||
+
|
|
||||||
+ Args:
|
|
||||||
+ filled_data: 填写数据
|
|
||||||
+ template_id: 模板ID
|
|
||||||
+
|
|
||||||
+ Returns:
|
|
||||||
+ Excel 文件流
|
|
||||||
+ """
|
|
||||||
+ return await _export_to_excel(filled_data, template_id)
|
|
||||||
+
|
|
||||||
+
|
|
||||||
+@router.post("/export/word")
|
|
||||||
+async def export_to_word(
|
|
||||||
+ filled_data: dict,
|
|
||||||
+ template_id: str = Query(..., description="模板ID")
|
|
||||||
+):
|
|
||||||
+ """
|
|
||||||
+ 专门导出为 Word 格式
|
|
||||||
+
|
|
||||||
+ Args:
|
|
||||||
+ filled_data: 填写数据
|
|
||||||
+ template_id: 模板ID
|
|
||||||
+
|
|
||||||
+ Returns:
|
|
||||||
+ Word 文件流
|
|
||||||
+ """
|
|
||||||
+ return await _export_to_word(filled_data, template_id)
|
|
||||||
diff --git a/backend/app/core/document_parser/docx_parser.py b/backend/app/core/document_parser/docx_parser.py
|
|
||||||
index 75e79da..03c341d 100644
|
|
||||||
--- a/backend/app/core/document_parser/docx_parser.py
|
|
||||||
+++ b/backend/app/core/document_parser/docx_parser.py
|
|
||||||
@@ -161,3 +161,133 @@ class DocxParser(BaseParser):
|
|
||||||
fields[field_name] = match.group(1)
|
|
||||||
|
|
||||||
return fields
|
|
||||||
+
|
|
||||||
+ def parse_tables_for_template(
|
|
||||||
+ self,
|
|
||||||
+ file_path: str
|
|
||||||
+ ) -> Dict[str, Any]:
|
|
||||||
+ """
|
|
||||||
+ 解析 Word 文档中的表格,提取模板字段
|
|
||||||
+
|
|
||||||
+ 专门用于比赛场景:解析表格模板,识别需要填写的字段
|
|
||||||
+
|
|
||||||
+ Args:
|
|
||||||
+ file_path: Word 文件路径
|
|
||||||
+
|
|
||||||
+ Returns:
|
|
||||||
+ 包含表格字段信息的字典
|
|
||||||
+ """
|
|
||||||
+ from docx import Document
|
|
||||||
+ from docx.table import Table
|
|
||||||
+ from docx.oxml.ns import qn
|
|
||||||
+
|
|
||||||
+ doc = Document(file_path)
|
|
||||||
+
|
|
||||||
+ template_info = {
|
|
||||||
+ "tables": [],
|
|
||||||
+ "fields": [],
|
|
||||||
+ "field_count": 0
|
|
||||||
+ }
|
|
||||||
+
|
|
||||||
+ for table_idx, table in enumerate(doc.tables):
|
|
||||||
+ table_info = {
|
|
||||||
+ "table_index": table_idx,
|
|
||||||
+ "rows": [],
|
|
||||||
+ "headers": [],
|
|
||||||
+ "data_rows": [],
|
|
||||||
+ "field_hints": {} # 字段名称 -> 提示词/描述
|
|
||||||
+ }
|
|
||||||
+
|
|
||||||
+ # 提取表头(第一行)
|
|
||||||
+ if table.rows:
|
|
||||||
+ header_cells = [cell.text.strip() for cell in table.rows[0].cells]
|
|
||||||
+ table_info["headers"] = header_cells
|
|
||||||
+
|
|
||||||
+ # 提取数据行
|
|
||||||
+ for row_idx, row in enumerate(table.rows[1:], 1):
|
|
||||||
+ row_data = [cell.text.strip() for cell in row.cells]
|
|
||||||
+ table_info["data_rows"].append(row_data)
|
|
||||||
+ table_info["rows"].append({
|
|
||||||
+ "row_index": row_idx,
|
|
||||||
+ "cells": row_data
|
|
||||||
+ })
|
|
||||||
+
|
|
||||||
+ # 尝试从第二列/第三列提取提示词
|
|
||||||
+ # 比赛模板通常格式为:字段名 | 提示词 | 填写值
|
|
||||||
+ if len(table.rows[0].cells) >= 2:
|
|
||||||
+ for row_idx, row in enumerate(table.rows[1:], 1):
|
|
||||||
+ cells = [cell.text.strip() for cell in row.cells]
|
|
||||||
+ if len(cells) >= 2 and cells[0]:
|
|
||||||
+ # 第一列是字段名
|
|
||||||
+ field_name = cells[0]
|
|
||||||
+ # 第二列可能是提示词或描述
|
|
||||||
+ hint = cells[1] if len(cells) > 1 else ""
|
|
||||||
+ table_info["field_hints"][field_name] = hint
|
|
||||||
+
|
|
||||||
+ template_info["fields"].append({
|
|
||||||
+ "table_index": table_idx,
|
|
||||||
+ "row_index": row_idx,
|
|
||||||
+ "field_name": field_name,
|
|
||||||
+ "hint": hint,
|
|
||||||
+ "expected_value": cells[2] if len(cells) > 2 else ""
|
|
||||||
+ })
|
|
||||||
+
|
|
||||||
+ template_info["tables"].append(table_info)
|
|
||||||
+
|
|
||||||
+ template_info["field_count"] = len(template_info["fields"])
|
|
||||||
+ return template_info
|
|
||||||
+
|
|
||||||
+ def extract_template_fields_from_docx(
|
|
||||||
+ self,
|
|
||||||
+ file_path: str
|
|
||||||
+ ) -> List[Dict[str, Any]]:
|
|
||||||
+ """
|
|
||||||
+ 从 Word 文档中提取模板字段定义
|
|
||||||
+
|
|
||||||
+ 适用于比赛评分表格:表格第一列是字段名,第二列是提示词/填写示例
|
|
||||||
+
|
|
||||||
+ Args:
|
|
||||||
+ file_path: Word 文件路径
|
|
||||||
+
|
|
||||||
+ Returns:
|
|
||||||
+ 字段定义列表
|
|
||||||
+ """
|
|
||||||
+ template_info = self.parse_tables_for_template(file_path)
|
|
||||||
+
|
|
||||||
+ fields = []
|
|
||||||
+ for field in template_info["fields"]:
|
|
||||||
+ fields.append({
|
|
||||||
+ "cell": f"T{field['table_index']}R{field['row_index']}", # TableXRowY 格式
|
|
||||||
+ "name": field["field_name"],
|
|
||||||
+ "hint": field["hint"],
|
|
||||||
+ "table_index": field["table_index"],
|
|
||||||
+ "row_index": field["row_index"],
|
|
||||||
+ "field_type": self._infer_field_type_from_hint(field["hint"]),
|
|
||||||
+ "required": True
|
|
||||||
+ })
|
|
||||||
+
|
|
||||||
+ return fields
|
|
||||||
+
|
|
||||||
+ def _infer_field_type_from_hint(self, hint: str) -> str:
|
|
||||||
+ """
|
|
||||||
+ 从提示词推断字段类型
|
|
||||||
+
|
|
||||||
+ Args:
|
|
||||||
+ hint: 字段提示词
|
|
||||||
+
|
|
||||||
+ Returns:
|
|
||||||
+ 字段类型 (text/number/date)
|
|
||||||
+ """
|
|
||||||
+ hint_lower = hint.lower()
|
|
||||||
+
|
|
||||||
+ # 日期关键词
|
|
||||||
+ date_keywords = ["年", "月", "日", "日期", "时间", "出生"]
|
|
||||||
+ if any(kw in hint for kw in date_keywords):
|
|
||||||
+ return "date"
|
|
||||||
+
|
|
||||||
+ # 数字关键词
|
|
||||||
+ number_keywords = ["数量", "金额", "人数", "面积", "增长", "比率", "%", "率"]
|
|
||||||
+ if any(kw in hint_lower for kw in number_keywords):
|
|
||||||
+ return "number"
|
|
||||||
+
|
|
||||||
+ return "text"
|
|
||||||
diff --git a/backend/app/services/template_fill_service.py b/backend/app/services/template_fill_service.py
|
|
||||||
index 2612354..94930fb 100644
|
|
||||||
--- a/backend/app/services/template_fill_service.py
|
|
||||||
+++ b/backend/app/services/template_fill_service.py
|
|
||||||
@@ -4,13 +4,12 @@
|
|
||||||
从非结构化文档中检索信息并填写到表格模板
|
|
||||||
"""
|
|
||||||
import logging
|
|
||||||
-from dataclasses import dataclass
|
|
||||||
+from dataclasses import dataclass, field
|
|
||||||
from typing import Any, Dict, List, Optional
|
|
||||||
|
|
||||||
from app.core.database import mongodb
|
|
||||||
-from app.services.rag_service import rag_service
|
|
||||||
from app.services.llm_service import llm_service
|
|
||||||
-from app.services.excel_storage_service import excel_storage_service
|
|
||||||
+from app.core.document_parser import ParserFactory
|
|
||||||
|
|
||||||
logger = logging.getLogger(__name__)
|
|
||||||
|
|
||||||
@@ -22,6 +21,17 @@ class TemplateField:
|
|
||||||
name: str # 字段名称
|
|
||||||
field_type: str = "text" # 字段类型: text/number/date
|
|
||||||
required: bool = True
|
|
||||||
+ hint: str = "" # 字段提示词
|
|
||||||
+
|
|
||||||
+
|
|
||||||
+@dataclass
|
|
||||||
+class SourceDocument:
|
|
||||||
+ """源文档"""
|
|
||||||
+ doc_id: str
|
|
||||||
+ filename: str
|
|
||||||
+ doc_type: str
|
|
||||||
+ content: str = ""
|
|
||||||
+ structured_data: Dict[str, Any] = field(default_factory=dict)
|
|
||||||
|
|
||||||
|
|
||||||
@dataclass
|
|
||||||
@@ -38,12 +48,12 @@ class TemplateFillService:
|
|
||||||
|
|
||||||
def __init__(self):
|
|
||||||
self.llm = llm_service
|
|
||||||
- self.rag = rag_service
|
|
||||||
|
|
||||||
async def fill_template(
|
|
||||||
self,
|
|
||||||
template_fields: List[TemplateField],
|
|
||||||
source_doc_ids: Optional[List[str]] = None,
|
|
||||||
+ source_file_paths: Optional[List[str]] = None,
|
|
||||||
user_hint: Optional[str] = None
|
|
||||||
) -> Dict[str, Any]:
|
|
||||||
"""
|
|
||||||
@@ -51,7 +61,8 @@ class TemplateFillService:
|
|
||||||
|
|
||||||
Args:
|
|
||||||
template_fields: 模板字段列表
|
|
||||||
- source_doc_ids: 源文档ID列表,不指定则从所有文档检索
|
|
||||||
+ source_doc_ids: 源文档 MongoDB ID 列表
|
|
||||||
+ source_file_paths: 源文档文件路径列表
|
|
||||||
user_hint: 用户提示(如"请从合同文档中提取")
|
|
||||||
|
|
||||||
Returns:
|
|
||||||
@@ -60,28 +71,23 @@ class TemplateFillService:
|
|
||||||
filled_data = {}
|
|
||||||
fill_details = []
|
|
||||||
|
|
||||||
+ # 1. 加载源文档内容
|
|
||||||
+ source_docs = await self._load_source_documents(source_doc_ids, source_file_paths)
|
|
||||||
+
|
|
||||||
+ if not source_docs:
|
|
||||||
+ logger.warning("没有找到源文档,填表结果将全部为空")
|
|
||||||
+
|
|
||||||
+ # 2. 对每个字段进行提取
|
|
||||||
for field in template_fields:
|
|
||||||
try:
|
|
||||||
- # 1. 从 RAG 检索相关上下文
|
|
||||||
- rag_results = await self._retrieve_context(field.name, user_hint)
|
|
||||||
-
|
|
||||||
- if not rag_results:
|
|
||||||
- # 如果没有检索到结果,尝试直接询问 LLM
|
|
||||||
- result = FillResult(
|
|
||||||
- field=field.name,
|
|
||||||
- value="",
|
|
||||||
- source="未找到相关数据",
|
|
||||||
- confidence=0.0
|
|
||||||
- )
|
|
||||||
- else:
|
|
||||||
- # 2. 构建 Prompt 让 LLM 提取信息
|
|
||||||
- result = await self._extract_field_value(
|
|
||||||
- field=field,
|
|
||||||
- rag_context=rag_results,
|
|
||||||
- user_hint=user_hint
|
|
||||||
- )
|
|
||||||
-
|
|
||||||
- # 3. 存储结果
|
|
||||||
+ # 从源文档中提取字段值
|
|
||||||
+ result = await self._extract_field_value(
|
|
||||||
+ field=field,
|
|
||||||
+ source_docs=source_docs,
|
|
||||||
+ user_hint=user_hint
|
|
||||||
+ )
|
|
||||||
+
|
|
||||||
+ # 存储结果
|
|
||||||
filled_data[field.name] = result.value
|
|
||||||
fill_details.append({
|
|
||||||
"field": field.name,
|
|
||||||
@@ -107,75 +113,113 @@ class TemplateFillService:
|
|
||||||
return {
|
|
||||||
"success": True,
|
|
||||||
"filled_data": filled_data,
|
|
||||||
- "fill_details": fill_details
|
|
||||||
+ "fill_details": fill_details,
|
|
||||||
+ "source_doc_count": len(source_docs)
|
|
||||||
}
|
|
||||||
|
|
||||||
- async def _retrieve_context(
|
|
||||||
+ async def _load_source_documents(
|
|
||||||
self,
|
|
||||||
- field_name: str,
|
|
||||||
- user_hint: Optional[str] = None
|
|
||||||
- ) -> List[Dict[str, Any]]:
|
|
||||||
+ source_doc_ids: Optional[List[str]] = None,
|
|
||||||
+ source_file_paths: Optional[List[str]] = None
|
|
||||||
+ ) -> List[SourceDocument]:
|
|
||||||
"""
|
|
||||||
- 从 RAG 检索相关上下文
|
|
||||||
+ 加载源文档内容
|
|
||||||
|
|
||||||
Args:
|
|
||||||
- field_name: 字段名称
|
|
||||||
- user_hint: 用户提示
|
|
||||||
+ source_doc_ids: MongoDB 文档 ID 列表
|
|
||||||
+ source_file_paths: 源文档文件路径列表
|
|
||||||
|
|
||||||
Returns:
|
|
||||||
- 检索结果列表
|
|
||||||
+ 源文档列表
|
|
||||||
"""
|
|
||||||
- # 构建查询文本
|
|
||||||
- query = field_name
|
|
||||||
- if user_hint:
|
|
||||||
- query = f"{user_hint} {field_name}"
|
|
||||||
-
|
|
||||||
- # 检索相关文档片段
|
|
||||||
- results = self.rag.retrieve(query=query, top_k=5)
|
|
||||||
-
|
|
||||||
- return results
|
|
||||||
+ source_docs = []
|
|
||||||
+
|
|
||||||
+ # 1. 从 MongoDB 加载文档
|
|
||||||
+ if source_doc_ids:
|
|
||||||
+ for doc_id in source_doc_ids:
|
|
||||||
+ try:
|
|
||||||
+ doc = await mongodb.get_document(doc_id)
|
|
||||||
+ if doc:
|
|
||||||
+ source_docs.append(SourceDocument(
|
|
||||||
+ doc_id=doc_id,
|
|
||||||
+ filename=doc.get("metadata", {}).get("original_filename", "unknown"),
|
|
||||||
+ doc_type=doc.get("doc_type", "unknown"),
|
|
||||||
+ content=doc.get("content", ""),
|
|
||||||
+ structured_data=doc.get("structured_data", {})
|
|
||||||
+ ))
|
|
||||||
+ logger.info(f"从MongoDB加载文档: {doc_id}")
|
|
||||||
+ except Exception as e:
|
|
||||||
+ logger.error(f"从MongoDB加载文档失败 {doc_id}: {str(e)}")
|
|
||||||
+
|
|
||||||
+ # 2. 从文件路径加载文档
|
|
||||||
+ if source_file_paths:
|
|
||||||
+ for file_path in source_file_paths:
|
|
||||||
+ try:
|
|
||||||
+ parser = ParserFactory.get_parser(file_path)
|
|
||||||
+ result = parser.parse(file_path)
|
|
||||||
+ if result.success:
|
|
||||||
+ source_docs.append(SourceDocument(
|
|
||||||
+ doc_id=file_path,
|
|
||||||
+ filename=result.metadata.get("filename", file_path.split("/")[-1]),
|
|
||||||
+ doc_type=result.metadata.get("extension", "unknown").replace(".", ""),
|
|
||||||
+ content=result.data.get("content", ""),
|
|
||||||
+ structured_data=result.data.get("structured_data", {})
|
|
||||||
+ ))
|
|
||||||
+ logger.info(f"从文件加载文档: {file_path}")
|
|
||||||
+ except Exception as e:
|
|
||||||
+ logger.error(f"从文件加载文档失败 {file_path}: {str(e)}")
|
|
||||||
+
|
|
||||||
+ return source_docs
|
|
||||||
|
|
||||||
async def _extract_field_value(
|
|
||||||
self,
|
|
||||||
field: TemplateField,
|
|
||||||
- rag_context: List[Dict[str, Any]],
|
|
||||||
+ source_docs: List[SourceDocument],
|
|
||||||
user_hint: Optional[str] = None
|
|
||||||
) -> FillResult:
|
|
||||||
"""
|
|
||||||
- 使用 LLM 从上下文中提取字段值
|
|
||||||
+ 使用 LLM 从源文档中提取字段值
|
|
||||||
|
|
||||||
Args:
|
|
||||||
field: 字段定义
|
|
||||||
- rag_context: RAG 检索到的上下文
|
|
||||||
+ source_docs: 源文档列表
|
|
||||||
user_hint: 用户提示
|
|
||||||
|
|
||||||
Returns:
|
|
||||||
提取结果
|
|
||||||
"""
|
|
||||||
+ if not source_docs:
|
|
||||||
+ return FillResult(
|
|
||||||
+ field=field.name,
|
|
||||||
+ value="",
|
|
||||||
+ source="无源文档",
|
|
||||||
+ confidence=0.0
|
|
||||||
+ )
|
|
||||||
+
|
|
||||||
# 构建上下文文本
|
|
||||||
- context_text = "\n\n".join([
|
|
||||||
- f"【文档 {i+1}】\n{doc['content']}"
|
|
||||||
- for i, doc in enumerate(rag_context)
|
|
||||||
- ])
|
|
||||||
+ context_text = self._build_context_text(source_docs, max_length=8000)
|
|
||||||
+
|
|
||||||
+ # 构建提示词
|
|
||||||
+ hint_text = field.hint if field.hint else f"请提取{field.name}的信息"
|
|
||||||
+ if user_hint:
|
|
||||||
+ hint_text = f"{user_hint}。{hint_text}"
|
|
||||||
|
|
||||||
- # 构建 Prompt
|
|
||||||
- prompt = f"""你是一个数据提取专家。请根据以下文档内容,提取指定字段的信息。
|
|
||||||
+ prompt = f"""你是一个专业的数据提取专家。请根据以下文档内容,提取指定字段的信息。
|
|
||||||
|
|
||||||
需要提取的字段:
|
|
||||||
- 字段名称:{field.name}
|
|
||||||
- 字段类型:{field.field_type}
|
|
||||||
+- 填写提示:{hint_text}
|
|
||||||
- 是否必填:{'是' if field.required else '否'}
|
|
||||||
|
|
||||||
-{'用户提示:' + user_hint if user_hint else ''}
|
|
||||||
-
|
|
||||||
参考文档内容:
|
|
||||||
{context_text}
|
|
||||||
|
|
||||||
请严格按照以下 JSON 格式输出,不要添加任何解释:
|
|
||||||
{{
|
|
||||||
"value": "提取到的值,如果没有找到则填写空字符串",
|
|
||||||
- "source": "数据来源的文档描述",
|
|
||||||
- "confidence": 0.0到1.0之间的置信度
|
|
||||||
+ "source": "数据来源的文档描述(如:来自xxx文档)",
|
|
||||||
+ "confidence": 0.0到1.0之间的置信度,表示对提取结果的信心程度"
|
|
||||||
}}
|
|
||||||
"""
|
|
||||||
|
|
||||||
@@ -226,6 +270,54 @@ class TemplateFillService:
|
|
||||||
confidence=0.0
|
|
||||||
)
|
|
||||||
|
|
||||||
+ def _build_context_text(self, source_docs: List[SourceDocument], max_length: int = 8000) -> str:
|
|
||||||
+ """
|
|
||||||
+ 构建上下文文本
|
|
||||||
+
|
|
||||||
+ Args:
|
|
||||||
+ source_docs: 源文档列表
|
|
||||||
+ max_length: 最大字符数
|
|
||||||
+
|
|
||||||
+ Returns:
|
|
||||||
+ 上下文文本
|
|
||||||
+ """
|
|
||||||
+ contexts = []
|
|
||||||
+ total_length = 0
|
|
||||||
+
|
|
||||||
+ for doc in source_docs:
|
|
||||||
+ # 优先使用结构化数据(表格),其次使用文本内容
|
|
||||||
+ doc_content = ""
|
|
||||||
+
|
|
||||||
+ if doc.structured_data and doc.structured_data.get("tables"):
|
|
||||||
+ # 如果有表格数据,优先使用
|
|
||||||
+ tables = doc.structured_data.get("tables", [])
|
|
||||||
+ for table in tables:
|
|
||||||
+ if isinstance(table, dict):
|
|
||||||
+ rows = table.get("rows", [])
|
|
||||||
+ if rows:
|
|
||||||
+ doc_content += f"\n【文档: {doc.filename} 表格数据】\n"
|
|
||||||
+ for row in rows[:20]: # 限制每表最多20行
|
|
||||||
+ if isinstance(row, list):
|
|
||||||
+ doc_content += " | ".join(str(cell) for cell in row) + "\n"
|
|
||||||
+ elif isinstance(row, dict):
|
|
||||||
+ doc_content += " | ".join(str(v) for v in row.values()) + "\n"
|
|
||||||
+ elif doc.content:
|
|
||||||
+ doc_content = doc.content[:5000] # 限制文本长度
|
|
||||||
+
|
|
||||||
+ if doc_content:
|
|
||||||
+ doc_context = f"【文档: {doc.filename} ({doc.doc_type})】\n{doc_content}"
|
|
||||||
+ if total_length + len(doc_context) <= max_length:
|
|
||||||
+ contexts.append(doc_context)
|
|
||||||
+ total_length += len(doc_context)
|
|
||||||
+ else:
|
|
||||||
+ # 如果超出长度,截断
|
|
||||||
+ remaining = max_length - total_length
|
|
||||||
+ if remaining > 100:
|
|
||||||
+ contexts.append(doc_context[:remaining])
|
|
||||||
+ break
|
|
||||||
+
|
|
||||||
+ return "\n\n".join(contexts) if contexts else "(源文档内容为空)"
|
|
||||||
+
|
|
||||||
async def get_template_fields_from_file(
|
|
||||||
self,
|
|
||||||
file_path: str,
|
|
||||||
@@ -236,7 +328,7 @@ class TemplateFillService:
|
|
||||||
|
|
||||||
Args:
|
|
||||||
file_path: 模板文件路径
|
|
||||||
- file_type: 文件类型
|
|
||||||
+ file_type: 文件类型 (xlsx/xls/docx)
|
|
||||||
|
|
||||||
Returns:
|
|
||||||
字段列表
|
|
||||||
@@ -245,43 +337,108 @@ class TemplateFillService:
|
|
||||||
|
|
||||||
try:
|
|
||||||
if file_type in ["xlsx", "xls"]:
|
|
||||||
- # 从 Excel 读取表头
|
|
||||||
- import pandas as pd
|
|
||||||
- df = pd.read_excel(file_path, nrows=5)
|
|
||||||
+ fields = await self._get_template_fields_from_excel(file_path)
|
|
||||||
+ elif file_type == "docx":
|
|
||||||
+ fields = await self._get_template_fields_from_docx(file_path)
|
|
||||||
|
|
||||||
- for idx, col in enumerate(df.columns):
|
|
||||||
- # 获取单元格位置 (A, B, C, ...)
|
|
||||||
- cell = self._column_to_cell(idx)
|
|
||||||
+ except Exception as e:
|
|
||||||
+ logger.error(f"提取模板字段失败: {str(e)}")
|
|
||||||
|
|
||||||
- fields.append(TemplateField(
|
|
||||||
- cell=cell,
|
|
||||||
- name=str(col),
|
|
||||||
- field_type=self._infer_field_type(df[col]),
|
|
||||||
- required=True
|
|
||||||
- ))
|
|
||||||
+ return fields
|
|
||||||
|
|
||||||
- elif file_type == "docx":
|
|
||||||
- # 从 Word 表格读取
|
|
||||||
- from docx import Document
|
|
||||||
- doc = Document(file_path)
|
|
||||||
-
|
|
||||||
- for table_idx, table in enumerate(doc.tables):
|
|
||||||
- for row_idx, row in enumerate(table.rows):
|
|
||||||
- for col_idx, cell in enumerate(row.cells):
|
|
||||||
- cell_text = cell.text.strip()
|
|
||||||
- if cell_text:
|
|
||||||
- fields.append(TemplateField(
|
|
||||||
- cell=self._column_to_cell(col_idx),
|
|
||||||
- name=cell_text,
|
|
||||||
- field_type="text",
|
|
||||||
- required=True
|
|
||||||
- ))
|
|
||||||
+ async def _get_template_fields_from_excel(self, file_path: str) -> List[TemplateField]:
|
|
||||||
+ """从 Excel 模板提取字段"""
|
|
||||||
+ fields = []
|
|
||||||
+
|
|
||||||
+ try:
|
|
||||||
+ import pandas as pd
|
|
||||||
+ df = pd.read_excel(file_path, nrows=5)
|
|
||||||
+
|
|
||||||
+ for idx, col in enumerate(df.columns):
|
|
||||||
+ cell = self._column_to_cell(idx)
|
|
||||||
+ col_str = str(col)
|
|
||||||
+
|
|
||||||
+ fields.append(TemplateField(
|
|
||||||
+ cell=cell,
|
|
||||||
+ name=col_str,
|
|
||||||
+ field_type=self._infer_field_type_from_value(df[col].iloc[0] if len(df) > 0 else ""),
|
|
||||||
+ required=True,
|
|
||||||
+ hint=""
|
|
||||||
+ ))
|
|
||||||
|
|
||||||
except Exception as e:
|
|
||||||
- logger.error(f"提取模板字段失败: {str(e)}")
|
|
||||||
+ logger.error(f"从Excel提取字段失败: {str(e)}")
|
|
||||||
|
|
||||||
return fields
|
|
||||||
|
|
||||||
+ async def _get_template_fields_from_docx(self, file_path: str) -> List[TemplateField]:
|
|
||||||
+ """从 Word 模板提取字段"""
|
|
||||||
+ fields = []
|
|
||||||
+
|
|
||||||
+ try:
|
|
||||||
+ from docx import Document
|
|
||||||
+
|
|
||||||
+ doc = Document(file_path)
|
|
||||||
+
|
|
||||||
+ for table_idx, table in enumerate(doc.tables):
|
|
||||||
+ for row_idx, row in enumerate(table.rows):
|
|
||||||
+ cells = [cell.text.strip() for cell in row.cells]
|
|
||||||
+
|
|
||||||
+ # 假设第一列是字段名
|
|
||||||
+ if cells and cells[0]:
|
|
||||||
+ field_name = cells[0]
|
|
||||||
+ hint = cells[1] if len(cells) > 1 else ""
|
|
||||||
+
|
|
||||||
+ # 跳过空行或标题行
|
|
||||||
+ if field_name and field_name not in ["", "字段名", "名称", "项目"]:
|
|
||||||
+ fields.append(TemplateField(
|
|
||||||
+ cell=f"T{table_idx}R{row_idx}",
|
|
||||||
+ name=field_name,
|
|
||||||
+ field_type=self._infer_field_type_from_hint(hint),
|
|
||||||
+ required=True,
|
|
||||||
+ hint=hint
|
|
||||||
+ ))
|
|
||||||
+
|
|
||||||
+ except Exception as e:
|
|
||||||
+ logger.error(f"从Word提取字段失败: {str(e)}")
|
|
||||||
+
|
|
||||||
+ return fields
|
|
||||||
+
|
|
||||||
+ def _infer_field_type_from_hint(self, hint: str) -> str:
|
|
||||||
+ """从提示词推断字段类型"""
|
|
||||||
+ hint_lower = hint.lower()
|
|
||||||
+
|
|
||||||
+ date_keywords = ["年", "月", "日", "日期", "时间", "出生"]
|
|
||||||
+ if any(kw in hint for kw in date_keywords):
|
|
||||||
+ return "date"
|
|
||||||
+
|
|
||||||
+ number_keywords = ["数量", "金额", "人数", "面积", "增长", "比率", "%", "率", "总计", "合计"]
|
|
||||||
+ if any(kw in hint_lower for kw in number_keywords):
|
|
||||||
+ return "number"
|
|
||||||
+
|
|
||||||
+ return "text"
|
|
||||||
+
|
|
||||||
+ def _infer_field_type_from_value(self, value: Any) -> str:
|
|
||||||
+ """从示例值推断字段类型"""
|
|
||||||
+ if value is None or value == "":
|
|
||||||
+ return "text"
|
|
||||||
+
|
|
||||||
+ value_str = str(value)
|
|
||||||
+
|
|
||||||
+ # 检查日期模式
|
|
||||||
+ import re
|
|
||||||
+ if re.search(r'\d{4}[年/-]\d{1,2}[月/-]\d{1,2}', value_str):
|
|
||||||
+ return "date"
|
|
||||||
+
|
|
||||||
+ # 检查数值
|
|
||||||
+ try:
|
|
||||||
+ float(value_str.replace(',', '').replace('%', ''))
|
|
||||||
+ return "number"
|
|
||||||
+ except ValueError:
|
|
||||||
+ pass
|
|
||||||
+
|
|
||||||
+ return "text"
|
|
||||||
+
|
|
||||||
def _column_to_cell(self, col_idx: int) -> str:
|
|
||||||
"""将列索引转换为单元格列名 (0 -> A, 1 -> B, ...)"""
|
|
||||||
result = ""
|
|
||||||
@@ -290,17 +447,6 @@ class TemplateFillService:
|
|
||||||
col_idx = col_idx // 26 - 1
|
|
||||||
return result
|
|
||||||
|
|
||||||
- def _infer_field_type(self, series) -> str:
|
|
||||||
- """推断字段类型"""
|
|
||||||
- import pandas as pd
|
|
||||||
-
|
|
||||||
- if pd.api.types.is_numeric_dtype(series):
|
|
||||||
- return "number"
|
|
||||||
- elif pd.api.types.is_datetime64_any_dtype(series):
|
|
||||||
- return "date"
|
|
||||||
- else:
|
|
||||||
- return "text"
|
|
||||||
-
|
|
||||||
|
|
||||||
# ==================== 全局单例 ====================
|
|
||||||
|
|
||||||
@@ -1,53 +0,0 @@
|
|||||||
diff --git a/frontend/src/db/backend-api.ts b/frontend/src/db/backend-api.ts
|
|
||||||
index 8944353..94ac852 100644
|
|
||||||
--- a/frontend/src/db/backend-api.ts
|
|
||||||
+++ b/frontend/src/db/backend-api.ts
|
|
||||||
@@ -92,6 +92,7 @@ export interface TemplateField {
|
|
||||||
name: string;
|
|
||||||
field_type: string;
|
|
||||||
required: boolean;
|
|
||||||
+ hint?: string;
|
|
||||||
}
|
|
||||||
|
|
||||||
// 表格填写结果
|
|
||||||
@@ -625,7 +626,10 @@ export const backendApi = {
|
|
||||||
*/
|
|
||||||
async fillTemplate(
|
|
||||||
templateId: string,
|
|
||||||
- templateFields: TemplateField[]
|
|
||||||
+ templateFields: TemplateField[],
|
|
||||||
+ sourceDocIds?: string[],
|
|
||||||
+ sourceFilePaths?: string[],
|
|
||||||
+ userHint?: string
|
|
||||||
): Promise<FillResult> {
|
|
||||||
const url = `${BACKEND_BASE_URL}/templates/fill`;
|
|
||||||
|
|
||||||
@@ -636,6 +640,9 @@ export const backendApi = {
|
|
||||||
body: JSON.stringify({
|
|
||||||
template_id: templateId,
|
|
||||||
template_fields: templateFields,
|
|
||||||
+ source_doc_ids: sourceDocIds || [],
|
|
||||||
+ source_file_paths: sourceFilePaths || [],
|
|
||||||
+ user_hint: userHint || null,
|
|
||||||
}),
|
|
||||||
});
|
|
||||||
|
|
||||||
diff --git a/frontend/src/pages/TemplateFill.tsx b/frontend/src/pages/TemplateFill.tsx
|
|
||||||
index 8c330a9..f9a4a39 100644
|
|
||||||
--- a/frontend/src/pages/TemplateFill.tsx
|
|
||||||
+++ b/frontend/src/pages/TemplateFill.tsx
|
|
||||||
@@ -128,8 +128,12 @@ const TemplateFill: React.FC = () => {
|
|
||||||
setStep('filling');
|
|
||||||
|
|
||||||
try {
|
|
||||||
- // 调用后端填表接口
|
|
||||||
- const result = await backendApi.fillTemplate('temp-template-id', templateFields);
|
|
||||||
+ // 调用后端填表接口,传递选中的文档ID
|
|
||||||
+ const result = await backendApi.fillTemplate(
|
|
||||||
+ 'temp-template-id',
|
|
||||||
+ templateFields,
|
|
||||||
+ selectedDocs // 传递源文档ID列表
|
|
||||||
+ );
|
|
||||||
setFilledResult(result);
|
|
||||||
setStep('preview');
|
|
||||||
toast.success('表格填写完成');
|
|
||||||
@@ -1,221 +0,0 @@
|
|||||||
diff --git "a/\346\257\224\350\265\233\345\244\207\350\265\233\350\247\204\345\210\222.md" "b/\346\257\224\350\265\233\345\244\207\350\265\233\350\247\204\345\210\222.md"
|
|
||||||
index bcb48fd..440a12d 100644
|
|
||||||
--- "a/\346\257\224\350\265\233\345\244\207\350\265\233\350\247\204\345\210\222.md"
|
|
||||||
+++ "b/\346\257\224\350\265\233\345\244\207\350\265\233\350\247\204\345\210\222.md"
|
|
||||||
@@ -50,7 +50,7 @@
|
|
||||||
| `prompt_service.py` | ✅ 已完成 | Prompt 模板管理 |
|
|
||||||
| `text_analysis_service.py` | ✅ 已完成 | 文本分析 |
|
|
||||||
| `chart_generator_service.py` | ✅ 已完成 | 图表生成服务 |
|
|
||||||
-| `template_fill_service.py` | ❌ 未完成 | 模板填写服务 |
|
|
||||||
+| `template_fill_service.py` | ✅ 已完成 | 模板填写服务,支持直接读取源文档进行填表 |
|
|
||||||
|
|
||||||
### 2.2 API 接口 (`backend/app/api/endpoints/`)
|
|
||||||
|
|
||||||
@@ -61,7 +61,7 @@
|
|
||||||
| `ai_analyze.py` | `/api/v1/analyze/*` | ✅ AI 分析(Excel、Markdown、流式) |
|
|
||||||
| `rag.py` | `/api/v1/rag/*` | ⚠️ RAG 检索(当前返回空) |
|
|
||||||
| `tasks.py` | `/api/v1/tasks/*` | ✅ 异步任务状态查询 |
|
|
||||||
-| `templates.py` | `/api/v1/templates/*` | ✅ 模板管理 |
|
|
||||||
+| `templates.py` | `/api/v1/templates/*` | ✅ 模板管理 (含 Word 导出) |
|
|
||||||
| `visualization.py` | `/api/v1/visualization/*` | ✅ 可视化图表 |
|
|
||||||
| `health.py` | `/api/v1/health` | ✅ 健康检查 |
|
|
||||||
|
|
||||||
@@ -78,8 +78,8 @@
|
|
||||||
|------|----------|------|
|
|
||||||
| Excel (.xlsx/.xls) | ✅ 已完成 | pandas + XML 回退解析 |
|
|
||||||
| Markdown (.md) | ✅ 已完成 | 正则 + AI 分章节 |
|
|
||||||
-| Word (.docx) | ❌ 未完成 | 尚未实现 |
|
|
||||||
-| Text (.txt) | ❌ 未完成 | 尚未实现 |
|
|
||||||
+| Word (.docx) | ✅ 已完成 | python-docx 解析,支持表格提取和字段识别 |
|
|
||||||
+| Text (.txt) | ✅ 已完成 | chardet 编码检测,支持文本清洗和结构化提取 |
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
@@ -87,7 +87,7 @@
|
|
||||||
|
|
||||||
### 3.1 模板填写模块(最优先)
|
|
||||||
|
|
||||||
-**这是比赛的核心评测功能,必须完成。**
|
|
||||||
+**当前状态**:✅ 已完成
|
|
||||||
|
|
||||||
```
|
|
||||||
用户上传模板表格(Word/Excel)
|
|
||||||
@@ -103,30 +103,34 @@ AI 根据字段提示词从源数据中提取信息
|
|
||||||
返回填写完成的表格
|
|
||||||
```
|
|
||||||
|
|
||||||
-**需要实现**:
|
|
||||||
-- [ ] `template_fill_service.py` - 模板填写核心服务
|
|
||||||
-- [ ] Word 模板解析 (`docx_parser.py` 需新建)
|
|
||||||
-- [ ] Text 模板解析 (`txt_parser.py` 需新建)
|
|
||||||
-- [ ] 模板字段识别与提示词提取
|
|
||||||
-- [ ] 多文档数据聚合与冲突处理
|
|
||||||
-- [ ] 结果导出为 Word/Excel
|
|
||||||
+**已完成实现**:
|
|
||||||
+- [x] `template_fill_service.py` - 模板填写核心服务
|
|
||||||
+- [x] Word 模板解析 (`docx_parser.py` - parse_tables_for_template, extract_template_fields_from_docx)
|
|
||||||
+- [x] Text 模板解析 (`txt_parser.py` - 已完成)
|
|
||||||
+- [x] 模板字段识别与提示词提取
|
|
||||||
+- [x] 多文档数据聚合与冲突处理
|
|
||||||
+- [x] 结果导出为 Word/Excel
|
|
||||||
|
|
||||||
### 3.2 Word 文档解析
|
|
||||||
|
|
||||||
-**当前状态**:仅有框架,尚未实现具体解析逻辑
|
|
||||||
+**当前状态**:✅ 已完成
|
|
||||||
|
|
||||||
-**需要实现**:
|
|
||||||
-- [ ] `docx_parser.py` - Word 文档解析器
|
|
||||||
-- [ ] 提取段落文本
|
|
||||||
-- [ ] 提取表格内容
|
|
||||||
-- [ ] 提取关键信息(标题、列表等)
|
|
||||||
+**已实现功能**:
|
|
||||||
+- [x] `docx_parser.py` - Word 文档解析器
|
|
||||||
+- [x] 提取段落文本
|
|
||||||
+- [x] 提取表格内容
|
|
||||||
+- [x] 提取关键信息(标题、列表等)
|
|
||||||
+- [x] 表格模板字段提取 (`parse_tables_for_template`, `extract_template_fields_from_docx`)
|
|
||||||
+- [x] 字段类型推断 (`_infer_field_type_from_hint`)
|
|
||||||
|
|
||||||
### 3.3 Text 文档解析
|
|
||||||
|
|
||||||
-**需要实现**:
|
|
||||||
-- [ ] `txt_parser.py` - 文本文件解析器
|
|
||||||
-- [ ] 编码自动检测
|
|
||||||
-- [ ] 文本清洗
|
|
||||||
+**当前状态**:✅ 已完成
|
|
||||||
+
|
|
||||||
+**已实现功能**:
|
|
||||||
+- [x] `txt_parser.py` - 文本文件解析器
|
|
||||||
+- [x] 编码自动检测 (chardet)
|
|
||||||
+- [x] 文本清洗
|
|
||||||
|
|
||||||
### 3.4 文档模板匹配(已有框架)
|
|
||||||
|
|
||||||
@@ -215,5 +219,122 @@ docs/test/
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
-*文档版本: v1.0*
|
|
||||||
-*最后更新: 2026-04-08*
|
|
||||||
\ No newline at end of file
|
|
||||||
+*文档版本: v1.1*
|
|
||||||
+*最后更新: 2026-04-08*
|
|
||||||
+
|
|
||||||
+---
|
|
||||||
+
|
|
||||||
+## 八、技术实现细节
|
|
||||||
+
|
|
||||||
+### 8.1 模板填表流程(已实现)
|
|
||||||
+
|
|
||||||
+#### 流程图
|
|
||||||
+```
|
|
||||||
+┌─────────────┐ ┌─────────────┐ ┌─────────────┐
|
|
||||||
+│ 上传模板 │ ──► │ 选择数据源 │ ──► │ AI 智能填表 │
|
|
||||||
+└─────────────┘ └─────────────┘ └─────────────┘
|
|
||||||
+ │
|
|
||||||
+ ▼
|
|
||||||
+ ┌─────────────┐
|
|
||||||
+ │ 导出结果 │
|
|
||||||
+ └─────────────┘
|
|
||||||
+```
|
|
||||||
+
|
|
||||||
+#### 核心组件
|
|
||||||
+
|
|
||||||
+| 组件 | 文件 | 说明 |
|
|
||||||
+|------|------|------|
|
|
||||||
+| 模板上传 | `templates.py` `/templates/upload` | 接收模板文件,提取字段 |
|
|
||||||
+| 字段提取 | `template_fill_service.py` | 从 Word/Excel 表格提取字段定义 |
|
|
||||||
+| 文档解析 | `docx_parser.py`, `xlsx_parser.py`, `txt_parser.py` | 解析源文档内容 |
|
|
||||||
+| 智能填表 | `template_fill_service.py` `fill_template()` | 使用 LLM 从源文档提取信息 |
|
|
||||||
+| 结果导出 | `templates.py` `/templates/export` | 导出为 Excel 或 Word |
|
|
||||||
+
|
|
||||||
+### 8.2 源文档加载方式
|
|
||||||
+
|
|
||||||
+模板填表服务支持两种方式加载源文档:
|
|
||||||
+
|
|
||||||
+1. **通过 MongoDB 文档 ID**:`source_doc_ids`
|
|
||||||
+ - 文档已上传并存入 MongoDB
|
|
||||||
+ - 服务直接查询 MongoDB 获取文档内容
|
|
||||||
+
|
|
||||||
+2. **通过文件路径**:`source_file_paths`
|
|
||||||
+ - 直接读取本地文件
|
|
||||||
+ - 使用对应的解析器解析内容
|
|
||||||
+
|
|
||||||
+### 8.3 Word 表格模板解析
|
|
||||||
+
|
|
||||||
+比赛评分表格通常是 Word 格式,`docx_parser.py` 提供了专门的解析方法:
|
|
||||||
+
|
|
||||||
+```python
|
|
||||||
+# 提取表格模板字段
|
|
||||||
+fields = docx_parser.extract_template_fields_from_docx(file_path)
|
|
||||||
+
|
|
||||||
+# 返回格式
|
|
||||||
+# [
|
|
||||||
+# {
|
|
||||||
+# "cell": "T0R1", # 表格0,行1
|
|
||||||
+# "name": "字段名",
|
|
||||||
+# "hint": "提示词",
|
|
||||||
+# "field_type": "text/number/date",
|
|
||||||
+# "required": True
|
|
||||||
+# },
|
|
||||||
+# ...
|
|
||||||
+# ]
|
|
||||||
+```
|
|
||||||
+
|
|
||||||
+### 8.4 字段类型推断
|
|
||||||
+
|
|
||||||
+系统支持从提示词自动推断字段类型:
|
|
||||||
+
|
|
||||||
+| 关键词 | 推断类型 | 示例 |
|
|
||||||
+|--------|----------|------|
|
|
||||||
+| 年、月、日、日期、时间、出生 | date | 出生日期 |
|
|
||||||
+| 数量、金额、比率、%、率、合计 | number | 增长比率 |
|
|
||||||
+| 其他 | text | 姓名、地址 |
|
|
||||||
+
|
|
||||||
+### 8.5 API 接口
|
|
||||||
+
|
|
||||||
+#### POST `/api/v1/templates/fill`
|
|
||||||
+
|
|
||||||
+填写请求:
|
|
||||||
+```json
|
|
||||||
+{
|
|
||||||
+ "template_id": "模板ID",
|
|
||||||
+ "template_fields": [
|
|
||||||
+ {"cell": "A1", "name": "姓名", "field_type": "text", "required": true, "hint": "提取人员姓名"}
|
|
||||||
+ ],
|
|
||||||
+ "source_doc_ids": ["mongodb_doc_id_1", "mongodb_doc_id_2"],
|
|
||||||
+ "source_file_paths": [],
|
|
||||||
+ "user_hint": "请从合同文档中提取"
|
|
||||||
+}
|
|
||||||
+```
|
|
||||||
+
|
|
||||||
+响应:
|
|
||||||
+```json
|
|
||||||
+{
|
|
||||||
+ "success": true,
|
|
||||||
+ "filled_data": {"姓名": "张三"},
|
|
||||||
+ "fill_details": [
|
|
||||||
+ {
|
|
||||||
+ "field": "姓名",
|
|
||||||
+ "cell": "A1",
|
|
||||||
+ "value": "张三",
|
|
||||||
+ "source": "来自:合同文档.docx",
|
|
||||||
+ "confidence": 0.95
|
|
||||||
+ }
|
|
||||||
+ ],
|
|
||||||
+ "source_doc_count": 2
|
|
||||||
+}
|
|
||||||
+```
|
|
||||||
+
|
|
||||||
+#### POST `/api/v1/templates/export`
|
|
||||||
+
|
|
||||||
+导出请求:
|
|
||||||
+```json
|
|
||||||
+{
|
|
||||||
+ "template_id": "模板ID",
|
|
||||||
+ "filled_data": {"姓名": "张三", "金额": "10000"},
|
|
||||||
+ "format": "xlsx" // 或 "docx"
|
|
||||||
+}
|
|
||||||
+```
|
|
||||||
\ No newline at end of file
|
|
||||||
@@ -1,144 +0,0 @@
|
|||||||
# 模板填表功能变更日志
|
|
||||||
|
|
||||||
**变更日期**: 2026-04-08
|
|
||||||
**变更类型**: 功能完善
|
|
||||||
**变更内容**: Word 表格解析和模板填表功能
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## 变更概述
|
|
||||||
|
|
||||||
本次变更完善了 Word 表格解析、表格模板构建和填写功能,实现了从源文档(MongoDB/文件)读取数据并智能填表的核心流程。
|
|
||||||
|
|
||||||
### 涉及文件
|
|
||||||
|
|
||||||
| 文件 | 变更行数 | 说明 |
|
|
||||||
|------|----------|------|
|
|
||||||
| backend/app/api/endpoints/templates.py | +156 | API 端点完善,添加 Word 导出 |
|
|
||||||
| backend/app/core/document_parser/docx_parser.py | +130 | Word 表格解析增强 |
|
|
||||||
| backend/app/services/template_fill_service.py | +340 | 核心填表服务重写 |
|
|
||||||
| frontend/src/db/backend-api.ts | +9 | 前端 API 更新 |
|
|
||||||
| frontend/src/pages/TemplateFill.tsx | +8 | 前端页面更新 |
|
|
||||||
| 比赛备赛规划.md | +169 | 文档更新 |
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## 详细变更
|
|
||||||
|
|
||||||
### 1. backend/app/core/document_parser/docx_parser.py
|
|
||||||
|
|
||||||
**新增方法**:
|
|
||||||
|
|
||||||
- `parse_tables_for_template(file_path)` - 解析 Word 文档中的表格,提取模板字段
|
|
||||||
- `extract_template_fields_from_docx(file_path)` - 从 Word 文档提取模板字段定义
|
|
||||||
- `_infer_field_type_from_hint(hint)` - 从提示词推断字段类型
|
|
||||||
|
|
||||||
**功能说明**:
|
|
||||||
- 专门用于比赛场景:解析表格模板,识别需要填写的字段
|
|
||||||
- 支持从表格第一列提取字段名,第二列提取提示词/描述
|
|
||||||
- 自动推断字段类型(text/number/date)
|
|
||||||
|
|
||||||
### 2. backend/app/services/template_fill_service.py
|
|
||||||
|
|
||||||
**重构内容**:
|
|
||||||
|
|
||||||
- 不再依赖 RAG 服务,直接从 MongoDB 或文件读取源文档
|
|
||||||
- 新增 `SourceDocument` 数据类
|
|
||||||
- 完善 `fill_template()` 方法,支持 `source_doc_ids` 和 `source_file_paths`
|
|
||||||
- 新增 `_load_source_documents()` - 加载源文档内容
|
|
||||||
- 新增 `_extract_field_value()` - 使用 LLM 提取字段值
|
|
||||||
- 新增 `_build_context_text()` - 构建上下文(优先使用表格数据)
|
|
||||||
- 完善 `_get_template_fields_from_docx()` - Word 模板字段提取
|
|
||||||
|
|
||||||
**核心流程**:
|
|
||||||
```
|
|
||||||
1. 加载源文档(MongoDB 或文件)
|
|
||||||
2. 对每个字段调用 LLM 提取值
|
|
||||||
3. 返回填写结果
|
|
||||||
```
|
|
||||||
|
|
||||||
### 3. backend/app/api/endpoints/templates.py
|
|
||||||
|
|
||||||
**新增内容**:
|
|
||||||
|
|
||||||
- `FillRequest` 添加 `source_doc_ids`, `source_file_paths`, `user_hint` 字段
|
|
||||||
- `ExportRequest` 添加 `format` 字段
|
|
||||||
- `_export_to_word()` - 导出为 Word 格式
|
|
||||||
- `/templates/export/excel` - 专门导出 Excel
|
|
||||||
- `/templates/export/word` - 专门导出 Word
|
|
||||||
|
|
||||||
### 4. frontend/src/db/backend-api.ts
|
|
||||||
|
|
||||||
**更新内容**:
|
|
||||||
|
|
||||||
- `TemplateField` 接口添加 `hint` 字段
|
|
||||||
- `fillTemplate()` 方法添加 `sourceDocIds`, `sourceFilePaths`, `userHint` 参数
|
|
||||||
|
|
||||||
### 5. frontend/src/pages/TemplateFill.tsx
|
|
||||||
|
|
||||||
**更新内容**:
|
|
||||||
|
|
||||||
- `handleFillTemplate()` 传递 `selectedDocs` 作为 `sourceDocIds` 参数
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## API 接口变更
|
|
||||||
|
|
||||||
### POST /api/v1/templates/fill
|
|
||||||
|
|
||||||
**请求体**:
|
|
||||||
```json
|
|
||||||
{
|
|
||||||
"template_id": "模板ID",
|
|
||||||
"template_fields": [
|
|
||||||
{
|
|
||||||
"cell": "A1",
|
|
||||||
"name": "姓名",
|
|
||||||
"field_type": "text",
|
|
||||||
"required": true,
|
|
||||||
"hint": "提取人员姓名"
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"source_doc_ids": ["mongodb_doc_id"],
|
|
||||||
"source_file_paths": [],
|
|
||||||
"user_hint": "请从xxx文档中提取"
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
**响应**:
|
|
||||||
```json
|
|
||||||
{
|
|
||||||
"success": true,
|
|
||||||
"filled_data": {"姓名": "张三"},
|
|
||||||
"fill_details": [...],
|
|
||||||
"source_doc_count": 1
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
### POST /api/v1/templates/export
|
|
||||||
|
|
||||||
**新增支持 format=dicx**,可导出为 Word 格式
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## 技术细节
|
|
||||||
|
|
||||||
### 字段类型推断
|
|
||||||
|
|
||||||
| 关键词 | 推断类型 |
|
|
||||||
|--------|----------|
|
|
||||||
| 年、月、日、日期、时间、出生 | date |
|
|
||||||
| 数量、金额、比率、%、率、合计 | number |
|
|
||||||
| 其他 | text |
|
|
||||||
|
|
||||||
### 上下文构建
|
|
||||||
|
|
||||||
源文档内容构建优先级:
|
|
||||||
1. 结构化数据(表格数据)
|
|
||||||
2. 原始文本内容(限制 5000 字符)
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## 相关文档
|
|
||||||
|
|
||||||
- [比赛备赛规划.md](../比赛备赛规划.md) - 已更新功能状态和技术实现细节
|
|
||||||
20
package.json
20
package.json
@@ -1,20 +0,0 @@
|
|||||||
{
|
|
||||||
"name": "filesreadsystem",
|
|
||||||
"version": "1.0.0",
|
|
||||||
"description": "",
|
|
||||||
"main": "index.js",
|
|
||||||
"directories": {
|
|
||||||
"doc": "docs"
|
|
||||||
},
|
|
||||||
"scripts": {
|
|
||||||
"test": "echo \"Error: no test specified\" && exit 1"
|
|
||||||
},
|
|
||||||
"repository": {
|
|
||||||
"type": "git",
|
|
||||||
"url": "https://gitea.kronecker.cc/OurCodesAreAllRight/FilesReadSystem.git"
|
|
||||||
},
|
|
||||||
"keywords": [],
|
|
||||||
"author": "",
|
|
||||||
"license": "ISC",
|
|
||||||
"type": "commonjs"
|
|
||||||
}
|
|
||||||
340
比赛备赛规划.md
340
比赛备赛规划.md
@@ -1,340 +0,0 @@
|
|||||||
# 比赛备赛规划文档
|
|
||||||
|
|
||||||
## 一、赛题核心理解
|
|
||||||
|
|
||||||
### 1.1 赛题名称
|
|
||||||
**A23 - 基于大语言模型的文档理解与多源数据融合**
|
|
||||||
参赛院校:金陵科技学院
|
|
||||||
|
|
||||||
### 1.2 核心任务
|
|
||||||
1. **文档解析**:解析 docx/md/xlsx/txt 四种格式的源数据文档
|
|
||||||
2. **模板填写**:根据模板表格要求,从源文档中提取数据填写到 Word/Excel 模板
|
|
||||||
3. **准确率与速度**:准确率优先,速度作为辅助评分因素
|
|
||||||
|
|
||||||
### 1.3 评分规则
|
|
||||||
| 要素 | 说明 |
|
|
||||||
|------|------|
|
|
||||||
| 准确率 | 填写结果与样例表格对比的正确率 |
|
|
||||||
| 响应时间 | 从导入文档到得到结果的时间 ≤ 90s × 文档数量 |
|
|
||||||
| 评测方式 | 赛方提供空表格模板 + 样例表格(人工填写),系统自动填写后对比 |
|
|
||||||
|
|
||||||
### 1.4 关键Q&A摘录
|
|
||||||
|
|
||||||
| 问题 | 解答要点 |
|
|
||||||
|------|----------|
|
|
||||||
| Q2: 模板与文档的关系 | 前2个表格只涉及1份文档;第3-4个涉及多份文档;第5个涉及大部分文档(从易到难) |
|
|
||||||
| Q5: 响应时间定义 | 从导入文档到最终得到结果的时间 ≤ 90s × 文档数量 |
|
|
||||||
| Q7: 需要读取哪些文件 | 每个模板只读取指定的数据文件,不需要读取全部 |
|
|
||||||
| Q10: 部署方式 | 不要求部署到服务器,本地部署即可 |
|
|
||||||
| Q14: 模板匹配 | 模板已指定数据文件,不需要算法匹配 |
|
|
||||||
| Q16: 数据库存储 | 可跳过,不强制要求 |
|
|
||||||
| Q20: 创新点 | 不用管,随意发挥 |
|
|
||||||
| Q21: 填写依据 | 按照测试表格模板给的提示词进行填写 |
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## 二、已完成功能清单
|
|
||||||
|
|
||||||
### 2.1 后端服务 (`backend/app/services/`)
|
|
||||||
|
|
||||||
| 服务文件 | 功能状态 | 说明 |
|
|
||||||
|----------|----------|------|
|
|
||||||
| `file_service.py` | ✅ 已完成 | 文件上传、保存、类型识别 |
|
|
||||||
| `excel_storage_service.py` | ✅ 已完成 | Excel 存储到 MySQL,支持 XML 回退解析 |
|
|
||||||
| `table_rag_service.py` | ⚠️ 已禁用 | RAG 索引构建(当前禁用,仅记录日志) |
|
|
||||||
| `llm_service.py` | ✅ 已完成 | LLM 调用、流式输出、多模型支持 |
|
|
||||||
| `markdown_ai_service.py` | ✅ 已完成 | Markdown AI 分析、分章节提取、流式输出、图表生成 |
|
|
||||||
| `excel_ai_service.py` | ✅ 已完成 | Excel AI 分析 |
|
|
||||||
| `visualization_service.py` | ✅ 已完成 | 图表生成(matplotlib) |
|
|
||||||
| `rag_service.py` | ⚠️ 已禁用 | FAISS 向量检索(当前禁用) |
|
|
||||||
| `prompt_service.py` | ✅ 已完成 | Prompt 模板管理 |
|
|
||||||
| `text_analysis_service.py` | ✅ 已完成 | 文本分析 |
|
|
||||||
| `chart_generator_service.py` | ✅ 已完成 | 图表生成服务 |
|
|
||||||
| `template_fill_service.py` | ✅ 已完成 | 模板填写服务,支持直接读取源文档进行填表 |
|
|
||||||
|
|
||||||
### 2.2 API 接口 (`backend/app/api/endpoints/`)
|
|
||||||
|
|
||||||
| 接口文件 | 路由 | 功能状态 |
|
|
||||||
|----------|------|----------|
|
|
||||||
| `upload.py` | `/api/v1/upload/excel` | ✅ Excel 文件上传与解析 |
|
|
||||||
| `documents.py` | `/api/v1/documents/*` | ✅ 文档管理(列表、删除、搜索) |
|
|
||||||
| `ai_analyze.py` | `/api/v1/analyze/*` | ✅ AI 分析(Excel、Markdown、流式) |
|
|
||||||
| `rag.py` | `/api/v1/rag/*` | ⚠️ RAG 检索(当前返回空) |
|
|
||||||
| `tasks.py` | `/api/v1/tasks/*` | ✅ 异步任务状态查询 |
|
|
||||||
| `templates.py` | `/api/v1/templates/*` | ✅ 模板管理 (含 Word 导出) |
|
|
||||||
| `visualization.py` | `/api/v1/visualization/*` | ✅ 可视化图表 |
|
|
||||||
| `health.py` | `/api/v1/health` | ✅ 健康检查 |
|
|
||||||
|
|
||||||
### 2.3 前端页面 (`frontend/src/pages/`)
|
|
||||||
|
|
||||||
| 页面文件 | 功能 | 状态 |
|
|
||||||
|----------|------|------|
|
|
||||||
| `Documents.tsx` | 主文档管理页面 | ✅ 已完成 |
|
|
||||||
| `ExcelParse.tsx` | Excel 解析页面 | ✅ 已完成 |
|
|
||||||
|
|
||||||
### 2.4 文档解析能力
|
|
||||||
|
|
||||||
| 格式 | 解析状态 | 说明 |
|
|
||||||
|------|----------|------|
|
|
||||||
| Excel (.xlsx/.xls) | ✅ 已完成 | pandas + XML 回退解析 |
|
|
||||||
| Markdown (.md) | ✅ 已完成 | 正则 + AI 分章节 |
|
|
||||||
| Word (.docx) | ✅ 已完成 | python-docx 解析,支持表格提取和字段识别 |
|
|
||||||
| Text (.txt) | ✅ 已完成 | chardet 编码检测,支持文本清洗和结构化提取 |
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## 三、待完成功能(核心缺块)
|
|
||||||
|
|
||||||
### 3.1 模板填写模块(最优先)
|
|
||||||
|
|
||||||
**当前状态**:✅ 已完成
|
|
||||||
|
|
||||||
```
|
|
||||||
用户上传模板表格(Word/Excel)
|
|
||||||
↓
|
|
||||||
解析模板,提取需要填写的字段和提示词
|
|
||||||
↓
|
|
||||||
根据模板指定的源文档列表读取源数据
|
|
||||||
↓
|
|
||||||
AI 根据字段提示词从源数据中提取信息
|
|
||||||
↓
|
|
||||||
将提取的数据填入模板对应位置
|
|
||||||
↓
|
|
||||||
返回填写完成的表格
|
|
||||||
```
|
|
||||||
|
|
||||||
**已完成实现**:
|
|
||||||
- [x] `template_fill_service.py` - 模板填写核心服务
|
|
||||||
- [x] Word 模板解析 (`docx_parser.py` - parse_tables_for_template, extract_template_fields_from_docx)
|
|
||||||
- [x] Text 模板解析 (`txt_parser.py` - 已完成)
|
|
||||||
- [x] 模板字段识别与提示词提取
|
|
||||||
- [x] 多文档数据聚合与冲突处理
|
|
||||||
- [x] 结果导出为 Word/Excel
|
|
||||||
|
|
||||||
### 3.2 Word 文档解析
|
|
||||||
|
|
||||||
**当前状态**:✅ 已完成
|
|
||||||
|
|
||||||
**已实现功能**:
|
|
||||||
- [x] `docx_parser.py` - Word 文档解析器
|
|
||||||
- [x] 提取段落文本
|
|
||||||
- [x] 提取表格内容
|
|
||||||
- [x] 提取关键信息(标题、列表等)
|
|
||||||
- [x] 表格模板字段提取 (`parse_tables_for_template`, `extract_template_fields_from_docx`)
|
|
||||||
- [x] 字段类型推断 (`_infer_field_type_from_hint`)
|
|
||||||
|
|
||||||
### 3.3 Text 文档解析
|
|
||||||
|
|
||||||
**当前状态**:✅ 已完成
|
|
||||||
|
|
||||||
**已实现功能**:
|
|
||||||
- [x] `txt_parser.py` - 文本文件解析器
|
|
||||||
- [x] 编码自动检测 (chardet)
|
|
||||||
- [x] 文本清洗
|
|
||||||
|
|
||||||
### 3.4 文档模板匹配(已有框架)
|
|
||||||
|
|
||||||
根据 Q&A,模板已指定数据文件,不需要算法匹配。当前已有上传功能,需确认模板与数据文件的关联逻辑是否完善。
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## 四、参赛材料准备
|
|
||||||
|
|
||||||
### 4.1 必交材料
|
|
||||||
|
|
||||||
| 材料 | 要求 | 当前状态 | 行动项 |
|
|
||||||
|------|------|----------|--------|
|
|
||||||
| 项目概要介绍 | PPT 格式 | ❌ 待制作 | 制作 PPT |
|
|
||||||
| 项目简介 PPT | - | ❌ 待制作 | 制作 PPT |
|
|
||||||
| 项目详细方案 | 文档 | ⚠️ 部分完成 | 完善文档 |
|
|
||||||
| 项目演示视频 | - | ❌ 待制作 | 录制演示视频 |
|
|
||||||
| 训练素材说明 | 来源说明 | ⚠️ 已有素材 | 整理素材文档 |
|
|
||||||
| 关键模块设计文档 | 概要设计 | ⚠️ 已有部分 | 完善文档 |
|
|
||||||
| 可运行 Demo | 核心代码 | ✅ 已完成 | 打包可运行版本 |
|
|
||||||
|
|
||||||
### 4.2 Demo 提交要求
|
|
||||||
|
|
||||||
根据 Q&A:
|
|
||||||
- 可以只提交核心代码,不需要完整运行环境
|
|
||||||
- 现场答辩可使用自带笔记本电脑
|
|
||||||
- 需要提供部署和运行说明(README)
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## 五、测试验证计划
|
|
||||||
|
|
||||||
### 5.1 使用现有测试数据
|
|
||||||
|
|
||||||
```
|
|
||||||
docs/test/
|
|
||||||
├── 2023年文化和旅游发展统计公报.md
|
|
||||||
├── 2024年卫生健康事业发展统计公报.md
|
|
||||||
├── 第三次全国工业普查主要数据公报.md
|
|
||||||
```
|
|
||||||
|
|
||||||
### 5.2 模板填写测试流程
|
|
||||||
|
|
||||||
1. 准备一个 Word/Excel 模板表格
|
|
||||||
2. 指定源数据文档
|
|
||||||
3. 上传模板和文档
|
|
||||||
4. 执行模板填写
|
|
||||||
5. 检查填写结果准确率
|
|
||||||
6. 记录响应时间
|
|
||||||
|
|
||||||
### 5.3 性能目标
|
|
||||||
|
|
||||||
| 指标 | 目标 | 当前状态 |
|
|
||||||
|------|------|----------|
|
|
||||||
| 信息提取准确率 | ≥80% | 需测试验证 |
|
|
||||||
| 单次响应时间 | ≤90s × 文档数 | 需测试验证 |
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## 六、工作计划(建议)
|
|
||||||
|
|
||||||
### 第一优先级:模板填写核心功能
|
|
||||||
- 完成 Word 文档解析
|
|
||||||
- 完成模板填写服务
|
|
||||||
- 端到端测试验证
|
|
||||||
|
|
||||||
### 第二优先级:Demo 打包与文档
|
|
||||||
- 制作项目演示 PPT
|
|
||||||
- 录制演示视频
|
|
||||||
- 完善 README 部署文档
|
|
||||||
|
|
||||||
### 第三优先级:测试优化
|
|
||||||
- 使用真实测试数据进行准确率测试
|
|
||||||
- 优化响应时间
|
|
||||||
- 完善错误处理
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## 七、注意事项
|
|
||||||
|
|
||||||
1. **创新点**:根据 Q&A,不必纠结创新点数量限制
|
|
||||||
2. **数据库**:不强制要求数据库存储,可跳过
|
|
||||||
3. **部署**:本地部署即可,不需要公网服务器
|
|
||||||
4. **评测数据**:初赛仅使用目前提供的数据
|
|
||||||
5. **RAG 功能**:当前已临时禁用,不影响核心评测功能
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
*文档版本: v1.1*
|
|
||||||
*最后更新: 2026-04-08*
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## 八、技术实现细节
|
|
||||||
|
|
||||||
### 8.1 模板填表流程(已实现)
|
|
||||||
|
|
||||||
#### 流程图
|
|
||||||
```
|
|
||||||
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
|
|
||||||
│ 上传模板 │ ──► │ 选择数据源 │ ──► │ AI 智能填表 │
|
|
||||||
└─────────────┘ └─────────────┘ └─────────────┘
|
|
||||||
│
|
|
||||||
▼
|
|
||||||
┌─────────────┐
|
|
||||||
│ 导出结果 │
|
|
||||||
└─────────────┘
|
|
||||||
```
|
|
||||||
|
|
||||||
#### 核心组件
|
|
||||||
|
|
||||||
| 组件 | 文件 | 说明 |
|
|
||||||
|------|------|------|
|
|
||||||
| 模板上传 | `templates.py` `/templates/upload` | 接收模板文件,提取字段 |
|
|
||||||
| 字段提取 | `template_fill_service.py` | 从 Word/Excel 表格提取字段定义 |
|
|
||||||
| 文档解析 | `docx_parser.py`, `xlsx_parser.py`, `txt_parser.py` | 解析源文档内容 |
|
|
||||||
| 智能填表 | `template_fill_service.py` `fill_template()` | 使用 LLM 从源文档提取信息 |
|
|
||||||
| 结果导出 | `templates.py` `/templates/export` | 导出为 Excel 或 Word |
|
|
||||||
|
|
||||||
### 8.2 源文档加载方式
|
|
||||||
|
|
||||||
模板填表服务支持两种方式加载源文档:
|
|
||||||
|
|
||||||
1. **通过 MongoDB 文档 ID**:`source_doc_ids`
|
|
||||||
- 文档已上传并存入 MongoDB
|
|
||||||
- 服务直接查询 MongoDB 获取文档内容
|
|
||||||
|
|
||||||
2. **通过文件路径**:`source_file_paths`
|
|
||||||
- 直接读取本地文件
|
|
||||||
- 使用对应的解析器解析内容
|
|
||||||
|
|
||||||
### 8.3 Word 表格模板解析
|
|
||||||
|
|
||||||
比赛评分表格通常是 Word 格式,`docx_parser.py` 提供了专门的解析方法:
|
|
||||||
|
|
||||||
```python
|
|
||||||
# 提取表格模板字段
|
|
||||||
fields = docx_parser.extract_template_fields_from_docx(file_path)
|
|
||||||
|
|
||||||
# 返回格式
|
|
||||||
# [
|
|
||||||
# {
|
|
||||||
# "cell": "T0R1", # 表格0,行1
|
|
||||||
# "name": "字段名",
|
|
||||||
# "hint": "提示词",
|
|
||||||
# "field_type": "text/number/date",
|
|
||||||
# "required": True
|
|
||||||
# },
|
|
||||||
# ...
|
|
||||||
# ]
|
|
||||||
```
|
|
||||||
|
|
||||||
### 8.4 字段类型推断
|
|
||||||
|
|
||||||
系统支持从提示词自动推断字段类型:
|
|
||||||
|
|
||||||
| 关键词 | 推断类型 | 示例 |
|
|
||||||
|--------|----------|------|
|
|
||||||
| 年、月、日、日期、时间、出生 | date | 出生日期 |
|
|
||||||
| 数量、金额、比率、%、率、合计 | number | 增长比率 |
|
|
||||||
| 其他 | text | 姓名、地址 |
|
|
||||||
|
|
||||||
### 8.5 API 接口
|
|
||||||
|
|
||||||
#### POST `/api/v1/templates/fill`
|
|
||||||
|
|
||||||
填写请求:
|
|
||||||
```json
|
|
||||||
{
|
|
||||||
"template_id": "模板ID",
|
|
||||||
"template_fields": [
|
|
||||||
{"cell": "A1", "name": "姓名", "field_type": "text", "required": true, "hint": "提取人员姓名"}
|
|
||||||
],
|
|
||||||
"source_doc_ids": ["mongodb_doc_id_1", "mongodb_doc_id_2"],
|
|
||||||
"source_file_paths": [],
|
|
||||||
"user_hint": "请从合同文档中提取"
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
响应:
|
|
||||||
```json
|
|
||||||
{
|
|
||||||
"success": true,
|
|
||||||
"filled_data": {"姓名": "张三"},
|
|
||||||
"fill_details": [
|
|
||||||
{
|
|
||||||
"field": "姓名",
|
|
||||||
"cell": "A1",
|
|
||||||
"value": "张三",
|
|
||||||
"source": "来自:合同文档.docx",
|
|
||||||
"confidence": 0.95
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"source_doc_count": 2
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
#### POST `/api/v1/templates/export`
|
|
||||||
|
|
||||||
导出请求:
|
|
||||||
```json
|
|
||||||
{
|
|
||||||
"template_id": "模板ID",
|
|
||||||
"filled_data": {"姓名": "张三", "金额": "10000"},
|
|
||||||
"format": "xlsx" // 或 "docx"
|
|
||||||
}
|
|
||||||
```
|
|
||||||
Reference in New Issue
Block a user