Initial commit: Family Planner application
Complete family planning application with: - React frontend with TypeScript - Node.js/Express backend with TypeScript - Python ingestion service for document processing - Planning ingestion service with LLM integration - Shared UI components and type definitions - OAuth integration for calendar synchronization - Comprehensive documentation 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
20
planning-ingestion/.env.example
Normal file
20
planning-ingestion/.env.example
Normal file
@@ -0,0 +1,20 @@
|
||||
# Server Configuration
|
||||
PORT=8000
|
||||
NODE_ENV=development
|
||||
|
||||
# Security - REQUIRED: Master key for encrypting API keys (32+ chars)
|
||||
MASTER_KEY=your-secure-master-key-minimum-32-characters-long
|
||||
|
||||
# Database
|
||||
DATABASE_PATH=./data/planning.db
|
||||
|
||||
# Upload Configuration
|
||||
UPLOAD_DIR=./uploads
|
||||
MAX_FILE_SIZE_MB=50
|
||||
|
||||
# Confidence threshold for LLM fallback (0.0 to 1.0)
|
||||
CONFIDENCE_THRESHOLD=0.7
|
||||
|
||||
# LLM Configuration (set via CLI or API, not here)
|
||||
# LLM_PROVIDER=openai|anthropic
|
||||
# LLM_API_KEY=<encrypted in database>
|
||||
9
planning-ingestion/.gitignore
vendored
Normal file
9
planning-ingestion/.gitignore
vendored
Normal file
@@ -0,0 +1,9 @@
|
||||
node_modules/
|
||||
dist/
|
||||
uploads/
|
||||
data/
|
||||
*.log
|
||||
.env
|
||||
.DS_Store
|
||||
*.sqlite3
|
||||
*.db
|
||||
492
planning-ingestion/README.md
Normal file
492
planning-ingestion/README.md
Normal file
@@ -0,0 +1,492 @@
|
||||
# Planning Ingestion Service
|
||||
|
||||
Service Node.js professionnel d'ingestion et normalisation de plannings. Supporte images (OCR), PDF et Excel avec normalisation intelligente vers un format JSON standard unique via heuristiques et LLM (OpenAI/Anthropic).
|
||||
|
||||
## Caractéristiques
|
||||
|
||||
- ✅ **Ingestion multi-format** : Images (OCR Tesseract), PDF, Excel (.xlsx/.xls)
|
||||
- ✅ **Normalisation intelligente** : Heuristiques + fallback LLM si ambiguïté
|
||||
- ✅ **JSON standard unique** : Format contractuel pour toute sortie
|
||||
- ✅ **Sécurité** : Clé API chiffrée AES-256-GCM, jamais en clair
|
||||
- ✅ **CLI & HTTP** : Interface en ligne de commande et API REST
|
||||
- ✅ **Persistance** : SQLite avec chiffrement des clés API
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
planning-ingestion/
|
||||
├── src/
|
||||
│ ├── types/schema.ts # Schémas TypeScript & Zod
|
||||
│ ├── crypto/encryption.ts # Chiffrement AES-256-GCM
|
||||
│ ├── database/db.ts # SQLite (documents, schedules, api_keys)
|
||||
│ ├── extractors/
|
||||
│ │ ├── ocr.ts # Tesseract pour images
|
||||
│ │ ├── pdf.ts # pdf-parse pour PDF
|
||||
│ │ └── excel.ts # xlsx pour Excel
|
||||
│ ├── normalizer/parser.ts # Heuristiques de parsing
|
||||
│ ├── llm/client.ts # OpenAI & Anthropic
|
||||
│ ├── services/ingestion.ts # Service principal
|
||||
│ ├── cli.ts # CLI
|
||||
│ └── server.ts # API HTTP
|
||||
├── data/ # Base SQLite
|
||||
├── uploads/ # Fichiers uploadés
|
||||
├── package.json
|
||||
├── tsconfig.json
|
||||
└── .env
|
||||
```
|
||||
|
||||
## Installation
|
||||
|
||||
```bash
|
||||
npm install
|
||||
```
|
||||
|
||||
## Configuration
|
||||
|
||||
Créer un fichier `.env` à la racine :
|
||||
|
||||
```env
|
||||
# Server
|
||||
PORT=8000
|
||||
NODE_ENV=development
|
||||
|
||||
# Security - OBLIGATOIRE : Clé maîtresse pour chiffrer les clés API (32+ caractères)
|
||||
MASTER_KEY=votre-cle-securisee-minimum-32-caracteres
|
||||
|
||||
# Database
|
||||
DATABASE_PATH=./data/planning.db
|
||||
|
||||
# Upload
|
||||
UPLOAD_DIR=./uploads
|
||||
MAX_FILE_SIZE_MB=50
|
||||
|
||||
# Confidence threshold (0.0-1.0) : en dessous, utilise le LLM
|
||||
CONFIDENCE_THRESHOLD=0.7
|
||||
```
|
||||
|
||||
⚠️ **MASTER_KEY** est critique : changez-la et ne la commitez jamais.
|
||||
|
||||
## CLI
|
||||
|
||||
### 1. Configurer la clé API LLM
|
||||
|
||||
Saisie interactive (recommandé) :
|
||||
|
||||
```bash
|
||||
npm run cli setup:api
|
||||
# Choisir : openai ou anthropic
|
||||
# Saisir la clé API
|
||||
```
|
||||
|
||||
Ou avec paramètre :
|
||||
|
||||
```bash
|
||||
npm run cli setup:api -- --provider openai
|
||||
```
|
||||
|
||||
La clé est chiffrée et stockée dans SQLite, jamais en clair.
|
||||
|
||||
### 2. Ingérer un document
|
||||
|
||||
```bash
|
||||
npm run cli ingest path/to/planning.pdf
|
||||
# Retourne : Document ID: abc-123-xyz
|
||||
```
|
||||
|
||||
Formats supportés : `.png`, `.jpg`, `.jpeg`, `.pdf`, `.xlsx`, `.xls`
|
||||
|
||||
### 3. Normaliser vers JSON standard
|
||||
|
||||
```bash
|
||||
npm run cli normalize --document abc-123-xyz
|
||||
# Options :
|
||||
# --scope weekly|monthly
|
||||
# --subject enfants|vacances
|
||||
# Retourne : Schedule ID: def-456-uvw
|
||||
```
|
||||
|
||||
### 4. Exporter le JSON
|
||||
|
||||
```bash
|
||||
npm run cli export:schedule --id def-456-uvw --out schedule.json
|
||||
```
|
||||
|
||||
Le fichier `schedule.json` est prêt à être consommé par votre application.
|
||||
|
||||
## HTTP API
|
||||
|
||||
### Démarrer le serveur
|
||||
|
||||
```bash
|
||||
npm run build
|
||||
npm start
|
||||
# ou en dev :
|
||||
npm run dev
|
||||
```
|
||||
|
||||
Le serveur écoute sur `http://localhost:8000`.
|
||||
|
||||
### Endpoints
|
||||
|
||||
#### `POST /auth/api-key`
|
||||
|
||||
Stocker la clé API LLM de manière sécurisée.
|
||||
|
||||
**Body** :
|
||||
|
||||
```json
|
||||
{
|
||||
"provider": "openai",
|
||||
"apiKey": "sk-..."
|
||||
}
|
||||
```
|
||||
|
||||
**Response** :
|
||||
|
||||
```json
|
||||
{
|
||||
"message": "API key stored securely"
|
||||
}
|
||||
```
|
||||
|
||||
#### `POST /ingest`
|
||||
|
||||
Ingérer un fichier (multipart/form-data).
|
||||
|
||||
**cURL** :
|
||||
|
||||
```bash
|
||||
curl -X POST http://localhost:8000/ingest \
|
||||
-F "file=@planning.pdf"
|
||||
```
|
||||
|
||||
**Response** :
|
||||
|
||||
```json
|
||||
{
|
||||
"document_id": "abc-123-xyz",
|
||||
"filename": "planning.pdf"
|
||||
}
|
||||
```
|
||||
|
||||
#### `POST /ingest/normalize`
|
||||
|
||||
Normaliser un document vers JSON standard.
|
||||
|
||||
**Body** :
|
||||
|
||||
```json
|
||||
{
|
||||
"document_id": "abc-123-xyz",
|
||||
"scope": "weekly",
|
||||
"subject": "enfants"
|
||||
}
|
||||
```
|
||||
|
||||
**Response** :
|
||||
|
||||
```json
|
||||
{
|
||||
"schedule_id": "def-456-uvw"
|
||||
}
|
||||
```
|
||||
|
||||
#### `GET /schedules/:id`
|
||||
|
||||
Récupérer le JSON standard normalisé.
|
||||
|
||||
**cURL** :
|
||||
|
||||
```bash
|
||||
curl http://localhost:8000/schedules/def-456-uvw
|
||||
```
|
||||
|
||||
**Response** (JSON contractuel) :
|
||||
|
||||
```json
|
||||
{
|
||||
"version": "1.0",
|
||||
"calendar_scope": "weekly",
|
||||
"timezone": "Europe/Paris",
|
||||
"period": {
|
||||
"start": "2025-01-13",
|
||||
"end": "2025-01-19"
|
||||
},
|
||||
"entities": ["enfants"],
|
||||
"events": [
|
||||
{
|
||||
"title": "Mathématiques",
|
||||
"date": "2025-01-13",
|
||||
"start_time": "08:30",
|
||||
"end_time": "10:00",
|
||||
"location": "Salle B12",
|
||||
"tags": ["cours"],
|
||||
"notes": null,
|
||||
"confidence": 0.95,
|
||||
"source_cells": []
|
||||
}
|
||||
],
|
||||
"extraction": {
|
||||
"method": "pdf",
|
||||
"model": "internal",
|
||||
"heuristics": ["weekday-headers", "time-patterns", "date-extraction"]
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
#### `GET /documents/:id`
|
||||
|
||||
Récupérer les métadonnées d'un document.
|
||||
|
||||
**Response** :
|
||||
|
||||
```json
|
||||
{
|
||||
"id": "abc-123-xyz",
|
||||
"filename": "planning.pdf",
|
||||
"document_type": "pdf",
|
||||
"mime_type": "application/pdf",
|
||||
"size_bytes": 245830,
|
||||
"uploaded_at": "2025-01-12T14:30:00.000Z"
|
||||
}
|
||||
```
|
||||
|
||||
## JSON Standard (Contrat de sortie)
|
||||
|
||||
Toute sortie respecte strictement ce schéma :
|
||||
|
||||
```typescript
|
||||
{
|
||||
version: "1.0",
|
||||
calendar_scope: "weekly" | "monthly",
|
||||
timezone: "Europe/Paris",
|
||||
period: {
|
||||
start: "YYYY-MM-DD",
|
||||
end: "YYYY-MM-DD"
|
||||
},
|
||||
entities: string[], // ex: ["enfants", "vacances"]
|
||||
events: [
|
||||
{
|
||||
title: string,
|
||||
date: "YYYY-MM-DD",
|
||||
start_time: "HH:MM", // 24h
|
||||
end_time: "HH:MM",
|
||||
location: string | null,
|
||||
tags: string[],
|
||||
notes: string | null,
|
||||
confidence: number, // 0.0 à 1.0
|
||||
source_cells: string[] // ex: ["A1", "B2"] (Excel uniquement)
|
||||
}
|
||||
],
|
||||
extraction: {
|
||||
method: "ocr" | "pdf" | "excel" | "llm",
|
||||
model: string, // "tesseract", "gpt-4o", "claude-3.5-sonnet", "internal"
|
||||
heuristics: string[]
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Exemples I/O
|
||||
|
||||
### Entrée : Image scannée d'un planning hebdomadaire
|
||||
|
||||
Fichier : `planning_semaine.jpg` (scan manuscrit ou imprimé)
|
||||
|
||||
```bash
|
||||
npm run cli ingest planning_semaine.jpg
|
||||
# Document ID: img-001
|
||||
|
||||
npm run cli normalize --document img-001 --scope weekly --subject enfants
|
||||
# Schedule ID: sch-001
|
||||
|
||||
npm run cli export:schedule --id sch-001 --out semaine.json
|
||||
```
|
||||
|
||||
**Sortie `semaine.json`** :
|
||||
|
||||
```json
|
||||
{
|
||||
"version": "1.0",
|
||||
"calendar_scope": "weekly",
|
||||
"timezone": "Europe/Paris",
|
||||
"period": {
|
||||
"start": "2025-01-13",
|
||||
"end": "2025-01-19"
|
||||
},
|
||||
"entities": ["enfants"],
|
||||
"events": [
|
||||
{
|
||||
"title": "Piscine",
|
||||
"date": "2025-01-15",
|
||||
"start_time": "14:00",
|
||||
"end_time": "15:30",
|
||||
"location": "Centre aquatique",
|
||||
"tags": ["sport"],
|
||||
"notes": null,
|
||||
"confidence": 0.88,
|
||||
"source_cells": []
|
||||
}
|
||||
],
|
||||
"extraction": {
|
||||
"method": "llm",
|
||||
"model": "gpt-4o",
|
||||
"heuristics": ["llm-analysis"]
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Entrée : Fichier Excel avec plusieurs onglets
|
||||
|
||||
Fichier : `planning_mensuel.xlsx`
|
||||
|
||||
Onglet "Janvier" avec :
|
||||
|
||||
| Lundi | Mardi | Mercredi | Jeudi | Vendredi |
|
||||
|-------|-------|----------|-------|----------|
|
||||
| 8h-9h : Français | 8h-9h : Maths | 8h-9h : Sport | ... | ... |
|
||||
|
||||
```bash
|
||||
curl -X POST http://localhost:8000/ingest -F "file=@planning_mensuel.xlsx"
|
||||
# { "document_id": "xls-002" }
|
||||
|
||||
curl -X POST http://localhost:8000/ingest/normalize \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"document_id":"xls-002","scope":"monthly","subject":"enfants"}'
|
||||
# { "schedule_id": "sch-002" }
|
||||
|
||||
curl http://localhost:8000/schedules/sch-002 > mensuel.json
|
||||
```
|
||||
|
||||
**Sortie `mensuel.json`** :
|
||||
|
||||
```json
|
||||
{
|
||||
"version": "1.0",
|
||||
"calendar_scope": "monthly",
|
||||
"timezone": "Europe/Paris",
|
||||
"period": {
|
||||
"start": "2025-01-01",
|
||||
"end": "2025-01-31"
|
||||
},
|
||||
"entities": ["enfants"],
|
||||
"events": [
|
||||
{
|
||||
"title": "Français",
|
||||
"date": "2025-01-06",
|
||||
"start_time": "08:00",
|
||||
"end_time": "09:00",
|
||||
"location": null,
|
||||
"tags": ["cours"],
|
||||
"notes": null,
|
||||
"confidence": 0.98,
|
||||
"source_cells": ["B2"]
|
||||
},
|
||||
{
|
||||
"title": "Maths",
|
||||
"date": "2025-01-07",
|
||||
"start_time": "08:00",
|
||||
"end_time": "09:00",
|
||||
"location": null,
|
||||
"tags": ["cours"],
|
||||
"notes": null,
|
||||
"confidence": 0.98,
|
||||
"source_cells": ["C2"]
|
||||
}
|
||||
],
|
||||
"extraction": {
|
||||
"method": "excel",
|
||||
"model": "internal",
|
||||
"heuristics": ["weekday-headers", "hour-columns", "table-detection"]
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Sécurité
|
||||
|
||||
- **Chiffrement clé API** : AES-256-GCM avec IV et Auth Tag uniques
|
||||
- **Master Key** : Dérivée de `.env` via scrypt (salt), jamais loggée
|
||||
- **Pas de fuite** : Les endpoints ne retournent jamais de clé API en clair
|
||||
- **SQLite sécurisé** : Table `api_keys` avec colonnes chiffrées
|
||||
|
||||
## Normalisation
|
||||
|
||||
### Heuristiques
|
||||
|
||||
Le parser interne détecte :
|
||||
|
||||
- **Jours FR** : lun, mar, mer, jeu, ven, sam, dim
|
||||
- **Mois FR** : janvier, février, ..., décembre
|
||||
- **Heures** : `8h30`, `8:30`, `08:30` → `08:30`
|
||||
- **Portée** : `weekly` si semaine identifiable, sinon `monthly`
|
||||
- **Entités** : mots-clés ("enfant", "vacances", "congé")
|
||||
- **Événements** : détection de plages horaires + titres
|
||||
|
||||
### LLM Fallback
|
||||
|
||||
Si `confidence < CONFIDENCE_THRESHOLD` (défaut 0.7), appel LLM avec prompt strict retournant **uniquement** le JSON contractuel.
|
||||
|
||||
Modèles :
|
||||
|
||||
- OpenAI : `gpt-4o`
|
||||
- Anthropic : `claude-3-5-sonnet-20241022`
|
||||
|
||||
## Validation
|
||||
|
||||
Tous les JSON produits sont validés par Zod selon le schéma `StandardScheduleSchema`.
|
||||
|
||||
## Logs
|
||||
|
||||
Logs sobres sans données sensibles. En production, configurer un système de monitoring (ex: Winston transports vers fichier).
|
||||
|
||||
## Tests
|
||||
|
||||
Ajoutez vos tests unitaires/intégration (ex: Vitest, Jest) :
|
||||
|
||||
```bash
|
||||
npm test
|
||||
```
|
||||
|
||||
## Build
|
||||
|
||||
```bash
|
||||
npm run build
|
||||
# Produit dist/
|
||||
node dist/server.js
|
||||
```
|
||||
|
||||
## Déploiement
|
||||
|
||||
1. Clonez le repo sur le serveur
|
||||
2. Créez `.env` avec `MASTER_KEY` sécurisée
|
||||
3. Installez : `npm install`
|
||||
4. Configurez la clé API : `npm run cli setup:api -- --provider openai`
|
||||
5. Buildez : `npm run build`
|
||||
6. Lancez : `npm start`
|
||||
|
||||
Utilisez PM2, systemd ou Docker pour la production.
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
**Erreur "MASTER_KEY not set"** :
|
||||
|
||||
- Vérifiez `.env` avec `MASTER_KEY` de 32+ caractères
|
||||
|
||||
**LLM ne se déclenche pas** :
|
||||
|
||||
- Vérifiez que la confiance est < `CONFIDENCE_THRESHOLD`
|
||||
- Vérifiez que la clé API est configurée : `npm run cli setup:api`
|
||||
|
||||
**OCR imprécis** :
|
||||
|
||||
- Utilisez des images haute résolution
|
||||
- Pré-traitez l'image (contraste, rotation)
|
||||
- Le LLM sera appelé si la confiance est faible
|
||||
|
||||
**Excel mal parsé** :
|
||||
|
||||
- Vérifiez le format des cellules (dates, heures)
|
||||
- Le LLM peut corriger les ambiguïtés
|
||||
|
||||
## Licence
|
||||
|
||||
Propriétaire. Tous droits réservés.
|
||||
2742
planning-ingestion/package-lock.json
generated
Normal file
2742
planning-ingestion/package-lock.json
generated
Normal file
File diff suppressed because it is too large
Load Diff
38
planning-ingestion/package.json
Normal file
38
planning-ingestion/package.json
Normal file
@@ -0,0 +1,38 @@
|
||||
{
|
||||
"name": "planning-ingestion",
|
||||
"version": "1.0.0",
|
||||
"type": "module",
|
||||
"description": "Planning ingestion service with OCR, PDF, Excel support and LLM normalization",
|
||||
"main": "dist/server.js",
|
||||
"bin": {
|
||||
"planning": "dist/cli.js"
|
||||
},
|
||||
"scripts": {
|
||||
"build": "tsc",
|
||||
"start": "node dist/server.js",
|
||||
"dev": "tsx watch src/server.ts",
|
||||
"cli": "tsx src/cli.ts"
|
||||
},
|
||||
"dependencies": {
|
||||
"express": "^4.19.0",
|
||||
"multer": "^1.4.5-lts.2",
|
||||
"better-sqlite3": "^9.4.0",
|
||||
"tesseract.js": "^5.0.0",
|
||||
"pdf-parse": "^1.1.1",
|
||||
"xlsx": "^0.18.5",
|
||||
"openai": "^4.28.0",
|
||||
"@anthropic-ai/sdk": "^0.17.0",
|
||||
"commander": "^11.1.0",
|
||||
"zod": "^3.22.4",
|
||||
"dotenv": "^16.4.0"
|
||||
},
|
||||
"devDependencies": {
|
||||
"@types/express": "^4.17.21",
|
||||
"@types/multer": "^1.4.11",
|
||||
"@types/better-sqlite3": "^7.6.9",
|
||||
"@types/pdf-parse": "^1.1.4",
|
||||
"@types/node": "^20.11.0",
|
||||
"typescript": "^5.3.0",
|
||||
"tsx": "^4.7.0"
|
||||
}
|
||||
}
|
||||
141
planning-ingestion/src/cli.ts
Normal file
141
planning-ingestion/src/cli.ts
Normal file
@@ -0,0 +1,141 @@
|
||||
#!/usr/bin/env node
|
||||
import { Command } from "commander";
|
||||
import { config } from "dotenv";
|
||||
import fs from "fs";
|
||||
import * as readline from "readline";
|
||||
import { DatabaseService } from "./database/db.js";
|
||||
import { EncryptionService } from "./crypto/encryption.js";
|
||||
import { IngestionService } from "./services/ingestion.js";
|
||||
import type { LLMProvider, CalendarScope } from "./types/schema.js";
|
||||
|
||||
config();
|
||||
|
||||
const program = new Command();
|
||||
|
||||
function getServices() {
|
||||
const dbPath = process.env.DATABASE_PATH || "./data/planning.db";
|
||||
const masterKey = process.env.MASTER_KEY;
|
||||
|
||||
if (!masterKey) {
|
||||
console.error("Error: MASTER_KEY not set in .env");
|
||||
process.exit(1);
|
||||
}
|
||||
|
||||
const db = new DatabaseService(dbPath);
|
||||
const encryption = new EncryptionService(masterKey);
|
||||
const confidenceThreshold = parseFloat(process.env.CONFIDENCE_THRESHOLD || "0.7");
|
||||
const ingestion = new IngestionService(db, encryption, confidenceThreshold);
|
||||
|
||||
return { db, encryption, ingestion };
|
||||
}
|
||||
|
||||
async function promptInput(question: string): Promise<string> {
|
||||
const rl = readline.createInterface({
|
||||
input: process.stdin,
|
||||
output: process.stdout
|
||||
});
|
||||
|
||||
return new Promise((resolve) => {
|
||||
rl.question(question, (answer) => {
|
||||
rl.close();
|
||||
resolve(answer);
|
||||
});
|
||||
});
|
||||
}
|
||||
|
||||
program.name("planning").description("Planning Ingestion CLI").version("1.0.0");
|
||||
|
||||
program
|
||||
.command("setup:api")
|
||||
.option("--provider <provider>", "LLM provider: openai or anthropic")
|
||||
.description("Configure LLM API key")
|
||||
.action(async (options) => {
|
||||
let provider = options.provider as LLMProvider;
|
||||
|
||||
if (!provider) {
|
||||
provider = (await promptInput("Provider (openai/anthropic): ")) as LLMProvider;
|
||||
}
|
||||
|
||||
if (provider !== "openai" && provider !== "anthropic") {
|
||||
console.error("Error: provider must be 'openai' or 'anthropic'");
|
||||
process.exit(1);
|
||||
}
|
||||
|
||||
const apiKey = await promptInput(`Enter your ${provider} API key: `);
|
||||
|
||||
if (!apiKey || apiKey.length < 10) {
|
||||
console.error("Error: Invalid API key");
|
||||
process.exit(1);
|
||||
}
|
||||
|
||||
const { ingestion } = getServices();
|
||||
await ingestion.saveAPIKey(provider, apiKey);
|
||||
|
||||
console.log(`✓ API key for ${provider} saved securely`);
|
||||
});
|
||||
|
||||
program
|
||||
.command("ingest")
|
||||
.argument("<path>", "Path to planning file")
|
||||
.description("Ingest a planning document")
|
||||
.action(async (path) => {
|
||||
if (!fs.existsSync(path)) {
|
||||
console.error(`Error: File not found: ${path}`);
|
||||
process.exit(1);
|
||||
}
|
||||
|
||||
const { ingestion } = getServices();
|
||||
const filename = path.split(/[/\\]/).pop() || "document";
|
||||
|
||||
console.log(`Ingesting: ${filename}...`);
|
||||
|
||||
const documentId = await ingestion.ingestDocument(path, filename);
|
||||
|
||||
console.log(`✓ Document ingested successfully`);
|
||||
console.log(`Document ID: ${documentId}`);
|
||||
});
|
||||
|
||||
program
|
||||
.command("normalize")
|
||||
.requiredOption("--document <id>", "Document ID to normalize")
|
||||
.option("--scope <scope>", "Calendar scope: weekly or monthly")
|
||||
.option("--subject <subject>", "Subject: enfants, vacances, etc.")
|
||||
.description("Normalize a document to standard JSON")
|
||||
.action(async (options) => {
|
||||
const { ingestion } = getServices();
|
||||
|
||||
console.log(`Normalizing document ${options.document}...`);
|
||||
|
||||
const scope = options.scope as CalendarScope | undefined;
|
||||
const subject = options.subject;
|
||||
|
||||
const scheduleId = await ingestion.normalizeDocument(options.document, {
|
||||
scope,
|
||||
subject
|
||||
});
|
||||
|
||||
console.log(`✓ Normalization complete`);
|
||||
console.log(`Schedule ID: ${scheduleId}`);
|
||||
});
|
||||
|
||||
program
|
||||
.command("export:schedule")
|
||||
.requiredOption("--id <id>", "Schedule ID")
|
||||
.requiredOption("--out <file>", "Output JSON file")
|
||||
.description("Export schedule to JSON file")
|
||||
.action(async (options) => {
|
||||
const { ingestion } = getServices();
|
||||
|
||||
const schedule = await ingestion.getSchedule(options.id);
|
||||
|
||||
if (!schedule) {
|
||||
console.error(`Error: Schedule ${options.id} not found`);
|
||||
process.exit(1);
|
||||
}
|
||||
|
||||
fs.writeFileSync(options.out, JSON.stringify(schedule, null, 2));
|
||||
|
||||
console.log(`✓ Schedule exported to ${options.out}`);
|
||||
});
|
||||
|
||||
program.parse();
|
||||
45
planning-ingestion/src/crypto/encryption.ts
Normal file
45
planning-ingestion/src/crypto/encryption.ts
Normal file
@@ -0,0 +1,45 @@
|
||||
import crypto from "crypto";
|
||||
|
||||
const ALGORITHM = "aes-256-gcm";
|
||||
const IV_LENGTH = 16;
|
||||
const AUTH_TAG_LENGTH = 16;
|
||||
|
||||
export class EncryptionService {
|
||||
private masterKey: Buffer;
|
||||
|
||||
constructor(masterKeyString: string) {
|
||||
if (!masterKeyString || masterKeyString.length < 32) {
|
||||
throw new Error("MASTER_KEY must be at least 32 characters");
|
||||
}
|
||||
this.masterKey = crypto.scryptSync(masterKeyString, "salt", 32);
|
||||
}
|
||||
|
||||
encrypt(text: string): { encrypted: string; iv: string; authTag: string } {
|
||||
const iv = crypto.randomBytes(IV_LENGTH);
|
||||
const cipher = crypto.createCipheriv(ALGORITHM, this.masterKey, iv);
|
||||
|
||||
let encrypted = cipher.update(text, "utf8", "hex");
|
||||
encrypted += cipher.final("hex");
|
||||
|
||||
const authTag = cipher.getAuthTag();
|
||||
|
||||
return {
|
||||
encrypted,
|
||||
iv: iv.toString("hex"),
|
||||
authTag: authTag.toString("hex")
|
||||
};
|
||||
}
|
||||
|
||||
decrypt(encrypted: string, ivHex: string, authTagHex: string): string {
|
||||
const iv = Buffer.from(ivHex, "hex");
|
||||
const authTag = Buffer.from(authTagHex, "hex");
|
||||
|
||||
const decipher = crypto.createDecipheriv(ALGORITHM, this.masterKey, iv);
|
||||
decipher.setAuthTag(authTag);
|
||||
|
||||
let decrypted = decipher.update(encrypted, "hex", "utf8");
|
||||
decrypted += decipher.final("utf8");
|
||||
|
||||
return decrypted;
|
||||
}
|
||||
}
|
||||
126
planning-ingestion/src/database/db.ts
Normal file
126
planning-ingestion/src/database/db.ts
Normal file
@@ -0,0 +1,126 @@
|
||||
import Database from "better-sqlite3";
|
||||
import path from "path";
|
||||
import fs from "fs";
|
||||
import type { Document, Schedule, APIKeyConfig, LLMProvider, StandardSchedule } from "../types/schema.js";
|
||||
|
||||
export class DatabaseService {
|
||||
private db: Database.Database;
|
||||
|
||||
constructor(dbPath: string) {
|
||||
const dir = path.dirname(dbPath);
|
||||
if (!fs.existsSync(dir)) {
|
||||
fs.mkdirSync(dir, { recursive: true });
|
||||
}
|
||||
|
||||
this.db = new Database(dbPath);
|
||||
this.db.pragma("journal_mode = WAL");
|
||||
this.initTables();
|
||||
}
|
||||
|
||||
private initTables() {
|
||||
this.db.exec(`
|
||||
CREATE TABLE IF NOT EXISTS documents (
|
||||
id TEXT PRIMARY KEY,
|
||||
filename TEXT NOT NULL,
|
||||
file_path TEXT NOT NULL,
|
||||
document_type TEXT NOT NULL,
|
||||
mime_type TEXT NOT NULL,
|
||||
size_bytes INTEGER NOT NULL,
|
||||
uploaded_at TEXT NOT NULL,
|
||||
raw_text TEXT
|
||||
);
|
||||
|
||||
CREATE TABLE IF NOT EXISTS schedules (
|
||||
id TEXT PRIMARY KEY,
|
||||
document_id TEXT NOT NULL,
|
||||
schedule_json TEXT NOT NULL,
|
||||
created_at TEXT NOT NULL,
|
||||
FOREIGN KEY (document_id) REFERENCES documents(id)
|
||||
);
|
||||
|
||||
CREATE TABLE IF NOT EXISTS api_keys (
|
||||
provider TEXT PRIMARY KEY,
|
||||
encrypted_key TEXT NOT NULL,
|
||||
iv TEXT NOT NULL,
|
||||
auth_tag TEXT NOT NULL,
|
||||
created_at TEXT NOT NULL
|
||||
);
|
||||
|
||||
CREATE INDEX IF NOT EXISTS idx_documents_type ON documents(document_type);
|
||||
CREATE INDEX IF NOT EXISTS idx_schedules_document ON schedules(document_id);
|
||||
`);
|
||||
}
|
||||
|
||||
// Documents
|
||||
insertDocument(doc: Omit<Document, "id" | "uploaded_at">): Document {
|
||||
const id = crypto.randomUUID();
|
||||
const uploaded_at = new Date().toISOString();
|
||||
|
||||
const stmt = this.db.prepare(`
|
||||
INSERT INTO documents (id, filename, file_path, document_type, mime_type, size_bytes, uploaded_at, raw_text)
|
||||
VALUES (?, ?, ?, ?, ?, ?, ?, ?)
|
||||
`);
|
||||
|
||||
stmt.run(id, doc.filename, doc.file_path, doc.document_type, doc.mime_type, doc.size_bytes, uploaded_at, doc.raw_text || null);
|
||||
|
||||
return { id, uploaded_at, ...doc };
|
||||
}
|
||||
|
||||
getDocument(id: string): Document | undefined {
|
||||
const stmt = this.db.prepare("SELECT * FROM documents WHERE id = ?");
|
||||
return stmt.get(id) as Document | undefined;
|
||||
}
|
||||
|
||||
// Schedules
|
||||
insertSchedule(documentId: string, scheduleJson: StandardSchedule): Schedule {
|
||||
const id = crypto.randomUUID();
|
||||
const created_at = new Date().toISOString();
|
||||
|
||||
const stmt = this.db.prepare(`
|
||||
INSERT INTO schedules (id, document_id, schedule_json, created_at)
|
||||
VALUES (?, ?, ?, ?)
|
||||
`);
|
||||
|
||||
stmt.run(id, documentId, JSON.stringify(scheduleJson), created_at);
|
||||
|
||||
return {
|
||||
id,
|
||||
document_id: documentId,
|
||||
schedule_json: scheduleJson,
|
||||
created_at
|
||||
};
|
||||
}
|
||||
|
||||
getSchedule(id: string): Schedule | undefined {
|
||||
const stmt = this.db.prepare("SELECT * FROM schedules WHERE id = ?");
|
||||
const row = stmt.get(id) as any;
|
||||
|
||||
if (!row) return undefined;
|
||||
|
||||
return {
|
||||
...row,
|
||||
schedule_json: JSON.parse(row.schedule_json)
|
||||
};
|
||||
}
|
||||
|
||||
// API Keys
|
||||
saveAPIKey(config: Omit<APIKeyConfig, "created_at">): void {
|
||||
const created_at = new Date().toISOString();
|
||||
|
||||
const stmt = this.db.prepare(`
|
||||
INSERT OR REPLACE INTO api_keys (provider, encrypted_key, iv, auth_tag, created_at)
|
||||
VALUES (?, ?, ?, ?, ?)
|
||||
`);
|
||||
|
||||
stmt.run(config.provider, config.encrypted_key, config.iv, config.auth_tag || "", created_at);
|
||||
}
|
||||
|
||||
getAPIKey(provider: LLMProvider): APIKeyConfig | undefined {
|
||||
const stmt = this.db.prepare("SELECT * FROM api_keys WHERE provider = ?");
|
||||
return stmt.get(provider) as APIKeyConfig | undefined;
|
||||
}
|
||||
|
||||
close() {
|
||||
this.db.close();
|
||||
}
|
||||
}
|
||||
38
planning-ingestion/src/extractors/excel.ts
Normal file
38
planning-ingestion/src/extractors/excel.ts
Normal file
@@ -0,0 +1,38 @@
|
||||
import XLSX from "xlsx";
|
||||
import type { ExtractionResult } from "./ocr.js";
|
||||
|
||||
export interface ExcelData {
|
||||
sheets: {
|
||||
name: string;
|
||||
data: any[][];
|
||||
range: string;
|
||||
}[];
|
||||
}
|
||||
|
||||
export class ExcelExtractor {
|
||||
async extractFromExcel(filePath: string): Promise<ExtractionResult & { structuredData: ExcelData }> {
|
||||
const workbook = XLSX.readFile(filePath);
|
||||
|
||||
const sheets = workbook.SheetNames.map((name) => {
|
||||
const worksheet = workbook.Sheets[name];
|
||||
const data = XLSX.utils.sheet_to_json(worksheet, { header: 1 }) as any[][];
|
||||
const range = worksheet["!ref"] || "A1";
|
||||
|
||||
return { name, data, range };
|
||||
});
|
||||
|
||||
const text = sheets
|
||||
.map((sheet) => {
|
||||
const rows = sheet.data.map((row: any[]) => row.join("\t")).join("\n");
|
||||
return `[${sheet.name}]\n${rows}`;
|
||||
})
|
||||
.join("\n\n");
|
||||
|
||||
return {
|
||||
text,
|
||||
confidence: 0.95,
|
||||
method: "excel",
|
||||
structuredData: { sheets }
|
||||
};
|
||||
}
|
||||
}
|
||||
26
planning-ingestion/src/extractors/ocr.ts
Normal file
26
planning-ingestion/src/extractors/ocr.ts
Normal file
@@ -0,0 +1,26 @@
|
||||
import Tesseract from "tesseract.js";
|
||||
import type { DocumentType } from "../types/schema.js";
|
||||
|
||||
export interface ExtractionResult {
|
||||
text: string;
|
||||
confidence: number;
|
||||
method: string;
|
||||
}
|
||||
|
||||
export class OCRExtractor {
|
||||
async extractFromImage(filePath: string): Promise<ExtractionResult> {
|
||||
const worker = await Tesseract.createWorker("fra");
|
||||
|
||||
try {
|
||||
const { data } = await worker.recognize(filePath);
|
||||
|
||||
return {
|
||||
text: data.text,
|
||||
confidence: data.confidence / 100,
|
||||
method: "ocr"
|
||||
};
|
||||
} finally {
|
||||
await worker.terminate();
|
||||
}
|
||||
}
|
||||
}
|
||||
16
planning-ingestion/src/extractors/pdf.ts
Normal file
16
planning-ingestion/src/extractors/pdf.ts
Normal file
@@ -0,0 +1,16 @@
|
||||
import fs from "fs";
|
||||
import pdfParse from "pdf-parse";
|
||||
import type { ExtractionResult } from "./ocr.js";
|
||||
|
||||
export class PDFExtractor {
|
||||
async extractFromPDF(filePath: string): Promise<ExtractionResult> {
|
||||
const dataBuffer = fs.readFileSync(filePath);
|
||||
const data = await pdfParse(dataBuffer);
|
||||
|
||||
return {
|
||||
text: data.text,
|
||||
confidence: data.text.length > 100 ? 0.9 : 0.6,
|
||||
method: "pdf"
|
||||
};
|
||||
}
|
||||
}
|
||||
118
planning-ingestion/src/llm/client.ts
Normal file
118
planning-ingestion/src/llm/client.ts
Normal file
@@ -0,0 +1,118 @@
|
||||
import OpenAI from "openai";
|
||||
import Anthropic from "@anthropic-ai/sdk";
|
||||
import type { LLMProvider, StandardSchedule } from "../types/schema.js";
|
||||
|
||||
export class LLMClient {
|
||||
private openai?: OpenAI;
|
||||
private anthropic?: Anthropic;
|
||||
|
||||
constructor(provider: LLMProvider, apiKey: string) {
|
||||
if (provider === "openai") {
|
||||
this.openai = new OpenAI({ apiKey });
|
||||
} else {
|
||||
this.anthropic = new Anthropic({ apiKey });
|
||||
}
|
||||
}
|
||||
|
||||
async normalizeWithLLM(rawText: string, extractedData?: any): Promise<StandardSchedule> {
|
||||
const prompt = this.buildPrompt(rawText, extractedData);
|
||||
|
||||
if (this.openai) {
|
||||
return this.callOpenAI(prompt);
|
||||
} else if (this.anthropic) {
|
||||
return this.callAnthropic(prompt);
|
||||
}
|
||||
|
||||
throw new Error("No LLM client configured");
|
||||
}
|
||||
|
||||
private buildPrompt(rawText: string, extractedData?: any): string {
|
||||
return `Tu es un expert en analyse de planning scolaire et familial.
|
||||
|
||||
Analyse le texte suivant et extrait UNIQUEMENT un JSON valide conforme au schéma ci-dessous.
|
||||
|
||||
**CONTRAINTES STRICTES:**
|
||||
- Réponds UNIQUEMENT avec le JSON, aucun texte avant ou après
|
||||
- Format dates: YYYY-MM-DD
|
||||
- Format heures: HH:MM (24h)
|
||||
- calendar_scope: "weekly" ou "monthly"
|
||||
- entities: tableau de strings ("enfants", "vacances", etc.)
|
||||
- Chaque event doit avoir: title, date, start_time, end_time, location (ou null), tags[], notes (ou null), confidence (0-1)
|
||||
|
||||
**Schéma JSON attendu:**
|
||||
{
|
||||
"version": "1.0",
|
||||
"calendar_scope": "weekly|monthly",
|
||||
"timezone": "Europe/Paris",
|
||||
"period": { "start": "YYYY-MM-DD", "end": "YYYY-MM-DD" },
|
||||
"entities": ["enfants"],
|
||||
"events": [
|
||||
{
|
||||
"title": "string",
|
||||
"date": "YYYY-MM-DD",
|
||||
"start_time": "HH:MM",
|
||||
"end_time": "HH:MM",
|
||||
"location": "string or null",
|
||||
"tags": ["string"],
|
||||
"notes": "string or null",
|
||||
"confidence": 0.9,
|
||||
"source_cells": []
|
||||
}
|
||||
],
|
||||
"extraction": {
|
||||
"method": "llm",
|
||||
"model": "gpt-4o ou claude-3.5-sonnet",
|
||||
"heuristics": ["llm-analysis"]
|
||||
}
|
||||
}
|
||||
|
||||
**Texte à analyser:**
|
||||
${rawText}
|
||||
|
||||
${extractedData ? `**Données structurées supplémentaires:**\n${JSON.stringify(extractedData, null, 2)}` : ""}
|
||||
|
||||
Réponds UNIQUEMENT avec le JSON valide:`;
|
||||
}
|
||||
|
||||
private async callOpenAI(prompt: string): Promise<StandardSchedule> {
|
||||
const response = await this.openai!.chat.completions.create({
|
||||
model: "gpt-4o",
|
||||
messages: [{ role: "user", content: prompt }],
|
||||
temperature: 0.1,
|
||||
max_tokens: 4000
|
||||
});
|
||||
|
||||
const content = response.choices[0].message.content;
|
||||
if (!content) throw new Error("Empty LLM response");
|
||||
|
||||
// Extract JSON from response (handle potential markdown wrapping)
|
||||
const jsonMatch = content.match(/```json\n?([\s\S]*?)\n?```/) || content.match(/({[\s\S]*})/);
|
||||
const jsonStr = jsonMatch ? jsonMatch[1] : content;
|
||||
|
||||
const result = JSON.parse(jsonStr.trim());
|
||||
result.extraction.model = "gpt-4o";
|
||||
|
||||
return result as StandardSchedule;
|
||||
}
|
||||
|
||||
private async callAnthropic(prompt: string): Promise<StandardSchedule> {
|
||||
const response = await this.anthropic!.messages.create({
|
||||
model: "claude-3-5-sonnet-20241022",
|
||||
max_tokens: 4000,
|
||||
temperature: 0.1,
|
||||
messages: [{ role: "user", content: prompt }]
|
||||
});
|
||||
|
||||
const content = response.content[0];
|
||||
if (content.type !== "text") throw new Error("Unexpected response type");
|
||||
|
||||
// Extract JSON from response
|
||||
const jsonMatch = content.text.match(/```json\n?([\s\S]*?)\n?```/) || content.text.match(/({[\s\S]*})/);
|
||||
const jsonStr = jsonMatch ? jsonMatch[1] : content.text;
|
||||
|
||||
const result = JSON.parse(jsonStr.trim());
|
||||
result.extraction.model = "claude-3.5-sonnet";
|
||||
|
||||
return result as StandardSchedule;
|
||||
}
|
||||
}
|
||||
177
planning-ingestion/src/normalizer/parser.ts
Normal file
177
planning-ingestion/src/normalizer/parser.ts
Normal file
@@ -0,0 +1,177 @@
|
||||
import type { StandardSchedule, CalendarScope, Event } from "../types/schema.js";
|
||||
|
||||
const DAYS_FR = ["lun", "mar", "mer", "jeu", "ven", "sam", "dim"];
|
||||
const MONTHS_FR = [
|
||||
"janvier",
|
||||
"février",
|
||||
"mars",
|
||||
"avril",
|
||||
"mai",
|
||||
"juin",
|
||||
"juillet",
|
||||
"août",
|
||||
"septembre",
|
||||
"octobre",
|
||||
"novembre",
|
||||
"décembre"
|
||||
];
|
||||
|
||||
export interface ParseOptions {
|
||||
scope?: CalendarScope;
|
||||
subject?: string;
|
||||
}
|
||||
|
||||
export class PlanningParser {
|
||||
private detectCalendarScope(text: string): CalendarScope {
|
||||
const lowerText = text.toLowerCase();
|
||||
|
||||
const weekIndicators = ["semaine", "lundi", "mardi", "mercredi", "jeudi", "vendredi"];
|
||||
const monthIndicators = MONTHS_FR;
|
||||
|
||||
const weekScore = weekIndicators.filter((w) => lowerText.includes(w)).length;
|
||||
const monthScore = monthIndicators.filter((m) => lowerText.includes(m)).length;
|
||||
|
||||
return weekScore > monthScore ? "weekly" : "monthly";
|
||||
}
|
||||
|
||||
private extractDates(text: string): { start: string; end: string } | null {
|
||||
// Pattern: DD/MM/YYYY or YYYY-MM-DD
|
||||
const datePattern = /(\d{1,2}[\/\-]\d{1,2}[\/\-]\d{4}|\d{4}[\/\-]\d{2}[\/\-]\d{2})/g;
|
||||
const matches = text.match(datePattern);
|
||||
|
||||
if (!matches || matches.length === 0) {
|
||||
return null;
|
||||
}
|
||||
|
||||
const dates = matches.map((d) => this.normalizeDate(d)).filter((d) => d !== null) as string[];
|
||||
|
||||
if (dates.length === 0) return null;
|
||||
|
||||
dates.sort();
|
||||
|
||||
return {
|
||||
start: dates[0],
|
||||
end: dates[dates.length - 1]
|
||||
};
|
||||
}
|
||||
|
||||
private normalizeDate(dateStr: string): string | null {
|
||||
// Convert DD/MM/YYYY to YYYY-MM-DD
|
||||
const ddmmyyyy = dateStr.match(/^(\d{1,2})[\/\-](\d{1,2})[\/\-](\d{4})$/);
|
||||
if (ddmmyyyy) {
|
||||
const [, day, month, year] = ddmmyyyy;
|
||||
return `${year}-${month.padStart(2, "0")}-${day.padStart(2, "0")}`;
|
||||
}
|
||||
|
||||
// Already YYYY-MM-DD
|
||||
const yyyymmdd = dateStr.match(/^(\d{4})[\/\-](\d{2})[\/\-](\d{2})$/);
|
||||
if (yyyymmdd) {
|
||||
return dateStr.replace(/\//g, "-");
|
||||
}
|
||||
|
||||
return null;
|
||||
}
|
||||
|
||||
private normalizeTime(timeStr: string): string {
|
||||
// Convert "8h30", "8:30", "08:30" to "08:30"
|
||||
const match = timeStr.match(/(\d{1,2})[:h]?(\d{2})?/i);
|
||||
if (!match) return "00:00";
|
||||
|
||||
const hours = match[1].padStart(2, "0");
|
||||
const minutes = match[2] ? match[2].padStart(2, "0") : "00";
|
||||
|
||||
return `${hours}:${minutes}`;
|
||||
}
|
||||
|
||||
private extractEvents(text: string, lines: string[]): Event[] {
|
||||
const events: Event[] = [];
|
||||
|
||||
// Simple heuristic: look for time patterns followed by activity names
|
||||
const timePattern = /(\d{1,2}[:h]\d{2})\s*[-–]\s*(\d{1,2}[:h]\d{2})/g;
|
||||
|
||||
lines.forEach((line) => {
|
||||
const matches = Array.from(line.matchAll(timePattern));
|
||||
|
||||
matches.forEach((match) => {
|
||||
const startTime = this.normalizeTime(match[1]);
|
||||
const endTime = this.normalizeTime(match[2]);
|
||||
|
||||
// Extract title (rest of the line after time)
|
||||
const title = line.replace(match[0], "").trim();
|
||||
|
||||
if (title) {
|
||||
events.push({
|
||||
title,
|
||||
date: "1970-01-01", // Placeholder, will be refined
|
||||
start_time: startTime,
|
||||
end_time: endTime,
|
||||
location: null,
|
||||
tags: [],
|
||||
notes: null,
|
||||
confidence: 0.6,
|
||||
source_cells: []
|
||||
});
|
||||
}
|
||||
});
|
||||
});
|
||||
|
||||
return events;
|
||||
}
|
||||
|
||||
private detectEntities(text: string): string[] {
|
||||
const entities: string[] = [];
|
||||
const lowerText = text.toLowerCase();
|
||||
|
||||
if (lowerText.match(/enfant|école|cours|activité/)) {
|
||||
entities.push("enfants");
|
||||
}
|
||||
|
||||
if (lowerText.match(/vacances|congé|camp|centre aéré/)) {
|
||||
entities.push("vacances");
|
||||
}
|
||||
|
||||
return entities;
|
||||
}
|
||||
|
||||
parseToStandard(text: string, options: ParseOptions = {}): StandardSchedule {
|
||||
const lines = text.split("\n").filter((line) => line.trim().length > 0);
|
||||
|
||||
const scope = options.scope || this.detectCalendarScope(text);
|
||||
const dates = this.extractDates(text) || {
|
||||
start: new Date().toISOString().split("T")[0],
|
||||
end: new Date(Date.now() + 7 * 24 * 60 * 60 * 1000).toISOString().split("T")[0]
|
||||
};
|
||||
|
||||
const entities = this.detectEntities(text);
|
||||
if (options.subject && !entities.includes(options.subject)) {
|
||||
entities.push(options.subject);
|
||||
}
|
||||
|
||||
const events = this.extractEvents(text, lines);
|
||||
|
||||
// Assign dates to events based on scope
|
||||
if (scope === "weekly" && events.length > 0) {
|
||||
const startDate = new Date(dates.start);
|
||||
events.forEach((event, index) => {
|
||||
const dayOffset = Math.floor(index / Math.max(1, events.length / 7));
|
||||
const eventDate = new Date(startDate);
|
||||
eventDate.setDate(startDate.getDate() + dayOffset);
|
||||
event.date = eventDate.toISOString().split("T")[0];
|
||||
});
|
||||
}
|
||||
|
||||
return {
|
||||
version: "1.0",
|
||||
calendar_scope: scope,
|
||||
timezone: "Europe/Paris",
|
||||
period: dates,
|
||||
entities,
|
||||
events,
|
||||
extraction: {
|
||||
method: "heuristics" as any,
|
||||
model: "internal",
|
||||
heuristics: ["weekday-headers", "time-patterns", "date-extraction"]
|
||||
}
|
||||
};
|
||||
}
|
||||
}
|
||||
156
planning-ingestion/src/server.ts
Normal file
156
planning-ingestion/src/server.ts
Normal file
@@ -0,0 +1,156 @@
|
||||
import express from "express";
|
||||
import multer from "multer";
|
||||
import { config } from "dotenv";
|
||||
import path from "path";
|
||||
import fs from "fs";
|
||||
import { DatabaseService } from "./database/db.js";
|
||||
import { EncryptionService } from "./crypto/encryption.js";
|
||||
import { IngestionService } from "./services/ingestion.js";
|
||||
import { z } from "zod";
|
||||
import type { LLMProvider, CalendarScope } from "./types/schema.js";
|
||||
|
||||
config();
|
||||
|
||||
const app = express();
|
||||
const port = parseInt(process.env.PORT || "8000");
|
||||
|
||||
// Setup services
|
||||
const dbPath = process.env.DATABASE_PATH || "./data/planning.db";
|
||||
const masterKey = process.env.MASTER_KEY;
|
||||
|
||||
if (!masterKey) {
|
||||
console.error("Error: MASTER_KEY not set in .env");
|
||||
process.exit(1);
|
||||
}
|
||||
|
||||
const uploadDir = process.env.UPLOAD_DIR || "./uploads";
|
||||
if (!fs.existsSync(uploadDir)) {
|
||||
fs.mkdirSync(uploadDir, { recursive: true });
|
||||
}
|
||||
|
||||
const db = new DatabaseService(dbPath);
|
||||
const encryption = new EncryptionService(masterKey);
|
||||
const confidenceThreshold = parseFloat(process.env.CONFIDENCE_THRESHOLD || "0.7");
|
||||
const ingestion = new IngestionService(db, encryption, confidenceThreshold);
|
||||
|
||||
// Multer setup
|
||||
const storage = multer.diskStorage({
|
||||
destination: uploadDir,
|
||||
filename: (req, file, cb) => {
|
||||
const uniqueName = `${Date.now()}-${file.originalname}`;
|
||||
cb(null, uniqueName);
|
||||
}
|
||||
});
|
||||
|
||||
const upload = multer({
|
||||
storage,
|
||||
limits: {
|
||||
fileSize: (parseInt(process.env.MAX_FILE_SIZE_MB || "50") || 50) * 1024 * 1024
|
||||
}
|
||||
});
|
||||
|
||||
app.use(express.json());
|
||||
|
||||
// Health check
|
||||
app.get("/health", (req, res) => {
|
||||
res.json({ status: "ok", timestamp: new Date().toISOString() });
|
||||
});
|
||||
|
||||
// POST /auth/api-key - Store API key
|
||||
app.post("/auth/api-key", async (req, res) => {
|
||||
try {
|
||||
const schema = z.object({
|
||||
provider: z.enum(["openai", "anthropic"]),
|
||||
apiKey: z.string().min(10)
|
||||
});
|
||||
|
||||
const { provider, apiKey } = schema.parse(req.body);
|
||||
|
||||
await ingestion.saveAPIKey(provider as LLMProvider, apiKey);
|
||||
|
||||
res.json({ message: "API key stored securely" });
|
||||
} catch (error) {
|
||||
res.status(400).json({ error: error instanceof Error ? error.message : "Invalid request" });
|
||||
}
|
||||
});
|
||||
|
||||
// POST /ingest - Ingest file
|
||||
app.post("/ingest", upload.single("file"), async (req, res) => {
|
||||
try {
|
||||
if (!req.file) {
|
||||
return res.status(400).json({ error: "No file provided" });
|
||||
}
|
||||
|
||||
const documentId = await ingestion.ingestDocument(req.file.path, req.file.originalname);
|
||||
|
||||
res.json({
|
||||
document_id: documentId,
|
||||
filename: req.file.originalname
|
||||
});
|
||||
} catch (error) {
|
||||
res.status(500).json({ error: error instanceof Error ? error.message : "Ingestion failed" });
|
||||
}
|
||||
});
|
||||
|
||||
// POST /ingest/normalize - Normalize document
|
||||
app.post("/ingest/normalize", async (req, res) => {
|
||||
try {
|
||||
const schema = z.object({
|
||||
document_id: z.string(),
|
||||
scope: z.enum(["weekly", "monthly"]).optional(),
|
||||
subject: z.string().optional()
|
||||
});
|
||||
|
||||
const { document_id, scope, subject } = schema.parse(req.body);
|
||||
|
||||
const scheduleId = await ingestion.normalizeDocument(document_id, {
|
||||
scope: scope as CalendarScope | undefined,
|
||||
subject
|
||||
});
|
||||
|
||||
res.json({ schedule_id: scheduleId });
|
||||
} catch (error) {
|
||||
res.status(500).json({ error: error instanceof Error ? error.message : "Normalization failed" });
|
||||
}
|
||||
});
|
||||
|
||||
// GET /schedules/:id - Get schedule
|
||||
app.get("/schedules/:id", async (req, res) => {
|
||||
try {
|
||||
const schedule = await ingestion.getSchedule(req.params.id);
|
||||
|
||||
if (!schedule) {
|
||||
return res.status(404).json({ error: "Schedule not found" });
|
||||
}
|
||||
|
||||
res.json(schedule);
|
||||
} catch (error) {
|
||||
res.status(500).json({ error: error instanceof Error ? error.message : "Failed to retrieve schedule" });
|
||||
}
|
||||
});
|
||||
|
||||
// GET /documents/:id - Get document metadata
|
||||
app.get("/documents/:id", async (req, res) => {
|
||||
try {
|
||||
const doc = db.getDocument(req.params.id);
|
||||
|
||||
if (!doc) {
|
||||
return res.status(404).json({ error: "Document not found" });
|
||||
}
|
||||
|
||||
res.json({
|
||||
id: doc.id,
|
||||
filename: doc.filename,
|
||||
document_type: doc.document_type,
|
||||
mime_type: doc.mime_type,
|
||||
size_bytes: doc.size_bytes,
|
||||
uploaded_at: doc.uploaded_at
|
||||
});
|
||||
} catch (error) {
|
||||
res.status(500).json({ error: error instanceof Error ? error.message : "Failed to retrieve document" });
|
||||
}
|
||||
});
|
||||
|
||||
app.listen(port, () => {
|
||||
console.log(`Planning Ingestion Server listening on port ${port}`);
|
||||
});
|
||||
144
planning-ingestion/src/services/ingestion.ts
Normal file
144
planning-ingestion/src/services/ingestion.ts
Normal file
@@ -0,0 +1,144 @@
|
||||
import path from "path";
|
||||
import fs from "fs";
|
||||
import { DatabaseService } from "../database/db.js";
|
||||
import { EncryptionService } from "../crypto/encryption.js";
|
||||
import { OCRExtractor } from "../extractors/ocr.js";
|
||||
import { PDFExtractor } from "../extractors/pdf.js";
|
||||
import { ExcelExtractor } from "../extractors/excel.js";
|
||||
import { PlanningParser, ParseOptions } from "../normalizer/parser.js";
|
||||
import { LLMClient } from "../llm/client.js";
|
||||
import type { DocumentType, LLMProvider, StandardSchedule } from "../types/schema.js";
|
||||
|
||||
export class IngestionService {
|
||||
constructor(
|
||||
private db: DatabaseService,
|
||||
private encryption: EncryptionService,
|
||||
private confidenceThreshold: number = 0.7
|
||||
) {}
|
||||
|
||||
async ingestDocument(filePath: string, filename: string): Promise<string> {
|
||||
const stats = fs.statSync(filePath);
|
||||
const mimeType = this.detectMimeType(filename);
|
||||
const documentType = this.detectDocumentType(mimeType);
|
||||
|
||||
const doc = this.db.insertDocument({
|
||||
filename,
|
||||
file_path: filePath,
|
||||
document_type: documentType,
|
||||
mime_type: mimeType,
|
||||
size_bytes: stats.size
|
||||
});
|
||||
|
||||
return doc.id;
|
||||
}
|
||||
|
||||
async normalizeDocument(documentId: string, options: ParseOptions = {}): Promise<string> {
|
||||
const doc = this.db.getDocument(documentId);
|
||||
if (!doc) throw new Error(`Document ${documentId} not found`);
|
||||
|
||||
// Step 1: Extract text
|
||||
const extractionResult = await this.extractText(doc.file_path, doc.document_type);
|
||||
|
||||
// Step 2: Parse with heuristics
|
||||
const parser = new PlanningParser();
|
||||
let schedule = parser.parseToStandard(extractionResult.text, options);
|
||||
|
||||
// Step 3: If confidence is low, use LLM
|
||||
const avgConfidence =
|
||||
schedule.events.reduce((sum, e) => sum + e.confidence, 0) / Math.max(1, schedule.events.length);
|
||||
|
||||
if (avgConfidence < this.confidenceThreshold) {
|
||||
const llmConfig = await this.getLLMConfig();
|
||||
if (llmConfig) {
|
||||
const llmClient = new LLMClient(llmConfig.provider, llmConfig.apiKey);
|
||||
schedule = await llmClient.normalizeWithLLM(
|
||||
extractionResult.text,
|
||||
doc.document_type === "excel" ? (extractionResult as any).structuredData : undefined
|
||||
);
|
||||
}
|
||||
}
|
||||
|
||||
// Update extraction method
|
||||
if (!schedule.extraction.method || schedule.extraction.method === ("heuristics" as any)) {
|
||||
schedule.extraction.method = doc.document_type === "image" ? "ocr" : doc.document_type;
|
||||
}
|
||||
|
||||
// Step 4: Save schedule
|
||||
const savedSchedule = this.db.insertSchedule(documentId, schedule);
|
||||
|
||||
return savedSchedule.id;
|
||||
}
|
||||
|
||||
async getSchedule(scheduleId: string): Promise<StandardSchedule | undefined> {
|
||||
const schedule = this.db.getSchedule(scheduleId);
|
||||
return schedule?.schedule_json;
|
||||
}
|
||||
|
||||
async saveAPIKey(provider: LLMProvider, apiKey: string): Promise<void> {
|
||||
const encrypted = this.encryption.encrypt(apiKey);
|
||||
|
||||
this.db.saveAPIKey({
|
||||
provider,
|
||||
encrypted_key: encrypted.encrypted,
|
||||
iv: encrypted.iv,
|
||||
auth_tag: encrypted.authTag
|
||||
});
|
||||
}
|
||||
|
||||
private async getLLMConfig(): Promise<{ provider: LLMProvider; apiKey: string } | null> {
|
||||
const openaiConfig = this.db.getAPIKey("openai");
|
||||
if (openaiConfig) {
|
||||
const apiKey = this.encryption.decrypt(openaiConfig.encrypted_key, openaiConfig.iv, openaiConfig.auth_tag || "");
|
||||
return { provider: "openai", apiKey };
|
||||
}
|
||||
|
||||
const anthropicConfig = this.db.getAPIKey("anthropic");
|
||||
if (anthropicConfig) {
|
||||
const apiKey = this.encryption.decrypt(
|
||||
anthropicConfig.encrypted_key,
|
||||
anthropicConfig.iv,
|
||||
anthropicConfig.auth_tag || ""
|
||||
);
|
||||
return { provider: "anthropic", apiKey };
|
||||
}
|
||||
|
||||
return null;
|
||||
}
|
||||
|
||||
private async extractText(filePath: string, type: DocumentType): Promise<any> {
|
||||
switch (type) {
|
||||
case "image": {
|
||||
const ocr = new OCRExtractor();
|
||||
return ocr.extractFromImage(filePath);
|
||||
}
|
||||
case "pdf": {
|
||||
const pdf = new PDFExtractor();
|
||||
return pdf.extractFromPDF(filePath);
|
||||
}
|
||||
case "excel": {
|
||||
const excel = new ExcelExtractor();
|
||||
return excel.extractFromExcel(filePath);
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
private detectMimeType(filename: string): string {
|
||||
const ext = path.extname(filename).toLowerCase();
|
||||
const mimeTypes: Record<string, string> = {
|
||||
".pdf": "application/pdf",
|
||||
".xlsx": "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
|
||||
".xls": "application/vnd.ms-excel",
|
||||
".png": "image/png",
|
||||
".jpg": "image/jpeg",
|
||||
".jpeg": "image/jpeg"
|
||||
};
|
||||
return mimeTypes[ext] || "application/octet-stream";
|
||||
}
|
||||
|
||||
private detectDocumentType(mimeType: string): DocumentType {
|
||||
if (mimeType.startsWith("image/")) return "image";
|
||||
if (mimeType === "application/pdf") return "pdf";
|
||||
if (mimeType.includes("spreadsheet") || mimeType.includes("excel")) return "excel";
|
||||
throw new Error(`Unsupported MIME type: ${mimeType}`);
|
||||
}
|
||||
}
|
||||
69
planning-ingestion/src/types/schema.ts
Normal file
69
planning-ingestion/src/types/schema.ts
Normal file
@@ -0,0 +1,69 @@
|
||||
import { z } from "zod";
|
||||
|
||||
export const CalendarScopeSchema = z.enum(["weekly", "monthly"]);
|
||||
export type CalendarScope = z.infer<typeof CalendarScopeSchema>;
|
||||
|
||||
export const EventSchema = z.object({
|
||||
title: z.string(),
|
||||
date: z.string().regex(/^\d{4}-\d{2}-\d{2}$/),
|
||||
start_time: z.string().regex(/^\d{2}:\d{2}$/),
|
||||
end_time: z.string().regex(/^\d{2}:\d{2}$/),
|
||||
location: z.string().nullable(),
|
||||
tags: z.array(z.string()),
|
||||
notes: z.string().nullable(),
|
||||
confidence: z.number().min(0).max(1),
|
||||
source_cells: z.array(z.string()).optional()
|
||||
});
|
||||
|
||||
export type Event = z.infer<typeof EventSchema>;
|
||||
|
||||
export const StandardScheduleSchema = z.object({
|
||||
version: z.literal("1.0"),
|
||||
calendar_scope: CalendarScopeSchema,
|
||||
timezone: z.string(),
|
||||
period: z.object({
|
||||
start: z.string().regex(/^\d{4}-\d{2}-\d{2}$/),
|
||||
end: z.string().regex(/^\d{4}-\d{2}-\d{2}$/)
|
||||
}),
|
||||
entities: z.array(z.string()),
|
||||
events: z.array(EventSchema),
|
||||
extraction: z.object({
|
||||
method: z.enum(["ocr", "pdf", "excel", "llm"]),
|
||||
model: z.string(),
|
||||
heuristics: z.array(z.string())
|
||||
})
|
||||
});
|
||||
|
||||
export type StandardSchedule = z.infer<typeof StandardScheduleSchema>;
|
||||
|
||||
export const DocumentTypeSchema = z.enum(["image", "pdf", "excel"]);
|
||||
export type DocumentType = z.infer<typeof DocumentTypeSchema>;
|
||||
|
||||
export const LLMProviderSchema = z.enum(["openai", "anthropic"]);
|
||||
export type LLMProvider = z.infer<typeof LLMProviderSchema>;
|
||||
|
||||
export interface Document {
|
||||
id: string;
|
||||
filename: string;
|
||||
file_path: string;
|
||||
document_type: DocumentType;
|
||||
mime_type: string;
|
||||
size_bytes: number;
|
||||
uploaded_at: string;
|
||||
raw_text?: string;
|
||||
}
|
||||
|
||||
export interface Schedule {
|
||||
id: string;
|
||||
document_id: string;
|
||||
schedule_json: StandardSchedule;
|
||||
created_at: string;
|
||||
}
|
||||
|
||||
export interface APIKeyConfig {
|
||||
provider: LLMProvider;
|
||||
encrypted_key: string;
|
||||
iv: string;
|
||||
auth_tag: string;
|
||||
created_at: string;
|
||||
}
|
||||
20
planning-ingestion/tsconfig.json
Normal file
20
planning-ingestion/tsconfig.json
Normal file
@@ -0,0 +1,20 @@
|
||||
{
|
||||
"compilerOptions": {
|
||||
"target": "ES2022",
|
||||
"module": "ES2022",
|
||||
"moduleResolution": "node",
|
||||
"lib": ["ES2022"],
|
||||
"outDir": "./dist",
|
||||
"rootDir": "./src",
|
||||
"strict": true,
|
||||
"esModuleInterop": true,
|
||||
"skipLibCheck": true,
|
||||
"forceConsistentCasingInFileNames": true,
|
||||
"resolveJsonModule": true,
|
||||
"declaration": true,
|
||||
"declarationMap": true,
|
||||
"sourceMap": true
|
||||
},
|
||||
"include": ["src/**/*"],
|
||||
"exclude": ["node_modules", "dist"]
|
||||
}
|
||||
Reference in New Issue
Block a user