Initial commit: Family Planner application

Complete family planning application with:
- React frontend with TypeScript
- Node.js/Express backend with TypeScript
- Python ingestion service for document processing
- Planning ingestion service with LLM integration
- Shared UI components and type definitions
- OAuth integration for calendar synchronization
- Comprehensive documentation

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
philippe
2025-10-14 10:43:33 +02:00
commit fdd72c1135
239 changed files with 44160 additions and 0 deletions

View File

@@ -0,0 +1,20 @@
# Server Configuration
PORT=8000
NODE_ENV=development
# Security - REQUIRED: Master key for encrypting API keys (32+ chars)
MASTER_KEY=your-secure-master-key-minimum-32-characters-long
# Database
DATABASE_PATH=./data/planning.db
# Upload Configuration
UPLOAD_DIR=./uploads
MAX_FILE_SIZE_MB=50
# Confidence threshold for LLM fallback (0.0 to 1.0)
CONFIDENCE_THRESHOLD=0.7
# LLM Configuration (set via CLI or API, not here)
# LLM_PROVIDER=openai|anthropic
# LLM_API_KEY=<encrypted in database>

9
planning-ingestion/.gitignore vendored Normal file
View File

@@ -0,0 +1,9 @@
node_modules/
dist/
uploads/
data/
*.log
.env
.DS_Store
*.sqlite3
*.db

View File

@@ -0,0 +1,492 @@
# Planning Ingestion Service
Service Node.js professionnel d'ingestion et normalisation de plannings. Supporte images (OCR), PDF et Excel avec normalisation intelligente vers un format JSON standard unique via heuristiques et LLM (OpenAI/Anthropic).
## Caractéristiques
-**Ingestion multi-format** : Images (OCR Tesseract), PDF, Excel (.xlsx/.xls)
-**Normalisation intelligente** : Heuristiques + fallback LLM si ambiguïté
-**JSON standard unique** : Format contractuel pour toute sortie
-**Sécurité** : Clé API chiffrée AES-256-GCM, jamais en clair
-**CLI & HTTP** : Interface en ligne de commande et API REST
-**Persistance** : SQLite avec chiffrement des clés API
## Architecture
```
planning-ingestion/
├── src/
│ ├── types/schema.ts # Schémas TypeScript & Zod
│ ├── crypto/encryption.ts # Chiffrement AES-256-GCM
│ ├── database/db.ts # SQLite (documents, schedules, api_keys)
│ ├── extractors/
│ │ ├── ocr.ts # Tesseract pour images
│ │ ├── pdf.ts # pdf-parse pour PDF
│ │ └── excel.ts # xlsx pour Excel
│ ├── normalizer/parser.ts # Heuristiques de parsing
│ ├── llm/client.ts # OpenAI & Anthropic
│ ├── services/ingestion.ts # Service principal
│ ├── cli.ts # CLI
│ └── server.ts # API HTTP
├── data/ # Base SQLite
├── uploads/ # Fichiers uploadés
├── package.json
├── tsconfig.json
└── .env
```
## Installation
```bash
npm install
```
## Configuration
Créer un fichier `.env` à la racine :
```env
# Server
PORT=8000
NODE_ENV=development
# Security - OBLIGATOIRE : Clé maîtresse pour chiffrer les clés API (32+ caractères)
MASTER_KEY=votre-cle-securisee-minimum-32-caracteres
# Database
DATABASE_PATH=./data/planning.db
# Upload
UPLOAD_DIR=./uploads
MAX_FILE_SIZE_MB=50
# Confidence threshold (0.0-1.0) : en dessous, utilise le LLM
CONFIDENCE_THRESHOLD=0.7
```
⚠️ **MASTER_KEY** est critique : changez-la et ne la commitez jamais.
## CLI
### 1. Configurer la clé API LLM
Saisie interactive (recommandé) :
```bash
npm run cli setup:api
# Choisir : openai ou anthropic
# Saisir la clé API
```
Ou avec paramètre :
```bash
npm run cli setup:api -- --provider openai
```
La clé est chiffrée et stockée dans SQLite, jamais en clair.
### 2. Ingérer un document
```bash
npm run cli ingest path/to/planning.pdf
# Retourne : Document ID: abc-123-xyz
```
Formats supportés : `.png`, `.jpg`, `.jpeg`, `.pdf`, `.xlsx`, `.xls`
### 3. Normaliser vers JSON standard
```bash
npm run cli normalize --document abc-123-xyz
# Options :
# --scope weekly|monthly
# --subject enfants|vacances
# Retourne : Schedule ID: def-456-uvw
```
### 4. Exporter le JSON
```bash
npm run cli export:schedule --id def-456-uvw --out schedule.json
```
Le fichier `schedule.json` est prêt à être consommé par votre application.
## HTTP API
### Démarrer le serveur
```bash
npm run build
npm start
# ou en dev :
npm run dev
```
Le serveur écoute sur `http://localhost:8000`.
### Endpoints
#### `POST /auth/api-key`
Stocker la clé API LLM de manière sécurisée.
**Body** :
```json
{
"provider": "openai",
"apiKey": "sk-..."
}
```
**Response** :
```json
{
"message": "API key stored securely"
}
```
#### `POST /ingest`
Ingérer un fichier (multipart/form-data).
**cURL** :
```bash
curl -X POST http://localhost:8000/ingest \
-F "file=@planning.pdf"
```
**Response** :
```json
{
"document_id": "abc-123-xyz",
"filename": "planning.pdf"
}
```
#### `POST /ingest/normalize`
Normaliser un document vers JSON standard.
**Body** :
```json
{
"document_id": "abc-123-xyz",
"scope": "weekly",
"subject": "enfants"
}
```
**Response** :
```json
{
"schedule_id": "def-456-uvw"
}
```
#### `GET /schedules/:id`
Récupérer le JSON standard normalisé.
**cURL** :
```bash
curl http://localhost:8000/schedules/def-456-uvw
```
**Response** (JSON contractuel) :
```json
{
"version": "1.0",
"calendar_scope": "weekly",
"timezone": "Europe/Paris",
"period": {
"start": "2025-01-13",
"end": "2025-01-19"
},
"entities": ["enfants"],
"events": [
{
"title": "Mathématiques",
"date": "2025-01-13",
"start_time": "08:30",
"end_time": "10:00",
"location": "Salle B12",
"tags": ["cours"],
"notes": null,
"confidence": 0.95,
"source_cells": []
}
],
"extraction": {
"method": "pdf",
"model": "internal",
"heuristics": ["weekday-headers", "time-patterns", "date-extraction"]
}
}
```
#### `GET /documents/:id`
Récupérer les métadonnées d'un document.
**Response** :
```json
{
"id": "abc-123-xyz",
"filename": "planning.pdf",
"document_type": "pdf",
"mime_type": "application/pdf",
"size_bytes": 245830,
"uploaded_at": "2025-01-12T14:30:00.000Z"
}
```
## JSON Standard (Contrat de sortie)
Toute sortie respecte strictement ce schéma :
```typescript
{
version: "1.0",
calendar_scope: "weekly" | "monthly",
timezone: "Europe/Paris",
period: {
start: "YYYY-MM-DD",
end: "YYYY-MM-DD"
},
entities: string[], // ex: ["enfants", "vacances"]
events: [
{
title: string,
date: "YYYY-MM-DD",
start_time: "HH:MM", // 24h
end_time: "HH:MM",
location: string | null,
tags: string[],
notes: string | null,
confidence: number, // 0.0 à 1.0
source_cells: string[] // ex: ["A1", "B2"] (Excel uniquement)
}
],
extraction: {
method: "ocr" | "pdf" | "excel" | "llm",
model: string, // "tesseract", "gpt-4o", "claude-3.5-sonnet", "internal"
heuristics: string[]
}
}
```
## Exemples I/O
### Entrée : Image scannée d'un planning hebdomadaire
Fichier : `planning_semaine.jpg` (scan manuscrit ou imprimé)
```bash
npm run cli ingest planning_semaine.jpg
# Document ID: img-001
npm run cli normalize --document img-001 --scope weekly --subject enfants
# Schedule ID: sch-001
npm run cli export:schedule --id sch-001 --out semaine.json
```
**Sortie `semaine.json`** :
```json
{
"version": "1.0",
"calendar_scope": "weekly",
"timezone": "Europe/Paris",
"period": {
"start": "2025-01-13",
"end": "2025-01-19"
},
"entities": ["enfants"],
"events": [
{
"title": "Piscine",
"date": "2025-01-15",
"start_time": "14:00",
"end_time": "15:30",
"location": "Centre aquatique",
"tags": ["sport"],
"notes": null,
"confidence": 0.88,
"source_cells": []
}
],
"extraction": {
"method": "llm",
"model": "gpt-4o",
"heuristics": ["llm-analysis"]
}
}
```
### Entrée : Fichier Excel avec plusieurs onglets
Fichier : `planning_mensuel.xlsx`
Onglet "Janvier" avec :
| Lundi | Mardi | Mercredi | Jeudi | Vendredi |
|-------|-------|----------|-------|----------|
| 8h-9h : Français | 8h-9h : Maths | 8h-9h : Sport | ... | ... |
```bash
curl -X POST http://localhost:8000/ingest -F "file=@planning_mensuel.xlsx"
# { "document_id": "xls-002" }
curl -X POST http://localhost:8000/ingest/normalize \
-H "Content-Type: application/json" \
-d '{"document_id":"xls-002","scope":"monthly","subject":"enfants"}'
# { "schedule_id": "sch-002" }
curl http://localhost:8000/schedules/sch-002 > mensuel.json
```
**Sortie `mensuel.json`** :
```json
{
"version": "1.0",
"calendar_scope": "monthly",
"timezone": "Europe/Paris",
"period": {
"start": "2025-01-01",
"end": "2025-01-31"
},
"entities": ["enfants"],
"events": [
{
"title": "Français",
"date": "2025-01-06",
"start_time": "08:00",
"end_time": "09:00",
"location": null,
"tags": ["cours"],
"notes": null,
"confidence": 0.98,
"source_cells": ["B2"]
},
{
"title": "Maths",
"date": "2025-01-07",
"start_time": "08:00",
"end_time": "09:00",
"location": null,
"tags": ["cours"],
"notes": null,
"confidence": 0.98,
"source_cells": ["C2"]
}
],
"extraction": {
"method": "excel",
"model": "internal",
"heuristics": ["weekday-headers", "hour-columns", "table-detection"]
}
}
```
## Sécurité
- **Chiffrement clé API** : AES-256-GCM avec IV et Auth Tag uniques
- **Master Key** : Dérivée de `.env` via scrypt (salt), jamais loggée
- **Pas de fuite** : Les endpoints ne retournent jamais de clé API en clair
- **SQLite sécurisé** : Table `api_keys` avec colonnes chiffrées
## Normalisation
### Heuristiques
Le parser interne détecte :
- **Jours FR** : lun, mar, mer, jeu, ven, sam, dim
- **Mois FR** : janvier, février, ..., décembre
- **Heures** : `8h30`, `8:30`, `08:30``08:30`
- **Portée** : `weekly` si semaine identifiable, sinon `monthly`
- **Entités** : mots-clés ("enfant", "vacances", "congé")
- **Événements** : détection de plages horaires + titres
### LLM Fallback
Si `confidence < CONFIDENCE_THRESHOLD` (défaut 0.7), appel LLM avec prompt strict retournant **uniquement** le JSON contractuel.
Modèles :
- OpenAI : `gpt-4o`
- Anthropic : `claude-3-5-sonnet-20241022`
## Validation
Tous les JSON produits sont validés par Zod selon le schéma `StandardScheduleSchema`.
## Logs
Logs sobres sans données sensibles. En production, configurer un système de monitoring (ex: Winston transports vers fichier).
## Tests
Ajoutez vos tests unitaires/intégration (ex: Vitest, Jest) :
```bash
npm test
```
## Build
```bash
npm run build
# Produit dist/
node dist/server.js
```
## Déploiement
1. Clonez le repo sur le serveur
2. Créez `.env` avec `MASTER_KEY` sécurisée
3. Installez : `npm install`
4. Configurez la clé API : `npm run cli setup:api -- --provider openai`
5. Buildez : `npm run build`
6. Lancez : `npm start`
Utilisez PM2, systemd ou Docker pour la production.
## Troubleshooting
**Erreur "MASTER_KEY not set"** :
- Vérifiez `.env` avec `MASTER_KEY` de 32+ caractères
**LLM ne se déclenche pas** :
- Vérifiez que la confiance est < `CONFIDENCE_THRESHOLD`
- Vérifiez que la clé API est configurée : `npm run cli setup:api`
**OCR imprécis** :
- Utilisez des images haute résolution
- Pré-traitez l'image (contraste, rotation)
- Le LLM sera appelé si la confiance est faible
**Excel mal parsé** :
- Vérifiez le format des cellules (dates, heures)
- Le LLM peut corriger les ambiguïtés
## Licence
Propriétaire. Tous droits réservés.

2742
planning-ingestion/package-lock.json generated Normal file

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,38 @@
{
"name": "planning-ingestion",
"version": "1.0.0",
"type": "module",
"description": "Planning ingestion service with OCR, PDF, Excel support and LLM normalization",
"main": "dist/server.js",
"bin": {
"planning": "dist/cli.js"
},
"scripts": {
"build": "tsc",
"start": "node dist/server.js",
"dev": "tsx watch src/server.ts",
"cli": "tsx src/cli.ts"
},
"dependencies": {
"express": "^4.19.0",
"multer": "^1.4.5-lts.2",
"better-sqlite3": "^9.4.0",
"tesseract.js": "^5.0.0",
"pdf-parse": "^1.1.1",
"xlsx": "^0.18.5",
"openai": "^4.28.0",
"@anthropic-ai/sdk": "^0.17.0",
"commander": "^11.1.0",
"zod": "^3.22.4",
"dotenv": "^16.4.0"
},
"devDependencies": {
"@types/express": "^4.17.21",
"@types/multer": "^1.4.11",
"@types/better-sqlite3": "^7.6.9",
"@types/pdf-parse": "^1.1.4",
"@types/node": "^20.11.0",
"typescript": "^5.3.0",
"tsx": "^4.7.0"
}
}

View File

@@ -0,0 +1,141 @@
#!/usr/bin/env node
import { Command } from "commander";
import { config } from "dotenv";
import fs from "fs";
import * as readline from "readline";
import { DatabaseService } from "./database/db.js";
import { EncryptionService } from "./crypto/encryption.js";
import { IngestionService } from "./services/ingestion.js";
import type { LLMProvider, CalendarScope } from "./types/schema.js";
config();
const program = new Command();
function getServices() {
const dbPath = process.env.DATABASE_PATH || "./data/planning.db";
const masterKey = process.env.MASTER_KEY;
if (!masterKey) {
console.error("Error: MASTER_KEY not set in .env");
process.exit(1);
}
const db = new DatabaseService(dbPath);
const encryption = new EncryptionService(masterKey);
const confidenceThreshold = parseFloat(process.env.CONFIDENCE_THRESHOLD || "0.7");
const ingestion = new IngestionService(db, encryption, confidenceThreshold);
return { db, encryption, ingestion };
}
async function promptInput(question: string): Promise<string> {
const rl = readline.createInterface({
input: process.stdin,
output: process.stdout
});
return new Promise((resolve) => {
rl.question(question, (answer) => {
rl.close();
resolve(answer);
});
});
}
program.name("planning").description("Planning Ingestion CLI").version("1.0.0");
program
.command("setup:api")
.option("--provider <provider>", "LLM provider: openai or anthropic")
.description("Configure LLM API key")
.action(async (options) => {
let provider = options.provider as LLMProvider;
if (!provider) {
provider = (await promptInput("Provider (openai/anthropic): ")) as LLMProvider;
}
if (provider !== "openai" && provider !== "anthropic") {
console.error("Error: provider must be 'openai' or 'anthropic'");
process.exit(1);
}
const apiKey = await promptInput(`Enter your ${provider} API key: `);
if (!apiKey || apiKey.length < 10) {
console.error("Error: Invalid API key");
process.exit(1);
}
const { ingestion } = getServices();
await ingestion.saveAPIKey(provider, apiKey);
console.log(`✓ API key for ${provider} saved securely`);
});
program
.command("ingest")
.argument("<path>", "Path to planning file")
.description("Ingest a planning document")
.action(async (path) => {
if (!fs.existsSync(path)) {
console.error(`Error: File not found: ${path}`);
process.exit(1);
}
const { ingestion } = getServices();
const filename = path.split(/[/\\]/).pop() || "document";
console.log(`Ingesting: ${filename}...`);
const documentId = await ingestion.ingestDocument(path, filename);
console.log(`✓ Document ingested successfully`);
console.log(`Document ID: ${documentId}`);
});
program
.command("normalize")
.requiredOption("--document <id>", "Document ID to normalize")
.option("--scope <scope>", "Calendar scope: weekly or monthly")
.option("--subject <subject>", "Subject: enfants, vacances, etc.")
.description("Normalize a document to standard JSON")
.action(async (options) => {
const { ingestion } = getServices();
console.log(`Normalizing document ${options.document}...`);
const scope = options.scope as CalendarScope | undefined;
const subject = options.subject;
const scheduleId = await ingestion.normalizeDocument(options.document, {
scope,
subject
});
console.log(`✓ Normalization complete`);
console.log(`Schedule ID: ${scheduleId}`);
});
program
.command("export:schedule")
.requiredOption("--id <id>", "Schedule ID")
.requiredOption("--out <file>", "Output JSON file")
.description("Export schedule to JSON file")
.action(async (options) => {
const { ingestion } = getServices();
const schedule = await ingestion.getSchedule(options.id);
if (!schedule) {
console.error(`Error: Schedule ${options.id} not found`);
process.exit(1);
}
fs.writeFileSync(options.out, JSON.stringify(schedule, null, 2));
console.log(`✓ Schedule exported to ${options.out}`);
});
program.parse();

View File

@@ -0,0 +1,45 @@
import crypto from "crypto";
const ALGORITHM = "aes-256-gcm";
const IV_LENGTH = 16;
const AUTH_TAG_LENGTH = 16;
export class EncryptionService {
private masterKey: Buffer;
constructor(masterKeyString: string) {
if (!masterKeyString || masterKeyString.length < 32) {
throw new Error("MASTER_KEY must be at least 32 characters");
}
this.masterKey = crypto.scryptSync(masterKeyString, "salt", 32);
}
encrypt(text: string): { encrypted: string; iv: string; authTag: string } {
const iv = crypto.randomBytes(IV_LENGTH);
const cipher = crypto.createCipheriv(ALGORITHM, this.masterKey, iv);
let encrypted = cipher.update(text, "utf8", "hex");
encrypted += cipher.final("hex");
const authTag = cipher.getAuthTag();
return {
encrypted,
iv: iv.toString("hex"),
authTag: authTag.toString("hex")
};
}
decrypt(encrypted: string, ivHex: string, authTagHex: string): string {
const iv = Buffer.from(ivHex, "hex");
const authTag = Buffer.from(authTagHex, "hex");
const decipher = crypto.createDecipheriv(ALGORITHM, this.masterKey, iv);
decipher.setAuthTag(authTag);
let decrypted = decipher.update(encrypted, "hex", "utf8");
decrypted += decipher.final("utf8");
return decrypted;
}
}

View File

@@ -0,0 +1,126 @@
import Database from "better-sqlite3";
import path from "path";
import fs from "fs";
import type { Document, Schedule, APIKeyConfig, LLMProvider, StandardSchedule } from "../types/schema.js";
export class DatabaseService {
private db: Database.Database;
constructor(dbPath: string) {
const dir = path.dirname(dbPath);
if (!fs.existsSync(dir)) {
fs.mkdirSync(dir, { recursive: true });
}
this.db = new Database(dbPath);
this.db.pragma("journal_mode = WAL");
this.initTables();
}
private initTables() {
this.db.exec(`
CREATE TABLE IF NOT EXISTS documents (
id TEXT PRIMARY KEY,
filename TEXT NOT NULL,
file_path TEXT NOT NULL,
document_type TEXT NOT NULL,
mime_type TEXT NOT NULL,
size_bytes INTEGER NOT NULL,
uploaded_at TEXT NOT NULL,
raw_text TEXT
);
CREATE TABLE IF NOT EXISTS schedules (
id TEXT PRIMARY KEY,
document_id TEXT NOT NULL,
schedule_json TEXT NOT NULL,
created_at TEXT NOT NULL,
FOREIGN KEY (document_id) REFERENCES documents(id)
);
CREATE TABLE IF NOT EXISTS api_keys (
provider TEXT PRIMARY KEY,
encrypted_key TEXT NOT NULL,
iv TEXT NOT NULL,
auth_tag TEXT NOT NULL,
created_at TEXT NOT NULL
);
CREATE INDEX IF NOT EXISTS idx_documents_type ON documents(document_type);
CREATE INDEX IF NOT EXISTS idx_schedules_document ON schedules(document_id);
`);
}
// Documents
insertDocument(doc: Omit<Document, "id" | "uploaded_at">): Document {
const id = crypto.randomUUID();
const uploaded_at = new Date().toISOString();
const stmt = this.db.prepare(`
INSERT INTO documents (id, filename, file_path, document_type, mime_type, size_bytes, uploaded_at, raw_text)
VALUES (?, ?, ?, ?, ?, ?, ?, ?)
`);
stmt.run(id, doc.filename, doc.file_path, doc.document_type, doc.mime_type, doc.size_bytes, uploaded_at, doc.raw_text || null);
return { id, uploaded_at, ...doc };
}
getDocument(id: string): Document | undefined {
const stmt = this.db.prepare("SELECT * FROM documents WHERE id = ?");
return stmt.get(id) as Document | undefined;
}
// Schedules
insertSchedule(documentId: string, scheduleJson: StandardSchedule): Schedule {
const id = crypto.randomUUID();
const created_at = new Date().toISOString();
const stmt = this.db.prepare(`
INSERT INTO schedules (id, document_id, schedule_json, created_at)
VALUES (?, ?, ?, ?)
`);
stmt.run(id, documentId, JSON.stringify(scheduleJson), created_at);
return {
id,
document_id: documentId,
schedule_json: scheduleJson,
created_at
};
}
getSchedule(id: string): Schedule | undefined {
const stmt = this.db.prepare("SELECT * FROM schedules WHERE id = ?");
const row = stmt.get(id) as any;
if (!row) return undefined;
return {
...row,
schedule_json: JSON.parse(row.schedule_json)
};
}
// API Keys
saveAPIKey(config: Omit<APIKeyConfig, "created_at">): void {
const created_at = new Date().toISOString();
const stmt = this.db.prepare(`
INSERT OR REPLACE INTO api_keys (provider, encrypted_key, iv, auth_tag, created_at)
VALUES (?, ?, ?, ?, ?)
`);
stmt.run(config.provider, config.encrypted_key, config.iv, config.auth_tag || "", created_at);
}
getAPIKey(provider: LLMProvider): APIKeyConfig | undefined {
const stmt = this.db.prepare("SELECT * FROM api_keys WHERE provider = ?");
return stmt.get(provider) as APIKeyConfig | undefined;
}
close() {
this.db.close();
}
}

View File

@@ -0,0 +1,38 @@
import XLSX from "xlsx";
import type { ExtractionResult } from "./ocr.js";
export interface ExcelData {
sheets: {
name: string;
data: any[][];
range: string;
}[];
}
export class ExcelExtractor {
async extractFromExcel(filePath: string): Promise<ExtractionResult & { structuredData: ExcelData }> {
const workbook = XLSX.readFile(filePath);
const sheets = workbook.SheetNames.map((name) => {
const worksheet = workbook.Sheets[name];
const data = XLSX.utils.sheet_to_json(worksheet, { header: 1 }) as any[][];
const range = worksheet["!ref"] || "A1";
return { name, data, range };
});
const text = sheets
.map((sheet) => {
const rows = sheet.data.map((row: any[]) => row.join("\t")).join("\n");
return `[${sheet.name}]\n${rows}`;
})
.join("\n\n");
return {
text,
confidence: 0.95,
method: "excel",
structuredData: { sheets }
};
}
}

View File

@@ -0,0 +1,26 @@
import Tesseract from "tesseract.js";
import type { DocumentType } from "../types/schema.js";
export interface ExtractionResult {
text: string;
confidence: number;
method: string;
}
export class OCRExtractor {
async extractFromImage(filePath: string): Promise<ExtractionResult> {
const worker = await Tesseract.createWorker("fra");
try {
const { data } = await worker.recognize(filePath);
return {
text: data.text,
confidence: data.confidence / 100,
method: "ocr"
};
} finally {
await worker.terminate();
}
}
}

View File

@@ -0,0 +1,16 @@
import fs from "fs";
import pdfParse from "pdf-parse";
import type { ExtractionResult } from "./ocr.js";
export class PDFExtractor {
async extractFromPDF(filePath: string): Promise<ExtractionResult> {
const dataBuffer = fs.readFileSync(filePath);
const data = await pdfParse(dataBuffer);
return {
text: data.text,
confidence: data.text.length > 100 ? 0.9 : 0.6,
method: "pdf"
};
}
}

View File

@@ -0,0 +1,118 @@
import OpenAI from "openai";
import Anthropic from "@anthropic-ai/sdk";
import type { LLMProvider, StandardSchedule } from "../types/schema.js";
export class LLMClient {
private openai?: OpenAI;
private anthropic?: Anthropic;
constructor(provider: LLMProvider, apiKey: string) {
if (provider === "openai") {
this.openai = new OpenAI({ apiKey });
} else {
this.anthropic = new Anthropic({ apiKey });
}
}
async normalizeWithLLM(rawText: string, extractedData?: any): Promise<StandardSchedule> {
const prompt = this.buildPrompt(rawText, extractedData);
if (this.openai) {
return this.callOpenAI(prompt);
} else if (this.anthropic) {
return this.callAnthropic(prompt);
}
throw new Error("No LLM client configured");
}
private buildPrompt(rawText: string, extractedData?: any): string {
return `Tu es un expert en analyse de planning scolaire et familial.
Analyse le texte suivant et extrait UNIQUEMENT un JSON valide conforme au schéma ci-dessous.
**CONTRAINTES STRICTES:**
- Réponds UNIQUEMENT avec le JSON, aucun texte avant ou après
- Format dates: YYYY-MM-DD
- Format heures: HH:MM (24h)
- calendar_scope: "weekly" ou "monthly"
- entities: tableau de strings ("enfants", "vacances", etc.)
- Chaque event doit avoir: title, date, start_time, end_time, location (ou null), tags[], notes (ou null), confidence (0-1)
**Schéma JSON attendu:**
{
"version": "1.0",
"calendar_scope": "weekly|monthly",
"timezone": "Europe/Paris",
"period": { "start": "YYYY-MM-DD", "end": "YYYY-MM-DD" },
"entities": ["enfants"],
"events": [
{
"title": "string",
"date": "YYYY-MM-DD",
"start_time": "HH:MM",
"end_time": "HH:MM",
"location": "string or null",
"tags": ["string"],
"notes": "string or null",
"confidence": 0.9,
"source_cells": []
}
],
"extraction": {
"method": "llm",
"model": "gpt-4o ou claude-3.5-sonnet",
"heuristics": ["llm-analysis"]
}
}
**Texte à analyser:**
${rawText}
${extractedData ? `**Données structurées supplémentaires:**\n${JSON.stringify(extractedData, null, 2)}` : ""}
Réponds UNIQUEMENT avec le JSON valide:`;
}
private async callOpenAI(prompt: string): Promise<StandardSchedule> {
const response = await this.openai!.chat.completions.create({
model: "gpt-4o",
messages: [{ role: "user", content: prompt }],
temperature: 0.1,
max_tokens: 4000
});
const content = response.choices[0].message.content;
if (!content) throw new Error("Empty LLM response");
// Extract JSON from response (handle potential markdown wrapping)
const jsonMatch = content.match(/```json\n?([\s\S]*?)\n?```/) || content.match(/({[\s\S]*})/);
const jsonStr = jsonMatch ? jsonMatch[1] : content;
const result = JSON.parse(jsonStr.trim());
result.extraction.model = "gpt-4o";
return result as StandardSchedule;
}
private async callAnthropic(prompt: string): Promise<StandardSchedule> {
const response = await this.anthropic!.messages.create({
model: "claude-3-5-sonnet-20241022",
max_tokens: 4000,
temperature: 0.1,
messages: [{ role: "user", content: prompt }]
});
const content = response.content[0];
if (content.type !== "text") throw new Error("Unexpected response type");
// Extract JSON from response
const jsonMatch = content.text.match(/```json\n?([\s\S]*?)\n?```/) || content.text.match(/({[\s\S]*})/);
const jsonStr = jsonMatch ? jsonMatch[1] : content.text;
const result = JSON.parse(jsonStr.trim());
result.extraction.model = "claude-3.5-sonnet";
return result as StandardSchedule;
}
}

View File

@@ -0,0 +1,177 @@
import type { StandardSchedule, CalendarScope, Event } from "../types/schema.js";
const DAYS_FR = ["lun", "mar", "mer", "jeu", "ven", "sam", "dim"];
const MONTHS_FR = [
"janvier",
"février",
"mars",
"avril",
"mai",
"juin",
"juillet",
"août",
"septembre",
"octobre",
"novembre",
"décembre"
];
export interface ParseOptions {
scope?: CalendarScope;
subject?: string;
}
export class PlanningParser {
private detectCalendarScope(text: string): CalendarScope {
const lowerText = text.toLowerCase();
const weekIndicators = ["semaine", "lundi", "mardi", "mercredi", "jeudi", "vendredi"];
const monthIndicators = MONTHS_FR;
const weekScore = weekIndicators.filter((w) => lowerText.includes(w)).length;
const monthScore = monthIndicators.filter((m) => lowerText.includes(m)).length;
return weekScore > monthScore ? "weekly" : "monthly";
}
private extractDates(text: string): { start: string; end: string } | null {
// Pattern: DD/MM/YYYY or YYYY-MM-DD
const datePattern = /(\d{1,2}[\/\-]\d{1,2}[\/\-]\d{4}|\d{4}[\/\-]\d{2}[\/\-]\d{2})/g;
const matches = text.match(datePattern);
if (!matches || matches.length === 0) {
return null;
}
const dates = matches.map((d) => this.normalizeDate(d)).filter((d) => d !== null) as string[];
if (dates.length === 0) return null;
dates.sort();
return {
start: dates[0],
end: dates[dates.length - 1]
};
}
private normalizeDate(dateStr: string): string | null {
// Convert DD/MM/YYYY to YYYY-MM-DD
const ddmmyyyy = dateStr.match(/^(\d{1,2})[\/\-](\d{1,2})[\/\-](\d{4})$/);
if (ddmmyyyy) {
const [, day, month, year] = ddmmyyyy;
return `${year}-${month.padStart(2, "0")}-${day.padStart(2, "0")}`;
}
// Already YYYY-MM-DD
const yyyymmdd = dateStr.match(/^(\d{4})[\/\-](\d{2})[\/\-](\d{2})$/);
if (yyyymmdd) {
return dateStr.replace(/\//g, "-");
}
return null;
}
private normalizeTime(timeStr: string): string {
// Convert "8h30", "8:30", "08:30" to "08:30"
const match = timeStr.match(/(\d{1,2})[:h]?(\d{2})?/i);
if (!match) return "00:00";
const hours = match[1].padStart(2, "0");
const minutes = match[2] ? match[2].padStart(2, "0") : "00";
return `${hours}:${minutes}`;
}
private extractEvents(text: string, lines: string[]): Event[] {
const events: Event[] = [];
// Simple heuristic: look for time patterns followed by activity names
const timePattern = /(\d{1,2}[:h]\d{2})\s*[-]\s*(\d{1,2}[:h]\d{2})/g;
lines.forEach((line) => {
const matches = Array.from(line.matchAll(timePattern));
matches.forEach((match) => {
const startTime = this.normalizeTime(match[1]);
const endTime = this.normalizeTime(match[2]);
// Extract title (rest of the line after time)
const title = line.replace(match[0], "").trim();
if (title) {
events.push({
title,
date: "1970-01-01", // Placeholder, will be refined
start_time: startTime,
end_time: endTime,
location: null,
tags: [],
notes: null,
confidence: 0.6,
source_cells: []
});
}
});
});
return events;
}
private detectEntities(text: string): string[] {
const entities: string[] = [];
const lowerText = text.toLowerCase();
if (lowerText.match(/enfant|école|cours|activité/)) {
entities.push("enfants");
}
if (lowerText.match(/vacances|congé|camp|centre aéré/)) {
entities.push("vacances");
}
return entities;
}
parseToStandard(text: string, options: ParseOptions = {}): StandardSchedule {
const lines = text.split("\n").filter((line) => line.trim().length > 0);
const scope = options.scope || this.detectCalendarScope(text);
const dates = this.extractDates(text) || {
start: new Date().toISOString().split("T")[0],
end: new Date(Date.now() + 7 * 24 * 60 * 60 * 1000).toISOString().split("T")[0]
};
const entities = this.detectEntities(text);
if (options.subject && !entities.includes(options.subject)) {
entities.push(options.subject);
}
const events = this.extractEvents(text, lines);
// Assign dates to events based on scope
if (scope === "weekly" && events.length > 0) {
const startDate = new Date(dates.start);
events.forEach((event, index) => {
const dayOffset = Math.floor(index / Math.max(1, events.length / 7));
const eventDate = new Date(startDate);
eventDate.setDate(startDate.getDate() + dayOffset);
event.date = eventDate.toISOString().split("T")[0];
});
}
return {
version: "1.0",
calendar_scope: scope,
timezone: "Europe/Paris",
period: dates,
entities,
events,
extraction: {
method: "heuristics" as any,
model: "internal",
heuristics: ["weekday-headers", "time-patterns", "date-extraction"]
}
};
}
}

View File

@@ -0,0 +1,156 @@
import express from "express";
import multer from "multer";
import { config } from "dotenv";
import path from "path";
import fs from "fs";
import { DatabaseService } from "./database/db.js";
import { EncryptionService } from "./crypto/encryption.js";
import { IngestionService } from "./services/ingestion.js";
import { z } from "zod";
import type { LLMProvider, CalendarScope } from "./types/schema.js";
config();
const app = express();
const port = parseInt(process.env.PORT || "8000");
// Setup services
const dbPath = process.env.DATABASE_PATH || "./data/planning.db";
const masterKey = process.env.MASTER_KEY;
if (!masterKey) {
console.error("Error: MASTER_KEY not set in .env");
process.exit(1);
}
const uploadDir = process.env.UPLOAD_DIR || "./uploads";
if (!fs.existsSync(uploadDir)) {
fs.mkdirSync(uploadDir, { recursive: true });
}
const db = new DatabaseService(dbPath);
const encryption = new EncryptionService(masterKey);
const confidenceThreshold = parseFloat(process.env.CONFIDENCE_THRESHOLD || "0.7");
const ingestion = new IngestionService(db, encryption, confidenceThreshold);
// Multer setup
const storage = multer.diskStorage({
destination: uploadDir,
filename: (req, file, cb) => {
const uniqueName = `${Date.now()}-${file.originalname}`;
cb(null, uniqueName);
}
});
const upload = multer({
storage,
limits: {
fileSize: (parseInt(process.env.MAX_FILE_SIZE_MB || "50") || 50) * 1024 * 1024
}
});
app.use(express.json());
// Health check
app.get("/health", (req, res) => {
res.json({ status: "ok", timestamp: new Date().toISOString() });
});
// POST /auth/api-key - Store API key
app.post("/auth/api-key", async (req, res) => {
try {
const schema = z.object({
provider: z.enum(["openai", "anthropic"]),
apiKey: z.string().min(10)
});
const { provider, apiKey } = schema.parse(req.body);
await ingestion.saveAPIKey(provider as LLMProvider, apiKey);
res.json({ message: "API key stored securely" });
} catch (error) {
res.status(400).json({ error: error instanceof Error ? error.message : "Invalid request" });
}
});
// POST /ingest - Ingest file
app.post("/ingest", upload.single("file"), async (req, res) => {
try {
if (!req.file) {
return res.status(400).json({ error: "No file provided" });
}
const documentId = await ingestion.ingestDocument(req.file.path, req.file.originalname);
res.json({
document_id: documentId,
filename: req.file.originalname
});
} catch (error) {
res.status(500).json({ error: error instanceof Error ? error.message : "Ingestion failed" });
}
});
// POST /ingest/normalize - Normalize document
app.post("/ingest/normalize", async (req, res) => {
try {
const schema = z.object({
document_id: z.string(),
scope: z.enum(["weekly", "monthly"]).optional(),
subject: z.string().optional()
});
const { document_id, scope, subject } = schema.parse(req.body);
const scheduleId = await ingestion.normalizeDocument(document_id, {
scope: scope as CalendarScope | undefined,
subject
});
res.json({ schedule_id: scheduleId });
} catch (error) {
res.status(500).json({ error: error instanceof Error ? error.message : "Normalization failed" });
}
});
// GET /schedules/:id - Get schedule
app.get("/schedules/:id", async (req, res) => {
try {
const schedule = await ingestion.getSchedule(req.params.id);
if (!schedule) {
return res.status(404).json({ error: "Schedule not found" });
}
res.json(schedule);
} catch (error) {
res.status(500).json({ error: error instanceof Error ? error.message : "Failed to retrieve schedule" });
}
});
// GET /documents/:id - Get document metadata
app.get("/documents/:id", async (req, res) => {
try {
const doc = db.getDocument(req.params.id);
if (!doc) {
return res.status(404).json({ error: "Document not found" });
}
res.json({
id: doc.id,
filename: doc.filename,
document_type: doc.document_type,
mime_type: doc.mime_type,
size_bytes: doc.size_bytes,
uploaded_at: doc.uploaded_at
});
} catch (error) {
res.status(500).json({ error: error instanceof Error ? error.message : "Failed to retrieve document" });
}
});
app.listen(port, () => {
console.log(`Planning Ingestion Server listening on port ${port}`);
});

View File

@@ -0,0 +1,144 @@
import path from "path";
import fs from "fs";
import { DatabaseService } from "../database/db.js";
import { EncryptionService } from "../crypto/encryption.js";
import { OCRExtractor } from "../extractors/ocr.js";
import { PDFExtractor } from "../extractors/pdf.js";
import { ExcelExtractor } from "../extractors/excel.js";
import { PlanningParser, ParseOptions } from "../normalizer/parser.js";
import { LLMClient } from "../llm/client.js";
import type { DocumentType, LLMProvider, StandardSchedule } from "../types/schema.js";
export class IngestionService {
constructor(
private db: DatabaseService,
private encryption: EncryptionService,
private confidenceThreshold: number = 0.7
) {}
async ingestDocument(filePath: string, filename: string): Promise<string> {
const stats = fs.statSync(filePath);
const mimeType = this.detectMimeType(filename);
const documentType = this.detectDocumentType(mimeType);
const doc = this.db.insertDocument({
filename,
file_path: filePath,
document_type: documentType,
mime_type: mimeType,
size_bytes: stats.size
});
return doc.id;
}
async normalizeDocument(documentId: string, options: ParseOptions = {}): Promise<string> {
const doc = this.db.getDocument(documentId);
if (!doc) throw new Error(`Document ${documentId} not found`);
// Step 1: Extract text
const extractionResult = await this.extractText(doc.file_path, doc.document_type);
// Step 2: Parse with heuristics
const parser = new PlanningParser();
let schedule = parser.parseToStandard(extractionResult.text, options);
// Step 3: If confidence is low, use LLM
const avgConfidence =
schedule.events.reduce((sum, e) => sum + e.confidence, 0) / Math.max(1, schedule.events.length);
if (avgConfidence < this.confidenceThreshold) {
const llmConfig = await this.getLLMConfig();
if (llmConfig) {
const llmClient = new LLMClient(llmConfig.provider, llmConfig.apiKey);
schedule = await llmClient.normalizeWithLLM(
extractionResult.text,
doc.document_type === "excel" ? (extractionResult as any).structuredData : undefined
);
}
}
// Update extraction method
if (!schedule.extraction.method || schedule.extraction.method === ("heuristics" as any)) {
schedule.extraction.method = doc.document_type === "image" ? "ocr" : doc.document_type;
}
// Step 4: Save schedule
const savedSchedule = this.db.insertSchedule(documentId, schedule);
return savedSchedule.id;
}
async getSchedule(scheduleId: string): Promise<StandardSchedule | undefined> {
const schedule = this.db.getSchedule(scheduleId);
return schedule?.schedule_json;
}
async saveAPIKey(provider: LLMProvider, apiKey: string): Promise<void> {
const encrypted = this.encryption.encrypt(apiKey);
this.db.saveAPIKey({
provider,
encrypted_key: encrypted.encrypted,
iv: encrypted.iv,
auth_tag: encrypted.authTag
});
}
private async getLLMConfig(): Promise<{ provider: LLMProvider; apiKey: string } | null> {
const openaiConfig = this.db.getAPIKey("openai");
if (openaiConfig) {
const apiKey = this.encryption.decrypt(openaiConfig.encrypted_key, openaiConfig.iv, openaiConfig.auth_tag || "");
return { provider: "openai", apiKey };
}
const anthropicConfig = this.db.getAPIKey("anthropic");
if (anthropicConfig) {
const apiKey = this.encryption.decrypt(
anthropicConfig.encrypted_key,
anthropicConfig.iv,
anthropicConfig.auth_tag || ""
);
return { provider: "anthropic", apiKey };
}
return null;
}
private async extractText(filePath: string, type: DocumentType): Promise<any> {
switch (type) {
case "image": {
const ocr = new OCRExtractor();
return ocr.extractFromImage(filePath);
}
case "pdf": {
const pdf = new PDFExtractor();
return pdf.extractFromPDF(filePath);
}
case "excel": {
const excel = new ExcelExtractor();
return excel.extractFromExcel(filePath);
}
}
}
private detectMimeType(filename: string): string {
const ext = path.extname(filename).toLowerCase();
const mimeTypes: Record<string, string> = {
".pdf": "application/pdf",
".xlsx": "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
".xls": "application/vnd.ms-excel",
".png": "image/png",
".jpg": "image/jpeg",
".jpeg": "image/jpeg"
};
return mimeTypes[ext] || "application/octet-stream";
}
private detectDocumentType(mimeType: string): DocumentType {
if (mimeType.startsWith("image/")) return "image";
if (mimeType === "application/pdf") return "pdf";
if (mimeType.includes("spreadsheet") || mimeType.includes("excel")) return "excel";
throw new Error(`Unsupported MIME type: ${mimeType}`);
}
}

View File

@@ -0,0 +1,69 @@
import { z } from "zod";
export const CalendarScopeSchema = z.enum(["weekly", "monthly"]);
export type CalendarScope = z.infer<typeof CalendarScopeSchema>;
export const EventSchema = z.object({
title: z.string(),
date: z.string().regex(/^\d{4}-\d{2}-\d{2}$/),
start_time: z.string().regex(/^\d{2}:\d{2}$/),
end_time: z.string().regex(/^\d{2}:\d{2}$/),
location: z.string().nullable(),
tags: z.array(z.string()),
notes: z.string().nullable(),
confidence: z.number().min(0).max(1),
source_cells: z.array(z.string()).optional()
});
export type Event = z.infer<typeof EventSchema>;
export const StandardScheduleSchema = z.object({
version: z.literal("1.0"),
calendar_scope: CalendarScopeSchema,
timezone: z.string(),
period: z.object({
start: z.string().regex(/^\d{4}-\d{2}-\d{2}$/),
end: z.string().regex(/^\d{4}-\d{2}-\d{2}$/)
}),
entities: z.array(z.string()),
events: z.array(EventSchema),
extraction: z.object({
method: z.enum(["ocr", "pdf", "excel", "llm"]),
model: z.string(),
heuristics: z.array(z.string())
})
});
export type StandardSchedule = z.infer<typeof StandardScheduleSchema>;
export const DocumentTypeSchema = z.enum(["image", "pdf", "excel"]);
export type DocumentType = z.infer<typeof DocumentTypeSchema>;
export const LLMProviderSchema = z.enum(["openai", "anthropic"]);
export type LLMProvider = z.infer<typeof LLMProviderSchema>;
export interface Document {
id: string;
filename: string;
file_path: string;
document_type: DocumentType;
mime_type: string;
size_bytes: number;
uploaded_at: string;
raw_text?: string;
}
export interface Schedule {
id: string;
document_id: string;
schedule_json: StandardSchedule;
created_at: string;
}
export interface APIKeyConfig {
provider: LLMProvider;
encrypted_key: string;
iv: string;
auth_tag: string;
created_at: string;
}

View File

@@ -0,0 +1,20 @@
{
"compilerOptions": {
"target": "ES2022",
"module": "ES2022",
"moduleResolution": "node",
"lib": ["ES2022"],
"outDir": "./dist",
"rootDir": "./src",
"strict": true,
"esModuleInterop": true,
"skipLibCheck": true,
"forceConsistentCasingInFileNames": true,
"resolveJsonModule": true,
"declaration": true,
"declarationMap": true,
"sourceMap": true
},
"include": ["src/**/*"],
"exclude": ["node_modules", "dist"]
}