Carrefour NC · Skazy Data

Supplier invoice
OCR
pipeline

AI-powered pipeline · Hybrid · Gemini
6
6SVC
Microservices
Mail2Docs · ClassiDocs
BillScan · DocHub
ocr-main · Orchestration
Tech stack
PythonAI services
Gemini FlashOCR · Classification
LaravelCentral API hub
Docker ComposeOrchestration
StreamlitBusiness UI
MariaDBPersistence
Processing pipeline
01Mail2DocsIMAP · Graph API
02ClassiDocsGemini · PyPDF
03BillScanStructured OCR
04DocHubLaravel · RBAC
05Export METIFAMLF.* · CFP
About

End-to-end pipeline for AI classification, OCR extraction and METIFA export, automating supplier invoice processing for Carrefour New Caledonia. Local XPF currency, METI format.

Final output
MLF.{ref}
AAA · FFE · FFG · FFH
FFL × N article lines
FFP · ZZZ
Currency CFP · MTTVA = 0
skazy.nc · Nouvelle-Calédonie
Service 01 · Acquisition
01
Mail2Docs

Automatic PDF attachment collection from a dedicated mailbox via IMAP or Microsoft Graph API.

  • Features
  • 4 authentication modes
  • Filter by sender, subject, date, extension
  • Mark as read + automatic archiving
  • SHA-1 filename deduplication
  • Real-time structured logs

Animated diagram

Fournisseur scan PDF Boîte mail dédiée IMAP Graph SSL · OAuth2 Backend main.py DL · filtre hash · archive Stockage ./data/ YYYY-MM-DD/ {hash}.pdf Streamlit UI live logs subprocess.Popen → IMAP/Graph → DL MIME → hash + écriture → logs UI PROTOCOLES VOLUMES · SORTIE IMAP Basic Auth basic / xoauth2 · SSL · Azure AD Microsoft Graph API Auth Azure AD · app token ./data/ PJ brutes · YYYY-MM-DD/ ./logs/ job_*.log lus en temps réel

Authentication modes

IMAP Basic Auth
Gmail · App Password
basic
XOAUTH2 IMAP
Gmail · OAuth2
xoauth2
Microsoft 365
Device Code · Azure AD
xoauth2_365
Microsoft Graph
Office 365 · OAuth2
graph_app

Output structure

DATA_DIR /
YYYY / MM / DD /
{date}_{hash10}_{sujet}.pdf

Filter variables

FROM_FILTER
SUBJECT_FILTER
DAYS_BACK
UNREAD_ONLY
ALLOWED_EXTENSIONS
ARCHIVE
MARK_AS_READ
GRAPH_TOP
Service 02 · Classification
02
ClassiDocs

AI classification and splitting of multi-document PDFs by supplier and type, via Gemini Files API page by page.

  • Features
  • PDF upload via Gemini Files API
  • Page-by-page classification
  • Injected supplier / store lists
  • Arizona → Express, Champion → Market
  • PyPDF splitting + structured naming

Animated diagram — intelligent splitting

PDF multi-docs pages mélangées GEMINI Files API page par page → groupe_doc CLASSIF. JSON {"type":"Facture" "groupe":1 "fournisseur":"X" "magasin":"Y"} par page PYPDF Découpe 1 PDF / groupe OUT Fac/ Four/ YYYY/ MM/DD ↑ Listes /shops + /suppliers depuis DocHub → injection dans prompts Gemini

Pipeline

Incoming PDF
multi-doc
Files API
Gemini upload
Classification
page by page
Grouping
doc_group
PDF split
PyPDF
Filing
supplier/date

Document types

Invoice
Billing document
Credit note
Credit memo
Other
Delivery note, quote, etc.

Prompt rules

  • Injected store lists
  • Injected supplier lists
  • doc_group (int)
  • ISO date YYYY-MM-DD
  • Blank page new group

Output

documents_decoupes/
YYYY/MM/DD/
{type}/
{fournisseur}/
Facture_{N°}.pdf
google-genaiPyPDFStreamlit
Service 03 · OCR Extraction
03
BillScan

Structured OCR extraction of line items from PDF invoices and receipts. Automatic pre-tax/tax arithmetic checks.

  • Features
  • Bilingual prompt FR + Anglo-saxon
  • XPF integer amounts only
  • Decimal ROUND_HALF_UP checks
  • reference vs ean_code (EAN-13)
  • Optional DocHub API push

Animated diagram — extraction + checks

FACTURE PDF N lignes OCR HYBRIDE Traditionnel + LayoutLM + NLP entités + Gemini Flash formats FR + EN JSON GEMINI [{"code_ean" "quantite" "prix_unit_ht" "tgc" "designation"},…] COHÉRENCE ht=unit×qte Σ ligne_ht ht×(1+tgc) Σ ttc_ligne Erreur bloquante : identification incorrecte → empêche injection METI Alerte non bloquante : écart prix → opérateur statue en connaissance de cause

Gemini JSON schema

labelstring
referenceinternal code
code_eanEAN-13
quantityinteger
tgc% NC tax
prix_unitaire_htXPF
prix_total_htXPF

Number formats

French
1 234,56 → 1234
Anglo-saxon
48,106.00 → 48106
XPF NC
ROUND_HALF_UP · 0 décimale

Arithmetic checks

ht = unit × qty
Line consistency
✓ bloquant
Σ line_ht = total_ht
Total consistency
✓ bloquant
ttc = ht × (1 + tgc)
Tax consistency
✓ bloquant
Σ ttc_line = total_ttc
TTC consistency
△ alerte
Service 04 · Central hub
04
DocHub

Central Laravel API. OCR ingestion, tri-phase business validation, RBAC and METIFA export to the METI NC ERP system.

  • Features
  • 3-phase pipeline: Product → Price → Tax
  • Route-level RBAC with group.* wildcards
  • REST API Basic Auth: shops, suppliers, document
  • METIFA export: FFE + FFL + FFP, CFP currency
  • Livewire UI · PDF proxy preview

Animated diagram — JSON pipeline → METI NC

BillScan JSON POST /api PHASE 1 Product Validator EAN → produit supplier_ref PHASE 2 Price Validator vs METI · promos PHASE 3 Tax Validator TGC vs TVA REMÉDIATION Interface Livewire RBAC 3 rôles METI NC CSV METI → DocHub · CRON quotidien · suppression + réimport validated errored pending done done_errors

Data models

Document

ref · date
supplier_id
store_id
total_ht · ttc
status · type

Document\Line

name · ean
qty · tax
unit_price
total_price
product_id

Supplier

code · name
alias
type (F/L/E)
api_enabled

CostPrice

meti_ref
supplier · store
tar_dttr
cost_price

Validation pipeline

Phase 1

ProductValidator

  • unidentified item
  • supplier duplicate
  • supplier not found
Phase 2

PriceValidator

  • incorrect promo price
  • unit price HT
  • price not found
Phase 3

TaxValidator

  • incorrect TGC tax

METI reference solution

CSV exported from METI
deposited on VM, dedicated folder
Daily CRON · purge + reimport
items · barcodes · prices · promos

REST API

GET/api/shopsSTORE_CODE + alias
GET/api/suppliersFOU_CDFO, FOU_NM + alias
POST/api/documentOCR import validation document

METIFA export METI NC

AAAFFEFFGFFHFFLFFPZZZ

Fixed-width flat file · Currency CFP (≠ ISO XPF) · MTTVA = 0 · multi-document ZIP

Governance & control

Beyond
extraction

The solution also manages user rights, credit note reconciliation and batch completeness tracking.

Security

3-role RBAC

Fine-grained permissions per business profile.

Administrator · thresholds + users
Accountant · remediation
Auditor · read-only + logs
Accounting

Quanti / quali credit notes

Two reconciliation logics.

Quantitative · Goods receipt
Qualitative · ERP dispute files
Monitoring

Completeness control

Batch tracking dashboard.

Files submitted vs OCR-processed
zero loss between source and injection