π AgenticEDA β Automated Exploratory Data Analysis with LangGraph
An automated EDA service powered by a LangGraph agent with local LLM (Ollama). Upload datasets, connect SQL databases, and get comprehensive analysis reports with visualizations.
π― What It Does
AgenticEDA is an automated exploratory data analysis service powered by a LangGraph agent with a local LLM (Ollama). Upload a CSV, TSV, Excel, or Parquet dataset (or connect a SQL database), and the agent runs a comprehensive EDA β statistics, correlations, outlier detection, visualizations β and generates a detailed markdown report.
POST /v1/analyze
{
"dataset_id": "ds_abc123",
"description": "Customer churn dataset"
}
β Runs 10 EDA tools via LangGraph agent
β Generates markdown report with plot artifacts
β Returns { run_id, report_path, status }
π§± Tech Stack
| Component | Technology |
|---|---|
| API | FastAPI (uvicorn) |
| Agent | LangGraph (StateGraph) |
| LLM | Ollama (default: llama3.1, local only) |
| Data | Pandas, matplotlib, seaborn |
| Storage | SQLAlchemy (read-only SQL connections) |
| Formats | CSV, TSV, Excel (openpyxl), Parquet (pyarrow) |
All LLM calls go to a local Ollama instance β no external API calls, no data leaves your machine.
ποΈ Architecture
src/agenticeda/
βββ main.py # FastAPI app factory
βββ api/routes.py # REST endpoints
βββ agent/
β βββ graph.py # LangGraph StateGraph with 6 nodes
β βββ prompts.py # SYSTEM_PROMPT, PLAN_PROMPT, SYNTHESIZE_PROMPT
βββ tools/eda_tools.py # 10 StructuredTools for EDA operations
βββ data/
β βββ loaders.py # CSV/TSV/Excel/Parquet/SQL loading
β βββ session.py # In-memory dataset-run session store
βββ models/schemas.py # Pydantic request/response models
βββ report/builder.py # Markdown report generation
βββ config.py # pydantic-settings config
π€ LangGraph Agent Flow
load_context β plan_eda β agent β tools β collect_results β synthesize β END
β (conditional)
FINISH? β YES β END
NO β tools (up to 12 steps)
The agent graph has 6 nodes:
- load_context: Loads dataset metadata (shape, columns, dtypes) into the agent context
- plan_eda: LLM generates an EDA plan based on the dataset description and available tools
- agent: LLM decides which tool to call next, or sends
FINISHsignal - tools: Executes the chosen EDA tool, returns results
- collect_results: Aggregates tool outputs into the run context
- synthesize: LLM generates the final markdown report from all collected results
The agent uses a 12-step maximum guard to prevent runaway tool calls.
π οΈ EDA Tools
| Tool | Description |
|---|---|
get_dataset_overview | Shape, columns, dtypes, memory usage |
get_missing_values | Null counts and percentages per column |
get_numeric_summary | Mean, median, std, min, max, quartiles |
get_categorical_summary | Value counts, unique counts, mode |
get_correlations | Pearson/Spearman correlation matrix |
detect_outliers | IQR-based outlier detection per column |
plot_distribution | Histogram with KDE for numeric columns |
plot_correlation_heatmap | Seaborn heatmap of correlations |
plot_scatter | Scatter plot for specified column pair |
run_custom_query | Custom pandas query (read-only, sandboxed) |
π Safety and Isolation
- Read-only SQL connections (no write operations)
- Sandboxed custom queries: forbidden patterns include
@,import,exec,eval,__(double underscore) - matplotlib non-interactive backend (
Agg) for server-side rendering - Output isolation: each run gets its own directory with plot PNGs
- Session isolation: separate dataset_id strings for multi-dataset support
π API Endpoints
| Endpoint | Method | Description |
|---|---|---|
/health | GET | Health check |
/v1/upload | POST | Upload dataset file (multipart) |
/v1/connect/sql | POST | Connect to SQL database (read-only) |
/v1/analyze | POST | Start EDA analysis run |
/v1/reports/{run_id} | GET | Get report by run ID |
π Quick Start
# Install
pip install agentic-eda
# Start Ollama (must have llama3.1 or another model)
ollama pull llama3.1
# Start the service
agentic-eda serve
# β API at http://localhost:8000
# Upload and analyze
curl -X POST -F "file=@data.csv" http://localhost:8000/v1/upload
# β { dataset_id: "ds_abc123" }
curl -X POST -H "Content-Type: application/json" \
-d '{"dataset_id": "ds_abc123", "description": "Sales data Q1 2026"}' \
http://localhost:8000/v1/analyze
# β { run_id: "run_def456", status: "running" }
# Get report
curl http://localhost:8000/v1/reports/run_def456
# β Markdown report with embedded plots
π‘ Why Itβs Interesting
AgenticEDA combines three things that rarely appear together: local-only LLM (no data exfiltration), LangGraph agent orchestration, and production-grade FastAPI service design. The LangGraph agent doesnβt just run tools blindly β it plans the analysis, executes tools in a meaningful order, and synthesizes everything into a coherent report. The tool implementations are thorough (IQR outlier detection, multiple plot types, correlation analysis), and the session-based architecture supports multiple concurrent datasets and runs. Itβs a genuinely useful tool for data scientists who want automated EDA without sending their data to any cloud service.