📊 AgenticEDA — Automated Exploratory Data Analysis with LangGraph

Pranesh Nikhar Jan 3, 2026 01/03/26 3 min read

 󰈤

An automated EDA service powered by a LangGraph agent with local LLM (Ollama). Upload datasets, connect SQL databases, and get comprehensive analysis reports with visualizations.

🎯 What It Does

AgenticEDA is an automated exploratory data analysis service powered by a LangGraph agent with a local LLM (Ollama). Upload a CSV, TSV, Excel, or Parquet dataset (or connect a SQL database), and the agent runs a comprehensive EDA — statistics, correlations, outlier detection, visualizations — and generates a detailed markdown report.

POST /v1/analyze
{
  "dataset_id": "ds_abc123",
  "description": "Customer churn dataset"
}

→ Runs 10 EDA tools via LangGraph agent
→ Generates markdown report with plot artifacts
→ Returns { run_id, report_path, status }

🧱 Tech Stack

Component	Technology
API	FastAPI (uvicorn)
Agent	LangGraph (StateGraph)
LLM	Ollama (default: llama3.1, local only)
Data	Pandas, matplotlib, seaborn
Storage	SQLAlchemy (read-only SQL connections)
Formats	CSV, TSV, Excel (openpyxl), Parquet (pyarrow)

All LLM calls go to a local Ollama instance — no external API calls, no data leaves your machine.

🏗️ Architecture

src/agenticeda/
├── main.py                   # FastAPI app factory
├── api/routes.py             # REST endpoints
├── agent/
│   ├── graph.py              # LangGraph StateGraph with 6 nodes
│   └── prompts.py            # SYSTEM_PROMPT, PLAN_PROMPT, SYNTHESIZE_PROMPT
├── tools/eda_tools.py        # 10 StructuredTools for EDA operations
├── data/
│   ├── loaders.py            # CSV/TSV/Excel/Parquet/SQL loading
│   └── session.py            # In-memory dataset-run session store
├── models/schemas.py         # Pydantic request/response models
├── report/builder.py         # Markdown report generation
└── config.py                 # pydantic-settings config

🤖 LangGraph Agent Flow

load_context → plan_eda → agent → tools → collect_results → synthesize → END
                                  ↓ (conditional)
                              FINISH? → YES → END
                              NO → tools (up to 12 steps)

The agent graph has 6 nodes:

load_context: Loads dataset metadata (shape, columns, dtypes) into the agent context
plan_eda: LLM generates an EDA plan based on the dataset description and available tools
agent: LLM decides which tool to call next, or sends FINISH signal
tools: Executes the chosen EDA tool, returns results
collect_results: Aggregates tool outputs into the run context
synthesize: LLM generates the final markdown report from all collected results

The agent uses a 12-step maximum guard to prevent runaway tool calls.

🛠️ EDA Tools

Tool	Description
`get_dataset_overview`	Shape, columns, dtypes, memory usage
`get_missing_values`	Null counts and percentages per column
`get_numeric_summary`	Mean, median, std, min, max, quartiles
`get_categorical_summary`	Value counts, unique counts, mode
`get_correlations`	Pearson/Spearman correlation matrix
`detect_outliers`	IQR-based outlier detection per column
`plot_distribution`	Histogram with KDE for numeric columns
`plot_correlation_heatmap`	Seaborn heatmap of correlations
`plot_scatter`	Scatter plot for specified column pair
`run_custom_query`	Custom pandas query (read-only, sandboxed)

🔒 Safety and Isolation

Read-only SQL connections (no write operations)
Sandboxed custom queries: forbidden patterns include @, import, exec, eval, __ (double underscore)
matplotlib non-interactive backend (Agg) for server-side rendering
Output isolation: each run gets its own directory with plot PNGs
Session isolation: separate dataset_id strings for multi-dataset support

🌐 API Endpoints

Endpoint	Method	Description
`/health`	GET	Health check
`/v1/upload`	POST	Upload dataset file (multipart)
`/v1/connect/sql`	POST	Connect to SQL database (read-only)
`/v1/analyze`	POST	Start EDA analysis run
`/v1/reports/{run_id}`	GET	Get report by run ID

🚀 Quick Start

# Install
pip install agentic-eda

# Start Ollama (must have llama3.1 or another model)
ollama pull llama3.1

# Start the service
agentic-eda serve
# → API at http://localhost:8000

# Upload and analyze
curl -X POST -F "file=@data.csv" http://localhost:8000/v1/upload
# → { dataset_id: "ds_abc123" }

curl -X POST -H "Content-Type: application/json" \
  -d '{"dataset_id": "ds_abc123", "description": "Sales data Q1 2026"}' \
  http://localhost:8000/v1/analyze
# → { run_id: "run_def456", status: "running" }

# Get report
curl http://localhost:8000/v1/reports/run_def456
# → Markdown report with embedded plots

💡 Why It’s Interesting

AgenticEDA combines three things that rarely appear together: local-only LLM (no data exfiltration), LangGraph agent orchestration, and production-grade FastAPI service design. The LangGraph agent doesn’t just run tools blindly — it plans the analysis, executes tools in a meaningful order, and synthesizes everything into a coherent report. The tool implementations are thorough (IQR outlier detection, multiple plot types, correlation analysis), and the session-based architecture supports multiple concurrent datasets and runs. It’s a genuinely useful tool for data scientists who want automated EDA without sending their data to any cloud service.

← [b]ack

posts/ 🤖 Titan — Terminal AI Coding Agent