* feat: space v3 * feat: connected space creation * fix: * fix: session_id for memory ingestion * chore: simplify gitignore patterns for agent directories --------- Co-authored-by: Manoj <saimanoj58@gmail.com>
BERT Topic Modeling CLI for Echo Episodes
This CLI tool performs topic modeling on Echo episodes using BERTopic. It connects to Neo4j, retrieves episodes with their pre-computed embeddings for a given user, and discovers thematic clusters using HDBSCAN clustering.
Features
- Connects to Neo4j database to fetch episodes
- Uses pre-computed embeddings (no need to regenerate them)
- Performs semantic topic clustering with BERTopic
- Displays topics with:
- Top keywords per topic
- Episode count per topic
- Sample episodes for each topic
- Configurable minimum topic size
- Environment variable support for easy configuration
Prerequisites
- Python 3.8+
- Access to Neo4j database with episodes stored
- Pre-computed embeddings stored in Neo4j (in
contentEmbeddingfield)
Installation
- Navigate to the bert directory:
cd apps/webapp/app/bert
- Install dependencies:
pip install -r requirements.txt
Configuration
The CLI can read Neo4j connection details from:
-
Environment variables (recommended) - Create a
.envfile or export:export NEO4J_URI=bolt://localhost:7687 export NEO4J_USERNAME=neo4j export NEO4J_PASSWORD=your_password -
Command-line options - Pass credentials directly as flags
-
From project root - The tool automatically loads
.envfrom the project root
Usage
Basic Usage
Using environment variables (most common):
python main.py <user_id>
Advanced Options
python main.py <user_id> [OPTIONS]
Options:
--min-topic-size INTEGER: Minimum number of episodes per topic (default: 10)--nr-topics INTEGER: Target number of topics for reduction (optional)--propose-spaces: Generate space proposals using OpenAI (requires OPENAI_API_KEY)--openai-api-key TEXT: OpenAI API key for space proposals (or use OPENAI_API_KEY env var)--json: Output only final results in JSON format (suppresses all other output)--neo4j-uri TEXT: Neo4j connection URI (default: bolt://localhost:7687)--neo4j-username TEXT: Neo4j username (default: neo4j)--neo4j-password TEXT: Neo4j password (required)
Examples
-
Basic usage with environment variables:
python main.py user-123 -
Custom minimum topic size:
python main.py user-123 --min-topic-size 10 -
Explicit credentials:
python main.py user-123 \ --neo4j-uri bolt://neo4j:7687 \ --neo4j-username neo4j \ --neo4j-password mypassword -
Using Docker compose Neo4j:
python main.py user-123 \ --neo4j-uri bolt://localhost:7687 \ --neo4j-password 27192e6432564f4788d55c15131bd5ac -
With space proposals:
python main.py user-123 --propose-spaces -
JSON output mode (for programmatic use):
python main.py user-123 --json -
JSON output with space proposals:
python main.py user-123 --propose-spaces --json
Get Help
python main.py --help
Output Formats
Human-Readable Output (Default)
The CLI outputs:
================================================================================
BERT TOPIC MODELING FOR ECHO EPISODES
================================================================================
User ID: user-123
Min Topic Size: 20
================================================================================
✓ Connected to Neo4j at bolt://localhost:7687
✓ Fetched 150 episodes with embeddings
🔍 Running BERTopic analysis (min_topic_size=20)...
✓ Topic modeling complete
================================================================================
TOPIC MODELING RESULTS
================================================================================
Total Topics Found: 5
Total Episodes: 150
================================================================================
────────────────────────────────────────────────────────────────────────────────
Topic 0: 45 episodes
────────────────────────────────────────────────────────────────────────────────
Keywords: authentication, login, user, security, session, password, token, oauth, jwt, credentials
Sample Episodes (showing up to 3):
1. [uuid-123]
Discussing authentication flow for the new user login system...
2. [uuid-456]
Implementing OAuth2 with JWT tokens for secure sessions...
3. [uuid-789]
Password reset functionality with email verification...
────────────────────────────────────────────────────────────────────────────────
Topic 1: 32 episodes
────────────────────────────────────────────────────────────────────────────────
Keywords: database, neo4j, query, graph, cypher, nodes, relationships, index, performance, optimization
Sample Episodes (showing up to 3):
...
Topic -1 (Outliers): 8 episodes
================================================================================
✓ Analysis complete!
================================================================================
✓ Neo4j connection closed
JSON Output Mode (--json flag)
When using the --json flag, the tool outputs only a clean JSON object with no debug logs:
{
"topics": {
"0": {
"keywords": ["authentication", "login", "user", "security", "session"],
"episodeIds": ["uuid-123", "uuid-456", "uuid-789"]
},
"1": {
"keywords": ["database", "neo4j", "query", "graph", "cypher"],
"episodeIds": ["uuid-abc", "uuid-def"]
}
},
"spaces": [
{
"name": "User Authentication",
"intent": "Episodes about user authentication, login systems, and security belong in this space.",
"confidence": 85,
"topics": [0, 3],
"estimatedEpisodes": 120
}
]
}
JSON Output Structure:
topics: Dictionary of topic IDs with keywords and episode UUIDsspaces: Array of space proposals (only if--propose-spacesis used)name: Space name (2-5 words)intent: Classification intent (1-2 sentences)confidence: Confidence score (0-100)topics: Source topic IDs that form this spaceestimatedEpisodes: Estimated number of episodes in this space
Use Cases for JSON Mode:
- Programmatic consumption by other tools
- Piping output to jq or other JSON processors
- Integration with CI/CD pipelines
- Automated space creation workflows
How It Works
- Connection: Establishes connection to Neo4j database
- Data Fetching: Queries all episodes for the given userId that have:
- Non-null
contentEmbeddingfield - Non-empty content
- Non-null
- Topic Modeling: Runs BERTopic with:
- Pre-computed embeddings (no re-embedding needed)
- HDBSCAN clustering (automatic cluster discovery)
- Keyword extraction via c-TF-IDF
- Results: Displays topics with keywords and sample episodes
Neo4j Query
The tool uses this Cypher query to fetch episodes:
MATCH (e:Episode {userId: $userId})
WHERE e.contentEmbedding IS NOT NULL
AND size(e.contentEmbedding) > 0
AND e.content IS NOT NULL
AND e.content <> ''
RETURN e.uuid as uuid,
e.content as content,
e.contentEmbedding as embedding,
e.createdAt as createdAt
ORDER BY e.createdAt DESC
Tuning Parameters
--min-topic-size:- Smaller values (5-10): More granular topics, may include noise
- Larger values (20-30): Broader topics, more coherent but fewer clusters
- Recommended: Start with 20 and adjust based on your data
Troubleshooting
No episodes found
- Verify the userId exists in Neo4j
- Check that episodes have
contentEmbeddingpopulated - Ensure episodes have non-empty
contentfield
Connection errors
- Verify Neo4j is running:
docker ps | grep neo4j - Check URI format: should be
bolt://host:port - Verify credentials are correct
Too few/many topics
- Adjust
--min-topic-sizeparameter - Need more topics: decrease the value (e.g.,
--min-topic-size 10) - Need fewer topics: increase the value (e.g.,
--min-topic-size 30)
Dependencies
bertopic>=0.16.0- Topic modelingneo4j>=5.14.0- Neo4j Python driverclick>=8.1.0- CLI frameworknumpy>=1.24.0- Numerical operationspython-dotenv>=1.0.0- Environment variable loading
License
Part of the Echo project.