mirror of
https://github.com/eliasstepanik/core.git
synced 2026-01-11 17:08:27 +00:00
* feat: space v3 * feat: connected space creation * fix: * fix: session_id for memory ingestion * chore: simplify gitignore patterns for agent directories --------- Co-authored-by: Manoj <saimanoj58@gmail.com>
300 lines
8.8 KiB
Markdown
300 lines
8.8 KiB
Markdown
# BERT Topic Modeling CLI for Echo Episodes
|
|
|
|
This CLI tool performs topic modeling on Echo episodes using BERTopic. It connects to Neo4j, retrieves episodes with their pre-computed embeddings for a given user, and discovers thematic clusters using HDBSCAN clustering.
|
|
|
|
## Features
|
|
|
|
- Connects to Neo4j database to fetch episodes
|
|
- Uses pre-computed embeddings (no need to regenerate them)
|
|
- Performs semantic topic clustering with BERTopic
|
|
- Displays topics with:
|
|
- Top keywords per topic
|
|
- Episode count per topic
|
|
- Sample episodes for each topic
|
|
- Configurable minimum topic size
|
|
- Environment variable support for easy configuration
|
|
|
|
## Prerequisites
|
|
|
|
- Python 3.8+
|
|
- Access to Neo4j database with episodes stored
|
|
- Pre-computed embeddings stored in Neo4j (in `contentEmbedding` field)
|
|
|
|
## Installation
|
|
|
|
1. Navigate to the bert directory:
|
|
|
|
```bash
|
|
cd apps/webapp/app/bert
|
|
```
|
|
|
|
2. Install dependencies:
|
|
|
|
```bash
|
|
pip install -r requirements.txt
|
|
```
|
|
|
|
## Configuration
|
|
|
|
The CLI can read Neo4j connection details from:
|
|
|
|
1. **Environment variables** (recommended) - Create a `.env` file or export:
|
|
|
|
```bash
|
|
export NEO4J_URI=bolt://localhost:7687
|
|
export NEO4J_USERNAME=neo4j
|
|
export NEO4J_PASSWORD=your_password
|
|
```
|
|
|
|
2. **Command-line options** - Pass credentials directly as flags
|
|
|
|
3. **From project root** - The tool automatically loads `.env` from the project root
|
|
|
|
## Usage
|
|
|
|
### Basic Usage
|
|
|
|
Using environment variables (most common):
|
|
|
|
```bash
|
|
python main.py <user_id>
|
|
```
|
|
|
|
### Advanced Options
|
|
|
|
```bash
|
|
python main.py <user_id> [OPTIONS]
|
|
```
|
|
|
|
**Options:**
|
|
|
|
- `--min-topic-size INTEGER`: Minimum number of episodes per topic (default: 10)
|
|
- `--nr-topics INTEGER`: Target number of topics for reduction (optional)
|
|
- `--propose-spaces`: Generate space proposals using OpenAI (requires OPENAI_API_KEY)
|
|
- `--openai-api-key TEXT`: OpenAI API key for space proposals (or use OPENAI_API_KEY env var)
|
|
- `--json`: Output only final results in JSON format (suppresses all other output)
|
|
- `--neo4j-uri TEXT`: Neo4j connection URI (default: bolt://localhost:7687)
|
|
- `--neo4j-username TEXT`: Neo4j username (default: neo4j)
|
|
- `--neo4j-password TEXT`: Neo4j password (required)
|
|
|
|
### Examples
|
|
|
|
1. **Basic usage with environment variables:**
|
|
|
|
```bash
|
|
python main.py user-123
|
|
```
|
|
|
|
2. **Custom minimum topic size:**
|
|
|
|
```bash
|
|
python main.py user-123 --min-topic-size 10
|
|
```
|
|
|
|
3. **Explicit credentials:**
|
|
|
|
```bash
|
|
python main.py user-123 \
|
|
--neo4j-uri bolt://neo4j:7687 \
|
|
--neo4j-username neo4j \
|
|
--neo4j-password mypassword
|
|
```
|
|
|
|
4. **Using Docker compose Neo4j:**
|
|
```bash
|
|
python main.py user-123 \
|
|
--neo4j-uri bolt://localhost:7687 \
|
|
--neo4j-password 27192e6432564f4788d55c15131bd5ac
|
|
```
|
|
|
|
5. **With space proposals:**
|
|
```bash
|
|
python main.py user-123 --propose-spaces
|
|
```
|
|
|
|
6. **JSON output mode (for programmatic use):**
|
|
```bash
|
|
python main.py user-123 --json
|
|
```
|
|
|
|
7. **JSON output with space proposals:**
|
|
```bash
|
|
python main.py user-123 --propose-spaces --json
|
|
```
|
|
|
|
### Get Help
|
|
|
|
```bash
|
|
python main.py --help
|
|
```
|
|
|
|
## Output Formats
|
|
|
|
### Human-Readable Output (Default)
|
|
|
|
The CLI outputs:
|
|
|
|
```
|
|
================================================================================
|
|
BERT TOPIC MODELING FOR ECHO EPISODES
|
|
================================================================================
|
|
User ID: user-123
|
|
Min Topic Size: 20
|
|
================================================================================
|
|
|
|
✓ Connected to Neo4j at bolt://localhost:7687
|
|
✓ Fetched 150 episodes with embeddings
|
|
|
|
🔍 Running BERTopic analysis (min_topic_size=20)...
|
|
✓ Topic modeling complete
|
|
|
|
================================================================================
|
|
TOPIC MODELING RESULTS
|
|
================================================================================
|
|
Total Topics Found: 5
|
|
Total Episodes: 150
|
|
================================================================================
|
|
|
|
────────────────────────────────────────────────────────────────────────────────
|
|
Topic 0: 45 episodes
|
|
────────────────────────────────────────────────────────────────────────────────
|
|
Keywords: authentication, login, user, security, session, password, token, oauth, jwt, credentials
|
|
|
|
Sample Episodes (showing up to 3):
|
|
1. [uuid-123]
|
|
Discussing authentication flow for the new user login system...
|
|
|
|
2. [uuid-456]
|
|
Implementing OAuth2 with JWT tokens for secure sessions...
|
|
|
|
3. [uuid-789]
|
|
Password reset functionality with email verification...
|
|
|
|
────────────────────────────────────────────────────────────────────────────────
|
|
Topic 1: 32 episodes
|
|
────────────────────────────────────────────────────────────────────────────────
|
|
Keywords: database, neo4j, query, graph, cypher, nodes, relationships, index, performance, optimization
|
|
|
|
Sample Episodes (showing up to 3):
|
|
...
|
|
|
|
Topic -1 (Outliers): 8 episodes
|
|
|
|
================================================================================
|
|
✓ Analysis complete!
|
|
================================================================================
|
|
|
|
✓ Neo4j connection closed
|
|
```
|
|
|
|
### JSON Output Mode (--json flag)
|
|
|
|
When using the `--json` flag, the tool outputs only a clean JSON object with no debug logs:
|
|
|
|
```json
|
|
{
|
|
"topics": {
|
|
"0": {
|
|
"keywords": ["authentication", "login", "user", "security", "session"],
|
|
"episodeIds": ["uuid-123", "uuid-456", "uuid-789"]
|
|
},
|
|
"1": {
|
|
"keywords": ["database", "neo4j", "query", "graph", "cypher"],
|
|
"episodeIds": ["uuid-abc", "uuid-def"]
|
|
}
|
|
},
|
|
"spaces": [
|
|
{
|
|
"name": "User Authentication",
|
|
"intent": "Episodes about user authentication, login systems, and security belong in this space.",
|
|
"confidence": 85,
|
|
"topics": [0, 3],
|
|
"estimatedEpisodes": 120
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
**JSON Output Structure:**
|
|
- `topics`: Dictionary of topic IDs with keywords and episode UUIDs
|
|
- `spaces`: Array of space proposals (only if `--propose-spaces` is used)
|
|
- `name`: Space name (2-5 words)
|
|
- `intent`: Classification intent (1-2 sentences)
|
|
- `confidence`: Confidence score (0-100)
|
|
- `topics`: Source topic IDs that form this space
|
|
- `estimatedEpisodes`: Estimated number of episodes in this space
|
|
|
|
**Use Cases for JSON Mode:**
|
|
- Programmatic consumption by other tools
|
|
- Piping output to jq or other JSON processors
|
|
- Integration with CI/CD pipelines
|
|
- Automated space creation workflows
|
|
|
|
## How It Works
|
|
|
|
1. **Connection**: Establishes connection to Neo4j database
|
|
2. **Data Fetching**: Queries all episodes for the given userId that have:
|
|
- Non-null `contentEmbedding` field
|
|
- Non-empty content
|
|
3. **Topic Modeling**: Runs BERTopic with:
|
|
- Pre-computed embeddings (no re-embedding needed)
|
|
- HDBSCAN clustering (automatic cluster discovery)
|
|
- Keyword extraction via c-TF-IDF
|
|
4. **Results**: Displays topics with keywords and sample episodes
|
|
|
|
## Neo4j Query
|
|
|
|
The tool uses this Cypher query to fetch episodes:
|
|
|
|
```cypher
|
|
MATCH (e:Episode {userId: $userId})
|
|
WHERE e.contentEmbedding IS NOT NULL
|
|
AND size(e.contentEmbedding) > 0
|
|
AND e.content IS NOT NULL
|
|
AND e.content <> ''
|
|
RETURN e.uuid as uuid,
|
|
e.content as content,
|
|
e.contentEmbedding as embedding,
|
|
e.createdAt as createdAt
|
|
ORDER BY e.createdAt DESC
|
|
```
|
|
|
|
## Tuning Parameters
|
|
|
|
- **`--min-topic-size`**:
|
|
- Smaller values (5-10): More granular topics, may include noise
|
|
- Larger values (20-30): Broader topics, more coherent but fewer clusters
|
|
- Recommended: Start with 20 and adjust based on your data
|
|
|
|
## Troubleshooting
|
|
|
|
### No episodes found
|
|
|
|
- Verify the userId exists in Neo4j
|
|
- Check that episodes have `contentEmbedding` populated
|
|
- Ensure episodes have non-empty `content` field
|
|
|
|
### Connection errors
|
|
|
|
- Verify Neo4j is running: `docker ps | grep neo4j`
|
|
- Check URI format: should be `bolt://host:port`
|
|
- Verify credentials are correct
|
|
|
|
### Too few/many topics
|
|
|
|
- Adjust `--min-topic-size` parameter
|
|
- Need more topics: decrease the value (e.g., `--min-topic-size 10`)
|
|
- Need fewer topics: increase the value (e.g., `--min-topic-size 30`)
|
|
|
|
## Dependencies
|
|
|
|
- `bertopic>=0.16.0` - Topic modeling
|
|
- `neo4j>=5.14.0` - Neo4j Python driver
|
|
- `click>=8.1.0` - CLI framework
|
|
- `numpy>=1.24.0` - Numerical operations
|
|
- `python-dotenv>=1.0.0` - Environment variable loading
|
|
|
|
## License
|
|
|
|
Part of the Echo project.
|