Feat: Space v3

* feat: space v3

* feat: connected space creation

* fix:

* fix: session_id for memory ingestion

* chore: simplify gitignore patterns for agent directories

---------

Co-authored-by: Manoj <saimanoj58@gmail.com>
This commit is contained in:
Harshith Mullapudi 2025-10-30 12:30:56 +05:30 committed by GitHub
parent c5407be54d
commit c869096be8
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
60 changed files with 3809 additions and 5809 deletions

View File

@ -41,10 +41,7 @@ NEO4J_USERNAME=neo4j
NEO4J_PASSWORD=27192e6432564f4788d55c15131bd5ac
OPENAI_API_KEY=
MAGIC_LINK_SECRET=27192e6432564f4788d55c15131bd5ac
NEO4J_AUTH=neo4j/27192e6432564f4788d55c15131bd5ac
OLLAMA_URL=http://ollama:11434

View File

@ -7,32 +7,6 @@ on:
workflow_dispatch:
jobs:
build-init:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
with:
ref: main
- name: Set up QEMU
uses: docker/setup-qemu-action@v1
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v1
- name: Login to Docker Registry
run: echo "${{ secrets.DOCKER_PASSWORD }}" | docker login -u "${{ secrets.DOCKER_USERNAME }}" --password-stdin
- name: Build and Push Frontend Docker Image
uses: docker/build-push-action@v2
with:
context: .
file: ./apps/init/Dockerfile
platforms: linux/amd64,linux/arm64
push: true
tags: redplanethq/init:${{ github.ref_name }}
build-webapp:
runs-on: ubuntu-latest

17
.gitignore vendored
View File

@ -46,13 +46,14 @@ registry/
.cursor
CLAUDE.md
AGENTS.md
.claude
.clinerules/byterover-rules.md
.kilocode/rules/byterover-rules.md
.roo/rules/byterover-rules.md
.windsurf/rules/byterover-rules.md
.cursor/rules/byterover-rules.mdc
.kiro/steering/byterover-rules.md
.qoder/rules/byterover-rules.md
.augment/rules/byterover-rules.md
.clinerules
.kilocode
.roo
.windsurf
.cursor
.kiro
.qoder
.augment

View File

@ -55,7 +55,7 @@ CORE memory achieves **88.24%** average accuracy in Locomo dataset across all re
## Overview
**Problem**
**Problem**
Developers waste time re-explaining context to AI tools. Hit token limits in Claude? Start fresh and lose everything. Switch from ChatGPT/Claude to Cursor? Explain your context again. Your conversations, decisions, and insights vanish between sessions. With every new AI tool, the cost of context switching grows.
@ -64,6 +64,7 @@ Developers waste time re-explaining context to AI tools. Hit token limits in Cla
CORE is an open-source unified, persistent memory layer for all your AI tools. Your context follows you from Cursor to Claude to ChatGPT to Claude Code. One knowledge graph remembers who said what, when, and why. Connect once, remember everywhere. Stop managing context and start building.
## 🚀 CORE Self-Hosting
Want to run CORE on your own infrastructure? Self-hosting gives you complete control over your data and deployment.
**Quick Deploy Options:**
@ -80,15 +81,20 @@ Want to run CORE on your own infrastructure? Self-hosting gives you complete con
### Setup
1. Clone the repository:
```
git clone https://github.com/RedPlanetHQ/core.git
cd core
```
2. Configure environment variables in `core/.env`:
```
OPENAI_API_KEY=your_openai_api_key
```
3. Start the service
```
docker-compose up -d
```
@ -100,6 +106,7 @@ Once deployed, you can configure your AI providers (OpenAI, Anthropic) and start
Note: We tried open-source models like Ollama or GPT OSS but facts generation were not good, we are still figuring out how to improve on that and then will also support OSS models.
## 🚀 CORE Cloud
**Build your unified memory graph in 5 minutes:**
Don't want to manage infrastructure? CORE Cloud lets you build your personal memory system instantly - no setup, no servers, just memory that works.
@ -115,24 +122,24 @@ Don't want to manage infrastructure? CORE Cloud lets you build your personal mem
## 🧩 Key Features
### 🧠 **Unified, Portable Memory**:
### 🧠 **Unified, Portable Memory**:
Add and recall your memory across **Cursor, Windsurf, Claude Desktop, Claude Code, Gemini CLI, AWS's Kiro, VS Code, and Roo Code** via MCP
![core-claude](https://github.com/user-attachments/assets/56c98288-ee87-4cd0-8b02-860aca1c7f9a)
### 🕸️ **Temporal + Reified Knowledge Graph**:
### 🕸️ **Temporal + Reified Knowledge Graph**:
Remember the story behind every fact—track who said what, when, and why with rich relationships and full provenance, not just flat storage
![core-memory-graph](https://github.com/user-attachments/assets/5d1ee659-d519-4624-85d1-e0497cbdd60a)
### 🌐 **Browser Extension**:
### 🌐 **Browser Extension**:
Save conversations and content from ChatGPT, Grok, Gemini, Twitter, YouTube, blog posts, and any webpage directly into your CORE memory.
**How to Use Extension**
1. [Download the Extension](https://chromewebstore.google.com/detail/core-extension/cglndoindnhdbfcbijikibfjoholdjcc) from the Chrome Web Store.
2. Login to [CORE dashboard](https://core.heysol.ai)
- Navigate to Settings (bottom left)
@ -141,13 +148,12 @@ Save conversations and content from ChatGPT, Grok, Gemini, Twitter, YouTube, blo
https://github.com/user-attachments/assets/6e629834-1b9d-4fe6-ae58-a9068986036a
### 💬 **Chat with Memory**:
### 💬 **Chat with Memory**:
Ask questions like "What are my writing preferences?" with instant insights from your connected knowledge
![chat-with-memory](https://github.com/user-attachments/assets/d798802f-bd51-4daf-b2b5-46de7d206f66)
### ⚡ **Auto-Sync from Apps**:
Automatically capture relevant context from Linear, Slack, Notion, GitHub and other connected apps into your CORE memory
@ -156,16 +162,12 @@ Automatically capture relevant context from Linear, Slack, Notion, GitHub and ot
![core-slack](https://github.com/user-attachments/assets/d5fefe38-221e-4076-8a44-8ed673960f03)
### 🔗 **MCP Integration Hub**:
### 🔗 **MCP Integration Hub**:
Connect Linear, Slack, GitHub, Notion once to CORE—then use all their tools in Claude, Cursor, or any MCP client with a single URL
![core-linear-claude](https://github.com/user-attachments/assets/7d59d92b-8c56-4745-a7ab-9a3c0341aa32)
## How CORE create memory
<img width="12885" height="3048" alt="memory-ingest-diagram" src="https://github.com/user-attachments/assets/c51679de-8260-4bee-bebf-aff32c6b8e13" />
@ -179,7 +181,6 @@ COREs ingestion pipeline has four phases designed to capture evolving context
The Result: Instead of a flat database, CORE gives you a memory that grows and changes with you - preserving context, evolution, and ownership so agents can actually use it.
![memory-ingest-eg](https://github.com/user-attachments/assets/1d0a8007-153a-4842-9586-f6f4de43e647)
## How CORE recalls from memory
@ -204,7 +205,7 @@ Explore our documentation to get the most out of CORE
- [Connect Core MCP with Claude](https://docs.heysol.ai/providers/claude)
- [Connect Core MCP with Cursor](https://docs.heysol.ai/providers/cursor)
- [Connect Core MCP with Claude Code](https://docs.heysol.ai/providers/claude-code)
- [Connect Core MCP with Codex](https://docs.heysol.ai/providers/codex)
- [Connect Core MCP with Codex](https://docs.heysol.ai/providers/codex)
- [Basic Concepts](https://docs.heysol.ai/overview)
- [API Reference](https://docs.heysol.ai/api-reference/get-user-profile)
@ -249,21 +250,11 @@ Have questions or feedback? We're here to help:
<a href="https://github.com/RedPlanetHQ/core/graphs/contributors">
<img src="https://contrib.rocks/image?repo=RedPlanetHQ/core" />
</a>
<<<<<<< Updated upstream
<<<<<<< HEAD
# =======
> > > > > > > Stashed changes
> > > > > > > 62db6c1 (feat: automatic space identification)

51
apps/init/.gitignore vendored
View File

@ -1,51 +0,0 @@
# See https://help.github.com/articles/ignoring-files/ for more about ignoring files.
# Dependencies
node_modules
.pnp
.pnp.js
# Local env files
.env
.env.local
.env.development.local
.env.test.local
.env.production.local
# Testing
coverage
# Turbo
.turbo
# Vercel
.vercel
# Build Outputs
.next/
out/
build
dist
.tshy/
.tshy-build/
# Debug
npm-debug.log*
yarn-debug.log*
yarn-error.log*
# Misc
.DS_Store
*.pem
docker-compose.dev.yaml
clickhouse/
.vscode/
registry/
.cursor
CLAUDE.md
.claude

View File

@ -1,70 +0,0 @@
ARG NODE_IMAGE=node:20.11.1-bullseye-slim@sha256:5a5a92b3a8d392691c983719dbdc65d9f30085d6dcd65376e7a32e6fe9bf4cbe
FROM ${NODE_IMAGE} AS pruner
WORKDIR /core
COPY --chown=node:node . .
RUN npx -q turbo@2.5.3 prune --scope=@redplanethq/init --docker
RUN find . -name "node_modules" -type d -prune -exec rm -rf '{}' +
# Base strategy to have layer caching
FROM ${NODE_IMAGE} AS base
RUN apt-get update && apt-get install -y openssl dumb-init postgresql-client
WORKDIR /core
COPY --chown=node:node .gitignore .gitignore
COPY --from=pruner --chown=node:node /core/out/json/ .
COPY --from=pruner --chown=node:node /core/out/pnpm-lock.yaml ./pnpm-lock.yaml
COPY --from=pruner --chown=node:node /core/out/pnpm-workspace.yaml ./pnpm-workspace.yaml
## Dev deps
FROM base AS dev-deps
WORKDIR /core
# Corepack is used to install pnpm
RUN corepack enable
ENV NODE_ENV development
RUN pnpm install --ignore-scripts --no-frozen-lockfile
## Production deps
FROM base AS production-deps
WORKDIR /core
# Corepack is used to install pnpm
RUN corepack enable
ENV NODE_ENV production
RUN pnpm install --prod --no-frozen-lockfile
## Builder (builds the init CLI)
FROM base AS builder
WORKDIR /core
# Corepack is used to install pnpm
RUN corepack enable
COPY --from=pruner --chown=node:node /core/out/full/ .
COPY --from=dev-deps --chown=node:node /core/ .
COPY --chown=node:node turbo.json turbo.json
COPY --chown=node:node .configs/tsconfig.base.json .configs/tsconfig.base.json
RUN pnpm run build --filter=@redplanethq/init...
# Runner
FROM ${NODE_IMAGE} AS runner
RUN apt-get update && apt-get install -y openssl postgresql-client ca-certificates
WORKDIR /core
RUN corepack enable
ENV NODE_ENV production
COPY --from=base /usr/bin/dumb-init /usr/bin/dumb-init
COPY --from=pruner --chown=node:node /core/out/full/ .
COPY --from=production-deps --chown=node:node /core .
COPY --from=builder --chown=node:node /core/apps/init/dist ./apps/init/dist
# Copy the trigger dump file
COPY --chown=node:node apps/init/trigger.dump ./apps/init/trigger.dump
# Copy and set up entrypoint script
COPY --chown=node:node apps/init/entrypoint.sh ./apps/init/entrypoint.sh
RUN chmod +x ./apps/init/entrypoint.sh
USER node
WORKDIR /core/apps/init
ENTRYPOINT ["dumb-init", "--"]
CMD ["./entrypoint.sh"]

View File

@ -1,197 +0,0 @@
# Core CLI
🧠 **CORE - Contextual Observation & Recall Engine**
A Command-Line Interface for setting up and managing the Core development environment.
## Installation
```bash
npm install -g @redplanethq/core
```
## Commands
### `core init`
**One-time setup command** - Initializes the Core development environment with full configuration.
### `core start`
**Daily usage command** - Starts all Core services (Docker containers).
### `core stop`
**Daily usage command** - Stops all Core services (Docker containers).
## Getting Started
### Prerequisites
- **Node.js** (v18.20.0 or higher)
- **Docker** and **Docker Compose**
- **Git**
- **pnpm** package manager
### Initial Setup
1. **Clone the Core repository:**
```bash
git clone https://github.com/redplanethq/core.git
cd core
```
2. **Run the initialization command:**
```bash
core init
```
3. **The CLI will guide you through the complete setup process:**
#### Step 1: Prerequisites Check
- The CLI shows a checklist of required tools
- Confirms you're in the Core repository directory
- Exits with instructions if prerequisites aren't met
#### Step 2: Environment Configuration
- Copies `.env.example` to `.env` in the root directory
- Copies `trigger/.env.example` to `trigger/.env`
- Skips copying if `.env` files already exist
#### Step 3: Docker Services Startup
- Starts main Core services: `docker compose up -d`
- Starts Trigger.dev services: `docker compose up -d` (in trigger/ directory)
- Shows real-time output with progress indicators
#### Step 4: Database Health Check
- Verifies PostgreSQL is running on `localhost:5432`
- Retries for up to 60 seconds if needed
#### Step 5: Trigger.dev Setup (Interactive)
- **If Trigger.dev is not configured:**
1. Prompts you to open http://localhost:8030
2. Asks you to login to Trigger.dev
3. Guides you to create an organization and project
4. Collects your Project ID and Secret Key
5. Updates `.env` with your Trigger.dev configuration
6. Restarts Core services with new configuration
- **If Trigger.dev is already configured:**
- Skips setup and shows "Configuration already exists" message
#### Step 6: Docker Registry Login
- Displays docker login command with credentials from `.env`
- Waits for you to complete the login process
#### Step 7: Trigger.dev Task Deployment
- Automatically runs: `npx trigger.dev@v4-beta login -a http://localhost:8030`
- Deploys tasks with: `pnpm trigger:deploy`
- Shows manual deployment instructions if automatic deployment fails
#### Step 8: Setup Complete!
- Confirms all services are running
- Shows service URLs and connection information
## Daily Usage
After initial setup, use these commands for daily development:
### Start Services
```bash
core start
```
Starts all Docker containers for Core development.
### Stop Services
```bash
core stop
```
Stops all Docker containers.
## Service URLs
After setup, these services will be available:
- **Core Application**: http://localhost:3033
- **Trigger.dev**: http://localhost:8030
- **PostgreSQL**: localhost:5432
## Troubleshooting
### Repository Not Found
If you run commands outside the Core repository:
- The CLI will ask you to confirm you're in the Core repository
- If not, it provides instructions to clone the repository
- Navigate to the Core repository directory before running commands again
### Docker Issues
- Ensure Docker is running
- Check Docker Compose is installed
- Verify you have sufficient system resources
### Trigger.dev Setup Issues
- Check container logs: `docker logs trigger-webapp --tail 50`
- Ensure you can access http://localhost:8030
- Verify your network allows connections to localhost
### Environment Variables
The CLI automatically manages these environment variables:
- `TRIGGER_PROJECT_ID` - Your Trigger.dev project ID
- `TRIGGER_SECRET_KEY` - Your Trigger.dev secret key
- Docker registry credentials for deployment
### Manual Trigger.dev Deployment
If automatic deployment fails, run manually:
```bash
npx trigger.dev@v4-beta login -a http://localhost:8030
pnpm trigger:deploy
```
## Development Workflow
1. **First time setup:** `core init`
2. **Daily development:**
- `core start` - Start your development environment
- Do your development work
- `core stop` - Stop services when done
## Support
For issues and questions:
- Check the main Core repository: https://github.com/redplanethq/core
- Review Docker container logs for troubleshooting
- Ensure all prerequisites are properly installed
## Features
- 🚀 **One-command setup** - Complete environment initialization
- 🔄 **Smart configuration** - Skips already configured components
- 📱 **Real-time feedback** - Live progress indicators and output
- 🐳 **Docker integration** - Full container lifecycle management
- 🔧 **Interactive setup** - Guided configuration process
- 🎯 **Error handling** - Graceful failure with recovery instructions
---
**Happy coding with Core!** 🎉

View File

@ -1,22 +0,0 @@
#!/bin/sh
# Exit on any error
set -e
echo "Starting init CLI..."
# Wait for database to be ready
echo "Waiting for database connection..."
until pg_isready -h "${DB_HOST:-localhost}" -p "${DB_PORT:-5432}" -U "${POSTGRES_USER:-docker}"; do
echo "Database is unavailable - sleeping"
sleep 2
done
echo "Database is ready!"
# Run the init command
echo "Running init command..."
node ./dist/esm/index.js init
echo "Init completed successfully!"
exit 0

View File

@ -1,145 +0,0 @@
{
"name": "@redplanethq/init",
"version": "0.1.0",
"description": "A init service to create trigger instance",
"type": "module",
"license": "MIT",
"repository": {
"type": "git",
"url": "https://github.com/redplanethq/core",
"directory": "apps/init"
},
"publishConfig": {
"access": "public"
},
"keywords": [
"typescript"
],
"files": [
"dist",
"trigger.dump"
],
"bin": {
"core": "./dist/esm/index.js"
},
"tshy": {
"selfLink": false,
"main": false,
"module": false,
"dialects": [
"esm"
],
"project": "./tsconfig.json",
"exclude": [
"**/*.test.ts"
],
"exports": {
"./package.json": "./package.json",
".": "./src/index.ts"
}
},
"devDependencies": {
"@epic-web/test-server": "^0.1.0",
"@types/gradient-string": "^1.1.2",
"@types/ini": "^4.1.1",
"@types/object-hash": "3.0.6",
"@types/polka": "^0.5.7",
"@types/react": "^18.2.48",
"@types/resolve": "^1.20.6",
"@types/rimraf": "^4.0.5",
"@types/semver": "^7.5.0",
"@types/source-map-support": "0.5.10",
"@types/ws": "^8.5.3",
"cpy-cli": "^5.0.0",
"execa": "^8.0.1",
"find-up": "^7.0.0",
"rimraf": "^5.0.7",
"ts-essentials": "10.0.1",
"tshy": "^3.0.2",
"tsx": "4.17.0"
},
"scripts": {
"clean": "rimraf dist .tshy .tshy-build .turbo",
"typecheck": "tsc -p tsconfig.src.json --noEmit",
"build": "tshy",
"test": "vitest",
"test:e2e": "vitest --run -c ./e2e/vitest.config.ts"
},
"dependencies": {
"@clack/prompts": "^0.10.0",
"@depot/cli": "0.0.1-cli.2.80.0",
"@opentelemetry/api": "1.9.0",
"@opentelemetry/api-logs": "0.52.1",
"@opentelemetry/exporter-logs-otlp-http": "0.52.1",
"@opentelemetry/exporter-trace-otlp-http": "0.52.1",
"@opentelemetry/instrumentation": "0.52.1",
"@opentelemetry/instrumentation-fetch": "0.52.1",
"@opentelemetry/resources": "1.25.1",
"@opentelemetry/sdk-logs": "0.52.1",
"@opentelemetry/sdk-node": "0.52.1",
"@opentelemetry/sdk-trace-base": "1.25.1",
"@opentelemetry/sdk-trace-node": "1.25.1",
"@opentelemetry/semantic-conventions": "1.25.1",
"ansi-escapes": "^7.0.0",
"braces": "^3.0.3",
"c12": "^1.11.1",
"chalk": "^5.2.0",
"chokidar": "^3.6.0",
"cli-table3": "^0.6.3",
"commander": "^9.4.1",
"defu": "^6.1.4",
"dotenv": "^16.4.5",
"dotenv-expand": "^12.0.2",
"esbuild": "^0.23.0",
"eventsource": "^3.0.2",
"evt": "^2.4.13",
"fast-npm-meta": "^0.2.2",
"git-last-commit": "^1.0.1",
"gradient-string": "^2.0.2",
"has-flag": "^5.0.1",
"import-in-the-middle": "1.11.0",
"import-meta-resolve": "^4.1.0",
"ini": "^5.0.0",
"jsonc-parser": "3.2.1",
"magicast": "^0.3.4",
"minimatch": "^10.0.1",
"mlly": "^1.7.1",
"nypm": "^0.5.4",
"nanoid": "3.3.8",
"object-hash": "^3.0.0",
"open": "^10.0.3",
"knex": "3.1.0",
"p-limit": "^6.2.0",
"p-retry": "^6.1.0",
"partysocket": "^1.0.2",
"pkg-types": "^1.1.3",
"polka": "^0.5.2",
"pg": "8.16.3",
"resolve": "^1.22.8",
"semver": "^7.5.0",
"signal-exit": "^4.1.0",
"source-map-support": "0.5.21",
"std-env": "^3.7.0",
"supports-color": "^10.0.0",
"tiny-invariant": "^1.2.0",
"tinyexec": "^0.3.1",
"tinyglobby": "^0.2.10",
"uuid": "11.1.0",
"ws": "^8.18.0",
"xdg-app-paths": "^8.3.0",
"zod": "3.23.8",
"zod-validation-error": "^1.5.0"
},
"engines": {
"node": ">=18.20.0"
},
"exports": {
"./package.json": "./package.json",
".": {
"import": {
"types": "./dist/esm/index.d.ts",
"default": "./dist/esm/index.js"
}
}
}
}

View File

@ -1,14 +0,0 @@
import { Command } from "commander";
import { initCommand } from "../commands/init.js";
import { VERSION } from "./version.js";
const program = new Command();
program.name("core").description("Core CLI - A Command-Line Interface for Core").version(VERSION);
program
.command("init")
.description("Initialize Core development environment (run once)")
.action(initCommand);
program.parse(process.argv);

View File

@ -1,3 +0,0 @@
import { env } from "../utils/env.js";
export const VERSION = env.VERSION;

View File

@ -1,36 +0,0 @@
import { intro, outro, note } from "@clack/prompts";
import { printCoreBrainLogo } from "../utils/ascii.js";
import { initTriggerDatabase, updateWorkerImage } from "../utils/trigger.js";
export async function initCommand() {
// Display the CORE brain logo
printCoreBrainLogo();
intro("🚀 Core Development Environment Setup");
try {
await initTriggerDatabase();
await updateWorkerImage();
note(
[
"Your services will start running:",
"",
"• Core Application: http://localhost:3033",
"• Trigger.dev: http://localhost:8030",
"• PostgreSQL: localhost:5432",
"",
"You can now start developing with Core!",
"",
" When logging in to the Core Application, you can find the login URL in the Docker container logs:",
" docker logs core-app --tail 50",
].join("\n"),
"🚀 Services Running"
);
outro("🎉 Setup Complete!");
process.exit(0);
} catch (error: any) {
outro(`❌ Setup failed: ${error.message}`);
process.exit(1);
}
}

View File

@ -1,3 +0,0 @@
#!/usr/bin/env node
import "./cli/index.js";

View File

@ -1,29 +0,0 @@
import chalk from "chalk";
import { VERSION } from "../cli/version.js";
export function printCoreBrainLogo(): void {
const brain = `
o o o
o o---o---o o
o---o o o---o---o
o o---o---o---o o
o---o o o---o---o
o o---o---o o
o o o
`;
console.log(chalk.cyan(brain));
console.log(
chalk.bold.white(
` 🧠 CORE - Contextual Observation & Recall Engine ${VERSION ? chalk.gray(`(${VERSION})`) : ""}\n`
)
);
}

View File

@ -1,24 +0,0 @@
import { z } from "zod";
const EnvironmentSchema = z.object({
// Version
VERSION: z.string().default("0.1.24"),
// Database
DB_HOST: z.string().default("localhost"),
DB_PORT: z.string().default("5432"),
TRIGGER_DB: z.string().default("trigger"),
POSTGRES_USER: z.string().default("docker"),
POSTGRES_PASSWORD: z.string().default("docker"),
// Trigger database
TRIGGER_TASKS_IMAGE: z.string().default("redplanethq/proj_core:latest"),
// Node environment
NODE_ENV: z
.union([z.literal("development"), z.literal("production"), z.literal("test")])
.default("development"),
});
export type Environment = z.infer<typeof EnvironmentSchema>;
export const env = EnvironmentSchema.parse(process.env);

View File

@ -1,182 +0,0 @@
import Knex from "knex";
import path from "path";
import { fileURLToPath } from "url";
import { env } from "./env.js";
import { spinner, note, log } from "@clack/prompts";
const __filename = fileURLToPath(import.meta.url);
const __dirname = path.dirname(__filename);
/**
* Returns a PostgreSQL database URL for the given database name.
* Throws if required environment variables are missing.
*/
export function getDatabaseUrl(dbName: string): string {
const { POSTGRES_USER, POSTGRES_PASSWORD, DB_HOST, DB_PORT } = env;
if (!POSTGRES_USER || !POSTGRES_PASSWORD || !DB_HOST || !DB_PORT || !dbName) {
throw new Error(
"One or more required environment variables are missing: POSTGRES_USER, POSTGRES_PASSWORD, DB_HOST, DB_PORT, dbName"
);
}
return `postgresql://${POSTGRES_USER}:${POSTGRES_PASSWORD}@${DB_HOST}:${DB_PORT}/${dbName}`;
}
/**
* Checks if the database specified by TRIGGER_DB exists, and creates it if it does not.
* Returns { exists: boolean, created: boolean } - exists indicates success, created indicates if database was newly created.
*/
export async function ensureDatabaseExists(): Promise<{ exists: boolean; created: boolean }> {
const { TRIGGER_DB } = env;
if (!TRIGGER_DB) {
throw new Error("TRIGGER_DB environment variable is missing");
}
// Build a connection string to the default 'postgres' database
const adminDbUrl = getDatabaseUrl("postgres");
// Create a Knex instance for the admin connection
const adminKnex = Knex({
client: "pg",
connection: adminDbUrl,
});
const s = spinner();
s.start("Checking for Trigger.dev database...");
try {
// Check if the database exists
const result = await adminKnex.select(1).from("pg_database").where("datname", TRIGGER_DB);
if (result.length === 0) {
s.message("Database not found. Creating...");
// Database does not exist, create it
await adminKnex.raw(`CREATE DATABASE "${TRIGGER_DB}"`);
s.stop("Database created.");
return { exists: true, created: true };
} else {
s.stop("Database exists.");
return { exists: true, created: false };
}
} catch (err) {
s.stop("Failed to ensure database exists.");
log.warning("Failed to ensure database exists: " + (err as Error).message);
return { exists: false, created: false };
} finally {
await adminKnex.destroy();
}
}
// Main initialization function
export async function initTriggerDatabase() {
const { TRIGGER_DB } = env;
if (!TRIGGER_DB) {
throw new Error("TRIGGER_DB environment variable is missing");
}
// Ensure the database exists
const { exists, created } = await ensureDatabaseExists();
if (!exists) {
throw new Error("Failed to create or verify database exists");
}
// Only run pg_restore if the database was newly created
if (!created) {
note("Database already exists, skipping restore from trigger.dump");
return;
}
// Run pg_restore with the trigger.dump file
const dumpFilePath = path.join(__dirname, "../../../trigger.dump");
const connectionString = getDatabaseUrl(TRIGGER_DB);
const s = spinner();
s.start("Restoring database from trigger.dump...");
try {
// Use execSync and capture stdout/stderr, send to spinner.log
const { spawn } = await import("child_process");
await new Promise<void>((resolve, reject) => {
const child = spawn(
"pg_restore",
["--verbose", "--no-acl", "--no-owner", "-d", connectionString, dumpFilePath],
{ stdio: ["ignore", "pipe", "pipe"] }
);
child.stdout.on("data", (data) => {
s.message(data.toString());
});
child.stderr.on("data", (data) => {
s.message(data.toString());
});
child.on("close", (code) => {
if (code === 0) {
s.stop("Database restored successfully from trigger.dump");
resolve();
} else {
s.stop("Failed to restore database.");
log.warning(`Failed to restore database: pg_restore exited with code ${code}`);
reject(new Error(`Database restore failed: pg_restore exited with code ${code}`));
}
});
child.on("error", (err) => {
s.stop("Failed to restore database.");
log.warning("Failed to restore database: " + err.message);
reject(new Error(`Database restore failed: ${err.message}`));
});
});
} catch (error: any) {
s.stop("Failed to restore database.");
log.warning("Failed to restore database: " + error.message);
throw new Error(`Database restore failed: ${error.message}`);
}
}
export async function updateWorkerImage() {
const { TRIGGER_DB, TRIGGER_TASKS_IMAGE } = env;
if (!TRIGGER_DB) {
throw new Error("TRIGGER_DB environment variable is missing");
}
const connectionString = getDatabaseUrl(TRIGGER_DB);
const knex = Knex({
client: "pg",
connection: connectionString,
});
const s = spinner();
s.start("Updating worker image reference...");
try {
// Get the first record from WorkerDeployment table
const firstWorkerDeployment = await knex("WorkerDeployment").select("id").first();
if (!firstWorkerDeployment) {
s.stop("No WorkerDeployment records found, skipping image update");
note("No WorkerDeployment records found, skipping image update");
return;
}
// Update the imageReference column with the TRIGGER_TASKS_IMAGE value
await knex("WorkerDeployment").where("id", firstWorkerDeployment.id).update({
imageReference: TRIGGER_TASKS_IMAGE,
updatedAt: new Date(),
});
s.stop(`Successfully updated worker image reference to: ${TRIGGER_TASKS_IMAGE}`);
} catch (error: any) {
s.stop("Failed to update worker image.");
log.warning("Failed to update worker image: " + error.message);
throw new Error(`Worker image update failed: ${error.message}`);
} finally {
await knex.destroy();
}
}

Binary file not shown.

View File

@ -1,40 +0,0 @@
{
"include": ["./src/**/*.ts"],
"exclude": ["./src/**/*.test.ts"],
"compilerOptions": {
"target": "es2022",
"lib": ["ES2022", "DOM", "DOM.Iterable", "DOM.AsyncIterable"],
"module": "NodeNext",
"moduleResolution": "NodeNext",
"moduleDetection": "force",
"verbatimModuleSyntax": false,
"jsx": "react",
"strict": true,
"alwaysStrict": true,
"strictPropertyInitialization": true,
"skipLibCheck": true,
"forceConsistentCasingInFileNames": true,
"noUnusedLocals": false,
"noUnusedParameters": false,
"noImplicitAny": true,
"noImplicitReturns": true,
"noImplicitThis": true,
"noFallthroughCasesInSwitch": true,
"resolveJsonModule": true,
"removeComments": false,
"esModuleInterop": true,
"emitDecoratorMetadata": false,
"experimentalDecorators": false,
"downlevelIteration": true,
"isolatedModules": true,
"noUncheckedIndexedAccess": true,
"pretty": true,
"isolatedDeclarations": false,
"composite": true,
"sourceMap": true
}
}

View File

@ -1,8 +0,0 @@
import { configDefaults, defineConfig } from "vitest/config";
export default defineConfig({
test: {
globals: true,
exclude: [...configDefaults.exclude, "e2e/**/*"],
},
});

View File

@ -92,3 +92,69 @@ export const sessionCompactionQueue = new Queue("session-compaction-queue", {
},
},
});
/**
* BERT topic analysis queue
* Handles CPU-intensive topic modeling on user episodes
*/
export const bertTopicQueue = new Queue("bert-topic-queue", {
connection: getRedisConnection(),
defaultJobOptions: {
attempts: 2, // Only 2 attempts due to long runtime
backoff: {
type: "exponential",
delay: 5000,
},
removeOnComplete: {
age: 7200, // Keep completed jobs for 2 hours
count: 100,
},
removeOnFail: {
age: 172800, // Keep failed jobs for 48 hours (for debugging)
},
},
});
/**
* Space assignment queue
* Handles assigning episodes to spaces based on semantic matching
*/
export const spaceAssignmentQueue = new Queue("space-assignment-queue", {
connection: getRedisConnection(),
defaultJobOptions: {
attempts: 3,
backoff: {
type: "exponential",
delay: 2000,
},
removeOnComplete: {
age: 3600,
count: 1000,
},
removeOnFail: {
age: 86400,
},
},
});
/**
* Space summary queue
* Handles generating summaries for spaces
*/
export const spaceSummaryQueue = new Queue("space-summary-queue", {
connection: getRedisConnection(),
defaultJobOptions: {
attempts: 3,
backoff: {
type: "exponential",
delay: 2000,
},
removeOnComplete: {
age: 3600,
count: 1000,
},
removeOnFail: {
age: 86400,
},
},
});

View File

@ -66,7 +66,6 @@ export async function initWorkers(): Promise<void> {
queue: conversationTitleQueue,
name: "conversation-title",
},
{
worker: sessionCompactionWorker,
queue: sessionCompactionQueue,

View File

@ -18,24 +18,39 @@ import {
processConversationTitleCreation,
type CreateConversationTitlePayload,
} from "~/jobs/conversation/create-title.logic";
import {
processSessionCompaction,
type SessionCompactionPayload,
} from "~/jobs/session/session-compaction.logic";
import {
processTopicAnalysis,
type TopicAnalysisPayload,
} from "~/jobs/bert/topic-analysis.logic";
import {
enqueueIngestEpisode,
enqueueSpaceAssignment,
enqueueSessionCompaction,
enqueueBertTopicAnalysis,
enqueueSpaceSummary,
} from "~/lib/queue-adapter.server";
import { logger } from "~/services/logger.service";
import {
processSpaceAssignment,
type SpaceAssignmentPayload,
} from "~/jobs/spaces/space-assignment.logic";
import {
processSpaceSummary,
type SpaceSummaryPayload,
} from "~/jobs/spaces/space-summary.logic";
/**
* Episode ingestion worker
* Processes individual episode ingestion jobs with per-user concurrency
* Processes individual episode ingestion jobs with global concurrency
*
* Note: Per-user concurrency is achieved by using userId as part of the jobId
* when adding jobs to the queue, ensuring only one job per user runs at a time
* Note: BullMQ uses global concurrency limit (5 jobs max).
* Trigger.dev uses per-user concurrency via concurrencyKey.
* For most open-source deployments, global concurrency is sufficient.
*/
export const ingestWorker = new Worker(
"ingest-queue",
@ -47,11 +62,12 @@ export const ingestWorker = new Worker(
// Callbacks to enqueue follow-up jobs
enqueueSpaceAssignment,
enqueueSessionCompaction,
enqueueBertTopicAnalysis,
);
},
{
connection: getRedisConnection(),
concurrency: 5, // Process up to 5 jobs in parallel
concurrency: 1, // Global limit: process up to 1 jobs in parallel
},
);
@ -108,6 +124,65 @@ export const sessionCompactionWorker = new Worker(
},
);
/**
* BERT topic analysis worker
* Handles CPU-intensive topic modeling
*/
export const bertTopicWorker = new Worker(
"bert-topic-queue",
async (job) => {
const payload = job.data as TopicAnalysisPayload;
return await processTopicAnalysis(
payload,
// Callback to enqueue space summary
enqueueSpaceSummary,
);
},
{
connection: getRedisConnection(),
concurrency: 2, // Process up to 2 analyses in parallel (CPU-intensive)
},
);
/**
* Space assignment worker
* Handles assigning episodes to spaces based on semantic matching
*
* Note: Global concurrency of 1 ensures sequential processing.
* Trigger.dev uses per-user concurrency via concurrencyKey.
*/
export const spaceAssignmentWorker = new Worker(
"space-assignment-queue",
async (job) => {
const payload = job.data as SpaceAssignmentPayload;
return await processSpaceAssignment(
payload,
// Callback to enqueue space summary
enqueueSpaceSummary,
);
},
{
connection: getRedisConnection(),
concurrency: 1, // Global limit: process one job at a time
},
);
/**
* Space summary worker
* Handles generating summaries for spaces
*/
export const spaceSummaryWorker = new Worker(
"space-summary-queue",
async (job) => {
const payload = job.data as SpaceSummaryPayload;
return await processSpaceSummary(payload);
},
{
connection: getRedisConnection(),
concurrency: 1, // Process one space summary at a time
},
);
/**
* Graceful shutdown handler
*/
@ -116,8 +191,10 @@ export async function closeAllWorkers(): Promise<void> {
ingestWorker.close(),
documentIngestWorker.close(),
conversationTitleWorker.close(),
sessionCompactionWorker.close(),
bertTopicWorker.close(),
spaceSummaryWorker.close(),
spaceAssignmentWorker.close(),
]);
logger.log("All BullMQ workers closed");
}

View File

@ -0,0 +1,250 @@
import { exec } from "child_process";
import { promisify } from "util";
import { identifySpacesForTopics } from "~/jobs/spaces/space-identification.logic";
import { assignEpisodesToSpace } from "~/services/graphModels/space";
import { logger } from "~/services/logger.service";
import { SpaceService } from "~/services/space.server";
import { prisma } from "~/trigger/utils/prisma";
const execAsync = promisify(exec);
export interface TopicAnalysisPayload {
userId: string;
workspaceId: string;
minTopicSize?: number;
nrTopics?: number;
}
export interface TopicAnalysisResult {
topics: {
[topicId: string]: {
keywords: string[];
episodeIds: string[];
};
};
}
/**
* Run BERT analysis using exec (for BullMQ/Docker)
*/
async function runBertWithExec(
userId: string,
minTopicSize: number,
nrTopics?: number,
): Promise<string> {
let command = `python3 /core/apps/webapp/python/main.py ${userId} --json`;
if (minTopicSize) {
command += ` --min-topic-size ${minTopicSize}`;
}
if (nrTopics) {
command += ` --nr-topics ${nrTopics}`;
}
console.log(`[BERT Topic Analysis] Executing: ${command}`);
const { stdout, stderr } = await execAsync(command, {
timeout: 300000, // 5 minutes
maxBuffer: 10 * 1024 * 1024, // 10MB buffer for large outputs
});
if (stderr) {
console.warn(`[BERT Topic Analysis] Warnings:`, stderr);
}
return stdout;
}
/**
* Process BERT topic analysis on user's episodes
* This is the common logic shared between Trigger.dev and BullMQ
*
* NOTE: This function does NOT update workspace.metadata.lastTopicAnalysisAt
* That should be done by the caller BEFORE enqueueing this job to prevent
* duplicate analyses from racing conditions.
*/
export async function processTopicAnalysis(
payload: TopicAnalysisPayload,
enqueueSpaceSummary?: (params: {
spaceId: string;
userId: string;
}) => Promise<any>,
pythonRunner?: (
userId: string,
minTopicSize: number,
nrTopics?: number,
) => Promise<string>,
): Promise<TopicAnalysisResult> {
const { userId, workspaceId, minTopicSize = 10, nrTopics } = payload;
console.log(`[BERT Topic Analysis] Starting analysis for user: ${userId}`);
console.log(
`[BERT Topic Analysis] Parameters: minTopicSize=${minTopicSize}, nrTopics=${nrTopics || "auto"}`,
);
try {
const startTime = Date.now();
// Run BERT analysis using provided runner or default exec
const runner = pythonRunner || runBertWithExec;
const stdout = await runner(userId, minTopicSize, nrTopics);
const duration = Date.now() - startTime;
console.log(`[BERT Topic Analysis] Completed in ${duration}ms`);
// Parse the JSON output
const result: TopicAnalysisResult = JSON.parse(stdout);
// Log summary
const topicCount = Object.keys(result.topics).length;
const totalEpisodes = Object.values(result.topics).reduce(
(sum, topic) => sum + topic.episodeIds.length,
0,
);
console.log(
`[BERT Topic Analysis] Found ${topicCount} topics covering ${totalEpisodes} episodes`,
);
// Step 2: Identify spaces for topics using LLM
try {
logger.info("[BERT Topic Analysis] Starting space identification", {
userId,
topicCount,
});
const spaceProposals = await identifySpacesForTopics({
userId,
topics: result.topics,
});
logger.info("[BERT Topic Analysis] Space identification completed", {
userId,
proposalCount: spaceProposals.length,
});
// Step 3: Create or find spaces and assign episodes
// Get existing spaces from PostgreSQL
const existingSpacesFromDb = await prisma.space.findMany({
where: { workspaceId },
});
const existingSpacesByName = new Map(
existingSpacesFromDb.map((s) => [s.name.toLowerCase(), s]),
);
for (const proposal of spaceProposals) {
try {
// Check if space already exists (case-insensitive match)
let spaceId: string;
const existingSpace = existingSpacesByName.get(
proposal.name.toLowerCase(),
);
if (existingSpace) {
// Use existing space
spaceId = existingSpace.id;
logger.info("[BERT Topic Analysis] Using existing space", {
spaceName: proposal.name,
spaceId,
});
} else {
// Create new space (creates in both PostgreSQL and Neo4j)
// Skip automatic space assignment since we're manually assigning from BERT topics
const spaceService = new SpaceService();
const newSpace = await spaceService.createSpace({
name: proposal.name,
description: proposal.intent,
userId,
workspaceId,
});
spaceId = newSpace.id;
logger.info("[BERT Topic Analysis] Created new space", {
spaceName: proposal.name,
spaceId,
intent: proposal.intent,
});
}
// Collect all episode IDs from the topics in this proposal
const episodeIds: string[] = [];
for (const topicId of proposal.topics) {
const topic = result.topics[topicId];
if (topic) {
episodeIds.push(...topic.episodeIds);
}
}
// Assign all episodes from these topics to the space
if (episodeIds.length > 0) {
await assignEpisodesToSpace(episodeIds, spaceId, userId);
logger.info("[BERT Topic Analysis] Assigned episodes to space", {
spaceName: proposal.name,
spaceId,
episodeCount: episodeIds.length,
topics: proposal.topics,
});
// Step 4: Trigger space summary if callback provided
if (enqueueSpaceSummary) {
await enqueueSpaceSummary({ spaceId, userId });
logger.info("[BERT Topic Analysis] Triggered space summary", {
spaceName: proposal.name,
spaceId,
});
}
}
} catch (spaceError) {
logger.error(
"[BERT Topic Analysis] Failed to process space proposal",
{
proposal,
error: spaceError,
},
);
// Continue with other proposals
}
}
} catch (spaceIdentificationError) {
logger.error(
"[BERT Topic Analysis] Space identification failed, returning topics only",
{
error: spaceIdentificationError,
},
);
// Return topics even if space identification fails
}
return result;
} catch (error) {
console.error(`[BERT Topic Analysis] Error:`, error);
if (error instanceof Error) {
// Check for timeout
if (error.message.includes("ETIMEDOUT")) {
throw new Error(
`Topic analysis timed out after 5 minutes. User may have too many episodes.`,
);
}
// Check for Python errors
if (error.message.includes("python3: not found")) {
throw new Error(`Python 3 is not installed or not available in PATH.`);
}
// Check for Neo4j connection errors
if (error.message.includes("Failed to connect to Neo4j")) {
throw new Error(
`Could not connect to Neo4j. Check NEO4J_URI and credentials.`,
);
}
// Check for no episodes
if (error.message.includes("No episodes found")) {
throw new Error(`No episodes found for userId: ${userId}`);
}
}
throw error;
}
}

View File

@ -7,6 +7,10 @@ import { prisma } from "~/trigger/utils/prisma";
import { EpisodeType } from "@core/types";
import { deductCredits, hasCredits } from "~/trigger/utils/utils";
import { assignEpisodesToSpace } from "~/services/graphModels/space";
import {
shouldTriggerTopicAnalysis,
updateLastTopicAnalysisTime,
} from "~/services/bertTopicAnalysis.server";
export const IngestBodyRequest = z.object({
episodeBody: z.string(),
@ -55,6 +59,12 @@ export async function processEpisodeIngestion(
sessionId: string;
source: string;
}) => Promise<any>,
enqueueBertTopicAnalysis?: (params: {
userId: string;
workspaceId: string;
minTopicSize?: number;
nrTopics?: number;
}) => Promise<any>,
): Promise<IngestEpisodeResult> {
try {
logger.log(`Processing job for user ${payload.userId}`);
@ -250,6 +260,44 @@ export async function processEpisodeIngestion(
});
}
// Auto-trigger BERT topic analysis if threshold met (20+ new episodes)
try {
if (
currentStatus === IngestionStatus.COMPLETED &&
enqueueBertTopicAnalysis
) {
const shouldTrigger = await shouldTriggerTopicAnalysis(
payload.userId,
payload.workspaceId,
);
if (shouldTrigger) {
logger.info(
`Triggering BERT topic analysis after reaching 20+ new episodes`,
{
userId: payload.userId,
workspaceId: payload.workspaceId,
},
);
await enqueueBertTopicAnalysis({
userId: payload.userId,
workspaceId: payload.workspaceId,
minTopicSize: 10,
});
// Update the last analysis timestamp
await updateLastTopicAnalysisTime(payload.workspaceId);
}
}
} catch (topicAnalysisError) {
// Don't fail the ingestion if topic analysis fails
logger.warn(`Failed to trigger topic analysis after ingestion:`, {
error: topicAnalysisError,
userId: payload.userId,
});
}
return { success: true, episodeDetails };
} catch (err: any) {
await prisma.ingestionQueue.update({

View File

@ -36,7 +36,7 @@ export interface SessionCompactionResult {
}
// Zod schema for LLM response validation
const CompactionResultSchema = z.object({
export const CompactionResultSchema = z.object({
summary: z.string().describe("Consolidated narrative of the entire session"),
confidence: z
.number()
@ -45,7 +45,7 @@ const CompactionResultSchema = z.object({
.describe("Confidence score of the compaction quality"),
});
const CONFIG = {
export const CONFIG = {
minEpisodesForCompaction: 5, // Minimum episodes to trigger compaction
compactionThreshold: 1, // Trigger after N new episodes
maxEpisodesPerBatch: 50, // Process in batches if needed

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,229 @@
/**
* Space Identification Logic
*
* Uses LLM to identify appropriate spaces for topics discovered by BERT analysis
*/
import { makeModelCall } from "~/lib/model.server";
import { getAllSpacesForUser } from "~/services/graphModels/space";
import { getEpisode } from "~/services/graphModels/episode";
import { logger } from "~/services/logger.service";
import type { SpaceNode } from "@core/types";
export interface TopicData {
keywords: string[];
episodeIds: string[];
}
export interface SpaceProposal {
name: string;
intent: string;
confidence: number;
reason: string;
topics: string[]; // Array of topic IDs
}
interface IdentifySpacesParams {
userId: string;
topics: Record<string, TopicData>;
}
/**
* Identify spaces for topics using LLM analysis
* Takes top 10 keywords and top 5 episodes per topic
*/
export async function identifySpacesForTopics(
params: IdentifySpacesParams,
): Promise<SpaceProposal[]> {
const { userId, topics } = params;
// Get existing spaces for the user
const existingSpaces = await getAllSpacesForUser(userId);
// Prepare topic data with top 10 keywords and top 5 episodes
const topicsForAnalysis = await Promise.all(
Object.entries(topics).map(async ([topicId, topicData]) => {
// Take top 10 keywords
const topKeywords = topicData.keywords.slice(0, 10);
// Take top 5 episodes and fetch their content
const topEpisodeIds = topicData.episodeIds.slice(0, 5);
const episodes = await Promise.all(
topEpisodeIds.map((id) => getEpisode(id)),
);
return {
topicId,
keywords: topKeywords,
episodes: episodes
.filter((e) => e !== null)
.map((e) => ({
content: e!.content.substring(0, 500), // Limit to 500 chars per episode
})),
episodeCount: topicData.episodeIds.length,
};
}),
);
// Build the prompt
const prompt = buildSpaceIdentificationPrompt(
existingSpaces,
topicsForAnalysis,
);
logger.info("Identifying spaces for topics", {
userId,
topicCount: Object.keys(topics).length,
existingSpaceCount: existingSpaces.length,
});
// Call LLM with structured output
let responseText = "";
await makeModelCall(
false, // not streaming
[{ role: "user", content: prompt }],
(text) => {
responseText = text;
},
{
temperature: 0.7,
},
"high", // Use high complexity for space identification
);
// Parse the response
const proposals = parseSpaceProposals(responseText);
logger.info("Space identification completed", {
userId,
proposalCount: proposals.length,
});
return proposals;
}
/**
* Build the prompt for space identification
*/
function buildSpaceIdentificationPrompt(
existingSpaces: SpaceNode[],
topics: Array<{
topicId: string;
keywords: string[];
episodes: Array<{ content: string }>;
episodeCount: number;
}>,
): string {
const existingSpacesSection =
existingSpaces.length > 0
? `## Existing Spaces
The user currently has these spaces:
${existingSpaces.map((s) => `- **${s.name}**: ${s.description || "No description"} (${s.contextCount || 0} episodes)`).join("\n")}
When identifying new spaces, consider if topics fit into existing spaces or if new spaces are needed.`
: `## Existing Spaces
The user currently has no spaces defined. This is a fresh start for space organization.`;
const topicsSection = `## Topics Discovered
BERT topic modeling has identified ${topics.length} distinct topics from the user's episodes. Each topic represents a cluster of semantically related content.
${topics
.map(
(t, idx) => `### Topic ${idx + 1} (ID: ${t.topicId})
**Episode Count**: ${t.episodeCount}
**Top Keywords**: ${t.keywords.join(", ")}
**Sample Episodes** (showing ${t.episodes.length} of ${t.episodeCount}):
${t.episodes.map((e, i) => `${i + 1}. ${e.content}`).join("\n")}
`,
)
.join("\n")}`;
return `You are a knowledge organization expert. Your task is to analyze discovered topics and identify appropriate "spaces" (thematic containers) for organizing episodic memories.
${existingSpacesSection}
${topicsSection}
## Task
Analyze the topics above and identify spaces that would help organize this content meaningfully. For each space:
1. **Consider existing spaces first**: If topics clearly belong to existing spaces, assign them there
2. **Create new spaces when needed**: If topics represent distinct themes not covered by existing spaces
3. **Group related topics**: Multiple topics can be assigned to the same space if they share a theme
4. **Aim for 20-50 episodes per space**: This is the sweet spot for space cohesion
5. **Focus on user intent**: What would help the user find and understand this content later?
## Output Format
Return your analysis as a JSON array of space proposals. Each proposal should have:
\`\`\`json
[
{
"name": "Space name (use existing space name if assigning to existing space)",
"intent": "Clear description of what this space represents",
"confidence": 0.85,
"reason": "Brief explanation of why these topics belong together",
"topics": ["topic-id-1", "topic-id-2"]
}
]
\`\`\`
**Important Guidelines**:
- **confidence**: 0.0-1.0 scale indicating how confident you are this is a good grouping
- **topics**: Array of topic IDs (use the exact IDs from above like "0", "1", "-1", etc.)
- **name**: For existing spaces, use the EXACT name. For new spaces, create a clear, concise name
- Only propose spaces with confidence >= 0.6
- Each topic should only appear in ONE space proposal
- Topic "-1" is the outlier topic (noise) - only include if it genuinely fits a theme
Return ONLY the JSON array, no additional text.`;
}
/**
* Parse space proposals from LLM response
*/
function parseSpaceProposals(responseText: string): SpaceProposal[] {
try {
// Extract JSON from markdown code blocks if present
const jsonMatch = responseText.match(/```(?:json)?\s*(\[[\s\S]*?\])\s*```/);
const jsonText = jsonMatch ? jsonMatch[1] : responseText;
const proposals = JSON.parse(jsonText.trim());
if (!Array.isArray(proposals)) {
throw new Error("Response is not an array");
}
// Validate and filter proposals
return proposals
.filter((p) => {
return (
p.name &&
p.intent &&
typeof p.confidence === "number" &&
p.confidence >= 0.6 &&
Array.isArray(p.topics) &&
p.topics.length > 0
);
})
.map((p) => ({
name: p.name.trim(),
intent: p.intent.trim(),
confidence: p.confidence,
reason: (p.reason || "").trim(),
topics: p.topics.map((t: any) => String(t)),
}));
} catch (error) {
logger.error("Failed to parse space proposals", {
error,
responseText: responseText.substring(0, 500),
});
return [];
}
}

View File

@ -0,0 +1,721 @@
import { logger } from "~/services/logger.service";
import { SpaceService } from "~/services/space.server";
import { makeModelCall } from "~/lib/model.server";
import { runQuery } from "~/lib/neo4j.server";
import { updateSpaceStatus, SPACE_STATUS } from "~/trigger/utils/space-status";
import type { CoreMessage } from "ai";
import { z } from "zod";
import { getSpace, updateSpace } from "~/trigger/utils/space-utils";
import { getSpaceEpisodeCount } from "~/services/graphModels/space";
export interface SpaceSummaryPayload {
userId: string;
spaceId: string; // Single space only
triggerSource?: "assignment" | "manual" | "scheduled";
}
interface SpaceEpisodeData {
uuid: string;
content: string;
originalContent: string;
source: string;
createdAt: Date;
validAt: Date;
metadata: any;
sessionId: string | null;
}
interface SpaceSummaryData {
spaceId: string;
spaceName: string;
spaceDescription?: string;
contextCount: number;
summary: string;
keyEntities: string[];
themes: string[];
confidence: number;
lastUpdated: Date;
isIncremental: boolean;
}
// Zod schema for LLM response validation
const SummaryResultSchema = z.object({
summary: z.string(),
keyEntities: z.array(z.string()),
themes: z.array(z.string()),
confidence: z.number().min(0).max(1),
});
const CONFIG = {
maxEpisodesForSummary: 20, // Limit episodes for performance
minEpisodesForSummary: 1, // Minimum episodes to generate summary
summaryEpisodeThreshold: 5, // Minimum new episodes required to trigger summary (configurable)
};
export interface SpaceSummaryResult {
success: boolean;
spaceId: string;
triggerSource: string;
summary?: {
statementCount: number;
confidence: number;
themesCount: number;
} | null;
reason?: string;
}
/**
* Core business logic for space summary generation
* This is shared between Trigger.dev and BullMQ implementations
*/
export async function processSpaceSummary(
payload: SpaceSummaryPayload,
): Promise<SpaceSummaryResult> {
const { userId, spaceId, triggerSource = "manual" } = payload;
logger.info(`Starting space summary generation`, {
userId,
spaceId,
triggerSource,
});
try {
// Update status to processing
await updateSpaceStatus(spaceId, SPACE_STATUS.PROCESSING, {
userId,
operation: "space-summary",
metadata: { triggerSource, phase: "start_summary" },
});
// Generate summary for the single space
const summaryResult = await generateSpaceSummary(
spaceId,
userId,
triggerSource,
);
if (summaryResult) {
// Store the summary
await storeSummary(summaryResult);
// Update status to ready after successful completion
await updateSpaceStatus(spaceId, SPACE_STATUS.READY, {
userId,
operation: "space-summary",
metadata: {
triggerSource,
phase: "completed_summary",
contextCount: summaryResult.contextCount,
confidence: summaryResult.confidence,
},
});
logger.info(`Generated summary for space ${spaceId}`, {
statementCount: summaryResult.contextCount,
confidence: summaryResult.confidence,
themes: summaryResult.themes.length,
triggerSource,
});
return {
success: true,
spaceId,
triggerSource,
summary: {
statementCount: summaryResult.contextCount,
confidence: summaryResult.confidence,
themesCount: summaryResult.themes.length,
},
};
} else {
// No summary generated - this could be due to insufficient episodes or no new episodes
// This is not an error state, so update status to ready
await updateSpaceStatus(spaceId, SPACE_STATUS.READY, {
userId,
operation: "space-summary",
metadata: {
triggerSource,
phase: "no_summary_needed",
reason: "Insufficient episodes or no new episodes to summarize",
},
});
logger.info(
`No summary generated for space ${spaceId} - insufficient or no new episodes`,
);
return {
success: true,
spaceId,
triggerSource,
summary: null,
reason: "No episodes to summarize",
};
}
} catch (error) {
// Update status to error on exception
try {
await updateSpaceStatus(spaceId, SPACE_STATUS.ERROR, {
userId,
operation: "space-summary",
metadata: {
triggerSource,
phase: "exception",
error: error instanceof Error ? error.message : "Unknown error",
},
});
} catch (statusError) {
logger.warn(`Failed to update status to error for space ${spaceId}`, {
statusError,
});
}
logger.error(
`Error in space summary generation for space ${spaceId}:`,
error as Record<string, unknown>,
);
throw error;
}
}
async function generateSpaceSummary(
spaceId: string,
userId: string,
triggerSource?: "assignment" | "manual" | "scheduled",
): Promise<SpaceSummaryData | null> {
try {
// 1. Get space details
const spaceService = new SpaceService();
const space = await spaceService.getSpace(spaceId, userId);
if (!space) {
logger.warn(`Space ${spaceId} not found for user ${userId}`);
return null;
}
// 2. Check episode count threshold (skip for manual triggers)
if (triggerSource !== "manual") {
const currentEpisodeCount = await getSpaceEpisodeCount(spaceId, userId);
const lastSummaryEpisodeCount = space.contextCount || 0;
const episodeDifference = currentEpisodeCount - lastSummaryEpisodeCount;
if (
episodeDifference < CONFIG.summaryEpisodeThreshold ||
lastSummaryEpisodeCount !== 0
) {
logger.info(
`Skipping summary generation for space ${spaceId}: only ${episodeDifference} new episodes (threshold: ${CONFIG.summaryEpisodeThreshold})`,
{
currentEpisodeCount,
lastSummaryEpisodeCount,
episodeDifference,
threshold: CONFIG.summaryEpisodeThreshold,
},
);
return null;
}
logger.info(
`Proceeding with summary generation for space ${spaceId}: ${episodeDifference} new episodes (threshold: ${CONFIG.summaryEpisodeThreshold})`,
{
currentEpisodeCount,
lastSummaryEpisodeCount,
episodeDifference,
},
);
}
// 2. Check for existing summary
const existingSummary = await getExistingSummary(spaceId);
const isIncremental = existingSummary !== null;
// 3. Get episodes (all or new ones based on existing summary)
const episodes = await getSpaceEpisodes(
spaceId,
userId,
isIncremental ? existingSummary?.lastUpdated : undefined,
);
// Handle case where no new episodes exist for incremental update
if (isIncremental && episodes.length === 0) {
logger.info(
`No new episodes found for space ${spaceId}, skipping summary update`,
);
return null;
}
// Check minimum episode requirement for new summaries only
if (!isIncremental && episodes.length < CONFIG.minEpisodesForSummary) {
logger.info(
`Space ${spaceId} has insufficient episodes (${episodes.length}) for new summary`,
);
return null;
}
// 4. Process episodes using unified approach
let summaryResult;
if (episodes.length > CONFIG.maxEpisodesForSummary) {
logger.info(
`Large space detected (${episodes.length} episodes). Processing in batches.`,
);
// Process in batches, each building on previous result
const batches: SpaceEpisodeData[][] = [];
for (let i = 0; i < episodes.length; i += CONFIG.maxEpisodesForSummary) {
batches.push(episodes.slice(i, i + CONFIG.maxEpisodesForSummary));
}
let currentSummary = existingSummary?.summary || null;
let currentThemes = existingSummary?.themes || [];
let cumulativeConfidence = 0;
for (const [batchIndex, batch] of batches.entries()) {
logger.info(
`Processing batch ${batchIndex + 1}/${batches.length} with ${batch.length} episodes`,
);
const batchResult = await generateUnifiedSummary(
space.name,
space.description as string,
batch,
currentSummary,
currentThemes,
);
if (batchResult) {
currentSummary = batchResult.summary;
currentThemes = batchResult.themes;
cumulativeConfidence += batchResult.confidence;
} else {
logger.warn(`Failed to process batch ${batchIndex + 1}`);
}
// Small delay between batches
if (batchIndex < batches.length - 1) {
await new Promise((resolve) => setTimeout(resolve, 500));
}
}
summaryResult = currentSummary
? {
summary: currentSummary,
themes: currentThemes,
confidence: Math.min(cumulativeConfidence / batches.length, 1.0),
}
: null;
} else {
logger.info(
`Processing ${episodes.length} episodes with unified approach`,
);
// Use unified approach for smaller spaces
summaryResult = await generateUnifiedSummary(
space.name,
space.description as string,
episodes,
existingSummary?.summary || null,
existingSummary?.themes || [],
);
}
if (!summaryResult) {
logger.warn(`Failed to generate LLM summary for space ${spaceId}`);
return null;
}
// Get the actual current counts from Neo4j
const currentEpisodeCount = await getSpaceEpisodeCount(spaceId, userId);
return {
spaceId: space.uuid,
spaceName: space.name,
spaceDescription: space.description as string,
contextCount: currentEpisodeCount,
summary: summaryResult.summary,
keyEntities: summaryResult.keyEntities || [],
themes: summaryResult.themes,
confidence: summaryResult.confidence,
lastUpdated: new Date(),
isIncremental,
};
} catch (error) {
logger.error(
`Error generating summary for space ${spaceId}:`,
error as Record<string, unknown>,
);
return null;
}
}
async function generateUnifiedSummary(
spaceName: string,
spaceDescription: string | undefined,
episodes: SpaceEpisodeData[],
previousSummary: string | null = null,
previousThemes: string[] = [],
): Promise<{
summary: string;
themes: string[];
confidence: number;
keyEntities?: string[];
} | null> {
try {
const prompt = createUnifiedSummaryPrompt(
spaceName,
spaceDescription,
episodes,
previousSummary,
previousThemes,
);
// Space summary generation requires HIGH complexity (creative synthesis, narrative generation)
let responseText = "";
await makeModelCall(
false,
prompt,
(text: string) => {
responseText = text;
},
undefined,
"high",
);
return parseSummaryResponse(responseText);
} catch (error) {
logger.error(
"Error generating unified summary:",
error as Record<string, unknown>,
);
return null;
}
}
function createUnifiedSummaryPrompt(
spaceName: string,
spaceDescription: string | undefined,
episodes: SpaceEpisodeData[],
previousSummary: string | null,
previousThemes: string[],
): CoreMessage[] {
// If there are no episodes and no previous summary, we cannot generate a meaningful summary
if (episodes.length === 0 && previousSummary === null) {
throw new Error(
"Cannot generate summary without episodes or existing summary",
);
}
const episodesText = episodes
.map(
(episode) =>
`- ${episode.content} (Source: ${episode.source}, Session: ${episode.sessionId || "N/A"})`,
)
.join("\n");
// Extract key entities and themes from episode content
const contentWords = episodes
.map((ep) => ep.content.toLowerCase())
.join(" ")
.split(/\s+/)
.filter((word) => word.length > 3);
const wordFrequency = new Map<string, number>();
contentWords.forEach((word) => {
wordFrequency.set(word, (wordFrequency.get(word) || 0) + 1);
});
const topEntities = Array.from(wordFrequency.entries())
.sort(([, a], [, b]) => b - a)
.slice(0, 10)
.map(([word]) => word);
const isUpdate = previousSummary !== null;
return [
{
role: "system",
content: `You are an expert at analyzing and summarizing episodes within semantic spaces based on the space's intent and purpose. Your task is to ${isUpdate ? "update an existing summary by integrating new episodes" : "create a comprehensive summary of episodes"}.
CRITICAL RULES:
1. Base your summary ONLY on insights derived from the actual content/episodes provided
2. Use the space's INTENT/PURPOSE (from description) to guide what to summarize and how to organize it
3. Write in a factual, neutral tone - avoid promotional language ("pivotal", "invaluable", "cutting-edge")
4. Be specific and concrete - reference actual content, patterns, and insights found in the episodes
5. If episodes are insufficient for meaningful insights, state that more data is needed
INTENT-DRIVEN SUMMARIZATION:
Your summary should SERVE the space's intended purpose. Examples:
- "Learning React" Summarize React concepts, patterns, techniques learned
- "Project X Updates" Summarize progress, decisions, blockers, next steps
- "Health Tracking" Summarize metrics, trends, observations, insights
- "Guidelines for React" Extract actionable patterns, best practices, rules
- "Evolution of design thinking" Track how thinking changed over time, decision points
The intent defines WHY this space exists - organize content to serve that purpose.
INSTRUCTIONS:
${
isUpdate
? `1. Review the existing summary and themes carefully
2. Analyze the new episodes for patterns and insights that align with the space's intent
3. Identify connecting points between existing knowledge and new episodes
4. Update the summary to seamlessly integrate new information while preserving valuable existing insights
5. Evolve themes by adding new ones or refining existing ones based on the space's purpose
6. Organize the summary to serve the space's intended use case`
: `1. Analyze the semantic content and relationships within the episodes
2. Identify topics/sections that align with the space's INTENT and PURPOSE
3. Create a coherent summary that serves the space's intended use case
4. Organize the summary based on the space's purpose (not generic frequency-based themes)`
}
${isUpdate ? "7" : "5"}. Assess your confidence in the ${isUpdate ? "updated" : ""} summary quality (0.0-1.0)
INTENT-ALIGNED ORGANIZATION:
- Organize sections based on what serves the space's purpose
- Topics don't need minimum episode counts - relevance to intent matters most
- Each section should provide value aligned with the space's intended use
- For "guidelines" spaces: focus on actionable patterns
- For "tracking" spaces: focus on temporal patterns and changes
- For "learning" spaces: focus on concepts and insights gained
- Let the space's intent drive the structure, not rigid rules
${
isUpdate
? `CONNECTION FOCUS:
- Entity relationships that span across batches/time
- Theme evolution and expansion
- Temporal patterns and progressions
- Contradictions or confirmations of existing insights
- New insights that complement existing knowledge`
: ""
}
RESPONSE FORMAT:
Provide your response inside <output></output> tags with valid JSON. Include both HTML summary and markdown format.
<output>
{
"summary": "${isUpdate ? "Updated HTML summary that integrates new insights with existing knowledge. Write factually about what the statements reveal - mention specific entities, relationships, and patterns found in the data. Avoid marketing language. Use HTML tags for structure." : "Factual HTML summary based on patterns found in the statements. Report what the data actually shows - specific entities, relationships, frequencies, and concrete insights. Avoid promotional language. Use HTML tags like <p>, <strong>, <ul>, <li> for structure. Keep it concise and evidence-based."}",
"keyEntities": ["entity1", "entity2", "entity3"],
"themes": ["${isUpdate ? 'updated_theme1", "new_theme2", "evolved_theme3' : 'theme1", "theme2", "theme3'}"],
"confidence": 0.85
}
</output>
JSON FORMATTING RULES:
- HTML content in summary field is allowed and encouraged
- Escape quotes within strings as \"
- Escape HTML angle brackets if needed: &lt; and &gt;
- Use proper HTML tags for structure: <p>, <strong>, <em>, <ul>, <li>, <h3>, etc.
- HTML content should be well-formed and semantic
GUIDELINES:
${
isUpdate
? `- Preserve valuable insights from existing summary
- Integrate new information by highlighting connections
- Themes should evolve naturally, don't replace wholesale
- The updated summary should read as a coherent whole
- Make the summary user-friendly and explain what value this space provides`
: `- Report only what the episodes actually reveal - be specific and concrete
- Cite actual content and patterns found in the episodes
- Avoid generic descriptions that could apply to any space
- Use neutral, factual language - no "comprehensive", "robust", "cutting-edge" etc.
- Themes must be backed by at least 3 supporting episodes with clear evidence
- Better to have fewer, well-supported themes than many weak ones
- Confidence should reflect actual data quality and coverage, not aspirational goals`
}`,
},
{
role: "user",
content: `SPACE INFORMATION:
Name: "${spaceName}"
Intent/Purpose: ${spaceDescription || "No specific intent provided - organize naturally based on content"}
${
isUpdate
? `EXISTING SUMMARY:
${previousSummary}
EXISTING THEMES:
${previousThemes.join(", ")}
NEW EPISODES TO INTEGRATE (${episodes.length} episodes):`
: `EPISODES IN THIS SPACE (${episodes.length} episodes):`
}
${episodesText}
${
episodes.length > 0
? `TOP WORDS BY FREQUENCY:
${topEntities.join(", ")}`
: ""
}
${
isUpdate
? "Please identify connections between the existing summary and new episodes, then update the summary to integrate the new insights coherently. Organize the summary to SERVE the space's intent/purpose. Remember: only summarize insights from the actual episode content."
: "Please analyze the episodes and provide a comprehensive summary that SERVES the space's intent/purpose. Organize sections based on what would be most valuable for this space's intended use case. If the intent is unclear, organize naturally based on content patterns. Only summarize insights from actual episode content."
}`,
},
];
}
async function getExistingSummary(spaceId: string): Promise<{
summary: string;
themes: string[];
lastUpdated: Date;
contextCount: number;
} | null> {
try {
const existingSummary = await getSpace(spaceId);
if (existingSummary?.summary) {
return {
summary: existingSummary.summary,
themes: existingSummary.themes,
lastUpdated: existingSummary.summaryGeneratedAt || new Date(),
contextCount: existingSummary.contextCount || 0,
};
}
return null;
} catch (error) {
logger.warn(`Failed to get existing summary for space ${spaceId}:`, {
error,
});
return null;
}
}
async function getSpaceEpisodes(
spaceId: string,
userId: string,
sinceDate?: Date,
): Promise<SpaceEpisodeData[]> {
// Query episodes directly using Space-[:HAS_EPISODE]->Episode relationships
const params: any = { spaceId, userId };
let dateCondition = "";
if (sinceDate) {
dateCondition = "AND e.createdAt > $sinceDate";
params.sinceDate = sinceDate.toISOString();
}
const query = `
MATCH (space:Space {uuid: $spaceId, userId: $userId})-[:HAS_EPISODE]->(e:Episode {userId: $userId})
WHERE e IS NOT NULL ${dateCondition}
RETURN DISTINCT e
ORDER BY e.createdAt DESC
`;
const result = await runQuery(query, params);
return result.map((record) => {
const episode = record.get("e").properties;
return {
uuid: episode.uuid,
content: episode.content,
originalContent: episode.originalContent,
source: episode.source,
createdAt: new Date(episode.createdAt),
validAt: new Date(episode.validAt),
metadata: JSON.parse(episode.metadata || "{}"),
sessionId: episode.sessionId,
};
});
}
function parseSummaryResponse(response: string): {
summary: string;
themes: string[];
confidence: number;
keyEntities?: string[];
} | null {
try {
// Extract content from <output> tags
const outputMatch = response.match(/<output>([\s\S]*?)<\/output>/);
if (!outputMatch) {
logger.warn("No <output> tags found in LLM summary response");
logger.debug("Full LLM response:", { response });
return null;
}
let jsonContent = outputMatch[1].trim();
let parsed;
try {
parsed = JSON.parse(jsonContent);
} catch (jsonError) {
logger.warn("JSON parsing failed, attempting cleanup and retry", {
originalError: jsonError,
jsonContent: jsonContent.substring(0, 500) + "...", // Log first 500 chars
});
// More aggressive cleanup for malformed JSON
jsonContent = jsonContent
.replace(/([^\\])"/g, '$1\\"') // Escape unescaped quotes
.replace(/^"/g, '\\"') // Escape quotes at start
.replace(/\\\\"/g, '\\"'); // Fix double-escaped quotes
parsed = JSON.parse(jsonContent);
}
// Validate the response structure
const validationResult = SummaryResultSchema.safeParse(parsed);
if (!validationResult.success) {
logger.warn("Invalid LLM summary response format:", {
error: validationResult.error,
parsedData: parsed,
});
return null;
}
return validationResult.data;
} catch (error) {
logger.error(
"Error parsing LLM summary response:",
error as Record<string, unknown>,
);
logger.debug("Failed response content:", { response });
return null;
}
}
async function storeSummary(summaryData: SpaceSummaryData): Promise<void> {
try {
// Store in PostgreSQL for API access and persistence
await updateSpace(summaryData);
// Also store in Neo4j for graph-based queries
const query = `
MATCH (space:Space {uuid: $spaceId})
SET space.summary = $summary,
space.keyEntities = $keyEntities,
space.themes = $themes,
space.summaryConfidence = $confidence,
space.summaryContextCount = $contextCount,
space.summaryLastUpdated = datetime($lastUpdated)
RETURN space
`;
await runQuery(query, {
spaceId: summaryData.spaceId,
summary: summaryData.summary,
keyEntities: summaryData.keyEntities,
themes: summaryData.themes,
confidence: summaryData.confidence,
contextCount: summaryData.contextCount,
lastUpdated: summaryData.lastUpdated.toISOString(),
});
logger.info(`Stored summary for space ${summaryData.spaceId}`, {
themes: summaryData.themes.length,
keyEntities: summaryData.keyEntities.length,
confidence: summaryData.confidence,
});
} catch (error) {
logger.error(
`Error storing summary for space ${summaryData.spaceId}:`,
error as Record<string, unknown>,
);
throw error;
}
}

View File

@ -15,7 +15,8 @@ import type { z } from "zod";
import type { IngestBodyRequest } from "~/jobs/ingest/ingest-episode.logic";
import type { CreateConversationTitlePayload } from "~/jobs/conversation/create-title.logic";
import type { SessionCompactionPayload } from "~/jobs/session/session-compaction.logic";
import { type SpaceAssignmentPayload } from "~/trigger/spaces/space-assignment";
import type { SpaceAssignmentPayload } from "~/jobs/spaces/space-assignment.logic";
import type { SpaceSummaryPayload } from "~/jobs/spaces/space-summary.logic";
type QueueProvider = "trigger" | "bullmq";
@ -144,22 +145,86 @@ export async function enqueueSessionCompaction(
/**
* Enqueue space assignment job
* (Helper for common job logic to call)
*/
export async function enqueueSpaceAssignment(
payload: SpaceAssignmentPayload,
): Promise<void> {
): Promise<{ id?: string }> {
const provider = env.QUEUE_PROVIDER as QueueProvider;
if (provider === "trigger") {
const { triggerSpaceAssignment } = await import(
"~/trigger/spaces/space-assignment"
);
await triggerSpaceAssignment(payload);
const handler = await triggerSpaceAssignment(payload);
return { id: handler.id };
} else {
// For BullMQ, space assignment is not implemented yet
// You can add it later when needed
console.warn("Space assignment not implemented for BullMQ yet");
// BullMQ
const { spaceAssignmentQueue } = await import("~/bullmq/queues");
const job = await spaceAssignmentQueue.add("space-assignment", payload, {
jobId: `space-assignment-${payload.userId}-${payload.mode}-${Date.now()}`,
attempts: 3,
backoff: { type: "exponential", delay: 2000 },
});
return { id: job.id };
}
}
/**
* Enqueue space summary job
*/
export async function enqueueSpaceSummary(
payload: SpaceSummaryPayload,
): Promise<{ id?: string }> {
const provider = env.QUEUE_PROVIDER as QueueProvider;
if (provider === "trigger") {
const { triggerSpaceSummary } = await import(
"~/trigger/spaces/space-summary"
);
const handler = await triggerSpaceSummary(payload);
return { id: handler.id };
} else {
// BullMQ
const { spaceSummaryQueue } = await import("~/bullmq/queues");
const job = await spaceSummaryQueue.add("space-summary", payload, {
jobId: `space-summary-${payload.spaceId}-${Date.now()}`,
attempts: 3,
backoff: { type: "exponential", delay: 2000 },
});
return { id: job.id };
}
}
/**
* Enqueue BERT topic analysis job
*/
export async function enqueueBertTopicAnalysis(payload: {
userId: string;
workspaceId: string;
minTopicSize?: number;
nrTopics?: number;
}): Promise<{ id?: string }> {
const provider = env.QUEUE_PROVIDER as QueueProvider;
if (provider === "trigger") {
const { bertTopicAnalysisTask } = await import(
"~/trigger/bert/topic-analysis"
);
const handler = await bertTopicAnalysisTask.trigger(payload, {
queue: "bert-topic-analysis",
concurrencyKey: payload.userId,
tags: [payload.userId, "bert-analysis"],
});
return { id: handler.id };
} else {
// BullMQ
const { bertTopicQueue } = await import("~/bullmq/queues");
const job = await bertTopicQueue.add("topic-analysis", payload, {
jobId: `bert-${payload.userId}-${Date.now()}`,
attempts: 2, // Only 2 attempts for expensive operations
backoff: { type: "exponential", delay: 5000 },
});
return { id: job.id };
}
}

View File

@ -29,12 +29,6 @@ Exclude:
Anything not explicitly consented to share
don't store anything the user did not explicitly consent to share.`;
const githubDescription = `Everything related to my GitHub work - repos I'm working on, projects I contribute to, code I'm writing, PRs I'm reviewing. Basically my coding life on GitHub.`;
const healthDescription = `My health and wellness stuff - how I'm feeling, what I'm learning about my body, experiments I'm trying, patterns I notice. Whatever matters to me about staying healthy.`;
const fitnessDescription = `My workouts and training - what I'm doing at the gym, runs I'm going on, progress I'm making, goals I'm chasing. Anything related to physical exercise and getting stronger.`;
export async function createWorkspace(
input: CreateWorkspaceDto,
): Promise<Workspace> {
@ -56,32 +50,7 @@ export async function createWorkspace(
await ensureBillingInitialized(workspace.id);
// Create default spaces
await Promise.all([
spaceService.createSpace({
name: "Profile",
description: profileRule,
userId: input.userId,
workspaceId: workspace.id,
}),
spaceService.createSpace({
name: "GitHub",
description: githubDescription,
userId: input.userId,
workspaceId: workspace.id,
}),
spaceService.createSpace({
name: "Health",
description: healthDescription,
userId: input.userId,
workspaceId: workspace.id,
}),
spaceService.createSpace({
name: "Fitness",
description: fitnessDescription,
userId: input.userId,
workspaceId: workspace.id,
}),
]);
await Promise.all([]);
try {
const response = await sendEmail({ email: "welcome", to: user.email });

View File

@ -19,7 +19,10 @@ import {
import { getModel } from "~/lib/model.server";
import { UserTypeEnum } from "@core/types";
import { nanoid } from "nanoid";
import { getOrCreatePersonalAccessToken } from "~/services/personalAccessToken.server";
import {
deletePersonalAccessToken,
getOrCreatePersonalAccessToken,
} from "~/services/personalAccessToken.server";
import {
hasAnswer,
hasQuestion,
@ -126,6 +129,7 @@ const { loader, action } = createHybridActionApiRoute(
});
result.consumeStream(); // no await
await deletePersonalAccessToken(pat?.id);
return result.toUIMessageStreamResponse({
originalMessages: validatedMessages,

View File

@ -1,6 +1,7 @@
import { json } from "@remix-run/node";
import { z } from "zod";
import { prisma } from "~/db.server";
import { createHybridLoaderApiRoute } from "~/services/routeBuilders/apiBuilder.server";
// Schema for logs search parameters

View File

@ -7,7 +7,10 @@ import { SpaceService } from "~/services/space.server";
import { json } from "@remix-run/node";
import { prisma } from "~/db.server";
import { apiCors } from "~/utils/apiCors";
import { isTriggerDeployment } from "~/lib/queue-adapter.server";
import {
enqueueSpaceAssignment,
isTriggerDeployment,
} from "~/lib/queue-adapter.server";
const spaceService = new SpaceService();
@ -74,6 +77,14 @@ const { action } = createHybridActionApiRoute(
workspaceId: user.Workspace.id,
});
await enqueueSpaceAssignment({
userId: user.id,
workspaceId: user.Workspace.id,
mode: "new_space",
newSpaceId: space.id,
batchSize: 25, // Analyze recent statements for the new space
});
return json({ space, success: true });
}

View File

@ -0,0 +1,107 @@
import { prisma } from "~/trigger/utils/prisma";
import { logger } from "~/services/logger.service";
import { runQuery } from "~/lib/neo4j.server";
interface WorkspaceMetadata {
lastTopicAnalysisAt?: string;
[key: string]: any;
}
/**
* Check if we should trigger a BERT topic analysis for this workspace
* Criteria: 20+ new episodes since last analysis (or no previous analysis)
*/
export async function shouldTriggerTopicAnalysis(
userId: string,
workspaceId: string,
): Promise<boolean> {
try {
// Get workspace metadata
const workspace = await prisma.workspace.findUnique({
where: { id: workspaceId },
select: { metadata: true },
});
if (!workspace) {
logger.warn(`Workspace not found: ${workspaceId}`);
return false;
}
const metadata = (workspace.metadata || {}) as WorkspaceMetadata;
const lastAnalysisAt = metadata.lastTopicAnalysisAt;
// Count episodes since last analysis
const query = lastAnalysisAt
? `
MATCH (e:Episode {userId: $userId})
WHERE e.createdAt > datetime($lastAnalysisAt)
RETURN count(e) as newEpisodeCount
`
: `
MATCH (e:Episode {userId: $userId})
RETURN count(e) as totalEpisodeCount
`;
const result = await runQuery(query, {
userId,
lastAnalysisAt,
});
const episodeCount = lastAnalysisAt
? result[0]?.get("newEpisodeCount")?.toNumber() || 0
: result[0]?.get("totalEpisodeCount")?.toNumber() || 0;
logger.info(
`[Topic Analysis Check] User: ${userId}, New episodes: ${episodeCount}, Last analysis: ${lastAnalysisAt || "never"}`,
);
// Trigger if 20+ new episodes
return episodeCount >= 20;
} catch (error) {
logger.error(
`[Topic Analysis Check] Error checking episode count:`,
error,
);
return false;
}
}
/**
* Update workspace metadata with last topic analysis timestamp
*/
export async function updateLastTopicAnalysisTime(
workspaceId: string,
): Promise<void> {
try {
const workspace = await prisma.workspace.findUnique({
where: { id: workspaceId },
select: { metadata: true },
});
if (!workspace) {
logger.warn(`Workspace not found: ${workspaceId}`);
return;
}
const metadata = (workspace.metadata || {}) as WorkspaceMetadata;
await prisma.workspace.update({
where: { id: workspaceId },
data: {
metadata: {
...metadata,
lastTopicAnalysisAt: new Date().toISOString(),
},
},
});
logger.info(
`[Topic Analysis] Updated last analysis timestamp for workspace: ${workspaceId}`,
);
} catch (error) {
logger.error(
`[Topic Analysis] Error updating last analysis timestamp:`,
error,
);
}
}

View File

@ -45,6 +45,43 @@ export async function createSpace(
};
}
/**
* Get all active spaces for a user
*/
export async function getAllSpacesForUser(
userId: string,
): Promise<SpaceNode[]> {
const query = `
MATCH (s:Space {userId: $userId})
WHERE s.isActive = true
// Count episodes assigned to each space
OPTIONAL MATCH (s)-[:HAS_EPISODE]->(e:Episode {userId: $userId})
WITH s, count(e) as episodeCount
RETURN s, episodeCount
ORDER BY s.createdAt DESC
`;
const result = await runQuery(query, { userId });
return result.map((record) => {
const spaceData = record.get("s").properties;
const episodeCount = record.get("episodeCount") || 0;
return {
uuid: spaceData.uuid,
name: spaceData.name,
description: spaceData.description,
userId: spaceData.userId,
createdAt: new Date(spaceData.createdAt),
updatedAt: new Date(spaceData.updatedAt),
isActive: spaceData.isActive,
contextCount: Number(episodeCount),
};
});
}
/**
* Get a specific space by ID
*/

View File

@ -58,6 +58,7 @@ async function createMcpServer(
// Handle memory tools and integration meta-tools
if (
name.startsWith("memory_") ||
name === "get_session_id" ||
name === "get_integrations" ||
name === "get_integration_actions" ||
name === "execute_integration_action"

View File

@ -1,262 +0,0 @@
import { logger } from "~/services/logger.service";
import {
getCompactedSessionBySessionId,
getCompactionStats,
getSessionEpisodes,
type CompactedSessionNode,
} from "~/services/graphModels/compactedSession";
import { enqueueSessionCompaction } from "~/lib/queue-adapter.server";
/**
* Configuration for session compaction
*/
export const COMPACTION_CONFIG = {
minEpisodesForCompaction: 5, // Minimum episodes to trigger initial compaction
compactionThreshold: 1, // Trigger update after N new episodes
autoCompactionEnabled: true, // Enable automatic compaction
};
/**
* SessionCompactionService - Manages session compaction lifecycle
*/
export class SessionCompactionService {
/**
* Check if a session should be compacted
*/
async shouldCompact(sessionId: string, userId: string): Promise<{
shouldCompact: boolean;
reason: string;
episodeCount?: number;
newEpisodeCount?: number;
}> {
try {
// Get existing compact
const existingCompact = await getCompactedSessionBySessionId(sessionId, userId);
if (!existingCompact) {
// No compact exists, check if we have enough episodes
const episodeCount = await this.getSessionEpisodeCount(sessionId, userId);
if (episodeCount >= COMPACTION_CONFIG.minEpisodesForCompaction) {
return {
shouldCompact: true,
reason: "initial_compaction",
episodeCount,
};
}
return {
shouldCompact: false,
reason: "insufficient_episodes",
episodeCount,
};
}
// Compact exists, check if we have enough new episodes
const newEpisodeCount = await this.getNewEpisodeCount(
sessionId,
userId,
existingCompact.endTime
);
if (newEpisodeCount >= COMPACTION_CONFIG.compactionThreshold) {
return {
shouldCompact: true,
reason: "update_compaction",
newEpisodeCount,
};
}
return {
shouldCompact: false,
reason: "insufficient_new_episodes",
newEpisodeCount,
};
} catch (error) {
logger.error(`Error checking if session should compact`, {
sessionId,
userId,
error: error instanceof Error ? error.message : String(error),
});
return {
shouldCompact: false,
reason: "error",
};
}
}
/**
* Get total episode count for a session
*/
private async getSessionEpisodeCount(
sessionId: string,
userId: string
): Promise<number> {
const episodes = await getSessionEpisodes(sessionId, userId);
return episodes.length;
}
/**
* Get count of new episodes since last compaction
*/
private async getNewEpisodeCount(
sessionId: string,
userId: string,
afterTime: Date
): Promise<number> {
const episodes = await getSessionEpisodes(sessionId, userId, afterTime);
return episodes.length;
}
/**
* Trigger compaction for a session
*/
async triggerCompaction(
sessionId: string,
userId: string,
source: string,
triggerSource: "auto" | "manual" | "threshold" = "auto"
): Promise<{ success: boolean; taskId?: string; error?: string }> {
try {
// Check if compaction should be triggered
const check = await this.shouldCompact(sessionId, userId);
if (!check.shouldCompact) {
logger.info(`Compaction not needed`, {
sessionId,
userId,
reason: check.reason,
});
return {
success: false,
error: `Compaction not needed: ${check.reason}`,
};
}
// Trigger the compaction task
logger.info(`Triggering session compaction`, {
sessionId,
userId,
source,
triggerSource,
reason: check.reason,
});
const handle = await enqueueSessionCompaction({
userId,
sessionId,
source,
triggerSource,
});
logger.info(`Session compaction triggered`, {
sessionId,
userId,
taskId: handle.id,
});
return {
success: true,
taskId: handle.id,
};
} catch (error) {
logger.error(`Failed to trigger compaction`, {
sessionId,
userId,
error: error instanceof Error ? error.message : String(error),
});
return {
success: false,
error: error instanceof Error ? error.message : "Unknown error",
};
}
}
/**
* Get compacted session for recall
*/
async getCompactForRecall(
sessionId: string,
userId: string
): Promise<CompactedSessionNode | null> {
try {
return await getCompactedSessionBySessionId(sessionId, userId);
} catch (error) {
logger.error(`Error fetching compact for recall`, {
sessionId,
userId,
error: error instanceof Error ? error.message : String(error),
});
return null;
}
}
/**
* Get compaction statistics for a user
*/
async getStats(userId: string): Promise<{
totalCompacts: number;
totalEpisodes: number;
averageCompressionRatio: number;
mostRecentCompaction: Date | null;
}> {
try {
return await getCompactionStats(userId);
} catch (error) {
logger.error(`Error fetching compaction stats`, {
userId,
error: error instanceof Error ? error.message : String(error),
});
return {
totalCompacts: 0,
totalEpisodes: 0,
averageCompressionRatio: 0,
mostRecentCompaction: null,
};
}
}
/**
* Auto-trigger compaction after episode ingestion
* Called from ingestion pipeline
*/
async autoTriggerAfterIngestion(
sessionId: string | null | undefined,
userId: string,
source: string
): Promise<void> {
// Skip if no sessionId or auto-compaction disabled
if (!sessionId || !COMPACTION_CONFIG.autoCompactionEnabled) {
return;
}
try {
const check = await this.shouldCompact(sessionId, userId);
if (check.shouldCompact) {
logger.info(`Auto-triggering compaction after ingestion`, {
sessionId,
userId,
reason: check.reason,
});
// Trigger compaction asynchronously (don't wait)
await this.triggerCompaction(sessionId, userId, source, "auto");
}
} catch (error) {
// Log error but don't fail ingestion
logger.error(`Error in auto-trigger compaction`, {
sessionId,
userId,
error: error instanceof Error ? error.message : String(error),
});
}
}
}
// Singleton instance
export const sessionCompactionService = new SessionCompactionService();

View File

@ -16,8 +16,6 @@ import {
updateSpace,
} from "./graphModels/space";
import { prisma } from "~/trigger/utils/prisma";
import { trackFeatureUsage } from "./telemetry.server";
import { enqueueSpaceAssignment } from "~/lib/queue-adapter.server";
export class SpaceService {
/**
@ -65,26 +63,7 @@ export class SpaceService {
logger.info(`Created space ${space.id} successfully`);
// Track space creation
trackFeatureUsage("space_created", params.userId).catch(console.error);
// Trigger automatic LLM assignment for the new space
try {
await enqueueSpaceAssignment({
userId: params.userId,
workspaceId: params.workspaceId,
mode: "new_space",
newSpaceId: space.id,
batchSize: 25, // Analyze recent statements for the new space
});
logger.info(`Triggered LLM space assignment for new space ${space.id}`);
} catch (error) {
// Don't fail space creation if LLM assignment fails
logger.warn(
`Failed to trigger LLM assignment for space ${space.id}:`,
error as Record<string, unknown>,
);
}
// trackFeatureUsage("space_created", params.userId).catch(console.error);
return space;
}
@ -197,9 +176,6 @@ export class SpaceService {
logger.info(`Nothing to update to graph`);
}
// Track space update
trackFeatureUsage("space_updated", userId).catch(console.error);
logger.info(`Updated space ${spaceId} successfully`);
return space;
}

View File

@ -0,0 +1,53 @@
import { task } from "@trigger.dev/sdk/v3";
import { python } from "@trigger.dev/python";
import {
processTopicAnalysis,
type TopicAnalysisPayload,
} from "~/jobs/bert/topic-analysis.logic";
import { spaceSummaryTask } from "~/trigger/spaces/space-summary";
/**
* Python runner for Trigger.dev using python.runScript
*/
async function runBertWithTriggerPython(
userId: string,
minTopicSize: number,
nrTopics?: number,
): Promise<string> {
const args = [userId, "--json"];
if (nrTopics) {
args.push("--nr-topics", String(nrTopics));
}
console.log(
`[BERT Topic Analysis] Running with Trigger.dev Python: args=${args.join(" ")}`,
);
const result = await python.runScript("./python/main.py", args);
return result.stdout;
}
/**
* Trigger.dev task for BERT topic analysis
*
* This is a thin wrapper around the common logic in jobs/bert/topic-analysis.logic.ts
*/
export const bertTopicAnalysisTask = task({
id: "bert-topic-analysis",
queue: {
name: "bert-topic-analysis",
concurrencyLimit: 3, // Max 3 parallel analyses to avoid CPU overload
},
run: async (payload: TopicAnalysisPayload) => {
return await processTopicAnalysis(
payload,
// Callback to enqueue space summary
async (params) => {
await spaceSummaryTask.trigger(params);
},
// Python runner for Trigger.dev
runBertWithTriggerPython,
);
},
});

View File

@ -6,6 +6,7 @@ import {
} from "~/jobs/ingest/ingest-episode.logic";
import { triggerSpaceAssignment } from "../spaces/space-assignment";
import { triggerSessionCompaction } from "../session/session-compaction";
import { bertTopicAnalysisTask } from "../bert/topic-analysis";
const ingestionQueue = queue({
name: "ingestion-queue",
@ -32,6 +33,14 @@ export const ingestTask = task({
async (params) => {
await triggerSessionCompaction(params);
},
// Callback for BERT topic analysis
async (params) => {
await bertTopicAnalysisTask.trigger(params, {
queue: "bert-topic-analysis",
concurrencyKey: params.userId,
tags: [params.userId, "bert-analysis"],
});
},
);
},
});

View File

@ -1,9 +1,9 @@
import { task } from "@trigger.dev/sdk";
import { z } from "zod";
import { IngestionQueue, IngestionStatus } from "@core/database";
import { IngestionStatus } from "@core/database";
import { logger } from "~/services/logger.service";
import { prisma } from "../utils/prisma";
import { IngestBodyRequest, ingestTask } from "./ingest";
import { type IngestBodyRequest, ingestTask } from "./ingest";
export const RetryNoCreditBodyRequest = z.object({
workspaceId: z.string(),
@ -43,9 +43,7 @@ export const retryNoCreditsTask = task({
};
}
logger.log(
`Found ${noCreditItems.length} NO_CREDITS episodes to retry`,
);
logger.log(`Found ${noCreditItems.length} NO_CREDITS episodes to retry`);
const results = {
total: noCreditItems.length,

File diff suppressed because it is too large Load Diff

View File

@ -1,557 +0,0 @@
import { task } from "@trigger.dev/sdk/v3";
import { logger } from "~/services/logger.service";
import { makeModelCall } from "~/lib/model.server";
import { runQuery } from "~/lib/neo4j.server";
import type { CoreMessage } from "ai";
import { z } from "zod";
import {
EXPLICIT_PATTERN_TYPES,
IMPLICIT_PATTERN_TYPES,
type SpacePattern,
type PatternDetectionResult,
} from "@core/types";
import { createSpacePattern, getSpace } from "../utils/space-utils";
interface SpacePatternPayload {
userId: string;
workspaceId: string;
spaceId: string;
triggerSource?:
| "summary_complete"
| "manual"
| "assignment"
| "scheduled"
| "new_space"
| "growth_threshold"
| "ingestion_complete";
}
interface SpaceStatementData {
uuid: string;
fact: string;
subject: string;
predicate: string;
object: string;
createdAt: Date;
validAt: Date;
content?: string; // For implicit pattern analysis
}
interface SpaceThemeData {
themes: string[];
summary: string;
}
// Zod schemas for LLM response validation
const ExplicitPatternSchema = z.object({
name: z.string(),
type: z.string(),
summary: z.string(),
evidence: z.array(z.string()),
confidence: z.number().min(0).max(1),
});
const ImplicitPatternSchema = z.object({
name: z.string(),
type: z.string(),
summary: z.string(),
evidence: z.array(z.string()),
confidence: z.number().min(0).max(1),
});
const PatternAnalysisSchema = z.object({
explicitPatterns: z.array(ExplicitPatternSchema),
implicitPatterns: z.array(ImplicitPatternSchema),
});
const CONFIG = {
minStatementsForPatterns: 5,
maxPatternsPerSpace: 20,
minPatternConfidence: 0.85,
};
export const spacePatternTask = task({
id: "space-pattern",
run: async (payload: SpacePatternPayload) => {
const { userId, workspaceId, spaceId, triggerSource = "manual" } = payload;
logger.info(`Starting space pattern detection`, {
userId,
workspaceId,
spaceId,
triggerSource,
});
try {
// Get space data and check if it has enough content
const space = await getSpaceForPatternAnalysis(spaceId);
if (!space) {
return {
success: false,
spaceId,
error: "Space not found or insufficient data",
};
}
// Get statements for pattern analysis
const statements = await getSpaceStatementsForPatterns(spaceId, userId);
if (statements.length < CONFIG.minStatementsForPatterns) {
logger.info(
`Space ${spaceId} has insufficient statements (${statements.length}) for pattern detection`,
);
return {
success: true,
spaceId,
triggerSource,
patterns: {
explicitPatterns: [],
implicitPatterns: [],
totalPatternsFound: 0,
},
};
}
// Detect patterns
const patternResult = await detectSpacePatterns(space, statements);
if (patternResult) {
// Store patterns
await storePatterns(
patternResult.explicitPatterns,
patternResult.implicitPatterns,
spaceId,
);
logger.info(`Generated patterns for space ${spaceId}`, {
explicitPatterns: patternResult.explicitPatterns.length,
implicitPatterns: patternResult.implicitPatterns.length,
totalPatterns: patternResult.totalPatternsFound,
triggerSource,
});
return {
success: true,
spaceId,
triggerSource,
patterns: {
explicitPatterns: patternResult.explicitPatterns.length,
implicitPatterns: patternResult.implicitPatterns.length,
totalPatternsFound: patternResult.totalPatternsFound,
},
};
} else {
logger.warn(`Failed to detect patterns for space ${spaceId}`);
return {
success: false,
spaceId,
triggerSource,
error: "Failed to detect patterns",
};
}
} catch (error) {
logger.error(
`Error in space pattern detection for space ${spaceId}:`,
error as Record<string, unknown>,
);
throw error;
}
},
});
async function getSpaceForPatternAnalysis(
spaceId: string,
): Promise<SpaceThemeData | null> {
try {
const space = await getSpace(spaceId);
if (!space || !space.themes || space.themes.length === 0) {
logger.warn(
`Space ${spaceId} not found or has no themes for pattern analysis`,
);
return null;
}
return {
themes: space.themes,
summary: space.summary || "",
};
} catch (error) {
logger.error(
`Error getting space for pattern analysis:`,
error as Record<string, unknown>,
);
return null;
}
}
async function getSpaceStatementsForPatterns(
spaceId: string,
userId: string,
): Promise<SpaceStatementData[]> {
const query = `
MATCH (s:Statement)
WHERE s.userId = $userId
AND s.spaceIds IS NOT NULL
AND $spaceId IN s.spaceIds
AND s.invalidAt IS NULL
MATCH (s)-[:HAS_SUBJECT]->(subj:Entity)
MATCH (s)-[:HAS_PREDICATE]->(pred:Entity)
MATCH (s)-[:HAS_OBJECT]->(obj:Entity)
RETURN s, subj.name as subject, pred.name as predicate, obj.name as object
ORDER BY s.createdAt DESC
`;
const result = await runQuery(query, {
spaceId,
userId,
});
return result.map((record) => {
const statement = record.get("s").properties;
return {
uuid: statement.uuid,
fact: statement.fact,
subject: record.get("subject"),
predicate: record.get("predicate"),
object: record.get("object"),
createdAt: new Date(statement.createdAt),
validAt: new Date(statement.validAt),
content: statement.fact, // Use fact as content for implicit analysis
};
});
}
async function detectSpacePatterns(
space: SpaceThemeData,
statements: SpaceStatementData[],
): Promise<PatternDetectionResult | null> {
try {
// Extract explicit patterns from themes
const explicitPatterns = await extractExplicitPatterns(
space.themes,
space.summary,
statements,
);
// Extract implicit patterns from statement analysis
const implicitPatterns = await extractImplicitPatterns(statements);
return {
explicitPatterns,
implicitPatterns,
totalPatternsFound: explicitPatterns.length + implicitPatterns.length,
processingStats: {
statementsAnalyzed: statements.length,
themesProcessed: space.themes.length,
implicitPatternsExtracted: implicitPatterns.length,
},
};
} catch (error) {
logger.error(
"Error detecting space patterns:",
error as Record<string, unknown>,
);
return null;
}
}
async function extractExplicitPatterns(
themes: string[],
summary: string,
statements: SpaceStatementData[],
): Promise<Omit<SpacePattern, "id" | "createdAt" | "updatedAt" | "spaceId">[]> {
if (themes.length === 0) return [];
const prompt = createExplicitPatternPrompt(themes, summary, statements);
// Pattern extraction requires HIGH complexity (insight synthesis, pattern recognition)
let responseText = "";
await makeModelCall(false, prompt, (text: string) => {
responseText = text;
}, undefined, 'high');
const patterns = parseExplicitPatternResponse(responseText);
return patterns.map((pattern) => ({
name: pattern.name || `${pattern.type} pattern`,
source: "explicit" as const,
type: pattern.type,
summary: pattern.summary,
evidence: pattern.evidence,
confidence: pattern.confidence,
userConfirmed: "pending" as const,
}));
}
async function extractImplicitPatterns(
statements: SpaceStatementData[],
): Promise<Omit<SpacePattern, "id" | "createdAt" | "updatedAt" | "spaceId">[]> {
if (statements.length < CONFIG.minStatementsForPatterns) return [];
const prompt = createImplicitPatternPrompt(statements);
// Implicit pattern discovery requires HIGH complexity (pattern recognition from statements)
let responseText = "";
await makeModelCall(false, prompt, (text: string) => {
responseText = text;
}, undefined, 'high');
const patterns = parseImplicitPatternResponse(responseText);
return patterns.map((pattern) => ({
name: pattern.name || `${pattern.type} pattern`,
source: "implicit" as const,
type: pattern.type,
summary: pattern.summary,
evidence: pattern.evidence,
confidence: pattern.confidence,
userConfirmed: "pending" as const,
}));
}
function createExplicitPatternPrompt(
themes: string[],
summary: string,
statements: SpaceStatementData[],
): CoreMessage[] {
const statementsText = statements
.map((stmt) => `[${stmt.uuid}] ${stmt.fact}`)
.join("\n");
const explicitTypes = Object.values(EXPLICIT_PATTERN_TYPES).join('", "');
return [
{
role: "system",
content: `You are an expert at extracting structured patterns from themes and supporting evidence.
Your task is to convert high-level themes into explicit patterns with supporting statement evidence.
INSTRUCTIONS:
1. For each theme, create a pattern that explains what it reveals about the user
2. Give each pattern a short, descriptive name (2-4 words)
3. Find supporting statement IDs that provide evidence for each pattern
4. Assess confidence based on evidence strength and theme clarity
5. Use appropriate pattern types from these guidelines: "${explicitTypes}"
- "theme": High-level thematic content areas
- "topic": Specific subject matter or topics of interest
- "domain": Knowledge or work domains the user operates in
- "interest_area": Areas of personal interest or hobby
6. You may suggest new pattern types if none of the guidelines fit well
RESPONSE FORMAT:
Provide your response inside <output></output> tags with valid JSON.
<output>
{
"explicitPatterns": [
{
"name": "Short descriptive name for the pattern",
"type": "theme",
"summary": "Description of what this pattern reveals about the user",
"evidence": ["statement_id_1", "statement_id_2"],
"confidence": 0.85
}
]
}
</output>`,
},
{
role: "user",
content: `THEMES TO ANALYZE:
${themes.map((theme, i) => `${i + 1}. ${theme}`).join("\n")}
SPACE SUMMARY:
${summary}
SUPPORTING STATEMENTS:
${statementsText}
Please extract explicit patterns from these themes and map them to supporting statement evidence.`,
},
];
}
function createImplicitPatternPrompt(
statements: SpaceStatementData[],
): CoreMessage[] {
const statementsText = statements
.map(
(stmt) =>
`[${stmt.uuid}] ${stmt.fact} (${stmt.subject}${stmt.predicate}${stmt.object})`,
)
.join("\n");
const implicitTypes = Object.values(IMPLICIT_PATTERN_TYPES).join('", "');
return [
{
role: "system",
content: `You are an expert at discovering implicit behavioral patterns from statement analysis.
Your task is to identify hidden patterns in user behavior, preferences, and habits from statement content.
INSTRUCTIONS:
1. Analyze statement content for behavioral patterns, not explicit topics
2. Give each pattern a short, descriptive name (2-4 words)
3. Look for recurring behaviors, preferences, and working styles
4. Identify how the user approaches tasks, makes decisions, and interacts
5. Use appropriate pattern types from these guidelines: "${implicitTypes}"
- "preference": Personal preferences and choices
- "habit": Recurring behaviors and routines
- "workflow": Work and process patterns
- "communication_style": How user communicates and expresses ideas
- "decision_pattern": Decision-making approaches and criteria
- "temporal_pattern": Time-based behavioral patterns
- "behavioral_pattern": General behavioral tendencies
- "learning_style": How user learns and processes information
- "collaboration_style": How user works with others
6. You may suggest new pattern types if none of the guidelines fit well
7. Focus on what the statements reveal about how the user thinks, works, or behaves
RESPONSE FORMAT:
Provide your response inside <output></output> tags with valid JSON.
<output>
{
"implicitPatterns": [
{
"name": "Short descriptive name for the pattern",
"type": "preference",
"summary": "Description of what this behavioral pattern reveals",
"evidence": ["statement_id_1", "statement_id_2"],
"confidence": 0.75
}
]
}
</output>`,
},
{
role: "user",
content: `STATEMENTS TO ANALYZE FOR IMPLICIT PATTERNS:
${statementsText}
Please identify implicit behavioral patterns, preferences, and habits from these statements.`,
},
];
}
function parseExplicitPatternResponse(response: string): Array<{
name: string;
type: string;
summary: string;
evidence: string[];
confidence: number;
}> {
try {
const outputMatch = response.match(/<output>([\s\S]*?)<\/output>/);
if (!outputMatch) {
logger.warn("No <output> tags found in explicit pattern response");
return [];
}
const parsed = JSON.parse(outputMatch[1].trim());
const validationResult = z
.object({
explicitPatterns: z.array(ExplicitPatternSchema),
})
.safeParse(parsed);
if (!validationResult.success) {
logger.warn("Invalid explicit pattern response format:", {
error: validationResult.error,
});
return [];
}
return validationResult.data.explicitPatterns.filter(
(p) =>
p.confidence >= CONFIG.minPatternConfidence && p.evidence.length >= 3, // Ensure at least 3 evidence statements
);
} catch (error) {
logger.error(
"Error parsing explicit pattern response:",
error as Record<string, unknown>,
);
return [];
}
}
function parseImplicitPatternResponse(response: string): Array<{
name: string;
type: string;
summary: string;
evidence: string[];
confidence: number;
}> {
try {
const outputMatch = response.match(/<output>([\s\S]*?)<\/output>/);
if (!outputMatch) {
logger.warn("No <output> tags found in implicit pattern response");
return [];
}
const parsed = JSON.parse(outputMatch[1].trim());
const validationResult = z
.object({
implicitPatterns: z.array(ImplicitPatternSchema),
})
.safeParse(parsed);
if (!validationResult.success) {
logger.warn("Invalid implicit pattern response format:", {
error: validationResult.error,
});
return [];
}
return validationResult.data.implicitPatterns.filter(
(p) =>
p.confidence >= CONFIG.minPatternConfidence && p.evidence.length >= 3, // Ensure at least 3 evidence statements
);
} catch (error) {
logger.error(
"Error parsing implicit pattern response:",
error as Record<string, unknown>,
);
return [];
}
}
async function storePatterns(
explicitPatterns: Omit<
SpacePattern,
"id" | "createdAt" | "updatedAt" | "spaceId"
>[],
implicitPatterns: Omit<
SpacePattern,
"id" | "createdAt" | "updatedAt" | "spaceId"
>[],
spaceId: string,
): Promise<void> {
try {
const allPatterns = [...explicitPatterns, ...implicitPatterns];
if (allPatterns.length === 0) return;
// Store in PostgreSQL
await createSpacePattern(spaceId, allPatterns);
logger.info(`Stored ${allPatterns.length} patterns`, {
explicit: explicitPatterns.length,
implicit: implicitPatterns.length,
});
} catch (error) {
logger.error("Error storing patterns:", error as Record<string, unknown>);
throw error;
}
}
// Helper function to trigger the task
export async function triggerSpacePattern(payload: SpacePatternPayload) {
return await spacePatternTask.trigger(payload, {
concurrencyKey: `space-pattern-${payload.spaceId}`, // Prevent parallel runs for the same space
tags: [payload.userId, payload.spaceId, payload.triggerSource || "manual"],
});
}

View File

@ -1,62 +1,11 @@
import { queue, task } from "@trigger.dev/sdk/v3";
import { logger } from "~/services/logger.service";
import { SpaceService } from "~/services/space.server";
import { makeModelCall } from "~/lib/model.server";
import { runQuery } from "~/lib/neo4j.server";
import { updateSpaceStatus, SPACE_STATUS } from "../utils/space-status";
import type { CoreMessage } from "ai";
import { z } from "zod";
import { triggerSpacePattern } from "./space-pattern";
import { getSpace, updateSpace } from "../utils/space-utils";
import {
processSpaceSummary,
type SpaceSummaryPayload,
} from "~/jobs/spaces/space-summary.logic";
import { EpisodeType } from "@core/types";
import { getSpaceEpisodeCount } from "~/services/graphModels/space";
import { addToQueue } from "~/lib/ingest.server";
interface SpaceSummaryPayload {
userId: string;
workspaceId: string;
spaceId: string; // Single space only
triggerSource?: "assignment" | "manual" | "scheduled";
}
interface SpaceEpisodeData {
uuid: string;
content: string;
originalContent: string;
source: string;
createdAt: Date;
validAt: Date;
metadata: any;
sessionId: string | null;
}
interface SpaceSummaryData {
spaceId: string;
spaceName: string;
spaceDescription?: string;
contextCount: number;
summary: string;
keyEntities: string[];
themes: string[];
confidence: number;
lastUpdated: Date;
isIncremental: boolean;
}
// Zod schema for LLM response validation
const SummaryResultSchema = z.object({
summary: z.string(),
keyEntities: z.array(z.string()),
themes: z.array(z.string()),
confidence: z.number().min(0).max(1),
});
const CONFIG = {
maxEpisodesForSummary: 20, // Limit episodes for performance
minEpisodesForSummary: 1, // Minimum episodes to generate summary
summaryEpisodeThreshold: 5, // Minimum new episodes required to trigger summary (configurable)
};
export type { SpaceSummaryPayload };
export const spaceSummaryQueue = queue({
name: "space-summary-queue",
@ -67,735 +16,17 @@ export const spaceSummaryTask = task({
id: "space-summary",
queue: spaceSummaryQueue,
run: async (payload: SpaceSummaryPayload) => {
const { userId, workspaceId, spaceId, triggerSource = "manual" } = payload;
logger.info(`Starting space summary generation`, {
userId,
workspaceId,
spaceId,
triggerSource,
logger.info(`[Trigger.dev] Starting space summary task`, {
userId: payload.userId,
spaceId: payload.spaceId,
triggerSource: payload.triggerSource,
});
try {
// Update status to processing
await updateSpaceStatus(spaceId, SPACE_STATUS.PROCESSING, {
userId,
operation: "space-summary",
metadata: { triggerSource, phase: "start_summary" },
});
// Generate summary for the single space
const summaryResult = await generateSpaceSummary(
spaceId,
userId,
triggerSource,
);
if (summaryResult) {
// Store the summary
await storeSummary(summaryResult);
// Update status to ready after successful completion
await updateSpaceStatus(spaceId, SPACE_STATUS.READY, {
userId,
operation: "space-summary",
metadata: {
triggerSource,
phase: "completed_summary",
contextCount: summaryResult.contextCount,
confidence: summaryResult.confidence,
},
});
logger.info(`Generated summary for space ${spaceId}`, {
statementCount: summaryResult.contextCount,
confidence: summaryResult.confidence,
themes: summaryResult.themes.length,
triggerSource,
});
return {
success: true,
spaceId,
triggerSource,
summary: {
statementCount: summaryResult.contextCount,
confidence: summaryResult.confidence,
themesCount: summaryResult.themes.length,
},
};
} else {
// No summary generated - this could be due to insufficient episodes or no new episodes
// This is not an error state, so update status to ready
await updateSpaceStatus(spaceId, SPACE_STATUS.READY, {
userId,
operation: "space-summary",
metadata: {
triggerSource,
phase: "no_summary_needed",
reason: "Insufficient episodes or no new episodes to summarize",
},
});
logger.info(
`No summary generated for space ${spaceId} - insufficient or no new episodes`,
);
return {
success: true,
spaceId,
triggerSource,
summary: null,
reason: "No episodes to summarize",
};
}
} catch (error) {
// Update status to error on exception
try {
await updateSpaceStatus(spaceId, SPACE_STATUS.ERROR, {
userId,
operation: "space-summary",
metadata: {
triggerSource,
phase: "exception",
error: error instanceof Error ? error.message : "Unknown error",
},
});
} catch (statusError) {
logger.warn(`Failed to update status to error for space ${spaceId}`, {
statusError,
});
}
logger.error(
`Error in space summary generation for space ${spaceId}:`,
error as Record<string, unknown>,
);
throw error;
}
// Use common business logic
return await processSpaceSummary(payload);
},
});
async function generateSpaceSummary(
spaceId: string,
userId: string,
triggerSource?: "assignment" | "manual" | "scheduled",
): Promise<SpaceSummaryData | null> {
try {
// 1. Get space details
const spaceService = new SpaceService();
const space = await spaceService.getSpace(spaceId, userId);
if (!space) {
logger.warn(`Space ${spaceId} not found for user ${userId}`);
return null;
}
// 2. Check episode count threshold (skip for manual triggers)
if (triggerSource !== "manual") {
const currentEpisodeCount = await getSpaceEpisodeCount(spaceId, userId);
const lastSummaryEpisodeCount = space.contextCount || 0;
const episodeDifference = currentEpisodeCount - lastSummaryEpisodeCount;
if (
episodeDifference < CONFIG.summaryEpisodeThreshold ||
lastSummaryEpisodeCount !== 0
) {
logger.info(
`Skipping summary generation for space ${spaceId}: only ${episodeDifference} new episodes (threshold: ${CONFIG.summaryEpisodeThreshold})`,
{
currentEpisodeCount,
lastSummaryEpisodeCount,
episodeDifference,
threshold: CONFIG.summaryEpisodeThreshold,
},
);
return null;
}
logger.info(
`Proceeding with summary generation for space ${spaceId}: ${episodeDifference} new episodes (threshold: ${CONFIG.summaryEpisodeThreshold})`,
{
currentEpisodeCount,
lastSummaryEpisodeCount,
episodeDifference,
},
);
}
// 2. Check for existing summary
const existingSummary = await getExistingSummary(spaceId);
const isIncremental = existingSummary !== null;
// 3. Get episodes (all or new ones based on existing summary)
const episodes = await getSpaceEpisodes(
spaceId,
userId,
isIncremental ? existingSummary?.lastUpdated : undefined,
);
// Handle case where no new episodes exist for incremental update
if (isIncremental && episodes.length === 0) {
logger.info(
`No new episodes found for space ${spaceId}, skipping summary update`,
);
return null;
}
// Check minimum episode requirement for new summaries only
if (!isIncremental && episodes.length < CONFIG.minEpisodesForSummary) {
logger.info(
`Space ${spaceId} has insufficient episodes (${episodes.length}) for new summary`,
);
return null;
}
// 4. Process episodes using unified approach
let summaryResult;
if (episodes.length > CONFIG.maxEpisodesForSummary) {
logger.info(
`Large space detected (${episodes.length} episodes). Processing in batches.`,
);
// Process in batches, each building on previous result
const batches: SpaceEpisodeData[][] = [];
for (let i = 0; i < episodes.length; i += CONFIG.maxEpisodesForSummary) {
batches.push(episodes.slice(i, i + CONFIG.maxEpisodesForSummary));
}
let currentSummary = existingSummary?.summary || null;
let currentThemes = existingSummary?.themes || [];
let cumulativeConfidence = 0;
for (const [batchIndex, batch] of batches.entries()) {
logger.info(
`Processing batch ${batchIndex + 1}/${batches.length} with ${batch.length} episodes`,
);
const batchResult = await generateUnifiedSummary(
space.name,
space.description as string,
batch,
currentSummary,
currentThemes,
);
if (batchResult) {
currentSummary = batchResult.summary;
currentThemes = batchResult.themes;
cumulativeConfidence += batchResult.confidence;
} else {
logger.warn(`Failed to process batch ${batchIndex + 1}`);
}
// Small delay between batches
if (batchIndex < batches.length - 1) {
await new Promise((resolve) => setTimeout(resolve, 500));
}
}
summaryResult = currentSummary
? {
summary: currentSummary,
themes: currentThemes,
confidence: Math.min(cumulativeConfidence / batches.length, 1.0),
}
: null;
} else {
logger.info(
`Processing ${episodes.length} episodes with unified approach`,
);
// Use unified approach for smaller spaces
summaryResult = await generateUnifiedSummary(
space.name,
space.description as string,
episodes,
existingSummary?.summary || null,
existingSummary?.themes || [],
);
}
if (!summaryResult) {
logger.warn(`Failed to generate LLM summary for space ${spaceId}`);
return null;
}
// Get the actual current counts from Neo4j
const currentEpisodeCount = await getSpaceEpisodeCount(spaceId, userId);
return {
spaceId: space.uuid,
spaceName: space.name,
spaceDescription: space.description as string,
contextCount: currentEpisodeCount,
summary: summaryResult.summary,
keyEntities: summaryResult.keyEntities || [],
themes: summaryResult.themes,
confidence: summaryResult.confidence,
lastUpdated: new Date(),
isIncremental,
};
} catch (error) {
logger.error(
`Error generating summary for space ${spaceId}:`,
error as Record<string, unknown>,
);
return null;
}
}
async function generateUnifiedSummary(
spaceName: string,
spaceDescription: string | undefined,
episodes: SpaceEpisodeData[],
previousSummary: string | null = null,
previousThemes: string[] = [],
): Promise<{
summary: string;
themes: string[];
confidence: number;
keyEntities?: string[];
} | null> {
try {
const prompt = createUnifiedSummaryPrompt(
spaceName,
spaceDescription,
episodes,
previousSummary,
previousThemes,
);
// Space summary generation requires HIGH complexity (creative synthesis, narrative generation)
let responseText = "";
await makeModelCall(
false,
prompt,
(text: string) => {
responseText = text;
},
undefined,
"high",
);
return parseSummaryResponse(responseText);
} catch (error) {
logger.error(
"Error generating unified summary:",
error as Record<string, unknown>,
);
return null;
}
}
function createUnifiedSummaryPrompt(
spaceName: string,
spaceDescription: string | undefined,
episodes: SpaceEpisodeData[],
previousSummary: string | null,
previousThemes: string[],
): CoreMessage[] {
// If there are no episodes and no previous summary, we cannot generate a meaningful summary
if (episodes.length === 0 && previousSummary === null) {
throw new Error(
"Cannot generate summary without episodes or existing summary",
);
}
const episodesText = episodes
.map(
(episode) =>
`- ${episode.content} (Source: ${episode.source}, Session: ${episode.sessionId || "N/A"})`,
)
.join("\n");
// Extract key entities and themes from episode content
const contentWords = episodes
.map((ep) => ep.content.toLowerCase())
.join(" ")
.split(/\s+/)
.filter((word) => word.length > 3);
const wordFrequency = new Map<string, number>();
contentWords.forEach((word) => {
wordFrequency.set(word, (wordFrequency.get(word) || 0) + 1);
});
const topEntities = Array.from(wordFrequency.entries())
.sort(([, a], [, b]) => b - a)
.slice(0, 10)
.map(([word]) => word);
const isUpdate = previousSummary !== null;
return [
{
role: "system",
content: `You are an expert at analyzing and summarizing episodes within semantic spaces based on the space's intent and purpose. Your task is to ${isUpdate ? "update an existing summary by integrating new episodes" : "create a comprehensive summary of episodes"}.
CRITICAL RULES:
1. Base your summary ONLY on insights derived from the actual content/episodes provided
2. Use the space's INTENT/PURPOSE (from description) to guide what to summarize and how to organize it
3. Write in a factual, neutral tone - avoid promotional language ("pivotal", "invaluable", "cutting-edge")
4. Be specific and concrete - reference actual content, patterns, and insights found in the episodes
5. If episodes are insufficient for meaningful insights, state that more data is needed
INTENT-DRIVEN SUMMARIZATION:
Your summary should SERVE the space's intended purpose. Examples:
- "Learning React" Summarize React concepts, patterns, techniques learned
- "Project X Updates" Summarize progress, decisions, blockers, next steps
- "Health Tracking" Summarize metrics, trends, observations, insights
- "Guidelines for React" Extract actionable patterns, best practices, rules
- "Evolution of design thinking" Track how thinking changed over time, decision points
The intent defines WHY this space exists - organize content to serve that purpose.
INSTRUCTIONS:
${
isUpdate
? `1. Review the existing summary and themes carefully
2. Analyze the new episodes for patterns and insights that align with the space's intent
3. Identify connecting points between existing knowledge and new episodes
4. Update the summary to seamlessly integrate new information while preserving valuable existing insights
5. Evolve themes by adding new ones or refining existing ones based on the space's purpose
6. Organize the summary to serve the space's intended use case`
: `1. Analyze the semantic content and relationships within the episodes
2. Identify topics/sections that align with the space's INTENT and PURPOSE
3. Create a coherent summary that serves the space's intended use case
4. Organize the summary based on the space's purpose (not generic frequency-based themes)`
}
${isUpdate ? "7" : "5"}. Assess your confidence in the ${isUpdate ? "updated" : ""} summary quality (0.0-1.0)
INTENT-ALIGNED ORGANIZATION:
- Organize sections based on what serves the space's purpose
- Topics don't need minimum episode counts - relevance to intent matters most
- Each section should provide value aligned with the space's intended use
- For "guidelines" spaces: focus on actionable patterns
- For "tracking" spaces: focus on temporal patterns and changes
- For "learning" spaces: focus on concepts and insights gained
- Let the space's intent drive the structure, not rigid rules
${
isUpdate
? `CONNECTION FOCUS:
- Entity relationships that span across batches/time
- Theme evolution and expansion
- Temporal patterns and progressions
- Contradictions or confirmations of existing insights
- New insights that complement existing knowledge`
: ""
}
RESPONSE FORMAT:
Provide your response inside <output></output> tags with valid JSON. Include both HTML summary and markdown format.
<output>
{
"summary": "${isUpdate ? "Updated HTML summary that integrates new insights with existing knowledge. Write factually about what the statements reveal - mention specific entities, relationships, and patterns found in the data. Avoid marketing language. Use HTML tags for structure." : "Factual HTML summary based on patterns found in the statements. Report what the data actually shows - specific entities, relationships, frequencies, and concrete insights. Avoid promotional language. Use HTML tags like <p>, <strong>, <ul>, <li> for structure. Keep it concise and evidence-based."}",
"keyEntities": ["entity1", "entity2", "entity3"],
"themes": ["${isUpdate ? 'updated_theme1", "new_theme2", "evolved_theme3' : 'theme1", "theme2", "theme3'}"],
"confidence": 0.85
}
</output>
JSON FORMATTING RULES:
- HTML content in summary field is allowed and encouraged
- Escape quotes within strings as \"
- Escape HTML angle brackets if needed: &lt; and &gt;
- Use proper HTML tags for structure: <p>, <strong>, <em>, <ul>, <li>, <h3>, etc.
- HTML content should be well-formed and semantic
GUIDELINES:
${
isUpdate
? `- Preserve valuable insights from existing summary
- Integrate new information by highlighting connections
- Themes should evolve naturally, don't replace wholesale
- The updated summary should read as a coherent whole
- Make the summary user-friendly and explain what value this space provides`
: `- Report only what the episodes actually reveal - be specific and concrete
- Cite actual content and patterns found in the episodes
- Avoid generic descriptions that could apply to any space
- Use neutral, factual language - no "comprehensive", "robust", "cutting-edge" etc.
- Themes must be backed by at least 3 supporting episodes with clear evidence
- Better to have fewer, well-supported themes than many weak ones
- Confidence should reflect actual data quality and coverage, not aspirational goals`
}`,
},
{
role: "user",
content: `SPACE INFORMATION:
Name: "${spaceName}"
Intent/Purpose: ${spaceDescription || "No specific intent provided - organize naturally based on content"}
${
isUpdate
? `EXISTING SUMMARY:
${previousSummary}
EXISTING THEMES:
${previousThemes.join(", ")}
NEW EPISODES TO INTEGRATE (${episodes.length} episodes):`
: `EPISODES IN THIS SPACE (${episodes.length} episodes):`
}
${episodesText}
${
episodes.length > 0
? `TOP WORDS BY FREQUENCY:
${topEntities.join(", ")}`
: ""
}
${
isUpdate
? "Please identify connections between the existing summary and new episodes, then update the summary to integrate the new insights coherently. Organize the summary to SERVE the space's intent/purpose. Remember: only summarize insights from the actual episode content."
: "Please analyze the episodes and provide a comprehensive summary that SERVES the space's intent/purpose. Organize sections based on what would be most valuable for this space's intended use case. If the intent is unclear, organize naturally based on content patterns. Only summarize insights from actual episode content."
}`,
},
];
}
async function getExistingSummary(spaceId: string): Promise<{
summary: string;
themes: string[];
lastUpdated: Date;
contextCount: number;
} | null> {
try {
const existingSummary = await getSpace(spaceId);
if (existingSummary?.summary) {
return {
summary: existingSummary.summary,
themes: existingSummary.themes,
lastUpdated: existingSummary.summaryGeneratedAt || new Date(),
contextCount: existingSummary.contextCount || 0,
};
}
return null;
} catch (error) {
logger.warn(`Failed to get existing summary for space ${spaceId}:`, {
error,
});
return null;
}
}
async function getSpaceEpisodes(
spaceId: string,
userId: string,
sinceDate?: Date,
): Promise<SpaceEpisodeData[]> {
// Query episodes directly using Space-[:HAS_EPISODE]->Episode relationships
const params: any = { spaceId, userId };
let dateCondition = "";
if (sinceDate) {
dateCondition = "AND e.createdAt > $sinceDate";
params.sinceDate = sinceDate.toISOString();
}
const query = `
MATCH (space:Space {uuid: $spaceId, userId: $userId})-[:HAS_EPISODE]->(e:Episode {userId: $userId})
WHERE e IS NOT NULL ${dateCondition}
RETURN DISTINCT e
ORDER BY e.createdAt DESC
`;
const result = await runQuery(query, params);
return result.map((record) => {
const episode = record.get("e").properties;
return {
uuid: episode.uuid,
content: episode.content,
originalContent: episode.originalContent,
source: episode.source,
createdAt: new Date(episode.createdAt),
validAt: new Date(episode.validAt),
metadata: JSON.parse(episode.metadata || "{}"),
sessionId: episode.sessionId,
};
});
}
function parseSummaryResponse(response: string): {
summary: string;
themes: string[];
confidence: number;
keyEntities?: string[];
} | null {
try {
// Extract content from <output> tags
const outputMatch = response.match(/<output>([\s\S]*?)<\/output>/);
if (!outputMatch) {
logger.warn("No <output> tags found in LLM summary response");
logger.debug("Full LLM response:", { response });
return null;
}
let jsonContent = outputMatch[1].trim();
let parsed;
try {
parsed = JSON.parse(jsonContent);
} catch (jsonError) {
logger.warn("JSON parsing failed, attempting cleanup and retry", {
originalError: jsonError,
jsonContent: jsonContent.substring(0, 500) + "...", // Log first 500 chars
});
// More aggressive cleanup for malformed JSON
jsonContent = jsonContent
.replace(/([^\\])"/g, '$1\\"') // Escape unescaped quotes
.replace(/^"/g, '\\"') // Escape quotes at start
.replace(/\\\\"/g, '\\"'); // Fix double-escaped quotes
parsed = JSON.parse(jsonContent);
}
// Validate the response structure
const validationResult = SummaryResultSchema.safeParse(parsed);
if (!validationResult.success) {
logger.warn("Invalid LLM summary response format:", {
error: validationResult.error,
parsedData: parsed,
});
return null;
}
return validationResult.data;
} catch (error) {
logger.error(
"Error parsing LLM summary response:",
error as Record<string, unknown>,
);
logger.debug("Failed response content:", { response });
return null;
}
}
async function storeSummary(summaryData: SpaceSummaryData): Promise<void> {
try {
// Store in PostgreSQL for API access and persistence
await updateSpace(summaryData);
// Also store in Neo4j for graph-based queries
const query = `
MATCH (space:Space {uuid: $spaceId})
SET space.summary = $summary,
space.keyEntities = $keyEntities,
space.themes = $themes,
space.summaryConfidence = $confidence,
space.summaryContextCount = $contextCount,
space.summaryLastUpdated = datetime($lastUpdated)
RETURN space
`;
await runQuery(query, {
spaceId: summaryData.spaceId,
summary: summaryData.summary,
keyEntities: summaryData.keyEntities,
themes: summaryData.themes,
confidence: summaryData.confidence,
contextCount: summaryData.contextCount,
lastUpdated: summaryData.lastUpdated.toISOString(),
});
logger.info(`Stored summary for space ${summaryData.spaceId}`, {
themes: summaryData.themes.length,
keyEntities: summaryData.keyEntities.length,
confidence: summaryData.confidence,
});
} catch (error) {
logger.error(
`Error storing summary for space ${summaryData.spaceId}:`,
error as Record<string, unknown>,
);
throw error;
}
}
/**
* Process space summary sequentially: ingest document then trigger patterns
*/
async function processSpaceSummarySequentially({
userId,
workspaceId,
spaceId,
spaceName,
summaryContent,
triggerSource,
}: {
userId: string;
workspaceId: string;
spaceId: string;
spaceName: string;
summaryContent: string;
triggerSource:
| "summary_complete"
| "manual"
| "assignment"
| "scheduled"
| "new_space"
| "growth_threshold"
| "ingestion_complete";
}): Promise<void> {
// Step 1: Ingest summary as document synchronously
await ingestSpaceSummaryDocument(spaceId, userId, spaceName, summaryContent);
logger.info(
`Successfully ingested space summary document for space ${spaceId}`,
);
// Step 2: Now trigger space patterns (patterns will have access to the ingested summary)
await triggerSpacePattern({
userId,
workspaceId,
spaceId,
triggerSource,
});
logger.info(
`Sequential processing completed for space ${spaceId}: summary ingested → patterns triggered`,
);
}
/**
* Ingest space summary as document synchronously
*/
async function ingestSpaceSummaryDocument(
spaceId: string,
userId: string,
spaceName: string,
summaryContent: string,
): Promise<void> {
// Create the ingest body
const ingestBody = {
episodeBody: summaryContent,
referenceTime: new Date().toISOString(),
metadata: {
documentType: "space_summary",
spaceId,
spaceName,
generatedAt: new Date().toISOString(),
},
source: "space",
spaceId,
sessionId: spaceId,
type: EpisodeType.DOCUMENT,
};
// Add to queue
await addToQueue(ingestBody, userId);
logger.info(`Queued space summary for synchronous ingestion`);
return;
}
// Helper function to trigger the task
export async function triggerSpaceSummary(payload: SpaceSummaryPayload) {
return await spaceSummaryTask.trigger(payload, {

View File

@ -1,4 +1,3 @@
import { type SpacePattern } from "@core/types";
import { prisma } from "./prisma";
export const getSpace = async (spaceId: string) => {
@ -11,22 +10,6 @@ export const getSpace = async (spaceId: string) => {
return space;
};
export const createSpacePattern = async (
spaceId: string,
allPatterns: Omit<
SpacePattern,
"id" | "createdAt" | "updatedAt" | "spaceId"
>[],
) => {
return await prisma.spacePattern.createMany({
data: allPatterns.map((pattern) => ({
...pattern,
spaceId,
userConfirmed: pattern.userConfirmed as any, // Temporary cast until Prisma client is regenerated
})),
});
};
export const updateSpace = async (summaryData: {
spaceId: string;
summary: string;
@ -41,7 +24,7 @@ export const updateSpace = async (summaryData: {
summary: summaryData.summary,
themes: summaryData.themes,
contextCount: summaryData.contextCount,
summaryGeneratedAt: new Date().toISOString()
summaryGeneratedAt: new Date().toISOString(),
},
});
};

View File

@ -1,3 +1,4 @@
import { randomUUID } from "node:crypto";
import { EpisodeTypeEnum } from "@core/types";
import { addToQueue } from "~/lib/ingest.server";
import { logger } from "~/services/logger.service";
@ -19,24 +20,24 @@ const SearchParamsSchema = {
description:
"Search query optimized for knowledge graph retrieval. Choose the right query structure based on your search intent:\n\n" +
"1. **Entity-Centric Queries** (Best for graph search):\n" +
" - ✅ GOOD: \"User's preferences for code style and formatting\"\n" +
" - ✅ GOOD: \"Project authentication implementation decisions\"\n" +
" - ❌ BAD: \"user code style\"\n" +
' - ✅ GOOD: "User\'s preferences for code style and formatting"\n' +
' - ✅ GOOD: "Project authentication implementation decisions"\n' +
' - ❌ BAD: "user code style"\n' +
" - Format: [Person/Project] + [relationship/attribute] + [context]\n\n" +
"2. **Multi-Entity Relationship Queries** (Excellent for episode graph):\n" +
" - ✅ GOOD: \"User and team discussions about API design patterns\"\n" +
" - ✅ GOOD: \"relationship between database schema and performance optimization\"\n" +
" - ❌ BAD: \"user team api design\"\n" +
' - ✅ GOOD: "User and team discussions about API design patterns"\n' +
' - ✅ GOOD: "relationship between database schema and performance optimization"\n' +
' - ❌ BAD: "user team api design"\n' +
" - Format: [Entity1] + [relationship type] + [Entity2] + [context]\n\n" +
"3. **Semantic Question Queries** (Good for vector search):\n" +
" - ✅ GOOD: \"What causes authentication errors in production? What are the security requirements?\"\n" +
" - ✅ GOOD: \"How does caching improve API response times compared to direct database queries?\"\n" +
" - ❌ BAD: \"auth errors production\"\n" +
' - ✅ GOOD: "What causes authentication errors in production? What are the security requirements?"\n' +
' - ✅ GOOD: "How does caching improve API response times compared to direct database queries?"\n' +
' - ❌ BAD: "auth errors production"\n' +
" - Format: Complete natural questions with full context\n\n" +
"4. **Concept Exploration Queries** (Good for BFS traversal):\n" +
" - ✅ GOOD: \"concepts and ideas related to database indexing and query optimization\"\n" +
" - ✅ GOOD: \"topics connected to user authentication and session management\"\n" +
" - ❌ BAD: \"database indexing concepts\"\n" +
' - ✅ GOOD: "concepts and ideas related to database indexing and query optimization"\n' +
' - ✅ GOOD: "topics connected to user authentication and session management"\n' +
' - ❌ BAD: "database indexing concepts"\n' +
" - Format: [concept] + related/connected + [domain/context]\n\n" +
"Avoid keyword soup queries - use complete phrases with proper context for best results.",
},
@ -75,6 +76,11 @@ const IngestSchema = {
description:
"The conversation text to store. Include both what the user asked and what you answered. Keep it concise but complete.",
},
sessionId: {
type: "string",
description:
"IMPORTANT: Session ID (UUID) is required to track the conversation session. If you don't have a sessionId in your context, you MUST call the get_session_id tool first to obtain one before calling memory_ingest.",
},
spaceIds: {
type: "array",
items: {
@ -84,14 +90,14 @@ const IngestSchema = {
"Optional: Array of space UUIDs (from memory_get_spaces). Add this to organize the memory by project. Example: If discussing 'core' project, include the 'core' space ID. Leave empty to store in general memory.",
},
},
required: ["message"],
required: ["message", "sessionId"],
};
export const memoryTools = [
{
name: "memory_ingest",
description:
"Store conversation in memory for future reference. USE THIS TOOL: At the END of every conversation after fully answering the user. WHAT TO STORE: 1) User's question or request, 2) Your solution or explanation, 3) Important decisions made, 4) Key insights discovered. HOW TO USE: Put the entire conversation summary in the 'message' field. Optionally add spaceIds array to organize by project. Returns: Success confirmation with storage ID.",
"Store conversation in memory for future reference. USE THIS TOOL: At the END of every conversation after fully answering the user. WHAT TO STORE: 1) User's question or request, 2) Your solution or explanation, 3) Important decisions made, 4) Key insights discovered. HOW TO USE: Put the entire conversation summary in the 'message' field. IMPORTANT: You MUST provide a sessionId - if you don't have one in your context, call get_session_id tool first to obtain it. Optionally add spaceIds array to organize by project. Returns: Success confirmation with storage ID.",
inputSchema: IngestSchema,
},
{
@ -150,6 +156,20 @@ export const memoryTools = [
},
},
},
{
name: "get_session_id",
description:
"Get a new session ID for the MCP connection. USE THIS TOOL: When you need a session ID and don't have one yet. This generates a unique UUID to identify your MCP session. IMPORTANT: If any other tool requires a sessionId parameter and you don't have one, call this tool first to get a session ID. Returns: A UUID string to use as sessionId.",
inputSchema: {
type: "object",
properties: {
new: {
type: "boolean",
description: "Set to true to get a new sessionId.",
},
},
},
},
{
name: "get_integrations",
description:
@ -243,6 +263,8 @@ export async function callMemoryTool(
return await handleUserProfile(userId);
case "memory_get_space":
return await handleGetSpace({ ...args, userId });
case "get_session_id":
return await handleGetSessionId();
case "get_integrations":
return await handleGetIntegrations({ ...args, userId });
case "get_integration_actions":
@ -334,6 +356,7 @@ async function handleMemoryIngest(args: any) {
source: args.source,
type: EpisodeTypeEnum.CONVERSATION,
spaceIds,
sessionId: args.sessionId,
},
args.userId,
);
@ -462,7 +485,7 @@ async function handleGetSpace(args: any) {
const spaceDetails = {
id: space.id,
name: space.name,
description: space.description,
summary: space.summary,
};
return {
@ -489,6 +512,35 @@ async function handleGetSpace(args: any) {
}
}
// Handler for get_session_id
async function handleGetSessionId() {
try {
const sessionId = randomUUID();
return {
content: [
{
type: "text",
text: JSON.stringify({ sessionId }),
},
],
isError: false,
};
} catch (error) {
logger.error(`MCP get session id error: ${error}`);
return {
content: [
{
type: "text",
text: `Error generating session ID: ${error instanceof Error ? error.message : String(error)}`,
},
],
isError: true,
};
}
}
// Handler for get_integrations
async function handleGetIntegrations(args: any) {
try {

View File

@ -44,6 +44,7 @@
"@radix-ui/react-icons": "^1.3.0",
"@radix-ui/react-label": "^2.0.2",
"@radix-ui/react-popover": "^1.0.7",
"@radix-ui/react-progress": "^1.1.4",
"@radix-ui/react-scroll-area": "^1.0.5",
"@radix-ui/react-select": "^2.0.0",
"@radix-ui/react-separator": "^1.1.7",
@ -53,7 +54,6 @@
"@radix-ui/react-tabs": "^1.0.4",
"@radix-ui/react-toast": "^1.1.5",
"@radix-ui/react-tooltip": "^1.2.7",
"@radix-ui/react-progress": "^1.1.4",
"@remix-run/express": "2.16.7",
"@remix-run/node": "2.1.0",
"@remix-run/react": "2.16.7",
@ -80,6 +80,7 @@
"@tiptap/pm": "^2.11.9",
"@tiptap/react": "^2.11.9",
"@tiptap/starter-kit": "2.11.9",
"@trigger.dev/python": "4.0.4",
"@trigger.dev/react-hooks": "4.0.4",
"@trigger.dev/sdk": "4.0.4",
"ai": "5.0.78",
@ -125,25 +126,25 @@
"react": "^18.2.0",
"react-calendar-heatmap": "^1.10.0",
"react-dom": "^18.2.0",
"react-hotkeys-hook": "^4.5.0",
"react-markdown": "10.1.0",
"react-resizable-panels": "^1.0.9",
"react-hotkeys-hook": "^4.5.0",
"react-virtualized": "^9.22.6",
"resumable-stream": "2.2.8",
"remix-auth": "^4.2.0",
"remix-auth-oauth2": "^3.4.1",
"remix-themes": "^2.0.4",
"remix-typedjson": "0.3.1",
"remix-utils": "^7.7.0",
"resumable-stream": "2.2.8",
"sigma": "^3.0.2",
"stripe": "19.0.0",
"simple-oauth2": "^5.1.0",
"stripe": "19.0.0",
"tailwind-merge": "^2.6.0",
"tiptap-markdown": "0.9.0",
"tailwind-scrollbar-hide": "^2.0.0",
"tailwindcss-animate": "^1.0.7",
"tailwindcss-textshadow": "^2.1.3",
"tiny-invariant": "^1.3.1",
"tiptap-markdown": "0.9.0",
"zod": "3.25.76",
"zod-error": "1.5.0",
"zod-validation-error": "^1.5.0"

View File

@ -0,0 +1,299 @@
# BERT Topic Modeling CLI for Echo Episodes
This CLI tool performs topic modeling on Echo episodes using BERTopic. It connects to Neo4j, retrieves episodes with their pre-computed embeddings for a given user, and discovers thematic clusters using HDBSCAN clustering.
## Features
- Connects to Neo4j database to fetch episodes
- Uses pre-computed embeddings (no need to regenerate them)
- Performs semantic topic clustering with BERTopic
- Displays topics with:
- Top keywords per topic
- Episode count per topic
- Sample episodes for each topic
- Configurable minimum topic size
- Environment variable support for easy configuration
## Prerequisites
- Python 3.8+
- Access to Neo4j database with episodes stored
- Pre-computed embeddings stored in Neo4j (in `contentEmbedding` field)
## Installation
1. Navigate to the bert directory:
```bash
cd apps/webapp/app/bert
```
2. Install dependencies:
```bash
pip install -r requirements.txt
```
## Configuration
The CLI can read Neo4j connection details from:
1. **Environment variables** (recommended) - Create a `.env` file or export:
```bash
export NEO4J_URI=bolt://localhost:7687
export NEO4J_USERNAME=neo4j
export NEO4J_PASSWORD=your_password
```
2. **Command-line options** - Pass credentials directly as flags
3. **From project root** - The tool automatically loads `.env` from the project root
## Usage
### Basic Usage
Using environment variables (most common):
```bash
python main.py <user_id>
```
### Advanced Options
```bash
python main.py <user_id> [OPTIONS]
```
**Options:**
- `--min-topic-size INTEGER`: Minimum number of episodes per topic (default: 10)
- `--nr-topics INTEGER`: Target number of topics for reduction (optional)
- `--propose-spaces`: Generate space proposals using OpenAI (requires OPENAI_API_KEY)
- `--openai-api-key TEXT`: OpenAI API key for space proposals (or use OPENAI_API_KEY env var)
- `--json`: Output only final results in JSON format (suppresses all other output)
- `--neo4j-uri TEXT`: Neo4j connection URI (default: bolt://localhost:7687)
- `--neo4j-username TEXT`: Neo4j username (default: neo4j)
- `--neo4j-password TEXT`: Neo4j password (required)
### Examples
1. **Basic usage with environment variables:**
```bash
python main.py user-123
```
2. **Custom minimum topic size:**
```bash
python main.py user-123 --min-topic-size 10
```
3. **Explicit credentials:**
```bash
python main.py user-123 \
--neo4j-uri bolt://neo4j:7687 \
--neo4j-username neo4j \
--neo4j-password mypassword
```
4. **Using Docker compose Neo4j:**
```bash
python main.py user-123 \
--neo4j-uri bolt://localhost:7687 \
--neo4j-password 27192e6432564f4788d55c15131bd5ac
```
5. **With space proposals:**
```bash
python main.py user-123 --propose-spaces
```
6. **JSON output mode (for programmatic use):**
```bash
python main.py user-123 --json
```
7. **JSON output with space proposals:**
```bash
python main.py user-123 --propose-spaces --json
```
### Get Help
```bash
python main.py --help
```
## Output Formats
### Human-Readable Output (Default)
The CLI outputs:
```
================================================================================
BERT TOPIC MODELING FOR ECHO EPISODES
================================================================================
User ID: user-123
Min Topic Size: 20
================================================================================
✓ Connected to Neo4j at bolt://localhost:7687
✓ Fetched 150 episodes with embeddings
🔍 Running BERTopic analysis (min_topic_size=20)...
✓ Topic modeling complete
================================================================================
TOPIC MODELING RESULTS
================================================================================
Total Topics Found: 5
Total Episodes: 150
================================================================================
────────────────────────────────────────────────────────────────────────────────
Topic 0: 45 episodes
────────────────────────────────────────────────────────────────────────────────
Keywords: authentication, login, user, security, session, password, token, oauth, jwt, credentials
Sample Episodes (showing up to 3):
1. [uuid-123]
Discussing authentication flow for the new user login system...
2. [uuid-456]
Implementing OAuth2 with JWT tokens for secure sessions...
3. [uuid-789]
Password reset functionality with email verification...
────────────────────────────────────────────────────────────────────────────────
Topic 1: 32 episodes
────────────────────────────────────────────────────────────────────────────────
Keywords: database, neo4j, query, graph, cypher, nodes, relationships, index, performance, optimization
Sample Episodes (showing up to 3):
...
Topic -1 (Outliers): 8 episodes
================================================================================
✓ Analysis complete!
================================================================================
✓ Neo4j connection closed
```
### JSON Output Mode (--json flag)
When using the `--json` flag, the tool outputs only a clean JSON object with no debug logs:
```json
{
"topics": {
"0": {
"keywords": ["authentication", "login", "user", "security", "session"],
"episodeIds": ["uuid-123", "uuid-456", "uuid-789"]
},
"1": {
"keywords": ["database", "neo4j", "query", "graph", "cypher"],
"episodeIds": ["uuid-abc", "uuid-def"]
}
},
"spaces": [
{
"name": "User Authentication",
"intent": "Episodes about user authentication, login systems, and security belong in this space.",
"confidence": 85,
"topics": [0, 3],
"estimatedEpisodes": 120
}
]
}
```
**JSON Output Structure:**
- `topics`: Dictionary of topic IDs with keywords and episode UUIDs
- `spaces`: Array of space proposals (only if `--propose-spaces` is used)
- `name`: Space name (2-5 words)
- `intent`: Classification intent (1-2 sentences)
- `confidence`: Confidence score (0-100)
- `topics`: Source topic IDs that form this space
- `estimatedEpisodes`: Estimated number of episodes in this space
**Use Cases for JSON Mode:**
- Programmatic consumption by other tools
- Piping output to jq or other JSON processors
- Integration with CI/CD pipelines
- Automated space creation workflows
## How It Works
1. **Connection**: Establishes connection to Neo4j database
2. **Data Fetching**: Queries all episodes for the given userId that have:
- Non-null `contentEmbedding` field
- Non-empty content
3. **Topic Modeling**: Runs BERTopic with:
- Pre-computed embeddings (no re-embedding needed)
- HDBSCAN clustering (automatic cluster discovery)
- Keyword extraction via c-TF-IDF
4. **Results**: Displays topics with keywords and sample episodes
## Neo4j Query
The tool uses this Cypher query to fetch episodes:
```cypher
MATCH (e:Episode {userId: $userId})
WHERE e.contentEmbedding IS NOT NULL
AND size(e.contentEmbedding) > 0
AND e.content IS NOT NULL
AND e.content <> ''
RETURN e.uuid as uuid,
e.content as content,
e.contentEmbedding as embedding,
e.createdAt as createdAt
ORDER BY e.createdAt DESC
```
## Tuning Parameters
- **`--min-topic-size`**:
- Smaller values (5-10): More granular topics, may include noise
- Larger values (20-30): Broader topics, more coherent but fewer clusters
- Recommended: Start with 20 and adjust based on your data
## Troubleshooting
### No episodes found
- Verify the userId exists in Neo4j
- Check that episodes have `contentEmbedding` populated
- Ensure episodes have non-empty `content` field
### Connection errors
- Verify Neo4j is running: `docker ps | grep neo4j`
- Check URI format: should be `bolt://host:port`
- Verify credentials are correct
### Too few/many topics
- Adjust `--min-topic-size` parameter
- Need more topics: decrease the value (e.g., `--min-topic-size 10`)
- Need fewer topics: increase the value (e.g., `--min-topic-size 30`)
## Dependencies
- `bertopic>=0.16.0` - Topic modeling
- `neo4j>=5.14.0` - Neo4j Python driver
- `click>=8.1.0` - CLI framework
- `numpy>=1.24.0` - Numerical operations
- `python-dotenv>=1.0.0` - Environment variable loading
## License
Part of the Echo project.

384
apps/webapp/python/main.py Normal file
View File

@ -0,0 +1,384 @@
#!/usr/bin/env python3
"""
BERT Topic Modeling CLI for Echo Episodes
This CLI tool connects to Neo4j, retrieves episodes with their embeddings for a given userId,
and performs topic modeling using BERTopic to discover thematic clusters.
"""
import os
import sys
import json
from typing import List, Tuple, Dict, Any
import click
import numpy as np
from neo4j import GraphDatabase
from bertopic import BERTopic
from bertopic.vectorizers import ClassTfidfTransformer
from dotenv import load_dotenv
from sklearn.feature_extraction.text import CountVectorizer
from umap import UMAP
from hdbscan import HDBSCAN
class Neo4jConnection:
"""Manages Neo4j database connection."""
def __init__(self, uri: str, username: str, password: str, quiet: bool = False):
"""Initialize Neo4j connection.
Args:
uri: Neo4j connection URI (e.g., bolt://localhost:7687)
username: Neo4j username
password: Neo4j password
quiet: If True, suppress output messages
"""
self.quiet = quiet
try:
self.driver = GraphDatabase.driver(uri, auth=(username, password))
# Verify connection
self.driver.verify_connectivity()
if not quiet:
click.echo(f"✓ Connected to Neo4j at {uri}")
except Exception as e:
if not quiet:
click.echo(f"✗ Failed to connect to Neo4j: {e}", err=True)
sys.exit(1)
def close(self):
"""Close the Neo4j connection."""
if self.driver:
self.driver.close()
if not self.quiet:
click.echo("✓ Neo4j connection closed")
def get_episodes_with_embeddings(self, user_id: str) -> Tuple[List[str], List[str], np.ndarray]:
"""Fetch all episodes with their embeddings for a given user.
Args:
user_id: The user ID to fetch episodes for
Returns:
Tuple of (episode_uuids, episode_contents, embeddings_array)
"""
query = """
MATCH (e:Episode {userId: $userId})
WHERE e.contentEmbedding IS NOT NULL
AND size(e.contentEmbedding) > 0
AND e.content IS NOT NULL
AND e.content <> ''
RETURN e.uuid as uuid,
e.content as content,
e.contentEmbedding as embedding,
e.createdAt as createdAt
ORDER BY e.createdAt DESC
"""
with self.driver.session() as session:
result = session.run(query, userId=user_id)
records = list(result)
if not records:
if not self.quiet:
click.echo(f"✗ No episodes found for userId: {user_id}", err=True)
sys.exit(1)
uuids = []
contents = []
embeddings = []
for record in records:
uuids.append(record["uuid"])
contents.append(record["content"])
embeddings.append(record["embedding"])
embeddings_array = np.array(embeddings, dtype=np.float32)
if not self.quiet:
click.echo(f"✓ Fetched {len(contents)} episodes with embeddings")
return uuids, contents, embeddings_array
def run_bertopic_analysis(
contents: List[str],
embeddings: np.ndarray,
min_topic_size: int = 20,
nr_topics: int = None,
quiet: bool = False
) -> Tuple[BERTopic, List[int], List[float]]:
"""Run BERTopic clustering on episode contents with improved configuration.
Args:
contents: List of episode content strings
embeddings: Pre-computed embeddings for the episodes
min_topic_size: Minimum number of documents per topic
nr_topics: Target number of topics (optional, for topic reduction)
quiet: If True, suppress output messages
Returns:
Tuple of (bertopic_model, topic_assignments, probabilities)
"""
if not quiet:
click.echo(f"\n🔍 Running BERTopic analysis (min_topic_size={min_topic_size})...")
# Step 1: Configure UMAP for dimensionality reduction
# More aggressive reduction helps find distinct clusters
umap_model = UMAP(
n_neighbors=15, # Balance between local/global structure
n_components=5, # Reduce to 5 dimensions
min_dist=0.0, # Allow tight clusters
metric='cosine', # Use cosine similarity
random_state=42
)
# Step 2: Configure HDBSCAN for clustering
# Tuned to find more granular topics
hdbscan_model = HDBSCAN(
min_cluster_size=min_topic_size, # Minimum episodes per topic
min_samples=5, # More sensitive to local density
metric='euclidean',
cluster_selection_method='eom', # Excess of mass method
prediction_data=True
)
# Step 3: Configure vectorizer with stopword removal
# Remove common English stopwords that pollute topic keywords
vectorizer_model = CountVectorizer(
stop_words='english', # Remove common English words
min_df=2, # Word must appear in at least 2 docs
max_df=0.95, # Ignore words in >95% of docs
ngram_range=(1, 2) # Include unigrams and bigrams
)
# Step 4: Configure c-TF-IDF with better parameters
ctfidf_model = ClassTfidfTransformer(
reduce_frequent_words=True, # Further reduce common words
bm25_weighting=True # Use BM25 for better keyword extraction
)
# Step 5: Initialize BERTopic with all custom components
model = BERTopic(
embedding_model=None, # Use pre-computed embeddings
umap_model=umap_model,
hdbscan_model=hdbscan_model,
vectorizer_model=vectorizer_model,
ctfidf_model=ctfidf_model,
top_n_words=15, # More keywords per topic
nr_topics=nr_topics, # Optional topic reduction
calculate_probabilities=True,
verbose=(not quiet)
)
# Fit the model with pre-computed embeddings
topics, probs = model.fit_transform(contents, embeddings=embeddings)
# Get topic count
unique_topics = len(set(topics)) - (1 if -1 in topics else 0)
if not quiet:
click.echo(f"✓ Topic modeling complete - Found {unique_topics} topics")
return model, topics, probs
def print_topic_results(
model: BERTopic,
topics: List[int],
uuids: List[str],
contents: List[str]
):
"""Print formatted topic results.
Args:
model: Fitted BERTopic model
topics: Topic assignments for each episode
uuids: Episode UUIDs
contents: Episode contents
"""
# Get topic info
topic_info = model.get_topic_info()
num_topics = len(topic_info) - 1 # Exclude outlier topic (-1)
click.echo(f"\n{'='*80}")
click.echo(f"TOPIC MODELING RESULTS")
click.echo(f"{'='*80}")
click.echo(f"Total Topics Found: {num_topics}")
click.echo(f"Total Episodes: {len(contents)}")
click.echo(f"{'='*80}\n")
# Print each topic
for idx, row in topic_info.iterrows():
topic_id = row['Topic']
count = row['Count']
# Skip outlier topic
if topic_id == -1:
click.echo(f"Topic -1 (Outliers): {count} episodes\n")
continue
# Get top words for this topic
topic_words = model.get_topic(topic_id)
click.echo(f"{''*80}")
click.echo(f"Topic {topic_id}: {count} episodes")
click.echo(f"{''*80}")
# Print top keywords
if topic_words:
keywords = [word for word, score in topic_words[:10]]
click.echo(f"Keywords: {', '.join(keywords)}")
# Print sample episodes
topic_episodes = [(uuid, content) for uuid, content, topic
in zip(uuids, contents, topics) if topic == topic_id]
click.echo(f"\nSample Episodes (showing up to 3):")
for i, (uuid, content) in enumerate(topic_episodes[:3]):
# Truncate content for display
truncated = content[:200] + "..." if len(content) > 200 else content
click.echo(f" {i+1}. [{uuid}]")
click.echo(f" {truncated}\n")
click.echo()
def build_json_output(
model: BERTopic,
topics: List[int],
uuids: List[str]
) -> Dict[str, Any]:
"""Build JSON output structure.
Args:
model: Fitted BERTopic model
topics: Topic assignments for each episode
uuids: Episode UUIDs
Returns:
Dictionary with topics data
"""
# Build topics dictionary
topics_dict = {}
topic_info = model.get_topic_info()
for idx, row in topic_info.iterrows():
topic_id = row['Topic']
# Skip outlier topic
if topic_id == -1:
continue
# Get keywords
topic_words = model.get_topic(topic_id)
keywords = [word for word, score in topic_words[:10]] if topic_words else []
# Get episode IDs for this topic
episode_ids = [uuid for uuid, topic in zip(uuids, topics) if topic == topic_id]
topics_dict[topic_id] = {
"keywords": keywords,
"episodeIds": episode_ids
}
return {"topics": topics_dict}
@click.command()
@click.argument('user_id', type=str)
@click.option(
'--min-topic-size',
default=10,
type=int,
help='Minimum number of episodes per topic (default: 10, lower = more granular topics)'
)
@click.option(
'--nr-topics',
default=None,
type=int,
help='Target number of topics for reduction (optional, e.g., 20 for ~20 topics)'
)
@click.option(
'--neo4j-uri',
envvar='NEO4J_URI',
default='bolt://localhost:7687',
help='Neo4j connection URI (default: bolt://localhost:7687)'
)
@click.option(
'--neo4j-username',
envvar='NEO4J_USERNAME',
default='neo4j',
help='Neo4j username (default: neo4j)'
)
@click.option(
'--neo4j-password',
envvar='NEO4J_PASSWORD',
required=True,
help='Neo4j password (required, can use NEO4J_PASSWORD env var)'
)
@click.option(
'--json',
'json_output',
is_flag=True,
default=False,
help='Output only final results in JSON format (suppresses all other output)'
)
def main(user_id: str, min_topic_size: int, nr_topics: int, neo4j_uri: str, neo4j_username: str, neo4j_password: str, json_output: bool):
"""
Run BERTopic analysis on episodes for a given USER_ID.
This tool connects to Neo4j, retrieves all episodes with embeddings for the specified user,
and performs topic modeling to discover thematic clusters.
Examples:
# Using environment variables from .env file
python main.py user-123
# With custom min topic size
python main.py user-123 --min-topic-size 10
# With explicit Neo4j credentials
python main.py user-123 --neo4j-uri bolt://localhost:7687 --neo4j-password mypassword
"""
# Print header only if not in JSON mode
if not json_output:
click.echo(f"\n{'='*80}")
click.echo("BERT TOPIC MODELING FOR ECHO EPISODES")
click.echo(f"{'='*80}")
click.echo(f"User ID: {user_id}")
click.echo(f"Min Topic Size: {min_topic_size}")
if nr_topics:
click.echo(f"Target Topics: ~{nr_topics}")
click.echo(f"{'='*80}\n")
# Connect to Neo4j (quiet mode if JSON output)
neo4j_conn = Neo4jConnection(neo4j_uri, neo4j_username, neo4j_password, quiet=json_output)
try:
# Fetch episodes with embeddings
uuids, contents, embeddings = neo4j_conn.get_episodes_with_embeddings(user_id)
# Run BERTopic analysis
model, topics, probs = run_bertopic_analysis(contents, embeddings, min_topic_size, nr_topics, quiet=json_output)
# Output results
if json_output:
# JSON output mode - only print JSON
output = build_json_output(model, topics, uuids)
click.echo(json.dumps(output, indent=2))
else:
# Normal output mode - print formatted results
print_topic_results(model, topics, uuids, contents)
click.echo(f"{'='*80}")
click.echo("✓ Analysis complete!")
click.echo(f"{'='*80}\n")
finally:
# Always close connection
neo4j_conn.close()
if __name__ == '__main__':
# Load environment variables from .env file if present
load_dotenv()
main()

View File

@ -0,0 +1,8 @@
bertopic>=0.16.0
neo4j>=5.14.0
click>=8.1.0
numpy>=1.24.0
python-dotenv>=1.0.0
scikit-learn>=1.3.0
umap-learn>=0.5.4
hdbscan>=0.8.33

View File

@ -1,6 +1,7 @@
import { defineConfig } from "@trigger.dev/sdk/v3";
import { syncEnvVars } from "@trigger.dev/build/extensions/core";
import { prismaExtension } from "@trigger.dev/build/extensions/prisma";
import { pythonExtension } from "@trigger.dev/python/extension";
export default defineConfig({
project: process.env.TRIGGER_PROJECT_ID as string,
@ -23,6 +24,9 @@ export default defineConfig({
dirs: ["./app/trigger"],
build: {
extensions: [
pythonExtension({
scripts: ["./python/*.py"],
}),
syncEnvVars(() => ({
// ANTHROPIC_API_KEY: process.env.ANTHROPIC_API_KEY as string,
// API_BASE_URL: process.env.API_BASE_URL as string,

View File

@ -55,7 +55,16 @@ RUN pnpm run build --filter=webapp...
# Runner
FROM ${NODE_IMAGE} AS runner
RUN apt-get update && apt-get install -y openssl netcat-openbsd ca-certificates
RUN apt-get update && apt-get install -y \
openssl \
netcat-openbsd \
ca-certificates \
python3 \
python3-pip \
python3-venv \
gcc \
g++ \
&& rm -rf /var/lib/apt/lists/*
WORKDIR /core
RUN corepack enable
ENV NODE_ENV production
@ -69,6 +78,13 @@ COPY --from=builder --chown=node:node /core/apps/webapp/build ./apps/webapp/buil
COPY --from=builder --chown=node:node /core/apps/webapp/public ./apps/webapp/public
COPY --from=builder --chown=node:node /core/scripts ./scripts
# Install BERT Python dependencies
COPY --chown=node:node apps/webapp/python/requirements.txt ./apps/webapp/python/requirements.txt
RUN pip3 install --no-cache-dir -r ./apps/webapp/python/requirements.txt
# Copy BERT scripts
COPY --chown=node:node apps/webapp/python/main.py ./apps/webapp/python/main.py
EXPOSE 3000
USER node

View File

@ -14,9 +14,7 @@ description: "Get started with CORE in 5 minutes"
## Requirements
These are the minimum requirements for running the webapp and background job components. They can run on the same, or on separate machines.
It's fine to run everything on the same machine for testing. To be able to scale your workers, you will want to run them separately.
These are the minimum requirements for running the core.
### Prerequisites
@ -27,7 +25,6 @@ To run CORE, you will need:
### System Requirements
**Webapp & Database Machine:**
- 4+ vCPU
- 8+ GB RAM
- 20+ GB Storage
@ -41,7 +38,7 @@ CORE offers multiple deployment approaches depending on your needs:
For a one-click deployment experience, use Railway:
[![Deploy on Railway](https://railway.com/button.svg)](https://railway.com/deploy/6aEd9C?referralCode=LHvbIb&utm_medium=integration&utm_source=template&utm_campaign=generic)
[![Deploy on Railway](https://railway.com/button.svg)](https://railway.com/deploy/core?referralCode=LHvbIb&utm_medium=integration&utm_source=template&utm_campaign=generic)
Railway will automatically set up all required services and handle the infrastructure for you.

View File

@ -16,7 +16,7 @@ We provide version-tagged releases for self-hosted deployments. It's highly advi
For a quick one-click deployment, you can use Railway:
[![Deploy on Railway](https://railway.com/button.svg)](https://railway.com/deploy/6aEd9C?referralCode=LHvbIb&utm_medium=integration&utm_source=template&utm_campaign=generic)
[![Deploy on Railway](https://railway.com/button.svg)](https://railway.com/deploy/core?referralCode=LHvbIb&utm_medium=integration&utm_source=template&utm_campaign=generic)
Alternatively, you can follow our [Docker deployment guide](/self-hosting/docker) for manual setup.

View File

@ -52,4 +52,6 @@ MODEL=gpt-4.1-2025-04-14
## for opensource embedding model
# EMBEDDING_MODEL=mxbai-embed-large
QUEUE_PROVIDER=bullmq
QUEUE_PROVIDER=bullmq
TELEMETRY_ENABLED=false

View File

@ -33,6 +33,7 @@ services:
- ENABLE_EMAIL_LOGIN=${ENABLE_EMAIL_LOGIN}
- OLLAMA_URL=${OLLAMA_URL}
- EMBEDDING_MODEL=${EMBEDDING_MODEL}
- EMBEDDING_MODEL_SIZE=${EMBEDDING_MODEL_SIZE}
- MODEL=${MODEL}
- TRIGGER_PROJECT_ID=${TRIGGER_PROJECT_ID}
- TRIGGER_SECRET_KEY=${TRIGGER_SECRET_KEY}

View File

@ -0,0 +1,2 @@
-- AlterTable
ALTER TABLE "Workspace" ADD COLUMN "metadata" JSONB NOT NULL DEFAULT '{}';

View File

@ -694,6 +694,8 @@ model Workspace {
slug String @unique
icon String?
metadata Json @default("{}")
integrations String[]
userId String? @unique

2077
pnpm-lock.yaml generated

File diff suppressed because it is too large Load Diff