4 Commits

Author SHA1 Message Date
Jake d0cd215210 ensuring proper submodule submodule 2026-03-22 08:47:56 +00:00
Jake 986c8103c4 feat: 🔒Starting the refactor 2026-03-22 08:18:49 +00:00
Jake-Pullen 90c88b068b Merge pull request #1 from Jake-Pullen/ai_in_the_middle
feat:  AI Powered enhanced queries to get better results
2026-03-07 11:22:11 +00:00
Jake 26c0049fd8 feat: AI Powered enhanced queries to get better results 2026-03-07 11:08:21 +00:00
13 changed files with 403 additions and 77 deletions
+4
View File
@@ -0,0 +1,4 @@
[submodule "toon-python"]
path = toon-python
url = https://github.com/toon-format/toon-python.git
branch = main
+2
View File
@@ -17,3 +17,5 @@
* entity chunking & re-ranking * entity chunking & re-ranking
* Logging in Ingestion * Logging in Ingestion
* database retrieve for tag or entity
*
+18 -4
View File
@@ -8,11 +8,14 @@ models:
enrich: "lm_studio/qwen-" # will have an identifier, based on amount of active LLMs see ./load_ingestion_llms.sh enrich: "lm_studio/qwen-" # will have an identifier, based on amount of active LLMs see ./load_ingestion_llms.sh
embedding: "text-embedding-qwen3-embedding-8b" embedding: "text-embedding-qwen3-embedding-8b"
retrieval: "lm_studio/qwen/qwen3-30b-a3b-2507" retrieval: "lm_studio/qwen/qwen3-30b-a3b-2507"
expansion: "lm_studio/qwen/qwen3-30b-a3b-2507"
# --- Ingestion Settings --- # --- Ingestion Settings ---
ingestion: ingestion:
data_dir: "/home/cosmic/DnD" data_dir: "/home/jake/dnd_test/"
db_path: "./data/dmv.db" db_path: "./data/"
db_name: "dmv.db"
toon_dir: "./data/toon_files"
active_llms: 2 active_llms: 2
parallel_requests_per_llm: 2 parallel_requests_per_llm: 2
chunk_size: 800 chunk_size: 800
@@ -27,8 +30,12 @@ ingestion_agent:
Analyze the provided notes and extract a concise synopsis and relevant metadata. Analyze the provided notes and extract a concise synopsis and relevant metadata.
synopsis = A one-sentence summary of the document. synopsis = A one-sentence summary of the document.
tags = Relevant tags (NPCs, Locations, Items, Plot Points). tags = Relevant tags (NPCs, Locations, Items, Plot Points).
entities = a list of Key names of people, places, or factions. entities = A list of Key names of people, places, or factions found in the document.
"note -> synopsis:str, tags: list[str], entities: list[str]" relationships = A list of object relationships between entities. For each pair of entities that appear together,
specify their relationship type (ally, enemy, mentor, servant, family, business_partner, etc.)
and connection strength (1-5 based on how often they appear together).
Format: [{"entity1": "Name", "entity2": "Name", "type": "relationship_type", "strength": int}, ...]
Output ONLY the metadata dictionary with these keys.
retrieval_agent: retrieval_agent:
retrieval_signature: | retrieval_signature: |
@@ -36,3 +43,10 @@ retrieval_agent:
Given the context and the question, answer the question. Given the context and the question, answer the question.
Do not make things up, base all of your answers on the context. Do not make things up, base all of your answers on the context.
Always site the file location of your source of information. Always site the file location of your source of information.
expansion_agent:
expansion_signature: |
You are a query expansion expert, specialised in Dungeons and Dragons.
Given a user's question, generate 3-5 similar but enhanced search queries that would help find more relevant information.
Each expanded query should be distinct and add different perspective to the original question.
Return only the queries as a JSON list with key "queries"."""
+13
View File
@@ -0,0 +1,13 @@
the idea here is to drop the vectors and semantic search, in favour of optimised knowledge base and llm tool calling.
the current implementation loads the closest semantic chunks based on semantics.
what if.
we ingest and enrich with a focus on tagging entities (knowing our qa will be around entities)
we transform, grouping all entity related infornation together
we load that grouped information out into toon files.
we give the agent a tool to load 1 or more toon file based on entites in the question.
the context window for modern llm is big enough to fit the entire campain notes, but we still risk poison or confusion if we fill the context window with irrelevant notes.
also wonder if we should give the full file at enrichment rather than chunks? worth experimenting...
+13 -3
View File
@@ -6,6 +6,16 @@ def load_config(config_path="config.yaml"):
return yaml.safe_load(f) return yaml.safe_load(f)
# Usage example: def update_ingestion_signature(new_signature: str):
# CFG = load_config() """Update the ingestion signature in config.yaml for relationship extraction."""
# print(CFG['api']['base_url']) import yaml
with open("config.yaml") as f:
cfg = yaml.safe_load(f)
cfg["ingestion_agent"]["ingestion_signature"] = new_signature
with open("config.yaml", "w") as f:
yaml.dump(cfg, f, default_flow_style=False)
return cfg
+2 -1
View File
@@ -1,5 +1,6 @@
import requests import requests
from langchain_core.embeddings import Embeddings from langchain_core.embeddings import Embeddings
from config_loader import load_config from config_loader import load_config
CFG = load_config() CFG = load_config()
@@ -37,7 +38,7 @@ class LocalLMEmbeddings(Embeddings):
for i in range(0, len(texts), self.batch_size): for i in range(0, len(texts), self.batch_size):
batch = texts[i : i + self.batch_size] batch = texts[i : i + self.batch_size]
print(f"🚀 Processing batch {(i // self.batch_size) + 1} (Size: {len(batch)})...") # print(f"🚀 Processing batch {(i // self.batch_size) + 1} (Size: {len(batch)})...")
batch_vectors = self._post_request(batch) batch_vectors = self._post_request(batch)
all_embeddings.extend(batch_vectors) all_embeddings.extend(batch_vectors)
View File
+28 -1
View File
@@ -11,10 +11,37 @@ class IngestionSignature(dspy.Signature):
note: str = dspy.InputField(desc="The DM notes or session recap content.") note: str = dspy.InputField(desc="The DM notes or session recap content.")
answer: dict[str, str | List] = dspy.OutputField( answer: dict[str, str | List] = dspy.OutputField(
desc="the metadata dictionary with the keys; synopsis, tags, entities" desc="the metadata dictionary with the keys; synopsis, tags, entities, relationships"
) )
class IngestionAgent(dspy.Module): class IngestionAgent(dspy.Module):
def __init__(self): def __init__(self):
self.ingest = dspy.Predict(IngestionSignature) self.ingest = dspy.Predict(IngestionSignature)
def ingest_with_relationships(self, note: str) -> dict:
"""Ingest notes and return metadata including extracted relationships."""
response = self.ingest(note=note)
result = response.answer
if not isinstance(result, dict):
result = {
"synopsis": "Failed to parse",
"tags": [],
"entities": [],
"relationships": [],
}
if "relationships" not in result:
entities = result.get("entities", [])
relationships = []
for i, ent1 in enumerate(entities):
for ent2 in entities[i + 1 :]:
relationships.append(
{"entity1": ent1, "entity2": ent2, "type": "co-occurs_with", "strength": 1}
)
result["relationships"] = relationships
return result
+77 -57
View File
@@ -1,35 +1,33 @@
import os import os
import turso from pathlib import Path
import dspy import dspy
from config_loader import load_config from config_loader import load_config
from embedding import LocalLMEmbeddings from toon_utils import decode_entity_toon, sanitize_entity_name
CFG = load_config() CFG = load_config()
DATABASE_PATH = CFG["ingestion"]["db_path"] TOON_DIR = CFG["ingestion"]["toon_dir"]
EMBEDDING_MODEL = CFG["models"]["embedding"]
API_BASE = CFG["api"]["base_url"]
RETRIEVAL_CONFIG = CFG["retrieval_agent"] RETRIEVAL_CONFIG = CFG["retrieval_agent"]
def retrieve_from_turso(embedded_question, k=5): class EntityLookupSignature(dspy.Signature):
query = f""" """Look up entity information from TOON files."""
SELECT file_path, synopsis, tags, entities, chunk_data,
vector_distance_cos(embedding, vector32('{embedded_question[0]}')) AS distance question: str = dspy.InputField(desc="The user's question containing entity names.")
FROM notes answer: str = dspy.OutputField(
ORDER BY distance ASC desc="Comma-separated list of entity names found in the question."
LIMIT {k}; )
"""
con = turso.connect(DATABASE_PATH)
cur = con.cursor() class FileLookupSignature(dspy.Signature):
cur.execute(query) """Extract file paths mentioned in questions."""
rows = cur.fetchall()
return rows question: str = dspy.InputField()
answer: str = dspy.OutputField(desc="Comma-separated list of file paths.")
# --- DSPy Signature ---
class DnDContextQA(dspy.Signature): class DnDContextQA(dspy.Signature):
f"{RETRIEVAL_CONFIG['retrieval_signature']}" f"{RETRIEVAL_CONFIG['retrieval_signature']}"
@@ -41,56 +39,78 @@ class DnDContextQA(dspy.Signature):
class DnDRAG(dspy.Module): class DnDRAG(dspy.Module):
def __init__(self): def __init__(self):
super().__init__() super().__init__()
self.embeddings_model = LocalLMEmbeddings( self.retrieval_lm = dspy.LM(
model=EMBEDDING_MODEL, model=CFG["models"]["retrieval"],
base_url=API_BASE, api_base=CFG["api"]["base_url"] + CFG["api"]["api_version"],
batch_size=1, # we only send 1 question at a time. )
self.entity_extractor = dspy.Predict(EntityLookupSignature)
self.file_extractor = dspy.Predict(FileLookupSignature)
self.generate_answer = dspy.ReAct(
signature=DnDContextQA, tools=[self.load_entity, self.load_file]
) )
# Tools exposed to the ReAct loop
self.tools = [self.load_file]
self.generate_answer = dspy.ReAct(signature=DnDContextQA, tools=self.tools)
def forward(self, question): def forward(self, question):
# TODO: Add step here to LLM Expand print("Processing query with TOON-based retrieval...")
# given the current question, generate 3-5 distinct search queries.
# embed all the questions
embedded_question = self.embeddings_model._post_request(question)
# store the 5 from all 3-5 questions (15 - 25 results)
results = retrieve_from_turso(embedded_question, k=5) # k is limit to return
# Format context as before with dspy.context(lm=self.retrieval_lm):
context_parts = [] entities_resp = self.entity_extractor(question=question)
for i, row in enumerate(results):
source = row[0] # file_path
synopsis = row[1] # synopsis
tags = row[2] # tags
entities = row[3] # entities
content = row[4] # chunk_data
context_parts.append(f""" entity_list = [e.strip() for e in entities_resp.answer.split(",")]
--- Chunk {i + 1} from {source} ---
synopsis: {synopsis},
tags: {tags},
entities: {entities}
{content}
""")
# print('Closest embedding hits') all_results = []
# for part in context_parts:
# print(part)
context = "\n\n".join(context_parts) for entity_name in entity_list:
if not entity_name:
continue
entity_data = self.load_entity(entity_name)
if entity_data:
all_results.append(f"Entity: {entity_name}\n{entity_data}")
with dspy.context(lm=self.retrieval_lm):
files_resp = self.file_extractor(question=question)
file_list = [f.strip() for f in files_resp.answer.split(",")]
for file_path in file_list:
if not file_path:
continue
file_content = self.load_file(file_path)
if file_content:
all_results.append(f"File: {file_path}\n{file_content}")
context = "\n\n".join(all_results) if all_results else "No relevant information found."
prediction = self.generate_answer(context=context, question=question) prediction = self.generate_answer(context=context, question=question)
return dspy.Prediction(answer=prediction.answer, context=context) return dspy.Prediction(answer=prediction.answer, context=context)
def load_file(self, file_path) -> str | None: def load_entity(self, entity_name: str) -> str | None:
"""Load and return specified file.""" """Load and decode entity data from TOON file."""
sanitized = sanitize_entity_name(entity_name)
toon_path = Path(TOON_DIR) / f"{sanitized}.toon"
if not toon_path.exists():
return None
try:
with open(toon_path, "r", encoding="utf-8") as f:
content = f.read()
decoded = decode_entity_toon(content)
return str(decoded)
except Exception as e:
print(f"Error loading entity {entity_name}: {e}")
return None
def load_file(self, file_path: str) -> str | None:
"""Load and return specified file content."""
if os.path.exists(file_path): if os.path.exists(file_path):
try: try:
with open(file_path) as file: with open(file_path, encoding="utf-8") as f:
return file.read() return f.read()
except Exception: except Exception as e:
print(f"Error reading file {file_path}: {e}")
return None return None
else: else:
return None return None
+11 -9
View File
@@ -12,10 +12,12 @@ from tqdm import tqdm
from config_loader import load_config from config_loader import load_config
from embedding import LocalLMEmbeddings from embedding import LocalLMEmbeddings
from experts.ingestion_agent import IngestionAgent from experts.ingestion_agent import IngestionAgent
from toon_utils import save_entities_from_chunks
CFG = load_config() CFG = load_config()
DATA_DIR = CFG["ingestion"]["data_dir"] DATA_DIR = CFG["ingestion"]["data_dir"]
DATABASE_PATH = CFG["ingestion"]["db_path"] DATABASE_PATH = CFG["ingestion"]["db_path"]
DATABASE_NAME = CFG["ingestion"]["db_name"]
MODEL_BASE = CFG["models"]["enrich"] MODEL_BASE = CFG["models"]["enrich"]
EMBEDDING_MODEL = CFG["models"]["embedding"] EMBEDDING_MODEL = CFG["models"]["embedding"]
API_BASE = CFG["api"]["base_url"] API_BASE = CFG["api"]["base_url"]
@@ -139,13 +141,10 @@ def embed_chunks(chunks: List[Any], batch_size: int = EMBEDDING_BATCH_SIZE) -> L
# Process chunks in batches # Process chunks in batches
for i in tqdm(range(0, total_chunks, batch_size), desc="Embedding batches"): for i in tqdm(range(0, total_chunks, batch_size), desc="Embedding batches"):
batch = chunks[i : i + batch_size] batch = chunks[i : i + batch_size]
print(f"🚀 Processing batch {(i // batch_size) + 1} (Size: {len(batch)})...")
batch_content = [chunk.page_content for chunk in batch] batch_content = [chunk.page_content for chunk in batch]
try: try:
# Use model's batched embedding method
# batch_embeddings = embeddings_model.embed_query(batch_content)
batch_embeddings = embeddings_model.embed_documents(batch_content) batch_embeddings = embeddings_model.embed_documents(batch_content)
# Process each chunk in the batch # Process each chunk in the batch
for j, (chunk, embedding) in enumerate(zip(batch, batch_embeddings)): for j, (chunk, embedding) in enumerate(zip(batch, batch_embeddings)):
# Extract metadata # Extract metadata
@@ -208,7 +207,7 @@ def embed_chunks(chunks: List[Any], batch_size: int = EMBEDDING_BATCH_SIZE) -> L
{ {
"file_path": normalize_path(chunk.metadata.get("full_path", "unknown")), "file_path": normalize_path(chunk.metadata.get("full_path", "unknown")),
"file_name": chunk.metadata.get("source", "unknown"), "file_name": chunk.metadata.get("source", "unknown"),
"chunk_data": content, "chunk_data": chunk.page_content,
"synopsis": "Embedding failed", "synopsis": "Embedding failed",
"tags": ["error"], "tags": ["error"],
"entities": [], "entities": [],
@@ -228,7 +227,7 @@ def save_to_db(chunk_dicts):
Each dict maps to a row in the 'notes' table. Each dict maps to a row in the 'notes' table.
""" """
print("connecting to db") print("connecting to db")
con = turso.connect(DATABASE_PATH) con = turso.connect(DATABASE_PATH + DATABASE_NAME)
print("opening cursor") print("opening cursor")
cur = con.cursor() cur = con.cursor()
@@ -252,7 +251,7 @@ def save_to_db(chunk_dicts):
entry["chunk_data"], entry["chunk_data"],
entry["synopsis"], entry["synopsis"],
",".join(entry["tags"]), # Store as comma-separated string ",".join(entry["tags"]), # Store as comma-separated string
",".join(entry["entities"]), # Store as comma-separated string ",".join(e.get("name", str(e)) if isinstance(e, dict) else str(e) for e in entry["entities"]), # Store as comma-separated string
embedding_str, embedding_str,
entry["timestamp"], entry["timestamp"],
) )
@@ -267,7 +266,8 @@ def save_to_db(chunk_dicts):
def create_db(): def create_db():
con = turso.connect(DATABASE_PATH) Path(DATABASE_PATH).mkdir(exist_ok=True)
con = turso.connect(DATABASE_PATH + DATABASE_NAME)
cur = con.cursor() cur = con.cursor()
cur.execute(""" cur.execute("""
@@ -334,7 +334,7 @@ def delete_from_db(embedded_chunks):
print(f"Deleting existing rows for {len(file_paths)} file(s)") print(f"Deleting existing rows for {len(file_paths)} file(s)")
con = turso.connect(DATABASE_PATH) con = turso.connect(DATABASE_PATH + DATABASE_NAME)
cur = con.cursor() cur = con.cursor()
# Use a single DELETE statement with IN clause for efficiency # Use a single DELETE statement with IN clause for efficiency
@@ -371,6 +371,8 @@ def main():
embedded_chunks = embed_chunks(enriched_chunks) embedded_chunks = embed_chunks(enriched_chunks)
print(f"Embedded {len(embedded_chunks)} chunks.") print(f"Embedded {len(embedded_chunks)} chunks.")
save_entities_from_chunks(embedded_chunks)
# remove existing rows from notes table that match file path # remove existing rows from notes table that match file path
delete_from_db(embedded_chunks) delete_from_db(embedded_chunks)
+11
View File
@@ -0,0 +1,11 @@
from toon_utils import encode_entity_toon, sanitize_entity_name
test_name = "Goblin King"
sanitized = sanitize_entity_name(test_name)
print(f"Original: {test_name} -> Sanitized: {sanitized}")
relationships = [
{"entity1": "Goblin King", "entity2": "Orc Commander", "type": "enemy", "strength": 5}
]
content_refs = [{"file": "session_001.txt", "chunk_index": 0}]
toon_data = encode_entity_toon(test_name, "npc", relationships, content_refs)
print(f"TOON encoded (first 200 chars): {toon_data[:200]}")
+221
View File
@@ -0,0 +1,221 @@
import sys
from pathlib import Path
from typing import Any
sys.path.insert(0, "/home/jake/source/dungeon_masters_vault/toon-python/src")
try:
from toon_format import decode as toon_decode
from toon_format import encode as toon_encode
except ImportError:
raise ImportError(
"toon_format not found. Ensure the toon-python library is installed and available.\n"
"Install with: pip install -e /path/to/toon-python"
)
from config_loader import load_config
CFG = load_config()
TOON_DIR = Path(CFG["ingestion"]["toon_dir"])
def sanitize_entity_name(name: str) -> str:
"""Convert entity name to valid filename: lowercase, underscores for spaces, remove special chars."""
import re
name = name.lower().strip()
name = name.replace(" ", "_")
name = re.sub(r"[^a-z0-9_]", "", name)
return name
def encode_entity_toon(
entity_name: str, entity_type: str, relationships: list[dict], content_references: list[dict]
) -> str:
"""Encode entity data to TOON format."""
data = {
"entity": [{"name": entity_name, "type": entity_type}],
"relationships": relationships,
"content_references": content_references,
}
return toon_encode(data)
def decode_entity_toon(toon_content: str) -> dict[str, Any]:
"""Decode TOON content back to Python dictionary."""
return toon_decode(toon_content)
def save_entity_toon(
entity_name: str,
entity_type: str,
relationships: list[dict],
content_references: list[dict],
output_dir: Path | None = None,
) -> Path:
"""Save entity data as a TOON file and return the path."""
if output_dir is None:
output_dir = Path(TOON_DIR)
output_dir.mkdir(parents=True, exist_ok=True)
sanitized_name = sanitize_entity_name(entity_name)
toon_path = output_dir / f"{sanitized_name}.toon"
toon_content = encode_entity_toon(entity_name, entity_type, relationships, content_references)
with open(toon_path, "w", encoding="utf-8") as f:
f.write(toon_content)
return toon_path
def load_entity_toon(entity_name: str, input_dir: Path | None = None) -> dict[str, Any] | None:
"""Load and decode a TOON file for an entity."""
if input_dir is None:
input_dir = Path(TOON_DIR)
sanitized_name = sanitize_entity_name(entity_name)
toon_path = input_dir / f"{sanitized_name}.toon"
if not toon_path.exists():
return None
with open(toon_path, "r", encoding="utf-8") as f:
content = f.read()
return decode_entity_toon(content)
def build_co_occurrence_graph(chunks_with_entities: list[dict]) -> dict[str, dict]:
"""
Build a co-occurrence graph from enriched chunks.
Each chunk contains entities field with list of entity names found in that chunk.
Returns: dict mapping each entity to dict of related entities
"""
graph = {}
for chunk_data in chunks_with_entities:
entities_in_chunk = chunk_data.get("entities", [])
if not isinstance(entities_in_chunk, list) or len(entities_in_chunk) < 2:
continue
for i, entity1 in enumerate(entities_in_chunk):
if entity1 not in graph:
graph[entity1] = {}
for entity2 in entities_in_chunk[i + 1 :]:
if entity2 not in graph[entity1]:
graph[entity1][entity2] = {
"relationship_type": "co-occurs_with",
"count": 0,
"sources": [],
}
graph[entity1][entity2]["count"] += 1
source_info = {
"file": chunk_data.get("file_name", "unknown"),
"chunk_index": chunk_data.get("original_index", 0),
}
if source_info not in graph[entity1][entity2]["sources"]:
graph[entity1][entity2]["sources"].append(source_info)
return graph
def format_relationships_for_toon(relationships: dict[str, dict]) -> list[dict]:
"""Convert relationship graph data to TOON-friendly format."""
result = []
for related_entity, info in relationships.items():
result.append(
{
"entity_name": related_entity,
"relationship_type": info.get("relationship_type", "co-occurs_with"),
"connection_strength": info.get("count", 1),
"source_count": len(info.get("sources", [])),
}
)
return result
def save_entities_from_chunks(
enriched_chunks: list[dict], output_dir: Path | None = None
) -> dict[str, str]:
"""
Extract unique entities from chunks and save as individual TOON files.
Args:
enriched_chunks: List of chunk dicts with 'entities' and 'relationships' fields
output_dir: Directory to save TOON files (defaults to config toon_dir)
Returns:
Dict mapping entity names to their TOON file paths
"""
if output_dir is None:
output_dir = TOON_DIR
output_dir.mkdir(parents=True, exist_ok=True)
entity_to_file_map = {}
for chunk_data in enriched_chunks:
entities = chunk_data.get("entities", [])
relationships = chunk_data.get("relationships", [])
if not isinstance(entities, list) or len(entities) == 0:
continue
source_info = {
"file": chunk_data.get("file_name", "unknown"),
"chunk_index": chunk_data.get("original_index", 0),
}
for entity_item in entities:
if isinstance(entity_item, dict):
entity_name = entity_item.get("name", entity_item.get("entity", ""))
else:
entity_name = str(entity_item)
if not entity_name:
continue
sanitized = sanitize_entity_name(entity_name)
if sanitized not in entity_to_file_map:
toon_path = output_dir / f"{sanitized}.toon"
entity_type = "npc"
content_refs = [source_info]
rels_for_entity = format_relationships_for_toon(
{
r.get("entity2", r.get("entity_name", "")): r
for r in relationships
if r.get("entity1") == entity_name or r.get("entity_name") == entity_name
}
)
toon_content = encode_entity_toon(
entity_name, entity_type, rels_for_entity, content_refs
)
with open(toon_path, "w", encoding="utf-8") as f:
f.write(toon_content)
entity_to_file_map[sanitized] = str(toon_path)
else:
toon_path = Path(entity_to_file_map[sanitized])
existing = load_entity_toon(entity_name, output_dir) or {}
if "content_references" not in existing:
existing["content_references"] = []
existing["content_references"].append(source_info)
with open(toon_path, "w", encoding="utf-8") as f:
f.write(toon_encode(existing))
return entity_to_file_map
Submodule
+1
Submodule toon-python added at 90861444e5