1 Commits

Author SHA1 Message Date
Jake 1e20a5452f fix: 🐛 more stable ingestion 2026-03-08 17:28:29 +00:00
5 changed files with 41 additions and 24 deletions
-3
View File
@@ -1,3 +0,0 @@
[submodule "toon-python"]
path = toon-python
url = git@github.com:toon-format/toon-python.git
+7 -3
View File
@@ -11,11 +11,15 @@
## Planned Next ## Planned Next
* AI in the middle - make the llm generate multiple queries for a wider search * database retrieve for tag or entity
## Planned Later ## Planned Later
* entity chunking & re-ranking * entity chunking & re-ranking
* Logging in Ingestion * Logging in Ingestion
* database retrieve for tag or entity * More robust ingestion - llm response sometimes out of expected
*
## Done
* AI in the middle - make the llm generate multiple queries for a wider search
+29 -15
View File
@@ -16,7 +16,7 @@ ingestion:
db_path: "./data/" db_path: "./data/"
db_name: "dmv.db" db_name: "dmv.db"
active_llms: 2 active_llms: 2
parallel_requests_per_llm: 2 parallel_requests_per_llm: 4
chunk_size: 800 chunk_size: 800
chunk_overlap: 100 chunk_overlap: 100
embedding_batch_size: 32 embedding_batch_size: 32
@@ -25,23 +25,37 @@ ingestion:
# ---- Agent Settings ---- # ---- Agent Settings ----
ingestion_agent: ingestion_agent:
ingestion_signature: | ingestion_signature: |
You are an expert Dungeon Master's assistant. You are an expert Dungeon Master's assistant specialized in campaign note enrichment.
Analyze the provided notes and extract a concise synopsis and relevant metadata. Your task is to analyze DnD session notes and extract structured metadata.
synopsis = A one-sentence summary of the document.
tags = Relevant tags (NPCs, Locations, Items, Plot Points). Follow these guidelines:
entities = a list of Key names of people, places, or factions. - SYNOPSIS: One concise sentence capturing the key event or development (use active voice)
"note -> synopsis:str, tags: list[str], entities: list[str]" - TAGS: Extract 3-7 relevant tags from: Campaign arcs, NPC names, Locations, Items, Spells, Factions, Plot hooks, Themes
- ENTITIES: List all proper nouns (NPCs, locations, organizations) - be specific and consistent with naming
The TAGS and ENTITIES must be a list of strings, not json objects
Format output as JSON with keys: synopsis, tags, entities
retrieval_agent: retrieval_agent:
retrieval_signature: | retrieval_signature: |
You are an expert Dungeon Master's assistant. You are an expert Dungeon Master's assistant helping to run a campaign.
Given the context and the question, answer the question. When answering questions about your DnD world:
Do not make things up, base all of your answers on the context.
Always site the file location of your source of information. 1. Strictly use ONLY the provided context from campaign notes
2. If information is incomplete, infer plausibly based on established lore (flag inferences)
3. Always cite sources: "Per [filename], [quote/summary]"
4. Maintain character voice and narrative style when appropriate
5. For rules questions, distinguish between rules-as-written and DM interpretation
Provide comprehensive answers that help you run the game, including relevant details about NPCs, locations, or plot points.
expansion_agent: expansion_agent:
expansion_signature: | expansion_signature: |
You are a query expansion expert, specialised in Dungeons and Dragons. You are a query expansion expert specialized in Dungeons & Dragons campaign management.
Given a user's question, generate 3-5 similar but enhanced search queries that would help find more relevant information.
Each expanded query should be distinct and add different perspective to the original question. Given a user question about their DnD world, generate 3-5 enhanced search queries that:
Return only the queries as a JSON list with key "queries".""" - Cover different aspects (characters, locations, lore, rules)
- Include synonyms and related terms (e.g., "dragon" → "wyrm", "scales" → "armor")
- Address potential follow-up questions the DM might have
- Vary specificity (broad to narrow)
Return ONLY a JSON array with key "queries". Keep queries concise (5-10 words each).
+5 -2
View File
@@ -176,8 +176,8 @@ def embed_chunks(chunks: List[Any], batch_size: int = EMBEDDING_BATCH_SIZE) -> L
print(f"⚠️ Batch processing failed at index {i}: {e}") print(f"⚠️ Batch processing failed at index {i}: {e}")
# Fallback: process individually (if needed) # Fallback: process individually (if needed)
for j, chunk in enumerate(batch): for j, chunk in enumerate(batch):
try:
content = chunk.page_content content = chunk.page_content
try:
embedding = embeddings_model.embed_query(content) embedding = embeddings_model.embed_query(content)
file_path_orig = chunk.metadata.get("full_path", "unknown") file_path_orig = chunk.metadata.get("full_path", "unknown")
@@ -250,7 +250,10 @@ def save_to_db(chunk_dicts):
entry["chunk_data"], entry["chunk_data"],
entry["synopsis"], entry["synopsis"],
",".join(entry["tags"]), # Store as comma-separated string ",".join(entry["tags"]), # Store as comma-separated string
",".join(entry["entities"]), # Store as comma-separated string ",".join(
str(e) if isinstance(e, str) else e.get("name", str(e))
for e in entry["entities"]
), # Store as comma-separated string
embedding_str, embedding_str,
entry["timestamp"], entry["timestamp"],
) )
Submodule toon-python deleted from 90861444e5