Progress Log - Banodoco Knowledge Base

February 3, 2026 - Night

Wan Knowledge Base COMPLETE

Built comprehensive static HTML knowledge base for the Wan ecosystem at kb/wan/.

KB Structure (10 sections, ~1,200 lines):

Overview: Wan family explanation (2.1, 2.2, VACE, Fun)
Choosing a Model: Decision tree + 2.1 vs 2.2 comparison table
Hardware: VRAM requirements for all 10+ models
Generation Modes: T2V, I2V, FLF, S2V with recommended settings
Control Methods: VACE, Fun Control, Camera (ReCamMaster), Motion (ATI, WanAnimate)
Character & Likeness: Phantom, MAGREF, EchoShot, Lynx
Lip-Sync & Audio: HuMo, MultiTalk, InfiniteTalk, FantasyTalking
Speed & Optimization: LightX2V, CausVid, Wan2GP, TeaCache
Training: LoRA tips, frameworks
Troubleshooting: 15+ common errors with solutions

Content synthesis approach: Rather than dumping all 316K messages of extracted knowledge, we curated the most useful items from Discord extractions, combined with technical specs from external docs (GitHub READMEs, tutorials). Result is a readable reference guide vs. a data dump.

February 3, 2026 - Late

Wan enrichment COMPLETE - Ready for static KB

Gathered all source materials for the Wan knowledge base. Now ready to synthesize into static HTML pages.

External sources gathered:

60+ URLs from official repos, community projects, ComfyUI docs
Technical content fetched: VRAM requirements, features, installation steps
Covers: Phantom, MAGREF, HuMo, MultiTalk, LightX2V, CausVid, ReCamMaster, VideoX-Fun, and more

#updates channel extracted:

1,987 curated posts from @pom (Aug 2023 - Jan 2026)
946 knowledge items extracted ($1.44)
High-value editorial content: 214 resources, 187 community creations, 142 workflows
Covers full history: AnimateDiff → SVD → SDXL → Wan → LTX → FLUX

Key insight: The #updates channel has a very different profile than raw chat - it's curated highlights with 83% having 10+ reactions. High on resources and community creations, low on troubleshooting.

February 3, 2026 - Evening

Wan extraction 100% COMPLETE

Completed the 5 previously failed months (Jul-Nov 2025). Full Wan ecosystem now extracted.

Final 5 months extracted:

July 2025: 32,584 msgs → $6.42
August 2025: 41,050 msgs → $7.68
September 2025: 25,790 msgs → $4.95
October 2025: 18,574 msgs → $3.45
November 2025: 10,533 msgs → $2.10

Total Wan extraction:

~316,000 messages across 5 channels
11 months of wan_chatter (Feb 2025 - Feb 2026)
Full runs of wan_gens, wan_training, wan_comfyui, wan_resources
Estimated cost: ~$65-70 total

Next steps: (1) Combine all extractions into NotebookLM-ready file, (2) Add external sources (official docs, blog posts, #updates channel), (3) Synthesize into cohesive static KB with better attribution.

February 3, 2026 - Earlier

Wan extraction 90% complete - Pipeline insights

Ran full Wan ecosystem extraction over ~5.5 hours. Most channels complete, 5 monthly chunks failed due to network errors.

Completed extractions (~3.5MB total):

wan_chatter: Feb-Jun 2025, Dec 2025-Feb 2026 (8 months)
wan_gens: 487KB - gallery/showcase content
wan_training: 457KB - LoRA training knowledge
wan_comfyui: 233KB - workflow implementation
wan_resources: 206KB - curated resources

Failed (need retry): wan_chatter Jul, Aug, Sep, Oct, Nov 2025 - network connection errors mid-extraction.

Pipeline insight - Extraction is just step 1: The raw extraction produces thousands of fragmented knowledge items. For a good static KB, we need additional steps: (1) enrich with external sources (official docs, blog posts), (2) synthesize/deduplicate with another LLM pass, (3) improve attribution format from "— Username" to "— Discord, Jan 2026" with links where possible.

Updated project plan: Added Step 3 (External Sources) and Step 4 (Synthesis) to the pipeline. See docs/project-plan.md.

Actual cost: ~$45 for Wan (vs $35 estimate) - more messages than expected (316K vs 200K).

February 2, 2026

LTX 2 KB validated - Project plan complete

Major milestone: LTX 2 knowledge base is functional end-to-end. NotebookLM tested and "works pretty well" per user feedback. Static HTML KB live.

What we built:

NotebookLM upload: Combined 8 extraction files into for_notebooklm/ltx2/2026-02-01/ltx2_january_combined.md (695KB, organized with section headers)
Static HTML KB: kb/ltx2/ - Comprehensive page with sticky TOC, collapsible sections, hardware tables, settings tables, troubleshooting guides
KB index: kb/index.html - Model selector page (LTX2 active, Wan/FLUX/others coming soon)
Project plan: docs/project-plan.md - Full scope, Wan ecosystem breakdown, cost estimates, 4-week timeline

Key insight - NotebookLM vs Static KB: These serve complementary roles. NotebookLM excels at specific Q&A ("What VRAM for 720p 97 frames?"). Static KB excels at browsing, decision trees ("Should I use Wan 2.1 or 2.2?"), and rich media (embedded videos, downloadable workflows). Build both.

Wan ecosystem is complex: Asked user to test NotebookLM with Wan questions. Learned it's not one model but an entire ecosystem:

Generations: Wan 2.1 (standard DiT) vs 2.2 (MoE architecture with High/Low noise split)
Control systems: VACE (full video control, inpainting, style transfer) vs Fun Control (Canny/Depth/Pose inputs)
Character models: Phantom (T2V consistency), MagRef (I2V likeness), HuMo (audio-reactive)
Lip-sync: MultiTalk, InfiniteTalk, HuMo S2V
Optimization LoRAs: LightX2V, CausVid, Pusa
Implementations: WanVideoWrapper (Kijai) vs ComfyUI Native

Media handling solved: Discord CDN URLs expire, but @pom built a refresh API endpoint. Will use that initially; migrate to Cloudflare R2 if reliability issues arise.

Cost estimates for full project:

LTX 2: $7.65 (complete)
Wan ecosystem (~200K msgs): ~$35
FLUX (~80K msgs): ~$14
All models combined (~800K msgs): ~$140

Next: Start Wan extraction - scope channels, estimate costs, begin processing.

February 1, 2026 - Evening

LTX 2 January extraction COMPLETE

Processed all LTX-related channels for January 2026. Total: ~44,500 messages across 4 channels, extracting ~4,345 knowledge items.

Channel	Messages	Items	Cost
ltx_chatter (full month)	34,751	3,053	$5.52
ltx_training	2,850	358	$0.59
ltx_gens	4,100	554	$0.83
ltx_resources	2,891	380	$0.71
Total	~44,500	~4,345	$7.65

Output: 8 markdown files in data/ directory (~19,000 lines total), ready for NotebookLM. Also JSON versions for structured use.

Lessons learned:

Running extractions in parallel hits rate limits (30K tokens/min). Run sequentially for reliability.
Cost tracking was accurate - actual $7.65 vs estimated $5-6. Each ~400 message chunk costs ~$0.15-0.20.
Forum channel (ltx_resources) works with time-chunked approach - still captures valuable content even without thread structure.

Insight: We now have comprehensive LTX 2 knowledge ready for NotebookLM testing. The extracted data includes technical discoveries, troubleshooting guides, hardware requirements, limitations, workflows, and community creations - everything needed for a useful knowledge base.

Next: Combine files into single comprehensive document, test in NotebookLM, then build static HTML knowledge base.

February 1, 2026 - Afternoon

LTX 2 chunked extraction working

Built extract_chat_chunks.py - processes chat in time-ordered chunks to capture ALL knowledge, not just Q&A pairs.

Why chunked approach: Chat contains more than Q&A - discoveries, comparisons, tips shared proactively, hardware benchmarks, links to resources. Processing in 400-message chunks preserves conversation context.

12 extraction categories:

Original 8: discoveries, troubleshooting, comparisons, tips, news, workflows, settings, concepts
New 4: resources (links to models/repos), limitations (what doesn't work), hardware (VRAM/RAM requirements), community_creations (LoRAs/nodes people made)

Prompt improvements (learned from pom's code):

Added accuracy guidelines: "Do NOT jump to conclusions unsupported by evidence"
Use reactions as quality signal (marked with ★) but don't over-index
Explicit skip instructions for jokes, casual chat, unsubstantiated claims

Insight: The new categories capture critical info. Jan 7 extraction found 46 limitations (things LTX 2 can't do well), 44 hardware requirements (specific VRAM/RAM for different GPUs), and 44 resource links (HuggingFace models, GitHub repos).

Sample extractions:

Limitation: "Can't do people turning around - gets back-to-front mutation horrors"
Hardware: "3090: safe at 81 frames, OOMs at 121 frames"
Resource: LTX official workflows link, SageAttention installation guide

Output: data/ltx_chatter_20260106_knowledge.md and data/ltx_chatter_20260107_knowledge.md - clean markdown ready for NotebookLM.

February 1, 2026 - Morning

Prototype extraction successful

Built and tested extraction scripts for both forum threads and chat Q&A. Results are high quality.

Forum thread extraction (4 threads tested):

FlippinRad Motion Morph LoRA (394 msgs, 80 reactions) - Extracted LoRA details, requirements, 6 issues with solutions, 8 contributors
Wan HuMo SVI Pro v5 Workflow (935 msgs) - Lip-sync workflow with HuMo, settings, 5 issues/solutions
SYSTMS Transition Workflow (140 msgs) - VACE transitions, shift settings, 6 troubleshooting entries
Creative Video Upscaler (406 msgs) - Multi-pass 480p→1080p upscaling, AnimateDiff techniques

Chat Q&A extraction (99 pairs from wan_chatter):

3 troubleshooting fixes (TEAcache compatibility, VACE frame errors, direction mask inversion)
5 tips (InfiniteTalk recommendation, VACE 14B preference, Qwen LoRAs for likeness)
3 settings recommendations (speed/quality optimization compatibility chart, Krea LoRA settings)
4 concept explanations (direction masks, self forcing, block-based LoRA training)

Insight: Forum threads yield richer, more structured knowledge (~$0.05/thread with Sonnet). Chat Q&A is thinner but captures troubleshooting that doesn't appear in forum posts. Both are valuable.

Scripts created:

scripts/extract_forum_thread.py - Process a single forum thread
scripts/extract_chat_qa.py - Extract Q&A pairs from chatter channels

Output files:

data/thread_*_knowledge.json - Extracted forum knowledge
data/chat_qa_*.json - Extracted Q&A knowledge

January 30, 2026 - Morning

Discovered forum structure & planned cost-effective extraction

Key realization: Resources channels use Discord's forum feature, not regular chat. The thread_id field identifies which "post" each message belongs to.

Actual forum post counts:

wan_resources: 50 posts (not 6,600 messages - those are replies within posts)
ltx_resources: 45 posts
resources: 114 posts
Total: ~209 curated workflow/resource posts

Insight: Initial query showed 6,605 "messages without reference_id" which seemed like 6,605 posts. But these are actually all messages across ~50 forum threads. Each forum post averages ~200 comments/replies. The thread_id field (not reference_id) is what groups forum messages.

LLM cost analysis:

Processing all 1M messages naively: ~$1,500+ with Opus (too expensive)
Smart filtering to ~100K high-value messages: ~$60-110 with Opus
Same with Sonnet: ~$15-25

High-value subsets identified:

Messages with 3+ reactions: 42K (community-validated)
Messages with attachments: 155K (workflows, examples)
Long messages (>300 chars): ~21K (substantive content)
Kijai's messages: 104K (expert knowledge)

Decision: Don't trust existing daily summaries. They only cover 87 days and earlier ones may have errors (GPT verification added late Jan 2026). Will regenerate from raw messages to cover full 2.5 year archive.

January 29, 2026 - Late Evening

Found the summary generation source code

Nathan found the code that generates daily summaries: brain-of-bdnc/news_summary.py

How summaries are generated:

Model: Claude Sonnet 4.5 for generation, GPT-5.2 with "high reasoning effort" for verification
Chunking: 1000 messages at a time, then combined to top 3-5 items
Verification checks: Attribution errors, unsupported claims, logical leaps, invented details

The prompt explicitly prioritizes (in order):

Original creations by community members (nodes, workflows, tools, LoRAs, scripts)
Notable achievements and demonstrations
High-engagement content (reactions/comments signal community interest)
New features people are excited about
Shared workflows with examples

Key prompt instructions:

"Do NOT jump to conclusions unsupported by evidence"
"Only report what is explicitly stated or clearly demonstrated"
"Distinguish between facts, opinions, and speculation"
"Always credit creators with bold usernames"

Insight: The summaries ARE capturing reference knowledge - but framed as "news". When someone discovers "FP32 compute improves quality", it's captured as a news item even though it's durable reference knowledge. For a KB, we need to re-process to extract the timeless content and organize by topic rather than date.

Important caveat: Peter (@pom) noted that the GPT-5.2 verification step was only added this week. Summaries before ~late January 2026 may contain inaccuracies (attribution errors, unsupported claims, etc.). This adds even more reason to re-process everything rather than using summaries as-is.

Proposed KB approach:

Re-process summaries - Extract reference content, strip the news framing
Cross-reference with raw Q&A - Summaries miss troubleshooting that happens in back-and-forth chat
Organize by topic - All Z-Image tips together, all Wan troubleshooting together, not scattered across dates

January 29, 2026 - Evening

Daily summaries contain more than "news"

Re-examined daily summaries after initially thinking they were mostly "news" (model releases, community activity). Found they actually contain significant reference knowledge:

Technical settings: FP32 vs BF16 compute flags, sampler recommendations, resolution tables
Workflow techniques: Dual-model approaches (Base + Turbo), step counts for different effects
Training knowledge: LoRA strength conversions, captioning best practices, specific commit versions
Troubleshooting: SageAttention breaking Z-Image Base, facial changes during relighting

Insight: Daily summaries may be better than raw chat for many KB use cases - they're already synthesized, structured, and attributed. The "news" framing was too narrow.

January 29, 2026 - Afternoon

Synthesized first troubleshooting guide from raw chat

Took the extracted Q&A data from wan_chatter and synthesized it into a structured troubleshooting guide. Created both JSON and Markdown outputs.

Results: 14 troubleshooting entries, 6 tips, 5 FAQs. Examples:

mat1/mat2 CLIP loader fix: pip install transformers==4.48.0
NAG attention error: disable WanVideo Apply NAG node
Sampler preview missing: check ComfyUI settings, not Manager
Shift values: use 5 for distilled LoRAs, increase for higher res

Files: data/troubleshooting_wan_chatter.json, data/troubleshooting_wan_chatter.md

January 29, 2026 - Afternoon

Extracted reference knowledge from raw chat

Built scripts/extract_reference_knowledge.py to find Q&A patterns, errors, and solutions buried in Discord messages. Ran against wan_chatter channel (50K messages).

Results surprised us:

11,544 potential questions (~23% of messages match question patterns)
5,605 Q&A reply pairs (using reference_id to track who replied to what)
662 messages mentioning fixes/solutions
793 error-related discussions

Insight: There's substantial reference knowledge in raw chat that doesn't appear in daily summaries. The reference_id field is key - it lets us connect questions to their answers.

January 29, 2026 - Morning

Discussion: What makes a good knowledge base?

Nathan shared that NotebookLM (chat-with-docs) has been the most useful KB approach he's tried. This led to thinking about what makes knowledge useful:

"News" vs "Reference" - Daily summaries capture what happened (news), but users often need how-to information (reference)
Update frequency - AI video/image space moves fast. Content becomes outdated quickly.
Audience - Primarily technical users who want to get unstuck or learn techniques

Initial hypothesis: Daily summaries = news, raw chat = buried reference knowledge. (This hypothesis was later revised - see evening entry.)

January 28, 2026

Analyzed top contributors

Built script to find who contributes most to the community. Key finding: Kijai has sent 103,556 messages - about 10% of all messages in the database. Clear power-law distribution.

Top 5: Kijai (103K), pom (34K), Juampab12 (26K), spacepxl (21K), Juan Gea (20K)

Created stats.html to display top 20 contributors with their most active channels.

January 28, 2026

Database gap filled!

Peter (@pom) filled in the 12-month data gap (Feb 2024 - Jan 2025). Database now has:

1,046,692 messages (up from 727K)
6,624 members (up from 4,477)
272,750 messages recovered from the gap period

This data includes FLUX release, Stable Diffusion 3, CogVideoX, and early HunyuanVideo discussions.

January 28, 2026

Project started

Goal: Transform the Banodoco Discord database into a useful knowledge base about open source AI tools (video generation, image generation, training, ComfyUI, etc.)

Initial exploration revealed:

4 core tables: discord_messages, discord_members, discord_channels, daily_summaries
29 months of data (Aug 2023 - Jan 2026)
AI-generated daily summaries with structured JSON, attribution, and media links
A 12-month gap in the data (Feb 2024 - Jan 2025) - later filled

Created database.html to visualize the database structure and coverage.

Current Status

Wan Knowledge Base COMPLETE

Wan enrichment COMPLETE - Ready for static KB

Wan extraction 100% COMPLETE

Wan extraction 90% complete - Pipeline insights

LTX 2 KB validated - Project plan complete

LTX 2 January extraction COMPLETE

LTX 2 chunked extraction working

Prototype extraction successful

Discovered forum structure & planned cost-effective extraction

Found the summary generation source code

Daily summaries contain more than "news"

Synthesized first troubleshooting guide from raw chat

Extracted reference knowledge from raw chat

Discussion: What makes a good knowledge base?

Analyzed top contributors

Database gap filled!

Project started