System Design: Codebase Knowledge Builder
Please DON’T remove notes for AI
Requirements
Notes for AI: Keep it simple and clear. If the requirements are abstract, write concrete user stories
User Story: As a developer onboarding to a new codebase, I want a tutorial automatically generated from its GitHub repository or local directory, optionally in a specific language. This tutorial should explain the core abstractions, their relationships (visualized), and how they work together, using beginner-friendly language, analogies, and multi-line descriptions where needed, so I can understand the project structure and key concepts quickly without manually digging through all the code.
Input:
- A publicly accessible GitHub repository URL or a local directory path.
- A project name (optional, will be derived from the URL/directory if not provided).
- Desired language for the tutorial (optional, defaults to English).
Output:
- A directory named after the project containing:
- An
index.md
file with:- A high-level project summary (potentially translated).
- A Mermaid flowchart diagram visualizing relationships between abstractions (using potentially translated names/labels).
- An ordered list of links to chapter files (using potentially translated names).
- Individual Markdown files for each chapter (
01_chapter_one.md
,02_chapter_two.md
, etc.) detailing core abstractions in a logical order (potentially translated content).
- An
Flow Design
Notes for AI:
- Consider the design patterns of agent, map-reduce, rag, and workflow. Apply them if they fit.
- Present a concise, high-level description of the workflow.
Applicable Design Pattern:
This project primarily uses a Workflow pattern to decompose the tutorial generation process into sequential steps. The chapter writing step utilizes a BatchNode (a form of MapReduce) to process each abstraction individually.
- Workflow: The overall process follows a defined sequence: fetch code -> identify abstractions -> analyze relationships -> determine order -> write chapters -> combine tutorial into files.
- Batch Processing: The
WriteChapters
node processes each identified abstraction independently (map) before the final tutorial files are structured (reduce).
Flow high-level Design:
FetchRepo
: Crawls the specified GitHub repository URL or local directory using appropriate utility (crawl_github_files
orcrawl_local_files
), retrieving relevant source code file contents.IdentifyAbstractions
: Analyzes the codebase using an LLM to identify up to 10 core abstractions, generate beginner-friendly descriptions (potentially translated if language != English), and list the indices of files related to each abstraction.AnalyzeRelationships
: Uses an LLM to analyze the identified abstractions (referenced by index) and their related code to generate a high-level project summary and describe the relationships/interactions between these abstractions (summary and labels potentially translated if language != English), specifying source and target abstraction indices and a concise label for each interaction.OrderChapters
: Determines the most logical order (as indices) to present the abstractions in the tutorial, considering input context which might be translated. The output order itself is language-independent.WriteChapters
(BatchNode): Iterates through the ordered list of abstraction indices. For each abstraction, it calls an LLM to write a detailed, beginner-friendly chapter (content potentially fully translated if language != English), using the relevant code files (accessed via indices) and summaries of previously generated chapters (potentially translated) as context.CombineTutorial
: Creates an output directory, generates a Mermaid diagram from the relationship data (using potentially translated names/labels), and writes the project summary (potentially translated), relationship diagram, chapter links (using potentially translated names), and individually generated chapter files (potentially translated content) into it. Fixed text like “Chapters”, “Source Repository”, and the attribution footer remain in English.
flowchart TD
A[FetchRepo] --> B[IdentifyAbstractions];
B --> C[AnalyzeRelationships];
C --> D[OrderChapters];
D --> E[Batch WriteChapters];
E --> F[CombineTutorial];
Utility Functions
Notes for AI:
- Understand the utility function definition thoroughly by reviewing the doc.
- Include only the necessary utility functions, based on nodes in the flow.
crawl_github_files
(utils/crawl_github_files.py
) - External Dependency: requests, gitpython (optional for SSH)- Input:
repo_url
(str),token
(str, optional),max_file_size
(int, optional),use_relative_paths
(bool, optional),include_patterns
(set, optional),exclude_patterns
(set, optional) - Output:
dict
containingfiles
(dict[str, str]) andstats
. - Necessity: Required by
FetchRepo
to download and read source code from GitHub if arepo_url
is provided. Handles API calls or SSH cloning, filtering, and file reading.
- Input:
crawl_local_files
(utils/crawl_local_files.py
) - External Dependency: None- Input:
directory
(str),max_file_size
(int, optional),use_relative_paths
(bool, optional),include_patterns
(set, optional),exclude_patterns
(set, optional) - Output:
dict
containingfiles
(dict[str, str]). - Necessity: Required by
FetchRepo
to read source code from a local directory if alocal_dir
path is provided. Handles directory walking, filtering, and file reading.
- Input:
call_llm
(utils/call_llm.py
) - External Dependency: LLM Provider API (e.g., Google GenAI)- Input:
prompt
(str),use_cache
(bool, optional) - Output:
response
(str) - Necessity: Used by
IdentifyAbstractions
,AnalyzeRelationships
,OrderChapters
, andWriteChapters
for code analysis and content generation. Needs careful prompt engineering and YAML validation (implicit viayaml.safe_load
which raises errors).
- Input:
Node Design
Shared Store
Notes for AI: Try to minimize data redundancy
The shared Store structure is organized as follows:
shared = {
# --- Inputs ---
"repo_url": None, # Provided by the user/main script if using GitHub
"local_dir": None, # Provided by the user/main script if using local directory
"project_name": None, # Optional, derived from repo_url/local_dir if not provided
"github_token": None, # Optional, from argument or environment variable
"output_dir": "output", # Default or user-specified base directory for output
"include_patterns": set(), # File patterns to include
"exclude_patterns": set(), # File patterns to exclude
"max_file_size": 100000, # Default or user-specified max file size
"language": "english", # Default or user-specified language for the tutorial
# --- Intermediate/Output Data ---
"files": [], # Output of FetchRepo: List of tuples (file_path: str, file_content: str)
"abstractions": [], # Output of IdentifyAbstractions: List of {"name": str (potentially translated), "description": str (potentially translated), "files": [int]} (indices into shared["files"])
"relationships": { # Output of AnalyzeRelationships
"summary": None, # Overall project summary (potentially translated)
"details": [] # List of {"from": int, "to": int, "label": str (potentially translated)} describing relationships between abstraction indices.
},
"chapter_order": [], # Output of OrderChapters: List of indices into shared["abstractions"], determining tutorial order
"chapters": [], # Output of WriteChapters: List of chapter content strings (Markdown, potentially translated), ordered according to chapter_order
"final_output_dir": None # Output of CombineTutorial: Path to the final generated tutorial directory (e.g., "output/my_project")
}
Node Steps
Notes for AI: Carefully decide whether to use Batch/Async Node/Flow. Removed explicit try/except in exec, relying on Node’s built-in fault tolerance.
FetchRepo
- Purpose: Download the repository code (from GitHub) or read from a local directory, loading relevant files into memory using the appropriate crawler utility.
- Type: Regular
- Steps:
prep
: Readrepo_url
,local_dir
,project_name
,github_token
,output_dir
,include_patterns
,exclude_patterns
,max_file_size
from shared store. Determineproject_name
fromrepo_url
orlocal_dir
if not present in shared. Setuse_relative_paths
flag.exec
: Ifrepo_url
is present, callcrawl_github_files(...)
. Otherwise, callcrawl_local_files(...)
. Convert the resultingfiles
dictionary into a list of(path, content)
tuples.post
: Write the list offiles
tuples and the derivedproject_name
(if applicable) to the shared store.
IdentifyAbstractions
- Purpose: Analyze the code to identify key concepts/abstractions using indices. Generates potentially translated names and descriptions if language is not English.
- Type: Regular
- Steps:
prep
: Readfiles
(list of tuples),project_name
, andlanguage
from shared store. Create context usingcreate_llm_context
helper which adds file indices. Format the list ofindex # path
for the prompt.exec
: Construct a prompt forcall_llm
. If language is not English, add instructions to generatename
anddescription
in the target language. Ask LLM to identify ~5-10 core abstractions, provide a simple description for each, and list the relevant file indices (e.g.,- 0 # path/to/file.py
). Request YAML list output. Parse and validate the YAML, ensuring indices are within bounds and converting entries like0 # path...
to just the integer0
.post
: Write the validated list ofabstractions
(e.g.,[{"name": "Node", "description": "...", "files": [0, 3, 5]}, ...]
) containing file indices and potentially translatedname
/description
to the shared store.
AnalyzeRelationships
- Purpose: Generate a project summary and describe how the identified abstractions interact using indices and concise labels. Generates potentially translated summary and labels if language is not English.
- Type: Regular
- Steps:
prep
: Readabstractions
,files
,project_name
, andlanguage
from shared store. Format context for the LLM, including potentially translated abstraction names and indices, potentially translated descriptions, and content snippets from related files (referenced byindex # path
usingget_content_for_indices
helper). Prepare the list ofindex # AbstractionName
(potentially translated) for the prompt.exec
: Construct a prompt forcall_llm
. If language is not English, add instructions to generatesummary
andlabel
in the target language, and note that input names might be translated. Ask for (1) a high-level summary and (2) a list of relationships, each specifyingfrom_abstraction
(e.g.,0 # Abstraction1
),to_abstraction
(e.g.,1 # Abstraction2
), and a conciselabel
. Request structured YAML output. Parse and validate, converting referenced abstractions to indices (from: 0, to: 1
).post
: Parse the LLM response and write therelationships
dictionary ({"summary": "...", "details": [{"from": 0, "to": 1, "label": "..."}, ...]}
) with indices and potentially translatedsummary
/label
to the shared store.
OrderChapters
- Purpose: Determine the sequence (as indices) in which abstractions should be presented. Considers potentially translated input context.
- Type: Regular
- Steps:
prep
: Readabstractions
,relationships
,project_name
, andlanguage
from the shared store. Prepare context including the list ofindex # AbstractionName
(potentially translated) and textual descriptions of relationships referencing indices and using the potentially translatedlabel
. Note in context if summary/names might be translated.exec
: Construct a prompt forcall_llm
asking it to order the abstractions based on importance, foundational concepts, or dependencies. Request output as an ordered YAML list ofindex # AbstractionName
. Parse and validate, extracting only the indices and ensuring all are present exactly once.post
: Write the validated ordered list of indices (chapter_order
) to the shared store.
WriteChapters
- Purpose: Generate the detailed content for each chapter of the tutorial. Generates potentially fully translated chapter content if language is not English.
- Type: BatchNode
- Steps:
prep
: Readchapter_order
(indices),abstractions
,files
,project_name
, andlanguage
from shared store. Initialize an empty instance variableself.chapters_written_so_far
. Return an iterable list where each item corresponds to an abstraction index fromchapter_order
. Each item should contain chapter number, potentially translated abstraction details, a map of related file content ({ "idx # path": content }
), full chapter listing (potentially translated names), chapter filename map, previous/next chapter info (potentially translated names), and language.exec(item)
: Construct a prompt forcall_llm
. If language is not English, add detailed instructions to write the entire chapter in the target language, translating explanations, examples, etc., while noting which input context might already be translated. Ask LLM to write a beginner-friendly Markdown chapter. Provide potentially translated concept details. Include a summary of previously written chapters (potentially translated). Provide relevant code snippets. Add the generated (potentially translated) chapter content toself.chapters_written_so_far
for the next iteration’s context. Return the chapter content.post(shared, prep_res, exec_res_list)
:exec_res_list
contains the generated chapter Markdown content strings (potentially translated), ordered correctly. Assign this list directly toshared["chapters"]
. Clean upself.chapters_written_so_far
.
CombineTutorial
- Purpose: Assemble the final tutorial files, including a Mermaid diagram using potentially translated labels/names. Fixed text remains English.
- Type: Regular
- Steps:
prep
: Readproject_name
,relationships
(potentially translated summary/labels),chapter_order
(indices),abstractions
(potentially translated name/desc),chapters
(list of potentially translated content),repo_url
, andoutput_dir
from shared store. Generate a Mermaidflowchart TD
string based onrelationships["details"]
, using indices to identify nodes (potentially translated names) and the conciselabel
(potentially translated) for edges. Construct the content forindex.md
(including potentially translated summary, Mermaid diagram, and ordered links to chapters using potentially translated names derived usingchapter_order
andabstractions
). Define the output directory path (e.g.,./output_dir/project_name
). Prepare a list of{ "filename": "01_...", "content": "..." }
for chapters, adding the English attribution footer to each chapter’s content. Add the English attribution footer to the index content.exec
: Create the output directory. Write the generatedindex.md
content. Iterate through the prepared chapter file list and write each chapter’s content to its corresponding.md
file in the output directory.post
: Write the finaloutput_path
toshared["final_output_dir"]
. Log completion.