testing

Implement Evaluation with Continue.dev

Inspect the original prompt language first, then copy or adapt it once you know how it fits your workflow.

Linked challenge: Mathematical Proof Assistant

Format

Text-first

Lines

Sections

Linked challenge

Mathematical Proof Assistant

Prompt source

Original prompt text with formatting preserved for inspection.

1 lines

1 sections

No variables

0 checklist items

Integrate Continue.dev to create an iterative testing loop for your proof assistant. Define test cases based on the provided `eval_data.json` (which contains new mathematical statements and their expected proofs/counter-examples). Use Continue.dev to automatically run tests against your LlamaIndex agent, compare generated proofs against ground truth, and report discrepancies. Focus on how Continue.dev helps in rapid iteration and debugging of agent behavior.

Adaptation plan

Keep the source stable, then change the prompt in a predictable order so the next run is easier to evaluate.

Keep stable

Preserve the rubric, target behavior, and pass-fail criteria as the baseline for evaluation.

Tune next

Adjust fixtures, mocks, and thresholds to the system under test instead of weakening the assertions.

Verify after

Make sure the prompt catches regressions instead of just mirroring the happy-path examples.

Prompt diagnostics

Variables

Lists

Code blocks

Purpose

testing

This prompt is mostly narrative and instruction-driven, so adapt examples and output constraints before you rewrite the structure.

Linked challenge

Mathematical Proof Assistant

This challenge focuses on building an advanced AI system capable of understanding complex mathematical questions, retrieving relevant theorems and definitions from a specialized knowledge base, and constructing logical proofs or counter-examples. Participants will leverage LlamaIndex's advanced RAG capabilities to ensure contextual understanding and Gemini 2.5 Pro's strong reasoning for generating robust mathematical arguments. The emphasis will be on accurate grounding of facts, verifiable proof construction, and systematic evaluation of the AI's mathematical competence against novel problems. The project requires designing and populating a structured mathematical knowledge base using LlamaIndex data connectors, integrating a vector store like ChromaDB for efficient retrieval. Developers will orchestrate a multi-stage LlamaIndex agent workflow that can plan, execute, and verify proof steps. The final system should demonstrate robust reasoning by generating mathematically sound proofs and identifying valid counter-examples when applicable, similar to the objectives of the 'First Proof' experiment.

Open challenge

Related prompts

Browse library

Initialize LlamaIndex for Mathematical Knowledge Base

planning

Design Gemini-powered Reasoning Agent

implementation

Deploy Proof Assistant with Baseten

deployment