DSC 180B Capstone Project

Think, Search, Correct: A Self-Reflection Mechanism for Adaptive Tool-Use Agents

Team: Megha Kataki (mkataki@ucsd.edu), Arshia Vadhani (avadhani@ucsd.edu), Joyce Lu (jol072@ucsd.edu), Jiaxin Yang (jiy016@ucsd.edu)

Mentor: Kun Zhou (kuzhou@ucsd.edu)

We improved a search-based reasoning agent for multi-hop question answering by adding reflection and verification mechanisms that make the agent less likely to answer from unsupported internal guesses, better at recovering missed evidence, and more capable of issuing stronger follow-up searches. LLM users who rely on factual checks or complex multi-hop problems would be the stakeholders from our project.

Features of the project: Tool-use agents Multi-hop QA Self-reflection Hallucination control
Accuracy
πŸ“ˆ 0.433

up from 0.292

F1 Score
πŸ“ˆ 0.458

up from 0.363

Searches per Query
πŸ“ˆ 2.17

baseline 1.14


Overview

Large language models (LLMs) can help your life become easier, but they still struggle when a task requires complex external search, selective evidence extraction, and hallucinations from multiple retrieved sources. Our project studies this through a search-based agent built on Search-o1 framework and trying to solve this practical question: how can we make an search agent more reliable and let itself decide when to search, what to search to reduce false information, and whether it has enough evidence to answer?

We focused on improving the agent pipeline, started from a Search-o1 baseline, analyzes its failures, and then added new features like judge and reflection components. Those new mechanisms helped to improve the performance like accuracy and F-1 score in complex multi-hop queries.

Question
Web Search / Retrieve
Extraction
Refinement Loop
Judge / Reflect
Hallucination
Checks
Better Follow-up
Search or Final Answer
AI hallucination
No more misleading outputs.

Problem Statement

Our project begins with a simple observation from the Quarter 1 case study: although Search-o1 can interleave reasoning with external retrieval, the baseline system still breaks down on multi-hop questions in several recurring ways. On HotpotQA, these failures often led the agent to stop too early, miss evidence that had already been retrieved, or continue searching without making meaningful progress.

1. Skipped search calls and unsupported answers

In many failed cases, the model did not take a needed search and instead answered directly from its internal knowledge base (usually not enough or dated). This often produced misleading outputs.

2. Information loss during extraction

Even when relevant documents or webpages were retrieved, the Reason-in-Documents step did not always preserve the useful evidence. As a result, the agent could conclude that no helpful information had been found even though the answer was already present in the fetched pages.

3. Weak multi-hop search planning

The baseline also struggled to turn partial evidence into a better next query. When an early search step was incomplete or slightly off topic, the model often failed to correct course, which limited its ability to perform effective multi-hop retrieval when solving complex questions.

What do we do?

These patterns suggest that external search alone is not enough. A tool-use agent also needs mechanisms for checking whether an answer is grounded, noticing when useful evidence has been missed, and reflecting on whether the current search path should be revised. These are the goals our project is trying to solve.

Pipeline Diagram

Baseline Framework: Search-o1

Search-o1 is a search-augmented reasoning framework that loops between an LRM (Large-Reasoning Model), a web search tool, and a document reasoning module. Compared with direct generation, it gives the model a way to gather external evidence; compared with standard RAG, it supports iterative search instead of a single retrieve-then-answer pass.

In our implementation, the baseline model uses a Jina search API (instaed of Bing Search API with Serper.dev in the original model) to retrieve web results, fetches web page content through Jina parsing, and uses Qwen2.5-3B-Instruct as the reasoning backbone.

What the baseline already does well
  • Supports iterative external search
  • Lets the model reason over retrieved documents instead of snippets alone
  • Performs better than pure direct generation on knowledge-intensive tasks
What it still struggles with
  • Premature answer generation
  • Missed evidence after retrieval
  • Weak planning capacity in multi-hop search

Method

Our system builds on the Search-o1 framework and focuses on improving reliability during the search–reason–answer loop. We introduced several verification, judge, and reflection components that target the failure modes identified in the baseline. Our development process proceeded in two stages: first is the architecture with multiple control gates (Phase 1), and then followed by a simplified final pipeline after empirical testing (Phase 2).

Phase 1: Initial Architecture

Hallucination Detection

Before a final answer is accepted, the system checks whether the reasoning trace contains evidence retrieved from external sources. If unsupported claims are detected, the agent is forced to perform an additional search step before answering.

Retrieval Judge and Query Refinement (Gate 1)

A lightweight judge function evaluates the relevance of initial search snippets. If the results is weak or unrelated, the system prompts the model to reformulate the query before retrieving full documents.

Extraction Refinement Loop

When the document reasoning step reports that no useful information was found, a secondary step checks the retrieved page content directly. If relevant cues are detected, the reasoning module is asked to reprocess the same evidence.

Content Judge and Reflection (Gate 2)

After extraction, a judge function evaluates whether the retrieved evidence strong enough to answer the question. When it does not, the system produces feedback that guides the next search action instead of continuing the same reasoning path.

Observation from Phase 1

While Phase 1 improved control over the reasoning process, we found that several gates introduced unnecessary complexity, and it sometimes interfered with the model's own search strategy. Execution logs revealed that some modules contributed little to performance while increasing latency and search loops.

Phase 2: Refinement

Removing the snippet judge

We removed Gate 1 evaluation because short search summaries often lacked sufficient context. We want the model prioritizing recall and fetching full documents of top-ranked search results instead of judging based on incomplete snippets.

Removing routing control

We also removed routing judge which caused redundant search loops. Gate 2 reflection mechanism will replace this and provides stategic guidance for the next step search.

Refinement reflection feedback

The reflection module can analyze the gap between the user's question and current findings. This helps the model to formulates a specific strategic follow-up query to guide the agent's next step.

Retaining effective components

Hallucination detection, the extraction refinement loop, and the content judge were retained because they consistently improved the agent's ability to verify evidence and recover missed information.

Final Pipeline

Hallucination Detection β†’ Retrieval β†’ Extraction Refinement Loop β†’ Content Judge & Reflection

The final model is simpler than what we design in the beginning but being more stable and effective. By focusing on evidence verification and search refinement, the agent becomes better at dealing multi-hop reasoning rather than prematurely producing unsupported answers.

Evaluation Setup

We evaluate the baseline and improved agent on a 120-question subset of the Hotpot-QA test set. This is a benchmark designed to evaluate multi-hop reasoning and document synthesis. The full test split contains more than 7000 questions, but our experiments process full web pages iteratively, which is token-intensive under our search and parsing budget. We only use 120 subset from it.

Dataset

  • Hotpot-QA test data
  • 120-sample subset for final evaluation
  • Chosen to preserve a meaningful benchmark under API and token limits

Model and settings

  • Backbone model: Qwen2.5-3B-Instruct
  • max_search_limit = 10
  • max_turn = 15, top_k = 10, max_doc_len = 3000

Metrics

  • Accuracy: whether the ground-truth answer appears in the prediction
  • F1: token overlap between prediction and ground truth
  • Average searches per query as a behavioral signal

What we care

  • Quantitative metrics per batch
  • Extraction histories and prompts
  • Search traces for qualitative error analysis

πŸ“ Demo to Run the Project

For more information, check our GitHub repository.

Click to expand setup and run instructions

1. SSH into the Cluster

Connect to your remote cluster where the model will run:

ssh username@your-cluster-address
    cd /path/to/repository

2. Launch the environment with A30 GPU

launch-scipy-ml.sh -W DSC180A_FA25_A00 -c 8 -m 32 -g 1 -v a30

3. Install Dependencies

pip install -r requirements.txt

4. Set up your API keys

export JINA_API_KEY="your_jina_key"
    export SERPER_API_KEY="your_serper_key"

5. Run the Search-o1 script

python scripts/run_search_o1.py \
      --dataset_name hotpotqa \
      --split test \
      --subset_num 10 \
      --max_search_limit 5 \
      --max_turn 10 \
      --top_k 10 \
      --max_doc_len 3000 \
      --use_jina True \
      --model_path "Qwen/Qwen2.5-3B-Instruct" \
      --jina_api_key $JINA_API_KEY \
      --bing_subscription_key $SERPER_API_KEY \
      --bing_endpoint "https://google.serper.dev/search"

6. Check Outputs Saved to outputs/

After the script finishes, check the generated results in the outputs/ directory.

Results

Quantitative Performance

The final architecture improves both answer quality and search behavior. The gains are not just numerical; they reflect a more evidence-driven agent that searches more strategically and recovers information it previously left unused.

Metric Baseline Search-o1 Improved Agent Change
Accuracy 0.292 0.433 +0.141
F1 Score 0.363 0.458 +0.095
Average searches per query 1.14 2.17 +1.03
Hallucination corrections observed β€” 12 new safety behavior
Extraction recoveries observed β€” 4 new recovery behavior
πŸ“Š Click to see Full results on 120 samples
results

What the higher search count means

The improved agent does not merely search more because it is inefficient. It searches more because it is less willing to stop at weak evidence and more willing to use partial clues to continue multi-hop reasoning.

Why the gains matter

The jump in accuracy and F1 suggests that the reflection and verification additions help translate extra tool access into more reliable question answering instead of extra noise.

Qualitative Performance

The strongest evidence for our approach is not only the aggregate score improvement, but also the kinds of corrections the final system can make during execution.

Case 1: Hallucination check prevents unsupported answers

In the baseline setting, the model sometimes attempted to answer directly from internal memory. The hallucination check interrupts that behavior and forces a search cycle before the answer can be finalized.

Takeaway: the agent becomes more evidence-dependent instead of overconfident.

Case 2: Reflection improves follow-up search

When the current evidence is incomplete, the reflection module now proposes a specific next-step query rather than generic advice. This helps the model move from a vague clue to a more productive second-hop search.

Takeaway: better search planning is a major source of the final gain.

Case 3: Extraction refinement recovers missed information

In several examples, useful evidence was already present in the fetched pages, but the reasoning module did not extract it on the first pass. The refinement loop detects that mismatch and triggers another analysis step.

Takeaway: reliability improves not only through retrieval, but also through better use of retrieved content.

What these cases show overall

The final system is better because it changes agent behavior: it verifies more often, stops trusting weak evidence, and turns partial results into more purposeful follow-up actions.

πŸŒ€ Click to see performance of Halluciation Detection
Hallucination Detection Step 1 Hallucination Detection Step 2
πŸ’­ Click to see Judge and Reflection
judge_reflection1 judge_reflection2
🌐 Click to see Extraction Refinement
extraction_refinement1 extraction_refinement2 extraction_refinement3 extraction_refinement4

Limitations and Future Work

Current limitations

  • Our evaluation was conducted on a 120-question subset of HotpotQA rather than the full benchmark. This was mainly due to token limits and the cost of running iterative retrieval and reasoning steps.
  • The pipeline introduces additional reasoning passes over retrieved documents. While this helps recover missed evidence, it also increases latency and token usage compared with the baseline.
  • The judge and reflection components rely on prompt-based evaluation. In some cases the feedback is still coarse and vague, which limits how precisely the agent can diagnose its own reasoning failures.

Future work

  • Run larger-scale experiments on the full HotpotQA benchmark if no token limits, this may help to have a more comprehensive evaluation of system's performance.
  • Develop more structured diagnostic features for reflection, so that the agent can distinguish between different failure types such as missing evidence, weak queries, or incomplete reasoning.
  • Extend the framework beyond retrieval-only settings and study how similar reflection mechanisms interact with other external tools in more complex tool-use agents.
Takeaway

Overall, our experiments suggest that relatively lightweight reflection and verification steps can improve the behavior of search-based reasoning agents. Rather than retraining the underlying model, those small changes which checks evidence and revises its search strategy can lead to more reliable multi-hop reasoning.

Team Contributions

Arshia Vadhani

  • Attempted to implement components for the Phase 1 pipeline, including a basic judge and reflection prompting function for content refinement (Gate 3).
  • Collaborated with Joyce to correct and update the Gate 3 functions to improve overall accuracy and ensure the generated answers met the required output format.
  • Contributed to the report by documenting the design and implementation of Gate 3's judge, reflect, and refine components.
  • Contributed to the Discussion section, summarizing results, outlining system limitations, and proposing future improvements to the agent.
  • Contributed to the final project poster, working on the domain introduction and contextual background.

Megha Kataki

  • Implemented the initial judge function (Gate 1), determining where document-level judgments should occur within the pipeline.
  • Ensured that each document retrieval call was evaluated properly, first using a heuristicbased evaluation approach and later transitioning to a Large Reasoning Model (LRM) approach.
  • Experimented with different judgment and reflection strategies and ran full dataset evaluations to improve the final performance metrics.
  • Contributed to the Introduction section of the report.
  • Managed report formatting in Overleaf and migrated the document from Google Docs to Overleaf.
  • Contributed to the design of the project poster and authored the following sections: Data, Methods, Qualitative Results, Quantitative Results, Conclusion, Impact, and References.

Joyce Lu

  • Implemented multiple components for the Phase 1 pipeline, including the Retrieval Judge and Reflection module (Gate 1), the Hallucination Detection module, the Extraction Refinement Loop, and the Content Judge and Reflection module (Gate 2).
  • Led architectural refinements in Phase 2 by identifying bottlenecks in the pipeline (Gate 1 and Gate 3) and optimizing system performance by refining the Content Judge and Reflection logic (Gate 2).
  • Contributed to the Methods section of the report, documenting the baseline framework and the iterative development process across Phase 1 and Phase 2.
  • Conducted the final evaluation on the HotpotQA dataset and contributed to the Results section documenting both quantitative and qualitative findings.

Jiaxin Yang

  • Implemented the initial reflective functions (Gate 1), helped to refine the prompt organization and reflective reasoning in Phase 1. Merging with Megha's judge function and trying a heuristic approach for evaluation.
  • Contributed to the reports abstract section. Summarizing the overview and report 1's takeaways and improvements.
  • Contributed to the conclusion section of the report. Summarizing limitations, future works, and improvements of the agent.
  • Contributed to the project website design and implementation.

References

  1. Li, Xiaoxi, Guanting Dong, Jiajie Jin, Yuyao Zhang, Yujia Zhou, Yutao Zhu, Peitian Zhang, and Zhicheng Dou. 2025. “Search-o1: Agentic Search-Enhanced Large Reasoning Models.”
  2. Yang, An, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong Tang, and ... 2024. “Qwen2 Technical Report.”
  3. Yang, Zhilin, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W. Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. 2018. “HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering.” arXiv preprint arXiv:1809.09600. [Link]