Team: Megha Kataki (mkataki@ucsd.edu), Arshia Vadhani (avadhani@ucsd.edu), Joyce Lu (jol072@ucsd.edu), Jiaxin Yang (jiy016@ucsd.edu)
Mentor: Kun Zhou (kuzhou@ucsd.edu)
We improved a search-based reasoning agent for multi-hop question answering by adding reflection and verification mechanisms that make the agent less likely to answer from unsupported internal guesses, better at recovering missed evidence, and more capable of issuing stronger follow-up searches. LLM users who rely on factual checks or complex multi-hop problems would be the stakeholders from our project.
up from 0.292
up from 0.363
baseline 1.14
Large language models (LLMs) can help your life become easier, but they still struggle when a task requires complex external search, selective evidence extraction, and hallucinations from multiple retrieved sources. Our project studies this through a search-based agent built on Search-o1 framework and trying to solve this practical question: how can we make an search agent more reliable and let itself decide when to search, what to search to reduce false information, and whether it has enough evidence to answer?
We focused on improving the agent pipeline, started from a Search-o1 baseline, analyzes its failures, and then added new features like judge and reflection components. Those new mechanisms helped to improve the performance like accuracy and F-1 score in complex multi-hop queries.
Our project begins with a simple observation from the Quarter 1 case study: although Search-o1 can interleave reasoning with external retrieval, the baseline system still breaks down on multi-hop questions in several recurring ways. On HotpotQA, these failures often led the agent to stop too early, miss evidence that had already been retrieved, or continue searching without making meaningful progress.
In many failed cases, the model did not take a needed search and instead answered directly from its internal knowledge base (usually not enough or dated). This often produced misleading outputs.
Even when relevant documents or webpages were retrieved, the Reason-in-Documents step did not always preserve the useful evidence. As a result, the agent could conclude that no helpful information had been found even though the answer was already present in the fetched pages.
The baseline also struggled to turn partial evidence into a better next query. When an early search step was incomplete or slightly off topic, the model often failed to correct course, which limited its ability to perform effective multi-hop retrieval when solving complex questions.
These patterns suggest that external search alone is not enough. A tool-use agent also needs mechanisms for checking whether an answer is grounded, noticing when useful evidence has been missed, and reflecting on whether the current search path should be revised. These are the goals our project is trying to solve.
Search-o1 is a search-augmented reasoning framework that loops between an LRM (Large-Reasoning Model), a web search tool, and a document reasoning module. Compared with direct generation, it gives the model a way to gather external evidence; compared with standard RAG, it supports iterative search instead of a single retrieve-then-answer pass.
In our implementation, the baseline model uses a Jina search API (instaed of Bing Search API with Serper.dev in the original model) to retrieve web results, fetches web page content through Jina parsing, and uses Qwen2.5-3B-Instruct as the reasoning backbone.
Our system builds on the Search-o1 framework and focuses on improving reliability during the searchβreasonβanswer loop. We introduced several verification, judge, and reflection components that target the failure modes identified in the baseline. Our development process proceeded in two stages: first is the architecture with multiple control gates (Phase 1), and then followed by a simplified final pipeline after empirical testing (Phase 2).
Before a final answer is accepted, the system checks whether the reasoning trace contains evidence retrieved from external sources. If unsupported claims are detected, the agent is forced to perform an additional search step before answering.
A lightweight judge function evaluates the relevance of initial search snippets. If the results is weak or unrelated, the system prompts the model to reformulate the query before retrieving full documents.
When the document reasoning step reports that no useful information was found, a secondary step checks the retrieved page content directly. If relevant cues are detected, the reasoning module is asked to reprocess the same evidence.
After extraction, a judge function evaluates whether the retrieved evidence strong enough to answer the question. When it does not, the system produces feedback that guides the next search action instead of continuing the same reasoning path.
While Phase 1 improved control over the reasoning process, we found that several gates introduced unnecessary complexity, and it sometimes interfered with the model's own search strategy. Execution logs revealed that some modules contributed little to performance while increasing latency and search loops.
We removed Gate 1 evaluation because short search summaries often lacked sufficient context. We want the model prioritizing recall and fetching full documents of top-ranked search results instead of judging based on incomplete snippets.
We also removed routing judge which caused redundant search loops. Gate 2 reflection mechanism will replace this and provides stategic guidance for the next step search.
The reflection module can analyze the gap between the user's question and current findings. This helps the model to formulates a specific strategic follow-up query to guide the agent's next step.
Hallucination detection, the extraction refinement loop, and the content judge were retained because they consistently improved the agent's ability to verify evidence and recover missed information.
Hallucination Detection β Retrieval β Extraction Refinement Loop β Content Judge & Reflection
The final model is simpler than what we design in the beginning but being more stable and effective. By focusing on evidence verification and search refinement, the agent becomes better at dealing multi-hop reasoning rather than prematurely producing unsupported answers.
We evaluate the baseline and improved agent on a 120-question subset of the Hotpot-QA test set. This is a benchmark designed to evaluate multi-hop reasoning and document synthesis. The full test split contains more than 7000 questions, but our experiments process full web pages iteratively, which is token-intensive under our search and parsing budget. We only use 120 subset from it.
For more information, check our GitHub repository.
Connect to your remote cluster where the model will run:
ssh username@your-cluster-address
cd /path/to/repository
launch-scipy-ml.sh -W DSC180A_FA25_A00 -c 8 -m 32 -g 1 -v a30
pip install -r requirements.txt
export JINA_API_KEY="your_jina_key"
export SERPER_API_KEY="your_serper_key"
python scripts/run_search_o1.py \
--dataset_name hotpotqa \
--split test \
--subset_num 10 \
--max_search_limit 5 \
--max_turn 10 \
--top_k 10 \
--max_doc_len 3000 \
--use_jina True \
--model_path "Qwen/Qwen2.5-3B-Instruct" \
--jina_api_key $JINA_API_KEY \
--bing_subscription_key $SERPER_API_KEY \
--bing_endpoint "https://google.serper.dev/search"
outputs/After the script finishes, check the generated results in the outputs/ directory.
The final architecture improves both answer quality and search behavior. The gains are not just numerical; they reflect a more evidence-driven agent that searches more strategically and recovers information it previously left unused.
| Metric | Baseline Search-o1 | Improved Agent | Change |
|---|---|---|---|
| Accuracy | 0.292 | 0.433 | +0.141 |
| F1 Score | 0.363 | 0.458 | +0.095 |
| Average searches per query | 1.14 | 2.17 | +1.03 |
| Hallucination corrections observed | β | 12 | new safety behavior |
| Extraction recoveries observed | β | 4 | new recovery behavior |
The improved agent does not merely search more because it is inefficient. It searches more because it is less willing to stop at weak evidence and more willing to use partial clues to continue multi-hop reasoning.
The jump in accuracy and F1 suggests that the reflection and verification additions help translate extra tool access into more reliable question answering instead of extra noise.
The strongest evidence for our approach is not only the aggregate score improvement, but also the kinds of corrections the final system can make during execution.
In the baseline setting, the model sometimes attempted to answer directly from internal memory. The hallucination check interrupts that behavior and forces a search cycle before the answer can be finalized.
Takeaway: the agent becomes more evidence-dependent instead of overconfident.
When the current evidence is incomplete, the reflection module now proposes a specific next-step query rather than generic advice. This helps the model move from a vague clue to a more productive second-hop search.
Takeaway: better search planning is a major source of the final gain.
In several examples, useful evidence was already present in the fetched pages, but the reasoning module did not extract it on the first pass. The refinement loop detects that mismatch and triggers another analysis step.
Takeaway: reliability improves not only through retrieval, but also through better use of retrieved content.
The final system is better because it changes agent behavior: it verifies more often, stops trusting weak evidence, and turns partial results into more purposeful follow-up actions.
Overall, our experiments suggest that relatively lightweight reflection and verification steps can improve the behavior of search-based reasoning agents. Rather than retraining the underlying model, those small changes which checks evidence and revises its search strategy can lead to more reliable multi-hop reasoning.