Felo Research Test Report on DeepResearch Bench
In the DeepResearch Bench evaluation, Felo Research outperformed both Gemini and OpenAI’s Deep Research Agent. Across all metrics, Felo Research maintained scores close to 0.5, demonstrating a well-balanced performance with no significant weaknesses.
1. Background Overview
DeepResearch Bench is a comprehensive benchmark designed for Deep Research Agents (AI systems specialized in intensive research tasks). It aims to evaluate how well AI handles high-level research assignments. The benchmark has the following characteristics:
- Broad task coverage: Includes 100 PhD-level research tasks, created by domain experts, spanning 22 fields (such as science and technology, finance, arts, history, and more).
- Rigorous design: Tasks are derived from real-world deep-retrieval queries, ensuring alignment with genuine research needs.
- Dual evaluation framework:
- RACE (Reference-based Adaptive Criteria-driven Evaluation): Dynamically generates evaluation standards, scoring reports across four weighted dimensions — Comprehensiveness, Insight (Depth), Instruction-Following, and Readability.
- FACT (Framework for Factual Abundance and Citation Trustworthiness): Focuses on the reliability of retrieval and citations, measuring Citation Accuracy and Effective Citations per task.
Overall, DeepResearch Bench has become widely adopted to assess AI research agents across depth of analysis, critical thinking, and writing quality.
2. Felo Search – Test Results
On DeepResearch Bench, Felo Search achieved the following RACE scores (full results available on GitHub):
Dimension | Score |
---|---|
Comprehensiveness | 0.4748 |
Insight | 0.497 |
Instruction-Following | 0.5089 |
Readability | 0.4958 |
Overall (Weighted) | 0.4937 |
These results show Felo Search performs steadily across all metrics, hovering around the 0.5 mark — indicating balanced strengths without any major weaknesses.
3. Interpretation & Benchmark Comparison
From the official DeepResearch Bench leaderboard:
Model | RACE Overall | RACE Comp. | RACE Depth | RACE Inst. | RACE Read. | FACT C. Acc. | FACT E. Cit. |
---|---|---|---|---|---|---|---|
Deep Research Agent | |||||||
Felo Research | 49.37 | 47.48 | 50.89 | 49.7 | 49.58 | - | - |
Gemini-2.5-Pro Deep Research | 48.92 | 48.45 | 48.3 | 49.29 | 49.77 | 78.3 | 165.34 |
OpenAI Deep Research | 46.45 | 46.46 | 43.73 | 49.39 | 47.22 | 75.01 | 39.79 |
Claude-Researcher | 45 | 45.34 | 42.79 | 47.58 | 44.66 | - | - |
Kimi-Researcher | 44.64 | 44.96 | 41.97 | 47.14 | 45.59 | - | - |
Doubao-DeepResearch | 44.34 | 44.84 | 40.56 | 47.95 | 44.69 | 52.86 | 52.62 |
Perplexity-Research | 40.46 | 39.1 | 35.65 | 46.11 | 43.08 | 82.63 | 31.2 |
Grok Deeper Search | 38.22 | 36.08 | 30.89 | 46.59 | 42.17 | 73.08 | 8.58 |
LLM with Search Tools | |||||||
Perplexity-Sonar-Reasoning-Pro | 37.76 | 34.96 | 31.65 | 44.93 | 42.42 | 45.19 | 9.39 |
Perplexity-Sonar-Reasoning | 37.75 | 34.73 | 32.59 | 44.42 | 42.39 | 52.58 | 13.37 |
Claude-3.7-Sonnet w/Search | 36.63 | 35.95 | 31.29 | 44.05 | 36.07 | 87.32 | 24.51 |
Perplexity-Sonar-Pro | 36.19 | 33.92 | 29.69 | 43.39 | 41.07 | 79.72 | 16.75 |
Gemini-2.5-Pro-Preview | 31.9 | 31.75 | 24.61 | 40.24 | 32.76 | - | - |
GPT-4o-Search-Preview | 30.74 | 27.81 | 20.44 | 41.01 | 37.6 | 86.63 | 5.05 |
Perplexity-Sonar | 30.64 | 27.14 | 21.62 | 40.7 | 37.46 | 76.41 | 10.68 |
GPT-4.1 w/Search | 29.31 | 25.59 | 18.42 | 40.63 | 36.49 | 89.85 | 4.27 |
Gemini-2.5-Flash-Preview | 29.19 | 28.97 | 21.62 | 37.8 | 29.97 | - | - |
GPT-4o-Mini-Search-Preview | 27.62 | 24.24 | 16.62 | 38.59 | 35.27 | 81.69 | 4.62 |
GPT-4.1-Mini w/Search | 26.62 | 22.86 | 15.39 | 38.18 | 34.49 | 84.54 | 4.1 |
Claude-3.5-Sonnet w/Search | 23.95 | 21.28 | 16.2 | 32.41 | 29.87 | 94.06 | 9.35 |
- Felo Research Overall RACE score = 49.37
- This surpasses Gemini-2.5-Pro Deep Research (48.92), taking a front-running position.
- It clearly outperforms other advanced systems including OpenAI, Claude, Kimi, and Doubao.
- It is far ahead of models limited to “search preview” capabilities.
In other words, Felo Research has reached (and slightly surpassed) the current benchmark leader, establishing itself among the top-tier research-focused AI systems.
4. What Makes Felo Research Unique
1) Deep Search with Reflection Mechanism
- Plan-driven search: Before diving into retrieval, Felo Research generates a clear step-by-step plan and rewrites queries intelligently for each stage. This ensures both breadth of coverage and depth of results.
- Reflection and iteration: After collecting initial results, the system performs a “reflection step” — identifying gaps between the retrieved information and the user’s actual research question. If gaps are found, it automatically re-invokes Web Search and Page Reading tools to supplement missing insights. → This method prevents one-sided or incomplete outputs.
2) Cross-Language Search Capability
- Community-aware retrieval: Felo Research detects which language communities host the richest knowledge sources for the task.
- Cross-lingual query rewriting: Queries are adapted into other languages when needed, tapping into high-quality resources across linguistic boundaries.
- Boosting reliability and authority: By pulling from multilingual sources, results become both broader in coverage and stronger in credibility.
5. Conclusion
Felo Research achieved an Overall Score of 0.4937 (~49.37/100) on DeepResearch Bench — placing it ahead of leading competitors. Its scores are evenly balanced across key dimensions, with especially solid performance in Instruction-Following, while maintaining strong levels of Readability and Insight.
The system’s competitive edge comes from its unique research strategy:
- A “plan-driven + reflection-iteration” loop ensures both wide-ranging and in-depth coverage of information.
- Cross-lingual search unlocks access to authoritative and diverse sources beyond a single language community.
Together, these innovations significantly improve comprehensiveness and factual accuracy, cementing Felo Research’s position as one of the most competitive research-oriented AI systems available today.