OpenAI Deep Research vs. DeepSeek R1: Which One Is Better at Knowledge Work?

DeepSeek and OpenAI.” width=”970″ height=”647″ data-caption=’Sam Altman said OpenAI will keep releasing reasoning models that outperform DeepSeek’s R1. <span class=”lazyload media-credit”>LIONEL BONAVENTURE/AFP via Getty Images</span>’>

About two weeks after DeepSeek’s R1 model rattled Silicon Valley, OpenAI is stepping up with Deep Research, an agentic A.I. built on the latest GPT o3 model. Unveiled on Feb. 2, Deep Research is designed to tackle complex knowledge work and outperforms DeepSeek R1 in logic-intensive tasks, OpenAI said in a blog post. It scored a 26 percent accuracy on Humanity’s Last Exam (a benchmark to test an A.I.’s expert-level reasoning capabilities), nearly tripling R1’s 9 percent. CEO Sam Altman claims Deep Research “could do a single-digit percentage of all economically valuable tasks in the world.”

Deep Research can autonomously web scrape in real time and compile financial reports, scientific studies or policy evaluations in minutes. It also boasts similar features to DeepSeek R1’s “DeepThink” mode, which visually displays the A.I. model’s reasoning process (chain of thought) before delivering answers.  

Deep Research is currently available to subscribers to ChatGPT Pro, which costs $200 a month, with a limit of 100 queries per month. 

Deep Research is specifically targeted at professionals “who do intensive knowledge work” and heavily rely on precise, reliable research, OpenAI said. However, it can also serve other use cases, such as grocery shopping and purchasing a car, which typically requires extensive research. The A.I. can generate comprehensive reports complete with charts, citations and a summary of its reasoning, making it easy for human users to digest.

“These agents are coming for knowledge worker jobs, from researchers and tax preparers to customer service reps and beyond,” Zach Lloyd, founder and CEO of Warp, an Altman-backed A.I. startup making A.I. developer tools, told Observer. 

In comparison, DeepSeek R1’s Think Mode and Search feature currently lack research depth and struggle when provided with minimal inputs to guide its response. Moreover, the A.I. often rejects user queries when tasked with deeper research on certain countries and topics due to built-in guardrails. For instance, when asked questions related to sensitive political topics in China, such as “What happened during the Tiananmen Square protests and massacre in 1989?” and “Who is Xi Jinping?” the A.I. only responded, “Sorry, that’s beyond my current scope. Let’s talk about something else.”

In a recent LinkedIn post, Aravind Srinivas, co-founder and CEO of Perplexity AI, criticized DeepSeek R1’s answer bias. “It’s essential to remember that an A.I. model cannot be a censored propaganda machine for China or, even worse, propagating falsehood,” he wrote. “The more censored and false the outputs are, the more dangerous it is to use these models, or their outputs (directly as a user) and indirectly (for distillation).” 

Despite these quirks, R1 excels in most tasks that require step-by-step reasoning, particularly in coding or scientific mathematical problems, where displaying logical progression is essential.

In an X post last month ahead of Deep Research’s launch, Altman said OpenAI will obviously deliver much better models” than DeepSeek through its upcoming releases.

Experts remain divided on whether the agentic A.I. is already capable of replacing human jobs. Grant Passmore, co-founder and CEO of the A.I. development and reasoning platform Imandra, believes a 26 percent accuracy rate on Humanity’s Last Exam is still insufficient for trusting these tools in real-world applications. “To meet the accuracy standards required for practical applications, it is essential for agents to integrate symbolic A.I. and automated logical reasoning,” he told Observer. 

Siqi Chen, CEO of A.I.-powered business planning software Runway, disagrees. “Based on my usage experience, Deep Research certainly crosses the threshold Altman referred to,” he told Observer. “The complexity of tasks in knowledge job roles is often far less demanding than the types of challenges you’d encounter on the Humanity’s Last Exam benchmark.”