How Good is DeepSeek in Driving An Agentic Architecture? – A Comparative Case Study

Munawar Hafiz, CEO of OpenRefactory, writes about how a fascinating comparison of SeepSeek and other Agentic AI systems. Edited by Charlie Bedard.

DeepSeek has swept the world technology news in the last couple of weeks. It brings in new ways of thinking of the cost structure and the business model of many of the current global AI leaders. They key benefits of DeepSeek include:

1. DeepSeek has been released as Open Source under an MIT license although the training data information is not available.
2. DeepSeek R1 offers similar performance to the OpenAI o1 model at a fraction of the cost (30X less expensive)
3. When compared with the OpenAI benchmark, DeepSeek performed close to the OpenAI o1 model for natural language processing tasks and performed a little better than the OpenAI o1 model for math and code generation tasks.

People have already started test driving these technologies. These tests make qualitative evaluations by comparing the outcomes generated by DeepSeek with those generated by OpenAI. There is one such demonstration on YouTube that shows that DeepSeek generated code may have some compilation errors, but fixing those errors is not that hard.

Even after the errors are fixed, the solution provided by OpenAI for an open ended coding model is more elegant than that provided by DeepSeek. Kyle Orland provides a detailed comparison of outputs from different domains (creative, mathematics, code generation). The observation, as expected from a qualitative study, was that there was no clear winner. Quoting from that post, “DeepSeek’s R1 model definitely distinguished itself by citing reliable sources to identify the billionth prime number and with some quality creative writing in the dad jokes and Abraham Lincoln’s basketball prompts. However, the model failed on the hidden code and complex number set prompts, making basic errors in counting and/or arithmetic that one or both of the OpenAI models avoided.”

In this article we will describe an experiment to put both the reasoning and the generation capability to test. We will not be using Code Generation capabilities; others have done that. Instead, we will compare how three models, one from OpenAI, one from Google Gemini and one from DeepSeek, performed in an agentic AI architecture to perform a specific task.

Use Case for the Agent and the Architecture

We have developed an agent that, given the GitHub URL of an open source package, attempts to build the package automatically and run the unit tests.
The system comprises two agents:
  1. an LLM agent, which generates commands, and
  2. an executor agent, which executes those commands and returns output to the LLM agent, thus creating a feedback loop.


The process starts with the creation of an Ubuntu 22.04 base Docker image. The LLM agent is provided with the README file and prompted to generate a command for building the package and executing unit tests automatically. The LLM agent is instructed to output “DONE” upon successful completion of the build. The command from the LLM agent is then extracted and sent to the executor agent Docker container. The terminal output is relayed back to the LLM agent, which is then asked to provide the next command. If the build is successful, the LLM agent outputs “DONE” instead of a command. The executor agent terminates its container upon receiving “DONE”.

In this study, we evaluate the performance of several LLM-driven build automation agents operating within a controlled execution environment.

Test Setup

We randomly selected 15 projects from the Jenkins dependency tree for evaluation. Jenkins has 172 dependencies. We picked 15 without considering any criteria such as project size, project popularity, etc. This was to avoid any prior bias about some projects being hard to build compared to the others. Because these are dependencies of Jenkins, all are Java projects and therefore they use common Java build tools (e.g., Maven, Gradle, etc.,). In future, we plan to extend the experiment with projects written in other languages. For the experiment, a total of six models were employed to drive the agents and attempt to build the projects and run tests.
    1. DeepSeek R1,
    2. ChatGPT O1-Preview,
    3. Gemini-2.0-Flash,
    4. Claude 3.5 Sonet,
    5. ChatGPT-4o-Latest, and
    6. Gemini-2.0-Flash-Exp,


The following table shows a comparison of the models

Recent Posts