How to train an artificial intelligence model

Training artificial intelligence models represents one of the most complex challenges in contemporary technological development. Much more than a simple algorithmic issue, the effective training of a model requires a methodical and multidisciplinary approach that integrates data science, domain knowledge and software engineering. As James Luke points out in his seminal text"Beyond Algorithms: Delivering AI for Business," the success of an AI implementation depends much more on data management and systemic design than on the algorithms themselves. The landscape is rapidly evolving, with innovations such as the DeepSeek-R1 model redefining cost and accessibility.

‍

The foundation: data collection and management

Quality rather than quantity

Contrary to what is often believed, the quantity of data is not always the determining factor for success. The quality and representativeness of the data are significantly more important. In this context, it is crucial to integrate different sources:

‍

Proprietary data: Collected and ethically anonymized from existing implementations
Authorized data: Sourced from reliable suppliers who meet strict quality standards
Open source datasets: Carefully verified to ensure diversity and accuracy
Synthetic data: Artificially generated to fill gaps and solve privacy problems

This integration creates a comprehensive training base that captures real-world scenarios while maintaining ethical and privacy standards.

The challenge of data preparation

The "data wrangling" process accounts for up to 80 percent of the effort required in artificial intelligence projects. This phase involves:

Data cleaning: Elimination of inconsistencies, duplications and outliers
Data transformation: Conversion to formats suitable for processing
Data integration: Fusion of different sources that often use incompatible schemas and formats
Handling missing data: Strategies such as statistical imputation or use of proxy data

As Hilary Packer, CTO of American Express, pointed out , "The aha! moment for us, honestly, was data. You can do the best model selection in the world ... but data is the key. Validation and accuracy are the holy grail right now in generative AI."

‍

Model architecture: right sizing

The choice of model architecture should be guided by the specific nature of the problem to be solved, rather than by personal tendencies or preferences. Different types of problems require different approaches:

‍

Transformer-based language models for tasks requiring deep linguistic understanding
Convolutional neural networks for image and pattern recognition
Graphical neural networks for analyzing complex relationships between entities
Reinforcement learning for optimization and decision problems
Hybrid architectures that combine multiple approaches for complex use cases

Architectural optimization requires systematic evaluation among different configurations, with a focus on balancing performance and computational requirements, something that has become even more relevant with the advent of models such as DeepSeek-R1 that offer advanced reasoning capabilities at significantly lower cost.

‍

Advanced training methodologies

‍

Distillation of the model

Distillation has emerged as a particularly powerful tool in the current AI ecosystem. This process enables the creation of smaller, more specific models that inherit the reasoning capabilities of larger, more complex models, such as DeepSeek-R1.

‍

As highlighted in the case of DeepSeek, the company has distilled its reasoning capabilities to several smaller models, including open-source models from Meta's Llama family and Alibaba's Qwen family. These smaller models can later be optimized for specific tasks, accelerating the trend toward fast, specialized models.

‍

Sam Witteveen, machine learning developer, notes, "We are starting to enter a world where people are using multiple models. They're not just using one model all the time." This includes low-cost closed models such as Gemini Flash and GPT-4o Mini, which "work very well for 80 percent of use cases."

Multi-task learning

Instead of training separate models for related capabilities, multi-task learning allows models to share knowledge across different functions:

Models simultaneously optimize for multiple related objectives
Basic functionality benefits from broader exposure to different tasks
Performance improves in all tasks, particularly those with limited data
Computational efficiency increases through component sharing

Supervised fine-tuning (SFT)

For companies operating in very specific domains, where information is not widely available on the Web or in the books typically used for training language models, supervised fine-tuning (SFT) is an effective option.

DeepSeek demonstrated that it is possible to get good results with "thousands" of question and answer datasets. For example, IBM engineer Chris Hay showed how he set up a small model using his own math-specific datasets and got extremely fast answers that outperformed OpenAI's o1 model on the same tasks.

Reinforcement learning (RL)

Companies that wish to train a model with further alignment to specific preferences-for example, making a customer support chatbot empathetic but concise-will want to implement reinforcement learning (RL) techniques. This approach is particularly useful if a company wants its chatbot to adapt its tone and recommendations based on user feedback.

Retrieval-Augmented Generation (RAG)

For most companies, Retrieval-Augmented Generation (RAG) is the simplest and most secure path. It is a relatively straightforward process that allows organizations to anchor their models with proprietary data contained in their databases, ensuring that the outputs are accurate and domain-specific.

This approach also helps counteract some of the hallucination problems associated with models such as DeepSeek, which currently hallucinate in 14 percent of cases compared to 8 percent for OpenAI's o3 model, according to a study conducted by Vectara.

The combination of model distillation and RAG is where the magic lies for most companies, having become incredibly easy to implement, even for those with limited skills in data science or programming.

‍

Evaluation and refinement: beyond accuracy metrics

Effective AI is not measured only in terms of raw accuracy, but requires a comprehensive evaluation framework that considers:

Functional accuracy: Frequency with which the model produces correct results
Robustness: Consistency of performance with varying inputs and conditions
Equity: Consistent performance across different user groups and scenarios
Calibration: Alignment between confidence scores and actual accuracy
Efficiency: Computational and memory requirements
Explainability: Transparency of decision-making processes, an aspect in which DeepSeek's distilled models excel, showing their reasoning process

The impact of the cost curve

The most immediate impact of DeepSeek's release is its aggressive price reduction. The technology industry expected costs to fall over time, but few anticipated how quickly this would happen. DeepSeek has demonstrated that powerful, open models can be both cheap and efficient, creating opportunities for widespread experimentation and cost-effective deployment.

‍

Amr Awadallah, CEO of Vectara, underscored this point, noting that the real tipping point is not just the cost of training, but the cost of inference, which for DeepSeek is about 1/30th that of OpenAI's o1 or o3 models per inference cost per token. "The margins that OpenAI, Anthropic and Google Gemini have been able to capture will now have to be reduced by at least 90 percent because they cannot remain competitive with such high prices," Awadallah said.

‍

Not only that, these costs will continue to decrease. Anthropic CEO Dario Amodei recently said that the cost of developing models continues to decrease at a rate of about 4 times each year. As a result, the rate that LLM providers charge for their use will also continue to decline.

‍

"I fully expect the cost to go to zero," said Ashok Srivastava, CDO of Intuit, a company that has strongly pushed AI in its tax and accounting software offerings such as TurboTax and Quickbooks. "...and latency will go to zero. They will simply become core capabilities that we can use."

‍

Conclusion: The future of enterprise AI is open, cheap and data-driven

OpenAI's DeepSeek and Deep Research are more than just new tools in the AI arsenal-they are signs of a profound shift in which companies will deploy masses of purpose-built models that are extremely cost-effective, competent and rooted in the company's own data and approach.

‍

For companies, the message is clear: the tools to build powerful domain-specific AI applications are at your fingertips. You risk falling behind if you do not leverage these tools. But real success will come from how you curate data, leverage techniques such as RAG and distillation, and innovate beyond the pre-training phase.

‍

As AmEx's Packer put it: companies that manage their data properly will be the ones to lead the next wave of innovation in AI.

Beyond the algorithm: How artificial intelligence models are trained and refined