Nicolas Malburet - Jul 12, 2025

LLM : Frequently Asked Questions

Language models (LLMs) have revolutionized the way we interact with AI, but their inner workings often remain mysterious.

This article aims to answer, simply and clearly, the most frequently asked questions people have when starting to use—or deploy—LLMs in real-world use cases.

Whether you are curious, a beginner, or already an advanced user, you will find essential insights here to better understand and leverage this technology.

What really differentiates a token from a word ?

A token is not the same as a word. It is a processing unit optimized by the model to efficiently represent language. A token can be part of a word (like "ing" in "running"), a whole word, or even several short words. The Llama 2 model uses 300,000 different tokens to cover all the language it can understand and generate.

Why do some prompts work better than others ?

Every word in your prompt influences the probability distribution of the next token generated by the LLM. A well-structured prompt increases the likelihood that the desired tokens are selected. That’s why techniques like concrete examples (few-shot) or instructions in uppercase have such a significant impact on result quality.

How can LLM hallucinations be avoided ?

Hallucinations cannot be completely eliminated, but several techniques help reduce them : include explicit instructions for the LLM to answer "I don't know" when uncertain, use external data via RAG instead of relying solely on training knowledge, and maintain human supervision to validate critical results.

What is the difference between fine-tuning and prompt engineering ?

Prompt engineering changes the instructions given to the model without altering its weights, while fine-tuning retrains certain layers of the model with new data. Prompt engineering is less expensive, faster to implement, and generally sufficient for most use cases. Fine-tuning is only justified for very specific needs after exhausting prompt engineering possibilities.

Should I choose an open or closed model ?

Closed models (GPT, Claude, Gemini) offer high performance without requiring infrastructure, but create vendor lock-in. Open models (Llama, Mistral) allow full control and maximum privacy, but require technical skills and dedicated infrastructure. The choice depends on your privacy constraints, budget, and technical expertise.

How does RAG actually work ?

RAG combines retrieval and generation : your documents are converted into embedding vectors and stored in a search engine. When a user asks a question, the system finds the most relevant passages and injects them into the LLM’s prompt. The model then generates its answer based on this external information, allowing it to access knowledge not present in its training data.

What makes agents so complex to develop ?

Agents must handle multi-step tasks, recover from execution errors, and maintain a consistent state throughout the process. Unlike classic LLMs that simply generate text, agents must plan, act, analyze results, and adapt. This complexity is amplified when they interact with unpredictable environments like the web.

How do I measure an LLM’s performance on my use cases ?

Create a representative test set of your real use cases with input examples and expected outputs. Test different models and prompt techniques on these examples, then measure accuracy, consistency, and relevance of results. Academic benchmarks like those from livech.ai provide a general indication, but only your specific tests will reveal performance for your real needs.

Can I trust LLMs for critical tasks ?

LLMs are not 100% reliable and should never be used unsupervised for critical tasks. Tests show success rates of 3/5 or 4/5 on some complex tasks. For sensitive fields like medicine or finance, implement robust safeguards and always maintain human validation in the loop.

How can I protect my confidential data with LLMs ?

Clearly distinguish between consumer applications (ChatGPT) and enterprise APIs (GPT-4 API). Professional APIs do not use your data for training and offer confidentiality guarantees similar to standard cloud services. For maximum privacy, choose open models hosted on your infrastructure or use techniques like homomorphic encryption.

What is the real cost of implementing an LLM solution ?

Costs vary greatly depending on the approach : a few euros per month for experiments via API, thousands of euros for production deployments, tens of thousands for custom fine-tuning. Start with API prototypes to validate your use cases before investing in heavier infrastructure. Prompt engineering alone can often solve 80% of your needs at minimal cost.

What is the future of LLMs in the coming years ?

LLMs are evolving towards greater autonomy (agents), specialization (domain models), and efficiency (smaller but more powerful models). Reasoning techniques are developing, enabling more advanced planning capabilities. However, challenges of reliability, cost, and privacy remain central. Integration into existing workflows will be key to mass adoption.