LLM Interviews : Prompt Engineering
- I’ll use any resource that helps me to prepare faster.
- I’m not directly working for interviews, but preparing for them makes me learn so many stuff, for years(check main page of website for 8 yo youtube channel).
- I’ll cite every content I use.
- I’ll agressively reference anything I believe explained things better than me.
Prompt Engineering & Basics of LLM
- Prompt Engineering & Basics of LLM
- Larger models with tens of billions of parameters may be trained on tens of trillions of tokens (e.g., 20B parameters on 60+ trillion tokens) Great Resource on this and another
- General Rules
- Avoid
- 5.3 Defining Examples
- 6. Explain in-context learning
- 7. Explain types of prompt engineering
- 8. What are some aspects to keep in mind while using few-shot prompting?
- 9. What are certain strategies to write good prompts?
- 10. What is hallucination, and how can it be controlled using prompt engineering?
- 11. How to improve the reasoning ability of LLM through prompt engineering?
- 12. How to improve LLM reasoning if your Chain-of-Thought (COT) prompt fails?
1. What is the difference between Predictive/Discriminative AI and Generative AI?
Predictive/discriminative AI generally works on a non-conversational areas, where domain experts would include CNN, tree-based algorithms, NER models, that makes predictions, recommendations and decisions, and forecasting and so on..
On the other hand, Generative AI creates content, code, music and marketing material and can translate data into different formats.
Earlier examples of generative AI models were variational encoders, GANs, like this and this which were amazed me SO MUCH at that time.
With the rise of transformer models, we have seen examples of code generation, music generation, video-to-video transformations, text-to-video generations, and LLMs.
![]() |
---|
Resource from Amazing ML Mastery Blog, who taughted millions, kudos to them |
- High resource usage
- Hallucinations
- Still trying to make algorithms more stable, it took years to come from GANs to these days
- Difficult to remove sensitive data/content filtering
2. What is a token in the language model?
Token is the piece of minimum/sufficiently information-encapsulating representation of the data to be mimicked. It doesn’t have to be necessarily be a syllable, rather it can be a piece of image vector.
Based on the size of the training data, the number of tokens can number in the billions or trillions — and, per the pretraining scaling law, the more tokens used for training, the better the quality of the AI model.
Side Note on Number of Tokens in Famous Models
According to the Chinchilla scaling law and related research, the optimal number of training tokens is roughly 20 times the number of model parameters. For example:
- A 5 billion parameter model is trained on about 1 trillion tokens.
- A 9 billion parameter model is trained on about 2.1 trillion tokens.
- A 3 billion parameter model is trained on about 2.8 trillion tokens.
- A 6 billion parameter model is trained on over 6.1 trillion tokens.
-
Larger models with tens of billions of parameters may be trained on tens of trillions of tokens (e.g., 20B parameters on 60+ trillion tokens) Great Resource on this and another
![]() |
---|
Resource from Eden AI, LLM Billing |
In the context of LLM models, they are trained to predict the next token in a correct way.
“The cat is beautiful.” is generated by LLMs as:
Token ID | Token | Notes |
---|---|---|
1 | The | Article |
32 | Ġcat | Word with space prefix |
542 | Ġis | Common verb with space |
123 | Ġbeautiful | Adjective with space |
13 | . | Punctuation |
At each step, it predicts the next word (or token) based on the previous ones:
Input Tokens | Text So Far | What Comes Next? |
---|---|---|
[1] |
“The” | What is the next word? |
[1, 32] |
“The cat” | What is the next word? |
[1, 32, 542] |
“The cat is” | What is the next word? |
[1, 32, 542, 123] |
“The cat is beautiful” | What is the next word? |
[1, 32, 542, 123, 13] |
“The cat is beautiful.” | What is the next word? I think I should stop. |
Tokens are also used as very first metrics to consider for LLM deployments. Especially deciding on GPU for same LLM model, or on same GPU when you want to decide smaller (1.7B) or larger (70B) model, token-based metrics are very useful, aside from the semantic performance/evaluation of the models.
Metric | Description | When It Matters |
---|---|---|
Time to First Token (TTFT) | Time from sending the prompt to receiving the first token. | 🧠 Important for interactive apps, chatbots, coding assistants, low-latency UIs |
Tokens per Second (TPS) | Speed at which the model generates tokens after the first one. | 📊 Crucial for batch processing, summarization, and longer outputs |
Prompt Length (Input Tokens) | Number of tokens in the input. Affects memory and context cost. | 🧾 Key for choosing context window size, managing VRAM usage |
Output Length (Generated Tokens) | Number of tokens the model needs to generate. | 🎯 Used to estimate latency, throughput, and inference cost |
Total Tokens (Input + Output) | Combined token count per request. Used for throughput and cost estimation. | 💸 Necessary for pricing, rate-limiting, and SLA planning |
3. How to estimate the cost of running SaaS-based and Open Source LLM models?
This is a very good question to distinguish people who deploys/decides to deploy LLM models. Our purpose as a team was to evaluate whether we should deploy model on GPUs (like runpod, Lambda Labs, Modal, or AWS Bedrock, Google Cloud Vertex, Azure) or use public SaaS-based APIs (such as OpenAI, Anthropic, Cohere, or Mistral API).
I don’t do namedropping (branddropping) over here rather I’m trying to place names to use during interviews. I wish they hire me, I don’t earn money thru this GitHub hosted website I swear :) .
Let’s go with some example.
Let’s say we have an AI app, we estimate 100,000 api request in a month, on weekends demand/traffic is not high. We expect peak requests during afternoon. Our app reads 3-pages long pdf and returns a summary of it which is generally 250 words.
You should read this
3.1 SaaS-based Models
API Pricings are mostly fixed to
- per model pricing (intelligent/bigger the model, cost is higher)
- Input/Output token based pricing (for each model, we have input/output token cost)
Let’s go with OpenAI pricing (15 May 2025):
Model | Description | Input (1M tokens) | Cached Input (1M tokens) | Output (1M tokens) |
---|---|---|---|---|
GPT-4.1 | Smartest model for complex tasks | $2.00 | $0.50 | $8.00 |
GPT-4.1 Mini | Affordable model balancing speed & intelligence | $0.40 | $0.10 | $1.60 |
GPT-4.1 Nano | Fastest, most cost-effective model for low-latency tasks | $0.100 | $0.025 | $0.400 |
We’ve experimented that our task is sufficient to be done with GPT-4.1 Mini, so we go with it.
Since scaling is not a deal for us, rather it’s solved by OpenAI(still we will face problems).
- A 3-page long PDF should be 500 * 3 = 1500 words.
- 1 word is approximately 0.75 tokens (source).
- This means we’ll have:
1500 words * 0.75 tokens/word = 1125 Input tokens per request.
-
On average, 250 words per output means:
250 words * 0.75 tokens/word = 187.5 Output tokens. - Input Tokens:
1125 tokens * 100,000 requests * $0.4 / 1 million tokens = $45 - Output Tokens:
187.5 tokens * 100,000 requests * $1.60 / 1 million tokens = $30
$45.00 (Input) + $1.875 (Output) = $75
However, OpenAI and any API provider can have down time and they all have rate limits, so you should consider to add another API provider as a backup in case api request fails.
For example, in our apps we have hit to the 60 request per min limit frequently (x-ratelimit-limit-requests
). In order to solve that:
- Either you can use batch processing, or delay the requests in codebase if response time is not critical
- Use another LLM api as a backup.
FIELD | SAMPLE VALUE | DESCRIPTION |
---|---|---|
x-ratelimit-limit-requests |
60 | Maximum number of requests allowed before exhausting the rate limit. |
x-ratelimit-limit-tokens |
150000 | Maximum number of tokens allowed before exhausting the rate limit. |
x-ratelimit-remaining-requests |
59 | Remaining number of requests allowed before exhausting the rate limit. |
x-ratelimit-remaining-tokens |
149984 | Remaining number of tokens allowed before exhausting the rate limit. |
x-ratelimit-reset-requests |
1s | Time remaining until the request rate limit resets. |
x-ratelimit-reset-tokens |
6m0s | Time remaining until the token rate limit resets. |
But it’s obvious $70 is very cheap than the deployment of LLM models.
3.2 Open Source Model Deployment
Things get crazy in here. Let’s discuss why you may need to choose Open Source models:
- 🛡️ In defense sector, companies don’t want to use public API.
- 🧮 You may use LLMs to categorize DB rows consistently/constantly, and you have millions of records.
- 🧠 OpenAI may not fix your problem, you may want to fine-tune a better model.
- 🌐 In remote locations or areas with limited connectivity, open-source models can be deployed locally, reducing reliance on the internet.
- 📍 Data sovereignity, you may need to keep data in specific geographical regions to comply with legal or internal policies.
Let’s assume you’ve found your model. And looking for a GPU to be used first.
Before that, let’s estimate our peak usage:
- 📈 100,000 api request in a month
- 💤 On weekends demand/traffic is not high.
- ☀️ We expect peak requests during afternoon.
So:
- You assume in weekdays, you’ll have %80 of the traffic (rather than 5 weekdays/7 days = 0.7)
- You assume %50 of the traffic happens in the 4 hours of afternoon, and the rest 20 hours is similar.
100,000 * 0.8 * 0.5 / ( 4 hours * 60 mins ) = 16 requests per minute
on the average of peak hours.- Let’s quadruple the amount so that we know that in minutes that we have peak requests, we can still process.
→ So in peak times we have 64 requests.
You fine-tuned DeepSeek R1:8B and you seek for GPU that can inference.
- 🧪 You use DeepSeek R1:8B
- 🔢 Q8 quantization is OK for you.
- 📦 You want to inference one request (one batch size)
I’ve found an amazing website that does realistic calculations of VRAM/GPU needs per model, it’s an amazing resource I discovered today. So you need to experiment over here:
👉 Can You Run This LLM? VRAM Calculator
Let’s say you choose RTX 4060 (16 GB) with 1 GPU.
- 🔄 On peak times, you have 64 requests.
- 🧾 Max tokens per input, sequence length is 1125 inputs. Input size affects time-to-first-token speed.
- ⚙️ Set number of GPUs to 1, we can increase if calculations show it’s not enough.
- 👥 Set concurrent users to 1, we will play with it to see if we can process peak demand.
Website (kudos to them, those seem realistic with my experiments) says:
“Generation Speed: ~51 tok/sec per single user.”
However, we know that:
- ⏱️ On peak request, we expect 64 request per minute.
- Which implies we need to end API calls at 0.93 seconds.
- 🧠 We generate 187.5 Output tokens per request which takes 3.67 seconds to process single request.
- 📉 Also with number of concurrent users, the token per second will still be lowered!
So this implies, the GPU is not enough! What to do now?
- 🪶 Try Q4 quantization or smaller if model quality doesn’t drop!
- 🖥️ Rent a bigger GPU, it seems 2x RTX 6000 gives ~188 tok/sec.
- 🧩 Go with multi-GPU solution, but it’ll have an engineering cost! I have never done that honestly :)
- ⚗️ Fine-tune a smaller model, fine-tuning is not that hard to experiment!
- 🤹 Go with batching on peak times, batch process 2 users, but it’ll have an engineering cost! Again I have never done that honestly :)
- 🌥️ If you still can use public API, use public API on peak times! A suitable GPU already handles most of the peak-ish demand, instead of trying to implement multi-GPU just use public API on overload times, you need to write backend of that.
Also beware:
- 💸 Deploying/training LLMs will steal your human power, salaries cost too!
#### 3.3 Why output tokens are more expensive than input tokens?
This question has great answer in here and here.
Shortly: When a model encodes tokens, it processes all of them at once, which is efficient. But when generating (decoding), it predicts one token at a time in order. After predicting each token, it must update its calculations to predict the next one. This step-by-step process makes decoding slower and more expensive than encoding.
Aspect | Input Tokens (Encoding) | Output Tokens (Decoding) |
---|---|---|
Compute (FLOPs) | Each input token is processed once | Each output token requires a full transformer pass |
Scaling Behavior | Without KV cache, compute grows roughly quadratically with input length | With KV cache, compute grows linearly as output tokens increase |
KV Cache Role | Built during encoding | Accessed and extended at every output generation step |
Memory Bandwidth | Relatively low — data read sequentially | High — random memory access due to autoregressive steps |
Latency Sensitivity | Tokens processed in parallel batches, so latency is low | Tokens generated one by one, so latency is critical |
Number of Transformer Runs | One run per input token | One run per output token |
Cost per Million Tokens | About $5 (e.g., GPT-4o) | About $15 (3× more expensive) |
Main Cost Drivers | Mostly compute power (FLOPs) | Memory bandwidth, cache maintenance, and latency overhead |
- Compute cost per token is similar, thanks to KV cache (avoids recomputing keys/values).
- Output token generation is sequential, each step adds to memory and latency cost.
- KV cache grows with sequence length, increasing memory bandwidth needs.
- Memory bandwidth (not just memory size) is a bottleneck—especially at inference scale.
-
Pricing reflects practical engineering constraints (not just FLOPs), hence higher price for outputs.
- “With KV caching, it costs almost the same FLOPs to take 100 input tokens and generate 900 output tokens, or vice versa.” – leogao
- “The model has to run once per output token, with gradually growing context.” – mishka
- “Memory bandwidth becomes the bottleneck, especially for output token generation.” – Peter Chng
4. Explain the Temperature parameter and how to set it.
We have discussed that LLMs are next-word prediction machines. But LLMs doesnt magically come up with next word. It also creates possibilities for the next word. Next word can be some of the candidates, right? There are few ways to complete any sentence by still being in the same context.
“My cat is ….” beautiful? gorgeous? cute? evil? all and more is possible. But “gpu” can’t be there right?
Word | Probability |
---|---|
beautiful | 0.35 |
gorgeous | 0.25 |
cute | 0.20 |
evil | 0.15 |
friendly | 0.04 |
gpu | 0.01 |
LLM models assign probability to each of the possible next words.
- If we set temperature to 0, it means that always the most probable is selected. And sentence continues the way the next word shapes the meaning.
- If we set temperature to 0.8, it means that second or third most can be randomly selected. And sentence continues the way the next word shapes the meaning.
Creativity comes from choosing random words. Setting temperature to 0.8 increases likelihood to choose evil, and sentence becomes : “My cat is evil, I hate her”
Creativity comes from choosing random words. Setting temperature to 0 ensures to choose beautiful, and sentence becomes : “My cat is beautiful, I love her”
Beautiful Illustration by Mastering LLM Blog |
But in practise:
- Choose low temperature when:
- You need consistent, reliable, and precise outputs
- The task requires factual accuracy or authoritative answers
- You want to build user trust by avoiding unpredictable results
- The application demands low cognitive load, so users get clear recommendations without confusion
- Minimizing user decision fatigue is important
- The domain is sensitive or high-stakes, where errors are costly (e.g., medical, legal)
- Choose high temperature when:
- Creativity, novelty, or variety is desired (e.g., brainstorming, storytelling)
- You want to keep users engaged and curious with unpredictable responses
- Encouraging users to explore different ideas or prompt refinements is beneficial
- Trade-offs to consider:
- Low temperature can make the model seem more confident but may cause users to over-trust incorrect answers
- High temperature increases creativity but can lead to incoherent or irrelevant responses
- High temperature may cause decision fatigue if users are overwhelmed by too many options
- Low temperature might make the experience feel stale or repetitive over time
- Summary:
- Use low temperature for reliability and clarity
- Use high temperature for creativity and engagement
- Always consider user expectations, context, and product goals when deciding
5. Explain the basic structure of prompt engineering
Note : I may need to add more content to this section.
Read everything here to master
Before diving in depth, the structure below for me has been very beneficial on my 5k MAU app, and 2 POCs. Although you may have seen the illustration before and say “another AI Influencer’s content”, think twice.
learnprompting.org |
The structure above we’ll discuss now is actually mostly used for apps that user expect to get single answer, non-conversational one.:
- Generate interview questions in some rule. -> Question is given to you and that’s it.
- Summarize the content with the neutral tone. -> Content is summarised and is given to you.
However, if you are developing a conversational app, you should consider:
- Rules are tend to be forgiven when number of messages increases.
- Users can try to jailbreak-hijack.
- If conversation takes long (generally happens when you paste some text/document to conversations), you are exceeding the context lenght of the model and your “command prompts” at the beginning are excluded to fit into the context lenght within the FIFO (First in First Out) principle.
Let’s go for the cases mentioned, illustration is very helpful.
5.1 Role Prompting
Role is to define what should AI consider itself while responding to you. It can either be historian, or a harsh interviewer that asks hard questions, or a kind call-person.
! Never forget, it’s the first thing your AI reads! Although in practise, you may see that commanding in random places still works, it’s proven that where you command/type important context is important.
Lost in the Middle: How Language Models Use Long Contexts and a good summary of paper
LLMs Get Lost in Multi-turn Conversation and a good summary of paper
Defining the role/mood of your AI would prepare it to style/how to think about the subject.
Let’s see this on qwen:1-7B
USER: you are expert interviewer. what would you ask to product manager in interview?
ASSISTANT:
Okay, I need to figure out what questions an expert interviewer would ask a product manager during an interview. Let me start by recalling what I know about product management interviews. Product managers are responsible for defining product goals, managing the product lifecycle, and ensuring alignment ....so many things... First, I should consider the different aspects of a product manager's role....* I remember that interviews often start with a greeting and an introduction. * ...
ASSISTANT: Hi, welcome to the interview…..
In the middle of thinking, AI realised that expert interviews greet people.
While if you do not assign role by saying * you are expert interviewer *, AI jumps directly to the question:
ASSISTANT:
... For example, instead of "What do you think about product management?" they should ask "How do you approach defining the product strategy for a new feature?" ...
So, in the first paragraph, you should define the sufficient:
- Role
- Mindset
- Tone
- Style
- Behavior
- Goal
- Audience
- Constraints
5.2 Instruction Prompt
You’ve assigned the role to the AI, then you should constraint it to focus on specific task within its responsibility.
In this section, you should be specific and command AI what to do, directly starting with: Specify, Summarize, Translate, Do, Classify, Divide, Distinguish
Then, you should continue with the action point in general.
USER: Translate the given text in Turkish, considering the localisation of the synonyms in the Turkish language, as if the direct translation of synonyms are meaningful for the target language.
You should also consider markdown tags while defining the rules, it has been observed it works:
General Rules
- Try to match with the same number of words as the translation shouldn’t increase the number of pages.
Avoid
- Never translate technical words in article, duckDB is not ordekDB in Turkish!
5.3 Defining Examples
I’ve built 3 big apps with LLMs and RAG. I can assure you that good examples are 10 times better than long instruction prompting. Whatever you define in instructions, after some context AI either won’t follow it or get bogged down in details. Good examples would even work even if you don’t have any instruction prompt.
You need to define good examples on:
- What to do (that’s for sure)
- What NOT to do
Like
### ✅ Good Example:
Q: Can you explain self-attention in Transformers?
A: Sure! Think of each word looking at every other word and asking: “Should I care about you?”...
### ❌ Bad Example:
A: Self-attention is a mechanism that allows each token to... [copy-pasted Wikipedia-level dump]...
If you would like to get a json output, by using ==JSON Mode== or ==Structured Outputs==, you should define the response format in prompt:
{
"title": "The Rise of Transformers in NLP",
"author": "Jane Doe",
"main_points": [
"Transformers revolutionized NLP by using self-attention.",
"They enable parallel processing of tokens.",
"They have become the backbone of modern language models."
]
}
or
{
"title": str title of movie,
"author": str author,
"main_points": List of str of 3-4 words explanation of main points
}
or
in OpenAI python:
from pydantic import BaseModel, Field
class MovieSummary(BaseModel):
title: str = Field(..., description="Title of the movie")
author: str = Field(..., description="Author of the movie or article")
main_points: List[str] = Field(
...,
description="List of main points (each 3-4 words) summarizing the content"
)
OpenAI reads descriptions from here, too! I’ve benefited a lot.
6. Explain in-context learning
Language Models are Few-Shot Learners
AI can’t be trained with the fresh data everyday. It’s just impossible. But AI has its intelligence, so that it can interpret the given data. RAG, retrieval augmented generation uses the principle that pipelines bring data to the prompt, so that your AI being informed about it.
Continue the following story fragment in the style of Shakespearean English:
Story start: "The sun dipped below the horizon, casting a golden glow on the ancient castle."
Continuation:
Lo, the amber light did kiss yon aged stones,
Where whispered secrets dwell in shadowed tones.
Then, AI will follow what you’ve trained.
! TBH I am ashamed to explain this, nowadays we need to know these gatekeeping names unfortunately :/
What distinguish the ICL?
- No Parameter Updates: The model’s weights remain fixed; learning happens through the prompt context alone.
- Transient Knowledge: The knowledge gained is temporary and specific to the current prompt; it is not stored permanently.
- Few-Shot Learning: Often called few-shot or few-shot prompting, as the model requires only a handful of examples to generalize the task.
- Leverages Pretraining: Better model, better ICL.
7. Explain types of prompt engineering
TBC.
8. What are some aspects to keep in mind while using few-shot prompting?
TBC.
9. What are certain strategies to write good prompts?
TBC.
10. What is hallucination, and how can it be controlled using prompt engineering?
TBC.
11. How to improve the reasoning ability of LLM through prompt engineering?
TBC.
12. How to improve LLM reasoning if your Chain-of-Thought (COT) prompt fails?
TBC.
Side Notes:
- %90 of the blog is/will be written by me.
- I may use ChatGPT to create %10 of the blogs, just to make content easier to read/to markdown format the algorithms. But nothing more than this. Just for help.
- This doesn’t mean I create posts with ChatGPT, and proof-read and additions. It means I create posts by myself, in the middle I found ChatGPT can graph/format algorithm better than me, and prompt what I want to explain to it. Then it gives better format/fun to read content for just some part of the blog.
- The reason is, I already spend time to create content, and I’m learning while writing, but I understand content before writing, so I want to minimize writing time.