AI 如何運作：引導而非祈禱，一張駕馭 AI 的完整地圖

你每天都在用 ChatGPT、Claude 或 Copilot。你會下 prompt、會調整問法、會在 AI 答錯的時候重問一次。但你有沒有想過，當你按下送出的那一刻，AI 到底在做什麼？

它是去某個巨大的資料庫「查」答案嗎？它怎麼知道你這句話裡哪個字重要？為什麼同樣的問題問兩次，答案會不一樣？為什麼它有時候會非常有自信地告訴你一個完全錯誤的資訊？

這篇要回答什麼

這篇文章要回答一個比「10 個 ChatGPT 使用技巧」更底層的問題：AI 的運作機制是什麼，以及理解這個機制，能讓你怎麼更好地駕馭它。

我會把整個過程拆成三個階段：AI 怎麼「讀」、怎麼「想」、怎麼「寫」。每個概念結尾都會附上實作建議。讀完這篇，你對 AI 的心智模型會徹底升級。

具體來說，這趟旅程會這樣走：

第一階段「AI 怎麼讀」：從 Token 開始，理解 AI 眼中的世界長什麼樣；接著是 Context Window（當次 Context 的隱喻，不是持久記憶）、Attention（AI 如何決定什麼重要）、KV Cache（讓計算變快的最佳化）。這四個概念解釋了 AI 如何「接收和理解」你給的輸入。

第二階段「AI 怎麼想 / 寫」：Autoregressive Generation 揭開 AI 一個字一個字產出答案的真相；Temperature & Sampling 是你控制輸出風格的旋鈕；Hallucination 則解釋了為什麼 AI 會自信地說錯話。這三個概念解釋了 AI 如何「產出」回應。

第三階段「你怎麼駕馭」：System Prompt vs User Prompt 把前面所有機制轉化成你實際操作的工具；最後用一個客服機器人的實戰例子，看這些概念如何同時運作。

每個概念都不是孤立的，它們是一條環環相扣的因果鏈。理解了這條鏈，你就握有了駕馭 AI 的韁繩。

讀這篇之前

這篇文章預設你有基本的軟體開發背景（看得懂 TypeScript、知道 cache 和 RAM 是什麼），但不需要任何 AI 或機器學習的專業知識。

第一階段 · AI 怎麼讀

1. Token：AI 看世界的最小單位

在理解 AI 怎麼運作之前，有一個最基礎的觀念必須先建立：AI 處理的是 token，不是人眼看到的「文字」。

當你輸入一段文字，AI 並不是一個字一個字地看。下方以英文的 please review this code 示範（中文的切法不同，但機制相同）：它會先把文字切成一塊一塊的 token，一種介於「字母」和「單字」之間的語言單位，然後把每個 token 轉換成一組數字（向量），這串數字才是 AI 真正拿去運算的東西。

"please review this code"
   ↓ tokenization
["please", " review", " this", " code"]
   ↓ token IDs
[7847, 12043, 5521, 9982]
   ↓ embeddings
[[0.23, -0.81, ...], [0.11, 0.42, ...], ...]
   ↓
what the model actually computes on

為什麼不直接用文字？因為電腦底層只認得數字。AI 模型本質上是一個巨大的數學函數：餵數字進去，吐數字出來。Token 就是「把人類的語言，轉換成 AI 能運算的數字」這個過程裡，最關鍵的中間單位。

這件事帶來幾個立刻能用的觀念：

AI 的所有「容量」都是用 token 計算的，不是字數。後面會講到的 Context Window 上限、API 計費，全部以 token 為單位。
中文和英文的 token 效率不同，同一段意思在不同語言會佔用不同數量的 token。
Token 的切割方式，會直接影響 AI 的能力。最有名的例子是「strawberry 裡有幾個 r」——AI 常常數錯，因為它看到的是 straw + berry 這樣的 token 塊；s-t-r-a-w-b-e-r-r-y 逐字母展開，模型根本看不到。

token 在實務上還有一個很重要的意義：它就是計費單位。API 依輸入與輸出的 token 數量分別計費，跟「請求次數」或「字數」無關。設計高頻呼叫的 AI 應用時，「一個 prompt 用了多少 token」直接等於「這個功能每個月要花多少錢」。

// In practice, use the model provider's tokenizer or API usage response.
// Different models use different tokenizers, so character-based formulas drift fast.
const usage = await client.messages.countTokens({
  model: "claude-opus-4-8",
  system: systemPrompt,
  messages: [
    {
      role: "user",
      content: `${documentContent}\n\nQuestion: ${userQuestion}`,
    },
  ],
});

console.log(`This input is about ${usage.input_tokens} tokens`);

當你理解了「token 是計費單位」，後面的 KV Cache、Context Window 管理就從技術細節變成跟成本直接掛鉤的設計決策。

這篇文章不會深入 token 的切割演算法（BPE）、Embedding 向量怎麼承載語義、或者 AI 怎麼 tokenize 圖片、音訊甚至基因序列。這些主題我在另一篇文章裡有完整的拆解：

想深入 Token？

想深入了解 Token 是怎麼被切出來的、為什麼不同語言的 token 效率差這麼多、甚至 AI 是怎麼把基因序列和天氣資料轉成 token 的，可以參考我的另一篇文章： Tokens: How AI Turns the World into Numbers。

對這篇文章來說，你只需要帶走一個核心觀念往下走：

AI 處理的基本單位是 token，而 token 在 AI 眼中是一串數字。 接下來講的每一個機制，記憶、注意力、快取、生成，全部都建立在 token 這個單位之上。

2. Context Window：AI 的工作記憶

理解了 token，第一個要面對的概念就是 Context Window。這裡的「工作記憶」是隱喻：指的是這一次 API 呼叫裡模型能同時看到的內容，不是硬碟那種持久儲存。

Context Window 就是 AI 在「這一次對話」裡，能夠同時看到的 token 總量上限。你的問題、AI 的回答、你上傳的文件、背後的 System Prompt，全部都在這個池子裡，共用同一個額度。

像 RAM 一樣，每次清空

關鍵類比

最精準的類比是：Context Window 就像程式的 RAM，只在當次執行中有效，關掉就清空。硬碟那種持久儲存，Context Window 沒有。

你的程式在執行的時候，RAM 裡只有「現在正在處理的東西」。程式關掉，RAM 就清空了。AI 的每一次 API 呼叫也是如此，它只看得到你這一次塞進 Context Window 的內容。它沒有跨 session 的持久記憶，也不會主動去「查」它訓練時讀過的資料。

LLM 本身沒有記憶

常見誤解

LLM 模型本身沒有任何記憶。你在 ChatGPT 或 Claude Code 裡感覺 AI「記得」上週的對話、記得你的專案設定，那個「記憶感」來自 Application 對 Context 的管理：應用程式會把過去的對話、你上傳的文件、System Prompt 等內容組裝起來，在每次呼叫時重新塞進 Context Window，你才會覺得它在延續先前的脈絡。

拿掉這層應用邏輯，直接對模型發一次 API 呼叫，它就只看得到你這次傳進去的內容。上一輪說了什麼，跟這一輪完全無關；開新對話就「忘記」先前內容，也是同樣道理，因為那是全新的 Context Window。

Context Window (all tokens the model can see in this API call)

System Prompt · fixed settings
Conversation history · accumulated
Uploaded files · specific to this call
Your question this turn · latest input
─────────────────────────
Remaining space · how many tokens left?

在 API 層級這件事最清楚：模型沒有隱藏的記憶體，你想讓它記得什麼，就必須每一次都把完整的歷史重新傳進去：

// The model has no memory. Send the full conversation history on every call.
interface Message {
  role: "system" | "user" | "assistant";
  content: string;
}

async function chat(history: Message[], newMessage: string) {
  const messages: Message[] = [
    ...history, // ← full past conversation, resent every time
    { role: "user", content: newMessage },
  ];

  const response = await client.messages.create({
    model: "claude-opus-4-8",
    max_tokens: 1024,
    messages,
  });

  return response;
}

// Omit history and the model has no idea what you talked about before

Context 塞滿了會發生什麼？

Context 額度用滿時，系統必須做取捨，這是實務上很常遇到的問題。常見有三種策略：

截斷（Truncation）：最直接，直接丟掉超出上限的部分。通常是砍掉最舊的對話，保留最新的。代價是 AI 會完全失去早期對話的資訊。

摘要壓縮（Summarization）：先用 AI 把舊的對話壓縮成一段摘要，再用摘要取代原始的完整對話。保留了大方向，但細節會流失。

滑動視窗（Sliding Window）：永遠只保留最近 N 個 token。對很長的對話來說，AI 會逐漸「忘記」太早之前說過的事情。

// Simplified truncation: when over budget, drop from the oldest messages
function truncateToFit(
  messages: Message[],
  maxTokens: number,
  countTokens: (text: string) => number,
): Message[] {
  let total = messages.reduce((sum, m) => sum + countTokens(m.content), 0);

  // Keep system prompt and latest message; drop second-oldest onward
  while (total > maxTokens && messages.length > 2) {
    const removed = messages.splice(1, 1)[0]; // remove oldest non-system message
    total -= countTokens(removed.content);
  }

  return messages;
}

對駕馭 AI 的實際影響

駕馭要點

Context Window 越大，能處理的任務越大。現在的模型動輒支援 200K token 以上的 Context，意味著你可以一次塞進整份合約、整個程式碼庫，或者很長的對話歷史。

但「能塞」不等於「塞了就好」。Context 越長，後面會講到的 Attention 機制就越難聚焦，成本也越高。把不相關的東西全部倒進 Context，反而會稀釋 AI 對重點的掌握。這裡有一個違反直覺的原則：好的 Context 管理重在精準取捨，把視窗塞到爆往往適得其反。 一份經過篩選、只包含相關資訊的 5K token Context，往往比一份塞滿雜訊的 100K token Context 效果更好。

每一次呼叫都要主動組裝 Context。想讓 AI 記得什麼，就把它放進這次的 Context。這是「駕馭 AI」的第一條原則：你的 Context，決定了 AI 的視野。

在真實的應用裡，Context 管理是一門需要刻意設計的工程。常見的做法是把 Context 分層管理：

// In real apps, context is assembled — not dumped in randomly
interface ContextLayers {
  systemPrompt: string;      // fixed role definition (cache-friendly)
  retrievedDocs: string[];   // RAG chunks for this turn
  conversationHistory: Message[]; // history (length-managed)
  currentQuery: string;      // user's question this turn
}

function assembleContext(layers: ContextLayers, budget: number): Message[] {
  const messages: Message[] = [];

  // In practice systemPrompt usually goes in the API `system` field (see §8);
  // here we focus on assembling `messages`.
  // 1. Fixed system prompt — highest priority, always kept
  // 2. Retrieved docs — only the most relevant chunks, within budget
  const docs = layers.retrievedDocs.slice(0, 3).join("\n\n");

  // 3. Conversation history — keep recent turns until budget is tight
  const history = trimHistoryToBudget(layers.conversationHistory, budget);

  messages.push(...history);
  messages.push({
    role: "user",
    content: `Reference material:\n${docs}\n\nQuestion: ${layers.currentQuery}`,
  });

  return messages;
}

這段程式碼的重點在觀念：在成熟的 AI 應用裡，沒有人把所有東西無腦丟給模型。每一次呼叫的 Context 都是經過刻意篩選和組裝的，因為 Context 的品質直接決定了輸出的品質，也直接決定了成本。

3. Attention：AI 如何決定什麼重要

現在你知道 Context Window 裡可能有幾萬個 token。問題來了：AI 在處理的時候，真的會平等地參考每一個 token 嗎？

不會。這就是 Attention 機制存在的原因，它是整個現代 AI（Transformer 架構）的核心，也是「為什麼 Context 裡的東西，AI 不是一視同仁」的根本答案。

一個經典的例子

看這個句子：

"The animal didn't cross the street because it was too tired."

當 AI 處理到 it 這個字的時候，它必須搞清楚 it 指的是什麼。是 animal 還是 street？

Attention 機制會計算 it 跟 Context 裡每一個其他 token 的「關聯分數」，然後發現 animal 的分數最高，因為在訓練資料裡，「累了（tired）」這個概念跟有生命的東西高度相關，跟街道無關。

it → animal   ████████████   high relevance
it → tired    ███████        medium (semantic)
it → street   ██             low
it → cross    █              very low

AI 靠數學計算判斷關聯強弱，不是憑「感覺」覺得 animal 比較重要。這個計算，就是 Attention。

機制：Q、K、V

這裡稍微深入一點，但用常見的資料結構概念來理解。Attention 的運作，可以類比成一個「模糊查詢版的 HashMap」。

在一般的 HashMap 裡，key 要完全符合才查得到 value。但 Attention 不同，每一個 key 都會被查到，只是「符合程度」不同，最後把所有 value 按照符合程度加權混合。

每個 token 在 Attention 裡同時扮演三個角色：

Q (Query)  → "What am I looking for?"     current token's query
K (Key)    → "What can I offer?"          each token's outward "label"
V (Value)  → "What do I actually carry?"  each token's real contribution

計算流程是這樣：

1. Take the current token's Q
2. Dot product Q with every token's K → relevance scores
3. Softmax the scores → weights that sum to 1
4. Weighted average of every token's V
5. Result = this token's vector after "seeing" the full context

把這五步收成一個式子，就是 Transformer 論文裡的 scaled dot-product attention：

Attention(Q, K, V) = softmax(QK^T / √d_k) V

拆開來看：

QK^T：每個 Query 跟所有 Key 做點積，得到「相關分數」矩陣（對應步驟 2）
√d_k：縮放因子，d_k 是 Key 向量的維度；沒有這項的話，維度一高點積值容易爆炸，Softmax 會變得極端
softmax(⋯)：把分數轉成加總為 1 的權重（對應步驟 3）
最後乘以 V：用權重對所有 Value 加權混合（對應步驟 4、5）

我們可以用一段簡化的 TypeScript 來表達這個流程的骨架（省略了 √d_k 縮放，但邏輯一致；實際上是高度優化的矩陣運算）：

// Simplified single-head attention to illustrate the flow
type Vector = number[];

function dotProduct(a: Vector, b: Vector): number {
  return a.reduce((sum, val, i) => sum + val * b[i], 0);
}

function softmax(scores: number[]): number[] {
  const max = Math.max(...scores);
  const exps = scores.map((s) => Math.exp(s - max)); // subtract max to avoid overflow
  const sum = exps.reduce((a, b) => a + b, 0);
  return exps.map((e) => e / sum);
}

function attention(
  query: Vector, // current token's Q
  keys: Vector[], // all tokens' K
  values: Vector[], // all tokens' V
): Vector {
  // Step 2: dot Q with each K → relevance scores
  const scores = keys.map((k) => dotProduct(query, k));

  // Step 3: Softmax → weights that sum to 1
  const weights = softmax(scores);

  // Step 4: weighted average of all V
  const dim = values[0].length;
  const output: Vector = new Array(dim).fill(0);

  values.forEach((v, i) => {
    for (let d = 0; d < dim; d++) {
      output[d] += weights[i] * v[d];
    }
  });

  return output; // Step 5: vector after "seeing" full context
}

關鍵在 weights 這個變數：它就是 AI 對「Context 裡每個 token 有多重要」的判斷。權重高的 token，對最終結果影響大；權重低的，幾乎被忽略。

Multi-Head Attention：同時從多個角度看

實際上，Attention 不會只跑一次，而是同時跑好幾組（通常 8 到 32 組），每一組叫一個 Head。每個 Head 會學到不同的「注意力模式」：

Head 1 → syntax (subject ↔ verb)
Head 2 → semantics (synonyms, near-meaning)
Head 3 → reference (it / they / this → what?)
Head 4 → position (distance in the sequence)
   ⋮

最後把所有 Head 的結果合併，得到一個從多個角度同時理解的 token 向量。這就是為什麼 AI 能同時掌握一句話的語法、語義、指代和語氣。

一個關於「理解」的哲學提醒

這裡值得停下來想一件事。當 AI「理解」it 指的是 animal，它並不是真的「懂」什麼是動物、什麼是疲倦。它做的事情，是在訓練資料的統計模式裡，發現「tired 這個 token 跟有生命的名詞共同出現的頻率，遠高於跟街道共同出現」。

能力邊界

換句話說，Attention 捕捉的是「統計上的關聯」，跟「真正的語義理解」是兩回事。這個區別很重要，因為它解釋了 AI 的能力邊界：在訓練資料裡模式清晰的地方，AI 表現得像真的懂；在模式稀疏或矛盾的地方，它就會露出馬腳。後面講 Hallucination 的時候，你會看到這個本質如何直接導致 AI 的「自信地說錯」。

理解 Attention 落在「統計關聯」這一側，能讓你對 AI 保持一種健康的態度：既不低估它（它捕捉到的統計模式極其豐富，足以完成大量複雜任務），也不過度神化它（它沒有獨立的事實判斷能力）。

位置編碼（Positional Encoding）與 Lost in the Middle

Self-attention 在數學上對 token 的排列順序是「無感」的：如果只餵 embedding、不告訴模型每個 token 排在哪裡，「貓坐在墊子上」和「墊子上坐著貓」對它來說沒有差別。因此現代 LLM 會為每個 token 加上位置編碼（Positional Encoding），標記它在整段 Context 中的序號，從開頭的 0 一路遞增到尾端。常見的實作是 RoPE（Rotary Positional Embedding，旋轉位置嵌入）。模型靠這個資訊掌握語法先後、指代關係，也能在生成回答時知道當前要處理的是哪一段任務。

實務上，Application 組裝 Context 時通常按時間順序排列：System Prompt 在最前，對話歷史接著往上堆，最新一輪使用者提問在尾端。生成下一個 token 時，模型從序列末端往前參照整段內容，所以最新提問自然落在「模型正在處理」的位置附近。

但這裡有個常見誤解：位置編碼標的是序列中的第幾位，不是「時間上多新舊」。Attention 權重也不會隨著「離當前對話越遠就越低」單調遞減。研究（Liu et al., 2023）發現的是 U 型分佈：

Attention weights (simplified)

high │ ████                              ████
     │ ████                              ████
 low │      ████  ████  ████  ████  ████
     └──────────────────────────────────────
       start         middle              end
       (System       (easily ignored)    (latest question)
        Prompt)

Lost in the Middle

開頭的 System Prompt 位置編號小，卻經常拿到很高的 Attention；中間的舊對話或文件段落最容易被忽略；尾端則因為緊鄰生成點，最新指令的權重也偏高。這個現象就叫 Lost in the Middle。

成因是多重因素疊加，而不是單一機制能解釋的：

位置編碼（Positional Encoding） 讓模型感知 token 的先後與相對距離，並帶入 positional attention bias。

訓練資料的結構偏誤則是另一層。Pre-training 接觸的文本裡，摘要、導論、結論這類任務線索常出現在開頭與結尾；Instruction fine-tuning 也習慣把指令放在 Context 最前面。模型在大量這種格式的資料上學習，有可能強化「開頭與結尾比較值得盯」的統計關聯。Liu et al.（2023）討論過這類假設，並指出它與架構因素疊加，而非唯一原因。

此外還有訓練時見過的 context 長度、causal mask、Softmax 在長 Context 下的競爭等。這些加在一起，讓長脈絡下的中段內容更容易被系統性地低估。

對駕馭 AI 的實際影響

Attention 機制解釋了好幾個你在用 AI 時觀察到的現象：

駕馭要點

重要資訊要放對位置（Lost in the Middle）。前面提過的 U 型 Attention 分佈意味著：如果你有一份很長的文件，把最關鍵的指令放在最前面或最後面，效果會明顯優於放在中間。若關鍵證據只能放在中段，可以先用摘要拉出重點、標示段落編號，或要求模型逐段對照，減少它只靠頭尾拼湊全篇的機會。

Context 越長，Attention 越難聚焦。10 個 token 互相計算關聯很輕鬆，10 萬個 token 互相計算則是天文數字的運算量，而且雜訊更多。這就是為什麼 Context 塞太滿，模型對細節的掌握反而可能下降。

重複強調真的有效。如果你在 System Prompt 說了一次某個限制，又在 User Prompt 再說一次，這個資訊在 Context 裡出現了兩個 token 群，加權分數更高，被執行的機率也更高。原因很直接：Attention 機制讓同一段資訊在 Context 裡出現兩次時，加權分數自然更高。

4. KV Cache：讓計算變快的關鍵

理解了 Attention 的 Q、K、V，下一個概念會非常自然，因為 KV Cache 就是直接從 Attention 的計算結果長出來的。

問題：每次都要重算嗎？

回憶一下：Attention 在處理每個 token 時，會把它的 K 和 V 跟 Context 裡所有其他 token 的 K、V 做計算。

現在想像一個情境：你的 System Prompt 有 1000 個 token。使用者每送來一個新訊息，AI 都要重新計算這 1000 個 token 的 K 和 V 嗎？

沒有 KV Cache 的話，是的，每一次都得重算。而且這 1000 個 token 的內容根本沒變，等於在做大量的重複工。

核心概念

KV Cache 就是把已經算好的 K 和 V 存起來，下次遇到相同的 prefix（前綴），直接讀快取，不再重新計算。

Without KV Cache:
Every call
  → System Prompt (1000 tokens) recompute K, V
  → Conversation history recompute K, V
  → New message compute K, V
  → Generate answer

With KV Cache:
First call
  → System Prompt (1000 tokens) compute K, V → store in cache

Second call (System Prompt unchanged)
  → System Prompt → read cache ✓ (no recompute!)
  → New message compute K, V
  → Generate answer

原理跟 Redis、CDN cache 一樣：算過一次的東西，存起來，下次直接拿。差別只在於這裡快取的是 Attention 機制裡的 K 和 V 向量。

Cache Hit vs Cache Miss

KV Cache 能不能命中，關鍵在於 prefix 有沒有改變：

// ✅ Cache hit: stable system prompt; different user messages still hit cache
const systemPrompt = "You are a professional support agent. Be friendly and concise.";

await chat(systemPrompt, "Where is my order?"); // system prompt KV computed once
await chat(systemPrompt, "How do I request a refund?"); // system prompt read from cache ✓

// ❌ Cache miss: one word changed in system prompt → full cache invalidation
const v1 = "You are a professional support agent. Be friendly and concise.";
const v2 = "You are a professional support helper. Be friendly and concise."; // "agent" → "helper"

await chat(v1, "Where is my order?"); // compute once
await chat(v2, "Where is my order?"); // prefix diverges → cache miss → full recompute ✗

成本陷阱

改一個字，就會讓整段快取失效。prefix matching 就是這麼嚴格：快取從第一個 token 開始逐一比對，只要某個位置開始不一樣，從那裡之後的快取全部作廢。

正確的 Context 排列順序

理解了 prefix matching，你就懂了為什麼 AI 廠商一直強調「把穩定的內容放在 Context 最前面」：

Context order (top to bottom; prefix matched from the top)

System Prompt · stable · front, easiest to cache
Few-shot examples · stable · same
Uploaded documents · call-specific, often large
Conversation history · grows over time
User message this turn · changes every time · last

如果你把會變動的內容（例如使用者資料）放在前面，把穩定的 System Prompt 放在後面，那每次使用者資料一變，後面所有的快取就全部失效了。順序錯了，快取就形同虛設。

Anthropic 的 Prompt Caching

Claude 提供明確的 Prompt Caching 功能，讓你手動標記哪些內容要被快取：

const response = await client.messages.create({
  model: "claude-opus-4-8",
  max_tokens: 1024,
  system: [
    {
      type: "text",
      text: longStableSystemPrompt, // long, stable system prompt
      cache_control: { type: "ephemeral" }, // ← mark for caching
    },
  ],
  messages: [{ role: "user", content: userMessage }],
});

命中快取的 token，計費和延遲都遠低於重新計算。對高頻呼叫的應用來說，這個差異非常顯著。

對駕馭 AI 的實際影響

駕馭要點

System Prompt 要保持穩定。不要每次呼叫都動態修改 System Prompt 的措辭，就算只是多一個空格。穩定的 prefix 才能持續命中快取，降低延遲與成本。

大文件、固定知識放前面。產品手冊、程式碼庫、規格文件這類「這次對話會反覆參考、但內容不變」的東西，放在 Context 前段並標記快取，後續所有問答都不需要重算它的 KV。

快取有時效。Claude 的 Prompt Cache 預設存活約 5 分鐘，超過沒有新呼叫就會失效。高頻使用的應用快取命中率高，低頻的應用幾乎每次都是 cache miss——這會影響你設計應用架構時的成本估算。

這不是微優化

為什麼這對成本影響這麼大？想像一個客服機器人，System Prompt 加上產品手冊共 10,000 個 token，每天被呼叫 5,000 次。如果沒有 KV Cache，這 10,000 個 token 每次都要重新計算、重新計費，等於每天多算 5,000 萬個 token 的成本。有了 KV Cache，這段穩定內容只在快取週期內算一次，命中快取的部分計費通常只有原價的一小部分。KV Cache 往往決定一個 AI 產品能不能賺錢，早已超出「微優化」的範圍。

這也是為什麼，當你看到一個 AI 應用「為什麼某些功能特別便宜、某些特別貴」，背後往往就是 KV Cache 命中率的差異。穩定 prefix、把固定內容放前面、不要動態改 System Prompt，這些看似瑣碎的工程習慣，加總起來就是真金白銀的成本差距。

第二階段 · AI 怎麼想 / 寫

5. Autoregressive Generation：AI 怎麼「產出」答案

前四節講的都是 AI 怎麼「讀」。從這一節開始，我們進入 AI 怎麼「寫」。而要理解這個，必須先打破一個非常普遍的錯誤心智模型。

打破「查資料庫」的錯覺

不熟悉 AI 的人以為 AI 是這樣運作的：

Your question → AI "looks up" a giant knowledge base → finds answer → returns it

打破錯覺

這個模型是錯的。AI 的運作跟搜尋引擎或資料庫查詢完全不同。它真正在做的是：

Your question → predict "the most likely next token" → append it
        → predict again → append → repeat until done

這就是 Autoregressive Generation（自迴歸生成）：用已經產出的內容，預測下一個 token，一個接一個地生成。

具體流程

假設你問（英文示範）：The capital of Taiwan is

Step 1: Context = "The capital of Taiwan is"
        → predict next token distribution
        → candidates: " Taipei" 88%, " Ta" 8%, other 4%
        → pick " Taipei"

Step 2: Context = "The capital of Taiwan is Taipei"  ← note: output becomes input
        → predict next token
        → candidates: "." 70%, "," 15%, other 15%
        → pick "."

Step 3: Context = "The capital of Taiwan is Taipei."
        → predict next token
        → candidates: <EOS> 92%, other 8%
        → pick <EOS> → stop generation

注意每一步的關鍵：剛產出的 token，會立刻變成下一步的輸入。這就是「自迴歸」這個名字的由來，輸出回饋成輸入。

我們可以用 TypeScript 把這個迴圈的本質表達出來：

// Autoregressive generation is fundamentally a loop
async function generate(prompt: string, maxTokens: number): Promise<string> {
  let context = prompt;

  for (let i = 0; i < maxTokens; i++) {
    // Each step: predict next-token distribution from current context
    const distribution = await predictNextToken(context);

    // Pick one token from the distribution (§6 covers "how to pick")
    const nextToken = sample(distribution);

    if (nextToken === END_OF_SEQUENCE) break;

    // Key: append output back into context for the next round
    context += nextToken;
  }

  return context;
}

為什麼這個機制這麼重要

AI 沒有「全域視角」。它不是先在腦中想好整個答案再輸出。它在寫第一個字的時候，並不知道第十個字會是什麼。這解釋了為什麼 AI 有時會「寫到一半改變方向」，或前後矛盾，因為它是一步一步往前走，沒有回頭路。

誤差會累積。一旦某個 token 被選出來，它就進入 Context，影響後續所有預測。如果某一步選錯了，後面只能在這個錯誤的基礎上繼續走下去。這是很多「越扯越遠」的回答的根源。

回答長度不是預先決定的。AI 不會先決定「要寫 200 字」再動筆；它每一步都在評估「結束符號」的機率，當這個機率夠高，它自然就停了。

跟常見開發概念的對照

如果你串接過 streaming API，看過 AI 的回答一個字一個字冒出來，那不是前端的「打字機特效」，那就是 AI 真實的生成過程。每個字出現，背後都是一次完整的模型推理。

// The "typing" you see in a stream is autoregressive generation itself
const stream = await client.messages.stream({
  model: "claude-opus-4-8",
  max_tokens: 1024,
  messages: [{ role: "user", content: "What is the capital of Taiwan?" }],
});

for await (const event of stream) {
  if (event.type === "content_block_delta") {
    process.stdout.write(event.delta.text); // each token is its own forward pass
  }
}

這也解釋了幾件實務上的事：

較長的回答需要更多時間：每多一個 token，就多一次完整推理。
輸出 token 通常比輸入 token 貴：輸入只需要算一次 KV（而且可能命中快取），輸出的每一個 token 都要跑一次完整的推理迴圈。
要求 AI「簡短回答」真的有效：你改變了它預測「結束符號」的機率傾向，它會更早停下來。

為什麼「讓 AI 一步一步想」會更準

為什麼 CoT 有效

理解了自迴歸，你也就能理解 Chain-of-Thought（思維鏈）為什麼有效：要求 AI「一步一步思考」，等於讓它先寫出中間步驟，那些步驟會進入 Context，成為後續預測的「鷹架」。直接要最終答案時，模型得一步猜對，容易出錯；展開推理則把負擔拆成多步，準確率通常更好。這也是為什麼許多模型內建「推理模式」，本質上就是在給最終答案前，先自迴歸地生成一大段推理過程。

// ❌ Ask for the answer directly: model must get it right in one shot
await chat("", "A store sold 23 units on day one. Day two was 3× day one. How many total?");

// ✅ Ask for step-by-step reasoning: intermediate steps scaffold later tokens
await chat(
  "",
  "A store sold 23 units on day one. Day two was 3× day one. How many total? Walk through each step, then give the answer.",
);

6. Temperature & Sampling：控制輸出的旋鈕

上一節我們在 sample(distribution) 這個函式停了下來。現在來拆解它：AI 拿到下一個 token 的機率分佈之後，到底是怎麼「挑」一個出來的？

這個「挑法」就是 Sampling 策略，而 Temperature 是控制這個挑法最核心的旋鈕。

Temperature 是什麼

每一步生成，AI 都會產出一個候選 token 的機率分佈：

Next token candidates:
" Paris"  → 80%
" Lyon"   → 15%
" Mars"   →  3%
" Rome"   →  1%
other     →  1%

Temperature 控制的是這個分佈的「形狀」，讓它變得更尖銳，或更平坦。

Temperature = 0 (approaching 0): sharpest
" Paris" → 99.9%   " Lyon" → 0.1%   other → ~0%
→ almost always picks the top token; stable, predictable

Temperature = 1.0 (default): original distribution
" Paris" → 80%   " Lyon" → 15%   " Mars" → 3%
→ sample by original probabilities; some variation

Temperature = 2.0 (high): flattest
" Paris" → 40%   " Lyon" → 30%   " Mars" → 20%   " Rome" → 10%
→ probabilities level out; low-probability tokens get a real chance; more random

最好記的類比：Temperature 就像骰子的「公平度」旋鈕。

Temperature 低 → 骰子被灌了鉛，幾乎每次都擲出 6。
Temperature 高 → 骰子更公平，每一面都有機會出現。

數學上，Temperature 是在 Softmax 之前，對每個分數做縮放：

function applyTemperature(logits: number[], temperature: number): number[] {
  if (temperature === 0) {
    // Temperature 0: greedy — fully deterministic
    const maxIndex = logits.indexOf(Math.max(...logits));
    return logits.map((_, i) => (i === maxIndex ? 1 : 0));
  }

  // Divide by temperature: lower → sharper; higher → flatter
  const scaled = logits.map((l) => l / temperature);
  return softmax(scaled);
}

這解釋了幾個常見現象

為什麼同樣的 prompt，每次答案不一樣？因為當 Temperature > 0，AI 是在做隨機抽樣，不是查表。每次抽樣本來就不會完全相同。

為什麼寫程式用低 Temperature，寫文案用高 Temperature？

Code generation: Temperature = 0
  → you want the correct answer, not a creative one
  → the highest-probability token is usually syntactically right

Creative writing: Temperature = 0.9 ~ 1.2
  → you want surprising phrasing
  → let low-probability tokens through for variety

另外兩個常見參數：Top-P 和 Top-K

除了 Temperature，還有兩個你會在 API 文件裡看到的 sampling 參數。

Top-P（Nucleus Sampling）：不直接改分佈形狀，而是「只從累積機率達到 P% 的候選裡抽樣」。

Top-P = 0.9, candidate list:
" Paris" → 80%   ← cumulative 80%
" Lyon"  → 15%   ← cumulative 95% > 90%, include it and stop
" Mars"  →  3%   ← excluded
" Rome"  →  1%   ← excluded

→ sample only between " Paris" and " Lyon"

也就是說，讓累積機率超過門檻的那個 token 仍然會留在抽樣池裡，接著再在池內重新分配相對機率。效果是自動過濾掉那些機率極低、可能導致廢話或亂碼的 token。

Top-K：更簡單粗暴，只從機率最高的前 K 個 token 裡抽樣。

Top-K = 2:
→ sample only from " Paris" (80%) and " Lyon" (15%); everything else excluded

實務上，大多數情況調 Temperature 就夠了，Top-P 和 Top-K 維持預設即可。

一張實用的對照表

Task type              Temperature   Top-P    Notes
─────────────────────────────────────────────────────────
Code generation         0 ~ 0.2      0.9      stable, correct
Extraction / classify   0            -        fully deterministic
Q&A / summarization     0.3 ~ 0.7    0.9      accuracy + fluency
General chat            0.7 ~ 1.0    0.95     natural variation
Creative writing        1.0 ~ 1.2    0.95     encourage surprise
Brainstorming           1.2+         1.0      maximum diversity

一個重要的觀念修正

很多人以為 Temperature 高 = AI 比較「聰明」或「有創意」，Temperature 低 = AI 比較「笨」或「保守」。

常見誤解

這個理解是錯的。

Temperature 不影響 AI 的能力，只影響它從機率分佈裡「怎麼挑」。一個能力差的模型，在高 Temperature 下只會產出更混亂的廢話，不會變聰明。能力來自訓練，Temperature 只是輸出風格的旋鈕。

7. Hallucination：為什麼 AI 會說錯話

有了前面對「生成」和「抽樣」的理解，現在我們可以正面回答那個讓所有人困擾的問題：為什麼 AI 會非常有自信地胡說八道？

重新定義這個詞

重新定義 Hallucination

「幻覺（Hallucination）」這個翻譯有點誤導，讓人以為 AI 在「看到不存在的東西」。更精確的描述是：AI 產出了「聽起來合理、但實際上不正確」的內容，而且它不知道自己錯了。

理解了第 5、6 節之後，Hallucination 不再是一個神秘的 bug，而是一個可以預測的結構性現象。

為什麼 Hallucination 是必然的

回到 Autoregressive Generation 的本質：

生成機制的本質

AI 在做的事情是「預測下一個 token 的機率分佈，然後抽樣」。這個機制的目標是「產出聽起來合理的序列」； 「查證事實的正確性」則是另一套完全不同的任務。

Wrong mental model:
AI → query knowledge base → find fact → output

Right mental model:
AI → given context, predict "the most likely next token" → output

一個句子「愛因斯坦在 1952 年提出了……」後面最可能接什麼？根據訓練資料的統計模式，會接上一段「聽起來像物理理論」的描述，不管那個描述是不是真的存在。AI 沒有一個獨立的「事實查核器」，它只有「什麼字最可能接在後面」的統計直覺。

Hallucination 最容易發生的情境

冷僻資訊。訓練資料裡關於某個主題的文字越少，AI 的機率分佈就越模糊，越容易抽到錯誤的 token。問全球前十大城市的人口，準確率很高；問某個小鄉鎮的歷史，準確率就低很多。

精確數字。數字在語言模型裡天生是弱點，「1994」和「1942」在 token 層面差異很小，但語義差異巨大。AI 對「大概哪個年代」判斷不錯，對「精確哪一年」就很不可靠。

長 Context 的後期。對話越長，Context 裡的雜訊越多，Attention 越難聚焦在關鍵事實上。你在第 30 輪問的問題，AI 對第 1 輪建立的事實的掌握，已經比一開始弱了。

要求 AI「自己編」的任務。「幫我寫一個虛構公司的歷史」，AI 在這裡做的事情，跟 Hallucination 機制完全一樣，只是你授權它這麼做了。這也是為什麼 AI 在創意寫作上表現很好：那個任務本來就是要它「產出聽起來合理的虛構內容」。

Hallucination 的兩種類型

Intrinsic Hallucination
  → output directly contradicts context you provided
  → you gave a doc saying A; AI answers B

Extrinsic Hallucination
  → output cannot be verified from your context
  → AI "fills in" information you never provided
  → more common, harder to spot

Extrinsic Hallucination 最危險，因為它的輸出通常跟你提供的資料沒有矛盾，只是 AI 多說了一些你無法驗證的東西，混在正確的內容裡，很難一眼看穿。

用這個理解來「駕馭」Hallucination

駕馭要點

給 Context，不要靠記憶。AI 根據你提供的文字做預測，比根據訓練記憶做預測準確得多。與其問「愛因斯坦哪年得諾貝爾獎」，不如把資料貼進去，問「根據以下資料，愛因斯坦哪年得諾貝爾獎」。這也正是 RAG（Retrieval-Augmented Generation）技術的核心思路——先檢索相關資料，再讓 AI 根據資料回答。

要求 AI 標示不確定性。在 System Prompt 加上「如果你不確定，請明確說你不確定，不要猜測」。這不能完全消除 Hallucination，但能讓 AI 在機率分佈不明確時，更傾向輸出「我不確定」，少給一個聽起來合理但其實錯誤的答案。

高風險資訊一定要驗證。法律條文、醫療資料、精確日期、引用來源，這些不要直接信任 AI 的輸出，要回去查原始資料。AI 在這類任務上最適合的角色是「幫你起草，讓你驗證」，把最終定稿留給人來確認。

用結構限制輸出範圍。與其讓 AI 自由發揮，不如給它選項：「以下哪個描述正確？A、B、C」。限縮了 token 的選擇空間，就限縮了 Hallucination 的空間。

// ❌ Rely on "memory": prone to hallucination
const bad = await chat(
  "You are an assistant",
  "What is our company's return policy window in days?",
);

// ✅ Provide context: put facts in the prompt
const good = await chat(
  "You are an assistant. Answer only from the user's provided material. If it's not there, say \"not mentioned in the material.\"",
  `Return policy document:\n${policyDocument}\n\nQuestion: How many days for returns?`,
);

一個重要的認知調整

認知調整

Hallucination 不會被「修掉」。它是這個生成機制的結構性產物，跟 bug 無關。更好的模型會讓 Hallucination 的頻率降低，但只要 AI 的本質是「預測 token 的機率分佈」，就永遠存在產出不正確內容的可能。

理解這一點，目的是讓你知道怎麼用對 AI：讓它做它擅長的（推理、整合、起草、分析），在它不擅長的地方（精確事實、數字、引用）加上一層人工驗證。

第三階段 · 你怎麼駕馭

8. System Prompt vs User Prompt：駕馭的基本工具

這是整篇文章的操作章節。前面七節學到的所有機制，在這一節都會找到對應的操作意義。

先看結構

當你透過 API 呼叫 Claude，一個完整的 request 長這樣：

const response = await client.messages.create({
  model: "claude-opus-4-8",
  max_tokens: 1024,
  system: "You are a professional technical writing assistant...", // ← System Prompt
  messages: [
    { role: "user", content: "Draft API documentation for me" }, // ← User Prompt
    { role: "assistant", content: "Sure — please share the API spec..." },
    { role: "user", content: "Here is the spec: ..." }, // ← User Prompt
  ],
});

System Prompt 和 User Prompt 都進入同一個 Context Window，但它們的角色不同、穩定性不同、對 AI 行為的影響層次也不同。

System Prompt：定義 AI 是誰

System Prompt 是在對話開始之前就存在的「設定層」，它的作用是：

Define role     → "You are a professional legal assistant"
Set constraints → "Only answer contract-related questions"
Specify format  → "Use bullet points; stay under 200 words"
Set tone        → "Formal tone; no emoji"
Inject knowledge → "Here is our product documentation: ..."

從前面學到的機制來看，System Prompt 有幾個特性：

它永遠在 Context 的最前面。Attention 的 U 型分佈讓開頭與結尾權重偏高；System Prompt 放在最前，能吃到 primacy bias，也符合 KV Cache 的 prefix 需求。因此它的指令通常有很高的執行優先權。

它是 KV Cache 最適合快取的部分。System Prompt 穩定、每次呼叫都一樣，完全符合 prefix cache 的條件。設計良好的 System Prompt 能省下大量重複計算成本。

常見誤解

它設定的是「預設行為」，不是「絕對鐵律」。這點很多人誤解。System Prompt 告訴 AI 「你平常該怎麼表現」，但如果 User Prompt 給了明確的相反指令，行為可能會被覆蓋。真正的硬性限制，需要在設計上特別加強。

User Prompt：每次對話的具體任務

User Prompt 是使用者每一輪輸入的內容：

Assign task      → "Draft an apology email to a customer"
Provide material → "Here is the contract text: ..."
Give feedback    → "This version is too formal — make it conversational"
Add constraints  → "Also keep it under 100 words"

它的特性是：

它是動態的。每一輪都不一樣，不適合也不需要 KV Cache。

它的指令有時會覆蓋 System Prompt。這是設計上的彈性，也是潛在的風險。如果你的應用不希望使用者能改變 AI 的行為，關鍵限制要在 System Prompt 寫得夠明確、夠強硬。

放在 Context 後段，但 Attention 對結尾的權重也高。你最新的指令天然就有很高的影響力，通常不需要重複強調。

兩者的互動關係

System Prompt              User Prompt
────────────────────────────────────────────────
"Reply in English"    +    "what is AI?"
  → AI explains AI in English

"Only legal questions" +   "write me a poem"
  → AI should refuse (not guaranteed — depends how firm the system prompt is)

"Keep tone formal"    +    "explain this casually"
  → user prompt overrides system; AI usually goes casual
    (when they conflict, the nearer, clearer instruction tends to win)

把前七節學到的東西全部用上

一個設計良好的 System Prompt，應該考慮到你學過的每一個機制：

const systemPrompt = `
You are the technical writing assistant for Lumen Tech.

# Hard constraints (up front — high Attention weight at the start)
- Answer only from material the user provides. Do not add outside information.
- If unsure, say "I'm not sure" — do not guess.

# Output format (structure narrows the token space)
1. Conclusion (one sentence)
2. Explanation (bullets, max three)
3. Code example if needed

# Long threads (respect Context Window limits)
If the conversation grows, prioritize the latest task; older detail can drop.
`.trim();

// Stable system prompt → good KV Cache candidate → lower cost

每一行設定背後的機制

限制放開頭 → 利用 Attention 對開頭的高權重
要求標示不確定 → 對抗 Hallucination
用格式限制輸出 → 馴服 Temperature 帶來的發散
提示長度管理 → 呼應 Context Window 的有限性
System Prompt 穩定不變 → 讓 KV Cache 能命中

一個很實用的思維框架

把 System Prompt 想成你雇用新員工時給的職位說明書，User Prompt 是你每天給他的工作指派。

職位說明書寫得清楚，員工遇到模糊情況就知道怎麼判斷；寫得模糊，員工只能亂猜。

Job description (System Prompt) covers:
  → scope of this role
  → how to handle uncertainty
  → output format and tone standards
  → what to decide alone vs. when to say "I don't know"

Work assignment (User Prompt) covers:
  → the specific task this turn
  → materials or constraints for this turn only
  → feedback on the last output

9. 實戰：用一個客服機器人，把所有概念串起來

理論講完了，讓我們用一個具體的例子，看看這八個概念如何在一個真實的應用裡同時發揮作用。

假設你要為公司建一個客服機器人。它需要回答關於產品的問題，語氣要專業友善，而且絕對不能亂編公司沒有的政策。我們一步一步來，看每個決策背後對應哪個機制。

第一步：設計 System Prompt

const systemPrompt = `
You are the support assistant for Lumen Tech.

# Core constraints
- Answer only from "reference material." If it's not there, say "I'll connect you with a specialist."
- Never guess or invent specs, prices, or return policies.
- If unsure, say so — don't sound plausible without evidence.

# Tone
- Professional, friendly, concise.
- No emoji.

# Answer format
- Answer the question first, then add necessary detail.
- Use numbered steps when describing procedures.
`.trim();

這裡用到了什麼？

核心限制放在最開頭，利用 U 型分佈的 primacy bias 與 KV Cache，讓最重要的約束容易被執行。
「不要猜測、不確定就說不確定」，直接對抗 Hallucination，特別是最危險的 Extrinsic 型（編造不存在的政策）。
這段 System Prompt 固定不變，為 KV Cache 創造命中條件。

第二步：用 RAG 補充知識

客服機器人不能靠模型的「記憶」來回答公司特定的政策，因為那些資訊根本不在訓練資料裡，硬問只會觸發 Hallucination。正確做法是檢索相關文件，放進 Context：

async function answerCustomerQuestion(question: string) {
  // 1. Retrieve relevant chunks from the knowledge base
  const relevantDocs = await vectorSearch(question, { limit: 3 });

  // 2. Assemble retrieved material into context
  const userContent = `
Reference material:
${relevantDocs.map((d) => d.content).join("\n---\n")}

Customer question: ${question}
  `.trim();

  // 3. Call the model
  const response = await client.messages.create({
    model: "claude-opus-4-8",
    max_tokens: 1024,
    temperature: 0.3, // low: support needs stable, predictable answers
    system: [
      {
        type: "text",
        text: systemPrompt,
        cache_control: { type: "ephemeral" },
      },
    ],
    messages: [{ role: "user", content: userContent }],
  });

  return response;
}

這裡用到了什麼？

用檢索到的資料填 Context Window，讓 AI 根據「眼前的資料」回答，少依賴訓練時殘留的模糊印象。這是降低 Hallucination 最有效的手段。
temperature: 0.3：客服場景要穩定可預測，不是創意。低 Temperature 讓相似問題得到一致的答案。
cache_control：讓穩定的 System Prompt 命中 KV Cache，在高頻客服場景下大幅降低成本與延遲。
limit: 3 只取最相關的三段，呼應「Context 不是塞越多越好」，避免雜訊稀釋 Attention 的聚焦。這裡刻意不用 topK 這個名字，避免跟 §6 的 Top-K sampling 混淆。

第三步：管理多輪對話的 Context

客戶可能會連續問好幾個問題。每一輪你都要把歷史傳回去（因為 AI 沒有記憶），但又不能讓 Context 無限膨脹：

async function handleConversation(
  history: Message[],
  newQuestion: string,
): Promise<Message[]> {
  // Cap history length — avoid blowing context budget or losing Attention focus
  const trimmedHistory = history.slice(-6); // keep last 3 Q&A pairs

  const relevantDocs = await vectorSearch(newQuestion, { limit: 3 });

  const messages: Message[] = [
    ...trimmedHistory,
    {
      role: "user",
      content: `Reference material:\n${formatDocs(relevantDocs)}\n\nQuestion: ${newQuestion}`,
    },
  ];

  const response = await client.messages.create({
    model: "claude-opus-4-8",
    max_tokens: 1024,
    temperature: 0.3,
    system: [{ type: "text", text: systemPrompt, cache_control: { type: "ephemeral" } }],
    messages,
  });

  return [...messages, { role: "assistant", content: extractText(response) }];
}

這裡用到了什麼？

history.slice(-6)：主動管理 Context Window，避免長對話塞爆額度；同時呼應 Hallucination 的「長 Context 後期準確率下降」，砍掉太舊的對話，反而讓 AI 對當前問題更聚焦。
每次都把歷史完整傳回，因為 Autoregressive Generation 的本質是無狀態的，AI 不會自己記得上一輪。
每輪檢索出的 relevantDocs 是動態內容，放在最新的 User Prompt 裡很合理；真正適合長時間快取的是穩定的 System Prompt、few-shot 範例，或這段對話會反覆參考的大型固定文件。設計 RAG 應用時，要把「固定可快取知識」和「每輪檢索結果」分開看，才不會高估 KV Cache 的命中率。

把整條鏈看一遍

當一個客戶問「我買的耳機壞了，可以退嗎？」，背後發生的事情是：

Token：問題被切成 token，轉成向量。
RAG + Context Window：系統檢索出退換貨政策文件，組進 Context。
KV Cache：穩定的 System Prompt 命中快取，不重算。
Attention：AI 計算「退」「壞」這些 token 跟政策文件裡哪些段落關聯最高。
Autoregressive Generation：AI 一個 token 一個 token 生成回答。
Temperature：低溫確保回答穩定、不發散。
Hallucination 防護：因為答案來自檢索到的真實文件，加上 System Prompt 的「不要猜測」約束，AI 不會編造一個不存在的退貨政策。
System / User Prompt 分工：System 定義了「客服角色和鐵律」，User 帶來了「這次的具體問題和資料」。

八個概念，在一次回答裡全部到齊。這就是為什麼理解機制如此重要：它讓你在設計應用時的每一個決策，都有清楚的理由，少靠憑感覺亂試。

10. 整合：一張完整的地圖

八個概念走完，你現在手上有的是一張完整的地圖。讓我們把它們串成一條因果鏈，看看它們如何環環相扣。

Token
  → AI's basic unit. All computation starts here; capacity and billing are token-based.

Context Window
  → AI's working memory. Finite like RAM; cleared every call.

Attention
  → How AI decides what matters. Each token gets a different weight via Q, K, V.
     This is where Lost in the Middle comes from.

KV Cache
  → Optimization built from Attention's K and V. Stable prefixes cache;
     that's why system prompts should be stable and up front.

Autoregressive Generation
  → How AI produces answers. One token at a time, no global plan;
     errors compound — source of many odd behaviors.

Temperature & Sampling
  → Knobs for how to pick from the distribution. Controls determinism,
     not intelligence. Low = stable; high = varied.

Hallucination
  → Structural outcome of generation, not a bug. AI optimizes for plausible;
     factual correctness needs extra tooling. Use it wisely; don't trust blindly.

System Prompt vs User Prompt
  → Turns all of the above into levers you actually pull.
     System defines the role (stable, front); user assigns the task (dynamic, can override).

這張地圖的價值

在 AI 的行為讓你困惑時，能回到機制層去理解「它為什麼這樣」，然後知道「我可以怎麼改」。

為什麼開新對話它就失憶了？因為 Context Window 是空的。

為什麼長文件中間的指令被忽略？因為 Attention 的 Lost in the Middle。

為什麼改一個字 API 就變慢變貴？因為 KV Cache 失效了。

為什麼同樣的問題答案每次不同？因為 Temperature 在做隨機抽樣。

為什麼它一臉自信地說錯？因為它在預測「合理」，不是查證「正確」。

這些問題，現在你都有了結構性的答案。

駕馭 AI 的本質，在於理解這台機器怎麼運轉，然後順著它的機制去使用它。背誦一堆 prompt 技巧只是捷徑；當你懂了底層，那些技巧就從機制自然推導出來，不必硬記成咒語。

技巧是表象，機制是根

回頭看，你會發現這篇文章裡所有的「實用建議」，沒有一條是憑空冒出來的。把重要資訊放在開頭或結尾，是因為 Attention 的 Lost in the Middle；保持 System Prompt 穩定，是因為 KV Cache 的 prefix matching；給 AI 資料、少依賴它「記得」什麼，是因為生成機制的本質是「預測合理」，「查證正確」要靠別的手段；要求 AI 一步一步思考，是因為自迴歸讓中間步驟成為後續推理的鷹架。技巧是表象，機制是根。抓住了根，你就能在面對任何新的 AI 工具、新的模型、新的怪異行為時，自己推導出該怎麼應對，而不必每次都去搜尋「XXX 的 prompt 怎麼寫」。

這，就是「會用 AI」和「駕馭 AI」之間，真正的差距。