Cache Augmented Generation (CAG)

What is Cache Augmented Generation (CAG)?

Cache Augmented Generation (CAG) is a novel approach to enhancing the performance and efficiency of large language models (LLMs) by leveraging preloaded knowledge in the form of precomputed key-value (KV) caches. Unlike Retrieval Augmented Generation (RAG), which dynamically retrieves external knowledge during inference, CAG eliminates retrieval steps altogether by embedding all relevant knowledge directly into the model’s extended context window before inference. This preloading strategy allows LLMs to generate responses using the precomputed information, significantly reducing latency and simplifying system architecture.

By storing the processed knowledge in a key-value cache, CAG ensures that the model has immediate access to the necessary context for answering queries. This approach is especially advantageous in scenarios where the knowledge base is static, relatively small, or when low latency is a priority.

How Does CAG Work?

CAG operates through three primary phases:

1. External Knowledge Preloading

All relevant documents or datasets are preloaded into the model’s context window before inference.
The preloaded content is processed into a key-value (KV) cache, capturing the model’s internal representation of the knowledge. For example:

   def preprocess_knowledge(model, tokenizer, prompt):
       input_ids = tokenizer.encode(prompt, return_tensors="pt").to(model.device)
       past_key_values = DynamicCache()
       outputs = model(input_ids=input_ids, past_key_values=past_key_values, use_cache=True)
       return outputs.past_key_values

This step ensures that the model has immediate access to the preprocessed knowledge, bypassing the need for real-time retrieval.

2. Inference with Precomputed Cache

When a query is made, the precomputed KV cache is combined with the query input.
The model generates a response based solely on the cached knowledge and the query, avoiding additional computations for knowledge retrieval. For instance:

   def generate_response(model, query, kv_cache):
       input_ids = tokenizer.encode(query, return_tensors="pt").to(model.device)
       output_ids = model(input_ids=input_ids, past_key_values=kv_cache, use_cache=True)
       return tokenizer.decode(output_ids)

3. Cache Management

As new queries are processed, the cache grows. A reset mechanism truncates the cache back to its original length to maintain performance and ensure subsequent queries are evaluated within the intended context. Example cache reset:

   def clean_up(kv_cache, original_length):
       for i in range(len(kv_cache.key_cache)):
           kv_cache.key_cache[i] = kv_cache.key_cache[i][:, :, :original_length, :]
           kv_cache.value_cache[i] = kv_cache.value_cache[i][:, :, :original_length, :]

Key Advantages of CAG

Low Latency
Since there is no need for real-time retrieval, CAG offers faster response times compared to RAG. This makes it ideal for time-sensitive applications.
Improved Accuracy
By preloading all relevant documents, CAG ensures the model processes a comprehensive dataset, reducing the risk of context gaps or retrieval errors.
Simplified Architecture
Unlike RAG, which requires a complex retrieval pipeline, CAG’s architecture is streamlined, reducing system complexity and maintenance overhead.
Efficiency at Scale
Once the knowledge is preloaded and cached, subsequent queries are processed with minimal computational overhead, making CAG efficient for repeated queries within the same knowledge domain.

Limitations of CAG

Context Window Size
CAG relies on the model’s context window to preload knowledge. Current LLMs support context windows up to 128,000 tokens, which limits the amount of knowledge that can be preloaded.
Knowledge Base Size
CAG is best suited for static and manageable knowledge bases. For large or dynamic datasets, the model might struggle to fit all relevant information into the context window.
Static Knowledge
CAG assumes the knowledge base remains unchanged during inference. It is less effective for use cases requiring real-time updates or dynamic knowledge integration.
Cost Implications
Large context windows increase computational costs during preloading, making CAG less economical for scenarios involving frequent updates or changes to the knowledge base.

How is CAG Used?

Practical Applications

CAG is commonly applied in scenarios where the knowledge base is static, manageable in size, and low latency is critical:

Customer Support Chatbots

Example: Preloading common troubleshooting steps for software products to provide instant responses to users.
Benefit: Eliminates retrieval errors and speeds up response times.

Document Analysis

Example: Financial institutions analyzing quarterly reports or legal firms querying regulatory documents.
Benefit: Ensures consistent and accurate responses by preloading all relevant documents into the model.

Healthcare Assistants

Example: Preloading medical guidelines to assist patient queries.
Benefit: Maintains continuity across multi-turn dialogues and ensures accurate referencing.

Education and Training

Example: Answering frequently asked questions in corporate training programs.
Benefit: Simplifies deployment while ensuring consistent responses.

Comparisons with Retrieval Augmented Generation (RAG)

Feature	CAG	RAG
Knowledge Handling	Preloads knowledge into the context window.	Dynamically retrieves knowledge at runtime.
System Complexity	Simplified, no retrieval pipeline required.	Requires additional components for retrieval.
Latency	Low, as retrieval steps are eliminated.	Higher due to real-time retrieval processes.
Scalability	Limited by context window size.	Scales well with large, dynamic datasets.
Error Risks	No retrieval errors.	Vulnerable to retrieval and ranking errors.
Best Use Cases	Static, low-latency tasks.	Dynamic, large, or frequently updated tasks.

Examples of Use Cases

CAG in Action

HR Systems
A company uses CAG to preload employee policies into the model. Employees can query the system for specific guidelines, and responses are generated instantly.
Legal Assistance
A legal assistant preloads relevant case laws into the model’s context to provide quick answers to legal queries without using a retrieval system.
Customer Service
A SaaS product’s chatbot uses CAG to preload FAQs and troubleshooting guides, ensuring smooth and fast customer interactions.

RAG for Dynamic Scenarios

News Aggregation
A news app uses RAG to fetch and summarize the latest articles, dynamically retrieving the most relevant information for user queries.
E-commerce Search
RAG is used to retrieve product details and availability from a large and frequently updated catalog.
Research Platforms
A scientific research platform employs RAG to fetch relevant papers and studies from large external databases.

Implementation Example: Preloading Knowledge in Python

The following Python code snippet demonstrates how to preload knowledge into a model for CAG:
This preloading mechanism ensures that the model processes queries without requiring external retrieval, enabling efficient and low-latency performance.

When to Use CAG

Static Knowledge Bases
Ideal for cases where the knowledge base is unlikely to change frequently.
Low-Latency Applications
Suitable for customer support, education, or healthcare systems where quick responses are necessary.
Cost-Effective Scenarios
Beneficial when the preloaded knowledge remains consistent across multiple queries, reducing computational overhead.

CAG is an efficient alternative to RAG for tasks requiring speed, simplicity, and consistency. However, it is limited by the size and static nature of the knowledge base.

Research on Cache Augmented Generation (CAG)

Adaptive Contextual Caching for Mobile Edge Large Language Model Service
Authors: Guangyuan Liu, Yinqiu Liu, Jiacheng Wang, Hongyang Du, Dusit Niyato, Jiawen Kang, Zehui Xiong
This paper addresses the challenges faced in mobile edge Large Language Model (LLM) deployments, such as limited computational resources and high retrieval latency. It proposes an Adaptive Contextual Caching (ACC) framework, which uses deep reinforcement learning (DRL) to optimize cache replacement policies by considering user context, document similarity, and cache miss overhead. Experimental results show that ACC achieves over 80% cache hit rates after 11 training episodes, significantly reducing retrieval latency by up to 40% compared to traditional methods. Furthermore, it minimizes the local caching overhead by up to 55%, making it suitable for scalable, low-latency LLM services in resource-constrained environments. The work highlights the potential of ACC to enhance efficiency in edge LLM systems.
Read the paper here
Towards More Economical Context-Augmented LLM Generation by Reusing Stored KV Cache
Authors: Hanchen Li, Yuhan Liu, Yihua Cheng, Kuntai Du, Junchen Jiang
This study explores the reuse of Key-Value (KV) caches to reduce prefill delays in LLM applications, particularly for repeated input texts. It investigates whether such cache reuse can also be economically viable when utilizing public cloud services for storage and processing. The authors propose a validated analytical model to assess the cloud costs (in compute, storage, and network) of storing and reusing KV caches across various workload parameters. The study demonstrates that KV cache reuse saves both delay and cloud costs for workloads with long contexts, encouraging further efforts in building more economical context-augmented LLM systems.
Read the paper here
MPIC: Position-Independent Multimodal Context Caching System for Efficient MLLM Serving
Authors: Shiju Zhao, Junhao Hu, Rongxiao Huang, Jiaqi Zheng, Guihai Chen
This paper introduces MPIC, a Position-Independent Multimodal Context Caching system, aimed at addressing inefficiencies in multimodal large language model (MLLM) inference. Traditional systems recompute the entire KV cache even for slight differences in context, leading to inefficiencies. MPIC offers a position-independent caching system that stores KV caches locally or remotely and parallelizes cache computation and loading during inference. Integrated reuse and recompute mechanisms mitigate accuracy degradation while achieving up to 54% reduction in response time compared to existing methods. This work highlights the potential for improved efficiency in multimodal LLM serving systems.
Read the paper here