The Evolution of AI Memory: From RAG to CAG and Beyond

Sam Ozturk
5 min readJan 7, 2025

--

In the rapidly evolving landscape of artificial intelligence, how we handle information retrieval and generation is undergoing a fascinating transformation. Two key approaches have emerged: Retrieval-Augmented Generation (RAG) and its evolutionary successor, Cached Augmented Generation (CAG). Let’s dive deep into these technologies and explore their implications for the future of AI.

Understanding RAG: The Foundation

Retrieval-Augmented Generation has become a cornerstone of modern AI systems. At its core, RAG combines two powerful capabilities:

  1. Retrieval: The ability to search through and fetch relevant information from a knowledge base
  2. Generation: The capacity to create coherent, contextually appropriate responses

Imagine a librarian who not only knows where every book is but can also synthesize information from multiple sources to answer your questions. That’s essentially what RAG does, but at machine speed and scale.

Enter CAG: The Next Evolution

Cached Augmented Generation builds upon RAG’s foundation but adds a crucial optimization layer. Think of it as adding a “memory cache” to our librarian’s capabilities. Here’s how it works:

  1. Caching Mechanism: CAG maintains a cache of previously generated responses
  2. Smart Retrieval: When a similar query arrives, it can quickly retrieve and adapt cached responses
  3. Efficiency Gains: This approach reduces computational overhead and improves response consistency

RAG vs. CAG: A Detailed Comparison

Let’s break down the key differences:

Performance

  • RAG: Must process each query from scratch, leading to consistent but potentially slower response times
  • CAG: Can leverage cached results, offering significantly faster responses for similar queries

Resource Usage

  • RAG: Higher computational resources required for each query
  • CAG: More efficient resource utilization through caching, though requires storage for the cache

Consistency

  • RAG: May produce slightly different responses to similar queries
  • CAG: Higher consistency in responses due to cache utilization

Adaptability

  • RAG: More flexible with entirely new queries
  • CAG: Excellent for common queries, but may need fallback to traditional generation for novel questions

Technical Deep Dive: RAG Implementation

Let’s examine how RAG works under the hood and implement a basic version.

RAG Architecture

The RAG pipeline consists of three main components:

  1. Document Indexing
  • Document chunking
  • Embedding generation
  • Vector store indexing

2. Retrieval System

  • Query embedding
  • Similarity search
  • Top-k retrieval

3. Generation Module

  • Context assembly
  • Prompt construction
  • LLM generation

Here’s a basic implementation using popular libraries:

from typing import Dict, List, Tuple
import hashlib
import numpy as np
from datetime import datetime, timedelta

class CacheEntry:
def __init__(self, response: str, timestamp: datetime):
self.response = response
self.timestamp = timestamp
self.access_count = 1

class CAG:
def __init__(self, base_rag: BasicRAG, cache_ttl: int = 3600):
self.rag = base_rag
self.cache: Dict[str, CacheEntry] = {}
self.cache_ttl = timedelta(seconds=cache_ttl)
self.embedding_cache: Dict[str, np.ndarray] = {}

def _compute_query_embedding(self, query: str) -> np.ndarray:
# Cache query embeddings for efficiency
if query not in self.embedding_cache:
self.embedding_cache[query] = self.rag.embeddings.embed_query(query)
return self.embedding_cache[query]

def _find_similar_cached_query(self, query: str,
similarity_threshold: float = 0.95) -> Tuple[str, float]:
query_embedding = self._compute_query_embedding(query)

max_similarity = -1
most_similar_query = None

for cached_query in self.cache.keys():
cached_embedding = self._compute_query_embedding(cached_query)
similarity = np.dot(query_embedding, cached_embedding)

if similarity > max_similarity:
max_similarity = similarity
most_similar_query = cached_query

return most_similar_query, max_similarity

def query(self, question: str) -> str:
# Check for exact cache hit
if question in self.cache:
entry = self.cache[question]
if datetime.now() - entry.timestamp < self.cache_ttl:
entry.access_count += 1
return entry.response
else:
del self.cache[question]

# Check for similar questions
similar_query, similarity = self._find_similar_cached_query(question)
if similar_query and similarity > 0.95:
entry = self.cache[similar_query]
if datetime.now() - entry.timestamp < self.cache_ttl:
entry.access_count += 1
return f"[Cached] {entry.response}"

# Fall back to RAG for new queries
response = self.rag.query(question)
self.cache[question] = CacheEntry(response, datetime.now())
return response

def maintain_cache(self, max_size: int = 1000):
"""Clean up old or less frequently accessed entries"""
if len(self.cache) > max_size:
entries = [(k, v) for k, v in self.cache.items()]
entries.sort(key=lambda x: (x[1].access_count, x[1].timestamp))
to_remove = entries[:-max_size]
for k, _ in to_remove:
del self.cache[k]

Advanced Optimization Techniques

1. Cache Management Strategies

class AdvancedCAG(CAG):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.cache_stats = {
'hits': 0,
'misses': 0,
'evictions': 0
}

def implement_lru_cache(self, max_size: int = 1000):
"""Implements Least Recently Used cache strategy"""
if len(self.cache) <= max_size:
return

sorted_entries = sorted(
self.cache.items(),
key=lambda x: x[1].timestamp
)

for key, _ in sorted_entries[:-max_size]:
del self.cache[key]
self.cache_stats['evictions'] += 1

def implement_lfu_cache(self, max_size: int = 1000):
"""Implements Least Frequently Used cache strategy"""
if len(self.cache) <= max_size:
return

sorted_entries = sorted(
self.cache.items(),
key=lambda x: x[1].access_count
)

for key, _ in sorted_entries[:-max_size]:
del self.cache[key]
self.cache_stats['evictions'] += 1

2. Embedding Optimization

For large-scale applications, we can optimize embedding storage and retrieval:

class OptimizedCAG(AdvancedCAG):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.embedding_index = FAISS(768) # Assuming OpenAI embedding dimension

def _update_embedding_index(self, query: str, embedding: np.ndarray):
self.embedding_index.add(
ids=np.array([hash(query)]),
embeddings=embedding.reshape(1, -1)
)

def _find_similar_cached_query(self, query: str,
similarity_threshold: float = 0.95) -> Tuple[str, float]:
query_embedding = self._compute_query_embedding(query)

# Use FAISS for efficient similarity search
D, I = self.embedding_index.search(
query_embedding.reshape(1, -1),
k=1
)

if D[0] > similarity_threshold:
return self.id_to_query[I[0]], D[0]
return None, 0.0

Performance Considerations

When implementing CAG, several factors affect performance:

  1. Memory Usage
  • Cache size grows with unique queries
  • Embedding storage can be substantial
  • Need for efficient cache eviction policies

2. Computational Overhead

  • Embedding computation for similarity matching
  • Cache maintenance operations
  • Vector similarity calculations

3. Latency Tradeoffs

  • Cache hits: microseconds
  • Similarity matching: milliseconds
  • Full RAG generation: seconds

Here’s a simple benchmark implementation:

class CAGBenchmark:
def __init__(self, cag_system: CAG):
self.cag = cag_system
self.metrics = {
'cache_hits': 0,
'cache_misses': 0,
'avg_response_time': 0,
'total_queries': 0
}

def run_benchmark(self, queries: List[str]):
for query in queries:
start_time = time.time()
response = self.cag.query(query)
end_time = time.time()

self.metrics['total_queries'] += 1
self.metrics['avg_response_time'] = (
(self.metrics['avg_response_time'] *
(self.metrics['total_queries'] - 1) +
(end_time - start_time)) /
self.metrics['total_queries']
)

if '[Cached]' in response:
self.metrics['cache_hits'] += 1
else:
self.metrics['cache_misses'] += 1

return self.metrics

Future Optimizations

Looking ahead, several promising optimizations are being explored:

  1. Hierarchical Caching
  • Multiple cache layers with different retention policies
  • Distributed cache architecture for scale

2. Dynamic Similarity Thresholds

  • Adaptive similarity thresholds based on query patterns
  • Context-aware cache matching

3. Predictive Caching

  • Pre-generating responses for likely future queries
  • Query pattern analysis for cache optimization

The Future of RAG and CAG

As we look ahead, several exciting developments are on the horizon:

Hybrid Approaches

The future likely lies in intelligent hybrid systems that combine the strengths of both RAG and CAG. These systems would:

  • Dynamically decide whether to use cached responses or generate new ones
  • Employ sophisticated cache management strategies
  • Adapt to changing information and user needs

Enhanced Caching Strategies

Future developments will likely focus on:

  • More intelligent cache prioritization
  • Better cache invalidation strategies
  • Advanced similarity matching for cached responses

Broader Applications

These technologies will find new applications in:

  • Enterprise knowledge management
  • Real-time customer service
  • Educational systems
  • Healthcare information systems

Why This Matters

The evolution from RAG to CAG represents more than just a technical improvement. It’s a step toward more efficient, consistent, and scalable AI systems. For businesses and organizations, this means:

  1. Cost Efficiency: Reduced computational costs through smart caching
  2. Better User Experience: Faster, more consistent responses
  3. Scalability: Ability to handle larger query volumes without proportional resource increases

Conclusion

The journey from RAG to CAG illustrates the constant innovation in AI technology. While RAG continues to be valuable for many applications, CAG represents an important optimization that addresses real-world scalability and efficiency challenges. As these technologies continue to evolve, we can expect to see even more sophisticated approaches that combine the best of both worlds.

Link to the paper: https://arxiv.org/abs/2412.15605

--

--

Sam Ozturk
Sam Ozturk

Written by Sam Ozturk

AI Engineer & Data Padawan. Non-technical post are at https://medium.com/@confused_matrix

No responses yet