Back

September 1, 2024

5 mins read

Understanding RAG System Latency: Breaking Down the Components

Author Image

Ganesh Voona

Co-Founder

Key Factors Impacting Speed in Retrieval-Augmented Generation Systems

At stockinsights.ai, we’re at the forefront of developing advanced Retrieval-Augmented Generation (RAG) systems. These systems combine powerful search techniques with large language models (LLMs) to deliver accurate and insightful responses. But what really affects the speed of these systems? Let’s break it down simply.

What is a RAG System?

RAG system performs two main tasks:
A. Retrieves Information: It searches a database for relevant data.
B. Generates Responses: It uses an LLM to turn that data into a natural language answer.

Retrieval Latency: The Vector Database Search

The first step is retrieving information. At stockinsights.ai, we store public company filings as embeddings in vector databases and use vector search algorithms to find relevant documents. Here’s what affects this retrieval:

A. Larger Clusters with More Nodes

1. Larger Clusters: In a vector database, a cluster consists of multiple nodes (servers or instances) that store and process data. More nodes mean the workload is distributed, improving query, indexing, and retrieval speeds.

2. Speed Improvement: More nodes can handle search requests simultaneously, reducing overall response time.

B. Index Size vs. Memory Capacity

1. Index: A vector database index stores vector embeddings and enables fast similarity searches. As data grows, the index size increases.

2. Memory Limitations: Each node has limited RAM. If the index exceeds available memory, the system relies on slower disk storage.

C. Impact on Performance

1. Slower Performance: When the index exceeds memory capacity, the system must fetch data from disk, introducing latency. This is due to “page faults” where data retrieval from disk is slower than from RAM.

2. Trade-offs: Larger clusters can enhance performance, but if the index size surpasses memory, the benefits may diminish.

At stockinsights.ai, we’ve optimized our setup to keep retrieval times under 10 seconds using MongoDB’s dedicated search nodes. For more tips on optimizing vector search, check out MongoDB’s Search Nodes Blog.

Response Latency: The LLM Speed

Once the data is retrieved, the LLM generates a response. This part of the process involves several factors that impact latency:

A. LLM Model : Larger models offer more power but are slower. They need more time to process and generate responses.

B. Overall Architecture: Often, multiple LLM calls are needed to produce the final answer.

C. Inference Speed: Faster token processing is crucial. Reducing tokens or requests can help.

To optimize LLM Speed:

1. Generate Fewer Tokens: Reducing token count can cut latency. Request concise responses to limit output length.

2. Use Fewer Input Tokens: Cutting input tokens helps, but the impact may be minimal. Use techniques like fine-tuning or context filtering to manage large inputs.

3. Make Fewer Requests: Combine multiple steps into a single prompt to avoid round-trip latency. Structure prompts to gather multiple results at once.

4. Parallelize Tasks: Execute independent tasks in parallel to save time. For sequential tasks, consider speculative execution.

5. Make Your Users Wait Less: Use streaming, chunking, and progress indicators to reduce perceived wait time and keep users informed.

6. Don’t Default to an LLM: For some tasks, traditional methods or pre-computed responses may be faster. Utilize caching and UI components when appropriate.

At stockinsights.ai, we use OpenAI’s LLM. For further guidance on optimizing latency, check out OpenAI’s Latency Optimization Guide.