logo
Vinicius.Dev
Go Back
· 2320 words · 12 min read

Building LLM Applications with Rust: When Speed and Parallelism Matter

Hey there! So you're thinking about using Rust for your next LLM project? Smart move! While Python and TypeScript dominate the AI space, Rust offers some serious advantages, especially for CPU-intensive operations. Let's dive into why Rust might be your new best friend for LLM applications.

Why Rust for LLM Applications?

Python's great for prototyping and has amazing libraries like Hugging Face's Transformers, but when performance matters, Rust shines. Here's why:

Performance Without Compromise

Rust gives you C-level performance without the headaches of memory management. For LLM applications that crunch through massive vector operations, this is huge.

// Example: Fast vector operations in Rust
fn cosine_similarity(a: &[f32], b: &[f32]) -> f32 {
    let dot_product: f32 = a.iter().zip(b.iter()).map(|(x, y)| x * y).sum();
    let norm_a: f32 = a.iter().map(|x| x * x).sum::<f32>().sqrt();
    let norm_b: f32 = b.iter().map(|x| x * x).sum::<f32>().sqrt();
    
    dot_product / (norm_a * norm_b)
}

Fearless Concurrency

LLMs love CPU cores, and Rust makes parallelism much safer compared to Python's GIL limitations or TypeScript's async complexity.

use rayon::prelude::*;

fn batch_process_embeddings(texts: Vec<String>, model: &Model) -> Vec<Vec<f32>> {
    texts.par_iter()
        .map(|text| model.embed(text))
        .collect()
}

Memory Safety Without Garbage Collection

No random pauses from garbage collection like in Python or JavaScript. This means more consistent latency for your API endpoints.

Rust's Memory Management: A Game-Changer for LLM Applications

If you're building LLM applications that need to process gigabytes of text or manipulate huge embedding vectors, memory management becomes critical. This is where Rust's ownership model truly shines.

How Rust's Ownership System Works

At its core, Rust's memory management revolves around three key principles:

  1. Ownership: Every value has exactly one owner
  2. Borrowing: References to values can be borrowed, either mutably (one exclusive reference) or immutably (multiple shared references)
  3. Lifetimes: The compiler tracks how long references are valid

Let's break this down with examples relevant to LLM processing:

fn tokenize_and_process(text: String) -> Vec<f32> {
    // 'text' is owned by this function now
    let tokens = tokenize(text); // 'text' is moved to tokenize function
    // 'text' is no longer accessible here!
    
    // process tokens and return result
    process(tokens)
}

fn tokenize_and_process_borrowed(text: &String) -> Vec<f32> {
    // We're just borrowing 'text', not taking ownership
    let tokens = tokenize_borrowed(text); // We pass a reference
    // 'text' is still valid here!
    
    process(tokens)
}

Why This Matters for LLM Applications

In LLM applications, you're often dealing with:

  1. Large Datasets: Processing gigabytes of text
  2. Shared Models: Multiple requests using the same loaded model
  3. Parallel Processing: Distributing work across cores

Let's look at how Rust's ownership model helps with each:

Large Datasets: Zero-Copy Parsing

fn process_large_file(filename: &str) -> Result<Vec<Embedding>, Error> {
    let file_content = std::fs::read_to_string(filename)?;
    
    // Create a view into the string without copying it
    let lines = file_content.lines();
    
    // Process each line in parallel
    let embeddings = lines.par_bridge()
        .map(|line| embed_text(line))
        .collect();
        
    Ok(embeddings)
}

With Rust, you can process slices of your data without unnecessary copying. The compiler ensures no one modifies the data while you're reading from it.

Shared Models: Safe Concurrent Access

struct LlmService {
    model: Arc<Model>, // Atomic Reference Counted pointer
}

impl LlmService {
    fn process_batch(&self, inputs: Vec<String>) -> Vec<Output> {
        // Multiple threads can safely access the model concurrently
        // because Arc<Model> is immutable and thread-safe
        inputs.par_iter()
            .map(|input| self.model.process(input))
            .collect()
    }
}

By using Arc<T> (Atomic Reference Counting), Rust allows safe sharing of your loaded model across threads. The compiler guarantees there are no data races.

Parallel Processing: No Data Races

fn process_embeddings(mut embeddings: Vec<Vec<f32>>) {
    // Split the mutable slice into non-overlapping chunks
    let chunk_size = embeddings.len() / num_cpus::get();
    
    embeddings.par_chunks_mut(chunk_size)
        .for_each(|chunk| {
            // Each thread gets exclusive access to its own chunk
            // No possibility of data races!
            for embedding in chunk {
                normalize_in_place(embedding);
            }
        });
}

Rust's borrow checker ensures that each thread has exclusive access to its chunk of data, eliminating data races by design, not by runtime checks.

No Garbage Collection = Predictable Performance

For ML serving, consistent latency is crucial. Python and JavaScript use garbage collection, which can cause unpredictable pauses. Rust's deterministic memory management means:

  1. No GC pauses during critical inference
  2. Predictable memory usage patterns
  3. Lower overall memory footprint

Let's look at a comparison of memory use patterns:

// In Rust - memory is allocated and freed at predictable points
fn process_batch(texts: Vec<String>) -> Vec<Vec<f32>> {
    // Memory for embeddings is allocated here
    let mut embeddings = Vec::with_capacity(texts.len());
    
    for text in texts {
        // Each iteration allocates exactly what it needs
        let embedding = generate_embedding(&text);
        embeddings.push(embedding);
        // 'text' is automatically freed here when it goes out of scope
    }
    
    // Return embeddings - ownership is transferred to caller
    embeddings
    // No garbage collection needed - memory is freed deterministically
}

In Python, the equivalent function would rely on the garbage collector to eventually clean up intermediate allocations, potentially causing latency spikes during high-load periods.

Zero-Cost Abstractions

Another key advantage is Rust's zero-cost abstractions. You can write high-level, expressive code without paying a runtime performance penalty:

// This high-level code...
fn find_similar_embeddings(query: &[f32], embeddings: &[Vec<f32>]) -> Vec<usize> {
    embeddings.iter()
        .enumerate()
        .map(|(idx, emb)| (idx, cosine_similarity(query, emb)))
        .filter(|(_, score)| *score > 0.8)
        .map(|(idx, _)| idx)
        .collect()
}

// ...compiles down to essentially the same machine code as this low-level version:
fn find_similar_embeddings_manual(query: &[f32], embeddings: &[Vec<f32>]) -> Vec<usize> {
    let mut results = Vec::new();
    for i in 0..embeddings.len() {
        let similarity = cosine_similarity(query, &embeddings[i]);
        if similarity > 0.8 {
            results.push(i);
        }
    }
    results
}

This means you can write clean, maintainable code without sacrificing performance.

Real-World Examples of Rust-Powered LLM Applications

1. Embeddings Generator Service

Let's say you need to generate embeddings for millions of documents:

use std::sync::Arc;
use tokio::sync::Semaphore;
use rust_bert::pipelines::sentence_embeddings::{SentenceEmbeddingsBuilder, SentenceEmbeddingsModelType};

#[derive(Clone)]
struct EmbeddingService {
    model: Arc<SentenceEmbeddingsModel>,
    semaphore: Arc<Semaphore>,
}

impl EmbeddingService {
    pub fn new() -> Self {
        // Load the model once
        let model = SentenceEmbeddingsBuilder::remote(
            SentenceEmbeddingsModelType::AllMiniLmL6V2
        ).create_model().unwrap();
        
        // Limit concurrent requests based on available cores
        let num_cores = num_cpus::get();
        let semaphore = Arc::new(Semaphore::new(num_cores));
        
        Self {
            model: Arc::new(model),
            semaphore,
        }
    }
    
    pub async fn embed_batch(&self, texts: Vec<String>) -> Vec<Vec<f32>> {
        // Acquire semaphore permit to control concurrency
        let permit = self.semaphore.acquire().await.unwrap();
        
        // This is CPU intensive work, so we'll spawn it to a dedicated thread
        let model = self.model.clone();
        let embeddings = tokio::task::spawn_blocking(move || {
            model.encode(&texts).unwrap()
        }).await.unwrap();
        
        // Permit is automatically released when it goes out of scope
        drop(permit);
        
        embeddings
    }
}

#[tokio::main]
async fn main() {
    let service = EmbeddingService::new();
    
    // Example batch processing
    let texts = vec![
        "Hello, world!".to_string(),
        "This is an example sentence".to_string(),
        "Rust is amazing for ML applications".to_string(),
    ];
    
    let embeddings = service.embed_batch(texts).await;
    println!("Generated {} embeddings", embeddings.len());
}

This service uses Rust's ability to spawn blocking work outside the async runtime, preventing your API from becoming unresponsive during heavy computation.

2. Parallel Text Processing Pipeline

Text processing before feeding into an LLM can be CPU-intensive. Here's how you might parallelize it:

use rayon::prelude::*;
use regex::Regex;
use std::collections::HashSet;

struct TextProcessor {
    stopwords: HashSet<String>,
    url_regex: Regex,
    email_regex: Regex,
}

impl TextProcessor {
    pub fn new() -> Self {
        let stopwords = vec!["the", "and", "or", "a", "an", "in", "to", "for"]
            .into_iter()
            .map(String::from)
            .collect();
            
        Self {
            stopwords,
            url_regex: Regex::new(r"https?://\S+").unwrap(),
            email_regex: Regex::new(r"\S+@\S+\.\S+").unwrap(),
        }
    }
    
    pub fn clean_text(&self, text: &str) -> String {
        // Remove URLs and emails
        let text = self.url_regex.replace_all(text, "[URL]");
        let text = self.email_regex.replace_all(&text, "[EMAIL]");
        
        // Convert to lowercase
        let text = text.to_lowercase();
        
        // Filter out stopwords and tokenize
        text.split_whitespace()
            .filter(|word| !self.stopwords.contains(*word))
            .collect::<Vec<_>>()
            .join(" ")
    }
    
    pub fn process_batch(&self, texts: Vec<String>) -> Vec<String> {
        texts.par_iter()
            .map(|text| self.clean_text(text))
            .collect()
    }
}

fn main() {
    let processor = TextProcessor::new();
    
    let texts = vec![
        "Check out https://example.com for more info!".to_string(),
        "Contact me at user@example.com if you have questions.".to_string(),
        "The quick brown fox jumps over the lazy dog.".to_string(),
    ];
    
    let processed = processor.process_batch(texts);
    
    for text in processed {
        println!("{}", text);
    }
}

The par_iter() from Rayon makes this embarrassingly parallel with almost no extra code!

3. Vector Database with Rust

Building a simple vector database for semantic search:

use std::collections::HashMap;
use serde::{Serialize, Deserialize};
use rayon::prelude::*;

#[derive(Clone, Serialize, Deserialize)]
struct Document {
    id: String,
    text: String,
    embedding: Vec<f32>,
}

struct VectorDB {
    documents: Vec<Document>,
}

impl VectorDB {
    pub fn new() -> Self {
        Self { documents: Vec::new() }
    }
    
    pub fn add_document(&mut self, doc: Document) {
        self.documents.push(doc);
    }
    
    pub fn add_documents(&mut self, docs: Vec<Document>) {
        self.documents.extend(docs);
    }
    
    pub fn search(&self, query_embedding: &[f32], top_k: usize) -> Vec<(&Document, f32)> {
        // Compute similarities in parallel
        let mut results: Vec<(&Document, f32)> = self.documents.par_iter()
            .map(|doc| {
                let similarity = cosine_similarity(&doc.embedding, query_embedding);
                (doc, similarity)
            })
            .collect();
        
        // Sort by similarity score (descending)
        results.sort_unstable_by(|(_, score1), (_, score2)| {
            score2.partial_cmp(score1).unwrap_or(std::cmp::Ordering::Equal)
        });
        
        // Return top-k results
        results.truncate(top_k);
        results
    }
}

fn cosine_similarity(a: &[f32], b: &[f32]) -> f32 {
    let dot_product: f32 = a.iter().zip(b.iter()).map(|(x, y)| x * y).sum();
    let norm_a: f32 = a.iter().map(|x| x * x).sum::<f32>().sqrt();
    let norm_b: f32 = b.iter().map(|x| x * x).sum::<f32>().sqrt();
    
    dot_product / (norm_a * norm_b)
}

fn main() {
    let mut db = VectorDB::new();
    
    // Add some example documents
    let docs = vec![
        Document {
            id: "1".to_string(),
            text: "Rust is a systems programming language".to_string(),
            embedding: vec![0.1, 0.2, 0.3, 0.4],
        },
        Document {
            id: "2".to_string(),
            text: "Python is great for data science".to_string(),
            embedding: vec![0.2, 0.3, 0.4, 0.5],
        },
        Document {
            id: "3".to_string(),
            text: "Large language models are transforming AI".to_string(),
            embedding: vec![0.3, 0.4, 0.5, 0.6],
        },
    ];
    
    db.add_documents(docs);
    
    // Example query
    let query_embedding = vec![0.2, 0.3, 0.4, 0.5];
    let results = db.search(&query_embedding, 2);
    
    for (doc, score) in results {
        println!("ID: {}, Score: {:.4}, Text: {}", doc.id, score, doc.text);
    }
}

Memory Management in Action: A Concrete Example

Let's build a simple but realistic example of memory management for an LLM inference server. This demonstrates how Rust's ownership model helps when dealing with large volumes of requests:

use std::sync::Arc;
use tokio::sync::{mpsc, Semaphore};

// A heavyweight model that we want to load once and share
struct LargeLanguageModel {
    // Imagine this contains gigabytes of parameters
    weights: Vec<f32>,
}

impl LargeLanguageModel {
    fn new() -> Self {
        println!("Loading model weights (this would be slow in real life)...");
        // In reality, this would load gigabytes of data
        Self { weights: vec![0.1; 1_000_000] }
    }
    
    fn generate(&self, prompt: &str) -> String {
        // Simulate CPU-intensive work
        std::thread::sleep(std::time::Duration::from_millis(100));
        format!("Response to: {}", prompt)
    }
}

// Our inference server
struct InferenceServer {
    model: Arc<LargeLanguageModel>,
    // Limit concurrent requests to prevent OOM
    semaphore: Arc<Semaphore>,
}

impl InferenceServer {
    fn new() -> Self {
        // Load model once, share it across all request handlers
        let model = Arc::new(LargeLanguageModel::new());
        let semaphore = Arc::new(Semaphore::new(4)); // Allow 4 concurrent requests
        
        Self { model, semaphore }
    }
    
    async fn handle_request(&self, prompt: String) -> String {
        // Get permission to process (or wait if at capacity)
        let permit = self.semaphore.acquire().await.unwrap();
        
        // Clone Arc to share ownership with the new thread
        let model = self.model.clone();
        
        // Offload CPU-intensive work to a dedicated thread
        let result = tokio::task::spawn_blocking(move || {
            model.generate(&prompt)
        }).await.unwrap();
        
        // Permit is implicitly dropped here, releasing a slot
        
        result
    }
}

#[tokio::main]
async fn main() {
    // Create our server
    let server = Arc::new(InferenceServer::new());
    
    // Create a channel for incoming requests
    let (tx, mut rx) = mpsc::channel(100);
    
    // Spawn some tasks to simulate client requests
    for i in 0..10 {
        let tx = tx.clone();
        tokio::spawn(async move {
            let prompt = format!("Request {}", i);
            tx.send(prompt).await.unwrap();
        });
    }
    
    // Process incoming requests
    let server_ref = server.clone();
    tokio::spawn(async move {
        while let Some(prompt) = rx.recv().await {
            let server = server_ref.clone();
            tokio::spawn(async move {
                let response = server.handle_request(prompt.clone()).await;
                println!("Prompt: {}, Response: {}", prompt, response);
            });
        }
    });
    
    // Wait a bit to let everything finish
    tokio::time::sleep(tokio::time::Duration::from_secs(2)).await;
}

Let's break down the memory management here:

  1. The model is loaded once and wrapped in an Arc (Atomic Reference Counter), allowing safe shared access across threads
  2. Each request handler borrows the model rather than copying it
  3. Resources are released automatically when they go out of scope
  4. Concurrency is controlled with a semaphore to prevent memory overuse

All of this is enforced at compile time by Rust's borrow checker, with zero runtime overhead.

Getting Started with Rust for LLM Apps

If you're coming from Python or TypeScript, here's a quick guide to get you started:

  1. Install Rust

    curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
    
  2. Essential Crates for LLM Work

    • rust-bert: Rust implementation of popular language models
    • tokenizers: Fast tokenization (from Hugging Face)
    • ndarray: NumPy-like operations
    • rayon: Easy parallelism
    • tokio: Async runtime
  3. Mixing with Other Languages

    • Use PyO3 to create Python bindings for your Rust code
    • Use WebAssembly to integrate with TypeScript/JavaScript

When to Use Rust vs. Python/TypeScript

Rust isn't always the right choice. Here's my take:

  • Use Rust when:

    • You need maximum performance
    • Your app is CPU-bound
    • You're building something to scale
    • Consistent latency matters
    • You can afford longer dev time
  • Stick with Python/TypeScript when:

    • You need rapid prototyping
    • The ecosystem has what you need
    • Performance isn't your bottleneck
    • You have team experience in those languages

Conclusion

Rust's ownership system is a game-changer for LLM applications where performance and memory efficiency matter. By catching memory issues at compile time and enabling safe concurrency without runtime overhead, Rust lets you push the limits of what your hardware can handle.

The mental model shift from garbage collection to ownership might take some time to adapt to, but the benefits for CPU-intensive LLM applications are massive: predictable performance, lower memory usage, and safe parallelism out of the box.

Have you tried using Rust with LLMs? I'd love to hear about your experiences in the comments below! Next week, I'll dig into how to use Rust to optimize token processing for custom LLM fine-tuning.

Happy coding! 🦀