Building LLM Applications with Rust: When Speed and Parallelism Matter
Hey there! So you're thinking about using Rust for your next LLM project? Smart move! While Python and TypeScript dominate the AI space, Rust offers some serious advantages, especially for CPU-intensive operations. Let's dive into why Rust might be your new best friend for LLM applications.
Why Rust for LLM Applications?
Python's great for prototyping and has amazing libraries like Hugging Face's Transformers, but when performance matters, Rust shines. Here's why:
Performance Without Compromise
Rust gives you C-level performance without the headaches of memory management. For LLM applications that crunch through massive vector operations, this is huge.
// Example: Fast vector operations in Rust
fn cosine_similarity(a: &[f32], b: &[f32]) -> f32 {
let dot_product: f32 = a.iter().zip(b.iter()).map(|(x, y)| x * y).sum();
let norm_a: f32 = a.iter().map(|x| x * x).sum::<f32>().sqrt();
let norm_b: f32 = b.iter().map(|x| x * x).sum::<f32>().sqrt();
dot_product / (norm_a * norm_b)
}
Fearless Concurrency
LLMs love CPU cores, and Rust makes parallelism much safer compared to Python's GIL limitations or TypeScript's async complexity.
use rayon::prelude::*;
fn batch_process_embeddings(texts: Vec<String>, model: &Model) -> Vec<Vec<f32>> {
texts.par_iter()
.map(|text| model.embed(text))
.collect()
}
Memory Safety Without Garbage Collection
No random pauses from garbage collection like in Python or JavaScript. This means more consistent latency for your API endpoints.
Rust's Memory Management: A Game-Changer for LLM Applications
If you're building LLM applications that need to process gigabytes of text or manipulate huge embedding vectors, memory management becomes critical. This is where Rust's ownership model truly shines.
How Rust's Ownership System Works
At its core, Rust's memory management revolves around three key principles:
- Ownership: Every value has exactly one owner
- Borrowing: References to values can be borrowed, either mutably (one exclusive reference) or immutably (multiple shared references)
- Lifetimes: The compiler tracks how long references are valid
Let's break this down with examples relevant to LLM processing:
fn tokenize_and_process(text: String) -> Vec<f32> {
// 'text' is owned by this function now
let tokens = tokenize(text); // 'text' is moved to tokenize function
// 'text' is no longer accessible here!
// process tokens and return result
process(tokens)
}
fn tokenize_and_process_borrowed(text: &String) -> Vec<f32> {
// We're just borrowing 'text', not taking ownership
let tokens = tokenize_borrowed(text); // We pass a reference
// 'text' is still valid here!
process(tokens)
}
Why This Matters for LLM Applications
In LLM applications, you're often dealing with:
- Large Datasets: Processing gigabytes of text
- Shared Models: Multiple requests using the same loaded model
- Parallel Processing: Distributing work across cores
Let's look at how Rust's ownership model helps with each:
Large Datasets: Zero-Copy Parsing
fn process_large_file(filename: &str) -> Result<Vec<Embedding>, Error> {
let file_content = std::fs::read_to_string(filename)?;
// Create a view into the string without copying it
let lines = file_content.lines();
// Process each line in parallel
let embeddings = lines.par_bridge()
.map(|line| embed_text(line))
.collect();
Ok(embeddings)
}
With Rust, you can process slices of your data without unnecessary copying. The compiler ensures no one modifies the data while you're reading from it.
Shared Models: Safe Concurrent Access
struct LlmService {
model: Arc<Model>, // Atomic Reference Counted pointer
}
impl LlmService {
fn process_batch(&self, inputs: Vec<String>) -> Vec<Output> {
// Multiple threads can safely access the model concurrently
// because Arc<Model> is immutable and thread-safe
inputs.par_iter()
.map(|input| self.model.process(input))
.collect()
}
}
By using Arc<T>
(Atomic Reference Counting), Rust allows safe sharing of your loaded model across threads. The compiler guarantees there are no data races.
Parallel Processing: No Data Races
fn process_embeddings(mut embeddings: Vec<Vec<f32>>) {
// Split the mutable slice into non-overlapping chunks
let chunk_size = embeddings.len() / num_cpus::get();
embeddings.par_chunks_mut(chunk_size)
.for_each(|chunk| {
// Each thread gets exclusive access to its own chunk
// No possibility of data races!
for embedding in chunk {
normalize_in_place(embedding);
}
});
}
Rust's borrow checker ensures that each thread has exclusive access to its chunk of data, eliminating data races by design, not by runtime checks.
No Garbage Collection = Predictable Performance
For ML serving, consistent latency is crucial. Python and JavaScript use garbage collection, which can cause unpredictable pauses. Rust's deterministic memory management means:
- No GC pauses during critical inference
- Predictable memory usage patterns
- Lower overall memory footprint
Let's look at a comparison of memory use patterns:
// In Rust - memory is allocated and freed at predictable points
fn process_batch(texts: Vec<String>) -> Vec<Vec<f32>> {
// Memory for embeddings is allocated here
let mut embeddings = Vec::with_capacity(texts.len());
for text in texts {
// Each iteration allocates exactly what it needs
let embedding = generate_embedding(&text);
embeddings.push(embedding);
// 'text' is automatically freed here when it goes out of scope
}
// Return embeddings - ownership is transferred to caller
embeddings
// No garbage collection needed - memory is freed deterministically
}
In Python, the equivalent function would rely on the garbage collector to eventually clean up intermediate allocations, potentially causing latency spikes during high-load periods.
Zero-Cost Abstractions
Another key advantage is Rust's zero-cost abstractions. You can write high-level, expressive code without paying a runtime performance penalty:
// This high-level code...
fn find_similar_embeddings(query: &[f32], embeddings: &[Vec<f32>]) -> Vec<usize> {
embeddings.iter()
.enumerate()
.map(|(idx, emb)| (idx, cosine_similarity(query, emb)))
.filter(|(_, score)| *score > 0.8)
.map(|(idx, _)| idx)
.collect()
}
// ...compiles down to essentially the same machine code as this low-level version:
fn find_similar_embeddings_manual(query: &[f32], embeddings: &[Vec<f32>]) -> Vec<usize> {
let mut results = Vec::new();
for i in 0..embeddings.len() {
let similarity = cosine_similarity(query, &embeddings[i]);
if similarity > 0.8 {
results.push(i);
}
}
results
}
This means you can write clean, maintainable code without sacrificing performance.
Real-World Examples of Rust-Powered LLM Applications
1. Embeddings Generator Service
Let's say you need to generate embeddings for millions of documents:
use std::sync::Arc;
use tokio::sync::Semaphore;
use rust_bert::pipelines::sentence_embeddings::{SentenceEmbeddingsBuilder, SentenceEmbeddingsModelType};
#[derive(Clone)]
struct EmbeddingService {
model: Arc<SentenceEmbeddingsModel>,
semaphore: Arc<Semaphore>,
}
impl EmbeddingService {
pub fn new() -> Self {
// Load the model once
let model = SentenceEmbeddingsBuilder::remote(
SentenceEmbeddingsModelType::AllMiniLmL6V2
).create_model().unwrap();
// Limit concurrent requests based on available cores
let num_cores = num_cpus::get();
let semaphore = Arc::new(Semaphore::new(num_cores));
Self {
model: Arc::new(model),
semaphore,
}
}
pub async fn embed_batch(&self, texts: Vec<String>) -> Vec<Vec<f32>> {
// Acquire semaphore permit to control concurrency
let permit = self.semaphore.acquire().await.unwrap();
// This is CPU intensive work, so we'll spawn it to a dedicated thread
let model = self.model.clone();
let embeddings = tokio::task::spawn_blocking(move || {
model.encode(&texts).unwrap()
}).await.unwrap();
// Permit is automatically released when it goes out of scope
drop(permit);
embeddings
}
}
#[tokio::main]
async fn main() {
let service = EmbeddingService::new();
// Example batch processing
let texts = vec![
"Hello, world!".to_string(),
"This is an example sentence".to_string(),
"Rust is amazing for ML applications".to_string(),
];
let embeddings = service.embed_batch(texts).await;
println!("Generated {} embeddings", embeddings.len());
}
This service uses Rust's ability to spawn blocking work outside the async runtime, preventing your API from becoming unresponsive during heavy computation.
2. Parallel Text Processing Pipeline
Text processing before feeding into an LLM can be CPU-intensive. Here's how you might parallelize it:
use rayon::prelude::*;
use regex::Regex;
use std::collections::HashSet;
struct TextProcessor {
stopwords: HashSet<String>,
url_regex: Regex,
email_regex: Regex,
}
impl TextProcessor {
pub fn new() -> Self {
let stopwords = vec!["the", "and", "or", "a", "an", "in", "to", "for"]
.into_iter()
.map(String::from)
.collect();
Self {
stopwords,
url_regex: Regex::new(r"https?://\S+").unwrap(),
email_regex: Regex::new(r"\S+@\S+\.\S+").unwrap(),
}
}
pub fn clean_text(&self, text: &str) -> String {
// Remove URLs and emails
let text = self.url_regex.replace_all(text, "[URL]");
let text = self.email_regex.replace_all(&text, "[EMAIL]");
// Convert to lowercase
let text = text.to_lowercase();
// Filter out stopwords and tokenize
text.split_whitespace()
.filter(|word| !self.stopwords.contains(*word))
.collect::<Vec<_>>()
.join(" ")
}
pub fn process_batch(&self, texts: Vec<String>) -> Vec<String> {
texts.par_iter()
.map(|text| self.clean_text(text))
.collect()
}
}
fn main() {
let processor = TextProcessor::new();
let texts = vec![
"Check out https://example.com for more info!".to_string(),
"Contact me at user@example.com if you have questions.".to_string(),
"The quick brown fox jumps over the lazy dog.".to_string(),
];
let processed = processor.process_batch(texts);
for text in processed {
println!("{}", text);
}
}
The par_iter()
from Rayon makes this embarrassingly parallel with almost no extra code!
3. Vector Database with Rust
Building a simple vector database for semantic search:
use std::collections::HashMap;
use serde::{Serialize, Deserialize};
use rayon::prelude::*;
#[derive(Clone, Serialize, Deserialize)]
struct Document {
id: String,
text: String,
embedding: Vec<f32>,
}
struct VectorDB {
documents: Vec<Document>,
}
impl VectorDB {
pub fn new() -> Self {
Self { documents: Vec::new() }
}
pub fn add_document(&mut self, doc: Document) {
self.documents.push(doc);
}
pub fn add_documents(&mut self, docs: Vec<Document>) {
self.documents.extend(docs);
}
pub fn search(&self, query_embedding: &[f32], top_k: usize) -> Vec<(&Document, f32)> {
// Compute similarities in parallel
let mut results: Vec<(&Document, f32)> = self.documents.par_iter()
.map(|doc| {
let similarity = cosine_similarity(&doc.embedding, query_embedding);
(doc, similarity)
})
.collect();
// Sort by similarity score (descending)
results.sort_unstable_by(|(_, score1), (_, score2)| {
score2.partial_cmp(score1).unwrap_or(std::cmp::Ordering::Equal)
});
// Return top-k results
results.truncate(top_k);
results
}
}
fn cosine_similarity(a: &[f32], b: &[f32]) -> f32 {
let dot_product: f32 = a.iter().zip(b.iter()).map(|(x, y)| x * y).sum();
let norm_a: f32 = a.iter().map(|x| x * x).sum::<f32>().sqrt();
let norm_b: f32 = b.iter().map(|x| x * x).sum::<f32>().sqrt();
dot_product / (norm_a * norm_b)
}
fn main() {
let mut db = VectorDB::new();
// Add some example documents
let docs = vec![
Document {
id: "1".to_string(),
text: "Rust is a systems programming language".to_string(),
embedding: vec![0.1, 0.2, 0.3, 0.4],
},
Document {
id: "2".to_string(),
text: "Python is great for data science".to_string(),
embedding: vec![0.2, 0.3, 0.4, 0.5],
},
Document {
id: "3".to_string(),
text: "Large language models are transforming AI".to_string(),
embedding: vec![0.3, 0.4, 0.5, 0.6],
},
];
db.add_documents(docs);
// Example query
let query_embedding = vec![0.2, 0.3, 0.4, 0.5];
let results = db.search(&query_embedding, 2);
for (doc, score) in results {
println!("ID: {}, Score: {:.4}, Text: {}", doc.id, score, doc.text);
}
}
Memory Management in Action: A Concrete Example
Let's build a simple but realistic example of memory management for an LLM inference server. This demonstrates how Rust's ownership model helps when dealing with large volumes of requests:
use std::sync::Arc;
use tokio::sync::{mpsc, Semaphore};
// A heavyweight model that we want to load once and share
struct LargeLanguageModel {
// Imagine this contains gigabytes of parameters
weights: Vec<f32>,
}
impl LargeLanguageModel {
fn new() -> Self {
println!("Loading model weights (this would be slow in real life)...");
// In reality, this would load gigabytes of data
Self { weights: vec![0.1; 1_000_000] }
}
fn generate(&self, prompt: &str) -> String {
// Simulate CPU-intensive work
std::thread::sleep(std::time::Duration::from_millis(100));
format!("Response to: {}", prompt)
}
}
// Our inference server
struct InferenceServer {
model: Arc<LargeLanguageModel>,
// Limit concurrent requests to prevent OOM
semaphore: Arc<Semaphore>,
}
impl InferenceServer {
fn new() -> Self {
// Load model once, share it across all request handlers
let model = Arc::new(LargeLanguageModel::new());
let semaphore = Arc::new(Semaphore::new(4)); // Allow 4 concurrent requests
Self { model, semaphore }
}
async fn handle_request(&self, prompt: String) -> String {
// Get permission to process (or wait if at capacity)
let permit = self.semaphore.acquire().await.unwrap();
// Clone Arc to share ownership with the new thread
let model = self.model.clone();
// Offload CPU-intensive work to a dedicated thread
let result = tokio::task::spawn_blocking(move || {
model.generate(&prompt)
}).await.unwrap();
// Permit is implicitly dropped here, releasing a slot
result
}
}
#[tokio::main]
async fn main() {
// Create our server
let server = Arc::new(InferenceServer::new());
// Create a channel for incoming requests
let (tx, mut rx) = mpsc::channel(100);
// Spawn some tasks to simulate client requests
for i in 0..10 {
let tx = tx.clone();
tokio::spawn(async move {
let prompt = format!("Request {}", i);
tx.send(prompt).await.unwrap();
});
}
// Process incoming requests
let server_ref = server.clone();
tokio::spawn(async move {
while let Some(prompt) = rx.recv().await {
let server = server_ref.clone();
tokio::spawn(async move {
let response = server.handle_request(prompt.clone()).await;
println!("Prompt: {}, Response: {}", prompt, response);
});
}
});
// Wait a bit to let everything finish
tokio::time::sleep(tokio::time::Duration::from_secs(2)).await;
}
Let's break down the memory management here:
- The model is loaded once and wrapped in an
Arc
(Atomic Reference Counter), allowing safe shared access across threads - Each request handler borrows the model rather than copying it
- Resources are released automatically when they go out of scope
- Concurrency is controlled with a semaphore to prevent memory overuse
All of this is enforced at compile time by Rust's borrow checker, with zero runtime overhead.
Getting Started with Rust for LLM Apps
If you're coming from Python or TypeScript, here's a quick guide to get you started:
-
Install Rust
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
-
Essential Crates for LLM Work
rust-bert
: Rust implementation of popular language modelstokenizers
: Fast tokenization (from Hugging Face)ndarray
: NumPy-like operationsrayon
: Easy parallelismtokio
: Async runtime
-
Mixing with Other Languages
- Use PyO3 to create Python bindings for your Rust code
- Use WebAssembly to integrate with TypeScript/JavaScript
When to Use Rust vs. Python/TypeScript
Rust isn't always the right choice. Here's my take:
-
Use Rust when:
- You need maximum performance
- Your app is CPU-bound
- You're building something to scale
- Consistent latency matters
- You can afford longer dev time
-
Stick with Python/TypeScript when:
- You need rapid prototyping
- The ecosystem has what you need
- Performance isn't your bottleneck
- You have team experience in those languages
Conclusion
Rust's ownership system is a game-changer for LLM applications where performance and memory efficiency matter. By catching memory issues at compile time and enabling safe concurrency without runtime overhead, Rust lets you push the limits of what your hardware can handle.
The mental model shift from garbage collection to ownership might take some time to adapt to, but the benefits for CPU-intensive LLM applications are massive: predictable performance, lower memory usage, and safe parallelism out of the box.
Have you tried using Rust with LLMs? I'd love to hear about your experiences in the comments below! Next week, I'll dig into how to use Rust to optimize token processing for custom LLM fine-tuning.
Happy coding! 🦀