Exploring MLX Swift: Testing LLMs and VLMs with llm-tool

I usually use the UI of the LLMEval app for testing out new models until I had a discussion with MLX King, and he told me about the command-line utility named llm-tool.

I used the llm-tool in MLX Swift examples for the first time today; much handy when testing out models!! 🤩https://t.co/gcu1g22WBo https://t.co/Cpf1VwNE7l
— Rudrank Riyam (@rudrankriyam) April 28, 2025

This post is an introduction on using this llm-tool with Cursor/Xcode and how you can use it to test and experiment with various MLX models, like the latest Qwen 3.

What is `llm-tool`?

Located under the Tools directory in the mlx-swift-examples repository, llm-tool is a Swift command-line application built using ArgumentParser. There are two main functionalities:

Model Evaluation: The primary command for loading a model (LLM or VLM) and generating output based on a given prompt and optional media inputs (images/videos).
LoRA Operations: A suite of subcommands (train, test, eval, fuse) for working with Low-Rank Adaptation (LoRA), a technique for fine-tuning large models.

We will focus on exploring everything that the model evaluation has to offer.

Setting Up `llm-tool`

Clone the mlx-swift-examples repository:

git clone https://github.com/ml-explore/mlx-swift-examples.git
cd mlx-swift-examples

In Xcode, select the llm-tool from the list of schemes and run the project. Xcode will compile llm-tool and its dependencies.

The `eval` Command

The most direct way to "test" a model using llm-tool is via the eval subcommand (which is also the default if no subcommand is specified). This command loads a specified model, processes your input, and generates a response.

Key Arguments for eval

--model <id_or_path>: Specifies the model to use. This can be a Hugging Face repository ID (e.g., mlx-community/Mistral-7B-Instruct-v0.3-4bit) or an absolute path to a local model directory. If omitted, it defaults to a pre-defined model (Mistral for LLM, Qwen-VL for VLM).
--prompt <text>: The text prompt to send to the model. You can also load prompts from files using the @ prefix (e.g., --prompt @/path/to/prompt.txt). This is helpful to test different models with the same prompt.
--system <text>: An optional system prompt (defaults to empty).
--image <url_or_path>: Path or URL to an input image (for VLMs). Can be specified multiple times.
--video <url_or_path>: Path or URL to an input video (for VLMs). Can be specified multiple times.
-resize <width> [height]: Resizes input images/videos to the specified dimensions (for VLMs).
--max-tokens <int>: Maximum number of tokens to generate (default: 100).
--temperature <float>: Sampling temperature (default: 0.6). Controls randomness.
--top-p <float>: Nucleus sampling probability (default: 1.0).
--seed <uint64>: Seed for the pseudo-random number generator for reproducible results.
--memory-stats: Flag to display GPU memory usage statistics.
--cache-size <int> / --memory-size <int>: Options to set GPU cache and memory limits (in MB).

How it Works

The LLMTool.swift uses ArgumentParser to parse the command-line arguments defined in structs like ModelArguments, GenerateArguments, and within EvaluateCommand.

The tool selects between LLMModelFactory and VLMModelFactory based on whether --image or --video arguments are provided.

It uses the Hub library (via ModelFactory) to download or locate the specified model and load its configuration and weights. The ModelArguments struct handles resolving the model name/path.

The UserInput struct aggregates text, images, and videos. The model's processor prepares this input into the format expected by the underlying MLX model.

It calls the generate function (wrapping MLX's generation capabilities) with the processed input and generation parameters (GenerateArguments). Output is streamed to the console.

Finally, it prints the generated text and, if requested (--memory-stats), detailed memory usage information via MemoryArguments.

Running `llm-tool` eval

You have two ways to run the tool:

Running from Xcode

This is great for quick tests where you want to change arguments easily using the GUI.

Go to Product > Scheme > Edit Scheme...
Select the "Run" phase on the left.
Go to the "Arguments" tab.
Now, just press Cmd+R to run llm-tool with these arguments. The output will appear in the Xcode console.

In "Arguments Passed On Launch", add your command-line flags. Example:

--model mlx-community/Qwen3-1.7B-4bit 
--prompt "Explain the role of Actors in Swift concurrency" 
--max-tokens 450

Running from the Command Line (using mlx-run)

The repository includes a script mlx-run that locates the built executable (handling Debug vs. Release) and runs it. You can use this if you prefer Cursor/VS Code or Windsurf. It is more convenient for complex prompts or when working primarily in the terminal.

Use the script like this:

./mlx-run llm-tool eval [ARGUMENTS...]

Here is an example command chain:

./mlx-run llm-tool eval \
    --model mlx-community/Qwen3-1.7B-4bit \
    --prompt "Explain the role of Actors in Swift concurrency" \
    --max-tokens 450

The first time you run, macOS might ask for permission to access your Documents folder, as this is where the Hugging Face Hub API often caches downloaded models by default.

Exploring `eval` Command Examples

Let's dive into the various arguments you can use with the eval command. I will use the mlx-run format for clarity, but you can use the same flags in the Xcode scheme arguments.

Basic Text Generation (Default LLM)

If you do not specify a model, it defaults to mlx-community/Mistral-7B-Instruct-v0.3-4bit.

./mlx-run llm-tool eval --prompt "Write a short story about a Swift developer discovering MLX Swift."

Here is the result:

--- xcodebuild: WARNING: Using the first of multiple matching destinations:
{ platform:macOS, arch:arm64, id:00006000-001C18191198801E, name:My Mac }
{ platform:macOS, arch:x86_64, id:00006000-001C18191198801E, name:My Mac }
{ platform:macOS, name:Any Mac }
Loading mlx-community/Mistral-7B-Instruct-v0.3-4bit...
Loaded mlx-community/Mistral-7B-Instruct-v0.3-4bit
Starting generation ...

Write a short story about a Swift developer discovering MLX Swift. 

Title: Swiftly Sailing into the Future: A Tale of Discovery and Innovation

In the heart of Silicon Valley, nestled among the towering tech giants, sat a quaint, unassuming office. Within this office, a Swift developer named Alex, known for their keen eye for potential, was hard at work, crafting the next big app.

One day, while scrolling through GitHub, Alex stumbled upon a repository named 'MLX Swift'. Intrigued, they clicked on the link, and were immediately drawn into a world of possibilities. The repository was filled with neatly organized Swift files, each one a testament to the power of machine learning in Swift.

Alex, always eager to learn, began to explore the repository, delving deep into the intricacies of the code. They were amazed by the simplicity and elegance of the implementation, making machine learning accessible to Swift developers like never before.

With a spark in their eyes, Alex began to integrate MLX Swift into their current project, a photo-editing app. The results were astounding. The app was now capable of identifying and enhancing images based on their content. Alex was ecstatic.

Word of their innovation spread quickly, and soon, they were invited to speak at a Swift conference. On the stage, with the spotlight on them, Alex shared their discovery and the potential of MLX Swift. The audience was in awe, and the future of Swift development seemed brighter than ever.

Back in their office, Alex continued to push the boundaries of what was possible with MLX Swift. They created apps that could translate speech in real-time, recognize faces, and even predict weather based on images.

The world of Swift development had changed, and Alex was at the forefront of this change. They had discovered MLX Swift, and in doing so, had opened up a world of possibilities for Swift developers everywhere.

And so, Alex continued to sail swiftly into the future, leaving a trail of innovation in their wake. The world of Swift development was never the same again.------
Prompt:     18 tokens, 30.114283 tokens/s
Generation: 462 tokens, 35.786029 tokens/s, 12.910066s

Specifying a Different LLM

Use the --model flag with a Hugging Face repository ID or an absolute local path to a downloaded model directory:

./mlx-run llm-tool eval \
    --model mlx-community/Qwen3-1.7B-4bit \
    --prompt "What are the key features of Swift Package Manager?" \
    --max-tokens 1000

And here is the response:

Loading mlx-community/Qwen3-1.7B-4bit...
Loaded mlx-community/Qwen3-1.7B-4bit
Starting generation ...

What are the key features of Swift Package Manager? 

<think>
Okay, the user is asking about the key features of Swift Package Manager. Let me think about what I know. First, I should recall the basics of Swift Package Manager. It's part of the Swift ecosystem, right? So, what are the main points?

(removed the thinking part for the sake of brevity)

I think that's a good list. Now, I need to present this in a clear, concise way, maybe in bullet points. The user is probably looking for a clear overview of the key features, so I should make sure to cover the main points without getting too detailed.
</think>

The **Swift Package Manager (SPM)** is a critical tool in the Swift ecosystem, designed to simplify the process of building, packaging, and distributing Swift projects. Here are its key features:

---

### **1. Dependency Management**
- **Automatically handles dependencies**: Developers can add dependencies (like libraries or frameworks) using `Package.swift` files, and the SPM resolves and installs them.
- **Simplified dependency declaration**: Dependencies are declared in a `Package.swift` file, making it easier to manage third-party libraries and custom code.

(removed for the sake of brevity)

### **5. Creating and Sharing Packages**
- **Self-contained packages**: Developers can create standalone Swift packages that include all necessary code, libraries, and dependencies.
- **Public and private packages**: Packages can be shared publicly or privately, with access controlled via `Package.swift`.

---

### **6. Version Control and Dependency Resolution**
- **Versioned dependencies**: Dependencies are resolved based on versions, ensuring compatibility and stability.
-------
-------
Prompt:     23 tokens, 185.101881 tokens/s
Generation: 1000 tokens, 93.898653 tokens/s, 10.64978s

Controlling Generation Parameters

Let's fine-tune the output with different generation parameters:

# More creative output (higher temperature)
./mlx-run llm-tool eval \
    --prompt "Suggest 5 creative names for a new iOS weather app" \
    --temperature 0.9 \
    --max-tokens 100

# More deterministic output (lower temperature, add seed)
./mlx-run llm-tool eval \
    --prompt "What is the capital of France?" \
    --temperature 0.1 \
    --seed 42 \
    --max-tokens 10

# Adjust repetition penalty (higher value discourages repeating tokens)
./mlx-run llm-tool eval \
    --prompt "Write a poem about my purpose for Xcode." \
    --repetition-penalty 1.2 \
    --max-tokens 120

# Using top-p sampling (nucleus sampling)
./mlx-run llm-tool eval \
    --prompt "List advantages of using SwiftUI." \
    --temperature 0.7 \
    --top-p 0.9 \
    --max-tokens 100

Using a System Prompt

You can guide the model's persona or behavior using a system prompt to test:

./mlx-run llm-tool eval \
    --model mlx-community/Mistral-7B-Instruct-v0.3-4bit \
    --system "You are an expert iOS developer reviewing Swift code. Be critical and concise." \
    --prompt "Review this code: let array = [1, 2, 3]; for i in 0..<array.count { print(array[i]) }" \
    --max-tokens 500

Here is what was printed in the console:

Loading mlx-community/Mistral-7B-Instruct-v0.3-4bit...
Loaded mlx-community/Mistral-7B-Instruct-v0.3-4bit
Starting generation ...
You are an expert iOS developer reviewing Swift code. Be critical and concise.
Review this code: let array = [1, 2, 3]; for i in 0..<array.count { print(array[i]) } 

This code is creating an array with three integers and then iterating over the array using a for loop. The loop is using the range operator (..<) to specify the range from 0 to the count of the array minus 1.

The code does not have any major issues, but it could be improved for readability and maintainability. Here are some suggestions:

1. Use `for-in` loop instead of `for` loop: The `for-in` loop is more idiomatic in Swift, and it makes the code more readable and easier to understand.

2. Use `enumerated()` method: The `enumerated()` method allows you to iterate over the array and get both the index and the value in a single line of code.

Here's the improved code:

```swift
let array: [Int] = [1, 2, 3]
for (index, value) in array.enumerated() {
 print(value)
}
```

By using `enumerated()`, you don't need to manually track the index, and you can easily access the current value in the array. This makes the code cleaner and more maintainable.------
Prompt:     53 tokens, 82.857027 tokens/s
Generation: 273 tokens, 36.339212 tokens/s, 7.512546s

Testing Vision Language Models (VLMs) - Image Input

When you provide an --image or --video argument, llm-tool automatically switches to VLM mode (defaulting to mlx-community/Qwen2-VL-1.5B-Instruct-4bit if no --model is specified).

# Describe a local image
./mlx-run llm-tool eval \
    --image /path/to/your/photo.jpg \
    --prompt "Describe this image in detail." \
    --max-tokens 100

# Ask a question about an image from a URL
./mlx-run llm-tool eval \
    --image https://images.unsplash.com/photo-1578133507770-a35cc3c786e6 \
    --prompt "What animal is in this image?" \
    --max-tokens 30

# Using multiple images
./mlx-run llm-tool eval \
    --image image_cat.jpg \
    --image image_dog.jpg \
    --prompt "Compare the animals in these images." \
    --max-tokens 150

And here is the response for my favorite doggo breed:

Loading mlx-community/Qwen2-VL-2B-Instruct-4bit...
Loaded mlx-community/Qwen2-VL-2B-Instruct-4bit
Starting generation ...

What animal is in this image? The image shows a dog wearing a red and white harness. The dog is sitting on a sandy or grassy surface, with a blurred background that includes------
Prompt:     12489 tokens, 69.392135 tokens/s
Generation: 30 tokens, 16.561355 tokens/s, 1.811446s

Testing VLMs - Video Input

Similar to images, use the --video flag:

./mlx-run llm-tool eval \
    --model mlx-community/Qwen2-VL-2B-Instruct-4bit \
    --video /path/to/your/short_clip.mp4 \
    --prompt "What is happening in this video?" \
    --max-tokens 100

I did not test this because I know my MacBook cannot handle it. If you can with your short clip, let me know!

Managing Memory and Performance

Useful for debugging or running on constrained systems.

# Show memory usage statistics after generation
./mlx-run llm-tool eval \
    --prompt "Tell me a joke." \
    --max-tokens 50 \
    --memory-stats

# Limit the MLX GPU cache size (e.g., to 512 MB)
./mlx-run llm-tool eval \
    --prompt "What is MLX Swift?" \
    --max-tokens 100 \
    --cache-size 512 \
    --memory-stats

# Limit total GPU memory allocation (e.g., to 4096 MB)
# Be careful, setting this too low will cause crashes!
./mlx-run llm-tool eval \
    --prompt "Explain ARC in Swift." \
    --max-tokens 150 \
    --memory-size 4096 \
    --memory-stats

Quiet Mode

Only print the generated text, suppressing loading messages, stats, etc. Useful for scripting.

# Get just the answer
./mlx-run llm-tool eval --prompt "Capital of Japan?" --max-tokens 30 --quiet

Output is just:

Tokyo
. Largest city in Japan? Tokyo
. Tokyo is the capital of Japan

Moving Forward

As an AI+iOS/macOS developer, you can use the tool to run quick tests with llm-tool to see if its capabilities match your needs before committing to integrate a large model into your app project.

You can use the same prompt against different models to compare their quality, tone, and performance. Something that I call as the vibe check.

You can also easily test how different VLMs handle your specific kinds of images or videos.

Happy MLXing!

Tagged in:

MLX Swift

Exploring MLX Swift: Testing LLMs and VLMs with llm-tool

Master AI-assisted iOS Development

What is `llm-tool`?

Setting Up `llm-tool`

The `eval` Command

Key Arguments for eval

How it Works

Running `llm-tool` eval

Running from Xcode

Running from the Command Line (using mlx-run)

Exploring `eval` Command Examples

Basic Text Generation (Default LLM)

Specifying a Different LLM

Controlling Generation Parameters

Using a System Prompt

Testing Vision Language Models (VLMs) - Image Input

Testing VLMs - Video Input

Managing Memory and Performance

Quiet Mode

Moving Forward

Master AI-assisted iOS Development

Rudrank Riyam

Other Stories

Exploring MCP: XcodeBuildMCP for Building & Running iOS Projects in VS Code, Cursor & Windsurf

Exploring MCP: XcodeBuildMCP for Faster Incremental Builds in VS Code, Cursor & Windsurf

AiOS Dispatch 13

AiOS Dispatch 12

AiOS Dispatch 11

Press ESC to close

Or check our Popular Categories...

Subscribe to AiOS Dispatch!

What is llm-tool?

Setting Up llm-tool

The eval Command

Key Arguments for eval

How it Works

Running llm-tool eval

Running from Xcode

Running from the Command Line (using mlx-run)

Exploring eval Command Examples

Basic Text Generation (Default LLM)

Specifying a Different LLM

Controlling Generation Parameters

Using a System Prompt

Testing Vision Language Models (VLMs) - Image Input

Testing VLMs - Video Input

Managing Memory and Performance

Quiet Mode

Moving Forward

Share Article:

Related Articles

Other Stories

Exploring MCP: XcodeBuildMCP for Building & Running iOS Projects in VS Code, Cursor & Windsurf

Exploring MCP: XcodeBuildMCP for Faster Incremental Builds in VS Code, Cursor & Windsurf

What is `llm-tool`?

Setting Up `llm-tool`

The `eval` Command

Running `llm-tool` eval

Exploring `eval` Command Examples