Exploring Cursor: Using Chat to Learn a New Framework

This chapter is taken from my book, "AI Assisted Coding for iOS Development". You can buy it to learn further about Cursor:

Exploring AI Assisted Coding

I recently gifted myself a device with an A18 Pro chip, and I want to see how much I can utilize its power for on-device intelligence. No, I am not talking about Apple Intelligence. I want to see how it works with OpenAI’s Whisper model for speech-to-text and different LLMs small but powerful enough to run on the 6.9” display monster.

I came across WhisperKit, an open-source Swift package integrating the speech recognition model with Apple’s CoreML framework for efficient and local inference on Apple devices.

GitHub - argmaxinc/WhisperKit: On-device Speech Recognition for Apple Silicon

I tried the sample project and could feel the difference between using it on 14 Pro Max and 16 Pro Max, with the latter being much faster. I decided to learn how it works and create a barebones sample project called Whispering for anyone who wants a starting point for working with WhisperKit.

GitHub - rudrankriyam/WhisperKit-Sample: A sample project to work with WhisperKit across Apple Platforms

But I do not have any idea where to start from! A few lines are mentioned in the README.md on how to get started, but not enough for my understanding. The sample project is one 1,000+ complex file, which is too difficult for me to comprehend.

So, I decided to see how much I could use Cursor to learn from the sample project while creating Whispering in the editor. This post is about my journey, and I hope you can pick up a few tricks and tips to make the most of this feature!

Note: While I am using Cursor for it, you can also use the same methodology with Copilot with VS Code or Windsurf.

Getting Started

This is the current content of ContentView:

import SwiftUI
import WhisperKit
 
struct ContentView: View {
    var body: some View {
        VStack {
            Image(systemName: "globe")
                .imageScale(.large)
                .foregroundStyle(.tint)
            Text("Hello, world!")
        }
        .padding()
    }
}

The first trick I have used with open-source projects is to change the GitHub URL from “g” to “u”. This will give you the whole LLLM context and the number of tokens for that repository or set of files to know how many tokens you will use. For example, you can do a long context chat with 200K tokens on Claude with Cursor Chat, sufficient for a small—to medium-sized open-sourced project.

First, I will use the README.md to make Chat summarise on how to use the package for my project. The URL changes from:

https://github.com/argmaxinc/WhisperKit/blob/main/README.md

to:

https://uithub.com/argmaxinc/WhisperKit/blob/main/README.md

The whole project at the time of writing this chapter is around 176K tokens, but I do not need that much context initially.

Using Chat

To open the chat window, you can use the keyboard shortcut Shift + Command + L on Cursor or Control + Command + I on Copilot Chat.

I started with a simple prompt, giving it the contents of README.md and then asking it:

<instruction>
Please explain in detail how to get started with WhisperKit
</instruction>

It gave me a summary of WhisperKit and a simple implementation based on the few lines of code in README.md. I decided just to mess around and try it out myself. I would usually directly apply the code, but this time, I wrote the code so I have the context about what is happening.

Over the months of using Cursor and AI-assisted coding, I have realized that taking the shortcut of AI helping you code is fun, but there should be a balance so you are not alienated from your codebase.

Initial Code

However, I love the tab tab tab, so while writing the code myself, I utilized the fast autocompletion of the Cursor Tab.

After some tab-driven development, this is what my class looks like:

struct ContentView: View {
  @StateObject private var transcriptionService = TranscriptionService()
  @State private var transcriptionResult = ""
 
  var body: some View {
    VStack {
      Image(systemName: "globe")
        .imageScale(.large)
        .foregroundStyle(.tint)
 
      Text(transcriptionResult)
    }
    .padding()
    .task {
      await transcriptionService.setup()
      let result = await transcriptionService.transcribe(audio: URL(fileURLWithPath: Bundle.main.path(forResource: "test", ofType: "mp3")!))
      transcriptionResult = result
    }
  }
}
 
# Preview {
  ContentView()
}
 
class TranscriptionService: ObservableObject {
  private var whisperKit: WhisperKit?
 
  init() {
  }
 
  func setup() async {
    do {
      whisperKit = try await WhisperKit()
    } catch {
      print("Error setting up WhisperKit: \(error)")
    }
  }
 
  func transcribe(audio: URL) async -> String {
    do {
      let result = try await whisperKit?.transcribe(audioPath: audio.absoluteString)
      return result?.first?.text ?? ""
    } catch {
      print("Error transcribing audio: \(error)")
      return ""
    }
  }
}

I create an instance of WhisperKit and transcribe the audio at a particular path. That’s a good start. Based on the documentation, this should show me how WhisperKit works.

Adding Test Audio

I want Cursor to add a random speech audio file called test to the project. I will add @ContentView.swift for context to the chat conversation:

@ContentView.swift Please add the test file to the project and ensure that the code I have written is correct and will work well according to the documentation of WhisperKit.

Claude gave me an improved version of the code but said I should add the file myself. I asked again, this time with @Web so that it could access a file:

Can you add the file from the @Web to the project yourself and to the xcodeproj?

The response was disappointing that it could not directly add files to the Xcode project or modify the file system. However, it gave me links to a clear man’s speech, an audio from Mozilla’s dataset, and a short TED talk sample!

I ran the command to download the LibriSpeech clear English speech example:

curl -o test.mp3 https://www.voiptroubleshooter.com/open_speech/american/OSR_us_000_0010_8k.wav

I know that Xcode 16 uses the folder structure, so I can change the directory into the Whispering folder and run the command; it should automatically add it. So, in the command line that can be opened via Control + Shift + `````, I used the inline feature (Command + K) to generate the command:

Change directory to inside Whispering folder and run the
curl -o test.mp3 https://www.voiptroubleshooter.com/open_speech/american/OSR_us_000_0010_8k.wav command

It generated the following command:

cd Whispering && curl -o test.mp3 https://www.voiptroubleshooter.com/open_speech/american/OSR_us_000_0010_8k.wav

I could have written cd Whispering && myself, but I did it for those unaware that you can use the inline feature directly to generate more complex commands, too. The command added the test.mp3 file inside Xcode, with the target membership of Whispering too!

Improved Code Suggestions

Coming back to the suggestions that I got in the Chat, it added a waveform image and a progress view when it transcribes the given audio file:

struct ContentView: View {
  @StateObject private var transcriptionService = TranscriptionService()
  @State private var transcriptionResult = ""
  @State private var isTranscribing = false
 
  var body: some View {
    VStack {
      Image(systemName: "waveform")
        .imageScale(.large)
        .foregroundStyle(.tint)
 
      if isTranscribing {
        ProgressView("Transcribing...")
      }
 
      Text(transcriptionResult)
        .padding()
    }
    .padding()
    .task {
      isTranscribing = true
      do {
        await transcriptionService.setup()
 
        if let path = Bundle.main.path(forResource: "test", ofType: "mp3") {
          let url = URL(fileURLWithPath: path)
          transcriptionResult = await transcriptionService.transcribe(audio: url)
        } else {
          transcriptionResult = "Error: Audio file not found"
        }
      } catch {
        transcriptionResult = "Error: \(error.localizedDescription)"
      }
      isTranscribing = false
    }
  }
}
 
class TranscriptionService: ObservableObject {
  private var whisperKit: WhisperKit?
 
  func setup() async {
    do {
      let config = WhisperKitConfig(model: "base")
      whisperKit = try await WhisperKit(config)
    } catch {
      print("Error setting up WhisperKit: \(error)")
    }
  }
 
  func transcribe(audio: URL) async -> String {
    do {
      let result = try await whisperKit?.transcribe(audioPath: audio.path)
      return result?.text ?? ""
    } catch {
      print("Error transcribing audio: \(error)")
      return "Error: \(error.localizedDescription)"
    }
  }
}

It also explicitly added the model “base” to fetch in the configuration before initializing the instance of WhisperKit. When I ran the project, I got the following error:

/Users/rudrankriyam/Downloads/Whispering/Whispering/ContentView.swift:71:22: error: value of type '[TranscriptionResult]' has no member 'text'
      return result?.text ?? ""
             ~~~~~~~ ^~~~

When you see the error in Terminal, you can directly add those particular lines in Chat using Shift + Command + L or the equivalent in VS Code. And then hit enter:

Ah, I see the error. The transcription result structure has changed in WhisperKit. Let's fix the code:

It gave me the solution to join the text from the transcription result:

if let results = try await whisperKit?.transcribe(audioPath: audio.path) {
        // Join all segments into a single string
        return results.map { $0.text }.joined(separator: " ")
      }

I investigated why it went wrong and realized that the older method that just returns TranscriptionResult is deprecated, and the newer one returns an array of TranscriptionResult instead. The Chat only relied on the README.md information, so it followed the older method. But, in README.md, the example was in one line, where Swift implicitly took the type of result as TranscriptionResult, which works, but Claude did it in two lines. AI is not going to take over the world yet, haha.

And with that, we successfully built SweetPad inside Cursor/ VS Code! (covered in subsequent chapters)

* * BUILD SUCCEEDED ** [11.178 sec]

First Build

After running the first build and waiting a while, I got the error message:

Error transcribing audio: Error Domain=com.apple.coreaudio.avfaudio Code=1685348671 "(null)" UserInfo={failed call=ExtAudioFileOpenURL((__bridge CFURLRef)fileURL, &_extAudioFile)}

The first instinct out of habit is to search it on Google (yeah, I am not used to Perplexity) but I want to use the debug with AI feature instead. I added those lines in Chat using Shift + Command + L or the equivalent in VS Code and then hit enter:

This error occurs because there's an issue with the audio file format or access. Let's fix this by ensuring proper audio file handling:

It added an error handling and better file validation to the class:

class TranscriptionService: ObservableObject {
  private var whisperKit: WhisperKit?
 
  func setup() async {
    do {
      let config = WhisperKitConfig(model: "base")
      whisperKit = try await WhisperKit(config)
    } catch {
      print("Error setting up WhisperKit: \(error)")
    }
  }
 
  func transcribe(audio: URL) async -> String {
    do {
      // Verify file exists and is readable
      guard FileManager.default.fileExists(atPath: audio.path) else {
        return "Error: Audio file not found at path"
      }
 
      // Print file information for debugging
      print("Attempting to transcribe file at: \(audio.path)")
      print("File exists: \(FileManager.default.fileExists(atPath: audio.path))")
 
      // Try to transcribe
      if let results = try await whisperKit?.transcribe(audioPath: audio.path) {
        return results.map { $0.text }.joined(separator: " ")
      }
      return "No transcription results"
    } catch {
      print("Error transcribing audio: \(error)")
      return "Error: \(error.localizedDescription)"
    }
  }
}

Then, it modified ContentView to ensure proper file handling:

struct ContentView: View {
  @StateObject private var transcriptionService = TranscriptionService()
  @State private var transcriptionResult = ""
  @State private var isTranscribing = false
 
  var body: some View {
    VStack {
      Image(systemName: "waveform")
        .imageScale(.large)
        .foregroundStyle(.tint)
 
      if isTranscribing {
        ProgressView("Transcribing...")
      }
 
      Text(transcriptionResult)
        .padding()
    }
    .padding()
    .task {
      isTranscribing = true
 
      // Get the file URL and verify it exists
      if let audioURL = Bundle.main.url(forResource: "test", withExtension: "mp3") {
        print("Found audio file at: \(audioURL.path)")
 
        // Verify file exists
        if FileManager.default.fileExists(atPath: audioURL.path) {
          transcriptionResult = await transcriptionService.transcribe(audio: audioURL)
        } else {
          transcriptionResult = "Error: Audio file not found in bundle"
        }
      } else {
        transcriptionResult = "Error: Could not create URL for audio file"
      }
 
      isTranscribing = false
    }
  }
}

Running the app again got me the following in the console:

Found audio file at: /private/var/containers/Bundle/Application/6CE5A9E7-22AF-4FC5-8154-E1B64B5027A8/Whispering.app/test.mp3
Attempting to transcribe file at: /private/var/containers/Bundle/Application/6CE5A9E7-22AF-4FC5-8154-E1B64B5027A8/Whispering.app/test.mp3
File exists: true

But I still got the audio related error. Asking it again, it recommended me to convert to wav format instead:

This error typically occurs with audio format issues. Let's try converting the audio to a WAV format with specific parameters that WhisperKit works well with. Here's what we'll do:

I ran the given command, and it automatically added the test.wav to the project:

ffmpeg -i test.mp3 -ar 16000 -ac 1 -c:a pcm_s16le test.wav

This created a WAV file with:

16kHz sample rate
Mono channel
16-bit PCM encoding

Then, it updated the ContentView as:

.task {
  isTranscribing = true
 
  await transcriptionService.setup()
 
  if let audioPath = Bundle.main.path(forResource: "test", ofType: "wav") {  // Changed to wav
    print("Audio path: \(audioPath)")
    transcriptionResult = await transcriptionService.transcribe(audioPath: audioPath)
  } else {
    transcriptionResult = "Error: Audio file not found"
  }
 
  isTranscribing = false
}

Running the project again gave me a successful transcription, on-device!

Improving with Speech Recognition

The next step in improving the app is to have an option to directly recognize the speech and then feed the buffer directly to WhisperKit. I do not see any mention of how to do it, but I know the sample project does it.

So, this time, we provide it with the context of the sample app instead:

https://uithub.com/argmaxinc/WhisperKit/tree/main/Examples/WhisperAX/WhisperAX?accept=text%2Fhtml&maxTokens=10000000&ext=swift

At the time of writing, it is roughly 22.6K tokens, which is good enough for a normal chat. I will create a new chat and provide it with the above context:

<instruction>
I want to extend the app to accept the user's speech when they click the start recording button. When they click the stop button, it should send the audio buffer to WhisperKit to transcribe and show the result in a view, a scrollable view for larger texts. 
Follow Apple's Human Interface Guidelines to create a clean user interface with accessibility in mind.
</instruction>

It provided me with the given code:

// ... existing code ...
 
struct ContentView: View {
  // Add new state variables for recording
  @State private var audioRecorder: AVAudioRecorder?
  @State private var recordingSession: AVAudioSession?
  @State private var isRecording = false
  @State private var transcribedText = ""
  @State private var recordingURL: URL?
 
  // ... existing state variables ...
 
  var body: some View {
    NavigationSplitView(columnVisibility: $columnVisibility) {
      // ... existing sidebar code ...
    } detail: {
      VStack {
        // Transcription results view
        ScrollView {
          Text(transcribedText)
            .frame(maxWidth: .infinity, alignment: .leading)
            .padding()
            .accessibilityLabel("Transcription results")
        }
 
        // Recording controls
        HStack(spacing: 20) {
          Button(action: {
            if isRecording {
              stopRecording()
            } else {
              startRecording()
            }
          }) {
            Image(systemName: isRecording ? "stop.circle.fill" : "mic.circle.fill")
              .resizable()
              .frame(width: 44, height: 44)
              .foregroundColor(isRecording ? .red : .blue)
          }
          .accessibilityLabel(isRecording ? "Stop recording" : "Start recording")
          .disabled(modelState != .loaded)
 
          if isRecording {
            // Show recording indicator
            Text("Recording...")
              .foregroundColor(.red)
          }
        }
        .padding()
      }
    }
    .onAppear {
      setupAudioSession()
    }
  }
 
  // Setup audio recording
  private func setupAudioSession() {
    recordingSession = AVAudioSession.sharedInstance()
    do {
      try recordingSession?.setCategory(.playAndRecord, mode: .default)
      try recordingSession?.setActive(true)
    } catch {
      print("Failed to set up recording session: \(error)")
    }
  }
 
  private func startRecording() {
    let audioFilename = getDocumentsDirectory().appendingPathComponent("recording.wav")
    recordingURL = audioFilename
 
    let settings = **[AVFormatIDKey: Int(kAudioFormatLinearPCM),](https://github.com/rudrankriyam/Whispering)**
 
Play with it, and let me know what you think!
 
Happy whispering!