Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Android: Allow switching the voice typing library to Whisper #11158

Open
wants to merge 24 commits into
base: dev
Choose a base branch
from

Conversation

personalizedrefrigerator
Copy link
Collaborator

@personalizedrefrigerator personalizedrefrigerator commented Oct 1, 2024

Summary

Important

I'm currently hosting the built Whisper models on this GitHub repository under my personal account. It may make sense to host these files elsewhere.

This pull request allows switching the voice typing library to Whisper (whisper-tiny) as an alternative to Vosk. For now, this sets Whisper to the default voice typing library and lets the user switch back to Vosk in settings > note.

Unlike Vosk, whisper-tiny supports punctuation in the generated text.

This pull request uses ONNX runtime (https://github.com/microsoft/onnxruntime) to run Whisper on Android (model release and build instructions). ONNX runtime is a general-purpose runtime for trained machine learning models.

This pull request uses whisper-tiny rather than whisper-small or whisper-large. On my Android tablet, whisper-small seems to crash the application (out of memory?).

Screen recordings

Whisper

whisper.webm

In the above screen recording:

  • I said: Testing. This is a test of voice typing using the Whisper voice recognition library. This is a test. After a few seconds, what I just said is added to the document.
  • Whisper transcribed: Testing. This is a test of voice type using the whisper voice recognition library. This is a test. After a few seconds, what I just said is added to the document.

Vosk

vosk.webm

In the above screen recording:

  • I said: Testing. This is a test of voice typing using the Vosk voice recognition library. This is a test. After a few seconds, what I just said is added to the document.
  • Vosk transcribed: justin this is the test a voice typing using the waske voice recognition library this is a test after a few seconds what i just said is added to the document

Size & performance notes

CPU usage comparison

Whisper uses significantly more CPU than Vosk. On my tablet, Whisper uses roughly 50% CPU while converting speech to text, and currently does this every few seconds. How often can be adjusted. By comparison, Vosk uses about 10% CPU.

Whisper CPU usage (Android/debug mode):
image

Vosk CPU usage (Android/debug mode):
image

Note: The graphs above are for different audio input.

Download size

whisper-tiny has a download size of 47.1 MB (zipped) and uses a single model for all languages. This is slightly larger than Vosk's 41.2 download size for English.

Note: Currently, the "downloading" message still says "Downloading [locale] language files"

APK size increase

ONNX runtime adds libonnxruntime.so and libortextensions.so for all output platforms.

libonnxruntime.so:

x86 x86_64 armeabi-v7a arm64-v8a
+19.3 MB +19.4 MB +11.7 MB + 16.8 MB

libortextensions.so:

x86 x86_64 armeabi-v7a arm64-v8a
+2.3 MB +2.3 MB +1.4 MB + 2.1 MB

Joplin builds a single APK that targets all four platforms. With the current build output, this corresponds to roughly a 75 MB increase in size.

Removing react-native-vosk causes the APK size to decrease from 183.3 MB to 146.2 MB. This, however, is still significantly larger than the previous APK size of 100 MB.

Note

If I enable per-ABI builds (diff), each per-ABI APK is still smaller than the current APK size of about 100 MB:

 58M app-arm64-v8a-release.apk
 48M app-armeabi-v7a-release.apk
 62M app-x86_64-release.apk
 62M app-x86-release.apk

Demo APK

https://github.com/personalizedrefrigerator/joplin/releases/tag/whisper-typing-0.0.2

@personalizedrefrigerator personalizedrefrigerator marked this pull request as draft October 1, 2024 00:03
Comment on lines +140 to +142
// Needed for Whisper speech-to-text
implementation 'com.microsoft.onnxruntime:onnxruntime-android:latest.release'
implementation 'com.microsoft.onnxruntime:onnxruntime-extensions-android:latest.release'
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • The implementation uses the Kotlin/Java bindings to ONNX. Originally, I tried recording audio in the correct format with Kotlin, sending the recorded data to React Native, then starting Whisper using the ONNX React Native bindings (implementation). However, this was failing with an error similar to the following:
    max_length > sequence_length was false. max_length (448) shall be greater than input sequence length (80000)
    
    Here, the input sequence was the raw audio data being provided to ONNX. max_length is the maximum output sequence length.
    • Using the React Native ONNX bindings also requires transferring the audio data from the audio recorder to React Native, then from React Native to ONNX, which adds additional overhead.

@personalizedrefrigerator personalizedrefrigerator marked this pull request as ready for review October 5, 2024 16:49
@personalizedrefrigerator personalizedrefrigerator changed the title Proof of concept: Android: Allow switching the voice typing library to Whisper Android: Allow switching the voice typing library to Whisper Oct 5, 2024
@personalizedrefrigerator
Copy link
Collaborator Author

personalizedrefrigerator commented Oct 14, 2024

  • Verify that downloaded Vosk models can still be accessed (i.e. that the saved model location doesn't change).
    Edit: This needs to be fixed — Vosk models seem to be stored in different places before and after this PR.

@personalizedrefrigerator personalizedrefrigerator marked this pull request as draft October 14, 2024 20:22
@personalizedrefrigerator personalizedrefrigerator marked this pull request as ready for review October 14, 2024 20:52
let urlTemplate = rtrimSlashes(Setting.value('voiceTypingBaseUrl').trim());

if (!urlTemplate) {
urlTemplate = 'https://github.com/personalizedrefrigerator/joplin-voice-typing-test/releases/download/test-release/{task}.zip';
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm currently hosting the built Whisper models on this GitHub repository under my personal account. It may make sense to host these files elsewhere (e.g. under a GitHub repo in the Joplin organization).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants