Android: Allow switching the voice typing library to Whisper #11158

personalizedrefrigerator · 2024-10-01T00:03:18Z

Summary

Important

I'm currently hosting the built Whisper models on this GitHub repository under my personal account. It may make sense to host these files elsewhere.

This pull request allows switching the voice typing library to Whisper (whisper-tiny) as an alternative to Vosk. For now, this sets Whisper to the default voice typing library and lets the user switch back to Vosk in settings > note.

Unlike Vosk, whisper-tiny supports punctuation in the generated text.

This pull request uses ONNX runtime (https://github.com/microsoft/onnxruntime) to run Whisper on Android (model release and build instructions). ONNX runtime is a general-purpose runtime for trained machine learning models.

This pull request uses whisper-tiny rather than whisper-small or whisper-large. On my Android tablet, whisper-small seems to crash the application (out of memory?).

Screen recordings

Whisper

whisper.webm

In the above screen recording:

I said: Testing. This is a test of voice typing using the Whisper voice recognition library. This is a test. After a few seconds, what I just said is added to the document.
Whisper transcribed: Testing. This is a test of voice type using the whisper voice recognition library. This is a test. After a few seconds, what I just said is added to the document.

Vosk

vosk.webm

In the above screen recording:

I said: Testing. This is a test of voice typing using the Vosk voice recognition library. This is a test. After a few seconds, what I just said is added to the document.
Vosk transcribed: justin this is the test a voice typing using the waske voice recognition library this is a test after a few seconds what i just said is added to the document

Size & performance notes

CPU usage comparison

Whisper uses significantly more CPU than Vosk. On my tablet, Whisper uses roughly 50% CPU while converting speech to text, and currently does this every few seconds. How often can be adjusted. By comparison, Vosk uses about 10% CPU.

Whisper CPU usage (Android/debug mode):

Vosk CPU usage (Android/debug mode):

Note: The graphs above are for different audio input.

Download size

whisper-tiny has a download size of 47.1 MB (zipped) and uses a single model for all languages. This is slightly larger than Vosk's 41.2 download size for English.

Note: Currently, the "downloading" message still says "Downloading [locale] language files"

APK size increase

ONNX runtime adds libonnxruntime.so and libortextensions.so for all output platforms.

libonnxruntime.so:

x86	x86_64	armeabi-v7a	arm64-v8a
+19.3 MB	+19.4 MB	+11.7 MB	+ 16.8 MB

libortextensions.so:

x86	x86_64	armeabi-v7a	arm64-v8a
+2.3 MB	+2.3 MB	+1.4 MB	+ 2.1 MB

Joplin builds a single APK that targets all four platforms. With the current build output, this corresponds to roughly a 75 MB increase in size.

Removing react-native-vosk causes the APK size to decrease from 183.3 MB to 146.2 MB. This, however, is still significantly larger than the previous APK size of 100 MB.

Note

If I enable per-ABI builds (diff), each per-ABI APK is still smaller than the current APK size of about 100 MB:

 58M app-arm64-v8a-release.apk
 48M app-armeabi-v7a-release.apk
 62M app-x86_64-release.apk
 62M app-x86-release.apk

Demo APK

https://github.com/personalizedrefrigerator/joplin/releases/tag/whisper-typing-0.0.2

The React Native ONNX library was failing to load Whisper (reporting an incorrect sequence_length). Similar code that uses the Kotlin/Java ONNX library works fine.

provide that to Whisper at each time step, rather than only the latest 5-10s

personalizedrefrigerator · 2024-10-01T01:21:34Z

packages/app-mobile/android/app/build.gradle

+    // Needed for Whisper speech-to-text
+    implementation 'com.microsoft.onnxruntime:onnxruntime-android:latest.release'
+    implementation 'com.microsoft.onnxruntime:onnxruntime-extensions-android:latest.release'


The implementation uses the Kotlin/Java bindings to ONNX. Originally, I tried recording audio in the correct format with Kotlin, sending the recorded data to React Native, then starting Whisper using the ONNX React Native bindings (implementation). However, this was failing with an error similar to the following:
max_length > sequence_length was false. max_length (448) shall be greater than input sequence length (80000)
Here, the input sequence was the raw audio data being provided to ONNX. max_length is the maximum output sequence length.

Using the React Native ONNX bindings also requires transferring the audio data from the audio recorder to React Native, then from React Native to ONNX, which adds additional overhead.

personalizedrefrigerator · 2024-10-14T15:20:40Z

Verify that downloaded Vosk models can still be accessed (i.e. that the saved model location doesn't change).
Edit: This needs to be fixed — Vosk models seem to be stored in different places before and after this PR.

content.

personalizedrefrigerator · 2024-10-15T17:03:34Z

packages/app-mobile/services/voiceTyping/whisper.ts

+		let urlTemplate = rtrimSlashes(Setting.value('voiceTypingBaseUrl').trim());
+
+		if (!urlTemplate) {
+			urlTemplate = 'https://github.com/personalizedrefrigerator/joplin-voice-typing-test/releases/download/test-release/{task}.zip';


I'm currently hosting the built Whisper models on this GitHub repository under my personal account. It may make sense to host these files elsewhere (e.g. under a GitHub repo in the Joplin organization).

personalizedrefrigerator added 12 commits September 25, 2024 21:29

Broken ONNX Whisper usage from RN (sequence_length error)

590ff07

Android: Partially-working Whisper for voice typing

ebb77a4

Refactoring, move ONNX processing to new thread

6f2f2c1

Remove now-unused React Native ONNX library

37878b9

The React Native ONNX library was failing to load Whisper (reporting an incorrect sequence_length). Similar code that uses the Kotlin/Java ONNX library works fine.

Change how data is processed: Maintain a buffer of audio history and

491b79d

provide that to Whisper at each time step, rather than only the latest 5-10s

Improve voice typing with long input sequences

b1b9cd2

Fix speech postprocessing, language codes

6a2e836

Refactoring

8684f18

Allow switching between VOSK and Whisper

349ee50

Update Whisper output more frequently

5085ed7

Prefer whisper-tiny

0e15125

Request RECORD_AUDIO permission before starting voice typing

d5e6453

personalizedrefrigerator marked this pull request as draft October 1, 2024 00:03

personalizedrefrigerator added 2 commits September 30, 2024 18:12

Fixing failing tests

df29dcc

Improve commenting

3f6bd5c

personalizedrefrigerator commented Oct 1, 2024

View reviewed changes

personalizedrefrigerator and others added 2 commits September 30, 2024 18:26

spaces -> tabs

9fcacb6

Merge branch 'dev' into pr/whisper

685bacb

personalizedrefrigerator mentioned this pull request Oct 1, 2024

Enhance voice typing with AI #10282

Open

personalizedrefrigerator marked this pull request as ready for review October 5, 2024 16:49

personalizedrefrigerator changed the title ~~Proof of concept: Android: Allow switching the voice typing library to Whisper~~ Android: Allow switching the voice typing library to Whisper Oct 5, 2024

Merge branch 'dev' into pr/whisper

121c0a4

Merge remote-tracking branch 'upstream/dev' into pr/whisper

cf25009

personalizedrefrigerator marked this pull request as draft October 14, 2024 20:22

Fix Vosk model UUID stored in wrong location

d1681cd

personalizedrefrigerator marked this pull request as ready for review October 14, 2024 20:52

personalizedrefrigerator added 3 commits October 14, 2024 13:59

Fix getUuidPath missing from vosk.ts

99f00b7

Fix voice typing dialog text wrap

d4c425b

Improve sentence break heuristic

625eade

personalizedrefrigerator and others added 2 commits October 14, 2024 17:46

Improve preview/finalizing when there is a long pause after the last

e51f41e

content.

Merge branch 'dev' into pr/whisper

af588cd

personalizedrefrigerator commented Oct 15, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Android: Allow switching the voice typing library to Whisper #11158

Android: Allow switching the voice typing library to Whisper #11158

personalizedrefrigerator commented Oct 1, 2024 •

edited

Loading

personalizedrefrigerator Oct 1, 2024

personalizedrefrigerator commented Oct 14, 2024 •

edited

Loading

personalizedrefrigerator Oct 15, 2024

Android: Allow switching the voice typing library to Whisper #11158

Are you sure you want to change the base?

Android: Allow switching the voice typing library to Whisper #11158

Conversation

personalizedrefrigerator commented Oct 1, 2024 • edited Loading

Summary

Screen recordings

Whisper

Vosk

Size & performance notes

CPU usage comparison

Download size

APK size increase

Demo APK

personalizedrefrigerator Oct 1, 2024

Choose a reason for hiding this comment

personalizedrefrigerator commented Oct 14, 2024 • edited Loading

personalizedrefrigerator Oct 15, 2024

Choose a reason for hiding this comment

personalizedrefrigerator commented Oct 1, 2024 •

edited

Loading

personalizedrefrigerator commented Oct 14, 2024 •

edited

Loading