Replies: 3 comments
-
@VladPetk Hey Vlad! Thank you so much for the update and for sharing your work and results on GitHub. It is very helpful and useful :) I wanted to give you couple of suggestions to improve your work and results:
I have three that I made just for that purpose: All three datasets at this link are great for BPE IMHO as they are much more balanced and homogenous. It is a well known fact that training on music which is separated into channels produces much better results. Solo instruments do not produce good results with auti-regresive models. So try that first.
BPE_DRAFT_DEMO_DATA_AND_TRAIN.zip To be honest, BPE does not work well with music for many reasons (from my experience). Primarily because numerical; complexity of music is much greater than that of text or images. For example, a triplet encoding 128 for time, 128 for durs, and 128 for pitches, would reqiure 128"3 combinations, which with BPE produces very large dictionary for current models/architectures to handle. So the main and most important step is to develop the most compressed encoding for music if you want to use BPE. I can also recoomend you not to use mixed datasets such as MMD. Instead try to use more homogenous datasets, like POP909+POP1k7 for POP music, or ASAP+GiantMIDI+ATEPP for classical music. This should also help to improve the results. Howver, as I said before, it is best to use music split in parts/channels, so you can also try to make a custom dataset from MMD or LAKH by selecting MIDIs that have piano in parts/channels. Last but not least, here is a sample I made long time ago which used BPE and POP909 dataset. While it played well, the music was not really beautiful or memorable, so there is also work that needs to be done to fix that. Anyway, if you have any questions about any of it or if you need help/advice from me about any of that stuff, feel free to reach out at any time. And thank you again for sharing the BPE work/results :) Happy Holidays to you! Alex. PS. For long seq_len I recommend to use torch.amp (fp16 precision) and torch.cuda.sdp_kernel(mem_efficient or flash attention). |
Beta Was this translation helpful? Give feedback.
-
Hi, Alex, Apologies for the belated reply. I wanted to try out your suggestions first :) First of all, the idea about using more homogenous datasets definitely worked, thanks for that! I constructed a classical piano dataset (approx. 3200 pieces only), and actually managed to train a model on it that performed better in terms of accuracy (~84%) compared to the one trained on 60K piano pieces (~82%). It's strange, I didn't expect a large model to effectively learn from such a small amount of data, but here we are. Though I guess the improvement in accuracy is not only due to the similar(ish) genres of the pieces - I also have more confidence in the quality of those MIDIs. About feeding the data to the model sequantially: I ran a few tests, and doing it sequantially led to a lot of instability in training (as might be expected in ML generally, afaik). But perhaps you did it differently? I also tried using padding instead of separator tokens, but both approaches exhibited pretty much the same performance. About REMI. You're absolutely right that an encoding like that can alter a piece, That's why I only used quantized pieces, then REMI changes pretty much nothing (except for chosen velocity bins, etc., of course). The motivation behind using REMI was to give the model more info about the timing of the piece (bar positions and time signatures), and it does seem to work as the generated output stays within bars and fits neatly if transcribed into notes. This bit was important for me, as I envision my model as a kind of helper/ inspiration in making music, so being able to effortlessly transfer the output into DAW was very important. Now that I think of it though, I haven't tried using a more relaxed encoding on quantized input- perhaps it would work just as well and even reduce the number of tokens needed, And about BPE. You might very well be right. After doing some more training on various models, it does seem that no-BPE may be just as good. Still worth a shot, I think, if you have lots of data and are going for very long sequences. And finally, I've also been using mixed precision - it's great! Though for my purposes I don't need long sequences. Happy holidays to you! And thanks again for all this. Vlad |
Beta Was this translation helpful? Give feedback.
-
@VladPetk I am glad my suggestions were helpful and useful :) If you want to discuss all this further, feel free to write at any time :) Alex |
Beta Was this translation helpful? Give feedback.
-
Hey Alex,
I've finally gotten the first results from training a solo piano model with byte-pair encoding.
First, thanks for all your work. I've used quite a bit of your code in my project.
I tried out several approaches but settled on using the x-transformers model trained on a REMI-encoded subset of MMD data.
In short, the model with BPE (vocab size of 2000 vs 363 for without BPE), performed better based on my subjective evaluations: i.e., listening to the generated output. The output was generally less confused and just more musical, so to speak. That is despite the BPE model achieving somewhat lower accuracy than the non-BPe one (70% vs 80%).
Also, I found that using REMI encoding (vs strcutured, a version of which I believe you use - at least in this repo) performs better in terms of rhythm. That's probably due to it having bar tokens and specifying the relative positions of notes in a bar. Of course, to achieve that I used only quantized MIDIs, which in itself probably also improved the rhythmic structure.
I started creating a repo, it has a more detailed description of my results. I will try to add more output to it/ add more details soon. https://github.com/VladPetk/Piano_music_transformer/tree/main
I was not really interested in creating whole pieces. I was after generating nice ideas/ continuations (as I like to dabble in composition in my free time). So a max_seq_len of 1024 was more than enough for me for now. But if you're interested in generating full compositions, using BPE might have an advantage there, too, as it effictively compresses the data and you can fit more of it into the same max_seq_len.
Hope this info is helpful!
Vlad
Beta Was this translation helpful? Give feedback.
All reactions