musicaiz.tokenizers.REMITokenizer¶

class musicaiz.tokenizers.REMITokenizer(file: Union[str, TextIO, pathlib.Path], args: Optional[musicaiz.tokenizers.remi.REMITokenizerArguments] = None)[source]¶

This class presents methods to compute the REMI Encoding. The REMI encoding for piano pieces (mono-track) was introduced in: Huang, Y. S., & Yang, Y. H. (2020, October). Pop music transformer: Beat-based modeling and generation of expressive pop piano compositions. In Proceedings of the 28th ACM International Conference on Multimedia (pp. 1180-1188).

For multi-track pieces, the REMI encoding was adapted by: Zeng, M., Tan, X., Wang, R., Ju, Z., Qin, T., & Liu, T. Y. (2021). Musicbert: Symbolic music understanding with large-scale pre-training. arXiv preprint arXiv:2106.05630.

In this implementation, both mono-track and multi-track are handled.

This encoding works divides a X/4 bar in 16 sub-beats which means that each quarter or crotchet is divided in 4 sub-beats (16th notes). In spite of that and for allowing developers having more control over the beats division, we can change that value to other divisions as a function of the selected note length. The music is quantized but, as happens with the sub-beats tokens, we can specify if we want to quantize or not with the quantize argument. The note’s duration are ex`ressed in its symbolic length, e.g., a duration equal to 1 is a whole note and a duration of 16 is a 16th note.

This hiherarchical tokenization is organized as follows:

Bar -> [BAR] Position from 1/16 to 16/16
Position -> [POS=1/16] [TEMPO=X] [INST=X] [PITCH=X] [DUR=1] [VEL=X] …

Note that if a position or sub-beat does not contain notes, it’ll not be present in the tokenization. This allows preventing having usueful or “empty” tokens.

Attributes

file: Optional[Union[str, TextIO, Path]] = None

__init__(file: Union[str, TextIO, pathlib.Path], args: Optional[musicaiz.tokenizers.remi.REMITokenizerArguments] = None)[source]¶

Methods

`__init__`(file[, args])
`add_token_to_vocabulary`()
`get_tokens_analytics`(tokens)
`get_vocabulary`([vocab_filename])	This method gets the vocabulary of a tokenize dataset in all the token-sequences.txt files in the directory dataset_path.
`split_tokens_by_bar`(piece_tokens)	Split tokens list by bar
`split_tokens_by_subbeat`(piece_tokens)	Split tokens list by subbeat
`to_txt`(all_files_tokens, file_name, path)
`tokenize_bars`([tokens])	This method tokenizes a given list of musicaiz bar objects.
`tokenize_file`()	This method tokenizes a Musa (MIDI) object.
`tokens_to_musa`(tokens[, sub_beat, resolution])	Converts a str valid tokens sequence in Musa objects.