GitHub |
您所在的位置:网站首页 › morgenshtern歌词 › GitHub |
Variational Inference with adversarial learning for end-to-end Singing Voice Conversion based on VITS
中文文档 The tree bigvgan-mix-v2 has good audio quality The tree RoFormer-HiFTNet has fast infer speed No More Upgrade This project targets deep learning beginners, basic knowledge of Python and PyTorch are the prerequisites for this project; This project aims to help deep learning beginners get rid of boring pure theoretical learning, and master the basic knowledge of deep learning by combining it with practices; This project does not support real-time voice converting; (need to replace whisper if real-time voice converting is what you are looking for) This project will not develop one-click packages for other purposes;A minimum VRAM requirement of 6GB for training Support for multiple speakers Create unique speakers through speaker mixing It can even convert voices with light accompaniment You can edit F0 using Excel AI_Elysia_LoveStory.mp4Powered by @ShadowVap Model properties Feature From Status Function whisper OpenAI ✅ strong noise immunity bigvgan NVIDA ✅ alias and snake natural speech Microsoft ✅ reduce mispronunciation neural source-filter NII ✅ solve the problem of audio F0 discontinuity speaker encoder Google ✅ Timbre Encoding and Clustering GRL for speaker Ubisoft ✅ Preventing Encoder Leakage Timbre SNAC Samsung ✅ One Shot Clone of VITS SCLN Microsoft ✅ Improve Clone Diffusion HuaWei ✅ Improve sound quality PPG perturbation this project ✅ Improved noise immunity and de-timbre HuBERT perturbation this project ✅ Improved noise immunity and de-timbre VAE perturbation this project ✅ Improve sound quality MIX encoder this project ✅ Improve conversion stability USP infer this project ✅ Improve conversion stability HiFTNet Columbia University ✅ NSF-iSTFTNet for speed up RoFormer Zhuiyi Technology ✅ Rotary Positional Embeddingsdue to the use of data perturbation, it takes longer to train than other projects. USP : Unvoice and Silence with Pitch when infer
Leveraging Content-based Features from Multiple Acoustic Models for Singing Voice Conversion Plug-In-DiffusionInstall PyTorch. Install project dependencies pip install -i https://pypi.tuna.tsinghua.edu.cn/simple -r requirements.txtNote: whisper is already built-in, do not install it again otherwise it will cuase conflict and error Download the Timbre Encoder: Speaker-Encoder by @mueller91, put best_model.pth.tar into speaker_pretrain/. Download whisper model whisper-large-v2. Make sure to download large-v2.pt,put it into whisper_pretrain/. Download hubert_soft model,put hubert-soft-0d54a1f4.pt into hubert_pretrain/. Download pitch extractor crepe full,put full.pth into crepe/assets. Note: crepe full.pth is 84.9 MB, not 6kb Download pretrain model sovits5.0.pretrain.pth, and put it into vits_pretrain/. python svc_inference.py --config configs/base.yaml --model ./vits_pretrain/sovits5.0.pretrain.pth --spk ./configs/singers/singer0001.npy --wave test.wav Dataset preparationNecessary pre-processing: Separate voice and accompaniment with UVR (skip if no accompaniment) Cut audio input to shorter length with slicer, whisper takes input less than 30 seconds. Manually check generated audio input, remove inputs shorter than 2 seconds or with obivous noise. Adjust loudness if necessary, recommend Adobe Audiiton. Put the dataset into the dataset_raw directory following the structure below. dataset_raw ├───speaker0 │ ├───000001.wav │ ├───... │ └───000xxx.wav └───speaker1 ├───000001.wav ├───... └───000xxx.wav Data preprocessing python svc_preprocessing.py -t 2-t: threading, max number should not exceed CPU core count, usually 2 is enough. After preprocessing you will get an output with following structure. data_svc/ └── waves-16k │ └── speaker0 │ │ ├── 000001.wav │ │ └── 000xxx.wav │ └── speaker1 │ ├── 000001.wav │ └── 000xxx.wav └── waves-32k │ └── speaker0 │ │ ├── 000001.wav │ │ └── 000xxx.wav │ └── speaker1 │ ├── 000001.wav │ └── 000xxx.wav └── pitch │ └── speaker0 │ │ ├── 000001.pit.npy │ │ └── 000xxx.pit.npy │ └── speaker1 │ ├── 000001.pit.npy │ └── 000xxx.pit.npy └── hubert │ └── speaker0 │ │ ├── 000001.vec.npy │ │ └── 000xxx.vec.npy │ └── speaker1 │ ├── 000001.vec.npy │ └── 000xxx.vec.npy └── whisper │ └── speaker0 │ │ ├── 000001.ppg.npy │ │ └── 000xxx.ppg.npy │ └── speaker1 │ ├── 000001.ppg.npy │ └── 000xxx.ppg.npy └── speaker │ └── speaker0 │ │ ├── 000001.spk.npy │ │ └── 000xxx.spk.npy │ └── speaker1 │ ├── 000001.spk.npy │ └── 000xxx.spk.npy └── singer ├── speaker0.spk.npy └── speaker1.spk.npyRe-sampling Generate audio with a sampling rate of 16000Hz in ./data_svc/waves-16k python prepare/preprocess_a.py -w ./dataset_raw -o ./data_svc/waves-16k -s 16000 Generate audio with a sampling rate of 32000Hz in ./data_svc/waves-32k python prepare/preprocess_a.py -w ./dataset_raw -o ./data_svc/waves-32k -s 32000Use 16K audio to extract pitch python prepare/preprocess_crepe.py -w data_svc/waves-16k/ -p data_svc/pitchUse 16K audio to extract ppg python prepare/preprocess_ppg.py -w data_svc/waves-16k/ -p data_svc/whisperUse 16K audio to extract hubert python prepare/preprocess_hubert.py -w data_svc/waves-16k/ -v data_svc/hubertUse 16k audio to extract timbre code python prepare/preprocess_speaker.py data_svc/waves-16k/ data_svc/speakerExtract the average value of the timbre code for inference; it can also replace a single audio timbre in generating the training index, and use it as the unified timbre of the speaker for training python prepare/preprocess_speaker_ave.py data_svc/speaker/ data_svc/singerUse 32k audio to extract the linear spectrum python prepare/preprocess_spec.py -w data_svc/waves-32k/ -s data_svc/specsUse 32k audio to generate training index python prepare/preprocess_train.pyTraining file debugging python prepare/preprocess_zzz.py TrainIf fine-tuning is based on the pre-trained model, you need to download the pre-trained model: sovits5.0.pretrain.pth. Put pretrained model under project root, change this line pretrain: "./vits_pretrain/sovits5.0.pretrain.pth"in configs/base.yaml,and adjust the learning rate appropriately, eg 5e-5. batch_szie: for GPU with 6G VRAM, 6 is the recommended value, 8 will work but step speed will be much slower. Start training python svc_trainer.py -c configs/base.yaml -n sovits5.0Resume training python svc_trainer.py -c configs/base.yaml -n sovits5.0 -p chkpt/sovits5.0/sovits5.0_***.ptLog visualization tensorboard --logdir logs/Export inference model: text encoder, Flow network, Decoder network python svc_export.py --config configs/base.yaml --checkpoint_path chkpt/sovits5.0/***.ptInference if there is no need to adjust f0, just run the following command. python svc_inference.py --config configs/base.yaml --model sovits5.0.pth --spk ./data_svc/singer/your_singer.spk.npy --wave test.wav --shift 0 if f0 will be adjusted manually, follow the steps: use whisper to extract content encoding, generate test.vec.npy. python whisper/inference.py -w test.wav -p test.ppg.npy use hubert to extract content vector, without using one-click reasoning, in order to reduce GPU memory usage python hubert/inference.py -w test.wav -v test.vec.npy extract the F0 parameter to the csv text format, open the csv file in Excel, and manually modify the wrong F0 according to Audition or SonicVisualiser python pitch/inference.py -w test.wav -p test.csv final inference python svc_inference.py --config configs/base.yaml --model sovits5.0.pth --spk ./data_svc/singer/your_singer.spk.npy --wave test.wav --ppg test.ppg.npy --vec test.vec.npy --pit test.csv --shift 0Notes when --ppg is specified, when the same audio is reasoned multiple times, it can avoid repeated extraction of audio content codes; if it is not specified, it will be automatically extracted; when --vec is specified, when the same audio is reasoned multiple times, it can avoid repeated extraction of audio content codes; if it is not specified, it will be automatically extracted; when --pit is specified, the manually tuned F0 parameter can be loaded; if not specified, it will be automatically extracted; generate files in the current directory:svc_out.wav Arguments ref args --config --model --spk --wave --ppg --vec --pit --shift name config path model path speaker wave input wave ppg wave hubert wave pitch pitch shiftpost by vad python svc_inference_post.py --ref test.wav --svc svc_out.wav --out svc_out_post.wav Creat singernamed by pure coincidence:average -> ave -> eva,eve(eva) represents conception and reproduction python svc_eva.py eva_conf = { './configs/singers/singer0022.npy': 0, './configs/singers/singer0030.npy': 0, './configs/singers/singer0047.npy': 0.5, './configs/singers/singer0051.npy': 0.5, }the generated singer file will be eva.spk.npy. Data set Name URL KiSing http://shijt.site/index.php/2021/05/16/kising-the-first-open-source-mandarin-singing-voice-synthesis-corpus/ PopCS https://github.com/MoonInTheRiver/DiffSinger/blob/master/resources/apply_form.md opencpop https://wenet.org.cn/opencpop/download/ Multi-Singer https://github.com/Multi-Singer/Multi-Singer.github.io M4Singer https://github.com/M4Singer/M4Singer/blob/master/apply_form.md CSD https://zenodo.org/record/4785016#.YxqrTbaOMU4 KSS https://www.kaggle.com/datasets/bryanpark/korean-single-speaker-speech-dataset JVS MuSic https://sites.google.com/site/shinnosuketakamichi/research-topics/jvs_music PJS https://sites.google.com/site/shinnosuketakamichi/research-topics/pjs_corpus JUST Song https://sites.google.com/site/shinnosuketakamichi/publication/jsut-song MUSDB18 https://sigsep.github.io/datasets/musdb.html#musdb18-compressed-stems DSD100 https://sigsep.github.io/datasets/dsd100.html Aishell-3 http://www.aishelltech.com/aishell_3 VCTK https://datashare.ed.ac.uk/handle/10283/2651 Korean Songs http://urisori.co.kr/urisori-en/doku.php/ Code sources and referenceshttps://github.com/facebookresearch/speech-resynthesis paper https://github.com/jaywalnut310/vits paper https://github.com/openai/whisper/ paper https://github.com/NVIDIA/BigVGAN paper https://github.com/mindslab-ai/univnet paper https://github.com/nii-yamagishilab/project-NN-Pytorch-scripts/tree/master/project/01-nsf https://github.com/huawei-noah/Speech-Backbones/tree/main/Grad-TTS https://github.com/brentspell/hifi-gan-bwe https://github.com/mozilla/TTS https://github.com/bshall/soft-vc https://github.com/maxrmorrison/torchcrepe https://github.com/MoonInTheRiver/DiffSinger https://github.com/OlaWod/FreeVC paper https://github.com/yl4579/HiFTNet paper One-shot Voice Conversion by Separating Speaker and Content Representations with Instance Normalization SNAC : Speaker-normalized Affine Coupling Layer in Flow-based Architecture for Zero-Shot Multi-Speaker Text-to-Speech Adapter-Based Extension of Multi-Speaker Text-to-Speech Model for New Speakers AdaSpeech: Adaptive Text to Speech for Custom Voice AdaVITS: Tiny VITS for Low Computing Resource Speaker Adaptation Cross-Speaker Prosody Transfer on Any Text for Expressive Speech Synthesis Learn to Sing by Listening: Building Controllable Virtual Singer by Unsupervised Learning from Voice Recordings Adversarial Speaker Disentanglement Using Unannotated External Data for Self-supervised Representation Based Voice Conversion Multilingual Speech Synthesis and Cross-Language Voice Cloning: GRL RoFormer: Enhanced Transformer with rotary position embedding Method of Preventing Timbre Leakage Based on Data Perturbationhttps://github.com/auspicious3000/contentvec/blob/main/contentvec/data/audio/audio_utils_1.py https://github.com/revsic/torch-nansy/blob/main/utils/augment/praat.py https://github.com/revsic/torch-nansy/blob/main/utils/augment/peq.py https://github.com/biggytruck/SpeechSplit2/blob/main/utils.py https://github.com/OlaWod/FreeVC/blob/main/preprocess_sr.py Contributorshttps://github.com/Francis-Komizu/Sovits Relevant Projects LoRA-SVC: decoder only svc Grad-SVC: diffusion based svc Original evidence2022.04.12 https://mp.weixin.qq.com/s/autNBYCsG4_SvWt2-Ll_zA 2022.04.22 https://github.com/PlayVoice/VI-SVS 2022.07.26 https://mp.weixin.qq.com/s/qC4TJy-4EVdbpvK2cQb1TA 2022.09.08 https://github.com/PlayVoice/VI-SVC Be copied by svc-develop-team/so-vits-svc |
今日新闻 |
推荐新闻 |
CopyRight 2018-2019 办公设备维修网 版权所有 豫ICP备15022753号-3 |