Huggingface Model Matrix
The Huggingface models supported by SoloDesk AI are the ones that have been tested on a specific reference machine (REF2026) and found to be compatible with consumer grade hardware.
The model information matrix gets updated periodically as new models are added and new testing is performed.
The last update was on March 9, 2026.
| Model Name | Disk (GB) | GPU Memory (GB) | Source | Parameters | Description | Supported | Notes |
|---|---|---|---|---|---|---|---|
| Stable Diffusion 1.5 | 5.1 | 3.2 | stable-diffusion-v1-5/stable-diffusion-v1-5 | 983M | Text+image+mask to image. | Yes | The foundational Stable Diffusion model. The basis of many derivatives |
| Stable Diffusion XL 1.0 Base | 13.2 | 8.2 | stabilityai/stable-diffusion-xl-base-1.0 | 2.6B | Text+image+mask to image. | Yes | The foundational Stable Diffusion XL model. The basis of many derivatives |
| Amused | 3.27 | 3.4 | amused/amused-512 | Lightweight text+image+mask to image. | Yes | Fast and light weight. Marginal quality. Deprecated library support. | |
| Lumina 2 | 19.7 | 12.5 | Alpha-VLLM/Lumina-Image-2.0 | 2B | High quality text-to-image generator. | Yes | Use this model when text is desired. |
| Sana | 15 | 12.4 | Efficient-Large-Model/Sana_1600M_1024px_diffusers | 1.6B | Fast text-to-image model from Nvidia | Yes | Among fastest of 1024x1024 renders. Marginal quality. |
| PixArt Alpha | 20.3 | 11.6 | PixArt-alpha/PixArt-XL-2-1024-MS | 1.2B | High quality text-to-image generator. | Yes | Seems unnecessary when Pixart Sigma exists. |
| PixArt Sigma | 20.3 | 11.6 | PixArt-alpha/PixArt-Sigma-XL-2-1024-MS | 1.2B | High quality text-to-image generator. | Yes | High quality 1536x768 renders. |
| Kolors | 16.5 | 6.1 | Kwai-Kolors/Kolors-diffusers | Text/image-to-image generator. | Yes | Now runs smoothly with recent memory management changes. | |
| Illustrious | 6.8 | 6.7 | OnomaAIResearch/Illustrious-XL-v2.0 | 3.5B | Text+image to image. | Yes | Based on SDXL architecture. Used for cartoon-style illustrations. |
| Pony Diffusion V6 | 6.5 | 8.3 | stablediffusionapi/Pony-Diffusion-V6-XL | 3.5B | Text+image to image. | Yes | Based on SDXL architecture. |
| Ultra Epic AI Realism | 6.5 | 8.2 | stablediffusionapi/ultraepicairealism-v10 | 2.6B | Text+image+mask-to-image. | Yes | A realistic and uncensored Stable Diffusion SDXL derivative. |
| Z-Image Base | 19.1 | 14.8 | Tongyi-MAI/Z-Image | 6B | A high quality text+image-to-image model from Alibaba. | Yes | Huggingface version works. Excellent text rendering. Civitai versions are fragmented. |
| Z-Image Turbo | 30.5 | 14.8 | Tongyi-MAI/Z-Image-Turbo | 6B | A high quality text+image-to-image model from Alibaba. | Yes | Huggingface version works. Civitai versions are fragmented. |
| Stable Video Diffusion | 4.2 | 6-23 | stabilityai/stable-video-diffusion-img2vid | 1.7B | 14-frame image-to-video | Yes | Loads about 6GB, runs about 11GB, and has a big memory spike at finish. |
| Stable Video Diffusion XT | 4.2 | 6-23 | stabilityai/stable-video-diffusion-img2vid-xt | 25-frame image-to-video | Yes | User’s machine should meet the full specs of REF2026 to run without issues. | |
| Stable Video Diffusion XT 1.1 | stabilityai/stable-video-diffusion-img2vid-xt-1-1 | Image-to-video. | No | Gated model. Not accessible from downloader. Might work, but never tested. | |||
| Stable Video 3D (SV3D) | stabilityai/sv3d | Image-to-3D. | No | Gated model. Not accessible from downloader. | |||
| LTX Video | 26.4 | 5.5 | Lightricks/LTX-Video | 2B | Text+image to video. | Yes | Renders 49 frames in 33 seconds on REF2025, 22 seconds on REF2026. |
| LTX-2 Video | >26.2 | Lightricks/LTX-2 | 19B | Text+image to video. | No | Extra large model will require special hacks to run on consumer grade hardware. | |
| Wan Video 2.1 | 27 | 13.1 | Wan-AI/Wan2.1-T2V-1.3B-Diffusers | 1.3B | Text-to-video. | Yes | Renders 640x480 at about 4 seconds/frame on REF2026. |
| Wan Video 2.2 | Wan-AI/Wan2.2-TI2V-5B-Diffusers | 5B | Text+Image to video. | Yes | Uses an image prompt and a text prompt to guide the motion. | ||
| AnimateDiff | 1.6 | guoyww/animatediff-motion-adapter-v1-5-3 | Motion adapter. Makes SD 1.5 models do short videos. | Yes | Designed for 16 frame renders. | ||
| AnimateDiffXL | guoyww/animatediff-motion-adapter-sdxl-beta | Motion adapter. Makes SDXL models do short videos. | Yes | Designed for 16 frame renders. | |||
| Sky Reels V2 | Skywork/SkyReels-V2-I2V-1.3B-540P | 1.3B | Image-to-Video | No | Planned for testing. | ||
| Sky Reels V2 | Skywork/SkyReels-V2-DF-1.3B-540P | 1.3B | Text-to-video | No | Planned for testing. | ||
| Audio LDM2 | 4.2 | 3.1 | cvssp/audioldm2 | 1.1B | Text-to-audio. | Yes | Light weight and fast render. |
| Audio LDM2 Large | 5.9 | 3.9 | cvssp/audioldm2-large | 1.5B | Text-to-audio. | Yes | Light weight and fast render despite “large” model characterization. |
| Music LDM2 | 4.2 | 3.1 | cvssp/audioldm2-music | 1.1B | Text-to-audio music model. | Yes | Can generate 20 seconds of audio in 26 seconds on REF2025, 13 seconds on REF2026. |
| Stable Audio Open 1.0 | stabilityai/stable-audio-open-1.0 | Text-to-audio | No | Gated model. Not accessible from downloader. | |||
| MusicGen-Melody | facebook/musicgen-melody | 1.5B | Text-to-audio | No | To be tested. | ||
| MusicGen – Medium | 16 | 14.9 | facebook/musicgen-medium | 1.5B | Text-to-audio music model. | Yes | Renders 30 seconds of audio in 49 seconds on REF2026. |
| MusicGen-Small | 2.3 | 3.9 | facebook/musicgen-small | 300M | A fast text-to-audio music model. | Yes | Renders 30 seconds of audio in 17 seconds on REF2026. |
| MusicGen-Stereo-Medium | 4.1 | 9.4 | facebook/musicgen-stereo-medium | 1.5B | A stereophonic version of MusicGen-Medium and faster. | Yes | Renders 30 seconds of audio in 30 seconds on REF2026. |
| MusicGen-Stereo-Small | 1.2 | 2.4 | facebook/musicgen-stereo-small | 300M | A stereophonic version of MusicGen-Small | Yes | Renders 30 seconds of audio in 19 seconds on REF2026. |
| CSM 1B | sesame/csm-1b | 1B | Text-to-speech | No | Gated model. Not accessible from downloader. | ||
| Marvis TTS | 2.9 | 3.5 | Marvis-AI/marvis-tts-250m-v0.2-transformers | 250M | Text-to-speech | Yes | Based on the gated Sesame CSM-1B model. Currently supports only default speaker. |
| Suno Bark | 4.2 | 5.7 | suno/bark | Text-to-speech | Yes | Supports default speaker as well as speaker embedding files. | |
| VibeVoice | microsoft/VibeVoice-1.5B | 1.5B | Text-to-voice | No | Planned for testing pending Transformers integration. | ||
| ShapE Text | 3.3 | 3.3 – 4.6 | openai/shap-e | Text-to-3D mesh. | Yes | Fast and light weight. Marginal quality. | |
| ShapE Image | 4 | 3.6 | openai/shap-e-img2img | Image-to-3D mesh. | Yes | Fast and light weight. Marginal quality. | |
| Hunyuan3D 2.0 Single View | 4.9 | 7.1 | tencent/Hunyuan3D-2 | Image to 3D | Yes | 3D mesh only. Textures not yet supported. | |
| Hunyuan3D 2.0 Multiview | 4.9 | 7.1 | tencent/Hunyuan3D-2mv | Image to 3D | Yes | 3D mesh only. Textures not yet supported. |
For the Hugginface models, the user can downloadload the model files directly from the Huggingface repositories or use a downloader utility to obtain models. The downloader utility is invoked from a menu in the SoloDesk user interface. The models are downloaded in the Huggingface diffusers format that exist as a Windows folder with multiple files and subfolders. Alternatively, the user can use a third-party downloader of their choice since SoloDesk is capable of using the Huggingface diffusers format. This format allows users to store their model archive on a drive that is separate from the Windows operating system drive.
Some of the models on the site are "gated" models. This means that the user will have to give some personal information to the owner of the model repository before they can access the model. The SoloDesk model downloader tool is not able to access gated models at this time due to technical limitations. Therefore, gated models are currently not supported.
View Supported Civitai Models