HoML CLI Documentation
Install HoML
Go to the download page to get the HoML CLI for your system. Once installed, run the following command to set up the HoML server:
homl server install
Pull a model from Hugging Face Hub
Download a model to your local machine. You can use a shorthand alias for curated models.
homl pull qwen3:0.6b
Or use the full Hugging Face model ID:
homl pull Qwen/Qwen3-0.6B
To refresh the model's configuration and override any local changes, use the --config
flag. This is useful if you've made changes to the launch parameters and want to revert to the defaults.
homl pull qwen3:0.6b --config
Run a model
Run a downloaded model. This will start the model and make it available for chat and API access.
homl run qwen3:0.6b
For a faster startup, you can use eager mode. This mode has slower latency but similar throughput.
homl run qwen3:0.6b --eager
Run a model in interactive chat mode
Start a conversation with a model.
homl chat qwen3:0.6b
Run a model in complete mode
Ask the model for text completion. This will work for models that doesn't have a chat template.
homl complete gemma-3:270m-it "3.1415926"
Eager Mode for Faster Model Loading
To improve your experience, HoML now uses Eager Mode by default for interactive sessions and automatic model switching. This significantly reduces model startup times.
Specifically:
* The homl chat
command automatically starts the model in Eager Mode.
* When the server switches models due to an API request, the new model is also loaded in Eager Mode.
This results in much faster model loading. For example, we've observed startup time improvements from 38s to 18s for gpt-oss:20b
and from 22s to 8s for qwen3:0.6b
.
If you need the lowest possible latency for individual requests and don't mind a longer initial startup, you can still use the standard homl run <model_name>
command without the --eager
flag.
List local models
List all models that are available locally.
homl list
Check running models
Check the status of models that are currently running.
homl ps
Stop a model
Stop a running model to free up resources.
homl stop qwen3:0.6b
Automatic GPU Memory Management
HoML is designed to manage your GPU resources efficiently. When you make a request to the OpenAI-compatible API for a specific model, HoML automatically loads it into memory. Currently, only one model can run at a time. If you make a request for a different model, HoML will unload the previous one and load the new one.
To free up your GPU for other applications, models are automatically unloaded after a period of inactivity. The default idle timeout is 10 minutes. You can configure this timeout using the homl config set model_unload_idle_time <seconds>
command.
Authenticate with Hugging Face
Set your Hugging Face token to pull private or gated models. You can provide the token directly or load it automatically from the default Hugging Face cache.
homl auth hugging-face <your-token>
Or load it automatically:
homl auth hugging-face --auto
Manage HoML Server
You can manage the HoML server with the following commands:
homl server stop
homl server restart
homl server log
Manage HoML Configuration
You can manage the HoML configuration with the following commands:
homl config list
Get a config value:
homl config get port
Set a config value:
homl config set port 8080
Manage Model-Specific Configuration
You can manage model-specific configurations, such as launch parameters, with the following commands:
homl config model <model_name> --params <launch_params>
Get a model's config:
homl config model <model_name> --get