Hermes4.3-Z

An experimental fine-tune of NousResearch/Hermes-4.3-36B using novel training data and methodologies.

Model Details

  • Base Model: NousResearch/Hermes-4.3-36B
  • Fine-tuned by: Daemontatox
  • Purpose: Chat/Conversational AI
  • Training: Experimental dataset and fine-tuning methodology
  • Parameters: 36B
  • Language: Multilingual

Training

This model uses an experimental approach to fine-tuning with a custom dataset designed for enhanced conversational capabilities.

Inference

Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "Daemontatox/Hermes4.3-Z",
    device_map="auto",
    torch_dtype="auto"
)
tokenizer = AutoTokenizer.from_pretrained("Daemontatox/Hermes4.3-Z")

messages = [
    {"role": "user", "content": "Hello, how are you?"}
]

input_ids = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt"
).to(model.device)

outputs = model.generate(
    input_ids,
    max_new_tokens=512,
    temperature=0.7,
    top_p=0.9,
    do_sample=True
)

response = tokenizer.decode(outputs[0][input_ids.shape[-1]:], skip_special_tokens=True)
print(response)

vLLM

from vllm import LLM, SamplingParams

llm = LLM(
    model="Daemontatox/Hermes4.3-Z",
    tensor_parallel_size=1,
    dtype="auto"
)

sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.9,
    max_tokens=512
)

prompts = ["Hello, how are you?"]
outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    print(output.outputs[0].text)

vLLM OpenAI-Compatible Server

vllm serve Daemontatox/Hermes4.3-Z \
    --dtype auto \
    --max-model-len 8192
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="token-abc123"
)

response = client.chat.completions.create(
    model="Daemontatox/Hermes4.3-Z",
    messages=[
        {"role": "user", "content": "Hello, how are you?"}
    ],
    temperature=0.7,
    max_tokens=512
)

print(response.choices[0].message.content)

TensorRT-LLM

# Convert to TensorRT-LLM format
python convert_checkpoint.py \
    --model_dir Daemontatox/Hermes4.3-Z \
    --output_dir ./trt_ckpt \
    --dtype float16 \
    --tp_size 1

# Build TensorRT engine
trtllm-build \
    --checkpoint_dir ./trt_ckpt \
    --output_dir ./trt_engine \
    --gemm_plugin float16 \
    --max_batch_size 8 \
    --max_input_len 4096 \
    --max_output_len 512
from tensorrt_llm import LLM

llm = LLM(model="./trt_engine")

prompts = ["Hello, how are you?"]
outputs = llm.generate(prompts, max_new_tokens=512)

for output in outputs:
    print(output.text)

Modular MAX

# Serve with MAX Engine
max serve Daemontatox/Hermes4.3-Z \
    --port 8000
from max import engine

# Load model with MAX
model = engine.InferenceSession(
    "Daemontatox/Hermes4.3-Z",
    device="cuda"
)

# Run inference
prompt = "Hello, how are you?"
output = model.generate(
    prompt,
    max_tokens=512,
    temperature=0.7,
    top_p=0.9
)

print(output.text)

Using MAX with Python API

from max.serve import serve
from max.pipelines import pipeline

# Create pipeline
pipe = pipeline(
    "text-generation",
    model="Daemontatox/Hermes4.3-Z",
    device="cuda"
)

# Generate
result = pipe(
    "Hello, how are you?",
    max_new_tokens=512,
    temperature=0.7,
    top_p=0.9
)

print(result[0]["generated_text"])

#Limitations This is an experimental model. Test thoroughly before production use.

Downloads last month
94
Safetensors
Model size
36B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for Daemontatox/Hermes4.3-Z

Finetuned
(1)
this model
Quantizations
2 models