Hey all,
I’m trying to explore making a code inspector (Mythos at home) with huggingface models. I’m currently working with gemma4 and and while I can load the smaller versions just fine, when I try to add a bunch of source code to a prompt I get errors saying I don’t have enough memory. One was trying to allocate ~1.7TB ![]()
I’ve made a function
def query_llm(system_message, user_message, assistant_message):
messages = [
{"role": "system", "content": system_message},
{"role": "user", "content": user_message},
{"role": "assistant", "content": assistant_message},
]
text = processor.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
enable_thinking=False
)
inputs = processor(text=text, return_tensors="pt").to(model.device)
input_len = inputs["input_ids"].shape[-1]
# Generate output
outputs = model.generate(**inputs, max_new_tokens=1024)
response = processor.decode(outputs[0][input_len:], skip_special_tokens=False)
return response
And I’m passing the code in as the assistant messages. Is this just the wrong approach? Is there any wisdom/guidance on how to go about doing local code analysis?