Oh. If your existing production stack is already mostly settled, you can safely treat my earlier vLLM comments as just a from-scratch architecture example and skip that part. The more important point is this: if you use raw LLM responses directly, it is hard to keep quality stable at scale. In many cases, the basic pattern is to put a layer between the model output and the published page — usually by having the model produce structured output first:
Short version
If I were building this, I would not start by replacing your backend.
You already have:
- PostgreSQL
- Java / Spring Boot
- React / Node.js
- AWS hosting
- a production app
- fixed operator-page UI sections
- an API-based plan using OpenRouter or similar providers
So I would keep that stack and add an asynchronous profile content lifecycle around it.
The core flow would be:
operator data
↓
normalized facts
↓
fact pack
↓
structured generation
↓
validation report
↓
duplicate / SEO quality checks
↓
repair or review
↓
public-unverified / verified / published content
The model writes prose.
The application owns facts, consistency, validation, duplicate detection, and publishing decisions.
That distinction matters. Even a very capable LLM can produce good-looking but invalid text if the raw output is used directly. I would treat model output as a draft, not as the production artifact.
Useful references:
1. Keep the backend, make the LLM provider an adapter
I would not move to a new backend unless there is a strong reason.
Spring Boot can remain the source of truth. PostgreSQL can store raw operator data, generation jobs, generated versions, validation results, review states, and publication states.
The LLM provider should be an adapter:
interface ProfileGenerationClient {
GeneratedProfile generate(ProfileFactPack factPack, GenerationConfig config);
}
Initial implementation:
OpenRouterProfileGenerationClient
Possible future implementations:
DirectProviderClient
InternalFineTunedModelClient
SelfHostedModelClient
For every generation, I would store metadata:
{
"provider": "openrouter",
"requested_model": "<MODEL_ID>",
"resolved_model": "<RESOLVED_MODEL_IF_AVAILABLE>",
"prompt_version": "profile_prompt_v7",
"schema_version": "operator_profile_schema_v3",
"fact_pack_version": "fact_pack_v2",
"temperature": 0.3,
"max_tokens": 1200,
"input_hash": "<INPUT_HASH>",
"fact_pack_hash": "<FACT_PACK_HASH>",
"output_hash": "<OUTPUT_HASH>"
}
Without this, it becomes difficult to debug quality changes later.
2. Start with the content contract
Before prompt engineering, I would define the exact output contract.
Since your UI is fixed, the model should not return arbitrary prose. It should return structured content for your fixed sections.
Example:
{
"bio": "...",
"services_offered": [
{
"name": "...",
"description": "...",
"source_fact_ids": ["skill_12", "category_3"]
}
],
"service_areas": [
{
"name": "Austin, TX",
"source_fact_ids": ["location_primary"]
}
],
"faqs": [
{
"question": "...",
"answer": "...",
"source_fact_ids": ["skill_12", "rate_1"]
}
],
"seo": {
"title": "...",
"meta_description": "..."
},
"claims_used": [
{
"claim": "The operator provides appliance repair in Austin, TX.",
"source_fact_ids": ["category_3", "location_primary"]
}
],
"unsupported_claims": [],
"risk_flags": []
}
The important part is source_fact_ids.
The model should not only write text. It should say which input facts support the generated claim. That makes downstream validation much easier.
OpenRouter structured outputs can help enforce the response shape:
But structured output is not the same as factual output.
This can be valid JSON and still be business-invalid:
{
"bio": "Austin-based certified appliance repair specialist with same-day service.",
"claims_used": ["certified", "same-day service"],
"unsupported_claims": []
}
If the operator did not provide certification or availability facts, that content should be rejected even if the JSON is valid.
So I would split validation into:
JSON/schema validation:
checks shape
business validation:
checks factuality, forbidden claims, duplicates, SEO risk, and publishability
3. Build a fact pack before generation
I would not send the raw operator record directly to the model.
Convert raw operator data into a fact pack first.
Example:
{
"operator_id": "op_123",
"allowed_facts": [
{
"id": "service_primary",
"type": "service",
"value": "appliance repair"
},
{
"id": "location_primary",
"type": "location",
"value": "Austin, TX"
},
{
"id": "experience_years",
"type": "experience",
"value": 7
},
{
"id": "skill_1",
"type": "skill",
"value": "washer repair"
}
],
"forbidden_claims": [
"licensed",
"insured",
"certified",
"top-rated",
"best",
"guaranteed",
"same-day service",
"24/7 emergency service",
"5-star reviews"
],
"missing_fact_classes": [
"insurance",
"certifications",
"reviews",
"availability",
"service_radius"
],
"content_limits": {
"max_bio_words": 140,
"max_faq_count": 2,
"allow_faq": true
}
}
Missing data should become explicit constraints.
For example:
insurance = null
should become:
Do not claim insured.
And:
reviews_summary = null
should become:
Do not claim highly reviewed, 5-star, top-rated, or customer-loved.
The model should not decide what missing data means. The application should decide.
4. Use a multi-step generation flow
I would avoid this:
input row → one prompt → final paragraph → publish
That is fragile at scale.
I would use a workflow:
1. Normalize operator input
2. Build fact pack
3. Decide content policy
4. Generate content plan
5. Validate content plan
6. Generate structured profile JSON
7. Validate schema
8. Validate factuality
9. Validate forbidden claims
10. Validate SEO/content quality
11. Check duplicate / near-duplicate risk
12. Repair or regenerate
13. Decide publishing state
14. Store content version + validation report
This is close to the workflow patterns described by Anthropic, especially prompt chaining and evaluator-optimizer:
The model should not own the whole workflow.
The model can write the words.
The application should decide what is allowed, what is invalid, what needs review, and what can be published.
5. Insert a content-plan step
Before final content generation, I would ask for a plan.
Example:
{
"bio_plan": {
"angle": "practical local appliance repair help",
"facts_to_use": [
"service_primary",
"location_primary",
"experience_years",
"skill_1"
],
"facts_to_avoid": [
"insurance",
"certifications",
"reviews",
"availability"
]
},
"faq_plan": [
{
"question_type": "service_scope",
"source_fact_ids": ["service_primary", "skill_1"]
}
],
"skip_sections": [
{
"section": "certifications",
"reason": "no certification facts were provided"
}
]
}
Then validate the plan before generating final copy.
If the plan already includes:
certified technician
same-day service
top-rated
5-star reviews
and those facts are not in the fact pack, reject the plan before the final content is generated.
6. Store validation reports
For every generated profile, I would store a validation report.
Example:
{
"schema": {
"status": "pass",
"errors": []
},
"factuality": {
"status": "fail",
"unsupported_claims": [
{
"claim": "insured",
"reason": "insurance was not present in the fact pack"
}
]
},
"forbidden_claims": {
"status": "pass",
"violations": []
},
"seo_quality": {
"status": "warn",
"issues": [
"FAQ answer is generic",
"bio uses low-specificity wording"
]
},
"duplication": {
"status": "pass",
"nearest_profile_id": "op_987",
"similarity": 0.78
},
"decision": "repair"
}
This report is useful for:
- debugging failed generations
- explaining why a profile went to review
- improving prompts
- comparing models
- building future evals
- creating future fine-tuning or preference data
Without validation reports, you only have “the model wrote something.”
With validation reports, you have a system you can improve.
7. Separate generated, public-unverified, verified, and published
I would not use READY to mean “trusted.”
I would separate these states:
| State |
Meaning |
GENERATED_READY |
Generated and passed automated checks |
PUBLIC_UNVERIFIED |
Publicly visible, but not manually/proof verified |
VERIFIED |
Important operator facts have been verified |
REVIEW_REQUIRED |
Should not be auto-published |
PUBLISHED |
Currently rendered on the live page |
The key distinction:
generated != verified
Your idea of generating quickly and adding a verified tag later is reasonable. I would just make that distinction explicit in the data model and UI.
A profile can be generated in 1–2 minutes and shown as public-unverified.
It can become verified later after proof, human review, or platform verification.
8. Use risk-based review, not full human-in-the-loop
I would not review every generated profile before publication unless the category is sensitive or legally risky.
Full human-in-the-loop can be too slow for onboarding.
Instead:
Auto-publish as PUBLIC_UNVERIFIED if:
- schema is valid
- no unsupported claims
- no forbidden claims
- duplicate score is low
- fact density is sufficient
- no suspicious operator patterns
- no high-risk service category
Send to review if:
REVIEW_REQUIRED if:
- unsupported claims were detected
- forbidden claims were detected
- duplicate similarity is high
- sparse input produced long output
- repeated repair attempts failed
- operator data looks suspicious
- service category is high risk
This keeps onboarding fast while still protecting quality.
9. Treat SEO quality as a policy, not a prompt phrase
I would avoid making the main instruction:
Write SEO-friendly content.
That can produce filler, keyword stuffing, and city/service boilerplate.
I would define the target as:
useful, fact-grounded, operator-specific, non-duplicative content
Relevant Google references:
The risk is not “AI wrote it.”
The risk is generating many low-value, near-duplicate, weakly grounded pages.
SEO/content quality gate:
- Does this profile contain enough operator-specific facts?
- Are service areas supported by input data?
- Are FAQs grounded in actual facts?
- Is the title/meta keyword-stuffed?
- Is this page too similar to other city/service pages?
- Is this sparse profile being inflated into a long page?
- Should this page be short, noindex, or review-required until more facts are collected?
Most important rule:
Sparse inputs should produce short profiles, not inflated pages.
If the operator only provides a city and one service, do not generate a long bio and five FAQs. That creates both hallucination risk and SEO risk.
10. Measure uniqueness instead of asking for it
I would not rely on this instruction:
Write a unique description.
I would measure uniqueness.
| Layer |
Check |
| 1 |
normalized text hash |
| 2 |
repeated phrase / sentence pattern |
| 3 |
n-gram overlap |
| 4 |
embedding similarity |
| 5 |
same-city + same-service comparison |
| 6 |
operator-data duplicate detection |
Since you already use PostgreSQL, pgvector is a practical option for vector similarity search.
Example:
SELECT
id,
operator_id,
service,
city,
embedding <=> <QUERY_EMBEDDING> AS cosine_distance
FROM operator_profile_versions
WHERE service = <SERVICE>
AND city = <CITY>
ORDER BY embedding <=> <QUERY_EMBEDDING>
LIMIT 10;
Possible policy:
if exact_hash_match:
reject
if ngram_overlap > threshold:
regenerate
if embedding_similarity > threshold and same_city_same_service:
review_required
if operator_data_duplicate_score > threshold:
block_or_manual_review
The thresholds should come from your own data.
Key idea:
Uniqueness should be a measured property, not a prompt instruction.
11. Make async generation reliable
Your workflow has this shape:
1. Save operator record in Postgres
2. Push generation job to queue
That creates a classic dual-write problem.
The DB write can succeed while queue publish fails. Or queue publish can happen twice. Or the worker can receive the same job more than once.
I would use the transactional outbox pattern:
Flow:
Spring Boot transaction:
- save operator record
- insert generation_job
- insert outbox_event
Outbox publisher:
- reads unpublished outbox rows
- sends message to SQS or worker queue
- marks outbox row as published
Worker:
- consumes job
- checks idempotency key
- builds fact pack
- generates content
- validates content
- writes content version + validation report
If you use SQS Standard queues, design for at-least-once delivery. AWS documents that messages may be delivered more than once and consumers should be idempotent:
Job payload:
{
"job_id": "<JOB_ID>",
"operator_id": "<OPERATOR_ID>",
"input_hash": "<INPUT_HASH>",
"fact_pack_hash": "<FACT_PACK_HASH>",
"prompt_version": "<PROMPT_VERSION>",
"schema_version": "<SCHEMA_VERSION>",
"attempt_number": 1,
"idempotency_key": "<IDEMPOTENCY_KEY>"
}
12. Build private evals before choosing the model
Public leaderboards are useful for discovery, but they do not measure your exact task.
I would create an offline eval set:
100-300 real or representative operator records
Include difficult cases:
- rich operator data
- sparse operator data
- same city + same service
- missing rate
- missing experience
- missing insurance
- missing certifications
- no reviews
- ambiguous service area
- bot-like duplicate registrations
Evaluate models and prompts on:
schema_pass_rate
unsupported_claim_rate
forbidden_claim_rate
required_fact_inclusion_rate
duplicate_risk_rate
sparse_profile_inflation_rate
FAQ_grounding_rate
repair_attempts_per_accepted_output
human_acceptance_rate
latency
accepted_output_cost
References:
Do not choose based on five nice-looking examples.
Choose based on accepted-output cost:
accepted_output_cost =
first_generation_cost
+ repair_generation_cost
+ validation_cost
+ duplicate-regeneration cost
+ human-review cost, if triggered
A cheaper model may be more expensive in production if it causes more repairs and reviews.
13. Model shortlist I would test
I would still avoid choosing the model from public vibes alone.
But if I had to build an initial shortlist, I would test models that cover different tradeoffs:
| Candidate |
Why test it |
google/gemma-4-26B-A4B-it |
First practical candidate; strong size/performance profile |
google/gemma-4-31B-it |
Gemma-family quality ceiling |
Qwen/Qwen3.6-27B |
Dense 27B challenger |
Qwen/Qwen3.6-35B-A3B |
Efficient MoE challenger |
mistralai/Mistral-Small-4-119B-2603 |
Heavier quality comparison |
CohereLabs/command-a-plus-05-2026-w4a4 |
Enterprise/business prose comparison |
moonshotai/Kimi-K2-Instruct-0905 |
Upper-bound comparison |
meta-llama/Llama-3.3-70B-Instruct |
Stable baseline |
Why Gemma 4 should be included
I would definitely include the Gemma 4 family, especially:
google/gemma-4-26B-A4B-it
google/gemma-4-31B-it
google/gemma-4-26B-A4B-it is interesting because it is a Mixture-of-Experts model. OpenRouter describes it as 25.2B total parameters with only 3.8B active per token, 256K context, structured output support, function calling, reasoning mode, and Apache 2.0 licensing:
For this task, I would treat it as the first practical candidate.
I would use:
Gemma 4 26B A4B:
first model to try
strong size/performance candidate
good API-evaluation candidate
Gemma 4 31B:
quality ceiling inside Gemma 4
useful to check whether A4B loses anything important
Why Qwen3.6 should be included
I would also test:
Qwen/Qwen3.6-27B
Qwen/Qwen3.6-35B-A3B
Qwen/Qwen3.6-27B is a strong dense comparison point. Its model card says the artifacts are compatible with Hugging Face Transformers, vLLM, SGLang, KTransformers, and similar runtimes:
I would test it for:
- instruction following
- JSON/schema stability
- factual discipline
- natural business prose
- repair rate
Qwen/Qwen3.6-35B-A3B is also worth testing as an efficient MoE-style challenger:
Why Mistral Small 4 and Command A+ might be useful
I would include Mistral Small 4 if budget and latency allow it:
I would not use it because every profile needs heavy reasoning. I would use it to see whether a stronger model reduces:
- unsupported claims
- repair attempts
- generic filler
- duplicate-like phrasing
- awkward FAQ output
I would also test Command A+ if business/enterprise prose is important:
Command A+ is interesting as an enterprise/business prose comparison, not necessarily as the first production choice.
How I would choose
I would not choose based on first-generation prose quality alone.
For each model, I would measure:
| Metric |
Meaning |
schema_pass_rate |
Does it follow the JSON contract? |
unsupported_claim_rate |
Does it invent facts? |
forbidden_claim_rate |
Does it output banned claims? |
duplicate_risk_rate |
Does it produce near-duplicate text? |
sparse_profile_inflation_rate |
Does it inflate weak input? |
repair_attempts_per_accepted_output |
How often does it need fixing? |
human_acceptance_rate |
Do reviewers accept it? |
accepted_output_cost |
True cost after validation/repair/review |
My initial practical bet would be:
Start with:
google/gemma-4-26B-A4B-it
Qwen/Qwen3.6-27B
mistralai/Mistral-Small-4-119B-2603
Then expand to:
google/gemma-4-31B-it
Qwen/Qwen3.6-35B-A3B
CohereLabs/command-a-plus-05-2026-w4a4
If Gemma 4 26B A4B gives strong validation pass rates and low repair rates, I would favor it as the first production candidate because of its size/performance profile.
If Qwen3.6 follows constraints better, I would choose Qwen.
If Mistral Small 4 dramatically reduces unsupported claims and repair attempts, I would consider paying more for it.
The model decision should come after the pipeline exists, because the pipeline defines what “good” means.
14. Improve operator input UX
If the operator data is weak, the model has only two safe choices:
write short content
or ask for more data
The unsafe choice is:
inflate sparse data into a long profile
So I would improve the onboarding form.
Collect structured fields like:
- primary service
- secondary services
- city / service area
- years of experience
- license / certification
- insurance
- languages
- availability
- rate / price range
- specialties
- customer type
- examples of work
- short self-written note
- proof fields for verified claims
Then use a fact density score:
| Fact density |
Content policy |
| high |
full profile, services, FAQ, SEO title/meta |
| medium |
shorter bio, limited FAQ |
| low |
short profile only, ask for more facts, maybe public-unverified or noindex |
This may improve SEO quality more than changing the model.
The best way to make useful pages is to collect useful facts.
15. Use content versioning
Do not overwrite generated content in place.
Possible tables:
operator
operator_profile
operator_profile_version
generation_job
generation_outbox
generation_validation_report
profile_embedding
manual_review_task
operator_edit
Each generated version should store:
operator_id
profile_version_id
generated_json
published_json
validation_report
source_fact_hash
prompt_version
schema_version
provider
model
generation_params
created_at
published_at
verified_at
This matters because:
- the operator may edit AI content
- the platform may verify claims later
- a new model may regenerate content
- reviewers may approve or reject changes
- you need rollback
- edits become useful future eval/fine-tuning data
16. Do not start with fine-tuning
Fine-tuning can help later, but I would not start there.
First build:
- content schema
- fact pack
- validators
- duplicate checks
- private evals
- validation reports
- review states
Only after that would I consider fine-tuning.
Later, you can use:
operator facts
+ generated output
+ validation report
+ operator edits
+ reviewer decisions
to create:
SFT data:
fact pack → good structured profile JSON
Preference data:
chosen good output vs rejected bad output
Verifier data:
fact pack + generated profile → validation report
If you fine-tune, I would start with LoRA/QLoRA rather than full fine-tuning:
But that is a later phase.
Practical build order
Phase 1: Offline prototype
1. Collect 100-300 representative operator records
2. Define content schema
3. Define fact pack schema
4. Define forbidden claims
5. Generate outputs with 2-4 models
6. Validate schema
7. Validate factuality
8. Check duplicate risk
9. Human-review 30-50 outputs
10. Tune prompt/schema/validators
Phase 2: MVP generation pipeline
1. Add generation_job table
2. Add content version table
3. Add validation report table
4. Add outbox table
5. Add worker
6. Add OpenRouter adapter
7. Add structured output
8. Add schema validation
9. Add basic fact/forbidden-claim checks
10. Add repair loop
Phase 3: SEO and duplicate safety
1. Add fact density scoring
2. Add sparse profile policy
3. Add n-gram duplicate checks
4. Add embeddings
5. Add pgvector similarity search
6. Add same-city/service duplicate policy
7. Add noindex/review-required rules for weak pages
Phase 4: Review and verification
1. Add PUBLIC_UNVERIFIED state
2. Add REVIEW_REQUIRED state
3. Add VERIFIED state
4. Add reviewer UI
5. Add operator edit UI
6. Store edits and review decisions
Phase 5: Model and tuning improvements
1. Run private evals regularly
2. Compare models by accepted-output cost
3. Add best-of-N generation if needed
4. Build verifier/reward model if useful
5. Consider LoRA/QLoRA or DPO after enough data exists
What I would avoid
I would avoid:
input row → one prompt → final paragraph → publish
I would avoid treating READY as trusted.
I would avoid writing long content for sparse operators.
I would avoid asking the model to make pages unique without measuring duplication.
I would avoid making “SEO-friendly” the main instruction.
I would avoid fine-tuning before you have evals and validation data.
I would avoid coupling business logic directly to one LLM provider.
Final summary
If I were building this from scratch, I would build a system that controls whether generated content is:
allowed
grounded
distinct
useful
publishable
reviewable
verifiable
versioned
The LLM is only the prose-generation component.
My first priorities would be:
1. content contract
2. fact pack
3. structured output
4. validation report
5. duplicate scoring
6. SEO/content quality policy
7. public-unverified vs verified states
8. private evals
9. reliable async jobs
10. operator input improvement
11. model comparison by accepted-output cost
The central rule:
The model can write the words, but the application should own the truth, consistency, publishing policy, and quality gates.