Need generative model, high-quality description generation

Problem statement

I am building operator profile descriptions for a local-services marketplace from structured inputs like skill, city, state, rate, and experience. I need descriptions that sound human-written, stay factually correct, and remain diverse across many operator pages.

I tried Hugging Face/open-source local models, Qwen, Phi-3, and free-tier Google API models, but the results are still not satisfactory for production quality. So far, the API-based result was the best, but I want suggestions for a better non-API or hybrid approach for this use case.

What I tried: Fixed templates became repetitive at scale and risk near-duplicate quality issues; then I tried a hybrid pipeline where I first extract facts and then rewrite with a model, and I tested local/open models like Qwen and Phi-3 plus free-tier Google API models, but only the API-based output was reasonably good so far.

What I need suggestions for: the best approach to generate long 3-paragraph and 5 other types of human-like business descriptions from structured facts, keep facts fixed while improving writing quality, reduce repetition across 10,000+ pages without massive hardcoded templates, and build a feasible SEO-, GEO-, and large-scale programmatic-content pipeline with strong quality control.

If you don’t insist on having the LLM complete everything by itself, this may be simpler:


Short answer

I would treat this less as a “find the perfect generative model” problem and more as a pipeline design problem.

For this kind of description-generation task, I would probably use vLLM as the inference backend, run one or more reputable Hugging Face models behind it, and put most of the engineering effort into the surrounding pipeline:

  1. normalize the structured input,
  2. build a fact pack,
  3. generate a content plan,
  4. generate the description,
  5. validate factuality,
  6. validate style and banned claims,
  7. check near-duplicates,
  8. repair or regenerate,
  9. log model/prompt/schema versions and eval results.

The model still matters, of course. But if the task is to generate thousands of profile/location/service descriptions, the main risk is usually not “the paragraph is not poetic enough.” The main risks are:

  • unsupported facts,
  • generic filler,
  • near-duplicate pages,
  • unsafe claims,
  • SEO-thin pages,
  • inability to compare model/prompt changes later.

So I would keep the model swappable and make the pipeline the main product.

Useful references:


Why I would not optimize only for “the best model”

There are many decent open models on Hugging Face now. Some Qwen, Llama, Mistral, Gemma, and Command-family models can produce good profile or marketing prose.

But for this use case, a better model alone does not solve the main operational problems.

A stronger model may still:

  • hallucinate credentials,
  • add unsupported service areas,
  • overstate experience,
  • invent availability,
  • invent review quality,
  • produce generic SEO-ish filler,
  • repeat similar sentence structures across thousands of pages,
  • silently change behavior after a model, prompt, or runtime update,
  • produce good-looking text that fails a business rule.

That is why I would avoid a pure “model shootout” approach.

A model shootout is still useful, but only after defining task-specific evals. General benchmark strength is not the same as quality on this exact task.

OpenAI’s eval guidance is useful here because it frames evals as a way to test AI systems despite generative variability:

Hamel Husain’s writing is also useful from a practical engineering point of view:

Chip Huyen’s production LLM article is also a good reference for the idea that LLM applications should be tested as systems, not just prompts:

The short version:

Do not ask “which model writes the nicest description?” first.
Ask “which pipeline reliably turns structured facts into useful, factual, non-duplicative descriptions?”


Proposed backend shape

I would use this architecture:

Admin / API
  ↓
FastAPI
  ↓
Postgres
  ↓
Celery or Temporal
  ↓
Workers
  ├─ normalize_input
  ├─ build_fact_pack
  ├─ generate_content_plan
  ├─ generate_description
  ├─ fact_check
  ├─ style_check
  ├─ duplicate_check
  ├─ repair_or_regenerate
  └─ publish_or_export
        ↓
vLLM OpenAI-compatible server
        ↓
HF model weights

Suggested starting stack:

Inference:
  vLLM

API:
  FastAPI

Database:
  Postgres

Vector similarity:
  pgvector

Queue / jobs:
  Celery + Redis for MVP
  Temporal later if workflows become complex

Validation:
  Pydantic
  Instructor or similar structured-output helper

Storage:
  S3 / R2 / MinIO

Monitoring:
  structured logs
  token/latency/cost counters
  eval dashboards

Why vLLM?

vLLM gives you an OpenAI-compatible HTTP server, which makes it easier to keep your application code stable while swapping the underlying HF model:

It also supports structured outputs, which is useful if you want the model to return a schema like this:

{
  "content_plan": {
    "angle": "experienced bilingual local technician",
    "paragraphs": [
      "Introduce the service and location",
      "Mention supported skills and experience",
      "Close with practical customer benefit"
    ]
  },
  "included_facts": [
    "Austin, TX",
    "7 years of experience",
    "washer repair",
    "dryer repair",
    "$85/hour"
  ],
  "unsupported_claims": [],
  "final_description": "<generated description>"
}

Reference:

The point is not that structured output magically guarantees truth. It does not. The point is that it gives the rest of your application something inspectable.


Why a pipeline fits this task better than one-shot generation

This task is a good match for a fixed workflow.

Anthropic’s “Building Effective Agents” post is useful here because it separates relatively deterministic workflows from more open-ended agents. In particular, it describes:

  • prompt chaining,
  • routing,
  • parallelization,
  • orchestrator-workers,
  • evaluator-optimizer.

Reference:

For this problem, I would use something closer to prompt chaining and evaluator-optimizer, not a fully autonomous agent.

A simple generation pipeline might look like this:

Raw row
  ↓
Normalized facts
  ↓
Fact pack
  ↓
Content plan
  ↓
Draft description
  ↓
Factuality check
  ↓
Style / banned-claim check
  ↓
Duplicate check
  ↓
Repair or regenerate
  ↓
Approved output

That is easier to test than a giant prompt that says:

Write a unique, high-quality, SEO-friendly, factual local service description.

The giant prompt may work for 20 examples. It is much less safe for 10,000+ examples.


Step 1: Normalize the input first

Before calling the LLM, normalize the input into a strict schema.

Example:

{
  "profile_id": "<PROFILE_ID>",
  "service": "appliance repair",
  "city": "Austin",
  "state": "TX",
  "rate": {
    "amount": 85,
    "currency": "USD",
    "unit": "hour"
  },
  "experience_years": 7,
  "skills": [
    "washer repair",
    "dryer repair",
    "refrigerator diagnostics"
  ],
  "languages": [
    "English",
    "Spanish"
  ],
  "certifications": [],
  "insurance": null,
  "reviews_summary": null
}

This is not just cleanup. It prevents the model from guessing what missing fields mean.

For example:

  • if certifications is empty, do not allow “certified”;
  • if insurance is null, do not allow “insured”;
  • if reviews_summary is null, do not allow “highly reviewed” or “5-star”;
  • if no availability is provided, do not allow “same-day service”;
  • if no service radius is provided, do not invent nearby cities.

The LLM should receive not only the raw facts but also the allowed and forbidden claims.


Step 2: Build a fact pack

I would explicitly build a fact pack before writing.

Example:

{
  "allowed_claims": [
    "The provider offers appliance repair in Austin, TX.",
    "The provider has 7 years of experience.",
    "The provider handles washer repair, dryer repair, and refrigerator diagnostics.",
    "The provider speaks English and Spanish.",
    "The listed rate is $85/hour."
  ],
  "forbidden_claims": [
    "licensed",
    "insured",
    "certified",
    "top-rated",
    "best in Austin",
    "guaranteed same-day service",
    "5-star reviews",
    "background checked",
    "family-owned",
    "emergency service"
  ],
  "missing_fields": [
    "certifications",
    "insurance",
    "reviews",
    "availability",
    "service_radius"
  ]
}

This makes the generation task much easier:

Write a description using only these allowed claims.
Do not use any forbidden claims.
Omit missing facts naturally.

This is also useful for auditing later.

If a generated page says “insured”, you can check whether insured was ever present in the fact pack. If it was not, the output is invalid.


Step 3: Generate a content plan before final prose

Instead of asking for the final description immediately, ask the model to make a small plan.

Example output:

{
  "angle": "practical local appliance repair help",
  "paragraph_plan": [
    {
      "goal": "Introduce service, location, and main skills",
      "facts_to_use": ["service", "city", "state", "skills"]
    },
    {
      "goal": "Mention experience and rate without sounding salesy",
      "facts_to_use": ["experience_years", "rate"]
    },
    {
      "goal": "Close with a customer-oriented sentence",
      "facts_to_use": ["languages"]
    }
  ],
  "style_constraints": [
    "professional",
    "plainspoken",
    "no exaggerated marketing claims",
    "no unsupported credentials"
  ]
}

This intermediate step gives you something to validate before prose generation.

If the plan already includes “certified technician” but the fact pack has no certification, reject the plan before generating the final text.


Step 4: Generate the description

Then generate the actual description.

Example prompt shape:

You write local service marketplace profile descriptions.

Use ONLY the facts in FACT_PACK.
Do not invent credentials, awards, insurance, guarantees, reviews, availability, service radius, or ranking claims.
If a fact is missing, omit it naturally.

Write in a warm, professional, human style.
Avoid clichés such as:
- dedicated professional
- top-notch
- go-to expert
- best in the area
- unparalleled service
- committed to excellence

Return JSON matching OUTPUT_SCHEMA.

FACT_PACK:
<FACT_PACK>

CONTENT_PLAN:
<CONTENT_PLAN>

OUTPUT_SCHEMA:
<OUTPUT_SCHEMA>

This is more controllable than:

Write a high-quality profile description.

Step 5: Validate factuality

After generating the description, validate it.

I would start with a combination of:

  1. deterministic checks,
  2. schema checks,
  3. LLM-based claim checking,
  4. sampled human review.

Example deterministic check:

BANNED_PHRASES = [
    "licensed",
    "insured",
    "certified",
    "top-rated",
    "best",
    "guaranteed",
    "same-day",
    "5-star",
    "award-winning",
]

def banned_phrase_check(text: str, allowed_claims: list[str]) -> list[str]:
    violations = []
    lower_text = text.lower()

    for phrase in BANNED_PHRASES:
        if phrase in lower_text and not any(phrase in claim.lower() for claim in allowed_claims):
            violations.append(phrase)

    return violations

Example LLM verifier output:

{
  "status": "fail",
  "unsupported_claims": [
    {
      "claim": "offers same-day service",
      "reason": "availability was not present in the input facts"
    }
  ],
  "missing_required_facts": [],
  "recommended_action": "repair"
}

This is where an evaluator-optimizer pattern becomes useful:

  • writer generates,
  • verifier checks,
  • repair model fixes only the invalid parts,
  • final validator runs again.

Useful references:

Important caveat: do not blindly trust an LLM judge. Use it as one signal. For critical rules, use deterministic checks too.


Step 6: Validate style

The style checker should not only ask “is this good writing?”

It should check task-specific failure modes:

  • Does it sound like generic SEO filler?
  • Does it repeat common marketing clichĂ©s?
  • Is it too similar to the template?
  • Does it overpromise?
  • Does it mention unavailable facts?
  • Is it useful to a real customer?

Example style checker output:

{
  "status": "fail",
  "issues": [
    {
      "type": "cliche",
      "span": "dedicated professional",
      "reason": "overused generic phrase"
    },
    {
      "type": "thin_content",
      "span": "provides quality service for all your needs",
      "reason": "generic phrase that adds no profile-specific value"
    }
  ],
  "recommended_action": "repair"
}

Repair prompt:

Revise the description to remove the listed style issues.
Do not add new facts.
Preserve all valid factual claims.
Do not change city, state, service, rate, years of experience, skills, or languages.

DESCRIPTION:
<DESCRIPTION>

STYLE_ISSUES:
<STYLE_ISSUES>

Step 7: Check duplicates and near-duplicates

For 10,000+ generated pages, exact duplicates are not the only problem.

You also need to catch near-duplicates like:

  • same paragraph structure,
  • same opening line with only city/service swapped,
  • same conclusion sentence,
  • same generic claims,
  • same semantic content in different words.

I would use multiple layers:

Layer 1:
  normalized text hash

Layer 2:
  n-gram overlap

Layer 3:
  embedding similarity

Layer 4:
  same city + same service group comparison

Layer 5:
  sampled human review

For embedding similarity, pgvector is a practical starting point because it lets you store vectors alongside normal Postgres data.

Reference:

Example table:

CREATE TABLE profile_outputs (
    id BIGSERIAL PRIMARY KEY,
    profile_id TEXT NOT NULL,
    service TEXT NOT NULL,
    city TEXT NOT NULL,
    state TEXT NOT NULL,
    output_text TEXT NOT NULL,
    embedding vector(768),
    model_repo TEXT NOT NULL,
    model_revision TEXT,
    prompt_version TEXT NOT NULL,
    schema_version TEXT NOT NULL,
    created_at TIMESTAMPTZ DEFAULT now()
);

Example duplicate check:

SELECT
    id,
    profile_id,
    service,
    city,
    output_text,
    embedding <=> <QUERY_EMBEDDING> AS cosine_distance
FROM profile_outputs
WHERE service = <SERVICE>
  AND city = <CITY>
ORDER BY embedding <=> <QUERY_EMBEDDING>
LIMIT 10;

The exact thresholds need empirical tuning. For example:

if exact_hash_match:
    reject

if ngram_overlap > 0.65:
    regenerate

if embedding_similarity > 0.92 and same_service:
    regenerate_with_new_angle

if embedding_similarity > 0.86 and same_city_same_service:
    send_to_review

The thresholds above are placeholders, not universal constants.


Step 8: Use evals before model selection

I would select models only after defining task-specific evals.

Example eval set:

Eval What it catches Example failure
Schema validity Invalid JSON or missing fields no final_description field
Factuality Claims not in input “insured” when not provided
Required facts Important facts omitted city or service missing
Forbidden claims Risky words or claims “best”, “certified”, “guaranteed”
Style Generic filler “for all your needs”
Duplication Too similar to existing pages same paragraph pattern
Helpfulness Thin or useless page no concrete differentiating facts

Model comparison should then look like:

Model A:
  factuality pass: 94%
  schema pass: 98%
  duplicate fail: 12%
  style fail: 18%
  average repair attempts: 0.7
  average latency: 1.2s
  average output tokens: 170

Model B:
  factuality pass: 91%
  schema pass: 99%
  duplicate fail: 6%
  style fail: 10%
  average repair attempts: 0.5
  average latency: 1.8s
  average output tokens: 165

This is much more useful than:

Model A sounds better than Model B in a few examples.

Useful references:


Step 9: Keep model/prompt/schema versions

Save enough metadata to reproduce or debug each output.

Minimum metadata:

{
  "profile_id": "<PROFILE_ID>",
  "output_id": "<OUTPUT_ID>",
  "model_repo": "<MODEL_REPO>",
  "model_revision": "<MODEL_REVISION>",
  "runtime": "vLLM",
  "runtime_version": "<VLLM_VERSION>",
  "prompt_version": "<PROMPT_VERSION>",
  "schema_version": "<SCHEMA_VERSION>",
  "temperature": 0.4,
  "top_p": 0.9,
  "max_tokens": 500,
  "input_hash": "<INPUT_HASH>",
  "fact_pack_hash": "<FACT_PACK_HASH>",
  "created_at": "<TIMESTAMP>"
}

For HF models, I would also pin the model revision or commit when testing and recording results.

Reference:

This matters because otherwise you cannot answer:

  • Did quality change because the model changed?
  • Did the prompt change?
  • Did the input data change?
  • Did the validation rules change?
  • Did the runtime change?
  • Which outputs need regeneration?

Step 10: Be careful with SEO / programmatic content

If this is for many local-service pages, do not think only about naturalness.

Think about usefulness and uniqueness.

Google’s guidance is important here. Google says generative AI can be useful for research and structuring content, but using generative AI or similar tools to generate many pages without adding value for users may violate its scaled content abuse policy.

References:

So I would not frame the pipeline as:

Generate lots of unique-looking pages.

I would frame it as:

Generate useful profile descriptions from real structured facts,
reject unsupported claims,
detect thin/duplicative pages,
and avoid publishing pages that do not add user value.

For programmatic SEO context, these are useful:

For local-service profile pages, the page should ideally have real differentiators, not just paraphrased boilerplate:

  • service category,
  • city and state,
  • actual skills,
  • years of experience,
  • real rate information,
  • real credentials if available,
  • real languages,
  • real availability if available,
  • real review summary if available,
  • real examples of work if available.

If most rows do not contain enough differentiating data, the pipeline should not hide that problem with fluent prose. It should flag those rows as low-information.


Suggested implementation path

I would start small.

Phase 1: Offline evaluation

Take 100–300 representative rows.

Include edge cases:

  • missing rate,
  • missing experience,
  • many skills,
  • only one skill,
  • no certifications,
  • has certification,
  • multiple languages,
  • high-overlap rows,
  • same city and service,
  • sparse profiles.

Run 2–4 candidate HF models behind vLLM.

Do not judge only by reading samples. Run evals.

Outputs from this phase:

- prompt v1
- fact schema v1
- output schema v1
- validation rules v1
- duplicate thresholds v0
- model comparison table
- human review notes

Phase 2: MVP backend

Build:

FastAPI
Postgres
pgvector
Celery + Redis
vLLM
Pydantic / Instructor

Celery is a reasonable MVP queue because it is a mature distributed task queue:

Postgres + pgvector is enough for initial metadata + vector similarity:

Phase 3: Add repair loops and review queues

Add statuses like:

pending
generating
validating
repairing
duplicate_check
review_required
approved
rejected
published

Add separate queues:

generation
validation
embedding
repair
export

Add max attempt counts:

max_generation_attempts: 3
max_repair_attempts: 2
human_review_after: 2 failed repair attempts

Phase 4: Move to durable workflows if needed

If the workflow becomes more complex, Temporal may be a better fit than Celery for the whole process.

Temporal is useful when you need durable execution, retries, and recovery across long-running workflows:

I would not necessarily start with Temporal if the team wants a quick MVP. But if human review, partial reruns, repair loops, and auditability become central, Temporal becomes attractive.


Example pipeline contract

A useful contract is:

The model is allowed to write prose.
The application owns facts, rules, validation, retries, and publishing.

That means:

  • the model does not decide whether a claim is allowed;
  • the model does not decide whether a page is publishable;
  • the model does not decide whether two pages are too similar;
  • the model does not silently change the data contract;
  • the model does not erase metadata needed for debugging.

The app should own those things.


Example prompt template

SYSTEM:
You write local service marketplace profile descriptions.

Hard rules:
- Use only the facts in FACT_PACK.
- Do not invent credentials, insurance, certifications, awards, reviews, rankings, guarantees, service areas, availability, or business history.
- If a fact is missing, omit it naturally.
- Avoid generic SEO filler.
- Avoid clichés.
- Keep the description useful to a real customer comparing providers.

Return JSON matching OUTPUT_SCHEMA.

FACT_PACK:
<FACT_PACK>

CONTENT_PLAN:
<CONTENT_PLAN>

OUTPUT_SCHEMA:
<OUTPUT_SCHEMA>

Example output schema:

{
  "type": "object",
  "properties": {
    "final_description": {
      "type": "string"
    },
    "included_facts": {
      "type": "array",
      "items": {"type": "string"}
    },
    "unsupported_claims": {
      "type": "array",
      "items": {"type": "string"}
    },
    "style_notes": {
      "type": "array",
      "items": {"type": "string"}
    }
  },
  "required": [
    "final_description",
    "included_facts",
    "unsupported_claims"
  ]
}

Example validator contract

{
  "factuality": {
    "status": "pass",
    "unsupported_claims": []
  },
  "forbidden_claims": {
    "status": "pass",
    "violations": []
  },
  "style": {
    "status": "fail",
    "issues": [
      "Contains generic phrase: 'for all your needs'"
    ]
  },
  "duplicate": {
    "status": "pass",
    "nearest_output_id": "<OUTPUT_ID>",
    "similarity": 0.78
  },
  "recommended_action": "repair"
}

This kind of object is much easier to debug than a plain paragraph.


Model choice

For the writer model, I would shortlist a few reputable HF models that run well under vLLM and evaluate them with the above pipeline.

I would not choose based only on public chat benchmarks.

I would choose based on:

  • schema pass rate,
  • factuality pass rate,
  • repair rate,
  • duplicate rate,
  • style pass rate,
  • latency,
  • throughput,
  • cost,
  • operational stability.

The best model for this pipeline is the one that produces the highest rate of valid, useful, non-duplicative outputs after the full validation pipeline, not necessarily the one that writes the most impressive one-off paragraph.


What I would avoid

I would avoid this:

One API endpoint:
  input row → prompt → final paragraph → publish

It is too hard to debug and too easy to scale mistakes.

I would also avoid:

Pick a strong model and trust the prompt.

Prompts are important, but prompts are not enforcement.

I would avoid publishing all generated outputs automatically before you have at least:

  • factuality validation,
  • banned-claim checks,
  • duplicate checks,
  • evals,
  • sampled human review,
  • versioned logs.

Practical minimal version

If you want a minimal version, I would build this first:

1. CSV or database rows
2. normalize into Pydantic schema
3. create fact pack
4. call vLLM writer model
5. validate JSON output
6. run banned-phrase checks
7. run LLM factuality verifier
8. embed final text
9. check nearest neighbors in pgvector
10. save output + validation metadata
11. export approved rows

This is already much safer than one-shot generation.


Final recommendation

I would use vLLM as the serving layer and keep HF models interchangeable.

Then I would invest most of the effort in:

  • input normalization,
  • fact packs,
  • structured outputs,
  • validation,
  • repair loops,
  • duplicate detection,
  • evals,
  • audit logs,
  • conservative publishing rules.

That makes the system more robust than trying to find one magic model.

The model matters, but the pipeline matters more.

A good model inside a weak pipeline will still hallucinate, duplicate, and drift.

A decent model inside a strong pipeline can be measured, repaired, compared, and replaced.

Thanks for the detailed guidance — this is very helpful. I agree with your main point: for our use case, the hard part is not finding one “best” model, but designing a reliable pipeline around structured operator data.

Our setup

  • Stack: PostgreSQL + Java (Spring Boot) + React/Node.js, hosted on AWS

  • We’re in production, and the app/web is running successfully. This is a stage-2 upgrade where we’re adding AI for content generation.

  • For cost reasons, we’ll likely start with an API-based approach (free tier or OpenRouter), not self-hosted models.

What we’re generating
We have operator pages. The UI is fixed and shared across all operators. The problem is the content: bio, service areas, location, services offered, FAQ, etc. These must be:

  • Unique

  • Factually accurate (locked to operator inputs)

  • SEO-friendly

  • Scalable to 1,000+ existing operators and all future registrations

Our workflow (with REST API)

  1. User clicks Create Profile in the frontend.

  2. Frontend sends operator data to Spring Boot.

  3. Spring Boot saves the raw operator record in PostgreSQL with status PENDING.

  4. Spring Boot pushes a generation job into a queue/worker system.

  5. Worker reads the operator data and calls OpenRouter.

  6. OpenRouter returns structured content JSON.

  7. Worker validates the JSON and stores the generated content in PostgreSQL.

  8. Status changes to READY or PUBLISHED.

  9. Frontend fetches the content and renders it in the fixed UI sections.

  10. If something fails, status becomes FAILED and only that case goes to retry or manual review.

This takes about 1–2 minutes, which is acceptable for the operator experience.

On duplicates and bots
We’re aware that bots and duplicate/near-duplicate content are real risks at scale. We’re considering either:

  • Human-in-the-loop, or

  • Human-on-the-loop (post-generation review),

depending on feasibility. If we go with human-in-the-loop for profile creation, it could take 1–2 days per profile, which may cause operator dissatisfaction and negative rumors about the platform.

Our compromise is:

  • Generate the profile in 1–2 minutes automatically.

  • Later, after verification (e.g., within a week), add a verified tag to the profile.

This balances customer trust and operator experience.

On the generation layer
Your suggestions about the generation pipeline are completely valid: risk of hallucinations, unsupported claims, and generic SEO filler are real concerns. Your point that “one API endpoint: input row → prompt → final paragraph → publish” is too fragile for scale is exactly what we’re trying to avoid.

We’re already considering:

  • Prompt optimization so the model performs accurately and stays within allowed facts.

  • A pipeline that owns normalization, fact restrictions, validation, duplicate checks, and publishing rules in the application layer, while the model mainly handles wording.

  • Structured outputs from OpenRouter, with validation before publishing.

Honestly, for the generation layer, I’m really glad for your help and guidance throughout this process. One last thing I wanted to ask — how would you personally approach building this system from scratch? Like, from which point would you start, and how would you structure the generation flow for scalability and SEO quality?

Oh. If your existing production stack is already mostly settled, you can safely treat my earlier vLLM comments as just a from-scratch architecture example and skip that part. The more important point is this: if you use raw LLM responses directly, it is hard to keep quality stable at scale. In many cases, the basic pattern is to put a layer between the model output and the published page — usually by having the model produce structured output first:


Short version

If I were building this, I would not start by replacing your backend.

You already have:

  • PostgreSQL
  • Java / Spring Boot
  • React / Node.js
  • AWS hosting
  • a production app
  • fixed operator-page UI sections
  • an API-based plan using OpenRouter or similar providers

So I would keep that stack and add an asynchronous profile content lifecycle around it.

The core flow would be:

operator data
  ↓
normalized facts
  ↓
fact pack
  ↓
structured generation
  ↓
validation report
  ↓
duplicate / SEO quality checks
  ↓
repair or review
  ↓
public-unverified / verified / published content

The model writes prose.
The application owns facts, consistency, validation, duplicate detection, and publishing decisions.

That distinction matters. Even a very capable LLM can produce good-looking but invalid text if the raw output is used directly. I would treat model output as a draft, not as the production artifact.

Useful references:


1. Keep the backend, make the LLM provider an adapter

I would not move to a new backend unless there is a strong reason.

Spring Boot can remain the source of truth. PostgreSQL can store raw operator data, generation jobs, generated versions, validation results, review states, and publication states.

The LLM provider should be an adapter:

interface ProfileGenerationClient {
    GeneratedProfile generate(ProfileFactPack factPack, GenerationConfig config);
}

Initial implementation:

OpenRouterProfileGenerationClient

Possible future implementations:

DirectProviderClient
InternalFineTunedModelClient
SelfHostedModelClient

For every generation, I would store metadata:

{
  "provider": "openrouter",
  "requested_model": "<MODEL_ID>",
  "resolved_model": "<RESOLVED_MODEL_IF_AVAILABLE>",
  "prompt_version": "profile_prompt_v7",
  "schema_version": "operator_profile_schema_v3",
  "fact_pack_version": "fact_pack_v2",
  "temperature": 0.3,
  "max_tokens": 1200,
  "input_hash": "<INPUT_HASH>",
  "fact_pack_hash": "<FACT_PACK_HASH>",
  "output_hash": "<OUTPUT_HASH>"
}

Without this, it becomes difficult to debug quality changes later.


2. Start with the content contract

Before prompt engineering, I would define the exact output contract.

Since your UI is fixed, the model should not return arbitrary prose. It should return structured content for your fixed sections.

Example:

{
  "bio": "...",
  "services_offered": [
    {
      "name": "...",
      "description": "...",
      "source_fact_ids": ["skill_12", "category_3"]
    }
  ],
  "service_areas": [
    {
      "name": "Austin, TX",
      "source_fact_ids": ["location_primary"]
    }
  ],
  "faqs": [
    {
      "question": "...",
      "answer": "...",
      "source_fact_ids": ["skill_12", "rate_1"]
    }
  ],
  "seo": {
    "title": "...",
    "meta_description": "..."
  },
  "claims_used": [
    {
      "claim": "The operator provides appliance repair in Austin, TX.",
      "source_fact_ids": ["category_3", "location_primary"]
    }
  ],
  "unsupported_claims": [],
  "risk_flags": []
}

The important part is source_fact_ids.

The model should not only write text. It should say which input facts support the generated claim. That makes downstream validation much easier.

OpenRouter structured outputs can help enforce the response shape:

But structured output is not the same as factual output.

This can be valid JSON and still be business-invalid:

{
  "bio": "Austin-based certified appliance repair specialist with same-day service.",
  "claims_used": ["certified", "same-day service"],
  "unsupported_claims": []
}

If the operator did not provide certification or availability facts, that content should be rejected even if the JSON is valid.

So I would split validation into:

JSON/schema validation:
  checks shape

business validation:
  checks factuality, forbidden claims, duplicates, SEO risk, and publishability

3. Build a fact pack before generation

I would not send the raw operator record directly to the model.

Convert raw operator data into a fact pack first.

Example:

{
  "operator_id": "op_123",
  "allowed_facts": [
    {
      "id": "service_primary",
      "type": "service",
      "value": "appliance repair"
    },
    {
      "id": "location_primary",
      "type": "location",
      "value": "Austin, TX"
    },
    {
      "id": "experience_years",
      "type": "experience",
      "value": 7
    },
    {
      "id": "skill_1",
      "type": "skill",
      "value": "washer repair"
    }
  ],
  "forbidden_claims": [
    "licensed",
    "insured",
    "certified",
    "top-rated",
    "best",
    "guaranteed",
    "same-day service",
    "24/7 emergency service",
    "5-star reviews"
  ],
  "missing_fact_classes": [
    "insurance",
    "certifications",
    "reviews",
    "availability",
    "service_radius"
  ],
  "content_limits": {
    "max_bio_words": 140,
    "max_faq_count": 2,
    "allow_faq": true
  }
}

Missing data should become explicit constraints.

For example:

insurance = null

should become:

Do not claim insured.

And:

reviews_summary = null

should become:

Do not claim highly reviewed, 5-star, top-rated, or customer-loved.

The model should not decide what missing data means. The application should decide.


4. Use a multi-step generation flow

I would avoid this:

input row → one prompt → final paragraph → publish

That is fragile at scale.

I would use a workflow:

1. Normalize operator input
2. Build fact pack
3. Decide content policy
4. Generate content plan
5. Validate content plan
6. Generate structured profile JSON
7. Validate schema
8. Validate factuality
9. Validate forbidden claims
10. Validate SEO/content quality
11. Check duplicate / near-duplicate risk
12. Repair or regenerate
13. Decide publishing state
14. Store content version + validation report

This is close to the workflow patterns described by Anthropic, especially prompt chaining and evaluator-optimizer:

The model should not own the whole workflow.

The model can write the words.
The application should decide what is allowed, what is invalid, what needs review, and what can be published.


5. Insert a content-plan step

Before final content generation, I would ask for a plan.

Example:

{
  "bio_plan": {
    "angle": "practical local appliance repair help",
    "facts_to_use": [
      "service_primary",
      "location_primary",
      "experience_years",
      "skill_1"
    ],
    "facts_to_avoid": [
      "insurance",
      "certifications",
      "reviews",
      "availability"
    ]
  },
  "faq_plan": [
    {
      "question_type": "service_scope",
      "source_fact_ids": ["service_primary", "skill_1"]
    }
  ],
  "skip_sections": [
    {
      "section": "certifications",
      "reason": "no certification facts were provided"
    }
  ]
}

Then validate the plan before generating final copy.

If the plan already includes:

certified technician
same-day service
top-rated
5-star reviews

and those facts are not in the fact pack, reject the plan before the final content is generated.


6. Store validation reports

For every generated profile, I would store a validation report.

Example:

{
  "schema": {
    "status": "pass",
    "errors": []
  },
  "factuality": {
    "status": "fail",
    "unsupported_claims": [
      {
        "claim": "insured",
        "reason": "insurance was not present in the fact pack"
      }
    ]
  },
  "forbidden_claims": {
    "status": "pass",
    "violations": []
  },
  "seo_quality": {
    "status": "warn",
    "issues": [
      "FAQ answer is generic",
      "bio uses low-specificity wording"
    ]
  },
  "duplication": {
    "status": "pass",
    "nearest_profile_id": "op_987",
    "similarity": 0.78
  },
  "decision": "repair"
}

This report is useful for:

  • debugging failed generations
  • explaining why a profile went to review
  • improving prompts
  • comparing models
  • building future evals
  • creating future fine-tuning or preference data

Without validation reports, you only have “the model wrote something.”
With validation reports, you have a system you can improve.


7. Separate generated, public-unverified, verified, and published

I would not use READY to mean “trusted.”

I would separate these states:

State Meaning
GENERATED_READY Generated and passed automated checks
PUBLIC_UNVERIFIED Publicly visible, but not manually/proof verified
VERIFIED Important operator facts have been verified
REVIEW_REQUIRED Should not be auto-published
PUBLISHED Currently rendered on the live page

The key distinction:

generated != verified

Your idea of generating quickly and adding a verified tag later is reasonable. I would just make that distinction explicit in the data model and UI.

A profile can be generated in 1–2 minutes and shown as public-unverified.
It can become verified later after proof, human review, or platform verification.


8. Use risk-based review, not full human-in-the-loop

I would not review every generated profile before publication unless the category is sensitive or legally risky.

Full human-in-the-loop can be too slow for onboarding.

Instead:

Auto-publish as PUBLIC_UNVERIFIED if:
  - schema is valid
  - no unsupported claims
  - no forbidden claims
  - duplicate score is low
  - fact density is sufficient
  - no suspicious operator patterns
  - no high-risk service category

Send to review if:

REVIEW_REQUIRED if:
  - unsupported claims were detected
  - forbidden claims were detected
  - duplicate similarity is high
  - sparse input produced long output
  - repeated repair attempts failed
  - operator data looks suspicious
  - service category is high risk

This keeps onboarding fast while still protecting quality.


9. Treat SEO quality as a policy, not a prompt phrase

I would avoid making the main instruction:

Write SEO-friendly content.

That can produce filler, keyword stuffing, and city/service boilerplate.

I would define the target as:

useful, fact-grounded, operator-specific, non-duplicative content

Relevant Google references:

The risk is not “AI wrote it.”
The risk is generating many low-value, near-duplicate, weakly grounded pages.

SEO/content quality gate:

- Does this profile contain enough operator-specific facts?
- Are service areas supported by input data?
- Are FAQs grounded in actual facts?
- Is the title/meta keyword-stuffed?
- Is this page too similar to other city/service pages?
- Is this sparse profile being inflated into a long page?
- Should this page be short, noindex, or review-required until more facts are collected?

Most important rule:

Sparse inputs should produce short profiles, not inflated pages.

If the operator only provides a city and one service, do not generate a long bio and five FAQs. That creates both hallucination risk and SEO risk.


10. Measure uniqueness instead of asking for it

I would not rely on this instruction:

Write a unique description.

I would measure uniqueness.

Layer Check
1 normalized text hash
2 repeated phrase / sentence pattern
3 n-gram overlap
4 embedding similarity
5 same-city + same-service comparison
6 operator-data duplicate detection

Since you already use PostgreSQL, pgvector is a practical option for vector similarity search.

Example:

SELECT
    id,
    operator_id,
    service,
    city,
    embedding <=> <QUERY_EMBEDDING> AS cosine_distance
FROM operator_profile_versions
WHERE service = <SERVICE>
  AND city = <CITY>
ORDER BY embedding <=> <QUERY_EMBEDDING>
LIMIT 10;

Possible policy:

if exact_hash_match:
    reject

if ngram_overlap > threshold:
    regenerate

if embedding_similarity > threshold and same_city_same_service:
    review_required

if operator_data_duplicate_score > threshold:
    block_or_manual_review

The thresholds should come from your own data.

Key idea:

Uniqueness should be a measured property, not a prompt instruction.

11. Make async generation reliable

Your workflow has this shape:

1. Save operator record in Postgres
2. Push generation job to queue

That creates a classic dual-write problem.

The DB write can succeed while queue publish fails. Or queue publish can happen twice. Or the worker can receive the same job more than once.

I would use the transactional outbox pattern:

Flow:

Spring Boot transaction:
  - save operator record
  - insert generation_job
  - insert outbox_event

Outbox publisher:
  - reads unpublished outbox rows
  - sends message to SQS or worker queue
  - marks outbox row as published

Worker:
  - consumes job
  - checks idempotency key
  - builds fact pack
  - generates content
  - validates content
  - writes content version + validation report

If you use SQS Standard queues, design for at-least-once delivery. AWS documents that messages may be delivered more than once and consumers should be idempotent:

Job payload:

{
  "job_id": "<JOB_ID>",
  "operator_id": "<OPERATOR_ID>",
  "input_hash": "<INPUT_HASH>",
  "fact_pack_hash": "<FACT_PACK_HASH>",
  "prompt_version": "<PROMPT_VERSION>",
  "schema_version": "<SCHEMA_VERSION>",
  "attempt_number": 1,
  "idempotency_key": "<IDEMPOTENCY_KEY>"
}

12. Build private evals before choosing the model

Public leaderboards are useful for discovery, but they do not measure your exact task.

I would create an offline eval set:

100-300 real or representative operator records

Include difficult cases:

- rich operator data
- sparse operator data
- same city + same service
- missing rate
- missing experience
- missing insurance
- missing certifications
- no reviews
- ambiguous service area
- bot-like duplicate registrations

Evaluate models and prompts on:

schema_pass_rate
unsupported_claim_rate
forbidden_claim_rate
required_fact_inclusion_rate
duplicate_risk_rate
sparse_profile_inflation_rate
FAQ_grounding_rate
repair_attempts_per_accepted_output
human_acceptance_rate
latency
accepted_output_cost

References:

Do not choose based on five nice-looking examples.

Choose based on accepted-output cost:

accepted_output_cost =
  first_generation_cost
  + repair_generation_cost
  + validation_cost
  + duplicate-regeneration cost
  + human-review cost, if triggered

A cheaper model may be more expensive in production if it causes more repairs and reviews.


13. Model shortlist I would test

I would still avoid choosing the model from public vibes alone.

But if I had to build an initial shortlist, I would test models that cover different tradeoffs:

Candidate Why test it
google/gemma-4-26B-A4B-it First practical candidate; strong size/performance profile
google/gemma-4-31B-it Gemma-family quality ceiling
Qwen/Qwen3.6-27B Dense 27B challenger
Qwen/Qwen3.6-35B-A3B Efficient MoE challenger
mistralai/Mistral-Small-4-119B-2603 Heavier quality comparison
CohereLabs/command-a-plus-05-2026-w4a4 Enterprise/business prose comparison
moonshotai/Kimi-K2-Instruct-0905 Upper-bound comparison
meta-llama/Llama-3.3-70B-Instruct Stable baseline

Why Gemma 4 should be included

I would definitely include the Gemma 4 family, especially:

google/gemma-4-26B-A4B-it
google/gemma-4-31B-it

google/gemma-4-26B-A4B-it is interesting because it is a Mixture-of-Experts model. OpenRouter describes it as 25.2B total parameters with only 3.8B active per token, 256K context, structured output support, function calling, reasoning mode, and Apache 2.0 licensing:

For this task, I would treat it as the first practical candidate.

I would use:

Gemma 4 26B A4B:
  first model to try
  strong size/performance candidate
  good API-evaluation candidate

Gemma 4 31B:
  quality ceiling inside Gemma 4
  useful to check whether A4B loses anything important

Why Qwen3.6 should be included

I would also test:

Qwen/Qwen3.6-27B
Qwen/Qwen3.6-35B-A3B

Qwen/Qwen3.6-27B is a strong dense comparison point. Its model card says the artifacts are compatible with Hugging Face Transformers, vLLM, SGLang, KTransformers, and similar runtimes:

I would test it for:

- instruction following
- JSON/schema stability
- factual discipline
- natural business prose
- repair rate

Qwen/Qwen3.6-35B-A3B is also worth testing as an efficient MoE-style challenger:

Why Mistral Small 4 and Command A+ might be useful

I would include Mistral Small 4 if budget and latency allow it:

I would not use it because every profile needs heavy reasoning. I would use it to see whether a stronger model reduces:

- unsupported claims
- repair attempts
- generic filler
- duplicate-like phrasing
- awkward FAQ output

I would also test Command A+ if business/enterprise prose is important:

Command A+ is interesting as an enterprise/business prose comparison, not necessarily as the first production choice.

How I would choose

I would not choose based on first-generation prose quality alone.

For each model, I would measure:

Metric Meaning
schema_pass_rate Does it follow the JSON contract?
unsupported_claim_rate Does it invent facts?
forbidden_claim_rate Does it output banned claims?
duplicate_risk_rate Does it produce near-duplicate text?
sparse_profile_inflation_rate Does it inflate weak input?
repair_attempts_per_accepted_output How often does it need fixing?
human_acceptance_rate Do reviewers accept it?
accepted_output_cost True cost after validation/repair/review

My initial practical bet would be:

Start with:
  google/gemma-4-26B-A4B-it
  Qwen/Qwen3.6-27B
  mistralai/Mistral-Small-4-119B-2603

Then expand to:
  google/gemma-4-31B-it
  Qwen/Qwen3.6-35B-A3B
  CohereLabs/command-a-plus-05-2026-w4a4

If Gemma 4 26B A4B gives strong validation pass rates and low repair rates, I would favor it as the first production candidate because of its size/performance profile.

If Qwen3.6 follows constraints better, I would choose Qwen.

If Mistral Small 4 dramatically reduces unsupported claims and repair attempts, I would consider paying more for it.

The model decision should come after the pipeline exists, because the pipeline defines what “good” means.


14. Improve operator input UX

If the operator data is weak, the model has only two safe choices:

write short content
or ask for more data

The unsafe choice is:

inflate sparse data into a long profile

So I would improve the onboarding form.

Collect structured fields like:

- primary service
- secondary services
- city / service area
- years of experience
- license / certification
- insurance
- languages
- availability
- rate / price range
- specialties
- customer type
- examples of work
- short self-written note
- proof fields for verified claims

Then use a fact density score:

Fact density Content policy
high full profile, services, FAQ, SEO title/meta
medium shorter bio, limited FAQ
low short profile only, ask for more facts, maybe public-unverified or noindex

This may improve SEO quality more than changing the model.

The best way to make useful pages is to collect useful facts.


15. Use content versioning

Do not overwrite generated content in place.

Possible tables:

operator
operator_profile
operator_profile_version
generation_job
generation_outbox
generation_validation_report
profile_embedding
manual_review_task
operator_edit

Each generated version should store:

operator_id
profile_version_id
generated_json
published_json
validation_report
source_fact_hash
prompt_version
schema_version
provider
model
generation_params
created_at
published_at
verified_at

This matters because:

  • the operator may edit AI content
  • the platform may verify claims later
  • a new model may regenerate content
  • reviewers may approve or reject changes
  • you need rollback
  • edits become useful future eval/fine-tuning data

16. Do not start with fine-tuning

Fine-tuning can help later, but I would not start there.

First build:

- content schema
- fact pack
- validators
- duplicate checks
- private evals
- validation reports
- review states

Only after that would I consider fine-tuning.

Later, you can use:

operator facts
  + generated output
  + validation report
  + operator edits
  + reviewer decisions

to create:

SFT data:
  fact pack → good structured profile JSON

Preference data:
  chosen good output vs rejected bad output

Verifier data:
  fact pack + generated profile → validation report

If you fine-tune, I would start with LoRA/QLoRA rather than full fine-tuning:

But that is a later phase.


Practical build order

Phase 1: Offline prototype

1. Collect 100-300 representative operator records
2. Define content schema
3. Define fact pack schema
4. Define forbidden claims
5. Generate outputs with 2-4 models
6. Validate schema
7. Validate factuality
8. Check duplicate risk
9. Human-review 30-50 outputs
10. Tune prompt/schema/validators

Phase 2: MVP generation pipeline

1. Add generation_job table
2. Add content version table
3. Add validation report table
4. Add outbox table
5. Add worker
6. Add OpenRouter adapter
7. Add structured output
8. Add schema validation
9. Add basic fact/forbidden-claim checks
10. Add repair loop

Phase 3: SEO and duplicate safety

1. Add fact density scoring
2. Add sparse profile policy
3. Add n-gram duplicate checks
4. Add embeddings
5. Add pgvector similarity search
6. Add same-city/service duplicate policy
7. Add noindex/review-required rules for weak pages

Phase 4: Review and verification

1. Add PUBLIC_UNVERIFIED state
2. Add REVIEW_REQUIRED state
3. Add VERIFIED state
4. Add reviewer UI
5. Add operator edit UI
6. Store edits and review decisions

Phase 5: Model and tuning improvements

1. Run private evals regularly
2. Compare models by accepted-output cost
3. Add best-of-N generation if needed
4. Build verifier/reward model if useful
5. Consider LoRA/QLoRA or DPO after enough data exists

What I would avoid

I would avoid:

input row → one prompt → final paragraph → publish

I would avoid treating READY as trusted.

I would avoid writing long content for sparse operators.

I would avoid asking the model to make pages unique without measuring duplication.

I would avoid making “SEO-friendly” the main instruction.

I would avoid fine-tuning before you have evals and validation data.

I would avoid coupling business logic directly to one LLM provider.


Final summary

If I were building this from scratch, I would build a system that controls whether generated content is:

allowed
grounded
distinct
useful
publishable
reviewable
verifiable
versioned

The LLM is only the prose-generation component.

My first priorities would be:

1. content contract
2. fact pack
3. structured output
4. validation report
5. duplicate scoring
6. SEO/content quality policy
7. public-unverified vs verified states
8. private evals
9. reliable async jobs
10. operator input improvement
11. model comparison by accepted-output cost

The central rule:

The model can write the words, but the application should own the truth, consistency, publishing policy, and quality gates.