ProfitsLocal / WebJuice — 项目总览

实时维护的项目结构总览 · 从 README + docs/v3 预渲染。Tabs: 概览 / 漏斗清单 / 控制层 / 路线图。

🗺️业务全景逻辑图（实时维护 · 先看这个）

实时维护。这是整个业务怎么运转的唯一一张图：**多入口 → 收敛成一个公司身份 → 统一采集流程 → 成本分级筛选漏斗 → 逐层深挖 → master.md → 建站。** 架构细节见 docs/v3/SPEC-FUNNEL-ORCHESTRATION.md · docs/v3/SPEC-GATHER-MODULE.md。图例：✅ 已建 · 🔄 在做/规划中 · ⚠️ 缺口（入口未接 / 未统一）。_更新于 2026-05-31。_

【入口 · 多个】────────────► 全部收敛成「一个公司身份」(名字 / 电话 / 地址 / 唯一标识)
  · Docker 地图爬 (pl:scrape-docker / gosom)               ✅
  · Google Places API (pl:places-search-intake)            ✅
  · 牌照数据库 — 42 万行 SQLite                             ✅
  · Google 搜索 → 拿结果 (tinyfish + ddg)                  ✅
  · 你发的一张图片                                          ⚠️ 入口未接
  · 你发的一个链接                                          ⚠️ 入口未接
                              │
                              ▼
【统一流程 · 所有入口共用这一条管子】
  1. 搜索（多引擎 · 5 条线）                                              ✅
  2. AI 判断相关性 + 是不是同一家（防同名冒牌 · 红线）                     ✅（身份判官 R143）
  3. 找到官方网站                                                         ✅
  4. 爬官网 + Google 地图 + Places API + 社媒（多来源）                    ✅（社媒抓取走 OpenCLI）
  5. 交叉验证 → 整理 → master.md                                          🔄（验证层 + 汇总 · R146）
                              │
                              ▼
【筛选漏斗 · 最便宜最快的先筛 · 尽早排除非客户】
  阶段 0（免费/秒级）：无联系方式 / 已关店 / 测试名 → 排除                 ✅ exclusion-filter
  阶段 1（免费/查库）：牌照吊销或过期 → 排除                              ✅ #9（仅在身份已确认的牌照上触发）
  阶段 2（便宜）：    太大 / 连锁 / 政府 / 同行 → 排除                     ✅ exclusion-filter
  阶段 3（中等）：    问题有多大 = 我们能帮多少（审计打分）               ✅ 审计分级
  阶段 4（中等）：    在不在经营？付不付得起？（活跃度信号）              ⚠️ 部分
                              │   每一层踢掉不合格的；活下来的才往下走（越往下越贵）
                              ▼
【深度采集 · 只对走到这里的线索做 · 最贵的几步】
  · 全站爬取 + 真实照片 + 评价 + 社媒背景                                 ✅ 零件都有
  · → 丰富的 master.md（建站素材）                                       🔄 R146 Phase E
                              │
                              ▼
                          【给他做网站】

三条铁律：

入口千变万化，下游只有一条管子。 每条线索先收敛成唯一的公司身份，之后全部走同一条「搜索 → 判相关 → 找官网 → 多源采集 → master.md」流程。
漏斗 = 成本分级。 最便宜的排除先做（免费查库/启发式规则）；最贵的活（全站爬取 + 拿照片）只对走到漏斗底部的线索做。绝不在不合格的线索上花钱。
master.md 是漏斗底部的成品 —— 给「确定要做的客户」的丰富素材文档（有官网走 redesign 版；无官网走背景调研版）。

已扎实 vs 待办： 统一流程 + 漏斗的「零件」基本都有了（身份判官、社媒抓取、牌照门、外部内容挖掘都已落地）。待办：(a) 图片/链接入口还没接到收敛口；(b) 漏斗的成本分级总控（入口收敛 → 分阶段 gate → 逐层深挖）还不是一个显式控制器（pl:run-funnel 是雏形）；(c) 采集骨干（页面规模准确、免费优先爬虫、验证层、master.md 汇总）是 SPEC-GATHER-MODULE.md 计划。

FUNNEL INVENTORY · existing work, by stage (2026-05-31)

Companion to README.md § The Funnel and SPEC-GATHER-MODULE.md. Grounded code inventory so we EXTEND, not rebuild (CLAUDE.md §7). (inferred) = surfaced by sweep, path not independently confirmed — verify before relying. Overall: ~72% of the funnel is wired & current; the identity-canonical write lane and a single end-to-end controller are the big gaps.

Stage 1 · ENTRY POINTS (all converge to one entity)

Channel	CLI	Status
Docker/gosom maps scrape	`pl:scrape-docker`	✅ (D43 fix: batch-start now chains it)
Google Places API	`pl:places-search-intake`	✅
Single business (phone/name/Maps URL)	`pl:single-enrich` (auto-chains audit)	✅
Licence DB (422k SQLite)	`pl:license-lookup` / `pl:license-build-index` / `pl:license-csv-sync`	✅
From a PHOTO	`pl:ingest-image` + `core/leads/image-lead-discovery-v2.js`	⚠️ entity created, but VLM auto-OCR is TODO(G-6.1) — fields still manual
From a LINK (arbitrary URL)	—	❌ NOT wired (planned in `data/sop1/intake-channels.json`)

Registry of channels: data/sop1/intake-channels.json.

Stage 2 · IDENTITY CONVERGENCE

Entity store: core/leads/discovery-store.js (upsert/merge/score/phase) · schema validation core/leads/entity-schema.js · score core/leads/discovery-score.js · routing core/leads/grade-router.js.
Entity JSON: data/leads/entities/<key>.json { latest{name,phone,email,website,address,city,niche,…}, status, phase, enrichment, license, deploy }.
Identity: deterministic core/enrichment/identity-match.js (R143-fixed) → tiered core/enrichment/identity/resolve-identity.js (tier0→2→1, write_allowed:false).
GAP (CRITICAL): no canonical-write lane — resolveIdentity verdicts never promote to entity (gated pending real-page clearance). dedup-merge marks loser merged but doesn't promote winner canonically.

Stage 3 · UNIFIED ENRICHMENT FLOW

Router core/leads/enrichment.js (5 routes: official/fb/ig/li/reviews + reverse-phone) · core/extractors/tinyfish.js (T0) → core/scrape/ddg.js (fallback).
AI relevance + identity: core/llm/match-judge.js (judgeEnrichmentMatches + judgePageIdentity).
External content mining: core/enrichment/mine-background.js (R145, identity-gated, quarantine) + OpenCLI core/enrichment/fetch/opencli-fetch.js (R137, cleared) — pl:mine-background.
Places enrich pl:places-enrich; reviews/GBP core/leads/reviews-adapter.js, core/handoff/gbp-*. Gate core/leads/enrichment-gate.js. Batch pl:run-enrichment-batch (+ identity observe wired R138).

Stage 4 · SCREENING / FILTER (cheap-first)

core/leads/exclusion-filter.js (3 layers: data-quality / business-type / timing) + core/leads/niche-config.json.
Cheap audit core/scoring/cheap-audit-v2.js (T0) → detailed core/scoring/detailed-audit.js (T1) → reviews+vision (T2).
Grading core/scoring/lead-grading.js (investment_level/product_tier/pricing) · gate core/scoring/qualification-scorecard.js (7 hard gates + 5D score≥60).
Licence kill core/leads/licence-kill-observe.js (R144, OBSERVE only, identity-gated). Archive core/leads/terminal-archive.js.

Stage 5 · DEEP CAPTURE (existing-site)

Crawl core/audit/multi-page-crawl.js (sitemap-aware · Firecrawl PAID → Playwright fallback · captures url/title/meta/rawHtml/text/images/links/headings).
Extractors: contact-extraction.js, logo-extractor.js, activity-audit.js, form-audit.js; (inferred) tech-stack/domain-history/ai-geo/image-optimization (verify).
Brief: core/audit/redesign-brief-builder.js DEEP → writes clients/<slug>/v2/core-extract.json {real_facts, brand_signals, ai_extensions, qualification}.

Stage 6 · CONSOLIDATION → master.md

Builder core/reports/master-md-builder.js (frontmatter + 5 CN sections) · CLI scripts/leads/build-master-md.js · refresh core/leads/master-md-refresh.js.
Feeds: detailed+visual+reviews+techstack+sitemap+activity+geo+pagespeed+form+domain+redesign-brief+grading.
Output: clients/<slug>/v2/master.md (+ themed HTML via huashu-md-html).
GAP: external_facts (R145) NOT yet read by master.md (deferred). redesign-brief↔master.md are separate paths (clarify: brief = stage-5 input to stage-6).

Stage 7 · ORCHESTRATION

pl:run-funnel (R124): discovery→enrich→audit+grade+master · resumable · dry-default · excludes identity-canonical lane.
leads:run-pipeline: detailed-audit→vision→reviews→internal-report · does NOT auto-invoke build-master-md.
pl:pipeline-batch-start/step, pl:task-dispatcher/listener (SOP-0 queue).
GAP: no single CLI runs entry→screen→enrich→deep→master→publish end-to-end; audit→master not auto-chained.

OVERLAP / DUPLICATION RISKS

Entity-write paths in discovery-store.js not unified under one transactional upsert (locking exists, not unified).
Enrichment task-spawn decided in 3 places (enrichment / enrichment-gate / cheap-audit-queue) — centralize + idempotent.
Contact info from site HTML (contact-extraction) vs external search (enrichment) both write entity.latest with no precedence — site should win (authoritative).
ABN validity in two places (license-lookup vs identity-match) — consolidate one validator.
redesign-brief-builder vs master-md-builder both synthesize narrative — make brief an explicit INPUT to master.md, not a parallel path.

GAPS (priority)

CRITICAL: identity-canonical write lane (Stage 2) — blocked on real-page clearance (now unblocked: OpenCLI cleared R138; gold-set directional false_same=0 R143; full clearance needs 300-500).
HIGH: "from a link" entry point; photo-entry VLM auto-OCR (G-6.1).
MED: audit→master.md auto-chain; single E2E controller; dedup-merge canonical promotion.
The SPEC-GATHER-MODULE plan (free-first crawl, scope-vs-mining, verify layer, master.md consolidation) addresses Stages 5-6 backbone.

Agents · Skills · Discord — the control + modular layer (2026-05-31)

How the funnel is DRIVEN (Discord + Hermes) and made MODULAR (skills the agents call). Companion to README.md § The Funnel, FUNNEL-INVENTORY.md, SPEC-FUNNEL-ORCHESTRATION.md. Sources: SOP_0_TASK_SYSTEM.md, DISCORD-CHANNELS-PRD.md. Legend ✅ live · ⚠️ code-exists-not-wired · ❌ missing/aspirational.

Discord — the operator surface (POST-mostly; commands via #website-tasks)

Channel	Env	Status	Purpose
#website-tasks	WEBSITE_TASKS_FORUM_CHANNEL_ID	✅	command/task entry → intent-router → CLI
#website-leads	WEBSITE_LEADS_DISCORD_CHANNEL_ID	✅	per-lead threads (no-demo) · grade/phase tags
#website-projects	WEBSITE_PROJECTS_DISCORD_CHANNEL_ID	⚠️	demo-ready leads + sales stages (not actively written)
#website-templates	WEBSITE_TEMPLATES_DISCORD_CHANNEL_ID	✅	template family threads
#lead-discovery-runs	LEAD_DISCOVERY_RUNS_DISCORD_CHANNEL_ID	⚠️	batch run progress (code exists, not emitting)
#paid-websites	PAID_WEBSITES_DISCORD_CHANNEL_ID	❌	M5+ paid build/revision stages

Flows (who posts): batch progress core/funnel/pipeline-batch-thread.js; per-lead thread core/funnel/lead-thread-sync.js; per-stage (9 stages) core/funnel/audit-stage-messages.js; cheap-audit verdict core/leads/cheap-audit-queue.js; archive core/leads/terminal-archive.js; paid intake/revision core/funnel/paid-intake-ops.js; build handoff / review / live-publish core/contracts/discord-messages.js. Message contract SoT: core/contracts/discord-messages.js.

Agents

Hermes (core/funnel/hermes-cron.js → local python ~/Developer/Hermes Agent): reads #website-tasks, runs per-lead crons (grade-A 4h / grade-B 12h), calls pl:context (read) + posts decision drafts for operator approval, advances phase. Skill exposed: profitslocal-lead-ops via registerLeadCron(...,{skill}). Status: ⚠️ local-only, aspirational (no VPS).
Commerce/Stripe (core/funnel/submission-router.js + paid-intake-ops.js): Stripe webhook → order/entitlement/revision-quota → case memory → agent task → #paid-websites. Status: ✅ MVP verified (Opa). Not a persistent agent — a synchronous dispatcher.
How agents invoke work: npm CLIs, NOT direct skill calls. Discord task → core/tasks/intent-router.js (codex_cli→claude_cli→ollama→regex, 8 kinds) → resolves to a pl:*/leads:* CLI. Agent tasks: data/agent-tasks/<client>/*.json executed by operator. No runtime skill-runner.

Skills (19 · `skills/*/SKILL.md`) — modular intent, not yet runtime-executable

Discovery/screen: profitslocal-lead-discovery, profitslocal-lead-filter, profitslocal-entity-enrichment, image-lead-discovery, site-audit. Collect/build-prep: profitslocal-collect, profitslocal-build-research-pack, profitslocal-data-checkpoint, profitslocal-assemble-handoff, profitslocal-audit-handoff. Audit/QA: profitslocal-quality-audit, website-copy-audit, website-ui-audit, pl-audit-rubric. Build/spec/voice: pl-local-trade-page-spec, pl-au-trade-voice, website-redesign-preservation, template-lab. Orchestration: lead-ops (the Hermes-callable one).

Invocation today: each skill has a matching CLI; Discord routes kind→CLI; only profitslocal-lead-ops is Hermes-callable. Skills are narrative prompt artifacts, not executables — no skill-loader / skill-registry / skill:run CLI.

The modularization target (Matthew's intent)

Make funnel steps modular, agent-callable skills so Hermes (and Claude/Commerce agents) can invoke them uniformly. Today: CLIs exist for every step, but the "skill" layer is docs + a hard-coded intent map. The target = a real skill surface (discoverable + runnable) that the agents call, instead of editing intent-router.js per new step.

INTEGRATION GAPS (Discord/agents/skills ↔ funnel stages 1-7)

No runtime skill-runner — skills are prompts, not executables; can't skill:run <name> --args. Intent-router hard-maps kind→CLI (must edit code to add a skill).
Discord↔skill loop missing — flow is Discord→intent-router→CLI→Discord; no skill in the middle; event-driven skill use needs a manual Claude Code session.
Two channels code-exists-not-wired — #lead-discovery-runs (batch threads) + #website-projects (demo-ready) not actually emitting → operator blind to batch progress + demo backlog.
Stage 8 (build) / 9 (publish) not skill-wrapped — CLIs only; agents can't invoke directly.
Hermes local-only — no always-on deployment; can't run when operator offline.
Identity/screening/gather modules (this session) not yet skill- or Discord-surfaced — resolveIdentity, mine-background, licence-kill, run-funnel exist as CLIs/modules but aren't in the intent-router kinds or skill registry.

Where this plugs into the framework

The funnel orchestrator (SPEC-FUNNEL-ORCHESTRATION Phase 1) is the natural home to (a) emit the missing Discord batch/demo threads (drop-accounting → #lead-discovery-runs), and (b) be the first consumer of a real skill surface.
ADR-11 (proposed): a skill-runner contract — each funnel step = a skill with a machine-readable manifest (inputs/outputs/cost_tier/CLI binding) that BOTH the intent-router and Hermes resolve against, so adding a step doesn't require editing router code. The ops:skill-contract-audit gate (built R136) already checks SKILL.md currency — extend it to validate the runnable manifest.

SPEC · Funnel Orchestration (stages 1-7) · codex R147 · 2026-05-31

状态: PLAN (codex R147). Owns the WHOLE funnel: entry convergence → cost-gated ordering → identity gating → progressive deepening → ONE master.md terminal artifact → drop accounting. The deep-capture/gather internals are a separate track (SPEC-GATHER-MODULE.md, invoked only AFTER cheap gates pass). Picture: README.md § The Funnel · live https://pl-business-map.pages.dev

Why this exists (codex R147)

Inventory (FUNNEL-INVENTORY.md) shows ~72% of parts are wired, but the system is not yet a coherent funnel: no single controller converges entries → runs cheapest-first exclusions → deepens only survivors → emits one master.md. This spec makes the system BEHAVE like the README funnel before we polish deep-capture quality.

Core principle — cost staging (Matthew)

Cheapest + fastest exclusions first; expensive work (full crawl, photos, reviews, vision) runs ONLY on leads that survive. Never spend money on disqualified leads.

entry (any channel) → ONE identity → [Stage0 free] → [Stage1 free-db] → [Stage2 cheap] →
  [Stage3 mid: problem-size] → [Stage4 mid: running/ability-to-pay] → [DEEP capture] → ONE master.md → build
                         ↑ each gate drops the unqualified + records WHY (drop accounting)

Build order (codex R147)

Phase 1 · Orchestrator skeleton + drop accounting ← FIRST. One controller (extend pl:run-funnel) that:

converges entry → entity; runs the EXISTING cost-gated stages IN ORDER (exclusion-filter L1 → licence-kill(observe) → exclusion L2 → cheap-audit → qualification/grading); does NOT run detailed crawl/reviews/deep until cheap gates pass; auto-chains surviving leads into build-master-md; emits drop accounting (entered / excluded@stageX / survived-to-deep / master-built). Reuses all existing stage components — no rebuild.

Phase 2 · Stage-6 consolidation slice: make redesign-brief an explicit INPUT to ONE master-md-builder; read external_facts (R145) behind an observe/flag path; reviews/GBP as named source blocks; ensure the terminal artifact is reliable.
Phase 3 · Identity observe/proposed-canonical lane: resolveIdentity verdict → proposed canonical patch → observe log → (gated) promotion. NO automatic canonical writes until 300-500 gold clearance. Lane exists structurally so the funnel has the right shape + accumulates real-run evidence (avoids painful retrofit).
Phase 4 · Gather backbone = the R146 sequence B(page-scale)→A(free-crawl)→D(verify)→C(unify)→E(richer master.md), as depth improvement INSIDE deep-capture.

ADRs (R147)

ADR-6 · Entry convergence contract: every channel (gosom/places/single/licence/image/link) emits a normalized entity {identity: name/phone/address/unique-id, source, source_url} into the store BEFORE the funnel runs. The funnel starts from the entity, never from a channel.
ADR-7 · Cost-gated stage ordering: stages are an ordered list with a cost_tier (free/cheap/mid/expensive); a lead only advances if the prior gate passes; expensive stages are unreachable for dropped leads (structural, not convention).
ADR-8 · Drop accounting: every drop records {stage, reason_code, cost_tier}; per-batch rollup {entered, by-stage drops, survived, master-built}; anomaly flag on drop-rate (reuse rollupScreeningObservability shape, R144).
ADR-9 · Identity gating in the funnel: the funnel consults resolveIdentity for cross-source merges + (Phase 3) proposed canonical patches; observe-only until clearance; no namesake bleed (R143 discipline).
ADR-10 · One terminal artifact: master.md is THE bottom-of-funnel output; the orchestrator guarantees it's built for every survivor; redesign-flavor (has-website) vs background-flavor (no-website) chosen by entity state.

Scope guard

Phase 1 REUSES existing stage CLIs/modules (exclusion-filter, cheap-audit, qualification, build-master-md) — it's a controller, not a reimplementation. pl:run-funnel is the seed to extend.
Identity canonical writes stay gated (Phase 3 observe-only) until the 300-500 gold-set clears false_same=0 (currently on hold per Matthew; directional already 0 at R143).
Deep-capture quality (page-scale accuracy, free-crawl, verify) is Phase 4 = the separate SPEC-GATHER-MODULE track.