Thisper/communication_translator_project_plan.md

676 lines
15 KiB
Markdown

# Communication Translator Project Plan
This document is the long-term product vision and design philosophy.
For current implementation state and release readiness, use:
- `README.md`
- `RELEASE_CANDIDATE.md`
- `THISPER_STATUS.md`
- `THISPER_IMPLEMENTATION_PLAN.md`
## Working Name
Use a placeholder name until the product identity becomes obvious through use.
Suggested internal names:
- Thisper
- TypeFlow
- Fidelity Keyboard
For now, use a neutral working title:
**Project Codename: Thisper**
---
## Core Product Vision
Build a typing-first and speech-capable communication tool that lets me write or speak naturally and quickly, then cleans the output to improve readability **without changing my meaning or replacing my voice**.
This is **not** meant to be a generic AI assistant, chatbot, summarizer, or writing tool.
It is a **fidelity-preserving input translation system**.
Its job is to help me communicate the way I naturally think, while reducing the friction other people have when reading what I produce.
---
## One-Sentence Product Definition
A cross-application typing and speech translation layer that preserves original meaning and voice while improving readability.
---
## Problem Statement
Normal tools do not fit how I think or communicate.
Current problems:
- Standard autocorrect only fixes words, not readability.
- Predictive text is shallow and often changes intent.
- Most AI rewrite tools sound obviously AI-generated.
- Dictation tools like Wispr Flow improve readability, but focus mainly on speech.
- My preferred input method is typing, not speech.
- My natural writing style is fast, dense, highly connected, and often difficult for other people to follow.
- Existing tools either:
- change too much,
- flatten my tone,
- introduce AI-sounding language,
- or fail to preserve factual precision.
I need a system that acts as a **translation layer**, not a replacement voice.
---
## Why This Project Exists
This system exists to solve a real communication gap:
- I can type very quickly.
- I think in relationships and connected meaning, not simple step-by-step output.
- My raw writing often carries the correct meaning, but other people struggle to process the density, pacing, grammar, or structure.
- I want a system that lets me continue writing naturally, then makes the result easier for others to read.
- I do not want the system to rewrite me into a generic AI voice.
- I do not want to lose factual precision, uncertainty, or emotional tone unless I explicitly request a change.
The goal is:
**clean output, preserved self**
---
## Non-Goals
This project is **not** intended to be:
- a full chatbot
- a generic AI writer
- a generic note-taking app
- a journaling system
- a replacement for Journal
- a cloud-only product
- a social media writing assistant
- a summarizer by default
- a grammar tool that prioritizes correctness over meaning
- a tool that rewrites everything into professional corporate speech
If the project starts drifting into any of the above, stop and re-evaluate.
---
## Primary Design Principles
### 1. Preserve meaning
The system must not change factual claims, uncertainty, intent, or core message unless explicitly asked.
### 2. Preserve voice
The system should keep my tone, cadence, style, and general phrasing as much as possible.
### 3. Improve readability
The system should make text easier for other people to read by improving punctuation, sentence boundaries, grammar, and flow.
### 4. Minimize AI smell
The output should not sound like a chatbot wrote it.
### 5. Typing-first
This tool must treat typing as a first-class input method, not a fallback behind speech.
### 6. Speech-capable
Speech support is useful, but secondary to typing for my needs.
### 7. Cross-app use
This should work across applications rather than living only inside one app.
### 8. Trust through transparency
The user should be able to see what changed.
### 9. Speed matters
The system should feel immediate, especially in typed workflows.
### 10. Pluggable intelligence
The architecture should support local, cloud, or hybrid backends without hard-coding the project to one provider.
---
## Target Users
### Primary user
Me.
This project is being built first to solve my own communication and translation needs.
### Secondary users
People who:
- think faster than they comfortably communicate
- prefer typing over speech
- produce dense or hard-to-follow writing
- want cleanup without losing their style
- dislike obvious AI rewriting
- need help bridging the gap between raw output and readable output
Potential overlap:
- autistic users
- ADHD users
- disabled users
- technical users
- trauma survivors who need precision and control
- anyone whose natural communication style does not fit normal tools
---
## User Experience Goal
I should be able to:
1. Type naturally at full speed.
2. Speak naturally when useful.
3. Capture raw input without friction.
4. Run a cleanup/translation pass.
5. Get output that is easier to read but still clearly mine.
6. Use the result in any app.
The ideal feeling is:
**“I typed like myself, and the system made it readable without turning it into someone else.”**
---
## Core Use Cases
### Use Case 1: Typed cleanup
I paste or type raw text into the tool and receive a cleaned version that preserves my voice.
### Use Case 2: Selected-text rewrite
I select text in another application, trigger the tool, and get a cleaned version back.
### Use Case 3: Clipboard bridge
I copy raw text, run it through the translator, and paste the improved output elsewhere.
### Use Case 4: Speech capture
I speak into the system and receive a highly accurate transcript with readability cleanup.
### Use Case 5: Audience adaptation
I choose a mode such as readable, concise, or formal without losing core meaning.
### Use Case 6: Diff review
I inspect exactly what changed before accepting the result.
---
## Primary Modes
These modes should be explicit and limited. Avoid mode explosion.
### 1. Clean
Fix punctuation, capitalization, sentence boundaries, whitespace, and obvious grammar issues while staying extremely close to the original.
### 2. Readable
Improve clarity and flow slightly more than Clean while still preserving voice and meaning.
### 3. Formal
Make the text more appropriate for legal, support, or professional contexts while preserving core message and accuracy.
### 4. Concise
Reduce length without removing important meaning.
### 5. Preserve Voice
The strictest style-preserving mode. Minimal cleanup, maximum fidelity.
Default mode should likely be:
**Preserve Voice** or **Clean**
---
## Transformation Rules
The default transformation engine must obey rules like these:
1. Preserve meaning exactly unless a different mode explicitly allows restructuring.
2. Preserve uncertainty exactly.
3. Preserve factual claims exactly.
4. Preserve emotional tone unless asked to soften or harden it.
5. Do not summarize unless explicitly requested.
6. Do not inject stock AI phrases.
7. Do not over-polish.
8. Do not remove intensity unless needed for readability or safety.
9. When uncertain, stay closer to the original.
10. Always prefer fidelity over prettiness.
---
## Product Scope Strategy
To avoid drift, build this in phases.
### Phase 1: Desktop text-to-text translator
This is the real MVP.
Must include:
- text input box
- paste raw text
- output pane
- selectable modes
- diff view
- copy output
- very simple settings
- one backend at first
- preserve-style-first behavior
Do not add speech yet unless it is trivial.
### Phase 2: System-wide desktop utility
Add:
- hotkey to open translator
- clipboard pipeline
- selected-text workflow
- tray app or background helper
- faster repeated usage across apps
### Phase 3: Speech input
Add:
- microphone capture
- streaming or chunked transcript
- cleaned transcript output
- same transformation modes
### Phase 4: Android keyboard
Build a real keyboard, not a fake dictation shell.
Must support:
- normal typing
- optional cleanup button
- optional rewrite action
- optional dictation later
### Phase 5: Optional local/hybrid backends
Add support for:
- local model providers
- cloud model providers
- fallback chains
- user-selectable provider strategy
### Phase 6: Journal integration
Only after the standalone tool proves itself.
Journal should consume this system, not contain its entire logic.
---
## MVP Definition
### MVP Goal
A desktop app that takes typed text and transforms it into more readable text while preserving the original voice and meaning.
### MVP Must Have
- input area
- output area
- mode selector
- copy button
- diff display
- rewrite button
- settings for backend/mode behavior
- at least one reliable backend
- strong preserve-style prompt rules
### MVP Should Not Have
- mobile
- iPhone
- full keyboard integration
- many modes
- user accounts
- journaling features
- complex profiles
- many AI providers
- voice-first workflow
- massive settings surface
---
## Technical Architecture
## High-Level Architecture
### 1. Input Layer
Responsible for collecting text or speech.
Possible components:
- text editor/input box
- clipboard intake
- selected-text capture
- speech capture
- keyboard integration later
### 2. Preprocessing Layer
Responsible for lightweight cleanup before AI.
Examples:
- trim whitespace
- normalize line breaks
- detect paragraphs
- optional sentence hints
- optional typo normalization
This layer should be deterministic where possible.
### 3. Transformation Layer
Responsible for style-preserving cleanup and rewrite operations.
This should be abstracted behind interfaces so providers can be swapped.
Possible provider types:
- cloud LLM
- local LLM
- hybrid chain
- rules + LLM combination
### 4. Review Layer
Responsible for trust and transparency.
Examples:
- side-by-side view
- inline diff
- changed text highlighting
- accept/reject whole output
- maybe per-block review later
### 5. Output Layer
Responsible for making the result usable.
Examples:
- copy to clipboard
- replace selected text
- save to file
- send to app
- Journal integration later
---
## Backend Strategy
Backends should be pluggable.
Use abstractions such as:
- `IRewriteProvider`
- `ITranscriptionProvider`
- `IFormattingProvider`
This prevents provider lock-in.
### Backend priorities
1. reliability
2. fidelity
3. latency
4. low AI smell
5. cost
6. local support later
### Initial backend recommendation
Start with one provider only.
Do not build a multi-provider ensemble in the MVP.
That can come later if needed.
---
## Recommended Processing Pipeline
### Typed input pipeline
1. User types or pastes raw text.
2. Preprocessing normalizes text.
3. Rewrite provider transforms according to selected mode.
4. Diff is shown.
5. User copies or replaces text.
### Speech pipeline
1. User speaks.
2. ASR provider transcribes in chunks or stream.
3. Transcript is normalized.
4. Rewrite provider applies selected cleanup mode.
5. User reviews and accepts output.
---
## UX Requirements
### Required UX qualities
- fast
- clean
- low friction
- minimal clicks
- obvious trust signals
- easy to understand
- no clutter
- no aggressive AI presence
### Important UX rules
- always preserve access to the raw original
- always make changes inspectable
- never hide major rewrites
- do not drown the UI in settings
- default to the safest mode
---
## Performance Goals
### For text-to-text
- small inputs should feel nearly immediate
- the UI must never freeze
- processing should happen asynchronously
- copy/reuse must be fast
### For speech later
- transcript should appear progressively
- cleanup should happen incrementally where possible
- avoid long blocking waits
- prefer “usable now, better refined in background” over “perfect after delay”
---
## Trust and Safety Philosophy
This is a communication aid, not a truth engine.
The system should:
- preserve what I said
- preserve uncertainty
- avoid hallucinating facts
- avoid inventing claims
- avoid changing meaning without permission
The most important safety rule is:
**Do not silently distort the message.**
---
## Privacy Philosophy
Privacy matters, but forcing everything local too early may block the project.
Approach:
- make privacy explicit
- allow backend choice
- do not hardwire cloud dependence
- support local later
- let the user know what leaves the device
The system should be able to grow toward:
- local-only mode
- hybrid mode
- cloud mode
But MVP can use a cloud provider if needed for quality.
---
## Integration Philosophy
This project should be standalone first.
It may later integrate with:
- Journal
- editors
- browsers
- messaging apps
- email workflows
But the core must stay focused:
**input translation across contexts**
---
## Risks
### 1. Scope creep
Trying to build speech, desktop, mobile keyboard, local AI, and Journal integration all at once.
Mitigation:
- follow phases strictly
- do not build future phases early
### 2. AI voice contamination
Outputs become bland, generic, or chatbot-like.
Mitigation:
- preserve-style prompts
- diff review
- strict mode rules
- compare against original constantly
### 3. Provider dependence
A cloud provider changes policy, pricing, or quality.
Mitigation:
- provider abstraction
- backend pluggability
### 4. Overengineering
Building a giant architecture before proving the core use case.
Mitigation:
- keep MVP small
- prove value first
### 5. Latency frustration
Tool feels too slow to be useful.
Mitigation:
- async architecture
- fast UI
- small input workflows first
- optimize perceived speed
### 6. Drift into “generic AI app”
Project becomes another assistant shell instead of a focused translation tool.
Mitigation:
- revisit product definition regularly
- reject features that do not support the core vision
---
## Decision Filters
Before adding any feature, ask:
1. Does this help preserve meaning?
2. Does this help preserve voice?
3. Does this improve readability?
4. Does this help use the tool across apps?
5. Does this keep the tool focused?
6. Can this wait until a later phase?
If the answer is unclear, do not add it yet.
---
## Immediate Development Priorities
### Priority 1
Write the exact behavior spec for each mode:
- Clean
- Readable
- Formal
- Concise
- Preserve Voice
### Priority 2
Build the text-to-text desktop MVP.
### Priority 3
Test outputs against real examples of my raw writing.
### Priority 4
Tune prompts and system rules until the result feels like:
- me
- but easier to read
### Priority 5
Add diff and trust tooling before getting fancy.
---
## Success Criteria
The project is succeeding if:
- I can write naturally without slowing down.
- The output remains recognizably mine.
- Other people can follow it more easily.
- The text does not sound generically AI-generated.
- I trust the system not to corrupt my meaning.
- I can use it in multiple contexts, not just one app.
- It reduces friction in real communication.
---
## Failure Criteria
The project is failing if:
- the output sounds like a chatbot
- my meaning changes too often
- it becomes another generic AI wrapper
- it gets overloaded with features before the core works
- the UI becomes cluttered
- it is too slow to use comfortably
- it only works in one narrow context
- it stops feeling like a tool for me
---
## Final Reminder
This project is not about making me sound like someone else.
It is about making **my actual communication** more readable without losing:
- meaning
- tone
- precision
- intensity
- identity
That is the standard.
When in doubt, return to this sentence:
**Clean the output. Do not replace the person.**