rust_browser/_bmad-output/implementation-artifacts/2-1-html5-tokenizer-completeness.md

# Story 2.1: HTML5 Tokenizer Completeness

Status: done

## Story

As a web user,
I want all HTML content to be parsed correctly regardless of markup patterns,
So that pages render properly even with unusual or complex HTML.

## Acceptance Criteria

1. **All tokenizer states implemented per WHATWG HTML §13.2.5:** Data, RCDATA, RAWTEXT, script data, PLAINTEXT, tag open, end tag open, tag name, RCDATA less-than sign, script data less-than sign, script data escaped, and all related sub-states produce correct tokens.

2. **Character references resolved correctly per context:** Named entities (full HTML5 set: 2,231 entries), numeric (decimal `&#NNN;` and hex `&#xHHH;`), handled differently in attribute values vs. text content per §13.2.5.73 (character reference state).

3. **CDATA sections, processing instructions, and bogus comments** tokenized correctly per spec without errors.

4. **Malformed HTML** (unclosed tags, missing quotes, invalid characters, null bytes) handled with graceful recovery per the spec's error handling rules — no panics.

5. **Golden tests** cover edge-case tokenizer states. `docs/HTML5_Implementation_Checklist.md` updated. `just ci` passes.

## Tasks / Subtasks

- [x] Task 1: Implement full tokenizer state machine (AC: #1)
  - [x] 1.1 Refactor tokenizer from linear scan to explicit state machine with `TokenizerState` enum matching WHATWG §13.2.5 states
  - [x] 1.2 Implement RCDATA state for `<title>`, `<textarea>` (currently treated as raw text — must decode character refs but not tags)
  - [x] 1.3 Implement RAWTEXT state for `<iframe>`, `<noembed>`, `<noframes>`, `<xmp>` (no entity decoding, no tags until matching end tag)
  - [x] 1.4 Implement PLAINTEXT state for `<plaintext>` (no end tag ever closes it)
  - [x] 1.5 Implement script data states: script data, script data escaped, script data double-escaped, and their sub-states (less-than sign, end tag open/name, escape start/dash variants) per §13.2.5.15–§13.2.5.28
  - [x] 1.6 Implement proper tag open / end tag open / tag name states with spec-accurate error recovery (e.g., `<1` emits `<` as character, `</>` is ignored)
  - [x] 1.7 Implement before/after attribute name, attribute value (quoted/unquoted) states with proper error handling
  - [x] 1.8 Handle self-closing flag (`/>`) with appropriate parse errors for non-void elements
  - [x] 1.9 Implement bogus comment state (e.g., `<!DOCTYPE` errors, `<? ... >`)
  - [x] 1.10 Implement markup declaration open state dispatching: `<!--`, `<!DOCTYPE`, `<![CDATA[`

- [x] Task 2: Complete character reference handling (AC: #2)
  - [x] 2.1 Replace the ~100-entity lookup table in `entities.rs` with the full HTML5 named character reference table (2,231 entries) — use a `phf` static map or sorted array with binary search for compile-time efficiency
  - [x] 2.2 Implement context-sensitive character reference consumption: in attributes, `&` followed by alphanumeric + `=` must NOT be consumed as a reference (legacy compat)
  - [x] 2.3 Handle ambiguous ampersand parse errors per spec
  - [x] 2.4 Handle numeric character reference edge cases: null → U+FFFD, surrogates → U+FFFD, out-of-range → U+FFFD, legacy Windows-1252 remapping table for 0x80–0x9F range

- [x] Task 3: CDATA, PI, and bogus comment handling (AC: #3)
  - [x] 3.1 Implement CDATA section tokenization (`<![CDATA[...]]>`) — only valid in foreign content (SVG/MathML); in HTML content emit as bogus comment
  - [x] 3.2 Ensure processing instructions (`<?...>`) are consumed as bogus comments per the HTML spec (not XML PI behavior)

- [x] Task 4: Error recovery and robustness (AC: #4)
  - [x] 4.1 Handle null characters: replace with U+FFFD in appropriate states, parse error in others
  - [x] 4.2 Handle unexpected characters in each state per spec (e.g., `<` in attribute value unquoted triggers parse error but continues)
  - [x] 4.3 Handle abrupt EOF in each state — emit appropriate tokens and parse errors
  - [x] 4.4 Ensure existing safety limits (10MB input, 500K tokens, 1024 nesting depth) remain enforced
  - [x] 4.5 Add `ParseError` tracking: collect parse errors with location info (line/column) but don't halt tokenization

- [x] Task 5: Tests and documentation (AC: #5)
  - [x] 5.1 Add unit tests for each new tokenizer state: RCDATA, RAWTEXT, PLAINTEXT, script data states
  - [x] 5.2 Add unit tests for full character reference resolution: named (multi-codepoint entities like `&NotSquareSubset;`), numeric edge cases, context-sensitive in attributes
  - [x] 5.3 Add golden tests for pages exercising edge-case tokenizer behavior
  - [x] 5.4 Update `docs/HTML5_Implementation_Checklist.md` — check off tokenizer-related items
  - [x] 5.5 Run `just ci` and ensure all tests pass

## Dev Notes

### Current Implementation (What Exists)

The tokenizer lives in `crates/html/src/tokenizer.rs` (~320 lines). It uses a **linear character-scanning approach** — NOT a state machine. Key characteristics:

- **`tokenize(input: &str) -> Vec<Token>`** — main entry point, iterates chars in a loop
- **Token enum variants:** `Doctype`, `StartTag`, `EndTag`, `Character`, `Comment`, `RawText`, `Eof`
- **Attribute parsing:** handles quoted (single/double) and unquoted values, lowercases names
- **Raw text elements:** `<script>` and `<style>` skip inner content until matching end tag — but this is NOT the same as the spec's RAWTEXT/script-data states
- **Entity decoding:** `entities.rs` has ~100 named entities + numeric/hex support
- **Safety limits:** MAX_HTML_INPUT_BYTES (10MB), MAX_HTML_TOKENS (500K), MAX_HTML_NESTING_DEPTH (1024)

### Architecture Constraints

- **Layer 1 crate** — depends only on `dom`, `shared`, and `tracing`. No upward deps.
- **Arena-based IDs** — tokenizer produces tokens consumed by tree_builder which creates `NodeId`-based DOM
- **No unsafe** — enforced by `scripts/check_unsafe.sh`
- **Error handling** — use `thiserror` derive, never panic on malformed input
- **Spec citations** — use `// HTML §13.2.5.x` inline comment format
- **File size** — enforced by `scripts/check_file_size.sh`; if tokenizer.rs grows large, split into `src/tokenizer/` module directory

### Key Design Decision: State Machine Refactor

The current linear scan must be refactored to an explicit state machine. Recommended approach:

```rust
enum TokenizerState {
    Data,
    Rcdata,
    Rawtext,
    ScriptData,
    Plaintext,
    TagOpen,
    EndTagOpen,
    TagName,
    // ... all WHATWG §13.2.5 states
}
```

**Do NOT** use a separate `fn` per state (excessive function call overhead). Use a `match` on state in a loop — this is the standard efficient approach for spec-conformant HTML tokenizers.

### Entity Table Strategy

Current `entities.rs` has ~100 hardcoded entries. For the full 2,231-entry HTML5 table:

- **Option A (recommended):** Use `phf` crate for compile-time perfect hash map — O(1) lookup, zero runtime cost. Already used in the Rust ecosystem for this exact purpose.
- **Option B:** Sorted array with binary search — no new dependency but O(log n) lookup.
- **Do NOT** use a runtime `HashMap` — wasteful for static data.

The entity data can be sourced from the WHATWG JSON file or generated at build time.

If adding `phf` as a dependency: add rationale comment in Cargo.toml per project rules (new dependencies require rationale).

### RCDATA vs RAWTEXT vs Script Data

These are commonly confused. Critical distinctions:

| State | Used for | Decodes entities? | Recognizes tags? | End condition |
|-------|----------|--------------------|-------------------|---------------|
| RCDATA | `<title>`, `<textarea>` | Yes | No (except matching end tag) | Matching end tag |
| RAWTEXT | `<iframe>`, `<noembed>`, `<noframes>`, `<xmp>` | No | No (except matching end tag) | Matching end tag |
| Script data | `<script>` | No | No, but has escape sequences (`<!--`, `-->`) | Matching `</script>` |
| PLAINTEXT | `<plaintext>` | No | No | Never (runs to EOF) |

### Files to Modify

- `crates/html/src/tokenizer.rs` — major refactor to state machine
- `crates/html/src/entities.rs` — expand to full HTML5 entity table
- `crates/html/src/lib.rs` — update public API if Token enum changes
- `crates/html/src/tests/tokenizer_tests.rs` — extensive new tests
- `tests/goldens/fixtures/` + `tests/goldens/expected/` — new golden tests
- `docs/HTML5_Implementation_Checklist.md` — update checked items
- `crates/html/Cargo.toml` — possibly add `phf` dependency

### What NOT to Change

- **`tree_builder.rs`** — Story 2.2 handles insertion modes. Only touch if the Token enum shape changes require it.
- **`dom/`** — No DOM changes needed for tokenizer work.
- **Other crates** — Tokenizer is internal to `html` crate.

### Previous Story Learnings (Epic 1)

- **Code review catches real bugs:** Story 1-13's review found `unset` not handled in pseudo-element styling. Expect similar edge-case misses.
- **Golden tests are essential:** Every rendering-affecting change needs golden test coverage.
- **Update checklists:** Always update `docs/HTML5_Implementation_Checklist.md` at the end.
- **Commit pattern:** Recent commits follow format: `Implement <feature> with code review fixes (§section)`.

### Testing Strategy

- **Unit tests** in `crates/html/src/tests/tokenizer_tests.rs` — test each state machine state independently
- **If test file exceeds ~200 lines**, split into `tokenizer_state_tests.rs`, `entity_tests.rs`, etc. (per architecture doc)
- **Golden tests** in `tests/goldens/` — HTML pages that exercise RCDATA (`<title>` with entities), RAWTEXT (`<xmp>`), script data escaping
- **Property-based tests** (optional) with `proptest` — fuzz tokenizer with random byte sequences, assert no panics

### References

- [WHATWG HTML Living Standard §13.2.5 — Tokenization](https://html.spec.whatwg.org/multipage/parsing.html#tokenization)
- [HTML5 Named Character References (JSON)](https://html.spec.whatwg.org/entities.json)
- [Source: crates/html/src/tokenizer.rs] — current implementation
- [Source: crates/html/src/entities.rs] — current entity table
- [Source: crates/html/src/tests/tokenizer_tests.rs] — current tests
- [Source: docs/HTML5_Implementation_Checklist.md] — checklist to update
- [Source: _bmad-output/planning-artifacts/architecture.md] — architecture constraints
- [Source: _bmad-output/planning-artifacts/epics.md] — Epic 2 requirements

## Dev Agent Record

### Agent Model Used
Claude Opus 4.6 (1M context)

### Debug Log References
- Fixed golden test 091-xml-processing-instructions: node IDs shifted because PIs now emit bogus comment tokens (spec-correct behavior)
- Promoted WPT test wpt-css-css-tables-table-cell-inline-size-box-sizing-quirks from known_fail to pass (side effect of improved tokenizer)
- Updated test_invalid_entity_remains_literal: `&amp` without semicolon now correctly resolves to `&` per HTML §13.2.5.73

### Completion Notes List
- Refactored tokenizer from 320-line linear scan to full WHATWG §13.2.5 state machine (~2100 lines in module directory)
- Implemented all 80 tokenizer states as an enum with match-in-loop pattern
- Added RCDATA state (title, textarea) with entity decoding but no tag recognition
- Added RAWTEXT state (style, xmp, iframe, noembed, noframes) with no entity decoding (noscript excluded — only RAWTEXT when scripting enabled)
- Added PLAINTEXT state (never exits)
- Added ScriptData states with full escape/double-escape handling (§13.2.5.15-28)
- Replaced ~100-entity lookup table with full HTML5 table (2,125 entries) using sorted array + binary search
- No new dependencies added — used sorted array approach instead of phf
- Implemented character reference state machine (named, numeric decimal, numeric hex)
- Added context-sensitive attribute character reference handling (legacy compat)
- Added Windows-1252 remapping for numeric refs in 0x80-0x9F range
- Added CDATA section tokenization (emits chars, parse error in HTML content)
- Processing instructions now correctly emit as bogus comments
- Added ParseError tracking with line/column info (33 error kinds)
- Added null character handling (U+FFFD replacement) in all states
- Added EOF handling in all states with appropriate token emission
- Added 39 unit tests covering RCDATA, RAWTEXT, PLAINTEXT, script data (incl. escape/double-escape), char refs, CDATA, PIs, error recovery, duplicate attrs, control/noncharacter refs
- Added golden test 275-rcdata-rawtext-states

### Change Log
- 2026-03-14: Full tokenizer refactor to WHATWG §13.2.5 state machine (Tasks 1-5)
- 2026-03-14: Code review fixes: removed dead Token::RawText variant, added duplicate attribute detection, added noncharacter/control char ref checks, fixed noscript RAWTEXT handling, renamed golden 093 to 275, added script escape tests

### File List
- crates/html/src/tokenizer/mod.rs (new) — Main tokenizer state machine, Token/Attribute types
- crates/html/src/tokenizer/states.rs (new) — TokenizerState enum with all 80 WHATWG states
- crates/html/src/tokenizer.rs (deleted) — Replaced by tokenizer module directory
- crates/html/src/entities.rs (modified) — Simplified to binary search on full entity table
- crates/html/src/entity_table.rs (new) — Full HTML5 named character reference table (2,125 entries)
- crates/html/src/lib.rs (modified) — Added entity_table module
- crates/html/src/tree_builder.rs (modified) — Removed dead Token::RawText branches
- crates/html/src/tests/mod.rs (modified) — Added tokenizer_state_tests module
- crates/html/src/tests/tokenizer_tests.rs (modified) — Updated test for spec-correct behavior
- crates/html/src/tests/tokenizer_state_tests.rs (new) — 39 tests for new tokenizer states
- tests/goldens/fixtures/275-rcdata-rawtext-states.html (new) — Golden test fixture
- tests/goldens/expected/275-rcdata-rawtext-states.layout.txt (new) — Expected layout output
- tests/goldens/expected/275-rcdata-rawtext-states.dl.txt (new) — Expected display list output
- tests/goldens/expected/091-xml-processing-instructions.layout.txt (modified) — Updated node IDs
- tests/goldens.rs (modified) — Added golden_275 test function
- tests/external/wpt/wpt_manifest.toml (modified) — Promoted newly-passing WPT test
- docs/HTML5_Implementation_Checklist.md (modified) — Checked off tokenizer and parse error items