Files
rust_browser/_bmad-output/implementation-artifacts/2-1-html5-tokenizer-completeness.md
Zachary D. Rowitsch 2e7642a658 Implement HTML5 tokenizer completeness with code review fixes (§13.2.5)
Refactor tokenizer from 320-line linear scan to full WHATWG §13.2.5 state
machine with all 80 states. Add RCDATA, RAWTEXT, PLAINTEXT, ScriptData
(with escape/double-escape), character reference states, CDATA sections,
and bogus comments. Replace ~100-entity table with full HTML5 set (2,125
unique entries) using sorted array + binary search. Add ParseError tracking
with 33 error kinds including line/column info.

Code review fixes: remove dead Token::RawText variant and tree_builder
branches, add duplicate attribute detection, add noncharacter/control char
reference checks per §13.2.5.80, fix noscript unconditional RAWTEXT
handling, rename golden 093 to 275 to avoid number collision, add script
escape/double-escape tests.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-14 14:03:23 -04:00

219 lines
14 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Story 2.1: HTML5 Tokenizer Completeness
Status: done
## Story
As a web user,
I want all HTML content to be parsed correctly regardless of markup patterns,
So that pages render properly even with unusual or complex HTML.
## Acceptance Criteria
1. **All tokenizer states implemented per WHATWG HTML §13.2.5:** Data, RCDATA, RAWTEXT, script data, PLAINTEXT, tag open, end tag open, tag name, RCDATA less-than sign, script data less-than sign, script data escaped, and all related sub-states produce correct tokens.
2. **Character references resolved correctly per context:** Named entities (full HTML5 set: 2,231 entries), numeric (decimal `&#NNN;` and hex `&#xHHH;`), handled differently in attribute values vs. text content per §13.2.5.73 (character reference state).
3. **CDATA sections, processing instructions, and bogus comments** tokenized correctly per spec without errors.
4. **Malformed HTML** (unclosed tags, missing quotes, invalid characters, null bytes) handled with graceful recovery per the spec's error handling rules — no panics.
5. **Golden tests** cover edge-case tokenizer states. `docs/HTML5_Implementation_Checklist.md` updated. `just ci` passes.
## Tasks / Subtasks
- [x] Task 1: Implement full tokenizer state machine (AC: #1)
- [x] 1.1 Refactor tokenizer from linear scan to explicit state machine with `TokenizerState` enum matching WHATWG §13.2.5 states
- [x] 1.2 Implement RCDATA state for `<title>`, `<textarea>` (currently treated as raw text — must decode character refs but not tags)
- [x] 1.3 Implement RAWTEXT state for `<iframe>`, `<noembed>`, `<noframes>`, `<xmp>` (no entity decoding, no tags until matching end tag)
- [x] 1.4 Implement PLAINTEXT state for `<plaintext>` (no end tag ever closes it)
- [x] 1.5 Implement script data states: script data, script data escaped, script data double-escaped, and their sub-states (less-than sign, end tag open/name, escape start/dash variants) per §13.2.5.15§13.2.5.28
- [x] 1.6 Implement proper tag open / end tag open / tag name states with spec-accurate error recovery (e.g., `<1` emits `<` as character, `</>` is ignored)
- [x] 1.7 Implement before/after attribute name, attribute value (quoted/unquoted) states with proper error handling
- [x] 1.8 Handle self-closing flag (`/>`) with appropriate parse errors for non-void elements
- [x] 1.9 Implement bogus comment state (e.g., `<!DOCTYPE` errors, `<? ... >`)
- [x] 1.10 Implement markup declaration open state dispatching: `<!--`, `<!DOCTYPE`, `<![CDATA[`
- [x] Task 2: Complete character reference handling (AC: #2)
- [x] 2.1 Replace the ~100-entity lookup table in `entities.rs` with the full HTML5 named character reference table (2,231 entries) — use a `phf` static map or sorted array with binary search for compile-time efficiency
- [x] 2.2 Implement context-sensitive character reference consumption: in attributes, `&` followed by alphanumeric + `=` must NOT be consumed as a reference (legacy compat)
- [x] 2.3 Handle ambiguous ampersand parse errors per spec
- [x] 2.4 Handle numeric character reference edge cases: null → U+FFFD, surrogates → U+FFFD, out-of-range → U+FFFD, legacy Windows-1252 remapping table for 0x800x9F range
- [x] Task 3: CDATA, PI, and bogus comment handling (AC: #3)
- [x] 3.1 Implement CDATA section tokenization (`<![CDATA[...]]>`) — only valid in foreign content (SVG/MathML); in HTML content emit as bogus comment
- [x] 3.2 Ensure processing instructions (`<?...>`) are consumed as bogus comments per the HTML spec (not XML PI behavior)
- [x] Task 4: Error recovery and robustness (AC: #4)
- [x] 4.1 Handle null characters: replace with U+FFFD in appropriate states, parse error in others
- [x] 4.2 Handle unexpected characters in each state per spec (e.g., `<` in attribute value unquoted triggers parse error but continues)
- [x] 4.3 Handle abrupt EOF in each state — emit appropriate tokens and parse errors
- [x] 4.4 Ensure existing safety limits (10MB input, 500K tokens, 1024 nesting depth) remain enforced
- [x] 4.5 Add `ParseError` tracking: collect parse errors with location info (line/column) but don't halt tokenization
- [x] Task 5: Tests and documentation (AC: #5)
- [x] 5.1 Add unit tests for each new tokenizer state: RCDATA, RAWTEXT, PLAINTEXT, script data states
- [x] 5.2 Add unit tests for full character reference resolution: named (multi-codepoint entities like `&NotSquareSubset;`), numeric edge cases, context-sensitive in attributes
- [x] 5.3 Add golden tests for pages exercising edge-case tokenizer behavior
- [x] 5.4 Update `docs/HTML5_Implementation_Checklist.md` — check off tokenizer-related items
- [x] 5.5 Run `just ci` and ensure all tests pass
## Dev Notes
### Current Implementation (What Exists)
The tokenizer lives in `crates/html/src/tokenizer.rs` (~320 lines). It uses a **linear character-scanning approach** — NOT a state machine. Key characteristics:
- **`tokenize(input: &str) -> Vec<Token>`** — main entry point, iterates chars in a loop
- **Token enum variants:** `Doctype`, `StartTag`, `EndTag`, `Character`, `Comment`, `RawText`, `Eof`
- **Attribute parsing:** handles quoted (single/double) and unquoted values, lowercases names
- **Raw text elements:** `<script>` and `<style>` skip inner content until matching end tag — but this is NOT the same as the spec's RAWTEXT/script-data states
- **Entity decoding:** `entities.rs` has ~100 named entities + numeric/hex support
- **Safety limits:** MAX_HTML_INPUT_BYTES (10MB), MAX_HTML_TOKENS (500K), MAX_HTML_NESTING_DEPTH (1024)
### Architecture Constraints
- **Layer 1 crate** — depends only on `dom`, `shared`, and `tracing`. No upward deps.
- **Arena-based IDs** — tokenizer produces tokens consumed by tree_builder which creates `NodeId`-based DOM
- **No unsafe** — enforced by `scripts/check_unsafe.sh`
- **Error handling** — use `thiserror` derive, never panic on malformed input
- **Spec citations** — use `// HTML §13.2.5.x` inline comment format
- **File size** — enforced by `scripts/check_file_size.sh`; if tokenizer.rs grows large, split into `src/tokenizer/` module directory
### Key Design Decision: State Machine Refactor
The current linear scan must be refactored to an explicit state machine. Recommended approach:
```rust
enum TokenizerState {
Data,
Rcdata,
Rawtext,
ScriptData,
Plaintext,
TagOpen,
EndTagOpen,
TagName,
// ... all WHATWG §13.2.5 states
}
```
**Do NOT** use a separate `fn` per state (excessive function call overhead). Use a `match` on state in a loop — this is the standard efficient approach for spec-conformant HTML tokenizers.
### Entity Table Strategy
Current `entities.rs` has ~100 hardcoded entries. For the full 2,231-entry HTML5 table:
- **Option A (recommended):** Use `phf` crate for compile-time perfect hash map — O(1) lookup, zero runtime cost. Already used in the Rust ecosystem for this exact purpose.
- **Option B:** Sorted array with binary search — no new dependency but O(log n) lookup.
- **Do NOT** use a runtime `HashMap` — wasteful for static data.
The entity data can be sourced from the WHATWG JSON file or generated at build time.
If adding `phf` as a dependency: add rationale comment in Cargo.toml per project rules (new dependencies require rationale).
### RCDATA vs RAWTEXT vs Script Data
These are commonly confused. Critical distinctions:
| State | Used for | Decodes entities? | Recognizes tags? | End condition |
|-------|----------|--------------------|-------------------|---------------|
| RCDATA | `<title>`, `<textarea>` | Yes | No (except matching end tag) | Matching end tag |
| RAWTEXT | `<iframe>`, `<noembed>`, `<noframes>`, `<xmp>` | No | No (except matching end tag) | Matching end tag |
| Script data | `<script>` | No | No, but has escape sequences (`<!--`, `-->`) | Matching `</script>` |
| PLAINTEXT | `<plaintext>` | No | No | Never (runs to EOF) |
### Files to Modify
- `crates/html/src/tokenizer.rs` — major refactor to state machine
- `crates/html/src/entities.rs` — expand to full HTML5 entity table
- `crates/html/src/lib.rs` — update public API if Token enum changes
- `crates/html/src/tests/tokenizer_tests.rs` — extensive new tests
- `tests/goldens/fixtures/` + `tests/goldens/expected/` — new golden tests
- `docs/HTML5_Implementation_Checklist.md` — update checked items
- `crates/html/Cargo.toml` — possibly add `phf` dependency
### What NOT to Change
- **`tree_builder.rs`** — Story 2.2 handles insertion modes. Only touch if the Token enum shape changes require it.
- **`dom/`** — No DOM changes needed for tokenizer work.
- **Other crates** — Tokenizer is internal to `html` crate.
### Previous Story Learnings (Epic 1)
- **Code review catches real bugs:** Story 1-13's review found `unset` not handled in pseudo-element styling. Expect similar edge-case misses.
- **Golden tests are essential:** Every rendering-affecting change needs golden test coverage.
- **Update checklists:** Always update `docs/HTML5_Implementation_Checklist.md` at the end.
- **Commit pattern:** Recent commits follow format: `Implement <feature> with code review fixes (§section)`.
### Testing Strategy
- **Unit tests** in `crates/html/src/tests/tokenizer_tests.rs` — test each state machine state independently
- **If test file exceeds ~200 lines**, split into `tokenizer_state_tests.rs`, `entity_tests.rs`, etc. (per architecture doc)
- **Golden tests** in `tests/goldens/` — HTML pages that exercise RCDATA (`<title>` with entities), RAWTEXT (`<xmp>`), script data escaping
- **Property-based tests** (optional) with `proptest` — fuzz tokenizer with random byte sequences, assert no panics
### References
- [WHATWG HTML Living Standard §13.2.5 — Tokenization](https://html.spec.whatwg.org/multipage/parsing.html#tokenization)
- [HTML5 Named Character References (JSON)](https://html.spec.whatwg.org/entities.json)
- [Source: crates/html/src/tokenizer.rs] — current implementation
- [Source: crates/html/src/entities.rs] — current entity table
- [Source: crates/html/src/tests/tokenizer_tests.rs] — current tests
- [Source: docs/HTML5_Implementation_Checklist.md] — checklist to update
- [Source: _bmad-output/planning-artifacts/architecture.md] — architecture constraints
- [Source: _bmad-output/planning-artifacts/epics.md] — Epic 2 requirements
## Dev Agent Record
### Agent Model Used
Claude Opus 4.6 (1M context)
### Debug Log References
- Fixed golden test 091-xml-processing-instructions: node IDs shifted because PIs now emit bogus comment tokens (spec-correct behavior)
- Promoted WPT test wpt-css-css-tables-table-cell-inline-size-box-sizing-quirks from known_fail to pass (side effect of improved tokenizer)
- Updated test_invalid_entity_remains_literal: `&amp` without semicolon now correctly resolves to `&` per HTML §13.2.5.73
### Completion Notes List
- Refactored tokenizer from 320-line linear scan to full WHATWG §13.2.5 state machine (~2100 lines in module directory)
- Implemented all 80 tokenizer states as an enum with match-in-loop pattern
- Added RCDATA state (title, textarea) with entity decoding but no tag recognition
- Added RAWTEXT state (style, xmp, iframe, noembed, noframes) with no entity decoding (noscript excluded — only RAWTEXT when scripting enabled)
- Added PLAINTEXT state (never exits)
- Added ScriptData states with full escape/double-escape handling (§13.2.5.15-28)
- Replaced ~100-entity lookup table with full HTML5 table (2,125 entries) using sorted array + binary search
- No new dependencies added — used sorted array approach instead of phf
- Implemented character reference state machine (named, numeric decimal, numeric hex)
- Added context-sensitive attribute character reference handling (legacy compat)
- Added Windows-1252 remapping for numeric refs in 0x80-0x9F range
- Added CDATA section tokenization (emits chars, parse error in HTML content)
- Processing instructions now correctly emit as bogus comments
- Added ParseError tracking with line/column info (33 error kinds)
- Added null character handling (U+FFFD replacement) in all states
- Added EOF handling in all states with appropriate token emission
- Added 39 unit tests covering RCDATA, RAWTEXT, PLAINTEXT, script data (incl. escape/double-escape), char refs, CDATA, PIs, error recovery, duplicate attrs, control/noncharacter refs
- Added golden test 275-rcdata-rawtext-states
### Change Log
- 2026-03-14: Full tokenizer refactor to WHATWG §13.2.5 state machine (Tasks 1-5)
- 2026-03-14: Code review fixes: removed dead Token::RawText variant, added duplicate attribute detection, added noncharacter/control char ref checks, fixed noscript RAWTEXT handling, renamed golden 093 to 275, added script escape tests
### File List
- crates/html/src/tokenizer/mod.rs (new) — Main tokenizer state machine, Token/Attribute types
- crates/html/src/tokenizer/states.rs (new) — TokenizerState enum with all 80 WHATWG states
- crates/html/src/tokenizer.rs (deleted) — Replaced by tokenizer module directory
- crates/html/src/entities.rs (modified) — Simplified to binary search on full entity table
- crates/html/src/entity_table.rs (new) — Full HTML5 named character reference table (2,125 entries)
- crates/html/src/lib.rs (modified) — Added entity_table module
- crates/html/src/tree_builder.rs (modified) — Removed dead Token::RawText branches
- crates/html/src/tests/mod.rs (modified) — Added tokenizer_state_tests module
- crates/html/src/tests/tokenizer_tests.rs (modified) — Updated test for spec-correct behavior
- crates/html/src/tests/tokenizer_state_tests.rs (new) — 39 tests for new tokenizer states
- tests/goldens/fixtures/275-rcdata-rawtext-states.html (new) — Golden test fixture
- tests/goldens/expected/275-rcdata-rawtext-states.layout.txt (new) — Expected layout output
- tests/goldens/expected/275-rcdata-rawtext-states.dl.txt (new) — Expected display list output
- tests/goldens/expected/091-xml-processing-instructions.layout.txt (modified) — Updated node IDs
- tests/goldens.rs (modified) — Added golden_275 test function
- tests/external/wpt/wpt_manifest.toml (modified) — Promoted newly-passing WPT test
- docs/HTML5_Implementation_Checklist.md (modified) — Checked off tokenizer and parse error items