Refactor tokenizer from 320-line linear scan to full WHATWG §13.2.5 state machine with all 80 states. Add RCDATA, RAWTEXT, PLAINTEXT, ScriptData (with escape/double-escape), character reference states, CDATA sections, and bogus comments. Replace ~100-entity table with full HTML5 set (2,125 unique entries) using sorted array + binary search. Add ParseError tracking with 33 error kinds including line/column info. Code review fixes: remove dead Token::RawText variant and tree_builder branches, add duplicate attribute detection, add noncharacter/control char reference checks per §13.2.5.80, fix noscript unconditional RAWTEXT handling, rename golden 093 to 275 to avoid number collision, add script escape/double-escape tests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
219 lines
14 KiB
Markdown
219 lines
14 KiB
Markdown
# Story 2.1: HTML5 Tokenizer Completeness
|
||
|
||
Status: done
|
||
|
||
## Story
|
||
|
||
As a web user,
|
||
I want all HTML content to be parsed correctly regardless of markup patterns,
|
||
So that pages render properly even with unusual or complex HTML.
|
||
|
||
## Acceptance Criteria
|
||
|
||
1. **All tokenizer states implemented per WHATWG HTML §13.2.5:** Data, RCDATA, RAWTEXT, script data, PLAINTEXT, tag open, end tag open, tag name, RCDATA less-than sign, script data less-than sign, script data escaped, and all related sub-states produce correct tokens.
|
||
|
||
2. **Character references resolved correctly per context:** Named entities (full HTML5 set: 2,231 entries), numeric (decimal `&#NNN;` and hex `&#xHHH;`), handled differently in attribute values vs. text content per §13.2.5.73 (character reference state).
|
||
|
||
3. **CDATA sections, processing instructions, and bogus comments** tokenized correctly per spec without errors.
|
||
|
||
4. **Malformed HTML** (unclosed tags, missing quotes, invalid characters, null bytes) handled with graceful recovery per the spec's error handling rules — no panics.
|
||
|
||
5. **Golden tests** cover edge-case tokenizer states. `docs/HTML5_Implementation_Checklist.md` updated. `just ci` passes.
|
||
|
||
## Tasks / Subtasks
|
||
|
||
- [x] Task 1: Implement full tokenizer state machine (AC: #1)
|
||
- [x] 1.1 Refactor tokenizer from linear scan to explicit state machine with `TokenizerState` enum matching WHATWG §13.2.5 states
|
||
- [x] 1.2 Implement RCDATA state for `<title>`, `<textarea>` (currently treated as raw text — must decode character refs but not tags)
|
||
- [x] 1.3 Implement RAWTEXT state for `<iframe>`, `<noembed>`, `<noframes>`, `<xmp>` (no entity decoding, no tags until matching end tag)
|
||
- [x] 1.4 Implement PLAINTEXT state for `<plaintext>` (no end tag ever closes it)
|
||
- [x] 1.5 Implement script data states: script data, script data escaped, script data double-escaped, and their sub-states (less-than sign, end tag open/name, escape start/dash variants) per §13.2.5.15–§13.2.5.28
|
||
- [x] 1.6 Implement proper tag open / end tag open / tag name states with spec-accurate error recovery (e.g., `<1` emits `<` as character, `</>` is ignored)
|
||
- [x] 1.7 Implement before/after attribute name, attribute value (quoted/unquoted) states with proper error handling
|
||
- [x] 1.8 Handle self-closing flag (`/>`) with appropriate parse errors for non-void elements
|
||
- [x] 1.9 Implement bogus comment state (e.g., `<!DOCTYPE` errors, `<? ... >`)
|
||
- [x] 1.10 Implement markup declaration open state dispatching: `<!--`, `<!DOCTYPE`, `<![CDATA[`
|
||
|
||
- [x] Task 2: Complete character reference handling (AC: #2)
|
||
- [x] 2.1 Replace the ~100-entity lookup table in `entities.rs` with the full HTML5 named character reference table (2,231 entries) — use a `phf` static map or sorted array with binary search for compile-time efficiency
|
||
- [x] 2.2 Implement context-sensitive character reference consumption: in attributes, `&` followed by alphanumeric + `=` must NOT be consumed as a reference (legacy compat)
|
||
- [x] 2.3 Handle ambiguous ampersand parse errors per spec
|
||
- [x] 2.4 Handle numeric character reference edge cases: null → U+FFFD, surrogates → U+FFFD, out-of-range → U+FFFD, legacy Windows-1252 remapping table for 0x80–0x9F range
|
||
|
||
- [x] Task 3: CDATA, PI, and bogus comment handling (AC: #3)
|
||
- [x] 3.1 Implement CDATA section tokenization (`<![CDATA[...]]>`) — only valid in foreign content (SVG/MathML); in HTML content emit as bogus comment
|
||
- [x] 3.2 Ensure processing instructions (`<?...>`) are consumed as bogus comments per the HTML spec (not XML PI behavior)
|
||
|
||
- [x] Task 4: Error recovery and robustness (AC: #4)
|
||
- [x] 4.1 Handle null characters: replace with U+FFFD in appropriate states, parse error in others
|
||
- [x] 4.2 Handle unexpected characters in each state per spec (e.g., `<` in attribute value unquoted triggers parse error but continues)
|
||
- [x] 4.3 Handle abrupt EOF in each state — emit appropriate tokens and parse errors
|
||
- [x] 4.4 Ensure existing safety limits (10MB input, 500K tokens, 1024 nesting depth) remain enforced
|
||
- [x] 4.5 Add `ParseError` tracking: collect parse errors with location info (line/column) but don't halt tokenization
|
||
|
||
- [x] Task 5: Tests and documentation (AC: #5)
|
||
- [x] 5.1 Add unit tests for each new tokenizer state: RCDATA, RAWTEXT, PLAINTEXT, script data states
|
||
- [x] 5.2 Add unit tests for full character reference resolution: named (multi-codepoint entities like `⊏̸`), numeric edge cases, context-sensitive in attributes
|
||
- [x] 5.3 Add golden tests for pages exercising edge-case tokenizer behavior
|
||
- [x] 5.4 Update `docs/HTML5_Implementation_Checklist.md` — check off tokenizer-related items
|
||
- [x] 5.5 Run `just ci` and ensure all tests pass
|
||
|
||
## Dev Notes
|
||
|
||
### Current Implementation (What Exists)
|
||
|
||
The tokenizer lives in `crates/html/src/tokenizer.rs` (~320 lines). It uses a **linear character-scanning approach** — NOT a state machine. Key characteristics:
|
||
|
||
- **`tokenize(input: &str) -> Vec<Token>`** — main entry point, iterates chars in a loop
|
||
- **Token enum variants:** `Doctype`, `StartTag`, `EndTag`, `Character`, `Comment`, `RawText`, `Eof`
|
||
- **Attribute parsing:** handles quoted (single/double) and unquoted values, lowercases names
|
||
- **Raw text elements:** `<script>` and `<style>` skip inner content until matching end tag — but this is NOT the same as the spec's RAWTEXT/script-data states
|
||
- **Entity decoding:** `entities.rs` has ~100 named entities + numeric/hex support
|
||
- **Safety limits:** MAX_HTML_INPUT_BYTES (10MB), MAX_HTML_TOKENS (500K), MAX_HTML_NESTING_DEPTH (1024)
|
||
|
||
### Architecture Constraints
|
||
|
||
- **Layer 1 crate** — depends only on `dom`, `shared`, and `tracing`. No upward deps.
|
||
- **Arena-based IDs** — tokenizer produces tokens consumed by tree_builder which creates `NodeId`-based DOM
|
||
- **No unsafe** — enforced by `scripts/check_unsafe.sh`
|
||
- **Error handling** — use `thiserror` derive, never panic on malformed input
|
||
- **Spec citations** — use `// HTML §13.2.5.x` inline comment format
|
||
- **File size** — enforced by `scripts/check_file_size.sh`; if tokenizer.rs grows large, split into `src/tokenizer/` module directory
|
||
|
||
### Key Design Decision: State Machine Refactor
|
||
|
||
The current linear scan must be refactored to an explicit state machine. Recommended approach:
|
||
|
||
```rust
|
||
enum TokenizerState {
|
||
Data,
|
||
Rcdata,
|
||
Rawtext,
|
||
ScriptData,
|
||
Plaintext,
|
||
TagOpen,
|
||
EndTagOpen,
|
||
TagName,
|
||
// ... all WHATWG §13.2.5 states
|
||
}
|
||
```
|
||
|
||
**Do NOT** use a separate `fn` per state (excessive function call overhead). Use a `match` on state in a loop — this is the standard efficient approach for spec-conformant HTML tokenizers.
|
||
|
||
### Entity Table Strategy
|
||
|
||
Current `entities.rs` has ~100 hardcoded entries. For the full 2,231-entry HTML5 table:
|
||
|
||
- **Option A (recommended):** Use `phf` crate for compile-time perfect hash map — O(1) lookup, zero runtime cost. Already used in the Rust ecosystem for this exact purpose.
|
||
- **Option B:** Sorted array with binary search — no new dependency but O(log n) lookup.
|
||
- **Do NOT** use a runtime `HashMap` — wasteful for static data.
|
||
|
||
The entity data can be sourced from the WHATWG JSON file or generated at build time.
|
||
|
||
If adding `phf` as a dependency: add rationale comment in Cargo.toml per project rules (new dependencies require rationale).
|
||
|
||
### RCDATA vs RAWTEXT vs Script Data
|
||
|
||
These are commonly confused. Critical distinctions:
|
||
|
||
| State | Used for | Decodes entities? | Recognizes tags? | End condition |
|
||
|-------|----------|--------------------|-------------------|---------------|
|
||
| RCDATA | `<title>`, `<textarea>` | Yes | No (except matching end tag) | Matching end tag |
|
||
| RAWTEXT | `<iframe>`, `<noembed>`, `<noframes>`, `<xmp>` | No | No (except matching end tag) | Matching end tag |
|
||
| Script data | `<script>` | No | No, but has escape sequences (`<!--`, `-->`) | Matching `</script>` |
|
||
| PLAINTEXT | `<plaintext>` | No | No | Never (runs to EOF) |
|
||
|
||
### Files to Modify
|
||
|
||
- `crates/html/src/tokenizer.rs` — major refactor to state machine
|
||
- `crates/html/src/entities.rs` — expand to full HTML5 entity table
|
||
- `crates/html/src/lib.rs` — update public API if Token enum changes
|
||
- `crates/html/src/tests/tokenizer_tests.rs` — extensive new tests
|
||
- `tests/goldens/fixtures/` + `tests/goldens/expected/` — new golden tests
|
||
- `docs/HTML5_Implementation_Checklist.md` — update checked items
|
||
- `crates/html/Cargo.toml` — possibly add `phf` dependency
|
||
|
||
### What NOT to Change
|
||
|
||
- **`tree_builder.rs`** — Story 2.2 handles insertion modes. Only touch if the Token enum shape changes require it.
|
||
- **`dom/`** — No DOM changes needed for tokenizer work.
|
||
- **Other crates** — Tokenizer is internal to `html` crate.
|
||
|
||
### Previous Story Learnings (Epic 1)
|
||
|
||
- **Code review catches real bugs:** Story 1-13's review found `unset` not handled in pseudo-element styling. Expect similar edge-case misses.
|
||
- **Golden tests are essential:** Every rendering-affecting change needs golden test coverage.
|
||
- **Update checklists:** Always update `docs/HTML5_Implementation_Checklist.md` at the end.
|
||
- **Commit pattern:** Recent commits follow format: `Implement <feature> with code review fixes (§section)`.
|
||
|
||
### Testing Strategy
|
||
|
||
- **Unit tests** in `crates/html/src/tests/tokenizer_tests.rs` — test each state machine state independently
|
||
- **If test file exceeds ~200 lines**, split into `tokenizer_state_tests.rs`, `entity_tests.rs`, etc. (per architecture doc)
|
||
- **Golden tests** in `tests/goldens/` — HTML pages that exercise RCDATA (`<title>` with entities), RAWTEXT (`<xmp>`), script data escaping
|
||
- **Property-based tests** (optional) with `proptest` — fuzz tokenizer with random byte sequences, assert no panics
|
||
|
||
### References
|
||
|
||
- [WHATWG HTML Living Standard §13.2.5 — Tokenization](https://html.spec.whatwg.org/multipage/parsing.html#tokenization)
|
||
- [HTML5 Named Character References (JSON)](https://html.spec.whatwg.org/entities.json)
|
||
- [Source: crates/html/src/tokenizer.rs] — current implementation
|
||
- [Source: crates/html/src/entities.rs] — current entity table
|
||
- [Source: crates/html/src/tests/tokenizer_tests.rs] — current tests
|
||
- [Source: docs/HTML5_Implementation_Checklist.md] — checklist to update
|
||
- [Source: _bmad-output/planning-artifacts/architecture.md] — architecture constraints
|
||
- [Source: _bmad-output/planning-artifacts/epics.md] — Epic 2 requirements
|
||
|
||
## Dev Agent Record
|
||
|
||
### Agent Model Used
|
||
Claude Opus 4.6 (1M context)
|
||
|
||
### Debug Log References
|
||
- Fixed golden test 091-xml-processing-instructions: node IDs shifted because PIs now emit bogus comment tokens (spec-correct behavior)
|
||
- Promoted WPT test wpt-css-css-tables-table-cell-inline-size-box-sizing-quirks from known_fail to pass (side effect of improved tokenizer)
|
||
- Updated test_invalid_entity_remains_literal: `&` without semicolon now correctly resolves to `&` per HTML §13.2.5.73
|
||
|
||
### Completion Notes List
|
||
- Refactored tokenizer from 320-line linear scan to full WHATWG §13.2.5 state machine (~2100 lines in module directory)
|
||
- Implemented all 80 tokenizer states as an enum with match-in-loop pattern
|
||
- Added RCDATA state (title, textarea) with entity decoding but no tag recognition
|
||
- Added RAWTEXT state (style, xmp, iframe, noembed, noframes) with no entity decoding (noscript excluded — only RAWTEXT when scripting enabled)
|
||
- Added PLAINTEXT state (never exits)
|
||
- Added ScriptData states with full escape/double-escape handling (§13.2.5.15-28)
|
||
- Replaced ~100-entity lookup table with full HTML5 table (2,125 entries) using sorted array + binary search
|
||
- No new dependencies added — used sorted array approach instead of phf
|
||
- Implemented character reference state machine (named, numeric decimal, numeric hex)
|
||
- Added context-sensitive attribute character reference handling (legacy compat)
|
||
- Added Windows-1252 remapping for numeric refs in 0x80-0x9F range
|
||
- Added CDATA section tokenization (emits chars, parse error in HTML content)
|
||
- Processing instructions now correctly emit as bogus comments
|
||
- Added ParseError tracking with line/column info (33 error kinds)
|
||
- Added null character handling (U+FFFD replacement) in all states
|
||
- Added EOF handling in all states with appropriate token emission
|
||
- Added 39 unit tests covering RCDATA, RAWTEXT, PLAINTEXT, script data (incl. escape/double-escape), char refs, CDATA, PIs, error recovery, duplicate attrs, control/noncharacter refs
|
||
- Added golden test 275-rcdata-rawtext-states
|
||
|
||
### Change Log
|
||
- 2026-03-14: Full tokenizer refactor to WHATWG §13.2.5 state machine (Tasks 1-5)
|
||
- 2026-03-14: Code review fixes: removed dead Token::RawText variant, added duplicate attribute detection, added noncharacter/control char ref checks, fixed noscript RAWTEXT handling, renamed golden 093 to 275, added script escape tests
|
||
|
||
### File List
|
||
- crates/html/src/tokenizer/mod.rs (new) — Main tokenizer state machine, Token/Attribute types
|
||
- crates/html/src/tokenizer/states.rs (new) — TokenizerState enum with all 80 WHATWG states
|
||
- crates/html/src/tokenizer.rs (deleted) — Replaced by tokenizer module directory
|
||
- crates/html/src/entities.rs (modified) — Simplified to binary search on full entity table
|
||
- crates/html/src/entity_table.rs (new) — Full HTML5 named character reference table (2,125 entries)
|
||
- crates/html/src/lib.rs (modified) — Added entity_table module
|
||
- crates/html/src/tree_builder.rs (modified) — Removed dead Token::RawText branches
|
||
- crates/html/src/tests/mod.rs (modified) — Added tokenizer_state_tests module
|
||
- crates/html/src/tests/tokenizer_tests.rs (modified) — Updated test for spec-correct behavior
|
||
- crates/html/src/tests/tokenizer_state_tests.rs (new) — 39 tests for new tokenizer states
|
||
- tests/goldens/fixtures/275-rcdata-rawtext-states.html (new) — Golden test fixture
|
||
- tests/goldens/expected/275-rcdata-rawtext-states.layout.txt (new) — Expected layout output
|
||
- tests/goldens/expected/275-rcdata-rawtext-states.dl.txt (new) — Expected display list output
|
||
- tests/goldens/expected/091-xml-processing-instructions.layout.txt (modified) — Updated node IDs
|
||
- tests/goldens.rs (modified) — Added golden_275 test function
|
||
- tests/external/wpt/wpt_manifest.toml (modified) — Promoted newly-passing WPT test
|
||
- docs/HTML5_Implementation_Checklist.md (modified) — Checked off tokenizer and parse error items
|