Refactor tokenizer from 320-line linear scan to full WHATWG §13.2.5 state machine with all 80 states. Add RCDATA, RAWTEXT, PLAINTEXT, ScriptData (with escape/double-escape), character reference states, CDATA sections, and bogus comments. Replace ~100-entity table with full HTML5 set (2,125 unique entries) using sorted array + binary search. Add ParseError tracking with 33 error kinds including line/column info. Code review fixes: remove dead Token::RawText variant and tree_builder branches, add duplicate attribute detection, add noncharacter/control char reference checks per §13.2.5.80, fix noscript unconditional RAWTEXT handling, rename golden 093 to 275 to avoid number collision, add script escape/double-escape tests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
14 KiB
Story 2.1: HTML5 Tokenizer Completeness
Status: done
Story
As a web user, I want all HTML content to be parsed correctly regardless of markup patterns, So that pages render properly even with unusual or complex HTML.
Acceptance Criteria
-
All tokenizer states implemented per WHATWG HTML §13.2.5: Data, RCDATA, RAWTEXT, script data, PLAINTEXT, tag open, end tag open, tag name, RCDATA less-than sign, script data less-than sign, script data escaped, and all related sub-states produce correct tokens.
-
Character references resolved correctly per context: Named entities (full HTML5 set: 2,231 entries), numeric (decimal
&#NNN;and hex&#xHHH;), handled differently in attribute values vs. text content per §13.2.5.73 (character reference state). -
CDATA sections, processing instructions, and bogus comments tokenized correctly per spec without errors.
-
Malformed HTML (unclosed tags, missing quotes, invalid characters, null bytes) handled with graceful recovery per the spec's error handling rules — no panics.
-
Golden tests cover edge-case tokenizer states.
docs/HTML5_Implementation_Checklist.mdupdated.just cipasses.
Tasks / Subtasks
-
Task 1: Implement full tokenizer state machine (AC: #1)
- 1.1 Refactor tokenizer from linear scan to explicit state machine with
TokenizerStateenum matching WHATWG §13.2.5 states - 1.2 Implement RCDATA state for
<title>,<textarea>(currently treated as raw text — must decode character refs but not tags) - 1.3 Implement RAWTEXT state for
<iframe>,<noembed>,<noframes>,<xmp>(no entity decoding, no tags until matching end tag) - 1.4 Implement PLAINTEXT state for
<plaintext>(no end tag ever closes it) - 1.5 Implement script data states: script data, script data escaped, script data double-escaped, and their sub-states (less-than sign, end tag open/name, escape start/dash variants) per §13.2.5.15–§13.2.5.28
- 1.6 Implement proper tag open / end tag open / tag name states with spec-accurate error recovery (e.g.,
<1emits<as character,</>is ignored) - 1.7 Implement before/after attribute name, attribute value (quoted/unquoted) states with proper error handling
- 1.8 Handle self-closing flag (
/>) with appropriate parse errors for non-void elements - 1.9 Implement bogus comment state (e.g.,
<!DOCTYPEerrors,<? ... >) - 1.10 Implement markup declaration open state dispatching:
<!--,<!DOCTYPE,<![CDATA[
- 1.1 Refactor tokenizer from linear scan to explicit state machine with
-
Task 2: Complete character reference handling (AC: #2)
- 2.1 Replace the ~100-entity lookup table in
entities.rswith the full HTML5 named character reference table (2,231 entries) — use aphfstatic map or sorted array with binary search for compile-time efficiency - 2.2 Implement context-sensitive character reference consumption: in attributes,
&followed by alphanumeric +=must NOT be consumed as a reference (legacy compat) - 2.3 Handle ambiguous ampersand parse errors per spec
- 2.4 Handle numeric character reference edge cases: null → U+FFFD, surrogates → U+FFFD, out-of-range → U+FFFD, legacy Windows-1252 remapping table for 0x80–0x9F range
- 2.1 Replace the ~100-entity lookup table in
-
Task 3: CDATA, PI, and bogus comment handling (AC: #3)
- 3.1 Implement CDATA section tokenization (
<![CDATA[...]]>) — only valid in foreign content (SVG/MathML); in HTML content emit as bogus comment - 3.2 Ensure processing instructions (
<?...>) are consumed as bogus comments per the HTML spec (not XML PI behavior)
- 3.1 Implement CDATA section tokenization (
-
Task 4: Error recovery and robustness (AC: #4)
- 4.1 Handle null characters: replace with U+FFFD in appropriate states, parse error in others
- 4.2 Handle unexpected characters in each state per spec (e.g.,
<in attribute value unquoted triggers parse error but continues) - 4.3 Handle abrupt EOF in each state — emit appropriate tokens and parse errors
- 4.4 Ensure existing safety limits (10MB input, 500K tokens, 1024 nesting depth) remain enforced
- 4.5 Add
ParseErrortracking: collect parse errors with location info (line/column) but don't halt tokenization
-
Task 5: Tests and documentation (AC: #5)
- 5.1 Add unit tests for each new tokenizer state: RCDATA, RAWTEXT, PLAINTEXT, script data states
- 5.2 Add unit tests for full character reference resolution: named (multi-codepoint entities like
⊏̸), numeric edge cases, context-sensitive in attributes - 5.3 Add golden tests for pages exercising edge-case tokenizer behavior
- 5.4 Update
docs/HTML5_Implementation_Checklist.md— check off tokenizer-related items - 5.5 Run
just ciand ensure all tests pass
Dev Notes
Current Implementation (What Exists)
The tokenizer lives in crates/html/src/tokenizer.rs (~320 lines). It uses a linear character-scanning approach — NOT a state machine. Key characteristics:
tokenize(input: &str) -> Vec<Token>— main entry point, iterates chars in a loop- Token enum variants:
Doctype,StartTag,EndTag,Character,Comment,RawText,Eof - Attribute parsing: handles quoted (single/double) and unquoted values, lowercases names
- Raw text elements:
<script>and<style>skip inner content until matching end tag — but this is NOT the same as the spec's RAWTEXT/script-data states - Entity decoding:
entities.rshas ~100 named entities + numeric/hex support - Safety limits: MAX_HTML_INPUT_BYTES (10MB), MAX_HTML_TOKENS (500K), MAX_HTML_NESTING_DEPTH (1024)
Architecture Constraints
- Layer 1 crate — depends only on
dom,shared, andtracing. No upward deps. - Arena-based IDs — tokenizer produces tokens consumed by tree_builder which creates
NodeId-based DOM - No unsafe — enforced by
scripts/check_unsafe.sh - Error handling — use
thiserrorderive, never panic on malformed input - Spec citations — use
// HTML §13.2.5.xinline comment format - File size — enforced by
scripts/check_file_size.sh; if tokenizer.rs grows large, split intosrc/tokenizer/module directory
Key Design Decision: State Machine Refactor
The current linear scan must be refactored to an explicit state machine. Recommended approach:
enum TokenizerState {
Data,
Rcdata,
Rawtext,
ScriptData,
Plaintext,
TagOpen,
EndTagOpen,
TagName,
// ... all WHATWG §13.2.5 states
}
Do NOT use a separate fn per state (excessive function call overhead). Use a match on state in a loop — this is the standard efficient approach for spec-conformant HTML tokenizers.
Entity Table Strategy
Current entities.rs has ~100 hardcoded entries. For the full 2,231-entry HTML5 table:
- Option A (recommended): Use
phfcrate for compile-time perfect hash map — O(1) lookup, zero runtime cost. Already used in the Rust ecosystem for this exact purpose. - Option B: Sorted array with binary search — no new dependency but O(log n) lookup.
- Do NOT use a runtime
HashMap— wasteful for static data.
The entity data can be sourced from the WHATWG JSON file or generated at build time.
If adding phf as a dependency: add rationale comment in Cargo.toml per project rules (new dependencies require rationale).
RCDATA vs RAWTEXT vs Script Data
These are commonly confused. Critical distinctions:
| State | Used for | Decodes entities? | Recognizes tags? | End condition |
|---|---|---|---|---|
| RCDATA | <title>, <textarea> |
Yes | No (except matching end tag) | Matching end tag |
| RAWTEXT | <iframe>, <noembed>, <noframes>, <xmp> |
No | No (except matching end tag) | Matching end tag |
| Script data | <script> |
No | No, but has escape sequences (<!--, -->) |
Matching </script> |
| PLAINTEXT | <plaintext> |
No | No | Never (runs to EOF) |
Files to Modify
crates/html/src/tokenizer.rs— major refactor to state machinecrates/html/src/entities.rs— expand to full HTML5 entity tablecrates/html/src/lib.rs— update public API if Token enum changescrates/html/src/tests/tokenizer_tests.rs— extensive new teststests/goldens/fixtures/+tests/goldens/expected/— new golden testsdocs/HTML5_Implementation_Checklist.md— update checked itemscrates/html/Cargo.toml— possibly addphfdependency
What NOT to Change
tree_builder.rs— Story 2.2 handles insertion modes. Only touch if the Token enum shape changes require it.dom/— No DOM changes needed for tokenizer work.- Other crates — Tokenizer is internal to
htmlcrate.
Previous Story Learnings (Epic 1)
- Code review catches real bugs: Story 1-13's review found
unsetnot handled in pseudo-element styling. Expect similar edge-case misses. - Golden tests are essential: Every rendering-affecting change needs golden test coverage.
- Update checklists: Always update
docs/HTML5_Implementation_Checklist.mdat the end. - Commit pattern: Recent commits follow format:
Implement <feature> with code review fixes (§section).
Testing Strategy
- Unit tests in
crates/html/src/tests/tokenizer_tests.rs— test each state machine state independently - If test file exceeds ~200 lines, split into
tokenizer_state_tests.rs,entity_tests.rs, etc. (per architecture doc) - Golden tests in
tests/goldens/— HTML pages that exercise RCDATA (<title>with entities), RAWTEXT (<xmp>), script data escaping - Property-based tests (optional) with
proptest— fuzz tokenizer with random byte sequences, assert no panics
References
- WHATWG HTML Living Standard §13.2.5 — Tokenization
- HTML5 Named Character References (JSON)
- [Source: crates/html/src/tokenizer.rs] — current implementation
- [Source: crates/html/src/entities.rs] — current entity table
- [Source: crates/html/src/tests/tokenizer_tests.rs] — current tests
- [Source: docs/HTML5_Implementation_Checklist.md] — checklist to update
- [Source: _bmad-output/planning-artifacts/architecture.md] — architecture constraints
- [Source: _bmad-output/planning-artifacts/epics.md] — Epic 2 requirements
Dev Agent Record
Agent Model Used
Claude Opus 4.6 (1M context)
Debug Log References
- Fixed golden test 091-xml-processing-instructions: node IDs shifted because PIs now emit bogus comment tokens (spec-correct behavior)
- Promoted WPT test wpt-css-css-tables-table-cell-inline-size-box-sizing-quirks from known_fail to pass (side effect of improved tokenizer)
- Updated test_invalid_entity_remains_literal:
&without semicolon now correctly resolves to&per HTML §13.2.5.73
Completion Notes List
- Refactored tokenizer from 320-line linear scan to full WHATWG §13.2.5 state machine (~2100 lines in module directory)
- Implemented all 80 tokenizer states as an enum with match-in-loop pattern
- Added RCDATA state (title, textarea) with entity decoding but no tag recognition
- Added RAWTEXT state (style, xmp, iframe, noembed, noframes) with no entity decoding (noscript excluded — only RAWTEXT when scripting enabled)
- Added PLAINTEXT state (never exits)
- Added ScriptData states with full escape/double-escape handling (§13.2.5.15-28)
- Replaced ~100-entity lookup table with full HTML5 table (2,125 entries) using sorted array + binary search
- No new dependencies added — used sorted array approach instead of phf
- Implemented character reference state machine (named, numeric decimal, numeric hex)
- Added context-sensitive attribute character reference handling (legacy compat)
- Added Windows-1252 remapping for numeric refs in 0x80-0x9F range
- Added CDATA section tokenization (emits chars, parse error in HTML content)
- Processing instructions now correctly emit as bogus comments
- Added ParseError tracking with line/column info (33 error kinds)
- Added null character handling (U+FFFD replacement) in all states
- Added EOF handling in all states with appropriate token emission
- Added 39 unit tests covering RCDATA, RAWTEXT, PLAINTEXT, script data (incl. escape/double-escape), char refs, CDATA, PIs, error recovery, duplicate attrs, control/noncharacter refs
- Added golden test 275-rcdata-rawtext-states
Change Log
- 2026-03-14: Full tokenizer refactor to WHATWG §13.2.5 state machine (Tasks 1-5)
- 2026-03-14: Code review fixes: removed dead Token::RawText variant, added duplicate attribute detection, added noncharacter/control char ref checks, fixed noscript RAWTEXT handling, renamed golden 093 to 275, added script escape tests
File List
- crates/html/src/tokenizer/mod.rs (new) — Main tokenizer state machine, Token/Attribute types
- crates/html/src/tokenizer/states.rs (new) — TokenizerState enum with all 80 WHATWG states
- crates/html/src/tokenizer.rs (deleted) — Replaced by tokenizer module directory
- crates/html/src/entities.rs (modified) — Simplified to binary search on full entity table
- crates/html/src/entity_table.rs (new) — Full HTML5 named character reference table (2,125 entries)
- crates/html/src/lib.rs (modified) — Added entity_table module
- crates/html/src/tree_builder.rs (modified) — Removed dead Token::RawText branches
- crates/html/src/tests/mod.rs (modified) — Added tokenizer_state_tests module
- crates/html/src/tests/tokenizer_tests.rs (modified) — Updated test for spec-correct behavior
- crates/html/src/tests/tokenizer_state_tests.rs (new) — 39 tests for new tokenizer states
- tests/goldens/fixtures/275-rcdata-rawtext-states.html (new) — Golden test fixture
- tests/goldens/expected/275-rcdata-rawtext-states.layout.txt (new) — Expected layout output
- tests/goldens/expected/275-rcdata-rawtext-states.dl.txt (new) — Expected display list output
- tests/goldens/expected/091-xml-processing-instructions.layout.txt (modified) — Updated node IDs
- tests/goldens.rs (modified) — Added golden_275 test function
- tests/external/wpt/wpt_manifest.toml (modified) — Promoted newly-passing WPT test
- docs/HTML5_Implementation_Checklist.md (modified) — Checked off tokenizer and parse error items