Files

Zachary D. Rowitsch 2e7642a658 Implement HTML5 tokenizer completeness with code review fixes (§13.2.5)

Refactor tokenizer from 320-line linear scan to full WHATWG §13.2.5 state
machine with all 80 states. Add RCDATA, RAWTEXT, PLAINTEXT, ScriptData
(with escape/double-escape), character reference states, CDATA sections,
and bogus comments. Replace ~100-entity table with full HTML5 set (2,125
unique entries) using sorted array + binary search. Add ParseError tracking
with 33 error kinds including line/column info.

Code review fixes: remove dead Token::RawText variant and tree_builder
branches, add duplicate attribute detection, add noncharacter/control char
reference checks per §13.2.5.80, fix noscript unconditional RAWTEXT
handling, rename golden 093 to 275 to avoid number collision, add script
escape/double-escape tests.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-03-14 14:03:23 -04:00

14 KiB

Raw Permalink Blame History

Story 2.1: HTML5 Tokenizer Completeness

Status: done

Story

As a web user, I want all HTML content to be parsed correctly regardless of markup patterns, So that pages render properly even with unusual or complex HTML.

Acceptance Criteria

All tokenizer states implemented per WHATWG HTML §13.2.5: Data, RCDATA, RAWTEXT, script data, PLAINTEXT, tag open, end tag open, tag name, RCDATA less-than sign, script data less-than sign, script data escaped, and all related sub-states produce correct tokens.
Character references resolved correctly per context: Named entities (full HTML5 set: 2,231 entries), numeric (decimal &#NNN; and hex &#xHHH;), handled differently in attribute values vs. text content per §13.2.5.73 (character reference state).
CDATA sections, processing instructions, and bogus comments tokenized correctly per spec without errors.
Malformed HTML (unclosed tags, missing quotes, invalid characters, null bytes) handled with graceful recovery per the spec's error handling rules — no panics.
Golden tests cover edge-case tokenizer states. docs/HTML5_Implementation_Checklist.md updated. just ci passes.

Tasks / Subtasks

Task 1: Implement full tokenizer state machine (AC: #1)
- 1.1 Refactor tokenizer from linear scan to explicit state machine with TokenizerState enum matching WHATWG §13.2.5 states
- 1.2 Implement RCDATA state for <title>, <textarea> (currently treated as raw text — must decode character refs but not tags)
- 1.3 Implement RAWTEXT state for <iframe>, <noembed>, <noframes>, <xmp> (no entity decoding, no tags until matching end tag)
- 1.4 Implement PLAINTEXT state for <plaintext> (no end tag ever closes it)
- 1.5 Implement script data states: script data, script data escaped, script data double-escaped, and their sub-states (less-than sign, end tag open/name, escape start/dash variants) per §13.2.5.15–§13.2.5.28
- 1.6 Implement proper tag open / end tag open / tag name states with spec-accurate error recovery (e.g., <1 emits < as character, </> is ignored)
- 1.7 Implement before/after attribute name, attribute value (quoted/unquoted) states with proper error handling
- 1.8 Handle self-closing flag (/>) with appropriate parse errors for non-void elements
- 1.9 Implement bogus comment state (e.g., <!DOCTYPE errors, <? ... >)
- 1.10 Implement markup declaration open state dispatching: <!--, <!DOCTYPE, <![CDATA[
Task 2: Complete character reference handling (AC: #2)
- 2.1 Replace the ~100-entity lookup table in entities.rs with the full HTML5 named character reference table (2,231 entries) — use a phf static map or sorted array with binary search for compile-time efficiency
- 2.2 Implement context-sensitive character reference consumption: in attributes, & followed by alphanumeric + = must NOT be consumed as a reference (legacy compat)
- 2.3 Handle ambiguous ampersand parse errors per spec
- 2.4 Handle numeric character reference edge cases: null → U+FFFD, surrogates → U+FFFD, out-of-range → U+FFFD, legacy Windows-1252 remapping table for 0x80–0x9F range
Task 3: CDATA, PI, and bogus comment handling (AC: #3)
- 3.1 Implement CDATA section tokenization (<![CDATA[...]]>) — only valid in foreign content (SVG/MathML); in HTML content emit as bogus comment
- 3.2 Ensure processing instructions (<?...>) are consumed as bogus comments per the HTML spec (not XML PI behavior)
Task 4: Error recovery and robustness (AC: #4)
- 4.1 Handle null characters: replace with U+FFFD in appropriate states, parse error in others
- 4.2 Handle unexpected characters in each state per spec (e.g., < in attribute value unquoted triggers parse error but continues)
- 4.3 Handle abrupt EOF in each state — emit appropriate tokens and parse errors
- 4.4 Ensure existing safety limits (10MB input, 500K tokens, 1024 nesting depth) remain enforced
- 4.5 Add ParseError tracking: collect parse errors with location info (line/column) but don't halt tokenization
Task 5: Tests and documentation (AC: #5)
- 5.1 Add unit tests for each new tokenizer state: RCDATA, RAWTEXT, PLAINTEXT, script data states
- 5.2 Add unit tests for full character reference resolution: named (multi-codepoint entities like &NotSquareSubset;), numeric edge cases, context-sensitive in attributes
- 5.3 Add golden tests for pages exercising edge-case tokenizer behavior
- 5.4 Update docs/HTML5_Implementation_Checklist.md — check off tokenizer-related items
- 5.5 Run just ci and ensure all tests pass

Dev Notes

Current Implementation (What Exists)

The tokenizer lives in crates/html/src/tokenizer.rs (~320 lines). It uses a linear character-scanning approach — NOT a state machine. Key characteristics:

tokenize(input: &str) -> Vec<Token> — main entry point, iterates chars in a loop
Token enum variants: Doctype, StartTag, EndTag, Character, Comment, RawText, Eof
Attribute parsing: handles quoted (single/double) and unquoted values, lowercases names
Raw text elements: <script> and <style> skip inner content until matching end tag — but this is NOT the same as the spec's RAWTEXT/script-data states
Entity decoding: entities.rs has ~100 named entities + numeric/hex support
Safety limits: MAX_HTML_INPUT_BYTES (10MB), MAX_HTML_TOKENS (500K), MAX_HTML_NESTING_DEPTH (1024)

Architecture Constraints

Layer 1 crate — depends only on dom, shared, and tracing. No upward deps.
Arena-based IDs — tokenizer produces tokens consumed by tree_builder which creates NodeId-based DOM
No unsafe — enforced by scripts/check_unsafe.sh
Error handling — use thiserror derive, never panic on malformed input
Spec citations — use // HTML §13.2.5.x inline comment format
File size — enforced by scripts/check_file_size.sh; if tokenizer.rs grows large, split into src/tokenizer/ module directory

Key Design Decision: State Machine Refactor

The current linear scan must be refactored to an explicit state machine. Recommended approach:

enum TokenizerState {
    Data,
    Rcdata,
    Rawtext,
    ScriptData,
    Plaintext,
    TagOpen,
    EndTagOpen,
    TagName,
    // ... all WHATWG §13.2.5 states
}

Do NOT use a separate fn per state (excessive function call overhead). Use a match on state in a loop — this is the standard efficient approach for spec-conformant HTML tokenizers.

Entity Table Strategy

Current entities.rs has ~100 hardcoded entries. For the full 2,231-entry HTML5 table:

Option A (recommended): Use phf crate for compile-time perfect hash map — O(1) lookup, zero runtime cost. Already used in the Rust ecosystem for this exact purpose.
Option B: Sorted array with binary search — no new dependency but O(log n) lookup.
Do NOT use a runtime HashMap — wasteful for static data.

The entity data can be sourced from the WHATWG JSON file or generated at build time.

If adding phf as a dependency: add rationale comment in Cargo.toml per project rules (new dependencies require rationale).

RCDATA vs RAWTEXT vs Script Data

These are commonly confused. Critical distinctions:

State	Used for	Decodes entities?	Recognizes tags?	End condition
RCDATA	`<title>`, `<textarea>`	Yes	No (except matching end tag)	Matching end tag
RAWTEXT	`<iframe>`, `<noembed>`, `<noframes>`, `<xmp>`	No	No (except matching end tag)	Matching end tag
Script data	`<script>`	No	No, but has escape sequences (`<!--`, `-->`)	Matching `</script>`
PLAINTEXT	`<plaintext>`	No	No	Never (runs to EOF)

Files to Modify

crates/html/src/tokenizer.rs — major refactor to state machine
crates/html/src/entities.rs — expand to full HTML5 entity table
crates/html/src/lib.rs — update public API if Token enum changes
crates/html/src/tests/tokenizer_tests.rs — extensive new tests
tests/goldens/fixtures/ + tests/goldens/expected/ — new golden tests
docs/HTML5_Implementation_Checklist.md — update checked items
crates/html/Cargo.toml — possibly add phf dependency

What NOT to Change

tree_builder.rs — Story 2.2 handles insertion modes. Only touch if the Token enum shape changes require it.
dom/ — No DOM changes needed for tokenizer work.
Other crates — Tokenizer is internal to html crate.

Previous Story Learnings (Epic 1)

Code review catches real bugs: Story 1-13's review found unset not handled in pseudo-element styling. Expect similar edge-case misses.
Golden tests are essential: Every rendering-affecting change needs golden test coverage.
Update checklists: Always update docs/HTML5_Implementation_Checklist.md at the end.
Commit pattern: Recent commits follow format: Implement <feature> with code review fixes (§section).

Testing Strategy

Unit tests in crates/html/src/tests/tokenizer_tests.rs — test each state machine state independently
If test file exceeds ~200 lines, split into tokenizer_state_tests.rs, entity_tests.rs, etc. (per architecture doc)
Golden tests in tests/goldens/ — HTML pages that exercise RCDATA (<title> with entities), RAWTEXT (<xmp>), script data escaping
Property-based tests (optional) with proptest — fuzz tokenizer with random byte sequences, assert no panics

References

WHATWG HTML Living Standard §13.2.5 — Tokenization
HTML5 Named Character References (JSON)
[Source: crates/html/src/tokenizer.rs] — current implementation
[Source: crates/html/src/entities.rs] — current entity table
[Source: crates/html/src/tests/tokenizer_tests.rs] — current tests
[Source: docs/HTML5_Implementation_Checklist.md] — checklist to update
[Source: _bmad-output/planning-artifacts/architecture.md] — architecture constraints
[Source: _bmad-output/planning-artifacts/epics.md] — Epic 2 requirements

Dev Agent Record

Agent Model Used

Claude Opus 4.6 (1M context)

Debug Log References

Fixed golden test 091-xml-processing-instructions: node IDs shifted because PIs now emit bogus comment tokens (spec-correct behavior)
Promoted WPT test wpt-css-css-tables-table-cell-inline-size-box-sizing-quirks from known_fail to pass (side effect of improved tokenizer)
Updated test_invalid_entity_remains_literal: &amp without semicolon now correctly resolves to & per HTML §13.2.5.73

Completion Notes List

Refactored tokenizer from 320-line linear scan to full WHATWG §13.2.5 state machine (~2100 lines in module directory)
Implemented all 80 tokenizer states as an enum with match-in-loop pattern
Added RCDATA state (title, textarea) with entity decoding but no tag recognition
Added RAWTEXT state (style, xmp, iframe, noembed, noframes) with no entity decoding (noscript excluded — only RAWTEXT when scripting enabled)
Added PLAINTEXT state (never exits)
Added ScriptData states with full escape/double-escape handling (§13.2.5.15-28)
Replaced ~100-entity lookup table with full HTML5 table (2,125 entries) using sorted array + binary search
No new dependencies added — used sorted array approach instead of phf
Implemented character reference state machine (named, numeric decimal, numeric hex)
Added context-sensitive attribute character reference handling (legacy compat)
Added Windows-1252 remapping for numeric refs in 0x80-0x9F range
Added CDATA section tokenization (emits chars, parse error in HTML content)
Processing instructions now correctly emit as bogus comments
Added ParseError tracking with line/column info (33 error kinds)
Added null character handling (U+FFFD replacement) in all states
Added EOF handling in all states with appropriate token emission
Added 39 unit tests covering RCDATA, RAWTEXT, PLAINTEXT, script data (incl. escape/double-escape), char refs, CDATA, PIs, error recovery, duplicate attrs, control/noncharacter refs
Added golden test 275-rcdata-rawtext-states

Change Log

2026-03-14: Full tokenizer refactor to WHATWG §13.2.5 state machine (Tasks 1-5)
2026-03-14: Code review fixes: removed dead Token::RawText variant, added duplicate attribute detection, added noncharacter/control char ref checks, fixed noscript RAWTEXT handling, renamed golden 093 to 275, added script escape tests

File List

crates/html/src/tokenizer/mod.rs (new) — Main tokenizer state machine, Token/Attribute types
crates/html/src/tokenizer/states.rs (new) — TokenizerState enum with all 80 WHATWG states
crates/html/src/tokenizer.rs (deleted) — Replaced by tokenizer module directory
crates/html/src/entities.rs (modified) — Simplified to binary search on full entity table
crates/html/src/entity_table.rs (new) — Full HTML5 named character reference table (2,125 entries)
crates/html/src/lib.rs (modified) — Added entity_table module
crates/html/src/tree_builder.rs (modified) — Removed dead Token::RawText branches
crates/html/src/tests/mod.rs (modified) — Added tokenizer_state_tests module
crates/html/src/tests/tokenizer_tests.rs (modified) — Updated test for spec-correct behavior
crates/html/src/tests/tokenizer_state_tests.rs (new) — 39 tests for new tokenizer states
tests/goldens/fixtures/275-rcdata-rawtext-states.html (new) — Golden test fixture
tests/goldens/expected/275-rcdata-rawtext-states.layout.txt (new) — Expected layout output
tests/goldens/expected/275-rcdata-rawtext-states.dl.txt (new) — Expected display list output
tests/goldens/expected/091-xml-processing-instructions.layout.txt (modified) — Updated node IDs
tests/goldens.rs (modified) — Added golden_275 test function
tests/external/wpt/wpt_manifest.toml (modified) — Promoted newly-passing WPT test
docs/HTML5_Implementation_Checklist.md (modified) — Checked off tokenizer and parse error items

14 KiB Raw Permalink Blame History Unescape Escape