Refactor tokenizer from 320-line linear scan to full WHATWG §13.2.5 state machine with all 80 states. Add RCDATA, RAWTEXT, PLAINTEXT, ScriptData (with escape/double-escape), character reference states, CDATA sections, and bogus comments. Replace ~100-entity table with full HTML5 set (2,125 unique entries) using sorted array + binary search. Add ParseError tracking with 33 error kinds including line/column info. Code review fixes: remove dead Token::RawText variant and tree_builder branches, add duplicate attribute detection, add noncharacter/control char reference checks per §13.2.5.80, fix noscript unconditional RAWTEXT handling, rename golden 093 to 275 to avoid number collision, add script escape/double-escape tests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
8 lines
520 B
Plaintext
8 lines
520 B
Plaintext
DisplayList items=6
|
|
Text rect=(8, 9.6, 22.449337, 16) text="This" color=#000000 font_size=16
|
|
Text rect=(33.50367, 9.6, 8.657856, 16) text="is" color=#000000 font_size=16
|
|
Text rect=(45.21586, 9.6, 120.30543, 16) text="<b>preformatted</b>" color=#000000 font_size=16
|
|
Text rect=(168.57562, 9.6, 36.546257, 16) text="&" color=#000000 font_size=16
|
|
Text rect=(208.17622, 9.6, 20.675476, 16) text="raw" color=#000000 font_size=16
|
|
Text rect=(231.90604, 9.6, 21.321585, 16) text="text" color=#000000 font_size=16
|