v8
V8 is Google’s open source high-performance JavaScript and WebAssembly engine, written in C++.
Loading...
Searching...
No Matches
regexp-compiler.cc
Go to the documentation of this file.
1// Copyright 2019 the V8 project authors. All rights reserved.
2// Use of this source code is governed by a BSD-style license that can be
3// found in the LICENSE file.
4
6
7#include <optional>
8
15
16#ifdef V8_INTL_SUPPORT
18#include "unicode/locid.h"
19#include "unicode/uniset.h"
20#include "unicode/utypes.h"
21#endif // V8_INTL_SUPPORT
22
23namespace v8::internal {
24
25using namespace regexp_compiler_constants; // NOLINT(build/namespaces)
26
27// -------------------------------------------------------------------
28// Implementation of the Irregexp regular expression engine.
29//
30// The Irregexp regular expression engine is intended to be a complete
31// implementation of ECMAScript regular expressions. It generates either
32// bytecodes or native code.
33
34// The Irregexp regexp engine is structured in three steps.
35// 1) The parser generates an abstract syntax tree. See ast.cc.
36// 2) From the AST a node network is created. The nodes are all
37// subclasses of RegExpNode. The nodes represent states when
38// executing a regular expression. Several optimizations are
39// performed on the node network.
40// 3) From the nodes we generate either byte codes or native code
41// that can actually execute the regular expression (perform
42// the search). The code generation step is described in more
43// detail below.
44
45// Code generation.
46//
47// The nodes are divided into four main categories.
48// * Choice nodes
49// These represent places where the regular expression can
50// match in more than one way. For example on entry to an
51// alternation (foo|bar) or a repetition (*, +, ? or {}).
52// * Action nodes
53// These represent places where some action should be
54// performed. Examples include recording the current position
55// in the input string to a register (in order to implement
56// captures) or other actions on register for example in order
57// to implement the counters needed for {} repetitions.
58// * Matching nodes
59// These attempt to match some element part of the input string.
60// Examples of elements include character classes, plain strings
61// or back references.
62// * End nodes
63// These are used to implement the actions required on finding
64// a successful match or failing to find a match.
65//
66// The code generated (whether as byte codes or native code) maintains
67// some state as it runs. This consists of the following elements:
68//
69// * The capture registers. Used for string captures.
70// * Other registers. Used for counters etc.
71// * The current position.
72// * The stack of backtracking information. Used when a matching node
73// fails to find a match and needs to try an alternative.
74//
75// Conceptual regular expression execution model:
76//
77// There is a simple conceptual model of regular expression execution
78// which will be presented first. The actual code generated is a more
79// efficient simulation of the simple conceptual model:
80//
81// * Choice nodes are implemented as follows:
82// For each choice except the last {
83// push current position
84// push backtrack code location
85// <generate code to test for choice>
86// backtrack code location:
87// pop current position
88// }
89// <generate code to test for last choice>
90//
91// * Actions nodes are generated as follows
92// <push affected registers on backtrack stack>
93// <generate code to perform action>
94// push backtrack code location
95// <generate code to test for following nodes>
96// backtrack code location:
97// <pop affected registers to restore their state>
98// <pop backtrack location from stack and go to it>
99//
100// * Matching nodes are generated as follows:
101// if input string matches at current position
102// update current position
103// <generate code to test for following nodes>
104// else
105// <pop backtrack location from stack and go to it>
106//
107// Thus it can be seen that the current position is saved and restored
108// by the choice nodes, whereas the registers are saved and restored by
109// by the action nodes that manipulate them.
110//
111// The other interesting aspect of this model is that nodes are generated
112// at the point where they are needed by a recursive call to Emit(). If
113// the node has already been code generated then the Emit() call will
114// generate a jump to the previously generated code instead. In order to
115// limit recursion it is possible for the Emit() function to put the node
116// on a work list for later generation and instead generate a jump. The
117// destination of the jump is resolved later when the code is generated.
118//
119// Actual regular expression code generation.
120//
121// Code generation is actually more complicated than the above. In order to
122// improve the efficiency of the generated code some optimizations are
123// performed
124//
125// * Choice nodes have 1-character lookahead.
126// A choice node looks at the following character and eliminates some of
127// the choices immediately based on that character. This is not yet
128// implemented.
129// * Simple greedy loops store reduced backtracking information.
130// A quantifier like /.*foo/m will greedily match the whole input. It will
131// then need to backtrack to a point where it can match "foo". The naive
132// implementation of this would push each character position onto the
133// backtracking stack, then pop them off one by one. This would use space
134// proportional to the length of the input string. However since the "."
135// can only match in one way and always has a constant length (in this case
136// of 1) it suffices to store the current position on the top of the stack
137// once. Matching now becomes merely incrementing the current position and
138// backtracking becomes decrementing the current position and checking the
139// result against the stored current position. This is faster and saves
140// space.
141// * The current state is virtualized.
142// This is used to defer expensive operations until it is clear that they
143// are needed and to generate code for a node more than once, allowing
144// specialized an efficient versions of the code to be created. This is
145// explained in the section below.
146//
147// Execution state virtualization.
148//
149// Instead of emitting code, nodes that manipulate the state can record their
150// manipulation in an object called the Trace. The Trace object can record a
151// current position offset, an optional backtrack code location on the top of
152// the virtualized backtrack stack and some register changes. When a node is
153// to be emitted it can flush the Trace or update it. Flushing the Trace
154// will emit code to bring the actual state into line with the virtual state.
155// Avoiding flushing the state can postpone some work (e.g. updates of capture
156// registers). Postponing work can save time when executing the regular
157// expression since it may be found that the work never has to be done as a
158// failure to match can occur. In addition it is much faster to jump to a
159// known backtrack code location than it is to pop an unknown backtrack
160// location from the stack and jump there.
161//
162// The virtual state found in the Trace affects code generation. For example
163// the virtual state contains the difference between the actual current
164// position and the virtual current position, and matching code needs to use
165// this offset to attempt a match in the correct location of the input
166// string. Therefore code generated for a non-trivial trace is specialized
167// to that trace. The code generator therefore has the ability to generate
168// code for each node several times. In order to limit the size of the
169// generated code there is an arbitrary limit on how many specialized sets of
170// code may be generated for a given node. If the limit is reached, the
171// trace is flushed and a generic version of the code for a node is emitted.
172// This is subsequently used for that node. The code emitted for non-generic
173// trace is not recorded in the node and so it cannot currently be reused in
174// the event that code generation is requested for an identical trace.
175
176namespace {
177
178constexpr base::uc32 MaxCodeUnit(const bool one_byte) {
179 static_assert(String::kMaxOneByteCharCodeU <=
180 std::numeric_limits<uint16_t>::max());
181 static_assert(String::kMaxUtf16CodeUnitU <=
182 std::numeric_limits<uint16_t>::max());
184}
185
186constexpr uint32_t CharMask(const bool one_byte) {
189 return MaxCodeUnit(one_byte);
190}
191
192} // namespace
193
195
197 text->AddElement(TextElement::Atom(this), zone);
198}
199
201 text->AddElement(TextElement::ClassRanges(this), zone);
202}
203
205 for (int i = 0; i < elements()->length(); i++)
206 text->AddElement(elements()->at(i), zone);
207}
208
212
216
218 switch (text_type()) {
219 case ATOM:
220 return atom()->length();
221
222 case CLASS_RANGES:
223 return 1;
224 }
225 UNREACHABLE();
226}
227
229 public:
230 explicit RecursionCheck(RegExpCompiler* compiler) : compiler_(compiler) {
231 compiler->IncrementRecursionDepth();
232 }
234
235 private:
237};
238
239// Attempts to compile the regexp using an Irregexp code generator. Returns
240// a fixed array or a null handle depending on whether it succeeded.
241RegExpCompiler::RegExpCompiler(Isolate* isolate, Zone* zone, int capture_count,
242 RegExpFlags flags, bool one_byte)
243 : next_register_(JSRegExp::RegistersForCaptureCount(capture_count)),
244 unicode_lookaround_stack_register_(kNoRegister),
245 unicode_lookaround_position_register_(kNoRegister),
246 work_list_(nullptr),
247 recursion_depth_(0),
248 flags_(flags),
249 one_byte_(one_byte),
250 reg_exp_too_big_(false),
251 limiting_recursion_(false),
252 optimize_(v8_flags.regexp_optimization),
253 read_backward_(false),
254 current_expansion_factor_(1),
255 frequency_collator_(),
256 isolate_(isolate),
257 zone_(zone) {
260}
261
263 Isolate* isolate, RegExpMacroAssembler* macro_assembler, RegExpNode* start,
264 int capture_count, DirectHandle<String> pattern) {
266
267 ZoneVector<RegExpNode*> work_list(zone());
268 work_list_ = &work_list;
269 Label fail;
271 Trace new_trace;
272 start->Emit(this, &new_trace);
275 while (!work_list.empty()) {
276 RegExpNode* node = work_list.back();
277 work_list.pop_back();
278 node->set_on_work_list(false);
279 if (!node->label()->is_bound()) node->Emit(this, &new_trace);
280 }
281 if (reg_exp_too_big_) {
282 if (v8_flags.correctness_fuzzer_suppressions) {
283 FATAL("Aborting on excess zone allocation");
284 }
287 }
288
290 isolate->IncreaseTotalRegexpCodeGenerated(code);
291 work_list_ = nullptr;
292
293 return {code, next_register_};
294}
295
298 Interval range = static_cast<DeferredClearCaptures*>(this)->range();
299 return range.Contains(that);
300 } else {
301 return reg() == that;
302 }
303}
304
306 for (DeferredAction* action = actions_; action != nullptr;
307 action = action->next()) {
308 if (action->Mentions(reg)) return true;
309 }
310 return false;
311}
312
314 DCHECK_EQ(0, *cp_offset);
315 for (DeferredAction* action = actions_; action != nullptr;
316 action = action->next()) {
317 if (action->Mentions(reg)) {
318 if (action->action_type() == ActionNode::STORE_POSITION) {
319 *cp_offset = static_cast<DeferredCapture*>(action)->cp_offset();
320 return true;
321 } else {
322 return false;
323 }
324 }
325 }
326 return false;
327}
328
329// A (dynamically-sized) set of unsigned integers that behaves especially well
330// on small integers (< kFirstLimit). May do zone-allocation.
331class DynamicBitSet : public ZoneObject {
332 public:
333 V8_EXPORT_PRIVATE bool Get(unsigned value) const {
334 if (value < kFirstLimit) {
335 return (first_ & (1 << value)) != 0;
336 } else if (remaining_ == nullptr) {
337 return false;
338 } else {
339 return remaining_->Contains(value);
340 }
341 }
342
343 // Destructively set a value in this set.
344 void Set(unsigned value, Zone* zone) {
345 if (value < kFirstLimit) {
346 first_ |= (1 << value);
347 } else {
348 if (remaining_ == nullptr)
349 remaining_ = zone->New<ZoneList<unsigned>>(1, zone);
350 if (remaining_->is_empty() || !remaining_->Contains(value))
351 remaining_->Add(value, zone);
352 }
353 }
354
355 private:
356 static constexpr unsigned kFirstLimit = 32;
357
358 uint32_t first_ = 0;
359 ZoneList<unsigned>* remaining_ = nullptr;
360};
361
363 Zone* zone) {
364 int max_register = RegExpCompiler::kNoRegister;
365 for (DeferredAction* action = actions_; action != nullptr;
366 action = action->next()) {
367 if (action->action_type() == ActionNode::CLEAR_CAPTURES) {
368 Interval range = static_cast<DeferredClearCaptures*>(action)->range();
369 for (int i = range.from(); i <= range.to(); i++)
370 affected_registers->Set(i, zone);
371 if (range.to() > max_register) max_register = range.to();
372 } else {
373 affected_registers->Set(action->reg(), zone);
374 if (action->reg() > max_register) max_register = action->reg();
375 }
376 }
377 return max_register;
378}
379
381 int max_register,
382 const DynamicBitSet& registers_to_pop,
383 const DynamicBitSet& registers_to_clear) {
384 for (int reg = max_register; reg >= 0; reg--) {
385 if (registers_to_pop.Get(reg)) {
386 assembler->PopRegister(reg);
387 } else if (registers_to_clear.Get(reg)) {
388 int clear_to = reg;
389 while (reg > 0 && registers_to_clear.Get(reg - 1)) {
390 reg--;
391 }
392 assembler->ClearRegisters(reg, clear_to);
393 }
394 }
395}
396
398 int max_register,
399 const DynamicBitSet& affected_registers,
400 DynamicBitSet* registers_to_pop,
401 DynamicBitSet* registers_to_clear,
402 Zone* zone) {
403 // Count pushes performed to force a stack limit check occasionally.
404 int pushes = 0;
405
406 for (int reg = 0; reg <= max_register; reg++) {
407 if (!affected_registers.Get(reg)) continue;
408
409 // The chronologically first deferred action in the trace
410 // is used to infer the action needed to restore a register
411 // to its previous state (or not, if it's safe to ignore it).
412 enum DeferredActionUndoType { IGNORE, RESTORE, CLEAR };
413 DeferredActionUndoType undo_action = IGNORE;
414
415 int value = 0;
416 bool absolute = false;
417 bool clear = false;
418 static const int kNoStore = kMinInt;
419 int store_position = kNoStore;
420 // This is a little tricky because we are scanning the actions in reverse
421 // historical order (newest first).
422 for (DeferredAction* action = actions_; action != nullptr;
423 action = action->next()) {
424 if (action->Mentions(reg)) {
425 switch (action->action_type()) {
428 static_cast<Trace::DeferredSetRegisterForLoop*>(action);
429 if (!absolute) {
430 value += psr->value();
431 absolute = true;
432 }
433 // SET_REGISTER_FOR_LOOP is only used for newly introduced loop
434 // counters. They can have a significant previous value if they
435 // occur in a loop. TODO(lrn): Propagate this information, so
436 // we can set undo_action to IGNORE if we know there is no value to
437 // restore.
438 undo_action = RESTORE;
439 DCHECK_EQ(store_position, kNoStore);
440 DCHECK(!clear);
441 break;
442 }
444 if (!absolute) {
445 value++;
446 }
447 DCHECK_EQ(store_position, kNoStore);
448 DCHECK(!clear);
449 undo_action = RESTORE;
450 break;
453 static_cast<Trace::DeferredCapture*>(action);
454 if (!clear && store_position == kNoStore) {
455 store_position = pc->cp_offset();
456 }
457
458 // For captures we know that stores and clears alternate.
459 // Other register, are never cleared, and if the occur
460 // inside a loop, they might be assigned more than once.
461 if (reg <= 1) {
462 // Registers zero and one, aka "capture zero", is
463 // always set correctly if we succeed. There is no
464 // need to undo a setting on backtrack, because we
465 // will set it again or fail.
466 undo_action = IGNORE;
467 } else {
468 undo_action = pc->is_capture() ? CLEAR : RESTORE;
469 }
470 DCHECK(!absolute);
471 DCHECK_EQ(value, 0);
472 break;
473 }
475 // Since we're scanning in reverse order, if we've already
476 // set the position we have to ignore historically earlier
477 // clearing operations.
478 if (store_position == kNoStore) {
479 clear = true;
480 }
481 undo_action = RESTORE;
482 DCHECK(!absolute);
483 DCHECK_EQ(value, 0);
484 break;
485 }
486 default:
487 UNREACHABLE();
488 }
489 }
490 }
491 // Prepare for the undo-action (e.g., push if it's going to be popped).
492 if (undo_action == RESTORE) {
493 pushes++;
496 DCHECK_GT(assembler->stack_limit_slack_slot_count(), 0);
497 if (pushes == assembler->stack_limit_slack_slot_count()) {
499 pushes = 0;
500 }
501
502 assembler->PushRegister(reg, stack_check);
503 registers_to_pop->Set(reg, zone);
504 } else if (undo_action == CLEAR) {
505 registers_to_clear->Set(reg, zone);
506 }
507 // Perform the chronologically last action (or accumulated increment)
508 // for the register.
509 if (store_position != kNoStore) {
510 assembler->WriteCurrentPositionToRegister(reg, store_position);
511 } else if (clear) {
512 assembler->ClearRegisters(reg, reg);
513 } else if (absolute) {
514 assembler->SetRegister(reg, value);
515 } else if (value != 0) {
516 assembler->AdvanceRegister(reg, value);
517 }
518 }
519}
520
521// This is called as we come into a loop choice node and some other tricky
522// nodes. It normalizes the state of the code generator to ensure we can
523// generate generic code.
524void Trace::Flush(RegExpCompiler* compiler, RegExpNode* successor) {
525 RegExpMacroAssembler* assembler = compiler->macro_assembler();
526
527 DCHECK(!is_trivial());
528
529 if (actions_ == nullptr && backtrack() == nullptr) {
530 // Here we just have some deferred cp advances to fix and we are back to
531 // a normal situation. We may also have to forget some information gained
532 // through a quick check that was already performed.
533 if (cp_offset_ != 0) assembler->AdvanceCurrentPosition(cp_offset_);
534 // Create a new trivial state and generate the node with that.
535 Trace new_state;
536 successor->Emit(compiler, &new_state);
537 return;
538 }
539
540 // Generate deferred actions here along with code to undo them again.
541 DynamicBitSet affected_registers;
542
543 if (backtrack() != nullptr) {
544 // Here we have a concrete backtrack location. These are set up by choice
545 // nodes and so they indicate that we have a deferred save of the current
546 // position which we may need to emit here.
547 assembler->PushCurrentPosition();
548 }
549
550 int max_register =
551 FindAffectedRegisters(&affected_registers, compiler->zone());
552 DynamicBitSet registers_to_pop;
553 DynamicBitSet registers_to_clear;
554 PerformDeferredActions(assembler, max_register, affected_registers,
555 &registers_to_pop, &registers_to_clear,
556 compiler->zone());
557 if (cp_offset_ != 0) {
558 assembler->AdvanceCurrentPosition(cp_offset_);
559 }
560
561 // Create a new trivial state and generate the node with that.
562 Label undo;
563 assembler->PushBacktrack(&undo);
564 if (successor->KeepRecursing(compiler)) {
565 Trace new_state;
566 successor->Emit(compiler, &new_state);
567 } else {
568 compiler->AddWork(successor);
569 assembler->GoTo(successor->label());
570 }
571
572 // On backtrack we need to restore state.
573 assembler->BindJumpTarget(&undo);
574 RestoreAffectedRegisters(assembler, max_register, registers_to_pop,
575 registers_to_clear);
576 if (backtrack() == nullptr) {
577 assembler->Backtrack();
578 } else {
579 assembler->PopCurrentPosition();
580 assembler->GoTo(backtrack());
581 }
582}
583
585 RegExpMacroAssembler* assembler = compiler->macro_assembler();
586
587 // Omit flushing the trace. We discard the entire stack frame anyway.
588
589 if (!label()->is_bound()) {
590 // We are completely independent of the trace, since we ignore it,
591 // so this code can be used as the generic version.
592 assembler->Bind(label());
593 }
594
595 // Throw away everything on the backtrack stack since the start
596 // of the negative submatch and restore the character position.
597 assembler->ReadCurrentPositionFromRegister(current_position_register_);
598 assembler->ReadStackPointerFromRegister(stack_pointer_register_);
599 if (clear_capture_count_ > 0) {
600 // Clear any captures that might have been performed during the success
601 // of the body of the negative look-ahead.
602 int clear_capture_end = clear_capture_start_ + clear_capture_count_ - 1;
603 assembler->ClearRegisters(clear_capture_start_, clear_capture_end);
604 }
605 // Now that we have unwound the stack we find at the top of the stack the
606 // backtrack that the BeginNegativeSubmatch node got.
607 assembler->Backtrack();
608}
609
610void EndNode::Emit(RegExpCompiler* compiler, Trace* trace) {
611 if (!trace->is_trivial()) {
612 trace->Flush(compiler, this);
613 return;
614 }
615 RegExpMacroAssembler* assembler = compiler->macro_assembler();
616 if (!label()->is_bound()) {
617 assembler->Bind(label());
618 }
619 switch (action_) {
620 case ACCEPT:
621 assembler->Succeed();
622 return;
623 case BACKTRACK:
624 assembler->GoTo(trace->backtrack());
625 return;
626 case NEGATIVE_SUBMATCH_SUCCESS:
627 // This case is handled in a different virtual method.
628 UNREACHABLE();
629 }
631}
632
634 if (guards_ == nullptr) guards_ = zone->New<ZoneList<Guard*>>(1, zone);
635 guards_->Add(guard, zone);
636}
637
639 RegExpNode* on_success) {
641 on_success->zone()->New<ActionNode>(SET_REGISTER_FOR_LOOP, on_success);
643 result->data_.u_store_register.value = val;
644 return result;
645}
646
649 on_success->zone()->New<ActionNode>(INCREMENT_REGISTER, on_success);
651 return result;
652}
653
655 RegExpNode* on_success) {
657 on_success->zone()->New<ActionNode>(STORE_POSITION, on_success);
659 result->data_.u_position_register.is_capture = is_capture;
660 return result;
661}
662
665 on_success->zone()->New<ActionNode>(CLEAR_CAPTURES, on_success);
666 result->data_.u_clear_captures.range_from = range.from();
667 result->data_.u_clear_captures.range_to = range.to();
668 return result;
669}
670
671ActionNode* ActionNode::BeginPositiveSubmatch(int stack_reg, int position_reg,
672 RegExpNode* body,
673 ActionNode* success_node) {
675 body->zone()->New<ActionNode>(BEGIN_POSITIVE_SUBMATCH, body);
677 result->data_.u_submatch.current_position_register = position_reg;
678 result->data_.u_submatch.success_node = success_node;
679 return result;
680}
681
682ActionNode* ActionNode::BeginNegativeSubmatch(int stack_reg, int position_reg,
683 RegExpNode* on_success) {
685 on_success->zone()->New<ActionNode>(BEGIN_NEGATIVE_SUBMATCH, on_success);
687 result->data_.u_submatch.current_position_register = position_reg;
688 return result;
689}
690
691ActionNode* ActionNode::PositiveSubmatchSuccess(int stack_reg, int position_reg,
692 int clear_register_count,
693 int clear_register_from,
694 RegExpNode* on_success) {
695 ActionNode* result = on_success->zone()->New<ActionNode>(
696 POSITIVE_SUBMATCH_SUCCESS, on_success);
698 result->data_.u_submatch.current_position_register = position_reg;
699 result->data_.u_submatch.clear_register_count = clear_register_count;
700 result->data_.u_submatch.clear_register_from = clear_register_from;
701 return result;
702}
703
705 int repetition_register,
706 int repetition_limit,
707 RegExpNode* on_success) {
709 on_success->zone()->New<ActionNode>(EMPTY_MATCH_CHECK, on_success);
711 result->data_.u_empty_match_check.repetition_register = repetition_register;
712 result->data_.u_empty_match_check.repetition_limit = repetition_limit;
713 return result;
714}
715
718 on_success->zone()->New<ActionNode>(MODIFY_FLAGS, on_success);
720 return result;
721}
722
723#define DEFINE_ACCEPT(Type) \
724 void Type##Node::Accept(NodeVisitor* visitor) { visitor->Visit##Type(this); }
726#undef DEFINE_ACCEPT
727
728// -------------------------------------------------------------------
729// Emit code.
730
732 Guard* guard, Trace* trace) {
733 switch (guard->op()) {
734 case Guard::LT:
735 DCHECK(!trace->mentions_reg(guard->reg()));
736 macro_assembler->IfRegisterGE(guard->reg(), guard->value(),
737 trace->backtrack());
738 break;
739 case Guard::GEQ:
740 DCHECK(!trace->mentions_reg(guard->reg()));
741 macro_assembler->IfRegisterLT(guard->reg(), guard->value(),
742 trace->backtrack());
743 break;
744 }
745}
746
747namespace {
748
749#ifdef DEBUG
750bool ContainsOnlyUtf16CodeUnits(unibrow::uchar* chars, int length) {
751 static_assert(sizeof(unibrow::uchar) == 4);
752 for (int i = 0; i < length; i++) {
753 if (chars[i] > String::kMaxUtf16CodeUnit) return false;
754 }
755 return true;
756}
757#endif // DEBUG
758
759// Returns the number of characters in the equivalence class, omitting those
760// that cannot occur in the source string because it is Latin1. This is called
761// both for unicode modes /ui and /vi, and also for legacy case independent
762// mode /i. In the case of Unicode modes we handled surrogate pair expansions
763// earlier so at this point it's all about single-code-unit expansions.
764int GetCaseIndependentLetters(Isolate* isolate, base::uc16 character,
765 RegExpCompiler* compiler, unibrow::uchar* letters,
766 int letter_length) {
767 bool one_byte_subject = compiler->one_byte();
768 bool unicode = IsEitherUnicode(compiler->flags());
769 static const base::uc16 kMaxAscii = 0x7f;
770 if (!unicode && character <= kMaxAscii) {
771 // Fast case for common characters.
772 base::uc16 upper = character & ~0x20;
773 if ('A' <= upper && upper <= 'Z') {
774 letters[0] = upper;
775 letters[1] = upper | 0x20;
776 return 2;
777 }
778 letters[0] = character;
779 return 1;
780 }
781#ifdef V8_INTL_SUPPORT
782
783 if (!unicode && RegExpCaseFolding::IgnoreSet().contains(character)) {
784 if (one_byte_subject && character > String::kMaxOneByteCharCode) {
785 // This function promises not to return a character that is impossible
786 // for the subject encoding.
787 return 0;
788 }
789 letters[0] = character;
790 DCHECK(ContainsOnlyUtf16CodeUnits(letters, 1));
791 return 1;
792 }
793 bool in_special_add_set =
794 RegExpCaseFolding::SpecialAddSet().contains(character);
795
796 icu::UnicodeSet set;
797 set.add(character);
798 set = set.closeOver(unicode ? USET_SIMPLE_CASE_INSENSITIVE
799 : USET_CASE_INSENSITIVE);
800
801 UChar32 canon = 0;
802 if (in_special_add_set && !unicode) {
803 canon = RegExpCaseFolding::Canonicalize(character);
804 }
805
806 int32_t range_count = set.getRangeCount();
807 int items = 0;
808 for (int32_t i = 0; i < range_count; i++) {
809 UChar32 start = set.getRangeStart(i);
810 UChar32 end = set.getRangeEnd(i);
811 CHECK(end - start + items <= letter_length);
812 for (UChar32 cu = start; cu <= end; cu++) {
813 if (one_byte_subject && cu > String::kMaxOneByteCharCode) continue;
814 if (!unicode && in_special_add_set &&
815 RegExpCaseFolding::Canonicalize(cu) != canon) {
816 continue;
817 }
818 letters[items++] = static_cast<unibrow::uchar>(cu);
819 }
820 }
821 DCHECK(ContainsOnlyUtf16CodeUnits(letters, items));
822 return items;
823#else
824 int length =
825 isolate->jsregexp_uncanonicalize()->get(character, '\0', letters);
826 // Unibrow returns 0 or 1 for characters where case independence is
827 // trivial.
828 if (length == 0) {
829 letters[0] = character;
830 length = 1;
831 }
832
833 if (one_byte_subject) {
834 int new_length = 0;
835 for (int i = 0; i < length; i++) {
836 if (letters[i] <= String::kMaxOneByteCharCode) {
837 letters[new_length++] = letters[i];
838 }
839 }
840 length = new_length;
841 }
842
843 DCHECK(ContainsOnlyUtf16CodeUnits(letters, length));
844 return length;
845#endif // V8_INTL_SUPPORT
846}
847
848inline bool EmitSimpleCharacter(Isolate* isolate, RegExpCompiler* compiler,
849 base::uc16 c, Label* on_failure, int cp_offset,
850 bool check, bool preloaded) {
851 RegExpMacroAssembler* assembler = compiler->macro_assembler();
852 bool bound_checked = false;
853 if (!preloaded) {
854 assembler->LoadCurrentCharacter(cp_offset, on_failure, check);
855 bound_checked = true;
856 }
857 assembler->CheckNotCharacter(c, on_failure);
858 return bound_checked;
859}
860
861// Only emits non-letters (things that don't have case). Only used for case
862// independent matches.
863inline bool EmitAtomNonLetter(Isolate* isolate, RegExpCompiler* compiler,
864 base::uc16 c, Label* on_failure, int cp_offset,
865 bool check, bool preloaded) {
866 RegExpMacroAssembler* macro_assembler = compiler->macro_assembler();
867 bool one_byte = compiler->one_byte();
868 unibrow::uchar chars[4];
869 int length = GetCaseIndependentLetters(isolate, c, compiler, chars, 4);
870 if (length < 1) {
871 // This can't match. Must be an one-byte subject and a non-one-byte
872 // character. We do not need to do anything since the one-byte pass
873 // already handled this.
874 CHECK(one_byte);
875 return false; // Bounds not checked.
876 }
877 bool checked = false;
878 // We handle the length > 1 case in a later pass.
879 if (length == 1) {
880 // GetCaseIndependentLetters promises not to return characters that can't
881 // match because of the subject encoding. This case is already handled by
882 // the one-byte pass.
883 CHECK_IMPLIES(one_byte, chars[0] <= String::kMaxOneByteCharCodeU);
884 if (!preloaded) {
885 macro_assembler->LoadCurrentCharacter(cp_offset, on_failure, check);
886 checked = check;
887 }
888 macro_assembler->CheckNotCharacter(chars[0], on_failure);
889 }
890 return checked;
891}
892
893bool ShortCutEmitCharacterPair(RegExpMacroAssembler* macro_assembler,
894 bool one_byte, base::uc16 c1, base::uc16 c2,
895 Label* on_failure) {
896 const uint32_t char_mask = CharMask(one_byte);
897 base::uc16 exor = c1 ^ c2;
898 // Check whether exor has only one bit set.
899 if (((exor - 1) & exor) == 0) {
900 // If c1 and c2 differ only by one bit.
901 // Ecma262UnCanonicalize always gives the highest number last.
902 DCHECK(c2 > c1);
903 base::uc16 mask = char_mask ^ exor;
904 macro_assembler->CheckNotCharacterAfterAnd(c1, mask, on_failure);
905 return true;
906 }
907 DCHECK(c2 > c1);
908 base::uc16 diff = c2 - c1;
909 if (((diff - 1) & diff) == 0 && c1 >= diff) {
910 // If the characters differ by 2^n but don't differ by one bit then
911 // subtract the difference from the found character, then do the or
912 // trick. We avoid the theoretical case where negative numbers are
913 // involved in order to simplify code generation.
914 base::uc16 mask = char_mask ^ diff;
915 macro_assembler->CheckNotCharacterAfterMinusAnd(c1 - diff, diff, mask,
916 on_failure);
917 return true;
918 }
919 return false;
920}
921
922// Only emits letters (things that have case). Only used for case independent
923// matches.
924inline bool EmitAtomLetter(Isolate* isolate, RegExpCompiler* compiler,
925 base::uc16 c, Label* on_failure, int cp_offset,
926 bool check, bool preloaded) {
927 RegExpMacroAssembler* macro_assembler = compiler->macro_assembler();
928 bool one_byte = compiler->one_byte();
929 unibrow::uchar chars[4];
930 int length = GetCaseIndependentLetters(isolate, c, compiler, chars, 4);
931 // The 0 and 1 case are handled by earlier passes.
932 if (length <= 1) return false;
933 // We may not need to check against the end of the input string
934 // if this character lies before a character that matched.
935 if (!preloaded) {
936 macro_assembler->LoadCurrentCharacter(cp_offset, on_failure, check);
937 }
938 Label ok;
939 switch (length) {
940 case 2: {
941 if (ShortCutEmitCharacterPair(macro_assembler, one_byte, chars[0],
942 chars[1], on_failure)) {
943 } else {
944 macro_assembler->CheckCharacter(chars[0], &ok);
945 macro_assembler->CheckNotCharacter(chars[1], on_failure);
946 macro_assembler->Bind(&ok);
947 }
948 break;
949 }
950 case 4:
951 macro_assembler->CheckCharacter(chars[3], &ok);
952 [[fallthrough]];
953 case 3:
954 macro_assembler->CheckCharacter(chars[0], &ok);
955 macro_assembler->CheckCharacter(chars[1], &ok);
956 macro_assembler->CheckNotCharacter(chars[2], on_failure);
957 macro_assembler->Bind(&ok);
958 break;
959 default:
960 UNREACHABLE();
961 }
962 return true;
963}
964
965void EmitBoundaryTest(RegExpMacroAssembler* masm, int border,
966 Label* fall_through, Label* above_or_equal,
967 Label* below) {
968 if (below != fall_through) {
969 masm->CheckCharacterLT(border, below);
970 if (above_or_equal != fall_through) masm->GoTo(above_or_equal);
971 } else {
972 masm->CheckCharacterGT(border - 1, above_or_equal);
973 }
974}
975
976void EmitDoubleBoundaryTest(RegExpMacroAssembler* masm, int first, int last,
977 Label* fall_through, Label* in_range,
978 Label* out_of_range) {
979 if (in_range == fall_through) {
980 if (first == last) {
981 masm->CheckNotCharacter(first, out_of_range);
982 } else {
983 masm->CheckCharacterNotInRange(first, last, out_of_range);
984 }
985 } else {
986 if (first == last) {
987 masm->CheckCharacter(first, in_range);
988 } else {
989 masm->CheckCharacterInRange(first, last, in_range);
990 }
991 if (out_of_range != fall_through) masm->GoTo(out_of_range);
992 }
993}
994
995// even_label is for ranges[i] to ranges[i + 1] where i - start_index is even.
996// odd_label is for ranges[i] to ranges[i + 1] where i - start_index is odd.
997void EmitUseLookupTable(RegExpMacroAssembler* masm,
998 ZoneList<base::uc32>* ranges, uint32_t start_index,
999 uint32_t end_index, base::uc32 min_char,
1000 Label* fall_through, Label* even_label,
1001 Label* odd_label) {
1002 static const uint32_t kSize = RegExpMacroAssembler::kTableSize;
1003 static const uint32_t kMask = RegExpMacroAssembler::kTableMask;
1004
1005 base::uc32 base = (min_char & ~kMask);
1006 USE(base);
1007
1008 // Assert that everything is on one kTableSize page.
1009 for (uint32_t i = start_index; i <= end_index; i++) {
1010 DCHECK_EQ(ranges->at(i) & ~kMask, base);
1011 }
1012 DCHECK(start_index == 0 || (ranges->at(start_index - 1) & ~kMask) <= base);
1013
1014 char templ[kSize];
1015 Label* on_bit_set;
1016 Label* on_bit_clear;
1017 int bit;
1018 if (even_label == fall_through) {
1019 on_bit_set = odd_label;
1020 on_bit_clear = even_label;
1021 bit = 1;
1022 } else {
1023 on_bit_set = even_label;
1024 on_bit_clear = odd_label;
1025 bit = 0;
1026 }
1027 for (uint32_t i = 0; i < (ranges->at(start_index) & kMask) && i < kSize;
1028 i++) {
1029 templ[i] = bit;
1030 }
1031 uint32_t j = 0;
1032 bit ^= 1;
1033 for (uint32_t i = start_index; i < end_index; i++) {
1034 for (j = (ranges->at(i) & kMask); j < (ranges->at(i + 1) & kMask); j++) {
1035 templ[j] = bit;
1036 }
1037 bit ^= 1;
1038 }
1039 for (uint32_t i = j; i < kSize; i++) {
1040 templ[i] = bit;
1041 }
1042 Factory* factory = masm->isolate()->factory();
1043 // TODO(erikcorry): Cache these.
1044 Handle<ByteArray> ba = factory->NewByteArray(kSize, AllocationType::kOld);
1045 for (uint32_t i = 0; i < kSize; i++) {
1046 ba->set(i, templ[i]);
1047 }
1048 masm->CheckBitInTable(ba, on_bit_set);
1049 if (on_bit_clear != fall_through) masm->GoTo(on_bit_clear);
1050}
1051
1052void CutOutRange(RegExpMacroAssembler* masm, ZoneList<base::uc32>* ranges,
1053 uint32_t start_index, uint32_t end_index, uint32_t cut_index,
1054 Label* even_label, Label* odd_label) {
1055 bool odd = (((cut_index - start_index) & 1) == 1);
1056 Label* in_range_label = odd ? odd_label : even_label;
1057 Label dummy;
1058 EmitDoubleBoundaryTest(masm, ranges->at(cut_index),
1059 ranges->at(cut_index + 1) - 1, &dummy, in_range_label,
1060 &dummy);
1061 DCHECK(!dummy.is_linked());
1062 // Cut out the single range by rewriting the array. This creates a new
1063 // range that is a merger of the two ranges on either side of the one we
1064 // are cutting out. The oddity of the labels is preserved.
1065 for (uint32_t j = cut_index; j > start_index; j--) {
1066 ranges->at(j) = ranges->at(j - 1);
1067 }
1068 for (uint32_t j = cut_index + 1; j < end_index; j++) {
1069 ranges->at(j) = ranges->at(j + 1);
1070 }
1071}
1072
1073// Unicode case. Split the search space into kSize spaces that are handled
1074// with recursion.
1075void SplitSearchSpace(ZoneList<base::uc32>* ranges, uint32_t start_index,
1076 uint32_t end_index, uint32_t* new_start_index,
1077 uint32_t* new_end_index, base::uc32* border) {
1078 static const uint32_t kSize = RegExpMacroAssembler::kTableSize;
1079 static const uint32_t kMask = RegExpMacroAssembler::kTableMask;
1080
1081 base::uc32 first = ranges->at(start_index);
1082 base::uc32 last = ranges->at(end_index) - 1;
1083
1084 *new_start_index = start_index;
1085 *border = (ranges->at(start_index) & ~kMask) + kSize;
1086 while (*new_start_index < end_index) {
1087 if (ranges->at(*new_start_index) > *border) break;
1088 (*new_start_index)++;
1089 }
1090 // new_start_index is the index of the first edge that is beyond the
1091 // current kSize space.
1092
1093 // For very large search spaces we do a binary chop search of the non-Latin1
1094 // space instead of just going to the end of the current kSize space. The
1095 // heuristics are complicated a little by the fact that any 128-character
1096 // encoding space can be quickly tested with a table lookup, so we don't
1097 // wish to do binary chop search at a smaller granularity than that. A
1098 // 128-character space can take up a lot of space in the ranges array if,
1099 // for example, we only want to match every second character (eg. the lower
1100 // case characters on some Unicode pages).
1101 uint32_t binary_chop_index = (end_index + start_index) / 2;
1102 // The first test ensures that we get to the code that handles the Latin1
1103 // range with a single not-taken branch, speeding up this important
1104 // character range (even non-Latin1 charset-based text has spaces and
1105 // punctuation).
1106 if (*border - 1 > String::kMaxOneByteCharCode && // Latin1 case.
1107 end_index - start_index > (*new_start_index - start_index) * 2 &&
1108 last - first > kSize * 2 && binary_chop_index > *new_start_index &&
1109 ranges->at(binary_chop_index) >= first + 2 * kSize) {
1110 uint32_t scan_forward_for_section_border = binary_chop_index;
1111 uint32_t new_border = (ranges->at(binary_chop_index) | kMask) + 1;
1112
1113 while (scan_forward_for_section_border < end_index) {
1114 if (ranges->at(scan_forward_for_section_border) > new_border) {
1115 *new_start_index = scan_forward_for_section_border;
1116 *border = new_border;
1117 break;
1118 }
1119 scan_forward_for_section_border++;
1120 }
1121 }
1122
1123 DCHECK(*new_start_index > start_index);
1124 *new_end_index = *new_start_index - 1;
1125 if (ranges->at(*new_end_index) == *border) {
1126 (*new_end_index)--;
1127 }
1128 if (*border >= ranges->at(end_index)) {
1129 *border = ranges->at(end_index);
1130 *new_start_index = end_index; // Won't be used.
1131 *new_end_index = end_index - 1;
1132 }
1133}
1134
1135// Gets a series of segment boundaries representing a character class. If the
1136// character is in the range between an even and an odd boundary (counting from
1137// start_index) then go to even_label, otherwise go to odd_label. We already
1138// know that the character is in the range of min_char to max_char inclusive.
1139// Either label can be nullptr indicating backtracking. Either label can also
1140// be equal to the fall_through label.
1141void GenerateBranches(RegExpMacroAssembler* masm, ZoneList<base::uc32>* ranges,
1142 uint32_t start_index, uint32_t end_index,
1143 base::uc32 min_char, base::uc32 max_char,
1144 Label* fall_through, Label* even_label,
1145 Label* odd_label) {
1148
1149 base::uc32 first = ranges->at(start_index);
1150 base::uc32 last = ranges->at(end_index) - 1;
1151
1152 DCHECK_LT(min_char, first);
1153
1154 // Just need to test if the character is before or on-or-after
1155 // a particular character.
1156 if (start_index == end_index) {
1157 EmitBoundaryTest(masm, first, fall_through, even_label, odd_label);
1158 return;
1159 }
1160
1161 // Another almost trivial case: There is one interval in the middle that is
1162 // different from the end intervals.
1163 if (start_index + 1 == end_index) {
1164 EmitDoubleBoundaryTest(masm, first, last, fall_through, even_label,
1165 odd_label);
1166 return;
1167 }
1168
1169 // It's not worth using table lookup if there are very few intervals in the
1170 // character class.
1171 if (end_index - start_index <= 6) {
1172 // It is faster to test for individual characters, so we look for those
1173 // first, then try arbitrary ranges in the second round.
1174 static uint32_t kNoCutIndex = -1;
1175 uint32_t cut = kNoCutIndex;
1176 for (uint32_t i = start_index; i < end_index; i++) {
1177 if (ranges->at(i) == ranges->at(i + 1) - 1) {
1178 cut = i;
1179 break;
1180 }
1181 }
1182 if (cut == kNoCutIndex) cut = start_index;
1183 CutOutRange(masm, ranges, start_index, end_index, cut, even_label,
1184 odd_label);
1185 DCHECK_GE(end_index - start_index, 2);
1186 GenerateBranches(masm, ranges, start_index + 1, end_index - 1, min_char,
1187 max_char, fall_through, even_label, odd_label);
1188 return;
1189 }
1190
1191 // If there are a lot of intervals in the regexp, then we will use tables to
1192 // determine whether the character is inside or outside the character class.
1193 static const int kBits = RegExpMacroAssembler::kTableSizeBits;
1194
1195 if ((max_char >> kBits) == (min_char >> kBits)) {
1196 EmitUseLookupTable(masm, ranges, start_index, end_index, min_char,
1197 fall_through, even_label, odd_label);
1198 return;
1199 }
1200
1201 if ((min_char >> kBits) != first >> kBits) {
1202 masm->CheckCharacterLT(first, odd_label);
1203 GenerateBranches(masm, ranges, start_index + 1, end_index, first, max_char,
1204 fall_through, odd_label, even_label);
1205 return;
1206 }
1207
1208 uint32_t new_start_index = 0;
1209 uint32_t new_end_index = 0;
1210 base::uc32 border = 0;
1211
1212 SplitSearchSpace(ranges, start_index, end_index, &new_start_index,
1213 &new_end_index, &border);
1214
1215 Label handle_rest;
1216 Label* above = &handle_rest;
1217 if (border == last + 1) {
1218 // We didn't find any section that started after the limit, so everything
1219 // above the border is one of the terminal labels.
1220 above = (end_index & 1) != (start_index & 1) ? odd_label : even_label;
1221 DCHECK(new_end_index == end_index - 1);
1222 }
1223
1224 DCHECK_LE(start_index, new_end_index);
1225 DCHECK_LE(new_start_index, end_index);
1226 DCHECK_LT(start_index, new_start_index);
1227 DCHECK_LT(new_end_index, end_index);
1228 DCHECK(new_end_index + 1 == new_start_index ||
1229 (new_end_index + 2 == new_start_index &&
1230 border == ranges->at(new_end_index + 1)));
1231 DCHECK_LT(min_char, border - 1);
1232 DCHECK_LT(border, max_char);
1233 DCHECK_LT(ranges->at(new_end_index), border);
1234 DCHECK(border < ranges->at(new_start_index) ||
1235 (border == ranges->at(new_start_index) &&
1236 new_start_index == end_index && new_end_index == end_index - 1 &&
1237 border == last + 1));
1238 DCHECK(new_start_index == 0 || border >= ranges->at(new_start_index - 1));
1239
1240 masm->CheckCharacterGT(border - 1, above);
1241 Label dummy;
1242 GenerateBranches(masm, ranges, start_index, new_end_index, min_char,
1243 border - 1, &dummy, even_label, odd_label);
1244 if (handle_rest.is_linked()) {
1245 masm->Bind(&handle_rest);
1246 bool flip = (new_start_index & 1) != (start_index & 1);
1247 GenerateBranches(masm, ranges, new_start_index, end_index, border, max_char,
1248 &dummy, flip ? odd_label : even_label,
1249 flip ? even_label : odd_label);
1250 }
1251}
1252
1253void EmitClassRanges(RegExpMacroAssembler* macro_assembler,
1254 RegExpClassRanges* cr, bool one_byte, Label* on_failure,
1255 int cp_offset, bool check_offset, bool preloaded,
1256 Zone* zone) {
1257 ZoneList<CharacterRange>* ranges = cr->ranges(zone);
1259
1260 // Now that all processing (like case-insensitivity) is done, clamp the
1261 // ranges to the set of ranges that may actually occur in the subject string.
1262 if (one_byte) CharacterRange::ClampToOneByte(ranges);
1263
1264 const int ranges_length = ranges->length();
1265 if (ranges_length == 0) {
1266 if (!cr->is_negated()) {
1267 macro_assembler->GoTo(on_failure);
1268 }
1269 if (check_offset) {
1270 macro_assembler->CheckPosition(cp_offset, on_failure);
1271 }
1272 return;
1273 }
1274
1275 const base::uc32 max_char = MaxCodeUnit(one_byte);
1276 if (ranges_length == 1 && ranges->at(0).IsEverything(max_char)) {
1277 if (cr->is_negated()) {
1278 macro_assembler->GoTo(on_failure);
1279 } else {
1280 // This is a common case hit by non-anchored expressions.
1281 if (check_offset) {
1282 macro_assembler->CheckPosition(cp_offset, on_failure);
1283 }
1284 }
1285 return;
1286 }
1287
1288 if (!preloaded) {
1289 macro_assembler->LoadCurrentCharacter(cp_offset, on_failure, check_offset);
1290 }
1291
1292 if (cr->is_standard(zone) && macro_assembler->CheckSpecialClassRanges(
1293 cr->standard_type(), on_failure)) {
1294 return;
1295 }
1296
1297 static constexpr int kMaxRangesForInlineBranchGeneration = 16;
1298 if (ranges_length > kMaxRangesForInlineBranchGeneration) {
1299 // For large range sets, emit a more compact instruction sequence to avoid
1300 // a potentially problematic increase in code size.
1301 // Note the flipped logic below (we check InRange if negated, NotInRange if
1302 // not negated); this is necessary since the method falls through on
1303 // failure whereas we want to fall through on success.
1304 if (cr->is_negated()) {
1305 if (macro_assembler->CheckCharacterInRangeArray(ranges, on_failure)) {
1306 return;
1307 }
1308 } else {
1309 if (macro_assembler->CheckCharacterNotInRangeArray(ranges, on_failure)) {
1310 return;
1311 }
1312 }
1313 }
1314
1315 // Generate a flat list of range boundaries for consumption by
1316 // GenerateBranches. See the comment on that function for how the list should
1317 // be structured
1318 ZoneList<base::uc32>* range_boundaries =
1319 zone->New<ZoneList<base::uc32>>(ranges_length * 2, zone);
1320
1321 bool zeroth_entry_is_failure = !cr->is_negated();
1322
1323 for (int i = 0; i < ranges_length; i++) {
1324 CharacterRange& range = ranges->at(i);
1325 if (range.from() == 0) {
1326 DCHECK_EQ(i, 0);
1327 zeroth_entry_is_failure = !zeroth_entry_is_failure;
1328 } else {
1329 range_boundaries->Add(range.from(), zone);
1330 }
1331 // `+ 1` to convert from inclusive to exclusive `to`.
1332 // [from, to] == [from, to+1[.
1333 range_boundaries->Add(range.to() + 1, zone);
1334 }
1335 int end_index = range_boundaries->length() - 1;
1336 if (range_boundaries->at(end_index) > max_char) {
1337 end_index--;
1338 }
1339
1340 Label fall_through;
1341 GenerateBranches(macro_assembler, range_boundaries,
1342 0, // start_index.
1343 end_index,
1344 0, // min_char.
1345 max_char, &fall_through,
1346 zeroth_entry_is_failure ? &fall_through : on_failure,
1347 zeroth_entry_is_failure ? on_failure : &fall_through);
1348 macro_assembler->Bind(&fall_through);
1349}
1350
1351} // namespace
1352
1353RegExpNode::~RegExpNode() = default;
1354
1356 Trace* trace) {
1357 // If we are generating a greedy loop then don't stop and don't reuse code.
1358 if (trace->stop_node() != nullptr) {
1359 return CONTINUE;
1360 }
1361
1362 RegExpMacroAssembler* macro_assembler = compiler->macro_assembler();
1363 if (trace->is_trivial()) {
1364 if (label_.is_bound() || on_work_list() || !KeepRecursing(compiler)) {
1365 // If a generic version is already scheduled to be generated or we have
1366 // recursed too deeply then just generate a jump to that code.
1367 macro_assembler->GoTo(&label_);
1368 // This will queue it up for generation of a generic version if it hasn't
1369 // already been queued.
1370 compiler->AddWork(this);
1371 return DONE;
1372 }
1373 // Generate generic version of the node and bind the label for later use.
1374 macro_assembler->Bind(&label_);
1375 return CONTINUE;
1376 }
1377
1378 // We are being asked to make a non-generic version. Keep track of how many
1379 // non-generic versions we generate so as not to overdo it.
1380 trace_count_++;
1381 if (KeepRecursing(compiler) && compiler->optimize() &&
1382 trace_count_ < kMaxCopiesCodeGenerated) {
1383 return CONTINUE;
1384 }
1385
1386 // If we get here code has been generated for this node too many times or
1387 // recursion is too deep. Time to switch to a generic version. The code for
1388 // generic versions above can handle deep recursion properly.
1389 bool was_limiting = compiler->limiting_recursion();
1390 compiler->set_limiting_recursion(true);
1391 trace->Flush(compiler, this);
1392 compiler->set_limiting_recursion(was_limiting);
1393 return DONE;
1394}
1395
1397 return !compiler->limiting_recursion() &&
1398 compiler->recursion_depth() <= RegExpCompiler::kMaxRecursion;
1399}
1400
1401void ActionNode::FillInBMInfo(Isolate* isolate, int offset, int budget,
1402 BoyerMooreLookahead* bm, bool not_at_start) {
1403 std::optional<RegExpFlags> old_flags;
1404 if (action_type_ == MODIFY_FLAGS) {
1405 // It is not guaranteed that we hit the resetting modify flags node, due to
1406 // recursion budget limitation for filling in BMInfo. Therefore we reset the
1407 // flags manually to the previous state after recursing.
1408 old_flags = bm->compiler()->flags();
1409 bm->compiler()->set_flags(flags());
1410 }
1411 if (action_type_ == BEGIN_POSITIVE_SUBMATCH) {
1412 // We use the node after the lookaround to fill in the eats_at_least info
1413 // so we have to use the same node to fill in the Boyer-Moore info.
1414 success_node()->on_success()->FillInBMInfo(isolate, offset, budget - 1, bm,
1415 not_at_start);
1416 } else if (action_type_ != POSITIVE_SUBMATCH_SUCCESS) {
1417 // We don't use the node after a positive submatch success because it
1418 // rewinds the position. Since we returned 0 as the eats_at_least value for
1419 // this node, we don't need to fill in any data.
1420 on_success()->FillInBMInfo(isolate, offset, budget - 1, bm, not_at_start);
1421 }
1422 SaveBMInfo(bm, not_at_start, offset);
1423 if (old_flags.has_value()) {
1424 bm->compiler()->set_flags(*old_flags);
1425 }
1426}
1427
1429 RegExpCompiler* compiler, int filled_in,
1430 bool not_at_start) {
1431 if (action_type_ == SET_REGISTER_FOR_LOOP) {
1432 on_success()->GetQuickCheckDetailsFromLoopEntry(details, compiler,
1433 filled_in, not_at_start);
1434 } else if (action_type_ == BEGIN_POSITIVE_SUBMATCH) {
1435 // We use the node after the lookaround to fill in the eats_at_least info
1436 // so we have to use the same node to fill in the QuickCheck info.
1437 success_node()->on_success()->GetQuickCheckDetails(details, compiler,
1438 filled_in, not_at_start);
1439 } else if (action_type() != POSITIVE_SUBMATCH_SUCCESS) {
1440 // We don't use the node after a positive submatch success because it
1441 // rewinds the position. Since we returned 0 as the eats_at_least value
1442 // for this node, we don't need to fill in any data.
1443 std::optional<RegExpFlags> old_flags;
1444 if (action_type() == MODIFY_FLAGS) {
1445 // It is not guaranteed that we hit the resetting modify flags node, as
1446 // GetQuickCheckDetails doesn't travers the whole graph. Therefore we
1447 // reset the flags manually to the previous state after recursing.
1448 old_flags = compiler->flags();
1449 compiler->set_flags(flags());
1450 }
1451 on_success()->GetQuickCheckDetails(details, compiler, filled_in,
1452 not_at_start);
1453 if (old_flags.has_value()) {
1454 compiler->set_flags(*old_flags);
1455 }
1456 }
1457}
1458
1459void AssertionNode::FillInBMInfo(Isolate* isolate, int offset, int budget,
1460 BoyerMooreLookahead* bm, bool not_at_start) {
1461 // Match the behaviour of EatsAtLeast on this node.
1462 if (assertion_type() == AT_START && not_at_start) return;
1463 on_success()->FillInBMInfo(isolate, offset, budget - 1, bm, not_at_start);
1464 SaveBMInfo(bm, not_at_start, offset);
1465}
1466
1468 QuickCheckDetails* details, RegExpCompiler* compiler, int filled_in,
1469 bool not_at_start) {
1470 RegExpNode* node = continue_node();
1471 return node->GetQuickCheckDetails(details, compiler, filled_in, not_at_start);
1472}
1473
1474namespace {
1475
1476// Takes the left-most 1-bit and smears it out, setting all bits to its right.
1477inline uint32_t SmearBitsRight(uint32_t v) {
1478 v |= v >> 1;
1479 v |= v >> 2;
1480 v |= v >> 4;
1481 v |= v >> 8;
1482 v |= v >> 16;
1483 return v;
1484}
1485
1486} // namespace
1487
1489 bool found_useful_op = false;
1490 const uint32_t char_mask = CharMask(asc);
1491 mask_ = 0;
1492 value_ = 0;
1493 int char_shift = 0;
1494 for (int i = 0; i < characters_; i++) {
1495 Position* pos = &positions_[i];
1496 if ((pos->mask & String::kMaxOneByteCharCode) != 0) {
1497 found_useful_op = true;
1498 }
1499 mask_ |= (pos->mask & char_mask) << char_shift;
1500 value_ |= (pos->value & char_mask) << char_shift;
1501 char_shift += asc ? 8 : 16;
1502 }
1503 return found_useful_op;
1504}
1505
1506uint32_t RegExpNode::EatsAtLeast(bool not_at_start) {
1507 return not_at_start ? eats_at_least_.eats_at_least_from_not_start
1508 : eats_at_least_.eats_at_least_from_possibly_start;
1509}
1510
1512 // SET_REGISTER_FOR_LOOP is only used to initialize loop counters, and it
1513 // implies that the following node must be a LoopChoiceNode. If we need to
1514 // set registers to constant values for other reasons, we could introduce a
1515 // new action type SET_REGISTER that doesn't imply anything about its
1516 // successor.
1517 UNREACHABLE();
1518}
1519
1521 RegExpCompiler* compiler,
1522 int characters_filled_in,
1523 bool not_at_start) {
1524 // See comment in RegExpNode::EatsAtLeastFromLoopEntry.
1525 UNREACHABLE();
1526}
1527
1529 DCHECK_EQ(alternatives_->length(), 2); // There's just loop and continue.
1530
1531 if (read_backward()) {
1532 // The eats_at_least value is not used if reading backward. The
1533 // EatsAtLeastPropagator should've zeroed it as well.
1534 DCHECK_EQ(eats_at_least_info()->eats_at_least_from_possibly_start, 0);
1535 DCHECK_EQ(eats_at_least_info()->eats_at_least_from_not_start, 0);
1536 return {};
1537 }
1538
1539 // Figure out how much the loop body itself eats, not including anything in
1540 // the continuation case. In general, the nodes in the loop body should report
1541 // that they eat at least the number eaten by the continuation node, since any
1542 // successful match in the loop body must also include the continuation node.
1543 // However, in some cases involving positive lookaround, the loop body under-
1544 // reports its appetite, so use saturated math here to avoid negative numbers.
1545 // For this to work correctly, we explicitly need to use signed integers here.
1546 uint8_t loop_body_from_not_start = base::saturated_cast<uint8_t>(
1547 static_cast<int>(loop_node_->EatsAtLeast(true)) -
1548 static_cast<int>(continue_node_->EatsAtLeast(true)));
1549 uint8_t loop_body_from_possibly_start = base::saturated_cast<uint8_t>(
1550 static_cast<int>(loop_node_->EatsAtLeast(false)) -
1551 static_cast<int>(continue_node_->EatsAtLeast(true)));
1552
1553 // Limit the number of loop iterations to avoid overflow in subsequent steps.
1554 int loop_iterations = base::saturated_cast<uint8_t>(min_loop_iterations());
1555
1558 base::saturated_cast<uint8_t>(loop_iterations * loop_body_from_not_start +
1559 continue_node_->EatsAtLeast(true));
1560 if (loop_iterations > 0 && loop_body_from_possibly_start > 0) {
1561 // First loop iteration eats at least one, so all subsequent iterations
1562 // and the after-loop chunk are guaranteed to not be at the start.
1563 result.eats_at_least_from_possibly_start = base::saturated_cast<uint8_t>(
1564 loop_body_from_possibly_start +
1565 (loop_iterations - 1) * loop_body_from_not_start +
1566 continue_node_->EatsAtLeast(true));
1567 } else {
1568 // Loop body might eat nothing, so only continue node contributes.
1569 result.eats_at_least_from_possibly_start =
1570 continue_node_->EatsAtLeast(false);
1571 }
1572 return result;
1573}
1574
1576 Trace* bounds_check_trace, Trace* trace,
1577 bool preload_has_checked_bounds,
1578 Label* on_possible_success,
1579 QuickCheckDetails* details,
1580 bool fall_through_on_failure,
1581 ChoiceNode* predecessor) {
1582 DCHECK_NOT_NULL(predecessor);
1583 if (details->characters() == 0) return false;
1584 GetQuickCheckDetails(details, compiler, 0,
1585 trace->at_start() == Trace::FALSE_VALUE);
1586 if (details->cannot_match()) return false;
1587 if (!details->Rationalize(compiler->one_byte())) return false;
1588 DCHECK(details->characters() == 1 ||
1589 compiler->macro_assembler()->CanReadUnaligned());
1590 uint32_t mask = details->mask();
1591 uint32_t value = details->value();
1592
1593 RegExpMacroAssembler* assembler = compiler->macro_assembler();
1594
1595 if (trace->characters_preloaded() != details->characters()) {
1596 DCHECK(trace->cp_offset() == bounds_check_trace->cp_offset());
1597 // The bounds check is performed using the minimum number of characters
1598 // any choice would eat, so if the bounds check fails, then none of the
1599 // choices can succeed, so we can just immediately backtrack, rather
1600 // than go to the next choice. The number of characters preloaded may be
1601 // less than the number used for the bounds check.
1602 int eats_at_least = predecessor->EatsAtLeast(
1603 bounds_check_trace->at_start() == Trace::FALSE_VALUE);
1604 DCHECK_GE(eats_at_least, details->characters());
1605 assembler->LoadCurrentCharacter(
1606 trace->cp_offset(), bounds_check_trace->backtrack(),
1607 !preload_has_checked_bounds, details->characters(), eats_at_least);
1608 }
1609
1610 bool need_mask = true;
1611
1612 if (details->characters() == 1) {
1613 // If number of characters preloaded is 1 then we used a byte or 16 bit
1614 // load so the value is already masked down.
1615 const uint32_t char_mask = CharMask(compiler->one_byte());
1616 if ((mask & char_mask) == char_mask) need_mask = false;
1617 mask &= char_mask;
1618 } else {
1619 // For 2-character preloads in one-byte mode or 1-character preloads in
1620 // two-byte mode we also use a 16 bit load with zero extend.
1621 static const uint32_t kTwoByteMask = 0xFFFF;
1622 static const uint32_t kFourByteMask = 0xFFFFFFFF;
1623 if (details->characters() == 2 && compiler->one_byte()) {
1624 if ((mask & kTwoByteMask) == kTwoByteMask) need_mask = false;
1625 } else if (details->characters() == 1 && !compiler->one_byte()) {
1626 if ((mask & kTwoByteMask) == kTwoByteMask) need_mask = false;
1627 } else {
1628 if (mask == kFourByteMask) need_mask = false;
1629 }
1630 }
1631
1632 if (fall_through_on_failure) {
1633 if (need_mask) {
1634 assembler->CheckCharacterAfterAnd(value, mask, on_possible_success);
1635 } else {
1636 assembler->CheckCharacter(value, on_possible_success);
1637 }
1638 } else {
1639 if (need_mask) {
1640 assembler->CheckNotCharacterAfterAnd(value, mask, trace->backtrack());
1641 } else {
1642 assembler->CheckNotCharacter(value, trace->backtrack());
1643 }
1644 }
1645 return true;
1646}
1647
1648// Here is the meat of GetQuickCheckDetails (see also the comment on the
1649// super-class in the .h file).
1650//
1651// We iterate along the text object, building up for each character a
1652// mask and value that can be used to test for a quick failure to match.
1653// The masks and values for the positions will be combined into a single
1654// machine word for the current character width in order to be used in
1655// generating a quick check.
1657 RegExpCompiler* compiler,
1658 int characters_filled_in,
1659 bool not_at_start) {
1660 // Do not collect any quick check details if the text node reads backward,
1661 // since it reads in the opposite direction than we use for quick checks.
1662 if (read_backward()) return;
1663 Isolate* isolate = compiler->macro_assembler()->isolate();
1664 DCHECK(characters_filled_in < details->characters());
1665 int characters = details->characters();
1666 const uint32_t char_mask = CharMask(compiler->one_byte());
1667 for (int k = 0; k < elements()->length(); k++) {
1668 TextElement elm = elements()->at(k);
1669 if (elm.text_type() == TextElement::ATOM) {
1670 base::Vector<const base::uc16> quarks = elm.atom()->data();
1671 for (int i = 0; i < characters && i < quarks.length(); i++) {
1673 details->positions(characters_filled_in);
1674 base::uc16 c = quarks[i];
1675 if (IsIgnoreCase(compiler->flags())) {
1676 unibrow::uchar chars[4];
1677 int length =
1678 GetCaseIndependentLetters(isolate, c, compiler, chars, 4);
1679 if (length == 0) {
1680 // This can happen because all case variants are non-Latin1, but we
1681 // know the input is Latin1.
1682 details->set_cannot_match();
1683 pos->determines_perfectly = false;
1684 return;
1685 }
1686 if (length == 1) {
1687 // This letter has no case equivalents, so it's nice and simple
1688 // and the mask-compare will determine definitely whether we have
1689 // a match at this character position.
1690 pos->mask = char_mask;
1691 pos->value = chars[0];
1692 pos->determines_perfectly = true;
1693 } else {
1694 uint32_t common_bits = char_mask;
1695 uint32_t bits = chars[0];
1696 for (int j = 1; j < length; j++) {
1697 uint32_t differing_bits = ((chars[j] & common_bits) ^ bits);
1698 common_bits ^= differing_bits;
1699 bits &= common_bits;
1700 }
1701 // If length is 2 and common bits has only one zero in it then
1702 // our mask and compare instruction will determine definitely
1703 // whether we have a match at this character position. Otherwise
1704 // it can only be an approximate check.
1705 uint32_t one_zero = (common_bits | ~char_mask);
1706 if (length == 2 && ((~one_zero) & ((~one_zero) - 1)) == 0) {
1707 pos->determines_perfectly = true;
1708 }
1709 pos->mask = common_bits;
1710 pos->value = bits;
1711 }
1712 } else {
1713 // Don't ignore case. Nice simple case where the mask-compare will
1714 // determine definitely whether we have a match at this character
1715 // position.
1716 if (c > char_mask) {
1717 details->set_cannot_match();
1718 pos->determines_perfectly = false;
1719 return;
1720 }
1721 pos->mask = char_mask;
1722 pos->value = c;
1723 pos->determines_perfectly = true;
1724 }
1725 characters_filled_in++;
1726 DCHECK(characters_filled_in <= details->characters());
1727 if (characters_filled_in == details->characters()) {
1728 return;
1729 }
1730 }
1731 } else {
1733 details->positions(characters_filled_in);
1734 RegExpClassRanges* tree = elm.class_ranges();
1735 ZoneList<CharacterRange>* ranges = tree->ranges(zone());
1736 if (tree->is_negated() || ranges->is_empty()) {
1737 // A quick check uses multi-character mask and compare. There is no
1738 // useful way to incorporate a negative char class into this scheme
1739 // so we just conservatively create a mask and value that will always
1740 // succeed.
1741 // Likewise for empty ranges (empty ranges can occur e.g. when
1742 // compiling for one-byte subjects and impossible (non-one-byte) ranges
1743 // have been removed).
1744 pos->mask = 0;
1745 pos->value = 0;
1746 } else {
1747 int first_range = 0;
1748 while (ranges->at(first_range).from() > char_mask) {
1749 first_range++;
1750 if (first_range == ranges->length()) {
1751 details->set_cannot_match();
1752 pos->determines_perfectly = false;
1753 return;
1754 }
1755 }
1756 CharacterRange range = ranges->at(first_range);
1757 const base::uc32 first_from = range.from();
1758 const base::uc32 first_to =
1759 (range.to() > char_mask) ? char_mask : range.to();
1760 const uint32_t differing_bits = (first_from ^ first_to);
1761 // A mask and compare is only perfect if the differing bits form a
1762 // number like 00011111 with one single block of trailing 1s.
1763 if ((differing_bits & (differing_bits + 1)) == 0 &&
1764 first_from + differing_bits == first_to) {
1765 pos->determines_perfectly = true;
1766 }
1767 uint32_t common_bits = ~SmearBitsRight(differing_bits);
1768 uint32_t bits = (first_from & common_bits);
1769 for (int i = first_range + 1; i < ranges->length(); i++) {
1770 range = ranges->at(i);
1771 const base::uc32 from = range.from();
1772 if (from > char_mask) continue;
1773 const base::uc32 to =
1774 (range.to() > char_mask) ? char_mask : range.to();
1775 // Here we are combining more ranges into the mask and compare
1776 // value. With each new range the mask becomes more sparse and
1777 // so the chances of a false positive rise. A character class
1778 // with multiple ranges is assumed never to be equivalent to a
1779 // mask and compare operation.
1780 pos->determines_perfectly = false;
1781 uint32_t new_common_bits = (from ^ to);
1782 new_common_bits = ~SmearBitsRight(new_common_bits);
1783 common_bits &= new_common_bits;
1784 bits &= new_common_bits;
1785 uint32_t new_differing_bits = (from & common_bits) ^ bits;
1786 common_bits ^= new_differing_bits;
1787 bits &= common_bits;
1788 }
1789 pos->mask = common_bits;
1790 pos->value = bits;
1791 }
1792 characters_filled_in++;
1793 DCHECK(characters_filled_in <= details->characters());
1794 if (characters_filled_in == details->characters()) return;
1795 }
1796 }
1797 DCHECK(characters_filled_in != details->characters());
1798 if (!details->cannot_match()) {
1799 on_success()->GetQuickCheckDetails(details, compiler, characters_filled_in,
1800 true);
1801 }
1802}
1803
1805 for (int i = 0; i < characters_; i++) {
1806 positions_[i].mask = 0;
1807 positions_[i].value = 0;
1808 positions_[i].determines_perfectly = false;
1809 }
1810 characters_ = 0;
1811}
1812
1813void QuickCheckDetails::Advance(int by, bool one_byte) {
1814 if (by >= characters_ || by < 0) {
1815 DCHECK_IMPLIES(by < 0, characters_ == 0);
1816 Clear();
1817 return;
1818 }
1819 DCHECK_LE(characters_ - by, 4);
1821 for (int i = 0; i < characters_ - by; i++) {
1822 positions_[i] = positions_[by + i];
1823 }
1824 for (int i = characters_ - by; i < characters_; i++) {
1825 positions_[i].mask = 0;
1826 positions_[i].value = 0;
1827 positions_[i].determines_perfectly = false;
1828 }
1829 characters_ -= by;
1830 // We could change mask_ and value_ here but we would never advance unless
1831 // they had already been used in a check and they won't be used again because
1832 // it would gain us nothing. So there's no point.
1833}
1834
1835void QuickCheckDetails::Merge(QuickCheckDetails* other, int from_index) {
1836 DCHECK(characters_ == other->characters_);
1837 if (other->cannot_match_) {
1838 return;
1839 }
1840 if (cannot_match_) {
1841 *this = *other;
1842 return;
1843 }
1844 for (int i = from_index; i < characters_; i++) {
1845 QuickCheckDetails::Position* pos = positions(i);
1846 QuickCheckDetails::Position* other_pos = other->positions(i);
1847 if (pos->mask != other_pos->mask || pos->value != other_pos->value ||
1848 !other_pos->determines_perfectly) {
1849 // Our mask-compare operation will be approximate unless we have the
1850 // exact same operation on both sides of the alternation.
1851 pos->determines_perfectly = false;
1852 }
1853 pos->mask &= other_pos->mask;
1854 pos->value &= pos->mask;
1855 other_pos->value &= pos->mask;
1856 uint32_t differing_bits = (pos->value ^ other_pos->value);
1857 pos->mask &= ~differing_bits;
1858 pos->value &= pos->mask;
1859 }
1860}
1861
1863 public:
1864 explicit VisitMarker(NodeInfo* info) : info_(info) {
1865 DCHECK(!info->visited);
1866 info->visited = true;
1867 }
1868 ~VisitMarker() { info_->visited = false; }
1869
1870 private:
1872};
1873
1874// Temporarily sets traversed_loop_initialization_node_.
1876 public:
1878 DCHECK(!node_->traversed_loop_initialization_node_);
1879 node_->traversed_loop_initialization_node_ = true;
1880 }
1882 DCHECK(node_->traversed_loop_initialization_node_);
1883 node_->traversed_loop_initialization_node_ = false;
1884 }
1887
1888 private:
1890};
1891
1892// Temporarily decrements min_loop_iterations_.
1894 public:
1896 DCHECK_GT(node_->min_loop_iterations_, 0);
1897 --node_->min_loop_iterations_;
1898 }
1899 ~IterationDecrementer() { ++node_->min_loop_iterations_; }
1902
1903 private:
1905};
1906
1908 if (info()->replacement_calculated) return replacement();
1909 if (depth < 0) return this;
1910 DCHECK(!info()->visited);
1911 VisitMarker marker(info());
1912 return FilterSuccessor(depth - 1, compiler);
1913}
1914
1916 RegExpCompiler* compiler) {
1917 RegExpNode* next = on_success_->FilterOneByte(depth - 1, compiler);
1918 if (next == nullptr) return set_replacement(nullptr);
1919 on_success_ = next;
1920 return set_replacement(this);
1921}
1922
1923// We need to check for the following characters: 0x39C 0x3BC 0x178.
1925 // TODO(dcarney): this could be a lot more efficient.
1926 return range.Contains(0x039C) || range.Contains(0x03BC) ||
1927 range.Contains(0x0178);
1928}
1929
1930namespace {
1931
1932bool RangesContainLatin1Equivalents(ZoneList<CharacterRange>* ranges) {
1933 for (int i = 0; i < ranges->length(); i++) {
1934 // TODO(dcarney): this could be a lot more efficient.
1935 if (RangeContainsLatin1Equivalents(ranges->at(i))) return true;
1936 }
1937 return false;
1938}
1939
1940} // namespace
1941
1943 RegExpFlags flags = compiler->flags();
1944 if (info()->replacement_calculated) return replacement();
1945 if (depth < 0) return this;
1946 DCHECK(!info()->visited);
1947 VisitMarker marker(info());
1948 int element_count = elements()->length();
1949 for (int i = 0; i < element_count; i++) {
1950 TextElement elm = elements()->at(i);
1951 if (elm.text_type() == TextElement::ATOM) {
1952 base::Vector<const base::uc16> quarks = elm.atom()->data();
1953 for (int j = 0; j < quarks.length(); j++) {
1954 base::uc16 c = quarks[j];
1955 if (!IsIgnoreCase(flags)) {
1956 if (c > String::kMaxOneByteCharCode) return set_replacement(nullptr);
1957 } else {
1958 unibrow::uchar chars[4];
1959 int length = GetCaseIndependentLetters(compiler->isolate(), c,
1960 compiler, chars, 4);
1961 if (length == 0 || chars[0] > String::kMaxOneByteCharCode) {
1962 return set_replacement(nullptr);
1963 }
1964 }
1965 }
1966 } else {
1967 // A character class can also be impossible to match in one-byte mode.
1969 RegExpClassRanges* cr = elm.class_ranges();
1970 ZoneList<CharacterRange>* ranges = cr->ranges(zone());
1972 // Now they are in order so we only need to look at the first.
1973 // If we are in non-Unicode case independent mode then we need
1974 // to be a bit careful here, because the character classes have
1975 // not been case-desugared yet, but there are characters and ranges
1976 // that can become Latin-1 when case is considered.
1977 int range_count = ranges->length();
1978 if (cr->is_negated()) {
1979 if (range_count != 0 && ranges->at(0).from() == 0 &&
1980 ranges->at(0).to() >= String::kMaxOneByteCharCode) {
1981 bool case_complications = !IsEitherUnicode(flags) &&
1982 IsIgnoreCase(flags) &&
1983 RangesContainLatin1Equivalents(ranges);
1984 if (!case_complications) {
1985 return set_replacement(nullptr);
1986 }
1987 }
1988 } else {
1989 if (range_count == 0 ||
1990 ranges->at(0).from() > String::kMaxOneByteCharCode) {
1991 bool case_complications = !IsEitherUnicode(flags) &&
1992 IsIgnoreCase(flags) &&
1993 RangesContainLatin1Equivalents(ranges);
1994 if (!case_complications) {
1995 return set_replacement(nullptr);
1996 }
1997 }
1998 }
1999 }
2000 }
2001 return FilterSuccessor(depth - 1, compiler);
2002}
2003
2005 if (info()->replacement_calculated) return replacement();
2006 if (depth < 0) return this;
2007 if (info()->visited) return this;
2008 {
2009 VisitMarker marker(info());
2010
2011 RegExpNode* continue_replacement =
2012 continue_node_->FilterOneByte(depth - 1, compiler);
2013 // If we can't continue after the loop then there is no sense in doing the
2014 // loop.
2015 if (continue_replacement == nullptr) return set_replacement(nullptr);
2016 }
2017
2018 return ChoiceNode::FilterOneByte(depth - 1, compiler);
2019}
2020
2022 if (info()->replacement_calculated) return replacement();
2023 if (depth < 0) return this;
2024 if (info()->visited) return this;
2025 VisitMarker marker(info());
2026 int choice_count = alternatives_->length();
2027
2028 for (int i = 0; i < choice_count; i++) {
2029 GuardedAlternative alternative = alternatives_->at(i);
2030 if (alternative.guards() != nullptr &&
2031 alternative.guards()->length() != 0) {
2032 set_replacement(this);
2033 return this;
2034 }
2035 }
2036
2037 int surviving = 0;
2038 RegExpNode* survivor = nullptr;
2039 for (int i = 0; i < choice_count; i++) {
2040 GuardedAlternative alternative = alternatives_->at(i);
2041 RegExpNode* replacement =
2042 alternative.node()->FilterOneByte(depth - 1, compiler);
2043 DCHECK(replacement != this); // No missing EMPTY_MATCH_CHECK.
2044 if (replacement != nullptr) {
2045 alternatives_->at(i).set_node(replacement);
2046 surviving++;
2047 survivor = replacement;
2048 }
2049 }
2050 if (surviving < 2) return set_replacement(survivor);
2051
2052 set_replacement(this);
2053 if (surviving == choice_count) {
2054 return this;
2055 }
2056 // Only some of the nodes survived the filtering. We need to rebuild the
2057 // alternatives list.
2058 ZoneList<GuardedAlternative>* new_alternatives =
2059 zone()->New<ZoneList<GuardedAlternative>>(surviving, zone());
2060 for (int i = 0; i < choice_count; i++) {
2061 RegExpNode* replacement =
2062 alternatives_->at(i).node()->FilterOneByte(depth - 1, compiler);
2063 if (replacement != nullptr) {
2064 alternatives_->at(i).set_node(replacement);
2065 new_alternatives->Add(alternatives_->at(i), zone());
2066 }
2067 }
2068 alternatives_ = new_alternatives;
2069 return this;
2070}
2071
2073 int depth, RegExpCompiler* compiler) {
2074 if (info()->replacement_calculated) return replacement();
2075 if (depth < 0) return this;
2076 if (info()->visited) return this;
2077 VisitMarker marker(info());
2078 // Alternative 0 is the negative lookahead, alternative 1 is what comes
2079 // afterwards.
2080 RegExpNode* node = continue_node();
2081 RegExpNode* replacement = node->FilterOneByte(depth - 1, compiler);
2082 if (replacement == nullptr) return set_replacement(nullptr);
2083 alternatives_->at(kContinueIndex).set_node(replacement);
2084
2085 RegExpNode* neg_node = lookaround_node();
2086 RegExpNode* neg_replacement = neg_node->FilterOneByte(depth - 1, compiler);
2087 // If the negative lookahead is always going to fail then
2088 // we don't need to check it.
2089 if (neg_replacement == nullptr) return set_replacement(replacement);
2090 alternatives_->at(kLookaroundIndex).set_node(neg_replacement);
2091 return set_replacement(this);
2092}
2093
2095 RegExpCompiler* compiler,
2096 int characters_filled_in,
2097 bool not_at_start) {
2098 if (body_can_be_zero_length_ || info()->visited) return;
2099 not_at_start = not_at_start || this->not_at_start();
2100 DCHECK_EQ(alternatives_->length(), 2); // There's just loop and continue.
2101 if (traversed_loop_initialization_node_ && min_loop_iterations_ > 0 &&
2102 loop_node_->EatsAtLeast(not_at_start) >
2103 continue_node_->EatsAtLeast(true)) {
2104 // Loop body is guaranteed to execute at least once, and consume characters
2105 // when it does, meaning the only possible quick checks from this point
2106 // begin with the loop body. We may recursively visit this LoopChoiceNode,
2107 // but we temporarily decrease its minimum iteration counter so we know when
2108 // to check the continue case.
2109 IterationDecrementer next_iteration(this);
2110 loop_node_->GetQuickCheckDetails(details, compiler, characters_filled_in,
2111 not_at_start);
2112 } else {
2113 // Might not consume anything in the loop body, so treat it like a normal
2114 // ChoiceNode (and don't recursively visit this node again).
2115 VisitMarker marker(info());
2116 ChoiceNode::GetQuickCheckDetails(details, compiler, characters_filled_in,
2117 not_at_start);
2118 }
2119}
2120
2122 QuickCheckDetails* details, RegExpCompiler* compiler,
2123 int characters_filled_in, bool not_at_start) {
2124 if (traversed_loop_initialization_node_) {
2125 // We already entered this loop once, exited via its continuation node, and
2126 // followed an outer loop's back-edge to before the loop entry point. We
2127 // could try to reset the minimum iteration count to its starting value at
2128 // this point, but that seems like more trouble than it's worth. It's safe
2129 // to keep going with the current (possibly reduced) minimum iteration
2130 // count.
2131 GetQuickCheckDetails(details, compiler, characters_filled_in, not_at_start);
2132 } else {
2133 // We are entering a loop via its counter initialization action, meaning we
2134 // are guaranteed to run the loop body at least some minimum number of times
2135 // before running the continuation node. Set a flag so that this node knows
2136 // (now and any times we visit it again recursively) that it was entered
2137 // from the top.
2138 LoopInitializationMarker marker(this);
2139 GetQuickCheckDetails(details, compiler, characters_filled_in, not_at_start);
2140 }
2141}
2142
2143void LoopChoiceNode::FillInBMInfo(Isolate* isolate, int offset, int budget,
2144 BoyerMooreLookahead* bm, bool not_at_start) {
2145 if (body_can_be_zero_length_ || budget <= 0) {
2146 bm->SetRest(offset);
2147 SaveBMInfo(bm, not_at_start, offset);
2148 return;
2149 }
2150 ChoiceNode::FillInBMInfo(isolate, offset, budget - 1, bm, not_at_start);
2151 SaveBMInfo(bm, not_at_start, offset);
2152}
2153
2155 RegExpCompiler* compiler,
2156 int characters_filled_in,
2157 bool not_at_start) {
2158 not_at_start = (not_at_start || not_at_start_);
2159 int choice_count = alternatives_->length();
2160 DCHECK_LT(0, choice_count);
2161 alternatives_->at(0).node()->GetQuickCheckDetails(
2162 details, compiler, characters_filled_in, not_at_start);
2163 for (int i = 1; i < choice_count; i++) {
2164 QuickCheckDetails new_details(details->characters());
2165 RegExpNode* node = alternatives_->at(i).node();
2166 node->GetQuickCheckDetails(&new_details, compiler, characters_filled_in,
2167 not_at_start);
2168 // Here we merge the quick match details of the two branches.
2169 details->Merge(&new_details, characters_filled_in);
2170 }
2171}
2172
2173namespace {
2174
2175// Check for [0-9A-Z_a-z].
2176void EmitWordCheck(RegExpMacroAssembler* assembler, Label* word,
2177 Label* non_word, bool fall_through_on_word) {
2178 if (assembler->CheckSpecialClassRanges(
2179 fall_through_on_word ? StandardCharacterSet::kWord
2181 fall_through_on_word ? non_word : word)) {
2182 // Optimized implementation available.
2183 return;
2184 }
2185 assembler->CheckCharacterGT('z', non_word);
2186 assembler->CheckCharacterLT('0', non_word);
2187 assembler->CheckCharacterGT('a' - 1, word);
2188 assembler->CheckCharacterLT('9' + 1, word);
2189 assembler->CheckCharacterLT('A', non_word);
2190 assembler->CheckCharacterLT('Z' + 1, word);
2191 if (fall_through_on_word) {
2192 assembler->CheckNotCharacter('_', non_word);
2193 } else {
2194 assembler->CheckCharacter('_', word);
2195 }
2196}
2197
2198// Emit the code to check for a ^ in multiline mode (1-character lookbehind
2199// that matches newline or the start of input).
2200void EmitHat(RegExpCompiler* compiler, RegExpNode* on_success, Trace* trace) {
2201 RegExpMacroAssembler* assembler = compiler->macro_assembler();
2202
2203 // We will load the previous character into the current character register.
2204 Trace new_trace(*trace);
2205 new_trace.InvalidateCurrentCharacter();
2206
2207 // A positive (> 0) cp_offset means we've already successfully matched a
2208 // non-empty-width part of the pattern, and thus cannot be at or before the
2209 // start of the subject string. We can thus skip both at-start and
2210 // bounds-checks when loading the one-character lookbehind.
2211 const bool may_be_at_or_before_subject_string_start =
2212 new_trace.cp_offset() <= 0;
2213
2214 Label ok;
2215 if (may_be_at_or_before_subject_string_start) {
2216 // The start of input counts as a newline in this context, so skip to ok if
2217 // we are at the start.
2218 assembler->CheckAtStart(new_trace.cp_offset(), &ok);
2219 }
2220
2221 // If we've already checked that we are not at the start of input, it's okay
2222 // to load the previous character without bounds checks.
2223 const bool can_skip_bounds_check = !may_be_at_or_before_subject_string_start;
2224 assembler->LoadCurrentCharacter(new_trace.cp_offset() - 1,
2225 new_trace.backtrack(), can_skip_bounds_check);
2226 if (!assembler->CheckSpecialClassRanges(StandardCharacterSet::kLineTerminator,
2227 new_trace.backtrack())) {
2228 // Newline means \n, \r, 0x2028 or 0x2029.
2229 if (!compiler->one_byte()) {
2230 assembler->CheckCharacterAfterAnd(0x2028, 0xFFFE, &ok);
2231 }
2232 assembler->CheckCharacter('\n', &ok);
2233 assembler->CheckNotCharacter('\r', new_trace.backtrack());
2234 }
2235 assembler->Bind(&ok);
2236 on_success->Emit(compiler, &new_trace);
2237}
2238
2239} // namespace
2240
2241// Emit the code to handle \b and \B (word-boundary or non-word-boundary).
2243 RegExpMacroAssembler* assembler = compiler->macro_assembler();
2244 Isolate* isolate = assembler->isolate();
2245 Trace::TriBool next_is_word_character = Trace::UNKNOWN;
2246 bool not_at_start = (trace->at_start() == Trace::FALSE_VALUE);
2247 BoyerMooreLookahead* lookahead = bm_info(not_at_start);
2248 if (lookahead == nullptr) {
2249 int eats_at_least =
2250 std::min(kMaxLookaheadForBoyerMoore, EatsAtLeast(not_at_start));
2251 if (eats_at_least >= 1) {
2253 zone()->New<BoyerMooreLookahead>(eats_at_least, compiler, zone());
2254 FillInBMInfo(isolate, 0, kRecursionBudget, bm, not_at_start);
2255 if (bm->at(0)->is_non_word()) next_is_word_character = Trace::FALSE_VALUE;
2256 if (bm->at(0)->is_word()) next_is_word_character = Trace::TRUE_VALUE;
2257 }
2258 } else {
2259 if (lookahead->at(0)->is_non_word())
2260 next_is_word_character = Trace::FALSE_VALUE;
2261 if (lookahead->at(0)->is_word()) next_is_word_character = Trace::TRUE_VALUE;
2262 }
2263 bool at_boundary = (assertion_type_ == AssertionNode::AT_BOUNDARY);
2264 if (next_is_word_character == Trace::UNKNOWN) {
2265 Label before_non_word;
2266 Label before_word;
2267 if (trace->characters_preloaded() != 1) {
2268 assembler->LoadCurrentCharacter(trace->cp_offset(), &before_non_word);
2269 }
2270 // Fall through on non-word.
2271 EmitWordCheck(assembler, &before_word, &before_non_word, false);
2272 // Next character is not a word character.
2273 assembler->Bind(&before_non_word);
2274 Label ok;
2275 BacktrackIfPrevious(compiler, trace, at_boundary ? kIsNonWord : kIsWord);
2276 assembler->GoTo(&ok);
2277
2278 assembler->Bind(&before_word);
2279 BacktrackIfPrevious(compiler, trace, at_boundary ? kIsWord : kIsNonWord);
2280 assembler->Bind(&ok);
2281 } else if (next_is_word_character == Trace::TRUE_VALUE) {
2282 BacktrackIfPrevious(compiler, trace, at_boundary ? kIsWord : kIsNonWord);
2283 } else {
2284 DCHECK(next_is_word_character == Trace::FALSE_VALUE);
2285 BacktrackIfPrevious(compiler, trace, at_boundary ? kIsNonWord : kIsWord);
2286 }
2287}
2288
2290 RegExpCompiler* compiler, Trace* trace,
2291 AssertionNode::IfPrevious backtrack_if_previous) {
2292 RegExpMacroAssembler* assembler = compiler->macro_assembler();
2293 Trace new_trace(*trace);
2294 new_trace.InvalidateCurrentCharacter();
2295
2296 Label fall_through;
2297 Label* non_word = backtrack_if_previous == kIsNonWord ? new_trace.backtrack()
2298 : &fall_through;
2299 Label* word = backtrack_if_previous == kIsNonWord ? &fall_through
2300 : new_trace.backtrack();
2301
2302 // A positive (> 0) cp_offset means we've already successfully matched a
2303 // non-empty-width part of the pattern, and thus cannot be at or before the
2304 // start of the subject string. We can thus skip both at-start and
2305 // bounds-checks when loading the one-character lookbehind.
2306 const bool may_be_at_or_before_subject_string_start =
2307 new_trace.cp_offset() <= 0;
2308
2309 if (may_be_at_or_before_subject_string_start) {
2310 // The start of input counts as a non-word character, so the question is
2311 // decided if we are at the start.
2312 assembler->CheckAtStart(new_trace.cp_offset(), non_word);
2313 }
2314
2315 // If we've already checked that we are not at the start of input, it's okay
2316 // to load the previous character without bounds checks.
2317 const bool can_skip_bounds_check = !may_be_at_or_before_subject_string_start;
2318 assembler->LoadCurrentCharacter(new_trace.cp_offset() - 1, non_word,
2319 can_skip_bounds_check);
2320 EmitWordCheck(assembler, word, non_word, backtrack_if_previous == kIsNonWord);
2321
2322 assembler->Bind(&fall_through);
2323 on_success()->Emit(compiler, &new_trace);
2324}
2325
2327 RegExpCompiler* compiler,
2328 int filled_in, bool not_at_start) {
2329 if (assertion_type_ == AT_START && not_at_start) {
2330 details->set_cannot_match();
2331 return;
2332 }
2333 return on_success()->GetQuickCheckDetails(details, compiler, filled_in,
2334 not_at_start);
2335}
2336
2338 RegExpMacroAssembler* assembler = compiler->macro_assembler();
2339 switch (assertion_type_) {
2340 case AT_END: {
2341 Label ok;
2342 assembler->CheckPosition(trace->cp_offset(), &ok);
2343 assembler->GoTo(trace->backtrack());
2344 assembler->Bind(&ok);
2345 break;
2346 }
2347 case AT_START: {
2348 if (trace->at_start() == Trace::FALSE_VALUE) {
2349 assembler->GoTo(trace->backtrack());
2350 return;
2351 }
2352 if (trace->at_start() == Trace::UNKNOWN) {
2353 assembler->CheckNotAtStart(trace->cp_offset(), trace->backtrack());
2354 Trace at_start_trace = *trace;
2355 at_start_trace.set_at_start(Trace::TRUE_VALUE);
2356 on_success()->Emit(compiler, &at_start_trace);
2357 return;
2358 }
2359 } break;
2360 case AFTER_NEWLINE:
2361 EmitHat(compiler, on_success(), trace);
2362 return;
2363 case AT_BOUNDARY:
2364 case AT_NON_BOUNDARY: {
2365 EmitBoundaryCheck(compiler, trace);
2366 return;
2367 }
2368 }
2369 on_success()->Emit(compiler, trace);
2370}
2371
2372namespace {
2373
2374bool DeterminedAlready(QuickCheckDetails* quick_check, int offset) {
2375 if (quick_check == nullptr) return false;
2376 if (offset >= quick_check->characters()) return false;
2377 return quick_check->positions(offset)->determines_perfectly;
2378}
2379
2380void UpdateBoundsCheck(int index, int* checked_up_to) {
2381 if (index > *checked_up_to) {
2382 *checked_up_to = index;
2383 }
2384}
2385
2386} // namespace
2387
2388// We call this repeatedly to generate code for each pass over the text node.
2389// The passes are in increasing order of difficulty because we hope one
2390// of the first passes will fail in which case we are saved the work of the
2391// later passes. for example for the case independent regexp /%[asdfghjkl]a/
2392// we will check the '%' in the first pass, the case independent 'a' in the
2393// second pass and the character class in the last pass.
2394//
2395// The passes are done from right to left, so for example to test for /bar/
2396// we will first test for an 'r' with offset 2, then an 'a' with offset 1
2397// and then a 'b' with offset 0. This means we can avoid the end-of-input
2398// bounds check most of the time. In the example we only need to check for
2399// end-of-input when loading the putative 'r'.
2400//
2401// A slight complication involves the fact that the first character may already
2402// be fetched into a register by the previous node. In this case we want to
2403// do the test for that character first. We do this in separate passes. The
2404// 'preloaded' argument indicates that we are doing such a 'pass'. If such a
2405// pass has been performed then subsequent passes will have true in
2406// first_element_checked to indicate that that character does not need to be
2407// checked again.
2408//
2409// In addition to all this we are passed a Trace, which can
2410// contain an AlternativeGeneration object. In this AlternativeGeneration
2411// object we can see details of any quick check that was already passed in
2412// order to get to the code we are now generating. The quick check can involve
2413// loading characters, which means we do not need to recheck the bounds
2414// up to the limit the quick check already checked. In addition the quick
2415// check can have involved a mask and compare operation which may simplify
2416// or obviate the need for further checks at some character positions.
2418 bool preloaded, Trace* trace,
2419 bool first_element_checked, int* checked_up_to) {
2420 RegExpMacroAssembler* assembler = compiler->macro_assembler();
2421 Isolate* isolate = assembler->isolate();
2422 bool one_byte = compiler->one_byte();
2423 Label* backtrack = trace->backtrack();
2424 QuickCheckDetails* quick_check = trace->quick_check_performed();
2425 int element_count = elements()->length();
2426 int backward_offset = read_backward() ? -Length() : 0;
2427 for (int i = preloaded ? 0 : element_count - 1; i >= 0; i--) {
2428 TextElement elm = elements()->at(i);
2429 int cp_offset = trace->cp_offset() + elm.cp_offset() + backward_offset;
2430 if (elm.text_type() == TextElement::ATOM) {
2431 base::Vector<const base::uc16> quarks = elm.atom()->data();
2432 for (int j = preloaded ? 0 : quarks.length() - 1; j >= 0; j--) {
2433 if (first_element_checked && i == 0 && j == 0) continue;
2434 if (DeterminedAlready(quick_check, elm.cp_offset() + j)) continue;
2435 base::uc16 quark = quarks[j];
2436 bool needs_bounds_check =
2437 *checked_up_to < cp_offset + j || read_backward();
2438 bool bounds_checked = false;
2439 switch (pass) {
2440 case NON_LATIN1_MATCH: {
2441 DCHECK(one_byte); // This pass is only done in one-byte mode.
2442 if (IsIgnoreCase(compiler->flags())) {
2443 // We are compiling for a one-byte subject, case independent mode.
2444 // We have to check whether any of the case alternatives are in
2445 // the one-byte range.
2446 unibrow::uchar chars[4];
2447 // Only returns characters that are in the one-byte range.
2448 int length =
2449 GetCaseIndependentLetters(isolate, quark, compiler, chars, 4);
2450 if (length == 0) {
2451 assembler->GoTo(backtrack);
2452 return;
2453 }
2454 } else {
2455 // Case-dependent mode.
2456 if (quark > String::kMaxOneByteCharCode) {
2457 assembler->GoTo(backtrack);
2458 return;
2459 }
2460 }
2461 break;
2462 }
2463 case NON_LETTER_CHARACTER_MATCH:
2464 bounds_checked =
2465 EmitAtomNonLetter(isolate, compiler, quark, backtrack,
2466 cp_offset + j, needs_bounds_check, preloaded);
2467 break;
2468 case SIMPLE_CHARACTER_MATCH:
2469 bounds_checked = EmitSimpleCharacter(isolate, compiler, quark,
2470 backtrack, cp_offset + j,
2471 needs_bounds_check, preloaded);
2472 break;
2473 case CASE_CHARACTER_MATCH:
2474 bounds_checked =
2475 EmitAtomLetter(isolate, compiler, quark, backtrack,
2476 cp_offset + j, needs_bounds_check, preloaded);
2477 break;
2478 default:
2479 break;
2480 }
2481 if (bounds_checked) UpdateBoundsCheck(cp_offset + j, checked_up_to);
2482 }
2483 } else {
2485 if (pass == CHARACTER_CLASS_MATCH) {
2486 if (first_element_checked && i == 0) continue;
2487 if (DeterminedAlready(quick_check, elm.cp_offset())) continue;
2488 RegExpClassRanges* cr = elm.class_ranges();
2489 bool bounds_check = *checked_up_to < cp_offset || read_backward();
2490 EmitClassRanges(assembler, cr, one_byte, backtrack, cp_offset,
2491 bounds_check, preloaded, zone());
2492 UpdateBoundsCheck(cp_offset, checked_up_to);
2493 }
2494 }
2495 }
2496}
2497
2499 TextElement elm = elements()->last();
2500 DCHECK_LE(0, elm.cp_offset());
2501 return elm.cp_offset() + elm.length();
2502}
2503
2506 bool read_backward,
2507 RegExpNode* on_success) {
2508 DCHECK_NOT_NULL(ranges);
2509 // TODO(jgruber): There's no fundamental need to create this
2510 // RegExpClassRanges; we could refactor to avoid the allocation.
2511 return zone->New<TextNode>(zone->New<RegExpClassRanges>(zone, ranges),
2512 read_backward, on_success);
2513}
2514
2516 Zone* zone, CharacterRange lead, ZoneList<CharacterRange>* trail_ranges,
2517 bool read_backward, RegExpNode* on_success) {
2518 ZoneList<TextElement>* elms = zone->New<ZoneList<TextElement>>(2, zone);
2519 if (lead.from() == lead.to()) {
2520 ZoneList<base::uc16> lead_surrogate(1, zone);
2521 lead_surrogate.Add(lead.from(), zone);
2522 RegExpAtom* atom = zone->New<RegExpAtom>(lead_surrogate.ToConstVector());
2523 elms->Add(TextElement::Atom(atom), zone);
2524 } else {
2525 ZoneList<CharacterRange>* lead_ranges = CharacterRange::List(zone, lead);
2527 zone->New<RegExpClassRanges>(zone, lead_ranges)),
2528 zone);
2529 }
2531 zone->New<RegExpClassRanges>(zone, trail_ranges)),
2532 zone);
2533 return zone->New<TextNode>(elms, read_backward, on_success);
2534}
2535
2537 Zone* zone, ZoneList<CharacterRange>* lead_ranges, CharacterRange trail,
2538 bool read_backward, RegExpNode* on_success) {
2539 ZoneList<CharacterRange>* trail_ranges = CharacterRange::List(zone, trail);
2540 ZoneList<TextElement>* elms = zone->New<ZoneList<TextElement>>(2, zone);
2541 elms->Add(
2542 TextElement::ClassRanges(zone->New<RegExpClassRanges>(zone, lead_ranges)),
2543 zone);
2545 zone->New<RegExpClassRanges>(zone, trail_ranges)),
2546 zone);
2547 return zone->New<TextNode>(elms, read_backward, on_success);
2548}
2549
2550// This generates the code to match a text node. A text node can contain
2551// straight character sequences (possibly to be matched in a case-independent
2552// way) and character classes. For efficiency we do not do this in a single
2553// pass from left to right. Instead we pass over the text node several times,
2554// emitting code for some character positions every time. See the comment on
2555// TextEmitPass for details.
2556void TextNode::Emit(RegExpCompiler* compiler, Trace* trace) {
2557 LimitResult limit_result = LimitVersions(compiler, trace);
2558 if (limit_result == DONE) return;
2559 DCHECK(limit_result == CONTINUE);
2560
2561 if (trace->cp_offset() + Length() > RegExpMacroAssembler::kMaxCPOffset) {
2562 compiler->SetRegExpTooBig();
2563 return;
2564 }
2565
2566 if (compiler->one_byte()) {
2567 int dummy = 0;
2568 TextEmitPass(compiler, NON_LATIN1_MATCH, false, trace, false, &dummy);
2569 }
2570
2571 bool first_elt_done = false;
2572 int bound_checked_to = trace->cp_offset() - 1;
2573 bound_checked_to += trace->bound_checked_up_to();
2574
2575 // If a character is preloaded into the current character register then
2576 // check that first to save reloading it.
2577 for (int twice = 0; twice < 2; twice++) {
2578 bool is_preloaded_pass = twice == 0;
2579 if (is_preloaded_pass && trace->characters_preloaded() != 1) continue;
2580 if (IsIgnoreCase(compiler->flags())) {
2581 TextEmitPass(compiler, NON_LETTER_CHARACTER_MATCH, is_preloaded_pass,
2582 trace, first_elt_done, &bound_checked_to);
2583 TextEmitPass(compiler, CASE_CHARACTER_MATCH, is_preloaded_pass, trace,
2584 first_elt_done, &bound_checked_to);
2585 } else {
2586 TextEmitPass(compiler, SIMPLE_CHARACTER_MATCH, is_preloaded_pass, trace,
2587 first_elt_done, &bound_checked_to);
2588 }
2589 TextEmitPass(compiler, CHARACTER_CLASS_MATCH, is_preloaded_pass, trace,
2590 first_elt_done, &bound_checked_to);
2591 first_elt_done = true;
2592 }
2593
2594 Trace successor_trace(*trace);
2595 // If we advance backward, we may end up at the start.
2596 successor_trace.AdvanceCurrentPositionInTrace(
2597 read_backward() ? -Length() : Length(), compiler);
2598 successor_trace.set_at_start(read_backward() ? Trace::UNKNOWN
2600 RecursionCheck rc(compiler);
2601 on_success()->Emit(compiler, &successor_trace);
2602}
2603
2605
2607 // We don't have an instruction for shifting the current character register
2608 // down or for using a shifted value for anything so lets just forget that
2609 // we preloaded any characters into it.
2611 // Adjust the offsets of the quick check performed information. This
2612 // information is used to find out what we already determined about the
2613 // characters by means of mask and compare.
2614 quick_check_performed_.Advance(by, compiler->one_byte());
2615 cp_offset_ += by;
2617 compiler->SetRegExpTooBig();
2618 cp_offset_ = 0;
2619 }
2620 bound_checked_up_to_ = std::max(0, bound_checked_up_to_ - by);
2621}
2622
2623void TextNode::MakeCaseIndependent(Isolate* isolate, bool is_one_byte,
2624 RegExpFlags flags) {
2625 if (!IsIgnoreCase(flags)) return;
2626#ifdef V8_INTL_SUPPORT
2627 // This is done in an earlier step when generating the nodes from the AST
2628 // because we may have to split up into separate nodes.
2629 if (NeedsUnicodeCaseEquivalents(flags)) return;
2630#endif
2631
2632 int element_count = elements()->length();
2633 for (int i = 0; i < element_count; i++) {
2634 TextElement elm = elements()->at(i);
2635 if (elm.text_type() == TextElement::CLASS_RANGES) {
2636 RegExpClassRanges* cr = elm.class_ranges();
2637 // None of the standard character classes is different in the case
2638 // independent case and it slows us down if we don't know that.
2639 if (cr->is_standard(zone())) continue;
2640 ZoneList<CharacterRange>* ranges = cr->ranges(zone());
2641 CharacterRange::AddCaseEquivalents(isolate, zone(), ranges, is_one_byte);
2642 }
2643 }
2644}
2645
2646int TextNode::GreedyLoopTextLength() { return Length(); }
2647
2649 RegExpCompiler* compiler) {
2650 if (read_backward()) return nullptr;
2651 if (elements()->length() != 1) return nullptr;
2652 TextElement elm = elements()->at(0);
2653 if (elm.text_type() != TextElement::CLASS_RANGES) return nullptr;
2654 RegExpClassRanges* node = elm.class_ranges();
2655 ZoneList<CharacterRange>* ranges = node->ranges(zone());
2657 if (node->is_negated()) {
2658 return ranges->length() == 0 ? on_success() : nullptr;
2659 }
2660 if (ranges->length() != 1) return nullptr;
2661 const base::uc32 max_char = MaxCodeUnit(compiler->one_byte());
2662 return ranges->at(0).IsEverything(max_char) ? on_success() : nullptr;
2663}
2664
2665// Finds the fixed match length of a sequence of nodes that goes from
2666// this alternative and back to this choice node. If there are variable
2667// length nodes or other complications in the way then return a sentinel
2668// value indicating that a greedy loop cannot be constructed.
2670 GuardedAlternative* alternative) {
2671 int length = 0;
2672 RegExpNode* node = alternative->node();
2673 // Later we will generate code for all these text nodes using recursion
2674 // so we have to limit the max number.
2675 int recursion_depth = 0;
2676 while (node != this) {
2678 return kNodeIsTooComplexForGreedyLoops;
2679 }
2680 int node_length = node->GreedyLoopTextLength();
2681 if (node_length == kNodeIsTooComplexForGreedyLoops) {
2682 return kNodeIsTooComplexForGreedyLoops;
2683 }
2684 length += node_length;
2685 node = node->AsSeqRegExpNode()->on_success();
2686 }
2687 if (read_backward()) {
2688 length = -length;
2689 }
2690 // Check that we can jump by the whole text length. If not, return sentinel
2691 // to indicate the we can't construct a greedy loop.
2694 return kNodeIsTooComplexForGreedyLoops;
2695 }
2696 return length;
2697}
2698
2700 DCHECK_NULL(loop_node_);
2701 AddAlternative(alt);
2702 loop_node_ = alt.node();
2703}
2704
2706 DCHECK_NULL(continue_node_);
2707 AddAlternative(alt);
2708 continue_node_ = alt.node();
2709}
2710
2712 RegExpMacroAssembler* macro_assembler = compiler->macro_assembler();
2713 if (trace->stop_node() == this) {
2714 // Back edge of greedy optimized loop node graph.
2715 int text_length =
2716 GreedyLoopTextLengthForAlternative(&(alternatives_->at(0)));
2717 DCHECK_NE(kNodeIsTooComplexForGreedyLoops, text_length);
2718 // Update the counter-based backtracking info on the stack. This is an
2719 // optimization for greedy loops (see below).
2720 DCHECK(trace->cp_offset() == text_length);
2721 macro_assembler->AdvanceCurrentPosition(text_length);
2722 macro_assembler->GoTo(trace->loop_label());
2723 return;
2724 }
2725 DCHECK_NULL(trace->stop_node());
2726 if (!trace->is_trivial()) {
2727 trace->Flush(compiler, this);
2728 return;
2729 }
2730 ChoiceNode::Emit(compiler, trace);
2731}
2732
2734 int eats_at_least) {
2735 int preload_characters = std::min(4, eats_at_least);
2736 DCHECK_LE(preload_characters, 4);
2737 if (compiler->macro_assembler()->CanReadUnaligned()) {
2738 bool one_byte = compiler->one_byte();
2739 if (one_byte) {
2740 // We can't preload 3 characters because there is no machine instruction
2741 // to do that. We can't just load 4 because we could be reading
2742 // beyond the end of the string, which could cause a memory fault.
2743 if (preload_characters == 3) preload_characters = 2;
2744 } else {
2745 if (preload_characters > 2) preload_characters = 2;
2746 }
2747 } else {
2748 if (preload_characters > 1) preload_characters = 1;
2749 }
2750 return preload_characters;
2751}
2752
2753// This class is used when generating the alternatives in a choice node. It
2754// records the way the alternative is being code generated.
2756 public:
2758 : possible_success(),
2759 expects_preload(false),
2760 after(),
2761 quick_check_details() {}
2766};
2767
2768// Creates a list of AlternativeGenerations. If the list has a reasonable
2769// size then it is on the stack, otherwise the excess is on the heap.
2771 public:
2772 AlternativeGenerationList(int count, Zone* zone) : alt_gens_(count, zone) {
2773 for (int i = 0; i < count && i < kAFew; i++) {
2774 alt_gens_.Add(a_few_alt_gens_ + i, zone);
2775 }
2776 for (int i = kAFew; i < count; i++) {
2777 alt_gens_.Add(new AlternativeGeneration(), zone);
2778 }
2779 }
2781 for (int i = kAFew; i < alt_gens_.length(); i++) {
2782 delete alt_gens_[i];
2783 alt_gens_[i] = nullptr;
2784 }
2785 }
2786
2787 AlternativeGeneration* at(int i) { return alt_gens_[i]; }
2788
2789 private:
2790 static const int kAFew = 10;
2792 AlternativeGeneration a_few_alt_gens_[kAFew];
2793};
2794
2798
2799namespace {
2800
2801ContainedInLattice AddRange(ContainedInLattice containment, const int* ranges,
2802 int ranges_length, Interval new_range) {
2803 DCHECK_EQ(1, ranges_length & 1);
2804 DCHECK_EQ(String::kMaxCodePoint + 1, ranges[ranges_length - 1]);
2805 if (containment == kLatticeUnknown) return containment;
2806 bool inside = false;
2807 int last = 0;
2808 for (int i = 0; i < ranges_length; inside = !inside, last = ranges[i], i++) {
2809 // Consider the range from last to ranges[i].
2810 // We haven't got to the new range yet.
2811 if (ranges[i] <= new_range.from()) continue;
2812 // New range is wholly inside last-ranges[i]. Note that new_range.to() is
2813 // inclusive, but the values in ranges are not.
2814 if (last <= new_range.from() && new_range.to() < ranges[i]) {
2815 return Combine(containment, inside ? kLatticeIn : kLatticeOut);
2816 }
2817 return kLatticeUnknown;
2818 }
2819 return containment;
2820}
2821
2822int BitsetFirstSetBit(BoyerMoorePositionInfo::Bitset bitset) {
2823 static_assert(BoyerMoorePositionInfo::kMapSize ==
2824 2 * kInt64Size * kBitsPerByte);
2825
2826 // Slight fiddling is needed here, since the bitset is of length 128 while
2827 // CountTrailingZeros requires an integral type and std::bitset can only
2828 // convert to unsigned long long. So we handle the most- and least-significant
2829 // bits separately.
2830
2831 {
2832 static constexpr BoyerMoorePositionInfo::Bitset mask(~uint64_t{0});
2833 BoyerMoorePositionInfo::Bitset masked_bitset = bitset & mask;
2834 static_assert(kInt64Size >= sizeof(decltype(masked_bitset.to_ullong())));
2835 uint64_t lsb = masked_bitset.to_ullong();
2836 if (lsb != 0) return base::bits::CountTrailingZeros(lsb);
2837 }
2838
2839 {
2840 BoyerMoorePositionInfo::Bitset masked_bitset = bitset >> 64;
2841 uint64_t msb = masked_bitset.to_ullong();
2842 if (msb != 0) return 64 + base::bits::CountTrailingZeros(msb);
2843 }
2844
2845 return -1;
2846}
2847
2848} // namespace
2849
2851 w_ = AddRange(w_, kWordRanges, kWordRangeCount, interval);
2852
2853 if (interval.size() >= kMapSize) {
2854 map_count_ = kMapSize;
2855 map_.set();
2856 return;
2857 }
2858
2859 for (int i = interval.from(); i <= interval.to(); i++) {
2860 int mod_character = (i & kMask);
2861 if (!map_[mod_character]) {
2862 map_count_++;
2863 map_.set(mod_character);
2864 }
2865 if (map_count_ == kMapSize) return;
2866 }
2867}
2868
2870 w_ = kLatticeUnknown;
2871 if (map_count_ != kMapSize) {
2872 map_count_ = kMapSize;
2873 map_.set();
2874 }
2875}
2876
2878 Zone* zone)
2879 : length_(length),
2880 compiler_(compiler),
2881 max_char_(MaxCodeUnit(compiler->one_byte())) {
2883 for (int i = 0; i < length; i++) {
2884 bitmaps_->Add(zone->New<BoyerMoorePositionInfo>(), zone);
2885 }
2886}
2887
2888// Find the longest range of lookahead that has the fewest number of different
2889// characters that can occur at a given position. Since we are optimizing two
2890// different parameters at once this is a tradeoff.
2892 int biggest_points = 0;
2893 // If more than 32 characters out of 128 can occur it is unlikely that we can
2894 // be lucky enough to step forwards much of the time.
2895 const int kMaxMax = 32;
2896 for (int max_number_of_chars = 4; max_number_of_chars < kMaxMax;
2897 max_number_of_chars *= 2) {
2898 biggest_points =
2899 FindBestInterval(max_number_of_chars, biggest_points, from, to);
2900 }
2901 if (biggest_points == 0) return false;
2902 return true;
2903}
2904
2905// Find the highest-points range between 0 and length_ where the character
2906// information is not too vague. 'Too vague' means that there are more than
2907// max_number_of_chars that can occur at this position. Calculates the number
2908// of points as the product of width-of-the-range and
2909// probability-of-finding-one-of-the-characters, where the probability is
2910// calculated using the frequency distribution of the sample subject string.
2911int BoyerMooreLookahead::FindBestInterval(int max_number_of_chars,
2912 int old_biggest_points, int* from,
2913 int* to) {
2914 int biggest_points = old_biggest_points;
2915 static const int kSize = RegExpMacroAssembler::kTableSize;
2916 for (int i = 0; i < length_;) {
2917 while (i < length_ && Count(i) > max_number_of_chars) i++;
2918 if (i == length_) break;
2919 int remembered_from = i;
2920
2921 BoyerMoorePositionInfo::Bitset union_bitset;
2922 for (; i < length_ && Count(i) <= max_number_of_chars; i++) {
2923 union_bitset |= bitmaps_->at(i)->raw_bitset();
2924 }
2925
2926 int frequency = 0;
2927
2928 // Iterate only over set bits.
2929 int j;
2930 while ((j = BitsetFirstSetBit(union_bitset)) != -1) {
2931 DCHECK(union_bitset[j]); // Sanity check.
2932 // Add 1 to the frequency to give a small per-character boost for
2933 // the cases where our sampling is not good enough and many
2934 // characters have a frequency of zero. This means the frequency
2935 // can theoretically be up to 2*kSize though we treat it mostly as
2936 // a fraction of kSize.
2937 frequency += compiler_->frequency_collator()->Frequency(j) + 1;
2938 union_bitset.reset(j);
2939 }
2940
2941 // We use the probability of skipping times the distance we are skipping to
2942 // judge the effectiveness of this. Actually we have a cut-off: By
2943 // dividing by 2 we switch off the skipping if the probability of skipping
2944 // is less than 50%. This is because the multibyte mask-and-compare
2945 // skipping in quickcheck is more likely to do well on this case.
2946 bool in_quickcheck_range =
2947 ((i - remembered_from < 4) ||
2948 (compiler_->one_byte() ? remembered_from <= 4 : remembered_from <= 2));
2949 // Called 'probability' but it is only a rough estimate and can actually
2950 // be outside the 0-kSize range.
2951 int probability = (in_quickcheck_range ? kSize / 2 : kSize) - frequency;
2952 int points = (i - remembered_from) * probability;
2953 if (points > biggest_points) {
2954 *from = remembered_from;
2955 *to = i - 1;
2956 biggest_points = points;
2957 }
2958 }
2959 return biggest_points;
2960}
2961
2962// Take all the characters that will not prevent a successful match if they
2963// occur in the subject string in the range between min_lookahead and
2964// max_lookahead (inclusive) measured from the current position. If the
2965// character at max_lookahead offset is not one of these characters, then we
2966// can safely skip forwards by the number of characters in the range.
2967// nibble_table is only used for SIMD variants and encodes the same information
2968// as boolean_skip_table but in only 128 bits. It contains 16 bytes where the
2969// index into the table represent low nibbles of a character, and the stored
2970// byte is a bitset representing matching high nibbles. E.g. to store the
2971// character 'b' (0x62) in the nibble table, we set the 6th bit in row 2.
2973 int min_lookahead, int max_lookahead,
2974 DirectHandle<ByteArray> boolean_skip_table,
2975 DirectHandle<ByteArray> nibble_table) {
2976 const int kSkipArrayEntry = 0;
2977 const int kDontSkipArrayEntry = 1;
2978
2979 std::memset(boolean_skip_table->begin(), kSkipArrayEntry,
2980 boolean_skip_table->length());
2981 const bool fill_nibble_table = !nibble_table.is_null();
2982 if (fill_nibble_table) {
2983 std::memset(nibble_table->begin(), 0, nibble_table->length());
2984 }
2985
2986 for (int i = max_lookahead; i >= min_lookahead; i--) {
2987 BoyerMoorePositionInfo::Bitset bitset = bitmaps_->at(i)->raw_bitset();
2988
2989 // Iterate only over set bits.
2990 int j;
2991 while ((j = BitsetFirstSetBit(bitset)) != -1) {
2992 DCHECK(bitset[j]); // Sanity check.
2993 boolean_skip_table->set(j, kDontSkipArrayEntry);
2994 if (fill_nibble_table) {
2995 int lo_nibble = j & 0x0f;
2996 int hi_nibble = (j >> 4) & 0x07;
2997 int row = nibble_table->get(lo_nibble);
2998 row |= 1 << hi_nibble;
2999 nibble_table->set(lo_nibble, row);
3000 }
3001 bitset.reset(j);
3002 }
3003 }
3004
3005 const int skip = max_lookahead + 1 - min_lookahead;
3006 return skip;
3007}
3008
3009// See comment above on the implementation of GetSkipTable.
3011 const int kSize = RegExpMacroAssembler::kTableSize;
3012
3013 int min_lookahead = 0;
3014 int max_lookahead = 0;
3015
3016 if (!FindWorthwhileInterval(&min_lookahead, &max_lookahead)) return;
3017
3018 // Check if we only have a single non-empty position info, and that info
3019 // contains precisely one character.
3020 bool found_single_character = false;
3021 int single_character = 0;
3022 for (int i = max_lookahead; i >= min_lookahead; i--) {
3024 if (map->map_count() == 0) continue;
3025
3026 if (found_single_character || map->map_count() > 1) {
3027 found_single_character = false;
3028 break;
3029 }
3030
3031 DCHECK(!found_single_character);
3032 DCHECK_EQ(map->map_count(), 1);
3033
3034 found_single_character = true;
3035 single_character = BitsetFirstSetBit(map->raw_bitset());
3036
3037 DCHECK_NE(single_character, -1);
3038 }
3039
3040 int lookahead_width = max_lookahead + 1 - min_lookahead;
3041
3042 if (found_single_character && lookahead_width == 1 && max_lookahead < 3) {
3043 // The mask-compare can probably handle this better.
3044 return;
3045 }
3046
3047 if (found_single_character) {
3048 // TODO(pthier): Add vectorized version.
3049 Label cont, again;
3050 masm->Bind(&again);
3051 masm->LoadCurrentCharacter(max_lookahead, &cont, true);
3052 if (max_char_ > kSize) {
3053 masm->CheckCharacterAfterAnd(single_character,
3055 } else {
3056 masm->CheckCharacter(single_character, &cont);
3057 }
3058 masm->AdvanceCurrentPosition(lookahead_width);
3059 masm->GoTo(&again);
3060 masm->Bind(&cont);
3061 return;
3062 }
3063
3064 Factory* factory = masm->isolate()->factory();
3065 Handle<ByteArray> boolean_skip_table =
3066 factory->NewByteArray(kSize, AllocationType::kOld);
3067 Handle<ByteArray> nibble_table;
3068 const int skip_distance = max_lookahead + 1 - min_lookahead;
3069 if (masm->SkipUntilBitInTableUseSimd(skip_distance)) {
3070 // The current implementation is tailored specifically for 128-bit tables.
3071 static_assert(kSize == 128);
3072 nibble_table =
3074 }
3075 GetSkipTable(min_lookahead, max_lookahead, boolean_skip_table, nibble_table);
3076 DCHECK_NE(0, skip_distance);
3077
3078 masm->SkipUntilBitInTable(max_lookahead, boolean_skip_table, nibble_table,
3079 skip_distance);
3080}
3081
3082/* Code generation for choice nodes.
3083 *
3084 * We generate quick checks that do a mask and compare to eliminate a
3085 * choice. If the quick check succeeds then it jumps to the continuation to
3086 * do slow checks and check subsequent nodes. If it fails (the common case)
3087 * it falls through to the next choice.
3088 *
3089 * Here is the desired flow graph. Nodes directly below each other imply
3090 * fallthrough. Alternatives 1 and 2 have quick checks. Alternative
3091 * 3 doesn't have a quick check so we have to call the slow check.
3092 * Nodes are marked Qn for quick checks and Sn for slow checks. The entire
3093 * regexp continuation is generated directly after the Sn node, up to the
3094 * next GoTo if we decide to reuse some already generated code. Some
3095 * nodes expect preload_characters to be preloaded into the current
3096 * character register. R nodes do this preloading. Vertices are marked
3097 * F for failures and S for success (possible success in the case of quick
3098 * nodes). L, V, < and > are used as arrow heads.
3099 *
3100 * ----------> R
3101 * |
3102 * V
3103 * Q1 -----> S1
3104 * | S /
3105 * F| /
3106 * | F/
3107 * | /
3108 * | R
3109 * | /
3110 * V L
3111 * Q2 -----> S2
3112 * | S /
3113 * F| /
3114 * | F/
3115 * | /
3116 * | R
3117 * | /
3118 * V L
3119 * S3
3120 * |
3121 * F|
3122 * |
3123 * R
3124 * |
3125 * backtrack V
3126 * <----------Q4
3127 * \ F |
3128 * \ |S
3129 * \ F V
3130 * \-----S4
3131 *
3132 * For greedy loops we push the current position, then generate the code that
3133 * eats the input specially in EmitGreedyLoop. The other choice (the
3134 * continuation) is generated by the normal code in EmitChoices, and steps back
3135 * in the input to the starting position when it fails to match. The loop code
3136 * looks like this (U is the unwind code that steps back in the greedy loop).
3137 *
3138 * _____
3139 * / \
3140 * V |
3141 * ----------> S1 |
3142 * /| |
3143 * / |S |
3144 * F/ \_____/
3145 * /
3146 * |<-----
3147 * | \
3148 * V |S
3149 * Q2 ---> U----->backtrack
3150 * | F /
3151 * S| /
3152 * V F /
3153 * S2--/
3154 */
3155
3160
3162#ifdef DEBUG
3163 int choice_count = alternatives_->length();
3164 for (int i = 0; i < choice_count - 1; i++) {
3165 GuardedAlternative alternative = alternatives_->at(i);
3166 ZoneList<Guard*>* guards = alternative.guards();
3167 int guard_count = (guards == nullptr) ? 0 : guards->length();
3168 for (int j = 0; j < guard_count; j++) {
3169 DCHECK(!trace->mentions_reg(guards->at(j)->reg()));
3170 }
3171 }
3172#endif
3173}
3174
3175void ChoiceNode::SetUpPreLoad(RegExpCompiler* compiler, Trace* current_trace,
3176 PreloadState* state) {
3177 if (state->eats_at_least_ == PreloadState::kEatsAtLeastNotYetInitialized) {
3178 // Save some time by looking at most one machine word ahead.
3179 state->eats_at_least_ =
3180 EatsAtLeast(current_trace->at_start() == Trace::FALSE_VALUE);
3181 }
3182 state->preload_characters_ =
3183 CalculatePreloadCharacters(compiler, state->eats_at_least_);
3184
3185 state->preload_is_current_ =
3186 (current_trace->characters_preloaded() == state->preload_characters_);
3187 state->preload_has_checked_bounds_ = state->preload_is_current_;
3188}
3189
3190void ChoiceNode::Emit(RegExpCompiler* compiler, Trace* trace) {
3191 int choice_count = alternatives_->length();
3192
3193 if (choice_count == 1 && alternatives_->at(0).guards() == nullptr) {
3194 alternatives_->at(0).node()->Emit(compiler, trace);
3195 return;
3196 }
3197
3199
3200 LimitResult limit_result = LimitVersions(compiler, trace);
3201 if (limit_result == DONE) return;
3202 DCHECK(limit_result == CONTINUE);
3203
3204 // For loop nodes we already flushed (see LoopChoiceNode::Emit), but for
3205 // other choice nodes we only flush if we are out of code size budget.
3206 if (trace->flush_budget() == 0 && trace->actions() != nullptr) {
3207 trace->Flush(compiler, this);
3208 return;
3209 }
3210
3211 RecursionCheck rc(compiler);
3212
3213 PreloadState preload;
3214 preload.init();
3215 GreedyLoopState greedy_loop_state(not_at_start());
3216
3217 int text_length = GreedyLoopTextLengthForAlternative(&alternatives_->at(0));
3218 AlternativeGenerationList alt_gens(choice_count, zone());
3219
3220 if (choice_count > 1 && text_length != kNodeIsTooComplexForGreedyLoops) {
3221 trace = EmitGreedyLoop(compiler, trace, &alt_gens, &preload,
3222 &greedy_loop_state, text_length);
3223 } else {
3224 preload.eats_at_least_ = EmitOptimizedUnanchoredSearch(compiler, trace);
3225
3226 EmitChoices(compiler, &alt_gens, 0, trace, &preload);
3227 }
3228
3229 // At this point we need to generate slow checks for the alternatives where
3230 // the quick check was inlined. We can recognize these because the associated
3231 // label was bound.
3232 int new_flush_budget = trace->flush_budget() / choice_count;
3233 for (int i = 0; i < choice_count; i++) {
3234 AlternativeGeneration* alt_gen = alt_gens.at(i);
3235 Trace new_trace(*trace);
3236 // If there are actions to be flushed we have to limit how many times
3237 // they are flushed. Take the budget of the parent trace and distribute
3238 // it fairly amongst the children.
3239 if (new_trace.actions() != nullptr) {
3240 new_trace.set_flush_budget(new_flush_budget);
3241 }
3242 bool next_expects_preload =
3243 i == choice_count - 1 ? false : alt_gens.at(i + 1)->expects_preload;
3244 EmitOutOfLineContinuation(compiler, &new_trace, alternatives_->at(i),
3245 alt_gen, preload.preload_characters_,
3246 next_expects_preload);
3247 }
3248}
3249
3251 AlternativeGenerationList* alt_gens,
3252 PreloadState* preload,
3253 GreedyLoopState* greedy_loop_state,
3254 int text_length) {
3255 RegExpMacroAssembler* macro_assembler = compiler->macro_assembler();
3256 // Here we have special handling for greedy loops containing only text nodes
3257 // and other simple nodes. These are handled by pushing the current
3258 // position on the stack and then incrementing the current position each
3259 // time around the switch. On backtrack we decrement the current position
3260 // and check it against the pushed value. This avoids pushing backtrack
3261 // information for each iteration of the loop, which could take up a lot of
3262 // space.
3263 DCHECK(trace->stop_node() == nullptr);
3264 macro_assembler->PushCurrentPosition();
3265 Label greedy_match_failed;
3266 Trace greedy_match_trace;
3267 if (not_at_start()) greedy_match_trace.set_at_start(Trace::FALSE_VALUE);
3268 greedy_match_trace.set_backtrack(&greedy_match_failed);
3269 Label loop_label;
3270 macro_assembler->Bind(&loop_label);
3271 greedy_match_trace.set_stop_node(this);
3272 greedy_match_trace.set_loop_label(&loop_label);
3273 alternatives_->at(0).node()->Emit(compiler, &greedy_match_trace);
3274 macro_assembler->Bind(&greedy_match_failed);
3275
3276 Label second_choice; // For use in greedy matches.
3277 macro_assembler->Bind(&second_choice);
3278
3279 Trace* new_trace = greedy_loop_state->counter_backtrack_trace();
3280
3281 EmitChoices(compiler, alt_gens, 1, new_trace, preload);
3282
3283 macro_assembler->Bind(greedy_loop_state->label());
3284 // If we have unwound to the bottom then backtrack.
3285 macro_assembler->CheckGreedyLoop(trace->backtrack());
3286 // Otherwise try the second priority at an earlier position.
3287 macro_assembler->AdvanceCurrentPosition(-text_length);
3288 macro_assembler->GoTo(&second_choice);
3289 return new_trace;
3290}
3291
3293 Trace* trace) {
3295 if (alternatives_->length() != 2) return eats_at_least;
3296
3297 GuardedAlternative alt1 = alternatives_->at(1);
3298 if (alt1.guards() != nullptr && alt1.guards()->length() != 0) {
3299 return eats_at_least;
3300 }
3301 RegExpNode* eats_anything_node = alt1.node();
3302 if (eats_anything_node->GetSuccessorOfOmnivorousTextNode(compiler) != this) {
3303 return eats_at_least;
3304 }
3305
3306 // Really we should be creating a new trace when we execute this function,
3307 // but there is no need, because the code it generates cannot backtrack, and
3308 // we always arrive here with a trivial trace (since it's the entry to a
3309 // loop. That also implies that there are no preloaded characters, which is
3310 // good, because it means we won't be violating any assumptions by
3311 // overwriting those characters with new load instructions.
3312 DCHECK(trace->is_trivial());
3313
3314 RegExpMacroAssembler* macro_assembler = compiler->macro_assembler();
3315 Isolate* isolate = macro_assembler->isolate();
3316 // At this point we know that we are at a non-greedy loop that will eat
3317 // any character one at a time. Any non-anchored regexp has such a
3318 // loop prepended to it in order to find where it starts. We look for
3319 // a pattern of the form ...abc... where we can look 6 characters ahead
3320 // and step forwards 3 if the character is not one of abc. Abc need
3321 // not be atoms, they can be any reasonably limited character class or
3322 // small alternation.
3323 BoyerMooreLookahead* bm = bm_info(false);
3324 if (bm == nullptr) {
3325 eats_at_least = std::min(kMaxLookaheadForBoyerMoore, EatsAtLeast(false));
3326 if (eats_at_least >= 1) {
3327 bm = zone()->New<BoyerMooreLookahead>(eats_at_least, compiler, zone());
3328 GuardedAlternative alt0 = alternatives_->at(0);
3329 alt0.node()->FillInBMInfo(isolate, 0, kRecursionBudget, bm, false);
3330 }
3331 }
3332 if (bm != nullptr) {
3333 bm->EmitSkipInstructions(macro_assembler);
3334 }
3335 return eats_at_least;
3336}
3337
3339 AlternativeGenerationList* alt_gens,
3340 int first_choice, Trace* trace,
3341 PreloadState* preload) {
3342 RegExpMacroAssembler* macro_assembler = compiler->macro_assembler();
3343 SetUpPreLoad(compiler, trace, preload);
3344
3345 // For now we just call all choices one after the other. The idea ultimately
3346 // is to use the Dispatch table to try only the relevant ones.
3347 int choice_count = alternatives_->length();
3348
3349 int new_flush_budget = trace->flush_budget() / choice_count;
3350
3351 for (int i = first_choice; i < choice_count; i++) {
3352 bool is_last = i == choice_count - 1;
3353 bool fall_through_on_failure = !is_last;
3354 GuardedAlternative alternative = alternatives_->at(i);
3355 AlternativeGeneration* alt_gen = alt_gens->at(i);
3357 ZoneList<Guard*>* guards = alternative.guards();
3358 int guard_count = (guards == nullptr) ? 0 : guards->length();
3359 Trace new_trace(*trace);
3360 new_trace.set_characters_preloaded(
3361 preload->preload_is_current_ ? preload->preload_characters_ : 0);
3362 if (preload->preload_has_checked_bounds_) {
3363 new_trace.set_bound_checked_up_to(preload->preload_characters_);
3364 }
3365 new_trace.quick_check_performed()->Clear();
3367 if (!is_last) {
3368 new_trace.set_backtrack(&alt_gen->after);
3369 }
3370 alt_gen->expects_preload = preload->preload_is_current_;
3371 bool generate_full_check_inline = false;
3372 if (v8_flags.regexp_optimization &&
3374 alternative.node()->EmitQuickCheck(
3375 compiler, trace, &new_trace, preload->preload_has_checked_bounds_,
3376 &alt_gen->possible_success, &alt_gen->quick_check_details,
3377 fall_through_on_failure, this)) {
3378 // Quick check was generated for this choice.
3379 preload->preload_is_current_ = true;
3380 preload->preload_has_checked_bounds_ = true;
3381 // If we generated the quick check to fall through on possible success,
3382 // we now need to generate the full check inline.
3383 if (!fall_through_on_failure) {
3384 macro_assembler->Bind(&alt_gen->possible_success);
3387 new_trace.set_bound_checked_up_to(preload->preload_characters_);
3388 generate_full_check_inline = true;
3389 }
3390 } else if (alt_gen->quick_check_details.cannot_match()) {
3391 if (!fall_through_on_failure) {
3392 macro_assembler->GoTo(trace->backtrack());
3393 }
3394 continue;
3395 } else {
3396 // No quick check was generated. Put the full code here.
3397 // If this is not the first choice then there could be slow checks from
3398 // previous cases that go here when they fail. There's no reason to
3399 // insist that they preload characters since the slow check we are about
3400 // to generate probably can't use it.
3401 if (i != first_choice) {
3402 alt_gen->expects_preload = false;
3403 new_trace.InvalidateCurrentCharacter();
3404 }
3405 generate_full_check_inline = true;
3406 }
3407 if (generate_full_check_inline) {
3408 if (new_trace.actions() != nullptr) {
3409 new_trace.set_flush_budget(new_flush_budget);
3410 }
3411 for (int j = 0; j < guard_count; j++) {
3412 GenerateGuard(macro_assembler, guards->at(j), &new_trace);
3413 }
3414 alternative.node()->Emit(compiler, &new_trace);
3415 preload->preload_is_current_ = false;
3416 }
3417 macro_assembler->Bind(&alt_gen->after);
3418 }
3419}
3420
3422 Trace* trace,
3423 GuardedAlternative alternative,
3424 AlternativeGeneration* alt_gen,
3425 int preload_characters,
3426 bool next_expects_preload) {
3427 if (!alt_gen->possible_success.is_linked()) return;
3428
3429 RegExpMacroAssembler* macro_assembler = compiler->macro_assembler();
3430 macro_assembler->Bind(&alt_gen->possible_success);
3431 Trace out_of_line_trace(*trace);
3432 out_of_line_trace.set_characters_preloaded(preload_characters);
3433 out_of_line_trace.set_quick_check_performed(&alt_gen->quick_check_details);
3434 if (not_at_start_) out_of_line_trace.set_at_start(Trace::FALSE_VALUE);
3435 ZoneList<Guard*>* guards = alternative.guards();
3436 int guard_count = (guards == nullptr) ? 0 : guards->length();
3437 if (next_expects_preload) {
3438 Label reload_current_char;
3439 out_of_line_trace.set_backtrack(&reload_current_char);
3440 for (int j = 0; j < guard_count; j++) {
3441 GenerateGuard(macro_assembler, guards->at(j), &out_of_line_trace);
3442 }
3443 alternative.node()->Emit(compiler, &out_of_line_trace);
3444 macro_assembler->Bind(&reload_current_char);
3445 // Reload the current character, since the next quick check expects that.
3446 // We don't need to check bounds here because we only get into this
3447 // code through a quick check which already did the checked load.
3448 macro_assembler->LoadCurrentCharacter(trace->cp_offset(), nullptr, false,
3449 preload_characters);
3450 macro_assembler->GoTo(&(alt_gen->after));
3451 } else {
3452 out_of_line_trace.set_backtrack(&(alt_gen->after));
3453 for (int j = 0; j < guard_count; j++) {
3454 GenerateGuard(macro_assembler, guards->at(j), &out_of_line_trace);
3455 }
3456 alternative.node()->Emit(compiler, &out_of_line_trace);
3457 }
3458}
3459
3460void ActionNode::Emit(RegExpCompiler* compiler, Trace* trace) {
3461 RegExpMacroAssembler* assembler = compiler->macro_assembler();
3462 LimitResult limit_result = LimitVersions(compiler, trace);
3463 if (limit_result == DONE) return;
3464 DCHECK(limit_result == CONTINUE);
3465
3466 RecursionCheck rc(compiler);
3467
3468 switch (action_type_) {
3469 case STORE_POSITION: {
3470 Trace::DeferredCapture new_capture(data_.u_position_register.reg,
3471 data_.u_position_register.is_capture,
3472 trace);
3473 Trace new_trace = *trace;
3474 new_trace.add_action(&new_capture);
3475 on_success()->Emit(compiler, &new_trace);
3476 break;
3477 }
3478 case INCREMENT_REGISTER: {
3480 data_.u_increment_register.reg);
3481 Trace new_trace = *trace;
3482 new_trace.add_action(&new_increment);
3483 on_success()->Emit(compiler, &new_trace);
3484 break;
3485 }
3486 case SET_REGISTER_FOR_LOOP: {
3487 Trace::DeferredSetRegisterForLoop new_set(data_.u_store_register.reg,
3488 data_.u_store_register.value);
3489 Trace new_trace = *trace;
3490 new_trace.add_action(&new_set);
3491 on_success()->Emit(compiler, &new_trace);
3492 break;
3493 }
3494 case CLEAR_CAPTURES: {
3496 data_.u_clear_captures.range_from, data_.u_clear_captures.range_to));
3497 Trace new_trace = *trace;
3498 new_trace.add_action(&new_capture);
3499 on_success()->Emit(compiler, &new_trace);
3500 break;
3501 }
3504 if (!trace->is_trivial()) {
3505 trace->Flush(compiler, this);
3506 } else {
3507 assembler->WriteCurrentPositionToRegister(
3508 data_.u_submatch.current_position_register, 0);
3509 assembler->WriteStackPointerToRegister(
3510 data_.u_submatch.stack_pointer_register);
3511 on_success()->Emit(compiler, trace);
3512 }
3513 break;
3514 case EMPTY_MATCH_CHECK: {
3515 int start_pos_reg = data_.u_empty_match_check.start_register;
3516 int stored_pos = 0;
3517 int rep_reg = data_.u_empty_match_check.repetition_register;
3518 bool has_minimum = (rep_reg != RegExpCompiler::kNoRegister);
3519 bool know_dist = trace->GetStoredPosition(start_pos_reg, &stored_pos);
3520 if (know_dist && !has_minimum && stored_pos == trace->cp_offset()) {
3521 // If we know we haven't advanced and there is no minimum we
3522 // can just backtrack immediately.
3523 assembler->GoTo(trace->backtrack());
3524 } else if (know_dist && stored_pos < trace->cp_offset()) {
3525 // If we know we've advanced we can generate the continuation
3526 // immediately.
3527 on_success()->Emit(compiler, trace);
3528 } else if (!trace->is_trivial()) {
3529 trace->Flush(compiler, this);
3530 } else {
3531 Label skip_empty_check;
3532 // If we have a minimum number of repetitions we check the current
3533 // number first and skip the empty check if it's not enough.
3534 if (has_minimum) {
3535 int limit = data_.u_empty_match_check.repetition_limit;
3536 assembler->IfRegisterLT(rep_reg, limit, &skip_empty_check);
3537 }
3538 // If the match is empty we bail out, otherwise we fall through
3539 // to the on-success continuation.
3540 assembler->IfRegisterEqPos(data_.u_empty_match_check.start_register,
3541 trace->backtrack());
3542 assembler->Bind(&skip_empty_check);
3543 on_success()->Emit(compiler, trace);
3544 }
3545 break;
3546 }
3548 if (!trace->is_trivial()) {
3549 trace->Flush(compiler, this);
3550 return;
3551 }
3552 assembler->ReadCurrentPositionFromRegister(
3553 data_.u_submatch.current_position_register);
3554 assembler->ReadStackPointerFromRegister(
3555 data_.u_submatch.stack_pointer_register);
3556 int clear_register_count = data_.u_submatch.clear_register_count;
3557 if (clear_register_count == 0) {
3558 on_success()->Emit(compiler, trace);
3559 return;
3560 }
3561 int clear_registers_from = data_.u_submatch.clear_register_from;
3562 Label clear_registers_backtrack;
3563 Trace new_trace = *trace;
3564 new_trace.set_backtrack(&clear_registers_backtrack);
3565 on_success()->Emit(compiler, &new_trace);
3566
3567 assembler->Bind(&clear_registers_backtrack);
3568 int clear_registers_to = clear_registers_from + clear_register_count - 1;
3569 assembler->ClearRegisters(clear_registers_from, clear_registers_to);
3570
3571 DCHECK(trace->backtrack() == nullptr);
3572 assembler->Backtrack();
3573 return;
3574 }
3575 case MODIFY_FLAGS: {
3576 compiler->set_flags(flags());
3577 on_success()->Emit(compiler, trace);
3578 break;
3579 }
3580 default:
3581 UNREACHABLE();
3582 }
3583}
3584
3586 RegExpMacroAssembler* assembler = compiler->macro_assembler();
3587 if (!trace->is_trivial()) {
3588 trace->Flush(compiler, this);
3589 return;
3590 }
3591
3592 LimitResult limit_result = LimitVersions(compiler, trace);
3593 if (limit_result == DONE) return;
3594 DCHECK(limit_result == CONTINUE);
3595
3596 RecursionCheck rc(compiler);
3597
3599 if (IsIgnoreCase(compiler->flags())) {
3600 bool unicode = IsEitherUnicode(compiler->flags());
3601 assembler->CheckNotBackReferenceIgnoreCase(start_reg_, read_backward(),
3602 unicode, trace->backtrack());
3603 } else {
3604 assembler->CheckNotBackReference(start_reg_, read_backward(),
3605 trace->backtrack());
3606 }
3607 // We are going to advance backward, so we may end up at the start.
3608 if (read_backward()) trace->set_at_start(Trace::UNKNOWN);
3609
3610 // Check that the back reference does not end inside a surrogate pair.
3611 if (IsEitherUnicode(compiler->flags()) && !compiler->one_byte()) {
3612 assembler->CheckNotInSurrogatePair(trace->cp_offset(), trace->backtrack());
3613 }
3614 on_success()->Emit(compiler, trace);
3615}
3616
3618 int element_count = elements()->length();
3619 // Set up the offsets of the elements relative to the start. This is a fixed
3620 // quantity since a TextNode can only contain fixed-width things.
3621 int cp_offset = 0;
3622 for (int i = 0; i < element_count; i++) {
3623 TextElement& elm = elements()->at(i);
3624 elm.set_cp_offset(cp_offset);
3625 cp_offset += elm.length();
3626 }
3627}
3628
3629namespace {
3630
3631// Assertion propagation moves information about assertions such as
3632// \b to the affected nodes. For instance, in /.\b./ information must
3633// be propagated to the first '.' that whatever follows needs to know
3634// if it matched a word or a non-word, and to the second '.' that it
3635// has to check if it succeeds a word or non-word. In this case the
3636// result will be something like:
3637//
3638// +-------+ +------------+
3639// | . | | . |
3640// +-------+ ---> +------------+
3641// | word? | | check word |
3642// +-------+ +------------+
3643class AssertionPropagator : public AllStatic {
3644 public:
3645 static void VisitText(TextNode* that) {}
3646
3647 static void VisitAction(ActionNode* that) {
3648 // If the next node is interested in what it follows then this node
3649 // has to be interested too so it can pass the information on.
3650 that->info()->AddFromFollowing(that->on_success()->info());
3651 }
3652
3653 static void VisitChoice(ChoiceNode* that, int i) {
3654 // Anything the following nodes need to know has to be known by
3655 // this node also, so it can pass it on.
3656 that->info()->AddFromFollowing(that->alternatives()->at(i).node()->info());
3657 }
3658
3659 static void VisitLoopChoiceContinueNode(LoopChoiceNode* that) {
3660 that->info()->AddFromFollowing(that->continue_node()->info());
3661 }
3662
3663 static void VisitLoopChoiceLoopNode(LoopChoiceNode* that) {
3664 that->info()->AddFromFollowing(that->loop_node()->info());
3665 }
3666
3667 static void VisitNegativeLookaroundChoiceLookaroundNode(
3668 NegativeLookaroundChoiceNode* that) {
3670 }
3671
3672 static void VisitNegativeLookaroundChoiceContinueNode(
3673 NegativeLookaroundChoiceNode* that) {
3675 }
3676
3677 static void VisitBackReference(BackReferenceNode* that) {}
3678
3679 static void VisitAssertion(AssertionNode* that) {}
3680};
3681
3682// Propagates information about the minimum size of successful matches from
3683// successor nodes to their predecessors. Note that all eats_at_least values
3684// are initialized to zero before analysis.
3685class EatsAtLeastPropagator : public AllStatic {
3686 public:
3687 static void VisitText(TextNode* that) {
3688 // The eats_at_least value is not used if reading backward.
3689 if (!that->read_backward()) {
3690 // We are not at the start after this node, and thus we can use the
3691 // successor's eats_at_least_from_not_start value.
3692 uint8_t eats_at_least = base::saturated_cast<uint8_t>(
3693 that->Length() + that->on_success()
3694 ->eats_at_least_info()
3695 ->eats_at_least_from_not_start);
3696 that->set_eats_at_least_info(EatsAtLeastInfo(eats_at_least));
3697 }
3698 }
3699
3700 static void VisitAction(ActionNode* that) {
3701 switch (that->action_type()) {
3703 // For a begin positive submatch we propagate the eats_at_least
3704 // data from the successor of the success node, ignoring the body of
3705 // the lookahead, which eats nothing, since it is a zero-width
3706 // assertion.
3707 // TODO(chromium:42201836) This is better than discarding all
3708 // information when there is a positive lookahead, but it loses some
3709 // information that could be useful, since the body of the lookahead
3710 // could tell us something about how close to the end of the string we
3711 // are.
3712 that->set_eats_at_least_info(
3713 *that->success_node()->on_success()->eats_at_least_info());
3714 break;
3715 }
3717 // We do not propagate eats_at_least data through positive submatch
3718 // success because it rewinds input.
3719 DCHECK(that->eats_at_least_info()->IsZero());
3720 break;
3722 // SET_REGISTER_FOR_LOOP indicates a loop entry point, which means the
3723 // loop body will run at least the minimum number of times before the
3724 // continuation case can run.
3725 that->set_eats_at_least_info(
3726 that->on_success()->EatsAtLeastFromLoopEntry());
3727 break;
3729 default:
3730 // Otherwise, the current node eats at least as much as its successor.
3731 // Note: we can propagate eats_at_least data for BEGIN_NEGATIVE_SUBMATCH
3732 // because NegativeLookaroundChoiceNode ignores its lookaround successor
3733 // when computing eats-at-least and quick check information.
3734 that->set_eats_at_least_info(*that->on_success()->eats_at_least_info());
3735 break;
3736 }
3737 }
3738
3739 static void VisitChoice(ChoiceNode* that, int i) {
3740 // The minimum possible match from a choice node is the minimum of its
3741 // successors.
3742 EatsAtLeastInfo eats_at_least =
3743 i == 0 ? EatsAtLeastInfo(UINT8_MAX) : *that->eats_at_least_info();
3744 eats_at_least.SetMin(
3745 *that->alternatives()->at(i).node()->eats_at_least_info());
3746 that->set_eats_at_least_info(eats_at_least);
3747 }
3748
3749 static void VisitLoopChoiceContinueNode(LoopChoiceNode* that) {
3750 if (!that->read_backward()) {
3751 that->set_eats_at_least_info(
3752 *that->continue_node()->eats_at_least_info());
3753 }
3754 }
3755
3756 static void VisitLoopChoiceLoopNode(LoopChoiceNode* that) {}
3757
3758 static void VisitNegativeLookaroundChoiceLookaroundNode(
3759 NegativeLookaroundChoiceNode* that) {}
3760
3761 static void VisitNegativeLookaroundChoiceContinueNode(
3762 NegativeLookaroundChoiceNode* that) {
3763 that->set_eats_at_least_info(*that->continue_node()->eats_at_least_info());
3764 }
3765
3766 static void VisitBackReference(BackReferenceNode* that) {
3767 if (!that->read_backward()) {
3768 that->set_eats_at_least_info(*that->on_success()->eats_at_least_info());
3769 }
3770 }
3771
3772 static void VisitAssertion(AssertionNode* that) {
3773 EatsAtLeastInfo eats_at_least = *that->on_success()->eats_at_least_info();
3774 if (that->assertion_type() == AssertionNode::AT_START) {
3775 // If we know we are not at the start and we are asked "how many
3776 // characters will you match if you succeed?" then we can answer anything
3777 // since false implies false. So let's just set the max answer
3778 // (UINT8_MAX) since that won't prevent us from preloading a lot of
3779 // characters for the other branches in the node graph.
3780 eats_at_least.eats_at_least_from_not_start = UINT8_MAX;
3781 }
3782 that->set_eats_at_least_info(eats_at_least);
3783 }
3784};
3785
3786} // namespace
3787
3788// -------------------------------------------------------------------
3789// Analysis
3790
3791// Iterates the node graph and provides the opportunity for propagators to set
3792// values that depend on successor nodes.
3793template <typename... Propagators>
3794class Analysis : public NodeVisitor {
3795 public:
3796 Analysis(Isolate* isolate, bool is_one_byte, RegExpFlags flags)
3797 : isolate_(isolate),
3798 is_one_byte_(is_one_byte),
3799 flags_(flags),
3801
3803 StackLimitCheck check(isolate());
3804 if (check.HasOverflowed()) {
3805 if (v8_flags.correctness_fuzzer_suppressions) {
3806 FATAL("Analysis: Aborting on stack overflow");
3807 }
3808 fail(RegExpError::kAnalysisStackOverflow);
3809 return;
3810 }
3811 if (that->info()->been_analyzed || that->info()->being_analyzed) return;
3812 that->info()->being_analyzed = true;
3813 that->Accept(this);
3814 that->info()->being_analyzed = false;
3815 that->info()->been_analyzed = true;
3816 }
3817
3818 bool has_failed() { return error_ != RegExpError::kNone; }
3820 DCHECK(error_ != RegExpError::kNone);
3821 return error_;
3822 }
3823 void fail(RegExpError error) { error_ = error; }
3824
3825 Isolate* isolate() const { return isolate_; }
3826
3827 void VisitEnd(EndNode* that) override {
3828 // nothing to do
3829 }
3830
3831// Used to call the given static function on each propagator / variadic template
3832// argument.
3833#define STATIC_FOR_EACH(expr) \
3834 do { \
3835 int dummy[] = {((expr), 0)...}; \
3836 USE(dummy); \
3837 } while (false)
3838
3839 void VisitText(TextNode* that) override {
3840 that->MakeCaseIndependent(isolate(), is_one_byte_, flags());
3841 EnsureAnalyzed(that->on_success());
3842 if (has_failed()) return;
3843 that->CalculateOffsets();
3844 STATIC_FOR_EACH(Propagators::VisitText(that));
3845 }
3846
3847 void VisitAction(ActionNode* that) override {
3848 if (that->action_type() == ActionNode::MODIFY_FLAGS) {
3849 set_flags(that->flags());
3850 }
3851 EnsureAnalyzed(that->on_success());
3852 if (has_failed()) return;
3853 STATIC_FOR_EACH(Propagators::VisitAction(that));
3854 }
3855
3856 void VisitChoice(ChoiceNode* that) override {
3857 for (int i = 0; i < that->alternatives()->length(); i++) {
3858 EnsureAnalyzed(that->alternatives()->at(i).node());
3859 if (has_failed()) return;
3860 STATIC_FOR_EACH(Propagators::VisitChoice(that, i));
3861 }
3862 }
3863
3864 void VisitLoopChoice(LoopChoiceNode* that) override {
3865 DCHECK_EQ(that->alternatives()->length(), 2); // Just loop and continue.
3866
3867 // First propagate all information from the continuation node.
3868 // Due to the unusual visitation order, we need to manage the flags manually
3869 // as if we were visiting the loop node before visiting the continuation.
3870 RegExpFlags orig_flags = flags();
3871
3872 EnsureAnalyzed(that->continue_node());
3873 if (has_failed()) return;
3874 // Propagators don't access global state (including flags), so we don't need
3875 // to reset them here.
3876 STATIC_FOR_EACH(Propagators::VisitLoopChoiceContinueNode(that));
3877
3878 RegExpFlags continuation_flags = flags();
3879
3880 // Check the loop last since it may need the value of this node
3881 // to get a correct result.
3882 set_flags(orig_flags);
3883 EnsureAnalyzed(that->loop_node());
3884 if (has_failed()) return;
3885 // Propagators don't access global state (including flags), so we don't need
3886 // to reset them here.
3887 STATIC_FOR_EACH(Propagators::VisitLoopChoiceLoopNode(that));
3888
3889 set_flags(continuation_flags);
3890 }
3891
3893 NegativeLookaroundChoiceNode* that) override {
3894 DCHECK_EQ(that->alternatives()->length(), 2); // Lookaround and continue.
3895
3896 EnsureAnalyzed(that->lookaround_node());
3897 if (has_failed()) return;
3899 Propagators::VisitNegativeLookaroundChoiceLookaroundNode(that));
3900
3901 EnsureAnalyzed(that->continue_node());
3902 if (has_failed()) return;
3904 Propagators::VisitNegativeLookaroundChoiceContinueNode(that));
3905 }
3906
3908 EnsureAnalyzed(that->on_success());
3909 if (has_failed()) return;
3910 STATIC_FOR_EACH(Propagators::VisitBackReference(that));
3911 }
3912
3913 void VisitAssertion(AssertionNode* that) override {
3914 EnsureAnalyzed(that->on_success());
3915 if (has_failed()) return;
3916 STATIC_FOR_EACH(Propagators::VisitAssertion(that));
3917 }
3918
3919#undef STATIC_FOR_EACH
3920
3921 private:
3922 RegExpFlags flags() const { return flags_; }
3924
3926 const bool is_one_byte_;
3929
3931};
3932
3933RegExpError AnalyzeRegExp(Isolate* isolate, bool is_one_byte, RegExpFlags flags,
3934 RegExpNode* node) {
3936 isolate, is_one_byte, flags);
3937 DCHECK_EQ(node->info()->been_analyzed, false);
3938 analysis.EnsureAnalyzed(node);
3939 DCHECK_IMPLIES(analysis.has_failed(), analysis.error() != RegExpError::kNone);
3940 return analysis.has_failed() ? analysis.error() : RegExpError::kNone;
3941}
3942
3943void BackReferenceNode::FillInBMInfo(Isolate* isolate, int offset, int budget,
3945 bool not_at_start) {
3946 // Working out the set of characters that a backreference can match is too
3947 // hard, so we just say that any character can match.
3948 bm->SetRest(offset);
3949 SaveBMInfo(bm, not_at_start, offset);
3950}
3951
3952static_assert(BoyerMoorePositionInfo::kMapSize ==
3954
3955void ChoiceNode::FillInBMInfo(Isolate* isolate, int offset, int budget,
3956 BoyerMooreLookahead* bm, bool not_at_start) {
3958 budget = (budget - 1) / alts->length();
3959 for (int i = 0; i < alts->length(); i++) {
3960 GuardedAlternative& alt = alts->at(i);
3961 if (alt.guards() != nullptr && alt.guards()->length() != 0) {
3962 bm->SetRest(offset); // Give up trying to fill in info.
3964 return;
3965 }
3966 alt.node()->FillInBMInfo(isolate, offset, budget, bm, not_at_start);
3967 }
3969}
3970
3971void TextNode::FillInBMInfo(Isolate* isolate, int initial_offset, int budget,
3972 BoyerMooreLookahead* bm, bool not_at_start) {
3973 if (initial_offset >= bm->length()) return;
3974 if (read_backward()) return;
3975 int offset = initial_offset;
3976 int max_char = bm->max_char();
3977 for (int i = 0; i < elements()->length(); i++) {
3978 if (offset >= bm->length()) {
3979 if (initial_offset == 0) set_bm_info(not_at_start, bm);
3980 return;
3981 }
3982 TextElement text = elements()->at(i);
3983 if (text.text_type() == TextElement::ATOM) {
3984 RegExpAtom* atom = text.atom();
3985 for (int j = 0; j < atom->length(); j++, offset++) {
3986 if (offset >= bm->length()) {
3987 if (initial_offset == 0) set_bm_info(not_at_start, bm);
3988 return;
3989 }
3990 base::uc16 character = atom->data()[j];
3991 if (IsIgnoreCase(bm->compiler()->flags())) {
3992 unibrow::uchar chars[4];
3993 int length = GetCaseIndependentLetters(isolate, character,
3994 bm->compiler(), chars, 4);
3995 for (int k = 0; k < length; k++) {
3996 bm->Set(offset, chars[k]);
3997 }
3998 } else {
3999 if (character <= max_char) bm->Set(offset, character);
4000 }
4001 }
4002 } else {
4003 DCHECK_EQ(TextElement::CLASS_RANGES, text.text_type());
4004 RegExpClassRanges* class_ranges = text.class_ranges();
4005 ZoneList<CharacterRange>* ranges = class_ranges->ranges(zone());
4006 if (class_ranges->is_negated()) {
4007 bm->SetAll(offset);
4008 } else {
4009 for (int k = 0; k < ranges->length(); k++) {
4010 CharacterRange& range = ranges->at(k);
4011 if (static_cast<int>(range.from()) > max_char) continue;
4012 int to = std::min(max_char, static_cast<int>(range.to()));
4013 bm->SetInterval(offset, Interval(range.from(), to));
4014 }
4015 }
4016 offset++;
4017 }
4018 }
4019 if (offset >= bm->length()) {
4020 if (initial_offset == 0) set_bm_info(not_at_start, bm);
4021 return;
4022 }
4023 on_success()->FillInBMInfo(isolate, offset, budget - 1, bm,
4024 true); // Not at start after a text node.
4025 if (initial_offset == 0) set_bm_info(not_at_start, bm);
4026}
4027
4029 RegExpNode* on_success) {
4035
4036 ChoiceNode* optional_step_back = zone()->New<ChoiceNode>(2, zone());
4037
4038 int stack_register = UnicodeLookaroundStackRegister();
4039 int position_register = UnicodeLookaroundPositionRegister();
4041 zone(), lead_surrogates, true, on_success);
4042 RegExpLookaround::Builder builder(true, step_back, stack_register,
4043 position_register);
4045 zone(), trail_surrogates, false, builder.on_match_success());
4046
4047 optional_step_back->AddAlternative(
4048 GuardedAlternative(builder.ForMatch(match_trail)));
4049 optional_step_back->AddAlternative(GuardedAlternative(on_success));
4050
4051 return optional_step_back;
4052}
4053
4055 bool is_one_byte) {
4056 // Wrap the body of the regexp in capture #0.
4057 RegExpNode* captured_body =
4058 RegExpCapture::ToNode(data->tree, 0, this, accept());
4059 RegExpNode* node = captured_body;
4060 if (!data->tree->IsAnchoredAtStart() && !IsSticky(flags())) {
4061 // Add a .*? at the beginning, outside the body capture, unless
4062 // this expression is anchored at the beginning or sticky.
4064 0, RegExpTree::kInfinity, false,
4065 zone()->New<RegExpClassRanges>(StandardCharacterSet::kEverything), this,
4066 captured_body, data->contains_anchor);
4067
4068 if (data->contains_anchor) {
4069 // Unroll loop once, to take care of the case that might start
4070 // at the start of input.
4071 ChoiceNode* first_step_node = zone()->New<ChoiceNode>(2, zone());
4072 first_step_node->AddAlternative(GuardedAlternative(captured_body));
4073 first_step_node->AddAlternative(GuardedAlternative(zone()->New<TextNode>(
4074 zone()->New<RegExpClassRanges>(StandardCharacterSet::kEverything),
4075 false, loop_node)));
4076 node = first_step_node;
4077 } else {
4078 node = loop_node;
4079 }
4080 }
4081 if (is_one_byte) {
4082 node = node->FilterOneByte(RegExpCompiler::kMaxRecursion, this);
4083 // Do it again to propagate the new nodes to places where they were not
4084 // put because they had not been calculated yet.
4085 if (node != nullptr) {
4086 node = node->FilterOneByte(RegExpCompiler::kMaxRecursion, this);
4087 }
4088 } else if (IsEitherUnicode(flags()) &&
4089 (IsGlobal(flags()) || IsSticky(flags()))) {
4091 }
4092
4093 if (node == nullptr) node = zone()->New<EndNode>(EndNode::BACKTRACK, zone());
4094 // We can run out of registers during preprocessing. Indicate an error in case
4095 // we do.
4096 if (reg_exp_too_big_) {
4097 data->error = RegExpError::kTooLarge;
4098 }
4099 return node;
4100}
4101
4107
4108} // namespace v8::internal
Isolate * isolate_
friend Zone
Definition asm-types.cc:195
#define CONTINUE
SourcePosition pos
int length() const
Definition vector.h:64
static ActionNode * EmptyMatchCheck(int start_register, int repetition_register, int repetition_limit, RegExpNode *on_success)
static ActionNode * ModifyFlags(RegExpFlags flags, RegExpNode *on_success)
struct v8::internal::ActionNode::@137::@143 u_clear_captures
static ActionNode * StorePosition(int reg, bool is_capture, RegExpNode *on_success)
static ActionNode * PositiveSubmatchSuccess(int stack_pointer_reg, int restore_reg, int clear_capture_count, int clear_capture_from, RegExpNode *on_success)
union v8::internal::ActionNode::@137 data_
static ActionNode * BeginNegativeSubmatch(int stack_pointer_reg, int position_reg, RegExpNode *on_success)
static ActionNode * SetRegisterForLoop(int reg, int val, RegExpNode *on_success)
struct v8::internal::ActionNode::@137::@140 u_position_register
void FillInBMInfo(Isolate *isolate, int offset, int budget, BoyerMooreLookahead *bm, bool not_at_start) override
struct v8::internal::ActionNode::@137::@144 u_modify_flags
static ActionNode * IncrementRegister(int reg, RegExpNode *on_success)
struct v8::internal::ActionNode::@137::@138 u_store_register
void Emit(RegExpCompiler *compiler, Trace *trace) override
struct v8::internal::ActionNode::@137::@139 u_increment_register
void GetQuickCheckDetails(QuickCheckDetails *details, RegExpCompiler *compiler, int filled_in, bool not_at_start) override
static ActionNode * BeginPositiveSubmatch(int stack_pointer_reg, int position_reg, RegExpNode *body, ActionNode *success_node)
struct v8::internal::ActionNode::@137::@141 u_submatch
struct v8::internal::ActionNode::@137::@142 u_empty_match_check
static ActionNode * ClearCaptures(Interval range, RegExpNode *on_success)
AlternativeGeneration * at(int i)
ZoneList< AlternativeGeneration * > alt_gens_
void EnsureAnalyzed(RegExpNode *that)
Analysis(Isolate *isolate, bool is_one_byte, RegExpFlags flags)
void VisitAssertion(AssertionNode *that) override
void VisitLoopChoice(LoopChoiceNode *that) override
DISALLOW_IMPLICIT_CONSTRUCTORS(Analysis)
void VisitText(TextNode *that) override
void VisitChoice(ChoiceNode *that) override
void fail(RegExpError error)
void set_flags(RegExpFlags flags)
void VisitEnd(EndNode *that) override
void VisitAction(ActionNode *that) override
RegExpFlags flags() const
void VisitBackReference(BackReferenceNode *that) override
void VisitNegativeLookaroundChoice(NegativeLookaroundChoiceNode *that) override
Isolate * isolate() const
void EmitBoundaryCheck(RegExpCompiler *compiler, Trace *trace)
void FillInBMInfo(Isolate *isolate, int offset, int budget, BoyerMooreLookahead *bm, bool not_at_start) override
void BacktrackIfPrevious(RegExpCompiler *compiler, Trace *trace, IfPrevious backtrack_if_previous)
void Emit(RegExpCompiler *compiler, Trace *trace) override
void GetQuickCheckDetails(QuickCheckDetails *details, RegExpCompiler *compiler, int filled_in, bool not_at_start) override
void FillInBMInfo(Isolate *isolate, int offset, int budget, BoyerMooreLookahead *bm, bool not_at_start) override
void Emit(RegExpCompiler *compiler, Trace *trace) override
void SetInterval(int map_number, const Interval &interval)
ZoneList< BoyerMoorePositionInfo * > * bitmaps_
int GetSkipTable(int min_lookahead, int max_lookahead, DirectHandle< ByteArray > boolean_skip_table, DirectHandle< ByteArray > nibble_table=DirectHandle< ByteArray >{})
BoyerMoorePositionInfo * at(int i)
BoyerMooreLookahead(int length, RegExpCompiler *compiler, Zone *zone)
bool FindWorthwhileInterval(int *from, int *to)
int FindBestInterval(int max_number_of_chars, int old_biggest_points, int *from, int *to)
void EmitSkipInstructions(RegExpMacroAssembler *masm)
void Set(int map_number, int character)
void SetInterval(const Interval &interval)
static void Canonicalize(ZoneList< CharacterRange > *ranges)
base::uc32 from() const
Definition regexp-ast.h:140
base::uc32 to() const
Definition regexp-ast.h:141
static V8_EXPORT_PRIVATE void AddCaseEquivalents(Isolate *isolate, Zone *zone, ZoneList< CharacterRange > *ranges, bool is_one_byte)
static void ClampToOneByte(ZoneList< CharacterRange > *ranges)
static ZoneList< CharacterRange > * List(Zone *zone, CharacterRange range)
Definition regexp-ast.h:114
static CharacterRange Range(base::uc32 from, base::uc32 to)
Definition regexp-ast.h:105
int EmitOptimizedUnanchoredSearch(RegExpCompiler *compiler, Trace *trace)
ZoneList< GuardedAlternative > * alternatives_
void AddAlternative(GuardedAlternative node)
int GreedyLoopTextLengthForAlternative(GuardedAlternative *alternative)
ZoneList< GuardedAlternative > * alternatives()
int CalculatePreloadCharacters(RegExpCompiler *compiler, int eats_at_least)
Trace * EmitGreedyLoop(RegExpCompiler *compiler, Trace *trace, AlternativeGenerationList *alt_gens, PreloadState *preloads, GreedyLoopState *greedy_loop_state, int text_length)
void FillInBMInfo(Isolate *isolate, int offset, int budget, BoyerMooreLookahead *bm, bool not_at_start) override
void GetQuickCheckDetails(QuickCheckDetails *details, RegExpCompiler *compiler, int characters_filled_in, bool not_at_start) override
virtual bool try_to_emit_quick_check_for_alternative(bool is_first)
void EmitChoices(RegExpCompiler *compiler, AlternativeGenerationList *alt_gens, int first_choice, Trace *trace, PreloadState *preloads)
RegExpNode * FilterOneByte(int depth, RegExpCompiler *compiler) override
void EmitOutOfLineContinuation(RegExpCompiler *compiler, Trace *trace, GuardedAlternative alternative, AlternativeGeneration *alt_gen, int preload_characters, bool next_expects_preload)
void SetUpPreLoad(RegExpCompiler *compiler, Trace *current_trace, PreloadState *preloads)
void AssertGuardsMentionRegisters(Trace *trace)
void GenerateGuard(RegExpMacroAssembler *macro_assembler, Guard *guard, Trace *trace)
void Emit(RegExpCompiler *compiler, Trace *trace) override
V8_INLINE bool is_null() const
Definition handles.h:693
V8_EXPORT_PRIVATE bool Get(unsigned value) const
void Set(unsigned value, Zone *zone)
void Emit(RegExpCompiler *compiler, Trace *trace) override
Handle< ByteArray > NewByteArray(int length, AllocationType allocation=AllocationType::kYoung)
Isolate * isolate() const
Definition factory.h:1281
int Frequency(int in_character)
void AddGuard(Guard *guard, Zone *zone)
ZoneList< Guard * > * guards()
v8::internal::Factory * factory()
Definition isolate.h:1527
IterationDecrementer(const IterationDecrementer &)=delete
IterationDecrementer(LoopChoiceNode *node)
IterationDecrementer & operator=(const IterationDecrementer &)=delete
V8_INLINE bool is_linked() const
Definition label.h:66
void AddContinueAlternative(GuardedAlternative alt)
void AddLoopAlternative(GuardedAlternative alt)
RegExpNode * FilterOneByte(int depth, RegExpCompiler *compiler) override
void GetQuickCheckDetails(QuickCheckDetails *details, RegExpCompiler *compiler, int characters_filled_in, bool not_at_start) override
void Emit(RegExpCompiler *compiler, Trace *trace) override
void FillInBMInfo(Isolate *isolate, int offset, int budget, BoyerMooreLookahead *bm, bool not_at_start) override
void GetQuickCheckDetailsFromLoopEntry(QuickCheckDetails *details, RegExpCompiler *compiler, int characters_filled_in, bool not_at_start) override
EatsAtLeastInfo EatsAtLeastFromLoopEntry() override
LoopInitializationMarker & operator=(const LoopInitializationMarker &)=delete
LoopInitializationMarker(const LoopInitializationMarker &)=delete
void GetQuickCheckDetails(QuickCheckDetails *details, RegExpCompiler *compiler, int characters_filled_in, bool not_at_start) override
RegExpNode * FilterOneByte(int depth, RegExpCompiler *compiler) override
void Emit(RegExpCompiler *compiler, Trace *trace) override
void Merge(QuickCheckDetails *other, int from_index)
void set_characters(int characters)
Position * positions(int index)
void Advance(int by, bool one_byte)
RecursionCheck(RegExpCompiler *compiler)
base::Vector< const base::uc16 > data() const
Definition regexp-ast.h:483
void AppendToText(RegExpText *text, Zone *zone) override
static RegExpNode * ToNode(RegExpTree *body, int index, RegExpCompiler *compiler, RegExpNode *on_success)
void AppendToText(RegExpText *text, Zone *zone) override
ZoneList< CharacterRange > * ranges(Zone *zone)
Definition regexp-ast.h:353
void set_flags(RegExpFlags flags)
ZoneVector< RegExpNode * > * work_list_
RegExpMacroAssembler * macro_assembler()
RegExpNode * OptionallyStepBackToLeadSurrogate(RegExpNode *on_success)
RegExpMacroAssembler * macro_assembler_
RegExpNode * PreprocessRegExp(RegExpCompileData *data, bool is_one_byte)
RegExpCompiler(Isolate *isolate, Zone *zone, int capture_count, RegExpFlags flags, bool is_one_byte)
FrequencyCollator * frequency_collator()
CompilationResult Assemble(Isolate *isolate, RegExpMacroAssembler *assembler, RegExpNode *start, int capture_count, DirectHandle< String > pattern)
virtual void SkipUntilBitInTable(int cp_offset, Handle< ByteArray > table, Handle< ByteArray > nibble_table, int advance_by)=0
virtual void Bind(Label *label)=0
virtual void PushBacktrack(Label *label)=0
virtual void CheckCharacterAfterAnd(unsigned c, unsigned and_with, Label *on_equal)=0
virtual void CheckCharacter(unsigned c, Label *on_equal)=0
virtual void ReadCurrentPositionFromRegister(int reg)=0
virtual DirectHandle< HeapObject > GetCode(DirectHandle< String > source, RegExpFlags flags)=0
virtual void IfRegisterLT(int reg, int comparand, Label *if_lt)=0
virtual void CheckGreedyLoop(Label *on_tos_equals_current_position)=0
virtual void AdvanceCurrentPosition(int by)=0
V8_EXPORT_PRIVATE void LoadCurrentCharacter(int cp_offset, Label *on_end_of_input, bool check_bounds=true, int characters=1, int eats_at_least=kUseCharactersValue)
virtual void BindJumpTarget(Label *label)
virtual void GoTo(Label *label)=0
virtual bool SkipUntilBitInTableUseSimd(int advance_by)
virtual void IfRegisterGE(int reg, int comparand, Label *if_ge)=0
void set_bm_info(bool not_at_start, BoyerMooreLookahead *bm)
virtual EatsAtLeastInfo EatsAtLeastFromLoopEntry()
virtual void FillInBMInfo(Isolate *isolate, int offset, int budget, BoyerMooreLookahead *bm, bool not_at_start)
virtual RegExpNode * GetSuccessorOfOmnivorousTextNode(RegExpCompiler *compiler)
BoyerMooreLookahead * bm_info(bool not_at_start)
bool KeepRecursing(RegExpCompiler *compiler)
LimitResult LimitVersions(RegExpCompiler *compiler, Trace *trace)
void SaveBMInfo(BoyerMooreLookahead *bm, bool not_at_start, int offset)
uint32_t EatsAtLeast(bool not_at_start)
bool EmitQuickCheck(RegExpCompiler *compiler, Trace *bounds_check_trace, Trace *trace, bool preload_has_checked_bounds, Label *on_possible_success, QuickCheckDetails *details_return, bool fall_through_on_failure, ChoiceNode *predecessor)
virtual RegExpNode * FilterOneByte(int depth, RegExpCompiler *compiler)
virtual void GetQuickCheckDetailsFromLoopEntry(QuickCheckDetails *details, RegExpCompiler *compiler, int characters_filled_in, bool not_at_start)
virtual void Emit(RegExpCompiler *compiler, Trace *trace)=0
static const int kRecursionBudget
static const int kNodeIsTooComplexForGreedyLoops
virtual void GetQuickCheckDetails(QuickCheckDetails *details, RegExpCompiler *compiler, int characters_filled_in, bool not_at_start)=0
static RegExpNode * ToNode(int min, int max, bool is_greedy, RegExpTree *body, RegExpCompiler *compiler, RegExpNode *on_success, bool not_at_start=false)
void AppendToText(RegExpText *text, Zone *zone) override
ZoneList< TextElement > * elements()
Definition regexp-ast.h:538
virtual void AppendToText(RegExpText *text, Zone *zone)
static const int kInfinity
Definition regexp-ast.h:196
RegExpNode * FilterSuccessor(int depth, RegExpCompiler *compiler)
RegExpNode * FilterOneByte(int depth, RegExpCompiler *compiler) override
static const uint32_t kMaxOneByteCharCodeU
Definition string.h:501
static const int32_t kMaxOneByteCharCode
Definition string.h:500
static const uint32_t kMaxUtf16CodeUnitU
Definition string.h:503
static const base::uc32 kMaxCodePoint
Definition string.h:504
static const int kMaxUtf16CodeUnit
Definition string.h:502
TextType text_type() const
Definition regexp-ast.h:501
TextElement(TextType text_type, RegExpTree *tree)
Definition regexp-ast.h:516
RegExpClassRanges * class_ranges() const
Definition regexp-ast.h:510
static TextElement ClassRanges(RegExpClassRanges *class_ranges)
void set_cp_offset(int cp_offset)
Definition regexp-ast.h:498
static TextElement Atom(RegExpAtom *atom)
RegExpAtom * atom() const
Definition regexp-ast.h:505
static TextNode * CreateForSurrogatePair(Zone *zone, CharacterRange lead, ZoneList< CharacterRange > *trail_ranges, bool read_backward, RegExpNode *on_success)
static TextNode * CreateForCharacterRanges(Zone *zone, ZoneList< CharacterRange > *ranges, bool read_backward, RegExpNode *on_success)
RegExpNode * GetSuccessorOfOmnivorousTextNode(RegExpCompiler *compiler) override
ZoneList< TextElement > * elements()
void GetQuickCheckDetails(QuickCheckDetails *details, RegExpCompiler *compiler, int characters_filled_in, bool not_at_start) override
int GreedyLoopTextLength() override
RegExpNode * FilterOneByte(int depth, RegExpCompiler *compiler) override
void Emit(RegExpCompiler *compiler, Trace *trace) override
void TextEmitPass(RegExpCompiler *compiler, TextEmitPassType pass, bool preloaded, Trace *trace, bool first_element_checked, int *checked_up_to)
void FillInBMInfo(Isolate *isolate, int offset, int budget, BoyerMooreLookahead *bm, bool not_at_start) override
void MakeCaseIndependent(Isolate *isolate, bool is_one_byte, RegExpFlags flags)
ActionNode::ActionType action_type()
QuickCheckDetails * quick_check_performed()
void PerformDeferredActions(RegExpMacroAssembler *macro, int max_register, const DynamicBitSet &affected_registers, DynamicBitSet *registers_to_pop, DynamicBitSet *registers_to_clear, Zone *zone)
void set_stop_node(RegExpNode *node)
void Flush(RegExpCompiler *compiler, RegExpNode *successor)
void set_at_start(TriBool at_start)
DeferredAction * actions_
void AdvanceCurrentPositionInTrace(int by, RegExpCompiler *compiler)
void set_characters_preloaded(int count)
void add_action(DeferredAction *new_action)
int FindAffectedRegisters(DynamicBitSet *affected_registers, Zone *zone)
void set_quick_check_performed(QuickCheckDetails *d)
DeferredAction * actions()
void RestoreAffectedRegisters(RegExpMacroAssembler *macro, int max_register, const DynamicBitSet &registers_to_pop, const DynamicBitSet &registers_to_clear)
bool mentions_reg(int reg)
bool GetStoredPosition(int reg, int *cp_offset)
void set_backtrack(Label *backtrack)
void set_bound_checked_up_to(int to)
QuickCheckDetails quick_check_performed_
void set_loop_label(Label *label)
void set_flush_budget(int to)
static V8_EXPORT_PRIVATE void FatalProcessOutOfMemory(Isolate *isolate, const char *location, const OOMDetails &details=kNoOOMDetails)
V8_INLINE int length() const
Definition zone-list.h:101
T & at(int i) const
Definition zone-list.h:88
base::Vector< const T > ToConstVector() const
Definition zone-list.h:110
void Add(const T &element, Zone *zone)
T * New(Args &&... args)
Definition zone.h:114
Zone * zone_
Register const value_
Handle< Code > code
JSRegExp::Flags flags_
const MapRef map_
int start
uint32_t count
Handle< SharedFunctionInfo > info
int end
Label label
int32_t offset
std::string pattern
Node * node
ZoneVector< RpoNumber > & result
LiftoffRegister reg
Label label_
std::vector< Point > points
Point to
uint32_t const mask
const int length_
Definition mul-fft.cc:473
int int32_t
Definition unicode.cc:40
unsigned int uchar
Definition unicode.h:21
constexpr unsigned CountTrailingZeros(T value)
Definition bits.h:144
constexpr bool IsPowerOfTwo(T value)
Definition bits.h:187
uint32_t uc32
Definition strings.h:19
bool contains(const C &container, const T &element)
uint16_t uc16
Definition strings.h:18
constexpr int kMinInt
Definition globals.h:375
constexpr int kInt64Size
Definition globals.h:402
constexpr int kBitsPerByte
Definition globals.h:682
static const base::uc32 kTrailSurrogateStart
constexpr bool IsEitherUnicode(RegExpFlags f)
ContainedInLattice Combine(ContainedInLattice a, ContainedInLattice b)
bool NeedsUnicodeCaseEquivalents(RegExpFlags flags)
BUILTIN_FP_CALL BUILTIN_FP_CALL BUILTIN_FP_CALL BUILTIN_FP_CALL BUILTIN_FP_CALL BUILTIN_FP_CALL BUILTIN_FP_CALL BUILTIN_FP_CALL BUILTIN_FP_CALL BUILTIN_FP_CALL int character
Flag flags[]
Definition flags.cc:3797
constexpr int kNoRegister
static const base::uc32 kTrailSurrogateEnd
V8_EXPORT_PRIVATE FlagValues v8_flags
RegExpError AnalyzeRegExp(Isolate *isolate, bool is_one_byte, RegExpFlags flags, RegExpNode *node)
static const base::uc32 kLeadSurrogateStart
static const base::uc32 kLeadSurrogateEnd
bool RangeContainsLatin1Equivalents(CharacterRange range)
Local< T > Handle
OptimizedCompilationInfo * info_
Definition pipeline.cc:305
uint32_t recursion_depth
RegExpCompiler * compiler_
#define STATIC_FOR_EACH(expr)
#define DEFINE_ACCEPT(Type)
#define FOR_EACH_NODE_TYPE(VISIT)
ZoneList< base::uc16 > * characters_
SmallRegExpTreeVector alternatives_
Node * node_
#define FATAL(...)
Definition logging.h:47
#define DCHECK_LE(v1, v2)
Definition logging.h:490
#define DCHECK_NULL(val)
Definition logging.h:491
#define CHECK_IMPLIES(lhs, rhs)
#define CHECK(condition)
Definition logging.h:124
#define DCHECK_NOT_NULL(val)
Definition logging.h:492
#define DCHECK_IMPLIES(v1, v2)
Definition logging.h:493
#define DCHECK_NE(v1, v2)
Definition logging.h:486
#define DCHECK_GE(v1, v2)
Definition logging.h:488
#define DCHECK(condition)
Definition logging.h:482
#define DCHECK_LT(v1, v2)
Definition logging.h:489
#define DCHECK_EQ(v1, v2)
Definition logging.h:485
#define DCHECK_GT(v1, v2)
Definition logging.h:487
#define USE(...)
Definition macros.h:293
#define V8_EXPORT_PRIVATE
Definition macros.h:460
static const int kEatsAtLeastNotYetInitialized
std::unique_ptr< ValueMirror > value