v8
V8 is Google’s open source high-performance JavaScript and WebAssembly engine, written in C++.
Loading...
Searching...
No Matches
experimental-bytecode.h
Go to the documentation of this file.
1// Copyright 2020 the V8 project authors. All rights reserved.
2// Use of this source code is governed by a BSD-style license that can be
3// found in the LICENSE file.
4
5#ifndef V8_REGEXP_EXPERIMENTAL_EXPERIMENTAL_BYTECODE_H_
6#define V8_REGEXP_EXPERIMENTAL_EXPERIMENTAL_BYTECODE_H_
7
9#include "src/base/strings.h"
10#include "src/base/vector.h"
12
13// ----------------------------------------------------------------------------
14// Definition and semantics of the EXPERIMENTAL bytecode.
15// Background:
16// - Russ Cox's blog post series on regular expression matching, in particular
17// https://swtch.com/~rsc/regexp/regexp2.html
18// - The re2 regular regexp library: https://github.com/google/re2
19//
20// This comment describes the bytecode used by the experimental regexp engine
21// and its abstract semantics in terms of a VM. An implementation of the
22// semantics that avoids exponential runtime can be found in `NfaInterpreter`.
23//
24// The experimental bytecode describes a non-deterministic finite automaton. It
25// runs on a multithreaded virtual machine (VM), i.e. in several threads
26// concurrently. (These "threads" don't need to be actual operating system
27// threads.) Apart from a list of threads, the VM maintains an immutable
28// shared input string which threads can read from. Each thread is given by a
29// program counter (PC, index of the current instruction), a fixed number of
30// registers of indices into the input string, and a monotonically increasing
31// index which represents the current position within the input string.
32//
33// For the precise encoding of the instruction set, see the definition `struct
34// RegExpInstruction` below. Currently we support the following instructions:
35// - CONSUME_RANGE: Check whether the codepoint of the current character is
36// contained in a non-empty closed interval [min, max] specified in the
37// instruction payload. If false, advance to the next CONSUME_RANGE in the
38// current list, or abort this thread if this was the last range. If true,
39// advance to the next instruction after the current list of ranges.
40// - RANGE_COUNT: Check that the current character can be accepted by any of the
41// next n CONSUME_RANGE instructions, where n is specified in the instruction
42// payload. Abort this thread if none of these ranges match. Otherwise,
43// advance the input position by 1 and continue with the next instruction
44// after the n ranges.
45// - ACCEPT: Stop this thread and signify the end of a match at the current
46// input position.
47// - FORK: If executed by a thread t, spawn a new thread t0 whose register
48// values and input position agree with those of t, but whose PC value is set
49// to the value specified in the instruction payload. The register values of
50// t and t0 agree directly after the FORK, but they can diverge. Thread t
51// continues with the instruction directly after the current FORK
52// instruction.
53// - JMP: Instead of incrementing the PC value after execution of this
54// instruction by 1, set PC of this thread to the value specified in the
55// instruction payload and continue there.
56// - SET_REGISTER_TO_CP: Set a register specified in the payload to the current
57// position (CP) within the input, then continue with the next instruction.
58// - CLEAR_REGISTER: Clear the register specified in the payload by resetting
59// it to the initial value -1.
60//
61// Special care must be exercised with respect to thread priority. It is
62// possible that more than one thread executes an ACCEPT statement. The output
63// of the program is given by the contents of the matching thread's registers,
64// so this is ambiguous in case of multiple matches. To resolve the ambiguity,
65// every implementation of the VM must output the match that a backtracking
66// implementation would output (i.e. behave the same as Irregexp).
67//
68// A backtracking implementation of the VM maintains a stack of postponed
69// threads. Upon encountering a FORK statement, this VM will create a copy of
70// the current thread, set the copy's PC value according to the instruction
71// payload, and push it to the stack of postponed threads. The VM will then
72// continue execution of the current thread.
73//
74// If at some point a thread t executes a MATCH statement, the VM stops and
75// outputs the registers of t. Postponed threads are discarded. On the other
76// hand, if a thread t is aborted because some input character didn't pass a
77// check, then the VM pops the topmost postponed thread and continues execution
78// with this thread. If there are no postponed threads, then the VM outputs
79// failure, i.e. no matches.
80//
81// Equivalently, we can describe the behavior of the backtracking VM in terms
82// of priority: Threads are linearly ordered by priority, and matches generated
83// by threads with high priority must be preferred over matches generated by
84// threads with low priority, regardless of the chronological order in which
85// matches were found. If a thread t executes a FORK statement and spawns a
86// thread t0, then the priority of t0 is such that the following holds:
87// * t0 < t, i.e. t0 has lower priority than t.
88// * For all threads u such that u != t and u != t0, we have t0 < u iff t < u,
89// i.e. the t0 compares to other threads the same as t.
90// For example, if there are currently 3 threads s, t, u such that s < t < u,
91// then after t executes a fork, the thread priorities will be s < t0 < t < u.
92
93namespace v8 {
94namespace internal {
95
96// Bytecode format.
97// Currently very simple fixed-size: The opcode is encoded in the first 4
98// bytes, the payload takes another 4 bytes.
121
122 struct Uc16Range {
123 base::uc16 min; // Inclusive.
124 base::uc16 max; // Inclusive.
125 };
126
128 public:
129 LookaroundPayload() = default;
130 LookaroundPayload(uint32_t lookaround_index, bool is_positive,
132 : payload_(Type::update(
133 IsPositive::update(LookaroundIndex::encode(lookaround_index),
135 type)) {}
136
137 uint32_t index() const { return LookaroundIndex::decode(payload_); }
138 bool is_positive() const { return IsPositive::decode(payload_); }
140
141 private:
145
146 uint32_t payload_;
147 };
148
152 result.payload.consume_range = Uc16Range{min, max};
153 return result;
154 }
155
157 return ConsumeRange(0x0000, 0xFFFF);
158 }
159
161 // This is encoded as the empty CONSUME_RANGE of characters 0xFFFF <= c <=
162 // 0x0000.
163 return ConsumeRange(0xFFFF, 0x0000);
164 }
165
169 result.payload.num_ranges = num_ranges;
170 return result;
171 }
172
173 static RegExpInstruction Fork(int32_t alt_index) {
176 result.payload.pc = alt_index;
177 return result;
178 }
179
180 static RegExpInstruction Jmp(int32_t alt_index) {
182 result.opcode = JMP;
183 result.payload.pc = alt_index;
184 return result;
185 }
186
190 return result;
191 }
192
196 result.payload.register_index = register_index;
197 return result;
198 }
199
203 result.payload.assertion_type = t;
204 return result;
205 }
206
210 result.payload.register_index = register_index;
211 return result;
212 }
213
220
224 result.payload.quantifier_id = quantifier_id;
225 return result;
226 }
227
231 result.payload.group_id = group_id;
232 return result;
233 }
234
238 result.payload.lookaround_id = lookaround_id;
239 return result;
240 }
241
245 result.payload.pc = pc;
246 return result;
247 }
248
254
260
261 static RegExpInstruction StartLookaround(int lookaround_index,
262 bool is_positive,
266 result.payload.lookaround =
267 LookaroundPayload(lookaround_index, is_positive, type);
268 return result;
269 }
270
276
277 static RegExpInstruction WriteLookTable(int32_t index) {
280 result.payload.lookaround_id = index;
281 return result;
282 }
283
284 static RegExpInstruction ReadLookTable(int32_t index, bool is_positive,
288 result.payload.lookaround = LookaroundPayload(index, is_positive, type);
289 return result;
290 }
291
292 // Returns whether an instruction is `FILTER_GROUP`, `FILTER_QUANTIFIER` or
293 // `FILTER_CHILD`.
294 static bool IsFilter(const RegExpInstruction& instruction) {
295 return instruction.opcode == RegExpInstruction::Opcode::FILTER_GROUP ||
296 instruction.opcode == RegExpInstruction::Opcode::FILTER_QUANTIFIER ||
297 instruction.opcode == RegExpInstruction::Opcode::FILTER_CHILD;
298 }
299
301 union {
302 // Payload of CONSUME_RANGE:
304 // Payload of RANGE_COUNT
305 int32_t num_ranges;
306 // Payload of FORK, JMP and FILTER_CHILD, the next/forked program counter
307 // (pc):
308 int32_t pc;
309 // Payload of SET_REGISTER_TO_CP and CLEAR_REGISTER:
311 // Payload of ASSERTION:
313 // Payload of SET_QUANTIFIER_TO_CLOCK and FILTER_QUANTIFIER:
315 // Payload of FILTER_GROUP:
316 int32_t group_id;
317 // Payload of WRITE_LOOKAROUND_TABLE and FILTER_LOOKAROUND:
319 // Payload of READ_LOOKAROUND_TABLE and START_LOOKAROUND:
322 static_assert(sizeof(payload) == 4);
323};
324static_assert(sizeof(RegExpInstruction) == 8);
325// TODO(mbid,v8:10765): This is rather wasteful. We can fit the opcode in 2-3
326// bits, so the remaining 29/30 bits can be used as payload. Problem: The
327// payload of CONSUME_RANGE consists of two 16-bit values `min` and `max`, so
328// this wouldn't fit. We could encode the payload of a CONSUME_RANGE
329// instruction by the start of the interval and its length instead, and then
330// only allows lengths that fit into 14/13 bits. A longer range can then be
331// encoded as a disjunction of smaller ranges.
332//
333// Another thought: CONSUME_RANGEs are only valid if the payloads are such that
334// min <= max. Thus there are
335//
336// 2^16 + 2^16 - 1 + ... + 1
337// = 2^16 * (2^16 + 1) / 2
338// = 2^31 + 2^15
339//
340// valid payloads for a CONSUME_RANGE instruction. If we want to fit
341// instructions into 4 bytes, we would still have almost 2^31 instructions left
342// over if we encode everything as tight as possible. For example, we could
343// use another 2^29 values for JMP, another 2^29 for FORK, 1 value for ACCEPT,
344// and then still have almost 2^30 instructions left over for something like
345// zero-width assertions and captures.
346
347std::ostream& operator<<(std::ostream& os, const RegExpInstruction& inst);
348std::ostream& operator<<(std::ostream& os,
350std::ostream& operator<<(std::ostream& os,
351 const RegExpInstruction::LookaroundPayload& inst);
352
353} // namespace internal
354} // namespace v8
355
356#endif // V8_REGEXP_EXPERIMENTAL_EXPERIMENTAL_BYTECODE_H_
static constexpr T decode(U value)
Definition bit-field.h:66
LookaroundPayload(uint32_t lookaround_index, bool is_positive, RegExpLookaround::Type type)
ZoneVector< RpoNumber > & result
uint16_t uc16
Definition strings.h:18
std::ostream & operator<<(std::ostream &os, AtomicMemoryOrder order)
static RegExpInstruction Jmp(int32_t alt_index)
static RegExpInstruction Accept()
static RegExpInstruction FilterChild(int32_t pc)
static RegExpInstruction EndLoop()
static RegExpInstruction SetQuantifierToClock(int32_t quantifier_id)
static RegExpInstruction ConsumeAnyChar()
static RegExpInstruction BeginLoop()
static RegExpInstruction FilterGroup(int32_t group_id)
static RegExpInstruction Fork(int32_t alt_index)
static RegExpInstruction ConsumeRange(base::uc16 min, base::uc16 max)
static RegExpInstruction SetRegisterToCp(int32_t register_index)
static RegExpInstruction FilterLookaround(int32_t lookaround_id)
static RegExpInstruction FilterQuantifier(int32_t quantifier_id)
union v8::internal::RegExpInstruction::@129 payload
static bool IsFilter(const RegExpInstruction &instruction)
static RegExpInstruction StartLookaround(int lookaround_index, bool is_positive, RegExpLookaround::Type type)
static RegExpInstruction ClearRegister(int32_t register_index)
static RegExpInstruction ReadLookTable(int32_t index, bool is_positive, RegExpLookaround::Type type)
static RegExpInstruction WriteLookTable(int32_t index)
static RegExpInstruction RangeCount(int32_t num_ranges)
static RegExpInstruction EndLookaround()
static RegExpInstruction Assertion(RegExpAssertion::Type t)