simdjson  3.11.0
Ridiculously Fast JSON
simdjson::dom::parser Class Reference

A persistent document parser. More...

Public Member Functions

simdjson_inline parser (size_t max_capacity=SIMDJSON_MAXSIZE_BYTES) noexcept
 Create a JSON parser. More...
 
simdjson_inline parser (parser &&other) noexcept
 Take another parser's buffers and state. More...
 
simdjson_inline parseroperator= (parser &&other) noexcept
 Take another parser's buffers and state. More...
 
 ~parser ()=default
 Deallocate the JSON parser.
 
simdjson_result< elementload (const std::string &path) &noexcept
 Load a JSON document from a file and return a reference to it. More...
 
simdjson_result< elementload (const std::string &path) &&=delete
 
simdjson_result< elementload_into_document (document &doc, const std::string &path) &noexcept
 Load a JSON document from a file into a provide document instance and return a temporary reference to it. More...
 
simdjson_result< elementload_into_document (document &doc, const std::string &path) &&=delete
 
simdjson_result< elementparse (const uint8_t *buf, size_t len, bool realloc_if_needed=true) &noexcept
 Parse a JSON document and return a temporary reference to it. More...
 
simdjson_result< elementparse (const uint8_t *buf, size_t len, bool realloc_if_needed=true) &&=delete
 
simdjson_inline simdjson_result< elementparse (const char *buf, size_t len, bool realloc_if_needed=true) &noexcept
 
simdjson_inline simdjson_result< elementparse (const char *buf, size_t len, bool realloc_if_needed=true) &&=delete
 
simdjson_inline simdjson_result< elementparse (const std::string &s) &noexcept
 
simdjson_inline simdjson_result< elementparse (const std::string &s) &&=delete
 
simdjson_inline simdjson_result< elementparse (const padded_string &s) &noexcept
 
simdjson_inline simdjson_result< elementparse (const padded_string &s) &&=delete
 
simdjson_inline simdjson_result< elementparse (const padded_string_view &v) &noexcept
 
simdjson_inline simdjson_result< elementparse (const padded_string_view &v) &&=delete
 
simdjson_result< elementparse_into_document (document &doc, const uint8_t *buf, size_t len, bool realloc_if_needed=true) &noexcept
 Parse a JSON document into a provide document instance and return a temporary reference to it. More...
 
simdjson_result< elementparse_into_document (document &doc, const uint8_t *buf, size_t len, bool realloc_if_needed=true) &&=delete
 
simdjson_inline simdjson_result< elementparse_into_document (document &doc, const char *buf, size_t len, bool realloc_if_needed=true) &noexcept
 
simdjson_inline simdjson_result< elementparse_into_document (document &doc, const char *buf, size_t len, bool realloc_if_needed=true) &&=delete
 
simdjson_inline simdjson_result< elementparse_into_document (document &doc, const std::string &s) &noexcept
 
simdjson_inline simdjson_result< elementparse_into_document (document &doc, const std::string &s) &&=delete
 
simdjson_inline simdjson_result< elementparse_into_document (document &doc, const padded_string &s) &noexcept
 
simdjson_inline simdjson_result< elementparse_into_document (document &doc, const padded_string &s) &&=delete
 
simdjson_result< document_streamload_many (const std::string &path, size_t batch_size=dom::DEFAULT_BATCH_SIZE) noexcept
 Load a file containing many JSON documents. More...
 
simdjson_result< document_streamparse_many (const uint8_t *buf, size_t len, size_t batch_size=dom::DEFAULT_BATCH_SIZE) noexcept
 Parse a buffer containing many JSON documents. More...
 
simdjson_result< document_streamparse_many (const char *buf, size_t len, size_t batch_size=dom::DEFAULT_BATCH_SIZE) noexcept
 
simdjson_result< document_streamparse_many (const std::string &s, size_t batch_size=dom::DEFAULT_BATCH_SIZE) noexcept
 
simdjson_result< document_streamparse_many (const std::string &&s, size_t batch_size)=delete
 
simdjson_result< document_streamparse_many (const padded_string &s, size_t batch_size=dom::DEFAULT_BATCH_SIZE) noexcept
 
simdjson_result< document_streamparse_many (const padded_string &&s, size_t batch_size)=delete
 
simdjson_warn_unused error_code allocate (size_t capacity, size_t max_depth=DEFAULT_MAX_DEPTH) noexcept
 Ensure this parser has enough memory to process JSON documents up to capacity bytes in length and max_depth depth. More...
 
simdjson_inline size_t capacity () const noexcept
 The largest document this parser can support without reallocating. More...
 
simdjson_inline size_t max_capacity () const noexcept
 The largest document this parser can automatically support. More...
 
simdjson_pure simdjson_inline size_t max_depth () const noexcept
 The maximum level of nested object and arrays supported by this parser. More...
 
simdjson_inline void set_max_capacity (size_t max_capacity) noexcept
 Set max_capacity. More...
 

Public Attributes

bool threaded {false}
 When SIMDJSON_THREADS_ENABLED is not defined, the parser instance cannot use threads.
 

Detailed Description

A persistent document parser.

The parser is designed to be reused, holding the internal buffers necessary to do parsing, as well as memory for a single document. The parsed document is overwritten on each parse.

This class cannot be copied, only moved, to avoid unintended allocations.

Note
Moving a parser instance may invalidate "dom::element" instances. If you need to preserve both the "dom::element" instances and the parser, consider wrapping the parser instance in a std::unique_ptr instance:

std::unique_ptr<dom::parser> parser(new dom::parser{}); auto error = parser->load(f).get(root);

You can then move std::unique_ptr safely.

Note
This is not thread safe: one parser cannot produce two documents at the same time!

Definition at line 30 of file parser.h.

Constructor & Destructor Documentation

◆ parser() [1/2]

simdjson_inline simdjson::dom::parser::parser ( size_t  max_capacity = SIMDJSON_MAXSIZE_BYTES)
explicitnoexcept

Create a JSON parser.

The new parser will have zero capacity.

Parameters
max_capacityThe maximum document length the parser can automatically handle. The parser will allocate more capacity on an as needed basis (when it sees documents too big to handle) up to this amount. The parser still starts with zero capacity no matter what this number is: to allocate an initial capacity, call allocate() after constructing the parser. Defaults to SIMDJSON_MAXSIZE_BYTES (the largest single document simdjson can process).

Definition at line 23 of file parser-inl.h.

◆ parser() [2/2]

simdjson_inline simdjson::dom::parser::parser ( parser &&  other)
defaultnoexcept

Take another parser's buffers and state.

Parameters
otherThe parser to take. Its capacity is zeroed.

Member Function Documentation

◆ allocate()

simdjson_warn_unused error_code simdjson::dom::parser::allocate ( size_t  capacity,
size_t  max_depth = DEFAULT_MAX_DEPTH 
)
inlinenoexcept

Ensure this parser has enough memory to process JSON documents up to capacity bytes in length and max_depth depth.

Parameters
capacityThe new capacity.
max_depthThe new max_depth. Defaults to DEFAULT_MAX_DEPTH.
Returns
The error, if there is one.

Definition at line 199 of file parser-inl.h.

◆ capacity()

simdjson_inline size_t simdjson::dom::parser::capacity ( ) const
noexcept

The largest document this parser can support without reallocating.

Returns
Current capacity, in bytes.

Definition at line 188 of file parser-inl.h.

◆ load()

simdjson_result< element > simdjson::dom::parser::load ( const std::string &  path) &
inlinenoexcept

Load a JSON document from a file and return a reference to it.

dom::parser parser; const element doc = parser.load("jsonexamples/twitter.json");

The function is eager: the file's content is loaded in memory inside the parser instance and immediately parsed. The file can be deleted after the parser.load call.

IMPORTANT: Document Lifetime

The JSON document still lives in the parser: this is the most efficient way to parse JSON documents because it reuses the same buffers, but you must use the document before you destroy the parser or call parse() again.

Moving the parser instance is safe, but it invalidates the element instances. You may store the parser instance without moving it by wrapping it inside an unique_ptr instance like so: std::unique_ptr<dom::parser> parser(new dom::parser{});.

Parser Capacity

If the parser's current capacity is less than the file length, it will allocate enough capacity to handle it (up to max_capacity).

Windows and Unicode

Windows users who need to read files with non-ANSI characters in the name should set their code page to UTF-8 (65001) before calling this function. This should be the default with Windows 11 and better. Further, they may use the AreFileApisANSI function to determine whether the filename is interpreted using the ANSI or the system default OEM codepage, and they may call SetFileApisToOEM accordingly.

Parameters
pathThe path to load.
Returns
The document, or an error:
  • IO_ERROR if there was an error opening or reading the file. Be mindful that on some 32-bit systems, the file size might be limited to 2 GB.
  • MEMALLOC if the parser does not have enough capacity and memory allocation fails.
  • CAPACITY if the parser does not have enough capacity and len > max_capacity.
  • other json errors if parsing fails. You should not rely on these errors to always the same for the same document: they may vary under runtime dispatch (so they may vary depending on your system and hardware).

Definition at line 94 of file parser-inl.h.

◆ load_into_document()

simdjson_result< element > simdjson::dom::parser::load_into_document ( document doc,
const std::string &  path 
) &
inlinenoexcept

Load a JSON document from a file into a provide document instance and return a temporary reference to it.

It is similar to the function load except that instead of parsing into the internal document instance associated with the parser, it allows the user to provide a document instance.

dom::parser parser; dom::document doc; element doc_root = parser.load_into_document(doc, "jsonexamples/twitter.json");

The function is eager: the file's content is loaded in memory inside the parser instance and immediately parsed. The file can be deleted after the parser.load_into_document call.

IMPORTANT: Document Lifetime

After the call to load_into_document, the parser is no longer needed.

The JSON document lives in the document instance: you must keep the document instance alive while you navigate through it (i.e., used the returned value from load_into_document). You are encourage to reuse the document instance many times with new data to avoid reallocations:

dom::document doc; element doc_root1 = parser.load_into_document(doc, "jsonexamples/twitter.json"); //... doc_root1 is a pointer inside doc element doc_root2 = parser.load_into_document(doc, "jsonexamples/twitter.json"); //... doc_root2 is a pointer inside doc // at this point doc_root1 is no longer safe

Moving the document instance is safe, but it invalidates the element instances. After moving a document, you can recover safe access to the document root with its root() method.

Parameters
docThe document instance where the parsed data will be stored (on success).
pathThe path to load.
Returns
The document, or an error:
  • IO_ERROR if there was an error opening or reading the file. Be mindful that on some 32-bit systems, the file size might be limited to 2 GB.
  • MEMALLOC if the parser does not have enough capacity and memory allocation fails.
  • CAPACITY if the parser does not have enough capacity and len > max_capacity.
  • other json errors if parsing fails. You should not rely on these errors to always the same for the same document: they may vary under runtime dispatch (so they may vary depending on your system and hardware).

Definition at line 98 of file parser-inl.h.

◆ load_many()

simdjson_result< document_stream > simdjson::dom::parser::load_many ( const std::string &  path,
size_t  batch_size = dom::DEFAULT_BATCH_SIZE 
)
inlinenoexcept

Load a file containing many JSON documents.

dom::parser parser; for (const element doc : parser.load_many(path)) { cout << std::string(doc["title"]) << endl; }

The file is loaded in memory and can be safely deleted after the parser.load_many(path) function has returned. The memory is held by the parser instance.

The function is lazy: it may be that no more than one JSON document at a time is parsed. And, possibly, no document many have been parsed when the parser.load_many(path) function returned.

If there is a UTF-8 BOM, the parser skips it.

Format

The file must contain a series of one or more JSON documents, concatenated into a single buffer, separated by whitespace. It effectively parses until it has a fully valid document, then starts parsing the next document at that point. (It does this with more parallelism and lookahead than you might think, though.)

Documents that consist of an object or array may omit the whitespace between them, concatenating with no separator. documents that consist of a single primitive (i.e. documents that are not arrays or objects) MUST be separated with whitespace.

The documents must not exceed batch_size bytes (by default 1MB) or they will fail to parse. Setting batch_size to excessively large or excesively small values may impact negatively the performance.

Error Handling

All errors are returned during iteration: if there is a global error such as memory allocation, it will be yielded as the first result. Iteration always stops after the first error.

As with all other simdjson methods, non-exception error handling is readily available through the same interface, requiring you to check the error before using the document:

dom::parser parser; dom::document_stream docs; auto error = parser.load_many(path).get(docs); if (error) { cerr << error << endl; exit(1); } for (auto doc : docs) { std::string_view title; if ((error = doc["title"].get(title)) { cerr << error << endl; exit(1); } cout << title << endl; }

Threads

When compiled with SIMDJSON_THREADS_ENABLED, this method will use a single thread under the hood to do some lookahead.

Parser Capacity

If the parser's current capacity is less than batch_size, it will allocate enough capacity to handle it (up to max_capacity).

Parameters
pathFile name pointing at the concatenated JSON to parse.
batch_sizeThe batch size to use. MUST be larger than the largest document. The sweet spot is cache-related: small enough to fit in cache, yet big enough to parse as many documents as possible in one tight loop. Defaults to 1MB (as simdjson::dom::DEFAULT_BATCH_SIZE), which has been a reasonable sweet spot in our tests. If you set the batch_size to a value smaller than simdjson::dom::MINIMAL_BATCH_SIZE (currently 32B), it will be replaced by simdjson::dom::MINIMAL_BATCH_SIZE.
Returns
The stream, or an error. An empty input will yield 0 documents rather than an EMPTY error. Errors:
  • IO_ERROR if there was an error opening or reading the file.
  • MEMALLOC if the parser does not have enough capacity and memory allocation fails.
  • CAPACITY if the parser does not have enough capacity and batch_size > max_capacity.
  • other json errors if parsing fails. You should not rely on these errors to always the same for the same document: they may vary under runtime dispatch (so they may vary depending on your system and hardware).

Definition at line 105 of file parser-inl.h.

◆ max_capacity()

simdjson_inline size_t simdjson::dom::parser::max_capacity ( ) const
noexcept

The largest document this parser can automatically support.

The parser may reallocate internal buffers as needed up to this amount.

Returns
Maximum capacity, in bytes.

Definition at line 191 of file parser-inl.h.

◆ max_depth()

simdjson_pure simdjson_inline size_t simdjson::dom::parser::max_depth ( ) const
noexcept

The maximum level of nested object and arrays supported by this parser.

Returns
Maximum depth, in bytes.

Definition at line 194 of file parser-inl.h.

◆ operator=()

simdjson_inline parser & simdjson::dom::parser::operator= ( parser &&  other)
defaultnoexcept

Take another parser's buffers and state.

Parameters
otherThe parser to take. Its capacity is zeroed.

◆ parse()

simdjson_result< element > simdjson::dom::parser::parse ( const uint8_t *  buf,
size_t  len,
bool  realloc_if_needed = true 
) &
inlinenoexcept

Parse a JSON document and return a temporary reference to it.

dom::parser parser; element doc_root = parser.parse(buf, len);

The function eagerly parses the input: the input can be modified and discarded after the parser.parse(buf, len) call has completed.

IMPORTANT: Document Lifetime

The JSON document still lives in the parser: this is the most efficient way to parse JSON documents because it reuses the same buffers, but you must use the document before you destroy the parser or call parse() again.

Moving the parser instance is safe, but it invalidates the element instances. You may store the parser instance without moving it by wrapping it inside an unique_ptr instance like so: std::unique_ptr<dom::parser> parser(new dom::parser{});.

REQUIRED: Buffer Padding

The buffer must have at least SIMDJSON_PADDING extra allocated bytes. It does not matter what those bytes are initialized to, as long as they are allocated. These bytes will be read: if you using a sanitizer that verifies that no uninitialized byte is read, then you should initialize the SIMDJSON_PADDING bytes to avoid runtime warnings.

If realloc_if_needed is true (the default), it is assumed that the buffer does not have enough padding, and it is copied into an enlarged temporary buffer before parsing. Thus the following is safe:

const char *json = R"({"key":"value"})"; const size_t json_len = std::strlen(json); simdjson::dom::parser parser; simdjson::dom::element element = parser.parse(json, json_len);

If you set realloc_if_needed to false (e.g., parser.parse(json, json_len, false)), you must provide a buffer with at least SIMDJSON_PADDING extra bytes at the end. The benefit of setting realloc_if_needed to false is that you avoid a temporary memory allocation and a copy.

The padded bytes may be read. It is not important how you initialize these bytes though we recommend a sensible default like null character values or spaces. For example, the following low-level code is safe:

const char *json = R"({"key":"value"})"; const size_t json_len = std::strlen(json); std::unique_ptr<char[]> padded_json_copy{new char[json_len + SIMDJSON_PADDING]}; std::memcpy(padded_json_copy.get(), json, json_len); std::memset(padded_json_copy.get() + json_len, '\0', SIMDJSON_PADDING); simdjson::dom::parser parser; simdjson::dom::element element = parser.parse(padded_json_copy.get(), json_len, false);

std::string references

If you pass a mutable std::string reference (std::string&), the parser will seek to extend its capacity to SIMDJSON_PADDING bytes beyond the end of the string.

Whenever you pass an std::string reference, the parser will access the bytes beyond the end of the string but before the end of the allocated memory (std::string::capacity()). If you are using a sanitizer that checks for reading uninitialized bytes or std::string's container-overflow checks, you may encounter sanitizer warnings. You can safely ignore these warnings. Or you can call simdjson::pad(std::string&) to pad the string with SIMDJSON_PADDING spaces: this function returns a simdjson::padding_string_view which can be be passed to the parser's parse function:

std::string json = R"({ "foo": 1 } { "foo": 2 } { "foo": 3 } )"; element doc = parser.parse(simdjson::pad(json));

Parser Capacity

If the parser's current capacity is less than len, it will allocate enough capacity to handle it (up to max_capacity).

Parameters
bufThe JSON to parse. Must have at least len + SIMDJSON_PADDING allocated bytes, unless realloc_if_needed is true.
lenThe length of the JSON.
realloc_if_neededWhether to reallocate and enlarge the JSON buffer to add padding.
Returns
An element pointing at the root of the document, or an error:
  • MEMALLOC if realloc_if_needed is true or the parser does not have enough capacity, and memory allocation fails.
  • CAPACITY if the parser does not have enough capacity and len > max_capacity.
  • other json errors if parsing fails. You should not rely on these errors to always the same for the same document: they may vary under runtime dispatch (so they may vary depending on your system and hardware).

Definition at line 153 of file parser-inl.h.

◆ parse_into_document()

simdjson_result< element > simdjson::dom::parser::parse_into_document ( document doc,
const uint8_t *  buf,
size_t  len,
bool  realloc_if_needed = true 
) &
inlinenoexcept

Parse a JSON document into a provide document instance and return a temporary reference to it.

It is similar to the function parse except that instead of parsing into the internal document instance associated with the parser, it allows the user to provide a document instance.

dom::parser parser; dom::document doc; element doc_root = parser.parse_into_document(doc, buf, len);

The function eagerly parses the input: the input can be modified and discarded after the parser.parse(buf, len) call has completed.

IMPORTANT: Document Lifetime

After the call to parse_into_document, the parser is no longer needed.

The JSON document lives in the document instance: you must keep the document instance alive while you navigate through it (i.e., used the returned value from parse_into_document). You are encourage to reuse the document instance many times with new data to avoid reallocations:

dom::document doc; element doc_root1 = parser.parse_into_document(doc, buf1, len); //... doc_root1 is a pointer inside doc element doc_root2 = parser.parse_into_document(doc, buf1, len); //... doc_root2 is a pointer inside doc // at this point doc_root1 is no longer safe

Moving the document instance is safe, but it invalidates the element instances. After moving a document, you can recover safe access to the document root with its root() method.

Parameters
docThe document instance where the parsed data will be stored (on success).
bufThe JSON to parse. Must have at least len + SIMDJSON_PADDING allocated bytes, unless realloc_if_needed is true.
lenThe length of the JSON.
realloc_if_neededWhether to reallocate and enlarge the JSON buffer to add padding.
Returns
An element pointing at the root of document, or an error:
  • MEMALLOC if realloc_if_needed is true or the parser does not have enough capacity, and memory allocation fails.
  • CAPACITY if the parser does not have enough capacity and len > max_capacity.
  • other json errors if parsing fails. You should not rely on these errors to always the same for the same document: they may vary under runtime dispatch (so they may vary depending on your system and hardware).

Definition at line 113 of file parser-inl.h.

◆ parse_many()

simdjson::dom::parser::parse_many ( const uint8_t *  buf,
size_t  len,
size_t  batch_size = dom::DEFAULT_BATCH_SIZE 
)
inlinenoexcept

Parse a buffer containing many JSON documents.

dom::parser parser; for (element doc : parser.parse_many(buf, len)) { cout << std::string(doc["title"]) << endl; }

No copy of the input buffer is made.

The function is lazy: it may be that no more than one JSON document at a time is parsed. And, possibly, no document many have been parsed when the parser.load_many(path) function returned.

The caller is responsabile to ensure that the input string data remains unchanged and is not deleted during the loop. In particular, the following is unsafe and will not compile:

auto docs = parser.parse_many("[\"temporary data"]"_padded); // here the string "["temporary data"]" may no longer exist in memory // the parser instance may not have even accessed the input yet for (element doc : docs) { cout << std::string(doc["title"]) << endl; }

The following is safe:

auto json = "[\"temporary data"]"_padded; auto docs = parser.parse_many(json); for (element doc : docs) { cout << std::string(doc["title"]) << endl; }

If there is a UTF-8 BOM, the parser skips it.

Format

The buffer must contain a series of one or more JSON documents, concatenated into a single buffer, separated by whitespace. It effectively parses until it has a fully valid document, then starts parsing the next document at that point. (It does this with more parallelism and lookahead than you might think, though.)

documents that consist of an object or array may omit the whitespace between them, concatenating with no separator. documents that consist of a single primitive (i.e. documents that are not arrays or objects) MUST be separated with whitespace.

The documents must not exceed batch_size bytes (by default 1MB) or they will fail to parse. Setting batch_size to excessively large or excesively small values may impact negatively the performance.

Error Handling

All errors are returned during iteration: if there is a global error such as memory allocation, it will be yielded as the first result. Iteration always stops after the first error.

As with all other simdjson methods, non-exception error handling is readily available through the same interface, requiring you to check the error before using the document:

dom::parser parser; dom::document_stream docs; auto error = parser.load_many(path).get(docs); if (error) { cerr << error << endl; exit(1); } for (auto doc : docs) { std::string_view title; if ((error = doc["title"].get(title)) { cerr << error << endl; exit(1); } cout << title << endl; }

REQUIRED: Buffer Padding

The buffer must have at least SIMDJSON_PADDING extra allocated bytes. It does not matter what those bytes are initialized to, as long as they are allocated. These bytes will be read: if you using a sanitizer that verifies that no uninitialized byte is read, then you should initialize the SIMDJSON_PADDING bytes to avoid runtime warnings.

Threads

When compiled with SIMDJSON_THREADS_ENABLED, this method will use a single thread under the hood to do some lookahead.

Parser Capacity

If the parser's current capacity is less than batch_size, it will allocate enough capacity to handle it (up to max_capacity).

Parameters
bufThe concatenated JSON to parse. Must have at least len + SIMDJSON_PADDING allocated bytes.
lenThe length of the concatenated JSON.
batch_sizeThe batch size to use. MUST be larger than the largest document. The sweet spot is cache-related: small enough to fit in cache, yet big enough to parse as many documents as possible in one tight loop. Defaults to 10MB, which has been a reasonable sweet spot in our tests.
Returns
The stream, or an error. An empty input will yield 0 documents rather than an EMPTY error. Errors:
  • MEMALLOC if the parser does not have enough capacity and memory allocation fails
  • CAPACITY if the parser does not have enough capacity and batch_size > max_capacity.
  • other json errors if parsing fails. You should not rely on these errors to always the same for the same document: they may vary under runtime dispatch (so they may vary depending on your system and hardware).

This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.

Definition at line 170 of file parser-inl.h.

◆ set_max_capacity()

simdjson_inline void simdjson::dom::parser::set_max_capacity ( size_t  max_capacity)
noexcept

Set max_capacity.

This is the largest document this parser can automatically support.

The parser may reallocate internal buffers as needed up to this amount as documents are passed to it.

Note: To avoid limiting the memory to an absurd value, such as zero or two bytes, iff you try to set max_capacity to a value lower than MINIMAL_DOCUMENT_CAPACITY, then the maximal capacity is set to MINIMAL_DOCUMENT_CAPACITY.

This call will not allocate or deallocate, even if capacity is currently above max_capacity.

Parameters
max_capacityThe new maximum capacity, in bytes.

Definition at line 247 of file parser-inl.h.


The documentation for this class was generated from the following files: