simdjson
3.11.0
Ridiculously Fast JSON
|
A persistent document parser. More...
Public Member Functions | |
simdjson_inline | parser (size_t max_capacity=SIMDJSON_MAXSIZE_BYTES) noexcept |
Create a JSON parser. More... | |
simdjson_inline | parser (parser &&other) noexcept |
Take another parser's buffers and state. More... | |
simdjson_inline parser & | operator= (parser &&other) noexcept |
Take another parser's buffers and state. More... | |
~parser ()=default | |
Deallocate the JSON parser. | |
simdjson_result< element > | load (const std::string &path) &noexcept |
Load a JSON document from a file and return a reference to it. More... | |
simdjson_result< element > | load (const std::string &path) &&=delete |
simdjson_result< element > | load_into_document (document &doc, const std::string &path) &noexcept |
Load a JSON document from a file into a provide document instance and return a temporary reference to it. More... | |
simdjson_result< element > | load_into_document (document &doc, const std::string &path) &&=delete |
simdjson_result< element > | parse (const uint8_t *buf, size_t len, bool realloc_if_needed=true) &noexcept |
Parse a JSON document and return a temporary reference to it. More... | |
simdjson_result< element > | parse (const uint8_t *buf, size_t len, bool realloc_if_needed=true) &&=delete |
simdjson_inline simdjson_result< element > | parse (const char *buf, size_t len, bool realloc_if_needed=true) &noexcept |
simdjson_inline simdjson_result< element > | parse (const char *buf, size_t len, bool realloc_if_needed=true) &&=delete |
simdjson_inline simdjson_result< element > | parse (const std::string &s) &noexcept |
simdjson_inline simdjson_result< element > | parse (const std::string &s) &&=delete |
simdjson_inline simdjson_result< element > | parse (const padded_string &s) &noexcept |
simdjson_inline simdjson_result< element > | parse (const padded_string &s) &&=delete |
simdjson_inline simdjson_result< element > | parse (const padded_string_view &v) &noexcept |
simdjson_inline simdjson_result< element > | parse (const padded_string_view &v) &&=delete |
simdjson_result< element > | parse_into_document (document &doc, const uint8_t *buf, size_t len, bool realloc_if_needed=true) &noexcept |
Parse a JSON document into a provide document instance and return a temporary reference to it. More... | |
simdjson_result< element > | parse_into_document (document &doc, const uint8_t *buf, size_t len, bool realloc_if_needed=true) &&=delete |
simdjson_inline simdjson_result< element > | parse_into_document (document &doc, const char *buf, size_t len, bool realloc_if_needed=true) &noexcept |
simdjson_inline simdjson_result< element > | parse_into_document (document &doc, const char *buf, size_t len, bool realloc_if_needed=true) &&=delete |
simdjson_inline simdjson_result< element > | parse_into_document (document &doc, const std::string &s) &noexcept |
simdjson_inline simdjson_result< element > | parse_into_document (document &doc, const std::string &s) &&=delete |
simdjson_inline simdjson_result< element > | parse_into_document (document &doc, const padded_string &s) &noexcept |
simdjson_inline simdjson_result< element > | parse_into_document (document &doc, const padded_string &s) &&=delete |
simdjson_result< document_stream > | load_many (const std::string &path, size_t batch_size=dom::DEFAULT_BATCH_SIZE) noexcept |
Load a file containing many JSON documents. More... | |
simdjson_result< document_stream > | parse_many (const uint8_t *buf, size_t len, size_t batch_size=dom::DEFAULT_BATCH_SIZE) noexcept |
Parse a buffer containing many JSON documents. More... | |
simdjson_result< document_stream > | parse_many (const char *buf, size_t len, size_t batch_size=dom::DEFAULT_BATCH_SIZE) noexcept |
simdjson_result< document_stream > | parse_many (const std::string &s, size_t batch_size=dom::DEFAULT_BATCH_SIZE) noexcept |
simdjson_result< document_stream > | parse_many (const std::string &&s, size_t batch_size)=delete |
simdjson_result< document_stream > | parse_many (const padded_string &s, size_t batch_size=dom::DEFAULT_BATCH_SIZE) noexcept |
simdjson_result< document_stream > | parse_many (const padded_string &&s, size_t batch_size)=delete |
simdjson_warn_unused error_code | allocate (size_t capacity, size_t max_depth=DEFAULT_MAX_DEPTH) noexcept |
Ensure this parser has enough memory to process JSON documents up to capacity bytes in length and max_depth depth. More... | |
simdjson_inline size_t | capacity () const noexcept |
The largest document this parser can support without reallocating. More... | |
simdjson_inline size_t | max_capacity () const noexcept |
The largest document this parser can automatically support. More... | |
simdjson_pure simdjson_inline size_t | max_depth () const noexcept |
The maximum level of nested object and arrays supported by this parser. More... | |
simdjson_inline void | set_max_capacity (size_t max_capacity) noexcept |
Set max_capacity. More... | |
Public Attributes | |
bool | threaded {false} |
When SIMDJSON_THREADS_ENABLED is not defined, the parser instance cannot use threads. | |
A persistent document parser.
The parser is designed to be reused, holding the internal buffers necessary to do parsing, as well as memory for a single document. The parsed document is overwritten on each parse.
This class cannot be copied, only moved, to avoid unintended allocations.
std::unique_ptr<dom::parser> parser(new dom::parser{}); auto error = parser->load(f).get(root);
You can then move std::unique_ptr safely.
|
explicitnoexcept |
Create a JSON parser.
The new parser will have zero capacity.
max_capacity | The maximum document length the parser can automatically handle. The parser will allocate more capacity on an as needed basis (when it sees documents too big to handle) up to this amount. The parser still starts with zero capacity no matter what this number is: to allocate an initial capacity, call allocate() after constructing the parser. Defaults to SIMDJSON_MAXSIZE_BYTES (the largest single document simdjson can process). |
Definition at line 23 of file parser-inl.h.
|
defaultnoexcept |
Take another parser's buffers and state.
other | The parser to take. Its capacity is zeroed. |
|
inlinenoexcept |
Ensure this parser has enough memory to process JSON documents up to capacity
bytes in length and max_depth
depth.
capacity | The new capacity. |
max_depth | The new max_depth. Defaults to DEFAULT_MAX_DEPTH. |
Definition at line 199 of file parser-inl.h.
|
noexcept |
The largest document this parser can support without reallocating.
Definition at line 188 of file parser-inl.h.
|
inlinenoexcept |
Load a JSON document from a file and return a reference to it.
dom::parser parser; const element doc = parser.load("jsonexamples/twitter.json");
The function is eager: the file's content is loaded in memory inside the parser instance and immediately parsed. The file can be deleted after the parser.load
call.
The JSON document still lives in the parser: this is the most efficient way to parse JSON documents because it reuses the same buffers, but you must use the document before you destroy the parser or call parse() again.
Moving the parser instance is safe, but it invalidates the element instances. You may store the parser instance without moving it by wrapping it inside an unique_ptr
instance like so: std::unique_ptr<dom::parser> parser(new dom::parser{});
.
If the parser's current capacity is less than the file length, it will allocate enough capacity to handle it (up to max_capacity).
Windows users who need to read files with non-ANSI characters in the name should set their code page to UTF-8 (65001) before calling this function. This should be the default with Windows 11 and better. Further, they may use the AreFileApisANSI function to determine whether the filename is interpreted using the ANSI or the system default OEM codepage, and they may call SetFileApisToOEM accordingly.
path | The path to load. |
Definition at line 94 of file parser-inl.h.
|
inlinenoexcept |
Load a JSON document from a file into a provide document instance and return a temporary reference to it.
It is similar to the function load
except that instead of parsing into the internal document
instance associated with the parser, it allows the user to provide a document instance.
dom::parser parser; dom::document doc; element doc_root = parser.load_into_document(doc, "jsonexamples/twitter.json");
The function is eager: the file's content is loaded in memory inside the parser instance and immediately parsed. The file can be deleted after the parser.load_into_document
call.
After the call to load_into_document, the parser is no longer needed.
The JSON document lives in the document instance: you must keep the document instance alive while you navigate through it (i.e., used the returned value from load_into_document). You are encourage to reuse the document instance many times with new data to avoid reallocations:
dom::document doc; element doc_root1 = parser.load_into_document(doc, "jsonexamples/twitter.json"); //... doc_root1 is a pointer inside doc element doc_root2 = parser.load_into_document(doc, "jsonexamples/twitter.json"); //... doc_root2 is a pointer inside doc // at this point doc_root1 is no longer safe
Moving the document instance is safe, but it invalidates the element instances. After moving a document, you can recover safe access to the document root with its root()
method.
doc | The document instance where the parsed data will be stored (on success). |
path | The path to load. |
Definition at line 98 of file parser-inl.h.
|
inlinenoexcept |
Load a file containing many JSON documents.
dom::parser parser; for (const element doc : parser.load_many(path)) { cout << std::string(doc["title"]) << endl; }
The file is loaded in memory and can be safely deleted after the parser.load_many(path)
function has returned. The memory is held by the parser
instance.
The function is lazy: it may be that no more than one JSON document at a time is parsed. And, possibly, no document many have been parsed when the parser.load_many(path)
function returned.
If there is a UTF-8 BOM, the parser skips it.
The file must contain a series of one or more JSON documents, concatenated into a single buffer, separated by whitespace. It effectively parses until it has a fully valid document, then starts parsing the next document at that point. (It does this with more parallelism and lookahead than you might think, though.)
Documents that consist of an object or array may omit the whitespace between them, concatenating with no separator. documents that consist of a single primitive (i.e. documents that are not arrays or objects) MUST be separated with whitespace.
The documents must not exceed batch_size bytes (by default 1MB) or they will fail to parse. Setting batch_size to excessively large or excesively small values may impact negatively the performance.
All errors are returned during iteration: if there is a global error such as memory allocation, it will be yielded as the first result. Iteration always stops after the first error.
As with all other simdjson methods, non-exception error handling is readily available through the same interface, requiring you to check the error before using the document:
dom::parser parser; dom::document_stream docs; auto error = parser.load_many(path).get(docs); if (error) { cerr << error << endl; exit(1); } for (auto doc : docs) { std::string_view title; if ((error = doc["title"].get(title)) { cerr << error << endl; exit(1); } cout << title << endl; }
When compiled with SIMDJSON_THREADS_ENABLED, this method will use a single thread under the hood to do some lookahead.
If the parser's current capacity is less than batch_size, it will allocate enough capacity to handle it (up to max_capacity).
path | File name pointing at the concatenated JSON to parse. |
batch_size | The batch size to use. MUST be larger than the largest document. The sweet spot is cache-related: small enough to fit in cache, yet big enough to parse as many documents as possible in one tight loop. Defaults to 1MB (as simdjson::dom::DEFAULT_BATCH_SIZE), which has been a reasonable sweet spot in our tests. If you set the batch_size to a value smaller than simdjson::dom::MINIMAL_BATCH_SIZE (currently 32B), it will be replaced by simdjson::dom::MINIMAL_BATCH_SIZE. |
Definition at line 105 of file parser-inl.h.
|
noexcept |
The largest document this parser can automatically support.
The parser may reallocate internal buffers as needed up to this amount.
Definition at line 191 of file parser-inl.h.
|
noexcept |
The maximum level of nested object and arrays supported by this parser.
Definition at line 194 of file parser-inl.h.
Take another parser's buffers and state.
other | The parser to take. Its capacity is zeroed. |
|
inlinenoexcept |
Parse a JSON document and return a temporary reference to it.
dom::parser parser; element doc_root = parser.parse(buf, len);
The function eagerly parses the input: the input can be modified and discarded after the parser.parse(buf, len)
call has completed.
The JSON document still lives in the parser: this is the most efficient way to parse JSON documents because it reuses the same buffers, but you must use the document before you destroy the parser or call parse() again.
Moving the parser instance is safe, but it invalidates the element instances. You may store the parser instance without moving it by wrapping it inside an unique_ptr
instance like so: std::unique_ptr<dom::parser> parser(new dom::parser{});
.
The buffer must have at least SIMDJSON_PADDING extra allocated bytes. It does not matter what those bytes are initialized to, as long as they are allocated. These bytes will be read: if you using a sanitizer that verifies that no uninitialized byte is read, then you should initialize the SIMDJSON_PADDING bytes to avoid runtime warnings.
If realloc_if_needed is true (the default), it is assumed that the buffer does not have enough padding, and it is copied into an enlarged temporary buffer before parsing. Thus the following is safe:
const char *json = R"({"key":"value"})"; const size_t json_len = std::strlen(json); simdjson::dom::parser parser; simdjson::dom::element element = parser.parse(json, json_len);
If you set realloc_if_needed to false (e.g., parser.parse(json, json_len, false)), you must provide a buffer with at least SIMDJSON_PADDING extra bytes at the end. The benefit of setting realloc_if_needed to false is that you avoid a temporary memory allocation and a copy.
The padded bytes may be read. It is not important how you initialize these bytes though we recommend a sensible default like null character values or spaces. For example, the following low-level code is safe:
const char *json = R"({"key":"value"})"; const size_t json_len = std::strlen(json); std::unique_ptr<char[]> padded_json_copy{new char[json_len + SIMDJSON_PADDING]}; std::memcpy(padded_json_copy.get(), json, json_len); std::memset(padded_json_copy.get() + json_len, '\0', SIMDJSON_PADDING); simdjson::dom::parser parser; simdjson::dom::element element = parser.parse(padded_json_copy.get(), json_len, false);
If you pass a mutable std::string reference (std::string&), the parser will seek to extend its capacity to SIMDJSON_PADDING bytes beyond the end of the string.
Whenever you pass an std::string reference, the parser will access the bytes beyond the end of the string but before the end of the allocated memory (std::string::capacity()). If you are using a sanitizer that checks for reading uninitialized bytes or std::string's container-overflow checks, you may encounter sanitizer warnings. You can safely ignore these warnings. Or you can call simdjson::pad(std::string&) to pad the string with SIMDJSON_PADDING spaces: this function returns a simdjson::padding_string_view which can be be passed to the parser's parse function:
std::string json = R"({ "foo": 1 } { "foo": 2 } { "foo": 3 } )"; element doc = parser.parse(simdjson::pad(json));
If the parser's current capacity is less than len, it will allocate enough capacity to handle it (up to max_capacity).
buf | The JSON to parse. Must have at least len + SIMDJSON_PADDING allocated bytes, unless realloc_if_needed is true. |
len | The length of the JSON. |
realloc_if_needed | Whether to reallocate and enlarge the JSON buffer to add padding. |
Definition at line 153 of file parser-inl.h.
|
inlinenoexcept |
Parse a JSON document into a provide document instance and return a temporary reference to it.
It is similar to the function parse
except that instead of parsing into the internal document
instance associated with the parser, it allows the user to provide a document instance.
dom::parser parser; dom::document doc; element doc_root = parser.parse_into_document(doc, buf, len);
The function eagerly parses the input: the input can be modified and discarded after the parser.parse(buf, len)
call has completed.
After the call to parse_into_document, the parser is no longer needed.
The JSON document lives in the document instance: you must keep the document instance alive while you navigate through it (i.e., used the returned value from parse_into_document). You are encourage to reuse the document instance many times with new data to avoid reallocations:
dom::document doc; element doc_root1 = parser.parse_into_document(doc, buf1, len); //... doc_root1 is a pointer inside doc element doc_root2 = parser.parse_into_document(doc, buf1, len); //... doc_root2 is a pointer inside doc // at this point doc_root1 is no longer safe
Moving the document instance is safe, but it invalidates the element instances. After moving a document, you can recover safe access to the document root with its root()
method.
doc | The document instance where the parsed data will be stored (on success). |
buf | The JSON to parse. Must have at least len + SIMDJSON_PADDING allocated bytes, unless realloc_if_needed is true. |
len | The length of the JSON. |
realloc_if_needed | Whether to reallocate and enlarge the JSON buffer to add padding. |
Definition at line 113 of file parser-inl.h.
|
inlinenoexcept |
Parse a buffer containing many JSON documents.
dom::parser parser; for (element doc : parser.parse_many(buf, len)) { cout << std::string(doc["title"]) << endl; }
No copy of the input buffer is made.
The function is lazy: it may be that no more than one JSON document at a time is parsed. And, possibly, no document many have been parsed when the parser.load_many(path)
function returned.
The caller is responsabile to ensure that the input string data remains unchanged and is not deleted during the loop. In particular, the following is unsafe and will not compile:
auto docs = parser.parse_many("[\"temporary data"]"_padded); // here the string "["temporary data"]" may no longer exist in memory // the parser instance may not have even accessed the input yet for (element doc : docs) { cout << std::string(doc["title"]) << endl; }
The following is safe:
auto json = "[\"temporary data"]"_padded; auto docs = parser.parse_many(json); for (element doc : docs) { cout << std::string(doc["title"]) << endl; }
If there is a UTF-8 BOM, the parser skips it.
The buffer must contain a series of one or more JSON documents, concatenated into a single buffer, separated by whitespace. It effectively parses until it has a fully valid document, then starts parsing the next document at that point. (It does this with more parallelism and lookahead than you might think, though.)
documents that consist of an object or array may omit the whitespace between them, concatenating with no separator. documents that consist of a single primitive (i.e. documents that are not arrays or objects) MUST be separated with whitespace.
The documents must not exceed batch_size bytes (by default 1MB) or they will fail to parse. Setting batch_size to excessively large or excesively small values may impact negatively the performance.
All errors are returned during iteration: if there is a global error such as memory allocation, it will be yielded as the first result. Iteration always stops after the first error.
As with all other simdjson methods, non-exception error handling is readily available through the same interface, requiring you to check the error before using the document:
dom::parser parser; dom::document_stream docs; auto error = parser.load_many(path).get(docs); if (error) { cerr << error << endl; exit(1); } for (auto doc : docs) { std::string_view title; if ((error = doc["title"].get(title)) { cerr << error << endl; exit(1); } cout << title << endl; }
The buffer must have at least SIMDJSON_PADDING extra allocated bytes. It does not matter what those bytes are initialized to, as long as they are allocated. These bytes will be read: if you using a sanitizer that verifies that no uninitialized byte is read, then you should initialize the SIMDJSON_PADDING bytes to avoid runtime warnings.
When compiled with SIMDJSON_THREADS_ENABLED, this method will use a single thread under the hood to do some lookahead.
If the parser's current capacity is less than batch_size, it will allocate enough capacity to handle it (up to max_capacity).
buf | The concatenated JSON to parse. Must have at least len + SIMDJSON_PADDING allocated bytes. |
len | The length of the concatenated JSON. |
batch_size | The batch size to use. MUST be larger than the largest document. The sweet spot is cache-related: small enough to fit in cache, yet big enough to parse as many documents as possible in one tight loop. Defaults to 10MB, which has been a reasonable sweet spot in our tests. |
This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.
Definition at line 170 of file parser-inl.h.
|
noexcept |
Set max_capacity.
This is the largest document this parser can automatically support.
The parser may reallocate internal buffers as needed up to this amount as documents are passed to it.
Note: To avoid limiting the memory to an absurd value, such as zero or two bytes, iff you try to set max_capacity to a value lower than MINIMAL_DOCUMENT_CAPACITY, then the maximal capacity is set to MINIMAL_DOCUMENT_CAPACITY.
This call will not allocate or deallocate, even if capacity is currently above max_capacity.
max_capacity | The new maximum capacity, in bytes. |
Definition at line 247 of file parser-inl.h.