Supported formats
RFC 5322
Description
This backend offers a high-performance, adaptable email parser designed for efficient data and metadata extraction. It prioritizes parsing methods that align closely with how modern Mail User Agents handle email content, rather than strictly following official specifications. This approach allows for more realistic and comprehensive analysis of email data, capturing details that are often missed by traditional parsers focused solely on technical standards.
info
Available in Contextal Platform 1.0 and later.
Features
- Header decoding: minimal parsing, generic validation
- RFC 2047 and RFC 2184 header character set decoding
- Body decoding (identity, quoted-printable, base64)
- MIME multipart support (each concrete part become a child object)
- Charset aware text conversion to UTF-8 (with replacement) of all inline part bodies
- Massive extraction of metadata, anomalies, and flaws
Symbols
Object
DUP_BCC→ duplicateBccheaderDUP_CC→ duplicateCcheaderDUP_ENVELOPE_TO→ duplicateEnvelope-ToheaderDUP_FROM→ duplicateFromheaderDUP_IN_REPLY_TO→ duplicateIn-Reply-ToheaderDUP_MESSAGE_ID→ duplicateMessage-IDheaderDUP_REPLY_TO→ duplicateReply-ToheaderDUP_RETURN_PATH→ duplicateReturn-PathheaderDUP_SUBJECT→ duplicateSubjectheaderDUP_TO→ duplicateToheaderFROM_LIST→ the mail is apparently from a mailing listINVALID_DATE→ theDateheader is syntactically invalidINVALID_HEADERS→ one or more of the headers are syntactically invalidINVALID_MIME_VER→ theMime-Versionheader is syntactically invalid or reports an invalid versionLIMITS_REACHED→ limits triggered while processing the messageMISSING_DATE→ the mandatoryDateheader is not presentMISSING_FROM→ the mandatoryFromheader is not presentMISSING_MESSAGE_ID→ the mandatoryMessage-IDheader is not presentMISSING_MIME_VER→ theMime-Versionheader is not presentMISSING_SUBJECT→ the mandatorySubjectheader is not presentMISSING_TO→ theToheader is not presentRESENT→ the message appears to have been resent
Children
CHARSET_ATTM→ one of more attachments (i.e. non inline parts) carry a charset (some malware does this)INVALID_BODY_ENC→ this child is a message body and contains flaws in the way it is encodedINVALID_HEADERS→ one or more of the part headers are syntactically invalidTOOBIG→ this part was not extracted as it exceeds the limits
Example Metadata
{
"org": "ctx",
"object_id": "a7e70dafb3bbc49ff7e284d084ea80e7a687903712c30d54388cfb986062550d",
"object_type": "Email",
"object_subtype": null,
"recursion_level": 1,
"size": 11301,
"hashes": {
"sha256": "a7e70dafb3bbc49ff7e284d084ea80e7a687903712c30d54388cfb986062550d",
"sha1": "1dff07e2f13f7b2171042d924b3a6b647f04671c",
"sha512": "b8de8e5000bbf5d335cbdb37751c5f1ecfd041383d2501c3136e54f5c90e0cfa19ae3f7d595da0c2c7326976f150e3ab72bc03f834b1af35960f8f8aa5b215f7",
"md5": "bcb6d5b532df71d89154946206449487"
},
"ctime": 1725899412.988861,
"ok": {
"symbols": [],
"object_metadata": {
"_backend_version": "1.0.0",
"charset": "iso-8859-1",
"date_ts": 1598959647,
"has_html_body": true,
"has_text_body": false,
"hdrs_health": {
"bad_name": false,
"bad_value": false,
"bad_value_encoding": false,
"bad_value_params": false,
"bad_value_quoting": false
},
"headers": [
{
"dup": false,
"name": "from",
"value": "e-mail server bl****ware.com < tc***n@bl****ware.com >"
},
{
"dup": false,
"name": "message-id",
"value": "<20200901042726.3b5c769a539eaed6@bl****ware.com>"
},
{
"dup": false,
"name": "reply-to",
"value": "info@ph****api.live"
},
{
"dup": false,
"name": "return-path",
"value": "<tc***n@bl****ware.com>"
},
{
"dup": false,
"name": "subject",
"value": "email suspension warning for tc***n@bl****ware.com"
},
{
"dup": false,
"name": "to",
"value": "tc***n@bl****ware.com"
}
],
"mime_type": "text/html",
"multipart": false,
"n_attachments": 0
},
"children": [
{
"org": "ctx",
"object_id": "b93e1ad4e327fc4ced0d76d8db5c3c170cd50ca74585131b0c99e88b08e4326b",
"object_type": "HTML",
"object_subtype": null,
"recursion_level": 2,
"size": 8879,
[...]
Example Queries
object_type == "Email"
&& @has_symbol("INVALID_HEADERS")
- This query matches an
Emailobject with syntactically invalid headers.
object_type == "Email"
&& @has_descendant(object_type == "Text"
&& @match_object_meta($natural_language_sentiment.compound < 0))
- This matches an
Email, out of which at some point aTextwith a negative language sentiment was extracted.
Configuration Options
max_processed_size→ maximum size of the input object that will be processed (default: 262144000)max_children→ maximum number of children objects to create (default: 100)max_child_input_size→ maximum size of a single input children object (default: 41943040)max_child_output_size→ maximum size of a single output children object (default: 41943040)