Text
Supported formats
Text objects in ASCII or Unicode formats.
Description
The Contextal Platform is capable of extracting text from within many different file formats, that's why the text backend is among the most often used ones. It is capable of extracting various statistics, indicators and other data that could be later contextualized.
Available in Contextal Platform 1.0 and later.
Features
Text Statistics
The details about the text structure are collected, such as a number of characters, digits, words, lines, whitespace, or base64 objects.
Natural Language Detection and Analysis
The backend can detect the natural (human-spoken) language (object metadata: natural_language) used in the text. For the English language, additionally a sentiment analysis and profanity detection is performed.
Programming Language Detection
Common scripting languages, which can be easily executed or are often part of threat toolkits are detected (object metadata: programming_language). This includes Python, shell scripts, JavaScript, PowerShell, and others. Compiled languages are not detected, as we don't see a practical use case here.
Credit Card Number Detection
When a technically valid credit card number is detected, it will be signaled with the CC_NUMBER symbol assigned to the object.
Possible Password Collection
The backend will collect possible passwords, which could be used for contextual auto-decryption purposes.
Crypto Wallet Collection
The processor will recognize and collect valid crypto addresses, which can be further used for detection and threat intelligence purposes.
URI Extraction
Common URI formats will be extracted for further analysis.
Symbols
Object
CHAR_DECODING_ERRORS→ issues were faced while converting the input data into UTF-8CODE_ALL_COMMENTS→ a programming language was detected but the code was all commented outCC_NUMBER→ a possible credit card number was detected in the textMANY_NUMBERS→ the text contains 10-50% of numbersMOSTLY_NUMBERS→ the text contains more than 50% of numbers (but not all)ALL_NUMBERS→ the text only contains numbersALL_ASCII→ the text only contains ASCII characters
Example Metadata
{
"org": "ctx",
"object_id": "febfdd4fc96fba982031a6b62a263b280e4ea25aaaf0755518afccd814082092",
"object_type": "Text",
"object_subtype": null,
"recursion_level": 2,
"size": 539,
"hashes": {
"md5": "d4eedad6edb3602d109bed3ded89dd34",
"sha256": "febfdd4fc96fba982031a6b62a263b280e4ea25aaaf0755518afccd814082092",
"sha512": "97ee71c2ebc90fffa06e2b71207857119fd678af9d0d30844f1af464ceb0d786a1202b4f21f97939a491d1fdadc052a672fce6bd5a577391e4ebfe8f32b40efe",
"sha1": "02eaf7974a857d1c8c1926981ddf4ec044736657"
},
"ctime": 1759246193.668021,
"entropy": 4.5824717384979525,
"relation_metadata": {
"DocumentText": {}
},
"ok": {
"symbols": [
"OCR"
],
"object_metadata": {
"_backend_version": "2.0.0",
"ascii_char_count": 536,
"base64_count": 0,
"crypto_addresses": [],
"digit_count": 0,
"encoding": "utf-8",
"natural_language": "English",
"natural_language_profanity_count": 0,
"natural_language_sentiment": {
"compound": -0.9701721930828288,
"neg": 0.24022346368715083,
"neu": 0.7597765363128491,
"pos": 0
},
"newline_count": 15,
"possible_passwords": [],
"unicode_char_count": 537,
"unique_domains": [],
"unique_hosts": [],
"uris": [],
"whitespace_count": 80,
"word_count": 78
},
"children": []
}
}
Note: The OCR symbol in the above metadata indicates that the text was obtained through optical character recognition of graphic data.
Example Queries
object_type == "Text"
&& @match_object_meta($programming_language == "JavaScript")
&& @match_object_meta($newline_count == 0)
- This query matches a
JavaScriptobfuscated into a single line.
object_type == "Text"
&& @match_object_meta($natural_language_profanity_count > 0)
- This query matches English text with profanities. For non-English texts the second condition will always be false.
Configuration Options
max_processed_size→ maximum text input size (default: 10485760)max_children→ maximum number of children objects to create (default: 50)natural_language_max_char_whitespace_ratio→ maximumnumber_of_characters / number_of_whitespacesratio to consider running the natural language detection (default: 20.0)natural_language_min_confidence_level→ minimum natural language confidence level to report. From 0.0 to 1.0. (default: 0.2)create_url_children→ whether to createURLchildren for further processing. As of Contextal Platform 1.0 only URLs coming from OCR'd text will be taken into account. (default: true)create_domain_children→ whether to createDomainchildren out of collected domain names for further processing (default: true)