Text

Supported formats

Text objects in ASCII or Unicode formats.

Description

The Contextal Platform is capable of extracting text from within many different file formats, that's why the text backend is among the most often used ones. It is capable of extracting various statistics, indicators and other data that could be later contextualized.

info

Available in Contextal Platform 1.0 and later.

Features

Text Statistics

The details about the text structure are collected, such as a number of characters, digits, words, lines, whitespace, or base64 objects.

Natural Language Detection and Analysis

The backend can detect the natural (human-spoken) language (object metadata: natural_language) used in the text. For the English language, additionally a sentiment analysis and profanity detection is performed.

Programming Language Detection

Common scripting languages, which can be easily executed or are often part of threat toolkits are detected (object metadata: programming_language). This includes Python, shell scripts, JavaScript, PowerShell, and others. Compiled languages are not detected, as we don't see a practical use case here.

Credit Card Number Detection

When a technically valid credit card number is detected, it will be signaled with the CC_NUMBER symbol assigned to the object.

Possible Password Collection

The backend will collect possible passwords, which could be used for contextual auto-decryption purposes.

Crypto Wallet Collection

The processor will recognize and collect valid crypto addresses, which can be further used for detection and threat intelligence purposes.

URI Extraction

Common URI formats will be extracted for further analysis.

Symbols

Object

CHAR_DECODING_ERRORS → issues were faced while converting the input data into UTF-8
CODE_ALL_COMMENTS → a programming language was detected but the code was all commented out
CC_NUMBER → a possible credit card number was detected in the text
MANY_NUMBERS → the text contains 10-50% of numbers
MOSTLY_NUMBERS → the text contains more than 50% of numbers (but not all)
ALL_NUMBERS → the text only contains numbers
ALL_ASCII → the text only contains ASCII characters

Example Metadata

{
  "org": "ctx",
  "object_id": "febfdd4fc96fba982031a6b62a263b280e4ea25aaaf0755518afccd814082092",
  "object_type": "Text",
  "object_subtype": null,
  "recursion_level": 2,
  "size": 539,
  "hashes": {
    "md5": "d4eedad6edb3602d109bed3ded89dd34",
    "sha256": "febfdd4fc96fba982031a6b62a263b280e4ea25aaaf0755518afccd814082092",
    "sha512": "97ee71c2ebc90fffa06e2b71207857119fd678af9d0d30844f1af464ceb0d786a1202b4f21f97939a491d1fdadc052a672fce6bd5a577391e4ebfe8f32b40efe",
    "sha1": "02eaf7974a857d1c8c1926981ddf4ec044736657"
  },
  "ctime": 1759246193.668021,
  "entropy": 4.5824717384979525,
  "relation_metadata": {
    "DocumentText": {}
  },
  "ok": {
    "symbols": [
      "OCR"
    ],
    "object_metadata": {
      "_backend_version": "2.0.0",
      "ascii_char_count": 536,
      "base64_count": 0,
      "crypto_addresses": [],
      "digit_count": 0,
      "encoding": "utf-8",
      "natural_language": "English",
      "natural_language_profanity_count": 0,
      "natural_language_sentiment": {
        "compound": -0.9701721930828288,
        "neg": 0.24022346368715083,
        "neu": 0.7597765363128491,
        "pos": 0
      },
      "newline_count": 15,
      "possible_passwords": [],
      "unicode_char_count": 537,
      "unique_domains": [],
      "unique_hosts": [],
      "uris": [],
      "whitespace_count": 80,
      "word_count": 78
    },
    "children": []
  }
}

Note: The OCR symbol in the above metadata indicates that the text was obtained through optical character recognition of graphic data.

Example Queries

object_type == "Text"
    && @match_object_meta($programming_language == "JavaScript")
    && @match_object_meta($newline_count == 0)

This query matches a JavaScript obfuscated into a single line.

object_type == "Text"
    && @match_object_meta($natural_language_profanity_count > 0)

This query matches English text with profanities. For non-English texts the second condition will always be false.

Configuration Options

max_processed_size → maximum text input size (default: 10485760)
max_children → maximum number of children objects to create (default: 50)
natural_language_max_char_whitespace_ratio → maximum number_of_characters / number_of_whitespaces ratio to consider running the natural language detection (default: 20.0)
natural_language_min_confidence_level → minimum natural language confidence level to report. From 0.0 to 1.0. (default: 0.2)
create_url_children → whether to create URL children for further processing. As of Contextal Platform 1.0 only URLs coming from OCR'd text will be taken into account. (default: true)
create_domain_children → whether to create Domain children out of collected domain names for further processing (default: true)

Supported formats​

Description​

Features​

Text Statistics​

Natural Language Detection and Analysis​

Programming Language Detection​

Credit Card Number Detection​

Possible Password Collection​

Crypto Wallet Collection​

URI Extraction​

Symbols​

Object​

Example Metadata​

Example Queries​

Configuration Options​