Data Sources
In HEAT, a Data Source represents a managed connection to an external storage or database service. Once you register a Data Source with HEAT, you can reference it within Session Templates to store or retrieve data during the execution of a workflow node.
Supported Data Source Types
Currently, HEAT supports connecting to:
- CosmosDB
- MSSQL
- PostgreSQL
- MongoDB
- S3 (e.g., Amazon S3 or MinIO-compatible services)
- Azure Blob Storage
Each Data Source is configured with:
- Name & Description: Human-readable identifiers to distinguish it from others.
- Connection String: The details HEAT needs to authenticate and connect.
Default & Built-In Data Sources
Depending on your deployment (e.g., On-Prem, Azure), HEAT comes with certain default Data Sources out of the box:
- MinIO-based Object Store
A local S3-compatible service useful for storing binary objects (e.g., CSV or JSON uploads). - CosmosDB Emulator & Azurite (On-Prem)
Simulated Azure services for local deployments, allowing you to test and develop without needing full cloud connectivity. - Live CosmosDB & Azure Storage (Azure)
Production-grade cloud services available in Azure-based deployments.
Although these built-ins suffice for many scenarios, you can still define additional Data Sources to meet specific requirements.
Referencing Data Sources in Session Templates
When configuring nodes (e.g., input nodes, transformation nodes, or output nodes) within a Session Template, you often need to specify which Data Source the node should read from or write to. For instance:
- Input Node: Might store raw CSV data received via an Ingest into an S3 bucket or a file-like blob in Azure.
- Transform Node: Could query a PostgreSQL database for enrichment or join data from MSSQL with a machine-learning model stored in CosmosDB.
- Output Node: Might send final computed metrics to MongoDB or store them as JSON documents in Azure Blob Storage.
By defining these Data Source references in your node configurations, you maintain a clear, structured workflow where each stage knows exactly where to push and pull data.
Best Practices
-
Use Meaningful Names
Since data sources appear throughout Session Templates and workflow nodes, choose a name and description that reflect the source’s function or contents (e.g.,TrainingDataPostgresorTelemetryCosmosDB). -
Secure Your Connection Strings
Treat connection strings like credentials - store them securely and limit access to those who truly need it. Once created, you can only update the ConnectionString via the public API, it is not possible externally to retrieve a connection string once it has been provided. -
Validate Connectivity
Ensure your HEAT cluster can reach the data source over the network. Firewalls, access controls, or misconfigured endpoints are common culprits for connectivity failures. -
Plan for Growth
If your workflows need parallel access or large-scale data operations, confirm that your chosen data source can handle the throughput. Some engines may require additional licensing or hardware configuration. -
Keep Dependencies Clear
In complex Session Templates, different nodes may point to different data sources. Keeping references well-documented prevents confusion and helps teams quickly see which systems are involved in each stage of the pipeline.
Next Steps
- Learn how nodes reference Data Sources in Session Templates.
- Set up your first ingestion pipeline in Ingests.
- Check out the Cluster Manager for monitoring data source-related metrics and connectivity logs.