One of the core innovations in Conduit is our cross-source query engine. In this post, we'll explore what cross-source queries are, how they work, and why they're essential for modern industrial data architectures.

The Traditional Approach: Centralization

Historically, when organizations wanted to analyze data from multiple systems, they followed a predictable pattern:

Extract data from source systems
Transform it into a common format
Load it into a central data warehouse or lake

This ETL (Extract, Transform, Load) approach has been the standard for decades. But in industrial environments, it creates significant challenges:

Latency: Batch ETL means your "current" data is always hours or days old
Cost: Moving and storing petabytes of time-series data is expensive
Governance: Duplicated data creates compliance and security concerns
Maintenance: ETL pipelines are fragile and require constant attention

Enter Cross-Source Queries

Cross-source query execution flips this model on its head. Instead of moving data to where the query runs, we move the query to where the data lives.

How It Works

When you submit a query to Conduit, here's what happens:

1. Parse the query and identify required data sources
2. Generate optimized sub-queries for each source system
3. Execute sub-queries in parallel against source systems
4. Stream results back to the merge layer
5. Merge, correlate, and return unified results

Let's walk through a concrete example.

Example: Cross-System Correlation

Suppose you want to correlate temperature readings from your Splunk historian with alarm events from MQTT:

SELECT
  p.timestamp,
  p.temperature,
  a.alarm_type,
  a.severity
FROM splunk.temperatures p
JOIN mqtt.alarms a
  ON p.asset_id = a.asset_id
  AND p.timestamp BETWEEN a.start_time AND a.end_time
WHERE p.timestamp > NOW() - INTERVAL '24 hours'

Conduit breaks this into two parallel operations:

Sub-query 1 (Splunk):

SELECT timestamp, temperature, asset_id
FROM temperatures
WHERE timestamp > NOW() - INTERVAL '24 hours'

Sub-query 2 (MQTT):

SELECT alarm_type, severity, asset_id, start_time, end_time
FROM alarms
WHERE start_time > NOW() - INTERVAL '24 hours'

These execute simultaneously. Results stream back to Conduit, where the join operation correlates records by asset and time window.

Query Optimization

Naive cross-source execution would be slow. The key to performance is intelligent query planning:

Predicate Pushdown

Filter conditions are pushed to source systems, reducing data transfer:

Original: SELECT * FROM splunk.temps WHERE value > 100
Pushed:   Splunk executes "value > 100" filter locally

Projection Pruning

Only requested columns are retrieved:

Original: SELECT temperature FROM splunk.readings
Pruned:   Splunk returns only temperature column, not all 50 columns

Join Reordering

Joins are executed in the optimal order to minimize intermediate result sizes.

Parallel Execution

Independent sub-queries execute in parallel across source systems.

Handling Heterogeneous Data

Industrial systems store data differently:

Historians use time-series models (timestamp, tag, value)
SCADA uses event-driven models (state changes)
MES uses relational models (orders, batches, products)

Conduit's semantic layer maps these different models to a unified schema. When you query "temperature for asset X", Conduit knows:

In Splunk, this is index ot_data with field T-101.PV
In MQTT, this is topic building1/reactor1/temperature
In the OPC-UA server, this is node ns=2;s=Building1.Reactor1.Temperature

Performance Characteristics

Cross-source queries have different performance characteristics than centralized queries:

| Aspect | Centralized | Cross-Source | | -------------- | ------------------ | --------------------- | | Query latency | Lower (local data) | Higher (network hops) | | Data freshness | Batch delayed | Real-time | | Storage cost | High (copies) | Low (no copies) | | Governance | Complex | Simple |

For most operational queries, the slight latency increase is worth the benefits of real-time data and simplified architecture.

When to Use Cross-Source Queries

Cross-source queries excel for:

Operational dashboards requiring real-time data
Ad-hoc analysis across multiple systems
Compliance queries where data residency matters
Integration without ETL pipelines

They're less suitable for:

Heavy analytics requiring repeated scans of historical data
Machine learning training on large datasets

For these use cases, consider using Conduit to populate a purpose-built analytics store.

Conclusion

Cross-source query execution is a paradigm shift in industrial data architecture. By moving queries to data instead of data to queries, organizations can get real-time insights without the cost and complexity of centralized data lakes.

Want to see cross-source queries in action? Request a demo and we'll show you cross-system correlation on your own data.

The Traditional Approach: Centralization

Historically, when organizations wanted to analyze data from multiple systems, they followed a predictable pattern:

Extract data from source systems
Transform it into a common format
Load it into a central data warehouse or lake

This ETL (Extract, Transform, Load) approach has been the standard for decades. But in industrial environments, it creates significant challenges:

Latency: Batch ETL means your "current" data is always hours or days old
Cost: Moving and storing petabytes of time-series data is expensive
Governance: Duplicated data creates compliance and security concerns
Maintenance: ETL pipelines are fragile and require constant attention

Enter Cross-Source Queries

Cross-source query execution flips this model on its head. Instead of moving data to where the query runs, we move the query to where the data lives.

How It Works

When you submit a query to Conduit, here's what happens:

1. Parse the query and identify required data sources
2. Generate optimized sub-queries for each source system
3. Execute sub-queries in parallel against source systems
4. Stream results back to the merge layer
5. Merge, correlate, and return unified results

Let's walk through a concrete example.

Example: Cross-System Correlation

Suppose you want to correlate temperature readings from your Splunk historian with alarm events from MQTT:

SELECT
  p.timestamp,
  p.temperature,
  a.alarm_type,
  a.severity
FROM splunk.temperatures p
JOIN mqtt.alarms a
  ON p.asset_id = a.asset_id
  AND p.timestamp BETWEEN a.start_time AND a.end_time
WHERE p.timestamp > NOW() - INTERVAL '24 hours'

Conduit breaks this into two parallel operations:

Sub-query 1 (Splunk):

SELECT timestamp, temperature, asset_id
FROM temperatures
WHERE timestamp > NOW() - INTERVAL '24 hours'

Sub-query 2 (MQTT):

SELECT alarm_type, severity, asset_id, start_time, end_time
FROM alarms
WHERE start_time > NOW() - INTERVAL '24 hours'

These execute simultaneously. Results stream back to Conduit, where the join operation correlates records by asset and time window.

Query Optimization

Naive cross-source execution would be slow. The key to performance is intelligent query planning:

Predicate Pushdown

Filter conditions are pushed to source systems, reducing data transfer:

Original: SELECT * FROM splunk.temps WHERE value > 100
Pushed:   Splunk executes "value > 100" filter locally

Projection Pruning

Only requested columns are retrieved:

Original: SELECT temperature FROM splunk.readings
Pruned:   Splunk returns only temperature column, not all 50 columns

Join Reordering

Joins are executed in the optimal order to minimize intermediate result sizes.

Parallel Execution

Independent sub-queries execute in parallel across source systems.

Handling Heterogeneous Data

Industrial systems store data differently:

Historians use time-series models (timestamp, tag, value)
SCADA uses event-driven models (state changes)
MES uses relational models (orders, batches, products)

Conduit's semantic layer maps these different models to a unified schema. When you query "temperature for asset X", Conduit knows:

In Splunk, this is index ot_data with field T-101.PV
In MQTT, this is topic building1/reactor1/temperature
In the OPC-UA server, this is node ns=2;s=Building1.Reactor1.Temperature

Performance Characteristics

Cross-source queries have different performance characteristics than centralized queries:

For most operational queries, the slight latency increase is worth the benefits of real-time data and simplified architecture.

When to Use Cross-Source Queries

Cross-source queries excel for:

Operational dashboards requiring real-time data
Ad-hoc analysis across multiple systems
Compliance queries where data residency matters
Integration without ETL pipelines

They're less suitable for:

Heavy analytics requiring repeated scans of historical data
Machine learning training on large datasets

For these use cases, consider using Conduit to populate a purpose-built analytics store.

Conclusion

Want to see cross-source queries in action? Request a demo and we'll show you cross-system correlation on your own data.

Understanding Cross-Source Queries in Industrial Data

The Traditional Approach: Centralization

Enter Cross-Source Queries

How It Works

Example: Cross-System Correlation

Query Optimization

Predicate Pushdown

Projection Pruning

Join Reordering

Parallel Execution

Handling Heterogeneous Data

Performance Characteristics

When to Use Cross-Source Queries

Conclusion

Understanding Cross-Source Queries in Industrial Data

The Traditional Approach: Centralization

Enter Cross-Source Queries

How It Works

Example: Cross-System Correlation

Query Optimization

Predicate Pushdown

Projection Pruning

Join Reordering

Parallel Execution

Handling Heterogeneous Data

Performance Characteristics

When to Use Cross-Source Queries

Conclusion