Skip to content

Query the Session Archive

Omnia automatically archives completed sessions from the warm Postgres store to object storage (S3, GCS, or Azure Blob) as Parquet files. This guide shows how to query that archive from any major SQL engine.

There is no dedicated sync controller or ETL pipeline to run. The archive is written in a standard Hive-partitioned Parquet layout that every major warehouse can read as an external table, so you point your engine of choice at the bucket and run SQL directly.

The long-term storage path has three pieces, all shipped in the core Helm chart:

  1. SessionRetentionPolicy — a cluster-scoped CRD that declares how long sessions stay in the warm store before being moved to cold.
  2. Compaction CronJob — the omnia-compaction Kubernetes CronJob runs on a schedule (default daily), reads sessions that have exceeded their warm retention, and writes them to the configured object store as Parquet.
  3. Cold archive provider — the Go package that performs the actual write. Configured via cold.backend, cold.bucket, cold.region, and cold.endpoint on the session-api deployment.

Enable it in your Helm values:

sessionRetention:
compaction:
enabled: true
schedule: "0 3 * * *" # daily at 03:00 UTC
defaultPolicy:
coldArchive:
enabled: true
backend: s3 # s3 | gcs | azure
bucket: my-omnia-archive
region: us-east-1

Files are written under a configurable prefix (default sessions/) using Hive-style partitioning so partition pruning works in every query engine:

s3://my-omnia-archive/sessions/
├── _manifest.json # internal index, skip this
├── year=2026/month=04/day=10/agent=my-agent/
│ ├── part-0000.parquet
│ └── part-0001.parquet
├── year=2026/month=04/day=10/agent=another-agent/
│ └── part-0000.parquet
└── year=2026/month=04/day=11/agent=my-agent/
└── part-0000.parquet
  • Compression: Snappy (configurable).
  • Max file size: 128 MB default, so partitions may contain multiple part-*.parquet files.
  • Agent name sanitization: characters outside [a-zA-Z0-9_-] are replaced with _ in the partition path.
  • _manifest.json: an internal JSON index used by Omnia for O(1) session lookups. It is not a Parquet file — exclude it from your warehouse’s file pattern (e.g. *.parquet rather than *).

Every file has the same schema:

ColumnTypeNotes
idstringSession ID
agent_namestringAgent that owned the session
namespacestringKubernetes namespace
workspace_namestring (optional)Workspace
statusstringTerminal session status
created_atint64Nanoseconds since Unix epoch
updated_atint64Nanoseconds since Unix epoch
expires_atint64Nanoseconds since Unix epoch (0 when unset)
ended_atint64Nanoseconds since Unix epoch (0 when unset)
message_countint32
tool_call_countint32
total_input_tokensint64
total_output_tokensint64
estimated_cost_usdfloat64
tagsstringJSON-encoded map[string]string
statestringJSON-encoded session state
last_message_previewstring (optional)
messages_jsonstringJSON-encoded full message history

The messages_json column is the interesting one for analytics: it holds the complete conversation (role, content, tool calls, tool results, token counts) as a JSON string. Every supported warehouse can parse it in place.

Timestamp columns are stored as int64 nanoseconds so they round-trip through Parquet losslessly. Convert them to native timestamps in your query:

  • Snowflake: TO_TIMESTAMP_NTZ(created_at / 1000000000)
  • BigQuery: TIMESTAMP_MICROS(DIV(created_at, 1000))
  • ClickHouse: fromUnixTimestamp64Nano(created_at)
  • Athena / Trino: from_unixtime(created_at / 1000000000)
  • DuckDB: make_timestamp(created_at // 1000)

Create an external stage pointing at your bucket, then an external table over the Parquet files:

CREATE OR REPLACE STAGE omnia_archive_stage
URL = 's3://my-omnia-archive/sessions/'
STORAGE_INTEGRATION = my_s3_integration
FILE_FORMAT = (TYPE = PARQUET);
CREATE OR REPLACE EXTERNAL TABLE omnia_sessions
WITH LOCATION = @omnia_archive_stage
PATTERN = '.*[.]parquet'
FILE_FORMAT = (TYPE = PARQUET)
AUTO_REFRESH = TRUE;

Query the conversation content using VARIANT:

SELECT
VALUE:id::STRING AS session_id,
VALUE:agent_name::STRING AS agent,
TO_TIMESTAMP_NTZ(VALUE:created_at::NUMBER / 1000000000) AS created_at,
VALUE:total_input_tokens::NUMBER AS input_tokens,
VALUE:estimated_cost_usd::FLOAT AS cost_usd,
PARSE_JSON(VALUE:messages_json::STRING) AS messages
FROM omnia_sessions
WHERE VALUE:agent_name::STRING = 'my-agent'
AND VALUE:created_at::NUMBER >= DATE_PART('epoch_nanosecond', '2026-04-01'::TIMESTAMP);

For high-frequency BI dashboards where external table read cost becomes noticeable, use Snowpipe to load the files into a native table on an S3 ObjectCreated event — Omnia does not need to be involved.

Create a Hive-partitioned external table:

CREATE OR REPLACE EXTERNAL TABLE `my_project.omnia.sessions`
WITH PARTITION COLUMNS (
year INT64,
month INT64,
day INT64,
agent STRING
)
OPTIONS (
format = 'PARQUET',
hive_partition_uri_prefix = 'gs://my-omnia-archive/sessions/',
uris = ['gs://my-omnia-archive/sessions/year=*/month=*/day=*/agent=*/*.parquet']
);

Partition pruning works automatically because BigQuery recognizes the year=/month=/day=/agent= segments:

SELECT
id,
agent_name,
TIMESTAMP_MICROS(DIV(created_at, 1000)) AS created_at,
total_input_tokens,
estimated_cost_usd,
JSON_EXTRACT_ARRAY(messages_json, '$') AS messages
FROM `my_project.omnia.sessions`
WHERE year = 2026 AND month = 4
AND agent_name = 'my-agent';

For native-table performance, use BigQuery Data Transfer Service to ingest the same files into a partitioned managed table — again, no Omnia-side changes needed.

ClickHouse reads S3 Parquet natively via the s3 table function:

SELECT
id,
agent_name,
fromUnixTimestamp64Nano(created_at) AS created_at,
total_input_tokens,
estimated_cost_usd,
JSONExtractArrayRaw(messages_json) AS messages
FROM s3(
'https://my-omnia-archive.s3.us-east-1.amazonaws.com/sessions/year=2026/month=04/**/*.parquet',
'AKIA...', 'SECRET...',
'Parquet'
)
WHERE agent_name = 'my-agent';

Or create a persistent table using the S3 engine:

CREATE TABLE omnia_sessions
ENGINE = S3(
'https://my-omnia-archive.s3.us-east-1.amazonaws.com/sessions/**/*.parquet',
'AKIA...', 'SECRET...',
'Parquet'
);

Athena reads Parquet on S3 via a Glue-registered external table:

CREATE EXTERNAL TABLE omnia_sessions (
id STRING,
agent_name STRING,
namespace STRING,
workspace_name STRING,
status STRING,
created_at BIGINT,
updated_at BIGINT,
expires_at BIGINT,
ended_at BIGINT,
message_count INT,
tool_call_count INT,
total_input_tokens BIGINT,
total_output_tokens BIGINT,
estimated_cost_usd DOUBLE,
tags STRING,
state STRING,
last_message_preview STRING,
messages_json STRING
)
PARTITIONED BY (year INT, month INT, day INT, agent STRING)
STORED AS PARQUET
LOCATION 's3://my-omnia-archive/sessions/'
TBLPROPERTIES ('parquet.compression' = 'SNAPPY');
MSCK REPAIR TABLE omnia_sessions;

Then query with partition pruning:

SELECT
id,
agent_name,
from_unixtime(created_at / 1000000000) AS created_at,
CAST(json_extract(messages_json, '$') AS ARRAY<JSON>) AS messages
FROM omnia_sessions
WHERE year = 2026 AND month = 4
AND agent_name = 'my-agent';

For local / ad-hoc exploration, DuckDB reads Parquet on object storage directly:

INSTALL httpfs;
LOAD httpfs;
SET s3_region = 'us-east-1';
SET s3_access_key_id = 'AKIA...';
SET s3_secret_access_key = 'SECRET...';
SELECT
id,
agent_name,
make_timestamp(created_at // 1000) AS created_at,
total_input_tokens,
estimated_cost_usd,
json_extract(messages_json, '$')::JSON AS messages
FROM read_parquet('s3://my-omnia-archive/sessions/**/*.parquet', hive_partitioning = true)
WHERE year = 2026 AND month = 4
AND agent_name = 'my-agent';

DuckDB infers the year, month, day, and agent partition columns automatically when hive_partitioning = true is set.

Retention is declarative via SessionRetentionPolicy. The compaction CronJob enforces the warm→cold move, and optionally the cold purge after a configurable age. Cleanup of files in the bucket past the cold retention period happens in the same job — no separate lifecycle policy needed, though you can still configure an S3 Lifecycle rule as a backstop if you prefer belt-and-braces.

See the SessionRetentionPolicy CRD reference for full field documentation.

Earlier versions of Omnia shipped a SessionAnalyticsSync CRD that imported the Parquet archive into Snowflake-native tables via a Kubernetes controller. It was removed before general availability because every major warehouse can query object-store Parquet natively as external tables, and the managed ingestion paths (Snowpipe, BigQuery Data Transfer Service, ClickHouse INSERT FROM s3(), Glue crawlers) are strictly better tools for high-throughput native-table loads than anything we could ship inside the operator.

If you need native-table performance for high-volume BI dashboards, use your warehouse’s own ingestion pipeline pointed at the Omnia archive bucket. The Parquet schema above is stable and Hive-partitioned, so any standard ingestion tool will work without custom adapters.