Hydrolix Release Notes v5.8.5⚓︎

Notable new features⚓︎

Autoscaling Improvements⚓︎

HDX Scaler efficiency has been improved by optimizing how metrics are parsed and processed.
HDX Scaler now restarts only those autoscalers whose configurations have changed, preserving weighted average calculation state for unaffected scalers and reducing compute resource usage.
The Vertical Pod Autoscaler (VPA) adds immediate upscale behavior when error conditions are detected.
The VPA responds to error conditions such as OOM kills, CPU saturation, and ephemeral storage eviction.

Improved Configuration Publishing Performance⚓︎

The common object revision tree query is now optmized for speed. The results are now cached, accelerating configuration generation for larger configurations.

Upgrade instructions⚓︎

This release introduces a migration to enforce unique credential names.

The migration runs automatically during upgrade and may rename existing credentials if duplicates are found. No action is required, but Hydrolix recommends checking any automation that references credentials by name.

Apply the new Hydrolix operator⚓︎

If you have a self-managed installation, apply the new operator directly with the kubectl command examples below. If you're using Hydrolix-supplied tools to manage your installation, follow the procedure prescribed by those tools.

GKE⚓︎

kubectl apply -f "https://www.hydrolix.io/operator/v5.8.5/operator-resources?namespace=${HDX_KUBERNETES_NAMESPACE}&gcp-storage-sa=${GCP_STORAGE_SA}"

EKS⚓︎

kubectl apply -f "https://www.hydrolix.io/operator/v5.8.5/operator-resources?namespace=${HDX_KUBERNETES_NAMESPACE}&aws-storage-role=${AWS_STORAGE_ROLE}"

LKE and AKS⚓︎

kubectl apply -f "https://www.hydrolix.io/operator/v5.8.5/operator-resources?namespace=${HDX_KUBERNETES_NAMESPACE}"

Monitor the upgrade process⚓︎

Kubernetes jobs named init-cluster and init-turbine-api will automatically run to upgrade your entire installation to match the new operator's version number. This will take a few minutes, during which time you can observe your pods' restarts with your preferred Kubernetes monitoring tool.

Ensure both the init-cluster and init-turbine-api jobs have completed successfully and that the turbine-api pod has restarted without errors. After that, view the UI and use the API of your new installation as a final check.

If the turbine-api pod doesn't restart properly or other functionality is missing, check the logs of the init-cluster and init-turbine-api jobs for details about failures. This can be done using the k9s utility or with the kubectl command:

% kubectl logs -l app=init-cluster
% kubectl logs -l app=init-turbine-api

If you still need help, contact Hydrolix support.

Changelog⚓︎

Updates⚓︎

Config API updates⚓︎

Updated default internal PostgreSQL from version 12 to version 15.2. Version 12 has reached its end-of-life date.

Operator updates⚓︎

Upgraded Pingora from version 0.4.0 to version 0.6.0 to resolve a DoS vulnerability (CVE-2025-8671) where repeated HTTP/2 resets could force excessive memory consumption due to mismatch in buffer allocation.
Upgraded Traefik from 3.4 to 3.5.3 to accomodate validation of Google service accounts and to preserve client IPs in Traefik.
Chproxy has been updated from 0.5.1 to 0.6.1.

Improvements⚓︎

Cluster operations improvements⚓︎

Improved logs concerning Kubernetes API exceptions when the operator fails to apply changes to resources. Extended logs now catch base class DynamicAPIError and use built-in method summary() to extract the status code and relevant information. Where the error appears temporary, the operator no longer logs an incorrect CONFLICT error.
Introduced a status record for all overrides within the HydrolixCluster spec. Users can now review validity of an override at any time, whether it is active or not. Override settings can now be updated while the override is active.
Implemented a secret named kafka-tls-secret for passing Kafka TLS keys. The old kafka_tls_key tunable has been deprecated and will be ignored. The Kubernetes Secret name is now specified with tunable kafka_tls_secret_name, which contains three key/value pairs:
KAFKA_TLS_CERT - PEM format certificate
KAFKA_TLS_CA - CA certificate
KAFKA_TLS_KEY - PEM format key
Improved HDX Scaler efficiency by optimizing how metrics are parsed and processed. This update reduces CPU usage in environments with multiple autoscaling targets, improving overall cluster responsiveness.
HDX Scaler: Improved vertical pod autoscaling (VPA) by responding to error conditions such as OOM kills, CPU saturation, and ephemeral storage eviction. Adds immediate upscale behavior when error conditions are detected, plus new configuration parameters (cpu_sat_min_unused_percent, cpu_saturation_seconds) to fine-tune CPU saturation thresholds.
Traefik hdx-auth plugin: added Google OAuth2 support
Validated Google service account tokens before forwarding requests
Upgraded Traefik to v3.5.x to enable required packages
Improved handling for invalid, expired, or unverified email
Created these five Prometheus metrics:
- hdx_auth_cache_operation_duration_seconds, summary: cache operation latency in seconds
- hdx_auth_cache_hit_total, counter: total authorization cache hits since pod start
- hdx_auth_cache_miss_total, counter: total authorization cache misses since pod start
- hdx_auth_cache_deletes_total, counter: Total authorization cache deletions since pod start
- hdx_auth_cache_length, gauge: total number of items in the cache
Pods targeted for vertical autoscaling will now log basic CPU and memory metrics when a global `hdx_pod_metrics_enabled: true" tunable is applied to the cluster.
Environment variables can now be specified on a per-pod basis with the env tunable embedded in an HTN tunable.
Per-cluster metrics can now be created, properly handling global metrics like queue lengths and pending job counts. Previously, only per-pod metrics were possible. HDX Scaler can use these to make better decisions and has an improved UI to handle these new metrics.
The Horizontal Pod Autoscaler now avoids erratic behavior during startup by reading scaling information from Kubernetes ConfigMaps. Averages and cooldown window scaling information is now persisted every 60 seconds.
The Horizontal Pod Autoscaler has improved initial scaling responsiveness due to correction for zero-initialization bias.
Traefik now preserves client IP addresses for each cloud provider, using X-Forwarded-* headers using proxy protocol (for EKS and LKE) or externalTrafficPolicy=Local (for other clouds). New tuneables were intoduced with this improvment:
traefik_pre_stop_seconds: number of seconds to sleep when traefik pods are terminating
traefik_use_local_policy: sets externalTrafficPolicy to Local for the traefik service
targeting: a dictionary of targeting-related Kubernetes settings
traefik_load_balancer_class: name of the load balancer you want to use. For example, set to service.k8s.aws/nlb to use the AWS Load Balancer Controller
HDX Scaler now restarts only those autoscalers whose configurations have changed, preserving weighted average calculation state for unaffected scalers and reducing compute resource usage.

Config API improvements⚓︎

New SIEM sources have their own new credential ID and credential defined upon creation, rather than storing the credential information directly in the SIEM source. For backwards compatibility, secrets can still be passed as part of the SIEM source.
Service accounts can no longer create, modify, or delete other service accounts. They also can no longer create authorization tokens. Service accounts with appropriate permissions can still list and show other service accounts.
Improved configuration publishing performance by speeding up the common object revision tree query and caching the results. This speeds up configuration generation for larger configurations.
Removed vestigial support from /config/v1/login API endpoint for ancillary values HYDROLIX_USER_ID, HYDROLIX_EXPIRY_DATE, and HYDROLIX_ORGANIZATION_ID. These values were once used in cookies to allow the UI to track additional user information. The UI retrieves this info now from the API and these variables are no longer necessary.
Centralized Keycloak events into the hydro.audit_log table by migrating events every five minutes and pruning migrated events daily. Supported more query parameters on the /config/v1/auth_logs endpoint.
Added validations to the storage deletion endpoint to check use of a storage definition in a spread list. Retained default behavior, refusing to DELETE /config/v1/orgs/{org_id}/storages/{id}/ if any reference exists. Now it accepts the force_operation flag to cascade deletes across all storage maps.
Allowed creating role policies by name instead of ID for organization, project and table scopes.
For orgs and projects, use the name.
For tables, use project_name.table_name.
If the name doesn’t resolve uniquely, an error is returned with the suggestion of using scope_id instead.
Enforced unique credential names across the entire cluster to prevent confusion and duplication. Cluster upgrade runs a migration script to ensure existing credentials have unique names and adds validation to block duplicate creation or updates.

Core improvements⚓︎

Lowered the socket timeout setting in ClickHouse from 1000 to 20 seconds to improve cancel response time on queries.

Merge improvements⚓︎

Optimized merge cleanup locking by introducing a fast path that fetches existing lock records with a SELECT instead of an UPSERT. This reduces contention under high database load by allowing merge peers to read lock IDs optimistically before acquiring write locks.

Query improvements⚓︎

Added cancel_recv_time_secs and num_streams fields to query_detail_runtime_stats and query head/peer logs. This helps surface cancellation delay and stream parallelism for troubleshooting and tuning.
Added effective_query_timerange to query_detail_runtime_stats to support further query speed optimizations.

UI improvements⚓︎

Pretransforms are now supported in the UI.

Bug fixes⚓︎

Cluster operation bug fixes⚓︎

Fixed the default template for Grafana URL deployment from Kubernetes operator. A previously missing trailing slash was causing infinite redirect loops.
Remove a trailing slash from hydrolix_url when constructing a URL for Kibana.
Excluded cloudsqladmin, rdsadmin and a few template PostgreSQL databases during initialization. Services other than the Hydrolix operator use databases in the PostgreSQL cluster.
The Horizontal Pod Autoscaler, when configured in "range mode," no longer skips scale-down operations.
The HDX Scaler autoscaler now correctly calculates desired replica counts when using external pod metrics.
Vector log files are now written to the correct folder structure in object storage with properly formatted template variables, allowing the periodic cleanup service to find and delete old logs.
HDX Scaler now terminates orphaned scalers when their configuration is removed, preventing unwanted scaling operations from continuing.
Duplicate credential names are no longer allowed during version migration.

Config API bug fixes⚓︎

The init-turbine-api job now ensures that hdx_primary storage settings are in sync with environment variables on every start. Should a user directly update environment variables concerning the default hdx_primary storage, the API will now honor those updates. Also, deletion of the hdx_primary storage is prohibited.
Fixed RBAC permissions on the config_blob endpoint. Access controls are now enforced.
Corrected an issue where keys were not properly deleted in Kubernetes when a credential or other object containing secrets was deleted.
Authorization tokens are now refused for a disabled account. This removes the ability of a user to continue to use a token from before the account was disabled.
Fixed an API migration issue that incorrectly reported unapplied changes to the AuditLog model during deployment or migration commands.
The affected models are now marked as unmanaged, preventing false warnings when running manage.py migrate or release commands.
Ensured that automatically generated credential names don't clash with pre-existing credential names during cluster upgrade.
Summary table column names can no longer be added via the API, since summary tables are currently incompatible with column renaming.[Team API]

Operator bug fixes⚓︎

Horizontal Pod Autoscaling (HPA) behavior: pauses horizontal scaling when the target app has zero replicas to avoid stalled states.
- Logs a warning on first detection and an info message on recovery
- Adds Prometheus metric hdxscaler_hpa_scaling_stalled to visualize time spent stalled
- Applies when scaling based on metrics emitted by another app
  
  Applies to: v5.4.x and earlier
  
  In earlier releases, deployments that scaled to zero based on metrics from another app could remain stalled indefinitely.
  For example, if merge-peers scaled to zero while merge-controller continued exposing metrics, the peers might never restart automatically.
  
  This release adds detection and logging for that condition. When the target app has no replicas, HDX Scaler now pauses scaling, logs a warning, and resumes once replicas are available.
  
  For older versions, monitor deployments that scale horizontally from other services. If any appear stuck at zero replicas, manually scale them up or upgrade to this version.
Fixed a race condition and token expiration issues which caused intermittent invalid token errors from ClickHouse. This bug particularly affected users of Grafana alert queries.
The operator did not detect changes made to the curated secret using replace rather than patch. For example: cat curated.yaml | kubectl replace -f -. The operator now detects changes to the curated secret made using replace.
Removed unnecessary REPLICATION permission from pgmonitor user. This permission caused failures during user provisioning in such environments as Amazon RDS.
Intake-head pods now terminate in the correct order during scale-down events, preventing data loss during autoscaling operations. The indexer has been moved to a separate Kubernetes sidecard container. PR
Traefik metrics endpoints are no longer exposed to the public when ip_allowlist is set to 0.0.0.0/0.

Query bug fixes⚓︎

hdx_query_max_execution_time now works as advertised. Context is passed to the query engine, and CancelledError exceptions are handled properly.
Column names are no longer logged for users without access to the corresponding tables.
Avoided a mutex-related deadlock in query-head.
Fixed a type inference and integer overflow bug that affected JSON field queries with numbers of different sizes.
Running many conncurrent queries and restarting query peers no longer result in deadlocks.
Fixed rare segfaults in query-head and intake-head during query execution and configuration loading.