Hydrolix Release Notes v5.8.5⚓︎
Notable new features⚓︎
Autoscaling Improvements⚓︎
- HDX Scaler efficiency has been improved by optimizing how metrics are parsed and processed.
- HDX Scaler now restarts only those autoscalers whose configurations have changed, preserving weighted average calculation state for unaffected scalers and reducing compute resource usage.
- The Vertical Pod Autoscaler (VPA) adds immediate upscale behavior when error conditions are detected.
- The VPA responds to error conditions such as OOM kills, CPU saturation, and ephemeral storage eviction.
Improved Configuration Publishing Performance⚓︎
- The common object revision tree query is now optmized for speed. The results are now cached, accelerating configuration generation for larger configurations.
Upgrade instructions⚓︎
This release introduces a migration to enforce unique credential names.
The migration runs automatically during upgrade and may rename existing credentials if duplicates are found. No action is required, but Hydrolix recommends checking any automation that references credentials by name.
Apply the new Hydrolix operator⚓︎
If you have a self-managed installation, apply the new operator directly with the kubectl command examples below. If you're using Hydrolix-supplied tools to manage your installation, follow the procedure prescribed by those tools.
GKE⚓︎
EKS⚓︎
LKE and AKS⚓︎
Monitor the upgrade process⚓︎
Kubernetes jobs named init-cluster and init-turbine-api will automatically run to upgrade your entire installation to match the new operator's version number. This will take a few minutes, during which time you can observe your pods' restarts with your preferred Kubernetes monitoring tool.
Ensure both the init-cluster and init-turbine-api jobs have completed successfully and that the turbine-api pod has restarted without errors. After that, view the UI and use the API of your new installation as a final check.
If the turbine-api pod doesn't restart properly or other functionality is missing, check the logs of the init-cluster and init-turbine-api jobs for details about failures. This can be done using the k9s utility or with the kubectl command:
If you still need help, contact Hydrolix support.
Changelog⚓︎
Updates⚓︎
Config API updates⚓︎
- Updated default internal PostgreSQL from version 12 to version 15.2. Version 12 has reached its end-of-life date.
Operator updates⚓︎
- Upgraded Pingora from version 0.4.0 to version 0.6.0 to resolve a DoS vulnerability (CVE-2025-8671) where repeated HTTP/2 resets could force excessive memory consumption due to mismatch in buffer allocation.
- Upgraded Traefik from 3.4 to 3.5.3 to accomodate validation of Google service accounts and to preserve client IPs in Traefik.
- Chproxy has been updated from 0.5.1 to 0.6.1.
Improvements⚓︎
Cluster operations improvements⚓︎
- Improved logs concerning Kubernetes API exceptions when the operator fails to apply changes to resources. Extended logs now catch base class
DynamicAPIErrorand use built-in methodsummary()to extract the status code and relevant information. Where the error appears temporary, the operator no longer logs an incorrectCONFLICTerror. - Introduced a status record for all overrides within the HydrolixCluster spec. Users can now review validity of an override at any time, whether it is active or not. Override settings can now be updated while the override is active.
- Implemented a secret named
kafka-tls-secretfor passing Kafka TLS keys. The oldkafka_tls_keytunable has been deprecated and will be ignored. The Kubernetes Secret name is now specified with tunablekafka_tls_secret_name, which contains three key/value pairs: KAFKA_TLS_CERT- PEM format certificateKAFKA_TLS_CA- CA certificateKAFKA_TLS_KEY- PEM format key- Improved HDX Scaler efficiency by optimizing how metrics are parsed and processed. This update reduces CPU usage in environments with multiple autoscaling targets, improving overall cluster responsiveness.
- HDX Scaler: Improved vertical pod autoscaling (VPA) by responding to error conditions such as OOM kills, CPU saturation, and ephemeral storage eviction. Adds immediate upscale behavior when error conditions are detected, plus new configuration parameters (
cpu_sat_min_unused_percent,cpu_saturation_seconds) to fine-tune CPU saturation thresholds. - Traefik
hdx-authplugin: added Google OAuth2 support - Validated Google service account tokens before forwarding requests
- Upgraded Traefik to v3.5.x to enable required packages
- Improved handling for invalid, expired, or unverified email
- Created these five Prometheus metrics:
hdx_auth_cache_operation_duration_seconds, summary: cache operation latency in secondshdx_auth_cache_hit_total, counter: total authorization cache hits since pod starthdx_auth_cache_miss_total, counter: total authorization cache misses since pod starthdx_auth_cache_deletes_total, counter: Total authorization cache deletions since pod starthdx_auth_cache_length, gauge: total number of items in the cache
- Pods targeted for vertical autoscaling will now log basic CPU and memory metrics when a global `hdx_pod_metrics_enabled: true" tunable is applied to the cluster.
- Environment variables can now be specified on a per-pod basis with the
envtunable embedded in an HTN tunable. - Per-cluster metrics can now be created, properly handling global metrics like queue lengths and pending job counts. Previously, only per-pod metrics were possible. HDX Scaler can use these to make better decisions and has an improved UI to handle these new metrics.
- The Horizontal Pod Autoscaler now avoids erratic behavior during startup by reading scaling information from Kubernetes ConfigMaps. Averages and cooldown window scaling information is now persisted every 60 seconds.
- The Horizontal Pod Autoscaler has improved initial scaling responsiveness due to correction for zero-initialization bias.
- Traefik now preserves client IP addresses for each cloud provider, using
X-Forwarded-*headers using proxy protocol (for EKS and LKE) orexternalTrafficPolicy=Local(for other clouds). New tuneables were intoduced with this improvment: - traefik_pre_stop_seconds: number of seconds to sleep when traefik pods are terminating
- traefik_use_local_policy: sets externalTrafficPolicy to
Localfor the traefik service - targeting: a dictionary of targeting-related Kubernetes settings
- traefik_load_balancer_class: name of the load balancer you want to use. For example, set to
service.k8s.aws/nlbto use the AWS Load Balancer Controller - HDX Scaler now restarts only those autoscalers whose configurations have changed, preserving weighted average calculation state for unaffected scalers and reducing compute resource usage.
Config API improvements⚓︎
- New SIEM sources have their own new credential ID and credential defined upon creation, rather than storing the credential information directly in the SIEM source. For backwards compatibility, secrets can still be passed as part of the SIEM source.
- Service accounts can no longer create, modify, or delete other service accounts. They also can no longer create authorization tokens. Service accounts with appropriate permissions can still list and show other service accounts.
- Improved configuration publishing performance by speeding up the common object revision tree query and caching the results. This speeds up configuration generation for larger configurations.
- Removed vestigial support from
/config/v1/loginAPI endpoint for ancillary valuesHYDROLIX_USER_ID,HYDROLIX_EXPIRY_DATE, andHYDROLIX_ORGANIZATION_ID. These values were once used in cookies to allow the UI to track additional user information. The UI retrieves this info now from the API and these variables are no longer necessary. - Centralized Keycloak events into the
hydro.audit_logtable by migrating events every five minutes and pruning migrated events daily. Supported more query parameters on the/config/v1/auth_logsendpoint. - Added validations to the storage deletion endpoint to check use of a storage definition in a spread list. Retained default behavior, refusing to
DELETE /config/v1/orgs/{org_id}/storages/{id}/if any reference exists. Now it accepts theforce_operationflag to cascade deletes across all storage maps. - Allowed creating role policies by name instead of ID for organization, project and table scopes.
- For orgs and projects, use the name.
- For tables, use
project_name.table_name. - If the name doesn’t resolve uniquely, an error is returned with the suggestion of using
scope_idinstead. - Enforced unique credential names across the entire cluster to prevent confusion and duplication. Cluster upgrade runs a migration script to ensure existing credentials have unique names and adds validation to block duplicate creation or updates.
Core improvements⚓︎
- Lowered the socket timeout setting in ClickHouse from
1000to20seconds to improve cancel response time on queries.
Merge improvements⚓︎
- Optimized merge cleanup locking by introducing a fast path that fetches existing lock records with a
SELECTinstead of anUPSERT. This reduces contention under high database load by allowing merge peers to read lock IDs optimistically before acquiring write locks.
Query improvements⚓︎
- Added
cancel_recv_time_secsandnum_streamsfields toquery_detail_runtime_statsand query head/peer logs. This helps surface cancellation delay and stream parallelism for troubleshooting and tuning. - Added
effective_query_timerangetoquery_detail_runtime_statsto support further query speed optimizations.
UI improvements⚓︎
- Pretransforms are now supported in the UI.
Bug fixes⚓︎
Cluster operation bug fixes⚓︎
- Fixed the default template for Grafana URL deployment from Kubernetes operator. A previously missing trailing slash was causing infinite redirect loops.
- Remove a trailing slash from
hydrolix_urlwhen constructing a URL for Kibana. - Excluded
cloudsqladmin,rdsadminand a few template PostgreSQL databases during initialization. Services other than the Hydrolix operator use databases in the PostgreSQL cluster. - The Horizontal Pod Autoscaler, when configured in "range mode," no longer skips scale-down operations.
- The HDX Scaler autoscaler now correctly calculates desired replica counts when using external pod metrics.
- Vector log files are now written to the correct folder structure in object storage with properly formatted template variables, allowing the periodic cleanup service to find and delete old logs.
- HDX Scaler now terminates orphaned scalers when their configuration is removed, preventing unwanted scaling operations from continuing.
- Duplicate credential names are no longer allowed during version migration.
Config API bug fixes⚓︎
- The
init-turbine-api job now ensures thathdx_primarystorage settings are in sync with environment variables on every start. Should a user directly update environment variables concerning the defaulthdx_primarystorage, the API will now honor those updates. Also, deletion of thehdx_primarystorage is prohibited. - Fixed RBAC permissions on the
config_blobendpoint. Access controls are now enforced. - Corrected an issue where keys were not properly deleted in Kubernetes when a credential or other object containing secrets was deleted.
- Authorization tokens are now refused for a disabled account. This removes the ability of a user to continue to use a token from before the account was disabled.
- Fixed an API migration issue that incorrectly reported unapplied changes to the
AuditLogmodel during deployment or migration commands.
The affected models are now marked as unmanaged, preventing false warnings when runningmanage.py migrateor release commands. - Ensured that automatically generated credential names don't clash with pre-existing credential names during cluster upgrade.
- Summary table column names can no longer be added via the API, since summary tables are currently incompatible with column renaming.[Team API]
Operator bug fixes⚓︎
-
Horizontal Pod Autoscaling (HPA) behavior: pauses horizontal scaling when the target app has zero replicas to avoid stalled states.
- Logs a warning on first detection and an info message on recovery
- Adds Prometheus metric
hdxscaler_hpa_scaling_stalledto visualize time spent stalled -
Applies when scaling based on metrics emitted by another app
Applies to: v5.4.x and earlier
In earlier releases, deployments that scaled to zero based on metrics from another app could remain stalled indefinitely.
For example, if merge-peers scaled to zero while merge-controller continued exposing metrics, the peers might never restart automatically.This release adds detection and logging for that condition. When the target app has no replicas, HDX Scaler now pauses scaling, logs a warning, and resumes once replicas are available.
For older versions, monitor deployments that scale horizontally from other services. If any appear stuck at zero replicas, manually scale them up or upgrade to this version.
-
Fixed a race condition and token expiration issues which caused intermittent invalid token errors from ClickHouse. This bug particularly affected users of Grafana alert queries.
- The operator did not detect changes made to the
curatedsecret using replace rather than patch. For example:cat curated.yaml | kubectl replace -f -. The operator now detects changes to thecuratedsecret made using replace. - Removed unnecessary
REPLICATIONpermission frompgmonitoruser. This permission caused failures during user provisioning in such environments as Amazon RDS. - Intake-head pods now terminate in the correct order during scale-down events, preventing data loss during autoscaling operations. The indexer has been moved to a separate Kubernetes sidecard container. PR
- Traefik metrics endpoints are no longer exposed to the public when
ip_allowlistis set to0.0.0.0/0.
Query bug fixes⚓︎
- hdx_query_max_execution_time now works as advertised. Context is passed to the query engine, and CancelledError exceptions are handled properly.
- Column names are no longer logged for users without access to the corresponding tables.
- Avoided a mutex-related deadlock in
query-head. - Fixed a type inference and integer overflow bug that affected JSON field queries with numbers of different sizes.
- Running many conncurrent queries and restarting query peers no longer result in deadlocks.
- Fixed rare segfaults in
query-headandintake-headduring query execution and configuration loading.