Hardware Sizing Guidelines

This document provides guidance for sizing GridOS clusters with Foundation and Connect. The small, medium, and large sizing options are general benchmarks and not exact specifications.

Connect is a scalable and resilient integration service. The optimal cluster size depends on how the service is used. Actual resource requirements may vary based on workload characteristics, data retention policies, and future growth.

Use these guidelines as a starting point, and adjust sizing based on observed usage patterns and performance metrics.

Workload Characteristics

Different workload characteristics affect how services should be scaled and how resource limits are applied. The main factors that drive resource requirements are message throughput and message size.

When processing large volumes of small messages, scale Connect horizontally to increase throughput through concurrency. When handling large payloads, scale Connect vertically to ensure sufficient memory to process incoming data.

Note that increased integration workloads impact not only the integration service, but also the persistence provider, log indexer, and other GridOS Foundation services.

For any questions on specific use cases or integration scenarios, contact Connect Support.

For more information about scaling Connect Kubernetes Deployments and the required configuration changes, see Scaling Kubernetes Deployments.

Cluster Sizing Guidelines

To meet the different processing needs for Connect, three standard GridOS cluster sizes are defined. Each size specifies CPU and memory resources per node, with separate storage recommendations for Elasticsearch and PostgreSQL.

These guidelines are designed to support the different levels of expected message throughput handled by Connect. As the number of messages processed by Connect increases, additional compute resources might be required to support and maintain stability, reliability, and performance across the integration service.

The following table outlines the recommended hardware sizing for small, medium, and large deployments, based on anticipated hourly message volumes. All compute resource values are specified per node. It is recommended that each cluster include at least four nodes, three running control plane components and all running worker components.

Always validate your configuration under expected load conditions before production deployment.

Cluster Size	Hourly Message (size ~2kb) Volume	Node Count	vCPUs (Cores) per Node	Memory (GiB) per Node
Small	Up to 200,000	4	8	32
Medium	Up to 1,000,000	4	16	64
Large	Up to 5,000,000	4	32	128

Elasticsearch and PostgreSQL Sizing Guidelines

When deploying Elasticsearch and PostgreSQL, it is important to consider their specific resource requirements. For best performance, use storage designed for critical, Input/Output Operations Per Second (IOPS)-intensive, and throughput-intensive workloads that require low latency, such as Amazon EBS Provisioned IOPS SSD volumes.

Elasticsearch

Elasticsearch data nodes are categorized as dataHot or dataCold. Their usage is controlled by the Index Lifecycle Management (ILM) policy, which defines where data is stored, how long it is retained, and when it transitions between storage phases (for example, from hot to cold). In the context of Connect, only dataHot nodes are used. The following sizing guidelines apply only to dataHot nodes.

Cluster Size	Number of Data Nodes	CPU Limits (vCPUs)	Memory Limits (GiB)	Storage (GiB) per Node
Small	3	2	6	50
Medium	3	4	8	200
Large	3	8	12	600

PostgreSQL

The primary database for Connect is a highly available PostgreSQL instance, connect-postgresql, deployed alongside the Connect microservices. Resource usage and storage requirements depend mainly on the database growth rate and the size of the Write-Ahead Log (WAL).

The following table below lists recommended resource allocations for each cluster size.

Cluster Size	Number of Servers	CPU Limits (vCPUs)	Memory Limits (GiB)	Storage (GiB) per Server
Small	3	2	4	5
Medium	3	4	8	20
Large	3	8	16	50

Scaling the Connect Kubernetes Deployments

This section explains how to scale the GridOS Connect service Deployments, and which configuration changes are required. It applies only to services managed by the connect Helm chart, not to third-party charts distributed with Connect (connect-openbao, connect-postgresql, and connect-victoria-metrics).

Scaling Recommendations

Specific recommendations for the number of pods or resource limits cannot be provided, as the optimal configuration depends entirely on the use case — the number of flows, message throughput, payload sizes, and integration patterns all play a role.

Scaling decisions should be based on evidence: conduct performance testing that reflects your actual workload, observe the results, and adjust accordingly.

Connect Services Deployment Scaling Support

The Connect services are deployed as a set of Kubernetes Deployments via Helm. Of all Connect Deployments, only the flow-server supports running as a cluster (multiple pods). All other Connect Deployments are designed to run as a single pod (replicas: 1) and must not be scaled horizontally.

Do not increase replicas beyond 1 for any Connect Deployment other than connect-flowserver. Running multiple replicas of singleton services is not supported and may cause undefined behavior.

Scaling the Flow Server

The flow-server uses Apache Pekko Cluster Sharding to distribute flow pipeline execution across multiple pods. This allows the flow-server to scale horizontally to increase message throughput and resilience.

The default replica count is 3, as defined in the Connect chart values. You can increase this value to add more flow-server pods to the cluster.

Updating maxNumberOfShards

When scaling the number of flow-server replicas, you must also update the mc.flow-server.maxNumberOfShards configuration property.

This property controls the number of shards used by Pekko Cluster Sharding to distribute flow pipelines across pods. The recommended value follows the Pekko documentation formula:

maxNumberOfShards = 10 × number of flow-server pods

The default value is 30, which corresponds to 3 pods.

maxNumberOfShards should be set before scaling and must not be decreased on a running cluster, as it would affect the assignment of existing shards and can lead to data inconsistencies or flow execution errors.

To override this value, add the following to your values.yaml override file (see Chart Value Override Recommendations):

flowserver:
  replicas: 5              (1)
  config:
    application.yml:
      mc.flow-server.maxNumberOfShards: 50  (2)

1	Number of flow-server pods.
2	Set to `10 × replicas` (e.g. `10 × 5 = 50`).

Apply the changes using Helm or Helmfile as described in Deploy with Helmfile.

Source Processor Distribution

When running multiple flow-server pods, source processors are distributed across the cluster depending on their type:

Most source processors (e.g. restApi) are deployed on every node in the cluster, so incoming work is naturally spread across all pods.
Some source processors are singletons and run on only one node at a time, regardless of how many pods are running. Examples include readFiles, schedule, and receiveFromRabbitMq.

Singleton sources are not parallelised automatically by adding more replicas. If parallel execution is required, the flow developer can apply manual sharding — for example, deploying multiple copies of a flow where each instance reads files with a different name prefix.

Vertical Scaling

Any Connect service can be vertically scaled by increasing its CPU and memory limits. This is particularly relevant for the flow-server when handling large message payloads, but applies to all services. See Chart Value Override Recommendations for guidance on adjusting resource limits.