Categories
00 - Cloud-Native App Dev 00 - Hybrid Cloud Infrastructure Auto-Scaling blog cloud computing Cloud Native Design

Getting started with OpenShift ServiceMesh Federation

Introduction

When you are serious about running a microservice architecture at scale, you will definitely run into some challenges at some point.  Beside all the benefits microservices deliver, they’ll also add a layer of complexity and have all the associated complexities of a distributed system. 

And as the number of services scales, it becomes increasingly difficult to understand the interactions between all of these services. Finding and tracing errors or latencies can become quite a hard job. Rolling out new services in a secure and reliable fashion can also cause some struggle for dev and ops teams.

Red Hat OpenShift Service Mesh addresses a variety of challenges in a microservice architecture by creating a centralized point of control in an application. It adds a transparent layer to existing distributed applications without requiring any changes to the application’s code.

OpenShift Service Mesh is based on the open source Maistra project, an opinionated distribution of Istio designed to work with Openshift. It combines Kiali, Jaeger, and Prometheus into a platform managed according to the OperatorHub lifecycle.

OpenShift Service Mesh can do a lot of things. The number of features and capabilities can be a bit overwhelming, but at its core it provides the four main things: 

  • Traffic management
  • Telemetry and Observability 
  • Service Identity and Security
  • Policy enforcement.

Service Mesh Federation

OpenShift Service Mesh 2.1 introduced a new feature “Federation”, which is a deployment model that lets you share services and workloads across separate meshes managed in distinct administrative domains.

The OpenShift ServiceMesh Federation feature is different from upstream Istio. One of these differences is its multi-tenant implementation, where a cluster may be divided into multiple “tenant” service meshes. Each tenant is a distinct administrative domain and can be in the same or different clusters. Upstream Istio assumes that one cluster is one tenant and that the communication relies on direct connectivity between the Kubernetes API servers.

Federation enables use cases like:

  • Sharing services in a secure way across different meshes and providers
  • Advanced Deployment strategies
  • Cross-region High Availability
  • Supporting Migration scenarios

Example Use Case: Testing with production data

Here is what we want to set up: We are going to create a production service mesh, running our production workload (the world famous Istio bookinfo application). 

Then we will set up a stage mesh tenant within the same OpenShift Cluster, where we will deploy and test the new version of the bookinfo’s details microservice. We will also want to test the functional behavior as well as the performance with data from production. In order to achieve that, we will first federate the two meshes, export the new version of the details-service from the stage environment, import it in the production mesh and create a VirtualService mirroring all traffic to the new version in the stage environment. 

You find all the sources and a deploy script in this Github repository: https://github.com/ortwinschneider/mesh-federation

Let’s get started!

OpenShift ServiceMesh Installation

We need at least one OpenShift cluster where we install OpenShift ServiceMesh 2.1+. Basically we have to install 4 operators in the following order:

  1. OpenShift Elasticsearch (Optional)
  2. Jaeger
  3. Kiali
  4. Red Hat OpenShift Service Mesh 

You’ll find the complete installation instructions here: Installing the Operators – Service Mesh 2.x | Service Mesh | OpenShift Container Platform 4.9

Installing the production Service Mesh

First we login to OpenShift and create a new project for the production mesh controlplane and the bookinfo application.

oc new-project prod-mesh oc new-project prod-bookinfo
Code language: Bash (bash)

The Interaction between meshes is handled through ingress and egress gateways, which are the only identities transferred between meshes. You need an additional ingress- and egress-gateway pair for each mesh you are federating with. The gateways in each mesh must be configured to communicate with each other, whether they are in the same or in different clusters.

  • All outbound requests to a federated mesh are transferred through a local egress gateway
  • All inbound requests from a federated mesh are transferred through a local ingress gateway

Therefore we have configured additional ingress- and egress gateways in the ServiceMeshControlPlane Custom Resource.

apiVersion: maistra.io/v2 kind: ServiceMeshControlPlane metadata: namespace: prod-mesh name: prod-mesh spec: cluster: name: production-cluster addons: grafana: enabled: true jaeger: install: storage: type: Memory kiali: enabled: true prometheus: enabled: true policy: type: Istiod telemetry: type: Istiod tracing: sampling: 10000 type: Jaeger version: v2.1 runtime: defaults: container: imagePullPolicy: Always proxy: accessLogging: file: name: /dev/stdout gateways: additionalEgress: stage-mesh-egress: enabled: true requestedNetworkView: - network-stage-mesh routerMode: sni-dnat service: metadata: labels: federation.maistra.io/egress-for: stage-mesh ports: - port: 15443 name: tls - port: 8188 name: http-discovery #note HTTP here additionalIngress: stage-mesh-ingress: enabled: true routerMode: sni-dnat service: type: ClusterIP metadata: labels: federation.maistra.io/ingress-for: stage-mesh ports: - port: 15443 name: tls - port: 8188 name: https-discovery #note HTTPS here security: trust: domain: prod-mesh.local
Code language: YAML (yaml)

Some things to note here:

In our ServiceMeshControlPlane resource, we need to configure the trust domain as it will be referenced by the other mesh to allow incoming requests. We also want to view endpoints in Kiali associated with the other mesh network. So the field requestedNetworkView must be network-<ServiceMeshPeer name>. The routerMode should always be sni-dnat. And the port names must be tls and http(s)-discovery. Symmetrically, this configuration must also be done on the stage-mesh controlplane.

Now we make the prod-bookinfo project a member of the mesh by using a ServiceMeshMemberRoll resource.

apiVersion: maistra.io/v1 kind: ServiceMeshMemberRoll metadata: name: default namespace: prod-mesh spec: members: - prod-bookinfo
Code language: YAML (yaml)

Let’s apply both resources and optionally wait for the installation to complete. This will set up the complete controlplane with Jaeger, Kiali, Prometheus, Istiod, Ingress-, Egress-Gateways etc.

oc apply -f prod-mesh/smcp.yaml oc apply -f prod-mesh/smmr.yaml oc wait --for condition=Ready -n prod-mesh smmr/default --timeout 300s
Code language: Bash (bash)

We are ready to deploy our “production” workload, the bookinfo application with Gateway, VirtualService and DestinationRule definitions .

oc apply -n prod-bookinfo -f https://raw.githubusercontent.com/Maistra/istio/maistra-2.0/samples/bookinfo/platform/kube/bookinfo.yaml oc apply -n prod-bookinfo -f https://raw.githubusercontent.com/Maistra/istio/maistra-2.0/samples/bookinfo/networking/bookinfo-gateway.yaml oc apply -n prod-bookinfo -f https://raw.githubusercontent.com/Maistra/istio/maistra-2.0/samples/bookinfo/networking/destination-rule-all.yaml
Code language: Bash (bash)

We can check if all pods are running with the sidecar injected. You should see 2/2 in the Ready column.

oc get pods NAME READY STATUS RESTARTS AGE details-v1-86dfdc4b95-fcmkq 2/2 Running 0 2d20h productpage-v1-658849bb5-pdpnq 2/2 Running 0 2d20h ratings-v1-76b8c9cbf9-z98v6 2/2 Running 0 2d20h reviews-v1-58b8568645-dks9d 2/2 Running 0 2d21h reviews-v2-5d8f8b6775-cb8df 2/2 Running 0 2d20h reviews-v3-666b89cfdf-z9r5d 2/2 Running 0 2d20h
Code language: Bash (bash)

Installing the staging Service Mesh

After installing the production controlplane and workload, it’s time to set up the stage environment. We are following the same steps as we did with the production mesh. First we create the projects:

oc new-project stage-mesh oc new-project stage-bookinfo
Code language: Bash (bash)

Next we install the stage-mesh controlplane in the stage-mesh project and add the stage-bookinfo project to the mesh:

apiVersion: maistra.io/v2 kind: ServiceMeshControlPlane metadata: namespace: stage-mesh name: stage-mesh spec: cluster: name: stage-cluster addons: grafana: enabled: true jaeger: install: storage: type: Memory kiali: enabled: true prometheus: enabled: true policy: type: Istiod telemetry: type: Istiod tracing: sampling: 10000 type: Jaeger version: v2.1 proxy: accessLogging: file: name: /dev/stdout runtime: defaults: container: imagePullPolicy: Always gateways: additionalEgress: prod-mesh-egress: enabled: true requestedNetworkView: - network-prod-mesh routerMode: sni-dnat service: metadata: labels: federation.maistra.io/egress-for: prod-mesh ports: - port: 15443 name: tls - port: 8188 name: http-discovery #note HTTP here additionalIngress: prod-mesh-ingress: enabled: true routerMode: sni-dnat service: type: ClusterIP metadata: labels: federation.maistra.io/ingress-for: prod-mesh ports: - port: 15443 name: tls - port: 8188 name: https-discovery #note HTTPS here security: trust: domain: stage-mesh.local
Code language: YAML (yaml)
apiVersion: maistra.io/v1 kind: ServiceMeshMemberRoll metadata: name: default namespace: stage-mesh spec: members: - stage-bookinfo
Code language: YAML (yaml)

Let’s apply both resources and optionally wait for the installation to complete.

oc apply -f stage-mesh/smcp.yaml oc apply -f stage-mesh/smmr.yaml oc wait --for condition=Ready -n stage-mesh smmr/default --timeout 300s
Code language: Bash (bash)

Now in the stage environment we are going to deploy only the “new” version of the details-service, not the complete bookinfo application. This is the version we want to test with data from production.

oc apply -n stage-bookinfo -f stage-mesh/stage-detail-v2-deployment.yaml oc apply -n stage-bookinfo -f stage-mesh/stage-detail-v2-service.yaml
Code language: Bash (bash)

Service Mesh Peering

As of now, we have two distinct Service Meshes running in one OpenShift cluster, one for production and the other for staging . The next step is to federate these Service Meshes in order to share services. To validate certificates presented by the other mesh, we need the CA root certificate of the remote mesh. The default certificate of the other mesh is located in a ConfigMap called istio-ca-root-cert in the remote mesh’s control plane namespace. 

We create a new ConfigMap containing the certificate of the other mesh. The key must still be root-cert.pem. This configuration has to be done symmetrically on both meshes.

PROD_MESH_CERT=$(oc get configmap -n prod-mesh istio-ca-root-cert -o jsonpath='{.data.root-cert\.pem}') STAGE_MESH_CERT=$(oc get configmap -n stage-mesh istio-ca-root-cert -o jsonpath='{.data.root-cert\.pem}') sed "s:{{STAGE_MESH_CERT}}:$STAGE_MESH_CERT:g" prod-mesh/stage-mesh-ca-root-cert.yaml | oc apply -f - sed "s:{{PROD_MESH_CERT}}:$PROD_MESH_CERT:g" stage-mesh/prod-mesh-ca-root-cert.yaml | oc apply -f -
Code language: Bash (bash)

The actual federation is done by using a ServiceMeshPeer Custom Resource. We need a peering configuration for the prod-mesh and stage-mesh side.

kind: ServiceMeshPeer apiVersion: federation.maistra.io/v1 metadata: name: stage-mesh namespace: prod-mesh spec: remote: addresses: - prod-mesh-ingress.stage-mesh.svc.cluster.local discoveryPort: 8188 servicePort: 15443 gateways: ingress: name: stage-mesh-ingress egress: name: stage-mesh-egress security: trustDomain: stage-mesh.local clientID: stage-mesh.local/ns/stage-mesh/sa/prod-mesh-egress-service-account certificateChain: kind: ConfigMap name: stage-mesh-ca-root-cert
Code language: YAML (yaml)

In spec.remote we configure the address of the peers ingress gateway and optionally the ports handling discovery- and service requests. This address has to be reachable from the local cluster. We also reference our local ingress and egress gateway names handling the federation traffic. The spec.security section configures the remote trust domain, the remote egress gateway client ID and CA Root certificate. It is also important to name the ServiceMeshPeer resource with the name of the remote mesh!

Let’s apply the ServiceMeshPeer resources for the stage and prod environment.

oc apply -f prod-mesh/smp.yaml oc apply -f stage-mesh/smp.yaml
Code language: Bash (bash)

Once the ServiceMeshPeer resources are configured on both sides, the connections between the gateways will be enabled. We have to make sure that the connections are established by checking the status: 

oc -n prod-mesh get servicemeshpeer stage-mesh -o json | jq .status oc -n stage-mesh get servicemeshpeer prod-mesh -o json | jq .status
Code language: Bash (bash)
{ "discoveryStatus": { "active": [ { "pod": "istiod-prod-mesh-7b59b74c-xfg88", "remotes": [ { "connected": true, "lastConnected": "2022-01-07T13:50:12Z", "lastFullSync": "2022-01-10T14:30:11Z", "source": "10.129.0.97" } ], "watch": { "connected": true, "lastConnected": "2022-01-07T13:50:27Z", "lastDisconnect": "2022-01-07T13:48:36Z", "lastDisconnectStatus": "503 Service Unavailable", "lastFullSync": "2022-01-08T11:27:41Z" } } ] } }
Code language: Bash (bash)

The status value for inbound (remotes[0].connected) and outbound (watch.connected) connections must be true. It may take a moment as the full synchronization happens every 5 minutes. If you don’t see a successful connection status for a while, check the logs of the istiod pod. You can ignore the warning in the istiod logs for “remote trust domain not matching the current trust domain…”.

Exporting Services

Now that the meshes are connected we can export services with an ExportedServiceSet resource. Exporting a service configures the ingress gateway to accept inbound requests for the exported service. Inbound requests are only validated for the client ID configured for the remote egress-gateway.

We must declare every service we want to make available to other meshes. It is possible to select services by name or namespace and we can also use wildcards. Important: We can only export services which are visible to the mesh’s namespace. 

In our case we want to export the new version of the details-service from the stage-bookinfo project.

kind: ExportedServiceSet apiVersion: federation.maistra.io/v1 metadata: name: prod-mesh namespace: stage-mesh spec: exportRules: - type: LabelSelector labelSelector: namespace: stage-bookinfo selector: matchLabels: app: details alias: namespace: bookinfo
Code language: YAML (yaml)

Important: The name of the ExportedServiceSet must match the ServiceMeshPeer name!

Internally, a unique name is used to identify the service for the other mesh, e.g. details.stage-bookinfo.svc.cluster.local may be exported under the name details.stage-bookinfo.svc.prod-mesh-exports.local. Exported services will only see traffic from the ingress gateway, not the original requestor.

Let’s apply the ExportedServiceSet:

oc apply -f stage-mesh/ess.yaml
Code language: Bash (bash)

and check the status of the exported services (this may also take a moment to get updated):

oc get exportedserviceset prod-mesh -n stage-mesh -o json | jq .status { "exportedServices": [ { "exportedName": "details.stage-bookinfo.svc.prod-mesh-exports.local", "localService": { "hostname": "details.stage-bookinfo.svc.cluster.local", "name": "details", "namespace": "stage-bookinfo" } }
Code language: Bash (bash)

Importing Services

Importing services is very similar to exporting services and lets you explicitly specify which services exported from another mesh should be accessible within your service mesh. 

The ImportedServiceSet custom resource is used to select services for import. Importing a service configures the mesh to send requests for the service to the egress gateway specified in the ServiceMeshPeer resource. 

We are going to import the previously exported details-service from the stage-mesh in order to mirror traffic to the service:

apiVersion: federation.maistra.io/v1 kind: ImportedServiceSet metadata: name: stage-mesh namespace: prod-mesh spec: importRules: - importAsLocal: false nameSelector: alias: name: details-v2 namespace: prod-bookinfo name: details namespace: stage-bookinfo type: NameSelector
Code language: YAML (yaml)

As you can see, we are importing the service with an alias name and namespace.

oc apply -f prod-mesh/iss.yaml
Code language: Bash (bash)

And again checking the status:

oc -n prod-mesh get importedservicesets stage-mesh -o json | jq .status { "importedServices": [ { "exportedName": "details.stage-bookinfo.svc.prod-mesh-exports.local", "localService": { "hostname": "details-v2.prod-bookinfo.svc.stage-mesh-imports.local", "name": "details-v2", "namespace": "prod-bookinfo" } } ] }
Code language: Bash (bash)

If you import services with importAsLocal: true, the domain suffix will be svc.cluster.local, like your normal local services. So, if you already have a local service with this name, the remote endpoint will be added to this service’s endpoints and you load balance across your local and imported services.

Routing traffic to imported services

We are almost done. So far we have set up two Service Meshes with distinct administrative domains. We’ve also deployed our workload and included it in the mesh. Then we have federated those two meshes with a peering configuration and have exported and imported the new version of the details-service.

We can treat our imported service like any other service and i.e. route traffic by applying VirtualServices, DestinationRules etc.

What we initially wanted to achieve is mirroring the traffic going to the production details-service to the new version of the details-service running in the stage environment, in order to test this new version with real data from production. 

This can be done by simply applying a VirtualService for the details-service:

kind: VirtualService apiVersion: networking.istio.io/v1alpha3 metadata: name: details namespace: prod-bookinfo spec: hosts: - details.prod-bookinfo.svc.cluster.local http: - mirror: host: details-v2.prod-bookinfo.svc.stage-mesh-imports.local route: - destination: host: details.prod-bookinfo.svc.cluster.local subset: v1 weight: 100
Code language: YAML (yaml)
oc apply -n prod-bookinfo -f prod-mesh/vs-mirror-details.yaml
Code language: Bash (bash)

Instead of mirroring we could also splitt traffic and perform a canary rollout across the two meshes, sending i.e. 20% of traffic to the stage environment. With this approach we could also gradually migrate workloads from one cluster to another.

kind: VirtualService apiVersion: networking.istio.io/v1alpha3 metadata: name: details namespace: prod-bookinfo spec: hosts: - details.prod-bookinfo.svc.cluster.local http: - route: - destination: host: details.prod-bookinfo.svc.cluster.local subset: v1 weight: 80 - destination: host: details-v2.prod-bookinfo.svc.stage-mesh-imports.local weight: 20
Code language: YAML (yaml)
oc apply -f prod-mesh/vs-split-details.yaml
Code language: Bash (bash)

Now, to see federation in action, we need to create some load in the bookinfo app in prod-mesh.

BOOKINFO_URL=$(oc -n prod-mesh get route istio-ingressgateway -o json | jq -r .spec.host) while true; do sleep 1; curl http://$BOOKINFO_URL/productpage &> /dev/null; done
Code language: Bash (bash)

The best way to visualize the traffic flowing through the mesh is by using Kiali. Kiali is a management and visualization dashboard, rendering traffic from and to services. The visualization is based on metrics pulled from the local mesh.

Open the Kiali UI for the prod-mesh and stage-mesh environments and take a look at the Graph page. You can adjust the graph, showing request percentages, response times, animate traffic etc.

oc get route -n prod-mesh kiali -o jsonpath='{.spec.host}'
Code language: Bash (bash)
The screenshot above shows the graph when splitting traffic between the prod- and stage environment.  Please note that when mirroring traffic, the destination will not show up in the graph as there are no specific metrics exposed for mirroring.
However, you can validate traffic coming in from prod with Kiali in the stage environment.

Conclusion

With OpenShift Service Mesh federation, it is possible to set up several services meshes in one or more OpenShift clusters. It is a “zero trust” approach with distinct administrative domains and sharing only information with explicit configuration, providing necessary security.

OpenShift Service Mesh federation enables some exciting new use cases by extending the overall capabilities and leveraging the powerful underlying hybrid cloud container platform.

You don’t have an OpenShift environment? Get started with the Developer Sandbox for Red Hat OpenShift.

By Ortwin Schneider

Ortwin Schneider currently works as a Principal Technical Marketing Manager on the Customer and Field Engagement team in the Hybrid Platforms Business unit at Red Hat. Former Specialist Solution Architect - Application Services in the manufacturing vertical in Red Hat Germany’s sales team, responsible for solution design and helping customers in the manufacturing industry adopting the cloud native path with Red Hat’s comprehensive middleware portfolio. He is passionate about the intersection of cloud computing, software architectures, and the Internet-of-Things (IoT). A technologist with experience in software engineering, enterprise architecture and tech support.

One reply on “Getting started with OpenShift ServiceMesh Federation”

Thanks for the detailed information. When i try it – i get the following errors in both (prod-mesh and stage-mesh) egress gateways :
[2022-05-06T05:59:13.359Z] “GET /v1/watch HTTP/1.1” 503 LR upstream_reset_before_response_started{local_reset} – “-” 0 84 9999 – “10.248.12.158” “Go-http-client/1.1” “bea38ccc-f971-94de-b0b0-51734156fa93” “discovery.prod-mesh.svc.stage-mesh.local” “10.249.52.192:8188” outbound|8188|federation-discovery-stage-mesh-egress|discovery.prod-mesh.svc.stage-mesh.local – 10.248.28.208:8188 10.248.12.158:48902 – federation-discovery-stage-mesh-egress
[2022-05-06T05:59:15.621Z] “GET /v1/services/ HTTP/1.1” 503 LR upstream_reset_before_response_started{local_reset} – “-” 0 84 9999 – “10.248.12.158” “Go-http-client/1.1” “0abdac57-04c5-9f41-b7ab-bd93b84b663f” “discovery.prod-mesh.svc.stage-mesh.local” “10.249.52.192:8188” outbound|8188|federation-discovery-stage-mesh-egress|discovery.prod-mesh.svc.stage-mesh.local – 10.248.28.208:8188 10.248.12.158:48996 – federation-discovery-stage-mesh-egress
[2022-05-06T05:59:26.625Z] “GET /v1/services/ HTTP/1.1” 503 LR upstream_reset_before_response_started{local_reset} – “-” 0 84 10001 – “10.248.12.158” “Go-http-client/1.1” “6a19e5ed-80d2-9a50-8f81-650469789def” “discovery.prod-mesh.svc.stage-mesh.local” “10.249.52.192:8188” outbound|8188|federation-discovery-stage-mesh-egress|discovery.prod-mesh.svc.stage-mesh.local – 10.248.28.208:8188 10.248.12.158:49576 – federation-discovery-stage-mesh-egress
[2022-05-06T05:59:37.632Z] “GET /v1/services/ HTTP/1.1” 503 LR upstream_reset_before_response_started{local_reset} – “-” 0 84 10001 – “10.248.12.158” “Go-http-client/1.1” “7ce09edd-e0a2-92e8-9914-14089bc028d2” “discovery.prod-mesh.svc.stage-mesh.local” “10.249.52.192:8188” outbound|8188|federation-discovery-stage-mesh-egress|discovery.prod-mesh.svc.stage-mesh.local – 10.248.28.208:8188 10.248.12.158:50164 – federation-discovery-stage-mesh-egress
[2022-05-06T05:59:48.638Z] “GET /v1/services/ HTTP/1.1” 503 LR upstream_reset_before_response_started{local_reset} – “-” 0 84 10001 – “10.248.12.158” “Go-http-client/1.1” “eb16866b-2ab5-974d-bdf7-8eeced1df449” “discovery.prod-mesh.svc.stage-mesh.local” “10.249.52.192:8188” outbound|8188|federation-discovery-stage-mesh-egress|discovery.prod-mesh.svc.stage-mesh.local – 10.248.28.208:8188 10.248.12.158:50738 – federation-discovery-stage-mesh-egress

And following is the error from the istiod pod
2022-05-06T06:03:57.360961Z info federation starting watch component=federation-registry
2022-05-06T06:04:07.394853Z error federation watch failed: status code is not OK: 503 (503 Service Unavailable) component=federation-registry

Leave a Reply

%d bloggers like this: