I’m struggling with the last step of a configuration using MetalLB, Kubernetes, Istio on a bare-metal instance, and that is to have a web page returned from a service to the outside world via an Istio VirtualService route. I’ve just updated the instance to
- MetalLB (version 0.7.3)
- Kubernetes (version 1.12.2)
- Istio (version 1.0.3)
I’ll start with what does work.
All complementary services have been deployed and most are working:
- Kubernetes Dashboard on http://localhost:8001
- Prometheus Dashboard on http://localhost:10010 (I had something else on 9009)
- Envoy Admin on http://localhost:15000
- Grafana (Istio Dashboard) on http://localhost:3000
- Jaeger on http://localhost:16686
I say most because since the upgrade to Istio 1.0.3 I've lost the telemetry from istio-ingressgateway in the Jaeger dashboard and I'm not sure how to bring it back. I've dropped the pod and re-created to no-avail.
Outside of that, MetalLB and K8S appear to be working fine and the load-balancer is configured correctly (using ARP).
kubectl get svc -n istio-system
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
grafana ClusterIP 10.109.247.149 <none> 3000/TCP 9d
istio-citadel ClusterIP 10.110.129.92 <none> 8060/TCP,9093/TCP 28d
istio-egressgateway ClusterIP 10.99.39.29 <none> 80/TCP,443/TCP 28d
istio-galley ClusterIP 10.98.219.217 <none> 443/TCP,9093/TCP 28d
istio-ingressgateway LoadBalancer 10.108.175.231 192.168.1.191 80:31380/TCP,443:31390/TCP,31400:31400/TCP,15011:30805/TCP,8060:32514/TCP,853:30601/TCP,15030:31159/TCP,15031:31838/TCP 28d
istio-pilot ClusterIP 10.97.248.195 <none> 15010/TCP,15011/TCP,8080/TCP,9093/TCP 28d
istio-policy ClusterIP 10.98.133.209 <none> 9091/TCP,15004/TCP,9093/TCP 28d
istio-sidecar-injector ClusterIP 10.102.158.147 <none> 443/TCP 28d
istio-telemetry ClusterIP 10.103.141.244 <none> 9091/TCP,15004/TCP,9093/TCP,42422/TCP 28d
jaeger-agent ClusterIP None <none> 5775/UDP,6831/UDP,6832/UDP,5778/TCP 27h
jaeger-collector ClusterIP 10.104.66.65 <none> 14267/TCP,14268/TCP,9411/TCP 27h
jaeger-query LoadBalancer 10.97.70.76 192.168.1.193 80:30516/TCP 27h
prometheus ClusterIP 10.105.176.245 <none> 9090/TCP 28d
zipkin ClusterIP None <none> 9411/TCP 27h
I can expose my deployment using:
kubectl expose deployment enrich-dev --type=LoadBalancer --name=enrich-expose
it all works perfectly fine and I can hit the webpage from the external load balanced IP address (I deleted the exposed service after this).
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
enrich-expose LoadBalancer 10.108.43.157 192.168.1.192 31380:30170/TCP 73s
enrich-service ClusterIP 10.98.163.217 <none> 80/TCP 57m
kubernetes ClusterIP 10.96.0.1 <none> 443/TCP 36d
If I create a K8S Service in the default namespace (I've tried multiple)
apiVersion: v1
kind: Service
metadata:
name: enrich-service
labels:
run: enrich-service
spec:
ports:
- name: http
port: 80
protocol: TCP
selector:
app: enrich
followed by a gateway and a route (VirtualService), the only response I get is a 404 outside of the mesh. You'll see in the gateway I'm using the reserved word mesh but I've tried both that and naming the specific gateway. I've also tried different match prefixes for specific URI and the port you can see below.
Gateway
apiVersion: networking.istio.io/v1alpha3
kind: Gateway
metadata:
name: enrich-dev-gateway
spec:
selector:
istio: ingressgateway
servers:
- port:
number: 80
name: http
protocol: HTTP
hosts:
- "*"
VirtualService
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
name: enrich-virtualservice
spec:
hosts:
- "enrich-service.default"
gateways:
- mesh
http:
- match:
- port: 80
route:
- destination:
host: enrich-service.default
subset: v1
---
apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
name: enrich-destination
spec:
host: enrich-service.default
trafficPolicy:
loadBalancer:
simple: LEAST_CONN
subsets:
- name: v1
labels:
app: enrich
I've double checked it's not the DNS playing up because I can go into the shell of the ingress-gateway either via busybox or using the K8S dashboard
and do both an
nslookup enrich-service.default
and
curl -f http://enrich-service.default/
and both work successfully, so I know the ingress-gateway pod can see those. The sidecars are set for auto-injection in both the default namespace and the istio-system namespace.
The logs for the ingress-gateway show the 404:
[2018-11-01T03:07:54.351Z] "GET /metadataHTTP/1.1" 404 - 0 0 1 - "192.168.1.90" "curl/7.58.0" "6c1796be-0791-4a07-ac0a-5fb07bc3818c" "enrich-service.default" "-" - - 192.168.224.168:80 192.168.1.90:43500
[2018-11-01T03:26:39.339Z] "GET /HTTP/1.1" 404 - 0 0 1 - "192.168.1.90" "curl/7.58.0" "ed956af4-77b0-46e6-bd26-c153e29837d7" "enrich-service.default" "-" - - 192.168.224.168:80 192.168.1.90:53960
192.168.224.168:80 is the IP address of the gateway. 192.168.1.90:53960 is the IP address of my external client.
Any suggestions, I've tried hitting this from multiple angles for a couple of days now and I feel I'm just missing something simple. Suggested logs to look at perhaps?
Just to close this question out for the solution to the problem in my instance. The mistake in configuration started all the way back in the Kubernetes cluster initialisation. I had applied:
the pod-network-cidr using the same address range as the local LAN on which the Kubernetes installation was deployed i.e. the desktop for the Ubuntu host used the same IP subnet as what I'd assigned the container network.
For the most part, everything operated fine as detailed above, until the Istio proxy was trying to route packets from an external load-balancer IP address to an internal IP address which happened to be on the same subnet. Project Calico with Kubernetes seemed to be able to cope with it as that's effectively Layer 3/4 policy but Istio had a problem with it a L7 (even though it was sitting on Calico underneath).
The solution was to tear down my entire Kubernetes deployment. I was paranoid and went so far as to uninstall Kubernetes and deploy again and redeploy with a pod network in the 172 range which wasn't anything to do with my local lan. I also made the same changes in the Project Calico configuration file to match pod networks. After that change, everything worked as expected.
I suspect that in a more public configuration where your cluster was directly attached to a BGP router as opposed to using MetalLB with an L2 configuration as a subset of your LAN wouldn't exhibit this issue either. I've documented it more in this post:
Microservices: .Net, Linux, Kubernetes and Istio make a powerful combination