diff --git a/website/docs/troubleshooting/Scenarios/_category_.json b/website/docs/troubleshooting/Scenarios/_category_.json deleted file mode 100644 index 0d44758281..0000000000 --- a/website/docs/troubleshooting/Scenarios/_category_.json +++ /dev/null @@ -1,3 +0,0 @@ -{ - "collapsed": false -} diff --git a/website/docs/troubleshooting/Scenarios/alb/alb_fix.md b/website/docs/troubleshooting/Scenarios/alb/alb_fix.md deleted file mode 100644 index 7ef6f65d13..0000000000 --- a/website/docs/troubleshooting/Scenarios/alb/alb_fix.md +++ /dev/null @@ -1,500 +0,0 @@ ---- -title: "Troubleshooting The Load balancer Scenario" -sidebar_position: 30 ---- - -The task for you in this troubleshooting scenario is to investigate the deployment for AWS Load Balancer Controller as well as the ingress object created by following the prompts with the script. At the end of this session, you should be able to see the ui app on your EKS cluster using ALB ingress through the browsers as depicted in the image. - -![ingress](./assets/ingress.webp) - -## Let's start the troubleshooting - -### Step 1: - -First, we need to verify the status of our pods and get ingress for ingress object creation. To do so, we will use `kubectl` tool. - -```bash -$ kubectl get pod -n ui -NAME READY STATUS RESTARTS AGE -ui-68495c748c-jkh2z 1/1 Running 0 85s -``` - -### Step 2: - -In _Step 1_, we checked the pods status for our application and aws-load-balancer-controller. The _aws-load-balancer-controller_ deployment is responsible for ALB creation for any ingress objects applied to the cluster. - -Upon looking for ingress object, did you observe any ALB DNS name to access your application with the ingress object? You can also verify ALB creation in the AWS Management Console. In a successful installation scenario, the ingress object should have an ALB DNS name shown like the example below. However in this case, the ADDRESS section where the ALB DNS should have populated is empty. - -```bash -$ kubectl get ingress/ui -n ui -NAME CLASS HOSTS ADDRESS PORTS AGE -ui alb * 80 105s - -#---This is the expected output when the ingress was deployed correctly-- -NAME CLASS HOSTS ADDRESS PORTS AGE -ingress-2048 * k8s-ui-ingress2-xxxxxxxxxx-yyyyyyyyyy.region-code.elb.amazonaws.com 80 2m32s -``` - -### Step 3: - -Check further into the ingress for any events indicating why we do not see the ALB DNS. You can retrieve those logs by running the following command. The event logs should point you towards what the issue might be with ingress creation. - -```bash -$ kubectl describe ingress/ui -n ui -Name: ui -Labels: -Namespace: ui -Address: -Ingress Class: alb -Default backend: -Rules: - Host Path Backends - ---- ---- -------- - * - / service-ui:80 () -Annotations: alb.ingress.kubernetes.io/healthcheck-path: /actuator/health/liveness - alb.ingress.kubernetes.io/scheme: internet-facing - alb.ingress.kubernetes.io/target-type: ip -Events: - Type Reason Age From Message - ---- ------ ---- ---- ------- - Warning FailedBuildModel 2m23s (x16 over 5m9s) ingress Failed build model due to couldn't auto-discover subnets: unable to resolve at least one subnet (0 match VPC and tags: [kubernetes.io/role/elb]) - -``` - -Refer the documentation on prerequisites for setting up ALB with EKS: https://kubernetes-sigs.github.io/aws-load-balancer-controller/v2.4/deploy/subnet_discovery/ - -### Step 4: - -_Step 3_ points to issues with the subnet auto-discovery for load balancer controller deployment. Ensure that all the public subnets have correct tags `tag:kubernetes.io/role/elb,Values=1'` - -:::info -Keep in mind that public subnet means the route table for the subnet has an Internet Gateway allowing traffic to and from the internet. -::: - -1. To find the all subnets through the command line, filter through existing ones with the following tag "Key: `alpha.eksctl.io/cluster-name` Value: `${EKS_CLUSTER_NAME}`". There should be four subnets. **Note:** _For your convenience we have added the cluster name as env variable with the variable `$EKS_CLUSTER_NAME`._ - -```bash -$ aws ec2 describe-subnets --filters "Name=tag:alpha.eksctl.io/cluster-name,Values=${EKS_CLUSTER_NAME}" --query 'Subnets[].SubnetId[]' -[ - "subnet-xxxxxxxxxxxxxxxxx", - "subnet-xxxxxxxxxxxxxxxxx", - "subnet-xxxxxxxxxxxxxxxxx", - "subnet-xxxxxxxxxxxxxxxxx", - "subnet-xxxxxxxxxxxxxxxxx", - "subnet-xxxxxxxxxxxxxxxxx" -] -``` - -2. Then by adding in the subnet ID into the route tables CLI filter one at a time, `--filters 'Name=association.subnet-id,Values=subnet-xxxxxxxxxxxxxxxxx'`, identify which subnets are public. - -``` -aws ec2 describe-route-tables --filters 'Name=association.subnet-id,Values=' --query 'RouteTables[].Routes[].[DestinationCidrBlock,GatewayId]' -``` - -Here a script that will help to iterate over the list of subnets - -```bash - -$ for subnet_id in $(aws ec2 describe-subnets --filters "Name=tag:alpha.eksctl.io/cluster-name,Values=${EKS_CLUSTER_NAME}" --query 'Subnets[].SubnetId[]' --output text); do echo "Subnect: ${subnet_id}"; aws ec2 describe-route-tables --filters "Name=association.subnet-id,Values=${subnet_id}" --query 'RouteTables[].Routes[].[DestinationCidrBlock,GatewayId]'; done -``` - -If the output shows `0.0.0.0/0` route to an Internet gateway ID, this is a public subnet. See below example. - -``` -WSParticipantRole:~/environment $ aws ec2 describe-route-tables --filters "Name=association.subnet-id,Values=subnet-xxxxxxxxxxxxx0470" --query 'RouteTables[].Routes[].[DestinationCidrBlock,GatewayId]' -[ - [ - "10.42.0.0/16", - "local" - ], - [ - "0.0.0.0/0", - "igw-xxxxxxxxxxxxxxxxx" - ] -] -``` - -3. Once you have all the public subnet ID's, describe subnets with the appropriate tag and confirm that the public subnet ID's that you identified are missing. In our case, none of our subnets have the correct tags. - -```bash -$ aws ec2 describe-subnets --filters 'Name=tag:kubernetes.io/role/elb,Values=1' --query 'Subnets[].SubnetId' -[] -``` - -4. Then add the correct tags. To help you a little bit, we have added the 3 public subnets to the `env` variables with the names `PUBLIC_SUBNET_1, PUBLIC_SUBNET_2 and PUBLIC_SUBNET_3` - -``` -aws ec2 create-tags --resources subnet-xxxxxxxxxxxxxxxxx subnet-xxxxxxxxxxxxxxxxx subnet-xxxxxxxxxxxxxxxxx --tags 'Key="kubernetes.io/role/elb",Value=1' -``` - -```bash -$ aws ec2 create-tags --resources $PUBLIC_SUBNET_1 $PUBLIC_SUBNET_2 $PUBLIC_SUBNET_3 --tags 'Key="kubernetes.io/role/elb",Value=1' -``` - -5. Confirm the tags are created. You should see the public subnet ID's populated following the command below. - -```bash -$ aws ec2 describe-subnets --filters 'Name=tag:kubernetes.io/role/elb,Values=1' --query 'Subnets[].SubnetId' -[ - "subnet-xxxxxxxxxxxxxxxxx", - "subnet-xxxxxxxxxxxxxxxxx", - "subnet-xxxxxxxxxxxxxxxxx" -] -``` - -6. Now restart the controller deployment using the kubectl rollout restart command: - -```bash timeout=180 -$ kubectl -n kube-system rollout restart deploy aws-load-balancer-controller -deployment.apps/aws-load-balancer-controller restarted -``` - -7. Now, check again the ingress deployment: - -```bash expectError=true -$ kubectl describe ingress/ui -n ui - Warning FailedDeployModel 68s ingress Failed deploy model due to AccessDenied: User: arn:aws:sts::xxxxxxxxxxxx:assumed-role/alb-controller-20240611131524228000000002/1718115201989397805 is not authorized to perform: elasticloadbalancing:CreateLoadBalancer on resource: arn:aws:elasticloadbalancing:us-west-2:xxxxxxxxxxxx:loadbalancer/app/k8s-ui-ui-5ddc3ba496/* because no identity-based policy allows the elasticloadbalancing:CreateLoadBalancer action - status code: 403, request id: b862fb9c-480b-44b5-ba6f-426a3884b6b6 - Warning FailedDeployModel 26s (x5 over 66s) ingress (combined from similar events): Failed deploy model due to AccessDenied: User: arn:aws:sts::xxxxxxxxxxxx:assumed-role/alb-controller-20240611131524228000000002/1718115201989397805 is not authorized to perform: elasticloadbalancing:CreateLoadBalancer on resource: arn:aws:elasticloadbalancing:us-west-2:xxxxxxxxxxxx:loadbalancer/app/k8s-ui-ui-5ddc3ba496/* because no identity-based policy allows the elasticloadbalancing:CreateLoadBalancer action - status code: 403, request id: 197cf2f7-2f68-44f2-92ae-ff5b36cb150f -``` - -:::tip -In AWS generally for creation/deletion/update of any resource, you will observe a corresponding API call which are recorded in CloudTrail. Look for any CloudTrail events for CreateLoadBalancer API calls. Do you observe any such calls in the last 1 hour of this lab setup? -::: - -### Step 5 - -With this setup, we’re leveraging IAM Roles for Service Accounts, which essentially allows pods to assume IAM roles using service accounts in Kubernetes and OIDC provider associated with your EKS cluster. Locate the service account that load balancer controller is using and find out the IAM role associated with it, to identify the IAM entity that would make API calls to provision your load balancer. -Try running: - -```bash -$ kubectl get serviceaccounts -n kube-system -l app.kubernetes.io/name=aws-load-balancer-controller -o yaml -``` - -```yaml {8} -apiVersion: v1 -items: - - apiVersion: v1 - automountServiceAccountToken: true - kind: ServiceAccount - metadata: - annotations: - eks.amazonaws.com/role-arn: arn:aws:iam::xxxxxxxxxxxx:role/alb-controller-20240611131524228000000002 - meta.helm.sh/release-name: aws-load-balancer-controller - meta.helm.sh/release-namespace: kube-system - creationTimestamp: "2024-06-11T13:15:32Z" - labels: - app.kubernetes.io/instance: aws-load-balancer-controller - app.kubernetes.io/managed-by: Helm - app.kubernetes.io/name: aws-load-balancer-controller - app.kubernetes.io/version: v2.7.1 - helm.sh/chart: aws-load-balancer-controller-1.7.1 - name: aws-load-balancer-controller-sa - namespace: kube-system - resourceVersion: "4950707" - uid: 6d842045-f2b4-4406-869b-f2addc67ff4d -kind: List -metadata: - resourceVersion: "" -``` - -:::tip -Can you verify if there’s a call in your CloudTrail events with the IAM role listed in the output for above command? If not, take a look at the logs from your controller. -::: - -### Hint 6 - -You can check the logs from controller pods to find additional details which could be preventing the load balancer to create. Let's check the logs using the command below. - -```bash -$ kubectl logs -n kube-system -l app.kubernetes.io/name=aws-load-balancer-controller -``` - -For example the output may show something similar to the below output. - -``` -{"level":"error","ts":"2024-06-11T14:24:24Z","msg":"Reconciler error","controller":"ingress","object":{"name":"ui","namespace":"ui"},"namespace":"ui","name":"ui","reconcileID":"49d27bbb-96e5-43b4-b115-b7a07e757148","error":"AccessDenied: User: arn:aws:sts::xxxxxxxxxxxx:assumed-role/alb-controller-20240611131524228000000002/1718115201989397805 is not authorized to perform: elasticloadbalancing:CreateLoadBalancer on resource: arn:aws:elasticloadbalancing:us-west-2:xxxxxxxxxxxx:loadbalancer/app/k8s-ui-ui-5ddc3ba496/* because no identity-based policy allows the elasticloadbalancing:CreateLoadBalancer action\n\tstatus code: 403, request id: a24a1620-3a75-46b7-b3c3-9c80fada159e"} -``` - -As you can see the error indicates the IAM role does not have the correct permissions, in this case the permissions to create the load balancer `elasticloadbalancing:CreateLoadBalancer`. - -:::tip -Verify the correct permissions required by the IAM role in the documentations here [[1]](https://kubernetes-sigs.github.io/aws-load-balancer-controller/v2.4/deploy/installation/#setup-iam-manually) where you can find the latest IAM permissions json file required for the LB Controller. After the changes, you have to wait a few minutes for the changes to reflect, since IAM uses an eventual consistency model. To make the changes, locate the IAM role through the AWS console and add the missing permissions that are shown in the log. In this case CreateLoadBalancer is missing. -::: - -Now let's fix it. To avoid conflicts with the automation of the workshop, we have already provisioned the correct permissions into the account and added the environment variable `LOAD_BALANCER_CONTROLLER_ROLE_NAME` that contains the role name and `LOAD_BALANCER_CONTROLLER_POLICY_ARN_FIX` which contains the correct IAM policy arn, and `LOAD_BALANCER_CONTROLLER_POLICY_ARN_ISSUE` that contains the incorrect IAM policy arn. - -So, to fix it we will just need to attach the correct IAM policy, as follows: - -```bash -$ aws iam attach-role-policy --role-name ${LOAD_BALANCER_CONTROLLER_ROLE_NAME} --policy-arn ${LOAD_BALANCER_CONTROLLER_POLICY_ARN_FIX} -``` - -and detach the incorrect IAM policy from the role: - -```bash -$ aws iam detach-role-policy --role-name ${LOAD_BALANCER_CONTROLLER_ROLE_NAME} --policy-arn ${LOAD_BALANCER_CONTROLLER_POLICY_ARN_ISSUE} -``` - -Try accessing the new Ingress URL in the browser as before to check if you can access the UI app: - -```bash -$ kubectl get ingress -n ui ui -o jsonpath="{.status.loadBalancer.ingress[*].hostname}{'\n'}" -k8s-ui-ui-5ddc3ba496-1208241872.us-west-2.elb.amazonaws.com -``` - -:::tip -It can take a couple of minutes for the Load Balancer to be available once created. -::: - -Also, feel free to go to CloudTrail again and verify the API call for CreateLoadBalancer is there. - -### Step 7 - -Even though the ingress creation succeeded, when you try accessing the app in browser there is an error stating, "Backend service does not exist". - -![ALb-Backend-DoesNotExist](./assets/alb-does-not-exist.webp) - -Since ingress is created, that would mean that there is an issue with communication from the Kubernetes ingress to the service. Check the deployment and service using: - -```bash -$ kubectl -n ui get service/ui -o yaml -``` - -```yaml {27} -apiVersion: v1 -kind: Service -metadata: - annotations: - ... - labels: - app.kubernetes.io/component: service - app.kubernetes.io/created-by: eks-workshop - app.kubernetes.io/instance: ui - app.kubernetes.io/managed-by: Helm - app.kubernetes.io/name: ui - helm.sh/chart: ui-0.0.1 - name: ui - namespace: ui - resourceVersion: "4950875" - uid: dc832144-b2a1-41cd-b7a1-8979111da677 -spec: - ... - ports: - - name: http - port: 80 - protocol: TCP - targetPort: http - selector: - app.kubernetes.io/component: service - app.kubernetes.io/instance: ui - app.kubernetes.io/name: ui-app - sessionAffinity: None - type: ClusterIP -status: - loadBalancer: {} -``` - -And now check the ingress configuration: - -```bash -$ kubectl get ingress/ui -n ui -o yaml -``` - -```yaml {23} -apiVersion: networking.k8s.io/v1 -kind: Ingress -metadata: - annotations: - alb.ingress.kubernetes.io/healthcheck-path: /actuator/health/liveness - alb.ingress.kubernetes.io/scheme: internet-facing - alb.ingress.kubernetes.io/target-type: ip - ... - finalizers: - - ingress.k8s.aws/resources - generation: 1 - name: ui - namespace: ui - resourceVersion: "4950883" - uid: 327b899c-405e-431b-8d67-32578435f0b9 -spec: - ingressClassName: alb - rules: - - http: - paths: - - backend: - service: - name: service-ui - port: - number: 80 - path: / - pathType: Prefix -... -``` - -From the outputs, observe the ingress spec and the service name `name: service-ui` that it is pointing to versus what the service name should be. - -We will need to edit the ingress spec to point to correct service name using the command below, which contains the fix: - -```bash -$ kubectl apply -k ~/environment/eks-workshop/modules/troubleshooting/alb/creating-alb/fix_ingress -``` - -To look like: - -```yaml {10} -spec: - ingressClassName: alb - rules: - - http: - paths: - - path: / - pathType: Prefix - backend: - service: - name: ui - port: - number: 80 -``` - -Try accessing the ALB again using the domain name shared in the get ingress output and check if you can access the app now? - -### Step 8 - -Now we observe a 503 error when accessing the ALB: - -![ALb-503-ERROR](./assets/alb-503.webp) - -503 would suggest a server-side issue, specifically with the service being unavailable. But we ensured that the service was running on the cluster when we ran get service command in _Step 7_. - -In Kubernetes, a service is just a construct to expose deployments either externally or within the cluster. Services rely on selectors to be able to send traffic to the correct backend deployment. To verify that we have our service pointing to the correct deployment, check the endpoints that are dynamically configured by kube-proxy on service creation. Run the following command: - -```bash -$ kubectl -n ui get endpoints ui -NAME ENDPOINTS AGE -ui 13d -``` - -The endpoints in command above should be pointing to IPs of the app pods running in _ui_ namespace. Can you identify if the selectors are setup correctly in service? - -### Step 9: - -Taking a look at the deployment spec using command below, verify the selector value being used versus the one used in your service. - -```bash -$ kubectl -n ui get deploy/ui -o yaml -``` - -```yaml {34} -apiVersion: apps/v1 -kind: Deployment -metadata: - annotations: - ... - name: ui - namespace: ui - .. -spec: - progressDeadlineSeconds: 600 - replicas: 1 - revisionHistoryLimit: 10 - selector: - matchLabels: - app.kubernetes.io/component: service - app.kubernetes.io/instance: ui - app.kubernetes.io/name: ui - strategy: - rollingUpdate: - maxSurge: 25% - maxUnavailable: 25% - type: RollingUpdate - template: - metadata: - annotations: - prometheus.io/path: /actuator/prometheus - prometheus.io/port: "8080" - prometheus.io/scrape: "true" - creationTimestamp: null - labels: - app.kubernetes.io/component: service - app.kubernetes.io/created-by: eks-workshop - app.kubernetes.io/instance: ui - app.kubernetes.io/name: ui - spec: - containers: -... - -``` - -And - -```bash -$ kubectl -n ui get svc ui -o yaml -``` - -```yaml {22} -apiVersion: v1 -kind: Service -metadata: - annotations: - ... - labels: - app.kubernetes.io/component: service - app.kubernetes.io/created-by: eks-workshop - app.kubernetes.io/instance: ui - app.kubernetes.io/managed-by: Helm - app.kubernetes.io/name: ui - helm.sh/chart: ui-0.0.1 - name: ui - namespace: ui - resourceVersion: "5000404" - uid: dc832144-b2a1-41cd-b7a1-8979111da677 -spec: - ... - selector: - app.kubernetes.io/component: service - app.kubernetes.io/instance: ui - app.kubernetes.io/name: ui-app - sessionAffinity: None - type: ClusterIP -... -``` - -Notice what the `service/ui` selector is using and what the actual `deployment/ui` labels are. To fix the issue, we need to update the `service/ui` selector `app.kubernetes.io/name: ui-app` to `app.kubernetes.io/name: ui`. - -:::tip -You can either update the service selector with: - -- `kubectl edit service -n ` or -- `kubectl patch service -n --type='json' -p='[{"op": "replace", "path": "/spec/selector", "value": {"key1": "value1", "key2": "value2"}}]'` - ::: - -for your convenience, we have added a kustomize script that update the selector, just execute the following command: - -```bash -$ kubectl apply -k ~/environment/eks-workshop/modules/troubleshooting/alb/creating-alb/fix_ui -``` - -Now refresh the browsers and you should see the ui application: - -![ALB-UI-APP](./assets/alb-working.webp) - -**Go ahead and enjoy a break, you’ve earned it!!** - -## Wrapping it up - -Here’s the general flow of how Load Balancer Controller works: - -1. The controller watches for [ingress events](https://kubernetes.io/docs/concepts/services-networking/ingress/#ingress-controllers) from the API server. When it finds ingress resources that satisfy its requirements, it begins the creation of AWS resources. - -2. An [ALB](https://docs.aws.amazon.com/elasticloadbalancing/latest/application/introduction.html) (ELBv2) is created in AWS for the new ingress resource. This ALB can be internet-facing or internal. You can also specify the subnets it's created in using annotations. - -3. [Target Groups](https://docs.aws.amazon.com/elasticloadbalancing/latest/application/load-balancer-target-groups.html) are created in AWS for each unique Kubernetes service described in the ingress resource. - -4. [Listeners](https://docs.aws.amazon.com/elasticloadbalancing/latest/application/load-balancer-listeners.html) are created for every port detailed in your ingress resource annotations. When no port is specified, sensible defaults (80 or 443) are used. Certificates may also be attached via annotations. - -5. [Rules](https://docs.aws.amazon.com/elasticloadbalancing/latest/application/listener-update-rules.html) are created for each path specified in your ingress resource. This ensures traffic to a specific path is routed to the correct Kubernetes Service. - ---- diff --git a/website/docs/troubleshooting/Scenarios/index.md b/website/docs/troubleshooting/Scenarios/index.md deleted file mode 100644 index 9fcf0379f6..0000000000 --- a/website/docs/troubleshooting/Scenarios/index.md +++ /dev/null @@ -1,12 +0,0 @@ ---- -title: "Scenarios" -sidebar_position: 20 ---- - -Even with careful planning and preparation, unexpected issues can sometimes arise when working with technology or completing complex tasks. This section provides examples of common troubleshooting scenarios you may encounter, along with step-by-step guidance on how to diagnose and resolve the problems. - -These are the scenarios covered: - -- AWS Load Balancer Controller -- Node not ready (Coming soon) -- others.. diff --git a/website/docs/troubleshooting/alb/alb_fix_1.md b/website/docs/troubleshooting/alb/alb_fix_1.md new file mode 100644 index 0000000000..fab629245f --- /dev/null +++ b/website/docs/troubleshooting/alb/alb_fix_1.md @@ -0,0 +1,165 @@ +--- +title: "Section 1 - Fixing Tag Issue" +sidebar_position: 30 +--- + +The task for you in this troubleshooting scenario is to investigate the deployment for AWS Load Balancer Controller as well as the ingress object created by following the prompts with the script. At the end of this session, you should be able to see the ui app on your EKS cluster using ALB ingress through the browsers as depicted in the image. + +![ingress](./assets/ingress.webp) + +## Let's start the troubleshooting + +### Step 1: + +First, we need to verify the status of our pods and get ingress for ingress object creation. To do so, we will use `kubectl` tool. + +```bash +$ kubectl get pod -n ui +NAME READY STATUS RESTARTS AGE +ui-68495c748c-jkh2z 1/1 Running 0 85s +``` + +### Step 2: + +In _Step 1_, we checked the pods status for our application and aws-load-balancer-controller. The _aws-load-balancer-controller_ deployment is responsible for ALB creation for any ingress objects applied to the cluster. + +Upon looking for ingress object, did you observe any ALB DNS name to access your application with the ingress object? You can also verify ALB creation in the AWS Management Console. In a successful installation scenario, the ingress object should have an ALB DNS name shown like the example below. However in this case, the ADDRESS section where the ALB DNS should have populated is empty. + +```bash +$ kubectl get ingress/ui -n ui +NAME CLASS HOSTS ADDRESS PORTS AGE +ui alb * 80 105s + +#---This is the expected output when the ingress was deployed correctly-- +NAME CLASS HOSTS ADDRESS PORTS AGE +ingress-2048 * k8s-ui-ingress2-xxxxxxxxxx-yyyyyyyyyy.region-code.elb.amazonaws.com 80 2m32s +``` + +### Step 3: + +Check further into the ingress for any events indicating why we do not see the ALB DNS. You can retrieve those logs by running the following command. The event logs should point you towards what the issue might be with ingress creation. + +```bash +$ kubectl describe ingress/ui -n ui +Name: ui +Labels: +Namespace: ui +Address: +Ingress Class: alb +Default backend: +Rules: + Host Path Backends + ---- ---- -------- + * + / service-ui:80 () +Annotations: alb.ingress.kubernetes.io/healthcheck-path: /actuator/health/liveness + alb.ingress.kubernetes.io/scheme: internet-facing + alb.ingress.kubernetes.io/target-type: ip +Events: + Type Reason Age From Message + ---- ------ ---- ---- ------- + Warning FailedBuildModel 2m23s (x16 over 5m9s) ingress Failed build model due to couldn't auto-discover subnets: unable to resolve at least one subnet (0 match VPC and tags: [kubernetes.io/role/elb]) + +``` + +Refer the documentation on prerequisites for setting up ALB with EKS: https://kubernetes-sigs.github.io/aws-load-balancer-controller/v2.4/deploy/subnet_discovery/ + +### Step 4: + +_Step 3_ points to issues with the subnet auto-discovery for load balancer controller deployment. Ensure that all the public subnets have correct tags `tag:kubernetes.io/role/elb,Values=1'` + +:::info +Keep in mind that public subnet means the route table for the subnet has an Internet Gateway allowing traffic to and from the internet. +::: + +1. To find the all subnets through the command line, filter through existing ones with the following tag "Key: `alpha.eksctl.io/cluster-name` Value: `${EKS_CLUSTER_NAME}`". There should be four subnets. **Note:** _For your convenience we have added the cluster name as env variable with the variable `$EKS_CLUSTER_NAME`._ + +```bash +$ aws ec2 describe-subnets --filters "Name=tag:alpha.eksctl.io/cluster-name,Values=${EKS_CLUSTER_NAME}" --query 'Subnets[].SubnetId[]' +[ + "subnet-xxxxxxxxxxxxxxxxx", + "subnet-xxxxxxxxxxxxxxxxx", + "subnet-xxxxxxxxxxxxxxxxx", + "subnet-xxxxxxxxxxxxxxxxx", + "subnet-xxxxxxxxxxxxxxxxx", + "subnet-xxxxxxxxxxxxxxxxx" +] +``` + +2. Then by adding in the subnet ID into the route tables CLI filter one at a time, `--filters 'Name=association.subnet-id,Values=subnet-xxxxxxxxxxxxxxxxx'`, identify which subnets are public. + +``` +aws ec2 describe-route-tables --filters 'Name=association.subnet-id,Values=' --query 'RouteTables[].Routes[].[DestinationCidrBlock,GatewayId]' +``` + +Here a script that will help to iterate over the list of subnets + +```bash + +$ for subnet_id in $(aws ec2 describe-subnets --filters "Name=tag:alpha.eksctl.io/cluster-name,Values=${EKS_CLUSTER_NAME}" --query 'Subnets[].SubnetId[]' --output text); do echo "Subnect: ${subnet_id}"; aws ec2 describe-route-tables --filters "Name=association.subnet-id,Values=${subnet_id}" --query 'RouteTables[].Routes[].[DestinationCidrBlock,GatewayId]'; done +``` + +If the output shows `0.0.0.0/0` route to an Internet gateway ID, this is a public subnet. See below example. + +``` +WSParticipantRole:~/environment $ aws ec2 describe-route-tables --filters "Name=association.subnet-id,Values=subnet-xxxxxxxxxxxxx0470" --query 'RouteTables[].Routes[].[DestinationCidrBlock,GatewayId]' +[ + [ + "10.42.0.0/16", + "local" + ], + [ + "0.0.0.0/0", + "igw-xxxxxxxxxxxxxxxxx" + ] +] +``` + +3. Once you have all the public subnet ID's, describe subnets with the appropriate tag and confirm that the public subnet ID's that you identified are missing. In our case, none of our subnets have the correct tags. + +```bash +$ aws ec2 describe-subnets --filters 'Name=tag:kubernetes.io/role/elb,Values=1' --query 'Subnets[].SubnetId' +[] +``` + +4. Then add the correct tags. To help you a little bit, we have added the 3 public subnets to the `env` variables with the names `PUBLIC_SUBNET_1, PUBLIC_SUBNET_2 and PUBLIC_SUBNET_3` + +``` +aws ec2 create-tags --resources subnet-xxxxxxxxxxxxxxxxx subnet-xxxxxxxxxxxxxxxxx subnet-xxxxxxxxxxxxxxxxx --tags 'Key="kubernetes.io/role/elb",Value=1' +``` + +```bash +$ aws ec2 create-tags --resources $PUBLIC_SUBNET_1 $PUBLIC_SUBNET_2 $PUBLIC_SUBNET_3 --tags 'Key="kubernetes.io/role/elb",Value=1' +``` + +5. Confirm the tags are created. You should see the public subnet ID's populated following the command below. + +```bash +$ aws ec2 describe-subnets --filters 'Name=tag:kubernetes.io/role/elb,Values=1' --query 'Subnets[].SubnetId' +[ + "subnet-xxxxxxxxxxxxxxxxx", + "subnet-xxxxxxxxxxxxxxxxx", + "subnet-xxxxxxxxxxxxxxxxx" +] +``` + +6. Now restart the controller deployment using the kubectl rollout restart command: + +```bash timeout=180 +$ kubectl -n kube-system rollout restart deploy aws-load-balancer-controller +deployment.apps/aws-load-balancer-controller restarted +``` + +7. Now, check again the ingress deployment: + +```bash expectError=true +$ kubectl describe ingress/ui -n ui + Warning FailedDeployModel 68s ingress Failed deploy model due to AccessDenied: User: arn:aws:sts::xxxxxxxxxxxx:assumed-role/alb-controller-20240611131524228000000002/1718115201989397805 is not authorized to perform: elasticloadbalancing:CreateLoadBalancer on resource: arn:aws:elasticloadbalancing:us-west-2:xxxxxxxxxxxx:loadbalancer/app/k8s-ui-ui-5ddc3ba496/* because no identity-based policy allows the elasticloadbalancing:CreateLoadBalancer action + status code: 403, request id: b862fb9c-480b-44b5-ba6f-426a3884b6b6 + Warning FailedDeployModel 26s (x5 over 66s) ingress (combined from similar events): Failed deploy model due to AccessDenied: User: arn:aws:sts::xxxxxxxxxxxx:assumed-role/alb-controller-20240611131524228000000002/1718115201989397805 is not authorized to perform: elasticloadbalancing:CreateLoadBalancer on resource: arn:aws:elasticloadbalancing:us-west-2:xxxxxxxxxxxx:loadbalancer/app/k8s-ui-ui-5ddc3ba496/* because no identity-based policy allows the elasticloadbalancing:CreateLoadBalancer action + status code: 403, request id: 197cf2f7-2f68-44f2-92ae-ff5b36cb150f +``` + +:::tip +In AWS generally for creation/deletion/update of any resource, you will observe a corresponding API call which are recorded in CloudTrail. Look for any CloudTrail events for CreateLoadBalancer API calls. Do you observe any such calls in the last 1 hour of this lab setup? +::: diff --git a/website/docs/troubleshooting/alb/alb_fix_5.md b/website/docs/troubleshooting/alb/alb_fix_5.md new file mode 100644 index 0000000000..3cf1d840d6 --- /dev/null +++ b/website/docs/troubleshooting/alb/alb_fix_5.md @@ -0,0 +1,91 @@ +--- +title: "Section 2 - Fixing Policy Issue" +sidebar_position: 31 +--- + +### Step 5 + +With this setup, we’re leveraging IAM Roles for Service Accounts, which essentially allows pods to assume IAM roles using service accounts in Kubernetes and OIDC provider associated with your EKS cluster. Locate the service account that load balancer controller is using and find out the IAM role associated with it, to identify the IAM entity that would make API calls to provision your load balancer. +Try running: + +```bash +$ kubectl get serviceaccounts -n kube-system -l app.kubernetes.io/name=aws-load-balancer-controller -o yaml +``` + +```yaml {8} +apiVersion: v1 +items: + - apiVersion: v1 + automountServiceAccountToken: true + kind: ServiceAccount + metadata: + annotations: + eks.amazonaws.com/role-arn: arn:aws:iam::xxxxxxxxxxxx:role/alb-controller-20240611131524228000000002 + meta.helm.sh/release-name: aws-load-balancer-controller + meta.helm.sh/release-namespace: kube-system + creationTimestamp: "2024-06-11T13:15:32Z" + labels: + app.kubernetes.io/instance: aws-load-balancer-controller + app.kubernetes.io/managed-by: Helm + app.kubernetes.io/name: aws-load-balancer-controller + app.kubernetes.io/version: v2.7.1 + helm.sh/chart: aws-load-balancer-controller-1.7.1 + name: aws-load-balancer-controller-sa + namespace: kube-system + resourceVersion: "4950707" + uid: 6d842045-f2b4-4406-869b-f2addc67ff4d +kind: List +metadata: + resourceVersion: "" +``` + +:::tip +Can you verify if there’s a call in your CloudTrail events with the IAM role listed in the output for above command? If not, take a look at the logs from your controller. +::: + +### Step 6 + +You can check the logs from controller pods to find additional details which could be preventing the load balancer to create. Let's check the logs using the command below. + +```bash +$ kubectl logs -n kube-system -l app.kubernetes.io/name=aws-load-balancer-controller +``` + +For example the output may show something similar to the below output. + +``` +{"level":"error","ts":"2024-06-11T14:24:24Z","msg":"Reconciler error","controller":"ingress","object":{"name":"ui","namespace":"ui"},"namespace":"ui","name":"ui","reconcileID":"49d27bbb-96e5-43b4-b115-b7a07e757148","error":"AccessDenied: User: arn:aws:sts::xxxxxxxxxxxx:assumed-role/alb-controller-20240611131524228000000002/1718115201989397805 is not authorized to perform: elasticloadbalancing:CreateLoadBalancer on resource: arn:aws:elasticloadbalancing:us-west-2:xxxxxxxxxxxx:loadbalancer/app/k8s-ui-ui-5ddc3ba496/* because no identity-based policy allows the elasticloadbalancing:CreateLoadBalancer action\n\tstatus code: 403, request id: a24a1620-3a75-46b7-b3c3-9c80fada159e"} +``` + +As you can see the error indicates the IAM role does not have the correct permissions, in this case the permissions to create the load balancer `elasticloadbalancing:CreateLoadBalancer`. + +:::tip +Verify the correct permissions required by the IAM role in the documentations here [[1]](https://kubernetes-sigs.github.io/aws-load-balancer-controller/v2.4/deploy/installation/#setup-iam-manually) where you can find the latest IAM permissions json file required for the LB Controller. After the changes, you have to wait a few minutes for the changes to reflect, since IAM uses an eventual consistency model. To make the changes, locate the IAM role through the AWS console and add the missing permissions that are shown in the log. In this case CreateLoadBalancer is missing. +::: + +Now let's fix it. To avoid conflicts with the automation of the workshop, we have already provisioned the correct permissions into the account and added the environment variable `LOAD_BALANCER_CONTROLLER_ROLE_NAME` that contains the role name and `LOAD_BALANCER_CONTROLLER_POLICY_ARN_FIX` which contains the correct IAM policy arn, and `LOAD_BALANCER_CONTROLLER_POLICY_ARN_ISSUE` that contains the incorrect IAM policy arn. + +So, to fix it we will just need to attach the correct IAM policy, as follows: + +```bash +$ aws iam attach-role-policy --role-name ${LOAD_BALANCER_CONTROLLER_ROLE_NAME} --policy-arn ${LOAD_BALANCER_CONTROLLER_POLICY_ARN_FIX} +``` + +and detach the incorrect IAM policy from the role: + +```bash +$ aws iam detach-role-policy --role-name ${LOAD_BALANCER_CONTROLLER_ROLE_NAME} --policy-arn ${LOAD_BALANCER_CONTROLLER_POLICY_ARN_ISSUE} +``` + +Try accessing the new Ingress URL in the browser as before to check if you can access the UI app: + +```bash +$ kubectl get ingress -n ui ui -o jsonpath="{.status.loadBalancer.ingress[*].hostname}{'\n'}" +k8s-ui-ui-5ddc3ba496-1208241872.us-west-2.elb.amazonaws.com +``` + +:::tip +It can take a couple of minutes for the Load Balancer to be available once created. +::: + +Also, feel free to go to CloudTrail again and verify the API call for CreateLoadBalancer is there. diff --git a/website/docs/troubleshooting/alb/alb_fix_7.md b/website/docs/troubleshooting/alb/alb_fix_7.md new file mode 100644 index 0000000000..ded94a1af2 --- /dev/null +++ b/website/docs/troubleshooting/alb/alb_fix_7.md @@ -0,0 +1,236 @@ +--- +title: "Section 3 - Fixing Manifest Configs" +sidebar_position: 32 +--- + +### Step 7 + +Even though the ingress creation succeeded, when you try accessing the app in browser there is an error stating, "Backend service does not exist". + +![ALb-Backend-DoesNotExist](./assets/alb-does-not-exist.webp) + +Since ingress is created, that would mean that there is an issue with communication from the Kubernetes ingress to the service. Check the deployment and service using: + +```bash +$ kubectl -n ui get service/ui -o yaml +``` + +```yaml {27} +apiVersion: v1 +kind: Service +metadata: + annotations: + ... + labels: + app.kubernetes.io/component: service + app.kubernetes.io/created-by: eks-workshop + app.kubernetes.io/instance: ui + app.kubernetes.io/managed-by: Helm + app.kubernetes.io/name: ui + helm.sh/chart: ui-0.0.1 + name: ui + namespace: ui + resourceVersion: "4950875" + uid: dc832144-b2a1-41cd-b7a1-8979111da677 +spec: + ... + ports: + - name: http + port: 80 + protocol: TCP + targetPort: http + selector: + app.kubernetes.io/component: service + app.kubernetes.io/instance: ui + app.kubernetes.io/name: ui-app + sessionAffinity: None + type: ClusterIP +status: + loadBalancer: {} +``` + +And now check the ingress configuration: + +```bash +$ kubectl get ingress/ui -n ui -o yaml +``` + +```yaml {23} +apiVersion: networking.k8s.io/v1 +kind: Ingress +metadata: + annotations: + alb.ingress.kubernetes.io/healthcheck-path: /actuator/health/liveness + alb.ingress.kubernetes.io/scheme: internet-facing + alb.ingress.kubernetes.io/target-type: ip + ... + finalizers: + - ingress.k8s.aws/resources + generation: 1 + name: ui + namespace: ui + resourceVersion: "4950883" + uid: 327b899c-405e-431b-8d67-32578435f0b9 +spec: + ingressClassName: alb + rules: + - http: + paths: + - backend: + service: + name: service-ui + port: + number: 80 + path: / + pathType: Prefix +... +``` + +From the outputs, observe the ingress spec and the service name `name: service-ui` that it is pointing to versus what the service name should be. + +We will need to edit the ingress spec to point to correct service name using the command below, which contains the fix: + +```bash +$ kubectl apply -k ~/environment/eks-workshop/modules/troubleshooting/alb/creating-alb/fix_ingress +``` + +To look like: + +```yaml {10} +spec: + ingressClassName: alb + rules: + - http: + paths: + - path: / + pathType: Prefix + backend: + service: + name: ui + port: + number: 80 +``` + +Try accessing the ALB again using the domain name shared in the get ingress output and check if you can access the app now? + +### Step 8 + +Now we observe a 503 error when accessing the ALB: + +![ALb-503-ERROR](./assets/alb-503.webp) + +503 would suggest a server-side issue, specifically with the service being unavailable. But we ensured that the service was running on the cluster when we ran get service command in _Step 7_. + +In Kubernetes, a service is just a construct to expose deployments either externally or within the cluster. Services rely on selectors to be able to send traffic to the correct backend deployment. To verify that we have our service pointing to the correct deployment, check the endpoints that are dynamically configured by kube-proxy on service creation. Run the following command: + +```bash +$ kubectl -n ui get endpoints ui +NAME ENDPOINTS AGE +ui 13d +``` + +The endpoints in command above should be pointing to IPs of the app pods running in _ui_ namespace. Can you identify if the selectors are setup correctly in service? + +### Step 9: + +Taking a look at the deployment spec using command below, verify the selector value being used versus the one used in your service. + +```bash +$ kubectl -n ui get deploy/ui -o yaml +``` + +```yaml {34} +apiVersion: apps/v1 +kind: Deployment +metadata: + annotations: + ... + name: ui + namespace: ui + .. +spec: + progressDeadlineSeconds: 600 + replicas: 1 + revisionHistoryLimit: 10 + selector: + matchLabels: + app.kubernetes.io/component: service + app.kubernetes.io/instance: ui + app.kubernetes.io/name: ui + strategy: + rollingUpdate: + maxSurge: 25% + maxUnavailable: 25% + type: RollingUpdate + template: + metadata: + annotations: + prometheus.io/path: /actuator/prometheus + prometheus.io/port: "8080" + prometheus.io/scrape: "true" + creationTimestamp: null + labels: + app.kubernetes.io/component: service + app.kubernetes.io/created-by: eks-workshop + app.kubernetes.io/instance: ui + app.kubernetes.io/name: ui + spec: + containers: +... + +``` + +And + +```bash +$ kubectl -n ui get svc ui -o yaml +``` + +```yaml {22} +apiVersion: v1 +kind: Service +metadata: + annotations: + ... + labels: + app.kubernetes.io/component: service + app.kubernetes.io/created-by: eks-workshop + app.kubernetes.io/instance: ui + app.kubernetes.io/managed-by: Helm + app.kubernetes.io/name: ui + helm.sh/chart: ui-0.0.1 + name: ui + namespace: ui + resourceVersion: "5000404" + uid: dc832144-b2a1-41cd-b7a1-8979111da677 +spec: + ... + selector: + app.kubernetes.io/component: service + app.kubernetes.io/instance: ui + app.kubernetes.io/name: ui-app + sessionAffinity: None + type: ClusterIP +... +``` + +Notice what the `service/ui` selector is using and what the actual `deployment/ui` labels are. To fix the issue, we need to update the `service/ui` selector `app.kubernetes.io/name: ui-app` to `app.kubernetes.io/name: ui`. + +:::tip +You can either update the service selector with: + +- `kubectl edit service -n ` or +- `kubectl patch service -n --type='json' -p='[{"op": "replace", "path": "/spec/selector", "value": {"key1": "value1", "key2": "value2"}}]'` + ::: + +for your convenience, we have added a kustomize script that update the selector, just execute the following command: + +```bash +$ kubectl apply -k ~/environment/eks-workshop/modules/troubleshooting/alb/creating-alb/fix_ui +``` + +Now refresh the browsers and you should see the ui application: + +![ALB-UI-APP](./assets/alb-working.webp) + +### Go ahead and enjoy a break, you’ve earned it!! diff --git a/website/docs/troubleshooting/alb/alb_fix_wrapping_up.md b/website/docs/troubleshooting/alb/alb_fix_wrapping_up.md new file mode 100644 index 0000000000..9dcee5a2a4 --- /dev/null +++ b/website/docs/troubleshooting/alb/alb_fix_wrapping_up.md @@ -0,0 +1,20 @@ +--- +title: "Wrapping it up" +sidebar_position: 33 +--- + +## Wrapping it up + +Here’s the general flow of how Load Balancer Controller works: + +1. The controller watches for [ingress events](https://kubernetes.io/docs/concepts/services-networking/ingress/#ingress-controllers) from the API server. When it finds ingress resources that satisfy its requirements, it begins the creation of AWS resources. + +2. An [ALB](https://docs.aws.amazon.com/elasticloadbalancing/latest/application/introduction.html) (ELBv2) is created in AWS for the new ingress resource. This ALB can be internet-facing or internal. You can also specify the subnets it's created in using annotations. + +3. [Target Groups](https://docs.aws.amazon.com/elasticloadbalancing/latest/application/load-balancer-target-groups.html) are created in AWS for each unique Kubernetes service described in the ingress resource. + +4. [Listeners](https://docs.aws.amazon.com/elasticloadbalancing/latest/application/load-balancer-listeners.html) are created for every port detailed in your ingress resource annotations. When no port is specified, sensible defaults (80 or 443) are used. Certificates may also be attached via annotations. + +5. [Rules](https://docs.aws.amazon.com/elasticloadbalancing/latest/application/listener-update-rules.html) are created for each path specified in your ingress resource. This ensures traffic to a specific path is routed to the correct Kubernetes Service. + +--- diff --git a/website/docs/troubleshooting/Scenarios/alb/assets/IntScen-ALBC-1-2.webp b/website/docs/troubleshooting/alb/assets/IntScen-ALBC-1-2.webp similarity index 100% rename from website/docs/troubleshooting/Scenarios/alb/assets/IntScen-ALBC-1-2.webp rename to website/docs/troubleshooting/alb/assets/IntScen-ALBC-1-2.webp diff --git a/website/docs/troubleshooting/Scenarios/alb/assets/alb-503.webp b/website/docs/troubleshooting/alb/assets/alb-503.webp similarity index 100% rename from website/docs/troubleshooting/Scenarios/alb/assets/alb-503.webp rename to website/docs/troubleshooting/alb/assets/alb-503.webp diff --git a/website/docs/troubleshooting/Scenarios/alb/assets/alb-does-not-exist.webp b/website/docs/troubleshooting/alb/assets/alb-does-not-exist.webp similarity index 100% rename from website/docs/troubleshooting/Scenarios/alb/assets/alb-does-not-exist.webp rename to website/docs/troubleshooting/alb/assets/alb-does-not-exist.webp diff --git a/website/docs/troubleshooting/Scenarios/alb/assets/alb-working.webp b/website/docs/troubleshooting/alb/assets/alb-working.webp similarity index 100% rename from website/docs/troubleshooting/Scenarios/alb/assets/alb-working.webp rename to website/docs/troubleshooting/alb/assets/alb-working.webp diff --git a/website/docs/troubleshooting/Scenarios/alb/assets/ingress.webp b/website/docs/troubleshooting/alb/assets/ingress.webp similarity index 100% rename from website/docs/troubleshooting/Scenarios/alb/assets/ingress.webp rename to website/docs/troubleshooting/alb/assets/ingress.webp diff --git a/website/docs/troubleshooting/Scenarios/alb/index.md b/website/docs/troubleshooting/alb/index.md similarity index 62% rename from website/docs/troubleshooting/Scenarios/alb/index.md rename to website/docs/troubleshooting/alb/index.md index a35f375662..62b95f1c2b 100644 --- a/website/docs/troubleshooting/Scenarios/alb/index.md +++ b/website/docs/troubleshooting/alb/index.md @@ -8,7 +8,7 @@ description: "Expose HTTP and HTTPS routes to the outside world using Ingress AP ::required-time -On this scenario we will learn how to troubleshoot various AWS Load Balancer Controller deployment issues, as well as ingress objects created. If you want to learn more about how a Load balancer controller works please check out the [Fundamentals module] (./fundamentals/) +On this scenario we will learn how to troubleshoot various AWS Load Balancer Controller deployment issues, as well as ingress objects created. If you want to learn more about how a Load balancer controller works please check out the [Fundamentals module](/docs/fundamentals/) :::tip Before you start Prepare your environment for this section: @@ -17,16 +17,35 @@ Prepare your environment for this section: $ prepare-environment troubleshooting/alb ``` -This will make the following changes to your lab environment: +The preparation of the lab might take a couple of minutes and it will make the following changes to your lab environment: - Pre-configure the base application from the introduction module - Configure the AWS Load Balancer Controller in the Amazon EKS cluster - Configure an ingress to get access to the UI via an AWS Load Balancer - Introduce an issue to the configuration, so we can learn how to troubleshoot these types of issues - ::: + +::: You can view the Terraform that applies these changes [here](https://github.com/VAR::MANIFESTS_OWNER/VAR::MANIFESTS_REPOSITORY/tree/VAR::MANIFESTS_REF/manifests/modules/troubleshooting/alb/.workshop/terraform). +:::info Root Cause Analysis (RCA) Methodology + +While we wait for the scenario to finalize its configuration, lets talk about the _RCA Methodology_ really quick. + +The Root Cause Analysis (RCA) helps in identifying how and why an event or failure happened, allowing for corrective and preventive measures to be put in place and the RCA generally serves as input to a remediation process whereby corrective actions are taken to prevent the problem from reoccurring. + +**_The method steps:_** + +1. Identify and describe the problem clearly. +2. Collect data +3. Establish a timeline from the normal situation until the problem occurs. +4. Identify Root Cause +5. Distinguish between the root cause and other causal factors (e.g., using event correlation). +6. Establish a causal graph between the root cause and the problem. +7. Although the word "cause" is singular in RCA, experience shows that generally causes are plural. Therefore, look for multiple causes when carrying out RCA. + +::: + Now let's verify if the service and ingress is up and running, so we can start troubleshooting the scenario. ```bash diff --git a/website/docs/troubleshooting/Scenarios/alb/tests/hook-add-ingress.sh b/website/docs/troubleshooting/alb/tests/hook-add-ingress.sh similarity index 100% rename from website/docs/troubleshooting/Scenarios/alb/tests/hook-add-ingress.sh rename to website/docs/troubleshooting/alb/tests/hook-add-ingress.sh diff --git a/website/docs/troubleshooting/Scenarios/alb/tests/hook-suite.sh b/website/docs/troubleshooting/alb/tests/hook-suite.sh similarity index 100% rename from website/docs/troubleshooting/Scenarios/alb/tests/hook-suite.sh rename to website/docs/troubleshooting/alb/tests/hook-suite.sh diff --git a/website/docs/troubleshooting/index.md b/website/docs/troubleshooting/index.md index 4f3776fcf6..847e6758a5 100644 --- a/website/docs/troubleshooting/index.md +++ b/website/docs/troubleshooting/index.md @@ -1,9 +1,30 @@ --- -title: "Troubleshooting" +title: "Troubleshooting Scenarios" sidebar_position: 1 weight: 40 --- -Within this chapter, we intent to enable users on different troubleshooting techniques for their Amazon EKS cluster environment. Here, you will find the most common troubleshooting scenarios reported by customers to AWS Support. Keep in mind that we will use previous concepts from the other chapters while going through each scenario. +Even with careful planning and preparation, unexpected issues can sometimes arise when working with technology or completing complex tasks. This module provides examples of common troubleshooting scenarios to issues reported to AWS support, along with step-by-step guidance on how to diagnose and resolve the problems. -Before we start with the scenarios, I would like you to understand the basics of troubleshooting and for that we have introduced the Troubleshooting Methodologies section. +Keep in mind that we will use previous concepts from the other chapters while going through each scenario. + +### These are the scenarios covered in this module: + +- **AWS Load Balancer Controller** +- **Node not ready (Coming soon)** +- others.. + +:::info Troubleshooting Methodologies +As you progress through the scenarios, we will be introducing an overview of different troubleshooting methodologies. For example, all our scenarios are based in the **Reproductions method**. + +#### Reproductions Method + +Systems and applications come in varying sizes and complexities, which means that you cannot always rely on a full-scale reproduction. We recommend starting with a cut-down reproduction, focusing solely on the components involved. There are times where the issue is specific to the environment and there is a combination of factors necessary for it to occur - in this case, you may need a more complex or even full-scale reproduction, but in our experience that is far less common. + +Being able to reproduce an issue allows you: + +- Observe and experiment in a controlled environment, without affecting users of the system. +- Allows your team to hand over the problem to the team responsible for the failing components, so your team can continue to focus on more pressing matters and mitigation. +- Provide reproduction instructions to the designers or builders of that component, they can perform a deep, targeted investigation. + +::: diff --git a/website/docs/troubleshooting/methodologies/5050.md b/website/docs/troubleshooting/methodologies/5050.md deleted file mode 100644 index 2637b19d39..0000000000 --- a/website/docs/troubleshooting/methodologies/5050.md +++ /dev/null @@ -1,12 +0,0 @@ ---- -title: "50/50 method aka divide and conquer" -sidebar_position: 12 ---- - -Especially effective with large footprint request-response or transaction-based systems, and those with black boxes with limited observability, this method involves continually dividing the system into halves, similar to a binary search, to find the offending component in a system. - -Take the following Web API architecture: - -Client -> CloudFront -> Load Balancer -> Proxy Server -> API Gateway -> Load Balancer -> Container Service - -If you receive reports of errors or latency, without sufficient instrumentation throughout the stack, it may be difficult to locate the cause of the issue. Introducing additional observability in the application may not be achievable in the short term, so you could use the 50/50 method in a controlled reproduction. The first step would be gathering details on the problematic requests and test end-to-end to confirm it is reproducible. Then, to divide the stack into two, you could create a simulated client sitting alongside the proxy servers (in front of API Gateway). From these clients, you could send a request directly to API Gateway and see if the same errors/latency is seen. If you do not see the issue, you know the issue is somewhere between the client and the proxy server. If you do see the issue, you know it is somewhere between API Gateway and the container service and you can then test directly against the load balancer and container service, bypassing API Gateway. diff --git a/website/docs/troubleshooting/methodologies/_category_.json b/website/docs/troubleshooting/methodologies/_category_.json deleted file mode 100644 index 0d44758281..0000000000 --- a/website/docs/troubleshooting/methodologies/_category_.json +++ /dev/null @@ -1,3 +0,0 @@ -{ - "collapsed": false -} diff --git a/website/docs/troubleshooting/methodologies/index.md b/website/docs/troubleshooting/methodologies/index.md deleted file mode 100644 index 21be7b488a..0000000000 --- a/website/docs/troubleshooting/methodologies/index.md +++ /dev/null @@ -1,11 +0,0 @@ ---- -title: "Troubleshooting Methodologies" -sidebar_position: 10 -chapter: true -sidebar_custom_props: { "info": true } -description: "Here will be describing the methodology" ---- - -This section will provide an overview of various methods of troubleshooting and is followed by tips learned over years of troubleshooting. - -Problem Analysis, Problem Solving, Root Cause Analysis and Decision Making are all well-defined processes, with numerous techniques and systems developed over the years. Many of these systems have books and multi-day workshops, covering the methods in extreme depth. In our experience, many of these systems provide valuable information and we highly recommend using a mixed approach, borrowing from each, using the most suitable method for the given situation. Some methods are light-weight and only focus on root cause analysis, whereas others tie together root cause analysis and mitigation. Here we will focus on Root Cause Analysis and a number of other methods we use at Amazon. diff --git a/website/docs/troubleshooting/methodologies/others.md b/website/docs/troubleshooting/methodologies/others.md deleted file mode 100644 index 4e4e4d714a..0000000000 --- a/website/docs/troubleshooting/methodologies/others.md +++ /dev/null @@ -1,39 +0,0 @@ ---- -title: "Other Methods" -sidebar_position: 14 ---- - -This section contains details and references for a number of other problem solving methods. We suggest researching other methods as many of them include useful techniques that can be used, regardless of the primary method chosen. Here is a list of other methods, including a number of proprietary problem solving systems. - -### The Scientific method - -An ever-evolving and age-old system used by scientists world-wide, the scientific method is a process for experimentation that is used to explore observations and answer questions. Generally, it consists of the following steps: - -- Gather information or make an observation -- State the Problem or ask a question -- Form a hypothesis -- Make a prediction -- Test the hypothesis -- Observe Results & Draw conclusions -- Repeat as necessary - -The scientific method is almost always an iterative process, with the outcome of one round of hypothesis testing feeding into the next. In this workshop, we assume mutating changes are a last resort when troubleshooting, unless the issue is able to be reproduced in a controlled environment. So we will be focusing on "read only" root cause analysis where possible. For an excellent overview of the scientific method, please see the following pages: - -- [Troubleshooting with the Scientific Method ](https://www.inetdaemon.com/tutorials/troubleshooting/scientific_method.shtml) -- [The scientific method ](https://www.khanacademy.org/science/biology/intro-to-biology/science-of-biology/a/the-science-of-biology) - -### Ishikawa diagrams - -Ishikawa diagrams (also called fishbone diagrams or cause-and-effect diagrams) help you to determine the root cause by identifying possible causes of an issue. We suggest building collaborative diagrams, inviting your team members to ensure all the potential causes are identified. It is also important to note that some causes may have multiple sub-causes, so we also suggest expanding your diagram in a hierarchical manner to encompass all possible causes. - -For more information and examples of Ishikawa diagrams, please see the following page: - -Fishbone Diagram: A Tool to Organize a Problem’s Cause and Effect - -### Trial-and-error - -Also called the scattergun approach, trial-and-error is often a last resort, or a method used when panic sets in. Whilst it can be used in controlled environments where it is clear that an issue is caused by or fixed by a lever or configuration, it is far from ideal. Whilst flipping a switch may fix the problem, making mutating changes before understanding root cause is not ideal, as the issue may not be properly fixed, or there could be another factor at play, with the issue being indirectly "fixed" and root cause still not being understood. We often also change too many variables and don't observe enough in between changes. - -As an example, say you have a web application running on an EC2 instance. One day, the application stops working and you see failures when trying to access the application. Looking at CloudWatch metrics, CPU Utilization is at 100%. You stop the EC2 instance, change the instance type to one with more vCPUs and start it again. The instance starts, the application is bootstrapped, but you still see the same issue - errors and 100% CPU Utilization. Perhaps it is a problem with the new version of the application, so your team roll back to the previous version. The issue persists. Maybe it is running out of memory and CPU is spiking because it is continually paging, moving data from memory to disk and vice-versa continuously. You change the instance type to one with more memory and this time, everything works! - -Whilst you were able to bring the application back to a working state, there was no methodical approach taken to identify the root cause so that a targeted fix could be put in. The application is now running, but until we understand the root cause, we don't know how long until it happens again. diff --git a/website/docs/troubleshooting/methodologies/rca.md b/website/docs/troubleshooting/methodologies/rca.md deleted file mode 100644 index e32e7b9a13..0000000000 --- a/website/docs/troubleshooting/methodologies/rca.md +++ /dev/null @@ -1,76 +0,0 @@ ---- -title: "Root Cause Analysis (RCA)" -sidebar_position: 11 ---- - -Root Cause Analysis (RCA) helps in identifying how and why an event or failure happened, allowing for corrective and preventive measures to be put in place. - -### Importance of root cause analysis - -In the heat of an investigation, it can be tempting to find and implement a quick fix. However, the risk is two-fold: - -Further investigation and deep root cause analysis with persistent fix(es) may be deprioritized. -Correcting the immediate cause may eliminate a symptom of a problem, but not the problem itself. As the underlying cause has not been eliminated, the issue may occur again in the future. -RCA generally serves as input to a remediation process whereby corrective actions are taken to prevent the problem from reoccurring. - -### Method - -The method varies somewhat depending on the flavour used, however most schools of root cause analysis cover the following steps: - -1. Identify and describe the problem clearly. -2. Collect data -3. Establish a timeline from the normal situation until the problem occurs. -4. Identify Root Cause -5. Distinguish between the root cause and other causal factors (e.g., using event correlation). -6. Establish a causal graph between the root cause and the problem. -7. Although the word "cause" is singular in RCA, experience shows that generally causes are plural. Therefore, look for multiple causes when carrying out RCA. - -### Method - -The method varies somewhat depending on the flavour used, however most schools of root cause analysis cover the following steps: - -1. Identify and describe the problem clearly. -2. Collect data -3. Establish a timeline from the normal situation until the problem occurs. -4. Identify Root Cause -5. Distinguish between the root cause and other causal factors (e.g., using event correlation). -6. Establish a causal graph between the root cause and the problem. -7. Although the word "cause" is singular in RCA, experience shows that generally causes are plural. Therefore, look for multiple causes when carrying out RCA. - -### Correction of Errors - -Once root cause has been established and you are no longer seeing impact, use a Correction Of Errors process. This will enable you to: - -1. Prevent a reoccurrence. -2. Should preventative measures fail, reduce impact of the next occurrence. -3. Improve observability and investigative ability to determine root cause should something similar happen again. - -We suggest using AWS Systems Manager Incident Manager , an incident management console designed to help you mitigate and recover from incidents affecting your AWS-hosted applications. Post-incident analysis guides you through identifying improvements to your incident response, including time to detection and mitigation. An analysis can also help you understand the root cause of the incidents. Incident Manager creates recommended action items to improve your incident response. - -### Five whys - -Five Whys is a linear, repetitive root cause analysis method used to explore the cause-and-effect relationships underlying a particular problem. The idea is simple - start with the problem and keep asking why until you get to all contributing and root causes, peeling away the layers of symptoms as you progress. - -At Amazon, we leverage the five whys technique during our Correction Of Errors process. We have found that whilst "5 Whys" is a helpful mnemonic, teams are encouraged to ask why more than five times as they complete their root cause analysis, especially when considering environmental and systemic factors that may have contributed prior to the immediate incident. A sufficiently deep five whys analysis will likely have multiple root causes or contributing causes, so branching your causal tree is generally expected and encouraged. - -The process is as follows: - -1. Identify the problem -2. Ask why the problem happened, and record the reason. -3. Decide if the reason is the root cause -4. Could the reason have been prevented? -5. Could the reason have been detected before it happened? -6. If the reason is human error, why was it possible? -7. If the answer you just provided doesn’t identify the root cause of the problem, repeat the process using the reason as the problem. Stop when you are confident that you have found the root causes. - -Some key points to remember: - -- Five whys should be applied in a blame-free way, where the focus is on finding the "why" rather than blaming "who". -- If you see "human error" as a root cause in the RCA, it may be indicating a lack of check or fail-safe mechanism. Therefore, you should always ask why the human error was possible. -- Consider having entirely separate causal trees for the duration of the event: ("Why did this impact last as long as it did?") - -If you use the [AWS System Manager Incident Manager](https://docs.aws.amazon.com/incident-manager/latest/userguide/analysis.html) COE template, then the five whys section is included as part of the Incident questions under the Prevention section. - -### Establishing a timeline - -Regardless of the method used, establishing a timeline is critical. It allows us to understand the state of the system at the time and the sequence of events leading to the failure. It is perfectly fine to start with known events and fill in the gaps as the investigation continues. We recommend leveraging services like AWS CloudTrail and other auditing systems to ensure the timeline is built from objective data sources, rather than human memory. For any blind spots, anecdote and memory can be useful, but whenever possible - use data and trust, but verify. diff --git a/website/docs/troubleshooting/methodologies/reproductions.md b/website/docs/troubleshooting/methodologies/reproductions.md deleted file mode 100644 index 5df8f63122..0000000000 --- a/website/docs/troubleshooting/methodologies/reproductions.md +++ /dev/null @@ -1,8 +0,0 @@ ---- -title: "Reproductions" -sidebar_position: 13 ---- - -Being able to reproduce an issue allows you to observe and experiment in a controlled environment, without affecting users of the system. It also allows your team to hand over the problem to the team responsible for the failing components, so your team can continue to focus on more pressing matters and mitigation. If you can provide reproduction instructions to the designers or builders of that component, they can perform a deep, targeted investigation. - -Systems and applications come in varying sizes and complexities, which means that you cannot always rely on a full-scale reproduction. We recommend starting with a cut-down reproduction, focusing solely on the components involved. There are times where the issue is specific to the environment and there is a combination of factors necessary for it to occur - in this case, you may need a more complex or even full-scale reproduction, but in our experience that is far less common. diff --git a/website/lab-timing-data.json b/website/lab-timing-data.json index 0b474e713f..106a4986c8 100644 --- a/website/lab-timing-data.json +++ b/website/lab-timing-data.json @@ -251,10 +251,10 @@ "file": "security/cluster-access-management/index.md", "executionTimeSeconds": 207 }, - "troubleshooting/scenarios/alb/index.md": { + "troubleshooting/alb/index.md": { "title": "Troubleshooting Module", - "directory": "troubleshooting/scenarios/alb", - "file": "troubleshooting/scenarios/alb/index.md", + "directory": "troubleshooting/alb", + "file": "troubleshooting/alb/index.md", "executionTimeSeconds": 780 } }