Building Tableau Server Container Images
- Customers require different drivers for their environments
- Many drivers are licensed such that Tableau can't package them
- Customers may be using technologies that don't offer other means of injecting drivers into containers; for instance, a Docker container doesn't allow for InitContainers.
Still, it would be nice if Tableau provided a base image with which customers could add any necessary drivers, scripts, and other customizations. I'm likely to implement this model myself in the meantime.
Tableau Server Container Image Size
Upgrading Tableau Server in a Container
- Create a special "upgrade" image, using both your current and desired versions of Tableau Server... yuck. And,
- Take a backup of your current environment, and restore that to an environment using the new images. Also yuck.
I can think of a few ways to improve this process using Kubernetes functionality, which is my next project. I'll write an article on the process once I've got something worth sharing.
Tableau Server Kubernetes Manifests
- Combined all Tableau nodes pods into a single StatefulSet – it fits the use case almost perfectly.
- Use volumeClaimTemplates for the data directory volumes instead of making the PVCs manually.
- Rewrote the startup and initialization logic to account for clusters of varying sizes. There's plenty more to do here, but it's a good start.
- Rewrote the readinessProbe check—it's almost right, now. Different pods will have different processes, and I want to account for those differences dynamically, whatever they might be.
- Added podAntiAffinity to ensure pods are scheduled on separate Kubernetes nodes. We're trying to tolerate pod failures gracefully in order to use spot instance pricing, so we need to minimize the impact of such failures.
- I wanted to avoid needing a ReadWriteMany PVC, so I rewrote the bootstrap.json file process for multi-node environments. More on this below.
- Added a preStop lifecycle hook to failover the repository if it's running on a pod that's terminating. More on this below, too.
- I implemented this with kustomize so I can more easily deploy and manage different environments from the same base manifests.
Tableau Server Cluster Initialization
- The initial node (pod 0 of the StatefulSet) is deployed and goes through an initialization script. If your configuration file's topology includes appzookeeper, that's filtered out at this stage because we always need to start with exactly one instance of it.
- After pod 0 initializes, the script determines how many nodes your config.json file specifies and waits for those nodes to register. All other pods in the StatefulSet will need to register at this time, which is why my manifests don't yet support OrderedReady pod management – all pods start in parallel.
- The other pods in the StatefulSet will be waiting for a "bootstrap.json" file that will allow them to register with the initial node pod. Tableau's manifests implied using a ReadWriteMany volume to distribute this file, but it's easier to get directly with the tsm command-line utility. This method requires you to build your image with TSM_REMOTE_UID and TSM_REMOTE_USERNAME variables set. But it makes the whole thing way easier to deploy, in my opinion. I shared this feedback directly with the dev team at Tableau.
- Once the other pods register, the initial pod continues with the cluster configuration. It configures and deploys all the services specified in your config.json file, except for the coordination service.
- After the configuration is applied, the initialization script will review your config.json file for your desired coordination service config. If you've specified a 3- or 5-node coordination service ensemble, it will deploy properly using the tsm topology deploy-coordination-service command, as is tradition.
- Finally, it starts the services and configures the initial user.
Tolerating Pod Failure
- Amazon EC2 Spot Instances will give you a two-minute warning before reclaiming compute capacity.
- Google Cloud Spot VMs will give you a 30-second warning before preemption.
- Azure Spot Virtual Machines will also give you a 30-second warning before eviction.
- It's not totally accurate, but I'm just going to refer to this event as "node failure" because cloud providers seem to avoid using the same terminology.
Since Tableau Server's startup time is longer than any of those timeframes, we can't just start a new pod when receiving that warning. To benefit from spot pricing, we need to make Tableau Server capable of tolerating some amount of failure. We can configure Tableau Server for high availability by ensuring we have three or more pods in a cluster and configuring instances of each process across multiple nodes. Even configured for HA, there's still some impact if the Active Repository process goes down — the application can become unavailable for five minutes before signaling the Passive Repository to take over. We can do better than that since we have a bit of advance notice!
- Implement a preStop lifecycle hook that determines whether the pod is responsible for Tableau Server's active repository and issues a failover command if so. This process takes seconds, not minutes.
- Implement a podAntiAffinity rule to ensure that no two Tableau Server pods get scheduled on the same Kubernetes node simultaneously.
- Use the cloud provider's auto-scaling functionality to request additional spot instances whenever we need more.
- Configure the auto-scaling functionality to select from many different compatible instance types, ensuring we don't run into issues finding a machine when we need one. I do this on AWS... I'm not sure how it works for other cloud providers yet.
I've tested this, and I've been pleased with how well it works. I want to do more tests, though. At worst, recovery times are still better than traditional Tableau Server HA failover recovery... but they need to be better since node failures are basically guaranteed when using spot instances.
Tableau. Still. Requires. Static. IP. Addresses.
- It's not easy to implement static IPs in Kubernetes, but it can be done.
- We must use a Container Network Interface (CNI) and IP Address Management (IPAM) plugin that supports static IP addresses. I've tested Calico CNI + Calico IPAM on AWS with success.
- It requires a more significant setup before deploying Tableau Server – one does not simply change a CNI and IPAM plugin.
- This model keeps us from using StatefulSets because we need to annotate each pod with a static IP address.
- Because the static IPs follow the pods, this model is compatible with spot instances and more frequent pod termination.
- Again, it's not easy to do, but it can be done.
- Any time a pod changes IP addresses, we need to reconfigure Tableau Server.
- We don't need a more significant setup before deploying Tableau Server.
- That reconfiguration is done with "tsm pending-changes apply" – even if there are no pending changes. This isn't documented anywhere, but it works.
- It's the "tail wagging the dog," so to speak. One pod changes IP addresses, and services on every other pod need to be reconfigured because of it.
- Since "tsm pending-changes apply" requires a server restart, there will be application downtime.
- Because of that downtime, it's not ideal for using spot instances or frequent pod termination.
