The Fifth Elephant 2025 Annual Conference CfP
Speak at The Fifth Elephant 2025 Annual Conference
Shaikh Md Ashif Iqbal
@sashifiqbal_atlassian
Submitted May 20, 2025
In the dynamic landscape of distributed systems and streaming ETL (Extract, Transform, Load) pipelines, maintaining backwards compatibility poses a significant challenge, especially as complexity scales. Services that interact through multiple versions of SDKs must ensure seamless communication and data flow, even as new features are developed. Traditionally, manual testing for compatibility across versions has been both time-consuming and prone to errors, particularly as development velocity increases.
This challenge became evident in our work on a platform developed at Atlassian called Lithium, where we faced the complexity of ensuring compatibility across numerous SDK versions in a dynamic and ephemeral environment. To address this, we developed a robust framework for automating the verification of compatibility across SDK and service versions. This framework uses containerized test environments, custom environment generation with Docker Compose, and targeted version testing to streamline the process. Integrated into the developer workflow via CI/CD pipelines, it conducts comprehensive compatibility tests for every code change, preventing regressions and enhancing reliability. Developers can replicate test environments locally for in-depth debugging, facilitating rapid iteration and resolution. By automating compatibility testing, we have significantly reduced manual effort, increased developer confidence, and ensured a smooth experience during major platform updates.
Ashif is a Senior Software Engineer with nearly 7 years of experience, including significant contributions to designing backwards compatibility testing solutions and service discovery mechanisms to enhance system stability and scalability.
LinkedIn: Ashif Iqbal
Slides Incoming
This is the first talk about Backwards Compatibility for Lithium Platform but below are some link for Lithium:
Hello everyone, I’m Ashif from the Data Portability Organization at Atlassian. Our primary mission is to develop platforms and solutions that simplify data portability within Atlassian. Data portability is crucial for us, as it underpins numerous experiences and enterprise features, in addition to facilitating the migration of customer data from on-premises servers to Atlassian Cloud. Some of the features built on this foundation include BRIE (Backup & Restore) and Sandbox.
We have created a platform named Lithium to facilitate data movement, which we refer to as ETL++. The ++ signifies that while it is indeed ETL, it incorporates several unique features:
I won’t delve deeply into how Lithium operates, as Robert’s talk covers that more effectively. However, I will highlight the essential components that illustrate the challenges we began to encounter:
lithium-control-plane
: This component features two topics, controlplane-events and dataplane-events, which facilitate the sending and receiving of events from the data plane components via their SDKs.lithium-sdk
: Located in the dataplane, this SDK manages communication with the control plane and provides constructs for data movement, enabling feature teams to focus solely on their features without needing to understand Kafka internals. Feature teams implement this SDK to integrate with the Lithium Pipeline.In production, we have hundreds of hosts implementing our SDK, each contributing to different parts of the ETL pipeline. Consequently, the production environment resembles a complex landscape, where pipelines may involve SDKs from various publicly available versions. Additionally, communication with the control plane occurs across all these SDK versions.
The challenges we faced stemmed from the rapid development of this new platform. Early on, individual developers often worked independently on features, with minimal awareness of each other’s work. This led to regression issues, primarily due to insufficient backward compatibility testing.
We identified two key areas where maintaining backward compatibility was essential:
At first glance, contract testing seemed sufficient for our use case, and while that is partially correct, we also needed to validate results and conduct feature testing. Essentially, we sought a solution that allowed us to write tests in a straightforward manner, assert as we typically would, and let the framework handle the heavy lifting. Finding no existing solution, we decided to build one ourselves. Moreover, we aimed to incorporate additional features, such as targeted tests for specific SDK versions. This was crucial for scenarios where we might deprecate a feature or introduce something new, necessitating tests only for newer versions.
It would be remiss of me not to show what we had previously before discussing our new solution. Initially, we relied on a lengthy and tedious manual process to test backward compatibility against the latest publicly released version. Our main codebase resides in a monorepo that houses the control plane service, a proxy service designed to accept HTTP traffic into our otherwise event-driven platform. This repository also contains the dataplane-sdk, along with three other services implementing the dataplane-sdk, facilitating simultaneous work on the SDK and corresponding ETL component changes. This setup provides quick feedback during the development loop. Additionally, we maintained another repository that housed a test service implementing the latest publicly available version of the dataplane-sdk, which was used to test changes made in the control plane service of the main repository.
The previous approach presented several challenges:
Let me show you one of the tests written using the new framework to illustrate how it appears to developers, followed by a deeper dive into the details.
@Tag("Parallel")
class TerminateWorkplanTest : ParallelBaseE2E() {
@Test
fun `terminating workplan from STOPPED state should work`() {
val workplanCreationRequest = getSampleWorkplanCreationRequest("sample-workplan-creation-request-with-sink-disabled")
WorkplanUtility.createWorkplanAndTestForSuccess(
workplanCreationRequest,
getHttpRequestBuilder().addNonAdminOwnerHeaders(),
httpClient,
)
WorkplanUtility.testWorkplanStatus(
workplanCreationRequest.getWorkplanId(),
WorkplanStatus.STOPPED,
getHttpRequestBuilder().addNonAdminOwnerHeaders(),
httpClient,
)
WorkplanUtility.terminateWorkplanAndTestForSuccess(
workplanCreationRequest.getWorkplanId(),
getHttpRequestBuilder().addNonAdminOwnerHeaders(),
httpClient,
)
WorkplanUtility.testWorkplanNotFound(
workplanCreationRequest.getWorkplanId(),
getHttpRequestBuilder().addNonAdminOwnerHeaders(),
httpClient,
)
}
}
This may appear to be a standard JUnit test, and you would be correct—it is. We have built specific capabilities tailored to our use case, but overall, the tests remain familiar to developers. These capabilities are provided by the ParallelBaseE2E base class, along with some custom-developed annotations and tags.
Here is another test utilizing the MinimumLibraryTargetVersion annotation to map this test to specific versions, as this feature was not available in older versions:
@Tag("Parallel")
class CustomPartionerTest : ParallelBaseE2E() {
@Nested
@MinimumLibraryTargetVersion("3.4.0")
inner class WithoutCustomPartitioner {
@Test
fun `With Max Sink Processors = 3 And Non Transactional Sink`() {
...
}
}
}
Now, let’s explore the architecture of the framework, which consists of three components:
First, let’s discuss the Test Writing Framework, which is a Custom Base E2E class that provides the following capabilities:
These tests are executed in an environment established using Docker Compose files. The structure of these files is as follows:
lithium-e2e-tests/
│
├── docker-compose-test-environments/
│ ├── kafka.yml
│ ├── database.yml
│ ├── udpp-control-plane-svc/
│ │ ├── latest-master.yml
│ │ └── specified-build.yml
│ └── udpp-extract-service/
│ ├── latest-master.yml
│ └── specified-build.yml
│ .
│ .
│ .
└── scripts/
├── library-version/
│ ├── branch.sh
│ └── main.sh
├── single-components/
│ ├── current-controlplane.sh
│ └── current-loader.sh
│ .
│ .
│ .
└── all-current.sh
Now, let’s take a look at what the Docker Compose file looks like for one of the services:
version: "3.8"
services:
udpp-extract-service:
image: ${IMAGE_PREFIX:-docker.atl-paas.net/sox/atlassian}/udpp-extract-service:latest
ports:
- "7075:8300"
healthcheck:
test: curl --fail -H X-Slauth-Mechanism:slauth -H X-Slauth-Subject:udpp-extract-service -H X-Slauth-Authorization:true -H X-Asap-Issue:udpp-extract-service http://localhost:8300/healthcheck || exit 1
start_period: 50s
retries: 10
timeout: 180s
interval: 10s
environment:
SPRING_PROFILES_ACTIVE: local,e2e-test
LIBRETTO_CODE_SERVER_LOCAL_URL: http://libretto-code-server:8080
MEMORY_OPTS: -Xmx512M
MICROS_AWS_REGION: micros-aws-e2e-region
MICROS_SERVICE: udpp-extract-service
MICROS_INSTANCE_ID: 9991
MICROS_ENV: e2e-test
MICROS_ENVTYPE: e2e-test
SERVER_SSL_ENABLED: false
USER: ${USER}
JMX_OPTS: "-Dcom.sun.management.jmxremote.rmi.port=9015 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.port=9015 -Dcom.sun.management.jmxremote.authenticate=false"
LITHIUM_FF_FILE_PATH: /opt/service/feature-flag.json
MAX_PROCESSORS: 200
depends_on:
kafka-broker-1:
condition: service_healthy
These Docker Compose YAML files provide a highly granular way to spin up the desired environment with a single command, as shown below:
#!/bin/bash
docker-compose -f docker-compose-test-environments/sox-database.yml \
-f docker-compose-test-environments/sox-single-kafka-broker.yml \
-f docker-compose-test-environments/udpp-cp-rest-proxy/sox-specified-build-single-instance.yml \
-f docker-compose-test-environments/udpp-control-plane-svc/sox-specified-build-single-instance.yml \
-f docker-compose-test-environments/lithium-dataplane-test-svc/custom-version-triple-instance.yml \
up --pull=always -d --wait
But how do these compose files and scripts work? The magic lies in their generation, which I will discuss next.
We create and publish these Docker images from several sources. Let’s review them one by one:
Here’s a glimpse of a testing pipeline in action, showcasing the number of tests being run across various combinations.
pic to be inserted
What did we achieve after implementing these changes?
I believe one of the most significant maturity metrics for a platform is its ability to accept contributions from others. Previously, we struggled with this due to the extensive manual testing effort required, creating a steep learning curve for those outside the development team. However, with our new processes, we have successfully accepted numerous contributions from colleagues in other teams, particularly our clients.
It’s incredibly reassuring to know that if the pipeline is GREEN, there is a high degree of confidence that the PR merged will not introduce regressions. I emphasize “high degree of confidence” because a testing framework is merely a tool; its effectiveness depends on how well it is utilized. If tests are written, the framework ensures that the PR aligns with those tests.
Q&A
{{ gettext('Login to leave a comment') }}
{{ gettext('Post a comment…') }}{{ errorMsg }}
{{ gettext('No comments posted yet') }}