Preface
We received a request from one of our clients to optimize the Report generator feature in their cloud platform.
The objective of this optimization exercise was to
-
- Make report-generation fast.
- Make report-generation process parallel (wherever possible)
- Make report-generation robust.
- Reduce costs associated with report generation.
Existing Implementation
- The existing implementation of report-generation was monolithic and synchronous.
- It was developed using PHP and using postgres as a data-store.
- On prod, whenever required, they scaled it up vertically, instead of horizontally.
Proposed Changes
- Changes in report-definition, to make it more granular and to
enable parallel processing of each widget. Widgets in the report were divided into following categories:
- Table widgets
- Chart widgets
- Text widgets
- Modify the current report-generation workflow asynchronous. Report generation endpoint had been decomposed into below endpoints
- Report generation job submission.
- Report status polling.
- Report downloads.
- Usage of NoSQL db to store report-definition and report instances
- MongoDB to store report definition JSON
- MongoDB offers a better data model for json documents naturally. MongoDB had been offering better querying capabilities compared to the latest postgres version at that time.
- DynamoDB to manage and store report status.
- S3 to store generated reports.
- MongoDB to store report definition JSON
- Decompose (monolithic)report-generator to microservice model
- Multiple microservices based on kafka-streams
- Each service contains a part topology of processors, which processes a certain part of the report.
- Multiple microservices based on kafka-streams
Architecture
Performance Improvements
- New Architecture changes were rolled out in a phased manner, behind feature-flag. EA for some customers and after monitoring it for a fortnight, it was GA for all users.
- Performance benchmarking tests revealed that total throughput of Report-Generator had increased on an average of 10 times.
- Existing synchronous, monolithic implementation was processing around 40-50 reports per hour.
- New implementation increased throughput to above 500 reports per hour.
- New implementation was more robust and resilient to partial failures.