Project References

Below you will find a list of some of the projects I have carried out as a software development consultant.

Near real-time Data Ingestion and Data Replication Pipeline

Near real-time data replication system from MySQL to Redshift in AWS using Kafka, Spark streaming with Scala and Python. The system also used Prometheus, Kibana, and Sqoop. Worked very closely with other data engineers, product owners and data scientists. Replicated very large tables in near real time.

Tools/Languages used: Kafka, Spark, Scala, Python, Prometheus, Kibana, and Sqoop, MySQL, Redshift, s3, Ansible, Jenkins, EMR, Data Migration Service, AWS

End-to-end Data Processing Architecture and Recommendation System

End-to-end data processing architecture and implementation from log structuring, to storage in Redshift and S3 to machine learning with Python to serving recommendations and analytics via an API from Elasticsearch. Worked very closely with CTO, Senior Product Managers and DevOps Engineers to ship to production. A near-real time recommendation system was deployed and proven effective in an independent A/B test. Helped interview, hire and on-board data engineers. Represented the Big Data area in front of investors.

Tools/Languages used: Python, NodeJS, Pyspark, CloudFormation, Redshift, MySQL, S3, ELB, ECS, Jenkins, Docker, AWS, Elasticsearch

API Gateway and Marketplace From Scratch

Member of a team that shipped an API gateway and API marketplace digital product. The platform is currently internet facing and withstood penetration testing from third party. Developed security based features: last-mile security based on TLS, an authorization DSL with capabilities similar to AWS IAM. Implemented API call logging and API call aggregation to facilitate analytics and billing features. Implemented Terraform deployment scripts.

Tools/Languages used: Go, Typescript, Docker, ECS, Athena and Kinesis Firehose, Terraform, AWS

Kubernetes Operator

Member of a team of three engineers which delivered a Kubernetes operator to deploy the company’s flagship product. The product consists of multiple services including stateful ones and utilizes Istio. The architecture consists of frontend, backend and the Kubernetes operator implemented as a Kubernetes service. Utilized the Kubernetes API and debugged the source code of Kubernetes metacontroller (an open source project).

Go Backend with Swagger on AWS Fargate

Implemented a backend in Go via go-swagger. Worked closely with frontend and computer vision engineers, product owner and designer to define and evolve the API spec via Swagger. Implemented the complete backend including database migration logic and tests. Implemented Terraform scripts to deploy the backend securely on an isolated subnet within an AWS VPC using VPC endpoints, NLB and API Gateway with an API key.

Open Source Repository for Stock Market Analysis (Timeseries)

Developed in collaboration an open-source repository for analysis of stock market data to showcase a customer's data analytics capabilities. The repository is now featured on AWS Open Data Registry.

Tools/Languages used: Pandas, Jupyter, Keras

Custom Zonal OCR Workflow with Cloud APIs and Open Source Image Processing

Evaluated cloud, third party and desktop OCR services. Implemented a zonal OCR prototype in a team of two using open-source image processing libraries and cloud services.

Design Sprint for Products Based on NLP and Image Search

Participated as a machine learning expert in a design thinking workshop for building NLP-based and image search products. Facilitated discussion around product and engineering requirements and the application of state-of-the-art approaches. Wrote a 30-page report of related work which also included the output models on data similar to the customer’s data.

Tools/Languages used: Python, Google Colab, OpenCV

Modern Data Pipeline Tool

External member of an in-house team to build a modern data pipeline tool. The tool supports data pipelines on AWS Redshift and EMR. Informed software and cloud architecture in technical discussions with developers, team lead and product owner. Implemented features, unit and integration tests as well as end-user demos. Created and hosted Python and Anaconda packages on Artifactory. Setup test jobs on Jenkins. Supported peers with code reviews. Supported other engineers with debugging help.

Tools/Languages used: Python 3.7, Spark, Jenkins, Docker, MySQL, Postgres, Artifactory

Microservice Architecture with Go and Postgres

Optimized slow database queries in Postgres. Optimized data storage via Change Data Capture and Slowly Changing Dimension. Implemented SWIFT parser in Go via code-generation. Implemented comprehensive integration tests. Delivered extensive project documentation for project handover. Used Docker, Postgres and Go.

Tools/Languages used: Go, Postgres, Docker, Jupyter

Spark Training Materials and Hands-on Training of Data Scientists

Prepared training materials on Spark architecture and sketching algorithms commissioned by a top-tier IT company; taught data scientists efficient algorithms for Big Data with Spark using a hands-on problem solving approach.

Tools/Languages used: Scala, Spark, Python, PySpark

CI Pipeline to build VM images on Google Cloud

Developed a pipeline to build VM images for data scientists via packer. The pipeline can be triggered from a third party CI/CD system. Deployed Airflow and implemented workflows that launch jobs on VMs via ssh paramiko. Implemented an App Engine web service protected by Google Identity Proxy which allows managers to browse reports generated by experiments. Implemented a Deployment Manager script in Python.

Tools/Languages used: Python, Google Cloud, Compute Engine, Airflow,


Data Pipeline for Large Financial Timeseries Data

Participated in design thinking workshops as a data engineer to inform future data products. In a team of three, delivered clickable proof-of-concept prototype for financial data in just three days. The system ingested and structured GBs of financial data in Redshift. The aggregated small data moved to MySQL from where a Go backend took over to deliver to a React app. Developed a simple idempotent pipeline in bash.

Tools/Languages used: Python, Redshift, MySQL, Pandas, EC2, S3, AWS