Project - Resume Processing Pipeline

Building a Google Cloud Dataflow pipeline that can process a virtually unlimited volume of resumes — from streaming ingestion through PubSub, to batched CSV output, to graph import into Neo4j.

Client: Curriculo
Year: 2024
Service: Data Engineering, Cloud Infrastructure

Overview

Curriculo's direct partners and prospects are among Brazil's largest companies, so the infrastructure needs to handle high volumes from the start. I built a streaming pipeline on Google Cloud Dataflow that decouples ingestion from processing and scales horizontally — designed to process a virtually unlimited number of resumes without architectural changes.

Key Contributions

Streaming Ingestion

Built a Go-based Dataflow consumer that subscribes to a PubSub topic and streams incoming resume events into the pipeline. The consumer writes incoming resumes to CSV files bucketed by time window, producing batches that can be processed independently. Google scheduled tasks trigger batch execution on a defined cadence, keeping throughput predictable and the pipeline resilient to spikes.

The architecture is designed so that scaling is a configuration concern rather than a code change — Dataflow handles the parallelism, and the pipeline can process a virtually unlimited volume of resumes.

Custom Neo4j Importer

The classified results need to be loaded into Neo4j for graph-based matching. Google provides a built-in Dataflow template for writing to Neo4j, but it requires Neo4j Enterprise Edition. Since Curriculo runs on Community Edition, I built a custom Python importer that reads the same JSON template configuration the official Neo4j template uses — node mappings, relationship definitions, property assignments — and interprets them to load data into Neo4j directly.

This gives us two things: a working pipeline on Community Edition today, and a clean upgrade path. If Curriculo moves to Neo4j Enterprise, the existing mapping configuration can be used with the official Dataflow template with minimal changes.

Google Cloud Dataflow
Go
Python
PubSub
Neo4j
Cloud Scheduler

Resume Throughput: Unlimited
Stream Consumer: Go
Graph Import: Neo4j
Pipeline: Dataflow

Project - Resume Processing Pipeline

Overview

Key Contributions

Streaming Ingestion

Custom Neo4j Importer

More projects

OAuth Server & Two-Factor Authentication

Reusable Authentication Infrastructure