Project - Resume Processing Pipeline
Building a Google Cloud Dataflow pipeline that can process a virtually unlimited volume of resumes — from streaming ingestion through PubSub, to batched CSV output, to graph import into Neo4j.
- Client
- Curriculo
- Year
- Service
- Data Engineering, Cloud Infrastructure
Overview
Curriculo's direct partners and prospects are among Brazil's largest companies, so the infrastructure needs to handle high volumes from the start. I built a streaming pipeline on Google Cloud Dataflow that decouples ingestion from processing and scales horizontally — designed to process a virtually unlimited number of resumes without architectural changes.
Key Contributions
Streaming Ingestion
Built a Go-based Dataflow consumer that subscribes to a PubSub topic and streams incoming resume events into the pipeline. The consumer writes incoming resumes to CSV files bucketed by time window, producing batches that can be processed independently. Google scheduled tasks trigger batch execution on a defined cadence, keeping throughput predictable and the pipeline resilient to spikes.
The architecture is designed so that scaling is a configuration concern rather than a code change — Dataflow handles the parallelism, and the pipeline can process a virtually unlimited volume of resumes.
Custom Neo4j Importer
The classified results need to be loaded into Neo4j for graph-based matching. Google provides a built-in Dataflow template for writing to Neo4j, but it requires Neo4j Enterprise Edition. Since Curriculo runs on Community Edition, I built a custom Python importer that reads the same JSON template configuration the official Neo4j template uses — node mappings, relationship definitions, property assignments — and interprets them to load data into Neo4j directly.
This gives us two things: a working pipeline on Community Edition today, and a clean upgrade path. If Curriculo moves to Neo4j Enterprise, the existing mapping configuration can be used with the official Dataflow template with minimal changes.
- Google Cloud Dataflow
- Go
- Python
- PubSub
- Neo4j
- Cloud Scheduler
- Resume Throughput
- Unlimited
- Stream Consumer
- Go
- Graph Import
- Neo4j
- Pipeline
- Dataflow