DESCAPY
Description & Motivations
The goal of this project was to build a production-ready, enterprise-grade event streaming pipeline on a single VPS using almost entirely Open Source (OSS) tools.
I chose to work with Wi-Fi “wardriving” data because it felt tangible and “real-world.” I could have used Python Faker to generate fake profiles or messages with to feed into Kafka, i might do this in the future because i can control and signifacntly increase the number of messages to test the stack/project to its full capability, I wanted to capture live physical signals around my neighborhood.
My approach centers on several key pillars:
- Engineering Excellence: I prioritized loose coupling to ensure the system doesn’t rely on a single script. If the database goes down, Kafka buffers the messages until it recovers; if storage goes down, the live dashboard remains functional; and if Kafka itself fails, only the ingestion stops while the downstream analytics remain accessible.
- Infrastructure as Code: I aimed for zero clicking on startup. Almost everything—from Kafka topics to Airflow connections—is defined in code, making the stack reproducible and ephemeral.
- Security & Networking: I aimed to build a hardened environment for the public internet. By leveraging Cloudflare Zero Trust, Caddy as a reverse proxy, and strict Docker networking, I tried to ensure that no services are exposed insecurely. Even if a vulnerability was discovered, the ephemeral nature of the stack and the lack of sensitive stored data make the environment a low value target.
- Learning: Tools like Kafka and Spark have a nearly infinite learning curve. This project was my attempt to scratch the surface of some of some of these tools in a complex, multi-stage environment.
Repo - https://github.com/whitecacti/docker_external/tree/mainline
System Architecture & General Flow
Raspberry Pi → Kafka → S3 (MinIO) → Airflow → Spark → Postgres → Superset
Hardware
- Raspberry Pi 4, 4GB RAM. Honestly, this is so little compute that even an older Pi would likely suffice.
- Alfa AWUS036ACH Wi-Fi adapter with two ARS-N19 antennas.
- VK-162 USB GPS dongle.
- Anker 20,000mAh Powerbank with 65W output. The setup only pulls about 5W so the power bank capacity and output are essentially overkill.
- Solis Lite hotspot for the “interwebs.” Since writing to a Kafka topic only consumes ~30MB/hour, the cost is trivial—especially with Solis’s occasional $2/GB sales with 99-year expiration.
- The Brainz: A VPS with 14 ARM cores and 32GB RAM. This provides significant overhead for the Docker-compose stack. Even with all services running at the same time, the host uses about 22GB of RAM usage.
Data Capture
To gather data, I drove through residential neighborhoods at about 15–30 mph with the antennas on my cars dash. I used Scapy in Python to sniff packets rather than the typical aircrack-ng suite, as Scapy allows for much cleaner programmatic capture from the radio interface.
I scan channels 1, 6, and 11 about every second as they are the most popular 2.4Ghz channels. Although the adapter supports 5GHz, I focus on the 2.4GHz band because it lets me capture Wi-Fi signals from much further away.
As the script runs, it batches unique MAC/SSID/Location combinations and writes them to the Kafka cluster via an authenticated cURL command. This process happens asynchronously, allowing the Wi-Fi adapter to continue capturing packets without interruption. Driving through a typical Seattle neighborhood, i get about 50-80 messages per second.
Data Ingestion & Streaming - Apache Kafka
At its core, event streaming captures real-time data—like Wi-Fi packets—and queues it for processing. Kafka serves as the backbone, organizing these events into Topics. This setup allows Producers to write data asynchronously (so the adapter can keep capturing packets without waiting), while Consumers, like my S3 Sink, process that data at their own pace.
This decoupled architecture ensures resilience. If the Postgres database fails, ingestion continues; if the Raspberry Pi stops, the dashboard remains live. To harden the system, I access Kafka solely through a REST proxy secured by Caddy and Cloudflare Zero Trust.
During the build, I initially implemented SASL_PLAINTEXT authentication but realized it was redundant. Because the stack runs in a single Docker network and ingestion flows through the proxy, the internal encryption overhead wasn’t necessary.
Finally, I treat the Kafka cluster as disposable. By defining topics and configurations in external scripts, I can wipe and rebuild the entire infrastructure in minutes without losing any important data.
ksqlDB
Although not strictly necessary for this project, this was an extremely cool thing to get working. I learned that you can create a “stream on top of another stream” and have certain things be filtered out. This is extremely valuable for saving on storage costs; I can keep a raw topic with a short 1-day retention period for high-volume data, while simultaneously creating a secondary, filtered stream that persists only high-value data for much longer. I’ve also played with ksqlDB for flattening and filtering nested JSON, like the Wikipedia recent changes feed , to save space before the data hits the S3 sink.
Data Storage - MinIO S3
The data storage layer consists of a MinIO Docker container. Storing about ~30mb of data per hour is cheap; the entire project, including testing and manipulation, is likely under a gig. I probably could have saved time using AWS S3, but I decided to spin up my own S3-compatible storage instead. Running docker-compose up and configuring the storage was the easy part. The real challenge was getting DNS resolution to work across a single VPS.
For example, when I access s3.xyz.com from my phone, the path is:
Me -> DNS -> s3.xyz.com -> VPS IP -> Caddy -> Container
However, this breaks down for other containers on the same host (like Kafka) trying to reach the S3 server. The internal request hits the public DNS, points to the VPS IP, and then gets lost—an issue often called NAT loopback or hairpin NAT. The fix was adding s3.xyz.com to the /etc/hosts file of the Kafka container so it resolves directly to the internal Docker network IP of the MinIO container.
sudo echo "172.21.0.2 s3.xyz.com" >> /etc/hosts
I also used this MinIO storage in Spark, where I’ve implemented a simplified Medallion Architecture. The bronze_bucket holds raw data from the Kafka S3 Sink, while the silver_bucket contains the data after it’s been processed by Spark.
Orchestration - Apache Airflow
I used a Dockerized Apache Airflow setup based on the official Compose file. By defining connections and variables directly in code, the environment becomes easy to reproduce, ensuring seamless startups and resets.
My main ETL DAG does the following:
- Truncate: Clears the descapy table in Postgres to prepare for fresh data.
- Spark Submit: Uses spark-submit to trigger a processing job on the Spark cluster.
- Copy: Executes a Postgres COPY command to pull the refined data from the MinIO Silver bucket into the database.
Data Processing - Apache Spark
Spark handles the heavy (in this case very light) lifting for data processing. I have a 2-worker cluster plus one master node set up via Docker Compose, with each node allocated a few GB of RAM.
The “fun” part was figuring out how to get spark-submit jobs to work and loading all the necessary dependencies. The biggest takeaway was the importance of version synchronization: Airflow, Spark, and all related services must run the exact same Python version (3.11 in my case). Without this, the cluster throws constant, cryptic compatibility errors.
Database Layer - PostgreSQL
Since Superset doesn’t natively support S3 as a direct data source, I spun up a Postgres database. To make this work with the rest of the stack, I added a few things:
- pg_parquet: Allows the database to read Spark-generated Parquet files.
- AWS Dependencies: Necessary for the database to pull data from our internal MinIO S3 server.
- PostGIS: Added to support geospatial data. While not strictly required for the current dashboard, it’s there for future improvements like isolating or filtering specific capture locations.
I used an init.sql file to automatically load these extensions on startup. Like the rest of the stack, I view this database as temporal and easily reproducible; I want to be able to lose all the data and restore it in minutes. For now, this ephemeral setup is perfect, though I might look into a more permanent PostgreSQL instance down the road.
Presentation - Apache Superset
I used Apache Superset to visualize the data. The resulting dashboards look great and are very responsive.
I’m honestly impressed with how Superset stacks up against enterprise closed source offerings like Tableau or QuickSight (or whatever they’re calling it this week). It delivers a professional, polished presentation with zero recurring licensing costs.
Security Considerations
My goal was to create a hardened environment where no services are exposed insecurely. I use ufw-docker to secure the host’s firewall and ensure that most services are bound to the loopback interface (127.0.0.1), restricting access to the host machine and preventing exposure to external networks.
Public access is strictly managed through a Caddy reverse proxy and Cloudflare Zero Trust. Even if my domain is exposed, the actual VPS IP remains hidden behind Cloudflare. The traffic flow follows a strict path: xyz.xyz.com -> Cloudflare Zero Trust -> Caddy -> Docker Container.
To keep the GitHub repo clean and secure, I use a “Push” deployment model. All secrets and environment variables are stored only on my local development machine. My move_build.sh script uses rsync to push these secrets to the VPS at runtime before starting the Docker Compose stack. Combined with .gitignore, this ensures that no sensitive credentials are ever committed to version control.
Improvements and other Ideas
- Currently, each MAC/SSID combo may show 10+ captured locations. I plan to implement a trilateration algorithm to estimate the router’s actual location based on the relative signal strengths of different capture points.
- Test a pg_dump setup on a cron schedule that automatically backs up to the MinIO server. Also, do some disaster recovery testing with that backup, to ensure the data can be easily restored. This and this gives me nightmares.
- Currently starting the capture process is pretty manual. I have the Raspberry Pi connected to a Cytrence Kiwi Pro KVM, which is then hooked up to my Mac. This setup allows me to access the Pi’s terminal and start the capture. My goal is to automate this by creating a systemd service that handles the adapter state and starts the script on boot.
- I’m not sure what the ideal channel-switching interval is—while 1 second works well, it would be interesting to test different configurations (like 500ms or 250ms or less) to see if I can increase packet density without missing important frames.