Shayan Geek -

Deduplicating Documents at Scale

ByShayan Sadeghi 2025-12-202025-12-20

Deduplication sounds simple until you actually try to do it on real data. If you have a few thousand short texts, almost any solution works. When you move to millions of long documents, multiple languages, noisy data, and the requirement to find very similar (but not identical) documents, everything changes: algorithms, data structures, memory layout,…

Data Engineering

How I Fixed a Slow Data Pipeline with Python Async – A Hands-On Guide

ByShayan Sadeghi 2025-11-212025-11-21

Hey there! A couple of weeks ago I hit a classic data-engineering wall. I had a pipeline that needed to: This process has to run for about 3K libraries data. if the whole process takes 1 seconds it takes about 2 days. The CPU was sitting idle 99% of the time just waiting for HTTP…

Management

Whiplash: A Brutal Symphony of Perfection — Lessons in Leadership, Growth, and Obsession

ByShayan Sadeghi 2025-11-072025-11-07

Spoiler Alert: This article contains key plot details from Whiplash (2014) In every ambitious person’s journey, there’s a moment when the excitement of learning slowly turns into an obsession with being perfect. It’s the moment when “I want to get better” becomes “I have to be the best.” And right there, the danger begins —…

Data Engineering

The Data Engineer’s Dilemma: Batch, Stream, or Hybrid?

ByShayan Sadeghi 2025-10-292025-10-29

There’s a moment in every data engineer’s journey when the excitement of building pipelines meets a difficult, quiet question: Should this run in batch, or should it be real-time? It sounds technical — but it’s actually philosophical. Behind it lies a deeper question:What are we really optimizing for — freshness, simplicity, or reliability? Because you…

Data Engineering | Elasticsearch

Elasticsearch Part 4: Analytical Queries

ByShayan Sadeghi 2025-10-152025-10-15

Welcome back to our deep dive into Elasticsearch! So far, we’ve mastered the art of finding the right documents. We’ve become experts in queries, filters, and the mighty bool query. But what happens after you’ve retrieved your results? How do you make sense of the bigger picture? This is where the true analytical power of Elasticsearch shines: Aggregations. If queries…

Data Engineering | Elasticsearch

Elasticsearch Queries – Part 3: Bool Queries and Pagination

ByShayan Sadeghi 2025-09-282025-09-28

Introduction If you’ve been following this series, you already know: Now it’s time for the real workhorse: the bool query.Why? Because no real-world search problem is solved by just one condition. Users expect relevance and restrictions: The bool query is how you glue all of these conditions together. By the way, I should also mention…

Data Engineering | Elasticsearch

Elasticsearch Queries – Part 2: Practical Query Types

ByShayan Sadeghi 2025-09-162025-09-16

In Part 1 of this series I walked through the foundations of Elasticsearch queries: the mental model, why mapping is your best friend, and how to choose between filters and matches. Now it’s time to roll up our sleeves and look at some of the practical query types that you’ll actually use when building real-world…

Data Engineering | Elasticsearch

Elasticsearch Queries – Part 1: Queries and Filters

ByShayan Sadeghi 2025-09-082025-09-08

When I first got to know Elasticsearch, I told myself: “Well, this is just another database… right?”But I was wrong. Elasticsearch is actually different. It kind of feels like a mix between a search engine and a database.To be honest, I’m still not very comfortable with it myself 🙂But in this post—which is the first…

Data Engineering

My Favorite Python Libraries for Fast Data Exploration

ByShayan Sadeghi 2025-09-012025-09-01

Let me be honest: when I sit down with a fresh dataset, I’m not looking for ceremony. I’m looking for clarity. That first hour matters more than most people admit. I want to get a feel for the terrain—what’s messy, what’s surprising, what’s worth digging into. If I can’t answer “what’s going on here?” in…

Data Engineering | Data Pipeline

Batch Processing in Apache Airflow

ByShayan Sadeghi 2025-08-122025-08-12

You’ve probably heard the term batch processing before.In this post, we’ll talk about what it means and how to use it in Apache Airflow. These days, with the mind-boggling amount of data growing by the second, one of the most important skills you can have is knowing how to process massive datasets efficiently. Tools like…