NLP Without Labels: How to Cluster N Legal Processes of the Colombian State and Turn Chaos into a Production Classifier

What do you do when you have 600,000 legal complaints, zero labeled data, and a government entity waiting for results? This talk walks through the full process of building an unsupervised NLP classification system for the Procuraduría General de la Nación. Starting from raw administrative text—noisy, full of abbreviations and institutional jargon—I'll show how TF-IDF, truncated SVD, and KMeans combined to organize more than half a million records into 64 semantically coherent groups, without a single manual label. But clustering is only the starting point. I'll cover how clusters were validated, how a Logistic Regression classifier was trained on them to make the system deployable, and how the final pipeline was packaged in a .pkl that non-technical colleagues use in production today. Along the way we'll face real problems: elbow curves that don't behave, 1:20 size imbalances between clusters, and the tension between mathematical elegance and institutional usability. Because in the public sector, a model nobody uses isn't a model—it's a PDF gathering dust.

Want to know more?

Join PyCon Colombia newsletter and get a complete overview of our events, speakers and community participation.