Machine Learning Data Science Core Python DevOps

NLP Without Labels: How to Cluster N Legal Processes of the Colombian State and Turn Chaos into a Production Classifier

FORMAT: TalkLEVEL: Advanced LANGUAGE: Spanish

What do you do when you have 600,000 legal complaints, zero labeled data, and a government entity waiting for results? This talk walks through the full process of building an unsupervised NLP classification system for the Procuraduría General de la Nación. Starting from raw administrative text—noisy, full of abbreviations and institutional jargon—I'll show how TF-IDF, truncated SVD, and KMeans combined to organize more than half a million records into 64 semantically coherent groups, without a single manual label. But clustering is only the starting point. I'll cover how clusters were validated, how a Logistic Regression classifier was trained on them to make the system deployable, and how the final pipeline was packaged in a .pkl that non-technical colleagues use in production today. Along the way we'll face real problems: elbow curves that don't behave, 1:20 size imbalances between clusters, and the tension between mathematical elegance and institutional usability. Because in the public sector, a model nobody uses isn't a model—it's a PDF gathering dust.

Speaker

Jonatan Esteban Gonzalez Balaguera

Professional @ Procuraduría General de la Nación

I'm a physicist with a master's in theoretical physics and a second master's in Visual Analytics and Big Data, currently pursuing a specialization in statistics at Universidad Nacional de Colombia. I work as an analyst at the Procuraduría General de la Nación, applying machine learning, NLP, and geospatial analysis to preventive oversight and monitoring problems. My path runs from simulating superconductor systems to developing deforestation detection tools and electoral analysis, always with Python as the common thread.

View speaker

Want to know more?

Join PyCon Colombia newsletter and get a complete overview of our events, speakers and community participation.