Summary of "Primera sesión: IA en la Gestión de Documentos (2025)"

This lecture, delivered by Professor Dr. Luis Miguel García Velázquez, focuses on the application of artificial intelligence (AI), specifically deep learning, to the management and de-identification of clinical medical records. The session blends theoretical insights on AI methodologies with a practical case study centered on medical data protection, emphasizing responsible AI use and interdisciplinary collaboration.

Main Ideas and Concepts

1. Context and Motivation

The project addresses a critical health issue: Type 2 diabetes mellitus in Mexico, with clinical data collected from the Mexican Social Security Institute’s family medicine information system.
Since 2006, medical records have been digitized, creating a large database of clinical notes, biometric data, and other patient information.
The objective is to analyze these records to identify conditions associated with complications in diabetic patients, not to predict future events but to inform preventive recommendations.
The challenge is to handle sensitive personal data, especially in free-text clinical notes, which vary widely in format and content.

2. Problem of De-identification in Clinical Records

Structured data fields can be easily anonymized by removing marked columns.
Free-text fields pose a significant challenge because personal data (names, addresses, etc.) are embedded in unstructured narrative text.
Manual anonymization is impractical for hundreds of thousands of records.
The team developed AI algorithms to automatically detect and redact personal data from these free-text notes.
The approach must respect ethical considerations and institutional requirements to ensure data privacy.

3. Artificial Intelligence Methodology

AI here mainly refers to machine learning, particularly deep learning, which builds mathematical models from examples.
The process involves two components:
- Training algorithm: Iteratively improves the model by learning from labeled examples.
- Trained model: A fixed function that processes new data to perform tasks like detecting personal data.
Models are context-specific; a model trained on clinical notes will not work well on legal or tax documents without retraining.
The methodology involves:
- Manual tagging of a sample of notes by medical staff.
- Use of rule-based automated pre-processing to speed up manual review.
- Iterative refinement of both rules and models through cross-validation between human experts and algorithms.
- Transfer learning by adapting pre-trained language models to Spanish clinical texts.

4. Illustrative Exercise: Word Classification and AI Intuition

Luis Miguel used a simple exercise with words (animals, verbs, nouns, etc.) to illustrate how AI algorithms classify data.
The AI groups words based on statistical co-occurrence patterns in a text corpus, without understanding meaning explicitly.
Different corpora produce different classifications, reflecting the text’s nature (e.g., fairy tales show gender bias in character roles).
This demonstrates:
- How AI mimics human cognitive processes like clustering.
- The importance of the training data (corpus) in shaping AI behavior.
- The presence of biases in data that AI can both reveal and perpetuate.

5. Responsible AI Use and Interdisciplinary Collaboration

Understanding AI’s inner workings, even intuitively, is crucial for responsible development and deployment.
AI models are not perfect; they make forced decisions and can propagate biases.
The purpose of the AI tool must be clearly defined upfront (e.g., de-identification vs. anonymization).
Interdisciplinary teams combining computer scientists, medical professionals, social workers, and archival experts are essential.
Human oversight remains necessary, especially for legal and ethical accountability.
Transparency and documentation of AI processes are vital for traceability, legal compliance, and user trust.
AI accelerates workflows (up to 13x faster in this case) but does not replace human expertise.

6. Practical Outcomes and Extensions

The project resulted in:
- A trained AI model for de-identification of clinical notes.
- A cleaned dataset of over 28,000 annotated notes ready for clinical analysis.
- A prototype interface for medical staff to visualize patient data summaries.
The methodology was adapted to other types of notes (e.g., social work records), demonstrating flexibility.
Future work includes expanding AI tools for multimodal document types (legal, clinical, tax) using hierarchical or modular models.

Detailed Methodology / Steps Presented

Data Preparation:
- Collect clinical notes from digitized medical records.
- Identify structured vs. unstructured data fields.
- Manually tag a subset of notes for personal data.
- Develop initial rule-based filters (e.g., recognizing titles like “doctor”).
- Use automated tools to pre-filter large datasets.
Model Training:
- Use deep learning algorithms pre-trained on general language data.
- Fine-tune models on the manually tagged clinical notes.
- Employ iterative training and validation cycles.
- Use selective memory techniques to prioritize frequently used rules and forget less useful.