IIIA | theses

Title: Privacy preserving approximate nearest neighbour

Themes: Information Retrieval Privacy

Description: One of the major challenges in protecting privacy is the computational overhead introduced by privacy-preserving strategies such as homomorphic encryption. This issue is particularly evident in time-sensitive tasks, where even small delays in computing results can have severe negative effects on user experience and perceived service quality.
During the research internship, the student will develop a framework to efficiently integrate privacy-preserving matching strategies, such as homomorphic encryption and secure nearest neighbor computation, with approximate nearest neighbor solutions such as FAISS.

Supervisors: G. Faggioli

Title: Data-native impact metrics for datasets and contributors (credit transfer + "data h-index")

Themes: Data Citation Knowledge Bases

Description: Modern science depends on shared datasets, but evaluation systems still mostly reward papers; as a result, data creators and curators often receive weak, inconsistent recognition. This thesis designs and prototypes a data-native impact metric—an h-index–style measure for data—based on a credit transfer procedure that propagates credit through reuse chains (dataset → derived dataset → paper/workflow → downstream reuse). The work includes: (i) defining a formal credit model (roles, direct vs transitive credit, decay rules, duplicate/merged records), (ii) implementing scalable computation over a provenance/citation graph (e.g., using a graph database), (iii) studying robustness to missing links and “gaming” behaviors, and (iv) validating the metric on real domains (biomedicine and cultural heritage) using repository metadata and paper–dataset links. The student will deliver an open prototype (data pipeline + algorithms) and an empirical report showing when the metric is stable, fair, and immediately usable by institutions without changing publishing workflows.

Supervisors: G. Silvello

Title: Quality certification for cited data slices when no ontology exists (constraint discovery + trust labels)

Themes: Data Citation Knowledge Bases

Description: Data extracted through queries can be incomplete, inconsistent, or semantically “impossible,” but most datasets do not come with a domain ontology that would enable formal validation. This thesis builds a quality certification service that attaches a machine-readable trust label to a query-derived data slice, even in the “no ontology available” case. The work includes: (i) designing a certificate format that separates what can always be guaranteed (identity, provenance trace, version/fixity, integrity hashes) from what can be inferred (schema-based and learned constraints), (ii) implementing constraint discovery and profiling (keys, functional dependencies, range rules, null/outlier patterns) plus validation checks, (iii) optionally integrating ontology-based checks for a biomedical case study to compare “constraint-only” vs “ontology-grounded” certification, and (iv) evaluating detection accuracy and false alarms under controlled corruption (“dirty data injection”) and real datasets. The student will produce a working tool/API that outputs graded certificates and a quantitative evaluation demonstrating how far trustworthy reuse is possible without relying on a pre-existing ontology.

Supervisors: G. Silvello

Title: From query results to reusable micro-evidence objects (nanopublication-inspired publishing + indexing)

Themes: Data Citation Knowledge Bases

Description: Data Citations usually remain human-oriented text strings, which limits machine reuse: computers cannot easily query, combine, and verify “what exactly was used” in a scientific claim. This thesis implements an end-to-end pipeline that transforms query results into reusable micro-evidence objects—small, resolvable, machine-actionable packages inspired by nanopublications—containing the data slice, the query, provenance, contributor attribution, and retrieval instructions. The work includes: (i) defining a minimal evidence-object schema (RDF/JSON-LD or similar), (ii) building generators for at least one query language (SQL or SPARQL), (iii) implementing a lightweight publication and indexing service (a “micro-evidence server”) with persistent identifiers, search, and retrieval APIs, and (iv) demonstrating downstream reuse: composing evidence objects into a mini knowledge graph and running analytics such as reuse tracking and contributor credit aggregation. The student will deliver a functional prototype, performance measurements (throughput, storage, retrieval latency), and a demonstration showing how machine-actionable evidence objects enable verification and reuse beyond what traditional citation strings can support.

Supervisors: G. Silvello

Title: Ontology-Guided Clinical Information Extraction from Italian Emergency Department Reports

Themes: Knowledge Representation Health Informatics

Description: Clinical notes in emergency settings are often terse, unstructured, and rich in domain-specific abbreviations, especially in low-resource languages like Italian. This thesis will address the challenge of extracting structured medical knowledge from such texts using an ontology-driven pipeline. The student will design and implement a hybrid information extraction system combining rule-based components (for handling negations and clinical patterns) with fine-tuned transformer models (e.g., Italian BERT). A portion of annotated ED notes will be used to evaluate system performance. The student will contribute to building a reusable component of the DE-ESCALATE pipeline and generate outputs compatible with semantic representations (e.g., RDF triples), ultimately enabling downstream reasoning and predictive modeling on emergency department data.

Supervisors: G. Silvello

Title: Multimodal Knowledge Graph Construction and Embedding for Clinical Outcome Prediction

Themes: Knowledge Representation Health Informatics

Description: Emergency department records often include fragmented information spread across structured fields, clinical notes, and diagnostic images. This thesis will focus on building patient-centric knowledge graphs that integrate entities and relations extracted from these heterogeneous sources. The student will design a graph schema aligned with the DE-ESCALATE ontology, implement data integration routines, and apply graph embedding techniques (e.g., RDF2Vec, GraphSAGE) to convert these graphs into numerical representations. These embeddings will be used to train and evaluate outcome prediction models for heart failure patients (e.g., 30-day mortality or readmission). The project offers hands-on experience in graph construction, machine learning, and explainable AI, with the opportunity to contribute to a real-world, multimodal healthcare application.

Supervisors: G. Silvello

Title: Design of a FAIR-Compliant Biomedical Data Platform for Clinical NLP and Predictive Modeling

Themes: Knowledge Representation Health Informatics

Description: Clinical AI research is often hindered by poor data accessibility and lack of reproducible infrastructure. This thesis will support the DE-ESCALATE project by building a FAIR-compliant data platform to manage and share multimodal ED data (structured, textual, and imaging-derived). The student will design metadata schemas, implement data pipelines for anonymization and versioning, and integrate access control mechanisms for secure distribution. They will also support interoperability with annotation tools like MetaTron and create leaderboard-ready benchmarks for future CLEF evaluation tasks. The work will provide practical experience in biomedical data engineering and result in a reusable platform enabling collaborative research across institutions and languages.

Supervisors: G. Silvello

Title: Privacy-Preserving Federated Analytics for Multimodal Health Data

Themes: Knowledge Representation Privacy

Description: Background: Modern biomedical research increasingly relies on combining clinical, imaging, and genomics data across institutions, but legal, ethical, and technical constraints prevent data centralization. The HEREDITARY project addresses this challenge through federated infrastructures and privacy-preserving analytics compliant with GDPR and EHDS requirements. Work to be done: The student will design and implement federated analytics and learning workflows over distributed multimodal datasets, integrating techniques such as secure aggregation, differential privacy, and privacy-aware query execution. Performance and privacy trade-offs will be evaluated using representative HEREDITARY use cases. Expected achievements: The thesis will deliver a validated prototype or experimental framework demonstrating scalable, secure analytics over sensitive health data, with quantitative evaluation of efficiency, privacy guarantees, and analytical accuracy, and clear guidelines for deployment in real-world federated environments.

Supervisors: G. Silvello

Title: Semantic Query Optimization in Federated Polystore Systems for Biomedical Data

Themes: Data Integration

Description: Background: HEREDITARY relies on ontology-based data access (OBDA) and polystore architectures to enable unified querying of heterogeneous biomedical data sources, including structured records, graphs, images, and genomics repositories. However, naïve semantic query translation can lead to inefficient execution in federated settings. Work to be done: The student will study semantic query rewriting and unfolding techniques and implement optimization strategies that consider data locality, heterogeneity of storage engines, network costs, and privacy constraints. The work will include developing and testing cost models and execution strategies on selected HEREDITARY workloads. Expected achievements: The thesis will produce optimized query execution methods with measurable performance improvements over baseline approaches, along with a reusable optimization module or evaluation framework applicable to large-scale multimodal biomedical data systems.

Supervisors: G. Silvello

Title: Visual Analytics for Explainable Multimodal AI in Health Data Integration

Themes: Data Integration

Description: Background: Multimodal AI models used in HEREDITARY generate complex results that are difficult for clinicians and policymakers to interpret without appropriate visualization and interaction tools. Explainability, uncertainty awareness, and provenance tracking are essential to foster trust and adoption of AI-based health analytics. Work to be done: The student will design and implement interactive visual analytics components that integrate semantic data, model outputs, uncertainty measures, and metadata from multimodal learning pipelines. Prototypes will be developed and evaluated through selected HEREDITARY case studies, with attention to usability and interpretability. Expected achievements: The thesis will deliver a functional visual analytics prototype and an evaluation demonstrating improved interpretability of multimodal AI results, contributing design guidelines and technical components for trustworthy and user-centered health analytics systems.

Supervisors: G. Silvello

Title: Query Obfuscation for Neural IR

Themes: Information Retrieval Privacy Large Language Models

Description: When we use a Search Engine, we might end up disclosing private information. In fact, queries that involve diseases and sysmptoms, locations, or financial aspects, are among the most popular queries issued on a search engine. Query Obfuscation consists of taking aprivate query and transform it in such a way that it is less disclosing of the original information need, but still able to retrieve relevant information.

During the research internship, the student will work on the development of query obfuscation techinques, especially for neural IR systems. Furthermore, the student will investigate on how to use Large Language Models (LLMs) to generate such obfuscation techniques.

Supervisors: G. Faggioli N. Ferro

Theses

Title: Privacy preserving approximate nearest neighbour

Title: Data-native impact metrics for datasets and contributors (credit transfer + "data h-index")

Title: Quality certification for cited data slices when no ontology exists (constraint discovery + trust labels)

Title: From query results to reusable micro-evidence objects (nanopublication-inspired publishing + indexing)

Title: Ontology-Guided Clinical Information Extraction from Italian Emergency Department Reports

Title: Multimodal Knowledge Graph Construction and Embedding for Clinical Outcome Prediction

Title: Design of a FAIR-Compliant Biomedical Data Platform for Clinical NLP and Predictive Modeling

Title: Privacy-Preserving Federated Analytics for Multimodal Health Data

Title: Semantic Query Optimization in Federated Polystore Systems for Biomedical Data

Title: Visual Analytics for Explainable Multimodal AI in Health Data Integration

Title: Query Obfuscation for Neural IR