Collective contextual anomaly detection in complex networks
Authorship
M.A.F.
Máster Universitario en Computación de Altas Prestaciones / High Performance Computing from the University of A Coruña and the University of Santiago de Compostela
M.A.F.
Máster Universitario en Computación de Altas Prestaciones / High Performance Computing from the University of A Coruña and the University of Santiago de Compostela
Defense date
09.02.2025 10:30
09.02.2025 10:30
Summary
Complex networks are ubiquitous across diverse domains: from human-made systems such as the Internet and banking transaction networks to natural phenomena like the brain’s neuronal networks. Their analysis has become increasingly important. Anomaly detection, and especially contextual collective anomaly detection, is one of the most powerful tools to extract meaningful insights from these structures. However, the high temporal and spatial complexity of existing methods prevents their application to large-scale networks, limiting their practical utility. In this work, we propose a scalable framework that leverages Apache Spark for parallel processing to detect contextual collective anomalies in large complex networks.
Complex networks are ubiquitous across diverse domains: from human-made systems such as the Internet and banking transaction networks to natural phenomena like the brain’s neuronal networks. Their analysis has become increasingly important. Anomaly detection, and especially contextual collective anomaly detection, is one of the most powerful tools to extract meaningful insights from these structures. However, the high temporal and spatial complexity of existing methods prevents their application to large-scale networks, limiting their practical utility. In this work, we propose a scalable framework that leverages Apache Spark for parallel processing to detect contextual collective anomalies in large complex networks.
Direction
Fernández Pena, Anselmo Tomás (Tutorships)
PICHEL CAMPOS, JUAN CARLOS (Co-tutorships)
Fernández Pena, Anselmo Tomás (Tutorships)
PICHEL CAMPOS, JUAN CARLOS (Co-tutorships)
Court
GARCIA LOUREIRO, ANTONIO JESUS (Chairman)
QUESADA BARRIUSO, PABLO (Secretary)
González Domínguez, Jorge (Member)
GARCIA LOUREIRO, ANTONIO JESUS (Chairman)
QUESADA BARRIUSO, PABLO (Secretary)
González Domínguez, Jorge (Member)
Improvement study of an aluminum sulfate production plant
Authorship
N.A.L.
Master in Chemical and Bioprocess Engineering
N.A.L.
Master in Chemical and Bioprocess Engineering
Defense date
09.09.2025 16:30
09.09.2025 16:30
Summary
The aluminum +3 cation is one of the best cations for neutralizing the negative charge present in organic matter. It is also a significant ion, which promotes the formation of flocs formed by the aluminum 3+ cation and negatively charged organic matter colloids. This causes several particles to coalesce, generating a more voluminous and heavier structure (flocs), capable of settling by gravity or being separated by filtration. The use of coagulants and flocculants is a crucial step in the water purification process. In this sense, aluminum sulfate is used in many domestic and industrial processes. Its main use is as a flocculant for the coagulation and flocculation of colloidal material present in collected water, facilitating the subsequent separation of the flocs formed and achieving potable water quality for distribution that meets the public health and organoleptic qualities required for human consumption under current regulations.
The aluminum +3 cation is one of the best cations for neutralizing the negative charge present in organic matter. It is also a significant ion, which promotes the formation of flocs formed by the aluminum 3+ cation and negatively charged organic matter colloids. This causes several particles to coalesce, generating a more voluminous and heavier structure (flocs), capable of settling by gravity or being separated by filtration. The use of coagulants and flocculants is a crucial step in the water purification process. In this sense, aluminum sulfate is used in many domestic and industrial processes. Its main use is as a flocculant for the coagulation and flocculation of colloidal material present in collected water, facilitating the subsequent separation of the flocs formed and achieving potable water quality for distribution that meets the public health and organoleptic qualities required for human consumption under current regulations.
Direction
GARRIDO FERNANDEZ, JUAN MANUEL (Tutorships)
GARRIDO FERNANDEZ, JUAN MANUEL (Tutorships)
Court
ROCA BORDELLO, ENRIQUE (Chairman)
Montes Piñeiro, Carlos (Secretary)
HOSPIDO QUINTANA, ALMUDENA (Member)
ROCA BORDELLO, ENRIQUE (Chairman)
Montes Piñeiro, Carlos (Secretary)
HOSPIDO QUINTANA, ALMUDENA (Member)
Cancer drug response prediction through the integration of multi-omic data: A Machine Learning-based approach
Authorship
L.A.M.
Master in Masive Data Analisys Tecnologies: Big Data
L.A.M.
Master in Masive Data Analisys Tecnologies: Big Data
Defense date
09.09.2025 19:00
09.09.2025 19:00
Summary
A primary goal in cancer research is to identify the genetic alterations that drive tumorigenesis and to predict drug response in order to develop targeted therapies that exploit these changes. In this context, multi-omics factors have been demonstrated to play a fundamental role. As this data is inherently heterogeneous, its integration is a crucial step in the analysis of multi-omics data. In this paper, a complete workflow for cancer compound response prediction through the integration of multi-omics data is reported: from the ETL pipeline for data cleansing to a vertical integration with feature selection. Finally, GACE-1 (Graph-based Attention model for Compound-response Estimation) is presented as a novel approach for drug response prediction tasks capable of handling heterogeneous graphs with edge features using multi-head attention over the different types of nodes. The proposed unsupervised feature selection method proves its capability to provide a sufficiently informative subset of features by assessing an improvement in the mean and median of the correlation with the IC50 distribution. The GACE-1 model is trained on prediction of unseen compounds, and evaluated with common regression metrics. Overall, GACE-1 achieves high performance in the drug response prediction task, outperforming several models in the literature in terms of correlation.
A primary goal in cancer research is to identify the genetic alterations that drive tumorigenesis and to predict drug response in order to develop targeted therapies that exploit these changes. In this context, multi-omics factors have been demonstrated to play a fundamental role. As this data is inherently heterogeneous, its integration is a crucial step in the analysis of multi-omics data. In this paper, a complete workflow for cancer compound response prediction through the integration of multi-omics data is reported: from the ETL pipeline for data cleansing to a vertical integration with feature selection. Finally, GACE-1 (Graph-based Attention model for Compound-response Estimation) is presented as a novel approach for drug response prediction tasks capable of handling heterogeneous graphs with edge features using multi-head attention over the different types of nodes. The proposed unsupervised feature selection method proves its capability to provide a sufficiently informative subset of features by assessing an improvement in the mean and median of the correlation with the IC50 distribution. The GACE-1 model is trained on prediction of unseen compounds, and evaluated with common regression metrics. Overall, GACE-1 achieves high performance in the drug response prediction task, outperforming several models in the literature in terms of correlation.
Direction
VIDAL AGUIAR, JUAN CARLOS (Tutorships)
Calvo Almeida, Shaila (Co-tutorships)
VIDAL AGUIAR, JUAN CARLOS (Tutorships)
Calvo Almeida, Shaila (Co-tutorships)
Court
GARCIA POLO, FRANCISCO JAVIER (Chairman)
FERNANDEZ PICHEL, MARCOS (Secretary)
López Martínez, Paula (Member)
GARCIA POLO, FRANCISCO JAVIER (Chairman)
FERNANDEZ PICHEL, MARCOS (Secretary)
López Martínez, Paula (Member)
Environmental assessment of the Galician garden: tomato case study
Authorship
A.E.A.J.
Master in Environmental Engineering (3rd ed)
A.E.A.J.
Master in Environmental Engineering (3rd ed)
Defense date
09.08.2025 11:00
09.08.2025 11:00
Summary
In Galicia, the cultivation of local tomato varieties such as Negro de Santiago, Avoa de Osedo, and Corazón de Buey has sparked growing interest among farmers and consumers. These varieties not only stand out for their unique flavor, highly valued by the most discerning palates, but also contribute positively to the regional economy. In particular, the Negro de Santiago tomato is prized for its late harvest between three and four months which allows high-quality local tomatoes to be available on the market for a longer period. This study aims to evaluate the production of these three native varieties from both an environmental and economic perspective, in order to demonstrate the benefits associated with the consumption of local products in Galician households compared to conventional varieties. To this end, six representative farms were selected, with areas between 0.5 and 1 hectare, located in municipalities with significant tomato production: Betanzos, Bergondo, Cerceda, Coristanco and Laracha and having as a functional unit 1 kg of tomato. The research is based on the Life Cycle Assessment (LCA) methodology, complemented by a specific questionnaire designed to collect inventory data and all the information necessary for a comprehensive assessment of the production scenarios. The results obtained will be compared with those of commercial tomato production systems documented in scientific literature. The main tasks of the study are: 1 Collection of inventory data on the cultivation of the selected varieties. 2 Analysis of the environmental profile and identification of critical points. 3 Comparison of environmental performance with reference studies on non-local varieties. The results obtained showed that native tomato cultivation under organic production systems presents a better environmental profile compared to conventional systems, especially in categories such as climate change, eutrophication, and human toxicity. However, conventional scenarios showed higher economic returns due to their high productivity per hectare. This finding highlights the need to consider both environmental impacts and economic benefits in agronomic decision-making, seeking a balance between sustainability and financial viability.
In Galicia, the cultivation of local tomato varieties such as Negro de Santiago, Avoa de Osedo, and Corazón de Buey has sparked growing interest among farmers and consumers. These varieties not only stand out for their unique flavor, highly valued by the most discerning palates, but also contribute positively to the regional economy. In particular, the Negro de Santiago tomato is prized for its late harvest between three and four months which allows high-quality local tomatoes to be available on the market for a longer period. This study aims to evaluate the production of these three native varieties from both an environmental and economic perspective, in order to demonstrate the benefits associated with the consumption of local products in Galician households compared to conventional varieties. To this end, six representative farms were selected, with areas between 0.5 and 1 hectare, located in municipalities with significant tomato production: Betanzos, Bergondo, Cerceda, Coristanco and Laracha and having as a functional unit 1 kg of tomato. The research is based on the Life Cycle Assessment (LCA) methodology, complemented by a specific questionnaire designed to collect inventory data and all the information necessary for a comprehensive assessment of the production scenarios. The results obtained will be compared with those of commercial tomato production systems documented in scientific literature. The main tasks of the study are: 1 Collection of inventory data on the cultivation of the selected varieties. 2 Analysis of the environmental profile and identification of critical points. 3 Comparison of environmental performance with reference studies on non-local varieties. The results obtained showed that native tomato cultivation under organic production systems presents a better environmental profile compared to conventional systems, especially in categories such as climate change, eutrophication, and human toxicity. However, conventional scenarios showed higher economic returns due to their high productivity per hectare. This finding highlights the need to consider both environmental impacts and economic benefits in agronomic decision-making, seeking a balance between sustainability and financial viability.
Direction
GONZALEZ GARCIA, SARA (Tutorships)
GONZALEZ GARCIA, SARA (Tutorships)
Court
Rojo Alboreca, Alberto (Chairman)
PARADELO NUÑEZ, REMIGIO (Secretary)
GIL GONZALEZ, ALVARO (Member)
Rojo Alboreca, Alberto (Chairman)
PARADELO NUÑEZ, REMIGIO (Secretary)
GIL GONZALEZ, ALVARO (Member)
Parallel Training of Kolmogorov-Arnold Networks: Performance Benchmark and Analysis on High-Performance Computing Systems
Authorship
G.C.
Máster Universitario en Computación de Altas Prestaciones / High Performance Computing from the University of A Coruña and the University of Santiago de Compostela
G.C.
Máster Universitario en Computación de Altas Prestaciones / High Performance Computing from the University of A Coruña and the University of Santiago de Compostela
Defense date
09.03.2025 11:00
09.03.2025 11:00
Summary
This thesis presents a comprehensive performance benchmark and analysis of parallel training strategies for Kolmogorov-Arnold Networks (KANs), a novel neural network architecture that replaces traditional linear weights with learnable univariate functions. As KANs demonstrate superior interpretability and efficiency compared to Multi-Layer Perceptrons (MLPs), understanding their parallel training characteristics becomes crucial for scaling these networks to larger problems and datasets. The research investigates four key aspects of KAN parallel training: strong scaling performance, weak scaling behavior, communication overhead analysis, and model scaling characteristics. Experiments were conducted on the FinisTerrae III supercomputer at CESGA, utilizing up to 8 GPUs across multiple nodes with PyTorch's Distributed Data Parallel (DDP) framework. Key findings include: (1) KANs achieve 74.7% parallel efficiency at 8 GPUs with a speedup of 5.97 X, demonstrating good strong scaling properties; (2) Weak scaling analysis reveals excellent stability with only 9.7% coefficient of variation in per-GPU throughput; (3) Communication overhead remains low, ranging from 1.3% for intra-node to 6.1% for inter-node configurations; (4) Model scaling analysis shows that while larger KANs consume more memory, they achieve better memory utilization efficiency. The thesis provides practical guidelines for deploying KAN training on HPC systems and establishes performance baselines for future research. The comprehensive performance evaluation framework developed can be extended to other neural network architectures and distributed training scenarios.
This thesis presents a comprehensive performance benchmark and analysis of parallel training strategies for Kolmogorov-Arnold Networks (KANs), a novel neural network architecture that replaces traditional linear weights with learnable univariate functions. As KANs demonstrate superior interpretability and efficiency compared to Multi-Layer Perceptrons (MLPs), understanding their parallel training characteristics becomes crucial for scaling these networks to larger problems and datasets. The research investigates four key aspects of KAN parallel training: strong scaling performance, weak scaling behavior, communication overhead analysis, and model scaling characteristics. Experiments were conducted on the FinisTerrae III supercomputer at CESGA, utilizing up to 8 GPUs across multiple nodes with PyTorch's Distributed Data Parallel (DDP) framework. Key findings include: (1) KANs achieve 74.7% parallel efficiency at 8 GPUs with a speedup of 5.97 X, demonstrating good strong scaling properties; (2) Weak scaling analysis reveals excellent stability with only 9.7% coefficient of variation in per-GPU throughput; (3) Communication overhead remains low, ranging from 1.3% for intra-node to 6.1% for inter-node configurations; (4) Model scaling analysis shows that while larger KANs consume more memory, they achieve better memory utilization efficiency. The thesis provides practical guidelines for deploying KAN training on HPC systems and establishes performance baselines for future research. The comprehensive performance evaluation framework developed can be extended to other neural network architectures and distributed training scenarios.
Direction
QUESADA BARRIUSO, PABLO (Tutorships)
García Selfa, David (Co-tutorships)
QUESADA BARRIUSO, PABLO (Tutorships)
García Selfa, David (Co-tutorships)
Court
CABALEIRO DOMINGUEZ, JOSE CARLOS (Chairman)
PIÑEIRO POMAR, CESAR ALFREDO (Secretary)
González Domínguez, Jorge (Member)
CABALEIRO DOMINGUEZ, JOSE CARLOS (Chairman)
PIÑEIRO POMAR, CESAR ALFREDO (Secretary)
González Domínguez, Jorge (Member)
Life cycle assessment of Profand Zaragoza S.L.U.'s salmon escalope
Authorship
V.G.C.S.
Master in Environmental Engineering (3rd ed)
V.G.C.S.
Master in Environmental Engineering (3rd ed)
Defense date
09.08.2025 12:30
09.08.2025 12:30
Summary
This Master's Thesis presents the first comprehensive Life Cycle Assessment (LCA) of salmon escalope produced at Profand Zaragoza. The cradle to gate study encompassed salmon farming in Norway, refrigerated transport to Spain, the portioning into salmon escalopes, packaging in trays containing 85% recycled PET, and distribution to Spanish wholesalers. The functional unit chosen was 1kg of ready to ship escalope. Impact was quantified using the 100 year Global Warming Potential (GWP100), drawing mainly on emission factors from Ecoinvent 3.10. A carbon footprint of 4.89kg CO2eq/FU was calculated, consistent with the literature and representing a medium impact relative to other meat products. Approximately 85% of emissions arise in the upstream phase-aquaculture and transport to Zaragoza are the dominant hotspots. Processing and packaging account for 13%, driven by electricity consumption, natural gas boilers and packaging polymers, while national distribution adds only a marginal share. A sensitivity analysis showed that the overall footprint can increase by 12% depending on the impact estimation for the aquaculture phase, which confirms the need for more detailed primary inventories. The main limitations stem from scarce farm specific data, the exclusion of other environmental categories relevant to seafood products and the simplified treatment of refrigerant impacts. Nevertheless, the compiled inventory serves as an operational baseline for Profand, enabling environmental improvement and future scenario modelling. Four priority improvement strategies have been identified: collecting primary aquaculture data, incorporating renewable energy and heat recovery, packaging ecodesign, and enhanced waste management and traceability. A follow up LCA is proposed once these measures are implemented, to quantify progress and move towards product certification.
This Master's Thesis presents the first comprehensive Life Cycle Assessment (LCA) of salmon escalope produced at Profand Zaragoza. The cradle to gate study encompassed salmon farming in Norway, refrigerated transport to Spain, the portioning into salmon escalopes, packaging in trays containing 85% recycled PET, and distribution to Spanish wholesalers. The functional unit chosen was 1kg of ready to ship escalope. Impact was quantified using the 100 year Global Warming Potential (GWP100), drawing mainly on emission factors from Ecoinvent 3.10. A carbon footprint of 4.89kg CO2eq/FU was calculated, consistent with the literature and representing a medium impact relative to other meat products. Approximately 85% of emissions arise in the upstream phase-aquaculture and transport to Zaragoza are the dominant hotspots. Processing and packaging account for 13%, driven by electricity consumption, natural gas boilers and packaging polymers, while national distribution adds only a marginal share. A sensitivity analysis showed that the overall footprint can increase by 12% depending on the impact estimation for the aquaculture phase, which confirms the need for more detailed primary inventories. The main limitations stem from scarce farm specific data, the exclusion of other environmental categories relevant to seafood products and the simplified treatment of refrigerant impacts. Nevertheless, the compiled inventory serves as an operational baseline for Profand, enabling environmental improvement and future scenario modelling. Four priority improvement strategies have been identified: collecting primary aquaculture data, incorporating renewable energy and heat recovery, packaging ecodesign, and enhanced waste management and traceability. A follow up LCA is proposed once these measures are implemented, to quantify progress and move towards product certification.
Direction
HOSPIDO QUINTANA, ALMUDENA (Tutorships)
Mella Arguerey, María de las Mercedes (Co-tutorships)
HOSPIDO QUINTANA, ALMUDENA (Tutorships)
Mella Arguerey, María de las Mercedes (Co-tutorships)
Court
FERNANDEZ ESCRIBANO, JOSE ANGEL (Chairman)
MAURICIO IGLESIAS, MIGUEL (Secretary)
EIBES GONZALEZ, GEMMA MARIA (Member)
FERNANDEZ ESCRIBANO, JOSE ANGEL (Chairman)
MAURICIO IGLESIAS, MIGUEL (Secretary)
EIBES GONZALEZ, GEMMA MARIA (Member)
Comparative Environmental Assessment between Conventional Desalination and the REWAISE System using Life Cycle Assessment
Authorship
M.C.G.
Master in Environmental Engineering (3rd ed)
M.C.G.
Master in Environmental Engineering (3rd ed)
Defense date
09.08.2025 12:00
09.08.2025 12:00
Summary
This Master's Thesis presents a comparative environmental assessment between conventional desalination and the innovative REWAISE system using Life Cycle Assessment (LCA). This novel system integrates advanced technologies to recover water and valuable materials from seawater brine. The results show that the REWAISE system offers a significant reduction in impacts across several key categories, including climate change, human toxicity, and resource use, thanks to its circular and integrative approach. The work uses detailed modelling with the SimaPro software and the ReCiPe methodology, and proposes a system expansion approach that allows comparison between a conventional system without coproducts and an innovative one that generates valuable coproducts. The conclusions reinforce the environmental feasibility of the REWAISE model as a more sustainable alternative aligned with the European objectives of ecological transition and circular economy.
This Master's Thesis presents a comparative environmental assessment between conventional desalination and the innovative REWAISE system using Life Cycle Assessment (LCA). This novel system integrates advanced technologies to recover water and valuable materials from seawater brine. The results show that the REWAISE system offers a significant reduction in impacts across several key categories, including climate change, human toxicity, and resource use, thanks to its circular and integrative approach. The work uses detailed modelling with the SimaPro software and the ReCiPe methodology, and proposes a system expansion approach that allows comparison between a conventional system without coproducts and an innovative one that generates valuable coproducts. The conclusions reinforce the environmental feasibility of the REWAISE model as a more sustainable alternative aligned with the European objectives of ecological transition and circular economy.
Direction
HOSPIDO QUINTANA, ALMUDENA (Tutorships)
Fernández Braña, Álvaro (Co-tutorships)
HOSPIDO QUINTANA, ALMUDENA (Tutorships)
Fernández Braña, Álvaro (Co-tutorships)
Court
FERNANDEZ ESCRIBANO, JOSE ANGEL (Chairman)
MAURICIO IGLESIAS, MIGUEL (Secretary)
EIBES GONZALEZ, GEMMA MARIA (Member)
FERNANDEZ ESCRIBANO, JOSE ANGEL (Chairman)
MAURICIO IGLESIAS, MIGUEL (Secretary)
EIBES GONZALEZ, GEMMA MARIA (Member)
Evaluation of the Quality of Urban Waste Compost for Its Use as an Organic Amendment
Authorship
D.F.L.
Master in Environmental Engineering (3rd ed)
D.F.L.
Master in Environmental Engineering (3rd ed)
Defense date
09.08.2025 10:00
09.08.2025 10:00
Summary
This study analyses the quality of five organic composts derived from urban and plant waste, with the aim of establishing their suitability as amendments to agricultural and urban soils. Physicochemical studies, soil respiration tests, and plant growth trials were performed using two soil types with distinctive properties: one agricultural (botanical) and the other urban (subsoil). The results indicated that composts produced from MSW demonstrate disadvantages due to their lack of maturity, higher electrical conductivity, and, in some cases, elevated levels of heavy metals, but the plant-based composts exhibit better characteristics for use in agriculture. In the mineralization tests, both the origin of the compost and the soil were found to have a considerable impact on microbial activity, with higher respiration rates in the agricultural soil. Plant growth tests showed variations between species: while ryegrass responded positively to MSW composts, barley showed reduced yields, suggesting its limited suitability for more demanding crops. Due to their restrictions, MSW composts are not suitable for growing in pots, but when used in agricultural soils in Galicia, they should not pose a major problem. Furthermore, their potential use for environmental restoration should not be ruled out.
This study analyses the quality of five organic composts derived from urban and plant waste, with the aim of establishing their suitability as amendments to agricultural and urban soils. Physicochemical studies, soil respiration tests, and plant growth trials were performed using two soil types with distinctive properties: one agricultural (botanical) and the other urban (subsoil). The results indicated that composts produced from MSW demonstrate disadvantages due to their lack of maturity, higher electrical conductivity, and, in some cases, elevated levels of heavy metals, but the plant-based composts exhibit better characteristics for use in agriculture. In the mineralization tests, both the origin of the compost and the soil were found to have a considerable impact on microbial activity, with higher respiration rates in the agricultural soil. Plant growth tests showed variations between species: while ryegrass responded positively to MSW composts, barley showed reduced yields, suggesting its limited suitability for more demanding crops. Due to their restrictions, MSW composts are not suitable for growing in pots, but when used in agricultural soils in Galicia, they should not pose a major problem. Furthermore, their potential use for environmental restoration should not be ruled out.
Direction
PARADELO NUÑEZ, REMIGIO (Tutorships)
PARADELO NUÑEZ, REMIGIO (Tutorships)
Court
PRIETO LAMAS, BEATRIZ LORETO (Chairman)
GIL GONZALEZ, ALVARO (Secretary)
BALBOA MENDEZ, SABELA (Member)
PRIETO LAMAS, BEATRIZ LORETO (Chairman)
GIL GONZALEZ, ALVARO (Secretary)
BALBOA MENDEZ, SABELA (Member)
Estimating future customer value using Markov chains and HPC
Authorship
J.D.G.N.
Máster Universitario en Computación de Altas Prestaciones / High Performance Computing from the University of A Coruña and the University of Santiago de Compostela
J.D.G.N.
Máster Universitario en Computación de Altas Prestaciones / High Performance Computing from the University of A Coruña and the University of Santiago de Compostela
Defense date
09.03.2025 10:30
09.03.2025 10:30
Summary
Estimating Customer Lifetime Value (CLV) in the financial sector is a crucial indicator for strategic decision-making and the personalization of products and services. This project proposes combining the predictive power of Markov chains with the processing capabilities of High-Performance Computing (HPC) architectures to perform these estimations. The goal is to design and implement an efficient algorithm based on Markov chains that can model customer transitions between different possible states and project their value over time. Implementing this in HPC environments will enable complex simulations to be performed in reduced times, overcoming the limitations of traditional approaches.
Estimating Customer Lifetime Value (CLV) in the financial sector is a crucial indicator for strategic decision-making and the personalization of products and services. This project proposes combining the predictive power of Markov chains with the processing capabilities of High-Performance Computing (HPC) architectures to perform these estimations. The goal is to design and implement an efficient algorithm based on Markov chains that can model customer transitions between different possible states and project their value over time. Implementing this in HPC environments will enable complex simulations to be performed in reduced times, overcoming the limitations of traditional approaches.
Direction
QUESADA BARRIUSO, PABLO (Tutorships)
Mazaira Gómez, Juan Manuel (Co-tutorships)
QUESADA BARRIUSO, PABLO (Tutorships)
Mazaira Gómez, Juan Manuel (Co-tutorships)
Court
CABALEIRO DOMINGUEZ, JOSE CARLOS (Chairman)
PIÑEIRO POMAR, CESAR ALFREDO (Secretary)
González Domínguez, Jorge (Member)
CABALEIRO DOMINGUEZ, JOSE CARLOS (Chairman)
PIÑEIRO POMAR, CESAR ALFREDO (Secretary)
González Domínguez, Jorge (Member)
Influence of different compost types on the mobility of copper, lead, arsenic, and phosphorus in urban gardens of Santiago de Compostela.
Authorship
W.H.G.A.
Master in Environmental Engineering (3rd ed)
W.H.G.A.
Master in Environmental Engineering (3rd ed)
Defense date
09.08.2025 10:30
09.08.2025 10:30
Summary
The use of compost as an organic amendment in soils for agriculture is an increasingly common and highly important practice. This approach promotes the circular economy, improves waste management, and fosters sustainability in urban agriculture, aligning with the objectives of the 2030 Agenda for Sustainable Development. This study, conducted in Santiago de Compostela, focuses on analyzing the influence of different compost compositions on contaminant mobility in urban agricultural soils. Building upon previous experiments on copper (Cu) at 20 t/ha developed by the Department of Soil Science and Agricultural Chemistry at the University of Santiago de Compostela, this research was expanded to include concentrations of 100 t/ha and to investigate additional elements such as lead (Pb), arsenic (As), and phosphorus (P). Adsorption experiments were carried out using urban garden soils from Belvís de Arriba (B5), Belvís de Abaixo (H24), and Campo das Hortas (H16). Soils were treated with three types of compost: municipal solid waste compost from the SOGAMA plant (SOGAMA), municipal vegetable waste compost (Vegetal), and mixed compost consisting of green waste blended with domestic waste, collected and treated by the Santiago City Council (Mezcla), comparing them with unamended soils (Ningún). Isotherm modeling consistently revealed that the Freundlich model is more suitable for Phosphorus, Arsenic, and Copper, reflecting the heterogeneous nature of adsorption and the complexity of interactions with organic matter. For Phosphorus, increasing the compost application rate from 20 t/ha to 100 t/ha generally reduced adsorption capacity (Kf), indicating a potential leaching risk. Similarly, for Arsenic and Copper, high application rates (100 t/ha) tended to decrease the maximum adsorption capacity (Qm), although affinity (KL) could have increased, suggesting a lower total retention. Lead exhibited extremely low concentrations, which precluded robust modeling, but indicated strong element immobility or undetectable release. In practical terms, Vegetal compost proved to be the most balanced and safest option, maintaining low levels of arsenic and lead and improving phosphorus and copper adsorption in certain soils. The use of composts like Mezcla and SOGAMA requires intensive monitoring, especially at high rates, due to the risk of contaminant release or mobilization. This research highlights the importance of understanding the mobility of multiple contaminants to enhance sustainable agricultural practices and environmental health, in addition to confirming that specific composts do not increase contaminant mobility, thereby ensuring safety for their use in urban gardens.
The use of compost as an organic amendment in soils for agriculture is an increasingly common and highly important practice. This approach promotes the circular economy, improves waste management, and fosters sustainability in urban agriculture, aligning with the objectives of the 2030 Agenda for Sustainable Development. This study, conducted in Santiago de Compostela, focuses on analyzing the influence of different compost compositions on contaminant mobility in urban agricultural soils. Building upon previous experiments on copper (Cu) at 20 t/ha developed by the Department of Soil Science and Agricultural Chemistry at the University of Santiago de Compostela, this research was expanded to include concentrations of 100 t/ha and to investigate additional elements such as lead (Pb), arsenic (As), and phosphorus (P). Adsorption experiments were carried out using urban garden soils from Belvís de Arriba (B5), Belvís de Abaixo (H24), and Campo das Hortas (H16). Soils were treated with three types of compost: municipal solid waste compost from the SOGAMA plant (SOGAMA), municipal vegetable waste compost (Vegetal), and mixed compost consisting of green waste blended with domestic waste, collected and treated by the Santiago City Council (Mezcla), comparing them with unamended soils (Ningún). Isotherm modeling consistently revealed that the Freundlich model is more suitable for Phosphorus, Arsenic, and Copper, reflecting the heterogeneous nature of adsorption and the complexity of interactions with organic matter. For Phosphorus, increasing the compost application rate from 20 t/ha to 100 t/ha generally reduced adsorption capacity (Kf), indicating a potential leaching risk. Similarly, for Arsenic and Copper, high application rates (100 t/ha) tended to decrease the maximum adsorption capacity (Qm), although affinity (KL) could have increased, suggesting a lower total retention. Lead exhibited extremely low concentrations, which precluded robust modeling, but indicated strong element immobility or undetectable release. In practical terms, Vegetal compost proved to be the most balanced and safest option, maintaining low levels of arsenic and lead and improving phosphorus and copper adsorption in certain soils. The use of composts like Mezcla and SOGAMA requires intensive monitoring, especially at high rates, due to the risk of contaminant release or mobilization. This research highlights the importance of understanding the mobility of multiple contaminants to enhance sustainable agricultural practices and environmental health, in addition to confirming that specific composts do not increase contaminant mobility, thereby ensuring safety for their use in urban gardens.
Direction
PARADELO NUÑEZ, REMIGIO (Tutorships)
PARADELO NUÑEZ, REMIGIO (Tutorships)
Court
MOSQUERA CORRAL, ANUSKA (Chairman)
GIL GONZALEZ, ALVARO (Secretary)
PRIETO LAMAS, BEATRIZ LORETO (Member)
MOSQUERA CORRAL, ANUSKA (Chairman)
GIL GONZALEZ, ALVARO (Secretary)
PRIETO LAMAS, BEATRIZ LORETO (Member)
Optimization and Scalability of Distributed Fine-Tuning of LLMs on AWS Cloud-HPC Systems
Authorship
D.G.F.
Máster Universitario en Computación de Altas Prestaciones / High Performance Computing from the University of A Coruña and the University of Santiago de Compostela
D.G.F.
Máster Universitario en Computación de Altas Prestaciones / High Performance Computing from the University of A Coruña and the University of Santiago de Compostela
Defense date
09.02.2025 11:00
09.02.2025 11:00
Summary
In recent years, Large Language Models (LLMs) have become a fundamental cornerstone of modern artificial intelligence. However, their development and training present significant computational challenges, particularly when working with architectures that contain billions of parameters. This Master's Thesis aims to analyze the feasibility, performance, and scalability of distributed fine-tuning of LLMs using different distributed training frameworks in cloud environments. To this end, a comparative study is conducted between the frameworks DeepSpeed, PyTorch Fully Sharded Data Parallel (FSDP), and AWS Neuron Optimum, deployed across various instance configurations on Amazon Web Services (AWS). Through the execution of several experiments, metrics such as throughput, training loss, cost, and execution time are evaluated across multiple scenarios (single GPU, multi-GPU, and multi-node). The experimental results obtained allow for practical conclusions and recommendations regarding which framework and architecture to use, depending on the available resources, budget, and project goals.
In recent years, Large Language Models (LLMs) have become a fundamental cornerstone of modern artificial intelligence. However, their development and training present significant computational challenges, particularly when working with architectures that contain billions of parameters. This Master's Thesis aims to analyze the feasibility, performance, and scalability of distributed fine-tuning of LLMs using different distributed training frameworks in cloud environments. To this end, a comparative study is conducted between the frameworks DeepSpeed, PyTorch Fully Sharded Data Parallel (FSDP), and AWS Neuron Optimum, deployed across various instance configurations on Amazon Web Services (AWS). Through the execution of several experiments, metrics such as throughput, training loss, cost, and execution time are evaluated across multiple scenarios (single GPU, multi-GPU, and multi-node). The experimental results obtained allow for practical conclusions and recommendations regarding which framework and architecture to use, depending on the available resources, budget, and project goals.
Direction
PICHEL CAMPOS, JUAN CARLOS (Tutorships)
PICHEL CAMPOS, JUAN CARLOS (Tutorships)
Court
GARCIA LOUREIRO, ANTONIO JESUS (Chairman)
QUESADA BARRIUSO, PABLO (Secretary)
González Domínguez, Jorge (Member)
GARCIA LOUREIRO, ANTONIO JESUS (Chairman)
QUESADA BARRIUSO, PABLO (Secretary)
González Domínguez, Jorge (Member)
Improving rainfall prediction using machine learning
Authorship
S.M.G.S.
Master in Environmental Engineering (3rd ed)
S.M.G.S.
Master in Environmental Engineering (3rd ed)
Defense date
09.08.2025 13:15
09.08.2025 13:15
Summary
Weather forecasting currently performed by meteorological agencies is based on the use of numerical models. These numerical models require computing infrastructure to be generated and provide predictions of different variables (temperature, precipitation, wind, etc.) with a certain temporal and spatial resolution for the coming days. An example of these models is WRF (Weather Research and Forecasting Model), which is used by MeteoGalicia to support its forecasting processes. In addition to the data generated by these models, meteorological agencies such as MeteoGalicia produce environmental observation data. In the case of rainfall, MeteoGalicia has a network of meteorological stations where precipitation (among many other variables) is measured, as well as a weather radar that generates observations throughout the territory with a specific spatial and temporal resolution. The aim of this work is to test the effectiveness of a machine learning model to improve the predictions generated by a model like WRF. For this purpose, different usage scenarios will be proposed in which one or more machine learning models will be generated using historical data from the stations and the models as training bases. The goal of these models will be to improve the predictions made by the WRF model in a specific area of the Galician region.
Weather forecasting currently performed by meteorological agencies is based on the use of numerical models. These numerical models require computing infrastructure to be generated and provide predictions of different variables (temperature, precipitation, wind, etc.) with a certain temporal and spatial resolution for the coming days. An example of these models is WRF (Weather Research and Forecasting Model), which is used by MeteoGalicia to support its forecasting processes. In addition to the data generated by these models, meteorological agencies such as MeteoGalicia produce environmental observation data. In the case of rainfall, MeteoGalicia has a network of meteorological stations where precipitation (among many other variables) is measured, as well as a weather radar that generates observations throughout the territory with a specific spatial and temporal resolution. The aim of this work is to test the effectiveness of a machine learning model to improve the predictions generated by a model like WRF. For this purpose, different usage scenarios will be proposed in which one or more machine learning models will be generated using historical data from the stations and the models as training bases. The goal of these models will be to improve the predictions made by the WRF model in a specific area of the Galician region.
Direction
RIOS VIQUEIRA, JOSE RAMON (Tutorships)
VILLARROYA FERNANDEZ, SEBASTIAN (Co-tutorships)
RIOS VIQUEIRA, JOSE RAMON (Tutorships)
VILLARROYA FERNANDEZ, SEBASTIAN (Co-tutorships)
Court
FERNANDEZ ESCRIBANO, JOSE ANGEL (Chairman)
Triñanes Fernández, Joaquín Ángel (Secretary)
HOSPIDO QUINTANA, ALMUDENA (Member)
FERNANDEZ ESCRIBANO, JOSE ANGEL (Chairman)
Triñanes Fernández, Joaquín Ángel (Secretary)
HOSPIDO QUINTANA, ALMUDENA (Member)
Advanced Flight Predictive Analytics using PySpark and High-Performance Computing
Authorship
F.S.K.
Máster Universitario en Computación de Altas Prestaciones / High Performance Computing from the University of A Coruña and the University of Santiago de Compostela
F.S.K.
Máster Universitario en Computación de Altas Prestaciones / High Performance Computing from the University of A Coruña and the University of Santiago de Compostela
Defense date
09.03.2025 11:30
09.03.2025 11:30
Summary
This project aims to analyze and predict airline flight delays using PySpark and High-Performance Computing (HPC). It processes large-scale flight data through distributed computing to generate insights and predictive models. By leveraging machine learning, the project supports improved scheduling, risk assessment, and decision-making in aviation. Key Objectives Analyze flight delay patterns using scalable and distributed systems. Build predictive models (Linear Regression and Naive Bayes) using features such as time, day, and airline. Enhance computational performance by utilizing HPC resources. Create visualizations to support understanding and feature selection. Evaluate system performance and scalability as data volume increases.
This project aims to analyze and predict airline flight delays using PySpark and High-Performance Computing (HPC). It processes large-scale flight data through distributed computing to generate insights and predictive models. By leveraging machine learning, the project supports improved scheduling, risk assessment, and decision-making in aviation. Key Objectives Analyze flight delay patterns using scalable and distributed systems. Build predictive models (Linear Regression and Naive Bayes) using features such as time, day, and airline. Enhance computational performance by utilizing HPC resources. Create visualizations to support understanding and feature selection. Evaluate system performance and scalability as data volume increases.
Direction
QUESADA BARRIUSO, PABLO (Tutorships)
LÓPEZ TABOADA, GUILLERMO (Co-tutorships)
QUESADA BARRIUSO, PABLO (Tutorships)
LÓPEZ TABOADA, GUILLERMO (Co-tutorships)
Court
CABALEIRO DOMINGUEZ, JOSE CARLOS (Chairman)
PIÑEIRO POMAR, CESAR ALFREDO (Secretary)
González Domínguez, Jorge (Member)
CABALEIRO DOMINGUEZ, JOSE CARLOS (Chairman)
PIÑEIRO POMAR, CESAR ALFREDO (Secretary)
González Domínguez, Jorge (Member)
Analysis and start up of a pilot plant using hybrid technologies for the reuse of nitrogen-rich effluents for the production of ammonia gas
Authorship
M.M.S.
Master in Environmental Engineering (3rd ed)
M.M.S.
Master in Environmental Engineering (3rd ed)
Defense date
09.08.2025 17:00
09.08.2025 17:00
Summary
This paper explores the evolution of wastewater treatment plants toward circular models focused on resource recovery, highlighting the valorization of waste streams rich in ammonia nitrogen. A pilot plant for the recovery of ammonia gas using a hybrid membrane contactor and pressure swing adsorption (PSA) technology is presented. The stabilization, membrane contactor optimization, and PSA optimization phases are detailed. The use of ion-selective electrodes was validated for high-pH samples. The study confirms the technical feasibility of this hybrid technology for recovering ammonia from complex waste matrices, identifying key parameters and challenges for its scaling.
This paper explores the evolution of wastewater treatment plants toward circular models focused on resource recovery, highlighting the valorization of waste streams rich in ammonia nitrogen. A pilot plant for the recovery of ammonia gas using a hybrid membrane contactor and pressure swing adsorption (PSA) technology is presented. The stabilization, membrane contactor optimization, and PSA optimization phases are detailed. The use of ion-selective electrodes was validated for high-pH samples. The study confirms the technical feasibility of this hybrid technology for recovering ammonia from complex waste matrices, identifying key parameters and challenges for its scaling.
Direction
MOSQUERA CORRAL, ANUSKA (Tutorships)
Santos Taboada, Antón (Co-tutorships)
MOSQUERA CORRAL, ANUSKA (Tutorships)
Santos Taboada, Antón (Co-tutorships)
Court
FEIJOO COSTA, GUMERSINDO (Chairman)
ALDREY VAZQUEZ, JOSE ANTONIO (Secretary)
López Romalde, Jesús Ángel (Member)
FEIJOO COSTA, GUMERSINDO (Chairman)
ALDREY VAZQUEZ, JOSE ANTONIO (Secretary)
López Romalde, Jesús Ángel (Member)
Techno-economic and environmental assessment of cellulose recovery in urban wastewater treatment plants (WWTPs)
Authorship
M.P.F.
Master in Chemical and Bioprocess Engineering
M.P.F.
Master in Chemical and Bioprocess Engineering
Defense date
09.09.2025 17:00
09.09.2025 17:00
Summary
Sustainable water resource management and wastewater treatment are key pillars in the transition towards a circular economy. Traditionally, wastewater treatment plants (WWTPs) have operated under a linear model focused solely on sanitation. However, the sector is evolving towards the concept of water resource recovery facilities, where the recovery of resources such as energy, nutrients, and valuable materials is promoted, thereby reducing environmental impacts. In this context, this study aims to simulate different WWTP scenarios and investigate the impact of implementing rotating belt filter (RBF) technology. Key operational parameters, such as aeration energy consumption and sludge production, will be analysed, complemented by an environmental and economic assessment. Furthermore, the potential of cellulose recovery as a strategy to reduce the use of fossil materials, minimize environmental impacts, and promote more efficient and sustainable WWTP management will also be evaluated.
Sustainable water resource management and wastewater treatment are key pillars in the transition towards a circular economy. Traditionally, wastewater treatment plants (WWTPs) have operated under a linear model focused solely on sanitation. However, the sector is evolving towards the concept of water resource recovery facilities, where the recovery of resources such as energy, nutrients, and valuable materials is promoted, thereby reducing environmental impacts. In this context, this study aims to simulate different WWTP scenarios and investigate the impact of implementing rotating belt filter (RBF) technology. Key operational parameters, such as aeration energy consumption and sludge production, will be analysed, complemented by an environmental and economic assessment. Furthermore, the potential of cellulose recovery as a strategy to reduce the use of fossil materials, minimize environmental impacts, and promote more efficient and sustainable WWTP management will also be evaluated.
Direction
GONZALEZ GARCIA, SARA (Tutorships)
Ferreiro Crespo, Iago (Co-tutorships)
GONZALEZ GARCIA, SARA (Tutorships)
Ferreiro Crespo, Iago (Co-tutorships)
Court
ROCA BORDELLO, ENRIQUE (Chairman)
Montes Piñeiro, Carlos (Secretary)
HOSPIDO QUINTANA, ALMUDENA (Member)
ROCA BORDELLO, ENRIQUE (Chairman)
Montes Piñeiro, Carlos (Secretary)
HOSPIDO QUINTANA, ALMUDENA (Member)
Review and Identification of Best Practices in Power BI Semantic Models
Authorship
C.P.I.
Master in Masive Data Analisys Tecnologies: Big Data
C.P.I.
Master in Masive Data Analisys Tecnologies: Big Data
Defense date
09.09.2025 19:30
09.09.2025 19:30
Summary
Power BI has become a key tool for data analysis and visualization in business environments. However, its widespread adoption has highlighted the lack of a systematic and automatable framework for applying best practices in the design of semantic models. Although many of these best practices are well known within the technical community, they are often scattered and lack clear organization. This work proposes a validation system based entirely on DAX code, which uses internal views from the Analysis Services engine (DVM) to automatically detect structural and performance related issues in data models. A total of 29 best practices were designed, grouped into structural or technical aspects. To validate the approach, an experimental model was built, including intentional errors that were detected afterwards with the scripts developed. The results show that certain bad practices have a measurable negative impact on performance, with improvements of up to 38 per cent observer after correction. While not all improvements led to noticeable gains in controlled environments, it became evident that improper configurations can degrade performance even in small scale models, reinforcing the need for a preventive approach. A replicable quantitative evaluation methodology was also defined, based on the use of demanding visualizations and Power BIs Performance Analyzer, allowing developers to objectively measure the impact of best practices on their own reports. Overall, the proposed solution enables automated validation of quality standards, serves as training and governance tool, and promotes a culture of continuous improvement and technical sustainability in Power BI report development.
Power BI has become a key tool for data analysis and visualization in business environments. However, its widespread adoption has highlighted the lack of a systematic and automatable framework for applying best practices in the design of semantic models. Although many of these best practices are well known within the technical community, they are often scattered and lack clear organization. This work proposes a validation system based entirely on DAX code, which uses internal views from the Analysis Services engine (DVM) to automatically detect structural and performance related issues in data models. A total of 29 best practices were designed, grouped into structural or technical aspects. To validate the approach, an experimental model was built, including intentional errors that were detected afterwards with the scripts developed. The results show that certain bad practices have a measurable negative impact on performance, with improvements of up to 38 per cent observer after correction. While not all improvements led to noticeable gains in controlled environments, it became evident that improper configurations can degrade performance even in small scale models, reinforcing the need for a preventive approach. A replicable quantitative evaluation methodology was also defined, based on the use of demanding visualizations and Power BIs Performance Analyzer, allowing developers to objectively measure the impact of best practices on their own reports. Overall, the proposed solution enables automated validation of quality standards, serves as training and governance tool, and promotes a culture of continuous improvement and technical sustainability in Power BI report development.
Direction
Triñanes Fernández, Joaquín Ángel (Tutorships)
Jácome Rodríguez, Cristina (Co-tutorships)
Triñanes Fernández, Joaquín Ángel (Tutorships)
Jácome Rodríguez, Cristina (Co-tutorships)
Court
GARCIA POLO, FRANCISCO JAVIER (Chairman)
FERNANDEZ PICHEL, MARCOS (Secretary)
López Martínez, Paula (Member)
GARCIA POLO, FRANCISCO JAVIER (Chairman)
FERNANDEZ PICHEL, MARCOS (Secretary)
López Martínez, Paula (Member)
Environmental evaluation of the valorization process of olive oil by-products into phenolic extracts
Authorship
D.R.S.
Master in Environmental Engineering (3rd ed)
D.R.S.
Master in Environmental Engineering (3rd ed)
Defense date
09.08.2025 16:30
09.08.2025 16:30
Summary
The production of virgin olive oil is becoming increasingly important worldwide, especially in Spain and other Mediterranean regions. However, the volume of waste generated in this process, such as olive mill waste (alpeorujo), a pollutant, phytotoxic and antimicrobial agent, is also very significant. For this reason, an option for the recovery of this waste will be examined, by means of a two-phase extraction system in which environmentally friendly solvents will be used, such as natural deep eutectic solvents (known as NADES). With this method, polyphenols can be obtained from olive mill waste. The aim of this work is to carry out a Life Cycle Analysis of the described process, from the elaboration of an inventory to the definition of the impacts and their evaluation. This will require the simulation of the system, which will lead to an environmental modelling in which the different impacts to be considered will be defined.
The production of virgin olive oil is becoming increasingly important worldwide, especially in Spain and other Mediterranean regions. However, the volume of waste generated in this process, such as olive mill waste (alpeorujo), a pollutant, phytotoxic and antimicrobial agent, is also very significant. For this reason, an option for the recovery of this waste will be examined, by means of a two-phase extraction system in which environmentally friendly solvents will be used, such as natural deep eutectic solvents (known as NADES). With this method, polyphenols can be obtained from olive mill waste. The aim of this work is to carry out a Life Cycle Analysis of the described process, from the elaboration of an inventory to the definition of the impacts and their evaluation. This will require the simulation of the system, which will lead to an environmental modelling in which the different impacts to be considered will be defined.
Direction
GONZALEZ GARCIA, SARA (Tutorships)
GONZALEZ GARCIA, SARA (Tutorships)
Court
FEIJOO COSTA, GUMERSINDO (Chairman)
ALDREY VAZQUEZ, JOSE ANTONIO (Secretary)
López Romalde, Jesús Ángel (Member)
FEIJOO COSTA, GUMERSINDO (Chairman)
ALDREY VAZQUEZ, JOSE ANTONIO (Secretary)
López Romalde, Jesús Ángel (Member)
Optimizing Vector Algorithm Processing through the integration of Ray IO and HPC in the Cloud
Authorship
T.S.O.
Máster Universitario en Computación de Altas Prestaciones / High Performance Computing from the University of A Coruña and the University of Santiago de Compostela
T.S.O.
Máster Universitario en Computación de Altas Prestaciones / High Performance Computing from the University of A Coruña and the University of Santiago de Compostela
Defense date
09.03.2025 10:00
09.03.2025 10:00
Summary
The adoption of vector embedding models is driven mainly by their ability to transform complex and unstructured data into a structured, machine-comprehensible format. This transformation is key to solving tasks that require a deep understanding of content, context, and subtle relationships within the data, making embeddings indispensable in the fields of artificial intelligence (AI) and machine learning. To manage the computational demands of processing large volumes of complex data (images, videos, audio, or unstructured text) with vector embeddings, intensive compute power is required. This necessitates efficient data management alongside software architectures and infrastructure such as distributed computing, dimensionality reduction, optimized model architectures, batch processing, hardware acceleration, cloud computing, and high-performance computing (HPC) resources. Accordingly, this work implements a cloud-based solution using Amazon Web Services (AWS) to ingest complex datasets for cleaning, standardization, enrichment, transformation, and metadata extraction via Ray IO. The results of these preprocessing steps are then forwarded to a virtual HPC cluster, where vector processing combines MPI, OpenMP, and GPU acceleration. For example, the pipeline might receive a set of videos in an S3 bucket; a REST API records the requested processing steps in an RDS database; a single API call triggers the Ray IO cluster to extract metadata, clean and enrich inputs via deep-learning models, apply vectorization algorithms, and partition the data. Once the videos are transformed into vectors, a Slurm command kicks off the HPC cluster, which loads the vectorized data and performs similarity measurements, data clustering, and pattern recognition, ultimately producing a final report summarizing the results.
The adoption of vector embedding models is driven mainly by their ability to transform complex and unstructured data into a structured, machine-comprehensible format. This transformation is key to solving tasks that require a deep understanding of content, context, and subtle relationships within the data, making embeddings indispensable in the fields of artificial intelligence (AI) and machine learning. To manage the computational demands of processing large volumes of complex data (images, videos, audio, or unstructured text) with vector embeddings, intensive compute power is required. This necessitates efficient data management alongside software architectures and infrastructure such as distributed computing, dimensionality reduction, optimized model architectures, batch processing, hardware acceleration, cloud computing, and high-performance computing (HPC) resources. Accordingly, this work implements a cloud-based solution using Amazon Web Services (AWS) to ingest complex datasets for cleaning, standardization, enrichment, transformation, and metadata extraction via Ray IO. The results of these preprocessing steps are then forwarded to a virtual HPC cluster, where vector processing combines MPI, OpenMP, and GPU acceleration. For example, the pipeline might receive a set of videos in an S3 bucket; a REST API records the requested processing steps in an RDS database; a single API call triggers the Ray IO cluster to extract metadata, clean and enrich inputs via deep-learning models, apply vectorization algorithms, and partition the data. Once the videos are transformed into vectors, a Slurm command kicks off the HPC cluster, which loads the vectorized data and performs similarity measurements, data clustering, and pattern recognition, ultimately producing a final report summarizing the results.
Direction
QUESADA BARRIUSO, PABLO (Tutorships)
Pardo Martínez, Xoan Carlos (Co-tutorships)
Padrón González, Emilio J. (Co-tutorships)
QUESADA BARRIUSO, PABLO (Tutorships)
Pardo Martínez, Xoan Carlos (Co-tutorships)
Padrón González, Emilio J. (Co-tutorships)
Court
CABALEIRO DOMINGUEZ, JOSE CARLOS (Chairman)
PIÑEIRO POMAR, CESAR ALFREDO (Secretary)
González Domínguez, Jorge (Member)
CABALEIRO DOMINGUEZ, JOSE CARLOS (Chairman)
PIÑEIRO POMAR, CESAR ALFREDO (Secretary)
González Domínguez, Jorge (Member)