Multi-Relational Learning with SQL All the Way

Author: 
Date created: 
2016-11-24
Identifier: 
etd9859
Keywords: 
Statistical Relational Learning(SRL)
Multi-Relational Database
FactorBase
Sufficient Statistics
Log-Linear Model
Bayesian networks (BNs)
Dependency Networks (DNs)
Link Analysis
Generative Modelling
Discriminative Learning.
Abstract: 

Which doctors prescribe which drugs to which patients? Who upvotes which answers on what topics on Quora? Who has followed whom on Twitter/Weibo? These relationships are all visible in data, and they all contain a wealth of information that could be extracted to be knowledge/wisdom. Statistical Relational Learning (SRL) is a recent growing field which extends traditional machine learning from single-table to multiple inter-related tables. It aims to provide integrated statistical analysis of heterogeneous and interdependent complex data. In the thesis, I focus on modelling the interactions between different attributes and the link itself for such complex heterogeneous and richly interconnected data. First, I describe the FactorBase system which combines advanced analytics from statistical-relational machine learning (SRL) with database systems. Within FactorBase, all statistical objects are stored as first-class citizens as well as raw data. This new SQL-based framework pushes the multi-relational model discovery into a relational database management system. Secondly, to solve the scalability issue of computing cross-table sufficient statistics, a new Virtual Join algorithm is proposed and implemented in FactorBase. Bayesian networks (BNs) and Dependency Networks (DNs) are two major classes of SRL. Thirdly, I utilize FactorBase to extend the state-of-the-art learning algorithm for BN of generative modelling with link uncertainty. The learned model captures correlations between link types, link features, and attributes of nodes, simultaneously. Finally, a fast hybrid approach is proposed for instance level discriminative learning of DNs with competitive predictive power but substantially better scalability.

Document type: 
Thesis
Rights: 
This thesis may be printed or downloaded for non-commercial research and scholarly purposes. Copyright remains with the author.
File(s): 
Senior supervisor: 
Oliver Schulte
Department: 
Applied Sciences: School of Computing Science
Thesis type: 
(Thesis) Ph.D.
Statistics: