Who learns better Bayesian network structures: Accuracy and speed of structure learning algorithms

Abstract

Three classes of algorithms to learn the structure of Bayesian networks from data are common in the literature: constraint-based algorithms, which use conditional independence tests to learn the dependence structure of the data; score-based algorithms, which use goodness-of-fit scores as objective functions to maximise; and hybrid algorithms that combine both approaches. Constraint-based and score-based algorithms have been shown to learn the same structures when conditional independence and goodness of fit are both assessed using entropy and the topological ordering of the network is known (Cowell, 2001). In this paper, we investigate how these three classes of algorithms perform outside the assumptions above in terms of speed and accuracy of network reconstruction for both discrete and Gaussian Bayesian networks. We approach this question by recognising that structure learning is defined by the combination of a statistical criterion and an algorithm that determines how the criterion is applied to the data. Removing the confounding effect of different choices for the statistical criterion, we find using both simulated and real-world complex data that constraint-based algorithms are often less accurate than score-based algorithms, but are seldom faster (even at large sample sizes); and that hybrid algorithms are neither faster nor more accurate than constraint-based algorithms. This suggests that commonly held beliefs on structure learning in the literature are strongly influenced by the choice of particular statistical criteria rather than just by the properties of the algorithms themselves.

Notes

Background and notation

Set of random variables $X = {X_{1}, \dots, X_{N}}$ associated to nodes of a directed acyclic graph (DAG) $G$ . We indicate with $A$ the set of arcs of $G$ . Graphical separation in $G ⟺$ conditional independence between the respective variables. As a result the following factorization hold

P (X, G, Θ) = i = 1 \prod N P (X_{i} ∣ Π_{X_{i}}, Θ_{X_{i}})

$Θ$ indicate the set of parameters of the global distribution of $X$ , $P (X)$ . The global distribution decomposes in one local distribution for each $X_{i}$ (with parameters $Θ_{X_{i}}$ ) conditional on its parents $Π_{X_{i}}$ .

The DAG $G$ does not uniquely identify a Bayesian Network (BN).

A v-structure in a BN is a pattern of arcs like $X_{i} \to X_{j} \leftarrow X_{k}$ .

An equivalence class of BN is defined by the same

underlying undirected graph and
v-structures

Gaussian Bayesian Networks (GBNs) (5) assume that the $X_{i}$ are univariate normal random variables linked by linear dependencies to their parents,

X_{i} ∣ Π_{X_{i}} \sim N (μ_{x_{i}} + Π_{X_{i}} β_{X_{i}}, σ_{X_{i}}^{2})

in what is essentially a linear regression model of Xi against the $X_{i}$ with regression coefficients $β_{X_{i}} = β_{X_{i}, X_{j}}, X_{j} \in Π_{X_{i}}$ . $X$ is then multivariate normal, and we generally assume that its covariance matrix $Σ$ is positive definite. Equivalently [6], we can consider the precision matrix $Ω = Σ^{- 1}$ and parametrize $X_{i} ∣ Π_{X_{i}}$ with the partial correlations

ρ_{X_{i}, X_{j} ∣ Π_{X_{i}} \ X_{j}} = \frac{Ω _{ij}}{Ω _{ii} Ω _{jj}}

between $X_{i}$ and each parent $X_{j} \in Π_{X_{i}}$ given the rest, since

β_{X_{i}, X_{j}} = ρ_{X_{i}, X_{j} ∣ Π_{X_{i}} \ X_{j}} \frac{Ω _{ii}}{Ω _{jj}} .

Climate data (skin temperature) network case-study

The climate case study from the document involves modeling global surface temperature anomalies using Bayesian networks (BNs). The study investigates the dependencies, including both local and long-range (teleconnected) spatial dependencies, within climate data. Below is a breakdown of the procedure and the algorithms used:

Data Preparation

Data Source:

• Monthly surface temperature values on a global 10° grid (approx. 1000 km resolution) from the NCEP/NCAR reanalysis for the period 1981–2010.

Preprocessing:

• Calculated anomalies by removing the mean annual cycle from raw temperature data, month by month, over the 30-year period. • Represented each grid point as a node in the BN, resulting in 648 nodes (18 latitude × 36 longitude).

Modeling Approach

Assumptions:

• The temperature at each grid point follows a Gaussian distribution. • BNs model spatial dependencies, where nodes represent grid points, and edges encode dependencies.

Algorithm Choices:

• Compared constraint-based, score-based, and hybrid structure learning algorithms. • Used extended Bayesian Information Criterion ( $BIC_{γ}$ ) for flexibility in enforcing sparsity in networks.

Structure Learning Algorithms Used

Constraint-Based Algorithms:

• PC-Stable • Grow-Shrink (GS)

Score-Based Algorithms:

• Tabu Search • Hill Climbing (HC)

Hybrid Algorithms:

• Max-Min Hill Climbing (MMHC) • H2PC

Adjustments for Complex Data

Recognized that constraint-based algorithms (e.g., PC-Stable, GS) struggle with complex climate data due to:

• High connectivity in locally dense regions.

• Conflict in arc directions leading to invalid CPDAGs.

• Introduced $BIC_{γ}$ to enforce sparsity and address issues:

• Regularization coefficient for penalizing the number of arcs.

Evaluation Metrics

Accuracy:

• Log-likelihood of the learned BN. • Analysis of long-distance arcs (teleconnections) and their suitability for inference. • Conditional dependence structure using unshielded v-structures.

Speed:

• Measured by the number of calls to the statistical criterion.

Inference Validation:

• Tested propagation of El Niño-like evidence (e.g., high tropical Pacific temperatures) and its effect on regional probabilities.

Key Observations

Constraint-Based Algorithms:

• Best for small, sparse networks. • Often fail to produce valid CPDAGs in dense, complex data.

Score-Based Algorithms (Tabu Search and HC):

• Learned large, dense networks capturing both local and teleconnected dependencies. • Performed better in propagating evidence and capturing global climatic phenomena.

Hybrid Algorithms (MMHC, H2PC):

• Balanced speed and accuracy for moderately dense networks. • Struggled to match the performance of score-based algorithms in larger networks.

Findings

• Only score-based algorithms effectively modeled complex data with teleconnections, crucial for understanding global climate variability. • Constraint-based algorithms performed well in small-scale scenarios but failed to generalize to dense, real-world networks. • The study underscored the importance of algorithm selection and parameter tuning for complex spatial data modeling.

Background

Bayesian Networks (BNs)

Bayesian networks (BNs) are graphical models representing the joint probability distribution over a set of random variables $X = {X_{1}, X_{2}, \dots, X_{N}}$ . The structure is defined by:

A Directed Acyclic Graph (DAG) $G$ where each node corresponds to a variable.
Conditional independence relationships between variables encoded by $G$ .

The joint probability distribution factorizes as:

P (X ∣ G, θ) = i = 1 \prod N P (X_{i} ∣ Pa (X_{i}), θ_{X_{i}}),

where: - $Pa (X_{i})$ : Set of parent nodes for $X_{i}$ . - $θ_{X_{i}}$ : Parameters associated with $X_{i}$ ’s conditional distribution.

Structure Learning

Structure learning involves finding the DAG $G$ that best explains the observed data $D$ . It is typically decomposed into two tasks:

Structure Learning: Find the DAG $G$ encoding dependencies among variables.
Parameter Learning: Estimate parameters $θ$ given the learned structure $G$ .

Bayes’ theorem splits this as:

P (G, θ ∣ D) = P (G ∣ D) \cdot P (θ ∣ G, D),

where: - $P (G ∣ D)$ : Posterior probability of structure $G$ . - $P (θ ∣ G, D)$ : Posterior probability of parameters $θ$ .

Bayesian Information Criterion (BIC)

The Bayesian Information Criterion (BIC) is a score function commonly used to evaluate a BN structure. It approximates the log-marginal likelihood $lo g P (D ∣ G)$ as:

BIC (G; D) = i = 1 \sum N [lo g P (X_{i} ∣ Pa (X_{i})) - \frac{∣ θ _{X_{i}} ∣}{2} lo g n],

where: - $n$ : Sample size. - $∣ θ_{X_{i}} ∣$ : Number of parameters for $X_{i}$ ’s conditional distribution.

Extended BIC ( $BIC_{γ}$ )

To handle complex data with sparse networks, an extended version of BIC is used, incorporating a regularization term $γ$ to penalize the number of parameters:

BIC_{γ} (G; D) = i = 1 \sum N [lo g P (X_{i} ∣ Pa (X_{i})) - ∣ θ_{X_{i}} ∣ (\frac{lo g n}{2} - γ lo g N)],

where $γ \in R^{+}$ is a regularization coefficient.

Conditional Independence Tests

For Discrete BNs

The conditional independence between two variables $X$ and $Y$ , given $Z$ , is tested using:

G-Test (Log-Likelihood Ratio Test):

G^{2} (X, Y ∣ Z) = 2 i, j, k \sum n_{ijk} lo g \frac{n _{ijk} \cdot n _{++ k}}{n _{i + k} \cdot n _{+ jk}},

where $n_{ijk}$ : Observed frequencies of $X = i$ , $Y = j$ , and $Z = k$ .

Pearson’s Chi-Square Test:

χ^{2} (X, Y ∣ Z) = i, j, k \sum \frac{( n _{ijk} - m _{ijk} ) ^{2}}{m _{ijk}},

with $m_{ijk} = \frac{n _{i + k} \cdot n _{+ jk}}{n _{++ k}}$ .

For Gaussian BNs

Tests rely on partial correlation coefficients $ρ_{X Y ∣ Z}$ :

Gaussian Mutual Information Test:

G^{2} (X, Y ∣ Z) = n lo g (1 - ρ_{X Y ∣ Z}^{2}) .

Structural Hamming Distance (SHD)

The Structural Hamming Distance (SHD) measures the difference between the learned structure $G^{'}$ and a reference structure $G$ . It is the count of:

Missing edges.
Extra edges.
Incorrectly oriented edges.

Simulation Metrics

Goodness of Fit:
- Log-likelihood of the data given the learned structure $lo g P (D ∣ G)$ .
Speed:
- Measured by the number of calls to the scoring or testing criterion.
Inference Validation:
- Propagating evidence in the learned BN to validate teleconnections and other dependencies.

Climate Case Study Adjustments

Problem with Constraint-Based Algorithms

Constraint-based algorithms like PC-Stable struggle with dense, locally connected regions, leading to:

Conflicting arc directions.
Invalid CPDAGs (cannot be converted into valid DAGs).

Introduction of $BIC_{γ}$

To address this, the regularized $BIC_{γ}$ is used with the independence test criterion adjusted as:

G_{BIC_{γ}}^{2} (X, Y ∣ Z) > (∣ θ_{G^{+}} ∣ - ∣ θ_{G^{-}} ∣) \cdot (2 γ lo g N + lo g n) .

This mathematical foundation supports structure learning for climate modeling, enabling better handling of complex data.

Climate Case Study: Algorithms Used

Constraint-Based Algorithms

Constraint-based algorithms learn the structure of Bayesian Networks (BNs) by identifying independence relationships among variables using conditional independence (CI) tests.

1. PC-Stable Algorithm

The PC-Stable algorithm is an improved version of the PC algorithm, ensuring consistent results regardless of variable order.

Procedure:

Initialization:
- Start with a fully connected undirected graph $G$ over the variables.
Edge Removal (Skeleton Discovery):
- For each pair of nodes $X_{i}$ and $X_{j}$ , perform CI tests with conditioning sets of increasing size.
- Remove the edge $X_{i} - X_{j}$ if $X_{i}$ and $X_{j}$ are conditionally independent given some set $S$ .
Orient Edges (V-Structure Discovery):
- Identify v-structures $X_{i} \to X_{k} \leftarrow X_{j}$ , where $X_{i}$ and $X_{j}$ are not adjacent, and $X_{k}$ is their common neighbor.
Propagation of Edge Directions:
- Apply rules to orient remaining edges without introducing cycles.

2. Grow-Shrink (GS) Algorithm

The GS algorithm is another constraint-based method focused on local structure learning.

Procedure:

Forward Phase (Grow):
- Identify the Markov blanket of each variable $X$ by iteratively adding variables that are dependent on $X$ .
Backward Phase (Shrink):
- Remove false positives from the Markov blanket by performing CI tests.
Structure Construction:
- Combine local Markov blankets into a global structure, followed by orientation of edges.

Score-Based Algorithms

Score-based algorithms search the space of possible network structures and assign a score to each based on its fit to the data.

3. Tabu Search

Tabu search is an iterative optimization algorithm that avoids revisiting previously explored solutions.

Procedure:

Initialization:
- Start with an empty graph or a random DAG.
Local Search:
- Modify the current graph by adding, deleting, or reversing edges to maximize a scoring function (e.g., $BIC_{γ}$ ).
Tabu List:
- Maintain a list of recently visited graphs to prevent cycling.
Stopping Condition:
- Terminate when no further improvements are found or a predefined number of iterations is reached.

4. Hill Climbing (HC)

Hill climbing is a greedy optimization algorithm that iteratively improves the network structure.

Procedure:

Initialization:
- Start with an empty graph or a random DAG.
Iterative Improvement:
- Evaluate all possible single-edge changes (addition, deletion, reversal).
- Update the graph with the change that gives the highest improvement in the scoring function (e.g., $BIC_{γ}$ ).
Stopping Condition:
- Terminate when no single-edge modification improves the score.

Hybrid Algorithms

Hybrid algorithms combine constraint-based and score-based approaches to leverage the strengths of both.

5. Max-Min Hill Climbing (MMHC)

MMHC first restricts the search space using CI tests and then scores potential structures.

Procedure:

Skeleton Discovery (Constraint-Based):
- Use CI tests to identify a candidate set of edges for each node.
Structure Optimization (Score-Based):
- Perform a local search (e.g., hill climbing) within the restricted search space to maximize a scoring function.

6. H2PC Algorithm

H2PC uses heuristic optimizations to speed up constraint-based skeleton discovery and integrates score-based methods for final structure refinement.

Procedure:

Heuristic Skeleton Discovery:
- Identify candidate edges using CI tests with heuristics to reduce the number of tests.
Structure Optimization:
- Refine the structure using a score-based algorithm, typically hill climbing.

Comparison of Algorithms

Constraint-Based:

Strengths:
- Efficient for sparse networks.
- Relies on statistical tests for independence.
Weaknesses:
- Struggles with dense networks.
- Can fail to produce valid CPDAGs.

Score-Based:

Strengths:
- Handles dense networks effectively.
- Captures higher-order dependencies (e.g., teleconnections).
Weaknesses:
- Computationally intensive for large networks.

Hybrid:

Strengths:
- Combines the efficiency of constraint-based and accuracy of score-based methods.
- Suitable for moderately dense networks.
Weaknesses:
- Less effective than score-based algorithms for very dense networks.

This suite of algorithms provided the foundation for modeling climate data in the study, with adjustments to ensure valid structures and efficient computation for complex spatial dependencies.

🌱AI4Climate.science

Marco Scutari et al. (2019)

Who learns better Bayesian network structures: Accuracy and speed of structure learning algorithms

Abstract

Background and notation

Climate data (skin temperature) network case-study

Background

Bayesian Networks (BNs)

Structure Learning

Bayesian Information Criterion (BIC)

Extended BIC (BICγ​)

Conditional Independence Tests

For Discrete BNs

For Gaussian BNs

Structural Hamming Distance (SHD)

Simulation Metrics

Climate Case Study Adjustments

Problem with Constraint-Based Algorithms

Introduction of BICγ​

Climate Case Study: Algorithms Used

Constraint-Based Algorithms

1. PC-Stable Algorithm

Procedure:

2. Grow-Shrink (GS) Algorithm

Procedure:

Score-Based Algorithms

3. Tabu Search

Procedure:

4. Hill Climbing (HC)

Procedure:

Hybrid Algorithms

5. Max-Min Hill Climbing (MMHC)

Procedure:

6. H2PC Algorithm

Procedure:

Comparison of Algorithms

Constraint-Based:

Score-Based:

Hybrid:

Table of Contents

Backlinks

Graph View

Extended BIC ( $BIC_{γ}$ )

Introduction of $BIC_{γ}$