Two-Sided Bandits in Matching Markets

Converging to Stability in Two-Sided Bandits: The Case of Unknown Preferences on Both Sides of a Matching Market

Authors: Gaurab Pokharel¹, and Sanmay Das¹

Affiliations: ¹Virginia Tech

Overview

Classic matching-market mechanisms (e.g., Gale–Shapley) assume agents know their preferences. In many real deployments, preferences must be learned from experience: repeated interactions reveal noisy signals about match quality, and agents must decide whom to propose to while anticipating competition and acceptance.

This project studies two-sided bandits: repeated matching where proposers must learn rewards for different partners and must reason about whether a proposal will be accepted—because acceptance depends on the receiving side’s preferences and other agents’ competing proposals.

Setting (Players, Arms, and Stability)

At each round, each player proposes to an arm. Arms receive one or more proposals and choose at most one to accept; successful matches generate stochastic rewards and update beliefs. The goal is to converge to a stable matching, where no blocking pair prefers each other over their current matches.

We consider three information regimes:

APCK (Arm Preferences Common Knowledge): arms’ preferences are known and common knowledge.
APKP (Arm Preferences Known but Private): arms know their preferences, but players do not.
APU (Arm Preferences Unknown): arms also learn their preferences from interaction.

Algorithms

Baseline: CA-UCB (APCK)

When arm preferences are common knowledge, players can construct a “plausible set” of arms they could realistically win and propose using UCB-style optimism, avoiding unnecessary conflicts.

OCA-UCB (APKP)

When arm preferences are private, players maintain optimistic beliefs about how arms rank them, and update those beliefs using minimal conflict feedback (who won a contested arm). This recovers a conflict-avoiding behavior similar to CA-UCB while learning arms’ rankings over time.

APCK (CA-UCB) vs APKP (OCA-UCB) stability — **Figure 1.** Left: APCK model with CA-UCB; Right: APKP model with OCA-UCB. Results averaged over 100 runs. Top row: varying market size with uniformly random preferences; bottom row: varying player preference heterogeneity at $N=K=10$. OCA-UCB converges to stability under dual-sided uncertainty, though more slowly due to increased complexity.nput.

PCA-SCA (APU)

When arms themselves do not know their preferences, conflict outcomes are initially noisy. PCA-SCA handles this by:

having arms maintain reward estimates and confidence intervals over players,
resolving overlaps probabilistically early on,
having players choose arms by combining (i) reward optimism and (ii) optimistic estimates of win probability in contested proposals.

The framework supports both UCB-style and Thompson-style belief tracking.

Stability Under UCB — **Figure 2.** Results of APU model experiments using UCB to track preferences. The figure on the left shows uniformly random preferences, and the one on the right shows varied player preference heterogeneity ($N=K=10$). Both markets converge to stability in expectation, and there is no dependence on player preference heterogeneity on convergence.

Stability Under TS — **Figure 3.** Results of APU model experiments using Thompson Sampling to track preferences. Like Figure 2, the figure on the left shows uniformly random preferences, and the one on the right shows varied player preference heterogeneity ($N=K=10$). Both markets converge to stability in expectation, and there is no dependence on player preference heterogeneity on convergence. However, compared to Figure 2, PCA-TS converges to stability quicker than PCA-UCB and in a smoother manner.

Theory Highlights (High-Level)

In APKP, optimistic belief updates are enough to preserve convergence guarantees similar to the common-knowledge case.
In APU, as arms learn and their confidence intervals separate, feedback becomes effectively deterministic; players’ win-probability estimates converge, and the resulting behavior becomes equivalent to the idealized conflict-avoiding execution in the limit.

Simulation Takeaways

Across market sizes and preference heterogeneity:

The proposed algorithms converge toward stable matchings.
Player regret declines as learning progresses.
In the fully-unknown APU model, Thompson Sampling often converges faster and more smoothly than UCB.

convergence proxy comparing UCB vs TS — **Figure 4.** Convergence comparison of UCB (solid) and Thompson Sampling (dotted) in the APU model with varying market sizes and uniformly random preferences. Thompson Sampling achieves stability faster and more reliably, while the UCB-based approach exhibits intermittent periods of instability before ultimately converging.

Visualization of the Optimism Function

Why this matters

Learning in matching markets shows up in hiring pipelines, internships/residencies, platform matching, and school choice variants where preferences are implicit and noisy. This project offers a principled route to decentralized, no-explicit-communication learning dynamics that still converge to stable outcomes, even when both sides begin uncertain.

Citation

 @article{Pokharel_Das_2023, 
    title={Converging to Stability in Two-Sided Bandits: 
    The Case of Unknown Preferences on Both Sides of a Matching Market}, 
    url={http://arxiv.org/abs/2302.06176}, 
    DOI={10.48550/arXiv.2302.06176}, 
    note={arXiv:2302.06176 [cs]}, 
    number={arXiv:2302.06176}, 
    publisher={arXiv}, 
    author={Pokharel, Gaurab and Das, Sanmay}, 
    year={2023}, 
    month=feb 
}