Prospective students — SOCAL Lab

Overview

Mission and fit

SOCAL Lab develops statistically principled methods for settings where standard machine learning pipelines degrade: decentralized and privacy-sensitive data, long-tailed and imbalanced regimes, and deployment of large language models under accountability constraints. We recruit motivated Ph.D., M.S., and undergraduate researchers who are prepared to combine careful theory, reproducible experimentation, and clear scientific communication.

The topic lists below are illustrative research threads intended to help you articulate your interests in an inquiry email. They are not a guarantee of immediate openings on every thread; thesis and project directions are determined jointly based on fit, feasibility, and funding.

Visual primer

Two recurring settings

Centralized vs. federated learning (raw rows stay local); synthetic augmentation to improve utility in low-density tail regions of tabular targets.

Comparison of centralized learning, where raw tabular data from multiple clients are sent to a central datacenter, and federated learning, where raw data remain on local devices and only model updates are aggregated at a central server. — Centralized vs. federated learning: only model updates—not full raw rows—are aggregated centrally when local silos must be protected. Click image to enlarge.

Normalized target distribution with original data and synthetic tail boost in the low-density tail, and a before-and-after comparison of model utility in rare regions after synthetic augmentation. — Synthetic tail augmentation: enriching sparse regions of the target distribution to improve model behavior where the original sample is thin. Click image to enlarge.

Open lab handout — download one-page poster (PDF, English)

Research areas

Representative directions

1. Federated learning (data privacy)

Collaborative learning when raw records cannot be centralized, with emphasis on statistical behavior under client heterogeneity and privacy constraints.

Federated optimization and aggregation under non-IID partitions and distribution shift.
Privacy-preserving federated protocols (e.g., differential privacy, secure aggregation, and risk-aware design) and their interaction with model utility.
Communication- and computation-aware federated training under realistic deployment constraints (dropouts, stragglers).

2. Synthetic data

Generation and augmentation of tabular structured data when direct release of sensitive spreadsheets is restricted, including federated and differentially private synthesis.

High-fidelity tabular synthesis with semantic or structural constraints (including LLM-assisted constraint discovery where appropriate).
Federated or cross-party synthesis preserving correlations while limiting disclosure of raw private tables.
Augmentation and resampling to improve coverage in low-density / tail regions, with evaluation beyond headline averages.

3. AI alignment (LLM)

Evaluation- and governance-oriented work on large language models in institutional settings, emphasizing measurable risk and accountability.

Privacy-aware evaluation and auditing workflows for deployed or procured LLM systems.
Security, defense, and governance perspectives on heterogeneous LLM deployments (structured surveys and taxonomies).
Policy–implementation gaps in public-sector technology procurement and post-deployment accountability.

Expectations

Mindset and preparation

Mindset and conduct. We expect members to be self-motivated and determined: research progress depends on sustained initiative through ambiguous phases of a project. Communication skills are essential—clear writing and spoken explanation of assumptions, limitations, and empirical findings are part of the research product, not an afterthought.
Technical preparation. A strong foundation in probability and statistics and scientific programming (e.g., Python) is required for most projects. Coursework or equivalent experience in machine learning, numerical linear algebra / optimization, and deep learning is highly beneficial, depending on the thread. Students should expect to read primary literature critically and to iterate on experiments with disciplined logging and evaluation protocols.
Collaboration. The lab values collegiality, intellectual honesty about negative or null results, and respect for privacy and data-use agreements in applied work.

Next step

How to inquire

Email natekang@knu.ac.kr with subject line SOCAL Lab — Prospective Student. Include (i) your CV or transcript summary, (ii) one or two topic bullets above that best match your interests, and (iii) a short paragraph on prior research or implementation experience (course projects are welcome) and what you hope to learn in a 6–12 month horizon.

Email to apply Research program