Jonghyun Yun

Data scientist

Biography

Jonghyun Yun is a data scientist in Institute of Statistical Data Science. He completed his PhD in the Department of Statistics at University of Illinois at Urbana-Champaign under the guidance of Dr. Yuguo Chen. He was an Assistant Professor in the Department of Mathematics at UT Arlington. Prior to joining UTA, he was an Assistant Professor in the Department of Mathematical Sciences and a core faculty in the Border Biomedical Research Center at the University of Texas at El Paso, from 2015 to 2016. He was a postdoctoral researcher at UT Southwestern Medical Center under the guidance of Drs. Guanghua Xiao and Yang Xie from 2012 to 2015.

Interests

Big data analysis
Bayesian inference
Time series / sequential data
Causal inference
NLP, ML, RL
Python, R, C/C++
Spark, Scala, PyTorch, TensorFlow

Education

Ph.D. in Statistics, 2012
University of Illinois at Urbana–Champaign
MA in Applied Statistics, 2006
Yonsei University
BAs in Applied Statistics and Business Administration, 2004
Yonsei University

Strengths

Big-data analytic
Machine learning
Prediction modeling
Visualization
Dimension reduction
Hidden Markov model
Reinforcement learning
Time series
Natural language processing
Bayesian inference
Monte Carlo method
Causal inference
Cybersecurity
Fraud detection
Biostatistics and Bioinformatics
Multiple hypothesis testing
Anomaly detection
Next-generation sequencing
Smart infrastructure
log-data analysis
R, Python, Spark, Scala, C/C++, MATLAB
PyTorch, Tensorflow
SQL, Git, Shell script
Parallel/Distributed computing

Experience

Cybersecurity Data Scientist

American Airlines

February 2022 – Present Fort Worth, TX

Developing loyalty fraud detection models to capture fraud at early stages of account takeover. Designing and deploy a system to create fraud incident reports and to leverage feedback from SME to reinforce the detection performance. The system has increased the fraud detection efficiency by 94.7%.

Data Scientist

Institute of Statistical Data Intelligence

September 2019 – Present Mansfield, TX

Developing and/or applying cutting edge ML for prediction modeling, Bayesian models, time series, causal inference, visualization, segmentation for big data. Applying NLP and survival model to analyze timestamped sequence of action data (log data). Developing network modeling frameworks to discover dynamic interaction b/w customers and merchandise. Parallel programming using C/C++ for complex Bayesian inference. Processing, cleansing and validating the integrity of data. Presenting analysis and visualization using R and python, and developing software packages.

Assistant Professor

UT Arlington

September 2016 – August 2019 Arlington, TX

Assistant Professor

UT El Paso

September 2015 – June 2016 El Paso, TX

Postdoctoral Researcher

UT Southwestern Medical Center

September 2012 – August 2015 Dallas, TX

Featured Publications

J Yun, K R Ryu, S Ham

January 2022 Automation in Construction

Spatial Analysis Leveraging Machine Learning and GIS of Socio-Geographic Factors Affecting Cost Overrun Occurrence in Roadway Projects

This study analyzes cost overrun occurrence (COO) in the context of socioeconomic conditions leveraging machine learning techniques and geographic information systems due to little information about the relationship between SE factors and cost overruns in transportation infrastructure improvement projects. We extract socio-geospatial features in multiple sources of data sets and establish a random forest model to discover their associations with COO. The developed models reveal highly significant features affecting COO, which include original amounts, original duration, management districts, number of lanes, population over 16-years-old, commuting behavior, industrial topography, and average temperature, indicating that socioeconomic conditions play an important role in actual project expenses. Our findings will assist practitioners and decision-makers to better forecast and reflect the likely impacts of socioeconomic conditions surrounding the project in their planning, budgeting, and operation and maintenance. The software for the statistical analysis can be found in github.com/jonghyun-yun/dico.

PDF DOI

J Yun, S Knag, A D Tehrani, S Ham

October 2020 Mathematics

Image Analysis and Functional Data Clustering for Random Shape Aggregate Models

This study presents a random shape aggregate model by establishing a functional mixture model for images of aggregate shapes. The mesoscale simulation to consider heterogeneous properties concrete is the highly cost- and time-effective method to predict the mechanical behavior of the concrete. Due to the significance of the design of the mesoscale concrete model, the shape of the aggregate is the most important parameter to obtain a reliable simulation result. We propose image analysis and functional data clustering for random shape aggregate models (IFAM). This novel technique learns the morphological characteristics of aggregates using images of real aggregates as inputs. IFAM provides random aggregates across a broad range of heterogeneous shapes using samples drawn from the estimated functional mixture model as outputs. Our learning algorithm is fully automated and allows flexible learning of the complex characteristics. Therefore, unlike similar studies, IFAM does not require users to perform time-consuming tuning on their model to provide realistic aggregate morphology. Using comparative studies, we demonstrate the random aggregate structures constructed by IFAM achieve close similarities to real aggregates in an inhomogeneous concrete medium. Thanks to our fully data-driven method, users can choose their own libraries of real aggregates for the training of the model and generate random aggregates with high similarities to the target libraries.

PDF DOI

J Yun, M, Shin, I H Jin, F Liang

August 2020 JSCS

Stochastic approximation Hamiltonian Monte Carlo

Recently, the Hamilton Monte Carlo (HMC) has become widespread as one of the more reliable approaches to efficient sample generation processes. However, HMC is difficult to sample in a multimodal posterior distribution because the HMC chain cannot cross energy barrier between modes due to the energy conservation property. In this paper, we propose a Stochastic Approximate Hamilton Monte Carlo (SAHMC) algorithm for generating samples from multimodal density under the Hamiltonian Monte Carlo (HMC) framework. SAHMC can adaptively lower the energy barrier to move the Hamiltonian trajectory more frequently and more easily between modes. Our simulation studies show that the potential for SAHMC to explore a multimodal target distribution is more efficient than HMC-based implementations.

DOI

J Yun, F Yang, Y Chen

May 2017 JASA

Augmented particle filters

Particle filters have been widely used for online filtering problems in state–space models (SSMs). The current available proposal distributions depend either only on the state dynamics, or only on the observation, or on both sources of information but are not available for general SSMs. In this article, we develop a new particle filtering algorithm, called the augmented particle filter (APF), for online filtering problems in SSMs. The APF combines two sets of particles from the observation equation and the state equation, and the state space is augmented to facilitate the weight computation. Theoretical justification of the APF is provided, and the connection between the APF and the optimal particle filter (OPF) in some special SSMs is investigated. The APF shares similar properties as the OPF, but the APF can be applied to a much wider range of models than the OPF. Simulation studies show that the APF performs similarly to or better than the OPF when the OPF is available, and the APF can perform better than other filtering algorithms in the literature when the OPF is not available.

DOI

J Yun, T Wang, G Xiao

February 2014 Biometrics

Bayesian Hidden Markov Models to Identify RNA-Protein Interaction Sites in PAR-CLIP

The photoactivatable ribonucleoside enhanced cross-linking immunoprecipitation (PAR-CLIP) has been increasingly used for the global mapping of RNA-protein interaction sites. There are two key features of the PAR-CLIP experiments: The sequence read tags are likely to form an enriched peak around each RNA-protein interaction site; and the cross-linking procedure is likely to introduce a specific mutation in each sequence read tag at the interaction site. Several ad hoc methods have been developed to identify the RNA-protein interaction sites using either sequence read counts or mutation counts alone; however, rigorous statistical methods for analyzing PAR-CLIP are still lacking. In this study, we propose an integrative model to establish a joint distribution of observed read and mutation counts. To pinpoint the interaction sites at single base-pair resolution, we developed a novel modeling approach that adopts non-homogeneous hidden Markov models to incorporate the nucleotide sequence at each genomic location. Both simulation studies and data application showed that our method outperforms the ad hoc methods, and provides reliable inferences for the RNA-protein binding sites from PAR-CLIP data.

PDF DOI