About Me


I am a statistician working for the Fifth Third (5/3) bank in Cincinnati (OH). In the bank, I am leading a team of quantitative analysts developing predictive models for the enterprise risk as well as working on annual submissions of Comprehensive Capital Analysis and Review (CCAR). Before joining the Fifth Third bank, I had worked for LexisNexis Risk Solutions, JPM Chase, and PayPal on various interesting areas, including database marketing, risk modeling, and fraud detection with machine learning.

I truly enjoy working on the data and concerting with other statisticians. In my spare time, I like learning new computing languages and reading good books and paper in statistics and machine learning.

Let’s link up on Linkedin

Developed R Packages

– R package “mob” (https://CRAN.R-project.org/package=mob), Monotonic Optimal Binning for the risk scorecard development
– R package “yager” (https://CRAN.R-project.org/package=yager), General Regression Neural Networks for functional approximation and classification
– R package “yap” (https://CRAN.R-project.org/package=yap), Probabilistic Neural Networks for pattern recognition

Selected Publications

– Modeling Practices of Operational Loss Forecasts, SAS Analytics Conference, 2015
– Modeling Fractional Outcomes with SAS ®, SAS Global Forum, 2014
– Modeling Practices of Risk Parameters for Consumer Portfolio, SAS Analytics Conference, 2011
– Rapid Model Refresh in Online Fraud Detection Engine, SAS Data Mining Conference, 2010
– Generalizations of Generalized Additive Model: A Case of Credit Risk Modeling, SAS Global Forum, 2009
– A Class of Predictive Models for Multilevel Risks, SAS Global Forum, 2009
– Count Data Models in SAS, SAS Global Forum, 2008 (Best Contributed Paper)
– Adjustment of Selection Bias in the Marketing Campaign, INFORMS Marketing Science Conference, 2008
– Behavior-based Predictive Models, SAS Data Mining Conference, 2008
– Generalized Additive Model and Applications in Direct Marketing, Direct Marketing Association (DMA) Analytical Journal, 2008
– Improve Credit Scoring by Generalized Additive Model, SAS Global Conference, 2007

Technical Competencies

– Data Mining: Decision Trees, Multivariate Adaptive Regression Splines (MARS), Generalized Additive Models, Projection Pursuit Regression, Neural Networks, Bagging, Boosting, Bumping, and Decision Stump.
– Statistics: Generalized Linear Models, Count Outcome Models, Proportion Outcome Models, Longitudinal Models, Finite Mixture Models, Quantile Regression, Multivariate Analysis, and Time Series.
– Programming: R / S+, Python, Julia, Matlab / Octave, and SAS.
– Database: Teradata (BTEQ), Oracle (PL/SQL), DB2, SQL server (T-SQL), MySQL, SQLite, and MongoDB.
– Utilities: Linux, Cygwin, Emacs (ESS), Vim, SED, Shell, Pig Latin, and HDF5.
– Risk Modeling: Credit Risk Models (PD / EAD / LGD) and Scorecard Development.

50 thoughts on “About Me

  1. Like your posts very much. Great job.
    One quick question, is the 401 K participation data set (Papke and Wooldridge, 1996) confidential? I searched it but could not find it.

    Keep up excellent work!

  2. Hi, WenSui, Sorry to trouble u, I am a student from china, grade 1. I have gotten notification via my email and leant a lot from your blog, I will be more glad if you can write some native bayes classification using SAS code?

  3. Hi, WenSui! You are a great stat-analyst. I am a SAS programmer and I this is my favorite learning website. I would ask you if you have developed something (SAS code) about collapsing levels of class predictor variables (for logistic regression)… I’ve seen your WOE tranformation pgm, but nothing about class vars recoding. (I would be glad for a your reply).
    Thanks a lot for your works!!

  4. Hi WenSui,
    Thanks for writing the macro “A SAS Macro Implementing Monotonic WOE Transformation in Scorecard Development”, and I have 2 suggestions , if can be incorporated.

    1. A weight variable, to adjust for oversampling / priori probabilities, as one of the macro parameters
    2. For estimating the missing values, as mostly we are looking for a Scorecard type model, and final method is a logistic regression. So can we actually use the log(odds) or WOE , to estimate the missing values(reverse Engineering !!!), by considering monotonic behavior in the log(odds), and replace the missing values with the nearest log(odds) based class / bin’s attribute value? Also floor and cap can be assigned for each variable at 5% and 95%. Also if a variable has above 30% data missing then a dummy be created and if data is missing for any variable between 0-30% can then be imputed.

    In my view it might be better than median substitution, especially for Risk scorecards, a median substitution may correspond to slight drop in scorecard separation, and may impact wheter this variable be in/out of the model.


  5. Hi WeiSui, Thanks for sharing your excellent work and idea. Can you also upload the test data set here so that we could figure out the details by ourselves?

  6. Hello, WenSui !

    I’m a bit curious about your recent blog concerning performance of parallel processing. Based on other parallel processing “through-put” numbers I’ve seen here & there, I was expecting that splitting the work across 8 CPU cores would have increased the speed approximately 6-fold, not the 3-fold you measured. I’m wondering if the “Ubuntu VM with 8 cores” is really 4 physical cores with hyper-threading, or if perhaps you think this particular task might have been disk-bound.

    Thanks for sharing so much with us with your postings.


    Doug Dame

  7. you are correct, doug. it is 4 physical cores with hyper-threading.
    appreciate your insight.

  8. Excellent WOE sas algorithm. Do you have any references for the procedure you employ I can review or cite?

  9. Dear Ryan
    First of all, I truly appreciate your interest in my sas code. Unfortunately, I don’t have the reference. It is more like a home-brew algorithm coming from my mind.

  10. Regarding “By-Group Aggregation in Parallel” have you tried data.table?

    It should be a lot faster than the alternatives for code like that



    # assuming the Rdata contains a data.frame called data


  11. You are correct. Data.table is indeed a lot faster. However, the purpose of this post is to showcase how to use parallelism.

    Thanks for reading my blog.

  12. Hiya, I’m attempting to use ur excellent ‘A SAS Macro Implementing Monotonic WOE Transformation in Scorecard Development’ code but running into some problems with the default sas server folder where the output files are being directed.
    Can you please let me know how to update the code to update where I want the output files directed to?
    Thanks in advance.

  13. Hello, i’ve been Reading your blog and i am interested in the Oct2Py part. Do you have some more examples?

  14. Hello sir, I’ve been Reading your blog and i am interested in the Oct2Py part, do you have some more examples?

  15. Hi Wenshui,

    For LGD modeling, how should we use WOE to work with catagorical variables?

    Many thanks

  16. Do you have any thoughts on models for Gain/Loss on loan sales and Gain/loss on foreclosed assets?

  17. Hi, is the dataset credit_count.txt publicly available? I wanted to follow your examples. Thanks.

  18. Hi WenSui
    Thanks for the monotonic binning code in Python – looks v interesting
    Am keen to replicate results on your Accepts.csv data – any chance you could supply?
    It would be a great help to understanding the code and properly applying to my data

  19. Hi, WenSui,

    I am trying to follow your interest post “Fitting Generalized Regression Neural Network with Python.” How can I obtain the csdata.txt file? I have googled around without success.

    Many thanks in advance,

  20. I’m your biggest fan. Thank you for the monotonic function on SMBinning, i’ve been using R, and my limits are still at inputting categorical variables rather than my preferred continuous variables (as explanatory variables). I’m struggling with transforming the binned variables into WoE numeric variables. Please help!

  21. It should really depend on the charge-off practice with your organization. To my mind, 12-month might be too long for the unsecured consumer product.

  22. Thank you, for the response.
    The approval rates for this sector – unsecured consumer loans, mostly low income – is pretty low and historically, the default rates have also been very low. Usually, the term has been from 12-36 months. I am exploring ways to assess and manage risk of such portfolios. Do you have any suggestions?


  23. Hi WenSui
    Similar request for the monotonic binning code in Python. I am eager to replicate results on your Accepts.csv data – any chance you could please supply it?
    It would be a great help to understanding the code and properly applying to my data

  24. I read with interest your paper on the estimation of count models in SAS. I have a question on the way the figures comparing predicted and observed count were produced. Did you compute for each observation the predicted count that you surround to compare with the distribution of the observed count, or did you rather compute for each observation the probabilities of a count of k (for k=0 to 8 in your example) than you compare with the observed distribution (as for example in Scott and Feese (Regression Models for Categorical Dep. Variables Using Stata).
    If this last option is preferred, would it possible to provide some guidance to perform it in SAS?
    Many thanks

  25. Thanks for reading my blog.
    To answer your question, I calculated the probability of each count outcome for each case and then aggregated.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s