Yet Another Blog in Statistical Computing

I can calculate the motion of heavenly bodies but not the madness of people. -Isaac Newton

About Me

with 53 comments


My name is WenSui Liu and I am a statistician working for Fifth Third (5/3) bank in Cincinnati (OH). In 5/3 bank, I am leading a team of quantitative analysts developing operational risk models for the enterprise risk as well as working on annual submissions of Comprehensive Capital Analysis and Review (CCAR). Before joining 5/3, I had worked for LexisNexis Risk Solutions, JPM Chase, and Paypal on various interesting areas, including database marketing, risk modeling, and fraud detection with machine learning.

I truly enjoy working on the data and concerting with other statisticians. In my spare time, I like learning new computing languages and reading good books and paper in statistics and machine learning.

Let’s link up on Linkedin

Selected Publications

– Modeling Practices of Operational Loss Forecasts, SAS Analytics Conference, 2015
– Modeling Fractional Outcomes with SAS ®, SAS Global Forum, 2014
– Modeling Practices of Risk Parameters for Consumer Portfolio, SAS Analytics Conference, 2011
– Rapid Model Refresh in Online Fraud Detection Engine, SAS Data Mining Conference, 2010
– Generalizations of Generalized Additive Model: A Case of Credit Risk Modeling, SAS Global Forum, 2009
– A Class of Predictive Models for Multilevel Risks, SAS Global Forum, 2009
– Count Data Models in SAS, SAS Global Forum, 2008 (Best Contributed Paper)
– Adjustment of Selection Bias in the Marketing Campaign, INFORMS Marketing Science Conference, 2008
– Behavior-based Predictive Models, SAS Data Mining Conference, 2008
– Generalized Additive Model and Applications in Direct Marketing, Direct Marketing Association (DMA) Analytical Journal, 2008
– Improve Credit Scoring by Generalized Additive Model, SAS Global Conference, 2007

Technical Competencies

– Data Mining: Decision Trees, Multivariate Adaptive Regression Splines (MARS), Generalized Additive Models, Projection Pursuit Regression, Neural Networks, Bagging, Boosting, Bumping, and Decision Stump.
– Statistics: Generalized Linear Models, Count Outcome Models, Proportion Outcome Models, Longitudinal Models, Finite Mixture Models, Quantile Regression, Multivariate Analysis, and Time Series.
– Programming: R / S+, Python, Julia, Matlab / Octave, and SAS.
– Database: Teradata (BTEQ), Oracle (PL/SQL), DB2, SQL server (T-SQL), MySQL, SQLite, and MongoDB.
– Utilities: Linux, Cygwin, Emacs (ESS), Vim, SED, Shell, Pig Latin, and HDF5.
– Risk Modeling: Credit Risk Models (PD / EAD / LGD) and Scorecard Development.


Written by statcompute

April 7, 2007 at 8:16 pm

53 Responses

Subscribe to comments with RSS.

  1. Excellent job. Excellent Statistician.

    James Xiao

    March 7, 2012 at 1:13 pm

  2. Like your posts very much. Great job.
    One quick question, is the 401 K participation data set (Papke and Wooldridge, 1996) confidential? I searched it but could not find it.

    Keep up excellent work!


    May 27, 2012 at 10:02 am

  3. Hi, WenSui, Sorry to trouble u, I am a student from china, grade 1. I have gotten notification via my email and leant a lot from your blog, I will be more glad if you can write some native bayes classification using SAS code?


    October 25, 2012 at 10:28 pm

  4. Hi, Nice blog thanks for sharing. Would you please consider adding a link to my website on your page. Please email me back. Thanks!

    Aaron Grey


    October 27, 2012 at 3:52 am

  5. Hi, WenSui! You are a great stat-analyst. I am a SAS programmer and I this is my favorite learning website. I would ask you if you have developed something (SAS code) about collapsing levels of class predictor variables (for logistic regression)… I’ve seen your WOE tranformation pgm, but nothing about class vars recoding. (I would be glad for a your reply).
    Thanks a lot for your works!!


    November 13, 2012 at 4:58 pm

  6. Hi WenSui,
    Thanks for writing the macro “A SAS Macro Implementing Monotonic WOE Transformation in Scorecard Development”, and I have 2 suggestions , if can be incorporated.

    1. A weight variable, to adjust for oversampling / priori probabilities, as one of the macro parameters
    2. For estimating the missing values, as mostly we are looking for a Scorecard type model, and final method is a logistic regression. So can we actually use the log(odds) or WOE , to estimate the missing values(reverse Engineering !!!), by considering monotonic behavior in the log(odds), and replace the missing values with the nearest log(odds) based class / bin’s attribute value? Also floor and cap can be assigned for each variable at 5% and 95%. Also if a variable has above 30% data missing then a dummy be created and if data is missing for any variable between 0-30% can then be imputed.

    In my view it might be better than median substitution, especially for Risk scorecards, a median substitution may correspond to slight drop in scorecard separation, and may impact wheter this variable be in/out of the model.



    February 14, 2013 at 3:48 pm

  7. Hi WeiSui, Thanks for sharing your excellent work and idea. Can you also upload the test data set here so that we could figure out the details by ourselves?


    May 21, 2013 at 5:05 pm

  8. Hello, WenSui !

    I’m a bit curious about your recent blog concerning performance of parallel processing. Based on other parallel processing “through-put” numbers I’ve seen here & there, I was expecting that splitting the work across 8 CPU cores would have increased the speed approximately 6-fold, not the 3-fold you measured. I’m wondering if the “Ubuntu VM with 8 cores” is really 4 physical cores with hyper-threading, or if perhaps you think this particular task might have been disk-bound.

    Thanks for sharing so much with us with your postings.


    Doug Dame

    Doug Dame

    May 29, 2013 at 12:23 am

  9. you are correct, doug. it is 4 physical cores with hyper-threading.
    appreciate your insight.


    May 29, 2013 at 6:29 am

  10. Excellent WOE sas algorithm. Do you have any references for the procedure you employ I can review or cite?

    Ryan Vaughn

    August 5, 2013 at 5:40 pm

  11. Dear Ryan
    First of all, I truly appreciate your interest in my sas code. Unfortunately, I don’t have the reference. It is more like a home-brew algorithm coming from my mind.


    August 5, 2013 at 5:43 pm

  12. No problem. Thank you very much for sharing. You have saved me a lot of work. :)


    August 5, 2013 at 5:54 pm

  13. Really appriciate your contributions. Just revised some of your SAS code, it’s helpful.

    Kun Zou

    September 11, 2013 at 11:23 am

  14. Regarding “By-Group Aggregation in Parallel” have you tried data.table?

    It should be a lot faster than the alternatives for code like that



    # assuming the Rdata contains a data.frame called data



    October 6, 2014 at 6:20 am

  15. You are correct. Data.table is indeed a lot faster. However, the purpose of this post is to showcase how to use parallelism.

    Thanks for reading my blog.


    October 6, 2014 at 9:11 am

  16. Hiya, I’m attempting to use ur excellent ‘A SAS Macro Implementing Monotonic WOE Transformation in Scorecard Development’ code but running into some problems with the default sas server folder where the output files are being directed.
    Can you please let me know how to update the code to update where I want the output files directed to?
    Thanks in advance.


    November 18, 2014 at 1:43 am

  17. Could you show me how to mathematically derive beta0 and beta1 in your SAS getpdo macro?

    Jian Yang

    November 26, 2014 at 6:47 pm

  18. Nice sir.


    May 6, 2015 at 9:03 am

  19. hi WenSui, good work and well i student and try to run

    i can not run

    mdl <- acp(y ~ -1, data = cnt)

    because i do not get the cnt data

    can you help me please
    Juan Pablo

    juan pablo

    May 28, 2015 at 10:50 pm

  20. Hello, i’ve been Reading your blog and i am interested in the Oct2Py part. Do you have some more examples?


    July 19, 2015 at 11:28 am

  21. Hello sir, I’ve been Reading your blog and i am interested in the Oct2Py part, do you have some more examples?

    Carlos Mario

    July 19, 2015 at 11:29 am

  22. Hello WenSui, where can I get the file “flights.db”? thanks.


    November 17, 2015 at 2:31 am

  23. Great work, master!


    November 18, 2015 at 3:47 pm

  24. Wensui, this is phenomenal work with a lot of dedication. Thank you for your generosity in sharing.


    December 14, 2015 at 9:30 am

  25. Hi Wensui, Thanks a lot for a great work. Could you please share the sample file to work on the scorecard codes.

    Shameel Ahmed Shakir

    January 25, 2016 at 11:13 am

  26. Hi Wenshui,

    For LGD modeling, how should we use WOE to work with catagorical variables?

    Many thanks


    March 12, 2016 at 6:31 pm

  27. Please email me and we can discuss. Thanks.


    March 12, 2016 at 8:08 pm

  28. Great work. It would be nice if you could supply the link to the files that you used. Thanks.


    March 21, 2016 at 12:59 am

  29. Do you have any thoughts on models for Gain/Loss on loan sales and Gain/loss on foreclosed assets?


    May 27, 2016 at 4:38 pm

  30. Hi WenSui,

    Great blog! I’d love for you to contribute to our blog at

    We are developing a blog to entice and courage users to come to our site. Our main site, will be a job board that has jobs from all over the country related to the data, analytics, business intelligence realm.

    Please let me know if interested. Thanks!

    Ajay Mistry

    Ajay Mistry

    June 21, 2016 at 3:21 pm

  31. sure, dear Ajay, please let me know how I might contribute.


    June 21, 2016 at 8:28 pm

  32. Hi, is the dataset credit_count.txt publicly available? I wanted to follow your examples. Thanks.


    June 27, 2016 at 9:26 pm

  33. Comment on your latest blog post:

    Your data.table code can be sped up significantly:

    data.table = {
    ### DATA.TABLE ###
    hflights <- data.table(hflights)
    hflights[,wday := 'weekday']
    hflights[DayOfWeek == 6, wday := 'weekend']
    hflights[DayOfWeek == 7, wday := 'weekend']
    hflights[,delay := ArrDelay + DepDelay]

    Zachary Deane-Mayer

    October 31, 2016 at 10:32 am

  34. Very smart way to avoid ifelse().


    October 31, 2016 at 10:53 am

  35. Hi WenSui
    Thanks for the monotonic binning code in Python – looks v interesting
    Am keen to replicate results on your Accepts.csv data – any chance you could supply?
    It would be a great help to understanding the code and properly applying to my data

    Mark Kenton

    January 27, 2017 at 11:54 am

  36. Hi, WenSui,

    I am trying to follow your interest post “Fitting Generalized Regression Neural Network with Python.” How can I obtain the csdata.txt file? I have googled around without success.

    Many thanks in advance,


    February 2, 2017 at 4:58 pm

  37. I’m your biggest fan. Thank you for the monotonic function on SMBinning, i’ve been using R, and my limits are still at inputting categorical variables rather than my preferred continuous variables (as explanatory variables). I’m struggling with transforming the binned variables into WoE numeric variables. Please help!


    March 13, 2017 at 7:35 am

  38. Is every one supposed to know how to get the file credit_count.txt.

    I stumbled upon this website in search of keras example (which I found).

    Is this the file,


    March 17, 2017 at 1:32 am

  39. Hello Mr. Liu,
    You have a great blog! I have a quick question: Are expected loss models (mentioned in the link below) used to assess credit risk in consumer lending (for individuals, mostly unsecured) portfolio, usually for a term of 12-36 months?


    April 9, 2017 at 12:10 pm

  40. It should really depend on the charge-off practice with your organization. To my mind, 12-month might be too long for the unsecured consumer product.


    April 9, 2017 at 12:49 pm

  41. Thank you, for the response.
    The approval rates for this sector – unsecured consumer loans, mostly low income – is pretty low and historically, the default rates have also been very low. Usually, the term has been from 12-36 months. I am exploring ways to assess and manage risk of such portfolios. Do you have any suggestions?



    April 18, 2017 at 1:15 am

  42. Hi WenSui
    Similar request for the monotonic binning code in Python. I am eager to replicate results on your Accepts.csv data – any chance you could please supply it?
    It would be a great help to understanding the code and properly applying to my data


    April 26, 2017 at 10:57 am

  43. Hell Mr.Liu
    Where can I download credit_count.txt dataset?


    May 17, 2017 at 4:57 am

  44. Hi Abler, are you able to download the credit_count.txt file? I didn’t see the file on the link shared by Mr.Liu


    August 18, 2017 at 12:09 pm

  45. hi, where is that dataset “credit_count.txt” ? Please provide the link of used datasets in your examples.


    September 12, 2017 at 11:02 pm

  46. Mert

    September 12, 2017 at 11:35 pm

  47. Hi WenSui,

    Thank you for the blog post regarding hurdle models. Does the model somewhat implies single claim will be reported if there is a claim?

    Zongwen TAN

    September 19, 2017 at 10:26 am

  48. Hi WenSui Liu
    thank you for autoencoder for dimension reduction’s code. I have a problem with this code .I don’t run code with my dataset. my dataset is images features and error in line 10 and 11.please if possible describe these two lines what do.
    thanks a lot and good luck.


    October 1, 2017 at 10:56 am

  49. I read with interest your paper on the estimation of count models in SAS. I have a question on the way the figures comparing predicted and observed count were produced. Did you compute for each observation the predicted count that you surround to compare with the distribution of the observed count, or did you rather compute for each observation the probabilities of a count of k (for k=0 to 8 in your example) than you compare with the observed distribution (as for example in Scott and Feese (Regression Models for Categorical Dep. Variables Using Stata).
    If this last option is preferred, would it possible to provide some guidance to perform it in SAS?
    Many thanks


    January 8, 2018 at 7:03 am

  50. Thanks for reading my blog.
    To answer your question, I calculated the probability of each count outcome for each case and then aggregated.


    January 13, 2018 at 3:52 pm

  51. For today’s post — 2018.02.26 — I am getting this error message after the line:

    est1 <- data.frame(beta = fit1$par, stder = stder, z_values = fit1$par / stder)

    Error in data.frame(beta = fit1$par, stder = stder, z_values = fit1$par/stder) :
    object 'stder' not found

    ?? Where is stder defined?


    Wayne Gray

    February 26, 2018 at 6:49 am

  52. Thanks for pointing it out, dear Prof. Gray.
    It was a typo and I just corrected.


    February 26, 2018 at 8:05 am

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s