I am a statistician working for the Fifth Third (5/3) bank in Cincinnati (OH). In the bank, I am leading a team of quantitative analysts developing predictive models for the enterprise risk as well as working on annual submissions of Comprehensive Capital Analysis and Review (CCAR). Before joining the Fifth Third bank, I had worked for LexisNexis Risk Solutions, JPM Chase, and PayPal on various interesting areas, including database marketing, risk modeling, and fraud detection with machine learning.

I truly enjoy working on the data and concerting with other statisticians. In my spare time, I like learning new computing languages and reading good books and paper in statistics and machine learning.

**Let’s link up on Linkedin**

https://www.linkedin.com/in/liuwensui

**Developed R Packages **

– R package “mob” (https://CRAN.R-project.org/package=mob), Monotonic Optimal Binning for the risk scorecard development

– R package “yager” (https://CRAN.R-project.org/package=yager), General Regression Neural Networks for functional approximation and classification

– R package “yap” (https://CRAN.R-project.org/package=yap), Probabilistic Neural Networks for pattern recognition

**Selected Publications **

– Modeling Practices of Operational Loss Forecasts, SAS Analytics Conference, 2015

– Modeling Fractional Outcomes with SAS ®, SAS Global Forum, 2014

– Modeling Practices of Risk Parameters for Consumer Portfolio, SAS Analytics Conference, 2011

– Rapid Model Refresh in Online Fraud Detection Engine, SAS Data Mining Conference, 2010

– Generalizations of Generalized Additive Model: A Case of Credit Risk Modeling, SAS Global Forum, 2009

– A Class of Predictive Models for Multilevel Risks, SAS Global Forum, 2009

– Count Data Models in SAS, SAS Global Forum, 2008 (Best Contributed Paper)

– Adjustment of Selection Bias in the Marketing Campaign, INFORMS Marketing Science Conference, 2008

– Behavior-based Predictive Models, SAS Data Mining Conference, 2008

– Generalized Additive Model and Applications in Direct Marketing, Direct Marketing Association (DMA) Analytical Journal, 2008

– Improve Credit Scoring by Generalized Additive Model, SAS Global Conference, 2007

**Technical Competencies **

– Data Mining: Decision Trees, Multivariate Adaptive Regression Splines (MARS), Generalized Additive Models, Projection Pursuit Regression, Neural Networks, Bagging, Boosting, Bumping, and Decision Stump.

– Statistics: Generalized Linear Models, Count Outcome Models, Proportion Outcome Models, Longitudinal Models, Finite Mixture Models, Quantile Regression, Multivariate Analysis, and Time Series.

– Programming: R / S+, Python, Julia, Matlab / Octave, and SAS.

– Database: Teradata (BTEQ), Oracle (PL/SQL), DB2, SQL server (T-SQL), MySQL, SQLite, and MongoDB.

– Utilities: Linux, Cygwin, Emacs (ESS), Vim, SED, Shell, Pig Latin, and HDF5.

– Risk Modeling: Credit Risk Models (PD / EAD / LGD) and Scorecard Development.

Excellent job. Excellent Statistician.

Like your posts very much. Great job.

One quick question, is the 401 K participation data set (Papke and Wooldridge, 1996) confidential? I searched it but could not find it.

Keep up excellent work!

Hi, WenSui, Sorry to trouble u, I am a student from china, grade 1. I have gotten notification via my email and leant a lot from your blog, I will be more glad if you can write some native bayes classification using SAS code?

Thanks.

Best.

akward

Hi, Nice blog thanks for sharing. Would you please consider adding a link to my website on your page. Please email me back. Thanks!

Aaron Grey

aarongrey112@gmail.com

Hi, WenSui! You are a great stat-analyst. I am a SAS programmer and I this is my favorite learning website. I would ask you if you have developed something (SAS code) about collapsing levels of class predictor variables (for logistic regression)… I’ve seen your WOE tranformation pgm, but nothing about class vars recoding. (I would be glad for a your reply).

Thanks a lot for your works!!

L.

Hi WenSui,

Thanks for writing the macro “A SAS Macro Implementing Monotonic WOE Transformation in Scorecard Development”, and I have 2 suggestions , if can be incorporated.

1. A weight variable, to adjust for oversampling / priori probabilities, as one of the macro parameters

2. For estimating the missing values, as mostly we are looking for a Scorecard type model, and final method is a logistic regression. So can we actually use the log(odds) or WOE , to estimate the missing values(reverse Engineering !!!), by considering monotonic behavior in the log(odds), and replace the missing values with the nearest log(odds) based class / bin’s attribute value? Also floor and cap can be assigned for each variable at 5% and 95%. Also if a variable has above 30% data missing then a dummy be created and if data is missing for any variable between 0-30% can then be imputed.

In my view it might be better than median substitution, especially for Risk scorecards, a median substitution may correspond to slight drop in scorecard separation, and may impact wheter this variable be in/out of the model.

regards,

Sriram

ram24x7esi@gmail.com

Hi WeiSui, Thanks for sharing your excellent work and idea. Can you also upload the test data set here so that we could figure out the details by ourselves?

Hello, WenSui !

I’m a bit curious about your recent blog concerning performance of parallel processing. Based on other parallel processing “through-put” numbers I’ve seen here & there, I was expecting that splitting the work across 8 CPU cores would have increased the speed approximately 6-fold, not the 3-fold you measured. I’m wondering if the “Ubuntu VM with 8 cores” is really 4 physical cores with hyper-threading, or if perhaps you think this particular task might have been disk-bound.

Thanks for sharing so much with us with your postings.

regards

Doug Dame

Doug_Dame@yahoo.com

you are correct, doug. it is 4 physical cores with hyper-threading.

appreciate your insight.

Excellent WOE sas algorithm. Do you have any references for the procedure you employ I can review or cite?

Dear Ryan

First of all, I truly appreciate your interest in my sas code. Unfortunately, I don’t have the reference. It is more like a home-brew algorithm coming from my mind.

Best

wensui

No problem. Thank you very much for sharing. You have saved me a lot of work. :)

Really appriciate your contributions. Just revised some of your SAS code, it’s helpful.

Thanks

Regarding “By-Group Aggregation in Parallel” have you tried data.table?

It should be a lot faster than the alternatives for code like that

library(data.table)

load("2008.Rdata")

# assuming the Rdata contains a data.frame called data

setDT(data)

`data[,mean(Distance),keyby=Month]`

You are correct. Data.table is indeed a lot faster. However, the purpose of this post is to showcase how to use parallelism.

Thanks for reading my blog.

Hiya, I’m attempting to use ur excellent ‘A SAS Macro Implementing Monotonic WOE Transformation in Scorecard Development’ code but running into some problems with the default sas server folder where the output files are being directed.

Can you please let me know how to update the code to update where I want the output files directed to?

Thanks in advance.

Clancy.

Could you show me how to mathematically derive beta0 and beta1 in your SAS getpdo macro?

Nice sir.

hi WenSui, good work and well i student and try to run

https://statcompute.wordpress.com/2015/03/29/autoregressive-conditional-poisson-model-i/

i can not run

mdl <- acp(y ~ -1, data = cnt)

because i do not get the cnt data

can you help me please

Thanks.

Juan Pablo

Hello, i’ve been Reading your blog and i am interested in the Oct2Py part. Do you have some more examples?

Hello sir, I’ve been Reading your blog and i am interested in the Oct2Py part, do you have some more examples?

Hello WenSui, where can I get the file “flights.db”? thanks.

Great work, master!

Wensui, this is phenomenal work with a lot of dedication. Thank you for your generosity in sharing.

Hi Wensui, Thanks a lot for a great work. Could you please share the sample file to work on the scorecard codes.

Hi Wenshui,

For LGD modeling, how should we use WOE to work with catagorical variables?

Many thanks

Please email me and we can discuss. Thanks.

Great work. It would be nice if you could supply the link to the files that you used. Thanks.

Do you have any thoughts on models for Gain/Loss on loan sales and Gain/loss on foreclosed assets?

Hi WenSui,

Great blog! I’d love for you to contribute to our blog at http://www.blog.hotdatajobs.com

We are developing a blog to entice and courage users to come to our site. Our main site, http://www.hotdatajobs.com will be a job board that has jobs from all over the country related to the data, analytics, business intelligence realm.

Please let me know if interested. Thanks!

Ajay Mistry

http://www.hotdatajobs.com

sure, dear Ajay, please let me know how I might contribute.

Hi, is the dataset credit_count.txt publicly available? I wanted to follow your examples. Thanks.

Comment on your latest blog post: https://statcompute.wordpress.com/2016/10/31/fastest-way-to-add-new-variables-to-a-large-data-frame/

Your data.table code can be sped up significantly:

data.table = {

### DATA.TABLE ###

hflights <- data.table(hflights)

hflights[,wday := 'weekday']

hflights[DayOfWeek == 6, wday := 'weekend']

hflights[DayOfWeek == 7, wday := 'weekend']

hflights[,delay := ArrDelay + DepDelay]

}

Very smart way to avoid ifelse().

Hi WenSui

Thanks for the monotonic binning code in Python – looks v interesting

Am keen to replicate results on your Accepts.csv data – any chance you could supply?

It would be a great help to understanding the code and properly applying to my data

thanks

Mark

Hi, WenSui,

I am trying to follow your interest post “Fitting Generalized Regression Neural Network with Python.” How can I obtain the csdata.txt file? I have googled around without success.

Many thanks in advance,

I’m your biggest fan. Thank you for the monotonic function on SMBinning, i’ve been using R, and my limits are still at inputting categorical variables rather than my preferred continuous variables (as explanatory variables). I’m struggling with transforming the binned variables into WoE numeric variables. Please help!

Is every one supposed to know how to get the file credit_count.txt.

I stumbled upon this website in search of keras example (which I found).

Is this the file, http://pages.stern.nyu.edu/~wgreene/Text/econometricanalysis.htm

Hello Mr. Liu,

You have a great blog! I have a quick question: Are expected loss models (mentioned in the link below) used to assess credit risk in consumer lending (for individuals, mostly unsecured) portfolio, usually for a term of 12-36 months?

https://statcompute.wordpress.com/2012/03/06/modeling-practices-of-loss-forecasting-for-consumer-banking-portfolio/

It should really depend on the charge-off practice with your organization. To my mind, 12-month might be too long for the unsecured consumer product.

Thank you, for the response.

The approval rates for this sector – unsecured consumer loans, mostly low income – is pretty low and historically, the default rates have also been very low. Usually, the term has been from 12-36 months. I am exploring ways to assess and manage risk of such portfolios. Do you have any suggestions?

Thanks!

Hi WenSui

Similar request for the monotonic binning code in Python. I am eager to replicate results on your Accepts.csv data – any chance you could please supply it?

It would be a great help to understanding the code and properly applying to my data

thanks

Hell Mr.Liu

Where can I download credit_count.txt dataset?

http://pages.stern.nyu.edu/~wgreene/Text/econometricanalysis.htm

I read with interest your paper on the estimation of count models in SAS. I have a question on the way the figures comparing predicted and observed count were produced. Did you compute for each observation the predicted count that you surround to compare with the distribution of the observed count, or did you rather compute for each observation the probabilities of a count of k (for k=0 to 8 in your example) than you compare with the observed distribution (as for example in Scott and Feese (Regression Models for Categorical Dep. Variables Using Stata).

If this last option is preferred, would it possible to provide some guidance to perform it in SAS?

Many thanks

Thanks for reading my blog.

To answer your question, I calculated the probability of each count outcome for each case and then aggregated.

Curious about how to build “Expected Losses Models”. Still a green hand on modeling loss

Which libraries have been used for the contribution titled “GRNN vs. GAM”?

For GAM, I used the GAM library by Professor Hastie. For GRNN, I used the YAGeR package developed by myself (https://github.com/statcompute/yager).

What is your email? Let me take a look before getting back to you. Thanks for testing my code.