Yet Another Blog in Statistical Computing

I can calculate the motion of heavenly bodies but not the madness of people. -Isaac Newton

About Me

with 44 comments

21273d0

My name is WenSui Liu and I am a statistician working for Fifth Third (5/3) bank in Cincinnati (OH). In 5/3 bank, I am leading a team of quantitative analysts developing operational risk models for the enterprise risk as well as working on annual submissions of Comprehensive Capital Analysis and Review (CCAR). Before joining 5/3, I had worked for LexisNexis Risk Solutions, JPM Chase, and Paypal on various interesting areas, including database marketing, risk modeling, and fraud detection with machine learning.

I truly enjoy working on the data and concerting with other statisticians. In my spare time, I like learning new computing languages and reading good books and paper in statistics and machine learning.

Let’s link up on Linkedin
https://www.linkedin.com/in/liuwensui

Selected Publications

– Modeling Practices of Operational Loss Forecasts, SAS Analytics Conference, 2015
– Modeling Fractional Outcomes with SAS ®, SAS Global Forum, 2014
– Modeling Practices of Risk Parameters for Consumer Portfolio, SAS Analytics Conference, 2011
– Rapid Model Refresh in Online Fraud Detection Engine, SAS Data Mining Conference, 2010
– Generalizations of Generalized Additive Model: A Case of Credit Risk Modeling, SAS Global Forum, 2009
– A Class of Predictive Models for Multilevel Risks, SAS Global Forum, 2009
– Count Data Models in SAS, SAS Global Forum, 2008 (Best Contributed Paper)
– Adjustment of Selection Bias in the Marketing Campaign, INFORMS Marketing Science Conference, 2008
– Behavior-based Predictive Models, SAS Data Mining Conference, 2008
– Generalized Additive Model and Applications in Direct Marketing, Direct Marketing Association (DMA) Analytical Journal, 2008
– Improve Credit Scoring by Generalized Additive Model, SAS Global Conference, 2007

Technical Competencies

– Data Mining: Decision Trees, Multivariate Adaptive Regression Splines (MARS), Generalized Additive Models, Projection Pursuit Regression, Neural Networks, Bagging, Boosting, Bumping, and Decision Stump.
– Statistics: Generalized Linear Models, Count Outcome Models, Proportion Outcome Models, Longitudinal Models, Finite Mixture Models, Quantile Regression, Multivariate Analysis, and Time Series.
– Programming: R / S+, Python, Julia, Matlab / Octave, and SAS.
– Database: Teradata (BTEQ), Oracle (PL/SQL), DB2, SQL server (T-SQL), MySQL, SQLite, and MongoDB.
– Utilities: Linux, Cygwin, Emacs (ESS), Vim, SED, Shell, Pig Latin, and HDF5.
– Risk Modeling: Credit Risk Models (PD / EAD / LGD) and Scorecard Development.

Written by statcompute

April 7, 2007 at 8:16 pm

44 Responses

Subscribe to comments with RSS.

  1. Excellent job. Excellent Statistician.

    James Xiao

    March 7, 2012 at 1:13 pm

  2. Like your posts very much. Great job.
    One quick question, is the 401 K participation data set (Papke and Wooldridge, 1996) confidential? I searched it but could not find it.

    Keep up excellent work!

    Aman

    May 27, 2012 at 10:02 am

  3. Hi, WenSui, Sorry to trouble u, I am a student from china, grade 1. I have gotten notification via my email and leant a lot from your blog, I will be more glad if you can write some native bayes classification using SAS code?
    Thanks.
    Best.
    akward

    taowenwu

    October 25, 2012 at 10:28 pm

  4. Hi, Nice blog thanks for sharing. Would you please consider adding a link to my website on your page. Please email me back. Thanks!

    Aaron Grey
    aarongrey112@gmail.com

    Aaron

    October 27, 2012 at 3:52 am

  5. Hi, WenSui! You are a great stat-analyst. I am a SAS programmer and I this is my favorite learning website. I would ask you if you have developed something (SAS code) about collapsing levels of class predictor variables (for logistic regression)… I’ve seen your WOE tranformation pgm, but nothing about class vars recoding. (I would be glad for a your reply).
    Thanks a lot for your works!!
    L.

    Laura

    November 13, 2012 at 4:58 pm

  6. Hi WenSui,
    Thanks for writing the macro “A SAS Macro Implementing Monotonic WOE Transformation in Scorecard Development”, and I have 2 suggestions , if can be incorporated.

    1. A weight variable, to adjust for oversampling / priori probabilities, as one of the macro parameters
    2. For estimating the missing values, as mostly we are looking for a Scorecard type model, and final method is a logistic regression. So can we actually use the log(odds) or WOE , to estimate the missing values(reverse Engineering !!!), by considering monotonic behavior in the log(odds), and replace the missing values with the nearest log(odds) based class / bin’s attribute value? Also floor and cap can be assigned for each variable at 5% and 95%. Also if a variable has above 30% data missing then a dummy be created and if data is missing for any variable between 0-30% can then be imputed.

    In my view it might be better than median substitution, especially for Risk scorecards, a median substitution may correspond to slight drop in scorecard separation, and may impact wheter this variable be in/out of the model.

    regards,
    Sriram
    ram24x7esi@gmail.com

    ram24x7esi

    February 14, 2013 at 3:48 pm

  7. Hi WeiSui, Thanks for sharing your excellent work and idea. Can you also upload the test data set here so that we could figure out the details by ourselves?

    Victor

    May 21, 2013 at 5:05 pm

  8. Hello, WenSui !

    I’m a bit curious about your recent blog concerning performance of parallel processing. Based on other parallel processing “through-put” numbers I’ve seen here & there, I was expecting that splitting the work across 8 CPU cores would have increased the speed approximately 6-fold, not the 3-fold you measured. I’m wondering if the “Ubuntu VM with 8 cores” is really 4 physical cores with hyper-threading, or if perhaps you think this particular task might have been disk-bound.

    Thanks for sharing so much with us with your postings.

    regards

    Doug Dame
    Doug_Dame@yahoo.com

    Doug Dame

    May 29, 2013 at 12:23 am

  9. you are correct, doug. it is 4 physical cores with hyper-threading.
    appreciate your insight.

    statcompute

    May 29, 2013 at 6:29 am

  10. Excellent WOE sas algorithm. Do you have any references for the procedure you employ I can review or cite?

    Ryan Vaughn

    August 5, 2013 at 5:40 pm

  11. Dear Ryan
    First of all, I truly appreciate your interest in my sas code. Unfortunately, I don’t have the reference. It is more like a home-brew algorithm coming from my mind.
    Best
    wensui

    statcompute

    August 5, 2013 at 5:43 pm

  12. No problem. Thank you very much for sharing. You have saved me a lot of work. :)

    Ryan

    August 5, 2013 at 5:54 pm

  13. Really appriciate your contributions. Just revised some of your SAS code, it’s helpful.
    Thanks

    Kun Zou

    September 11, 2013 at 11:23 am

  14. Regarding “By-Group Aggregation in Parallel” have you tried data.table?

    It should be a lot faster than the alternatives for code like that


    library(data.table)

    load("2008.Rdata")

    # assuming the Rdata contains a data.frame called data
    setDT(data)

    data[,mean(Distance),keyby=Month]

    Anonymous

    October 6, 2014 at 6:20 am

  15. You are correct. Data.table is indeed a lot faster. However, the purpose of this post is to showcase how to use parallelism.

    Thanks for reading my blog.

    statcompute

    October 6, 2014 at 9:11 am

  16. Hiya, I’m attempting to use ur excellent ‘A SAS Macro Implementing Monotonic WOE Transformation in Scorecard Development’ code but running into some problems with the default sas server folder where the output files are being directed.
    Can you please let me know how to update the code to update where I want the output files directed to?
    Thanks in advance.
    Clancy.

    clancy

    November 18, 2014 at 1:43 am

  17. Could you show me how to mathematically derive beta0 and beta1 in your SAS getpdo macro?

    Jian Yang

    November 26, 2014 at 6:47 pm

  18. Nice sir.

    Jayesh

    May 6, 2015 at 9:03 am

  19. hi WenSui, good work and well i student and try to run
    https://statcompute.wordpress.com/2015/03/29/autoregressive-conditional-poisson-model-i/

    i can not run

    mdl <- acp(y ~ -1, data = cnt)

    because i do not get the cnt data

    can you help me please
    Thanks.
    Juan Pablo

    juan pablo

    May 28, 2015 at 10:50 pm

  20. Hello, i’ve been Reading your blog and i am interested in the Oct2Py part. Do you have some more examples?

    Anonymous

    July 19, 2015 at 11:28 am

  21. Hello sir, I’ve been Reading your blog and i am interested in the Oct2Py part, do you have some more examples?

    Carlos Mario

    July 19, 2015 at 11:29 am

  22. Hello WenSui, where can I get the file “flights.db”? thanks.

    Anonymous

    November 17, 2015 at 2:31 am

  23. Great work, master!

    Anonymous

    November 18, 2015 at 3:47 pm

  24. Wensui, this is phenomenal work with a lot of dedication. Thank you for your generosity in sharing.

    nhammond36

    December 14, 2015 at 9:30 am

  25. Hi Wensui, Thanks a lot for a great work. Could you please share the sample file to work on the scorecard codes.

    Shameel Ahmed Shakir

    January 25, 2016 at 11:13 am

  26. Hi Wenshui,

    For LGD modeling, how should we use WOE to work with catagorical variables?

    Many thanks

    Anonymous

    March 12, 2016 at 6:31 pm

  27. Please email me and we can discuss. Thanks.

    statcompute

    March 12, 2016 at 8:08 pm

  28. Great work. It would be nice if you could supply the link to the files that you used. Thanks.

    Ellis

    March 21, 2016 at 12:59 am

  29. Do you have any thoughts on models for Gain/Loss on loan sales and Gain/loss on foreclosed assets?

    SS

    May 27, 2016 at 4:38 pm

  30. Hi WenSui,

    Great blog! I’d love for you to contribute to our blog at http://www.blog.hotdatajobs.com

    We are developing a blog to entice and courage users to come to our site. Our main site, http://www.hotdatajobs.com will be a job board that has jobs from all over the country related to the data, analytics, business intelligence realm.

    Please let me know if interested. Thanks!

    Ajay Mistry
    http://www.hotdatajobs.com

    Ajay Mistry

    June 21, 2016 at 3:21 pm

  31. sure, dear Ajay, please let me know how I might contribute.

    statcompute

    June 21, 2016 at 8:28 pm

  32. Hi, is the dataset credit_count.txt publicly available? I wanted to follow your examples. Thanks.

    Anonymous

    June 27, 2016 at 9:26 pm

  33. Comment on your latest blog post: https://statcompute.wordpress.com/2016/10/31/fastest-way-to-add-new-variables-to-a-large-data-frame/

    Your data.table code can be sped up significantly:

    data.table = {
    ### DATA.TABLE ###
    hflights <- data.table(hflights)
    hflights[,wday := 'weekday']
    hflights[DayOfWeek == 6, wday := 'weekend']
    hflights[DayOfWeek == 7, wday := 'weekend']
    hflights[,delay := ArrDelay + DepDelay]
    }

    Zachary Deane-Mayer

    October 31, 2016 at 10:32 am

  34. Very smart way to avoid ifelse().

    statcompute

    October 31, 2016 at 10:53 am

  35. Hi WenSui
    Thanks for the monotonic binning code in Python – looks v interesting
    Am keen to replicate results on your Accepts.csv data – any chance you could supply?
    It would be a great help to understanding the code and properly applying to my data
    thanks
    Mark

    Mark Kenton

    January 27, 2017 at 11:54 am

  36. Hi, WenSui,

    I am trying to follow your interest post “Fitting Generalized Regression Neural Network with Python.” How can I obtain the csdata.txt file? I have googled around without success.

    Many thanks in advance,

    TMS

    February 2, 2017 at 4:58 pm

  37. I’m your biggest fan. Thank you for the monotonic function on SMBinning, i’ve been using R, and my limits are still at inputting categorical variables rather than my preferred continuous variables (as explanatory variables). I’m struggling with transforming the binned variables into WoE numeric variables. Please help!

    James

    March 13, 2017 at 7:35 am

  38. Is every one supposed to know how to get the file credit_count.txt.

    I stumbled upon this website in search of keras example (which I found).

    Is this the file, http://pages.stern.nyu.edu/~wgreene/Text/econometricanalysis.htm

    auro

    March 17, 2017 at 1:32 am

  39. Hello Mr. Liu,
    You have a great blog! I have a quick question: Are expected loss models (mentioned in the link below) used to assess credit risk in consumer lending (for individuals, mostly unsecured) portfolio, usually for a term of 12-36 months?
    https://statcompute.wordpress.com/2012/03/06/modeling-practices-of-loss-forecasting-for-consumer-banking-portfolio/

    SK

    April 9, 2017 at 12:10 pm

  40. It should really depend on the charge-off practice with your organization. To my mind, 12-month might be too long for the unsecured consumer product.

    statcompute

    April 9, 2017 at 12:49 pm

  41. Thank you, for the response.
    The approval rates for this sector – unsecured consumer loans, mostly low income – is pretty low and historically, the default rates have also been very low. Usually, the term has been from 12-36 months. I am exploring ways to assess and manage risk of such portfolios. Do you have any suggestions?

    Thanks!

    SK

    April 18, 2017 at 1:15 am

  42. Hi WenSui
    Similar request for the monotonic binning code in Python. I am eager to replicate results on your Accepts.csv data – any chance you could please supply it?
    It would be a great help to understanding the code and properly applying to my data
    thanks

    coursera18

    April 26, 2017 at 10:57 am

  43. Hell Mr.Liu
    Where can I download credit_count.txt dataset?

    Abler

    May 17, 2017 at 4:57 am


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: