LeastSquares.gms : Demonstrate the use of numpy.linalg.lstsq on the diabetes test problem using linalg ols

Description

```Use the libInlude linalg ols to solve the least squares regression problem.
This model demonstrates how to get the same statistical values as with
the LS solver (dropped in GAMS 34). It also demonstrates how to use the
feature in a loop with subsets of the feature set p.
```

Category : GAMS Data Utilities library

Main file : LeastSquares.gms   includes :  LeastSquares.gms  ls.gdx  diabetes_data.gdx

``````\$title Demonstrate the use of numpy.linalg.lstsq on the diabetes test problem using linalg ols (LeastSquares,SEQ=141)

\$ontext
Use the libInlude linalg ols to solve the least squares regression problem.
This model demonstrates how to get the same statistical values as with
the LS solver (dropped in GAMS 34). It also demonstrates how to use the
feature in a loop with subsets of the feature set p.
\$offtext
Set i   'patients'
p   'features';

Parameter
A(i<,p<) 'features collected from a cohort of diabetes patients'
y(i)     'quantitative measure of disease progression one year after baseline';

\$gdxin diabetes_data.gdx
\$gdxin

\$if not set EXPLICITINTERCEPT \$set EXPLICITINTERCEPT 1
\$ifThen %EXPLICITINTERCEPT%==1
\$onMulti
set p / b0 Intercept /;
\$offMulti
A(i,'b0') = 1;
\$endIf

* When the LS solver is available use this to create the ls.gdx with the statistics
\$ifThen not gamsVersion 340
variables x(p), sse 'sum of squared errors';

equation fit(i) 'equation to fit', sumsq;

sumsq..   sse =n= 0;
fit(i)..  y(i) =e= sum(p, x(p)*A(i,p));

model leastsq /all/;
option lp = ls; solve leastsq using lp minimizing sse;
\$exit
\$endIf
\$log *** Use ls.gdx from previous run with LS solver

Scalars
sigma   'Standard error'
r2      'R-Squared'
df      'Degrees of freedom'
resvar  'Residual variance';
Parameters
estimate(p) 'Estimated coefficients'
se(p)       'Standard errors'
tval(p)     't values'
pval(p)     'p values'
resid(i)    'residuals'
fitted(i)   'fitted values for dependent variable'
covar(p,p)  'variance-covariance matrix';

\$FuncLibIn stolib stodclib
Functions
cdfT     / stolib.CDFStudentT  /
icdfT    / stolib.iCDFstudentT /;

\$libInclude linalg ols -CDFStudentT=cdfT -iCDFStudentT=icdfT -covar -sigma -r2 -df -rss -resvar -se -tval -pval -fitted  -resid i p A y estimate

Set alpha / "90%", "95%", "97.5%", "99%" /;
Parameter
av(alpha) / "90%"   .9
"95%"   .95
"97.5%" .975
"99%"   .99 /;

parameter confint(alpha,p,*) 'confidence intervals'
scalar q, t;
loop(alpha,
t = -icdfT((1-av(alpha))/2, df);
confint(alpha, p, 'LO') = estimate(p) - t*se(p);
confint(alpha, p, 'UP') = estimate(p) + t*se(p);
);
execute.checkErrorLevel 'gdxdiff ls.gdx ls_np.gdx eps=1e-4 releps=1e-4 > %system.nullFile%'

* Now demonstrate dynamic use of OLS with subsets of features
Set pp(p)    'subset of features'
iter     'iterations' / iter1*iter20 /
bestp(p) 'best subset of features'
Parameter
beste(p) 'best estimate';

loop(iter,
option clear=pp; pp(p)\$(uniform(0,1)<0.5) = yes;
\$if %EXPLICITINTERCEPT%==1  pp('b0') = yes;
\$  libInclude linalg ols -mergeType=REPLACE -rss i pp A y estimate