2.1
O urdiscussion ofm easurem enterrorm odelsdraw sheavily from Fuller7.W hen m odeling errorthe covariatescan be treated asrandom orfixed values.Structuralm odelsconsiderthecovariatestoberandom quantitiesand functionalm odelsconsiderthecovariatesto befixed9.W econsider a structuralm odel.Let,w here X isa(random )m atrixofcovariatesofdim ension
a vectorofcoefficientsoflength
isa vectorofNorm allydistributed i.i.d.random errorsoflength n,and Y istheresultantoutcom evectoralso oflength n.In an additive m easurem ent errorm odel,w eassum ethatw hatisobservedisnotX butratherthe“contam inated”or“error-prone”m atrixW = X + U w hereU arandom
m atrix.
W hen am odelisfitthatignoresm easurem enterror,i.e.itassum esthatthetruem odelis,theresultingestim ates
aresaid
tobenaiveandsatisfy
w hereisthetruecoefficientvector,
isthecovariancem atrixofthecovariatesand
isthecovariancem atrixofthem easurem ent error.Inthecaseoflinearregressionw ithasinglecovariate,(1)sim plifiestoanattenuatingfactorthatbiasesthecoefficientestim atestow ardszero. H ow ever,w ith m ultiplecovariatesthebiasm ayincrease,decrease,and even changethesign oftheestim ated coefficients.N otably,m easurem ent erroraffectingasinglecovariatecanbiascoefficientestim atesinallofthecovariates,even thosethatarenotm easuredw ith error9.
2.2
M a10 presentedm ethodstoaccountform easurem enterrorw hileperform ingvariableselectioninparam etricandsem i-param etricsettings.Focusingontheparam etricsetting,theyproposedaw idescopingm ethodthatcanbeusedinm orethanjustgeneralizedlinearm odels.Them ethodrelies on derivingthefulllikelihood ofeachobservation and it’scorrespondingscorefunction,,choosingapenaltyfunction and findingits derivative,
,thensolvingthepenalized estim atingequations:
Solving the penalized equations can be very difficultcom putationally,especially in the high dim ensionalsetting.Therefore,w e w illlook to com pareourm ethodw ith fasterm ethodsthatarevariantsoftheLasso,w hich canbesolved m uch m orequickly.
2.3
Sorensenetal.4 analyzetheLasso11 inthepresenceofm easurem enterrorbystudyingthepropertiesof
isasym ptoticallybiased w hen
since
.N otice thisisthe sam e biasthatis introduced w hen naivelinearregressionisperform ed on observed covariates.Sorensen etal.4 derivealow erbound on them agnitudeofthenon-
zerocoefficientelem entsbelow w hichthecorrespondingcovariatew illnotbeselected,andanupperboundontheestim ationerror
. They show thatw ith increasing m easurem enterrorthe low erbound increases,i.e.,increasing m easurem enterroraddsnon-inform ative noise to thesystem andsoforthesignalassociatedw iththerelevantcovariatestobeidentifiedthesignalm ustincrease.Increased m easurem enterroralso leadstoanincreaseintheupperboundoftheestim ationerror.Signconsistentselectionisalsoim pactedbythepresenceofcovariatem easurem ent error.Sorensen etal.4 setalow erbound on theprobabilityofsign consistentselectionin thissetting.TheresultrequiresthattheIrrepresentability Condition with Measurement Error (IC-M E)holds.The IC-M E requiresthatthem easurem entsoftherelevantand irrelevantcovariateshave lim ited correlation,relativeto thesizeoftherelevantm easuredcovariatecorrelation.N otethesam plecorrelation oftheirrelevantcovariatesisnotconsidered.Bystudyingtheform ofthelow erbound,itcan beconcluded that(atleastw hen usingtheLasso)m easurem enterrorintroducesagreater distortionontheselectionofirrelevantcovariatesthanitdoesintheselectionofrelevantcovariates.
Sorensenetal.4 introduced an iterativem ethod toobtaintheRegularizedCorrected Lassow ithconstraintontheradiusR:
Them ainresultsoftheirsim ulationstudyw ereconsistentw iththeiranalyticalresults,nam elythatthecorrectedLassohadaslightlylow erselection rateforthetruecovariatesthanthenaiveLasso,butw asalso m oreconservativein includingirrelevantcovariates.Further,theprediction error,as m easuredbyboth and
,w aslow erforthecorrectedLasso.
Them ajordraw backofthecorrected Lassom ethodisthatitisverycom putationallyintensive,involvinganiterativecalculation w hereeachstep involvesaprojectionofanupdated ontothe
-ballforagivenradiusR.Theiterativeprocessm ustbeconductedforeachfixedvalueoftheradius R.The selected valuesofR provide apath ofpossible solutionsfor
.H ence,the approach seem sim practicalforlarge-scale problem sand for repeatedapplicationinasim ulation study.
2.4
A recentpaperbyDattaand Zou5 proposesan alternativeapproach w hich theyreferto astheConvexConditioned Lasso (CoCoLasso).Consider
thefollow ingreform ulationoftheLassoproblem ,
TheCoCoLasso isbased on theLoh and W ainw rightcorrections12 forthepredictor-outcom e correlation and variance m atrix
in thepresence ofm easurem enterror.W henerror-pronecovariatesW arem easuredinplaceofX,w ecangetcorrected estim ates
and
:
w hereisthe(assum edknow n)varianceinthem easuredW.Theseestim atorsareunbiased.A m easurem enterrorcorrectedLassoestim atecould then be derived by substituting
and
into (5).The problem w ith thisidea isthatthe corrected m atrix
m ay notbe a valid covariance m atrix, sinceitispossibletobenon positivesem i-definite.If
hasanegativeeigenvalue,thenthisLassofunctionw ouldbenon-convexandunbounded.To overcom ethisobstacle,thekeytotheCoCoLasso5 iscalculatingtheprojection of
onto thespaceofpositivedefinitem atrices:
TheCoCoLassothensolvesastandardLassoproblem inw hich and
w iththecorrectedvaluesfrom (6)and(7),yieldingtheCoCoLassoestim ator:
W hen isnotpositivedefinite,theprojection from (7)can bechallengingto com pute.H ow ever,theprojection onlyneedsto bedoneonce,unlike theSorensencorrection4 w hichrequiresaprojectionateach iteration.
O urproposed variable selection algorithm ,M EBoost(M easurem entErrorBoosting),is based on an iterative functionalgradientdescenttype algorithm thatgeneratesvariableselectionpaths.Thekeyideaisthat,insteadoffollow ingapathdefinedbythegradientofalossfunction(e.g.,the likelihood),the“descent”follow sthedirection defined byan estim atingequation .Thealgorithm icstructureofM EBoostisshared w ith ThrEEBoost(Thresholded Estim atingEquationBoost,13),ageneral-purposetechniqueforvariableselectionbased onestim atingequations.W hile ThrEEBoostdescribedanapproachtoperform ingvariableselectioninthepresenceofcorrelatedoutcom esbyleveragingtheGeneralizedEstim ating Equations14,M EBoostachievesim proved variable selection perform ance in the presence ofm easurem enterrorby follow ing a path defined byam easurem enterrorcorrected score function due to Nakam uraw hich isdescribed in Section 3.1.N akam ura’sapproach isapplicableto linear regression m odelsw ith norm aladditive orm ultiplicativem easurem enterror.Closed-form corrected score functionsarealso derived forPoisson, Gam m a,andW aldregression.Nakam uracom m entsthatnoclosedform correctioncanbecreatedforlogisticregression.Byusingthisfam ilyofcorrected score functions,theM EBoostalgorithm ism ore broadly applicablethan thecorrected Lasso and CoCoLasso,neitherofw hich isobviously generalizablebeyond linearregression.
3.1
N akam ura8 proposed asetofcorrected scorefunctionsforperform ingestim ation and inferenceinthegeneralized linearregression m odelw here covariatesare subjectto additive m easurem enterrorw ith know n variance m atrix .In general,the corrected score function S*based on the covariatesm easured w ith error(W),hasthe expectation equalto the score function,S,based on the true covariates(X).Forthe norm allinear m odel,Nakam uraproposedthefollow ingcorrectiontothenegativelog-likelihood to accountform easurem enterror:
D ifferentiating(9)w ith respectto ,w eobtainthecorrected scorefunction:
In thiscase the corrected score function isthe ’naive’score function,,w ith a m easurem enterrorcorrection determ ined by the sam ple size,m odelerror,m easurem enterrors,and the coefficientvalue:
.The naive score function isthe score function from the true m odelcalculatedw iththem easuredcovariates:
Thecorrected varianceestim atew illbecalculatedas,w hichinthenorm alcaseis:
Sim ilarly to the corrected score function,the corrected variance estim ate isthe naive variance estim ate,,w ith a
m easurem enterrorcorrection.Thecorrectionreducestheestim atedvariance,thussubtractingthenoiseintroduced bythem easurem enterror.In
thevariancecasethecorrectionfactorisdeterm ined onlybythetruecoefficientvectorandthem easurem enterrorvariance.
Asanotherexam ple,thecorrectionforPoisson distributeddataisthefollow ing:
w hichw eapplyin ourdataapplication(seeSection 5).Nakam ura8 also providescorrectionsform ultiplicativem easurem enterrorinlinearregression,asw ellasm easurem enterrorinGam m aandW aldregression.Inw hatfollow s,w eusethenorm allinearadditivem easurem enterrorcorrected scorefunction aspartofan iterativepath-follow ingalgorithm thatperform svariableselectioninthepresenceofcovariatem easurem enterror.
3.2
O urproposed variable selection algorithm ,M EBoost,consistsofapplying ThrEEBoostw ith the corrected score function and corrected variance estim atedescribed intheprevioussection.Algorithm 1 sum m arizestheM EBoostprocedure.
Letbethefixed thresholdingparam eter.Startingw ith a
estim ateof0 and a
,thecorrected scorefunction
iscalculated at thesevalues,and them agnitudeofeach com ponentof
isrecorded.Theindicesofelem entsto updateareidentified byathresholdingrule,
m ax
.The nextpointin the variable selection path,
,isobtained by addinga sm allvalue,
,to each ofthese elem ents in the direction corresponding to the signsofeach
for
.Thisupdated
isused to calculate an updated corrected
.The algorithm continuesforT iterations,w hereT istypicallychosen tobelarge(e.g.,1,000).
Theparam etersand
interactto determ ine thespecificvariable selection path thatresultsfrom thealgorithm .The sm allerthevalueof
the sm allerthe distance betw een
estim ateson the selection path,w hile a largervalue of
leadsto largerjum psin the selection path.Ideally,a verysm allvalueof
(e.g.,0.01).w ouldbeused,butif
islarge,alargenum berofiterations,T,m ayberequiredtogenerateaselectionpath.This ofcourseisthetrade-offone isrequired to m akew hen determ iningthestep size.A selection path increm ented byonlyasm allvalueispreferable to a path w hich takeslarge steps,butthe tim e required fora large num berofiterationsm ay becom e prohibitive.W ith each ofthe t iterations thoseelem entsofthecoefficientvectorthatarestillofsizezero havenotbeen selected atthisiteration.A conservative selection approach takes a com bination ofsm all
and T,w hereasa m ore aggressive approach takesa com bination oflargervalue
and T.In the case w hen
,the M EBoostalgorithm only updatesthe elem ent(s)w ith the m axim um absolute value.Forany com bination of
and T,thisisthe m ostconservative approachthatcanbetakenandw illleadtosparserm odelsthanw hen athreshold isconsidered.Italso requiresam uchlargervalueofT.
The param eterdeterm ineshow m any coefficientsare updated ateach iteration;itoffersa com prom ise betw een updating each coefficient ateveryiteration (
,sim ilarto standard gradientdescent)and updating only the coefficientcorresponding to the elem entofthe estim ating equationw ith largestm agnitude(
).In thecontextofGeneralizedLinearM odelsw ithoutm easurem enterror,W olfson15 show ed thatsetting
yieldsan update rule thatisasym ptotically equivalent(as
,and
)to follow ing the path ofm inim izersofan
-penalized projected artificiallog-likelihood ratio w hose tangentisthe GLM score function.In the case w hen
,the M EBoostalgorithm only
updatestheelem ent(s)w ith them axim um absolutevalue.Foranycom binationofandT,thisisthem ostconservativeapproachthatcan betaken and w illlead tosparserm odelsthan w hen athreshold isconsidered.Italso requiresam uch largervalueofT.Byallow ingm ultipledirectionsto be updatedateach iteration,M EBoostcan exploream uch w iderrangeofvariableselection paths;asw ediscusslater,cross-validation can beused to selecttheparam eter
w hich leadsto theoptim allevelofthresholding.In ThrEEBoost13,itw asshow n thatathreshold intherangeof0.4-0.8 m ay perform betterthanthresholdscloserto0 or1.
3.2.1
N akam ura’sm easurem entcorrected score functions are derivativesofcorrected negative log likelihoods.In the norm alcase,the correction is exactlythatdescribed in Sorensenetal.(seeEquation(3)).Hence,theargum entsofRosset16 can beappliedto show that1)M EBoostappliedw ith and threshold value
,and 2)thesolutionsto(3),havethesam elocalbehavior.Specifically,undersom eregularityconditions15,as
and
w ith
,M EBoost’siterativestepsm atchthesequenceofsolutionsto (3).
3.2.2
Fora fixed ,identifying a finalm odelinvolveschoosing a pointon the variable selection path generated by Algorithm 1;thisisakin to choosing thepenaltyparam eterintheLasso.Cross-validationusingalossfunction relevantto theproblem athand (e.g.,m eansquared error)can beusedto selecta
onthepath.Cross-validation cansim ilarlybeusedto selectthebestvalueof
.Thefullprocedureisdescribed inAlgorithm 2.
To exam ine the im pactofm easurem enterrorin the covariateson variable selection w e perform ed a sim ulation study.W e evaluated M EBoost by com paringitto tw o variable selection m ethods:the Convex Conditioned Lasso (CoCoLasso),and the “naive”Lasso w hich doesnotcorrectfor m easurem enterror.
4.1
D ataw eregenerated from alinearregression m odelw ith iid norm alerrors,;w here
and
.Thesam plesizefor allstudiesis80.The true covariatesare draw n from a m ultivariate norm aldistribution,
isa block diagonalm atrix w ith diagonalentriesequalto1,and10 by10 blockscorrespondingtoagroupof10 covariatesw ithanexchangeablecorrelationstructurew ithcom m on pairw isecorrelation
.In allsim ulationsthetruem odelhas10 non-zerocoefficientsand90 zero coefficients,i.e.,
,so thatthe relevantcovariatesinthefirstblockw erecorrelated.
The m easured covariatesw ere generated asW = X + U forU a m atrixw hose colum nsw ere generated asdescribed below.To explore the im pactofdifferenttypesofm easurem enterror,w econsidered 10 differentscenariosforgeneratingthecolum nsofU andvaryingtheassum ptions m adeaboutit.Inthefirstfivescenarios,U isassum edtobenorm allydistributedw ithm eanzeroandcovariancem atrix,andthescenariosexplore differentstructuresfor
.IneachofScenarios1-5,w ecorrectlyspecifythedistributionofUw henapplyingM EBoostandtheCoCoLasso.Scenarios 6-10 explorecasesw herethedistributionofU isincorrectlyspecified.
1. Basecase:,w here
theidentitym atrix,and
.
2. Varyingforjin 1-10.Thispattern repeatsacrosstheblocksof10 covariates.Therelevantcovariateshavesim ilar
5. Som eU’snotm easuredw ith error:,w here
and
for
.
6. O verestim ated generatedasinScenario1,butw especify
.
7. Underestim atedgenerated asinScenario1,butw especify
.
8. M isspecified correlation:U generated as in Scenario 3,butw e ignore the correlation and specify in running M EBoostand
9. M easurem ent error is distributed uniform ly:Each entry of U is generated independently from a Uniform distribution,
.M EBoostandCoCoLassoarerun assum ing
w ith
.
10. M easurem enterrorisdistributed asym m etrically:Each entry ofU isgenerated independentlyfrom ashifted exponentialdistribution,
.M EBoostandCoCoLassoarerun assum ing
w ith
.
ates)w asused to selectthe optim alvalue ofand num berofM EBoostiterations,asw ellasthe value of
in the CoCoLasso and naive Lasso. W ecom pared M EBoost,CoCoLasso,and naiveLasso on tw o m etricsofprediction error:m ean squared errorbased on thetruecovariates(M SE =
),m eansquarederrorpredictionbasedonthem easuredcovariates(M SE-M =
).Thesem etricsw ere estim ated using independenttestsetsgenerated during each individualsim ulation.W e also com puted
distance from the true
,and variable selectionsensitivityandspecificity.Foreachscenariothem etricspresentedaretheaverageover1,000 sim ulations,andarecalculatedatintervals of0.05 along
;thetruevalue,
.BecausetheM EBoostalgorithm m aychangem ultipleindicesateach iteration itm aynothave valuesalongeach intervalin the path.To accountforthis,a linearapproxim ation ofthe relevantstatisticw asm ade ateach pointinthepath.
W enotethatin thissim ulation studyw echose toinvestigatem odelperform ancebased on both thetrueand error-pronecovariates.Them otivation fortechniqueslike oursw hich accountform easurem enterroristo uncovertheunderlying relationship betw een theerror-free covariates X and the outcom e Y.Hence,in an idealw orld,valuesofX w ould be available on som e subset(oran independentset)ofobservationsso that prediction errorcould beassessed and the“best”m odelchosen.H ow ever,in practicew ew illoften onlyhaveaccessto theerror-pronecovariates W form odelfitting.So,iferror-free m easurem entsX are not(and m ay neverbe)available,isitw orthw hile to correctform easurem enterror? Buonaccorsi17 arguedagainstcorrection,usingthelogicthatthefuturepredictionsw illbebasedon(error-prone)W,noton(error-free)X.Indeed, itcan beshow n in sim plelinearregression,thatw ithoutthecorrection in alargesam pletheexpected valueofM SE-M islessthan orequalto that ofan estim ateignoringm easurem enterror.How ever,asseen in theresultssection thatfollow s,w efound thatcorrectingform easurem enterror decreased prediction errorregardlessofw hetherpredictionsw erecom puted usingerror-freeorerror-pronecovariates.Sincew eoften onlyhave m ism easured dataavailable,itisreassuringto see thatw e are able to use the m easured covariatesto perform cross-validation to selecta m odel thatw illprovideusw ith an accuraterelationshipbetw eentheoutcom eandtruecovariate.Thisfindingisdiscussed ingreaterdetailbelow.
4.2
Table 1 presentsthe m inim um M SE,M SE-M ,distance from the true
,sensitivity,and specificity atthe m inim um M SE forthe three variable selection m ethodsacrossthe10 scenarios.In allten scenarios,M EBoosthad thelow estM SE,M SE-M ,and
distancefrom thetrue
.TheCoCo-Lassohas16.6% -71.7% higherprediction errorfrom thetruecovariatesthanM EBoostandinthecasew herem easurem enterrorisoverestim ated, thepredictionerrorfrom theCoCoLassois5.26tim esthatofM EBoost.ThisislikelyduetothefactthattheLohandW ainw rightcorrection
in(6) ism orenegative,andhencerequiresa“longer”(andhencepotentiallym oredistorting)projection onto thespaceofpositivedefinitem atrices.
In term sofvariableselection,M EBoosthad agreatersensitivityand low erspecificity than CoCoLasso in each case w hile Lasso had thelow est specificity.TheLasso strugglesm ostw hen correlation ispresentin them easurem enterror.TheM SE isabout2.5 tim esthatofM EBoost,w hen w e allow M EBoostto accountforthe correlation.Allm ethodsperform poorly w hen w e m isspecify by ignoringthe correlation.The sensitivity and specificityareathigh levelsform ostsim ulationsw iththeexceptionofthem isspecified
thatignoredcorrelation.O verestim ating
leadto am ore conservativeselectionprocessw ith ahigh specificity,w hileunderestim ating
had ahighersensitivity.The
distancefrom thetrue
can also tell usaboutperform ance.Again,thescenariow herew em isspecify
byignoringcorrelationperform sw orst.
W e applied ourm ethod to baseline data collected in the Box Lunch Study,a random ized trialofthe effectsofportion size availability on w eight change.In thestudy,atotalof219 subjectsw ererandom ized to oneoffourgroups:inthreegroups,subjectsw ereprovided afreedailylunch w ith afixednum berofcalories(400,800,and1600).Thecontrolgroupw asnotprovided afreelunch.
W econsideredtheproblem ofpredictingthenum beroftim essubjectsreportedbingingonfoodinthelastm onth,usingPoissonregressionw ith 99 explanatoryvariables.Allvariablesw erem easured atbaseline.16 ofthe99 explanatoryvariablesw ere self-reported m easures;ofthese16,8 w erem easuresoffood consum ption and thereforepossiblysubjectto substantialm easurem enterrorw ew illnotate .Another8 m ayhave also beenm easuredw itherror,notated
.Kipnis18 exam inedanutritionalstudyw itha24 hourrecall,andfound thatthecorrelationbetw eenthetrue and reported consum ption ofprotein and energyw asonly0.336.W e assum ethisrelationship existsin each ofourvariablesm easured w ith error. Assum ingthem easurem enterrorvariance
isindependentofthevarianceofthetruecovariate
,w ecan obtain:
and hence Thisisthe valuew e w illneed to provide M EBoostforourassum ption ofthe m easurem enterror.W e assum ethislevelofm easurem enterrorforeach 24 hourdietaryrecallvariable.Afterscalingourpredictorsto have zero m ean and unitvariance, w e applied ourm ethod w ith the Nakam ura correction.Since ourm easured data hasitsvariance (
)scaled to equal1,w e assum ed that the8 dietaryrecallcovariatesm easured w ith errorhad
.Since dietaryvariablesm aybem oreprone to m easurem enterrorthan other variables,w escaledtheassum ederroroftheother8variablestobehalfthatofthenutritionalvariables:
.Therem ainingvariablesw ere assum edtobem easuredw ithouterror.W econductedasensitivityanalysisto assesstheperform anceofourm ethodbysetting
and0.25.
To selecttuning param eters,w e em ployed 8-fold crossvalidation based on the deviance on a training setconsisting of70% ofthe data.The perform anceofourm odelw asevaluated on therem ainingtestset.W epresentthem odelsderived from M EBoostperform ed w ith threedifferent thresholds:0.2,0.6 (theapproxim atevalueestim atedusingcross-validation),and0.9.
Table2 show stheselectedvariablesandestim atedprediction error(M SE-M ,bottom row )forvariousM EBoostm odelsalongw ith resultsfrom
thenaiveLasso.W edidnotcom paretotheM easurem entErrorLassoortheCoCoLassobecauseim plem entingthesetechniquesinaproblem ofthis
sizew ascom putationallyinfeasible.ThedevianceandM SE-M w erelow estforthem odelselectedbyM EBoostassum ingthehighestm easurem ent error(= 0.887)andathresholdvalueof0.6.Thism odel(and
)selectedjust4 variables,w hichw ereasubsetofthe7 chosenw ith the naive Lasso.The othertw o M EBoostm odelsincluded up to tw o additionalvariablesto the M EBoostm odelthatm inim ized M SE-M (selected w ith
and
).Regardlessofthe assum ption aboutthe levelofm easurem enterror,using a threshold value of
leadsto theinclusion ofseveralvariablesw ith sm allcoefficients,andam uch higherdevianceandprediction error.O fparticularnoteisthatthenaiveLasso (andM EBoostw iththelow erthreshold)includedthevariablecorrespondingtothenum berofdailycaloriesconsum edatbreakfast,w hilethebestperform ingM EBoostm odels(w ith
and0.9)didnot.Sinceitisbasedona24-hourdietaryrecall,thisvariablem aybeparticularlysusceptible tom easurem enterrorinduced byrecallbias.
W eexam inedthevariableselectionproblem in regressionw hen thenum berofpotentialcovariatesislargecom paredto thesam plesizeandw hen thesepotentialcovariatesarem easuredw ithm easurem enterror.W eproposedM EBoost,acom putationallysim pledescent-basedapproachw hich follow sapathdeterm inedbym easurem enterror-correctedestim atingequations.W ecom paredM EBoost,viasim ulationandinarealdataexam ple, w iththerecently-proposedConvexConditionedLasso(CoCoLasso)asw ellasthenaiveLassow hichassum esthatcovariatesarem easuredw ithout error.In alm ostallsim ulation scenarios,M EBoostperform edbestinterm sofprediction errorand coefficientbias.TheCoCoLassoism oreconservativew iththehighestspecificityineachcase,butsensitivityandpredictionarebetterw ithM EBoost.Inthecom parisonofselectionpaths,w esaw thatM EBoostw asm ore aggressivein identifying variablesto beincluded in them odelm ore quickly than theCoCoLasso.These differencesw ere m ostapparentw hen them easurem enterrorhad alargervariance and a m ore com plexcorrelation structure.Specifically,w hen faced w ith adata setof1000observationsand1000 covariates,M EBoostobtainedasolution in1.3 seconds,w hiletheCoCoLassoneeded 6:17.
Asshow ninthesim ulationstudy,M EBoosthaslow erpredictionerrorthantheLassoonindependenttestdataw henpredictionsarebasedonthe true(i.e.,non-error-prone)covariates.ItisinterestingtonotethatM EBoostretainssom eadvantage,albeitam orem odestone,overtheLassow hen predictionsarebased on error-pronecovariates.Thisfindingappearsto contradicttheintuition thataccountingforcovariatem easurem enterror providesnobenefitw hen thegoalisprediction anderror-freecovariatesw illneverbeavailable.H ow ever,theobserved benefitinoursim ulationis likelydue to thefactthatM EBoostissom ew hatm oreflexiblethan theLasso asitusesan additionalparam eter,thethreshold ,w hich allow sitto explorethem odelspacem orecom prehensively.Nevertheless,itisreassuringthatbyusingtheerror-pronecovariatesto perform cross-validation and selectam odel,M EBooststillallow susto selectam odelthatoffersan im provem entin prediction in the settingw here w ew illhave correctly m easuredcovariates.
M EBoost,w hileaprom isingapproach,hassom elim itations.O ne lim itation–w hich isshared w ith m anym ethodsthatcorrectform easurem ent error–isthatw e assum e thatthe covariance m atrix ofthe m easurem enterrorprocessisknow n,an assum ption w hich in m any settingsm ay be unrealistic.In som e cases,itm ay be possible to estim ate these structuresusing externaldata sources,butabsentsuch data one could perform a sensitivityanalysisw ithdifferentm easurem enterrorvariancesandcorrelationstructures,asw edem onstrateintherealdataapplication.Another challenging aspectofm odelselection w ith error-prone covariatesisthat,even ifthe setofcandidate m odelsisgenerated via a technique w hich accountsform easurem enterror,theprocessofselectingafinalm odel(e.g.,viacross-validation)stillusescovariatesthatarem easuredw ith error. H ow ever,w e show ed in oursim ulation study thatM EBoostperform sw ellin selectinga m odelw hich recoversthe relationship betw een the true (error-free)covariatesandtheoutcom e,evenw hen usingerror-pronecovariatesto selectthefinalm odel.Thisfindingsuggeststhattheprocedure forgeneratinga“path”ofcandidatem odelshasagreaterinfluenceon prediction errorand variableselection accuracythan theprocedurepicking afinalm odelfrom am ongthosecandidates.
Toconclude,w enotethatw hilew eonlyconsideredlinearandPoissonregressioninthispaper,M EBoostcaneasilybeappliedtootherregression m odelsby,e.g.,usingtheestim atingequationspresentedbyN akam ura8 orothersw hichcorrectform easurem enterror.Incontrast,theapproaches ofSorensen4 andDatta5 exploitthestructureofthelinearregressionm odelanditisnotobvioushow theycouldbeextendedtothebroaderfam ily ofgeneralized linearm odels.The robustnessand sim plicity ofM EBoost,along w ith itsstrong perform ance againstotherm ethodsin the linear m odelcasesuggeststhatthisnovelm ethodisareliablew aytodealw ith variableselectioninthepresenceofm easurem enterror.
1. Spiegelm anD,M cD erm ottA,RosnerB.Regressioncalibrationm ethodforcorrectingm easurem ent-errorbiasinnutritionalepidem iology..The American journal of clinical nutrition.1997;65(4Suppl):1179S–1186S.
2. FraserGaryE,Stram DanielO .Regression calibration w hen foods(m easured w ith error)arethevariablesofinterest:m arkedlynon-Gaussian dataw ith m anyzeroes..American journal of epidemiology.2012;175(4):325–31.
3. RosnerB,Spiegelm an D,W illettW C.Correction oflogistic regression relative risk estim atesand confidence intervalsforrandom w ithin- personm easurem enterror..American journal of epidemiology.1992;136(11):1400–13.
4. SørensenØ ystein,FrigessiArnoldo,Thoresen M agne.M easurem entErrorinLasso:Im pactandCorrection.arXiv.org.2012;.
5. DattaA.,ZouH..CoCoLassoforHigh-dim ensionalError-in-variablesRegression.Annalsof Statistics.2017;(Accepted).
6. StefanskiLeonardA.,CarrollRaym ondJ..CovariateM easurem entErrorinLogisticRegression.TheAnnalsof Statistics.1985;13(4):1335–1351.
7. FullerW ayneA.,ed.Measurement Error Models.W ileySeriesinProbabilityandStatisticsH oboken,N J,USA:JohnW iley& Sons,Inc.;1987.
8. Nakam uraT..Corrected score function forerrors-in-variablesm odels:M ethodologyand application to generalized linearm odels.Biometrika. 1990;77(1):127–137.
9. BuonaccorsiJohn.Measurement Error:Models,Methodsand Applications.BocaRaton:CRC Press;2010.
10. M aYanyuan,LiRunze.Variableselectioninm easurem enterrorm odels.Bernoulli.2010;16(1):274–300.
11. TibshiraniRobert.Regressionshrinkageandselectionviathelasso.Journal of theRoyal Statistical Society SeriesB.1996;58:267–288.
12. Loh Po-Ling,W ainw rightM artin J..High-dim ensionalregression w ith noisy and m issing data:Provable guaranteesw ith nonconvexity.The Annalsof Statistics.2012;40(3):1637–1664.
13. Brow n Ben,M illerChristopherJ.,W olfson Julian.ThrEEBoost:Thresholded Boosting forVariable Selection and Prediction via Estim ating Equations.Journal of Computational and Graphical Statistics.2017;:1–10.
14. LiangKung-Yee,ZegerScottL..Longitudinaldataanalysisusinggeneralizedlinearm odels.Biometrika.1986;73(1):13–22.
15. W olfson Julian.EEBoost:A GeneralM ethod forPrediction and Variable Selection Based on Estim ating Equations.Journal of the American Statistical Association.2011;106(493):296–305.
16. RossetSaharon,Zhu Ji,Hastie Trevor.Boosting asa Regularized Path to a M axim um M argin Classifier.Journal of Machine Learning Research. 2004;5:941–973.
17. BuonaccorsiJohn P.Prediction in the Presence ofM easurem entError:GeneralD iscussion and an Exam ple Prediction in the Presence of M easurem entError:GeneralDiscussion andanExam plePredictingD efoliation.Source:Biometrics.1995;51(4):1562–1569.
18. KipnisVictor,SubarAm yF,M idthuneDouglas,etal.Structureofdietarym easurem enterror:resultsoftheO PEN biom arkerstudy..American journal of epidemiology.2003;158(1):14–21;discussion 22–6.
TABLE 1 Perform ance m etricsforthe 1,000 sim ulationsin variousm easurem enterrorscenarios.The m odelsw ere selected atthe pointw ith m inim um M SE-M .