MEBoost: Variable Selection in the Presence of Measurement Error
2017·Arxiv
1 IN TRO DUCTIO N

Variableselection isaw ell-studied problem in situationsw herecovariatesarem easured w ithouterror.H ow ever,itiscom m on forcovariatem easurem entstobeerror-proneorsubjecttorandom variationaround som em ean value.Consider,forinstance,astudyw hereinsubjectsreporttheir daily food intake on the basisofa dietary recallquestionnaire.There isvariation from day to day in an individual’scalorie consum ption,butitis also w ellestablished inthenutrition literaturethatthereiserrorassociatedw ith therecallorm easurem entofthenum berofcaloriesinam eal1,2. In theusualregressionsetting,ignoringm easurem enterrorleadsto biasedcoefficientestim ation3,and hencethepresenceofm easurem enterror hasthepotentialtoaffecttheperform anceofvariableselectionprocedures.In thisexam ple,w em aybeableto createapredictivem odelbased on thesem ism easureddieteryrecalldata,thatw ecan thenapplythem odelto m oreexpensivedatathatcan bem easuredw ith reduced orelim inated m easurem enterrorsuchasw ith thehelp ofanutritionistorthroughprepackaged m eals.

Therehasbeenrelativelylittleresearchdoneaboutvariableselectioninthepresenceofm easurem enterror.Sorensen4 introducedavariationof theLassothatallow sforNorm al,i.i.d.,additivecovariatem easurem enterror.D attaandZou5 proposedtheconvexconditioned Lasso(CoCoLasso) w hich correctsforboth additive and m ultiplicative m easurem enterrorin thenorm alcase.Both ofthesem ethodsareapplicable to linearm odels forcontinuousoutcom es,butdo noteasily extend to regression m odelsforotheroutcom e types(e.g.,binary orcountdata).M eanw hile,there is a sizable statisticalliterature on m ethodsforperform ingestim ation and inference forlow -dim ensionalregression param etersin the presence of m easurem enterror3,6,7,buttheseapproachesdonotaddressthevariableselectionproblem andcannotbeappliedinlargep,sm alln problem s.

W epropose anovelm ethod forvariableselection in the presenceofm easurem enterror,M EBoost,w hich leveragesestim atingequationsthat

havebeen proposed forlow -dim ensionalestim ation and inferencein thissetting.M EBoostisacom putationallyefficientpath-follow ingalgorithm

thatm ovesiterativelyindirectionsdefinedbytheseestim atingequations,onlyrequiringthecalculation(notthesolution)ofanestim atingequation ateach step.Asaresult,itism uch fasterthan alternativeapproachesinvolving,e.g.,a m atrixprojection calculation ateach step.M EBoostisalso flexible:the version thatw e describe isbased on estim ating equationsproposed by N akam ura8,w hich apply to som e generalized linearm odels,

image

andtheunderlyingM EBoostalgorithm caneasilyincorporatem easurem enterror-correctedestim atingequationsforotherregressionm odels.W e conducted a sim ulation study to com pare M EBoostto the Convex Conditioned Lasso (CoCoLasso)proposed by D aata and Zou5 and the “naive” Lassow hichignoresm easurem enterror.W ealsoappliedM EBoosttodatafrom theBoxLunch Study,aclinicaltrialinnutritionw herecaloricintake acrossanum beroffood categoriesw asbasedonself-reportandhencem easuredw ith error.

2.1

O urdiscussion ofm easurem enterrorm odelsdraw sheavily from Fuller7.W hen m odeling errorthe covariatescan be treated asrandom orfixed values.Structuralm odelsconsiderthecovariatestoberandom quantitiesand functionalm odelsconsiderthecovariatesto befixed9.W econsider a structuralm odel.LetY = Xβ + ǫ,w here X isa(random )m atrixofcovariatesofdim ension  n × p,βa vectorofcoefficientsoflength  p,ǫisa vectorofNorm allydistributed i.i.d.random errorsoflength n,and Y istheresultantoutcom evectoralso oflength n.In an additive m easurem ent errorm odel,w eassum ethatw hatisobservedisnotX butratherthe“contam inated”or“error-prone”m atrixW = X + U w hereU arandom  n×pm atrix.

W hen am odelisfitthatignoresm easurem enterror,i.e.itassum esthatthetruem odelisY = WβW + ǫ,theresultingestim atesˆβWaresaid

tobenaiveandsatisfy

image

w hereβisthetruecoefficientvector,ΣXXisthecovariancem atrixofthecovariatesand  ∆ ≡ ΣUUisthecovariancem atrixofthem easurem ent error.Inthecaseoflinearregressionw ithasinglecovariate,(1)sim plifiestoanattenuatingfactorthatbiasesthecoefficientestim atestow ardszero. H ow ever,w ith m ultiplecovariatesthebiasm ayincrease,decrease,and even changethesign oftheestim ated coefficients.N otably,m easurem ent erroraffectingasinglecovariatecanbiascoefficientestim atesinallofthecovariates,even thosethatarenotm easuredw ith error9.

2.2

M a10 presentedm ethodstoaccountform easurem enterrorw hileperform ingvariableselectioninparam etricandsem i-param etricsettings.Focusingontheparam etricsetting,theyproposedaw idescopingm ethodthatcanbeusedinm orethanjustgeneralizedlinearm odels.Them ethodrelies on derivingthefulllikelihood ofeachobservation and it’scorrespondingscorefunction,S∗eff(Wi, Yi, β),choosingapenaltyfunction and findingits derivative,p′(β),thensolvingthepenalized estim atingequations:

image

Solving the penalized equations can be very difficultcom putationally,especially in the high dim ensionalsetting.Therefore,w e w illlook to com pareourm ethodw ith fasterm ethodsthatarevariantsoftheLasso,w hich canbesolved m uch m orequickly.

2.3

Sorensenetal.4 analyzetheLasso11 inthepresenceofm easurem enterrorbystudyingthepropertiesof

image

ˆβLasso,λnisasym ptoticallybiased w hen  λn/n → 0 as n → ∞since  E[ˆβ′Lasso,λn] = β′(ΣXX + ∆)−1ΣXX.N otice thisisthe sam e biasthatis introduced w hen naivelinearregressionisperform ed on observed covariates.Sorensen etal.4 derivealow erbound on them agnitudeofthenon-

zerocoefficientelem entsbelow w hichthecorrespondingcovariatew illnotbeselected,andanupperboundontheL1estim ationerror||ˆβW −β||1. They show thatw ith increasing m easurem enterrorthe low erbound increases,i.e.,increasing m easurem enterroraddsnon-inform ative noise to thesystem andsoforthesignalassociatedw iththerelevantcovariatestobeidentifiedthesignalm ustincrease.Increased m easurem enterroralso leadstoanincreaseintheupperboundoftheestim ationerror.Signconsistentselectionisalsoim pactedbythepresenceofcovariatem easurem ent error.Sorensen etal.4 setalow erbound on theprobabilityofsign consistentselectionin thissetting.TheresultrequiresthattheIrrepresentability Condition with Measurement Error (IC-M E)holds.The IC-M E requiresthatthem easurem entsoftherelevantand irrelevantcovariateshave lim ited correlation,relativeto thesizeoftherelevantm easuredcovariatecorrelation.N otethesam plecorrelation oftheirrelevantcovariatesisnotconsidered.Bystudyingtheform ofthelow erbound,itcan beconcluded that(atleastw hen usingtheLasso)m easurem enterrorintroducesagreater distortionontheselectionofirrelevantcovariatesthanitdoesintheselectionofrelevantcovariates.

image

Sorensenetal.4 introduced an iterativem ethod toobtaintheRegularizedCorrected Lassow ithconstraintontheradiusR:

image

Them ainresultsoftheirsim ulationstudyw ereconsistentw iththeiranalyticalresults,nam elythatthecorrectedLassohadaslightlylow erselection rateforthetruecovariatesthanthenaiveLasso,butw asalso m oreconservativein includingirrelevantcovariates.Further,theprediction error,as m easuredbyboth  || ˆβ − β||1and  || ˆβ − β||2,w aslow erforthecorrectedLasso.

Them ajordraw backofthecorrected Lassom ethodisthatitisverycom putationallyintensive,involvinganiterativecalculation w hereeachstep involvesaprojectionofanupdated ˆβontotheL1-ballforagivenradiusR.Theiterativeprocessm ustbeconductedforeachfixedvalueoftheradius R.The selected valuesofR provide apath ofpossible solutionsfor ˆβRCL.H ence,the approach seem sim practicalforlarge-scale problem sand for repeatedapplicationinasim ulation study.

2.4

A recentpaperbyDattaand Zou5 proposesan alternativeapproach w hich theyreferto astheConvexConditioned Lasso (CoCoLasso).Consider

thefollow ingreform ulationoftheLassoproblem ,

image

TheCoCoLasso isbased on theLoh and W ainw rightcorrections12 forthepredictor-outcom e correlation  ρand variance m atrixΣin thepresence ofm easurem enterror.W henerror-pronecovariatesW arem easuredinplaceofX,w ecangetcorrected estim ates˜ρand ˆΣ:

image

w here∆isthe(assum edknow n)varianceinthem easuredW.Theseestim atorsareunbiased.A m easurem enterrorcorrectedLassoestim atecould then be derived by substituting  ˜ρand ˆΣinto (5).The problem w ith thisidea isthatthe corrected m atrix ˆΣm ay notbe a valid covariance m atrix, sinceitispossibletobenon positivesem i-definite.IfˆΣhasanegativeeigenvalue,thenthisLassofunctionw ouldbenon-convexandunbounded.To overcom ethisobstacle,thekeytotheCoCoLasso5 iscalculatingtheprojection ofˆΣonto thespaceofpositivedefinitem atrices:

image

TheCoCoLassothensolvesastandardLassoproblem inw hich ˆΣandρw iththecorrectedvaluesfrom (6)and(7),yieldingtheCoCoLassoestim ator:

image

W hen ˆΣisnotpositivedefinite,theprojection from (7)can bechallengingto com pute.H ow ever,theprojection onlyneedsto bedoneonce,unlike theSorensencorrection4 w hichrequiresaprojectionateach iteration.

O urproposed variable selection algorithm ,M EBoost(M easurem entErrorBoosting),is based on an iterative functionalgradientdescenttype algorithm thatgeneratesvariableselectionpaths.Thekeyideaisthat,insteadoffollow ingapathdefinedbythegradientofalossfunction(e.g.,the likelihood),the“descent”follow sthedirection defined byan estim atingequation  g(Y, X, β).Thealgorithm icstructureofM EBoostisshared w ith ThrEEBoost(Thresholded Estim atingEquationBoost,13),ageneral-purposetechniqueforvariableselectionbased onestim atingequations.W hile ThrEEBoostdescribedanapproachtoperform ingvariableselectioninthepresenceofcorrelatedoutcom esbyleveragingtheGeneralizedEstim ating Equations14,M EBoostachievesim proved variable selection perform ance in the presence ofm easurem enterrorby follow ing a path defined byam easurem enterrorcorrected score function due to Nakam uraw hich isdescribed in Section 3.1.N akam ura’sapproach isapplicableto linear regression m odelsw ith norm aladditive orm ultiplicativem easurem enterror.Closed-form corrected score functionsarealso derived forPoisson, Gam m a,andW aldregression.Nakam uracom m entsthatnoclosedform correctioncanbecreatedforlogisticregression.Byusingthisfam ilyofcorrected score functions,theM EBoostalgorithm ism ore broadly applicablethan thecorrected Lasso and CoCoLasso,neitherofw hich isobviously generalizablebeyond linearregression.

image

3.1

N akam ura8 proposed asetofcorrected scorefunctionsforperform ingestim ation and inferenceinthegeneralized linearregression m odelw here covariatesare subjectto additive m easurem enterrorw ith know n variance m atrix  ∆.In general,the corrected score function S*based on the covariatesm easured w ith error(W),hasthe expectation equalto the score function,S,based on the true covariates(X).Forthe norm allinear m odel,Nakam uraproposedthefollow ingcorrectiontothenegativelog-likelihood to accountform easurem enterror:

image

D ifferentiating(9)w ith respectto  β,w eobtainthecorrected scorefunction:

image

In thiscase the corrected score function isthe ’naive’score function,S(Y, W, β)′,w ith a m easurem enterrorcorrection determ ined by the sam ple size,m odelerror,m easurem enterrors,and the coefficientvalue:nσ−2β′∆.The naive score function isthe score function from the true m odelcalculatedw iththem easuredcovariates:

image

Thecorrected varianceestim atew illbecalculatedas∂l∗/∂σ = 0,w hichinthenorm alcaseis:

image

Sim ilarly to the corrected score function,the corrected variance estim ate isthe naive variance estim ate,n−1 (Y − Wβ∗)′ (Y − Wβ∗),w ith a

m easurem enterrorcorrection.Thecorrectionreducestheestim atedvariance,thussubtractingthenoiseintroduced bythem easurem enterror.In

thevariancecasethecorrectionfactorisdeterm ined onlybythetruecoefficientvectorandthem easurem enterrorvariance.

Asanotherexam ple,thecorrectionforPoisson distributeddataisthefollow ing:

image

w hichw eapplyin ourdataapplication(seeSection 5).Nakam ura8 also providescorrectionsform ultiplicativem easurem enterrorinlinearregression,asw ellasm easurem enterrorinGam m aandW aldregression.Inw hatfollow s,w eusethenorm allinearadditivem easurem enterrorcorrected scorefunction aspartofan iterativepath-follow ingalgorithm thatperform svariableselectioninthepresenceofcovariatem easurem enterror.

3.2

O urproposed variable selection algorithm ,M EBoost,consistsofapplying ThrEEBoostw ith the corrected score function and corrected variance estim atedescribed intheprevioussection.Algorithm 1 sum m arizestheM EBoostprocedure.

Letτ ∈ [0, 1]bethefixed thresholdingparam eter.Startingw ith aβestim ateof0 and a  ˆσ2 = 1,thecorrected scorefunction  S∗iscalculated at thesevalues,and them agnitudeofeach com ponentofν ≡ S∗isrecorded.Theindicesofelem entsto updateareidentified byathresholdingrule, Jt = {j : |νj| ≥ τ ·m axj|νj|}.The nextpointin the variable selection path,β(1),isobtained by addinga sm allvalue,γ,to each ofthese elem ents in the direction corresponding to the signsofeach  νjforj ∈ Jt.Thisupdated  β(1)isused to calculate an updated corrected  σ2(1).The algorithm continuesforT iterations,w hereT istypicallychosen tobelarge(e.g.,1,000).

Theparam etersγ,Tand  τinteractto determ ine thespecificvariable selection path thatresultsfrom thealgorithm .The sm allerthevalueofγthe sm allerthe distance betw een  βestim ateson the selection path,w hile a largervalue ofγleadsto largerjum psin the selection path.Ideally,a verysm allvalueofγ(e.g.,0.01).w ouldbeused,butif||β||1islarge,alargenum berofiterations,T,m ayberequiredtogenerateaselectionpath.This ofcourseisthetrade-offone isrequired to m akew hen determ iningthestep size.A selection path increm ented byonlyasm allvalueispreferable to a path w hich takeslarge steps,butthe tim e required fora large num berofiterationsm ay becom e prohibitive.W ith each ofthe t iterations thoseelem entsofthecoefficientvectorthatarestillofsizezero havenotbeen selected atthisiteration.A conservative selection approach takes a com bination ofsm allγand T,w hereasa m ore aggressive approach takesa com bination oflargervalue  γand T.In the case w hen  τ = 1,the M EBoostalgorithm only updatesthe elem ent(s)w ith the m axim um absolute value.Forany com bination ofγand T,thisisthe m ostconservative approachthatcanbetakenandw illleadtosparserm odelsthanw hen athreshold isconsidered.Italso requiresam uchlargervalueofT.

The param eterτdeterm ineshow m any coefficientsare updated ateach iteration;itoffersa com prom ise betw een updating each coefficient ateveryiteration (τ = 0,sim ilarto standard gradientdescent)and updating only the coefficientcorresponding to the elem entofthe estim ating equationw ith largestm agnitude(τ = 1).In thecontextofGeneralizedLinearM odelsw ithoutm easurem enterror,W olfson15 show ed thatsetting τ = 1yieldsan update rule thatisasym ptotically equivalent(asT → ∞,γ → 0,and  T · γ → 0)to follow ing the path ofm inim izersofan  L1-penalized projected artificiallog-likelihood ratio w hose tangentisthe GLM score function.In the case w hen  τ = 1,the M EBoostalgorithm only

image

image

updatestheelem ent(s)w ith them axim um absolutevalue.Foranycom binationofγandT,thisisthem ostconservativeapproachthatcan betaken and w illlead tosparserm odelsthan w hen athreshold isconsidered.Italso requiresam uch largervalueofT.Byallow ingm ultipledirectionsto be updatedateach iteration,M EBoostcan exploream uch w iderrangeofvariableselection paths;asw ediscusslater,cross-validation can beused to selecttheparam eterτw hich leadsto theoptim allevelofthresholding.In ThrEEBoost13,itw asshow n thatathreshold intherangeof0.4-0.8 m ay perform betterthanthresholdscloserto0 or1.

3.2.1

N akam ura’sm easurem entcorrected score functions are derivativesofcorrected negative log likelihoods.In the norm alcase,the correction is exactlythatdescribed in Sorensenetal.(seeEquation(3)).Hence,theargum entsofRosset16 can beappliedto show that1)M EBoostappliedw ith S∗and threshold valueτ = 1,and 2)thesolutionsto(3),havethesam elocalbehavior.Specifically,undersom eregularityconditions15,asT → ∞and  ǫ → 0w ith  T · ǫ → 0,M EBoost’siterativestepsm atchthesequenceofsolutionsto (3).

3.2.2

Fora fixed  τ,identifying a finalm odelinvolveschoosing a pointon the variable selection path generated by Algorithm 1;thisisakin to choosing thepenaltyparam eterintheLasso.Cross-validationusingalossfunction relevantto theproblem athand (e.g.,m eansquared error)can beusedto selecta ˆβonthepath.Cross-validation cansim ilarlybeusedto selectthebestvalueofτ.Thefullprocedureisdescribed inAlgorithm 2.

image

image

To exam ine the im pactofm easurem enterrorin the covariateson variable selection w e perform ed a sim ulation study.W e evaluated M EBoost by com paringitto tw o variable selection m ethods:the Convex Conditioned Lasso (CoCoLasso),and the “naive”Lasso w hich doesnotcorrectfor m easurem enterror.

4.1

D ataw eregenerated from alinearregression m odelw ith iid norm alerrors,Y = Xβ + ǫ;w hereǫi ∼ N(0, σ2ǫ)and  σǫ = 1.5.Thesam plesizefor allstudiesis80.The true covariatesare draw n from a m ultivariate norm aldistribution,X ∼ MVN(0, ΣXX).ΣXXisa block diagonalm atrix w ith diagonalentriesequalto1,and10 by10 blockscorrespondingtoagroupof10 covariatesw ithanexchangeablecorrelationstructurew ithcom m on pairw isecorrelation  φ = 0.3.In allsim ulationsthetruem odelhas10 non-zerocoefficientsand90 zero coefficients,i.e.,β = (110, 090),so thatthe relevantcovariatesinthefirstblockw erecorrelated.

The m easured covariatesw ere generated asW = X + U forU a m atrixw hose colum nsw ere generated asdescribed below.To explore the im pactofdifferenttypesofm easurem enterror,w econsidered 10 differentscenariosforgeneratingthecolum nsofU andvaryingtheassum ptions m adeaboutit.Inthefirstfivescenarios,U isassum edtobenorm allydistributedw ithm eanzeroandcovariancem atrixΩ,andthescenariosexplore differentstructuresforΩ.IneachofScenarios1-5,w ecorrectlyspecifythedistributionofUw henapplyingM EBoostandtheCoCoLasso.Scenarios 6-10 explorecasesw herethedistributionofU isincorrectlyspecified.

1. Basecase:U ∼ N(0, δ2Ω1),w hereΩ1 = Itheidentitym atrix,and  δ2 = 0.75.

2. Varyingδ2:δ2j = 0.3375 + 0.075jforjin 1-10.Thispattern repeatsacrosstheblocksof10 covariates.Therelevantcovariateshavesim ilar

image

5. Som eU’snotm easuredw ith error:U ∼ N(0, δ2Ω5),w hereδ2 = 0.75,diag(Ω5) = [0, 1, 0, 1, . . . ]and  Ω5,ij = 0fori ̸= j.

6. O verestim ated  δ2:UgeneratedasinScenario1,butw especifyδ2 = 1.5.

7. Underestim atedδ2:Ugenerated asinScenario1,butw especifyδ2 = 0.375.

8. M isspecified correlation:U generated as in Scenario 3,butw e ignore the correlation and specify  Ω = δ2Iin running M EBoostand

image

9. M easurem ent error is distributed uniform ly:Each entry  Uijof U is generated independently from a Uniform distribution,Uij ∼U(−1.5, 1.5).M EBoostandCoCoLassoarerun assum ingU ∼ N(0, δ2I)w ith  δ2 = 0.75 = Var(Uij).

10. M easurem enterrorisdistributed asym m etrically:Each entry  UijofU isgenerated independentlyfrom ashifted exponentialdistribution, Uij +√0.75 ∼ exp(√0.75).M EBoostandCoCoLassoarerun assum ingU ∼ N(0, δ2I)w ithδ2 = 0.75 = Var(Uij).

image

ates)w asused to selectthe optim alvalue ofτand num berofM EBoostiterations,asw ellasthe value ofλin the CoCoLasso and naive Lasso. W ecom pared M EBoost,CoCoLasso,and naiveLasso on tw o m etricsofprediction error:m ean squared errorbased on thetruecovariates(M SE =

n(Y−Xˆβ)′(Y−Xˆβ)),m eansquarederrorpredictionbasedonthem easuredcovariates(M SE-M =  1n(Y−W ˆβ)′(Y−W ˆβ)).Thesem etricsw ere estim ated using independenttestsetsgenerated during each individualsim ulation.W e also com puted  L1distance from the true  β,and variable selectionsensitivityandspecificity.Foreachscenariothem etricspresentedaretheaverageover1,000 sim ulations,andarecalculatedatintervals of0.05 along|| ˆβ||1 ∈ {0.05, 0.1, 0.15, ..., 15};thetruevalue,||β||1 = 10.BecausetheM EBoostalgorithm m aychangem ultipleindicesateach iteration itm aynothave valuesalongeach intervalin the path.To accountforthis,a linearapproxim ation ofthe relevantstatisticw asm ade ateach pointinthepath.

image

W enotethatin thissim ulation studyw echose toinvestigatem odelperform ancebased on both thetrueand error-pronecovariates.Them otivation fortechniqueslike oursw hich accountform easurem enterroristo uncovertheunderlying relationship betw een theerror-free covariates X and the outcom e Y.Hence,in an idealw orld,valuesofX w ould be available on som e subset(oran independentset)ofobservationsso that prediction errorcould beassessed and the“best”m odelchosen.H ow ever,in practicew ew illoften onlyhaveaccessto theerror-pronecovariates W form odelfitting.So,iferror-free m easurem entsX are not(and m ay neverbe)available,isitw orthw hile to correctform easurem enterror? Buonaccorsi17 arguedagainstcorrection,usingthelogicthatthefuturepredictionsw illbebasedon(error-prone)W,noton(error-free)X.Indeed, itcan beshow n in sim plelinearregression,thatw ithoutthecorrection in alargesam pletheexpected valueofM SE-M islessthan orequalto that ofan estim ateignoringm easurem enterror.How ever,asseen in theresultssection thatfollow s,w efound thatcorrectingform easurem enterror decreased prediction errorregardlessofw hetherpredictionsw erecom puted usingerror-freeorerror-pronecovariates.Sincew eoften onlyhave m ism easured dataavailable,itisreassuringto see thatw e are able to use the m easured covariatesto perform cross-validation to selecta m odel thatw illprovideusw ith an accuraterelationshipbetw eentheoutcom eandtruecovariate.Thisfindingisdiscussed ingreaterdetailbelow.

4.2

Table 1 presentsthe m inim um M SE,M SE-M ,L1distance from the true  β,sensitivity,and specificity atthe m inim um M SE forthe three variable selection m ethodsacrossthe10 scenarios.In allten scenarios,M EBoosthad thelow estM SE,M SE-M ,and  L1distancefrom thetrueβ.TheCoCo-Lassohas16.6% -71.7% higherprediction errorfrom thetruecovariatesthanM EBoostandinthecasew herem easurem enterrorisoverestim ated, thepredictionerrorfrom theCoCoLassois5.26tim esthatofM EBoost.ThisislikelyduetothefactthattheLohandW ainw rightcorrection ˆΣin(6) ism orenegative,andhencerequiresa“longer”(andhencepotentiallym oredistorting)projection onto thespaceofpositivedefinitem atrices.

In term sofvariableselection,M EBoosthad agreatersensitivityand low erspecificity than CoCoLasso in each case w hile Lasso had thelow est specificity.TheLasso strugglesm ostw hen correlation ispresentin them easurem enterror.TheM SE isabout2.5 tim esthatofM EBoost,w hen w e allow M EBoostto accountforthe correlation.Allm ethodsperform poorly w hen w e m isspecify  ∆by ignoringthe correlation.The sensitivity and specificityareathigh levelsform ostsim ulationsw iththeexceptionofthem isspecified∆thatignoredcorrelation.O verestim atingδleadto am ore conservativeselectionprocessw ith ahigh specificity,w hileunderestim atingδhad ahighersensitivity.TheL1distancefrom thetrueβcan also tell usaboutperform ance.Again,thescenariow herew em isspecify∆byignoringcorrelationperform sw orst.

W e applied ourm ethod to baseline data collected in the Box Lunch Study,a random ized trialofthe effectsofportion size availability on w eight change.In thestudy,atotalof219 subjectsw ererandom ized to oneoffourgroups:inthreegroups,subjectsw ereprovided afreedailylunch w ith afixednum berofcalories(400,800,and1600).Thecontrolgroupw asnotprovided afreelunch.

W econsideredtheproblem ofpredictingthenum beroftim essubjectsreportedbingingonfoodinthelastm onth,usingPoissonregressionw ith 99 explanatoryvariables.Allvariablesw erem easured atbaseline.16 ofthe99 explanatoryvariablesw ere self-reported m easures;ofthese16,8 w erem easuresoffood consum ption and thereforepossiblysubjectto substantialm easurem enterrorw ew illnotate  δ2D.Another8 m ayhave also beenm easuredw itherror,notatedδ2M.Kipnis18 exam inedanutritionalstudyw itha24 hourrecall,andfound thatthecorrelationbetw eenthetrue and reported consum ption ofprotein and energyw asonly0.336.W e assum ethisrelationship existsin each ofourvariablesm easured w ith error. Assum ingthem easurem enterrorvarianceVar(Ui) ≡ δ2iisindependentofthevarianceofthetruecovariateVar(Xi) ≡ σ2Xi,w ecan obtain:

image

and hence  Var(Wi) = 1 − 0.3362 = 0.887.Thisisthe valuew e w illneed to provide M EBoostforourassum ption ofthe m easurem enterror.W e assum ethislevelofm easurem enterrorforeach 24 hourdietaryrecallvariable.Afterscalingourpredictorsto have zero m ean and unitvariance, w e applied ourm ethod w ith the Nakam ura correction.Since ourm easured data hasitsvariance (δ2i + σ2Xi)scaled to equal1,w e assum ed that the8 dietaryrecallcovariatesm easured w ith errorhad ˆδ2D = 0.887.Since dietaryvariablesm aybem oreprone to m easurem enterrorthan other variables,w escaledtheassum ederroroftheother8variablestobehalfthatofthenutritionalvariables:ˆδ2M = ˆδ2D/2.Therem ainingvariablesw ere assum edtobem easuredw ithouterror.W econductedasensitivityanalysisto assesstheperform anceofourm ethodbysettingˆδ2D = 0.5and0.25.

To selecttuning param eters,w e em ployed 8-fold crossvalidation based on the deviance on a training setconsisting of70% ofthe data.The perform anceofourm odelw asevaluated on therem ainingtestset.W epresentthem odelsderived from M EBoostperform ed w ith threedifferent thresholdsτ:0.2,0.6 (theapproxim atevalueestim atedusingcross-validation),and0.9.

Table2 show stheselectedvariablesandestim atedprediction error(M SE-M ,bottom row )forvariousM EBoostm odelsalongw ith resultsfrom

thenaiveLasso.W edidnotcom paretotheM easurem entErrorLassoortheCoCoLassobecauseim plem entingthesetechniquesinaproblem ofthis

image

sizew ascom putationallyinfeasible.ThedevianceandM SE-M w erelow estforthem odelselectedbyM EBoostassum ingthehighestm easurem ent error(= 0.887)andathresholdvalueof0.6.Thism odel(ˆδ2D = 0.887andτ = 0.6)selectedjust4 variables,w hichw ereasubsetofthe7 chosenw ith the naive Lasso.The othertw o M EBoostm odelsincluded up to tw o additionalvariablesto the M EBoostm odelthatm inim ized M SE-M (selected w ith ˆδ2D = 0.887and  τ = 0.6).Regardlessofthe assum ption aboutthe levelofm easurem enterror,using a threshold value ofτ = 0.2leadsto theinclusion ofseveralvariablesw ith sm allcoefficients,andam uch higherdevianceandprediction error.O fparticularnoteisthatthenaiveLasso (andM EBoostw iththelow erthreshold)includedthevariablecorrespondingtothenum berofdailycaloriesconsum edatbreakfast,w hilethebestperform ingM EBoostm odels(w ithτ = 0.6and0.9)didnot.Sinceitisbasedona24-hourdietaryrecall,thisvariablem aybeparticularlysusceptible tom easurem enterrorinduced byrecallbias.

W eexam inedthevariableselectionproblem in regressionw hen thenum berofpotentialcovariatesislargecom paredto thesam plesizeandw hen thesepotentialcovariatesarem easuredw ithm easurem enterror.W eproposedM EBoost,acom putationallysim pledescent-basedapproachw hich follow sapathdeterm inedbym easurem enterror-correctedestim atingequations.W ecom paredM EBoost,viasim ulationandinarealdataexam ple, w iththerecently-proposedConvexConditionedLasso(CoCoLasso)asw ellasthenaiveLassow hichassum esthatcovariatesarem easuredw ithout error.In alm ostallsim ulation scenarios,M EBoostperform edbestinterm sofprediction errorand coefficientbias.TheCoCoLassoism oreconservativew iththehighestspecificityineachcase,butsensitivityandpredictionarebetterw ithM EBoost.Inthecom parisonofselectionpaths,w esaw thatM EBoostw asm ore aggressivein identifying variablesto beincluded in them odelm ore quickly than theCoCoLasso.These differencesw ere m ostapparentw hen them easurem enterrorhad alargervariance and a m ore com plexcorrelation structure.Specifically,w hen faced w ith adata setof1000observationsand1000 covariates,M EBoostobtainedasolution in1.3 seconds,w hiletheCoCoLassoneeded 6:17.

Asshow ninthesim ulationstudy,M EBoosthaslow erpredictionerrorthantheLassoonindependenttestdataw henpredictionsarebasedonthe true(i.e.,non-error-prone)covariates.ItisinterestingtonotethatM EBoostretainssom eadvantage,albeitam orem odestone,overtheLassow hen predictionsarebased on error-pronecovariates.Thisfindingappearsto contradicttheintuition thataccountingforcovariatem easurem enterror providesnobenefitw hen thegoalisprediction anderror-freecovariatesw illneverbeavailable.H ow ever,theobserved benefitinoursim ulationis likelydue to thefactthatM EBoostissom ew hatm oreflexiblethan theLasso asitusesan additionalparam eter,thethreshold  τ,w hich allow sitto explorethem odelspacem orecom prehensively.Nevertheless,itisreassuringthatbyusingtheerror-pronecovariatesto perform cross-validation and selectam odel,M EBooststillallow susto selectam odelthatoffersan im provem entin prediction in the settingw here w ew illhave correctly m easuredcovariates.

M EBoost,w hileaprom isingapproach,hassom elim itations.O ne lim itation–w hich isshared w ith m anym ethodsthatcorrectform easurem ent error–isthatw e assum e thatthe covariance m atrix ofthe m easurem enterrorprocessisknow n,an assum ption w hich in m any settingsm ay be unrealistic.In som e cases,itm ay be possible to estim ate these structuresusing externaldata sources,butabsentsuch data one could perform a sensitivityanalysisw ithdifferentm easurem enterrorvariancesandcorrelationstructures,asw edem onstrateintherealdataapplication.Another challenging aspectofm odelselection w ith error-prone covariatesisthat,even ifthe setofcandidate m odelsisgenerated via a technique w hich accountsform easurem enterror,theprocessofselectingafinalm odel(e.g.,viacross-validation)stillusescovariatesthatarem easuredw ith error. H ow ever,w e show ed in oursim ulation study thatM EBoostperform sw ellin selectinga m odelw hich recoversthe relationship betw een the true (error-free)covariatesandtheoutcom e,evenw hen usingerror-pronecovariatesto selectthefinalm odel.Thisfindingsuggeststhattheprocedure forgeneratinga“path”ofcandidatem odelshasagreaterinfluenceon prediction errorand variableselection accuracythan theprocedurepicking afinalm odelfrom am ongthosecandidates.

Toconclude,w enotethatw hilew eonlyconsideredlinearandPoissonregressioninthispaper,M EBoostcaneasilybeappliedtootherregression m odelsby,e.g.,usingtheestim atingequationspresentedbyN akam ura8 orothersw hichcorrectform easurem enterror.Incontrast,theapproaches ofSorensen4 andDatta5 exploitthestructureofthelinearregressionm odelanditisnotobvioushow theycouldbeextendedtothebroaderfam ily ofgeneralized linearm odels.The robustnessand sim plicity ofM EBoost,along w ith itsstrong perform ance againstotherm ethodsin the linear m odelcasesuggeststhatthisnovelm ethodisareliablew aytodealw ith variableselectioninthepresenceofm easurem enterror.

1. Spiegelm anD,M cD erm ottA,RosnerB.Regressioncalibrationm ethodforcorrectingm easurem ent-errorbiasinnutritionalepidem iology..The American journal of clinical nutrition.1997;65(4Suppl):1179S–1186S.

image

2. FraserGaryE,Stram DanielO .Regression calibration w hen foods(m easured w ith error)arethevariablesofinterest:m arkedlynon-Gaussian dataw ith m anyzeroes..American journal of epidemiology.2012;175(4):325–31.

3. RosnerB,Spiegelm an D,W illettW C.Correction oflogistic regression relative risk estim atesand confidence intervalsforrandom w ithin- personm easurem enterror..American journal of epidemiology.1992;136(11):1400–13.

4. SørensenØ ystein,FrigessiArnoldo,Thoresen M agne.M easurem entErrorinLasso:Im pactandCorrection.arXiv.org.2012;.

5. DattaA.,ZouH..CoCoLassoforHigh-dim ensionalError-in-variablesRegression.Annalsof Statistics.2017;(Accepted).

6. StefanskiLeonardA.,CarrollRaym ondJ..CovariateM easurem entErrorinLogisticRegression.TheAnnalsof Statistics.1985;13(4):1335–1351.

7. FullerW ayneA.,ed.Measurement Error Models.W ileySeriesinProbabilityandStatisticsH oboken,N J,USA:JohnW iley& Sons,Inc.;1987.

8. Nakam uraT..Corrected score function forerrors-in-variablesm odels:M ethodologyand application to generalized linearm odels.Biometrika. 1990;77(1):127–137.

9. BuonaccorsiJohn.Measurement Error:Models,Methodsand Applications.BocaRaton:CRC Press;2010.

10. M aYanyuan,LiRunze.Variableselectioninm easurem enterrorm odels.Bernoulli.2010;16(1):274–300.

11. TibshiraniRobert.Regressionshrinkageandselectionviathelasso.Journal of theRoyal Statistical Society SeriesB.1996;58:267–288.

12. Loh Po-Ling,W ainw rightM artin J..High-dim ensionalregression w ith noisy and m issing data:Provable guaranteesw ith nonconvexity.The Annalsof Statistics.2012;40(3):1637–1664.

13. Brow n Ben,M illerChristopherJ.,W olfson Julian.ThrEEBoost:Thresholded Boosting forVariable Selection and Prediction via Estim ating Equations.Journal of Computational and Graphical Statistics.2017;:1–10.

14. LiangKung-Yee,ZegerScottL..Longitudinaldataanalysisusinggeneralizedlinearm odels.Biometrika.1986;73(1):13–22.

15. W olfson Julian.EEBoost:A GeneralM ethod forPrediction and Variable Selection Based on Estim ating Equations.Journal of the American Statistical Association.2011;106(493):296–305.

16. RossetSaharon,Zhu Ji,Hastie Trevor.Boosting asa Regularized Path to a M axim um M argin Classifier.Journal of Machine Learning Research. 2004;5:941–973.

17. BuonaccorsiJohn P.Prediction in the Presence ofM easurem entError:GeneralD iscussion and an Exam ple Prediction in the Presence of M easurem entError:GeneralDiscussion andanExam plePredictingD efoliation.Source:Biometrics.1995;51(4):1562–1569.

18. KipnisVictor,SubarAm yF,M idthuneDouglas,etal.Structureofdietarym easurem enterror:resultsoftheO PEN biom arkerstudy..American journal of epidemiology.2003;158(1):14–21;discussion 22–6.

image

image

TABLE 1 Perform ance m etricsforthe 1,000 sim ulationsin variousm easurem enterrorscenarios.The m odelsw ere selected atthe pointw ith m inim um M SE-M .

image

image

image

image


designed for accessibility and to further open science