I am performing a locally weighted regression model using housing data, where I only include observations within a certain distance of the house in question. For cross-validation of the bandwidth I am collecting elements of the "hat matrix" (where y hat=hat matrix *y). I was convinced I could grab the diagonal elements for the hat matrix using lm.influence()$hat. In particular, I am interested in grabbing the one element of the hat matrix that corresponds with the observation I am running my locally weighted regression at. When I looked more closely at the lm.influence()$hat output, I realized that the observations used in my regression do not appear to be the same observations for which the hat matrix returns values. I had assumed the "names" associated with lm.influence()$hat were the observation numbers for the regression data, am I wrong? I've included a code snippet and its output. I am confused as to why the observations for which I give positive weights in the regression do not appear to be the same as the "names" in the hat matrix output. Do you know what mistake I am making? > obs <- 451 # this is the location/observation in the data for which we are currently running the regression, for example > require(fields) > # calculate the distance all other observations are from this observation > Di=t(rdist.earth(cbind(housedata$longitude[obs],housedata $latitude[obs]), + cbind(housedata$longitude,housedata $latitude) )) > > ########################## > b=.3 # this is the relevant distance threshold > > housedata$w <- 0 # generate a "weights" variable > housedata$w[Di<b] <- 1 # give all observations closer than b a weight of 1 > print(which(housedata$w>0)) # this tells me which observations are included in this regression [1] 333 336 340 345 346 376 378 406 414 418 419 425 426 427 428 429 430 431 [19] 436 438 441 444 450 451 456 457 458 461 462 463 464 465 467 468 469 470 [37] 471 474 475 476 479 481 483 488 494 496 508 512 514 518 525 526 528 530 [55] 531 533 538 539 544 548 563 572 576 584 585 587 591 594 595 600 601 607 [73] 613 615 616 617 618 624 631 637 638 641 645 647 652 653 654 655 656 659 [91] 663 678 681 685 688 689 691 693 694 711 712 > # run the linear regression only including the observations within the distance threshold > result.b <- lm(adjprice~lotsize+squareft+garagesqft +numbath+numbed+time, + data=housedata, + weights=w ) > # collect the hat matrix > print(lm.influence(result.b)$hat) # 345 348 352 357 358 389 391 0.06332126 0.06332126 0.05592105 0.09368046 0.10605304 0.05592105 0.09757274 419 427 431 432 438 439 440 0.03762151 0.10091480 0.04979739 0.05659565 0.05160888 0.03915642 0.10149422 441 442 443 444 449 451 722 0.05572360 0.03086186 0.05624229 0.04658039 0.09087753 0.06436925 0.09952022 725 731 732 737 738 739 742 0.08183102 0.06732644 0.05362610 0.04742278 0.05196055 0.02725287 0.03086186 743 744 745 746 748 749 750 0.03848066 0.06161776 0.03352387 0.09729289 0.04968367 0.04588662 0.04620045 751 752 755 756 757 760 762 0.08194437 0.07748418 0.20282956 0.05679513 0.05283027 0.08194437 0.05737857 764 769 775 777 789 793 795 0.14753830 0.04742278 0.04409041 0.04675800 0.05739381 0.05739381 0.04125143 799 806 807 809 811 812 814 0.11049178 0.05286319 0.04125143 0.13971558 0.03192842 0.04254609 0.06587966 819 820 825 829 844 853 857 0.23414783 0.02942560 0.04627927 0.04968367 0.04968367 0.04627927 0.02689040 865 866 868 872 875 876 881 0.10691998 0.09988275 0.06171944 0.08152409 0.11049178 0.04627927 0.05572857 882 888 894 896 897 898 899 0.10646147 0.04149530 0.12769051 0.04092457 0.06117365 0.04092457 0.04316847 905 912 918 919 922 926 928 0.17072235 0.04125143 0.06117365 0.14435872 0.04309004 0.06117365 0.05196055 933 934 935 936 937 940 944 0.06065717 0.03094961 0.18271286 0.10755273 0.05196055 0.06117365 0.06117365 959 962 966 969 971 973 975 0.13231524 0.06752826 0.06752826 0.06752826 0.06752826 0.06117365 0.06752826 976 994 995 0.04149530 0.04125143 0.06158475 I only noticed this problem because several times the observation in question wasn't even a part of the hat matrix output... Am I incorrect in assuming that the output from print(which(housedata$w>0)) should be the same as the "names" from print(lm.influence(result.b)$hat). Both have the same length (in this case 88 observations, but they don't appear to be the same observations. Thanks for anyone who can help me clear this up, Aaron