SAS - Replicate multiple observations across rows











up vote
0
down vote

favorite
1












I have a data structure that looks like this:



DATA have ; 
INPUT famid indid implicate imp_inc;
CARDS ;
1 1 1 40000
1 1 2 25000
1 1 3 34000
1 1 4 23555
1 1 5 49850
1 2 1 1000
1 2 2 2000
1 2 3 3000
1 2 4 4000
1 2 5 5000
1 3 1 .
1 3 2 .
1 3 3 .
1 3 4 .
1 3 5 .
2 1 1 40000
2 1 2 45000
2 1 3 50000
2 1 4 34000
2 1 5 23500
2 2 1 .
2 2 2 .
2 2 3 .
2 2 4 .
2 2 5 .
2 3 1 41000
2 3 2 39000
2 3 3 24000
2 3 4 32000
2 3 5 53000
RUN ;


So, we have family id, individual id, implicate number and imputed income for each implicate.



What i need is to replicate the results of the first individual in each family (all of the five implicates) for the remaining individuals within each family, replacing whatever values we previously had on those cells, like this:



DATA want ; 
INPUT famid indid implicate imp_inc;
CARDS ;
1 1 1 40000
1 1 2 25000
1 1 3 34000
1 1 4 23555
1 1 5 49850
1 2 1 40000
1 2 2 25000
1 2 3 34000
1 2 4 23555
1 2 5 49850
1 3 1 40000
1 3 2 25000
1 3 3 34000
1 3 4 23555
1 3 5 49850
2 1 1 40000
2 1 2 45000
2 1 3 50000
2 1 4 34000
2 1 5 23500
2 2 1 40000
2 2 2 45000
2 2 3 50000
2 2 4 34000
2 2 5 23500
2 3 1 40000
2 3 2 45000
2 3 3 50000
2 3 4 34000
2 3 5 23500
RUN ;


In this example I'm trying to replicate only one variable but in my project I will have to do this for dozens of variables.



So far, I came up with this solution:



%let implist_1=imp_inc;

%macro copyv1(list);
%let nwords=%sysfunc(countw(&list));
%do i=1 %to &nwords;
%let varl=%scan(&list, &i);
proc means data=have max noprint;
var &varl;
by famid implicate;
where indid=1;
OUTPUT OUT=copy max=max_&varl;
run;
data want;
set have;
drop &varl;
run;
data want (drop=_TYPE_ _FREQ_);
merge want copy;
by famid implicate;
rename max_&varl=&varl;
run;
%end;
%mend;
%copyv1(&imp_list1);


This works well for one or two variables. However it is tremendously slow once you do it for 400 variables in a data-set with the size of 1.5 GB.



I'm pretty sure there is a faster way to do this with some form of proc sql or first.var etc., but i'm relatively new to SAS and so far I couldn't come up with a better solution.



Thank you very much for your support.



Best regards










share|improve this question
























  • Is the source data already sorted by FAMID so that first record is the one to replicate? What if the value to be replicated is missing on the first record?
    – Tom
    Nov 7 at 17:15

















up vote
0
down vote

favorite
1












I have a data structure that looks like this:



DATA have ; 
INPUT famid indid implicate imp_inc;
CARDS ;
1 1 1 40000
1 1 2 25000
1 1 3 34000
1 1 4 23555
1 1 5 49850
1 2 1 1000
1 2 2 2000
1 2 3 3000
1 2 4 4000
1 2 5 5000
1 3 1 .
1 3 2 .
1 3 3 .
1 3 4 .
1 3 5 .
2 1 1 40000
2 1 2 45000
2 1 3 50000
2 1 4 34000
2 1 5 23500
2 2 1 .
2 2 2 .
2 2 3 .
2 2 4 .
2 2 5 .
2 3 1 41000
2 3 2 39000
2 3 3 24000
2 3 4 32000
2 3 5 53000
RUN ;


So, we have family id, individual id, implicate number and imputed income for each implicate.



What i need is to replicate the results of the first individual in each family (all of the five implicates) for the remaining individuals within each family, replacing whatever values we previously had on those cells, like this:



DATA want ; 
INPUT famid indid implicate imp_inc;
CARDS ;
1 1 1 40000
1 1 2 25000
1 1 3 34000
1 1 4 23555
1 1 5 49850
1 2 1 40000
1 2 2 25000
1 2 3 34000
1 2 4 23555
1 2 5 49850
1 3 1 40000
1 3 2 25000
1 3 3 34000
1 3 4 23555
1 3 5 49850
2 1 1 40000
2 1 2 45000
2 1 3 50000
2 1 4 34000
2 1 5 23500
2 2 1 40000
2 2 2 45000
2 2 3 50000
2 2 4 34000
2 2 5 23500
2 3 1 40000
2 3 2 45000
2 3 3 50000
2 3 4 34000
2 3 5 23500
RUN ;


In this example I'm trying to replicate only one variable but in my project I will have to do this for dozens of variables.



So far, I came up with this solution:



%let implist_1=imp_inc;

%macro copyv1(list);
%let nwords=%sysfunc(countw(&list));
%do i=1 %to &nwords;
%let varl=%scan(&list, &i);
proc means data=have max noprint;
var &varl;
by famid implicate;
where indid=1;
OUTPUT OUT=copy max=max_&varl;
run;
data want;
set have;
drop &varl;
run;
data want (drop=_TYPE_ _FREQ_);
merge want copy;
by famid implicate;
rename max_&varl=&varl;
run;
%end;
%mend;
%copyv1(&imp_list1);


This works well for one or two variables. However it is tremendously slow once you do it for 400 variables in a data-set with the size of 1.5 GB.



I'm pretty sure there is a faster way to do this with some form of proc sql or first.var etc., but i'm relatively new to SAS and so far I couldn't come up with a better solution.



Thank you very much for your support.



Best regards










share|improve this question
























  • Is the source data already sorted by FAMID so that first record is the one to replicate? What if the value to be replicated is missing on the first record?
    – Tom
    Nov 7 at 17:15















up vote
0
down vote

favorite
1









up vote
0
down vote

favorite
1






1





I have a data structure that looks like this:



DATA have ; 
INPUT famid indid implicate imp_inc;
CARDS ;
1 1 1 40000
1 1 2 25000
1 1 3 34000
1 1 4 23555
1 1 5 49850
1 2 1 1000
1 2 2 2000
1 2 3 3000
1 2 4 4000
1 2 5 5000
1 3 1 .
1 3 2 .
1 3 3 .
1 3 4 .
1 3 5 .
2 1 1 40000
2 1 2 45000
2 1 3 50000
2 1 4 34000
2 1 5 23500
2 2 1 .
2 2 2 .
2 2 3 .
2 2 4 .
2 2 5 .
2 3 1 41000
2 3 2 39000
2 3 3 24000
2 3 4 32000
2 3 5 53000
RUN ;


So, we have family id, individual id, implicate number and imputed income for each implicate.



What i need is to replicate the results of the first individual in each family (all of the five implicates) for the remaining individuals within each family, replacing whatever values we previously had on those cells, like this:



DATA want ; 
INPUT famid indid implicate imp_inc;
CARDS ;
1 1 1 40000
1 1 2 25000
1 1 3 34000
1 1 4 23555
1 1 5 49850
1 2 1 40000
1 2 2 25000
1 2 3 34000
1 2 4 23555
1 2 5 49850
1 3 1 40000
1 3 2 25000
1 3 3 34000
1 3 4 23555
1 3 5 49850
2 1 1 40000
2 1 2 45000
2 1 3 50000
2 1 4 34000
2 1 5 23500
2 2 1 40000
2 2 2 45000
2 2 3 50000
2 2 4 34000
2 2 5 23500
2 3 1 40000
2 3 2 45000
2 3 3 50000
2 3 4 34000
2 3 5 23500
RUN ;


In this example I'm trying to replicate only one variable but in my project I will have to do this for dozens of variables.



So far, I came up with this solution:



%let implist_1=imp_inc;

%macro copyv1(list);
%let nwords=%sysfunc(countw(&list));
%do i=1 %to &nwords;
%let varl=%scan(&list, &i);
proc means data=have max noprint;
var &varl;
by famid implicate;
where indid=1;
OUTPUT OUT=copy max=max_&varl;
run;
data want;
set have;
drop &varl;
run;
data want (drop=_TYPE_ _FREQ_);
merge want copy;
by famid implicate;
rename max_&varl=&varl;
run;
%end;
%mend;
%copyv1(&imp_list1);


This works well for one or two variables. However it is tremendously slow once you do it for 400 variables in a data-set with the size of 1.5 GB.



I'm pretty sure there is a faster way to do this with some form of proc sql or first.var etc., but i'm relatively new to SAS and so far I couldn't come up with a better solution.



Thank you very much for your support.



Best regards










share|improve this question















I have a data structure that looks like this:



DATA have ; 
INPUT famid indid implicate imp_inc;
CARDS ;
1 1 1 40000
1 1 2 25000
1 1 3 34000
1 1 4 23555
1 1 5 49850
1 2 1 1000
1 2 2 2000
1 2 3 3000
1 2 4 4000
1 2 5 5000
1 3 1 .
1 3 2 .
1 3 3 .
1 3 4 .
1 3 5 .
2 1 1 40000
2 1 2 45000
2 1 3 50000
2 1 4 34000
2 1 5 23500
2 2 1 .
2 2 2 .
2 2 3 .
2 2 4 .
2 2 5 .
2 3 1 41000
2 3 2 39000
2 3 3 24000
2 3 4 32000
2 3 5 53000
RUN ;


So, we have family id, individual id, implicate number and imputed income for each implicate.



What i need is to replicate the results of the first individual in each family (all of the five implicates) for the remaining individuals within each family, replacing whatever values we previously had on those cells, like this:



DATA want ; 
INPUT famid indid implicate imp_inc;
CARDS ;
1 1 1 40000
1 1 2 25000
1 1 3 34000
1 1 4 23555
1 1 5 49850
1 2 1 40000
1 2 2 25000
1 2 3 34000
1 2 4 23555
1 2 5 49850
1 3 1 40000
1 3 2 25000
1 3 3 34000
1 3 4 23555
1 3 5 49850
2 1 1 40000
2 1 2 45000
2 1 3 50000
2 1 4 34000
2 1 5 23500
2 2 1 40000
2 2 2 45000
2 2 3 50000
2 2 4 34000
2 2 5 23500
2 3 1 40000
2 3 2 45000
2 3 3 50000
2 3 4 34000
2 3 5 23500
RUN ;


In this example I'm trying to replicate only one variable but in my project I will have to do this for dozens of variables.



So far, I came up with this solution:



%let implist_1=imp_inc;

%macro copyv1(list);
%let nwords=%sysfunc(countw(&list));
%do i=1 %to &nwords;
%let varl=%scan(&list, &i);
proc means data=have max noprint;
var &varl;
by famid implicate;
where indid=1;
OUTPUT OUT=copy max=max_&varl;
run;
data want;
set have;
drop &varl;
run;
data want (drop=_TYPE_ _FREQ_);
merge want copy;
by famid implicate;
rename max_&varl=&varl;
run;
%end;
%mend;
%copyv1(&imp_list1);


This works well for one or two variables. However it is tremendously slow once you do it for 400 variables in a data-set with the size of 1.5 GB.



I'm pretty sure there is a faster way to do this with some form of proc sql or first.var etc., but i'm relatively new to SAS and so far I couldn't come up with a better solution.



Thank you very much for your support.



Best regards







sas rows replicate






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 7 at 10:35

























asked Nov 7 at 10:26









Luis Martins

106




106












  • Is the source data already sorted by FAMID so that first record is the one to replicate? What if the value to be replicated is missing on the first record?
    – Tom
    Nov 7 at 17:15




















  • Is the source data already sorted by FAMID so that first record is the one to replicate? What if the value to be replicated is missing on the first record?
    – Tom
    Nov 7 at 17:15


















Is the source data already sorted by FAMID so that first record is the one to replicate? What if the value to be replicated is missing on the first record?
– Tom
Nov 7 at 17:15






Is the source data already sorted by FAMID so that first record is the one to replicate? What if the value to be replicated is missing on the first record?
– Tom
Nov 7 at 17:15














2 Answers
2






active

oldest

votes

















up vote
0
down vote



accepted










Yes, this can be done in DATA step using a first. reference made available via the by statement.



data want;
set have (keep=famid indid implicate imp_inc /* other vars */);

by famid indid implicate; /* by implicate is so step logs an error (at run-time) if data not sorted */

if first.famid then if indid ne 1 then abort;

array across imp_inc /* other vars */;
array hold [1,5] _temporary_; /* or [<n>,5] where <n> means the number of variables in the across array */

if indid = 1 then do; /* hold data for 1st individuals implicate across data */
do _n_ = 1 to dim(across);
hold[_n_,implicate] = across[_n_]; /* store info of each implicate of first individual */
end;
end;
else do;
do _n_ = 1 to dim(across);
across[_n_] = hold[_n_,implicate]; /* apply 1st persons info to subsequent persons */
end;
end;
run;


The DATA step could be significantly faster due to single pass through data, however there is an internal processing cost associated with calculating all those pesky array addresses at run; time, and that cost could become impactful at some <n>



SQL is simpler syntax, clearer understanding and works if have data set is unsorted or has some peculiar sequencing in the by group.






share|improve this answer





















  • With the SQL procedure my entire program takes 11 minutes per iteration to run; with this solution it takes 9 minutes. Thanks a lot!
    – Luis Martins
    Nov 8 at 10:59










  • Great! Welcome to Stack overflow, learn more about the question and answer process
    – Richard
    Nov 8 at 13:35




















up vote
0
down vote













This is fairly straightforward with a bit of SQL:



proc sql;
create table want as
select a.famid, a.indid, a.implicate, b.* from
have a
left join (
select * from have
group by famid
having indid = min(indid)
) b
on
a.famid = b.famid
and a.implicate = b.implicate
order by a.famid, a.indid, a.implicate
;
quit;


The idea is to join the table to a subset of itself containing only the rows corresponding to the first individual within each family.



It is set up to pick the lowest numbered individual within each family, so it will work even if there is no row with indid = 1. If you are sure that there will always be such a row, you can use a slightly simpler query:



proc sql;
create table want as
select a.famid, a.indid, a.implicate, b.* from
have(sortedby = famid) a
left join have(where = (indid = 1)) b
on
a.famid = b.famid
and a.implicate = b.implicate
order by a.famid, a.indid, a.implicate
;
quit;


Specifying sortedby = famid provides a hint to the query optimiser that it can skip one of the initial sorts required for the join, which may improve performance a bit.






share|improve this answer























  • Thank you for your reply. I don't know much about sql, but this seems to replicate all variables right? If I want to select a subset I would need to put select imp_inc instead of select *, right?
    – Luis Martins
    Nov 7 at 11:01










  • You can also specify a comma-separated list of variables, or use data set options, e.g. select * from have(keep = key1-key3 firstvar--lastvar)
    – user667489
    Nov 7 at 11:09










  • This works really nice thank you very much!
    – Luis Martins
    Nov 7 at 15:15











Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














 

draft saved


draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53187588%2fsas-replicate-multiple-observations-across-rows%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























2 Answers
2






active

oldest

votes








2 Answers
2






active

oldest

votes









active

oldest

votes






active

oldest

votes








up vote
0
down vote



accepted










Yes, this can be done in DATA step using a first. reference made available via the by statement.



data want;
set have (keep=famid indid implicate imp_inc /* other vars */);

by famid indid implicate; /* by implicate is so step logs an error (at run-time) if data not sorted */

if first.famid then if indid ne 1 then abort;

array across imp_inc /* other vars */;
array hold [1,5] _temporary_; /* or [<n>,5] where <n> means the number of variables in the across array */

if indid = 1 then do; /* hold data for 1st individuals implicate across data */
do _n_ = 1 to dim(across);
hold[_n_,implicate] = across[_n_]; /* store info of each implicate of first individual */
end;
end;
else do;
do _n_ = 1 to dim(across);
across[_n_] = hold[_n_,implicate]; /* apply 1st persons info to subsequent persons */
end;
end;
run;


The DATA step could be significantly faster due to single pass through data, however there is an internal processing cost associated with calculating all those pesky array addresses at run; time, and that cost could become impactful at some <n>



SQL is simpler syntax, clearer understanding and works if have data set is unsorted or has some peculiar sequencing in the by group.






share|improve this answer





















  • With the SQL procedure my entire program takes 11 minutes per iteration to run; with this solution it takes 9 minutes. Thanks a lot!
    – Luis Martins
    Nov 8 at 10:59










  • Great! Welcome to Stack overflow, learn more about the question and answer process
    – Richard
    Nov 8 at 13:35

















up vote
0
down vote



accepted










Yes, this can be done in DATA step using a first. reference made available via the by statement.



data want;
set have (keep=famid indid implicate imp_inc /* other vars */);

by famid indid implicate; /* by implicate is so step logs an error (at run-time) if data not sorted */

if first.famid then if indid ne 1 then abort;

array across imp_inc /* other vars */;
array hold [1,5] _temporary_; /* or [<n>,5] where <n> means the number of variables in the across array */

if indid = 1 then do; /* hold data for 1st individuals implicate across data */
do _n_ = 1 to dim(across);
hold[_n_,implicate] = across[_n_]; /* store info of each implicate of first individual */
end;
end;
else do;
do _n_ = 1 to dim(across);
across[_n_] = hold[_n_,implicate]; /* apply 1st persons info to subsequent persons */
end;
end;
run;


The DATA step could be significantly faster due to single pass through data, however there is an internal processing cost associated with calculating all those pesky array addresses at run; time, and that cost could become impactful at some <n>



SQL is simpler syntax, clearer understanding and works if have data set is unsorted or has some peculiar sequencing in the by group.






share|improve this answer





















  • With the SQL procedure my entire program takes 11 minutes per iteration to run; with this solution it takes 9 minutes. Thanks a lot!
    – Luis Martins
    Nov 8 at 10:59










  • Great! Welcome to Stack overflow, learn more about the question and answer process
    – Richard
    Nov 8 at 13:35















up vote
0
down vote



accepted







up vote
0
down vote



accepted






Yes, this can be done in DATA step using a first. reference made available via the by statement.



data want;
set have (keep=famid indid implicate imp_inc /* other vars */);

by famid indid implicate; /* by implicate is so step logs an error (at run-time) if data not sorted */

if first.famid then if indid ne 1 then abort;

array across imp_inc /* other vars */;
array hold [1,5] _temporary_; /* or [<n>,5] where <n> means the number of variables in the across array */

if indid = 1 then do; /* hold data for 1st individuals implicate across data */
do _n_ = 1 to dim(across);
hold[_n_,implicate] = across[_n_]; /* store info of each implicate of first individual */
end;
end;
else do;
do _n_ = 1 to dim(across);
across[_n_] = hold[_n_,implicate]; /* apply 1st persons info to subsequent persons */
end;
end;
run;


The DATA step could be significantly faster due to single pass through data, however there is an internal processing cost associated with calculating all those pesky array addresses at run; time, and that cost could become impactful at some <n>



SQL is simpler syntax, clearer understanding and works if have data set is unsorted or has some peculiar sequencing in the by group.






share|improve this answer












Yes, this can be done in DATA step using a first. reference made available via the by statement.



data want;
set have (keep=famid indid implicate imp_inc /* other vars */);

by famid indid implicate; /* by implicate is so step logs an error (at run-time) if data not sorted */

if first.famid then if indid ne 1 then abort;

array across imp_inc /* other vars */;
array hold [1,5] _temporary_; /* or [<n>,5] where <n> means the number of variables in the across array */

if indid = 1 then do; /* hold data for 1st individuals implicate across data */
do _n_ = 1 to dim(across);
hold[_n_,implicate] = across[_n_]; /* store info of each implicate of first individual */
end;
end;
else do;
do _n_ = 1 to dim(across);
across[_n_] = hold[_n_,implicate]; /* apply 1st persons info to subsequent persons */
end;
end;
run;


The DATA step could be significantly faster due to single pass through data, however there is an internal processing cost associated with calculating all those pesky array addresses at run; time, and that cost could become impactful at some <n>



SQL is simpler syntax, clearer understanding and works if have data set is unsorted or has some peculiar sequencing in the by group.







share|improve this answer












share|improve this answer



share|improve this answer










answered Nov 7 at 13:22









Richard

7,38321125




7,38321125












  • With the SQL procedure my entire program takes 11 minutes per iteration to run; with this solution it takes 9 minutes. Thanks a lot!
    – Luis Martins
    Nov 8 at 10:59










  • Great! Welcome to Stack overflow, learn more about the question and answer process
    – Richard
    Nov 8 at 13:35




















  • With the SQL procedure my entire program takes 11 minutes per iteration to run; with this solution it takes 9 minutes. Thanks a lot!
    – Luis Martins
    Nov 8 at 10:59










  • Great! Welcome to Stack overflow, learn more about the question and answer process
    – Richard
    Nov 8 at 13:35


















With the SQL procedure my entire program takes 11 minutes per iteration to run; with this solution it takes 9 minutes. Thanks a lot!
– Luis Martins
Nov 8 at 10:59




With the SQL procedure my entire program takes 11 minutes per iteration to run; with this solution it takes 9 minutes. Thanks a lot!
– Luis Martins
Nov 8 at 10:59












Great! Welcome to Stack overflow, learn more about the question and answer process
– Richard
Nov 8 at 13:35






Great! Welcome to Stack overflow, learn more about the question and answer process
– Richard
Nov 8 at 13:35














up vote
0
down vote













This is fairly straightforward with a bit of SQL:



proc sql;
create table want as
select a.famid, a.indid, a.implicate, b.* from
have a
left join (
select * from have
group by famid
having indid = min(indid)
) b
on
a.famid = b.famid
and a.implicate = b.implicate
order by a.famid, a.indid, a.implicate
;
quit;


The idea is to join the table to a subset of itself containing only the rows corresponding to the first individual within each family.



It is set up to pick the lowest numbered individual within each family, so it will work even if there is no row with indid = 1. If you are sure that there will always be such a row, you can use a slightly simpler query:



proc sql;
create table want as
select a.famid, a.indid, a.implicate, b.* from
have(sortedby = famid) a
left join have(where = (indid = 1)) b
on
a.famid = b.famid
and a.implicate = b.implicate
order by a.famid, a.indid, a.implicate
;
quit;


Specifying sortedby = famid provides a hint to the query optimiser that it can skip one of the initial sorts required for the join, which may improve performance a bit.






share|improve this answer























  • Thank you for your reply. I don't know much about sql, but this seems to replicate all variables right? If I want to select a subset I would need to put select imp_inc instead of select *, right?
    – Luis Martins
    Nov 7 at 11:01










  • You can also specify a comma-separated list of variables, or use data set options, e.g. select * from have(keep = key1-key3 firstvar--lastvar)
    – user667489
    Nov 7 at 11:09










  • This works really nice thank you very much!
    – Luis Martins
    Nov 7 at 15:15















up vote
0
down vote













This is fairly straightforward with a bit of SQL:



proc sql;
create table want as
select a.famid, a.indid, a.implicate, b.* from
have a
left join (
select * from have
group by famid
having indid = min(indid)
) b
on
a.famid = b.famid
and a.implicate = b.implicate
order by a.famid, a.indid, a.implicate
;
quit;


The idea is to join the table to a subset of itself containing only the rows corresponding to the first individual within each family.



It is set up to pick the lowest numbered individual within each family, so it will work even if there is no row with indid = 1. If you are sure that there will always be such a row, you can use a slightly simpler query:



proc sql;
create table want as
select a.famid, a.indid, a.implicate, b.* from
have(sortedby = famid) a
left join have(where = (indid = 1)) b
on
a.famid = b.famid
and a.implicate = b.implicate
order by a.famid, a.indid, a.implicate
;
quit;


Specifying sortedby = famid provides a hint to the query optimiser that it can skip one of the initial sorts required for the join, which may improve performance a bit.






share|improve this answer























  • Thank you for your reply. I don't know much about sql, but this seems to replicate all variables right? If I want to select a subset I would need to put select imp_inc instead of select *, right?
    – Luis Martins
    Nov 7 at 11:01










  • You can also specify a comma-separated list of variables, or use data set options, e.g. select * from have(keep = key1-key3 firstvar--lastvar)
    – user667489
    Nov 7 at 11:09










  • This works really nice thank you very much!
    – Luis Martins
    Nov 7 at 15:15













up vote
0
down vote










up vote
0
down vote









This is fairly straightforward with a bit of SQL:



proc sql;
create table want as
select a.famid, a.indid, a.implicate, b.* from
have a
left join (
select * from have
group by famid
having indid = min(indid)
) b
on
a.famid = b.famid
and a.implicate = b.implicate
order by a.famid, a.indid, a.implicate
;
quit;


The idea is to join the table to a subset of itself containing only the rows corresponding to the first individual within each family.



It is set up to pick the lowest numbered individual within each family, so it will work even if there is no row with indid = 1. If you are sure that there will always be such a row, you can use a slightly simpler query:



proc sql;
create table want as
select a.famid, a.indid, a.implicate, b.* from
have(sortedby = famid) a
left join have(where = (indid = 1)) b
on
a.famid = b.famid
and a.implicate = b.implicate
order by a.famid, a.indid, a.implicate
;
quit;


Specifying sortedby = famid provides a hint to the query optimiser that it can skip one of the initial sorts required for the join, which may improve performance a bit.






share|improve this answer














This is fairly straightforward with a bit of SQL:



proc sql;
create table want as
select a.famid, a.indid, a.implicate, b.* from
have a
left join (
select * from have
group by famid
having indid = min(indid)
) b
on
a.famid = b.famid
and a.implicate = b.implicate
order by a.famid, a.indid, a.implicate
;
quit;


The idea is to join the table to a subset of itself containing only the rows corresponding to the first individual within each family.



It is set up to pick the lowest numbered individual within each family, so it will work even if there is no row with indid = 1. If you are sure that there will always be such a row, you can use a slightly simpler query:



proc sql;
create table want as
select a.famid, a.indid, a.implicate, b.* from
have(sortedby = famid) a
left join have(where = (indid = 1)) b
on
a.famid = b.famid
and a.implicate = b.implicate
order by a.famid, a.indid, a.implicate
;
quit;


Specifying sortedby = famid provides a hint to the query optimiser that it can skip one of the initial sorts required for the join, which may improve performance a bit.







share|improve this answer














share|improve this answer



share|improve this answer








edited Nov 7 at 11:11

























answered Nov 7 at 10:48









user667489

7,69721226




7,69721226












  • Thank you for your reply. I don't know much about sql, but this seems to replicate all variables right? If I want to select a subset I would need to put select imp_inc instead of select *, right?
    – Luis Martins
    Nov 7 at 11:01










  • You can also specify a comma-separated list of variables, or use data set options, e.g. select * from have(keep = key1-key3 firstvar--lastvar)
    – user667489
    Nov 7 at 11:09










  • This works really nice thank you very much!
    – Luis Martins
    Nov 7 at 15:15


















  • Thank you for your reply. I don't know much about sql, but this seems to replicate all variables right? If I want to select a subset I would need to put select imp_inc instead of select *, right?
    – Luis Martins
    Nov 7 at 11:01










  • You can also specify a comma-separated list of variables, or use data set options, e.g. select * from have(keep = key1-key3 firstvar--lastvar)
    – user667489
    Nov 7 at 11:09










  • This works really nice thank you very much!
    – Luis Martins
    Nov 7 at 15:15
















Thank you for your reply. I don't know much about sql, but this seems to replicate all variables right? If I want to select a subset I would need to put select imp_inc instead of select *, right?
– Luis Martins
Nov 7 at 11:01




Thank you for your reply. I don't know much about sql, but this seems to replicate all variables right? If I want to select a subset I would need to put select imp_inc instead of select *, right?
– Luis Martins
Nov 7 at 11:01












You can also specify a comma-separated list of variables, or use data set options, e.g. select * from have(keep = key1-key3 firstvar--lastvar)
– user667489
Nov 7 at 11:09




You can also specify a comma-separated list of variables, or use data set options, e.g. select * from have(keep = key1-key3 firstvar--lastvar)
– user667489
Nov 7 at 11:09












This works really nice thank you very much!
– Luis Martins
Nov 7 at 15:15




This works really nice thank you very much!
– Luis Martins
Nov 7 at 15:15


















 

draft saved


draft discarded



















































 


draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53187588%2fsas-replicate-multiple-observations-across-rows%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







這個網誌中的熱門文章

Tangent Lines Diagram Along Smooth Curve

Yusuf al-Mu'taman ibn Hud

Zucchini