Partially match to words from two different files and extract rows sed,awk,grep?












1















I have two different files that I want to partially match one number from the first one with another number from the other one and extract the whole column.



File1:



smt_hsa_3150    932
smt_hsa_28592 682
smt_hsa_5184 657
smt_hsa_430 648
smt_hsa_14100 648
smt_hsa_96 648


File2:



chr11 5933549 5933577 29 + hsa_smt_028592
chr11 45693060 45693086 27 - hsa_smt_000059
chr11 45699803 45699832 30 - hsa_smt_000087
chr2 131291172 131291197 26 - hsa_smt_000096


I need to match smt_hsa_28592 or 28592 with hsa_smt_028592 or 028592.
and then extract to a new file the line from the second file plus the number from 2nd column of the 1st file.



output:



chr11 5933549 5933577 29 + hsa_smt_028592 682   
chr2 131291172 131291197 26 - hsa_smt_000096 648


As I'm new to awk/sed programming I tried first to change the name of the first column of the 1st file from smt_hsa_3150 to hsa_smt_3150, but when I perform



awk '{gsub("smt","hsa")}1'


then I cannot use the same code to change only the second "hsa".
The second problem is how I would be able to match hsa_smt_028592 with smt_has_28592 or smt_hsa_96 with hsa_smt_000096.










share|improve this question























  • Possible duplicate of Inner join on two text files

    – tripleee
    Nov 16 '18 at 11:13
















1















I have two different files that I want to partially match one number from the first one with another number from the other one and extract the whole column.



File1:



smt_hsa_3150    932
smt_hsa_28592 682
smt_hsa_5184 657
smt_hsa_430 648
smt_hsa_14100 648
smt_hsa_96 648


File2:



chr11 5933549 5933577 29 + hsa_smt_028592
chr11 45693060 45693086 27 - hsa_smt_000059
chr11 45699803 45699832 30 - hsa_smt_000087
chr2 131291172 131291197 26 - hsa_smt_000096


I need to match smt_hsa_28592 or 28592 with hsa_smt_028592 or 028592.
and then extract to a new file the line from the second file plus the number from 2nd column of the 1st file.



output:



chr11 5933549 5933577 29 + hsa_smt_028592 682   
chr2 131291172 131291197 26 - hsa_smt_000096 648


As I'm new to awk/sed programming I tried first to change the name of the first column of the 1st file from smt_hsa_3150 to hsa_smt_3150, but when I perform



awk '{gsub("smt","hsa")}1'


then I cannot use the same code to change only the second "hsa".
The second problem is how I would be able to match hsa_smt_028592 with smt_has_28592 or smt_hsa_96 with hsa_smt_000096.










share|improve this question























  • Possible duplicate of Inner join on two text files

    – tripleee
    Nov 16 '18 at 11:13














1












1








1








I have two different files that I want to partially match one number from the first one with another number from the other one and extract the whole column.



File1:



smt_hsa_3150    932
smt_hsa_28592 682
smt_hsa_5184 657
smt_hsa_430 648
smt_hsa_14100 648
smt_hsa_96 648


File2:



chr11 5933549 5933577 29 + hsa_smt_028592
chr11 45693060 45693086 27 - hsa_smt_000059
chr11 45699803 45699832 30 - hsa_smt_000087
chr2 131291172 131291197 26 - hsa_smt_000096


I need to match smt_hsa_28592 or 28592 with hsa_smt_028592 or 028592.
and then extract to a new file the line from the second file plus the number from 2nd column of the 1st file.



output:



chr11 5933549 5933577 29 + hsa_smt_028592 682   
chr2 131291172 131291197 26 - hsa_smt_000096 648


As I'm new to awk/sed programming I tried first to change the name of the first column of the 1st file from smt_hsa_3150 to hsa_smt_3150, but when I perform



awk '{gsub("smt","hsa")}1'


then I cannot use the same code to change only the second "hsa".
The second problem is how I would be able to match hsa_smt_028592 with smt_has_28592 or smt_hsa_96 with hsa_smt_000096.










share|improve this question














I have two different files that I want to partially match one number from the first one with another number from the other one and extract the whole column.



File1:



smt_hsa_3150    932
smt_hsa_28592 682
smt_hsa_5184 657
smt_hsa_430 648
smt_hsa_14100 648
smt_hsa_96 648


File2:



chr11 5933549 5933577 29 + hsa_smt_028592
chr11 45693060 45693086 27 - hsa_smt_000059
chr11 45699803 45699832 30 - hsa_smt_000087
chr2 131291172 131291197 26 - hsa_smt_000096


I need to match smt_hsa_28592 or 28592 with hsa_smt_028592 or 028592.
and then extract to a new file the line from the second file plus the number from 2nd column of the 1st file.



output:



chr11 5933549 5933577 29 + hsa_smt_028592 682   
chr2 131291172 131291197 26 - hsa_smt_000096 648


As I'm new to awk/sed programming I tried first to change the name of the first column of the 1st file from smt_hsa_3150 to hsa_smt_3150, but when I perform



awk '{gsub("smt","hsa")}1'


then I cannot use the same code to change only the second "hsa".
The second problem is how I would be able to match hsa_smt_028592 with smt_has_28592 or smt_hsa_96 with hsa_smt_000096.







awk sed grep






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Nov 16 '18 at 8:43









K GelesK Geles

4217




4217













  • Possible duplicate of Inner join on two text files

    – tripleee
    Nov 16 '18 at 11:13



















  • Possible duplicate of Inner join on two text files

    – tripleee
    Nov 16 '18 at 11:13

















Possible duplicate of Inner join on two text files

– tripleee
Nov 16 '18 at 11:13





Possible duplicate of Inner join on two text files

– tripleee
Nov 16 '18 at 11:13












1 Answer
1






active

oldest

votes


















1














For fun using join :



join -11 -26 -o 2.1,2.2,2.3,2.4,2.5,1.1,1.2 <(awk -F' +|_' '{printf "hsa_smt_%06s %sn", $3, $4}' file1.txt | sort -k1) <(sort -k6 file2.txt)


awk command is formatting the first column of your file1 in same format as file2. Then we join the two files (first we need to sort them both).





awk -F' *|_' -> I split on multiple spaces and "_" character



'{printf "hsa_smt_%06s %sn", $3, $4}' -> I print hsa_smt_< third column as an integer of width 6 > <4th column>. Awk will add leading zeros if needed



sort -k6 file2 or awk ... | sort -k1 I sort the files on 6th or 1st columns



join -11 -26 -o 2.1,2.2,2.3,2.4,2.5,1.1,1.2 I join the files on the first column for file 1 (-11) and 6th column for file 2 (-26) then I order the columns (2.1 -> 1st column of file2, etc...)





Pure awk solution :



awk 'NR==FNR{split($1,s,"_");smt[s[3]+0]=$2;next}{split($6,s,"_");k=s[3]+0}k in smt{print $0, smt[k]}' file1.txt file2.txt


In this solution, I split smt_hsa_xxx and hsa_smt_xxx, convert the 3rd part to integers so I can compare them.





NR==FNR{split($1,s,"_");smt[s[3]+0]=$2;next} this block will be executed only on file1 because of the NR==FNR condition. It splits the first column, convert the smt_hsa id to integer (doing +0 is a classic way to convert string to integer in awk), then stores second column inside an array named smt.



next; prevents executing the other blocks for file1



{split($6,s,"_");k=s[3]+0} this block will split the 6th column of file2, convert the id to integer. We could have used " +|_" as delimiter to avoid splitting columns (see first solution).



k in smt{print $0, smt[k]} will check if the id is a key of smt array. Then it prints line from file2 and second column of file1 (which was stored in smt)





I'm pretty sure that the second solution is more performant but may be greedier in memory.



Both give same output, order of lines may differ :



chr11 5933549 5933577 29 + hsa_smt_028592 682
chr2 131291172 131291197 26 - hsa_smt_000096 648





share|improve this answer


























  • Thank you very much, is it possible to have a small explanation about each step of awk?

    – K Geles
    Nov 16 '18 at 9:29








  • 2





    I added some explanations, do not hesitate to comment if you have any questions @K.Gele

    – Corentin Limier
    Nov 16 '18 at 9:48













  • It is much informative! Thank you!

    – K Geles
    Nov 16 '18 at 12:09













  • I have another question regarding joining two files but it will be marked as a duplicate. How should I proceed?

    – K Geles
    Nov 28 '18 at 15:31











  • @KGeles if it is set as duplicate that's because you can get your answer already somewhere ?

    – Corentin Limier
    Nov 28 '18 at 16:55












Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53334229%2fpartially-match-to-words-from-two-different-files-and-extract-rows-sed-awk-grep%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









1














For fun using join :



join -11 -26 -o 2.1,2.2,2.3,2.4,2.5,1.1,1.2 <(awk -F' +|_' '{printf "hsa_smt_%06s %sn", $3, $4}' file1.txt | sort -k1) <(sort -k6 file2.txt)


awk command is formatting the first column of your file1 in same format as file2. Then we join the two files (first we need to sort them both).





awk -F' *|_' -> I split on multiple spaces and "_" character



'{printf "hsa_smt_%06s %sn", $3, $4}' -> I print hsa_smt_< third column as an integer of width 6 > <4th column>. Awk will add leading zeros if needed



sort -k6 file2 or awk ... | sort -k1 I sort the files on 6th or 1st columns



join -11 -26 -o 2.1,2.2,2.3,2.4,2.5,1.1,1.2 I join the files on the first column for file 1 (-11) and 6th column for file 2 (-26) then I order the columns (2.1 -> 1st column of file2, etc...)





Pure awk solution :



awk 'NR==FNR{split($1,s,"_");smt[s[3]+0]=$2;next}{split($6,s,"_");k=s[3]+0}k in smt{print $0, smt[k]}' file1.txt file2.txt


In this solution, I split smt_hsa_xxx and hsa_smt_xxx, convert the 3rd part to integers so I can compare them.





NR==FNR{split($1,s,"_");smt[s[3]+0]=$2;next} this block will be executed only on file1 because of the NR==FNR condition. It splits the first column, convert the smt_hsa id to integer (doing +0 is a classic way to convert string to integer in awk), then stores second column inside an array named smt.



next; prevents executing the other blocks for file1



{split($6,s,"_");k=s[3]+0} this block will split the 6th column of file2, convert the id to integer. We could have used " +|_" as delimiter to avoid splitting columns (see first solution).



k in smt{print $0, smt[k]} will check if the id is a key of smt array. Then it prints line from file2 and second column of file1 (which was stored in smt)





I'm pretty sure that the second solution is more performant but may be greedier in memory.



Both give same output, order of lines may differ :



chr11 5933549 5933577 29 + hsa_smt_028592 682
chr2 131291172 131291197 26 - hsa_smt_000096 648





share|improve this answer


























  • Thank you very much, is it possible to have a small explanation about each step of awk?

    – K Geles
    Nov 16 '18 at 9:29








  • 2





    I added some explanations, do not hesitate to comment if you have any questions @K.Gele

    – Corentin Limier
    Nov 16 '18 at 9:48













  • It is much informative! Thank you!

    – K Geles
    Nov 16 '18 at 12:09













  • I have another question regarding joining two files but it will be marked as a duplicate. How should I proceed?

    – K Geles
    Nov 28 '18 at 15:31











  • @KGeles if it is set as duplicate that's because you can get your answer already somewhere ?

    – Corentin Limier
    Nov 28 '18 at 16:55
















1














For fun using join :



join -11 -26 -o 2.1,2.2,2.3,2.4,2.5,1.1,1.2 <(awk -F' +|_' '{printf "hsa_smt_%06s %sn", $3, $4}' file1.txt | sort -k1) <(sort -k6 file2.txt)


awk command is formatting the first column of your file1 in same format as file2. Then we join the two files (first we need to sort them both).





awk -F' *|_' -> I split on multiple spaces and "_" character



'{printf "hsa_smt_%06s %sn", $3, $4}' -> I print hsa_smt_< third column as an integer of width 6 > <4th column>. Awk will add leading zeros if needed



sort -k6 file2 or awk ... | sort -k1 I sort the files on 6th or 1st columns



join -11 -26 -o 2.1,2.2,2.3,2.4,2.5,1.1,1.2 I join the files on the first column for file 1 (-11) and 6th column for file 2 (-26) then I order the columns (2.1 -> 1st column of file2, etc...)





Pure awk solution :



awk 'NR==FNR{split($1,s,"_");smt[s[3]+0]=$2;next}{split($6,s,"_");k=s[3]+0}k in smt{print $0, smt[k]}' file1.txt file2.txt


In this solution, I split smt_hsa_xxx and hsa_smt_xxx, convert the 3rd part to integers so I can compare them.





NR==FNR{split($1,s,"_");smt[s[3]+0]=$2;next} this block will be executed only on file1 because of the NR==FNR condition. It splits the first column, convert the smt_hsa id to integer (doing +0 is a classic way to convert string to integer in awk), then stores second column inside an array named smt.



next; prevents executing the other blocks for file1



{split($6,s,"_");k=s[3]+0} this block will split the 6th column of file2, convert the id to integer. We could have used " +|_" as delimiter to avoid splitting columns (see first solution).



k in smt{print $0, smt[k]} will check if the id is a key of smt array. Then it prints line from file2 and second column of file1 (which was stored in smt)





I'm pretty sure that the second solution is more performant but may be greedier in memory.



Both give same output, order of lines may differ :



chr11 5933549 5933577 29 + hsa_smt_028592 682
chr2 131291172 131291197 26 - hsa_smt_000096 648





share|improve this answer


























  • Thank you very much, is it possible to have a small explanation about each step of awk?

    – K Geles
    Nov 16 '18 at 9:29








  • 2





    I added some explanations, do not hesitate to comment if you have any questions @K.Gele

    – Corentin Limier
    Nov 16 '18 at 9:48













  • It is much informative! Thank you!

    – K Geles
    Nov 16 '18 at 12:09













  • I have another question regarding joining two files but it will be marked as a duplicate. How should I proceed?

    – K Geles
    Nov 28 '18 at 15:31











  • @KGeles if it is set as duplicate that's because you can get your answer already somewhere ?

    – Corentin Limier
    Nov 28 '18 at 16:55














1












1








1







For fun using join :



join -11 -26 -o 2.1,2.2,2.3,2.4,2.5,1.1,1.2 <(awk -F' +|_' '{printf "hsa_smt_%06s %sn", $3, $4}' file1.txt | sort -k1) <(sort -k6 file2.txt)


awk command is formatting the first column of your file1 in same format as file2. Then we join the two files (first we need to sort them both).





awk -F' *|_' -> I split on multiple spaces and "_" character



'{printf "hsa_smt_%06s %sn", $3, $4}' -> I print hsa_smt_< third column as an integer of width 6 > <4th column>. Awk will add leading zeros if needed



sort -k6 file2 or awk ... | sort -k1 I sort the files on 6th or 1st columns



join -11 -26 -o 2.1,2.2,2.3,2.4,2.5,1.1,1.2 I join the files on the first column for file 1 (-11) and 6th column for file 2 (-26) then I order the columns (2.1 -> 1st column of file2, etc...)





Pure awk solution :



awk 'NR==FNR{split($1,s,"_");smt[s[3]+0]=$2;next}{split($6,s,"_");k=s[3]+0}k in smt{print $0, smt[k]}' file1.txt file2.txt


In this solution, I split smt_hsa_xxx and hsa_smt_xxx, convert the 3rd part to integers so I can compare them.





NR==FNR{split($1,s,"_");smt[s[3]+0]=$2;next} this block will be executed only on file1 because of the NR==FNR condition. It splits the first column, convert the smt_hsa id to integer (doing +0 is a classic way to convert string to integer in awk), then stores second column inside an array named smt.



next; prevents executing the other blocks for file1



{split($6,s,"_");k=s[3]+0} this block will split the 6th column of file2, convert the id to integer. We could have used " +|_" as delimiter to avoid splitting columns (see first solution).



k in smt{print $0, smt[k]} will check if the id is a key of smt array. Then it prints line from file2 and second column of file1 (which was stored in smt)





I'm pretty sure that the second solution is more performant but may be greedier in memory.



Both give same output, order of lines may differ :



chr11 5933549 5933577 29 + hsa_smt_028592 682
chr2 131291172 131291197 26 - hsa_smt_000096 648





share|improve this answer















For fun using join :



join -11 -26 -o 2.1,2.2,2.3,2.4,2.5,1.1,1.2 <(awk -F' +|_' '{printf "hsa_smt_%06s %sn", $3, $4}' file1.txt | sort -k1) <(sort -k6 file2.txt)


awk command is formatting the first column of your file1 in same format as file2. Then we join the two files (first we need to sort them both).





awk -F' *|_' -> I split on multiple spaces and "_" character



'{printf "hsa_smt_%06s %sn", $3, $4}' -> I print hsa_smt_< third column as an integer of width 6 > <4th column>. Awk will add leading zeros if needed



sort -k6 file2 or awk ... | sort -k1 I sort the files on 6th or 1st columns



join -11 -26 -o 2.1,2.2,2.3,2.4,2.5,1.1,1.2 I join the files on the first column for file 1 (-11) and 6th column for file 2 (-26) then I order the columns (2.1 -> 1st column of file2, etc...)





Pure awk solution :



awk 'NR==FNR{split($1,s,"_");smt[s[3]+0]=$2;next}{split($6,s,"_");k=s[3]+0}k in smt{print $0, smt[k]}' file1.txt file2.txt


In this solution, I split smt_hsa_xxx and hsa_smt_xxx, convert the 3rd part to integers so I can compare them.





NR==FNR{split($1,s,"_");smt[s[3]+0]=$2;next} this block will be executed only on file1 because of the NR==FNR condition. It splits the first column, convert the smt_hsa id to integer (doing +0 is a classic way to convert string to integer in awk), then stores second column inside an array named smt.



next; prevents executing the other blocks for file1



{split($6,s,"_");k=s[3]+0} this block will split the 6th column of file2, convert the id to integer. We could have used " +|_" as delimiter to avoid splitting columns (see first solution).



k in smt{print $0, smt[k]} will check if the id is a key of smt array. Then it prints line from file2 and second column of file1 (which was stored in smt)





I'm pretty sure that the second solution is more performant but may be greedier in memory.



Both give same output, order of lines may differ :



chr11 5933549 5933577 29 + hsa_smt_028592 682
chr2 131291172 131291197 26 - hsa_smt_000096 648






share|improve this answer














share|improve this answer



share|improve this answer








edited Nov 16 '18 at 11:10

























answered Nov 16 '18 at 9:17









Corentin LimierCorentin Limier

2,0511611




2,0511611













  • Thank you very much, is it possible to have a small explanation about each step of awk?

    – K Geles
    Nov 16 '18 at 9:29








  • 2





    I added some explanations, do not hesitate to comment if you have any questions @K.Gele

    – Corentin Limier
    Nov 16 '18 at 9:48













  • It is much informative! Thank you!

    – K Geles
    Nov 16 '18 at 12:09













  • I have another question regarding joining two files but it will be marked as a duplicate. How should I proceed?

    – K Geles
    Nov 28 '18 at 15:31











  • @KGeles if it is set as duplicate that's because you can get your answer already somewhere ?

    – Corentin Limier
    Nov 28 '18 at 16:55



















  • Thank you very much, is it possible to have a small explanation about each step of awk?

    – K Geles
    Nov 16 '18 at 9:29








  • 2





    I added some explanations, do not hesitate to comment if you have any questions @K.Gele

    – Corentin Limier
    Nov 16 '18 at 9:48













  • It is much informative! Thank you!

    – K Geles
    Nov 16 '18 at 12:09













  • I have another question regarding joining two files but it will be marked as a duplicate. How should I proceed?

    – K Geles
    Nov 28 '18 at 15:31











  • @KGeles if it is set as duplicate that's because you can get your answer already somewhere ?

    – Corentin Limier
    Nov 28 '18 at 16:55

















Thank you very much, is it possible to have a small explanation about each step of awk?

– K Geles
Nov 16 '18 at 9:29







Thank you very much, is it possible to have a small explanation about each step of awk?

– K Geles
Nov 16 '18 at 9:29






2




2





I added some explanations, do not hesitate to comment if you have any questions @K.Gele

– Corentin Limier
Nov 16 '18 at 9:48







I added some explanations, do not hesitate to comment if you have any questions @K.Gele

– Corentin Limier
Nov 16 '18 at 9:48















It is much informative! Thank you!

– K Geles
Nov 16 '18 at 12:09







It is much informative! Thank you!

– K Geles
Nov 16 '18 at 12:09















I have another question regarding joining two files but it will be marked as a duplicate. How should I proceed?

– K Geles
Nov 28 '18 at 15:31





I have another question regarding joining two files but it will be marked as a duplicate. How should I proceed?

– K Geles
Nov 28 '18 at 15:31













@KGeles if it is set as duplicate that's because you can get your answer already somewhere ?

– Corentin Limier
Nov 28 '18 at 16:55





@KGeles if it is set as duplicate that's because you can get your answer already somewhere ?

– Corentin Limier
Nov 28 '18 at 16:55




















draft saved

draft discarded




















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53334229%2fpartially-match-to-words-from-two-different-files-and-extract-rows-sed-awk-grep%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Xamarin.iOS Cant Deploy on Iphone

Glorious Revolution

Dulmage-Mendelsohn matrix decomposition in Python