Partially match to words from two different files and extract rows sed,awk,grep?

I have two different files that I want to partially match one number from the first one with another number from the other one and extract the whole column.

File1:

smt_hsa_3150    932

smt_hsa_28592   682

smt_hsa_5184    657

smt_hsa_430 648

smt_hsa_14100   648

smt_hsa_96 648

File2:

chr11 5933549 5933577 29 + hsa_smt_028592

chr11 45693060 45693086 27 - hsa_smt_000059

chr11 45699803 45699832 30 - hsa_smt_000087

chr2 131291172 131291197 26 - hsa_smt_000096

I need to match smt_hsa_28592 or 28592 with hsa_smt_028592 or 028592.
and then extract to a new file the line from the second file plus the number from 2nd column of the 1st file.

output:

chr11 5933549 5933577 29 + hsa_smt_028592 682   

chr2 131291172 131291197 26 - hsa_smt_000096 648

As I'm new to awk/sed programming I tried first to change the name of the first column of the 1st file from smt_hsa_3150 to hsa_smt_3150, but when I perform

awk '{gsub("smt","hsa")}1'

then I cannot use the same code to change only the second "hsa".
The second problem is how I would be able to match hsa_smt_028592 with smt_has_28592 or smt_hsa_96 with hsa_smt_000096.

asked Nov 16 '18 at 8:43

K Geles

4217

Possible duplicate of Inner join on two text files

– tripleee
Nov 16 '18 at 11:13

add a comment |

I have two different files that I want to partially match one number from the first one with another number from the other one and extract the whole column.

File1:

smt_hsa_3150    932

smt_hsa_28592   682

smt_hsa_5184    657

smt_hsa_430 648

smt_hsa_14100   648

smt_hsa_96 648

File2:

chr11 5933549 5933577 29 + hsa_smt_028592

chr11 45693060 45693086 27 - hsa_smt_000059

chr11 45699803 45699832 30 - hsa_smt_000087

chr2 131291172 131291197 26 - hsa_smt_000096

I need to match smt_hsa_28592 or 28592 with hsa_smt_028592 or 028592.
and then extract to a new file the line from the second file plus the number from 2nd column of the 1st file.

output:

chr11 5933549 5933577 29 + hsa_smt_028592 682   

chr2 131291172 131291197 26 - hsa_smt_000096 648

As I'm new to awk/sed programming I tried first to change the name of the first column of the 1st file from smt_hsa_3150 to hsa_smt_3150, but when I perform

awk '{gsub("smt","hsa")}1'

then I cannot use the same code to change only the second "hsa".
The second problem is how I would be able to match hsa_smt_028592 with smt_has_28592 or smt_hsa_96 with hsa_smt_000096.

asked Nov 16 '18 at 8:43

K Geles

4217

Possible duplicate of Inner join on two text files

– tripleee
Nov 16 '18 at 11:13

add a comment |

I have two different files that I want to partially match one number from the first one with another number from the other one and extract the whole column.

File1:

smt_hsa_3150    932

smt_hsa_28592   682

smt_hsa_5184    657

smt_hsa_430 648

smt_hsa_14100   648

smt_hsa_96 648

File2:

chr11 5933549 5933577 29 + hsa_smt_028592

chr11 45693060 45693086 27 - hsa_smt_000059

chr11 45699803 45699832 30 - hsa_smt_000087

chr2 131291172 131291197 26 - hsa_smt_000096

I need to match smt_hsa_28592 or 28592 with hsa_smt_028592 or 028592.
and then extract to a new file the line from the second file plus the number from 2nd column of the 1st file.

output:

chr11 5933549 5933577 29 + hsa_smt_028592 682   

chr2 131291172 131291197 26 - hsa_smt_000096 648

As I'm new to awk/sed programming I tried first to change the name of the first column of the 1st file from smt_hsa_3150 to hsa_smt_3150, but when I perform

awk '{gsub("smt","hsa")}1'

then I cannot use the same code to change only the second "hsa".
The second problem is how I would be able to match hsa_smt_028592 with smt_has_28592 or smt_hsa_96 with hsa_smt_000096.

asked Nov 16 '18 at 8:43

K Geles

4217

I have two different files that I want to partially match one number from the first one with another number from the other one and extract the whole column.

File1:

smt_hsa_3150    932

smt_hsa_28592   682

smt_hsa_5184    657

smt_hsa_430 648

smt_hsa_14100   648

smt_hsa_96 648

File2:

chr11 5933549 5933577 29 + hsa_smt_028592

chr11 45693060 45693086 27 - hsa_smt_000059

chr11 45699803 45699832 30 - hsa_smt_000087

chr2 131291172 131291197 26 - hsa_smt_000096

I need to match smt_hsa_28592 or 28592 with hsa_smt_028592 or 028592.
and then extract to a new file the line from the second file plus the number from 2nd column of the 1st file.

output:

chr11 5933549 5933577 29 + hsa_smt_028592 682   

chr2 131291172 131291197 26 - hsa_smt_000096 648

As I'm new to awk/sed programming I tried first to change the name of the first column of the 1st file from smt_hsa_3150 to hsa_smt_3150, but when I perform

awk '{gsub("smt","hsa")}1'

then I cannot use the same code to change only the second "hsa".
The second problem is how I would be able to match hsa_smt_028592 with smt_has_28592 or smt_hsa_96 with hsa_smt_000096.

awk sed grep

asked Nov 16 '18 at 8:43

K Geles

4217

asked Nov 16 '18 at 8:43

K Geles

4217

asked Nov 16 '18 at 8:43

K Geles

4217

asked Nov 16 '18 at 8:43

K Geles

4217

asked Nov 16 '18 at 8:43

K Geles

4217

Possible duplicate of Inner join on two text files

– tripleee
Nov 16 '18 at 11:13

add a comment |

Possible duplicate of Inner join on two text files

– tripleee
Nov 16 '18 at 11:13

Possible duplicate of Inner join on two text files

– tripleee
Nov 16 '18 at 11:13

add a comment |

1 Answer
1

active

oldest

votes

For fun using join :

join -11 -26 -o 2.1,2.2,2.3,2.4,2.5,1.1,1.2 <(awk -F' +|_' '{printf "hsa_smt_%06s %sn", $3, $4}' file1.txt | sort -k1) <(sort -k6 file2.txt)

awk command is formatting the first column of your file1 in same format as file2. Then we join the two files (first we need to sort them both).

awk -F' *|_' -> I split on multiple spaces and "_" character

'{printf "hsa_smt_%06s %sn", $3, $4}' -> I print hsa_smt_< third column as an integer of width 6 > <4th column>. Awk will add leading zeros if needed

sort -k6 file2 or awk ... | sort -k1 I sort the files on 6th or 1st columns

join -11 -26 -o 2.1,2.2,2.3,2.4,2.5,1.1,1.2 I join the files on the first column for file 1 (-11) and 6th column for file 2 (-26) then I order the columns (2.1 -> 1st column of file2, etc...)

Pure awk solution :

awk 'NR==FNR{split($1,s,"_");smt[s[3]+0]=$2;next}{split($6,s,"_");k=s[3]+0}k in smt{print $0, smt[k]}' file1.txt file2.txt

In this solution, I split smt_hsa_xxx and hsa_smt_xxx, convert the 3rd part to integers so I can compare them.

NR==FNR{split($1,s,"_");smt[s[3]+0]=$2;next} this block will be executed only on file1 because of the NR==FNR condition. It splits the first column, convert the smt_hsa id to integer (doing +0 is a classic way to convert string to integer in awk), then stores second column inside an array named smt.

next; prevents executing the other blocks for file1

{split($6,s,"_");k=s[3]+0} this block will split the 6th column of file2, convert the id to integer. We could have used " +|_" as delimiter to avoid splitting columns (see first solution).

k in smt{print $0, smt[k]} will check if the id is a key of smt array. Then it prints line from file2 and second column of file1 (which was stored in smt)

I'm pretty sure that the second solution is more performant but may be greedier in memory.

Both give same output, order of lines may differ :

chr11 5933549 5933577 29 + hsa_smt_028592 682

chr2 131291172 131291197 26 - hsa_smt_000096 648

edited Nov 16 '18 at 11:10

answered Nov 16 '18 at 9:17

Corentin Limier

2,0511611

Thank you very much, is it possible to have a small explanation about each step of awk?

– K Geles
Nov 16 '18 at 9:29

2

I added some explanations, do not hesitate to comment if you have any questions @K.Gele

– Corentin Limier
Nov 16 '18 at 9:48

It is much informative! Thank you!

– K Geles
Nov 16 '18 at 12:09

I have another question regarding joining two files but it will be marked as a duplicate. How should I proceed?

– K Geles
Nov 28 '18 at 15:31

@KGeles if it is set as duplicate that's because you can get your answer already somewhere ?

– Corentin Limier
Nov 28 '18 at 16:55

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53334229%2fpartially-match-to-words-from-two-different-files-and-extract-rows-sed-awk-grep%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

For fun using join :

join -11 -26 -o 2.1,2.2,2.3,2.4,2.5,1.1,1.2 <(awk -F' +|_' '{printf "hsa_smt_%06s %sn", $3, $4}' file1.txt | sort -k1) <(sort -k6 file2.txt)

awk command is formatting the first column of your file1 in same format as file2. Then we join the two files (first we need to sort them both).

awk -F' *|_' -> I split on multiple spaces and "_" character

'{printf "hsa_smt_%06s %sn", $3, $4}' -> I print hsa_smt_< third column as an integer of width 6 > <4th column>. Awk will add leading zeros if needed

sort -k6 file2 or awk ... | sort -k1 I sort the files on 6th or 1st columns

join -11 -26 -o 2.1,2.2,2.3,2.4,2.5,1.1,1.2 I join the files on the first column for file 1 (-11) and 6th column for file 2 (-26) then I order the columns (2.1 -> 1st column of file2, etc...)

Pure awk solution :

awk 'NR==FNR{split($1,s,"_");smt[s[3]+0]=$2;next}{split($6,s,"_");k=s[3]+0}k in smt{print $0, smt[k]}' file1.txt file2.txt

In this solution, I split smt_hsa_xxx and hsa_smt_xxx, convert the 3rd part to integers so I can compare them.

next; prevents executing the other blocks for file1

{split($6,s,"_");k=s[3]+0} this block will split the 6th column of file2, convert the id to integer. We could have used " +|_" as delimiter to avoid splitting columns (see first solution).

k in smt{print $0, smt[k]} will check if the id is a key of smt array. Then it prints line from file2 and second column of file1 (which was stored in smt)

I'm pretty sure that the second solution is more performant but may be greedier in memory.

Both give same output, order of lines may differ :

chr11 5933549 5933577 29 + hsa_smt_028592 682

chr2 131291172 131291197 26 - hsa_smt_000096 648

edited Nov 16 '18 at 11:10

answered Nov 16 '18 at 9:17

Corentin Limier

2,0511611

Thank you very much, is it possible to have a small explanation about each step of awk?

– K Geles
Nov 16 '18 at 9:29

2

I added some explanations, do not hesitate to comment if you have any questions @K.Gele

– Corentin Limier
Nov 16 '18 at 9:48

It is much informative! Thank you!

– K Geles
Nov 16 '18 at 12:09

I have another question regarding joining two files but it will be marked as a duplicate. How should I proceed?

– K Geles
Nov 28 '18 at 15:31

@KGeles if it is set as duplicate that's because you can get your answer already somewhere ?

– Corentin Limier
Nov 28 '18 at 16:55

add a comment |

For fun using join :

join -11 -26 -o 2.1,2.2,2.3,2.4,2.5,1.1,1.2 <(awk -F' +|_' '{printf "hsa_smt_%06s %sn", $3, $4}' file1.txt | sort -k1) <(sort -k6 file2.txt)

awk command is formatting the first column of your file1 in same format as file2. Then we join the two files (first we need to sort them both).

awk -F' *|_' -> I split on multiple spaces and "_" character

'{printf "hsa_smt_%06s %sn", $3, $4}' -> I print hsa_smt_< third column as an integer of width 6 > <4th column>. Awk will add leading zeros if needed

sort -k6 file2 or awk ... | sort -k1 I sort the files on 6th or 1st columns

join -11 -26 -o 2.1,2.2,2.3,2.4,2.5,1.1,1.2 I join the files on the first column for file 1 (-11) and 6th column for file 2 (-26) then I order the columns (2.1 -> 1st column of file2, etc...)

Pure awk solution :

awk 'NR==FNR{split($1,s,"_");smt[s[3]+0]=$2;next}{split($6,s,"_");k=s[3]+0}k in smt{print $0, smt[k]}' file1.txt file2.txt

In this solution, I split smt_hsa_xxx and hsa_smt_xxx, convert the 3rd part to integers so I can compare them.

next; prevents executing the other blocks for file1

{split($6,s,"_");k=s[3]+0} this block will split the 6th column of file2, convert the id to integer. We could have used " +|_" as delimiter to avoid splitting columns (see first solution).

k in smt{print $0, smt[k]} will check if the id is a key of smt array. Then it prints line from file2 and second column of file1 (which was stored in smt)

I'm pretty sure that the second solution is more performant but may be greedier in memory.

Both give same output, order of lines may differ :

chr11 5933549 5933577 29 + hsa_smt_028592 682

chr2 131291172 131291197 26 - hsa_smt_000096 648

edited Nov 16 '18 at 11:10

answered Nov 16 '18 at 9:17

Corentin Limier

2,0511611

Thank you very much, is it possible to have a small explanation about each step of awk?

– K Geles
Nov 16 '18 at 9:29

2

I added some explanations, do not hesitate to comment if you have any questions @K.Gele

– Corentin Limier
Nov 16 '18 at 9:48

It is much informative! Thank you!

– K Geles
Nov 16 '18 at 12:09

I have another question regarding joining two files but it will be marked as a duplicate. How should I proceed?

– K Geles
Nov 28 '18 at 15:31

@KGeles if it is set as duplicate that's because you can get your answer already somewhere ?

– Corentin Limier
Nov 28 '18 at 16:55

add a comment |

For fun using join :

join -11 -26 -o 2.1,2.2,2.3,2.4,2.5,1.1,1.2 <(awk -F' +|_' '{printf "hsa_smt_%06s %sn", $3, $4}' file1.txt | sort -k1) <(sort -k6 file2.txt)

awk command is formatting the first column of your file1 in same format as file2. Then we join the two files (first we need to sort them both).

awk -F' *|_' -> I split on multiple spaces and "_" character

'{printf "hsa_smt_%06s %sn", $3, $4}' -> I print hsa_smt_< third column as an integer of width 6 > <4th column>. Awk will add leading zeros if needed

sort -k6 file2 or awk ... | sort -k1 I sort the files on 6th or 1st columns

join -11 -26 -o 2.1,2.2,2.3,2.4,2.5,1.1,1.2 I join the files on the first column for file 1 (-11) and 6th column for file 2 (-26) then I order the columns (2.1 -> 1st column of file2, etc...)

Pure awk solution :

awk 'NR==FNR{split($1,s,"_");smt[s[3]+0]=$2;next}{split($6,s,"_");k=s[3]+0}k in smt{print $0, smt[k]}' file1.txt file2.txt

In this solution, I split smt_hsa_xxx and hsa_smt_xxx, convert the 3rd part to integers so I can compare them.

next; prevents executing the other blocks for file1

{split($6,s,"_");k=s[3]+0} this block will split the 6th column of file2, convert the id to integer. We could have used " +|_" as delimiter to avoid splitting columns (see first solution).

k in smt{print $0, smt[k]} will check if the id is a key of smt array. Then it prints line from file2 and second column of file1 (which was stored in smt)

I'm pretty sure that the second solution is more performant but may be greedier in memory.

Both give same output, order of lines may differ :

chr11 5933549 5933577 29 + hsa_smt_028592 682

chr2 131291172 131291197 26 - hsa_smt_000096 648

edited Nov 16 '18 at 11:10

answered Nov 16 '18 at 9:17

Corentin Limier

2,0511611

For fun using join :

join -11 -26 -o 2.1,2.2,2.3,2.4,2.5,1.1,1.2 <(awk -F' +|_' '{printf "hsa_smt_%06s %sn", $3, $4}' file1.txt | sort -k1) <(sort -k6 file2.txt)

awk command is formatting the first column of your file1 in same format as file2. Then we join the two files (first we need to sort them both).

awk -F' *|_' -> I split on multiple spaces and "_" character

'{printf "hsa_smt_%06s %sn", $3, $4}' -> I print hsa_smt_< third column as an integer of width 6 > <4th column>. Awk will add leading zeros if needed

sort -k6 file2 or awk ... | sort -k1 I sort the files on 6th or 1st columns

join -11 -26 -o 2.1,2.2,2.3,2.4,2.5,1.1,1.2 I join the files on the first column for file 1 (-11) and 6th column for file 2 (-26) then I order the columns (2.1 -> 1st column of file2, etc...)

Pure awk solution :

awk 'NR==FNR{split($1,s,"_");smt[s[3]+0]=$2;next}{split($6,s,"_");k=s[3]+0}k in smt{print $0, smt[k]}' file1.txt file2.txt

In this solution, I split smt_hsa_xxx and hsa_smt_xxx, convert the 3rd part to integers so I can compare them.

next; prevents executing the other blocks for file1

{split($6,s,"_");k=s[3]+0} this block will split the 6th column of file2, convert the id to integer. We could have used " +|_" as delimiter to avoid splitting columns (see first solution).

k in smt{print $0, smt[k]} will check if the id is a key of smt array. Then it prints line from file2 and second column of file1 (which was stored in smt)

I'm pretty sure that the second solution is more performant but may be greedier in memory.

Both give same output, order of lines may differ :

chr11 5933549 5933577 29 + hsa_smt_028592 682

chr2 131291172 131291197 26 - hsa_smt_000096 648

edited Nov 16 '18 at 11:10

answered Nov 16 '18 at 9:17

Corentin Limier

2,0511611

edited Nov 16 '18 at 11:10

answered Nov 16 '18 at 9:17

Corentin Limier

2,0511611

answered Nov 16 '18 at 9:17

Corentin Limier

2,0511611

answered Nov 16 '18 at 9:17

Corentin Limier

2,0511611

Thank you very much, is it possible to have a small explanation about each step of awk?

– K Geles
Nov 16 '18 at 9:29

2

I added some explanations, do not hesitate to comment if you have any questions @K.Gele

– Corentin Limier
Nov 16 '18 at 9:48

It is much informative! Thank you!

– K Geles
Nov 16 '18 at 12:09

I have another question regarding joining two files but it will be marked as a duplicate. How should I proceed?

– K Geles
Nov 28 '18 at 15:31

@KGeles if it is set as duplicate that's because you can get your answer already somewhere ?

– Corentin Limier
Nov 28 '18 at 16:55

add a comment |

Thank you very much, is it possible to have a small explanation about each step of awk?

– K Geles
Nov 16 '18 at 9:29

2

I added some explanations, do not hesitate to comment if you have any questions @K.Gele

– Corentin Limier
Nov 16 '18 at 9:48

It is much informative! Thank you!

– K Geles
Nov 16 '18 at 12:09

I have another question regarding joining two files but it will be marked as a duplicate. How should I proceed?

– K Geles
Nov 28 '18 at 15:31

@KGeles if it is set as duplicate that's because you can get your answer already somewhere ?

– Corentin Limier
Nov 28 '18 at 16:55

Thank you very much, is it possible to have a small explanation about each step of awk?

– K Geles
Nov 16 '18 at 9:29

I added some explanations, do not hesitate to comment if you have any questions @K.Gele

– Corentin Limier
Nov 16 '18 at 9:48

It is much informative! Thank you!

– K Geles
Nov 16 '18 at 12:09

I have another question regarding joining two files but it will be marked as a duplicate. How should I proceed?

– K Geles
Nov 28 '18 at 15:31

@KGeles if it is set as duplicate that's because you can get your answer already somewhere ?

– Corentin Limier
Nov 28 '18 at 16:55

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Vfrdtyky