-1

I am a beginner with using linux bash for bioinformatics purpose and recently i encountered some error with this 'awk' command. ChatGPT suggestion is not helping and the task is very basic. I have a big file of human genome and i need to extract CDS region. This is one of the examples from the file:" CDS 648..2924"Screenshot from the file start position is first number, end position is second number. my code:

awk '/CDS/ && /\.\./ { if (match($0, /([0-9]+)\.\.([0-9]+)/, arr)) { print arr[1], arr[2] } }' BRCA1.gb

Every suggestion is appreciated

Note: I know that there are other ways around to complete this task, but i need to complete it specifically with 'awk' and 'match'. Thanks a lot! (Picture from the file below)

https://i.sstatic.net/TMqe80wJ.png

8
  • 1
    I don't get the error you mentioned using the awk command you pasted. Commented Jun 1 at 18:03
  • 7
    My best guesses are that you're either a) using a gawk-only extension (3rd argument to match()) but not using gawk or b) on Solaris using old, broken awk. FYI ChatGPT usually outputs wrong answers when given software questions. Commented Jun 1 at 18:04
  • 1
    @EdMorton Thank you very much for your suggestion, i just tried with gawk and it is working. Wish you all the best. Commented Jun 1 at 18:08
  • 1
    @markp-fuso thank you very much. I just fixed the issue with simply using 'gawk'. Commented Jun 1 at 18:09
  • 2
    Your welcome. By the way your script could be written more concisely and robustly as awk 'match($0, /CDS.*\<([0-9]+)\.\.([0-9]+)/, arr) { print arr[1], arr[2] }' BRCA1.gb Commented Jun 1 at 18:10

1 Answer 1

1

You must not use 3-argument match if you are using mawk. You can use 3-argument match if you use GNU AWK (gawk). Your code

awk '/CDS/ && /\.\./ { if (match($0, /([0-9]+)\.\.([0-9]+)/, arr)) { print arr[1], arr[2] } }' BRCA1.gb

is valid gawk command, but could be made more concise, observe that your if does something only for 1 branch, so you might avoid using them by combining it with other conditions using logical AND that is

awk '/CDS/ && /\.\./ && match($0, /([0-9]+)\.\.([0-9]+)/, arr) { print arr[1], arr[2] }' BRCA1.gb

Observe that 2nd regular expression is infix of regular expression used in match, so you could drop it, as if there will not be .. then match would not hold, that is you could do

awk '/CDS/ && match($0, /([0-9]+)\.\.([0-9]+)/, arr) { print arr[1], arr[2] }' BRCA1.gb

to get same effect.

I acknowledge that you must use match AT ANY PRICE due to imposed requirements, but I want to note that this might done different way if presence CDA is 100% sure way to tell line to extract, namely

awk 'BEGIN{FPAT="[0-9]+"}/CDS/{print $1,$2}' BRCA1.gb

Explanation: I infrom GNU AWK that fields consist of one-or-more (+) digits, then for line with CDS I print 1st and 2nd field.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.