4

Let's say I have the following file.txt:

foo-a baz
foo bar
foo-c baz

If I run sort file.txt, I get the same output:

foo-a baz
foo bar
foo-c baz

Since (space) is before - in ASCII, I expected the output to be:

foo bar
foo-a baz
foo-c baz

Event if sort looks at entire words, I'd still expect foo to come before both foo-a and foo-c.

I've tried sort -d (dictionary), sort -g (general-numeric) and sort -h (human-numeric) with no success. Is there a way to get the order I want using sort? Otherwise with another basic utility (I know it would be easy to do with python, perl, ruby etc. but I am writing a shell script that needs to be portable).

3
  • 4
    Does LC_COLLATE=C sort file.txt give you the expected order? Commented Sep 30, 2022 at 23:46
  • 1
    It does. Wow that was fast. Please make your comment an answer so that I can accept it. Cheers! Commented Sep 30, 2022 at 23:49
  • It's pretty well covered in Sort not sorting lines with a pipe '|' in it correctly Commented Oct 1, 2022 at 0:57

2 Answers 2

5

In the C locale, collation is meant to be based on the order of characters in the ASCII charset, even on the (rare) systems that are not ASCII-based.

In practice 99.99% are ASCII based, and sort implementations in that locale just sort by byte value without caring about what characters those bytes represent.

The collation aspect of localisation is controlled by the LC_COLLATE environment variable, though beware that LC_ALL if set takes precedence over all other LC_* variables.

LC_CTYPE also controls how bytes are decoded into characters on input (and characters encoded back into byte on output), setting it to C as well helps avoiding possible ambiguity if there are non-ASCII characters in the input and will make decoding/encoding faster (just a pass-through as hinted above).

And once you set LC_CTYPE to C you might as well set LC_MESSAGES and all others as well as chances are the error messages in the user's language for instance couldn't be displayed in ASCII anyway.

So all in all, setting LC_ALL to C fixes all the problems:

$ LC_ALL=C sort file.txt
foo bar
foo-a baz
foo-c baz

More on that at: What does "LC_ALL=C" do?

0

Using Raku (formerly known as Perl_6)

Default sorting:

~$ raku -e '.put for lines.sort;'  file


A
Thor is mighty
foo bar
foo-a baz
foo-c baz
thor is mighty
Þor is mighty
þor is mighty
䨝
坽

OR (sort ASCII):

~$ raku -e '.put for lines.sort({ .match(:global, / <:ASCII>+ /) });'  file


坽
䨝
A
Thor is mighty
foo bar
foo-a baz
foo-c baz
þor is mighty
Þor is mighty
thor is mighty

OR (sort Latin script):

~$ raku -e '.put for lines.sort({ .match(:global, /  <:Script<Latin>>+ /) });'  file


坽
䨝
A
Thor is mighty
foo-a baz
foo bar
foo-c baz
thor is mighty
Þor is mighty
þor is mighty

All sorting forms above can be shortened by using the m:g/ … /form, but the .match(/:global, … /)form is more explicit (and helpful) for new users as the leading . dot indicates the function is called on $_ the topic variable.

Note above, in instances where the "sort criteria" is not encountered that line is "passed thru" and placed at the head of the output. You can replicate the default output by sorting explicitly for the presence/absence of a particular ASCII/Unicode-script, like so (replicates default sorting at top):

~$ raku -e '.put for lines.sort({ m:g/ <-:ASCII>+ /, m:g/ <:ASCII>+ /  });'  file


A
Thor is mighty
foo bar
foo-a baz
foo-c baz
thor is mighty
Þor is mighty
þor is mighty
䨝
坽

Sample Input:

foo-a baz
foo bar
foo-c baz

þor is mighty
thor is mighty
Þor is mighty
Thor is mighty

坽
䨝
A

https://docs.raku.org/language/unicode
https://docs.raku.org/routine/sort
https://raku.org

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.