5

Lets assume I have a (text) file with the following structure (name, score):

 a         0
 a         1
 b         0
 c         0
 d         3
 b         2

And so on. My aim is to sum the scores for every name and order them from highest score to lowest score. So in this case, I want the following output:

 d         3
 b         2
 a         1
 c         0

In advance I do not know what names will be in the file.

I was wondering if there is an efficient way to do this. My text file can contain up to 50,000 entries.

The only way I can think of is just start at line 1, remember that name and then go over the whole file to look for that name and sum. This looks horribly inefficient, so I was wondering if there is a better way to do this.

5
  • 2
    Any attempts from your side? Commented Dec 4, 2015 at 11:24
  • @AvinashRaj As stated in the question, I know a way to do it, but I was more looking for a better solution. I can add some pseudocode into my question, if you want me to. Commented Dec 4, 2015 at 11:29
  • You can use sorted(split) with the same key (being the letters) Commented Dec 4, 2015 at 11:30
  • 1
    You could use a dict with the name as the key and the score as the value. That can be made slightly neater by using defaultdict(int). Commented Dec 4, 2015 at 11:31
  • 1
    @caiohamamura: Nigel did clearly describe the only approach he could think of, and that he felt that it was too inefficient (which it is). Surely he doesn't need to specifically show us the code for that inefficient O(n^2) algorithm? Commented Dec 4, 2015 at 11:43

3 Answers 3

12

Read all data into a dictionary:

from collections import defaultdict
from operator import itemgetter

scores = defaultdict(int)
with open('my_file.txt') as fobj:
    for line in fobj:
        name, score = line.split()
        scores[name] += int(score)

and the sorting:

for name, score in sorted(scores.items(), key=itemgetter(1), reverse=True):
    print(name, score)

prints:

d 3
b 2
a 1
c 0

Performance

To check the performance of this answer vs. the one from @SvenMarnach, I put both approaches into a function. Here fobj is a file opened for reading. I use io.StringIO so IO delays should, hopefully, not be measured:

from collections import Counter

def counter(fobj):
    scores = Counter()
    fobj.seek(0)
    for line in fobj:
        key, score = line.split()
        scores.update({key: int(score)})
    return scores.most_common()

from collections import defaultdict
from operator import itemgetter

def default(fobj):
    scores = defaultdict(int)
    fobj.seek(0)
    for line in fobj:
        name, score = line.split()
        scores[name] += int(score)
    return sorted(scores.items(), key=itemgetter(1), reverse=True)

Results for collections.Counter:

%timeit counter(fobj)
10000 loops, best of 3: 59.1 µs per loop

Results for collections.defaultdict:

%timeit default(fobj)
10000 loops, best of 3: 15.8 µs per loop

Looks like defaultdictis four times faster. I would not have guessed this. But when it comes to performance you need to measure.

Sign up to request clarification or add additional context in comments.

5 Comments

@caiohamamura: Fair enough, deleted my comment. Didn't see that Mike is using defaultdict(int) (My solution was put them in a list and then use sum()).
@Kevin That list-based strategy would be slower and (temporarily) consume more RAM than Mike's approach of simply accumulating the data as it's seen.
@PM2Ring: Yep, Because it'll create some lists. Forgot that we can use defaultdict(int).
@Mike ... just to confirm this, you could replace key=itemgetter(1) with key=lambda s:s[1] ...and still you will get the same output...couldn't you?
@IronFist Yes, that is what itemgetter does. The name is a bit more intuitive than reading the lambda function.
7

This is a good use case for collections.Counter:

from collections import Counter

scores = Counter()
with open('my_file') as f:
    for line in f:
        key, score = line.split()
        scores.update({key: int(score)})

for key, score in scores.most_common():
    print(key, score)

2 Comments

I guess this would be a little slower than using defaultdict, since you need to put each data item into a dict to add it to the Counter. OTOH, it does make it simple to get the sorted list of accumulated scores.
@PM2Ring: Probably, and defaultdict will be far slower than using a plain dict. Speed only matters for this use case if the file is really huge, and if it is, the bottleneck will be I/O, not CPU.
0

Pandas can do this fairly easily:

import pandas as pd
data = pd.read_csv('filename.txt', names=['Name','Score'])
sorted = data.groupby('Name').sum().sort_values('Score', ascending=False)
print sorted

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.