Efficiently processing data in text file

Question

Lets assume I have a (text) file with the following structure (name, score):

 a         0
 a         1
 b         0
 c         0
 d         3
 b         2

And so on. My aim is to sum the scores for every name and order them from highest score to lowest score. So in this case, I want the following output:

 d         3
 b         2
 a         1
 c         0

In advance I do not know what names will be in the file.

I was wondering if there is an efficient way to do this. My text file can contain up to 50,000 entries.

The only way I can think of is just start at line 1, remember that name and then go over the whole file to look for that name and sum. This looks horribly inefficient, so I was wondering if there is a better way to do this.

@AvinashRaj As stated in the question, I know a way to do it, but I was more looking for a better solution. I can add some pseudocode into my question, if you want me to. — Nigel Overmars
– Nigel Overmars, Commented Dec 4, 2015 at 11:29
You can use sorted(split) with the same key (being the letters) — MaTh
– MaTh, Commented Dec 4, 2015 at 11:30
You could use a dict with the name as the key and the score as the value. That can be made slightly neater by using defaultdict(int). — PM 2Ring
– PM 2Ring, Commented Dec 4, 2015 at 11:31
@caiohamamura: Nigel did clearly describe the only approach he could think of, and that he felt that it was too inefficient (which it is). Surely he doesn't need to specifically show us the code for that inefficient O(n^2) algorithm? — PM 2Ring
– PM 2Ring, Commented Dec 4, 2015 at 11:43

Mike Müller · Accepted Answer · 2015-12-04 12:01:06Z

12

Read all data into a dictionary:

from collections import defaultdict
from operator import itemgetter

scores = defaultdict(int)
with open('my_file.txt') as fobj:
    for line in fobj:
        name, score = line.split()
        scores[name] += int(score)

and the sorting:

for name, score in sorted(scores.items(), key=itemgetter(1), reverse=True):
    print(name, score)

prints:

d 3
b 2
a 1
c 0

Performance

To check the performance of this answer vs. the one from @SvenMarnach, I put both approaches into a function. Here fobj is a file opened for reading. I use io.StringIO so IO delays should, hopefully, not be measured:

from collections import Counter

def counter(fobj):
    scores = Counter()
    fobj.seek(0)
    for line in fobj:
        key, score = line.split()
        scores.update({key: int(score)})
    return scores.most_common()

from collections import defaultdict
from operator import itemgetter

def default(fobj):
    scores = defaultdict(int)
    fobj.seek(0)
    for line in fobj:
        name, score = line.split()
        scores[name] += int(score)
    return sorted(scores.items(), key=itemgetter(1), reverse=True)

Results for collections.Counter:

%timeit counter(fobj)
10000 loops, best of 3: 59.1 µs per loop

Results for collections.defaultdict:

%timeit default(fobj)
10000 loops, best of 3: 15.8 µs per loop

Looks like defaultdictis four times faster. I would not have guessed this. But when it comes to performance you need to measure.

edited Dec 4, 2015 at 12:01

answered Dec 4, 2015 at 11:31

Mike Müller

86k21 gold badges174 silver badges165 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Remi Guan Over a year ago

@caiohamamura: Fair enough, deleted my comment. Didn't see that Mike is using defaultdict(int) (My solution was put them in a list and then use sum()).

PM 2Ring Over a year ago

@Kevin That list-based strategy would be slower and (temporarily) consume more RAM than Mike's approach of simply accumulating the data as it's seen.

Remi Guan Over a year ago

@PM2Ring: Yep, Because it'll create some lists. Forgot that we can use defaultdict(int).

Iron Fist Over a year ago

@Mike ... just to confirm this, you could replace key=itemgetter(1) with key=lambda s:s[1] ...and still you will get the same output...couldn't you?

Mike Müller Over a year ago

@IronFist Yes, that is what itemgetter does. The name is a bit more intuitive than reading the lambda function.

Remi Guan · Accepted Answer · 2015-12-04 11:34:58Z

7

This is a good use case for collections.Counter:

from collections import Counter

scores = Counter()
with open('my_file') as f:
    for line in f:
        key, score = line.split()
        scores.update({key: int(score)})

for key, score in scores.most_common():
    print(key, score)

edited Dec 4, 2015 at 11:34

Remi Guan

22.5k17 gold badges68 silver badges90 bronze badges

answered Dec 4, 2015 at 11:32

Sven Marnach

608k123 gold badges968 silver badges865 bronze badges

2 Comments

PM 2Ring Over a year ago

I guess this would be a little slower than using defaultdict, since you need to put each data item into a dict to add it to the Counter. OTOH, it does make it simple to get the sorted list of accumulated scores.

Sven Marnach Over a year ago

@PM2Ring: Probably, and defaultdict will be far slower than using a plain dict. Speed only matters for this use case if the file is really huge, and if it is, the bottleneck will be I/O, not CPU.

screenpaver · Accepted Answer · 2015-12-04 13:44:30Z

0

Pandas can do this fairly easily:

import pandas as pd
data = pd.read_csv('filename.txt', names=['Name','Score'])
sorted = data.groupby('Name').sum().sort_values('Score', ascending=False)
print sorted

answered Dec 4, 2015 at 13:44

screenpaver

1,1508 silver badges14 bronze badges

Collectives™ on Stack Overflow

Efficiently processing data in text file

3 Answers 3

Performance

5 Comments

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Performance

5 Comments

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related