data processing pipeline python

Question

I am working on the following problem. Lets say I have data (say image values RGB as integers) in a file per line. I want to read 10000 of these lines and make a frame object (image frame containing 10000 RGB Values) and send it to downstream function in the processing pipeline. Then read the next 10000 lines and make another frame object and send it to downstream function in the pipeline.

How can i setup this function that it keeps on making frame objects until the end of file is reached. Is the following the right way to do it? Are there other neat approaches?

class frame_object(object):
    def __init__(self):
            self.line_cnt  = 0
            self.buffer = []

    def make_frame(line):
        if(self.line_cnt < 9999):
            self.buffer.append(line)
        return self.buffer

Ion Scerbatiuc · Accepted Answer · 2014-06-05 02:13:08Z

0

You could use generators to create a data pipeline like in the following example:

FRAME_SIZE = 10000


def gen_lines(filename):
    with open(filename, "r") as fp:
        for line in fp:
            yield line[:-1]


def gen_frames(lines):
    count = 0
    frame = []

    for line in lines:
        if count < FRAME_SIZE:
            frame.append(line)
            count += 1

        if count == FRAME_SIZE:
            yield frame
            frame = []
            count = 0

    if count > 0:
        yield frame


def process_frames(frames):
    for frame in frames:
        # do stuff with frame
        print len(frame)


lines = gen_lines("/path/to/input.file")
frames = gen_frames(lines)
process_frames(frames)

In this way it's easier to see the data pipeline and hook in different processing or filtering logic. You can learn more on generators and their use in data-processing pipelines here.

answered Jun 5, 2014 at 2:13

Ion Scerbatiuc

1,1816 silver badges10 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

holms Over a year ago

Just really wonder, aren't such problems are solved with message ques..?

Ion Scerbatiuc Over a year ago

It really depends on the context of what you're trying to do. If you're processing a locally accessible file and the output is another file (or a set of files), then I think it would be overkill to use a message queue (like RabbitMQ). Also you probably don't want a lot of data to be in the queue because it could cause memory issues in your message broker.

holms Over a year ago

if it's just a single file processing yeap this would be overkill :) I've ended up putting whatever data I have to sqlite (or mongo) for processing, because it's just way more productive than doing everything by hand.

Collectives™ on Stack Overflow

data processing pipeline python

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related