0

I am trying to sum array of array and get average at the same time. The original data is in the form of JSON. I have to parse my data to array of array in order to render the graph. The graph does not accept array of hash.

I first convert the output to JSON using the definition below.

ActiveSupport::JSON.decode(@output.first(10).to_json)

And the result of the above action is shown below.

output = 
[{"name"=>"aaa", "job"=>"a", "pay"=> 2, ... }, 
 {"name"=>"zzz", "job"=>"a", "pay"=> 4, ... }, 
 {"name"=>"xxx", "job"=>"a", "pay"=> 6, ... }, 
 {"name"=>"yyy", "job"=>"a", "pay"=> 8, ... },
 {"name"=>"aaa", "job"=>"b", "pay"=> 2, ... }, 
 {"name"=>"zzz", "job"=>"b", "pay"=> 4, ... }, 
 {"name"=>"xxx", "job"=>"b", "pay"=> 6, ... }, 
 {"name"=>"yyy", "job"=>"b", "pay"=> 10, ... }, 
] 

Then I retrieved the job and pay by converting to array of array.

ActiveSupport::JSON.decode(output.to_json).each { |h| 
  a << [h['job'], h['pay']]
}

The result of the above operation is as below.

a = [["a", 2], ["a", 4], ["a", 6], ["a", 8],
     ["b", 2], ["b", 4], ["b", 6], ["b", 10]]

The code below will give me the sum of each element in the form of array of array.

a.inject({}) { |h,(job, data)| h[job] ||= 0; h[job] += data; h }.to_a

And the result is as below

[["a", 20], ["b", 22]]

However, I am trying to get the average of the array. The expected output is as below.

[["a", 5], ["b", 5.5]]

I can count how many elements in an array and divide the sum array by the count array. I was wondering if there is an easier and more efficient way to get the average.

10
  • Can you create a hash of the format {a: [2, 4, 6, 8], b: [2, 4, 6, 20]}, get the average (more easily) and then convert that into the array of arrays? Commented Jun 15, 2017 at 20:06
  • @SaraTibbetts Yeah, that's what I was thinking at first and I saw many answered question with array of hash. But I have to filter out the keys that I want first. There is about 10 keys but I only need two. Commented Jun 15, 2017 at 20:12
  • Can you show a bit of how you are doing the filtering? Commented Jun 15, 2017 at 20:14
  • @SaraTibbetts updated. I will use any method that is most efficient as I am trying to do it on 80K rows of data. Commented Jun 15, 2017 at 20:27
  • 1
    How would you like to get the average at the same time as summing. If you were going to do this by hand per se, understanding what an average is, you know that you would have to count the elements and then add them up and divide by the count. This is a 3 step process count, add, divide. You cannot determine the average of anything until you know the sum and the count. Commented Jun 15, 2017 at 20:32

5 Answers 5

2
output = [
  {"name"=>"aaa", "job"=>"a", "pay"=> 2 }, 
  {"name"=>"zzz", "job"=>"a", "pay"=> 4 }, 
  {"name"=>"xxx", "job"=>"a", "pay"=> 6 }, 
  {"name"=>"yyy", "job"=>"a", "pay"=> 8 },
  {"name"=>"aaa", "job"=>"b", "pay"=> 2 }, 
  {"name"=>"zzz", "job"=>"b", "pay"=> 4 }, 
  {"name"=>"xxx", "job"=>"b", "pay"=> 6 }, 
  {"name"=>"yyy", "job"=>"b", "pay"=> 10 }, 
]

output.group_by { |obj| obj['job'] }.map do |key, list|
  [key, list.map { |obj| obj['pay'] }.reduce(:+) / list.size.to_f]
end

The group_by method will transform your list into a hash with the following structure:

{"a"=>[{"name"=>"aaa", "job"=>"a", "pay"=>2}, ...], "b"=>[{"name"=>"aaa", "job"=>"b", ...]}

After that, for each pair of that hash, we want to calculate the mean of its 'pay' values, and return a pair [key, mean]. We use a map for that, returning a pair with:

  1. They key itself ("a" or "b").
  2. The mean of the values. Note that the values list has the form of a list of hashes. To retrieve the values, we need to extract the last element of each pair; that's what list.map { |obj| obj['pay'] } is used for. Finally, calculate the mean by suming all elements with .reduce(:+) and dividing them by the list size as a float.

Not the most efficient solution, but it's practical.


Comparing the answer with @EricDuminil's, here's a benchmark with a list of size 8.000.000:

def Wikiti(output)
  output.group_by { |obj| obj['job'] }.map do |key, list|
    [key, list.map { |obj| obj['pay'] }.reduce(:+) / list.size.to_f]
  end
end

def EricDuminil(output)
  count_and_sum = output.each_with_object(Hash.new([0, 0])) do |hash, mem|
    job = hash['job']
    count, sum = mem[job]
    mem[job] = count + 1, sum + hash['pay']
  end
  result = count_and_sum.map do |job, (count, sum)|
    [job, sum / count.to_f]
  end
end

require 'benchmark'

Benchmark.bm do |x|
  x.report('Wikiti') { Wikiti(output) }
  x.report('EricDuminil') { EricDuminil(output) }
end

user         system      total        real
Wikiti       4.100000    0.020000     4.120000 (  4.130373)
EricDuminil  4.250000    0.000000     4.250000 (  4.272685)
Sign up to request clarification or add additional context in comments.

13 Comments

Exactly what I am looking for. Loaded with 3000 lines of data, the graph is generated within 1-2 seconds (localhost). Thank you!
@MelvinCh'ng: 3000 lines of data is nowhere near "huge". Any method would do in this case.
@MelvinCh'ng I've added a benchmark with the answers I considered correct.
I've tried a benchmark with all the answers. They all have similar performances. So it makes sense to pick the clearest and most concise answer, which is this one IMHO.
@engineersmnky I've noticed it too, but only with some input. 10% aren't really worth the extra complexity IMHO. It makes sense that your method is slower, you do provide a bit more information than the other answers.
|
2

This method should be reasonably efficient. It creates a temporary hash with job name as key and [count, sum] as value:

output = [{ 'name' => 'aaa', 'job' => 'a', 'pay' => 2 },
          { 'name' => 'zzz', 'job' => 'a', 'pay' => 4 },
          { 'name' => 'xxx', 'job' => 'a', 'pay' => 6 },
          { 'name' => 'yyy', 'job' => 'a', 'pay' => 8 },
          { 'name' => 'aaa', 'job' => 'b', 'pay' => 2 },
          { 'name' => 'zzz', 'job' => 'b', 'pay' => 4 },
          { 'name' => 'xxx', 'job' => 'b', 'pay' => 6 },
          { 'name' => 'yyy', 'job' => 'b', 'pay' => 10 }]

count_and_sum = output.each_with_object(Hash.new([0, 0])) do |hash, mem|
  job = hash['job']
  count, sum = mem[job]
  mem[job] = count + 1, sum + hash['pay']
end
#=> {"a"=>[4, 20], "b"=>[4, 22]}

result = count_and_sum.map do |job, (count, sum)|
  [job, sum / count.to_f]
end
#=> [["a", 5.0], ["b", 5.5]]

It requires 2 passes, but the created objects aren't big. In comparison, calling group_by on a huge array of hashes isn't very efficient.

Comments

1

How about this (Single pass iterative average calculation)

accumulator = Hash.new {|h,k| h[k] = Hash.new(0)}
a.each_with_object(accumulator) do |(k,v),obj|
   obj[k][:count] += 1
   obj[k][:sum] += v
   obj[k][:average] = (obj[k][:sum] / obj[k][:count].to_f)
end
#=> {"a"=>{:count=>4, :sum=>20, :average=>5.0}, 
#     "b"=>{:count=>4, :sum=>22, :average=>5.5}}

Obviously average is just recalculated on every iteration but since you asked for them at the same time this is probably as close as you are going to get.

Using your "output" instead looks like

output.each_with_object(accumulator) do |h,obj|
   key = h['job']
   obj[key][:count] += 1
   obj[key][:sum] += h['pay']
   obj[key][:average] = (obj[key][:sum] / obj[key][:count].to_f)
end

#=> {"a"=>{:count=>4, :sum=>20, :average=>5.0}, 
#     "b"=>{:count=>4, :sum=>22, :average=>5.5}}

2 Comments

Nice. Hash.new(0) should be enough for the inner hash of your accumulator.
@EricDuminil fair and updated I usually use block form but since the Integer is immutable that makes sense. I have fallen into that stupid trap using an empty String or Hash one to many times that block form is just instinctual at this point
0

as Sara Tibbetts comment suggests, my first step would be to convert it like this

new_a = a.reduce({}){ |memo, item| memo[item[0]] ||= []; memo[item[0]] << item[1]; memo}

which puts it in this format

{a: [2, 4, 6, 8], b: [2, 4, 6, 20]}

you can then use slice to filter the keys you want

new_a.slice!(key1, key2, ...)

Then do another pass through to do get the final format

new_a.reduce([]) do |memo, (k,v)|
  avg = v.inject{ |sum, el| sum + el }.to_f / v.size
  memo << [k,avg]
  memo
end

1 Comment

Why do you mix reduce and inject? If you always return memo at the end of your reduce, it means you need each_with_object.
0

I elected to use Enumerable#each_with_object with the object being an array of two hashes, the first to compute totals, the second to count the number of numbers that are totalled. Each hash is defined Hash.new(0), zero being the default value. See Hash::new for a fuller explanation, In short, if a hash defined h = Hash.new(0) does not have a key k, h[k] returns 0. (h is not modified.) h[k] += 1 expands to h[k] = h[k] + 1. If h does not have a key k, h[k] on the right of the equality returns 0.1

output =
[{"name"=>"aaa", "job"=>"a", "pay"=> 2},
 {"name"=>"zzz", "job"=>"a", "pay"=> 4},
 {"name"=>"xxx", "job"=>"a", "pay"=> 6},
 {"name"=>"yyy", "job"=>"a", "pay"=> 8},
 {"name"=>"aaa", "job"=>"b", "pay"=> 2},
 {"name"=>"zzz", "job"=>"b", "pay"=> 4},
 {"name"=>"xxx", "job"=>"b", "pay"=> 6},
 {"name"=>"yyy", "job"=>"b", "pay"=>10}
]

htot, hnbr = output.each_with_object([Hash.new(0), Hash.new(0)]) do |f,(g,h)|
  s = f["job"]
  g[s] += f["pay"]
  h[s] += 1
end
htot.merge(hnbr) { |k,o,n| o.to_f/n }.to_a
  #=> [["a", 5.0], ["b", 5.5]]

If .to_a at the end is dropped the the hash {"a"=>5.0, "b"=>5.5} is returned. The OP might find that more useful than the array.

I've used the form of Hash#merge that uses a block to determine the values of keys that are present in both hashes being merged.

Note that htot={"a"=>20, "b"=>22} and hnbr=>{"a"=>4, "b"=>4}.

1 If the reader is wondering why h[k] on the left of = doesn't return zero as well, it's a different method: Hash#[]= versus Hash#[]

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.