Json Array Flattening with Jackson / Parsing Performance

Question

I have a JSON like this below: My aimed POJO is

[{
        "id": "1",
        "teams": [{
                "name": "barca"
            },
            {
                "name": "real"
            }
        ]
    },
    {
        "id": "2"
    },
    {
        "id": "3",
        "teams": [{
                "name": "atletico"
            },
            {
                "name": "cz"
            }
        ]
    }
]

My aimed POJO is

class Team
int id;
String name;

Meaning, for each "team" I want to create a new object. Like;

new Team(1,barca)
new Team(1,real)
new Team(2,null)
new Team(3,atletico)
...

Which I believe I did with custom deserializer like below:

            JsonNode rootArray = jsonParser.readValueAsTree();
            for (JsonNode root : rootArray) {
                String id = root.get("id").toString();
                JsonNode teamsNodeArray = root.get("teams");
                if (teamsNodeArray != null) {
                    for (JsonNode teamNode: teamsNodeArray ) {
                        String nameString = teamNode.get("name").toString();
                        teamList.add(new Team(id, nameString));
                    }
                } else {
                    teamList.add(new Team(id, null));
                }
            }

Condidering I am getting 750k records... having 2 fors is I believe making the code way slower than it should be. It takes ~7min.

My question is, could you please enlighten me if there is any better way to do this?

PS: I have checked many stackoverflow threads for this, could not find anything that fits so far.

Thank you in advance.

josejuan · Accepted Answer · 2021-05-24 12:56:16Z

1

Do not parse the data yourself, use automatic de/serialization whenever possible.

Using jackson it could be as simple as:

MyData myData = new ObjectMapper().readValue(rawData, MyData.class);

For you specific example, we generate a really big instance (10M rows):

$ head big.json 
[{"id": 1593, "group": "6141", "teams": [{"id": 10502, "name": "10680"}, {"id": 16435, "name": "18351"}]}
,{"id": 28478, "group": "3142", "teams": [{"id": 30951, "name": "3839"}, {"id": 25310, "name": "19839"}]}
,{"id": 29810, "group": "8889", "teams": [{"id": 5586, "name": "8825"}, {"id": 27202, "name": "7335"}]}
...
$ wc -l big.json 
10000000 big.json

Then, define classes matching your data model (e.g.):

public static class Team {
    public int id;
    public String name;
}

public static class Group {
    public int id;
    public String group;
    public List<Team> teams;
}

Now you can read directly the data by simply:

List<Group> xs = new ObjectMapper()
                   .readValue(
                       new File(".../big.json"),
                       new TypeReference<List<Group>>() {});

A complete code could be:

public static void main(String... args) throws IOException {

    long t0 = System.currentTimeMillis();

    List<Group> xs = new ObjectMapper().readValue(new File("/home/josejuan/tmp/1/big.json"), new TypeReference<List<Group>>() {});

    long t1 = System.currentTimeMillis();

    // test: add all group id
    long groupIds = xs.stream().mapToLong(x -> x.id).sum();

    long t2 = System.currentTimeMillis();

    System.out.printf("Group id sum := %d, Read time := %d mS, Sum time = %d mS%n", groupIds, t1 - t0, t2 - t1);
}

With output:

Group id sum := 163827035542, Read time := 10710 mS, Sum time = 74 mS

Only 11 seconds to parse 10M rows.

To check data and compare performance, we can read directly from disk:

$ perl -n -e 'print "$1\n" if /"id": ([0-9]+), "group/' big.json | time awk '{s+=$1}END{print s}'
163827035542
4.96user

Using 5 seconds (the Java code is only half as slow).

The non-performance problem of processing the data can be solved in many ways depending on how you want to use the information. For example, grouping all the teams can be done:

List<Team> teams = xs.stream()
                     .flatMap(x -> x.teams.stream())
                     .collect(toList());

Map<Integer, Team> uniqTeams = xs.stream()
                                 .flatMap(x -> x.teams.stream())
                                 .collect(toMap(
                                      x -> x.id,
                                      x -> x,
                                      (a, b) -> a));

edited May 24, 2021 at 12:56

answered May 24, 2021 at 9:19

josejuan

9,60628 silver badges33 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Bleach Over a year ago

Hi... Thought about this, but this solution doesn't give me the "flattening" I want. I think If I wanted this way I could just use the annotations... My point is to create new instances from the inner array.

josejuan Over a year ago

Hi @Bleach I can't quite understand the structure of your data (e.g. the name field does not appear in the json) nor how you need to aggregate it. In any case, once you have the data, it should not be difficult, as the 7 minute performance loss is caused by how you parse the json (not the loop).

Bleach Over a year ago

Hi, I am sorry I made a typo in the source json. Now I edited it, and name is there. So what you're saying is that the performance loss is inevitable in this case?

josejuan Over a year ago

Both problems performance ( "Only 11 seconds to parse 10M rows." ) and transformation have been solved. Take your time to understand the answer and try to apply it to your case :)

dariosicily Over a year ago

Very detailed answer.

|

Collectives™ on Stack Overflow

Json Array Flattening with Jackson / Parsing Performance

1 Answer 1

6 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

6 Comments

Your Answer

Sign up or log in

Post as a guest

Related