0

I have a JSON like this below: My aimed POJO is

[{
        "id": "1",
        "teams": [{
                "name": "barca"
            },
            {
                "name": "real"
            }
        ]
    },
    {
        "id": "2"
    },
    {
        "id": "3",
        "teams": [{
                "name": "atletico"
            },
            {
                "name": "cz"
            }
        ]
    }
]

My aimed POJO is

class Team
int id;
String name;

Meaning, for each "team" I want to create a new object. Like;

new Team(1,barca)
new Team(1,real)
new Team(2,null)
new Team(3,atletico)
...

Which I believe I did with custom deserializer like below:

            JsonNode rootArray = jsonParser.readValueAsTree();
            for (JsonNode root : rootArray) {
                String id = root.get("id").toString();
                JsonNode teamsNodeArray = root.get("teams");
                if (teamsNodeArray != null) {
                    for (JsonNode teamNode: teamsNodeArray ) {
                        String nameString = teamNode.get("name").toString();
                        teamList.add(new Team(id, nameString));
                    }
                } else {
                    teamList.add(new Team(id, null));
                }
            }

Condidering I am getting 750k records... having 2 fors is I believe making the code way slower than it should be. It takes ~7min.

My question is, could you please enlighten me if there is any better way to do this?

PS: I have checked many stackoverflow threads for this, could not find anything that fits so far.

Thank you in advance.

1 Answer 1

1

Do not parse the data yourself, use automatic de/serialization whenever possible.

Using jackson it could be as simple as:

MyData myData = new ObjectMapper().readValue(rawData, MyData.class);

For you specific example, we generate a really big instance (10M rows):

$ head big.json 
[{"id": 1593, "group": "6141", "teams": [{"id": 10502, "name": "10680"}, {"id": 16435, "name": "18351"}]}
,{"id": 28478, "group": "3142", "teams": [{"id": 30951, "name": "3839"}, {"id": 25310, "name": "19839"}]}
,{"id": 29810, "group": "8889", "teams": [{"id": 5586, "name": "8825"}, {"id": 27202, "name": "7335"}]}
...
$ wc -l big.json 
10000000 big.json

Then, define classes matching your data model (e.g.):

public static class Team {
    public int id;
    public String name;
}

public static class Group {
    public int id;
    public String group;
    public List<Team> teams;
}

Now you can read directly the data by simply:

List<Group> xs = new ObjectMapper()
                   .readValue(
                       new File(".../big.json"),
                       new TypeReference<List<Group>>() {});

A complete code could be:

public static void main(String... args) throws IOException {

    long t0 = System.currentTimeMillis();

    List<Group> xs = new ObjectMapper().readValue(new File("/home/josejuan/tmp/1/big.json"), new TypeReference<List<Group>>() {});

    long t1 = System.currentTimeMillis();

    // test: add all group id
    long groupIds = xs.stream().mapToLong(x -> x.id).sum();

    long t2 = System.currentTimeMillis();

    System.out.printf("Group id sum := %d, Read time := %d mS, Sum time = %d mS%n", groupIds, t1 - t0, t2 - t1);
}

With output:

Group id sum := 163827035542, Read time := 10710 mS, Sum time = 74 mS

Only 11 seconds to parse 10M rows.

To check data and compare performance, we can read directly from disk:

$ perl -n -e 'print "$1\n" if /"id": ([0-9]+), "group/' big.json | time awk '{s+=$1}END{print s}'
163827035542
4.96user

Using 5 seconds (the Java code is only half as slow).

The non-performance problem of processing the data can be solved in many ways depending on how you want to use the information. For example, grouping all the teams can be done:

List<Team> teams = xs.stream()
                     .flatMap(x -> x.teams.stream())
                     .collect(toList());

Map<Integer, Team> uniqTeams = xs.stream()
                                 .flatMap(x -> x.teams.stream())
                                 .collect(toMap(
                                      x -> x.id,
                                      x -> x,
                                      (a, b) -> a));
Sign up to request clarification or add additional context in comments.

6 Comments

Hi... Thought about this, but this solution doesn't give me the "flattening" I want. I think If I wanted this way I could just use the annotations... My point is to create new instances from the inner array.
Hi @Bleach I can't quite understand the structure of your data (e.g. the name field does not appear in the json) nor how you need to aggregate it. In any case, once you have the data, it should not be difficult, as the 7 minute performance loss is caused by how you parse the json (not the loop).
Hi, I am sorry I made a typo in the source json. Now I edited it, and name is there. So what you're saying is that the performance loss is inevitable in this case?
Both problems performance ( "Only 11 seconds to parse 10M rows." ) and transformation have been solved. Take your time to understand the answer and try to apply it to your case :)
Very detailed answer.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.