hadoop: Persisting data across steps in spring batch

vendredi 6 mars 2015

Persisting data across steps in spring batch

We are using spring batch to process large CSV file with 500K lines. Result of this processing is two things, one line presents one article object and with that we do not have any problems, after chunk is complete we do API call with list of processed articles (1000 per chunk). API endpoint can filter out duplicates so we can process one line at time.

Each line has also quantity and second result should be sum of quantities for same article identifier per location


article_code, article_name, size, color, quantity, location, sublocation
123, Nike Shoes, 32, black, 3, store1, sales floor 1
124, Nike shoes, 34, white, 2, store1, sales floor 1
123, Nike Shoes, 32, black, 5, store1, sales floor 2
123, Nike shoes, 32, black, 5, store1, stock room
124, Nike shoes, 34, white, 7, store2, sales floor
123, Nike shoes, 32, black, 3, store2, sales floor
111, Nike shoes, 37, pink, 5, store2, sales floor

This should result in 3 articles created and 2 API calls to save stock for each location (article 123 will have quantity of 13 on location store1, and 3 in store2).

Currently we are having one step which saves articles over rest API and as side effect persist quantities in DB, and other step which pick up that data from DB grouped per location and does API call to save stock.

What is good approach for storing data across step if data step is larger than limit of StepContext?

Does spring batch have any elegant way of summing rows in CSV file which are not ordered per some criteria?

hadoop

vendredi 6 mars 2015

Persisting data across steps in spring batch

Aucun commentaire:

Enregistrer un commentaire