Sunday, December 15, 2013

CSV Field Cutting

www.unixbabuforum.inI am trying to speed up a (Solaris 10) shell script which has to deal with probably millions of files and is taking longer to run than required. The script uses grep and cut and tr and I want to rewrite it just using bash builtins for speed. The main problem I am having a problem with is truncating the CSV lines. Depending on a variable, I need to truncate the CSV lines to 15 or 19 or 20 fields. The fields are often empty. The output line has to contain the original CSV commas (and be terminated with a semi-colon). 

For example, a line might be: 10,1,,,24234,,453,2342453455434,4534.0000,423423,, 4,2,,343,2,0,10,,34234,12,545 

This might need to be truncated to: 
10,1,,,24234,,453,2342453455434,4534.0000,423423,, 4,2,,343,2,0,,; 

I have tried using IFS and arrays and the ${xx:n:m} structures but nothing works correctly. 


www.unixbabuforum.inI wouldn't say that was a tip about cut. It's more of a fundamental programming consideration, that works for all languages, all times. It is called "Take invariants out of loops!". It's a rainy Sunday morning in Manchester (there is no other kind)- perfect time for a rant. 

Start with your redirect >> to: ${TARGET_FOLDER}/`basename $OUTPUT_FILE` 

The loop does not change TARGET_FOLDER, OUTPUT_FILE or (pedantically) the behaviour of basename. So at every iteration, the whole thing MUST produce the same answer, but every time round the loop it costs performance. So do it once, before the first iteration, and use the cheapest possible substitution inside the loop. Hence my use of OUTPUT_PATH. 

OK, we now have >> $OUTPUT_PATH in the loop. But that is also a kind of invariant. Appending several thousand lines to a file one at a time HAS to have the identical outcome to writing all the lines in one output. So take the >> out of the loop, and put a > $OUTPUT_PATH after the while loop. This works for exactly the same reason as the < $INPUT_FILE works: every command inside the loop inherits the streams belonging to the loop itself. 

This also fixes an issue with using >>. What happens if the file already existed? You need to have something to remove or empty the output file before the first append. Using > avoids this. 

I also understated the costs of the append. It does do all that stuff with opening the file, finding the last partial block, adding a line, rewriting it, and syncing the file to disk. But there is much more. 

Suppose your TARGET_FOLDER is /home/username/stats/csv_data/rework. Every time you append a single line, the file system scans / for the directory called home, opens it, scans it for the directory called username, opens it, scans it to .. well, you get the idea. Bad on a local disc, terrible on an NFS or SAN. 

OK, the loop contents are now reduced to: 

read A; B=$( echo A | cut ); echo B ";" 

This time, the invariant is pretty much "Run cut on every line in the file". The reads, assigns and echos are put there to handle a line at a time. But cut is a multiline utility - it does whole files. Or to put it another way, cut itself loops over records in files. 

So the reduction this time is to have cut itself do the "loop", and get rid of the while do ... done, and the assigns, and the echoes. Basically, make cut do everything it can do best, then dress it up in a pipeline if you need some further tweaking (in this case, appending ";"). Apart from avoiding the shell loop code, it avoids running thousands of copies of the cut process one after the other. 

So what comes out at the end is one pipeline, two standard commands, and no loop whatsoever. Or to put it the other way up, everything in the original loop was an invariant. 

I will try this Monday too. I'm interested whether the speed up is x100 or x1000.

0 comments:

Post a Comment

 
Design by BABU | Dedicated to grandfather | welcome to BABU-UNIX-FORUM