SHELLdorado Newsletter 1/2005 - April 30th, 2005

================================================== ==============
The "SHELLdorado Newsletter" covers UNIX shell script related
topics. To subscribe to this newsletter, leave your e-mail
address at the SHELLdorado home page:

http://www.shelldorado.com/

View previous issues at the following location:

http://www.shelldorado.com/newsletter/

"Heiner's SHELLdorado" is a place for UNIX shell script
programmers providing

Many shell script examples, shell scripting tips & tricks,
a large collection of shell-related links & more...
================================================== ==============

Contents

o Shell Tip: How to read a file line-by-line
o Shell Tip: Print a line from a file given its line number
o Shell Tip: How to convert upper-case file names to lower-case
o Shell Tip: Speeding up scripts using "xargs"
o Shell Tip: How to avoid "Argument list too long" errors

-----------------------------------------------------------------
>> Shell Tip: How to read a file line-by-line

-----------------------------------------------------------------

Assume you have a large text file, and want to process it
line-by line. How could you do it?

file=/etc/motd
for line in `cat $file` # WRONG
do
echo "$line"
done

...is no solution, because the variable "line" will in turn
contain each (whitespace-delimited) *word* of the file, not
each line. The "while" command is a better candidate for
this job:

file=/etc/motd
while read line
do
echo "$line"
done < "$file"

Note that the "read" command automatically processes its
input: it removes leading whitespace from each line, and
concatenates a line ending with "\" with the one following.
The following commands suppress this behaviour:

file=/etc/motd
OIFS=$IFS; IFS= # Change input field separator
while read -r line
do
echo "$line"
done < "$file"
IFS=$OIFS # Restore old value

There still is one disadvantage to this loop: it's slow. If
the processing consists of string manipulations, consider
replacing the loop completely e.g. with an AWK script.

Portability:
"read -r" is available with ksh, ksh93, bash, zsh,
POSIX, but not with older Bourne Shells (sh).


-----------------------------------------------------------------
>> Shell Tip: Print a line from a file given its line number

-----------------------------------------------------------------

Regular expressions can be very powerful, and there are many
tools (like "egrep") allowing to use them on any file. But
what if we simply want to get the 5th line of a file? No
elaborate regular expression required:

lineno=5
sed -n "${lineno}p"

prints the 5th line without involving ^.[]*$ or other
meta-characters resembling the noise of a defective serial
interface. "sed -n" means: do not automatically print each
line. "5p" indicates: print line 5. We have to use
"${lineno}p" here instead of "$linenop", because otherwise
the shell would try to expand the variable "$linenop", not
knowing that "p" is an "sed" command.

This could be improved upon. Assume the input file is
"/usr/dict/words", and consists of 25143 lines. The "sed"
command above would not only dutifully print line 5, but
also continue to read the following 25138 lines, doing what
it was told to do: ignore them. The following command makes
"sed" stop reading after line 5:

lineno=5
sed -n "${lineno}{p;q;}"

So you think you have a better solution for this problem?
Prove it: send me your suggestion
(heiner.steven@shelldorado.com, closing date: 2005-05-31),
and I'll measure the speed of all contributions on a Linux
and a Solaris system. The fastest (or most elegant) solution
using only POSIX shell commands will be published in the
next SHELLdorado Newsletter.


-----------------------------------------------------------------
>> Shell Tip: How to convert upper-case file names to lower-case

-----------------------------------------------------------------

Admit it: you sometimes copy files from an operating system
with a name ending in *indows. A frequent annoyance are file
names IN ALL UPPER CASE.

The following command renames them to contain only lower
case characters:

for file in *
do
lcase=`echo "$file" | tr '[A-Z]' '[a-z]'`

# Does the target file exist already? Do not
# overwrite it:
[ -f "$lcase" ] && continue

# Are old and new name different?
[ x"$file" = x"$lcase" ] && continue # no change

mv "$file" "$lcase"
done

The KornShell (and ksh93) has the useful "typeset -l"
option, which will automatically convert the contents of a
variable to lower case:

$ typeset -l lcase=ABCDE
$ echo "$lcase"
abcde

Changing the above loop to use "typeset -l" is left as an
exercise for the reader.


-----------------------------------------------------------------
>> Shell Tip: Speeding up scripts using "xargs"

-----------------------------------------------------------------

The essential part of writing fast scripts is avoiding
external processes.

for file in *.txt
do
gzip "$file"
done

is much slower than just

gzip *.txt

because the former code may need many "gzip" processes for a
task the latter command accomplishes with only one external
process. But how could we build a command line like the one
above when the input files come from a file, or even
standard input? A naive approach could be

gzip `cat textfiles.list archivefiles.list`

but this command can easily run into an "Argument list too
long" error, and doesn't work with file names containing
embedded whitespace characters. A better solution is using
"xargs":

cat textfiles.list archivefiles.list | xargs gzip

The "xargs" command reads its input line by line, and build
a command line by appending each line to its arguments
(here: "gzip"). Therefore the input

a.txt
b.txt
c.txt

would result in "xargs" executing the command

gzip a.txt b.txt c.txt

"xargs" also takes care that the resulting command line does
not get too long, and therefore avoids "Argument list too
long" errors.


-----------------------------------------------------------------
>> Shell Tip: How to avoid "Argument list too long" errors

-----------------------------------------------------------------

Oh no, there it is again: the system's spool directory is
almost full (4018 files); old files need to be removed, and
all useful commands only print the dreaded "Argument list
too long":

$ cd /var/spool/data
$ ls *
ls: Argument list too long
$ rm *
rm: Argument list too long

So what exactly in the character '*' is too long? Well, the
current shell does the useful work of converting '*' to a
(large) list of files matching that pattern. This is not the
problem. Afterwards, it tries to execute the command (e.g.
"/bin/ls") with the file list using the system call
execve(2) (or a similar one). This system call has a
limitation for the maximum number of bytes that can be used
for arguments and environment variables(*), and fails.

It's important to note that the limitation is on the side of
the the system call, not the shell's internal lists.

To work around this problem, we'll use shell-internal
functions, or ways to limit the number of files directly
specified as arguments to a command.

Examples:

o Don't specify arguments, to get the (hopefully) useful
default:

$ ls

o Use shell-internal functionality ("echo" and "for" are
shell-internal commands):

$ echo *
file1 file2 [...]

$ for file in *; do rm "$file"; done # be careful!

o Use "xargs"

$ ls | xargs rm # careful!

$ find . -type f -size +100000 -print | xargs ...

o Limit the number of arguments for a command:

$ ls [a-l]*
$ ls [m-z]*

Using this techniques should help getting around the
problem.

---
(*) Parameter ARG_MAX, often 128K (Linux) or 1 or 2 MB
(Solaris).


----------------------------------------------------------------
If you want to comment on this newsletter, have suggestions for
new topics to be covered in one of the next issues, or even want
to submit an article of your own, send an e-mail to

mailto:heiner.steven@shelldorado.com

================================================== ==============
To unsubscribe, send a mail with the body "unsubscribe" to
newsletter@shelldorado.com
================================================== ==============