Clustering sounds like rocket science, but it can also be very simple and useful for quite common tasks. Creating a MP3 collection from music CDs is a good example of how to use multiple PCs to do the job faster.

Creating an MP3 from music CDs has two major parts. First, you must rip WAV files from the CDs, and then convert them to MP3 files. Both are time consuming, but if you have more than one PC, you can speed up the process.

Planning the Directory Structure

Because the plan is to use multiple PCs and the files should end up in one location, that directory must be available to all the participating PCs. NFS is the obvious choice for Linux. In my examples I used /data/mp3 as the root of the MP3 file tree. If you would like to burn your collection to CD-R or DVD+/-R disks, create subdirectories such as vol1 and vol2 for the entire collection. Keep all of the clustering scripts in the root mp3 directory.

/data/mp3
|
-- all
|   |
|   --- album1
|   |
|   --- album2
|   |
|   ...
|
-- vol1
|   |
|   --- album1
|   |
|   --- album2
|
-- vol2

/data/mp3 is the MP3 collection root, all has all the album subdirectories, album1 and album2 contain the contents of one music CD each, vol1 contains the first MP3 collection to burn to CD/DVD, vol2 the second, and so on.

Ripping WAV Files from the Music CDs

Apart from the basic utilities, install freedbtool package to get track names from freecddb.org. It saves a lot of typing by fetching the album names from the freecddb database. Get the tar ball from Freshmeat’s Freedbtool page, extract it, change to the extracted directory, run make discid, and copy discid and freedbtool.py to an appropriate bin/ directory on every PC you plan to use for ripping.

Another tool you might have to install is cdparanoia, but it is part of most major Linux distributions. cdparanoia reads the WAV files from a music CD. As its name suggests, it is very thorough. If the disk has scratches, cdparanoia might take hours to process one disk. Good disks go much faster, in a couple of minutes for a fast drive. Note that the quality and speed of drives varies very much. CD or DVD recorders are usually better than read-only devices.

The ripping script is fairly simple shell script. It takes one parameter, the name of the album, creates a new directory and a lock file, calls cdparanoia to do the ripping, calls freedbtool to get the table of contents or toc file, generates a renaming script from the toc file, runs the renaming script, removes the lock file, and ejects the CD.

#!/bin/sh
# mkwav.sh
mkdir $1
cd $1
touch lock
cdparanoia -B

# Get only the first version of toc
freedbtool.py get -n1
dos2unix toc

# Generate the renaming script
awk 'BEGIN{FS="="
           print > "toc.sh" }
     /TTITLE/{sub("TTITLE","")
     $1++

# add leading zero if value less than 10
     if ($1 < 10) $1="0"$1

# replace blanks with underscore
     gsub(" ","_",$2)

# escape some special characters
     gsub("\\(","\\\(",$2)
     gsub("\\)","\\\)",$2)
     gsub("\\&","\\\\&",$2)
     gsub("/","\\\\",$2)
     print "mv track"$1".cdda.wav " $1"_"$2 ".wav"}' toc  | tr  \' _ >> toc.sh

# Run the renaming script
sh toc.sh
rm -f lock
eject /dev/cdrom

If you have more than one CD/DVD drive in any of your PCs, make a copy of the script and change three lines in the copy:

cdparanoia -d /dev/cdrom1 -B
freedbtool.py get -n1 -d "discid /dev/cdrom1"
eject /dev/cdrom1

If your CD/DVD drive has a device file with a different name, make the appropriate changes. The filenames should be the same in every PC (the naming conventions vary between distributions and versions).

Once you have the ripping script in your MP3 collection root, ssh to the first ripping machine, mount the collection root, cd to it, cd to vol1/, insert the first music CD, and start ripping:

sh ../mkwav.sh Artist_Name\:Album_name

You don’t have to use underscores instead of blanks, but it tends to make life easier. Repeat for each PC. When the disk trays open, replace the disks and start the script with a new album names.

I use KDE Konsole as my X Window terminal. Each window can have multiple tabs representing multiple terminal sessions. I change sessions with Shift-left/right arrow keys. This way, I keep all of my ripping sessions in a single window.

With my fastest CD drive, it took 221 seconds to rip a good quality, 70-minute music CD. The resulting WAV files took up 738 MB total space, so the network bandwidth requirement was 3.3 MB/s, which would saturate a Gigabit network with about 25 concurrent ripping sessions. However, the human factor is often the bottleneck in this case. If you are able to change disks and write new album names in 20 seconds, you scale up to 11 concurrent ripping sessions. Hard disk writing speed is also a very likely bottleneck.

Converting WAV Files to MP3s

While you are ripping, you can already start to convert WAV files to MP3s. The script I use skips any subdirectory with a lock file in it, so the first script must have processed at least one disk before you start. You might need to install Lame; the one I use came from Dag’s RPM repository. It should be available from one of your favorite repositories, so use apt-get, yum, or whatever is your favorite advanced package manager to resolve its dependencies and install it to all PCs you intend to use for MP3 conversion.

#!/bin/sh
# mkmp3.sh
for i in *
do if [ -d $i -a ! -f $i/lock ]
     then
     cd $i
     for j in *.wav
       do
       if [ ! -f $j.reserved -a -f $j ]
          then
             touch $j.reserved
             echo At `date` $HOSTNAME starts to convert $j
             lame $j `basename $j .wav`.mp3 >/dev/null 2>&1
             rm -f $j
             rm -f $j.reserved
          fi
       done
     cd ..
   fi
done

The script processes every subdirectory without a lock file, changes into them, processes every WAV file, checks if they are reserved, if not, creates a lock file and runs lame, and then removes the lock file. You can run this script on any number of PCs because of the use of lock files.

To generate Ogg Vorbis files, use oggenc instead of lame. lame uses a bitrate of 128 kbps by default. Add -b bitrate to change that.

Once you have the script in your MP3 collection root, ssh to the first encoding machine, mount the directory, cd to it, cd to vol1/, and start the encoding script. Alternately, you can write another script that does the same on every MP3 encoding cluster member.

#!/bin/sh
# mp3cluster.sh
CLUSTERHOSTS="rk2 rk4 rk23"
COLLECTIONHOST=rk14
DATADIR=data

echo
for i in $CLUSTERHOSTS
  do
  ssh $i "mkdir -p /$COLLECTIONHOST/$DATADIR > /dev/null ; \
          mount $COLLECTIONHOST:/$DATADIR /$COLLECTIONHOST/$DATADIR ; \
          cd  /$COLLECTIONHOST/$DATADIR/mp3/$1 ; \
          sh  ../mkmp3.sh ; \
          cd ; \
          umount /$COLLECTIONHOST/$DATADIR" &
done

Note that the first and only parameter to mp3cluster.sh is the subdirectory (vol1/, vol2/, etc.) to process. Change the variables at the beginning to suitable values and fix the paths as well.

Note that I have generated keys with ssh-keygen and added them to all cluster members’ $HOME/.ssh/authorized_keys files to run ssh without password query.

This all combines to produce a simple clustering application done by simple shell scripts. That’s not rocket science!

Because each cluster node reads a WAV file and writes the resulting MP3 file over the network, network speed is often the biggest scalability limiting factor. Both files are big relative to the processing time, so network latency is not an issue. Disk reading and writing speeds might be another limiting factor. Processing 738 MB of WAV files took 388 seconds on an AMD Athlon64 3000+ CPU, so the bandwidth requirement was 1.9 MB/s. Gigabit Ethernet should scale up to about 40 similar CPUs.

Room for Improvement

Because 738 MB of WAV files resulted only in 67 MB of MP3 files, erasing the source files in the process, a machine could store its WAV files locally. Avoiding these bandwidth issues should improve scalability by an order of magnitude. This would, however, tie the ripping and encoding phases to the same machine. Another possibility is to get the album names from the freecddb database; disk change time could be perhaps five times faster. However, the database doesn’t recognize every CD ever and its naming conventions vary, so that could differ from your preferences.

 

The biggest objection against improvement is, however, the fact that very few of us have more than ten PCs with CD or DVD drives waiting to create an MP3 collection. The existing software should be good enough, but it’s obvious where someone so inclined could improve it.

Adding ID3 Tags

Many MP3 players use ID3 tags, which contain various pieces of track information. Fedora Core 5 contains id3lib for manipulating ID3 tags, so your favorite distribution might also have it. The script that adds ID3 tags isn’t very interesting. It just uses the toc file and id3tag to add information to MP3 files.

#!/bin/sh
# mktag.sh
cd "$1"

# Check if toc file exists.
if [ -f toc ]
then

# Skip if already tagged
 if [ -f id3tag.sh ]
  then
   echo $1 has already been id3tagged
  else

# Generate the tagging script
    awk 'BEGIN{FS="="

# Add a blank line at the beginning so that
# some special characters won't prevent the script from running.
         print > "id3tag.sh"}
     /DTITLE/{sub("DTITLE=","")
             if (RECORD == ""){
             split($0,arr," / ")
             ARTIST=arr[1]
             RECORD=arr[2]
             }

# DTITLE can be split to two lines, combine them
             else
             RECORD=RECORD $0
             }
     /DYEAR/{YEAR=$2
            }
     /TTITLE/{sub("TTITLE","")

# Check if TTITLE is on two lines
     if (TRACK == $1)
       TITLE=TITLE $2
     else
       TITLE=$2
     TRACK=$1
     $1++
     if ($1 < 10) $1="0"$1
     print "if [ -f "$1"* ] ; then id3tag -a\""ARTIST"\" -A\""RECORD"\" -y"YEAR" -t"$1" -s\""TITLE"\" "$1"*.mp3 ; fi "
     }' toc  >> id3tag.sh
# Run the generated worker script
    sh id3tag.sh
  fi
fi

Running this script is file system-intensive, so there is no sense in trying to clusterize it–it runs quite fast anyway. To make it more versatile, pass the subdirectory to process as a parameter. Usually you would cd to the appropriate higher-level directory, such as vol1/, and run for i in * ; do sh ../mktag.sh $i ; done to process the whole directory tree. After you have added ID3 tags, you can look at the results:

$ id3info /rk14/data/mp3/all/Adrian_Belew\:Belewprints/01_Men_In_Helicopters.mp3

*** Tag information for /rk14/data/mp3/all/Adrian_Belew:Belewprints/01_Men_In_Helicopters.mp3
=== TPE1 (Lead performer(s)/Soloist(s)): Adrian Belew
=== TALB (Album/Movie/Show title): Belewprints - The Acoustic Adrian Belew - Volume Two
=== TIT2 (Title/songname/content description): Men In Helicopters
=== TYER (Year): 1998
=== TRCK (Track number/Position in set): 1
*** mp3 info
MPEG1/layer III
Bitrate: 128KBps
Frequency: 44KHz

Making MP3 CD/DVD-Rs

When your excitement about clustering has settled and you have processed enough MP3 files to fill a CD/DVD-R, create a subdirectory for the next volume and move the albums that don’t fit from the first to second volume. Then copy the whole tree of vol1/ to all/ with cp -lav vol1/* all while in the MP3 root directory. -l creates hard links instead of copying the files, so you save a lot of disk space. You could also move the files, but keeping the volume subdirectories acts as a reminder of your MP3 CD/DVD collection contents.

Rename the track01, track02, etc. filenames created by cdparanoia to potentially longer song title names means that it’s necessary to enable long filenames on the CD file system. Windows machines prefer the Joliet approach, but it limits the filenames to only 128 characters. ISO9660 Level 3 or 4 has no nominal limit. Rock Ridge, the preferred solution on Linux, has a 256-character limit. That’s almost always enough. You might want to add -J to mkisofs parameters in the following script to enable Joliet extensions, but I left it out because the Joliet filename length limit has been too low so many times .

#!/bin/sh
if [ ! -d $1 ]
then
  echo Give a sub-directory name!
else
mkisofs -r -allow-multidot -V $1 -iso-level 3 -o $1.iso $1
fi

Code Availability

All the scripts are available from the mp3cluster project homepage. If you think you can improve them, test on other systems, or contribute otherwise, please do so.

Conclusion

Creating a MP3 collection can be about ten times faster if you have enough PCs. It is also a fine candidate for starting to use multiple PCs as a cluster because it is so simple. No, this isn’t a huge scientific cluster, but it provides an interesting and useful example of scaling resource-bound computations, especially as the constraints become network and storage bandwidth.

Raimo Koski is a founder of the Lineox Linux distribution.

Original release at O’Reillys sysadmin site at 12/14/2006