APLawrence - Information and Resources for Unix and Linux Systems, Bloggers and the self-employed
RSS Feeds Get APLawrence.com by RSS














(OLDER) <- More Stuff -> (NEWER) (NEWEST)
Printer Friendly Version



Handling missing data in inputs





Missing data can be very annoying to a programmer. In fact, it is so annoying that very often we'll write separate programs to clean up data and eliminate unpleasant conditions so that the main program doesn't have to deal with it. Here, I'll show some examples of the kind of problems we see.

Let's take a comman data format, a TAB delimited file. A simplistic Perl program to read such a file might be:


#!/usr/bin/perl
while (<>) {
 #split on tab into @x array
 @x=split /\t/;
 #print first three elements
 print "$x[0]\t$x[1]\t$x[2]\n";
}
 

An equivalent shell script might be

IFS="(tab here)"
while read a b c d
do
echo "$a        $b      $c"
done
 

The Perl script works, but the shell script doesn't. Here's the output if the imput file looks like this:

$ cat t;hexdump -c t
1       2       3       4
1               3       4
        2       3       4
1       2               4
                3       4
0000000   1  \t   2  \t   3  \t   4  \n   1  \t  \t   3  \t   4  \n  \t
0000010   2  \t   3  \t   4  \n   1  \t   2  \t  \t   4  \n  \t  \t   3
0000020  \t   4  \n                                                    
0000023
 

The Perl script produces

1       2       3
1               3
        2       3
1       2
                3
 

but the shell script messes up:

1       2        3
1       3        4
2       3        4
1       2        4
3       4 
 

If this were a problem with Perl, we'd handle it like this:

#!/usr/bin/perl
while (<>) {
 # make sure there is at least one space between adjacent tabs
 s/\t\t/\t \t/g;
 #split on tab into @x array
 @x=split /\t/;
 #print first three elements
 print "$x[0]\t$x[1]\t$x[2]\n";
}
 

But things can be worse. For example, if we are processing what was once a report format, we may have no delimiters, just empty space. We might see something like this:

Date          Customer             Phone           Terms     Balance

09/04/04      ABCD Corp.                           PPD          0.00
09/04/04      Abba Corp.          555-5555         Net 30     985.00
 
You can't process that with delimiters, but you can use unpack:
#!/usr/bin/perl
while(<>) {
@x=unpack("A8A6A20A17A9A12",$_);
print "$x[0]:$x[2]:$x[3]:$x[4]:$x[5]\n";
}
 

Which will produce:

Date:Customer: Phone:Terms: Balance
::::
09/04/04:ABCD Corp.::PPD:    0.00
09/04/04:Abba Corp.:555-5555:Net 30:  985.00
 

Comma separated value files can be annoying if they also contain commas within quoted fields. You can't use split because of that. There are at least two ways to handle that: either use the Text::Parsewords module:

#!/usr/bin/perl
use Text::ParseWords;
while(<>) {
 @x=quotewords(",",0,$_);
 foreach (@x) {
  print " $_";
 }
print "\n";
}
 

Or (assuming the data is regular enough), replace commas not inside quotes with a different delimiter and then split it. I think ParseWords is easier.





But sometimes none of that is going to work either. I'm working on a project right now where the input data can have up to three fields, but any of the three can be missing and there are no delimiters and no spacing. The only way to determine what we have is to know that the field one, if present, is alpha, field two is a whole integer, and field three will always have decimal points. So

ABC  982.00
8
15.45
 

means that I have 1 and 3 on line 1, only 2 on line 2, and only 3 on line 3. It's actually much worse than this; there are other fields, some of which are always present and some which are not, and it is quite a challenge to normalize this stuff to be able to massage the data. The way to handle it is to do splits on / /, and then determine what we got. So it's something like this:

#!/usr/bin/perl
while(<>) {
s/\s+/ /g;
@x=split / /;
foreach (@x) {
  .. determine what we have based on previous field(s) seen and content
}
 



Click here to add your comments





Thu Mar 17 20:34:15 2005: Subject:   anonymous



with this prime number programme .
how do i find the last prime number that did not go into the prime number.i thought it would be something like

printf(" %-8.3F\n", $value++);but it do not work.

cat prime
#!/usr/bin/perl
print "enter a number> ";
$number = <STDIN>;
chomp( $number );
if ( $number !~ /^\d+$/ )
{
print "invalid input\n";
exit 1;
}
$prime = 1;
for( $value = 2; $value < $number; $value++ )
{
if ( $number % $value == 0 )
{
$prime = 0;
break;
}
}
if ( $prime == 1 )
{
print "prime number\n";
}
else
{
print "not a prime number\n";
}
exit 0;




Thu Mar 17 20:35:56 2005: Subject:   anonymous



with this prime number programme .
how do i find the last prime number that did not go into the prime number.i thought it would be something like

printf(" %-8.3F\n", $value++);but it do not work.

cat prime
#!/usr/bin/perl
print "enter a number> ";
$number = <STDIN>;
chomp( $number );
if ( $number !~ /^\d+$/ )
{
print "invalid input\n";
exit 1;
}
$prime = 1;
for( $value = 2; $value < $number; $value++ )
{
if ( $number % $value == 0 )
{
$prime = 0;
break;
}
}
if ( $prime == 1 )
{
print "prime number\n";
}
else
{
print "not a prime number\n";
}
exit 0;
colin_richard_weaver@hotmail.com
weaverc1@cardiff.ac.uk



Thu Mar 17 22:01:19 2005: Subject:   TonyLawrence

gravatar
I don't think you understand the code. This is NOT looping through prime numbers, so there is no "last prime number".

Don't miss responses! Subscribe to Comments by RSS or by Email

Click here to add your comments


If you want a picture to show with your comment, go get a Gravatar


cartoon
Looking for Mac OS X Help?
OS X PDF e-books
Inexpensive, instant download


Have you tried Searching this site?

Unix/Linux/Mac OS X support by phone, email or on-site: Support Rates

This is a Unix/Linux resource website. It contains technical articles about Unix, Linux and general computing related subjects, opinion, news, help files, how-to's, tutorials and more. We appreciate comments and article submissions.

Publishing your articles here

Jump to Comments



Many of the products and books I review are things I purchased for my own use. Some were given to me specifically for the purpose of reviewing them. I resell or can earn commissions from the sale of some of these items. Links within these pages may be affiliate links that pay me for referring you to them. That's mostly insignificant amounts of money; whenever it is not I have made my relationship plain. I also may own stock in companies mentioned here. If you have any question, please do feel free to contact me.

Specific links that take you to pages that allow you to purchase the item I reviewed are very likely to pay me a commission. Many of the books I review were given to me by the publishers specifically for the purpose of writing a review. These gifts and referral fees do not affect my opinions; I often give bad reviews anyway.

We use Google third-party advertising companies to serve ads when you visit our website. These companies may use information (not including your name, address, email address, or telephone number) about your visits to this and other websites in order to provide advertisements about goods and services of interest to you. If you would like more information about this practice and to know your choices about not having this information used by these companies, click here.


book graphic unix and linux troubleshooting guide

My Troubleshooting E-Book will show you how to solve tough problems on Linux and Unix systems!



 I sell and support
 Kerio Mail server




pavatar.jpg
More:
       - Code
       - Perl
       - Programming


Unix/Linux Consultants

Skills Tests

Guest Post Here








card_image






My Favorites

Change Congress