Interview Questions: Linux/Unix

One thing I have neglected to study are the not-to-often used commands of Linux and Unix.  In an effort to make sure I never forget them, I’ll keep a list of the few that I have been asked about:

  • lsof – lists all of the files currently open in the operating system.
  • echo $SHELL – displays which shell you are currently using.
  • uname – displays which operating system you are running.  By itself it doesn’t tell you much, but uname -a tells you everything about the computer you are logged into.
  • find vs grep – use grep to find stuff in files, use find to find files by file name.  Specifically: find . -name filename -print looks for files matching the filename string.  You can use * for wildcard, eg: *brian* to look for any files containing brian (case sensative).

Get All Phone Numbers From All Word Documents

There is a common question in interviews where you have a directory structure containing, lets say word documents, and you’re looking for all of the phone numbers in all of them.  I decided to poke around and devise a solution to this problem (credit goes to this post for getting me started).

Searching raw text files for phone numbers are easy, but word documents are slightly harder since they are in a compressed binary file.  I found a neat SourceForge project called docx2txt, a handy little Perl program that converts .docx files to easily searchable text files.  A bit of setup is needed to configure the unzip program location in the doc2xtxt.config file, but that’s it.  :)

Warning, doing this in Windows is a bit more difficult than Linux since you’ll have to download Perl and cygwin to run my solution.

My solution:

Get/setup docx2txt from SourceForge, cd to the root of the folder containing the .docx files, then run this:

$ find . -name "*.docx" | \
      xargs -i perl c:/downloads/docx2txt-1.0/docx2txt.pl {}; \
      find . -name "*.txt" | \
      xargs -i cat {} | \
      grep -ohP "\(?\b[0-9]{3}\)?[-. ]?[0-9]{3}[-. ]?[0-9]{4}\b" | \
      sort -u

Voila!  Yikes…

This is a long, multi-part statement though.  Here’s a bit about each part:

  • find . -name “*.docx” –> recurse through all directories, starting at current directory, and find all *.docx files
  • | xargs -i perl d:/Users/brian/Downloads/docx2txt-1.0/docx2txt.pl {} –> for each .docx file found from the previous statement, pipe it as an argument into docx2txt.pl
  • ; find . -name “*.txt” –> wait for the previous statement to finish, then find all of the generated .txt files from the second part
  • | xargs -i cat {} –> output the contents of each text file
  • | grep -ohP “\(?\b[0-9]{3}\)?[-. ]?[0-9]{3}[-. ]?[0-9]{4}\b” –> search for all phone numbers of the format ###-###-####, (using -‘s, .’s or spaces for separators). Regular expressions are not fun, but made a lot easier by tools like Just Great Software’s RegexBuddy
  • | sort -u –> finally output each of the phone numbers (removing duplicates)

You should end up with something like this:

$ find . -name "*.docx" | \
> xargs -i perl c:/downloads/docx2txt-1.0/docx2txt.pl {};\
> find . -name "*.txt" |
> xargs -i cat {} | \
> grep -ohP "\(?\b[0-9]{3}\)?[-. ]?[0-9]{3}[-. ]?[0-9]{4}\b" | \
> sort -u
(888)555-0003
888-555-0001
888-555-0002
888-555-0006
888.555-0005
888.555.0004