Archive

Posts Tagged ‘awk’

How to extract website URL using Regular Expression

October 13th, 2010 818 views 1 comment
I am using Java syntax for this example. You can use AWK, PHP or other language in same way.


http www

String RE = "http:.*\\.[a-zA-Z0-9]{2,4}";
Regex r = new Regex(RE);
:

Test string 1:

r.search("I am maintaining http:\\article-stack.com. This will help you to learn.");

Output:

http:\article-stack.com

Test string 2:

r.search("< a href='http:\\article-stack.com' alt='nothing'>article-stack< /a>");

Output:

http:\article-stack.com

Consideration:
length of domain type is 2-4 and it contains alphanumeric characters.

Improve previous RE

Valid website name should contains alphanumeric characters and hyphen sign only. And hyphen must not come in starting of website name.

String RE = "http:\\\\[^\-][a-zA-Z0-9\-]+\.[a-z]{2,4}";

Sample text

I am maintaining http:\\article-stack.com. This will help you to learn.
I am maintaining http:\\article-stack.com. This will help you to learn.http:\\-article-stack.com
I am maintaining http:\\article-stack.com. This will help you to learn.

Output:

        (
            [0] => http:\\article-stack.com
            [1] => http:\\article-stack.com
            [2] => http:\\article-stack.com
        )

In addition:

You can modify upper RE for domain since domain name may be in form of “.co.in”.

Regular Expressions: Common elements part 2 (Range Search)

September 30th, 2010 82 views No comments


Regular Expression

I had covered following components in last chapter of common elements

  1. ^(Start),
  2. $(End),
  3. .(Any char),
  4. *( zero or more occurrences),
  5. +( one or more occurrences) and
  6. ?( zero or one occurrences)
What if you need to limit your search up to some characters only? For example, you have to search for mobile number.

Range Search

Range search let you search for specified characters or their range. You need to enclose all characters with in square brackets as follows

 	[abcd987ABCD] 

Example

        [0123456789]+

You can use above RE for mobile number searching. However, there are two things wrong with this RE.

1) If characters increased then length of RE is increased
2) Mobile numbers are generally 10 digits. While this RE will search all numeric words contain at least 1 digit.

Lets resolve these issues,

1
Define range to decrease RE length. You can define numeric and character range as follow

[0-9] It means any character from 0 to 9.
[a-z] any character from a to z.
[A-Z] any character from A to Z.

You can limit the range like [/c],[01] or [A-J].
Or you can combine them as [a-z0-9A-U].
You can also use some special characters, back slashes characters and spaces like [a-z@\. \t]

Examples
1. Search for all CAPITAL Words.

     [A-Z]+ 

2. Search for Moblie numbers

 	[0-9]+ 

3. Search for email ids (An email id can contain “.”,”_”. And website name can contain “-“ )

 	[a-z0-9\_\.]+@[a-z0-9\-]+\.[a-z]+ 

If you add “^” just after “[“ then it search for exclusion of characters. For example;

 [^0-9]* 

Above RE will search for every character excluding numeric chars.

2
Fix number of occurrences

*, + are used to set minimum occurrence. We can define maximum or minimum limit of occurrences of RE.

r{m} exact m occurrences of r
r{m,} at least m occurrences of r
r{n,m} n to m occurrences of r

Examples

Mobile number has 10 fixed digits.

	[0-9]{10}

Domain name should be less than 5 characters.

 	[a-z0-9\_\.]+@[a-z0-9\-]+\.[a-z]{2,5}

Regular Expressions: Common elements part 1

September 29th, 2010 84 views No comments

If you find yourself weak in regular expression then complete this article patiently. But never forget to read Regular Expression, an introduction with full of examples. Otherwise this article will scare you surely.


Regular expression

Example content;

< amty > 1st block < / amty>
article-stack .com
< amty src=""> Tag with attributes < / amty>
I am running article-stack.com

Elements

^ (Shift + 6) beginning of line
$ (Shift + 4) end of line

Example RE

‘/amty/’ It’ll return all lines contain amty word. So it’ll return 1st & 3rd lines.
‘/^article-stack/’ It’ll return 2nd line starting with article-stack. Note that a line starting with white space or tab will not come in result
‘/article-stack$/’ It’ll return no line. Because 2nd line is ending with com.

Note that AWK returns complete line wherever pattern is found. While other language returns matched pattern only. So

/amty/ Will return 4 occurrences of “amty”
/^article-stack/ Will return only “article-stack” 1 time, wherever it is coming in starting of a line.
/article-stack$/ Will return noting since it is not appearing in last of any line.

Move ahead

. (decimal point) Any single character

‘/.amty/’ Will return 4 occurrences. (Or 2 lines in awk)

Result

        (
            [0] => < amty
            [1] => / amty
            [2] => < amty
            [3] => / amty
        )

Dot (.) is regular expression element. So if you searching for dot only then you’ll have to use ‘\’ before this. Eg

‘/article-stack\.com/’ Will return “article-stack.com” from 4th line. Because 2nd line contains space between “stack” and “.com”.

Elements to define occurrence

r* zero or more occurrences of regular expression encounter in left
r+ one or more occurrences of regular expression encounter in left
r? zero or one occurrences of regular expression encounter in left

Sounds difficult? See examples

‘/^< amty >.*<\/amty >/’

Explanation:

  • ^ will search all lines starting with < amty >
  • \/amty, “\” is required before any regular expression element, if you are treating them as simple text.
  • *says zero or more occurrence of dot (.). While dot (.) says any single character.
  • Finally, above expression will extract all lines which are starting with < amty >. And contains any number of characters between “< amty >” and “< / amty >“. It is not necessary that line ends with “< / amty >“

Another example

.+@.+\.com It is simpler version to filter email ids.
Please note this
If you are having fine knowledge of regular expression then you will find that many RE, in this article, are not efficient. They are build just for understanding. I have tuned them in further articles. So keep reading

How to test Regular expression across the programming languages?

September 29th, 2010 218 views 2 comments


Regular expression

Prerequisite

Regular expression, in introduction with full of example

A regular expression (regex or regexp for short) is a special text string for describing a search pattern.

There are various flavours of Regular Expressions. All the flavours are 80% common. Some languages provide more elements, functions and keywords for efficient searching. Maximum of them have common regular expression elements. You can test them in various languages as follow.

AWK

    awk ‘/RE/’ filename

JAVA

	boolean b = Pattern.matches("RE", "contents");

PHP

        preg_match_all(‘/RE/’,$contents,$result_array)

PERL
I havn’t gone through PERL syntax. But they are similar to AWK.

C# (.NET)

	Regex pattern=new Regex("RE");
        bool matching = return pattern.IsMatch("Contents");

Common elements are also supported by rich text editor programs like textpad or notepad++.

You can also try Online Tool to test for testing regular expression.

Regular Expression, an introduction with full of examples

September 26th, 2010 91 views No comments

Let’s search txt files in a folder. If you are in widows OS then you will open search and will type ‘*.txt’. If you are using unix then you’ll use ‘ls *.txt’.

‘*’ is commonly used regular expression’s element. Sometimes we called it as wild character.


Regular expression typography

A regular expression (regex or regexp for short) is a special text string for describing a search pattern.

Take another simplest example.

Amty*.txt

What does above pattern do? It’ll search for all files which are starting with ‘Amty’ and ending with ‘.txt’.

Regular expression is nothing but the combination of such elements along with simple text. Programmers or Mathematics student can understand an element as variable. Now see

       127x+43y=z
       -78x-34y=125z

Above expressions can be written as ‘Ax+By=C’. Here you can put various value of A,B,C to get above result.

Now consider a situation where you have to extract above expression from a page full with some text. For example

Artcle-stack.com is a 127x+43y=z online sharing and learning site.
-78x-34y=125z Users registrations is must to see restricted contents and articles
Subscription is required for email alerts.

You have to extract all the expressions look like ‘Ax+By=C’. Where given pattern is surrounded by space. So a possible RE would be

RE

[\-0-9]*x[\+\-]{1}[0-9]*y=[^ ]*

Result

            [0] => 127x+43y=z
            [1] => -78x-34y=125z

Above RE would not clear to you until you read about elements of Regular expression. So keep reading.

AWK: pattern and actions

September 26th, 2010 79 views No comments

Regular Expression Typography

We write action in curly braces. But pattern can be written outside. Like;

Syntax

	pattern { action }

Example

	awk 'NR==52 {print $0;}'
Pattern decides “what to search?”. While action decides “what to do?”.

For example you want to remove all spaces & tabs from all fields of a file.

Here, first you need to find out text contains spaces & tabs (pattern) only. So you can replace it by blank character. This is called action.

Example:

awk ‘/article-stack\.com/’ post-contents.txt

Above command will filter & print (as its default nature) all the lines which contain “article-stack.com”. Don’t worry about how we are doing. That I’ll explain in next session. Here, “/article-stack\.com/” is called pattern.

Now instead of printing complete line, you may prefer to print first some words or characters. Sometimes you just want to delete the matched pattern from the file. This all activity is called action.

I am not giving any example for this. Because you need to learn basic structure and syntax of AWK first.

You can understand patterns deeply only when you are clear with Regular Expressions.

Basic structure of AWK command

August 2nd, 2010 186 views 1 comment

We already had discussed about what is AWK, an introduction. Now you need to understand basic structure of AWK command;

An AWK command can be broken in three parts.

Syntax

BEGIN { print "START" }
      { print         }
END   { print "STOP"  }

BEGIN, END blocks are optional and run only once. While, in the other hand, the middle block is run for every line of given file. This is the bock where actual processing logic is written.

For example

awk ‘/article-stack\.com/’ post-contents.txt

In above command, BEGIN & END blocks are missing. Above command will search for ‘/article-stack\.com/’ in every line.

Another Example; Try all below examples

awk ‘BEGIN{print FILENAME}’ filename
awk ‘{print FILENAME}’ filename
awk ‘END{print FILENAME}’ filename

/*FILENAME is a reserved keyword. It prints the name of input file, we will understand it later in element section.*/

Here I am considering that input file for above example is having more than 1 line. Second AWK command will print filename N times where N stands for number of lines in input file. Because it runs for every line. While the first and 3rd line shall run only once. So they’ll print FILENAME one time only.

You can use BEGIN block to set initial parameters, while you can use END block to print result as summary.

Some more examples for better understanding:

1. To print number of lines in a file [one time only]

awk 'END{print NR}' filename

2. To Print no. of entries of a month

AWK: How to count number of entries for a month

August 2nd, 2010 65 views No comments

I hope all of you are aware with AWK. This example will help you to understand AWK practically.
Sample Data:

	 10-Jul-10
	 23-Jul-10
	:
	 31-Jul-10
	 1-Aug-10
	:
	 4-Aug-10
	 5-Aug-10
awk 'BEGIN{
		FS="-";OFS=","
	}
	{
		$1="";
		print substr($0,2,length($0))
	}'
	dates.txt
| sort | 

awk 'BEGIN{
		getline;
		lastline=$0;
		count=1;
	}
	{
		if(lastline==$0)
		{
			count+=1;
		}
		else{
			print lastline": "count;
			lastline=$0;
			count=1;
		}
	}
     END{
		print lastline": "count;
	}'

Output:
Aug,10: 5
Jul,10: 22

Explanation:
We can break above commands in 3 parts. First part removes the date. And gives filtered out put like;

	 Jul,10
	 Jul,10
	 Jul,10
	:
	 Jul,10
	 Aug,10
	 :
	Aug,10
	 Aug,10

There may be a change that text file or input data have dates in any order. So the sort command just sorts the output given by first command. And provides support to third command. 3rd command is again a AWK command. And it searches for continuity of a pattern. Once pattern changes it prints the count.
To understand AWK examplesyou must read basic structure of AWK command.

What is AWK, an introduction

August 2nd, 2010 47 views No comments

AWK is nothing but a simple and powerful UNIX filter command. You can use it to filter & format contents of a text file or to modify sometimes. For example;
count number of entries month wise from below file

	 10-Jul-10
	 23-Jul-10
	:
	 31-Jul-10
	 1-Aug-10
	:
	 4-Aug-10
	 5-Aug-10

Syntax idea:
Basic

awk 'NR==52 {print;exit}'

Complex

awk 'BEGIN {
		FS="^";OFS="^";
	}
	{
		for (i=1; i<=NF; i++) {
			gsub(/^[ t]+|[ t]+$/,"",$i);
		} print
	}'
	filename > file_name_new

You can call AWK as a programming language as well. Because, like other programming languages, it has control and conditional statements. It also let you create functions.

Instead of typing big syntax on command prompt, you can write AWK command in some file.

There are 3 versions of AWK:

AWK - the original from AT&T
NAWK - A newer, improved version from AT&T
GAWK - The Free Software foundation's version

I love to work in basic version only. Because it is having limited functions, some restriction over regular expression and lesser features than NAWK, GAWK. And you can explore your intelligence with this. Moreover, If you write a program in AWK then it’ll run in NAWK and GAWK as well.

AWK: How to remove all spaces & tabs from all fields of a file

August 1st, 2010 670 views No comments

Following command will help you to remove all trailing and leading spaces from all fields of a text file. It also remove tab characters. I am assuming that all the fields in input file are separated by “^”. If you are using any other separator then set the value of FS in BEGIN block accordingly.

 awk 'BEGIN {FS="^";OFS="^";}{for (i=1; i<=NF; i++) {gsub(/^[ \t]+|[ \t]+$/,"",$i);} print }' filename > file_name_new
Please note this
This example will also help you to build data file for you database where you need to insert blank or null value when value for a field does not exist.