JOIN
Get Time
features   
A Number or a String:
Parsing Your Input and Formatting Your Output in C++


Author
By Nickolas
TopCoder Member

Introduction
While TopCoder problems are generally hard enough to solve as is, sometimes their solution is complicated with the necessity of extra data conversion: parsing input strings into numerical data to be processed, or formatting the final result as a string of special format.

Such transformations are not as simple as they seem; if worst comes to worst, they can prevent one from solving the problem, and even the easiest cases can take enough time to lose precious points. At one point, while working on one of the Google CodeJam practice problems, I found myself just giving up after an extremely boring hour of parsing input without ever starting the solution itself.

This article is focused on main fast-to-type and convenient means of parsing input and formatting output in cases that are typical for TopCoder problems. All methods are specific to C++, but Java and VB coders may find several helpful ideas.

Typical problems
The problem of parsing input arises when you are given a string (or a set of strings) of a given format that contains data you will need for your computations. The following parsing cases are the most frequent in TopCoder problems:
  • String fragmentation problem: The given string contains several text fragments separated with delimiters, and you need to get all fragments. From here on we assume that the delimiters are sequences of spaces, while text fragments consist of letters and a certain set of special characters. This assumption is legitimate, since otherwise we can scan the string, replace special delimiters with spaces, and replace spaces that are part of the text with some "exotic" character like '&' that wouldn't normally appear in the text. In that case after the fragmentation you will need to scan the result and to replace the "exotic" characters with spaces before later processing.
  • Numbers extracting problem: The given string contains some numbers and some non-numerical information -- you need to extract the numbers, and possibly the other data. If you can neglect the dividers, this problem is the same as the previous one after you have replaced all the non-digit characters with spaces. If not, you will need some other means of extracting every part of the string.
  • Data conversion problem: The given string contains a notation of a number (not necessarily decimal), which you need to get. This problem usually arises after string fragmentation, in situations when you need to convert some of the leftover text fragments into numbers.
  • Time parsing problem: The given string contains a notation of time in a given format (usually "HH:MM"), and you need to get hours and minutes or convert them to minutes only. This problem can be solved easily with any of the general-purpose methods and several special ones, which will be considered separately.
  • Mixed problem: There are a variety of other parsing situations you might encounter, most of which can be broken-down to the cases listed here. For example, you might know the position of the wanted number in the string, or the exact format of the string, and so forth.
One case we won't consider here is one that comes up fairly regularly in TopCoder problems -- a set of strings representing a pattern, maze, or game field, in which each character is a significant part of the whole. Cases like this must be parsed character-by-character, and usually no improvements can be done.

Problems in formatting outputs arise when you need to produce the output as a string of a particular format while including the results of numerical computations. Generally speaking, there are fewer commonly used formatting cases, including:
  • Writing integers: You need to represent a given integer as a string containing its notation (decimal or not).
  • Writing doubles: You need to represent a given double as its notation, usually with a certain number of fractional digits using a set truncation procedure.
  • Mixed problem: The output should be a sequence of strings and numbers composed in a certain order. The problem of formatting time is a nice example of such problems, though it has some problem-specific methods, too.
Methods of attack
  • Manual parsing and formatting: As its name implies, this method involves parsing the string manually, character by character, keeping in memory all possible sequences and the current parser position within them. This method is considered only for several problems in this article; because its implementation greatly depends on the string format, is usually time- and mind consuming, and is extremely error-prone, it is strongly recommended that you use something else. The only exception I can see is the task of parsing and formatting time, since here the string format is quite simple and both parsing and formatting can be done in one line. See the "managing time" section below for more information.
  • Using special functions: There are a great deal of functions that have been created for data conversion and string fragmentation. Some of them accomplish only data conversion tasks like those described above, but others can have wider applications. For example, strtod can be used for number extracting if the numbers are separated with spaces, and sscanf can parse a string of almost any format. Sometimes search functions can be helpful for half-manual parsing, e.g. find_first_of and find_first_not_of methods of class string. This write-up contains no specifications or detailed descriptions of the existing special functions — for more information, consult MSDN or any IDE Help.
  • Other methods: These include methods that originally were not meant for data conversion or string parsing, but can be used rather efficiently. Stringstream is a representative of this group. The …stream family was designed as a tool for managing streams of data, but by representing the string you need to parse as a stream of characters you can work wonders. The main idea is that characters in the stream can be part of a string, or a number notation, or single characters, and all the …stream routines manage them all freely. In some cases stringstream's settings are rather complicated, but it's a universal means of parsing and formatting. Among leading coders, misof is an adherent of using stringstreams for simple things, having mentioned that they are "slow (as in runtime) but convenient". tomek uses istringstream for data conversion string to integer. On the other hand, bmerry considers stream classes uncomfortable because they require a temporary variable with a long typename, and because it is hard to remember all the parameters used for weird formatting. This article provides enough examples, below, that you should be able to see the value of stringstream for yourself.
  • Pre-written templates: The use of pre-written templates is discussed below, along with several tips for writing your own template.
Now that we have reviewed different methods for parsing and formattings, let's see how are they are applied to some typical TopCoder situations.

String fragmentation problem
We are given string s, and we need to get and process all the text fragments it consists of. We assume that delimiters are spaces, though this assumption is essential only for the last approach (which uses stringstream).

This problem allows manual parsing, though the corresponding code is rather lengthy.
	string tmp="";
	int i=0;
	s+=" ";	//so that we needn't check the last text fragment separately
	while (i<s.size())
	{ if (s[i]==' ') 		//check if the current char is a delimiter
	   { if (tmp.size()>0)
	      { // process the current fragment tmp
	           tmp="";
	      }
	   } 
	  else tmp+=s[i];
	  i++;
	}
Using string type member functions find and find_first_not_of for searching the beginnings of spaces and text fragments makes the solution smarter.
	string tmp;
	int j=0,i=s.find_first_not_of(" ",j);
	while (i != string::npos)
	{  j=s.find(" ",i);			//first argument here...
	   tmp=s.substr(i,j-i);
	   i=s.find_first_not_of(" ",j);	//...and here is a string of delimiters
	   // process the current fragment tmp
	}
Special function strtok searches delimiters in a string and replaces them with null character (one delimiter at a time):
	char *st,*buf, sep[]=" .,";		//sep is a string of delimiters
	buf=strdup(s.c_str());
	st=strtok(buf,sep);
	while (st)
	{ //process the current fragment st
	  st=strtok(0,sep);
	}
The solution that uses stringstream is the shortest and the smartest:
	string tmp;
	stringstream a;
	a<<s;
	while (a>>tmp)
	{ //process the current fragment tmp
	}
Extracting numbers
We are given string s, and we need to extract and process all integers it contains.

First we consider that we can neglect all other characters (like in HiddenNumbers (SRM 220, Div1 Easy, Div2 Medium). Then we can replace all characters except for numbers with spaces, and trim ending spaces:
	for(int i=0;i<s.sz;i++)
	  if ((s[i]>'9')||(s[i]<'0')) s[i]=' ';
	s.erase(s.find_last_not_of(" "));
Then we can use one of these approaches:
  • get numbers as text fragments, as described above, and then convert each into a number with an appropriate function;
  • use another approach based on the fact that the text fragments we need are numbers.
The following are several examples of extracting data as numbers. The first is based on the strtod function, which returns not only the extracted number but also the position of the character that stopped scanning the string:
	char *tek,st[20];
	int i;
	strcpy(st,s.c_str());
	tek=st;
	while (tek-st<s.sz)
	{ i=strtod(tek,&tek);
	  // process the current number i
	}
The next is stringstream based, and doesn't even need to trim ending spaces:
	int i;
	stringstream a;
	a<<s;
	while (a>>i)
	{ //process the current number i
	}
The third uses the special functions strtok and atoi:
	char *st,*buf, sep[]=" ";
	int i;
	buf=strdup(s.c_str());
	st=strtok(buf,sep);
	while (st)
	{ i=atoi(st);
	  //process the current number i
	  st=strtok(0,sep);
	}
If we need each piece of information that the string contains, we'll need some more information about the string format to be sure that we're getting the right pieces. Let's examine the parsing task in the SimpleCalculator (SRM 178, Div2 Easy). Here we have an integer n1, a character sign and one more integer n2. The parsing here can look like:
	sscanf(s.c_str(),"%d%c%d",&n1,&sign,&n2);
or
	stringstream A;
	A<<s;
	A>>n1>>sign>>n2;
Data conversion string to number
We are given string s that is a notation of one number, and we need to extract this number into a variable d.

First, let's consider a situation when the number is an integer. There are a number of ways to get the d:
	0) d=0;
	   for(int i=0; i<s.size(); i++)
	   { d+=s[i]-'0';
	     d*=10; }
	1) d=atoi(s.c_str());
	2) sscanf(s.c_str(),"%d",&d);
	3) char *t;
	   d=strtod(s.c_str(),&t);
	4) stringstream A(s);
	   A>>d;
If we have a floating-point number, written in a standard way (i.e. not with a scientific notation, though many functions recognize that too) with an obligatory decimal point. Corresponding ways to get the number are:
	0) d=0;
	   int pos=0, i;
	   for(i=0; s[i]!='.'; i++)
	     d=(d+s[i]-'0')*10;
	   pos=i;
	   i++;
	   for(; i<s.size(); i++)
	     d+=(s[i]-'0')/pow(10,i-pos);
	1) d=atof(s.c_str());
	2) sscanf(s.c_str(),"%f",&d);
	3) stringstream A(s);
	   A>>d;
Formatting doubles
We will consider the most common problem definition: we need to represent a positive double d as its notation with the given number of digits before and after decimal point. For simplicity sake let's assume that we need exactly 4 digits before the decimal point and 4 digits after (this task occurs in PointLifeGame (SRM 221, Div2 Medium)).

Manual formatting is overly long and full of enumeration of possibilities that include round-up and adding leading and trailing zeroes. The sprintf function, however, is perfect for the task:
	char st[10];
	sprintf(st,"%09.4f",d);	//9 stands for total result length
	s=string(st);
The stringstream can manage this task, too, though it takes several hard-to-remember flags:
	stringstream A;
	A<<setw(9)<<setfill('0')<<fixed<<setprecision(4)<<d;
	A>>s;
Other special functions yield worse results. The …toa family doesn't support conversion from doubles at all. ecvt and fcvt support either the total number of digits or the number of digits after decimal point -- but not both -- and neither of the functions actually writes the decimal point. You could use these with lots of extra work, like checking the actual position of the decimal point and inserting it by hand, but even manual formatting is probably preferable.

Managing time
Let's consider one more problem that is an example of parsing and formatting time. Given a string time that contains notation of time in format "HH:MM", we need to extract integer numbers of hours h and of minutes m (cases that include seconds, "am" or "pm" modifiers, or don't include hours should be processed the same way).

Manual parsing is short and elegant here, if we use the property of numeric characters:
	h = (time[0]-'0')*10+(time[1]-'0');
	m = (time[3]-'0')*10+(time[4]-'0');
Other approaches are not much smarter, because this task is extremely easy and can be performed in dozens of ways. For atoi the code is:
	h = atoi( time.c_str() );
	m = atoi( time.c_str()+3 );
For strtod it looks like this:
	char *end;
	h=strtod( time.c_str(), &end);
	m=strtod( time.c_str()+3, &end);
For sscanf we get:
	sscanf( time.c_str(), "%d:%d",&h,&m);
Using stringstream gives us:
	stringstream s;	time[2]=' ';	s<<time;	s>>h>>m;
Now let's consider a pair of integer numbers of hours h and minutes m, which we need to write to a string time as "HH:MM". This task is an example of writing an integer with leading zeros. Manual formatting is easy but comparatively long:
	time="  :  ";
	time[0] = h/10+'0';
	time[1] = h%10+'0';
	time[3] = m/10+'0';
	time[4] = m%10+'0';
The problem with using the special functions here is that some of them don't give you a chance to set the resulting string's width. Thus, if you choose to use itoa, you will need to add zeros manually:
	time="";
	char hs[3],ms[3];
	itoa(h,hs,10);
	itoa(m,ms,10);
	if (strlen(hs)==1) time +="0";
	time +=string(hs)+":";
	if (strlen(ms)==1) time +="0";
	time +=string(ms);
You can use fcvt with no digits after decimal point, but it will also need adding zeros:
	time ="";
	int dec,sign;
	char *hs;
	hs=fcvt(h,0,&dec,&sign);
	if (strlen(hs)==1) time +='0';
	time +=string(hs)+":";
	hs=fcvt(m,0,&dec,&sign);
	if (strlen(hs)==1) time +="0";
	time +=string(hs);
sprintf fulfils the task easily, though using it requires data conversion from char[] to string:
	char st[6];
	sprintf(st,"%02d:%02d",h,m);
	time=string(st);
Using stringstream, with its format functions, gives us:
	#include <iomanip>
	stringstream A;
	A<<setw(2)<<setfill('0')<<h<<':'<<setw(2)<<setfill('0')<<m;
	A>>time;
Should you use a template?
The appeal of using pre-written templates is that, if you have written all the functions you might need for parsing and formatting beforehand, all you will need to do during the Coding Phase is just choose the best one to use.

This approach unites the advantages of other methods, and it is extremely handy. The disadvantage, though, is that when the template is written its author can't be sure that it covers all the parsing and formatting problems they will face during their coder career. (And, as I'm pretty sure that our problem writers' imaginations are far from exhausted, we should expect to encounter many more exotic parsing tasks.)

The sensible advice here is to take a compromise approach -- write a nice short template for cases that are hard to remember, and memorize some quick and multi-purpose methods for both typical and exotic cases.

Experienced coders (tomek, John Dethridge, Eryx, bladerunner) include in their templates a function for string fragmentation into strings or integers, since such parsing tasks are the most frequent in TopCoder problems. Some coders, like bladerunner, have template functions for data conversion that encapsulate a call of some special function with the necessary temporary variables so that they needn't waste time on typing it. One can view their templates in any of their recent solutions. Taking the opposite approach is misof, who doesn't use a template at all; he remembers everything he needs by heart.

Beware
The problems of parsing and data conversion look rather simple once they are solved correctly, but they can still hold pitfalls. The following are several tricky cases.

If you are formatting a double as its notation with a given number of fractional digits, you will need to use some round-up algorithm implemented by the function you've chosen. Remember, though, that the function's round-up might be different from the one the problem statement asks for, so check the boundary cases carefully and consider writing a more complicated conversion.

Thus, in the PointLifeGame (SRM 221, Div2 Medium) you need to round the result down to the nearest ten-thousands. In the PiCalculator (SRM 246, Div1 Easy) "the last digit(s) should be rounded according to the standard rounding rules (less than five round down, more than or equal to five round up)". One more rounding-up problem might occur when working with negative doubles, though it is a rare case for TopCoder problems.

I wouldn't say that you should format the numbers manually, but using a certain combination of floor, ceil and constants like 0.49 before writing the number to a string might be helpful in case of special rounding rules.

Another pitfall is computational speed. Data conversion is a rather slow operation compared to numerical computations, so you shouldn't use them needlessly. For example consider the ApocalypseSomeday (TCHS SRM 2, Div1 Medium) problem. There are various ways to check whether the number contains "666" in its notation. If you convert the number into a string using stringstream and then search for "666" with find member function, you will run into the time limit on the maximum test.

You should also take into account the fact that using different methods of data conversion takes significantly different time. If we compare times of conversions string to integer using the described above methods, we can see that atoi is the fastest, sscanf is about 3 times slower, strtod is 4 times slower than sscanf, and stringstream is 9 times slower than strtod. Therefore, if you need several thousands of data conversions, you should weight the chosen function against others.

And finally, there exist problems concerned with parsing and formatting that aren't limited to them. Sometimes other knowledge is needed, or just reading the "Constraints" section carefully. Thus, in the PiCalculator (SRM 246, Div1 Easy) you should notice that double and even long double type is not enough to store all 25 digits of Pi, so you will need to write
	if (precision==1) return "3.1";
	if (precision==2) return "3.14";
	if (precision==3) return "3.142";
…and so on. Eryx recalls a problem in which "some people had a problem with parsing, because they didn't notice that numbers can have more than one digit."

Conclusion
Different problems and pitfalls might be encountered in different parsing or formatting tasks, but, as misof pointed out, "the most common cause of mistakes is ... trying to do something by hand when there is a library function that does the job." A certain amount of practice is necessary for managing such tasks freely and confidently, however annoying it can be.

Practice Room
Listed below are a couple of TopCoder problems that illustrate the use of the parsing and formatting tasks outlined above. Some of them were mentioned in the write-up, others are a good practice for a novice. The problems are grouped roughly by the task they implement, though some problems need several ones.

String fragmentation
VLNString (SRM 226, Div2 Easy)
SortBooks (SRM 262, Div2 Medium, Div1 Easy)
TreasuresPacking (SRM 308, Div2 Hard)
Recipe (SRM 265, Div1 Medium)

String fragmentation with non-space delimiters
DigitalDisplay (TCHS SRM 6, Div1 Easy)
DirectoryTree (SRM 168, Div1 Medium)
GradingGridIns (SRM 264, Div1 Medium)

Numbers extracting
SimpleCalculator (SRM 178, Div2 Easy)
MassiveNumbers (SRM 236, Div2 Easy)
TwoEquations (SRM 287, Div1 Easy, Div2 Medium)
HiddenNumbers (SRM 220, Div1 Easy, Div2 Medium)

Managing time
VideoEncode (TCO05, Qual 1/3)
SlowKeyboard (TCO06 Qual 7/9/14)
Time (SRM 144, Div2 Easy)
TimeCard (SRM 257, Div2 Hard)
SlowDigitalClock (SRM 260, Div1 Medium)

Formatting doubles
FormatAmt (SRM 149, Div2 Easy)
PointLifeGame (SRM 221, Div2 Medium)
RoomSummary (CRPF Charity Round 1, Div1 Medium)
Planets (TCO'03 Round1, Div1 Medium)