If you have been involved with computer science for any length of time, you have probably heard the term "relational database," but you might not have had the opportunity to actually work with this type of technology. Clearly, relational databases comprise a huge topic, and there are many great books written on the subject. However, the purpose of this article is to give you a chance to learn some database basics and get a better understanding of some of the key components of relational databases. The intended audience for this article is people who have little or no experience with relational databases, or those who have not worked with databases for a while and would like a refresher course. Even if you're a database expert and don't need a refresher, though, hopefully the section on loading TopCoder RSS feeds to Oracle might be helpful to you. I. Introduction Besides being a nice resume addition, though, there are many other great reasons to learn about relational databases. XML files are great for storing data in a uniform format that you can access with languages such as JAVA. However, when you start dealing with 20 million records, working with raw XML files can become impractical very quickly. A database, when tuned properly, can help you find the one record you need out of 20 million in seconds, or even less. Besides offering fast access to data, the database can help ensure the integrity of that data. Think about a person who goes to the bank to withdraw money. What if, at the moment the money is being withdrawn from their account, the computer server that houses the database with their account information crashes? Well, if you are using a relational database and have the withdrawal logic wrapped in a "transaction" (we will discuss this in a later article), then you can rest assured that you can get that database back to a point in time right before the crash, with the data intact. (This is assuming, of course, that you have set up a proper database infrastructure and are running the database in ARCHIVELOG mode so that you can reapply all the “transactions” since the last backup.) While the basic concepts behind relational databases are pretty much universal, in this article we will focus on Oracle. In addition to Oracle holding the top share of the relational-database market (at least according to Wikipedia), TopCoder has recently had some component competitions for Oracle PL/SQL components. If you are in the job hunt, or are just looking to earn some extra money in component competitions, it is probably the single most beneficial database you can learn. II. Environment Setup Note that when you install the software, you will be prompted for a password to use for the system account. In the examples that follow, we will assume you used a password of oracle. If you use something different during the install then just plug in the appropriate value. I recommend that you set the Oracle services startup type to manual if you are running on Windows, as this will prevent the services from consuming a lot of memory when you are not using Oracle. To set the services to manual startup, go to Start->Run and type services.msc. You should get a dialog box similar to the one shown below. Find the OracleServiceXE and OracleXETNSListener services, right-click on each, and select "properties." You can then set the startup type to manual. Note that you will need to start these services up whenever you want to use Oracle on your machine. Alternatively, you can just go to Start->Programs->Oracle 10g Express Edition->Start Database/Stop Database to start/stop the database. III. Tables
An important component of a database table is the primary key. The primary key is the column or set of columns within each row of the table that make that row unique from all the other rows in the table. So in our table above, we would make the primary key on coder_id, since that value should always be unique for each row in this table. In situations where there is not a suitable primary key, we can generate one by using a database sequence in Oracle, or if you are using Microsoft SQL Server you can use an identity column. These values are usually just a number starting at 1, and each time we insert a row into the table the value is incremented. Primary keys are also important to allow us to set up referential integrity between tables, which we will discuss later. To create a table, most databases come with a graphical user interface tool. While this is sufficient in some cases, it will help you to learn Structured Query Language, more commonly known as SQL. SQL is the programming language we use to interact with the database (I'll be covering SQL in more detail in another article). You can also create tables with SQL which is the approach many coders use. Let's create our very first database table! First, we will need to log into SQL*Plus, Oracle's command line utility for interacting with the database. Assuming you used 'oracle' for your password, bring up a command prompt and just type 'sqlplus system/oracle' at the prompt (this assumes the oracle bin directory was added to your path, which the installation should have done for you automatically; if it didn't you should add the bin directory manually). We'll create a new Oracle user, which will be the owner of the tables we are going to create, and we will call this user topcoder. The password to our topcoder user will be topcoder also. We will need to set our topcoder user up with some basic rights and assign it a default tablespace so it can have a place to store its tables. From the SQL*Plus command line, run the following SQL code: Connected to: Oracle Database 10g Express Edition Release 10.2.0.1.0 - Production SQL> create user topcoder identified by topcoder; User created. SQL> alter user topcoder default tablespace users; User altered. SQL> grant connect to topcoder; Grant succeeded. SQL> alter user topcoder quota unlimited on users; User altered. Now we're finally ready to create our coder table for the active algorithm coder list. The SQL code to run create our table is listed below. Note that in SQL*Plus, the command doesn't execute until you follow it with a semicolon -- you can type each line and press enter, and you don't have to worry about SQL*Plus trying to execute the command until you type a semicolon. There are some exceptions -- some commands, like running an external script, will execute without a semicolon -- but in most cases SQL*Plus won't do anything until you indicate the end of a command in this manner. (Also, please note that you won’t be required to type in the line numbers 2, 3, 4, etc. as shown below. Just type in the text and, each time you press return at the end of a line, the next line will begin with a line number to indicate that this text is a continuation of the previous line.) SQL> create table topcoder.coder ( 2 coder_id number primary key 3 ,handle varchar2(20) 4 ,country_name varchar2(70) 5 ,alg_rating number 6 ,alg_vol number 7 ,alg_num_ratings number 8 ); Table created. We have now created a database table. Notice that we specified coder_id as our primary key for the table. Now, to find out what data is in our table, we can use the SQL select statement. We will get into advanced types of queries in the later articles in this series, but for now just run the following code: SQL> select * from topcoder.coder; no rows selected Since we just created the table, there is not going to be any data in it, so let's insert a few records. We'll use the SQL insert statement to insert our records. Note the use of the 'commit' command after the insert statements. The commit command, as you probably guessed, tells the database to commit the changes to the table and make them permanent. If for some reason we don't want to keep these records we inserted, we can execute a 'rollback' command instead, assuming we haven't already committed yet. Here's the code: SQL> insert into topcoder.coder 2 (coder_id,handle,country_name,alg_rating,alg_vol,alg_num_ratings) values 3 (123456,'slowpoke','United States',1020,320,42); 1 row created. SQL> insert into topcoder.coder 2 (coder_id,handle,country_name,alg_rating,alg_vol,alg_num_ratings) values 3 (123457,'fastcoder','China',1941,100,13); 1 row created. SQL> commit; Commit complete. Now if we run our select statement again we will see that there is data in our table: SQL> select * from topcoder.coder; CODER_ID HANDLE ---------- -------------------- COUNTRY_NAME ---------------------------------------------------------------------- ALG_RATING ALG_VOL ALG_NUM_RATINGS ---------- ---------- --------------- 123456 slowpoke United States 1020 320 42 123457 fastcoder China 1941 100 13 If we now try to insert another record with the same primary key, watch what happens: SQL> insert into topcoder.coder 2 (coder_id,handle,country_name,alg_rating,alg_vol,alg_num_ratings) values 3 (123456,'slowpoke','United States',1020,320,42); insert into topcoder.coder * ERROR at line 1: ORA-00001: unique constraint (TOPCODER.SYS_C004124) violated This is why it is absolutely critical to use a primary key on each table. It ensures that you do not end up with duplicate records. Go ahead and delete the records out of the coder table, since we will be loading this table with real data shortly. Oracle's truncate command is a quick way to clear out all the records in a table. SQL> truncate table topcoder.coder; Table truncated. Now let's go ahead and create the other tables we will be using throughout this article. Be sure to run the last command below, which will create the primary key on the round_results table. SQL> create table topcoder.round_list ( 2 round_id number primary key 3 ,full_name varchar2(120) 4 ,short_name varchar2(50) 5 ,round_type_desc varchar2(100) 6 ,round_date date 7 ); Table created. SQL> SQL> create table topcoder.round_results ( 2 round_id number 3 ,coder_id number 4 ,room_id number 5 ,room_name varchar2(50) 6 ,paid number 7 ,old_rating number 8 ,new_rating number 9 ,new_vol number 10 ,num_ratings number 11 ,room_placed number 12 ,division_placed number 13 ,advanced varchar2(1) 14 ,challenge_points number 15 ,system_test_points number 16 ,defense_points number 17 ,submission_points number 18 ,final_points number 19 ,division varchar2(10) 20 ,problems_presented number 21 ,problems_submitted number 22 ,problems_correct number 23 ,problems_failed_by_system_test number 24 ,problems_failed_by_challenge number 25 ,problems_opened number 26 ,problems_left_open number 27 ,challenge_attempts_made number 28 ,challenges_made_successful number 29 ,challenges_made_failed number 30 ,challenge_attempts_received number 31 ,challenges_received_successful number 32 ,challenges_received_failed number 33 ,rated_flag varchar2(1) 34 ,level_one_submission_points number 35 ,level_one_final_points number 36 ,level_one_status varchar2(50) 37 ,level_one_time_elapsed number 38 ,level_one_placed number 39 ,level_one_language varchar2(50) 40 ,level_two_submission_points number 41 ,level_two_final_points number 42 ,level_two_status varchar2(50) 43 ,level_two_time_elapsed number 44 ,level_two_placed number 45 ,level_two_language varchar2(50) 46 ,level_three_submission_points number 47 ,level_three_final_points number 48 ,level_three_status varchar2(50) 49 ,level_three_time_elapsed number 50 ,level_three_placed number 51 ,level_three_language varchar2(50) 52 ); Table created. SQL> SQL> alter table topcoder.round_results add primary key (round_id,coder_id); Table altered. IV. PL/SQL In the package specification, we can list our global variables, type declarations, constants, and any procedure and function declarations that we want external callers to be able to call in our package. In other words, if we have a procedure or function in our package body that isn't declared in the specification, then this procedure or function cannot be called by code outside of our package -- the procedure or function is essentially "private" to our package. PL/SQL also supports polymorphism, through which the same function or procedure name can be declared multiple times with different signatures (i.e. different parameters). The coding conventions I use for PL/SQL are pretty simple, and based somewhat on the same ones Thomas Kyte (author of Expert Oracle and creator of asktom.oracle.com) uses. For parameters to procedures and functions I use a p_ prefix. For local procedure and function variables I use an l_ prefix. For global variables I use a g_ prefix. For type definitions I use a t_ prefix. And finally, for constants I use a c_ prefix. For procedure and function names I use underscores between key words (ex: load_coder_table). Since PL/SQL is not case sensitive, I use lowercase for everything. V. Loading RSS data First, we will create the PL/SQL topcoder.rss_util package we will be using to load the tables. The script you need to run to create the package is here. Go ahead and copy this script text and save it to a file somewhere on your machine (I saved mine as create_rss_util_pkg.sql). Next, you can run the script from SQL*Plus by using the 'at' sign ('@') and providing the path to the script (note that you do not need to put a semicolon after the end of the script name). In the example below, I saved the script to C:\temp\create_rss_util_pkg.sql, so modify the command below to point to wherever you saved the script (be sure to type in the set define off command, as that will prevent SQL*Plus from prompting you for a parameter) . SQL> set define off SQL> @c:\temp\create_rss_util_pkg.sql Package created. Package body created. The easiest way to load an RSS feed, assuming you don't run into firewall issues, is to just load it directly from the website to the Oracle table. That is the method we will be using. If you do have firewall issues then there are other methods you can use that require the use of an intermediate table, but that's beyond the scope of this article. In the case of our active algorithm coder list, when processing the RSS feed we want to insert the coder record (if we don't already have it in our coder table). If we do already have the record in our coder table, we want to update it so we get the latest information for the coder (rating, etc.). If you take a look at the load_coder_tbl procedure from the rss_util package we just created, you will see that we are using Oracle's createuri command to connect directly to the website and extract the active algorithm coder records. We are using a loop to iterate through each coder record, which allows us to determine whether to insert new coder records that we don't already have in our topcoder.coder table, or update records that already exist in the table. When we run this procedure for the first time, we won't have any records in our coder table so all the records will get loaded. For the Oracle SQL experts out there who are wondering why I didn't use Oracle's merge command to load the coder table, I actually tried to do so. It turned out to be about a minute slower than the technique I ended up using, however, which was to manually check to see if the coder record existed to determine whether to insert or update. A few more comments about this load_coder_table procedure:
We are now ready to load the active algorithm RSS feed, so type the following code into SQL*Plus: SQL> declare 2 begin 3 topcoder.rss_util.load_coder_tbl; 4 commit; 5 end; 6 / PL/SQL procedure successfully completed. Note that the script may take a minute or two to run, so don’t be alarmed. Hopefully you got the "PL/SQL procedure successfully completed" message. If you got errors, chances are good you ran into firewall issues. To verify that the table was actually loaded with coder records, run this SQL: SQL> select count(*) from topcoder.coder; COUNT(*) ---------- 5004 Your count will probably be different, but as long as it's not zero it means records were loaded. So now we have loaded our topcoder.coder table with active coders! If you want to pull your coder record (assuming you have competed in an algo competition in the last 6 months) run the following code and change 'dcp' to your own handle: SQL> select * from topcoder.coder where handle='dcp'; CODER_ID HANDLE ---------- -------------------- COUNTRY_NAME ---------------------------------------------------------------------- ALG_RATING ALG_VOL ALG_NUM_RATINGS ---------- ---------- --------------- 21684580 dcp United States 1069 306 35 There are many things you can do with this table. Would you like to know how many coders from Poland have a rating greater than or equal to 1500? Run this query: SQL> select count(*) from topcoder.coder where country_name='Poland' and alg_rating >= 1500; COUNT(*) ---------- 76 This rss_util package also contains some other procedures, namely, load_round_list_tbl and load_round_results_tbl, which we can use to load the round_list and round_results tables respectively. The load_round_list_tbl works similarly to the load_coder_tbl, that is, it either inserts or updates records, which results in our round_list table containing the latest round list information. So when we run this procedure the first time, all round list records will be loaded: SQL> declare 2 begin 3 topcoder.rss_util.load_round_list_tbl; 4 commit; 5 end; 6 / PL/SQL procedure successfully completed. There are two versions of the load_round_results_tbl procedure in our topcoder.rss_util package. The first version takes the round date as a parameter and the second version takes a round id as a parameter (this is an example of polymorphism). The way the round results RSS feed is designed is that you have to pass in a round id. So let’s say for example, that we want to load the round results for SRM 342. First we need to determine the round_id for SRM 342. We can get that information from our round_list table we just loaded using the following query: SQL> select round_id from topcoder.round_list where short_name = 'SRM 342'; ROUND_ID ---------- 10666 Now that we know the round_id, we're ready to call our load_round_results_tbl procedure plugging in the 10666 (round_id) value as the parameter: SQL> declare 2 begin 3 topcoder.rss_util.load_round_results_tbl(10666); 4 commit; 5 end; 6 / PL/SQL procedure successfully completed. But instead of having to go through the trouble of looking up the round id, we can use the other version of this procedure, which allows us to pass in a round date. Since SRM 342 took place on March 14 th, 2007, we would call the procedure as follows: SQL> declare 2 begin 3 topcoder.rss_util.load_round_results_tbl(to_date('20070314','YYYYMMDD')); 4 commit; 5 end; 6 / PL/SQL procedure successfully completed. If you look at the code for the load_round_results_tbl procedure, you will notice that we delete any existing round results data we have for the round before reloading it. This step enables us to reload the round data as many times as we would like, without having to worry about manually deleting the records since the procedure handles it for us. So in the case where the round data might change (due to cheaters being eliminated, etc.) you can just reload it by re-running the procedure. Now we're really starting to get some traction. We have a lot of useful information at our fingertips. The TopCoder site does an outstanding job of supplying various statistics about a match. However, suppose you want to see a statistic that's not on the site? Well now you can! For example, in SRM 342, let's find out which countries won the money and order it by the amount of money in descending order. SQL> select country_name 2 ,sum(r.paid) 3 from topcoder.round_results r 4 ,topcoder.round_list l 5 ,topcoder.coder c 6 where r.coder_id = c.coder_id 7 and r.round_id = l.round_id 8 and l.short_name = 'SRM 342' 9 group by country_name 10 having sum(r.paid) != 0 11 order by 2 desc 12 ; COUNTRY_NAME ---------------------------------------------------------------------- SUM(R.PAID) ----------- China 1407 United States 632 Russian Federation 581 . . (results truncated for brevity) . 26 rows selected. If the SQL in that command looks confusing, don't worry. We'll be covering SQL in more depth in an upcoming article. Hopefully, though, you can see from this example the power you can leverage with SQL queries. So now, whenever we want to load the latest SRM round information into our database, we can use the following steps:
Here's a complete example of how we could load the data for SRM 343 into our database: SQL> declare 2 begin 3 topcoder.rss_util.load_coder_tbl; 4 topcoder.rss_util.load_round_list_tbl; 5 topcoder.rss_util.load_round_results_tbl( 6 to_date('20070322','YYYYMMDD')); 7 commit; 8 end; 9 / PL/SQL procedure successfully completed. Note that if multiple rounds take place on a given date, the above code will load the information for all of those rounds. If you wanted to get really fancy, you could modify the topcoder.rss_util PL/SQL package and create your own procedure that would encapsulate all of these calls into a single procedure to simply load round information. I'll leave that as an exercise for the reader. If you prefer a GUI interface for viewing your query results, Oracle Express Edition does provide one. If you’re running Windows, just go to Start->Programs->Oracle Database 10g Express Edition->Go To Database Home Page. Login with your system account, and you will be taken to a page where you can select an SQL dropdown menu to run an SQL command. This tool does a much nicer job than SQL*Plus of presenting query results in an easier-to-read format. Another free GUI tool you may want to check out is Oracle’s SQL Developer tool. In Part 2, coming soon, we'll look at more key components of relational databases, including indexes, referential integrity, constraints, views, transactions, and triggers. Until then, enjoy exploring your new TopCoder competition database, and good luck! |
|