Scope of this guide

This is a guide for transitioning from standard SQL to the RAW programming language. It assumes basic familiarity with SQL, as concepts are occasionally described in counterpart to their behavior in SQL. It targets users who are familiar with SQL and wish to write their own queries in RAW. For more information and documentation on RAW please visit http://just-ask.raw-labs.com/docs.

Introduction

RAW is a programming language designed with query needs in mind. RAW is similar to SQL for most practical purposes. Like SQL, RAW is a declarative language. Unlike SQL, RAW is also pure, functional programming language, strongly typed, with extensive type inference and with explicit support for querying "raw" (i.e. dirty/messy data) and "complex" data (i.e. arbitrarily nested data structures). Programs written in RAW are queries that get compiled down to machine code before being executed, typically on a distributed cluster.

RAW gives preference to the syntax used in modern-day programming language instead of strictly following the SQL standard. As a result, RAW usage is more straightforward and usually feels more "modern" to the typical programmer. It is possible to run RAW in stricter SQL compatibility mode if needed.

Expressions

In RAW, queries are expressions, i.e. queries are evaluated to produce values. Unlike SQL, queries (expressions) can be used wherever a value is allowed. The following examples illustrate this behaviour:

The following query finds the students whose age is the same as the youngest professor:

SELECT * FROM students WHERE age = (MIN(SELECT age FROM professors))

Note that the student age is being compared with the result of an inner SELECT query.

The following query lists all student names, and for each adds a boolean flag indicating whether the student has the same age as the inner youngest professor:

SELECT name, (age = (MIN(SELECT age FROM professors))) AS same_age_as_youngest_prof FROM students

From a SQL perspective, some of these queries may not be strictly valid. In RAW, however, the behavior is simple and consistent: if a value is allowed, it can be replaced by a query that produces a value of the same type.

Data Types

RAW is a strongly typed language with extensive type inference. This means that while RAW checks that all expressions are valid (i.e. have valid types) before the query is executed, it is also capable of inferring types instead of asking the user to specify them.

RAW supports the following data types:

  • Primitive Types. Primitive types include integer, long, float, double (to represent numbers with different precision), char, string, boolean, date, time, timestamp, interval and regex (for regular expressions).
  • Record Type. A record type contains a sequence of attributes, where each attribute has a name and a type. For instance, "record(name: string, age: int)" represents a record with two fields: name (a string) and age (an integer).
  • Collection Type. The collection is a bag (also known as a multiset) of values. (In the remaining of the tutorial, the terms collection and bag are used interchangeably). Some examples:
    • "collection(record(name: string, age: int))" represents a collection of records, where each record has two fields: name (a string) and age (an integer). This example would be a "table" in SQL.
    • "collection(int)" is a collection of integers. Note that there is no "record". This type is impossible to represent in SQL.
  • Option Type. The option type represents a value that may or may not be present. This is similar to the idea of NULLable types in SQL but safer and more generic. (Examples are given later in the discussion of nullability.).

In addition, there are function types. These are discussed later as needed.

RAW vs SQL data types

The RAW type system is richer than SQL's. In SQL, queries must produce tables as their output while in RAW no such restriction exists. This flexibility expands the capabilities of the RAW query language. It is one of the main reasons that RAW handles complex data structures, such as nested data (XML/JSON) without needing to flat/preprocess the data. One consequence is that some queries type differently between RAW and SQL.

The following illustrates a few RAW queries and discusses the differences with SQL where applicable.


1 + 1

This query outputs the result value "2" of type integer. There is no equivalent query in SQL. SQL queries of the form "SELECT 1 + 1 ..." are not equivalent since these produce a SQL table with a single element, instead of a single integer.


2 * MAX(SELECT age FROM students)

This queries finds the oldest student and returns its age multiplied by two. It outputs a single result value of type integer.


MAX(SELECT age FROM students) = MIN(SELECT age FROM professors)

This query outputs a single boolean value - true or false - depending on whether the oldest student has the same age as the youngest professor.


(column1: "First column", column2: 42)

This query produces a record. The record has two attributes: "column1", whose type is string and value is "First Column" and "column2" whose type is integer and value is 42.


SELECT student.name, (SELECT prof.name FROM professors prof WHERE prof.age = student.age) FROM students student

This query finds all students, and for each, returns its name and the list of professor names with the same age as the student. This query has no direct equivalent in SQL and illustrates the support of nested data structures in RAW.


COLLECTION(1.1, 2.2, 3.3)

This query produces a collection of floating point numbers. This query has no equivalent in SQL, since SQL tables must always contain at least one column, while in this example there is no column.


COLLECTION((name: "Miguel", age: 36), (name: "Ben", age: 38), (name: "Cesar", age: 40), (name: "Nuno", age: 38))

This query produces a collection of four records, each with two fields: "name" of type string and "age" of type "integer". This is equivalent to a SQL table with two columns and four entries.

SELECT keyword

The SELECT keyword is used to query collections of data. It works similarly to its counterpart in SQL with two exceptions:

  • it can be used to query collections of data other than records;
  • it can be used to produce output data other than collections.

Examples common with SQL

The following queries can be performed in standard SQL and in RAW without any syntactic change and producing exactly the same result:


SELECT * FROM students WHERE age > 23

List all students older than 23.


SELECT decade, COUNT(*) FROM students WHERE age > 23 GROUP BY age / 10 AS decade

List number of students per "decade".


SELECT s.name as student_name, p.name as professor_name FROM students s, professors p WHERE s.age > p.age

List student and professor names where the student is older than the professor.

Examples supported in SQL

The following queries while syntactic equivalent to their SQL counterpart produce different result types in RAW.


MAX(SELECT age FROM students)

This query produces a single integer (the oldest student age) in RAW, i.e. the query evaluates to a single number. In SQL, the query produces a table with a single column and a single entry. That is, while in RAW the output type is "int", in SQL it would be "collection(record(max: int))".


SELECT age FROM students

This query produces a collection of integers, with the student ages. There is no record (i.e. "no column") in the output type. In SQL, this query produces a table with a single column. That is: while in RAW the output type is "collection(int)", in SQL it would be "collection(record(age: int))".


SELECT age AS student_age FROM students

This query produces exactly the same output in RAW as in SQL. The reason is the "AS student_age": since the query specifies an explicit attribute name with the AS clause, a record must be created to hold this attribute name. Therefore, in RAW and in SQL, the output type is "collection(record(student_age: int))".

Unsupported in SQL

This section describes features in RAW that are unsupported in SQL.

Querying Nested data

RAW supports arbitrarily nested data structures. These are not supported in standard SQL, although some vendors provide limited support through custom SQL extensions. In RAW, however, handling nested data does not require any additional syntax or specific functions: the syntax is uniform and coherent in all situations; in addition, all queries are supported without limitations.

Assume a data source called "sales" with the following JSON content:

[
    {"country": "CH",
     "products": [
         {"category": "Keyboard", "cost": 50},
         {"category": "Keyboard", "cost": 70},
         {"category": "Monitor", "cost": 450}]},
    {"country": "US",
     "products": [
        {"category": "Keyboard", "cost": 20},
        {"category": "Monitor", "cost": 200}]}
]

To find the products that cost more than a certain threshold:

SELECT p FROM sales s, s.products p WHERE p.cost > 60

The output result (in JSON) is:

[
    {"category": "Keyboard", "cost": 70},
    {"category": "Monitor", "cost": 450},
    {"category": "Monitor", "cost": 200}
]

The syntax "FROM sales s, s.products p" means that every sale is assigned to "s" and then every product of a sale is assigned to the "p". (More formally, every element of the collection "sales" is assigned to the identifier "s" and every element of the inner collection "products" is assigned to the identifier "p".) The input data is a nested data structure (unsupported in SQL) but the output is a collection of records of primitive types (supported in SQL). It is an example of how RAW can be used to input nested data and output tabular SQL-friendly data.


To find the average prices of keyboards per country:

SELECT s.country, AVG(SELECT cost FROM s.products WHERE category="Keyboard") AS avg_cost FROM sales s

The output result (in JSON) is:

[
    {"country": "CH", "avg_cost": 60},
    {"country": "US", "avg_cost": 20}
]

The query reads the source "sales", which is a collection of records with fields "country" (type string) and "products" (type collection(record(category: String, cost: int))). It assigns every element of the collection to the identifier "s". It then outputs the value given by s.country and computes the average cost of products per sale. The inner SELECT query reads from "s.products", i.e. from the list of products in the sale "s" and filters those products whose category is "Keyboard".


Similarly, to find the country names whose average sale prices of keyboards exceeds a certain threshold:

SELECT s.country AS country_name FROM sales s WHERE AVG(SELECT cost FROM s.products WHERE category="Keyboard") > 40

The output result (in JSON) is:

[
    {"country_name": "CH"}
]

Outputting Nested data

In addition to querying nested data, RAW can also output arbitrarily nested data. This section contains a few examples:


To find countries and products where the product cost exceeds a threshold:

SELECT s.country AS sale_country, p AS product FROM sales s, s.products p WHERE p.cost > 60

The output result (in JSON) is:

[
  {
    "sale_country": "CH",
    "product": {
      "category": "Keyboard",
      "cost": 70
    }
  },
  {
    "sale_country": "CH",
    "product": {
      "category": "Monitor",
      "cost": 450
    }
  },
  {
    "sale_country": "US",
    "product": {
      "category": "Monitor",
      "cost": 200
    }
  }
]

In the SELECT clause, two fields are projected. The first is "s.country AS sale_country", which is a string. The second is "p AS product", which contains nested data.

GROUP BY

Due to RAW's support for nested data, the GROUP BY clause in RAW works differently from SQL. The following queries illustrate the different. These re-use the "sales" data example from the previous sections. None of the queries described below is possible in SQL since the original data is nested.

Let's find the products by category:

SELECT category, * AS products FROM sales s, s.products p GROUP BY p.category AS category

The * in the SELECT projection may appear strange at first as it has no equivalent in SQL. In the case of GROUP BY, a series of groups are created, where each group contains the original items that share the GROUP BY condition. In the case of "GROUP BY p.category", there is a group created per "p.category" containing all the original items that have the same "p.category". The group itself is assigned to *, which is an identifier that can be used freely in the query.

A few additional examples will clarify this feature, and shown its potential:


Let's find the product costs per category:

SELECT category, (SELECT cost FROM *) AS costs FROM sales s, s.products p GROUP BY p.category AS category

The output result (in JSON) is:

[
  {
    "category": "Keyboard",
    "costs": [50, 70, 20]
  },
  {
    "category": "Monitor",
    "costs": [450, 200]
  }
]

This query builds on the one above, doing an additional query on the *, which are the grouped data, to retrieve only the cost field. Again note that * is a "collection of data" that can be queried with the SELECT clause as any other collection in RAW.


Let's find the most expensive cost per category:

SELECT category, MAX(SELECT cost FROM *) AS most_expensive FROM sales s, s.products p GROUP BY p.category AS category

Note the use of "MAX(SELECT cost FROM *)". There are two aspects being combined: the MAX(...) and the *.

MAX(...) in RAW is an expression that takes a list of numbers and returns the greatest number. For instance, the following query is valid in RAW. Its output is 42.

MAX(COLLECTION(10,20,42))

As described earlier, in the case of a SELECT with a GROUP BY clause, the * means the "data in each group". Furthermore, the inner "SELECT cost FROM *" creates a collection of numbers, which are the costs in each group. The MAX of that collection is the maximum cost.


To count the number of sales per category:

SELECT category, COUNT(*) AS nsales FROM sales s, s.products p GROUP BY p.category AS category

The output result (in JSON) is:

[
  {
    "category": "Keyboard",
    "nsales": 3
  },
  {
    "category": "Monitor",
    "nsales": 2
  }
]

This query follows a pattern similar to the ones above. It "looks" semantically equivalent to an SQL query. However, the query creates a group per category, and then outputs the category name, while counting the number of elements in the group.

Note that the COUNT(...) function in RAW is also general-purpose. The following is a valid query: its input a collection of strings and outputs the value 3, which is the size of the collection.

COUNT(COLLECTION("A", "B", "C"))

The remainder of this section includes additional examples.


Let's find all sales by category, while counting them, finding the most expensive cost, as well as those that are somewhat expensive:

SELECT category,
        * AS sales,
        COUNT(*) AS nsales,
        MAX(SELECT cost FROM *) AS most_expensive,
        (SELECT category FROM * WHERE cost > 50) AS somewhat_expensive
FROM sales s, s.products p
GROUP BY p.category AS category

A subset of the output (in JSON) is:

[
  {
    "category": "Keyboard",
    "sales": [
      {
        "country": "CH",
        "products": [
          {
            "category": "Keyboard",
            "cost": 50
          },
          {
            "category": "Keyboard",
            "cost": 70
          },
          {
            "category": "Monitor",
            "cost": 450
          }
        ],
        "category": "Keyboard",
        "cost": 50
      },
      {
        "country": "CH",
        "products": [
          {
            "category": "Keyboard",
            "cost": 50
          },
          {
            "category": "Keyboard",
            "cost": 70
          },
          {
            "category": "Monitor",
            "cost": 450
          }
        ],
        "category": "Keyboard",
        "cost": 70
      },
      {
        "country": "US",
        "products": [
          {
            "category": "Keyboard",
            "cost": 20
          },
          {
            "category": "Monitor",
            "cost": 200
          }
        ],
        "category": "Keyboard",
        "cost": 20
      }
    ],
    "nsales": 3,
    "most_expensive": 70,
    "somewhat_expensive": [
      "Keyboard"
    ]
  },
  ...

This query illustrates the consistent behavior in RAW: the * is a collection of data that can itself be queried freely or returned as a result.


To count the number of sales per country and category:

SELECT country,
       category,
       COUNT(*) AS nsales
FROM sales s, s.products p
GROUP BY s.country AS country, p.category AS category

The output result (in JSON) is:

[
  {
    "country": "CH",
    "category": "Keyboard",
    "nsales": 2
  },
  {
    "country": "US",
    "category": "Keyboard",
    "nsales": 1
  },
  {
    "country": "US",
    "category": "Monitor",
    "nsales": 1
  },
  {
    "country": "CH",
    "category": "Monitor",
    "nsales": 1
  }
]

This query is also not possible in SQL since the product category is a nested data structure.

Assignments

It is often convenient to build a complex query in parts. To facilitate this process, RAW includes assignments into the language. Example:

{
  youngest_professor := MIN(SELECT age FROM professors);

  SELECT * FROM students WHERE age > youngest_professor
}

The query finds all students that are older than the youngest professor. The assignment of "youngest_professor" does NOT imply that one query is done after another. In fact, RAW will decide whether the queries are uncorrelated and create an execution plan independently of how the query is written. The assignment is a convenience for the developer and does not directly influence how the query gets executed.


Another example:

{
  young_professor_age := 30;

  young_professors := SELECT * FROM professors WHERE age <= young_professor_age;

  SELECT * FROM students s, young_professors p WHERE s.age > p.age
}

The query joins students with professors, choosing those combinations where students are older than young professors.

Functions and Polymorphism

In addition to assignments, RAW also allows users to build reusable functions that process data. Example:

{
  filter_young(data, young_age) := SELECT * FROM data d WHERE d.age <= young_age;

  SELECT * FROM filter_young(professors, 30)
}

The function "filter_young" takes two arguments: "data" and "young_age". The SELECT filters those elements of "data" where the age is smaller or equal to "young_age". Note that each element of "data" is assigned to "d", and the filter is on "d.age". This is mandatory so that the system knows that "age" is a field of "d", which itself is an element of "data". If the filter were written as "WHERE age <= young_age", RAW would raise a type error since it does not know where "age" is defined.

This query illustrates an important feature of the RAW type system called "type inference". This means that the function "filter_young" is not parameterized with types by the user, which would be the case in SQL. Instead, RAW infers the types itself. In this case, RAW infers that "data" must be a collection - since it is used in a FROM clause - and that it must have a field called "age", since it is filtered by it, and the field "age" must be a comparable (e.g. a number in this case).

The formal name for this feature is "polymorphic type inference" and it is a key part of the RAW type system. The goal is to provide:

  • User convenience: In the case of complex nested data structures, it would be extremely inconvenient to force the user to specify the types;
  • Code reusability: The function "filter_young" can be used on any table that has a field "age" which is comparable. Code is written once and reused many times;
  • Correctness: RAW checks that the query is valid even before the query is run. This is in contrast with dynamic scripting languages, like Python, which provide similar flexibility but may crash at runtime;
  • Performance: RAW can generate optimal code because it knows the precise type information.

The user may optionally specify the argument types if so desired. This is not required in practice but may serve as documentation in some situations, e.g.:

{
  filter_young(data: collection(record(name: string, age: int)),
               young_age: int) :=
      SELECT * FROM data d WHERE d.age <= young_age;

  SELECT * FROM filter_young(professors, 30)
}

The type inference in RAW covers additional cases. For instance, to build a function that groups "things" by age:

{
  group_by_age(data) := SELECT d.age, * FROM data d GROUP BY d.age;

  (group_by_age(students), group_by_age(professors))
}

In the example "group_by_age" is a function that takes "data" and groups the data by field "age". The output is a record, whose first field as the students grouped by age, and in the second field has the professors grouped by age. This function can be reused for "students" and "professors" since they both have a field "age". It cannot be used for collections that do not have a field "age": RAW will report an error to the user.

Parsing data

RAW includes built-in support for parsing logs and text files with convenient syntax. The keyword is PARSE AS which takes a regular expression as its argument and is typically used in combination with the SELECT keyword as follows:

Suppose an input log, called "log_file", with the following contents:

2016-04-01 08:00 ServiceA - Event X happened
2016-04-01 08:01 ServiceB - Restarted

To query this data, use PARSE AS and pass a regular expression (with r"""..."""). Each group in the regular expression becomes a field. A group in a regular expression the part being "captured" by the regular expression, which is wrapped in parenthesis. Learn more about regular expressions.

SELECT * FROM log_file PARSE AS r"""(\d+-\d+-\d+\s+\d+:\d+)\s+(\w+)\s+-\s+(.*)"""

The result in JSON:

[
  {
    "_1": "2016-04-01 08:00",
    "_2": "ServiceA",
    "_3": "Event X happened"
  },
  {
    "_1": "2016-04-01 08:01",
    "_2": "ServiceB",
    "_3": "Restarted"
  }
]

The data is now split per fields. However, fields still have no friendly-names - the auto-generated names are _1 for the first field, _2 for the second, etc - and still do not have the proper type. To give them names and to convert to a type other than string, use the INTO keyword as follows:

SELECT *
FROM log_file
PARSE AS r"""(\d+-\d+-\d+\s+\d+:\d+)\s+(\w+)\s+-\s+(.*)"""
INTO (time: strtodate(_1, "yyyy-MM-dd HH:MM"), service: _2, event: _3)

The result in JSON is now:

[
  {
    "time": "2016-04-01T08:00:00.000Z",
    "service": "ServiceA",
    "event": "Event X happened"
  },
  {
    "time": "2016-04-01T08:01:00.000Z",
    "service": "ServiceB",
    "event": "Restarted"
  }
]

In the example, the INTO keyword takes a record constructor as its argument. The field names of the new record as "time", "service" and "event". The "time" field is _1, the first group of the regular expression, converted from string to date. The "service" field is _2, the second group of the regular expression. The "event" field is _3, the third group of the regular expression. This data can now be queried as usual.

The usual practice is to place the parsing code in a function and then reuse it as needed:

{
  parse_log(data) := SELECT * FROM log_file
                              PARSE AS r"""(\d+-\d+-\d+\s+\d+:\d+)\s+(\w+)\s+-\s+(.*)"""
                              INTO (time: strtodate(_1, "yyyy-MM-dd HH:MM"), service: _2, event: _3);
  
  SELECT * FROM parse_log(log_file)
}

Null or Unknown Values

In SQL, "null" represents not a data value but a marker for an absent or unknown value. In addition, SQL includes a feature called "null propagation". In SQL summing "1 + NULL" should be read as "1 plus unknown" whose result is also unknown, i.e. "NULL".

While a convenient feature, SQL does not coherently follow the "three-valued logic". This is a source of controversy, common mistakes and errors.

Thus far, in RAW, we opted not to support "null propagation" and replace it by a safer mechanism. This decision is also combined with our stated goal of handling dirty/messy data in RAW, where unknown/null values are more commonplace.

RAW uses the "option type" for absent data. Any data value - whether it is a primitive type, like an integer, or a collection, or a record - may be present or absent.

If RAW finds data to be missing option types are automatically inferred from the data sources. Let's assume the following data, called "sales", in CSV format:

ProductID,Price,Seller
Keyboard,40,Miguel
Keyboard,50,-
Monitor,200,Ben
Mouse,20,-

The data has missing values for the "Seller", where the missing values are represented by the -. When registering the source, the user may choose the representation for missing values. In this case, the user should set the file type to CSV during registration, then click on "Advanced Properties" and entering "-" (without quotes) in the field "Nulls". This forces RAW to infer the source data type to be "collection(record(ProductID: string, price: int, Seller: option(string)))". Note the "option(string)" as the type of Seller. This means the Seller is "optionally" a string: it is either a string, or it is absent.

To query this data:

SELECT * FROM sales WHERE Seller IS NULL

The query above works similarly to the SQL version. But unlike SQL, the following query is also valid and unambiguous:

SELECT * FROM sales WHERE Seller = NULL

On the other hand, if the comparison with NULL is against any field that is not nullable, RAW will raise an error. This is considerably safer than SQL, where anything can be compared with NULL, including non-nullable fields, leading to occasional mistakes. The following is NOT valid in RAW since price is not an option type:

SELECT * FROM sales WHERE Price = NULL

A consequence of RAW's missing "null propagation" is that users are forced to handle option values explicitely. For instance, the following is invalid:

SELECT * FROM sales WHERE Seller = "Miguel"

The reason is that "seller" is an option, and options cannot be strings. Option values can contain string values, but are not a string themselves. To do the comparison above, there are a few alternatives. The first is to provide an default value to be used in case the original value is not defined:

SELECT * FROM sales WHERE ISNULL(Seller, "N/A") = "Miguel"

The value "N/A" is used if the Seller is not defined. Otherwise, the seller name is used. We can re-buiild the entire table without undefined values with this syntax:

SELECT ProductID, Price, ISNULL(Seller, "N/A") FROM sales

The result (in CSV) of the query is:

ProductID,Price,Seller
Keyboard,40,Miguel
Keyboard,50,N/A
Monitor,200,Ben
Mouse,20,N/A

Other Differences compared with SQL

There are a few other minor differences between RAW and SQL. These are primarily aesthetic and serve to "modernize" the look-and-feel of RAW compared with modern languages.

The first is the use of quotes. SQL uses " for column names. RAW uses " for strings. In addition, strings that include " can be escaped with triple quotes. This change is common with most modern programming languages:

"""This string has "quotes" inside"""

For field names that include spaces, use ` in RAW.

(`a field name with spaces`: 10)

Instead of CASE statements, RAW prefers IF/THEN/ELSE syntax:

IF (true) THEN "ok" ELSE "not ok"

All these are syntactic changes. These can be reverted if stricter SQL compatibility is desired.

Summary and Next Steps

This document describes RAW by comparing its behaviour with that of standard SQL. It discusses a set of differences between RAW and SQL. All differences fall in one of the following categories:

  • Support for nested data: e.g. eliminate the need to "flatten" hierarchical data; reduce the need for expensive OUTER JOINs in SQL; extensions to the GROUP BY clause;
  • Convenience to the user: e.g. assignments and polymorphic functions;
  • Correctness and safety: e.g. option types, type inference;
  • Performance: e.g. collections of primitive types without encapsulating records; extensions to the GROUP BY clause to perform multiple queries in a single aggregation; polymorphic functions and type inference to generate optimal code;
  • Modern syntax

For the next steps and documentation please visit http://just-ask.raw-labs.com/docs.