Skip to content

Regular expression overview

Regular expressions provide a powerful way to match text patterns. You can use simple wildcards (such as %, _) in LIKE statements, but regular expressions give you more flexibility and matching options.

Usage scenarios

Regular expressions can perform complex string matching and operations. The following lists some common usage scenarios:

-Data Verification: Regular expressions can be used to verify that data meets a specific format, such as checking whether a field contains a valid email address, phone number, or social security number, etc. For example, you can use regular expressions to find all email addresses that are not in compliance with the specification:

```sql
SELECT email
FROM users
WHERE email NOT REGEXP '^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Za-z]{2,}$';
```

-Data Filter: Regular expressions can be used to search for data containing or not containing a specific pattern. For example, if you want to find out all phone numbers that start with a specific prefix, you can use regular expressions to match those numbers:

```sql
SELECT phone_number
FROM users
WHERE phone_number REGEXP '^180';
```

-Data Cleaning: Regular expressions are also very useful in data cleaning. For example, you may need to clear special characters, spaces, or extract numbers from strings, etc., and regular expressions can easily implement these functions:

```sql
SELECT REGEXP_REPLACE(name, '[^a-zA-Z]', '')
FROM users;
```

This SQL segment will return the result that all non-letter characters in the `name` field are replaced.

-Text Analysis: If you are doing text analysis, regular expressions can help you find out specific vocabulary or phrases in your text, and can even be used for certain forms of natural language processing.

Special characters

Regular expressions use POSIX (Portable Operating System Interface) to extend regular expressions. Here are some special characters to pay attention to:

character description
. Match any single character (except line breaks)
* Indicates that the previous element can be repeated any time (including zero times)
+ Indicates that the previous element can be repeated once or more times.
? Indicates that the previous element can be repeated zero or once.
{n} Indicates that the previous element can be repeated n times.
{n,} Indicates that the previous element can be repeated n times or more.
{n,m} Indicates that the previous element can be repeated n times to m times.
^ Indicates the beginning of the matching string.
$ Indicates the end of the matching string.
[abc] means match a, b, or c.
[^abc] Indicates matching any character that is not a, b, or c.
(abc|def) Indicates matching abc or def.
\d Indicates matching numbers
\s Indicates matching whitespace characters
\w Indicates matching word characters
\D, \S and \W means matching the opposite character set

Note

In regular expressions, these special characters need to be escaped using \, for example \\. means matching an actual . character, not any character.

MatrixOne regular expressions are case sensitive by default. If you want to make case-insensitive matches, you can use the syntax such as REGEXP_LIKE(column, pattern, 'i'). where i means case insensitive.

Reference Document

Regular expression function list

Name Definition
NOT REGEXP Negative expression of REGEXP
REGEXP_INSTR() Match the substring start index of regular expressions
REGEXP_LIKE() Does the string match the regular expression
REGEXP_REPLACE() Replace substring matching regular expressions
REGEXP_SUBSTR() Returns the substring matching the regular expression