PostgreSQL introduction by example

Introduction

This document is a short introductiont to PostgreSQL. It aims to allow the reader to quickly use PostgreSQL.
It’s target audience is primarily developers that have experience with mysql. I would like to point out that I am not going to debate on which one is better.
First we are going to install the PostgreSQL server in an Ubuntu 16.04 machine. It will follow a tutorial on how you can create a PostgreSQL user. Afterwards, a small introduction on using the psql command line client. It will follow an example of how to create a table, how you can populate the table, how to perform queries. Then, a tutorial on a basic usage of the EXPLAIN and how we can use it to give us hints on which indexes we should create. Finally I will show you one way to perform efficiently suffix searches.

Installation

PostgreSQL installation on Ubuntu 16.04 can be done:

sudo apt-get install -y postgresql

The above command installs PostgreSQL server and also creates a system user call postgres. To run administration commands you need to login as this user.

ubuntu@postgres-tut:/vagrant$ cat /etc/passwd|grep postgres
postgres:x:112:118:PostgreSQL administrator,,,:/var/lib/postgresql:/bin/bash
ubuntu@postgres-tut:/vagrant$ 

Now set a password for this user:

sudo passwd postgres

Create a user/role

Switch to postgres user user:

su - postgres

Then create a user (role):

createuser user1 -P -e

-e argument prints the sql query executed -P argument is for setting a password

postgres@postgres-tut:~$ createuser user1 -P -e
Enter password for new role: 
Enter it again: 
CREATE ROLE user1 PASSWORD 'md5cc120d94f7666830c6215fbe6e1bfd52' NOSUPERUSER NOCREATEDB NOCREATEROLE INHERIT LOGIN;
postgres@postgres-tut:~$ 

An alternative way to create a user is using psql (PostgreSQL interactive terminal)

psql -c "CREATE USER user2 WITH PASSWORD '1234';

This could be useful when you want to create the user in bash script

To list the users you can execute the sql:

psql -c "select * from pg_catalog.pg_user"
postgres@postgres-tut:~$ psql -c "select * from pg_catalog.pg_user"
usename  | usesysid | usecreatedb | usesuper | userepl | usebypassrls |  passwd  | valuntil | useconfig 
----------+----------+-------------+----------+---------+--------------+----------+----------+-----------
 postgres |       10 | t           | t        | t       | t            | ******** |          | 
 test     |    16384 | f           | f        | f       | f            | ******** |          | 
 user1    |    16385 | f           | f        | f       | f            | ******** |          | 
 user2    |    16386 | f           | f        | f       | f            | ******** |          | 
(4 rows)

postgres@postgres-tut:~$ 

Create a database

Let’s now create a database to play around

createdb -O user1 user1_db -e

The above command creates a database named user1_db which is owned by the user user1. It also uses the default encoding.
To see the default encoding you can execute:

postgres@postgres-tut:~$ psql -c 'SHOW SERVER_ENCODING'
 server_encoding 
-----------------
 UTF8
(1 row)

postgres@postgres-tut:~$ 

You can also create a database using the CREATE DATABASE command

Interactive terminal

Switch back to your normal user. Now you can connect to PostgreSQL using the command psql -h 127.0.0.1 -U user1 -d user1_db

ubuntu@postgres-tut:/vagrant$ psql -h 127.0.0.1 -U user1 -d user1_db
Password for user user1: 
psql (9.5.9)
SSL connection (protocol: TLSv1.2, cipher: ECDHE-RSA-AES256-GCM-SHA384, bits: 256, compression: off)
Type "help" for help.

user1_db=> 

Create table

Let’s create ta table to store for example domain names.

Our table will be named domains and it will consist of two columns

We need the id to be an auto increment value (as in mysql) and we decide that our domains will be at most 100 chars.

 CREATE TABLE domains(id SERIAL PRIMARY KEY, domain VARCHAR(100));

You can see the created tables (SHOW TABLES in MySQL):

 user1_db=> \dt
       List of relations
 Schema | Name | Type  | Owner 
--------+------+-------+-------
 public | domains | table | user1
(1 row)

user1_db=> 

An rough equivalent of MySQL’s SHOW CREATE TABLE domains
is

user1_db=> \d urls;
                                 Table "public.urls"
 Column |          Type          |                     Modifiers                     
--------+------------------------+---------------------------------------------------
 id     | integer                | not null default nextval('domains_id_seq'::regclass)
 domain    | character varying(100) | 
Indexes:
    "domains_pkey" PRIMARY KEY, btree (id)

user1_db=> 

Insert into table

Now we are going to insert a domain into the table

user1_db=> INSERT INTO domains(domain) VALUES('gksoftware.eu');
INSERT 0 1
user1_db=> select * from domains;
 id |         domain          
----+----------------------
  1 | gksoftware.eu
(1 row)

user1_db=> INSERT INTO domains(domain) VALUES('gkomninos.com');                                                                                                                                                                              
INSERT 0 1
user1_db=> select * from domains;                                                                                                                                                                                                                
 id |         domain          
----+----------------------
  1 | gksoftware.eu
  2 | gkomninos.com
(2 rows)

user1_db=> 

Also multiple inserts (just like MySQL are supported)

INSERT INTO table(col1, ..., coln) VALUES
(val1, ..., valn),
(val1, ..., valn);

We try to insert a domain that exceeds 100 chars

user1_db=> INSERT INTO domains(domain) VALUES('abababababababababaabababaababababagkomninos898u98ninundasdasasdsadsadsadsadasdsadsadsadasdadsad');                                                              
ERROR:  value too long for type character varying(100)
user1_db=> 

Notice the difference with mysql, which by default it truncates the data. PostgreSQL does not allows us to insert.

Empty the table

To empty the table we use

user1_db=> TRUNCATE TABLE domains;
TRUNCATE TABLE
user1_db=> 

Add more data

Download the DMOZ dataset from here (https://dataverse.harvard.edu/file.xhtml;jsessionid=a951ca322132ac28a7de5868d841?fileId=2847112&version=RELEASED&version=.1)

Uncompress the file using 7z:

7z x parsed-domain.csv.7z

This is a csv file with comma (,) as separator. The first column is the domain

Let’s create a new file with only the domains

cut -d ',' -f1 parsed-new.csv > domains.txt

There are 2.340.768 domains in our file

ubuntu@postgres-tut:/vagrant$ wc -l domains.txt 
2340768 domains.txt
ubuntu@postgres-tut:/vagrant$ 

Now were are going to import the data into PostresSQL using the \copy command

user1_db=> \copy domains(domain) FROM '/vagrant/domains.txt';
COPY 2340768
user1_db=> 

Queries

Now that the domains are loaded in the domains table we can start performing some queries

user1_db=> select count(id) from domains;
 2340768

user1_db=> 

The above query returns the number of rows in the table. It is the same as the number of lines of the domains.txt file, as expected.

user1_db=> SELECT * FROM domains WHERE domain LIKE 'gksoftware.eu';

user1_db=> 

As you see my company’s domain is not on the table.

user1_db=> SELECT * FROM domains WHERE domain LIKE 'gelbeseiten.de';
409169 | gelbeseiten.de
user1_db=> 

and gelbeseiten.de was found.

user1_db=> SELECT * from domains WHERE domain LIKE 'abcdef%';
1080757 | abcdefg-record.net
user1_db=> 

The above returns all the domains starting with abcdef .

user1_db=> SELECT COUNT(id) from domains WHERE domain LIKE '%.gr';
6826
user1_db=> 

And here we got the number of rows having domains with tld .gr.

user1_db=> SELECT * from domains WHERE domain LIKE '%tour%.com.cy';
660234 | ecotouradventures.com.cy
user1_db=> 

Above query returns the rows in which the domain contains the string tour and ends with .com.cy.

Using EXPLAIN

Now we are going to examine how the above are executed in an effort to optimize them.

user1_db=> EXPLAIN SELECT * from domains WHERE domain LIKE 'gksoftware.eu';
 Seq Scan on domains  (cost=0.00..44233.60 rows=1 width=20)
    Filter: ((domain)::text ~~ 'gksoftware.eu'::text)

user1_db=> 
user1_db=> EXPLAIN SELECT * from domains WHERE domain LIKE 'gelbeseiten.de';                                                                                                                                                                  
 Seq Scan on domains  (cost=0.00..44233.60 rows=1 width=20)
    Filter: ((domain)::text ~~ 'gelbeseiten.de'::text)

user1_db=> 
user1_db=> EXPLAIN SELECT * from domains WHERE domain LIKE '%.gr';                                                                                                                                                                            
 Seq Scan on domains  (cost=0.00..44233.60 rows=23644 width=20)
    Filter: ((domain)::text ~~ '%.gr'::text)

    user1_db=> 
user1_db=> EXPLAIN SELECT * from domains WHERE domain LIKE 'abc%';                                                                                                                                                                            
 Seq Scan on domains  (cost=0.00..44233.60 rows=234 width=20)
    Filter: ((domain)::text ~~ 'abc%'::text)

    user1_db=> 

The above says that we are scanning all rows to obtain the results.

We can add some indexes to do better.

Create Indexes

We do not want to store duplicate domains and also we would like to improve the query speed. For this we can add an index in our table.

user1_db=> CREATE UNIQUE INDEX domain_unq_idx ON domains(domain);
CREATE INDEX
user1_db=> 

Creating an index can be a slow operation especially in very large tables.

Let’s now see the explain’s output

user1_db=> EXPLAIN SELECT * from domains WHERE domain LIKE 'gelbeseiten.de';
 Index Scan using domain_unq_idx on domains  (cost=0.43..8.45 rows=1 width=20)
    Index Cond: ((domain)::text = 'gelbeseiten.de'::text)
       Filter: ((domain)::text ~~ 'gelbeseiten.de'::text)
user1_db=> 

This seems much better. The cost is

user1_db=> EXPLAIN ANALYZE SELECT * from domains WHERE domain LIKE 'gelbeseiten%';
Seq Scan on domains  (cost=0.00..44233.60 rows=234 width=20) (actual time=48.980..220.585 rows=2 loops=1)
    Filter: ((domain)::text ~~ 'gelbeseiten%'::text)
       Rows Removed by Filter: 2340766
        Planning time: 0.121 ms
         Execution time: 220.627 ms

user1_db=> 

Here I used EXPLAIN ANALYZE. As you see we are still using sequencial search.

To do prefix search we can add another type of index.

CREATE INDEX domain_varchar_idx ON domains (domain varchar_pattern_ops);

Lets see how it performs now

user1_db=> EXPLAIN ANALYZE SELECT * from domains WHERE domain LIKE 'gelbeseiten%';                                                                                                                                                            
 Index Scan using domain_varchar_idx on domains  (cost=0.43..8.45 rows=234 width=20) (actual time=0.120..0.128 rows=2 loops=1)
    Index Cond: (((domain)::text ~>=~ 'gelbeseiten'::text) AND ((domain)::text ~<~ 'gelbeseiteo'::text))
       Filter: ((domain)::text ~~ 'gelbeseiten%'::text)
        Planning time: 0.610 ms
         Execution time: 0.176 ms

user1_db=> 

That is much better.

But still see here

user1_db=> EXPLAIN ANALYZE SELECT * from domains WHERE domain LIKE '%.gr';
Seq Scan on domains  (cost=0.00..44233.60 rows=23644 width=20) (actual time=0.258..300.859 rows=6826 loops=1)
    Filter: ((domain)::text ~~ '%.gr'::text)
       Rows Removed by Filter: 2333942
        Planning time: 0.141 ms
         Execution time: 301.266 ms
user1_db=> 

We are still checking all rows. The reason is that the index we added is only working for prefix searches. Here we want to return all rows ending with .gr, thus the index is not used.

Performing suffix searches

Since we are storing domains we could insert them reversed, like
gksoftware.eu -> ‘ue.erawtfoskg’

This way we could use the prefix indexes as above. So to get all domains with a tld .eu we can search for ue.% .

Note that we we should insert them twice (one reversed and one normal).
Anyway let’s do that to speedup our queries.

First we need to add another column in the table.

user1_db=> ALTER TABLE domains ADD COLUMN domain_reversed varchar(100) NOT NULL DEFAULT '-';
ALTER TABLE
user1_db=> UPDATE domains SET domain_reversed = REVERSE(domain);
UPDATE 2340768
user1_db=> ALTER TABLE domains ALTER COLUMN domain_reversed DROP DEFAULT;
ALTER TABLE
user1_db=>  CREATE INDEX domains_rev_idx ON domains (domain_reversed varchar_pattern_ops);
CREATE INDEX

What we did above:
- we added a column domain_reversed with default value -
- added the reversed domain
- removed the default value
- created an index

Let’s see now our query

user1_db=> EXPLAIN ANALYZE SELECT * from domains WHERE domain_reversed LIKE reverse('%.gr');
 QUERY PLAN                                                           
--------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on domains  (cost=200.09..17590.32 rows=23644 width=36) (actual time=4.423..10.507 rows=6826 loops=1)
Filter: ((domain_reversed)::text ~~ 'rg.%'::text)
Heap Blocks: exact=5787
->  Bitmap Index Scan on domains_rev_idx  (cost=0.00..194.18 rows=6975 width=0) (actual time=3.414..3.414 rows=6826 loops=1)
Index Cond: (((domain_reversed)::text ~>=~ 'rg.'::text) AND ((domain_reversed)::text ~<~ 'rg/'::text))
Planning time: 0.265 ms
Execution time: 11.019 ms
(7 rows)
user1_db=> 

Now it’s much faster. Note the reverse in the LIKE condition.

Finally about the searches of the form ’s%whatever’. There is no trivial way to speed them up. You could try (https://www.postgresql.org/docs/current/static/pgtrgm.html).

Summary

In this article I described how you can install PostgresSQL in Ubuntu 16.04 and how you can create a user and a database. Also it was very briefly described the postgres command line client psql.
We also show how you can create a simple table, how to insert data with the INSERT sql statement and the \copy command. We showed how you can query data and use the EXPLAIN to guide you in creating indexes to speedup your queries. Finally it was presented a small “hack” to allow you to do suffix searches ( with the extra storage cost )

References