GCP

Part 1. Getting start

Chapter 1. What is Cloud

TCO: Total Cost of Ownership

Google Cloud SDK

  • gcloud compute instances list
  • gcloud compute ssh <NAME> : Under the hood, Google is using the credentials it obtained when you ran gcloud auth login, generating a new public/private key pair, securely putting the new public key onto the virtual machine, and then using the private key generated to connect to the machine.
  • gcloud config list --format 'value(core.project)' 2>/dev/null: get project id

Chapter 2. Try it out

  1. gcloud sql instances list

  2. In fact, the size of your disk auto- matically increases as needed.

  3. gcloud sql users set-password root "%" --password "my-changed-long-password-2!" --instance wordpress-db

Reset the password for the root user across all hosts. (The MySQL wildcard character is a percent sign.)

  1. mysql -h 104.197.207.227 -u root -p : my-very-long-password!

  2. database

    • CREATE DATABASE wordpress;
    • CREATE USER wordpress IDENTIFIED BY 'very-long-wordpress-password';
    • GRANT ALL PRIVILEGES ON wordpress.* TO wordpress;
    • FLUSH PRIVILEGES; : Finally let’s tell MySQL to reload the list of users and privileges. If you forget this com- mand, MySQL would know about the changes when it restarts, but you don’t want to restart your Cloud SQL instance for this.

Chapter 3. The Cloud Datacenter

  1. Key questions about Cloud Datacenter
  • where are these data centers?
  • Are they safe?
  • Should you trust the employees who take care of them?
  • Couldn’t someone steal your data or the source code to your killer app?

  • Secure facilities
  • Encryption
  • Replication
  • Backup
  1. Standards
  • industrywide standards

    such as strict security to enter the premises

  • google cloud Standards

  1. oxymoronic; amorphic

  2. various levels of isolation

  • Zones
  • Regions
  1. Services are available, and can be affected, at several different levels:
  • Zonal
  • Regional
  • Multiregional
  • Global: At this point, you typically want to use multiple cloud providers (for example, Amazon Web Services alongside Google Cloud) to protect the service against disasters spanning an entire company.
  1. securing resources
  • Privacy
  • Availability
  • Durability
  1. If you have special legal issues to consider (HIPAA, BDSG, and so on), check with a lawyer before storing information with any cloud provider.

Part 2. Storage

Chapter 4. Cloud SQL

  1. SQL SOUNDEX

The SOUNDEX() converts the string to a four-character code based on how the string sounds when spoken. SOUNDEX(todoitems.name) LIK CONCAT(SUBSTRING(SOUNDEX("egg"), 1, 2), "%");

  1. SSL

mysql -u root --password=really-strong-root-password -h 104.196.23.32 \ --ssl-ca=server-ca.pem \ --ssl-cert=client-cert.pem \ --ssl-key=client-key.pem

  1. Maintenance Window

The last option allows you to choose whether you want updates to arrive earlier or later in the release cycle. Earlier timing means that your instance will be upgraded as soon as the newest version is considered stable, whereas setting this to later will delay the upgrade for a while. In general, only choose earlier if you’re dealing with test instances.

  1. MySQL configuration

max_heap_table_size -> large in-memory temporary tables

By default, there’s a limit to how big those tables can be, which is 16 MB. If you end up going past that limit, MySQL automatically converts that in-memory table to an on-disk MyISAM table.

  1. Scale up and down

A larger disk not only can store more bytes, it provides more IOPS to access those bytes.

The key thing to remember here is that you may find yourself in a situation where you’re running low on disk space, or where your data isn’t grow- ing in size, but it’s being accessed more frequently and needs more IOPS capacity. In either situation, the answer’s the same: make your disk bigger.

Note that you can increase the size of your database, but you can’t decrease it.

  1. high availability (Replication)

A fundamental component to designing highly available systems is removing any single points of failure, with the goal being that your system continues running without any service interruptions, even as many parts of your system fail (usually in new and novel ways every time).

failover replica

read replica : you can use a different instance type; a couple of operations are only possible with read replicas: promoting and disabling replication.

  1. backup

Always have last 7 days’ backups

The backup itself is a disk-level snapshot, which begins with a special user (cloudsqladmin) sending a FLUSH TABLES WITH READ LOCK query to your instance. This command tells MySQL to write all data to disk and prevents writes to your database while that’s happening.

Restore backup:

gcloud sql backups list --instance=todo-list --filter "status = SUCCESSFUL" gcloud sql instances restore-backup todo-list

  1. Google Cloud Storage Backup

mysqldump : This means that everything you’ve come to expect from mysqldump applies to this export process, including the convenient fact that exports are run with the –single-transaction flag (meaning that at least InnoDB tables won’t be locked while the export runs).

backups are a good fit for the Nearline storage class, as it’s less expensive for infrequently accessed data.

make sure to choose the SQL format (not CSV), which includes all of your table definitions along with your schema, rather than the data alone.

If you put .tgz at the end of your export file name, it’ll be automatically compressed using gzip.

  1. Price

Google Cloud considers two basic principles of pricing for computing resources: computing time and storage.

sustained-use discounts: but the short version is that running instances around the clock costs about 30% less than the sticker price.

  1. Scorecard

Trading data, which is likely to be much larger than the customer data, wouldn’t be well suited for relational storage, but instead would fit better in some sort of data warehouse. (large-scale analytics using BigQuery.)

PostgreSQL 9.5’s native JSON type support)

MySQL’s advanced scalability features, such as multimaster or circular replication.

Chapter 5. Document storage

These documents are arbitrary sets of key-value pairs, and the only thing they must have in common is the document type, which matches up with the collection.

Data Locality

In the world of storage, the concept of where to put data is called data locality. Datastore is designed in a way that allows you to choose which documents live near other documents by putting them in the same entity group.

Result Set Query Scale

To deal with this, you’d probably want to index emails as they arrive so that when you want to search your inbox, the time it takes to run any query (for example, searching for specific emails or listing the last 10 messages to arrive) would be proportional only to the number of matching emails (not the total number of emails).

This idea of making queries as expensive, with regards to time, as the number of results is sometimes referred to as scaling with the size of the result set. Datastore uses indexing to accomplish it.

Index(TODO):

  • MongoDB - B-tree indexes, Fractal Tree Indexes (3rd party)
  • BigTable, HBase, Apache Cassandra - LMS-tree

Automatica Replication

Keys

Datastore’s keys contain both the type of the data and the data’s unique identifier.

keys themselves can be hierarchical. If two keys have the same parent, they’re in the same entity group. and the kind (type) of the data is the kind of the bottom- most piece.

Entities

Consistency and Replication

two key requirements: to be always available and to scale with the result set.

Data Replication: One protocol that Cloud Datastore happens to use involves something called a two-phase commit.

In this method, you break the changes you want saved into two phases: a preparation phase and a commit phase. In the preparation phase, you send a request to a set of replicas, describing a change and asking the replicas to get ready to apply it. Once all of the replicas confirm that they’ve prepared the change, you send a second request instructing all replicas to apply that change. In the case of Datastore, this second (commit) phase is done asynchronously, where some of those changes may hang around in the prepared but not yet applied state.

Any strongly consistent query (for example, a get of an entity) will first push a replica to execute any pending commits of the resource and then run the query, resulting in a strongly consistent result.

when you use the put operation, under the hood Datastore will do quite a bit of work (figure 5.1):  Create or update the entity.  Determine which indexes need to change as well.  Tell the replicas to prepare for the change.  Ask the replicas to apply the change when they can.

This concept is called eventual consistency, which means that eventually your indexes will be up to date (consistent) with the data you have stored in your entities. It also means that although the operations you learned about will always return the truth, any queries you run are running over the indexes, which means that the results you get back may be slightly behind the truth.

get through ID is a strong consistent query while get through query/index is not.

combining querying with data locality to get strong consistency. Or the Entity Update and Index Update might be consistent with each other.

Consistency with Data Locality

queries inside a single entity group are strongly consistent (not eventually consistent).

an entity group, defined by keys sharing the same parent key, is how you tell Datastore to put entities near each other.

Telling Datastore where you want to query over in terms of the locality gives it a specific range of keys to consider. It needs to make sure that any pending operations in that range of keys are fully committed prior to executing the query, resulting in strong consistency.

The reason for this is that a single entity group can only handle a certain number of requests simultaneously—in the range of about 10 per second.

CAP theory:

  • Consistency: Relational Database
  • Availability: Google Cloud Datastore
  • Partition Tolerance: Google Cloud Datastore, Relational Database

Backup and restore

  1. Create the Bucket

gsutil mb -l US gs://my-data-export

  1. Disable Writes

  2. Export

gcloud datastore export gs://my-data-export/export-1

4.gsutil ls gs://my-data-export/export-1

  1. Import

gcloud beta datastore import gs://my-data-export/export-1/export- 1.overall_export_metadata

Price

  1. Google determines Cloud Datastore prices based on two things: the amount of data you store and the number of operations you perform on that data.

  2. Datastore offers full ACID (atomicity, consistency, isolation, durability) transaction semantics, you never have to worry about multiple updates accidentally ending up in a half-committed state.

  3. HBase’s parent system, Bigtable

Note (Like the Elasticsearch)

  1. Scalability: Shards
  2. Durability / Availability: Replicas
  3. Node: a running Google Cloud Datastore instance
  4. Cluster: a group of instances that can communication with each other
  5. Data will be written into of of the shards. Primary is a concept above the shard and replica.

Chapter 6. Large-Scale SQL

  1. Spanner instances feature two aspects: a data-oriented aspect and an infrastructural aspect.

  2. Spanner’s guarantees are focused on:
    • rich querying
    • globally strong consistency
    • high availability
    • high performance
  3. Tables have a few other constraints, such as a maximum cell size of 10 MiB, but in general, Spanner tables shouldn’t be surprising.

  4. a few prerequisites exist for what types of changes are allowed.
  • First, the new column can’t be a primary key.
  • Next, the new column can’t have a NOT NULL requirement.
  1. You can perform three different types of column alterations:
  • Change the type of a column from STRING to BYTES (or BYTES to STRING).
  • Change the size of a BYTES or STRING column, so long as it’s not a primary key column.
  • Add or remove the NOT NULL requirement on a column.
  1. When interleaving tables, the parent’s primary key must be the start of the child’s primary key (for instance, the paychecks primary key must start with the employee_id field) or you’ll get an error.

  2. Spanner makes a big assumption: if you didn’t say that things must stay together, they can and may be split.

  3. Datastore has the same concept but talks about entity groups as the indivisible chunks of data, whereas Spanner talks about the points between the chunks and calls them split points.

  4. a terrible primary key to use: timestamps.

  5. The moral of this story is that when writing new data, you should choose keys that are evenly distributed and never choose keys that are counting or incrementing (such as A, B, C, D, or 1, 2, 3). Instead of using counting numbers of employees, you might want to choose a unique random number or a reversed fixed-size counter. A library, such as Groupon’s locality-uuid package (see https://github.com/groupon/locality-uuid.java), can help with this.

  6. Whenever you update an employee’s name (or create a new employee), however, you need to update the row in the table along with the data in each index that references the name column.

  7. Like you can interleave one table into another, indexes can similarly be interleaved with a table.

  8. Data- bases that support ACID transactional semantics are said to have atomicity (either all the changes happen or none of them do), consistency (when the transaction finishes, everyone gets the same results), isolation (when you read a chunk of data, you’re sure that it didn’t change out from under you), and durability (when the transaction fin- ishes, the changes are truly saved and not lost if a server crashes).

  9. read-only lock and read-write lock: the locking is at the cell level,

  10. Isolation Levels

  11. Cloud Spanner pricing has three different components: computing power, data stor- age, and network cost.

Chapter 7. Cloud BigTable

  1. 2 phase reading

  2. Design goals

    • LARGE AMOUNTS OF (REPLICATED) DATA
    • LOW LATENCY, HIGH THROUGHPUT
    • RAPIDLY CHANGING DATA
    • HISTORY OF DATA CHANGES
    • STRONG CONSISTENCY
    • ROW-LEVEL TRANSACTIONS
    • SUBSET SELECTION
  3. multirow transactional semantics

  4. Data model concepts

    • Row Key

      This key can be anything you want, but as you’ll read later, you should choose the format of this key carefully.

      Bigtable allows you to quickly find data using a row key, but it doesn’t allow you to find data using any secondary indexes (they don’t exist).

    • ROW KEY SORTING

      • String IDs
      • Timestamps

        Do not use a timestamp as the key itself (or the start of the key)! Doing so ensures that all write traffic will always be concentrated in a specific area of the key space, which would force all traffic to be handled by a small number of machines (or even a single machine).

      • Combined values
      • Hierarchical structured content
  5. Columns and Column Families

Each of these belongs to a single family, which is a grouping that holds these column qualifiers and act much more like a static column in a relational database.

column qualifiers can be anything you want and can be thought of as data—something you’d never do in a relational database.

Using this type of struc- ture also means that when you visualize data in Bigtable as a table, most of the cells will be empty, or you call a sparse map.

  1. TALL VS. WIDE TABLES

wide table, which is a table having rel- atively few rows but lots of column families and qualifiers.

a tall table is one with relatively few column families and col- umn qualifiers but quite a few rows, each one corresponding somehow to a particular data point.

This table style contains quite a few differences compared to what we’ve discussed so far. The first and most obvious difference is that rather than growing wider as more items are completed, it will grow longer (or taller) instead.

Although these two tables do ultimately allow you to ask similar questions, it would appear that the tall version allows you to be a bit more specific at the cost of more single- entry lookups to get bulk information.

  1. Infrastructure concepts

  2. Bigtable is one of the more confusing ser- vices, particularly when it comes to how replication is handled.

  3. Another tricky area is that Bigtable itself has a concept of a tablet, which isn’t directly exposed via the Cloud Bigtable API.

  4. Elasticsearch

    • Cluster
    • Node: A node is an instance of Elasticsearch
    • Shards and Replicas
  5. BigTable

the basic structure here is that an instance is the top-most concept and can contain many clusters, and each cluster contains several nodes (with a minimum of three).

That said, Bigtable will almost cer- tainly support replication with multiple clusters per instance in the future.     
comments powered by Disqus