Blog

IBM BIG SQL SELF-SERV HELP v1

Table of Contents

Abstract

Big SQL Architecture

Best Practices – Big SQL Performance

Using and Administering Big SQL

Overview of Big SQL Logs and Diagnostics

Troubleshooting Big SQL query performance

Troubleshooting Big SQL Hangs

Troubleshooting Big SQL Crashes

Troubleshooting Big SQL Hadoop Native I/O Engine

Troubleshooting Big SQL Scheduler

Troubleshooting Big SQL Load

Troubleshooting Big SQL Installation

General Data Collection

Collecting db2trc in Big SQL

Applying Big SQL v4.1 patches

Credits

Abstract

Big SQL leverages IBM’s strength in SQL engines to provide ANSI SQL access to data across any system from Hadoop, via JDBC or ODBC – seamlessly whether that data exists in Hadoop or a relational data base. This means that developers familiar with the SQL programming language can access data in Hadoop without having to learn new languages or skills. With Big SQL, all of your big data is SQL accessible. It presents a structured view of your existing data, using an optimal execution strategy, given your available resources. You can leverage MapReduce parallelism when needed for complex data sets and avoid it when it hinders, using direct access for smaller, low-latency queries.

The intention of this Blog is to help you with learning about BIG SQL Architecture, Best Practices, Usage and Administration and Troubleshooting problems with Big SQL.

Big SQL Architecture

Figure 1 shows how Big SQL fits within the IBM® Open Platform with Apache Hadoop.

Architecture_Overview

Figure 1. Big SQL Architecture Overview Diagram

Big SQL statements are run by the Big SQL server on your cluster against data on your cluster. The server transforms the specification of a result table into a sequence of internal operations that optimize data retrieval. The Big SQL server is included in the Big SQL service installation.

The Big SQL architecture uses the latest relational database technology from IBM. The data remains on the HDFS cluster, with no relational database management system (RDBMS) structure or restrictions on the layout and organization of the data. The database infrastructure provides a logical view of the data (by allowing storage and management of metadata) and a view of the query compilation, plus the optimization and runtime environment for optimal SQL processing.

  • Applications connect on a specific node based on specific user configurations.
  • SQL statements are routed through this node, which is called the Big SQL management node, or the coordinating node. There can be one or many management nodes, but there is only one Big SQL management node. SQL statements are compiled and optimized to generate a parallel execution query plan.
  • Then, a runtime engine distributes the parallel plan to worker nodes on the compute node and manipulates the consumption and return of the result set.
  • The compute node is a node that can be a physical server or operating system. The worker nodes can contain the temporary tables, the runtime execution, the readers and writers, and the data nodes. The DataNode holds the data.

A worker node is not required on every HDFS data node. It can operate in a configuration where it is deployed on a subset of the Hadoop cluster.

When a worker node receives a query plan, it dispatches special processes that know how to read and write HDFS data natively.

Big SQL uses native and Java open source–based readers (and writers) that are able to ingest different file formats. For more information about the readers and writers, see Big SQL Input/Ouput.

The Big SQL engine pushes predicates down to these processes so that they can, in turn, apply projection and selection closer to the data. These processes also transform input data into an appropriate format for consumption inside Big SQL.

All of these nodes can be on one Management Node, or each part of a separate Management Node. You might consider separating the Big SQL management node from the other Hadoop master nodes. This arrangement can allow the Big SQL management node to have enough resources to store intermediate data from the Big SQL data nodes.

Learn more about Big SQL v3/v4 Architecture by watching the video:

Best Practices – Big SQL Performance

1) Specifying multiple paths for Big SQL data directory

During Big SQL installation time, the installer prompts for Big SQL Data path(s) as shows in Figure 2.

bigsql_db_pathFigure 2. Install Prompt during Big SQL installation to specify bigsql_db_path

Big SQL data directory (bigsql_db_path) is used for temporary data during execution of Big SQL queries.

For good performance, specify multiple paths to spread the Big SQL data directory over as many disks as possible. It is important to note that is it important to get this right during installation time, because it is not easy to change after Big SQL install is done.

2) What storage format to use for Big SQL

Big SQL supports many file formats and compression codes out of the box such as Text, RCFILE, ORC, AVRO, Parquet, etc.

Big SQL performance team at IBM run a test using different storage format on a 5-node cluster, each equipped with:

  • 2 processors, 16 cores, Intel® Xeon® CPU E5-2450 0 @ 2.10GHz
  • 10 data disk drives at 1.2 TB each
  • 64 GB of memory

The table as depicted in Figure 3 summarizes the performance numbers among TEXT, PARQUET, AVRO, ORC, RCFILE and TEXT SEQUENCE. Metrics include materialized HDFS sizes in MB, LOAD time, ANALYZE time, and actual TPCH query time (total seconds for all 22 queries).

BIGSQL_STORAGE_FORMATSFigure 3. Big SQL Storage Formats Comparison

In the table,  indicates the best performer. In summary, Parquet wins total query and analyze time, and it is the second best in storage. ORC actually used the least HDFS storage. AVRO had the best overall LOAD time.

Learn more details about the on this topic here: https://developer.ibm.com/hadoop/blog/2014/09/19/big-sql-3-0-file-formats-usage-performance/

In summary, choose the best storage format based on your requirement for storage availability, speed of data import and analysis.

Tip: For analytical workloads, IBM recommends the Parquet storage format.

3) Data Type considerations for Big SQL

The STRING data type is commonly used in Hive to represent character data of variable (and undetermined) length. In Big SQL, the STRING type is mapped internally as VARCHAR(32K).

We strongly discorage the use of STRING data types in Big SQL.

To avoid performance problems, change references to STRING to explicit VARCHAR(n) that most appropriately fit the data size

4) Resource sharing

A Hadoop cluster with Big SQL installed has its resources split between Big SQL and all other Hadoop components. The Big SQL percent usage (bigsql_pct_usage) install property specifies how much of the cluster’s resources (CPU and memory) are assigned to Big SQL. During install, specify a value for bigsql_pct_usage property indicating the percentage of cluster resources you would like to allocate to Big SQL as shows in Figure 4.

bigsql_resource_percentFigure 4. Install Prompt during Big SQL installation to specify bigsql_resource_percent

The default value of 25% may be too low. IBM’s recommended range is 25% -> 75% by always leaving 25% of resource to Operating System.

It may not be easy to get this value right at install time since it is difficult to determine the optimal split of resources requires between Big SQL and other Hadoop components. The exact resources you need to dedicate to Big SQL vs other Hadoop components depends upon the performance requirements of the Big SQL queries and Hadoop jobs within your workload.

Changing the percentage of resources dedicated to Big SQL post install is relatively easy and is documented in detail in BigInsights Knowledge Center 4.1:

https://www.ibm.com/support/knowledgecenter/#!/SSPT3X_4.1.0/com.ibm.swg.im.infosphere.biginsights.analyze.doc/doc/bigsql_config.html

5) Big SQL Table Partitioning

Big SQL provides the ability to partition a table based on a data value. This table partitioning will improve query performance by eliminating those partitions that do not contain the data value of interest.

Big SQL stores different data partitions as separate files in hdfs and only scans the partitions required by a query thereby improving runtime. Partition on a column commonly referenced in range delimiting or equality predicates. For instance, range of dates would be ideal for use as partition columns.

Example DDL for specifying Partition during CREATE HADOOP TABLE statement

“CREATE HADOOP TABLE LINEITEM (L_ORDERKEY BIGINT NOT NULL, … L_COMMENT VARCHAR(44) NOT NULL) PARTITIONED BY (L_SHIPDATE DATE) STORED AS TEXTFILE”

6) Gathering Big SQL Statistics (Analyze command)

Keeping statistics up to date is vital to Big SQL performance. Good statistics can improve query performance many fold, whereas old, inaccurate or non-existent statistics can have the opposite effect.

Use Analyze command similar to one that follows to update a table’s statistics:

ANALYZE TABLE ORDERS    COMPUTE STATISTICS    FOR COLUMNS O_ORDERKEY, O_CUSTKEY, O_ORDERSTATUS, O_TOTALPRICE, O_ORDERDATE, O_ORDERPRIORITY, O_CLERK, O_SHIPPRIORITY, O_COMMENT

Make sure to update statistics every time when:

  • a new table is populated, or
  • an existing table’s data undergoes significant changes:
    • new data added
    • old data removed
    • existing data is updated

It is important to note that ANALYZE is a MapReduce job. If Analyze fails, use Resource Manager UI to debug the issue.

If ANALYZE tasks fail with Java Heap exceptions, then increase:

  • mapreduce.map.memory.mb
  • mapreduce.map.java.opts
  • mapreduce.reduce.memory.mb
  • mapreduce.reduce.java.opts

For Additional details on Analyze command, check here: https://developer.ibm.com/hadoop/blog/2015/05/26/gathering-big-sql-statistics-using-analyze-command/

7) Use of Informational Constraints

Informational Constraints are much like regular Primary and Foreign Key constraints except they are not enforced. In Big SQL we encourage the use of Information Constraints because they would allow the optimizer to make better selectivity estimates which improves the costing of the plan and execution efficiency of the query

Example of specifying the Informational Constraint:

alter table orders add primary key (O_ORDERKEY) not enforced;

8) Big SQL Load Best Practices

The LOAD HADOOP statement fetches data from external data sources and loads into a Big SQL Hadoop table. Use this statement to load data from external sources such as relational databases or delimited files that are on the BigInsights clusters – either as files stored on the local file system or the Hadoop distributed file system.

Load Best Practice 1:

If the source files are on the distributed file system (for example, HDFS) to the cluster, then the default value for num.map.task is 80% of the cluster map task capacity.

By default, if the source file is not on the distributed file system, then num.map.tasks is 4 map tasks.

Having too few or too many map tasks can make LOAD slower. You can tune LOAD property num.map.tasks to customize the number of map tasks.

A good starting point is to set to number of Big SQL worker nodes (or a multiple thereof) as shown in the following example:

load hadoop using file url ‘/tpch1000g/orders/’     with source properties (‘field.delimiter’=’|’, ‘ignore.extra.fields’=’true’)   into table ORDERS   overwrite WITH LOAD PROPERTIES (‘num.map.tasks’=’145’);

Load Best Practice 2:

  • If the source consists of a few large files, then it is more efficient to copy the files to HDFS first, and then load from HDFS
  • If the source consists of many small files, then loading from either local file system or HDFS has similar performance

Reason:  Number of map tasks used to load from local file system is limited to number of files. If only a few large files, then only a small number of map tasks will be created to transfer and load the data. For HDFS sources, number of map tasks is limited by number of HDFS blocks, not number of files.

For Additional details on ‘Big SQL data ingestion best practices’, check https://developer.ibm.com/hadoop/blog/2014/11/07/big-sql-data-ingestion-statements-load-hadoop-statement-best-practices/

 

Using and Administering Big SQL

Connecting to a Big SQL server

1) Using DB2 CLI

You can use DB2 CLI or any client that uses JDBC drivers to define a connection to the Big SQL server.

The following example shows how to connect to BIGSQL as bigsql user and execure create Hadoop table, insert a row and query a table:

Connect to Big SQL database:

[root@bdavm535 ~]# su bigsql
[bigsql@bdavm535 root]$ db2 connect to BIGSQL

Database Connection Information
Database server        = DB2/LINUXX8664 10.6.3
SQL authorization ID   = BIGSQL
Local database alias   = BIGSQL

Create Big SQL Table:

[bigsql@bdavm535 root]$ db2 “create hadoop table test2 (col1 int)”
DB20000I  The SQL command completed successfully.

Insert into Big SQL Table:

[bigsql@bdavm535 root]$ db2 “insert into test2 values (1)”
DB20000I  The SQL command completed successfully.

Select from Big SQL Table:

[bigsql@bdavm535 root]$ db2 “select * from test2”
COL1
———–
1
1 record(s) selected.

2) Using JSqsh

The JSqsh client is installed as part of the BigInsights – Home service.
The JSqsh installation directory is in /usr/ibmpacks/common-utils/current/jsqsh.
Start the client by using the following command to define connection profiles:

/usr/ibmpacks/common-utils/current/jsqsh/bin/jsqsh bigsql

Example:
[root@bdavm537 bin]# ./jsqsh bigsql
Welcome to JSqsh 4.4
Type “\help” for help topics. Using JLine.

Create Big SQL Table:

bdavm535.svl.ibm.com][root] 1> create hadoop table test3 (col1 int);
0 rows affected (total: 3.57s)

Insert into Big SQL Table

[bdavm535.svl.ibm.com][root] 1> insert into test3 values (0);
1 row affected (total: 0.31s)

Select from Big SQL Table

[bdavm535.svl.ibm.com][root] 1> select * from test3;
+——+
| COL1 |
+——+
|    0 |
+——+
1 row in results(first row: 0.30s; total: 0.31s)

How to Stop and Start Big SQL

1) In Ambari Web UI

  1. Logon to Ambari Web UI
  2. Click on BigInsights – Big SQL
  3. Click on Service Actions button
  4. Select Stop/Start/Restart All

See example screenshot in Figure 5

stop_bigsql_ambariFigure 5. Administering Big SQL (Submitting stop/start/restart) using Ambari Web UI

2) From command line, using a script

Alternatively, you can issue stop / start / forcestop / status using a scrip.

The following commands to stop/start bigsql should be run on Big SQL head node

  1. First change directory to the folder containing bigslq script

[root@bdavm535 4.1]# cd /usr/ibmpacks/bigsql/4.1/bigsql/bin

  1. For usage, simply type ./bigsql and press enter

[root@bdavm535 bin]# ./bigsql
  use: bigsql [start | stop | forcestop | status] [-all|-bigsql |-scheduler|-bigsql1] [-n node]
Actions:
start      – Start the service if not running
stop       – Gracefully stop the service if running
forcestop  – Forcefully stop the service if running
status     – Display the status of the service
Options:
-all       – Perform action on all services (the default behavior)
-bigsql    – Perform action only on Big SQL
-scheduler – Perform action only on Big SQL Scheduler service
-bigsql1   – Perform action only on Big SQL v1
-n node    – When starting/stopping Big SQL, indicates which node to control

  1. Example of executing a status command below

[bigsql@bdsup006 bin]$ ./bigsql status
SERVICE              HOSTNAME                       NODE      PID STATUS
Big SQL Scheduler    bdsup006.svl.ibm.com              –    14708 Available
Big SQL Worker       bdsup008.svl.ibm.com              1    11869 DB2 Running
Big SQL Master       bdsup006.svl.ibm.com              0    26324 DB2 Running
Big SQL Worker       bdsup009.svl.ibm.com              2    30077 DB2 Running
Big SQL Worker       bdsup010.svl.ibm.com              3    12716 DB2 Running

Adding new Big SQL node

You can Expand your Big SQL cluster by adding more worker nodes, one at a time.

  1. Logon to Ambari Web UI
  2. Click on Hosts tab
  3. Click on Actions button, and select Add New Hosts (See Figure 6)

Add_new_hostFigure 6. Add new host using Ambari Web UI

After host is successfully added

  1. Click on Hosts tab again
  2. Click on the newly added hostname
  3. Make sure to install all Clients
  4. Click on add > Big SQL Worker node (See Figure 7)

add_new_big_sql_worker_nodeFigure 7. Add new Big SQL Worker node using Ambari Web UI

Dropping Big SQL Node

You can drop your existing Big SQL node, one at a time, using the following steps

  1. Logon to Ambari Web UI
  2. Click on Hosts tab
  3. Click on the host from which you want to drop Big SQL (e.g. Big SQL Head)
  4. Choose Decommision
  5. Execute Delete (See Figure 8)

decommision_bigsql_headFigure 8. Decommision Big SQL Head node

Drop  multiple Big SQL Worker nodes

You can use
/var/lib/ambari-server/resources/stacks/BigInsights/4.1/services/BIGSQL/package/scripts/fixtopologyDriver.sh script to drop multiple Big SQL Worker nodes (DEAR or ALIVE)

Step1: At head node of BIGSQL make a copy of
/home/bigsql/sqllib/db2nodes.cfg
to
/tmp/db2nodesToDelete.cfg , and

Remove all lines you do NOT want to drop

Step2: As root Run
./fixtopologyDriver.sh -f  db2nodesToDelete.cfg

Note: Logs for this operation can be found at /tmp/bigsql/logs

Dropping ALL BIGSQL

You can run fullBigSqlCleanup.sh script at Head Node to wipe all  Big SQL.

Steps:

cd /var/lib/ambari-agent/cache/stacks/BigInsights/4.1/services/BIGSQL/package/scripts/

./fullBigSqlCleanup.sh

use: fullBigSqlCleanup.sh -c<ambari-cluster-name> -x<ambari-server-port> -s<ambari-server-host>  [-u<ambari-admin-user>] [-p<ambari-admin-password> ]

Options:
-s ambari-server-host
-c ambari cluster name
-x ambari-server-port
-u ambari-admin-user    , defaults to admin if not provided
-p ambari-admin-password, defaults to admin if not provided

Getting BIGSQL HA working for the first time

 To get HA running for the first time:

In the ambari console open up the bigsql service.  In the action button dropdown select EnableHA.  This will install bigsql on the secondary server and bring it up as a DB2 HADR standby server.

When the enabling of HA is successful a lot of steps have been completed.

  • DB2 was installed on the secondary host
  • A DB2 instance was created along with the bigsql database
  • A backup was taken of the bigsql database on the head node
  • That backup was copied over to the secondary host and restored
  • The DB2 instance bigsql database on the secondary is started as a standby
  • Finally, the bigsql database on the head node is restarted as a primary

When HA is successully enabled you can run the db2pd utility with options -db bigsql -hadr to see that all is in a working state:

[bigsql@bdsup006 ibmpacks]$ db2pd -db bigsql -hadr
Database Member 0 — Database BIGSQL — Active — Up 1 days 23:30:11 — Date 2016-04-06-12.20.43.712896
HADR_ROLE = PRIMARY
REPLAY_TYPE = PHYSICAL
HADR_SYNCMODE = NEARSYNC
STANDBY_ID = 1
LOG_STREAM_ID = 0
HADR_STATE = PEER
HADR_FLAGS =
PRIMARY_MEMBER_HOST = bdsup006.svl.ibm.com
PRIMARY_INSTANCE = bigsql
PRIMARY_MEMBER = 0
STANDBY_MEMBER_HOST = bdsup007.svl.ibm.com
STANDBY_INSTANCE = bigsql
STANDBY_MEMBER = 0
HADR_CONNECT_STATUS = CONNECTED
HADR_CONNECT_STATUS_TIME = 04/04/2016 12:50:33.899582 (1459799433)
HEARTBEAT_INTERVAL(seconds) = 30
HEARTBEAT_MISSED = 0
HEARTBEAT_EXPECTED = 5698
HADR_TIMEOUT(seconds) = 120
TIME_SINCE_LAST_RECV(seconds) = 0
PEER_WAIT_LIMIT(seconds) = 0
LOG_HADR_WAIT_CUR(seconds) = 0.000
LOG_HADR_WAIT_RECENT_AVG(seconds) = 0.000001
LOG_HADR_WAIT_ACCUMULATED(seconds) = 8.299
LOG_HADR_WAIT_COUNT = 6184
SOCK_SEND_BUF_REQUESTED,ACTUAL(bytes) = 0, 19800
SOCK_RECV_BUF_REQUESTED,ACTUAL(bytes) = 0, 87380
PRIMARY_LOG_FILE,PAGE,POS = S0000001.LOG, 2042, 231854201
STANDBY_LOG_FILE,PAGE,POS = S0000001.LOG, 2042, 231854201
HADR_LOG_GAP(bytes) = 0
STANDBY_REPLAY_LOG_FILE,PAGE,POS = S0000001.LOG, 2042, 231854201
STANDBY_RECV_REPLAY_GAP(bytes) = 0
PRIMARY_LOG_TIME = 04/06/2016 10:40:35.000000 (1459964435)
STANDBY_LOG_TIME = 04/06/2016 10:40:35.000000 (1459964435)
STANDBY_REPLAY_LOG_TIME = 04/06/2016 10:40:35.000000 (1459964435)
STANDBY_RECV_BUF_SIZE(pages) = 4306
STANDBY_RECV_BUF_PERCENT = 0
STANDBY_SPOOL_LIMIT(pages) = 384000
STANDBY_SPOOL_PERCENT = 0
STANDBY_ERROR_TIME = NULL
PEER_WINDOW(seconds) = 600
PEER_WINDOW_END = 04/06/2016 12:30:36.000000 (1459971036)
READS_ON_STANDBY_ENABLED = N

To monitor and automate failover using HADR the TSA product is installed and configured.

On a healthy cluster here is the TSA resources configured to monitor the Big SQL head node:

BIGSQL_HA
Figure 9. “lssam -V” output

Figure 9 shows bdsup006 as the primary and bdsup007 as the standby.

 

Overview of Big SQL Logs and Diagnostics

Component Logs and Locations:

Big SQL Logs and Diagnostics are located in different places for each component:

Big SQL Server (DB2) – /var/ibm/bigsql/diag/DIAG0000/
Big SQL Scheduler – /var/ibm/bigsql/logs/bigsql-sched.log
Big SQL Native I/O – /var/ibm/bigsql/logs/dfsreader*
Big SQL Java I/O – /var/ibm/bigsql/logs/bigsql.log
Big SQL DDL – /var/ibm/bigsql/logs/bigsql.log

Log Entry Identifier from error messages:

Many Big SQL error messages contain a log entry identifier, a three-part name that enables you to identify which file to refer to for a particular error message.  The identifier consists of three parts:  LLLNNNXXXYYYZZZ

LLL: Identifies which component’s log file contains the error

NNN: Identifies which node of the Big SQL cluster the error originated from.  The node number is the DB2 node number.

Mapping from node number to host name can be found in $BIGSQL_INST_HOME/db2nodes.cfg

XXXYYYZZZ: Identifies the unique message ID, such that you can search for and find this value in the log file on the particular node number.

Each Component log file has a unique 3 letter identifier:

BSL – Big SQL Log file (/var/ibm/bigsql/logs/bigsql.log). Currently the java readers, DDL code, and java code involved in the C++ readers write to this log
NRL – Native Reader Log file (/var/ibm/bigsql/logs).  The C++ readers write to their own log files.
DDL – The db2diag.log file – ( /var/ibm/bigsql/diag/)
SCL – The scheduler log – ( /var/ibm/bigsql/logs/bigsql-sched.log ).  It will only exist on the node on which the scheduler is running.

Example:

[State: ZZZZZ][Code: -5105]: The statement failed because a Big SQL component encountered an error. Component receiving the error: “BigSQL IO”. Component returning the error: “UNKNOWN”. Log entry identifier:
”[BSL-1-8b3791dbb]”.. SQLCODE=-5105, SQLSTATE=ZZZZZ, DRIVER=3.67.33

The identifier “[BSL-1-8b3791dbb]” tells you three things:

(1) Which diag file(s) to look into
(2) On which machine (hostname)
(3) Which pattern to search for in the diag file.

Mapping partion (aka node) number to the hostname where the logs may reside:

Look into ~bigsql/sqllib/db2nodes.cfg file (assuming Big SQL instance’s userid is “bigsql”).

First column in this file is node-number, second column is hostname.

Let’s look at example db2nodes.cfg file below:

=========================
0 earth.foobar.com 0
1 mars.foobar.com 0
2 venus.foobar.com 0
=========================

It tells that node number 1 maps to mars.foobar.com.

Troubleshooting Big SQL query performance

It is good idea to start with a general analysis first, the gradually narrow down to the root cause.

The first part of a performance investigation should be a the step of narrowing down what type of problem you are experiencing.  Then, based on this initial discussion, come up with a plan of what kind of data collection will fit You should avoid heavy data collection if the problem can be quickly answered by browsing your configurations and reviewing best practice documentation.

 

The Initial Questions to ask yourself:

  • What’s the problem?!
  • What is slow?
  • Are all queries slow, or only a few?  Single query?  which ones?
  • What is the type/format of the external data?  How was the data created/ingested?
  • When did the problem start happening?  Did it always happen, or did something change? Any software/hardware components change?
  • Is there anything in common with the queries/jobs that are slow?
  • Is there a comparison case that is not slow?  Is there a QA/DEV/PROD cluster that does not exhibit the same symptoms?
  • Is there a lot of concurrency?  Does it happen during peak workload time?
  • Does the problem reproduce?
  • Are you current on software maintenance level?  Any plans to move to newer versions or fixpacks?

The idea here is to narrow down so that you may focus on the right things.

Starting the investigation

Now that we know a what the problem is, where can we start to investigate.

Troubleshooting a Big SQL performance problem will require some familiarity of the Big SQL architecture and recommended practices to avoid known pitfalls.    Please ensure to have read the architecture and best practices sections of the Self Serve Help.

At this point you should probably have an idea if the problem is a general issue, or if the issue has been narrowed down to a specific job/query/task.

General Issues

  1. Check statistics
  • First port of call for a Big SQL performance problem – Make sure ANALYZE has been run and statistics are up to date.  Analyze best practices are published at:

https://developer.ibm.com/hadoop/blog/2015/05/26/gathering-big-sql-statistics-using-analyze-command/

  • Check STATS_TIME and CARD in SYSCAT.TABLES to see if ANALYZE has been run with the following query example:
  1. Check Big SQL resource configs that have a big performance impact

INSTANCE_MEMORY and DB2_CPU_BINDING

This configuration parameter specifies the % (CPU and memory) of the cluster dedicated to Big SQL.  Both are set to the % specified at the time of installation.  A recommended practice is to give 50% or above to Big SQL.  If you want to change the % after installation, run the following command:

autoconfigure using mem_percent 50 workload_type complex is_populated no apply db and dbm

The resource percentage option will in turn determine several lower level properties that have big impact on performance:

Sort space : SORTHEAP and SHEAPTHRES_SHR

SHEAPTHRES_SHR is 50% of memory assigned to Big SQL

SORTHEAP is 1/20th of SHEAPTHRES_SHR

If too low, optimizer may favor joins other than Hash Join (which is the preferred, prevalent join method of Big SQL for Hadoop tables)

Bufferpool – IBMDEFAULTBP

Area of memory used to cache temporary working data for a query.  Out of box 15% of memory assigned to Big SQL is used for bufferpool.  If this value is too small, optimizer may start to select sub-optimal plans (with nested loop joins).   Additiionally, once the bufferpool is full, temporary data will spill to disk (temporary tablespace), and performance may degrade significiantly…

DFS reader memory

Native IO engine require memory to read data from HDFS and exchange the data with Big SQL runtime.  By default, the readers are allocated 20% of the memory assigned to Big SQL.

Check Big SQL resource sharing functionality:  STMM

Big SQL will constantly monitor the memory allocation and how efficiently it is being used.  Memory allocation will be re-assigned between Big SQL consumers (within the given boundaries) to ensure Big SQL is making the best possible use of the available memory.  For example, if Big SQL detects that the bufferpools are infrequently used but sort space is regularly exhausted, then it may decide to reduce the amount of memory available to bufferpools and give this memory to sort space.

If you have a relatively fixed workload, run with STMM enabled for a period of time.  Monitor the bufferpools and sort space.  When they have remained stable for some time, disable STMM using the following command:

         update db cfg for Big SQL using SELF_TUNING_MEMORY off

Trouble shooting performance for a specific job/query/task

  1. Check statistics (see previous section)
  2. Check application design
    1. Check data types. String is bad for Big SQL.  It is prevalent in Hadoop and Hive world.  This performance degradation can be avoided by:
      1. Change reference from STRING to explicit VARCHAR(n) that most appropriately fit the data size
      2. Use the Big SQL string size property to lower the default size (32k) of the VARCHAR to which the STRING is mapped when creating new tables
    2. Check partition design. Partitioning on a column with many distinct values will create many (small) files and may have a negative impact on performance.  Choose your partitioning key wisely:
      1. Partition on a column commonly referenced in range delimiting or equality predicate
      2. Range of dates are ideal for use as partition columns
  1. Gather diagnostic documents
    1. If you know the query, put it into it’s own file called queryfile.sql terminated with a ;. Then, from the head node only, run db2support like this:db2support <output path> -d bigsql -cl 1 -sf queryfile.sql –localhostThis command will:- gets a db2look collection of all objects in the database.
      – gets all the db/dbm/db2set configs etc and other stuff
      – gets catalog table exports, which allows us to simulate the query compile for debugging purposes in the lab later.
    2. If you do not know the query, or there’s more than one query in this job/task that they think is slow, then instead of a. above, collect db2support this way instead:db2support <output path> -d bigsql -cl 0 –localhostThis command will:- gets a db2look collection of all objects in the database.
      – gets all the db/dbm/db2set configs etc and other stuff
      – gets catalog table exports, which allows us to simulate the query compile for debugging purposes in the lab later.

Unlike a. above, it does not collect db2exfmt output on a query. For any queries that are involved in the identified job, you should get db2exfmt for those separately, like this, for each query:

db2 connect to bigsql
db2 set current explain mode explain
db2 -tvf query1.sql
db2 set current explain mode no
db2exfmt -d bigsql -u bigsql bigsql -o query1.exfmt.out

…where query1.sql contains the query you are explaining terminated by a ;.

  1. Note that if a. or b. are done above, when you unzip the db2support.zip file, there will be another zip file in there call db2supp_opt.zip. When you unzip that one, this is where you will find the db2look and/or exfmt depending on the collection.
  1. Analyze explain plans. Look for some these:
    1. NOT using HashJoins as the join type.
    2. Have large BTQs.
    3. Nested Loop Joins (NLJNs) (can be v.bad )
    4. Nested Loop Joins without a TEMP on the inner ( can be v.v.v. bad)
    5. Merge Scan Joins (MSJOIN)

These are just quick things to check. In reality, having some of these plan properties does not necessarily mean that the plan is sub-optimal. Performance, and the optimizers is no exception to this, is often a game of trade-off’s and balances.

Troubleshooting Big SQL Hangs

What is a hang?

“Hang” is one of those terms that can have different meanings depending on where you are sitting.  It’s a general term that people may use to describe anything that is not responsive.

Others can use the term “hang” however they like, but for our purposes here, let’s try to make a more solid definition of a hang that is less subjective.  Let us define the term hang to refer to a situation in which an application/task/job is really stuck, often due to a product defect somewhere, where the execution cannot break out of it’s current state and will remain forever engaged in execution until someone or something comes along and intervenes with an abnormal termination such as killing a process or thread.

Given this definition, the following things are NOT hangs:

  • a query does not return in a timely manner (but there are signs that execution is working normally, such as monitor metrics showing activity like rows read/rows written changing over time)
  • a query is in a normal state of waiting, where some other action needs to happen first to allow the execution to resume. In this case, the other action is initiated by the user.  Example: some forms of locking, waiting for user to commit, waiting for user input, etc.  If you are waiting for a user, it is not a hang.

The following things ARE hangs:

  • A piece of execution is stuck in an infinite loop that will never end. Pro tips: look for high cpu usage in a specific thread that keeps growing.  Check traces to see if the loop is captured.  Check stacks to reveal possible looping codepaths.
  • Internal resourced contention that forms a circular state of waiting. For example, thread 1 waiting on thread 2.  thread 2 waiting on thread 3.  thread 3 waiting on thread 1.  In this case, everybody is waiting on each other, creating this deadlocked type of scenario that will never be released.  Examples of this: “deadlatch” problems.  internal wait areas (waitpost, or inter-partition communication wait’s).  Pro tip: You’ll need to look at stacks to investigate this type of issue.
  • The OS is really hosed, thrashing with high io, swapping, runqueue/cpu ..  A “system” level hang.

Let’s note also that it’s possible for a hang to be isolated to a single query or application.  A hang does not necessarily mean that the entirety of bigsql is stuck.

How to tell if something is really a hang or not?

There are some light-weight things that you can do though to narrow down and figure out if you are dealing with a real hang or else perhaps it’s just a performance issue.  Doing this narrow-down work can help put together a data collection and properly define the problem description.  If you assume it’s a hang, but in reality it’s just a slow moving query, then collecting hang data may not be sufficient to investigate the performance issue.

Here’s some quick tests to help narrow down:

  • Get a “vmstat 2 10” Any chance the server is totally pegged with cpu contention, memory usage, thrashing etc..?
  • Does other activity work okay? simple commands?  Can you connect to the database?  If something like “db2 get snapshot for database on bigsql global” from the head node will not return, then you’ve likely got a full instance hang.
  • If you are able to get some commands to run, I would suggest: “db2 get snapshot for database on bigsql global” > dbsnap.1 then wait 1 minute, then get it again, naming it dbsnap.2.  Have a look at the output and compare…did the “rows read” metric change at all?  If it did, then the entire db is not hung..something is still doing work.  If you are investigating a specific query or application, use the same technique but at the application level instead of at the databaes level, using “db2 get snapshot for applications on bigsql global”

What to collect?

You can use the collectData_BigSQL.ksh script for this, specifying the -H option, as follows:

./collectData_BigSQL.ksh -H

Get the script at the following URL:

http://www-01.ibm.com/support/docview.wss?uid=swg21970707

Category 1) General Performance Data Collection

Example command to run: ./collectData_BigSQL.ksh

When to perform this collection?
Use this when you are observing very general performance concerns but has not isolated the problem down to a specific query or task. Choose the number of intervals and the interval times based on how long and how frequently you want to collect data for. This also comes in handy for observing overall system-wide behaviours, like tracking the memory usage for an entire day of customer workload etc.
What does it do?
This command will first print out the usage, and then wait 10 seconds to give you chance to cancel/ctrl-c out of it. Once it gets going, this is the most comprehensive collection and will get a little bit of everything, running with a default number of intervals of 5 with sleep time of 5 minutes (300 seconds). Use -i and -s options to override these defaults as needed.
Category 2) Specific Application Performance Data Collection

Example command to run: ./collectData_BigSQL.ksh -a 12345

When to perform this collection?

Use this collection if you know that a particular application or task is causing your performance issues. First, identify the application handle that is responsible (using an application snapshot, db2pd, or other technique), and use this application handle with the -a. This method is less general than method 1) above, and so it is able to get some more detailed information that is relevant to just the application at hand.

What does it do?

This command will focus the data collections by only gathering information using the given application handle and it’s related processes and threads. It runs with default of 3 iterations with a sleep time of 3 minutes (180 seconds). Use -i and -s options to override these defaults as needed.

Category 3) Query Performance Data Collection

Example command to run: ./collectData_BigSQL.ksh -Q queryfile.sql

When to perform this collection?

Use this collection if you know the exact sql statement that you are having problems with, and it is reproducible from the command line. This will collect information to observe the performance of just this query as it executes. Actual query results are sent to /dev/null.
For this information to have the most value, you should also collect the optimizer data collection on the query using db2support tool from the head node like this:
db2support <output path> -d bigsql -cl 1 -sf queryfile.sql -localhost
…if you have not already done so.

What does it do?

This option is essentially the same as the -a option from category 2, except that it will connect to the db and execute the query for you so you do not have to do a look-up of the application handle. Default iterations is 3 with sleep time of 180 s. Just like the others, use -i or -s if you want to override these defaults.

Category 4) Deep Query Analysis Data collection

Example command to run: ./collectData_BigSQL.ksh -Q queryfile.sql -T

When to perform this collection?

Use this option if IBM Support needs to study the specifics of how a query is executing internally using a trace, this option will execute the query and collect the required trace and other debugging info. Unlike the -a and regular -Q option, this will not collect many snapshots, since the goal here is to minimize the noise that is collected in the trace.

– Optional trace masks can be provided with -T option.

For example:
./collectData_BigSQL.ksh -Q queryfile.sql -T “-Madd SQLRI -Madd SQLKT -Madd SQLKQ”

What does it do?

Before running the query, this will generate a clean “section” of the query using the .db2service undocumented data collection. A “section” is the binary executable compiled form of the query plan that is executed through the runtime interpreter during query execution. We have internal tools in-lab that can format this binary section allowing us to learn various details about how the query execution occurs.
Then, a trace is turned on across all hosts using the option to track only the threads that are associated to this application that we will use to execute the query. Then, the query is invoked. When the script completes, it will dump the trace to a file. This one also runs with default iterations 3, sleep time 180.

Category 5) Hang Data Collection

Example command to run: ./collectData_BigSQL.ksh -H

When to perform this collection?

Use this data collection if it’s believed that something is stuck (a different style of problem from a slow-moving query). This is a catch-all collection of the info needed to investigate a true hang. Sometimes, a customer may not really be able to tell the difference between a hang vs slow-moving query. As such, this collection does have enough performance collections in it so that it’s better than nothing, but it’s not ideally suited for performance investigation.

What does it do?

This version of the collection will avoid getting data collections that involve connecting to the database, running queries, or anything else that requires latches and such in the engine. If something is really hung, often the data collection attempts themselves can hang, so this will stick to only those collections that are latch-safe like db2pd commands, and sending signals to trigger diagnostic dumps, plus OS commands and db2traces etc. It runs with a default of 3 iterations at 60 second intervals. Given all the dumps and traces etc, it’s not the best thing to use for long-term execution, but like the rest of this script, you can choose to override those defaults with -i and -s.

Send the output to IBM Support for Analysis.

 

Troubleshooting Big SQL Crashes

 

When a crash occurs, as a starting point we need to collect all the data that was produced by the crash.

To collect the DB2-related diagnostics, as the bigsql user, run the following on at least the node where the crash occurred and the head node:

db2support <path for output file> -d bigsql -localhost

The path can be ‘.’ or any directory where there is sufficient space for the file.  This creates a file named db2support.zip; rename the resulting zip file to add the node number to avoid confusion with collections from other nodes.

If a core file was created, collect it with the following command:

db2support -fodc <directory name where the core file can be found>

Note that this takes just the directory name, not the path.  The path is assumed to be $DIAGPATH by the collection tool.  Look for core files in the DB2 diag path.  To determine the diag path, on the head node as bigsql user, do the following:

db2 get dbm cfg | grep -i diag

The value of the DIAGPATH parameter will provide the path where DB2 writes its diagnostic files.  Look for ‘*core*’ in the directories on that path.

The BigSQL logs need to be collected from their respective directory, again at least on the node where the crash occurred and on the head node.

The  logs are found under /var/ibm/bigsql/logs

Troubleshooting Big SQL Hadoop Native I/O Engine

Scenario

You  SQL Select statement fails with SQL5199N for I/O component

Error Message

>SELECT * FROM T1

SQL5199N The statement failed because a connection to a Hadoop I/O component could not be established or maintained. Hadoop I/O component name: “C++ DFSIO”.  Reason code: “1”.Database partition number: “2”.

Logs and diagnostics collection steps

  • SQL5199N reason code 1 often indicates that the Big SQL Fenced Mode Process (FMP) hosting the Hadoop I/O component terminated abnormally. Important to note, that you will not have information in the reader logs.
  • Error indicates that the Hadoop C++ DFSIO component failed on database partition number 2.
  • Collect DB2 (invoking component) and C++ DFSIO (failing component) diagnostics from partition 2.
  • Look for an FODC_FmpTrap directory in the DB2 diagnostics directory and try to correlate the time of collection with the time of occurrence
  • Examine the C++ DFS IO components logs and look for entries around the time of occurrence
  • Collect db2support data from the node in question.

db2support /tmp -d bigsql -localhost -f -o db2support.zip

 

Anatomy of a C++ DFSIO log

  • DFS IO logs are located in the /var/ibm/bigsql/logs directory
  • DFS IO log names use the following convention:dfsreader.<hostname>.<db2 instance owner>.log.<log level>.<timestamp>.<process identifier>e.g. dfsreader.hotellnxbi02.jdwoods.log.ERROR.20140617-154927.1019103
  • The active DFSIO log files for the partition is located at$BIGINSIGHTS_HOME/logs/bigsql-ndfsio.log.<log level>
  • Log level can be FATAL, ERROR, WARNING, or INFO
  • The command db2pd -fmp returns the list of active DB2 Fenced Mode Processes. It can be used to determine if the DFSIO process is still alive.

Troubleshooting Big SQL Scheduler

Scenario

User issues a SQL SELECT statement that fails with SQL5105N for scheduler component

Error Message

>SELECT * FROM T1

2014-06-23 13:21:46,235 ERROR com.ibm.biginsights.bigsql.scheduler.server.cache.DescriptorTableCache [pool-1-thread-2] : [‘SCL-0’-6c9bf74fb] Failed to load metadata for table: t1.

Logs and diagnostics collection steps

SCL                     – means Scheduler Log

0                          – means node 0

6c9bf74fb         – Look for this unique identifier in the log file.

 

You should look in the bigsql-sched.log file, and we grep for 6c9bf74fb.

You may find this message:

2014-06-23 13:21:46,235 ERROR com.ibm.biginsights.bigsql.scheduler.server.cache.DescriptorTableCache [pool-1-thread-2] : [‘SCL-0’-6c9bf74fb] Failed to load metadata for table: t1.

So the problem is with the hive metastore for table t1.

Log into Hive and check to see what Hive says about the table t1:

hive> show tables ;
OK
bigint_tab
boolean_tab
<…>

The table t1 does not exist in the hive meta store. This is the root cause of the error message

Troubleshooting Big SQL Load

Finding error information in the bigsql.log file

When the LOAD statement fails, the SQL message that is returned can contain a Log Entry Identifier, such as [BSL-n-xxxxxxx]. This identifier marks the location in the bigsql.log file where more error information can be found. The n position in the identifier, is the Big SQL node where the file exists. The default local file location is /var/ibm/bigsql/logs/bigsql.log.

You can also use the function SYSHADOOP.LOG_ENTRY to display a snippet of the bigsql.log file near the log entry ID. This is helpful if you are using a remote JDBC client to connect to Big SQL. The following SELECT statement shows how to use the SYSHADOOP.LOG_ENTRY function to get the information from the bigsql.log file:

SELECT * FROM TABLE(SYSHADOOP.LOG_ENTRY(‘BSL-0-22b40517b’));

Finding error information in the logs of MapReduce jobs

When the LOAD HADOOP statement fails, the SQL message can contain a job identifier. The job identifier is the Hadoop MapReduce job identifier that is run by the LOAD statement. This job identifier can be used to locate the Hadoop job in the MapReduce Job History web page. The following address is the default URL for the Job History web page:

http://&lt;mapreduce-server>.com:19888/jobhistory

The mapreduce-server is the node where the MapReduce History server is installed. If you are not sure of the node, look at the Ambari web interface for the MapReduce service.

On the Job History web page, find and select the job identifier for the LOAD job. There are three places that you can see job logs:

Application Master

The MapReduce Job xxxx page contains a section called Application Master. In this section, select the link to logs. Expand the Click Here link to see the entire file.

This log file might contain errors when starting, committing, or cleaning up the job. It can also have Out of Memory or other resource errors.

Map

The MapReduce Job xxxx page contains a section with column Task Type. Select the map link, which shows the map tasks that were run. Select each map task link to see the task attempts. Each task attempt contains a logs link. Open the log and expand by selecting the Click Here link to see the entire file.

These log files might have rejected record information, problems related to accessing the source data, or problems related to writing to the target table.

Reduce

The MapReduce Job xxxx page contains a section with column Task Type. Select the reduce link, which shows the reduce tasks that were run. Not all LOAD jobs have reduce tasks, so there might not be any tasks listed.

Select each reduce task link to see the task attempts. Each task attempt contains a logs link. Open the log and expand by selecting the Click Here link to see the entire file.

These log files might have error information about writing to the partitions of the target table.

Rejected records

LOAD might complete successfully and return SQL CODE 5108, but some of the rows were rejected. LOAD might stop and fail with SQL CODE 5109, if the number of rejected records exceeds the maximum set with option max.rejected.records.

When a record is rejected, the reason is written to the map task attempt log. The following list represents some of the reasons that records might be rejected. Each reason also contains the corrective action:

Mismatch in source and target data types

  • You might need to use a different data type for your target table.
  • For a database table source, you can CAST the source column to the intended target data type.
  • Consult the JDBC driver documentation for the type of database source to determine the native data type mapping to JDBC.

Null value for a column which does not allow null values

  • For a file source, you can use option rejected.records.dir to write the rejected records to a file. You can then inspect the rejected records file, make the needed corrections, and retry the LOAD statement.

Overflow of decimal value

  • You might need to specify the target column with a different precision and scale to match the source data.
  • You can use the decimal.overvlow option to round values to match the target precision and scale.

Multi-row source file

  • If the source file contains records that span multiple lines, then you must use the allow.multiline.record option set to a true value.

Incorrect field delimiter which can cause a record to not get parsed

  • Check the source file and verify that it contains the same field delimiter that you specified in the field.delimiter option.
  • If the field delimiter is a non-printable character, then you can specify it as an escaped character (\t) or the octal value (\001).

Source file data corrections

There are several LOAD options that can help you load data from a file with the following characteristics:

  • Column names on the top line.
  • Too many fields.
  • Fields in the wrong order.
  • Multi-line records.
  • Non-standard date or timestamp formats.
  • Decimal values that overflow the target precision or scale.
  • Place holder values for NULL.

Use the following options to avoid some LOAD errors:

  • time.format – useful for date and timestamp formats.
  • with.null and replace.string.with.null – placeholder strings for NULL values.
  • lines.count – skip the top lines that do not have data.
  • multiline.record – allows records to span multiple lines in the file.
  • overflow – round decimal values to fit the target precision and scale.
  • indexes – reorder the fields in the source file to match the target column.
  • extra.fields – useful for when the number of source fields do not match the number of target columns.

Troubleshooting Big SQL Installation

Scenario 1

Big SQL install fails when non english characters are encountered

Error Message

“/usr/lib/python2.6/site-packages/resource_management/core/shell.py”,

 

line 290, in _call    err_msg = Logger.filter_text((“Execution of ‘%s’

returned %d. %s”) % (command_alias, code, all_output))  File

“/usr/lib/python2.6/site-packages/resource_management/core/logger.py”,

line 96, in filter_text

text = text.replace(unprotected_string,

protected_string)UnicodeDecodeError: ‘ascii’ codec can’t decode byte

0xe3 in position 242: ordinal not in range(128)

Logs and diagnostics collection steps

1) Verify the existing encode:

Issue the following sequence of commands as root:

[root@iop-ldap-client site-packages]# python

Python 2.6.6 (r266:84292, Jul 23 2015, 15:22:56)

[GCC 4.4.7 20120313 (Red Hat 4.4.7-11)] on linux2

Type “help”, “copyright”, “credits” or “license” for more information.

>>> import sys

>>> reload(sys)

<module ‘sys’ (built-in)>

>>> sys.getdefaultencoding()

‘ascii’

>>> exit()

2) Change the default encoding from ascii to utf-8

cd /usr/lib/python2.6

cd site-packages/

vi sitecustomize.py

add the following lines into the sitecustomize.py file

# encoding=utf8

import sys

reload(sys)

sys.setdefaultencoding(‘utf8’)

3) Verify the new encoding:

[root@iop-ldap-client site-packages]# python

Python 2.6.6 (r266:84292, Jul 23 2015, 15:22:56)

[GCC 4.4.7 20120313 (Red Hat 4.4.7-11)] on linux2

Type “help”, “copyright”, “credits” or “license” for more information.

>>> import sys

>>> reload(sys)

<module ‘sys’ (built-in)>

>>> sys.getdefaultencoding()

‘utf8’

>>> exit()

Scenario 2

BigSQL 4.1.0.2 installation fails with Error: “Failed to install TSAMP with rc:1”

Error Message

1) Installation fails when trying to add host

sudo /var/lib/ambari-agent/cache/stacks/BigInsights/4.1/services/BIGSQL/package/scripts/bigsql-add-host.sh -H /usr/ibmpacks/current/bigsql/bigsql -n bdsup008.svl.ibm.com -o add                    -s os -u bigsql -b hbase -x hdfs -i hive                    -e hadoop -g users -f 28051                    -k  -r                     -v /var/ibm/bigsql -l ‘/var/ibm/bigsql/database’ -p 32051 -Q bdsup007.svl.ibm.com                    -N BigInsights -O 4.1.0.0 -T 0

2) Host log:

/tmp/bigsql/logs/bigsql-add-host-2016-01-12_13.00.54.9124.log

2016-01-12.13:00:55.383 @ Line(00113) @ Fun               add_pre_processing: Executing: /usr/ibmpacks/current/bigsql/db2/bin/db2fmcu -d

Error stopping the service: initctl command failed with rc=256

2016-01-12.13:00:55.402 @ Line(01292) @ Fun               add_pre_processing: Installing TSAMP

2016-01-12.13:00:55.405 @ Line(01126) @ Fun                     tsampInstall: .Entering TSAMP Install

2016-01-12.13:00:55.411 @ Line(01132) @ Fun                     tsampInstall: .isCentOS:1

2016-01-12.13:00:55.414 @ Line(01151) @ Fun                     tsampInstall: .TSAMP explicit install detected

2016-01-12.13:00:55.431 @ Line(01167) @ Fun                     tsampInstall: .TSAMP is not present on the machine, calling installTSAMP /usr/ibmpacks/current/bigsql/

2016-01-12.13:00:57.557 @ Line(01173) @ Fun                     tsampInstall: .Error occurred while installing TSAMP, refer to log file /tmp/tsampUtils.log for detail

2016-01-12.13:00:57.560 @ Line(01178) @ Fun                     tsampInstall: .Completed TSAMP Install with rc=1

2016-01-12.13:00:57.562 @ Line(01298) @ Fun               add_pre_processing: Failed to install TSAMP with rc:1

2016-01-12.13:00:57.565 @ Line(01304) @ Fun               add_pre_processing: Completed add-host-pre-processing with rc=1

4) /tmp/installSAM.log

prereqSAM: Using default prerequisite checking on the following rpm package:  ‘pam’ ‘i686’

prereqSAM: Error: The following rpm package is not installed:  ‘pam’ ‘i686’

1 missing package: pam (i686)

prereqSAM: Error: Prerequisite checking for the ITSAMP installation failed:  RHEL 6.6 x86_64

prereqSAM: One or more required packages are not installed: pam (i686)

installSAM: <<< Installation on bdsup008 – log  ended : Tue Jan 12 13:00:57 PST 2016

Logs and diagnostics collection steps

Scenario 3

Error Message

Logs and diagnostics collection steps

Solution

Install the following packages, and re-try the BigSQL installation:

 

yum install pam

yum install pam.i686

 

Scenario 3

Big SQL installation fails with Error: Cannot retrieve repository metadata (repomd.xml) for repository: BI-Ambari-4.1.

Error:

Error: Cannot retrieve repository metadata (repomd.xml) for repository: BI-Ambari-4.1. Please verify its path and try again

Loaded plugins: product-id, security, subscription-manager

This system is not registered to Red Hat Subscription Management. You can use subscription-manager to register.

Setting up Remove Process

No Match for argument: bigsql-samples*

http://10.87.225.176/repos/Ambari/rhel/6/x86_64/2.1.x/GA/2.1/repodata/repomd.xml: [Errno 14] PYCURL ERROR 7 – “couldn’t connect to host”

Trying other mirror.

Error: Cannot retrieve repository metadata (repomd.xml) for repository: BI-Ambari-4.1. Please verify its path and try again

Solution

  1. Checked file /etc/yum.repos.d/ambari.repo

 

[hadoop@infbidn5 yum.repos.d]$ cat ambari.repo

[BI-Ambari-4.1]

name=BIGINSIGHTS-4.1

baseurl=http://10.87.225.176/repos/Ambari/rhel/6/x86_64/2.1.x/GA/2.1/

enabled=1

gpgcheck=0

 

  1. Copy paste the URL from ambari.repo file inside web browser, to see if it works

http://10.87.225.176/repos/

 

  1. If it des not work, then start apache web server by running the following command

apachectl start  

  1. Restart Big SQL

 

General Data Collection

Collecting General Environment Information from your’s cluster for IBM Support

  • What hardware: x86 or ppc64 (this could be obtained by running command “uname -a”)
    • [root@bdavm535 ~]# uname –a

Linux bdavm535 2.6.32-504.12.2.el6.x86_64 #1 SMP Sun Feb 1 12:14:02 EST 2015 x86_64 x86_64 x86_64 GNU/Linux

  • What OS: g. RHEL 6.3 (this could be obtained by running command “cat /etc/*lease”)
    • [root@bdavm535 ~]# cat /etc/*lease

Red Hat Enterprise Linux Server release 6.6 (Santiago)

 

  • Output of db2level command (run as user bigsql on bigsql head node)

[bigsql@bdavm535 root]$ db2level

DB21085I  This instance or install (instance name, where applicable: “bigsql”) uses “64” bits and DB2 code release “SQL10063” with level identifier “0704010E”. Informational tokens are “DB2 v10.6.0.3”, “s150918”, “IP23850”, and Fix Pack “3”. Product is installed at “/usr/ibmpacks/bigsql/4.1/db2”.

  • Are any special patches were applied for Big SQL or DB2?
  • Content of /home/bigsql/sqllib/db2nodes.cfg file from bigsql-head node
  • Note: This info rarely changes, unless you perform an upgrade.

General location for BIGSQL Logs / Config files:

Big SQL logs /var/ibm/bigsql/logs/ (on all BIGSQL nodes)

BIG SQL Config files /usr/ibmpacks/bigsql/4.*/bigsql/conf  directory.

 

Big SQL Installation logs:

/tmp/bigsql/logs/*

Collect additional information based on the nature of the problem:

  • If the problem is with a DDL => then DDL statement
    • db2look – DDL Extraction tool command
  • If the problem is with a insert/update/delete/load => then DDL and insert/update/delete/load SQL statement
  • If the problem is with a query => then DDL, some sample data (if possible; do NOT send us sensitive) and SQL query
  • If any JVM ran out of memory then it dumps *.phd and javacore* files – get those.
  • If any JVM is hung (e.g. scheduler, db2fmp with java-io-engine etc) then “kill -3 <processid>” . This will dump a javacore* file. Take two snapshot at 1 minute interval and provide both javacore files.
  • General – a short reproduction of the the problem (if possible).

Command to Generate DB2Support.zip on all nodes of the cluster:

As bigsql user, run the following command on your BIGSQL Head Node:

db2_all “db2support /tmp -d bigsql -localhost -f -o db2support.zip”

After the command finishes running, it should create db2support.zip file under /tmp directory on each node of your Big SQL cluster.

Collecting db2trc in Big SQL

db2trc – overview

db2trc is a powerful debugging tool that has been inside the db2 product for a long time.  It is very mature, with a robust feature set, too much to cover in this high-level how-to document here.

Almost every single db2 source code function has “trace points”.  A trace point in the code is a way that the executing code can log some debugging info into the trace.  Source code functions have an “entry” trace point at the start of the function, and an “exit” tracepoint at the end.  They may also have data tracepoints that capture dumps of information.

Each trace point has a unique identifier.  It’s really just a number/counter that increments as each trace point is captured into the trace.  As such, the trace points can be used to see a chronological view of functions executed.

I encourage you to seek the documentation to see all the stuff it can do.  As a quick teaser though in case you want to become a trace guru, here’s some neat things you can do (not covered in this quick overview):

– Limit the trace collection using “masks” so that only some things are traced.

– Hook a “suspend” into a trace point to cause a hang in key point in the code allowing you to attach with a debugger.

– Inject sleep points into certain trace points to cause certain codepaths to slow down if you want to force a certain timing issue

– Get a performance breakdown summary of time spent in all functions

– Hook a db2cos (db2 callout script) invocation into a trace point, so that when db2 hits the tracepoint, it will invoke a shell script where you can put data collection commands in it.

– perfcount trace (different from performance report trace)

This document here will keep things simple for how to use it in the more common scenario’s on bigsql.

 

You can trace to a memory buffer (there is performance impact, shouldn’t be too bad, but has things to watch out for due to the size of the buffer).

You can trace directly to a file instead of memory buffer (this is bad for performance, but doesn’t have some of the problems with memory buffer tracing.  Watch out for growing a file too big!!).

Once a trace has been collected, you will have an unreadable binary dump file.  To actually see the trace, you have to format the dump file.

When to get trace, and for what kind of problems should you get it?

Reproducible problem:

Trace is great when the customer knows how to reproduce something.  “I run this query and it fails with sql error blah”

The trace can capture information that leads up to, and including, the error (including it’s error handling).  As such, it’s a great tool to get more information about a particular problem that reproduces on the customer site.

However, I would not start out with trace right away in a pmr.  It’s best to use simpler analysis first, such as an investigation into diagnostic log files first, look for known problems, to see if the problem is more obvious first.  You’d use a trace as a later method if you can’t solve it based on available logs.

 

*Some* hangs:

If a hang seems to be caused by a contention style of issue, then don’t use trace.  i.e. latching, waitpost, other…

However, if the hang is related to some kind of looping behaviour, infinite loop, high cpu in an agent, then trace (especially the flow format) is a great way to observe the loop, provided that the loop is crossing trace points.  A tight loop that does not have any tracepoints of course won’t give you anything here.  Usually for hangs, db2trc is built in to hang data collection scripts and you wouldn’t have to ask for it specifically. Look for trace output in the hang data collection scripts outputs!

 

Performance issues:

Trace adds overhead and makes thing slower.  As such, You don’t use trace for performance issues and expect it to have good timings.  However, what it can help with is “where in db2 is the time being spent?  what’s it doing?”  Use -t option for this to track timestamps.  See also performance report options.  I don’t recommend this generally unless you are not getting anywhere with standard performance analysis.  I use trace if a deep dive is needed into  a particular performance issue.

How to collect trace:

Trace can be a bit confusing to collect, and sometimes you don’t get what you really wanted due to problems in how you collect it!

Turning on a trace to memory buffer:

There’s 2 ways to trace to memory, circular or truncated.  Tracing to memory does NOT automatically dump anything; you still need to explicitly generate the trace dump.

Circular buffer: -l <buffer size>

Entries are added to the memory buffer.  When the buffer is full, it wraps to the front of the buffer, overwriting older entries and continues like this in circular fashion.  This is okay, provided that it does not wrap over the entries that you actually wanted to see.  Thus, knowing when to turn the trace off is key.  If tracing a reproducible failure issue, ensure to turn the trace off ASAP after the failure happens so that you don’t overwrite the good trace points that you wanted.

Truncated buffer: -i <buffer size>

Entries are added to the buffer.  When the buffer is full, no more entries added, all subsequent tracepoints are not captured.  Use this if you know that you want to capture the starting info of something, but do not care about what happens later.  Thus, there is no worry of trace wrapping past the first stuff.

 

Some example memory traces:

db2trc on -l 1024m

Turns trace on to memory using the maximum buffers size of 1 GB.  Circular buffer and may wrap.

db2trc on -i 64m

Turns trace on to memory using a 64 mb buffer.  This trace will truncate once the buffer is full.

 

IMPORTANT notes when turning on trace:

Trace will only run on the current host that it is executed from.  Depending on the problem being investigated, you will likely want to wrap all of the trace commands in rah to distribute the same command to all hosts. Here’s an example of wrapping a trace in the rah command:

rah “db2trc on -l 128m”

Also, the size of the memory buffer that you choose with -l or -i option is limited by the size of the shared memory set. This memory set is created when bigsql is started. You can see the size of the memory set by doing

db2pd -memsets

…and look at the size of the “Trace” memory set. For example, you may ask for a trace buffer with -l 1024m to get a 1 GB memory buffer, but you may not really get that amount because the set is only sized at 32 mb. Unfortunately, if you want to max out the trace buffer size and get a full 1024m, you would need to use:

db2set DB2TRC_DEF_BUFFSIZE=1G

…followed by a bigsql recycle, which is not always convenient or customer-friendly to do that.

 

Dumping trace:

When you trace to memory, you don’t have any output yet..it’s all in memory.  Thus, to get the trace, the memory buffer needs to be dumped to file like this:

db2trc dump <trace dump file name>

 

As noted, in bigsql if you have this running on all hosts, you’ll probably have this wrapped in rah like this:

rah “db2trc dump /tmp/trace.dmp”   Ensure the path you give exists on all hosts (/tmp is convenient for this)

 

Dumping a trace does not turn it off!  It will continue to collect trace data in the buffer (if it’s circular buffer or non-full truncated trace).  A trace to file on the other hand, does not require you to dump it, but you still have to turn it off.

 

Tracing to file:

To avoid worrying about trace wrapping or truncating and missing out on the data you needed to see, then trace directly to a file instead:

db2trc on -f <trace file name>

WARNING: Would it be a good idea to trace to file a huge query that joins 100 million rows?  nope!  The trace dump file can grow very quickly out of control.

You don’t want to create a 50GB trace dump.  You might consider masking the trace to limit the components.

 

Turning trace off:

Don’t let trace keep running!!

rah “db2trc off”

 

Formatting the trace:

There’s 2 main ways to format the trace, “the flow” (flw) and “the format” (fmt).  A trace flow output does not have all the details of dumped structures and things, but it is extremely valuable at learning codepaths due to the entry/exit nature of trace points. The trace fmt format has much more details from each trace point, including any debugging info dumps.

Here’s an example syntax of formatting trace as flow output:

db2trc flw <trace dump file> <output file formatted as flow>

Formatting the trace in the format (fmt)

db2trc fmt <trace dump file> <trace output in format>

Both the flw and fmt are formatting the same trace, they are just giving you a different view/look at the info.  Typically, an investigation will involve

looking at both flw and fmt in parallel.

 

Applying Big SQL v4.1 patches

Overview

This section has detailed instructions on how to apply a Big SQL special build in case if you were given one by IBM Support. The instructions will help you update the binaries/libraries, while allowing you to rollback to the original binaries should you encounter blocking issues with the special build.

You should always stop Big SQL before installing, or rolling back, a special build.

Run all of these steps on every host where Big SQL is installed unless explicitly instructed otherwise.  You can run through the per-host instructions in parallel.

There are four major steps to installing a special build.  The first is to extract the DB2 image from the RPM.  The third is to lay down the new binaries/libraries from the DB2 image in a path using installFixPack.  The third step is to instruct the bigsql instance to use the new libraries.  The final step is to update your local RPM repository with the new patched RPM.

There are two major steps to rolling back a special build.  The first is to instruct the bigsql instance to use the original libraries.  The second is to rollback the application of the db2luw RPM in your local RPM repository.

These steps will be covered in detail in the section “Installing the special build”.

Requirements

  • IBM BigInsights 4.1.0.2 installation
  • Local RPM repository for BigInsights 4.1.0.2
  • Patched db2luw RPM
    • In this specific instance, it is db2luw-rhel6-x86_64-10.6.0.3-s160222-db2rpm-athens.rpm
  • Patched bigsql-dist RPM
    • In this specific instance, it is bigsql-dist_4_1_0_0-5.46.7-el6-x86_64.rpm

Preparing to install the special build 

Stop the BigInsights – BigSQL service in Ambari.

On each host:

  1. As the bigsql user, run db2level to determine where DB2 is installed, and what version is installed there. Validate that the informational tokens on your host match the tokens below.

 

[root@machine1 ~]# su – bigsql

[bigsql@machine1 ~]$ db2level

Server is running DB2 partitioned database environment(6 partitions)

DB21085I  This instance or install (instance name, where applicable: “bigsql”)

uses “64” bits and DB2 code release “SQL10063” with level identifier

“0704010E”.

Informational tokens are “DB2 v10.6.0.3”, “s151125”, “IP24010”, and Fix Pack

“3”.

Product is installed at “/usr/ibmpacks/bigsql/4.1/db2”.

  1. As the root user, create a backup of the existing DB2 installation in /usr/ibmpacks/bigsql/4.1/db2.orig. This backup will be used if you have to rollback the special build installation.

 

[root@machine1]# cp -rp /usr/ibmpacks/bigsql/4.1/db2/ /usr/ibmpacks/bigsql/4.1/db2.orig/

 

On the machine that hosts your local repository:

 

  1. Make a copy of your existing db2luw RPM from your local repository.
  2. Make a copy of your existing bigsql-dist RPM from your local repository.

 

Installing the special build

 

On each host:

  1. As the root user, unpack the rpm using the linux tools rpm2cpio and cpio in a partition with at least 5GB of free space.

 

[root@bdavm706 db2_server]# rpm2cpio db2luw-rhel6-x86_64-10.6.0.3-s160222-db2rpm-athens.rpm | cpio –idmv

 

Example Output:

./db2luw_4_1_0_2-x86_64-10.6.0.3.tar.gz

2862835 blocks

  1. As the root user, unpack the db2luw_4_1_0_2-x86_64-10.6.0.3.tar.gz file into a temporary location on each host in the cluster where BigSQL is installed.

Example: /tmp/db2_special

  1. Change directory to the /server directory that was unpacked from the tar ball.

cd /tmp/db2_special/server

  1. In your shell, export the following variable.

export SKIP_SHAREDFS_CHECK=YES

  1. Execute
    /tmp/db2_special/server/installFixPack -f level -f install -p /usr/ibmpacks/bigsql/4.1/db2 -b /usr/ibmpacks/bigsql/4.1/db2

This command will result in the DB2 special build binaries/libraries being copied to /usr/ibmpacks/bigsql/4.1/db2.The installFixPack tool is told to allow the installation to proceed even though the same product level is already installed (-f level), and to skip trying to update local instances (-f install).
  2. Run the command
    /usr/ibmpacks/bigsql/4.1/db2/instance/db2iupdt -k bigsql This command will update the bigsql instance to use the new special build binaries/libraries.Example:
    [root@machine1 server]# export SKIP_SHAREDFS_CHECK=YES
    [root@machine1 server]# ./installFixPack -f level -f install -p /usr/ibmpacks/bigsql/4.1/db2 -b /usr/ibmpacks/bigsql/4.1/db2



    DBI1017I  installFixPack is updating the DB2 product(s) installed in
 location /usr/ibmpacks/bigsql/4.1/db2.
    DB2 installation is being initialized.
    Total number of tasks to be performed: 35
    Total estimated time for all tasks to be performed: 1528 second(s)
    Task #1 start

    [root@machine1 server]# /usr/ibmpacks/bigsql/4.1/db2/instance/db2iupdt -k bigsql
    DBI1446I  The db2iupdt command is running.
DB2 installation is being initialized.
  3. Run db2level (as bigsql user). Make sure the db2level confirms that special build has been applied.[bigsql@machine1 ~]$ db2level
DB21085I  This instance or install (instance name, where applicable: “bigsql”) uses “64” bits and DB2 code release “SQL10063” with level identifier “0704010E”. Informational tokens are “DB2 v10.6.0.3”, “s160222”, “BSQL160222AMD64”, and Fix dPack “3”. Product is installed at “/usr/ibmpacks/bigsql/4.1/db2”.
  4. On the Big SQL head node only after doing the preceding steps on all hosts:
    As the bigsql user, execute the following:
    db2 update dbm cfg using instance_usage bigsql

 

On the machine that hosts your local repository:

  1. Copy the patched db2luw RPM (db2luw-rhel6-x86_64-10.6.0.3-s160222-db2rpm-athens.rpm) to your existing local repository.
  2. Remove your old db2luw RPM.
  3. Copy the patched bigsql-dist RPM (bigsql-dist_4_1_0_0-5.46.7-el6-x86_64.rpm) to your existing local repository.
  4. Remove your old bigsql-dist RPM.
  5. Run createrepo <path> –update on the local repository directory to regenerate the RPM metadata.
    createrepo /path/to/local/repo –update

 

On each host:

  1. Clean the YUM cache after recreating the local repository
    yum clean all

 

Restart the BigInsights – BigSQL service through Ambari.

 

  1. On the Big SQL head node only after doing the preceding steps on all hosts:
    As the bigsql user, execute the following:
    db2 activate db bigsql
  2. Start Big SQL

 

Validating the installation

 

Validate the installation (applies to installing and rolling back the special build)

 

  1. Execute rah “db2level”. All hosts should show the same level
  2. Execute db2pd -member all –. All the members should show up as Active.
  3. Execute db2_ps to validate that all nodes are running db2sysc and db2ckpwd processes.
  4. Execute db2pd -member all -db bigsql – .All databases should show up as Active

 

Rolling Back Special Build

 

Stop the BigInsights – BigSQL service in Ambari.

 

On each host:

  • As the bigsql user, run db2level to determine where DB2 is installed, and what version is installed there. Validate that the informational tokens on your host match the tokens below.
    [bigsql@machine1 ~]$ db2level
DB21085I  This instance or install (instance name, where applicable: “bigsql”) uses “64” bits and DB2 code release “SQL10063” with level identifier “0704010E”. Informational tokens are “DB2 v10.6.0.3”, “s160222”, “BSQL160222AMD64”, and Fix dPack “3”. Product is installed at “/usr/ibmpacks/bigsql/4.1/db2”.If the output looks like the original db2level output – go no further on this host.  The special build was not applied.
  • As the root user, copy the backup of the original DB2 installation from /usr/ibmpacks/bigsql/4.1/db2.orig to /usr/ibmpacks/bigsql/4.1/db2.[root@machine1]# cp -prf /usr/ibmpacks/bigsql/4.1/db2.orig/ /usr/ibmpacks/bigsql/4.1/db2/
  • In your shell, export the following variable.
    export SKIP_SHAREDFS_CHECK=YES
  • Run the command
    /usr/ibmpacks/bigsql/4.1/db2/instance/db2iupdt -f level-k bigsql

This command will update the bigsql instance to use the original DB2 binaries/libraries.  The -f level parameter instructs the tool to allow a lower level DB2 product to be installed.Example:
    [root@machine1 ~]# export SKIP_SHAREDFS_CHECK=YES
    [root@machine1 ~]# /usr/ibmpacks/bigsql/4.1/db2/instance/db2iupdt -f level-k bigsql

    DBI1070I  Program db2iupdt completed successfully.
  • Run db2level (as bigsql user). Expect output with the following informational tokens.
    [root@machine1 ~]# su – bigsql
    [bigsql@machine1 ~]$ db2level
Server is running DB2 partitioned database environment(6 partitions)
DB21085I  This instance or install (instance name, where applicable: “bigsql”) 
uses “64” bits and DB2 code release “SQL10063″ with level identifier 
”0704010E”.
Informational tokens are “DB2 v10.6.0.3”, “s151125”, “IP24010″, and Fix Pack 
”3”.
Product is installed at “/usr/ibmpacks/bigsql/4.1/db2”.

 

On the Big SQL head node only after doing the preceding steps on all hosts:

  1. As the bigsql user, execute the following:
    db2 update dbm cfg using instance_usage bigsql

 

On the machine that hosts your local repository:

  • Copy the original db2luw RPM to your existing local repository. Remove the patched db2luw RPM.
  • Copy the original bigsql-dist RPM to your existing local repository. Remove the patched bigsql-dist RPM.
  • Run createrepo <path> –update on the local repository directory to regenerate the RPM metadata.
  • createrepo /path/to/local/repo –update

 

On each host:

  • Clean the YUM cache after recreating the local repository.
    yum clean all

 

Restart the BigInsights – BigSQL service through Ambari.

 

On the Big SQL head node only after doing the preceding steps on all hosts:

  • As the bigsql user, execute the following:
    db2 activate db bigsql

 

Start Big SQL service

 

Troubleshooting tip:

 

If after applying the patch or after adding a new Big SQL node, you get -901 error, check for the following and fix accordingly:
Cluster-wide all bigsql nodes must have

  1. Same db2level
    as bigsql user on head node run: db2_all “db2level”
  2. Same bigsql-dist level
    as root on all nodes : yum list installed | grep bigsql-dist*
  3. Database should be connectable in all bigsql nodes
    as bigsql user on heads node run : db2_all “db2 connect to bigsql; db2 terminate”

 

Blog Author:

Zach Zakharian, BigInsights Support Team Lead. (kzakhar@us.ibm.com)

Thanks to the following individuals for their contributions to this Blog:

Seeling Cheung
Simon Harris
Lorysa Cornish
Jamie Nisbet
Mick Legare
Rajib Sarkar
Sripriya Srinivasan
Sampada Basarkar

Advertisements