I've been the witness of many discussions about what benefit reorganizing a database has.
First in the days of old, when doing so would cost more more work then what was saved
Second when someone started using indexes and said "now we never need to reorganize our data again"
Again when someone said that reorganizing is just "reclaiming diskspace, and that's pretty cheap now"
There there was the "LMTS will reuse all the diskspace easily"
I've even heard someone use the excuse that: "The data is on a SAN, it will reorganize the data internally", and the latest one: "I'm using data compression, so there will be no wasted space"
I'm pretty sure with the move on SAP HANA, I'll soon hear the argument: "its in-memory computing"
But the fact of the matter is that for all databases, indexed or not, if there's a lot of changes in the data, it will get jumbled around. So there'll always be a need (how big will depend on the business) to reorganize the data IF one of the following conditions are true:
1) Reporting reads data sequentially.
2) Your index consists mainly of inserts in one side and thus becomes uneven.
3) You need to reclaim diskspace after deleting data
4) Select statements issued can make use of your data being sorted
5) You need to move data around
And there are other conditions too, but these are the ones I most commonly run into.
The reason for (1) is simply that once data has ben entered into an Oracle database, a highwater mark is set. And a sequential read will even borther to read the empty blocks (I have yet to figure out why this behavior is, but I'm sure there's a good reason to be found somewhere). Thus if you remove and insert a lot of data (like say during a BW load) into your database, you may end up with a lot of empty blocks that sometimes get read, when they dont have to. This will cause performance degredation. Arguably, you can also reclaim wasted space, but if you DO have all these changes continually, you'll likely not be saving a lot of space in the long run. Also, it's worth noting that HOW your inserts are made makes a difference as to weather or not the highwatermark is being pushed.
As for (2), one can make the argument that you really only need to reorganize your index. But in reality, an index is just a table too. And an index that is very heavily loaded with data on only one side (say financial postings by date, will likely consist mainly of newer and newer dates), will get slower and slower to use, as your index becomes deeper. Reorganizing the index will save you absolutely no space, but will likely give you a small performance boost. One could also make an argument that data sorted by date can benefit from partitioning. And this is oftentimes true, but that's a matter for another post.
(3) is pretty selfexplanatory. Oftentimes large dataloads, exports or other is run against a database just once. And if the diskspace is needed for other use, there's no reason why we as administrators shouldn't remove the excess empty space, and reclaiming space from the filesystem requires that the datafiles become smaller, for them to be smaller, we need to shrink (and move) the tables, for that to happen, we need to reorganize. This is the case even if you're using datacompression on your data or indices. Because it is passive cmompression. Once data gets deleted or moved, you will still have a hole in your block.
If at all possible, when someone comes to me in advance and tells me of their temporary need, I create a new tablespace on a new LUN, if this is possible, so that I can easily reclaim the space later, and even drop the LUN back into the diskpool.
For performance reasons, it's sometimes a good idea to reorg your data sorted, which leads to (4). Logically, if your select statement makes good use of data being sorted (say with a "between" statement), it would help if your data was sorted by whatever makes sense. The downside to reorganizing this way is that we as administrators needs to know a lot about the application and what querries are being run against it. Because what's good for the goose, might not be good for the gander. Sorting data one way, without looking at how it was previously sorted, might even cause a performance degradation.
Moving data (5) can frequently be done by setting a datafile offline, moving it, and bringing it online again in a new place. But this requires downtime (at the very least on the filelevel). With reorganization tools like Quest livereorg, or using a reasonably new version of Oracle, there's a redefinition package that will allow us as administrators to recreate a table in a different tablespace, easily. Moving individual datafiles (online) is still a problem, but even as painfully slow as it is, you can actually move the data pieces (tables) inside a datafile one at a time. If you need to. Once someone creates a product that can move ONLY the pieces inside a particular datafile, I'll be more than happy to endorse or even use their product. Meanwhile, for me, it's still wishfull thinking whenever I have to (online) move data to a new storage area or similar.
On top of all of this there's the whole issue of read vs write performance. What's more important ? There's the entire cost angle. If I spend 10 hours making the system save y number of CPU cycles/day, or x amount of I/O operations/day, is that profitable ? After a couple of months the users will get used to slow responsetimes anyway....
So in conclusion, yes, I can find many good arguments for reorganizing a database, but none of them are "because someone said it was a good idea". Reorganization is a reasonably big thing to do, and it will impact performance. But if used correctly, it can save your users a lot of time and griefs. Because as a strategic tool, reorganization can increase the performance of your system.
And now I'm off to report on the last reorganization effort :D
Thursday, December 6, 2012
Thursday, November 29, 2012
Program RSUSR003 reports security violation
I got this alarm on a system recently. And it really annoyed me, because I'd not seen this one before.
So I looked at the source of the alarm. It was a syslog message, and it was pointing to a user and a report. So presumably, running the report would generate the error. I easily replicated this in another environment, by simply running the report.
So I contacted the user and asked what was this all about, and apparently, during an audit someone suggested running this report on a regular basis. And the report is nice. It shows the status of default users, the settings of the passwordlength and complexity etc.
The only problem is that if our setup varies from how the report thinks the world should look, apparently I will get this error in my syslog. And as I cant really go around enforcing a different password policy on the SAP system than on all the other systems, I had to make the error go away.
So setting out to do this, I looked again in the syslog. The error code was E03, meaning in area E0 subid 3. I opened the table TSL1T and looked at area E0.
So it would appear that this particular error only occurs "when expected", as it is called from within reports. So my solution was to simply remove the E03 from the syslog filter in my solution manager alerting... However, there's no "where used" function for syslog messages, so I might be wrong about the severity, but I doubt it.
So I looked at the source of the alarm. It was a syslog message, and it was pointing to a user and a report. So presumably, running the report would generate the error. I easily replicated this in another environment, by simply running the report.
So I contacted the user and asked what was this all about, and apparently, during an audit someone suggested running this report on a regular basis. And the report is nice. It shows the status of default users, the settings of the passwordlength and complexity etc.
The only problem is that if our setup varies from how the report thinks the world should look, apparently I will get this error in my syslog. And as I cant really go around enforcing a different password policy on the SAP system than on all the other systems, I had to make the error go away.
So setting out to do this, I looked again in the syslog. The error code was E03, meaning in area E0 subid 3. I opened the table TSL1T and looked at area E0.
So it would appear that this particular error only occurs "when expected", as it is called from within reports. So my solution was to simply remove the E03 from the syslog filter in my solution manager alerting... However, there's no "where used" function for syslog messages, so I might be wrong about the severity, but I doubt it.
Tuesday, November 13, 2012
Automating windows tasks for SAP systems
It always bothered me that SAP systems are so horribly bad at cleaning up after themselves. Especially anything that runs on Java, windows or is a thridparty plugin/adaptor/app/whatever.And given that I in many cases dont even have an SAP/ABAP backend to program a cleanup in, I had to make due with regular windows expressions and commands.
So the other day, when I had a javabased plugin, running on a windows system that was creating about 500 logfiles/day, I figured, enough is enough, I dont WANT anymore alerts on semifull disks, bad performance or requests for defrag etc. etc. caused by millions of small files taking up all the space on the system.
So I decided to create a small script that would clean up all of the subfolders that contained the offending files.
I created a file called "cleanup.cmd" and placed in on the servers desktop.
In the file, I put a lot of lines that would uniquely identify the files that I was looking for, lines like this:
forfiles /P "<absolute_path_without_wildcards>" /S /M <mask> /D -<age> /C "cmd /c echo @path"
Once each line selected the stuff I wanted from the directories (and with the /S also subdirectories) I wanted, I replaced the "cmd /c echo @path" with "cmd /c del @path" and saved the file.
Then it's just a matter of scheduling it in the windows scheduler to run on a monthly basis.
At some point in time, I'll take the time to create a script that'll be able to accept a parameter or two so that it'll clean up each of the different types of systems I manage, that does not clean up all of its own work- and logfiles (XI, gw connector, charm, TREX, livereorg, redwood, BPC, SBOP, sapconsole etc....). It would be so nice if I could just maintain ONE script.... Oh, the hazzards of automation :D
Wednesday, November 7, 2012
Dependent module libclntsh.a(shr.o) could not be loaded
I've found this error "Dependent module libclntsh.a(shr.o) could not be loaded" on many occasions, but usually only when the Oracle client or version in general had been changed. And often in conjuction with a faling "magic number". Magic being somewhat non-descriptive, when they only mean 32 or 64 bit.... But nevermind teasing the developers ;)
Usually, it's caused by the incorrect setting of the LIBPATH (for AIX), SHLIB_PATH (for HP-UX) or similar library path.
On occasion I've even had to relink my Oracle to get the libraries running correctly again.
So let me just set things straight:
On an SAP system, the user ORA<SID> needs to have the vaiable set to include <path to SYS/exe/run> and /oracle/<SID>/<version>/lib (and remember to differentiate between 32 and 64 bit libraries, if you're using both).
The user <SID>ADM needs the varibale to point to /oracle/client/<version>/instantclient (assuming you're on Oracle version 10 or newer and thus using the instantclient)
The other day, however, I did notice problems that neither spelling errors, incorreclty set variable or even a relink couldn't fix. And it took me a while to find the cause of the problem. The system had been unicode converted, oracle upgraded, had a new client installed, and even got an SAP enhancement package installed. So there were a lot of stuff that could have gone wrong. But for some reason a few (but not all) of the systems would throw the error "Dependent module libclntsh.a(shr.o) could not be loaded" when I tried running brconnect, brspace or other similar calls. Whenever the caller was the <SID>ADM it wouldnt work....
I struggled for a while, especially because the shr.o reference (rather than shr_64.o) would seem to indicate that a 32 bit module was being attempted loaded. And since I was running Oracle 11.2 64 bit, then ofcourse there'd be no 32 bit library available. But I followed the traces around, and finally I found the culprit. In the instantclient directory, there was no softlink named instantclient. There was only the instantclient_<version> directory, and the LIBPATH pointed directly to this. That meant that any test I would subject it to would pass, but for some reason there must be someting hardcoded in the SAP kernel that points to the instastantclient directory/softlink despite setting the variable LIBPATH.....
I quickly created a new softlink, and lo and behold. Everything was working like a charm again. On the upside I got to clean up all the old .dbenv files and obsolete environment variables on the failing systems ;)
Usually, it's caused by the incorrect setting of the LIBPATH (for AIX), SHLIB_PATH (for HP-UX) or similar library path.
On occasion I've even had to relink my Oracle to get the libraries running correctly again.
So let me just set things straight:
On an SAP system, the user ORA<SID> needs to have the vaiable set to include <path to SYS/exe/run> and /oracle/<SID>/<version>/lib (and remember to differentiate between 32 and 64 bit libraries, if you're using both).
The user <SID>ADM needs the varibale to point to /oracle/client/<version>/instantclient (assuming you're on Oracle version 10 or newer and thus using the instantclient)
The other day, however, I did notice problems that neither spelling errors, incorreclty set variable or even a relink couldn't fix. And it took me a while to find the cause of the problem. The system had been unicode converted, oracle upgraded, had a new client installed, and even got an SAP enhancement package installed. So there were a lot of stuff that could have gone wrong. But for some reason a few (but not all) of the systems would throw the error "Dependent module libclntsh.a(shr.o) could not be loaded" when I tried running brconnect, brspace or other similar calls. Whenever the caller was the <SID>ADM it wouldnt work....
I struggled for a while, especially because the shr.o reference (rather than shr_64.o) would seem to indicate that a 32 bit module was being attempted loaded. And since I was running Oracle 11.2 64 bit, then ofcourse there'd be no 32 bit library available. But I followed the traces around, and finally I found the culprit. In the instantclient directory, there was no softlink named instantclient. There was only the instantclient_<version> directory, and the LIBPATH pointed directly to this. That meant that any test I would subject it to would pass, but for some reason there must be someting hardcoded in the SAP kernel that points to the instastantclient directory/softlink despite setting the variable LIBPATH.....
I quickly created a new softlink, and lo and behold. Everything was working like a charm again. On the upside I got to clean up all the old .dbenv files and obsolete environment variables on the failing systems ;)
Friday, November 2, 2012
Incorrect Timezone for the US after DST changed
One can wonder why SAP just doesn't put this information into a support package or something..... But it's ever a problem after The last sunday of October, when most of the world with DST changes time. Because since 2007, the US been changing on the first sunday of November... EST changes on another date than CET, GMT etc.
If you're system is running UTC, the difference drops by one hour (to normal time) whenever DST ends. It's just not the same day everywhere in the world...
When you're logged onto an SAP system, this can then give you some issues, as SAP still thinks that the US changes same as the rest of the world, because transaction STZBC hasn't been updated with the correct information
So for all lazy sysadmins getting stuck with problems about the US being one hour "off key" in november.... Tell your developers to maintan the variable DST settings in SAP... After all, the system is doing exactly what it's supposed to ;)
If you're system is running UTC, the difference drops by one hour (to normal time) whenever DST ends. It's just not the same day everywhere in the world...
When you're logged onto an SAP system, this can then give you some issues, as SAP still thinks that the US changes same as the rest of the world, because transaction STZBC hasn't been updated with the correct information
So for all lazy sysadmins getting stuck with problems about the US being one hour "off key" in november.... Tell your developers to maintan the variable DST settings in SAP... After all, the system is doing exactly what it's supposed to ;)
Resetting the OPS$ Mechanism
It always bothered me that not only is it possible to screw up the OPS$ mechanism in SAP, when running Unix/Oracle flavor. It's also quite easy. And then there's one thing wrong, then another, and the BRTOOLS just dont really cut it when it comes to fixing the errors.... It's always manual labor. And especially whenever someone is doing a system restore, homogenous system copy or similar, there's always problems afterwards.
So I've created a little script that resets all of it for me (and it also creates an ops$ mechanism for the Oracle user so that I can run backups with "-u /" istead of specifying the password, or storing the password in a plain text file).
As ora<SID> I go to sqlplus / as sysdba and issue the commands:
drop user "OPS$<old_sid>adm" cascade; #In case of homogenous system copies, remove incorrect SIDs
drop user "OPS$ora<old_sid>" cascade; #In case of homogenous system copies, remove incorrect SIDs
drop user "OPS$<new_sid>adm" cascade; #Regardless of error, remove the current SID too
drop user "OPS$ora<new_sid>" cascade; #Regardless of error, remove the current SID too
drop table SAPSR3.sapuser; ## possibly other schemaowner, like SAP<SID>, <SID>R3, SAPR3
drop public synonym sapuser;
create user "OPS$<new_sid>adm" default tablespace <<usually psapsr3usr>> temporary tablespace <<usually psaptemp>> identified externally;
grant connect, resource to "OPS$<new_sid>adm";
create table "OPS$<new_sid>adm".sapuser ( USERID VARCHAR2(256), PASSWD VARCHAR2 (256));
insert into "OPS$<new_sid>adm".sapuser values ('SAPSR3', 'sap'); #NOTE that the username and password ofcourse must reflect the schemaowner and current password
alter user sapsr3 identified by sap; #and here the password must be set again
create public synonym sapuser for OPS$<new_sid>adm.sapuser;
create USER "OPS$ora<new_sid>" default tablespace system temporary tablespace <<usually PSAPTEMP>> identified externally;
grant DBA, CONNECT to "OPS$ora<new_sid>";
grant select, update on sapuser to "OPS$<new_sid>adm";
Test that this works by logging in as <sid>adm and issuing the command
Or as ora<sid>:
This little snippet of code has saved me a lot of time once in a while. Sometimes I even think about making it remote executable, so I dont have to actually log on to a system whenever someone calls me with transport issues or other funny stuff following a bungled homogenous system copy.
So I've created a little script that resets all of it for me (and it also creates an ops$ mechanism for the Oracle user so that I can run backups with "-u /" istead of specifying the password, or storing the password in a plain text file).
As ora<SID> I go to sqlplus / as sysdba and issue the commands:
drop user "OPS$<old_sid>adm" cascade; #In case of homogenous system copies, remove incorrect SIDs
drop user "OPS$ora<old_sid>" cascade; #In case of homogenous system copies, remove incorrect SIDs
drop user "OPS$<new_sid>adm" cascade; #Regardless of error, remove the current SID too
drop user "OPS$ora<new_sid>" cascade; #Regardless of error, remove the current SID too
drop table SAPSR3.sapuser; ## possibly other schemaowner, like SAP<SID>, <SID>R3, SAPR3
drop public synonym sapuser;
create user "OPS$<new_sid>adm" default tablespace <<usually psapsr3usr>> temporary tablespace <<usually psaptemp>> identified externally;
grant connect, resource to "OPS$<new_sid>adm";
create table "OPS$<new_sid>adm".sapuser ( USERID VARCHAR2(256), PASSWD VARCHAR2 (256));
insert into "OPS$<new_sid>adm".sapuser values ('SAPSR3', 'sap'); #NOTE that the username and password ofcourse must reflect the schemaowner and current password
alter user sapsr3 identified by sap; #and here the password must be set again
create public synonym sapuser for OPS$<new_sid>adm.sapuser;
create USER "OPS$ora<new_sid>" default tablespace system temporary tablespace <<usually PSAPTEMP>> identified externally;
grant DBA, CONNECT to "OPS$ora<new_sid>";
grant select, update on sapuser to "OPS$<new_sid>adm";
Test that this works by logging in as <sid>adm and issuing the command
R3trans -x
Or as ora<sid>:
sqlplus / as sysdba
connect /
exit This little snippet of code has saved me a lot of time once in a while. Sometimes I even think about making it remote executable, so I dont have to actually log on to a system whenever someone calls me with transport issues or other funny stuff following a bungled homogenous system copy.
Thursday, November 1, 2012
Scripting finding hidden columns
Any good sysadmin, is a lazy sysadmin. The less work we have, the better the systems are running.
So the other day I was tasked with identifying hidden columns in an Oracle database. I decided to automate the process. It was quite simple.
I create a directory <mydir> for holding the script
I then created an sql script find_hidden.sql in the directory containing the following lines of code:
SET PAGESIZE 1000
SET LINESIZE 200
COLUMN owner FORMAT a10
COLUMN hidden FORMAT a8
spool /<mydir>/found_hidden.txt
select
a.owner, a.table_name, a.hidden_column hidden,sum(b.bytes)/(1024*1024) size_MB
from all_tab_cols a
join dba_segments b on a.table_name = b.segment_name
where a.hidden_column = 'YES' and a.owner != 'SYS' and b.segment_type='TABLE'
group by a.owner,a.table_name,a.hidden_column;
spool off
exit
The astoute reader will recognize the script from my earlier writings, with the catch that I'm spooling the results to a file.
This creates a file with the data. Now for creating a script that will mail me the file, if I need it.....
So I create a shellscript mail_hidden.sh, with the following lines of code:
#!/bin/sh
sqlplus / as sysdba @/<mydir>/find_hidden.sql
myfilesize=$(ls -lart /<mydir>/found_hidden.txt | awk '{print $5}')
if [ $myfilesize > 20 ];
then
cat /<mydir>/found_hidden.txt | mailx -s "Found hidden columns in Name of system" <my_mail_adress>
fi
rm /<mydir>/found_hidden.txt
Note that the script will determine if the filesize of the output file is lager than a few bytes..."no rows selected" statement is about 19 bytes ;)
So I just run this on my database(s) every month or so.
Next step will be to automatically run an online reorg of the objects during low activity periods. Because I know I dont want to spend my weekends doing that for the forseeable future.
So the other day I was tasked with identifying hidden columns in an Oracle database. I decided to automate the process. It was quite simple.
I create a directory <mydir> for holding the script
I then created an sql script find_hidden.sql in the directory containing the following lines of code:
SET PAGESIZE 1000
SET LINESIZE 200
COLUMN owner FORMAT a10
COLUMN hidden FORMAT a8
spool /<mydir>/found_hidden.txt
select
a.owner, a.table_name, a.hidden_column hidden,sum(b.bytes)/(1024*1024) size_MB
from all_tab_cols a
join dba_segments b on a.table_name = b.segment_name
where a.hidden_column = 'YES' and a.owner != 'SYS' and b.segment_type='TABLE'
group by a.owner,a.table_name,a.hidden_column;
spool off
exit
The astoute reader will recognize the script from my earlier writings, with the catch that I'm spooling the results to a file.
This creates a file with the data. Now for creating a script that will mail me the file, if I need it.....
So I create a shellscript mail_hidden.sh, with the following lines of code:
#!/bin/sh
sqlplus / as sysdba @/<mydir>/find_hidden.sql
myfilesize=$(ls -lart /<mydir>/found_hidden.txt | awk '{print $5}')
if [ $myfilesize > 20 ];
then
cat /<mydir>/found_hidden.txt | mailx -s "Found hidden columns in Name of system" <my_mail_adress>
fi
rm /<mydir>/found_hidden.txt
Note that the script will determine if the filesize of the output file is lager than a few bytes..."no rows selected" statement is about 19 bytes ;)
So I just run this on my database(s) every month or so.
Next step will be to automatically run an online reorg of the objects during low activity periods. Because I know I dont want to spend my weekends doing that for the forseeable future.
Does not match Component version
One of my clients have a SAP GRC Access Control application installed on their system.
But after upgrading the SAP system to EHP6, I ran into a problem with the transport system suddenly showing the status of transports from the dev system as "Does not match component version". Upon checking, I found the cause to be that the VIRSANH 530_700 component wasn't a part of the test system I'm trying to import the transports into.
Apparently, the customer only wants this component on the development system, as they only use the functionality in the development system.
But after upgrading the SAP system to EHP6, I ran into a problem with the transport system suddenly showing the status of transports from the dev system as "Does not match component version". Upon checking, I found the cause to be that the VIRSANH 530_700 component wasn't a part of the test system I'm trying to import the transports into.
Apparently, the customer only wants this component on the development system, as they only use the functionality in the development system.
So, without the option to install the component on the other systems to get the component versions identical, I had to set the parameter SP_TRANS_SYNC = OFF in the transport tool. The major downside is that this will ignore ALL component version problems, not just the ones affecting the GRC access control.
Wednesday, October 31, 2012
Identifying Hidden Columns in Oracle
Sometimes an Oracle system will start having some hidden columns. This is quite normal behaviour if views or dependants change or are dropped.
However, for an SAP system, this is not desirable behaviour, so I sometimes have to remove hidden columns.
The downside, though, is that first I have to find them. Fancy scripts exists for this, but I find that a single line of code and some formatting is enough.
SET PAGESIZE 1000
SET LINESIZE 200
COLUMN owner FORMAT a10
select distinct owner,table_name,hidden_column from all_tab_cols where hidden_column = 'YES' and owner != 'SYS';
This excludes the sys objects that are not really manipulated by SAP anyway. So the only thing left is to reorganize the tables listed.... I'll get that done this weekend.... Right after the gym....
However, for an SAP system, this is not desirable behaviour, so I sometimes have to remove hidden columns.
The downside, though, is that first I have to find them. Fancy scripts exists for this, but I find that a single line of code and some formatting is enough.
SET PAGESIZE 1000
SET LINESIZE 200
COLUMN owner FORMAT a10
select distinct owner,table_name,hidden_column from all_tab_cols where hidden_column = 'YES' and owner != 'SYS';
This excludes the sys objects that are not really manipulated by SAP anyway. So the only thing left is to reorganize the tables listed.... I'll get that done this weekend.... Right after the gym....
Ora-26040 block corruption
I never saw this one before today. And I really wondered for a while what it could be. It was an Oracle error ora-1578 wrapped around an ora-01110, wrapped around an ora-26040 wrapped around an ora-06512 etc. etc. It was a LOOOONG statement. But it was the error 26040 that was new to me, and that I saw as the point of attack.
And the funny part was that this error didnt even show up during backup runs or similar. I just got alerted to some shortdumps in the SAP BW system....
So my first step was ofcourse to check how many block were corrupted, and if they were located in temporary, index or other easily recreateable blocks.
I quickly logged on to the system as Oracle administrator, and logged on to the RMAN interface and ran this querry: backup validate check logical database;
Much to my dismay, upon looking over the results (in SQLPLUS: select * from v$database_block_corruption;) I found more than 23000 corrupt blocks, even the mere thought of restoring this many blocks were disheartening to me. And this many corrputions means that there's something causing the errors, something that I needed to fix before getting rid of the corruption. So I decided to investigate the cause of the corruption.
What I found was slightly surprising. When I went through my list, I noticed that all of the corruptions were of the type "nologging". So I decided this might be the clue I needed and started digging. In the end it turned out that a BW system is loaded with a lot of reporting data from various sources. And someone had started a load with the option "nologging", which means that the undo segments will not be enabled. This is great if you're loading data larger than what can be put in the undo area, but for daily operations, not so much. As it happens, once a load done in nologging mode gets cancelled, there's no option to do a rollback, so the block written get marked as corrupt, if the transaction cancels (which one of them did)
That also explained why I hadn't seen this error before, noone in their right mind should load business critical data without the option to recover :)
It all turned out well, as the developers got the information and started doing the loads correctly, and I just restored the database...
And the funny part was that this error didnt even show up during backup runs or similar. I just got alerted to some shortdumps in the SAP BW system....
So my first step was ofcourse to check how many block were corrupted, and if they were located in temporary, index or other easily recreateable blocks.
I quickly logged on to the system as Oracle administrator, and logged on to the RMAN interface and ran this querry: backup validate check logical database;
Much to my dismay, upon looking over the results (in SQLPLUS: select * from v$database_block_corruption;) I found more than 23000 corrupt blocks, even the mere thought of restoring this many blocks were disheartening to me. And this many corrputions means that there's something causing the errors, something that I needed to fix before getting rid of the corruption. So I decided to investigate the cause of the corruption.
What I found was slightly surprising. When I went through my list, I noticed that all of the corruptions were of the type "nologging". So I decided this might be the clue I needed and started digging. In the end it turned out that a BW system is loaded with a lot of reporting data from various sources. And someone had started a load with the option "nologging", which means that the undo segments will not be enabled. This is great if you're loading data larger than what can be put in the undo area, but for daily operations, not so much. As it happens, once a load done in nologging mode gets cancelled, there's no option to do a rollback, so the block written get marked as corrupt, if the transaction cancels (which one of them did)
That also explained why I hadn't seen this error before, noone in their right mind should load business critical data without the option to recover :)
It all turned out well, as the developers got the information and started doing the loads correctly, and I just restored the database...
Friday, October 26, 2012
Orphaned Oracle processes
For some unfathomable reason, the backup system here at work seems to stop running once in a while.
This causes the RMAN backup to stop (but not fail), which causes the underlying Oracle shadow processes to hang, because they assume they'll get to continue their work later.
The proceses takes up large amounts of memory and along with various resource locks, starts to cause a bother for the users. So we have to identify and kill those processes.
So I log on as the DB owner and run sqlplus, where I issue this command:
This will display all spids in the v$process that does not have a parrent session.
Then I simply try the unix commands "kill -3" (quit), if this fails then "kill -1" (hangup) and if this fails "kill -9" (abort)
I did, however, at a later date find a nice note indicating that as of Oracle 10.2, there is an alternative way to kill sessions that will also kill the OS process (from sqlplus):
The POST_TRANSACTION clause waits for ongoing transactions to complete before disconnecting the
session, while the IMMEDIATE clause disconnects the session and ongoing transactions are
recovered immediately.
So I'll try the latter method next time this happens, as this would likely cause less collateral damage in the system....
This causes the RMAN backup to stop (but not fail), which causes the underlying Oracle shadow processes to hang, because they assume they'll get to continue their work later.
The proceses takes up large amounts of memory and along with various resource locks, starts to cause a bother for the users. So we have to identify and kill those processes.
So I log on as the DB owner and run sqlplus, where I issue this command:
SELECT spid FROM v$process WHERE NOT EXISTS
( SELECT 1 FROM v$session WHERE paddr = addr);
This will display all spids in the v$process that does not have a parrent session.
Then I simply try the unix commands "kill -3" (quit), if this fails then "kill -1" (hangup) and if this fails "kill -9" (abort)
I did, however, at a later date find a nice note indicating that as of Oracle 10.2, there is an alternative way to kill sessions that will also kill the OS process (from sqlplus):
ALTER SYSTEM DISCONNECT SESSION ’sid,serial#’ POST_TRANSACTION;
ALTER SYSTEM DISCONNECT SESSION ’sid,serial#’ IMMEDIATE;
The POST_TRANSACTION clause waits for ongoing transactions to complete before disconnecting the
session, while the IMMEDIATE clause disconnects the session and ongoing transactions are
recovered immediately.
So I'll try the latter method next time this happens, as this would likely cause less collateral damage in the system....
Subscribe to:
Posts (Atom)