1 2 Previous Next 16 Replies Latest reply on Apr 11, 2012 1:54 PM by 846231

    MKSYSB for Linux

    846231
      Hi All,

      Today, we encounter data block corruption on our linux server (a simple example of disaster) . And our critical application would not startup anymore :(
      We have a backup of the apps but when restored the error persist. I suspect the corruption/error is on the OS itself.

      How do we make a system backup (a counterpart for mksysb in aix) to avoid this in the future? So that in case of OS datablock corruption, we can just restore back the system image?

      Can you share your experience how do you handle this kind of disaster?

      Thanks a lot.
        • 1. Re: MKSYSB for Linux
          Billy~Verreynne
          Backup of the system disk (or partition) is an overkill IMO.

          A server should at minimum have 2 local drives as a h/w mirror for the system disk. This gives you a fair degree of redundancy should a disk fail. It is not expensive. It is easy to setup.

          Should both disks fail (unlikely, but can happen), you simply pop the mirror disk from another server (and replace that with a brand new disk) and use that mirror disk to boot the failed server into single user mode - update the network config and reboot into multi-user. A very fast way to repair a server's failed system disk.

          If you do go for a system disk backup - keep in mind that it need to be a physical backup in order to provide a bootable image to be restored on a similar size disk. This also means that open files need to be read (as part of the disk read), if the o/s on that disk is running. So using something like dd is not a good idea.
          1 person found this helpful
          • 2. Re: MKSYSB for Linux
            846231
            Thanks Bill,


            Currently we have system disk under filesystems:
            Filesystem    1024-blocks      Free %Used    Iused %Iused Mounted on
            /dev/hd4          6553600   2148940   68%    21060     5% /
            /dev/hd2          3932160   1053172   74%    57920    20% /usr
            /dev/hd9var        262144    125348   53%      632     3% /var
            /dev/hd3          4194304   4108884    3%      567     1% /tmp
            /dev/fwdump        131072    130724    1%        4     1% /var/adm/ras/platform
            /dev/hd1         12582912   5285904   58%   267302    19% /home
            /proc                   -         -    -         -     -  /proc
            /dev/hd10opt      5242880   2096140   61%    66056    13% /opt
            Supposing I have new 1TB disk mounted as /u01. How can I mirror the systems file and in case corrupted and wont boot up how can I point it to the mirror?


            Thanks a lot,
            • 3. Re: MKSYSB for Linux
              Dude!
              A disk mirror will only help you in case the cause of the problem is physical disk damage. To have some good backup of the OS can be useful, but to maintain a regular backup is most likely useless unless you can point out the exact time a possible problem was introduced. If damage occurs then it is usually more efficient to reinstall the system on a new computer, which often includes an upgrade and a new installation.

              Regarding your problem, what clues exist that data block corruption is the source of your problem? What is the error you are experiencing? There is no such thing like a systems file. You will have to copy all your various partitions in order to get a functional system. However, if your problem is indeed data corruption, then copying the data will only duplicate your problem.

              There are several methods to backup a system. It all depends on your requirements, resources and technical experience. You are probably not aware how many options exist, otherwise you wouldn't be asking about system backup.

              To answer your question: The equivalent of "mksysb for aix" is probably LVM snapshot:
              http://tldp.org/HOWTO/LVM-HOWTO/snapshots_backup.html'

              The easiest way to duplicate a complete disks is to shutdown the system, restart from the installation DVD and use the dd command. e.g. "dd if=/dev/sda of=/dev/sdd" or copy partitons: "dd if=/dev/sda1 of=/dev/sdd1", or create an ISO image, e.g. "dd if=/dev/hda of=/home/mycd.iso"
              1 person found this helpful
              • 4. Re: MKSYSB for Linux
                Billy~Verreynne
                With mirroring I was referring to h/w mirror. Most servers (x86) today have an on-board RAID controller chipset.

                We use that to configure a mirror. The o/s sees a single disk (e.g. /dev/sda) instead of the 2 mirrored disks. So there's nothing to configure o/s wise.

                On newer server models the drive is usually hot swappable, allowing you to pull a mirror disk from the server bay while powered up and replacing it with a new disk (the on-board RAID will then start rebuilding the mirror).

                If you do a physical backup, that requires a means to restore that backup - and how do you run the backup s/w to restore a system disk on a server that cannot boot as its system disk is broken? Do you now first pop that system disk into another server (requiring downtime there) in order to run the restore of that system disk there? Not really an efficient approach.

                The last time I did physical system disk backups was on mainframes. The backup was from disk to magnetic tape - and we could boot the mainframe h/w from the boot loader on the tape and run a restore like that, rebuilding the system disk. :-)

                Unless you do a physical system disk backup as a raw backup to a spare disk (exact disk duplicate, allowing you to use this as the new server boot disk), I do not see much use for using the backup and restore approach of system disks.
                1 person found this helpful
                • 5. Re: MKSYSB for Linux
                  846231
                  Thanks Dude/Bill,


                  We are just small and with cheap budget company. So Bills backup solution is not viable :)

                  I think I go for dd command suggested by Dude.
                  The easiest way to duplicate a complete disks is to shutdown the system, restart from the installation DVD and use the dd command.
                  e.g.

                  1. "dd if=/dev/sda of=/dev/sdd" or
                  2. copy partitons: "dd if=/dev/sda1 of=/dev/sdd1", or
                  3. create an ISO image, e.g. "dd if=/dev/hda of=/home/mycd.iso"

                  Question, How do I perform recovery if I use option #1? or

                  if I use #2? or
                  if I use #3? or


                  Thanks a lot,

                  Edited by: KinsaKaUy? on 11-Apr-2012 04:43
                  • 6. Re: MKSYSB for Linux
                    846231
                    Hi All,
                    Regarding your problem, what clues exist that data block corruption is the source of your problem? What is the error you are experiencing? There is no such thing like a systems file. You will have to copy all your various partitions in order to get a functional system. However, if your problem is indeed data corruption, then copying the data will only duplicate your problem.
                    We have a maintenance power shutdown for 5 days last HolyWeek. It seems the server was not used to long holidays, that when we started it i has missing file or cannot create "socket" on the folder or something like
                    permission error related to Java and Cobol programs. These Java and Cobol compilers where installed in /opt . I can not perform fsck on /opt because it can not be unmounted.

                    To my disgrace there were two (2) servers affected. I can not understand why at the same time they hit this kind of error when we started our application.


                    Any Idea why an apps can not create a socket? Normally when you have done nothing to the apps, you can blame it to disk corruption due to long holydays shutdown. But I can not understand why two of them was hit
                    at the same time with the same error.

                    The error looks like this:
                    - 2011-02-07 14:10:16,238 [pool-2-thread-1] ERROR (cobol.host.SocketStrategy)
                    Unable to establish connection on port 6506 after waiting 20 seconds.
                    java.net.ConnectException: Cannot connect to socket caused by: No such file or directory
                    Maybe you are the angels sent from heaven to help me.

                    Thanks a lot

                    Edited by: KinsaKaUy? on 11-Apr-2012 04:56

                    Edited by: KinsaKaUy? on 11-Apr-2012 04:59

                    Edited by: KinsaKaUy? on 11-Apr-2012 05:01
                    • 7. Re: MKSYSB for Linux
                      Dude!
                      Duplicating your current system to another disk will only duplicate the problem.

                      To assume disk corruption is the reason of your problem is in my opinion a false assumption and hence recovering your system disk a false strategy.

                      You might want to find out the cause of your problem, e.g. networking, firewall or server configuration. From my experience, if you had a power outage then there is a chance that some network admin did not save their device configuration or your system startup is not configured properly.

                      I suggest to Google for your error or check Oracle support and you will find several links that may guide to the right solution. For instance: CC&B Install issue:ClassNotFound RPCRouterServlet & libcobjvm_sun_150
                      1 person found this helpful
                      • 8. Re: MKSYSB for Linux
                        846231
                        Is socket afftected by network or firewall? I also found out the our DNS server has problem. :(
                        You might want to find out the cause of your problem, e.g. networking, firewall or server configuration. From my experience, if you had a power outage then there is a chance that some network admin did not save their device configuration or your system startup is not configured properly.
                        or your system startup is not configured properly.
                        How can this system startup not configured properly? We did not do anything on the server. We just boot it up, then startup the application as we normally do. We even restarted it many times to fsck the file systems which are all "clean". So I think you are right in the assumption that it is not data corruption ;)


                        Thanks Dude.
                        • 9. Re: MKSYSB for Linux
                          Billy~Verreynne
                          I do not see any evidence of disk corruption in that message.

                          It looks like a network related error. The "+Cannot connect to socket caused by: No such file or directory+" part of the error could be due to how the network interface layer works.

                          It may need to create a temp file (typically in the +/tmp+ directory) for the socket it creates. That file could already exist (stale file that cannot be overwritten or removed due to file permissions). Or it could be that the directory/file location used no longer exists, or that there is file permission issue.

                          It also could have nothing to do with files or directories. From an o/s kernel perspective, there is little difference between a file handle and a socket handle. You can pass a socket handle to an application and it can happily read and write using standard I/O calls, using that handle. (this is for example how the old-style SVR4 super internet daemon worked, passing the socket handle as stdin and stdout handles to a service process).

                          So file/dir errors (dealing with handles) are sometimes used to describe socket handle errors too - as the socket handle behaves the same way as a file handle when using standard I/O on it.

                          I suggest looking at the configuration files for the s/w being run - are the directories and files listed in the config valid for that platform?

                          Check whether connectivity on the applicable port works from that platfor. Is it connecting to tcp port 6506 on localhost, or a remote server (IP)? Does that connectivity works? Is iptables (firewall) running and if so, how is it configured?

                          Are there any other similar app processes running? What does netstat and lsof show for these processes? (the lsof command is very useful to see what directories and files are being accessed).

                          And as Dude suggested, research the error on the net. It is very unlikely that you are the very first person to hit this specific error.

                          Also get the app developers to do better exception management and provide more meaningful call stacks and error traces. It is difficult to diagnose an error without that detail.
                          1 person found this helpful
                          • 10. Re: MKSYSB for Linux
                            846231
                            Thanks Bill,

                            We try to do create /temp2 and chmod 777 and point the apps to create socket there. but still the error persist.
                            I suggest looking at the configuration files for the s/w being run - are the directories and files listed in the config valid for that platform?
                            The error log does not specifically metioned which filename/folder does it writing or creating with. So I can not check which one do I need to give file permission.
                            Are there any other similar app processes running? What does netstat and lsof show for these processes? (the lsof command is very useful to see what directories and files are being accessed).
                            There should be no other processes running since this server came from reboot. and we only start this single apps as we did normally last week before the shutdown.
                            And as Dude suggested, research the error on the net. It is very unlikely that you are the very first person to hit this specific error.
                            Yes we have similar error the ona the dude linked above, but that person is just on the process of new installation. But we have been on production for a year, and still we tried the suggestions there but to no avail :(

                            Thanks a lot.
                            • 11. Re: MKSYSB for Linux
                              Dude!
                              If your application previously worked after a system restart and nothing was changed I see no reason why it won't work after a power failure. Since you have problems also with other machines and services I think its obvious that you experience an external issue. Whoever is in charge of your network infrastructure should verify or analyze routing, DNS resolving, firewall, physical link negotiation. Like I mentioned before, a switch or router may have lost its configuration after a power failure.
                              • 12. Re: MKSYSB for Linux
                                846231
                                It is difficult to diagnose an error without that detail.
                                :)
                                 -  2012-04-11 17:15:05,188 [Thread-13] INFO  (cobol.host.CobolHostStartup) Using active JVM count of 1 for remote cobol execution.
                                 -  2012-04-11 17:15:06,315 [JVM 1 INFO logger] INFO  (cobol.host.ProcessLogger) Remote JVM 1 started with arguments:  1 9905 9906 2
                                 -  2012-04-11 17:15:06,416 [JVM 1 INFO logger] INFO  (cobol.host.ProcessLogger)  -  2012-04-11 17:15:06,331 [main] INFO  (shared.environ.ApplicationProperties) loaded properties from resource spl.
                                properties: {spl.runtime.cobol.sql.cursoredCache.maxRows=10, spl.tools.loaded.applications=base,ccb,cm, spl.runtime.cobol.sql.disableQueryCache=false, spl.runtime.utf8Database=true, spl.runtime.cob
                                ol.encoding=UTF8, spl.runtime.cobol.sql.cache.maxTotalEntries=1000, spl.runtime.cobol.cobrcall=false, spl.runtime.cobol.sql.fetchSize=150, spl.runtime.environ.init.dir=/ccbV210/TRUIBMSTG/etc, spl.r
                                untime.sql.highValue=?, spl.runtime.service.extraInstallationServices=CILTINCP, spl.runtime.oracle.statementCacheSize=300}
                                 -  2012-04-11 17:15:06,517 [JVM 1 INFO logger] INFO  (cobol.host.ProcessLogger)  -  2012-04-11 17:15:06,372 [Remote JVM:1 Main ] INFO  (cobol.host.SocketStrategy) Socket strategy set to com.splwg.
                                base.support.cobol.host.sockets.UnixDomainSocketStrategy
                                 -  2012-04-11 17:15:06,626 [JVM 1 INFO logger] INFO  (cobol.host.ProcessLogger)  -  2012-04-11 17:15:06,377 [Remote JVM:1 Main ] INFO  (host.sockets.UnixDomainSocketStrategy) Using this directory
                                for socket files: /ccbV210/TRUIBMSTG/runtime
                                 -  2012-04-11 17:15:06,727 [JVM 1 INFO logger] INFO  (cobol.host.ProcessLogger) Remote JVM 1 listening for requests on port: 9906
                                 -  2012-04-11 17:15:08,372 [Thread-13] INFO  (support.context.ContextFactory) Done creating default context, time 362,874.814 ms(6 min 2.875 sec)
                                 -  2012-04-11 17:15:08,373 [Thread-13] INFO  (web.startup.DeferredXAIStartup) Done initializing XAI application context, time 362,906.814 ms(6 min 2.907 sec)
                                 -  2012-04-11 17:15:16,750 [JVM 1 ERROR logger] ERROR (cobol.host.ProcessLogger) java.lang.RuntimeException: No command runner was registered with this remote JVM after waiting 10000ms
                                 -  2012-04-11 17:15:16,751 [JVM 1 INFO logger] INFO  (cobol.host.ProcessLogger)  -  2012-04-11 17:15:16,750 [Remote JVM:1 Main ] INFO  (cobol.host.RemoteJVM) Shutting down loggers and exiting Remo
                                te JVM 1
                                 -  2012-04-11 17:15:16,851 [JVM 1 ERROR logger] ERROR (cobol.host.ProcessLogger)       at com.splwg.base.support.cobol.host.RemoteJVM.waitForServerToRegisterRunner(RemoteJVM.java:163)
                                 -  2012-04-11 17:15:16,960 [JVM 1 ERROR logger] ERROR (cobol.host.ProcessLogger)       at com.splwg.base.support.cobol.host.RemoteJVM.main(RemoteJVM.java:121)
                                 -  2012-04-11 17:15:46,365 [pool-2-thread-1] ERROR (cobol.host.OptimizedDataOutput) Exception flushing stream.
                                java.io.IOException: Error writing to socket caused by: A file or directory in the path name does not exist.
                                
                                        at com.splwg.base.support.cobol.host.sockets.UnixDomainSocketNative.writeToSocket(Native Method)
                                        at com.splwg.base.support.cobol.host.sockets.UnixDomainSocket.writeBuffer(UnixDomainSocket.java:64)
                                        at com.splwg.base.support.cobol.host.sockets.PipeSocket.writeBuffer(PipeSocket.java:113)
                                        at com.splwg.base.support.cobol.host.sockets.PipeSocket$PipeOutputStream.writeBufferAndClear(PipeSocket.java:279)
                                        at com.splwg.base.support.cobol.host.sockets.PipeSocket$PipeOutputStream.flush(PipeSocket.java:307)
                                        at java.io.DataOutputStream.flush(DataOutputStream.java:131)
                                        at com.splwg.base.support.cobol.host.OptimizedDataOutput.flush(OptimizedDataOutput.java:55)
                                        at com.splwg.base.support.cobol.host.OptimizedRemoteExecuterStub.sendRequestGetResponse(OptimizedRemoteExecuterStub.java:80)
                                        at com.splwg.base.support.cobol.host.OptimizedRemoteExecuterStub.invoke(OptimizedRemoteExecuterStub.java:58)
                                        at com.splwg.base.support.cobol.host.RemoteRunnerImpl.invoke(RemoteRunnerImpl.java:109)
                                        at com.splwg.base.support.cobol.host.RemoteJVMConnectionImpl.createRemoteRunner(RemoteJVMConnectionImpl.java:157)
                                        at com.splwg.base.support.cobol.host.RemoteJVMConnectionImpl.<init>(RemoteJVMConnectionImpl.java:76)
                                        at com.splwg.base.support.cobol.host.RemoteJVMFactoryImpl.addConnection(RemoteJVMFactoryImpl.java:80)
                                        at com.splwg.base.support.cobol.host.RotatingCommandRunnerProvider$ConnectionMonitor.addNecessaryConnections(RotatingCommandRunnerProvider.java:401)
                                        at com.splwg.base.support.cobol.host.RotatingCommandRunnerProvider$ConnectionMonitor.doHousekeeping(RotatingCommandRunnerProvider.java:330)
                                        at com.splwg.base.support.cobol.host.RotatingCommandRunnerProvider$ConnectionMonitor.run(RotatingCommandRunnerProvider.java:323)
                                        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:432)
                                        at java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:295)
                                        at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:150)
                                        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:80)
                                        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPeriodic(ScheduledThreadPoolExecutor.java:157)
                                        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:181)
                                        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:665)
                                        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:690)
                                        at java.lang.Thread.run(Thread.java:810)
                                 -  2012-04-11 17:15:46,368 [pool-2-thread-1] ERROR (cobol.host.OptimizedRemoteExecuterStub) An exception occurred invoking remote command.
                                 -  2012-04-11 17:15:46,368 [pool-2-thread-1] INFO  (cobol.host.RemoteJVMConnectionImpl) Connection to JVM 1 being shunned
                                 -  2012-04-11 17:15:46,369 [pool-2-thread-1] ERROR (cobol.host.RemoteJVMConnectionImpl) An exception has occurred calling the remote JVM
                                 -  2012-04-11 17:15:46,369 [pool-2-thread-1] ERROR (cobol.host.RotatingCommandRunnerProvider) Caught exception in Remote JVM connection housekeeper: com.splwg.shared.common.LoggedException:
                                The following stacked messages were reported as the LoggedException was rethrown:
                                com.splwg.base.support.cobol.host.OptimizedRemoteExecuterStub.sendRequestGetResponse(OptimizedRemoteExecuterStub.java:80): An exception has occurred calling the remote JVM
                                com.splwg.base.support.cobol.host.OptimizedRemoteExecuterStub.sendRequestGetResponse(OptimizedRemoteExecuterStub.java:80): An exception occurred invoking remote command.
                                
                                The root LoggedException was: Exception flushing stream.
                                 -  2012-04-11 17:15:48,463 [JVM 2 INFO logger] INFO  (cobol.host.ProcessLogger) Remote JVM 2 started with arguments:  2 9905 9906 2
                                 -  2012-04-11 17:15:48,565 [JVM 2 INFO logger] INFO  (cobol.host.ProcessLogger)  -  2012-04-11 17:15:48,480 [main] INFO  (shared.environ.ApplicationProperties) loaded properties from resource spl.
                                properties: {spl.runtime.cobol.sql.cursoredCache.maxRows=10, spl.tools.loaded.applications=base,ccb,cm, spl.runtime.cobol.sql.disableQueryCache=false, spl.runtime.utf8Database=true, spl.runtime.cob
                                ol.encoding=UTF8, spl.runtime.cobol.sql.cache.maxTotalEntries=1000, spl.runtime.cobol.cobrcall=false, spl.runtime.cobol.sql.fetchSize=150, spl.runtime.environ.init.dir=/ccbV210/TRUIBMSTG/etc, spl.r
                                untime.sql.highValue=?, spl.runtime.service.extraInstallationServices=CILTINCP, spl.runtime.oracle.statementCacheSize=300}
                                 -  2012-04-11 17:15:48,666 [JVM 2 INFO logger] INFO  (cobol.host.ProcessLogger)  -  2012-04-11 17:15:48,513 [Remote JVM:2 Main ] INFO  (cobol.host.SocketStrategy) Socket strategy set to com.splwg.
                                base.support.cobol.host.sockets.UnixDomainSocketStrategy
                                 -  2012-04-11 17:15:48,776 [JVM 2 INFO logger] INFO  (cobol.host.ProcessLogger)  -  2012-04-11 17:15:48,518 [Remote JVM:2 Main ] INFO  (host.sockets.UnixDomainSocketStrategy) Using this directory
                                for socket files: /ccbV210/TRUIBMSTG/runtime
                                 -  2012-04-11 17:15:48,886 [JVM 2 INFO logger] INFO  (cobol.host.ProcessLogger) Remote JVM 2 listening for requests on port: 9906
                                 -  2012-04-11 17:15:58,706 [JVM 2 ERROR logger] ERROR (cobol.host.ProcessLogger) java.lang.RuntimeException: No command runner was registered with this remote JVM after waiting 10000ms
                                 -  2012-04-11 17:15:58,707 [JVM 2 INFO logger] INFO  (cobol.host.ProcessLogger)  -  2012-04-11 17:15:58,706 [Remote JVM:2 Main ] INFO  (cobol.host.RemoteJVM) Shutting down loggers and exiting Remo
                                te JVM 2
                                 -  2012-04-11 17:15:58,809 [JVM 2 ERROR logger] ERROR (cobol.host.ProcessLogger)       at com.splwg.base.support.cobol.host.RemoteJVM.waitForServerToRegisterRunner(RemoteJVM.java:163)
                                 -  2012-04-11 17:15:58,909 [JVM 2 ERROR logger] ERROR (cobol.host.ProcessLogger)       at com.splwg.base.support.cobol.host.RemoteJVM.main(RemoteJVM.java:121)
                                 -  2012-04-11 17:15:59,714 [pool-2-thread-1] ERROR (cobol.host.OptimizedRemoteExecuterStub) An exception occurred invoking remote command.
                                 -  2012-04-11 17:15:59,714 [pool-2-thread-1] INFO  (cobol.host.RemoteJVMConnectionImpl) Connection to JVM 2 being shunned
                                 -  2012-04-11 17:15:59,714 [pool-2-thread-1] ERROR (cobol.host.RemoteJVMConnectionImpl) An exception has occurred calling the remote JVM
                                 -  2012-04-11 17:15:59,715 [pool-2-thread-1] ERROR (cobol.host.RotatingCommandRunnerProvider) Caught exception in Remote JVM connection housekeeper: com.splwg.base.support.cobol.host.InputClosedEx
                                ception:
                                The following stacked messages were reported as the LoggedException was rethrown:
                                com.splwg.base.support.cobol.host.OptimizedRemoteExecuterStub.sendRequestGetResponse(OptimizedRemoteExecuterStub.java:83): An exception has occurred calling the remote JVM
                                com.splwg.base.support.cobol.host.OptimizedRemoteExecuterStub.sendRequestGetResponse(OptimizedRemoteExecuterStub.java:83): An exception occurred invoking remote command.
                                
                                The root LoggedException was: The input was closed.
                                Thanks a lot,
                                • 13. Re: MKSYSB for Linux
                                  846231
                                  Thanks dude,

                                  I just want to rule out that is it not server startup issue? Because I was the one who started the server. I might be blamed for the error :(
                                  your system startup is not configured properly
                                  What do you mean by this? or How can this happened?

                                  Edited by: KinsaKaUy? on 11-Apr-2012 06:25
                                  • 14. Re: MKSYSB for Linux
                                    Dude!
                                    ... your system startup is not configured properly
                                    Application startup after a system restart may fail if your system was not setup to automatically configure required components and account environments, for instance, startup oracleasm, setup oracle_home, disable SELinux, etc.
                                    1 2 Previous Next