2 Replies Latest reply: Feb 23, 2012 4:36 PM by 805322 RSS

    Solaris 8 login problems: no UTMPX, tmchild: exec service failed...etc

    805322
      Hi All,

      I have this really strange problem on a Solaris 8 server. I've tried searching on Google but didn't come across anything useful. Hopefully someone here has more experience and can share some of their knowledge.

      The basic problem is the user is unable to log into the Solaris 8 server after a period (Anything from a few hours to 10 hours). The Server does not run any Xservers.

      The symptoms are as follows:
      1. Server responds to pings from other machines
      2. rlogin to the server produces the following errors:
      No utmpx entry, you must exec "login" from lowest level "shell" or Protocol error
      3. rsh will sometimes work but not always (I use rsh to reboot the server when it works)
      4. login at the terminal at the physical server produces the following errors:
      tmchild: exec service failed, errno=5
      INIT: failed write of utmpx entry: "Co"
      5. the following errors are printed on the terminal prior to this happening:
      cannot open /var/spool/mqueue: Not a directory

      I have check disk space on the server and all mounts are at most 40% full.

      For the utmpx error the most command suggested fix seems to be to delete the /var/adm/utmpx file and create a new one. This only seems to prolong the period to failing.
        • 1. Re: Solaris 8 login problems: no UTMPX, tmchild: exec service failed...etc
          BryanWood
          What is the file size of utmpx and also wtmpx?

          Seems like something is corrupting utmpx/wtmpx, maybe an automated job that frequently does an "rsh" or "ssh".

          While the system is behaving normally (after you've removed the utmpx file as you mention in your post), and after say 1 hour (given you say it begins to fail after a few hours), run:
          root# last | more
          The above output should tell you which user is logging in and how frequently. If you are able to confirm this theory, then you would be looking for the 3rd column which is the IP address or DNS alias of the source machine performing the logins.

          Here is a perl script that will roll up the last output:
          root#
          root# cat rollup.pl
          #!/usr/bin/perl
          use strict;
          my (%rollup) = ();
          open(LAST,"last|")
            or die "cannot execute last command: $!";
          while(<LAST>){
            next if (/ begins /);
            my @fields = split;
            $rollup{$fields[0]}{source}{$fields[2]}++;
            $rollup{$fields[0]}{all}++;
          }
          foreach my $user (sort
            {$rollup{$b}{all} <=> $rollup{$a}{all}}
            keys %rollup){
            next unless $user;
            print "user $user logins: $rollup{$user}{all}\n";
            foreach my $source (sort
              {$rollup{$user}{source}{$b} <=> $rollup{$user}{source}{$a}}
              keys %{$rollup{$user}{source}}){
              print "  $source logins: $rollup{$user}{source}{$source}\n";
            }
          }
          close(LAST);
          root#
          root# chmod +x rollup.pl
          root# ./rollup.pl
          user bryan logins: 38
            :0.0 logins: 24
            :0 logins: 14
          user root logins: 21
            192.168.1.107 logins: 9
            192.168.1.20 logins: 7
            192.168.1.128 logins: 4
            :0.0 logins: 1
          user reboot logins: 15
            boot logins: 15
          root#
          Another suggestion would be to save a copy of the problematic utmpx file, and try to read its entries with "od -c":
          root# cd /var/adm
          root# cp utmpx utmpx.save
          root# od -c "utmpx.save"
          Lastly, here is a script that truncates the wtmpx file taken from http://www.linuxmisc.com/3-linux/0c72ad22d625e643.htm
          #! /bin/sh - 
          # 
          # adm.weekly: once a week adm log rolling with wtmpx compression 
          # 
          PATH=/usr/bin:/bin:/usr/sbin 
          umask 022 
          LOG=wtmpx 
          DIR=/var/adm 
          cd $DIR || exit 1 
          for GEN in 4 3 2 1 0 
          do 
                  ROT=`expr $GEN + 1` 
                  test -f $LOG.$GEN && compress -f $LOG.$GEN 
                  test -f $LOG.$GEN.Z && mv $LOG.$GEN.Z $LOG.$ROT.Z 
          done 
          BS=372 
          SK=0 
          test -f $LOG.skip && SK=`cat $LOG.skip` 
          dd if=$LOG   of=$LOG.0 bs=$BS skip=$SK 2>/dev/null 
          cp $LOG.0 $LOG 
          SK=`wc -c <$LOG.0` 
          SK=`expr $SK / $BS` 
          echo "$SK" >$LOG.skip 
          chmod 644    $LOG 
          compress $LOG.0 
          #!/end 
          • 2. Re: Solaris 8 login problems: no UTMPX, tmchild: exec service failed...etc
            805322
            Thanks BryanWood. That was helpful. I've checked wtmpx and it's not very big, about 5MB. As far as I can tell remote logins do not run any automated jobs other than users manually transferring files.

            Incidentally, the server failed to boot a few times with an error message saying no boot media was found. I'm thought the disk drive might be dying so I looked in the message log.
            I found a lot of SCSI messages like the following:

            SCSI: [ID 107833 kern.warning] WARNING: /pci@0,0/pci1000,f@12(ncrs0)
            SCSI: [ID 107833 kern.warning] WARNING: /pci@0,0/pci1000,f@12/sd@6,0 invalid reselection (6:0)
            SCSI transport failed: reason 'reset': retrying command
            SCSI transport failed: reason 'unexpected_bus_free: retrying command

            I still looking up what these mean but it would seem like the disk is failing. Can you (or anyone else) confirm?

            Thanks.