Tuesday, March 27, 2012

Can't change TCP Port using configuration manager for x64 SQL 2005 SP1 virtual instance

Hi, I hope someone can help me! I changed the TCP port that the default instance of SQL 2005 x64 SP1 is listening on and now SQL server won't start as the port is being used - no problem I say to myself, I'll just change it again, however I can't!

My problem is that I can not change the TCP Port back again using Configuration Manager - no matter what values I put in, after a restart of the instance it comes up with this in the IP Addresses tab for TCP/IP Properties and SQL Server won't start:

IP1
Active Yes
Enabled No
IP Address aaa.bbb.ccc.111
TCP Dynamic Ports
TCP Port 0

IP2
Active Yes
Enabled No
IP Address aaa.bbb.ccc.222
TCP Dynamic Ports
TCP Port 0

IP3
Active Yes
Enabled No
IP Address aaa.bbb.ccc.333
TCP Dynamic Ports
TCP Port 0

IP4
Active Yes
Enabled No
IP Address 10.1.1.1
TCP Dynamic Ports
TCP Port 0

IP5
Active Yes
Enabled No
IP Address 127.0.0.1
TCP Dynamic Ports
TCP Port 0

IPALL
TCP Dynamic Ports 2020
TCP Port 2020

Interestingly, the ip address that the server is listening on aaa.bbb.ccc.444 is not reported in the list of IP addresses in the configuration manager for this instance, yet it is when I view the available IP addresses for the other two instances - is this significant?

I have tried failing to the other node, changing it there, changing it at the console and also while logged onto the the virtual machine remote desktop, rebooting both nodes, etc, etc. There are two other instances on this cluster that are still working fine but I hvae not attempted to change the TCP ports they listen on.

I used the same method to change this default instance as I did for a named instance on a similarly configured cluster and it worked fine. How to fix this?

Regards, Mike

2007-02-27 12:03:17.69 Server Server is listening on [ aaa.bbb.ccc.444<ipv4> 2020].
2007-02-27 12:03:17.71 spid9s Starting up database 'tempdb'.
2007-02-27 12:03:17.72 Server Error: 26023, Severity: 16, State: 1.
2007-02-27 12:03:17.72 Server Server TCP provider failed to listen on [ aaa.bbb.ccc.444<ipv4> 2020]. Tcp port is already in use.
2007-02-27 12:03:17.75 Server Error: 17182, Severity: 16, State: 1.
2007-02-27 12:03:17.75 Server TDSSNIClient initialization failed with error 0x2740, status code 0xa.
2007-02-27 12:03:17.75 Server Error: 17182, Severity: 16, State: 1.
2007-02-27 12:03:17.75 Server TDSSNIClient initialization failed with error 0x2740, status code 0x1.
2007-02-27 12:03:17.75 Server Error: 17826, Severity: 18, State: 3.
2007-02-27 12:03:17.75 Server Could not start the network library because of an internal error in the network library. To determine the cause, review the errors immediately preceding this one in the error log.
2007-02-27 12:03:17.77 Server Error: 17120, Severity: 16, State: 1.
2007-02-27 12:03:17.77 Server SQL Server could not spawn FRunCM thread. Check the SQL Server error log and the Windows event logs for information about possible related problems.

BWAHAHAHAHAHAHA!!!!! The same thing happened to me today! What a nightmare! I came about 10min away from nuking my virtual server instance and restoring 50+ databases. It was supposed to be such a simple change... I even tested it on a non-clustered box first. Oh well.

I had *exactly* the same setup as you, except 32bit. I think the problem was that I made the change on my passive node. My cluster has multiple instances spread over all nodes, but the virtual instance I was changing wasn't changed on the active node. I think that's what did me in.

In my situation, the IP settings I ented in caused the SQL Server instance to not start (too lazy to check the actual error at the moment). What happens is that the cluster service is configured to replicate the registry settings that govern the TCP/IP paramenters. If you edit the parameters incorrectly (or at least the way I did) then the cluster service continues to replicate those parameters to all nodes. Mind you, I used the SSCM too! The registry replication occurs when the cluster group changes nodes. If the SQL Server service does not start, then the cluster quorum logs never get cleared. In other words, the cluster doesn't think the registry params got applied correctly. No matter what you do to the registry on each node, the cluster quorum logs will continue to apply the replicated settings until SQL starts successfully (which never happens).

To fix it, I had to stop MSCS from replicating the bad registry params so that I could manually rebuild them. Here's how I did it, YMMV:

0. (I forgot a step) Set your broken SQL instance cluster resource to not restart automatically in MSCS

1. Stop cluster service on all nodes

2. Locate the SQL Server instance cluster resource you're having issues with here: HKLM\Cluster\Resources

Change the following parameter from this:

Software\Microsoft\Microsoft SQL Server\MSSQL.x\MSSQLSERVER

to this:

Software\Microsoft\Microsoft SQL Server\MSSQL.x\MSSQLSERVER\CurrentVersion

on all nodes

3. Fix the following registry branch on all nodes to what they were before (use your other instances as a guide if necessary):

Software\Microsoft\Microsoft SQL Server\MSSQL.x\MSSQLSERVER\supersocketnetlib\TCP

4. Start the cluster service on all nodes.

5. SQL Server will not start because you've messed with the registry replication params. (but it'll be a different error than before)

6. Move this cluster group to all nodes and let it fail on all nodes.

6b. Verify that the TCP/IP parameters are still good on all nodes.

7. Take your group completely offline.

8. Revert the registry replication changes you made in step 2 back to what they should be.

9. Clean up Software\Microsoft\Microsoft SQL Server\MSSQL.x\MSSQLSERVER\CurrentVersion. (It'll be messed up from the forced registry replication. Do this on all nodes.

10. cycle the SQL browser service on all nodes for good measure.

11. Bring your broken SQL instance back to life. I had to cycle SQL Browser one more time while I was doing this.

12. Verify that the SQL instance comes back on all nodes.

This worked for me. Good luck. Oh, and for my gratification, how about you try changing the IP address parameters from the Active node using SSCM while SQL is up? Let me know it if works :) and I'll attempt it on my cluster again.

-EDF

This sucks Beavis.

|||Thanks EDF. This looks very promising. I actually made my changes while RDP'd to the virtual instance name so it must have been the active node but maybe making the change through Remote Desktop stuffed it. I'll give it a go tomorrow and try my IP config changes again from the active node at the console and let everyone know how it goes. Thanks again.|||I tried the above changes but it did not help. I am going to raise a call with MS Support and I'll post the resolution.|||

Gengleman, did either of you successfully change the IP address of the cluster? We are currently attempting this and although I did it once before I think I got lucky. I'd love to see the steps followed.

UJ

|||Yeah, I did. It turns out that "Listen on All IP Addresses" *really* means "Listen on All IP Addresses in the cluster group". Move all your SQL instances to one node. Log into that node and start the IP binding configuration tool, SSCM. Leave the entry for "listen all"=Yes (totally against all that is good & right in me...). I wanted all my SQL instances to listen on the same port (2048), but diff IP addresses. On the IP Addresses tab, I set all my IP addresses entries to Enabled=No. At the bottom, I set IPAll entry so that TCP Dynamic Ports was NULL (not 0) and TCP Port=2048. On start up, each SQL instance then only attempts to bind port 2048 to it's own IP address. Certainly NOT what I would have expected, but I suppose it makes sense.|||

Hi, I was trying to change the port, not the IP address, when somehow it got screwed up - I guess I mis-typed something. I ended up calling in PSS and they fixed it for me as I couldn't reproduce the fix that fischere suggested. It is very easy to screw up your TCP/IP config with the SSCM - the MS tech used it to corrupt his installation to reproduce my problem so it would be a good idea to play around with it on an environment that isn't important first and then make sure you follow the process exactly on your other instances.

If you end up screwing up the instance (by changing IP addresses, ports, etc) so that it won't start, you can reset it back to what it was using the Cluster Server Recovery Utility (ClusterRecovery.exe) which is available for download from the MS website. It's main purpose seems to be to recover from failed drives, but it has the ability to override the registry entries that cluster services wants to apply to a node using the "checkpoint" option. This allows you to overcome the problem of not being able to reset the values that the SSCM tries to write to the registry on each node because when you try and bring an instance on line or fail to another node the settings get restored from cluster services as fischere points out. To fix a failed instance using this tool, follow these steps:

1. move the failed instance to a node

2. change the registry settings to what they should be on that node, eg HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft SQL Server\MSSQL.1\MSSQLServer\SuperSocketNetLib\Tcp\IPAll

3. checkpoint the cluster using the tool on that node

4. bring the instance on line

5. repeat this process on each node.

I hope this helps someone else!

|||Thanks for posting this solution. It worked like a charm for me.

No comments:

Post a Comment