Amos Jeffries wrote:
> Dave wrote:
>> Hi,
>> Thanks for your reply. The following is the ip and abbreviated msg:
>> (reason: 554 5.7.1 Service unavailable; Client host [65.24.5.137]
>> blocked using dnsbl-1.uceprotect.net;
>> To my squid issue, if aufs is less intensive and more efficient
>> i'll definitely switch over to it. As for your suggestion about
>> splitting in to multiple files I believe the version i have can do
>> this, it has multiple acl statements for the safe_ports definition.
>> My issue though is there's like 15000+ lines in this file, and
>> investigating some like 500 are duplicates. I'd rather not have to
>> manually go through this and do the split, is there a way i can split
>> based on the dst, dstdomain, or url_regexp you referenced?
>
> I just used the following commands, pulled off most of the job in a
> few minutes. The remainders that got left as regex was small. There
> are some that are duplicates of the domain-only list, but that can be
> dealt with later.
>
>
> # Pull out the IPs
> grep -v -E "[a-z]+" porn | sort -u >porn.ipa
>
> # copy everything else into a temp file
> grep -v -E "[a-z]+" porn | sort -u >temp.1
>
> # pull out lines with only domain name
> grep -E "^([0-9a-z\-]\.)+[a-z]+$" temp.1 | sort -u >temp.d
>
> # pull out everthing without a domain name into another temp
> grep -v -E "^([0-9a-z\-]\.)+[a-z]+$" temp.1 | sort -u >temp.2
> rm temp.1
>
> # pull out lines that are domain/ or domain<space> and drop the end
> grep -E "^([0-9a-z\-]\.)+[a-z]+[\/ ]$" temp.2 | sed s/\\/// | sed s/\\
> // | sort -u >>temp.d
>
> # leave the rest as regex patterns
> grep -v -E "^([0-9a-z\-]\.)+[a-z]+[\/ ]$" temp.2 | sort -u >porn.regex
> rm temp.2
>
> # sort the just-domains and make sure there are no duplicate.
> cat temp.d | sort -u > porn.domains
> rm temp.d
>
> Amos
For what it's worth, this method will not remove overlapping domains (if
http://yahoo.com/, http://www.yahoo.com/index.html and
http://mail.yahoo.com are all included, you will have more entries than
you need). Minor issue, perhaps, but it can lead to unpredictable
results (the dstdomain acl type will disregard overlaps to keep the tree
sort simple).
Find attached a (less than pretty) Perl script that will resolve these
issues*. Critiques and patches welcome (hopefully it's commented enough
to make sense to someone else). It's likely not optimized, but in my
case the input list is not changed often, so optimization is not critical.
Chris
* It is only set up to handle a list of URLs. It stuffs IP addresses in
with the other reg exes. It will not account for what I assume to be
commented lines (starting with a #), or strings such as
"=female+wrestling" but will treat them as part of the domain name.
Error checking is minimal. It works for me, but comes without
warranty. Salt to taste.
#!/usr/bin/perl
# Parses a the file, determines if a line should really be a site or
# domain block and pushes the data into the proper files.
use strict;
# Define variables;
$| = 1;
my ($url, $host, $scope, $time, $final);
my %domains = ();
my %regex = ();
my @hosts;
my @site_array;
my @lines;
# Open a bunch of file handles.
my $urlfile = "/etc/squid/acls/ExternalLinks.txt";
open (URLFILE, "< $urlfile");
my $allowurlfile = "/etc/squid/acls/allowurls";
unlink ($allowurlfile);
open (ALLOWURLS, "> $allowurlfile");
my $allowdomfile = "/etc/squid/acls/allowdoms";
unlink ($allowdomfile);
open (ALLOWDOMS, "> $allowdomfile");
# Start reading input
print "Working...";
while ($url = <URLFILE>) {
chomp $url;
my $time = time();
# grab the host & (if it exists) path
(undef, undef, $final) = $url =~ m#^(http(s)?://)?(.*)$#i;
# Split the string on forward slashes
my @url_array = split "/", $final;
# Grab the host
$host = shift @url_array;
# Split the host into domain components
my @host_array = split '\.', $host;
# Check for a leading www (get rid of it!)
if (@host_array[0] eq "www") {
shift @host_array;
}
# Put the fqdn back together.
$host = join (".", @host_array);
if (scalar(@url_array[0]) || isIP(@url_array)) { # Is this REALLY a site allow?
# Yes, it's a site.
my $time = time();
# grab the host & (if it exists) path
(undef, undef, $final) = $url =~ m#^(http(s)?://)?(.*)$#;
# Escape special regex characters
$final =~ s/(\\|\||\(|\)|\[|\{|\^|\$|\*|\+|\?)/\\$1/g;
# Split the string on forward slashes
my @url_array = split "/", $final;
# Grab the host
my $host = shift @url_array;
# Split the host into domain components
my @host_array = split '\.', $host;
# Check for a leading www (get rid of it!)
if (@host_array[0] eq "www") {
shift @host_array;
}
# Put the fqdn back together.
$host = join (".", @host_array);
$final = join ('.', @host_array);
$final .= "/";
$final .= join ("/", @url_array);
$final =~ s/\./\\\./g;
# Now check for a duplicate site block
if (1 != $regex{$final}->{defined}) {
$regex{$final}->{defined} = 1;
# Create the entry
#print "Added site $url\n";
$scope = "Site";
$domains{$url}->{host} = $host;
$domains{$url}->{final} = $final;
$domains{$url}->{scope} = $scope;
$domains{$url}->{time} = $time;
}
} else {
# It's a Domain.
# Is it a repeat?
if (1 != $domains{$host}->{defined}) {
# Haven't seen this one before. Mark it as seen.
$domains{$host}->{defined} = 1;
$scope = "Domain";
# Clear out empty array elements
$final = join ('.', @host_array);
$final = ".$final";
# Create the entry
#print "Added domain $url\n";
$domains{$url}->{host} = $host;
$domains{$url}->{final} = $final;
$domains{$url}->{scope} = $scope;
$domains{$url}->{time} = $time;
push @hosts, $host;
}
}
}
# Done reading the file. Let's filter the data to remove duplication.
# Sort by number of host elements, remove subdomains of defined domains
sub byNumberOfHostElements { $a <=> $b }
# Somehow, this performs the desired sort. Perl is weird.
my @sortedHosts = map { $_->[0] }
sort {
my @a_fields = @$a[1..$#$a];
my @b_fields = @$b[1..$#$b];
scalar(@a_fields) <=> scalar(@b_fields)
}
map { [$_, split'\.'] } @hosts;
foreach $host (@sortedHosts) {
my $dotHost = ".$host";
foreach my $urlToTest (keys %domains) {
my $hostToTest = $domains{$urlToTest}->{host};
my $dotHostToTest = ".$hostToTest";
my $deleted = 0;
my $different = 0;
# If a subdomain of the host is found, drop it from the list
if (($hostToTest =~ m/$host$/)) {
#print "$dotHost - $dotHostToTest - $urlToTest\n";
# We have a potential match. Verify further...
my @host1 = split'\.', $hostToTest;
my @host2 = split'\.', $host;
my ($test1, $test2);
while ($test1 = pop (@host1)) {
$test2 = pop (@host2);
if (defined($test1) && defined($test2)) {
if ($test1 eq $test2) {
#print "# They match so far ($test1 eq $test2), check the next element\n";
# They match so far, check the next element
next;
} else {
#print "# The hosts are different ($hostToTest $host). Break out of here.\n";
# The hosts are different. Break out of here.
$different = 1;
last;
}
} elsif (!defined($test2)) {
# We have a match. Drop the subdomain.
#print "$hostToTest is a subdomain of $host. Deleting.\n";
print "."; # So there is SOME indication of work progressing...
delete $domains{$urlToTest};
#delete @sortedHosts[$host];
$deleted = 1;
}
}
if (!$deleted && !$different && ("Domain" ne $domains{$urlToTest}->{scope})) {
#print "$urlToTest is a subdomain of $host. Deleting.\n";
print "."; # More progress indication
delete $domains{$urlToTest};
}
}
}
}
print "\n";
# Write the data
print ALLOWDOMS ".apexvs.com\n";
foreach $url (keys %domains) {
$final = $domains{$url}->{final};
$time = $domains{$url}->{time};
if ("Site" eq $domains{$url}->{scope}) {
$scope = "Site";
print ALLOWURLS "$final\n";
} else {
$scope = "Domain";
print ALLOWDOMS "$final\n";
}
}
# Close it all up
close URLFILE;
close ALLOWURLS;
close ALLOWDOMS;
# Set proper ownership
ERROR! CHANGE 15 TO THE UID OF squiduser ON YOUR MACHINE chown (15, -1, $allowurlfile, $allowdomfile);
chmod (0644, $allowurlfile, $allowdomfile);
print "Done. Don't forget to reload Squid to make changes effective.\n";
exit 0;
sub isIP {
my @array = shift;
for (my $i = 1; $i <= 4; $i++) {
# Search the first 4 parts of the array for alpha and hyphen
# Return 0 if found or if the array is shorter than 4 parts
@array[$i] =~ /[a-zA-Z\-]/ || (!defined(@array[$i])) && return 0;
}
# No alpha or hyphen found, and there are at least four parts? It
# could be an IP address.
return 1;
}
Received on Thu Jul 05 2007 - 18:19:27 MDT
This archive was generated by hypermail pre-2.1.9 : Wed Aug 01 2007 - 12:00:03 MDT