How should you be using log files to assist your SEO? That’s what we’re discussing on episode 26 of the Knowledge Panel show, where Dixon Jones is joined by Gerry White from Oda, Sophie Brannon from Absolute and Steven van Vessum from Conductor.
Sign up below to watch future episodes live…
Want to Read Instead? Here is the Transcript.
Dixon: Hello and welcome back to The Knowledge Panel Show. It’s Episode 26 and it’s “Using log files in SEO”. So, I’ve got a fantastic panel again with me. I’ve got Gerry, Sophie, and Steven. I’ll ask them all to introduce themselves in just a moment.
For those that are seeing me on camera, I do apologize for the lack of shaving. I spent last week at the British Chess Championships and shaving didn’t seem to be a thing there. They didn’t do that very much. So, I didn’t bother.
Anyway, thanks so much for coming in. Why don’t we just start with the introduction? Sophie, why don’t we start with you? Tell us about yourself. So, who are you and where do you come from?
Sophie: Hi everyone, I’m Sophie Brannon. I’m the client services director at an agency called Absolute Digital Media and I’ve been working in SEO for the last six and a half years. I’ve dealt with campaigns across a broad number of industries, huge websites down to tiny little brochure websites, and like to get my teeth stuck into log files so I’m really excited to join the panel.
Dixon: Fantastic. Thanks very much for coming on. And Gerry, what about you? Where are you and where do you come from?
Gerry: So, at the moment I’m the SEO director for a company called Oda. It’s a Norwegian supermarket. We’re expanding globally as we speak. We’re kind of looking to kind of grow into Germany, into Finland, where it’ll be next week, and well, I mean, that’s basically the start, but historically, I’ve been everywhere from agency side to I’ve been the SEO of JUST EAT, I’ve been at BBC, I’ve been more places that I can remember in like I say 20 years’ worth of experience.
Dixon: Brilliant. Thanks very much for coming in, Gerry. And Steven, tell us about yourself and where do you come from?
Steven: Thanks. So, my name is Steven. I’m the director of organic marketing over at Conductor. Conductor is an enterprise SEO platform. I’m also the co-founder of ContentKing, a real-time SEO auditing and change tracking solution, which was acquired earlier this year by Conductor. And similar to Gerry, I’ve been all over the place. I worked in-house, agency side, run my own agency, and now in the SEO tool space for the last seven years.
Dixon: So, as was said on Twitter earlier on today, a dream panel. Thanks very much for coming in guys. I really do appreciate all of your attention here. As always, this whole event is sponsored by InLinks. So, thanks to inlinks.net. That’s the advert for them out the way.
Let’s just check with my producer though, I haven’t missed anything important before we get on. David, how are you?
David: The only thing I want to say… I’m very good thank you. I just want to say, Dixon, we got many people listening to the show on Apple Podcasts, Google Podcasts, Spotify, but if you can, try and join us live next time. We broadcast live on YouTube, Facebook, and Twitter. So, follow the InLinks channels, especially on YouTube. We go live once a month on YouTube. So, subscribe there and get an alert when we go live, and then you can interact. You can see what the next show is and hopefully, you can watch us live and ask questions on the next show, and I’ll tell you about the next show later in this one.
Dixon: Brilliant. We’ve got a little opt-in if you want to sign up and get an email when the shows are coming on, then if I remember, I send that out an hour before the event, but you know how I am. I’m not always quite as perfect as I should be. Okay guys, so thanks a lot.
I’m going to start as I always do, around the topic. A lot of people don’t jump into the whole 45 minutes all these podcasts and wish that we got to the point quicker. So, I tend to start with the question of if people don’t have 45 minutes to hang around – what one tip would you give to people about, a tip, a suggestion, a takeaway, surrounding, using log files for SEO? One thing that you think, “Hey, that’s an easy win. Go do that.” And I’m going to jump in on Gerry first to go with that one because he’s always got an answer for me.
Gerry: To be honest, I’m struggling with this one. There’s so much to it. Log files are actually one of the most complicated parts of it all. There are some great tools to kind of analyze it. I mean, obviously, ContentKing has got the kind of log file analysis part of it, component to it. If you haven’t got that, then Screaming Frog, and they’ve got kind of great dashboarding solutions. These two tools will allow you to kind of do a really quick and deep analysis.
So, basically, you can’t really do log file analysis without a great tool anymore. It’s not something which is possible because you’re talking about terabytes of data. Unless you’re kind of a programming database big data kind of guru then kind of analysis of it does require some kind of tool. Like I say, I’ve used multiple different things in the past but yeah, use a good tool.
Dixon: Okay. Sophie, what about you? Any tip, suggestion, idea?
Sophie: Oh, I think mine comes probably more off of one of my biggest pain points of log file analysis. I guess my tip would be if you can use a CDN to access your log files then do it because it makes it a lot easier to access and it stops all of those headaches of trying to find where they are, trying to get access to it, particularly if you don’t have the full access anyway and you’re trying to go through lines of developers or stakeholders to try and get hold of them. If you can get them in the CDN then that’ll be the easiest way.
Dixon: We’ll probably come back to that because I was going to talk about CDNS anyway because I’m so, so old that CDNS didn’t exist when I started, and to Gerry’s point, I used to use a templated Excel spreadsheet where you could cut and paste everything into a spreadsheet and then I did some analysis.
Sophie: Manual.
Dixon: Yeah, I’m really, really old school but things have moved on a little bit. So, can we come back to CDNs?
Steven, I thought of you and, obviously, I think it’s fair to say: “Use the ContentKing tool”. You’re allowed to say that. It’s okay. I’m not going to stop you from talking about your own tool. But any tip?
Steven: Yeah. So, love what Gerry and Sophie said and continuing where Sophie left off. Using CDNs where you can basically plug in the log streams from CDNs into SEO platforms such as ContentKing, for example, and you can get real-time insight into what’s going on your site. So, that’s super, super useful. A lot of folks think about log file analysis as, like, a rigid process that you go through once a year and it takes forever and you need to bribe people with donuts to get the log files in the first place, but there’s a whole new world out there and I would highly recommend exploring it.
Dixon: Let’s start with the CDN thing then because that’s kind of… Well, actually no, before we go into the CDN and I’ll come straight on to that one because I think we should get into that but let’s just, for those that CDN is two steps too far, let’s just ask, what is the real difference between a log file analysis and just using Google Analytics? I mean, it’s a question that I’m asking rhetorically but does someone want to jump in and say, what are the main differences between log files and things like Google Analytics, which is what we all use?
Gerry: I’ll jump in. Basically, Google Analytics is really very, very different to log file analysis. I mean, a bit of history, actually. There is the fact that Urchin, actually, used to be… Well, in fact, when we first started doing web analytics it was based on log files, but we are going back to when Dixon was young, so we are talking quite a long time ago.
But basically, since then what we’ve done is we’ve been almost firing this client-side. So, basically, when a user interacts with the page then JavaScript is fired and so we only track users in analytics packages. In fact, we actually try really hard in analytics packages to only track users, and normally the best way to do this is to basically only track the ones that are firing JavaScript and then basically filtering it then down. We’re going to be talking about user agents later on but basically, we try to kind of restrict it right down whereas log files it’s kind of the opposite. It’s literally trying to understand what is hitting the server and how it’s hitting the server, why it’s hitting the server.
What that doesn’t track is when it doesn’t hit the server. So, a good example of that is we often have interactions within pages which don’t necessarily fire kind of a something back to the server. For instance, if you click play on a YouTube video, it won’t actually fire back to your own server, so it doesn’t get tracked in the log file.
So, fundamentally, the two now have very, very different purposes and I think one of the things we’ll be talking about here more for log files is we don’t really even look at users in log files. We want to look at the robots. We want to look at people in, like, Google, how Google crawls your site, how Bing crawls your site, even how some of the other search engines actually handle your site and what we can do to kind of either stop them or improve the way in which they go through it.
And I think that’s the biggest thing that we look at in log files is basically the inefficiencies of how all sites are being crawled whereas in analytics packages we’re almost looking at inefficiencies or how to convert users. So, they have kind of similar purposes but really totally different.
Dixon: Steven, do you try and blend sort of JavaScript-based signals with log file-based signals to have one kind of streaming signal, or do you keep them separate?
Steven: Personally, I keep them separate because I look at the crawlers and users in different ways as Gerry explained. So, when it comes to log files, I really look at, I zoom in on Google’s behavior primarily because it’s the dominant search engine in most countries that we target and I am looking for ways to improve the time to crawl, time to index, and time to rank because, obviously, when you push the button and you publish a content piece, you want it to drive organic traffic as soon as possible and especially when you start analyzing that in real-time, at scale, it gets really, really interesting.
Dixon: Okay. So, then Sophie, let’s go back to… Well, feel free to add to those two points, but let’s go back to the CDN thing, you made the point at the top of the show, if you can get your log file through CDN network, instead of through the end server or whatever, then that’s going to be a better thing.
I’ve got a couple of questions, two or three questions really, I’ll just sort of throw them all out there to you and talk to you Sophie, if that’s all right. Do all the CDN systems provide that data very effectively and why is it better to have the CDN stream rather than the end server log file data? Why? Why and how easy is it?
Sophie: So, for me, my personal preference is Cloudflare and that’s for a number of reasons not just because of log files. I’m talking like DDoS attack with security, everything on top of that as well, overall site speed, and I think I’ve seen a bigger increase in people using CDN since the whole core web vitals everyone needs to improve their site speed, oh quick, let’s shove a CDN on top of it without really seeing or doing much else with it.
The reason why I prefer using a CDN is literally the ease of access more than anything, particularly when you’re working with big corporate companies, big brands, or even just people who don’t know where their log files are or they’re dealing with kind of legal teams, really strict legal teams, and there’s a whole other range of issues there which I think you might talk about a little bit later, which we can, with GDPR and things, but it just really helps you to access it a lot quicker.
One of the biggest things that I also find when having to go through a development team or going through a whole list of stakeholders is the length of time those log files are held for you can kind of control that on a much kind of closer scale when you’re dealing with CDNs because really you can access all of that, whereas when you’re going through kind of the client side and you’re trying to go through the development teams you turn around, they’re like, “Oh, we only held the log files for 24 hours. Sorry about that.” Great. That’s really helpful when I’m trying to analyze something.
Dixon: Okay. Gerry, anything on that? Any other good reasons to use CDNs?
Gerry: Yeah. I mean, this whole CDN is actually quite a recent thing for me. I say that because back when I was younger, CDNs were almost a problem. They didn’t store log files. They didn’t kind of… In fact, what they tended to do was stop the logs, they’d stop the servers from being hit as much because they’d actually cache a lot of the hits and do other bits and pieces. So, it’s interesting that now we kind of use the CDN as a way to get access to log files, that we would not have got before. So, I think that’s absolutely kind of really important that we kind of do use the CDNs. But one of the other advantages is the fact that when I was working, for example, at JUST EAT, the website is made up of like 12 quite separate kind of servers, services. One is a PHP box, one is a .net box, one is a WordPress box, there’s so many different ways in which these kinds of engines will kind of store the hits, and this really is important that if you then try to analyze it, you need almost the complete pitch, you kind of have to understand what’s happening everywhere.
And I think one of the things Steven will probably be able to expand on is the fact that all of these different services have a different log file format. Sometimes it’s a different column, different everything, so even if you got access to all of these, consolidating it, pulling it into one, and kind of analyzing it, you’re kind of trying to look at loads of different formats, and it’s like when 10 people are trying to consolidate one Excel document, it’s… sorry, 10 Excel documents, it’s absolutely horrendous job to do. But yeah, Steven, I think you’ve probably got an opinion.
Steven: Yes, I do. I totally agree, Gerry. It just makes life as an SEO so much easier if you have one place to look for your log files. It’s a breeze.
Dixon: I guess as well, you know, I mean, Gerry made the point of a large website has six servers say but a small website that uses CDN that has only one server, then has quite likely, especially if you’ve got static content, cached content that’s in the CDN that would never hit the server and that’s the whole point of a content display network or a distribution network, sorry, I just can’t remember what CDN stands for. But that means that, it’s very likely that if you try to use the raw server logs then and you’re using a CDN, you’re going to miss a whole load of the traffic because the traffic is never actually going to hit your log file. Is that correct or am I being naive?
Gerry: One hundred percent correct, yep. We saw this time and time again. Again, this is before the times when CDNs actually produced their own log files. When I was working at JUST EAT, I was like, “Oh, okay, so none of these actually,” particularly for images, particularly where we have like a long cache life, so any time… I mean, a good example of that CSS, we did not want every single user to kind of access the CSS file on the server. We wanted the CDN to handle 80% to 90% of that. And so yeah, exactly right Dixon.
Dixon: I think as well for those that haven’t seen another show that we’ve got, we’ve got SEO on the Edge, which you can find out there. Well, one of shows that was about the Edge and Edge workers. And the Edge workers, for those that don’t know, is where effectively you’re using a CDN network to literally change the code at the DNS level and that is increasingly common amongst SEOs, not always with the approval of the web developers or the admins, assistant admins, but is getting us through quite a few problems for previous issues when admins have been unwilling to give us access to various bits and pieces. So, very useful tool. So, look up that if you want to go into that a bit further.
Okay. So, let’s talk a little bit about the traffic that we see in log files because I know Imperva used to do an annual study on the types of traffic that they see on websites and they tried to work out how much of the content was human versus how much of it was machines, how much of it was malicious versus how much of it was not, and one of the startling things that always came out is that pretty much half the traffic on the Internet is not human orientated at all.
Is that the sort of thing that you see, Steven, in ContentKing or is it something that varies a lot on website-to-website? And what is the sort of non-human traffic that you see, or one sees?
Steven: I haven’t looked across all of our clients, I can’t, to see what the ratio is between real traffic and bot traffic, but I would wager that it’s more like 80% bot traffic and 20% user traffic nowadays, just because there’s so much crawlers out there doing their thing. A lot of them are, we don’t really know what they do, what they’re up to, but they’re out there.
And if you’re talking about ratios for sites, it depends. If you have a massive site like Gerry’s, for example, you’d want to see a lot of bot traffic on it because if a lot of pages get refreshed and they’re pushing a lot of new content out there, you want that content to be re-indexed as soon as possible. So, I’d rather have like 10% human traffic and 90% Google traffic on my site than, say, 50/50 because whenever I publish something, I want Google to eat it up.
Dixon: Okay. I wonder if the rest of you, Sophie, have you got the same sort of views or different views?
Sophie: Kind of similar for me. I see there’s a bit of a difference depending on the industry I’m working on. So, like, I find that with kind of health, finance, particularly bigger websites you do tend to see a lot more bot traffic than what you do with smaller ones. So, I’m talking like, I don’t know, Jim the local guy, you’re probably not going to see too much, you know. I like to kind of split it between good bot traffic and bad bot traffic.
Now, that may be controversial depending on who you’re talking to, but for me, like, good bot traffic would be website crawlers, website monitoring, like UptimeRobot or something, scraping, aggregation bots, but then bad bots would be like the spam, the DDoS attacks, ransomware, the ad fraud. It does depend, I really hate saying it, I really try and avoid saying it, but I think it’s just a matter of being able to block some of the bad bot traffic if you’re able to really identify that.
Dixon: It’s hard though, isn’t it, because we, unless I’m mistaken, we’ve got two basic ways of identifying that bot, that traffic, either the user agent or the IP address that it comes from, and if it’s the user agent what I think a lot of people don’t appreciate is that the user agent is given by the choice of the user. I mean, it’s something that’s pushed by the bot. So, the bot could sit there and say, “Hey, I’m Mr. Google,” quite happily or “I’m a very good bot,” and you wouldn’t have any way of knowing, or they could sit there and say, “I am a bot that you’re already familiar with” or “I’m Firefox” or whatever.
And, of course, IP addresses are getting much, much more random and variable these days, particularly with IPv6 being people are changing their IP addresses all the time. So, Gerry, when you’re blocking, if you’re going to use log files or information from log files to start blocking bots, how much danger are you in? Are you going to slip up and actually start stopping, I don’t know, Google’s image browser from looking at your information inadvertently, for example?
Gerry: Yeah, absolutely. I won’t mention the client’s name, but basically, I have seen clients where we can see something where somebody is trying to basically hack the site. An example of that is, I think everybody here knows, you can buy usernames and passwords off the internet off of sort of black sites basically, relatively cheaply. I’ve never done it myself, so I don’t know exactly how to do it.
Dixon: I’m glad about that, Gerry.
Sophie: A bit of a disclaimer there.
Gerry: I know. Absolutely. But then you can run these against any kind of site, any major site. And a number of sites that I’ve worked on have sort of spotted that, very specific passwords and things have been done at scale often from a strange country like America if you’re based in the UK or often it’s like Russia or something like that. And, as you say, if I was doing this myself, I would use a range of different IP addresses, different countries, and different user agents to hide it as much as possible.
So, exactly what you say, although we can sort of see this pattern happening, us trying to block it using user agents and us trying to block it using IP addresses, it’s a massive challenge. There are ways in which we can kind of try to fix that using things like a good example of that is, like, is it human, the kind of the Google thing, the CAPTCHA stuff, but again that means that other bots which are valuable, will struggle as well.
So, it is that sort of magical balancing act. So, I mean, we do rely a lot on things like Cloudflare’s own protection to kind of say, you know, “We can see bad behavior. Switch it on in Cloudflare.” But again, Cloudflare is not perfect, you know [crosstalk 00:22:05].
Dixon: No, Cloudflare is got this big button that you just, this one button that you can press, and it strikes me that that is using a mallet to knock in a paper clip really.
Sophie: To just decide, yeah, just trust it aside.
Gerry: Yeah, but often when they, I think it’s called Shields up, isn’t it, but basically when you hit that button it’s because you know there’s kind of a security incident going up ahead. And, I mean, I think all three of us, all four of us have worked on big sites where security trumps everything else, you know, you kind of get to that point where you kind of go, “Okay. SEO is really, really important, but security really, really important.”
But again, you know, I’ve often found it where they’ve hit the button almost and they’ve almost forgotten they’ve hit the button until suddenly we start falling out of Google and kind of going, “Oh, we seem to be blocking Google.” But as a user, I can’t see it because we’ve been whitelisted because we’re UK based or we’re whitelisted because of this or the other, but it’s kind of like until Google starts to sort of say, “Hey, we’re actually starting to drop you out.” It is something which is definitely worth kind of paying attention to.
But so, to your point, IP addresses can be… Well, you can’t fake Google’s IP address but what you can do is you can kind of give yourself a random IP address in effect in any country. You can’t fake user agents completely. So, yeah, I mean, I for one often browse news websites that I don’t want to pay for by pretending I’m Google and they just magically let me read all the private content and everything.
Sophie: Yeah. I think a good example of that as well was when everyone was… Well, a few people booking in their COVID vaccines, and everyone was trying to skip the NHS queue by doing exactly that. It was all over Twitter changing their user agent. So, it’s hard and same talking about kind of just blocking certain regions or kind of countries, people use VPNs all the time. People are so much more invested in their own privacy and their own security and it’s probably becoming a bigger trend than ever. That’s where it starts getting really dangerous.
Dixon: Yeah, I know, I’m British and I’ve got a TV license. So, I definitely want to be able to use my iPlayer from abroad. So, I use a VPN for that all the time as well. I think we do use that, we’re more and more of us are using VPNs and certainly a hacker, whether good or bad, whether a good person or bad, they know how to use a VPN. So, it’s not going to be an effective barrier.
And I think on Cloudflare there’s two different buttons. We’ve got a DDoS. A denial-of-service attack is happening right now, press this button. I understand that, you know, if you’ve got panic, it’s great to have a button. What I worry about is they’re more subtle. Here’s the setup that’s running in the background sort of thing that if it’s just taken at face value you start to miss an awful lot of potentially good traffic because if you want some… Coming back to the idea of SEOs being the subject of the podcast. If you want somebody to click on your website from a search engine whether it’s YouTube, whether it’s a Majestic or whether it’s Bing or whether it’s an image search engine, or anything that’s not the core Google is the biggest search engine, there’s loads of different Google crawlers and stuff, then if you don’t let the bots see that information then a human will never click on the link that’s then indexed by that information. I think people forget that quite a lot. Anyway feel free to carry on, come back on that one.
But let’s talk about error codes and how that could be useful on log files and let’s get to something really useful for SEO. Why can’t I see error codes in Google Analytics and what error codes can I see in log files that I can use to improve my SEO? I see Gerry is come off his mic, so we’ll let him go in.
Gerry: Sure. I mean, this feels like a great question for ContentKing, to be honest with you, but I’ll go in first. The fundamental thing is that JavaScript can’t see the error code and as we mentioned before Google Analytics uses JavaScript to kind of tell you what sort of page it is and it’s like that, so if it can’t see the status code, then there’s no way to know if it’s an error code. Now, I do actually hack in error codes into Google Analytics a lot of the time, so I often kind of go, “Hey, can we make sure that we track which ones are 404s, which ones are 500s,” if you can put your analytics code on the 500 page which is sometimes a challenge.”
So, fundamentally, that’s one of the key things that’s really, really different. So, Google Analytics literally only uses JavaScript to kind of like understand what’s the pages. So, unless it can know what the error code is then there’s no way for it to see it, but equally, you can build other things into it which you can’t necessarily build into the hit. So, for instance, you can say stock levels, or you can say whether or not pages somebody is logged in, logged out.
So, there’s loads of other different codes which you can put into analytics that you can’t necessarily put into log file analysis. So, they’re different. And one of the things that is important is understanding what goes where and how to consolidate the information afterwards.
Dixon: I’ll come to Steven last on this one, I think, because you’ve probably got plenty to say, but Sophie, are there any error codes that you very much look out for because you’re looking at log files probably day-to-day more often than Steven or Gerry or me.
Sophie: Yeah. I mean, for me the 404s, the 500 errors, particularly on large e-com websites where there’s, like, loads of people accessing the site on a regular basis, updating products, taking things down, taking things out, and not putting redirects in place, or when they do put a redirect in place, let’s put a 302 in because they don’t know the difference and all this kind of stuff because they’re not there for SEO, they just manage their website, they’re just there to manage their stock. If that is a huge website and you know that kind of thing is happening on a regular basis just from the standard nature of it, that’s when I turn to log files.
I mean, I can run crawlers on sites like Screaming Frog and things like that or ContentKing but just really understanding and drilling into that more regularly and then seeing what Google is hitting because if they’re regularly hitting those 404 pages then you know there’s a problem and you know you need to do something very quickly.
Dixon: Steven, what do you want to add in there?
Steven: So, when it comes to error codes and Google Analytics, as long as a page is loading, and JavaScript is executed, and the Google Analytics JavaScript is executed, it’s going to be logged and tracked in GA. So, typically you’ll see that 4xx error codes are all going to show up in GA. But for instance, like the 302 redirected that Sophie mentions, definitely not going to show up in Google Analytics, and in most cases 5xx status codes same thing.
So, what I like to do is, I pull in all of the status codes that we get and pull it into a different place than Google Analytics because it’s just not the right place to make that overview. So, you could build your own platform, or you could use something like ContentKing that continuously monitors your site and leverages log file analysis, Google search console data, and Google Analytics data and you could piece it all together. You can even add the stock levels that Gerry was talking about through an API. So, you’re basically piecing together your own platform and getting the insights that you need.
Dixon: Okay, cool. So, there’s a lot of things you can pull in but apart from things like error codes are there some other things that you pick out in log files at all, that you’ve got a pet choice? Sophie, you’ve come off mic, so I’ll let you dive in there first.
Sophie: Yeah. For me, if a page is unnecessarily large or slow and the reason why I use log files for this more than anything and it’s one of my biggest frustrations in SEO is people will typically just run the home page of the site through kind of page speed insights or Lighthouse and just be like, “Oh, everything is fine. This is fine.” That’s not fine.
So, I tend to use it more specifically for that, and finding things that may be like static resources that are crawled too frequently or not frequently enough. But yes, the page speed, especially with this whole turning around towards the core web vitals and user experience, and like the big trend towards that, a lot of clients I see on a monthly basis are asking a lot more what are you doing with our site speed, what are you doing with our core web vitals.
Now, that’s not always going to be the top priority, but being able to really identify that in a much larger website where it’s going to take a really long time for it to crawl on kind of a website crawler like a Screaming Frog, for example.
Dixon: So, do you then use that also to find really large image files that are sitting there just, you know, they’re actually a little icon in the browser but when you look at them, they’re 4HD, Ultra HD, and that’s really slowing stuff and you wouldn’t see it another way. So, I think that’s a really, really good example.
Gerry, Steve, any other things that we need to bring out before we hit the top of the show?
Gerry: I think the one thing that I get inspired and I basically find interesting is the wastage. Basically, you often find that there’s, like, parameter URLs that are being crawled at scale. There’s sub-domains that are being crawled or there’s often things which you don’t expect. I mean, the funny thing about log files is the fact that you tend to look for things that you didn’t really see anyway almost. I mean, a lot of the time we kind of look at the website, we have a kind of understanding of the website, and using tools like ContentKing or Screaming Frog or whatever kind of crawler we’ve got, we have a really good idea of how Google should be seeing the website. But then, often we kind of look in the log files, and we’re often surprised when we go, “Oh, Google has got a bit weird over here” or “Google’s found something over here.”
I mean, one of the things that I would mention is before you get into log files, another place you can look for the same kind of information is in Google Search Console. There’s something called The Crawl Stats section which I think is massively underutilized, and once we’re talking about log files, I think kind of starting there allows you to kind of go, “Okay. This is something which I need to look at,” and then you go to the log files. It’s almost kind of a, “Where do I start to kind of then go somewhere else?”
I mean, one of the things Sophie mentioned before is things like 302s versus 301s. We kind of understand the difference but explaining to a developer that a 302 will be hit hundreds of times whereas the 301 will be hit a few times before Google will kind of go, “Okay. This seems to be permanently moved over here.” 302 is like, “Oh, I have to keep checking back and checking back and checking back.”
So, a lot of the time developers, IT guys, they want to improve the crawling, they want to improve the efficiency of observing content to not just users but also to the search engines especially if a huge proportion of the traffic is search engine based. We want to improve the efficiency there, and the best way to do that is to kind of go in log files.
Another thing Sophie mentioned was page speed and one of the things that I never knew before recently was that there was like a status code called a 304. Actually, I mean, I say recently, this is like 5, 10 years ago but when I first found out about that, I was surprised. I was almost worried when I went, “Wow, there’s so many 304s in here. What are they and is it causing me issues?” And it turns out that’s a good thing, but nobody had ever really told me that we should be using 304s to improve a crawl which means that Google will check it and then when it comes back again it won’t rescan the page. It will know that it’s not been modified because that’s a not modified status code.
All of these kinds of little things which you kind of go, “Okay. These are the little things that are kind of massively improving things,” or “This is where Google is gone a bit crazy over here and is found some spider trap or something,” which means Google is kind of finding millions of pages of parameter content that it doesn’t need to find. A good example, going back to kind of my time at JUST EAT, we had hundreds of thousands of sub-domains and Google was crawling them even though they weren’t generating any traffic. It was just basically canonicalized back to the main site and we didn’t realize just how bad it was until we started doing log file analysis.
Dixon: I might come back with some of that, but I wanted Steven to have a chance to wade in.
Steven: Yeah. I love the 304 and using Google Search Console’s crawl stats report to get some pointers on where to optimize to use your site’s crawl budget more efficiently. So, what we typically see is that a lot of sites are not leveraging the 304 not modified HTTP status response.
Dixon: So, let me ask, how do you leverage a 304? I mean, so basically, what you’re saying is a 304 response is better than a 200 response? Is that right?
Steven: It depends. So, if you’ve got content that’s not changing or not changing that often, it makes a lot of sense to use the 304-response code because you’re essentially telling both browsers and crawlers like, “Hey, this piece of content, whether it’s a font or a logo or HTML, it hasn’t changed. So, you don’t need to fully crawl it.” It hasn’t changed. So, you can use whatever you have, and that way you decrease unnecessary load on the site.
So, for example, like a company’s logo doesn’t change that often, it’s totally fine to use the 304 HTTP status there and you can use it in a bunch of places where it makes sense.
Dixon: So, you do it on the logos and the images not necessarily on the HTML pages, URL itself, is that right? How do you do that? I mean, if I’m a WordPress user, am I stuffed? Do I need to be a little bit more tech-savvy than that or can I do it in WordPress?
Steven: There’s probably plugins for that, so any speed plugin that’s worth their salt is going to have options for this, so that’s built-in pretty much any plugin I’ve seen.
Dixon: Okay. Cool. And… sorry, Gerry?
Gerry: No, I was going to say you can also do that using the CDN layer basically. So, you know, often that’s what CDNs are really, really good for is kind of going, “Actually, I want to cache these hits and use a 304 and other bits and pieces.” So, it does exactly what Steven is saying, it kind of gives you the ability to kind of go, “Okay. These image files, I want to basically extend the, oh God, the data points.” Oh, I’m talking rubbish now. I’ll pass it back to Steven.
Dixon: No, I’ll move on because I want us to get onto GDPR just briefly before the end and before that you brought up, I think it was Gerry, the idea of… Well, actually, Sophie brought it up really, data basically, these large files and that being really obvious when you look at log files and you suddenly see that 80% of your resources have been used to load images and you didn’t know you had an image-rich site or something like that or it’s just been streaming a video to one person for the last 24 hours or something.
Does that mean, is that an opportunity to talk to data, to developers, and systems admins in a language that they more understand? If you sit there and say, “Look, you’re using 20% of your resources just to load this image up,” are you going to get a faster reaction than if you say, “Look, you’ve got 3000 bots hitting this web page,” or something like that? Is data a better way to communicate with systems admins, or is that an it-depends kind of question of the system admins?
Sophie: I think, it’s always an it-depends question. I think anything is an it-depends question, isn’t it really? I’ve always found it’s easier to build relationships with developers if you are almost speaking their language. If you can show them how it impacts them rather than how it impacts you or kind of just really building that bridge because they don’t really care, they’ve got their own job list, they’ve got their own ticket system, like they’ve got all of these things that they need to work through, why should they care more about what you’re trying to get pushed through than what they already need to.
So, you just really being able to showcase to them what the impact is and kind of talking to them about what the effort behind that is as well and how they can resolve that and leaving that open to them because if you storm into kind of a development meeting being like, “Great. This is definitely a low effort task. I need all of these things done because it’s going to benefit me and it’s going to benefit the SEO,” you’re not really going to get anywhere. They’re going to be completely just shut down from that. So, I always think just talking to them in their language, showing them the data, is going to help.
Dixon: Then I’m going to defend, it’s not it depends, it’s yes. You reckon that if you can talk to them in their language then you’re going to get further faster.
Sophie: Yeah, unless they know SEO, I guess, but that’s really hard. I can’t say I’ve found a developer who’s like the same level of SEO as what an SEO is, but yeah.
Dixon: Yeah. Okay. We agree with that Gerry and Steven?
Gerry: Absolutely.
Steven: Yeah.
Gerry: Go on, Steven.
Steven: Yeah. It’s more of a communications issue as Sophie put it, yeah.
Gerry: I think the worst thing or the worst habit I’ve seen from SEO people is telling developers how to do it rather than what they want to get done. If you kind of go, “Oh, you need to edit the htaccess file,” and the guy turns around and kind of goes, “We don’t have a htaccess file or a .net server.” You look like a… Yeah, you don’t look like the best SEO guy out there and it is basically your job to kind of talk to them about what you need and why you need it rather than how to do it.
They’re always very good at kind of telling you how they think it should be done and talking to you through it and you need to be in those conversations because often they come up with a solution which doesn’t really work for you, but equally, you know, you almost have to trust them to kind of understand their own systems to kind of come up with the best ideas.
Dixon: Excellent. Okay. Before we… We’re nearly at the end of our time. These things go really quickly when you get into them, but I did want to just finish up with this question which if you don’t want to answer that’s actually fine, but GDPR law is something that, you know, when it came in, I thought, “Well, I believe in this and I do, I want people to opt-in to have their data or not to have my data stored or whatever.”
But one of the quirks of GDPR law is that IP addresses are considered to be personally identified information. So, does that mean that server logs are illegal? Who wants to go on with that one?
Gerry: So, fundamentally, no, they’re not illegal in the basis that they have to be stored. They have to be stored in that way. Basically, the whole GDPR law is if your data… It’s a bit complicated but, basically, if your data is used and matched up in a different way. Now, if we kind of use the IP address to then re-target you and do sort of all bits and pieces and match it up with other kind of data sources then yeah. Basically, if it’s for the core service that we’re doing then absolutely we need to use the IP address and so on and so forth.
Now, it is debatable whether an IP address is actually personally identifiable. The reason I say that is because…
Dixon: I agree.
Gerry: … I used to work at the BBC and everybody at the BBC whether they were in the Manchester, the London, or any other office, all shared one IP address, and it’s the same with most organizations. And equally, as I browse around at home, my IP address could change. I could pay extra to have a fixed IP address, but I don’t because I’m tight and I don’t need one, but basically an IP address is one of those things which is only just kind of a way in which you at the computer that you’re at accessing the internet at that time.
And again, as Sophie’s mentioned, so many people now are using VPNs and other bits and pieces. It’s quite interesting that we still consider an IP address to be kind of personally identifiable. However, on the basis that GDPR and, I mean, the company that I work for, we don’t use Google Analytics, for instance, because Europe has deemed it to be potentially illegal. So, in Germany and Finland, we’re looking at different tools but what we are doing is we’re not sharing the IP address externally. We’re not using the IP address that we capture to kind of then share that with other tools and other services to kind of use it in ways for retargeting and other bits and pieces, but we are using it to kind of provide the best service and to make sure that what we do and how we do it is functional.
So, without storing and using IP addresses literally the internet would collapse, and whilst I’m a great believer in my level of privacy I’m also a great believer in having a functional internet.
Dixon: No, I absolutely agree with all you’re saying, it’s just that it wasn’t me that deems IP addresses are personally identifiable information. It’s in the legislation which seems to me absurd because there’s no way you can… It’s like having a telephone call between two people but the system doesn’t know what one of the telephone numbers is, that’s not going to work. You need it to be able to do the communications.
Gerry: Yeah. I mean, one of the things I would say is I’ve had some very interesting conversations with legal teams and big companies where they were trying to figure out how we’ve given permission for Google to scrape the site or similar things. Basically, lawyers don’t always understand the implications of law. They understand their side of it. We understand our side of it, but the two are not necessarily very well connected. Internet lawyers are a new breed and I think that’s a fascinating kind of area if I’m honest.
Dixon: Sophie, Steven, you got anything that you want to add in that conversation? I find it fascinating, but it probably bores the hell out of a lot of people. So, I left it to the end of the show.
Sophie: I mean… Go on, Steven.
Steven: Yeah. So, I think the discussion you need to have it’s like how do you go about IP addresses. In a lot of cases, you can discard them, or you can just remove the last couple of octets. And you can do everything that you wanted to do anyway. So, it’s not really an issue that’s holding back your work as an SEO. So, moving beyond that, it’s pretty easy.
Dixon: Yeah, that’s certainly what I do. On InLinks we kind of have a tracking system on our, it’s like Intercom, it’s called GoSquared but it’s sort of basically a web chat system, which everyone needs to be logged into to be able to have a chat. So, I’ve got their email address and everything. But where I haven’t got any of that information or haven’t been given permission to hold that information then we kill the last three digits of the IP address, so that we’re GDPR compliant. And obviously, if somebody signs up for the service then they’re obviously then…it’s a different nature of a different relationship.
But I find it weird because you go down a different internet lawyer who’s worried about, I don’t know, terrorism, and all of a sudden it’s illegal not to have the IP address of your customers. So, the ISPs are damned if they do, damned if they aren’t. So, I find it… Anyway, it’s a story for another day and I don’t want to take people too much off the beaten track, but I do find it fascinating.
Anyway, guys, we’re up to our 45 minutes and a little bit beyond already. So, thank you very much for coming in. Before I ask you all to, please tell people how they can get hold of you and find out more because a lot of interesting stuff comes out of this session. I’m going to bring back my producer, David, to let us know what’s happening on the next show and when that’s going to be.
David: Sure. The next show is going to be on the 19th of September, 4:00 pm BST. That will be Episode 27 and the topic will be: “How do you Target Audiences using SEO.” We’re booking a few guests for that and stay tuned to the InLinks channels to find out exactly who’s going to be on that show, but the topic is going to be, “How do you Target Audiences using SEO.”
And the sign-up link is theknowledgepanelshow.com if you want to sign up and watch the next episode live.
Dixon: I might have to talk about GDPR again then, that’s going to happen. Okay, guys, thank you very much for coming in. Guys, before we go, can you tell people how they can get in contact with you as long as you want them to, and please don’t say what you can see on the screen because bear in mind most people are on Spotify or iTunes or whatever. So, Gerry, how can they get you?
Gerry: You can always find me, if you Google me, Gerry White. You can find me on Twitter which is @dergal, or you can find me on LinkedIn. They’re the two best places to find me.
Dixon: Okay. And Gerry is with a G. Sophie, how do they find you?
Sophie: Similar for me as well. Twitter more so just because I’m dreadful with my LinkedIn DMs because they’re normally full of backlink builders, but @SophieBrannon is the best place to get me on Twitter.
Dixon: Excellent. And Johnny Scott is on YouTube in the background. So, thank you very much. Thanks for coming on, Johnny. That’s great. Steven, how do they find you?
Steven: I’m an SEO, not a hard person to find. You can search for Steven van Vessum, and if my last name is too difficult, you can go to steven.land and you’ll be able to find me and contact me.
Dixon: Brilliant. Steven with a V. So, guys, just leaves it for me to say thank you very much for coming in, and if you all want to be on the next session live go to theknowledgepanelshow.com and sign up there. Cheers. Bye for now.
Transcript edited on 8th October 2022.
Leave a Reply
Want to join the discussion?Feel free to contribute!