245264c2c9fix respider frequency bug.
Matt Wells
2013-10-21 15:06:23 -0700
64a1c7c2f2more bug fixes. if spiders disabled for row in url filters, don't spider the url.
Matt Wells
2013-10-21 14:45:12 -0700
978910ca7afix more bugs.
Matt Wells
2013-10-21 14:17:32 -0700
1fb85db307url filters fixes.
Matt Wells
2013-10-21 13:44:30 -0700
dc4afad67edo not respider if collectiverespiderfreq is <= 0.0. added a url filter for that. added a couple url filters for retrying errors (tcp timed out, etc)
Matt Wells
2013-10-21 12:04:08 -0700
605289e130fix a couple collection related bugs causing cores in crawlbot.
Matt Wells
2013-10-21 11:38:33 -0700
d2d4379d5cremove debug point.
Matt Wells
2013-10-20 10:25:26 -0700
54915dc384fix data corruption in RdbMem buffer when running with threads disabled.
Matt Wells
2013-10-19 19:37:29 -0700
85bca4f3d1can now delete collection while spiders are out
Matt Wells
2013-10-18 18:11:14 -0700
889583ec4bnow we can reset collection mid stream
Matt Wells
2013-10-18 17:49:36 -0700
ecab57ff0fchange collnum of reset collection so any adds in progress will fail.
Matt Wells
2013-10-18 15:46:00 -0700
b589b17e63fix collection resetting.
Matt Wells
2013-10-18 15:21:00 -0700
50313a815fuse seeds and spots now
Matt Wells
2013-10-18 11:53:14 -0700
a288217e9fa few bug fixes
Matt Wells
2013-10-17 18:59:00 -0700
84a3aded94spider round updates correction
Matt Wells
2013-10-17 17:18:05 -0700
df7fd21253spider rounds update.
Matt Wells
2013-10-17 17:17:19 -0700
fe8ebd23a3added simplified redirect urls to spiderdb as a new spiderrequest. made XmlDoc::getLinks() call m_links.set(redirUrl.getUrl()) so that it is treated like an outlink on the page and gets added from addOutlinkSpiderRecsToMetaList().
Matt Wells
2013-10-17 12:06:12 -0700
92413001fbdirty word detector revisions. we need a word-based function, isDirtyWord() which IDs single words, bigrams and trigrams. it'll be much faster than the current approach and won't slow down when the list of dirty words gets big. we then need isDirtyUrl() to use the logic in Speller.cpp to split a url up into composing words to run through isDirtyWord().
mwells
2013-10-16 20:19:49 -0700
b9f94d7d45show cached json objects as application/json without term highlighting and the disclaimer
Matt Wells
2013-10-16 17:54:17 -0700
d9b132fd5amake : into . for indexing json names.
Matt Wells
2013-10-16 17:43:46 -0700
74c2742cedfix mem leak of LinkInfo. fixed json output from injecting url.
Matt Wells
2013-10-16 17:17:28 -0700
70c4ef682dprinting updates
Matt Wells
2013-10-16 16:27:24 -0700
ee06428059fix json indexing and searching
Matt Wells
2013-10-16 16:15:28 -0700
9d6c3626d8json indexing/hashing updates.
Matt Wells
2013-10-16 15:41:12 -0700
bb09b4f742do not store diffbot api url in diffbot reply yet. later may want to store in each diffbot object doc maybe as part of the json content?
Matt Wells
2013-10-16 15:24:22 -0700
11897f09daturn off log debug msg.
mwells
2013-10-16 16:24:08 -0600
f8256c3ef9fix core from diffbot object doc not having valid dmoz info
Matt Wells
2013-10-16 15:14:39 -0700
acad6f48d3Merge branch 'master' of git@github.com:gigablast/open-source-search-engine
Matt Wells
2013-10-16 14:54:02 -0700
dae005e4aeensure dmoz info valid when making titlerec
Matt Wells
2013-10-16 14:53:48 -0700
57ee9739e5fix addColl() logic for collectionless rdbs
Matt Wells
2013-10-16 14:38:09 -0700
fc17521697Merge branch 'master' into diffbot
Matt Wells
2013-10-16 14:28:42 -0700
22ef91a6f1show all colls in json after deleteCrawl operation
Matt Wells
2013-10-16 14:13:28 -0700
e565a861aegive nice reply from seed in json. show how many outlinks from same domain were found and how many outlinks were filtered. same for addurls (bulk add).
Matt Wells
2013-10-16 14:03:14 -0700
36e6e21ae3url filter processing looking good.
Matt Wells
2013-10-16 12:19:25 -0700
f5e5b0f5d3fix crawlbot bugs
Matt Wells
2013-10-16 12:12:22 -0700
d8835acfefcrawlbot api work.
mwells
2013-10-15 11:54:54 -0600
37a9e82060update the dirty word list. but we still should remove tags, except maybe outlinks, and detect the dirty words on what remains. getting too many false positives in tags still.
mwells
2013-10-15 01:01:19 -0700
3db726c22etake out references to AdultBit.cpp, since it is no longer used.
mwells
2013-10-14 23:21:58 -0700
12bff1e9b0fix potential problem of tons of points in our statsdb div graph. use hashtable to dedup points and save from printing out too many <div> tags.
mwells
2013-10-14 22:52:29 -0700
90fca8c171fix "search in category" link.
mwells
2013-10-14 22:39:42 -0700
0096877127fix "statsdb" graph so it seems to work now.
mwells
2013-10-14 22:31:00 -0700
9e9ef9c2ccstill getting statsdb link to work. a little better now.
mwells
2013-10-14 21:21:27 -0700
8f93a72961start using html div graph for PageStatsdb.cpp now too.
mwells
2013-10-14 20:35:45 -0700
a0808df2aegot new diffbot api compiled
mwells
2013-10-14 18:19:59 -0600
80918ca6e3remove old libplotter references and files.
mwells
2013-10-13 23:48:07 -0700
553c28fbe0get performance graphing working again. use absolute divs to draw the graph instead of old gif plotter library.
mwells
2013-10-13 23:39:31 -0700
81a09f9835half way done fixing performance graph. needs more work.
mwells
2013-10-13 22:02:21 -0700
c7cf6a817admoz directory root page search box should just search all sites in dmoz.
mwells
2013-10-13 20:13:15 -0700
3ac5838b8ffix the search tabs for the dmoz directory search box. allow more error types when spidering dmoz docs.
mwells
2013-10-13 18:43:45 -0700
66364c581aminor fix when indexing dmoz urls.
mwells
2013-10-13 17:12:20 -0700
876af6d8c6dmoz support is now updated and re-integrated.
mwells
2013-10-13 16:53:28 -0700
3bc85cf528a few cleanups for the new dmoz code.
mwells
2013-10-13 16:48:59 -0700
d4b5c37f45Merge branch 'master' into testing
mwells
2013-10-13 00:20:37 -0700
fbcaefa6ffso we have spider https sites add the old gigablast private/public key file.
mwells
2013-10-13 00:15:39 -0700
65bad44450try to fix EBADIP stopping a page from getting indexed into dmoz.
mwells
2013-10-13 00:14:16 -0700
c949bfe315ignore certain errors and index the doc anyway so we at least have it in our dmoz index with its designated title and summary from dmoz.
mwells
2013-10-13 00:02:25 -0700
eeb10bb99afix ip vector logic in xmldoc.cpp.
mwells
2013-10-12 23:14:39 -0700
c283e85e40add support for noindex meta tag. use it in the gbdmoz.urls.txt.* files that contain the dmoz urls we want to spider.
mwells
2013-10-12 22:50:23 -0700
d300dc42f7added XmlDoc::getDmozTitles() and related functions.
mwells
2013-10-12 21:56:25 -0700
3374ce450afix a couple catdb generation bugs. MAX_CATIDS violation causing corruption. not saving catdb tree to catdb-saved.dat causing missing catdb recs.
mwells
2013-10-12 20:33:04 -0700
1d133e87c9just print dmoz pages verbatim for now. later we can show the dmoz entries as search results.
mwells
2013-10-10 23:18:57 -0700
547420396fadded Categories::printUrlsInTopic() to print the dmoz urls for a catid. can replace us doing a search for the time being.
mwells
2013-10-10 23:03:07 -0700
38bb82c902make /Top the top dir page now.
mwells
2013-10-10 22:46:31 -0700
55c5ad2921fix "Top/" issues in breadcrumb etc.
mwells
2013-10-10 22:27:49 -0700
ca6af65217git dmoz nagivation system working. now we just need to index the urls to populate dmoz.
mwells
2013-10-10 22:08:21 -0700
be01041e36added support for new url filter: "lastspidertime>={roundstart} --> IGNORE" so we can spider all urls before we advance to the next spider round and re-spider everything again. CollectionRec::m_spiderRoundStartTime and CollectionRec::m_spiderRoundNum are the new collection rec parms. show the round stuff on url filters page.
mwells
2013-10-10 18:47:46 -0600
08c153bdb0add old gb.pem file, not used by gigablast any more, but allows https server to startup.
mwells
2013-10-09 17:37:01 -0600
ea859ef685added 'gb emailmandrill' for testing. got it working. it posts json, not url encoded.
mwells
2013-10-09 17:35:51 -0600
2bb8b818d6more bug fixes with notification system.
mwells
2013-10-09 16:28:15 -0600
c1c5c4e3d0send notifications if no urls available for immediate spidering.
mwells
2013-10-09 15:24:35 -0600
24e3b8cf52Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot
mwells
2013-10-09 13:07:22 -0600
0b4bbf926efix potential compiler error.
Matt Wells
2013-10-09 11:52:58 -0700
58e2be8b6ftake out log msg
Matt Wells
2013-10-09 11:51:39 -0700
283ec2f6b4email and webhook alerts when spider runs out of urls to spider.
Matt Wells
2013-10-09 11:42:56 -0700
7ba9994804many dmoz fixes. but still more we need to do. isn't printing subcategories right now.
mwells
2013-10-08 23:55:11 -0700
3702a05d64add sendEmailThroughMandrill() to send through mail chimp http api.
Matt Wells
2013-10-08 18:01:38 -0700
9eecfd378cadded support for pageprocesspattern again. || separated strings to find in m_content before sending to diffbot.
Matt Wells
2013-10-08 17:08:58 -0700
a78dc35169Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot
mwells
2013-10-08 17:51:07 -0600